Building for scale when optimising for speed
Our startup environment is characterised by one thing at its core: speed. Failing to build things fast stunts growth and inhibits development of creative ideas, especially when solutions to engineering problems are scrutinised for scalability instead of their inherent value. With that being said, scaling the infrastructure of our product is an inevitable and crucial step as long as this scaling is considered within context and at the appropriate time.
Background & Description of the Problem:
We were recently faced with a scaling problem when one of our larger customers submitted a request for 2 years' worth of data for thousands of their users, for data analytics purposes. When processing such large requests, we implement a method of "chunking" the data into week-long segments. This allows processing to be broken down into a number of individual pieces, protecting us from the fragility of a single huge data payload being normalised at once.
When the large query comes in, it gets sent into a queue of tasks and waits to be processed asynchronously. Once a worker accepts the original 2 year-long task, it then spawns one task for each week of the initial request, and waits for their completion in order to send a message indicating that all 2 years of data has been transferred successfully.
As a result, each request submitted was broken down into ~100 further week-long requests, causing millions of calls to be pushed into our task queue. This first caused a huge backlog of queries waiting to be processed, and other devs were unable to receive any webhooks - whether that be their own data requests, or even notifications of users authenticating! To make matters worse, each "parent request" would wait on its sub-queries to be completed, but due to the sheer number of "parent requests", all workers were filled with them and were busy waiting on sub-requests, which were never actually processed since they couldn't be allocated to any worker.
Due to the two factors above, our queues kept filling up instead of slowly being emptied.
Immediate action: How did we find out about the incident and what actions did we take?
Thankfully, we've previously set up some internal testing infrastructure which informs us of webhook delays, tracks uptime for HTTP requests, and provides other analytics. Shortly after our customer's large requests were submitted, our testing powerhouse (Terra Thor, built from scratch in-house!) notified us on Discord about large webhook delays in the API.
After some time, when we realized this wasn't just a bug in the testing framework, we immediately went into crisis mode: what can we best do to resolve the problem right now, before thinking of solutions not to let this happen again.
We spent 5 minutes diagnosing what the problem was, and another 5 understanding what the potential solutions were:
- Clear the entire queue, and ask the developer to submit the query again once we upgraded the infrastructure.
Drawback: this would also inadvertently clear other developers' requests, and severely hinder them if - for example - they had authentication webhooks which weren't send through in the end
2. Spawn a large number of workers to hopefully clear the tasks after some time.
Drawback: We couldn't know how long this would take, and devs could be waiting for the webhooks for a while.
We deliberated and felt that there would be way too much cost to clearing the queue, and would cause trouble for hundreds of devs relying on the API. We chose option B, and the queue was then slowly cleared over around 1–2 hours.
Black box approach: Learnings & Changes to be made
Learning from mistakes and iterating as we go are two of our fundamental guiding principles: hence this was the perfect opportunity to revisit our webhook infrastructure.
First, we provided chunking parent tasks with their own separate queue. This ensures that they cannot block other requests from completing as had just happened.
Second, we needed a more apparent notification of urgency when something is up with the API. We created an upper threshold of webhook delay, past which we will get notified by SMS on our mobile phones, so that we can understand if something is genuinely going wrong.
Third, one developer's requests should never interfere with another ones: from a customer's standpoint, they should not have to care what another one of Terra's users is doing on their end, and have the right to expect reliable, uninterrupted service regardless. To this effect, we decided it made sense to give larger developers their own set of queues, so that their tasks don't get mangled with others'.
While speed of implementation is one of our core values, we need to be able to react to situations in which we need to think of scalability, and adapt to our customers' needs. We learn, grow, and document our learnings so that we can build on top of robust foundations.