< Blogs
Alex Venetidis

Alex Venetidis

April 13, 2022

Building for scale when optimising for speed

Our startup environment is characterised by one thing at its core: speed. Failing to build things fast stunts growth and inhibits development of creative ideas, especially when solutions to engineering problems are scrutinised for scalability instead of their inherent value. With that being said, scaling the infrastructure of our product is an inevitable and crucial step as long as this scaling is considered within context and at the appropriate time.

Background & Description of the Problem:

We were recently faced with a scaling problem when one of our larger customers submitted a request for 2 years' worth of data for thousands of their users, for data analytics purposes. When processing such large requests, we implement a method of "chunking" the data into week-long segments. This allows processing to be broken down into a number of individual pieces, protecting us from the fragility of a single huge data payload being normalised at once.

When the large query comes in, it gets sent into a queue of tasks and waits to be processed asynchronously. Once a worker accepts the original 2 year-long task, it then spawns one task for each week of the initial request, and waits for their completion in order to send a message indicating that all 2 years of data has been transferred successfully.

As a result, each request submitted was broken down into ~100 further week-long requests, causing millions of calls to be pushed into our task queue. This first caused a huge backlog of queries waiting to be processed, and other devs were unable to receive any webhooks - whether that be their own data requests, or even notifications of users authenticating! To make matters worse, each "parent request" would wait on its sub-queries to be completed, but due to the sheer number of "parent requests", all workers were filled with them and were busy waiting on sub-requests, which were never actually processed since they couldn't be allocated to any worker.

Due to the two factors above, our queues kept filling up instead of slowly being emptied.

Immediate action: How did we find out about the incident and what actions did we take?

Thankfully, we've previously set up some internal testing infrastructure which informs us of webhook delays, tracks uptime for HTTP requests, and provides other analytics. Shortly after our customer's large requests were submitted, our testing powerhouse (Terra Thor, built from scratch in-house!) notified us on Discord about large webhook delays in the API.

After some time, when we realized this wasn't just a bug in the testing framework, we immediately went into crisis mode: what can we best do to resolve the problem right now, before thinking of solutions not to let this happen again.

We spent 5 minutes diagnosing what the problem was, and another 5 understanding what the potential solutions were:

  1. Clear the entire queue, and ask the developer to submit the query again once we upgraded the infrastructure.

Drawback: this would also inadvertently clear other developers' requests, and severely hinder them if - for example - they had authentication webhooks which weren't send through in the end

2. Spawn a large number of workers to hopefully clear the tasks after some time.

Drawback: We couldn't know how long this would take, and devs could be waiting for the webhooks for a while.

We deliberated and felt that there would be way too much cost to clearing the queue, and would cause trouble for hundreds of devs relying on the API. We chose option B, and the queue was then slowly cleared over around 1–2 hours.

Black box approach: Learnings & Changes to be made

Learning from mistakes and iterating as we go are two of our fundamental guiding principles: hence this was the perfect opportunity to revisit our webhook infrastructure.

First, we provided chunking parent tasks with their own separate queue. This ensures that they cannot block other requests from completing as had just happened.

Second, we needed a more apparent notification of urgency when something is up with the API. We created an upper threshold of webhook delay, past which we will get notified by SMS on our mobile phones, so that we can understand if something is genuinely going wrong.

Third, one developer's requests should never interfere with another ones: from a customer's standpoint, they should not have to care what another one of Terra's users is doing on their end, and have the right to expect reliable, uninterrupted service regardless. To this effect, we decided it made sense to give larger developers their own set of queues, so that their tasks don't get mangled with others'.

While speed of implementation is one of our core values, we need to be able to react to situations in which we need to think of scalability, and adapt to our customers' needs. We learn, grow, and document our learnings so that we can build on top of robust foundations.

More Topics

All Blogs
Team Spotlight
Startup Spotlight
How To
Blog
Podcast
Product Updates
Wearables
See All >
CEO and Co-Founder of Veri - Anttoni Aniebonam

CEO and Co-Founder of Veri - Anttoni Aniebonam

In this podcast with Kyriakos the CEO of Terra, Anttoni Aniebonam shares his journey founding Veri, and his decision in the acquisition by Oura to further his vision.

Terra APITerra API
September 27, 2024
CEO and Founder of Prenuvo - Andrew Lacy

CEO and Founder of Prenuvo - Andrew Lacy

In this podcast with Kyriakos the CEO of Terra, Andrew Lacy shares his journey with Prenuvo which began from a personal health crisis.

Terra APITerra API
August 28, 2024
MedHacks: Using Wearables To Predict Heart Attacks

MedHacks: Using Wearables To Predict Heart Attacks

A few weeks ago we met Vishal, a recent engineering graduate who wanted to use Terra API as part of his MedHacks hackathon project, Cardio Clarity.

Gursukh SembiGursukh Sembi
August 19, 2024
July 2024 updates

July 2024 updates

Teams API adds Kinexon integration & new webhooks. Terra Health Scores now include Respiratory & Stress metrics. Eight Sleep integration returns with enhanced data.

Alex VenetidisAlex Venetidis
August 2, 2024
Vice President of Teamworks - Sean Harrington

Vice President of Teamworks - Sean Harrington

In this podcast with Kyriakos the CEO of Terra, Sean Harrington shares his journey from founding NoteMeal to becoming the VP of Teamworks.

Terra APITerra API
August 2, 2024
next ventures
pioneer fund
samsung next
y combinator
general catalyst

Cookie Preferences

Essential CookiesAlways On
Advertisement Cookies
Analytics Cookies

Crunch Time: Embrace the Cookie Monster Within!

We use cookies to enhance your browsing experience and analyse our traffic. By clicking “Accept All”, you consent to our use of cookies according to our Cookie Policy. You can change your mind any time by visiting out cookie policy.