< Blogs
Alex Venetidis

Alex Venetidis

April 13, 2022

Building for scale when optimising for speed

Our startup environment is characterised by one thing at its core: speed. Failing to build things fast stunts growth and inhibits development of creative ideas, especially when solutions to engineering problems are scrutinised for scalability instead of their inherent value. With that being said, scaling the infrastructure of our product is an inevitable and crucial step as long as this scaling is considered within context and at the appropriate time.

Background & Description of the Problem:

We were recently faced with a scaling problem when one of our larger customers submitted a request for 2 years' worth of data for thousands of their users, for data analytics purposes. When processing such large requests, we implement a method of "chunking" the data into week-long segments. This allows processing to be broken down into a number of individual pieces, protecting us from the fragility of a single huge data payload being normalised at once.

When the large query comes in, it gets sent into a queue of tasks and waits to be processed asynchronously. Once a worker accepts the original 2 year-long task, it then spawns one task for each week of the initial request, and waits for their completion in order to send a message indicating that all 2 years of data has been transferred successfully.

As a result, each request submitted was broken down into ~100 further week-long requests, causing millions of calls to be pushed into our task queue. This first caused a huge backlog of queries waiting to be processed, and other devs were unable to receive any webhooks - whether that be their own data requests, or even notifications of users authenticating! To make matters worse, each "parent request" would wait on its sub-queries to be completed, but due to the sheer number of "parent requests", all workers were filled with them and were busy waiting on sub-requests, which were never actually processed since they couldn't be allocated to any worker.

Due to the two factors above, our queues kept filling up instead of slowly being emptied.

Immediate action: How did we find out about the incident and what actions did we take?

Thankfully, we've previously set up some internal testing infrastructure which informs us of webhook delays, tracks uptime for HTTP requests, and provides other analytics. Shortly after our customer's large requests were submitted, our testing powerhouse (Terra Thor, built from scratch in-house!) notified us on Discord about large webhook delays in the API.

After some time, when we realized this wasn't just a bug in the testing framework, we immediately went into crisis mode: what can we best do to resolve the problem right now, before thinking of solutions not to let this happen again.

We spent 5 minutes diagnosing what the problem was, and another 5 understanding what the potential solutions were:

  1. Clear the entire queue, and ask the developer to submit the query again once we upgraded the infrastructure.

Drawback: this would also inadvertently clear other developers' requests, and severely hinder them if - for example - they had authentication webhooks which weren't send through in the end

2. Spawn a large number of workers to hopefully clear the tasks after some time.

Drawback: We couldn't know how long this would take, and devs could be waiting for the webhooks for a while.

We deliberated and felt that there would be way too much cost to clearing the queue, and would cause trouble for hundreds of devs relying on the API. We chose option B, and the queue was then slowly cleared over around 1–2 hours.

Black box approach: Learnings & Changes to be made

Learning from mistakes and iterating as we go are two of our fundamental guiding principles: hence this was the perfect opportunity to revisit our webhook infrastructure.

First, we provided chunking parent tasks with their own separate queue. This ensures that they cannot block other requests from completing as had just happened.

Second, we needed a more apparent notification of urgency when something is up with the API. We created an upper threshold of webhook delay, past which we will get notified by SMS on our mobile phones, so that we can understand if something is genuinely going wrong.

Third, one developer's requests should never interfere with another ones: from a customer's standpoint, they should not have to care what another one of Terra's users is doing on their end, and have the right to expect reliable, uninterrupted service regardless. To this effect, we decided it made sense to give larger developers their own set of queues, so that their tasks don't get mangled with others'.

While speed of implementation is one of our core values, we need to be able to react to situations in which we need to think of scalability, and adapt to our customers' needs. We learn, grow, and document our learnings so that we can build on top of robust foundations.

More Topics

All Blogs
Team Spotlight
Startup Spotlight
How To
Blog
Podcast
Product Updates
Wearables
See All >
CEO and Co-Founder of Bioniq - Vadim Fedotov

CEO and Co-Founder of Bioniq - Vadim Fedotov

In this podcast with Kyriakos the CEO of Terra, Vadim Fedotov a former professional athlete turned entrepreneur, shares his journey in founding Bioniq.

Terra APITerra API
December 10, 2024
5 Lessons for Standing Out at HLTH

5 Lessons for Standing Out at HLTH

5 lessons from team Terra API for making a lasting impact at HLTH: from engaging senses to building real touch points, here’s what we learned from the HLTH event.

VanessaVanessa
December 5, 2024
November '24 Updates by Terra

November '24 Updates by Terra

Terra’s Latest Updates: Zepp Metrics, Support Revamp, and Teams API Enhancements 🚀✨

Alex VenetidisAlex Venetidis
December 1, 2024
Strava Pulls the Plug on their API: What This Means for Developers

Strava Pulls the Plug on their API: What This Means for Developers

Strava discontinued their API service, changing the ecosystem of third-party apps that have relied on their platform. How can developers react to this?

Terra APITerra API
November 21, 2024
Alternatives to the latest changes in the Strava API

Alternatives to the latest changes in the Strava API

Strava just introduced big changes to their API program. These changes will basically kill off a lot of apps. Use Terra API instead to avoid this

Kyriakos EleftheriouKyriakos Eleftheriou
November 19, 2024
next ventures
pioneer fund
samsung next
y combinator
general catalyst

Cookie Preferences

Essential CookiesAlways On
Advertisement Cookies
Analytics Cookies

Crunch Time: Embrace the Cookie Monster Within!

We use cookies to enhance your browsing experience and analyse our traffic. By clicking “Accept All”, you consent to our use of cookies according to our Cookie Policy. You can change your mind any time by visiting out cookie policy.