CTO + Director of AI at Flo Health: Roman Bugaev + Vladislav Nedosekin

Flo Health has grown to 80 million users, adding about a million users each month, showcasing its rapid expansion.
Roman Bugaev joined Flo when it had just 20 employees and no revenue, highlighting the company's dramatic growth journey.
Vladislav Nedosekin leads AI development, ensuring the platform's medical accuracy and user engagement through advanced algorithms.
Flo Health employs a unique experimentation platform, conducting 400 A-B tests per quarter to refine user experience and product features.
Teams at Flo are structured as autonomous units, each functioning like a startup within the company, to efficiently manage new features and user segments.

In this podcast with Kyriakos the CEO of Terra, Roman Bugaev and Vladislav Nedosekin dive into the explosive growth of Flo Health, now boasting 80 million users. Roman shares insights from joining the company when it was just 20 people strong, while Vladislav reveals how AI is reshaping the platform's capabilities. Discover how their innovative team structures and relentless testing have propelled Flo to the forefront of health tech.

For the podcast: Apple, Spotify, Youtube, X.com

42,000 Tickets and Zero Ad Spend: The Flow Health Phenomenon

Kyriakos: Flow has been probably the fastest-growing health company in the world. You guys have 80 million users. How does this even happen?

This is Roman Bugaev, CTO at Flow Health, who has overseen the evolution of the platform into a global leader. And Vladislav Nedosekin, Director of Engineering for the AI platform at Flow Health, leading the development of the company's cutting-edge AI infrastructure.

Roman: I've joined when the company was really small, 20 people in total. We had no revenue, now we have hundreds of servers and petabytes of data stored. The biggest challenge here is that you also want to be absolutely medically safe and correct. There is no room for mistakes. If you think about chatbots, for example, several seconds doesn't actually make a difference. What's more important from a user perspective is the model actually takes time to provide you a medically-related answer.

Kyriakos: Would you say that AI is doing more mistakes, or finishing sooner?

Roman: What we can see is that AI is more consistent. It will not have fatigue. It will not have attention loss. We have like a digital avatar that kind of lives on our servers.

The Photo That Built a Flywheel: Flow's Early Days

Kyriakos: So for those who don't know, Flow has been probably the fastest growing health company in the world, I suspect. Last time I had a discussion with Dmitry, the CEO, roughly six months ago, he mentioned that you guys had 75 million users. I checked yesterday, you guys have 80 million users. So it's like you have a million users every month. Like, how does this even happen? So, and Roman was joined when the company was 20 employees, I believe, very early on. So, can we start with that? What were the early days of Flow once you joined the company?

Roman: Yeah. So, yes, I've joined when the company was really small, 20 people in total, maybe engineering was about 10 people max. We had no revenue, but we already had some users. And basically, we knew that the product is actually a hit and we actually have users who like it and enjoy it. But the product was very basic. We had like a tracking of symptoms. We had like a basic functionality to predict cycles based on historical data, but it was like very basic algorithms, let's say. But we already knew that the users loved the product. And then, basically since then, we had every six months or like every 12 months, depending on the year, it's a new company because the company is growing really fast and we also transformed as a business.

So we launched monetization in 2019. We launched a lot of new features, chatbots within the app. We launched different predictive algorithms based on neural networks. And basically since then we kind of growing very fast. And kind of the main success story here is that we also have very good products so people stick with us. So once they install our apps, they stay with us for many, many years. And because of that, the audience is growing very quickly.

Kyriakos: You mentioned at the beginning, you knew that the product was a hit. How did you know?

Roman: Well, when I've joined the company, I think we already had a couple of millions of users. And basically users were installing the app. We had no marketing, but they were looking in App Store for a period tracker app. And we were somewhere like close to the top of the App Store. And it was very early days. I've also checked at the time all the competitors and they were very ugly apps. It was hard to believe that the market is so underserved. Basically everyone at the time was working on some sort of Uber alternative. It was a time when everyone was so focused on self-driven cars and so on. And nobody was kind of building anything for women. And because of that, it was kind of an underserved market. And when you started to build something with a nice UI, with good algorithms, with a lot of best practices in place, it became a successful product.

From Basic to Brilliant: Flow's Product Evolution

Kyriakos: And then when you joined, what was the product and how did the product evolve over time?

Roman: Yeah, so as I mentioned, I think the product was basic. We had a symptom panel where you can log different symptoms like headache, cramps, and others. And then we had a cycle tracking. So women were able to track periods. And then we had some basic algorithms to predict when the next cycle will start. And then the idea was that we had founders with engineering backgrounds. So we kind of immediately understood that if you have data, you can use this data to predict the future. And we started to adopt neural networks to predict those different symptoms and cycles. And this was basically the product. We had almost nothing else except this very basic functionality.

Kyriakos: And then over time, how did you guys prioritize for the new products that you are getting out? And what are the products today?

Roman: So the idea was always that we need to educate users and we need to give insights based on the information that they enter. But in order to go deeper into insights, you need to get more information from the user. So one of the first features that we built when I joined was a chatbot. The idea was to build a feature that worked close to what happens in real life. So when you go to the doctor, you don't really know what the issue is. You just know that something is not right. And doctors start to ask questions like, do you have a family history of these conditions? Do you have headaches? And you just answer questions like yes, no, maybe, sometimes. Doctors know what they do. They ask questions to narrow down the problems for the user. And then they say, okay, well, it's a certain condition or issue. We were inspired by this idea and we built the chatbots for a similar purpose.

One of the first chatbots was about why a cycle is late. There are multiple reasons for that. One might be the user is stressed, or maybe the user is pregnant. There are multiple other reasons, like different health conditions as well. The idea was to explain why the user has certain issues and give them insights. Another core idea was the app should be proactive. Instead of waiting for the user to ask questions, we were always proactive and said, do you want to talk about this? We have some insights for you, let's talk about that. You need to nudge the user a little bit to engage them in the functionality.

The prioritization framework was, I think we early started to use the ICE framework for scoring. It's like impact, confidence, effort. We were scoring all the features. We always had a lot of ideas from the user, from reviews, from support. Users were always asking about some features, but we always had more ideas than we could implement. So we scored all the features through this framework. We were always working on the most impactful feature where we have confidence and we know that we can build this feature. The framework was guiding us through this prioritization process.

We understood very early that we don't know what we don't know. In order to build a successful product, you need to test the waters a lot. You need to build a small, minimal, viable product or lovable product, whatever framework you use. Then you need to ship it to end users and give them the opportunity to test it and try it. You need to check whether they'll use this product. If yes, you continue. If not, you need to change something. We started to do a lot of A-B tests to understand what users like. I think as of today, we're on about 400 A-B testing per quarter. We do a lot of tests, like whether this feature is better than this feature or how the feature should look like, whether it's a green button or yellow buttons.

Kyriakos: Tactically, this means that you get a subset of let's say a thousand users and you do this with each thousand users?

Roman: So basically, we also built an experiment service, an internal one, because we found that the experimentation platforms available on the market were not designed for Flow's needs. We built an experimentation platform inside the company and this platform decides who will see this new experiment. Users are randomly selected for the experiments. We check statistical significance for experiments. We can also target experiments to a particular audience. For example, we can have parallel experiments for users who are pregnant, but also create another set of experiments for users who just track cycles or are interested in sleep or weight management. You can split your audience into slices and do a lot of A-B tests if you have enough audience. We are one of those lucky companies that have a lot of users.

Kyriakos: But given that, how do you optimize for costs once you have, let's say a thousand users or 10,000 users or a hundred thousand users doing that experiment? How can you tell how many people can you run this for to have significant results?

Roman: It's actually possible to calculate in advance. Depending on what kind of effect you expect, you need more or less users. If you want to detect a very small change, you probably need a bigger audience because a small impact is harder to detect on a small audience. But if you expect something to double, you need fewer users. Based on the initial assumptions of the experiment, we can decide what the audience is. We also built a tool that uses some creative way for statistics. We have experiments that run until we have statistical significance. The experiment is stopped when we reach statistical significance. Then we can analyze this experiment and decide whether we want to allow it to 100% of the users or continue iterating on this.

The Two-Pizza Team: Engineering Growth at Flow

Kyriakos: From the time you joined to today, how did the engineering work structure look like and how did it change?

Roman: When I joined, we had a very flat structure. We had me and engineers and that's the structure. After that, we started to hire people and divide teams into sub-teams. Initially, I was running a couple of teams, but then when we reached about 25 engineers, I decided I needed more managers. The core idea always was that we want to have fully autonomous teams that are almost like a startup within the company. Because of that, we're able to scale the company. If we want to launch a new feature or new user segment, we just need to create a team for this particular feature. This team will be completely cross-functional. We will have mobile engineers, backend engineers, machine learning engineers if needed, quality, product manager, designer, analyst. They work as a cross-functional unit that can execute any idea and they don't need to depend on other teams. Sometimes we do features that require multiple team coordination, but we try to keep those dependencies to a minimum.

Kyriakos: And how small or how big are these sub-teams?

Roman: Typically, it's like a two-pizza team. It's about seven to ten people. Smaller teams are not really a team, and a bigger team is hard to manage. It's like too many people. It's very hard to invite everyone to the meeting and so on. So yeah, two-pizza team works well for us. I think right now we have about 25 different teams of this size.

Kyriakos: Until you found this structure, did you make any mistakes along the way? Are there any types of teams that don't work that people should avoid?

Roman: That was the core idea from day one. We were lucky to start from it. We experimented on the way with other structures and found that, for example, a separate mobile team is not working because you have a backend team, a mobile team, and they always have dependencies between each other. Coordination of multiple teams becomes a nightmare. Different product managers need to know which team will be involved in feature development, so they have to coordinate a lot of teams as well. We found that it's not practical. We also experimented with different team sizes. Most of our teams right now have one mobile engineer per platform, a couple of backend engineers, and then quality engineers and so on. But we also experimented with two mobile engineers per team, three mobile engineers per platform per team, and different setups. We found that the perfect size is almost like four backend engineers, one iOS, one Android, two QA engineers, and this is your team. But again, it really depends on the particular context of the team. For example, a team that works on chatbots is more backend-heavy. Some other teams need to be slightly more frontend-heavy, but overall, that's the setup.

Scaling Systems: Evolution Over Revolution

Kyriakos: Given this drastic growth from an engineering perspective, were you very thoughtful from the very beginning about how the systems look like to reach that type of scale? Or did you have to rewrite stuff every two years or something like this?

Roman: I'm a big believer in evolution, in systems that evolve over time. We try to avoid huge refactorings or rewriting systems. We just try to evolve the systems. We add new components and evolve the system rather than re-architecting completely from scratch. We started with a very simple architecture. We had four servers, two for the database, two for the backend, and then we had mobile clients on iOS and Android. We evolved from that. Right now we have about 600 different services, many different storages, a separate data platform, maybe 30 Amazon accounts. We started from just a couple of servers and now we have hundreds of servers and petabytes of data stored. It's evolution rather than revolution. You launch a new feature like chatbots, you create a service for chatbots, you create a database for chatbots, and you evolve this way. It was the same as you evolve teams.

Another core idea here is that you want to respect Conway's law. Your organizational structure will affect your software architecture and vice versa. It's very important to design teams and architecture and your technology with this in mind. If you want to make sure that some of the components are divided, put them into different teams. If you want to make sure that they will be closer together, put them into the same team. Conway's law works both ways. You can influence work structures through architecture and vice versa.

AI or Not AI? That is the Question

Kyriakos: How do you guys decide about the products that you're going to launch? Because obviously you always have the question of like, I might launch an AI product, but it's not really necessary for it to be AI. How do you make these decisions? And then it's a separate question. Can you list the AI products that Flow has today?

Roman: Whenever it's possible, we are not doing AI. You don't want to start from AI or machine learning algorithms if it's possible to do this with very simple if-then-else logic or in other more conventional ways. We use AI only when some problems are not possible to solve without machine learning and AI, or you reach the limits of technology and you have to go into this more complex environment. Most of the time we start with this idea. The way it works typically is if you have one type of user, it's very easy to build a system that will work for this one type of user. But if you have users of different ages, from different countries, with different goals in mind, with different health conditions, with different preferences, it becomes almost impossible to write all these if-then-else and to write like, if the user is 20 years old, do this, if the user is 25 years old, do that. If the user has polycystic ovary syndrome, go this way. If the user has endometriosis, go that way. It becomes too many ifs, and this is probably a time when you want to introduce something more advanced and maybe based on machine learning or AI models. It's a way to make a more personalized service to use AI ML.

In terms of features that we have right now based on AI, we're using neural networks for more than 10 years already at Flow. The first feature was just to predict cycle lengths based on the history of the cycle. Then we added a lot of different variables to this, like temperature from wearable devices. Even the same algorithm, we're already working on this algorithm for 10 years and it's still evolving over time. We still have a lot of things to do and basically new signals, like with new wearable devices, you have new signals and you can improve the algorithm because machine learning needs data to make some useful predictions.

We have multiple other algorithms based on relevancy. For example, when you see content inside Flow, it's personalized based on your preferences and your interactions with the content. Sometimes people say, well, I want to become pregnant. But when we show them content about how to become pregnant, they actually don't read it. They actually have some hidden goals and it's almost impossible to know about that without complex ML algorithms, but ML algorithms can learn based on behavior what is actually interesting for the user.

Recently we started to work a lot with large language models. With large language models, we do many things. I think the most interesting one is a chatbot based on large language models, which is basically chat GPT but with access to all your data from Flow. We also do a lot of other things. For example, if you have a lot of articles and you need to make sure that they are all up to date, you can ask people to read them and validate all the facts, but you can also ask large language models to do the same. We do this constant validation of our contents with LLM models. Translation is another big application for large language models. It's not just translating word to word. It's more like you adapt the content to a particular audience. You change the tone of voice. You change even some, I don't know, in some countries maybe people used to eat oranges and in some other countries people used to eat apples, but they are all the same size. If you need to compare something with the size, you change the fruit based on the country. We also do that with large language models because they're very good at this. You can say, translate this article from this language to this language, but also adapt this content to the audience that will be reading this content in this country. You can also say, and this audience, and you describe the audience and so on. The models not just translate, but they adapt content. They basically write it from scratch.

Temperature Talks: The Data Point That Matters

Kyriakos: You mentioned temperature. I think that was a very important data point for Flow. Can we speak a bit more about why temperature is important? When does it change? How does the algorithm work to predict?

Roman: Temperature is a very important data point to detect ovulation. The challenge here is that you need to measure this temperature during the night at a particular moment of time. Most users don't have any wearable devices on them when they sleep, especially if it's a large watch like yours or mine. People tend to put this device on the charger. It's a little bit tricky parameter to capture.

Kyriakos: Are we speaking about skin or body temperature?

Roman: Conventionally, we need basal temperature, but we also trained our algorithms on skin temperature from the wrist, or it also works very well with temperature from your finger. Devices like rings are better to capture temperature because it's easier to convince someone to sleep with rings than with a watch. We see that temperature from a finger is more popular among our users than temperature from an Apple Watch or some other devices. You capture this data and then it changes over time. If you know the temperature of the user, you can detect ovulation. Once you know ovulation, you can also predict the cycle and know certain symptoms and so on.

Kyriakos: How accurate can you get to a prediction of a cycle or a prediction of pregnancy?

Roman: According to our users, we are the most accurate period tracker. I think we have so many users because a lot of users find us very accurate. We expect to clinically validate this soon. I think we are one of the most accurate period trackers on the market because we have such a large population of users. We've seen every possible scenario and because of that, we can predict cycles very well. If you take any other competitor, they probably have fewer users and because of that, they haven't seen certain patterns in the world. Because of that, they are not accurate for these users. A simple idea here is assume you're a user of Garmin. This device probably will be biased towards the audience that wears Garmins. They have people who are really into sports, they like these massive devices, they probably also have a lot of money because Garmin costs a lot. You will have a certain profile of the user. Flow is different. We have all possible kinds of users in all possible countries with all possible goals. Different income levels. We have people who don't pay us at all. For example, in more than 50 countries in the world, Flow is free of charge. Countries like India, many countries in Africa. We've seen everything. It's a very rare situation when someone has access to such a variety of data. Most other devices probably don't have such access. For example, another competitor, not a competitor, but health and fitness trackers like Aura, they probably have like six million users. It's like 20 times less than we have. The audience is skewed towards a particular segment again. It's hard to find someone on the market who has such a wide understanding of the users.

Competition: To Watch or Not to Watch?

Kyriakos: I want to ask you the last question before we convert this into a panel. You mentioned competition. Do you guys look at competition? Do you look at what they do? Do you avoid them? What's your approach?

Roman: We try not to be distracted by competition. We're more interested in what users think about us, what they expect from us, what we need to build for them, and what kind of value we want to deliver in our vision than about competition. Competitors, well, first of all, we don't really have real competitors on the market. Most of them are followers. We set the tone. We are here to innovate. We have the most R&D resources and the brightest minds. Competition is important. It drives the market overall. It's good to have healthy competition. But I don't think we are so much focused on them.

Behind the Curtain: AI-Powered Content and Chatbots

Kyriakos: Can we break down maybe the two products, the content creation AI and the chatbot, and then break them down and see how they work in the background?

Roman: Yeah. Do you want to start with content creation?

Kyriakos: Sure.

Roman: We built a lot of tools that help us to generate content. By content, I mean not just boring articles, but something that might have a more complex structure, maybe some widgets inside, maybe even a survey. In our case, it will be content. We have certain prompts that we use to generate the content. We can say, we are writing content for this particular audience. This is the goal of the content piece. This is what we expect from the article. This is how similar articles perform. This is what kind of metrics they have. We can instruct a model with a lot of context. We can also feed ground truth into the content generation.

Kyriakos: Is the ultimate goal to show the right piece of content to each individual?

Roman: That's the ultimate goal. You want to show content to the right audience at the right moment. For that, you need a very big variety of content because there are a lot of different users, different moments, and so on. It's almost impossible to rely on humans to generate such an amount of content. The biggest challenge here is that you also want to be absolutely medically safe and correct. There is no room for mistakes. We also built on top of that some systems that can validate generated content. We have humans in the loop that can rubber-stamp any piece of content that goes into production.

Vlad: I think if you imagine a world where you can produce every piece of content in a correct way, that would be ideal. But we are definitely not there. One of the other challenges I'd like to highlight is what people call hallucinations when they use LLMs. In some cases, it's not hallucination as such, but essentially a variety of sources that talk about the same thing. If you ask a question, what is a normal cycle length, some institutions say it's 28 days, others 29, others 27. All of those answers are technically correct. It's just different methodologies. When an LLM gives you a particular answer, and its answer is different each time, it's not that it's wrong. It's just different data points. We decided for ourselves that we will follow ACOG as a standard for the cycle lengths. We need to ensure that in our content produced, regardless if it's a video, audio, or an article, we actually follow those guidelines continuously. That creates another challenge when we do validate the content that we have to, like Roman said, enhance it with our own knowledge and make sure that we're consistent across our messaging.

Kyriakos: Is there a part that you do labeling? Are there clinicians doing labeling at the output? So they improve the system overall?

Vlad: Yeah, that's how we can dive deep a little bit into chatbots as well. Essentially, we would like to ensure that our content or interactions are medically safe. Medically safe means like 20 different things, especially if you're a medical professional, you know exactly what you would like to say and what you don't want to say. There are also regulations where we cannot say something that may be considered as medical advice. We need to be pretty much careful in how we position those facts. Creating ground truths is one of the fundamental steps where we need to ensure that the models, regardless if we use an LLM with in-context learning or fine-tune the model, they actually follow what we consider a safe and useful scenario. To do that, we actually created a ground truth where we have a collection of labeled data labeled by medical professionals. Some people will say, that's great. You have lots of data. It's all labeled. But humans also make mistakes. When we then calibrate those data sets, from time to time, we find that AI disagrees with the human. Whoever checks this data is not confident if it's AI who made a mistake or a human who made a mistake. To address this particular challenge, our medical team introduced a process where they have what they called a three-person blind test, where they give data that has been misaligned between a human label and AI label to three other medical professionals. They don't know what was the answer labeled by a human, what was the answer labeled by AI. They do a blind test. They also provide their labels. Then we look at how those were essentially aligned and choose the right answer, which then becomes a ground truth for the next iteration. It's pretty much a multi-layer process, allowing us to build up our ground truth and come to a state where we actually correct human mistakes. Because essentially, that's a challenge that we have in particular. With medical terms, with medical content, it's a lot of cognitive load. If you have to label data at some point, annotators may become distracted, may not necessarily pick up all those aspects. It's important to validate their work as well.

Kyriakos: At this stage, AIs today, would you say that AI is doing more mistakes or clinicians do more mistakes?

Vlad: I think it's essentially a super interesting topic. I'll say this. What we have observed is that, essentially, because we are working in a particular domain, the domain of women's health, those LLMs are pretty much biased against women. There are different theories, if you look into research papers. They all come to conclusions that there is, essentially, a lack of data that allows you to train those LLMs to an extent that will be comparable with men's health. Simple things as aspirin have never been tested on women. So we don't know if there are any particular side effects that are applicable to the demographic. I think that gap does not necessarily allow you to make a fair comparison between medical professionals and AI. But what we can see is that AI is more consistent. If you explain to AI what you want to do and why, it will not have fatigue. It will not have attention loss. It will then become pretty much consistent in how it processes similar types of requests.

Kyriakos: Given all the bias that you mentioned, what is an actual solution to this? Is this to run more studies with women over time and increase the amount of studies you guys do?

Roman: Yeah, basically, this is the solution. You need more data. You also need to train the models. One of the things that we do at Flow, we also rely not just on proprietary models like OpenAI, GPT, or Gemini models, but we also take open-source models like LLAMA70B, for example. Then we fine-tune the models based on synthetic data that we generate with our medical team in the loop. We change the model's behavior based on what we actually want from those models. It helps us to create those specialized models that are less biased and also more accurate and safe and designed for a particular domain. In our case, it's women's health. Because of that, we can build better models. It's also very important to understand that a lot of these frontier providers are very focused on certain topics, like coding. Everyone is building the best model for writing Python code. I think very few companies are focused on women's health. Probably general health, yes. I think everyone is doing this. But the specialized models for particular topics, they're less focused on at this stage of development.

Inside Flow's AI: The User Profile and Intent Router

Kyriakos: I saw a presentation from Flow. There was an awesome schematic of how things work in the background. Can we break down? First of all, you have a user profile. How does the user profile look like when you embed this into a system? Can we break this down? Is this some sort of you embed to a system, like it's a field with five columns that says, here's the data? What does the input look like for the person?

Roman: I will start, and Vlad will help me. Basically, the user profile is one of the foundational components of our platform. The interesting fact about Flow is that users stay with Flow for many, many years. People install the app when they're 20 years old, and they use the app for years. At 35 years, they still use Flow and so on. During this time, we accumulate knowledge about the user into the user profile. The user profile is certain parameters that we know about the user. What is the age of the user? What do they like? What are the goals? Whether they have a polycystic ovary syndrome or not? Whether the cycle is regular? What is the history of cycles? It's just a user ID with a lot of parameters associated with this user ID. Some parameters live a really long life. For example, age will be the same parameter for the whole year. But then there are some other parameters that change very frequently. Let's say weight, for example. Today, you're 75 kilos. Tomorrow, you're 76 kilos. It's dynamic parameters that change every day. We try to have all types of parameters in the user profile. It's also interesting that our user profile is not a static thing. We have almost like a digital avatar of the user that lives with the user. Even if the user doesn't open the app, the user profile exists and changes every day. For example, as I already said, age is changing. We have a digital avatar that lives on our servers. Its user profile changes every day with the user. It allows us to send push notifications. It allows us to send emails to the user. We can say, hey, you know what? This is what we know about you. This is what you expect today, and so on. We know this because we recalculate the user profile every day. It's different attributes every day.

Vlad: Probably nothing missed, but I would like to double down on the digital avatar. If you imagine for a second that you are not here in 2020, but you are in another century, and we really mastered the skill of AI. You have this fancy AI assistant or AI body that sits somewhere in your cortex where you can just think about something, and this will be noted, and so on. I think that is our ultimate goal when it comes to user profile and AI. If you think about different scenarios, like changes in user profile, maybe a user did some symptom checkers a year ago or two years ago. They went through a symptom checker, everything was fine. Today, something changed, like temperature, weight. There is a continuous weight loss, and so on. The user may not even be aware, because if you're losing a kilo every month, it's not super noticeable straight away. But when it all will go to AI and machine learning, that may change the whole layout of the symptom. Our end goal is to make our AI super active. When those changes occur, the application can reach out and say, you know what? We see that you're losing weight. By the way, now we think that you might be at risk of this or that, so go maybe and check. That's where we thought of striving to get to with the user profile being their ultimate avatar or data source for all those changes.

Kyriakos: The user has asked this question, and this schematic is basically explaining that there is a router here that determines the intent, and then somehow adapts the question. Can we go into that one as well? How does this router work?

Roman: Let me clarify a little bit. The idea here is that the user asks a question, like, for example, why do I have headaches, or why is my cycle late, why do I have these symptoms, and so on. Typically, these questions are really small, and they have just a few words, because users don't know what kind of other information might be relevant. They don't provide this as a context. But we take this context from the user profile. The first step, we take a question from the users. We enrich those questions with user profile data that we have about the user. We know the user intent, and we can route this question to a particular model. The way we think about this, imagine you go to a doctor. First, you go to a general practitioner, someone who knows everything a little bit, but they're not masters of particular health conditions. They route you to an expert specialist, and this general practitioner, GP, can say, oh, you need to see an OB-GYN, or you need to see someone else. The same idea, we apply it to large language models. The first model, the sole goal of this model is to understand what the user wants, and what is the best model to answer this question. Maybe it's a Gemini model with one prompt. Maybe it's a GPT model with another prompt. Maybe it's a fine-tuned Lama model that we created in-house. This router can route a question to a particular model, which is designed to handle those questions. It's a very powerful idea because you can also extend the system very easily. Imagine, today we have only systems that can answer questions about periods and cycles. But tomorrow, we also want to say something about sleep. What you need to do is, you need to create a new model that can handle sleep questions. Then you need to say to the router, to this GP model, that from today, whenever you see a question about sleep, go to this specialized model. In this way, you can create new topics, and you can extend the system. You don't need to retest previous models because they remain static. Once you build these models, and you're happy with them, you don't need to change them. They remain almost unchanged. You need to monitor them, just to avoid drifts and so on. But they stay static, and then you just have a router and new models that need to be updated, or created, if it's a new model.

Vlad: I'll just add, hopefully you notice how closely you're working with medical teams, that even our conversations about AI are tied into GPs and medical professionals. But I would like to also add yours here. I think that this concept ultimately also has another benefit. Our backend models can actually focus on their medical part of things. One other feature is that the router can act if needed. It can essentially process their results and ensure that the tone of voice, the way the model communicates, is consistent and aligned with what we would like this conversation to be in.

AI's Role in User Intent and Interaction Continuity

Kyriakos: An input, when we also look at their, like Roman mentioned, data coming from user profiles to understand the user intent. The other thing that is quite important is the continuation of their interaction with the model. If you spoke about your headache with the model yesterday, because the user profile will have this information, the model will remember that it's not just the first time you're reporting the headache, but it's a continuation of the problem. This allows it to have a high success rate in understanding the intent.

Choosing the Right Model for the Job

Kyriakos: How does the system decide which one to use? Is there some sort of optimization for eight billion parameters, 70 billion, like what's the difference there?

Roman: If you have seen presentations with Steve Ballmer, who was always saying "developers, developers, developers," in our case, it'll be "evaluation, evaluation, and evaluation." We run evaluation of the model against several dimensions. Two of them are more functional: medical safety and usefulness. The third one is cost, per token or per conversation. This allows us to evaluate proprietary large language models like GPT, Gemini, Entropic, and others. We also fine-tune models to improve their quality and look at the cost to measure the best possible combination. This process is continuous. We look at new candidates for a particular topic, and if we have a better candidate, we may try to do an experiment and see how it's performing. Stability is important, especially with medical staff, to ensure consistency. Proprietary models like Google, Entropic, or OpenAI continuously update their models. We don't want to be in a situation where they deprecate GPT.5 and move everyone to 5.1, resulting in different outcomes. We evaluate possible upgrade paths, and if an upgrade path is not possible, we look at open-source models to maintain stability and consistency.

Evaluating Model Performance

Kyriakos: In the evaluations you have today, do you have conclusions about what is best in predicting certain things you do today?

Roman: What we have found is that there is no best model that fits all cases. Depending on the topics and constraints, some models follow instructions better, some may be faster, and so on. It's a variety of dimensions. For chatbots, several seconds don't make a difference. What's more important is that the model takes time to provide a medically related answer, giving users more confidence. But if you need real-time interaction, like a pop-up, it has to react faster. Localization or content creation also requires concise pop-ups, so when produced by LLM, it needs to follow precise instructions. Not all do it the same way. For every use case, we evaluate all models to see which one is best. That's why Roman mentioned almost everything available on the market.

Anonymous Mode: Balancing Privacy and Functionality

Kyriakos: You have the anonymous mode as well. What kind of complexity does this create when you want to provide the best user experience with AI?

Roman: The idea of anonymous mode is simple. We want to ensure users have access to all features while maintaining privacy. If our data is compromised, maybe by government request or other situations, we invented a special mode where we have users without any identifiers. We don't collect emails, phone numbers, IP addresses, or technical identifiers that can be used to identify a user. This way, we protect user privacy while keeping all app functionality available. It's different from end-to-end encryption, which is good for protecting user data but hard to use for machine learning and building new models. In anonymous mode, we have both worlds working together. We use machine learning and AI safely and privately. We also have cool features like mobile client protection with pin codes and encrypted communication resistant to quantum computers. Even if someone records data from the user device to our servers, it's impossible to decrypt because the encryption is post-quantum. We use oblivious HTTP, a protocol used by Apple and others, to remove original IP addresses and replace them with fake ones, ensuring we can't trace back to the user. We published a white paper on this and won awards like Best Invention from Times, the same year ChatGPT won.

Handling Downtime and GPU Usage

Kyriakos: Speaking of Cloudflare, they had a massive downtime last week. We were affected, but it's the beauty of using big providers. When everyone is down, you're also kind of allowed to be down. It's like, well, LinkedIn is not working, Flo is not working, so you guys take a coffee.

Roman: I also read that you have access to 20,000 H100 GPU time. I can explain this. We fine-tune large language models like Llama7TB, which takes time to prepare a dataset and create the pipeline. One run of fine-tuning is about 10,000 GPU hours, and we use H100 for that. Each iteration is expensive but reasonable. We receive better models tailored to our needs. GPUs are expensive, and for fine-tuning, we work with Databricks infrastructure, which is convenient because you only pay for GPU hours you need.

Training vs. Inference: Cost and Efficiency

Kyriakos: How do you break down the time spent on training versus inference if you have access to GPUs?

Roman: You train a model once, but not every time it's a good training. You sometimes need multiple iterations. When you start your project, you spend more on training and need more GPUs. Once models are trained, you can save on inference. With OpenAI ChatGPT, you need large prompts, paying for every token. Fine-tuning models means you don't need large prompts, as instructions are embedded in the model. Inference becomes cheaper, but serving the model for millions of users makes inference more expensive than training. Initially, prompts with the medical team could be as big as 100,000 tokens before optimization. We run optimization routines to decrease this to a reasonable amount. Fine-tuning models also reduce response time, making it more predictable and reducing latency.

Build vs. Buy: Strategic Decisions in Tooling

Kyriakos: I've seen that you use a lot of tooling, like Databricks. How do you decide when to build something internally or use tools?

Roman: The idea is simple. Dedicate resources to building what's not available and build your competitive advantage. Buy whatever else. For example, electricity is a commodity, so everyone buys it. The same applies to servers. Everyone has more or less the same servers. You don't compete on who has the best server. But cycle prediction accuracy, content, personalization algorithms are tailored to your app, so you dedicate engineering time to build those. If it's possible to buy, we buy it. If not, you build it. Most engineers work on specialized models and features only available in Flo app. Sometimes you build commodity things due to unique privacy requirements or user volume. Our analytics and A/B testing are in-house. We use foundational components from Databricks but have our own A/B testing framework. Sometimes we build with partners, like Databricks, to enhance their services to meet our requirements.

Value Creation vs. Value Capture: Structuring Teams

Kyriakos: I saw you have separate teams for value creation and value capture. How does this work?

Roman: We have 20+ teams divided into groups and streams. One stream is value capture, another is value creation. Value creation includes features that are the reason users use your app, like cycle prediction, calendar, chatbots, and community. To build a sustainable business, you need to capture some of the value created for the user. Value capture teams focus on subscription engines, optimizing user journeys, and onboarding. We aim to have more value creation teams than value capture teams, creating more value than we capture.

The Future of AI in Personalized Health Experiences

Kyriakos: How does AI help users in Flo in two to five years?

Roman: AI helps create the most personalized experience. In two to three years, our app will know users better than they know themselves, providing a super personalized experience and solving problems unique to their situation. AI helps achieve this personalized experience. My hope is that AI will take over the role of general practitioner, taking signals from your body and advising you on what to do next. Instead of going to a GP, you'll be told what you need to do.

Time Series Language Models and Health Predictions

Kyriakos: You talked about building a data representation of the user journey. Do you have any opinion on time series language models?

Roman: It's an interesting question. I don't have a professional opinion, but based on conversations with Google, you need a tremendous amount of users using wearables to do something meaningful. I believe those models will be successful because your body works on patterns. With a lot of time series data representing your body's state, you have a high chance of predicting conditions. This is the future for diagnostic devices that can pick up early stages of conditions based on sensors. More sophisticated devices tracking body signals will become regular, affordable commodities.

Addressing Stigmatized Topics in Women's Health

Audience Member: I'm part of the Imperial College's research team working on digital health for women's health. What challenges did you have with Flo from the beginning on approaching features on stigmatized topics?

Roman: We follow World Health Organization guidelines and rely on science for guidance. Some topics become political, which is why we have anonymous mode. In some countries, reproductive health is a taboo. We invest in these topics, educate users, and created a partner mode to educate all genders. We aim to be scientific, unbiased, and talk to all audiences. We made good progress on this topic.

Kyriakos: We also have a feature called secret chats, like anonymous Reddit for people to discuss sensitive topics. It creates a space for advice, moral support, or clarification on taboo topics.

AI in Code Writing: Policies and Practices

Audience Member: How are you using AI to help with code writing?

Roman: We're pragmatic. We don't force AI use but allow it. We have tools like GitHub Copilot, Claude, Codex from ChatGPT, and Gemini. We measure the impact on developer productivity and educate engineers on what works better. 100% of our code is reviewed by AI, and 25% is generated or touched by AI. We see a productivity boost, and the space changes dramatically every few months. Everyone has access to the same tools, and competition still exists. It's about being a good driver with the tools.

Kyriakos: As a manager, I don't write code daily, but I can read PRs. AI helps me understand code quickly without bothering engineers. It's a significant time-saver for interpreting and understanding PRs.

Evaluating Prompts and Using RAG

Audience Member: Can you give examples of how you set evals for your prompts and do you use RAG?

Roman: Our evaluation is multi-layered. We evaluate prompts using several frameworks and partners to understand their effectiveness. We run continuous evaluations to improve prompts and models before releasing them to users. We have offline evaluation pipelines to monitor human-AI alignment. For RAG, we use it across the company for documents. We inject context for evaluation via RAG, using our Floppedia as a standard. We use RAG for legal and content queries. If you have specific questions, we can discuss them offline.

Kyriakos: We use LLM as a judge, an architectural pattern where large language models check other models. Our medical team creates judges, and other teams are educated to do the same. It's a process that requires validation but allows for efficient creation of judges.

Exploring Wearables and Market Expansion

Audience Member: Have you considered creating your own wearable for developing markets?

Roman: We constantly think about it, but it's not a top priority. There are many device manufacturers, and we rely on them. We don't believe users will have multiple watches. We're constrained by the fact that users have two hands. For now, it's not the time to enter this market.

Compliance Challenges in Global Health Apps

Audience Member: How do you ensure compliance with different countries' regulations?

Roman: We operate in almost all countries except China and a few others. We comply with regulations to avoid being removed from app stores. We prioritize user privacy and use GDPR as a baseline. Anonymous mode helps with compliance, as we can't provide data we don't have. We rely on technology to protect users rather than regulations. HIPAA compliance is a challenge, but we're not currently required to comply. Regulations can be contradictory, but we've been successful so far.

Kyriakos: Regarding China, the great firewall affects our app's performance. Our servers are in the US, and accessing the app from China is slow. We don't have servers in China or an ICP license, so we don't focus on that market.

The Role of Human Judgment in Health Tech

Audience Member: As you automate more with LLMs and agents, where will human judgment remain essential in health tech?

Roman: I believe technology can't displace anyone. In manufacturing, machinery didn't reduce jobs; it increased productivity. Professionals in any industry should embrace technology and learn to use it efficiently. Those who adapt will be successful. Embrace change, learn new ways, and no tool will displace you. You'll be a master of the tool and more productive.

Building Trust in Sensitive Data Handling

Audience Member: How do you convince users that their data is secure and protected?

Roman: Trust is earned through hard work and third-party audits. We have double ISO certification in security and privacy. We maintain user privacy with features like anonymous mode and pin codes. We don't do much advertising on privacy and security because it's not exciting, but specialists know about it and recommend us. Trust is built through word of mouth and hard work.

Audience Member: Do you have issues with users not wanting to use chatbots due to privacy concerns?

Roman: It's a challenge, but we focus on building trust through our practices and certifications.

Privacy Policies and User Comfort: A Balancing Act

Kyriakos: Something that's not a problem? Yeah, so first of all, we have a very transparent privacy policy and an implicit consent screen where we describe how the data will be used, what kind of data will be used, and so on. We design our chatbots in a way that we use a zero data retention policy with all the proprietary models providers like Google and OpenAI. Because of that, it's a very private experience; the data will not be stored on servers like Google's. Internally, we apply the same security and privacy standards for chatbots as for any other features. We also rely a lot on anonymization and de-identification. We have all the GDPR rights in place, like users can delete the data if they want, access data if they want, and so on. We don't really see a lot of issues with that. But what do you want to add?

Roman: I'll add two points. One is technical and one is a bit more social. We do a lot of fine-tuning, but we don't use any user data to do the fine-tuning. We synthesize data based on user patterns. For example, our anonymous chats, like secret chats, have no identifiers whatsoever. But we can synthesize some data based on those chats to create synthetic dialogues of what people are discussing to train the model. That's the technical part. The other part is a bit sad, but when you think about users who might be uncomfortable interacting with the chatbot, if you dive into papers describing this, you'll find that it's actually working in a very different order. People are more comfortable talking to machines than to humans. I read a paper about the US where there was quite a big percentage of women who didn't want to consult a doctor because of previous experiences of mistreatment or abuse. They were more comfortable going to chat GPT for advice than actually consulting a doctor. It's a sad reality, but it is how it is.

Roman: I can actually share one more story on that. In our chatbot, we tested different icons. When users talk to our chatbot, they see a little icon that represents our chatbot. We tested a photo of a medical expert and then just our logo. We found that users are more comfortable talking with our logo than with the image of a human. It reflects what Vlad said: users are more comfortable talking to artificial intelligence than to actual humans.

Wellness vs. Medical: The AI Dilemma

Kyriakos: You've mentioned medical and GPs. You even mentioned the chatbot overtaking GPs. My understanding is that both Terra and Flow are wellness apps. Do you see a point where they need to be medical apps or medical devices? Do you see AI influencing that? What's the big pros and cons against going down the medical device route?

Roman: As of today, Flow is more like a wellness app, and we try not to cross this line and not to give medical advice or replace doctors. But we also started to work on a few features that will be considered as a medical device. Right now, we are working with the FDA as the first market where we are going to launch those features to certify them with the FDA. In our case, it's like software as a medical device. We will be able to give medical advice to end users. I think it's a very natural evolution of our system. You start with simple topics and advice, maybe simple content, but then you build more advanced features like symptom checkers and even more advanced features like medical software algorithms. You have to, because if you want to solve real problems, you need to go that way. It's expensive and takes a lot of time and effort. For example, the clinical studies that we're starting in January will run for a couple of years, and we're spending millions of dollars on this. It's expensive, but I think it's something we have to do to serve our users properly.

Building a Fortress: Defending Against Giants

Kyriakos: From founded to now, how do you layer by layer build up and strengthen the defendability? For example, if SoftBank now is throwing a few hundred million to another company and they use some similar data layer, some AI marketing, how do you defend it?

Roman: Well, first of all, good luck because it's very important to understand that to date, we have already invested hundreds of millions into R&D just to build the product. If someone wants to replicate this, they already have to invest a lot of money. It's not just replicating the same functionality; you need to build a learning machine. The beauty of Flow is that we do all these A-B tests constantly, and we have a system that constantly becomes better. You need to replicate this learning machine as well. Otherwise, you will replicate Flow as of today, but by the time you finish building a Flow copy, we'll be on a different stage. They will also probably copy a lot of mistakes that we have right now. Not all our experiments are successful. It's really hard just to throw money and build the same.

Roman: Another defense we have is our large audience, and you need to convince a lot of users to switch from one app to another. You also need to have data from those users. Imagine you want to train some algorithms as we did. It's not just convincing someone to switch from one app to another; you also need to convince them to use this app for many years to collect the amount of user data that we have. Then you can build the models based on this information. It's hard, almost practically impossible, to replicate Flow at this stage. It's too big. Probably someone who owns the market for many years has more chances than SoftBank.

Roman: I think there'll be another aspect, which is more like a social aspect. We have quite a good social capital. If you think about Flow as a company, it's driven by a mission. Essentially, that mission makes people dedicated. They actually would like to make the world better, especially for women. If you have a company with one goal, to replicate the product, you will have completely different social capital in that company, and you'll be driven by different things, which will probably lead you to failure, to be honest.

Kyriakos: Awesome, Roman, Vlad, thank you so much for the discussion.

Roman: Thank you.

Vlad: Thank you.

Talks with great people directly from Terra

CTO + Director of AI at Flo Health: Roman Bugaev + Vladislav Nedosekin