I'll just add, hopefully you notice how closely you're working with medical teams, that even our conversations about AI are tied into GPs and medical professionals. But I would like to also add yours here. I think that this concept ultimately also has another benefit. Our backend models can actually focus on their medical part of things. One other feature is that the router can act if needed. It can essentially process their results and ensure that the tone of voice, the way the model communicates, is consistent and aligned with what we would like this conversation to be in. Kyriakos: An input, when we also look at their, like Roman mentioned, data coming from user profiles to understand the user intent. The other thing that is quite important is the continuation of their interaction with the model. If you spoke about your headache with the model yesterday, because the user profile will have this information, the model will remember that it's not just the first time you're reporting the headache, but it's a continuation of the problem. This allows it to have a high success rate in understanding the intent. Kyriakos: How does the system decide which one to use? Is there some sort of optimization for eight billion parameters, 70 billion, like what's the difference there? Roman: If you have seen presentations with Steve Ballmer, who was always saying "developers, developers, developers," in our case, it'll be "evaluation, evaluation, and evaluation." We run evaluation of the model against several dimensions. Two of them are more functional: medical safety and usefulness. The third one is cost, per token or per conversation. This allows us to evaluate proprietary large language models like GPT, Gemini, Entropic, and others. We also fine-tune models to improve their quality and look at the cost to measure the best possible combination. This process is continuous. We look at new candidates for a particular topic, and if we have a better candidate, we may try to do an experiment and see how it's performing. Stability is important, especially with medical staff, to ensure consistency. Proprietary models like Google, Entropic, or OpenAI continuously update their models. We don't want to be in a situation where they deprecate GPT.5 and move everyone to 5.1, resulting in different outcomes. We evaluate possible upgrade paths, and if an upgrade path is not possible, we look at open-source models to maintain stability and consistency. Kyriakos: In the evaluations you have today, do you have conclusions about what is best in predicting certain things you do today? Roman: What we have found is that there is no best model that fits all cases. Depending on the topics and constraints, some models follow instructions better, some may be faster, and so on. It's a variety of dimensions. For chatbots, several seconds don't make a difference. What's more important is that the model takes time to provide a medically related answer, giving users more confidence. But if you need real-time interaction, like a pop-up, it has to react faster. Localization or content creation also requires concise pop-ups, so when produced by LLM, it needs to follow precise instructions. Not all do it the same way. For every use case, we evaluate all models to see which one is best. That's why Roman mentioned almost everything available on the market. Kyriakos: You have the anonymous mode as well. What kind of complexity does this create when you want to provide the best user experience with AI? Roman: The idea of anonymous mode is simple. We want to ensure users have access to all features while maintaining privacy. If our data is compromised, maybe by government request or other situations, we invented a special mode where we have users without any identifiers. We don't collect emails, phone numbers, IP addresses, or technical identifiers that can be used to identify a user. This way, we protect user privacy while keeping all app functionality available. It's different from end-to-end encryption, which is good for protecting user data but hard to use for machine learning and building new models. In anonymous mode, we have both worlds working together. We use machine learning and AI safely and privately. We also have cool features like mobile client protection with pin codes and encrypted communication resistant to quantum computers. Even if someone records data from the user device to our servers, it's impossible to decrypt because the encryption is post-quantum. We use oblivious HTTP, a protocol used by Apple and others, to remove original IP addresses and replace them with fake ones, ensuring we can't trace back to the user. We published a white paper on this and won awards like Best Invention from Times, the same year ChatGPT won. Kyriakos: Speaking of Cloudflare, they had a massive downtime last week. We were affected, but it's the beauty of using big providers. When everyone is down, you're also kind of allowed to be down. It's like, well, LinkedIn is not working, Flo is not working, so you guys take a coffee. Roman: I also read that you have access to 20,000 H100 GPU time. I can explain this. We fine-tune large language models like Llama7TB, which takes time to prepare a dataset and create the pipeline. One run of fine-tuning is about 10,000 GPU hours, and we use H100 for that. Each iteration is expensive but reasonable. We receive better models tailored to our needs. GPUs are expensive, and for fine-tuning, we work with Databricks infrastructure, which is convenient because you only pay for GPU hours you need. Kyriakos: How do you break down the time spent on training versus inference if you have access to GPUs? Roman: You train a model once, but not every time it's a good training. You sometimes need multiple iterations. When you start your project, you spend more on training and need more GPUs. Once models are trained, you can save on inference. With OpenAI ChatGPT, you need large prompts, paying for every token. Fine-tuning models means you don't need large prompts, as instructions are embedded in the model. Inference becomes cheaper, but serving the model for millions of users makes inference more expensive than training. Initially, prompts with the medical team could be as big as 100,000 tokens before optimization. We run optimization routines to decrease this to a reasonable amount. Fine-tuning models also reduce response time, making it more predictable and reducing latency. Kyriakos: I've seen that you use a lot of tooling, like Databricks. How do you decide when to build something internally or use tools? Roman: The idea is simple. Dedicate resources to building what's not available and build your competitive advantage. Buy whatever else. For example, electricity is a commodity, so everyone buys it. The same applies to servers. Everyone has more or less the same servers. You don't compete on who has the best server. But cycle prediction accuracy, content, personalization algorithms are tailored to your app, so you dedicate engineering time to build those. If it's possible to buy, we buy it. If not, you build it. Most engineers work on specialized models and features only available in Flo app. Sometimes you build commodity things due to unique privacy requirements or user volume. Our analytics and A/B testing are in-house. We use foundational components from Databricks but have our own A/B testing framework. Sometimes we build with partners, like Databricks, to enhance their services to meet our requirements. Kyriakos: I saw you have separate teams for value creation and value capture. How does this work? Roman: We have 20+ teams divided into groups and streams. One stream is value capture, another is value creation. Value creation includes features that are the reason users use your app, like cycle prediction, calendar, chatbots, and community. To build a sustainable business, you need to capture some of the value created for the user. Value capture teams focus on subscription engines, optimizing user journeys, and onboarding. We aim to have more value creation teams than value capture teams, creating more value than we capture. Kyriakos: How does AI help users in Flo in two to five years? Roman: AI helps create the most personalized experience. In two to three years, our app will know users better than they know themselves, providing a super personalized experience and solving problems unique to their situation. AI helps achieve this personalized experience. My hope is that AI will take over the role of general practitioner, taking signals from your body and advising you on what to do next. Instead of going to a GP, you'll be told what you need to do. Kyriakos: You talked about building a data representation of the user journey. Do you have any opinion on time series language models? Roman: It's an interesting question. I don't have a professional opinion, but based on conversations with Google, you need a tremendous amount of users using wearables to do something meaningful. I believe those models will be successful because your body works on patterns. With a lot of time series data representing your body's state, you have a high chance of predicting conditions. This is the future for diagnostic devices that can pick up early stages of conditions based on sensors. More sophisticated devices tracking body signals will become regular, affordable commodities. Audience Member: I'm part of the Imperial College's research team working on digital health for women's health. What challenges did you have with Flo from the beginning on approaching features on stigmatized topics? Roman: We follow World Health Organization guidelines and rely on science for guidance. Some topics become political, which is why we have anonymous mode. In some countries, reproductive health is a taboo. We invest in these topics, educate users, and created a partner mode to educate all genders. We aim to be scientific, unbiased, and talk to all audiences. We made good progress on this topic. Kyriakos: We also have a feature called secret chats, like anonymous Reddit for people to discuss sensitive topics. It creates a space for advice, moral support, or clarification on taboo topics. Audience Member: How are you using AI to help with code writing? Roman: We're pragmatic. We don't force AI use but allow it. We have tools like GitHub Copilot, Claude, Codex from ChatGPT, and Gemini. We measure the impact on developer productivity and educate engineers on what works better. 100% of our code is reviewed by AI, and 25% is generated or touched by AI. We see a productivity boost, and the space changes dramatically every few months. Everyone has access to the same tools, and competition still exists. It's about being a good driver with the tools. Kyriakos: As a manager, I don't write code daily, but I can read PRs. AI helps me understand code quickly without bothering engineers. It's a significant time-saver for interpreting and understanding PRs. Audience Member: Can you give examples of how you set evals for your prompts and do you use RAG? Roman: Our evaluation is multi-layered. We evaluate prompts using several frameworks and partners to understand their effectiveness. We run continuous evaluations to improve prompts and models before releasing them to users. We have offline evaluation pipelines to monitor human-AI alignment. For RAG, we use it across the company for documents. We inject context for evaluation via RAG, using our Floppedia as a standard. We use RAG for legal and content queries. If you have specific questions, we can discuss them offline. Kyriakos: We use LLM as a judge, an architectural pattern where large language models check other models. Our medical team creates judges, and other teams are educated to do the same. It's a process that requires validation but allows for efficient creation of judges. Audience Member: Have you considered creating your own wearable for developing markets? Roman: We constantly think about it, but it's not a top priority. There are many device manufacturers, and we rely on them. We don't believe users will have multiple watches. We're constrained by the fact that users have two hands. For now, it's not the time to enter this market. Audience Member: How do you ensure compliance with different countries' regulations? Roman: We operate in almost all countries except China and a few others. We comply with regulations to avoid being removed from app stores. We prioritize user privacy and use GDPR as a baseline. Anonymous mode helps with compliance, as we can't provide data we don't have. We rely on technology to protect users rather than regulations. HIPAA compliance is a challenge, but we're not currently required to comply. Regulations can be contradictory, but we've been successful so far. Kyriakos: Regarding China, the great firewall affects our app's performance. Our servers are in the US, and accessing the app from China is slow. We don't have servers in China or an ICP license, so we don't focus on that market. Audience Member: As you automate more with LLMs and agents, where will human judgment remain essential in health tech? Roman: I believe technology can't displace anyone. In manufacturing, machinery didn't reduce jobs; it increased productivity. Professionals in any industry should embrace technology and learn to use it efficiently. Those who adapt will be successful. Embrace change, learn new ways, and no tool will displace you. You'll be a master of the tool and more productive. Audience Member: How do you convince users that their data is secure and protected? Roman: Trust is earned through hard work and third-party audits. We have double ISO certification in security and privacy. We maintain user privacy with features like anonymous mode and pin codes. We don't do much advertising on privacy and security because it's not exciting, but specialists know about it and recommend us. Trust is built through word of mouth and hard work. Audience Member: Do you have issues with users not wanting to use chatbots due to privacy concerns? Roman: It's a challenge, but we focus on building trust through our practices and certifications. Kyriakos: Something that's not a problem? Yeah, so first of all, we have a very transparent privacy policy and an implicit consent screen where we describe how the data will be used, what kind of data will be used, and so on. We design our chatbots in a way that we use a zero data retention policy with all the proprietary models providers like Google and OpenAI. Because of that, it's a very private experience; the data will not be stored on servers like Google's. Internally, we apply the same security and privacy standards for chatbots as for any other features. We also rely a lot on anonymization and de-identification. We have all the GDPR rights in place, like users can delete the data if they want, access data if they want, and so on. We don't really see a lot of issues with that. But what do you want to add? Roman: I'll add two points. One is technical and one is a bit more social. We do a lot of fine-tuning, but we don't use any user data to do the fine-tuning. We synthesize data based on user patterns. For example, our anonymous chats, like secret chats, have no identifiers whatsoever. But we can synthesize some data based on those chats to create synthetic dialogues of what people are discussing to train the model. That's the technical part. The other part is a bit sad, but when you think about users who might be uncomfortable interacting with the chatbot, if you dive into papers describing this, you'll find that it's actually working in a very different order. People are more comfortable talking to machines than to humans. I read a paper about the US where there was quite a big percentage of women who didn't want to consult a doctor because of previous experiences of mistreatment or abuse. They were more comfortable going to chat GPT for advice than actually consulting a doctor. It's a sad reality, but it is how it is. Roman: I can actually share one more story on that. In our chatbot, we tested different icons. When users talk to our chatbot, they see a little icon that represents our chatbot. We tested a photo of a medical expert and then just our logo. We found that users are more comfortable talking with our logo than with the image of a human. It reflects what Vlad said: users are more comfortable talking to artificial intelligence than to actual humans. Kyriakos: You've mentioned medical and GPs. You even mentioned the chatbot overtaking GPs. My understanding is that both Terra and Flow are wellness apps. Do you see a point where they need to be medical apps or medical devices? Do you see AI influencing that? What's the big pros and cons against going down the medical device route? Roman: As of today, Flow is more like a wellness app, and we try not to cross this line and not to give medical advice or replace doctors. But we also started to work on a few features that will be considered as a medical device. Right now, we are working with the FDA as the first market where we are going to launch those features to certify them with the FDA. In our case, it's like software as a medical device. We will be able to give medical advice to end users. I think it's a very natural evolution of our system. You start with simple topics and advice, maybe simple content, but then you build more advanced features like symptom checkers and even more advanced features like medical software algorithms. You have to, because if you want to solve real problems, you need to go that way. It's expensive and takes a lot of time and effort. For example, the clinical studies that we're starting in January will run for a couple of years, and we're spending millions of dollars on this. It's expensive, but I think it's something we have to do to serve our users properly. Kyriakos: From founded to now, how do you layer by layer build up and strengthen the defendability? For example, if SoftBank now is throwing a few hundred million to another company and they use some similar data layer, some AI marketing, how do you defend it? Roman: Well, first of all, good luck because it's very important to understand that to date, we have already invested hundreds of millions into R&D just to build the product. If someone wants to replicate this, they already have to invest a lot of money. It's not just replicating the same functionality; you need to build a learning machine. The beauty of Flow is that we do all these A-B tests constantly, and we have a system that constantly becomes better. You need to replicate this learning machine as well. Otherwise, you will replicate Flow as of today, but by the time you finish building a Flow copy, we'll be on a different stage. They will also probably copy a lot of mistakes that we have right now. Not all our experiments are successful. It's really hard just to throw money and build the same. Roman: Another defense we have is our large audience, and you need to convince a lot of users to switch from one app to another. You also need to have data from those users. Imagine you want to train some algorithms as we did. It's not just convincing someone to switch from one app to another; you also need to convince them to use this app for many years to collect the amount of user data that we have. Then you can build the models based on this information. It's hard, almost practically impossible, to replicate Flow at this stage. It's too big. Probably someone who owns the market for many years has more chances than SoftBank. Roman: I think there'll be another aspect, which is more like a social aspect. We have quite a good social capital. If you think about Flow as a company, it's driven by a mission. Essentially, that mission makes people dedicated. They actually would like to make the world better, especially for women. If you have a company with one goal, to replicate the product, you will have completely different social capital in that company, and you'll be driven by different things, which will probably lead you to failure, to be honest. Kyriakos: Awesome, Roman, Vlad, thank you so much for the discussion. Roman: Thank you. Vlad: Thank you.