All articles

Sleep Tracking

Benchmarking Wearable Sleep Data Reliability

November 28, 2025

  • Garmin produces the most consistent data – it shows the longest average sleep duration, the lowest variability, and very low outlier rates, making it the most reliable for long-term trend tracking.
  • Apple consistently underestimates deep sleep and shows the most data quality issues – including missing sleep stage classifications, unusually high outlier rates, and even impossible values, suggesting its sleep staging is less reliable.
  • Oura and Fitbit add valuable nap detection but increase variability – Oura, in particular, records many short sleep episodes, which lower the average sleep time until they’re filtered out. Fitbit is cleaner overall, striking a balance between accurate nap detection and stable results.
Colourful research report cover saying 'Original Research: Latest Results'. It is surrounded by three text boxes saying 'What wearables are the most accurate?', 'Best method to track workouts', and 'What’s your healthiest day of the week?'

Stay updated with our latest research

Get the latest reports and insights from Terra Research delivered to your inbox.

By continuing, I agree to the Privacy Policy and Terms of Service.

Interested in your sleep? How accurate is your wearable sleep data?

Sleep is one of the most fundamental processes for recovery, health, and performance. When people ask me about the newest recovery tool, I like to quip, “It may help, but nothing makes up for bad sleep. Recovery is 90% sleep, 9% nutrition and everything else makes up the rest.” This may be a slightly facetious exaggeration, but it’s not too far from the truth. As more of us turn to wearables each night to track our rest, it’s natural to wonder just how consistent these devices are. We analysed data from thousands of users—2,581 from Oura, 1,086 from Apple, 620 from Fitbit, and 669 from Garmin. While we can’t be entirely sure that each device’s users are similar (each device sample is representative of the broader population), the data is drawn from two similar platforms. On the other hand, for example, Garmin users could be more serious athletes than Apple users and thus might sleep more!

Total Sleep Time: Who Gives You the Longest Night?

When it comes to total sleep time, Garmin users are the most well-rested. On average, Garmin recorded 7.17 hours of nightly sleep, the highest among the four platforms. Oura users, by contrast, clocked just 6.34 hours, the lowest of the group. Apple and Fitbit fell in the middle at 6.68 and 6.39 hours, respectively.

The range of recorded values tells a different story. Apple data showed astonishing variability, with some nights apparently lasting over 51 hours—clearly impossible. Garmin, by contrast, had the tightest spread and the lowest standard deviation. In plain terms, Garmin’s numbers were the most consistent, while Apple’s showed a greater tendency towards extremes. Fitbit and Oura sat between these two poles, although Oura’s inclusion of naps and multiple events per day further inflated variability.

output (6).png
Figure 1: Average Total Sleep Time with Variability The mean total sleep time (TST) recorded by each platform, with error bars representing variability (standard deviation). Garmin users averaged the longest sleep at just over 7 hours, while Oura recorded the shortest at 6.3 hours. Apple and Fitbit fell in between. Garmin also had the most consistent readings (tight error bars), whereas Apple and Oura showed greater variability. These differences reflect both device algorithms and the inclusion of naps or multiple short events in some platforms.

The Composition of Sleep: REM, Light, and Deep

Beyond duration, wearables also attempt to break sleep into stages. REM sleep, the stage linked to dreaming and memory consolidation, was consistent across platforms, ranging from 16.4% (Garmin) to 21.4% (Fitbit) of total sleep time. Light sleep dominated across the board, accounting for around 60–66% of the night. Deep sleep, associated with recovery and growth, revealed the most striking difference. Apple recorded an average of only 10.5% deep sleep, compared with nearly 18% for Garmin, Fitbit, and Oura. This suggests Apple’s algorithms may be systematically underestimating deep sleep.

When sleep is filtered to exclude short naps (less than four hours), the percentages settle further. Garmin and Fitbit maintained a deep sleep rate of nearly 18%, Oura came in at 17.8%, while Apple remained at 10.5%. In other words, no matter how you slice it, Apple stands apart in reporting less deep sleep than other devices.

output (7).png
Figure 2: Sleep Stage Composition by Platform (≥4h episodes) The proportion of REM, Light, and Deep sleep detected by each wearable. The totals should sum to ~100% of sleep, but Apple consistently under-reports by around 10%, leaving some sleep time unclassified. Across the other platforms, REM sleep is steady (16–22%), Light sleep hovers around 60–66%, and Deep sleep sits at ~18%. Apple stands out with markedly lower Deep sleep (10.5%), highlighting a fundamental difference in how its algorithm stages sleep.

The Outlier Problem

No dataset is perfect, and wearables are no exception. There is already a significant amount of smoothing that occurs during the data processing pipeline, to exclude outliers. We decided to run an outlier detection test using the standard 1.5 × IQR method to highlight how each platform handles noisy or erroneous values.

Oura initially showed the highest rate of outliers, particularly in total sleep time, where 13.3% of values fell outside the expected range. Apple, while more stable in total sleep time, produced outliers in specific sleep stages—light and REM in particular. In fact, Apple displayed the single highest outlier rate for any metric: 4.33% in light sleep. Garmin, by contrast, had the lowest overall outlier rates, with less than 1.5% across most categories. Fitbit performed similarly well, showing low and balanced outlier levels.

We decided to remove short sleeps (naps) by removing every event under 4 hours long. At this point Fitbit and Garmin emerged as the cleanest platforms. Oura improved substantially, dropping its total sleep time outliers to 1.43%, while Apple continued to struggle, particularly with stage classification. The presence of negative values in Apple’s deep sleep data even suggests measurement or synchronisation errors—something no statistical threshold can fix.

output (8).png
Figure 3: Data Exclusions (<4h) and Outlier Rates by PlatformThis compares the percentage of records excluded as short sleep (<4 hours) against the percentage of outliers detected after filtering. Garmin excluded the largest share of records (16.3%), but produced very few outliers afterwards (~1.4%). Oura and Fitbit both identified naps explicitly and had moderate exclusion rates (12.7% and 9.1% respectively), while Apple recorded virtually no naps (<1% exclusions) but suffered the highest outlier rates (up to 4.3%). Garmin and Apple do not detect naps, which partly explains why Garmin required more exclusions (short fragments that had to be removed manually), and why Apple’s dataset retained implausible values instead.

Naps and Multiple Events: The Oura Factor

Oura and Fitbit include naps in sleep data and also separately flag them in the output data. Around 12.7% of Our sleep records were classified as naps by our < 4-hour criteria, averaging just under an hour. Fitbit also identified naps, though less frequently and with a longer average duration of 1 hour and 40 minutes.

This difference has a profound effect on total sleep time. The inclusion of many short episodes can partly explain Oura’s lower average TST. When these are removed, Oura’s nightly average rises to 7.12 hours—almost identical to Garmin's. It is a reminder that how a device defines and detects sleep events can alter the statistics dramatically.

You can see for yourself (in figure 2) that Apple has almost no sleep episodes under 4 hours, pointing to the fact that they already proactively remove nap sessions from sleep data.

Clustering of Outliers

The distribution of outliers across users also varies. Oura’s outliers are widely spread, with nearly all users having at least one unusual record, but each user contributes only a small share. Garmin is the opposite: fewer users show outliers, but when they do, they are concentrated in a handful of nights. Apple and Fitbit sit between these two extremes.

This pattern matters because it reflects whether anomalies are occasional glitches across the board, or recurring issues for a subset of users. For researchers or product developers, knowing the difference can guide decisions about cleaning and interpreting data.

Cross-Platform Consistency: Where They Agree, Where They Don’t

Despite the differences, there are encouraging consistencies. All platforms converge on similar ranges for REM sleep (16–21%) and total sleep time (6.3–7.2 hours). Light sleep hovers around 60–66%, though Apple’s tendency towards variability is notable. Deep sleep is the main point of divergence, with Apple showing significantly lower values.

From a data quality perspective, Garmin is the clear winner: consistent averages, low outlier rates, and stable stage classification. Fitbit performs nearly as well, with balanced results across most metrics. Oura provides the most detailed dataset, including naps, but at the cost of higher variability and occasional data issues. Apple, while popular and widely used, shows systematic problems with stage classification and higher rates of extreme outliers.

What This Means for Users and Researchers

For the casual user, these differences may not matter much. Whether you slept 6.8 or 7.1 hours is unlikely to change your morning routine. But for researchers, athletes, and clinicians, understanding the strengths and weaknesses of each platform is vital.

  • Garmin seems to be best suited for consistent, reliable long-term trends.
  • Oura offers the richest data, especially for those interested in nap behaviour, but requires careful filtering.
  • Fitbit strikes a balance: relatively clean data without the extremes seen in Apple.
  • Apple provides broad accessibility, but it should be treated with caution when interpreting deep sleep and outlier values.

Ultimately, no wearable is perfect, and all rely on algorithms that make educated guesses based on movement, heart rate, and other signals. The real value lies in recognising patterns over time, rather than obsessing over single-night accuracy.

The next stage is to compare the distribution from each wearable with the gold standard measure, Polysomnography (PSG). That’s something to look forward to in a later blog!

Cookie Preferences

Essential CookiesAlways On
Advertisement Cookies
Analytics Cookies

Crunch Time: Embrace the Cookie Monster Within!

We use cookies to enhance your browsing experience and analyse our traffic. By clicking “Accept All”, you consent to our use of cookies according to our Cookie Policy. You can change your mind any time by visiting out cookie policy.