Chevron pointing leftAll articles

Sleep Tracking

Sleep Metric Variation Across Wearables

By Alistair Brownlee, Faraaz Akhtar and Halvard Ramstad

January 22, 2026

  • We tested whether sleep wearables actually agree — analyzing 77,000+ nights across six devices, plus same-person, same-night comparisons to isolate true device differences.

  • Sleep duration lines up, sleep stages don’t — most trackers cluster around ~7 hours of total sleep and are usually within ~15–22 minutes on the same night, but deep/REM/light can diverge by hours between devices.

  • Key takeaway: sleep stages don’t match across devices — on the same nights, trackers differed by ~32 min (deep), ~23 min (REM), and ~38 min (light).

Get the latest Terra Research reports
By continuing, I agree to the Privacy Policy and Terms of Service.

WHOOP Data and Direct Multi-Device Comparisons

After we published the sleep data comparison a few weeks ago, we had quite a lot of requests on social media from WHOOP users eager to see how their device stacks up. So we're excited to include WHOOP in this updated analysis. 

Building on our previous population-level exploration of sleep data from Oura, Fitbit, Apple, and Garmin, we incorporate fresh insights from direct head-to-head comparisons—nights where the same individuals used multiple devices simultaneously.

The dataset spans November 17 to December 17, 2025, encompassing over 77,000 filtered main sleep episodes (excluding naps and durations outside 4-13 hours). This real-world aggregation from thousands of users offers a robust view of consumer sleep patterns and device performance.

The Crucial Distinction: Population-Level vs. Direct Comparisons

A core limitation of population-level analyses, such as our earlier one, is that they compare averages across different user groups. For instance, if Eight Sleep users average longer sleep than Apple Watch users, is that because the smart bed's temperature regulation genuinely extends sleep duration, or because Eight Sleep attracts a demographic more committed to recovery (e.g., biohackers or those with sleep issues who invest in premium solutions)? Similarly, differences could stem from lifestyle factors, age, or even users' motivation to wear/track consistently.

These ecological views are valuable for understanding broader trends among tracker users, but don't isolate device-specific effects. That's where multi-device nights change the game: by controlling for the individual (same person, same night, same sleep), we can directly assess algorithmic differences in detection and staging.

Population-Level Insights: A Solid ~7 Hours, With Notable Variations

Across the six manufacturers, mean total sleep time (TST) ranges from 6.89 to 7.65 hours—encouragingly close to the 7-9 hours recommended by sleep experts for cognitive health, immune function, and mood regulation. Wearable users, often more health-aware than the general population (where averages dip below 7 hours in many surveys), seem to prioritize sleep moderately well. 

We’d expect this, as the sample is taken from a population that we can be confident are relatively (compared to the general population) interested in their health.

manufacturer_sleep_stages_distributions.png
Figure 1: Stacked Bar Chart of Average Sleep Stages by Manufacturer - breaks down mean hours in REM (blue), Deep (green), and Light (yellow) sleep, stacked to show total sleep time. The chart highlights Apple's notably lower deep sleep and Eight Sleep's higher REM, illustrating inter-device staging differences.

Sleep stages follow expected physiology: light dominates (~60%), REM ~20-25% (1.3-1.7 hours), and deep varies most (0.8-1.3 hours). Deep sleep's variability highlights a known challenge: consumer devices rely on heart rate, movement, and sometimes temperature, but struggle to match the EEG sensitivity of lab polysomnography for slow-wave detection.

Apple's low deep sleep aligns with peer-reviewed validation studies showing wrist-based actigraphy often underestimates slow-wave sleep compared to PSG [1][2]. Conversely, Oura's higher deep sleep reporting has been found closer to gold standards in multiple independent validations [3][4].

WHOOP slots in comfortably, with averages reflecting its athlete-focused user base—decent duration but not outlier-long.

manufacturer_population_distributions.png
Figure 2: Mean Total Sleep Time with Error BarsBar chart showing average TST per device, with sample sizes noted and variability indicated. Eight Sleep leads in duration, while shorter averages on wrist-based devices like Apple and Fitbit may reflect user habits or detection algorithms. Is this because Eight Sleep users self-select as the most sleep-conscious in the sample, or because the product itself extends sleep? More of this later!
manufacturer_tst_distribution_lines (1).png
Figure 3: Distributions of Total Sleep Time - Smoothed density curves for TST across devices, showing that most nights cluster around 6.5-8 hours. Distributions overlap significantly, but peaks vary slightly—e.g., Eight Sleep skewed higher.

Direct Multi-Device Comparisons: Isolating Device Differences

Multi-device comparison is the true gold standard when we examine nights when the same individual uses multiple devices simultaneously: here, by perfectly controlling for the person, their physiology, bedtime habits, bedroom environment, and the exact sleep episode, we isolate pure device-to-device differences. 

This direct, within-subject approach - rare in consumer tech analyses due to the need for real-world overlap data - offers unparalleled clarity on how algorithms interpret identical physiological signals, revealing genuine biases in total sleep time detection and stage classification that aggregated population data can only hint at.

whoop_tst_comparison_chart (1).png
Figure 4: WHOOP TST Differences in Multi-Device ComparisonsBar chart comparing average and median differences (minutes) for Oura vs. WHOOP and Eight Sleep vs. WHOOP. Medians under 25 minutes indicate solid agreement on total time, despite larger averages pulled by outliers.

In the multi-device sleep tracker comparisons (such as Oura vs. WHOOP or Eight Sleep vs. WHOOP), the average total sleep time (TST) differences often appear substantially higher—around 38–45 minutes—compared to the much lower median differences of just 17–22 minutes. This discrepancy arises because the data is positively skewed: on most nights, the devices agree quite closely (differences under 20–30 minutes), pulling the median (the middle value when sorted) downward to reflect typical performance. 

However, a smaller number of outlier nights—where one device significantly over- or under-estimates sleep due to factors like movement artefacts, poor sensor contact, or algorithmic edge cases—can produce large deviations (60+ minutes). 

These extremes heavily inflate the average (mean), which is sensitive to outliers, while the median remains robust and more representative of what users experience on an average night. In sleep tracking research, this pattern is common, which is why medians are often preferred for assessing day-to-day reliability.

whoop_sleep_stages_comparison_chart.png
Figure 5: WHOOP Sleep Stage Differences- Average absolute differences in minutes for REM, Deep, and Light stages in the same comparisons. Stage discrepancies are substantially larger than TST, emphasising ongoing challenges in consumer staging accuracy.

From 725 users contributing 2,144 overlapping nights (mostly dual-device), we gain clearer device insights into how devices compare. We only ran comparisons on pairs with 20+ nights of data:

  1. Oura vs. WHOOP (42 nights from multiple users):
    • Median TST difference: 16.6 minutes (average 45.2 minutes)
    • WHOOP higher TST on 57.1% of nights
    • Per-user averages ranged 7.2–86.4 minutes difference; most showed night-to-night variability (71.4% mixed direction)
  2. Eight Sleep vs. WHOOP (31 nights):
    • Median TST difference: 21.6 minutes (average 37.8 minutes)
    • WHOOP higher on 80.6% of nights—a more consistent bias
    • Some users (40%) always had WHOOP higher; others mixed

Direct comparisons between WHOOP and Oura (across 42 nights) and between WHOOP and Eight Sleep (31 nights) reveal deeper insights into device-specific biases and user variability in sleep tracking. Notably, WHOOP tends to report longer total sleep time (TST) more frequently—57% of nights against Oura and a striking 81% against Eight Sleep—suggesting WHOOP's strap-based algorithm may be more lenient in classifying quiet wakefulness as sleep, perhaps due to its focus on athlete recovery metrics. However, median TST differences remain modest (under 25 minutes), indicating strong overall agreement for most nights, while larger average differences highlight occasional outliers influenced by factors like fit or movement. 

Sleep stage discrepancies are even more pronounced, with average gaps exceeding 250 minutes for light sleep, underscoring algorithmic divergences: Oura's ring sensors (incorporating temperature) appear to detect more nuanced transitions, while Eight Sleep's bed-based system emphasizes REM, potentially overestimating it relative to WHOOP. User-level patterns add nuance—71% of Oura-WHOOP users show mixed nightly biases, versus 40% of Eight Sleep-WHOOP users, where WHOOP consistently over reports—implying individual physiology or sleep position plays a role.

WhoopSleepFigure 1.png
Figure 6: Sleep Stage Duration Breakdown by Device - wide variance in deep sleep, REM sleep, and light sleep across different wearable devices. 

Broader Implications and Practical Advice

Combining both lenses: Population data shows tracker users sleeping ~7 hours with stage proportions broadly sensible, but direct comparisons reveal TST reliability within ~20 minutes across brands—reassuring for duration tracking—while stages remain interpretive.

For WHOOP fans: Your device performs comparably in aggregates and shows modest biases in direct matchups, often edging longer TST.

Ultimately, these tools motivate better habits and reveal patterns (e.g., weekend recovery). But discrepancies remind us to Focus on subjective refreshment rather than chasing identical scores. Avoid "orthosomnia"—stress from imperfect data!

References

[1] de Zambotti, Massimiliano, et al. “Wearable Sleep Technology in Clinical and Research Settings.” Medicine & Science in Sports & Exercise, 2019, https://pubmed.ncbi.nlm.nih.gov/30789439/.

[2] Walch, Olivia J., et al. “Sleep Stage Prediction with Raw Acceleration and Photoplethysmography Heart Rate Data Derived from a Consumer Wearable Device.” Sleep, vol. 42, no. 12, 2019, article zsz180, https://academic.oup.com/sleep/article/42/12/zsz180/5549536.

[3] Altini, Marco, and Hannu Kinnunen. “The Promise of Sleep: A Multi-Sensor Approach for Accurate Sleep Stage Detection Using the Oura Ring.” Sensors, vol. 21, no. 13, 2021, article 4302, https://www.mdpi.com/1424-8220/21/13/4302.

[4] Asgari Mehrabadi, Mohammad, et al. “Accuracy Assessment of Oura Ring Nocturnal Heart Rate and Heart Rate Variability in Comparison With Electrocardiography.” Journal of Medical Internet Research, vol. 24, no. 1, 2022, article e27487, https://www.jmir.org/2022/1/e27487/.

Cookie Preferences

Essential CookiesAlways On
Advertisement Cookies
Analytics Cookies

Crunch Time: Embrace the Cookie Monster Within!

We use cookies to enhance your browsing experience and analyse our traffic. By clicking “Accept All”, you consent to our use of cookies according to our Cookie Policy. You can change your mind any time by visiting out cookie policy.