What you measure, and how you interpret it, is important. Chris Zacharias described that lesson in the history of YouTube, in Page Weight Matters.
After deploying a customized video page that dropped the page weight from 1.2 MB to 98 KB he found their “average aggregate page latency” had increased. That of course didn’t make any sense, how do you make a huge performance improvement and end up with real user monitoring ( RUM ) results that show the average experience getting worse. It wasn’t until they dug down to a more granular level that the answer became apparent:
When we plotted the data geographically and compared it to our total numbers broken out by region, there was a disproportionate increase in traffic from places like Southeast Asia, South America, Africa, and even remote regions of Siberia. Further investigation revealed that, in these places, the average page load time under Feather was over TWO MINUTES! This meant that a regular video page, at over a megabyte, was taking more than TWENTY MINUTES to load! This was the penalty incurred before the video stream even had a chance to show the first frame. Correspondingly, entire populations of people simply could not use YouTube because it took too long to see anything. Under Feather, despite it taking over two minutes to get to the first frame of video, watching a video actually became a real possibility. Over the week, word of Feather had spread in these areas and our numbers were completely skewed as a result. Large numbers of people who were previously unable to use YouTube before were suddenly able to.
It isn’t hard to imagine a similar process happening to sites today.
Take a site that is so big and slow on mobile devices that only 3% of their page views are on mobile devices. It just isn’t worth trying to load the page on anything except laptops and desktops on fast wifi connections. Then they take the time to dramatically reshape the site, improving mobile page load by 10X. That in turn increases their mobile views from 3% to 48%. Such a huge shift in the audience is going to impact a number of metrics. The total average of some modern measurements like first contentful paint could actually get worse in that scenario, instead of better.
Audience shifts like that can have unexpected results. That is the same reason why it is hard to compare RUM performance results across different sites. Unless the makeup of their audience is very similar, it is easy for the comparison to give unexpected results.