Hello! To explore the concept of lurking variables, let's step into the role of a Site Reliability Engineer (SRE). If you’re not an SRE, don’t worry; you don’t need to be an SRE to grasp the concept.
Scenario
Let’s consider a hypothetical cloud provider, NebuloSky, which offers two types of virtual machines (VMs) running on different platforms:
Platform X: The well-established, stable platform that has been deployed and used by customers for years.
Platform Y: The next generation platform, not yet considered stable and deployed at a much smaller scale.
For various reasons, including security, NebuloSky regularly deploys new kernel versions to its VMs. Since NebuloSky operates worldwide, kernel rollouts are a continuous process—there’s always a rollout happening somewhere.
One of the primary metrics for tracking the success of these rollouts is kernel panics—a system failure that can occur for various reasons, including a defective kernel version. Given the scale of NebuloSky’s infrastructure, kernel panics are inevitable, but their frequency is monitored closely.
As an SRE at NebuloSky, we receive a page alert: the number of kernel panics has doubled in a short timeframe.
Naturally, our first assumption is to suspect a bad kernel rollout. We checked the related dashboard and realized that a new kernel version started rolling out a few days ago. Therefore, we have made the immediate decision to rollback this kernel version.
However, hours pass, and the kernel panic rate doesn’t drop. What have we done wrong?
Correlation vs. Causation
The new kernel rollout correlated with the spike in panics, leading us to assume causation. Yet, correlation doesn’t always imply causation—something crucial was missing in our analysis.

Kernel panics can be triggered by multiple factors, a defective version as we mentioned but also hardware issues. In this case, the platform used by the VMs had a significant impact on kernel panic rates.
Breaking down the panic rates per platforms reveals the following. Within 24 hours:
Platform X has a 1% chance of experiencing a kernel panic.
Platform Y has a 10% chance of experiencing a kernel panic.
So, what actually caused the increase in kernel panics we observed? The proportion of VMs running on platform Y had increased. Before the incident, the VM distribution was:
9,000,000 VMs with platform X
100,000 VMs with platform Y
With these numbers, the expected number of panics per day was around 100,000.
However, NebuloSky recently deployed 1,000,00 additional VMs with platform Y. Because platform Y has a higher panic rate, the number of panics per day increased to around 200,000 per day.
Lurking Variable
In this scenario, we focused only on the kernel rollout while ignoring another key factor: the shifting proportion of platforms. This is a classic case of a lurking variable: a hidden factor that affects the relationship between studied variables but is not included in an analysis. Note that “hidden” refers to either an unknown factor or one that was simply not taken into account.
In this scenario, the studied variables were:
Kernel version
Kernel panic count
However, the lurking variable was the platform distribution. The increase of VMs with platform Y had a significant impact on the panic rate for the fleet of machines.
Avoiding Lurking Variables in Analysis
In this scenario, suspecting a kernel rollout as the cause of the spike in panics was a reasonable assumption. When troubleshooting, it’s natural to start with the most obvious cause, especially if it has been responsible for similar issues in the past.
However, rolling back the kernel without clear evidence of causation was a mistake. This is a classic example of confirmation bias—where we interpret information in a way that confirms our prior assumptions. We saw a new kernel rollout, assumed it was the cause, and acted without fully verifying our hypothesis.
To avoid confirmation bias, we should have tried to collect a clear signal, for example:
Did the spike in panics coincide precisely when the rollout started?
Does the increase in panics correlate with the proportion of machines running the new kernel?
If the kernel rollout hypothesis isn’t confirmed and no alternative hypotheses remain, that would indicate a lurking variable—something we’re unaware of and it impacts the variables observed.
Detecting Lurking Variables with Data Segmentation
One of the keys to detecting lurking variables lies in data segmentation, which is the process of dividing a dataset into meaningful subgroups.
First, let’s have a look at the kernel panic chart:
We observe that the count of kernel panics doubled. Now, let’s segment (break down) the data per platform:
We observe that the count of kernel panics for platform X remains stable while the count of panics for platform Y increases. This suggests that the issue is specific to platform Y, but it needs further validation.
To confirm this hypothesis, we examine the count of machines per platform. If more machines are running on platform Y, it would explain the increased panics:
This chart validates our hypotheses: the increase in kernel panics is due to a higher proportion of machines running on platform Y.
Yet, something needs to be clarified. We explained that a lurking variable is, by definition, something we’re initially unaware of. So how could we have instinctively decided to break down kernel panics per platform if we don’t even know platforms impact kernel panics? There are two possible ways:
Exploration: If we are able to thoroughly investigate the incident, we may eventually identify platform distribution as a suspect. By systematically ruling out our potential causes and segmenting the data in different ways, we can discover hidden factors, including lurking variables.
Proactive observability: When designing dashboards to track system stability, we should go beyond just the primary metric—e.g., kernel panics—and ensure data can be segmented based on known influencing factors. In this case, if the kernel rollout dashboard had proactively included breakdowns by platform, the anomaly could have been detected much faster.
Meanwhile, observability remains a continuous process. If a truly unknown factor emerges such as one that no one at NebuloSky ever predicted, documenting in the postmortem the need for an additional breakdown in the dashboard would be a valuable follow-up to improve future incident response.
Conclusion
Lurking variables are particularly dangerous in analysis as they can mislead us into drawing false conclusions. As takeways:
We need to be aware of potential lurking variables when drawing a hypothesis. Just because two events correlate doesn’t mean one causes the other.
We should fight confirmation bias and stay open by embracing the possibility of being wrong. If an initial hypothesis doesn’t fully explain an issue, we must be willing to reassess our assumption rather than forcing the data to fit our expectations.
Proactively segmenting data per influencing factors can help discover lurking variables. By designing observability systems to include breakdowns for known potential factors, we make it easier to detect hidden influences before they cause misinterpretations.
💬 Have you ever encountered a situation where a lurking variable changed your understanding of a problem? Share your stories in the comments.
❤️ If you made it this far and enjoyed the post, please consider giving it a like.