Epi Night School Extra: Berkson’s Bias is One Sneaky Selection Problem
It's what brought us the bad take that coffee causes pancreatic cancer.
If you’ve been following along with Epidemiology Night School, you know that bias is one of the trickiest challenges in study design. Today we’re going to dive into a particularly sneaky culprit: Berkson’s Bias.
Imagine you’re in the hospital doing a study on diabetes and gallbladder disease. You find that these two conditions seem closely associated. But wait! Does one actually cause the other, or are you just seeing this because the hospital selects for patients with multiple health issues? Congratulations, you’ve stumbled into Berkson’s bias!
Let’s break down what Berkson’s bias is, why it keeps showing up in studies, and how we can visualize it using R.
What is Berkson’s Bias?
Berkson’s bias arises in studies that involve a selected population, like hospitalized patients, instead of the general population. When your study is limited to a specific group that meets certain inclusion criteria, relationships between variables can appear stronger (or weaker) than they really are.
Here’s how it happens:
Suppose two conditions, X (exposure) and Y (outcome), are completely independent of each other in the general population.
However, patients with X or Y are more likely to be in the hospital (or another study-selected group).
If you only study hospitalized patients, you’ll find that having X makes it seem more likely that you’ll have Y, even though this isn’t true in the real world.
Berkson’s bias is all about selection distortion. It creates an illusion of association where none exists, leading to misleading conclusions.
Why Does Berkson’s Bias Keep Creeping In?
The problem with Berkson’s bias is that it’s subtle. It sneaks into observational studies, particularly those conducted in hospitals or clinics where people with multiple health conditions are overrepresented.
Hospital Studies: People with more than one condition are likelier to be hospitalized, so conditions that co-occur purely by chance seem associated.
Self-Selection Surveys: Participants with specific traits (e.g., symptoms, interests) are more likely to participate in studies, distorting associations.
The key challenge is that most studies require some sort of selection. That is, you can’t study everyone, so you have to take a sample. But if you don’t carefully account for selection bias, Berkson’s bias can skew your findings and trick you into seeing relationships that don’t exist.
Let’s Visualize Berkson’s Bias in R
Here’s a hands-on example using R. We’ll simulate two independent variables, X (exposure) and Y (outcome), and then apply a selection criterion that mimics Berkson’s bias. You’ll see how the “association” magically appears after we impose selection.
Step 1: Simulating the Data
In this simulation:
X and Y are independent variables (normally distributed).
We impose a selection criterion: only include observations where the sum of X and Y > 1.
Here’s the code to run in R:
# Load necessary libraries
library(ggplot2)
library(gridExtra)
# Simulate independent variables X and Y
set.seed(123) # For reproducibility
n <- 1000 # Number of observations
X <- rnorm(n, mean = 0, sd = 1)
Y <- rnorm(n, mean = 0, sd = 1)
# Combine X and Y into a data frame
data <- data.frame(X, Y)
# Apply selection criterion (e.g., sum of X and Y > 1)
selected_data <- data[which(data$X + data$Y > 1), ]
# Plot original data
p1 <- ggplot(data, aes(x = X, y = Y)) +
geom_point(alpha = 0.5) +
geom_smooth(method = 'lm', col = 'black') +
labs(x = "Exposure", y = "Outcome") +
ggtitle("Original Data") +
theme_minimal()
# Plot selected data to illustrate Berkson's bias
p2 <- ggplot(selected_data, aes(x = X, y = Y)) +
geom_point(color = "red", alpha = 0.5) +
geom_smooth(method = 'lm', col = 'black') +
labs(x = "Exposure", y = "Outcome") +
ggtitle("Data After Selection Criterion") +
theme_minimal()
# Display plots
gridExtra::grid.arrange(p1, p2, ncol = 2)
Step 2: Understanding the Output
On the left, you’ll see the original data: two clouds of points with no relationship between X and Y. This reflects the truth: X and Y are independent (the line is basically flat).
On the right, after applying the selection criterion, a false association appears. The scatterplot suggests that X and Y are correlated (the line is not flat), even though we know they’re not!
This is Berkson’s bias in action: by restricting our “study population” (only including data points where X + Y > 1), we’ve introduced a spurious relationship.
Why This Matters in Real Studies
Berkson’s bias doesn’t just live in textbooks or simulations. It creeps into real-world research whenever studies focus on selected populations:
Hospital-based studies overestimate associations between co-occurring diseases.
Clinic-based studies might inflate correlations between risk factors and outcomes.
Surveys with self-selection may misrepresent associations because the sample is skewed toward those most interested in participating.
If you don’t account for this bias, you risk drawing incorrect conclusions and misleading public health decisions.
Avoiding Berkson’s Bias
So how do we protect our studies from this sneaky bias?
Be Mindful of Selection Criteria: Understand how including or excluding certain individuals may distort associations.
Use Population-Based Data: Whenever possible, study the general population instead of specific subgroups.
Sensitivity Analyses: Check how your results change when you adjust for selection effects or use different inclusion criteria.
Stratification: Analyze data separately for groups that meet or don’t meet the selection criteria.
Final Thoughts
Berkson’s bias is a powerful reminder that how we select our study population matters. Whether you’re investigating disease risk factors, co-occurring conditions, or health behaviors, selection can distort relationships that don’t actually exist.
The simulation above shows just how easy it is for this bias to appear. Luckily, once you’re aware of it, you can design better studies and interpret findings with a more critical eye.
Now, go ahead and run the code! Play with the selection criterion and see how the “association” changes. The next time someone points out a surprising correlation, you’ll know to ask: “Could this be Berkson’s bias?”
In the plot of the original data, why are there two clouds of points? If each point has an x and y coordinate (with x and y independent of each other) and that's one cloud, what makes up the second cloud? Thank you.