When working with complex surveys, pooling data from multiple sources can be a challenge, especially when there’s a significant imbalance in the number of Primary Sampling Units (PSUs). I recently came across a Reddit post that highlighted this issue, and I thought it was worth exploring further.
The original poster was trying to analyze Cannabis Use Disorder (CUD) by mode of cannabis consumption using two complex surveys from Argentina (2020 and 2022). The problem was that the 2020 survey had only 10 PSUs, while the 2022 survey had around 900 PSUs. This extreme imbalance raised concerns about the validity of variance estimation.
To address this, the poster had already taken some steps, including harmonizing the datasets, dividing the weights by 2, creating combined strata using year and geographic area, assigning unique PSU IDs, and using bootstrap replication for variance and confidence interval estimation. They also performed sensitivity analyses to compare estimates and proportions between years.
Despite these efforts, the poster was still concerned about the validity of variance estimation due to the low number of PSUs in 2020. This is a crucial issue, as incorrect variance estimation can lead to misleading conclusions.
So, what can be done to address this problem more rigorously? One approach could be to use stratified sampling or clustered sampling to account for the PSU imbalance. Another option might be to use statistical models that can handle complex survey data, such as generalized linear mixed models (GLMMs) or Bayesian hierarchical models.
Pooling complex surveys with extreme PSU imbalance requires careful consideration of the survey design and statistical analysis. By acknowledging the limitations of the data and using appropriate methods, researchers can increase the validity of their findings and provide more accurate insights into the phenomenon being studied.
Do you have any experience with pooling complex surveys? How do you handle PSU imbalance in your analysis?