One of our active forecasters requested more information about the cluster analysis for the HPV-related questions on SciCast.

**Background**

The U.S. CDC reports that human papillomavirus (HPV) is the most common sexually transmitted infection in the U.S. Because some types of HPV are initially asymptomatic but increase the risk of cancer, particularly cervical cancer in women, great effort has been put into vaccinating the population against it. Two HPV vaccines have been introduced since 2006, and the CDC encourages their use for girls age 11 and older.

Studies of HPV initially focused on the 13- to 17-year-old population, and the CDC estimates that 53.8% of U.S. females aged 13-17 had been vaccinated with at least one dose of an HPV vaccine in 2012, a gain of 0.8% since 2011. The vaccination coverage varies widely across U.S. States, but some States are similar. Therefore instead of linking each State to the US average, we put them in clusters.

**Cluster Analysis**

Clusters of States were created by first analyzing variables that correlate with HPV vaccination coverage in 2011 and 2012. A simple model using those variables for predicting HPV vaccination coverage explained over half the variation among states. To view the state-level variables, including HPV vaccination coverage estimates, *open this Google Sheet*.

On the most useful variables, states in a given cluster are more similar to each other than to States in other clusters. To create the clusters, I used the *mclust* and *cluster* libraries in R statistical software to try several forms of cluster analysis. Results varied somewhat, and we chose to use the five clusters that were relatively easy to interpret and each contained a reasonable number of States.

### Cluster Model

The link structure on SciCast is a simple hierarchy:

- US Rate –> {5 Clusters}
- Cluster –> {States In the Cluster}
- Also, US Rate in 2013 –> US Rate in 2014

That means you can forecast cluster rates *given* US rates (or vice versa), and a State’s expected rate *given* it’s cluster’s rate. (Because States are modeled as scaled continuous questions, you cannot forecast cluster *given* State.)

Based on our model, we have set initial marginal distributions for all 5 clusters, and an initial conditional distribution of Cluster *given* the the US is in the most likely state. Both are approximately Normal, with the conditional distribution having smaller variance.

### Not Known: StateS Conditional On Clusters

Although state HPV vaccination likely is dependent on the cluster to which the State belongs, we do not have a very clear model of the relation between a specific State’s vaccination and its cluster’s vaccination rate. Forecasts of a State rate *given** *its cluster need to be filled in by users with the “Related Forecasts” section after a forecast on a cluster question (Even a forecast that completely agrees with the current market forecast will open up conditional forecasting options.). Hopefully the statistics in the file will help forecasters devise their own ideas!

## Links

by Ken Olson