Our key independent variable is the mean percentage of vaccine-related misinformation shared via Twitter at the U.S. state or county level. We used 55 M tweets from the CoVaxxy dataset^{17}, which were collected between January 4th and March 25th from the Twitter filtered stream API using a comprehensive list of keywords related to vaccines (see Supplementary Information). We leveraged the Carmen library^{29} to geolocate almost 1.67 M users residing in 50 U.S. states, and a subset of approximately 1.15 M users residing in over 1,300 counties. The larger set of users accounts for a total of 11 M shared tweets. Following a consolidated approach in the literature^{25,26,27,28}, we identified misinformation by considering tweets that contained links to news articles from a list of low-credibility websites compiled by a politically neutral third party (see details in the Supplementary Information). We measured the prevalence of misinformation about vaccines in each region by (i) calculating the proportion of vaccine-related misinformation tweets shared by each geo-located account; and (ii) taking the average of this proportion across accounts within a specific region. The Twitter data collection was evaluated and deemed exempt from review by the Indiana University IRB (protocol 1102004860).

Our dependent variables include vaccination uptake rates at the state level and vaccine hesitancy at the state and county levels. Vaccination uptake is measured from the number of daily vaccinations administered in each state during the week of 19–25 March 2021, and measurements are derived from the CDC^{9}. Vaccine hesitancy rates are based on Facebook Symptom Surveys provided by the Delphi Group^{24} at Carnegie Mellon University. Vaccine hesitancy is likely to affect uptake rates, so we specify a longer time window to measure this variable, i.e., the period January 4th–March 25th 2021. We computed hesitancy by inverting the proportion of individuals “who either have already received a COVID vaccine or would definitely or probably choose to get vaccinated, if a vaccine were offered to them today.” See Supplementary Information for further details.

There are no missing vaccine-hesitancy survey data at the state level. Observations are missing at the county level because Facebook survey data are available only when the number of respondents is at least 100. We use the same threshold on the minimum number of Twitter accounts geolocated in each county, resulting in a sample size of N = 548 counties.

Our multivariate regression models adjust for six potential confounding factors: percentage of the population below the poverty line, percentage aged 65 + , percentage of residents in each racial and ethnic group (Asian, Black, Native American, and Hispanic; White non-Hispanic is omitted), rural–urban continuum code (RUCC, county level only), number of COVID-19 deaths per thousand, and percentage Republican vote (in 10 percent units). Other covariates, including religiosity, unemployment rate, and population density, were also considered (full list in Supplementary Table S9).

We also conduct a large number of sensitivity analyses, including different specifications of the misinformation variable (with a restricted set of keywords and different thresholds for the inclusion of Twitter accounts) as well as logged versions of misinformation (to correct positive skew). These results are presented in Supplementary Information (Tables S3-S8).

We conduct multiple regression models predicting vaccination rate and vaccine hesitancy. Both dependent variables are normally distributed, making weighted least squares regression the appropriate model. Data are observed (aggregated) at the state or county level rather than at the individual level. Analytic weights are applied to give more influence to observations calculated over larger samples. The weights are inversely proportional to the variance of an observation such that the variance of the *j*-th observation is assumed to be σ^{2}/*w*_{j} where *w*_{j} is the weight. The weights are set equal to the size of the sample from which the average is calculated. We estimate weighted regression with the aweights command in Stata 16. In addition, because counties are nested hierarchically within states, we use cluster robust standard errors to correct for lack of independence between county-level observations.

We investigate Granger causality between vaccine hesitancy and misinformation by comparing two auto-regressive models. The first considers daily vaccine hesitancy rates (x) at time (t) in geographical region (r) (state or county):

$$x_{t,r} = mathop sum limits_{i}^{n} a_{i} x_{t – i,r} + epsilon_{t,r} ,$$

where (n) is the length of the time window. The second model adds daily misinformation rates per account as an exogenous variable (y):

$$x_{t,r} = mathop sum limits_{i}^{n} (a_{i} x_{t – i,r} + b_{i} y_{t – i,r} ) + epsilon^{{prime }}_{t,r} .$$

The variable (y) is said to be Granger causal^{30,31} on (x) if, in statistically significant terms, it reduces the error term (epsilon^{prime}_{t}), i.e., if

$$E_{{a,b}} = sumlimits_{{t,r}} {epsilon _{{t,r}} ^{2} } – sumlimits_{{t,r}}^{{}} {epsilon _{{t,r}}^{{prime 2}} } > 0,$$

meaning that misinformation rates *y* help forecast hesitancy rates *x*. We assume geographical regions to have equivalence and independence in terms of the way misinformation influences vaccine attitudes. Thus, we use the same parameters for (a_{i}) and (b_{i}) across all regions. We employ Ordinary Least Squares (using the Python statsmodels package version 0.11.1) linear regression to fit (a) and (b), standardizing the two variables and removing trends in the time series of each region. We select the value of the time window (n) that maximizes (E_{a,b}). For both counties and states, this was (n = 6) days and we present results using this value. We also tested nearby values of (n pm 2) to confirm these provide similar results. We use data points with at least 1 tweet and at least 100 survey responses for every day in the time window for the specified region.

The traditional statistic used to assess the significance of Granger Causality is the F-statistic^{30}. However, in our case, there are several reasons why this is not appropriate. First, we have missing time windows in some of our regions. Second, our assumptions of equivalence and independence for regions may not be accurate. For these reasons, we use a bootstrap method to estimate the expected random distribution of (E_{a,b}) with the time signal removed. To this end, we generate trial surrogates for (y) by randomly shuffling the data points. With each random reshuffled trial, we can then use the same procedure to calculate the reduction in error, which we call (E^{*}_{a,b}). The *p*-value of our Granger Causality analysis is then given by the proportion of trials ((N) = 10,000) for which (E^{{*}{}}_{a,b} > E_{a,b}). A potential issue with Granger Causality analysis is that it may detect an underlying trend. We tested for this by linearly detrending both time series before running the Granger analysis, finding similar results.

https://www.nature.com/articles/s41598-022-10070-w