## Blog pages

### The Gospel according to "Frozen II" (or, why Elsa is Jesus) (edit 9)

I'm still rewriting The Gospel according to "Frozen II" (or, why Elsa is Jesus). I think the writing is basically done, and I've now published it in place of the old post. I still want to add a few more pictures.

You may next want to read:
A systematic mythology of the "Frozen" universe
The Gospel according to Disney's "Tangled"

### The Gospel according to "Frozen II" (or, why Elsa is Jesus) (edit 8)

I'm still rewriting The Gospel according to "Frozen II" (or, why Elsa is Jesus). I think I'm maybe halfway done.

You may next want to read:
A systematic mythology of the "Frozen" universe
The Gospel according to Disney's "Tangled"

### The Gospel according to "Frozen II" (or, why Elsa is Jesus) (edit 7)

I'm rewriting The Gospel according to "Frozen II" (or, why Elsa is Jesus). I haven't actually replaced the post with the new version, as the rewrite is still in an intermediary state.

You may next want to read:
A systematic mythology of the "Frozen" universe
The Gospel according to Disney's "Tangled"

### Many places can reopen now, but "how" matters more than "when"

Different parts of the country are now starting to open back up after the coronavirus lockdowns - and this is drawing a lot of different reactions. Some are protesting for more reopening and greater freedoms, while others are warning that opening too soon will trigger an exponential flare-up and cause thousands of more deaths. Throw in the political polarization in our society, and it seems that people are being driven to a simplistic, binary yes/no position on reopening the country.

The reality is, of course, more complex and subtle. There are two major corrections we must make to a binary understanding. The first is that the United States is a large country, and local conditions vary greatly - and it's the local conditions that will dictate when a particular city or county can reopen. Many places are safe enough now, at least when we look at just the infection numbers. Other places still have a severe infection, and they need to get their numbers down lower. The decision of "when" will need to be made state by state, county by county, and city by city. There is no one-size-fits-all solution. This doesn't mean that I think every locality is doing the right thing: as far as I can tell from just the numbers, Georgia is being a little reckless in reopening too early, and the San Francisco Bay Area is being too cautious in extending its shelter-in-place order until the end of May.

But these concerns pale in comparison to the second, and by far the more important, correction to the reopening question: HOW we reopen matters far more than WHEN. Remember, we need to keep R0 under 1. We need enough social distancing, personal protection, disinfecting, and other such measures, so that each infected person causes less than one additional infection. That's the "how". The question of "when" almost doesn't matter in comparison. If we really get the "how" right, much of the country can open now, or very soon. If we get it really wrong, then it won't matter how long we wait - we'll never be able to safely lift the lockdowns. That's why we must reopen very carefully and deliberately. This is the trickiest part of the whole plan, the part where the "how" is most likely to go wrong.

But more on the "how" later - for now, let's focus back on "when", and look at the data from the different locations.

This is a graph of deaths and cases for my home state of California. You see that the numbers have been growing, although they may have plateaued starting around late April. All the graphs in this post use the latest data as of May 4th.

The numbers in the title are each trailing 7-day averages of their per day values. So there have been 1636 new cases and 72.1 deaths per day on average, over the last 7 days.

The most important of these numbers is the "deaths per 1M: 1.8". This says that there have been 1.8 deaths per day per million people in the population, when averaged over the last 7 days. This is the number that tells you the level of risk for the average person in the state of California.

One important point of reference for this number is 0.3 deaths per day per 1M - the risk of dying in a car accident. This is a level of risk that we're quite comfortable accepting. We won't shut down the economy to prevent people from dying in car accidents, so we should accept a similar level of risk from the coronavirus.

In fact, we can actually take on a bit more risk, because we know that the virus mostly kills the elderly, or those with pre-existing conditions. So if you're young and healthy, your risk of death drops by an order of magnitude or more - meaning that you would be able to tolerate numbers up to 3 deaths per day per 1M. With this in mind, I would say that any given locality should be safe to start reopening (slowly, carefully) if they're at or below 1 deaths per day per 1M. Up to 3 deaths per day per 1M may be acceptable, but places with values above that should seriously consider staying locked down until the numbers come down further.

In summary, if the deaths per day per 1M is:
Under 1: Safe to reopen
Between 1 and 3: Be cautious
Greater than 3: Should probably stay in lockdown

This is just a rule of thumb from some simple calculations: local leaders and experts will have more context and data, and you should defer to any such extra information they provide. Furthermore, this doesn't take into account how easy it would be to transmit the disease - even if you're young and healthy and the economy reopens, you should still try not to catch it, because you may pass it on to someone who's more at risk. But this has to do with the "how" of reopening - which, as I said, is far more important, and which I'll discuss later.

Looking at the graph for California, I'd be hesitant about opening up the whole state. The numbers are not yet on a clear downward trend, and the 1.8 deaths per day per 1M people is still a bit too high. But we can do better than looking at the whole state: when we drill down to the county level for the San Francisco Bay Area, we see this:

We see that many counties are below 1 death per day per 1M, with a few exceptions. The trends, too, all seem to be flat or decreasing. There are a few worrisome counties, but I would say that the Bay Area as a whole is a good candidate for reopening - although I wouldn't mind waiting another week or two.

What about some other counties in California? Let's look at the Sacramento and Los Angeles counties:

Sacramento is basically fine. With only 0.2 deaths per day per 1M, the coronavirus poses little danger in my state's capital. The trend, too, seems to be distinctly downwards. There was a protest there a few days ago calling for a reopening, and I'm inclined to be quite sympathetic towards them, especially if the reopening is just for the city of Sacramento and not for the whole state.

Los Angeles, on the other hand, should not reopen yet. 4.5 deaths per day per 1M is probably too high of a death rate, and the trend may still be increasing. In fact, we see that Los Angeles is basically responsible for much of the numbers for California as a whole.

We can run similar kinds of analysis for various states as well. Of course, New York and a bunch of other states in the northeast should not reopen yet:

Their deaths per day per 1M are still horrendously bad, and some of their trends are still upwards.

On the other hand, there are also states which are nearly unaffected by the virus:

These states are generally safe and they should be able to reopen without putting their people at too bad of a risk, as long as they get the "how" right.

Here are the graphs all 50 states, and their deaths per day per 1M value:

But like we saw with California, all states should really be looked at on a county-by-county, city-by-city level. Furthermore, while this is an important number, it still remains only one of the many factors which enter into when your state can reopen.

So that's the question of "when". Many states are quite safe and can reopen now, and many counties or cities in more iffy states are also safe. There are also areas that are still quite dangerous, which need to stay in lockdown. So the question of "when" needs to be approached with locally. There isn't a binary yes/no answer for the whole country.

But, remember, "how" matters far more than "when". Nothing I said about what areas are safe matter at all, if we get the "how" wrong. And the answer to "how" is simpler: we need to keep R0 under 1. This means doing all the things that you already know about: wash your hands. Don't touch your face. Keep 6 feet of distance from others. Stay home if you're sick. Wear a face mask.

Also consider what additional things you and your communities can do: face shields, in addition to face masks, probably help. You can work from home whenever possible, and you should avoid crowds. A temperature check can be required before entering a building or meeting someone. Hand sanitizers can be placed near commonly used surfaces, like door handles or elevator buttons.

As we progress further we will hear more about other measures that we can take: for instance, it seems that sunlight, heat, and humidity can all help us further fight the virus. If so, then these can all be incorporated into things you can do personally to fight the pandemic.

Some other "how" measures require more of a top-down approach from higher up, like testing (which we still need more of) and contact tracing. But much of the "how" question can be answered at an individual level, with personal responsibility. You are personally responsible for not catching the virus, and for not passing it on if you do.

And that's how we'll beat this thing. Especially as we reopen the economy, remember that the "how" matters more than "when".

You may next want to read:
Quick takes on the plan to re-open the country
Coronavirus endgame: how we get back to normal

### Re-analyzing the Stanford COVID-19 antibody study

Re-analyzing the Stanford covid antibody study
Introduction and results

This is a re-analysis of Stanford's antibody study in Santa Clara County, which reported a population prevalence of 2.5% to 4.2% for COVID-19 antibodies, and a corresponding infection fatality rate of 0.12% to 0.2%. This result, if true, would have huge implications, as the lower fatality rate would dramatically change the calculus on important policy decisions, like when and how we should reopen the economy. However, this study has also received numerous criticisms, most notably for the results being inconsistent with the false positive rate of the antibody test.

Here, I attempt to derive what the results ought to have been, under a better methodology. I will be using a Bayesian approach, employing beta distributions to model probabilities.

The results I get at the end are as follows:
Antibody prevalence in study participants:
1.0% (95% CI: 0.16% to 1.7%)

Antibody prevalence in Santa Clara
(speculative, due to missing data):
1.9% (95% CI: 0.3% to 3.2%)

Infection fatality rate implied from above:
0.27% (95% CI: 0.17% to 1.6%)

This fatality rate is quite uncertain on its own, but it is in broad agreement with other similar kinds of studies.
Methodology and code

Alright, let's begin. First, let's import some packages:
In [1]:
import numpy as np
import pandas as pd
from scipy.stats import beta
%matplotlib inline

Next, let's decide on a prior for the prevalence of antibodies in the study:
In [2]:
p_prior = beta(1.1, 60)
x = np.linspace(0, 0.2, 1000)
pd.Series(p_prior.pdf(x), index=x).plot()

Out[2]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f908dd0c88>
In [3]:
p_prior.mean(), p_prior.median(), p_prior.interval(0.95)

Out[3]:
(0.01800327332242226,
0.013074113173063773,
(0.0006173369513848543, 0.06281995654253102))
Note that this prior is quite favorable to the results of the study. It piles on the prior probability right on top of where the study's results turned out to be (1.5%), with the mean and the median of the distribution falling right into the interval cited by the study (1.1-2%). So we are making an assumption a priori that the results of the study are correct. Of course, in a good Bayesian analysis this doesn't matter much in the end, as the prior should get overwhelmed by the evidence from the data.

Next, let's model the specificity and sensitivity of the antibody test they used. Pooling together the numbers they provided in the paper, we get:
In [4]:
sensitivity = beta(78 + 25 + 0.5, 7 + 12 + 0.5)
specificity = beta(30 + 369 + 0.5, 0 + 2 + 0.5)

We next run a simulation using random samples from these beta distributions, then only look at the results which actually return the empirical results of the study. This satisfies Bayes rule, which says that the probability values which best predict the outcome should be favored. It's consistent with the adage that “when you have eliminated the impossible, whatever remains, however improbable, must be the truth”. It undergirds Bayesian hierarchical modeling, and it's the same methodology I used in my argument for the resurrection of Jesus Christ.

The data from the study says that there were 50 positive cases out of 3330.
In [5]:
n_sim = 1000000
total_cases = 3330
positive_cases = 50

df = pd.DataFrame()
df["p_population"] = p_prior.rvs(n_sim)
df["sensitivity"] = sensitivity.rvs(n_sim)
df["specificity"] = specificity.rvs(n_sim)
df["detected_positives"] = (
# true positives
df["p_population"] * df["sensitivity"]
# plus false positives
+ (1 - df["p_population"]) * (1 - df["specificity"])
).apply(lambda x: round(x * total_cases))

#"eliminate the impossible":
data_df = df[df["detected_positives"] == positive_cases]


Out[5]:
p_population sensitivity specificity detected_positives
81 0.005393 0.909420 0.989789 50
226 0.003434 0.781180 0.987629 50
455 0.011168 0.891503 0.994876 50
The distribution of true prevalence of the antibodies can then simply be read off from the "p_population" column:
In [6]:
data_df["p_population"].hist(bins=50)

Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f9090fce80>
The relatively smooth distribution shows that the number of simulations was sufficient. Note that the distribution runs right up to 0, as you'd expect with a specificity with a lower bound of about 98%. The actual prevalence of antibodies in the study can then be characterized as follows:
In [7]:
data_df["p_population"].mean()

Out[7]:
0.010314341450043712
In [8]:
data_df["p_population"].quantile([0.025, 0.975])

Out[8]:
0.025    0.001608
0.975    0.016619
Name: p_population, dtype: float64
So we have an antibody prevalence of 1.0% (95% CI: 0.16% to 1.7%) for the study participants.
Discussion on demographic reweighing

This now needs to be re-weighed to match the demographics of Santa Clara county. The paper mentions that it re-weighed the samples by zip code, race, and sex.

In order to do this calculation, we would need every zip-race-sex combination in the study, along with the detected number of positives and negatives in that combination. Unfortunately, the paper doesn't provide that data - presumably for privacy reasons. Fair enough.

However, I will note that this re-weighing is highly unlikely to reduce the uncertainty in our metric. That is to say, if we had a perfectly random sample, then our study would perfectly reflect the population - and even so our uncertainty already spans a whole order of magnitude (0.16% to 1.7%). Could deviating from this ideal scenario make our final answer MORE certain? Any deviation, and the required adjustment, is far more likely to add uncertainty rather than certainty.

If I may engage in a bit of speculation here, I suspect this is where the study went wrong. As far as I can tell without the missing data, they calculated a point estimate for the prevalence in the study participants, then performed the reweighing to get the population-adjusted point estimate. This step nearly doubled the prevalence, from 1.5% to 2.8%. Meaning that, at this point, they were effectively thinking of the positives in their samples not as 50/3330, but 94/3330 - nearly doubling the number of positive samples artificially. Only after this adjustment did they correct for sensitivity and specificity - with the result that the inflated positive numbers were now able to outpace the false positive rate.

Now, it's not impossible for the correct procedure to end up doing something similar. After all, the individual demographic information is additional data, so adding in that data could add more certainty, and that certainty could work to increase the prevalence. But this would require a rare, specific set of circumstances, and very strong assumptions. So I'm not saying that the original paper was necessarily wrong - but I should like a release of the data itself, or at least a discussion of it, before I believed the results. Such rarities in the data would, in all likelihood, point to a flaw in the sampling rather than additional certainty.

And indeed there ARE flaws in the sampling, even apart from this issue. Two points stand out: first, the participants for the study were recruited via Facebook - which would naturally select for those who had higher reason to believe that they had been infected at some point. Second, the demographic combination mentioned above explicitly doesn't correct for age, which means that compared to the country, the study systematically underrepresents the elderly (65 or older). Of course, this is the demographic which has the most to lose from an infection, and so would be most cautious in trying not to catch it. So underrepresenting this group would cause the reported prevalence rate to be inflated.

Both of these effects would artificially increase the calculated prevalence rate - which means that we should be VERY cautious about the demographic reweighing further increasing our results. In particular, it is very difficult to justify the lower bound increasing from 0.16% in the study participants, to 2.5% in the population - which is the value that's given in the paper. Nor is that 2.5% lower bound from a proper 95% interval: it is rather just the smallest value among three constructed scenarios - where two of the scenarios have their own lower bounds BELOW this 2.5% value.
Results, comparison to other studies, and conclusion

Given all this, and the fact that the required data are simply not provided, perhaps the best we can do for the demographic reweighing is to simply increase our unweighed results proportionately, which would keep the relative uncertainty the same. In the study itself, the demographic reweighing increased the prevalence from 1.5% to 2.8% - a factor of 1.87. Doing the same to our results cited above gives:

Antibody prevalence of 1.9% (95% CI: 0.3% to 3.2%) for the Santa Clara County.

Using the same number of deaths as in the study (100), this translates to:

infection fatality rate of 0.27% (95% CI: 0.17% to 1.6%).

This is a tentative result obtained by fudging around the missing data, and the uncertainty range is broad and not particularly helpful for policy setting - and yet it provides some measure of insight in conjunction with other similar studies.

For example, There's a German antibody study that reports an infection fatality rate of 0.37% (with a 14% prevalence, which makes it robust against false positives).

An antibody test in New York gives a rate of 0.5% (with a 10-20% prevalence, again making it robust against false positives).

Preliminary results from two other antibody tests have also been released: USC conducted a test in LA county, with an infection mortality rate of 0.13% to 0.27%, and a prevalence rate of 4.1%. University of Miami conducted a study of the Miami-Dade county, which gave a prevalence rate of 4.4% to 7.9%. With some 300 deaths in the county, that translates to an infection mortality rate of 0.14% to 0.24%. These seem to share much of the same characteristics as the Santa Clara study: blood tests reporting low prevalence. In addition, they have smaller sample sizes. It's not yet known whether they share the same flaws.

A study for the Diamond Princes cruise ship gives an adjusted infection fatality rate of 0.6%, with a 95% CI of 0.2% to 1.3%.

A group of pregnant women were also tested in New York City. About 14% of them tested positive for the virus. This is an atypical demographic group who were tested with a different method, so their results cannot be expanded to the whole population. But the results here are roughly consistent with the previously cited New York study, reinforcing the 5% number.

Lastly, Iceland performed a sampling from their whole population, testing for the active virus itself. They reported that about 0.7% of their population actively had the virus in the 20 days leading up to April 4th. Given that their total cases number about 3 times the average active number they had during those dates, this roughly translates to a 2.1% prevalence rate and a 0.13%. But there are huge uncertainties associated with this number - it's not known how many tests were performed at which points in the infection's progress, and Iceland is a small country - they only have 10 deaths and a total population of 360 thousand people.

So looking at these previous results, we can say that our Santa Clara study, when re-analyzed as above, is in line with the rest. Though there are still large uncertainties, these studies seem to be converging roughly in the ballpark of 0.2% to 0.6% for the infection fatality rate.

You may next want to read:
Quick takes on the plan to re-open the country
Keeping score: my coronavirus predictions