Jupyter | Gary Allison

Simple addition to a Jupyter table

March 11 2021

A colleague recently commented on a report that I generated that it would be way more user-friendly to make the tables interactive. Sigh. Of course, he is right. When you come across big tables, most people will just scroll right past – even if they are interested in the content. It is just too much trouble to digest.

Based on his suggestion, I went on a search for HTML or JavaScript tools that might help me do something like that. What I stumbled upon has been super useful not just for reports, but just about every Jupyter script I write.

It is the itables module for Python and pandas. By just adding a few lines of code, in every dataframe you display, column names can be clicked to sort them, tables can be broken in to pages, and an awesome search bar is provided to make filtering on the fly a snap.

First-occurrence of COVID cases and deaths by US county

May 7 2020May 7, 2020

When the New York Times made its compilation of covid cases available, I decided to take a look. Although it seems like EVERYONE is doing covid analysis, there were a few things that I wasn’t seeing. For instance, the US is a huge country and I imagine that the sense of threat from covid is going to be related to how close the virus is geographically to a given person.

At the state level, at the beginning of March, only about 10 states were showing cases, but within a week or so, almost all states registered cases. By the beginning of April, just about all states registered a covid death:

Because the NYT data are reported down to the county level, we can get better geographic resolution. At this level, we see there are still quite a few counties that haven’t registered a case and less than half have seen a covid death (As of May 6, 2020).

How about looking at the number of people in those counties? For that, I pulled in the 2010 Census numbers for a quick comparison (I removed territories for this analysis). The total population for 2010 is 309,585,169. Here we see that just about all people are in counties with cases and about 85% of people are in counties that have registered deaths – many of those ‘spared’ counties have very few people.

And here are the days of first cases and deaths:

(Here is a version of the Jupyter notebook I used for these analyses.)

How many cuts of a deck of cards does it take before the deck is shuffled? Maybe the answer will surprise you.

April 17 2020April 19, 2020

During the current covid-19 “stay-at-home” order, my family has been playing a lot of cards. Inevitably, someone will complain that the dealer hasn’t shuffled enough or is over-shuffling. And that leads to questions like “how can you even know if you have shuffled enough?”

Sounds like a job for simulations! Let’s start with some very simple models:

If we assume that, instead of the typical deck of Ace, 2, 3, …, Q, K in four different suits, the cards here are just numbered from 0 to 51, we simplify the simulation problem. Thus, the deck, when “new,” will be consecutively ordered from 0 to 51.

Furthermore, a simple measure of it’s likeness to the “new” state is the accumulated difference between consecutive cards:

Our “new” deck is defined as a simple list:

def ndeck():
   return list(range(52))

And looks like:

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 
10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 
20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 
30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 
40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 
50, 51]

A shuffled deck is a new deck that is scrambled with the random.shuffle module of python:

def rdeck():
    d = ndeck()
    random.shuffle(d) 
    return d

… and will look something like:

[19, 45, 44, 40, 32, 33, 27, 37, 14, 10, 
34, 5, 23, 8, 17, 21, 48, 6, 47, 24, 
38, 15, 46, 29, 22, 12, 36, 42, 0, 41, 
3, 2, 13, 11, 51, 30, 9, 28, 20, 49, 
31, 16, 4, 50, 43, 39, 25, 1, 18, 7, 
26, 35]

We will calculate scores of our decks with:

def deck_score(deck):
    accum = 0
    for i,card in enumerate(deck):
        if i < len(deck)-1: # don't do final card
        accum += abs(card - deck[i+1])
        return accum

What scores do a new and a shuffled deck yield?

The new deck score is 51.

Shuffled decks (using the standard python module, “random”) gives us a range of scores. And the mean shuffled score is about 900.

Cutting the deck

Let’s start with a simple form of shuffling: just cutting the deck. This is the simple process of separating the deck into two piles of random thickness and then switching top and bottom halves. A function to do that for us:

def cut_deck(deck,minThin=4):  # minThin controls how thin the piles can be
    lower = minThin # the lower and upper limits to where the cut can be in the deck
    upper = len(deck)-minThin
    cut = random.randint(lower,upper)  # location of the cut
    new = deck[cut:] + deck[:cut]  # bottom on top, top on the bottom
    return new

And it looks something like

[10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 
20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 
30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 
40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 
50, 51, 0, 1, 2, 3, 4, 5, 6, 7, 
8, 9]

Pretty simple, eh?

A Null Hypothesis

A good place to start exploring is to guess what you are going to see as you start simulating. In this case, my first hypothesis is that as we cut the same deck consecutively, its deck score will climb toward that mean of a shuffled deck and that it will take about 12 cuts to make a random looking deck. Why 12? Just a guess.

I think it will look something like:

Consecutive cuts

If a new deck gives us a score of 51, and a deck cut once yields 101, what does cutting a deck twice yield?

151?

l = []
d = ndeck()
l.append(deck_score(d))
for i in range(10):
    d = cut_deck(d)
    #print(d)
    l.append(deck_score(d))
print(l)

And the result of the consecutive cuts?

[51, 101, 101, 101, 101, 101, 
101, 101, 101, 101, 101]

Huh?! Let’s look at a deck after a million cuts:

d = ndeck()
for i in range(1000000):
    d = cut_deck(d)

The result:

[39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 
49, 50, 51, 0, 1, 2, 3, 4, 5, 6, 
7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 
27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 
37, 38]

Just a single cut!

Thinking a little more careful reveals what’s happening. The process of the second cut actually reconnects the two cards that were cut in the first cut!

This was a surprise to me. Consecutive cutting doesn’t actually change things that much, only the distance between the most recently cut cards. This is because the next cut actually “heals” the previous cut!

Thinking about it now, it makes sense. But that has not been my long held belief that cutting multiple times results in a shuffled deck. I admit that I had to pull out a real deck of cards to convince myself that this simulation result wasn’t an artifact. But it is not. Try it out.

Was this a surprise to you? Just about everyone in my house guessed that the deck would be pretty shuffled after a bunch of cuts. Probably a thinking error that magicians exploit.

An empirical exploration of the Central Limit Theorem

April 6 2020April 6, 2020

The way I understand it, the Central Limit Theorem states that a distribution of sampled means of ANY distribution will be normal.

Well, sort of.

From Wikipedia:

In probability theory, the central limit theorem (CLT) establishes that, in some situations, when independent random variables are added, their properly normalized sum tends toward a normal distribution (informally a “bell curve”) even if the original variables themselves are not normally distributed. The theorem is a key concept in probability theory because it implies that probabilistic and statistical methods that work for normal distributions can be applicable to many problems involving other types of distributions.

in another place:

The central limit theorem states that the sum of a number of independent and identically distributed random variables with finite variances will tend to a normal distribution as the number of variables grows.

In general, I only start to understand such concepts if I can play around with them. So let’s see how this works! We will start by defining a function that will perform the sampling to create distributions of means.

In the code below, s in the function getSamp is the name of the distribution that we will be testing.

#preamble 
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')

import pandas as pd
import numpy as np

nbins =100
def getSamp(s='normal(5,3,',n=1000,reps=10000):
    ddist = eval('np.random.'+s+'size=n)')
    xb = eval('np.random.'+s+'size=(n,reps))')
    f, (ax1,ax2) = plt.subplots(ncols=2,figsize=(10,5))
    ax1.set_title('Data distribution')
    ax1.hist(ddist,bins=nbins)
    ax2.set_title('Sample distribution of the mean for: '+s+')')
    mb = xb.mean(axis=0)
    return ax2.hist(mb,bins=nbins) # tmp holds the arrays used to calc

tmp = getSamp()

For the default (normal distribution, mean = 5, std. dev = 3), here’s the distribution and the results distribution of means:

Let’s define a list of distributions that we want to run against.

dist = ['normal(5,3,',
        'uniform(low=0,high=5,',
        'binomial(n=100,p=.5,',
        'logistic(',
        'laplace(',
        'exponential(',
        'lognormal(',
        'poisson(',
        'poisson(lam=50,',
        'power(a=2,',
        'power(a=50,',
        'power(a=10000,',
        'wald(mean=5,scale=1,']

Then run it:

for d in dist:
    tmp = getSamp(s=d)

For uniform and binomial distributions:

For exponential and lognormal distributions:

How about a couple of different poisson distributions?

That default poisson yields a normal-looking mean distribution, but clearly the highly discrete nature of the source distribution has an effect on the mean’s distribution.

But overall, it seems to hold that the distribution of the means are normal. (You might want to try other distributions for yourself…)