Getting a Complete Collection

10 June 2010

This summer, I’m teaching 400-level intro probability from Ross, 8th edition.  Recently we covered a problem from that book which has a special place in my heart.  This is a problem which I made up for myself, and then actually solved, when I was in 7th grade, long before I’d ever seen a book on probability.

The Problem.

Suppose that you wish to collect a complete set of a certain kind of collectible cards. (Insert your favorite kind of trading cards, real or imagined.)  There are a total of n different types of card, and whenever you buy a card it is equally likely to be any of the n types.  If you want to buy cards, one at a time, until you have a complete set, how many cards do you expect to buy?  (This is a problem about expected value in the probability theory sense; follow the link if that’s not in your working vocabulary.)

A direct assault on this problem gets very hard in a hurry.  For example, if you wanted to compute the probability that you will complete a collection of 500 things on your 731st try, you’d find that that probability is excruciatingly difficult to work out.  The beauty is that we don’t need to do anything like that hard!  All we need are two ideas from probability (both of which are important in their own right) and an elegant way of applying them.

Idea 1: Expectation is Additive.

That is, if X and Y are any random quantities whatsoever, then the expected value of X+Y is the sum of the expected values.  When you put it this way, it seems obvious, and in a sense it is ― the proof is very easy.  But it’s also powerful, because it applies to any quantities whatsoever.  You don’t have to understand how X and Y are correlated, you only need their individual expectations.

Idea 2: How Many Times Do I Have to Roll a Die?

Let’s say you are performing a series of independent trials, and each one “succeeds” with probability p.  Then the expected number of attempts required is 1/p.

This should square pretty well with intuition, but it doesn’t go without checking.  In this case the verification takes the form of an infinite series.  Since the probability of needing exactly k attempts is (1-p)^{k-1}p, we have to sum the series \sum_{k=1}^\infty k(1-p)^{k-1}p, which has value p.  This has an elegant solution, but let’s not worry about how to compute this now.  (Exercise for the interested; the rest of you, take my word for it.)

Human intuition about probabilities and expectation is notoriously faulty.  For example, imagine that I am trying to fill up a gallon jug of water.  I do this by pouring in the water from a bunch of other jugs.  Each of these other jugs contains a random amount of water, uniformly chosen from 0 to 1 gallons.  Then each jug has an expected contribution of 1/2 gallon.  So how many jugs do I expect to need to fill my jug?  It’s tempting to say 2, but the true answer is rather higher, closer to 3.  (Actually it’s e.)


Here’s the key idea that makes this problem solvable.  Break up the process of collecting into a bunch of separate steps.  Imagine that you throw a little party each time your collection gets bigger (that is, each time you get a card you didn’t already have).  Instead of keeping track of how long it takes to get each particular type of card, keep track of how long it takes to throw each new party.  Let X_i be the number of cards you have to buy to increase your collection from i-1 to i different cards.  That is, if you just got your tenth new card, it will take X_{11} more purchases to get your next nonrepeated card.

Then the total number of cards that you buy is \sum_{k=1}^n X_k, so the expected total number of cards is \sum_{k=1}^n E[X_k].

But each of the X_k is the precisely the situation described in Idea 2!  If you already have k-1 cards, then each additional card will be one you don’t already have with probability \frac{n-k+1}{n}.  So the expectation of X_k is just \frac{n}{n-k+1}, and we’re essentially done.


Thus the expected number of cards required to get a complete set is \frac{n}{n}+\frac{n}{n-1}+\frac{n}{n-2}+\cdots+\frac{n}{2}+\frac{n}{1} = n\sum_{k=1}^n\frac{1}{k}.  It’s a basic fact of analysis that this summation, which represents the running totals of the reciprocals of the positive integers, is well approximated by \log n.  So a good estimate for the total expectation is n\log n.


This explains why it seems to be harder and harder and harder to make progress in a collection, the further you get.

As an example, suppose you are trying to collect 100 things, and you have 50 different ones right now.  Is it really reasonable to say that you are about halfway there?  At the beginning, you’d expect to need 100\sum_{k=1}^{100}\frac{1}{k} cards, about 519.  Now that you have 50, you only need 100\sum_{k=1}^{50}\frac{1}{k} more, about 450.  So in a meaningful sense, you have 450/519 of your task still ahead of you, about 86.7% (!!!)

So when are you halfway there?  At what point do you expect to need to buy about 259 more cards?  Believe it or not, you aren’t really halfway to collecting all 100 cards until you have 93 of them!