Statistically Significant

July 1, 2009

A Probability Exercise

Filed under: Math — Hoxie @ 6:44 pm

At work, I’ve been using the random forest™ algorithm to create a classifier (lots of random classification trees make a classification forest… get it?).  The algorithm is a neat tool in applied statistics that was proposed by Leo Breiman at Berkeley and can perform classifications, regressions, and unsupervised learning.  The code was originally written in Fortran, although an R package is available.

The first step in implementing the algorithm for a training sample of size n is to uniformly sample n cases with replacement, i.e. the same observation can be chosen multiple times while other samples may not be chosen at all.  (Each sample forms a tree, and many trees are grown, à la bagging.)  While reading Dr. Breiman’s random forest™ website and other papers on the topic, I kept coming across the idea that if you uniformly sample n observations with replacement from n observations, you’ll leave out “about one-third” of the observations.  (Another paper said “about 37% of observations”.)  There seemed to be a little bit of hand-waving going on here, so I decided to do some basic probability and figure out what’s going on here.

Q: Given n observations, suppose you uniformly sample with replacement n times (to form a new sample of size n).  In the long run, approximately what proportion of the original observations will be left out of your new sample?

A: My solution.

Leave a Reply