A Probability Exercise
At work, I’ve been using the random forest™ algorithm to create a classifier (lots of random classification trees make a classification forest… get it?). The algorithm is a neat tool in applied statistics that was proposed by Leo Breiman at Berkeley and can perform classifications, regressions, and unsupervised learning. The code was originally written in Fortran, although an R package is available.
The first step in implementing the algorithm for a training sample of size n is to uniformly sample n cases with replacement, i.e. the same observation can be chosen multiple times while other samples may not be chosen at all. (Each sample forms a tree, and many trees are grown, à la bagging.) While reading Dr. Breiman’s random forest™ website and other papers on the topic, I kept coming across the idea that if you uniformly sample n observations with replacement from n observations, you’ll leave out “about one-third” of the observations. (Another paper said “about 37% of observations”.) There seemed to be a little bit of hand-waving going on here, so I decided to do some basic probability and figure out what’s going on here.
Q: Given n observations, suppose you uniformly sample with replacement n times (to form a new sample of size n). In the long run, approximately what proportion of the original observations will be left out of your new sample?
A: My solution.
Leave a Reply