Thursday, July 10, 2008

Find the tail in a distribution using Pareto principle

In July-August 2008 issue of Harvard Business Review, an article titled "Should You Invest in the Long Tail?" raised discussion between the author of the Long Tail book and the article's author. One of the major disagreements is how to distinguish the head and the tail of a distribution. The term tail is also used in context of heavy tailed distributions and power law. The mathematical definition of heavy tail is to compare a cumulative distribution function with an exponential function. If the function increases slower than an exponential function, which suggested that the function is likely polynomial, then the function has a heavy tail. Using exponential function to tell where the tail begins in a distribution is not straightforward.

I found another easy way to cut the tail out. That is via Pareto principle, or 80-20 rule. No matter it is 80-20 or 70-30, we can always get an equation
F(k)+rank(k)=100,
where k is the sequence number of an item in the studied set (size N) sorted in a descending order according to their popularity, rank(k) is the ranking of kth item in term of percent (k/N), and F(k) is the cumulative function. Suppose we have a set of N items, the tail of its distribution can be determined by the intersection of F(k) and f(k)=1-k/N. The intersection point indicate where the cumulative function value is equal to 1- rank(k). For example, when rank(k) = 20%, F(k) = 80% at the intersection, which is 80-20. Since F(k) is a strictly increasing function, and f(k)=1- k/N is a strictly decreasing function, and their value ranges are the same, [0, 1], we can ensure that there is one and only one intersection of the two functions.

let's use an example to check if this method can find a "good" point to divide the tail from the head. Zipf distribution is very popular for characterizing the distribution of ranked items like web pages, words, films, and books. The figure shows the cumulative function of a Zipf distribution with N = 1000 and s = 1. We get a very interesting result that the intersection is right at 80-20 place.