I found another easy way to cut the tail out. That is via Pareto principle, or 80-20 rule. No matter it is 80-20 or 70-30, we can always get an equation
F(k)+rank(k)=100,
where k is the sequence number of an item in the studied set (size N) sorted in a descending order according to their popularity, rank(k) is the ranking of kth item in term of percent (k/N), and F(k) is the cumulative function. Suppose we have a set of N items, the tail of its distribution can be determined by the intersection of F(k) and f(k)=1-k/N. The intersection point indicate where the cumulative function value is equal to 1- rank(k). For example, when rank(k) = 20%, F(k) = 80% at the intersection, which is 80-20. Since F(k) is a strictly increasing function, and f(k)=1- k/N is a strictly decreasing function, and their value ranges are the same, [0, 1], we can ensure that there is one and only one intersection of the two functions.

let's use an example to check if this method can find a "good" point to divide the tail from the head. Zipf distribution is very popular for characterizing the distribution of ranked items like web pages, words, films, and books. The figure shows the cumulative function of a Zipf distribution with N = 1000 and s = 1. We get a very interesting result that the intersection is right at 80-20 place.
Regarding some other ways of computing the 'head' and the 'tail' of a pareto-like distribution you may find this article interesting:
ReplyDeleteA practical model for analyzing long tails, by Kalevi Kilkii.
Cheers, Oscar