Thursday, July 10, 2008

Find the tail in a distribution using Pareto principle

In July-August 2008 issue of Harvard Business Review, an article titled "Should You Invest in the Long Tail?" raised discussion between the author of the Long Tail book and the article's author. One of the major disagreements is how to distinguish the head and the tail of a distribution. The term tail is also used in context of heavy tailed distributions and power law. The mathematical definition of heavy tail is to compare a cumulative distribution function with an exponential function. If the function increases slower than an exponential function, which suggested that the function is likely polynomial, then the function has a heavy tail. Using exponential function to tell where the tail begins in a distribution is not straightforward.

I found another easy way to cut the tail out. That is via Pareto principle, or 80-20 rule. No matter it is 80-20 or 70-30, we can always get an equation
F(k)+rank(k)=100,
where k is the sequence number of an item in the studied set (size N) sorted in a descending order according to their popularity, rank(k) is the ranking of kth item in term of percent (k/N), and F(k) is the cumulative function. Suppose we have a set of N items, the tail of its distribution can be determined by the intersection of F(k) and f(k)=1-k/N. The intersection point indicate where the cumulative function value is equal to 1- rank(k). For example, when rank(k) = 20%, F(k) = 80% at the intersection, which is 80-20. Since F(k) is a strictly increasing function, and f(k)=1- k/N is a strictly decreasing function, and their value ranges are the same, [0, 1], we can ensure that there is one and only one intersection of the two functions.

let's use an example to check if this method can find a "good" point to divide the tail from the head. Zipf distribution is very popular for characterizing the distribution of ranked items like web pages, words, films, and books. The figure shows the cumulative function of a Zipf distribution with N = 1000 and s = 1. We get a very interesting result that the intersection is right at 80-20 place.

Monday, July 7, 2008

Solution for latex2html not generating images

Recently I updated GhostScript and NetPbm on my Windows box, and found the latex2html started to report errors when generating images. The error messages said "bad file descriptor" when executed pstoimg.bat. Google gave several pages that contains exactly the same symptom of this problem but no solutions. Some suggested to use the debug mode of latex2html to see more detailed tracks. It really helps! I found the NetPbm executable file was trying to locate a file named rgb.txt, and it tried several places that are the directories on Linux systems, but failed. It also suggested a RGBDEF environment variable. The file is in the misc directory of the NetPbm I installed. After I set the environment variable, the problem is fixed.

It seems that all applications better have a DEBUG mode.

Wednesday, May 21, 2008

Java String to byte array and InputStream

Based on Google result, the best solutions are
byte[] byteArray = yourString.getBytes(charsetName);

and
InputStream stream = new ByteArrayInputStream(yourString.getBytes(charsetName));

Monday, May 19, 2008

Start trying Google App Engine

Just got a message from Google that my account was activated. A little anxious to start a journey of learning Python and doing experiments.

Sunday, May 18, 2008

Two font problems causing EPS cannot appear in LaTeX document

I am working on Windows XP, using Visio for drawing. My procedure to produce an EPS graphics is printing Visio page to PS -> PS to EPS using GSView. There are many other ways to get EPS out of Visio drawing. MetafileToEPSConverter is a good tool highly recommended, and remember to adjust its printing quality to higher than 600 dpi to get nice figures in PS file.

If you have adjusted your system DPI setup recently for your new big LCD, then you may find your old EPS files cannot be displayed correctly any more in a new PS file. The symptom is that the graphics just flashes and disappears, and leaves a blank space. The solution is just to change the DPI setup to the old value.

The other problem I encountered was that the font in EPS files looked weird and some characters were even missing. When embedded them in LaTeX documents, the graphics was simply not there. Later I figured out that that might caused by the Ghostscript, it was solved when I updated both GSview and Ghostscript to the latest versions.

Tuesday, May 13, 2008

Switch to blogger from MoveableType

I used to blog on Movable Type hosted by the University, but was a little disappointed after losing my long-typing several times. The first alternative blog host service I considered was WordPress, but I was surprised by they even charge for update the CSS of a template. I have registered already on blogger, and read many hosted by blogger in my reader. So I finally decided to move here. My old posts will still on MT for searching engines' sake until I have no access of the university service.