Wednesday, July 30, 2008

Amazon S3 gossip and decentralized control

Via Werner, I read the technical explanation and solutions about the last S3 outrage. It is interesting to know that a "gossip" protocol is used in S3 for messaging around servers. The explanation did not give too many details about this gossip protocol. I suspect that it is a kind of p2p flooding. Then rumor flooding will produce the result as "On Sunday, we saw a large number of servers that were spending almost all of their time gossiping and a disproportionate amount of servers that had failed while gossiping." The rumor resulted from small number of corruptions of original message. This recalls me about a manuscript I wrote about centralized or decentralized systems.
A system applying decentralized control paradigms can easily reach several local optimal solutions, while it is hard for such a system to check which solution is the global optimal solution. The systems are sometimes trapped in locally optimized situations, and cannot get out without outside interferences. The “circular mill” of army ants is a typical example for the local-optimization issue. For army ants that are blind and move by following the ant ahead, an isolated group of ants may form a circle which will get larger and larger until the ants die of starvation.

(picture from T. C. Schneirla. Army ants. A study in social organization. W. H. Freeman & Co, San Francisco, 1971.).

Thursday, July 10, 2008

Find the tail in a distribution using Pareto principle

In July-August 2008 issue of Harvard Business Review, an article titled "Should You Invest in the Long Tail?" raised discussion between the author of the Long Tail book and the article's author. One of the major disagreements is how to distinguish the head and the tail of a distribution. The term tail is also used in context of heavy tailed distributions and power law. The mathematical definition of heavy tail is to compare a cumulative distribution function with an exponential function. If the function increases slower than an exponential function, which suggested that the function is likely polynomial, then the function has a heavy tail. Using exponential function to tell where the tail begins in a distribution is not straightforward.

I found another easy way to cut the tail out. That is via Pareto principle, or 80-20 rule. No matter it is 80-20 or 70-30, we can always get an equation
where k is the sequence number of an item in the studied set (size N) sorted in a descending order according to their popularity, rank(k) is the ranking of kth item in term of percent (k/N), and F(k) is the cumulative function. Suppose we have a set of N items, the tail of its distribution can be determined by the intersection of F(k) and f(k)=1-k/N. The intersection point indicate where the cumulative function value is equal to 1- rank(k). For example, when rank(k) = 20%, F(k) = 80% at the intersection, which is 80-20. Since F(k) is a strictly increasing function, and f(k)=1- k/N is a strictly decreasing function, and their value ranges are the same, [0, 1], we can ensure that there is one and only one intersection of the two functions.

let's use an example to check if this method can find a "good" point to divide the tail from the head. Zipf distribution is very popular for characterizing the distribution of ranked items like web pages, words, films, and books. The figure shows the cumulative function of a Zipf distribution with N = 1000 and s = 1. We get a very interesting result that the intersection is right at 80-20 place.

Monday, July 7, 2008

Solution for latex2html not generating images

Recently I updated GhostScript and NetPbm on my Windows box, and found the latex2html started to report errors when generating images. The error messages said "bad file descriptor" when executed pstoimg.bat. Google gave several pages that contains exactly the same symptom of this problem but no solutions. Some suggested to use the debug mode of latex2html to see more detailed tracks. It really helps! I found the NetPbm executable file was trying to locate a file named rgb.txt, and it tried several places that are the directories on Linux systems, but failed. It also suggested a RGBDEF environment variable. The file is in the misc directory of the NetPbm I installed. After I set the environment variable, the problem is fixed.

It seems that all applications better have a DEBUG mode.