The Tag Cloud

For a while now I’ve been categorizing my posts.  At the time of design for this site I didn’t really have an idea of what I wasimage going to  do with the categories, but speculated I might build a tag cloud.  The problem was I didn’t really understand the  fundamentals of a tag cloud: 
  • Get all categories from posts
  • Count number of posts per category
  • Add weighting to category
  • Format text to distinguish weighting
  • Link to all posts within that category

On the surface it looks fairly simple.  Query the database, do some mathematical fiddling to get weights, create a List of links and stick a style attribute to modify font-size based on weighting.

Well it’s not quite that easy.  The query was fairly straightforward, and creating the list and stylizing it was pretty simple.  Getting the weights on the other hand was a pain.  The problem was figuring out a fair weighting solution.

Standard Deviation and Statistical Analysis

Originally I figured I could get away with ordering the numbers, and then finding the difference between each increment.  Unfortunately some topics I discuss more frequently than others.  For instance, I talk about SQL a whole lot more than I talk about Open Source-stuff.  A huge skewing problem would have arose.  So, going back to High School Statistics class (holy crap, I actually can use this in real life!  My teacher [,if he were dead,] would be rolling over in his grave) I go back to one of the first things I learned: The Standard Deviation.

The standard deviation is a measure of the dispersion of a collection of numbers.  It is defined as the root-mean-square (RMS) deviation of the values from their mean, or as the square root of the variance.  To calculate the standard deviation on a dataset:

  1. Find the mean, \scriptstyle\overline{x}, of the values.
  2. For each value xi calculate its deviation (\scriptstyle x_i - \overline{x}) from the mean.
  3. Calculate the squares of these deviations.
  4. Find the mean of the squared deviations. This quantity is the variance s2.
  5. Take the square root of the variance.

In calculation form it looks like this:

\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^N (x_i - \overline{x})^2}\,,

The variable s signifies the standard deviation.  Thanks to Wikipedia for the pictures.

In C# the code would look like:

double StdDev(List<double> values, out double mean)
{
    mean = Statistics.Mean(values);
    double sumSquares = 0;
    int count = 0;

    foreach (double d in values)
    {
        double diff = (d - mean);
        sumSquares += diff * diff;
        count++;
    }

    return Math.Sqrt(sumSquares / count);
}

If they taught math in the form of programming (“this is how you do it in code…”), Math might not have sucked!  Yes, I do realize math is the precursor to programming, but a boy can dream.

Back to the Cloud

Once the standard deviation is found, I can apply the weighting.  Essentially I started at a deviation of 0 and assigned it the default font-size for this site.  As the deviation falls below 0 the font decreases, and as it increases above 0 it increases.  Fairly straightforward.  Once I got the weighting, I applied CSS, and built a web part accordingly.