Is this going to work?

]]>Gobi started by talking about writing a** spam filter using logistic regression**. The idea is to use character 4-grams as features, and hash it into p buckets, where p is some large enough prime. Then, use the number of 4-grams landing in each bucket as a predictor in a logistic regression model. This technique apparently did better than Google’s spam filter technology. Another interesting thing is that the researches who came up with this idea used google search results for “spam” subjects and “non-spam” subjects as training data.

David discussed his work for the Rotman International Trading Competition and **Market Microstructure**. One problem is that when you’re ordering a large number of shares, the market isn’t liquid enough to handle all your orders without drastically increasing price of the shares. It is a challenge to figure out how to buy/sell a large number of stocks in the optimal way possible. Another problem is related to market making: posting both a bid and ask at different prices to pocket the spread. The difficulty here is figuring out what the bid and ask prices should be, and how to handle the risk of the market moving.

John discussed his work at Pitney Bowes mining business related data using economical, demographic, and geographical aspects of a business. It spurred discussions in cluster detection, visualization of high-dimensional data, and cluster validation.

Samson talked briefly about non-parametrical methods and a review of bootstrapping: using empirical data to estimate the probability density function non-parametrically to estimate non-parametric quantities such as mean and variance, and using sampling with replacement to construct an estimate of the variance.

As always meetings are on Mondays at 4:30 in M3-3109.

]]>This week, the introduction of** Chernoff Faces** as a tool for graphing multivariate data made dealing with so many variables bearable, and making sense of the data as a whole more intuitive. Different data dimensions map onto different facial features. The example (taken from “FlowingData”, http://flowingdata.com/2010/08/31/how-to-visualize-data-with-cartoonish-faces/) below will show how Chernoff Faces are constructed and how easier it makes data interpretation.

Parallel Coordinates was another graphical representation of multivariate data that was introduced. In this method, each variable is assigned its own vertical axis and each axis is parallel to the other. A horizontal line between any axes implies positive correlation while an intersection implies negative correlation. It seems to be a good tool for measuring the association between variables.

Star plotting was briefly mentioned at the meeting, and the Iris data set was used to calculate the chances of a reading ( that isn’t isolated nor distinctively associated with any near by clusters of readings) being apart of any cluster surrounding it

]]>Lisa talked first, discussing her latest project using **Data-Driven Documents (d3)**. The dataset of choice was the *Iris* dataset, that which any R enthusiast will be familiar with. It contains five variables: *Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, *and *Species*.

For those unfamiliar with d3.js, it is a visualization library that provides functionality to bind data to objects. The advantages of using d3.js is that it is extremely fast and flexible.

Here is a simple example of selecting a circle and updating it with new coordinates and size :

var circle = svg.selectAll(“circle”)

.attr(“cy”, 90)

.attr(“cx”, 30)

.attr(“r”, 40);

But what if we want to make it more interesting. Consider a data set of size two (i.e [20, 35]). D3.js makes it easy to add a new circle and visualize the data in such a way that the size and x-coordinate of each circle reflects each data point.

var circle = svg.selectAll(“circle”)

.data([20, 35])

.enter().append(“circle”)

.attr(“cy”, 90)

.attr(“cx”, function(d) {return d;})

.attr(“r”, function(d) {return d;});

Lisa’s project explored multiple visualization strategies including an easy-to-use interface to change the axis of a scatter plot.

Jeeyoung discussed **edge detection and line detection**. In edge detection, a user applies an algorithm (that uses a threshold on the directional derivatives) to produce an image with the original’s edges. One of the more common edge detection algorithms is *Canny edge detection *(see below: source Wikipedia).

From there, you can apply line detection. It is easier to do this by mapping each point along the edges to a line in the parameter space (slope-intercept coordinate plane) and looking for intersections. While the exact details will not be mentioned, the transformation to be used is the *Hough Transformation*.

Note you can implement both using Matlab commands : *edge(.)* and *hough(.)*

As usual come join us in the Stats Club Office every Monday at 4:30pm!

]]>**Advertisement Bidding**: Paul talked about internet ad bidding, i.e. bidding on advertisement slots on platforms like Google AdWords. He introduced the optimization problem inherent to ad bidding, the difference in pricing between advertisement slots on Google, and explained various bidding methods.

**Stay Alert! Kaggle Competition**: David gave a walk-through of the Stay Alert! Ford challenge on Kaggle. He described two main findings which propelled his model to finish 6th in the overall rankings. The first was a data visualization method which, in his case, zeroed out the randomly added datapoints. The second was the discovery of interaction effects between variables.

The next meeting will be held next Monday at 4:30pm in the Stats Club Office M3 3109.

]]>Mixture model is a weighted combinations of probability distributions. Mixture model is a powerful and well-understood tool for various problems in artificial intelligence, computer vision, and statistics. In this post, we will examine Gaussian mixture models, and algorithms to solve them.

Let’s first introduce Gaussian Mixture Model. Let be a collection of Gaussian distributions, with some associated weights . Let be set of observations that are generated by the above distributions.

**Figure 1**. Mixture of two Gaussian distributions, with , , .

We call this model **Mixture of Gaussians**. The parameter is called the **mixing portion of the Gaussians**.

By **solving** Gaussian Mixture Model, we are given a set of observed points and we want to **find the model parameters** ** that maximizes the posterior probability** .

**Figure 2.** Histogram of 10,000 points generated by the above Gaussian mixture model. We want to find the model parameters which maximizes the posterior probability of generating those points.

**K-means clustering**

K-means clustering solves a different problem, but K-means problem gives an insight over how we can solve the mixture of Gaussian problem. In K-means clustering, a set of observed points , a positive integer are given. We want to cluster the points into clusters so that it minimizes the within cluster sum of squares (WCSS).

is the mean of the cluster .

**Figure 3.** Example of a K-means clustering in 2 dimensions, K = 2. The observed points are clustered into the two clusters. The X mark denotes the centre of the cluster.

**Figure 4.** Another example of a K-means clustering in 2 dimensions, K=4.

**Expectation Maximization (EM)**

Expectation Maximization is an iterative algorithm for solving maximum likelihood problems. Given the set of observed data *X*, generated by the probability model with parameter 𝜃, we want to find the parameter 𝜃 that maximizes the posterior probability . In K-means clustering,

- – observed data
- – the clustering of the observed data points , and their associated centres .

EM Algorithm starts with an initial hypothesis for the parameter 𝜃 and iteratively calculate the posterior probability , and re-calculates the hypothesis 𝜃 each step. The details are going to be different between each EM algorithms, but the following are the approximate steps.

- Start with an initial hypothesis 𝜃.
**(Expectation step)**Calculate the posterior probability .**(Maximization step)**Categorize the observed data according to the probability, and update the hypothesis accordingly.- Repeat this process until the hypothesis converges.

**Algorithm – Hard K-means**

We assume that the , the likelihood of point belonging to the cluster centered at the point is 1 iff is the closest cluster.

During the **E-step**, the distances are calculated to determine . During the **M-step**, are determined by clustering each to the cluster closest to it. Finally, each are re-calculated by taking the mean of the points in .

**Figure 5**. Example of a K-means algorithm, for K=2. Initial cluster centres are randomly chosen from the observed points. Each iteration, the observed points are categorized to the nearest cluster centres, and the cluster centres are re-calculated. This process is repeated until the cluster centres converge.

**Algorithm – Soft K-means**

In Hard-K means. the observed points can only belong in a single cluster. However, it may be useful to consider the probability during the computation of . This is especially true for the points near a boundary between two clusters.

We want to calculate via the weighted average of with .

We assume that the likelihood has exponential distribution with the stiffness factor .

This way, is still monotonically decreasing with respect to the distance away from the centres, but the probability is panellized for the higher distance.

Similar to the Hard K-means algorithm, is obtained by normalizing this value.

is obtained by the weighted averages of all .

One thing to note is that Hard K-mean algorithm is equivalent to a Soft K-mean algorithm as .

**Algorithm – Gaussian K-means**

K-means algorithms are great, but the algorithm only reveals information about the cluster membership. However, we can modify the EM algorithm to calculate

We come back to the original assumption, that the likelihood follows a Gaussian distribution.

Where is the Gaussian probability density distribution. The actual probably is calculated as following.

We have to re-calculate three parameters, . Similar to the Soft K-mean algorithm, μ is the weighted averages of all . is the normalized ratio fo the sums . 𝜎 is the weighted standard deviation.

.

Where *I* is the dimensionality of *x*.

**Figure 6.** Points sampled from a mixture of two Gaussians. Top left plot is the data points generated by the model parameters , . The other plots show the results of Hard K-means, Soft K-means, and Gaussian K-means.

**Figure 7.** Another example of Gaussian K-means algorithm, with 4 clusters.

**References**

- http://en.wikipedia.org/wiki/K-means_clustering – K means clustering.
- http://www.inference.phy.cam.ac.uk/mackay/itila/ – Information Theory, Inference, and Learning Algorithm. Chapter 20, 22 has nice materials on K-mean clustering.

The data is taken from the Central Election Commission’s own (semi)-official results. The precinct-level data alone consisted of 2,743 Excel files totaling 146 MB. From it, the results across all of Russia are divided into:

Regions: | 135 |

Sub-regions: | 2,744 |

Precincts: | 94,573 |

In the plot below, the vertical axis is United Russia’s vote share divided by the entire *voting age population* (all the eligible voters, not just those who voted). The horizontal axis is the sum of all other parties’ vote shares, over the same set of people. It follows that the turnout of the vote is equal to the sum of these two values. Lines of equal turnout would have a -45° angle (i.e. a slope of -1); obviously, all data points are confined to the lower left triangle (0-100% turnout).

The colors of the dots have no meaning other than its ordering: regions or precincts that are adjacent on the Election Commission’s official results have similar colors. In the “per-precinct” plots for each region below, the ordering is done first by sub-region, then by precinct (PEC) number. Therefore, all the precincts belonging to a sub-region have similar colors, which is a good way to spot systematic vote-rigging or manipulation across a sub-region.

The regional results can be further broken down into sub-regions and precincts:

I will focus on two basic types of vote fraud:

**Ballot stuffing**: when one person casts multiple votes, or where unused ballots are fraudulently filled in and cast by election officials. However, since you can’t decrease other parties’ vote counts via this method, the total number of votes always increases and the apparent turnout is inflated. Such increases in total votes would come entirely from one party, which in an honest election would be highly unnatural. By plotting the votes from all precincts onto a graph, these telltale signs can be detected.

**Misreporting**: when vote totals are fabricated after votes are counted, either at the voting station, or even during central tabulation by the Commission (which is feasible if results at each station are not publicly posted). These often come with telltale signs, such as orderly patterns in turnout/vote shares on a graph, where one would expect randomness. It can also crop up as skewed distributions or those that simply don’t make sense.

I think the government has taken the position that, while there exist some “isolated” activity of fraud which ought to be investigated (and which by now is too obvious to admit otherwise), these “incidents” are of little consequence overall and ultimately don’t affect the validity of the election. Therefore, no need to redo the election. However, the data itself suggests otherwise.

What I just plotted is a frequency distribution of vote share for each party in each of the country’s 94,573 precincts. While the six opposition parties each have support clustered at around 10% or less, they sum to over 30% (which is all that matters, since Russia uses proportional representation). This seemingly has a higher mean than even United Russia’s blue line, but thanks to its long tail to the right, United Russia’s curve actually had a higher mean, allowing it to just barely (almost, *conveniently*) eke out a majority in the parliament this time around (visually, it is hard to believe the blue curve has the higher mean, but it is true). Compare it to the 2007 elections below, which was quite the opposite situation: the incumbents handily defeated the combined opposition.

In this year especially, the long tail for United Russia is a curious outcast among the more symmetric distributions for the other parties. For speculation, I’ve included an unscientific estimate of what United Russia’s actual vote share likely was, had there been no inflation of its figures. I’ve taken the liberty, of course, to assume the distribution ought to be Gaussian, which “ICM laureate” Sergey Kusnetsov points out is not always true. Nevertheless, that is beside the point. Even if the true distribution given the current voter sentiment and geography was not Gaussian, it *should not differ to this degree*.

Here are similar plots from the 2008 U.S. presidential election to show, relatively speaking, what an honest election should look like. (Of course, in the U.S. the problem is that it doesn’t matter who you elect!). I have plotted both candidates’ votes as a fraction of the total VAP (voting age population) in each region. The data has been broken down by state, as well as by county.

The above graphs show characteristics of a fair election: a single, gaussian-distributed cluster, low variance in voter turnout (between 50-65%), and relatively few counties with extreme support for either candidate. These same features are lacking in the Russian election.

The plot for Chechnya frankly speaks for itself, and even looking at the tables on the Commission’s website, you’d be crazy not to notice something amiss. This is the initial outrageous result that led me to investigate all the regions.

In this region, highly regular patterns become visible in the precinct results once placed on this raw vote vs. raw vote graph. The red lines indicate not only that numbers were forged, but that whoever was responsible really lacked creativity. One red cluster follows almost exactly the line **y = 3*x**, and another seems to be dictated by **y + x = 95%**. A blue line, if my eyes are not mistaken, follows **y = 10*x**. How could voters at different precincts have possibly coordinated themselves to produce such orderly results? (By the way, this is not simply due to very small numbers, such as 3 votes out of 4. The larger dots represent precincts that had hundreds of eligible voters). Finally, there is the overwhelming, 80%+ average vote share for a single party, a level unheard of in most democratic elections.

The impossibly high turnouts and simultaneously lopsided vote shares (a.k.a. points clustered into the upper left corner of the graph) seem to be endemic to the autonomous republics (for a complete listing, see below). If there is a good, legitimate reason for this occurring, I have yet to hear it. Smaller sample size alone wouldn’t explain this, since it would imply greater variance, whereas these regions show just the opposite. Perhaps the physical inability to vote for other parties, if not outright miscounting of votes, is the real culprit.

The telltale green and purple clusters below are not only suspicious but reflect a lack of statistical understanding on the part of whomever made them up (if that is the case). The green cluster in particular seems far too concentrated to have occurred by chance, and is in fact responsible for the small red dot you see in the plot above.

The figures in the next plot are very strange indeed, as if every single ballot that wasn’t used was secretly filled in with “United Russia”. Whoever conjured them also seemed to think people will blindly believe these mythical turnout rates of over 95%:

This is, surprisingly, the first region we see that has *any* points in the “legitimate” zone around (x = 30%, y = 20%). Again we see that consistent and implausible clusters around y=60% and 90% percent suggest false reporting, likely the work of a single official.

If this doesn’t convince you, have a look at the same region in 2007:

Not all instances of fraud necessarily involve absurdly high vote shares. In some regions it was made to be not so dramatic: only a 10-20% inflation of United Russia’s vote share, yet indefensible under statistical scrutiny.

There is, at the very least, a manipulation of vote counts as betrayed by the decisive line formation in the centre. Whoever fabricated the numbers clearly lacked the creative acumen to do anything other than set United Russia’s total to the sum of all other parties’ times a constant. If you checked, you would see that the linear cluster in the 2011 plot is exactly described by **y = 6/7 x**. (It’s rather appalling in itself that they could not find someone more statistically competent to do their bidding, in a nation so highly ranked in math competitions).

Whether the other cluster to the top-left is also fraudulent is uncertain, but its existence is awfully strange. Moreover, by looking at the bottommost cluster for this year (evidently the legitimate cluster) and seeing how much support United Russia has lost since 2007, you can understand why they’ve had to fudge the votes.

The plot for Tyumen region below shows again two distinct patterns of rigging: first, the insane precincts with greater than 80% United Russia support, but also the red and green clusters around y=50-70% which are still way off the expected values (but whose “other party” votes seem to be typical), and is likely indicative of heavy ballot stuffing.

The individual regions in Moscow (see “Plots for all regions” below) seem to be distributed strangely, but the nature of the rigging does not become obvious until all its regions are pooled as one. In this composite plot of all 3,373 precincts in the city, one can clearly see a bimodal distribution, or two distinct clusters. The bottom cluster is located near the main clusters seen in other cities, and should be the legitimate one, while the more dispersed, larger cluster above it is likely of fraudulent origin. What you see here could either be vote stuffing or miscounting taken at a massive scale.

Credit goes to Maxim Pshenichnikov for having done this exact same analysis in greater detail; I am just including the idea here for the sake of completeness.

The plot for this region has one large cluster, and everything appears to check out, with only a few extreme (six or so) outliers above y=60%. But could perhaps even one outlier be too much? Looking at the other graphs, and at how tightly packed the main cluster is here, the answer is yes: a single point located that far away would take an enormous coincidence. It’s safe to say that all six of these precincts’ results are probably bogus.

These examples have shown that vote rigging is much more widespread and dramatic than it appears, and indeed decisively changed the outcome of the election. We saw unexpectedly regular patterns at the small scale, which help explain the anomalies in the big picture, and we compared different types of irregularities. Those in Russia who are protesting these results ought to settle for nothing less than a complete re-do of the election, under better scrutiny. Since Medvedev does not seem willing to allow this, the next few days and weeks will be very interesting to watch.

…I’ve never had to visualize anything high-dimensional in my pure math classes. Working things out algebraically is much nicer, and using a lower-dimensional object as an example or source of intuition usually works out — at least at the undergrad level.

But that’s not a really satisfying answer, for two reasons. One is that it

ispossible to visualize high-dimensional objects, and people have developed many ways of doing so. Dimension Math has on its website a neat series of videos for visualizing high-dimensional geometric objects using stereographic projection. The other reason is that while pure mathematicians do not have a need for visualizing high-dimensions, statisticians do. Methods of visualizing high dimensional data can give useful insights when analyzing data….

More at http://www.lisazhang.ca/2011/12/visualizing-4-dimensions.html

]]>**Eigenfaces:** Samson explained through principal component analysis, how eigenfaces can be used in face recognition.

**Data Visualization:** Samson also talked about information density, the data-ink ratio and how to condense visualizations without sacrificing information. He described several methods including sparklines, one of Tufte’s inventions, as well as why the “ideal” length of a x-y graph is dependent on the rate of change of the trendline.

**Semantic Clustering:** Konrad gave an overview of some of the work he did at Nuance involving semantic and k-means clustering algorithms. He described how the randomness of customer calls, among other variables, makes it difficult to know exactly the appropriate semantic tags one should use in large data sets.

Next meeting will be next Tuesday Nov 29th at 3pm in the Stats Club Office M3 3109.

]]>- Machine Learning: http://jan2012.ml-class.org/
- Natural Language Processing: http://www.nlp-class.org/
- Probabilistic Graphical Model: http://www.pgm-class.org/
- Game Theory: http://www.game-theory-class.org/
- Human Computer Interaction: http://www.hci-class.org/

Other classes offered are:

- CS 101: http://www.cs101-class.org/
- Software As A Service: http://www.saas-class.org/