This is a test

Is this going to work?

Spam Filter, Algo Trading, Nonparametric methods etc.

Today’s meeting covered aspects of spam filter, algorithmic trading and clustering.

Gobi started by talking about writing a spam filter using logistic regression. The idea is to use character 4-grams as features, and hash it into p buckets, where p is some large enough prime. Then, use the number of 4-grams landing in each bucket as a predictor in a logistic regression model. This technique apparently did better than Google’s spam filter technology. Another interesting thing is that the researches who came up with this idea used google search results for “spam” subjects and “non-spam” subjects as training data.

David discussed his work for the Rotman International Trading Competition and Market Microstructure. One problem is that when you’re ordering a large number of shares, the market isn’t liquid enough to handle all your orders without drastically increasing price of the shares. It is a challenge to figure out how to buy/sell a large number of stocks in the optimal way possible. Another problem is related to market making: posting both a bid and ask at different prices to pocket the spread. The difficulty here is figuring out what the bid and ask prices should be, and how to handle the risk of the market moving.

John discussed his work at Pitney Bowes mining business related data using economical, demographic, and geographical aspects of a business. It spurred discussions in cluster detection, visualization of high-dimensional data, and cluster validation.

Samson talked briefly about non-parametrical methods and a review of bootstrapping: using empirical data to estimate the probability density function non-parametrically to estimate non-parametric quantities such as mean and variance, and using sampling with replacement to construct an estimate of the variance.

As always meetings are on Mondays at 4:30 in M3-3109.


Visualizing  and making sense of multivariate data geometrically in the Euclidean space  is very challenging to say the least when more than three variables are in question.

This week, the introduction of Chernoff Faces as a tool for graphing multivariate data made dealing with so many variables bearable, and making sense of the data as a whole more intuitive. Different data dimensions map onto different facial features. The example (taken from “FlowingData”, http://flowingdata.com/2010/08/31/how-to-visualize-data-with-cartoonish-faces/) below will show how Chernoff Faces are constructed and how easier it makes data interpretation.

Parallel Coordinates was another graphical representation of multivariate data that was introduced. In this method, each variable is assigned its own  vertical axis and each axis is parallel to the other. A horizontal line between any axes implies positive correlation while an intersection  implies negative correlation. It seems to be a good tool for measuring  the association between variables.

Star plotting was briefly mentioned at the meeting, and the Iris data set was used to calculate the chances of a reading ( that isn’t isolated nor distinctively associated with any near by clusters of readings) being  apart of any cluster surrounding it

Week 2: d3.js and line detection

This week’s meeting focused on talks given by Lisa and Jeeyoung.

Lisa talked first, discussing her latest project using Data-Driven Documents (d3).  The dataset of choice was the Iris dataset, that which any R enthusiast will be familiar with.  It contains five variables: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species.

For those unfamiliar with d3.js, it is a visualization library that provides functionality to bind data to objects.  The advantages of using d3.js is that it is extremely fast and flexible.

Here is a simple example of selecting a circle and updating it with new coordinates and size :

var circle = svg.selectAll(“circle”)
.attr(“cy”, 90)
.attr(“cx”, 30)
.attr(“r”, 40);

But what if we want to make it more interesting.  Consider a data set of size two (i.e [20, 35]).  D3.js makes it easy to add a new circle and visualize the data in such a way that the size and x-coordinate of each circle reflects each data point.

var circle = svg.selectAll(“circle”)
.data([20, 35])
.attr(“cy”, 90)
.attr(“cx”, function(d) {return d;})
.attr(“r”, function(d) {return d;});

Lisa’s project explored multiple visualization strategies including an easy-to-use interface to change the axis of a scatter plot.

Jeeyoung discussed edge detection and line detection.  In edge detection, a user applies an algorithm (that uses a threshold on the directional derivatives) to produce an image with the original’s edges.  One of the more common edge detection algorithms is Canny edge detection (see below: source Wikipedia).

From there, you can apply line detection.  It is easier to do this by mapping each point along the edges to a line in the parameter space (slope-intercept coordinate plane) and looking for intersections.  While the exact details will not be mentioned, the transformation to be used is the Hough Transformation.

Note you can implement both using Matlab commands : edge(.) and hough(.)

As usual come join us in the Stats Club Office every Monday at 4:30pm!

Advertisement Bidding, Kaggle Contest, New Term & New Faces

Today, the Data Science team kicked off its first meeting of 2012! Welcome to Arthur, David, Paul, and Will for joining the cause! The meeting covered exciting new topics ranging from internet ad bidding to Kaggle competition.

Advertisement Bidding: Paul talked about internet ad bidding, i.e. bidding on advertisement slots on platforms like Google AdWords. He introduced the optimization problem inherent to ad bidding, the difference in pricing between advertisement slots on Google, and explained various bidding methods.

Stay Alert! Kaggle Competition: David gave a walk-through of the Stay Alert! Ford challenge on Kaggle. He described two main findings which propelled his model to finish 6th in the overall rankings. The first was a data visualization method which, in his case, zeroed out the randomly added datapoints. The second was the discovery of interaction effects between variables.

The next meeting will be held next Monday at 4:30pm in the Stats Club Office M3 3109.

Gaussian Mixture Model, K-means algorithm, and how to solve them.


Mixture model is a weighted combinations of probability distributions. Mixture model is a powerful and well-understood tool for various problems in artificial intelligence, computer vision, and statistics. In this post, we will examine Gaussian mixture models, and algorithms to solve them.

Let’s first introduce Gaussian Mixture Model. Let (\mu_j, \sigma_j) be a collection of Gaussian distributions, with some associated weights \pi_j \in [0,1]. Let (x_i)_{i=1}^n be set of observations that are generated by the above distributions.

P(x_i|\sigma_j) \sim \pi_i G(\mu_j, \sigma_j)

Figure 1. Mixture of two Gaussian distributions, with \pi = (0.25, 0.75), \mu = (0, 5), \sigma=(1,3).

We call this model Mixture of Gaussians. The parameter \pi is called the mixing portion of the Gaussians.

By solving Gaussian Mixture Model, we are given a set of observed points X=(x_i)_{i=1}^n and we want to find the model parameters (\pi, \mu, \sigma) that maximizes the posterior probability P(X | \pi,\mu,\sigma).

Figure 2. Histogram of 10,000 points generated by the above Gaussian mixture model. We want to find the model parameters which maximizes the posterior probability of generating those points.

K-means clustering

K-means clustering solves a different problem, but K-means problem gives an insight over how we can solve the mixture of Gaussian problem. In K-means clustering, a set of observed points (x_i)_{i=1}^{n}, a positive integer k are given. We want to cluster the points into k clusters (S_j)_{j=1}^{k} so that it minimizes the within cluster sum of squares (WCSS).

\underset{S_x}{argmin}\sum_{j=1}^{k} \sum_{x \in S_j} || x - \mu_j ||^2

\mu_j = \frac{\sum_{x \in S_j} x}{\#S_j} is the mean of the cluster S_j.

Figure 3. Example of a K-means clustering in 2 dimensions, K = 2. The observed points are clustered into the two clusters. The X mark denotes the centre of the cluster.

Figure 4. Another example of a K-means clustering in 2 dimensions, K=4.

Expectation Maximization (EM)

Expectation Maximization is an iterative algorithm for solving maximum likelihood problems. Given the set of observed data X, generated by the probability model with parameter 𝜃, we want to find the parameter 𝜃 that maximizes the posterior probability P(X|\theta). In K-means clustering,

  • X – observed data (x_i)_{i=1}^{n}
  • \theta – the clustering of the observed data points (S_j)_{j=1}^{k}, and their associated centres \mu_j.

EM Algorithm starts with an initial hypothesis for the parameter 𝜃 and iteratively calculate the posterior probability P(X|\theta), and re-calculates the hypothesis 𝜃 each step. The details are going to be different between each EM algorithms, but the following are the approximate steps.

  1. Start with an initial hypothesis 𝜃.
  2. (Expectation step) Calculate the posterior probability P(X | \theta).
  3. (Maximization step) Categorize the observed data according to the probability, and update the hypothesis accordingly.
  4. Repeat this process until the hypothesis converges.

Algorithm – Hard K-means

We assume that the P(x_i|S_j), the likelihood of point x_i belonging to the cluster centered at the point \mu_j is 1 iff \mu_j is the closest cluster.

P(x_i|S_j) = 1\ if\ j = \underset{k}{argmin} || x_i - \mu_k ||, 0\ otherwise

During the E-step, the distances || x_i - \mu_j || are calculated to determine P(x_i|S_j). During the M-stepS_j are determined by clustering each x_i to the cluster closest to it. Finally, each \mu_j are re-calculated by taking the mean of the points in S_j.

\mu_j = \displaystyle\sum\limits_{x \in S_j}\frac{x}{\#S_j}

Figure 5. Example of a K-means algorithm, for K=2. Initial cluster centres are randomly chosen from the observed points. Each iteration, the observed points are categorized to the nearest cluster centres, and the cluster centres are re-calculated. This process is repeated until the cluster centres converge.

Algorithm – Soft K-means

In Hard-K means. the observed points can only belong in a single cluster. However, it may be useful to consider the probability P(x_i|S_j) during the computation of \mu. This is especially true for the points near a boundary between two clusters.

We want to calculate \mu_j via the weighted average of x_i with P(x_i|S_j).

We assume that the likelihood P(x_i|S_j) has exponential distribution with the stiffness factor \beta > 0.

P(x_i|S_j) \sim e^{-\beta || x_i - \mu_j ||}

This way, P(x_i|S_j) is still monotonically decreasing with respect to the distance away from the centres, but the probability is panellized for the higher distance.

Similar to the Hard K-means algorithm, P(x_i|S_j) is obtained by normalizing this value.

P(x_i|S_j) = \frac{e^{-\beta||x_i-\mu_j||}}{\sum_l e^{-\beta || x_l - \mu_j ||}}

\mu_j is obtained by the weighted averages of all x_i .

\mu_j = \frac{\sum_i P(x_i|S_j) x_i}{\sum_i P(x_i|S_j)}

One thing to note is that Hard K-mean algorithm is equivalent to a Soft K-mean algorithm as \beta \rightarrow \infty.

Algorithm – Gaussian K-means

K-means algorithms are great, but the algorithm only reveals information about the cluster membership. However, we can modify the EM algorithm to calculate

We come back to the original assumption, that the likelihood P(x_i|S_j) follows a Gaussian distribution.

P(x_i|S_j) \sim \pi_j G(||x_i - \mu_j||, \sigma_j)

Where G(\bullet, \bullet) is the Gaussian probability density distribution. The actual probably is calculated as following.

P(x_i|S_j) = \frac{\pi_j G(||x_i-\mu_j||, \sigma_j)}{\sum_k \pi_k G(||x_i-\mu_k||, \sigma_k)}

We have to re-calculate three parameters, \mu, \sigma, \pi. Similar to the Soft K-mean algorithm, μ is the weighted averages of all x_i. \pi is the normalized ratio fo the sums \sum_i P(x_i|S_j). 𝜎 is the weighted standard deviation.

\mu_j = \frac{\sum_i P(x_i|S_j) x_i}{R_k}.

\sigma_j^2 = \frac{\sum_i P(x_i|S_j)|| x_i - \mu_j || }{I R_k}

\pi_j = \frac{R_j}{\sum_k R_k}

R_j = \sum_i P(x_i|S_j)

Where I is the dimensionality of x.

Figure 6. Points sampled from a mixture of two Gaussians. Top left plot is the data points generated by the model parameters \pi=(\frac{1}{2},\frac{1}{2}), \sigma_x = \sigma_y = (0.3, 1) \mu_x = (-2, 0), \mu_y = (0, 0). The other plots show the results of Hard K-means, Soft K-means, and Gaussian K-means.

Figure 7.  Another example of Gaussian K-means algorithm, with 4 clusters.


Graphing Russia’s Election Fraud

Following Russia’s parliamentary elections on December 4, a link was posted to Reddit reporting an impossibly high turnout (99.51%) and near unanimous support (99.48%) for Putin’s ruling party, United Russia, in the last location one would expect it: the republic of Chechnya. Even if relations with the secessionist region have improved since the Second Chechen War, both the turnout and United Russia’s vote share are a complete joke. This absurdity prompted a more thorough examination of all regions, many of which were also plagued by irregularities. In this post, I will give some detailed visualizations of both region- and precinct- level election data, and point out some highly likely instances of fraud.

Read the full post »

Visualizing 4+ Dimensions

As a student of pure math, I am often asked the question of how to visualize high dimensional objects. This question isn’t as important to pure mathematicians as it is to statisticians, so I write about it here.

…I’ve never had to visualize anything high-dimensional in my pure math classes. Working things out algebraically is much nicer, and using a lower-dimensional object as an example or source of intuition usually works out — at least at the undergrad level.

But that’s not a really satisfying answer, for two reasons. One is that it is possible to visualize high-dimensional objects, and people have developed many ways of doing so. Dimension Math has on its website a neat series of videos for visualizing high-dimensional geometric objects using stereographic projection. The other reason is that while pure mathematicians do not have a need for visualizing high-dimensions, statisticians do. Methods of visualizing high dimensional data can give useful insights when analyzing data….

More at http://www.lisazhang.ca/2011/12/visualizing-4-dimensions.html

Eigenfaces, Data Visualizations and Clustering Methods

Today’s meeting covered everything from classifying whether a certain photo is of Justin Bieber using  linear algebra, to understanding how call steering portals really work.

Eigenfaces: Samson explained through principal component analysis, how eigenfaces can be used in face recognition.

Data Visualization: Samson also talked about information density, the data-ink ratio and how to condense visualizations without sacrificing information. He described several methods including sparklines, one of Tufte’s inventions, as well as why the “ideal” length of a x-y graph is dependent on the rate of change of the trendline.

Semantic Clustering: Konrad gave an overview of some of the work he did at Nuance involving semantic and k-means clustering algorithms. He described how the randomness of customer calls, among other variables, makes it difficult to know exactly the appropriate semantic tags one should use in large data sets.

Next meeting will be next Tuesday Nov 29th at 3pm in the Stats Club Office M3 3109.

Stanford Online Classes in Winter

Once again Stanford is offering online courses next term. Here are some that might be relevant:

Other classes offered are: