Spam Filter, Algo Trading, Nonparametric methods etc.

Today’s meeting covered aspects of spam filter, algorithmic trading and clustering.

Gobi started by talking about writing a spam filter using logistic regression. The idea is to use character 4-grams as features, and hash it into p buckets, where p is some large enough prime. Then, use the number of 4-grams landing in each bucket as a predictor in a logistic regression model. This technique apparently did better than Google’s spam filter technology. Another interesting thing is that the researches who came up with this idea used google search results for “spam” subjects and “non-spam” subjects as training data.

David discussed his work for the Rotman International Trading Competition and Market Microstructure. One problem is that when you’re ordering a large number of shares, the market isn’t liquid enough to handle all your orders without drastically increasing price of the shares. It is a challenge to figure out how to buy/sell a large number of stocks in the optimal way possible. Another problem is related to market making: posting both a bid and ask at different prices to pocket the spread. The difficulty here is figuring out what the bid and ask prices should be, and how to handle the risk of the market moving.

John discussed his work at Pitney Bowes mining business related data using economical, demographic, and geographical aspects of a business. It spurred discussions in cluster detection, visualization of high-dimensional data, and cluster validation.

Samson talked briefly about non-parametrical methods and a review of bootstrapping: using empirical data to estimate the probability density function non-parametrically to estimate non-parametric quantities such as mean and variance, and using sampling with replacement to construct an estimate of the variance.

As always meetings are on Mondays at 4:30 in M3-3109.


Visualizing  and making sense of multivariate data geometrically in the Euclidean space  is very challenging to say the least when more than three variables are in question.

This week, the introduction of Chernoff Faces as a tool for graphing multivariate data made dealing with so many variables bearable, and making sense of the data as a whole more intuitive. Different data dimensions map onto different facial features. The example (taken from “FlowingData”, below will show how Chernoff Faces are constructed and how easier it makes data interpretation.

Parallel Coordinates was another graphical representation of multivariate data that was introduced. In this method, each variable is assigned its own  vertical axis and each axis is parallel to the other. A horizontal line between any axes implies positive correlation while an intersection  implies negative correlation. It seems to be a good tool for measuring  the association between variables.

Star plotting was briefly mentioned at the meeting, and the Iris data set was used to calculate the chances of a reading ( that isn’t isolated nor distinctively associated with any near by clusters of readings) being  apart of any cluster surrounding it