Today’s meeting covered aspects of spam filter, algorithmic trading and clustering.

Gobi started by talking about writing a** spam filter using logistic regression**. The idea is to use character 4-grams as features, and hash it into p buckets, where p is some large enough prime. Then, use the number of 4-grams landing in each bucket as a predictor in a logistic regression model. This technique apparently did better than Google’s spam filter technology. Another interesting thing is that the researches who came up with this idea used google search results for “spam” subjects and “non-spam” subjects as training data.

David discussed his work for the Rotman International Trading Competition and **Market Microstructure**. One problem is that when you’re ordering a large number of shares, the market isn’t liquid enough to handle all your orders without drastically increasing price of the shares. It is a challenge to figure out how to buy/sell a large number of stocks in the optimal way possible. Another problem is related to market making: posting both a bid and ask at different prices to pocket the spread. The difficulty here is figuring out what the bid and ask prices should be, and how to handle the risk of the market moving.

John discussed his work at Pitney Bowes mining business related data using economical, demographic, and geographical aspects of a business. It spurred discussions in cluster detection, visualization of high-dimensional data, and cluster validation.

Samson talked briefly about non-parametrical methods and a review of bootstrapping: using empirical data to estimate the probability density function non-parametrically to estimate non-parametric quantities such as mean and variance, and using sampling with replacement to construct an estimate of the variance.

As always meetings are on Mondays at 4:30 in M3-3109.