This is a test

Is this going to work?

Spam Filter, Algo Trading, Nonparametric methods etc.

Today’s meeting covered aspects of spam filter, algorithmic trading and clustering.

Gobi started by talking about writing a spam filter using logistic regression. The idea is to use character 4-grams as features, and hash it into p buckets, where p is some large enough prime. Then, use the number of 4-grams landing in each bucket as a predictor in a logistic regression model. This technique apparently did better than Google’s spam filter technology. Another interesting thing is that the researches who came up with this idea used google search results for “spam” subjects and “non-spam” subjects as training data.

David discussed his work for the Rotman International Trading Competition and Market Microstructure. One problem is that when you’re ordering a large number of shares, the market isn’t liquid enough to handle all your orders without drastically increasing price of the shares. It is a challenge to figure out how to buy/sell a large number of stocks in the optimal way possible. Another problem is related to market making: posting both a bid and ask at different prices to pocket the spread. The difficulty here is figuring out what the bid and ask prices should be, and how to handle the risk of the market moving.

John discussed his work at Pitney Bowes mining business related data using economical, demographic, and geographical aspects of a business. It spurred discussions in cluster detection, visualization of high-dimensional data, and cluster validation.

Samson talked briefly about non-parametrical methods and a review of bootstrapping: using empirical data to estimate the probability density function non-parametrically to estimate non-parametric quantities such as mean and variance, and using sampling with replacement to construct an estimate of the variance.

As always meetings are on Mondays at 4:30 in M3-3109.


Visualizing  and making sense of multivariate data geometrically in the Euclidean space  is very challenging to say the least when more than three variables are in question.

This week, the introduction of Chernoff Faces as a tool for graphing multivariate data made dealing with so many variables bearable, and making sense of the data as a whole more intuitive. Different data dimensions map onto different facial features. The example (taken from “FlowingData”, http://flowingdata.com/2010/08/31/how-to-visualize-data-with-cartoonish-faces/) below will show how Chernoff Faces are constructed and how easier it makes data interpretation.

Parallel Coordinates was another graphical representation of multivariate data that was introduced. In this method, each variable is assigned its own  vertical axis and each axis is parallel to the other. A horizontal line between any axes implies positive correlation while an intersection  implies negative correlation. It seems to be a good tool for measuring  the association between variables.

Star plotting was briefly mentioned at the meeting, and the Iris data set was used to calculate the chances of a reading ( that isn’t isolated nor distinctively associated with any near by clusters of readings) being  apart of any cluster surrounding it

Advertisement Bidding, Kaggle Contest, New Term & New Faces

Today, the Data Science team kicked off its first meeting of 2012! Welcome to Arthur, David, Paul, and Will for joining the cause! The meeting covered exciting new topics ranging from internet ad bidding to Kaggle competition.

Advertisement Bidding: Paul talked about internet ad bidding, i.e. bidding on advertisement slots on platforms like Google AdWords. He introduced the optimization problem inherent to ad bidding, the difference in pricing between advertisement slots on Google, and explained various bidding methods.

Stay Alert! Kaggle Competition: David gave a walk-through of the Stay Alert! Ford challenge on Kaggle. He described two main findings which propelled his model to finish 6th in the overall rankings. The first was a data visualization method which, in his case, zeroed out the randomly added datapoints. The second was the discovery of interaction effects between variables.

The next meeting will be held next Monday at 4:30pm in the Stats Club Office M3 3109.

A Site for Data Scientists to Prove Their Skills and Make Money

Who’s up to challenge some big boys?


Airlines, insurance companies, hospitals and many other organizations are trying to figure out how to corral all their data and turn it into something useful.

Kaggle, a start-up, has figured out a way to connect these companies with the mathematicians and scientists who crunch numbers for a living or a hobby. On Thursday, it announced it had raised $11 million from investors including Khosla Ventures, Index Ventures and Hal Varian, Google‘s chief economist.

“We’re really making big data into a sport,” said Anthony Goldbloom, Kaggle’s founder.