This is a test

Is this going to work?


Spam Filter, Algo Trading, Nonparametric methods etc.

Today’s meeting covered aspects of spam filter, algorithmic trading and clustering.

Gobi started by talking about writing a spam filter using logistic regression. The idea is to use character 4-grams as features, and hash it into p buckets, where p is some large enough prime. Then, use the number of 4-grams landing in each bucket as a predictor in a logistic regression model. This technique apparently did better than Google’s spam filter technology. Another interesting thing is that the researches who came up with this idea used google search results for “spam” subjects and “non-spam” subjects as training data.

David discussed his work for the Rotman International Trading Competition and Market Microstructure. One problem is that when you’re ordering a large number of shares, the market isn’t liquid enough to handle all your orders without drastically increasing price of the shares. It is a challenge to figure out how to buy/sell a large number of stocks in the optimal way possible. Another problem is related to market making: posting both a bid and ask at different prices to pocket the spread. The difficulty here is figuring out what the bid and ask prices should be, and how to handle the risk of the market moving.

John discussed his work at Pitney Bowes mining business related data using economical, demographic, and geographical aspects of a business. It spurred discussions in cluster detection, visualization of high-dimensional data, and cluster validation.

Samson talked briefly about non-parametrical methods and a review of bootstrapping: using empirical data to estimate the probability density function non-parametrically to estimate non-parametric quantities such as mean and variance, and using sampling with replacement to construct an estimate of the variance.

As always meetings are on Mondays at 4:30 in M3-3109.

Visualizing 4+ Dimensions

As a student of pure math, I am often asked the question of how to visualize high dimensional objects. This question isn’t as important to pure mathematicians as it is to statisticians, so I write about it here.

…I’ve never had to visualize anything high-dimensional in my pure math classes. Working things out algebraically is much nicer, and using a lower-dimensional object as an example or source of intuition usually works out — at least at the undergrad level.

But that’s not a really satisfying answer, for two reasons. One is that it is possible to visualize high-dimensional objects, and people have developed many ways of doing so. Dimension Math has on its website a neat series of videos for visualizing high-dimensional geometric objects using stereographic projection. The other reason is that while pure mathematicians do not have a need for visualizing high-dimensions, statisticians do. Methods of visualizing high dimensional data can give useful insights when analyzing data….

More at http://www.lisazhang.ca/2011/12/visualizing-4-dimensions.html

Stanford Online Classes in Winter

Once again Stanford is offering online courses next term. Here are some that might be relevant:

Other classes offered are:

Project Discussions, User Ids

Today’s meeting was mostly sharing and discussing project ideas:

Visualization Projects: Konrad is working with housing availability data, and changes in Facebook friendship over time.

Twitter Data: Lisa is working on visualizing twitter user growth, and used the data to attempt to show that user ids are not random: although it is easy to do, one should not sample users by the modulo of the user id by some number. [Separate post pending]

News Data: Gobi and Samson talked about their work (and project ideas) clustering news, and methods for using supervised learning when labeled data is difficult to obtain.

Next meeting will be next Tuesday Nov 15th at 3pm in the Stats Club Office M3 3109.

Neural nets, encryption, and more

Today’s meeting was jam packed with everything from neural networks to work in progress on an automated way of doing homework.

Neural Networks: Samson introduced us to neural networks, and how to learn weights of a neural network by backpropogating the errors and using gradient descent. [Separate post pending.]

Homomorphic Encryption: Diran talked about homomorphic encryption, an encryption schema where one can perform operations (such as addition) on the encrypted messages and get meaningful results. With an encryption schema that preserves addition and multiplications, we can do things like linear regression on encrypted data. [Separate post pending.]

AOL Search Dataset: Jee explained the AOL search dataset, a dataset consisting of an anonymized list of users and the terms they searched for on AOL over the span of three months. Previous projects that use this dataset include the query “topic” connection visualization  (type something on the gray area at the top of the page) and a visualization of question queries.

Jimmy also presented on a work in progress: an automatic homework solution generator that outputs latex code for solving a given formula.

Next meeting will be next Tuesday Nov 8th at 3pm in the Stats Club Office M3 3109.

Inaugural Meeting

University of Waterloo Data Science kicked off with our inaugural meeting this past Wednesday, October 26th, 2011. Five people came out and at least three more expressed interest in joining our bi-weekly discussions, starting next Tuesday, November 1st at 2:30 PM in the Stats Club Office M3 3109.

During discussions we’ll talk about anything that anyone found interesting in the past two weeks: this could be a new paper, an interesting technique, a useful tool, or feedback on a work in progress. Members are encouraged to bring something to every discussions: either something to help others (paper, technique, tool), or a question that would help them in whatever they’re working on.

If you are interested in being a part of this community, drop by at our next Tuesday’s meeting. If you are no longer in Waterloo or can’t make the meeting, worry not. We hope to summarize most meeting discussions and post them here on this blog. We won’t be posting about members’ projects that are work in progress, they will be shared once they are published!