Eigenfaces, Data Visualizations and Clustering Methods

Today’s meeting covered everything from classifying whether a certain photo is of Justin Bieber using  linear algebra, to understanding how call steering portals really work.

Eigenfaces: Samson explained through principal component analysis, how eigenfaces can be used in face recognition.

Data Visualization: Samson also talked about information density, the data-ink ratio and how to condense visualizations without sacrificing information. He described several methods including sparklines, one of Tufte’s inventions, as well as why the “ideal” length of a x-y graph is dependent on the rate of change of the trendline.

Semantic Clustering: Konrad gave an overview of some of the work he did at Nuance involving semantic and k-means clustering algorithms. He described how the randomness of customer calls, among other variables, makes it difficult to know exactly the appropriate semantic tags one should use in large data sets.

Next meeting will be next Tuesday Nov 29th at 3pm in the Stats Club Office M3 3109.

Compression, physical XORs, and NN clustering!

Today, Samson and Diran gave new twists and applications for their previous topics.

Lossless Dataset Compression: Samson discussed lossless compression of statistical data, capable of shrinking huge datasets (often hundreds of TB) down by orders of magnitude. The goal, instead of trying to file away the data into a smaller footprint than other compressors, is to rather allow the data to still be readily accessible while being compressed.

Encryption, Encore: Diran described a way to transmit secret messages across untrustworthy carriers. Using the previously discussed holomorphic encryption, and splitting the message into multiple planes coupled with random noise, one ensures that no single carrier possessing only some of the planes is able to compute the secret. Rather, decryption requires XORing every plane together, which can in fact be done physically with multiple layers of invisible ink, or a unique way of encoding 1s and 0s that allows the union of dark spots to simulate XORs.

Neural Network Clustering: As part of his “neural network” series, Samson introduced a simple, single-hidden-layer network capable of determining, in a spatially clustered dataset, which cluster (or subset of clusters, “0” or “1”) an arbitrary data point should be marked under. The hidden layer is a Gaussian of the vector distance between a given point and each cluster, such that nearer clusters have greater influence. Finding the optimal weights during training requires only least-squares regression, allowing many data points to be trained at once.

Have you gone mod?: Jimmy offered a counterargument to Lisa’s proposal that sequential numbers, such as user IDs of a website, would yield non-random subsets when their residue (mod a certain number) is taken.

Next meeting will be next Tuesday Nov 22nd at 3pm in the Stats Club Office M3 3109.

Project Discussions, User Ids

Today’s meeting was mostly sharing and discussing project ideas:

Visualization Projects: Konrad is working with housing availability data, and changes in Facebook friendship over time.

Twitter Data: Lisa is working on visualizing twitter user growth, and used the data to attempt to show that user ids are not random: although it is easy to do, one should not sample users by the modulo of the user id by some number. [Separate post pending]

News Data: Gobi and Samson talked about their work (and project ideas) clustering news, and methods for using supervised learning when labeled data is difficult to obtain.

Next meeting will be next Tuesday Nov 15th at 3pm in the Stats Club Office M3 3109.

Neural nets, encryption, and more

Today’s meeting was jam packed with everything from neural networks to work in progress on an automated way of doing homework.

Neural Networks: Samson introduced us to neural networks, and how to learn weights of a neural network by backpropogating the errors and using gradient descent. [Separate post pending.]

Homomorphic Encryption: Diran talked about homomorphic encryption, an encryption schema where one can perform operations (such as addition) on the encrypted messages and get meaningful results. With an encryption schema that preserves addition and multiplications, we can do things like linear regression on encrypted data. [Separate post pending.]

AOL Search Dataset: Jee explained the AOL search dataset, a dataset consisting of an anonymized list of users and the terms they searched for on AOL over the span of three months. Previous projects that use this dataset include the query “topic” connection visualization  (type something on the gray area at the top of the page) and a visualization of question queries.

Jimmy also presented on a work in progress: an automatic homework solution generator that outputs latex code for solving a given formula.

Next meeting will be next Tuesday Nov 8th at 3pm in the Stats Club Office M3 3109.