Eigenfaces, Data Visualizations and Clustering Methods

Today’s meeting covered everything from classifying whether a certain photo is of Justin Bieber using  linear algebra, to understanding how call steering portals really work.

Eigenfaces: Samson explained through principal component analysis, how eigenfaces can be used in face recognition.

Data Visualization: Samson also talked about information density, the data-ink ratio and how to condense visualizations without sacrificing information. He described several methods including sparklines, one of Tufte’s inventions, as well as why the “ideal” length of a x-y graph is dependent on the rate of change of the trendline.

Semantic Clustering: Konrad gave an overview of some of the work he did at Nuance involving semantic and k-means clustering algorithms. He described how the randomness of customer calls, among other variables, makes it difficult to know exactly the appropriate semantic tags one should use in large data sets.

Next meeting will be next Tuesday Nov 29th at 3pm in the Stats Club Office M3 3109.


Stanford Online Classes in Winter

Once again Stanford is offering online courses next term. Here are some that might be relevant:

Other classes offered are:

Compression, physical XORs, and NN clustering!

Today, Samson and Diran gave new twists and applications for their previous topics.

Lossless Dataset Compression: Samson discussed lossless compression of statistical data, capable of shrinking huge datasets (often hundreds of TB) down by orders of magnitude. The goal, instead of trying to file away the data into a smaller footprint than other compressors, is to rather allow the data to still be readily accessible while being compressed.

Encryption, Encore: Diran described a way to transmit secret messages across untrustworthy carriers. Using the previously discussed holomorphic encryption, and splitting the message into multiple planes coupled with random noise, one ensures that no single carrier possessing only some of the planes is able to compute the secret. Rather, decryption requires XORing every plane together, which can in fact be done physically with multiple layers of invisible ink, or a unique way of encoding 1s and 0s that allows the union of dark spots to simulate XORs.

Neural Network Clustering: As part of his “neural network” series, Samson introduced a simple, single-hidden-layer network capable of determining, in a spatially clustered dataset, which cluster (or subset of clusters, “0” or “1”) an arbitrary data point should be marked under. The hidden layer is a Gaussian of the vector distance between a given point and each cluster, such that nearer clusters have greater influence. Finding the optimal weights during training requires only least-squares regression, allowing many data points to be trained at once.

Have you gone mod?: Jimmy offered a counterargument to Lisa’s proposal that sequential numbers, such as user IDs of a website, would yield non-random subsets when their residue (mod a certain number) is taken.

Next meeting will be next Tuesday Nov 22nd at 3pm in the Stats Club Office M3 3109.

A Site for Data Scientists to Prove Their Skills and Make Money

Who’s up to challenge some big boys?


Airlines, insurance companies, hospitals and many other organizations are trying to figure out how to corral all their data and turn it into something useful.

Kaggle, a start-up, has figured out a way to connect these companies with the mathematicians and scientists who crunch numbers for a living or a hobby. On Thursday, it announced it had raised $11 million from investors including Khosla Ventures, Index Ventures and Hal Varian, Google‘s chief economist.

“We’re really making big data into a sport,” said Anthony Goldbloom, Kaggle’s founder.

Project Discussions, User Ids

Today’s meeting was mostly sharing and discussing project ideas:

Visualization Projects: Konrad is working with housing availability data, and changes in Facebook friendship over time.

Twitter Data: Lisa is working on visualizing twitter user growth, and used the data to attempt to show that user ids are not random: although it is easy to do, one should not sample users by the modulo of the user id by some number. [Separate post pending]

News Data: Gobi and Samson talked about their work (and project ideas) clustering news, and methods for using supervised learning when labeled data is difficult to obtain.

Next meeting will be next Tuesday Nov 15th at 3pm in the Stats Club Office M3 3109.

Stats and Crypto: the missing link

A couple weeks ago I came across a talk by Prof. Teske on the recent advancements in homomorphic encryptions. Uncharacteristic of a crypto talk, this one dwelt very little with the crypto itself. Rather, it was lighthearted and succinct, understandable for a novice like myself. The talk left me excited and eager to learn more. Channeling from the experience, I would like to keep the following post simple and accessible. Let’s get started.

Homomorphic Encryption: a walkthrough

The idea of homomorphic encryption is to perform operations on encrypted elements in a way to mirror the same operation if it were performed prior to being encrypted. (Whoa lots of wordy words, hang in there!)

Let’s run through a scenario: say Alice has 25 dollars in her pocket and Bob has 16 dollars in his. They would like to know whether their combined pocket money forms the universe number. As neither Alice nor Bob want to reveal their actual amount they agree on performing an exchange like the following:

  • Alice encrypts her number into the cipher: [[25]]
  • Bob encrypts his number into the cipher: [[16]]
  • Alice and Bob together perform a homomorphic addition of their respective ciphers, the result of which is the encryption of  the sum: [[25]] [+] [[16]] => [[25+16]]=[[41]]
  • Alice and Bob together encrypts the universe number: [[42]]
  • Finally, together they perform a check for equality: [[41]] [==] [[42]]

Both Alice and Bob get to know the True/False result, but they do not get any additional knowledge. Hence the scheme does not reveal (1) the associated sum and (2) the amount possessed by the counterpart. Therefore both Alice’s and Bob’s secret, well, stays secret.

The above example showcased an encryption scheme with a single homomorphic operation, the addition. Another type of homomorphic operation is the multiplication. Encryption schemes that support both types of operations are called fully homomorphic encryptions.

Aside from the obvious increase in the number of operations allowed, fully homomorphic encryption allows for much more (keeners can read it off from wikipedia).

A Real-Life Usage: electronic voting

Homomorphic encryptions are often used in electronic voting. The goal in doing so is to preserve the privacy of each ballot while being able to accurately determine the winner.

Let’s look at another simplified example:

Given an election with four candidates {A,B,C,D}, each voter can vote for one and only one candidate in the following way,

  • pick a candidate, say C
  • create the tuple <0,0,1,0>
  • encrypt each elements of the tuple using a randomized encryption (ie an encryption where a given message if encrypted twice will produce two different ciphers)
  • sends out the encrypted tuple <[[0]],[[0]],[[1]],[[0]]> as the ballot
After the election has ended, the commission ends up with a set of data, the encrypted tuples from all voters. Each entry of the dataset looks like gibberish to the commissioner. The commissioner proceeds as follows,
  • performs a homomorphic sum over each individual columns of the tuples resulting in <[[sum_A]], [[sum_B]], [[sum_C]], [[sum_D]]>
  • loops i from 0 to n=#number of votes casted (with consent from voters)
    • checks whether [[i]] [==] [[sum_A]], [[sum_B]],  [[sum_C]], [[sum_D]]
  • determine whether max(sum_A, sum_B, sum_C, sum_D)
Obviously, in practice, the systems contain much more rigor than my make-shift example.

What’s To Come

Fully homomorphic encryptions is a fairly novel subject: its first implementation appeared in 2010. The benefits from it are potentially enormous. Maybe one day we’ll be able to perform statistical computations on a cloud of encrypted data.

First blood: Visualizing Correlations

To start things off, here is a post I wrote a week ago about a task I had working at CPPIB: Visualizing Correlations.

…correlations in finance are difficult to estimate. If you take all the historical data you can get your hands on, you might be missing out on some recent correlation drifts. If you take too little, you are subjected to estimation error. The problem went as follows: 5 assets, 5 different ways of calculating correlation (varying time horizons and weight structures), and 1 “target” correlation matrix; the objective is to quickly spot when live correlations deviate too much from the target.

Read the full post at http://david.ma/blog/?p=24

Neural nets, encryption, and more

Today’s meeting was jam packed with everything from neural networks to work in progress on an automated way of doing homework.

Neural Networks: Samson introduced us to neural networks, and how to learn weights of a neural network by backpropogating the errors and using gradient descent. [Separate post pending.]

Homomorphic Encryption: Diran talked about homomorphic encryption, an encryption schema where one can perform operations (such as addition) on the encrypted messages and get meaningful results. With an encryption schema that preserves addition and multiplications, we can do things like linear regression on encrypted data. [Separate post pending.]

AOL Search Dataset: Jee explained the AOL search dataset, a dataset consisting of an anonymized list of users and the terms they searched for on AOL over the span of three months. Previous projects that use this dataset include the query “topic” connection visualization  (type something on the gray area at the top of the page) and a visualization of question queries.

Jimmy also presented on a work in progress: an automatic homework solution generator that outputs latex code for solving a given formula.

Next meeting will be next Tuesday Nov 8th at 3pm in the Stats Club Office M3 3109.