Week 2: d3.js and line detection

This week’s meeting focused on talks given by Lisa and Jeeyoung.

Lisa talked first, discussing her latest project using Data-Driven Documents (d3).  The dataset of choice was the Iris dataset, that which any R enthusiast will be familiar with.  It contains five variables: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species.

For those unfamiliar with d3.js, it is a visualization library that provides functionality to bind data to objects.  The advantages of using d3.js is that it is extremely fast and flexible.

Here is a simple example of selecting a circle and updating it with new coordinates and size :

var circle = svg.selectAll(“circle”)
.attr(“cy”, 90)
.attr(“cx”, 30)
.attr(“r”, 40);

But what if we want to make it more interesting.  Consider a data set of size two (i.e [20, 35]).  D3.js makes it easy to add a new circle and visualize the data in such a way that the size and x-coordinate of each circle reflects each data point.

var circle = svg.selectAll(“circle”)
.data([20, 35])
.enter().append(“circle”)
.attr(“cy”, 90)
.attr(“cx”, function(d) {return d;})
.attr(“r”, function(d) {return d;});

Lisa’s project explored multiple visualization strategies including an easy-to-use interface to change the axis of a scatter plot.

Jeeyoung discussed edge detection and line detection.  In edge detection, a user applies an algorithm (that uses a threshold on the directional derivatives) to produce an image with the original’s edges.  One of the more common edge detection algorithms is Canny edge detection (see below: source Wikipedia).

From there, you can apply line detection.  It is easier to do this by mapping each point along the edges to a line in the parameter space (slope-intercept coordinate plane) and looking for intersections.  While the exact details will not be mentioned, the transformation to be used is the Hough Transformation.

Note you can implement both using Matlab commands : edge(.) and hough(.)

As usual come join us in the Stats Club Office every Monday at 4:30pm!

Graphing Russia’s Election Fraud

Following Russia’s parliamentary elections on December 4, a link was posted to Reddit reporting an impossibly high turnout (99.51%) and near unanimous support (99.48%) for Putin’s ruling party, United Russia, in the last location one would expect it: the republic of Chechnya. Even if relations with the secessionist region have improved since the Second Chechen War, both the turnout and United Russia’s vote share are a complete joke. This absurdity prompted a more thorough examination of all regions, many of which were also plagued by irregularities. In this post, I will give some detailed visualizations of both region- and precinct- level election data, and point out some highly likely instances of fraud.

(more…)

Visualizing 4+ Dimensions

As a student of pure math, I am often asked the question of how to visualize high dimensional objects. This question isn’t as important to pure mathematicians as it is to statisticians, so I write about it here.

…I’ve never had to visualize anything high-dimensional in my pure math classes. Working things out algebraically is much nicer, and using a lower-dimensional object as an example or source of intuition usually works out — at least at the undergrad level.

But that’s not a really satisfying answer, for two reasons. One is that it is possible to visualize high-dimensional objects, and people have developed many ways of doing so. Dimension Math has on its website a neat series of videos for visualizing high-dimensional geometric objects using stereographic projection. The other reason is that while pure mathematicians do not have a need for visualizing high-dimensions, statisticians do. Methods of visualizing high dimensional data can give useful insights when analyzing data….

More at http://www.lisazhang.ca/2011/12/visualizing-4-dimensions.html

Eigenfaces, Data Visualizations and Clustering Methods

Today’s meeting covered everything from classifying whether a certain photo is of Justin Bieber using  linear algebra, to understanding how call steering portals really work.

Eigenfaces: Samson explained through principal component analysis, how eigenfaces can be used in face recognition.

Data Visualization: Samson also talked about information density, the data-ink ratio and how to condense visualizations without sacrificing information. He described several methods including sparklines, one of Tufte’s inventions, as well as why the “ideal” length of a x-y graph is dependent on the rate of change of the trendline.

Semantic Clustering: Konrad gave an overview of some of the work he did at Nuance involving semantic and k-means clustering algorithms. He described how the randomness of customer calls, among other variables, makes it difficult to know exactly the appropriate semantic tags one should use in large data sets.

Next meeting will be next Tuesday Nov 29th at 3pm in the Stats Club Office M3 3109.

Project Discussions, User Ids

Today’s meeting was mostly sharing and discussing project ideas:

Visualization Projects: Konrad is working with housing availability data, and changes in Facebook friendship over time.

Twitter Data: Lisa is working on visualizing twitter user growth, and used the data to attempt to show that user ids are not random: although it is easy to do, one should not sample users by the modulo of the user id by some number. [Separate post pending]

News Data: Gobi and Samson talked about their work (and project ideas) clustering news, and methods for using supervised learning when labeled data is difficult to obtain.

Next meeting will be next Tuesday Nov 15th at 3pm in the Stats Club Office M3 3109.

First blood: Visualizing Correlations

To start things off, here is a post I wrote a week ago about a task I had working at CPPIB: Visualizing Correlations.

…correlations in finance are difficult to estimate. If you take all the historical data you can get your hands on, you might be missing out on some recent correlation drifts. If you take too little, you are subjected to estimation error. The problem went as follows: 5 assets, 5 different ways of calculating correlation (varying time horizons and weight structures), and 1 “target” correlation matrix; the objective is to quickly spot when live correlations deviate too much from the target.

Read the full post at http://david.ma/blog/?p=24