Monday, June 16, 2008

19th Century Reading Habits in Australia

Here is a blog post describing data mining of 19th Century reading habits in Australia.

It is a fascinating application of PCA and clustering.

I don't think it will be long before commercial databases include standard data mining abilities such as feature selection, PCA, LSA, regressions, k-means, SVM, etc... The blog poster above had to use a combination of things, including the great R, to prep his data for the analysis; this should all be done inside a database. Perhaps this is the straw that broke SQL's back? Perhaps that is where a language such as Pig is needed? Pig's niche could very well be in prepping data for data mining tasks and streaming them through a map-reduce library such as mahout.

Regardless, business users are already there: companies like Harrah's, CitiBank, and Nationwide live and die by their analysis. Now, we just have to build the tools to let them bring new products up quickly and effortlessly. We are on the forefront of a huge explosion in statistical tools, modeling techniques, and machine learning. As much as the internet helped in providing ubiquitous access to data, these tools and techniques will help computers learn and understand us and our preferences. The internet revolution will appear to be a tiny blip on the screen compared to the wonders that are to come.

