Friday, June 20, 2008

Where is the software that enables the long tail?

For a product to reach the long tail, it has to have three things.
First, it has to have a low barrier of entry. Anyone needs to be able to create content and publish it. Second, it has to be easy to access. This means it must be searchable. Finally, it has to allow others to review and comment. That's it. It's not rocket science.

Unfortunately, I'm having tremendous trouble finding software to enable the long tail. I really don't have overly strict requirements. In application 1, I want to be able to allow people to post Java jar files. I might want to check things on post, such as does it have a MANIFEST? In application 2, I have a site where I am putting up papers about record linkage. I want others to be able to post links and have the software verify that the links exist. In each application, I'd like to also allow them to associate a title with either the jar file or the link.

For search, I want application 1 to be able to search the javadocs that I generate from the source in the jar files. In application 2, I would love to have the content of the links searchable (so that people can find which PDFs contain information about blocking), but I'd settle for just being able to search the title that was given to the link.

For ratings/reviews I would like to have a 5-star rating system and comments. However, I'd settle for just the comments. I want people to say which papers they found readable or which jar files they found useful.

There are a lot of features you could add such as an RSS feed, etc..., but I'm willing to ignore all of that for right now. I just need the basics.

Am I asking for too much? Why doesn't this already exist? In the world of CPAN, reddit, etc... why doesn't a generic version of long tail software exist? Perhaps I'm just overlooking it. If not, I guess I'll just have to write it, but I'd much rather use something already written.

Let me know if I missed something!

Monday, June 16, 2008

19th Century Reading Habits in Australia

Here is a blog post describing data mining of 19th Century reading habits in Australia.

It is a fascinating application of PCA and clustering.

I don't think it will be long before commercial databases include standard data mining abilities such as feature selection, PCA, LSA, regressions, k-means, SVM, etc... The blog poster above had to use a combination of things, including the great R, to prep his data for the analysis; this should all be done inside a database. Perhaps this is the straw that broke SQL's back? Perhaps that is where a language such as Pig is needed? Pig's niche could very well be in prepping data for data mining tasks and streaming them through a map-reduce library such as mahout.

Regardless, business users are already there: companies like Harrah's, CitiBank, and Nationwide live and die by their analysis. Now, we just have to build the tools to let them bring new products up quickly and effortlessly. We are on the forefront of a huge explosion in statistical tools, modeling techniques, and machine learning. As much as the internet helped in providing ubiquitous access to data, these tools and techniques will help computers learn and understand us and our preferences. The internet revolution will appear to be a tiny blip on the screen compared to the wonders that are to come.

Wednesday, June 04, 2008

The Future of Enterprise

What is the future of enterprise software? Is it JRuby on Rails? What about Scala? Maybe a .NET stack? What is the next big shift?

In my opinion, the next big shift is to grid computing. Things like EC2, Hadoop, and HBase will become more and more popular. Commercial versions and vendors will spring up over the next few years. Enterprises will start pushing more and more of their computation into batch grid work. Column store databases will replace traditional RDBMS's for data warehousing (this is already happening with Teradata). Transactional applications will work on cached stores that are pulled from a large grid where they were processed and analyzed. Marketing will be on demand, but pre-computed.

In the rush to understand and market to the consumer, more and more companies have been moving to real-time analytics. However, the faster you need a decision, the less time you have to think about it. That is why grid computing will be so important. Your think time has to be done in advance. Transactional queries will be relegated to closing deals or finding a pre-computed offer.

The main challenge, as I see it, will be to find and use data structures that can be refreshed quickly. Google is in a fortunate position. They can refresh boxes asynchronously. If customer A enters query Q1 and then enters query Q2 it is ok if those two queries hit different data sets and return different results. However, if we're not dealing with a search application but instead something akin to a customer recognition or record linkage application, then the same person should always receive the same link. This is why versioning will become so important in the future. The customer will get a certain version of the data sets and will continue to use that version until their application finishes at which point they can be upgraded. This requires more server side storage, but allows the client the ability to use a consistent data set. This type of versioning and auto-update software will be necessary in a commercial form. In addition, quick nearest-neighbor searches will need to be commercialized. Clients will want to know quickly which market segment a client falls into. Nearest neighbor searches are the key to understanding that and they will need to be generalized and distributed over the next few years.

So, the enterprise is changing. The relational database will move from its position of dominance to just another tool. The grid will take its place as a hammer looking for nails, and applications that can process terabytes of data quickly will be deemed must haves for enterprise data centers across the world. The Enterprise Service Bus (ESB) will continue to shuffle transactional data around, but the backend will now be distributed data stores that will be versioned and querable by categorized nearest neighbor searches. Who will be the vendor of these tools? I have no idea. Many, like Hadoop, will be open source. Some, like Teradata, already exist. Others will be proprietary and don't even exist currently. Regardless, it will be a lot of fun and I'm looking forward to the ride.