Saturday, July 12, 2008

The last month

Hopefully my employer doesn't have my blog in his RSS feed :)

Over the last month I have interviewed with a number of great companies. I have had a blast going and visiting with the brilliant people that inhabit the halls of Google, Yahoo!, Amazon, and Microsoft. I have, or am receiving, offers from three of the four. It is definitely an exciting time as I rush to look at places to live, schools for the children and compare things like benefits, stock prices, growth potential, and perks.

It will be sad to leave my family and friends; adjustments will be necessary all the way around. However, I am very excited about the opportunities that lay ahead.

I won't describe the interview process or questions. However, I will say that all of the questions were very intelligent and relevant. It helps to know core algorithms and data structures such as lists, queues, and trees. Graph algorithms and advanced data structures like heaps and splay trees are also good to know. More generally, you just need to be able to tear a problem apart and reduce it to its essence. They want to see that you can analyze a problem and coming up with the right solution immediately is not as valuable as coming up with many solutions and being able to weigh them.

All of the companies had extraordinary employees, a great culture, and innovative systems. They all operate at "web scale" which means their problems are MASSIVE. No matter which company I choose, it will be a great choice.

I'm looking forward to the journey and I'll keep you all updated.

Tuesday, July 08, 2008

ArsDigita University

There are a lot of great free videos from Ars Digita University. Check them out!

Tuesday, July 01, 2008

Researching in a Search 2.0 world

A lot of my time is spent researching. A LOT of my time is spent researching. I research things around record linkage. I also research clustering, classification, natural language processing, and machine learning, in general. Quite a few times, I have to get up to speed. I need to understand what Felligi and Sunter did in the 1960s before I can understand what Winkler added to it in the 90s and what the general entity resolution research is all about now. Or, perhaps I just want to be a better programmer. Perhaps move from a O(N) to O(log N) on the Programmer Competency Matrix.

For Search 2.0, much of the hullabaloo has been about Natural Language Processing (NLP). Companies such as Powerset have touted their products as being able to understand a human query. For instance, the powerset engineers have given demos where they ask their engine "Which politicians died of disease?" and it gives back a list. This approach is great if I'm after general information, or if I'm helping my kid with her homework. However, it doesn't give me perspective about research. Why am I asking about politicians and disease anyway? Am I trying to get a statistical view of politicians that die of disease vs the health of the rest of the populace? Am I trying to understand the effects of an ill politician on the society? I might wish to see the sites that others also searched for, much like the Amazon feature. Or, maybe I want additional statistics about that country during the time period. In other words, I need more than just the answer to my question. I need a path that others have followed that I can follow as well. Eventually, I'll have to get off the path, but I want to stay on it as long as possible.

In addition, I want to quickly understand an author's position. I want to know, with my search results, whether this author is an expert or a novice in the field. I want to know where his or her funding comes from. I want to know, based on statistical analysis of their previous posts, if they are conservative or liberal. Have they published papers? If so, in what journals? Are they top journals? It is this context that will make search valuable. Whether or not I can ask a specific question is irrelevant to me. I'll figure out a way to ask the question; however, I want more information back in an easy to understand manner. I want the site's PageRank, I want a general view of how other sites have posted about the site in question (positive or negative), I want to see complaints or complements if it is a potential employer. I want CONTEXT. It seems to me that people get on the internet a lot for research. You research a good book to buy or what digital camera to get or where to go on vacation. All of these things could be enhanced by adding more context, more data mining, and better presentation of the information.

That will be search 2.0.