Thoughts of Me: 2008

Tuesday, December 30, 2008

North vs South

I've been in the Pacific Northwest for four months now and I thought I'd share a few of the differences I've noticed.

In the South, we have sweet tea. In fact, we have most any fattening item you can imagine: gravy, lard, fatback, etc... In the Northwest, everyone drinks their tea unsweetened and even IHOP lists the number of calories in their food.

In the South, you drive everywhere. If you take the bus or walk you are most likely poor and almost always frowned upon. In the Northwest, you are hailed as a hero and considered "green".

In the South, PTA was for the crazies who had nothing better to do with their money. In the Northwest, our PTA paticipation is at 100% for the third year straight.

In the South, the kids look forward to deer season. In the Northwest, the kids look forward to the new Star Trek movie.

In the South, teachers send letters home with the children. In the Northwest, teachers email the parents

In the South, we get our coffee from McDonalds. In the Northwest, the city would shut down if Starbucks closed.

In the South, the biggest technical gathering is when Ma Potter drops the security on her wireless router. In the Northwest, you are constantly bombarded with invites to technical events.

In the South, a top IT graduate comes from Vanderbilt. In the Northwest, a mediocre hire comes from Berkeley.

In the South, the churches are dense, huge and filled with Southerners. In the Northwest, the churches are sparse, huge and filled with former Southerners.

I'll pause here and reflect on a few things. First, with regards to schools, it is easy to find a correlation between money spent and quality. The schools in the Northwest are better funded. Heck, my daughter's school here has a buffet! However, those people who feel that spending more on education will fix it have missed the boat. The real issue is parental involvement and peer pressure. In our schools the children challenge each other and the parents are involved. At our PTA social events nearly everyone is there. We, and the children, want each other to succeed. No amount of money will make kids want to have a read-in instead of play outside. No amount of money will make kids come to school dressed as their favorite storybook character instead of their favorite pop-star. No amount of money will make kids succeed. It takes dedicated parents. Moreover, it takes a school full of dedicated parents. Otherwise, you will run into the problems faced by other behavioral economists. The smart kid with dedicated parents will be ostracized. He or she will be considered an outcast. After all, they aren't going to be staying around after graduation, so why invest your friendship in them? You need an entire school full of dedicated children and dedicated parents in order to succeed. Money won't help, changing the culture is required.

Secondly, you can see the failure of the church in the Northwest. The educated, affluent people avoid the church while the Southerners flock to it. I think there are a number of factors at play here. First, the larger churches are getting the publicity and they are taking a stance against education. I have personally been in a church where the pastor said that anyone who believed in the Big Bang is going to hell. I didn't go back to that church. However, if a church accepts everyone and everything then there is no point in going. You can socialize on the net. So, there is no middle ground. The educated people are turned off from the backwards southern denominations and have no reason to attend the more "progressive"/"liberal" variants. If higher education becomes the norm in the South (which is a big question mark), then the church as we know it is doomed. The church must find its way in a new era. It must bridge what we know to be true, scientifically, with what we believe to be true, spiritually. It must exclude certain things as immoral, even though it accepts them civically. For instance, homosexuality is banned in the Bible. However, so is adultery. No churchmember protests the legality of adultery; hell, half of them have committed it themselves. But homosexuality is an abomination! The same goes with other moral sins such as lying and blaspheming. The church needs to realize that it does NOT have a civic duty to outlaw immoral behavior. No one can legislate morality and those that do are seen as haughty and hippocritical. Instead, they should accept that the world's morality and the church's morality are different. Christians are "sanctified". Before Christianity, to sanctify something was simply to set it apart; there was no religious meaning. If the church gets to legislate everything, then there is no more sanctification; we are just like everyone else and that is not our mandate. We are to work in the world and change the people. That can only be done by example. Right now, we're not setting a good example. It won't be long until the church is completely irrelevant outside a few poor areas where their only role is to give the poor hope.

In total, I've really enjoyed living in the North. I have no idea if I'll stay here long term, but it's been a great experience so far and I'm glad I was given the opportunity.

Sunday, November 23, 2008

C# Awesomeness

As you can imagine, my current employer gives me a number of opportunities to work in C#. I've really been enjoying myself, but only scratching the surface of what C# was capable of. So, tonight, I tried a few more interesting ideas that use the C# anonymous types and lambda function capabilities.

The first one is a pretty simple copy of ruby's IO.each_line function. My goal is to have the C# function encapsulate the logic of opening and closing the file and me pass in a lambda that does the processing I want (just like Ruby's).

Here's the code


1    public delegate void IOProcessor<T>(T line);
2    public static void ForEachLine(string fileName, IOProcessor<string> processor)
3    {
4        using (StreamReader reader = new StreamReader(fileName, Encoding.UTF8))
5        {
6            while (!reader.EndOfStream)
7            {
8                processor(reader.ReadLine());
9            }
10       }
11   }

First, I declare a delegate. In C or C++, it would be a function pointer. Here, it is more than that because it can be generic, which is utter coolness in a box. We'll see the point of the genericity in a moment.

The next line declares my function which takes the file name and the delegate (we will be giving it a lambda.) Here, we explicitly set the generic parameter to a string because we know that is what we are retrieving from the file.

The using statement opens the file, specifying a UTF-8 encoding. It will also ensure the file is closed whether or not an exception occurs.

Finally, we have a loop that reads each line and passes it to the delegate until we run out of lines. Very simple and straightforward.

If we wanted to use this code to sum up the lines in a file we could write code like the following.


1    int sum = 0;
2    
3    ForEachLine("my_integers.txt", (line) => { sum += int.Parse(line); });

After line 3 finishes, sum contains the sum of each line in the file.
For instance, if the file was
10
20
30
40
Then sum would be 100.

Another thing I do a lot of is handle delimited files. Usually, I'll open one up, perform a line by line manipulation of the file, and close it back. Of course, the obvious string.Split() method comes to mind, but that is somewhat unsatisfactory because you are still dealing with integer offsets into an array. How do I know that offset 3 is the anchor text and offset 7 is the ubercool feature? It would be nice, if I could specify their names in the beginning and then work with objects.

So, to handle that, I created a function that would objectify my delimited files.

Here goes:


1    public static IEnumerable<T> EachDelimitedLine<T>(
2           string fileName, char delimiter, T outputObject)
3    {
4        Type outputType = outputObject.GetType();
5        using (StreamReader reader = new StreamReader(fileName, Encoding.UTF8))
6        {
7            while (!reader.EndOfStream)
8            {
9                string[] fields = reader.ReadLine().Split(delimiter);
10               yield return (T) Activator.CreateInstance(outputType, fields);
11           }
12       }
13   }

I'll break this function down line by line in the next blog. Specifically, it takes in a prototype object called outputObject and creates an object like that from each line of the delimited file, using the strings after splitting to initialize the object. I call this function like


1    foreach(var x in 
2       EachDelimitedLine("delimitedFile.txt", ',', 
3                         new { Title = "", Body = "", Anchor = "" }) 
4    {
5       if (x.Title == "Thoughts of Me")
6       {
7         // do something with my blog...
8       }
9    }

As you can see, I specify a prototype object of an anonymous type and my code uses that prototype to create objects from the delimited file. Of course, in production code you would want checks for too few fields or too many fields so that you don't get the weird error messages you would get from this code by default. However, I think it gives you a flavor of the power of C#.

Now, for the final piece, we combine the ideas of the ForEachLine function and the EachDelimitedLine function to get a ForEachDelimitedLine Function. This also shows you the power of the generic delegate.


1    public static void ForEachDelimitedLine<T>(
2        string fileName, char delimter, T outputObject, IOProcessor processor)
3    {
4        foreach (var x in EachDelimitedLine(fileName, delimter, outputObject))
5        {
6            processor(x);
7        }
8    }

In this case, the IOProcessor now accepts a generic argument T which is the same type as the prototype object I pass in (called outputObject).

Let's look at the above example code redone to use the ForEachDelimitedLine function


1    ForEachDelimitedLine(
2        "delimitedFile.txt", ',', new { Title = "", Body = "", Anchor = "" }, (x) =>
3    {
4      if (x.Title == "Thoughts of Me")
5      {
6        // do something with my blog...
7      }
8    }

There, no more foreach statement. Now, the logic is part of the function itself.

Next, I plan to extend my C# emacs mode to generate a class around my static functions and compile and run for me so I can have a C#-script of sorts. Don't get me wrong, I still love and use perl on a daily basis, but C# is often much faster than perl, surprisingly enough, and seems to be able to handle memory more efficiently as well. However, I do admit that is anecdotal and not based on any tests I have performed. On a more pragmatic side, it is easier to integrate with exsiting C# libraries from C# rather than writing a SWIG interface.

Hope you enjoyed!

Saturday, November 22, 2008

disgruntled iPhone user

I'm tired of the Apple iPhone. As long as you don't need to do anything unusual, it is fine, but try to go outside what the folks at Apple have deemed appropriate for you and it blows up. For example, if you want to sync with two computers, you are hosed. If you want to make a different interface (for example, I want playlists to have folders), you are hosed. If you want to do anything custom, you are up a creek - I'm tired of it. When my contract expires, I will be moving to a different phone. No more Apple for me, I want something more open.

Midomi

Just a quick post to give props to the Midomi team. I always enjoyed Shazam, but it only works for commercially recorded songs. Midomi lets you sing and then tries to determine what song you were singing. It is great karaoke fun! My wife and I played a game of Horse with it recently and had a blast.

Great job, guys!

Monday, September 01, 2008

The end of broadcast TV

Seems like broadcast TV should be in its death throws. I give it no more than 7 years. I imagine within 5 everyone else will chime in with its death. I wonder how Dish/Comcast/DirectTV/etc... will fight back. Will they try to lure the RIAA into banning places like hulu.com? Or will they hold out hope that people will want to see their shows as soon as possible instead of waiting for them to come online? At what point do they stop coming out in broadcast form first? Seriously, between netflix, hulu, veoh, and mtv.com I don't really need cable. I have enough entertainment with just those sites. That doesn't even include iTunes and the paid providers. The tail is growing in this area and it will only be a few more years before it becomes mainstream. Not only that, but the monitization ability is already there through commercials, so it is a viable model. I do hope it doesn't get bogged down by commercials though - and I hope the commercials become personalized. Just because I'm watching The Hills doesn't mean I want to see a cosmetics commercial...however, my eyebrows do look a bit...nevermind.

Sunday, August 24, 2008

What a great family

I received these pictures from my wife and kids when I arrived in Seattle. They are a great family and I miss them. From left to right is my niece Ashlan, my daughter Brooklyn, and my son Carter. See you soon!

Wednesday, August 13, 2008

Amazon Interview

I went through my interview with Microsoft in my previous blog entry. In this entry I plan to discuss my Amazon interview. I applied to Amazon after having visited the Silicon Valley area. I knew that I didn't want to work in that area and hoped the Seattle area would be nicer. Amazon gets tons of traffic daily and I'd been very impressed with their S3 and EC2 technologies, which I discuss here. I wanted to work at internet scale and Amazon seemed to be a prosperous and growing company.

I received an email from an Amazon recruiter not too long after submitting my resume. She described the process: two phone screens where I would have to write code over the phone and then they would fly me to Seattle. I had just come off my interviews with Google and Yahoo, so I was excited to have another opportunity ahead of me.

For my first phone screen, I was out of town, so I had to do it in my car. They mentioned I would have to write code and I shouldn't be driving. I had pulled over in a parking lot and was ready with pencil and paper when the call came. I had also turned off my car and it was about 100 degrees F outside; it was definitely uncomfortable. In this first interview, he asked me to write a binary search function for an integer array. Normally I don't describe the questions that are asked; however, in this case I don't think I'm giving away any big secrets.

I wrote a recursive binary search implementation and read it back to him over the phone. It was a bit odd to read code over the phone. I had to read out curly braces, square brackets, the whole works! When I finished reading it he asked me if line X was ok. I told him that I thought it needed a +1 to it. He said ok and then asked me why I chose a recursive implementation. I told him that I thought it would be easier to write. I tend to think recursively. He asked me what the downsides of a recursive implementation are. I told him that it requires time to push on and off the stack. I also mentioned something about stack overflow, but in hindsight that was a bit silly. The greatest stack depth you can ever have for a binary search is 32 (on a 32 bit machine). I also mentioned that a tail call optimizer would eliminate the performance penalty. I'm not sure he understood that, so I didn't pursue it. Besides, I don't know of any C++ compilers that include tail-call optimizations.

We had plenty of time left, so he asked me to rewrite it iteratively. I complied and started writing it. In the middle of writing it, I realized I screwed up the recursive version. I told him about my screw up over the phone and called myself an idiot. We both got a laugh out of that. I finished my iterative implementation, double checked it for accuracy, and read it to him over the phone. He seemed happy with that approach and I mentioned that it was easier to write than the recursive version. However, I imagine that is because I had calmed down some and the recursive version had prepped me. He asked me about some of the test cases I would run on it. I mentioned as many as I could such as element not in the array, element at each end of the array, element in the middle of the array, duplicate elements, etc... He seemed satisfied and began telling me about their team and projects. In the end he asked me if I had any questions. I only had one or two about the team and working conditions. I'm not a huge questioner, I guess.

Honestly, at that point I wasn't sure I'd be getting a call back. I had flubbed the recursive version of a binary sort for goodness sake. I mean, come on, who does that? I should have nailed that one. I was a bit disappointed. (On a side note, there is a really great article on binary search in Beautiful Code. If you don't have that book, you really should.) Since my phone screen was on a Thursday evening I had to wait until the next week to hear my results. The recruiter called and told me that they were interested in talking to me further and they would like to set up another phone screen. I was excited to be moving on to the next round!

The next interviewer also asked me a fairly basic programming question. It is one that I ask a lot in interviews and therefore quickly pounded out the optimal solution. In this case, it seemed to take him a few minutes to verify that it worked. I was a bit surprised - I figured if you asked a question you should know the optimal answer for it. It could be that he was testing me to see if I would rush to change anything, but I was confident in my answer. He seemed pleased and we moved on to talking about other things. After this interview, I was pretty confident that I would be getting to go to Seattle and after a few days, the recruiter called to confirm it.

There were a few negatives about the trip to Seattle with regards to Amazon. First, the put you in a hotel downtown, which means that you are within walking distance to everything. That's not a negative. However, they don't imagine you need a rental car, so they don't provide one. Instead, they reimburse you for taxis. Now, I appreciate being reimbursed, but it is a lot of trouble. I have to find the cash to pay the taxis, get receipts, keep the receipts, turn them in, and wait 6 weeks for the money. Honestly, it's just not worth it. Secondly, the hotel they put me in didn't have free in room internet. I wasn't going to pay $10 a day to get it, either. They did have free internet in the lobby, so I would go down there, download an Ars Digita Video Lecture and go back to the room to watch it. I will say that the staff at the hotel was very courteous and the restaurant was great, but not having in-room internet sucks. Big time.

Amazon flew me in a day before the interview, so the first day I was there I went and walked around the downtown area. I found it vibrant and clean. I went to a suggested restaurant nearby and had some really amazing food. However, it was getting late and I was tired, so I went back to my room to rest up for the next day.

The next day I awoke and took a taxi to the Amazon building. It was on 5th street South. Either that or 4th Street South. I forget. But the reason I bring it up is because the directions explicitly say that South is different from not South :) In other words, they are on different ends of the town. I'm paranoid that my taxi driver won't know that and keep repeating to him, "South, right? I mean it's a different street or something." We get on 4th Street (or 5th Street...who knows...) and it's not South and I'm repeating again "4th Street South!". Finally, he assures me that he knows where he is going and I should just let him take me there. I sit back, say "ok", and enjoy the ride. We drove by the Seattle Public Library and I am amazed by its architecture. It is truly stunning.

Finally, we arrive at our destination. He lets me out and he pulls off after giving me a blank receipt. For that matter, all the taxi drivers have given me blank receipts. Do I fill in the amount and sign them myself? I have no idea. Anyway, I'm looking around and I don't see an Amazon sign anywhere. I'm freaking out thinking "Oh my gosh, he has taken me to the wrong street!" However, I do see a street sign and it has South on it, so I must be in the right place.

Finally, I see an entryway on the side of the building with "Amazon" on it. I go inside and speak to the receptionist. I'm about 30 minutes early, so she asks me to take a seat. Interestingly enough, as I'm sitting there thumbing through magazines, I find an advertisement for my current company, Acxiom. It had to be the worst advertisement I'd seen. It was a city-scape with a lake and the words "We make information intelligent." What does that even mean? How do I know what Acxiom does from that? I sat there for about 10 minutes coming up with better advertising campaigns. It wasn't hard. Having worked at Acxiom, I am Jack's complete lack of surprise.

A little after the appointed time, my first interviewer came and got me. He didn't say much as he walked me to the room where I would spend the rest of the day. In fact, once we were in the room he immediately asked me to code the solution to a problem. There was no introduction, soft-skills questions, or questions to make you comfortable. It was just an immediate call for action. I did well on the first question, though it took the entire time. The next interviewer had to wait outside the door a bit for us to finish up. The only odd thing was that I still have no idea what the first interviewer's name was or on what team he worked.

The second interviewer was a bit better, I think he told me his name :) However, he once again avoided any comfort questions and went straight to the coding ones. He asked a question that I put to my CS 101 students. I told him that it is a programming assignment that I give them and that it is usually a bit long to write out; however, I hoped I could do a better job than they could. Still, the program just takes time to write and it took up an entire whiteboard by the time we were done. He asked quite a few questions about it and I fixed a few bugs reading back through it. He also asked me how I would test it and I went through the various scenarios. One of those scenarios failed with my current implementation, so we fixed that as well. Then, it was time for the next interviewer.

The next interviewer asked a design question. It was a question that I had some familiarity with it as it included aspects of record linkage. In fact, I explained a few things to him from the leading edge of research that he was not familiar with. We got into a little debate about whether user click-streams could be substituted for a heuristic algorithm, but on the whole it was a good interview.

After that, it was the team's boss's turn to take me out to lunch. Amazon doesn't have a cafeteria (at least they didn't take me to it), so we went across the street to a Thai restaurant. Turns out my interviewer liked the same type of food I did and we ended up ordering the exact same thing. We talked about my experience and work history over lunch and he talked about the team. He mentioned a number of times that it was a production team and I would have to carry a pager. He wanted to make sure I was ok with that. I told him that I was and that I had done support in the past and was fine with it.

When we returned to the office, it was his turn to ask questions. His were more logic/match puzzles. At first, he started by saying "You have an array..." and I thought, "Oh boy, here we go again." Pretty much every company asks the same array question. "You are given an array with N integers. Each integer is in the range of [1,N-1]. Furthermore only one integer is repeated. How can you find the repeated integer?" I got this question and variants of it in 3 of the 4 interviews. I was getting ready to answer it again for Amazon when he threw a curve ball and asked a variant of it that I had never heard. This one was more interesting and actually fun. Now, I don't think that this type of question shows anything about the programmer other than whether or not he's heard the question before. It's kind of like the moving Mt. Fuji questions of old. However, I worked through the question, got the "Aha!" moment, and solved it. I tend to like solving those questions. Even though it didn't say anything about me as a programmer, I had fun with it.

The next interviewer was to assess my C++ and OO design skills. He asked me stock C++ questions, which I answered without difficulty. He then posed a coding question that was really more about interface design than actual coding. I had to do things like make sure my constructor was correct and write operator + in terms of operator+=, etc... I wasn't familiar with the domain (and told him so), but in the end I came up with a solution. He told me that it was the same solution he came up with when he first attempted the problem. However, he had to do it in a production setting, not an interview setting. I felt like coming up with the same solution as my interviewer was a good thing, so I was ok with that.

The final technical interview of the day was another design question. It was with this question that I had the most trouble. It was a question that was directly related to how they track and update pricing information. I tried to use a database to help solve the problem. I'm not sure if this was the right approach or not as it really constricted me. In the end, I came up with a pretty pathetic solution to a very complex problem. I don't think either the interviewer or I was happy with it. It was the one interview that I sucked at :)

After that was supposed to be the boss of the second team I interviewed with. However, he was out of town so he sent a sub :) This person described his team and the other team that I was interviewing for. I asked a lot of questions trying to distinguish the teams. I don't know that I ever could distinguish between the two teams, they were so similar. The only difference I could gather is that one was also responsible for a new framework.

After not getting the team structure sorted out, I was on to the final meeting of the day. It was with the human resources person. He and I chatted about a few HR matters and then he showed me to the lobby where they called a taxi for me.

On the whole, my experience with Amazon was just "OK". They were obviously not a "tech" company. Their dress code, lack of free drinks, and general frugality said that they were a retail business. The employees were smart, but seemed a bit like "cowboys". Their plans were overly ambitious and I wondered about their success rate. I'd also read online about how hard they worked and how little they were rewarded for it. Now, I certainly don't believe everything I read online, but it can't help but color your views.

In the end, I thought that Microsoft would be the better choice. I've never worked for a "true" tech company and the search space was very exciting. I really enjoyed visiting Amazon and I'm sure it would be a fantastic company to work for, but in the end the prospect of working for Microsoft was too good to pass up.

Well, I hope you enjoyed this second installment of my interviews and be sure to look for the third (Yahoo!) and fourth (Google) installments.

Tuesday, August 12, 2008

Microsoft Interview - Live Search

After reading about Ben's interview at Microsoft, I thought I would post my own. Perhaps I'll post my interviews at Google, Yahoo, and Amazon later.

Where to begin? Let me start by saying I'm not a Microsoft fanboy. I don't have anything against them and use Windows on my home desktop and Office at work, but that's about it. I spend most of my time in Cygwin: a linux environment for Windows. In addition, I use Emacs and Eclipse. I have programmed professionally for 8 years, 7 1/2 of those years have been exclusively in a Linux or Unix environment. My primary languages have been C++, Perl and more recently Java. For a while, I ran Ubuntu on my desktop, but the kids and wife didn't like that so much :) Nevertheless, I think you can see that I am by no means a Microsoft fanboy.

I first applied for a position at Microsoft in 2003 in their Visual Studio group. I had recently completed my Ph.D. at Clemson and thought I was God's gift to programmers. I won't go into that interview at this time, since this entry is focusing on my latest interview, but I will say I bombed badly. For one thing, I ate something that disagreed with me and spent more time in the bathroom than in the interviewer's offices. Secondly, I was interviewing for both a PM role and an SDE role and so I kept having to switch modes. Thirdly, I was bull headed and wouldn't listen to any of the interviewers. I just sucked. They said I could apply again in a year, but I didn't. I figured they had a big red X on my submission folder.

Most of my professional career has been spent at Acxiom, a global interactive marketing services firm. Primarily, I study and implement ideas around matching and record linkage. Over the last two years, we've been bringing machine learning and information retrieval research to bear on the record linkage problem and it has been very exciting. I wanted to spend more time working on those large scale problems. Therefore, I submitted my resume to a few of the larger companies: Google, Yahoo, Amazon, and Microsoft. I got callbacks from the first three. Microsoft still had that big ole X on my folder.

My Google interview was first. I'll describe it in another post, but I bombed it. I didn't study and they didn't give hints. They were a great group of people, but it just wasn't my day.

Next up was the Yahoo! interview. I thought I had bombed it as well, but apparently not. I got an offer to work with the Pig team. However, by this time I also had an interview with Amazon upcoming, so I requested that they wait for my decision until after that interview. On a side note, I think the Pig team is great and I wish them all the best in the future.

Needless to say, I was very excited after receiving the Yahoo! offer. Perhaps I wasn't an idiot after all (I had begun to wonder after the Google interview - and my wife was making fun of me!). The first thing I did was announce it on twitter. The funny thing is, Gretchen, of JobsBlog fame, follows me on twitter. Her response? "Congratulations! Sounds like we'll have to do something about that." (Note: Since twitter is public, I don't have a problem reposting her words here). By the end of the day, I had three Microsoft teams contacting me. Among those teams was the Live Search team. Since I was interested in machine learning and information retrieval, I requested that team be the one I interview for.

I wasn't too worried on the day of the phone screening. After all, I had passed two phone screens for each of the three other companies; how hard could this one be? Turns out, it could be pretty hard. The screener started out by asking if I primarily used live search. My response, "No, I prefer Google." Once it escaped my mouth I immediately knew I bombed. How could I interview for a product that I didn't even use! She then asked if I had ever used Live Search, to which I responded positively. I told her I didn't see much difference in the results, which was true. She then moved on to the "coding" question. I won't go into the question, but I will say I didn't do too well on it. I started out trying to use a dynamic programming algorithm. She quickly informed me that the problem could be solved in linear time. Ouch! We worked on the program for quite some time. I got down the right path, but needed quite a bit of help. She then took questions from me and the interview was over. I knew I wouldn't be moving to the second round. I even sent out a few text messages saying "When interviewing with Microsoft, the answer to 'Do you use live search?' is not 'No, I prefer Google'".

A few days later I was driving my family to a local water park when I got the call from Microsoft. They wanted to fly me out! Wow! Apparently it is OK to use Google after all! I certainly didn't expect that. I was already going to be in the Seattle area for my interview with Amazon, so I asked if they could schedule their interview at the same time. It required some fast moving on their part as my interview with Amazon was only a week away. However, they responded brilliantly and set up my accommodations and new return flight. The only thing left for me to do was choose which teams inside of live search I would prefer to interview with. They gave me a list of teams and their descriptions and I ranked them.

At this point, I'll skip over the Amazon interview and move forward in time to the day before the Microsoft interview. Amazon does not provide a rental car, but instead reimburses you for taxis. Therefore, I had to take a Taxi back to the airport and then get the rental car from Microsoft. From there, I drove to Redmond and found my hotel. It was in the Redmond Town Center, which provided a great venue for eating and shopping. I got there in the afternoon and I found greek diner in the town center. After lunch I went back to my hotel and watched some Ars Digita University algorithms classes. In addition, I had also spent some time reading the CLRS Algorithms book. Those two things, coupled with an online MIT course on algorithms would be my primary study guide. Unlike Ben, I didn't work too many problems out on my own, just geeked out over videos :) On a side note, the hotel that Amazon provided didn't have free in-room internet, so it was nice to be in a hotel that did.

At this point, my worst nightmare happened. My stomach began to churn and I started spending more time in the bathroom than out. Why did this keep happening to me? Was it nerves or do I REALLY not like Seattle food? I tried a couple of times to wander around the Redmond Town Center, but each time I kept having to run (literally) back to my hotel room. If this didn't pass by the next day, I was going to just die.

Luckily it went away. I ended up eating something light for supper instead of all the greek and italian food I'd been eating. That seemed to help. By the morning, I was going to the bathroom at a much slower pace, so maybe I'd be ok.

I left very early in order to get to Building 19 on time. I didn't map out the route the day before, so I wanted to be sure I could find it. I'm glad I did leave early because the signs to it aren't overly visible and I looped the Microsoft campus a few times before finding the right building.

The receptionist inside the Lobby of Bldg 19 took my name. I commented on her ear thingies. She had things in her earlobes like they have in Africa. I asked her what they were called and she said, "Gauges." I told her I was from Arkansas and we don't see too many things like that, but I thought they were cool. She thanked me and pointed to the waiting area. Yes, right, don't flirt with the receptionist...go sit down.

The waiting area came complete with an XBox 360 and a copy of Guitar Hero, which another candidate was playing. Well, at least I thought it was a candidate. I have to admit, she was really tearing up Guitar Hero - definitely a pro! After her turn ended, she looked back over at the receptionist and said, "Ok, your turn!". It turns out she was the OTHER receptionist. They swapped places and Gauge Girl began playing Guitar Hero (well, she offered it to me first, but I politely refused). Sadly, I began wondering if that was a test. Maybe they only hired people that played Guitar Hero when given the chance? (yeah, I'm sad like that)

After a bit of waiting, my recruiter came in. He took me back to his office and we began to talk about Microsoft. He gave me a great spiel about the company; he was a good salesman. We ended up talking longer than we should have and I had to hurry to get to a bus. Unfortunately, all the buses were gone and we had to wait for one to come back. It made me late to my first appointment :( He said not to worry, he would call over and tell them I was going to be late. Furthermore, like Ben, I was give a schedule that only went until 2:00. If I didn't perform, I'd be done then.

My first interview went ok. I never do great in the mornings, but I answered his questions satisfactorily. I forgot one simple case which finally came to me, but other than that, I did fine. I mentioned this failing to the next interviewer who said not to worry about it. The next interview was more challenging, I had read a bit about what he was asking, so I had some background, which I explained to him. However, I only had the broad theory, not the intricate details, so we proceeded to hammer out the details over the board. It was actually a lot of fun because I knew certain things to be true, but didn't know why. We discovered why together over the board. It caused a lot of "Aha!" moments. The only downside is that I missed a key part that caused my runtime to explode. I figured this out as he was chatting with the next interviewer and when they returned I mentioned the optimization to him. He nodded his agreement and passed me off.

At this point, it was lunchtime. My interviewer took me to the cafeteria. I got Chinese food because I thought I could eat the rice without it upsetting my stomach. We talked over lunch and I didn't eat much. We mainly discussed coding styles, agile methods, etc... Turns out he was interested in a book club I had started with my current company because he was running one for his team.

After lunch, we returned to his office for some interface design. I teach a UML class at The University of Arkansas at Little Rock so I sketched out my interface and design in UML. We went into detail around some of the decisions, like whether to use a visitor pattern and factory pattern. I think I did ok, but it is hard to design an interface without seeing use cases for it first. I mentioned this toward the end and he agreed. At the end of this interview it was 2:00...and...he took me to the next interview!

During this interview I was back to coding. Actually, I never got around to coding too much as I spent most of my time determining the algorithm. This is a big difference from my first Microsoft interview as I would have gone straight to the code. In this interview, I stepped back, looked at a number of examples and came up with the right solution first. I coded it pretty quickly, but no major mistakes. He showed me some of the cool things his team had been working on. I was really impressed! Then, he took me to the next interviewer. This interviewer focused me on a "real world" problem. In other words, this problem was one that they had seen and solved in the past. It took me about half the interview, but I came up with the solution. He seemed impressed and said that not many of their interviewers come up with that solution and it is the one that they actually used in production. I was pretty pleased with myself at that point. We spent the rest of the time talking about their projects, team structure, etc... It turns out that both teams I interviewed with spend a lot of time working with Microsoft Research, which is really cool for an academic like me :)

That was the last interview before the "big boss", who took me to get a drink and a snack. I was hungry since I didn't eat much lunch, so I took him up on a granola bar and a bottle of water. He spent a lot of time telling me about Microsoft and asking me about my previous experience and the work I did with my current company. I told him what I could without violating any NDAs. He seemed pleased with what I told him and I was very impressed with what he was saying about the future of the team. He described a lot of their internal systems and frameworks, all of which excited me. Like the recruiter, he was a great salesman. That must be a Microsoft thing :)

After our discussion, he said that I should meet the other team's "big boss" (my words, not his). He left me in the lobby of his building and went to hunt the other one down. Turns out the other "big boss" had just gotten done playing softball with the interns and needed to clean up, but offered to take me to dinner. I graciously accepted and sat down to wait as he got ready. I went to eat with him and one of his team leads (who I had interviewed with previously) at an Italian restaurant in the Redmond Town Center. We talked about Microsoft, the Yahoo deal, Google, our mutual desire to WIN, etc... He was also a great salesman.

I was pretty pumped after leaving there that night and couldn't sleep a bit on my red-eye flight back home. I called my wife and told her that I was pretty sure we'd have an offer from Microsoft to consider. Turns out we would also have an offer from Amazon.

I won't go too much into my decision making process other than to say that I did not like the Silicon Valley area. It just felt dirty to me. I was much more at home in Seattle. I liked the contemporary vibe and the fact that there were actual trees. Lots of trees. Yeah, I know it rains, but I don't like the sun much, anyway :) Therefore, though the Yahoo! job was very attractive, I just could not accept it. My decision was between Amazon and Microsoft.

At one point, I thought I wasn't going to accept any of the offers. We are settled, the kids are in school, we own our home, and life isn't too bad. However, I think the ability to work with some amazingly smart people and have Microsoft on the resume is too good to pass up. It is not often that you get to work with internet scale data. My work at Acxiom is on billions of records, but nothing like what you see when you consider search. The playing field is very, very different and you have to think WAY outside the box, which I love to do. Moreover, I think it would benefit our kids to be in an environment like Seattle. There's a lot to see and do there and there are many more opportunities there than there are here in Arkansas. Finally, the school systems there are very good and I think the kids will benefit greatly from that.

I start on August 25th and I am very excited. I'll be sad to leave a great team behind, but I know they wish me the best as well. Now, I'm looking forward to starting the next phase of my life as a member of the Live Search team. Google, look out!

New Job

I'm excited to announce that I am joining Microsoft to work on their Microsoft Live Search Engine. It's going to be a big transition, but I've never lived in the northwest, so it is a very exciting time as well. I begin there on August 25 and am very excited!

Please keep me in your prayers during this exciting, but stressful time.

Saturday, July 12, 2008

The last month

Hopefully my employer doesn't have my blog in his RSS feed :)

Over the last month I have interviewed with a number of great companies. I have had a blast going and visiting with the brilliant people that inhabit the halls of Google, Yahoo!, Amazon, and Microsoft. I have, or am receiving, offers from three of the four. It is definitely an exciting time as I rush to look at places to live, schools for the children and compare things like benefits, stock prices, growth potential, and perks.

It will be sad to leave my family and friends; adjustments will be necessary all the way around. However, I am very excited about the opportunities that lay ahead.

I won't describe the interview process or questions. However, I will say that all of the questions were very intelligent and relevant. It helps to know core algorithms and data structures such as lists, queues, and trees. Graph algorithms and advanced data structures like heaps and splay trees are also good to know. More generally, you just need to be able to tear a problem apart and reduce it to its essence. They want to see that you can analyze a problem and coming up with the right solution immediately is not as valuable as coming up with many solutions and being able to weigh them.

All of the companies had extraordinary employees, a great culture, and innovative systems. They all operate at "web scale" which means their problems are MASSIVE. No matter which company I choose, it will be a great choice.

I'm looking forward to the journey and I'll keep you all updated.

Tuesday, July 08, 2008

ArsDigita University

There are a lot of great free videos from Ars Digita University. Check them out!

Tuesday, July 01, 2008

Researching in a Search 2.0 world

A lot of my time is spent researching. A LOT of my time is spent researching. I research things around record linkage. I also research clustering, classification, natural language processing, and machine learning, in general. Quite a few times, I have to get up to speed. I need to understand what Felligi and Sunter did in the 1960s before I can understand what Winkler added to it in the 90s and what the general entity resolution research is all about now. Or, perhaps I just want to be a better programmer. Perhaps move from a O(N) to O(log N) on the Programmer Competency Matrix.

For Search 2.0, much of the hullabaloo has been about Natural Language Processing (NLP). Companies such as Powerset have touted their products as being able to understand a human query. For instance, the powerset engineers have given demos where they ask their engine "Which politicians died of disease?" and it gives back a list. This approach is great if I'm after general information, or if I'm helping my kid with her homework. However, it doesn't give me perspective about research. Why am I asking about politicians and disease anyway? Am I trying to get a statistical view of politicians that die of disease vs the health of the rest of the populace? Am I trying to understand the effects of an ill politician on the society? I might wish to see the sites that others also searched for, much like the Amazon feature. Or, maybe I want additional statistics about that country during the time period. In other words, I need more than just the answer to my question. I need a path that others have followed that I can follow as well. Eventually, I'll have to get off the path, but I want to stay on it as long as possible.

In addition, I want to quickly understand an author's position. I want to know, with my search results, whether this author is an expert or a novice in the field. I want to know where his or her funding comes from. I want to know, based on statistical analysis of their previous posts, if they are conservative or liberal. Have they published papers? If so, in what journals? Are they top journals? It is this context that will make search valuable. Whether or not I can ask a specific question is irrelevant to me. I'll figure out a way to ask the question; however, I want more information back in an easy to understand manner. I want the site's PageRank, I want a general view of how other sites have posted about the site in question (positive or negative), I want to see complaints or complements if it is a potential employer. I want CONTEXT. It seems to me that people get on the internet a lot for research. You research a good book to buy or what digital camera to get or where to go on vacation. All of these things could be enhanced by adding more context, more data mining, and better presentation of the information.

That will be search 2.0.

Friday, June 20, 2008

Where is the software that enables the long tail?

For a product to reach the long tail, it has to have three things.
First, it has to have a low barrier of entry. Anyone needs to be able to create content and publish it. Second, it has to be easy to access. This means it must be searchable. Finally, it has to allow others to review and comment. That's it. It's not rocket science.

Unfortunately, I'm having tremendous trouble finding software to enable the long tail. I really don't have overly strict requirements. In application 1, I want to be able to allow people to post Java jar files. I might want to check things on post, such as does it have a MANIFEST? In application 2, I have a site where I am putting up papers about record linkage. I want others to be able to post links and have the software verify that the links exist. In each application, I'd like to also allow them to associate a title with either the jar file or the link.

For search, I want application 1 to be able to search the javadocs that I generate from the source in the jar files. In application 2, I would love to have the content of the links searchable (so that people can find which PDFs contain information about blocking), but I'd settle for just being able to search the title that was given to the link.

For ratings/reviews I would like to have a 5-star rating system and comments. However, I'd settle for just the comments. I want people to say which papers they found readable or which jar files they found useful.

There are a lot of features you could add such as an RSS feed, etc..., but I'm willing to ignore all of that for right now. I just need the basics.

Am I asking for too much? Why doesn't this already exist? In the world of CPAN, reddit, etc... why doesn't a generic version of long tail software exist? Perhaps I'm just overlooking it. If not, I guess I'll just have to write it, but I'd much rather use something already written.

Let me know if I missed something!

Monday, June 16, 2008

19th Century Reading Habits in Australia

Here is a blog post describing data mining of 19th Century reading habits in Australia.

It is a fascinating application of PCA and clustering.

I don't think it will be long before commercial databases include standard data mining abilities such as feature selection, PCA, LSA, regressions, k-means, SVM, etc... The blog poster above had to use a combination of things, including the great R, to prep his data for the analysis; this should all be done inside a database. Perhaps this is the straw that broke SQL's back? Perhaps that is where a language such as Pig is needed? Pig's niche could very well be in prepping data for data mining tasks and streaming them through a map-reduce library such as mahout.

Regardless, business users are already there: companies like Harrah's, CitiBank, and Nationwide live and die by their analysis. Now, we just have to build the tools to let them bring new products up quickly and effortlessly. We are on the forefront of a huge explosion in statistical tools, modeling techniques, and machine learning. As much as the internet helped in providing ubiquitous access to data, these tools and techniques will help computers learn and understand us and our preferences. The internet revolution will appear to be a tiny blip on the screen compared to the wonders that are to come.

Wednesday, June 04, 2008

The Future of Enterprise

What is the future of enterprise software? Is it JRuby on Rails? What about Scala? Maybe a .NET stack? What is the next big shift?

In my opinion, the next big shift is to grid computing. Things like EC2, Hadoop, and HBase will become more and more popular. Commercial versions and vendors will spring up over the next few years. Enterprises will start pushing more and more of their computation into batch grid work. Column store databases will replace traditional RDBMS's for data warehousing (this is already happening with Teradata). Transactional applications will work on cached stores that are pulled from a large grid where they were processed and analyzed. Marketing will be on demand, but pre-computed.

In the rush to understand and market to the consumer, more and more companies have been moving to real-time analytics. However, the faster you need a decision, the less time you have to think about it. That is why grid computing will be so important. Your think time has to be done in advance. Transactional queries will be relegated to closing deals or finding a pre-computed offer.

The main challenge, as I see it, will be to find and use data structures that can be refreshed quickly. Google is in a fortunate position. They can refresh boxes asynchronously. If customer A enters query Q1 and then enters query Q2 it is ok if those two queries hit different data sets and return different results. However, if we're not dealing with a search application but instead something akin to a customer recognition or record linkage application, then the same person should always receive the same link. This is why versioning will become so important in the future. The customer will get a certain version of the data sets and will continue to use that version until their application finishes at which point they can be upgraded. This requires more server side storage, but allows the client the ability to use a consistent data set. This type of versioning and auto-update software will be necessary in a commercial form. In addition, quick nearest-neighbor searches will need to be commercialized. Clients will want to know quickly which market segment a client falls into. Nearest neighbor searches are the key to understanding that and they will need to be generalized and distributed over the next few years.

So, the enterprise is changing. The relational database will move from its position of dominance to just another tool. The grid will take its place as a hammer looking for nails, and applications that can process terabytes of data quickly will be deemed must haves for enterprise data centers across the world. The Enterprise Service Bus (ESB) will continue to shuffle transactional data around, but the backend will now be distributed data stores that will be versioned and querable by categorized nearest neighbor searches. Who will be the vendor of these tools? I have no idea. Many, like Hadoop, will be open source. Some, like Teradata, already exist. Others will be proprietary and don't even exist currently. Regardless, it will be a lot of fun and I'm looking forward to the ride.

Thursday, May 29, 2008

Including other jars in your hadoop job jar

I learned that you can include other jars in your hadoop job jar by placing them in the lib/ directory under your job jar. Very nice and convenient!

Now, if I can only make the DistributedCache work. Right now, it just isn't.

Tuesday, May 27, 2008

ICWSM '08 Lectures online

I've not seen videolectures.net, but I got tipped off that they have the presentations from the 2008 International Conference of Weblogs and Social Media from The data mining blog.

Monday, May 26, 2008

Record Linkage Papers

I have created a site where I am putting up links to record linkage papers. I hope to include comments on them soon. If you have a paper you'd like put on the site, just leave me a comment.

Saturday, May 24, 2008

Min-Hash signature for bigrams

I don't believe there is a decent description of how to create a min-hash signature for bigrams on the web, so I'm going to try and provide one. Of course, my description will probably be flawed, but I hope that it will be better than what is out there currently.

First, what is a min-hash signature?

The idea is that, given two records, you can provide signatures such that the similarity of the signatures is approximately equal to the similarity of the records.

Here is a slide show based around using min-hash signature to provide keys for indexing into a database. Here is the relevant paper.

So, basically it is a way of generating a fuzzy key such that if the two keys match there is a high probability that the two records will match.

We're going to examine a way of doing this using bigrams.

Let's assume we have the local part of an email address. In our base file, we have the following entries:

tgibbs
tanton_gibbs
bbarker
tkieth
bspears

If we look at the bigrams generated for each local part, we have the following:
tgibbs = { tg, gi, ib, bb, bs }
tanton_gibbs = {ta, an, nt, to, on, n_, _g, gi, ib, bb, bs}
gbarker = {gb, ba, ar, rk, ke, er}
tkisth = {tk, ki, is, et, th}
bspears = {bs, sp, pe, ea, ar, rs}

If we order all the bigrams alphabetically, then we have the following universe of bigrams: {an, ba, bb, bs, ... th}

Now, a min hash signature requires an input f that tells how many hash functions to use. Let's set f to 3. That means we'll hash each record in 3 different ways.

To generate a hash function, we randomly permute the bigram universe.

So, our first permutation may look like:

{tk, ib, an, pe, ... rs}

our second permutation may look like:

{rk, gi, bs, th, ... pe}

and our third permutation may look like:

{bb, ba, er, ke, ... an}

Now, the next thing we need is a similarity threshold, t. Let's assume t is 0.8.
So, for each record, we will use 80% of the bigrams to produce the hash.

So, if our input record is tgbbis, then we have the following bigrams to choose from
{tg, gb, bb, bi, is} We would choose 3 or 4 bigrams (probably both) to produce the hash. But, which 3 or 4 do we choose?

For each permutation of our bigram universe that we produced above, we will pick the bigrams that appear the earliest. So, for permutation 1, we might end up choosing {gb, bi, is}. For permutation 2, we might end up with {bi, tg, bb}. For permutation 3, we might have {bb, is, gb}. We would do the same thing for combinations of 4 bigrams.

Now, we can create a string from the bigrams {gbbiis, bitgbb, bbisgb}. These strings become our prospecting keys. We perform the same key generation routine on the base records. Then we can join our input keys to our base record keys to find our match candidates.

If we want a more approximate min-hash signature, we could sort the bigrams before creating the key so that we would have the keys {bigbis, bbbitg, bbgbis}. Obviously, the duplicate key could be thrown away. This has the effect of handling more transpositions in the input at the cost of bringing back more candidates.

Hopefully this illuminates min-hash signatures a bit more so that the references above make sense.

Friday, May 23, 2008

pig.properties

If you are going to set up pig, you need to be aware of the pig.properties file.

It has things like the cluster setting, which is important to set correctly. It also has things like whether or not to run it locally or remotely.

See this thread on the Pig mailing list for how to set the cluster setting:

Thursday, May 22, 2008

Pig redux

I rewrote my original pig script to use a local foreach block.

It now looks like:

A = load 'myInputFile' using PigStorage(',') as (seqid, cl, local, domain, fn, ln, zip, email);
B = foreach A generate flatten(SortChars(local)) as sorted_local, cl;
C = GROUP B by sorted_local PARALLEL 10;
D = FOREACH C {
CLS = B.cl;
DCLS = DISTINCT CLS;
GENERATE group, COUNT(DCLS) as clCount;
}
E = FILTER D by clCount > 1;
F = FOREACH E GENERATE group;
store F into 'myDir' using PigStorage();

A, B, and C are the same as before. However, we now push the distinct into the FOREACH loop for D. This allows Pig to use a DistinctBag to dedup the CLs for a sorted_local part. Before, the DISTINCT had to use a separate map-reduce step; however, now it can do it all in only one map-reduce job. This effectively halves the time the script takes.

For more on how things get executed, you can use the explain option. For instance, explain F; will show how F gets executed. More information can also be found on the pig wiki.

Pig continued

Now that I have the sorted local parts of an email address, I want to know which emails are identified by those sorted local parts. So, I know that bbgist has more than one CL associated with it. Now, I want to know that tgibbs and tgbbis both go to bbgist. Also, I only want those emails with more than one CL, so I don't want tgobbs because it (let's pretend) only associated with one CL.

Here is the pig script to do that:

register /home/tgibbs/pig/pig/myPigFunctions.jar
A = load 'myFile' using PigStorage(',') as (seqid, cl, local, domain, fn, ln, email);
B = load 'mySortedLocalFile' using PigStorage() as (sorted_local);
C = foreach A generate local;
D = DISTINCT C PARALLEL 10;
E = FOREACH D generate local, flatten(SortChars(local)) as sorted_local;
F = JOIN E by sorted_local, B by sorted_local PARALLEL 10;
G = FOREACH F generate local;
store G into '/user/tgibbs/eproducts-local-merged-out' using PigStorage();

On line 1, once again, I register my SortChars function that is in my jar file.
Line 2 loads the original email file.
Line 3 loads the file of sorted local parts that I created last blog entry.
Line 4 reduces the original email file down to just the local part.
Line 5 (the D line) gets rid of any duplicate local parts, since I don't care that tgibbs appears 10 times and it reduces the amount of data I have to deal with later.
Line 6 (the E line) generates the sorted version of the local part and keeps the original local part as well.
Line 7 (F) joins the two files by the sorted local portion. It's schema looks something like (E::local, sorted_local, sorted_local).
As an example, it would have (tgibbs, bbgist, bbgist) and (tgbbis, bbgist, bbgist).
Line 8 (G) just gets the local part, which is what I want to store and the last line stores it.

Next up, I plan to try out the Illustrate command which should display the execution plan.

Mahout

Just a quick post to point out Mahout. Mahout is a machine learning library built on top of Hadoop. They are still fairly pre-alpha, but they already have some interesting algorithms developed. I plan on trying out their canopy clustering algorithm relatively soon.

On a second note, I've gotten Pig up and running and have successfully run some large jobs. Woot!

Here is a pig script to count the number of AbiliTec Consumer Links associated with the sorted local part of an email address.

In other words, if there were two emails tgibbs@blah.com and tgbbis@blah.com, the sorted local part of both would be bbgist. If they were each associated with a different consumer link, then the following pig script would output bbgist. Otherwise, it would not.

register /home/tgibbs/pig/pig/myPigFunctions.jar
A = LOAD '/user/tgibbs/eproducts' USING PigStorage(',') AS (seqid, cl, local, domain, fn, ln, zip, email);
B = FOREACH A GENERATE FLATTEN(SortChars(local)) as sorted_local, cl;
BDIST = DISTINCT B PARALLEL 10;
C = GROUP BDIST BY sorted_local PARALLEL 10;
D = FILTER C BY COUNT(BDIST.cl) > 1;
E = FOREACH D GENERATE group;
STORE E INTO '/user/tgibbs/eproducts-sorted-local-out' USING PigStorage();

The register line loads a custom jar file so that I can call my custom functions. More on that later.

The next line, the assignment of A, just reads a comma delimited file into A. It also associates a name with each of the comma delimited fields. So, field 0 is seqid, field 1 is cl, etc...

The third line loops through each record in A (that's the FOREACH) and sorts the characters in the local part. I wrote the SortChars class and will post that at the end of this entry. The FLATTEN is needed because SortChars returns a tuple and since I only have one element going in, there is only one element coming out and I want to treat that one element as an atomic data item instead of as a tuple . The 'as sorted_local' portion renames the data item.

The type of B is now a tuple (sorted_local, cl).

The next line eliminates duplicates. So, if two records have the same sorted local part AND cl, then we can safely erase the duplicate because it will not affect our final count. In fact, we'll need to eliminate it for the rest of our logic to work. The PARALLEL keyword ups the number of reduces. This means we'll end up with 10 output files instead of 1, but that's ok because we'll process them all later, anyway.

In the end BDIST has the same type as B (sorted_local, cl).

The line that aliases C groups BDIST by its sorted_local part. This basically creates one record for each distinct sorted_local part. The value is a data bag that contains one BDIST record for each distinct sorted local part.

So, C now has the type (group, BDIST:{sorted_local, cl}).

The D line gets rid of any groups that have only one cl. D has the same type as C.

The E line gets only the groups, ignoring the actual values. And the store line writes those groups out. So, now I have one group for every sorted local part that has more than one CL. How cool is that!

Here is my PigFunction for sorting the characters:

import org.apache.pig.data.Tuple;
import org.apache.pig.data.DataBag;
import org.apache.pig.EvalFunc;
import java.util.Arrays;
import java.io.IOException;

public class SortChars extends EvalFunc {
@Override
public void exec(Tuple input, Tuple output) throws IOException {
String str = input.getAtomField(0).strval();
byte[] bytes = str.getBytes();
Arrays.sort(bytes);
str = new String(bytes);
Tuple newOut = new Tuple(str);
output.copyFrom(newOut);
}
}

Wednesday, May 21, 2008

Pig

Pig is Yahoo!'s data flow language that is designed to run atop hadoop. I've spent a few hours today getting it set up and running. One thing I would like to point out is that you can't run the pig script in the bin directory (or at least not and connect to the hadoop cluster).

I had to manually run:
java -cp pig.jar:$HADOOPSITECONFIG org.apache.pig.Main

Also, if you dump a variable, it has to run the map phases to get to it. I thought it would just tell the schema, but no....

Hadoop Streaming

Hadoop Streaming appears to be a way to write quick hadoop jobs. I've recently been playing with it and have finally gotten it to work for me.

The main parameter that I had to add was -jobconf stream.shipped.hadoopstreaming=$HADOOP_HOME/contrib/streaming

It was somehow getting set to /tmp which was causing everything in my /tmp directory to get added to the job jar it generates.

Another good thing to keep in mind is the -verbose flag. It can help figure out what is going on under the hood.

Tuesday, May 20, 2008

Multi-User Hadoop

I've been setting up hadoop on a few (6) boxes and have been attempting to make it work for multiple users. It is not as easy as it sounds because the docs are a bit spread out.

Nevertheless, if everyone is in the same group, then you need to set your default group to represent that.

dfs.permissions.supergroup
groupname

You'll also need to change the group of any files you've already created.
hadoop dfs chgrp -R groupname /

Next, to allow multiple users to run mapreduce jobs, you'll need to set your configuration directory to be a place that is accessible to ALL the boxes. I'm using an nfs mount, but you could use hdfs just as easily.

mapred.system.dir
/mnt/myNFSMount/hadoop/mapred/system

Make sure that directory exists and is writable by the group mentioned above (or at least that all your mapreduce users can write to it).

At this point, that is all that I know needs to be changed. I have two people (including myself) using our hadoop cluster. So, I'll let you know as we run into more problems.

Saturday, May 03, 2008

Linear Algebra

A great undergraduate linear algebra class from MIT's Open Courseware program can be found here.

Other MIT Open Courseware courses with heavy video/audio content can be found here.

Sunday, April 27, 2008

You Tube Videos

Google Tech Talks has a number of great video lectures on You Tube. I thought I would share some that I found interesting.

First, there is a 13 part series on statistical aspects of data mining. The presenter is giving the class at Stanford and decides to also give the class for fellow Googlers. He is a very good instructor and takes people through an introduction of how to use statistics in data mining. This is definitely required watching.

Second, Andrew McCallum has a lecture on enabling object search instead of page search. Basically, his work is in entity resolution and graph approaches.

More to come...

Thursday, March 20, 2008

Toyota

I couldn't agree more

Don't measure against rivals.

It seems like companies, when they develop a new product, just try to emulate what is already available in the market. Unfortunately, it takes 3 to 5 years to create a mature product. What is currently available is somewhere in that 3 to 5 year span. If you attempt to emulate that, you will also take 3 to 5 years. However, by that time, the market will have moved on. You will always be reactionary and you will always be behind the competition.

Tuesday, March 18, 2008

Data mining

To feed my continuing interest in behavioral economics, I've decided to dig deeper into data mining. It seems that every researcher uses data mining in some way. I'm taking a course in statistical approaches to natural language processing. It is definitely improving my understanding of the difficulties involved in analyzing the data. I also hope to get a few books on the subject soon.

It seems that real estate data is quite the gem of the behavioral economics world. I guess because it is free and easily accessible, it gets more interest, much like blogs for the search community.

Probably the best thing to do would be start with a question. What do I want to know about the world? And then start finding the right ways to answer it. I'm not sure what my question will be. Stay tuned. In the mean time, I wonder where you find real estate data...

Thursday, March 13, 2008

Lost

The other day, we lost a student in our program. Our website, still in beta, was a bit too buggy and too confusing. He or she was right to leave, we should have put more time into user testing. Instead of throwing out half-baked ideas, we should have made them a little more solid. We released a student site and sent them a link, but didn't link to it on our homepage. We did have a link to our forum, but our forum username wasn't the same as the student username and password. But we learned from that experience. We're focusing heavily now on usability. We're trimming unnecessary fat and trying to unify our systems. I lost one student, but I don't want to lose another for similar reasons.

So, if we send out a link to it, the link better be on the front page. Also, people like to have a button that says "login". Finally, if you have multiple features, the login information should be consistent among all three. Since we grew organically and a little haphazardly, we didn't think of these things. Now they've bitten us and it is time to fix them.

Tuesday, March 11, 2008

Michael Buble

My wife and I went to see Michael Buble' last night in Memphis. He's a very good entertainer. However, it was sad to think that he is the best standards singer of our time. He's not on the level of a Frank Sinatra or even a Bing Crosby. I think of him more like a Sammy Davis Jr. type. I actually like his original compositions more than I do his take on the standards. Anyway, I would recommend going if you get the chance, he's worth that. I just wouldn't go more than once.

At least Memphis in May is coming up soon. I am looking forward to that!

Friday, March 07, 2008

Remote users

Remote users are the worst. It was nice when all I had to deal with were me and my friend. There were no versioning problems and when differences did arise, I just drove down the street and took care of them. Now, my software has to run on XP and Vista (yes, that can be problematic). It has to run with different versions of Internet Explorer. It has to deal with all sorts of problems that it didn't have to when it was just for me. However, I have to say I'm learning a lot. I can now manipulate the DOM like nobody's business. Is that useful? Probably not. But then again, when have I cared about usefulness, as long as its cool :)

Why doesn't everyone use unicode?

Well, my promise to blog daily was short lived, but I'm back now. Wednesday's are going to be hard for me to blog because I teach a night class about 2 hours from where I live. But, I should have blogged on Thursday, no excuses there.

On to today's topic.

I spent multiple hours yesterday dealing with a character set encoding issue. As I've mentioned, our chess training site is multi-lingual. Therefore, we store all of our messages in UTF-8. Our database, MySQL, is set to use unicode as is our server side language, PHP. Our client side language, perl, is also set to use unicode. However, no matter how hard I tried, whenever I sent a message using red hot pawn via our software, the message got corrupted. I could manually copy and paste the message from our database viewer to the message sender and it would work fine, but when my software did it, the ä came out looking like Ã and some other symbol that I wasn't familiar with. Of course, having it print to my screen from perl caused it to come out looking like a third symbol Σ. Obviously, Internet Explorer, perl, and my windows console were all using different character sets. I found my setting for Internet Explorer. It was set to ISO-8859-1 (Latin 1). I did not find the setting for my windows command prompt. I assume it was using some Windows character set. I tried changing the Internet Explorer setting, but it didn't seem to have an effect. Finally, after a few hours of hunting and validating various settings, I checked the encoding on the red hot pawn page. It was set to ISO-8859-1. Ah ha! So, they were enforcing their own encoding on the page. Apparently, when I copy and paste, the operating system does the conversion in the background for me. However, when my program does it, I have to do the conversion myself. Because of a few limitations, using the perl Encode module was not an option for me. So, I settled on utf8::downgrade. This won't work if someone's desktop settings are not Latin-1, but is suits my needs at the moment. Why can't everyone just use unicode?

Tuesday, March 04, 2008

The value of free

Predictably Irrational spends quite a bit of time talking about the value of free. Apparently, free makes you go bananas. For instance, if you are checking out at a store and are offered a hershey's kiss for free or a lindt chocolate truffle for $.14, over 70% of the people take the free kiss. If you charge just $.01 for the hershey's kiss, then over 70% of the people will take the truffle, even though it is $.15 (adding the penny back in for relative difference reasons). Free makes people bananas.

Another study came from Amazon. When Amazon started offering free shipping, their sales increased dramatically everywhere but in France. France saw no difference. After researching it, they discovered that the French division had decided not to go completely free, but instead charge 1 Franc (about US $.20 at the time). This small fee was enough to change people's brain chemistry. When they reduced the price to free, France also saw a big jump in purchases.

The next chapter expands this by considering social norms and market norms. We deal with both every day, but we try not to mix them. If you are using a social norm (like eating dinner at a friend's house), you don't offer to pay. You can give a small gift as appreciation, but you NEVER bring up money. Contrariwise, if you are using a market norm, then you want the best deal - you act selfish. What their research found was that the smallest payment was enough to change people from a social norm to a market norm. In fact, even the idea of money was enough to make the change. For instance, if you ask someone off the street to help you move a couch, they will agree. If, instead, you offer to pay them $1.00 to help you, they will turn and run because they are insulted. When operating in market norm phase they want the best deal, when operating in social norm phase, they want to help.

Perhaps we take the free Hershey's kiss over the lindt truffle because we want to operate in a social norm. We are social creatures and we want to be unselfish. Perhaps by thinking of the kiss as a free gift, we get to stay in our social environment that much longer, keeping out the harsh market realities.

Monday, March 03, 2008

Force and Counterforce

I'm programming the Personal Chess Training website with a friend. He does most of the content work (web design, chess problems, hints, etc...). I do most of the hard core programming. We split the database work, the hard stuff goes to me and he picks up the easier things.

The one thing that is interesting is how force and counterforce work together between us. For example, I prefer to take things slow and steadily. He prefers to rush in and see what breaks. I have my students set to 1 or 2 problems at a time, he sent out 25 to all of his students at once. I had hoped to add 1 or 2 additional teachers, he signed up 10. Often, his forces create counter forces in the code base. For instance, I got tired of hardcoding the teacher name in the code, so now I pull it automatically from their cookies. We needed some way to handle his problem load, so we created tracks and simultaneous games that could be set per teacher. I also added automatic restarting of mates in 1 since he was having to work through so many that students missed. Each push that he has given has resulted in a counter push by me to automate the site to reduce my workload.

I wonder how many other things have come about from a force and a counter force. After all, necessity is the mother of invention.

Sunday, March 02, 2008

Personal Chess Training

Recently, I've been working on a chess training site with a friend. He's around 2100 and is a very good endgame technician. We designed a course to help others with their endgames. If you are interested in improving your chess endgame (for free!), then please visit our site. You use a free correspondence chess site www.redhotpawn.com to play the games with REAL HUMANS. We've had a number of great comments on the redhotpawn forum that we can direct you to if you are interested.

To subscribe, just join red hot pawn and message me (thgibbs) or my friend (petrovitch) with a message indicating your interest.

Comparisons and Anchors

It's been a while since I posted, but I'm going to try and post daily from this point on - we'll see.

I've been listening to the book Predictably Irrational: The Hidden Forces that Shape Our Decisions on my iPhone that I got from my audible subscription. In this book, the author describes how we compare things and how anchors affect us. In short, an anchor is the first thing that we related to a category. I'll give an example. If we see a bicycle advertised for $20, then $20 becomes the anchor price for all bicycles. If we later see a bicycle for $40 we believe that it is either better in some way or overpriced, but we always go back and compare it to the first price we saw. He performed studies where anchors were used to manipulate the amount people would pay for an item. For instance, if people were asked if $10 were a reasonable price for an item, their maximum price was much higher than those people who were asked if $0.10 were a reasonable price. The initial anchor greatly affects how we view future spending on an item.

I believe that an iterative approach to software development should be described in these terms. With the first iteration we give the user an anchor. He then can make decisions about that program with the initial program in mind. His requests should be more rational because he is going to frame them in the context of the initial anchor. With a traditional waterfall approach, the user has to create the requirements without an anchor. This leads to unnecessary requirements and unused features because the anchor isn't present. There doesn't exist a "good" or "bad" context to help shape the program, there is only air. By creating an initial draft of the program, we create an anchor that guides the user, giving them the all important context that ensures a successful program.