Pages

Tuesday, January 15, 2013

Exploring Internet Paths

Introduction

Packets on the internet are very similar to car drivers. They (and excuse my anthropomorphism) would like to reach their destination in the quickest, cheapest way possible. But unlike drivers, packets don't have the luxury of relatively static routes. The best route from your source to the destination today, may not be the best tomorrow, and might even be completely unavailable by next week.

This post details my approach to learning which best routes today are still likely so be the best in the future, as part of Kaggle's "Facebook II - Mapping the Internet" contest. Kaggle describe the challenge as:
The Task: you will be given a path which, at one point in the training time period, was an optimal path from node A to B. The question is then to make a probalistic prediction, for each of the 5 test graphs, whether the given path is STILL an optimal path.  This is a much more difficult task than link prediction alone. The global structure of the graph may affect many optimal routes, paths can have varying lengths (and thus varying a priori probabilities of being optimal), and there may be multiple optimal routes for a given source & destination.
I reached to the 4th place, but I'm especially proud that I did that in only 13 attempts. In comparison, the top three places submitted more than 30 entries.

Cleaning Up the Data

It's safe to say that I've invested most of my time at this stage. I also believe that investing more time in cleaning up my data could have brought me to the top 3. Facebook, the organizers of this contest, decided that it would be more fun to add artificial errors to the training set. Upon examination, I finally convinced myself that when providing a name of a node it can be either the true name (e.g. "SILKROAD - Silk Road Technology Inc"), or one of the following mutations:
  1. A single word drop, e.g. "SILKROAD - Silk Technology Inc"
  2. A single word duplication, e.g. "SILKROAD - Inc Silk Road Technology Inc"
  3. A single word reshuffling, e.g. "SILKROAD - lSki Road Technology Inc"
Luckily, the true form of the node name was much likelier than any of the mutation. My final approach was simple but effective. Given a set of potential node names (both true names and mutations) and their counts in the training data - look for the most common node name, write it down as a suspected true name, and remove it from the set as well as any name that can be its mutation. Repeat with the names left in the set.

For the very popular nodes, those with many links, this worked wonders. In some less popular nodes, one of the mutations was more popular than the true name, which led me to have a couple of the mutations in the final name set (because "SILKROAD - lSki Road Technology Inc" is not a valid one step mutation of "SILKROAD - Inc Silk Road Technology Inc", removing one from the set of names, doesn't eliminate the other, and in time it will be written down too as a suspected true name). I had a script looking for very similar suspected true names, and I had manually fixed some of those when I had some free time.

After getting a set of suspected true names, I went through the training (and testing) sets and fixed those, matching each possibly garbled named with the best matching suspected real name. Still there were some mutations that I couldn't uniquely map to a real name. Some government and military nodes have very similar names, for example "DNIC-AS-01346 - Navy Network Information Center (NNIC)" and "DNIC-AS-01986 - Navy Network Information Center (NNIC)". A mutation that drops the first token in those, e.g. "Navy Network Information Center (NNIC)" cannot be safely associated with either. Here I could have used information from other training graphs, but never had the time to get it done.

As an aside, I really disliked this artificial noise. I think it created an entry barrier that only the very bored motivated bothered with.

Historical Features and Model

I started with a very naive set of features - given the last 10 network graphs, I've created an exponential decaying averages for the weight of the path and the viability of the path (some links along the path may be deleted or created). I've used four parameters for the decay rate (0.25, 0.5, 0 - meaning ignoring all the graphs but the last one, 1- meaning a regular arithmetic average) and fed it into logistic regression and got really nice results. It was an early stage of the competition, and I think this simple approach led me to the top 10. Getting into the top 10 is really motivating - at least for me - once I'm there I'll make any effort to keep in the top 10.

Adding three other simple and very similar features improved my score further:
  1. The length of the path (number of hops) - it make sense that the longest the path, the more likely it is to break.
  2. The last valid weight of the path, if it's broken in the last training graph.
  3. Whether the path is broken or not in the last graph
Next, I've tried using a random forest classifier, which wasn't any better than logistic regression on my test set. However an average between both was a big win, improving my score by 0.01 points.

Link Probability Features

After playing a bit with similar features (that I've nicknamed "history based features"), I've tried taking a totally different approach to the problem. Instead of looking at the whole path, I started considering the nodes that composed it. Specifically, how likely were they to keep incoming and outgoing links. As features I now had:
  1. The geometric mean, the minimum and maximum of the probability of nodes along the path to sustain incoming links.
  2. The geometric mean, the minimum and maximum of the probability of nodes along the path to sustain outgoing links.
  3. The geometric mean, the minimum and maximum of the multiplication of the incoming probability of node B and outgoing probability of node A, for node pairs (A,B) along the path.
Adding these "probability" features to the historic features, and playing with the weights of the logistic regression model and the random forest model got me another nice increase in my total score.

Giant Component Features

Motivated by my success of attacking the problem from a new perspective. I've started thinking about other approaches to look at this problem. I've tried using the number of alternative routes, but it was slow and didn't seem to contribute a lot to my test score. However, I soon discovered another nice property of the graph - there was a super-cluster of nodes such that from each node in the super-cluster one can reach any other node. Taking the intersection of the super-cluster in each training graph gives us something akin to the "internet's core". As a matter of fact, within the world of this problem, you can come up with two cores - one of nodes that can reach each other for free (lets call it the Free Super Cluster), and a bigger one of nodes that can reach each other but may have to pay for that privilege (lets call it the Non Free Super Cluster).

 This "discovery" leads to a couple of more features per path:
  1. The proportion of nodes in the path that are in the Free Super Cluster
  2. The proportion of nodes in the path that are in the Non Free Super Cluster
  3. An exponential decaying average in the number of nodes that are part of the free super cluster of each training graph (decay rate 0.5, just for kicks)
  4. An exponential decaying average in the number of nodes that are part of the non-free super cluster of each training graph (decay rate 0.5)
Adding these features to the previous ones, led to another 0.006 increase in my score.

Oh Stupid Me

By this time, only a couple of days were left for the competition, and I was in the very comfortable fifth place. I was tired and, well, unmotivated to try out new ideas. The competition didn't involve any monetary prize, and I was happy with my ranking. But then, two things happened. First, my girlfriend offered to take me to a restaurant if I get to the fourth place, and second, some snotty new comer (who wasn't really snotty or new comer, I took some literary freedom here) robbed me of the fifth place. I had to do something, but what?

Turns out, I had a bug in my code, combining all the features. Fixing it, not only improved my score by an amazing 0.01, but also got me to the fourth place. By the way, if you spot a bug (either spelling or a grammatical one) in this blog post, feel free to leave a comment. English is not my first language.

Epilogue

My girlfriend still haven't taken me to a restaurant (that's ok honey). Facebook didn't call to set up an interview (that's ok, Mark). However, I've learned a bunch of things from this competition, and got to the ~130 place in Kaggle's full ranking. All-in-all, I think it was worth it.

$$\alpha$$

1 comment:

  1. Very interesting post.
    Thanks

    And here it comes, you've asked for it ..... :-) :
    'girlfriend still haven't '
    should be 'hasn't'

    ReplyDelete