Once again, this is a blog post about a recent Kaggle competition. However, unlike similar posts, this one will be less useful for aspiring data scientists, as no secret techniques or deep insights are shared here. Rather, it will probably better serve future competition organizers, as it describes how I got to third place (out of ~630 challengers), without providing any substantial understanding about the data at hand. In other words, this post describes how I used target leaks to reach a near perfect score, trolling the leader board for months. Thankfully, the competition organizers were kind enough to allow solutions based on target leaks. As a matter of fact, avoiding the use of target leaks proves more difficult in this competition than finding them.
The goal of the "Accelerometer Biometric Competition" was to identify users given readings taken from their mobile phone's accelerometer, without requiring them to perform any specific pre-negotiated movement. Theoretically, such a feat would be possible if we all had micro-movements that uniquely define us. Using such identification mechanism, doors and computer systems could be automatically unlocked once you approach them. The organizers took several precautions to mask the mobile device each user used, hampering any attempt to tell apart users by their phone model rather than their movements (for example a certain device may be more sensitive to movement than others). Ironically, these precautions were a major factor in the competition's downfall.
Chains
The train and test sets were prepared in the following way: Each user's accelerometer recordings were split into two parts. The first part was given in the train set, after being shifted so all the train sequences start at the same date (the time in day was left intact). The second part was also shifted in a similar manner. It was then further split into batches of 300 samples each. Half of those batches were associated with the user who created them. The other half were randomly assigned to other users that have similar devices, thus making model specific features hard to exploit. Our goal is to guess which test sequence was assigned correctly to the user who generated it.
Take a moment and think about this method of data preparation.
Let's call the set of users that can be assigned to a sample of a given user, the user's group. The group includes the user that created the sample as well as users with similar devices.
Nothing in this scheme prevents you from trying to chain together the test sequences back. A chain of sequences has high probability of belonging to a single user. Moreover, since each sample in the sequence has a 50% chance of being marked with the correct user who generated it, given a long enough chain, and assuming the user's group consists of more than two users, we can have a good guess about which user generated the sequence.
For example, say we have a chain of six test sequences. Four of them are marked to belong to user A, one to user B, and one to user C, and we know that A, B, and C make a group. Assuming that A generated the chain, the likelihood of such assignment is (0.5^4)*(0.25^1)*(0.25^1), because in that case a sequence has a probability of 0.5 to be marked as belonging to A, and 0.25 as belonging to either B or C. On the other hand, the likelihood that B generated the chain is (0.25^4)*(0.5^1)*(0.25^1). Obviously, A is the more likely generator, and with some Bayes theorem applications, we can calculate the exact probability that A generated the chain.
Using only this leak, my solution gets 0.984 on the private leaderboard, or 13th place.
Chaining the test sequences together is the real trick. Basically, you can chain sequence X with sequence Y if Y starts soon after X ends. However, there can be many sequences that start soon after X ends, and obviously, there can be many sequences that end soon before Y begins. Other competitors tried looking for a Y similar in nature to X. I took another approach, consisting of four steps:
- First chain together sequences in a rather greedy way - Given a sequence X, if there's only on sequence Y that follows it within a given time window, link them together. The time window is 0 to the 95th percentile of time-deltas between consecutive samples in X. It can be extended to the 99th percentile, if no Y starts within the 95th percentile. If there are two (or more) sequences in the time windows, or none, stop building the current chain. Otherwise, try to link Y with a sequence that follows it.
- Since now there can be many sequences that are linked to a given Y (since they end soon before Y starts), break chains whenever theres a sequence that joins two separate chains together.
- The users appearing now in any chain are probably belonging to the same group. Calculate the groups according to the chains at hand.
- Repeat step 1 and 2, but this time use the groups when linking sequences together. Only link sequences that are assigned to users of the same group.
Chaining Train and Test Samples
One can try and chain together the train sequences with test sequences - simply look for test sequences that start immediately after a train sequence ends. Since only the test sequences' dates were shifted, and the time in day were left untouched, this task was simple enough. Furthermore, you can give zero probability to test sequences that start before their assigned user's train sequence ends. About 5% of test sequences can be "disqualified" in such a manner.
Next, if a test sequence T is assigned to user A, which belongs to group G, and every user in group G other than A cannot be the generator of T (since their train sequence ends after T begins), we can be certain that A generated T.
Taking this idea an "iterative" step further, if group G consists of k users, there are k test sequences that overlap, and we know for sure the generators of k-1 of those sequences, we can also find the generator of the not yet assigned sequence - that's because sequences belonging to the same user cannot overlap, yielding that each of those k sequences should belong to a different user.
I think this calls for an MCMC approach. If two sequences overlap - they cannot belong to the same user, and if they follow each other, there's high probability they do belong to the same user. One needs to assign probabilities for each sequence to belong to each user to maximize the likelihood of the overall assignment. Unfortunately, I wasn't smart enough to make it work.
Other Features Leaks
My first stab at this challenge was to ignore the accelerometer readings all together, and calculate the probability that a sequence belongs to a certain user by just looking at the sample frequencies. Certain devices have different sample frequencies, even though the competition organizers tried to make them sample every 200ms. This is probably due to internal timer resolution of each device, and other applications that may run concurrently on it. Just calculating the probability of each sequence to be generated by the assigned user in this way (given its likelihood of being generated by other users in the group) gets you as far as 58th place, with 0.898 score. In my implementation this method already combines another leak first made public by FastML, where one can get 0.8 score by estimating the probability that sequence S was generated by user A by the number of samples A has in the train set.
One can use the time of day of the test sequence, as different users are active in different time zones. Next, looking at the actual accelerometer readings - different devices have different resolutions, ranges and generally speaking, possible values (this is the reason it is so difficult avoiding using a target leak). I used each one of those parameters as a feature.
Combining Everything Together
I used a random forest classifier and a logistic regression classifier on the above features to get probability estimations for each test sequence, and then averaged the two estimations together, giving stronger weight to the random forest.
Creating a train set to train those classifiers was a bit of a challenge since sequences taken from the provided train set could be more easily chained together than those in the actual test set. That caused the classifiers to give more weight to the "chain probability" feature than needed. I had to arbitrarily break chains up in my train set to more accurately simulate their lengths in the test set.
Epilogue
This competition was a lot of fun, especially in those times that I was in the first place under the name "Target Leak Model". Yes, it turns out that I'm a troll. Seriously though, this competition stresses the need for some QA process at Kaggle to make sure that such flaws in competitions would be fixed before the competition is made public. Another option that I really liked was proposed by one of the competitors - giving special prizes to challengers that expose leaks early on, and fixing the train and test set accordingly after such leaks are exposed. Either way, for Kaggle to be truly useful for competition organizers, the issue of target leaks (and this is far from the first time such an issue surfaces) must be dealt with.
My code is available at https://github.com/rouli/kaggle_accelerometer/tree/master/code, and uses Drake to build the solution and intermediate steps. It assumes you have python and sklearn installed. Since I had to switch computers twice during the competition (once during the final few days!), it doesn’t faithfully recreate my solutions, but is still good enough to win the third place both on the public and private leaderboards.
what is the meaning of hunting?
ReplyDeletehttp://www.ju.edu.jo/home.aspx
Jordan University
[url]http://www.ju.edu.jo[/url]
http://medicine.ju.edu.jo/Home.aspx
Faculty of Medicine
[url]http://medicine.ju.edu.jo/Home.aspx[/url]
http://arts.ju.edu.jo/Home.aspx
Love your post dear ♥
ReplyDeleteTop Apps Finder
Top Apps