Getting in shape for Kaggle

On the face of it, Kaggle is pretty simple. Companies offer up data sets and ask you to use them to build useful models to predict stuff. Who ever can provide the best predictions, wins. While this may not be everyone’s idea of fun, with a background in Statistics and large part of my actual job being devoted to forecasting and model building, I should have an advantage. The reality is somewhat different.

Like many new Kaggle recruits, I rushed to the most recent competition page. In my case, the aim was to prediction the onset of epileptic seizures from EEG data. This was fantastic, an opportunity to help real, actual people right from my laptop. But first, there were two problems.

  1. Each patient data file was roughly 8Gb. It took forever to download and crashed any program in which I tried to open them
  2. I had no idea what an EEG was or how you would spot a seizure on one, even if I could look at the data

What to do?

At this point it was either, give up and go back to watching Game of Thrones, or accept that I had quite a bit to learn before I could expect to do well a competition.

On this site I plan to record my successes and failures on Kaggle and display any interesting data and analysis on the way. Hopefully others who are thinking about competing on Kaggle will find it useful.

 
8
Kudos
 
8
Kudos

Now read this

Using Amazon Web Services and RStudio for Kaggle

Not all of us can afford (or fit) a super computer in our bedroom. Luckily Amazon offer the next best thing. Many Kaggle competitions use datasets large enough that you’re likely to need some specialist equipment and a big budget. Amazon... Continue →