Slices

Analyst and aspiring Kaggle competitor

Read this first

Installing R packages on to your EC2 RStudio instance

Once you’ve got your EC2 instance running with RStudio, you will probably want to install some of your favourite packages. I use ggplot2 and plyr a lot, but installing them isn’t as simple as on your local pc.

First you need to connect to your instance. If you use Windows, you’ll need to download PuTTY. Amazon provide a walk-through of how to do this, and I’ve replicated the key steps here.

You will need to create a version of the Key-Pair file you created when you first set up your EC2 security group that PuTTY can recognise. Use PuTTYgen for this and click ‘Load’ existing key to browser for your Key-Pair file. You will need to use ‘All file types’ as the Amazon Key-Pair is a .pem file. Once you’ve found it, click save as private key. You should now see a .ppk in the same folder as you .pem key-pair.

Now start the PuTTY.exe program.

In the Hostname box write ‘ubuntu@’ and your...

Continue reading →


Using Amazon Web Services and RStudio for Kaggle

Not all of us can afford (or fit) a super computer in our bedroom. Luckily Amazon offer the next best thing.

Many Kaggle competitions use datasets large enough that you’re likely to need some specialist equipment and a big budget. Amazon Web Services changes this by allowing you to rent computing power and storage space at reasonably low prices.

This is all very well, but with only a vague understanding of cloud computing and having never used Linux, I had no idea how to actually get it working.

First, sign up for a AWS account. You’ll need to put some card details in, but they won’t charge you unless you start using some computing power or running an ‘instance’.

I do pretty much all my analysis in R using RStudio which are both free. It turns out that the open source community, and Louis Aslett in particular, have done a lot of the hard work in getting RStudio to work on an EC2...

Continue reading →


Getting in shape for Kaggle

On the face of it, Kaggle is pretty simple. Companies offer up data sets and ask you to use them to build useful models to predict stuff. Who ever can provide the best predictions, wins. While this may not be everyone’s idea of fun, with a background in Statistics and large part of my actual job being devoted to forecasting and model building, I should have an advantage. The reality is somewhat different.

Like many new Kaggle recruits, I rushed to the most recent competition page. In my case, the aim was to prediction the onset of epileptic seizures from EEG data. This was fantastic, an opportunity to help real, actual people right from my laptop. But first, there were two problems.

  1. Each patient data file was roughly 8Gb. It took forever to download and crashed any program in which I tried to open them
  2. I had no idea what an EEG was or how you would spot a seizure on one, even if I...

Continue reading →