Hello world,

Welcome to qzds. I will spare you my illustrious pretty boring life story and get straight to the heart of things.

I have been doing some data sciencey things, and I aim this blog mainly at being a repository for me to write up said things, mainly so I can practice writing up analyses, and share learnings from my journey in working with data thus far. Neither post frequency nor quality is guaranteed, read on at your own risk.

In this post, we’re exploring a small set of tweet data.

Last week, the company I work for (shameless plug, AppNexus, we’re hiring for all sorts of cool roles!), hosted its annual London Summit, our largest client event to date. Now, AppNexus is kind of a big deal in the programmatic space (brush jacket shoulder) and as we all know, the twittersphere is dominated by what’s hot in programmatic advertising technology, so I thought it might be interesting to see how much of a splash the event made.

To do this, I set up a python script to connect to the Twitter API to hook into the Twitter streaming feed (using the python tweepy library which made things very convenient), setting a filter to capture any tweets with the word appnexus or appnexussummit in it (you can tell we work in advertising because of the creativity of our hashtags), and then writing these tweets into a MySQL database I’d set up for some logging.

Getting the Data

For the sake of not making this first blog post too long painful to write, I’m going to gloss over the line by line specifics of how to do this, but in short, what you’ll need to do to get this working should you want to is:

  1. Sign up for Twitter (true story, I didn’t have a working twitter account before last week).
  2. Create a new application under their developers section to get the API access keys.
  3. Create a MySQL database, you can use the code below to duplicate mine:
CREATE TABLE `tweet_log` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `tweet_id` varchar(255) DEFAULT NULL,
  `user_id` varchar(255) DEFAULT NULL,
  `loc` varchar(255) DEFAULT NULL,
  `coords` varchar(255) DEFAULT NULL,
  `text` varchar(255) DEFAULT NULL,
  `created_on` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1536 DEFAULT CHARSET=utf8mb4;
  1. Set up a conf.json file with your API Keys and database credential information.
  2. Clone the script down from my repo, replace the keywords and so on with whatever you’re interested in and run!

Visualising the Data

In total over the 6.5 days of data collection we ended up with over 1500 tweets. It’s hard to say whether I found this unexpectedly low, or unexpectedly high…let’s just say I had no expectations.

The interesting thing though is that the Twitter streaming API only gives a 1% sample of submitted tweets, so perhaps the true number of tweets would have been over 150000. Now, that would have been pretty cool.

Using a Jupyter Notebook and Bokeh, I plotted the following visualisation of Tweets per hour over time, as well as showing in the tooltip the top 3 most popular words we saw per hour.

You can find the full notebook in the github repository.


You can see that there is one major peak and two minor peaks.

The first minor peak occurs on the 3rd (coincidentally, my birthday!) where AppNexus announced the deal with Axel Springer, followed by the considerably larger peak during summit itself, where the assembled gathering sent a dizzying 156 (perhaps sampled down from 15600 - seems somewhat unlikely) tweets per hour. Was nobody listening to the talks?

The peak tweet activity coincided, unsurprisingly, with the fireside argument chat between Jeff Jarvis and Alexi Mostrous, which I think captured two sides of an extremely interesting and relevant topic to the media world today. I personally would have loved to see that panel go longer.

It is also worth noting that the various trending topics per hour tended to mirror the panel topics pretty closely (so maybe people were listening after all).

The day after, on the 5th, we see some follow on activity, mainly echoing the news from the previous two days.

My favourite hour was 2015-05-07 03, where there were 2 tweets, and the top 3 words were: signs, favorite and delightful.

I wonder if that was perhaps one of our execs, relaxing with some wine and shooting the twitter breeze after a weeks work well done. :)


For full code and notebooks, see the repository here.

-qz