brutally short intro to weka

July 3rd, 2010

weka is a java based machine learning workbench that i’ve found useful to playing with to help understand some standard machine learning algorithms. in this quick demo i show how to build a classifier for three simple datasets; two of which address the basics of text classification

brutally short intro to weka from Mat Kelcey on Vimeo.

friend clustering by term usage

June 25th, 2010

recently signed up to the infochimps api and wanted to do something quick and dirty to get a feel for it.

so here’s a little experiment

  1. get the people i follow on twitter
  2. look up the words that “represent” them according to the infochimps word bag api
  3. build a similiarity matrix based on the common use of those terms
  4. plot the connectivity for the top 30 or so pairings

it’s basically partitioned into three groups…

  1. veztek (my boss john) and smcinnes (steve from the lonely planet community team) in the top right
  2. a big clump of nosqlness with mongodb – hbase – jpatanooga – kevinweil in the bottom left
  3. everyone else

an interesting enough result given the time taken; the codes on github

country codes in world cup tweets – viz1

June 21st, 2010

#worldcup tweet viz1 from Mat Kelcey on Vimeo.

here’s a simple visualisation of the use of official country codes (eg #aus) in a week’s worth of tweets from the search stream for #worldcup.

rate is about 2hours of tweets per sec. orb size denotes relative frequency of that country code. edges denote that those two countries feature a lot in the same tweets. movement is based on gravitational like attraction along edges.

the quiet period at about 0:17 is a twitter outage :)

here’s the original processing applet version with a bit more discussion

moving average of a time series in R

June 15th, 2010

in this a sliding window of 3 elements

> x = c(3,1,4,1,5,9,2,6,5,3,5,8)
> ra_x = filter(x, rep(1,3)/3)
> ra_x
Time Series:
Start = 1
End = 12
Frequency = 1
 [1]       NA 2.666667 2.000000 3.333333 5.000000 5.333333 5.666667 4.333333
 [9] 4.666667 4.333333 5.333333       NA

#worldcup twitter analytics

June 14th, 2010

since the world cup started i’ve spent more time looking at twitter data about the games than the actual games themselves. what a sad data nerd i am!

anyways, here’s the first few days analysis based the use of official country tags (eg #aus) in the search stream for #worldcup.

tomorrow i might look in more detail at one of the games, wondering how many variants of ‘goooooooal’ i’ll find :D

a quick study in tf/icf

June 9th, 2010

while doing some more research on trending algorithms i came across a cool little paper about term frequency normalisation for streaming data: TF-ICF: A New Term Weighting Scheme for Clustering Dynamic Data Streams.

i’m finding streaming related algorithms quite interesting lately and think are the way forward in terms of dealing with large amounts of constant data. it’s just not feasible to use algorithms that expect you to have all the data at any given time; it forces you to reprocess all the data you’ve ever seen as you get new examples. my thinking is the best solutions are the ones that build models of the data and fold in new examples in batches. anyways, i’m getting off topic already.

tf/icf as presented in the paper is a variant on the classic tf/idf for term weighting but instead of requiring all weighting in all docs to be recalculated as a new document comes along (as tf/idf strictly does) it instead just approximates based on what has been seen before.

so how does it do? actually quite well, here’s my experimental breakdown

5 minute ggobi demo

June 4th, 2010

brutally short demo of ggobi from Mat Kelcey on Vimeo.

note: non embedded version has higher res at full screen

how many terms in a trend?

May 11th, 2010

i’ve been poking around with a simple trending algorithm over the last few weeks and have uncovered a problem that, like most interesting ones, i’m not sure how to solve. the question revolves around discovering multi terms trends.

a sensible place to start when looking for trends is to consider single terms but what if though we ended up with three equally trending terms ‘happy’, ‘new’ and ‘year’? it’s pretty obvious that the actual trend is ‘happy new year’ but what is the best way to express this as a single trend in an algorithmic sense?

one approach i’ve been playing with is to collect unigrams, bigrams and trigrams (1,2,3 term ‘phrases’) and consider the cases where the terms overlap. basically if ‘happy new year’ is trending then, in some sense, we can ignore trends for ‘happy new’, ‘new year’, ‘happy’, ‘new’ and ‘year’. but does this result in to many false positives? would we miss ‘happy’ as a trend if lots of people were chirpy about the change of year (as they usually are, on new years eve)

rather than outright ignore we could somehow reduce the weighting by removing the double counting.

eg if we had 3 trends; (free beer,11), (free,12) & (beer,25)
we can take 11 (from the 2gram) off both 1grams to give (free beer,11), (free,1) & (beer,14)
showing that ‘beer’, outside of the phrase ‘free beer’, is perhaps a trend in itself (as it should be)

this feels like it might work but would be non trivial (read: fun) to implement

another slightly different problem is around the handling of retweeting. my experiments have shown a huge amount of the ‘trends’ found are related to retweets, which is fine in itself, but it gives quite strange trends since the retweeted portion of the text is usually quite long.

for example; say lots of people are retweeting something and, as some people do, are adding various bits and pieces at the beginning and end; eg ‘RT @bob omg i just found a peanut’ or ‘omg i just found a peanut; via @bob lucky him!!’

if we’re considering bigrams (which i am in my current implementation) we end up with an odd selection of trends such as ‘just found’, ‘a peanut’, ‘omg i’, ‘found a’, ‘i just’ and in these cases it’d be great to be able to just stitch them together into the common retweeted element ‘omg i just found a peanut’.

we could ’solve’ this problem by not just considering 1,2 and 3 grams but considering _all_ possible n-grams for each tweet and employing the technique we spoke of above of reducing the counts. it’d almost be feasible, since tweets are never that long, but feels uber clumsy and i’d hate to see the order statistic of that algorithm ;)

this seems more like a stitching problem of some kind; eg if we have 4 grams ‘omg i just found’, ‘i just found a’, ‘just found a peanut’ perhaps we can identify the non trivial overlap and stitch them together (?)

not sure, there are a number of things to try. was hoping that brain dumping some of this would help me see the light but nothing obvious jumps out :(

trending topics in tweets about cheese; part2

May 1st, 2010

prototyping in ruby was a great way to prove the concept but my main motivation for this project was to play some more with pig.

the main approach will be

  1. maintain a relation with one record per ngram we want to monitoring for trending
  2. fold 1 hours worth of new data at a time into the model
  3. check the entries for the latest hour for any trends

the full version is on github. read on for a line by line walkthrough

Read the rest of this entry »

trending topics in tweets about cheese; part1

April 27th, 2010

trending topics

what does it mean for a topic to be ‘trending’? consider the following time series (430e3 tweets containing cheese collected over a month period bucketed into hourly timeslots)

without a formal definition we can just look at this and say that the series was trending where there was a spike just before time 600. as a start then let’s just define a trend as a value that was greater than was ‘expected’.

how can we calculate trending?

one really nice simple algorithm for detecting a trend is to say a value, v, is trending if v > mean + 3 * standard deviation of the data seen so far. (thanks @peteskomoroch for the suggestion, works a treat)

let’s consider the same time series as before but this time with some overlaid data;
green – the mean
red – minimum trend value ( = mean + 3 * std dev )
blue – instances where the value > minimum trend value

Read the rest of this entry »