mongodb + twitter + yahoo term extractor = fun!

March 7th, 2010

ran a little experiment in using yahoo term extraction yesterday and it worked well enough. here’s some code to pass some text to yahoo and get back an array of terms

i’ve got to say mongodb is such an easy tool for working with json data. these 20 odd lines insert a text json tweet stream into mongo. so simple, why can’t all code be this easy…

what to do with a week off?

February 22nd, 2010

this week i’m between jobs so i have (a little) more time than usual to hack.

i’ve got a list of pending things to do but can’t decide what to do next, here’s my list in (sort of) priority order…

  • fix up my numerical underflow / overflow problems in my recent semi supervised classification project.
  • work through the exerecises from the first few chapters to introductory statistics with r and all of statistics. i’m particularly keen to write a intro stats blog post about statistical signifigance.
  • do this mongdb tute i found; shouldn’t take too long.
  • do a weka screencast. i did some little talks at work lately about weka and they seemed to be interesting enough to others that it might be worth doing a screencast on it.
  • do some work on modelling of periodic functions. seemed like trending topics is an interesting area at the moment and this would be a good chance to learn some more about R. fourier series look like a potential solution. there is also some interesting stuff to do in this area around majority evaluation from a stream of data.
  • finish my work on detecting resemblance with hadoop. something that’s been hanging over my head for about 2 years is the first piece of work i did that led me onto hadoop. i’ve had a long running project on resemblance that ended up with me writing a map/reduce framework in erlang (until i (re)discovered hadoop).
  • revisit mahout, it’s looking a bit more polished nowadays.
  • redo and finish my project on latent semantic analysis; need to include some comparison work with probabilistic latent semantic analysis and latent dirichlet allocation (which is close to winning the scariest-formulas-on-a-wikipedia-page award)
  • finish my twitter classifier; haven’t work on it since lists were introduced and i think they would be an interesting addition to the algorithm.

decisions, decisions….

semi supervised naive bayes for text classification

February 14th, 2010

experiment 13; a test of semi supervised naive bayes for text classification is complete.

semi supervised algorithms seem to work pretty well and i can see how they are a huge benefit for text classification where you can have an enormous corpus but not enough time to label it all…

e12.3 stat syns FAIL!

February 5th, 2010

after quite a bit of hacking the statistical synonyms idea doesn’t seem to give terribly interesting results so i’m going onto do something else.

for the record here’s what I did do though….

  1. generate 3grams from 800e3 tweets
  2. collect n-grams together that share the same first and last term; eg ‘the blue cat’, ‘the green cat’, ‘the red cat’
  3. for each set generate all the combos of the middle terms; eg ‘blue green’, ‘blue red’, ‘green red’
  4. count the occurrences of each pair
  5. draw a graph of the 150 top occurring pairs

graph.840k.150viola! click this image for a bigger version

some interesting result. few of the more complex things i was trying were working. they were mainly based on trying to incorporate the frequencies of terms but it seemed the simplest gave the best result (i think it’s because my assumptions about how to use the data were wrong).

here’s the code, feel free to read my notes, correct my incorrect terrible statistical assumptions and make a better image!

an intro to semi supervised document classification

January 31st, 2010

here’s a great lecture from tom mitchell about document classification using a semi supervised version of naive bayes.

semi supervised algorithms only require some of the training examples to be labeled and are able to make use of any unlabelled ones, very common when we have a huge corpus.

i’ve started an experiment brewing to test this out by porting some previous naive bayes work i did to use this semi supervised scheme and will published it when it’s done.

cool stuff!!

e12.2 entity set expansion

January 28th, 2010

i’ve been doing some reading for my statistical synonyms project and have uncovered a heap of cool papers. most of them are around an idea (from the 1950’s!) called the distributional hypothesis that simply states that words that appear in similar contexts often have similar meanings.

the coolest paper so far is ‘Web-Scale Distributional Similarity and Entity Set Expansion’ by Pantel,Crestan,Borkovsky et al which has introduced me to an area of research i didn’t really know existed; entity set expansion.

entity set expansion is a bit like thesaurus building for proper nouns; given a seed set of related items can you expand the set to include other semantically similiar items?

an example might be brands of japanese motorbikes. starting with ‘yamaha’ and ‘kawasaki’ we might expect the set to be expanded to include ‘honda’

i started hacking around in pig but today switched back to ruby for slightly quicker prototyping. who knows, i might give piglet a go!

the code is on github

e12.1 statistical synonyms

January 23rd, 2010

i’ve had an idea brewing in my head for awhile now seeded by a great talk by peter norvig about statistically approaches to find patterns in data.

one thing he alludes to is the generation of synoyms based on n-gram models.

the basic intuition is this; if a corpus contains occurrences of the phrases ‘a x b’ and ‘a y b’ then to some degree x and y are synonymous.

the question becomes how do we calculate the strength of the relationship? how is it a function of the frequencies of a, b, x, y, ‘a x b’, ‘a y b’, ‘a ? b’ in the corpus. what else can we take into account?

a pig screencast

January 17th, 2010

pig demo from Mat Kelcey on Vimeo.

based on a talk i gave at work recently

tweets about cheese

November 15th, 2009

people tweet about all sorts of stuff.

sometimes it’s really important ground breaking world changing stuff…
but most of the time it’s ridiculous waste of time stuff like ‘i ate some cheese’

in fact how much do people actually tweet about cheese?
and when they do, what are the most important cheese related topics?

lets gather some data…

Read the rest of this entry »

xargs parallel execution

November 6th, 2009

just recently discovered xargs has a parallelise option!

i have 20 files, sample.01.gz to sample.20.gz, each ~100mb in size that i need to run a script over

one option is

zcat sample*gz | ./script.rb > output

but this will process the files sequentially on a single core.

to get some parallel action going i could generate a temp script that produces

zcat sample.01.gz | ./script.rb > sample.01.out &
zcat sample.02.gz | ./script.rb > sample.02.out &
...
zcat sample.20.gz | ./script.rb > sample.20.out &

and run that but this will have all 20 running at the same time and produce contention

(though with only 20 files this might not be a problem)

instead i can make a temp script, parse.sh

zcat $1 | ./script.rb > $1.out

and run

find sample*gz | xargs -n1 -P4 sh parse.sh
cat *out > output

what is this xargs command doing?

  • -n1 passes one arg a time to the run comamnd (instead of the xargs default of passing all args)
  • -P4 says have at most 4 commands running at the same time

100% on all cores (and only because the disk can keep up)

awesome!