February 22nd, 2010
this week i’m between jobs so i have (a little) more time than usual to hack.
i’ve got a list of pending things to do but can’t decide what to do next, here’s my list in (sort of) priority order…
- fix up my numerical underflow / overflow problems in my recent semi supervised classification project.
- work through the exerecises from the first few chapters to introductory statistics with r and all of statistics. i’m particularly keen to write a intro stats blog post about statistical signifigance.
- do this mongdb tute i found; shouldn’t take too long.
- do a weka screencast. i did some little talks at work lately about weka and they seemed to be interesting enough to others that it might be worth doing a screencast on it.
- do some work on modelling of periodic functions. seemed like trending topics is an interesting area at the moment and this would be a good chance to learn some more about R. fourier series look like a potential solution. there is also some interesting stuff to do in this area around majority evaluation from a stream of data.
- finish my work on detecting resemblance with hadoop. something that’s been hanging over my head for about 2 years is the first piece of work i did that led me onto hadoop. i’ve had a long running project on resemblance that ended up with me writing a map/reduce framework in erlang (until i (re)discovered hadoop).
- revisit mahout, it’s looking a bit more polished nowadays.
- redo and finish my project on latent semantic analysis; need to include some comparison work with probabilistic latent semantic analysis and latent dirichlet allocation (which is close to winning the scariest-formulas-on-a-wikipedia-page award)
- finish my twitter classifier; haven’t work on it since lists were introduced and i think they would be an interesting addition to the algorithm.
decisions, decisions….
Posted in Uncategorized | 2 Comments »
February 14th, 2010
experiment 13; a test of semi supervised naive bayes for text classification is complete.
semi supervised algorithms seem to work pretty well and i can see how they are a huge benefit for text classification where you can have an enormous corpus but not enough time to label it all…
Tags: e13, naive bayes, semi supervised
Posted in Uncategorized | 1 Comment »
February 5th, 2010
after quite a bit of hacking the statistical synonyms idea doesn’t seem to give terribly interesting results so i’m going onto do something else.
for the record here’s what I did do though….
- generate 3grams from 800e3 tweets
- collect n-grams together that share the same first and last term; eg ‘the blue cat’, ‘the green cat’, ‘the red cat’
- for each set generate all the combos of the middle terms; eg ‘blue green’, ‘blue red’, ‘green red’
- count the occurrences of each pair
- draw a graph of the 150 top occurring pairs
viola! click this image for a bigger version
some interesting result. few of the more complex things i was trying were working. they were mainly based on trying to incorporate the frequencies of terms but it seemed the simplest gave the best result (i think it’s because my assumptions about how to use the data were wrong).
here’s the code, feel free to read my notes, correct my incorrect terrible statistical assumptions and make a better image!
Tags: e12, fail
Posted in Uncategorized | No Comments »
January 31st, 2010
here’s a great lecture from tom mitchell about document classification using a semi supervised version of naive bayes.
semi supervised algorithms only require some of the training examples to be labeled and are able to make use of any unlabelled ones, very common when we have a huge corpus.
i’ve started an experiment brewing to test this out by porting some previous naive bayes work i did to use this semi supervised scheme and will published it when it’s done.
cool stuff!!
Tags: machine learning, naive bayes, semi supervised
Posted in Uncategorized | No Comments »
January 28th, 2010
i’ve been doing some reading for my statistical synonyms project and have uncovered a heap of cool papers. most of them are around an idea (from the 1950’s!) called the distributional hypothesis that simply states that words that appear in similar contexts often have similar meanings.
the coolest paper so far is ‘Web-Scale Distributional Similarity and Entity Set Expansion’ by Pantel,Crestan,Borkovsky et al which has introduced me to an area of research i didn’t really know existed; entity set expansion.
entity set expansion is a bit like thesaurus building for proper nouns; given a seed set of related items can you expand the set to include other semantically similiar items?
an example might be brands of japanese motorbikes. starting with ‘yamaha’ and ‘kawasaki’ we might expect the set to be expanded to include ‘honda’
i started hacking around in pig but today switched back to ruby for slightly quicker prototyping. who knows, i might give piglet a go!
the code is on github
Tags: e12, linguistics
Posted in Uncategorized | No Comments »
January 23rd, 2010
i’ve had an idea brewing in my head for awhile now seeded by a great talk by peter norvig about statistically approaches to find patterns in data.
one thing he alludes to is the generation of synoyms based on n-gram models.
the basic intuition is this; if a corpus contains occurrences of the phrases ‘a x b’ and ‘a y b’ then to some degree x and y are synonymous.
the question becomes how do we calculate the strength of the relationship? how is it a function of the frequencies of a, b, x, y, ‘a x b’, ‘a y b’, ‘a ? b’ in the corpus. what else can we take into account?
Tags: e12, statistics
Posted in Uncategorized | No Comments »
November 6th, 2009
just recently discovered xargs has a parallelise option!
i have 20 files, sample.01.gz to sample.20.gz, each ~100mb in size that i need to run a script over
one option is
zcat sample*gz | ./script.rb > output
but this will process the files sequentially on a single core.
to get some parallel action going i could generate a temp script that produces
zcat sample.01.gz | ./script.rb > sample.01.out &
zcat sample.02.gz | ./script.rb > sample.02.out &
...
zcat sample.20.gz | ./script.rb > sample.20.out &
and run that but this will have all 20 running at the same time and produce contention
(though with only 20 files this might not be a problem)
instead i can make a temp script, parse.sh
zcat $1 | ./script.rb > $1.out
and run
find sample*gz | xargs -n1 -P4 sh parse.sh
cat *out > output
what is this xargs command doing?
- -n1 passes one arg a time to the run comamnd (instead of the xargs default of passing all args)
- -P4 says have at most 4 commands running at the same time
100% on all cores (and only because the disk can keep up)
awesome!
Tags: bash, unix
Posted in Uncategorized | 2 Comments »