BMI Students

Tuesday, May 30, 2006

Phages

Slate has an interesting article on the use of bacteriophages to attack infections instead of antibiotics. They also speculate that unfortunately it will be hard to bring the technology to the US.

link

Saturday, May 20, 2006

Notes on classifiers

I have been testing a bunch of classifiers for a project I am doing. The objective is to classify an intergenic region as ACE1 (or any motif) or not-ACE1, based on features of the intergenic regions. I did this for a number of sets of features, and the results were very consistent. I have done enough tests that I feel comfortable relating some general conclusions....

Random Forests and SVMs always won, with random forests usually commanding a slight lead. SVMs with a polynomial kernel did a bit worse. MaxEnt usually came fourth, and seemed to do better on discrete data (hence the NLP slant of this method). Finally, k-nearest neighbors always lost. Random Forests were slower than SVMs, apart from that I think they are preferable.

Random forests are just collections of voting decision trees, each trained on bootstrapped data and variables. Someone must have done the same thing for collections of voting SVMs. If I find it I'll add it to this post. Seems like it must win overall.

Monday, May 15, 2006

Flowers that detect landmines

This kind of thing helps GM food's image no end...
link

Saturday, May 13, 2006

LaTeX

I use LaTeX a lot, mainly because Word on the mac is so horrible, and I like not worrying about formatting while typing. This article explains some of the small benefits of doing so.

link

Tuesday, May 02, 2006

Mammalian promoters

Nature Genetics just published a milestone paper from the Fantom/RIKEN consortium compiling an enormous genome-wide collection of transcript start sites (TSSs) in humans and mice. The paper could be a treasure trove for bioinformaticians. They collected TSS tags from many different tissues and mapped them onto the genome. There are several different classes of promoters: some with very well defined TSSs, some with very broad distributions (transcription can start anywhere in a comparatively broad region), some with mutliple well-defined sites and some with combinations of the above. The paper claims four classes. I don't know what kind of clustering they used -- but it would be interesting to know more about how distinct their classes are and if four is really the best estimate.

I think that in addition to analyses they did in the paper, one can try a bunch of correlations quickly -- several possibilities for projects small and large. Like, do promoter classes correlate with alternatively spliced genes? or are TSS'es correlated with transcription units from tiled array experiments (Affy and others)? One can also do some gene ontology correlations, or expression analysis using these data. We know that transcription initiation, splicing and expression (and other things) are all intimately connected, so this might be leveragable in many different directions...

Monday, May 01, 2006

Machine learning videos

There are a bunch of machine learning webcast lectures here. Many of them are tutorials; includes a few biology-focused lectures.

link