[ml] this week: fun with R?

Ben Weisburd ben.weisburd at gmail.com
Tue Apr 26 09:11:28 UTC 2011


Hi Brian and Mike, thanks for your suggestions!

If I understand correctly, both of you are suggesting rerunning
cross-validation and/or classification after each tweak to the feature set
or feature weights. I can see that this is the way to go in the general
case. I did this when selecting meta-parameters (SVM training parameter C,
kernel, etc.). However, I think it becomes prohibitively expensive for
feature selection (unless I can get away with searching a small subset of
the possible feature combinations and parameters). I was hoping there was
some way to measure the noisiness / separability for a particular feature
set more directly, especially in the simple case where you can assume that
the separation boundary isn't very different from a plane. For example, lets
say you take each feature individually, compute its value for your positive
and negative examples, and then draw the ROC curve for it (or just 2
box-plots). Then you adjust the feature or find another feature that
improves the ROC curve or increases the distance between the boxes. I
realize that this wouldn't work if your multi-dimensional boundary is
donut-shaped, but if the boundary is something like a plane, then increasing
the separability in each dimension separately should lead to better overall
results.. right?

Also, Mike - when testing the importance of a feature by adding noise, why
is it better to add noise instead of just removing the feature and testing
performance without it?

Thanks,
-Ben

Ps.  To be more specific about my app. - I'm working with gene sequencing
data, and the problem comes down to peak detection where peaks
appear simultaneously in several data channels. The data consists of
millions of short (~ 30 nucleotide) RNA fragments which are aligned to a
reference genome, resulting in a histogram of position vs. read counts. The
genome is on the order of 1/2 million bases long, so that the x-axis is 1/2
million discrete positions, and the y-axis is the number of reads that
mapped to a position. Because of the way the data was generated, there
should be a sharp (~3 nucleotide wide) peak in the histogram at positions
where genes begin. I have a training set of ~100 known gene starts and my
goal is to classify other positions in the genome as either gene starts or
not. Right now I'm training the SVM on the 100 positive examples, as well as
about 1000 negative examples, and then running it to classify the ~ 50,000
other positions in the genome where the histogram value is above a certain
minimum threshold.  The multiple channels come from the fact that there are
actually multiple data sets (aka. histograms) generated under different
conditions. These provide different views of the same underlying biological
processes and should have peak-like shapes in the same positions. My feature
vectors are based on the values of these channels around the position to be
classified (for example, feature #1 = value of channel 1 at the position
divided by average channel 1 value within an upstream window,  feature #2 =
channel 1 value divided by channel 2 value at the position,  etc.), and have
a total of 9 features.



>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.noisebridge.net/pipermail/ml/attachments/20110426/1040f35b/attachment-0003.html>


More information about the ml mailing list