[ml] this week: fun with R?

Mon May 2 01:26:18 UTC 2011

per what you and I discussed after the meeting (using frequency information
rather than just relative amplitudes)

the derivative should tell you how steep is the peak. the second should tell
you how fast it is getting steeper, around the sides of it.

with this data I think best to use half derivatives, ie

f '(k)~ f(k) - f(k-1)

or else average absolute values
f'(k) ~ (|f(k) - f(k-1)| + |f(k+1) - f(k)|)

To allow your fitting to prefer whichever works better of smaller sharper
peaks or taller fatter ones that would be in the second derivative.

f ''(k) ~ f(k+1) - 2 f(k) + f(k-1) = [ f(k+1) - f(k) ] - [f(k) - f(k-1)]

or

| f ''(k) | ~  |  |f(k+1) - f(k)| - |f(k) - f(k-1)|   |

an advantage to using absolute value is you can put a weighted average
inside ie

| f ' | + a | f ''|

hope this helps

On Tue, Apr 26, 2011 at 2:11 AM, Ben Weisburd <ben.weisburd at gmail.com>wrote:

> Hi Brian and Mike, thanks for your suggestions!
>
> If I understand correctly, both of you are suggesting rerunning
> cross-validation and/or classification after each tweak to the feature set
> or feature weights. I can see that this is the way to go in the general
> case. I did this when selecting meta-parameters (SVM training parameter C,
> kernel, etc.). However, I think it becomes prohibitively expensive for
> feature selection (unless I can get away with searching a small subset of
> the possible feature combinations and parameters). I was hoping there was
> some way to measure the noisiness / separability for a particular feature
> set more directly, especially in the simple case where you can assume that
> the separation boundary isn't very different from a plane. For example, lets
> say you take each feature individually, compute its value for your positive
> and negative examples, and then draw the ROC curve for it (or just 2
> box-plots). Then you adjust the feature or find another feature that
> improves the ROC curve or increases the distance between the boxes. I
> realize that this wouldn't work if your multi-dimensional boundary is
> donut-shaped, but if the boundary is something like a plane, then increasing
> the separability in each dimension separately should lead to better overall
> results.. right?
>
> Also, Mike - when testing the importance of a feature by adding noise, why
> is it better to add noise instead of just removing the feature and testing
> performance without it?
>
> Thanks,
> -Ben
>
> Ps.  To be more specific about my app. - I'm working with gene sequencing
> data, and the problem comes down to peak detection where peaks
> appear simultaneously in several data channels. The data consists of
> millions of short (~ 30 nucleotide) RNA fragments which are aligned to a
> reference genome, resulting in a histogram of position vs. read counts. The
> genome is on the order of 1/2 million bases long, so that the x-axis is 1/2
> million discrete positions, and the y-axis is the number of reads that
> mapped to a position. Because of the way the data was generated, there
> should be a sharp (~3 nucleotide wide) peak in the histogram at positions
> where genes begin. I have a training set of ~100 known gene starts and my
> goal is to classify other positions in the genome as either gene starts or
> not. Right now I'm training the SVM on the 100 positive examples, as well as
> about 1000 negative examples, and then running it to classify the ~ 50,000
> other positions in the genome where the histogram value is above a certain
> minimum threshold.  The multiple channels come from the fact that there are
> actually multiple data sets (aka. histograms) generated under different
> conditions. These provide different views of the same underlying biological
> processes and should have peak-like shapes in the same positions. My feature
> vectors are based on the values of these channels around the position to be
> classified (for example, feature #1 = value of channel 1 at the position
> divided by average channel 1 value within an upstream window,  feature #2 =
> channel 1 value divided by channel 2 value at the position,  etc.), and have
> a total of 9 features.
>
>
>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.noisebridge.net/pipermail/ml/attachments/20110501/9e246679/attachment-0002.html>