Hi Brian and Mike, thanks for your suggestions!<div><br></div><div>If I understand correctly, both of you are suggesting rerunning cross-validation and/or classification after each tweak to the feature set or feature weights. I can see that this is the way to go in the general case. I did this when selecting meta-parameters (SVM training parameter C, kernel, etc.). However, I think it becomes prohibitively expensive for feature selection (unless I can get away with searching a small subset of the possible feature combinations and parameters). I was hoping there was some way to measure the noisiness / separability for a particular feature set more directly, especially in the simple case where you can assume that the separation boundary isn't very different from a plane. For example, lets say you take each feature individually, compute its value for your positive and negative examples, and then draw the ROC curve for it (or just 2 box-plots). Then you adjust the feature or find another feature that improves the ROC curve or increases the distance between the boxes. I realize that this wouldn't work if your multi-dimensional boundary is donut-shaped, but if the boundary is something like a plane, then increasing the separability in each dimension separately should lead to better overall results.. right? <br>


<div><br></div><div>Also, Mike - when testing the importance of a feature by adding noise, why is it better to add noise instead of just removing the feature and testing performance without it?  </div><div><br></div><div>


Thanks,</div><div><div>-Ben</div><div><br></div><div>Ps.  To be more specific about my app. - I'm working with gene sequencing data, and the problem comes down to peak detection where peaks appear simultaneously in several data channels. The data consists of millions of short (~ 30 nucleotide) RNA fragments which are aligned to a reference genome, resulting in a histogram of position vs. read counts. The genome is on the order of 1/2 million bases long, so that the x-axis is 1/2 million discrete positions, and the y-axis is the number of reads that mapped to a position. Because of the way the data was generated, there should be a sharp (~3 nucleotide wide) peak in the histogram at positions where genes begin. I have a training set of ~100 known gene starts and my goal is to classify other positions in the genome as either gene starts or not. Right now I'm training the SVM on the 100 positive examples, as well as about 1000 negative examples, and then running it to classify the ~ 50,000 other positions in the genome where the histogram value is above a certain minimum threshold.  The multiple channels come from the fact that there are actually multiple data sets (aka. histograms) generated under different conditions. These provide different views of the same underlying biological processes and should have peak-like shapes in the same positions. My feature vectors are based on the values of these channels around the position to be classified (for example, feature #1 = value of channel 1 at the position divided by average channel 1 value within an upstream window,  feature #2 = channel 1 value divided by channel 2 value at the position,  etc.), and have a total of 9 features.</div>


<div></div><div><br></div><div><br></div><div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0.8ex; border-left-width: 1px; border-left-color: rgb(204, 204, 204); border-left-style: solid; padding-left: 1ex; ">


<div><div class="h5"><br></div></div></blockquote></div></div></div></div>