Wednesday 15 September 2010

oneclass SVM skewed data in outlier detection with scikit-learn -


I'm detecting outlets in a dataset using a squid SVM with ski-learning. I will try to explain my problem with an example:

Imagine a simple dataset with the height and performance of the features (this is just a simplification, my data set is huge). For example:

  h - height p - display hp square ---------- 10 For example, when I make a strange combination between the two attributes I should find out the outlet 0.1 1 12 0.5 1 20 3.2 1 24 2. 9 1 23 0.4 -1 ...  

I am scaling the data set, and I want to configure such a configuration I have been training and tracking outliers using:

  clf = svm.OneClassSVM (kernel = "rbf", nu = 0.01, gamma = 0.01)  

No matter when my data set is proportional. However, when the data set is diverted in the following way, I have a problem:

  • Perform all the heights around 3 or 4, but display around 0 around the height of 10 . Then instead of selecting values ​​like a square-smm height 30 and display 0.2, selects around 10 heights in the form of outliers

I divided the two groups into two SMMs The problem is solved (10 heights in a data set and other in other data sets).

Is there a way to solve this without breaking the data set?


No comments:

Post a Comment