Saturday, 15 September 2012

Python: How to properly deal with NaN's in a pandas DataFrame for feature selection in scikit-learn -


This is related to the question I posted, but it is more specific and simple.

I have a panda dataframe whose index is a unique user identifier, the column corresponds to unique events, and value 1 (present), 0 (not present), or Nain (which was not invited / Was not relevant). The matrix is ​​quite rare in relation to NAN: there are several hundred incidents and most users are invited only for maximum tens.

I created some extra columns to measure "success", which I just attended to the relative invitation:

  my_data ['invite'] = my_data.count ( Axis = 1) my_data ['present'] = my_data.sum (axis = 1) -my_data ['My goal is to make the feature selection right now on events / columns, my goal is to start [my_data [' success'] = My_data ['present'] / my_data ['invited']  

with the most basic variant-based method: remove people with low variance then I have a linear on events I see Ratigmn and I keep only the large multiples and those with small P-values.

But my problem is that I have many nan and I am not sure how much better to deal with them, the same as scikit-learn methods Give them errors due to them. One idea that is not present in '1' and is not 'Invited with 0' is in place, but I worry that it will change the importance of events.

Can someone suggest that all of these NANs, without changing the statistical significance of each feature / event?

Edit : I would like to add that I am happy to change my metric for "success" from above, if appropriate is appropriate which will allow me to go Ahead with I am just trying to determine which incidents are effective in capturing user interest, it is very open and it is mostly an exercise to practice convenience selection.

Thank you!

If I understand correctly, then you would not want to clear your data from the nan, within it Without altering statistical properties - so that you can run some analysis afterwords.

I've actually used something similar recently, a simple approach that you might be interested in, using Scalarn's 'Imputter'. As mentioned earlier, adamum is to be replaced with the meaning of a thought axis. For example, other options include the place of mediation. Some: like sklearn.preprocessing imports Impeter imp = imputer (missing_values ​​= 'NaN', strategy = 'mean', axis = 1) cleaned_data = imp Fit_transform (original_data)

In this case, it will replace NaN with the mean in each axis (for example, we asbestone = 1). Then you can make sure to score the clear data, that you get 0 and 1.

I will plot some histograms for the data, to check discretion that this preprocessing will make significant changes in your distribution - as we swap a large number of values ​​/ with each axis in the mode / mean <

Links to References:

Moving things one step forward (not enough to handle above), you can alternatively do the following:

  • To leave the event column and all the NAN numbers in your data Calculate the area after ( '1 - p') are likely to be present to participate in ( 'P'). [To wit. P = To be present / not present (not present)]
  • Then by replacing your NaN number in each event column, through the random number generated from the Bernoulli distribution, 'Np:

    import np as np

    n = 1 # number of tests

    P = 0.25 # Replace the expected probability of each lawsuit (i.e. that is what you attended / have got)

    s = np.random.binomial (n, p, 1000)

    # s is now a bunch of random 1 and 0, you

Again, this is not right in itself, yet you still have to slow down your data in a little bit or a little bit or a little bit. Moving at speed (for example, more precise approaches will be based on the dependence of your data in events for each user) - But sampling is largely arbitrarily replaced with matching delivery, at least mean value etc. From a Should be strong.

Hope it helps!


No comments:

Post a Comment