Friday, 15 March 2013

arrays - Optimized method for calculating cosine distance in Python -


I wrote a method to calculate the distance of cosine between two arrays:

  Def cosine_distance A, B): If LAN (A)! = Lane (B): Returns wrong digit = 0 denom = 0 in dinob = i i category (lane (a)): degrees + a = i [i] * b [i] denoma + = abs (a [i]) * * 2 denomb + = abs (b [i]) ** 2 results = 1 - fraction / (sqrt (denoma) * sqrt (dinob)) returns result  

running this one The large array can be very slow. Is there a customized version of this method that will run fast?

Update: I have tried to do all the tips till today, including sliced. Here's the version of the beat to include suggestions from Mike and Steve:

  def cosine_distance (a, b): if len (a)! = Len (b): Increase valueError, "a and b should have the same length" #importes = 1 = 0 denom = 0 = 0 for = 0 in dinob = 0 i optimize the mean: AI = a [i] # This is the result of diploid + = bi * bi = 1 - fraction / (sqrt) only once BI = B [I] fraction + = AI * bi # exponent (barely) denoma + = ai * ai #strip abs () (Denoma) * sqrt (dinob)) Results Results  

If you use SciPy You can use the cosine from Local Distance :

If you can not use SciPy, you can get a little speed by typing your python again (edit: but this does not work Was doing as I thought it would be, see below). Importing izip from math copy to itertools from

  Ort sqrt def cosine_distance (a, b): if LAN (A)! = Lane (b): increase the value, "a and b should be the same length" fraction = sum (tup [0] * tup [1] Izip for a tup (a, b) denom = sum (avalue ** 2 For avalue) dinob = amount (for bvalue in bvalue ** b) result = 1 - fraction / (sqrt (denoma) * sqrt (dinob)) returns result  

A and b The length of the mismatch is better when lifting an exception.

sum (for using the generator quote inside the call)) You can calculate your values ​​with most of the functions being done by C code inside Python. The should be faster than using for the loop.

I have not given time, so I can not guess how fast it can be. But SciPy codes are almost certainly written in C or C ++ and should be as soon as possible.

If you are doing bioinformatics in Python then you should actually use SciPy.

EDIT: Darius Bacon finished my code and found it slow. So I ended my code and ... Yes, it is slow Lesson for everyone: When you are trying to speed things up, do not guess, measure.

Why is the slow attempt to do more work on the C-Internal of Insanity, I tried to do it for the list of length 1000 and it was still slow.

I can not spend much time trying to hack Python cleverly if you need more speed, then I suggest you try SciPy.

Edit: Without time, I have just tested by hand I think that for small A and B, the old code is fast; For long time A and B, the new code is fast; The difference is not large in both cases (now I am wondering if I can trust the timetable on my Windows computer; I want to try this test again on Linux.) I work to get it faster I will not change the code and once I urge you to try a sympiune. : -)


No comments:

Post a Comment