Monday 15 June 2015

linux - How to make this sed script faster? -


I have inherited this Sid Script snippet that tries to delete some spaces:

  s / [\ S \ t] * | / | / Gs / | [\ S \ t] * / | / Gs / [\ s] * $ // gs / ^ | / Null | / G>  

that runs on the file which is about 1 GB big This script runs on our Unix server for 2 hours Any idea how to speed it?

The note that \ s stands for one place and stands for one tab, the actual script uses the actual space and tab and not the symbol

Input file is a pipe delimited file and locally not on the network 4 lines sed -f

In the file executed with

I, this script was:

  s / [\ s \ t] * | [\ S \ t] * / | / Gs / [\ s \ t] * $ // s / ^ | / Null | /  

In my tests, it ran approximately 30% faster than your SAD script. The increase in functioning comes from the combination of the first two regensons and the addition of the "G" flag, where it is not needed.

However, there is only a slight improvement by 30% faster (it's still one hour and one half to run the above script on your 1GB data file) I want to see that I can do better I am

Finally, there was no other method I did not try (other methods with AJAX, Pearl, and SAD) except for any better performance, of course - expecting a plain ol ' The code is a bit different for posting here, but if you want a program that is probably faster than any other method, you might want to.

In my test, about 20% of the implementation of C takes it for your sed script, so it can take about 25 minutes to run on your Unix server.

I did not spend too much time in optimizing C implementation. There is no doubt that there are many places where the algorithm can be improved, but clearly, I do not know whether it already It is possible to distinguish a significant amount beyond the receipt. If anything, then I think that definitely you can expect from other methods (SAD, AJK, PRL, Python, etc.) but have an upper limit.

Edit: There was a minor bug in the original version, as a result of which it might be possible to print the wrong thing at the end of production (for example "blank" to be printed) Could not be there). I had some time today, to see it and decided that I have also optimized the call on strlen () which gives it a little bit performance.


No comments:

Post a Comment