Sunday, 15 September 2013

Get random lines from large files in bash -


How can I get the n random line from very large files that do not fit in memory

Update 1

There are specs in my case:

It would also be good if I can add filters before or after randomization.

  • 10 million lines
  • Normal random batch size 10000-30000
  • 512 ram hosted Ubuntu server 14.10
  • So losing some lines from the file will not be such a big problem as they have 1 in 10000 spot, but performance and resource consumption will be a problem

    In such limited factors, the following approach would be better.

    • Random Status
    • Go backward from this situation and find the beginning of the given line
    • Go ahead and print the full line
    • < Li> / li>

    For this, you can find the tool in the cap file, for example perl .

      Use strict; Use warnings; Use symbols; Use FCTL QW (search for O_RDOI); My $ seekdiff = 256; # Eg From "RAND_position-256" to RAND_Pozitone + 256 mine ($ Fan, $ filename) = @ARGV; My $ FD = Genus; Sysopen ($ fd, $ filename, O_RDONLY) || Die ("$ filename can not be opened: $!"); Binmoda $ FD; My $ endpos = sysseek ($ fd, 0, SEEK_END) or die ("can not search: $!"); My $ buffer; My $ cnt; While ($ wants> $ cnt ++) {my $ randpos = int (rand ($ endpos)); #random file status for my $ seekpos = $ randpos - $ seekdiff; # Start here read ($ seekdiff characters first) $ seekpos = 0 ($ seekpos & lt; 0); Sysseek ($ FD, $ seekpos, SEEK_SET); #seek to transfer my $ in_count = sysread ($ fd, $ buffer, $ seekdiff & lt; & lt; 1) #read 2 * Cantined characters my $ rand_in_buff = ($ randpos - $ seekpos) -1; My $ linestart = rindex ($ buffer, "\ n", $ rand_in_buff) + 1 in # buffer; # My lining starts at $ $ buffer, "\ n", $ linestart; At the end of the line # buffer, my $ the_line = substr $ buffer, $ linestart, $ lineend & lt; 0? 0: $ linen- $ linestart; Print "$ the_line \ n"; }  

    Save some of the above files to "randlines.pl" and use it as:

      perl randlines.pl wanted_count_of_lines file_name  

    example

      perl randlines.pl 10000 The script operates very low level Io, that means it  very fast  (on my notebook, 30m lines from 10m to half-second). 


    No comments:

    Post a Comment