How can I get the n
random line from very large files that do not fit in memory
Update 1
There are specs in my case:
It would also be good if I can add filters before or after randomization.
So losing some lines from the file will not be such a big problem as they have 1 in 10000 spot, but performance and resource consumption will be a problem
In such limited factors, the following approach would be better.
- Random Status
- Go backward from this situation and find the beginning of the given line
- Go ahead and print the full line < Li> / li>
For this, you can find the tool in the cap file, for example perl
.
Use strict; Use warnings; Use symbols; Use FCTL QW (search for O_RDOI); My $ seekdiff = 256; # Eg From "RAND_position-256" to RAND_Pozitone + 256 mine ($ Fan, $ filename) = @ARGV; My $ FD = Genus; Sysopen ($ fd, $ filename, O_RDONLY) || Die ("$ filename can not be opened: $!"); Binmoda $ FD; My $ endpos = sysseek ($ fd, 0, SEEK_END) or die ("can not search: $!"); My $ buffer; My $ cnt; While ($ wants> $ cnt ++) {my $ randpos = int (rand ($ endpos)); #random file status for my $ seekpos = $ randpos - $ seekdiff; # Start here read ($ seekdiff characters first) $ seekpos = 0 ($ seekpos & lt; 0); Sysseek ($ FD, $ seekpos, SEEK_SET); #seek to transfer my $ in_count = sysread ($ fd, $ buffer, $ seekdiff & lt; & lt; 1) #read 2 * Cantined characters my $ rand_in_buff = ($ randpos - $ seekpos) -1; My $ linestart = rindex ($ buffer, "\ n", $ rand_in_buff) + 1 in # buffer; # My lining starts at $ $ buffer, "\ n", $ linestart; At the end of the line # buffer, my $ the_line = substr $ buffer, $ linestart, $ lineend & lt; 0? 0: $ linen- $ linestart; Print "$ the_line \ n"; }
Save some of the above files to "randlines.pl" and use it as:
perl randlines.pl wanted_count_of_lines file_name
example
perl randlines.pl 10000 The script operates very low level Io, that means it very fast (on my notebook, 30m lines from 10m to half-second).
No comments:
Post a Comment