Sunday, 15 September 2013

Get random lines from large files in bash -

How can I get the n random line from very large files that do not fit in memory

Update 1

There are specs in my case:

It would also be good if I can add filters before or after randomization.

10 million lines

Normal random batch size 10000-30000

512 ram hosted Ubuntu server 14.10

So losing some lines from the file will not be such a big problem as they have 1 in 10000 spot, but performance and resource consumption will be a problem

In such limited factors, the following approach would be better.

Random Status
Go backward from this situation and find the beginning of the given line
Go ahead and print the full line

For this, you can find the tool in the cap file, for example perl .

  Use strict; Use warnings; Use symbols; Use FCTL QW (search for O_RDOI); My $ seekdiff = 256; # Eg From "RAND_position-256" to RAND_Pozitone + 256 mine ($ Fan, $ filename) = @ARGV; My $ FD = Genus; Sysopen ($ fd, $ filename, O_RDONLY) || Die ("$ filename can not be opened: $!"); Binmoda $ FD; My $ endpos = sysseek ($ fd, 0, SEEK_END) or die ("can not search: $!"); My $ buffer; My $ cnt; While ($ wants> $ cnt ++) {my $ randpos = int (rand ($ endpos)); #random file status for my $ seekpos = $ randpos - $ seekdiff; # Start here read ($ seekdiff characters first) $ seekpos = 0 ($ seekpos & lt; 0); Sysseek ($ FD, $ seekpos, SEEK_SET); #seek to transfer my $ in_count = sysread ($ fd, $ buffer, $ seekdiff & lt; & lt; 1) #read 2 * Cantined characters my $ rand_in_buff = ($ randpos - $ seekpos) -1; My $ linestart = rindex ($ buffer, "\ n", $ rand_in_buff) + 1 in # buffer; # My lining starts at $ $ buffer, "\ n", $ linestart; At the end of the line # buffer, my $ the_line = substr $ buffer, $ linestart, $ lineend & lt; 0? 0: $ linen- $ linestart; Print "$ the_line \ n"; }

Save some of the above files to "randlines.pl" and use it as:

  perl randlines.pl wanted_count_of_lines file_name

example

  perl randlines.pl 10000 The script operates very low level Io, that means it  very fast  (on my notebook, 30m lines from 10m to half-second).




Posted by



Unknown




at

03:22











Email ThisBlogThis!Share to XShare to FacebookShare to Pinterest




No comments:







Post a Comment




Newer Post


Older Post

Home




Subscribe to:
Post Comments (Atom)


















    
About Me




Unknown



View my complete profile



Blog Archive








        ► 
      



2015

(1583)





        ► 
      



September

(174)







        ► 
      



August

(172)







        ► 
      



July

(180)







        ► 
      



June

(160)







        ► 
      



May

(201)







        ► 
      



April

(172)







        ► 
      



March

(173)







        ► 
      



February

(183)







        ► 
      



January

(168)









        ► 
      



2014

(1684)





        ► 
      



September

(186)







        ► 
      



August

(180)







        ► 
      



July

(160)







        ► 
      



June

(167)







        ► 
      



May

(187)







        ► 
      



April

(176)







        ► 
      



March

(200)







        ► 
      



February

(241)







        ► 
      



January

(187)









        ▼ 
      



2013

(1486)





        ▼ 
      



September

(169)

asp.net - Attaching mdf file into sql server -
visual studio 2008 - Unit test sequence when runni...
python - How to print number with commas as thousa...
Declaring arrays similar to C style (C++) -
gtk - Fundamentals of Game Programming in C -
html5 - What are situations with western languages...
dojo - Browsers supporting event processing in doj...
javascript - How to detect which row [ tr ] is cli...
What does it mean by C++ runtime? -
asp.net mvc - How to configure url routing if I ha...
Jquery drag and drop plugin between tables -
text - Reading different data from a textfile deli...
How to create XSLT transformation for srcML? -
windows installer - Custom uninstaller for a WIX g...
Calling Web-Service / Website from Java -
performance - WCF ChannelFactory and Channel cachi...
r - How to assign from a function which returns mo...
redirect - Apache mod_rewrite - rewriting from sub...
Bitmap (of a signature) comparison in c# -
c# - What is the .NET equivalent of java's
System....
performance - Java threads: interpreting thread st...
functional programming - How to make Haskell compu...
seo - Is "&" character inside url segment allowed? -
What defines the availability of the DOM `document...
JavaScript getElementByID() not working -
Multi-statement Macros in C++ -
linux - Weird Subversion permissions issue -
c++ - How to call erase with a reverse iterator -
loops - PHP adding leading 0 to range -
wpf - Add User Settings at runtime in a c# applica...
java - What is InputStream & Output Stream? Why an...
c# - pushing to live asp.net web application. Secu...
.net - Issue in resizeing an image to thumbnail si...
mysql - Questions on Auction Database Schema -
c# - How to add a Panel to SplitContainer? -
c# picturebox memory releasing problem -
Hibernate Code Generator does not generate anything -
math - How do I use the least squares approximatio...
c# - Does static make it slow -
.net - Word Count of .resx files -
visual studio - How do I avoid popping up an error...
c - overwriting a specific line on a text file? -
iphone NSDate NSDateFormatter to a ceratin format -
Preventing WPF event tunneling -
c# - What is currently the best, free time picker ...
html - Why does the "in" icon on my web page appea...
javascript - Connecting Node.js docker container t...
c++ - Macro expansion issue in Mingw GCC -
c - Changing GtkText input method -
jQuery contents method doesn't return iframe conte...
c++ - Find on list of map -
css - Alignment of html pseudo-element after -
templates - Using hapi and handlebars, with the de...
mule - MuleSoft - access data in db query in java ...
xcode - iOS autolayout - gap between image and top...
java - How to set an image using a hash map -
ASP.NET / C# Respond.Redirect function -
jQuery find input value from parent siblings -
ember.js - Ember Simple Auth - always login -
character - Greek letters - JSON -
ruby on rails - Could not find generator 'rspec:in...
Android options menu background -
mongodb - GMongo search within an embedded json in...
statistics - Python package :MLE for Dirichlet dis...
Python/Matplotlib : convert Axis <=> Data coordina...
c# - Async Get in ASP.net Web API does not close t...
ios - Cut out a shape from a UIImage with a blurre...
android - Tablet not enabled on Play store -
python - Parse nested CSV columns into new CSV rows -
Need to make a java program to convert UML diagram...
javascript - Why are these two pieces of code retu...
c# - Build error in Unity 5 -
using methods and constructors to construct a basi...
Loop counter in D language -
osx - SBApplication (in this case SafariApplicatio...
angularjs: download REST API response as CSV file -
python 2.7 - More idiomatic way of retrieving valu...
mysql - Pubnub PHP subscribe/publish -
javascript - How to show hidden div in the same li...
How to format the json Date format using spring bo...
ios - cannot invoke 'runaction' with an argument o...
java - Android: why is return used in this instanc...
How can I change the color of the Android Action B...
c# - How to use Parse() method from the UnitsNet n...
ubuntu - Kyestone client commands fails when calle...
Converting C++ GUID to delphi TGUID? -
Get random lines from large files in bash -
.net - Windows Phone 8.1: NavigationCache not requ...
Creating new set of columns in a dataframe in R fr...
objective c - Full screen UIViewController on iOS 7 -
c# - Filter report parameter valid values based on...
Simple AngularJS directive with attributes is not ...
node.js - Email or socket communication between se...
javascript - Use wordpress shortcode in JS -
algorithm - new graph with adding 1 edges and numb...
How to programmatically find email associated to a...
java - JPQL TREAT AS /LEFT OUTER JOIN -
r - reorder barchart by sum of bar segments with g...
xcode - How to invoke an app through the iOS 8 Sha...
Qt SSL: Handshake failed for requests made from QML -








        ► 
      



August

(163)







        ► 
      



July

(161)







        ► 
      



June

(162)







        ► 
      



May

(156)







        ► 
      



April

(170)







        ► 
      



March

(174)







        ► 
      



February

(164)







        ► 
      



January

(167)









        ► 
      



2012

(1541)





        ► 
      



September

(176)







        ► 
      



August

(160)







        ► 
      



July

(163)







        ► 
      



June

(179)







        ► 
      



May

(169)







        ► 
      



April

(158)







        ► 
      



March

(201)







        ► 
      



February

(168)







        ► 
      



January

(167)









        ► 
      



2011

(1528)





        ► 
      



September

(167)







        ► 
      



August

(176)







        ► 
      



July

(169)







        ► 
      



June

(184)







        ► 
      



May

(157)







        ► 
      



April

(182)







        ► 
      



March

(166)







        ► 
      



February

(171)







        ► 
      



January

(156)









        ► 
      



2010

(1540)





        ► 
      



September

(147)







        ► 
      



August

(182)







        ► 
      



July

(168)







        ► 
      



June

(162)







        ► 
      



May

(188)







        ► 
      



April

(169)







        ► 
      



March

(194)







        ► 
      



February

(164)







        ► 
      



January

(166)


















    















Simple theme. Powered by Blogger.