Wednesday, 15 April 2015

concurrency - What's the best way to divide large files in Python for multiprocessing? -


I take part in many "embarrassing parallel" projects that I want to parallel with multiprocessing Module However, they often include reading in large files (over 2 GB), processing them on line line, running basic calculations, and then writing results to split the file and use Python's multiprocessing module the best way Or is it? Should multiprocessing be used in Q or joinable queue ? Or in the line module? Or, should I be replicated on a pool of processes using multiprocessing ? I have experimented with these methods but overhead on the basis of line is heavy in distribution I have settled on light pipe-filter design using the cat file Process1 --out-file out1 --num-processes 2 | Process2 --out-file out2 , which passes a few percent of the direct input of the second input (see) of the input first, but I want to put the solution contained in Python completely.

Surprisingly, the Python documentation does not recommend a legal way of doing this (despite a long section on programming guidelines in the multiprocessing document).

Thanks, Vince

Additional Information: The processing time varies per line. Some problems are not fast and difficult to I / O bound, some CPU-bound CPU-bound, non-dependent functions will receive posts from parallel, such as inappropriate methods of assigning data to the processing function, even at the time of the clock It will be beneficial.

A major example script that removes fields from lines, checks for a variety of bitwise flag, and with a few flags, writes a line with a new file in an entirely new format . It seems like an I / O bound problem, but when I went with the pipes with my cheaper concurrent version, it was about 20% faster when I run it with pool and map, or multiprocessing Queue is always 100% slow

One of the best architectures is already Linux Is part of the OS. No special libraries are required.

You want a "fan-out" design.

  1. A "main" program creates several subproductions connected by pipes.

  2. / Li>

  3. Each sub-publication should probably have a pipeline of different processes that read and write with stdin.

    You do not need queue data structure, it is an in-memory pipeline - a queue between bytes between two concurrent processes.


No comments:

Post a Comment