I take part in many "embarrassing parallel" projects that I want to parallel with multiprocessing
Module However, they often include reading in large files (over 2 GB), processing them on line line, running basic calculations, and then writing results to split the file and use Python's multiprocessing module the best way Or is it? Should Q
or joinable queue
? Or in the line
module? Or, should I be replicated on a pool of processes using multiprocessing
? I have experimented with these methods but overhead on the basis of line is heavy in distribution I have settled on light pipe-filter design using the
Surprisingly, the Python documentation does not recommend a legal way of doing this (despite a long section on programming guidelines in the multiprocessing
document).
Thanks, Vince
Additional Information: The processing time varies per line. Some problems are not fast and difficult to I / O bound, some CPU-bound CPU-bound, non-dependent functions will receive posts from parallel, such as inappropriate methods of assigning data to the processing function, even at the time of the clock It will be beneficial.
A major example script that removes fields from lines, checks for a variety of bitwise flag, and with a few flags, writes a line with a new file in an entirely new format . It seems like an I / O bound problem, but when I went with the pipes with my cheaper concurrent version, it was about 20% faster when I run it with pool and map, or multiprocessing Queue
is always 100% slow
One of the best architectures is already Linux Is part of the OS. No special libraries are required.
You want a "fan-out" design.
-
A "main" program creates several subproductions connected by pipes.
- / Li>
Each sub-publication should probably have a pipeline of different processes that read and write with stdin.
You do not need queue data structure, it is an in-memory pipeline - a queue between bytes between two concurrent processes.
No comments:
Post a Comment