Wednesday 15 August 2012

Python packaging for hive/hadoop streaming -


I have a hive query with custom mapper and reducers that are written in Python. Mapper and reducer modules are dependent on some 3rd party modules / packages that are not installed on my cluster (installing them on a cluster is not an option). I got this problem only after running the hive query, when it failed to say that xyz module was not found.

How do I package the whole thing so that all the dependencies available (including gross reliance) in my streaming job? How can I use such packaging and how can I use the Import module in my Mapper and Reduce?

The question is not easy, but even after an hour, I could not find any answer. Also, it is not specific to the hive only, but usually the hoop keeps for streaming jobs when Mapper / Reduce is written in Python.

This ZIP code by packing dependencies and readers script and this zip as a resource in the hive Can be done by adding.

Assume that the Python Reducer script depends on the package D1, depending on the current D2 (thus resolves the OP's query on transit dependencies), and both the D1 and D2 clusters Not installed on any machine.

  • Package D1, D2, and Python Readers script (let's call it reducer.py), say ,.zip
  • this zip Use the following sample queries:

    ADD ARCHIVE dep.zip;

    FROM (some_table) t1
    INSERT overwrite table T2
    Redisizance T1COL1, T1COL2 'Python DP Zip / DAP / RDUCAP 'AP output;


No comments:

Post a Comment