Thursday, 15 September 2011

apache - Hadoop DistributedCache caching files without absolute path? -


I am in the process of migrating yarn and it seems that the behavior of distributed cache has changed.

First, I will add some files in the cache as follows:

 for  (string file: args) {path path = new path (cache_rot, file); URI Yuri = New URI (Path tool). ToString ()); Distributed cache.add cache file (Yuri, conf); }  

Path usually resembles

  /some/path/to/my/file.txt  

Which is already present on HDFS and ends in the basically distributed cache

  / $ DISTRO_CACHE / some / path / to / my / file.txt  

I can symlink it in my current working directory and use it with DistributedCache.getLocalCacheFiles ()

yarn, it seems that this file Ends in the cache:

  / $ DISTRO_CACHE / file.txt  

That is, the 'Path' part of the file URI has been deleted and only the file name is created

How does the work with different paths end up with the same file name? Consider the following case:

  delivered. AddCacheFile ("some / path / in / file.txt", conf); Distributed cache.add cache file ("some / other / path / in / file.txt", conf);  

Of course, any piece can be used:

  delivered. AddCacheFile ("some / path / to / file.txt # file1", conf); DistributedCache.addCacheFile ("some / other / path / in / file.txt # file2", conf);  

But it seems difficult to manage unnecessarily. Imagine the scenario where these are command line arguments, you somehow need to manage those 2 filenames, though different paths will definitely collide in the distribution catcher and hence these filenames have to be re-mapped to pieces and The rest of the program?

Is this an easy way to manage it?

Try adding files to jobs

Most likely you actually How to configure a job and access them in a mapper.

When you are setting up a job, then you are going to do something like this

  job.addCacheFile (new path. ("Cache / file1.txt") Touri ( )); Job.addCacheFile (new path ("cache / file 2.txt"). ToUri ());  

Then the URL in its mapper code is being stored in an array that can be accessed this way.

  URI file-1URI = reference.gatecache file () [0]; URI file2 URI = context.gatecache files () [1];  

Hope this can help you.


No comments:

Post a Comment