I am in the process of migrating yarn and it seems that the behavior of distributed cache has changed.
First, I will add some files in the cache as follows:
for (string file: args) {path path = new path (cache_rot, file); URI Yuri = New URI (Path tool). ToString ()); Distributed cache.add cache file (Yuri, conf); } Path usually resembles
/some/path/to/my/file.txt Which is already present on HDFS and ends in the basically distributed cache
/ $ DISTRO_CACHE / some / path / to / my / file.txt I can symlink it in my current working directory and use it with DistributedCache.getLocalCacheFiles ()
yarn, it seems that this file Ends in the cache:
/ $ DISTRO_CACHE / file.txt That is, the 'Path' part of the file URI has been deleted and only the file name is created
How does the work with different paths end up with the same file name? Consider the following case:
delivered. AddCacheFile ("some / path / in / file.txt", conf); Distributed cache.add cache file ("some / other / path / in / file.txt", conf); Of course, any piece can be used:
delivered. AddCacheFile ("some / path / to / file.txt # file1", conf); DistributedCache.addCacheFile ("some / other / path / in / file.txt # file2", conf); But it seems difficult to manage unnecessarily. Imagine the scenario where these are command line arguments, you somehow need to manage those 2 filenames, though different paths will definitely collide in the distribution catcher and hence these filenames have to be re-mapped to pieces and The rest of the program?
Is this an easy way to manage it?