In Hadoop, it is sometimes difficult to pass arguments to mappers and reducers. If the number of arguments is huge (e.g., big arrays), DistributedCache might be a good choice. However, here, we’re discussing small arguments, usually a hand of configuration parameters.
In fact, the way to configure these parameters is simple. When you initialize “JobConf” object to launch a mapreduce job, you can set the parameter by using “set” method like:
1 2 | JobConf job = (JobConf)getConf(); job.set("NumberOfDocuments", args[0]); |
Here, “NumberOfDocuments” is the name of parameter and its value is read from “args[0]”, a command line argument. Once you set this arguments, you can retrieve its value in reducer or mapper as follows:
1 2 3 4 | private static Long N; public void configure(JobConf job) { N = Long.parseLong(job.get("NumberOfDocuments")); } |
Note, the tricky part is that you cannot set parameters like this:
1 2 | Configuration con = new Configuration(); con.set("NumberOfDocuments", args[0]); |
and hope that all mappers or reducers can retrieve this parameter. This will fail in running.
This is a great arctile. But what if I want to pass my own data structure like the graph as parameters and arguments to mapper and reducer in hadoop?
It was helpfull. Thank u. 😛