What does PipeMapRed do in Hadoop streaming? - hadoop-streaming

I run a hadoop job for more than one time, and every time it takes too much time to finish, like *15 mins * in all.
I checked the syslog, found out that, org.apache.hadoop.streaming.PipeMapRed was doing something for about 10 mins, and after PipeMapRed is done, MapTask took over and finished in less than 1 min, what the heck?
What does PipeMapRed do actually? Why is it so time-consuming?
Here is some log printed by PipeMapRed:
17:00:57,307 INFO org.apache.hadoop.streaming.PipeMapRed: Records R/W=1633/1
17:00:59,782 INFO org.apache.hadoop.streaming.PipeMapRed: R/W/S=10000/8763/0 in:5000=10000/2 [rec/s] out:4381=8763/2 [rec/s]
17:01:07,310 INFO org.apache.hadoop.streaming.PipeMapRed: Records R/W=60670/59051
17:01:12,610 INFO org.apache.hadoop.streaming.PipeMapRed: R/W/S=100000/97904/0 in:6666=100000/15 [rec/s] out:6526=97904/15 [rec/s]
17:01:17,332 INFO org.apache.hadoop.streaming.PipeMapRed: Records R/W=126104/124334
17:01:27,378 INFO org.apache.hadoop.streaming.PipeMapRed: Records R/W=181681/179714
17:01:30,514 INFO org.apache.hadoop.streaming.PipeMapRed: R/W/S=200000/198233/0 in:6060=200000/33 [rec/s] out:6007=198233/33 [rec/s]
17:01:37,404 INFO org.apache.hadoop.streaming.PipeMapRed: Records R/W=244642/242654

The logs you provided are logs from mapreduce streaming, you can see how many records are being read and write, for example:
R/W/S=10000/8763/0 in:5000=10000/2 [rec/s] out:4381=8763/2 [rec/s]
First part stands for how many records are:
READ/WRITE/SKIPPED=10000/8763/0
Second part is about how fast do you process records, so you read 5000 records/sec and write 4381 records/sec
15 min per (streaming) mapreduce job is totally ok, if not to little :)

Related

Scheduler logs different between airflow1 and airflow2

Because I can't use the airflow CLI, I'm actually parsing scheduler logs with grep on airflow1 in order to retrieve some infos such as :
check if the dag is triggered or not / if it's successful or not / start timestamp with the pattern "INFO Marking run" :
[2021-12-01 11:06:50,340] {logging_mixin.py:112} INFO - [2021-12-01 11:06:50,339] {dagrun.py:307} INFO - Marking run <DagRun prd_*** # 2021-12-01 10:02:00+00:00: scheduled__2021-12-01T10:02:00+00:00, externally triggered: False>successful
when the dag is not triggered, I use the pattern 'INFO - Created' to retrieve the dag' start timestamp :
[2021-12-01 11:04:49,213] {scheduler_job.py:1298} INFO - Created <DagRun prd_*** # 2021-12-01T10:02:00+00:00: scheduled__2021-12-01T10:02:00+00:00, externally triggered: False>
It works well on airflow1 but I can't find those data in the airflow2 scheduler logs after migration.
Does the configuration need to be changed ?
Regards,
Troubadour
You should use Airflow 2 REST API instead.
It was precisely done so that you do not have to parse logs. https://airflow.apache.org/docs/apache-airflow/stable/stable-rest-api-ref.html

Sqoop Hcatalog import job completed but data is not present in the table

I was trying to integrate hcatalog with sqoop in order to import data from rdbms(oracle) to data lake(in hive).
sqoop-import --connect connection-string --username username --password pass --table --hcatalog-database data_extraction --hcatalog-table --hcatalog-storage-stanza 'stored as orcfile' -m1 --verbose
Job got executed e=successfully but not able to find the data.
Also, checked the location of the table created in hcatalog, after checking the location found that any directory is not created for that and only a 0 byte file _$folder$ was found.
please found the stack trace :
19/09/25 17:53:37 INFO Configuration.deprecation: io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
19/09/25 17:54:02 DEBUG db.DBConfiguration: Fetching password from job credentials store
19/09/25 17:54:03 INFO db.DBInputFormat: Using read commited transaction isolation
19/09/25 17:54:03 DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound '1=1' and upper bound '1=1'
19/09/25 17:54:03 INFO mapreduce.JobSubmitter: number of splits:1
19/09/25 17:54:03 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1569355854349_1231
19/09/25 17:54:04 INFO impl.YarnClientImpl: Submitted application application_1569355854349_1231
19/09/25 17:54:04 INFO mapreduce.Job: The url to track the job: http://<PII-removed-by-me>/application_1569355854349_1231/
19/09/25 17:54:04 INFO mapreduce.Job: Running job: job_1569355854349_1231
19/09/25 17:57:34 INFO hive.metastore: Closed a connection to metastore, current connections: 1
19/09/25 18:02:59 INFO mapreduce.Job: Job job_1569355854349_1231 running in uber mode : false
19/09/25 18:02:59 INFO mapreduce.Job: map 0% reduce 0%
19/09/25 18:03:16 INFO mapreduce.Job: map 100% reduce 0%
19/09/25 18:03:18 INFO mapreduce.Job: Job job_1569355854349_1231 completed successfully
19/09/25 18:03:18 INFO mapreduce.Job: Counters: 35
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=425637
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=87
HDFS: Number of bytes written=0
HDFS: Number of read operations=1
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
S3: Number of bytes read=0
S3: Number of bytes written=310154
S3: Number of read operations=0
S3: Number of large read operations=0
S3: Number of write operations=0
Job Counters
Launched map tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=29274
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=14637
Total vcore-milliseconds taken by all map tasks=14637
Total megabyte-milliseconds taken by all map tasks=52459008
Map-Reduce Framework
Map input records=145608
Map output records=145608
Input split bytes=87
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=199
CPU time spent (ms)=4390
Physical memory (bytes) snapshot=681046016
Virtual memory (bytes) snapshot=5230788608
Total committed heap usage (bytes)=1483210752
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
19/09/25 18:03:18 INFO mapreduce.ImportJobBase: Transferred 0 bytes in 582.8069 seconds (0 bytes/sec)
19/09/25 18:03:18 INFO mapreduce.ImportJobBase: Retrieved 145608 records.
19/09/25 18:03:18 INFO mapreduce.ImportJobBase: Publishing Hive/Hcat import job data to Listeners for table null
19/09/25 18:03:19 DEBUG util.ClassLoaderStack: Restoring classloader: sun.misc.Launcher$AppClassLoader#1d548a08
Solved it.
As we are using AWS EMR(managed hadoop service).It is already mentioned on their site.
Aws Forum Screenshot
When you use Sqoop to write output to an HCatalog table in Amazon S3, disable Amazon EMR direct write by setting the mapred.output.direct.NativeS3FileSystem and mapred.output.direct.EmrFileSystem properties to false. For more information, see Using HCatalog. You can use the Hadoop -D mapred.output.direct.NativeS3FileSystem=false and -D mapred.output.direct.EmrFileSystem=false commands.
If you don't disable direct write, no error occurs, but the table is created in Amazon S3 and no data is written.
can be found at https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-sqoop-considerations.html

Airflow execution_timeout settings not respected

In my tasks, I have execution_timeout=timedelta(minutes=1) set in my task and 'dagrun_timeout': timedelta(minutes=2) for my DAG, and this is correctly reflected in the web GUI's Task Instance Details. However, none of my task instances are actually set to failed or retry when breaching the one minute threshold. Rather, they time out at 11 minutes...
[2017-11-02 18:00:05,376] {base_task_runner.py:95} INFO - Subtask: [2017-11-02 18:00:05,370] {base_hook.py:67} INFO - Using connection to: [REDACTED]
[2017-11-02 18:10:06,505] {base_task_runner.py:95} INFO - Subtask: [2017-11-02 18:10:06,504] {timeout.py:37} ERROR - Process timed out
Do I have a problem with my configuration, or is there something buggy happening with how Airflow interprets time out settings?

Fastest way to write in HDFS from R (without any package)

I am trying to write some data into HDFS using custom R map reduce. I have read process in pretty fast but post processing write take quite long time. I have tried (functions who can write to a file connection)
output <- file("stdout", "w")
write.table(base,file=output,sep=",",row.names=F)
writeLines(t(as.matrix(base)), con = output, sep = ",", useBytes = FALSE)
However write.table only write partial information (first few rows and last few rows) and writeLines doesn't work. So now I trying:
for(row in 1:nrow(base)){
cat(base[row,]$field1,",",base[row,]$field2,",",base[row,]$field3,",",base[row,]$field4,",",
base[row,]$field5,",",base[row,]$field6,"\n",sep='')
}
But the writing speed of this very slow. Here is some log about how slow the writing speed is:
2016-07-07 08:59:30,557 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/406056
2016-07-07 08:59:40,567 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/406422
2016-07-07 08:59:50,582 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/406710
2016-07-07 09:00:00,947 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/407001
2016-07-07 09:00:11,392 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/407316
2016-07-07 09:00:21,832 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/407683
2016-07-07 09:00:31,883 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/408103
2016-07-07 09:00:41,892 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/408536
2016-07-07 09:00:51,895 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/408969
2016-07-07 09:01:01,903 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/409377
2016-07-07 09:01:12,187 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/409782
2016-07-07 09:01:22,198 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/410161
2016-07-07 09:01:32,293 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/410569
2016-07-07 09:01:42,509 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/410989
2016-07-07 09:01:52,515 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/411435
2016-07-07 09:02:02,525 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/411814
2016-07-07 09:02:12,625 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/412196
2016-07-07 09:02:22,988 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/412616
2016-07-07 09:02:32,991 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/413078
2016-07-07 09:02:43,104 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/413508
2016-07-07 09:02:53,115 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/413975
2016-07-07 09:03:03,122 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/414415
2016-07-07 09:03:13,128 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/414835
2016-07-07 09:03:23,131 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/415210
2016-07-07 09:03:33,143 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/415643
2016-07-07 09:03:43,153 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/416031
So I am wondering if I am doing something wrong. I am using data.table.
Based on my different experimentations with various functions with file writing capabilities I found following the fastest:
base <- data.table(apply(base,2,FUN=as.character),stringsAsFactors = F)
x <- sapply(1:nrow(base),
FUN = function(row) {
cat(base$field1[row],",", base$field2[row], "," , base$field3[row], "," ,
base$field4[row], "," , base$field5[row], "," , base$field6[row], "\n" , sep='')
}
)
rm(x)
where x is just there to capture NULL returns which sapply throws and sapply of as.character is to prevent mess up which cat does to factors (printing internal factor value than actual value).

Running a R script using hadoop streaming Job Failing : PipeMapRed.waitOutputThreads(): subprocess failed with code 1

I have a R script which works perfectly fine in R Colsole ,but when I am running in Hadoop streaming it is failing with the below error in Map phase .Find the Task attempts log
The Hadoop Streaming Command I have :
/home/Bibhu/hadoop-0.20.2/bin/hadoop jar \
/home/Bibhu/hadoop-0.20.2/contrib/streaming/*.jar \
-input hdfs://localhost:54310/user/Bibhu/BookTE1.csv \
-output outsid -mapper `pwd`/code1.sh
stderr logs
Loading required package: class
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
no lines available in input
Calls: read.csv -> read.table
Execution halted
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:572)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
syslog logs
2013-07-03 19:32:36,080 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId=
2013-07-03 19:32:36,654 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 1
2013-07-03 19:32:36,675 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 100
2013-07-03 19:32:36,835 INFO org.apache.hadoop.mapred.MapTask: data buffer = 79691776/99614720
2013-07-03 19:32:36,835 INFO org.apache.hadoop.mapred.MapTask: record buffer = 262144/327680
2013-07-03 19:32:36,899 INFO org.apache.hadoop.streaming.PipeMapRed: PipeMapRed exec [/home/Bibhu/Downloads/SentimentAnalysis/Sid/smallFile/code1.sh]
2013-07-03 19:32:37,256 INFO org.apache.hadoop.streaming.PipeMapRed: Records R/W=0/1
2013-07-03 19:32:38,509 INFO org.apache.hadoop.streaming.PipeMapRed: MRErrorThread done
2013-07-03 19:32:38,509 INFO org.apache.hadoop.streaming.PipeMapRed: PipeMapRed failed!
2013-07-03 19:32:38,557 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:572)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
2013-07-03 19:32:38,631 INFO org.apache.hadoop.mapred.TaskRunner: Runnning cleanup for the task
write hadoopStreamming jar with full version like hadoop-streaming-1.0.4.jar
specify separate file path for mapper & reducer with -file option
tell hadoop which is your mapper & reducer code with -mapper & -reducer option
for more ref see Running WordCount on Hadoop using R script
You need to find the logs from your mappers and reducers, since this is the place where the job is failing (as indicated by java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1). This says that your R script crashed.
If you are using the Hortonworks Hadoop distribuion, the easiest way is to open your jobhistory. It should be at http://127.0.0.1:19888/jobhistory . It should be possible to find the log in the filesystem using the command line as well, but I haven't yet found where.
Open http://127.0.0.1:19888/jobhistory in your web browser
Click on the Job ID of the failed job
Click the number indicating the failed job count
Click an attempt link
Click the logs link
You should see a page which looks something like
Log Type: stderr
Log Length: 418
Traceback (most recent call last):
File "/hadoop/yarn/local/usercache/root/appcache/application_1404203309115_0003/container_1404203309115_0003_01_000002/./mapper.py", line 45, in <module>
mapper()
File "/hadoop/yarn/local/usercache/root/appcache/application_1404203309115_0003/container_1404203309115_0003_01_000002/./mapper.py", line 37, in mapper
for record in reader:
_csv.Error: newline inside string
This is an error from my Python script, the errors from R look a bit different.
source: http://hortonworks.com/community/forums/topic/map-reduce-job-log-files/
I received this same error tonight, while also developing Map Reduce Streaming jobs with R.
I was working on a 10 node cluster, each with 12 cores, and tried to supply at submission time:
-D mapred.map.tasks=200\
-D mapred.reduce.tasks=200
The job completed successfully though when I changed these to
-D mapred.map.tasks=10\
-D mapred.reduce.tasks=10
This was a mysterious fix, and perhaps more context will arise this evening. But if any readers can elucidate, please do!

Resources