Fastest way to write in HDFS from R (without any package) - r

I am trying to write some data into HDFS using custom R map reduce. I have read process in pretty fast but post processing write take quite long time. I have tried (functions who can write to a file connection)
output <- file("stdout", "w")
write.table(base,file=output,sep=",",row.names=F)
writeLines(t(as.matrix(base)), con = output, sep = ",", useBytes = FALSE)
However write.table only write partial information (first few rows and last few rows) and writeLines doesn't work. So now I trying:
for(row in 1:nrow(base)){
cat(base[row,]$field1,",",base[row,]$field2,",",base[row,]$field3,",",base[row,]$field4,",",
base[row,]$field5,",",base[row,]$field6,"\n",sep='')
}
But the writing speed of this very slow. Here is some log about how slow the writing speed is:
2016-07-07 08:59:30,557 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/406056
2016-07-07 08:59:40,567 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/406422
2016-07-07 08:59:50,582 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/406710
2016-07-07 09:00:00,947 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/407001
2016-07-07 09:00:11,392 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/407316
2016-07-07 09:00:21,832 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/407683
2016-07-07 09:00:31,883 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/408103
2016-07-07 09:00:41,892 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/408536
2016-07-07 09:00:51,895 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/408969
2016-07-07 09:01:01,903 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/409377
2016-07-07 09:01:12,187 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/409782
2016-07-07 09:01:22,198 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/410161
2016-07-07 09:01:32,293 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/410569
2016-07-07 09:01:42,509 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/410989
2016-07-07 09:01:52,515 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/411435
2016-07-07 09:02:02,525 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/411814
2016-07-07 09:02:12,625 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/412196
2016-07-07 09:02:22,988 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/412616
2016-07-07 09:02:32,991 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/413078
2016-07-07 09:02:43,104 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/413508
2016-07-07 09:02:53,115 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/413975
2016-07-07 09:03:03,122 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/414415
2016-07-07 09:03:13,128 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/414835
2016-07-07 09:03:23,131 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/415210
2016-07-07 09:03:33,143 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/415643
2016-07-07 09:03:43,153 INFO [Thread-49]
org.apache.hadoop.streaming.PipeMapRed: Records R/W=921203/416031
So I am wondering if I am doing something wrong. I am using data.table.

Based on my different experimentations with various functions with file writing capabilities I found following the fastest:
base <- data.table(apply(base,2,FUN=as.character),stringsAsFactors = F)
x <- sapply(1:nrow(base),
FUN = function(row) {
cat(base$field1[row],",", base$field2[row], "," , base$field3[row], "," ,
base$field4[row], "," , base$field5[row], "," , base$field6[row], "\n" , sep='')
}
)
rm(x)
where x is just there to capture NULL returns which sapply throws and sapply of as.character is to prevent mess up which cat does to factors (printing internal factor value than actual value).

Related

Hive tables not showing in Spark session

If I run DBI::dbGetQuery(sc, "SHOW DATABASES") in R, I get as result only default database.
And not the full list of hive tables created from the hive> command line...
Also in the R project dir, get's created a derby.log and metastore_db folder.
So my guess is that sparklyr's spark session is no using the global hive config...
I'm using Spark 3.3.0, Sparklyr 1.7.8 and MySQL for metastore...
I have tried changing sql.warehouse.dir to the value of hive's hive.metastore.warehouse.dir which is "/user/hive/warehouse" and sql.catalogImplementation to "hive".
options(sparklyr.log.console = TRUE)
sc_config <- spark_config()
sc_config$spark.sql.warehouse.dir <- "/user/hive/warehouse"
sc_config$spark.sql.catalogImplementation <- "hive"
sc <- spark_connect(master = "yarn", spark_home = "/home/ml/spark", app_name = "TestAPP", config = sc_config)
sparklyr::hive_context_config(sc)
This is the log from > sparklyr.log.console = TRUE:
22/10/18 11:11:43 INFO sparklyr: Session (97754) is starting under 127.0.0.1 port 8880
22/10/18 11:11:43 INFO sparklyr: Session (97754) found port 8880 is available
22/10/18 11:11:43 INFO sparklyr: Gateway (97754) is waiting for sparklyr client to connect to port 8880
22/10/18 11:11:43 INFO sparklyr: Gateway (97754) accepted connection
22/10/18 11:11:43 INFO sparklyr: Gateway (97754) is waiting for sparklyr client to connect to port 8880
22/10/18 11:11:43 INFO sparklyr: Gateway (97754) received command 0
22/10/18 11:11:43 INFO sparklyr: Gateway (97754) found requested session matches current session
22/10/18 11:11:43 INFO sparklyr: Gateway (97754) is creating backend and allocating system resources
22/10/18 11:11:43 INFO sparklyr: Gateway (97754) is using port 8881 for backend channel
22/10/18 11:11:44 INFO sparklyr: Gateway (97754) created the backend
22/10/18 11:11:44 INFO sparklyr: Gateway (97754) is waiting for R process to end
22/10/18 11:11:46 INFO HiveConf: Found configuration file null
22/10/18 11:11:46 INFO SparkContext: Running Spark version 3.3.0
22/10/18 11:11:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/10/18 11:11:47 INFO ResourceUtils: ==============================================================
22/10/18 11:11:47 INFO ResourceUtils: No custom resources configured for spark.driver.
22/10/18 11:11:47 INFO ResourceUtils: ==============================================================
22/10/18 11:11:47 INFO SparkContext: Submitted application: TestAPP
22/10/18 11:11:47 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 512, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
22/10/18 11:11:47 INFO ResourceProfile: Limiting resource is cpus at 1 tasks per executor
22/10/18 11:11:47 INFO ResourceProfileManager: Added ResourceProfile id: 0
22/10/18 11:11:48 INFO SecurityManager: Changing view acls to: ml
22/10/18 11:11:48 INFO SecurityManager: Changing modify acls to: ml
22/10/18 11:11:48 INFO SecurityManager: Changing view acls groups to:
22/10/18 11:11:48 INFO SecurityManager: Changing modify acls groups to:
22/10/18 11:11:48 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ml); groups with view permissions: Set(); users with modify permissions: Set(ml); groups with modify permissions: Set()
22/10/18 11:11:48 INFO Utils: Successfully started service 'sparkDriver' on port 38889.
22/10/18 11:11:48 INFO SparkEnv: Registering MapOutputTracker
22/10/18 11:11:48 INFO SparkEnv: Registering BlockManagerMaster
22/10/18 11:11:48 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
22/10/18 11:11:48 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
22/10/18 11:11:48 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
22/10/18 11:11:49 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-65ec8b4e-6131-4fed-a227-ea5b2162e4d8
22/10/18 11:11:49 INFO MemoryStore: MemoryStore started with capacity 93.3 MiB
22/10/18 11:11:49 INFO SparkEnv: Registering OutputCommitCoordinator
22/10/18 11:11:50 INFO Utils: Successfully started service 'SparkUI' on port 4040.
22/10/18 11:11:50 INFO SparkContext: Added JAR file:/home/ml/R/x86_64-pc-linux-gnu-library/4.2/sparklyr/java/sparklyr-master-2.12.jar at spark://master:38889/jars/sparklyr-master-2.12.jar with timestamp 1666116706621
22/10/18 11:11:51 INFO DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at /0.0.0.0:8032
22/10/18 11:11:53 INFO Configuration: resource-types.xml not found
22/10/18 11:11:53 INFO ResourceUtils: Unable to find 'resource-types.xml'.
22/10/18 11:11:53 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container)
22/10/18 11:11:53 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
22/10/18 11:11:53 INFO Client: Setting up container launch context for our AM
22/10/18 11:11:53 INFO Client: Setting up the launch environment for our AM container
22/10/18 11:11:53 INFO Client: Preparing resources for our AM container
22/10/18 11:11:53 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
22/10/18 11:12:03 INFO Client: Uploading resource file:/tmp/spark-71575ad6-a8f7-43c0-974e-7c751281ef51/__spark_libs__890394313143327111.zip -> file:/home/ml/.sparkStaging/application_1665674177007_0028/__spark_libs__890394313143327111.zip
22/10/18 11:12:07 INFO Client: Uploading resource file:/tmp/spark-71575ad6-a8f7-43c0-974e-7c751281ef51/__spark_conf__9152665720324853254.zip -> file:/home/ml/.sparkStaging/application_1665674177007_0028/__spark_conf__.zip
22/10/18 11:12:08 INFO SecurityManager: Changing view acls to: ml
22/10/18 11:12:08 INFO SecurityManager: Changing modify acls to: ml
22/10/18 11:12:08 INFO SecurityManager: Changing view acls groups to:
22/10/18 11:12:08 INFO SecurityManager: Changing modify acls groups to:
22/10/18 11:12:08 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ml); groups with view permissions: Set(); users with modify permissions: Set(ml); groups with modify permissions: Set()
22/10/18 11:12:08 INFO Client: Submitting application application_1665674177007_0028 to ResourceManager
22/10/18 11:12:08 INFO YarnClientImpl: Submitted application application_1665674177007_0028
22/10/18 11:12:09 INFO Client: Application report for application_1665674177007_0028 (state: ACCEPTED)
22/10/18 11:12:09 INFO Client:
client token: N/A
diagnostics: [Tue Oct 18 11:12:08 -0700 2022] Application is Activated, waiting for resources to be assigned for AM. Details : AM Partition = <DEFAULT_PARTITION> ; Partition Resource = <memory:16384, vCores:16> ; Queue's Absolute capacity = 100.0 % ; Queue's Absolute used capacity = 0.0 % ; Queue's Absolute max capacity = 100.0 % ; Queue's capacity (absolute resource) = <memory:16384, vCores:16> ; Queue's used capacity (absolute resource) = <memory:0, vCores:0> ; Queue's max capacity (absolute resource) = <memory:16384, vCores:16> ;
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1666116728172
final status: UNDEFINED
tracking URL: http://master:8088/proxy/application_1665674177007_0028/
user: ml
22/10/18 11:12:10 INFO Client: Application report for application_1665674177007_0028 (state: ACCEPTED)
22/10/18 11:12:11 INFO Client: Application report for application_1665674177007_0028 (state: ACCEPTED)
22/10/18 11:12:12 INFO Client: Application report for application_1665674177007_0028 (state: ACCEPTED)
22/10/18 11:12:13 INFO Client: Application report for application_1665674177007_0028 (state: ACCEPTED)
22/10/18 11:12:14 INFO Client: Application report for application_1665674177007_0028 (state: ACCEPTED)
22/10/18 11:12:15 INFO Client: Application report for application_1665674177007_0028 (state: ACCEPTED)
22/10/18 11:12:16 INFO Client: Application report for application_1665674177007_0028 (state: ACCEPTED)
22/10/18 11:12:17 INFO Client: Application report for application_1665674177007_0028 (state: ACCEPTED)
22/10/18 11:12:18 INFO Client: Application report for application_1665674177007_0028 (state: ACCEPTED)
22/10/18 11:12:19 INFO Client: Application report for application_1665674177007_0028 (state: ACCEPTED)
22/10/18 11:12:20 INFO Client: Application report for application_1665674177007_0028 (state: ACCEPTED)
22/10/18 11:12:21 INFO Client: Application report for application_1665674177007_0028 (state: ACCEPTED)
22/10/18 11:12:22 INFO Client: Application report for application_1665674177007_0028 (state: ACCEPTED)
22/10/18 11:12:23 INFO Client: Application report for application_1665674177007_0028 (state: RUNNING)
22/10/18 11:12:23 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: 192.168.1.82
ApplicationMaster RPC port: -1
queue: default
start time: 1666116728172
final status: UNDEFINED
tracking URL: http://master:8088/proxy/application_1665674177007_0028/
user: ml
22/10/18 11:12:23 INFO YarnClientSchedulerBackend: Application application_1665674177007_0028 has started running.
22/10/18 11:12:23 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 43035.
22/10/18 11:12:23 INFO NettyBlockTransferService: Server created on master:43035
22/10/18 11:12:23 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
22/10/18 11:12:23 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, master, 43035, None)
22/10/18 11:12:23 INFO BlockManagerMasterEndpoint: Registering block manager master:43035 with 93.3 MiB RAM, BlockManagerId(driver, master, 43035, None)
22/10/18 11:12:23 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, master, 43035, None)
22/10/18 11:12:23 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, master, 43035, None)
22/10/18 11:12:23 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> master, PROXY_URI_BASES -> http://master:8088/proxy/application_1665674177007_0028), /proxy/application_1665674177007_0028
22/10/18 11:12:24 INFO ServerInfo: Adding filter to /jobs: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
22/10/18 11:12:24 INFO ServerInfo: Adding filter to /jobs/json: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
22/10/18 11:12:24 INFO ServerInfo: Adding filter to /jobs/job: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
22/10/18 11:12:24 INFO ServerInfo: Adding filter to /jobs/job/json: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
22/10/18 11:12:24 INFO ServerInfo: Adding filter to /stages: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
22/10/18 11:12:24 INFO ServerInfo: Adding filter to /stages/json: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
22/10/18 11:12:24 INFO ServerInfo: Adding filter to /stages/stage: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
22/10/18 11:12:24 INFO ServerInfo: Adding filter to /stages/stage/json: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
22/10/18 11:12:24 INFO ServerInfo: Adding filter to /stages/pool: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
22/10/18 11:12:24 INFO ServerInfo: Adding filter to /stages/pool/json: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
22/10/18 11:12:24 INFO ServerInfo: Adding filter to /storage: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
22/10/18 11:12:24 INFO ServerInfo: Adding filter to /storage/json: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
22/10/18 11:12:24 INFO ServerInfo: Adding filter to /storage/rdd: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
22/10/18 11:12:24 INFO ServerInfo: Adding filter to /storage/rdd/json: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
22/10/18 11:12:24 INFO ServerInfo: Adding filter to /environment: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
22/10/18 11:12:24 INFO ServerInfo: Adding filter to /environment/json: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
22/10/18 11:12:24 INFO ServerInfo: Adding filter to /executors: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
22/10/18 11:12:24 INFO ServerInfo: Adding filter to /executors/json: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
22/10/18 11:12:24 INFO ServerInfo: Adding filter to /executors/threadDump: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
22/10/18 11:12:24 INFO ServerInfo: Adding filter to /executors/threadDump/json: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
22/10/18 11:12:24 INFO ServerInfo: Adding filter to /static: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
22/10/18 11:12:25 INFO ServerInfo: Adding filter to /: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
22/10/18 11:12:25 INFO ServerInfo: Adding filter to /api: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
22/10/18 11:12:25 INFO ServerInfo: Adding filter to /jobs/job/kill: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
22/10/18 11:12:25 INFO ServerInfo: Adding filter to /stages/stage/kill: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
22/10/18 11:12:25 INFO ServerInfo: Adding filter to /metrics/json: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
22/10/18 11:12:25 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: 30000000000(ns)
22/10/18 11:12:25 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir.
22/10/18 11:12:25 INFO SharedState: Warehouse path is 'file:/user/hive/warehouse'.
22/10/18 11:12:25 INFO ServerInfo: Adding filter to /SQL: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
22/10/18 11:12:25 INFO ServerInfo: Adding filter to /SQL/json: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
22/10/18 11:12:25 INFO ServerInfo: Adding filter to /SQL/execution: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
22/10/18 11:12:25 INFO ServerInfo: Adding filter to /SQL/execution/json: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
22/10/18 11:12:25 INFO ServerInfo: Adding filter to /static/sql: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
22/10/18 11:12:25 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(spark-client://YarnAM)
22/10/18 11:12:29 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 1 for reason Container from a bad node: container_1665674177007_0028_02_000002 on host: worker1. Exit status: -1000. Diagnostics: [2022-10-18 11:12:26.949]File file:/home/ml/.sparkStaging/application_1665674177007_0028/__spark_libs__890394313143327111.zip does not exist
java.io.FileNotFoundException: File file:/home/ml/.sparkStaging/application_1665674177007_0028/__spark_libs__890394313143327111.zip does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:779)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1100)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:769)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:462)
at org.apache.hadoop.yarn.util.FSDownload.verifyAndCopy(FSDownload.java:271)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:68)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:415)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:412)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:412)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:247)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:240)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:228)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
.
22/10/18 11:12:29 INFO BlockManagerMaster: Removal of executor 1 requested
22/10/18 11:12:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asked to remove non-existent executor 1
22/10/18 11:12:29 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
22/10/18 11:12:39 INFO HiveUtils: Initializing HiveMetastoreConnection version 2.3.9 using Spark classes.
22/10/18 11:12:40 INFO HiveClientImpl: Warehouse location for Hive client (version 2.3.9) is file:/user/hive/warehouse
22/10/18 11:12:41 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (192.168.1.82:43560) with ID 2, ResourceProfileId 0
22/10/18 11:12:42 INFO BlockManagerMasterEndpoint: Registering block manager master:40397 with 93.3 MiB RAM, BlockManagerId(2, master, 40397, None)
22/10/18 11:12:49 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (192.168.1.82:43600) with ID 3, ResourceProfileId 0
22/10/18 11:12:50 INFO BlockManagerMasterEndpoint: Registering block manager master:44035 with 93.3 MiB RAM, BlockManagerId(3, master, 44035, None)
And this is the print from > sparklyr::hive_context_config(sc): https://pastebin.com/e28KJ4wQ
Any help?
Thanks in advance.
Okay so I found the solution on this other question.
I added this property to my hive-site.xml and also copied it to $HOME_SPARK/conf/
<property>
<name>hive.metastore.uris</name>
<value>thrift://localhost:9083</value>
</property>
Also I removed all spark_config() configs that I tried before.
I would love to know why was this the solution.
When the spark session is created, you could get configure "spark.sql.catalogImplementation" to see if it is "hive".
If spark can not find the hive class, it will change the config to "in-memory".
In org.apache.spark.repl.Main:
if (conf.get(CATALOG_IMPLEMENTATION.key, "hive").toLowerCase(Locale.ROOT) == "hive") {
if (SparkSession.hiveClassesArePresent) {
// In the case that the property is not set at all, builder's config
// does not have this value set to 'hive' yet. The original default
// behavior is that when there are hive classes, we use hive catalog.
sparkSession = builder.enableHiveSupport().getOrCreate()
logInfo("Created Spark session with Hive support")
} else {
// Need to change it back to 'in-memory' if no hive classes are found
// in the case that the property is set to hive in spark-defaults.conf
builder.config(CATALOG_IMPLEMENTATION.key, "in-memory")
sparkSession = builder.getOrCreate()
logInfo("Created Spark session")
}
} else {
// In the case that the property is set but not to 'hive', the internal
// default is 'in-memory'. So the sparkSession will use in-memory catalog.
sparkSession = builder.getOrCreate()
logInfo("Created Spark session")
}

Sqoop Hcatalog import job completed but data is not present in the table

I was trying to integrate hcatalog with sqoop in order to import data from rdbms(oracle) to data lake(in hive).
sqoop-import --connect connection-string --username username --password pass --table --hcatalog-database data_extraction --hcatalog-table --hcatalog-storage-stanza 'stored as orcfile' -m1 --verbose
Job got executed e=successfully but not able to find the data.
Also, checked the location of the table created in hcatalog, after checking the location found that any directory is not created for that and only a 0 byte file _$folder$ was found.
please found the stack trace :
19/09/25 17:53:37 INFO Configuration.deprecation: io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
19/09/25 17:54:02 DEBUG db.DBConfiguration: Fetching password from job credentials store
19/09/25 17:54:03 INFO db.DBInputFormat: Using read commited transaction isolation
19/09/25 17:54:03 DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound '1=1' and upper bound '1=1'
19/09/25 17:54:03 INFO mapreduce.JobSubmitter: number of splits:1
19/09/25 17:54:03 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1569355854349_1231
19/09/25 17:54:04 INFO impl.YarnClientImpl: Submitted application application_1569355854349_1231
19/09/25 17:54:04 INFO mapreduce.Job: The url to track the job: http://<PII-removed-by-me>/application_1569355854349_1231/
19/09/25 17:54:04 INFO mapreduce.Job: Running job: job_1569355854349_1231
19/09/25 17:57:34 INFO hive.metastore: Closed a connection to metastore, current connections: 1
19/09/25 18:02:59 INFO mapreduce.Job: Job job_1569355854349_1231 running in uber mode : false
19/09/25 18:02:59 INFO mapreduce.Job: map 0% reduce 0%
19/09/25 18:03:16 INFO mapreduce.Job: map 100% reduce 0%
19/09/25 18:03:18 INFO mapreduce.Job: Job job_1569355854349_1231 completed successfully
19/09/25 18:03:18 INFO mapreduce.Job: Counters: 35
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=425637
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=87
HDFS: Number of bytes written=0
HDFS: Number of read operations=1
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
S3: Number of bytes read=0
S3: Number of bytes written=310154
S3: Number of read operations=0
S3: Number of large read operations=0
S3: Number of write operations=0
Job Counters
Launched map tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=29274
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=14637
Total vcore-milliseconds taken by all map tasks=14637
Total megabyte-milliseconds taken by all map tasks=52459008
Map-Reduce Framework
Map input records=145608
Map output records=145608
Input split bytes=87
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=199
CPU time spent (ms)=4390
Physical memory (bytes) snapshot=681046016
Virtual memory (bytes) snapshot=5230788608
Total committed heap usage (bytes)=1483210752
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
19/09/25 18:03:18 INFO mapreduce.ImportJobBase: Transferred 0 bytes in 582.8069 seconds (0 bytes/sec)
19/09/25 18:03:18 INFO mapreduce.ImportJobBase: Retrieved 145608 records.
19/09/25 18:03:18 INFO mapreduce.ImportJobBase: Publishing Hive/Hcat import job data to Listeners for table null
19/09/25 18:03:19 DEBUG util.ClassLoaderStack: Restoring classloader: sun.misc.Launcher$AppClassLoader#1d548a08
Solved it.
As we are using AWS EMR(managed hadoop service).It is already mentioned on their site.
Aws Forum Screenshot
When you use Sqoop to write output to an HCatalog table in Amazon S3, disable Amazon EMR direct write by setting the mapred.output.direct.NativeS3FileSystem and mapred.output.direct.EmrFileSystem properties to false. For more information, see Using HCatalog. You can use the Hadoop -D mapred.output.direct.NativeS3FileSystem=false and -D mapred.output.direct.EmrFileSystem=false commands.
If you don't disable direct write, no error occurs, but the table is created in Amazon S3 and no data is written.
can be found at https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-sqoop-considerations.html

hadoop streaming failed with error code 1 in RHadoop

I am working with RHadoop by the following code:
Sys.setenv(HADOOP_OPTS="-Djava.library.path=/usr/local/hadoop/lib/native")
Sys.setenv(HADOOP_HOME="/usr/local/hadoop")
Sys.setenv(HADOOP_CMD="/usr/local/hadoop/bin/hadoop")
Sys.setenv(HADOOP_STREAMING="/usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.0.0.jar")
Sys.setenv(JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64")
library(rJava)
library(rhdfs)
library(rmr2)
hdfs.init()
mapper = function (., X) {
n=nrow(X);
ones=matrix(rep(1,n),nrow=n,ncol=1);
ag=aggregate(cbind(ones,X[,1:79]),by=list(X[,80]),FUN="sum")
key=factor(ag[,1]);
keyval(key,split(ag[,-1],key))
}
reducer = function(k, A) {
keyval(k,list(Reduce('+', A)))
}
GroupSums <- from.dfs( mapreduce(input = "/ISCXFlowMeter.csv", map = mapper, reduce = reducer, combine = T))
When I run this code, I get an error as:
packageJobJar: [/tmp/hadoop-unjar7138506441946536619/] []
/tmp/streamjob6099552934186757596.jar tmpDir=null 2018-06-12
22:40:04,651 INFO client.RMProxy: Connecting to ResourceManager at
/0.0.0.0:8032 2018-06-12 22:40:04,945 INFO client.RMProxy: Connecting
to ResourceManager at /0.0.0.0:8032 2018-06-12 22:40:05,201 INFO
mapreduce.JobResourceUploader: Disabling Erasure Coding for path:
/tmp/hadoop-yarn/staging/uel/.staging/job_1528838017005_0012
2018-06-12 22:40:06,158 INFO mapred.FileInputFormat: Total input files
to process : 1 2018-06-12 22:40:06,171 INFO net.NetworkTopology:
Adding a new node: /default-rack/127.0.0.1:9866 2018-06-12
22:40:06,233 INFO mapreduce.JobSubmitter: number of splits:2
2018-06-12 22:40:06,348 INFO Configuration.deprecation:
yarn.resourcemanager.system-metrics-publisher.enabled is deprecated.
Instead, use yarn.system-metrics-publisher.enabled 2018-06-12
22:40:06,608 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_1528838017005_0012 2018-06-12 22:40:06,610 INFO
mapreduce.JobSubmitter: Executing with tokens: [] 2018-06-12
22:40:06,945 INFO conf.Configuration: resource-types.xml not found
2018-06-12 22:40:06,945 INFO resource.ResourceUtils: Unable to find
'resource-types.xml'. 2018-06-12 22:40:07,022 INFO
impl.YarnClientImpl: Submitted application
application_1528838017005_0012 2018-06-12 22:40:07,249 INFO
mapreduce.Job: The url to track the job:
http://uel-Deskop-VM:8088/proxy/application_1528838017005_0012/
2018-06-12 22:40:07,251 INFO mapreduce.Job: Running job:
job_1528838017005_0012 2018-06-12 22:40:09,301 INFO mapreduce.Job: Job
job_1528838017005_0012 running in uber mode : false 2018-06-12
22:40:09,305 INFO mapreduce.Job: map 0% reduce 0% 2018-06-12
22:40:09,337 INFO mapreduce.Job: Job job_1528838017005_0012 failed
with state FAILED due to: Application application_1528838017005_0012
failed 2 times due to AM Container for
appattempt_1528838017005_0012_000002 exited with exitCode: 127
Failing this attempt.Diagnostics: [2018-06-12 22:40:08.734]Exception
from container-launch. Container id:
container_1528838017005_0012_02_000001 Exit code: 127
[2018-06-12 22:40:08.736]Container exited with a non-zero exit code
127. Error file: prelaunch.err. Last 4096 bytes of prelaunch.err : Last 4096 bytes of stderr : /bin/bash: /bin/java: No such file or
directory
[2018-06-12 22:40:08.736]Container exited with a non-zero exit code
127. Error file: prelaunch.err. Last 4096 bytes of prelaunch.err : Last 4096 bytes of stderr : /bin/bash: /bin/java: No such file or
directory
For more detailed output, check the application tracking page:
http://uel-Deskop-VM:8088/cluster/app/application_1528838017005_0012
Then click on links to logs of each attempt. . Failing the
application. 2018-06-12 22:40:09,368 INFO mapreduce.Job: Counters: 0
2018-06-12 22:40:09,369 ERROR streaming.StreamJob: Job not successful!
Streaming Command Failed! Error in mr(map = map, reduce = reduce,
combine = combine, vectorized.reduce, : hadoop streaming failed
with error code 1
>
ISCXFlowMeter.csv file in hadoop is available here: https://www.dropbox.com/s/rbppzg6x2slzcjz/ISCXFlowMeter.csv?dl=1
Could you please guide me how to rectify this issue?
After a while, by adding the following properties into mapred-site.xml, I could rectify the error.
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
But, the issue now is that, the key-value is NULL after completing map-reduce. Any help, I appreciate it.

understanding tsung report: clarification required between Page and Session Main statistics

Here is the session i am using:
<sessions>
<session type="ts_http" name="Test" probability="100">
<for var="i" to="1" from="1">
<request subst="true">
<http version="1.1" contents="%%autoupload:readdata%%" method="POST" url="/UploadFile">
<http_header name="key" value="testkey"/>
<http_header name="Filename" value="test.zip"/>
</http>
</request>
</for>
</session>
The session has got only one post request. so the mean page response time and mean request response time are same as expected, in the tsung report.
but i was expecting the mean for user session also to be nearly same with deviation of only connection time.
below is snap of tsung report:
Name highest-10sec-mean lowest-10sec-mean Highest-Rate Mean-Rate Mean Count
connect 1.55 sec 4.11 msec 0.5 / sec 0.24 / sec 0.50 sec 47
page 26.35 sec 2.50 sec 0.9 / sec 0.24 / sec 12.83 sec 43
request 26.35 sec 2.50 sec 0.9 / sec 0.24 / sec 12.83 sec 43
session 30.83 sec 6.91 sec 0.9 / sec 0.25 / sec 17.73 sec 44
Wanted to understand what is it that getting added in the session mean time, such that the session time is higher than page/request time.
IIRC page means a consecutive sequence of requests within a session, without thinktimes/waits. Depending on the load you are configuring, session also includes work required to get the session started. As starting new sessions is not free, you could try to launch 1/10th users and let each user to 10 requests. page and session should be almost identical then.
It's a bit strange though, that you see almost 5 sec difference on the mean values. Could you provide more details on your environment? (os/tsung/erlang versions, entire configuration, ...)

What does PipeMapRed do in Hadoop streaming?

I run a hadoop job for more than one time, and every time it takes too much time to finish, like *15 mins * in all.
I checked the syslog, found out that, org.apache.hadoop.streaming.PipeMapRed was doing something for about 10 mins, and after PipeMapRed is done, MapTask took over and finished in less than 1 min, what the heck?
What does PipeMapRed do actually? Why is it so time-consuming?
Here is some log printed by PipeMapRed:
17:00:57,307 INFO org.apache.hadoop.streaming.PipeMapRed: Records R/W=1633/1
17:00:59,782 INFO org.apache.hadoop.streaming.PipeMapRed: R/W/S=10000/8763/0 in:5000=10000/2 [rec/s] out:4381=8763/2 [rec/s]
17:01:07,310 INFO org.apache.hadoop.streaming.PipeMapRed: Records R/W=60670/59051
17:01:12,610 INFO org.apache.hadoop.streaming.PipeMapRed: R/W/S=100000/97904/0 in:6666=100000/15 [rec/s] out:6526=97904/15 [rec/s]
17:01:17,332 INFO org.apache.hadoop.streaming.PipeMapRed: Records R/W=126104/124334
17:01:27,378 INFO org.apache.hadoop.streaming.PipeMapRed: Records R/W=181681/179714
17:01:30,514 INFO org.apache.hadoop.streaming.PipeMapRed: R/W/S=200000/198233/0 in:6060=200000/33 [rec/s] out:6007=198233/33 [rec/s]
17:01:37,404 INFO org.apache.hadoop.streaming.PipeMapRed: Records R/W=244642/242654
The logs you provided are logs from mapreduce streaming, you can see how many records are being read and write, for example:
R/W/S=10000/8763/0 in:5000=10000/2 [rec/s] out:4381=8763/2 [rec/s]
First part stands for how many records are:
READ/WRITE/SKIPPED=10000/8763/0
Second part is about how fast do you process records, so you read 5000 records/sec and write 4381 records/sec
15 min per (streaming) mapreduce job is totally ok, if not to little :)

Resources