I am using the Sparklyr library to read and write data from R to HDFS. Reading data works as expected but writing gives issues.
To be able to use the spark_write_csv function, I need to convert my R data.frames into Spark objects.
I use the sparklyr sdf_copy_to function for this (also tried with copy_to). However, I always get errors
Code:
table1 <- sdf_copy_to(sc,dataframe,OVERWRITE=TRUE)
spark_write_csv(table1, "path")
Error:
Error: org.apache.hadoop.mapred.InvalidInputException: Input path does
not exist:
hdfs://iacchadoopdev01.dap:8020/tmp/Rtmp2gpelH/spark_serialize_62547a7b0f9ad206fd384af04e585deb3a2636ca7b1f026943d4cc1d11c7759a.csv
Did somebody encountered the same problem and knows how to solve this?
A possible reason might be that the sdf_copy_to function stores the data onto my linux /tmp folder while the write function is looking for data on the HDFS /tmp folder.
I had the same problem. You need to put the .csv into hdfs.
You can do this via the shell.
You logg into your cluster via ssh. Then you use 'put' to put the .csv into hdfs.
Write in the shell after you connected to the cluster:
hdfs dfs 'path to local file/file.csv' put 'path to folder in hdfs of your choosing'
Then you will use the hdfs path to load the file.
Related
I'm new to Spark and newer to R, and am trying to figure out how to 'include' other R-scripts when running spark-submit.
Say I have the following R script which "sources" another R script:
main.R
source("sub/fun.R")
mult(4, 2)
The second R script looks like this, which exists in a sub-directory "sub":
sub/fun.R
mult <- function(x, y) {
x*y
}
I can invoke this with Rscript and successfully get this to work.
Rscript file.R
[1] 8
However, I want to run this with Spark, and use spark-submit. When I run spark-submit, I need to be able to set the current working directory on the Spark workers to the directory which contains the main.R script, so that the Spark/R worker process will be able to find the "sourced" file in the "sub" subdirectory. (Note: I plan to have a shared filesystem between the Spark workers, so that all workers will have access to the files).
How can I set the current working directory that SparkR executes in such that it can discover any included (sourced) scripts?
Or, is there a flag/sparkconfig to spark-submit to set the current working directory of the worker process that I can point at the directory containing the R Scripts?
Or, does R have an environment variable that I can set to add an entry to the "R-PATH" (forgive me if no such thing exists in R)?
Or, am I able to use the --files flag to spark-submit to include these additional R-files, and if so, how?
Or is there generally a better way to include R scripts when run with spark-submit?
In summary, I'm looking for a way to include files with spark-submit and R.
Thanks for reading. Any thoughts are much appreciated.
I just started working with Spark R. There is a aws.s3 bucket that is mounted in Databricks.
I would like to use list.files() or dir() to list files that contain certain pattern in their names. However, I can't detect the path for that. I can read a single file with sparkr::read.df, but I don't know how to find all the files out of sevral thousands that I am interested in the bucket?
Many thanks if you can help!
There are multiple ways to do this. There is an R package called aws.s3 that can help you with that here: https://github.com/cloudyr/aws.s3
OR, if you have the aws command line tool installed, you could do a call to it via system from within R. You could include a grep in the call and do something like
listOfBucketsWithABCpattern<- system("aws s3 ls | grep abc", intern=TRUE)
Just to clarify, this isn't a SparkR question so much as it is an R question and an AWS Command Line Interface question.
Hopefully this helps. Cheers!
This question is basically the same as this one (but in R).
I am developing an R package that uses SparkR. I have created some unit tests (several .R files) in PkgName/inst/tests/testthat using the testthat package. For one of the tests I need to read an external data file, and since it is small and is only used in the tests, I read that it can be placed just in the same folder of the tests.
When I deploy this with Maven in a standalone Spark cluster, using "local[*]" as master, it works. However, if I try using a "remote" Spark cluster (via docker -the image has java, Spark 1.5.2 and R- where I create a master in,e.g http://172.17.0.1 and then a worker that is successfully linked to that master), then it does not work. It complains that the data file cannot be found, because it seems to look for it using an absolute path that is valid only in my local pc but not in the workers. The same happens if I use only the filename (without preceding path).
I have also tried delivering the file to the workers with the --file argument to spark-submit, and the file is successfully delivered (apparently it is placed in http://192.168.0.160:44977/files/myfile.dat although the port changes with every execution). If I try to retrieve the file's location using SparkFiles.get, I get a path (that has some random number in one of the intermediate folders) but apparently, it still refers to a path in my local machine. If I try to read the file using the path I've retrieved, it throws the same error (file not found).
I have set the environment variables like this:
SPARK_PACKAGES = "com.databricks:spark-csv_2.10:1.3.0"
SPARKR_SUBMIT_ARGS = " --jars /path/to/extra/jars --files /local/path/to/mydata.dat sparkr-shell"
SPARK_MASTER_IP = "spark://172.17.0.1:7077"
The INFO messages say:
INFO Utils: Copying /local/path/to/mydata.dat to /tmp/spark-c86739c6-2c73-468f-8326-f3d03f5abd6b/userFiles-e1e77e47-2689-4882-b60f-327cf99fe5e0/mydata.dat
INFO SparkContext: Added file file:/local/path/to/mydata.dat at http://192.168.0.160:47110/files/mydata.dat with timestamp 1458995156845
This port changes from one run to another. From within R, I have tried:
fullpath <- SparkR:::callJStatic("org.apache.spark.SparkFiles", "get", "mydata.dat")
(here I use callJStatic only for debugging purposes) and I get
/tmp/spark-c86739c6-2c73-468f-8326-f3d03f5abd6b/userFiles-e1e77e47-2689-4882-b60f-327cf99fe5e0/ratings.dat
but when I try to read from fullpath in R, I get fileNotFound exception, probably because fullpath is not the location in the workers. When I try to read simply from "mydata.dat" (without a full path) I get the same error, because R is still trying to read from my local path where my project is placed (just appends "mydata.dat" to that local path).
I have also tried delivering my R package to the workers (not sure if this may help or not), following the correct packaging conventions (a JAR file with a strict structure and so on). I get no errors (seems the JAR file with my R package can be installed in the workers) but with no luck.
Could you help me please? Thanks.
EDIT: I think I was wrong and I don't need to access the file in the workers but just in the driver, because the access is not part of a distributed operation (just calling SparkR::read.df). Anyway it does not work. But surprisingly, if I pass the file with --files and I read it with read.table (not from SparkR but the basic R utils) passing the full path returned by SparkFiles.get, it works (although this is useless to me). Btw I'm using SparkR version 1.5.2.
I have a case where I will be running R code on a data that will be downloaded from Hadoop. Then, the output of the R code will be uploaded back to Hadoop as well. Currently, I am doing it manually and I would like to avoid this manual downloading/uploading process.
Is there a way I can do this in R by connecting to hdfs? In other words, in the beginning of the R script, it connects to Hadoop and reads the data, then in the end it uploads the output data to Hadoop again. Are there any packages that can be used? Any changes required in Hadoop server or R?
I forgot to note the important part: R and Hadoop are on different servers.
Install the package rmr2 , you will have an option of from.dfs function which can solves your requirement of getting the data from HDFS as mentioned below:
input_hdfs <- from.dfs("path_to_HDFS_file",format="format_columns")
For Storing the results back into HDFS, you can do like
write.table(data_output,file=pipe(paste('hadoop dfs -put -', path_to_output_hdfs_file, sep='')),row.names=F,col.names=F,sep=',',quote=F)
(or)
You can use rmr2 to.dfs function to store back into HDFS.
So... Have you found a solution for this?
Some months ago I have stumbled upon the same situation. After fiddling around for some time with the Revolution Analytics packages, I couldn't find a way for it to work in the situation where R and Hadoop are on different servers.
I tried using webHDFS, which at the time worked for me.
You can find an R package for webhdfs acess here
The package is not available on CRAN you need to run:
devtools::install_github(c("saurfang/rwebhdfs"))
(yeah... You will need the devtools package)
I am trying to open MUlti-sensor precipitation data from eumetsat in R. I can get these data only using GZIP compression method and data format type is GRIB. When I download data I get tar file.
How can I open these data in R?
I tried to use code
> untar("1098496-1of1")
but got error message
Error in gzfile(path.expand(tarfile), "rb") : cannot open the connection
In addition: Warning message:
In gzfile(path.expand(tarfile), "rb") :
cannot open compressed file '1098496-1of1', probable reason 'No such file or directory'
but I when I use next code:
> dir.create("rainfalldataeumetstatR")
> getwd()
[1] "C:/Users/st/Documents"
> untar("1098496-1of1.tar")
> untar("1098496-1of1.tar", files="rainfalldataeumetstatR")
> list.files("rainfalldataeumetstatR")
I don't get some files in my directory and get answer:
character(0)
May be that error appears because files in tar zip are gz archives?
I, too, have grappled with opening GRIB files in R. You have several problems and can tackle them one by one.
For the untar and gzip issues, work from the command line. I don't know how the tar package is built/packaged from Eumetsat; does it create a directory and put all the data files in that directory? In that case, put the tarball in a top-level data directory and then
tar xvf tar_file_name
cd (to the directory that was just created)
gunzip *.gz
Note down the full path name of the files you will want to open for later use.
Are the files in GRIB1 or GRIB2? If in GRIB1, you need to install wgrib. If in GRIB2, you need to install wgrib2. Both are available from NCEP.
You can download them from:
http://www.cpc.ncep.noaa.gov/products/wesley/
In R, 3.1 and later, you install the rNOMADS package 2.0.1 and later.
NOAA National Operational Model Archive and Distribution System (NOMADS) distributes global grid data in GRIB format (currently in GRIB2).
rNOMADS helps you open GRIB1 and GRIB2 data in R by calling wgrib or wgrib2 to decode the binary GRIB data and pipe it (in csv format) for R to read in.
Open up R, load up rNOMADS, and then call the ReadGrib routine using the full path name of your data file in "data_file_name". This is not the way described in the rNOMADS documentation, but it works.
Installing wgrib and wgrib2 is the only hard part and it may not even be that hard, depending on your system. I'm writing tutorials on how to install wgrib, wgrib2 and use rNOMADS with local data files. When I am done, they will be posted here:
http://rda.ucar.edu/datasets/ds083.2/#!software
Now for some bad news:
You need to open each file sequentially. But, you can extract and save the subfields you need, and then read in the next datafile, overwriting the large data structure into which you read the previous file. If that is too much of a PITA, have you considered using the GRADS tool for displaying GRIB data?
There is no native way to read grib files into R. Use wgrib or wgrib2 depending on whether your file is in grib or grib2 format. I am the package manager for rNOMADS - and trust me, we tried to figure out a simple R way, and ended up dropping it. Maybe the folks at NCEP will do it someday, but it's out of our skill range.
Personally I untar my files using cygwin also because the wgrib package in cygwin will allow you to get an inventory file so you can tell R what data is contained in each layer. Under the assumption the data is grib1 r can read it directly. Grib2 requires wgrib2 on your machine, RNomads is working on that challenge.
Alright I recently found a great website that shows how to install wgrib so that it can run in R in conjunction with rNOMADS.
https://bovineaerospace.wordpress.com/2015/04/26/how-to-install-rnomads-with-grib-file-support-on-windows/#comments