rxHadoopCopyFromLocal from Windows - r

What is the right syntax to copy from Windows to a remote HDFS?
I'm trying to copy a file from my local machine to a remote hadoop cluster using RStudio
rxHadoopCopyFromLocal("C:/path/to/file.csv", "/target/on/hdfs/")
This throws
copyFromLocal '/path/to/file.csv': no such file or directory`
Notice the C:/ disappeared.
This syntax also fails
rxHadoopCopyFromLocal("C:\\path\\to\\file.csv", "/target/on/hdfs/")
with error
-copyFromLocal: Can not create a Path from a null string

This is a common mistake.
Turns out the rxHadoopCopyFromLocal command is a wrapper of the hdfs fs -copyFromLocal. All it does is copy from a local filesystem to an hdfs target.
In this case the rxSetComputeContext(remotehost) was set to a remote cluster. On the remote machine, there is not a C:\path\to\file.csv
Here are a couple of ways to get the files there.
Configure local hdfs-site.xml for remote Hdfs Cluster
Ensure you have hadoop tools installed on your local machine
Edit your local hdfs-site.xml to point to the remote cluster
Ensure rxSetComputeContext("local")
Run rxHadoopCopyFromLocal("C:\local\path\to\file.csv", "/target/on/hdfs/")
SCP and Remote Compute Context
Copy your file to the remote machine with scp C:\local\path\to\file.csv user#remotehost:/tmp
Ensure rxSetComputeContext(remotehost)
Run rxHadoopCopyFromLocal("/tmp/file.csv", "/target/on/hdfs/")

The dev version of dplyrXdf now supports files in HDFS. You can upload a file from the native filesystem as follows; this works both from the edge node, and from a remote client.
hdfs_upload("c\\path\\to\\file.csv", "/target/on/hdfs")
If you have a dataset (an R object) that you want to upload, you can also use the standard dplyr copy_to verb. This will import the data to an Xdf file and upload it, returning an RxXdfData data source pointing to the uploaded file.
txt <- RxTextData("file.csv")
hd <- RxHdfsFileSystem()
hdfs_xdf <- copy_to(hd, txt, name="uploaded_xdf")

Related

Home directory in CDH 5.10.0

I am running CDH 5.10.0 VM
When I create .sql files using gedit in the terminal, into /home/cloudera, I can see the sql file is being created in Desktop-> Cloudera's Home. But the same is not appearing when I use hadoop fs -ls /home/cloudera
Similarly, when I execute INSERT OVERWRITE INTO DIRECTORY /home/cloudera/somefolder, it is not showing physically in Desktop -> Cloudera's Home. But it is being displayed when I use- hadoop fs -ls /home/cloudera
Is it a permission issue? or my VM is corrupted?
Hadoop file system is different from your OS file system(local filesystem) so the path Desktop-> Cloudera's Home is completely different with /home/cloudera in your HDFS.
Hive in Cloudera is configured to use HDFS by default so the query you issued :
INSERT OVERWRITE INTO DIRECTORY /home/cloudera/somefolder
ran using HDFS not your local file system.

Download code from GCloud VM instance with expired RStudio Server license

I had created a Google compute engine (virtual machine) instance with RStudio Server being unaware that RStudio Server is a licensed software. Now, my trial license for RStudio has expired, and I cannot login to my R sessions anymore.
However, I had written some code which I need to recover. How do I download the files?
I have SSH-ed into my virtual machine but cannot find the relevant files or a way to download them.
I had a similar issue and I was able to recover the files by performing the following steps:
SSH to the virtual machine
Once you are in the virtual machine run the following command: cd ../rstudio-user/
Now ls there you will see the file structure you used to see in the RStudio Server interface}
Navigate using cd and ls between the folders to get to the desired file
Once you are in the desired location (where with an ls you can see the files you want to recover) run the following command: pwd
Click on the Engine and go to download file
Enter the full path of the file you want to download, it will be something like: /home/rstudio-user/FILENAME.R
Click on Download
You can do this for each of the files you want to recover.
In case you want to recover a full folder its easier to compress to a zip file and then to download it.

Load file in ssh server with R

Hel lo,
I have some files that are presrnt into a ssh server that I need to open from my R local software .
Does someone know a way to load these files from the server to the local computer instead of doing a (copy) scp of this file to my local computer and then load them in R.
Thank you for your help

SparkR cannot access to file in workers

This question is basically the same as this one (but in R).
I am developing an R package that uses SparkR. I have created some unit tests (several .R files) in PkgName/inst/tests/testthat using the testthat package. For one of the tests I need to read an external data file, and since it is small and is only used in the tests, I read that it can be placed just in the same folder of the tests.
When I deploy this with Maven in a standalone Spark cluster, using "local[*]" as master, it works. However, if I try using a "remote" Spark cluster (via docker -the image has java, Spark 1.5.2 and R- where I create a master in,e.g http://172.17.0.1 and then a worker that is successfully linked to that master), then it does not work. It complains that the data file cannot be found, because it seems to look for it using an absolute path that is valid only in my local pc but not in the workers. The same happens if I use only the filename (without preceding path).
I have also tried delivering the file to the workers with the --file argument to spark-submit, and the file is successfully delivered (apparently it is placed in http://192.168.0.160:44977/files/myfile.dat although the port changes with every execution). If I try to retrieve the file's location using SparkFiles.get, I get a path (that has some random number in one of the intermediate folders) but apparently, it still refers to a path in my local machine. If I try to read the file using the path I've retrieved, it throws the same error (file not found).
I have set the environment variables like this:
SPARK_PACKAGES = "com.databricks:spark-csv_2.10:1.3.0"
SPARKR_SUBMIT_ARGS = " --jars /path/to/extra/jars --files /local/path/to/mydata.dat sparkr-shell"
SPARK_MASTER_IP = "spark://172.17.0.1:7077"
The INFO messages say:
INFO Utils: Copying /local/path/to/mydata.dat to /tmp/spark-c86739c6-2c73-468f-8326-f3d03f5abd6b/userFiles-e1e77e47-2689-4882-b60f-327cf99fe5e0/mydata.dat
INFO SparkContext: Added file file:/local/path/to/mydata.dat at http://192.168.0.160:47110/files/mydata.dat with timestamp 1458995156845
This port changes from one run to another. From within R, I have tried:
fullpath <- SparkR:::callJStatic("org.apache.spark.SparkFiles", "get", "mydata.dat")
(here I use callJStatic only for debugging purposes) and I get
/tmp/spark-c86739c6-2c73-468f-8326-f3d03f5abd6b/userFiles-e1e77e47-2689-4882-b60f-327cf99fe5e0/ratings.dat
but when I try to read from fullpath in R, I get fileNotFound exception, probably because fullpath is not the location in the workers. When I try to read simply from "mydata.dat" (without a full path) I get the same error, because R is still trying to read from my local path where my project is placed (just appends "mydata.dat" to that local path).
I have also tried delivering my R package to the workers (not sure if this may help or not), following the correct packaging conventions (a JAR file with a strict structure and so on). I get no errors (seems the JAR file with my R package can be installed in the workers) but with no luck.
Could you help me please? Thanks.
EDIT: I think I was wrong and I don't need to access the file in the workers but just in the driver, because the access is not part of a distributed operation (just calling SparkR::read.df). Anyway it does not work. But surprisingly, if I pass the file with --files and I read it with read.table (not from SparkR but the basic R utils) passing the full path returned by SparkFiles.get, it works (although this is useless to me). Btw I'm using SparkR version 1.5.2.

How can I copy data over the Amazon's EC2 and run a script?

I am a novice as far as using cloud computing but I get the concept and am pretty good at following instructions. I'd like to do some simulations on my data and each step takes several minutes. Given the hierarchy in my data, it takes several hours for each set. I'd like to speed this up by running it on Amazon's EC2 cloud.
After reading this, I know how to launch an AMI, connect to it via the shell, and launch R at the command prompt.
What I'd like help on is being able to copy data (.rdata files) and a script and just source it at the R command prompt. Then, once all the results are written to new .rdata files, I'd like to copy them back to my local machine.
How do I do this?
I don't know much about R, but I do similar things with other languages. What I suggest would probably give you some ideas.
Setup a FTP server on your local machine.
Create a "startup-script" that you launch with your instance.
Let the startup script download the R files from your local machine, initialize R and do the calculations, then the upload the new files to your machine.
Start up script:
#!/bin/bash
set -e -x
apt-get update && apt-get install curl + "any packages you need"
wget ftp://yourlocalmachine:21/r_files > /mnt/data_old.R
R CMD BATCH data_old.R -> /mnt/data_new.R
/usr/bin/curl -T /mnt/data_new.r -u user:pass ftp://yourlocalmachine:21/new_r_files
Start instance with a startup script
ec2-run-instances --key KEYPAIR --user-data-file my_start_up_script ami-xxxxxx
first id use amazon S3 for storing the filesboth from your local machine and back from the instance
as stated before, you can create start up scripts, or even bundle your own customized AMI with all the needed settings and run your instances from it
so download the files from a bucket in S3, execute and process, finally upload the results back to the same/different bucket in S3
assuming the data is small (how big scripts can be) than S3 cost/usability would be very effective

Resources