using sparklyr in RStudio, can I upload a LOCAL csv file to a spark cluster - r

I'm pretty new to cluster computing, so not sure if this is even possible.
I am successfully creating a spark_context in Rstudio (using sparklyr) to connect to our local Spark cluster. Using copy_to I can upload data frames from R to Spark, but I am trying to upload a locally stored CSV file directly to the Spark cluster using spark_read_csv without importing it into the R environment first (it's a big 5GB file). It's not working (even prefixing location with file:///), and it seems that it can only upload files that are ALREADY stored in the cluster.
How do I upload a local file directly to spark without loading it into R first??
Any tips appreciated.

You cannot. File has to be reachable from each machine in your cluster either as a local copy or placed on distributed files system / object storage.

You can upload the files from local to spark by using spark_read_csv() method. Please pass the path properly.
Note: It is not necessary to load the data first into R environment.

Related

How to compare the content of local and remote directory via R

I would compare a local list of csv files with another remote list of csv file on online server. I would use R for this, how I can do ?
I'm alreday connect via R to the online server and now I would like to compare the local folder with the online without download the online

Azure Databricks: How do we access R Scripts present on DBFS?

I'm new to DataBricks. I am trying to access a .R file that is present in the DBFS storage but I cannot figure out how to do so. Any help is really appreciated.
I can read data from the storage using the file path /dbfs and also source code from the script but I want to make edits to the script.
You need some editor to do that - for example, you can setup RStudio on your cluster and connect to it via RStudio UI - in this case you can edit R files directly on DBFS.
But really, the simplest for you would be to use Databricks CLI fs command to copy the file to your local machine, make changes in the editor of your choice, and upload file back.

Is it possible to write parquet files to local storage from h2o on hadoop?

I'm working with h2o (latest version 3.26.0.10) on a Hadoop cluster. I've read in a parquet file from HDFS and have performed some manipulation on it, built a model, etc.
I've stored some important results in an H2OFrame that I wish to export to local storage, instead of HDFS. Is there a way to export this file as a parquet?
I tried using h2o.exportFile, documentation here: http://docs.h2o.ai/h2o/latest-stable/h2o-r/docs/reference/h2o.exportFile.html but all the examples are for writing .csv. I tried using the a file path with .parquet as an extension and that didn't work. It wrote a file but I think it was basically a .csv as it was identical file size to the .csv.
example: h2o.exportFile(iris_hf, path = "/path/on/h2o/server/filesystem/iris.parquet")
On a related note, if I were to export my H2OFrame to HDFS instead of local storage, would it be possible to write that in parquet format? I could at least then move that to local storage.

How to read parquet files from HDFS in R

I need to read parquet files stored on HDFS (I have a Kerberos-protected Hadoop cluster) in my R program. I came across a couple of packages, but none of them completely satisfy what I need
rhadoop: It looks like an old project with no further development. rhdfs package under these libraries does not support parquet files or Kerberos.
arrow: It seems like it can read parquet files, but there is no connectivity to HDFS
Is there any other library which let me read parquet files from HDFS in R?
I'm aware of sparklyr, but I believe I need to install spark on the machine which runs the spark driver? Is that correct? My R client is a different machine.

Starting SparkR session using external config file

I have an RStudio driver instance which is connected to a Spark Cluster. I wanted to know if there is any way to actually connect to Spark cluster from RStudio using an external configuration file which can specify the number of executors, memory and other spark parameters. I know we can do it using the below command
sparkR.session(sparkConfig = list(spark.cores.max='2',spark.executor.memory = '8g'))
I am specifically looking for a method which takes spark parameters from an external file to start the sparkR session.
Spark uses standardized configuration layout with spark-defaults.conf used for specifying configuration option. This file should be located in one of the following directories:
SPARK_HOME/conf
SPARK_CONF_DIR
All you have to do is to configure SPARK_HOME or SPARK_CONF_DIR environment variables and put configuration there.
Each Spark installation comes with template files you can use as an inspiration.

Resources