The Object in the S3 bucket is 5.3 GB size. In order to convert object into data, I used get_object("link to bucket path"). But this leads to memory issues.
So, I installed Spark 2.3.0 in RStudio and trying to load this object directly into Spark but the command to load object directly into spark is not known.
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local")
If I convert the object into a readable data type (such as data.frame/tbl in R) I would use copy_to to transfer the data into spark from R as below:
Copy data to Spark
spark_tbl <- copy_to(spark_conn,data)
I was wondering how can convert the object inside spark ?
relevant links would be
https://github.com/cloudyr/aws.s3/issues/170
Sparklyr connection to S3 bucket throwing up error
Any guidance would be sincerely appreciated.
Solution.
I was trying to read the csv file which is 5.3 GB from S3 bucket. But Since R is single-threaded, it was giving memory issues (IO exceptions).
However, the solution is to load sparklyr in R (library(sparklyr)) and hence now all the cores in the computer will be utilized.
get_object("link to bucket path") can be replaced by
spark_read_csv("link to bucket path"). Since RStudio uses all cores, we have no memory issues.
Also, depending on the file extension, you can change the functions:
´spark_load_table, spark_read_jdbc, spark_read_json, spark_read_libsvm, spark_read_parquet, spark_read_source, spark_read_table, spark_read_text, spark_save_table, spark_write_csv, spark_write_jdbc, spark_write_json, spark_write_parquet, spark_write_source, spark_write_table, spark_write_text´
Related
Overview:
Azure HDInsight
Cluster Type: ML Services (R Server)
Version: R Server 9.1 (HDI 3.6)
I am trying to import a csv file from Azure data storage blob into R server environment. But it's obviously not as easy as I thought it would be or just as easy as locally.
First thing I tried was installing sparklyr package and set connection.
#install.packages("devtools")
#devtools::install_github("rstudio/sparklyr")
install.packages("sparklyr")
library(sparklyr)
sc <- spark_connect(master = "yarn")
But due to an old version installed in HDI, there's an error message.
Error in start_shell(master = master, spark_home = spark_home, spark_version = version, :
sparklyr does not currently support Spark version: 2.1.1.2.6.2.38
Then I tried to use rxSparkConnect but didn't work either.
#Sys.setenv(SPARK_HOME_VERSION="2.1.1.2.6.2.38-1")
cc <- rxSparkConnect(interop = "sparklyr")
sc <- rxGetSparklyrConnection(cc)
orgins <- file.path("wasb://STORAGENAME#CLUSTERNAME.blob.core.windows.net","FILENAME.csv")
spark_read_csv(sc,path = origins, name = "df")
How would you read a csv file from azure storage blob into the r server environment?
I'm a little upset at myself that this is taking so long, and it shouldn't be this complicated, please help me guys! Thanks in advance!
related post 1
related post 2
I found a imperfect work around is to upload the data in the "local" environment in the bottom right corner and simply read the csv file from there.
There's gotta be a better way to do it, since it's a lot of manual work, probably impractical if data size is big and it's a waste of storage blob.
I have recently started working with databricks and azure.
I have microsoft azure storage explorer. I ran a jar program on databricks
which outputs many csv files in the azure storgae explorer in the path
..../myfolder/subfolder/output/old/p/
The usual thing I do is to go the folder p and download all the csv files
by right clicking the p folder and click download on my local drive
and these csv files in R to do any analysis.
My issue is that sometimes my runs could generate more than 10000 csv files
whose downloading to the local drive takes lot of time.
I wondered if there is a tutorial/R package which helps me to read in
the csv files from the path above without downloading them. For e.g.
is there any way I can set
..../myfolder/subfolder/output/old/p/
as my working directory and process all the files in the same way I do.
EDIT:
the full url to the path looks something like this:
https://temp.blob.core.windows.net/myfolder/subfolder/output/old/p/
According to the offical document CSV Files of Azure Databricks, you can directly read a csv file in R of a notebook of Azure Databricks as the R example of the section Read CSV files notebook example said, as the figure below.
Alternatively, I used R package reticulate and Python package azure-storage-blob to directly read a csv file from a blob url with sas token of Azure Blob Storage.
Here is my steps as below.
I created a R notebook in Azure Databricks workspace.
To install R package reticulate via code install.packages("reticulate").
To install Python package azure-storage-blob as the code below.
%sh
pip install azure-storage-blob
To run Python script to generate a sas token of container level and to use it to get a list of blob urls with sas token, please see the code below.
library(reticulate)
py_run_string("
from azure.storage.blob.baseblobservice import BaseBlobService
from azure.storage.blob import BlobPermissions
from datetime import datetime, timedelta
account_name = '<your storage account name>'
account_key = '<your storage account key>'
container_name = '<your container name>'
blob_service = BaseBlobService(
account_name=account_name,
account_key=account_key
)
sas_token = blob_service.generate_container_shared_access_signature(container_name, permission=BlobPermissions.READ, expiry=datetime.utcnow() + timedelta(hours=1))
blob_names = blob_service.list_blob_names(container_name, prefix = 'myfolder/')
blob_urls_with_sas = ['https://'+account_name+'.blob.core.windows.net/'+container_name+'/'+blob_name+'?'+sas_token for blob_name in blob_names]
")
blob_urls_with_sas <- py$blob_urls_with_sas
Now, I can use different ways in R to read a csv file from the blob url with sas token, such as below.
5.1. df <- read.csv(blob_urls_with_sas[[1]])
5.2. Using R package data.table
install.packages("data.table")
library(data.table)
df <- fread(blob_urls_with_sas[[1]])
5.3. Using R package readr
install.packages("readr")
library(readr)
df <- read_csv(blob_urls_with_sas[[1]])
Note: for reticulate library, please refer to the RStudio article Calling Python from R.
Hope it helps.
Update for your quick question:
We have generated a parquet files, one in Dask (Python) and another with R Drill (using the Sergeant packet ). They use a different implementations of parquet see my other parquet question
We are not able to cross read the files (the python can't read the R file and vice versa).
When reading the Python parquet file in the R environment we receive the following error: system error: Illegalstatexception: UTF8 can only annotate binary filed .
When reading the R/Drill parquet file in Dask we get an FileNotFoundError: [Error 2] no such file or directory ...\_metadata (which is self explanatory).
What are the options to cross read parquet files between R and Python?
Any insights would be appreciated.
To read drill-like parquet data-sets with fastparquet/dask, you need to pas a list of the filenames, e.g.,
files = glob.glob('mydata/*/*.parquet')
df = dd.read_parquet(files)
The error from going in the other direction might be a bug, or (gathering from your other question), may indicate that you used fixed-length strings, but drill/R doesn't support them.
I am working with big data and I have a 70GB JSON file.
I am using jsonlite library to load in the file into memory.
I have tried AWS EC2 x1.16large machine (976 GB RAM) to perform this load but R breaks with the error:
Error: cons memory exhausted (limit reached?)
after loading in 1,116,500 records.
Thinking that I do not have enough RAM, I tried to load in the same JSON on a bigger EC2 machine with 1.95TB of RAM.
The process still broke after loading 1,116,500 records. I am using R version 3.1.1 and I am executing it using --vanilla option. All other settings are default.
here is the code:
library(jsonlite)
data <- jsonlite::stream_in(file('one.json'))
Any ideas?
There is a handler argument to stream_in that allows to handle big data. So you could write the parsed data to a file or filter the unneeded data.
My question is about the feasibilty of running a sparkR program in spark without an R dependency.
In other words can I run the following program in spark when there is no R interpreter installed in the machine?
#set env var
Sys.setenv(SPARK_HOME="/home/fazlann/Downloads/spark-1.5.0-bin-hadoop2.6")
#Tell R where to find sparkR package
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"),"R","lib"), .libPaths()))
#load sparkR into this environment
library(SparkR)
#create the sparkcontext
sc <- sparkR.init(master = "local")
#to work with DataFrames we will need a SQLContext, which can be created from the SparkContext
sqlContext <- sparkRSQL.init(sc)
name <- c("Nimal","Kamal","Ashen","lan","Harin","Vishwa","Malin")
age <- c(23,24,12,25,31,22,43)
child <- c(TRUE,TRUE,FALSE,FALSE,TRUE,FALSE,TRUE)
localdf <- data.frame(name,age,child)
#convert R dataframe into spark DataFrame
sparkdf <- createDataFrame(sqlContext,localdf);
#since we are passing a spark DataFrame into head function, the method gets executed in spark
head(sparkdf)
No, you can't. You'll need to install R and also the needed packages, otherwise your machine won't know that she needs to interpret R.
Don't try to ship your R interpreter in the application you are submitting as the uber application will be excessively heavy to distribute among your cluster.
You'll need a configuration management system that allows you to define the state of your IT infrastructure, then automatically enforces the correct state.
No. SparkR works by having an R process communicating with Spark via rJava. You will still need R installed on your machine, just as you need a JVM installed.