I have recently started working with databricks and azure.
I have microsoft azure storage explorer. I ran a jar program on databricks
which outputs many csv files in the azure storgae explorer in the path
..../myfolder/subfolder/output/old/p/
The usual thing I do is to go the folder p and download all the csv files
by right clicking the p folder and click download on my local drive
and these csv files in R to do any analysis.
My issue is that sometimes my runs could generate more than 10000 csv files
whose downloading to the local drive takes lot of time.
I wondered if there is a tutorial/R package which helps me to read in
the csv files from the path above without downloading them. For e.g.
is there any way I can set
..../myfolder/subfolder/output/old/p/
as my working directory and process all the files in the same way I do.
EDIT:
the full url to the path looks something like this:
https://temp.blob.core.windows.net/myfolder/subfolder/output/old/p/
According to the offical document CSV Files of Azure Databricks, you can directly read a csv file in R of a notebook of Azure Databricks as the R example of the section Read CSV files notebook example said, as the figure below.
Alternatively, I used R package reticulate and Python package azure-storage-blob to directly read a csv file from a blob url with sas token of Azure Blob Storage.
Here is my steps as below.
I created a R notebook in Azure Databricks workspace.
To install R package reticulate via code install.packages("reticulate").
To install Python package azure-storage-blob as the code below.
%sh
pip install azure-storage-blob
To run Python script to generate a sas token of container level and to use it to get a list of blob urls with sas token, please see the code below.
library(reticulate)
py_run_string("
from azure.storage.blob.baseblobservice import BaseBlobService
from azure.storage.blob import BlobPermissions
from datetime import datetime, timedelta
account_name = '<your storage account name>'
account_key = '<your storage account key>'
container_name = '<your container name>'
blob_service = BaseBlobService(
account_name=account_name,
account_key=account_key
)
sas_token = blob_service.generate_container_shared_access_signature(container_name, permission=BlobPermissions.READ, expiry=datetime.utcnow() + timedelta(hours=1))
blob_names = blob_service.list_blob_names(container_name, prefix = 'myfolder/')
blob_urls_with_sas = ['https://'+account_name+'.blob.core.windows.net/'+container_name+'/'+blob_name+'?'+sas_token for blob_name in blob_names]
")
blob_urls_with_sas <- py$blob_urls_with_sas
Now, I can use different ways in R to read a csv file from the blob url with sas token, such as below.
5.1. df <- read.csv(blob_urls_with_sas[[1]])
5.2. Using R package data.table
install.packages("data.table")
library(data.table)
df <- fread(blob_urls_with_sas[[1]])
5.3. Using R package readr
install.packages("readr")
library(readr)
df <- read_csv(blob_urls_with_sas[[1]])
Note: for reticulate library, please refer to the RStudio article Calling Python from R.
Hope it helps.
Update for your quick question:
Related
is there a way to read an open excel file into R?
When an excel file is open in Excel, Excel puts a lock on the file, such as the reading method in R cannot access the file.
Can you circumvent this lock?
Thanks
Edit: this occurs under windows with original excel.
I too do not have problem opening xlsx files that are already open in excel, but if you do i have a workaround that might work:
path_to_xlsx <- "C:/Some/Path/to/test.xlsx"
temp <- tempdir()
file.copy(path_to_xlsx, to = paste0(temp, "/test.xlsx"))
df <- openxlsx::read.xlsx(paste0(temp, "/test.xlsx"))
This copies the file (Which should not be blocked) to a temporary directory, and then loads the file from there. Again, i'm not sure if this is needed, as i do not have the problem you have.
You could try something like this using the ps package. I've used it on Windows and Mac to read from files that I had downloaded from some web resource and opened in Excel with openxlsx2, but it should work with other packages or programs too.
# get the path to the open file via the ps package
library(ps)
p <- ps()
# get the pid for the current program, in my case Excel on Mac
ppid <- p$pid[grepl("Excel", p$name)]
# get the list of open files for that program
pfiles <- ps_open_files(ps_handle(ppid))
pfile <- pfiles[grepl(".xlsx", pfiles$path),]
# return the path to the file
sel <- grepl("^(.|[^~].*)\\.xlsx", basename(pfile$path))
path <- pfile$path[sel]
What do you mean by "the reading method in R", and by "cannot access the file" (i.e. what code are you using and what error message do you get exactly)? I'm successfully importing Excel files that are currently open, with something like:
dat <- readxl::read_excel("PATH/TO/FILE.xlsx")
If the file is being edited in Excel, R imports the last saved version.
EDIT: I've now tried it on both Linux and Windows and it still works, at least with version 1.3.1 of 'readxl'.
In continuation of my unsolved post, I have altered my requirement to writing a file (.jpeg in this specific example) to OneDrive with the help of R Script.
#Loading required packages
library(outbreaks)
library(incidence)
library(RCurl)
#Data transformation
cases = subset(nipah_malaysia, select = c("perak", "negeri_sembilan", "selangor",
"singapore"))
i = as.incidence(cases, dates = nipah_malaysia$date, interval = 7L)
#Saving to the local working directory
jpeg(plot.jpeg)
#Trying to upload using an absolute path. However, doesn't work.
#jpeg(file = "https://1drv.ms/u/s!AtWMPT_CT0l3hB6flgne1OHU34SV?e=3zAJfL/plot.jpg")
plot(i)
dev.off()
Created on 2020-07-17 by the reprex package (v0.3.0)
Below is the error when I use OneDrive's absolute path (I've used personal account for testing. But, an organization's account is used in real-time.)
Error:
Error in jpeg(file = "https://1drv.ms/u/s!AtWMPT_CT0l3hB6flgne1OHU34SV?e=3zAJfL/plot.jpg") :
unable to start jpeg() device
In addition: Warning messages:
1: In jpeg(file = "https://1drv.ms/u/s!AtWMPT_CT0l3hB6flgne1OHU34SV?e=3zAJfL/plot.jpg") :
unable to open file 'https://1drv.ms/u/s!AtWMPT_CT0l3hB6flgne1OHU34SV?e=3zAJfL/plot.jpg' for writing
2: In jpeg(file = "https://1drv.ms/u/s!AtWMPT_CT0l3hB6flgne1OHU34SV?e=3zAJfL/plot.jpg") :
opening device failed
Of course, from the error, I understood that uploading is not so straight forward. I did some research and came across a few articles such as:
https://community.powerbi.com/t5/Service/R-script-Write-to-csv-file-on-Onedrive/td-p/487374
Cannot write csv file on a secured OneDrive folder using R script in Power BI without OneDrive API
I learnt that an API needs to be established with OneDrive using R Script but I'm uncertain in accomplishing the task. I did find the OneDrive API for uploading small files.
Any guidance or suggestions in implementing the OneDrive API using R would be highly appreciated.
Overview:
Azure HDInsight
Cluster Type: ML Services (R Server)
Version: R Server 9.1 (HDI 3.6)
I am trying to import a csv file from Azure data storage blob into R server environment. But it's obviously not as easy as I thought it would be or just as easy as locally.
First thing I tried was installing sparklyr package and set connection.
#install.packages("devtools")
#devtools::install_github("rstudio/sparklyr")
install.packages("sparklyr")
library(sparklyr)
sc <- spark_connect(master = "yarn")
But due to an old version installed in HDI, there's an error message.
Error in start_shell(master = master, spark_home = spark_home, spark_version = version, :
sparklyr does not currently support Spark version: 2.1.1.2.6.2.38
Then I tried to use rxSparkConnect but didn't work either.
#Sys.setenv(SPARK_HOME_VERSION="2.1.1.2.6.2.38-1")
cc <- rxSparkConnect(interop = "sparklyr")
sc <- rxGetSparklyrConnection(cc)
orgins <- file.path("wasb://STORAGENAME#CLUSTERNAME.blob.core.windows.net","FILENAME.csv")
spark_read_csv(sc,path = origins, name = "df")
How would you read a csv file from azure storage blob into the r server environment?
I'm a little upset at myself that this is taking so long, and it shouldn't be this complicated, please help me guys! Thanks in advance!
related post 1
related post 2
I found a imperfect work around is to upload the data in the "local" environment in the bottom right corner and simply read the csv file from there.
There's gotta be a better way to do it, since it's a lot of manual work, probably impractical if data size is big and it's a waste of storage blob.
I'm using RStudio Server version 0.99.903 and I want to create a script to export a data frame to my local machine running Windows 7. I can successfully export the data frame manually to the hard drive following the steps described HERE.
I have searched SO extensively. So far nothing has worked.
Any help would be greatly appreciated.
Guessing since no access to Windows 7, based on this
write_to_desktop <- function(df,fn="tmp.csv",...) {
dskpath <- file.path("C:",Sys.getenv("USER"),"Desktop",fn)
write.csv(df,file=dskpath,...)
}
You could modify this to try to guess the file name from the name of the data frame (fn <- paste0(deparse(substitute(df)),".csv")) or to write a .rda file (save()) or a .rds file (saveRDS()) ...
My company recently converted to SAS and did not buy the SAS SHARE license so I cannot ODBC into the server. I am not a SAS user, but I am writing a program that needs to query data from the server and I want to have my R script call a .sas program to retrieve the data. I think this is possible using
df <- system("sas -SYSIN path/to/sas/script.sas")
but I can't seem to make it work. I have spent all a few hours on the Googles and decided to ask here.
error message:
running command 'sas -SYSIN C:/Desktop/test.sas' had status 127
Thanks!
Assuming your sas program generates a sas dataset, you'll need to do two things:
Through shellor system, make SAS run the program, but first cd in the directory containing the sas executable in case the directory isn't in your PATH environment variable.
setwd("c:\\Program Files\\SASHome 9.4\\SASFoundation\\9.4\\")
return.code <- shell("sas.exe -SYSIN c:\\temp\\myprogram.sas")
Note that what this returns is NOT the data itself, but the code issued by the OS telling you if the task succeeded or not. A code 0 means task has succeeded.
In the sas program, all I did was to create a copy of sashelp.baseball in the c:\temp directory.
Import the generated dataset into R using one of the packages written for that. Haven is the most recent and IMO most reliable one.
# Install Haven from CRAN:
install.packages("haven")
# Import the dataset:
myData <- read_sas("c:\\temps\\baseball.sas7bdat")
And there you should have it!