in python this is how I would access a csv from Azure blobs
storage_account_name = "testname"
storage_account_access_key = "..."
file_location = "wasb://example#testname.blob.core.windows.net/testfile.csv"
spark.conf.set(
"fs.azure.account.key."+storage_account_name+".blob.core.windows.net",
storage_account_access_key)
df = spark.read.format('csv').load(file_location, header = True, inferSchema = True)
How can I do this in R? I cannot find any documentation...
The AzureStor package provides an R interface to Azure storage, including files, blobs and ADLSgen2.
endp <- storage_endpoint("https://acctname.blob.core.windows.net", key="access_key")
cont <- storage_container(endp, "mycontainer")
storage_download(cont, "myblob.csv", "local_filename.csv")
Note that this will download to a file in local storage. From there, you can ingest into Spark using standard Sparklyr methods.
Disclaimer: I'm the author of AzureStor.
If you do not want to download it, create a tempfile and then read from it
endp <- storage_endpoint("https://acctname.blob.core.windows.net", key="access_key")
cont <- storage_container(endp, "mycontainer")
fname <- tempfile()
storage_download(cont, "myblob.csv", fname)
df = read.csv(fname)
Related
I am creating an application on R-Shiny where i am taking inputs from users which i am storing into SQLite database in backed.But my concern is on my form i have one upload file input which basically accepts files like (.PDF,.jpeg,.png)(screenshot below).
If users upload any file using that i want that file to be stored in my SQLite database table for further use.But i am not aware of how to achieve this using r programming.
Any help would be appreciated.
You can store objects (any R output which is not tabular, model's output for example) as BLOBs in SQLite, in R use serialize/unserialize for this, but before you need to read the raw PDF using readBin, here an example :
path <- file.path(Sys.getenv("R_DOC_DIR"), "NEWS.pdf")
# see the PDF
browseURL(path)
# Read raw PDF
pdf <- readBin(con = path, what = raw(), n = file.info(path)$size)
library(RSQLite)
# Connect to DB
con <- dbConnect(RSQLite::SQLite(), ":memory:")
# Serialize raw pdf
pdf_serialized <- serialize(pdf, NULL)
# Create a data.frame
df <- data.frame(pdfs = I(list(pdf_serialized)))
# Write the table in database
dbWriteTable(conn = con, name = "mytable", value = df, overwrite = TRUE)
# Read your table
out <- dbReadTable(conn = con, name = "mytable")
# unserialize
pdf_out <- unserialize(out$pdfs[[1]])
# Write the PDF in a temporary file
tmp <- tempfile(fileext = ".pdf")
writeBin(object = pdf_out, con = tmp)
# open it
browseURL(tmp)
Similar to Victorp's answer, you can also base64 encode the data and store it as text in a database:
file <- "mypdf.pdf"
con <- dbConnect(...)
table <- "mypdf_table"
bin <- readBin(file, raw(), n = file.size(file))
enc_data <- base64enc::base64encode(bin)
dd <- data.frame(
file = file,
data = enc_data
)
DBI::dbWriteTable(con, table, dd, append = TRUE)
I am trying to read a csv from AWS S3 bucket. Its the same file which I was able to write to the bucket.When I read it I get an error. Below is the code for reading the csv:
s3BucketName <- "pathtobucket"
Sys.setenv("AWS_ACCESS_KEY_ID" = "aaaa",
"AWS_SECRET_ACCESS_KEY" = "vvvvv",
"AWS_DEFAULT_REGION" = "us-east-1")
bucketlist()
games <- aws.s3::get_object(object = "s3://path/data.csv", bucket = s3BucketName)%>%
rawToChar() %>%
readr::read_csv()
Below is the error I get
<Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Key>_data.csv</Key><RequestId>222</RequestId><HostId>333=</HostId></Error>
For reference below is how I used to write the data to the bucket
s3write_using(data, FUN = write.csv, object = "data.csv", bucket = s3BucketName
You don't need to include the protocol (s3://) or the bucket name in the object parameter of the get_object function, just the object key (filename with any prefixes.)
Should be able to do something like
games <- aws.s3::get_object(object = "data.csv", bucket = s3BucketName)
I am trying to understand how to connect R to redshift using spark , i can't connect using simple RPostgres as that dataset is huge and needs distributed computing ,
so far i am able to read and write CSVs from s3 into spark dataframe , can someone please show how to configure jars and other things so that i can connect SparklyR(spark_read_jdbc() ) or sparkR to redshift .
Also it would be helpful if you can show how to add jars to sparkContexts
Till now I have figured out that databricks has provided with some jars that are needed to access jdbc url to redshift db .
rm(list=ls())
library(sparklyr)
#library(SparkR)
#detach('SparkR')
Sys.setenv("SPARK_MEM" = "15G")
config <- spark_config()
config$`sparklyr.shell.driver-memory` <- "8G"
config$`sparklyr.shell.executor-memory` <- "8G"
config$`spark.yarn.executor.memoryOverhead` <- "6GB"
config$`spark.dynamicAllocation.enabled` <- "TRUE"
config$`sparklyr.shell.driver-java-options`<-list("driver-class-path" ="/home/root/spark/spark-2.1.0-bin-hadoop2.7/jars/RedshiftJDBC4-no-awssdk-1.2.20.1043.jar")
spark_dir = "/tmp/spark_temp"
config$`sparklyr.shell.driver-java-options` <- paste0("-Djava.io.tmpdir=", spark_dir)
sc <- spark_connect(master = "local[*]", config = config)
#sc <- spark_connect(master = "local")
###invoke the spark context
ctx <- sparklyr::spark_context(sc)
#Use below to set the java spark context ##"org.apache.spark.api.java.JavaSparkContext"
####
jsc <- sparklyr::invoke_static( sc, "org.apache.spark.api.java.JavaSparkContext", "fromSparkContext",ctx )
##invoke the hadoop context
hconf <- jsc %>% sparklyr::invoke("hadoopConfiguration")
#hconf %>% invoke("set","fs.s3a.access.key","<your access key for s3 >")
hconf %>% sparklyr::invoke("set","fs.s3a.access.key","<your access key for s3>")
hconf %>% sparklyr::invoke("set","fs.s3a.secret.key", "<your secret key for s3>")
hconf%>% sparklyr::invoke("set","fs.s3a.endpoint", "<your region of s3 bucket>")
hconf %>% sparklyr::invoke("set","com.amazonaws.services.s3.enableV4", "true")
hconf %>% sparklyr::invoke("set","spark.hadoop.fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
hconf %>% sparklyr::invoke("set","fs.s3a.impl.disable.cache", "true")
?spark_read_csv
###reading from s3 buckets
spark_read_csv(sc=sc,name='sr',path="s3a://my-bucket/tmp/2district.csv",memory = TRUE)
spark_read_csv(sc=sc,name='sr_disk3',path="s3a://my-bucket/tmp/changed/",memory = FALSE)
###reading from local drive
spark_read_csv(sc=sc,name='raw_data_loc_in3',path="/tmp/distance.csv",memory = TRUE)
spark_read_csv(sc=sc,name='raw_data_loc_in5',path="/tmp/distance.csv",memory = TRUE)
####reading from redshift table
t<-sparklyr::spark_read_jdbc(sc, "connection", options = list(
url = "jdbc:redshift://<URL>:<Port>/<dbName>",
user = "<user_name>",
password = "<password>",
dbtable='(Select * from sales limit 1000)',
tempS3Dir = "s3a://my-bucket/migration"),memory = T,overwrite = T,repartition = 3)
####write rdd to csv in local
sparklyr::spark_write_csv(t,path='sample.csv')
####write rdd to csv in local
sparklyr::spark_write_csv(t,path='s3a://my-bucket/output/')
Frequently when working with files in IBM Cloud Object Storage from a Watson Studio notebook, I need to save the files to the notebook local file system where I can then access them from R functions.
Project-lib allows me to retrieve the file from cloud object storage as a byte array, how can I save the byte array to a file?
library(projectLib)
project <- projectLib::Project$new(projectId="secret, projectToken="secret")
pc <- project$project_context
my.file <- project$get_file("myfile.csv.gz")
#
# Question: how do I save the file to disk ??
#
df = read.csv2("myfile.csv.gz", sep = "|",
colClasses=c("ASSETUNIT_GLOBALID"="character"))
I tried using save() but this was corrupting the data in the file.
The R function writeBin was the solution for me:
library(projectLib)
project <- projectLib::Project$new(projectId="secret, projectToken="secret")
pc <- project$project_context
my.file <- project$get_file("myfile.csv.gz")
#
# writeBin was the solution :
#
writeBin(my.file, 'myfile.csv.gz', size = NA_integer_,
endian = .Platform$endian, useBytes = TRUE)
df = read.csv2("myfile.csv.gz", sep = "|",
colClasses=c("ASSETUNIT_GLOBALID"="character"))
access_key<-"**************"
secret_key<-"****************"
bucket<- "temp"
filename<-"test.csv"
Sys.setenv("AWS_ACCESS_KEY_ID" = access_key,
"AWS_SECRET_ACCESS_KEY" = secret_key )
buckets<-(bucketlist())
getbucket(bucket)
usercsvobj <-get_object(bucket = "","s3://part112017rscriptanddata/test.csv")
csvcharobj <- rawToChar(usercsvobj)
con <- textConnection(csvcharobj)
data <- read.csv(con)
I am a able to see the contents of the bucket, but fail to read the csv as a data frame.
[1] "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<Error>
<Code>PermanentRedirect</Code><Message>The bucket you are attempting to
access must be addressed using the specified endpoint. Please send all
future requests to this endpoint.</Message><Bucket>test.csv</Bucket>
<Endpoint>test.csv.s3.amazonaws.com</Endpoint>
<RequestId>76E9C6B03AC12D8D</RequestId>
<HostId>9Cnfif4T23sJVHJyNkx8xKgWa6/+
Uo0IvCAZ9RkWqneMiC1IMqVXCvYabTqmjbDl0Ol9tj1MMhw=</HostId></Error>"
I am using the cran versioin of the aws.S3 package .
I was able to read from an S3 bucket both in local r and via r stuio server using:
data <-read.csv(textConnection(getURL("https://s3-eu-west-1.amazonaws.com/'yourbucket'/'yourFileName")),sep = ",", header = TRUE)