I am trying to understand how to connect R to redshift using spark , i can't connect using simple RPostgres as that dataset is huge and needs distributed computing ,
so far i am able to read and write CSVs from s3 into spark dataframe , can someone please show how to configure jars and other things so that i can connect SparklyR(spark_read_jdbc() ) or sparkR to redshift .
Also it would be helpful if you can show how to add jars to sparkContexts
Till now I have figured out that databricks has provided with some jars that are needed to access jdbc url to redshift db .
rm(list=ls())
library(sparklyr)
#library(SparkR)
#detach('SparkR')
Sys.setenv("SPARK_MEM" = "15G")
config <- spark_config()
config$`sparklyr.shell.driver-memory` <- "8G"
config$`sparklyr.shell.executor-memory` <- "8G"
config$`spark.yarn.executor.memoryOverhead` <- "6GB"
config$`spark.dynamicAllocation.enabled` <- "TRUE"
config$`sparklyr.shell.driver-java-options`<-list("driver-class-path" ="/home/root/spark/spark-2.1.0-bin-hadoop2.7/jars/RedshiftJDBC4-no-awssdk-1.2.20.1043.jar")
spark_dir = "/tmp/spark_temp"
config$`sparklyr.shell.driver-java-options` <- paste0("-Djava.io.tmpdir=", spark_dir)
sc <- spark_connect(master = "local[*]", config = config)
#sc <- spark_connect(master = "local")
###invoke the spark context
ctx <- sparklyr::spark_context(sc)
#Use below to set the java spark context ##"org.apache.spark.api.java.JavaSparkContext"
####
jsc <- sparklyr::invoke_static( sc, "org.apache.spark.api.java.JavaSparkContext", "fromSparkContext",ctx )
##invoke the hadoop context
hconf <- jsc %>% sparklyr::invoke("hadoopConfiguration")
#hconf %>% invoke("set","fs.s3a.access.key","<your access key for s3 >")
hconf %>% sparklyr::invoke("set","fs.s3a.access.key","<your access key for s3>")
hconf %>% sparklyr::invoke("set","fs.s3a.secret.key", "<your secret key for s3>")
hconf%>% sparklyr::invoke("set","fs.s3a.endpoint", "<your region of s3 bucket>")
hconf %>% sparklyr::invoke("set","com.amazonaws.services.s3.enableV4", "true")
hconf %>% sparklyr::invoke("set","spark.hadoop.fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
hconf %>% sparklyr::invoke("set","fs.s3a.impl.disable.cache", "true")
?spark_read_csv
###reading from s3 buckets
spark_read_csv(sc=sc,name='sr',path="s3a://my-bucket/tmp/2district.csv",memory = TRUE)
spark_read_csv(sc=sc,name='sr_disk3',path="s3a://my-bucket/tmp/changed/",memory = FALSE)
###reading from local drive
spark_read_csv(sc=sc,name='raw_data_loc_in3',path="/tmp/distance.csv",memory = TRUE)
spark_read_csv(sc=sc,name='raw_data_loc_in5',path="/tmp/distance.csv",memory = TRUE)
####reading from redshift table
t<-sparklyr::spark_read_jdbc(sc, "connection", options = list(
url = "jdbc:redshift://<URL>:<Port>/<dbName>",
user = "<user_name>",
password = "<password>",
dbtable='(Select * from sales limit 1000)',
tempS3Dir = "s3a://my-bucket/migration"),memory = T,overwrite = T,repartition = 3)
####write rdd to csv in local
sparklyr::spark_write_csv(t,path='sample.csv')
####write rdd to csv in local
sparklyr::spark_write_csv(t,path='s3a://my-bucket/output/')
Related
I have ~ 250 csv files I want to load into SQLite db. I've loaded all the csv into my global environment as data frames. I'm using the following function to copy all of them to db but get Error: df must be local dataframe or a remote tbl_sql
library(DBI)
library(odbc)
library(rstudioapi)
library(tidyverse)
library(dbplyr)
library(RSQLite)
library(dm)
# Create DB Instance ---------------------------------------------
my_db <- dbConnect(RSQLite::SQLite(), "test_db.sqlite", create = TRUE)
# Load all csv files ---------------------------------------------
filenames <- list.files(pattern = ".*csv")
names <- substr(filenames, 1, nchar(filenames)-4)
for (i in names) {
filepath <- file.path(paste(i, ".csv", sep = ""))
assign(i, read.csv(filepath, sep = ","))
}
# Get list of data.frames ----------------------------------------
tables <- as.data.frame(sapply(mget(ls(), .GlobalEnv), is.data.frame))
colnames(tables) <- "is_data_frame"
tables <- tables %>%
filter(is_data_frame == "TRUE")
table_list <- row.names(tables)
# Copy dataframes to db ------------------------------------------
for (j in table_list) {
copy_to(my_db, j)
}
I have had mixed success using copy_to. I recommend the dbWriteTable command from the DBI package. Example code below:
DBI::dbWriteTable(
db_connection,
DBI::Id(
catalog = db_name,
schema = schema_name,
table = table_name
),
r_table_name
)
This would replace your copy_to command. You will need to provide a string to name the table, but the database and schema names are likely optional and can probably be omitted.
I am creating an application on R-Shiny where i am taking inputs from users which i am storing into SQLite database in backed.But my concern is on my form i have one upload file input which basically accepts files like (.PDF,.jpeg,.png)(screenshot below).
If users upload any file using that i want that file to be stored in my SQLite database table for further use.But i am not aware of how to achieve this using r programming.
Any help would be appreciated.
You can store objects (any R output which is not tabular, model's output for example) as BLOBs in SQLite, in R use serialize/unserialize for this, but before you need to read the raw PDF using readBin, here an example :
path <- file.path(Sys.getenv("R_DOC_DIR"), "NEWS.pdf")
# see the PDF
browseURL(path)
# Read raw PDF
pdf <- readBin(con = path, what = raw(), n = file.info(path)$size)
library(RSQLite)
# Connect to DB
con <- dbConnect(RSQLite::SQLite(), ":memory:")
# Serialize raw pdf
pdf_serialized <- serialize(pdf, NULL)
# Create a data.frame
df <- data.frame(pdfs = I(list(pdf_serialized)))
# Write the table in database
dbWriteTable(conn = con, name = "mytable", value = df, overwrite = TRUE)
# Read your table
out <- dbReadTable(conn = con, name = "mytable")
# unserialize
pdf_out <- unserialize(out$pdfs[[1]])
# Write the PDF in a temporary file
tmp <- tempfile(fileext = ".pdf")
writeBin(object = pdf_out, con = tmp)
# open it
browseURL(tmp)
Similar to Victorp's answer, you can also base64 encode the data and store it as text in a database:
file <- "mypdf.pdf"
con <- dbConnect(...)
table <- "mypdf_table"
bin <- readBin(file, raw(), n = file.size(file))
enc_data <- base64enc::base64encode(bin)
dd <- data.frame(
file = file,
data = enc_data
)
DBI::dbWriteTable(con, table, dd, append = TRUE)
I'm using the do.call() command to read a list of csv-files to have all data points in one csv file. I have been using the following:
files = list.files(path = "G:/SafeGraph201708MidWest",
pattern = "*.csv",
recursive = TRUE,
full.names = TRUE)
library(data.table)
DT = do.call(rbind, lapply(files, fread))
Instead of reading all the rows in each file, I want to read specific rows. Especially the ones that are within this range:
Data <- filter(DT, longitude >= -86.97 & longitude <= -86.78,
latitude >= 40.35 & latitude <= 40.49)
Is there a way that I can do that using do.call()? Looking forward for a soon reply. Thank you!
There are several strategies on how to tackle this. You can import all the data into a list using lapply and then from each list element filter out based on your filter. You would use data.table::rbindlist to make the final data.table. Another one would be to do this in one step, e.g. (not tested, obviously)
library(data.table)
files = list.files(path = "G:/SafeGraph201708MidWest",
pattern = "*.csv",
recursive = TRUE,
full.names = TRUE)
xy <- lapply(files, FUN = function(x) {
out <- fread(x)
out <- filter(out, longitude >= -86.97 & longitude <= -86.78,
latitude >= 40.35 & latitude <= 40.49)
out
})
xy <- rbindlist(xy)
Assuming you use Windows PC and have at least Microsoft Office 2007+ installed, consider directly querying the CSV with the JET/ACE SQL Engine (.dll files) which is the very engine of MS Access.
Below includes two connection strings using Access or Excel. Either version works and the files do need to exist but are never used except for connecting to ACE. Once connected, CSV files are then queried from same or different path.
library(odbc)
# VERIFY AVAILABLE DSNs AND DRIVERS
odbcListDataSources()
# DSN VERSIONS
conn <- dbConnect(odbc::odbc(), DSN ="MS Access Database;DBQ=C:\\Path\\To\\Access.accdb;");
conn <- dbConnect(odbc::odbc(), DSN ="Excel Files;DBQ=C:\\Path\\To\\Excel.xlsx;");
# DRIVER VERSIONS
conn <- dbConnect(odbc::odbc(),
.connection_string = "Driver={Microsoft Access Driver (*.mdb, *.accdb)};DBQ=C:\\Path\\To\\Access.accdb;");
conn <- dbConnect(odbc::odbc(),
.connection_string ="Driver={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)};DBQ=C:\\Path\\To\\Excel.xlsx;");
# CSV QUERY
df <- dbGetQuery(conn, "SELECT t.*
FROM [text;database=C:\\Path\\To\\CSV_Folder].Name_of_File.csv AS t
WHERE t.longitude BETWEEN -86.97 AND -86.78
AND t.latitude BETWEEN 40.35 AND 40.49;")
head(df)
dbDisconnect(conn)
And in a loop:
files = list.files(path = "G:/SafeGraph201708MidWest",
pattern = "*.csv",
recursive = TRUE,
full.names = TRUE)
df_list <- lapply(files, function(f)
df <- dbGetQuery(conn,
paste0("SELECT t.* ",
" FROM [text;database=G:\\SafeGraph201708MidWest].", f, " AS t ",
" WHERE t.longitude BETWEEN -86.97 AND -86.78",
" AND t.latitude BETWEEN 40.35 AND 40.49;")
)
)
final_dt <- rbindlist(df_list)
You can use the ability from data.table::fread() to execute a command and 'read' he results.
I Assume you are using windows, so you have got accessto the findstr-function in your command prompt.
So, if you can build a regex that 'hits' on the lines you want to extract, you can filter wanted lines before reading the entire file into R. This is (potentially) a huuuge memory-saver on larger files, and may speed up your workflow considerably.
sample data
lat's say coords.csv looks line this:
id,latitude,longitude
1,10,11
2,11,12
3,12,13
4,13,14
5,14,15
In this example, you want to extract lines with latitudes bewteen 12 and 14 AND longitudes beteeen 11 and 13
code
#build a list of files (I created only one)
#make sure you use the full path (not relative)
x <- list.files( path = "C:/folder", pattern = "coord.csv", full.names = TRUE )
#build reges that only hits on rows with:
# latitude 12-14
# longitude 11-13
pattern = "^[0-9],1[2-4],1[1-3]$"
#read the file(s), extract the lines with match the regex-pattern
#and bind the resuklt to a data.table
rbindlist( lapply( x, function(x) {
fread( cmd = paste0( "findstr /R ", pattern, " ", x ), header = FALSE )
} ) )
output
V1 V2 V3
1: 3 12 13
in python this is how I would access a csv from Azure blobs
storage_account_name = "testname"
storage_account_access_key = "..."
file_location = "wasb://example#testname.blob.core.windows.net/testfile.csv"
spark.conf.set(
"fs.azure.account.key."+storage_account_name+".blob.core.windows.net",
storage_account_access_key)
df = spark.read.format('csv').load(file_location, header = True, inferSchema = True)
How can I do this in R? I cannot find any documentation...
The AzureStor package provides an R interface to Azure storage, including files, blobs and ADLSgen2.
endp <- storage_endpoint("https://acctname.blob.core.windows.net", key="access_key")
cont <- storage_container(endp, "mycontainer")
storage_download(cont, "myblob.csv", "local_filename.csv")
Note that this will download to a file in local storage. From there, you can ingest into Spark using standard Sparklyr methods.
Disclaimer: I'm the author of AzureStor.
If you do not want to download it, create a tempfile and then read from it
endp <- storage_endpoint("https://acctname.blob.core.windows.net", key="access_key")
cont <- storage_container(endp, "mycontainer")
fname <- tempfile()
storage_download(cont, "myblob.csv", fname)
df = read.csv(fname)
access_key<-"**************"
secret_key<-"****************"
bucket<- "temp"
filename<-"test.csv"
Sys.setenv("AWS_ACCESS_KEY_ID" = access_key,
"AWS_SECRET_ACCESS_KEY" = secret_key )
buckets<-(bucketlist())
getbucket(bucket)
usercsvobj <-get_object(bucket = "","s3://part112017rscriptanddata/test.csv")
csvcharobj <- rawToChar(usercsvobj)
con <- textConnection(csvcharobj)
data <- read.csv(con)
I am a able to see the contents of the bucket, but fail to read the csv as a data frame.
[1] "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<Error>
<Code>PermanentRedirect</Code><Message>The bucket you are attempting to
access must be addressed using the specified endpoint. Please send all
future requests to this endpoint.</Message><Bucket>test.csv</Bucket>
<Endpoint>test.csv.s3.amazonaws.com</Endpoint>
<RequestId>76E9C6B03AC12D8D</RequestId>
<HostId>9Cnfif4T23sJVHJyNkx8xKgWa6/+
Uo0IvCAZ9RkWqneMiC1IMqVXCvYabTqmjbDl0Ol9tj1MMhw=</HostId></Error>"
I am using the cran versioin of the aws.S3 package .
I was able to read from an S3 bucket both in local r and via r stuio server using:
data <-read.csv(textConnection(getURL("https://s3-eu-west-1.amazonaws.com/'yourbucket'/'yourFileName")),sep = ",", header = TRUE)