Registering a temporary table using sparklyr in Databricks - r

My colleague is using pyspark in Databricks and the usual step is to run an import using data = spark.read.format('delta').parquet('parquet_table').select('column1', column2') and then this caching step, which is really fast.
data.cache()
data.registerTempTable("data")
As an R user I am looking for this registerTempTable equivalent in sparklyr.
I would usually do
data = sparklyr::spark_read_parquet(sc = sc, path = "parquet_table", memory = FALSE) %>% dplyr::select(column1, column2)
In case I opt for memory = TRUE or tbl_cache(sc, "data") it keeps running and never stops. The contrast in time difference seems very obvious - my colleague's registerTempTable takes seconds whereas my option of sparklyr keeps running, i.e. unknown when it will stop. Is there a better function in sparklyr for R users which can do this registerTempTable faster?

You can try using cache and createOrReplaceTempView
library(SparkR)
df <- read.df("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", source = "csv", header="true", inferSchema = "true")
cache(df)
createOrReplaceTempView(df,"df_temp")

The above works for SparkR.
For sparklyR, you're code the spark_read_parquet() with memory=FALSE is a pretty similar procedure to what is happening with SparkR.
You shouldn't need to cache the data if you just want to create a temporary table so I would just use memory = FALSE.
See this question.

Related

Vroom using too much RAM

I was working with some data, trying to use vroom to open it, and it worked like a charm. After reseting R, something happened and it never worked again. Loading exactly the same data, if I run:
df = vroom(directory)
It cannot guess the delimiter. If I run:
df = vroom(directory, delim = ',')
It gets stuck indexing the database and I have to close RStudio. If I run:
df = vroom(directory, delim = ',', progress = FALSE)
I get Error: std::bad_alloc, even though I have 8GB of free RAM for an 80MB file. Tried using a smaller file and it reads it no problem, so I'm guessing the problem IS that vroom() uses too much RAM for wide files. Is there any way to optimize this?. My vroom version is 1.6.0, my R is 4.2.1.
Thank you!
PS: data comes from this link .

Speed up odbc::dbFetch

I'm trying to analyze data stored in an SQL database (MS SQL server) in R, and on a mac. Typical queries might return a few GB of data, and the entire database is a few TB. So far, I've been using the R package odbc, and it seems to work pretty well.
However, dbFetch() seems really slow. For example, a somewhat complex query returns all results in ~6 minutes in SQL server, but if I run it with odbc and then try dbFetch, it takes close to an hour to get the full 4 GB into a data.frame. I've tried fetching in chunks, which helps modestly: https://stackoverflow.com/a/59220710/8400969. I'm wondering if there is another way to more quickly pipe the data to my mac, and I like the line of thinking here: Quickly reading very large tables as dataframes
What are some strategies for speeding up dbFetch when the results of queries are a few GB of data? If the issue is generating a data.frame object from larger tables, are there savings available by "fetching" in a different manner? Are there other packages that might help?
Thanks for your ideas and suggestions!
My answer includes use of a different package. I use RODBC which is found in cran at https://cran.r-project.org/web/packages/RODBC/index.html.
This has saved me SO MUCH frustration and wasted time that came from my previous method of exporting each query result to .csv to load it into my R environment. I found regular ODBC to be much slower than RODBC.
I use the following functions:
sqlQuery() wraps the function that opens the connection to the SQL db with the first argument (in parentheses) and the query itself as the second argument. Put the query itself in quote marks.
odbcConnect() is itself the first argument in sqlquery(). The argument in odbcConnect() is the name of your connection to the SQL db. Put the connection name in quote marks.
odbcCloseAll() is the final function for this task set. Use this after each sqlQuery() to close the connection and save yourself from annoying warning messages.
Here is a simple example.
library(RODBC)
result <- sqlQuery(odbcConnect("ODBCConnectionName"),
"SELECT *
FROM dbo.table
WHERE Collection_ID = 2498")
odbcCloseAll()
Here is the same example PLUS data manipulation directly from the query result.
library(dplyr)
library(RODBC)
result <- sqlQuery(odbcConnect("ODBCConnectionName"),
"SELECT *
FROM dbo.table
WHERE Collection_ID = 2498") %>%
mutate(matchid = paste0(schoolID, "-", studentID)) %>%
distinct(matchid, .keep_all - TRUE)
odbcCloseAll()
I would suggest using the dbcooper found on github. https://github.com/chriscardillo/dbcooper
I have found huge improvements in speed when querying large datasets.
Firstly, Add your connection to your environment.
conn <- DBI::dbConnect(odbc::odbc(),
Driver = "",
Server = "",
Database = "",
UID="",
PWD="")
devtools::install_github("chriscardillo/dbcooper")
library(dbcooper)
dbcooper::dbc_init(con = conn,
con_id = "test",
tables = c("schema.table"))
This adds the function test_schema_table() to your environment which is used to call the data. To collect into your environment use scheme_table %>% collect()
Here is a microbenchmark I did to compare the results of both DBI and dbcooper.
mbm <- microbenchmark::microbenchmark(
DBI = DBI::dbFetch(DBI::dbSendQuery(conn,qry)),
dbcooper = ava_qry() %>% collect() , times=5
)
Here are the results of a microbenchmark I did to compare DBI with dbcooper.

Appending new data to a local Access data base file with r after a successful connection

So I am currently working with a connecting to an Access database. I am able to get connected to the Access DB which is located on my local system. This is actually connected to a SharePoint list. I would love to automate the process handling this SharePoint list with an R and Access combo! What I want to be able to do actually pretty basic, I want to introduce new data via a .csv which is processed for the relevant content and then compared to the current Access DB and finally the new information uploaded from r to Access.
I've learned that you need to pair the bit version of your Windows OS, Office version, and R version. So I am x64 on all of the above. This allowed me to connect to the Access DB. You also need the 'Microsoft Access Database Engine 2016 Redistributable' which is essentially the driver for the connection.
So what I have so far is:
library(odbc)
library(DBI)
file_path <- "C:/user/Documents/R Projects/...pathtofile.../filename.accdb"
accdb_con <- dbConnect(drv = odbc(), .connection_string = paste0("Driver={Microsoft Access Driver (*.mdb, *.accdb)};DBQ=",file_path,";"))
access.db <- dbReadTable(accdb_con, "sNPS Deep Dives")
That now connects!
I then read in a .csv of new information
new.df <- read.csv("C:/user/Documents/R projects/...pathtofile.csv", header=T, stringsAsFactors=FALSE, na.strings=c("","NA"))
an example of the data set might just look something like this:
date <- c("15/10/2018","15/10/2018", "16/10/2018", "12/11/2018", "07/09/2018")
score <- c("6", "10", "7", "10", "9")
group <- c("a","b", "b", "a", "b")
CaseID <- c("301", "302", "303", "304", "305")
new.df <- data.frame(date,score,group,CaseID)
new.df$date <- as.character(new.df$date)
new.df$score <- as.numeric(new.df$score)
new.df$group <- as.character(new.df$group)
new.df$CaseID <- as.numeric(new.df$CaseID)
Notably there are more columns in the Access DB that people will fill in by hand with further information.
and I process it to be ready go into the Access DB.
probably not that interesting...
Then I compare the the new data against the Access DB as such:
library(dplyr)
new <- anti_join(new.df, access.db, by= "Case.ID")
Now I've tried:
dbWriteTable(access.db.copy, new, append = TRUE)
dbAppendTable(access.db.copy, new)
I don't seem to be able to get this to go anywhere
I am getting an error:
Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘dbWriteTable’ for signature ‘"ACCESS", "data.frame", "missing"’
I've seen plenty of posts in which people are having trouble connecting to an Access DB but I haven't seen anything about writing new data into that database.
I know this isn't quite a reproducible example but it seems like a difficult problem to recreate since it's a connection problem between different tools. I would be happy to provide example sets that might make this easier
I would appreciate any direction you all can provide.
Thanks!
Edit:
It appears that Bing Sun was right, I was missing an argument. So it appears that we need something more like:
dbWriteTable(access.db.copy, "Name of table",new, append = TRUE)
Which produces the error:
Error in result_insert_dataframe(rs#ptr, values) :
nanodbc/nanodbc.cpp:1944: HY104: [Microsoft][ODBC Microsoft Access Driver]Invalid precision value
I wonder if this may something that is an error from Access about a file type?
now if I use the append I don't get an error I get a 0 for output
dbAppendTable(access.db.copy, "Name of table", new, append= TRUE)
With output:
[1] 0
But I don't see any of the new values when I check the Access file.
I know it's years later, but hopefully this will help someone else with this issue since you're right CrayCrayTown, there aren't very many posts covering this issue.
I've run into this problem repeatedly when dealing with R and MS Access. The solution that I've come up with is pretty "hacky" but it accomplishes what's trying to be done...just not very eloquently.
The way I do this is with a combo of RODBC and DBI packages.
First, I open a connection to the DB with RODBC, and use that connection to write my data to the DB as an intermediary table:
chan <- RODBC::odbcDriverConnection(connection = "/path/to/database.accdb")
RODBC::sqlSave(channel = chan,
dat = df,
tablename = "tbl_intermediary",
rownames = FALSE,
append = FALSE)
RODBC::odbcClose(chan)
rm(chan)
Make sure to close the RODBC connection, I also destroy it for good measure, because why not? I use RODBC for the intermediary table because it supports batch insert statements. I know that the same thing can, in theory, be done with DBI with DBI::dbAppendTable()(but we wouldn't be on this post if that worked how we had hoped). I tried this in a previous SO question here, but it didn't solve my problem. I also don't know how big my intermediary tables could get in the future. Hopefully by the time they get too big we'll be in a different DBMS.
Next, I reopen the connection, this time with DBI, and send a statement to the DB to write those data from the intermediary table to the final resting place for those data, and then drop the intermediary table.
con <- DBI::dbConnect(odbc::odbc(), .connection_string = "/path/to/database.accdb")
DBI::dbSendStatement(
conn = con,
statement = 'UPDATE
tbl_intermediary INNER JOIN final_tbl ON tbl_intermediary.SampleID = final_tbl.sampleNumber
SET
final_tbl.field1 = [tbl_intermediary].[field1],
final_tbl.notes = IIf(Nz([tbl_intermediary].[Notes],"")="",[final_tbl].[notes],[final_tbl].[notes] & "; Newest Notes: " & [tbl_intermediary].[Notes]);'
)
DBI::dbSendStatement(
conn = con,
statement = 'DROP TABLE tbl_intermediary;'
DBI::dbDisconnect(con)
rm(con)
)
The main reason why I chose this method is because some of the SQL I use with Access also has some VBA in it. When I send the SQL-VBA hybrid string with RODBC, I get assorted errors in the IIF() and Nz() functions (see example above). From the RODBC CRAN docs the query argument for the sqlQuery() function is strictly assumed to be a valid SQL statement. So, RODBC has no clue how to interpret the IIf() and Nz() MS Access functions. I think this also has to do with how the ODBC driver handles communication as well (please, someone correct me if I'm wrong about this).
As I understand it, DBI::dbSendStatment() however lets the database engine you're working with interpret how to use the statement argument you provide. In the situation above, the VBA is executed exactly how I would expect if it were run in Access directly. As per the DBI docs, for interactive use you'll generally want to use dbGetQuery or dbExecute.

Is there a way to transfer R objects to separate R sessions on linux?

I've got a program that repeatedly loads largish datasets that are stored in R's Rds format. Here's a silly example that has all of the salient features:
# make and save the data
big_data <- matrix(rnorm(1e6^2), 1e6)
saveRDS(big_data, file = "big_data.Rds")
# write a program that uses the data
big_data <- readRDS("big_data.Rds")
BIGGER_data <- big_data+rnorm(1)
print("hooray!")
# save this in a text file called `my_program.R`
# run this program a bunch
for (i = 1:1000){
system("Rscript my_program.R")
}
The bottleneck is loading the data. But what if I had a separate process somewhere that held the data in memory?
Maybe something like this:
# write a program to hold the data in memory
big_data <- readRDS("big_data.Rds")
# save this as `holder.R` open a terminal and do
Rscript holder.R
Now there is a process running somewhere with my data in memory. How can I get it from a different R session? (I'm assuming that this would be faster than loading it -- but is this correct?)
Maybe something like this:
# write another program:
big_data <- get_big_data_from_holder()
BIGGER_data <- big_data+1
print("yahoo!")
# save this as `my_improved_program.R`
# now do the following:
for (i = 1:1000){
system("Rscript my_improved_program.R")
}
So I guess my question is what would the function get_big_data_from_holder() look like? Is it possible to do this? Practical?
Backstory: I'm trying to work around what appears to be a memory leak in R's interface to keras/tensorflow, that I've described here. The workaround is to let the OS clean up all of the cruft left over from a TF session, so that I can run TF sessions one after another without my computer slowing to a crawl.
Edit: maybe I could do this with a clone() system call? Conceptually I can imagine that I'd clone the process running holder and then run all of the commands in the program that depend on the data that's loaded. But I don't know how this would be done.
You may also improve the performance of saving and loading the data by turning off compression:
saveRDS(..., compress = FALSE)
You may find my filematrix package useful for storing and quickly accessing the big matrix.
To create it, run:
big_data = matrix(rnorm(1e4^2), 1e4)
library(filematrix)
fm = fm.create.from.matrix('matrix_file', big_data)
close(fm)
To access it from another R session:
library(filematrix)
fm = fm.open('matrix_file')
show(fm[1:3,1:3])
close(fm)

spark: java.io.IOException: No space left on device [again!]

I am getting the java.io.IOException: No space left on device that occurs after running a simple query in sparklyr. I use both last versions of Spark (2.1.1) and Sparklyr
df_new <-spark_read_parquet(sc, "/mypath/parquet_*", name = "df_new", memory = FALSE)
myquery <- df_new %>% group_by(text) %>% summarize(mycount = n()) %>%
arrange(desc(mycount)) %>% head(10)
#this FAILS
get_result <- collect(myquery)
I do have set both
spark.local.dir <- "/mypath/"
spark.worker.dir <- "/mypath/"
using the usual
config <- spark_config()
config$`spark.executor.memory` <- "100GB"
config$`spark.executor.cores` <- "3"
config$`spark.local.dir` <- "/mypath/"
config$`spark.worker.dir` <- "mypath/"
config$`spark.cores.max`<- "2000"
config$`spark.default.parallelism`<- "4"
config$`spark.total-executor-cores`<- "80"
config$`sparklyr.shell.driver-memory` <- "100G"
config$`sparklyr.shell.executor-memory` <- "100G"
config$`spark.yarn.executor.memoryOverhead` <- "100G"
config$`sparklyr.shell.num-executors` <- "90"
config$`spark.memory.fraction` <- "0.2"
Sys.setenv(SPARK_HOME="mysparkpath")
sc <- spark_connect(master = "spark://mynode", config = config)
where mypath has more than 5TB of disk space (I can see these options in the Environment tab). I tried a similar command in Pyspark and it failed the same way (same error).
By looking at the Stages tab in Spark, I see that the error occurs when shuffle write is about 60 GB. (input is about 200GB). This is puzzling given that I have plenty of space available. I have have looked at the other SO solutions already...
The cluster job is started with magpie https://github.com/LLNL/magpie/blob/master/submission-scripts/script-sbatch-srun/magpie.sbatch-srun-spark
Every time I start a Spark job, I see a directory called spark-abcd-random_numbers in my /mypath folder. but the size of the files in there is very small (nowhere near the 60GB shuffle write)
there are about 40 parquet files. each is 700K (original csv files were 100GB) They contain strings essentially.
cluster is 10 nodes, each has 120GB RAM and 20 cores.
What is the problem here?
Thanks!!
I ve had this problem multiple times before. The reason behind is the temporary files. most of servers have a very small size partition for /tmp/ which is the default temporary directory for spark.
Usually, I used to change that by setting that in spark-submit command as the following:
$spark-submit --master local[*] --conf "spark.driver.extraJavaOptions=-Djava.io.tmpdir=/mypath/" ....
In your case, I think that you can provide that to the configuration in R as following (I have not tested that but that should work):
config$`spark.driver.extraJavaOptions` <- "-Djava.io.tmpdir=/mypath/"
config$`spark.executor.extraJavaOptions ` <- "-Djava.io.tmpdir=/mypath/"
Notice that you have to change that for the driver and executors since you're using Spark standalone master (as I can see in your question)
I hope that will help
change following settings in your magpie script
export MAGPIE_LOCAL_DIR="/tmp/${USER}/magpie"
export SPARK_LOCAL_DIR="/tmp/${USER}/spark"
to have mypath prefix and not /tmp
Once you set the parameter, you can see the new value of spark.local.dir in Spark environment UI. But it doesn't reflect.
Even I faced the similar problem. After setting this parameter, I restarted the machines and then started working.
Since you need to set this when the JVM is launched via spark-submit, you need to use the sparklyr java-options, e.g.
config$`sparklyr.shell.driver-java-options` <- "-Djava.io.tmpdir=/mypath"
I had this very problem this week on a Standalone mode cluster and after trying different things, like some of the recommendations in this thread, it ended up being a sub folder called "work" inside the Spark home folder grew unchecked for a while thus filling up the worker's hhd

Resources