alternative to copy_to in sparklyr for large data sets - r

I have the code below that takes a dataset does a SQL transformation on it using a wrapper function calling the spark SQL API using Sparklyr. I then use "invoke("createOrReplaceTempView", "name")" to save the table in the Spark environment as a spark data frame so that I can call is in a future function call. Then I use dplyr code "mutate" to call a hive function "regexp_replace" to transform letters to numeric (0). I them need to call an SQL function again.
However to do so I seem to have to use the "copy_to" function from sparklyr. On a large data set the "copy_to" function causes the following error:
Error: org.apache.spark.SparkException: Job aborted due to stage
failure: Total size of serialized results of 6 tasks (1024.6 MB) is
bigger than spark.driver.maxResultSize (1024.0 MB)
Is there an alternative to "copy_to" that allows me to get a spark data frame that I can then call with SQL API?
Here is my codeL
sqlfunction <- function(sc, block) {
spark_session(sc) %>% invoke("sql", block)
}
sqlfunction(sc, "SELECT * FROM
test")%>%invoke("createOrReplaceTempView",
"name4")
names<-tbl(sc, "name4")%>%
mutate(col3 = regexp_replace(col2, "^[A-Z]", "0"))
copy_to(sc, names, overwrite = TRUE)
sqlfunction(sc, "SELECT * FROM
test")%>%invoke("createOrReplaceTempView",
"name5")
final<-tbl(sc, "name5")

It'd help if you had a reprex, but try
final <- names %>%
spark_dataframe() %>%
sqlfunction(sc, "SELECT * FROM test") %>%
sdf_register("name5")

Related

Optimization : turning a SQLite .db file into data.frame on R

I need to analyse very heavy data (~40Go, 3 million rows). The data is too big to open in a spreatsheet or in R.
To adress that, I loaded my data into a SQLite database. I then used R (and the package RSQLite) to split it in smaller parts that I can manipulate (70000 rows). I then needed my data to be in data.frame format, so I can analyse it. I used as.data.frame :
#Connecting to the database
con = dbConnect(drv=RSQLite::SQLite(),dbname="path")
#Connecting to the table
d=tbl(con, "Test")
#Filter the database and convert it
d %>%
#I filtered using dplyr
filter("reduce number of rows") %>%
as.data.frame()
It works, but it takes a lot of time to execute. Does someone knows how to make this faster (knowing I have limited RAM) ?
I also tried a setDT(), but it doesnt seem to work on SQLight data.
d %>% setDT()
Error in setDT(.) :
All elements in argument 'x' to 'setDT' must be of same length, but the profile of input lengths (length:frequency) is: [2:1, 13:1]
The first entry with fewer than 13 entries is 1
Thanks
To process successive chunks of 70000 rows using con from the question replace the print statement below with any processing desired (base, dplyr, data.table, etc.) .
rs <- dbSendQuery(con, "select * from Test")
while(!dbHasCompleted(rs)) {
dat <- dbFetch(rs, 70000)
print(dim(dat)) # replace with your processing
}
dbClearResult(rs)
dbDisconnect(con)

R query database tables iteratively without for loop with lambda or vectorized function for Shiny app

I am connecting to a SQL Server database through the ODBC connection in R. I have two potential methods to get data, and am trying to determine which would be more efficient. The data is needed for a Shiny dashboard, so the data needs to be pulled while the app is loading rather than querying on the fly as the user is using the app.
Method 1 is to use over 20 stored procedures to query all of the needed data and store them for use. Method 2 is to query all of the tables individually.
Here is the method I used to query one of the stored procedures:
get_proc_data <- function(proc_name, url, start_date, end_date){
dbGetQuery(con, paste0(
"EXEC dbo.", proc_name, " ",
"#URL = N'", url, "', ",
"#Startdate = '", start_date, "', ",
"#enddate = '", end_date, "' "
))
}
data <- get_proc_data(proc_name, url, today(), today() %m-% years(5))
However, each of the stored procedures has a slightly different setup for the parameters, so I would have to define each of them separately.
I have started to implement Method 2, but have run into issues with iteratively querying each table.
# use dplyr create list of table names
db_tables <- dbGetQuery(con, "SELECT * FROM [database_name].INFORMATION_SCHEMA.TABLES;") %>% select(TABLE_NAME)
# use dplyr pull to create list
table_list <- pull(db_tables , TABLE_NAME)
# get a quick look at the first few rows
tbl(con, "[TableName]") %>% head() %>% glimpse()
# iterate through all table names, get the first five rows, and export to .csv
for (table in table_list){
write.csv(
tbl(con, table) %>% head(), str_glue("{getwd()}/00_exports/tables/{table}.csv")
)
}
selected_tables <- db_tables %>% filter(TABLE_NAME == c("TableName1","TableName2"))
Ultimately this method was just to test how long it would take to iterate through the ~60 tables and perform the required function. I have tried putting this into a function instead but have not been able to get it to iterate through while also pulling the name of the table.
Pro/Con for Method 1: The stored procs are currently powering a metrics plug-in that was written in C++ and is displaying metrics on the webpage. This is for internal use to monitor website performance. However, the stored procedures are not all visible to me and the client needs me to extend their current metrics. I also do not have a DBA at my disposal to help with the SQL Server side, and the person who wrote the procs is unavailable. The procs are also using different logic than each other, so joining the results of two different procs gives drastically different values. For example, depending on the proc, each date will list total page views for each day or already be aggregated at the weekly or monthly scale then listed repeatedly. So joining and grouping causes drastic errors in actual page views.
Pro/Con for Method 2: I am familiar with dplyr and would be able to join the tables together to pull the data I need. However, I am not as familiar with SQL and there is no Entity-Relationship Diagram (ERD) of any sort to refer to. Otherwise, I would build each query individually.
Either way, I am trying to come up with a way to proceed with either a named function, lambda function, or vectorized method for iterating. It would be great to name each variable and assign them appropriately so that I can perform the data wrangling with dplyr.
Any help would be appreciated, I am overwhelmed with which direction to go. I researched the equivalent to Python list comprehension in R but have not been able get the function in R to perform similarly.
> db_table_head_to_csv <- function(table) {
+ write.csv(
+ tbl(con, table) %>% head(), str_glue("{getwd()}/00_exports/bibliometrics_tables/{table}.csv")
+ )
+ }
>
> bibliometrics_tables %>% db_table_head_to_csv()
Error in UseMethod("as.sql") :
no applicable method for 'as.sql' applied to an object of class "data.frame"
Consider storing all table data in a named list (counterpart to Python dictionary) using lapply (counterpart to Python's list/dict comprehension). And if you use its sibling, sapply, the character vector passed in will return as names of elements:
# RETURN VECTOR OF TABLE NAMES
db_tables <- dbGetQuery(
con, "SELECT [TABLE_NAME] FROM [database_name].INFORMATION_SCHEMA.TABLES"
)$TABLE_NAME
# RETURN NAMED LIST OF DATA FRAMES FOR EACH DB TABLE
df_list <- sapply(db_tables, function(t) dbReadTable(conn, t), simplify = FALSE)
You can extend the lambda function for multiple steps like write.csv or use a defined method. Just be sure to return a data frame as last line. Below uses the new pipe, |> in base R 4.1.0+:
db_table_head_to_csv <- function(table) {
head_df <- dbReadTable(con, table) |> head()
write.csv(
head_df,
file.path(
"00_exports", "bibliometrics_tables", paste0(table, ".csv")
)
)
return(head_df)
}
df_list <- sapply(db_tables, db_table_head_to_csv, simplify = FALSE)
You lose no functionality of data frame object if stored in a list and can extract with $ or [[ by name:
# EXTRACT SPECIFIC ELEMENT
head(df_list$table_1)
tail(df_list[["table_2"]])
summary(df_list$`table_3`)

Return logical plan using sparklyr

We are trying to get the logical plan (not to be confused with the physical plan) that Spark generates for a given query. According to the Spark docs here you should be able to retrieve this using the scala command:
df.explain(true)
or in sparklyr with the example code:
spark_version <- "2.4.3"
sc <- spark_connect(master = "local", version = spark_version)
iris_sdf <- copy_to(sc, iris)
iris_sdf %>%
spark_dataframe %>%
invoke("explain", T)
This command runs, but simply returns NULL in RStudio. My guess is that sparklyr does not retrieve content that is printed to the console. Is there a way around this or another way to retrieve the logical plan using sparklyr? The physical plan is easy to get using dplyr::explain([your_sdf]), but does not return the logical plan that was used to create it.
Looks like you can get this via:
iris_sdf %>%
spark_dataframe %>%
invoke("queryExecution") %>%
invoke("toString") %>%
cat()

Sparklyr's spark_apply function seems to run on single executor and fails on moderately-large dataset

I am trying to use spark_apply to run the R function below on a Spark table. This works fine if my input table is small (e.g. 5,000 rows), but after ~30 mins throws an error when the table is moderately large (e.g. 5,000,000 rows):
sparklyr worker rscript failure, check worker logs for details
Looking at the Spark UI shows that there is only ever a single task being created, and a single executor being applied to this task.
Can anyone give advice on why this function is failing for 5 million row dataset? Could the problem be that a single executor is being made to do all the work, and failing?
# Create data and copy to Spark
testdf <- data.frame(string_id=rep(letters[1:5], times=1000), # 5000 row table
string_categories=rep(c("", "1", "2 3", "4 5 6", "7"), times=1000))
testtbl <- sdf_copy_to(sc, testdf, overwrite=TRUE, repartition=100L, memory=TRUE)
# Write function to return dataframe with strings split out
myFunction <- function(inputdf){
inputdf$string_categories <- as.character(inputdf$string_categories)
inputdf$string_categories=with(inputdf, ifelse(string_categories=="", "blank", string_categories))
stringCategoriesList <- strsplit(inputdf$string_categories, ' ')
outDF <- data.frame(string_id=rep(inputdf$string_id, times=unlist(lapply(stringCategoriesList, length))),
string_categories=unlist(stringCategoriesList))
return(outDF)
}
# Use spark_apply to run function in Spark
outtbl <- testtbl %>%
spark_apply(myFunction,
names=c('string_id', 'string_categories'))
outtbl
The sparklyr worker rscript failure, check worker logs for details error is written by the driver node and points out that the actual error needs to be found in the worker logs. Usually, the worker logs can be accessed by opening stdout from the executor's tab in the Spark UI; the logs should contain RScript: entries describing what the executor is processing and the specific of the error.
Regarding the single task being created, when columns are not specified with types in spark_apply(), it needs to compute a subset of the result to guess the column types, to avoid this, pass explicit column types as follows:
outtbl <- testtbl %>%
spark_apply(
myFunction,
columns=list(
string_id = "character",
string_categories = "character"))
If using sparklyr 0.6.3, update to sparklyr 0.6.4 or devtools::install_github("rstudio/sparklyr"), since sparklyr 0.6.3 contains an incorrect wait time in some edge cases where package distribution is enabled and more than one executor runs in each node.
Under high load, it is common to run out of memory. Increasing the number of partitions could resolve this issue since it would reduce the total memory required to process this dataset. Try running this as:
testtbl %>%
sdf_repartition(1000) %>%
spark_apply(myFunction, names=c('string_id', 'string_categories'))
It could also be the case that the function throws an exception for some of the partitions due to logic in the function, you could see if this is the case by using tryCatch() to ignore the errors and then find which are the missing values and why the function would fail for those values. I would start with something like:
myFunction <- function(inputdf){
tryCatch({
inputdf$string_categories <- as.character(inputdf$string_categories)
inputdf$string_categories=with(inputdf, ifelse(string_categories=="", "blank", string_categories))
stringCategoriesList <- strsplit(inputdf$string_categories, ' ')
outDF <- data.frame(string_id=rep(inputdf$string_id, times=unlist(lapply(stringCategoriesList, length))),
string_categories=unlist(stringCategoriesList))
return(outDF)
}, error = function(e) {
return(
data.frame(string_id = c(0), string_categories = c("error"))
)
})
}

R to SQL server. Make a query joining a database table with a R dataframe

I have an ODBC connection to SQL server database. From R, I want to query a table with lots of data, but I want to get only those records that match my dataframe in R by certain columns (INNER JOIN). I do currently linking ODBC tables in MS ACCESS 2003 (linked tables "dbo_name") and then doing relational queries, without downloading the entire table. I need to reproduce this process in R avoiding downloading the entire table (avoid SQLFetch ()).
I have read the information from ODBC, DBI, rsqlserver packages without success. Is there any package or way to fix this?
If you can't write a table to the database, there is another trick you can use. You essentially make a giant WHERE statement. Let's say you want to join table table in the database to your data.frame called a on the column id. You could say:
ids <- paste0(a$id,collapse=',')
# If a$id is a character, you'll have to surround this in quotes:
# ids <- paste0(paste0("'",a$id,"'"),collapse=',')
dbGetQuery(con, paste0('SELECT * FROM table where id in (',paste(ids,collapse=','),')'))
From your comment, it seems that SQL Server has a problem with a query of that size. I suspect that you may have to "chunk" the query into smaller bits, and then join them all together. Here is an example of splitting the ids into 1000 chunks, querying, and then combining.
id.chunks <- split(a$ids,seq(1000))
result.list <- lapply(id.chunks, function(ids)
dbGetQuery(con,
paste0('SELECT * FROM table where id in (',ids,')')))
combined.resuls <- do.call(rbind,result.list)
The problem was solved. Ids vector was divided into groups of 1000, and then querying to the server each. I show the unorthodox code. Thanks nograpes!
# "lani1" is the vector with 395.474 ids
id.chunks<-split(lani1,seq(1000))
for (i in 1:length(id.chunks)){
idsi<-paste0(paste0("'",as.vector(unlist(id.chunks[i])),"'"),collapse=',')
if(i==1){ani<-sqlQuery(riia,paste0('SELECT * FROM T_ANIMALES WHERE an_id IN (',idsi,')'))
}
else {ani1<-sqlQuery(riia,paste0('SELECT * FROM T_ANIMALES WHERE an_id IN (',idsi,')'))
ani<-rbind(ani,ani1)
}
}
I adapted the answer above and the following worked for me without needing SQL syntax. The table I used was from the adventureworks SQL Server database.
lazy_dim_customer <- dplyr::tbl(conn, dbplyr::in_schema("dbo", "DimCustomer"))
# Create data frame of customer ids
adv_customers <- dplyr::tbl(conn, "DimCustomer")
query1 <- adv_customers %>%
filter(CustomerKey < 20000) %>%
select(CustomerKey)
d00df_customer_keys <- query1 %>% dplyr::collect()
# Chunk customer ids, filter, collect and bind
id.chunks <- split(d00df_customer_keys$CustomerKey, seq(10))
result.list <- lapply(id.chunks, function(ids)
lazy_dim_customer %>%
filter(CustomerKey %in% ids) %>%
select(CustomerKey, FirstName, LastName) %>%
collect() )
combined.results <- do.call(rbind, result.list)

Resources