Rsqlite takes hours to write table to sqlite database - r

I have this simple R program that reads a table (1000000 rows, 10 columns) from a sqlite database into an R data.table and then I do some operations on the data and try to write it back into a new table of the same sqlite database. Reading the data takes a few seconds but writing the table back into the sqlite database takes hours. I don't know how long exactly because it has never finished, the longest I have tried is 8 hours.
This is the simplified version of the program:
library(DBI)
library(RSQLite)
library(data.table)
driver = dbDriver("SQLite")
con = dbConnect(driver, dbname = "C:/Database/DB.db")
DB <- data.table(dbGetQuery(con, "SELECT * from Table1"))
dbSendQuery(con, "DROP TABLE IF EXISTS Table2")
dbWriteTable(con, "Table2", DB)
dbDisconnect(con)
dbUnloadDriver(driver)
Im using R version 2.15.2, package version are:
data.table_1.8.6 RSQLite_0.11.2 DBI_0.2-5
I have tried on multiple systems and on different Windows versions and in all cases it takes an incredible amount of time to write this table into the sqlite database. When looking at the file size of the sqlite database it writes at about 50KB per minute.
My question is does anybody know what causes this slow write speed?
Tim had the answer but I can't flag it as such because it is in the comments.
As in:
ideas to avoid hitting memory limit when using dbWriteTable to save an R data table inside a SQLite database
I wrote the data to the database in chunks
chunks <- 100
starts.stops <- floor( seq( 1 , nrow( DB ) , length.out = chunks ) )
system.time({
for ( i in 2:( length( starts.stops ) ) ){
if ( i == 2 ){
rows.to.add <- ( starts.stops[ i - 1 ] ):( starts.stops[ i ] )
} else {
rows.to.add <- ( starts.stops[ i - 1 ] + 1 ):( starts.stops[ i ] )
}
dbWriteTable( con , 'Table2' , DB[ rows.to.add , ] , append = TRUE )
}
})
It takes:
user system elapsed
4.49 9.90 214.26
time to finish writing the data to the database. Apparantly I was hitting the memory limit without knowing it.

Use a single transaction (commit) for all the records. Add a
dbSendQuery(con, "BEGIN")
before the insert and a
dbSendQuery(con, "END")
to complete. Much faster.

Related

Join a data frame (or other R object) to a table in a read-only Postgresql database?

I have read-only access to a Postgres database. I can not write to the database.
Q. Is there a way to construct and run a SQL query where I join a data frame (or other R object) to a table in a read-only Postgres database?
This is for accessing data from WRDS, https://wrds-www.wharton.upenn.edu/
Here's an attempt at pseudocode
#establish a connection to a database
con <- dbConnect( Postgres(),
host = 'host.org',
port = 1234,
dbname = 'db_name',
sslmode = 'require',
user = 'username', password = 'password')
#create an R dataframe (or other object)
df <- data.frame( customer_id = c('a123', 'a-345', 'b0') )
#write a sql query we will run
sql_query <- "
SELECT t.customer_id, t.* FROM df t
LEFT JOIN table_name df
on t.customer_id = df.customer_id
"
my_query_results <- dbSendQuery(con, sql_query)
temp <- dbFetch(res, n = 1)
dbClearResult(res)
my_query_results
Note and edit: The example query I provided is intentionally super simple for example purposes.
In my actual queries, there might be 3 or more columns I want to join on, and millions of rows I want to join on.
Use the copy_inline function from the dbplyr package, which was added following an issue filed on this topic. See also the question here.
An example of its use is found here.
If your join is on a single condition, it can be rewritten using an in clause:
In SQL:
SELECT customer_id
FROM table_name
WHERE customer_id in ('a123', 'a-345', 'b0')
Programmatically from R:
sql_query = sprintf(
"SELECT customer_id
FROM table_name
WHERE customer_id in (%s)",
paste(sQuote(df$customer_id, q = FALSE), collapse = ", ")
)

dbplyr function equivalent to EXISTS in sql

Some time ago I did this question: Speeding up PostgreSQL queries (Check if entry exists in another table)
But, since I'm working with DBI with dbplyr as backend, I'd like to know whats is the dbplyr function equivalent to EXISTS function from PostgreSQL.
For while, I'm performing the query using literal SQL sintaxe
myQuery <- 'SELECT "genomic_accession",
"assembly",
"product_accession",
"tmpcol",
( EXISTS (SELECT 1
FROM "cachedb" c
WHERE c.product_accession IN ( pt.product_accession, pt.tmpcol
)) )
AS CACHE,
( EXISTS (SELECT 1
FROM "sbpdb" s
WHERE s.product_accession IN ( pt.product_accession, pt.tmpcol
)) )
AS SBP
FROM (SELECT *
FROM "pairtable2") pt; '
dbExecute(db, myQuery) -> tmp
Then, I tried to pass literal SQL instructions to mutate:
pairTable %>%
head(n=5000) %>%
mutate(
CACHE = sql('EXISTS( select 1 FROM "cacheDB" AS c
WHERE c.product_accession IN ( product_accession, tmpcol) )' ),
SBP = sql('EXISTS( select 1 FROM "SBPDB" AS s
WHERE s.product_accession IN ( product_accession, tmpcol) )' )
)
But this way, I don't know why I'm lacking all cases which are false, the comparison.
And I expect there is an implementation of this method in dbplyr or even some DBI method to this.
Thanks

Inserting a R dataframe in SQL table using a stored proc

I have a dataframe in R containing 10 rows and 7 columns. There's a stored procedure that does the few logic checks in the background and then inserts the data in the table 'commodity_price'.
library(RMySQL)
#Connection Settings
mydb = dbConnect(MySQL(),
user='uid',
password='pwd',
dbname='database_name',
host='localhost')
#Listing the tables
dbListTables(mydb)
f= data.frame(
location= rep('Bhubaneshwar', 4),
sourceid= c(8,8,9,2),
product= c("Ingot", "Ingot", "Sow Ingot", "Alloy Ingot"),
Specification = c('ie10','ic20','se07','se08'),
Price=c(14668,14200,14280,20980),
currency=rep('INR',4),
uom=rep('INR/MT',4)
)
For multiple rows insert, there's a pre-created stored proc 'PROC_COMMODITY_PRICE_INSERT', which I need to call.
for (i in 1:nrow(f))
{
dbGetQuery(mydb,"CALL PROC_COMMODITY_PRICE_INSERT(
paste(f$location[i],',',
f$sourceid[i],',',f$product[i],',',f$Specification[i],',',
f$Price[i],',',f$currency[i],',', f$uom[i],',',#xyz,')',sep='')
);")
}
I am repeatedly getting error.
Error in .local(conn, statement, ...) :
could not run statement: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '[i],',',
f$sourceid[i],',',f$product[i],',',f$Specification' at line 2
I tried using RODBC but its not getting connected at all. How can I insert the data from the R dataframe in the 'commodity_price' table by calling a stored proc? Thanks in advance!
That is probably due to your use of ', this might work:
for (i in 1:nrow(f))
{
dbGetQuery(mydb,paste("CALL PROC_COMMODITY_PRICE_INSERT(",f$location[i],',',
f$sourceid[i],',',f$product[i],',',f$Specification[i],',',
f$Price[i],',',f$currency[i],',', f$uom[i],',',"#xyz",sep='',");"))
}
or the one-liner:
dbGetQuery(mydb,paste0("CALL PROC_COMMODITY_PRICE_INSERT('",apply(f, 1, paste0, collapse = "', '"),"');"))
Trying the for loop:
for (i in 1:nrow(f))
{
dbGetQuery(mydb,paste("CALL PROC_COMMODITY_PRICE_INSERT(","'",f$location[i],"'",',',"'",
f$sourceid[i],"'",',',"'",f$product[i],"'",',',"'",f$Specification[i],"'",',',"'",
f$Price[i],"'",',',"'",f$currency[i],"'",',',"'",f$uom[i],"'",',','#xyz',sep='',");"))
}

R - RMysql - could not run statement: memory exhausted

I have R script for data analysis. I try it on 6 different tables from my mysql database. On 5 of them script works fine. But on last table it don't wont work. There is part of my code :
sql <- ""
#write union select for just one access to database which will optimize code
for (i in 2:length(awq)-1){
num <- awq[i]-1
sql <- paste(sql, "(SELECT * FROM mytable LIMIT ", num, ",1) UNION ")
}
sql <- paste(sql, "(SELECT * FROM mytable LIMIT ", awq[length(awq)-1], ",1)")
#database query
nb <- dbGetQuery(mydb, sql)
My mysql table where script don't work have 21 676 rows. My other tables have under 20 000 rows and with them script work. If it don't work work it give me this error :
Error in .local(conn, statement, ...) :
could not run statement: memory exhausted near '1) UNION (SELECT * FROM mytable LIMIT 14107 ,1) UNION (SELECT * FROM mytabl' at line 1
I understood there is memory problem. But how to solve it ? I don't want delete rows from my table. Is there another way ?

How to Read 3 Million Records from RODBC and write to Text File

I'm reading 3 Million Records from a table and i want to Write it to a text file, but i'm facing as the program is running out of Memory throwing an error
Exceeded maximum space of Memory 3096 MB.
My System Configuration is i5 Processor with 4 GB RAM.
Please find below code.
library(RODBC)
con <- odbcConnect("REGION", uid="", pwd="")
a <- sqlQuery(con, "SELECT * FROM dbo.GERMANY where CHARGE_START_DATE = '04/01/2017'");
write.table(a,"C:/Users/609354986/Desktop/R/Data/1Germany.txt",na="",sep="|",row.names = FALSE,col.names = FALSE)
close(con)
what you can do is add an index to your db table so you can loop through it and extract/write your data piece by piece without filling up your memory
here is an example
# create that index
sqlQuery(channel, 'alter table dbo.GERMANY ADD MY_COL NUMBER')
sqlQuery(channel, 'update dbo.GERMANY set MY_COL = rownum ')
# the function
g <- function(a) {
for (i in (1:length(a))) {
query <- gsub('\n',' ', paste( "SELECT * FROM dbo.GERMANY where
CHARGE_START_DATE = '04/01/2017 and
my_col between",a[i] ," and ", a[i+1], collapse = ' '));
df <- sqlQuery(channel, query) ;
write.csv(df, paste('my_',i,'_df.csv')) ;
}
}
# use reasonable chunks
a <- seq(1,3000000,250000)
g(a)

Resources