Is there a way to do a bulk insert using MonetDB.R (not via a for loop and dbSendUpdate)?
Does dbWriteTable allow for updates (append=TRUE)?
About "INSERT INTO" the MonetDB documentation states:
"The benefit is clear: this is very straightforward. However, this is a seriously inefficient way of doing things in MonetDB."
Thanks.
Hannes might have a smarter solution, but for the time being, this might help :)
# start with an example data set
nrow( mtcars )
# and a MonetDB.R connection
db
# here's how many records you'd have if you stack your example data three times
nrow( mtcars ) * 3
# write to three separate tables
dbWriteTable( db , 'mtcars1' , mtcars )
dbWriteTable( db , 'mtcars2' , mtcars )
dbWriteTable( db , 'mtcars3' , mtcars )
# stack them all
dbSendUpdate( db , "CREATE TABLE mtcars AS SELECT * FROM mtcars1 UNION ALL SELECT * FROM mtcars2 UNION ALL SELECT * FROM mtcars3 WITH DATA" )
# correct number of records
nrow( dbReadTable( db , 'mtcars' ) )
I'll consider it. monetdb.read.csv does use COPY INTO, so you might get away with creating a temp. CSV file.
I see what you mean although this doesn't change anything to the fact that dbWriteTable uses a for loop and "INSERT INTO" which can be rather slow. I may not have been very clear in my initial post.
As a workaround I guess "START TRANSACTION" and "COMMIT" with dbSendUpdate might work.
Ideally something like this would be great:
"COPY INTO table FROM data.frame"
We just published version 0.9.4 of MonetDB.R on CRAN. The main change in this release are major improvements to the dbWriteTable method. By default, INSERTs are now chunked into 1000 rows per statement. Also, if the database is running on the same machine as R, you may use the csvdump=T parameter. This writes the data.frame to a local temporary CSV file, and uses an automatically generated COPY INTO statement to import. Both these methods obviously are designed to improve the speed with which dbWriteTable imports data. Also, he append/overwrite parameter handling has been fixed.
Related
What is the dbplyr verbs combination that is equivalent to DBI::dbSendQuery(con, "DELETE FROM <table> WHERE <condition>").
What I want is not querying data from database, but removing data from and updating a table in database.
I want to do it in a dplyr way, but I am not sure if it is possible. I could not find anything similar in the package reference.
dbplyr translates dplyr commands to query database tables. I am not aware of any inbuilt way to modify existing database tables using pure dbplyr.
This is likely a design choice.
Within R we do not need to distinguish between fetching data from a table (querying) and modifying a table. This is probably because in R we can reload the original data into memory if an error/mistake occurs.
But in databases querying and modifying a table are deliberately different things. When modifying a database, you are modifying the source so additional controls are used (because recovering deleted data is a lot harder).
The DBI package is probably your best choice for modifying the database
This is the approach I use for all my dbplyr work. Often a custom function that takes the query produced by dbplyr translation and inserting it into a DBI call (you can see examples of this in my dbplyr helpers GitHub repo).
Two approaches to consider for this: (1) an anti-join (on all columns) followed by writing a new table, (2) the DELETE FROM syntax.
Mock up of anti-join approach
records_to_remove = remote_table %>%
filter(conditions)
desired_final_table = remote_table %>%
anti_join(records_to_remove, by = colnames(remote_table))
query = paste0("SELECT * INTO output_table FROM (",
sql_render(desired_final_table),
") AS subquery")
DBI::dbExecute(db_con, as.character(query))
Mock up of DELETE FROM syntax
records_to_remove = remote_table %>%
filter(conditions)
query = sql_render(records_to_remove) %>%
as.character() %>%
gsub(search_term = "SELECT *", replacement_term = "DELETE")
DBI::dbExecute(db_con, query)
If you plan to run these queries multiple times, then wrapping them in a function, with checks for validity would be recommended.
For some use cases deleting rows will not be necessary.
You could think of the filter command in R as deleting rows from a table. For example in R we might run:
prepared_table = input_table %>%
filter(colX == 1) %>%
select(colA, colB, colZ)
And think of this as deleting rows where colX == 1 before producing output:
output = prepared_table %>%
group_by(colA) %>%
summarise(sumZ = sum(colZ))
(Or you could use an anti-join above instead of a filter.)
But for this type of deleting, you do not need to edit the source data, as you can just filter out the unwanted rows at runtime every time. Yes it will make your database query larger, but this is normal for working with databases.
So combining the preparation and output in SQL is normal (something like this):
SELECT colA, SUM(colZ) AS sumZ
FROM (
SELECT colA, colB, colZ
FROM input_table
WHERE colX = 1
) AS prepared_table
GROUP BY colA
So unless you need to modify the database, I would recommend filtering instead of deleting.
I'm trying to analyze data stored in an SQL database (MS SQL server) in R, and on a mac. Typical queries might return a few GB of data, and the entire database is a few TB. So far, I've been using the R package odbc, and it seems to work pretty well.
However, dbFetch() seems really slow. For example, a somewhat complex query returns all results in ~6 minutes in SQL server, but if I run it with odbc and then try dbFetch, it takes close to an hour to get the full 4 GB into a data.frame. I've tried fetching in chunks, which helps modestly: https://stackoverflow.com/a/59220710/8400969. I'm wondering if there is another way to more quickly pipe the data to my mac, and I like the line of thinking here: Quickly reading very large tables as dataframes
What are some strategies for speeding up dbFetch when the results of queries are a few GB of data? If the issue is generating a data.frame object from larger tables, are there savings available by "fetching" in a different manner? Are there other packages that might help?
Thanks for your ideas and suggestions!
My answer includes use of a different package. I use RODBC which is found in cran at https://cran.r-project.org/web/packages/RODBC/index.html.
This has saved me SO MUCH frustration and wasted time that came from my previous method of exporting each query result to .csv to load it into my R environment. I found regular ODBC to be much slower than RODBC.
I use the following functions:
sqlQuery() wraps the function that opens the connection to the SQL db with the first argument (in parentheses) and the query itself as the second argument. Put the query itself in quote marks.
odbcConnect() is itself the first argument in sqlquery(). The argument in odbcConnect() is the name of your connection to the SQL db. Put the connection name in quote marks.
odbcCloseAll() is the final function for this task set. Use this after each sqlQuery() to close the connection and save yourself from annoying warning messages.
Here is a simple example.
library(RODBC)
result <- sqlQuery(odbcConnect("ODBCConnectionName"),
"SELECT *
FROM dbo.table
WHERE Collection_ID = 2498")
odbcCloseAll()
Here is the same example PLUS data manipulation directly from the query result.
library(dplyr)
library(RODBC)
result <- sqlQuery(odbcConnect("ODBCConnectionName"),
"SELECT *
FROM dbo.table
WHERE Collection_ID = 2498") %>%
mutate(matchid = paste0(schoolID, "-", studentID)) %>%
distinct(matchid, .keep_all - TRUE)
odbcCloseAll()
I would suggest using the dbcooper found on github. https://github.com/chriscardillo/dbcooper
I have found huge improvements in speed when querying large datasets.
Firstly, Add your connection to your environment.
conn <- DBI::dbConnect(odbc::odbc(),
Driver = "",
Server = "",
Database = "",
UID="",
PWD="")
devtools::install_github("chriscardillo/dbcooper")
library(dbcooper)
dbcooper::dbc_init(con = conn,
con_id = "test",
tables = c("schema.table"))
This adds the function test_schema_table() to your environment which is used to call the data. To collect into your environment use scheme_table %>% collect()
Here is a microbenchmark I did to compare the results of both DBI and dbcooper.
mbm <- microbenchmark::microbenchmark(
DBI = DBI::dbFetch(DBI::dbSendQuery(conn,qry)),
dbcooper = ava_qry() %>% collect() , times=5
)
Here are the results of a microbenchmark I did to compare DBI with dbcooper.
I managed to load and merge the 6 heavy excel files I had from my RStudio instance (on EC2 server) into one single table in PostgreQSL (linked with RDS).
Now this table has 14 columns and 2,4 Million rows.
The size of the table in PostgreSQL is 1059MB.
The EC2 instance is a t2.medium.
I wanted to analyze it, so I thought I could simply load the table with DBI package and perform different operations on it.
So I did:
my_big_df <- dbReadTable(con, "my_big_table")
my_big_df <- unique(my_big_df)
and my RStudio froze, out of memory...
My questions would be:
1) Is what I have been doing (to handle big tables like this) a ok/good practice?
2) If yes to 1), is the only way to be able to perform the unique() operation or other similar operations to increase the EC2 server memory?
3) If yes to 2), how can I know to which extent should I increase the EC2 server memory?
Thanks!
dbReadTable convert the entire table to a data.frame, which is not what you want to do for such a big tables.
As #cory told you, you need to extract the required info using SQL queries.
You can do that with DBI using combinations of dbSendQuery,dbBind,dbFetch or dbGetQuery.
For example, you could define a function to get the required data
filterBySQLString <- function(databaseDB,sqlString){
sqlString <- as.character(sqlString)
dbResponse <- dbSendQuery(databaseDB,sqlString)
requestedData <- dbFetch(dbResponse)
dbClearResult(dbResponse)
return(requestedData)
}
# write your query to get unique values
SQLquery <- "SELECT * ...
DISTINCT ..."
my_big_df <- filterBySQLString(myDB,SQLquery)
my_big_df <- unique(my_big_df)
If you cannot use SQL, then you have two options:
1) stop using Rstudio and try to run your code from the terminal or via Rscript.
2) beef up your instance
A thing I ran into using the RPostgreSQL package is that dbWriteTable(… overwrite=TRUE) seems to destroy existing table structure (datatypes and constraints), and dbRemoveTable() is equivalent to DROP table.
So I’ve used:
ltvTable <- "the_table_to_use"
dfLTV <- dataframe(x,y,z)
sql_truncate <- paste("TRUNCATE ", ltvTable) ##unconditional DELETE FROM…
res <- dbSendQuery(conn=con, statement=sql_truncate)
dbWriteTable(conn=con, name=ltvTable, value=dfLTV, row.names=FALSE, append=TRUE)
Is the TRUNCATE step necessary, or is there a dbWriteTable method that overwrites just the content not the structure?
I experience different behaviour from the answer offered by Manura Omal to How we can write data to a postgres DB table using R?, as overwrite=TRUE does not appear to truncate first.
I'm using: RPostgreSQL 0.4-1; PostgreSQL 9.4
best wishes - JS
As far as I know overwrite=T does 3 things:
DROPs the table
CREATES the table
fills table with new data
So if you want to preserve the structure, then you need the Truncate step.
Different behaviour might be caused by existence or non-existence of foreign keys preventing the DROP table step.
I would like to bulk-INSERT/UPSERT a moderately large amount of rows to a postgreSQL database using R. In order to do so I am preparing a multi-row INSERT string using R.
query <- sprintf("BEGIN;
CREATE TEMPORARY TABLE
md_updates(ts_key varchar, meta_data hstore) ON COMMIT DROP;
INSERT INTO md_updates(ts_key, meta_data) VALUES %s;
LOCK TABLE %s.meta_data_unlocalized IN EXCLUSIVE MODE;
UPDATE %s.meta_data_unlocalized
SET meta_data = md_updates.meta_data
FROM md_updates
WHERE md_updates.ts_key = %s.meta_data_unlocalized.ts_key;
COMMIT;", md_values, schema, schema, schema, schema)
DBI::dbGetQuery(con,query)
The entire function can be found here. Surprisingly (at leat to me) I learned that the UPDATE part is not the problem. I left it out and ran a the query again and it wasn't much faster. INSERT a million+ records seems to be the issue here.
I did some research and found quite some information:
bulk inserts
bulk inserts II
what causes large inserts to slow down
answers from #Erwin Brandstetter and #Craig Ringer were particularly helpful. I was able to speed things up quite a bit by dropping indices and following a few other suggestions.
However, I struggled to implement another suggestion which sounded promising: COPY. The problem is I can't get it done from within R.
The following works for me:
sql <- sprintf('CREATE TABLE
md_updates(ts_key varchar, meta_data hstore);
COPY md_updates FROM STDIN;')
dbGetQuery(sandbox,"COPY md_updates FROM 'test.csv' DELIMITER ';' CSV;")
But I can't get it done without reading from a extra .csv file. So my questions are:
Is COPY really a promising approach here (over the multi-row INSERT I got?
Is there a way to use COPY from within R without writing data to a file. Data does fit in memory and since it's already in mem why write to disk?
I am using PostgreSQL 9.5 on OS X and 9.5 on RHEL respectively.
RPostgreSQL has a "CopyInDataframe" function that looks like it should do what you want:
install.packages("RPostgreSQL")
library(RPostgreSQL)
con <- dbConnect(PostgreSQL(), user="...", password="...", dbname="...", host="...")
dbSendQuery(con, "copy foo from stdin")
postgresqlCopyInDataframe(con, df)
Where table foo has the same columns as dataframe df