What is the dbplyr verbs combination that is equivalent to DBI::dbSendQuery(con, "DELETE FROM <table> WHERE <condition>").
What I want is not querying data from database, but removing data from and updating a table in database.
I want to do it in a dplyr way, but I am not sure if it is possible. I could not find anything similar in the package reference.
dbplyr translates dplyr commands to query database tables. I am not aware of any inbuilt way to modify existing database tables using pure dbplyr.
This is likely a design choice.
Within R we do not need to distinguish between fetching data from a table (querying) and modifying a table. This is probably because in R we can reload the original data into memory if an error/mistake occurs.
But in databases querying and modifying a table are deliberately different things. When modifying a database, you are modifying the source so additional controls are used (because recovering deleted data is a lot harder).
The DBI package is probably your best choice for modifying the database
This is the approach I use for all my dbplyr work. Often a custom function that takes the query produced by dbplyr translation and inserting it into a DBI call (you can see examples of this in my dbplyr helpers GitHub repo).
Two approaches to consider for this: (1) an anti-join (on all columns) followed by writing a new table, (2) the DELETE FROM syntax.
Mock up of anti-join approach
records_to_remove = remote_table %>%
filter(conditions)
desired_final_table = remote_table %>%
anti_join(records_to_remove, by = colnames(remote_table))
query = paste0("SELECT * INTO output_table FROM (",
sql_render(desired_final_table),
") AS subquery")
DBI::dbExecute(db_con, as.character(query))
Mock up of DELETE FROM syntax
records_to_remove = remote_table %>%
filter(conditions)
query = sql_render(records_to_remove) %>%
as.character() %>%
gsub(search_term = "SELECT *", replacement_term = "DELETE")
DBI::dbExecute(db_con, query)
If you plan to run these queries multiple times, then wrapping them in a function, with checks for validity would be recommended.
For some use cases deleting rows will not be necessary.
You could think of the filter command in R as deleting rows from a table. For example in R we might run:
prepared_table = input_table %>%
filter(colX == 1) %>%
select(colA, colB, colZ)
And think of this as deleting rows where colX == 1 before producing output:
output = prepared_table %>%
group_by(colA) %>%
summarise(sumZ = sum(colZ))
(Or you could use an anti-join above instead of a filter.)
But for this type of deleting, you do not need to edit the source data, as you can just filter out the unwanted rows at runtime every time. Yes it will make your database query larger, but this is normal for working with databases.
So combining the preparation and output in SQL is normal (something like this):
SELECT colA, SUM(colZ) AS sumZ
FROM (
SELECT colA, colB, colZ
FROM input_table
WHERE colX = 1
) AS prepared_table
GROUP BY colA
So unless you need to modify the database, I would recommend filtering instead of deleting.
Related
I have a JDBC connection and would like to query data from one schema and save to another
library(tidyverse)
library(dbplyr)
library(rJava)
library(RJDBC)
# access the temp table in the native schema
tbl(conn, "temp")
temp_ed <- temp %*% mutate(new = 1)
# I would like to save temp_ed to a new schema "schmema_new"
I would like to use something like dbplyr::compute() but define the output schema specifically. It seems dbplyr::copy_to could be used, but would require bringing the data through the local machine.
I want to use something like RJDBC::dbSendUpdate() but which would ideally integrate nicely with the data manipulating pipeline above.
I do this using dbExecute from the DBI package.
The key idea is to extract the query that defines the current remote table and make this a sub-query in a larger SQL query that writes the table. This requires that (1) the schema exists, (2) you have permission to write new tables, and (3) you know the correct SQL syntax.
Doing this directly might look like:
tbl(conn, "temp")
temp_ed <- temp %*% mutate(new = 1)
save_table_query = paste(
"SELECT * INTO my_database.schema_new.my_table FROM (\n",
dbplyr::sql_render(temp_ed),
"\n) AS sub_query"
)
dbExecute(conn, as.character(save_table_query))
INTO is the clause for writing a new table in SQL server (the flavour of SQL I use). You will need to find the equivalent clause for your database.
In practice I use a custom function that looks something like this:
write_to_database <- function(input_tbl, db, schema, tbl_name){
# connection
tbl_connection <- input_tbl$src$con
# SQL query
sql_query <- glue::glue(
"SELECT *\n",
"INTO {db}.{schema}.{tbl_name}\n",
"FROM (\n",
dbplyr::sql_render(input_tbl),
"\n) AS sub_query"
)
result <- dbExecute(tbl_connection, as.character(sql_query))
}
Applying this in your context:
tbl(conn, "temp")
temp_ed <- temp %*% mutate(new = 1)
write_to_database(temp_ed, "my_database", "schema_new", "my_table")
I am working with database tables with dbplyr
I have a local table and want to join it with a large (150m rows) table on the database
The database PRODUCTION is read only
# Set up the connection and point to the table
library(odbc); library(dbplyr)
my_conn_string <- paste("Driver={Teradata};DBCName=teradata2690;DATABASE=PRODUCTION;UID=",
t2690_username,";PWD=",t2690_password, sep="")
t2690 <- dbConnect(odbc::odbc(), .connection_string=my_conn_string)
order_line <- tbl(t2690, "order_line") #150m rows
I also have a local table, let's call it orders
# fill df with random data
orders <- data.frame(matrix(rexp(50), nrow = 100000, ncol = 5))
names(orders) <- c("customer_id", paste0(rep("variable_", 4), 1:4))
let's say I wanted to join these two tables, I get the following error:
complete_orders <- orders %>% left_join(order_line)
> Error: `x` and `y` must share the same src, set `copy` = TRUE (may be slow)
The issue is, if I were to set copy = TRUE, it would try to download the whole of order_line and my computer would quickly run out of memory
Another option could be to upload the orders table to the database. The issue here is that the PRODUCTION database is read only - I would have to upload to a different database. Trying to copy across databases in dbplyr results in the same error.
The only solution I have found is to upload into the writable database and use sql to join them, which is far from ideal
I have found the answer, you can use the in_schema() function within the tbl pointer to work across schemas within the same connection
# Connect without specifying a database
my_conn_string <- paste("Driver={Teradata};DBCName=teradata2690;UID=",
t2690_username,";PWD=",t2690_password, sep="")
# Upload the local table to the TEMP db then point to it
orders <- tbl(t2690, in_schema("TEMP", "orders"))
order_line <- tbl(t2690, in_schema("PRODUCTION", "order_line"))
complete_orders <- orders %>% left_join(order_line)
Another option could be to upload the orders table to the database. The issue here is that the PRODUCTION database is read only - I would have to upload to a different database. Trying to copy across databases in dbplyr results in the same error.
In your use case, it seems (based on the accepted answer) that your databases are on the same server and it's just a matter of using in_schema. If this were not the case, another approach would be that given here, which in effect gives a version of copy_to that works on a read-only connection.
I have a Postgres database. I want to find the minimum value of a column called calendarid, which is of type integer and the format yyyymmdd, from a certain table. I am able to do so via the following code.
get_history_startdate <- function(src) {
get_required_table(src) %>% # This gives me the table tbl(src, "table_name")
select(calendarid) %>%
as_data_frame %>%
collect() %>%
min() # Result : 20150131
}
But this method is really slow as it loads all the data from the database to the memory. Any ideas how can I improve it?
get_required_table(src) %>%
summarise(max(calendarid, na.rm = TRUE)) %>%
pull
will run the appropriate SQL query.
If you just want the minimum value of the calendarid column across the entire table, then use this:
SELECT MIN(calendarid) AS min_calendarid
FROM your_table;
I don't exactly what your R code is doing under the hood, but if it's bringing in the entire table from Postgres into R, then it is very wasteful. If so, then running the above query directly on Postgres should give you a boost in performance.
I want to use dbWriteTable() of R's DBI package to write data into a database. Usually, the respective tables are already present so I use the argument append = TRUE. How do I get which rows were added to the table by dbWriteTable()? Most of tables have certain columns with UNIQUE values so a SELECT will work (see below for a simple example). However, this is not true for all of them or only several columns together are UNIQUE making the SELECT more complicated. In addition, I would like to put the writing and querying into a function so I would prefer a consistent approach for all cases.
I mainly need this to get the PRIMARY KEY's added by the database and to allow a user to quickly see what was added. If important, my database is PostgreSQL and I would like to use the odbc package for connection.
I have something like this in mind, however, I am looking for a more general solution:
library(DBI)
con <- dbConnect(odbc::odbc(), dsn = "database")
dbWriteTable(con,
name = "site", value = data.frame(name = c("abcd", "efgh"),
append = TRUE))
dbGetQuery(conn,
paste("SELECT * FROM site WHERE name in ('abcd', 'efgh');"))
Is there a way to do a bulk insert using MonetDB.R (not via a for loop and dbSendUpdate)?
Does dbWriteTable allow for updates (append=TRUE)?
About "INSERT INTO" the MonetDB documentation states:
"The benefit is clear: this is very straightforward. However, this is a seriously inefficient way of doing things in MonetDB."
Thanks.
Hannes might have a smarter solution, but for the time being, this might help :)
# start with an example data set
nrow( mtcars )
# and a MonetDB.R connection
db
# here's how many records you'd have if you stack your example data three times
nrow( mtcars ) * 3
# write to three separate tables
dbWriteTable( db , 'mtcars1' , mtcars )
dbWriteTable( db , 'mtcars2' , mtcars )
dbWriteTable( db , 'mtcars3' , mtcars )
# stack them all
dbSendUpdate( db , "CREATE TABLE mtcars AS SELECT * FROM mtcars1 UNION ALL SELECT * FROM mtcars2 UNION ALL SELECT * FROM mtcars3 WITH DATA" )
# correct number of records
nrow( dbReadTable( db , 'mtcars' ) )
I'll consider it. monetdb.read.csv does use COPY INTO, so you might get away with creating a temp. CSV file.
I see what you mean although this doesn't change anything to the fact that dbWriteTable uses a for loop and "INSERT INTO" which can be rather slow. I may not have been very clear in my initial post.
As a workaround I guess "START TRANSACTION" and "COMMIT" with dbSendUpdate might work.
Ideally something like this would be great:
"COPY INTO table FROM data.frame"
We just published version 0.9.4 of MonetDB.R on CRAN. The main change in this release are major improvements to the dbWriteTable method. By default, INSERTs are now chunked into 1000 rows per statement. Also, if the database is running on the same machine as R, you may use the csvdump=T parameter. This writes the data.frame to a local temporary CSV file, and uses an automatically generated COPY INTO statement to import. Both these methods obviously are designed to improve the speed with which dbWriteTable imports data. Also, he append/overwrite parameter handling has been fixed.