I have worked little bit with DBI in R and first question is more of best practice, as currently appending new data to DB is taking more time than I hoped. Second is error that I'm receiving when trying to update old information in database. Here is my current workflow when inserting new data to existing table in DB:
con <- dbConnect(odbc(), "myDSN")
# Example table 1
tbl1 <- tibble(Key = c("A", "B", "C", "D", "E"),
Val = c(1, 2, 3, 4, 5))
# Original table in DB
dbWriteTable(con, "tbl1", tbl1, overwrite = TRUE)
# Link to Original table
db_tbl <- tbl(con, in_schema("dbo", "tbl1"))
# New data
tbl2 <- tibble(Key = c("D", "E", "F", "G", "H"),
val = c(10, 11, 12, 13, 14))
# Write it to Staging
dbWriteTable(con, "tbl1_staging", tbl2, overwrite = TRUE)
# Get a link to staging
db_tblStaging <- tbl(con, in_schema("dbo", "tbl1_staging"))
# Compare Info
not_in_db <- db_tblStaging %>%
anti_join(db_tbl, by="Key") %>%
collect()
# Append missing info to DB
dbWriteTable(con, "tbl1", not_in_db, append = TRUE)
# Voila!
dbReadTable(con, "tbl1")
That will do the trick, but I'm looking for better solution, as I hate the collect() part of the code, which means that I'm bringing something to in R memory (as far as I understand it) could be a problem in future, when I have bigger data. What I hoped would work is something like this, that would allow me to append new data to DB in a fly, without it visiting in memory.
# What I hoped to have
db_tblStaging %>%
anti_join(db_tbl, by="Key") %>%
dbWriteTable(con, "tbl1", ., append = TRUE)
Second problem is updating existing table. Here is what I tried, but error will emerge and can't figure it out. Here is link where I tried to copy the answer: How to pass data.frame for UPDATE with R DBI. I would like to update key E and D with new values in val.
# Trying to update tbl1
update_values <- db_tblStaging %>%
semi_join(db_tbl, by="Key") %>%
collect()
update <- dbSendQuery(con, 'UPDATE tbl1
SET "val" = ?
WHERE Key = ?')
dbBind(update, update_values)
Error in result_bind(res#ptr, as.list(params)) :
nanodbc/nanodbc.cpp:1587: 42000: [Microsoft][ODBC Driver 13 for SQL Server][SQL Server]Incorrect syntax near the keyword 'Key'.
Has the package changed in some way? I can't spot my syntax error.
Consider running pure SQL after your table staging uploads as it looks like you need the NOT EXISTS (to avoid duplicates) and UPDATE INNER JOIN (for existing records). This avoids any R client side query imports and exports.
And Key is a reserved word in SQL Server. Hence, escape it with square brackets.
apn_sql <- "INSERT INTO dbo.tbl (s.[Key], s.[Val])
SELECT s.[Key], s.[Val] FROM dbo.tbl_staging s
WHERE NOT EXISTS
(SELECT 1 FROM dbo.tbl t
WHERE t.[Key] = s.[Key])"
dbSendQuery(con, apn_sql)
upd_sql <- "UPDATE t
SET t.Val = s.Val
FROM dbo.tbl t
INNER JOIN dbo.tbl_staging s
ON t.[Key] = s.[Key]"
dbSendQuery(con, upd_sql)
Rextester demo
In fact, SQL Server has the MERGE query to handle both in one call:
MERGE dbo.tbl AS Target
USING (SELECT [Key], [Val] FROM dbo.tbl_staging) AS Source
ON (Target.[Key] = Source.[Key])
WHEN MATCHED THEN
UPDATE SET Target.Val = Source.Val
WHEN NOT MATCHED BY TARGET THEN
INSERT ([Key], [Val])
VALUES (Source.[Key], Source.[Val]);
Rextester demo
Related
I have multiple Datatable in a SQLite database. I am trying to delete specific rows of a datatable using DBI package. Here is the code:
library(dplyr)
library(DBI)
con <- DBI::dbConnect(RSQLite::SQLite(), dbname = "C:\\DB2.sqlite" , password="password")
DBI::dbWriteTable(con,"data_iris",iris,overwrite=TRUE)
query<-"DELETE FROM data_iris WHERE Species = ?;"
specie<-'setosa'
res <- dbExecute(con,query,params = list(specie))
res
[1] 50
The above code works good. But why the following code does not work:
query <- 'DELETE FROM ? WHERE Species = ?;'
table_name<-"data_iris"
res <- dbExecute(con,query,params = c(table_name,specie))
#Error: near "?": syntax error
I can not use the first code since the table_name changes dynamically (in a shiny APP).
I'm trying to join tables from two different datasets in the same project. How can I do this?
library(tidyverse)
library(bigrquery)
con1 <-
bConnect(
drv = bigrquery::bigquery(),
project = PROJECT,
dataset = "dataset_1"
)
con2 <-
bConnect(
drv = bigrquery::bigquery(),
project = PROJECT,
dataset = "dataset_2"
)
A <- con1 %>% tbl("A")
B <- con2 %>% tbl("B")
inner_join(A, B,
by = "key",
copy = T) %>%
collect()
Then I get the error: Error: BigQuery does not support temporary tables
The problem is most likely that you are using different connections to connect with the two tables. When you attempt this, R tries to copy data from one source into a temporary table on the other source.
See this question and the copy parameter in this documentation (its a different package, but the principle is the same).
The solution is to only use a single connection for all your tables. Something like this:
con <-
bConnect(
drv = bigrquery::bigquery(),
project = PROJECT,
dataset = "dataset_1"
)
A <- con %>% tbl("A")
B <- con %>% tbl("B")
inner_join(A, B,
by = "key") %>%
collect()
You may need to leave the dataset parameter blank in your connection string, or use in_schema to include the dataset name along with the table when you connect to a remote table. It's hard to be sure without knowing more about the structure of your database(s).
The table reg_data is a PostgreSQL table. It turns out to be faster to run the regressions in PostgreSQL. But, as I am running it for 100,000s of data sets, I want to do it data set by data set and append the results of each to a table.
Is there a way to append PostgreSQL data to a PostgreSQL table using native dplyr verbs? I'm not sure that there's a huge cost to bringing the data to R then sending them back to PostgreSQL (it's just 6 numbers and a couple of identifying fields), but it does seem inelegant.
library(dplyr)
pg <- src_postgres()
reg_data <- tbl(pg, "reg_data")
reg_results <-
reg_data %>%
summarize(r_squared=regr_r2(y, x),
num_obs=regr_count(y, x),
constant=regr_intercept(y, x),
slope=regr_slope(y, x),
mean_analyst_fog=regr_avgx(y, x),
mean_manager_fog=regr_avgy(y, x)) %>%
collect() %>%
as.data.frame()
# Push to database.
dbWriteTable(pg$con, c("bgt", "within_call_data"), reg_results,
append=TRUE, row.names=FALSE)
dplyr does not include commands to insert or update records in a database, so there is not a complete native dplyr solution for this. But you could combine dplyr with regular SQL statements to avoid bringing the data to R.
Let's start by reproducing your steps before the collect() statement
library(dplyr)
pg <- src_postgres()
reg_data <- tbl(pg, "reg_data")
reg_results <-
reg_data %>%
summarize(r_squared=regr_r2(y, x),
num_obs=regr_count(y, x),
constant=regr_intercept(y, x),
slope=regr_slope(y, x),
mean_analyst_fog=regr_avgx(y, x),
mean_manager_fog=regr_avgy(y, x))
Now, you could use compute() instead of collect() to create a temporary table in the database.
temp.table.name <- paste0(sample(letters, 10, replace = TRUE), collapse = "")
reg_results <- reg_results %>% compute(name=temp.table.name)
Where temp.table.name is a random table name. Using the option name = temp.table.name in compute we assign this random name to the temporary table created.
Now, we will use the library RPostgreSQL to create an insert query that uses the results stored in the temporary table. As the temporary table only lives in the connection created by src_postgresql() we need to reuse it.
library(RPostgreSQL)
copyconn <- pg$con
class(copyconn) <- "PostgreSQLConnection" # I get an error if I don't fix the class
Finally the insert query
sql <- paste0("INSERT INTO destination_table SELECT * FROM ", temp.tbl.name,";")
dbSendQuery(copyconn, sql)
So, everything is happening in the database and the data is not brought into R.
EDIT
Previous versions of this post did break encapsulation when we obtained temp.tbl.name from reg_results. This is avoided using the option name=in compute.
another option would be to use a command called sql_render() to create each SQL statement, and then another command called db_save_query() to create the table using a SQL statement and then a manual statement to append to the table. To loop through each query, the purrr commands: map and walk are used. Preferably, a command like compute() command should do this, but in lieu of that, the following is a fully reproducible example:
library(dplyr)
library(dbplyr)
library(purrr)
# Setting up a SQLite db with 3 tables
con <- DBI::dbConnect(RSQLite::SQLite(), path = ":memory:")
copy_to(con, filter(mtcars, cyl == 4), "mtcars1")
copy_to(con, filter(mtcars, cyl == 6), "mtcars2")
copy_to(con, filter(mtcars, cyl == 8), "mtcars3")
# Pre-process the SQL statements
tables <- c("mtcars1","mtcars2","mtcars3")
all_results <- tables %>%
map(~{
tbl(con, .x) %>%
summarise(avg_mpg = mean(mpg),
records = n()) %>%
sql_render()
})
# Execute the SQL statements, 1st one creates the table
# subsquent queries are insterted to the table
first_query <- TRUE
all_results %>%
walk(~{
if(first_query == TRUE){
first_query <<- FALSE
db_save_query(con, ., "results")
} else {
dbExecute(con, build_sql("INSERT INTO results ", .))
}
})
tbl(con, "results")
dbDisconnect(con)
I am using the RSQLite package in a shiny app. I need to be able to dynamically update an sqlite db as users progress through the app. I want to use the UPDATE syntax in SQLite to achieve this, but I have come up against a problem when trying to update multiple rows for the same user.
Consider the following code:
# Load libraries
library("RSQLite")
## Path for SQLite db
sqlitePath <- "test.db"
# Create db to store tables
con <- dbConnect(SQLite(),sqlitePath)
## Create toy data
who <- c("jane", "patrick", "samantha", "jane", "patrick", "samantha")
tmp_var_1 <- c(1,2,3, 4, 5, 6)
tmp_var_2 <- c(2,4,6,8,10,12)
# Create original table
users <- data.frame(who = as.character(who), tmp_var_1 = tmp_var_1, tmp_var_2 = tmp_var_2)
users$who <- as.character(users$who)
# Write original table
dbWriteTable(con, "users", users)
# Subset users data
jane <- users[who=="jane",]
patrick <- users[who=="patrick",]
samantha <- users[who=="samantha",]
# Edit Jane's data
jane$tmp_var_1 <- c(99,100)
# Save edits back to SQL (this is where the problem is!)
table <- "users"
db <- dbConnect(SQLite(), sqlitePath)
query <- sprintf(
"UPDATE %s SET %s = ('%s') WHERE who = %s",
table,
paste(names(jane), collapse = ", "),
paste(jane, collapse = "', '"),
"'jane'"
)
dbGetQuery(db, query)
## Load data to check update has worked
loadData <- function(table) {
# Connect to the database
db <- dbConnect(SQLite(), sqlitePath)
# Construct the fetching query
query <- sprintf("SELECT * FROM %s", table)
# Submit the fetch query and disconnect
data <- dbGetQuery(db, query)
dbDisconnect(db)
data
}
loadData("users")
Here I am trying to update the entry for Jane so that the values for tmp_var_1 are changed, but all other columns remain the same. In response to questions from #zx8754 and #Altons posted below, the value for query is as follows:
UPDATE users SET who, tmp_var_1, tmp_var_2 = ('c(\"jane\", \"jane\")', 'c(99, 100)', 'c(2, 8)') WHERE who = 'jane'
The problem is almost certainly coming from the way that I am specifying the query to RSQlite. When I run dbGetQuery(db, query) I get the following error:
Error in sqliteSendQuery(con, statement, bind.data) :
error in statement: near ",": syntax error
Any suggestions for improvement would be most welcome.
I have a table in a PostgreSQL database that has a BIGSERIAL auto-incrementing primary key. Recreate it using:
CREATE TABLE foo
(
"Id" bigserial PRIMARY KEY,
"SomeData" text NOT NULL
);
I want to append some data to this table from R via the RPostgreSQL package. In R, the data doesn't include the Id column because I want the database to generate those value.
dfr <- data.frame(SomeData = letters)
Here's the code I used to try and write the data:
library(RPostgreSQL)
conn <- dbConnect(
"PostgreSQL",
user = "yourname",
password = "your password",
dbname = "test"
)
dbWriteTable(conn, "foo", dfr, append = TRUE, row.names = FALSE)
dbDisconnect(conn)
Unfortunately, dbWriteTable throws an error:
## Error in postgresqlgetResult(new.con) :
## RS-DBI driver: (could not Retrieve the result : ERROR: invalid input syntax for integer: "a"
## CONTEXT: COPY foo, line 1, column Id: "a"
## )
The error message isn't completely clear, but I interpret this as R trying to pass the contents of the SomeData column to the first column in the database (which is Id).
How should I be passing the data to PostgreSQL so that the Id column is auto-generated?
From the thread in hrbrmstr's comment, I found a hack to make this work.
In the postgresqlWriteTable in the RPostgreSQL package, you need to replace the line
sql4 <- paste("COPY", postgresqlTableRef(name), "FROM STDIN")
with
sql4 <- paste(
"COPY ",
postgresqlTableRef(name),
"(",
paste(postgresqlQuoteId(names(value)), collapse = ","),
") FROM STDIN"
)
Note that the quoting of variables (not included in the original hack) is necessary to pass case-sensitive column names.
Here's a script to do that:
body_lines <- deparse(body(RPostgreSQL::postgresqlWriteTable))
new_body_lines <- sub(
'postgresqlTableRef(name), "FROM STDIN")',
'postgresqlTableRef(name), "(", paste(shQuote(names(value)), collapse = ","), ") FROM STDIN")',
body_lines,
fixed = TRUE
)
fn <- RPostgreSQL::postgresqlWriteTable
body(fn) <- parse(text = new_body_lines)
while("RPostgreSQL" %in% search()) detach("package:RPostgreSQL")
assignInNamespace("postgresqlWriteTable", fn, "RPostgreSQL")
I struggled with an issue very similar to this today, and stumbled across this thread as I tried out different approaches. As of this writing (02/12/2018), it looks like the patch recommended above has been implemented into the latest version of RPostgreSQL::postgresqlWriteTable, but I still kept getting an error indicating that the primary key R assigned to my new rows was duplicated in the source data table.
I ultimately implemented a workaround generating an incrementing primary key in R to append to my inserted data to update the source table in my postgreSQL Db. For my purposes, I only needed to insert one record into my table at a time and I can't imagine this is an optimal solution for inserting a batch of records requiring a serially incremented primary key. Predictably, an error of "table my_table exists in database: aborting assignTable" was thrown when I omitted the 'append=TRUE' from my script; however this option did not automatically assign an incrementing primary key as I had hoped, even with the code patch described above.
drv <- dbDriver("PostgreSQL")
localdb <- dbConnect(drv, dbname= 'MyDatabase',
host= 'localhost',
port = 5432,
user = 'postgres',
password= 'MyPassword')
KeyPlusOne <- sum(dbGetQuery(localdb, "SELECT count(*) FROM my_table"),1)
NewRecord <- t(c(KeyPlusOne, 'Var1','Var2','Var3','Var4'))
NewRecord <- as.data.frame(NewRecord)
NewRecord <- setNames(KeyPlusOne, c("PK","VarName1","VarName2","VarName3","VarName4"))
postgresqlWriteTable(localdb, "my_table", NewRecord, append=TRUE, row.names=FALSE)