I am working with Microsoft SQL Azure version 12, by operating on an RStudio-server and the DBI library. I need to create multiple SQL tables from dataframes with a variable of length 4000. This can be done as
# Create dataframe
df <- data.frame("myid" = stringi::stri_rand_strings(5, 4000),
"mydate" = c(Sys.time(), Sys.time()-1, Sys.time()-2, Sys.time()-3, Sys.time()-4) )
# Create SQL table sschema.ttable
DBI::dbWriteTable(conn = connection,
name = DBI::Id(schema = "sschema", table = "ttable"),
value = df,
overwrite = TRUE)
This fails with the following error
Error in result_insert_dataframe(rs#ptr, values, batch_rows) :
nanodbc/nanodbc.cpp:1617: 00000: [Microsoft][ODBC Driver 17 for SQL Server]String data, right truncation
I tried
Truncating variables (suboptimal)
Create table > alter variables to be of format VARCHAR(6000) instead of VARCHAR(255) > append dataframe. This results in the same "String data, right truncation" error.
Any solutions how to create SQL tables directly from R dataframes?
The answer is to define the variables and their desired SQL variable class with field.types as in
# Create SQL table sschema.ttable
DBI::dbWriteTable(conn = connection,
name = DBI::Id(schema = "sschema", table = "ttable"),
field.types=c(myid="varchar(6000)"),
value = df,
overwrite = TRUE)
Related
In another project working with Amazon Athena I could do this:
con <- DBI::dbConnect(odbc::odbc(), Driver = "path-to-driver",
S3OutputLocation = "location",
AwsRegion = "eu-west-1", AuthenticationType = "IAM Profile",
AWSProfile = "profile", Schema = "prod")
tbl(con,
# Run SQL query
sql('SELECT *
FROM TABLE')) %>%
# Without having collected the data, I could further wrangle the data inside the database
# using dplyr code
select(var1, var2) %>%
mutate(var3 = var1 + var2)
However, now using BigQuery I get the following error
con <- DBI::dbConnect(bigrquery::bigquery(),
project = "project")
tbl(con,
sql(
'SELECT *
FROM TABLE'
))
Error: dataset is not a string (a length one character vector).
Any idea if with BigQuery is not possible to do what I'm trying to do?
Not a BigQuery user, so can't test this, but from looking at this example it appears unrelated to how you are piping queries (%>%). Instead it appears BigQuery does not support receiving a tbl with an sql string as the second argument.
So it is likely to work when the second argument is a string with the name of the table:
tbl(con, "db_name.table_name")
But you should expect it to fail if the second argument is of type sql:
query_string = "SELECT * FROM db_name.table_name"
tbl(con, sql(query_string))
Other things to test:
Using odbc::odbc() to connect to BigQuery instead of bigquery::bigquery(). The problem could be caused by the bigquery package.
The second approach without the conversation to sql: tbl(con, query_string)
I am very new to R, so please forgive any obvious or naive errors. I need to insert multiple rows of data from R into an Oracle database table.
Make the data frame (I have made the RJDBC connection earlier in the script):
df <- data.frame("field_1" = 1:2, "field_2" = c("f","k"), "field_3"= c("j","t"))
This code runs without error, but inserts only the first row into the table:
insert <- sprintf("insert into temp_r_test_u_suck values (%s')",
apply(df, 1, function(i) gsub(" ", "", paste("'", i, collapse="',"), fixed = TRUE)))
dbSendUpdate(con, insert)
This code runs:
insert <- sprintf("into temp_r_test_u_suck values (%s')",
apply(df, 1, function(i) gsub(" ", "", paste("'", i, collapse="',"), fixed = TRUE)))
insert_all <- c("insert all", insert, "select * from dual")
dbSendUpdate(con, insert_all)
But gives me this error:
Error in .local(conn, statement, ...) :
execute JDBC update query failed in dbSendUpdate (ORA-00905: missing keyword
Both of the queries work on their own in Oracle. WHAT am I doing wrong?
Thank you!
Multiple SQL statements are not supported in dbGetQuery, dbSendQuery, dbSendUpdate calls. You need to iterate through them for each statement. Hence, why only the first statement processes. To resolve, extend the anonymous function inside apply to call dbSendUpdate:
apply(df, 1, function(i) {
# BUILD SQL STATEMENT
insert <- sprintf("insert into temp_r_test_u_suck values (%s')",
paste0("'", i, collapse="',"))
# RUN QUERY
dbSendUpdate(con, insert)
})
However, RJDBC extends the DBI standard by supporting parameterization with dbSendUpdate as mentioned in rForge docs for bulk-inserts with no need for iteratively concatenating strings.
dbSendUpdate(conn, statement, ...) This function is analogous to
dbSendQuery, but works with DBML statements and thus doesn't return a
result set. It is more efficient than dbSendQuery. In addition, as of
RJDBC 0.2-9 it supports vectors in prepared statements which allows
bulk-inserts.
# ALL CHARACTER DATAFRAME
df <- data.frame(field_1=as.character(1:2), field_2=c("f","k"), field_3=c("j","t"),
stringsAsFactors=FALSE)
# PREPARED STATEMENT
sql <- "insert into temp_r_test_u_suck values (?, ?, ?)"
# RUN QUERY
dbSendUpdate(con, sql, df$field_1, df$field_2, df$field_3)
I have worked little bit with DBI in R and first question is more of best practice, as currently appending new data to DB is taking more time than I hoped. Second is error that I'm receiving when trying to update old information in database. Here is my current workflow when inserting new data to existing table in DB:
con <- dbConnect(odbc(), "myDSN")
# Example table 1
tbl1 <- tibble(Key = c("A", "B", "C", "D", "E"),
Val = c(1, 2, 3, 4, 5))
# Original table in DB
dbWriteTable(con, "tbl1", tbl1, overwrite = TRUE)
# Link to Original table
db_tbl <- tbl(con, in_schema("dbo", "tbl1"))
# New data
tbl2 <- tibble(Key = c("D", "E", "F", "G", "H"),
val = c(10, 11, 12, 13, 14))
# Write it to Staging
dbWriteTable(con, "tbl1_staging", tbl2, overwrite = TRUE)
# Get a link to staging
db_tblStaging <- tbl(con, in_schema("dbo", "tbl1_staging"))
# Compare Info
not_in_db <- db_tblStaging %>%
anti_join(db_tbl, by="Key") %>%
collect()
# Append missing info to DB
dbWriteTable(con, "tbl1", not_in_db, append = TRUE)
# Voila!
dbReadTable(con, "tbl1")
That will do the trick, but I'm looking for better solution, as I hate the collect() part of the code, which means that I'm bringing something to in R memory (as far as I understand it) could be a problem in future, when I have bigger data. What I hoped would work is something like this, that would allow me to append new data to DB in a fly, without it visiting in memory.
# What I hoped to have
db_tblStaging %>%
anti_join(db_tbl, by="Key") %>%
dbWriteTable(con, "tbl1", ., append = TRUE)
Second problem is updating existing table. Here is what I tried, but error will emerge and can't figure it out. Here is link where I tried to copy the answer: How to pass data.frame for UPDATE with R DBI. I would like to update key E and D with new values in val.
# Trying to update tbl1
update_values <- db_tblStaging %>%
semi_join(db_tbl, by="Key") %>%
collect()
update <- dbSendQuery(con, 'UPDATE tbl1
SET "val" = ?
WHERE Key = ?')
dbBind(update, update_values)
Error in result_bind(res#ptr, as.list(params)) :
nanodbc/nanodbc.cpp:1587: 42000: [Microsoft][ODBC Driver 13 for SQL Server][SQL Server]Incorrect syntax near the keyword 'Key'.
Has the package changed in some way? I can't spot my syntax error.
Consider running pure SQL after your table staging uploads as it looks like you need the NOT EXISTS (to avoid duplicates) and UPDATE INNER JOIN (for existing records). This avoids any R client side query imports and exports.
And Key is a reserved word in SQL Server. Hence, escape it with square brackets.
apn_sql <- "INSERT INTO dbo.tbl (s.[Key], s.[Val])
SELECT s.[Key], s.[Val] FROM dbo.tbl_staging s
WHERE NOT EXISTS
(SELECT 1 FROM dbo.tbl t
WHERE t.[Key] = s.[Key])"
dbSendQuery(con, apn_sql)
upd_sql <- "UPDATE t
SET t.Val = s.Val
FROM dbo.tbl t
INNER JOIN dbo.tbl_staging s
ON t.[Key] = s.[Key]"
dbSendQuery(con, upd_sql)
Rextester demo
In fact, SQL Server has the MERGE query to handle both in one call:
MERGE dbo.tbl AS Target
USING (SELECT [Key], [Val] FROM dbo.tbl_staging) AS Source
ON (Target.[Key] = Source.[Key])
WHEN MATCHED THEN
UPDATE SET Target.Val = Source.Val
WHEN NOT MATCHED BY TARGET THEN
INSERT ([Key], [Val])
VALUES (Source.[Key], Source.[Val]);
Rextester demo
I would like to insert contents of a dataframe into an existing table in an oracle database.
sqlSave(conn, df[1:3,c(which(names(df) == "x"), which(names(df) == "y"), which(names(df) == "z")], tablename = "A_X", append = TRUE)
I get the error Error in odbcUpdate(channel, query, mydata, coldata[m, ], test = test, :
missing columns in »data« because the chosen columns and those of the Oracle table are not matching.
The oracle table has more columns than the data frame, so the non matching columns should be filled with NULL. How can I implement this in R? I would like to include the contents of the df in the SQL code at the bottom as follows:
INSERT INTO A_X
VALUES (df[1:3,c(which(names(df) == "x"), which(names(df) == "y"), which(names(df) == "z")], AUTO_ID, NULL, NULL);
In Oracle SQL this would be possible with the following code:
INSERT INTO A_X
VALUES (300, 'text', 'text', AUTO_ID, NULL, NULL);
The second problem is to generate the ID AUTO_ID automatically. I have version Oracle DB 11.2.0.3 and it's not possible to update to version 12c currently.
Here is an alternate approach.
In stead of just using sqlSave, use a combination of sqlSave & sqlQuery from RODBC or dbExecute from DBI.
You can then write the df as either a permanent or temp table and then wrap the sqlQuery with an INSERT or UPDATE statement to operate on your target table and the temp table. This is not very elegant, but should give you a scalable solution even if the schema changes in future.
I have a table in a PostgreSQL database that has a BIGSERIAL auto-incrementing primary key. Recreate it using:
CREATE TABLE foo
(
"Id" bigserial PRIMARY KEY,
"SomeData" text NOT NULL
);
I want to append some data to this table from R via the RPostgreSQL package. In R, the data doesn't include the Id column because I want the database to generate those value.
dfr <- data.frame(SomeData = letters)
Here's the code I used to try and write the data:
library(RPostgreSQL)
conn <- dbConnect(
"PostgreSQL",
user = "yourname",
password = "your password",
dbname = "test"
)
dbWriteTable(conn, "foo", dfr, append = TRUE, row.names = FALSE)
dbDisconnect(conn)
Unfortunately, dbWriteTable throws an error:
## Error in postgresqlgetResult(new.con) :
## RS-DBI driver: (could not Retrieve the result : ERROR: invalid input syntax for integer: "a"
## CONTEXT: COPY foo, line 1, column Id: "a"
## )
The error message isn't completely clear, but I interpret this as R trying to pass the contents of the SomeData column to the first column in the database (which is Id).
How should I be passing the data to PostgreSQL so that the Id column is auto-generated?
From the thread in hrbrmstr's comment, I found a hack to make this work.
In the postgresqlWriteTable in the RPostgreSQL package, you need to replace the line
sql4 <- paste("COPY", postgresqlTableRef(name), "FROM STDIN")
with
sql4 <- paste(
"COPY ",
postgresqlTableRef(name),
"(",
paste(postgresqlQuoteId(names(value)), collapse = ","),
") FROM STDIN"
)
Note that the quoting of variables (not included in the original hack) is necessary to pass case-sensitive column names.
Here's a script to do that:
body_lines <- deparse(body(RPostgreSQL::postgresqlWriteTable))
new_body_lines <- sub(
'postgresqlTableRef(name), "FROM STDIN")',
'postgresqlTableRef(name), "(", paste(shQuote(names(value)), collapse = ","), ") FROM STDIN")',
body_lines,
fixed = TRUE
)
fn <- RPostgreSQL::postgresqlWriteTable
body(fn) <- parse(text = new_body_lines)
while("RPostgreSQL" %in% search()) detach("package:RPostgreSQL")
assignInNamespace("postgresqlWriteTable", fn, "RPostgreSQL")
I struggled with an issue very similar to this today, and stumbled across this thread as I tried out different approaches. As of this writing (02/12/2018), it looks like the patch recommended above has been implemented into the latest version of RPostgreSQL::postgresqlWriteTable, but I still kept getting an error indicating that the primary key R assigned to my new rows was duplicated in the source data table.
I ultimately implemented a workaround generating an incrementing primary key in R to append to my inserted data to update the source table in my postgreSQL Db. For my purposes, I only needed to insert one record into my table at a time and I can't imagine this is an optimal solution for inserting a batch of records requiring a serially incremented primary key. Predictably, an error of "table my_table exists in database: aborting assignTable" was thrown when I omitted the 'append=TRUE' from my script; however this option did not automatically assign an incrementing primary key as I had hoped, even with the code patch described above.
drv <- dbDriver("PostgreSQL")
localdb <- dbConnect(drv, dbname= 'MyDatabase',
host= 'localhost',
port = 5432,
user = 'postgres',
password= 'MyPassword')
KeyPlusOne <- sum(dbGetQuery(localdb, "SELECT count(*) FROM my_table"),1)
NewRecord <- t(c(KeyPlusOne, 'Var1','Var2','Var3','Var4'))
NewRecord <- as.data.frame(NewRecord)
NewRecord <- setNames(KeyPlusOne, c("PK","VarName1","VarName2","VarName3","VarName4"))
postgresqlWriteTable(localdb, "my_table", NewRecord, append=TRUE, row.names=FALSE)