I would like to insert contents of a dataframe into an existing table in an oracle database.
sqlSave(conn, df[1:3,c(which(names(df) == "x"), which(names(df) == "y"), which(names(df) == "z")], tablename = "A_X", append = TRUE)
I get the error Error in odbcUpdate(channel, query, mydata, coldata[m, ], test = test, :
missing columns in »data« because the chosen columns and those of the Oracle table are not matching.
The oracle table has more columns than the data frame, so the non matching columns should be filled with NULL. How can I implement this in R? I would like to include the contents of the df in the SQL code at the bottom as follows:
INSERT INTO A_X
VALUES (df[1:3,c(which(names(df) == "x"), which(names(df) == "y"), which(names(df) == "z")], AUTO_ID, NULL, NULL);
In Oracle SQL this would be possible with the following code:
INSERT INTO A_X
VALUES (300, 'text', 'text', AUTO_ID, NULL, NULL);
The second problem is to generate the ID AUTO_ID automatically. I have version Oracle DB 11.2.0.3 and it's not possible to update to version 12c currently.
Here is an alternate approach.
In stead of just using sqlSave, use a combination of sqlSave & sqlQuery from RODBC or dbExecute from DBI.
You can then write the df as either a permanent or temp table and then wrap the sqlQuery with an INSERT or UPDATE statement to operate on your target table and the temp table. This is not very elegant, but should give you a scalable solution even if the schema changes in future.
Related
Here is the data frame that i have
trail_df= data.frame(d= seq.Date(as.Date("2020-01-01"), as.Date("2020-02-01"), by= 1),
AA= NA,
BB= NA,
CC= NA)
Now I would loop to the columns of trail_df and get the data of the column names respectively from the oracle database for the given date, which I am doing like this.
for ( i in 2:ncol(trail_df)){
c_name = colnames(trail_df)[i]
query = paste0("SELECT * FROM tablename WHERE ID= '",c_name,"' ") # this query would return Date and price
result= dbGetQuery(con, query) # con is the connection variable from db
for (k in nrow(trail_df)){
trail_df [which(as.Date(result[k,1])==as.Date(trail_df[,1])),i]= result[k,2]
# just matching the date in trail_df dataframe and pasting the value in front of respective column
}
}
this is the snippet of the code and the dates filtering and all has been taken care of in real code.
The problem is, I have more than 6000 columns and 500 rows, for which I have to match the dates(
BECAUSE THE DATES ARE RANDOM) and put the price in front, which is taking like forever now.
I am new in the R language and would appreciate any help which would fasten this code maybe multiprocess if possible in R.
There are two steps to this answer:
Use parameterized queries to get the raw data; and
Get this data into the "wide" format you desire.
Parameterized query
My (first) suggestion is to use parameterized queries, which is safer. It may not improve the speed relative to #RonakShah's answer (using sprintf), at least not on the first time.
However, it might help a touch if the query is repeated: DBMSes tend to parse/optimize queries and cache this optimization. When a query changes even a little, this caching cannot happen, and the query is re-optimized. In this case, this cache-invalidation is unnecessary, and can be avoided if we use binding parameters.
query <- sprintf("SELECT * FROM tablename WHERE ID IN (%s)",
paste(rep("?", ncol(trail_df[-1])), collapse = ","))
query
# [1] "SELECT * FROM tablename WHERE ID IN (?,?,?)"
res <- dbGetQuery(con, query, params = list(trail_df$ID))
Some thoughts:
if the database has many more dates than what you have here, you can restrict the data returned by reducing the date range queries. This will work well if your trail_df dates are close together:
query <- sprintf("SELECT * FROM tablename WHERE ID IN (%s) and Date between ? and ?",
paste(rep("?", ncol(mtcars)), collapse = ","))
query
res <- dbGetQuery(con, query, params = c(list(trail_df$ID), as.list(range(df$d))))
if your dates are more variable and you end up querying many more rows than you actually need, I suggest you can upload your trail_df dates into a temporary table and something like:
"select tb.Date, tb.ID, tb.Price
from mytemptable tmp
left join tablename tb on tmp.d = tb.Date
where ..."
Reshape
It appears as if your database table may be more "long" shaped and you want it "wide" in your frame. There are many ways to reshape from long-to-wide (examples), but these should work:
reshape2::dcast(res, Date ~ ID, value.var = "Price") # 'Price' is the 'value' column, unk here
tidyr::pivot_wider(res, id_cols = "Date", names_from = "ID", values.from = "Price")
I am confused in updating the columns of the table dynamically. I am having a data frame which contains the roles of the peoples in the school.
roles_sam <- data.frame(character(),character(),character(),character(),stringsAsFactors = FALSE)
names(roles_sam) <- c("Student","Staff","HOD","Principal")
Similarly I am having another data frame which contains the start date of the roles.
roles_start <- data.frame(character(),character(),character(),character(),stringsAsFactors = FALSE)
names(roles_start) <- c("Student_Start","Staff_Start","HOD_Start","Principal_Start")
If the role is selected as "HOD", then the "HOD_Start" should be updated with the value given. Similarly if the role is selected as "Staff", then the "Staff_Start" should be updated in the database.
The code I tried is:
a <- "HOD"
sd <- "16/01/2019"
counter <- ncol(roles_start)
for(i in 1:counter){
if(names(roles_sam)[i] == a){
role_start_date <- roles_start[i]
}
}
print(role_start_date)
This code matches and brings the correct role_start_date for role selected. But I am not sure of how to use the dbSendQuery()
dbSendQuery(con, paste0("UPDATE table_name SET role_start_date=\'",sd,"\' ") )
Is this the right way? Can anyone help me with this? Thanks in advance...
You should use INSERT command with dbSendUpdate function.
for(i in 1:iter)){
dbSendUpdate(jdbcConnection, "INSERT INTO TABLE NAME VALUES(?,?,?)", YOUR_DF[i, 1], YOUR_DF[i,2], YOUR_DF[i,3]
}
This is one way of how to insert new values on your existing table. In the example I gave we have a table with 3 columns and we are filling it with the rows of our data frame called 'YOUR_DF'.
I am very new to R, so please forgive any obvious or naive errors. I need to insert multiple rows of data from R into an Oracle database table.
Make the data frame (I have made the RJDBC connection earlier in the script):
df <- data.frame("field_1" = 1:2, "field_2" = c("f","k"), "field_3"= c("j","t"))
This code runs without error, but inserts only the first row into the table:
insert <- sprintf("insert into temp_r_test_u_suck values (%s')",
apply(df, 1, function(i) gsub(" ", "", paste("'", i, collapse="',"), fixed = TRUE)))
dbSendUpdate(con, insert)
This code runs:
insert <- sprintf("into temp_r_test_u_suck values (%s')",
apply(df, 1, function(i) gsub(" ", "", paste("'", i, collapse="',"), fixed = TRUE)))
insert_all <- c("insert all", insert, "select * from dual")
dbSendUpdate(con, insert_all)
But gives me this error:
Error in .local(conn, statement, ...) :
execute JDBC update query failed in dbSendUpdate (ORA-00905: missing keyword
Both of the queries work on their own in Oracle. WHAT am I doing wrong?
Thank you!
Multiple SQL statements are not supported in dbGetQuery, dbSendQuery, dbSendUpdate calls. You need to iterate through them for each statement. Hence, why only the first statement processes. To resolve, extend the anonymous function inside apply to call dbSendUpdate:
apply(df, 1, function(i) {
# BUILD SQL STATEMENT
insert <- sprintf("insert into temp_r_test_u_suck values (%s')",
paste0("'", i, collapse="',"))
# RUN QUERY
dbSendUpdate(con, insert)
})
However, RJDBC extends the DBI standard by supporting parameterization with dbSendUpdate as mentioned in rForge docs for bulk-inserts with no need for iteratively concatenating strings.
dbSendUpdate(conn, statement, ...) This function is analogous to
dbSendQuery, but works with DBML statements and thus doesn't return a
result set. It is more efficient than dbSendQuery. In addition, as of
RJDBC 0.2-9 it supports vectors in prepared statements which allows
bulk-inserts.
# ALL CHARACTER DATAFRAME
df <- data.frame(field_1=as.character(1:2), field_2=c("f","k"), field_3=c("j","t"),
stringsAsFactors=FALSE)
# PREPARED STATEMENT
sql <- "insert into temp_r_test_u_suck values (?, ?, ?)"
# RUN QUERY
dbSendUpdate(con, sql, df$field_1, df$field_2, df$field_3)
I have worked little bit with DBI in R and first question is more of best practice, as currently appending new data to DB is taking more time than I hoped. Second is error that I'm receiving when trying to update old information in database. Here is my current workflow when inserting new data to existing table in DB:
con <- dbConnect(odbc(), "myDSN")
# Example table 1
tbl1 <- tibble(Key = c("A", "B", "C", "D", "E"),
Val = c(1, 2, 3, 4, 5))
# Original table in DB
dbWriteTable(con, "tbl1", tbl1, overwrite = TRUE)
# Link to Original table
db_tbl <- tbl(con, in_schema("dbo", "tbl1"))
# New data
tbl2 <- tibble(Key = c("D", "E", "F", "G", "H"),
val = c(10, 11, 12, 13, 14))
# Write it to Staging
dbWriteTable(con, "tbl1_staging", tbl2, overwrite = TRUE)
# Get a link to staging
db_tblStaging <- tbl(con, in_schema("dbo", "tbl1_staging"))
# Compare Info
not_in_db <- db_tblStaging %>%
anti_join(db_tbl, by="Key") %>%
collect()
# Append missing info to DB
dbWriteTable(con, "tbl1", not_in_db, append = TRUE)
# Voila!
dbReadTable(con, "tbl1")
That will do the trick, but I'm looking for better solution, as I hate the collect() part of the code, which means that I'm bringing something to in R memory (as far as I understand it) could be a problem in future, when I have bigger data. What I hoped would work is something like this, that would allow me to append new data to DB in a fly, without it visiting in memory.
# What I hoped to have
db_tblStaging %>%
anti_join(db_tbl, by="Key") %>%
dbWriteTable(con, "tbl1", ., append = TRUE)
Second problem is updating existing table. Here is what I tried, but error will emerge and can't figure it out. Here is link where I tried to copy the answer: How to pass data.frame for UPDATE with R DBI. I would like to update key E and D with new values in val.
# Trying to update tbl1
update_values <- db_tblStaging %>%
semi_join(db_tbl, by="Key") %>%
collect()
update <- dbSendQuery(con, 'UPDATE tbl1
SET "val" = ?
WHERE Key = ?')
dbBind(update, update_values)
Error in result_bind(res#ptr, as.list(params)) :
nanodbc/nanodbc.cpp:1587: 42000: [Microsoft][ODBC Driver 13 for SQL Server][SQL Server]Incorrect syntax near the keyword 'Key'.
Has the package changed in some way? I can't spot my syntax error.
Consider running pure SQL after your table staging uploads as it looks like you need the NOT EXISTS (to avoid duplicates) and UPDATE INNER JOIN (for existing records). This avoids any R client side query imports and exports.
And Key is a reserved word in SQL Server. Hence, escape it with square brackets.
apn_sql <- "INSERT INTO dbo.tbl (s.[Key], s.[Val])
SELECT s.[Key], s.[Val] FROM dbo.tbl_staging s
WHERE NOT EXISTS
(SELECT 1 FROM dbo.tbl t
WHERE t.[Key] = s.[Key])"
dbSendQuery(con, apn_sql)
upd_sql <- "UPDATE t
SET t.Val = s.Val
FROM dbo.tbl t
INNER JOIN dbo.tbl_staging s
ON t.[Key] = s.[Key]"
dbSendQuery(con, upd_sql)
Rextester demo
In fact, SQL Server has the MERGE query to handle both in one call:
MERGE dbo.tbl AS Target
USING (SELECT [Key], [Val] FROM dbo.tbl_staging) AS Source
ON (Target.[Key] = Source.[Key])
WHEN MATCHED THEN
UPDATE SET Target.Val = Source.Val
WHEN NOT MATCHED BY TARGET THEN
INSERT ([Key], [Val])
VALUES (Source.[Key], Source.[Val]);
Rextester demo
I have a table in a PostgreSQL database that has a BIGSERIAL auto-incrementing primary key. Recreate it using:
CREATE TABLE foo
(
"Id" bigserial PRIMARY KEY,
"SomeData" text NOT NULL
);
I want to append some data to this table from R via the RPostgreSQL package. In R, the data doesn't include the Id column because I want the database to generate those value.
dfr <- data.frame(SomeData = letters)
Here's the code I used to try and write the data:
library(RPostgreSQL)
conn <- dbConnect(
"PostgreSQL",
user = "yourname",
password = "your password",
dbname = "test"
)
dbWriteTable(conn, "foo", dfr, append = TRUE, row.names = FALSE)
dbDisconnect(conn)
Unfortunately, dbWriteTable throws an error:
## Error in postgresqlgetResult(new.con) :
## RS-DBI driver: (could not Retrieve the result : ERROR: invalid input syntax for integer: "a"
## CONTEXT: COPY foo, line 1, column Id: "a"
## )
The error message isn't completely clear, but I interpret this as R trying to pass the contents of the SomeData column to the first column in the database (which is Id).
How should I be passing the data to PostgreSQL so that the Id column is auto-generated?
From the thread in hrbrmstr's comment, I found a hack to make this work.
In the postgresqlWriteTable in the RPostgreSQL package, you need to replace the line
sql4 <- paste("COPY", postgresqlTableRef(name), "FROM STDIN")
with
sql4 <- paste(
"COPY ",
postgresqlTableRef(name),
"(",
paste(postgresqlQuoteId(names(value)), collapse = ","),
") FROM STDIN"
)
Note that the quoting of variables (not included in the original hack) is necessary to pass case-sensitive column names.
Here's a script to do that:
body_lines <- deparse(body(RPostgreSQL::postgresqlWriteTable))
new_body_lines <- sub(
'postgresqlTableRef(name), "FROM STDIN")',
'postgresqlTableRef(name), "(", paste(shQuote(names(value)), collapse = ","), ") FROM STDIN")',
body_lines,
fixed = TRUE
)
fn <- RPostgreSQL::postgresqlWriteTable
body(fn) <- parse(text = new_body_lines)
while("RPostgreSQL" %in% search()) detach("package:RPostgreSQL")
assignInNamespace("postgresqlWriteTable", fn, "RPostgreSQL")
I struggled with an issue very similar to this today, and stumbled across this thread as I tried out different approaches. As of this writing (02/12/2018), it looks like the patch recommended above has been implemented into the latest version of RPostgreSQL::postgresqlWriteTable, but I still kept getting an error indicating that the primary key R assigned to my new rows was duplicated in the source data table.
I ultimately implemented a workaround generating an incrementing primary key in R to append to my inserted data to update the source table in my postgreSQL Db. For my purposes, I only needed to insert one record into my table at a time and I can't imagine this is an optimal solution for inserting a batch of records requiring a serially incremented primary key. Predictably, an error of "table my_table exists in database: aborting assignTable" was thrown when I omitted the 'append=TRUE' from my script; however this option did not automatically assign an incrementing primary key as I had hoped, even with the code patch described above.
drv <- dbDriver("PostgreSQL")
localdb <- dbConnect(drv, dbname= 'MyDatabase',
host= 'localhost',
port = 5432,
user = 'postgres',
password= 'MyPassword')
KeyPlusOne <- sum(dbGetQuery(localdb, "SELECT count(*) FROM my_table"),1)
NewRecord <- t(c(KeyPlusOne, 'Var1','Var2','Var3','Var4'))
NewRecord <- as.data.frame(NewRecord)
NewRecord <- setNames(KeyPlusOne, c("PK","VarName1","VarName2","VarName3","VarName4"))
postgresqlWriteTable(localdb, "my_table", NewRecord, append=TRUE, row.names=FALSE)