I have a dataframe in R with various different data types. While writing the dataframe from R to redshift server, I am getting error only with character and timestamp values. I am adding R code snippet below to give you more idea about the issue.
library(lubridate)
library(dplyr)
dat <- data.frame(id = letters[1:2], x = 2:3, date = now())
dat
str(dat)
drv <- dbDriver("PostgreSQL")
conn <- dbConnect(drv, host="redshift.amazonaws.com", port="5439", dbname="abcd", user="xyz", password="abc")
DBI::dbGetQuery(conn, "DROP TABLE test21;")
DBI::dbGetQuery(conn, "CREATE TABLE test21 ( id VARCHAR(255), x INT, date timestamp);")
chunksize = 100
for (i in 1:ceiling(nrow(dat)/chunksize)) {
query = paste0('INSERT INTO test21 (',paste0(colnames(dat),collapse = ','),') VALUES ')
vals = NULL
for (j in 1:chunksize) {
k = (i-1)*chunksize+j
if (k <= nrow(dat)) {
vals[j] = paste0('(', paste0(dat[k,],collapse = ','), ')')
}
}
query = paste0(query, paste0(vals,collapse=','))
DBI::dbExecute(conn, query)
}
While running the last part, I am getting the below error:
RS-DBI driver: (could not Retrieve the result : ERROR: column "date" is of type timestamp without time zone but expression is of type numeric
HINT: You will need to rewrite or cast the expression.
When I manually entered the values into the redshift table, it came as expected.
DBI::dbGetQuery(conn, "INSERT INTO test21 (id, x, date) values ('a','2','2019-02-08 15:21:08'),('b','3','2019-02-08 15:21:08')")
I am sensing that this issue is coming from some programmatic error. requesting your advise on the same where I am doing wrong with the code.
In the date field of your dataframe, try replacing
now()
with
substr(now(), 1, 19)
Related
library(DBI)
library(RSQLite)
library(dplyr)
con <- dbConnect(RSQLite::SQLite(), ":memory:")
iris_id <- iris |>
mutate(id = row_number())
dbWriteTable(con, "iris_id", iris_id)
params <- list(id = c(5,6,7))
q <- "SELECT COUNT(*) FROM iris_id WHERE id IN ($id)"
res <- dbSendQuery(con, q)
dbBind(res, params)
dbFetch(res)
As per documentation, this performs the query once per entry in params$id and returns c(1,1,1).
This also doesn't work, because this query is actually WHERE id IN ('5,6,7'):
id <- c(5L,6L,7L)
stopifnot(is.integer(id))
params <- list(id = paste(id, collapse=","))
res <- dbSendQuery(con, q)
dbBind(res, params)
dbFetch(res)
The answer to question [0] suggests using positional ? and pasting a list of ? together. However, this loses the possibility of using named params, which would be beneficial if I have multiple parameters. Is there another way?
[0] Passing DataFrame column into WHERE clause in SQL query embedded in R via parametrized queries
One solution would be to pass all the ids as a comma separated string (no spaces) and use instead of IN the operator LIKE:
params <- list(id = "5,6,7")
q <- "SELECT COUNT(*) FROM iris_id WHERE ',' || $id || ',' LIKE '%,' || id || ',%'"
res <- dbSendQuery(con, q)
dbBind(res, params)
dbFetch(res)
library(dbplyr)
translate_sql(id %in% c(4L,5L,6L))
I would need to store BLOB/CLOB with larger raw/character data by passing them to the Oracle db procedure from R.
The documentation of ROracle indicates it should be possible. I can pass the data directly by using dbWritetable when setting attr(in.df$msg_att, "ora.type") <- "BLOB" of the input dataframe.
I suppose the same logic should apply in oracleProc. Unfortunately, it does not work. Setting it to BLOB binds the NULL value within the db procedure. When not set, it can pass maximum 32767 Bytes.
Here is the script:
library(ROracle)
ch <- db_connect(ROracle::Oracle(), user, password)
dbGetQuery(ch, "CREATE SEQUENCE ATT_ID_SEQ")
dbGetQuery(
ch,
"CREATE TABLE T_ATT_TEST
(msg_id NUMBER,
att_id NUMBER GENERATED BY DEFAULT AS IDENTITY,
msg_att CLOB)"
)
dbGetQuery(
ch,
"CREATE OR REPLACE PROCEDURE P_ATT_TEST
(msg_id IN INTEGER, att_id OUT integer, msg_att IN OUT BLOB)
IS
BEGIN
insert into T_ATT_TEST(msg_id, att_id, msg_att)
values(msg_id, att_id_seq.nextval, msg_att)
returning att_id into att_id;
END;")
raw.lst <- vector("list", 1)
raw.lst[[1L]] <- charToRaw(paste(rep('x', 32767), collapse = ""))
in.df <- data.frame(msg_id = 1,
att_id = as.integer(NA),
stringsAsFactors = F)
in.df$msg_att <- raw.lst
attr(in.df$msg_id, "ora.parameter_name") <- "msg_id"
attr(in.df$msg_id, "ora.parameter_mode") <- "IN"
attr(in.df$msg_att, "ora.parameter_name") <- "msg_att"
attr(in.df$msg_att, "ora.parameter_mode") <- "IN"
#attr(in.df$msg_att, "ora.type") <- "BLOB"
attr(in.df$att_id, "ora.parameter_name") <- "att_id"
attr(in.df$att_id, "ora.parameter_mode") <- "OUT"
att_id <- oracleProc(ch, 'BEGIN P_ATT_TEST(
:msg_id,
:att_id,
:msg_att
); END;', in.df)
dbCommit(ch)
res <- dbReadTable(ch, "T_ATT_TEST")
str(res)
# remove created objects
dbGetQuery(ch, "DROP SEQUENCE ATT_ID_SEQ")
dbGetQuery(ch, "DROP TABLE T_ATT_TEST")
dbGetQuery(ch, "DROP PROCEDURE P_ATT_TEST")
dbDisconnect(ch)
Can somebody help me with an example of binding BLOB/CLOB to oracleProc?
Thank you and best regards
Ondrej
I am using R-Studio to process data from HIVE. Here I am using RJDBC. RJDBC converts the select statement into a dataframe. Unfortunately conversion of hive columns data types "date" and "timestamp" seems not be recognized. Thus it is converted as character during dbReadTable(conn, db2.ibor_lending), which is bad.
Do you have any idea about this ? I don't want to recast the character to date again in R because it is 1. overhead, 2. lead to coupling and 3. increased maintenance efforts
library(DBI)
library(rJava)
library(RJDBC)
print("Attempting Hive Connection...")
hadoop.class.path = list.files(path=c("/usr/hdp/current/hadoop-client"),pattern="jar", full.names=T);
hadoop.client.lib = list.files(path=c("/usr/hdp/current/hadoop-client/lib"),pattern="jar", full.names=T);
hive.class.path = list.files(path=c("/usr/hdp/current/hive-client/lib"),pattern="jar", full.names=T);
hadoop.hdfs.lib.path = list.files(path=c("/usr/hdp/current/hadoop-hdfs-client"),pattern="jar",full.names=T);
zookeeper.lib.path = list.files(path=c("/usr/hdp/current/zookeeper-client"),pattern="jar",full.names=T);
mapred.class.path = list.files(path=c("/usr/hdp/current/hadoop-mapreduce-client"),pattern="jar",full.names=T);
cp = c(hive.class.path,mapred.class.path,hadoop.class.path,hadoop.client.lib,hadoop.hdfs.lib.path)
.jinit(classpath=cp, parameters="-Djavax.security.auth.useSubjectCredsOnly=false")
drv <- JDBC("org.apache.hive.jdbc.HiveDriver","/usr/hdp/current/hive-client/lib/hive-jdbc.jar",identifier.quote="`")
conn <- dbConnect(drv,"jdbc:hive2://xxx:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;principal=hive/_HOST#yyyy")
show_databases <- dbGetQuery(conn, "select * from db2.ibor_lending LIMIT 100")
show_datatypes <- dbGetQuery(conn, "describe db2.ibor_lending")
show_table <- dbReadTable(conn, db2.ibor_lending)
The result is:
Hive: col_name data_type comment
cutoffdate timestamp
R dataframe: ibor_lending.cutoffdate character
Br, Dennis
This question is the extension of this question How to quickly export data from R to SQL Server. Currently I am using following code:
# DB Handle for config file #
dbhandle <- odbcDriverConnect()
# save the data in the table finally
sqlSave(dbhandle, bp, "FACT_OP", append=TRUE, rownames=FALSE, verbose = verbose, fast = TRUE)
# varTypes <- c(Date="datetime", QueryDate = "datetime")
# sqlSave(dbhandle, bp, "FACT_OP", rownames=FALSE,verbose = TRUE, fast = TRUE, varTypes=varTypes)
# DB handle close
odbcClose(dbhandle)
I have tried this approach also, which is working beautifully and I have gained significant speed as well.
toSQL = data.frame(...);
write.table(toSQL,"C:\\export\\filename.txt",quote=FALSE,sep=",",row.names=FALSE,col.names=FALSE,append=FALSE);
sqlQuery(channel,"BULK
INSERT Yada.dbo.yada
FROM '\\\\<server-that-SQL-server-can-see>\\export\\filename.txt'
WITH
(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\\n'
)");
But my issue is I can NOT keep my data at rest between the transaction (Writing data to a file is not an option because of data security), so I was looking for solution if I can directly Bulk insert from memory or cache the data. Thanks for the help.
Good question - also useful in instances where the BULK INSERT permissions cannot be setup for whatever reason.
I threw together this poor man's solution a while back when I had enough data that sqlSave was too slow, but not enough to justify setting up BULK INSERT, so it does not require any data being written to a file. The primary reason that sqlSave and parameterized queries are so slow for inserting data is that each row is inserted with a new INSERT statement. Having R write the INSERT statement manually bypasses this in my example below:
library(RODBC)
channel <- ...
dataTable <- ...relevant data...
numberOfThousands <- floor(nrow(dataTable)/1000)
extra <- nrow(dataTable)%%1000
thousandInsertQuery <- function(channel,dat,range){
sqlQuery(channel,paste0("INSERT INTO Database.dbo.Responses (IDNum,State,Answer)
VALUES "
,paste0(
sapply(range,function(k) {
paste0("(",dat$IDNum[k],",'",
dat$State[k],"','",
gsub("'","''",dat$Answer[k],fixed=TRUE),"')")
})
,collapse=",")))
}
if(numberOfThousands)
for(n in 1:numberOfThousands)
{
thousandInsertQuery(channel,(1000*(n-1)+1):(1000*n),dataTable)
}
if(extra)
thousandInsertQuery(channel,(1000*numberOfThousands+1):(1000*numberOfThousands+extra))
SQL's INSERT statements written out with values will only accept up to 1000 rows at a time, so this code breaks it up into chunks (much more efficiently than one row at a time).
The thousandInsertQuery function will obviously have to be customized to handle whatever columns your data frame has - note also that there are single quotes around the character/factor columns and a gsub to handle any single quotes that might be in the character column. Other than this there are no safeguards against SQL injection attacks.
What about using DBI::dbWriteTable() function?
Example below (I am connecting my R code to AWS RDS instance of MS SQL Express):
library(DBI)
library(RJDBC)
library(tidyverse)
# Specify where you driver lives
drv <- JDBC(
"com.microsoft.sqlserver.jdbc.SQLServerDriver",
"c:/R/SQL/sqljdbc42.jar")
# Connect to AWS RDS instance
conn <- drv %>%
dbConnect(
host = "jdbc:sqlserver://xxx.ccgqenhjdi18.ap-southeast-2.rds.amazonaws.com",
user = "xxx",
password = "********",
port = 1433,
dbname= "qlik")
if(0) { # check what the conn object has access to
queryResults <- conn %>%
dbGetQuery("select * from information_schema.tables")
}
# Create test data
example_data <- data.frame(animal=c("dog", "cat", "sea cucumber", "sea urchin"),
feel=c("furry", "furry", "squishy", "spiny"),
weight=c(45, 8, 1.1, 0.8))
# Works in 20ms in my case
system.time(
conn %>% dbWriteTable(
"qlik.export.test",
example_data
)
)
# Let us see if we see the exported results
conn %>% dbGetQuery("select * FROM qlik.export.test")
# Let's clean the mess and force-close connection at the end of the process
conn %>% dbDisconnect()
It works pretty fast for small amount of data transferred and seems rather elegant if you want data.frame -> SQL table solution.
Enjoy!
Building on #jpd527 solution which I found really worth digging into...
require(RODBC)
channel <- #connection parameters
dbPath <- # path to your table, database.table
data <- # the DF you have prepared for insertion, /!\ beware of column names and values types...
# Function to insert 1000 rows of data in one sqlQuery call, coming from
# any DF and into any database.table
insert1000Rows <- function(channel, dbPath, data, range){
# Defines columns names for the database.table
columns <- paste(names(data), collapse = ", ")
# Initialize a string which will incorporate all 1000 rows of values
values <- ""
# Not very elegant, but appropriately builds the values (a, b, c...), (d, e, f...) into a string
for (i in range) {
for (j in 1:ncol(data)) {
# First column
if (j == 1) {
if (i == min(range)) {
# First row, only "("
values <- paste0(values, "(")
} else {
# Next rows, ",("
values <- paste0(values, ",(")
}
}
# Value Handling
values <- paste0(
values
# Handling NA values you want to insert as NULL values
, ifelse(is.na(data[i, j])
, "null"
# Handling numeric values you want to insert as INT
, ifelse(is.numeric(data[i, j])
, data[i, J]
# Else handling as character to insert as VARCHAR
, paste0("'", data[i, j], "'")
)
)
)
# Separator for columns
if (j == ncol(data)) {
# Last column, close parenthesis
values <- paste0(values, ")")
} else {
# Other columns, add comma
values <- paste0(values, ",")
}
}
}
# Once the string is built, insert it into SQL Server
sqlQuery(channel,paste0("insert into ", dbPath, " (", columns, ") values ", values))
}
This insert1000Rows function is used in a loop in the next function, sqlInsertAll, for which you simply define which DF you want to insert into which database.table.
# Main function which uses the insert1000rows function in a loop
sqlInsertAll <- function(channel, dbPath, data) {
numberOfThousands <- floor(nrow(data) / 1000)
extra <- nrow(data) %% 1000
if (numberOfThousands) {
for(n in 1:numberOfThousands) {
insert1000Rows(channel, dbPath, data, (1000 * (n - 1) + 1):(1000 * n))
print(paste0(n, "/", numberOfThousands))
}
}
if (extra) {
insert1000Rows(channel, dbPath, data, (1000 * numberOfThousands + 1):(1000 * numberOfThousands + extra))
}
}
With this, I am able to insert 250k rows of data in 5 minutes or so, whereas it took more than 24 hours using sqlSave from the RODBC package.
I have a table in a PostgreSQL database that has a BIGSERIAL auto-incrementing primary key. Recreate it using:
CREATE TABLE foo
(
"Id" bigserial PRIMARY KEY,
"SomeData" text NOT NULL
);
I want to append some data to this table from R via the RPostgreSQL package. In R, the data doesn't include the Id column because I want the database to generate those value.
dfr <- data.frame(SomeData = letters)
Here's the code I used to try and write the data:
library(RPostgreSQL)
conn <- dbConnect(
"PostgreSQL",
user = "yourname",
password = "your password",
dbname = "test"
)
dbWriteTable(conn, "foo", dfr, append = TRUE, row.names = FALSE)
dbDisconnect(conn)
Unfortunately, dbWriteTable throws an error:
## Error in postgresqlgetResult(new.con) :
## RS-DBI driver: (could not Retrieve the result : ERROR: invalid input syntax for integer: "a"
## CONTEXT: COPY foo, line 1, column Id: "a"
## )
The error message isn't completely clear, but I interpret this as R trying to pass the contents of the SomeData column to the first column in the database (which is Id).
How should I be passing the data to PostgreSQL so that the Id column is auto-generated?
From the thread in hrbrmstr's comment, I found a hack to make this work.
In the postgresqlWriteTable in the RPostgreSQL package, you need to replace the line
sql4 <- paste("COPY", postgresqlTableRef(name), "FROM STDIN")
with
sql4 <- paste(
"COPY ",
postgresqlTableRef(name),
"(",
paste(postgresqlQuoteId(names(value)), collapse = ","),
") FROM STDIN"
)
Note that the quoting of variables (not included in the original hack) is necessary to pass case-sensitive column names.
Here's a script to do that:
body_lines <- deparse(body(RPostgreSQL::postgresqlWriteTable))
new_body_lines <- sub(
'postgresqlTableRef(name), "FROM STDIN")',
'postgresqlTableRef(name), "(", paste(shQuote(names(value)), collapse = ","), ") FROM STDIN")',
body_lines,
fixed = TRUE
)
fn <- RPostgreSQL::postgresqlWriteTable
body(fn) <- parse(text = new_body_lines)
while("RPostgreSQL" %in% search()) detach("package:RPostgreSQL")
assignInNamespace("postgresqlWriteTable", fn, "RPostgreSQL")
I struggled with an issue very similar to this today, and stumbled across this thread as I tried out different approaches. As of this writing (02/12/2018), it looks like the patch recommended above has been implemented into the latest version of RPostgreSQL::postgresqlWriteTable, but I still kept getting an error indicating that the primary key R assigned to my new rows was duplicated in the source data table.
I ultimately implemented a workaround generating an incrementing primary key in R to append to my inserted data to update the source table in my postgreSQL Db. For my purposes, I only needed to insert one record into my table at a time and I can't imagine this is an optimal solution for inserting a batch of records requiring a serially incremented primary key. Predictably, an error of "table my_table exists in database: aborting assignTable" was thrown when I omitted the 'append=TRUE' from my script; however this option did not automatically assign an incrementing primary key as I had hoped, even with the code patch described above.
drv <- dbDriver("PostgreSQL")
localdb <- dbConnect(drv, dbname= 'MyDatabase',
host= 'localhost',
port = 5432,
user = 'postgres',
password= 'MyPassword')
KeyPlusOne <- sum(dbGetQuery(localdb, "SELECT count(*) FROM my_table"),1)
NewRecord <- t(c(KeyPlusOne, 'Var1','Var2','Var3','Var4'))
NewRecord <- as.data.frame(NewRecord)
NewRecord <- setNames(KeyPlusOne, c("PK","VarName1","VarName2","VarName3","VarName4"))
postgresqlWriteTable(localdb, "my_table", NewRecord, append=TRUE, row.names=FALSE)