How to save to pre-existing Snowflake table from R using pool - r

I am using pool to handle connections to my Snowflake warehouse. I have created a connection to my database and can read data in a pre-existing table with no issues e.g:
my_pool <- dbPool(odbc::odbc(),
Driver = "Snowflake",
Server = Sys.getenv('WH_URL'),
UID = Sys.getenv('WH_USER'),
PWD = Sys.getenv('WH_PW'),
Warehouse = Sys.getenv('WH_WH'),
Database = "MY_DB")
my_data<-tbl(my_pool, in_schema(sql("schema_name"), sql("table_name"))) %>%
collect()
I would like to save back to a table (table_name) and I believe the best way to do this is with pool::dbWriteTable:
# Create some data to save to db
data<-data.frame("user_email" = "tim#apple.com",
"query_run" = "arrivals_departures",
"data_downloaded" = FALSE,
"created_at" = as.character(Sys.time()))
# Define where to save the data
table_id <- Id(database="MY_DB", schema="MY_SCHEMA", table="TABLE_NAME")
# Write to database
pool::dbWriteTable(my_pool, table_id, data, append=TRUE)
However this returns the error:
Error in new_result(connection#ptr, statement, immediate) :
nanodbc/nanodbc.cpp:1594: 00000: SQL compilation error:
Object 'MY_DB.MY_SCHEMA.TABLE_NAME' already exists.
I have read/write/update permissions for this database for the user specified in my_pool.
I have explored the accepted answers here and here to create the above attempt and can't figure out what I'm doing wrong. It's probably something simple that I've forgotten to do - any thoughts?
EDIT: Wondering if my issue is anything to do with: https://github.com/r-dbi/odbc/issues/480

Related

sparklyr::spark_write_jdbc Does not Accept Spark Dataframe?

I am working within Databricks, trying to use the sparklyr function spark_write_jdbc to write a dataframe to a SQL Server table. The server name/driver etc are correct and work, as I successfully used sparklyr::spark_read_jdbc() earlier in the code.
Per the documentation (here), spark_write_jdbc should accept a Spark Dataframe.
I used SparkR::createDataFrame() to convert the dataframe I was working with to a Spark dataframe.
Here is the relevant code:
events_long_test <- SparkR::createDataFrame(events_long, schema = NULL, samplingRatio = 1, numPartitions = NULL)
sparklyr::spark_write_jdbc(events_long_test,
name ="who_status_long_test" ,
options = list(url = url,
user = user,
driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver",
password = pw,
dbtable = "who_status_long_test"))
However, when I run this, it gives me the following error:
Error in UseMethod("spark_write_jdbc") : Error in UseMethod("spark_write_jdbc") :
no applicable method for 'spark_write_jdbc' applied to an object of class "SparkDataFrame"
I have searched around and cannot find other people asking about this error. Why would it say this function cannot work with a Spark Dataframe, when the documentation says it does?
Any help is appreciated.
What is in events_long? the syntax is correct and make sure your connection properties in options are correct. Make sure that events_long_test is a spark dataframe not a table.

Understanding Spark 1.6.3 <-> JDBC (Oracle) connection in sparklyr

Using sparklyr I access a table from Oracle via JDBC in the following manner:
tbl_sample_stuff <- spark_read_jdbc(
sc = sc,
name = "tbl_spark_some_table",
options = list(
url = "jdbc:oracle:thin:#//my.host.with.data:0000/my.host.service.details",
driver = "oracle.jdbc.OracleDriver",
user = "MY_USERNAME",
password = "SOME_PASSWORD",
# dbtable = "(SELECT * FROM TABLE WHERE FIELD > 10) ALIAS"),
dbtable = "some_table"
),
memory = FALSE
)
The sample_stuff table is accessible. For instance running glimpse(tbl_sample_stuff) produces the required results.
Questions
Let's say I want to derive a simple count per group using the code below:
dta_tbl_sample_stuff_category <- tbl_sample_stuff %>%
count(category_variable) %>%
collect()
As a consequence my Spark 1.6.3 delivers the following job:
What is actually going on there, why there is a one collect job running first for a long period of time (~ 7 mins)? My view would be that the optimal approach would initially run some SQL like SELECT COUNT(category_variable) FROM table GROUP BY category_variable on that data and then collected the results. It feels to me that this job is downloading the data first and then aggregating, is that correct?
What's the optimal way of using JDBC connection via sparklyr. In particular, I would like to know:
What's wise in terms of creating temporary tables? Should I always create temporary tables for data I may want to analyse frequently?
Other details
I'm adding Oracle driver via
configDef$sparklyr.jars.default <- ora_jar_drv
Rest is typical connection to Spark cluster managed on Yarn returned as sc object to R session.

Appending records to a table in SQL Server from R

I am able to establish a connection to a Microsoft SQL Server and am also able to read tables.
pool <- pool::dbPool(drv=odbc::odbc(),
dsn="MYDSN",
uid = "MYUID",
pwd = "XXXXX")
con <- poolCheckout(pool)
WVDListFull <- tbl(con, in_schema('Midas',"WVDListFull")) %>% head() %>% collect()
However I am unable to append new records to the table. Assuming that I have new records in a dataframe called x, I ttried the following code:
dbWriteTable(pool,'[Midas].[WVDListFull]', x, append=TRUE)
This gave me an error:
nanodbc/nanodbc.cpp:1587: 42000: [FreeTDS][SQL Server]CREATE TABLE permission denied in database 'ScorpioEDW'.
I do have read and write permissions on the said database. I also tried this:
dbWriteTable(con,DBI::SQL("Midas.WVDListFull"), x, append=TRUE)
Which resulted in another error:
Error: Can't unquote Midas.WVDListFull
Here Midas is the schema containing the table WVDListFull. Can someone tell me what's going on here?

Appending new data to a local Access data base file with r after a successful connection

So I am currently working with a connecting to an Access database. I am able to get connected to the Access DB which is located on my local system. This is actually connected to a SharePoint list. I would love to automate the process handling this SharePoint list with an R and Access combo! What I want to be able to do actually pretty basic, I want to introduce new data via a .csv which is processed for the relevant content and then compared to the current Access DB and finally the new information uploaded from r to Access.
I've learned that you need to pair the bit version of your Windows OS, Office version, and R version. So I am x64 on all of the above. This allowed me to connect to the Access DB. You also need the 'Microsoft Access Database Engine 2016 Redistributable' which is essentially the driver for the connection.
So what I have so far is:
library(odbc)
library(DBI)
file_path <- "C:/user/Documents/R Projects/...pathtofile.../filename.accdb"
accdb_con <- dbConnect(drv = odbc(), .connection_string = paste0("Driver={Microsoft Access Driver (*.mdb, *.accdb)};DBQ=",file_path,";"))
access.db <- dbReadTable(accdb_con, "sNPS Deep Dives")
That now connects!
I then read in a .csv of new information
new.df <- read.csv("C:/user/Documents/R projects/...pathtofile.csv", header=T, stringsAsFactors=FALSE, na.strings=c("","NA"))
an example of the data set might just look something like this:
date <- c("15/10/2018","15/10/2018", "16/10/2018", "12/11/2018", "07/09/2018")
score <- c("6", "10", "7", "10", "9")
group <- c("a","b", "b", "a", "b")
CaseID <- c("301", "302", "303", "304", "305")
new.df <- data.frame(date,score,group,CaseID)
new.df$date <- as.character(new.df$date)
new.df$score <- as.numeric(new.df$score)
new.df$group <- as.character(new.df$group)
new.df$CaseID <- as.numeric(new.df$CaseID)
Notably there are more columns in the Access DB that people will fill in by hand with further information.
and I process it to be ready go into the Access DB.
probably not that interesting...
Then I compare the the new data against the Access DB as such:
library(dplyr)
new <- anti_join(new.df, access.db, by= "Case.ID")
Now I've tried:
dbWriteTable(access.db.copy, new, append = TRUE)
dbAppendTable(access.db.copy, new)
I don't seem to be able to get this to go anywhere
I am getting an error:
Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘dbWriteTable’ for signature ‘"ACCESS", "data.frame", "missing"’
I've seen plenty of posts in which people are having trouble connecting to an Access DB but I haven't seen anything about writing new data into that database.
I know this isn't quite a reproducible example but it seems like a difficult problem to recreate since it's a connection problem between different tools. I would be happy to provide example sets that might make this easier
I would appreciate any direction you all can provide.
Thanks!
Edit:
It appears that Bing Sun was right, I was missing an argument. So it appears that we need something more like:
dbWriteTable(access.db.copy, "Name of table",new, append = TRUE)
Which produces the error:
Error in result_insert_dataframe(rs#ptr, values) :
nanodbc/nanodbc.cpp:1944: HY104: [Microsoft][ODBC Microsoft Access Driver]Invalid precision value
I wonder if this may something that is an error from Access about a file type?
now if I use the append I don't get an error I get a 0 for output
dbAppendTable(access.db.copy, "Name of table", new, append= TRUE)
With output:
[1] 0
But I don't see any of the new values when I check the Access file.
I know it's years later, but hopefully this will help someone else with this issue since you're right CrayCrayTown, there aren't very many posts covering this issue.
I've run into this problem repeatedly when dealing with R and MS Access. The solution that I've come up with is pretty "hacky" but it accomplishes what's trying to be done...just not very eloquently.
The way I do this is with a combo of RODBC and DBI packages.
First, I open a connection to the DB with RODBC, and use that connection to write my data to the DB as an intermediary table:
chan <- RODBC::odbcDriverConnection(connection = "/path/to/database.accdb")
RODBC::sqlSave(channel = chan,
dat = df,
tablename = "tbl_intermediary",
rownames = FALSE,
append = FALSE)
RODBC::odbcClose(chan)
rm(chan)
Make sure to close the RODBC connection, I also destroy it for good measure, because why not? I use RODBC for the intermediary table because it supports batch insert statements. I know that the same thing can, in theory, be done with DBI with DBI::dbAppendTable()(but we wouldn't be on this post if that worked how we had hoped). I tried this in a previous SO question here, but it didn't solve my problem. I also don't know how big my intermediary tables could get in the future. Hopefully by the time they get too big we'll be in a different DBMS.
Next, I reopen the connection, this time with DBI, and send a statement to the DB to write those data from the intermediary table to the final resting place for those data, and then drop the intermediary table.
con <- DBI::dbConnect(odbc::odbc(), .connection_string = "/path/to/database.accdb")
DBI::dbSendStatement(
conn = con,
statement = 'UPDATE
tbl_intermediary INNER JOIN final_tbl ON tbl_intermediary.SampleID = final_tbl.sampleNumber
SET
final_tbl.field1 = [tbl_intermediary].[field1],
final_tbl.notes = IIf(Nz([tbl_intermediary].[Notes],"")="",[final_tbl].[notes],[final_tbl].[notes] & "; Newest Notes: " & [tbl_intermediary].[Notes]);'
)
DBI::dbSendStatement(
conn = con,
statement = 'DROP TABLE tbl_intermediary;'
DBI::dbDisconnect(con)
rm(con)
)
The main reason why I chose this method is because some of the SQL I use with Access also has some VBA in it. When I send the SQL-VBA hybrid string with RODBC, I get assorted errors in the IIF() and Nz() functions (see example above). From the RODBC CRAN docs the query argument for the sqlQuery() function is strictly assumed to be a valid SQL statement. So, RODBC has no clue how to interpret the IIf() and Nz() MS Access functions. I think this also has to do with how the ODBC driver handles communication as well (please, someone correct me if I'm wrong about this).
As I understand it, DBI::dbSendStatment() however lets the database engine you're working with interpret how to use the statement argument you provide. In the situation above, the VBA is executed exactly how I would expect if it were run in Access directly. As per the DBI docs, for interactive use you'll generally want to use dbGetQuery or dbExecute.

How to write a table in PostgreSQL from R?

At present to insert data in a PostgreSQL table I have to create an empty table and then do an insert into table values ... along with a dataframe collapsed insto a single string with all the values. It doesn't work for large sized dataframes.
The dbWtriteTable() doesn't work for PostgreSQL and gives the following error...
Error in postgresqlpqExec(new.con, sql4) : RS-DBI driver: (could not Retrieve the result : ERROR: syntax error at or near "STDIN" LINE 1: COPY "table_1" FROM STDIN
I have tried the following hack as suggested in answer to a similar question asked before. Here's the link... How do I write data from R to PostgreSQL tables with an autoincrementing primary key?
body_lines <- deparse(body(RPostgreSQL::postgresqlWriteTable))
new_body_lines <- sub(
'postgresqlTableRef(name), "FROM STDIN")',
'postgresqlTableRef(name), "(", paste(shQuote(names(value)), collapse = ","), ") FROM STDIN")',
body_lines,
fixed = TRUE
)
fn <- RPostgreSQL::postgresqlWriteTable
body(fn) <- parse(text = new_body_lines)
while("RPostgreSQL" %in% search()) detach("package:RPostgreSQL")
assignInNamespace("postgresqlWriteTable", fn, "RPostgreSQL")
This hack still doesn't work for me. The postgresqlWriteTable() throws exactly the same error...
What exactly is the problem here?
As an alternative I have tried using dbWriteTable2() from caroline package. And it throws a different error...
Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not Retrieve the result : ERROR: column "id" does not exist in table_1
)
creating NAs/NULLs for for fields of table that are missing in your df
Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not Retrieve the result : ERROR: column "id" does not exist in table_1
)
Is there any other method to write a large dataframe into a table in PostgreSQL directly?
Ok, I'm not sure why dbWriteTable() would be failing; there may be some kind of version/protocol mismatch. Perhaps you could try installing the latest versions of R, the RPostgreSQL package, and upgrading the PostgreSQL server on your system, if possible.
Regarding the insert into workaround failing for large data, what is often done in the IT world when large amounts of data must be moved and a one-shot transfer is infeasible/impractical/flaky is what is sometimes referred to as batching or batch processing. Basically, you divide the data into smaller chunks and send each chunk one at a time.
As a random example, a few years ago I wrote some Java code to query for employee information from an HR LDAP server which was constrained to only provide 1000 records at a time. So basically I had to write a loop to keep sending the same request (with the query state tracked using some kind of weird cookie-based mechanism) and accumulating the records into a local database until the server reported the query complete.
Here's some code that manually constructs the SQL to create an empty table based on a given data.frame, and then insert the content of the data.frame into the table using a parameterized batch size. It's mostly built around calls to paste() to build the SQL strings, and dbSendQuery() to send the actual queries. I also use postgresqlDataType() for the table creation.
## connect to the DB
library('RPostgreSQL'); ## loads DBI automatically
drv <- dbDriver('PostgreSQL');
con <- dbConnect(drv,host=...,port=...,dbname=...,user=...,password=...);
## define helper functions
createEmptyTable <- function(con,tn,df) {
sql <- paste0("create table \"",tn,"\" (",paste0(collapse=',','"',names(df),'" ',sapply(df[0,],postgresqlDataType)),");");
dbSendQuery(con,sql);
invisible();
};
insertBatch <- function(con,tn,df,size=100L) {
if (nrow(df)==0L) return(invisible());
cnt <- (nrow(df)-1L)%/%size+1L;
for (i in seq(0L,len=cnt)) {
sql <- paste0("insert into \"",tn,"\" values (",do.call(paste,c(sep=',',collapse='),(',lapply(df[seq(i*size+1L,min(nrow(df),(i+1L)*size)),],shQuote))),");");
dbSendQuery(con,sql);
};
invisible();
};
## generate test data
NC <- 1e2L; NR <- 1e3L; df <- as.data.frame(replicate(NC,runif(NR)));
## run it
tn <- 't1';
dbRemoveTable(con,tn);
createEmptyTable(con,tn,df);
insertBatch(con,tn,df);
res <- dbReadTable(con,tn);
all.equal(df,res);
## [1] TRUE
Note that I didn't bother prepending a row.names column to the database table, unlike dbWriteTable(), which always seems to include such a column (and doesn't seem to provide any means of preventing it).
I had the same error while working through this example.
For me worked:
dbWriteTable(con, "cartable", value = df, overwrite = T, append = F, row.names = FALSE)
While I have configured a table "cartable" in pgAdmin. So an empty table existed and I had to overwrite that table with values.
So the answer showing batch processing given earlier is 99.99% correct. However, it doesn't work on windows because of a tiny argument required at the 'insertBatch' function.
(was not able to add a comment for the same answer)
The 'shQuote' function requires an argument type = 'cmd2' for it to work.
However, to add an argument there, you need this answer:
[https://stackoverflow.com/questions/6827299/r-apply-function-with-multiple-parameters][1]
So, the new 'insertBatch' function becomes:
insertBatch <- function(con,tn,df,size=100L) {
if (nrow(df)==0L) return(invisible());
cnt <- (nrow(df)-1L)%/%size+1L;
for (i in seq(0L,len=cnt)) {
sql <- paste0("insert into \"",tn,"\" values (",do.call(paste,c(sep=',',collapse='),(',lapply(df[seq(i*size+1L,min(nrow(df),(i+1L)*size)),],shQuote,type = 'cmd2'))),");");
dbSendQuery(con,sql);
};
invisible();
};

Resources