Using sparklyr I access a table from Oracle via JDBC in the following manner:
tbl_sample_stuff <- spark_read_jdbc(
sc = sc,
name = "tbl_spark_some_table",
options = list(
url = "jdbc:oracle:thin:#//my.host.with.data:0000/my.host.service.details",
driver = "oracle.jdbc.OracleDriver",
user = "MY_USERNAME",
password = "SOME_PASSWORD",
# dbtable = "(SELECT * FROM TABLE WHERE FIELD > 10) ALIAS"),
dbtable = "some_table"
),
memory = FALSE
)
The sample_stuff table is accessible. For instance running glimpse(tbl_sample_stuff) produces the required results.
Questions
Let's say I want to derive a simple count per group using the code below:
dta_tbl_sample_stuff_category <- tbl_sample_stuff %>%
count(category_variable) %>%
collect()
As a consequence my Spark 1.6.3 delivers the following job:
What is actually going on there, why there is a one collect job running first for a long period of time (~ 7 mins)? My view would be that the optimal approach would initially run some SQL like SELECT COUNT(category_variable) FROM table GROUP BY category_variable on that data and then collected the results. It feels to me that this job is downloading the data first and then aggregating, is that correct?
What's the optimal way of using JDBC connection via sparklyr. In particular, I would like to know:
What's wise in terms of creating temporary tables? Should I always create temporary tables for data I may want to analyse frequently?
Other details
I'm adding Oracle driver via
configDef$sparklyr.jars.default <- ora_jar_drv
Rest is typical connection to Spark cluster managed on Yarn returned as sc object to R session.
Related
I am using pool to handle connections to my Snowflake warehouse. I have created a connection to my database and can read data in a pre-existing table with no issues e.g:
my_pool <- dbPool(odbc::odbc(),
Driver = "Snowflake",
Server = Sys.getenv('WH_URL'),
UID = Sys.getenv('WH_USER'),
PWD = Sys.getenv('WH_PW'),
Warehouse = Sys.getenv('WH_WH'),
Database = "MY_DB")
my_data<-tbl(my_pool, in_schema(sql("schema_name"), sql("table_name"))) %>%
collect()
I would like to save back to a table (table_name) and I believe the best way to do this is with pool::dbWriteTable:
# Create some data to save to db
data<-data.frame("user_email" = "tim#apple.com",
"query_run" = "arrivals_departures",
"data_downloaded" = FALSE,
"created_at" = as.character(Sys.time()))
# Define where to save the data
table_id <- Id(database="MY_DB", schema="MY_SCHEMA", table="TABLE_NAME")
# Write to database
pool::dbWriteTable(my_pool, table_id, data, append=TRUE)
However this returns the error:
Error in new_result(connection#ptr, statement, immediate) :
nanodbc/nanodbc.cpp:1594: 00000: SQL compilation error:
Object 'MY_DB.MY_SCHEMA.TABLE_NAME' already exists.
I have read/write/update permissions for this database for the user specified in my_pool.
I have explored the accepted answers here and here to create the above attempt and can't figure out what I'm doing wrong. It's probably something simple that I've forgotten to do - any thoughts?
EDIT: Wondering if my issue is anything to do with: https://github.com/r-dbi/odbc/issues/480
I've connected to a SQL Server database with the code shown here, and then I try to run a query to collect data filtered on a date, which is held as an integer in the table in YYYYMMDD format
con <- DBI::dbConnect(odbc::odbc(), driver = "SQL Server", server = "***")
fact_transaction_line <- tbl(con,in_schema('***', '***'))
data <- fact_transaction_line %>%
filter(key_date_trade == 20200618)
This stores as a query, but fails when I use glimpse to look at the data, with the below error
"dbplyr_031"
WHERE ("key_date_trade" = 20200618.0)'
Why isn't this working, is there a better way for me to format the query to get this data?
Both fact_transaction_line and data in your example code are remote tables. One important consequence of this is that you are limited to interacting with them to certain dplyr commands. glimpse may not be a command that is supported for remote tables.
What you can do instead (including #Bruno's suggestions):
Use head to view the top few rows of your remote data.
If you are receiving errors, try show_query(data) to see the underlying SQL query for the remote table. Check that this query is correct.
Check the size of the remote table with remote_table%>% ungroup() %>% summarise(num = n()). If the remote table is small enough to fit into your local R memory, then local_table = collect(remote_table) will copy the table into R memory.
Combine options 1 & 3: local_table = data %>% head(100) %>% collect() will load the first 100 rows of your remote table into R. Then you can glimpse(local_table).
I would like to have a list of the queries sent to the database by my R script, but it is a bit unclear to me how/where are performed some operations involving local dataframes & database tables.
As mentionned in this post, it seems that when performing an operation between a data frame in local env and a table from a DBI connexion (e.g. a left_join(... ,copy=TRUE)), - copy=TRUE needed because data is coming from different datasources - the operations are performed on the database side, working with temporary tables.
I tried to verify this using the show_query() to see exactly what is sent to the database and what is not.
I cannot give a proper reproductible example as it involves a database connexion, but here is the logic :
con <- DBI::dbConnect(odbc::odbc(),
Driver = "SQL Server",
Server = "server",
Database = "database",
UID = "user",
PWD = "pwd",
Port = port)
db_table <- tbl(con, "tbl_A")
local_df <- read.csv("/.../file.csv",stringsAsFactors = FALSE)
q1 <- local_df %>% inner_join(db_table ,by=c('id'='id'),copy=TRUE)
Below are the outputs of the show_query() statements :
> db_table %>% show_query()
<SQL>
SELECT *
FROM "tbl_A"
q1 %>% show_query()
Error in UseMethod("show_query") :
no applicable method for 'show_query' applied to an object of class "data.frame"
This makes me think that in that sequence, the only operation performed on the database side is SELECT * FROM "tbl_A", and that q1 is performed on the local environment using the local_df and a local copy of the database table.
I tried to have a look at the dplyr documentation but there is no information for when data is coming from multiple sources.
So I am currently working with a connecting to an Access database. I am able to get connected to the Access DB which is located on my local system. This is actually connected to a SharePoint list. I would love to automate the process handling this SharePoint list with an R and Access combo! What I want to be able to do actually pretty basic, I want to introduce new data via a .csv which is processed for the relevant content and then compared to the current Access DB and finally the new information uploaded from r to Access.
I've learned that you need to pair the bit version of your Windows OS, Office version, and R version. So I am x64 on all of the above. This allowed me to connect to the Access DB. You also need the 'Microsoft Access Database Engine 2016 Redistributable' which is essentially the driver for the connection.
So what I have so far is:
library(odbc)
library(DBI)
file_path <- "C:/user/Documents/R Projects/...pathtofile.../filename.accdb"
accdb_con <- dbConnect(drv = odbc(), .connection_string = paste0("Driver={Microsoft Access Driver (*.mdb, *.accdb)};DBQ=",file_path,";"))
access.db <- dbReadTable(accdb_con, "sNPS Deep Dives")
That now connects!
I then read in a .csv of new information
new.df <- read.csv("C:/user/Documents/R projects/...pathtofile.csv", header=T, stringsAsFactors=FALSE, na.strings=c("","NA"))
an example of the data set might just look something like this:
date <- c("15/10/2018","15/10/2018", "16/10/2018", "12/11/2018", "07/09/2018")
score <- c("6", "10", "7", "10", "9")
group <- c("a","b", "b", "a", "b")
CaseID <- c("301", "302", "303", "304", "305")
new.df <- data.frame(date,score,group,CaseID)
new.df$date <- as.character(new.df$date)
new.df$score <- as.numeric(new.df$score)
new.df$group <- as.character(new.df$group)
new.df$CaseID <- as.numeric(new.df$CaseID)
Notably there are more columns in the Access DB that people will fill in by hand with further information.
and I process it to be ready go into the Access DB.
probably not that interesting...
Then I compare the the new data against the Access DB as such:
library(dplyr)
new <- anti_join(new.df, access.db, by= "Case.ID")
Now I've tried:
dbWriteTable(access.db.copy, new, append = TRUE)
dbAppendTable(access.db.copy, new)
I don't seem to be able to get this to go anywhere
I am getting an error:
Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘dbWriteTable’ for signature ‘"ACCESS", "data.frame", "missing"’
I've seen plenty of posts in which people are having trouble connecting to an Access DB but I haven't seen anything about writing new data into that database.
I know this isn't quite a reproducible example but it seems like a difficult problem to recreate since it's a connection problem between different tools. I would be happy to provide example sets that might make this easier
I would appreciate any direction you all can provide.
Thanks!
Edit:
It appears that Bing Sun was right, I was missing an argument. So it appears that we need something more like:
dbWriteTable(access.db.copy, "Name of table",new, append = TRUE)
Which produces the error:
Error in result_insert_dataframe(rs#ptr, values) :
nanodbc/nanodbc.cpp:1944: HY104: [Microsoft][ODBC Microsoft Access Driver]Invalid precision value
I wonder if this may something that is an error from Access about a file type?
now if I use the append I don't get an error I get a 0 for output
dbAppendTable(access.db.copy, "Name of table", new, append= TRUE)
With output:
[1] 0
But I don't see any of the new values when I check the Access file.
I know it's years later, but hopefully this will help someone else with this issue since you're right CrayCrayTown, there aren't very many posts covering this issue.
I've run into this problem repeatedly when dealing with R and MS Access. The solution that I've come up with is pretty "hacky" but it accomplishes what's trying to be done...just not very eloquently.
The way I do this is with a combo of RODBC and DBI packages.
First, I open a connection to the DB with RODBC, and use that connection to write my data to the DB as an intermediary table:
chan <- RODBC::odbcDriverConnection(connection = "/path/to/database.accdb")
RODBC::sqlSave(channel = chan,
dat = df,
tablename = "tbl_intermediary",
rownames = FALSE,
append = FALSE)
RODBC::odbcClose(chan)
rm(chan)
Make sure to close the RODBC connection, I also destroy it for good measure, because why not? I use RODBC for the intermediary table because it supports batch insert statements. I know that the same thing can, in theory, be done with DBI with DBI::dbAppendTable()(but we wouldn't be on this post if that worked how we had hoped). I tried this in a previous SO question here, but it didn't solve my problem. I also don't know how big my intermediary tables could get in the future. Hopefully by the time they get too big we'll be in a different DBMS.
Next, I reopen the connection, this time with DBI, and send a statement to the DB to write those data from the intermediary table to the final resting place for those data, and then drop the intermediary table.
con <- DBI::dbConnect(odbc::odbc(), .connection_string = "/path/to/database.accdb")
DBI::dbSendStatement(
conn = con,
statement = 'UPDATE
tbl_intermediary INNER JOIN final_tbl ON tbl_intermediary.SampleID = final_tbl.sampleNumber
SET
final_tbl.field1 = [tbl_intermediary].[field1],
final_tbl.notes = IIf(Nz([tbl_intermediary].[Notes],"")="",[final_tbl].[notes],[final_tbl].[notes] & "; Newest Notes: " & [tbl_intermediary].[Notes]);'
)
DBI::dbSendStatement(
conn = con,
statement = 'DROP TABLE tbl_intermediary;'
DBI::dbDisconnect(con)
rm(con)
)
The main reason why I chose this method is because some of the SQL I use with Access also has some VBA in it. When I send the SQL-VBA hybrid string with RODBC, I get assorted errors in the IIF() and Nz() functions (see example above). From the RODBC CRAN docs the query argument for the sqlQuery() function is strictly assumed to be a valid SQL statement. So, RODBC has no clue how to interpret the IIf() and Nz() MS Access functions. I think this also has to do with how the ODBC driver handles communication as well (please, someone correct me if I'm wrong about this).
As I understand it, DBI::dbSendStatment() however lets the database engine you're working with interpret how to use the statement argument you provide. In the situation above, the VBA is executed exactly how I would expect if it were run in Access directly. As per the DBI docs, for interactive use you'll generally want to use dbGetQuery or dbExecute.
I am working with database tables with dbplyr
I have a local table and want to join it with a large (150m rows) table on the database
The database PRODUCTION is read only
# Set up the connection and point to the table
library(odbc); library(dbplyr)
my_conn_string <- paste("Driver={Teradata};DBCName=teradata2690;DATABASE=PRODUCTION;UID=",
t2690_username,";PWD=",t2690_password, sep="")
t2690 <- dbConnect(odbc::odbc(), .connection_string=my_conn_string)
order_line <- tbl(t2690, "order_line") #150m rows
I also have a local table, let's call it orders
# fill df with random data
orders <- data.frame(matrix(rexp(50), nrow = 100000, ncol = 5))
names(orders) <- c("customer_id", paste0(rep("variable_", 4), 1:4))
let's say I wanted to join these two tables, I get the following error:
complete_orders <- orders %>% left_join(order_line)
> Error: `x` and `y` must share the same src, set `copy` = TRUE (may be slow)
The issue is, if I were to set copy = TRUE, it would try to download the whole of order_line and my computer would quickly run out of memory
Another option could be to upload the orders table to the database. The issue here is that the PRODUCTION database is read only - I would have to upload to a different database. Trying to copy across databases in dbplyr results in the same error.
The only solution I have found is to upload into the writable database and use sql to join them, which is far from ideal
I have found the answer, you can use the in_schema() function within the tbl pointer to work across schemas within the same connection
# Connect without specifying a database
my_conn_string <- paste("Driver={Teradata};DBCName=teradata2690;UID=",
t2690_username,";PWD=",t2690_password, sep="")
# Upload the local table to the TEMP db then point to it
orders <- tbl(t2690, in_schema("TEMP", "orders"))
order_line <- tbl(t2690, in_schema("PRODUCTION", "order_line"))
complete_orders <- orders %>% left_join(order_line)
Another option could be to upload the orders table to the database. The issue here is that the PRODUCTION database is read only - I would have to upload to a different database. Trying to copy across databases in dbplyr results in the same error.
In your use case, it seems (based on the accepted answer) that your databases are on the same server and it's just a matter of using in_schema. If this were not the case, another approach would be that given here, which in effect gives a version of copy_to that works on a read-only connection.