Joining across databases with dbplyr - r

I am working with database tables with dbplyr
I have a local table and want to join it with a large (150m rows) table on the database
The database PRODUCTION is read only
# Set up the connection and point to the table
library(odbc); library(dbplyr)
my_conn_string <- paste("Driver={Teradata};DBCName=teradata2690;DATABASE=PRODUCTION;UID=",
t2690_username,";PWD=",t2690_password, sep="")
t2690 <- dbConnect(odbc::odbc(), .connection_string=my_conn_string)
order_line <- tbl(t2690, "order_line") #150m rows
I also have a local table, let's call it orders
# fill df with random data
orders <- data.frame(matrix(rexp(50), nrow = 100000, ncol = 5))
names(orders) <- c("customer_id", paste0(rep("variable_", 4), 1:4))
let's say I wanted to join these two tables, I get the following error:
complete_orders <- orders %>% left_join(order_line)
> Error: `x` and `y` must share the same src, set `copy` = TRUE (may be slow)
The issue is, if I were to set copy = TRUE, it would try to download the whole of order_line and my computer would quickly run out of memory
Another option could be to upload the orders table to the database. The issue here is that the PRODUCTION database is read only - I would have to upload to a different database. Trying to copy across databases in dbplyr results in the same error.
The only solution I have found is to upload into the writable database and use sql to join them, which is far from ideal

I have found the answer, you can use the in_schema() function within the tbl pointer to work across schemas within the same connection
# Connect without specifying a database
my_conn_string <- paste("Driver={Teradata};DBCName=teradata2690;UID=",
t2690_username,";PWD=",t2690_password, sep="")
# Upload the local table to the TEMP db then point to it
orders <- tbl(t2690, in_schema("TEMP", "orders"))
order_line <- tbl(t2690, in_schema("PRODUCTION", "order_line"))
complete_orders <- orders %>% left_join(order_line)

Another option could be to upload the orders table to the database. The issue here is that the PRODUCTION database is read only - I would have to upload to a different database. Trying to copy across databases in dbplyr results in the same error.
In your use case, it seems (based on the accepted answer) that your databases are on the same server and it's just a matter of using in_schema. If this were not the case, another approach would be that given here, which in effect gives a version of copy_to that works on a read-only connection.

Related

R and PostgreSQL - pre-specify possible column names and types

I have multiple large similar data files stored in .csv format. These are data files released annually. Most of these have the same variables but in some years they have added variables or changed the names of variables.
I am looping through my directory of files (~30 .csv files), converting them to data frames, and importing them to a Google Cloud SQL PostgreSQL 12 database via:
DBI::dbAppendTable(con, tablename, df)
where con is my connection to the database, tablename is the table name, and df is the data frame produced from a .csv.
The problem is each of these .csv files will have a different number of columns and some won't have columns others have.
Is there an easy way to pre-define a structure to the PostgreSQL 12 database that specifies "any of these .csv columns will all go into this one database column" and also "any columns not included in the .csv should be filled with NA in the database". I think I could come up with something in R to make all the dataframes look similar prior to uploading to the database but it seems cumbersome. I am imaging a document like a JSON that the SQL database compares against kind of like below:
SQL | Data frame
----------------------------------
age = "age","Age","AGE"
sex = "Sex","sex","Gender","gender"
...
fnstatus = "funcstatus","FNstatus"
This would specify to the database all the possible columns it might see and how to parse those. And for columns it doesn't see in a given .csv, it would fill all records with NA.
While I cannot say if such a feature is available as Postgres has many novel methods and extended data types, I would be hesitant to utilize such features as maintainability can be a challenge.
Enterprise, server, relational databases like PostgreSQL should be planned infrastructure systems. As r2evans comments, tables [including schemas, columns, users, etc.] should be defined up front. Designers need to think out entire uses and needs before any data migration or interaction. Dynamically adjusting database tables and columns by one-off application needs are usually not recommended. So clients like R should dynamically align data to meet the planned, relational database specifications.
One approach can be to use a temporary table as staging of all raw CSV data, possibly set with all VARCHAR. Then populate this table with all raw data to be finally migrated in a single append query using COALESCE and :: for type casting to final destination.
# BUILD LIST OF DFs FROM ALL CSVs
df_list <- lapply(list_of_csvs, read.csv)
# NORMALIZE ALL COLUMN NAMES TO LOWER CASE
df_list <- lapply(df_list, function(df), setNames(df, tolower(names(df))))
# RETURN LIST OF UNIQUE NAMES
all_names <- unique(lapply(df_list, names))
# CREATE TABLE QUERY
dbSendQuery(con, "DROP TABLE IF EXISTS myTempTable")
sql <- paste("CREATE TABLE myTempTable (",
paste(all_names, collapse = " VARCHAR(100), "),
"VARCHAR(100)",
")")
dbSendQuery(con, sql)
# APPEND DATAFRAMES TO TEMP TABLE
lapply(df_list, function(df) DBI::dbAppendTable(con, "myTempTable", df))
# RUN FINAL CLEANED APPEND QUERY
sql <- "INSERT INTO myFinalTable (age, sex, fnstatus, ...)
SELECT COALESCE(age)::int
, COALESCE(sex, gender)::varchar(5)
, COALESCE(funcstatus, fnstatus)::varchar(10)
...
FROM myTempTable"
dbSendQuery(con, sql)

R dbplyr SQL date filter issue

I've connected to a SQL Server database with the code shown here, and then I try to run a query to collect data filtered on a date, which is held as an integer in the table in YYYYMMDD format
con <- DBI::dbConnect(odbc::odbc(), driver = "SQL Server", server = "***")
fact_transaction_line <- tbl(con,in_schema('***', '***'))
data <- fact_transaction_line %>%
filter(key_date_trade == 20200618)
This stores as a query, but fails when I use glimpse to look at the data, with the below error
"dbplyr_031"
WHERE ("key_date_trade" = 20200618.0)'
Why isn't this working, is there a better way for me to format the query to get this data?
Both fact_transaction_line and data in your example code are remote tables. One important consequence of this is that you are limited to interacting with them to certain dplyr commands. glimpse may not be a command that is supported for remote tables.
What you can do instead (including #Bruno's suggestions):
Use head to view the top few rows of your remote data.
If you are receiving errors, try show_query(data) to see the underlying SQL query for the remote table. Check that this query is correct.
Check the size of the remote table with remote_table%>% ungroup() %>% summarise(num = n()). If the remote table is small enough to fit into your local R memory, then local_table = collect(remote_table) will copy the table into R memory.
Combine options 1 & 3: local_table = data %>% head(100) %>% collect() will load the first 100 rows of your remote table into R. Then you can glimpse(local_table).

Analyze big data in R on EC2 server

I managed to load and merge the 6 heavy excel files I had from my RStudio instance (on EC2 server) into one single table in PostgreQSL (linked with RDS).
Now this table has 14 columns and 2,4 Million rows.
The size of the table in PostgreSQL is 1059MB.
The EC2 instance is a t2.medium.
I wanted to analyze it, so I thought I could simply load the table with DBI package and perform different operations on it.
So I did:
my_big_df <- dbReadTable(con, "my_big_table")
my_big_df <- unique(my_big_df)
and my RStudio froze, out of memory...
My questions would be:
1) Is what I have been doing (to handle big tables like this) a ok/good practice?
2) If yes to 1), is the only way to be able to perform the unique() operation or other similar operations to increase the EC2 server memory?
3) If yes to 2), how can I know to which extent should I increase the EC2 server memory?
Thanks!
dbReadTable convert the entire table to a data.frame, which is not what you want to do for such a big tables.
As #cory told you, you need to extract the required info using SQL queries.
You can do that with DBI using combinations of dbSendQuery,dbBind,dbFetch or dbGetQuery.
For example, you could define a function to get the required data
filterBySQLString <- function(databaseDB,sqlString){
sqlString <- as.character(sqlString)
dbResponse <- dbSendQuery(databaseDB,sqlString)
requestedData <- dbFetch(dbResponse)
dbClearResult(dbResponse)
return(requestedData)
}
# write your query to get unique values
SQLquery <- "SELECT * ...
DISTINCT ..."
my_big_df <- filterBySQLString(myDB,SQLquery)
my_big_df <- unique(my_big_df)
If you cannot use SQL, then you have two options:
1) stop using Rstudio and try to run your code from the terminal or via Rscript.
2) beef up your instance

Understanding Spark 1.6.3 <-> JDBC (Oracle) connection in sparklyr

Using sparklyr I access a table from Oracle via JDBC in the following manner:
tbl_sample_stuff <- spark_read_jdbc(
sc = sc,
name = "tbl_spark_some_table",
options = list(
url = "jdbc:oracle:thin:#//my.host.with.data:0000/my.host.service.details",
driver = "oracle.jdbc.OracleDriver",
user = "MY_USERNAME",
password = "SOME_PASSWORD",
# dbtable = "(SELECT * FROM TABLE WHERE FIELD > 10) ALIAS"),
dbtable = "some_table"
),
memory = FALSE
)
The sample_stuff table is accessible. For instance running glimpse(tbl_sample_stuff) produces the required results.
Questions
Let's say I want to derive a simple count per group using the code below:
dta_tbl_sample_stuff_category <- tbl_sample_stuff %>%
count(category_variable) %>%
collect()
As a consequence my Spark 1.6.3 delivers the following job:
What is actually going on there, why there is a one collect job running first for a long period of time (~ 7 mins)? My view would be that the optimal approach would initially run some SQL like SELECT COUNT(category_variable) FROM table GROUP BY category_variable on that data and then collected the results. It feels to me that this job is downloading the data first and then aggregating, is that correct?
What's the optimal way of using JDBC connection via sparklyr. In particular, I would like to know:
What's wise in terms of creating temporary tables? Should I always create temporary tables for data I may want to analyse frequently?
Other details
I'm adding Oracle driver via
configDef$sparklyr.jars.default <- ora_jar_drv
Rest is typical connection to Spark cluster managed on Yarn returned as sc object to R session.

Get all rows that DBI::dbWriteTable has just written

I want to use dbWriteTable() of R's DBI package to write data into a database. Usually, the respective tables are already present so I use the argument append = TRUE. How do I get which rows were added to the table by dbWriteTable()? Most of tables have certain columns with UNIQUE values so a SELECT will work (see below for a simple example). However, this is not true for all of them or only several columns together are UNIQUE making the SELECT more complicated. In addition, I would like to put the writing and querying into a function so I would prefer a consistent approach for all cases.
I mainly need this to get the PRIMARY KEY's added by the database and to allow a user to quickly see what was added. If important, my database is PostgreSQL and I would like to use the odbc package for connection.
I have something like this in mind, however, I am looking for a more general solution:
library(DBI)
con <- dbConnect(odbc::odbc(), dsn = "database")
dbWriteTable(con,
name = "site", value = data.frame(name = c("abcd", "efgh"),
append = TRUE))
dbGetQuery(conn,
paste("SELECT * FROM site WHERE name in ('abcd', 'efgh');"))

Resources