I've connected to a SQL Server database with the code shown here, and then I try to run a query to collect data filtered on a date, which is held as an integer in the table in YYYYMMDD format
con <- DBI::dbConnect(odbc::odbc(), driver = "SQL Server", server = "***")
fact_transaction_line <- tbl(con,in_schema('***', '***'))
data <- fact_transaction_line %>%
filter(key_date_trade == 20200618)
This stores as a query, but fails when I use glimpse to look at the data, with the below error
"dbplyr_031"
WHERE ("key_date_trade" = 20200618.0)'
Why isn't this working, is there a better way for me to format the query to get this data?
Both fact_transaction_line and data in your example code are remote tables. One important consequence of this is that you are limited to interacting with them to certain dplyr commands. glimpse may not be a command that is supported for remote tables.
What you can do instead (including #Bruno's suggestions):
Use head to view the top few rows of your remote data.
If you are receiving errors, try show_query(data) to see the underlying SQL query for the remote table. Check that this query is correct.
Check the size of the remote table with remote_table%>% ungroup() %>% summarise(num = n()). If the remote table is small enough to fit into your local R memory, then local_table = collect(remote_table) will copy the table into R memory.
Combine options 1 & 3: local_table = data %>% head(100) %>% collect() will load the first 100 rows of your remote table into R. Then you can glimpse(local_table).
Related
I have a .csv file that contains 105M rows and 30 columns that I would like to query for plotting in an R shiny app.
it contains alpha-numeric data that looks like:
#Example data
df<-as.data.frame(percent=as.numeric(rep(c("50","80"),each=5e2)),
maskProportion=as.numeric(rep(c("50","80")),each=5e2),
dose=runif(1e3),
origin=as.factor(rep(c("ABC","DEF"),each=5e2)),
destination=as.factor(rep(c("XYZ","GHI"),each=5e2))
)
write.csv(df,"PassengerData.csv")
In the terminal, I have ingested it into an SQLite database as follows:
$ sqlite3 -csv PassengerData.sqlite3 '.import PassengerData.csv df'
which is from:
Creating an SQLite DB in R from an CSV file: why is the DB file 0KB and contains no tables?
So far so good.
The problem I have is speed in querying in R so I tried indexing the DB back in the terminal.
In sqlite3, I tried creating indexes on percent, maskProportion, origin and destination following this link https://data.library.virginia.edu/creating-a-sqlite-database-for-use-with-r/ :
$ sqlite3 create index "percent" on PassengerData("percent");
$ sqlite3 create index "origin" on PassengerData("origin");
$ sqlite3 create index "destination" on PassengerData("destination");
$ sqlite3 create index "maskProp" on PassengerData("maskProp");
I run out of disk space because my DB seems to grow in size every time I make an index. E.g. after running the first command the size is 20GB. How can I avoid this?
I assume the concern is that running collect() to transfer data from SQL to R is too slow for your app. It is not clear how / whether you are processing the data in SQL before passing to R.
Several things to consider:
Indexes are not copied from SQL to R. SQL works with data off disk, so knowing where to look up specific parts of your data result in time savings. R works on data in memory so indexes are not required.
collect transfers data from a remote table (in this case SQLite) into R memory. If your goal is to transfer data into R, you could read a csv direct into R instead of writing it to SQL and then reading from SQL into R.
SQL is a better choice for doing data crunching / preparation of large datasets, and R is a better choice for detailed analysis and visualisation. But if both R and SQL are running on the same machine then both are limited by the cpu speed. Not a concern is SQL and R are running on separate hardware.
Some things you can do to improve performance:
(1) Only read the data you need from SQL into R. Prepare the data in SQL first. For example, contrast the following:
# collect last
local_r_df = remote_sql_df %>%
group_by(origin) %>%
summarise(number = n()) %>%
collect()
# collect first
local_r_df = remote_sql_df %>%
collect() %>%
group_by(origin) %>%
summarise(number = n())
Both of these will produce the same output. However, in the first example, the summary takes place in SQL and only the final result is copied to R; while in the second example, the entire table is copied to R where it is then summarized. Collect last will likely have better performance than collect first because it transfers only a small amount of data between SQL and R.
(2) Preprocess the data for your app. If your app will only examine the data from a limited number of directions, then the data could be preprocessed / pre-summarized.
For example, suppose users can pick at most two dimensions and receive a cross-tab, then you could calculate all the two-way cross-tabs and save them. This is likely to be much smaller than the entire database. Then at runtime, your app loads the prepared summaries and shows the user any summary they request. This will likely be much faster.
I'm trying to analyze data stored in an SQL database (MS SQL server) in R, and on a mac. Typical queries might return a few GB of data, and the entire database is a few TB. So far, I've been using the R package odbc, and it seems to work pretty well.
However, dbFetch() seems really slow. For example, a somewhat complex query returns all results in ~6 minutes in SQL server, but if I run it with odbc and then try dbFetch, it takes close to an hour to get the full 4 GB into a data.frame. I've tried fetching in chunks, which helps modestly: https://stackoverflow.com/a/59220710/8400969. I'm wondering if there is another way to more quickly pipe the data to my mac, and I like the line of thinking here: Quickly reading very large tables as dataframes
What are some strategies for speeding up dbFetch when the results of queries are a few GB of data? If the issue is generating a data.frame object from larger tables, are there savings available by "fetching" in a different manner? Are there other packages that might help?
Thanks for your ideas and suggestions!
My answer includes use of a different package. I use RODBC which is found in cran at https://cran.r-project.org/web/packages/RODBC/index.html.
This has saved me SO MUCH frustration and wasted time that came from my previous method of exporting each query result to .csv to load it into my R environment. I found regular ODBC to be much slower than RODBC.
I use the following functions:
sqlQuery() wraps the function that opens the connection to the SQL db with the first argument (in parentheses) and the query itself as the second argument. Put the query itself in quote marks.
odbcConnect() is itself the first argument in sqlquery(). The argument in odbcConnect() is the name of your connection to the SQL db. Put the connection name in quote marks.
odbcCloseAll() is the final function for this task set. Use this after each sqlQuery() to close the connection and save yourself from annoying warning messages.
Here is a simple example.
library(RODBC)
result <- sqlQuery(odbcConnect("ODBCConnectionName"),
"SELECT *
FROM dbo.table
WHERE Collection_ID = 2498")
odbcCloseAll()
Here is the same example PLUS data manipulation directly from the query result.
library(dplyr)
library(RODBC)
result <- sqlQuery(odbcConnect("ODBCConnectionName"),
"SELECT *
FROM dbo.table
WHERE Collection_ID = 2498") %>%
mutate(matchid = paste0(schoolID, "-", studentID)) %>%
distinct(matchid, .keep_all - TRUE)
odbcCloseAll()
I would suggest using the dbcooper found on github. https://github.com/chriscardillo/dbcooper
I have found huge improvements in speed when querying large datasets.
Firstly, Add your connection to your environment.
conn <- DBI::dbConnect(odbc::odbc(),
Driver = "",
Server = "",
Database = "",
UID="",
PWD="")
devtools::install_github("chriscardillo/dbcooper")
library(dbcooper)
dbcooper::dbc_init(con = conn,
con_id = "test",
tables = c("schema.table"))
This adds the function test_schema_table() to your environment which is used to call the data. To collect into your environment use scheme_table %>% collect()
Here is a microbenchmark I did to compare the results of both DBI and dbcooper.
mbm <- microbenchmark::microbenchmark(
DBI = DBI::dbFetch(DBI::dbSendQuery(conn,qry)),
dbcooper = ava_qry() %>% collect() , times=5
)
Here are the results of a microbenchmark I did to compare DBI with dbcooper.
Using sparklyr I access a table from Oracle via JDBC in the following manner:
tbl_sample_stuff <- spark_read_jdbc(
sc = sc,
name = "tbl_spark_some_table",
options = list(
url = "jdbc:oracle:thin:#//my.host.with.data:0000/my.host.service.details",
driver = "oracle.jdbc.OracleDriver",
user = "MY_USERNAME",
password = "SOME_PASSWORD",
# dbtable = "(SELECT * FROM TABLE WHERE FIELD > 10) ALIAS"),
dbtable = "some_table"
),
memory = FALSE
)
The sample_stuff table is accessible. For instance running glimpse(tbl_sample_stuff) produces the required results.
Questions
Let's say I want to derive a simple count per group using the code below:
dta_tbl_sample_stuff_category <- tbl_sample_stuff %>%
count(category_variable) %>%
collect()
As a consequence my Spark 1.6.3 delivers the following job:
What is actually going on there, why there is a one collect job running first for a long period of time (~ 7 mins)? My view would be that the optimal approach would initially run some SQL like SELECT COUNT(category_variable) FROM table GROUP BY category_variable on that data and then collected the results. It feels to me that this job is downloading the data first and then aggregating, is that correct?
What's the optimal way of using JDBC connection via sparklyr. In particular, I would like to know:
What's wise in terms of creating temporary tables? Should I always create temporary tables for data I may want to analyse frequently?
Other details
I'm adding Oracle driver via
configDef$sparklyr.jars.default <- ora_jar_drv
Rest is typical connection to Spark cluster managed on Yarn returned as sc object to R session.
I would like to have a list of the queries sent to the database by my R script, but it is a bit unclear to me how/where are performed some operations involving local dataframes & database tables.
As mentionned in this post, it seems that when performing an operation between a data frame in local env and a table from a DBI connexion (e.g. a left_join(... ,copy=TRUE)), - copy=TRUE needed because data is coming from different datasources - the operations are performed on the database side, working with temporary tables.
I tried to verify this using the show_query() to see exactly what is sent to the database and what is not.
I cannot give a proper reproductible example as it involves a database connexion, but here is the logic :
con <- DBI::dbConnect(odbc::odbc(),
Driver = "SQL Server",
Server = "server",
Database = "database",
UID = "user",
PWD = "pwd",
Port = port)
db_table <- tbl(con, "tbl_A")
local_df <- read.csv("/.../file.csv",stringsAsFactors = FALSE)
q1 <- local_df %>% inner_join(db_table ,by=c('id'='id'),copy=TRUE)
Below are the outputs of the show_query() statements :
> db_table %>% show_query()
<SQL>
SELECT *
FROM "tbl_A"
q1 %>% show_query()
Error in UseMethod("show_query") :
no applicable method for 'show_query' applied to an object of class "data.frame"
This makes me think that in that sequence, the only operation performed on the database side is SELECT * FROM "tbl_A", and that q1 is performed on the local environment using the local_df and a local copy of the database table.
I tried to have a look at the dplyr documentation but there is no information for when data is coming from multiple sources.
I am working with database tables with dbplyr
I have a local table and want to join it with a large (150m rows) table on the database
The database PRODUCTION is read only
# Set up the connection and point to the table
library(odbc); library(dbplyr)
my_conn_string <- paste("Driver={Teradata};DBCName=teradata2690;DATABASE=PRODUCTION;UID=",
t2690_username,";PWD=",t2690_password, sep="")
t2690 <- dbConnect(odbc::odbc(), .connection_string=my_conn_string)
order_line <- tbl(t2690, "order_line") #150m rows
I also have a local table, let's call it orders
# fill df with random data
orders <- data.frame(matrix(rexp(50), nrow = 100000, ncol = 5))
names(orders) <- c("customer_id", paste0(rep("variable_", 4), 1:4))
let's say I wanted to join these two tables, I get the following error:
complete_orders <- orders %>% left_join(order_line)
> Error: `x` and `y` must share the same src, set `copy` = TRUE (may be slow)
The issue is, if I were to set copy = TRUE, it would try to download the whole of order_line and my computer would quickly run out of memory
Another option could be to upload the orders table to the database. The issue here is that the PRODUCTION database is read only - I would have to upload to a different database. Trying to copy across databases in dbplyr results in the same error.
The only solution I have found is to upload into the writable database and use sql to join them, which is far from ideal
I have found the answer, you can use the in_schema() function within the tbl pointer to work across schemas within the same connection
# Connect without specifying a database
my_conn_string <- paste("Driver={Teradata};DBCName=teradata2690;UID=",
t2690_username,";PWD=",t2690_password, sep="")
# Upload the local table to the TEMP db then point to it
orders <- tbl(t2690, in_schema("TEMP", "orders"))
order_line <- tbl(t2690, in_schema("PRODUCTION", "order_line"))
complete_orders <- orders %>% left_join(order_line)
Another option could be to upload the orders table to the database. The issue here is that the PRODUCTION database is read only - I would have to upload to a different database. Trying to copy across databases in dbplyr results in the same error.
In your use case, it seems (based on the accepted answer) that your databases are on the same server and it's just a matter of using in_schema. If this were not the case, another approach would be that given here, which in effect gives a version of copy_to that works on a read-only connection.