How can I recieve full results from any (general) SQL query in dplyr? Here is a toy example where the SQL query simply returns the full table.
library("plyr")
library("dplyr")
## connect to a database
hflights_sqlite <- tbl(hflights_sqlite(), "hflights")
my_con <- src_sqlite(hflights_sqlite$src$path)
## here is the problem
tbl(my_con, sql("SELECT * FROM hflights"))
## ...
## Warning message:
## Only first 500 results retrieved. Use n = -1 to retrieve all.
tbl(my_con, sql("SELECT * FROM hflights"), n=-1)
## ...
## Warning message:
## Only first 500 results retrieved. Use n = -1 to retrieve all.
(This is not a question about the particular simple SQL used here, of course)
Use collect(n=Inf) to force dplyr to fetch all data.
Here is an example:
results <- CONNECTION %>% tbl(sql(SQL_QUERY)) %>% collect(n=Inf)
where
CONNECTION in your case is from src_sqlite(hflights_sqlite$src$path)
and
SQL_QUERY in your case is "SELECT * FROM hflights"
It looks like there used to be some bugs in setting the limit for how many would be cached, but it's been fixed: https://github.com/hadley/dplyr/issues/407
#Andreas: If I understand dplyr, it's always lazy for as long as possible. When you execute the tbl call above, or any tbl call, it fetches just enough data to show you that it worked... if you want the entire resultset, you need to collect the results, per #hadley's comment, or in some other way force full evaluation, e.g.,
head(tbl(my_con, sql("SELECT * FROM hflights")), n=999999)
... n=-1 should work, but I haven't yet seen it work properly in my testing.
Related
I am using DBI/ROracle.
drv <- dbDriver("Oracle")
conn <- dbConnect(drv, ...)
I need to create a table from a select query in another table (i.e. a statement like create table <tabX> as select * from <tabY>).
There seems to be several functions that can perform this task, e.g.:
dbSendQuery(conn, "create table tab1 as select * from bigtable")
# Statement: create table tab1 as select * from bigtable
# Rows affected: 28196
# Row count: 0
# Select statement: FALSE
# Statement completed: TRUE
# OCI prefetch: FALSE
# Bulk read: 1000
# Bulk write: 1000
Or:
dbExecute(conn, "create table tab2 as select * from bigtable")
# [1] 28196
Or even:
tab3 <- dbGetQuery(conn, "select * from bigtable")
dbWriteTable(conn = conn, "TAB3", tab3)
# [1] TRUE
Each method seems to work but I guess there is differences in performance/best pratice. What is the best/most efficient way to run statements like create table <tabX> as select * from <tabY>?
I did not find any hint in the DBI and ROracle help pages.
Up front: use dbExecute for this; don't use dbSendQuery, that function suggests the expectation of returned data (though still works).
dbSendQuery should only be used when you expect data in return; most connections will do just fine even if you mis-use it, but that's the design of it. Instead, use dbSendStatement/dbClearResult or better yet just dbExecute.
The following are pairs of perfectly-equivalent pathways:
To retrieve data:
dat <- dbGetQuery(con, qry)
res <- dbSendQuery(con, qry); dat <- dbFetch(res); dbClearResult(res)
To send a statement (that does not return data, e.g. UPDATE or INSERT):
dbExecute(con, stmt)
res <- dbSendStatement(con, stmt); dbClearResult(res)
(sloppy) res <- dbSendQuery(con, stmt); dbClearResult(res) (I think some DBs complain about this method)
If you choose dbSend*, one should always call dbClearResult when done with the statement/fetch. (R will often clean up after you, but if something goes wrong here -- and I have hit this a few times over the last few years -- the connection locks up and you must recreate it. This can leave orphan connections on the database as well.)
I think most use-cases are a single-query-and-out, meaning dbGetQuery and dbExecute are the easiest to use. However, there are times when you may want to repeat a query. An example from ?dbSendQuery:
# Pass multiple sets of values with dbBind():
rs <- dbSendQuery(con, "SELECT * FROM mtcars WHERE cyl = ?")
dbBind(rs, list(6L))
dbFetch(rs)
dbBind(rs, list(8L))
dbFetch(rs)
dbClearResult(rs)
(I think it's a little hasty in that documentation to dbFetch without capturing the data ... I would expect dat <- dbFetch(..), discarding the return value here seems counter-productive.)
One advantage to doing this multi-step (requiring dbClearResult) is with more complex queries: database servers in general tend to "compile" or optimize a query based on its execution engine. This is not always a very expensive step for the server to execute, and it can pay huge dividends on data retrieval. The server often caches this optimized query, and when it sees the same query it uses the already-optimized version of the query. This is one case where using parameter-binding can really help, as the query is identical in repeated use and therefore never needs to be re-optimized.
FYI, parameter-binding can be done repeatedly as shown above using dbBind, or it can be done using dbGetQuery using the params= argument. For instance, this equivalent set of expressions will return the same results as above:
qry <- "SELECT * FROM mtcars WHERE cyl = ?"
dat6 <- dbGetQuery(con, qry, params = list(6L))
dat8 <- dbGetQuery(con, qry, params = list(8L))
As for dbWriteTable, for me it's mostly a matter of convenience for quick work. There are times when the DBI/ODBC connection uses the wrong datatype on the server (e.g., SQL Server's DATETIME instead of DATETIMEOFFSET; or NVARCHAR(32) versus varchar(max)), so if I need something quickly, I'll use dbWriteTable, otherwise I formally define the table with the server datatypes that I know I want, as in dbExecute(con, "create table quux (...)"). This is by far not a "best practice", it is heavily rooted in preference and convenience. For data that is easy (float/integer/string) and the server default datatypes are acceptable, dbWriteTable is perfectly fine. One can also use dbCreateTable (which creates it without uploading data), which allows you to specify the fields with a bit more control.
I'm pretty sure a similar query has been asked already. However my question is specific to when I connect an R session to my SQL Server database.
accept<-x%>%
group_by(SiteID, MachineID, LocationID) %>%
filter(DateTime>="2019-01-1" & DateTime<"2019-12-31") %>%
summarise(n=(sum(TenSecCount))) %>%
collect()
When I try to collect the data into a data frame, I get the following error --
Arithmetic overflow error converting expression to data type int.
'SELECT "SiteID", "MachineID", "LocationID", CAST("n" AS VARCHAR(MAX)) AS "n"
FROM (SELECT TOP 100 PERCENT "SiteID", "MachineID", "LocationID", (SUM("TenSecCount")) AS "n"
FROM (SELECT TOP 100 PERCENT *
FROM (SELECT TOP 100 PERCENT "MachineID", "OutletID", "TenSecCount"
Any workarounds for this?
This error arises when you collect the data into R, because it is only at collection that your remote SQL table is evaluated.
dbplyr defines your remote table by the SQL query that would return your results. Until you ask for results to be returned, your remote table definition is not much different from an SQL script waiting to be run.
When you request results from that table, in your case using collect, the sql code is executed on the server and the results returned to R. This means you can have an invalid remote table definition, and not know it until you execute it. E.g.:
remote_table <- server_df %>%
group_by(SiteID, MachineID, LocationID) %>%
filter(DateTime>="2019-01-1" & DateTime<"2019-12-31") %>%
summarise(n=(sum(TenSecCount)))
# no error because all we have done is define an sql query and store it in remote_table
# review underlying sql query
show_query(remote_table)
# if you copy & paste this query and try to run it directly on the server it will error
# attempt to collect data
local_table <- remote_table %>% collect()
# error occurs on evaluation
You can tell the error arose when the sql is evaluated because R returned to you the sql code that caused the error and the sql error message. "Arithmetic overflow error converting expression to data type int." is an SQL error, not an R error. See this question for ways to solve it.
Hint, probably something like:
remote_table <- server_df %>%
mutate(TenSecCount = CAST(TenSecCount AS BIGINT)) %>% # additional step changing data type
group_by(SiteID, MachineID, LocationID) %>%
filter(DateTime>="2019-01-1" & DateTime<"2019-12-31") %>%
summarise(n=(sum(TenSecCount)))
Need a little help with the following R code. I’ve got quite a number of data to load from a Microsoft sql database. I tried to do a few things to make the sql queries manageable.
1) Stored the query as object names with unique prefix
2) Using search to return a vector of the object names with unique prefix
3) using for loop to loop through the vector to load data <- this part didn’t work.
Library(odbc)
Library(tidyverse)
Library(stringer)
#setting up dB connection, odbc pkg
db<- DBI::dbConnect(odbc::odbc(),Driver =‘SQL Server’, Server=‘Server_name’, Database=‘Datbase name’, UID=‘User ID’, trusted_connection=‘yes’)
#defining the sql query
Sql_query1<-“select * from db1”
Sql_query2<-“select top 100 * from db2”
#the following is to store the sql query object name in a vector by searching for object names with prefix sql_
Sql_list <- ls()[str_detect(ls(),regex(“sql_”,ignore_case=TRUE))]
#This is the part where the code didn’t work
For (i in Sql_list){ i <- dbGetQuery(db, i)}
The error I’ve got is “Error: ‘Sql_query1’ nanodb.cpp:1587: 42000: [Microsoft][ODBC SQL Server Driver][SQL Server]Could not find stored procedure ‘Sql_query1’
However, if i don’t use the loop, no error occurred! It may be feasible if I’ve only got 2 -3 queries to manage... unfortunately I’ve 20 of them!
dbGetquery(db,Sql_query1)
Can anyone help? Thank you!
#Rohits solution written down:
first part from your side is fine
#setting up dB connection, odbc pkg
db<- DBI::dbConnect(odbc::odbc(),Driver =‘SQL Server’, Server=‘Server_name’, Database=‘Datbase name’, UID=‘User ID’, trusted_connection=‘yes’)
But then it would be more convenient to do something like this:
A more verbose version:
sqlqry_lst <- vector(mode = 'list', length = 2)#create a list to hold queries the in real life length = 20
names(sqlqry_lst) <- paste0('Sql_query', 1:2)#assign names to your list again jut use 1:20 here in your real life example
#put the SQL code into the list elements
sqlqry_lst['Sql_query1'] <- "select * from db1"
sqlqry_lst['Sql_query2'] <- "select top 100 * from db2"
#if you really want to use for loops
res <- vector(mode = 'list', length(sqlqry_lst))#result list
for (i in length(sqlqry_lst)) res[[i]] <- dbGetquery(db,sqlqry_lst[[i]])
Or as a two liner, a bit more R stylish and imho elegant:
sqlqry_lst <- list(Sql_query1="select * from db1", Sql_query2="select top 100 * from db2")
res <- lapply(sqlqry_lst, FUN = dbGetQuery, conn=db)
I suggest you mix and mingle the verbose eg for creating or more precisely for naming the query list and the short version for running the queries against the database as it suits u best.
I am wondering if R does not support using sqldf to delete rows from a data table. My data looks like this
and I am trying to delete from a data table using a delete statement. There is no underlying database just a data.table. But hwen I enter the following sql statement:
loans_good <- sqldf("Delete from LoansDT1 where status not in ('Current','Default')")
I get the following error message:
'SQL statements must be issued with dbExecute() or dbSendStatement() instead of dbGetQuery() or dbSendQuery().'
Since I get the same message for update I am wondering if it is a limitation.
This question is a FAQ. See FAQ 8 on the sqldf github home page.
The operation did work. The message is a warning message, not an error message. The message is misleading and you can ignore it. Note that question did not show the complete message -- the complete message does state that it is a warning message.
The warning message comes from RSQLite, not from sqldf itself. It is caused by non-backwardly compatible change that was introduced into RSQLite at some point; however, as stated the actual operation works anyways.
Also delete and update act on tables in the database. They do not return values so even if they work you won't see any result. If you want a result you have to use a select statement after the delete or update to retrieve the modified table.
Here is an example using the built-in 6 row BOD data.frame. It deletes the last row as that row has a Time greater than 5.
library(sqldf)
sqldf(c("delete from BOD where Time > 5", "select * from BOD"))
## Time demand
## 1 1 8.3
## 2 2 10.3
## 3 3 19.0
## 4 4 16.0
## 5 5 15.6
## Warning message:
## In result_fetch(res#ptr, n = n) :
## SQL statements must be issued with dbExecute() or dbSendStatement() instead of dbGetQuery() or dbSendQuery().
Note that this is listed in the sqldf issues where a workaround for the message is provided: https://github.com/ggrothendieck/sqldf/issues/40
You need to use dbExecute() to perform delete, update or insert queries.
conn <- dbConnect("Put your connection to your database here")
dbExecute(
conn,
"Delete from LoansDT1 where status not in ('Current','Default')"
)
dbReadTable(conn, LoansDT1) # Check
I've written a very quick blast script in r to enable interfacing with the NCBI blast API. Sometimes however, the result url takes a while to load and my script throws an error until the url is ready. Is there an elegant way (i.e. a tryCatch option) to handle the error until the result is returned or timeout after a specified time?
library(rvest)
## Definitive set of blast API instructions can be found here: https://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/new/BLAST_URLAPI.html
## Generate query URL
query_url <-
function(QUERY,
PROGRAM = "blastp",
DATABASE = "nr",
...) {
put_url_stem <-
'https://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Put'
arguments = list(...)
paste0(
put_url_stem,
"&QUERY=",
QUERY,
"&PROGRAM=",
PROGRAM,
"&DATABASE=",
DATABASE,
arguments
)
}
blast_url <- query_url(QUERY = "NP_001117.2") ## test query
blast_session <- html_session(blast_url) ## create session
blast_form <- html_form(blast_session)[[1]] ## pull form from session
RID <- blast_form$fields$RID$value ## extract RID identifier
get_url <- function(RID, ...) {
get_url_stem <-
"https://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Get"
arguments = list(...)
paste0(get_url_stem, "&RID=", RID, "&FORMAT_TYPE=XML", arguments)
}
hits_xml <- read_xml(get_url(RID)) ## this is the sticky part
Sometimes it takes several minutes for the get_url to go live so what I would like is to do is to keep trying let's say every 20-30 seconds until it either produces the url or times out after a pre-specified time.
I think you may find this answer about the use of tryCatch useful
Regarding the 'keep trying until timeout' part. I imagine you can work on top of this other answer about a tryCatch loop on error
Hope it helps.