I would like to understand the difference between dplyr joins and sql joins.
I have an open connection to an oracle database in R:
con <- dbConnect(odbc::odbc(), …)
The 1st request :
dbGetQuery(con, "select *
from result join test on result.test_1 = test.test_1
join sample on test.sample = sample.id_2") %>%
setNames(make.names(names(.), unique = TRUE) )%>%
as_tibble()
gives a tibble with 9541 rows (what I want !)
The 2nd request :
tbl(con, "result")%>%
inner_join(tbl(con, "sample"), by = c("test_1" = "id_2"))%>%
collect()
gives a tibble with 2688 rows
test_1 and id_2 are character fields with spaces in it and numbers at the end. example: “ 3333” .
Thanks
In SQL I see 3 tables, in R I see 2 tables : result and sample table.
Here is the probable cause of the difference.
Related
Here is the data frame that i have
trail_df= data.frame(d= seq.Date(as.Date("2020-01-01"), as.Date("2020-02-01"), by= 1),
AA= NA,
BB= NA,
CC= NA)
Now I would loop to the columns of trail_df and get the data of the column names respectively from the oracle database for the given date, which I am doing like this.
for ( i in 2:ncol(trail_df)){
c_name = colnames(trail_df)[i]
query = paste0("SELECT * FROM tablename WHERE ID= '",c_name,"' ") # this query would return Date and price
result= dbGetQuery(con, query) # con is the connection variable from db
for (k in nrow(trail_df)){
trail_df [which(as.Date(result[k,1])==as.Date(trail_df[,1])),i]= result[k,2]
# just matching the date in trail_df dataframe and pasting the value in front of respective column
}
}
this is the snippet of the code and the dates filtering and all has been taken care of in real code.
The problem is, I have more than 6000 columns and 500 rows, for which I have to match the dates(
BECAUSE THE DATES ARE RANDOM) and put the price in front, which is taking like forever now.
I am new in the R language and would appreciate any help which would fasten this code maybe multiprocess if possible in R.
There are two steps to this answer:
Use parameterized queries to get the raw data; and
Get this data into the "wide" format you desire.
Parameterized query
My (first) suggestion is to use parameterized queries, which is safer. It may not improve the speed relative to #RonakShah's answer (using sprintf), at least not on the first time.
However, it might help a touch if the query is repeated: DBMSes tend to parse/optimize queries and cache this optimization. When a query changes even a little, this caching cannot happen, and the query is re-optimized. In this case, this cache-invalidation is unnecessary, and can be avoided if we use binding parameters.
query <- sprintf("SELECT * FROM tablename WHERE ID IN (%s)",
paste(rep("?", ncol(trail_df[-1])), collapse = ","))
query
# [1] "SELECT * FROM tablename WHERE ID IN (?,?,?)"
res <- dbGetQuery(con, query, params = list(trail_df$ID))
Some thoughts:
if the database has many more dates than what you have here, you can restrict the data returned by reducing the date range queries. This will work well if your trail_df dates are close together:
query <- sprintf("SELECT * FROM tablename WHERE ID IN (%s) and Date between ? and ?",
paste(rep("?", ncol(mtcars)), collapse = ","))
query
res <- dbGetQuery(con, query, params = c(list(trail_df$ID), as.list(range(df$d))))
if your dates are more variable and you end up querying many more rows than you actually need, I suggest you can upload your trail_df dates into a temporary table and something like:
"select tb.Date, tb.ID, tb.Price
from mytemptable tmp
left join tablename tb on tmp.d = tb.Date
where ..."
Reshape
It appears as if your database table may be more "long" shaped and you want it "wide" in your frame. There are many ways to reshape from long-to-wide (examples), but these should work:
reshape2::dcast(res, Date ~ ID, value.var = "Price") # 'Price' is the 'value' column, unk here
tidyr::pivot_wider(res, id_cols = "Date", names_from = "ID", values.from = "Price")
In another project working with Amazon Athena I could do this:
con <- DBI::dbConnect(odbc::odbc(), Driver = "path-to-driver",
S3OutputLocation = "location",
AwsRegion = "eu-west-1", AuthenticationType = "IAM Profile",
AWSProfile = "profile", Schema = "prod")
tbl(con,
# Run SQL query
sql('SELECT *
FROM TABLE')) %>%
# Without having collected the data, I could further wrangle the data inside the database
# using dplyr code
select(var1, var2) %>%
mutate(var3 = var1 + var2)
However, now using BigQuery I get the following error
con <- DBI::dbConnect(bigrquery::bigquery(),
project = "project")
tbl(con,
sql(
'SELECT *
FROM TABLE'
))
Error: dataset is not a string (a length one character vector).
Any idea if with BigQuery is not possible to do what I'm trying to do?
Not a BigQuery user, so can't test this, but from looking at this example it appears unrelated to how you are piping queries (%>%). Instead it appears BigQuery does not support receiving a tbl with an sql string as the second argument.
So it is likely to work when the second argument is a string with the name of the table:
tbl(con, "db_name.table_name")
But you should expect it to fail if the second argument is of type sql:
query_string = "SELECT * FROM db_name.table_name"
tbl(con, sql(query_string))
Other things to test:
Using odbc::odbc() to connect to BigQuery instead of bigquery::bigquery(). The problem could be caused by the bigquery package.
The second approach without the conversation to sql: tbl(con, query_string)
I have a table in SQL server database, and I want to manipulate this table with dbplyr/dplyr in R packages.
library(odbc)
library(DBI)
library(tidyverse)
con <- DBI::dbConnect(odbc::odbc(),
Driver = "SQL Server",
Server = "xx.xxx.xxx.xxx",
Database = "stock",
UID = "userid",
PWD = "userpassword")
startday = 20150101
day = tbl(con, in_schema("dbo", "LogDay"))
I tried this simple dplyr function after connecting to remote database, but only to fail with error messages.
day %>%
mutate(ovnprofit = ifelse(stockCode == lead(stockCode,1),lead(priceOpen,1)/priceClose, NA)) %>%
select(logDate,stockCode, ovnprofit)
How can I solve this problem?
p.s. When I apply dplyr function after transforming 'day' into tibble first, it works. However, I want to apply dplyr function directly, not transforming into tibble because it's to time consuming and memory intensive.
The problem is most likely with the lead function. In R a data set has an ordering, but in SQL datasets are unordered and the order needs to be specified explicitly.
Note that the SQL code in the error message contains:
LEAD("stockCode", 1.0, NULL) OVER ()
That there is nothing in the brackets after the OVER suggests to me that SQL expects somethings here.
Two ways you can resolve this:
By using arrange before the mutate
By specifying the order_by argument of lead
# approach 1:
day %>%
arrange(logDate) %>%
mutate(ovnprofit = ifelse(stockCode == lead(stockCode,1),
lead(priceOpen,1)/priceClose,
NA)
) %>%
select(logDate,stockCode, ovnprofit)
# approach 2:
day %>%
mutate(ovnprofit = ifelse(stockCode == lead(stockCode,1, order_by = 'logDate'),
lead(priceOpen,1, order_by = 'logDate')/priceClose,
NA)
) %>%
select(logDate,stockCode, ovnprofit)
However, it also looks like you are only wanting to lead within each stockCode. This can be done by group_by. I would recommend the following:
output = day %>%
group_by(stockCode) %>%
arrange(logDate) %>%
mutate(next_priceOpen = lead(priceOpen, 1)) %>%
mutate(ovnprofit = next_priceOpen / priceClose)
select(logDate,stockCode, ovnprofit)
If you view the generated SQL with show_query(output) you should see the SQL OVER clause similar to the following:
LEAD(priceOpen, 1.0, NULL) OVER (PARTITION BY stockCode ORDER BY logDate)
In dplyr, if tbl is a table in a database then head(tbl) gets translated into
select
*
from
tbl
limit 6
but there doesn't seem to be a way to use the offset keyword to read data in chunks. E.g. the equivalent of
select
*
from
tbl
limit 6 offset 5
doesn't seem possible with dplyr. In dbplyr, there is a do function to let you choose a chunk_size to bring back data chunk-by-chunk.
Is that the only way to do it in R? The solution doesn't have to in dplyr or the tidyverse.
Another approach would be to construct your own offset function. This assumes your database supports it, and the function is unlikely to be transferable to databases of other types.
Something like the following:
offset_head = function(table, num, offset){
# get connection
db_connection = table$src$con
sql_query = build_sql(con = db_connection,
sql_render(table),
"\nLIMIT ", num,
"\nOFFSET ", offset
)
return(tbl(db_connection, sql(sql_query)))
}
The way I have done this in dbplyr is based on the addition of a reference/ID column:
my_tbl = tbl(con, "table_name")
for(i in 1:100){
sub_tbl = my_tbl %>% filter(ID %% 100 == i)
# further processing using 'sub_tbl'
...
}
If you add a row number to your dataset, then your filter could be replaced by filter(LowerBound < row_number & row_number < UpperBound).
I have worked little bit with DBI in R and first question is more of best practice, as currently appending new data to DB is taking more time than I hoped. Second is error that I'm receiving when trying to update old information in database. Here is my current workflow when inserting new data to existing table in DB:
con <- dbConnect(odbc(), "myDSN")
# Example table 1
tbl1 <- tibble(Key = c("A", "B", "C", "D", "E"),
Val = c(1, 2, 3, 4, 5))
# Original table in DB
dbWriteTable(con, "tbl1", tbl1, overwrite = TRUE)
# Link to Original table
db_tbl <- tbl(con, in_schema("dbo", "tbl1"))
# New data
tbl2 <- tibble(Key = c("D", "E", "F", "G", "H"),
val = c(10, 11, 12, 13, 14))
# Write it to Staging
dbWriteTable(con, "tbl1_staging", tbl2, overwrite = TRUE)
# Get a link to staging
db_tblStaging <- tbl(con, in_schema("dbo", "tbl1_staging"))
# Compare Info
not_in_db <- db_tblStaging %>%
anti_join(db_tbl, by="Key") %>%
collect()
# Append missing info to DB
dbWriteTable(con, "tbl1", not_in_db, append = TRUE)
# Voila!
dbReadTable(con, "tbl1")
That will do the trick, but I'm looking for better solution, as I hate the collect() part of the code, which means that I'm bringing something to in R memory (as far as I understand it) could be a problem in future, when I have bigger data. What I hoped would work is something like this, that would allow me to append new data to DB in a fly, without it visiting in memory.
# What I hoped to have
db_tblStaging %>%
anti_join(db_tbl, by="Key") %>%
dbWriteTable(con, "tbl1", ., append = TRUE)
Second problem is updating existing table. Here is what I tried, but error will emerge and can't figure it out. Here is link where I tried to copy the answer: How to pass data.frame for UPDATE with R DBI. I would like to update key E and D with new values in val.
# Trying to update tbl1
update_values <- db_tblStaging %>%
semi_join(db_tbl, by="Key") %>%
collect()
update <- dbSendQuery(con, 'UPDATE tbl1
SET "val" = ?
WHERE Key = ?')
dbBind(update, update_values)
Error in result_bind(res#ptr, as.list(params)) :
nanodbc/nanodbc.cpp:1587: 42000: [Microsoft][ODBC Driver 13 for SQL Server][SQL Server]Incorrect syntax near the keyword 'Key'.
Has the package changed in some way? I can't spot my syntax error.
Consider running pure SQL after your table staging uploads as it looks like you need the NOT EXISTS (to avoid duplicates) and UPDATE INNER JOIN (for existing records). This avoids any R client side query imports and exports.
And Key is a reserved word in SQL Server. Hence, escape it with square brackets.
apn_sql <- "INSERT INTO dbo.tbl (s.[Key], s.[Val])
SELECT s.[Key], s.[Val] FROM dbo.tbl_staging s
WHERE NOT EXISTS
(SELECT 1 FROM dbo.tbl t
WHERE t.[Key] = s.[Key])"
dbSendQuery(con, apn_sql)
upd_sql <- "UPDATE t
SET t.Val = s.Val
FROM dbo.tbl t
INNER JOIN dbo.tbl_staging s
ON t.[Key] = s.[Key]"
dbSendQuery(con, upd_sql)
Rextester demo
In fact, SQL Server has the MERGE query to handle both in one call:
MERGE dbo.tbl AS Target
USING (SELECT [Key], [Val] FROM dbo.tbl_staging) AS Source
ON (Target.[Key] = Source.[Key])
WHEN MATCHED THEN
UPDATE SET Target.Val = Source.Val
WHEN NOT MATCHED BY TARGET THEN
INSERT ([Key], [Val])
VALUES (Source.[Key], Source.[Val]);
Rextester demo