I have a variable that stores a time string.
library(lubridate)
date_n <- today() - years(2)
And I want to use the date_n within the following sqlQuery.
transactions_july <- sqlQuery(con,
"select DATA, VREME, PARTIJA, IZNOS
from pts
where DATA > '2016-08-10'")
So basically, date_n would replace the date - '2016-08-10'.
Any ideas?
You can use sprintf
Just do it:
transactions_july <- sqlQuery(con,
sprintf("select DATA, VREME, PARTIJA, IZNOS
from pts where DATA > %s",date_n))
The %s will be replaced by the date_n as you want.
And for SQL query you can also use sqldf.
Related
Here is the data frame that i have
trail_df= data.frame(d= seq.Date(as.Date("2020-01-01"), as.Date("2020-02-01"), by= 1),
AA= NA,
BB= NA,
CC= NA)
Now I would loop to the columns of trail_df and get the data of the column names respectively from the oracle database for the given date, which I am doing like this.
for ( i in 2:ncol(trail_df)){
c_name = colnames(trail_df)[i]
query = paste0("SELECT * FROM tablename WHERE ID= '",c_name,"' ") # this query would return Date and price
result= dbGetQuery(con, query) # con is the connection variable from db
for (k in nrow(trail_df)){
trail_df [which(as.Date(result[k,1])==as.Date(trail_df[,1])),i]= result[k,2]
# just matching the date in trail_df dataframe and pasting the value in front of respective column
}
}
this is the snippet of the code and the dates filtering and all has been taken care of in real code.
The problem is, I have more than 6000 columns and 500 rows, for which I have to match the dates(
BECAUSE THE DATES ARE RANDOM) and put the price in front, which is taking like forever now.
I am new in the R language and would appreciate any help which would fasten this code maybe multiprocess if possible in R.
There are two steps to this answer:
Use parameterized queries to get the raw data; and
Get this data into the "wide" format you desire.
Parameterized query
My (first) suggestion is to use parameterized queries, which is safer. It may not improve the speed relative to #RonakShah's answer (using sprintf), at least not on the first time.
However, it might help a touch if the query is repeated: DBMSes tend to parse/optimize queries and cache this optimization. When a query changes even a little, this caching cannot happen, and the query is re-optimized. In this case, this cache-invalidation is unnecessary, and can be avoided if we use binding parameters.
query <- sprintf("SELECT * FROM tablename WHERE ID IN (%s)",
paste(rep("?", ncol(trail_df[-1])), collapse = ","))
query
# [1] "SELECT * FROM tablename WHERE ID IN (?,?,?)"
res <- dbGetQuery(con, query, params = list(trail_df$ID))
Some thoughts:
if the database has many more dates than what you have here, you can restrict the data returned by reducing the date range queries. This will work well if your trail_df dates are close together:
query <- sprintf("SELECT * FROM tablename WHERE ID IN (%s) and Date between ? and ?",
paste(rep("?", ncol(mtcars)), collapse = ","))
query
res <- dbGetQuery(con, query, params = c(list(trail_df$ID), as.list(range(df$d))))
if your dates are more variable and you end up querying many more rows than you actually need, I suggest you can upload your trail_df dates into a temporary table and something like:
"select tb.Date, tb.ID, tb.Price
from mytemptable tmp
left join tablename tb on tmp.d = tb.Date
where ..."
Reshape
It appears as if your database table may be more "long" shaped and you want it "wide" in your frame. There are many ways to reshape from long-to-wide (examples), but these should work:
reshape2::dcast(res, Date ~ ID, value.var = "Price") # 'Price' is the 'value' column, unk here
tidyr::pivot_wider(res, id_cols = "Date", names_from = "ID", values.from = "Price")
I am working on a project where I have to download more than 10 million records on a relatively small server. So instead of just downloading the entire dataset, I have to download it in smaller sections. I am trying to create a loop that will call batches of the data based on date. I'm used to coding in Stata where you can call a local by using `x' or some variant within a string. However, I can't find a way to do this in R. Below is a small piece of the code I'm using. Basically, whenever I try to run this 'val' and 'val2' aren't updating with the dates in the defined lists so the output literally just reads as if the server is trying to search between 'val' and 'val2' instead of between '20190101' and '20190301'. Any suggestions for how to fix this are greatly appreciated!
x<-c(20190101, 20190301)
y<-c(20190301, 20190501)
foreach (val=x, val2=y) %do% {
data<-DBI::dbGetQuery(myconn, "SELECT * FROM .... WHERE (DATE BETWEEN 'val' AND 'val2')")
}
With a basic loop
x<-c(20190101, 20190301)
y<-c(20190301, 20190501)
data_all = c()
for(i in 1:length(x)){
query = paste0("SELECT * FROM .... WHERE (DATE BETWEEN '",
x[i], "' AND '", y[i], "')")
data <- DBI::dbGetQuery(myconn, query)
data_all = rbind(data_all, data)
}
With sprintf you can construct the query and use lapply + do.call to combine the results into one dataframe.
x<-c(20190101, 20190301)
y<-c(20190301, 20190501)
input <- sprintf("SELECT * FROM .... WHERE (DATE BETWEEN '%s' AND '%s')", x, y)
result <- do.call(rbind, lapply(input, function(x) DBI::dbGetQuery(myconn, x)))
Using purrr::map_df is a bit shorter.
result <- purrr::map_df(input, ~DBI::dbGetQuery(myconn, .x))
In dplyr, if tbl is a table in a database then head(tbl) gets translated into
select
*
from
tbl
limit 6
but there doesn't seem to be a way to use the offset keyword to read data in chunks. E.g. the equivalent of
select
*
from
tbl
limit 6 offset 5
doesn't seem possible with dplyr. In dbplyr, there is a do function to let you choose a chunk_size to bring back data chunk-by-chunk.
Is that the only way to do it in R? The solution doesn't have to in dplyr or the tidyverse.
Another approach would be to construct your own offset function. This assumes your database supports it, and the function is unlikely to be transferable to databases of other types.
Something like the following:
offset_head = function(table, num, offset){
# get connection
db_connection = table$src$con
sql_query = build_sql(con = db_connection,
sql_render(table),
"\nLIMIT ", num,
"\nOFFSET ", offset
)
return(tbl(db_connection, sql(sql_query)))
}
The way I have done this in dbplyr is based on the addition of a reference/ID column:
my_tbl = tbl(con, "table_name")
for(i in 1:100){
sub_tbl = my_tbl %>% filter(ID %% 100 == i)
# further processing using 'sub_tbl'
...
}
If you add a row number to your dataset, then your filter could be replaced by filter(LowerBound < row_number & row_number < UpperBound).
I'm using the RPostgreSQL package to load data from a PostgreSQL data base.
The problem is that a datetime column (POSIXct) is automatically convert into a date.
library(RPostgreSQL)
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname="abc",host="def ",port=1234,user="ghi",password="jkl" )
Instead of using this:
df = dbGetQuery(con, "
SELECT customer_id, dttm_utc
FROM schema.table;")
I have to use that:
df = dbGetQuery(con, "
SELECT customer_id, to_char(dttm_utc, 'MM-DD-YYYY HH24:MI:SS') as dttm_utc,
FROM schema.table;")
If I don't I loose the time and only recover dates.
I noticed this probem doesn't occur if I only want the first 1000 rows. It appears almost all the time when there is more than 300 000 rows.
How can I fix this ?
How do you write a SQL query with a date using RSQLite. Here is an example below. The dbGetQuery does not return any rows.
require(RSQLite)
require(ggplot2)
data(presidential)
m <- dbDriver("SQLite")
tmpfile <- tempfile('presidential', fileext='.db')
conn <- dbConnect(m, dbname=tmpfile)
dbWriteTable(conn, "presidential", presidential)
dbGetQuery(conn, "SELECT * FROM presidential WHERE Date(start) >= Date('1980-01-01')")
Just to illustrate, this works fine:
tmpfile <- tempfile('presidential', fileext='.db')
conn <- dbConnect(m, dbname=tmpfile)
p <- presidential
p$start <- as.character(p$start)
p$end <- as.character(p$end)
dbWriteTable(conn, "presidential", p)
dbGetQuery(conn, "SELECT * FROM presidential WHERE start >= '1980-01-01'")
You can read about the lack of native date types in SQLite in the docs here. I've been using strings as dates for so long in SQLite that I'd actually forgotten about the issue completely.
And yes, I've written a small R function that converts any Date column in a data frame to character. For simple comparisons, keeping them in YYYY-MM-DD is enough, and if I need to do arithmetic I convert them after the fact in R.
Following on from #joran's answer, here's a simple function to convert date columns to string for a data.frame.
mutate(df, across(where(is.Date), ~ format(.x, "%Y.%m.%d")))
I found working with RSQLite and dplyr to be the most convenient way to stay type-consistent using R and SQLite. In particular, extendend_types = TRUE ensures that columns of type DATE, DATETIME / TIMESTAMP, and TIME are mapped to the corresponding R-classes (at least after version 2.2.8 for RSQLite).
library(dplyr)
library(RSQLite)
library(ggplot2)
data(presidential)
mydb <- dbConnect(SQLite(), "presidential.sqlite", extended_types = TRUE)
dbWriteTable(mydb, "presidential", presidential)
tbl(mydb, "presidential") %>%
filter(start >= as.Date("1980-01-01")) %>%
collect()
You can also formulate the latter collection as a get query:
dbGetQuery(mydb, "SELECT * FROM presidential WHERE start >= CAST('1980-01-01' AS DATE)")
As #joran suggests keeping dates in text in SQLlite seems like the best way to go for the time being.
I used #Richard Knight's approach for conversion in, but with ISO format, to change the date to string before writing the dataframe:
local_df %>% mutate(across(where(lubridate::is.Date), ~ format(.x, "%Y-%m-%d")))
Manipulating the dates remotely can be done using sql translation, particularly:
remote_df %>% mutate(date_as_number = julianday(date_as_string))
remote_df %>% mutate(date_as_string = date(date_as_number))
N.b. that is date not as.Date in the second one. This is because as.Date will get translated to CAST(date_as_number AS DATE) whereas what we want is to use SQLLite's date() function with a floating point number as returned by julianday().
Mapping the remote datestrings back into dates can be done automatically, if you :
collect <- function(remote_df, ...) {
raw = remote_df %>% dplyr::collect(...)
isoDateString = function(x) return(is.character(x) & all(na.omit(stringr::str_detect(x,"[0-9]{4}-[0-9]{2}-[0-9]{2}"))) & !all(is.na(x)))
raw = raw %>% mutate(across(where(isoDateString), ~ as.Date(.x, "%Y-%m-%d")))
maybeJulian = function(x) {return(is.double(x) & all(na.omit(x>2440587.5)) & all(na.omit(x<2488069.5)) & !all(is.na(x)))}
raw = raw %>% mutate(across(matches(".*(D|d)ate.*") & where(maybeJulian), ~ as.Date(.x-2440587.5, "1970-01-01")))
return(raw)
}
The apparently random numbers in the maybeJulian function correspond to 1970-01-01 and 2100-01-01