How to read data from a database by chunk in R? - r

In dplyr, if tbl is a table in a database then head(tbl) gets translated into
select
*
from
tbl
limit 6
but there doesn't seem to be a way to use the offset keyword to read data in chunks. E.g. the equivalent of
select
*
from
tbl
limit 6 offset 5
doesn't seem possible with dplyr. In dbplyr, there is a do function to let you choose a chunk_size to bring back data chunk-by-chunk.
Is that the only way to do it in R? The solution doesn't have to in dplyr or the tidyverse.

Another approach would be to construct your own offset function. This assumes your database supports it, and the function is unlikely to be transferable to databases of other types.
Something like the following:
offset_head = function(table, num, offset){
# get connection
db_connection = table$src$con
sql_query = build_sql(con = db_connection,
sql_render(table),
"\nLIMIT ", num,
"\nOFFSET ", offset
)
return(tbl(db_connection, sql(sql_query)))
}

The way I have done this in dbplyr is based on the addition of a reference/ID column:
my_tbl = tbl(con, "table_name")
for(i in 1:100){
sub_tbl = my_tbl %>% filter(ID %% 100 == i)
# further processing using 'sub_tbl'
...
}
If you add a row number to your dataset, then your filter could be replaced by filter(LowerBound < row_number & row_number < UpperBound).

Related

How to retrieve data from oracle to R in a faster way than this?

Here is the data frame that i have
trail_df= data.frame(d= seq.Date(as.Date("2020-01-01"), as.Date("2020-02-01"), by= 1),
AA= NA,
BB= NA,
CC= NA)
Now I would loop to the columns of trail_df and get the data of the column names respectively from the oracle database for the given date, which I am doing like this.
for ( i in 2:ncol(trail_df)){
c_name = colnames(trail_df)[i]
query = paste0("SELECT * FROM tablename WHERE ID= '",c_name,"' ") # this query would return Date and price
result= dbGetQuery(con, query) # con is the connection variable from db
for (k in nrow(trail_df)){
trail_df [which(as.Date(result[k,1])==as.Date(trail_df[,1])),i]= result[k,2]
# just matching the date in trail_df dataframe and pasting the value in front of respective column
}
}
this is the snippet of the code and the dates filtering and all has been taken care of in real code.
The problem is, I have more than 6000 columns and 500 rows, for which I have to match the dates(
BECAUSE THE DATES ARE RANDOM) and put the price in front, which is taking like forever now.
I am new in the R language and would appreciate any help which would fasten this code maybe multiprocess if possible in R.
There are two steps to this answer:
Use parameterized queries to get the raw data; and
Get this data into the "wide" format you desire.
Parameterized query
My (first) suggestion is to use parameterized queries, which is safer. It may not improve the speed relative to #RonakShah's answer (using sprintf), at least not on the first time.
However, it might help a touch if the query is repeated: DBMSes tend to parse/optimize queries and cache this optimization. When a query changes even a little, this caching cannot happen, and the query is re-optimized. In this case, this cache-invalidation is unnecessary, and can be avoided if we use binding parameters.
query <- sprintf("SELECT * FROM tablename WHERE ID IN (%s)",
paste(rep("?", ncol(trail_df[-1])), collapse = ","))
query
# [1] "SELECT * FROM tablename WHERE ID IN (?,?,?)"
res <- dbGetQuery(con, query, params = list(trail_df$ID))
Some thoughts:
if the database has many more dates than what you have here, you can restrict the data returned by reducing the date range queries. This will work well if your trail_df dates are close together:
query <- sprintf("SELECT * FROM tablename WHERE ID IN (%s) and Date between ? and ?",
paste(rep("?", ncol(mtcars)), collapse = ","))
query
res <- dbGetQuery(con, query, params = c(list(trail_df$ID), as.list(range(df$d))))
if your dates are more variable and you end up querying many more rows than you actually need, I suggest you can upload your trail_df dates into a temporary table and something like:
"select tb.Date, tb.ID, tb.Price
from mytemptable tmp
left join tablename tb on tmp.d = tb.Date
where ..."
Reshape
It appears as if your database table may be more "long" shaped and you want it "wide" in your frame. There are many ways to reshape from long-to-wide (examples), but these should work:
reshape2::dcast(res, Date ~ ID, value.var = "Price") # 'Price' is the 'value' column, unk here
tidyr::pivot_wider(res, id_cols = "Date", names_from = "ID", values.from = "Price")

R function for looping over a string with unique values

I am working on a project where I have to download more than 10 million records on a relatively small server. So instead of just downloading the entire dataset, I have to download it in smaller sections. I am trying to create a loop that will call batches of the data based on date. I'm used to coding in Stata where you can call a local by using `x' or some variant within a string. However, I can't find a way to do this in R. Below is a small piece of the code I'm using. Basically, whenever I try to run this 'val' and 'val2' aren't updating with the dates in the defined lists so the output literally just reads as if the server is trying to search between 'val' and 'val2' instead of between '20190101' and '20190301'. Any suggestions for how to fix this are greatly appreciated!
x<-c(20190101, 20190301)
y<-c(20190301, 20190501)
foreach (val=x, val2=y) %do% {
data<-DBI::dbGetQuery(myconn, "SELECT * FROM .... WHERE (DATE BETWEEN 'val' AND 'val2')")
}
With a basic loop
x<-c(20190101, 20190301)
y<-c(20190301, 20190501)
data_all = c()
for(i in 1:length(x)){
query = paste0("SELECT * FROM .... WHERE (DATE BETWEEN '",
x[i], "' AND '", y[i], "')")
data <- DBI::dbGetQuery(myconn, query)
data_all = rbind(data_all, data)
}
With sprintf you can construct the query and use lapply + do.call to combine the results into one dataframe.
x<-c(20190101, 20190301)
y<-c(20190301, 20190501)
input <- sprintf("SELECT * FROM .... WHERE (DATE BETWEEN '%s' AND '%s')", x, y)
result <- do.call(rbind, lapply(input, function(x) DBI::dbGetQuery(myconn, x)))
Using purrr::map_df is a bit shorter.
result <- purrr::map_df(input, ~DBI::dbGetQuery(myconn, .x))

How to lookup data and print values based on criteria in R?

So I have a csv file that has 12 columns of data, what I want to do is get specific values from the CSV file based on the desired criteria
A snip of the data is provided, so I have this list of Maps:
Maps <- c("Nuke","Vertigo","Inferno","Mirage","Train","Overpass","Dust2")
The goal is to get CTWinProb & TWinProb values for each of the maps in the Map list, e.g.
CTWinProbs;
Nuke = 0.5758
Dust2 = 0.4965
Inferno = 0.4885
etc and vice versa for TWinProb
So far I have been using sqldf library which is very tedious, this is what I am currently doing:
T1NukeCT <- sqldf("select CTWinProb from Team1 where MapName like '%Nuke%'")
which outputs T1NukeCT = 0.5758
and repeating for each Map and then again for TWinProb
I am sure there is an easier way, just quite new to using R so am not 100% on the best method here or how to go about doing it in a less tedious manner
You may use a WHERE IN (...) clause:
Maps <- c("Nuke","Vertigo","Inferno","Mirage","Train","Overpass","Dust2")
where_in <- paste0("('", paste(Maps, collapse="','"), "')")
sql <- paste0("SELECT CTWinProb FROM Team1 WHERE MapName IN ", where_in)
T1NukeCT <- sqldf(sql)
To be clear, the SQL query generated by the above script is:
SELECT CTWinProb
FROM Team1
WHERE MapName IN ('Nuke','Vertigo','Inferno','Mirage','Train','Overpass','Dust2')
What output/results are you looking for exactly?
If you want results in R, these are two simple functions to return the desired values.
They require the dplyr package to be loaded.
library(dplyr)
YourData <- read_csv("./yourfile/.csv")
CTWinFunc <- function(x){
YourData %>% filter(MapName == x) %>% pull(CTWinProb)}
TWinFunc <- function(x){
YourData %>% filter(MapName == x) %>% pull(TWinProb)}
Now CTWinFunc("Nuke") should return CTWinProb result for Nuke, ie: 0.5758
And TWinFunc("Nuke") should return TWinProb result for Nuke, ie: 0.4242
If you want to return a vector with all the results together, I guess you could use the sapply() function. Something like this...
TWins <- sapply(Maps, TWinFunc)
TWins[lengths(TWins)==0] <- NA
TWins <- unlist(TWins)
And this should give you a table with the results:
cbind(Maps, Twins)
Of course, it seems like all this data is already in the original table and you could just subset that.
YourData[,c(4,11,12)]

How can we bulk insert data in SQLServer without creating a text file from RODBC package?

This question is the extension of this question How to quickly export data from R to SQL Server. Currently I am using following code:
# DB Handle for config file #
dbhandle <- odbcDriverConnect()
# save the data in the table finally
sqlSave(dbhandle, bp, "FACT_OP", append=TRUE, rownames=FALSE, verbose = verbose, fast = TRUE)
# varTypes <- c(Date="datetime", QueryDate = "datetime")
# sqlSave(dbhandle, bp, "FACT_OP", rownames=FALSE,verbose = TRUE, fast = TRUE, varTypes=varTypes)
# DB handle close
odbcClose(dbhandle)
I have tried this approach also, which is working beautifully and I have gained significant speed as well.
toSQL = data.frame(...);
write.table(toSQL,"C:\\export\\filename.txt",quote=FALSE,sep=",",row.names=FALSE,col.names=FALSE,append=FALSE);
sqlQuery(channel,"BULK
INSERT Yada.dbo.yada
FROM '\\\\<server-that-SQL-server-can-see>\\export\\filename.txt'
WITH
(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\\n'
)");
But my issue is I can NOT keep my data at rest between the transaction (Writing data to a file is not an option because of data security), so I was looking for solution if I can directly Bulk insert from memory or cache the data. Thanks for the help.
Good question - also useful in instances where the BULK INSERT permissions cannot be setup for whatever reason.
I threw together this poor man's solution a while back when I had enough data that sqlSave was too slow, but not enough to justify setting up BULK INSERT, so it does not require any data being written to a file. The primary reason that sqlSave and parameterized queries are so slow for inserting data is that each row is inserted with a new INSERT statement. Having R write the INSERT statement manually bypasses this in my example below:
library(RODBC)
channel <- ...
dataTable <- ...relevant data...
numberOfThousands <- floor(nrow(dataTable)/1000)
extra <- nrow(dataTable)%%1000
thousandInsertQuery <- function(channel,dat,range){
sqlQuery(channel,paste0("INSERT INTO Database.dbo.Responses (IDNum,State,Answer)
VALUES "
,paste0(
sapply(range,function(k) {
paste0("(",dat$IDNum[k],",'",
dat$State[k],"','",
gsub("'","''",dat$Answer[k],fixed=TRUE),"')")
})
,collapse=",")))
}
if(numberOfThousands)
for(n in 1:numberOfThousands)
{
thousandInsertQuery(channel,(1000*(n-1)+1):(1000*n),dataTable)
}
if(extra)
thousandInsertQuery(channel,(1000*numberOfThousands+1):(1000*numberOfThousands+extra))
SQL's INSERT statements written out with values will only accept up to 1000 rows at a time, so this code breaks it up into chunks (much more efficiently than one row at a time).
The thousandInsertQuery function will obviously have to be customized to handle whatever columns your data frame has - note also that there are single quotes around the character/factor columns and a gsub to handle any single quotes that might be in the character column. Other than this there are no safeguards against SQL injection attacks.
What about using DBI::dbWriteTable() function?
Example below (I am connecting my R code to AWS RDS instance of MS SQL Express):
library(DBI)
library(RJDBC)
library(tidyverse)
# Specify where you driver lives
drv <- JDBC(
"com.microsoft.sqlserver.jdbc.SQLServerDriver",
"c:/R/SQL/sqljdbc42.jar")
# Connect to AWS RDS instance
conn <- drv %>%
dbConnect(
host = "jdbc:sqlserver://xxx.ccgqenhjdi18.ap-southeast-2.rds.amazonaws.com",
user = "xxx",
password = "********",
port = 1433,
dbname= "qlik")
if(0) { # check what the conn object has access to
queryResults <- conn %>%
dbGetQuery("select * from information_schema.tables")
}
# Create test data
example_data <- data.frame(animal=c("dog", "cat", "sea cucumber", "sea urchin"),
feel=c("furry", "furry", "squishy", "spiny"),
weight=c(45, 8, 1.1, 0.8))
# Works in 20ms in my case
system.time(
conn %>% dbWriteTable(
"qlik.export.test",
example_data
)
)
# Let us see if we see the exported results
conn %>% dbGetQuery("select * FROM qlik.export.test")
# Let's clean the mess and force-close connection at the end of the process
conn %>% dbDisconnect()
It works pretty fast for small amount of data transferred and seems rather elegant if you want data.frame -> SQL table solution.
Enjoy!
Building on #jpd527 solution which I found really worth digging into...
require(RODBC)
channel <- #connection parameters
dbPath <- # path to your table, database.table
data <- # the DF you have prepared for insertion, /!\ beware of column names and values types...
# Function to insert 1000 rows of data in one sqlQuery call, coming from
# any DF and into any database.table
insert1000Rows <- function(channel, dbPath, data, range){
# Defines columns names for the database.table
columns <- paste(names(data), collapse = ", ")
# Initialize a string which will incorporate all 1000 rows of values
values <- ""
# Not very elegant, but appropriately builds the values (a, b, c...), (d, e, f...) into a string
for (i in range) {
for (j in 1:ncol(data)) {
# First column
if (j == 1) {
if (i == min(range)) {
# First row, only "("
values <- paste0(values, "(")
} else {
# Next rows, ",("
values <- paste0(values, ",(")
}
}
# Value Handling
values <- paste0(
values
# Handling NA values you want to insert as NULL values
, ifelse(is.na(data[i, j])
, "null"
# Handling numeric values you want to insert as INT
, ifelse(is.numeric(data[i, j])
, data[i, J]
# Else handling as character to insert as VARCHAR
, paste0("'", data[i, j], "'")
)
)
)
# Separator for columns
if (j == ncol(data)) {
# Last column, close parenthesis
values <- paste0(values, ")")
} else {
# Other columns, add comma
values <- paste0(values, ",")
}
}
}
# Once the string is built, insert it into SQL Server
sqlQuery(channel,paste0("insert into ", dbPath, " (", columns, ") values ", values))
}
This insert1000Rows function is used in a loop in the next function, sqlInsertAll, for which you simply define which DF you want to insert into which database.table.
# Main function which uses the insert1000rows function in a loop
sqlInsertAll <- function(channel, dbPath, data) {
numberOfThousands <- floor(nrow(data) / 1000)
extra <- nrow(data) %% 1000
if (numberOfThousands) {
for(n in 1:numberOfThousands) {
insert1000Rows(channel, dbPath, data, (1000 * (n - 1) + 1):(1000 * n))
print(paste0(n, "/", numberOfThousands))
}
}
if (extra) {
insert1000Rows(channel, dbPath, data, (1000 * numberOfThousands + 1):(1000 * numberOfThousands + extra))
}
}
With this, I am able to insert 250k rows of data in 5 minutes or so, whereas it took more than 24 hours using sqlSave from the RODBC package.

Using Dates with RSQLite

How do you write a SQL query with a date using RSQLite. Here is an example below. The dbGetQuery does not return any rows.
require(RSQLite)
require(ggplot2)
data(presidential)
m <- dbDriver("SQLite")
tmpfile <- tempfile('presidential', fileext='.db')
conn <- dbConnect(m, dbname=tmpfile)
dbWriteTable(conn, "presidential", presidential)
dbGetQuery(conn, "SELECT * FROM presidential WHERE Date(start) >= Date('1980-01-01')")
Just to illustrate, this works fine:
tmpfile <- tempfile('presidential', fileext='.db')
conn <- dbConnect(m, dbname=tmpfile)
p <- presidential
p$start <- as.character(p$start)
p$end <- as.character(p$end)
dbWriteTable(conn, "presidential", p)
dbGetQuery(conn, "SELECT * FROM presidential WHERE start >= '1980-01-01'")
You can read about the lack of native date types in SQLite in the docs here. I've been using strings as dates for so long in SQLite that I'd actually forgotten about the issue completely.
And yes, I've written a small R function that converts any Date column in a data frame to character. For simple comparisons, keeping them in YYYY-MM-DD is enough, and if I need to do arithmetic I convert them after the fact in R.
Following on from #joran's answer, here's a simple function to convert date columns to string for a data.frame.
mutate(df, across(where(is.Date), ~ format(.x, "%Y.%m.%d")))
I found working with RSQLite and dplyr to be the most convenient way to stay type-consistent using R and SQLite. In particular, extendend_types = TRUE ensures that columns of type DATE, DATETIME / TIMESTAMP, and TIME are mapped to the corresponding R-classes (at least after version 2.2.8 for RSQLite).
library(dplyr)
library(RSQLite)
library(ggplot2)
data(presidential)
mydb <- dbConnect(SQLite(), "presidential.sqlite", extended_types = TRUE)
dbWriteTable(mydb, "presidential", presidential)
tbl(mydb, "presidential") %>%
filter(start >= as.Date("1980-01-01")) %>%
collect()
You can also formulate the latter collection as a get query:
dbGetQuery(mydb, "SELECT * FROM presidential WHERE start >= CAST('1980-01-01' AS DATE)")
As #joran suggests keeping dates in text in SQLlite seems like the best way to go for the time being.
I used #Richard Knight's approach for conversion in, but with ISO format, to change the date to string before writing the dataframe:
local_df %>% mutate(across(where(lubridate::is.Date), ~ format(.x, "%Y-%m-%d")))
Manipulating the dates remotely can be done using sql translation, particularly:
remote_df %>% mutate(date_as_number = julianday(date_as_string))
remote_df %>% mutate(date_as_string = date(date_as_number))
N.b. that is date not as.Date in the second one. This is because as.Date will get translated to CAST(date_as_number AS DATE) whereas what we want is to use SQLLite's date() function with a floating point number as returned by julianday().
Mapping the remote datestrings back into dates can be done automatically, if you :
collect <- function(remote_df, ...) {
raw = remote_df %>% dplyr::collect(...)
isoDateString = function(x) return(is.character(x) & all(na.omit(stringr::str_detect(x,"[0-9]{4}-[0-9]{2}-[0-9]{2}"))) & !all(is.na(x)))
raw = raw %>% mutate(across(where(isoDateString), ~ as.Date(.x, "%Y-%m-%d")))
maybeJulian = function(x) {return(is.double(x) & all(na.omit(x>2440587.5)) & all(na.omit(x<2488069.5)) & !all(is.na(x)))}
raw = raw %>% mutate(across(matches(".*(D|d)ate.*") & where(maybeJulian), ~ as.Date(.x-2440587.5, "1970-01-01")))
return(raw)
}
The apparently random numbers in the maybeJulian function correspond to 1970-01-01 and 2100-01-01

Resources