dbFetch, write to another table, and dbFetch again?

dbFetch, write to another table, and dbFetch again? - r

I would like to work on a large database table. The idea was to read some rows, process them, append the result to another table, and so on. In code:
stmt <- "SELECT * FROM input_table WHERE cond"
rs <- DBI::dbSendQuery(con, stmt)
while (!DBI::dbHasCompleted(rs)) {
current_set <- DBI::dbFetch(rs, 50000)
res <- process(current_set)
dbWriteTable(con, "output_table", value=res, append=TRUE)
}
DBI::dbClearResult(rs)
However, I get the message "Closing open result set, pending rows". Is there any way to save the intermediate output?
I would like to work with sqlite and later on Postgres.

Just for reference, I ended up with a solution using a LIMIT / OFFSET construct. Not sure if it efficient, but it is fast enough for my case (700k rows).
batchsize <- 50000
stmt <- "SELECT * FROM input_table WHERE cond"
lim <- paste("LIMIT", batchsize, ";")
finished <- FALSE
i <- 0
while (!finished) {
curr_stmt <- paste(stmt, lim)
current_set <- dbGetQuery(con, curr_stmt)
res <- process(current_set)
dbWriteTable(con, "output_table", value=res, append=TRUE)
finished <- nrow(current_set) < batchsize
i <- i + nrow(current_set)
lim <- paste("LIMIT", batchsize, "OFFSET", i, ";")
}

Related

Paste variable in RMariaDB dbGetQuery 'where clause' [1054]

Having issues with pasting a variable into the query string for RMariaDB. I can return a query without paste and find the proper where statement I am looking for within the dataframe I query (ex. MIN). When I try to use a variable in the query it fails. I have searched stackoverflow up and down and read the dbgetquery docs but nothing seems to be working. I am sure it is something simple, just can't seem to find it.
library(RMariaDB)
team <- "MIN"
# This returns entire database with MIN in tm column.
filename <- dbGetQuery(conn, "select * from nhl_lab_lines_today")
# These will all give me a [1054] error.
test <- paste("select * from nhl_lab_lines_today WHERE tm = ",paste(team,collapse=", "),sep ="")
test <- paste("select * from nhl_lab_lines_today WHERE tm = team")
test <- paste("select * from nhl_lab_lines_today WHERE tm =", team,sep=" ")
filename <- dbGetQuery(conn, test)

dbGetQuery(con, paste0("select * from nhl_lab_lines_today WHERE tm = '", team ,"'"))

R for loop is only storing one result?

I am trying to get the row counts of all my tables with a query and I want to save the results in a dataframe. Right now, it only saves one value and I'm not sure what the issue is. Thanks for any help.
schema <- "test"
table_prefix <- "results_"
row_count <- list()
for (geo in geos){
table_name <- paste0(schema, ".", table_prefix, geo)
queries <- paste("SELECT COUNT(*) FROM", table_name)
}
for (x in queries){
row_count <- dbGetQuery(con, x)
}

Looping code for multiple db3 files in R

I'm trying to import a number of .db3 files, and rbind them together for further analysis. I'm having no troubles importing a single .db3 file, but my rbind won't work, despite it working fine for .csv files. Where have I gone wrong?
df <- c()
for (x in list.files(pattern="*.db3")){
sqlite <- dbDriver("SQLite")
mydb <- dbConnect(sqlite, x)
dbListTables(mydb)
results <- dbSendQuery(mydb, "SELECT * FROM gps_data")
data = fetch(results, n = -1)
data$Label <- factor(x)
data <- rbind(df, data)
}
Any help you can offer would be great!

Let's have a close look at that rbind call at the end of your loop:
df <- c()
for (x in list.files(pattern="*.db3")){
sqlite <- dbDriver("SQLite")
mydb <- dbConnect(sqlite, x)
dbListTables(mydb)
results <- dbSendQuery(mydb, "SELECT * FROM gps_data")
data = fetch(results, n = -1)
data$Label <- factor(x)
data <- rbind(df, data)
}
You've created the object df, then you're binding data to the end of it and using that to override the existing data (note df hasn't changed). Great. Now your loop starts again, creating a new data object, and binding it to.... df. Doh! It's a simple error, but you're binding things in the wrong order. Try changing that last line to:
df <- rbind( df, data )
and see how it goes.
What you'll be doing differently is overwriting df over and over, making it bigger each time. When you overwrote data, you went back and recreated it anew, throwing away what you'd just done.

Speed up R script looping through files/folders to check thresholds, calculate averages, and plot

I'm trying to speed up some code in R. I think my looping methods can be replaced (maybe with some form of lapply or using sqldf) but I can't seem to figure out how.
The basic premise is that I have a parent directory with ~50 subdirectories, and each of those subdirectories contains ~200 CSV files (a total of 10,000 CSVs). Each of those CSV files contains ~86,400 lines (data is daily by the second).
The goal of the script is to calculate the mean and stdev for two intervals of time from each file, and then make one summary plot for each subdirectory as follows:
library(timeSeries)
library(ggplot2)
# list subdirectories in parent directory
dir <- list.dirs(path = "/ParentDirectory", full.names = TRUE, recursive = FALSE)
num <- (length(dir))
# iterate through all subdirectories
for (idx in 1:num){
# declare empty vectors to fill for each subdirectory
DayVal <- c()
DayStd <- c()
NightVal <- c()
NightStd <- c()
date <- as.Date(character())
setwd(dir[idx])
filenames <- list.files(path=getwd())
numfiles <- length(filenames)
# for each file in the subdirectory
for (i in c(1:numfiles)){
day <- read.csv(filenames[i], sep = ',')
today <- as.Date(day$time[1], "%Y-%m-%d")
# setting interval for times of day we care about <- SQL seems like it may be useful here but I couldn't get read.csv.sql to recognize hourly intervals
nightThreshold <- as.POSIXct(paste(today, "03:00:00"))
dayThreshold <- as.POSIXct(paste(today, "15:00:00"))
nightInt <- day[(as.POSIXct(day$time) >= nightThreshold & as.POSIXct(day$time) <= (nightThreshold + 3600)) , ]
dayInt <- day[(as.POSIXct(day$time) >= dayThreshold & as.POSIXct(day$time) <= (dayThreshold + 3600)) , ]
#check some thresholds in the data for that time period
if (sum(nightInt$val, na.rm=TRUE) < 5){
NightMean <- mean(nightInt$val, na.rm =TRUE)
NightSD <-sd(nightInt$val, na.rm =TRUE)
} else {
NightMean <- NA
NightSD <- NA
}
if (sum(dayInt$val, na.rm=TRUE) > 5){
DayMean <- mean(dayInt$val, na.rm =TRUE)
DaySD <-sd(dayInt$val, na.rm =TRUE)
} else {
DayMean <- NA
DaySD <- NA
}
NightVal <- c(NightVal, NightMean)
NightStd <- c(NightStd, NightSD)
DayVal <- c(gsrDayVal, DayMean)
DayStd <- c(gsrDayStd, DaySD)
date <-c(date, as.Date(today))
}
df<-data.frame(date,DayVal,DayStd,NightVal, NightStd)
# plot for the subdirectory
p1 <- ggplot() +
geom_point(data = df, aes(x = date, y = gsrDayVal, color = "Day Average")) +
geom_point(data = df, aes(x = date, y = gsrDayStd, color = "Day Standard Dev")) +
geom_point(data = df, aes(x = date, y = gsrNightVal, color = "Night Average")) +
geom_point(data = df, aes(x = date, y = gsrNightStd, color = "Night Standard Dev")) +
scale_colour_manual(values = c("steelblue", " turquoise3", "purple3", "violet"))
}
Thanks very much for any advice you can offer!

Consider an SQL database solution as you manage quite a bit of data in flatfiles. A Relational Database Management System (RDMS) can easily handle millions of records, even aggregate as needed using its scalable db engine rather than processing in memory per R. If not for speed and efficiency, databases can provide security, robustness, and organization as the central repository. Even work a script to import each daily csv thereafter directly into database.
Fortunately, practically all RDMS have CSV handlers and can load mulitple files in bulk. Below are open source solutions: SQLite (file level database), MySQL, and PostgreSQL (both server level databases), all of which have corresponding libraries in R. Each example recursively imports a csv file from directory list of files into database table named timeseriesdata (with same named fields/data types as csv files). At the end is one SQL call to import an aggregation of Night and Day interval mean and standard deviation (adjust as needed). The only challenge is designating a file and subdirectory indicator (which may or may not exist in actual data) and appending with csv files (possibly after each iteration, run an update query to a FileID column).
dir <- list.dirs(path = "/ParentDirectory",
full.names = TRUE, recursive = FALSE)
# SQLITE DATABASE
library(RSQLite)
sqconn <- dbConnect(RSQLite::SQLite(), dbname = "/path/to/database.db")
# (CONNECTION NOT NEEDED DUE TO CMD LINE LOAD BELOW)
for (d in dir){
filenames <- list.files(d)
for (f in filenames){
csvfile <- paste0(d, '/', f)
# IMPORT VIA COMMAND LINE OR BASH (ASSUMES SQLITE3 IS PATH VARIABLE)
cmd <- paste0("(echo .separator ,; echo .import ' ", csvfile , " ' timeseriesdata ')",
" '| sqlite3 ' /path/to/databasename.db")
system(cmd)
}
}
# CLOSE CONNNECTION
dbDisconnect(sqconn)
# MYSQL DATABASE
library(RMySQL)
myconn <- dbConnect(RMySQL::MySQL(), dbname="databasename", host="hostname",
username="username", password="***")
for (d in dir){
filenames <- list.files(d)
for (f in filenames){
csvfile <- paste0(d, '/', f)
# IMPORT USING LOAD DATA INFILE COMMAND
sql <- paste0("LOAD DATA INFILE '", csvfile, "'
INTO TABLE timeseriesdata
FIELDS TERMINATED BY ','
ENCLOSED BY '\"'
ESCAPED BY '\"'
LINES TERMINATED BY '\\n'
IGNORE 1 LINES
(col1, col2, col3, col4, col5);")
dbSendQuery(myconn, sql)
dbCommit(myconn)
}
}
# CLOSE CONNECTION
dbDisconnect(myconn)
# POSTGRESQL DATABASE
library(RPostgreSQL)
pgconn <- dbConnect(PostgreSQL(), dbname="databasename", host="myhost",
user= "postgres", password="***")
for (d in dir){
filenames <- list.files(d)
for (f in filenames){
csvfile <- paste0(d, '/', f)
# IMPORT USING COPY COMMAND
sql <- paste("COPY timeseriesdata(col1, col2, col3, col4, col5)
FROM '", csvfile , "' DELIMITER ',' CSV;")
dbSendQuery(pgconn, sql)
}
}
# CLOSE CONNNECTION
dbDisconnect(pgconn)
# CREATE PLOT DATA FRAME (MYSQL EXAMPLE)
# (ADD INSIDE SUBDIRECTORY LOOP OR INCLUDE SUBDIR COLUMN IN GROUP BY)
library(RMySQL)
myconn <- dbConnect(RMySQL::MySQL(), dbname="databasename", host="hostname",
username="username", password="***")
# AGGREGATE QUERY USING TWO DERIVED TABLE SUBQUERIES
# (FOR NIGHT AND DAY, ADJUST FILTERS PER NEEDS)
strSQL <- "SELECT dt.FileID, NightMean, NightSTD, DayMean, DaySTD
FROM
(SELECT nt.FileID, Avg(nt.time) As NightMean, STDDEV(nt.time) As NightSTD
FROM timeseriesdata nt
WHERE nt.time >= '15:00:00' AND nt.time <= '21:00:00'
GROUP BY nt.FileID
HAVING Sum(nt.val) < 5) AS ng
INNER JOIN
(SELECT dt.FileID, Avg(dt.time) As DayMean, STDDEV(dt.time) As DaySTD
FROM timeseriesdata dt
WHERE dt.time >= '03:00:00' AND dt.time <= '09:00:00'
GROUP BY dt.FileID
HAVING Sum(dt.val) > 5) AS dy
ON ng.FileID = dy.FileID;"
df <- dbSendQuery(myconn, strSQL)
dbFetch(df)
dbDisconnect(myconn)

One thing would be to do the conversion of day$time once instead of all the times you are doing it now. Also use the lubridate package because if you have a large number of times to convert, it is much faster than 'as.POSIXct'.
Also size the variables you are storing results in, e.g., DayVal, DayStd, to the approriate size (DayVal <- numeric(num)) and then index the result into the appropriate index.
If the CSV files are large, consider using the 'fread' function in data.table package.

How to import from SQLite database?

I have an SQLite database file exported from Scraperwiki with .sqlite file extension. How do I import it into R, presumably mapping the original database tables into separate data frames?

You could use the RSQLite package.
Some example code to store the whole data in data.frames:
library("RSQLite")
## connect to db
con <- dbConnect(drv=RSQLite::SQLite(), dbname="YOURSQLITEFILE")
## list all tables
tables <- dbListTables(con)
## exclude sqlite_sequence (contains table information)
tables <- tables[tables != "sqlite_sequence"]
lDataFrames <- vector("list", length=length(tables))
## create a data.frame for each table
for (i in seq(along=tables)) {
lDataFrames[[i]] <- dbGetQuery(conn=con, statement=paste("SELECT * FROM '", tables[[i]], "'", sep=""))
}

To anyone else that comes across this post, a nice way to do the loop from the top answer using the purr library is:
lDataFrames <- map(tables, ~{
dbGetQuery(conn=con, statement=paste("SELECT * FROM '", .x, "'", sep=""))
})
Also means you don't have to do:
lDataFrames <- vector("list", length=length(tables))

Putting together sgibb's and primaj's answers, naming tables, and adding facility to retrieve all tables or a specific table:
getDatabaseTables <- function(dbname="YOURSQLITEFILE", tableName=NULL){
library("RSQLite")
con <- dbConnect(drv=RSQLite::SQLite(), dbname=dbname) # connect to db
tables <- dbListTables(con) # list all table names
if (is.null(tableName)){
# get all tables
lDataFrames <- map(tables, ~{ dbGetQuery(conn=con, statement=paste("SELECT * FROM '", .x, "'", sep="")) })
# name tables
names(lDataFrames) <- tables
return (lDataFrames)
}
else{
# get specific table
return(dbGetQuery(conn=con, statement=paste("SELECT * FROM '", tableName, "'", sep="")))
}
}
# get all tables
lDataFrames <- getDatabaseTables(dbname="YOURSQLITEFILE")
# get specific table
df <- getDatabaseTables(dbname="YOURSQLITEFILE", tableName="YOURTABLE")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

dbFetch, write to another table, and dbFetch again? - r

Related

Paste variable in RMariaDB dbGetQuery 'where clause' [1054]

R for loop is only storing one result?

Looping code for multiple db3 files in R

Speed up R script looping through files/folders to check thresholds, calculate averages, and plot

How to import from SQLite database?

Categories

Resources