Improving Performance with large Data Sizes in RMySQL & R - r

I need to create a Data Downloader which may be asked to fetch from a few 1000s to 30-40 Million records.
In RMySQL, I am currently fetching a constant size of 4000 records in every round using below Code:
rs = dbSendQuery(con, query)
totRow = 0
resultDF = data.frame()
while (!dbHasCompleted(rs)) {
chunk <- dbFetch(rs, 4000)
resultDF<-rbind(resultDF, chunk)
totRow = totRow + nrow(chunk)
#Status Print
if(totRow %% 100000 == 0){
print(paste(OutFile, format(totRow, scientific=F)))
}
}
dbClearResult(rs)
Since, I am doing a rbind every time, this methods slows down horribly for large data sizes. What can be better approach? Does RMySQL support returning how many records are present in result?
I was thinking of performing the fetch with a dynamic size, which keeps increasing with every fetch, just like C++ Vectors. Are there any better alternatives here?

Related

R: how to query from a database multiple times based on different dates

I have an R Markup file in which I establish a database connection, query data, and store the data in an csv file. The query is based on a specific date range. How can I automated make multiple queries, so that one after another e.g. every week is queried from the database? I cannot make a query for e.g. the whole year, but I need to store the data separately for each week. I could make a data frame, in which I have two columns for the start and end date, which I would like to use for the query.
But how can I automatically run the queries multiple times depending on the date data frame?
My code so far:
#load libraries
drv <- PostgreSQL()
db_con <- dbConnect(drv, host=my_host, user=my_user, dbname=my_name, port=my_port, password=my_password)
start = "2015-01-01"
end = "2015-01-02"
result <- dbGetQuery(
db_con,
"SELECT * FROM table WHERE date >= start AND date <= end;")
st_write(result, pathname)
Consider parameterization using DBI::sqlInterpolate with a Map (wrapper to mapply) iteration:
db_con <- dbConnect(
PostgreSQL(), host=my_host, user=my_user, dbname=my_name,
port=my_port, password=my_password)
)
# ALL WEEKLY DATES IN 2015
dates_df <- data.frame(
start=seq.Date(as.Date("2015-01-01"), as.Date("2016-01-01"), by="week"),
end=seq.Date(as.Date("2015-01-08"), as.Date("2016-01-08"), by="week")
)
# USER DEFINED METHOD TO QUERY AND WRITE DATA
query_db <- function(s, e) {
# PREPARED STATEMENT WITH PLACEHOLDERS
sql <- "SELECT * FROM table WHERE date >= ?start AND date <= ?end;"
# BIND PARAMETERS AND QUERY
stmt <- DBI::sqlInterpolate(db_con, sql, start=s, end=e)
result <- dbGetQuery(db_con, stmt)
# WRITE DATA TO DISK
st_write(result, pathname)
# RETURN QUERY RESULTSET
return(result)
}
# WRITE AND STORE DATA IN MEMORY
df_list <- Map(query_db, dates_df$start, dates_df$end)
I suggest DBI::dbBind and a frame (similar to #Parfait's answer).
For demonstration, I have a "sessions" table on my pg instance that has a field ScheduledStart. In this case, it's a TIMESTAMPTZ column, not a Date column, so I need to take one more step in my demo (convert from R's Date to POSIXt classes).
# pg <- DBI::dbConnect(...)
ranges <- data.frame(
start = seq(as.Date("2020-03-01"), length.out = 4, by = "week")
)
ranges$end <- ranges$start + 6
# this line is only necessary because of my local table
ranges[] <- lapply(ranges, as.POSIXct)
Here is the bulk of the "query multiple weeks":
res <- DBI::dbSendQuery(pg, "select count(*) as n from Sessions where ScheduledStart between ? and ?")
DBI::dbBind(res, ranges)
out <- DBI::dbFetch(res)
DBI::dbClearResult(res)
out
# n
# 1 8
# 2 1
# 3 0
# 4 0
While sqlInterpolate is much better than forming your own query strings (e.g., with sprintf or paste), using dbBind allows for internal iteration like above, and allows the DBMS to optimize the query with the binding parameter ? instead of actual data. (Using sqlInterpolate, the DBMS would see four different queries. Using dbBind, it sees one query, optimizes it, and uses it four times.)
That query was really boring (select * ... works, too), but I think it gets the point across. The only downside of this method is that while it makes it really easy to get all of the data, there is nothing here that inherently tells you which of your queries a particular row came from. I suspect that you can determine that from your data, that your main intent on breaking it down by-week is the amount retrieved per-query.
Side note: I often use code like this in functions, where it is feasible that something between dbSendQuery and dbClearResult might interrupt operation. In that case, I tend to reorder my code a little, like this:
somefunc <- function(...) {
# ...
res <- DBI::dbSendQuery(pg, "select count(*) as n from Sessions where ScheduledStart between ? and ?")
on.exit({
suppressWarnings(DBI::dbClearResult(res))
}, add = TRUE)
DBI::dbBind(res, ranges)
DBI::dbClearResult(res)
return(out)
}

How to efficiently XML parse in R

I use R to parse XML data from a website. I have list of 20,000 rows with URLs from which I need to extract data. I have a code which gets the job done using a for loop, but it's very slow (takes approx. 12 hours). I thought of using parallel processing (I have access to several CPUs) to speed it up, but I cannot make it work properly. Would it be more efficient using a data table instead of a data frame? Is there any way to speed the process up? Thanks!
for (i in 1:nrow(list)) {
t <- xmlToDataFrame(xmlParse(read_xml(list$path[i]))) #Read the data into a file
t$ID <- list$ID[i]
emptyDF <- bind_rows(all, t) #Bind all into one file
if (i / 10 == floor(i / 10)) {
print(i)
} #print every 10th value to monitor progress of the loop
}
This script should point you in the correct direction:
t<-list()
for (i in 1:nrow(list)) {
tempdf <- xmlToDataFrame(xmlParse(list$path[i])) #Read the data into a file
tempdf$ID <- list$ID[i]
t[[i]]<-tempdf
if (i %% 10 == 0) {
print(i)
} #print every 10th value to monitor progress of the loop
}
answer <- bind_rows(t) #Bind all into one file
Instead of a for loop, an lapply would also work here. Without any sample data, this is untested.

How to set counter size (total number of read lines) while reading large dataset using con in R

In the code below, I am using con to read through large data matrix files (sample.files). This loop reads through all the lines in a chunk as it loops over each sample.files. Is there a way to set this loop in such a way that once it reads certain number of lines (say 500,000lines) as it loops over any number of files to get that many lines (i.e. counting all the chunks), then execute Do this job part? Can anyone please suggest how to do this?
for (isamp in 1:length(sample.files)){
con <- file(sample.files[isamp], open="r") # close(con)
num.lines<-1 # so does while llop at least once
reads<-max.reads
counter<- -1
while (num.lines >0){
counter<-counter+1
print(paste0("Reading chunk))
if(counter==0){
mydata<-try(scan(con,what=character(num.vars),skip=(reads*counter)+skip.lines,nlines=reads,sep="\t",fill=TRUE,na.strings="",quote="\""))
}else{
mydata<-try(scan(con,what=character(num.vars),nlines=reads,sep="\t",fill=TRUE,na.strings="",quote="\""))
}
num.lines<-length(mydata)/(num.vars)
print(num.lines)
if(num.lines==0){next}
dim(mydata)<-c(num.vars,num.lines)
mydata<-t(mydata)
colnames(mydata)<-column.labels
If (total accumulated rows are 500,000 line){
'Do this job'
}
}

rmongodb is very slow in creating data.frame

I am using MongoDB do to tick data analysis in R. Initially I used MySQL, which worked fine, but I wanted to test MongoDB for this purpose. The data set contains about 200 million entries at the moment. Using RODBC I could get the query result into a data.frame very quickly using sqlQuery(conn, "select * from td where prd = 'TY' and date = '2012-01-03'")
In MongoDB I have documents like Document{{_id=5537ca647a3ad42a84374f0a, prd=TY, time=1325661600043, px=130.6875, sz=11}}
In Java I can retrieve a days worth of tick data - roughly 100,000 entries, create Tick objects and add them to an array, all in less than 2 seconds.
Using rmongodb, the below takes forever. Any ideas how to improve this?
query <- mongo.bson.from.list( list(product = "TY", date = as.POSIXct("2012-01-04")) )
res.cursor <- mongo.find(mongo, db.coll, query, limit = 100e3, options=mongo.find.exhaust)
resdf <- mongo.cursor.to.data.frame(res.cursor)
Using find.all is equally slow.

Split Neo4j Cypher query into smaller queries

So I'm trying to extract some data from my Neo4j database to a file using R
This is what the code looks like :
library('bitops')
library('RCurl')
library('RJSONIO')
query <- function(querystring) {
h = basicTextGatherer()
curlPerform(url="localhost:7474/db/data/cypher",
postfields=paste('query',curlEscape(querystring), sep='='),
writefunction = h$update,
verbose = FALSE
)
result <- fromJSON(h$value())
#print(result)
data <- data.frame(t(sapply(result$data, unlist)))
print(data)
names(data) <- result$columns
data
}
q <-"MATCH (n:`layer_1_SB`)-[r]-> (m) WHERE m:layer_1_SB RETURN n.userid, m.userid LIMIT 18000000"
data <- query(q)
head(data)
dim(data)
names(data)
write.table(data, file = "/home/dataminer/data1.dat", append=FALSE,quote=FALSE,sep=" ",eol="\n", na="NA", dec=".", row.names=FALSE)
And it works fine, returning around 147k relationships. However when I make the same quer between two different labels (layer_1 to layer_2) which should return around 18million relationships, the program loads for a while and then returns NULL. When doing the same query and returning the count on the Neo4j browser it works, so I'm assuming the problem has to do with the amount of data that R can handle.
The question is: How can I split my query into smaller queries so that my code works?
UPDATE
I tried doing a query with 10million rels and it worked. So now I want to useWITH and ORDER BY to return the first then the last relationships. However it's returning NULL, I believe my query is badly formatted:
MATCH (n:'layer_1_SB')-[r]-> (m) WITH n ORDER BY n.userid DESC WHERE m:layer_2_SB RETURN n.userid, m.userid LIMIT 8000000
You should use the transactional endpoint instead or at least pass the header X-Stream:true.
Both stream data from the server so it doesn't eat up its memory.

Resources