Memory issues in-memory dataframe. Best approach to write output? - r

I’m running into memory issues in my R script processing huge folder. I have to perform several operations per file and then output one row per file into my results data frame.
sometimes the resulting data frame has hundred of rows pasted together In one row as if it got stuck in the same line (seems that rbind is not working ok when the load is huge)
I think the issues arises when keeping a temporal data frame in memory to append results, so I’m taking other approach:
A loop to read every file one by one, process it, and then open a connection to results file, write a line, close the connection and go to read next file. Came to mind that avoiding a big df in memory and writing immediately to file could solve my issues.
I assume this is very inefficient, so my question: is there another way of efficiently appending line by line of output instead of binding in-memory data frame and writing to disk at the end?
I’m versed in the many options: sink, cat, write line......my doubt is which one to use to avoid conflicts and be the most efficient given the conditions

I have been using the following snippet:
library(data.table)
filepaths <- list.files(dir)
resultFilename <- "/path/to/resultFile.txt"
for (i in 1:length(filepaths)) {
content <- fread(filepaths, header = FALSE, sep = ",")
### some manipulation for the content
results <- content[1]
fwrite(results, resultFilename, col.names = FALSE, quote = FALSE, append = TRUE)
}
finalData <- fread(resultFilename, header = FALSE, sep = ",")
In my use case, for ~2000 files and tens of millions of rows the processing time decreased over 95 % compared with read.csv and incrementally increasing data into a data.frame in the loop. As you can see in https://csgillespie.github.io/efficientR/importing-data.html section 4.3.1 and https://www.r-bloggers.com/fast-csv-writing-for-r/, fread and fwrite are very affordable data I/O functions.

Related

R Session Aborted When Reading Large Dataset

I need to read ~20,000 csv files (~500GB), then filter the data and bind them together. My code works when I only read ~15,000 files, but it prompts 'R session aborted' when I read ~20,000 files.
memory.limit(80000)
ReadCustomer = function(x)
fread(x, encoding = "UTF-8", select = c("customer_sysno", "event_cat2")) %>%
filter(event_cat2 == "***") %>%
select(customer_sysno) %>%
rename(CustomerSysNo = customer_sysno) %>%
mutate(CustomerSysNo = as.numeric(CustomerSysNo)) %>%
filter(CustomerSysNo > 0)
CustomerData = rbindlist(lapply(FileList, ReadCustomer))
I tried replacing fread(x, encoding = "UTF-8", select = c("customer_sysno", "event_cat2")) by spark_read_csv(sc, "Data", x), but sparkR still didn't work.
How can I read all the files? Will Rcpp help?
Do you know how many rows you get back from each file, you don't say?
You're essentially posing this problem as a straightforward filtering exercise; you want only the customer_sysno column where certain conditions are met. What you then want to do with this will influence whether you even want to merge them all together.
I propose opening an output file and appending each new output to it. Then you've got a local file containing all your desired customer_sysno values. You can then walk through or sample that as suits your use case.
If the rows where your event_cat2 condition is met is actually a small subset of each file, and each file is big, then another approach would be to readLine your way through them, maybe in conjunction with appending results to an output file. This is basically asking R to do a job like (g)awk is awesome at, so that might be a useful preprocessing step to get you the desired data.

R - running out of RAM with write.table in a loop

Everyday I'm parsing around 700 MB from web in one seconds intervals with VB script. Procedure creates around 13,000 files daily.
With R I'm trying to put those files into the databases. In order to achieve that I created for loop that goes through all the files I've downloaded and writes them into databases that are stored directory.
At each iteration I have the follwing code:
rm(list=c('var1', 'var2'))
unlink(file)
gc()
which I hoped to solve the problem. It didn't.
Within the main loop I have inner loop to save the files after being read.
for (i in seq_along(listofallfiles)) {
(here goes code to parse data out of files and store them in var1, var2, etc. -)
file = paste(path,"\\",l[i], sep="")
txt = readLines(file,skipNul = TRUE)
html = htmlTreeParse(txt, useInternalNodes = TRUE)
name = xpathSApply(html, "//td/div/span[starts-with(#class, 'name')]", xmlValue)
(then goes many more var2, var3 that are based on xpathSapply)
for (j in seq_along(name)) {
final_file = paste(direction,"\\", name[j], ".csv", sep="")
if (file.exists(final_file)) {
write.table(t(as.matrix(temp[j,])), file=final_file, row.names = FALSE, append=TRUE, col.names = FALSE)
} else {
file.create(final_file, showWarnings = FALSE)
write.table(t(as.matrix(temp[j,])), file=final_file, row.names = FALSE, append=TRUE)
}
}
}
THE PROBLEM
When I open Task Manager I see that memory usage of RStudio is around 90% after reading merely 50% of the files from one day. That means I won't be able to create even one database for one day. RAM usage at 55% is around 4,2GB.
It's even stranger because the size of the databases created in the directory is around 40MB only!
QUESTION
Is there any way to build such a database with R?
I've chosen write.table but it can be any function that gives me an output which can store in iterative manner (so the function that can append data to existing file).
If not in R - in what programming language then?
EDIT
Database - for now it's planned as flat file (csv). This was confusing. The goal is to store data in anyway that is possible and efficient for reading in R again (not using too much RAM)
file - these are HTML files, that's why I'm using xpathSApply. One file is roughly 28KB.
SOLUTION
My solution to the problem was creation of outer loop that reads data in chunks. After each iteration of the loop I put
.rs.restartR()
which solved the problem.

R script, programmatically batch import multiple csv files as list of data frames (solution)

I'm relatively new to R but experienced in traditional programming languages (e.g., C, Java). I've recently run into the situation where I had so many data files to load that I was spending almost as much time on that one task as I was on the actual analysis. I spent a little time googling this but didn't run across any solutions that I found directly relevant (I might have missed something, I'm impatient that way). Despite that I came up with a simple solution to my problem that I wanted to share with the community in case anyone else found themselves in similar circumstances.
A bit of background info: The data I'm analyzing is real-time performance and diagnostic metrics for an experimental system that is driven by real-time data feeds (i.e., complicated). The upshot is that between trials filenames don't change and the data is written out directly to csv files (I wrote the logging code so I get to be my own best friend like that ;). There are dozens of files generated during a single trial and we have potentially hundreds of trials to look forward to.
I had a few ideas and after playing around with the code a bit I came up with the following solution:
# Create mapping that associates files with a handle that the loader will use to
# generate a named list of data frames (don't even try this on the cmdline)
createDataFileMapping <- function() {
list(
c(file = "file1.csv", descr = "descriptor1"),
c(file = "file2.csv", descr = "descriptor2"),
...
)
}
# Batch load csv files and return as list of data frames
loadTrialData <- function(load.dir, mapping) {
dfList <- list()
for (item in mapping) {
file <- paste(load.dir, item[["file"]], sep = "/")
df <- read.csv(file)
dfList[[ item[["descr"]] ]] <- df
}
return(dfList)
}
Invoking is as simple as loadTrialData("~/data/directory", createDataFileMapping()).
I'm sure there are other ways to solve this problem but the above gets the job done in my case. I'm sure this is slightly less memory-efficient than loading the files directly into data frames in the global environment, and the syntax for passing individual data frames to analysis/plotting functions isn't as elegant as it could be, but I'm not choosy. If you have a more flexible/generalizable solution then please don't hesitate to post!
What you have is sound, I would add only two comments:
Don't worry about extra memory usage, assuming the data frames are of nontrivial size you won't lose much putting them in a big list.
You might add ... as an argument to your function and pass it through to read.csv, so that if another user needs to specify extra arguments because their file wasn't in quite the same format (or wants stringsAsFactors=FALSE or something) then they have the flexibility to do that.

Read, process and export analysis results from multiple .csv files in R

I have a bunch of CSV files and I would like to perform the same analysis (in R) on the data within each file. Firstly, I assume each file must be read into R (as opposed to running a function on the CSV and providing output, like a sed script).
What is the best way to input numerous CSV files to R, in order to perform the analysis and then output separate results for each input?
Thanks (btw I'm a complete R newbie)
You could go for Sean's option, but it's going to lead to several problems:
You'll end up with a lot of unrelated objects in the environment, with the same name as the file they belong to. This is a problem because...
For loops can be pretty slow, and because you've got this big pile of unrelated objects, you're going to have to rely on for loops over the filenames for each subsequent piece of analysis - otherwise, how the heck are you going to remember what the objects are named so that you can call them?
Calling objects by pasting their names in as strings - which you'll have to do, because, again, your only record of what the object is called is in this list of strings - is a real pain. Have you ever tried to call an object when you can't write its name in the code? I have, and it's horrifying.
A better way of doing it might be with lapply().
# List files
filelist <- list.files(pattern = "*.csv")
# Now we use lapply to perform a set of operations
# on each entry in the list of filenames.
to_dispose_of <- lapply(filelist, function(x) {
# Read in the file specified by 'x' - an entry in filelist
data.df <- read.csv(x, skip = 1, header = TRUE)
# Store the filename, minus .csv. This will be important later.
filename <- substr(x = x, start = 1, stop = (nchar(x)-4))
# Your analysis work goes here. You only have to write it out once
# to perform it on each individual file.
...
# Eventually you'll end up with a data frame or a vector of analysis
# to write out. Great! Since you've kept the value of x around,
# you can do that trivially
write.table(x = data_to_output,
file = paste0(filename, "_analysis.csv"),
sep = ",")
})
And done.
You can try the following codes by putting all csv files in the same directory.
names = list.files(pattern="*.csv") %csv file names
for(i in 1:length(names)){ assign(names[i],read.csv(names[i],skip=1, header=TRUE))}
Hope this helps !

Append new data to an existing dataframe (RDS) in R

I have an Rscript that is reading in a constant stream of data in the form of a flat file. Another script picks up this flat file, does some parsing and processing, then saves the result as a data.frame in RDS format. It then sleeps, and repeats the process.
saveRDS(tmp.df, file="H:/Documents/tweet.df.rds") #saving the data.frame
On the second... nth iteration, I have the code only process the new lines added to the flat file since the previous iteration. However, in order to append the delta lines to the permanent data frame, I have to read it in, append, and then save it back out, overwriting the original.
df2 <- readRDS("H:/Documents/tweet.df.rds") #read in permanent
tmp.df2 <- rbind(df2, tmp.df) #append new to existing
saveRDS(tmp.df2, file="H:/Documents/tweet.df.rds") #save it
rm(df2) #housecleaning
rm(tmp.df2) #housecleaning
This approach is risky because whenever the RDS is open for reading/writing, another process wanting to touch that file has to wait. As the base file gets bigger, the risk increases.
Is there something like an appendRDS (I know literally there isn't) that can achieve what I want- iterative updating of a single data frame- saved to a file- that uses appending rather than complete replacement?
I think you can safeguard your process by using connections, opening and closing it before the next process takes over.
con <- file("tmp.rds")
open(con)
df <- readRDS(con)
df.new <- rbind(df,df)
saveRDS(df.new, con)
close(con)
Update:
You can test if a connection to the file is open and tell it to wait for a bit if you're having problems with concurrency.
while(is.Open(con)) { # untested but something of this nature should work
sys.Sleep(2)
}
Is there anything wrong with using a series of numbered RDS files in a directory instead of a single RDS file? I don't think is is possible to append to a data frame an an RDS file without rewriting the entire file, since data frames are simply lists of columns, so presumably they are serialized one column at a time, so only the last column ends near the end of the file.
If you want to stick with a single file but minimize the risk of reading inconsistent data from a RDS file, you can read it in, do the append operation, and then write it out to a temp file and rename the temp file to the original name once it is finished. Then at least your period of risk is not dependent on the size of the file. I'm not familiar with what kind of atomicity is guaranteed by various filesystems when renaming a file to an existing name, but it's probably better than the time taken by saveRDS.

Resources