Everyday I'm parsing around 700 MB from web in one seconds intervals with VB script. Procedure creates around 13,000 files daily.
With R I'm trying to put those files into the databases. In order to achieve that I created for loop that goes through all the files I've downloaded and writes them into databases that are stored directory.
At each iteration I have the follwing code:
rm(list=c('var1', 'var2'))
unlink(file)
gc()
which I hoped to solve the problem. It didn't.
Within the main loop I have inner loop to save the files after being read.
for (i in seq_along(listofallfiles)) {
(here goes code to parse data out of files and store them in var1, var2, etc. -)
file = paste(path,"\\",l[i], sep="")
txt = readLines(file,skipNul = TRUE)
html = htmlTreeParse(txt, useInternalNodes = TRUE)
name = xpathSApply(html, "//td/div/span[starts-with(#class, 'name')]", xmlValue)
(then goes many more var2, var3 that are based on xpathSapply)
for (j in seq_along(name)) {
final_file = paste(direction,"\\", name[j], ".csv", sep="")
if (file.exists(final_file)) {
write.table(t(as.matrix(temp[j,])), file=final_file, row.names = FALSE, append=TRUE, col.names = FALSE)
} else {
file.create(final_file, showWarnings = FALSE)
write.table(t(as.matrix(temp[j,])), file=final_file, row.names = FALSE, append=TRUE)
}
}
}
THE PROBLEM
When I open Task Manager I see that memory usage of RStudio is around 90% after reading merely 50% of the files from one day. That means I won't be able to create even one database for one day. RAM usage at 55% is around 4,2GB.
It's even stranger because the size of the databases created in the directory is around 40MB only!
QUESTION
Is there any way to build such a database with R?
I've chosen write.table but it can be any function that gives me an output which can store in iterative manner (so the function that can append data to existing file).
If not in R - in what programming language then?
EDIT
Database - for now it's planned as flat file (csv). This was confusing. The goal is to store data in anyway that is possible and efficient for reading in R again (not using too much RAM)
file - these are HTML files, that's why I'm using xpathSApply. One file is roughly 28KB.
SOLUTION
My solution to the problem was creation of outer loop that reads data in chunks. After each iteration of the loop I put
.rs.restartR()
which solved the problem.
Related
I currently have a 10million row table that I am uploading to R. I understand the uploading process to R make take some time and so as a consequence I decided to store my chunk as chache = TRUE and for good measure I did the following
knitr::opts_chunk$set(cache = TRUE, warning = FALSE,
message = FALSE, cache.lazy = FALSE)
When I re-run my RMarkdown document it is still taking >30minutes to execute the chunk that has the large data frame. Why are my results not being cached so that it does not need to take a long time to execute? I have made no changes to the query and the backend data has not updated as well.
In your question you have chache = TRUE, did you mean cache = TRUE?
You could also have the heavy processing done in it's own R script and call it inside your Rmarkdown with source('rscript.R'). Where caching is set in the .R file
as stated in this link you can cache objects yourself with...
if (file.exists("results.rds")) {
res <- readRDS("results.rds")
} else {
res <- compute_it() # a time-consuming function
saveRDS(res, "results.rds")
}
I’m running into memory issues in my R script processing huge folder. I have to perform several operations per file and then output one row per file into my results data frame.
sometimes the resulting data frame has hundred of rows pasted together In one row as if it got stuck in the same line (seems that rbind is not working ok when the load is huge)
I think the issues arises when keeping a temporal data frame in memory to append results, so I’m taking other approach:
A loop to read every file one by one, process it, and then open a connection to results file, write a line, close the connection and go to read next file. Came to mind that avoiding a big df in memory and writing immediately to file could solve my issues.
I assume this is very inefficient, so my question: is there another way of efficiently appending line by line of output instead of binding in-memory data frame and writing to disk at the end?
I’m versed in the many options: sink, cat, write line......my doubt is which one to use to avoid conflicts and be the most efficient given the conditions
I have been using the following snippet:
library(data.table)
filepaths <- list.files(dir)
resultFilename <- "/path/to/resultFile.txt"
for (i in 1:length(filepaths)) {
content <- fread(filepaths, header = FALSE, sep = ",")
### some manipulation for the content
results <- content[1]
fwrite(results, resultFilename, col.names = FALSE, quote = FALSE, append = TRUE)
}
finalData <- fread(resultFilename, header = FALSE, sep = ",")
In my use case, for ~2000 files and tens of millions of rows the processing time decreased over 95 % compared with read.csv and incrementally increasing data into a data.frame in the loop. As you can see in https://csgillespie.github.io/efficientR/importing-data.html section 4.3.1 and https://www.r-bloggers.com/fast-csv-writing-for-r/, fread and fwrite are very affordable data I/O functions.
I am regularly receiving data from a source that is producing a non-standard Excel format which can't be read by readxl::read_excel. Here is the github issue thread. Consequently I have a whole directory tree containing hundreds of (almost) Excel files that I would like to read into R and combine with plyr::ldply The files can, however, be opened just fine by XLConnect::loadWorkbook. But unfortunately, even with allocating huge amounts of memory for the Java virtual machine, it always crashes after reading a few files. I tried adding these three lines to my import function:
options(java.parameters = "-Xmx16g")
detach("package:XLConnect", unload = TRUE)
library(XLConnect)
xlcFreeMemory()
However, I still get:
Error: OutOfMemoryError (Java): Java heap space
All I need to do is resave them in Excel and then they read in just fine from readxl::read_excel. I'm hoping I could also resave them in batch using XLConnect and then read them in using readxl::read_excel. Unfortunately, using Linux, I can't just script Excel to resave them. Does anyone have another workaround?
Since you're on Linux, running an Excel macro to re-save the spreadsheets looks to be difficult.
You could start a separate R process to read each spreadsheet with XLConnect. This can be done in at least two ways:
Run Rscript with a script file, passing it the name of the spreadsheet. Save the data to a .RData file, and read it back in your master R process.
Use parLapply from the parallel package, passing it a vector of spreadsheet names and a function to read the file. In this case, you don't have to save the data to disk as an intermediate step. However, you might have to do this in chunks, as the slave processes will slowly run out of memory unless you restart them.
Example of the latter:
files <- list.files(pattern="xlsx$")
filesPerChunk <- 5
clustSize <- 4 # or how ever many slave nodes you want
runSize <- clustSize * filesPerChunk
runs <- length(files)%/%runSize + (length(files)%%runSize != 0)
library(parallel)
sheets <- lapply(seq(runs), function(i) {
runStart <- (i - 1) * runSize + 1
runEnd <- min(length(files), runStart + runSize - 1)
runFiles <- files[runStart:runEnd]
# periodically restart and stop the cluster to deal with memory leaks
cl <- makeCluster(clustSize)
on.exit(stopCluster(cl))
parLapply(cl, runFiles, function(f) {
require(XLConnect)
loadWorkbook(f, ...)
})
})
sheets <- unlist(sheets, recursive=FALSE) # convert list of lists to a simple list
I have a bunch of CSV files and I would like to perform the same analysis (in R) on the data within each file. Firstly, I assume each file must be read into R (as opposed to running a function on the CSV and providing output, like a sed script).
What is the best way to input numerous CSV files to R, in order to perform the analysis and then output separate results for each input?
Thanks (btw I'm a complete R newbie)
You could go for Sean's option, but it's going to lead to several problems:
You'll end up with a lot of unrelated objects in the environment, with the same name as the file they belong to. This is a problem because...
For loops can be pretty slow, and because you've got this big pile of unrelated objects, you're going to have to rely on for loops over the filenames for each subsequent piece of analysis - otherwise, how the heck are you going to remember what the objects are named so that you can call them?
Calling objects by pasting their names in as strings - which you'll have to do, because, again, your only record of what the object is called is in this list of strings - is a real pain. Have you ever tried to call an object when you can't write its name in the code? I have, and it's horrifying.
A better way of doing it might be with lapply().
# List files
filelist <- list.files(pattern = "*.csv")
# Now we use lapply to perform a set of operations
# on each entry in the list of filenames.
to_dispose_of <- lapply(filelist, function(x) {
# Read in the file specified by 'x' - an entry in filelist
data.df <- read.csv(x, skip = 1, header = TRUE)
# Store the filename, minus .csv. This will be important later.
filename <- substr(x = x, start = 1, stop = (nchar(x)-4))
# Your analysis work goes here. You only have to write it out once
# to perform it on each individual file.
...
# Eventually you'll end up with a data frame or a vector of analysis
# to write out. Great! Since you've kept the value of x around,
# you can do that trivially
write.table(x = data_to_output,
file = paste0(filename, "_analysis.csv"),
sep = ",")
})
And done.
You can try the following codes by putting all csv files in the same directory.
names = list.files(pattern="*.csv") %csv file names
for(i in 1:length(names)){ assign(names[i],read.csv(names[i],skip=1, header=TRUE))}
Hope this helps !
I'm trying to normalize a big amount of Affymetrix CEL files using R. However, some of them appear to be truncated, so when reading them i get the error
Cel file xxx does not seem to have the correct dimensions
And the normalization stops. Manually removing the corrupted files and restart every time will take very long. Do you know if there is a fast way (in R or with a tool) to detect corrupted files?
PS I'm 99.99% sure I'm normalizing together CELs from the same platform, it's really just truncated files :-)
One simple suggestion:
Can you just use a tryCatch block around your read.table (or whichever read command you're using)? Then just skip a file if you get that error message. You can also compile a list of corrupted files within the catch block (I recommend doing that so that you are tracking corrupted files for future reference when running a big batch process like this). Here's the pseudo code:
corrupted.files <- data.frame()
for(i in 1:nrow(files)) {
x <- tryCatch(read.table(file=files[i]), error = function(e)
if(e=="something") { corrupted.files <- rbind(corrupted.files, files[i]) }
else { stop(e) },
finally=print(paste("finished with", files[i], "at", Sys.time())))
if(nrow(x)) # do something with the uncorrupted data
}