load new files in directory - r

I have a R script to load multiple text files in a directory and save the data as compressed .rda. It looks like this,
#!/usr/bin/Rscript --vanilla
args <- commandArgs(TRUE)
## arg[1] is the folder name
outname <- paste(args[1], ".rda", sep="")
files <- list.files(path=args[1], pattern=".txt", full=TRUE)
tmp <- list()
if(file.exists(outname)){
message("found ", outname)
load(outname)
tmp <- get(args[1]) # previously read stuff
files <- setdiff(files, names(tmp))
}
if(is.null(files))
message("no new files") else {
## read the files into a list of matrices
results <- plyr::llply(files, read.table, .progress="text")
names(results) <- files
assign(args[1], c(tmp, results))
message("now saving... ", args[1])
save(list=args[1], file=outname)
}
message("all done!")
The files are quite large (15Mb each, 50 of them typically), so running this script takes up to a few minutes typically, a substantial part of which is taken writing the .rda results.
I often update the directory with new data files, therefore I would like to append them to the previously saved and compressed data. This is what I do above by checking if there's already an output file with that name. The last step is still pretty slow, saving the .rda file.
Is there a smarter way to go about this in some package, keeping a trace of which files have been read, and saving this faster?
I saw that knitr uses tools:::makeLazyLoadDB to save its cached computations, but this function is not documented so I'm not sure where it makes sense to use it.

For intermediate files that I need to read (or write) often, I use
save (..., compress = FALSE)
which speeds up things considerably.

Related

Creating objects from all .xlsx documents in working directory

I am trying to create objects from all files in working directory with name of the original file. I tried to go the following way, but couldn't solve appearing problems.
# - SETTING WD
getwd()
setwd("PATH TO THE FILE")
library(readxl)
# - CREATING OBJECTS
file_objects <- list.files()
xlsx_objects <- unlist(grep(".xlsx",file_objects,value = T))
for (i in xlsx_objects) {
xlsx_objects[i] <- read_xlsx(xlsx_objects[i], header = T)
}
I tried to paste [i]item from "xlsx_objects" with path to WD but it only created a list of files names from docs in WD.
I also find information, that read.csv can read only one file at the time, but I guess that it should be the case with for loop, right? It is reading only one file at the time.
Using lapply (as described in this forum) I was able to get the data in the environment, but argument header didn't work, I lost names of my docs in that object which does not have desired structure. I am though looking for having these files in separated objects without calling every document exclusively.
IIUC, you could do something like:
files = list.files("PATH TO THE FILE", full.names = T, pattern = 'xlsx')
list_files = map(files, readxl::read_excel)
(You can't use read.csv to read excel files)
Also I recommend reading about R Projects so you don't have to use setwd() ever again, which makes your code harder to reproduce down the pipeline

Opening large csv files in a loop with R

I have a folder with different csv files (more than 100) and every file is very large in size and takes a lifetime to open. Do you know how can I write a code just to see if the files opens correctly and if not which are the problematic files. I've tried with this but it was not working.
library(data.table)
setwd("Working dir"
files<-list.files(pattern="*.csv")
numfiles <- length(files)
for (i in c(1:numfiles)){
files[i] <- paste(".\\",files[i],sep="")
assign(gsub("[.]csv$","",files[i]),fread(files[i], header=FALSE))
}
Thank you for your help

Iteratively resave a directory tree of Excel files

I am regularly receiving data from a source that is producing a non-standard Excel format which can't be read by readxl::read_excel. Here is the github issue thread. Consequently I have a whole directory tree containing hundreds of (almost) Excel files that I would like to read into R and combine with plyr::ldply The files can, however, be opened just fine by XLConnect::loadWorkbook. But unfortunately, even with allocating huge amounts of memory for the Java virtual machine, it always crashes after reading a few files. I tried adding these three lines to my import function:
options(java.parameters = "-Xmx16g")
detach("package:XLConnect", unload = TRUE)
library(XLConnect)
xlcFreeMemory()
However, I still get:
Error: OutOfMemoryError (Java): Java heap space
All I need to do is resave them in Excel and then they read in just fine from readxl::read_excel. I'm hoping I could also resave them in batch using XLConnect and then read them in using readxl::read_excel. Unfortunately, using Linux, I can't just script Excel to resave them. Does anyone have another workaround?
Since you're on Linux, running an Excel macro to re-save the spreadsheets looks to be difficult.
You could start a separate R process to read each spreadsheet with XLConnect. This can be done in at least two ways:
Run Rscript with a script file, passing it the name of the spreadsheet. Save the data to a .RData file, and read it back in your master R process.
Use parLapply from the parallel package, passing it a vector of spreadsheet names and a function to read the file. In this case, you don't have to save the data to disk as an intermediate step. However, you might have to do this in chunks, as the slave processes will slowly run out of memory unless you restart them.
Example of the latter:
files <- list.files(pattern="xlsx$")
filesPerChunk <- 5
clustSize <- 4 # or how ever many slave nodes you want
runSize <- clustSize * filesPerChunk
runs <- length(files)%/%runSize + (length(files)%%runSize != 0)
library(parallel)
sheets <- lapply(seq(runs), function(i) {
runStart <- (i - 1) * runSize + 1
runEnd <- min(length(files), runStart + runSize - 1)
runFiles <- files[runStart:runEnd]
# periodically restart and stop the cluster to deal with memory leaks
cl <- makeCluster(clustSize)
on.exit(stopCluster(cl))
parLapply(cl, runFiles, function(f) {
require(XLConnect)
loadWorkbook(f, ...)
})
})
sheets <- unlist(sheets, recursive=FALSE) # convert list of lists to a simple list

How to download a gzipped file in R without saving it to the computer

I am a new user of R.
I have some txt.gz files on the web of approximate size 9x500000.
I'm trying to uncompress a file and read it straight to R with read.table().
I have used this code (url censored):
LoadData <- function(){
con <- gzcon(url("http://"))
raw <- textConnection(readLines(con, n = 25000))
close(con)
dat <- read.table(raw,skip = 2, na.strings = "99.9")
close(raw)
return(dat)
}
The problem is that if I read more lines with readLines, the
program will take much more time to do what it should.
How can I do this is reasonable time?
You can make a temporary file like this:
tmpfile <- tempfile(tmpdir=getwd())
file.create(tmpfile)
download.file(url,tmpfile)
#do your stuff
file.remove(tmpfile) #delete the tmpfile
Don't do this.
Each time you want to access the file, you'll have to re-download it, which is both time consuming for you and costly for the file hoster.
It is better practise to download the file (see download.file) and then read in a local copy in a separate step.
You can decompress the file with untar(..., compressed = "gzip").

Prompting user for multiple input files in R

I'm trying to do something I think should be straight forward enough, but so far I've been unable to figure it out (not surprisingly I'm a noob)...
I would like to be able to prompt a user for input file(s) in R. I've successfully used file.choose() to get a single file, but I would like to have the option of selecting more than one file at a time.
I'm trying to write a program that sucks in daily data files, with the same header and appends them into one large monthly file. I can do it in the console by importing the files individually, and then using rbind(file1, file2,...) but I need a script to automate the process. The number of files to append will not necessarily be constant between runs.
Thanks
Update: Here the code I came up that works for me, maybe it will be helpful to someone else as well
library (tcltk)
File.names <- tk_choose.files() #Prompts user for files to be combined
Num.Files <-NROW(File.names) # Gets number of files selected by user
# Create one large file by combining all files
Combined.file <- read.delim(File.names [1], header=TRUE, skip=2) #read in first file of list selected by user
for(i in 2:Num.Files){
temp <- read.delim(File.names [i], header=TRUE, skip=2) #temporary file reads in next file
Combined.file <-rbind(Combined.file, temp) #appends Combined file with the last file read in
i<-i+1
}
output.dir <- dirname(File.names [1]) #Finds directory of the files that were selected
setwd(output.dir) #Changes directory so output file is in same directory as input files
output <-readline(prompt = "Output Filename: ") #Prompts user for output file name
outfile.name <- paste(output, ".txt", sep="", collapse=NULL)
write.table(Combined.file, file= outfile.name, sep= "\t", col.names = TRUE, row.names=FALSE)` #write tab delimited text file in same dir that original files are in
Have you tried ?choose.files
Use a Windows file dialog to choose a list of zero or more files interactively.
If you are willing to type each file name, why not just loop over all the files like this:
filenames <- c("file1", "file2", "file3")
filecontents <- lapply(filenames, function(fname) {<insert code for reading file here>})
bigfile <- do.call(rbind, filecontents)
If your code must be interactive, you can use the readline function in a loop that will stop asking for more files when the user inputs an empty line:
getFilenames <- function() {
filenames <- list()
x <- readline("Filename: ")
while (x != "") {
filenames <- append(filenames, x)
x <- readline("Filename: ")
}
filenames
}

Resources