list.files taking account of list number in R? - r

I have a large number of files (>50,000) to analyze. I can get a list of these files with;
myfiles <- list.files(pattern="*output*")
and then loop with
for (file in myfiles) {
"code"
}
The problem is that sometimes my system freezes due to RAM overload so the only option left is to kill the rsession and restart the loop again with same files. How can I modify the list.files call, so that I can only select a certain number of files like 100:200 or 3500:5000 via list.files. Basically, I would like to skip the files which are already analyzed before the last system freeze.
Any help would be appreciated.
Thanks.

The 'myfiles' objects is a vector. So, we can create the sequence (:) of positions to subset the object when we loop
for (file in myfiles[100:200]) {
...code...
}
Also, the files can be split into a list with each element of length 100
lst1 <- split(myfiles, as.integer(gl(length(myfiles), 100, length(myfiles))))
Then, an idea is to loop in parallel or sequentially, remove (rm) the temporary object, call gc() to release memory

Related

How could I use lapply with two parameters?

I have a list with the names of my files and an array with some integers i want to use every time I open a file. When I open file1 I want it to use numbers[1], when I open file 14 I want it to open numbers[14].
I have tried to create n inside the function in which I will use lapply, but not knowing how to have an index to know which file I am reading, I discarded it. Then I tried to use mapply but it creates twice as many elements as I want.
I want to execute my function and every time I use the n index element of my fnames and the n index of my array numbers. I want you to save the result in a list.
My function opens a file and calculates the data in that file based on the value of n corresponding to that file (in a while). That's why I need to use the same index for fnames () as for numbers [].
The function returns a dataframe, with lapply intended to enter the dataframe result of each file with its corresponding number in a list.
In this way I create the list of names
x<-list.files(pattern=".txt")
This is the array of numbers:
n<-c(4,4,12,6,3,6,8,32,4,4,9,5,5,6,8,3,6,7,3,6,5,3,5)
I do not know how to execute the function with those two parameters to get a list with all the results as if I were running lapply with it.

How to create external folder in a loop in R?

Ok, actually I have a loop of 50 iterations and then I need an output file for each of these iterations. That happens is that with my current code I only obtain the output file corresponding to the last iteration, so could you give me a code to let me get all the files in mi current folder??. Thank you enter image description here
part[] is a vector of length 50 (really a list but it does not matter
Use
for(i in 1:(length(vec)-1)){
write.table(part[[i]],paste(i,"txt",sep = "."))
}
How about using list.files()
That lists all the files in the current directory. or you can specify a directory as the first element of the function.

Using index number of file in directory

I'm using the list.files function in R. I know how to tell it to access all files in a directory, such as:
list.files("directory", full.names=TRUE)
But I don't really know how to subset the directory. If I just want list.files to list the 2nd, 5th, and 6th files in the directory, is there a way to tell list.files to only list those files? I've been thinking about whether it's possible to use the files' indices within the directory but I can't figure out how to do it. It's okay if I can only do this with consecutive files (such as 1:3) but non-consecutive would be even better.
The context of the question is that this is for a problem for a class, so I'm not worried about the files in the directory changing or being deleted.
If you store the list.files to an object say object you can see that it is just an atomic vector of class character (nothing more nothing less!). You can subset it with the regex syntax for character strings (and functions that uses regex like grep or grepl) or just with the regular subsetting operators [ or (most important) by combining both techniques.
For your example:
object[c(2,5,6)]
or exclude with:
object[-c(2,5,6)]
or if you want to find all names that start with the shuttle string with:
object[grepl("^shuttle", object)]
or with the following code if you want to find all .csv files:
object[grepl(".csv$", object)]
possibilities are huge.

append multiple large data.table's; custom data coercion using colClasses and fread; named pipes

[This is kind of multiple bug-reports/feature requests in one post, but they don't necessarily make sense in isolation. Apologies for the monster post in advance. Posting here as suggested by help(data.table). Also, I'm new to R; so apologies if I'm not following best practices in my code below. I'm trying.]
1. rbindlist crash on 6 * 8GB files (I have 128GB RAM)
First I want to report that using rbindlist to append large data.tables causes R to segfault (ubuntu 13.10, the packaged R version 3.0.1-3ubuntu1, data.table installed from within R from CRAN). The machine has 128 GiB of RAM; so, I shouldn't be running out of memory given the size of the data.
My code:
append.tables <- function(files) {
moves.by.year <- lapply(files, fread)
move <- rbindlist(moves.by.year)
rm(moves.by.year)
move[,week_end := as.Date(as.character(week_end), format="%Y%m%d")]
return(move)
}
Crash message:
append.tables crashes with this:
> system.time(move <- append.tables(files))
*** caught segfault ***
address 0x7f8e88dc1d10, cause 'memory not mapped'
Traceback:
1: rbindlist(moves.by.year)
2: append.tables(files)
3: system.time(move <- append.tables(files))
There are 6 files, each about 8 GiB or 100 million lines long with 8 variables, tab separated.
2. Could fread accept multiple file names?
In any case, I think a better approach here would be to allow fread to take files as a vector of file names:
files <- c("my", "files", "to be", "appended")
dt <- fread(files)
Presumably you can be much more memory efficient under the hood than without having to keep all of these objects around at the same time as appears to be necessary as a user of R.
3. colClasses gives an error message
My second problem is that I need to specify a custom coercion handler for one of my data types, but that fails:
dt <- fread(tfile, colClasses=list(date="myDate"))
Error in fread(tfile, colClasses = list(date = "myDate")) :
Column name 'myDate' in colClasses not found in data
Yes, in the case of dates, a simple:
dt[,date := as.Date(as.character(date), format="%Y%m%d")]
works.
However, I have a different use case, which is to strip the decimal point from one of the data columns before it is converted from a character. Precision here is extremely important (thus our need for using the integer type), and coercing to an integer from the double type results in lost precision.
Now, I can get around this with some system() calls to append the files and pipe them through some sed magic (simplified here) (where tfile is another temporary file):
if (has_header) {
tfile2 <- tempfile()
system(paste("echo fakeline >>", tfile2))
system(paste("head -q -n1", files[[1]], ">>", tfile2))
system(paste("tail -q -n+2", tfile2, paste(files, collapse=" "),
" | sed 's/\\.//' >>", tfile), wait=wait)
unlink(tfile2)
} else {
system(paste("cat", paste(files, collapse=" "), ">>", tfile), wait=wait)
}
but this involves an extra read/write cycle. I have 4 TiB of data to process, which is a LOT of extra reading and writing (no, not all into one data.table. About 1000 of them.)
4. fread thinks named pipes are empty files
I typically leave wait=TRUE. But I was trying to see if I could avoid the extra read/write cycle by making tfile a named pipe system('mkfifo', tfile), setting wait=FALSE, and then running fread(tfile). However, fread complains about the pipe being an empty file:
system(paste("tail -q -n+2", tfile2, paste(files, collapse=" "),
" | sed 's/\\.//' >>", tfile), wait=FALSE)
move <- fread(tfile)
Error in fread(tfile) : File is empty: /tmp/RtmpbxNI1L/file78a678dc1999
In any case, this is a bit of a hack.
Simplified Code if I had my wish list
Ideally, I would be able to do something like this:
setClass("Int_Price")
setAs("character", "Int_Price",
function (from) {
return(as.integer(gsub("\\.", "", from)))
}
)
dt <- fread(files, colClasses=list(price="Int_Price"))
And then I'd have a nice long data.table with properly coerced data.
Update: The rbindlist bug has been fixed in commit 1100 v1.8.11. From NEWS:
o Fixed a rare segfault that occurred on >250m rows (integer overflow during memory allocation); closes #5305. Thanks to Guenter J. Hitsch for reporting.
As mentioned in comments, you're supposed to ask separate questions separately. But since they're such good points and linked together into the wish at the end, ok, will answer in one go.
1. rbindlist crash on 6 * 8GB files (I have 128GB RAM)
Please run again changing the line :
moves.by.year <- lapply(files, fread)
to
moves.by.year <- lapply(files, fread, verbose=TRUE)
and send me the output. I don't think it is the size of the files, but something about the type and contents. You're right that fread and rbindlist should have no issue loading the 48GB of data on your 128GB box. As you say, the lapply should return 48GB and then the rbindlist should create a new 48GB single table. This should work on your 128GB machine since 96GB < 128GB. 100 million rows * 6 is 600 million rows, which is well under the 2 billion row limit so should be fine (data.table hasn't caught up with long vector support in R3 yet, otherwise > 2^31 rows would be fine, too).
2. Could fread accept multiple file names?
Excellent idea. As you say, fread could then sweep through all 6 files detecting their types and counting the total number of rows, first. Then allocate once for the 600 million rows directly. This would save churning through 48GB of RAM needlessly. It might also detect any anomalies in the 5th or 6th file (say) before starting to read the first files, so would return quicker in the event of problems.
I'll file this as a feature request and post the link here.
3. colClasses gives an error message
When type list, the type appears to the left of the = and a vector of column names or positions appears to the right. The idea is to be easier than colClasses in read.csv which only accepts a vector; to save repeating "character" over and over. I could have sworn this was better documented in ?fread but it seems not. I'll take a look at that.
So, instead of
fread(tfile, colClasses=list(date="myDate"))
Error in fread(tfile, colClasses = list(date = "myDate")) :
Column name 'myDate' in colClasses not found in data
the correct syntax is
fread(tfile, colClasses=list(myDate="date"))
Given what you go on to say in the question, iiuc, you actually want :
fread(tfile, colClasses=list(character="date")) # just fread accepts list
or
fread(tfile, colClasses=c("date"="character")) # both read.csv and fread
Either of those should load the column called "date" as character so you can manipulate it before coercion. If it really is just dates, then I've still to implement that coercion automatically. You mentioned precision of numeric so just to remind that integer64 can be read directly by fread too.
4. fread thinks named pipes are empty files
Hopefully this goes away now assuming the previous point is resolved? fread works by memory mapping its input. It can accept non-files such as http addresses and connections (tbc) and what it does first for convenience is to write the complete input to ramdisk so it can map the input from there. The reason fread is fast is hand in hand with seeing the entire input first.

Using R to list all files with a specified extension

I'm very new to R and am working on updating an R script to iterate through a series of .dbf tables created using ArcGIS and produce a series of graphs.
I have a directory, C:\Scratch, that will contain all of my .dbf files. However, when ArcGIS creates these tables, it also includes a .dbf.xml file. I want to remove these .dbf.xml files from my file list and thus my iteration. I've tried searching and experimenting with regular expressions to no avail. This is the basic expression I'm using (Excluding all of the various experimentation):
files <- list.files(pattern = "dbf")
Can anyone give me some direction?
files <- list.files(pattern = "\\.dbf$")
$ at the end means that this is end of string. "dbf$" will work too, but adding \\. (. is special character in regular expressions so you need to escape it) ensure that you match only files with extension .dbf (in case you have e.g. .adbf files).
Try this which uses globs rather than regular expressions so it will only pick out the file names that end in .dbf
filenames <- Sys.glob("*.dbf")
Peg the pattern to find "\\.dbf" at the end of the string using the $ character:
list.files(pattern = "\\.dbf$")
Gives you the list of files with full path:
Sys.glob(file.path(file_dir, "*.dbf")) ## file_dir = file containing directory
I am not very good in using sophisticated regular expressions, so I'd do such task in the following way:
files <- list.files()
dbf.files <- files[-grep(".xml", files, fixed=T)]
First line just lists all files from working dir. Second one drops everything containing ".xml" (grep returns indices of such strings in 'files' vector; subsetting with negative indices removes corresponding entries from vector).
"fixed" argument for grep function is just my whim, as I usually want it to peform crude pattern matching without Perl-style fancy regexprs, which may cause surprise for me.
I'm aware that such solution simply reflects drawbacks in my education, but for a novice it may be useful =) at least it's easy.

Resources