R 3.1 sapply to a list of files - r

I want to parse the read.table() function to a list of .txt files. These files are in my current directory.
my.txt.list <-
list("subject_test.txt", "subject_train.txt", "X_test.txt", "X_train.txt")
Before applying read.table() to elements of this list, I want to check if the dt has not been already computed and is in a cache directory. dt from cache directory are already in my environment(), in form of file_name.dt
R> ls()
"subject_test.dt" "subject_train.dt"
In this example, I only want to compute "X_test.txt" and "X_train.txt". I wrote a small function to test if dt has already been cached and apply read.table()in case not.
my.rt <- function(x,...){
# apply read.table to txt files if data table is not already cached
# x is a character vector
y <- strsplit(x,'.txt')
y <- paste(y,'.dt',sep = '')
if (y %in% ls() == FALSE){
rt <- read.table(x, header = F, sep = "", dec = '.')
}
}
This function works if I take one element this way :
subject_test.dt <- my.rt('subject_test.txt')
Now I want to sapply to my files list this way:
my.res <- saply(my.txt.list,my.rt)
I have my.resas a list of df, but the issue is the function compute all files and does take into account already computed files.
I must be missing something, but I can't see why.
TY for suggestions.

I think it has to do with the use of strsplit in your example. strsplit returns a list.
What about this?
my.txt.files <- c("subject_test.txt", "subject_train.txt", "X_test.txt", "X_train.txt")
> ls()
[1] "subject_test.dt" "subject_train.dt"
my.rt <- function(x){
y <- gsub(".txt", ".dt", x, fixed = T)
if (!(y %in% ls())) {
read.table(x, header = F, sep = "", dec = '.') }
}
my.res <- sapply(my.txt.files, FUN = my.rt)
Note that I'm replacing .txt with .dt and I'm doing a "not in". You will get NULL entries in the result list if a file is not processed.
This is untested, but I think it should work...

Related

Extract data from text files using for loop

I have 40 text files with names :
[1] "2006-03-31.txt" "2006-06-30.txt" "2006-09-30.txt" "2006-12-31.txt" "2007-03-31.txt"
[6] "2007-06-30.txt" "2007-09-30.txt" "2007-12-31.txt" "2008-03-31.txt" etc...
I need to extract one specific data, i know how to do it individually but this take a while:
m_value1 <- `2006-03-31.txt`$Marknadsvarde_tot[1]
m_value2 <- `2006-06-30.txt`$Marknadsvarde_tot[1]
m_value3 <- `2006-09-30.txt`$Marknadsvarde_tot[1]
m_value4 <- `2006-12-31.txt`$Marknadsvarde_tot[1]
Can someone help me with a for loop which would extract the data from a specific column and row through all the different text files please?
Assuming your files are all in the same folder, you can use list.files to get the names of all the files, then loop through them and get the value you need. So something like this?
m_value<-character() #or whatever the type of your variable is
filelist<-list.files(path="...", all.files = TRUE)
for (i in 1:length(filelist)){
df<-read.table(myfile[i], h=T)
m_value[i]<-df$Marknadsvarde_tot[1]
}
EDIT:
In case you have imported already all the data you can use get:
txt_files <- list.files(pattern = "*.txt")
for(i in txt_files) { x <- read.delim(i, header=TRUE) assign(i,x) }
m_value<-character()
for(i in 1:length(txt_files)) {
m_value[i] <- get(txt_files[i])$Marknadsvarde_tot[1]
}
You could utilize the select-parameter from fread of the data.table-package for this:
library(data.table)
file.list <- list.files(pattern = '.txt')
lapply(file.list, fread, select = 'Marknadsvarde_tot', nrow = 1, header = FALSE)
This will result in a list of datatables/dataframes. If you just want a vector with all the values:
sapply(file.list, function(x) fread(x, select = 'Marknadsvarde_tot', nrow = 1, header = FALSE)[[1]])
temp = list.files(pattern="*.txt")
library(data.table)
list2env(
lapply(setNames(temp, make.names(gsub("*.txt$", "", temp))),
fread), envir = .GlobalEnv)
Added data.table to an existing answer at Importing multiple .csv files into R
After you get all your files you can get data from the data.tables using DT[i,j,k] where i will be your condition

Applying a function on all csv files from a certain folder

I am reading csv files from a certain folder, which all have the same structure. Furthermore, I have created a function which adds a certain value to a dataFrame.
I have created the "folder reading" - part and also created the function. However, I now need to connect these two with each other. This is where I am having my problems:
Here is my code:
addValue <- function(valueToAdd, df.file, writterPath) {
df.file$result <- df.file$Value + valueToAdd
x <- x + 1
df.file <- as.data.frame(do.call(cbind, df.file))
fullFilePath <- paste(writterPath, x , "myFile.csv", sep="")
write.csv(as.data.frame(df.file), fullFilePath)
}
#1.reading R files
path <- "C:/Users/RFiles/files/"
files <- list.files(path=path, pattern="*.csv")
for(file in files)
{
perpos <- which(strsplit(file, "")[[1]]==".")
assign(
gsub(" ","",substr(file, 1, perpos-1)),
read.csv(paste(path,file,sep="")))
}
#2.appyling function
writterPath <- "C:/Users/RFiles/files/results/"
addValue(2, sys, writterPath)
How to apply the addValue() function in my #1.reading R files construct? Any recommendations?
I appreciate your answers!
UPDATE
When trying out the example code, I get:
+ }
+ ## If you really need to change filenames with numbers,
+ newfname <- file.path(npath, paste0(x, basename(fname)))
+ ## otherwise just use `file.path(npath, basename(fname))`.
+
+ ## (4) Write back to a different file location:
+ write.csv(newdat, file = newfname, row.names = FALSE)
+ }
Error in `$<-.data.frame`(`*tmp*`, "results", value = numeric(0)) :
replacement has 0 rows, data has 11
Any suggestions?
There are several problems with your code (e.g., x in your function is never defined and is not retained between calls to addValue), so I'm guessing that this is a chopped-down version of the real code and you still have remnants remaining. Instead of picking it apart verbosely, I'll just offer my own suggested code and a few pointers.
The function addValue looks like it is good for changing a data.frame, but I would not have guessed (by the name, at least) that it would also write the file to disk (and potentially over-write an existing file).
I'm guessing you are trying to (1) read a file, (2) "add value" to it, (3) assign it to a global variable, and (4) write it to disk. The third can be problematic (and contentious with some programmers), but I'll leave it for now.
Unless writing to disk is inherent to your idea of "adding value" to a data.frame, I recommend you keep #2 separate from #4. Below is a suggested alternative to your code:
addValue <- function(valueToAdd, df) {
df$results <- df$Value + valueToAdd
## more stuff here?
return(df)
}
opath <- 'c:/Users/RFiles/files/raw' # notice the difference
npath <- 'c:/Users/RFiles/files/adjusted'
files <- list.files(path = opath, pattern = '*.csv', full.names = TRUE)
x <- 0
for (fname in files) {
x <- x + 1
## (1) read in and (2) "add value" to it
dat <- read.csv(fname)
newdat <- addValue(2, dat)
## (3) Conditionally assign to a global variable:
varname <- gsub('\\.[^.]*$', '', basename(fname))
if (! exists(varname)) {
assign(x = varname, value = newdat)
} else {
warning('variable exists, did not overwrite: ', varname)
}
## If you really need to change filenames with numbers,
newfname <- file.path(npath, paste0(x, basename(fname)))
## otherwise just use `file.path(npath, basename(fname))`.
## (4) Write back to a different file location:
write.csv(newdat, file = newfname, row.names = FALSE)
}
Notice that it will not overwrite global variables. This may be an annoying check, but will keep you from losing data if you accidentally run this section of code.
An alternative to assigning numerous variables to the global address space is to save all of them to a single list. Assuming they are the same format, you will likely be dealing with them with identical (or very similar) analytical methods, so putting them all in one list will facilitate that. The alternative of tracking disparate variable names can be tiresome.
## addValue as defined previously
opath <- 'c:/Users/RFiles/files/raw'
npath <- 'c:/Users/RFiles/files/adjusted'
ofiles <- list.files(path = opath, pattern = '*.csv', full.names = TRUE)
nfiles <- file.path(npath, basename(ofiles))
dats <- mapply(function(ofname, nfname) {
dat <- read.csv(ofname)
newdat <- addValue(2, dat)
write.csv(newdat, file = nfname, row.names = FALSE)
newdat
}, ofiles, nfiles, SIMPLIFY = FALSE)
length(dats) # number of files
names(dats) # one for each file

Function to read in multiple delimited text files

Using this answer, I have created a function that should read in all the text datasets in a directory:
read.delims = function(dir, sep = "\t"){
# Make a list of all data frames in the "data" folder
list.data = list.files(dir, pattern = "*.(txt|TXT|csv|CSV)")
# Read them in
for (i in 1:length(list.data)) {
assign(list.data[i],
read.delim(paste(dir, list.data[i], sep = "/"),
sep = sep))
}
}
However, even though there are .txt and .csv files in the specified directory, no R objects get created (I'm guessing this happens because I'm using the read.delim within a function). How to correct this?
You can add the parameter envir in your assignment, like this :
read.delims = function(dir, sep = "\t"){
# Make a list of all data frames in the "data" folder
list.data = list.files(dir, pattern = "*.(txt|TXT|csv|CSV)")
# Read them in
for (i in 1:length(list.data)) {
assign(list.data[i],
read.delim(paste(dir, list.data[i], sep = "/"),
sep = sep),
envir=.GlobalEnv)
}
}
Doing this, your object will be created in the global environment and not just in the function environment
As I said in my comment, it is necessary to return() a value after assigning. I don't really see the point in using assign() though, so here it is with a simple for-loop, assuming you want your output to be a list of data frames.
Note that I changed the reading function to read.table() for personal convenience. You might want to adjust that.
read.delims <- function(dir, sep = "\t"){
# Make a list of all data frames in the "data" folder
list.data <- list.files(dir, pattern = "*.(txt|TXT|csv|CSV)")
list.out <- as.list(1:length(list.data))
# Read them in
for (i in 1:length(list.data)) {
list.out[[i]] <- read.table(paste(dir, list.data[i], sep = "/"), sep = sep)
}
return(list.out)
}
Maybe you should also add a $ to your regular expression.
Cheers.

Write list of data.frames to separate CSV files with lapply

The question says it all - I want to take a list object full of data.frames and write each data.frame to a separate .csv file where the name of the .csv file corresponds to the name of the list object.
Here's a reproducible example and the code I've written thus far.
df <- data.frame(
var1 = sample(1:10, 6, replace = TRUE)
, var2 = sample(LETTERS[1:2], 6, replace = TRUE)
, theday = c(1,1,2,2,3,3)
)
df.daily <- split(df, df$theday) #Split into separate days
lapply(df.daily, function(x){write.table(x, file = paste(names(x), ".csv", sep = ""), row.names = FALSE, sep = ",")})
And here is the top of the error message that R spits out
Error: Results must have one or more dimensions.
In addition: Warning messages:
1: In if (file == "") file <- stdout() else if (is.character(file)) { :
the condition has length > 1 and only the first element will be used
What am I missing here?
Try this:
sapply(names(df.daily),
function (x) write.table(df.daily[[x]], file=paste(x, "txt", sep=".") ) )
You should see the names ("1", "2", "3") spit out one by one, but the NULLs are the evidence that the side-effect of writing to disk files was done. (Edit: changed [] to [[]].)
You could use mapply:
mapply(
write.table,
x=df.daily, file=paste(names(df.daily), "txt", sep="."),
MoreArgs=list(row.names=FALSE, sep=",")
)
There is thread about similar problem on plyr mailing list.
A couple of things:
laply performs operations on a list. What you're looking for is d_ply. And you don't have to break it up by day, you can let plyr do that for you. Also, I would not use names(x) as that returns all of the column names of a data.frame.
d_ply(df, .(theday), function(x) write.csv(x, file=paste(x$theday,".csv",sep=""),row.names=F))

Loading many files at once?

So let's say I have a directory with a bunch of .rdata files
file_names=as.list(dir(pattern="stock_*"))
[[1]]
[1] "stock_1.rdata"
[[2]]
[1] "stock_2.rdata"
Now, how do I load these files with a single call?
I can always do:
for(i in 1:length(file_names)) load(file_names[[i]])
but why can't I do something like do.call(load, file_names)?
I suppose none of the apply functions would work because most of them would return lists but nothing should be returned, just that these files need to be loaded. I cannot get the get function to work in this context either. Ideas?
lapply works, but you have to specify that you want the objects loaded to the .GlobalEnv otherwise they're loaded into the temporary evaluation environment created (and destroyed) by lapply.
lapply(file_names,load,.GlobalEnv)
For what it's worth, the above didn't exactly work for me, so I'll post how I adapted that answer:
I have files in folder_with_files/ that are prefixed by prefix_pattern_, are all of type .RData, and are named what I want them to be named in my R environment: ex: if I had saved var_x = 5, I would save it as prefix_pattern_var_x.Data in folder_with_files.
I get the list of the file names, then generate their full path to load them, then gsub out the parts that I don't want: taking it (for object1 as an example) from folder_with_files/prefix_pattern_object1.RData to object1 as the objname to which I will assign the object stored in the RData file.
file_names=as.list(dir(path = 'folder_with_files/', pattern="prefix_pattern_*"))
file_names = lapply(file_names, function(x) paste0('folder_with_files/', x))
out = lapply(file_names,function(x){
env = new.env()
nm = load(x, envir = env)[1]
objname = gsub(pattern = 'folder_with_files/', replacement = '', x = x, fixed = T)
objname = gsub(pattern = 'prefix_pattern_|.RData', replacement = '', x = objname)
# print(str(env[[nm]]))
assign(objname, env[[nm]], envir = .GlobalEnv)
0 # succeeded
} )
Loading many files in a function?
Here's a modified version of Joshua Ulrich's answer that will work both interactively and if placed within a function, by replacing GlobalEnv with environment():
lapply(file_names, load, environment())
or
foo <- function(file_names) {
lapply(file_names, load, environment())
ls()
}
Working example below. It will write files to your current working directory.
invisible(sapply(letters[1:5], function(l) {
assign(paste0("ex_", l), data.frame(x = rnorm(10)))
do.call(save, list(paste0("ex_", l), file = paste0("ex_", l, ".rda")))
}))
file_names <- paste0("ex_", letters[1:5], ".rda")
foo(file_names)

Resources