Hi I'm writing a function to read a file and returning a time series. I then need to assign this time series to a variable. I'm trying to do this in a terse way and utilise any functional programming features of R.
# read a file and returns the
readFile <- function( fileName , filePath){
fullPath <- paste(filePath, filename, sep='');
f <- as.xts(read.zoo(fullPath, format='%d/%m/%Y',
FUN=as.Date, header=TRUE, sep='\t'));
return(na.locf(f));
}
filePath <- 'C://data/'
# real list of files is a lot longer
fnames <- c('d1.csv', 'd2.csv','d3.csv');
varnames <- c('data1', 'data2', 'data3');
In the above piece of code I would like to initialise variables by the name of data1, data2, data2 by applying the readfile function to fnames and filepath (which is always constant).
Something like :
lapply( fnames, readFile, filePath);
The above doesnt work, of course and neither does it do this dynamic variable assignment that I'm trying to achieve. Any R functional programming gurus out there that could guide me ?
The working version of this would look something like :
data1 <- readFile( 'd1.csv', filepath);
data2 <- readFile( 'd2.csv', filepath);
YIKES
Constructing many variables with specified names is a somewhat common request on SO and can certainly be managed with the assign function, but you'll probably find it easier to handle your data if you build a list instead of multiple variables. For instance, you could read in all the results and obtain a named list with:
lst <- setNames(lapply(fnames, readFile, filePath), varnames)
Then you could access your results from each csv file with lst[["data1"]], lst[["data2"]], and lst[["data3"]].
The benefit of this approach is that you can now perform an operation over all your time series variables using lapply(lst, ...) instead of looping through all your variables or looping through variable names and using the get function.
Related
I've just started learning R so forgive me for my ignorance! I'm reading in lots of .csv files, each of which correlates to a different year (2010-2019). I then filter down the .csv files based on a variable within one of the columns (because the datasets are very large. Currently I am using the below code to do this and then repeating it for each year:
data_2010 <- data.table::fread("//Project/2010 data/2010 data.csv", select = c("date", "id", "type"))
data_b_2010 <- data_2010[which(data_2010$type=="ABC123")]
rm(data_2010)
What I would like to do is use a For-loop to create new object data_20xx for each year, and then read in the .csv files (and apply the filter of "type") for each year too.
I think I know how to create the objects in a For-loop but not entirely sure how I would also assign the .csv files and change the filepath string so it updates with each year (i.e. "//Project/2010 data/2010 data.csv" to "//Project/2011 data/2011 data.csv").
Any help would be greatly appreciated!
Next time please provide a repoducible example so we can help you.
I would use data.table which contains specialized functions to do what you want.
library(data.table)
setwd("Project")
allfiles <- list.files(recursive = T, full.names = T)
allcsv <- allfiles[grepl(".csv", allfiles)]
data_list <- list()
for(i in 1:length(allcsv)) {
print(paste(round(i/length(allcsv),2)))
data_list[i] <- fread(allcsv[i])
}
data_list_filtered <- lapply(data_list, function(x) {
y <- data.frame(x)
return(y[which(y["type"]=="ABC123",)])
})
result <- rbindlist(data_list_filtered)
First, list.files will tell you all the files contained in your working dir by default.
Second, read each csv file into the data_list list using the fast and efficient fread function.
Third, do the filtering within a loop, as requested.
Fourth, use rbindlist from data.table to rbind all of these data.table's.
Finally, if you are not familiar with the data.table syntax, you can run setDF(result) to convert your results back to a data.frame.
I strongly encourage you to learn the data.table syntax as it is quite powerful and efficient for tabular data manipulations. These vignettes will get you started.
I have several .RData files, each of which has letters and numbers in its name, eg. m22.RData. Each of these contains a single data.frame object, with the same name as the file, eg. m22.RData contains a data.frame object named "m22".
I can generate the file names easily enough with something like datanames <- paste0(c("m","n"),seq(1,100)) and then use load() on those, which will leave me with a few hundred data.frame objects named m1, m2, etc. What I am not sure of is how to do the next step -- prepare and merge each of these dataframes without having to type out all their names.
I can make a function that accepts a data frame as input and does all the processing. But if I pass it datanames[22] as input, I am passing it the string "m22", not the data frame object named m22.
My end goal is to epeatedly do the same steps on a bunch of different data frames without manually typing out "prepdata(m1) prepdata(m2) ... prepdata(n100)". I can think of two ways to do it, but I don't know how to implement either of them:
Get from a vector of the names of the data frames to a list containing the actual data frames.
Modify my "prepdata" function so that it can accept the name of the data frame, but then still somehow be able to do things to the data frame itself (possibly by way of "assign"? But the last step of the function will be to merge the prepared data to a bigger data frame, and I'm not sure if there's a method that uses "assign" that can do that...)
Can anybody advise on how to implement either of the above methods, or another way to make this work?
See this answer and the corresponding R FAQ
Basically:
temp1 <- c(1,2,3)
save(temp1, file = "temp1.RData")
x <- c()
x[1] <- load("temp1.RData")
get(x[1])
#> [1] 1 2 3
Assuming all your data exists in the same folder you can create an R object with all the paths, then you can create a function that gets a path to a Rdata file, reads it and calls "prepdata". Finally, using the purr package you can apply the same function on a input vector.
Something like this should work:
library(purrr)
rdata_paths <- list.files(path = "path/to/your/files", full.names = TRUE)
read_rdata <- function(path) {
data <- load(path)
return(data)
}
prepdata <- function(data) {
### your prepdata implementation
}
master_function <- function(path) {
data <- read_rdata(path)
result <- prepdata(data)
return(result)
}
merged_rdatas <- map_df(rdata_paths, master_function) # This create one dataset. Merging all together
I cooked up some code that is supposed to find all my .txt files (they're outputs of ODE simulations), open them all up as data frames with "read.table" and then perform some calculations on them.
files <- list.files(path="/Users/redheadmammoth/Desktop/Ultimate_Aging_F2016",
pattern=".txt",full.names=TRUE)
ldf <- lapply(files, read.table)
tuse <- seq(from=0,to=100,by=0.1)
for(files in ldf)
findR <- function(r){
with(files,(sum(exp(-r*age)*fecund*surv*0.1)-1)^2)
}
{
R0 <- with(files,(sum(fecund*surv*age)))
GenTime <- with(files,(sum(tuse*fecund*surv*0.1))/R0)
r <- optimize(f=findR,seq(-5,5,.0001),tol=0.00000001)$minimum
RV <- with(files,(exp(r*tuse)/surv)*(exp(-r*tuse)*(fecund*surv)))
plot(log(surv) ~ age,files,type="l")
tmp.lm <- lm(log(surv) ~ age + I(age^2),files) #Fit log surv to a quadratic
lines(files$age,predict(tmp.lm),col="red")
}
However, the problem is that it seems to only be performing the calculations contained in my "for" loop on one file, rather than all of them. I'd like it to perform the calculations on all of my files, then save all the files together as one big data frame so I can access the results of any particular set of my simulations. I suspect the error is that I'm not indexing the files correctly in order to loop over all of them.
How about using plyr::ldply() for this. It takes a list (in your case your list of files) and performs the same function on them and then returns a data frame.
The main thing to remember to do is create a column for the ID of each file you read in so you know which data comes from which file. The simplest way to do this is just to call it the file name and then you can edit it from there.
If you have additional arguments in your function they go after the function you want to use in ldply.
# create file list
files <- list.files(path="/Users/redheadmammoth/Desktop/Ultimate_Aging_F2016",
pattern=".txt",full.names=TRUE)
tuse <- seq(from=0,to=100,by=0.1)
load_and_edit <- function(file, tuse){
temp <- read.table(file)
# here put all your calculations you want to do on each file
temp$R0 <- sum(temp$fecund*temp$surv*temp*age)
# make a column for each file name so you know which data comes from which file
temp$id <- file
return(temp)
}
new_data <- plyr::ldply(list.files, load_and_edit, tuse)
This is the easiest way I have found to read in and wrangle multiple files in batch.
You can then plot each one really easily.
The following works, but I'm missing a functional programming technique, indexing, or a better way of structuring my data. After a month, it will take a bit to remember exactly how this works instead of being easy to maintain. It seems like a workaround when it shouldn't be. I want to use regex to decide which function to use for expected groups of files. When a new file format comes along, I can write the read function, then add the function along with the regex to the data.frame to run it alongside all the rest.
I have different formats of Excel and csv files that need to be read in and standardized. I want to maintain a list or data.frame of the file name regex and appropriate function to use. Sometimes there will be new file formats that won't be matched, and old formats without new files. But then it gets complicated which is something I would prefer to avoid.
# files to read in based on filename
fileexamples <- data.frame(
filename = c('notanyregex.xlsx','regex1today.xlsx','regex2today.xlsx','nomatch.xlsx','regex1yesterday.xlsx','regex2yesterday.xlsx','regex3yesterday.xlsx'),
readfunctionname = NA
)
# regex and corresponding read function
filesourcelist <- read.table(header = T,stringsAsFactors = F,text = "
greptext readfunction
'.*regex1.*' 'readsheettype1'
'.*nonematchthis.*' 'readsheetwrench'
'.*regex2.*' 'readsheettype2'
'.*regex3.*' 'readsheettype3'
")
# list of grepped files
fileindex <- lapply(filesourcelist$greptext,function(greptext,files){
grepmatches <- grep(pattern = greptext,x = data.frame(files)[,1],ignore.case = T)
},files = fileexamples$filename)
# run function on files based on fileindex from grep
for(i in 1:length(fileindex)){
fileexamples[fileindex[[i]],'readfunctionname'] <- filesourcelist$readfunction[i]
}
I have a lot of result from parametric study to analyze. Fortunately there is an output file where the output file are saved. I need to save the name of file. I used this routine:
IndexJobs<-read.csv("C:/Users/.../File versione7.1/
"IndexJobs.csv",sep=",",header=TRUE,stringsAsFactors=FALSE)
dir<-IndexJobs$WORKDIR
Dir<-gsub("\\\\","/",dir)
Dir1<-gsub(" C","C",Dir)
Now I use e for in order to read CSV and create different dataframe
for(i in Dir1){
filepath <- file.path(paste(i,"eplusout.csv",sep=""))
dat<-NULL
dat<-read.table(filepath,header=TRUE,sep=",")
filenames <- substr(filepath,117,150)
names <-substr(filenames,1,21)
assign(names, dat)
}
Now I want to extract selected variables from each database, and putting together each variable for each database into separated database. I would also joint name of variable and single database in order to have a clear database for making some analysis. I try to make something but with bad results.
I tried to insert in for some other row:
for(i in Dir1){
filepath <- file.path(paste(i,"eplusout.csv",sep=""))
dat<-NULL
dat<-read.table(filepath,header=TRUE,sep=",")
filenames <- substr(filepath,117,150)
names <-substr(filenames,1,21)
assign(names, dat)
datTest<-dat$X5EC132.Surface.Outside.Face.Temperature..C..TimeStep.
nameTest<-paste(names,"_Test",sep="")
assign(nameTest,datTest)
DFtest=c[,nameTest]
}
But for each i there is an overwriting of DFtest and remain only the last database column.
Some suggestion?Thanks
Maybe it will work if you replace DFtest=c[,nameTest] with
DFtest[nameTest] <- get(nameTest)
or, alternatively,
DFtest[nameTest] <- datTest
This procedure assumes the object DFtest exists before you run the loop.
An alternative way is to create an empty list before running the loop:
DFtest <- list()
In the loop, you can use the following command:
DFtest[[nameTest]] <- datTest
After the loop, all values in the list DFtest can be combined using
do.call("cbind", DFtest)
Note that this will only work if all vectors in the list DFtesthave the same length.