Looping Over a Set of Files - r

I cooked up some code that is supposed to find all my .txt files (they're outputs of ODE simulations), open them all up as data frames with "read.table" and then perform some calculations on them.
files <- list.files(path="/Users/redheadmammoth/Desktop/Ultimate_Aging_F2016",
pattern=".txt",full.names=TRUE)
ldf <- lapply(files, read.table)
tuse <- seq(from=0,to=100,by=0.1)
for(files in ldf)
findR <- function(r){
with(files,(sum(exp(-r*age)*fecund*surv*0.1)-1)^2)
}
{
R0 <- with(files,(sum(fecund*surv*age)))
GenTime <- with(files,(sum(tuse*fecund*surv*0.1))/R0)
r <- optimize(f=findR,seq(-5,5,.0001),tol=0.00000001)$minimum
RV <- with(files,(exp(r*tuse)/surv)*(exp(-r*tuse)*(fecund*surv)))
plot(log(surv) ~ age,files,type="l")
tmp.lm <- lm(log(surv) ~ age + I(age^2),files) #Fit log surv to a quadratic
lines(files$age,predict(tmp.lm),col="red")
}
However, the problem is that it seems to only be performing the calculations contained in my "for" loop on one file, rather than all of them. I'd like it to perform the calculations on all of my files, then save all the files together as one big data frame so I can access the results of any particular set of my simulations. I suspect the error is that I'm not indexing the files correctly in order to loop over all of them.

How about using plyr::ldply() for this. It takes a list (in your case your list of files) and performs the same function on them and then returns a data frame.
The main thing to remember to do is create a column for the ID of each file you read in so you know which data comes from which file. The simplest way to do this is just to call it the file name and then you can edit it from there.
If you have additional arguments in your function they go after the function you want to use in ldply.
# create file list
files <- list.files(path="/Users/redheadmammoth/Desktop/Ultimate_Aging_F2016",
pattern=".txt",full.names=TRUE)
tuse <- seq(from=0,to=100,by=0.1)
load_and_edit <- function(file, tuse){
temp <- read.table(file)
# here put all your calculations you want to do on each file
temp$R0 <- sum(temp$fecund*temp$surv*temp*age)
# make a column for each file name so you know which data comes from which file
temp$id <- file
return(temp)
}
new_data <- plyr::ldply(list.files, load_and_edit, tuse)
This is the easiest way I have found to read in and wrangle multiple files in batch.
You can then plot each one really easily.

Related

Create list of files in directory, apply (lapply) a custom function to each, and cbind results to new file

rewrote in attempt to simplify my problem statement.
I am using R V1.3.959 and relatively new to R overall. I have a custom excel form, which means the objects are in various cells in excel and the variable is also in some cell. I have over 1000 of these forms as product specs. I read in only 1 file and created a function called tidy.form to pull data out and then cbind into new file as below.
read_customer_file = "C:/Users/..../FABRIC TECHNICAL SUBMISSION AGREEMENT J123abd.xlsx"
product_tech <- read_excel(read_customer_file, sheet = "Form") %>% clean_names()
#function for make form tidy
form.extract <- function(tidy.form) {
#extract the object / data point looking for but with entire column
fabric.supplier.name <- product_tech[c( 0,5)]
#extract the specific row in the column with the data point desired
fabric.supplier.name <- slice(fabric.supplier.name, 3,0)
#rename column to correct variable
colnames(fabric.supplier.name)[colnames(fabric.supplier.name) == "x5"] <- "fabric.supplier.name"
combine <- cbind(date, fabric.supplier.name, address)
return(combine)
}
Now I need a way to read in all of the xlsx files from a directory and do the same thing for each.
I figured out how to read the file names in through:
files <- list.files(path="C:/Users/me/productspecfolder", pattern="*.xlsx", full.names=TRUE, recursive=FALSE)
However I am stuck at how to loop / lapply through my list.files and apply the function tidy.form to each.
Any help would be so much appreciated!

R: Doing the same steps on many data frames with their names stored in a vector

I have several .RData files, each of which has letters and numbers in its name, eg. m22.RData. Each of these contains a single data.frame object, with the same name as the file, eg. m22.RData contains a data.frame object named "m22".
I can generate the file names easily enough with something like datanames <- paste0(c("m","n"),seq(1,100)) and then use load() on those, which will leave me with a few hundred data.frame objects named m1, m2, etc. What I am not sure of is how to do the next step -- prepare and merge each of these dataframes without having to type out all their names.
I can make a function that accepts a data frame as input and does all the processing. But if I pass it datanames[22] as input, I am passing it the string "m22", not the data frame object named m22.
My end goal is to epeatedly do the same steps on a bunch of different data frames without manually typing out "prepdata(m1) prepdata(m2) ... prepdata(n100)". I can think of two ways to do it, but I don't know how to implement either of them:
Get from a vector of the names of the data frames to a list containing the actual data frames.
Modify my "prepdata" function so that it can accept the name of the data frame, but then still somehow be able to do things to the data frame itself (possibly by way of "assign"? But the last step of the function will be to merge the prepared data to a bigger data frame, and I'm not sure if there's a method that uses "assign" that can do that...)
Can anybody advise on how to implement either of the above methods, or another way to make this work?
See this answer and the corresponding R FAQ
Basically:
temp1 <- c(1,2,3)
save(temp1, file = "temp1.RData")
x <- c()
x[1] <- load("temp1.RData")
get(x[1])
#> [1] 1 2 3
Assuming all your data exists in the same folder you can create an R object with all the paths, then you can create a function that gets a path to a Rdata file, reads it and calls "prepdata". Finally, using the purr package you can apply the same function on a input vector.
Something like this should work:
library(purrr)
rdata_paths <- list.files(path = "path/to/your/files", full.names = TRUE)
read_rdata <- function(path) {
data <- load(path)
return(data)
}
prepdata <- function(data) {
### your prepdata implementation
}
master_function <- function(path) {
data <- read_rdata(path)
result <- prepdata(data)
return(result)
}
merged_rdatas <- map_df(rdata_paths, master_function) # This create one dataset. Merging all together

how to use "for loop" to write multiple .csv file names?

Does anyone know the best way to carry out a "for loop" that would read in different subject id's and append them to the name of an exported csv?
As an example, I have multiple output files from an electrocardiogram software program (each file belongs to one individual). The files are named C800_HR.bdf.evt, C801_HR.bdf.evt, C802_HR.bdf.evt etc. Each file gets read into r and then has a script applied to calculate heart rate variability. At the end of the script, I need to add a loop that will extract the subject id (e.g., C800, C801, C802) and write a new file name for each individual so that it becomes C800_RtoR.csv. Essentially, I would like to avoid changing the syntax every time I read in and export a file name.
I am currently using the following syntax to read in multiple files:
>setwd("/Users/kmpc/Downloads")
>myhrvdata <-lapply(Sys.glob("C8**_HR.bdf.evt"), read.delim)
Try this out:
cardio_files <- list.files(pattern = "C8\\d{2}_HR.bdf.evt")
subject_ids <- sub("^(C8\\d{2})_.*", "\\1" cardio_files)
myList <- lapply(cardio_files, read.delim)
## do calculations on the list
for (i in names(myList)) {
write.csv(myList[[i]], paste0(subject_ids[i], "_RtoR.csv"))
}
The only thing is, you have to deal with using a list when doing your calculations. You could combine them to a single data.frame, but it would be best to leave it as a list to write the files at the end.
Consider generalizing your process by creating a function that: 1) reads in file, 2) processes data, 3) outputs to csv. Then have lapply call the defined method iteratively across all Sys.glob items and even return a list of calculated data frames.
proc_heart_rate <- function(f_name) {
# READ IN .evt FILE INTO df
df <- read.delim(f_name)
# CALCULATE HEART RATE VARIABILITY WITH df
...
# OUTPUT df TO CSV
subject_id <- gsub("\\_.*", "", f_name)
write.csv(df, paste0(subject_id, "_RtoR.csv"))
# RETURN df FOR OTHER USES
return(df)
}
# LIST OF DATA FRAMES WITH CALCULATIONS
myhrvdata_list <-lapply(Sys.glob("C8**_HR.bdf.evt"), proc_heart_rate)

Building a mean across several csv files

I have an assignment on Coursera and I am stuck - I do not necessarily need or want a complete answer (as this would be cheating) but a hint in the right direction would be highly appreciated.
I have over 300 CSV files in a folder (named 001.csv, 002.csv and so on). Each contains a data frame with a header. I am writing a function that will take three arguments: the location of the files, the name of the column you want to calculate the mean (inside the data frames) and the files you want to use in the calculation (id).
I have tried to keep it as simple as possible:
pm <- function(directory, pollutant, id = 1:332) {
setwd("C:/Users/cw/Documents")
setwd(directory)
files <<- list.files()
First of all, set the wd and get a list of all files
x <- id[1]
x
get the starting point of the user-specified ID.
Problem
for (i in x:length(id)) {
df <- rep(NA, length(id))
df[i] <- lapply(files[i], read.csv, header=T)
result <- do.call(rbind, df)
return(df)
}
}
So this is where I am hitting a wall: I would need to take the user-specified input from above (e.g. 10:25) and put the content from files "010.csv" through "025.csv" into a dataframe to actually come up with the mean of one specific column.
So my idea was to run a for-loop along the length of id (e.g. 16 for 10:25) starting with the starting point of the specified id. Within this loop I would then need to take the appropriate values of files as the input for read.csv and put the content of the .csv files in a dataframe.
I can get single .csv files and put them into a dataframe, but not several.
Does anybody have a hint how I could procede?
Based on your example e.g. 16 files for 10:25, i.e. 010.csv, 011.csv, 012.csv, etc.
Under the assumption that your naming convention follows the order of the files in the directory, you could try:
csvFiles <- list.files(pattern="\\.csv")[10:15]#here [10:15] ... in production use your function parameter here
file_list <- vector('list', length=length(csvFiles))
df_list <- lapply(X=csvFiles, read.csv, header=TRUE)
names(df_list) <- csvFiles #OPTIONAL: if you want to rename (later rows) to the csv list
df <- do.call("rbind", df_list)
mean(df[ ,"columnName"])
These code snippets should be possible to pimp and incorprate into your routine.
You can aggregate your csv files into one big table like this :
for(i in 100:250)
{
infile<-paste("C:/Users/cw/Documents/",i,".csv",sep="")
newtable<-read.csv(infile)
newtable<-cbind(newtable,rep(i,dim(newtable)[1]) # if you want to be able to identify tables after they are aggregated
bigtable<-rbind(bigtable,newtable)
}
(you will have to replace 100:250 with the user-specified input).
Then, calculating what you want shouldn't be very hard.
That won't works for files 001 to 099, you'll have to distinguish those from the others because of the "0" but it's fixable with little treatment.
Why do you have lapply inside a for loop? Just do lapply(files[files %in% paste0(id, ".csv")], read.csv, header=T).
They should also teach you to never use <<-.

How to not overwrite file in R

I am trying to copy and paste tables from R into Excel. Consider the following code from a previous question:
data <- list.files(path=getwd())
n <- length(list)
for (i in 1:n)
{
data1 <- read.csv(data[i])
outline <- data1[,2]
outline <- as.data.frame(table(outline))
print(outline) # this prints all n tables
name <- paste0(i,"X.csv")
write.csv(outline, name)
}
This code writes each table into separate Excel files (i.e. "1X.csv", "2X.csv", etc..). Is there any way of "shifting" each table down some rows instead of rewriting the previous table each time? I have also tried this code:
output <- as.data.frame(output)
wb = loadWorkbook("X.xlsx", create=TRUE)
createSheet(wb, name = "output")
writeWorksheet(wb,output,sheet="output",startRow=1,startCol=1)
writeNamedRegion(wb,output,name="output")
saveWorkbook(wb)
But this does not copy the dataframes exactly into Excel.
I think, as mentioned in the comments, the way to go is to first merge the data frames in R and then writing them into (one) output file:
# get vector of filenames
filenames <- list.files(path=getwd())
# for each filename: load file and create outline
outlines <- lapply(filenames, function(filename) {
data <- read.csv(filename)
outline <- data[,2]
outline <- as.data.frame(table(outline))
outline
})
# merge all outlines into one data frame (by appending them row-wise)
outlines.merged <- do.call(rbind, outlines)
# save merged data frame
write.csv(outlines.merged, "all.csv")
Despite what microsoft would like you to believe, .csv files are not excel files, they are a common file type that can be read by excel and many other programs.
The best approach depends on what you really want to do. Do you want all the tables to read into a single worksheet in excel? If so you could just write to a single file using the append argument to the write.csv or other functions. Or use a connection that you keep open so each new one is appended. You may want to use cat to put a couple of newlines before each new table.
Your second attempt looks like it uses the XLConnect package (but you don't say, so it could be something else). I would think this the best approach, how is the result different from what you are expecting?

Resources