R extract variable from multiple dataframe in loop - r

I have a lot of result from parametric study to analyze. Fortunately there is an output file where the output file are saved. I need to save the name of file. I used this routine:
IndexJobs<-read.csv("C:/Users/.../File versione7.1/
"IndexJobs.csv",sep=",",header=TRUE,stringsAsFactors=FALSE)
dir<-IndexJobs$WORKDIR
Dir<-gsub("\\\\","/",dir)
Dir1<-gsub(" C","C",Dir)
Now I use e for in order to read CSV and create different dataframe
for(i in Dir1){
filepath <- file.path(paste(i,"eplusout.csv",sep=""))
dat<-NULL
dat<-read.table(filepath,header=TRUE,sep=",")
filenames <- substr(filepath,117,150)
names <-substr(filenames,1,21)
assign(names, dat)
}
Now I want to extract selected variables from each database, and putting together each variable for each database into separated database. I would also joint name of variable and single database in order to have a clear database for making some analysis. I try to make something but with bad results.
I tried to insert in for some other row:
for(i in Dir1){
filepath <- file.path(paste(i,"eplusout.csv",sep=""))
dat<-NULL
dat<-read.table(filepath,header=TRUE,sep=",")
filenames <- substr(filepath,117,150)
names <-substr(filenames,1,21)
assign(names, dat)
datTest<-dat$X5EC132.Surface.Outside.Face.Temperature..C..TimeStep.
nameTest<-paste(names,"_Test",sep="")
assign(nameTest,datTest)
DFtest=c[,nameTest]
}
But for each i there is an overwriting of DFtest and remain only the last database column.
Some suggestion?Thanks

Maybe it will work if you replace DFtest=c[,nameTest] with
DFtest[nameTest] <- get(nameTest)
or, alternatively,
DFtest[nameTest] <- datTest
This procedure assumes the object DFtest exists before you run the loop.
An alternative way is to create an empty list before running the loop:
DFtest <- list()
In the loop, you can use the following command:
DFtest[[nameTest]] <- datTest
After the loop, all values in the list DFtest can be combined using
do.call("cbind", DFtest)
Note that this will only work if all vectors in the list DFtesthave the same length.

Related

Iterating over CSVs to different dataframes based on file names

I have a dataframe that contains the names of a bunch of .CSV files. It looks how it does in the snippet below:
What I'm trying to do is convert each of these .CSVs into a dataframe that appends the results of each. What I'm trying to do is create three different dataframes based on what's in the file names:
Create a dataframe with all results from .CSV files with -callers- in its file name
Create a dataframe with all results from .CSV files with -results in its filename
Create a dataframe with all results from .CSV files with -script_results- in its filename
The command to actually convert the .CSV file into a dataframe looks like this if I were using the first .CSV in the dataframe below:
data <- aws.s3::s3read_using(read.csv, object = "s3://abc-testtalk/08182020-testpilot-arizona-results-08-18-2020--08-18-2020-168701001.csv")
But what I'm trying to do is:
Iterate ALL the .csv files under Key using the s3read_using function
Put them in three separate dataframes based on the file names as listed above
Key
08182020-testpilot-arizona-results-08-18-2020--08-18-2020-168701001.csv
08182020-testpilot-arizona-results-08-18-2020--08-18-2020-606698088.csv
08182020-testpilot-arizona-script_results-08-18-2020--08-18-2020-114004469.csv
08182020-testpilot-arizona-script_results-08-18-2020--08-18-2020-450823767.csv
08182020-testpilot-iowa-callers-08-18-2020-374839084.csv
08182020-testpilot-maine-callers-08-18-2020-396935866.csv
08182020-testpilot-maine-results-08-18-2020--08-18-2020-990912614.csv
08182020-testpilot-maine-script_results-08-18-2020--08-18-2020-897037786.csv
08182020-testpilot-michigan-callers-08-18-2020-367670258.csv
08182020-testpilot-michigan-follow-ups-08-18-2020--08-18-2020-049435266.csv
08182020-testpilot-michigan-results-08-18-2020--08-18-2020-544974900.csv
08182020-testpilot-michigan-script_results-08-18-2020--08-18-2020-239089219.csv
08182020-testpilot-nevada-callers-08-18-2020-782329503.csv
08182020-testpilot-nevada-results-08-18-2020--08-18-2020-348644934.csv
08182020-testpilot-nevada-script_results-08-18-2020--08-18-2020-517037762.csv
08182020-testpilot-new-hampshire-callers-08-18-2020-134150800.csv
08182020-testpilot-north-carolina-callers-08-18-2020-739838755.csv
08182020-testpilot-pennsylvania-callers-08-18-2020-223839956.csv
08182020-testpilot-pennsylvania-results-08-18-2020--08-18-2020-747438886.csv
08182020-testpilot-pennsylvania-script_results-08-18-2020--08-18-2020-546894204.csv
08182020-testpilot-virginia-callers-08-18-2020-027531377.csv
08182020-testpilot-virginia-follow-ups-08-18-2020--08-18-2020-419338697.csv
08182020-testpilot-virginia-results-08-18-2020--08-18-2020-193170030.csv
Create 3 empty dataframes. You will probably also need to indicate column names matching column names from each of the file you want to append:
results <- data.frame()
script_results <- data.frame()
callers <- data.frame()
Then iterate over file_name and read it into data object. Conditionally on what pattern ("-results-", "-script_results-" or "-caller-" is contanied in the name of each file, it will be appended to the correct dataframe:
for (file in file_name) {
data <- aws.s3::s3read_using(read.csv, object = paste0("s3://abc-testtalk/", file))
if (grepl(file, "-results-")) { results <- rbind(results, data)}
if (grepl(file, "-script_results-")) { script_results <- rbind(script_results, data)}
if (grepl(file, "-callers-")) { callers <- rbind(callers, data)}
}
As an alternative to #JohnFranchak's recommendation for map_dfr (which likely works just fine), the method that I referenced in comments would look something like this:
alldat <- lapply(setNames(nm = dat$file_name),
function(obj) aws.s3::s3read_using(read.csv, object = obj))
callers <- do.call(rbind, alldat[grepl("-callers-", names(alldat))])
results <- do.call(rbind, alldat[grepl("-results-", names(alldat))])
script_results <- do.call(rbind, alldat[grepl("-script_results-", names(alldat))])
others <- do.call(rbind, alldat[!grepl("-(callers|results|script_results)-", names(alldat))])
The do.call(rbind, ...) part is analogous to dplyr::bind_rows and data.table::rbindlist in that it accepts a list of frames, and the result is a single frame. Some differences:
do.call(rbind, ...) really requires all columns to exist in all frames, in the same order. It's not hard to enforce this externally (e.g., adding missing columns, rearranging), but it's not automatic.
data.table::rbindlist will complain for the same conditions (missing columns or different order), but it has fill= and use.names= arguments that need to be set TRUE.
dplyr::bind_rows will fill and row-bind by-name by default, without message or warning. (I don't agree that a default of silence is good all of the time, but it is the simplest.)
Lastly, my use of setNames(nm=..) is merely to assign the filename to each object. This is not strictly necessary since we still have dat$file_name, but I've found that with two separate objects, it is feasible to accidentally change (delete, append, or reorder) one of them and not the other, so I prefer to keep the names and the objects (frames) perfectly tied together. These two calls are relatively the same in the resulting named-list:
lapply(setNames(nm = dat$file_name), ...)
sapply(dat$file_name, ..., simplify = FALSE)

R: Doing the same steps on many data frames with their names stored in a vector

I have several .RData files, each of which has letters and numbers in its name, eg. m22.RData. Each of these contains a single data.frame object, with the same name as the file, eg. m22.RData contains a data.frame object named "m22".
I can generate the file names easily enough with something like datanames <- paste0(c("m","n"),seq(1,100)) and then use load() on those, which will leave me with a few hundred data.frame objects named m1, m2, etc. What I am not sure of is how to do the next step -- prepare and merge each of these dataframes without having to type out all their names.
I can make a function that accepts a data frame as input and does all the processing. But if I pass it datanames[22] as input, I am passing it the string "m22", not the data frame object named m22.
My end goal is to epeatedly do the same steps on a bunch of different data frames without manually typing out "prepdata(m1) prepdata(m2) ... prepdata(n100)". I can think of two ways to do it, but I don't know how to implement either of them:
Get from a vector of the names of the data frames to a list containing the actual data frames.
Modify my "prepdata" function so that it can accept the name of the data frame, but then still somehow be able to do things to the data frame itself (possibly by way of "assign"? But the last step of the function will be to merge the prepared data to a bigger data frame, and I'm not sure if there's a method that uses "assign" that can do that...)
Can anybody advise on how to implement either of the above methods, or another way to make this work?
See this answer and the corresponding R FAQ
Basically:
temp1 <- c(1,2,3)
save(temp1, file = "temp1.RData")
x <- c()
x[1] <- load("temp1.RData")
get(x[1])
#> [1] 1 2 3
Assuming all your data exists in the same folder you can create an R object with all the paths, then you can create a function that gets a path to a Rdata file, reads it and calls "prepdata". Finally, using the purr package you can apply the same function on a input vector.
Something like this should work:
library(purrr)
rdata_paths <- list.files(path = "path/to/your/files", full.names = TRUE)
read_rdata <- function(path) {
data <- load(path)
return(data)
}
prepdata <- function(data) {
### your prepdata implementation
}
master_function <- function(path) {
data <- read_rdata(path)
result <- prepdata(data)
return(result)
}
merged_rdatas <- map_df(rdata_paths, master_function) # This create one dataset. Merging all together

how to use "for loop" to write multiple .csv file names?

Does anyone know the best way to carry out a "for loop" that would read in different subject id's and append them to the name of an exported csv?
As an example, I have multiple output files from an electrocardiogram software program (each file belongs to one individual). The files are named C800_HR.bdf.evt, C801_HR.bdf.evt, C802_HR.bdf.evt etc. Each file gets read into r and then has a script applied to calculate heart rate variability. At the end of the script, I need to add a loop that will extract the subject id (e.g., C800, C801, C802) and write a new file name for each individual so that it becomes C800_RtoR.csv. Essentially, I would like to avoid changing the syntax every time I read in and export a file name.
I am currently using the following syntax to read in multiple files:
>setwd("/Users/kmpc/Downloads")
>myhrvdata <-lapply(Sys.glob("C8**_HR.bdf.evt"), read.delim)
Try this out:
cardio_files <- list.files(pattern = "C8\\d{2}_HR.bdf.evt")
subject_ids <- sub("^(C8\\d{2})_.*", "\\1" cardio_files)
myList <- lapply(cardio_files, read.delim)
## do calculations on the list
for (i in names(myList)) {
write.csv(myList[[i]], paste0(subject_ids[i], "_RtoR.csv"))
}
The only thing is, you have to deal with using a list when doing your calculations. You could combine them to a single data.frame, but it would be best to leave it as a list to write the files at the end.
Consider generalizing your process by creating a function that: 1) reads in file, 2) processes data, 3) outputs to csv. Then have lapply call the defined method iteratively across all Sys.glob items and even return a list of calculated data frames.
proc_heart_rate <- function(f_name) {
# READ IN .evt FILE INTO df
df <- read.delim(f_name)
# CALCULATE HEART RATE VARIABILITY WITH df
...
# OUTPUT df TO CSV
subject_id <- gsub("\\_.*", "", f_name)
write.csv(df, paste0(subject_id, "_RtoR.csv"))
# RETURN df FOR OTHER USES
return(df)
}
# LIST OF DATA FRAMES WITH CALCULATIONS
myhrvdata_list <-lapply(Sys.glob("C8**_HR.bdf.evt"), proc_heart_rate)

Looping Over a Set of Files

I cooked up some code that is supposed to find all my .txt files (they're outputs of ODE simulations), open them all up as data frames with "read.table" and then perform some calculations on them.
files <- list.files(path="/Users/redheadmammoth/Desktop/Ultimate_Aging_F2016",
pattern=".txt",full.names=TRUE)
ldf <- lapply(files, read.table)
tuse <- seq(from=0,to=100,by=0.1)
for(files in ldf)
findR <- function(r){
with(files,(sum(exp(-r*age)*fecund*surv*0.1)-1)^2)
}
{
R0 <- with(files,(sum(fecund*surv*age)))
GenTime <- with(files,(sum(tuse*fecund*surv*0.1))/R0)
r <- optimize(f=findR,seq(-5,5,.0001),tol=0.00000001)$minimum
RV <- with(files,(exp(r*tuse)/surv)*(exp(-r*tuse)*(fecund*surv)))
plot(log(surv) ~ age,files,type="l")
tmp.lm <- lm(log(surv) ~ age + I(age^2),files) #Fit log surv to a quadratic
lines(files$age,predict(tmp.lm),col="red")
}
However, the problem is that it seems to only be performing the calculations contained in my "for" loop on one file, rather than all of them. I'd like it to perform the calculations on all of my files, then save all the files together as one big data frame so I can access the results of any particular set of my simulations. I suspect the error is that I'm not indexing the files correctly in order to loop over all of them.
How about using plyr::ldply() for this. It takes a list (in your case your list of files) and performs the same function on them and then returns a data frame.
The main thing to remember to do is create a column for the ID of each file you read in so you know which data comes from which file. The simplest way to do this is just to call it the file name and then you can edit it from there.
If you have additional arguments in your function they go after the function you want to use in ldply.
# create file list
files <- list.files(path="/Users/redheadmammoth/Desktop/Ultimate_Aging_F2016",
pattern=".txt",full.names=TRUE)
tuse <- seq(from=0,to=100,by=0.1)
load_and_edit <- function(file, tuse){
temp <- read.table(file)
# here put all your calculations you want to do on each file
temp$R0 <- sum(temp$fecund*temp$surv*temp*age)
# make a column for each file name so you know which data comes from which file
temp$id <- file
return(temp)
}
new_data <- plyr::ldply(list.files, load_and_edit, tuse)
This is the easiest way I have found to read in and wrangle multiple files in batch.
You can then plot each one really easily.

nested for loop to create histograms named according to list

I'm new to R and need to create a bunch of histograms that are named according to the population they came from. When I try running the loop without the "names" part, it works fine. The code below loops through the list of names and applies them in order, but I end up with 3,364 versions of the same exact histogram. If anyone has any suggestions, I'd really appreciate it.
popFiles <- list.files(pattern = "*.txt") # generates a list of the files I'm working with
popTables <- lapply(popFiles, read.table, header=TRUE, na.strings="NA")
popNames <- read.table(file.path("Path to file containing names", "popNamesR.txt"), header=FALSE,)
popNames <- as.matrix(popNames)
name <- NULL
table <- c(1:58)
for (table in popTables){
for (name in popNames){
pVals <- table$p
hist(pVals, breaks=20, xlab="P-val", main=name))
}
}
Try making a distinct iterator, and use that, rather than iterating over the table list itself. It's just easier to see what's going on. For example:
pdf("Myhistograms.pdf")
for(i in 1:length(popTables)){
table = popTables[[i]]
name = popNames[i]
pVals = table$p
hist(pVals, breaks=20, xlab="P-val", main=name))
}
dev.off()
In this case, your problem is that name and table are actually linked, but you have two for loops, so actually every combination of table and name are generated.

Resources