I have written a loop in R (still learning). My purpose is to pick the max AvgConc and max Roll_TotDep from each looping file, and then have two data frames that each contains all the max numbers picked from individual files. The code I wrote only save the last iteration results (for only one single file)... Can someone point me a right direction to revise my code, so I can append the result of each new iteration with previous ones? Thanks!
data.folder <- "D:\\20150804"
files <- list.files(path=data.folder)
for (i in 1:length(files)) {
sub <- read.table(file.path(data.folder, files[i]), header=T)
max1Conc <- sub[which.max(sub$AvgConc),]
maxETD <- sub[which.max(sub$Roll_TotDep),]
write.csv(max1Conc, file= "max1Conc.csv", append=TRUE)
write.csv(maxETD, file= "maxETD.csv", append=TRUE)
}
The problem is that max1Conc and maxETD are not lists data.frames or vectors (or other types of object capable of storing more than one value).
To fix this:
maxETD<-vector()
max1Conc<-vector()
for (i in 1:length(files)) {
sub <- read.table(file.path(data.folder, files[i]), header=T)
max1Conc <- append(max1Conc,sub[which.max(sub$AvgConc),])
maxETD <- append(maxETD,sub[which.max(sub$Roll_TotDep),])
write.csv(max1Conc, file= "max1Conc.csv", append=TRUE)
write.csv(maxETD, file= "maxETD.csv", append=TRUE)
}
The difference here is that I made the two variables you wish to write out empty vectors (max1Conc and maxETD), and then used the append command to add each successive value to the vectors.
There are more idiomatic R ways of accomplishing your goal; personally, I suggest you look into learning the apply family of functions. (http://adv-r.had.co.nz/Functionals.html)
I can't directly test the whole thing because I don't have a directory with files like yours, but I tested the parts, and I think this should work as an apply-driven alternative. It starts with a pair of functions, one to ingest a file from your directory and other to make a row out of the two max values from each of those files:
library(dplyr)
data.folder <- "D:\\20150804"
getfile <- function(filename) {
sub <- read.table(file.path(data.folder, filename), header=TRUE)
return(sub)
}
getmaxes <- function(df) {
rowi <- data.frame(AvConc.max = max(df[,"AvConc"]), ETD.max = max(df[,"ETD"]))
return(rowi)
}
Then it uses a couple of rounds of lapply --- embedded in piping courtesy ofdplyr --- to a) build a list with each data set as an item, b) build a second list of one-row data frames with the maxes from each item in the first list, c) rbind those rows into one big data frame, d) and then cbind the filenames to that data frame for reference.
dfmax <- lapply(as.list(list.files(path = data.folder)), getfiles) %>%
lapply(., getmaxes) %>%
Reduce(function(...) rbind(...), .) %>%
data.frame(file = list.files(path = data.folder), .)
Related
I have a few data frames (colors, sets, inventory) and I want to save each of them into a folder that I have set as my wd. I want to do this using a for loop, but I am not sure how to write the file argument such that R understands that it should use the elements of the vector as the file names.
I might write:
DFs <- c("colors", "sets", "inventory")
for (x in 1:length(DFs)){
save(x, file = "x.Rda")
}
The goal would be that the files would save as colors.Rda, sets.Rda, etc. However, the last element to run through the loop simply saves as x.Rda.
In short, perhaps my question is: how do you tell R that I am wanting to use elements being run through a loop within an argument when that argument requires a character string?
For bonus points, I am sure I will encounter the same problem if I want to load a series of files from that folder in the future. Rather than loading each one individually, I'd also like to write a for loop. To load these a few minutes ago, I used the incredibly clunky code:
sets_file <- "~/Documents/ME teaching/R notes/datasets/sets.csv"
sets <- read.csv(sets_file)
inventories_file <- "~/Documents/ME teaching/R notes/datasets/inventories.csv"
inventories <- read.csv(inventories_file)
colors_file <- "~/Documents/ME teaching/R notes/datasets/colors.csv"
colors <- read.csv(colors_file)
For compactness I use lapply instead of a for loop here, but the idea is the same:
lapply(DFs, \(x) save(list=x, file=paste0(x, ".Rda"))))
Note that you need to generate the varying file names by providing x as a variable and not as a character (as part of the file name).
To load those files, you can simply do:
lapply(paste0(DFs, ".Rda"), load, envir = globalenv())
To save you can do this:
DFs <- list(color, sets, inventory)
names(DFs) = c("color", "sets", "inventory")
for (x in 1:length(DFs)){
dx = paste(names(DFs)[[x]], "Rda", sep = ".")
dfx = DFs[[x]]
save(dfx, file = dx)
}
To specify the path just inform in the construction of the dx object as following to read.
To read:
DFs <- c("colors", "sets", "inventory")
# or
DFs = dir("~/Documents/ME teaching/R notes/datasets/")
for(x in 1:length(DFs)){
arq = paste("~/Documents/ME teaching/R notes/datasets/", DFs[x], ".csv", sep = "")
DFs[x] = read.csv(arq)
}
It will read as a list, so you can access using [[]] indexation.
I have a vector of file paths called dfs, and I want create a dataframe of those files and bind them together into one huge dataframe, so I did something like this :
for (df in dfs){
clean_df <- bind_rows(as.data.table(read.delim(df, header=T, sep="|")))
return(clean_df)
}
but only the last item in the dataframe is being returned. How do I fix this?
I'm not sure about your file format, so I'll take common .csv as an example. Replace the a * i part with actually reading all the different files, instead of just generating mockup data.
files = list()
for (i in 1:10) {
a = read.csv('test.csv', header = FALSE)
a = a * i
files[[i]] = a
}
full_frame = data.frame(data.table::rbindlist(files))
The problem is that you can only pass one file at a time to the function read.delim(). So the solution would be to use a function like lapply() to read in each file specified in your df.
Here's an example, and you can find other answers to your question here.
library(tidyverse)
df <- c("file1.txt","file2.txt")
all.files <- lapply(df,function(i){read.delim(i, header=T, sep="|")})
clean_df <- bind_rows(all.files)
(clean_df)
Note that you don't need the function return(), putting the clean_df in parenthesis prompts R to print the variable.
I am currently working on an imputation project where I need to evaluate my methods of imputation. I have my incomplete dataframe with NAs from which I calculate the missing rate for every column/variable. My second data frame contains the complete cases which I extracted from the first data frame. I now want to simulate the missingness structure of the real data in the frame containing the complete cases. the data frame with the generated NAs get stored in the object "result" as you can see in the code. If I now want to replicate this code and thus generate 100 different data frames like "result", how do I replicate and save them separately?
I'm a beginner and would be really thankful for your answers!
I tried to put my loop which generates the NAs in another loop which contains the replicate() command and counts from 1:100 and saves these 100 replicated data frames but it didn't work at all.
result = data.frame(res0=rep(NA, dim(comp_cas)[1]))
for (i in 1:length(Z32_miss_item$miss_per_item)) {
dat = comp_cas[,i]
missRate = Z32_miss_item$miss_per_item[i]
cat (i, " ", paste0(dat, collapse=",") ," ", missRate, "!\n")
df <- data.frame("res"= GenMiss(x=dat, missrate = missRate), stringsAsFactors = FALSE)
colnames(df) = gsub("res", paste0("Var", i), colnames(df))
result = cbind(result, df)
}
result = result[,-1]
I expect that every data frame of the 100 runs get saved in a separate .rda file in my project folder.
also, is imputation and the evaluation of fitness of the latter beginner stuff in r or at what level of proficiency am I if you take a look at the code that I posted?
It is difficult to guess what exactly you are doing without some dummy data. But it is fine to have loops within loops and to save data.frames. Firstly, I would avoid the replicate function here as it has a strange syntax and just stick with plain loops. Secondly, you must make sure that the loops have different indexes (i.e. for(i ... should be surrounded by, say, for(j ... since functions can loop outside their scope in R. Finally, use saveRDS rather than save, as you can then have each object (data.frame) saved in separate .rds files. The save function is designed for saving your whole workspace so that you can pick up where you left off.
fun <- function(i){
df <- data.frame(x=rnorm(5))
names(df) <- paste0("x",i)
df
}
for(j in 1:100){
res <- data.frame(id=1:5)
for(i in 1:10){
res <- cbind(res, fun(i))
}
saveRDS(res, sprintf("replication_%s.rds",j))
}
I'm fairly new in working with R but trying to get this done. I have dozens of ENVI spectral datasets stored in a directory. Each dataset is seperated into two files. They all have the same name convention, i.e.:
ID_YYYYMMDD_350-200nm.asr
ID_YYYYMMDD_350-200nm.hdr
The task is to read the dataset, add two columns (ID and date from filename), and store the results in a *.csv-file. I got this to work for a single file (hardcoded).
library(caTools)
setwd("D:/some/path/software_scripts")
### filename without extension
name <- "011a_20100509_350-2500nm"
### split filename in area-id and date
flaeche<-substr(name, 0, 4)
date <- as.Date((substr(name,6,13)),"%Y%m%d")
### get values from ENVI-file in a matrix
spectrum <- read.ENVI(paste(name,".esl", sep = ""), headerfile=paste(name,".hdr", sep=""))
### add columns
spectrum <- cbind(Flaeche=flaeche,Datum=as.character(date),spectrum)
### CSV-Dataset with all values
write.csv(spectrum, file = name,".csv", sep=",")
I want to combine all available files into one *.csv file. I know that I've to use list.files but have no idea, how to implement the read.ENVI function and add the resulting matrices ongoing to CSV.
Update:
library(caTools)
setwd("D:/some/path/mean")
files <- list.files() # change or leave totally empty if setwd() put you in the right spot
all_names <- sub("^([^.]*).*", "\\1", files) # strip off extensions
name <- unique(all_names) # get rid of duplicates from .esl and .hdr
# wrap your existing code in a function
mungeENVI <- function(name) {
# split filename in area-id and date
flaeche<-substr(name, 0, 4)
date <- as.Date((substr(name,6,13)),"%Y%m%d")
# get values from ENVI-file in a matrix
spectrum <- read.ENVI(paste(name,".esl", sep = ""), headerfile=paste(name,".hdr", sep=""))
# add columns
spectrum <- cbind(Flaeche=flaeche,Datum=as.character(date),spectrum)
return(spectrum)
}
# use lapply to 'loop' over each name
list_of_ENVIs <- lapply(name, mungeENVI) # returns a list
# use do.call(rbind, x) to turn it into a big data.frame
final_df <- do.call(rbind, list_of_ENVIs)
# now write output
write.csv(final_df, "all_results.csv")
you can find a sample dataset here: Sample dataset
I work with a lot of lab data where I can rely on the output files being in a reliable format (same column order, column name, header format, etc). So this is assuming that the .ENVI files you have are similar to that. If your files are not like that, I'm happy to help with that too, I'd just need to see a dummy file or two.
Anyways here's the idea:
library(caTools)
library(lubridate)
library(magrittr)
setwd("~/Binfo/TST/Stack/") # adjust as needed
files <- list.files("data/", full.name = T) # adjust as needed
all_names <- gsub("\\.\\D{3}", "", files) # strip off extensions
names1 <- unique(all_names) # get rid of duplicates
# wrap your existing code in a function
mungeENVI <- function(name) {
# split filename in area-id and date
f <- gsub(".*\\/(\\d{3}\\D)_.*", "\\1", name)
d <- gsub(".*_(\\d+)_.*", "\\1", name) %>% ymd()
# get values from ENVI-file in a matrix
spectrum <- read.ENVI(paste(name,".esl", sep = ""), headerfile=paste(name,".hdr", sep=""))
# add columns
spectrum <- cbind(Flaeche=f,Datum= as.character(d),spectrum)
return(spectrum)
}
# use lapply to 'loop' over each name
list_of_ENVIs <- lapply(names1, mungeENVI) # returns a list
# use do.call(rbind, x) to turn it into a big data.frame
final_df <- do.call(rbind, list_of_ENVIs)
# now write output
write.csv(final_df, "data/all_results.csv")
Let me know if you have any problems and we an go from there. Cheers.
I edited my answer a bit, I think the problem you were hitting is in list.files() it should have had the argument full.name = T. I also adjusted you parsing method to be a little more defensive and use grep capture expressions. I tested the code with your two example files (4 really) but I can build out a large matrix (66743 elements). Also I used lubridate, I think it's a better way to work with dates and times.
I would like to know how I solve the following problem using higher order functions like ddply, ldply, dlply, and avoid using problematic for loops.
The problem:
I have a .csv file representing a dataset loaded into a data.frame, with each row containing the path to a directory where more information is stored in files. I want to use the directory information in the datas.frame to open the files("file1.txt","file2.txt") in that directory, merge them, then combine the merged files from each entry in one large dataframe.
something like this:
df =
entryName,dir
1,/home/guest/data/entry1
2,/home/guest/data/entry2
3,/home/guest/data/entry3
4,/home/guest/data/entry4
what I would like to do is apply a function to the dataframe that take the directory,
appends a couple of file names "file1.txt", "file.txt", then merges the two files together based off a given field.
for example file1.txt could be:
entry,subEntry,value
1,A,2
1,B,3
1,C,4
1,D,5
1,E,3
1,F,3
for example file2.txt could be:
entry,subEntry,value
1,A,8
1,B,7
1,C,8
1,D,9
1,E,8
1,F,7
the output would look something like this:
entryName,subEntry,valueFromFile1,valueFromFile2
1,A,2,8
1,B,3,7
1,C,4,8
1,D,5,9
1,E,3,8
1,F,3,7
2,A,4,8
2,B,5,9
2,C,6,7
2,D,3,7
2,E,6,8
2,F,5,9
Right now I am using a for loop, but for obvious reasons would like to use a higher order function. Here is what I have so far:
allCombined <- data.frame()
df <- read.csv(file="allDataEntries.csv",header=true)
numberOfEntries = <- dim(df)[1]
for(i in 1:numberOfEntries){
dir <- df$dir[i]
file1String <- paste(dir,"/file1.txt",sep='')
file2String <- paste(dir,"/file2.txt",sep='')
file1.df <- read.csv(file=file1String,header=TRUE)
file2.df <- read.csv(file=file2String,header=TRUE)
localMerged <- merge(file1.df,file2.df, by="value")
allCombined <- rbind(allCombined,localMerged)
}
#rest of my analysis...
Here is one way to do it. The idea is to create a list with contents of all the files, and then use Reduce to merge them sequentially using the common columns entry and subEntry.
# READ DIRECTORIES, FILES AND ENTRIES
dirs <- read.csv(file = "allDataEntries.csv", header = TRUE, as.is = TRUE)$dir
files <- as.vector(outer(dirs, c('file.txt', 'file2.txt'), 'file.path'))
entries <- lapply(files, 'read.csv', header = TRUE)
# APPLY CUSTOM MERGE FUNCTION TO COMBINE ENTRIES
merge_by <- function(x, y){
merge(x, y, by = c('entry', 'subEntry'))
}
Reduce('merge_by', entries)
I've not tested this, but it seems like it should work. The anonymous function takes a single row from df, reads in the two associated files, and merges them together by value. Using ddply will take these data frames and make a single one out of them by rbinding (since the requested output is a data frame). It does assume entryName is not repeated in df. If it is, you can add a unique row to group over instead.
ddply(df, .(entryName), function(DF) {
dir <- df$dir
file1String <- paste(dir,"/file1.txt",sep='')
file2String <- paste(dir,"/file2.txt",sep='')
file1.df <- read.csv(file=file1String,header=TRUE)
file2.df <- read.csv(file=file2String,header=TRUE)
merge(file1.df,file2.df, by="value")
})