how to average values across data frames? - r

file.names <- list.files(path = 'mypath')
file.names <- paste("mypath", file.names, sep="/")
for(i in 1:length(file.names))
{
assign(paste("Frame",i,""), read.table(file.names[i], sep="", header=FALSE))
}
My above code reads files from a directory and adds them to a data frame. I have thousands of these files. The question is how can i get all the data frames that i create for each file and average each value across all data frames. Its just like when you have a 100x 100 matrix of 1000 files (dataframes) you just want one 100 x 100 matrix with average values across the dataframes. Any help is really appreciated. I have been stuck for a while with this.

The following code seem to do the trick. Thanks to #Gregor
X <- NULL
mylist <- list()
args = commandArgs(trailingOnly=TRUE)
# test if there is at least one argument: if not, return an error
if (length(args)==0){
stop("At least one argument must be supplied (input file).n", call.=FALSE)
} else if (length(args)==1){
file.names <- list.files(path =args[1],pattern=".gdat")
file.names <- paste(args[1], file.names, sep="/")
args[2] <- paste(args[1], "avg.txt", sep="/")
for(i in 1:length(file.names))
{mylist[i] <- list(read.table(file.names[i], sep="", header=FALSE))}
X <- Reduce("+", mylist) / length(mylist) #this is the funx that averages across dataframes
write.table(X, file=args[2], sep="\t",row.names=FALSE, quote=FALSE)
}

Related

Data tables in R with for loops by lists of files with pattern matching

In a directory, I have 780 files and I need to bind them by rows, using R, in 78 different files and then write a .txt by file. The names of files are like these:
S1_S1_F1.xlsx
S1_S2_F1.xlsx
...
S1_S5_F1.xlsx
S1_S6_F2.xlsx
...
S1_S10_F2.xlsx
S2_S1_F1.xlsx
The first part of the expresion S1_(.*).xlsx repeats 10 times, then changes up to S78_(.*).xlsx, with the second part changing from (.*)_S1(.*).xlsx to (.*)S10(.*).xlsx. I need to combine the files just by the second term to have 78 files from S1.txt to S78.txt.
I'm far from being an expert in R, so my approach was to do it file by file with the following code:
S1<-list.files(pattern = "^S1(.*).xlsx")
S1<-lapply(S1,read_excel)
S1 <- bind_rows(S1)
write.table(S1, "S1.txt", sep="\t",row.names=FALSE)
up to
S78<-list.files(pattern = "^S78(.*).xlsx")
S78 <-lapply(S78,read_excel)
S78 <- bind_rows(S78)
write.table(S78, "S1.txt", sep="\t",row.names=FALSE)
As you can see, this code seems to have been written by an australopithecus (which I'm not), so I beg your help! How can I do it with a for loop?
Simply wrap another lapply (which is a loop) around your lines iterating through the sequence of 1 to 78. Below will output the 78 txt files and leave you a list of 78 dataframes:
dfList <- lapply(seq(1,78), function(i) {
f <- list.files(pattern = paste0("^S", i, "(.*).xlsx"))
dfs <- lapply(f, read_excel)
df <- bind_rows(dfs) # OR base R'S do.call(rbind, dfs)
write.table(df, paste0("S", i, ".txt"), sep="\t", row.names=FALSE)
return(df)
})
dfList[[1]]
dfList[[2]]
...
dfList[[78]]
And even use sapply to return a named list:
dfList <- sapply(paste0("S",seq(1,78)), function(i) {
f <- list.files(pattern = paste0("^", i, "(.*).xlsx"))
dfs <- lapply(f, read_excel)
df <- bind_rows(dfs)
write.table(df, paste0(i, ".txt"), sep="\t", row.names=FALSE)
return(df)
}, simplify = FALSE)
dfList$S1
dfList$S2
...
dfList$S78

How do create one data.frame with multiple csv files in R using a function? [duplicate]

This question already has answers here:
What's wrong with my function to load multiple .csv files into single dataframe in R using rbind?
(6 answers)
Closed 5 years ago.
I am quite new to R and I need some help. I have multiple csv files labeled from 001 to 332. I would like to combine all of them into one data.frame. This is what I have done so far:
filesCSV <- function(id = 1:332){
fileNames <- paste(id) ## I want fileNames to be a vector with the names of all the csv files that I want to join together
for(i in id){
if(i < 10){
fileNames[i] <- paste("00",fileNames[i],".csv",sep = "")
}
if(i < 100 && i > 9){
fileNames[i] <- paste("0", fileNames[i],".csv", sep = "")
}
else if (i > 99){
fileNames[i] <- paste(fileNames[i], ".csv", sep = "")
}
}
theData <- read.csv(fileNames[1], header = TRUE) ## here I was trying to create the data.frame with the first csv file
for(a in 2:332){
theData <- rbind(theData,read.csv(fileNames[a])) ## here I wanted to use a for loop to cycle through the names of files in the fileNames vector, and open them all and add them to the 'theData' data.frame
}
theData
}
Any help would be appreciated, Thanks!
Hmm it looks roughly like your function should already be working. What is the issue?
Anyways here would be a more idiomatic R way to achieve what you want to do that reduces the whole function to three lines of code:
Construct the filenames:
infiles <- sprintf("%03d.csv", 1:300)
the %03d means: insert an integer value d padded to length 3 zeroes (0). Refer to the help of ?sprintf() for details.
Read the files:
res <- lapply(infiles, read.csv, header = TRUE)
lapply maps the function read.csv with the argument header = TRUE to each element of the vector "infiles" and returns a list (in this case a list of data.frames)
Bind the data frames together:
do.call(rbind, res)
This is the same as entering rbind(df1, df2, df3, df4, ..., dfn) where df1...dfn are the elments of the list res
You were very close; just needed ideas to append 0s to files and cater for cases when the final data should just read the csv or be an rbind
filesCSV <- function(id = 1:332){
library(stringr)
# Append 0 ids in front
fileNames <- paste(str_pad(id, 3, pad = "0"),".csv", sep = "")
# Initially the data is NULL
the_data <- NULL
for(i in seq_along(id)
{
# Read the data in dat object
dat <- read.csv(fileNames[i], header = TRUE)
if(is.null(the_data) # For the first pass when dat is NULL
{
the_data <- dat
}else{ # For all other passes
theData <- rbind(theData, dat)
}
}
return(the_data)
}

Calculate the mean of one column across multiple .csv files How?

I am newbie in R and have got to calculate the mean of column sulf from 332 files. The mean formulas bellow works well with 1 file . The problem comes when I attempt to calculate across the files.
Perhaps the reading all files and storing them in mydata does not work well? Could you help me out?
Many thanks
pollutantmean <- function(specdata,pollutant=xor(sulf,nit),i=1:332){
specdata<-getwd()
pollutant<-c(sulf,nit)
for(i in 1:332){
mydata<-read.csv(file_list[i])
}
sulfate <- (subset(mydata,select=c("sulfate")))
sulf <- sulfate[!is.na(sulfate)]
y <- mean(sulf)
print(y)
}
This is not tested, but the steps are as followed. Note also that this kind of questions are being asked over and over again (e.g. here). Try searching for "work on multiple files", "batch processing", "import many files" or something akin to this.
lx <- list.files(pattern = ".csv", full.names = TRUE)
# gives you a list of
xy <- sapply(lx, FUN = function(x) {
out <- read.csv(x)
out <- out[, "sulfate", drop = FALSE] # do not drop to vector just for fun
out <- out[is.na(out[, "sulfate"]), ]
out
}, simplify = FALSE)
xy <- do.call(rbind, xy) # combine the result for all files into one big data.frame
mean(xy[, "sulfate"]) # calculate the mean
# or
summary(xy)
If you are short on RAM, this can be optimized a bit.
thank you for your help.
I have sorted it out. the key was to use full.names=TRUE in list.files and rbind(mydata, ... ) as otherwise it reads the files one by one and does not append them after each other, which is my aim
See below. I am not sure if it is the most "R" solution but it works
pollutantmean<-function(directory,pollutant,id=1:332){
files_list <- list.files(directory, full.names=TRUE)
mydata <- data.frame()
for (i in id) {
mydata <- rbind(mydata, read.csv(files_list[i]))
}
if(pollutant %in% "sulfate")
{
mean(mydata$sulfate,na.rm=TRUE)
}
else
{if(pollutant %in% "nitrate")
{
mean(mydata$nitrate,na.rm=TRUE)
}
else
{"wrong pollutant"
}
}
}
`

Subset multiple dataframes in a loop in R

I am trying to drop columns from over 20 data frames that I have imported. However, I'm getting errors when I try to iterate through all of these files. I'm able to drop when I hard code the individual file name, but as soon as I try to loop through all of the files, I have errors. Here's the code:
path <- "C://Home/Data/"
files <- list.files(path=path, pattern="^.file*\\.csv$")
for(i in 1:length(files))
{
perpos <- which(strsplit(files[i], "")[[1]]==".")
assign(
gsub(" ","",substr(files[i], 1, perpos-1)),
read.csv(paste(path,files[i],sep="")))
}
mycols <- c("test," "trialruns," "practice")
`file01` = `file01`[,!(names(`file01`) %in% mycols)]
So, the above will work and drop those three columns from file01. However, I can't iterate through files02 to files20 and drop the columns from all of them. Any ideas? Thank you so much!
As #zx8754 mentions, consider lapply() maintaining all dataframes in one compiled list instead of multiple objects in your environment (but below also includes how to output individual dfs from list):
path <- "C://Home/Data/"
files <- list.files(path=path, pattern="^.file*\\.csv$")
mycols <- c("test," "trialruns," "practice")
# READ IN ALL FILES AND SUBSET COLUMNS
dfList <- lapply(files, function(f) {
read.csv(paste0(path, f))[mycols]
})
# SET NAMES TO EACH DF ELEMENT
dfList <- setNames(dfList, gsub(".csv", "", files))
# IN CASE YOU REALLY NEED INDIVIDUAL DFs
list2env(dfList, envir=.GlobalEnv)
# IN CASE YOU NEED TO APPEND ALL DFs
finaldf <- do.call(rbind, dfList)
# TO RETRIEVE FIRST DF
dfList[[1]] # OR dfList$file01

How to read variable number of files and then combine the data frames in R?

I would like to design a function. Say I have files file1.csv, file2.csv, file3.csv, ..., file100.csv. I only want to read some of them every time I call the function by specifying an integer vector id, e.g., id = 1:10, then I will read file1.csv,...,file10.csv.
After reading those csv files, I would like to row combine them into a single variable. All csv files have the same column structure.
My code is below:
namelist <- list.files()
for (i in id) {
assign(paste0( "file", i ), read.csv(namelist[i], header=T))
}
As you can see, after I read in all the data matrix, I stuck at combining them since they all have different variable names.
You should read in each file as an element of a list. Then you can combine them as follows:
namelist <- list.files()
df <- vector("list", length = length(id))
for (i in id) {
df[[i]] <- read.csv(namelist[i], header = TRUE)
}
df <- do.call("rbind", df)
Or more concisely:
df <- do.call(rbind, lapply(list.files(), read.csv))
I do this, which is more R like without the for loop:
## assuming you have a folder full of .csv's to merge
filenames <- list.files()
all_files <- Reduce(rbind, lapply(filenames, read.csv))
If I understand correctly what you want to do then this is all you need:
namelist <- list.files()
singlevar = c()
for (i in id) {
singlevar = rbind(singlevar, read.csv(namelist[i], header=T))
}
Since in the end you want one single object to contain all the partial information from the single files, rbind as you go.

Resources