Calculate the mean of one column across multiple .csv files How? - r

I am newbie in R and have got to calculate the mean of column sulf from 332 files. The mean formulas bellow works well with 1 file . The problem comes when I attempt to calculate across the files.
Perhaps the reading all files and storing them in mydata does not work well? Could you help me out?
Many thanks
pollutantmean <- function(specdata,pollutant=xor(sulf,nit),i=1:332){
specdata<-getwd()
pollutant<-c(sulf,nit)
for(i in 1:332){
mydata<-read.csv(file_list[i])
}
sulfate <- (subset(mydata,select=c("sulfate")))
sulf <- sulfate[!is.na(sulfate)]
y <- mean(sulf)
print(y)
}

This is not tested, but the steps are as followed. Note also that this kind of questions are being asked over and over again (e.g. here). Try searching for "work on multiple files", "batch processing", "import many files" or something akin to this.
lx <- list.files(pattern = ".csv", full.names = TRUE)
# gives you a list of
xy <- sapply(lx, FUN = function(x) {
out <- read.csv(x)
out <- out[, "sulfate", drop = FALSE] # do not drop to vector just for fun
out <- out[is.na(out[, "sulfate"]), ]
out
}, simplify = FALSE)
xy <- do.call(rbind, xy) # combine the result for all files into one big data.frame
mean(xy[, "sulfate"]) # calculate the mean
# or
summary(xy)
If you are short on RAM, this can be optimized a bit.

thank you for your help.
I have sorted it out. the key was to use full.names=TRUE in list.files and rbind(mydata, ... ) as otherwise it reads the files one by one and does not append them after each other, which is my aim
See below. I am not sure if it is the most "R" solution but it works
pollutantmean<-function(directory,pollutant,id=1:332){
files_list <- list.files(directory, full.names=TRUE)
mydata <- data.frame()
for (i in id) {
mydata <- rbind(mydata, read.csv(files_list[i]))
}
if(pollutant %in% "sulfate")
{
mean(mydata$sulfate,na.rm=TRUE)
}
else
{if(pollutant %in% "nitrate")
{
mean(mydata$nitrate,na.rm=TRUE)
}
else
{"wrong pollutant"
}
}
}
`

Related

Issues with correlation function in R

I am writing a function that takes a directory and a threshold (default of 0) for complete cases as arguments and then performs the cor function for two of the columns. The function runs, kind of. However, whenever I run it it seems like it is skipping everything I have in the function and returns numeric(0). I can't figure out what is causing it to return that. I ran each line through the console and made sure each line worked on its own. But once it comes time to put it all together it seems like it all falls apart. Below is what I'm working with.
corr <- function(directory, threshold =0){
files <- list.files(path = directory, pattern = "*.csv", full.names = T)
cr <- numeric()
for(i in seq(files)){
data <- read.csv(files[i], )
cleandata <- complete.cases
data <- data[cleandata, ]
if (nrow(data) > threshold){
correlation <- cor(data$sulfate,data$nitrate)
cr <- append(cr, correlation)
}
}
return(cr)
}
Edit : removed last part of question. After re-reading I realized that I would not get what I wanted from return(files) even if it worked correctly. I should also note that adding the return(files) changed my output from numeric(0) to character(0)
How about something like this. You put all the files in a dataframe and then map out the correlations. As per #r2evans comment, this adds an NA when the threshold condition is not met and as per #Limey comment, this is more efficient than a for loop and append. The benefit here is that we keep the correlation with the file name so that you can see which file has what correlation.
library(tidyverse)
corr <- function(directory, threshold =0){
tibble(file = list.files(path = directory,
pattern = "*.csv",
full.names = TRUE)) |>
mutate(data = map(file, read_csv),
comp = map(data, complete.cases),
nrw = map_dbl(comp, nrow),
cr = case_when(
nrw > threshold ~ map_dbl(comp, ~cor(.x$sulfate, .x$nitrate)),
TRUE ~ NA_real_
))
}
corr_df <- corr(directory = "user/AndS./stack_questions")
corr_vec <- pull(corr_df, cr)

List elements getting overwritten in for loop R?

I have a bunch of csv files that I'm trying to read into R all at once, with each data frame from a csv becoming an element of a list. The loops largely work, but they keep overriding the list elements. So, for example, if I loop over the first 2 files, both data frames in list[[1]] and list[[2]] will contain the data frame for the second file.
#function to open one group of files named with "cores"
open_csv_core<- function(year, orgtype){
file<- paste(year, "/coreco.core", year, orgtype, ".csv", sep = "")
df <- read.csv(file)
names(df) <- tolower(names(df))
df <- df[df$ntee1 %in% c("C","D"),]
df<- df[!(df$nteecc %in% c("D20","D40", "D50", "D60", "D61")),]
return(df)
}
#function to open one group of files named with "nccs"
open_csv_nccs<- function(year, orgtype){
file2<- paste(year, "/nccs.core", year, orgtype, ".csv", sep="")
df2 <- read.csv(file2)
names(df2) <- tolower(names(df2))
df2 <- df2[df2$ntee1 %in% c("C","D"),]
df2<- df2[!(df2$nteecc %in% c("D20","D40", "D50", "D60", "D61")),]
return(df2)
}
#############################################################################
yrpc<- list()
yrpf<- list()
yrco<- list()
fname<- vector()
file_yrs<- as.character(c(1989:2019))
for(i in 1:length(file_yrs)){
fname<- list.files(path = file_yrs[i], pattern = NULL)
#accessing files in a folder and assigning to the proper function to open them based on how the file is named
for(j in 1:length(fname)){
if(grepl("pc.csv", fname[j])==T) {
if(grepl("nccs", fname[j])==T){
a <- open_csv_nccs(file_yrs[j], "pc")
yrpc[[paste0(file_yrs[i], "pc")]] <- a
} else {
b<- open_csv_core(file_yrs[j], "pc")
yrpc[[paste0(file_yrs[i], "pc")]] <- b
}
} else if (grepl("pf.csv", fname[j])==T){
if(grepl("nccs", fname[j])==T){
c <- open_csv_nccs(file_yrs[j], "pf")
yrpf[[paste0(file_yrs[i], "pf")]] <- c
} else {
d<- open_csv_core(file_yrs[j], "pf")
yrpf[[paste0(file_yrs[i], "pf")]] <- d
}
} else {
if(grepl("nccs", fname[j])==T){
e<- open_csv_nccs(file_yrs[j], "co")
yrco[[paste0(file_yrs[i], "co")]] <- e
} else {
f<- open_csv_core(file_yrs[j], "co")
yrco[[paste0(file_yrs[i], "co")]] <- f
}
}
}
}
Actually, both of your csv reading functions do exactly the same,
except that the paths are different.
If you find a way to list your files with abstract paths instead of relative
paths (just the file names), you wouldn't need to reconstruct the paths like
you do. This is possible by full.names = TRUE in list.files().
The second point is, it seems there is never from same year and same type
a "nccs.core" file in addition to a "coreco.core" file. So they are mutually
exclusive. So then, there is no logics necessary to distinguish those cases, which simplifies our code.
The third point is, you just want to separate the data frames by filetype ("pc", "pf", "co") and years.
Instead of creating 3 lists for each type, I would create one res-ults list, which contains for each type an inner list.
I would solve this like this:
years <- c(1989:2019)
path_to_type <- function(path) gsub(".*(pc|pf|co)\\.csv", "\\1", path)
res <- list("pc" = list(),
"pf" = list(),
"co" = list())
lapply(years, function(year) {
files <- list.files(path = year, pattern = "\\.csv", full.names = TRUE)
dfs <- lapply(files, function(path) {
print(path) # just to signal that the path is getting processed
df <- read.csv(path)
file_type <- path_to_type(path)
names(df) <- tolower(names(df))
df <- df[df$ntee1 %in% c("C", "D"), ]
df <- df[!(df$nteecc %in% c("D20", "D40", "D50", "D60", "D61")), ]
res[[file_type]][[year]] <- df
})
})
Now you can call from result's list by file_type and year
e.g.:
res[["co"]][[1995]]
res[["pf"]][[2018]]
And so on.
Actually, the results of the lapply() calls in this case are
not interesting. Just the content of res ... (result list).
It seems that in your for(j in 1:length(fname)){... you are creating one of 4 variable a, b, c or d. And you're reusing these variable names, so they are getting overwritten.
The "correct" way to do this is to use lapply in place of the for loop. Pass the list of files, and the required function (i.e. open_csv_core, etc) to lapply, and the return value that you get back is a list of the results.

Issues with for loops in R

I am trying to combine some excel spreadsheets. There are 50 documents. I am looking to get sheets 2:5, except some only have sheets 2:3, 2:4, etc - this is why I include the try function. I need ranges F6:AZ2183 and I am transposing the data.
The issue I am running into is that only the last file is saving into the data frame df.
I attached the code below. If you have any ideas, I would much appreciate it!
Also, I'm a longtime lurker first time poster, so if my etiquette is poor, I apologize.
df <- data.frame()
for (i in 1:50){
for (j in 2:5) {
try({
df.temp <- t(read_excel((paste0('FqReport',i,'.xlsx')), sheet = j, range ='F6:AZ2183'))
df.temp <- df.temp[rowSums(is.na(df.temp)) != ncol(df.temp), ]
df <- rbind(df, df.temp)
rm(df.temp)
gc()
}, silent = TRUE)
}
}
You can read the sheets available in each excel file which will avoid the use of try. Also growing dataframe in loop is quite inefficient. Try this lapply approach.
library(readxl)
filename <- paste0('FqReport',1:50,'.xlsx')
df <- do.call(rbind, lapply(filename, function(x) {
sheet_name <- excel_sheets(x)[-1]
do.call(rbind, lapply(sheet_name, function(y) {
df.temp <- t(read_excel(x, y, range ='F6:AZ2183'))
df.temp[rowSums(is.na(df.temp)) != ncol(df.temp), ]
}))
}))

how to average values across data frames?

file.names <- list.files(path = 'mypath')
file.names <- paste("mypath", file.names, sep="/")
for(i in 1:length(file.names))
{
assign(paste("Frame",i,""), read.table(file.names[i], sep="", header=FALSE))
}
My above code reads files from a directory and adds them to a data frame. I have thousands of these files. The question is how can i get all the data frames that i create for each file and average each value across all data frames. Its just like when you have a 100x 100 matrix of 1000 files (dataframes) you just want one 100 x 100 matrix with average values across the dataframes. Any help is really appreciated. I have been stuck for a while with this.
The following code seem to do the trick. Thanks to #Gregor
X <- NULL
mylist <- list()
args = commandArgs(trailingOnly=TRUE)
# test if there is at least one argument: if not, return an error
if (length(args)==0){
stop("At least one argument must be supplied (input file).n", call.=FALSE)
} else if (length(args)==1){
file.names <- list.files(path =args[1],pattern=".gdat")
file.names <- paste(args[1], file.names, sep="/")
args[2] <- paste(args[1], "avg.txt", sep="/")
for(i in 1:length(file.names))
{mylist[i] <- list(read.table(file.names[i], sep="", header=FALSE))}
X <- Reduce("+", mylist) / length(mylist) #this is the funx that averages across dataframes
write.table(X, file=args[2], sep="\t",row.names=FALSE, quote=FALSE)
}

No output when run function

New to R and to programming, I reviewed all the possible threads on SO on this Coursera assignment but couldn't figure out what the issue was. I know this function can be optimized using lapply and much more, but I would like to know why this particular function does not work. I felt like some questions on this function slightly irritated some users. To be honest, I reviewed the relevant posts on that and I don't see what I can do about this particular bug.
pollutantmean <- function (directory, pollutant, id) {
#Create the data frame with the data from the 332 files
files <- list.files(getwd())
df <- data.frame()
id <- 1:332
for (i in 1:length(id)) {df <- rbind(df, read.csv(files[i]))
if (pollutant=="nitrate"){
#Create a subset for nitrate values of df
df_nitrate <- df[df$ID==id[i], "nitrate"]
#Take mean of df_nitrate
mean (df_nitrate, na.rm = TRUE)
} else {
#Create a subset for sulfate values of df
df_sulfate <- df[df$ID==id[i],"sulfate"]
#Take mean of df_sulfate
mean(df_sulfate, na.rm = TRUE)
}
}
}
For those of you who have not heard of this assignment function: I have 332 csv files(named 001.csv, 002.csv and so on) in my working directory. The task is to get all of them in one data frame and to be able to call the mean of a column of a file (given by "id" variable that corresponds to that file) or across multiple files (some examples of function and output can be found here)
I tried to call traceback or debug functions to situate the problem, but to no avail:
pollutantmean(getwd(), "nitrate", 23)
> traceback()
No traceback available
> debug(pollutantmean)
>
The OS is Windows 10.
Any suggestions or comments are welcome. Thanks in advance.
Your for loop is wrapped around your if block. R functions will not return a value while in a loop (unless you use the return function, which is not what you want to do here).
pollutantmean <- function (directory, pollutant, id) {
#Create the data frame with the data from the 332 files
files <- list.files(getwd())
df <- data.frame()
id <- 1:332
for (i in 1:length(id)) {df <- rbind(df, read.csv(files[i]))
# close for loop HERE
}
if (pollutant=="nitrate"){
#Create a subset for nitrate values of df
df_nitrate <- df[df$ID==id[i], "nitrate"]
#Take mean of df_nitrate
mean (df_nitrate, na.rm = TRUE)
} else {
#Create a subset for sulfate values of df
df_sulfate <- df[df$ID==id[i],"sulfate"]
#Take mean of df_sulfate
mean(df_sulfate, na.rm = TRUE)
}
# not HERE
#}
}

Resources