Calculating a mean from data held in multiple files - r

I am trying to write an R script that calculates the mean of a specified pollutant (nitrate or sulfate) based on data from one or more of 332 monitor stations. The data from each station is held in a separate file, numbered 1:332. I am new to R and, to be fair to anyone who chooses to help me, I should say that this is a homework problem. I have written the script below, which works for just one file:
pollutantmean <- function(directory, pollutant, id = 1:332) {
filepath <- "/Users/jim/Documents/Coursera/2_R_Prog/Data"
for(i in seq_along(id)) {
if(id < 10) {
name <- paste("00", id[i], sep = "")
}
if(id >= 10 && id < 100) {
name <- paste("0", id[i], sep = "")
}
if(id >= 100) {
name <- id[i]
}
}
file <- paste(name, "csv", sep = ".")
station <- paste(filepath, directory, file, sep = "/")
monitor <- read.csv(station)
if(pollutant == "nitrate") {
x <- mean(monitor$nitrate, na.rm = T)
}
if(pollutant == "sulfate") {
x <- mean(monitor$sulfate, na.rm = T)
}
x
}
However, if I enter more than one file (eg 70:72) I get the mean for the last file only (72). This suggests to me that it is calculating the mean for each file and then overwriting it with the mean of the next, so that only the last is outputted. I would be able to solve this using rbind(), but I can't figure out how to assign unique names for each variable which would then become the arguments for rbind(). I would be grateful for any help anyone can offer.
Cheers,
Jim

You don't loop over the files.
And you get the mean of the last file because when you loop over ids to create names, your loop returns the last name created.
You should create a vector of names then stations and loop over it !
Tips : You don't need a loop and conditional statements to create your names, you could use sprintf precising the size of the string you are expected (3) and what with you want to "expand" the string (0)
> id <- c(1, 10, 100)
> names <- sprintf("%03d", id)
> names
[1] "001" "010" "100"
And this should works :
pollutantmean <- function(directory, pollutant, id = 1:332) {
filepath <- "/Users/jim/Documents/Coursera/2_R_Prog/Data"
names <- sprintf("%03d", id)
files <- paste0(names, ".csv") # Or directly : files <- sprintf("%03d.csv", id)
station <- file.path(filepath, directory, files)
means <- numeric(length(station))
for (i in seq_along(station)) {
monitor <- read.csv(station[i])
if(pollutant == "nitrate") {
means[i] <- mean(monitor$nitrate, na.rm = T)
} else if(pollutant == "sulfate") {
means[i] <- mean(monitor$sulfate, na.rm = T)
}
}
return(means)
}
EDIT :
If you want a single mean, you can use the code above and ponderate each means by the nrow non NA. Replace the loop by :
means <- numeric(length(station))
counts <- numeric(length(station))
for (i in seq_along(station)) {
monitor <- read.csv(station[i])
if(pollutant == "nitrate") {
means[i] <- mean(monitor$nitrate, na.rm = TRUE)
counts[i] <- sum(!is.na(monitor$nitrate))
} else if(pollutant == "sulfate") {
means[i] <- mean(monitor$sulfate, na.rm = TRUE)
counts[i] <- sum(!is.na(monitor$sulfate))
}
}
myMean <- sum(means * counts) / sum(counts)
return(myMean)
Since your first intention was to gather your datas into one vector, here is a solution that create a list in which each element is the desire "pollutant" variable of each datasframes, unlist gather all the vectors into 1 and then we can compute the mean on this vector.
pollutantmean <- function(directory, pollutant, id = 1:332) {
filepath <- "/Users/jim/Documents/Coursera/2_R_Prog/Data"
names <- sprintf("%03d", id)
files <- paste0(names, ".csv") # Or directly : files <- sprintf("%03d.csv", id)
station <- file.path(filepath, directory, files)
li <- lapply(station, function(x) {
monitor <- read.csv(x)
if(pollutant == "nitrate") {
monitor$nitrate
} else if(pollutant == "sulfate") {
monitor$sulfate
}
})
myMean <- mean(unlist(li))
return(myMean)
}

A small correction in Julien Navarre's 2nd pollutantmean function. When calculating the mean, it is not ignoring the NA values, which could affect the overall result. So the line calculating the mean value should be like this.
myMean <- mean(unlist(l), na.rm=TRUE)

Related

My bind_rows is throwing an usual error Can't combine `..1$Activity ID` <logical> and `..2$Activity ID` <factor<2585d>>

My code is as shown below and has been running fine for months:
The first part is appending data to dfCampaignTask and this is running fine
for (i in files) {
if (startsWith(i, "CampaignTaskReport_2")) {
temp <- read.csv(paste(directory, i, sep = ""), header = TRUE)
names(temp) <- names(dfCampaignTask)
dfCampaignTask <- rbind(dfCampaignTask, temp)
}
else if (startsWith(i, "CombinedReport_2")) {
temp <- read.csv(paste(directory, i, sep = ""), header = TRUE)
if (ncol(temp) < 19) {
names(temp) <- dfCombinedNamesOld
}
else {
names(temp) <- names(dfCombined)
}
dfCombined <- bind_rows(dfCombined, temp)
}
}
The second part which is appending data to dfcombined is throwing an error: Can't combine ..1$Activity ID and ..2$Activity ID <factor<2585d>>.
I checked str of all combined files in the folder and activity ID column is all factor so not sure what's causing the discrepancy

How to subset 2 parameters in 1 function

I have a problem i'm attempting to solve and have run into a brick wall. I'm attempting to find the mean of a set of data given specific pollutant names and the ID number. So the code all the way to the for loop I believe works fine. I create a function with 3 arguments, create an empty data.frame and then bind all my files into one variable called "dat".
Now i'm trying to subset this new binded data by "id" and by the specific pollutant name (there's two of them named sulfate and nitrate). As you can see, the code under the for loop is a mess.
In specific, i'm unsure how to subset two parameters/arguments in one "which" function so I tried to make a seperate one for each. I was thinking I could use the median function to find the mean between both
pollutantmean <- function(directory, pollutant, id = 1:332) {
files_list <- list.files(directory, full.names = TRUE)
dat <- data.frame()
for (i in 1:332){
dat <- rbind(dat, read.csv(files.list[1]))
}
subset_id <-dat[which(dat[, "id"] ==id) , ]
subset_poll <-dat[which(dat[, "pollutant"] ==pollutant) , ]
median(subset_id)
}
Here is a photo of what the head/tail data looks like in R.
EDIT1: So I was able to get the function initilized (proper term?) but am getting numerous "undefined columns selected" when I try to run it with input.
pollutantmean <- function(directory, pollutant, ID = 1:332) {
files_list <- list.files(directory, full.names = TRUE)
dat <- data.frame()
for (i in 1:332) {
dat <- rbind(dat, read.csv(files_list[1]))
}
subset_id <- dat[which(dat[, "ID"] == ID & dat[, "pollutant"] ==
pollutant) ]
median(subset_id[, "pollutant"], na.rm = TRUE)
}
So that function gets placed into memory just fine, but when I try to input parameters "pollutantmean("specdata","sulfate", 1:10)" I get the following errors.
Error in `[.data.frame`(dat, , "pollutant") : undefined columns selected
In addition: Warning message:
In dat[, "ID"] == ID :
Error in `[.data.frame`(dat, , "pollutant") : undefined columns selected
I was able to solve this question with some outside help.
pollutantmean <- function(directory, pollutant, ID = 1:332) {
files_list <- list.files(directory, full.names = TRUE)
dat <- data.frame()
for (i in ID) {
dat <- rbind(dat, read.csv(files_list[i]))
}
mean(dat[!is.na(dat[, "ID"]),pollutant], na.rm = TRUE)
}

Breaking the golden rules on FOR loops with R

Firstly, apologies as this may seem a bit long winded, but I hope to give as much information on this problem as I can...
I have written a script that loops through a set of files defined in a csv file. Each file within this csv listing is an XML file, each one is for a particular event in an application, and all files within this list are of the same event type. However, each file can contain different data. For instance, one could hold an attribute with no child nodes beneath, while others contain nodes.
My script works perfectly fine, but when it gets to about XML file 5000, it has slowed down considerably.
Problem is that my code creates a blank dataframe initially, and then grows is at new columns are detected.
I understand that this is a big NO NO when it comes to writing R FOR loops, but am unsure how to get around this problem, give my smallest file listing is 69000, which makes going through each one in turn and counting the nodes a task in itself.
Are there any ideas on how to get around this?
pseudo code or actual R code to do this would be great. So would ideas/opinions, as I am unsure on the best approach to this task.
Here is my current code.
library(XML)
library(xml2)
library(plyr)
library(tidyverse)
library(reshape2)
library(foreign)
library(rio)
# Get file data to be used
#
setwd('c:/temp/xml')
headerNames <- c('GUID','EventId','AppId','RequestFile', 'AE_Type', 'AE_Drive')
GetNames <- rowid_to_column(read.csv(file= 'c:/temp/xml/R_EventIdA.csv', fileEncoding="UTF-8-BOM", header = FALSE, col.names = headerNames),'ID')
inputfiles <- as.character(GetNames[,5]) # Gets list of files
# Create empty dataframes
#
df <- data.frame()
transposed.df1 <- data.frame()
allxmldata <- data.frame()
findchildren<-function(nodes, df) {
numchild <- sapply(nodes, function(x){length(xml_children(x))})
xml.value <- xml_text(nodes[numchild==0])
xml.name <- xml_name(nodes[numchild==0])
xml.path <- sapply(nodes[numchild==0], function(x) {gsub(', ','_', toString(rev(xml_name(xml_parents(x)))))})
fieldname <- paste(xml.path,xml.name,sep = '_')
contents <- sapply(xml.value, function(f){is.na(f)<-which(f == '');f})
if (length(fieldname) > 0) {
fieldname <- paste(fieldname,xml.value, sep = '_')
dftemp <- data.frame(fieldname, contents)
df <- rbind(df, dftemp)
print(dim(df))
}
if (sum(numchild)>0){
findchildren(xml_children(nodes[numchild>0]), df) }
else{ return(df)
}
}
findchildren2<-function(nodes, df){
numchild<-sapply(nodes, function(embeddedinputfile){length(xml_children(embeddedinputfile))})
xmlvalue<-xml_text(nodes[numchild==0])
xmlname<-xml_name(nodes[numchild==0])
xmlpath<-sapply(nodes[numchild==0], function(embeddedinputfile) {gsub(', ','_', toString(rev(xml_name(xml_parents(embeddedinputfile)))))})
fieldname<-paste(xmlpath,xmlname,sep = '_')
contents<-sapply(xmlvalue, function(f){is.na(f)<-which(f == '');f})
if (length(fieldname) > 0) {
dftemp<-data.frame(fieldname, contents)
df<-rbind(df, dftemp)
print(dim(df))
}
if (sum(numchild)>0){
findchildren2(xml_children(nodes[numchild>0]), df) }
else{ return(df)
}
}
# Loop all files
#
for (x in inputfiles) {
df1 <- findchildren(xml_children(read_xml(x)),df)
## original xml dataframe
if (length(df1) > 0) {
xml.df1 <- data.frame(spread(df1, key = fieldname, value = contents), fix.empty.names = TRUE)
}
##
xml.df1 %>%
pluck('Response_RawData') -> rawxml
if (length(rawxml)>0) {
df.rawxml <- data.frame(rawxml)
export(df.rawxml,'embedded.xml')
embeddedinputfile <-as.character('embedded.xml')
rm(df1)
df1 <- findchildren2(xml_children(read_xml(embeddedinputfile)),df)
if (length(df1) > 0) {
xml.df2 <- spread(df1, key = fieldname, value = contents)
}
allxmldata <- rbind.fill(allxmldata,cbind(xml.df1,xml.df2))
} else {
allxmldata <- rbind.fill(allxmldata,cbind(xml.df1))
}
}
if(nrow(allxmldata)==nrow(GetNames)) {
alleventdata<-cbind(GetNames,allxmldata)
}
dbConn2 <- odbcDriverConnect('driver={SQL Server};server=PC-XYZ;database=Events;trusted_connection=true')
sqlSave(dbConn2, alleventdata, tablename = 'AE_EventA', append = TRUE )

R Data Frames column names rename

I am new to R and not sure why I have to rename data frame column names at the end of the program though I have defined data frame with column names at the beginning of the program. The use of the data frame is, I got two columns where I have to save sequence under ID column and some sort of number in NOBS column.
complete <- function(directory, id = 1:332) {
collectCounts = data.frame(id=numeric(), nobs=numeric())
for(i in id) {
fileName = sprintf("%03d",i)
fileLocation = paste(directory, "/", fileName,".csv", sep="")
fileData = read.csv(fileLocation, header=TRUE)
completeCount = sum(!is.na(fileData[,2]), na.rm=TRUE)
collectCounts <- rbind(collectCounts, c(id=i, completeCount))
#print(completeCount)
}
colnames(collectCounts)[1] <- "id"
colnames(collectCounts)[2] <- "nobs"
print(collectCounts)
}
Its not quite clear what your specific problem is, as you did not provide a complete and verifiable example. But I can give a few pointers on improving the code, nonetheless.
1) It is not recommended to 'grow' a data.frame within a loop. This is extremely inefficient in R, as it copies the entire structure each time. Better is to assign the whole data.frame at the outset, then fill in the rows in the loop.
2) R has a handy functionpaste0 that does not require you to specify sep = "".
3) There's no need to specify na.rm = TRUE in your sum, because is.na will never return NA's
Putting this together:
complete = function(directory, id = 1:332) {
collectCounts = data.frame(id=id, nobs=numeric(length(id)))
for(i in 1:length(id)) {
fileName = sprintf("%03d", id[i])
fileLocation = paste0(directory, "/", fileName,".csv")
fileData = read.csv(fileLocation, header=TRUE)
completeCount = sum(!is.na(fileData[, 2]))
collectCounts[i, 'nobs'] <- completeCount
}
}
Always hard to answer questions without example data.
You could start with
collectCounts = data.frame(id, nobs=NA)
And in your loop, do:
collectCounts[i, 2] <- completeCount
Here is another way to do this:
complete <- function(directory, id = 1:332) {
nobs <- sapply(id, function(i) {
fileName = paste0(sprintf("%03d",i), ".csv")
fileLocation = file.path(directory, fileName)
fileData = read.csv(fileLocation, header=TRUE)
sum(!is.na(fileData[,2]), na.rm=TRUE)
}
)
data.frame(id=id, nobs=nobs)
}

Overwriting result with for loop in R

I have a number of csv files and my goal is to find the number of complete cases for a file or set of files given by id argument. My function should return a data frame with column id specifying the file and column obs giving the number of complete cases for this id. However, my function overwrites the previous value of nobs in each loop and the resulting data frame gives me only its last value. Do you have any idea how to get the value of nobs for each value of id?
myfunction<-function(id=1:20) {
files<-list.files(pattern="*.csv")
myfiles = do.call(rbind, lapply(files, function(x) read.csv(x,stringsAsFactors = FALSE)))
for (i in id) {
good<-complete.cases(myfiles)
newframe<-myfiles[good,]
cases<-newframe[newframe$ID %in% i,]
nobs<-nrow(cases)
}
clean<-data.frame(id,nobs)
clean
}
Thanks.
We can do all inside lapply(), something like below (not tested):
myfunction <- function(id = 1:20) {
files <- list.files(pattern = "*.csv")[id]
do.call(rbind,
lapply(files, function(x){
df <- read.csv(x,stringsAsFactors = FALSE)
df <- df[complete.cases(df), ]
data.frame(ID=x,nobs=nrow(df))
}
)
)
}

Resources