I have a problem i'm attempting to solve and have run into a brick wall. I'm attempting to find the mean of a set of data given specific pollutant names and the ID number. So the code all the way to the for loop I believe works fine. I create a function with 3 arguments, create an empty data.frame and then bind all my files into one variable called "dat".
Now i'm trying to subset this new binded data by "id" and by the specific pollutant name (there's two of them named sulfate and nitrate). As you can see, the code under the for loop is a mess.
In specific, i'm unsure how to subset two parameters/arguments in one "which" function so I tried to make a seperate one for each. I was thinking I could use the median function to find the mean between both
pollutantmean <- function(directory, pollutant, id = 1:332) {
files_list <- list.files(directory, full.names = TRUE)
dat <- data.frame()
for (i in 1:332){
dat <- rbind(dat, read.csv(files.list[1]))
}
subset_id <-dat[which(dat[, "id"] ==id) , ]
subset_poll <-dat[which(dat[, "pollutant"] ==pollutant) , ]
median(subset_id)
}
Here is a photo of what the head/tail data looks like in R.
EDIT1: So I was able to get the function initilized (proper term?) but am getting numerous "undefined columns selected" when I try to run it with input.
pollutantmean <- function(directory, pollutant, ID = 1:332) {
files_list <- list.files(directory, full.names = TRUE)
dat <- data.frame()
for (i in 1:332) {
dat <- rbind(dat, read.csv(files_list[1]))
}
subset_id <- dat[which(dat[, "ID"] == ID & dat[, "pollutant"] ==
pollutant) ]
median(subset_id[, "pollutant"], na.rm = TRUE)
}
So that function gets placed into memory just fine, but when I try to input parameters "pollutantmean("specdata","sulfate", 1:10)" I get the following errors.
Error in `[.data.frame`(dat, , "pollutant") : undefined columns selected
In addition: Warning message:
In dat[, "ID"] == ID :
Error in `[.data.frame`(dat, , "pollutant") : undefined columns selected
I was able to solve this question with some outside help.
pollutantmean <- function(directory, pollutant, ID = 1:332) {
files_list <- list.files(directory, full.names = TRUE)
dat <- data.frame()
for (i in ID) {
dat <- rbind(dat, read.csv(files_list[i]))
}
mean(dat[!is.na(dat[, "ID"]),pollutant], na.rm = TRUE)
}
Related
I am working on R and learning how to code. I have written a piece of code, utilizing a for loop and I find it very slow. I was wondering if I can get some assistance to convert it to use either the sapply or lapply function. Here is my working R code:
library(dplyr)
pollutantmean <- function(directory, pollutant, id = 1:332) {
files_list <- list.files(directory, full.names=TRUE) #creates a list of files
dat <- data.frame() #creates an empty data frame
for (i in seq_along(files_list)) {
#loops through the files, rbinding them together
dat <- rbind(dat, read.csv(files_list[i]))
}
dat_subset <- filter(dat, dat$ID %in% id) #subsets the rows that match the 'ID' argument
mean(dat_subset[, pollutant], na.rm=TRUE) #identifies the Mean of a Pollutant
}
pollutantmean("specdata", "sulfate", 1:10)
This code takes almost 20 seconds to return, which is unacceptable for 332 records. Imagine if I have a dataset with 10K records and wanted to get the mean of those variables?
You can rbind all elements in a list using do.call, and you can read in all the files into that list using lapply:
mean(
filter( # here's the filter that will be applied to the rbind-ed data
do.call("rbind", # call "rbind" on all elements of a list
lapply( # create a list by reading in the files from list.files()
# add any necessary args to read.csv:
list.files("[::DIR_PATH::]"), function(x) read.csv(file=x, ...)
)
)
), ID %in% id)$pollutant, # make sure id is replaced with what you want
na.rm = TRUE
)
The reason your code is slow because you are incrementally growing your dataframe in the loop. One way to do this using dplyr and map_df from purrr can be
library(dplyr)
pollutantmean <- function(directory, pollutant, id = 1:332) {
files_list <- list.files(directory, full.names=TRUE)
purrr::map_df(files_list, read.csv) %>%
filter(ID %in% id) %>%
summarise_at(pollutant, mean, na.rm = TRUE)
}
Here is the data I am working with. https://d396qusza40orc.cloudfront.net/rprog%2Fdata%2Fspecdata.zip
I'm trying to create a function called pollutantmean that will load selected files, aggregate (rbind) the columns, and return a mean of a certain column. I have figured out everything except how to run the loop so I can turn the multiple files into one big data frame.
for (id in 1:5) {
files_full <- Sys.glob("*.csv")
fileQ <- files_full[[id]]
empty_tbl <- rbind(empty_tbl, read.csv(fileQ, header = TRUE))
}
This for loop works by itself but when i try and use my bigger function
pollutantmean <- function(directory = "specdata", pollutant, id = 1:332) {
empty_tbl <- data.frame()
for (id in 1:332) {
files_full <- Sys.glob("*.csv")
fileQ <- files_full[[i]]
empty_tbl <- rbind(empty_tbl, read.csv(fileQ, header = TRUE))
}
goodata <- na.omit(empty_tbl)
if(pollutant == "sulfate") {
mean(goodata[,2])
} else {
mean(goodata[,3])
}
}
I get the:
"Error in read.table(file = file, header = header, sep = sep, quote = quote, :
'file' must be a character string or connection".
I am at a complete loss over how to fix this and have tried many, many different ways. I'm sure I'm messing something up with the naming of the file but I try the for loop by itself and it works fine...
Consider using lapply() on csv files that uses the directory argument of function. Below assumes specdata is a subfolder of the current working directory:
pollutantmean <- function(directory = "specdata", pollutant) {
files_full <- Sys.glob(paste0(directory,"/*.csv"))[1:332] # FIRST 332 CSVs IN DIRECTORY
dfList <- lapply(files_full, read.csv, header=TRUE)
df <- do.call(rbind, dfList)
gooddata <- na.omit(df)
pmean <- ifelse(pollutant == "sulfate", mean(gooddata[,2]), mean(gooddata[,3]))
}
I have a number of csv files and my goal is to find the number of complete cases for a file or set of files given by id argument. My function should return a data frame with column id specifying the file and column obs giving the number of complete cases for this id. However, my function overwrites the previous value of nobs in each loop and the resulting data frame gives me only its last value. Do you have any idea how to get the value of nobs for each value of id?
myfunction<-function(id=1:20) {
files<-list.files(pattern="*.csv")
myfiles = do.call(rbind, lapply(files, function(x) read.csv(x,stringsAsFactors = FALSE)))
for (i in id) {
good<-complete.cases(myfiles)
newframe<-myfiles[good,]
cases<-newframe[newframe$ID %in% i,]
nobs<-nrow(cases)
}
clean<-data.frame(id,nobs)
clean
}
Thanks.
We can do all inside lapply(), something like below (not tested):
myfunction <- function(id = 1:20) {
files <- list.files(pattern = "*.csv")[id]
do.call(rbind,
lapply(files, function(x){
df <- read.csv(x,stringsAsFactors = FALSE)
df <- df[complete.cases(df), ]
data.frame(ID=x,nobs=nrow(df))
}
)
)
}
I am trying to write an R script that calculates the mean of a specified pollutant (nitrate or sulfate) based on data from one or more of 332 monitor stations. The data from each station is held in a separate file, numbered 1:332. I am new to R and, to be fair to anyone who chooses to help me, I should say that this is a homework problem. I have written the script below, which works for just one file:
pollutantmean <- function(directory, pollutant, id = 1:332) {
filepath <- "/Users/jim/Documents/Coursera/2_R_Prog/Data"
for(i in seq_along(id)) {
if(id < 10) {
name <- paste("00", id[i], sep = "")
}
if(id >= 10 && id < 100) {
name <- paste("0", id[i], sep = "")
}
if(id >= 100) {
name <- id[i]
}
}
file <- paste(name, "csv", sep = ".")
station <- paste(filepath, directory, file, sep = "/")
monitor <- read.csv(station)
if(pollutant == "nitrate") {
x <- mean(monitor$nitrate, na.rm = T)
}
if(pollutant == "sulfate") {
x <- mean(monitor$sulfate, na.rm = T)
}
x
}
However, if I enter more than one file (eg 70:72) I get the mean for the last file only (72). This suggests to me that it is calculating the mean for each file and then overwriting it with the mean of the next, so that only the last is outputted. I would be able to solve this using rbind(), but I can't figure out how to assign unique names for each variable which would then become the arguments for rbind(). I would be grateful for any help anyone can offer.
Cheers,
Jim
You don't loop over the files.
And you get the mean of the last file because when you loop over ids to create names, your loop returns the last name created.
You should create a vector of names then stations and loop over it !
Tips : You don't need a loop and conditional statements to create your names, you could use sprintf precising the size of the string you are expected (3) and what with you want to "expand" the string (0)
> id <- c(1, 10, 100)
> names <- sprintf("%03d", id)
> names
[1] "001" "010" "100"
And this should works :
pollutantmean <- function(directory, pollutant, id = 1:332) {
filepath <- "/Users/jim/Documents/Coursera/2_R_Prog/Data"
names <- sprintf("%03d", id)
files <- paste0(names, ".csv") # Or directly : files <- sprintf("%03d.csv", id)
station <- file.path(filepath, directory, files)
means <- numeric(length(station))
for (i in seq_along(station)) {
monitor <- read.csv(station[i])
if(pollutant == "nitrate") {
means[i] <- mean(monitor$nitrate, na.rm = T)
} else if(pollutant == "sulfate") {
means[i] <- mean(monitor$sulfate, na.rm = T)
}
}
return(means)
}
EDIT :
If you want a single mean, you can use the code above and ponderate each means by the nrow non NA. Replace the loop by :
means <- numeric(length(station))
counts <- numeric(length(station))
for (i in seq_along(station)) {
monitor <- read.csv(station[i])
if(pollutant == "nitrate") {
means[i] <- mean(monitor$nitrate, na.rm = TRUE)
counts[i] <- sum(!is.na(monitor$nitrate))
} else if(pollutant == "sulfate") {
means[i] <- mean(monitor$sulfate, na.rm = TRUE)
counts[i] <- sum(!is.na(monitor$sulfate))
}
}
myMean <- sum(means * counts) / sum(counts)
return(myMean)
Since your first intention was to gather your datas into one vector, here is a solution that create a list in which each element is the desire "pollutant" variable of each datasframes, unlist gather all the vectors into 1 and then we can compute the mean on this vector.
pollutantmean <- function(directory, pollutant, id = 1:332) {
filepath <- "/Users/jim/Documents/Coursera/2_R_Prog/Data"
names <- sprintf("%03d", id)
files <- paste0(names, ".csv") # Or directly : files <- sprintf("%03d.csv", id)
station <- file.path(filepath, directory, files)
li <- lapply(station, function(x) {
monitor <- read.csv(x)
if(pollutant == "nitrate") {
monitor$nitrate
} else if(pollutant == "sulfate") {
monitor$sulfate
}
})
myMean <- mean(unlist(li))
return(myMean)
}
A small correction in Julien Navarre's 2nd pollutantmean function. When calculating the mean, it is not ignoring the NA values, which could affect the overall result. So the line calculating the mean value should be like this.
myMean <- mean(unlist(l), na.rm=TRUE)
I am new to R. I created the function below to calculate the mean of dataset contained in 332 csv files. Seek advice on how I could improve this code. It takes 38 sec to run which make me think it is not very efficient.
pollutantmean <- function(directory, pollutant, id = 1:332) {
files_list <- list.files(directory, full.names = TRUE) #creats list of files
dat <- data.frame() #creates empty dataframe
for(i in id){
dat<- rbind(dat,read.csv(files_list[i])) #combin all the monitor data together
}
good <- complete.cases(dat) #remove all NA values from dataset
mean(dat[good,pollutant]) #calculate mean
} #run time ~ 37sec - NEED TO OPTIMISE THE CODE
Instead of creating a void data.frame and rbind each time with a for loop, you can store all data.frames in a list and combine them in one shot. You can also use na.rm option of mean function not to take into account NA values.
pollutantmean <- function(directory, pollutant, id = 1:332)
{
files_list = list.files(directory, full.names = TRUE)[id]
df = do.call(rbind, lapply(files_list, read.csv))
mean(df[[pollutant]], na.rm=TRUE)
}
Optional - I would increase the readability with magrittr:
library(magrittr)
pollutantmean <- function(directory, pollutant, id = 1:332)
{
list.files(directory, full.names = TRUE)[id] %>%
lapply(read.csv) %>%
do.call(rbind,.) %>%
extract2(pollutant) %>%
mean(na.rm=TRUE)
}
You can improve it by using data.table's fread function (see Quickly reading very large tables as dataframes in R)
Also binding the result using data.table::rbindlist is way faster.
require(data.table)
pollutantmean <- function(directory, pollutant, id = 1:332) {
files_list = list.files(directory, full.names = TRUE)[id]
DT = rbindlist(lapply(files_list, fread))
mean(DT[[pollutant]], na.rm=TRUE)
}