New to R and to programming, I reviewed all the possible threads on SO on this Coursera assignment but couldn't figure out what the issue was. I know this function can be optimized using lapply and much more, but I would like to know why this particular function does not work. I felt like some questions on this function slightly irritated some users. To be honest, I reviewed the relevant posts on that and I don't see what I can do about this particular bug.
pollutantmean <- function (directory, pollutant, id) {
#Create the data frame with the data from the 332 files
files <- list.files(getwd())
df <- data.frame()
id <- 1:332
for (i in 1:length(id)) {df <- rbind(df, read.csv(files[i]))
if (pollutant=="nitrate"){
#Create a subset for nitrate values of df
df_nitrate <- df[df$ID==id[i], "nitrate"]
#Take mean of df_nitrate
mean (df_nitrate, na.rm = TRUE)
} else {
#Create a subset for sulfate values of df
df_sulfate <- df[df$ID==id[i],"sulfate"]
#Take mean of df_sulfate
mean(df_sulfate, na.rm = TRUE)
}
}
}
For those of you who have not heard of this assignment function: I have 332 csv files(named 001.csv, 002.csv and so on) in my working directory. The task is to get all of them in one data frame and to be able to call the mean of a column of a file (given by "id" variable that corresponds to that file) or across multiple files (some examples of function and output can be found here)
I tried to call traceback or debug functions to situate the problem, but to no avail:
pollutantmean(getwd(), "nitrate", 23)
> traceback()
No traceback available
> debug(pollutantmean)
>
The OS is Windows 10.
Any suggestions or comments are welcome. Thanks in advance.
Your for loop is wrapped around your if block. R functions will not return a value while in a loop (unless you use the return function, which is not what you want to do here).
pollutantmean <- function (directory, pollutant, id) {
#Create the data frame with the data from the 332 files
files <- list.files(getwd())
df <- data.frame()
id <- 1:332
for (i in 1:length(id)) {df <- rbind(df, read.csv(files[i]))
# close for loop HERE
}
if (pollutant=="nitrate"){
#Create a subset for nitrate values of df
df_nitrate <- df[df$ID==id[i], "nitrate"]
#Take mean of df_nitrate
mean (df_nitrate, na.rm = TRUE)
} else {
#Create a subset for sulfate values of df
df_sulfate <- df[df$ID==id[i],"sulfate"]
#Take mean of df_sulfate
mean(df_sulfate, na.rm = TRUE)
}
# not HERE
#}
}
Related
back again with another question. I ended up tearing my code down from earlier and started from square one because I got to that I was buried in errors and this seemed like an easier approach to fix it. My function now returns correct values but only when I have it look at one file (ID). Whenever I attempt to run this over a sequence of files (i.e. 1:10) I get an incorrect answer and Warning message: In dat[, "ID"] == ID : longer object length is not a multiple of shorter object length
This is the code (I had originally tried using lapply and sapply with data.table, but that seemed to have opened a whole new can of worms I was not prepared for).
pollutantmean <- function(directory, pollutant, id = 1:332){
files_list <- list.files(directory, full.names = TRUE)
dat <- data.frame()
for(i in 1:332) {
dat <- rbind(dat, read.csv(files_list[i]))
}
dat_subset <- dat[which(dat[,"ID"]==id), ]
mean(dat_subset[ , pollutant], na.rm =TRUE, useNames = TRUE)
}
When I call my function as
> pollutantmean("./specdata/", "nitrate", 23)
it comes back with 1.280833 which is what I am expecting to see for this call.
However, when I call it as
pollutantmean("./specdata/", "sulfate", 1:10)
it comes back with the previously mentioned warning message
I don't know if it has something to do with the way I am defining the columns or rows of the dat <-data.frame() or something else that's maybe staring me right in the face.
replace == by %in%. this allows for comparisons against vectors instead of scalars
# which is not needed
dat_subset <- dat[dat[,"ID"] %in% id, ]
I'm trying to write a function called complete that takes a file directory (which has csv files titled 1-332) and the title of the file as a number to print out the number of rows without NA in the sulfate or nitrate columns. I am trying to use mutate to add a column titled nobs which returns 1 if neither column is na and then takes the sum of nobs for my answer, but I get an error message that the object nob is not found. How can I fix this? The specific file directory in question is downloaded within this block of code.
library(tidyverse)
if(!file.exists("rprog-data-specdata.zip")) {
temp <- tempfile()
download.file("https://d396qusza40orc.cloudfront.net/rprog%2Fdata%2Fspecdata.zip",temp)
unzip(temp)
unlink(temp)
}
complete <- function(directory, id = 1:332){
#create a list of files
files_full <- list.files(directory, full.names = TRUE)
#create an empty data frame
dat <- data.frame()
for(i in id){
dat <- rbind(dat, read.csv(files_full[i]))
}
mutate(dat, nob = ifelse(!is.na(dat$sulfate) & !is.na(dat$nitrate), 1, 0))
x <- summarise(dat, sum = sum(nob))
return(x)
}
When one runs the following code nobs should be 117, but I get an error message instead
complete("specdata", 1)
Error: object 'nob' not found"
I think the function below should get what you need. Rather than a loop, I prefer map (or apply) in this setting. It's difficult to say where your code went wrong without the error message or an example I can run on my machine, however.
Happy Coding,
Daniel
library(tidyverse)
complete <- function(directory, id = 1:332){
#create a list of files
files_full <- list.files(directory, full.names = TRUE)
# cycle over each file to get the number of nonmissing rows
purrr::map_int(
files_full,
~ read.csv(.x) %>% # read in datafile
dplyr::select(sulfate, nitrate) %>% # select two columns of interest
tidyr::drop_na %>% # drop missing observations
nrow() # get the number of rows with no missing data
) %>%
sum() # sum the total number of rows not missing among all files
}
As mentioned, avoid building objects in a loop. Instead, consider building a list of data frames from each csv then call rbind once. In fact, even consider base R (i.e., tinyverse) for all your needs:
complete <- function(directory, id = 1:332){
# create a list of files
files_full <- list.files(directory, full.names = TRUE)
# create a list of data frames
df_list <- lapply(files_full[id], read.csv)
# build a single data frame with nob column
dat <- transform(do.call(rbind, df_list),
nob = ifelse(!is.na(sulfate) & !is.na(nitrate), 1, 0)
)
return(sum(dat$nob))
}
I am working on R and learning how to code. I have written a piece of code, utilizing a for loop and I find it very slow. I was wondering if I can get some assistance to convert it to use either the sapply or lapply function. Here is my working R code:
library(dplyr)
pollutantmean <- function(directory, pollutant, id = 1:332) {
files_list <- list.files(directory, full.names=TRUE) #creates a list of files
dat <- data.frame() #creates an empty data frame
for (i in seq_along(files_list)) {
#loops through the files, rbinding them together
dat <- rbind(dat, read.csv(files_list[i]))
}
dat_subset <- filter(dat, dat$ID %in% id) #subsets the rows that match the 'ID' argument
mean(dat_subset[, pollutant], na.rm=TRUE) #identifies the Mean of a Pollutant
}
pollutantmean("specdata", "sulfate", 1:10)
This code takes almost 20 seconds to return, which is unacceptable for 332 records. Imagine if I have a dataset with 10K records and wanted to get the mean of those variables?
You can rbind all elements in a list using do.call, and you can read in all the files into that list using lapply:
mean(
filter( # here's the filter that will be applied to the rbind-ed data
do.call("rbind", # call "rbind" on all elements of a list
lapply( # create a list by reading in the files from list.files()
# add any necessary args to read.csv:
list.files("[::DIR_PATH::]"), function(x) read.csv(file=x, ...)
)
)
), ID %in% id)$pollutant, # make sure id is replaced with what you want
na.rm = TRUE
)
The reason your code is slow because you are incrementally growing your dataframe in the loop. One way to do this using dplyr and map_df from purrr can be
library(dplyr)
pollutantmean <- function(directory, pollutant, id = 1:332) {
files_list <- list.files(directory, full.names=TRUE)
purrr::map_df(files_list, read.csv) %>%
filter(ID %in% id) %>%
summarise_at(pollutant, mean, na.rm = TRUE)
}
I am newbie in R and have got to calculate the mean of column sulf from 332 files. The mean formulas bellow works well with 1 file . The problem comes when I attempt to calculate across the files.
Perhaps the reading all files and storing them in mydata does not work well? Could you help me out?
Many thanks
pollutantmean <- function(specdata,pollutant=xor(sulf,nit),i=1:332){
specdata<-getwd()
pollutant<-c(sulf,nit)
for(i in 1:332){
mydata<-read.csv(file_list[i])
}
sulfate <- (subset(mydata,select=c("sulfate")))
sulf <- sulfate[!is.na(sulfate)]
y <- mean(sulf)
print(y)
}
This is not tested, but the steps are as followed. Note also that this kind of questions are being asked over and over again (e.g. here). Try searching for "work on multiple files", "batch processing", "import many files" or something akin to this.
lx <- list.files(pattern = ".csv", full.names = TRUE)
# gives you a list of
xy <- sapply(lx, FUN = function(x) {
out <- read.csv(x)
out <- out[, "sulfate", drop = FALSE] # do not drop to vector just for fun
out <- out[is.na(out[, "sulfate"]), ]
out
}, simplify = FALSE)
xy <- do.call(rbind, xy) # combine the result for all files into one big data.frame
mean(xy[, "sulfate"]) # calculate the mean
# or
summary(xy)
If you are short on RAM, this can be optimized a bit.
thank you for your help.
I have sorted it out. the key was to use full.names=TRUE in list.files and rbind(mydata, ... ) as otherwise it reads the files one by one and does not append them after each other, which is my aim
See below. I am not sure if it is the most "R" solution but it works
pollutantmean<-function(directory,pollutant,id=1:332){
files_list <- list.files(directory, full.names=TRUE)
mydata <- data.frame()
for (i in id) {
mydata <- rbind(mydata, read.csv(files_list[i]))
}
if(pollutant %in% "sulfate")
{
mean(mydata$sulfate,na.rm=TRUE)
}
else
{if(pollutant %in% "nitrate")
{
mean(mydata$nitrate,na.rm=TRUE)
}
else
{"wrong pollutant"
}
}
}
`
I have a number of csv files and my goal is to find the number of complete cases for a file or set of files given by id argument. My function should return a data frame with column id specifying the file and column obs giving the number of complete cases for this id. However, my function overwrites the previous value of nobs in each loop and the resulting data frame gives me only its last value. Do you have any idea how to get the value of nobs for each value of id?
myfunction<-function(id=1:20) {
files<-list.files(pattern="*.csv")
myfiles = do.call(rbind, lapply(files, function(x) read.csv(x,stringsAsFactors = FALSE)))
for (i in id) {
good<-complete.cases(myfiles)
newframe<-myfiles[good,]
cases<-newframe[newframe$ID %in% i,]
nobs<-nrow(cases)
}
clean<-data.frame(id,nobs)
clean
}
Thanks.
We can do all inside lapply(), something like below (not tested):
myfunction <- function(id = 1:20) {
files <- list.files(pattern = "*.csv")[id]
do.call(rbind,
lapply(files, function(x){
df <- read.csv(x,stringsAsFactors = FALSE)
df <- df[complete.cases(df), ]
data.frame(ID=x,nobs=nrow(df))
}
)
)
}