My goal is to create a function that reads specified .csv files (all of which have the same format) from the working directory, bind them into one data frame, and then return the mean of a specified column ("nitrate" or "sulfate") of that data frame. The current problem is that every time I call the function no matter how many files I chose to read/how many rows the mean is calculated on, the function always returns 0. I'm not quite sure how to fix this, any help appreciated.
pollutantmean <- function(pollutant, id = 1:332, directory =
"/Users/marsh/datasciencecoursera/specdata/") {
setwd(directory)
list <- list.files()
df <- data.frame()
for(i in id) {
x <- read.csv(list[i])
df <- rbind(df,x)
}
mean(!is.na(df["pollutant",]))
}
If you want the mean and there are NA's present in your data use: mean(df["pollutant",], na.rm = TRUE)
You are calculating the ratio of how many NA's are present. If you get 0 back, it means that you have only NA's . So maybe there is something else wrong as well. Maybe you can use dput() on a dataframe so we can have a look.
The syntax is wrong. It should be
mean(!is.na(df[[pollutant]]))
We don't need quotes for pollutant and should be same as the input argument. Secondly, the , implies that we are selecting the row names as the general format for indexing is row, column, which can be either numeric or character strings. In this case, we need to calculate the mean of missing values for a specific column. So [[ will extract the column or we can do
mean(!is.na(df[,pollutant]))
The whole function should now be
pollutantmean <- function(pollutant, id = 1:332, directory =
"/Users/marsh/datasciencecoursera/specdata/") {
setwd(directory)
list <- list.files()
df <- data.frame()
for(i in id) {
x <- read.csv(list[i])
df <- rbind(df,x)
}
mean(!is.na(df[[pollutant]]))
}
This can also be optimized using data.table
library(data.table)
pollutantmean <- function(pollutant, id = 1:332, directory =
"/Users/marsh/datasciencecoursera/specdata/") {
setwd(directory)
lst <- list.files()
df <- rbindlist(lapply(lst, fread))
mean(!is.na(df[[pollutant]]))
}
Just a guess, because there is no data to confirm this, but it looks like you are asking for the mean of the rows labeled pollutant, not the columns.
Typically a variable is saved in a column and individual observations are saved in rows. So, moving that comma will help get the right data into your calculation, giving you all rows(observations) of the column="pollutant".
#how the data frame is constructed df[rows, columns]
By asking the way that you did, you got all of the observations that do not have an NA in that row but you took the mean of the entire data frame.
pollutantmean <- function(pollutant, id = 1:332, directory =
"/Users/marsh/datasciencecoursera/specdata/") {
setwd(directory)
list <- list.files()
df <- data.frame()
for(i in id) {
x <- read.csv(list[i])
df <- rbind(df,x)
}
mean(df[,pollutant], rm.na=TRUE)
}
this says take the mean of all observations in the dataframe df column pollutant that are not = to NA this should give you what you want
All the above answers helped me fix it.
mean(df[[pollutant]], na.rm = TRUE)
ended up returning the correct answers. Thanks!
Related
I have a list of dataframes with varying dimensions filled with data and row/col names of Countries. I also have a "master" dataframe outside of this list that is blank with square dimensions of 189x189.
I wish to merge each dataframe inside the list individually on top of the "master" sheet perserving the square matrix dimensions. I have been able to achieve this individually using this code:
rownames(Trade) <- Trade$X
Trade <- Trade[, 2:length(Trade)]
Full[row.names(Trade), colnames(Trade)] <- Trade
With "Full" being my master sheet and "Trade" being an individual df.
I have attempted to create a function to apply this process to a list of dataframes but am unable to properly do this.
Function and code in question:
DataMerge <- function(df) {
rownames(df) <- df$Country
Trade <- Trade[, 2:length(Trade)]
Country[row.names(df), colnames(df)] <- df
}
Applied using :
DataMergeDF <- lapply(TradeMatrixDF, DataMerge)
filenames <- paste0("Merged",names(DataMergeDF), ".csv")
mapply(write.csv, DataMergeDF, filenames)
Country <- read.csv("FullCountry.csv")
However what ends up happening is that the data does not end up merging properly / the dimensions are not preserved.
I asked a question pertaining to this issue a few days ago (CSV generated from not matching to what I have in R) , but I have a suspicion that I am running into this issue due to my use of "lapply". However, I am not 100% sure.
If we return the 'Country' at the end it should work. Also, better to pass the other data as an argument
DataMerge <- function(Country, df) {
rownames(df) <- df$Country
df <- df[, 2:length(df)]
Country[row.names(df), colnames(df)] <- df
Country
}
then, we call the function as
DataMergeDF <- lapply(TradeMatrixDF, DataMerge, Country = Country)
This question already has answers here:
Return a data frame from function
(2 answers)
Closed 6 years ago.
I want to create a function which loops through a large number of files, calculates the number of complete cases for each file and then appends a new row to an existing data frame with the "ID" number of the file and its corresponding number of complete cases.
Below I have created a code which only returns the last row of the data frame. I belive my function only returns the last row, because R overwrites my data frame in every loop, but I am not sure. I have done a lot of research online how to solve this, but I could not find an easy solution (I am very very new to R).
Below you can see my code and the output I get:
complete <- function(directory = "specdata", id = 1:332) {
files_list <- list.files("specdata", full.names = T) # creates a list of files
dat <- data.frame() # creates an emmpty data frame
for (i in id) {
data <- read.csv(files_list[i]) # reads the file "i" in the id vector
nobs <- sum(complete.cases(data)) # counts the number of complete cases in that file
data_frame <- data.frame("ID" = i, nobs) # here I want to store the number of complete cases in a data frame
output <- rbind(dat, data_frame) # here the data_frame should be added to an existing data frame
}
print(output)
}
When I run complete( , 3:5), I get the following result:
ID nobs
1 5 402
Thanks four your help! :)
As Maxim.K said, there are better ways to do this but the actual problem here is that your output variable gets overwritten at each iteration in the for loop.
Try :
dat <- rbind(dat, data_frame)
and print dat.
Instead of for (i in id) {, try for (i in 1:322) { or for (i in 1:length(id) { at the beginning of your loop
I'm taking an introductory R-programming course on Cousera. The first assignment has us evaluating a list of hundreds of csv files in a specified directory ("./specdata/). Each csv file, in turn, contains hundreds of records of sample pollutant data in the atmosphere - a date, a sulfite sample, a nitrate sample, and an ID of that identifies the sampling location.
The assignment asks us to create a function that takes the pollutant an id or range of ids for sampling location and returns a sample mean, given the supplied arguments.
My code (below) uses a for loop to use the id argument to only read the files of interest (seems more efficient than reading in all 322 files before doing any processing). That works great.
Within the loop, I assign the contents of the csv file to a variable. I then make that variable a data frame and use rbind to append to it the file read in during each loop. I use na.omit to remove the missing files from the variable. Then I use rbind to append the result of each iteration of the loop to variable. When I print the data frame variable within the loop, I can see the entire full list, subgrouped by id. But when I print the variable outside the loop, I only see the last element in the id vector.
I would like to create a consolidated list of all records matching the id argument within the loop, then pass the consolidate list outside the loop for further processing. I can't get this to work. My code is shown below.
Is this the wrong approach? Seems like it could work. Any help would be most appreciated. I searched StackOverflow and couldn't find anything that quite addresses what I'm trying to do.
pmean <- function(directory = "./specdata/", pollutant, id = 1:322) {
x <- list.files(path=directory, pattern="*.csv")
x <- paste(directory, x, sep="")
id1 <- id[1]
id2 <- id[length(id)]
for (i in id1:id2) {
df <- read.csv(x[i], header = TRUE)
df <- data.frame(df)
df <- na.omit(df)
df <- rbind(df)
print(df)
}
# would like a consolidated list of records here to to do more stuff, e.g. filter on pollutant and calcuate mean
}
You can just define the data frame outside the for loop and append to it. Also you can skip some steps in between... There are more ways to improve here... :-)
pmean <- function(directory = "./specdata/", pollutant, id = 1:322) {
x <- list.files(path=directory, pattern="*.csv")
x <- paste(directory, x, sep="")
df_final <- data.frame()
for (i in id) {
df <- read.csv(x[i], header = TRUE)
df <- data.frame(df)
df <- na.omit(df)
df_final <- rbind(df_final, df)
print(df)
}
# would like a consolidated list of records here to to do more stuff, e.g. filter on pollutant and calcuate mean
return(df_final)
}
by only calling df <- rbind(df) you are effectively overwriting df everytime. You can fix this by doing something like this:
df = data.frame() # empty data frame
for(i in 1:10) { # for all you csv files
x <- mean(rnorm(10)) # some new information
df <- rbind(df, x) # bind old dataframe and new value
}
By the way, if you know how big df will be beforehand then this is not the proper way to do it.
The function below goes through several CSV files and returns a data frame with file names and the number of complete rows (no missing values) in each file. Although I assign column names to complete_rows in the beginning (id and nobs), the data frame that gets returned doesn't have the same names. Why does this happen?
complete <- function(directory, id = 1:332) {
#navigate to directory
setwd(directory)
#keep track of row name and number of completed rows
complete_rows <- data.frame(id=numeric(0), nobs=numeric(0))
#csv names
myfiles <- list.files(pattern = "csv")
#loop through files
for(i in id) {
#read each file
current_dataset <- read.csv(myfiles[i])
#include only files with complete datasets
good_rows <- current_dataset[complete.cases(current_dataset),]
#push id and number of good rows to data frame
complete_rows <- rbind(complete_rows, c(i, nrow(good_rows)))
#increment loop
i <- i + 1
}
#return data frame
complete_rows
}
Use rbind on two data.frames with identical names:
complete_rows <- rbind(complete_rows, data.frame(id=i, nobs=nrow(good_rows)))
Your code is not very idiomatic to R as beginneR has covered.
I'm not sure why exactly you are experiencing that behavior, but I would propose some adjustments to your code as follows:
complete <- function(directory, id = 1:332) {
#navigate to directory
setwd(directory)
#keep track of row name and number of completed rows
complete_rows <- data.frame(id=numeric(length(id)), nobs=numeric(length(id)))
#csv names
myfiles <- list.files(pattern = "csv")
#loop through files
for(i in id) {
#read each file
current_dataset <- read.csv(myfiles[i])
# write id
complete_rows$id[i] <- i
# write nobs
complete_rows$nobs[i] <- sum(complete.cases(current_dataset))
}
#return data frame
return(complete_rows)
}
If you only want the id and number of observations, you don't need to use rbind and to return something from a function you either use return or nothing (which would then return the last evaluated expression as far as I know). And you can initalize complete_rows with the number of rows you need, since you already know that in advance. You also don't need to manually increment i in your for loop, since that is done already in for(i in id).
Does this work for you?
Edit/note:
It would probably be even better to read all files at once into a list and then operate on them.
Here's the directory with CSV files I'm using:
https://d396qusza40orc.cloudfront.net/rprog%2Fdata%2Fspecdata.zip
here's my code:
complete<- function(directory, id=1:332){
data<-NULL
for (i in 1:length(id)) {
data[[i]]<- c(paste(directory, "/",formatC(id[i], width=3, flag=0),".csv",sep=""))
}
cases<-NULL
for (d in 1:length(data)) {
cases[[d]]<-c(read.csv(data[d]))
}
df<-NULL
for (c in 1:length(cases)){
df[[c]]<-(data.frame(cases[c]))
}
dt<-do.call(rbind, df)
ok<-(complete.cases(dt))
finally<-as.data.frame(table(dt[ok, "ID"]))
colnames(finally)<-c('id', 'nobs')
replace(finally,is.na(finally),0)
return(finally)
}
when I enter:
complete('specdata')
I get a data frame with the number of complete cases in each csv file but the CSV files with no complete cases are omitted completely. I need the csv files with 0 complete cases to show up in the data frame with a nobs value of 0. I tried using replace in the code but it doesn't seem to change my data frame at all.
Given a data frame df, complete.cases(df) returns a vector of true or false values. You can use this vector as an index of df to extract a subset of it that has complete cases, like this:
df[complete.cases(df),]
The number of complete cases, or nobs value as you write in your text, is the number of rows in this resulting smaller data frame. You can use the nrow function to get that count:
nrow(df[complete.cases(df),])
This will return 0 for a data frame that has no complete cases.
To solve the exercise, you need to build a data frame with two vectors: id and nobs, where nobs is the number of complete cases of the data frame indicated by the corresponding id. For getting the nobs value from the id, it makes sense to introduce a helper function:
get.nobs <- function(id) {
df <- getmonitor(id, directory)
nrow(df[complete.cases(df),])
}
getmonitor is a function to read the data frame from a csv file. After you have the data frame for this id, you can return the row count of the complete cases in it.
You can use this function to get the count for each id. Instead of a loop, this is a perfect use case for sapply.
Putting it all together (spoiler alert!):
complete <- function(directory, id = 1:332) {
get.nobs <- function(id) {
df <- getmonitor(id, directory)
nrow(df[complete.cases(df),])
}
data.frame(id, nobs=sapply(id, get.nobs))
}