I'm taking an introductory R-programming course on Cousera. The first assignment has us evaluating a list of hundreds of csv files in a specified directory ("./specdata/). Each csv file, in turn, contains hundreds of records of sample pollutant data in the atmosphere - a date, a sulfite sample, a nitrate sample, and an ID of that identifies the sampling location.
The assignment asks us to create a function that takes the pollutant an id or range of ids for sampling location and returns a sample mean, given the supplied arguments.
My code (below) uses a for loop to use the id argument to only read the files of interest (seems more efficient than reading in all 322 files before doing any processing). That works great.
Within the loop, I assign the contents of the csv file to a variable. I then make that variable a data frame and use rbind to append to it the file read in during each loop. I use na.omit to remove the missing files from the variable. Then I use rbind to append the result of each iteration of the loop to variable. When I print the data frame variable within the loop, I can see the entire full list, subgrouped by id. But when I print the variable outside the loop, I only see the last element in the id vector.
I would like to create a consolidated list of all records matching the id argument within the loop, then pass the consolidate list outside the loop for further processing. I can't get this to work. My code is shown below.
Is this the wrong approach? Seems like it could work. Any help would be most appreciated. I searched StackOverflow and couldn't find anything that quite addresses what I'm trying to do.
pmean <- function(directory = "./specdata/", pollutant, id = 1:322) {
x <- list.files(path=directory, pattern="*.csv")
x <- paste(directory, x, sep="")
id1 <- id[1]
id2 <- id[length(id)]
for (i in id1:id2) {
df <- read.csv(x[i], header = TRUE)
df <- data.frame(df)
df <- na.omit(df)
df <- rbind(df)
print(df)
}
# would like a consolidated list of records here to to do more stuff, e.g. filter on pollutant and calcuate mean
}
You can just define the data frame outside the for loop and append to it. Also you can skip some steps in between... There are more ways to improve here... :-)
pmean <- function(directory = "./specdata/", pollutant, id = 1:322) {
x <- list.files(path=directory, pattern="*.csv")
x <- paste(directory, x, sep="")
df_final <- data.frame()
for (i in id) {
df <- read.csv(x[i], header = TRUE)
df <- data.frame(df)
df <- na.omit(df)
df_final <- rbind(df_final, df)
print(df)
}
# would like a consolidated list of records here to to do more stuff, e.g. filter on pollutant and calcuate mean
return(df_final)
}
by only calling df <- rbind(df) you are effectively overwriting df everytime. You can fix this by doing something like this:
df = data.frame() # empty data frame
for(i in 1:10) { # for all you csv files
x <- mean(rnorm(10)) # some new information
df <- rbind(df, x) # bind old dataframe and new value
}
By the way, if you know how big df will be beforehand then this is not the proper way to do it.
Related
I have a list of dataframes with varying dimensions filled with data and row/col names of Countries. I also have a "master" dataframe outside of this list that is blank with square dimensions of 189x189.
I wish to merge each dataframe inside the list individually on top of the "master" sheet perserving the square matrix dimensions. I have been able to achieve this individually using this code:
rownames(Trade) <- Trade$X
Trade <- Trade[, 2:length(Trade)]
Full[row.names(Trade), colnames(Trade)] <- Trade
With "Full" being my master sheet and "Trade" being an individual df.
I have attempted to create a function to apply this process to a list of dataframes but am unable to properly do this.
Function and code in question:
DataMerge <- function(df) {
rownames(df) <- df$Country
Trade <- Trade[, 2:length(Trade)]
Country[row.names(df), colnames(df)] <- df
}
Applied using :
DataMergeDF <- lapply(TradeMatrixDF, DataMerge)
filenames <- paste0("Merged",names(DataMergeDF), ".csv")
mapply(write.csv, DataMergeDF, filenames)
Country <- read.csv("FullCountry.csv")
However what ends up happening is that the data does not end up merging properly / the dimensions are not preserved.
I asked a question pertaining to this issue a few days ago (CSV generated from not matching to what I have in R) , but I have a suspicion that I am running into this issue due to my use of "lapply". However, I am not 100% sure.
If we return the 'Country' at the end it should work. Also, better to pass the other data as an argument
DataMerge <- function(Country, df) {
rownames(df) <- df$Country
df <- df[, 2:length(df)]
Country[row.names(df), colnames(df)] <- df
Country
}
then, we call the function as
DataMergeDF <- lapply(TradeMatrixDF, DataMerge, Country = Country)
My goal is to create a function that reads specified .csv files (all of which have the same format) from the working directory, bind them into one data frame, and then return the mean of a specified column ("nitrate" or "sulfate") of that data frame. The current problem is that every time I call the function no matter how many files I chose to read/how many rows the mean is calculated on, the function always returns 0. I'm not quite sure how to fix this, any help appreciated.
pollutantmean <- function(pollutant, id = 1:332, directory =
"/Users/marsh/datasciencecoursera/specdata/") {
setwd(directory)
list <- list.files()
df <- data.frame()
for(i in id) {
x <- read.csv(list[i])
df <- rbind(df,x)
}
mean(!is.na(df["pollutant",]))
}
If you want the mean and there are NA's present in your data use: mean(df["pollutant",], na.rm = TRUE)
You are calculating the ratio of how many NA's are present. If you get 0 back, it means that you have only NA's . So maybe there is something else wrong as well. Maybe you can use dput() on a dataframe so we can have a look.
The syntax is wrong. It should be
mean(!is.na(df[[pollutant]]))
We don't need quotes for pollutant and should be same as the input argument. Secondly, the , implies that we are selecting the row names as the general format for indexing is row, column, which can be either numeric or character strings. In this case, we need to calculate the mean of missing values for a specific column. So [[ will extract the column or we can do
mean(!is.na(df[,pollutant]))
The whole function should now be
pollutantmean <- function(pollutant, id = 1:332, directory =
"/Users/marsh/datasciencecoursera/specdata/") {
setwd(directory)
list <- list.files()
df <- data.frame()
for(i in id) {
x <- read.csv(list[i])
df <- rbind(df,x)
}
mean(!is.na(df[[pollutant]]))
}
This can also be optimized using data.table
library(data.table)
pollutantmean <- function(pollutant, id = 1:332, directory =
"/Users/marsh/datasciencecoursera/specdata/") {
setwd(directory)
lst <- list.files()
df <- rbindlist(lapply(lst, fread))
mean(!is.na(df[[pollutant]]))
}
Just a guess, because there is no data to confirm this, but it looks like you are asking for the mean of the rows labeled pollutant, not the columns.
Typically a variable is saved in a column and individual observations are saved in rows. So, moving that comma will help get the right data into your calculation, giving you all rows(observations) of the column="pollutant".
#how the data frame is constructed df[rows, columns]
By asking the way that you did, you got all of the observations that do not have an NA in that row but you took the mean of the entire data frame.
pollutantmean <- function(pollutant, id = 1:332, directory =
"/Users/marsh/datasciencecoursera/specdata/") {
setwd(directory)
list <- list.files()
df <- data.frame()
for(i in id) {
x <- read.csv(list[i])
df <- rbind(df,x)
}
mean(df[,pollutant], rm.na=TRUE)
}
this says take the mean of all observations in the dataframe df column pollutant that are not = to NA this should give you what you want
All the above answers helped me fix it.
mean(df[[pollutant]], na.rm = TRUE)
ended up returning the correct answers. Thanks!
This question already has answers here:
Return a data frame from function
(2 answers)
Closed 6 years ago.
I want to create a function which loops through a large number of files, calculates the number of complete cases for each file and then appends a new row to an existing data frame with the "ID" number of the file and its corresponding number of complete cases.
Below I have created a code which only returns the last row of the data frame. I belive my function only returns the last row, because R overwrites my data frame in every loop, but I am not sure. I have done a lot of research online how to solve this, but I could not find an easy solution (I am very very new to R).
Below you can see my code and the output I get:
complete <- function(directory = "specdata", id = 1:332) {
files_list <- list.files("specdata", full.names = T) # creates a list of files
dat <- data.frame() # creates an emmpty data frame
for (i in id) {
data <- read.csv(files_list[i]) # reads the file "i" in the id vector
nobs <- sum(complete.cases(data)) # counts the number of complete cases in that file
data_frame <- data.frame("ID" = i, nobs) # here I want to store the number of complete cases in a data frame
output <- rbind(dat, data_frame) # here the data_frame should be added to an existing data frame
}
print(output)
}
When I run complete( , 3:5), I get the following result:
ID nobs
1 5 402
Thanks four your help! :)
As Maxim.K said, there are better ways to do this but the actual problem here is that your output variable gets overwritten at each iteration in the for loop.
Try :
dat <- rbind(dat, data_frame)
and print dat.
Instead of for (i in id) {, try for (i in 1:322) { or for (i in 1:length(id) { at the beginning of your loop
The below is driving me a little crazy and I’m sure theres an easy solution.
I currently use R to perform some calculations from a bunch of excel files, where the files are monthly observations of financial data. The files all have the exact same column headers. Each file gets imported, gets some calcs done on it and the output is saved to a list. The next file is imported and the process is repeated. I use the following code for this:
filelist <- list.files(pattern = "\\.xls")
universe_list <- list()
count <- 1
for (file in filelist) {
df <- read.xlsx(file, 1, startRow=2, header=TRUE)
*perform calcs*
universe_list[[count]] <- df
count <- count + 1
}
I now have a problem where some of the new operations I want to perform would involve data from two or more excel files. So for example, I would need to import the Jan-16 and the Jan-15 excel files, perform whatever needs to be done, and then move on to the next set of files (Feb-16 and Feb-15). The files will always be of fixed length apart (like one year etc)
I cant seem to figure out the code on how to do this… from a process perspective, Im thinking 1) need to design a loop to import both sets of files at the same time, 2) create two dataframes from the imported data, 3) rename the columns of one of the dataframes (so the columns can be distinguished), 4) merge both dataframes together, and 4) perform the calcs. I cant work out the code for steps 1-4 for this!
Many thanks for helping out
Consider mapply() to handle both data frame pairs together. Your current loop is actually reminiscient of other languages running for loop operations. However, R has many vectorized approaches to iterate over lists. Below assumes both 15 and 16 year list of files are same length with corresponding months in both and year abbrev comes right before file extension (i.e, -15.xls, -16.xls):
files15list <- list.files(path, pattern = "[15]\\.xls")
files16list <- list.files(path, pattern = "[16]\\.xls")
dfprocess <- function(x, y){
df1 <- read.xlsx(x, 1, startRow=2, header=TRUE)
names(df1) <- paste0(names(df1), "1") # SUFFIX COLS WITH 1
df2 <- read.xlsx(y, 1, startRow=2, header=TRUE)
names(df2) <- paste0(names(df2), "2") # SUFFIX COLS WITH 2
df <- cbind(df1, df2) # CBIND DFs
# ... perform calcs ...
return(df)
}
wide_list <- mapply(dfprocess, files15list, files16list)
long_list <- lapply(1:ncol(wide_list),
function(i) wide_list[,i]) # ALTERNATE OUTPUT
First sort your filelist such that the two files on which you want to do your calculations are consecutive to each other. After that try this:
count <- 1
for (count in seq(1, (len(filelist)),2) {
df <- read.xlsx(filelist[count], 1, startRow=2, header=TRUE)
df1 <- read.xlsx(filelist[count+1], 1, startRow=2, header=TRUE)
*change column names and apply merge or append depending on requirement
*perform calcs*
*save*
}
I am trying to read over 200 CSV files, each with multiple rows and columns of numbers. It makes most sense to reach each one as a separate data frame.
Ideally, I'd like to give meaningful names. So the data frame of store 1, room 1 would be named store.1.room.1, and store.1.room.2. This would go all the way up to store.100.room.1, store.100.room.2 etc.
I can read each file into a specified data frame. For example:
store.1.room.1 <- read.csv(filepath,...)
But how do I create a dynamically created data frame name using a For loop?
For example:
for (i in 1:100){
for (j in 1:2){
store.i.room.j <- read.csv(filepath...)
}
}
Alternatively, is there another approach that I should consider instead of having each csv file as a separate data frame?
Thanks
You can create your dataframes using read.csv as you have above, but store them into a list. Then give names to each item (i.e. dataframe) in the list:
# initialize an empty list
my_list <- list()
for (i in 1:100) {
for (j in 1:2) {
df <- read.csv(filename...)
df_name <- paste("store", i, "room", j, sep="")
my_list[[df_name]] <- df
}
}
# now you can access any data frame you wish by using my_list$store.i.room.j
I'm not sure whether I am answering your question, but you would never want to store those CSV files into separate data frames. What I would do in your case is this:
set <- data.frame()
for (i in 1:100){
##calculate filename here
current.csv <- read.csv(filename)
current.csv <- cbind(current.csv, index = i)
set <- rbind(set, current.csv)
An additional column is being used to identify which csv files the measurements are from.
EDIT:
This is useful to apply tapply in certain vectors of your data.frame. Also, in case you'd like to keep the measurements of only one csv (let's say the one indexed by 5), you can enter
single.data.frame <- set[set$index == 5, ]