Loop Aggregate with Weighted Mean in R - r

Apologies in advance for wording, English is not my native language and this is my first post. I have been able to aggregate my data to this point, but am having issues condensing it further. I am trying to get the weighted average depth by biomass of several species.
My data currently has columns (station, time, layer, depth, biomass_X, biomass_Y, biomass_Z, ...) and I want to condense it to (station, time, weighted_depth_X, weighted_depth_Y, weighted_depth_Z, ...).
I got this code to work, but is there a way to loop it so it can complete all my columns?
library(plyr)
newData<-ddply(data, ~station+time, summarize, weighted.mean(data[,6], w=depth))

There is certainly a nicer way but this should work:
# data: dataframe containing columns to be averaged
# weights: vector containing the corresponding weights
weighted_mean_all_cols<- function(data,weights){
res<-do.call(cbind,llply(colnames(data), function(col) {weighted.mean(data[,col], w=weights)}))
colnames(res) <- colnames(data)
res
}
# collect the names of the target columns to average
targetCols <- grep("^biomass",colnames(data))
# apply weighted average by group, for every target column
newData <- ddply(data, c('station','time'), function(groupDF) {
print(groupDF[targetCols])
weighted_mean_all_cols(groupDF[,targetCols],groupDF$depth)
})

Related

Running pearson correlation test on multiple groups in a data frame

I would like to calculate the pearson correlation for multiple groups of a data frame by year in R. There is a column in the data frame with Years 1962-2007 in it and I would like to run the p.c. between CO2 emissions and GDP (two other columns in the data frame) for each year. Thank you for any advice!
Something like this I think can get you started:
str(dataset) # use this to verify that you are only getting numeric or integer values
dataset <- dataset %>%
select(x, y, z) # where x, y, z are the names for columns that are integers or numbers
options(digits=2)
correlation_table <- corrr::correlate(dataset,
method = "pearson")
correlation_table
All this will get you something called a correlation matrix.
This is really just a partial answer. You should read SO documentation on this subject.
You should provide sample data when you post a question so the answer can be designed to your specific needs. Assuming you have a data frame called df and variables called Year, CO2, and GDP:
df.split <- split(df[, c("CO2", "GDP")], df$Year)
df.corrs <- lapply(df.split, cor)
will return a list called df.corrs with correlation matrices for each year. To get the correlation for the first year only, use df.corrs[[1]] or df.corrs[["1962"]].

Subset DF with for-loop to run each subset through a function

I have a data set of plant demographics from 5 years across 10 sites with a total of 37 transects within the sites. Below is a link to a GoogleDoc with some of the data:
https://docs.google.com/spreadsheets/d/1VT-dDrTwG8wHBNx7eW4BtXH5wqesnIDwKTdK61xsD0U/edit?usp=sharing
In total, I have 101 unique combinations.
I need to subset each unique set of data, so that I can run each through some code. This code will give me one column of output that I need to add back to the original data frame so that I can run LMs on the entire data set. I had hoped to write a for-loop where I could subset each unique combination, run the code on each, and then append the output for each model back onto the original dataset. My attempts at writing a subset loop have all failed to produce even a simple output.
I created a column, "SiteTY", with unique Site, Transect, Year combinations. So "PWR 832015" is site PWR Transect 83 Year 2015. I tried to use that to loop through and fill an empty matrix, as proof of concept.
transect=unique(dat$SiteTY)
ntrans=length(transect)
tmpout=matrix(NA, nrow=ntrans, ncol=2)
for (i in 1:ntrans) {
df=subset(dat, SiteTY==i)
tmpout[i,]=(unique(df$SiteTY))
}
When I do this, I notice that df has no observations. If I replace "i" with a known value (like PWR 832015) and run each line of the for-loop individually, it populates correctly. If I use is.factor() for i or PWR 832015, both return FALSE.
This particular code also gives me the error:
Error in [,-(*tmp*, , i, value=mean(df$Year)) : subscript out of bounds
I can only assume this happens because the data frame is empty.
I've read enough SO posts to know that for-loops are tricky, but I've tried more iterations than I can remember to try to make this work in the last 3 years to no avail.
Any tips on loops or ways to avoid them while getting the output I need would be appreciated.
Per your needs, I need to subset each unique set of data, run a function, take the output and calculate a new value, consider two routes:
Using ave if your function expects and returns a single numeric column.
Using by if your function expects a data frame and returns anything.
ave
Returns a grouped inline aggregate column with repeated value for every member of group. Below, with is used as context manager to avoid repeated dat$ references.
# BY SITE GROUPING
dat$New_Column <- with(dat, ave(Numeric_Column, Site, FUN=myfunction))
# BY SITE AND TRANSECT GROUPINGS
dat$New_Column <- with(dat, ave(Numeric_Column, Site, Transect, FUN=myfunction))
# BY SITE AND TRANSECT AND YEAR GROUPINGS
dat$New_Column <- with(dat, ave(Numeric_Column, Site, Transect, Year, FUN=myfunction))
by
Returns a named list of objects or whatever your function returns for each possible grouping. For more than one grouping, tryCatch is used due to possibly empty data frame item from all possible combinations where your myfunction can return an error.
# BY SITE GROUPING
obj_list <- by(dat, dat$Site, function(sub) {
myfunction(sub) # RUN ANY OPERATION ON sub DATA FRAME
})
# BY SITE AND TRANSECT GROUPINGS
obj_list <- by(dat, dat[c("Site", "Transect")], function(sub) {
tryCatch(myfunction(sub),
error = function(e) NULL)
})
# BY SITE AND TRANSECT AND YEAR GROUPINGS
obj_list <- by(dat, dat[c("Site", "Transect", "Year")], function(sub) {
tryCatch(myfunction(sub),
error = function(e) NULL)
})
# FILTERS OUT ALL NULLs (I.E., NO LENGTH)
obj_list <- Filter(length, obj_list)
# BUILDS SINGLE OUTPUT IF MATRIX OR DATA FRAME
final_obj <- do.call(rbind, obj_list)
Here's another approach using the dplyr library, in which I'm creating a data.frame of summary statistics for each group and then just joining it back on:
library(dplyr)
# Group by species (site, transect, etc) and summarise
species_summary <- iris %>%
group_by(Species) %>%
summarise(mean.Sepal.Length = mean(Sepal.Length),
mean.Sepal.Width = mean(Sepal.Width))
# A data.frame with one row per species, one column per statistic
species_summary
# Join the summary stats back onto the original data
iris_plus <- iris %>% left_join(species_summary, by = "Species")
head(iris_plus)

Extract a new vector of multiple mean values from a data frame

I have a large data frame with multiple columns.
Two of my columns look like this:
day_of_year <- c(123,312,23,123,322,1,23,321,124,192, ...)
group <- c(1,1,1,1,3,3,3,2,2,2, ...)
I want to create a new vector with mean values of "day_of_year" for each group separated. Meaning my output vector should contain as many (mean) values as different groups in "group". Please note that some Groups have more values than others!
I hope you can help me with this one!
That's a case for tapply
day_of_year <- c(123,312,23,123,322,1,23,321,124,192)
group <- c(1,1,1,1,3,3,3,2,2,2)
tapply(day_of_year, group, mean)
# 1 2 3
#145.2500 212.3333 115.3333
it would be really helpful if you can post the end result that you are looking for, however, as per my understanding, if you are looking for a mean value per group, the following would work. (Install dplyr package)
install.packages('dplyr')
library('dplyr')
New.Data.Frame <- Your.Data.Frame >%>
group_by(,group)>%>
summarise(,Mean_Day_of_Year= mean(day_of_year ) )

Applying a function to increasingly larger subsets of a data frame

I want to apply a statistical function to increasingly larger subsets of a data frame, starting at row 1 and incrementing by, say, 10 rows each time. So the first subset is rows 1-10, the second rows 1-20, and the final subset is rows 1-nrows. Can this be done without a for loop? And if so, how?
here is one solution:
# some sample data
df <- data.frame(x = sample(1:105, 105))
#getting the endpoints of the sequences you wanted
row_seq <- c(seq(0,nrow(df), 10), nrow(df))
#getting the datasubsets filtering df from 1 to each endpoint
data.subsets <- lapply(row_seq, function(x) df[1:x, ])
# applying the mean function to each data-set
# just replace the function mean by whatever function you want to use
lapply(data.subsets, mean)

R: Split-Apply-Combine... Apply Functions via Aggregate to Row-Bound Data Frames Subset by Class

Update: My NOAA GHCN-Daily weather station data functions have since been cleaned and merged into the rnoaa package, available on CRAN or here: https://github.com/ropensci/rnoaa
I'm designing a R function to calculate statistics across a data set comprised of multiple data frames. In short, I want to pull data frames by class based on a reference data frame containing the names. I then want to apply statistical functions to values for the metrics listed for each given day. In effect, I want to call and then overlay a list of data frames to calculate functions on a vector of values for every unique date and metric where values are not NA.
The data frames are iteratively read into the workspace from file based on a class variable, using the 'by' function. After importing the files for a given class, I want to rbind() the data frames for that class and each user-defined metric within a range of years. I then want to apply a concatenation of user-provided statistical functions to each metric within a class that corresponds to a given value for the year, month, and day (i.e., the mean [function] low temperature [class] on July 1st, 1990 [date] reported across all locations [data frames] within a given region [class]. I want the end result to be new data frames containing values for every date within a region and a year range for each metric and statistical function applied. I am very close to having this result using the aggregate() function, but I am having trouble getting reasonable results out of the aggregate function, which is currently outputting NA and NaN for most functions other than the mean temperature. Any advice would be much appreciated! Here is my code thus far:
# Example parameters
w <- c("mean","sd","scale") # Statistical functions to apply
x <- "C:/Data/" # Folder location of CSV files
y <- c("MaxTemp","AvgTemp","MinTemp") # Metrics to subset the data
z <- c(1970:2000) # Year range to subset the data
CSVstnClass <- data.frame(CSVstations,CSVclasses)
by(CSVstnClass, CSVstnClass[,2], function(a){ # Station list by class
suppressWarnings(assign(paste(a[,2]),paste(a[,1]),envir=.GlobalEnv))
apply(a, 1, function(b){ # Data frame list, row-wise
classData <- data.frame()
sapply(y, function(d){ # Element list
CSV_DF <- read.csv(paste(x,b[2],"/",b[1],".csv",sep="")) # Read in CSV files as data frames
CSV_DF1 <- CSV_DF[!is.na("Value")]
CSV_DF2 <- CSV_DF1[which(CSV_DF1$Year %in% z & CSV_DF1$Element == d),]
assign(paste(b[2],"_",d,sep=""),CSV_DF2,envir=.GlobalEnv)
if(nrow(CSV_DF2) > 0){ # Remove empty data frames
classData <<- rbind(classData,CSV_DF2) # Bind all data frames by row for a class and element
assign(paste(b[2],"_",d,"_bound",sep=""),classData,envir=.GlobalEnv)
sapply(w, function(g){ # Function list
# Aggregate results of bound data frame for each unique date
dataFunc <- aggregate(Value~Year+Month+Day+Element,data=classData,FUN=g,na.action=na.pass)
assign(paste(b[2],"_",d,"_",g,sep=""),dataFunc,envir=.GlobalEnv)
})
}
})
})
})
I think I am pretty close, but I am not sure if rbind() is performing properly, nor why the aggregate() function is outputting NA and NaN for so many metrics. I was concerned that the data frames were not being bound together or that missing values were not being handled well by some of the statistical functions. Thank you in advance for any advice you can offer.
Cheers,
Adam
You've tackled this problem in a way that makes it very hard to debug. I'd recommend switching things around so you can more easily check each step. (Using informative variable names also helps!) The code is unlikely to work as is, but it should be much easier to work iteratively, checking that each step has succeeded before continuing to the next.
paths <- dir("C:/Data/", pattern = "\\.csv$")
# Read in CSV files as data frames
raw <- lapply(paths, read.csv, str)
# Extract needed rows
filter_metrics <- c("MaxTemp", "AvgTemp", "MinTemp")
filter_years <- 1970:2000
filtered <- lapply(raw, subset,
!is.na(Value) & Year %in% filter_years & Element %in% filter_metrics)
# Drop any empty data frames
rows <- vapply(filtered, nrow, integer(1))
filtered <- filtered[rows > 0]
# Compute aggregates
my_aggregate <- function(df, fun) {
aggregate(Value ~ Year + Month + Day + Element, data = df, FUN = fun,
na.action = na.pass)
}
means <- lapply(filtered, my_aggregate, mean)
sds <- lapply(filtered, my_aggregate, sd)
scales <- lapply(filtered, my_aggregate, scale)

Resources