Creating a summary table of maximum values - r

I have a matrix of daily data of average flow and want to make a summary matrix that shows the maximum peak flow. Here's a little sample of what my data looks like:
x<-c(5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,90,95,100)
flow<-matrix(c(c(rep(1990,365),rep(1991,365),rep(1992,365)),sample(x,(365*3), replace=TRUE)),nrow=(365*3), ncol=2)
I'd like the summary matrix to be formatted with the year in column 1 and the peak flow event from that year in column 2. Here's an example of how I would like the summary matrix formatted.
summary=matrix(, ncol=2, nrow=3)
summary[,1]=c(1990,1991,1992)

This should be close:
DF <- as.data.frame(flow)
names(DF) <- c("year", "flow")
DF$year <- as.factor(DF$year)
res <- aggregate(flow ~ year, data = DF, FUN = max)
And gives:
year flow
1 1990 100
2 1991 100
3 1992 100
in the form of a data frame.

And the dplyr family of functions (building on #Bryans work):
DF <- as.data.frame(flow)
names(DF) <- c("year", "flow")
group_by(DF, year) %>% summarize(flow = max(flow))
Gives:
Source: local data frame [3 x 2]
year flow
1 1990 100
2 1991 100
3 1992 100

Related

Better way to operate on a value in a column based on the value in another column of that row?

I have a simple data frame composed of Year, Month, and a measured value. I wanted to create a "Normalized" column that is equal to dividing that timestamps value by the average value of all measurements belonging to that same month. I ended up with this loop, but I'm sure there's a cleaner way to do it with something from the tidyverse
for (i in 1:nrow(my_data)){
my_data[i,"Normalized"] <- my_data[i, "MERRA2"]/ mean_monthly[[my_data[i,"Month"]]]
}
where mean_monthly is
mean_monthly <- apply(merra2_data[,2:13], 2, mean, na.rm=T)
head of the dataframe
Year Month MERRA2 Normalized
1 2000 1 7.217474 1.0267520
2 2000 2 7.700417 1.0625818
3 2000 3 8.004980 1.0532328
4 2000 4 7.994653 0.9930986
5 2000 5 8.317802 1.1213321
6 2000 6 6.734449 0.9501416
Using dplyr.
I created a sample dataset and calculated the required output by using dplyr library
df <- data.frame(Year = rep(2000:2004, each=60),
Month= rep(1:12, 5, each=5),
MERRA2= 1:300)
df1 <- df %>% group_by(Year, Month) %>% mutate(Normalised = MERRA2/mean(MERRA2))
You could try dplyr:
library(dplyr)
my_data <- my_data %>%
group_by(Year, Month) %>%
mutate(Normalized = MERRA2/mean(MERRA2))

Prophet forecasting by id and populating a data frame with one month ahead forecasts

I have a dataframe containing multiple (thousands) unequal-length monthly time series separated by a non-sequencial id variable. The data set looks like this,
id1 <- rep(12, 60)
ds1 <- seq(as.Date("2014-01-01"), as.Date("2018-12-31"), by = "month")
value1 <- sample(60)
id2 <- rep(132, 48)
ds2 <- seq(as.Date("2015-01-01"), as.Date("2018-12-31"), by = "month")
value2 <- sample(48)
id3 <- rep(210, 72)
ds3 <- seq(as.Date("2013-01-01"), as.Date("2018-12-31"), by = "month")
value3 <- sample(72)
id <- c(id1, id2, id3)
ds <- c(ds1, ds2, ds3)
y <- c(value1, value2, value3)
df <- data.frame(id, ds, y)
> head(df)
id ds y
1 12 2014-01-01 51
2 12 2014-02-01 22
3 12 2014-03-01 34
4 12 2014-04-01 53
5 12 2014-05-01 26
6 12 2014-06-01 56
I want to run the prophet forecasting model on every time series separated by id and generate a data frame with one month ahead forecast with one or two diagnostic statistics. The rows of that data frame should start with the id variable, ie. the first column should be id.
For a single id case, the procedure looks like this,
library(prophet)
set.seed(1234)
id <- rep(23, 60)
ds <- seq(as.Date("2014-01-01"), as.Date("2018-12-31"), by = "month")
y <- sample(60)
df <- data.frame(ds, y)
m <- prophet(df, seasonality.mode = 'multiplicative')
future <- make_future_dataframe(m, periods = 1)
fcst <- predict(m, future)
last_fcst <- fcst[61,]
mse <- mean((df$y - fcst$yhat[c(1:60)])^2)
mae <- mean(abs((df$y - fcst$yhat[c(1:60)])))
final <- cbind(last_fcst, mse, mae)
final
> final
ds trend multiplicative_terms multiplicative_terms_lower multiplicative_terms_upper yearly
61 2018-12-02 27.19465 -0.1401155 -0.1401155 -0.1401155 -0.1401155
yearly_lower yearly_upper additive_terms additive_terms_lower additive_terms_upper yhat_lower yhat_upper
61 -0.1401155 -0.1401155 0 0 0 3.689257 42.66293
trend_lower trend_upper yhat mse mae
61 27.19465 27.19465 23.38425 242.4414 12.80532
I want to repeat this procedure and create a dataset with each one-month forecast with their corresponding row ids. Any idea what's the best way to do that?
As I said in the comments, it's best to split() by id in a list(). This way you can use lapply() or (purrr::map()) to make predictions and calculate the metrics per each id.
library(prophet)
library(dplyr) # for data wrangling
library(purrr) # for map/map2, equivalents are lapply/mapply from baseR
# preparations
l_df <- df %>%
split(.$id)
m_list <- map(l_df, prophet) # prophet call
future_list <- map(m_list, make_future_dataframe, periods = 1) # makes future obs
forecast_list <- map2(m_list, future_list, predict) # map2 because we have two inputs
So, forecast_list will contain the output from the predict, again divided by id.
You can "merge" them back in a data.frame by using bind_rows(forecast_list), as long as each df is equal (same structure).
For the metrics I'd follow the same principle:
# to evaluate the model: create a new list
eval_list <- map2(forecast_list, l_df, function(x,z) {
# x is the single dataframe of predictions
# z is the original dataframe with actuals
x <- x[1:(nrow(x)-1), ] # subset to exclude first true forecast
x <- x %>% mutate(y_true = (z %>% select(y) %>% pull()) ) # add the column of actual values
})
# metrics evaluation:
eval_list <- map(eval_list, function(x) {
x <- x %>%
summarise(mse = mean((y_true - yhat)^2)) # add more scores
})
# $`12`
# mse
# 1 199.1829
#
# $`132`
# mse
# 1 156.6394
#
# $`210`
# mse
# 1 415.9659
You can use map2() like I did for eval_list to bind the true forecast with the metrics if you want.

Vectorising linear interpolation function for use with mutate

I have a data frame that looks like this:
# Set RNG
set.seed(33550336)
# Create toy data frame
df <- expand.grid(day = 1:10, dist = seq(0, 100, by = 10))
df1 <- df %>% mutate(region = "Here")
df2 <- df %>% mutate(region = "There")
df3 <- df %>% mutate(region = "Everywhere")
df_ref <- do.call(rbind, list(df1, df2, df3))
df_ref$value <- runif(nrow(df_ref))
# > head(df_ref)
# day dist region value
# 1 1 0 Here 0.39413117
# 2 2 0 Here 0.44224203
# 3 3 0 Here 0.44207487
# 4 4 0 Here 0.08007335
# 5 5 0 Here 0.02836093
# 6 6 0 Here 0.94475814
This represents a reference data frame and I'd like to compare observations against it. My observations are taken on a specific day that is found in this reference data frame (i.e., day is an integer from 1 to 10) in a region that is also found in this data frame (i.e., Here, There, or Everywhere), but the distance (dist) is not necessarily an integer between 0 and 100. For example, my observation data frame (df_obs) might look like this:
# Observations
df_obs <- data.frame(day = sample(1:10, 3, replace = TRUE),
region = sample(c("Here", "There", "Everywhere")),
dist = runif(3, 0, 100))
# day region dist
# 1 6 Everywhere 68.77991
# 2 7 There 57.78280
# 3 10 Here 85.71628
Since dist is not an integer, I can't just lookup the value corresponding to my observations in df_ref like this:
df_ref %>% filter(day == 6, region == "Everywhere", dist == 68.77991)
So, I created a lookup function that uses the linear interpolation function approx:
lookup <- function(re, di, da){
# Filter to day and region
df_tmp <- df_ref %>% filter(region == re, day == da)
# Approximate answer from distance
approx(unlist(df_tmp$dist), unlist(df_tmp$value), xout = di)$y
}
Applying this to my first observation gives,
lookup("Everywhere", 68.77991, 6)
#[1] 0.8037013
Nevertheless, when I apply the function using mutate I get a different answer.
df_obs %>% mutate(ref = lookup(region, dist, day))
# day region dist ref
# 1 6 Everywhere 68.77991 0.1881132
# 2 7 There 57.78280 0.1755198
# 3 10 Here 85.71628 0.1730285
I suspect that this is because lookup is not vectorised correctly. Why am I getting different answers and how do I fix my lookup function to avoid this?

R for loop for calculating sums based on a data frame's different columns

My current data frame looks like this:
# Create sample data
my_df <- data.frame(seq(1, 100), rep(c("ind_1", "", "", ""), times = 25), rep(c("", "ind_2", "", ""), times = 25), rep(c("", "", "ind_3", ""), times = 25), rep(c("", "", "", "ind_4"), times = 25))
# Rename columns
names(my_df)[names(my_df)=="seq.1..100."] <- "value"
names(my_df)[names(my_df)=="rep.c..ind_1................times...25."] <- "ind_1"
names(my_df)[names(my_df)=="rep.c......ind_2............times...25."] <- "ind_2"
names(my_df)[names(my_df)=="rep.c..........ind_3........times...25."] <- "ind_3"
names(my_df)[names(my_df)=="rep.c..............ind_4....times...25."] <- "ind_4"
# Replace empty elements with NA
my_df[my_df==''] = NA
What I want to script is a rather simple for loop that calculates the sum of the value column for each of the four ind_*columns and prints the result.
So far my very meagre attempt has been:
# Create a vector with all individuals
individuals <- c("ind_1", "ind_2", "ind_3", "ind_4")
# Calculate aggregates for each individual
for (i in individuals){
ind <- 1
sum_i <- aggregate(value~ind_1, data = my_df, sum)
print(paste("Individual", i, "possesses an aggregated value of", sum_i$value))
ind <- ind + 1
}
As you can see, I currently struggle to include the correct command to calculate the sum based on one column after another as the current output, naturally, only calculates the result of ind_1. What needs to be changed in the aggregatecommand to achieve the desired result (I'm a total beginner but thought of using indices for proceeding from one column to another?)?
Assuming you´d want to calculate the sum if ind-column matches an expression in your individuals-vector:
individuals <- c("ind_1", "ind_2", "ind_3", "ind_4")
for (i in 1:(ncol(my_df)-1)){
print(sum(my_df$value[which(my_df[,individuals[i]] == individuals[i])]))
}
Why do you want to use print() instead of storing the results in a separate vector?
You can try tidyverse as well:
my_df %>%
gather(key, Inds, -value) %>%
filter(!is.na(Inds)) %>%
group_by(key) %>%
summarise(Sum=sum(value))
# A tibble: 4 x 2
key Sum
<chr> <int>
1 ind_1 1225
2 ind_2 1250
3 ind_3 1275
4 ind_4 1300
Idea is to make the data long using gather. Filter the NAs out, then group by Inds and summarize the values.
A more base R solution would be:
library(reshape2)
my_df_long <- melt(my_df, id.vars = "value",value.name = "ID")
aggregate(value ~ ID, my_df_long, sum, na.rm= T)
ID value
1 ind_1 1225
2 ind_2 1250
3 ind_3 1275
4 ind_4 1300

Maintain data frame rows after subet

I am trying to calculate a % yield of some data based on a subset:
# example data set
set.seed(10)
Measurement <- rnorm(1000, 5, 2)
ID <- rep(c(1:100), each=10)
Batch <- rep(c(1:10), each=100)
df <- data.frame(Batch, ID, Measurement)
df$ID <- factor(df$ID)
df$Batch <- factor(df$Batch)
# Subset data based on measurement range
pass <- subset(df, Measurement > 6 & Measurement < 7)
# Calculate number of rows in data frame (by Batch then ID)
ac <- ddply(df, c("Batch", "ID"), nrow)
colnames(ac) <- c("Batch", "ID", "Total")
# Calculate number of rows in subset (by Batch then ID)
bc <- ddply(pass, c("Batch", "ID"), nrow)
colnames(bc) <- c("Batch", "ID", "Pass")
# Calculate yield
bc$Yield <- (bc$Pass / ac$Total) * 100
# plot yield
ggplot(bc, aes(ID, Yield, colour=Batch)) + geom_point()
My problem is that, due to my filter range (between 6 and 7) my subset (pass) has less rows than my data frame (df)
nrow(ac)
[1] 100
nrow(bc)
[1] 83
Therefore I cannot use
bc$Yield <- (bc$Pass / ac$Total) * 100
Or I get the error
replacement has 100 rows, data has 83
The reason I am trying to keep in generic is because my real data has varying batch and ID amounts (otherwise I could just divide by a constant in my yield calculation). Can anyone tell me how to put a 0 in my subset if the data falls outside of the limits (6 to 7 in this case). Or point out an more elegant way of calculating yield. Thank you
Update:
str(df)
'data.frame': 1000 obs. of 3 variables:
$ Batch : Factor w/ 10 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ ID : Factor w/ 100 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Measurement: num 5.04 4.63 2.26 3.8 5.59 ...
I think this is what you want. I've done it using dplyr's group_by and summarize here.
For each Batch/ID it calculates the number of observations, the number of observations where measurement is between 6 and 7 and the ratio of those two.
library(dplyr)
# example data set
set.seed(10)
Measurement <- rnorm(1000, 5, 2)
ID <- rep(c(1:100), each=10)
Batch <- rep(c(1:10), each=100)
df <- data.frame(Batch, ID, Measurement)
df$ID <- factor(df$ID)
df$Batch <- factor(df$Batch)
# Subset data based on measurement range
countFunc <- function(x) sum((x > 6)&(x<7))
# Calculate number of rows, rows that meet criteria, and yield.
totals <- df %>% group_by(Batch, ID) %>%
summarize(total = length(Measurement), x = countFunc(Measurement)) %>%
mutate(yield = x/total) %>%
as.data.frame()

Resources