R: how to average every 7th row - r

I want to take the average of each column (except the date) after every seven rows. I tried the approach below, but I was getting incorrect values. This method also seems really long. Is there a way to shorten it?
bankamerica = read.csv('https://raw.githubusercontent.com/bandcar/Examples/main/bankamerica.csv')
library(tidyverse)
GroupLabels <- 0:(nrow(bankamerica) - 1)%/% 7
bankamerica$Group <- GroupLabels
Avgs <- bankamerica %>%
group_by(bankamerica$Group) %>%
summarize(Avg = mean(bankamerica$tr))
EDITED: Just realized this code provides the incorrect values

I think you're on the right path.
bankamerica %>%
mutate(group = cumsum(row_number() %% 7 == 1)) %>%
group_by(group) %>%
summarise(caldate = first(caldate), across(-caldate, mean)) %>%
select(-group)
## A tibble: 144 × 3
# caldate tr var
# <chr> <dbl> <dbl>
# 1 1/2/01 28.9 -50.6
# 2 1/11/01 23.6 -45.4
# 3 1/23/01 20.9 -45
# 4 2/1/01 17.4 -48
# 5 2/12/01 14.4 -48
# 6 2/21/01 17 -48.9
# 7 3/2/01 19.1 -56
# 8 3/13/01 19.4 -56.9
# 9 3/22/01 23.3 -55.7
#10 4/2/01 7.71 -58.3
This averages every 7 rows not every 7 days, because there are missing days in the data.

Related

Is there any function that give the changes between columns?

I have a df that looks like this.
head(dfhigh)
rownames 2015Y 2016Y 2017Y 2018Y 2019Y 2020Y 2021Y
1 Australia 29583.7403 48397.383 45220.323 68461.941 39218.044 20140.351 29773.188
2 Austria* 1294.5092 -8400.973 14926.164 5511.625 2912.795 -14962.963 5855.014
3 Belgium* -24013.3111 68177.596 -3057.153 27119.084 -9208.553 13881.481 22955.298
4 Canada 43852.7732 36061.859 22764.156 37653.521 50141.784 23174.006 59693.992
5 Chile* 20507.8407 12249.294 6128.716 7735.778 12499.238 8385.907 15251.538
6 Czech Republic 465.2137 9814.496 9517.948 11010.423 10108.914 9410.576 5805.084
I want to calculate the changes between years, so instead of the values, the table has the percentage of change (obviously deleting 2015Y).
Try this using (current - previous)/ previous *100
lst <- list()
nm <- names(dfhigh)[-1]
for(i in 1:(length(nm) - 1)){
lst[[i]] <- (dfhigh[[nm[i+1]]] - dfhigh[[nm[i]]]) / dfhigh[[nm[i]]] * 100
}
ans <- do.call(cbind , lst)
colnames(ans) <- paste("ch_of" , nm[-1])
ans
you can change the formula to calculate percentage as you want
You could also use a tidyverse solution.
library(tidyverse)
df %>%
pivot_longer(!rownames) %>%
group_by(rownames) %>%
mutate(value = 100*value/lag(value)-100) %>%
ungroup() %>%
pivot_wider(names_from = name, values_from = value)
# # A tibble: 6 × 8
# rownames `2015Y` `2016Y` `2017Y` `2018Y` `2019Y` `2020Y` `2021Y`
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 Australia NA 63.6 -6.56 51.4 -42.7 -48.6 47.8
# 2 Austria* NA -749. -278. -63.1 -47.2 -614. -139.
# 3 Belgium* NA -384. -104. -987. -134. -251. 65.4
# 4 Canada NA -17.8 -36.9 65.4 33.2 -53.8 158.
# 5 Chile* NA -40.3 -50.0 26.2 61.6 -32.9 81.9
# 6 CzechRepublic NA 2010. -3.02 15.7 -8.19 -6.91 -38.3

How do I find which country has the lowest average time to finish a marathon?

I have a data set with a column of countries and a column of time it took them to run a marathon. I want to find out which 5 countries completed it in the shortest time on average. I am new to R so only have basic knowledge. The column of time is in hours. eexample of the data: marathon$Countries is a column of the nationality of each runner, marathon$OverallHrs is the overall time it took to complete the marathon for each runner.
I have tried
tapply(marathon$OverallHrs, marathon$Country, mean)
It hasnt worked in the way I want it to
I am assuming that you are not referring to the trivial case where you don't have repeated countries for your "country" column. For a beginner in R, i would strongly encourage to start learning with the package "tidyverse".
Below is the solution, where you can have repeated countries for the column "Country"
library(tidyverse)
set.seed(123)
# Generate 10 Countries, each one 5 times
A = sample(rep(1:10,5))
# Generate 50 random timing from (5-20)
B = round(runif(50)*15 + 5)
#Create a dataframe with columns (Country, Timing), rows = 50
df = data.frame("Country" = paste0("Country",A),
"Timing" = B)
#Dataframe will look like this
# Country Timing
# 1 Country5 15
# 2 Country4 17
# 3 Country4 5
# 4 Country3 12
# 5 Country5 16
# Calculate average marathon timing
df_mean <- df %>%
group_by(Country) %>% #Group
summarise(Mean_Timing = mean(Timing), .groups = 'drop') %>% #Calculate Mean_Timing
arrange(Mean_Timing) # Arrange by fastest timing first
#Dataframe = df_mean
# A tibble: 10 x 2
# Country Mean_Timing
# <chr> <dbl>
# 1 Country9 10.6
# 2 Country1 11.4
# 3 Country3 11.4
# 4 Country4 11.4
# 5 Country2 12.2
# 6 Country10 12.6
# 7 Country8 13.2
# 8 Country7 13.6
# 9 Country5 15
# 10 Country6 15.2
#To get the first 5 country, would just be
df_mean$Country[1:5]
# "Country9" "Country1" "Country3" "Country4" "Country2"
There is always the aggregate function in R for calculating mean per group. Lesser code, but I still prefer the tidyverse method as it is intuitive to use after a while and can be tweaked slightly to solve any dataframe question.
Anyway, here is the solution using aggregate.
df_mean2 <- aggregate(df[, 2], list(df$Country), mean) # Calculate Mean
df_mean2[order(df_mean2$x), ] # Sort by ascending
Group.1 x
10 Country9 10.6
1 Country1 11.4
4 Country3 11.4
5 Country4 11.4
3 Country2 12.2
2 Country10 12.6
9 Country8 13.2
8 Country7 13.6
6 Country5 15.0
7 Country6 15.2

Mean Temperature by group month in R

I am trying to calculate the mean temperature per month of daily records between 1988 to 2020 using the following code:
(Temperature_year_month <- (na.omit(database_PE_na) %>% group_by(month) %>% summarise(mean_temp_monthYear = mean(Air.Temp.Mean))))
and I got the following results, that I checked in excel and it seems correct:
# A tibble: 12 x 2
month mean_temp_monthYear
<dbl> <dbl>
1 1 11.4
2 2 13.5
3 3 17.2
4 4 21.2
5 5 26.0
6 6 31.0
7 7 33.3
8 8 32.5
9 9 29.1
10 10 22.4
11 11 15.4
12 12 10.7
However when I do this only for the month of July (month =7). I got a different result:
(Temperature_year_month <- (na.omit(database_PE_na) %>% group_by(month=7) %>% summarise(mean_temp_monthYear = mean(Air.Temp.Mean))))
month mean_temp_monthYear
<dbl> <dbl>
1 7 22.0
Someone could explain to me why this happens¿
We can use data.table methods
library(data.table)
setDT(database_PE_na)[month == 7,
.(mean_temp_monthYear = mean(Air.Temp.Mean, na.rm = TRUE))]
For comparison use == and not =.
If you want to get mean of one month use it in filter instead of group_by.
mean has na.rm argument which can be set to TRUE to ignore NA values instead of using na.omit and removing the complete row.
Use :
library(dplyr)
Temperature_year_month <- database_PE_na %>%
filter(month==7) %>%
summarise(mean_temp_monthYear = mean(Air.Temp.Mean, na.rm = TRUE))

rolling 30-day geometric mean with variable width

The solution to this question by #ShirinYavari was almost what I needed except for the use of the static averaging window width of 2. I have a dataset with random samples from multiple stations that I want to calculate a rolling 30-day geomean. I want all samples within a 30-day window of a given sample to be averaged and the width may change if preceding samples are farther or closer together in time, for instance whether you would need to average 2, 3, or more samples if 1, 2, or more preceding samples were within 30 days of a given sample.
Here is some example data, plus my code attempt:
RESULT = c(50,900,25,25,125,50,25,25,2000,25,25,
25,25,25,25,25,25,325,25,300,475,25)
DATE = as.Date(c("2018-05-23","2018-06-05","2018-06-17",
"2018-08-20","2018-10-05","2016-05-22",
"2016-06-20","2016-07-25","2016-08-11",
"2017-07-21","2017-08-08","2017-09-18",
"2017-10-12","2011-04-19","2011-06-29",
"2011-08-24","2011-10-23","2012-06-28",
"2012-07-16","2012-08-14","2012-09-29",
"2012-10-24"))
FINAL_SITEID = c(rep("A", 5), rep("B", 8), rep("C", 9))
df=data.frame(FINAL_SITEID,DATE,RESULT)
data_roll <- df %>%
group_by(FINAL_SITEID) %>%
arrange(DATE) %>%
mutate(day=DATE-dplyr::lag(DATE, n=1),
day=replace_na(day, 1),
rnk=cumsum(c(TRUE, day > 30))) %>%
group_by(FINAL_SITEID, rnk) %>%
mutate(count=rowid(rnk)) %>%
mutate(GM30=rollapply(RESULT, width=count, geometric.mean, fill=RESULT, align="right"))
I get this error message, which seems like it should be an easy fix, but I can't figure it out:
Error: Column `rnk` must be length 5 (the group size) or one, not 6
Easiest way to compute rolling statistics depending on datetime windows is runner package. You don't have to hack around to get just 30-days windows. Function runner allows you to apply any R function in rolling window. Below example of 30-days geometric.mean within FINAL_SITEID group:
library(psych)
library(runner)
df %>%
group_by(FINAL_SITEID) %>%
arrange(DATE) %>%
mutate(GM30 = runner(RESULT, k = 30, idx = DATE, f = geometric.mean))
# FINAL_SITEID DATE RESULT GM30
# <fct> <date> <dbl> <dbl>
# 1 C 2011-04-19 25 25.0
# 2 C 2011-06-29 25 25.0
# 3 C 2011-08-24 25 25.0
# 4 C 2011-10-23 25 25.0
# 5 C 2012-06-28 325 325.
# 6 C 2012-07-16 25 90.1
# 7 C 2012-08-14 300 86.6
# 8 C 2012-09-29 475 475.
# 9 C 2012-10-24 25 109.
# 10 B 2016-05-22 50 50.0
The width argument of rollapply can be a vector of widths which can be set using findInterval. An example of this is shown in the Examples section of the rollapply help file and we use that below.
library(dplyr)
library(psych)
library(zoo)
data_roll <- df %>%
arrange(FINAL_SITEID, DATE) %>%
group_by(FINAL_SITEID) %>%
mutate(GM30 = rollapplyr(RESULT, 1:n() - findInterval(DATE - 30, DATE),
geometric.mean, fill = NA)) %>%
ungroup
giving:
# A tibble: 22 x 4
FINAL_SITEID DATE RESULT GM30
<fct> <date> <dbl> <dbl>
1 A 2018-05-23 50 50.0
2 A 2018-06-05 900 212.
3 A 2018-06-17 25 104.
4 A 2018-08-20 25 25.0
5 A 2018-10-05 125 125.
6 B 2016-05-22 50 50.0
7 B 2016-06-20 25 35.4
8 B 2016-07-25 25 25.0
9 B 2016-08-11 2000 224.
10 B 2017-07-21 25 25.0
# ... with 12 more rows

Subsetting data set to only retain the mean

Please see attached image of dataset.
What are the different ways to only retain a single value for each 'Month'? I've got a bunch of data points and would only need to retain, say, the mean value.
Many thanks
A different way of using the aggregate() function.
> aggregate(Temp ~ Month, data=airquality, FUN = mean)
Month Temp
1 5 65.54839
2 6 79.10000
3 7 83.90323
4 8 83.96774
5 9 76.90000
library(tidyverse)
library(lubridate)
#example data from airquality:
aq<-as_data_frame(airquality)
aq$mydate<-lubridate::ymd(paste0(2018, "-", aq$Month, "-", aq$Day))
> aq
# A tibble: 153 x 7
Ozone Solar.R Wind Temp Month Day mydate
<int> <int> <dbl> <int> <int> <int> <date>
1 41 190 7.40 67 5 1 2018-05-01
2 36 118 8.00 72 5 2 2018-05-02
3 12 149 12.6 74 5 3 2018-05-03
aq %>%
group_by("Month" = month(mydate)) %>%
summarize("Mean_Temp" = mean(Temp, na.rm=TRUE))
Summarize can return multiple summary functions:
aq %>%
group_by("Month" = month(mydate)) %>%
summarize("Mean_Temp" = mean(Temp, na.rm=TRUE),
"Num" = n(),
"SD" = sd(Temp, na.rm=TRUE))
# A tibble: 5 x 4
Month Mean_Temp Num SD
<dbl> <dbl> <int> <dbl>
1 5.00 65.5 31 6.85
2 6.00 79.1 30 6.60
3 7.00 83.9 31 4.32
4 8.00 84.0 31 6.59
5 9.00 76.9 30 8.36
Lubridate Cheatsheet
A data.table answer:
# load libraries
library(data.table)
library(lubridate)
setDT(dt)
dt[, .(meanValue = mean(value, na.rm =TRUE)), by = .(monthDate = floor_date(dates, "month"))]
Where dt has at least columns value and dates.
We can group by the index of dataset, use that in aggregate (from base R) to get the mean
aggregate(dat, index(dat), FUN = mean)
NB: Here, we assumed that the dataset is xts or zoo format. If the dataset have a month column, then use
aggregate(dat, list(dat$Month), FUN = mean)

Resources