Averaging my data into quarterly means (nfrequency error message) - r

I'm trying to average my data into quarterly means but when i use the following code i get this error
code
quarterly = aggregate(overturning_ts, nfrequency = 4, mean)
error message
Error in aggregate.ts(overturning_ts, nfrequency = 4, mean) :
cannot change frequency from 1 to 4
date snippet
overturning_ts
year month day hour Quarter Days_since_start Overturning_Strength
[1,] 2004 4 2 0 2 1.0 9.689933
[2,] 2004 4 2 12 2 1.5 10.193495
[3,] 2004 4 3 0 2 2.0 10.660849
[4,] 2004 4 3 12 2 2.5 11.077229
[5,] 2004 4 4 0 2 3.0 11.432414
[6,] 2004 4 4 12 2 3.5 11.721769
all data available here, after downloading i just converted it to a time series to get overturned_ts: https://drive.google.com/file/d/1NV3aKsvpPkGatLnuUMbvLpxhcYs_gdM-/view?usp=sharing
outcome i am looking for here;
Qtr1 Qtr2 Qtr3 Qtr4
1960 160.1 129.7 84.8 120.1
1961 160.1 124.9 84.8 116.9
1962 169.7 140.9 89.7 123.3

Like this?
library(tidyverse)
df %>%
group_by(year, Quarter) %>%
summarise(avg_overturning = mean(Overturning_Strength, na.rm = TRUE)) %>%
pivot_wider(names_from = Quarter,
values_from = avg_overturning, names_sort = TRUE)
# A tibble: 11 x 5
# Groups: year [11]
year `1` `2` `3` `4`
<dbl> <dbl> <dbl> <dbl> <dbl>
1 2004 NA 15.3 23.7 17.7
2 2005 14.0 18.7 21.7 22.5
3 2006 17.1 17.7 20.5 20.8
4 2007 18.9 15.5 17.9 20.0
5 2008 18.5 15.5 16.1 20.2
6 2009 16.3 14.9 15.3 12.2
7 2010 8.89 16.2 19.7 15.1
8 2011 15.1 16.0 17.8 18.4
9 2012 15.8 11.9 16.4 16.5
10 2013 11.9 17.1 17.6 18.8
11 2014 15.1 NA NA NA

We can use base R
with(df1, tapply(Overturning_Strength, list(year, Quarter),
FUN = mean, na.rm = TRUE))
1 2 3 4
2004 NA 15.34713 23.74958 17.65220
2005 13.950342 18.66797 21.73983 22.49755
2006 17.116492 17.71430 20.50190 20.84159
2007 18.918347 15.46002 17.87220 20.01701
2008 18.508666 15.53064 16.06696 20.21658
2009 16.255357 14.85671 15.28269 12.16084
2010 8.889602 16.18042 19.74318 15.05649
2011 15.130970 15.96652 17.79070 18.35192
2012 15.793286 11.90334 16.37805 16.45706
2013 11.867353 17.07688 17.60640 18.81432
2014 15.119643 NA NA NA
Or with xtabs from base R
xtabs(Overturning_Strength ~ year + Quarter,
df1)/table(df1[c("year", "Quarter")])
Quarter
year 1 2 3 4
2004 15.347126 23.749583 17.652204
2005 13.950342 18.667970 21.739828 22.497550
2006 17.116492 17.714298 20.501897 20.841587
2007 18.918347 15.460020 17.872199 20.017007
2008 18.508666 15.530639 16.066960 20.216581
2009 16.255357 14.856708 15.282690 12.160845
2010 8.889602 16.180422 19.743183 15.056486
2011 15.130970 15.966518 17.790699 18.351916
2012 15.793286 11.903337 16.378045 16.457062
2013 11.867353 17.076883 17.606403 18.814323
2014 15.119643

As (it seems like) your data is already structured with quarters as a column a possible solution could be to use dplyr directly without making it a timeseries object with ts(). We would group_by every year-quarter pair, summarise the strength-value, and change to a wide format for the desired output with pivot_wider.
library(dplyr)
overturning |>
select(year, Quarter, Overturning_Strength) |>
group_by(year, Quarter) |>
summarise(value = mean(Overturning_Strength)) |>
ungroup() |>
pivot_wider(year, names_from = Quarter, names_prefix = "Qtr", names_sort = TRUE)

Related

Creating averages across time periods

I'm a beginner to R, but I have the below dataframe with more observations in which I have at max each 'id' observation for three years 91, 99, 07.
I want to create a variable avg_ln_rd by 'id' that takes the average of 'ln_rd' and 'ln_rd' from year 91 if the first ln_rd observation is from 99 - and from year 99 if the first ln_rd observation is from 07.
id year ln_rd
<dbl> <dbl> <dbl>
1 1013 1991 3.51
2 1013 1999 5.64
3 1013 2007 4.26
4 1021 1991 0.899
5 1021 1999 0.791
6 1021 2007 0.704
7 1034 1991 2.58
8 1034 1999 3.72
9 1034 2007 4.95
10 1037 1991 0.262
I also already dropped any observations of 'id' that only exist for one of the three years.
My first thought was to create for each year a standalone variable for ln_rd but then i still would need to filter by id which i do not know how to do.
Then I tried using these standalone variables to form an if clause.
df$lagln_rd_99 <- ifelse(df$year == 1999, df$ln_rd_91, NA)
But again I do not know how to keep 'id' fixed.
Any help would be greatly appreciated.
EDIT:
I grouped by id using dplyr. Can I then just sort my df by id and create a new variable that is ln_rd but shifted by one row?
Still a bit unclear what to do if all years are present in a group but this might help.
-- edited -- to show the desired output.
library(dplyr)
df %>%
group_by(id) %>%
arrange(id, year) %>%
mutate(avg91 = mean(c(ln_rd[year == 1991], ln_rd[year == 1999])),
avg99 = mean(c(ln_rd[year == 1999], ln_rd[year == 2007])),
avg91 = ifelse(year == 1991, avg91, NA),
avg99 = ifelse(year == 2007, avg99, NA)) %>%
ungroup()
# A tibble: 15 × 5
year id ln_rd avg91 avg99
<int> <int> <dbl> <dbl> <dbl>
1 1991 3505 3.38 3.09 NA
2 1999 3505 2.80 NA NA
3 1991 4584 1.45 1.34 NA
4 1999 4584 1.22 NA NA
5 1991 5709 1.90 2.13 NA
6 1999 5709 2.36 NA NA
7 2007 5709 3.11 NA 2.74
8 2007 9777 2.36 NA 2.36
9 1991 18729 4.82 5.07 NA
10 1999 18729 5.32 NA NA
11 2007 18729 5.53 NA 5.42
12 1991 20054 0.588 0.307 NA
13 1999 20054 0.0266 NA NA
14 1999 62169 1.91 NA NA
15 2007 62169 1.45 NA 1.68

How can I divide into columns the summarize() funtion with tidyverse?

I am struggling with the tidyverse package. I'm using the mpg dataset from R to display the issue that I'm facing (ignore if the relationships are not relevant, it is just for the sake of explaining my problem).
What I'm trying to do is to obtain the average "displ" grouped by manufacturer and year AND at the same time (and this is what I can't figure out), have several columns for each of the fuel types variable (i.e.: a column for the mean of diesel, a column for the mean of petrol, etc.).
This is the first part of the code and I'm new to R so I really don't know what do I need to add...
mpg %>%
group_by(manufacturer, year) %>%
summarize(Mean. = mean(c(displ)))
# A tibble: 30 × 3
# Groups: manufacturer [15]
manufacturer year Mean.
<chr> <int> <dbl>
1 audi 1999 2.36
2 audi 2008 2.73
3 chevrolet 1999 4.97
4 chevrolet 2008 5.12
5 dodge 1999 4.32
6 dodge 2008 4.42
7 ford 1999 4.45
8 ford 2008 4.66
9 honda 1999 1.6
10 honda 2008 1.85
# … with 20 more rows
Any help is appreciated, thank you.
Perhaps, we need to reshape into 'wide'
library(dplyr)
library(tidyr)
mpg %>%
select(manufacturer, year, fl, displ) %>%
pivot_wider(names_from = fl, values_from = displ, values_fn = mean)
-output
# A tibble: 30 x 7
manufacturer year p r e d c
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 audi 1999 2.36 NA NA NA NA
2 audi 2008 2.73 NA NA NA NA
3 chevrolet 2008 6.47 4.49 5.3 NA NA
4 chevrolet 1999 5.7 4.22 NA 6.5 NA
5 dodge 1999 NA 4.32 NA NA NA
6 dodge 2008 NA 4.42 4.42 NA NA
7 ford 1999 NA 4.45 NA NA NA
8 ford 2008 5.4 4.58 NA NA NA
9 honda 1999 1.6 1.6 NA NA NA
10 honda 2008 2 1.8 NA NA 1.8
# … with 20 more rows

Filter a dataframe by keeping row dates of three days in a row preferably with dplyr

I would like to filter a dataframe based on its date column. I would like to keep the rows where I have at least 3 consecutive days. I would like to do this as effeciently and quickly as possible, so if someone has a vectorized approached it would be good.
I tried to inspire myself from the following link, but it didn't really go well, as it is a different problem:
How to filter rows based on difference in dates between rows in R?
I tried to do it with a for loop, I managed to put an indicator on the dates who are not consecutive, but it didn't give me the desired result, because it keeps all dates that are in a row even if they are less than 3 in a row.
tf is my dataframe
for(i in 2:(nrow(tf)-1)){
if(tf$Date[i] != tf$Date[i+1] %m+% days(-1)){
if(tf$Date[i] != tf$Date[i-1] %m+% days(1)){
tf$Date[i] = as.Date(0)
}
}
}
The first 22 rows of my dataframe look something like this:
Date RR.x RR.y Y
1 1984-10-20 1 10.8 1984
2 1984-11-04 1 12.5 1984
3 1984-11-05 1 7.0 1984
4 1984-11-09 1 22.9 1984
5 1984-11-10 1 24.4 1984
6 1984-11-11 1 19.0 1984
7 1984-11-13 1 5.9 1984
8 1986-10-15 1 10.3 1986
9 1986-10-16 1 18.1 1986
10 1986-10-17 1 11.3 1986
11 1986-11-17 1 14.1 1986
12 2003-10-17 1 7.8 2003
13 2003-10-25 1 7.6 2003
14 2003-10-26 1 5.0 2003
15 2003-10-27 1 6.6 2003
16 2003-11-15 1 26.4 2003
17 2003-11-20 1 10.0 2003
18 2011-10-29 1 10.0 2011
19 2011-11-04 1 11.4 2011
20 2011-11-21 1 9.8 2011
21 2011-11-22 1 5.6 2011
22 2011-11-23 1 20.4 2011
The result should be:
Date RR.x RR.y Y
4 1984-11-09 1 22.9 1984
5 1984-11-10 1 24.4 1984
6 1984-11-11 1 19.0 1984
8 1986-10-15 1 10.3 1986
9 1986-10-16 1 18.1 1986
10 1986-10-17 1 11.3 1986
13 2003-10-25 1 7.6 2003
14 2003-10-26 1 5.0 2003
15 2003-10-27 1 6.6 2003
20 2011-11-21 1 9.8 2011
21 2011-11-22 1 5.6 2011
22 2011-11-23 1 20.4 2011
One possibility could be:
df %>%
mutate(Date = as.Date(Date, format = "%Y-%m-%d"),
diff = c(0, diff(Date))) %>%
group_by(grp = cumsum(diff > 1 & lead(diff, default = last(diff)) == 1)) %>%
filter(if_else(diff > 1 & lead(diff, default = last(diff)) == 1, 1, diff) == 1) %>%
filter(n() >= 3) %>%
ungroup() %>%
select(-diff, -grp)
Date RR.x RR.y Y
<date> <int> <dbl> <int>
1 1984-11-09 1 22.9 1984
2 1984-11-10 1 24.4 1984
3 1984-11-11 1 19 1984
4 1986-10-15 1 10.3 1986
5 1986-10-16 1 18.1 1986
6 1986-10-17 1 11.3 1986
7 2003-10-25 1 7.6 2003
8 2003-10-26 1 5 2003
9 2003-10-27 1 6.6 2003
10 2011-11-21 1 9.8 2011
11 2011-11-22 1 5.6 2011
12 2011-11-23 1 20.4 2011
Here's a base solution:
DF$Date <- as.Date(DF$Date)
rles <- rle(cumsum(c(1,diff(DF$Date)!=1)))
rles$values <- rles$lengths >= 3
DF[inverse.rle(rles), ]
Date RR.x RR.y Y
4 1984-11-09 1 22.9 1984
5 1984-11-10 1 24.4 1984
6 1984-11-11 1 19.0 1984
8 1986-10-15 1 10.3 1986
9 1986-10-16 1 18.1 1986
10 1986-10-17 1 11.3 1986
13 2003-10-25 1 7.6 2003
14 2003-10-26 1 5.0 2003
15 2003-10-27 1 6.6 2003
20 2011-11-21 1 9.8 2011
21 2011-11-22 1 5.6 2011
22 2011-11-23 1 20.4 2011
Similar approach in dplyr
DF%>%
mutate(Date = as.Date(Date))%>%
add_count(IDs = cumsum(c(1, diff(Date) !=1)))%>%
filter(n >= 3)
# A tibble: 12 x 6
Date RR.x RR.y Y IDs n
<date> <int> <dbl> <int> <dbl> <int>
1 1984-11-09 1 22.9 1984 3 3
2 1984-11-10 1 24.4 1984 3 3
3 1984-11-11 1 19 1984 3 3
4 1986-10-15 1 10.3 1986 5 3
5 1986-10-16 1 18.1 1986 5 3
6 1986-10-17 1 11.3 1986 5 3
7 2003-10-25 1 7.6 2003 8 3
8 2003-10-26 1 5 2003 8 3
9 2003-10-27 1 6.6 2003 8 3
10 2011-11-21 1 9.8 2011 13 3
11 2011-11-22 1 5.6 2011 13 3
12 2011-11-23 1 20.4 2011 13 3

Rescale data frame columns as percentages of baseline entry with dplyr

I often need to rescale time series relative to their value at a certain baseline time (usually as a percent of the baseline). Here's an example.
> library(dplyr)
> library(magrittr)
> library(tibble)
> library(tidyr)
# [messages from package imports snipped]
> set.seed(42)
> mexico <- tibble(Year=2000:2004, Country='Mexico', A=10:14+rnorm(5), B=20:24+rnorm(5))
> usa <- tibble(Year=2000:2004, Country='USA', A=30:34+rnorm(5), B=40:44+rnorm(5))
> table <- rbind(mexico, usa)
> table
# A tibble: 10 x 4
Year Country A B
<int> <chr> <dbl> <dbl>
1 2000 Mexico 11.4 19.9
2 2001 Mexico 10.4 22.5
3 2002 Mexico 12.4 21.9
4 2003 Mexico 13.6 25.0
5 2004 Mexico 14.4 23.9
6 2000 USA 31.3 40.6
7 2001 USA 33.3 40.7
8 2002 USA 30.6 39.3
9 2003 USA 32.7 40.6
10 2004 USA 33.9 45.3
I want to scale A and B to express each value as a percent of the country-specific 2001 value (i.e., the A and B entries in rows 2 and 7 should be 100). My way of doing this is somewhat roundabout and awkward: extract the baseline values into a separate table, merge them back into a separate column in the main table, and then compute scaled values, with annoying intermediate gathering and spreading to avoid specifying the column names of each time series (real data sets can have far more than two value columns). Is there a better way to do this, ideally with a single short pipeline?
> long_table <- table %>% gather(variable, value, -Year, -Country)
> long_table
# A tibble: 20 x 4
Year Country variable value
<int> <chr> <chr> <dbl>
1 2000 Mexico A 11.4
2 2001 Mexico A 10.4
#[remaining tibble printout snipped]
> baseline_table <- long_table %>%
filter(Year == 2001) %>%
select(-Year) %>%
rename(baseline=value)
> baseline_table
# A tibble: 4 x 3
Country variable baseline
<chr> <chr> <dbl>
1 Mexico A 10.4
2 USA A 33.3
3 Mexico B 22.5
4 USA B 40.7
> normalized_table <- long_table %>%
inner_join(baseline_table) %>%
mutate(value=100*value/baseline) %>%
select(-baseline) %>%
spread(variable, value) %>%
arrange(Country, Year)
Joining, by = c("Country", "variable")
> normalized_table
# A tibble: 10 x 4
Year Country A B
<int> <chr> <dbl> <dbl>
1 2000 Mexico 109. 88.4
2 2001 Mexico 100. 100
3 2002 Mexico 118. 97.3
4 2003 Mexico 131. 111.
5 2004 Mexico 138. 106.
6 2000 USA 94.0 99.8
7 2001 USA 100 100
8 2002 USA 92.0 96.6
9 2003 USA 98.3 99.6
10 2004 USA 102. 111.
My second attempt was to use transform, but this failed because transform doesn't seem to recognize dplyr groups, and it would be suboptimal even if it worked because it requires me to know that 2001 is the second year in the time series.
> table %>%
arrange(Country, Year) %>%
gather(variable, value, -Year, -Country) %>%
group_by(Country, variable) %>%
transform(norm=value*100/value[2])
Year Country variable value norm
1 2000 Mexico A 11.37096 108.9663
2 2001 Mexico A 10.43530 100.0000
3 2002 Mexico A 12.36313 118.4741
4 2003 Mexico A 13.63286 130.6418
5 2004 Mexico A 14.40427 138.0340
6 2000 USA A 31.30487 299.9901
7 2001 USA A 33.28665 318.9811
8 2002 USA A 30.61114 293.3422
9 2003 USA A 32.72121 313.5627
10 2004 USA A 33.86668 324.5395
11 2000 Mexico B 19.89388 190.6402
12 2001 Mexico B 22.51152 215.7247
13 2002 Mexico B 21.90534 209.9157
14 2003 Mexico B 25.01842 239.7480
15 2004 Mexico B 23.93729 229.3876
16 2000 USA B 40.63595 389.4085
17 2001 USA B 40.71575 390.1732
18 2002 USA B 39.34354 377.0235
19 2003 USA B 40.55953 388.6762
20 2004 USA B 45.32011 434.2961
It would be nice for this to be more scalable, but here's a simple solution. You can refer to A[Year == 2001] inside mutate, much as you might do table$A[table$Year == 2001] in base R. This lets you scale against your baseline of 2001 or whatever other year you might need.
Edit: I was missing a group_by to ensure that values are only being scaled against other values in their own group. The "sanity check" (that I clearly didn't do) is that values for Mexico in 2001 should have a scaled value of 1, and same for USA and any other countries.
library(tidyverse)
set.seed(42)
mexico <- tibble(Year=2000:2004, Country='Mexico', A=10:14+rnorm(5), B=20:24+rnorm(5))
usa <- tibble(Year=2000:2004, Country='USA', A=30:34+rnorm(5), B=40:44+rnorm(5))
table <- rbind(mexico, usa)
table %>%
group_by(Country) %>%
mutate(A_base2001 = A / A[Year == 2001], B_base2001 = B / B[Year == 2001])
#> # A tibble: 10 x 6
#> # Groups: Country [2]
#> Year Country A B A_base2001 B_base2001
#> <int> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 2000 Mexico 11.4 19.9 1.09 0.884
#> 2 2001 Mexico 10.4 22.5 1 1
#> 3 2002 Mexico 12.4 21.9 1.18 0.973
#> 4 2003 Mexico 13.6 25.0 1.31 1.11
#> 5 2004 Mexico 14.4 23.9 1.38 1.06
#> 6 2000 USA 31.3 40.6 0.940 0.998
#> 7 2001 USA 33.3 40.7 1 1
#> 8 2002 USA 30.6 39.3 0.920 0.966
#> 9 2003 USA 32.7 40.6 0.983 0.996
#> 10 2004 USA 33.9 45.3 1.02 1.11
Created on 2018-05-23 by the reprex package (v0.2.0).
Inspired by Camille's answer, I found one simple approach that that scales well:
table %>%
gather(variable, value, -Year, -Country) %>%
group_by(Country, variable) %>%
mutate(value=100*value/value[Year == 2001]) %>%
spread(variable, value)
# A tibble: 10 x 4
# Groups:   Country [2]
Year Country A B
<int> <chr> <dbl> <dbl>
1 2000 Mexico 109. 88.4
2 2000 USA 94.0 99.8
3 2001 Mexico 100. 100
4 2001 USA 100 100
5 2002 Mexico 118. 97.3
6 2002 USA 92.0 96.6
7 2003 Mexico 131. 111.
8 2003 USA 98.3 99.6
9 2004 Mexico 138. 106.
10 2004 USA 102. 111.
Preserving the the original values alongside the scaled ones takes more work. Here are two approaches. One of them uses an extra gather call to produce two variable-name columns (one indicating the series name, the other marking original or scaled), then unifying them into one column and reformatting.
table %>%
gather(variable, original, -Year, -Country) %>%
group_by(Country, variable) %>%
mutate(scaled=100*original/original[Year == 2001]) %>%
gather(scaled, value, -Year, -Country, -variable) %>%
unite(variable_scaled, variable, scaled, sep='_') %>%
mutate(variable_scaled=gsub("_original", "", variable_scaled)) %>%
spread(variable_scaled, value)
# A tibble: 10 x 6
# Groups:   Country [2]
Year Country A A_scaled B B_scaled
<int> <chr> <dbl> <dbl> <dbl> <dbl>
1 2000 Mexico 11.4 109. 19.9 88.4
2 2000 USA 31.3 94.0 40.6 99.8
3 2001 Mexico 10.4 100. 22.5 100
4 2001 USA 33.3 100 40.7 100
5 2002 Mexico 12.4 118. 21.9 97.3
6 2002 USA 30.6 92.0 39.3 96.6
7 2003 Mexico 13.6 131. 25.0 111.
8 2003 USA 32.7 98.3 40.6 99.6
9 2004 Mexico 14.4 138. 23.9 106.
10 2004 USA 33.9 102. 45.3 111.
A second equivalent approach creates a new table with the columns scaled "in place" and then merges it back into with the original one.
table %>%
gather(variable, value, -Year, -Country) %>%
group_by(Country, variable) %>%
mutate(value=100*value/value[Year == 2001]) %>%
ungroup() %>%
mutate(variable=paste(variable, 'scaled', sep='_')) %>%
spread(variable, value) %>%
inner_join(table)
Joining, by = c("Year", "Country")
# A tibble: 10 x 6
Year Country A_scaled B_scaled A B
<int> <chr> <dbl> <dbl> <dbl> <dbl>
1 2000 Mexico 109. 88.4 11.4 19.9
2 2000 USA 94.0 99.8 31.3 40.6
3 2001 Mexico 100. 100 10.4 22.5
4 2001 USA 100 100 33.3 40.7
5 2002 Mexico 118. 97.3 12.4 21.9
6 2002 USA 92.0 96.6 30.6 39.3
7 2003 Mexico 131. 111. 13.6 25.0
8 2003 USA 98.3 99.6 32.7 40.6
9 2004 Mexico 138. 106. 14.4 23.9
10 2004 USA 102. 111. 33.9 45.3
It's possible to replace the final inner_join with arrange(County, Year) %>% select(-Country, -Year) %>% bind_cols(table), which may perform better for some data sets, though it orders the columns suboptimally.

Calculate the percent occurrence of a variable in multiple groups

Sample data
set.seed(123)
df <- data.frame(loc.id = rep(1:1000, each = 35), year = rep(1980:2014,times = 1000),month.id = sample(c(1:4,8:10,12),35*1000,replace = T))
The data frame has a 1000 locations X 35 years of data for a variable called month.id which is basically the month of a year. For each year, I want to calculate percent occurrence of each month. For e.g. for 1980,
month.vec <- df[df$year == 1980,]
table(month.vec$month.id)
1 2 3 4 8 9 10 12
106 132 116 122 114 130 141 139
To calculate the percent occurrence of months:
table(month.vec$month.id)/length(month.vec$month.id) * 100
1 2 3 4 8 9 10 12
10.6 13.2 11.6 12.2 11.4 13.0 14.1 13.9
I want to have a table something like this:
year month percent
1980 1 10.6
1980 2 13.2
1980 3 11.6
1980 4 12.2
1980 5 NA
1980 6 NA
1980 7 NA
1980 8 11.4
1980 9 13
1980 10 14.1
1980 11 NA
1980 12 13.9
Since, months 5,6,7,11 are missing, I just want to add the additional rows with NAs for those months. If possible, I would
like a dplyr solution to something like this:
library(dplyr)
df %>% group_by(year) %>% summarise(percentage.contri = table(month.id)/length(month.id)*100)
Solution using dplyr and tidyr
# To get month as integer use (or add as.integer to mutate):
# df$month.id <- as.integer(df$month.id)
library(dplyr)
library(tidyr)
df %>%
group_by(year, month.id) %>%
# Count occurrences per year & month
summarise(n = n()) %>%
# Get percent per month (year number is calculated with sum(n))
mutate(percent = n / sum(n) * 100) %>%
# Fill in missing months
complete(year, month.id = 1:12, fill = list(percent = 0)) %>%
select(year, month.id, percent)
year month.id percent
<int> <dbl> <dbl>
1 1980 1.00 10.6
2 1980 2.00 13.2
3 1980 3.00 11.6
4 1980 4.00 12.2
5 1980 5.00 0
6 1980 6.00 0
7 1980 7.00 0
8 1980 8.00 11.4
9 1980 9.00 13.0
10 1980 10.0 14.1
11 1980 11.0 0
12 1980 12.0 13.9
A base R solution:
tab <- table(month.vec$year, factor(month.vec$month.id, levels = 1:12))/length(month.vec$month.id) * 100
dfnew <- as.data.frame(tab)
which gives:
> dfnew
Var1 Var2 Freq
1 1980 1 10.6
2 1980 2 13.2
3 1980 3 11.6
4 1980 4 12.2
5 1980 5 0.0
6 1980 6 0.0
7 1980 7 0.0
8 1980 8 11.4
9 1980 9 13.0
10 1980 10 14.1
11 1980 11 0.0
12 1980 12 13.9
Or with data.table:
library(data.table)
setDT(month.vec)[, .N, by = .(year, month.id)
][.(year = 1980, month.id = 1:12), on = .(year, month.id)
][, N := 100 * N/sum(N, na.rm = TRUE)][]

Resources