Annual moving window over a data frame - r

I have a data frame of discharge data. Below is a reproducible example:
library(lubridate)
Date <- sample(seq(as.Date('1981/01/01'), as.Date('1982/12/31'), by="day"), 24)
Date <- sort(Date, decreasing = F)
Station <- rep(as.character("A"), 24)
Discharge <- rnorm(n = 24, mean = 1, 1)
df <- cbind.data.frame(Station, Date, Discharge)
df$Year <- year(df$Date)
df$Month <- month(df$Date)
df$Day <- day(df$Date)
The output:
> df
Station Date Discharge Year Month Day
1 A 1981-01-23 0.75514968 1981 1 23
2 A 1981-02-17 -0.08552776 1981 2 17
3 A 1981-03-20 1.47586712 1981 3 20
4 A 1981-04-26 3.64823544 1981 4 26
5 A 1981-05-22 1.21880453 1981 5 22
6 A 1981-05-23 2.19482857 1981 5 23
7 A 1981-07-02 -0.13598754 1981 7 2
8 A 1981-07-23 0.12365626 1981 7 23
9 A 1981-07-24 2.12557882 1981 7 24
10 A 1981-09-02 2.79879494 1981 9 2
11 A 1981-09-04 1.67926948 1981 9 4
12 A 1981-11-06 0.49720784 1981 11 6
13 A 1981-12-21 -0.25272271 1981 12 21
14 A 1982-04-08 1.39706157 1982 4 8
15 A 1982-04-19 -0.13965981 1982 4 19
16 A 1982-05-26 0.55238425 1982 5 26
17 A 1982-06-23 3.94639154 1982 6 23
18 A 1982-06-25 -0.03415929 1982 6 25
19 A 1982-07-15 1.00996167 1982 7 15
20 A 1982-09-11 3.18225186 1982 9 11
21 A 1982-10-17 0.30875497 1982 10 17
22 A 1982-10-30 2.26209011 1982 10 30
23 A 1982-11-06 0.34430489 1982 11 6
24 A 1982-11-19 2.28251458 1982 11 19
What I need to do is to create a moving window function using base R. I have tried using runner package but it is proving not to be so flexible. This moving window (say 3) shall take 3 rows at a time and calculate the mean discharge. This window shall continue till the last date of the year 1981. Another window shall start from 1982 and do the same. How to approach this?

Using base R only
w=3
df$DischargeM=sapply(1:nrow(df),function(x){
tmp=NA
if (x>=w) {
if (length(unique(df$Year[(x-w+1):x]))==1) {
tmp=mean(df$Discharge[(x-w+1):x])
}
}
tmp
})
Station Date Discharge Year Month Day DischargeM
1 A 1981-01-21 2.0009355 1981 1 21 NA
2 A 1981-02-11 0.5948567 1981 2 11 NA
3 A 1981-04-17 0.2637090 1981 4 17 0.95316705
4 A 1981-04-18 3.9180253 1981 4 18 1.59219699
5 A 1981-05-09 -0.2589129 1981 5 9 1.30760712
6 A 1981-07-05 1.1055913 1981 7 5 1.58823456
7 A 1981-07-11 0.7561600 1981 7 11 0.53427946
8 A 1981-07-22 0.0978999 1981 7 22 0.65321706
9 A 1981-08-04 0.5410163 1981 8 4 0.46502541
10 A 1981-08-13 -0.5044425 1981 8 13 0.04482458
11 A 1981-10-06 1.5954315 1981 10 6 0.54400178
12 A 1981-11-08 -0.5757041 1981 11 8 0.17176164
13 A 1981-12-24 1.3892440 1981 12 24 0.80299047
14 A 1982-01-07 1.9363874 1982 1 7 NA
15 A 1982-02-20 1.4340554 1982 2 20 NA
16 A 1982-05-29 0.4536461 1982 5 29 1.27469632
17 A 1982-06-10 2.9776761 1982 6 10 1.62179253
18 A 1982-06-17 1.6371733 1982 6 17 1.68949847
19 A 1982-06-28 1.7585579 1982 6 28 2.12446908
20 A 1982-08-17 0.8297518 1982 8 17 1.40849432
21 A 1982-09-21 1.6853808 1982 9 21 1.42456348
22 A 1982-11-13 0.6066167 1982 11 13 1.04058309
23 A 1982-11-16 1.4989263 1982 11 16 1.26364126
24 A 1982-11-28 0.2273658 1982 11 28 0.77763625
(make sure your df is ordered).

You can do this by using dplyr and the rollmean or rollmeanr function from zoo.
You group the data by year, and apply the rollmeanr in a mutate function.
library(dplyr)
df %>%
group_by(Year) %>%
mutate(avg = zoo::rollmeanr(Discharge, k = 3, fill = NA))
# A tibble: 24 x 7
# Groups: Year [2]
Station Date Discharge Year Month Day avg
<chr> <date> <dbl> <dbl> <dbl> <int> <dbl>
1 A 1981-01-04 1.00 1981 1 4 NA
2 A 1981-03-26 0.0468 1981 3 26 NA
3 A 1981-03-28 0.431 1981 3 28 0.494
4 A 1981-05-04 1.30 1981 5 4 0.593
5 A 1981-08-26 2.06 1981 8 26 1.26
6 A 1981-10-14 1.09 1981 10 14 1.48
7 A 1981-12-10 1.28 1981 12 10 1.48
8 A 1981-12-23 0.668 1981 12 23 1.01
9 A 1982-01-02 -0.333 1982 1 2 NA
10 A 1982-04-13 0.800 1982 4 13 NA
# ... with 14 more rows

Kindly let me know if this is what you were anticipating
Base version:
result <- transform(df,
Discharge_mean = ave(Discharge,Year,
FUN= function(x) rollapply(x,width = 3, mean, align='right',fill=NA))
)
dplyr version:
result <-df %>%
group_by(Year)%>%
mutate(Discharge_mean=rollapply(Discharge,3,mean,align='right',fill=NA))
Output:
> result
Station Date Discharge Year Month Day Discharge_mean
1 A 1981-01-09 0.560448487 1981 1 9 NA
2 A 1981-01-17 0.006777809 1981 1 17 NA
3 A 1981-02-08 2.008959399 1981 2 8 0.8587286
4 A 1981-02-21 1.166452993 1981 2 21 1.0607301
5 A 1981-04-12 3.120080595 1981 4 12 2.0984977
6 A 1981-04-24 2.647325960 1981 4 24 2.3112865
7 A 1981-05-01 0.764980310 1981 5 1 2.1774623
8 A 1981-05-20 2.203700845 1981 5 20 1.8720024
9 A 1981-06-19 0.519390897 1981 6 19 1.1626907
10 A 1981-07-06 1.704146872 1981 7 6 1.4757462
# 14 more rows

Related

Canculating the compound annual growth rate

I'm trying to calculate the compound annual growth rate of my data (snipet shown below), does anyone know the best way to do this or if there is a function that does part of the job?
Data: (only woried about the preds column here, others can be ignored)
year month timestep ymin ymax preds date
1 1998 1 1 17.84037 18.58553 18.21295 1998-01-01
2 1998 2 2 17.05009 17.70642 17.37826 1998-02-01
3 1998 3 3 16.97067 17.61320 17.29193 1998-03-01
4 1998 4 4 18.38551 19.00838 18.69695 1998-04-01
5 1998 5 5 21.39082 21.97338 21.68210 1998-05-01
6 1998 6 6 24.77679 25.35464 25.06571 1998-06-01
7 1998 7 7 27.27057 27.82818 27.54938 1998-07-01
8 1998 8 8 28.24703 28.76702 28.50702 1998-08-01
9 1998 9 9 27.72370 28.24619 27.98494 1998-09-01
10 1998 10 10 25.83783 26.33969 26.08876 1998-10-01
11 1998 11 11 22.94968 23.42268 23.18618 1998-11-01
12 1998 12 12 19.50499 20.05466 19.77982 1998-12-01
13 1999 1 13 17.98323 18.50530 18.24426 1999-01-01
14 1999 2 14 17.20124 17.61746 17.40935 1999-02-01
15 1999 3 15 17.11064 17.53492 17.32278 1999-03-01

Is it possible to make groups based on an ID of a person in R?

I have this data:
data <- data.frame(id_pers=c(4102,13102,27101,27102,28101,28102, 42101,42102,56102,73102,74103,103104,117103,117104,117105),
birthyear=c(1992,1994,1993,1992,1995,1999,2000,2001,2000, 1994, 1999, 1978, 1986, 1998, 1999))
I want to group the different persons by familys in a new column, so that persons 27101,27102 (siblings) are group/family 1 and 42101,42102 are in group 2, 117103,117104,117105 are in group 3 so on.
Person "4102" has no siblings and should be a NA in the new column.
It is always the case that 2 or more persons are siblings if the ID's are not further apart than a maximum of 6 numbers.
I have a far larger dataset with over 3000 rows. How could I do it the most efficient way?
You can use round with digits = -1 (or -2) if you have id_pers that goes above 10 observations per family. If you want the id to be integers from 1; you can use cur_group_id:
library(dplyr)
data %>%
group_by(fam_id = round(id_pers - 5, digits = -1)) %>%
mutate(fam_gp = cur_group_id())
output
# A tibble: 15 × 3
# Groups: fam_id [10]
id_pers birthyear fam_id fam_gp
<dbl> <dbl> <dbl> <int>
1 4102 1992 4100 1
2 13102 1994 13100 2
3 27101 1993 27100 3
4 27102 1992 27100 3
5 28101 1995 28100 4
6 28106 1999 28100 4
7 42101 2000 42100 5
8 42102 2001 42100 5
9 56102 2000 56100 6
10 73102 1994 73100 7
11 74103 1999 74100 8
12 103104 1978 103100 9
13 117103 1986 117100 10
14 117104 1998 117100 10
15 117105 1999 117100 10
It looks like we can the 1000s digit (and above) to delineate groups.
library(dplyr)
data %>%
mutate(
famgroup = trunc(id_pers/1000),
famgroup = match(famgroup, unique(famgroup))
)
# id_pers birthyear famgroup
# 1 4102 1992 1
# 2 13102 1994 2
# 3 27101 1993 3
# 4 27102 1992 3
# 5 28101 1995 4
# 6 28102 1999 4
# 7 42101 2000 5
# 8 42102 2001 5
# 9 56102 2000 6
# 10 73102 1994 7
# 11 74103 1999 8
# 12 103104 1978 9
# 13 117103 1986 10
# 14 117104 1998 10
# 15 117105 1999 10

How to calculate the number of months from the initial date for each individual

This is a representation of my dataset
ID<-c(rep(1,10),rep(2,8))
year<-c(2007,2007,2007,2008,2008,2009,2010,2009,2010,2011,
2008,2008,2009,2010,2009,2010,2011,2011)
month<-c(2,7,12,4,11,6,11,1,9,4,3,6,7,4,9,11,2,8)
mydata<-data.frame(ID,year,month)
I want to calculate for each individual the number of months from the initial date. I am using two variables: year and month.
I firstly order years and months:
mydata2<-mydata%>%group_by(ID,year)%>%arrange(year,month,.by_group=T)
Then I created the variable date considering that the day begin with 01:
mydata2$date<-paste("01",mydata2$month,mydata2$year,sep = "-")
then I used lubridate to change this variable in date format
mydata2$date<-dmy(mydata2$date)
But after this, I really don't know what to do, in order to have such a dataset (preferably using dplyr code) below:
ID year month date dif_from_init
1 1 2007 2 01-2-2007 0
2 1 2007 7 01-7-2007 5
3 1 2007 12 01-12-2007 10
4 1 2008 4 01-4-2008 14
5 1 2008 11 01-11-2008 21
6 1 2009 1 01-1-2009 23
7 1 2009 6 01-6-2009 28
8 1 2010 9 01-9-2010 43
9 1 2010 11 01-11-2010 45
10 1 2011 4 01-4-2011 50
11 2 2008 3 01-3-2008 0
12 2 2008 6 01-6-2008 3
13 2 2009 7 01-7-2009 16
14 2 2009 9 01-9-2009 18
15 2 2010 4 01-4-2010 25
16 2 2010 11 01-11-2010 32
17 2 2011 2 01-2-2011 35
18 2 2011 8 01-8-2011 41
One way could be:
mydata %>%
group_by(ID) %>%
mutate(date = as.Date(sprintf('%d-%d-01',year, month)),
diff = as.numeric(round((date - date[1])/365*12)))
# A tibble: 18 x 5
# Groups: ID [2]
ID year month date diff
<dbl> <dbl> <dbl> <date> <dbl>
1 1 2007 2 2007-02-01 0
2 1 2007 7 2007-07-01 5
3 1 2007 12 2007-12-01 10
4 1 2008 4 2008-04-01 14
5 1 2008 11 2008-11-01 21
6 1 2009 6 2009-06-01 28
7 1 2010 11 2010-11-01 45
8 1 2009 1 2009-01-01 23
9 1 2010 9 2010-09-01 43
10 1 2011 4 2011-04-01 50
11 2 2008 3 2008-03-01 0
12 2 2008 6 2008-06-01 3
13 2 2009 7 2009-07-01 16
14 2 2010 4 2010-04-01 25
15 2 2009 9 2009-09-01 18
16 2 2010 11 2010-11-01 32
17 2 2011 2 2011-02-01 35
18 2 2011 8 2011-08-01 41

R - calculate annual population conditional on survival in every year

I have a data frame with three columns: birth_year, death_year, gender.
I have to calculate total alive male and female population for every year in a given range (1950:1980).
The data frame looks like this:
birth_year death_year gender
1934 1988 male
1922 1993 female
1890 1966 male
1901 1956 male
1946 2009 female
1909 1976 female
1899 1945 male
1887 1949 male
1902 1984 female
The person is alive in year x if death_year > x & birth year <= x
The output I am looking for is something like this:
year male female
1950 3 4
1951 2 3
1952 4 3
1953 4 5
.
.
1980 6 3
Thanks!
Does this work:
library(tidyr)
library(purrr)
library(dplyr)
df %>% mutate(year = map2(1950,1980, seq)) %>% unnest(year) %>%
mutate(isalive = case_when(year >= birth_year & year < death_year ~ 1, TRUE ~ 0)) %>%
group_by(year, gender) %>% summarise(alive = sum(isalive)) %>%
pivot_wider(names_from = gender, values_from = alive) %>% print( n = 50)
`summarise()` regrouping output by 'year' (override with `.groups` argument)
# A tibble: 31 x 3
# Groups: year [31]
year female male
<int> <dbl> <dbl>
1 1950 4 3
2 1951 4 3
3 1952 4 3
4 1953 4 3
5 1954 4 3
6 1955 4 3
7 1956 4 2
8 1957 4 2
9 1958 4 2
10 1959 4 2
11 1960 4 2
12 1961 4 2
13 1962 4 2
14 1963 4 2
15 1964 4 2
16 1965 4 2
17 1966 4 1
18 1967 4 1
19 1968 4 1
20 1969 4 1
21 1970 4 1
22 1971 4 1
23 1972 4 1
24 1973 4 1
25 1974 4 1
26 1975 4 1
27 1976 3 1
28 1977 3 1
29 1978 3 1
30 1979 3 1
31 1980 3 1
Data used:
df
# A tibble: 9 x 3
birth_year death_year gender
<dbl> <dbl> <chr>
1 1934 1988 male
2 1922 1993 female
3 1890 1966 male
4 1901 1956 male
5 1946 2009 female
6 1909 1976 female
7 1899 1945 male
8 1887 1949 male
9 1902 1984 female
Here's a simple base R solution. Summing a logical vector will get you your count of alive or dead because TRUE is 1 and FALSE is 0.
number_alive <- function(range, df){
sapply(range, function(x) sum((df$death_year > x) & (df$birth_year <= x)))
}
output <- data.frame('year' = 1950:1980,
'female' = number_alive(1950:1980, df[df$gender == 'female']),
'male' = number_alive(1950:1980, df[df$gender == 'male']))
# year female male
# 1 1950 4 3
# 2 1951 4 3
# 3 1952 4 3
# 4 1953 4 3
# 5 1954 4 3
# 6 1955 4 3
# 7 1956 4 2
# 8 1957 4 2
# 9 1958 4 2
# 10 1959 4 2
# 11 1960 4 2
# 12 1961 4 2
# 13 1962 4 2
# 14 1963 4 2
# 15 1964 4 2
# 16 1965 4 2
# 17 1966 4 1
# 18 1967 4 1
# 19 1968 4 1
# 20 1969 4 1
# 21 1970 4 1
# 22 1971 4 1
# 23 1972 4 1
# 24 1973 4 1
# 25 1974 4 1
# 26 1975 4 1
# 27 1976 3 1
# 28 1977 3 1
# 29 1978 3 1
# 30 1979 3 1
# 31 1980 3 1
This approach uses an ifelse to determine if alive (1) or dead (0).
Data:
df <- "birth_year death_year gender
1934 1988 male
1922 1993 female
1890 1966 male
1901 1956 male
1946 2009 female
1909 1976 female
1899 1945 male
1887 1949 male
1902 1984 female"
df <- read.table(text = df, header = TRUE)
Code:
library(dplyr)
library(tidyr)
library(tibble)
library(purrr)
df %>%
mutate(year = map2(1950,1980, seq)) %>%
unnest(year) %>%
select(year, birth_year, death_year, gender) %>%
mutate(
alive = ifelse(year >= birth_year & year <= death_year, 1, 0)
) %>%
group_by(year, gender) %>%
summarise(
is_alive = sum(alive)
) %>%
pivot_wider(
names_from = gender,
values_from = is_alive
) %>%
select(year, male, female)
Output:
#> # A tibble: 31 x 3
#> # Groups: year [31]
#> year male female
#> <int> <dbl> <dbl>
#> 1 1950 3 4
#> 2 1951 3 4
#> 3 1952 3 4
#> 4 1953 3 4
#> 5 1954 3 4
#> 6 1955 3 4
#> 7 1956 3 4
#> 8 1957 2 4
#> 9 1958 2 4
#> 10 1959 2 4
#> # … with 21 more rows
Created on 2020-11-11 by the reprex package (v0.3.0)

How to lump sum the number of days of a data of several year?

I have data similar to this. I would like to lump sum the day (I'm not sure the word "lump sum" is correct or not) and create a new column "date" so that new column lump sum the number of 3 years data in ascending order.
year month day
2011 1 5
2011 2 14
2011 8 21
2012 2 24
2012 3 3
2012 4 4
2012 5 6
2013 2 14
2013 5 17
2013 6 24
I did this code but result was wrong and it's too long also. It doesn't count the February correctly since February has only 28 days. are there any shorter ways?
cday <- function(data,syear=2011,smonth=1,sday=1){
year <- data[1]
month <- data[2]
day <- data[3]
cmonth <- c(0,31,28,31,30,31,30,31,31,30,31,30,31)
date <- (year-syear)*365+sum(cmonth[1:month])+day
for(yr in c(syear:year)){
if(yr==year){
if(yr%%4==0&&month>2){date<-date+1}
}else{
if(yr%%4==0){date<-date+1}
}
}
return(date)
}
op10$day.no <- apply(op10[,c("year","month","day")],1,cday)
I expect the result like this:
year month day date
2011 1 5 5
2011 1 14 14
2011 1 21 21
2011 1 24 24
2011 2 3 31
2011 2 4 32
2011 2 6 34
2011 2 14 42
2011 2 17 45
2011 2 24 52
Thank you for helping!!
Use Date classes. Dates and times are complicated, look for tools to do this for you rather than writing your own. Pick whichever of these you want:
df$date = with(df, as.Date(paste(year, month, day, sep = "-")))
df$julian_day = as.integer(format(df$date, "%j"))
df$days_since_2010 = as.integer(df$date - as.Date("2010-12-31"))
df
# year month day date julian_day days_since_2010
# 1 2011 1 5 2011-01-05 5 5
# 2 2011 2 14 2011-02-14 45 45
# 3 2011 8 21 2011-08-21 233 233
# 4 2012 2 24 2012-02-24 55 420
# 5 2012 3 3 2012-03-03 63 428
# 6 2012 4 4 2012-04-04 95 460
# 7 2012 5 6 2012-05-06 127 492
# 8 2013 2 14 2013-02-14 45 776
# 9 2013 5 17 2013-05-17 137 868
# 10 2013 6 24 2013-06-24 175 906
# using this data
df = read.table(text = "year month day
2011 1 5
2011 2 14
2011 8 21
2012 2 24
2012 3 3
2012 4 4
2012 5 6
2013 2 14
2013 5 17
2013 6 24", header = TRUE)
This is all using base R. If you handle dates and times frequently, you may also want to look a the lubridate package.

Resources