I have a data frame with three columns: birth_year, death_year, gender.
I have to calculate total alive male and female population for every year in a given range (1950:1980).
The data frame looks like this:
birth_year death_year gender
1934 1988 male
1922 1993 female
1890 1966 male
1901 1956 male
1946 2009 female
1909 1976 female
1899 1945 male
1887 1949 male
1902 1984 female
The person is alive in year x if death_year > x & birth year <= x
The output I am looking for is something like this:
year male female
1950 3 4
1951 2 3
1952 4 3
1953 4 5
.
.
1980 6 3
Thanks!
Does this work:
library(tidyr)
library(purrr)
library(dplyr)
df %>% mutate(year = map2(1950,1980, seq)) %>% unnest(year) %>%
mutate(isalive = case_when(year >= birth_year & year < death_year ~ 1, TRUE ~ 0)) %>%
group_by(year, gender) %>% summarise(alive = sum(isalive)) %>%
pivot_wider(names_from = gender, values_from = alive) %>% print( n = 50)
`summarise()` regrouping output by 'year' (override with `.groups` argument)
# A tibble: 31 x 3
# Groups: year [31]
year female male
<int> <dbl> <dbl>
1 1950 4 3
2 1951 4 3
3 1952 4 3
4 1953 4 3
5 1954 4 3
6 1955 4 3
7 1956 4 2
8 1957 4 2
9 1958 4 2
10 1959 4 2
11 1960 4 2
12 1961 4 2
13 1962 4 2
14 1963 4 2
15 1964 4 2
16 1965 4 2
17 1966 4 1
18 1967 4 1
19 1968 4 1
20 1969 4 1
21 1970 4 1
22 1971 4 1
23 1972 4 1
24 1973 4 1
25 1974 4 1
26 1975 4 1
27 1976 3 1
28 1977 3 1
29 1978 3 1
30 1979 3 1
31 1980 3 1
Data used:
df
# A tibble: 9 x 3
birth_year death_year gender
<dbl> <dbl> <chr>
1 1934 1988 male
2 1922 1993 female
3 1890 1966 male
4 1901 1956 male
5 1946 2009 female
6 1909 1976 female
7 1899 1945 male
8 1887 1949 male
9 1902 1984 female
Here's a simple base R solution. Summing a logical vector will get you your count of alive or dead because TRUE is 1 and FALSE is 0.
number_alive <- function(range, df){
sapply(range, function(x) sum((df$death_year > x) & (df$birth_year <= x)))
}
output <- data.frame('year' = 1950:1980,
'female' = number_alive(1950:1980, df[df$gender == 'female']),
'male' = number_alive(1950:1980, df[df$gender == 'male']))
# year female male
# 1 1950 4 3
# 2 1951 4 3
# 3 1952 4 3
# 4 1953 4 3
# 5 1954 4 3
# 6 1955 4 3
# 7 1956 4 2
# 8 1957 4 2
# 9 1958 4 2
# 10 1959 4 2
# 11 1960 4 2
# 12 1961 4 2
# 13 1962 4 2
# 14 1963 4 2
# 15 1964 4 2
# 16 1965 4 2
# 17 1966 4 1
# 18 1967 4 1
# 19 1968 4 1
# 20 1969 4 1
# 21 1970 4 1
# 22 1971 4 1
# 23 1972 4 1
# 24 1973 4 1
# 25 1974 4 1
# 26 1975 4 1
# 27 1976 3 1
# 28 1977 3 1
# 29 1978 3 1
# 30 1979 3 1
# 31 1980 3 1
This approach uses an ifelse to determine if alive (1) or dead (0).
Data:
df <- "birth_year death_year gender
1934 1988 male
1922 1993 female
1890 1966 male
1901 1956 male
1946 2009 female
1909 1976 female
1899 1945 male
1887 1949 male
1902 1984 female"
df <- read.table(text = df, header = TRUE)
Code:
library(dplyr)
library(tidyr)
library(tibble)
library(purrr)
df %>%
mutate(year = map2(1950,1980, seq)) %>%
unnest(year) %>%
select(year, birth_year, death_year, gender) %>%
mutate(
alive = ifelse(year >= birth_year & year <= death_year, 1, 0)
) %>%
group_by(year, gender) %>%
summarise(
is_alive = sum(alive)
) %>%
pivot_wider(
names_from = gender,
values_from = is_alive
) %>%
select(year, male, female)
Output:
#> # A tibble: 31 x 3
#> # Groups: year [31]
#> year male female
#> <int> <dbl> <dbl>
#> 1 1950 3 4
#> 2 1951 3 4
#> 3 1952 3 4
#> 4 1953 3 4
#> 5 1954 3 4
#> 6 1955 3 4
#> 7 1956 3 4
#> 8 1957 2 4
#> 9 1958 2 4
#> 10 1959 2 4
#> # … with 21 more rows
Created on 2020-11-11 by the reprex package (v0.3.0)
Related
So I have a data table of 5000 firms, each firm is assigned a numerical value ("id") which is 1 for the first firm, 2 for the second ...
Here is my table with only the profit variable :
|id | year | profit
|:----| :----| :----|
|1 |2001 |-0.4
|1 |2002 |-0.89
|2 |2001 |1.89
|2 |2002 |2.79
Each firm is expressed twice, one line specifies the data in 2001 and the second in 2002 (the "id" value being the same on both lines because it is the same firm one year apart).
How to calculate the annual rate of change of each firm ("id") between 2001 and 2002 ?
I'm really new to R and I don't see where to start? Separate the 2001 and 2002 data?
I did this :
years <- sort(unique(group$year))years
And I also found this on the internet but with no success :
library(dplyr)
res <-
group %>%
arrange(id,year) %>%
group_by(id) %>%
mutate(evol_rate = ("group$year$2002" / lag("group$year$2001") - 1) * 100) %>%
ungroup()
Thank you very much
From what you've written, I take it that you want to calculate the formula for ROC for the profit values of 2001 and 2002:
ROC=(current_value/previous_value − 1) ∗ 100
To accomplish this, I suggest tidyr::pivot_wider() which reshapes your dataframe from long to wide format (see: https://r4ds.had.co.nz/tidy-data.html#pivoting).
Code:
require(tidyr)
require(dplyr)
id <- sort(rep(seq(1,250, 1), 2))
year <- rep(seq(2001, 2002, 1), 500)
value <- sample(500:2000, 500)
df <- data.frame(id, year, value)
head(df, 10)
#> id year value
#> 1 1 2001 856
#> 2 1 2002 1850
#> 3 2 2001 1687
#> 4 2 2002 1902
#> 5 3 2001 1728
#> 6 3 2002 1773
#> 7 4 2001 691
#> 8 4 2002 1691
#> 9 5 2001 1368
#> 10 5 2002 893
df_wide <- df %>%
pivot_wider(names_from = year,
names_prefix = "profit_",
values_from = value,
values_fn = mean)
res <- df_wide %>%
mutate(evol_rate = (profit_2002/profit_2001-1)*100) %>%
round(2)
head(res, 10)
#> # A tibble: 10 x 4
#> id profit_2001 profit_2002 evol_rate
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 856 1850 116.
#> 2 2 1687 1902 12.7
#> 3 3 1728 1773 2.6
#> 4 4 691 1691 145.
#> 5 5 1368 893 -34.7
#> 6 6 883 516 -41.6
#> 7 7 1280 1649 28.8
#> 8 8 1579 1383 -12.4
#> 9 9 1907 1626 -14.7
#> 10 10 1227 1134 -7.58
If you want to do it without reshaping your data into a wide format you can use
library(tidyverse)
id <- sort(rep(seq(1,250, 1), 2))
year <- rep(seq(2001, 2002, 1), 500)
value <- sample(500:2000, 500)
df <- data.frame(id, year, value)
df %>% head(n = 10)
#> id year value
#> 1 1 2001 1173
#> 2 1 2002 1648
#> 3 2 2001 1560
#> 4 2 2002 1091
#> 5 3 2001 1736
#> 6 3 2002 667
#> 7 4 2001 1840
#> 8 4 2002 1202
#> 9 5 2001 1597
#> 10 5 2002 1797
new_df <- df %>%
group_by(id) %>%
mutate(ROC = ((value / lag(value) - 1) * 100))
new_df %>% head(n = 10)
#> # A tibble: 10 × 4
#> # Groups: id [5]
#> id year value ROC
#> <dbl> <dbl> <int> <dbl>
#> 1 1 2001 1173 NA
#> 2 1 2002 1648 40.5
#> 3 2 2001 1560 NA
#> 4 2 2002 1091 -30.1
#> 5 3 2001 1736 NA
#> 6 3 2002 667 -61.6
#> 7 4 2001 1840 NA
#> 8 4 2002 1202 -34.7
#> 9 5 2001 1597 NA
#> 10 5 2002 1797 12.5
This groups the data by id and then uses lag to compare the current year to the year prior
I have this data:
data <- data.frame(id_pers=c(4102,13102,27101,27102,28101,28102, 42101,42102,56102,73102,74103,103104,117103,117104,117105),
birthyear=c(1992,1994,1993,1992,1995,1999,2000,2001,2000, 1994, 1999, 1978, 1986, 1998, 1999))
I want to group the different persons by familys in a new column, so that persons 27101,27102 (siblings) are group/family 1 and 42101,42102 are in group 2, 117103,117104,117105 are in group 3 so on.
Person "4102" has no siblings and should be a NA in the new column.
It is always the case that 2 or more persons are siblings if the ID's are not further apart than a maximum of 6 numbers.
I have a far larger dataset with over 3000 rows. How could I do it the most efficient way?
You can use round with digits = -1 (or -2) if you have id_pers that goes above 10 observations per family. If you want the id to be integers from 1; you can use cur_group_id:
library(dplyr)
data %>%
group_by(fam_id = round(id_pers - 5, digits = -1)) %>%
mutate(fam_gp = cur_group_id())
output
# A tibble: 15 × 3
# Groups: fam_id [10]
id_pers birthyear fam_id fam_gp
<dbl> <dbl> <dbl> <int>
1 4102 1992 4100 1
2 13102 1994 13100 2
3 27101 1993 27100 3
4 27102 1992 27100 3
5 28101 1995 28100 4
6 28106 1999 28100 4
7 42101 2000 42100 5
8 42102 2001 42100 5
9 56102 2000 56100 6
10 73102 1994 73100 7
11 74103 1999 74100 8
12 103104 1978 103100 9
13 117103 1986 117100 10
14 117104 1998 117100 10
15 117105 1999 117100 10
It looks like we can the 1000s digit (and above) to delineate groups.
library(dplyr)
data %>%
mutate(
famgroup = trunc(id_pers/1000),
famgroup = match(famgroup, unique(famgroup))
)
# id_pers birthyear famgroup
# 1 4102 1992 1
# 2 13102 1994 2
# 3 27101 1993 3
# 4 27102 1992 3
# 5 28101 1995 4
# 6 28102 1999 4
# 7 42101 2000 5
# 8 42102 2001 5
# 9 56102 2000 6
# 10 73102 1994 7
# 11 74103 1999 8
# 12 103104 1978 9
# 13 117103 1986 10
# 14 117104 1998 10
# 15 117105 1999 10
I have a data frame of discharge data. Below is a reproducible example:
library(lubridate)
Date <- sample(seq(as.Date('1981/01/01'), as.Date('1982/12/31'), by="day"), 24)
Date <- sort(Date, decreasing = F)
Station <- rep(as.character("A"), 24)
Discharge <- rnorm(n = 24, mean = 1, 1)
df <- cbind.data.frame(Station, Date, Discharge)
df$Year <- year(df$Date)
df$Month <- month(df$Date)
df$Day <- day(df$Date)
The output:
> df
Station Date Discharge Year Month Day
1 A 1981-01-23 0.75514968 1981 1 23
2 A 1981-02-17 -0.08552776 1981 2 17
3 A 1981-03-20 1.47586712 1981 3 20
4 A 1981-04-26 3.64823544 1981 4 26
5 A 1981-05-22 1.21880453 1981 5 22
6 A 1981-05-23 2.19482857 1981 5 23
7 A 1981-07-02 -0.13598754 1981 7 2
8 A 1981-07-23 0.12365626 1981 7 23
9 A 1981-07-24 2.12557882 1981 7 24
10 A 1981-09-02 2.79879494 1981 9 2
11 A 1981-09-04 1.67926948 1981 9 4
12 A 1981-11-06 0.49720784 1981 11 6
13 A 1981-12-21 -0.25272271 1981 12 21
14 A 1982-04-08 1.39706157 1982 4 8
15 A 1982-04-19 -0.13965981 1982 4 19
16 A 1982-05-26 0.55238425 1982 5 26
17 A 1982-06-23 3.94639154 1982 6 23
18 A 1982-06-25 -0.03415929 1982 6 25
19 A 1982-07-15 1.00996167 1982 7 15
20 A 1982-09-11 3.18225186 1982 9 11
21 A 1982-10-17 0.30875497 1982 10 17
22 A 1982-10-30 2.26209011 1982 10 30
23 A 1982-11-06 0.34430489 1982 11 6
24 A 1982-11-19 2.28251458 1982 11 19
What I need to do is to create a moving window function using base R. I have tried using runner package but it is proving not to be so flexible. This moving window (say 3) shall take 3 rows at a time and calculate the mean discharge. This window shall continue till the last date of the year 1981. Another window shall start from 1982 and do the same. How to approach this?
Using base R only
w=3
df$DischargeM=sapply(1:nrow(df),function(x){
tmp=NA
if (x>=w) {
if (length(unique(df$Year[(x-w+1):x]))==1) {
tmp=mean(df$Discharge[(x-w+1):x])
}
}
tmp
})
Station Date Discharge Year Month Day DischargeM
1 A 1981-01-21 2.0009355 1981 1 21 NA
2 A 1981-02-11 0.5948567 1981 2 11 NA
3 A 1981-04-17 0.2637090 1981 4 17 0.95316705
4 A 1981-04-18 3.9180253 1981 4 18 1.59219699
5 A 1981-05-09 -0.2589129 1981 5 9 1.30760712
6 A 1981-07-05 1.1055913 1981 7 5 1.58823456
7 A 1981-07-11 0.7561600 1981 7 11 0.53427946
8 A 1981-07-22 0.0978999 1981 7 22 0.65321706
9 A 1981-08-04 0.5410163 1981 8 4 0.46502541
10 A 1981-08-13 -0.5044425 1981 8 13 0.04482458
11 A 1981-10-06 1.5954315 1981 10 6 0.54400178
12 A 1981-11-08 -0.5757041 1981 11 8 0.17176164
13 A 1981-12-24 1.3892440 1981 12 24 0.80299047
14 A 1982-01-07 1.9363874 1982 1 7 NA
15 A 1982-02-20 1.4340554 1982 2 20 NA
16 A 1982-05-29 0.4536461 1982 5 29 1.27469632
17 A 1982-06-10 2.9776761 1982 6 10 1.62179253
18 A 1982-06-17 1.6371733 1982 6 17 1.68949847
19 A 1982-06-28 1.7585579 1982 6 28 2.12446908
20 A 1982-08-17 0.8297518 1982 8 17 1.40849432
21 A 1982-09-21 1.6853808 1982 9 21 1.42456348
22 A 1982-11-13 0.6066167 1982 11 13 1.04058309
23 A 1982-11-16 1.4989263 1982 11 16 1.26364126
24 A 1982-11-28 0.2273658 1982 11 28 0.77763625
(make sure your df is ordered).
You can do this by using dplyr and the rollmean or rollmeanr function from zoo.
You group the data by year, and apply the rollmeanr in a mutate function.
library(dplyr)
df %>%
group_by(Year) %>%
mutate(avg = zoo::rollmeanr(Discharge, k = 3, fill = NA))
# A tibble: 24 x 7
# Groups: Year [2]
Station Date Discharge Year Month Day avg
<chr> <date> <dbl> <dbl> <dbl> <int> <dbl>
1 A 1981-01-04 1.00 1981 1 4 NA
2 A 1981-03-26 0.0468 1981 3 26 NA
3 A 1981-03-28 0.431 1981 3 28 0.494
4 A 1981-05-04 1.30 1981 5 4 0.593
5 A 1981-08-26 2.06 1981 8 26 1.26
6 A 1981-10-14 1.09 1981 10 14 1.48
7 A 1981-12-10 1.28 1981 12 10 1.48
8 A 1981-12-23 0.668 1981 12 23 1.01
9 A 1982-01-02 -0.333 1982 1 2 NA
10 A 1982-04-13 0.800 1982 4 13 NA
# ... with 14 more rows
Kindly let me know if this is what you were anticipating
Base version:
result <- transform(df,
Discharge_mean = ave(Discharge,Year,
FUN= function(x) rollapply(x,width = 3, mean, align='right',fill=NA))
)
dplyr version:
result <-df %>%
group_by(Year)%>%
mutate(Discharge_mean=rollapply(Discharge,3,mean,align='right',fill=NA))
Output:
> result
Station Date Discharge Year Month Day Discharge_mean
1 A 1981-01-09 0.560448487 1981 1 9 NA
2 A 1981-01-17 0.006777809 1981 1 17 NA
3 A 1981-02-08 2.008959399 1981 2 8 0.8587286
4 A 1981-02-21 1.166452993 1981 2 21 1.0607301
5 A 1981-04-12 3.120080595 1981 4 12 2.0984977
6 A 1981-04-24 2.647325960 1981 4 24 2.3112865
7 A 1981-05-01 0.764980310 1981 5 1 2.1774623
8 A 1981-05-20 2.203700845 1981 5 20 1.8720024
9 A 1981-06-19 0.519390897 1981 6 19 1.1626907
10 A 1981-07-06 1.704146872 1981 7 6 1.4757462
# 14 more rows
I have the following dataset:
ireland england france year
5 3 2 1920
4 3 4 1921
6 2 1 1922
3 1 5 1930
2 5 2 1931
I need to summarise the data by 1920's and 1930's. So I need total points for ireland, england and france in the 1920-1922 and then another total point for ireland,england and france in 1930,1931.
Any ideas? I have tried but failed.
Dataset:
x <- read.table(text = "ireland england france
5 3 2 1920
4 3 4 1921
6 2 1 1922
3 1 5 1930
2 5 2 1931", header = T)
How about dividing the years by 10 and then summarizing?
library(dplyr)
x %>% mutate(decade = floor(year/10)*10) %>%
group_by(decade) %>%
summarize_all(sum) %>%
select(-year)
# A tibble: 2 x 5
# decade ireland england france
# <dbl> <int> <int> <int>
# 1 1920 15 8 7
# 2 1930 5 6 7
An R base solution
As A5C1D2H2I1M1N2O1R2T1 mentioned, you can use findIntervals() to set corresponding decade for each year and then, an aggregate() to group py decade
txt <-
"ireland england france year
5 3 2 1920
4 3 4 1921
6 2 1 1922
3 1 5 1930
2 5 2 1931"
df <- read.table(text=txt, header=T)
decades <- c(1920, 1930, 1940)
df$decade<- decades[findInterval(df$year, decades)]
aggregate(cbind(ireland,england,france) ~ decade , data = df, sum)
Output:
decade ireland england france
1 1920 15 8 7
2 1930 5 6 7
I'm trying to sum columns 4 (child) ,5 (adult) and 6 (elderly) and return values for each country by year disregarding column 3 (sex). Reading through various forums I cannot combine these:
country year sex child adult elderly
1 Afghanistan 1995 male -1 -1 -1
2 Afghanistan 1996 female -1 -1 -1
3 Afghanistan 1996 male -1 -1 -1
4 Afghanistan 1997 female 5 96 1
5 Afghanistan 1997 male 0 26 0
6 Afghanistan 1998 female 45 1142 20
I was able to sum the 3 columns by row and create a separate column with the following but still need to combine the male and female rows for each country:
tuberculosiscases <-tuberculosis$child + tuberculosis$adult + tuberculosis$elderly
names(tuberculosiscases) <- c("tuberculosiscases")
tuberculosis <- data.frame(tuberculosis,tuberculosiscases)
head(tuberculosis)
country year sex child adult elderly tuberculosiscases
1 Afghanistan 1995 male -1 -1 -1 -3
2 Afghanistan 1996 female -1 -1 -1 -3
3 Afghanistan 1996 male -1 -1 -1 -3
4 Afghanistan 1997 female 5 96 1 102
5 Afghanistan 1997 male 0 26 0 26
6 Afghanistan 1998 female 45 1142 20 1207
If you want add the sum to your dataframe, have several options:
# with base R (1)
transform(dat, tuber.sum = ave(tuberculosiscases, country, year, FUN = sum))
# with base R (2)
dat$tuber.sum <- ave(dat$tuberculosiscases, dat$country, dat$year, FUN = sum))
# with the data.table package
library(data.table)
setDT(dat)[, tuber.sum:=sum(tuberculosiscases), by= .(country, year)]
# with the plyr package
library(plyr)
dat <- ddply(dat, .(country, year), transform, tuber.sum=sum(tuberculosiscases))
# with the dplyr package
library(dplyr)
dat <- dat %>%
group_by(country, year) %>%
mutate(tuber.sum=sum(tuberculosiscases))
all give:
> dat
country year sex child adult elderly tuberculosiscases tuber.sum
1: Afghanistan 1995 male -1 -1 -1 -3 -3
2: Afghanistan 1996 female -1 -1 -1 -3 -6
3: Afghanistan 1996 male -1 -1 -1 -3 -6
4: Afghanistan 1997 female 5 96 1 102 128
5: Afghanistan 1997 male 0 26 0 26 128
6: Afghanistan 1998 female 45 1142 20 1207 1207
If I correctly understand your question and assuming that the name of the initial data.frame is my_df I would use aggregate:
aggdata <-aggregate(my_df[,c("child", "adult", "elderly")],
by=list(my_df$country,my_df$year), FUN=sum, na.rm=TRUE)