Creating New Age group - r

I have some age(12:54) and related data for them (here year and ASFR). The year starts from 1933 to 1987.
The structure of the data is something like ensuing:
year
age
Asfr
1933
12
.00004
1933
13
.00044
1933
14
.00177
1933
15
.00672
1933
16
.01875
1933
17
.03846
1933
18
.06586
1933
19
.08719
...
...
...
1933
49
.00037
1933
50
.00009
1933
51
.00003
1933
52
.00003
1933
53
.00003
1933
54
.00002
Now, I need codes by which I can turn this data into age groups with the following structure:
"15-19" , "20-24", "25-29", "30-34", "35-39" ,"40-44", "45-49"
in which I want 15-19 age group be the sum of 12, 13, 14, 15, 16, 17, 18, 19
20-24 age group be the sum of 20, 21, 22, 23, 24
Finally, the last age group be the sum of 45, 46, 47, 48, 49, 50, 51, 52, 53,54
I would really appreciate it if someone could help me. Thank you so much in advance.

You can use case_when from dplyr:
library(dplyr)
df %>%
mutate(age_group = case_when(age %in% c(12:19) ~ "15-19",
age %in% c(20:24) ~ "20-24",
age %in% c(25:29) ~ "25-29",
age %in% c(30:34) ~ "30-34",
age %in% c(35:39) ~ "35-39",
age %in% c(40:44) ~ "40-44",
age %in% c(45:49) ~ "45-49",
age > 49 ~ "50+")) %>%
group_by(age_group, year) %>%
summarize(total_asfr = sum(Asfr),
age_group_n = n()) %>%
ungroup()
This gives us:
# A tibble: 5 × 3
age_group total_asfr age_group_n
<chr> <dbl> <int>
1 15-19 0.0385 2
2 20-24 0.00044 1
3 30-34 0.00177 1
4 45-49 0.00672 1
5 50+ 0.0188 1
Using sample data:
df <- structure(list(year = c(1933L, 1933L, 1933L, 1933L, 1933L, 1933L
), age = c(12L, 23L, 34L, 45L, 56L, 17L), Asfr = c(4e-05, 0.00044,
0.00177, 0.00672, 0.01875, 0.03846)),
row.names = c(NA, -6L),
class = "data.frame")

Here's a possible solution:
# Import tidyverse or dplyr
library(tidyverse)
#create the age groups and group by Year and age_groups
df %>% mutate(age_groups = cut(df$age,
breaks=c(12, 20, 25, 30, 35, 40, 45,55),
right= F) ) %>%
group_by(year, age_groups) %>%
summarise(asfr_total = sum(Asfr))
You should see something like this:
year age_groups asfr_total
<dbl> <fct> <dbl>
1 1933 [12,20) 4.32
2 1933 [20,25) 2.33
3 1933 [25,30) 2.68
4 1933 [30,35) 2.89
5 1933 [35,40) 2.23
6 1933 [40,45) 2.85
7 1933 [45,55) 6.05

Related

Transform data to long with grouped columns

For this week's tidytuesday challenge, for some reason, I am not able to group the column names in R which I was doing with pivot_longer function from tidyr previously. So, here is my code and I do not get it why it does throw an error and not give what I want.
library(tidyverse)
tuesdata <- tidytuesdayR::tt_load(2023, week = 7)
age_gaps <- tuesdata$age_gaps
df_long <- age_gaps %>%
pivot_longer(cols= actor_1_name:actor_2_name, names_to = "actornumber", values_to = "actorname") %>%
pivot_longer(cols= character_1_gender:character_2_gender, names_to = "gendernumber", values_to = "gender") %>%
pivot_longer(cols= actor_1_age:actor_2_age, names_to = "agenumber", values_to = "age") %>%
select(movie_name, release_year, director, age_difference, actorname, gender, age)
As seen from the code, the initial data has 1155 rows and after doing the quick data wrangling, I am expecting to get a data of 1155x2=2310 rows as I would like to merge the columns on actor names and their relevant information such as age and birthdate. Yet, the code does not give me the expected outcome and I am wondering why and how can I solve this problem. Thank you for your attention beforehand.
Example data (first 6 rows)
age_gaps <- structure(list(movie_name = c("Harold and Maude", "Venus", "The Quiet American",
"The Big Lebowski", "Beginners", "Poison Ivy"), release_year = c(1971,
2006, 2002, 1998, 2010, 1992), director = c("Hal Ashby", "Roger Michell",
"Phillip Noyce", "Joel Coen", "Mike Mills", "Katt Shea"), age_difference = c(52,
50, 49, 45, 43, 42), couple_number = c(1, 1, 1, 1, 1, 1), actor_1_name = c("Ruth Gordon",
"Peter O'Toole", "Michael Caine", "David Huddleston", "Christopher Plummer",
"Tom Skerritt"), actor_2_name = c("Bud Cort", "Jodie Whittaker",
"Do Thi Hai Yen", "Tara Reid", "Goran Visnjic", "Drew Barrymore"
), character_1_gender = c("woman", "man", "man", "man", "man",
"man"), character_2_gender = c("man", "woman", "woman", "woman",
"man", "woman"), actor_1_birthdate = structure(c(-26725, -13666,
-13442, -14351, -14629, -13278), class = "Date"), actor_2_birthdate = structure(c(-7948,
4536, 4656, 2137, 982, 1878), class = "Date"), actor_1_age = c(75,
74, 69, 68, 81, 59), actor_2_age = c(23, 24, 20, 23, 38, 17)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
You could set ".value" in names_to and supply one of names_sep or names_pattern to specify how the column names should be split.
library(tidyr)
age_gaps %>%
pivot_longer(actor_1_name:actor_2_age,
names_prefix = "(actor|character)_",
names_to = c("actor", ".value"),
names_sep = '_')
# A tibble: 12 × 10
movie_name release_year director age_difference couple_number actor name gender birthdate age
<chr> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> <date> <dbl>
1 Harold and Maude 1971 Hal Ashby 52 1 1 Ruth Gordon woman 1896-10-30 75
2 Harold and Maude 1971 Hal Ashby 52 1 2 Bud Cort man 1948-03-29 23
3 Venus 2006 Roger Michell 50 1 1 Peter O'Toole man 1932-08-02 74
4 Venus 2006 Roger Michell 50 1 2 Jodie Whittaker woman 1982-06-03 24
5 The Quiet American 2002 Phillip Noyce 49 1 1 Michael Caine man 1933-03-14 69
6 The Quiet American 2002 Phillip Noyce 49 1 2 Do Thi Hai Yen woman 1982-10-01 20
7 The Big Lebowski 1998 Joel Coen 45 1 1 David Huddleston man 1930-09-17 68
8 The Big Lebowski 1998 Joel Coen 45 1 2 Tara Reid woman 1975-11-08 23
9 Beginners 2010 Mike Mills 43 1 1 Christopher Plummer man 1929-12-13 81
10 Beginners 2010 Mike Mills 43 1 2 Goran Visnjic man 1972-09-09 38
11 Poison Ivy 1992 Katt Shea 42 1 1 Tom Skerritt man 1933-08-25 59
12 Poison Ivy 1992 Katt Shea 42 1 2 Drew Barrymore woman 1975-02-22 17

compute variable over the value of the difference between another variable this year and the previous one R

In the data below I want to compute the following ratio tr(year)/(op(year) - op(year-1). I would appreciate an answer with dplyr.
year op tr cp
<chr> <dbl> <dbl> <dbl>
1 1984 10 39.1 38.3
2 1985 55 132. 77.1
3 1986 79 69.3 78.7
4 1987 78 47.7 74.1
5 1988 109 77.0 86.4
this is the expected output
year2 ratio
1 1985 2.933333
2 1986 2.887500
3 1987 -47.700000
4 1988 -2.483871
I do not manage to get to any result...
Use lag:
library(dplyr)
df %>%
mutate(year = year,
ratio = tr / (op - lag(op)),
.keep = "none") %>%
tidyr::drop_na()
# year ratio
#2 1985 2.933333
#3 1986 2.887500
#4 1987 -47.700000
#5 1988 2.483871
We may use
library(dplyr)
df1 %>%
reframe(year = year[-1], ratio = tr[-1]/diff(op))
-output
year ratio
1 1985 2.933333
2 1986 2.887500
3 1987 -47.700000
4 1988 2.483871
data
df1 <- structure(list(year = 1984:1988, op = c(10L, 55L, 79L, 78L, 109L
), tr = c(39.1, 132, 69.3, 47.7, 77), cp = c(38.3, 77.1, 78.7,
74.1, 86.4)), class = "data.frame", row.names = c("1", "2", "3",
"4", "5"))

Classify table based on value 'moving window' range and proportions?

I have a datasets of forest stands, each containing several tree layers of different age and volume.
I want to classify the stands as even- or uneven-aged, combining volume and age data. The forest is considered even-aged if more then 80% of the volume is allocated to age classes within 20 years apart. I wonder how to implement the 'within 20 years apart' condition? I can easily calculate the sum of volume and it's share for individual tree layers (strat). But how to check for 'how many years they are apart?' Is it some sort of moving window?
Dummy example:
# investigate volume by age classes?
library(dplyr)
df <- data.frame(stand = c("id1", "id1", "id1", "id1",
'id2', 'id2', 'id2'),
strat = c(1,2,3,4,
1,2,3),
v = c(4,10,15,20,
11,15,18),
age = c(5,10,65,80,
10,15,20))
# even age = if more of teh 80% of volume is allocated in layers in 20 years range
df %>%
group_by(stand) %>%
mutate(V_tot = sum(v)) %>%
mutate(V_share = v/V_tot*100)
Expected outcome:
stand strat v age V_tot V_share quality
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 id1 1 4 5 49 8.16 uneven-aged
2 id1 2 10 10 49 20.4 uneven-aged
3 id1 3 15 65 49 30.6 uneven-aged
4 id1 4 20 80 49 40.8 uneven-aged #* because age classes 65 and 80, even less then 20 years apart have only 70% of total volume
5 id2 1 11 10 44 25 even-aged
6 id2 2 15 15 44 34.1 even-aged
7 id2 3 18 20 44 40.9 even-aged
Another tidyverse solution implementing a moving average:
library(tidyverse)
df <- structure(list(stand = c("id1", "id1", "id1", "id1", "id2", "id2", "id2"), strat = c(1, 2, 3, 4, 1, 2, 3), v = c(4, 10, 15, 20, 11, 15, 18), age = c(5, 10, 65, 80, 10, 15, 20), V_tot = c(49, 49, 49, 49, 44, 44, 44), V_share = c(8.16326530612245, 20.4081632653061, 30.6122448979592, 40.8163265306122, 25, 34.0909090909091, 40.9090909090909)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -7L))
df %>%
group_by(stand) %>%
mutate(range20 = map_dbl(age, ~ sum(V_share[which(abs(age - .x) <= 20)])),
quality = ifelse(any(range20 > 80), "even-aged", "uneven-aged"))
#> # A tibble: 7 × 8
#> # Groups: stand [2]
#> stand strat v age V_tot V_share range20 quality
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 id1 1 4 5 49 8.16 28.6 uneven-aged
#> 2 id1 2 10 10 49 20.4 28.6 uneven-aged
#> 3 id1 3 15 65 49 30.6 71.4 uneven-aged
#> 4 id1 4 20 80 49 40.8 71.4 uneven-aged
#> 5 id2 1 11 10 44 25 100 even-aged
#> 6 id2 2 15 15 44 34.1 100 even-aged
#> 7 id2 3 18 20 44 40.9 100 even-aged
Created on 2021-09-08 by the reprex package (v2.0.1)
Interesting issue, I think I have a solution using the runner package
df %>%
group_by(stand) %>%
mutate(
V_tot = sum(v),
V_share = v/V_tot*100,
test = sum_run(
V_share,
k = 20L,
idx = age,
na_rm = TRUE,
na_pad = FALSE
),
quality = if_else(any(test >= 80), 'even-aged', 'uneven-aged')
) %>%
select(-test)

Summarise across each column by grouping their names

I want to calculate the weighted variance using the weights provided in the dataset, while group for the countries and cities, however the function returns NAs:
library(Hmisc) #for the 'wtd.var' function
weather_winter.std<-weather_winter %>%
group_by(country, capital_city) %>%
summarise(across(starts_with("winter"),wtd.var))
The provided output from the console (when in long format):
# A tibble: 35 x 3
# Groups: country [35]
country capital_city winter
<chr> <chr> <dbl>
1 ALBANIA Tirane NA
2 AUSTRIA Vienna NA
3 BELGIUM Brussels NA
4 BULGARIA Sofia NA
5 CROATIA Zagreb NA
6 CYPRUS Nicosia NA
7 CZECHIA Prague NA
8 DENMARK Copenhagen NA
9 ESTONIA Tallinn NA
10 FINLAND Helsinki NA
# … with 25 more rows
This is the code that I used to get the data from a wide format into a long format:
weather_winter <- weather_winter %>% pivot_longer(-c(31:33))
weather_winter$name <- NULL
names(weather_winter)[4] <- "winter"
Some example data:
structure(list(`dec-wet_2011` = c(12.6199998855591, 12.6099996566772,
14.75, 11.6899995803833, 18.2899990081787), `dec-wet_2012` = c(13.6300001144409,
14.2199993133545, 14.2299995422363, 16.1000003814697, 18.0299987792969
), `dec-wet_2013` = c(4.67999982833862, 5.17000007629395, 4.86999988555908,
7.56999969482422, 5.96000003814697), `dec-wet_2014` = c(14.2999992370605,
14.4799995422363, 13.9799995422363, 15.1499996185303, 16.1599998474121
), `dec-wet_2015` = c(0.429999977350235, 0.329999983310699, 1.92999994754791,
3.30999994277954, 7.42999982833862), `dec-wet_2016` = c(1.75,
1.29999995231628, 3.25999999046326, 6.60999965667725, 8.67999935150146
), `dec-wet_2017` = c(13.3400001525879, 13.3499994277954, 15.960000038147,
10.6599998474121, 14.4699993133545), `dec-wet_2018` = c(12.210000038147,
12.4399995803833, 11.1799993515015, 10.75, 18.6299991607666),
`dec-wet_2019` = c(12.7199993133545, 13.3800001144409, 13.9899997711182,
10.5299997329712, 12.3099994659424), `dec-wet_2020` = c(15.539999961853,
16.5200004577637, 11.1799993515015, 14.7299995422363, 13.5499992370605
), `jan-wet_2011` = c(8.01999950408936, 7.83999967575073,
10.2199993133545, 13.8899993896484, 14.5299997329712), `jan-wet_2012` = c(11.5999994277954,
11.1300001144409, 12.5500001907349, 10.1700000762939, 22.6199989318848
), `jan-wet_2013` = c(17.5, 17.4099998474121, 15.5599994659424,
13.3199996948242, 20.9099998474121), `jan-wet_2014` = c(12.5099992752075,
12.2299995422363, 15.210000038147, 9.73999977111816, 9.63000011444092
), `jan-wet_2015` = c(17.6900005340576, 16.9799995422363,
11.75, 9.9399995803833, 19), `jan-wet_2016` = c(15.6099996566772,
15.5, 14.5099992752075, 10.3899993896484, 18.4499988555908
), `jan-wet_2017` = c(9.17000007629395, 9.61999988555908,
9.30999946594238, 15.8499994277954, 11.210000038147), `jan-wet_2018` = c(8.55999946594238,
9.10999965667725, 13.2599992752075, 9.85999965667725, 15.8899993896484
), `jan-wet_2019` = c(17.0699996948242, 16.8699989318848,
14.5699996948242, 19.0100002288818, 19.4699993133545), `jan-wet_2020` = c(6.75999975204468,
6.25999975204468, 6.00999975204468, 5.35999965667725, 8.15999984741211
), `feb-wet_2011` = c(9.1899995803833, 8.63999938964844,
6.21999979019165, 9.82999992370605, 4.67999982833862), `feb-wet_2012` = c(12.2699995040894,
11.6899995803833, 8.27999973297119, 14.9399995803833, 13.0499992370605
), `feb-wet_2013` = c(15.3599996566772, 15.9099998474121,
17.0599994659424, 13.3599996566772, 16.75), `feb-wet_2014` = c(10.1999998092651,
11.1399993896484, 13.8599996566772, 10.7399997711182, 7.35999965667725
), `feb-wet_2015` = c(11.9200000762939, 12.2699995040894,
8.01000022888184, 14.5299997329712, 5.71999979019165), `feb-wet_2016` = c(14.6999998092651,
14.7799997329712, 16.7899990081787, 4.90000009536743, 19.3500003814697
), `feb-wet_2017` = c(8.98999977111816, 9.17999935150146,
11.7699995040894, 6.3899998664856, 13.9899997711182), `feb-wet_2018` = c(16.75,
16.8599987030029, 12.0599994659424, 16.1900005340576, 8.51000022888184
), `feb-wet_2019` = c(7.58999967575073, 7.26999998092651,
8.21000003814697, 7.57999992370605, 8.81999969482422), `feb-wet_2020` = c(10.6399993896484,
10.4399995803833, 13.4399995803833, 8.53999996185303, 19.939998626709
), country = c("SERBIA", "SERBIA", "SLOVENIA", "GREECE",
"CZECHIA"), capital_city = c("Belgrade", "Belgrade", "Ljubljana",
"Athens", "Prague"), weight = c(20.25, 19.75, 14.25, 23.75,
14.25)), row.names = c(76L, 75L, 83L, 16L, 5L), class = "data.frame")
Your code seems to provide the right answer, now there's more data:
# Groups: country [4]
country capital_city winter
<chr> <chr> <dbl>
1 CZECHIA Prague 27.2
2 GREECE Athens 14.6
3 SERBIA Belgrade 19.1
4 SLOVENIA Ljubljana 16.3
Is this what you were looking for?
I took the liberty of streamlining your code:
weather_winter <- weather_winter %>%
pivot_longer(-c(31:33), values_to = "winter") %>%
select(-name)
weather_winter.std <- weather_winter %>%
group_by(country, capital_city) %>%
summarise(winter = wtd.var(winter))
With only one "winter" column, there's no need for the across().
Finally, you are not using the weights. If these are needed, then change the last line to:
summarise(winter = wtd.var(winter, weights = weight))
To give:
# A tibble: 4 x 3
# Groups: country [4]
country capital_city winter
<chr> <chr> <dbl>
1 CZECHIA Prague 26.3
2 GREECE Athens 14.2
3 SERBIA Belgrade 18.8
4 SLOVENIA Ljubljana 15.8

calculating medians per year per ID in R and plotting the outcome

Dataset:
structure(list(ID = c(1234, 1234, 1234, 1234, 1234, 1234, 1234,
1234, 8769, 8769, 8769, 8769, 8769, 7457, 7457, 7457, 7457, 7457,
7457, 55667, 55667, 55667, 55667, 55667, 55667, 55667, 3789,
3789, 3789, 3789, 3789, 3789), date_of_bloods = structure(c(978307200,
981072000, 1173052800, 1175731200, 1367798400, 1465171200, 1467936000,
1659916800, 1072915200, 1075680000, 1173052800, 1175731200, 1367798400,
978307200, 981072000, 1173052800, 1175731200, 1367798400, 1465171200,
978307200, 981072000, 1173052800, 1270425600, 1273104000, 1465171200,
1467936000, 1270425600, 1367798400, 1465171200, 1465257600, 1465344000,
1465430400), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
result = c(90, 80, 60, 40, 25, 22, 22, 21, 70, 65, 43, 23,
22, 90, 90, 88, 86, 76, 74, 58, 46, 35, 34, 33, 30, 24, 76,
67, 56, 34, 33, 23), `mutation type` = c(1, 1, 1, 1, 1, 1,
1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3,
3, 1, 1, 1, 1, 1, 1)), row.names = c(NA, -32L), class = "data.frame")
I would like the median of results per year per ID in a format where the year is just 0,1,2,3 etc for uniformity across cohorts and then to plot these lines with some indication of their mutation category.
I have done:
filtered$date_of_bloods <-format(filtered$date_of_bloods,format="%Y")
#split into individual ID groups
a <- with(filtered, split(filtered, list(ID)))
#aggregate median results per year
medianfunc <- function(y) {aggregate(results ~ date_of_bloods, data = y, median)}
medians <- sapply(a, medianfunc)
# do lm per ID cohort and get slope of lines
g<- as.data.frame(medians)
coefLM <- function(x) {coef(lm(date_of_bloods ~ results, data = x))}
coefs<- sapply(g, coefLM)
The actual years don't matter and for uniformity I would like them to be 0,1,2,3,4 etc per ID. I am not sure how to do that? I would then want to plot this data (median yearly bloods per ID) with some form of idea as to which mutational category they belong.
I hope this isn't too broad a question.
Many thanks
You can try this (filtered is the dput() you included). I hope this helps:
library(dplyr)
library(lubridate)
library(ggplot2)
library(broom)
#Data
filtered %>% mutate(year=year(date_of_bloods)) %>%
group_by(ID,year,`mutation type`) %>% summarise(med=median(result)) -> df1
#Variables
df1 %>% ungroup()%>% mutate(ID=as.factor(ID),
year=as.factor(year),
`mutation type`=as.factor(`mutation type`)) -> df1
#Plot
ggplot(df1,aes(x=ID,y=med,fill=`mutation type`,color=year,group=year))+
geom_line()
And for models:
#Models
fits <- df1 %>%group_by(ID) %>%
do(fitmodel = lm(med ~ year, data = .))
#Coefs
dfCoef = tidy(fits, fitmodel)
# A tibble: 10 x 6
# Groups: ID [5]
ID term estimate std.error statistic p.value
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1234 (Intercept) 6329. 1546. 4.09 0.0264
2 1234 year -3.13 0.769 -4.07 0.0268
3 3789 (Intercept) 14318. 4746. 3.02 0.204
4 3789 year -7.08 2.36 -3.00 0.205
5 7457 (Intercept) 2409. 403. 5.98 0.0269
6 7457 year -1.16 0.201 -5.78 0.0287
7 8769 (Intercept) 9268. 4803. 1.93 0.304
8 8769 year -4.60 2.39 -1.92 0.306
9 55667 (Intercept) 3294. 759. 4.34 0.0492
10 55667 year -1.62 0.378 -4.29 0.0503
Code for required plot:
#Plot 2
#Data modifications
df1 %>% mutate(year2=as.numeric(year)-1) -> df2
df2 %>% mutate(year2=factor(year2,levels = sort(unique(year2)))) -> df2
#Plot 2
ggplot(df2,aes(x=year2,y=med,color=ID,group=ID))+
facet_wrap(.~`mutation type`)+
geom_line()
Your naming structure is unclear, if the data you provided is called df then you can do:
df$year <-format(df$date_of_bloods,format="%Y")
aggregate(result ~ year + ID, data = df, median)
year ID result
1 2001 1234 85.0
2 2007 1234 50.0
3 2013 1234 25.0
4 2016 1234 22.0
5 2022 1234 21.0
6 2010 3789 76.0
7 2013 3789 67.0
8 2016 3789 33.5
9 2001 7457 90.0
10 2007 7457 87.0
11 2013 7457 76.0
12 2016 7457 74.0
13 2004 8769 67.5
14 2007 8769 33.0
15 2013 8769 22.0
16 2001 55667 52.0
17 2007 55667 35.0
18 2010 55667 33.5
19 2016 55667 27.0

Resources