Let me illustrate my question with an example:
Sample data:
df<-data.frame(BirthYear = c(1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005), Number= c(1,1,1,1,1,1,1,1,1,1,1), Group = c("g", "g", "g", "g", "g", "g","t","t","t","t","t"))
df
BirthYear Number Group
1 1995 1 g
2 1996 1 g
3 1997 1 g
4 1998 1 g
5 1999 1 g
6 2000 1 g
7 2001 1 t
8 2002 1 t
9 2003 1 t
10 2004 1 t
11 2005 1 t
and
df1<- structure(list(Year = c(2015, 2016, 2017, 2018, 2019, 2020)), class = "data.frame", row.names = c(NA,
-6L))
df1
Year
1 2015
2 2016
3 2017
4 2018
5 2019
6 2020
Now I want to add new columns to df1: g1, g2, t1 and t2.
g1 and t1 respectively represent the sum of df$Number for all instances of a group (g or t in df) where df1$Year - df$BirthYear is greater than 18 and lower than 21, so basically if someone is in the age between 19 & 20.
g2 and t2 represent the sum of df$Number for all instances of a group where the difference in years is lower than 19.
I want to end up with the following:
df1
Year g1 g2 t1 t2
1 2015 2 4 0 5
2 2016 2 3 0 5
3 2017 2 2 0 5
4 2018 2 1 0 5
5 2019 2 0 0 5
6 2020 1 0 1 4
I know I could make a for-loop over df1 to create the new columns but I don't know how to specify the condition to get the correct group sums for each year.
I hope this example makes clear what I'm trying to achieve.
I'd be very grateful for any help cause I'm really stuck at this point.
If what you want to do is just to calculate year differences across 2015:2020 and BirthYear, then you don't have to create a separate dataframe. Perhaps just
library(tidyr)
library(dplyr)
df %>%
expand(Year = 2015:2020, nesting(BirthYear, Number, Group)) %>%
group_by(Year, Group) %>%
summarise(
`1` = sum(between(Year - BirthYear, 19, 20) * Number),
`2` = sum((Year - BirthYear < 19) * Number)
) %>%
pivot_wider(names_from = "Group", values_from = c("1", "2"), names_glue = "{Group}{.value}")
Output
`summarise()` regrouping output by 'Year' (override with `.groups` argument)
# A tibble: 6 x 5
# Groups: Year [6]
Year g1 t1 g2 t2
<int> <dbl> <dbl> <dbl> <dbl>
1 2015 2 0 4 5
2 2016 2 0 3 5
3 2017 2 0 2 5
4 2018 2 0 1 5
5 2019 2 0 0 5
6 2020 1 1 0 4
Related
Fairly new to R, ended up in the following situation: I want to create a summary row for each group in the dataframe based on Year and Model, where a value of each row would be based on the subtraction of value of one Variable from others in the group.
df <- data.frame(Model = c(1,1,1,2,2,2,2,2,2,2,2,2,2),
Year = c(2020, 2020, 2020, 2020, 2020, 2020, 2020, 2030, 2030, 2030, 2040, 2040, 2040),
Variable = c("A", "B", "C", "A", "B", "C", "D", "A", "C", "E", "A", "C", "D"),
value = c(15, 2, 5, 25, 6, 4, 4, 41, 24,1, 15, 3, 2))
I have managed to create a new row for each group, so it already has a Year and a Variable name that I manually specified using:
df <- df %>% group_by(Model, Year) %>% group_modify(~ add_row(., Variable = "New", .before=0))
However, I am struggling to create an equation from which I want to calculate the value.
What I want to have instead of NAs: value of A-B-D in each group
Would appreciate any help. My first thread here, pardon for any inconvenience.
You could pivot wide and then back; this would add rows with zeros where missing:
library(dplyr); library(tidyr)
df %>%
pivot_wider(names_from = Variable, values_from = value, values_fill = 0) %>%
mutate(new = A - B - D) %>%
pivot_longer(-c(Model, Year), names_to = "Variable")
# A tibble: 24 × 4
Model Year Variable value
<dbl> <dbl> <chr> <dbl>
1 1 2020 A 15
2 1 2020 B 2
3 1 2020 C 5
4 1 2020 D 0
5 1 2020 E 0
6 1 2020 new 13 # 15 - 2 - 0 = 13
7 2 2020 A 25
8 2 2020 B 6
9 2 2020 C 4
10 2 2020 D 4
# … with 14 more rows
EDIT - variation where we leave the missing values and use coalesce(x, 0) to allow subtraction to treat NA's as zeroes. The pivot_wider creates NA's in the missing spots, but we can exclude these in the pivot_longer using values_drop_na = TRUE.
df %>%
pivot_wider(names_from = Variable, values_from = value) %>%
mutate(new = A - coalesce(B,0) - coalesce(D,0)) %>%
pivot_longer(-c(Model, Year), names_to = "Variable", values_drop_na = TRUE)
# A tibble: 17 × 4
Model Year Variable value
<dbl> <dbl> <chr> <dbl>
1 1 2020 A 15
2 1 2020 B 2
3 1 2020 C 5
4 1 2020 new 13
5 2 2020 A 25
6 2 2020 B 6
7 2 2020 C 4
8 2 2020 D 4
9 2 2020 new 15
10 2 2030 A 41
11 2 2030 C 24
12 2 2030 E 1
13 2 2030 new 41
14 2 2040 A 15
15 2 2040 C 3
16 2 2040 D 2
17 2 2040 new 13
I am looking to get the standard deviation grouped by year. All the examples I have seen does not involve an aggregated count column.
I want to use the sum of the count column as part of the standard deviation calculation.
year count age
2018 2 0
2018 3 1
2018 4 2
2017 1 0
2017 4 1
2017 2 2
The expected answer for the above would be:-
Year 2018 = 0.78567420131839
Year 2017 = 0.63887656499994
The following should do the trick.
library(dplyr)
library(purrr)
data <- tibble(year = c(2018, 2018, 2018, 2017, 2017, 2017),
count = c(2, 3, 4, 1, 4, 2),
age = c(0, 1, 2, 0, 1, 2))
data %>%
mutate(vec = map2(age, count, ~ rep(.x, .y))) %>%
group_by(year) %>%
mutate(concs = list(unlist(vec))) %>%
ungroup() %>%
mutate(age_sd = map_dbl(concs, sd)) %>%
select(-vec, -concs)
# year count age age_sd
# <dbl> <dbl> <dbl> <dbl>
# 1 2018 2 0 0.833
# 2 2018 3 1 0.833
# 3 2018 4 2 0.833
# 4 2017 1 0 0.690
# 5 2017 4 1 0.690
# 6 2017 2 2 0.690
For my dataset I want a row for each year for each ID and I want to determine if they lived in an urban area or not (0/1). Because some ID’s moved within a year and therefore have two rows for that year, I want to identify if they have two rows for that specific year, which mean they lived in an urban and non-urban area in that year (so I can manually determine in Excel at where they belong).
I’ve already excluded the exact double rows (so they moved in a certain year, but the urbanisation didn’t change).
df <- df %>% distinct(ID, YEAR, URBAN, .keep_all = TRUE)
structure(t2A)
# A tibble: 3,177,783 x 4
ID ZIPCODE YEAR URBAN
<dbl> <chr> <chr> <dbl>
1 1 1234AB 2013 0
2 1 1234AB 2014 0
3 1 1234AB 2015 0
4 1 1234AB 2016 0
5 1 1234AB 2017 0
6 1 1234AB 2018 0
7 2 5678CD 2013 0
8 2 5678CD 2014 0
9 2 5678CD 2015 0
10 2 5678CD 2016 0
# ... with 3,177,773 more rows
structure(list(ID= c(1, 1, 1, 1
), YEAR = c("2013", "2014", "2015", "2016"), URBAN = c(0,
0, 0, 0)), row.names = c(NA, -4L), class = c("tbl_df", "tbl",
"data.frame"))
Can you guys help me with identifying ID’s that have two rows for a specific year/have a 0 and 1 in the same year?
Edit: the example doesn't show any ID's with urbanisation 1, but there are and not all ID's are included all years :)
Below might be useful:
df <- df %>%
dplyr::group_by(ID, YEAR) %>%
dplyr::mutate(nIds=dplyr::n(),#count the occurance at unique ID and year combination
URBAN_Flag=sum(URBAN), ##Urban flag for those who are from urban
moved=dplyr::if_else(nIds>1,1,0)) %>%
dplyr::select(-c(nIds))
You can deselect the columns if not needed
First, we create some dummy data
library(tidyverse)
db <- tibble(
id = c(1, 1, 1, 2, 2, 2),
year = c(2000, 2000, 2001, 2001, 2002, 2003),
urban = c(0, 1, 0, 0, 0, 0)
)
We see that person one moved in 2000.
id year urban
<dbl> <dbl> <dbl>
1 1 2000 0
2 1 2000 1
3 1 2001 0
4 2 2001 0
5 2 2002 0
6 2 2003 0
Now, we can group by id and year and count the number of rows. We can use the count value to create a dummy whether or not they moved in a given year.
db %>%
group_by(id, year) %>%
summarize(rows = n()) %>%
mutate(
moved = ifelse(rows == 2, 1, 0)
)
Which gives the result:
id year rows moved
<dbl> <dbl> <int> <dbl>
1 1 2000 2 1
2 1 2001 1 0
3 2 2001 1 0
4 2 2002 1 0
5 2 2003 1 0
I'm challenged with this problem. I have these types of data:
df <- data.frame(
ID = c(1,1,1,1,1,1,2,2,2,2,2,3,3,3,3),
Pr = c(0, 1, 0, 999, -1, 1, 999, 1, 0, 0, 1, 0, 1, 0, 0),
Yrs = c(2010,2011,2012,2013,2014,2015, 2010, 2011, 2012, 2013, 2014, 2012, 2013, 2014, 2015)
)
ID Pr Yrs
1 0 2010
1 1 2011
1 0 2012
1 999 2013
1 -1 2014
1 1 2015
2 999 2010
2 1 2011
2 0 2012
2 0 2013
2 1 2014
3 0 2012
3 1 2013
3 0 2014
3 0 2015
I would like to get:
a)the number of (unique)IDs having "1" just once;
b)The distance (years) between the first occurrence of "1" and the following occurrence of "1", per group(ID).
Thank you for your help.
Here's one way to get at the problem:
library(tidyverse)
df %>% group_by(ID) %>% filter(sum(Pr==1)==1)
# A tibble: 4 x 3
# Groups: ID [1]
# ID Pr Yrs
# <dbl> <dbl> <dbl>
#1 3 0 2012
#2 3 1 2013
#3 3 0 2014
#4 3 0 2015
df %>%
group_by(ID) %>%
filter(Pr==1) %>%
filter(n()>1) %>%
summarise(dist=diff(Yrs))
# A tibble: 2 x 2
# ID dist
# <dbl> <dbl>
#1 1 4
#2 2 3
With a summary data frame as
library(data.table)
setDT(df)
df_summ <-
df[, {one <- which(Pr == 1);
.(num_ones = length(one), gap = diff(Yrs[one[1:2]]))}
, by = ID]
We can see
a)the number of (unique)IDs having "1" just once;
df_summ[, sum(num_ones == 1)]
# [1] 1
b)The distance (years) between the first occurrence of "1" and the
following occurrence of "1", per group(ID)
See gap column
df_summ
# ID num_ones gap
# 1: 1 2 4
# 2: 2 2 3
# 3: 3 1 NA
Sorry if this post is not well organized, first time stack overflower...
I am trying to create a column to create a order within each IDs, but the twist is that if there is a gap year, order needs to start from the beginning.
Please check example and expected result below.
I wasn't able to find appropriate code for it.. I cannot think of anything :( Please help me! I appreciate alot!
One option is to create a new group variable when difference between the year is greater than 1 and create a sequence in each group using row_number().
library(dplyr)
df %>%
group_by(ID, group = cumsum(c(1, diff(Year) > 1))) %>%
mutate(order = row_number()) %>%
ungroup() %>%
select(-group)
# ID Year order
# <fct> <int> <int>
# 1 A 2007 1
# 2 A 2008 2
# 3 A 2009 3
# 4 A 2013 1
# 5 A 2014 2
# 6 A 2015 3
# 7 A 2016 4
# 8 B 2010 1
# 9 B 2012 1
#10 B 2013 2
Using base R ave that would be
as.integer(with(df, ave(ID, ID, cumsum(c(1, diff(Year) > 1)), FUN = seq_along)))
#[1] 1 2 3 1 2 3 4 1 1 2
data
df <- data.frame(ID = c(rep("A", 7), rep("B", 3)),
Year = c(2007:2009, 2013:2016, 2010, 2012, 2013), stringsAsFactors = FALSE)
A data.table option:
library(data.table)
setDT(df)
df[, jump := Year - shift(Year) - 1, by = ID
][is.na(jump), jump := 0
][, order := seq_len(.N), by = .(ID, cumsum(jump))]
# ID Year jump order
# 1: A 2007 0 1
# 2: A 2008 0 2
# 3: A 2009 0 3
# 4: A 2013 3 1
# 5: A 2014 0 2
# 6: A 2015 0 3
# 7: A 2016 0 4
# 8: B 2010 0 1
# 9: B 2012 1 1
# 10: B 2013 0 2
Or using data.table::nafill() available in data.table v1.12.3 (still in development):
df[, jump := nafill(Year - shift(Year) - 1, fill = 0), by = ID
][, order := seq_len(.N), by = .(ID, cumsum(jump))]
We can take the difference of 'Year' and the lag of 'Year', get the cumulative sum, use that in the group_by along with 'ID' and create the order as row_number()
library(dplyr)
df %>%
group_by(ID, grp = cumsum(Year - lag(Year, default = Year[1]) > 1)) %>%
mutate(order = row_number()) %>%
ungroup %>%
select(-grp)
# A tibble: 10 x 3
# ID Year order
# <chr> <dbl> <int>
# 1 A 2007 1
# 2 A 2008 2
# 3 A 2009 3
# 4 A 2013 1
# 5 A 2014 2
# 6 A 2015 3
# 7 A 2016 4
# 8 B 2010 1
# 9 B 2012 1
#10 B 2013 2
data
df <- structure(list(ID = c("A", "A", "A", "A", "A", "A", "A", "B",
"B", "B"), Year = c(2007, 2008, 2009, 2013, 2014, 2015, 2016,
2010, 2012, 2013)), class = "data.frame", row.names = c(NA, -10L
))