I'm challenged with this problem. I have these types of data:
df <- data.frame(
ID = c(1,1,1,1,1,1,2,2,2,2,2,3,3,3,3),
Pr = c(0, 1, 0, 999, -1, 1, 999, 1, 0, 0, 1, 0, 1, 0, 0),
Yrs = c(2010,2011,2012,2013,2014,2015, 2010, 2011, 2012, 2013, 2014, 2012, 2013, 2014, 2015)
)
ID Pr Yrs
1 0 2010
1 1 2011
1 0 2012
1 999 2013
1 -1 2014
1 1 2015
2 999 2010
2 1 2011
2 0 2012
2 0 2013
2 1 2014
3 0 2012
3 1 2013
3 0 2014
3 0 2015
I would like to get:
a)the number of (unique)IDs having "1" just once;
b)The distance (years) between the first occurrence of "1" and the following occurrence of "1", per group(ID).
Thank you for your help.
Here's one way to get at the problem:
library(tidyverse)
df %>% group_by(ID) %>% filter(sum(Pr==1)==1)
# A tibble: 4 x 3
# Groups: ID [1]
# ID Pr Yrs
# <dbl> <dbl> <dbl>
#1 3 0 2012
#2 3 1 2013
#3 3 0 2014
#4 3 0 2015
df %>%
group_by(ID) %>%
filter(Pr==1) %>%
filter(n()>1) %>%
summarise(dist=diff(Yrs))
# A tibble: 2 x 2
# ID dist
# <dbl> <dbl>
#1 1 4
#2 2 3
With a summary data frame as
library(data.table)
setDT(df)
df_summ <-
df[, {one <- which(Pr == 1);
.(num_ones = length(one), gap = diff(Yrs[one[1:2]]))}
, by = ID]
We can see
a)the number of (unique)IDs having "1" just once;
df_summ[, sum(num_ones == 1)]
# [1] 1
b)The distance (years) between the first occurrence of "1" and the
following occurrence of "1", per group(ID)
See gap column
df_summ
# ID num_ones gap
# 1: 1 2 4
# 2: 2 2 3
# 3: 3 1 NA
Related
Let me illustrate my question with an example:
Sample data:
df<-data.frame(BirthYear = c(1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005), Number= c(1,1,1,1,1,1,1,1,1,1,1), Group = c("g", "g", "g", "g", "g", "g","t","t","t","t","t"))
df
BirthYear Number Group
1 1995 1 g
2 1996 1 g
3 1997 1 g
4 1998 1 g
5 1999 1 g
6 2000 1 g
7 2001 1 t
8 2002 1 t
9 2003 1 t
10 2004 1 t
11 2005 1 t
and
df1<- structure(list(Year = c(2015, 2016, 2017, 2018, 2019, 2020)), class = "data.frame", row.names = c(NA,
-6L))
df1
Year
1 2015
2 2016
3 2017
4 2018
5 2019
6 2020
Now I want to add new columns to df1: g1, g2, t1 and t2.
g1 and t1 respectively represent the sum of df$Number for all instances of a group (g or t in df) where df1$Year - df$BirthYear is greater than 18 and lower than 21, so basically if someone is in the age between 19 & 20.
g2 and t2 represent the sum of df$Number for all instances of a group where the difference in years is lower than 19.
I want to end up with the following:
df1
Year g1 g2 t1 t2
1 2015 2 4 0 5
2 2016 2 3 0 5
3 2017 2 2 0 5
4 2018 2 1 0 5
5 2019 2 0 0 5
6 2020 1 0 1 4
I know I could make a for-loop over df1 to create the new columns but I don't know how to specify the condition to get the correct group sums for each year.
I hope this example makes clear what I'm trying to achieve.
I'd be very grateful for any help cause I'm really stuck at this point.
If what you want to do is just to calculate year differences across 2015:2020 and BirthYear, then you don't have to create a separate dataframe. Perhaps just
library(tidyr)
library(dplyr)
df %>%
expand(Year = 2015:2020, nesting(BirthYear, Number, Group)) %>%
group_by(Year, Group) %>%
summarise(
`1` = sum(between(Year - BirthYear, 19, 20) * Number),
`2` = sum((Year - BirthYear < 19) * Number)
) %>%
pivot_wider(names_from = "Group", values_from = c("1", "2"), names_glue = "{Group}{.value}")
Output
`summarise()` regrouping output by 'Year' (override with `.groups` argument)
# A tibble: 6 x 5
# Groups: Year [6]
Year g1 t1 g2 t2
<int> <dbl> <dbl> <dbl> <dbl>
1 2015 2 0 4 5
2 2016 2 0 3 5
3 2017 2 0 2 5
4 2018 2 0 1 5
5 2019 2 0 0 5
6 2020 1 1 0 4
I am looking to get the standard deviation grouped by year. All the examples I have seen does not involve an aggregated count column.
I want to use the sum of the count column as part of the standard deviation calculation.
year count age
2018 2 0
2018 3 1
2018 4 2
2017 1 0
2017 4 1
2017 2 2
The expected answer for the above would be:-
Year 2018 = 0.78567420131839
Year 2017 = 0.63887656499994
The following should do the trick.
library(dplyr)
library(purrr)
data <- tibble(year = c(2018, 2018, 2018, 2017, 2017, 2017),
count = c(2, 3, 4, 1, 4, 2),
age = c(0, 1, 2, 0, 1, 2))
data %>%
mutate(vec = map2(age, count, ~ rep(.x, .y))) %>%
group_by(year) %>%
mutate(concs = list(unlist(vec))) %>%
ungroup() %>%
mutate(age_sd = map_dbl(concs, sd)) %>%
select(-vec, -concs)
# year count age age_sd
# <dbl> <dbl> <dbl> <dbl>
# 1 2018 2 0 0.833
# 2 2018 3 1 0.833
# 3 2018 4 2 0.833
# 4 2017 1 0 0.690
# 5 2017 4 1 0.690
# 6 2017 2 2 0.690
For my dataset I want a row for each year for each ID and I want to determine if they lived in an urban area or not (0/1). Because some ID’s moved within a year and therefore have two rows for that year, I want to identify if they have two rows for that specific year, which mean they lived in an urban and non-urban area in that year (so I can manually determine in Excel at where they belong).
I’ve already excluded the exact double rows (so they moved in a certain year, but the urbanisation didn’t change).
df <- df %>% distinct(ID, YEAR, URBAN, .keep_all = TRUE)
structure(t2A)
# A tibble: 3,177,783 x 4
ID ZIPCODE YEAR URBAN
<dbl> <chr> <chr> <dbl>
1 1 1234AB 2013 0
2 1 1234AB 2014 0
3 1 1234AB 2015 0
4 1 1234AB 2016 0
5 1 1234AB 2017 0
6 1 1234AB 2018 0
7 2 5678CD 2013 0
8 2 5678CD 2014 0
9 2 5678CD 2015 0
10 2 5678CD 2016 0
# ... with 3,177,773 more rows
structure(list(ID= c(1, 1, 1, 1
), YEAR = c("2013", "2014", "2015", "2016"), URBAN = c(0,
0, 0, 0)), row.names = c(NA, -4L), class = c("tbl_df", "tbl",
"data.frame"))
Can you guys help me with identifying ID’s that have two rows for a specific year/have a 0 and 1 in the same year?
Edit: the example doesn't show any ID's with urbanisation 1, but there are and not all ID's are included all years :)
Below might be useful:
df <- df %>%
dplyr::group_by(ID, YEAR) %>%
dplyr::mutate(nIds=dplyr::n(),#count the occurance at unique ID and year combination
URBAN_Flag=sum(URBAN), ##Urban flag for those who are from urban
moved=dplyr::if_else(nIds>1,1,0)) %>%
dplyr::select(-c(nIds))
You can deselect the columns if not needed
First, we create some dummy data
library(tidyverse)
db <- tibble(
id = c(1, 1, 1, 2, 2, 2),
year = c(2000, 2000, 2001, 2001, 2002, 2003),
urban = c(0, 1, 0, 0, 0, 0)
)
We see that person one moved in 2000.
id year urban
<dbl> <dbl> <dbl>
1 1 2000 0
2 1 2000 1
3 1 2001 0
4 2 2001 0
5 2 2002 0
6 2 2003 0
Now, we can group by id and year and count the number of rows. We can use the count value to create a dummy whether or not they moved in a given year.
db %>%
group_by(id, year) %>%
summarize(rows = n()) %>%
mutate(
moved = ifelse(rows == 2, 1, 0)
)
Which gives the result:
id year rows moved
<dbl> <dbl> <int> <dbl>
1 1 2000 2 1
2 1 2001 1 0
3 2 2001 1 0
4 2 2002 1 0
5 2 2003 1 0
Sorry if this post is not well organized, first time stack overflower...
I am trying to create a column to create a order within each IDs, but the twist is that if there is a gap year, order needs to start from the beginning.
Please check example and expected result below.
I wasn't able to find appropriate code for it.. I cannot think of anything :( Please help me! I appreciate alot!
One option is to create a new group variable when difference between the year is greater than 1 and create a sequence in each group using row_number().
library(dplyr)
df %>%
group_by(ID, group = cumsum(c(1, diff(Year) > 1))) %>%
mutate(order = row_number()) %>%
ungroup() %>%
select(-group)
# ID Year order
# <fct> <int> <int>
# 1 A 2007 1
# 2 A 2008 2
# 3 A 2009 3
# 4 A 2013 1
# 5 A 2014 2
# 6 A 2015 3
# 7 A 2016 4
# 8 B 2010 1
# 9 B 2012 1
#10 B 2013 2
Using base R ave that would be
as.integer(with(df, ave(ID, ID, cumsum(c(1, diff(Year) > 1)), FUN = seq_along)))
#[1] 1 2 3 1 2 3 4 1 1 2
data
df <- data.frame(ID = c(rep("A", 7), rep("B", 3)),
Year = c(2007:2009, 2013:2016, 2010, 2012, 2013), stringsAsFactors = FALSE)
A data.table option:
library(data.table)
setDT(df)
df[, jump := Year - shift(Year) - 1, by = ID
][is.na(jump), jump := 0
][, order := seq_len(.N), by = .(ID, cumsum(jump))]
# ID Year jump order
# 1: A 2007 0 1
# 2: A 2008 0 2
# 3: A 2009 0 3
# 4: A 2013 3 1
# 5: A 2014 0 2
# 6: A 2015 0 3
# 7: A 2016 0 4
# 8: B 2010 0 1
# 9: B 2012 1 1
# 10: B 2013 0 2
Or using data.table::nafill() available in data.table v1.12.3 (still in development):
df[, jump := nafill(Year - shift(Year) - 1, fill = 0), by = ID
][, order := seq_len(.N), by = .(ID, cumsum(jump))]
We can take the difference of 'Year' and the lag of 'Year', get the cumulative sum, use that in the group_by along with 'ID' and create the order as row_number()
library(dplyr)
df %>%
group_by(ID, grp = cumsum(Year - lag(Year, default = Year[1]) > 1)) %>%
mutate(order = row_number()) %>%
ungroup %>%
select(-grp)
# A tibble: 10 x 3
# ID Year order
# <chr> <dbl> <int>
# 1 A 2007 1
# 2 A 2008 2
# 3 A 2009 3
# 4 A 2013 1
# 5 A 2014 2
# 6 A 2015 3
# 7 A 2016 4
# 8 B 2010 1
# 9 B 2012 1
#10 B 2013 2
data
df <- structure(list(ID = c("A", "A", "A", "A", "A", "A", "A", "B",
"B", "B"), Year = c(2007, 2008, 2009, 2013, 2014, 2015, 2016,
2010, 2012, 2013)), class = "data.frame", row.names = c(NA, -10L
))
I'm having some trouble with advanced operations in dplyr with grouped data. I'm not sure how to specify if I want to refer to an observation-level value, and when I can specifically refer to the entire vector.
Sample data frame:
df <- as.data.frame(
rbind(
c(11990, 2011, 1, 1, 2010),
c(11990, 2015, 1, 0, NA),
c(11990, 2017, 2, 1, NA),
c(11990, 2018, 2, 1, 2016),
c(11990, 2019, 2, 1, 2019),
c(11990, 2020, 1, 0, NA),
c(22880, 2013, 1, 1, NA),
c(22880, 2014, 1, 0, 2011),
c(22880, 2015, 1, 1, NA),
c(22880, 2018, 2, 0, 2014),
c(22880, 2020, 2, 0, 1979)))
names(df) <- c("id", "year", "house_apt", "moved", "year_moved")
# > df
# id year house_apt moved year_moved
# 1 11990 2011 1 1 2010
# 2 11990 2015 1 0 NA
# 3 11990 2017 2 1 NA
# 4 11990 2018 2 1 2016
# 5 11990 2019 2 1 2019
# 6 11990 2020 1 0 NA
# 7 22880 2013 1 1 NA
# 8 22880 2014 1 0 2011
# 9 22880 2015 1 1 NA
# 10 22880 2018 2 0 2014
# 11 22880 2020 2 0 1979
If I do simple mutate operations:
library(dplyr)
df %>% mutate(year+2)
df %>% group_by(id) %>% mutate(year+2)
It's pretty obvious that "year" here refers to each individual row value. This is the case even if I were to (for some reason) do it with a grouping. However, if I were to do the following two operations which involve a vector operation:
df %>% mutate(sum(year))
df %>% group_by(id) %>% mutate(sum(year))
dplyr understands "year" as the entire vector of year values for that whole group.
However, now I am having a lot of trouble with an operation where it is ambiguous whether I want mutate to use the row-value or the entire vector. With my data frame, I want to create a variable a guessed moving year for individuals who moved but didn't record the moving date until a later survey instance. Note the data is extremely messy, with some nonsensical moving dates that we want to ignore.
Therefore, I want to create a "guess" value for each row where a person moved but no move_year is recorded. I want the operation to look through the entire vector of moving dates for each individual, subset to include only the ones earlier than the current year, and pick out the one that is the closest to the year for the current row. Granular example: If we look at row #3, the individual moved in that year, but there is no move date. Therefore we want to look at the entire year_moved vector for that person (2010, NA, NA, 2016, 2019, NA) and choose the one that is the closest to and preferably earlier than the row #3 value of year (2017). The guess value, therefore, would be 2016.
Getting the value we want with a given year and vector of values is simple:
year <- 2017
year_moved <- c(2010, 2016, 2017)
year_moved[which.min(year-(year_moved[year_moved<year & !is.na(year_moved)]))]
# [1] 2016
rm(year, year_moved)
However, when I try this within a mutate function, it doesn't give me the same result.
df %>%
group_by(id) %>%
mutate(
year_guess = ifelse(moved==1 & is.na(year_moved),
year_moved[which.min(year-(year_moved[year_moved<year]))],
NA))
# # A tibble: 11 x 6
# # Groups: id [2]
# id year house_apt moved year_moved guess
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 11990 2011 1 1 2010 NA
# 2 11990 2015 1 0 NA NA
# 3 11990 2017 2 1 NA NA
# 4 11990 2018 2 1 2016 NA
# 5 11990 2019 2 1 2019 NA
# 6 11990 2020 1 0 NA NA
# 7 22880 2013 1 1 NA 2011
# 8 22880 2014 1 0 2011 NA
# 9 22880 2015 1 1 NA 2011
# 10 22880 2018 2 0 2014 NA
# 11 22880 2020 2 0 1979 NA
# Warning message:
# In year - (year_moved[year_moved < year & !is.na(year_moved)]) :
# longer object length is not a multiple of shorter object length
(Row 3 should be 2016 and Row 9 should be 2014.) I think part of it is my inability to specify whether I am interested in a row-value or a vector. Note that the first time I refer to "year_moved" (is.na(year_moved)), I am referring to the value in that row. When I refer to it within the which.min, I am trying to refer to the groupwise vector. When I refer to "year", I'm trying to refer to the value of the individual row I'm working in. Clearly things are a little muddled, and it's a broader problem I've been running into with many different applications. Can anyone provide guidance?
I've been writing my whole project using tidyverse so would like to continue if possible.
I think the most straightforward way to modify your current attempt to get the right results is to wrap the guessing operation in sapply so that a guess is separately calculated for each year:
df %>%
group_by(id) %>%
mutate(
year_guess = ifelse(
moved==1 & is.na(year_moved),
sapply(year, function(x) year_moved[which.min(x-(year_moved[year_moved < x]))]),
NA)
)
I haven't been able to fully unpack the logic of how this works but I think as written your guessing procedure is a little bit complex to be easily vectorized (although it probably can be if you approach it in a slightly different way).
Output:
# A tibble: 11 x 6
# Groups: id [2]
id year house_apt moved year_moved year_guess
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 11990 2011 1 1 2010 NA
2 11990 2015 1 0 NA NA
3 11990 2017 2 1 NA 2016
4 11990 2018 2 1 2016 NA
5 11990 2019 2 1 2019 NA
6 11990 2020 1 0 NA NA
7 22880 2013 1 1 NA 2011
8 22880 2014 1 0 2011 NA
9 22880 2015 1 1 NA 2014
10 22880 2018 2 0 2014 NA
11 22880 2020 2 0 1979 NA