Using groupby and n_distinct to count NEW unique ids - r

The following code counts the number of unique IDs per year. My question is: how to count the number of new unique IDs, i.e., IDs that did not appear in previous years?
df %>%
group_by(year) %>%
summarize(count=n_distinct(ID))
For example, I need to create the variable wanted_count below
Year
ID
count
wanted_count
2000
1
3
3
2000
2
3
3
2000
3
3
3
2001
2
2
0
2001
3
2
0
2002
3
2
1
2002
4
2
1
2003
4
2
1
2003
7
2
1
2003
4
2
1
See data below:
df <- structure(list(Year = c(2000L, 2000L, 2000L, 2001L, 2001L, 2002L,
2002L, 2003L, 2003L, 2003L), ID = c(1L, 2L, 3L, 2L, 3L, 3L, 4L,
4L, 7L, 4L), count = c(3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L
), wanted_count = c(3L, 3L, 3L, 0L, 0L, 1L, 1L, 1L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-10L))

library(dplyr)
df %>%
mutate(cum_new = cumsum(!duplicated(ID))) %>%
group_by(Year) %>%
summarize(total = max(cum_new), .groups = "drop") %>%
mutate(
result = c(first(total), diff(total)),
total = NULL
) %>%
left_join(df, by = "Year")
# # A tibble: 10 × 5
# Year result ID count wanted_count
# <int> <int> <int> <int> <int>
# 1 2000 3 1 3 3
# 2 2000 3 2 3 3
# 3 2000 3 3 3 3
# 4 2001 0 2 2 0
# 5 2001 0 3 2 0
# 6 2002 1 3 2 1
# 7 2002 1 4 2 1
# 8 2003 1 4 2 1
# 9 2003 1 7 2 1
# 10 2003 1 4 2 1
Using this data:
df = read.table(text = 'Year ID count wanted_count
2000 1 3 3
2000 2 3 3
2000 3 3 3
2001 2 2 0
2001 3 2 0
2002 3 2 1
2002 4 2 1
2003 4 2 1
2003 7 2 1
2003 4 2 1', header = T)

You could create a logical column indicating occurrences of new ID, and sum up them in each Year group.
library(dplyr)
df %>%
mutate(new = !duplicated(ID)) %>%
add_count(Year, wt = new) %>%
select(-new)
# Year ID n
# 1 2000 1 3
# 2 2000 2 3
# 3 2000 3 3
# 4 2001 2 0
# 5 2001 3 0
# 6 2002 3 1
# 7 2002 4 1
# 8 2003 4 1
# 9 2003 7 1
# 10 2003 4 1

Related

How to add new rows conditionally on R

I have a df with
v1 t1 c1 o1
1 1 9 1
1 1 12 2
1 2 2 1
1 2 7 2
2 1 3 1
2 1 6 2
2 2 3 1
2 2 12 2
And I would like to add 2 rows each time that v1 changes it's value, in order to get this:
v1 t1 c1 o1
1 1 1 1
1 1 1 2
1 2 9 1
1 2 12 2
1 3 2 1
1 3 7 2
2 1 1 1
2 1 1 2
1 2 3 1
1 2 6 2
2 3 3 1
2 3 12 2
So what I'm doing is that every time v1 changes its value I'm adding 2 rows of ones and adding a 1 to the values of t1. This is kind of tricky. I've been able to do it in Excel but I would like to scale to big files in R.
We may do the expansion in group_modify
library(dplyr)
df1 %>%
group_by(v1) %>%
group_modify(~ .x %>%
slice_head(n = 2) %>%
mutate(across(-o1, ~ 1)) %>%
bind_rows(.x) %>%
mutate(t1 = as.integer(gl(n(), 2, n())))) %>%
ungroup
-output
# A tibble: 12 × 4
v1 t1 c1 o1
<int> <int> <dbl> <int>
1 1 1 1 1
2 1 1 1 2
3 1 2 9 1
4 1 2 12 2
5 1 3 2 1
6 1 3 7 2
7 2 1 1 1
8 2 1 1 2
9 2 2 3 1
10 2 2 6 2
11 2 3 3 1
12 2 3 12 2
Or do a group by summarise
df1 %>%
group_by(v1) %>%
summarise(t1 = as.integer(gl(n() + 2, 2, n() + 2)),
c1 = c(1, 1, c1), o1 = rep(1:2, length.out = n() + 2),
.groups = 'drop')
-output
# A tibble: 12 × 4
v1 t1 c1 o1
<int> <int> <dbl> <int>
1 1 1 1 1
2 1 1 1 2
3 1 2 9 1
4 1 2 12 2
5 1 3 2 1
6 1 3 7 2
7 2 1 1 1
8 2 1 1 2
9 2 2 3 1
10 2 2 6 2
11 2 3 3 1
12 2 3 12 2
data
df1 <- structure(list(v1 = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), t1 = c(1L,
1L, 2L, 2L, 1L, 1L, 2L, 2L), c1 = c(9L, 12L, 2L, 7L, 3L, 6L,
3L, 12L), o1 = c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L)),
class = "data.frame", row.names = c(NA,
-8L))

In R: subset so that I only have the observations 3 years prior to and after an event

I found the following link of an answer that I should be able to apply, but it didn't seem to work:
https://stackoverflow.com/a/66485141/15388602
The following is a sample from my dataset:
companyID year status
1 2000 1
1 2001 1
1 2002 1
1 2003 1
1 2004 0
1 2005 2
1 2006 2
1 2007 2
2 2012 1
2 2013 0
2 2014 2
2 2015 2
2 2016 2
3 2008 1
3 2009 1
3 2010 1
3 2011 1
3 2012 1
3 2013 0
3 2014 2
3 2015 2
3 2016 2
3 2017 2
I would like to get the following observations so that I now only have the observations concerning 3 years before the event, the year of the event (where status is 0), and the 3 years after the event:
companyID year status
1 2001 1
1 2002 1
1 2003 1
1 2004 0
1 2005 2
1 2006 2
1 2007 2
3 2010 1
3 2011 1
3 2012 1
3 2013 0
3 2014 2
3 2015 2
3 2016 2
Would it be easier if I supplied the variable showing the event date? The variable would show a date in the same observation (year) that the status is 0.
Thank you in advance for any help!
This could be achieved with group_by arrange and filter
library(dplyr)
df %>% group_by(companyID) %>%
arrange(status, year, .by_group = TRUE) %>%
filter(year >= first(year)- 3 & year <= first(year)+ 3) %>%
filter(n() >=7) %>%
arrange(year)
Output:
companyID year status
<int> <int> <int>
1 1 2001 1
2 1 2002 1
3 1 2003 1
4 1 2004 0
5 1 2005 2
6 1 2006 2
7 1 2007 2
8 3 2010 1
9 3 2011 1
10 3 2012 1
11 3 2013 0
12 3 2014 2
13 3 2015 2
14 3 2016 2
Try this with dplyr and tidyr:
library(dplyr)
library(tidyr)
df %>%
group_by(companyID, year) %>%
mutate(ref_yr = case_when(status == 0 ~ year,
TRUE ~ NA_integer_)) %>%
ungroup() %>%
group_by(companyID) %>%
fill(ref_yr, .direction = "downup") %>%
mutate(yr_diff = abs(ref_yr - year))%>%
filter(yr_diff <= 3) %>%
select(-c(ref_yr, yr_diff))
#> # A tibble: 19 x 3
#> # Groups: companyID [3]
#> companyID year status
#> <int> <int> <int>
#> 1 1 2001 1
#> 2 1 2002 1
#> 3 1 2003 1
#> 4 1 2004 0
#> 5 1 2005 2
#> 6 1 2006 2
#> 7 1 2007 2
#> 8 2 2012 1
#> 9 2 2013 0
#> 10 2 2014 2
#> 11 2 2015 2
#> 12 2 2016 2
#> 13 3 2010 1
#> 14 3 2011 1
#> 15 3 2012 1
#> 16 3 2013 0
#> 17 3 2014 2
#> 18 3 2015 2
#> 19 3 2016 2
data
df <- structure(list(companyID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L),
year = c(2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L,
2007L, 2012L, 2013L, 2014L, 2015L, 2016L, 2008L, 2009L, 2010L,
2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 2017L), status = c(1L,
1L, 1L, 1L, 0L, 2L, 2L, 2L, 1L, 0L, 2L, 2L, 2L, 1L, 1L, 1L,
1L, 1L, 0L, 2L, 2L, 2L, 2L)), class = "data.frame", row.names = c(NA,
-23L))
Created on 2021-04-25 by the reprex package (v2.0.0)
Does this work:
library(dplyr)
df %>% group_by(companyID) %>%
mutate(flag1 = year[status == 0] - min(year), flag2 = max(year) - year[status == 0]) %>%
filter(flag1 > 2 & flag2 > 2 & between(year,year[status == 0] - 3, year[status == 0] + 3)) %>% select(-flag1, -flag2)
# A tibble: 14 x 3
# Groups: companyID [2]
companyID year status
<int> <int> <int>
1 1 2001 1
2 1 2002 1
3 1 2003 1
4 1 2004 0
5 1 2005 2
6 1 2006 2
7 1 2007 2
8 3 2010 1
9 3 2011 1
10 3 2012 1
11 3 2013 0
12 3 2014 2
13 3 2015 2
14 3 2016 2
If your dataframe is df:
zeros <- which(df$status == 0)
calcrows <- sapply(zeros, function(x) (x-3):(x+3))
df2 <- df[calcrows, ]

Ifelse with dplyr in R

I would like to use dplyr in replacing NA value in the DV column of each ID with DV value at a specific time point within that individual:
I want to replace NA (DV column) at the time 2 of each ID with DV value at time 4 of that specific ID.
I want to replace NA (DV column) at the time 4 of each ID with DV value at time 0 of that specific ID.
I can not figure out how to do it with dplyr.
Here is my dataset:
ID TIME DV
1 0 5
1 2 NA
1 4 4
2 0 3
2 2 3
2 4 NA
3 0 7
3 2 NA
3 4 9
Expected output:
ID TIME DV
1 0 5
1 2 4
1 4 4
2 0 3
2 2 3
2 4 3
3 0 7
3 2 9
3 4 9
Any suggestions are appreciated.
Best,
I agree with #akrun that perhaps fill is a good fit in general, but your rules suggest handling things a little differently (since "updown" does not follow your rules).
library(dplyr)
# library(tidyr)
dat %>%
tidyr::pivot_wider(id_cols = "ID", names_from = "TIME", values_from = "DV") %>%
mutate(
`2` = if_else(is.na(`2`), `4`, `2`),
`4` = if_else(is.na(`4`), `0`, `4`)
) %>%
tidyr::pivot_longer(-ID, names_to = "TIME", values_to = "DV")
# # A tibble: 9 x 3
# ID TIME DV
# <int> <chr> <int>
# 1 1 0 5
# 2 1 2 4
# 3 1 4 4
# 4 2 0 3
# 5 2 2 3
# 6 2 4 3
# 7 3 0 7
# 8 3 2 9
# 9 3 4 9
It might help to visualize what this is doing by looking mid-pipe:
dat %>%
tidyr::pivot_wider(id_cols = "ID", names_from = "TIME", values_from = "DV")
# # A tibble: 3 x 4
# ID `0` `2` `4`
# <int> <int> <int> <int>
# 1 1 5 NA 4
# 2 2 3 3 NA
# 3 3 7 NA 9
dat %>%
tidyr::pivot_wider(id_cols = "ID", names_from = "TIME", values_from = "DV") %>%
mutate(
`2` = if_else(is.na(`2`), `4`, `2`),
`4` = if_else(is.na(`4`), `0`, `4`)
)
# # A tibble: 3 x 4
# ID `0` `2` `4`
# <int> <int> <int> <int>
# 1 1 5 4 4
# 2 2 3 3 3
# 3 3 7 9 9
We could use fill after grouping by 'ID'
library(dplyr)
library(tidyr)
df1 %>%
arrange(ID, TIME) %>%
# or as #r2evans mentioned
#arrange(ID, factor(TIME, levels = c(0, 2, 4))) %>%
group_by(ID) %>%
fill(DV, .direction = 'downup')
# A tibble: 9 x 3
# Groups: ID [3]
# ID TIME DV
# <int> <int> <int>
#1 1 0 5
#2 1 2 4
#3 1 4 4
#4 2 0 3
#5 2 2 3
#6 2 4 3
#7 3 0 7
#8 3 2 9
#9 3 4 9
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), TIME = c(0L,
2L, 4L, 0L, 2L, 4L, 0L, 2L, 4L), DV = c(5L, NA, 4L, 3L, 3L, NA,
7L, NA, 9L)), class = "data.frame", row.names = c(NA, -9L))

Reshaping data.frame from wide to long creating more than one column from groups of variables

I have tried to adapt my knowledge about Reshape() to my necessities, but I cannot.
My data.frame has two sets of columns (a and b), which I want to reshape to the long format separatly.
It also has variables I want to keep unmodified. Like this:
id 2010a 2011a 2012a char 2010b 2011b 2012b
1 1 2 3 x 5 6 7
2 1 2 3 y 5 6 7
3 1 2 3 z 5 6 7
4 1 2 3 x 5 6 7
To this long format
id year a b char
1 2010 1 5 x
2 2010 1 5 y
3 2010 1 5 z
4 2010 1 5 x
1 2011 2 6 x
2 2011 2 6 y
3 2011 2 6 z
4 2011 2 6 x
1 2012 3 7 x
2 2012 3 7 y
3 2012 3 7 z
4 2012 3 7 x
Thank you!
A solution with tidyr:
library(tidyr)
library(dplyr)
dt_final <- gather(dt_initial, key = year, value = value, -id) %>%
separate(col=year, into=c("year", "name"), sep=-1) %>%
spread(key = name, value = value) %>%
arrange(id, year)
What about this?
library(data.table)
data2 <- melt(setDT(data), id.vars = "id", variable.name = "year")
data2[, l := substr(year, 6,6)][, year := gsub("[a-zA-Z]", "", year)]
dcast(data2, id + year ~ l, value.var = "value")[order(year, id)]
id year a b
1: 1 2010 1 5
2: 2 2010 1 5
3: 3 2010 1 5
4: 4 2010 1 5
5: 1 2011 2 6
6: 2 2011 2 6
7: 3 2011 2 6
8: 4 2011 2 6
9: 1 2012 3 7
10: 2 2012 3 7
11: 3 2012 3 7
12: 4 2012 3 7
Data:
data <- data.frame(
id = 1:4,
`2010a` = c(1L, 1L, 1L, 1L),
`2011a` = c(2L, 2L, 2L, 2L),
`2012a` = c(3L, 3L, 3L, 3L),
`2010b` = c(5L, 5L, 5L, 5L),
`2011b` = c(6L, 6L, 6L, 6L),
`2012b` = c(7L, 7L, 7L, 7L)
)

Create a conditional count variable in R

I want to create a count variable with the number of peoples with Z==0 in each of the given years. As Illustrated below:
PersonID Year Z Count*
1 1990 0 1
2 1990 1 1
3 1990 1 1
4 1990 2 1
5 1990 1 1
1 1991 1 3
2 1991 0 3
3 1991 1 3
4 1991 0 3
5 1991 0 3
1 1992 NA 1
2 1992 2 1
3 1992 2 1
4 1992 0 1
5 1993 1 0
1 1993 1 0
2 1993 2 0
3 1993 NA 0
4 1993 1 0
5 1994 0 5
1 1994 0 5
2 1994 0 5
3 1994 0 5
4 1994 0 5
I looked at my previous R-scripts and found this
library(dplyr)
sum_data <- data %>% group_by(PersonID) %>% summarise(Count = sum(Z, na.rm=T))
Can someone help me get this right? The count variable should basically count a total number of persons with Z==0, in the same format as I illustrated above. Thanks!!
dput(data)
structure(list(PersonID = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L,
5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L),
Year = c(1990L, 1990L, 1990L, 1990L, 1990L, 1991L, 1991L,
1991L, 1991L, 1991L, 1992L, 1992L, 1992L, 1992L, 1993L, 1993L,
1993L, 1993L, 1993L, 1994L, 1994L, 1994L, 1994L, 1994L),
Z = c(0L, 1L, 1L, 2L, 1L, 1L, 0L, 1L, 0L, 0L, NA, 2L, 2L,
0L, 1L, 1L, 2L, NA, 1L, 0L, 0L, 0L, 0L, 0L)), .Names = c("PersonID",
"Year", "Z"), class = "data.frame", row.names = c(NA, -24L))
Here's a simple solution :
library(dplyr)
sum_data <- df %>%
mutate(Z=replace(Z, is.na(Z), 1)) %>%
mutate(temp = ifelse(Z == 0, 1, 0)) %>%
group_by(Year) %>%
summarize(count = sum(temp))
basically this is what the code is doing :
mutate(Z=replace(Z, is.na(Z), 1)) replace the NA with 1 (optional)
mutate(temp = ifelse(Z == 0, 1, 0)) create a conditional temp
variable :
ifelse(Z == 0, 1, 0) say if Z == 0 then the value is 1
else 0
group_by(Year) pretty explicite :) it group the data frame by
Year
summarize(count = sum(temp)) create a count variable with the
sum of earlier generated temp
results :
Year count
<int> <int>
1 1990 5
2 1991 5
3 1992 4
4 1993 5
5 1994 5
and if you want to join this data to the original data frame just use join :
left_join(df, sum_data)
Joining, by = "Year"
PersonID Year Z count
1 1 1990 0 1
2 2 1990 1 1
3 3 1990 1 1
4 4 1990 2 1
5 5 1990 1 1
6 1 1991 1 3
7 2 1991 0 3
8 3 1991 1 3
9 4 1991 0 3
10 5 1991 0 3
11 1 1992 NA 1
12 2 1992 2 1
13 3 1992 2 1
14 4 1992 0 1
15 5 1993 1 0
16 1 1993 1 0
17 2 1993 2 0
18 3 1993 NA 0
19 4 1993 1 0
20 5 1994 0 5
21 1 1994 0 5
22 2 1994 0 5
23 3 1994 0 5
24 4 1994 0 5
Try this:
library(dplyr)
df <- left_join(data, data %>% filter(Z==0) %>% group_by(Year) %>% summarise(Count = n()))
df[is.na(df$Count),]$Count <- 0
PersonID Year Z Count
1 1 1990 0 1
2 2 1990 1 1
3 3 1990 1 1
4 4 1990 2 1
5 5 1990 1 1
6 1 1991 1 3
7 2 1991 0 3
8 3 1991 1 3
9 4 1991 0 3
10 5 1991 0 3
11 1 1992 NA 1
12 2 1992 2 1
13 3 1992 2 1
14 4 1992 0 1
15 5 1993 1 0
16 1 1993 1 0
17 2 1993 2 0
18 3 1993 NA 0
19 4 1993 1 0
20 5 1994 0 5
21 1 1994 0 5
22 2 1994 0 5
23 3 1994 0 5
24 4 1994 0 5

Resources