Create a conditional count variable in R - r

I want to create a count variable with the number of peoples with Z==0 in each of the given years. As Illustrated below:
PersonID Year Z Count*
1 1990 0 1
2 1990 1 1
3 1990 1 1
4 1990 2 1
5 1990 1 1
1 1991 1 3
2 1991 0 3
3 1991 1 3
4 1991 0 3
5 1991 0 3
1 1992 NA 1
2 1992 2 1
3 1992 2 1
4 1992 0 1
5 1993 1 0
1 1993 1 0
2 1993 2 0
3 1993 NA 0
4 1993 1 0
5 1994 0 5
1 1994 0 5
2 1994 0 5
3 1994 0 5
4 1994 0 5
I looked at my previous R-scripts and found this
library(dplyr)
sum_data <- data %>% group_by(PersonID) %>% summarise(Count = sum(Z, na.rm=T))
Can someone help me get this right? The count variable should basically count a total number of persons with Z==0, in the same format as I illustrated above. Thanks!!
dput(data)
structure(list(PersonID = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L,
5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L),
Year = c(1990L, 1990L, 1990L, 1990L, 1990L, 1991L, 1991L,
1991L, 1991L, 1991L, 1992L, 1992L, 1992L, 1992L, 1993L, 1993L,
1993L, 1993L, 1993L, 1994L, 1994L, 1994L, 1994L, 1994L),
Z = c(0L, 1L, 1L, 2L, 1L, 1L, 0L, 1L, 0L, 0L, NA, 2L, 2L,
0L, 1L, 1L, 2L, NA, 1L, 0L, 0L, 0L, 0L, 0L)), .Names = c("PersonID",
"Year", "Z"), class = "data.frame", row.names = c(NA, -24L))

Here's a simple solution :
library(dplyr)
sum_data <- df %>%
mutate(Z=replace(Z, is.na(Z), 1)) %>%
mutate(temp = ifelse(Z == 0, 1, 0)) %>%
group_by(Year) %>%
summarize(count = sum(temp))
basically this is what the code is doing :
mutate(Z=replace(Z, is.na(Z), 1)) replace the NA with 1 (optional)
mutate(temp = ifelse(Z == 0, 1, 0)) create a conditional temp
variable :
ifelse(Z == 0, 1, 0) say if Z == 0 then the value is 1
else 0
group_by(Year) pretty explicite :) it group the data frame by
Year
summarize(count = sum(temp)) create a count variable with the
sum of earlier generated temp
results :
Year count
<int> <int>
1 1990 5
2 1991 5
3 1992 4
4 1993 5
5 1994 5
and if you want to join this data to the original data frame just use join :
left_join(df, sum_data)
Joining, by = "Year"
PersonID Year Z count
1 1 1990 0 1
2 2 1990 1 1
3 3 1990 1 1
4 4 1990 2 1
5 5 1990 1 1
6 1 1991 1 3
7 2 1991 0 3
8 3 1991 1 3
9 4 1991 0 3
10 5 1991 0 3
11 1 1992 NA 1
12 2 1992 2 1
13 3 1992 2 1
14 4 1992 0 1
15 5 1993 1 0
16 1 1993 1 0
17 2 1993 2 0
18 3 1993 NA 0
19 4 1993 1 0
20 5 1994 0 5
21 1 1994 0 5
22 2 1994 0 5
23 3 1994 0 5
24 4 1994 0 5

Try this:
library(dplyr)
df <- left_join(data, data %>% filter(Z==0) %>% group_by(Year) %>% summarise(Count = n()))
df[is.na(df$Count),]$Count <- 0
PersonID Year Z Count
1 1 1990 0 1
2 2 1990 1 1
3 3 1990 1 1
4 4 1990 2 1
5 5 1990 1 1
6 1 1991 1 3
7 2 1991 0 3
8 3 1991 1 3
9 4 1991 0 3
10 5 1991 0 3
11 1 1992 NA 1
12 2 1992 2 1
13 3 1992 2 1
14 4 1992 0 1
15 5 1993 1 0
16 1 1993 1 0
17 2 1993 2 0
18 3 1993 NA 0
19 4 1993 1 0
20 5 1994 0 5
21 1 1994 0 5
22 2 1994 0 5
23 3 1994 0 5
24 4 1994 0 5

Related

Using groupby and n_distinct to count NEW unique ids

The following code counts the number of unique IDs per year. My question is: how to count the number of new unique IDs, i.e., IDs that did not appear in previous years?
df %>%
group_by(year) %>%
summarize(count=n_distinct(ID))
For example, I need to create the variable wanted_count below
Year
ID
count
wanted_count
2000
1
3
3
2000
2
3
3
2000
3
3
3
2001
2
2
0
2001
3
2
0
2002
3
2
1
2002
4
2
1
2003
4
2
1
2003
7
2
1
2003
4
2
1
See data below:
df <- structure(list(Year = c(2000L, 2000L, 2000L, 2001L, 2001L, 2002L,
2002L, 2003L, 2003L, 2003L), ID = c(1L, 2L, 3L, 2L, 3L, 3L, 4L,
4L, 7L, 4L), count = c(3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L
), wanted_count = c(3L, 3L, 3L, 0L, 0L, 1L, 1L, 1L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-10L))
library(dplyr)
df %>%
mutate(cum_new = cumsum(!duplicated(ID))) %>%
group_by(Year) %>%
summarize(total = max(cum_new), .groups = "drop") %>%
mutate(
result = c(first(total), diff(total)),
total = NULL
) %>%
left_join(df, by = "Year")
# # A tibble: 10 × 5
# Year result ID count wanted_count
# <int> <int> <int> <int> <int>
# 1 2000 3 1 3 3
# 2 2000 3 2 3 3
# 3 2000 3 3 3 3
# 4 2001 0 2 2 0
# 5 2001 0 3 2 0
# 6 2002 1 3 2 1
# 7 2002 1 4 2 1
# 8 2003 1 4 2 1
# 9 2003 1 7 2 1
# 10 2003 1 4 2 1
Using this data:
df = read.table(text = 'Year ID count wanted_count
2000 1 3 3
2000 2 3 3
2000 3 3 3
2001 2 2 0
2001 3 2 0
2002 3 2 1
2002 4 2 1
2003 4 2 1
2003 7 2 1
2003 4 2 1', header = T)
You could create a logical column indicating occurrences of new ID, and sum up them in each Year group.
library(dplyr)
df %>%
mutate(new = !duplicated(ID)) %>%
add_count(Year, wt = new) %>%
select(-new)
# Year ID n
# 1 2000 1 3
# 2 2000 2 3
# 3 2000 3 3
# 4 2001 2 0
# 5 2001 3 0
# 6 2002 3 1
# 7 2002 4 1
# 8 2003 4 1
# 9 2003 7 1
# 10 2003 4 1

In R: subset so that I only have the observations 3 years prior to and after an event

I found the following link of an answer that I should be able to apply, but it didn't seem to work:
https://stackoverflow.com/a/66485141/15388602
The following is a sample from my dataset:
companyID year status
1 2000 1
1 2001 1
1 2002 1
1 2003 1
1 2004 0
1 2005 2
1 2006 2
1 2007 2
2 2012 1
2 2013 0
2 2014 2
2 2015 2
2 2016 2
3 2008 1
3 2009 1
3 2010 1
3 2011 1
3 2012 1
3 2013 0
3 2014 2
3 2015 2
3 2016 2
3 2017 2
I would like to get the following observations so that I now only have the observations concerning 3 years before the event, the year of the event (where status is 0), and the 3 years after the event:
companyID year status
1 2001 1
1 2002 1
1 2003 1
1 2004 0
1 2005 2
1 2006 2
1 2007 2
3 2010 1
3 2011 1
3 2012 1
3 2013 0
3 2014 2
3 2015 2
3 2016 2
Would it be easier if I supplied the variable showing the event date? The variable would show a date in the same observation (year) that the status is 0.
Thank you in advance for any help!
This could be achieved with group_by arrange and filter
library(dplyr)
df %>% group_by(companyID) %>%
arrange(status, year, .by_group = TRUE) %>%
filter(year >= first(year)- 3 & year <= first(year)+ 3) %>%
filter(n() >=7) %>%
arrange(year)
Output:
companyID year status
<int> <int> <int>
1 1 2001 1
2 1 2002 1
3 1 2003 1
4 1 2004 0
5 1 2005 2
6 1 2006 2
7 1 2007 2
8 3 2010 1
9 3 2011 1
10 3 2012 1
11 3 2013 0
12 3 2014 2
13 3 2015 2
14 3 2016 2
Try this with dplyr and tidyr:
library(dplyr)
library(tidyr)
df %>%
group_by(companyID, year) %>%
mutate(ref_yr = case_when(status == 0 ~ year,
TRUE ~ NA_integer_)) %>%
ungroup() %>%
group_by(companyID) %>%
fill(ref_yr, .direction = "downup") %>%
mutate(yr_diff = abs(ref_yr - year))%>%
filter(yr_diff <= 3) %>%
select(-c(ref_yr, yr_diff))
#> # A tibble: 19 x 3
#> # Groups: companyID [3]
#> companyID year status
#> <int> <int> <int>
#> 1 1 2001 1
#> 2 1 2002 1
#> 3 1 2003 1
#> 4 1 2004 0
#> 5 1 2005 2
#> 6 1 2006 2
#> 7 1 2007 2
#> 8 2 2012 1
#> 9 2 2013 0
#> 10 2 2014 2
#> 11 2 2015 2
#> 12 2 2016 2
#> 13 3 2010 1
#> 14 3 2011 1
#> 15 3 2012 1
#> 16 3 2013 0
#> 17 3 2014 2
#> 18 3 2015 2
#> 19 3 2016 2
data
df <- structure(list(companyID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L),
year = c(2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L,
2007L, 2012L, 2013L, 2014L, 2015L, 2016L, 2008L, 2009L, 2010L,
2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 2017L), status = c(1L,
1L, 1L, 1L, 0L, 2L, 2L, 2L, 1L, 0L, 2L, 2L, 2L, 1L, 1L, 1L,
1L, 1L, 0L, 2L, 2L, 2L, 2L)), class = "data.frame", row.names = c(NA,
-23L))
Created on 2021-04-25 by the reprex package (v2.0.0)
Does this work:
library(dplyr)
df %>% group_by(companyID) %>%
mutate(flag1 = year[status == 0] - min(year), flag2 = max(year) - year[status == 0]) %>%
filter(flag1 > 2 & flag2 > 2 & between(year,year[status == 0] - 3, year[status == 0] + 3)) %>% select(-flag1, -flag2)
# A tibble: 14 x 3
# Groups: companyID [2]
companyID year status
<int> <int> <int>
1 1 2001 1
2 1 2002 1
3 1 2003 1
4 1 2004 0
5 1 2005 2
6 1 2006 2
7 1 2007 2
8 3 2010 1
9 3 2011 1
10 3 2012 1
11 3 2013 0
12 3 2014 2
13 3 2015 2
14 3 2016 2
If your dataframe is df:
zeros <- which(df$status == 0)
calcrows <- sapply(zeros, function(x) (x-3):(x+3))
df2 <- df[calcrows, ]

How can I make a counter conditionally on the rows? R

I got this dataset of firms, I have "completed the panel" so whenever the quantitative variables (Sales, wages) are 0 the firm is closed. The NA represents that I have completed the panel, that means all the firms have the same years, but NA is that the firm doesn't existed before (or after)
I want to make a counter for the first closure of the firm.
So my data looks like this:
Year Firm sales wages
2014 A 12 4
2015 A 8 3
2016 A 0 0
2017 A NA NA
2018 A NA NA
2014 B NA NA
2015 B 8 3
2016 B 4 2
2017 B 9 5
2018 B 8 6
2014 C 9 5
2015 C 7 6
2016 C 0 0
2017 C 0 0
2018 C 0 0
And the desired result looks like this:
Year Firm sales wages Closure
2014 A 12 4 0
2015 A 8 3 0
2016 A 0 0 1
2017 A NA NA 2 # After the closure in 2016 it doesn't appear on the original dataset anymore
2018 A NA NA 3 # Same here
2014 B NA NA 0 # Here the firm has not been created yet
2015 B NA NA 0 # Here too
2016 B 4 2 0
2017 B 9 5 0
2018 B 8 6 0
2014 C 9 5 0
2015 C 7 6 0
2016 C 0 0 1
2017 C 0 0 2 #After the closure it continues appearing because the firm has some debts or some pending
2018 C 0 0 3 #Here the same, still appears bc it still have obligations
How can I accomplish this?
Thanks In Advance.
Perhaps this helps
library(dplyr)
library(tidyr)
df1 %>%
group_by(Firm) %>%
mutate(Closure = replace_na(cumsum(lead(is.na(sales) &
is.na(wages), default = TRUE)|(sales == 0 & wages == 0)), 0)) %>%
ungroup
-output
# A tibble: 15 x 5
# Year Firm sales wages Closure
# <int> <chr> <int> <int> <dbl>
# 1 2014 A 12 4 0
# 2 2015 A 8 3 0
# 3 2016 A 0 0 1
# 4 2017 A NA NA 2
# 5 2018 A NA NA 3
# 6 2014 B NA NA 0
# 7 2015 B 8 3 0
# 8 2016 B 4 2 0
# 9 2017 B 9 5 0
#10 2018 B 8 6 0
#11 2014 C 9 5 0
#12 2015 C 7 6 0
#13 2016 C 0 0 1
#14 2017 C 0 0 2
#15 2018 C 0 0 3
data
df1 <- structure(list(Year = c(2014L, 2015L, 2016L, 2017L, 2018L, 2014L,
2015L, 2016L, 2017L, 2018L, 2014L, 2015L, 2016L, 2017L, 2018L
), Firm = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B",
"C", "C", "C", "C", "C"), sales = c(12L, 8L, 0L, NA, NA, NA,
8L, 4L, 9L, 8L, 9L, 7L, 0L, 0L, 0L), wages = c(4L, 3L, 0L, NA,
NA, NA, 3L, 2L, 5L, 6L, 5L, 6L, 0L, 0L, 0L)), class = "data.frame",
row.names = c(NA,
-15L))

Creating dummy variable based on group properties

My data looks something like this:
ID CSEX MID CMOB CYRB 1ST 2ND
1 1 1 1 1991 0 1
2 1 1 7 1989 1 0
3 2 2 1 1985 1 0
4 2 2 11 1985 0 1
5 1 2 9 1994 0 0
6 2 3 4 1992 1 0
7 2 4 2 1992 0 1
8 1 4 10 1983 1 0
With ID = child ID, CSEX = child sex, MID = mother ID, CMOB = month of birth and CYRB = year of birth, 1st = first born dummy, 2nd = second born dummy.
And I'm trying to make a dummy variable that takes the value 1 if the first two children born into a family (i.e. with the same MID) are the same sex.
I tried
Identifiers_age <- Identifiers_age %>% group_by(MPUBID) %>%
mutate(samesex =
as.numeric(((first == 1 & CSEX == 1) & (second == 1 & CSEX == 1))
| (first == 1 & CSEX == 2) & (second == 1 & CSEX ==2))))
But clearly this still only check the condition for each individual ID rather than by MID so returns a dummy which always takes value = 0.
Thanks
Edit for expected output:
ID CSEX MID CMOB CYRB 1ST 2ND SAMESEX
1 1 1 1 1991 0 1 1
2 1 1 7 1989 1 0 1
3 2 2 1 1985 1 0 1
4 2 2 11 1985 0 1 1
5 1 2 9 1994 0 0 1
6 2 3 4 1992 1 0 0
7 2 4 2 1992 0 1 0
8 1 4 10 1983 1 0 0
i.e. for any individual that is in a family where the first two children born are of the same sex, the dummy SAMESEX = 1
Edit2 (What I showed before was just an example I made, for the true dataset calling structure gives):
CPUBID MPUBID CSEX CMOB CYRB first second
<int> <int> <int> <int> <int> <dbl> <dbl>
1 201 2 2 3 1993 1 0
2 202 2 2 11 1994 0 1
3 301 3 2 6 1981 1 0
4 302 3 2 10 1983 0 1
5 303 3 2 4 1986 0 0
6 401 4 1 8 1980 1 0
7 403 4 2 3 1997 0 1
8 801 8 2 3 1976 1 0
9 802 8 1 5 1979 0 1
10 803 8 2 9 1982 0 0
and str:
Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 11512 obs. of 7 variables:
$ CPUBID : int 201 202 301 302 303 401 403 801 802 803 ...
$ MPUBID : int 2 2 3 3 3 4 4 8 8 8 ...
$ CSEX : int 2 2 2 2 2 1 2 2 1 2 ...
$ CMOB : int 3 11 6 10 4 8 3 3 5 9 ...
$ CYRB : int 1993 1994 1981 1983 1986 1980 1997 1976 1979 1982 ...
$ first : num 1 0 1 0 0 1 0 1 0 0 ...
$ second : num 0 1 0 1 0 0 1 0 1 0 ...
May be this helps
library(dplyr)
Identifiers_age %>%
group_by(MID) %>%
mutate(ind1 = CSEX *`1ST`,
ind2 = CSEX *`2ND`,
SAMESEX = as.integer(n_distinct(c(ind1[ind1!=0],
ind2[ind2!=0]))==1 & sum(ind1) >0 & sum(ind2) > 0)) %>%
select(-ind1, -ind2)
# ID CSEX MID CMOB CYRB 1ST 2ND SAMESEX
# <int> <int> <int> <int> <int> <int> <int> <int>
#1 1 1 1 1 1991 0 1 1
#2 2 1 1 7 1989 1 0 1
#3 3 2 2 1 1985 1 0 1
#4 4 2 2 11 1985 0 1 1
#5 5 1 2 9 1994 0 0 1
#6 6 2 3 4 1992 1 0 0
#7 7 2 4 2 1992 0 1 0
#8 8 1 4 10 1983 1 0 0
Or it can be made slightly compact with
Identifiers_age %>%
group_by(MID) %>%
mutate(SAMESEX = as.integer(n_distinct(c(CSEX * NA^!`1ST`, CSEX * NA^!`2ND`),
na.rm = TRUE)==1 & sum(`1ST`) > 0 & sum(`2ND`) > 0))
data
Identifiers_age <- structure(list(ID = 1:8, CSEX = c(1L, 1L, 2L, 2L, 1L,
2L, 2L,
1L), MID = c(1L, 1L, 2L, 2L, 2L, 3L, 4L, 4L), CMOB = c(1L, 7L,
1L, 11L, 9L, 4L, 2L, 10L), CYRB = c(1991L, 1989L, 1985L, 1985L,
1994L, 1992L, 1992L, 1983L), `1ST` = c(0L, 1L, 1L, 0L, 0L, 1L,
0L, 1L), `2ND` = c(1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L)), .Names = c("ID",
"CSEX", "MID", "CMOB", "CYRB", "1ST", "2ND"), class = "data.frame",
row.names = c(NA, -8L))

How to generate a "range" variable in R? [duplicate]

This question already has answers here:
Coerce logical (boolean) vector to 0 and 1
(2 answers)
Closed 6 years ago.
I have a dataset that looks something like this:
Subject Year X
A 1990 1
A 1991 1
A 1992 2
A 1993 3
A 1994 4
A 1995 4
B 1990 0
B 1991 1
B 1992 1
B 1993 2
C 1991 1
C 1992 2
C 1993 3
C 1994 3
D 1991 1
D 1992 2
D 1993 3
D 1994 4
D 1995 5
D 1996 5
D 1997 6
I want to generate a binary(0/1) variable (let's say variable A) that indicates weather the X variables has reached 3 (or 1-3), for each Subject. If the X variable has reached 4 or more, the A should not capture it.
It should look like this:
Subject Year X A
A 1990 1 0
A 1991 1 0
A 1992 2 0
A 1993 3 0
A 1994 4 0
A 1995 4 0
B 1990 0 0
B 1991 1 0
B 1992 1 0
B 1993 2 0
C 1991 1 1
C 1992 2 1
C 1993 3 1
C 1994 3 1
D 1991 1 0
D 1992 2 0
D 1993 3 0
D 1994 4 0
D 1995 5 0
D 1996 5 0
D 1997 6 0
I tried the following: mydata$A<- as.numeric(mydata$X %in% 1:3)but it doesn't control for the continuation....
A reproducible sample:
> dput(mydata)
structure(list(Subject = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), .Label = c("A",
"B", "C", "D"), class = "factor"), Year = c(1990L, 1991L, 1992L,
1993L, 1994L, 1995L, 1990L, 1991L, 1992L, 1993L, 1991L, 1992L,
1993L, 1994L, 1991L, 1992L, 1993L, 1994L, 1995L, 1996L, 1997L
), X = c(1L, 1L, 2L, 3L, 4L, 4L, 0L, 1L, 1L, 2L, 1L, 2L, 3L,
3L, 1L, 2L, 3L, 4L, 5L, 5L, 6L)), .Names = c("Subject", "Year",
"X"), class = "data.frame", row.names = c(NA, -21L))
All suggestions are welcome – thanks!
Here's a base R one-liner use ave:
df$A <- ave(df$X, df$Subject, FUN = function(x) if (max(x) == 3) 1 else 0)
> df
Subject Year X A
1 A 1990 1 0
2 A 1991 1 0
3 A 1992 2 0
4 A 1993 3 0
5 A 1994 4 0
6 A 1995 4 0
7 B 1990 0 0
8 B 1991 1 0
9 B 1992 1 0
10 B 1993 2 0
11 C 1991 1 1
12 C 1992 2 1
13 C 1993 3 1
14 C 1994 3 1
15 D 1991 1 0
16 D 1992 2 0
17 D 1993 3 0
18 D 1994 4 0
19 D 1995 5 0
20 D 1996 5 0
21 D 1997 6 0
Then, if you only want to capture increases, with shift function you can access to other rows. This solution works, but first value is NA because it hasn't nothing to compare with
mydata$A <- ifelse(mydata$X > shift(mydata$X, 1L, type="lag"), 1,0)

Resources