This question already has answers here:
Coerce logical (boolean) vector to 0 and 1
(2 answers)
Closed 6 years ago.
I have a dataset that looks something like this:
Subject Year X
A 1990 1
A 1991 1
A 1992 2
A 1993 3
A 1994 4
A 1995 4
B 1990 0
B 1991 1
B 1992 1
B 1993 2
C 1991 1
C 1992 2
C 1993 3
C 1994 3
D 1991 1
D 1992 2
D 1993 3
D 1994 4
D 1995 5
D 1996 5
D 1997 6
I want to generate a binary(0/1) variable (let's say variable A) that indicates weather the X variables has reached 3 (or 1-3), for each Subject. If the X variable has reached 4 or more, the A should not capture it.
It should look like this:
Subject Year X A
A 1990 1 0
A 1991 1 0
A 1992 2 0
A 1993 3 0
A 1994 4 0
A 1995 4 0
B 1990 0 0
B 1991 1 0
B 1992 1 0
B 1993 2 0
C 1991 1 1
C 1992 2 1
C 1993 3 1
C 1994 3 1
D 1991 1 0
D 1992 2 0
D 1993 3 0
D 1994 4 0
D 1995 5 0
D 1996 5 0
D 1997 6 0
I tried the following: mydata$A<- as.numeric(mydata$X %in% 1:3)but it doesn't control for the continuation....
A reproducible sample:
> dput(mydata)
structure(list(Subject = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), .Label = c("A",
"B", "C", "D"), class = "factor"), Year = c(1990L, 1991L, 1992L,
1993L, 1994L, 1995L, 1990L, 1991L, 1992L, 1993L, 1991L, 1992L,
1993L, 1994L, 1991L, 1992L, 1993L, 1994L, 1995L, 1996L, 1997L
), X = c(1L, 1L, 2L, 3L, 4L, 4L, 0L, 1L, 1L, 2L, 1L, 2L, 3L,
3L, 1L, 2L, 3L, 4L, 5L, 5L, 6L)), .Names = c("Subject", "Year",
"X"), class = "data.frame", row.names = c(NA, -21L))
All suggestions are welcome – thanks!
Here's a base R one-liner use ave:
df$A <- ave(df$X, df$Subject, FUN = function(x) if (max(x) == 3) 1 else 0)
> df
Subject Year X A
1 A 1990 1 0
2 A 1991 1 0
3 A 1992 2 0
4 A 1993 3 0
5 A 1994 4 0
6 A 1995 4 0
7 B 1990 0 0
8 B 1991 1 0
9 B 1992 1 0
10 B 1993 2 0
11 C 1991 1 1
12 C 1992 2 1
13 C 1993 3 1
14 C 1994 3 1
15 D 1991 1 0
16 D 1992 2 0
17 D 1993 3 0
18 D 1994 4 0
19 D 1995 5 0
20 D 1996 5 0
21 D 1997 6 0
Then, if you only want to capture increases, with shift function you can access to other rows. This solution works, but first value is NA because it hasn't nothing to compare with
mydata$A <- ifelse(mydata$X > shift(mydata$X, 1L, type="lag"), 1,0)
Related
I found the following link of an answer that I should be able to apply, but it didn't seem to work:
https://stackoverflow.com/a/66485141/15388602
The following is a sample from my dataset:
companyID year status
1 2000 1
1 2001 1
1 2002 1
1 2003 1
1 2004 0
1 2005 2
1 2006 2
1 2007 2
2 2012 1
2 2013 0
2 2014 2
2 2015 2
2 2016 2
3 2008 1
3 2009 1
3 2010 1
3 2011 1
3 2012 1
3 2013 0
3 2014 2
3 2015 2
3 2016 2
3 2017 2
I would like to get the following observations so that I now only have the observations concerning 3 years before the event, the year of the event (where status is 0), and the 3 years after the event:
companyID year status
1 2001 1
1 2002 1
1 2003 1
1 2004 0
1 2005 2
1 2006 2
1 2007 2
3 2010 1
3 2011 1
3 2012 1
3 2013 0
3 2014 2
3 2015 2
3 2016 2
Would it be easier if I supplied the variable showing the event date? The variable would show a date in the same observation (year) that the status is 0.
Thank you in advance for any help!
This could be achieved with group_by arrange and filter
library(dplyr)
df %>% group_by(companyID) %>%
arrange(status, year, .by_group = TRUE) %>%
filter(year >= first(year)- 3 & year <= first(year)+ 3) %>%
filter(n() >=7) %>%
arrange(year)
Output:
companyID year status
<int> <int> <int>
1 1 2001 1
2 1 2002 1
3 1 2003 1
4 1 2004 0
5 1 2005 2
6 1 2006 2
7 1 2007 2
8 3 2010 1
9 3 2011 1
10 3 2012 1
11 3 2013 0
12 3 2014 2
13 3 2015 2
14 3 2016 2
Try this with dplyr and tidyr:
library(dplyr)
library(tidyr)
df %>%
group_by(companyID, year) %>%
mutate(ref_yr = case_when(status == 0 ~ year,
TRUE ~ NA_integer_)) %>%
ungroup() %>%
group_by(companyID) %>%
fill(ref_yr, .direction = "downup") %>%
mutate(yr_diff = abs(ref_yr - year))%>%
filter(yr_diff <= 3) %>%
select(-c(ref_yr, yr_diff))
#> # A tibble: 19 x 3
#> # Groups: companyID [3]
#> companyID year status
#> <int> <int> <int>
#> 1 1 2001 1
#> 2 1 2002 1
#> 3 1 2003 1
#> 4 1 2004 0
#> 5 1 2005 2
#> 6 1 2006 2
#> 7 1 2007 2
#> 8 2 2012 1
#> 9 2 2013 0
#> 10 2 2014 2
#> 11 2 2015 2
#> 12 2 2016 2
#> 13 3 2010 1
#> 14 3 2011 1
#> 15 3 2012 1
#> 16 3 2013 0
#> 17 3 2014 2
#> 18 3 2015 2
#> 19 3 2016 2
data
df <- structure(list(companyID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L),
year = c(2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L,
2007L, 2012L, 2013L, 2014L, 2015L, 2016L, 2008L, 2009L, 2010L,
2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 2017L), status = c(1L,
1L, 1L, 1L, 0L, 2L, 2L, 2L, 1L, 0L, 2L, 2L, 2L, 1L, 1L, 1L,
1L, 1L, 0L, 2L, 2L, 2L, 2L)), class = "data.frame", row.names = c(NA,
-23L))
Created on 2021-04-25 by the reprex package (v2.0.0)
Does this work:
library(dplyr)
df %>% group_by(companyID) %>%
mutate(flag1 = year[status == 0] - min(year), flag2 = max(year) - year[status == 0]) %>%
filter(flag1 > 2 & flag2 > 2 & between(year,year[status == 0] - 3, year[status == 0] + 3)) %>% select(-flag1, -flag2)
# A tibble: 14 x 3
# Groups: companyID [2]
companyID year status
<int> <int> <int>
1 1 2001 1
2 1 2002 1
3 1 2003 1
4 1 2004 0
5 1 2005 2
6 1 2006 2
7 1 2007 2
8 3 2010 1
9 3 2011 1
10 3 2012 1
11 3 2013 0
12 3 2014 2
13 3 2015 2
14 3 2016 2
If your dataframe is df:
zeros <- which(df$status == 0)
calcrows <- sapply(zeros, function(x) (x-3):(x+3))
df2 <- df[calcrows, ]
I got this dataset of firms, I have "completed the panel" so whenever the quantitative variables (Sales, wages) are 0 the firm is closed. The NA represents that I have completed the panel, that means all the firms have the same years, but NA is that the firm doesn't existed before (or after)
I want to make a counter for the first closure of the firm.
So my data looks like this:
Year Firm sales wages
2014 A 12 4
2015 A 8 3
2016 A 0 0
2017 A NA NA
2018 A NA NA
2014 B NA NA
2015 B 8 3
2016 B 4 2
2017 B 9 5
2018 B 8 6
2014 C 9 5
2015 C 7 6
2016 C 0 0
2017 C 0 0
2018 C 0 0
And the desired result looks like this:
Year Firm sales wages Closure
2014 A 12 4 0
2015 A 8 3 0
2016 A 0 0 1
2017 A NA NA 2 # After the closure in 2016 it doesn't appear on the original dataset anymore
2018 A NA NA 3 # Same here
2014 B NA NA 0 # Here the firm has not been created yet
2015 B NA NA 0 # Here too
2016 B 4 2 0
2017 B 9 5 0
2018 B 8 6 0
2014 C 9 5 0
2015 C 7 6 0
2016 C 0 0 1
2017 C 0 0 2 #After the closure it continues appearing because the firm has some debts or some pending
2018 C 0 0 3 #Here the same, still appears bc it still have obligations
How can I accomplish this?
Thanks In Advance.
Perhaps this helps
library(dplyr)
library(tidyr)
df1 %>%
group_by(Firm) %>%
mutate(Closure = replace_na(cumsum(lead(is.na(sales) &
is.na(wages), default = TRUE)|(sales == 0 & wages == 0)), 0)) %>%
ungroup
-output
# A tibble: 15 x 5
# Year Firm sales wages Closure
# <int> <chr> <int> <int> <dbl>
# 1 2014 A 12 4 0
# 2 2015 A 8 3 0
# 3 2016 A 0 0 1
# 4 2017 A NA NA 2
# 5 2018 A NA NA 3
# 6 2014 B NA NA 0
# 7 2015 B 8 3 0
# 8 2016 B 4 2 0
# 9 2017 B 9 5 0
#10 2018 B 8 6 0
#11 2014 C 9 5 0
#12 2015 C 7 6 0
#13 2016 C 0 0 1
#14 2017 C 0 0 2
#15 2018 C 0 0 3
data
df1 <- structure(list(Year = c(2014L, 2015L, 2016L, 2017L, 2018L, 2014L,
2015L, 2016L, 2017L, 2018L, 2014L, 2015L, 2016L, 2017L, 2018L
), Firm = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B",
"C", "C", "C", "C", "C"), sales = c(12L, 8L, 0L, NA, NA, NA,
8L, 4L, 9L, 8L, 9L, 7L, 0L, 0L, 0L), wages = c(4L, 3L, 0L, NA,
NA, NA, 3L, 2L, 5L, 6L, 5L, 6L, 0L, 0L, 0L)), class = "data.frame",
row.names = c(NA,
-15L))
How to produce the result of imputed variable? id1's 2001 is filled the mean of 2000 and 2002.
id Year A imputed
1 2000 6 6
1 2001 NA 7
1 2002 8 8
1 2003 10 10
2 2000 2 2
2 2001 NA 5
2 2002 8 8
2 2003 5 5
3 2000 9 9
3 2001 10 10
3 2002 NA 10.5
3 2003 11 12
library(dplyr)
df %>%
arrange(id,Year) %>%
mutate(Imputed = ifelse(is.na(A), (lag(A)+lead(A))/2, A))
Output is:
id Year A Imputed
1 1 2000 6 6.0
2 1 2001 NA 7.0
3 1 2002 8 8.0
4 1 2003 10 10.0
5 2 2000 2 2.0
6 2 2001 NA 5.0
7 2 2002 8 8.0
8 2 2003 5 5.0
9 3 2000 9 9.0
10 3 2001 10 10.0
11 3 2002 NA 10.5
12 3 2003 11 11.0
#sample data
> dput(df)
structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L), Year = c(2000L, 2001L, 2002L, 2003L, 2000L, 2001L, 2002L,
2003L, 2000L, 2001L, 2002L, 2003L), A = c(6L, NA, 8L, 10L, 2L,
NA, 8L, 5L, 9L, 10L, NA, 11L)), .Names = c("id", "Year", "A"), class = "data.frame", row.names = c(NA,
-12L))
Complicating a previous question, lets say I have the following sock data.
>socks
year drawer week sock_total
1990 1 1 3
1990 1 2 4
1990 1 3 3
1990 1 4 2
1990 1 5 4
1990 2 1 1
1990 2 2 1
1990 2 3 1
1990 2 4 1
1990 2 5 2
1990 3 1 3
1990 3 2 4
1990 3 3 4
1990 3 4 4
1990 3 5 4
1991 1 1 4
1991 1 2 3
1991 1 3 2
1991 1 4 2
1991 1 5 3
1991 2 1 1
1991 2 2 3
1991 2 3 4
1991 2 4 4
1991 2 5 3
1991 3 1 2
1991 3 2 3
1991 3 3 3
1991 3 4 2
1991 3 5 3
How can I use summarise in dplyr to create a new variable
growth which equals 1 if their was an increase in each week between the first year and the second year-- else 0. The data should look like this
>socks
drawer week growth
1 1 1
1 2 0
1 3 0
1 4 0
1 5 0
2 1 0
2 2 1
2 3 1
2 4 1
2 5 1
3 1 0
3 2 0
3 3 0
3 4 0
3 5 0
Also, how would you handle data where a drawer did not have a corresponding week in one of the years. aka add NA if a week was missing.
The answer would be very similar to the previous, but group by drawer and week, comment by #eipi10 is also a great option; You can handle missing year for a specific drawer and week by using index after the subset, which turns a length zero object into NA:
For instance:
df %>%
group_by(drawer, week) %>%
summarise(growth = +(sock_total[year==1991][1] - sock_total[year==1990][1] > 0))
# ^^^ ^^^
# A tibble: 15 x 3
# Groups: drawer [?]
# drawer week growth
# <int> <int> <int>
# 1 1 1 1
# 2 1 2 0
# 3 1 3 0
# 4 1 4 0
# 5 1 5 0
# 6 2 1 0
# 7 2 2 1
# 8 2 3 1
# 9 2 4 1
#10 2 5 1
#11 3 1 0
#12 3 2 0
#13 3 3 0
#14 3 4 0
#15 3 5 NA
The data has left out the year 1991 for drawer 3 and week 5:
structure(list(year = c(1990L, 1990L, 1990L, 1990L, 1990L, 1990L,
1990L, 1990L, 1990L, 1990L, 1990L, 1990L, 1990L, 1990L, 1990L,
1991L, 1991L, 1991L, 1991L, 1991L, 1991L, 1991L, 1991L, 1991L,
1991L, 1991L, 1991L, 1991L, 1991L), drawer = c(1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L), week = c(1L, 2L, 3L, 4L,
5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L,
1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L), sock_total = c(3L, 4L, 3L,
2L, 4L, 1L, 1L, 1L, 1L, 2L, 3L, 4L, 4L, 4L, 4L, 4L, 3L, 2L, 2L,
3L, 1L, 3L, 4L, 4L, 3L, 2L, 3L, 3L, 2L)), .Names = c("year",
"drawer", "week", "sock_total"), class = "data.frame", row.names = c(NA,
-29L))
Or you can try this without complete .
df%>%group_by(drawer,week)%>%
summarise(growth =ifelse(n()<=1,0,ifelse((sock_total[1]-sock_total[2])>=0,0,1)))
# A tibble: 15 x 3
# Groups: drawer [?]
drawer week growth
<int> <int> <dbl>
1 1 1 1
2 1 2 0
3 1 3 0
4 1 4 0
5 1 5 0
6 2 1 0
7 2 2 1
8 2 3 1
9 2 4 1
10 2 5 1
11 3 1 0
12 3 2 0
13 3 3 0
14 3 4 0
15 3 5 0
I want to create a count variable with the number of peoples with Z==0 in each of the given years. As Illustrated below:
PersonID Year Z Count*
1 1990 0 1
2 1990 1 1
3 1990 1 1
4 1990 2 1
5 1990 1 1
1 1991 1 3
2 1991 0 3
3 1991 1 3
4 1991 0 3
5 1991 0 3
1 1992 NA 1
2 1992 2 1
3 1992 2 1
4 1992 0 1
5 1993 1 0
1 1993 1 0
2 1993 2 0
3 1993 NA 0
4 1993 1 0
5 1994 0 5
1 1994 0 5
2 1994 0 5
3 1994 0 5
4 1994 0 5
I looked at my previous R-scripts and found this
library(dplyr)
sum_data <- data %>% group_by(PersonID) %>% summarise(Count = sum(Z, na.rm=T))
Can someone help me get this right? The count variable should basically count a total number of persons with Z==0, in the same format as I illustrated above. Thanks!!
dput(data)
structure(list(PersonID = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L,
5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L),
Year = c(1990L, 1990L, 1990L, 1990L, 1990L, 1991L, 1991L,
1991L, 1991L, 1991L, 1992L, 1992L, 1992L, 1992L, 1993L, 1993L,
1993L, 1993L, 1993L, 1994L, 1994L, 1994L, 1994L, 1994L),
Z = c(0L, 1L, 1L, 2L, 1L, 1L, 0L, 1L, 0L, 0L, NA, 2L, 2L,
0L, 1L, 1L, 2L, NA, 1L, 0L, 0L, 0L, 0L, 0L)), .Names = c("PersonID",
"Year", "Z"), class = "data.frame", row.names = c(NA, -24L))
Here's a simple solution :
library(dplyr)
sum_data <- df %>%
mutate(Z=replace(Z, is.na(Z), 1)) %>%
mutate(temp = ifelse(Z == 0, 1, 0)) %>%
group_by(Year) %>%
summarize(count = sum(temp))
basically this is what the code is doing :
mutate(Z=replace(Z, is.na(Z), 1)) replace the NA with 1 (optional)
mutate(temp = ifelse(Z == 0, 1, 0)) create a conditional temp
variable :
ifelse(Z == 0, 1, 0) say if Z == 0 then the value is 1
else 0
group_by(Year) pretty explicite :) it group the data frame by
Year
summarize(count = sum(temp)) create a count variable with the
sum of earlier generated temp
results :
Year count
<int> <int>
1 1990 5
2 1991 5
3 1992 4
4 1993 5
5 1994 5
and if you want to join this data to the original data frame just use join :
left_join(df, sum_data)
Joining, by = "Year"
PersonID Year Z count
1 1 1990 0 1
2 2 1990 1 1
3 3 1990 1 1
4 4 1990 2 1
5 5 1990 1 1
6 1 1991 1 3
7 2 1991 0 3
8 3 1991 1 3
9 4 1991 0 3
10 5 1991 0 3
11 1 1992 NA 1
12 2 1992 2 1
13 3 1992 2 1
14 4 1992 0 1
15 5 1993 1 0
16 1 1993 1 0
17 2 1993 2 0
18 3 1993 NA 0
19 4 1993 1 0
20 5 1994 0 5
21 1 1994 0 5
22 2 1994 0 5
23 3 1994 0 5
24 4 1994 0 5
Try this:
library(dplyr)
df <- left_join(data, data %>% filter(Z==0) %>% group_by(Year) %>% summarise(Count = n()))
df[is.na(df$Count),]$Count <- 0
PersonID Year Z Count
1 1 1990 0 1
2 2 1990 1 1
3 3 1990 1 1
4 4 1990 2 1
5 5 1990 1 1
6 1 1991 1 3
7 2 1991 0 3
8 3 1991 1 3
9 4 1991 0 3
10 5 1991 0 3
11 1 1992 NA 1
12 2 1992 2 1
13 3 1992 2 1
14 4 1992 0 1
15 5 1993 1 0
16 1 1993 1 0
17 2 1993 2 0
18 3 1993 NA 0
19 4 1993 1 0
20 5 1994 0 5
21 1 1994 0 5
22 2 1994 0 5
23 3 1994 0 5
24 4 1994 0 5