How can I make a counter conditionally on the rows? R - r

I got this dataset of firms, I have "completed the panel" so whenever the quantitative variables (Sales, wages) are 0 the firm is closed. The NA represents that I have completed the panel, that means all the firms have the same years, but NA is that the firm doesn't existed before (or after)
I want to make a counter for the first closure of the firm.
So my data looks like this:
Year Firm sales wages
2014 A 12 4
2015 A 8 3
2016 A 0 0
2017 A NA NA
2018 A NA NA
2014 B NA NA
2015 B 8 3
2016 B 4 2
2017 B 9 5
2018 B 8 6
2014 C 9 5
2015 C 7 6
2016 C 0 0
2017 C 0 0
2018 C 0 0
And the desired result looks like this:
Year Firm sales wages Closure
2014 A 12 4 0
2015 A 8 3 0
2016 A 0 0 1
2017 A NA NA 2 # After the closure in 2016 it doesn't appear on the original dataset anymore
2018 A NA NA 3 # Same here
2014 B NA NA 0 # Here the firm has not been created yet
2015 B NA NA 0 # Here too
2016 B 4 2 0
2017 B 9 5 0
2018 B 8 6 0
2014 C 9 5 0
2015 C 7 6 0
2016 C 0 0 1
2017 C 0 0 2 #After the closure it continues appearing because the firm has some debts or some pending
2018 C 0 0 3 #Here the same, still appears bc it still have obligations
How can I accomplish this?
Thanks In Advance.

Perhaps this helps
library(dplyr)
library(tidyr)
df1 %>%
group_by(Firm) %>%
mutate(Closure = replace_na(cumsum(lead(is.na(sales) &
is.na(wages), default = TRUE)|(sales == 0 & wages == 0)), 0)) %>%
ungroup
-output
# A tibble: 15 x 5
# Year Firm sales wages Closure
# <int> <chr> <int> <int> <dbl>
# 1 2014 A 12 4 0
# 2 2015 A 8 3 0
# 3 2016 A 0 0 1
# 4 2017 A NA NA 2
# 5 2018 A NA NA 3
# 6 2014 B NA NA 0
# 7 2015 B 8 3 0
# 8 2016 B 4 2 0
# 9 2017 B 9 5 0
#10 2018 B 8 6 0
#11 2014 C 9 5 0
#12 2015 C 7 6 0
#13 2016 C 0 0 1
#14 2017 C 0 0 2
#15 2018 C 0 0 3
data
df1 <- structure(list(Year = c(2014L, 2015L, 2016L, 2017L, 2018L, 2014L,
2015L, 2016L, 2017L, 2018L, 2014L, 2015L, 2016L, 2017L, 2018L
), Firm = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B",
"C", "C", "C", "C", "C"), sales = c(12L, 8L, 0L, NA, NA, NA,
8L, 4L, 9L, 8L, 9L, 7L, 0L, 0L, 0L), wages = c(4L, 3L, 0L, NA,
NA, NA, 3L, 2L, 5L, 6L, 5L, 6L, 0L, 0L, 0L)), class = "data.frame",
row.names = c(NA,
-15L))

Related

In R: subset so that I only have the observations 3 years prior to and after an event

I found the following link of an answer that I should be able to apply, but it didn't seem to work:
https://stackoverflow.com/a/66485141/15388602
The following is a sample from my dataset:
companyID year status
1 2000 1
1 2001 1
1 2002 1
1 2003 1
1 2004 0
1 2005 2
1 2006 2
1 2007 2
2 2012 1
2 2013 0
2 2014 2
2 2015 2
2 2016 2
3 2008 1
3 2009 1
3 2010 1
3 2011 1
3 2012 1
3 2013 0
3 2014 2
3 2015 2
3 2016 2
3 2017 2
I would like to get the following observations so that I now only have the observations concerning 3 years before the event, the year of the event (where status is 0), and the 3 years after the event:
companyID year status
1 2001 1
1 2002 1
1 2003 1
1 2004 0
1 2005 2
1 2006 2
1 2007 2
3 2010 1
3 2011 1
3 2012 1
3 2013 0
3 2014 2
3 2015 2
3 2016 2
Would it be easier if I supplied the variable showing the event date? The variable would show a date in the same observation (year) that the status is 0.
Thank you in advance for any help!
This could be achieved with group_by arrange and filter
library(dplyr)
df %>% group_by(companyID) %>%
arrange(status, year, .by_group = TRUE) %>%
filter(year >= first(year)- 3 & year <= first(year)+ 3) %>%
filter(n() >=7) %>%
arrange(year)
Output:
companyID year status
<int> <int> <int>
1 1 2001 1
2 1 2002 1
3 1 2003 1
4 1 2004 0
5 1 2005 2
6 1 2006 2
7 1 2007 2
8 3 2010 1
9 3 2011 1
10 3 2012 1
11 3 2013 0
12 3 2014 2
13 3 2015 2
14 3 2016 2
Try this with dplyr and tidyr:
library(dplyr)
library(tidyr)
df %>%
group_by(companyID, year) %>%
mutate(ref_yr = case_when(status == 0 ~ year,
TRUE ~ NA_integer_)) %>%
ungroup() %>%
group_by(companyID) %>%
fill(ref_yr, .direction = "downup") %>%
mutate(yr_diff = abs(ref_yr - year))%>%
filter(yr_diff <= 3) %>%
select(-c(ref_yr, yr_diff))
#> # A tibble: 19 x 3
#> # Groups: companyID [3]
#> companyID year status
#> <int> <int> <int>
#> 1 1 2001 1
#> 2 1 2002 1
#> 3 1 2003 1
#> 4 1 2004 0
#> 5 1 2005 2
#> 6 1 2006 2
#> 7 1 2007 2
#> 8 2 2012 1
#> 9 2 2013 0
#> 10 2 2014 2
#> 11 2 2015 2
#> 12 2 2016 2
#> 13 3 2010 1
#> 14 3 2011 1
#> 15 3 2012 1
#> 16 3 2013 0
#> 17 3 2014 2
#> 18 3 2015 2
#> 19 3 2016 2
data
df <- structure(list(companyID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L),
year = c(2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L,
2007L, 2012L, 2013L, 2014L, 2015L, 2016L, 2008L, 2009L, 2010L,
2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 2017L), status = c(1L,
1L, 1L, 1L, 0L, 2L, 2L, 2L, 1L, 0L, 2L, 2L, 2L, 1L, 1L, 1L,
1L, 1L, 0L, 2L, 2L, 2L, 2L)), class = "data.frame", row.names = c(NA,
-23L))
Created on 2021-04-25 by the reprex package (v2.0.0)
Does this work:
library(dplyr)
df %>% group_by(companyID) %>%
mutate(flag1 = year[status == 0] - min(year), flag2 = max(year) - year[status == 0]) %>%
filter(flag1 > 2 & flag2 > 2 & between(year,year[status == 0] - 3, year[status == 0] + 3)) %>% select(-flag1, -flag2)
# A tibble: 14 x 3
# Groups: companyID [2]
companyID year status
<int> <int> <int>
1 1 2001 1
2 1 2002 1
3 1 2003 1
4 1 2004 0
5 1 2005 2
6 1 2006 2
7 1 2007 2
8 3 2010 1
9 3 2011 1
10 3 2012 1
11 3 2013 0
12 3 2014 2
13 3 2015 2
14 3 2016 2
If your dataframe is df:
zeros <- which(df$status == 0)
calcrows <- sapply(zeros, function(x) (x-3):(x+3))
df2 <- df[calcrows, ]

How to sum the amount of columns with an entry in each row?

I would like to find out how many columns have an entry in each row:
For example:
Date A B
1990 NA NA
1991 1 NA
1992 2 2
1993 3 3
1994 4 NA
1995 5 3
1996 NA NA
1997 7 8
1998 8 2
1999 NA NA
2000 8 4
Column C here would be the result I am wanting.
Date A B C
1990 NA NA 0
1991 1 NA 1
1992 2 2 2
1993 3 3 2
1994 4 NA 1
1995 5 3 2
1996 NA NA 0
1997 7 8 2
1998 8 2 2
1999 NA NA 0
2000 8 4 2
Many Thanks
Try this (you can choose the columns in df, here I excluded first column):
df$C <- apply(df[,-1],1,function(x) length(which(!is.na(x))))
df
Date A B C
1 1990 NA NA 0
2 1991 1 NA 1
3 1992 2 2 2
4 1993 3 3 2
5 1994 4 NA 1
6 1995 5 3 2
7 1996 NA NA 0
8 1997 7 8 2
9 1998 8 2 2
10 1999 NA NA 0
11 2000 8 4 2
Some data:
df <- structure(list(Date = 1990:2000, A = c(NA, 1L, 2L, 3L, 4L, 5L,
NA, 7L, 8L, NA, 8L), B = c(NA, NA, 2L, 3L, NA, 3L, NA, 8L, 2L,
NA, 4L)), row.names = c(NA, -11L), class = "data.frame")
structure(list(Date = 1990:2000, A = c(NA, 1L, 2L, 3L, 4L, 5L,
NA, 7L, 8L, NA, 8L), B = c(NA, NA, 2L, 3L, NA, 3L, NA, 8L, 2L,
NA, 4L)), row.names = c(NA, -11L), class = "data.frame")
a tidyverse
library(tidyverse)
df %>%
rowwise() %>%
mutate(C = sum(!is.na(across(A:B)))) %>%
ungroup
# A tibble: 11 x 4
Date A B C
<int> <int> <int> <dbl>
1 1990 NA NA 0
2 1991 1 NA 1
3 1992 2 2 2
4 1993 3 3 2
5 1994 4 NA 1
6 1995 5 3 2
7 1996 NA NA 0
8 1997 7 8 2
9 1998 8 2 2
10 1999 NA NA 0
11 2000 8 4 2
Or simply combining the mutate with akruns rowSums approach and dropping column ..1
df %>%
mutate(C = rowSums(!is.na(across(-1))))
You can try c_across()
df <- data.frame(obs = 1:5, COL_A = 6:10, COL_B = 11:15, COL_C = c(10, NA, 21, NA, 7))
df2 <- df %>%
rowwise() %>%
mutate(TOTAL = sum(c_across(COL_A:COL_C), na.rm = TRUE))
# A tibble: 5 x 5
# Rowwise:
# obs COL_A COL_B COL_C TOTAL
# <int> <int> <int> <dbl> <dbl>
# 1 1 6 11 10 27
# 2 2 7 12 NA 19
# 3 3 8 13 21 42
# 4 4 9 14 NA 23
# 5 5 10 15 7 32
We can use rowSums on a logical matrix (created with is.na)
df1$C <- rowSums(!is.na(df1[c('A', 'B')])
NOTE: Added the rowSums approach first here

Finding Cumulative Sum In R Using Conditions

I need to create a new variable with the sum of the past three years' amounts for each ID.
If there are not three years' worth of data, there should be an 'NA'.
As an example:
ID YEAR AMOUNT
1 2010 5
1 2011 2
1 2012 4
1 2013 1
1 2014 3
2 2013 4
2 2014 6
2 2015 9
3 2012 4
3 2013 7
3 2014 2
3 2015 3
Here's what the result should be:
ID YEAR AMOUNT THREE_YR
1 2010 5 NA
1 2011 2 NA
1 2012 4 11
1 2013 1 7
1 2014 3 8
2 2013 4 NA
2 2014 6 NA
2 2015 9 19
3 2012 4 NA
3 2013 7 NA
3 2014 2 13
3 2015 3 12
How would I do this? Thanks!
We can use functions from dplyr and zoo. dt2 is the final output.
# Create example data frame
dt <- read.table(text = "ID YEAR AMOUNT
1 2010 5
1 2011 2
1 2012 4
1 2013 1
1 2014 3
2 2013 4
2 2014 6
2 2015 9
3 2012 4
3 2013 7
3 2014 2
3 2015 3",
header = TRUE, stringsAsFactors = FALSE)
# Load packages
library(dplyr)
library(zoo)
# Process the data
dt2 <- dt %>%
group_by(ID) %>%
mutate(THREE_YR = rollsum(AMOUNT, k = 3, fill = NA, align = "right"))
Update: ID groups with less than 3 records.
The OP asked what to do if there are IDs with only one or two rows. Honestly, I did not find a good way to solve this. The only thing I can think of is dividing the original data frame to two groups, apply the rollsum to the group with all records larger than or equal to three. After that, combine all groups.
# Create example data frame
dt <- read.table(text = "ID YEAR AMOUNT
1 2010 5
1 2011 2
1 2012 4
1 2013 1
1 2014 3
2 2013 4
3 2012 4
3 2013 7
3 2014 2
3 2015 3",
header = TRUE, stringsAsFactors = FALSE)
# Load packages
library(dplyr)
library(zoo)
# Process the data
dt2 <- dt %>%
group_by(ID) %>%
filter(n() >= 3) %>%
mutate(THREE_YR = rollsum(AMOUNT, k = 3, fill = NA, align = "right")) %>%
bind_rows(dt %>% group_by(ID) %>% filter(n() < 3)) %>%
arrange(ID, YEAR)
With the data.table:
library(data.table)
setDT(dt)
setorder(dt,YEAR)
dt[,.(YEAR,AMOUNT,THREE_YR=AMOUNT+shift(AMOUNT,1)+shift(AMOUNT,2)),by=.(ID)]
#ID YEAR AMOUNT THREE_YR
# 1: 1 2010 5 NA
# 2: 1 2011 2 NA
# 3: 1 2012 4 11
# 4: 1 2013 1 7
# 5: 1 2014 3 8
# 6: 3 2012 4 NA
# 7: 3 2013 7 NA
# 8: 3 2014 2 13
# 9: 3 2015 3 12
#10: 2 2013 4 NA
#11: 2 2014 6 NA
#12: 2 2015 9 19
Using zoo::rollapplyr() and aggregate()
This will return NA if there are less than three members in a group.
x <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L,
3L, 3L), YEAR = c(2010L, 2011L, 2012L, 2013L, 2014L, 2013L, 2014L,
2015L, 2012L, 2013L, 2014L, 2015L), AMOUNT = c(5L, 2L, 4L, 1L,
3L, 4L, 6L, 9L, 4L, 7L, 2L, 3L)), .Names = c("ID", "YEAR", "AMOUNT"
), class = "data.frame", row.names = c(NA, -12L))
library(zoo)
rsum <- aggregate(AMOUNT ~ ID, data=x,
FUN=function(x) rollapplyr(x, 3, fill=NA, partial=TRUE,
FUN=function(y) if (length(y) >= 3) sum(y) else NA))
x$rsum <- do.call(c, rsum$AMOUNT)
x
# ID YEAR AMOUNT rsum
# 1 1 2010 5 NA
# 2 1 2011 2 NA
# 3 1 2012 4 11
# 4 1 2013 1 7
# 5 1 2014 3 8
# 6 2 2013 4 NA
# 7 2 2014 6 NA
# 8 2 2015 9 19
# 9 3 2012 4 NA
# 10 3 2013 7 NA
# 11 3 2014 2 13
# 12 3 2015 3 12
# remove one of the 2s
x <- x[-6, ]
rsum <- aggregate(AMOUNT ~ ID, data=x,
FUN=function(x) rollapplyr(x, 3, fill=NA, partial=TRUE,
FUN=function(y) if (length(y) >= 3) sum(y) else NA))
x$rsum <- do.call(c, rsum$AMOUNT)
x
# ID YEAR AMOUNT rsum
# 1 1 2010 5 NA
# 2 1 2011 2 NA
# 3 1 2012 4 11
# 4 1 2013 1 7
# 5 1 2014 3 8
# 7 2 2014 6 NA
# 8 2 2015 9 NA
# 9 3 2012 4 NA
# 10 3 2013 7 NA
# 11 3 2014 2 13
# 12 3 2015 3 12

Create a conditional count variable in R

I want to create a count variable with the number of peoples with Z==0 in each of the given years. As Illustrated below:
PersonID Year Z Count*
1 1990 0 1
2 1990 1 1
3 1990 1 1
4 1990 2 1
5 1990 1 1
1 1991 1 3
2 1991 0 3
3 1991 1 3
4 1991 0 3
5 1991 0 3
1 1992 NA 1
2 1992 2 1
3 1992 2 1
4 1992 0 1
5 1993 1 0
1 1993 1 0
2 1993 2 0
3 1993 NA 0
4 1993 1 0
5 1994 0 5
1 1994 0 5
2 1994 0 5
3 1994 0 5
4 1994 0 5
I looked at my previous R-scripts and found this
library(dplyr)
sum_data <- data %>% group_by(PersonID) %>% summarise(Count = sum(Z, na.rm=T))
Can someone help me get this right? The count variable should basically count a total number of persons with Z==0, in the same format as I illustrated above. Thanks!!
dput(data)
structure(list(PersonID = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L,
5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L),
Year = c(1990L, 1990L, 1990L, 1990L, 1990L, 1991L, 1991L,
1991L, 1991L, 1991L, 1992L, 1992L, 1992L, 1992L, 1993L, 1993L,
1993L, 1993L, 1993L, 1994L, 1994L, 1994L, 1994L, 1994L),
Z = c(0L, 1L, 1L, 2L, 1L, 1L, 0L, 1L, 0L, 0L, NA, 2L, 2L,
0L, 1L, 1L, 2L, NA, 1L, 0L, 0L, 0L, 0L, 0L)), .Names = c("PersonID",
"Year", "Z"), class = "data.frame", row.names = c(NA, -24L))
Here's a simple solution :
library(dplyr)
sum_data <- df %>%
mutate(Z=replace(Z, is.na(Z), 1)) %>%
mutate(temp = ifelse(Z == 0, 1, 0)) %>%
group_by(Year) %>%
summarize(count = sum(temp))
basically this is what the code is doing :
mutate(Z=replace(Z, is.na(Z), 1)) replace the NA with 1 (optional)
mutate(temp = ifelse(Z == 0, 1, 0)) create a conditional temp
variable :
ifelse(Z == 0, 1, 0) say if Z == 0 then the value is 1
else 0
group_by(Year) pretty explicite :) it group the data frame by
Year
summarize(count = sum(temp)) create a count variable with the
sum of earlier generated temp
results :
Year count
<int> <int>
1 1990 5
2 1991 5
3 1992 4
4 1993 5
5 1994 5
and if you want to join this data to the original data frame just use join :
left_join(df, sum_data)
Joining, by = "Year"
PersonID Year Z count
1 1 1990 0 1
2 2 1990 1 1
3 3 1990 1 1
4 4 1990 2 1
5 5 1990 1 1
6 1 1991 1 3
7 2 1991 0 3
8 3 1991 1 3
9 4 1991 0 3
10 5 1991 0 3
11 1 1992 NA 1
12 2 1992 2 1
13 3 1992 2 1
14 4 1992 0 1
15 5 1993 1 0
16 1 1993 1 0
17 2 1993 2 0
18 3 1993 NA 0
19 4 1993 1 0
20 5 1994 0 5
21 1 1994 0 5
22 2 1994 0 5
23 3 1994 0 5
24 4 1994 0 5
Try this:
library(dplyr)
df <- left_join(data, data %>% filter(Z==0) %>% group_by(Year) %>% summarise(Count = n()))
df[is.na(df$Count),]$Count <- 0
PersonID Year Z Count
1 1 1990 0 1
2 2 1990 1 1
3 3 1990 1 1
4 4 1990 2 1
5 5 1990 1 1
6 1 1991 1 3
7 2 1991 0 3
8 3 1991 1 3
9 4 1991 0 3
10 5 1991 0 3
11 1 1992 NA 1
12 2 1992 2 1
13 3 1992 2 1
14 4 1992 0 1
15 5 1993 1 0
16 1 1993 1 0
17 2 1993 2 0
18 3 1993 NA 0
19 4 1993 1 0
20 5 1994 0 5
21 1 1994 0 5
22 2 1994 0 5
23 3 1994 0 5
24 4 1994 0 5

How to generate a "range" variable in R? [duplicate]

This question already has answers here:
Coerce logical (boolean) vector to 0 and 1
(2 answers)
Closed 6 years ago.
I have a dataset that looks something like this:
Subject Year X
A 1990 1
A 1991 1
A 1992 2
A 1993 3
A 1994 4
A 1995 4
B 1990 0
B 1991 1
B 1992 1
B 1993 2
C 1991 1
C 1992 2
C 1993 3
C 1994 3
D 1991 1
D 1992 2
D 1993 3
D 1994 4
D 1995 5
D 1996 5
D 1997 6
I want to generate a binary(0/1) variable (let's say variable A) that indicates weather the X variables has reached 3 (or 1-3), for each Subject. If the X variable has reached 4 or more, the A should not capture it.
It should look like this:
Subject Year X A
A 1990 1 0
A 1991 1 0
A 1992 2 0
A 1993 3 0
A 1994 4 0
A 1995 4 0
B 1990 0 0
B 1991 1 0
B 1992 1 0
B 1993 2 0
C 1991 1 1
C 1992 2 1
C 1993 3 1
C 1994 3 1
D 1991 1 0
D 1992 2 0
D 1993 3 0
D 1994 4 0
D 1995 5 0
D 1996 5 0
D 1997 6 0
I tried the following: mydata$A<- as.numeric(mydata$X %in% 1:3)but it doesn't control for the continuation....
A reproducible sample:
> dput(mydata)
structure(list(Subject = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), .Label = c("A",
"B", "C", "D"), class = "factor"), Year = c(1990L, 1991L, 1992L,
1993L, 1994L, 1995L, 1990L, 1991L, 1992L, 1993L, 1991L, 1992L,
1993L, 1994L, 1991L, 1992L, 1993L, 1994L, 1995L, 1996L, 1997L
), X = c(1L, 1L, 2L, 3L, 4L, 4L, 0L, 1L, 1L, 2L, 1L, 2L, 3L,
3L, 1L, 2L, 3L, 4L, 5L, 5L, 6L)), .Names = c("Subject", "Year",
"X"), class = "data.frame", row.names = c(NA, -21L))
All suggestions are welcome – thanks!
Here's a base R one-liner use ave:
df$A <- ave(df$X, df$Subject, FUN = function(x) if (max(x) == 3) 1 else 0)
> df
Subject Year X A
1 A 1990 1 0
2 A 1991 1 0
3 A 1992 2 0
4 A 1993 3 0
5 A 1994 4 0
6 A 1995 4 0
7 B 1990 0 0
8 B 1991 1 0
9 B 1992 1 0
10 B 1993 2 0
11 C 1991 1 1
12 C 1992 2 1
13 C 1993 3 1
14 C 1994 3 1
15 D 1991 1 0
16 D 1992 2 0
17 D 1993 3 0
18 D 1994 4 0
19 D 1995 5 0
20 D 1996 5 0
21 D 1997 6 0
Then, if you only want to capture increases, with shift function you can access to other rows. This solution works, but first value is NA because it hasn't nothing to compare with
mydata$A <- ifelse(mydata$X > shift(mydata$X, 1L, type="lag"), 1,0)

Resources