Count observations per group satisfying a different condition for each row - r

I have a dataframe that looks like this one
state start end date treat
1 1999 2000 2001 1
1 1998 2000 2001 1
1 2000 2003 NA 0
2 2001 2002 NA 0
2 2002 2004 2003 1
2 2003 2004 2005 1
3 2002 2004 2006 1
3 2003 2004 NA 0
3 2005 2007 NA 0
I want to group it by state identifier and, for each state, I want compute the number of treated observation (treat) the date of which lies in between start and end.
In other words I want to get the following
state start end date treat result
1 1999 2000 2001 1 0
1 1998 2000 2001 1 0
1 2000 2003 NA 0 2
2 2001 2002 NA 0 0
2 2002 2004 2003 1 1
2 2003 2004 2005 1 0
3 2002 2004 2006 1 0
3 2003 2004 NA 0 0
3 2005 2008 NA 0 1
For instance, result in the first row is equal to 0 because within state = 1 there is no date between 1999 and 2000. On the other hand, result in the last row is equal to one because within state 3 I have one treated unit the date of which lies between 2005 and 2008 (in particular date = 2006 in the 7th row).
Thank you very much for your help.

You can split by state and combine two outer with & testing if date is between start and end and then sum treat for those matching dates.
x$result <- unlist(lapply(split(x, x$state), function(y) {
tt <- outer(y$start, y$date, "<") & outer(y$end, y$date, ">")
tt[is.na(tt)] <- TRUE
apply(tt, 1, function(z) sum(y$treat[z]))
}))
x
# state start end date treat result
#1 1 1999 2000 2001 1 0
#2 1 1998 2000 2001 1 0
#3 1 2000 2003 NA 0 2
#4 2 2001 2002 NA 0 0
#5 2 2002 2004 2003 1 1
#6 2 2003 2004 2005 1 0
#7 3 2002 2004 2006 1 0
#8 3 2003 2004 NA 0 0
#9 3 2005 2007 NA 0 1
Or you take the part describing the treat per state and date and merge it with the part describing state, start and end and sum the matching treat.
tt <- aggregate(treat ~ state + date, x[,c("state", "date", "treat")], sum)
tt <- merge(x[,c("state", "start", "end")], tt)
tt$treat[tt$start >= tt$date | tt$end <= tt$date] <- 0
aggregate(treat ~ start + end + state, tt, sum)
# start end state treat
#1 1998 2000 1 0
#2 1999 2000 1 0
#3 2000 2003 1 2
#4 2001 2002 2 0
#5 2002 2004 2 1
#6 2003 2004 2 0
#7 2002 2004 3 0
#8 2003 2004 3 0
#9 2005 2007 3 1

This gives your numbers though it repeats them on every row:
library(tidyverse)
df %>% group_by(state) %>%
mutate(result=sum(treat==1 & date>=min(start, na.rm=TRUE) & date<=max(end, na.rm=TRUE), na.rm=TRUE))
#> # A tibble: 9 x 6
#> # Groups: state [3]
#> state start end date treat result
#> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 1 1999 2000 2001 1 2
#> 2 1 1998 2000 2001 1 2
#> 3 1 2000 2003 NA 0 2
#> 4 2 2001 2002 NA 0 1
#> 5 2 2002 2004 2003 1 1
#> 6 2 2003 2004 2005 1 1
#> 7 3 2002 2004 2006 1 1
#> 8 3 2003 2004 NA 0 1
#> 9 3 2005 2007 NA 0 1
If you just want one number per group, summarize might be a better option:
df %>% group_by(state) %>%
summarize(result=sum(treat==1 & date>=min(start, na.rm=TRUE) & date<=max(end, na.rm=TRUE), na.rm=TRUE))
#> # A tibble: 3 x 2
#> state result
#> <dbl> <int>
#> 1 1 2
#> 2 2 1
#> 3 3 1

Related

Create sequence by condition in the case when condition changes

the data looks like:
df <- data.frame("Grp"=c(rep("A",10),rep("B",10)),
"Year"=c(seq(2001,2010,1),seq(2001,2010,1)),
"Treat"=c(as.character(c(0,0,1,1,1,1,0,0,1,1)),
as.character(c(1,1,1,0,0,0,1,1,1,0))))
df
Grp Year Treat
1 A 2001 0
2 A 2002 0
3 A 2003 1
4 A 2004 1
5 A 2005 1
6 A 2006 1
7 A 2007 0
8 A 2008 0
9 A 2009 1
10 A 2010 1
11 B 2001 1
12 B 2002 1
13 B 2003 1
14 B 2004 0
15 B 2005 0
16 B 2006 0
17 B 2007 1
18 B 2008 1
19 B 2009 1
20 B 2010 0
All I want is to generate another col seq to count the sequence of Treat by Grp, maintaining the sequence of Year. I think the hard part is that when Treat turns to 0, seq should be 0 or whatever, and the sequence of Treat should be re-counted when it turns back to non-zero again. An example of the final dataframe looks like below:
Grp Year Treat seq
1 A 2001 0 0
2 A 2002 0 0
3 A 2003 1 1
4 A 2004 1 2
5 A 2005 1 3
6 A 2006 1 4
7 A 2007 0 0
8 A 2008 0 0
9 A 2009 1 1
10 A 2010 1 2
11 B 2001 1 1
12 B 2002 1 2
13 B 2003 1 3
14 B 2004 0 0
15 B 2005 0 0
16 B 2006 0 0
17 B 2007 1 1
18 B 2008 1 2
19 B 2009 1 3
20 B 2010 0 0
Any suggestions would be much appreciated!
With data.table rleid , you can do :
library(dplyr)
df %>%
group_by(Grp, grp = data.table::rleid(Treat)) %>%
mutate(seq = row_number() * as.integer(Treat)) %>%
ungroup %>%
select(-grp)
# Grp Year Treat seq
# <chr> <dbl> <chr> <int>
# 1 A 2001 0 0
# 2 A 2002 0 0
# 3 A 2003 1 1
# 4 A 2004 1 2
# 5 A 2005 1 3
# 6 A 2006 1 4
# 7 A 2007 0 0
# 8 A 2008 0 0
# 9 A 2009 1 1
#10 A 2010 1 2
#11 B 2001 1 1
#12 B 2002 1 2
#13 B 2003 1 3
#14 B 2004 0 0
#15 B 2005 0 0
#16 B 2006 0 0
#17 B 2007 1 1
#18 B 2008 1 2
#19 B 2009 1 3
#20 B 2010 0 0

Extracting the change of the mean per group over time

I have a data table from which I calculated the mean sales as follows:
library(data.table)
DT <- fread(
"ID country year sales industry size cat4
1 NLD 2000 4 A 1 0
2 NLD 2000 4 B 1 1
3 NLD 2006 2 A 1 1
4 NLD 2002 4 A 1 0
5 NLD 2002 4 B 1 1
6 NLD 2006 2 A 1 1
7 NLD 2006 2 B 2 0
8 NLD 2006 1 A 1 4
9 GBR 2001 2 B 3 5
10 GBR 2001 1 B 2 5
11 GBR 2002 1 A 1 11
12 GBR 2006 1 A 1 2
13 GBR 2006 1 B 3 12
14 GBR 2006 1 A 1 2
15 GBR 2006 1 B 3 12",
header = TRUE)
setDT(DT)[,Mean_Sales:= mean(sales, na.rm=TRUE), by=c("country", "industry", "size")]
However, now I am interested in how Mean_Sales changes over time, per group: by=c("iso3c", "industry", "size").
I would like to take the mean of the absolute differences, divided by the years they are apart.
As an example, for a company in NLD of industry A and size 1, constituting to ID=1 and ID=8, I want the mean of absolute differences (|1-4|=3), divided by the years apart (2006-2000 = 6). Leading to a year to year change of the mean of 3/6 = 0.5.
I just cannot figure out how to get it into R code. Any help would be greatly appreciated.
Desired output:
library(data.table)
DT <- fread(
"ID country year sales industry size cat4 delta
1 NLD 2000 4 A 1 0 0.5
2 NLD 2000 4 B 1 1 0.33
3 NLD 2006 2 A 1 1
4 NLD 2002 4 A 1 0
5 NLD 2002 4 B 1 1
6 NLD 2006 2 A 1 1
7 NLD 2006 2 B 1 0 0.33
8 NLD 2006 1 A 1 4 0.5
9 GBR 2001 2 B 3 5
10 GBR 2001 1 B 2 5
11 GBR 2002 1 A 1 11
12 GBR 2006 1 A 1 2
13 GBR 2006 1 B 3 12
14 GBR 2006 1 A 1 2
15 GBR 2006 1 B 3 12",
header = TRUE)
You could order by year and get absolute difference between last and first sales value and divide it by difference in year.
library(data.table)
DT[order(year), delta := abs(last(sales) - first(sales))/(max(year) - min(year)),
.(country, industry, size)]

Selecting observations for which two years are available by country

I have a dataset as follows:
DT <- fread(
"ID country year Event_A Event_B
4 BEL 2002 0 1
5 BEL 2002 0 1
6 NLD 2002 1 1
7 NLD 2006 1 0
8 NLD 2006 1 1
9 GBR 2001 0 1
10 GBR 2001 0 0
11 GBR 2001 0 1
12 GBR 2007 1 1
13 GBR 2007 1 1",
header = TRUE)
I would like to keep only observations for which I have observations in two country-years. So, BEL will drop out because it only has observations in 2002.
I would like to do something like DT[,if(unique(year)>1) .SD, by=country] but that does not do anything. I also tried DT[unique(year)>1, .SD, by=country] but this gives the error:
Error in `[.data.table`(DT, unique(year) > 1, .SD, by = country) :
i evaluates to a logical vector length 4 but there are 10 rows. Recycling of logical i is no longer allowed as it hides more bugs than is worth the rare convenience. Explicitly use rep(...,length=.N) if you really need to recycle.
Desired output:
DT <- fread(
"ID country year Event_A Event_B
6 NLD 2002 1 1
7 NLD 2006 1 0
8 NLD 2006 1 1
9 GBR 2001 0 1
10 GBR 2001 0 0
11 GBR 2001 0 1
12 GBR 2007 1 1
13 GBR 2007 1 1",
header = TRUE)
You can use uniqueN to get count of unique values and select rows using .SD.
library(data.table)
DT[, .SD[uniqueN(year) > 1], country]
# country ID year Event_A Event_B
#1: NLD 6 2002 1 1
#2: NLD 7 2006 1 0
#3: NLD 8 2006 1 1
#4: GBR 9 2001 0 1
#5: GBR 10 2001 0 0
#6: GBR 11 2001 0 1
#7: GBR 12 2007 1 1
#8: GBR 13 2007 1 1
Or in dplyr we can do the same with n_distinct and filter
library(dplyr)
DT %>% group_by(country) %>% filter(n_distinct(year) > 1)
In the same spirit as #user2474226, if you're open to other packages, a simple dplyrsolution:
library(data.table)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:data.table':
#>
#> between, first, last
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
DT <- fread(
"ID country year Event_A Event_B
4 BEL 2002 0 1
5 BEL 2002 0 1
6 NLD 2002 1 1
7 NLD 2006 1 0
8 NLD 2006 1 1
9 GBR 2001 0 1
10 GBR 2001 0 0
11 GBR 2001 0 1
12 GBR 2007 1 1
13 GBR 2007 1 1",
header = TRUE)
# Detect count of countries
sel_cnt <-
DT %>%
count(country, year) %>%
count(country) %>%
filter(n > 1)
DT %>%
semi_join(sel_cnt, by = "country")
#> ID country year Event_A Event_B
#> 1 6 NLD 2002 1 1
#> 2 7 NLD 2006 1 0
#> 3 8 NLD 2006 1 1
#> 4 9 GBR 2001 0 1
#> 5 10 GBR 2001 0 0
#> 6 11 GBR 2001 0 1
#> 7 12 GBR 2007 1 1
#> 8 13 GBR 2007 1 1
Here is a base R solution by using ave() and subset()
DTout <- subset(DT, as.logical(ave(DT$year,DT$country, FUN = function(x) length(unique(x))>=2)))
such that
> DTout
ID country year Event_A Event_B
3 6 NLD 2002 1 1
4 7 NLD 2006 1 0
5 8 NLD 2006 1 1
6 9 GBR 2001 0 1
7 10 GBR 2001 0 0
8 11 GBR 2001 0 1
9 12 GBR 2007 1 1
10 13 GBR 2007 1 1
If it's not necessary to do it in data.table, you can count the number of distinct years by country via base R:
country_count <- aggregate(year ~ country, DT, FUN = function(x) NROW(unique(x)))
DT[DT$country %in% country_count$country[country_count$year > 1],]
# output
ID country year Event_A Event_B
3 6 NLD 2002 1 1
4 7 NLD 2006 1 0
5 8 NLD 2006 1 1
6 9 GBR 2001 0 1
7 10 GBR 2001 0 0
8 11 GBR 2001 0 1
9 12 GBR 2007 1 1
10 13 GBR 2007 1 1

apply lag or lead in increasing order for the dataframe

df1 <- read.csv("C:/Users/uni/DS-project/df1.csv")
df1
year value
1 2000 1
2 2001 2
3 2002 3
4 2003 4
5 2004 5
6 2000 1
7 2001 2
8 2002 3
9 2003 4
10 2004 5
11 2000 1
12 2001 2
13 2002 3
14 2003 4
15 2004 5
16 2000 1
17 2001 2
18 2002 3
19 2003 4
20 2004 5
i want to apply lead so i can get the output in the below fashion.
we have set of 5 observation of each year repeated for n number of times, in output for 1st year we need to remove 2000 and its respective value, similar for second year we neglect 2000 and 2001 and its respective value, and for 3rd year remove - 2000, 2001, 2002 and its respective value. And so on.
so that we can get the below output in below manner.
output:
year value
2000 1
2001 2
2002 3
2003 4
2004 5
2001 2
2002 3
2003 4
2004 5
2002 3
2003 4
2004 5
2003 4
2004 5
please help.
Just for fun, adding a vectorized solution using matrix sub-setting
m <- matrix(rep(TRUE, nrow(df)), 5)
m[upper.tri(m)] <- FALSE
df[m,]
# year value
# 1 2000 1
# 2 2001 2
# 3 2002 3
# 4 2003 4
# 5 2004 5
# 7 2001 2
# 8 2002 3
# 9 2003 4
# 10 2004 5
# 13 2002 3
# 14 2003 4
# 15 2004 5
# 19 2003 4
# 20 2004 5
Below grp is 1 for each row of the first group, 2 for the second and so on. Seq is 1, 2, 3, ... for the successive rows of each grp. Now just pick out those rows for which Seq is at least as large as grp. This has the effect of removing the first i-1 rows from the ith group for i = 1, 2, ... .
grp <- cumsum(df1$year == 2000)
Seq <- ave(grp, grp, FUN = seq_along)
subset(df1, Seq >= grp)
We could alternately write this in the less general form:
subset(df1, 1:5 >= rep(1:4, each = 5))
In any case the output from either subset statement is:
year value
1 2000 1
2 2001 2
3 2002 3
4 2003 4
5 2004 5
7 2001 2
8 2002 3
9 2003 4
10 2004 5
13 2002 3
14 2003 4
15 2004 5
19 2003 4
20 2004 5
library(dplyr)
df %>%
group_by(g = cumsum(year == 2000)) %>%
filter(row_number() >= g) %>%
ungroup %>%
select(-g)
# # A tibble: 14 x 2
# year value
# <int> <int>
# 1 2000 1
# 2 2001 2
# 3 2002 3
# 4 2003 4
# 5 2004 5
# 6 2001 2
# 7 2002 3
# 8 2003 4
# 9 2004 5
# 10 2002 3
# 11 2003 4
# 12 2004 5
# 13 2003 4
# 14 2004 5
Using lapply():
to <- nrow(df) / 5 - 1
df[-unlist(lapply(1:to, function(x) seq(1:x) + 5*x)), ]
year value
1 2000 1
2 2001 2
3 2002 3
4 2003 4
5 2004 5
7 2001 2
8 2002 3
9 2003 4
10 2004 5
13 2002 3
14 2003 4
15 2004 5
19 2003 4
20 2004 5
Where unlist(lapply(1:to, function(x) seq(1:x) + 5*x)) are the indices to skip:
[1] 6 11 12 16 17 18
Using sequence:
df[5-rev(sequence(2:5)-1),]
# year value
# 1 2000 1
# 2 2001 2
# 3 2002 3
# 4 2003 4
# 5 2004 5
# 2.1 2001 2
# 3.1 2002 3
# 4.1 2003 4
# 5.1 2004 5
# 3.2 2002 3
# 4.2 2003 4
# 5.2 2004 5
# 4.3 2003 4
# 5.3 2004 5
how it works:
5-rev(sequence(2:5)-1)
# [1] 1 2 3 4 5 2 3 4 5 3 4 5 4 5
rev(sequence(2:5)-1)
# [1] 4 3 2 1 0 3 2 1 0 2 1 0 1 0
sequence(2:5)-1
# [1] 0 1 0 1 2 0 1 2 3 0 1 2 3 4
sequence(2:5)
# [1] 1 2 1 2 3 1 2 3 4 1 2 3 4 5

Adding new rows, filling those with a consecutive year and 0s in the variable column

I've got a 3 columns table (year, ID, variable). For every individual ID I need add a new row with the following year and a 0 in the variable column.The difficulty for me is that every ID has got different number of rows (years).
This is may original table:
ID year var
1 1998 2
1 1999 5
1 2000 6
1 2001 6
1 2002 6
2 1998 12
2 1999 12
3 1998 5
3 1999 5
3 2000 4
... and this is what I need:
ID year var
1 1998 2
1 1999 5
1 2000 6
1 2001 6
1 2002 6
1 2003 0
2 1998 12
2 1999 12
2 2000 0
3 1998 5
3 1999 5
3 2000 4
3 2001 0
Any help will be appreciated.
Cheers
A data.table solution (for syntax more than memory efficiency)
library(data.table)
# assuming your data is in the data.frame dd
DT <- data.table(dd)
DT[,list(year = c(year,max(year)+1), var = c(var,0)),by = ID]
Get the data:
test <- read.table(textConnection("ID year var
1 1998 2
1 1999 5
1 2000 6
1 2001 6
1 2002 6
2 1998 12
2 1999 12
3 1998 5
3 1999 5
3 2000 4"),header=TRUE)
Add the rows in:
do.call(rbind,by(test,test$ID,function(x) rbind(x,c(x$ID[1],max(x$year)+1,0))))
The result:
ID year var
1.1 1 1998 2
1.2 1 1999 5
1.3 1 2000 6
1.4 1 2001 6
1.5 1 2002 6
1.6 1 2003 0
2.6 2 1998 12
2.7 2 1999 12
2.3 2 2000 0
3.8 3 1998 5
3.9 3 1999 5
3.10 3 2000 4
3.4 3 2001 0
Here's another solution using just the base package
DF <- read.table(textConnection("ID year var
1 1998 2
1 1999 5
1 2000 6
1 2001 6
1 2002 6
2 1998 12
2 1999 12
3 1998 5
3 1999 5
3 2000 4"), header=TRUE)
foo <- split(DF, DF["ID"])
addone <- function(x){
last <- tail(x,1)
last$year<-last$year+1
last$var <- 0
rbind(x,last)
}
do.call(rbind, lapply(foo, addone))
... and the output:
ID year var
1.1 1 1998 2
1.2 1 1999 5
1.3 1 2000 6
1.4 1 2001 6
1.5 1 2002 6
1.51 1 2003 0
2.6 2 1998 12
2.7 2 1999 12
2.71 2 2000 0
3.8 3 1998 5
3.9 3 1999 5
3.10 3 2000 4
3.101 3 2001 0

Resources