Extracting the change of the mean per group over time - r

I have a data table from which I calculated the mean sales as follows:
library(data.table)
DT <- fread(
"ID country year sales industry size cat4
1 NLD 2000 4 A 1 0
2 NLD 2000 4 B 1 1
3 NLD 2006 2 A 1 1
4 NLD 2002 4 A 1 0
5 NLD 2002 4 B 1 1
6 NLD 2006 2 A 1 1
7 NLD 2006 2 B 2 0
8 NLD 2006 1 A 1 4
9 GBR 2001 2 B 3 5
10 GBR 2001 1 B 2 5
11 GBR 2002 1 A 1 11
12 GBR 2006 1 A 1 2
13 GBR 2006 1 B 3 12
14 GBR 2006 1 A 1 2
15 GBR 2006 1 B 3 12",
header = TRUE)
setDT(DT)[,Mean_Sales:= mean(sales, na.rm=TRUE), by=c("country", "industry", "size")]
However, now I am interested in how Mean_Sales changes over time, per group: by=c("iso3c", "industry", "size").
I would like to take the mean of the absolute differences, divided by the years they are apart.
As an example, for a company in NLD of industry A and size 1, constituting to ID=1 and ID=8, I want the mean of absolute differences (|1-4|=3), divided by the years apart (2006-2000 = 6). Leading to a year to year change of the mean of 3/6 = 0.5.
I just cannot figure out how to get it into R code. Any help would be greatly appreciated.
Desired output:
library(data.table)
DT <- fread(
"ID country year sales industry size cat4 delta
1 NLD 2000 4 A 1 0 0.5
2 NLD 2000 4 B 1 1 0.33
3 NLD 2006 2 A 1 1
4 NLD 2002 4 A 1 0
5 NLD 2002 4 B 1 1
6 NLD 2006 2 A 1 1
7 NLD 2006 2 B 1 0 0.33
8 NLD 2006 1 A 1 4 0.5
9 GBR 2001 2 B 3 5
10 GBR 2001 1 B 2 5
11 GBR 2002 1 A 1 11
12 GBR 2006 1 A 1 2
13 GBR 2006 1 B 3 12
14 GBR 2006 1 A 1 2
15 GBR 2006 1 B 3 12",
header = TRUE)

You could order by year and get absolute difference between last and first sales value and divide it by difference in year.
library(data.table)
DT[order(year), delta := abs(last(sales) - first(sales))/(max(year) - min(year)),
.(country, industry, size)]

Related

Create sequence by condition in the case when condition changes

the data looks like:
df <- data.frame("Grp"=c(rep("A",10),rep("B",10)),
"Year"=c(seq(2001,2010,1),seq(2001,2010,1)),
"Treat"=c(as.character(c(0,0,1,1,1,1,0,0,1,1)),
as.character(c(1,1,1,0,0,0,1,1,1,0))))
df
Grp Year Treat
1 A 2001 0
2 A 2002 0
3 A 2003 1
4 A 2004 1
5 A 2005 1
6 A 2006 1
7 A 2007 0
8 A 2008 0
9 A 2009 1
10 A 2010 1
11 B 2001 1
12 B 2002 1
13 B 2003 1
14 B 2004 0
15 B 2005 0
16 B 2006 0
17 B 2007 1
18 B 2008 1
19 B 2009 1
20 B 2010 0
All I want is to generate another col seq to count the sequence of Treat by Grp, maintaining the sequence of Year. I think the hard part is that when Treat turns to 0, seq should be 0 or whatever, and the sequence of Treat should be re-counted when it turns back to non-zero again. An example of the final dataframe looks like below:
Grp Year Treat seq
1 A 2001 0 0
2 A 2002 0 0
3 A 2003 1 1
4 A 2004 1 2
5 A 2005 1 3
6 A 2006 1 4
7 A 2007 0 0
8 A 2008 0 0
9 A 2009 1 1
10 A 2010 1 2
11 B 2001 1 1
12 B 2002 1 2
13 B 2003 1 3
14 B 2004 0 0
15 B 2005 0 0
16 B 2006 0 0
17 B 2007 1 1
18 B 2008 1 2
19 B 2009 1 3
20 B 2010 0 0
Any suggestions would be much appreciated!
With data.table rleid , you can do :
library(dplyr)
df %>%
group_by(Grp, grp = data.table::rleid(Treat)) %>%
mutate(seq = row_number() * as.integer(Treat)) %>%
ungroup %>%
select(-grp)
# Grp Year Treat seq
# <chr> <dbl> <chr> <int>
# 1 A 2001 0 0
# 2 A 2002 0 0
# 3 A 2003 1 1
# 4 A 2004 1 2
# 5 A 2005 1 3
# 6 A 2006 1 4
# 7 A 2007 0 0
# 8 A 2008 0 0
# 9 A 2009 1 1
#10 A 2010 1 2
#11 B 2001 1 1
#12 B 2002 1 2
#13 B 2003 1 3
#14 B 2004 0 0
#15 B 2005 0 0
#16 B 2006 0 0
#17 B 2007 1 1
#18 B 2008 1 2
#19 B 2009 1 3
#20 B 2010 0 0

Retaining column information when melting multiple columns into one

I have a data.table which I have melted follows:
library(data.table)
DT <- fread(
"ID country year Event_A Event_B
4 NLD 2002 0 1
5 NLD 2002 0 1
6 NLD 2006 1 1
7 NLD 2006 1 0
8 NLD 2006 1 1
9 GBR 2002 0 1
10 GBR 2002 0 0
11 GBR 2002 0 1
12 GBR 2006 1 1
13 GBR 2006 1 1",
header = TRUE)
melt(DT, id.var = setdiff(names(DT), c("Event_A", "Event_B")),
value.name = 'Event')[, variable := NULL][order(ID)]
# ID country year Event
# 1: 4 NLD 2002 0
# 2: 4 NLD 2002 1
# 3: 5 NLD 2002 0
# 4: 5 NLD 2002 1
# 5: 6 NLD 2006 1
# 6: 6 NLD 2006 1
# 7: 7 NLD 2006 1
# 8: 7 NLD 2006 0
# 9: 8 NLD 2006 1
#10: 8 NLD 2006 1
#11: 9 GBR 2002 0
#12: 9 GBR 2002 1
#13: 10 GBR 2002 0
#14: 10 GBR 2002 0
#15: 11 GBR 2002 0
#16: 11 GBR 2002 1
#17: 12 GBR 2006 1
#18: 12 GBR 2006 1
#19: 13 GBR 2006 1
#20: 13 GBR 2006 1
However, in hindsight, I want to have the Event category in the melted data set. How do I make sure that this information is retained in the melted data?
EDIT (Due to oversimplification in original post):
DT <- fread(
"ID country year Event_A Event_B Choice_A Choice_B
4 NLD 2002 0 1 0 1
5 NLD 2002 0 1 1 1
6 NLD 2006 1 1 0 1
7 NLD 2006 1 0 1 1
8 NLD 2006 1 1 1 1
9 GBR 2002 0 1 1 0
10 GBR 2002 0 0 1 1
11 GBR 2002 0 1 0 1
12 GBR 2006 1 1 1 1
13 GBR 2006 1 1 0 0",
header = TRUE)
DT<- melt(DT, measure = patterns("^Event_", "^Choice_"),
value.name = c("Event", "Choice"))[, variable := NULL][order(ID)]
Desired output:
# ID country year Event Event_Cat Choice Choice_Cat
# 1: 4 NLD 2002 0 A 0 A
# 2: 4 NLD 2002 1 B 1 B
# 3: 5 NLD 2002 0 A
# 4: 5 NLD 2002 1 B
# 5: 6 NLD 2006 1 A
# 6: 6 NLD 2006 1 B
# 7: 7 NLD 2006 1
# 8: 7 NLD 2006 0
# 9: 8 NLD 2006 1
#10: 8 NLD 2006 1
#11: 9 GBR 2002 0
#12: 9 GBR 2002 1
#13: 10 GBR 2002 0
#14: 10 GBR 2002 0
#15: 11 GBR 2002 0
#16: 11 GBR 2002 1
#17: 12 GBR 2006 1
#18: 12 GBR 2006 1
#19: 13 GBR 2006 1
#20: 13 GBR 2006 1
Don't NULLify the variable.name:
setnames(
melt(DT, id.var = setdiff(names(DT), c("Event_A", "Event_B")), value.name = 'Event')[
, variable:=sub("Event_", "", variable)][order(ID)],
old="variable", new="Event_Cat")
ID country year Event_Cat Event
1: 4 NLD 2002 A 0
2: 4 NLD 2002 B 1
3: 5 NLD 2002 A 0
4: 5 NLD 2002 B 1
5: 6 NLD 2006 A 1
6: 6 NLD 2006 B 1 ...
Edit, based on new information provided (melting multiple columns).
DT2 <- setnames(
melt(DT, measure = patterns("^Event_", "^Choice_"),
value.name = c("Event", "Choice"))[, variable := forcats::lvls_revalue(variable,
c("A", "B"))][order(ID)],
old="variable", new="Cetegory")
DT2
ID country year Cetegory Event Choice
1: 4 NLD 2002 A 0 0
2: 4 NLD 2002 B 1 1
3: 5 NLD 2002 A 0 1
4: 5 NLD 2002 B 1 1
5: 6 NLD 2006 A 1 0
6: 6 NLD 2006 B 1 1 ...
You could use pivot_longer from tidyr :
tidyr::pivot_longer(DT, cols = starts_with('Event'),
names_to = c('.value', 'Event_Cat'),
names_sep = '_')
# ID country year Event_Cat Event
# <int> <chr> <int> <chr> <int>
# 1 4 NLD 2002 A 0
# 2 4 NLD 2002 B 1
# 3 5 NLD 2002 A 0
# 4 5 NLD 2002 B 1
# 5 6 NLD 2006 A 1
# 6 6 NLD 2006 B 1
# 7 7 NLD 2006 A 1
# 8 7 NLD 2006 B 0
# 9 8 NLD 2006 A 1
#10 8 NLD 2006 B 1
#11 9 GBR 2002 A 0
#12 9 GBR 2002 B 1
#13 10 GBR 2002 A 0
#14 10 GBR 2002 B 0
#15 11 GBR 2002 A 0
#16 11 GBR 2002 B 1
#17 12 GBR 2006 A 1
#18 12 GBR 2006 B 1
#19 13 GBR 2006 A 1
#20 13 GBR 2006 B 1

Count observations per group satisfying a different condition for each row

I have a dataframe that looks like this one
state start end date treat
1 1999 2000 2001 1
1 1998 2000 2001 1
1 2000 2003 NA 0
2 2001 2002 NA 0
2 2002 2004 2003 1
2 2003 2004 2005 1
3 2002 2004 2006 1
3 2003 2004 NA 0
3 2005 2007 NA 0
I want to group it by state identifier and, for each state, I want compute the number of treated observation (treat) the date of which lies in between start and end.
In other words I want to get the following
state start end date treat result
1 1999 2000 2001 1 0
1 1998 2000 2001 1 0
1 2000 2003 NA 0 2
2 2001 2002 NA 0 0
2 2002 2004 2003 1 1
2 2003 2004 2005 1 0
3 2002 2004 2006 1 0
3 2003 2004 NA 0 0
3 2005 2008 NA 0 1
For instance, result in the first row is equal to 0 because within state = 1 there is no date between 1999 and 2000. On the other hand, result in the last row is equal to one because within state 3 I have one treated unit the date of which lies between 2005 and 2008 (in particular date = 2006 in the 7th row).
Thank you very much for your help.
You can split by state and combine two outer with & testing if date is between start and end and then sum treat for those matching dates.
x$result <- unlist(lapply(split(x, x$state), function(y) {
tt <- outer(y$start, y$date, "<") & outer(y$end, y$date, ">")
tt[is.na(tt)] <- TRUE
apply(tt, 1, function(z) sum(y$treat[z]))
}))
x
# state start end date treat result
#1 1 1999 2000 2001 1 0
#2 1 1998 2000 2001 1 0
#3 1 2000 2003 NA 0 2
#4 2 2001 2002 NA 0 0
#5 2 2002 2004 2003 1 1
#6 2 2003 2004 2005 1 0
#7 3 2002 2004 2006 1 0
#8 3 2003 2004 NA 0 0
#9 3 2005 2007 NA 0 1
Or you take the part describing the treat per state and date and merge it with the part describing state, start and end and sum the matching treat.
tt <- aggregate(treat ~ state + date, x[,c("state", "date", "treat")], sum)
tt <- merge(x[,c("state", "start", "end")], tt)
tt$treat[tt$start >= tt$date | tt$end <= tt$date] <- 0
aggregate(treat ~ start + end + state, tt, sum)
# start end state treat
#1 1998 2000 1 0
#2 1999 2000 1 0
#3 2000 2003 1 2
#4 2001 2002 2 0
#5 2002 2004 2 1
#6 2003 2004 2 0
#7 2002 2004 3 0
#8 2003 2004 3 0
#9 2005 2007 3 1
This gives your numbers though it repeats them on every row:
library(tidyverse)
df %>% group_by(state) %>%
mutate(result=sum(treat==1 & date>=min(start, na.rm=TRUE) & date<=max(end, na.rm=TRUE), na.rm=TRUE))
#> # A tibble: 9 x 6
#> # Groups: state [3]
#> state start end date treat result
#> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 1 1999 2000 2001 1 2
#> 2 1 1998 2000 2001 1 2
#> 3 1 2000 2003 NA 0 2
#> 4 2 2001 2002 NA 0 1
#> 5 2 2002 2004 2003 1 1
#> 6 2 2003 2004 2005 1 1
#> 7 3 2002 2004 2006 1 1
#> 8 3 2003 2004 NA 0 1
#> 9 3 2005 2007 NA 0 1
If you just want one number per group, summarize might be a better option:
df %>% group_by(state) %>%
summarize(result=sum(treat==1 & date>=min(start, na.rm=TRUE) & date<=max(end, na.rm=TRUE), na.rm=TRUE))
#> # A tibble: 3 x 2
#> state result
#> <dbl> <int>
#> 1 1 2
#> 2 2 1
#> 3 3 1

Melting/Splitting a row into two rows, using two column values in the original row, leaving the rest intact

I have a data.table as follows:
DT <- fread(
"ID country year Event_A Event_B
4 NLD 2002 0 1
5 NLD 2002 0 1
6 NLD 2006 1 1
7 NLD 2006 1 0
8 NLD 2006 1 1
9 GBR 2002 0 1
10 GBR 2002 0 0
11 GBR 2002 0 1
12 GBR 2006 1 1
13 GBR 2006 1 1",
header = TRUE)
I want to cast the event columns over the row without summing them, creating new rows. I tried:
meltedsessions <- melt(Exp, id.vars = -c(Event_A", "Event_B"), measure.vars = c("Event_A", "Event_B"))
I need to specify id.vars as a negative because the actual dataset has another 240 variables that need to stay intact. However if I do this I get the error:
Error in melt.data.table(Exp, id.vars = c("ID", "country", "year"), measure.vars = c("Event_A", :
One or more values in 'id.vars' is invalid.
How should I solve this?
Desired output:
DT <- fread(
"NewID ID country year Event
1 4 NLD 2002 0
2 4 NLD 2002 1
3 5 NLD 2002 0
4 5 NLD 2002 1
5 6 NLD 2006 1
6 6 NLD 2006 1
7 7 NLD 2006 1
8 7 NLD 2006 0
9 8 NLD 2006 1
10 8 NLD 2006 0
11 9 GBR 2002 1
12 9 GBR 2002 1
13 10 GBR 2002 0
14 10 GBR 2002 0
15 11 GBR 2002 0
16 12 GBR 2002 1
17 13 GBR 2006 1
18 14 GBR 2006 1
19 15 GBR 2006 1
20 16 GBR 2006 1",
header = TRUE)
Instead of - in id.var, can use setdiff
library(data.table)
melt(DT, id.var = setdiff(names(DT), c("Event_A", "Event_B")),
value.name = 'Event')[, variable := NULL][order(ID)]
# ID country year Event
# 1: 4 NLD 2002 0
# 2: 4 NLD 2002 1
# 3: 5 NLD 2002 0
# 4: 5 NLD 2002 1
# 5: 6 NLD 2006 1
# 6: 6 NLD 2006 1
# 7: 7 NLD 2006 1
# 8: 7 NLD 2006 0
# 9: 8 NLD 2006 1
#10: 8 NLD 2006 1
#11: 9 GBR 2002 0
#12: 9 GBR 2002 1
#13: 10 GBR 2002 0
#14: 10 GBR 2002 0
#15: 11 GBR 2002 0
#16: 11 GBR 2002 1
#17: 12 GBR 2006 1
#18: 12 GBR 2006 1
#19: 13 GBR 2006 1
#20: 13 GBR 2006 1

Selecting observations for which two years are available by country

I have a dataset as follows:
DT <- fread(
"ID country year Event_A Event_B
4 BEL 2002 0 1
5 BEL 2002 0 1
6 NLD 2002 1 1
7 NLD 2006 1 0
8 NLD 2006 1 1
9 GBR 2001 0 1
10 GBR 2001 0 0
11 GBR 2001 0 1
12 GBR 2007 1 1
13 GBR 2007 1 1",
header = TRUE)
I would like to keep only observations for which I have observations in two country-years. So, BEL will drop out because it only has observations in 2002.
I would like to do something like DT[,if(unique(year)>1) .SD, by=country] but that does not do anything. I also tried DT[unique(year)>1, .SD, by=country] but this gives the error:
Error in `[.data.table`(DT, unique(year) > 1, .SD, by = country) :
i evaluates to a logical vector length 4 but there are 10 rows. Recycling of logical i is no longer allowed as it hides more bugs than is worth the rare convenience. Explicitly use rep(...,length=.N) if you really need to recycle.
Desired output:
DT <- fread(
"ID country year Event_A Event_B
6 NLD 2002 1 1
7 NLD 2006 1 0
8 NLD 2006 1 1
9 GBR 2001 0 1
10 GBR 2001 0 0
11 GBR 2001 0 1
12 GBR 2007 1 1
13 GBR 2007 1 1",
header = TRUE)
You can use uniqueN to get count of unique values and select rows using .SD.
library(data.table)
DT[, .SD[uniqueN(year) > 1], country]
# country ID year Event_A Event_B
#1: NLD 6 2002 1 1
#2: NLD 7 2006 1 0
#3: NLD 8 2006 1 1
#4: GBR 9 2001 0 1
#5: GBR 10 2001 0 0
#6: GBR 11 2001 0 1
#7: GBR 12 2007 1 1
#8: GBR 13 2007 1 1
Or in dplyr we can do the same with n_distinct and filter
library(dplyr)
DT %>% group_by(country) %>% filter(n_distinct(year) > 1)
In the same spirit as #user2474226, if you're open to other packages, a simple dplyrsolution:
library(data.table)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:data.table':
#>
#> between, first, last
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
DT <- fread(
"ID country year Event_A Event_B
4 BEL 2002 0 1
5 BEL 2002 0 1
6 NLD 2002 1 1
7 NLD 2006 1 0
8 NLD 2006 1 1
9 GBR 2001 0 1
10 GBR 2001 0 0
11 GBR 2001 0 1
12 GBR 2007 1 1
13 GBR 2007 1 1",
header = TRUE)
# Detect count of countries
sel_cnt <-
DT %>%
count(country, year) %>%
count(country) %>%
filter(n > 1)
DT %>%
semi_join(sel_cnt, by = "country")
#> ID country year Event_A Event_B
#> 1 6 NLD 2002 1 1
#> 2 7 NLD 2006 1 0
#> 3 8 NLD 2006 1 1
#> 4 9 GBR 2001 0 1
#> 5 10 GBR 2001 0 0
#> 6 11 GBR 2001 0 1
#> 7 12 GBR 2007 1 1
#> 8 13 GBR 2007 1 1
Here is a base R solution by using ave() and subset()
DTout <- subset(DT, as.logical(ave(DT$year,DT$country, FUN = function(x) length(unique(x))>=2)))
such that
> DTout
ID country year Event_A Event_B
3 6 NLD 2002 1 1
4 7 NLD 2006 1 0
5 8 NLD 2006 1 1
6 9 GBR 2001 0 1
7 10 GBR 2001 0 0
8 11 GBR 2001 0 1
9 12 GBR 2007 1 1
10 13 GBR 2007 1 1
If it's not necessary to do it in data.table, you can count the number of distinct years by country via base R:
country_count <- aggregate(year ~ country, DT, FUN = function(x) NROW(unique(x)))
DT[DT$country %in% country_count$country[country_count$year > 1],]
# output
ID country year Event_A Event_B
3 6 NLD 2002 1 1
4 7 NLD 2006 1 0
5 8 NLD 2006 1 1
6 9 GBR 2001 0 1
7 10 GBR 2001 0 0
8 11 GBR 2001 0 1
9 12 GBR 2007 1 1
10 13 GBR 2007 1 1

Resources