r collapse by year by ID

r collapse by year by ID - r

I have a dataset with multiple rows per ID like this
ID From To State
1 2004 2005 MD
1 2005 2005 MD
1 2005 2012 DC
1 2012 2015 DC
1 2015 2020 DC
1 2012 2013 MD
1 2013 2016 MD
1 2016 2019 MD
1 2019 2020 MD
2 2003 2004 OR
2 2004 2008 OR
2 2008 2013 AZ
2 2013 2015 AZ
My goal is to collapse the multiple From and To columns to create a smooth timeline like
ID From To State
1 2004 2005 MD
1 2005 2020 DC
1 2012 2020 MD
2 2003 2008 OR
2 2008 2015 AZ
Not sure how to accomplish this. An help is much appreciated. Thanks.

Group by 'ID', 'State' and the run-length-id of 'State', get the first of 'From' and last of 'To'
library(dplyr)
library(data.table)
df1 %>%
group_by(ID, State, grp = rleid(State)) %>%
summarise(From = first(From), To = last(To), .groups = 'drop') %>%
select(-grp)
-output
# A tibble: 5 × 4
ID State From To
<int> <chr> <int> <int>
1 1 DC 2005 2020
2 1 MD 2004 2005
3 1 MD 2012 2020
4 2 AZ 2008 2015
5 2 OR 2003 2008
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L), From = c(2004L, 2005L, 2005L, 2012L, 2015L, 2012L,
2013L, 2016L, 2019L, 2003L, 2004L, 2008L, 2013L), To = c(2005L,
2005L, 2012L, 2015L, 2020L, 2013L, 2016L, 2019L, 2020L, 2004L,
2008L, 2013L, 2015L), State = c("MD", "MD", "DC", "DC", "DC",
"MD", "MD", "MD", "MD", "OR", "OR", "AZ", "AZ")),
class = "data.frame", row.names = c(NA,
-13L))

Related

Create a group variable first.treat indicating the first year when each unit becomes treated

I am trying to reproduce on my dataframe a DiD analysis performed by Callaway and Sant'Anna (2021). Having a variation in treatment timing, I need to define a variable "first-treat" reporting for each ID the year when they first became treated (treatment = 0 if not treated, 1 otherwise). In case the units are never treated, the value of first.treat will be zero.
I report below a simplified dataframe: I have the variables ID, Year, and Treatment. I need to create the variable first.treat as follows.
ID
Year
Treatment
first.treat
a
2016
0
2017
a
2017
1
2017
a
2018
1
2017
b
2016
1
2016
b
2017
1
2016
b
2018
1
2016
c
2016
0
2018
c
2017
0
2018
c
2018
1
2018
d
2016
0
0
d
2017
0
0
d
2018
0
0
How can I do it with R?
Thank you

Please make sure to provide data in a more R-friendly format next time. E.g.
df <-
tibble::tribble(
~ID, ~Year, ~Treatment, ~first.treat,
"a", 2016L, 0L, 2017L,
"a", 2017L, 1L, 2017L,
"a", 2018L, 1L, 2017L,
"b", 2016L, 1L, 2016L,
"b", 2017L, 1L, 2016L,
"b", 2018L, 1L, 2016L,
"c", 2016L, 0L, 2018L,
"c", 2017L, 0L, 2018L,
"c", 2018L, 1L, 2018L,
"d", 2016L, 0L, 0L,
"d", 2017L, 0L, 0L,
"d", 2018L, 0L, 0L
)
Luckily there is a datapasta package which allowed me to easily convert your table to the code above. But it might not be so widely known.
Here's a solution to your problem:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(first.treat = min(
if_else(Treatment == 1, Year, NA_integer_),
na.rm = TRUE
)) %>%
ungroup()
#> # A tibble: 12 x 4
#> ID Year Treatment first.treat
#> <chr> <int> <int> <dbl>
#> 1 a 2016 0 2017
#> 2 a 2017 1 2017
#> 3 a 2018 1 2017
#> 4 b 2016 1 2016
#> 5 b 2017 1 2016
#> 6 b 2018 1 2016
#> 7 c 2016 0 2018
#> 8 c 2017 0 2018
#> 9 c 2018 1 2018
#> 10 d 2016 0 Inf
#> 11 d 2017 0 Inf
#> 12 d 2018 0 Inf
Created on 2022-01-04 by the reprex package (v2.0.1)
Here we calculate min values in groups by ID for a modified Year variable: when Treatment is not 1, year is set to NA.

Why geom_line is not displaying correctly?

I am running analysis in Bike Sharing (kaggle) dataset. Heres is a sample:
Head
yr mnth Ano cnt
<int> <int> <chr> <int>
1 0 1 2011 985
2 0 1 2011 801
3 0 1 2011 1349
4 0 1 2011 1562
5 0 1 2011 1600
Tail
yr mnth Ano cnt
<int> <int> <chr> <int>
1 1 12 2012 2114
2 1 12 2012 3095
3 1 12 2012 1341
4 1 12 2012 1796
5 1 12 2012 2729
Where "cnt" means the number of bikes for each day. Every line is a day from 01/01/2011 to 12/12/2012
My goal was to analyse the cnt for each month from both 2011 and 2012; However, I keep getting this weird output:
my code:
k<- bike_new %>%
ggplot(aes(x=mnth,y=cnt))+ geom_line();k
What am I doing wrong here?

As mentioned by the sage advice from #AllanCameron add the group element as a factor, and as you have two years, you would need a color. Here the code using simulated data:
library(ggplot2)
library(dplyr)
#Code
bike_new %>%
ggplot(aes(x=factor(mnth),y=cnt,group=factor(Ano),color=factor(Ano)))+
geom_line()+
xlab('month')+
labs(color='Ano')
Output:
Some data used:
#Data
bike_new <- structure(list(yr = c(0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L,
0L, 0L, 0L, 0L, 0L), mnth = c(1, 1, 1, 1, 1, 12, 12, 12, 12,
12, 2, 2, 2, 2, 2), Ano = c(2011L, 2011L, 2011L, 2011L, 2011L,
2012L, 2012L, 2012L, 2012L, 2012L, 2011L, 2011L, 2011L, 2011L,
2011L), cnt = c(985, 801, 1349, 1562, 1600, 2114, 3095, 1341,
1796, 2729, 1085, 901, 1449, 1662, 1700)), row.names = c(NA,
-15L), class = "data.frame")
If you want to see only one line per year, a strategy could be that explained by #Phil using other variable as day. Or you can aggregate values in next form:
#Code 2
bike_new %>%
group_by(Ano,mnth) %>%
summarise(cnt=sum(cnt,na.rm=T)) %>%
ggplot(aes(x=factor(mnth),y=cnt,group=factor(Ano),color=factor(Ano)))+
geom_line()+
geom_point()+
xlab('month')+
labs(color='Ano')
Output:
As you are analyzing number of bikes.

Calculate Percentage Change in R using dplyr

I want to calculate the percentage of Profit by YEAR which is a fairly simple task but somehow I am getting NA. I have checked same questions asked before but I'm not able to understand why I am getting NA. The data is as follows:
> df_vertical_growth
YEAR VERTICAL Profit pct_change
1 2017 AGRICULTURE 0 NA
2 2016 AGRICULTURE 2053358 NA
3 2015 AGRICULTURE 0 NA
4 2014 AGRICULTURE 2370747 NA
5 2013 AGRICULTURE 4066693 NA
6 2017 COMMUNICATION 0 NA
7 2016 COMMUNICATION 1680074 NA
8 2015 COMMUNICATION 1322470 NA
9 2014 COMMUNICATION 1460133 NA
10 2013 COMMUNICATION 1529863 NA
11 2017 CONSTRUCTION 0 NA
12 2016 CONSTRUCTION 0 NA
13 2015 CONSTRUCTION 0 NA
14 2014 CONSTRUCTION 8250149 NA
15 2013 CONSTRUCTION 0 NA
16 2017 EDUCATION 0 NA
17 2016 EDUCATION 12497015 NA
18 2015 EDUCATION 13437356 NA
19 2014 EDUCATION 10856685 NA
20 2013 EDUCATION 13881127 NA
21 2017 FINANCE, INSURANCE, REAL ESTATE 0 NA
22 2016 FINANCE, INSURANCE, REAL ESTATE 0 NA
23 2015 FINANCE, INSURANCE, REAL ESTATE 0 NA
24 2014 FINANCE, INSURANCE, REAL ESTATE 0 NA
25 2013 FINANCE, INSURANCE, REAL ESTATE 5008436 NA
26 2017 HEALTHCARE 0 NA
27 2016 HEALTHCARE 0 NA
28 2015 HEALTHCARE 0 NA
29 2014 HEALTHCARE 4554364 NA
30 2013 HEALTHCARE 5078130 NA
31 2017 HOSPITALITY 0 NA
32 2016 HOSPITALITY 4445512 NA
33 2015 HOSPITALITY 5499419 NA
34 2014 HOSPITALITY 9060639 NA
35 2013 HOSPITALITY 4391522 NA
36 2017 MANUFACTURING 0 NA
37 2016 MANUFACTURING 0 NA
38 2015 MANUFACTURING 0 NA
39 2014 MANUFACTURING 0 NA
40 2013 MANUFACTURING 27466974 NA
41 2017 MINING 0 NA
42 2016 MINING 4359251 NA
43 2015 MINING 4163201 NA
44 2014 MINING 6272530 NA
45 2013 MINING 6668191 NA
46 2017 OTHER 0 NA
47 2016 OTHER 0 NA
48 2015 OTHER 0 NA
49 2014 OTHER 5935199 NA
50 2013 OTHER 3585969 NA
51 2017 PUBLIC ADMIN 0 NA
52 2016 PUBLIC ADMIN 0 NA
53 2015 PUBLIC ADMIN 0 NA
54 2014 PUBLIC ADMIN 0 NA
55 2013 PUBLIC ADMIN 0 NA
56 2017 RETAIL TRADE 0 NA
57 2016 RETAIL TRADE 0 NA
58 2015 RETAIL TRADE 0 NA
59 2014 RETAIL TRADE 0 NA
60 2013 RETAIL TRADE 0 NA
61 2017 SERVICE 0 NA
62 2016 SERVICE 0 NA
63 2015 SERVICE 0 NA
64 2014 SERVICE 0 NA
65 2013 SERVICE 28018522 NA
66 2017 TRANSPORTATION 0 NA
67 2016 TRANSPORTATION 0 NA
68 2015 TRANSPORTATION 0 NA
69 2014 TRANSPORTATION 0 NA
70 2013 TRANSPORTATION 8430244 NA
71 2017 UTILITY 0 NA
72 2016 UTILITY 3551989 NA
73 2015 UTILITY 6535248 NA
74 2014 UTILITY 3995486 NA
75 2013 UTILITY 4477617 NA
76 2017 WHOLESALE TRADE 0 NA
77 2016 WHOLESALE TRADE 6898041 NA
78 2015 WHOLESALE TRADE 7120828 NA
79 2014 WHOLESALE TRADE 0 NA
80 2013 WHOLESALE TRADE 0 NA
My Code:
df_vertical_growth %>% group_by(YEAR, VERTICAL) %>%
mutate(pct_change = ((Profit/lag(Profit) - 1) * 100))
Now, based on the answers provided here How can I calculate the percentage change within a group for multiple columns in R?, also tried doing the following:
pct <- function(x) {x / lag(x) - 1}
df_vertical_growth %>% group_by(YEAR, VERTICAL) %>% mutate_at(funs=pct,Profit)
But I am getting following error:
Error in check_dot_cols(.vars, .cols) : object 'Profit' not found
Can someone please tell me, what am I doing wrong? Thanks a lot in advance.

The problem lies in the fact each group has one observation. One unique year per Vertical. What is the lag of one observation? Additionally since the years go in descending order I trust you need lead.
library(tidyverse)
z %>%
group_by(VERTICAL) %>%
mutate(pct_change = (Profit/lead(Profit) - 1) * 100)
#output
YEAR VERTICAL Profit pct_change
<int> <fctr> <int> <dbl>
1 2017 AGRICULTURE 0 -100
2 2016 AGRICULTURE 2053358 Inf
3 2015 AGRICULTURE 0 -100
4 2014 AGRICULTURE 2370747 - 41.7
5 2013 AGRICULTURE 4066693 NA
6 2017 COMMUNICATION 0 -100
7 2016 COMMUNICATION 1680074 27.0
8 2015 COMMUNICATION 1322470 - 9.43
9 2014 COMMUNICATION 1460133 - 4.56
10 2013 COMMUNICATION 1529863 NA
This solution assumes the years are arranged in the correct order, to make sure:
z %>%
group_by(VERTICAL) %>%
arrange(YEAR, .by_group = TRUE) %>%
mutate(pct_change = (Profit/lag(Profit) - 1) * 100)
#output
YEAR VERTICAL Profit pct_change
<int> <fctr> <int> <dbl>
1 2013 AGRICULTURE 4066693 NA
2 2014 AGRICULTURE 2370747 - 41.7
3 2015 AGRICULTURE 0 -100
4 2016 AGRICULTURE 2053358 Inf
5 2017 AGRICULTURE 0 -100
6 2013 COMMUNICATION 1529863 NA
7 2014 COMMUNICATION 1460133 - 4.56
8 2015 COMMUNICATION 1322470 - 9.43
9 2016 COMMUNICATION 1680074 27.0
10 2017 COMMUNICATION 0 -100
or use
arrange(desc(YEAR), .by_group = TRUE)
and lead
z is:
structure(list(YEAR = c(2017L, 2016L, 2015L, 2014L, 2013L, 2017L,
2016L, 2015L, 2014L, 2013L, 2017L, 2016L, 2015L, 2014L, 2013L,
2017L, 2016L, 2015L, 2014L, 2013L, 2017L, 2016L, 2015L, 2014L,
2013L, 2017L, 2016L, 2015L, 2014L, 2013L, 2017L, 2016L, 2015L,
2014L, 2013L, 2017L, 2016L, 2015L, 2014L, 2013L, 2017L, 2016L,
2015L, 2014L, 2013L, 2017L, 2016L, 2015L, 2014L, 2013L, 2017L,
2016L, 2015L, 2014L, 2013L, 2017L, 2016L, 2015L, 2014L, 2013L
), VERTICAL = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L,
6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 8L, 9L,
9L, 9L, 9L, 9L, 10L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 11L,
11L, 12L, 12L, 12L, 12L, 12L), .Label = c("AGRICULTURE", "COMMUNICATION",
"CONSTRUCTION", "EDUCATION", "HEALTHCARE", "HOSPITALITY", "MANUFACTURING",
"MINING", "OTHER", "SERVICE", "TRANSPORTATION", "UTILITY"), class = "factor"),
Profit = c(0L, 2053358L, 0L, 2370747L, 4066693L, 0L, 1680074L,
1322470L, 1460133L, 1529863L, 0L, 0L, 0L, 8250149L, 0L, 0L,
12497015L, 13437356L, 10856685L, 13881127L, 0L, 0L, 0L, 4554364L,
5078130L, 0L, 4445512L, 5499419L, 9060639L, 4391522L, 0L,
0L, 0L, 0L, 27466974L, 0L, 4359251L, 4163201L, 6272530L,
6668191L, 0L, 0L, 0L, 5935199L, 3585969L, 0L, 0L, 0L, 0L,
28018522L, 0L, 0L, 0L, 0L, 8430244L, 0L, 3551989L, 6535248L,
3995486L, 4477617L)), .Names = c("YEAR", "VERTICAL", "Profit"
), row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9",
"10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20",
"26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36",
"37", "38", "39", "40", "41", "42", "43", "44", "45", "46", "47",
"48", "49", "50", "61", "62", "63", "64", "65", "66", "67", "68",
"69", "70", "71", "72", "73", "74", "75"), class = "data.frame")

Assuming that your Profit column represents the profit in a given year, this function will calculate the difference between year n and year n-1, divide by the value of year n-1, and multiply by 100 to get a percentage. If the value in year n-1 was zero, there is no valid percent change. It is important that you group the data only by VERTICAL and not by YEAR as well.
profit_pct_change <- function(x) {
x <- x[order(x$YEAR, decreasing = TRUE), ] # Confirms ordered by decreasing year
pct_change <- -diff(x$Profit)/x$Profit[-1] * 100 # Gets percent change in profit from preceding year
data.frame(year = x$YEAR[-length(x$YEAR)], pct_change = pct_change) # Returns data frame
}
df_vertical_growth %>%
group_by(VERTICAL) %>%
do(profit_pct_change(.))

Create a constant variable link to year and quarter

Hi i am a stata user and i am trying to pass my codes to R. I have a panel data as shown below, and i am looking for a command that can create a constant variable according to which year and quarter the row is located. In stata such command would be reproduced by gen new_variable = yq(year, quarter)
My dataframe look like this
id year quarter
1 2007 1
1 2007 2
1 2007 3
1 2007 4
1 2008 1
1 2008 2
1 2008 3
1 2008 4
1 2009 1
1 2009 2
1 2009 3
1 2009 4
2 2007 1
2 2007 2
2 2007 3
2 2007 4
2 2008 1
2 2008 2
2 2008 3
2 2008 4
3 2009 2
3 2009 3
3 2010 2
3 2010 3
I my expected output should look like this: (Values inside new_variable are arbitrary, just looking for a constant value the would be always the same for each year and quarter)
id year quarter new_variable
1 2007 1 220
1 2007 2 221
1 2007 3 222
1 2007 4 223
1 2008 1 224
1 2008 2 225
1 2008 3 226
1 2008 4 227
1 2009 1 228
1 2009 2 229
1 2009 3 230
1 2009 4 231
2 2007 1 220
2 2007 2 221
2 2007 3 222
2 2007 4 223
2 2008 1 224
2 2008 2 225
2 2008 3 226
2 2008 4 227
3 2009 2 229
3 2009 3 230
3 2010 2 233
3 2010 3 234

Any of these will work:
# basic: just concatenate year and quarter
df$new_variable = paste(df$year, df$quarter)
# made for this, has additional options around
# ordering of the categories and including unobserved combos
df$new_variable = interaction(df$year, df$quarter)
# for an integer value, 1 to the number of combos
df$new_variable = as.integer(factor(paste(df$year, df$quarter)))

Here are two options:
library(dplyr) # with dplyr
df %>% mutate(new_variable = group_indices(., year, quarter))
library(data.table) # with data.table
setDT(df)[, new_variable := .GRP, .(year, quarter)]
Data
df <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L), year = c(2007L,
2007L, 2007L, 2007L, 2008L, 2008L, 2008L, 2008L, 2009L, 2009L,
2009L, 2009L, 2007L, 2007L, 2007L, 2007L, 2008L, 2008L, 2008L,
2008L, 2009L, 2009L, 2010L, 2010L), quarter = c(1L, 2L, 3L, 4L,
1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L,
2L, 3L, 2L, 3L)), .Names = c("id", "year", "quarter"), class = "data.frame", row.names = c(NA,
-24L))

1) yearqtr The yearqtr class in the zoo package does this. yearqtr objects have a type of double with the value year + 0 for Q1, year + 1/4 for Q2, etc. When displayed they are shown in a meaningful way; however, they can still be manipulated as if they were plain numbers, e.g. if yq is yearqtr variable then yq + 1 is the same quarter in the next year.
library(zoo)
transform(df, new_variable = as.yearqtr(year + (quarter - 1)/4))
1a) or
transform(df, new_variable = as.yearqtr(paste(year, quarter, sep = "-")))
Either of these give:
id year quarter new_variable
1 1 2007 1 2007 Q1
2 1 2007 2 2007 Q2
3 1 2007 3 2007 Q3
4 1 2007 4 2007 Q4
5 1 2008 1 2008 Q1
... etc ...
2) 220 If you specifically wanted to assign 220 to the first date and have each subsequent quarter increment by 1 then:
transform(df, new_variable = as.numeric(factor(4 * year + quarter)) + 220 - 1)

Grouping data and then assigning values to variable names stored in strings - R

I am trying to migrate this activity from excel/SQL to R and I am stuck - any help is very much appreciated. Thanks !
Format of Data:
There are unique customer ids. Each customer has purchases in different groups in different years.
Objective:
For each customer id - get one row of output. Use variable names stored in column and create columns - for each column assign sum of amount. Create a similar column and assign as 1 or 0 depending on presence or absence of revenue.
SOURCE:
Cust_ID Group Year Variable_Name Amount
1 1 A 2009 A_2009 2000
2 1 B 2009 B_2009 100
3 2 B 2009 B_2009 300
4 2 C 2009 C_2009 20
5 3 D 2009 D_2009 299090
6 3 A 2011 A_2011 89778456
7 1 B 2011 B_2011 884
8 1 C 2010 C_2010 34894
9 3 D 2010 D_2010 389849
10 2 A 2013 A_2013 742
11 1 B 2013 B_2013 25661
12 2 C 2007 C_2007 393
13 3 D 2007 D_2007 23
OUTPUT:
Cust_ID A_2009 B_2009 C_2009 D_2009 A_2011 …. A_2009_P B_2009_P
1 sum of amount .. 1 0 ….
2
3
dput of original data:
structure(list(Cust_ID = c(1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L, 3L,
2L, 1L, 2L, 3L), Group = c("A", "B", "B", "C", "D", "A", "B",
"C", "D", "A", "B", "C", "D"), Year = c(2009L, 2009L, 2009L,
2009L, 2009L, 2011L, 2011L, 2010L, 2010L, 2013L, 2013L, 2007L,
2007L), Variable_Name = c("A_2009", "B_2009", "B_2009", "C_2009",
"D_2009", "A_2011", "B_2011", "C_2010", "D_2010", "A_2013", "B_2013",
"C_2007", "D_2007"), Amount = c(2000L, 100L, 300L, 20L, 299090L,
89778456L, 884L, 34894L, 389849L, 742L, 25661L, 393L, 23L)), .Names = c("Cust_ID",
"Group", "Year", "Variable_Name", "Amount"), class = "data.frame", row.names = c(NA,
-13L))

One option:
intm <- as.data.frame.matrix(xtabs(Amount ~ Cust_ID + Variable_Name,data=dat))
result <- data.frame(aggregate(Amount~Cust_ID, data=dat,sum),intm,(intm > 0)+0 )
Result (abridged):
Cust_ID Amount A_2009 A_2011 ... A_2009.1 A_2011.1
1 1 65539 4000 0 ... 1 0
2 2 1455 0 0 ... 0 0
3 3 90467418 0 89778456 ... 0 1
If the names are a concern, they can easily be fixed via:
names(res) <- gsub("\\.1","_P",names(res))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

r collapse by year by ID - r

Related

Create a group variable first.treat indicating the first year when each unit becomes treated

Why geom_line is not displaying correctly?

Calculate Percentage Change in R using dplyr

Create a constant variable link to year and quarter

Grouping data and then assigning values to variable names stored in strings - R

Categories

Resources