Propensity Score Matching with panel data - r

I am trying to use MatchIt to perform Propensity Score Matching (PSM) for my panel data. The data is panel data that contains multi-year observations from the same group of companies.
The data is basically describing a list of bond data and the financial data of their issuers, also the bond terms such as issued date, coupon rate, maturity, and bond type of bonds issued by them. For instance:
Firmnames
Year
ROA
Bond_type
AAPL US Equity
2015
0.3
0
AAPL US Equity
2015
0.3
1
AAPL US Equity
2016
0.3
0
AAPL US Equity
2017
0.3
0
C US Equity
2015
0.3
0
C US Equity
2016
0.3
0
C US Equity
2017
0.3
0
......
I've already known how to match the observations by the criteria I want and I use exact = Year to make sure I match observations from the same year. The problem now I am facing is that the observations from the same companies will be matched together, this is not what I want. The code I used:
matchit(Bond_type ~ Year + Amount_Issued + Cpn + Total_Assets_bf + AssetsEquityRatio_bf + Asset_Turnover_bf, data = rdata, method = "nearest", distance = "glm", exact = "Year")
However, as you can see, in the second raw of my sample, there might be two observations in one year from the same companies due to the nature of my study (the company can issue bonds more than one time a year). The only difference between them is the Bond_type. Therefore, the MathcIt function will, of course, treat them as the best control and treatment group and match these two observations together since they have the same ROA and other matching factors in that year.
I have two ways to solve this in my opinion:
Remove the observations from the same year and company, however, removing the observations might lead to bias results and ruined the study.
Preventing MatchIt function match the observations from the same company (or with the same Frimnames)
The second approach will be better since it will not lead to bias, however, I don't know if I can do this in MatchIt function. Hope someone can give me some advice on this or maybe there's any better solution to this problem, please be so kind to share with me, thanks in advance!
Note: If there's any further information or requirement I should provide, please just inform me. This is my first time raising the question here!

This is not possible with MatchIt at the moment (though it's an interesting idea and not hard to implement, so I may add it as a feature).
In the optmatch package, which perfroms optimal pair and full matching, there is a constraint that can be added called "anti-exact matching", which sounds exactly like what you want. Units with the same value of the anti-exact matching variable will not be matched with each other. This can be implemented using optmatch::antiExactMatch().
In the Matching package, which performs nearest neighbor and genetic matching, the restrict argument can be supplied to the matching function to restrict certain matches. You could manually create the restriction matrix by restricting all pairs of observations in the same company and then supply the matrix to Match().

Related

Calculate number of years worked with different end dates

Consider the following two datasets. The first dataset describes an id variable that identifies a person and the date when his or her unemployment benefits starts.
The second dataset shows the number of service years, which makes it possible to calculate the maximum entitlement period. More precisely, each year denotes a dummy variable, which is equal to unity in case someone build up unemployment benefits rights in a particular year (i.e. if someone worked). If this is not the case, this variable is equal to zero.
df1<-data.frame( c("R005", "R006", "R007"), c(20120610, 20130115, 20141221))
colnames(df1)<-c("id", "start_UI")
df1$start_UI<-as.character(df1$start_UI)
df1$start_UI<-as.Date(df1$start_UI, "%Y%m%d")
df2<-data.frame( c("R005", "R006", "R007"), c(1,1,1), c(1,1,1), c(0,1,1), c(1,0,1), c(1,0,1) )
colnames(df2)<-c("id", "worked2010", "worked2011", "worked2012", "worked2013", "worked2014")
Just to summarize the information from the above two datasets. We see that person R005 worked in the years 2010 and 2011. In 2012 this person filed for Unemployment insurance. Thereafter person R005 works again in 2013 and 2014 (we see this information in dataset df2). When his unemployment spell started in 2012, his entitlement was based on the work history before he got unemployed. Hence, the work history is equal to 2. In a similar vein, the employment history for R006 and R007 is equal to 3 and 5, respectively (for R007 we assume he worked in 2014 as he only filed for unemployment benefits in December of that year. Therefore the number is 5 instead of 4).
Now my question is how I can merge these two datasets effectively such that I can get the following table
df_final<- data.frame(c("R005", "R006", "R007"), c(20120610, 20130115, 20141221), c(2,3,5))
colnames(df_final)<-c("id", "start_UI", "employment_history")
id start_UI employment_history
1 R005 20120610 2
2 R006 20130115 3
3 R007 20141221 5
I tried using "aggregate", but in that case I also include work history after the year someone filed for unemployment benefits and that is something I do not want. Does anyone have an efficient way how to combine the information from the two above datasets and calculate the unemployment history?
I appreciate any help.
base R
You should use Reduce with accumulate = T.
df2$employment_history <- apply(df2[,-1], 1, function(x) sum(!Reduce(any, x==0, accumulate = TRUE)))
merge(df1, df2[c("id","employment_history")])
dplyr
Or use the built-in dplyr::cumany function:
df2 %>%
pivot_longer(-id) %>%
group_by(id) %>%
summarise(employment_history = sum(value[!cumany(value == 0)])) %>%
left_join(df1, .)
Output
id start_UI employment_history
1 R005 2012-06-10 2
2 R006 2013-01-15 3
3 R007 2014-12-21 5

Using calendar adjustment when forecasting

I am reading the online textbook "Forecasting: Principles and Practice
Textbook by George Athanasopoulos and Rob J. Hyndman" which has examples in R code.
A section on calendar adjustments explaisn that it is often useful to "look at average daily production instead of average monthly production, we effectively remove the variation due to the different month lengths. Simpler patterns are usually easier to model and lead to more accurate forecasts.
This is the example data sets with accompanying plots:
monthly -> daily
I don't understand the second line of the given example code:
monthdays <- rep(c(31,28,31,30,31,30,31,31,30,31,30,31),14)
monthdays[26 + (4*12)*(0:2)] <- 29
par(mfrow=c(2,1))
plot(milk, main="Monthly milk production per cow",
ylab="Pounds",xlab="Years")
plot(milk/monthdays, main="Average milk production per cow per day",
ylab="Pounds", xlab="Years")
I understand that the first line creates a vector of the # of days in each month and repeats it 14 times because the data set is 14 years. But I have no idea what the second line is doing and where those numbers and calculations are coming from?

Table of average score of peer per percentile

I'm quite a newbie in R so I was interested in the optimality of my solution. Even if it works it could be (a bit) long and I wanted your advice to see if the "way I solved it" is "the best" and it could help me to learn new techniques and functions in R.
I have a dataset on students identified by their id and I have the school where they are matched and the score they obtained at a specific test (so for short: 3 variables id,match and score).
I need to construct the following table: for students in between two percentiles of score, I need to calculate the average score (between students) of the average score of the students of the school they are matched to (so for each school I take the average score of the students matched to it and then I calculate the average of this average for percentile classes, yes average of a school could appear twice in this calculation). In English it allows me to answer: "A student belonging to the x-th percentile in terms of score will be in average matched to a school with this average quality".
Here is an example in the picture:
So in that case, if I take the median (15) for the split (rather than percentiles) I would like to obtain:
[0,15] : 9.5
(15,24] : 20.25
So for students having a score between 0 and 15 I take the average of the average score of the school they are matched to (note that b average will appears twice but that's ok).
Here how I did it:
match <- c(a,b,a,b,c)
score <- c(18,4,15,8,24)
scoreQuant <- cut(score,quantile(score,probs=seq(0,1,0.1),na.rm=TRUE))
AvgeSchScore <- tapply(score,match,mean,na.rm=TRUE)
AvgScore <- 0
for(i in 1:length(score)) {
AvgScore[i] <- AvgeSchScore[match[i]]
}
results <- tapply(AvgScore,scoreQuant,mean,na.rm = TRUE)
If you have a more direct way of doing it.. Or I think the bad point is 3) using a loop, maybe apply() is better ? But I'm not sure how to use it here (I tried to code my own function but it crashed so I "bruted force it").
Thanks :)
The main fix is to eliminate the for loop with:
AvgScore <- AvgeSchScore[match]
R allows you to subset in ways that you cannot in other languages. The tapply function outputs the names of the factor that you grouped by. We are using those names for match to subset AvgeScore.
data.table
If you would like to try data.table you may see speed improvements.
library(data.table)
match <- c("a","b","a","b","c")
score <- c(18,4,15,8,24)
dt <- data.table(id=1:5, match, score)
scoreQuant <- cut(dt$score,quantile(dt$score,probs=seq(0,1,0.1),na.rm=TRUE))
dt[, AvgeScore := mean(score), match][, mean(AvgeScore), scoreQuant]
# scoreQuant V1
#1: (17.4,19.2] 16.5
#2: NA 6.0
#3: (12.2,15] 16.5
#4: (7.2,9.4] 6.0
#5: (21.6,24] 24.0
It may be faster than base R. If the value in the NA row bothers you, you can delete it after.

HoltWinter Initial values not matching with Rob Hyndman theory

I am following this tutorial by Rob Hyndman for initialization (additive).
Steps to calculate initial values are specified as:
I am running above steps manually (with pen/paper) on data set provided in Rob Hydman free online text book. Values I got after first two steps are:
I used same data set on "R", but seasonal output values in R are drastically different (screenshot below)
Not sure what I am doing wrong. Any help would be appreciated.
Another interesting thing I have observed just now is, initial level (l(t)) in text book is 33.8, but in R output it is : 48.24, which proves that I am missing something while calculating manually.
EDIT:
Here is how I am calculating Moving Averages Smooth (Based on formula used in Section 2 of this link. )
After calculating I have de-trended, means original value - smoothed value.
Then seasonal values: Which is
S1 =Average of Q1
S2 = Average of Q2
...
The first two values of your moving average are incorrect. You have assumed that the values prior to the first observation are zero. They are not zero, they are missing, which is quite different. It is impossible to compute the moving average for the first two observations for this reason.
The third and subsequent values of your moving average are only approximately correct because you have rounded the data to the first decimal point instead of using the data as provided in the fpp package in R.
The values obtained following this procedure are used as initial values in the optimization within ets(). So the output from ets() will not contain the initial values but the optimized values. The table in the book gives the optimized values. You will not be able to reproduce them using a simple procedure.
However, you can reproduce what is provided by HoltWinters because it does not do any optimization of initial values. Using HoltWinters, the initial seasonal values are given as:
> HoltWinters(y)$fitted[1:4,]
xhat level trend season
[1,] 43.73934 33.21330 1.207739 9.318302
[2,] 28.25863 35.65614 1.376490 -8.774002
[3,] 36.86581 37.57569 1.450688 -2.160566
[4,] 41.87604 38.83521 1.424568 1.616267
(The output in coefficients gives the final states not the initial states.)
The seasonal indices in the last column can be computed as follows:
y MAsmooth detrend detrend.adj
41.72746 NA NA NA
24.04185 NA NA NA
32.32810 34.41724 -2.089139 -2.160566
37.32871 35.64101 1.687695 1.616267
46.21315 36.82342 9.389730 9.318302
29.34633 38.04890 -8.702575 -8.774002
36.48291 NA NA NA
42.97772 NA NA NA
The last column is the adjusted detrended data (so they add to zero).

How to enter censored data into R's survival model?

I'm attempting to model customer lifetimes on subscriptions. As the data is censored I'll be using R's survival package to create a survival curve.
The original subscriptions dataset looks like this..
id start_date end_date
1 2013-06-01 2013-08-25
2 2013-06-01 NA
3 2013-08-01 2013-09-12
Which I manipulate to look like this..
id tenure_in_months status(1=cancelled, 0=active)
1 2 1
2 ? 0
3 1 1
..in order to feed the survival model:
obj <- with(subscriptions, Surv(time=tenure_in_months, event=status, type="right"))
fit <- survfit(obj~1, data=subscriptions)
plot(fit)
What shall I put in the tenure_in_months variable for the consored cases i.e. the cases where the subscription is still active today - should it be the tenure up until today or should it be NA?
First I shall say I disagree with the previous answer. For a subscription still active today, it should not be considered as tenure up until today, nor NA. What do we know exactly about those subscriptions? We know they tenured up until today, that is equivalent to say tenure_in_months for those observations, although we don't know exactly how long they are, they are longer than their tenure duration up to today.
This is a situation known as right-censor in survival analysis. See: http://en.wikipedia.org/wiki/Censoring_%28statistics%29
So your data would need to translate from
id start_date end_date
1 2013-06-01 2013-08-25
2 2013-06-01 NA
3 2013-08-01 2013-09-12
to:
id t1 t2 status(3=interval_censored)
1 2 2 3
2 3 NA 3
3 1 1 3
Then you will need to change your R surv object, from:
Surv(time=tenure_in_months, event=status, type="right")
to:
Surv(t1, t2, event=status, type="interval2")
See http://stat.ethz.ch/R-manual/R-devel/library/survival/html/Surv.html for more syntax details. A very good summary of computational details can be found: http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_lifereg_sect018.htm
Interval censored data can be represented in two ways. For the first use type = interval and the codes shown above. In that usage the value of the time2 argument is ignored unless event=3. The second approach is to think of each observation as a time interval with (-infinity, t) for left censored, (t, infinity) for right censored, (t,t) for exact and (t1, t2) for an interval. This is the approach used for type = interval2, with NA taking the place of infinity. It has proven to be the more useful.
If a missing end date means that the subscription is still active, then you need to take the time until the current date as censor date.
NA wont work with the survival object. I think those cases will be omitted. That is not what you want! Because these cases contain important information about the survival.
SQL code to get the time till event (use in SELECT part of query)
DATEDIFF(M,start_date,ISNULL(end_date,GETDATE()) AS tenure_in_months
BTW:
I would use difference in days, for my analysis. Does not make sense to round off the time to months.
You need to know the date the data was collected. The tenure_in_months for id 2 should then be this date minus 2013-06-01.
Otherwise I believe your encoding of the data is correct. the status of 0 for id 2 indicates it's right-censored (meaning we have a lower bound on it's lifetime, but not an upper bound).

Resources