I'm trying to fit a Cox Proportional Hazard model to analyze the impact of the number of protest events on the survival rates of different political regimes in different countries.
My dataset looks similar to this:
Country year sdate edate time evercollapsed protest GDPgrowth
Country A 2003 1996-11-24 2012-12-31 5881 0 78 14.78
Country A 2004 NA NA NA 0 99 8.56
Country A 2005 NA NA NA 0 25 3.56
Country B 2003 2000-10-26 2011-05-21 3859 1 13 2.33
Country B 2004 NA NA NA 1 28 5.43
Country B 2005 NA NA NA 1 7 1.89
So, basically my dataset provides yearly information on a number of variables for each year, but information about the start and end dates for the regime and the time of survival (measured in days) is only provided in the first row of each given political regime.
My data includes information for 48 different political regimes and 15 of them collapse in the time span I am looking at.
I fitted a Cox PH model with the survival package:
myCPH <- coxph(Surv(time, evercollapsed) ~ protest + GDPgrowth, data = mydata)
This gives me the following result:
Call:
coxph(formula = Surv(time, evercollapsed) ~ protest + GDPgrowth,
data = mydata)
coef exp(coef) se(coef) z p
protest 0.01630 1.01644 0.00722 2.26 0.024
GDPgrowth -0.03447 0.96612 0.01523 -2.26 0.024
Likelihood ratio test=9.26 on 2 df, p=0.00977
n= 48, number of events= 15
(556 observations deleted due to missingness)
So, these results imply that I'm losing 556 country years, because the rows in my data frame do not include the information on the survival time of the regime.
My question now is, how to include the country years into the analysis which do not provide the information on sdate, edate and time?
I assume, if I would just copy the information for each country-year, this would increase my number of regime collapses?
I assume I have to give an unique ID for every given political regime to make sure R can distinguish the different cases. Then, how do I have to fit the Cox PH model that includes the information of the differen country-years in the analysis?
Many thanks in advance!
Related
I have a data set of survey data over 30 years. Each participant in the data frame is given a personal ID. I have two questions regarding this data set.
Firstly, how can I print the number of different personal IDs, i.e., how many different persons are observed?
Secondly, how can I make sure I include every person only once, when making a regression analysis over all observations?
Until now, I just went with only including one survey year, yet, this severely limits my sample sice. Thus, I wanted to include obsevrations from a time frame of 5 years.
Here is basic information and a minimal example when working with panel data in R.
Question 1: Number of different personal IDs (units, people, firms, countries)
# Packages
library(plm)
library(tidyverse)
# Data
data("Produc", package = "plm")
## (1) Tidyverse logic
Produc %>%
group_by(state) %>%
nrow
[1] 816
## (2) panelr package
library(panelr)
Produc <- panel_data(Produc, id = state, wave = year)
Produc
# Panel data: 816 × 11
# entities: state [48]
# wave variable: year [1970, 1971, 1972, ... (17 waves)]
state year region pcap hwy water util pc gsp emp unemp
<fct> <int> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
1 ALABAMA 1970 6 15033. 7326. 1656. 6051. 35794. 28418 1010. 4.7
2 ALABAMA 1971 6 15502. 7526. 1721. 6255. 37300. 29375 1022. 5.2
3 ALABAMA 1972 6 15972. 7765. 1765. 6442. 38670. 31303 1072. 4.7
4 ALABAMA 1973 6 16406. 7908. 1742. 6756. 40084. 33430 1136. 3.9
5 ALABAMA 1974 6 16763. 8026. 1735. 7002. 42057. 33749 1170. 5.5
6 ALABAMA 1975 6 17316. 8158. 1752. 7406. 43972. 33604 1155. 7.7
7 ALABAMA 1976 6 17733. 8228. 1800. 7705. 50222. 35764 1207 6.8
8 ALABAMA 1977 6 18112. 8366. 1845. 7901. 51085. 37463 1269. 7.4
9 ALABAMA 1978 6 18480. 8511. 1961. 8009. 52604. 39964 1336. 6.3
10 ALABAMA 1979 6 18881. 8641. 2082. 8159. 54526. 40979 1362 7.1
# … with 806 more rows
Question 2: How can I make sure I include every person only once, when making a regression analysis over all observations?
Either you include ALL, then you have a pooled model, or you do a cross-section, subsetting/filtering for a particular time.
# pooling means all observations, all units in all time points
model_pooling <- plm(pcap ~ unemp + water + util + pc + gsp,
data=Produc,
effect = "individual", model = "pooling",
index=c("state", "year"))
# cross-section
model_1970 <- plm(pcap ~ unemp + water + util + pc + gsp,
data=Produc %>% filter(year == 1970),
effect = "individual", model = "pooling",
index=c("state", "year"))
library(stargazer)
stargazer(model_pooling, model_1970, type="text",
column.labels = c("All", "Sample 1970"), model.numbers = FALSE)
==================================================================
Dependent variable:
-----------------------------------------------------
pcap
All Sample 1970
------------------------------------------------------------------
unemp 26.116 -223.077
(35.381) (177.538)
water 1.421*** 1.880***
(0.085) (0.198)
util 1.118*** 1.045***
(0.019) (0.068)
pc 0.030*** 0.026*
(0.005) (0.015)
gsp 0.056*** 0.073***
(0.008) (0.024)
Constant 2,057.508*** 2,569.209***
(240.335) (882.888)
------------------------------------------------------------------
Observations 816 48
R2 0.994 0.998
Adjusted R2 0.994 0.997
F Statistic 27,696.970*** (df = 5; 810) 3,696.803*** (df = 5; 42)
==================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
When you have a panel, we strongly recommend to use the additional variation that it provides. Please specify your regression model, if I got you wrong. Alternatively, you can filter for the first observation per person (unit) ignoring the time.
I have a dataset (DF) for patients seen at the emergency department of a hospital who were all admitted for heart attacks from the years of 2010-2015 (simplified example of data is below, each row is a new patient and the actual dataset is over 1000 patients)
Patient ID age smoker Overweight YearHeartAttack
0001 34 Y N 2015
0002 44 Y Y 2014
0003 67 N N 2015
0004 75 Y Y 2011
0005 23 N Y 2015
0006 45 Y N 2010
0007 55 Y Y 2013
0008 64 N Y 2012
0009 27 Y N 2012
0010 48 Y Y 2014
0011 65 N N 2010
I'd like to model a poisson regression for the number of patients who have had heart attacks by each year using the glm function in R, however the only way that I found this to be possible is if I use some summary function to take a count of each of the years and create a new dataset such as below and then use the glm function;
Count Year
2 2010
1 2011
2 2012
1 2013
2 2014
2 2015
HeartAttackfit <- glm(Count ~ Year, data = CountDF, family = poisson) #poisson model
This method works for just creating a simple poisson model, however I plan on taking this model a lot further through applying generalized estimating equations with the geeglm package for example and it has several issues with feeding in simplified data in this Count/Year form. I was wondering if there is any way I can create the poisson model directly from the DF dataset for the number of patients who have had heart attacks by each year utilizing the glm function without summarizing the data to the Count/Year form? Thank you very much in advance.
I have to select the countries that have a number of points in the top 25% of the distribution of number of datapoints using function subset & quantiles with the %in% operator.
My dataset has this form
head(drugs1)
LOCATION TIME PC_HEALTHXP PC_GDP USD_CAP TOTAL_SPEND
1 AUS 1971 15.992 0.727 35.720 462.11
2 AUS 1972 15.091 0.686 36.056 475.11
3 AUS 1973 15.117 0.681 39.871 533.47
4 AUS 1974 14.771 0.755 47.559 652.65
5 AUS 1975 11.849 0.682 47.561 660.76
6 AUS 1976 10.920 0.630 46.908 658.26
where the first column represents the countries & the second the data points that each country appear in each year.
I tried to apply the command
a<-subset(drugs1, quantile(drugs1$TIME, 0.25),1)
but the results are NULL.
Can you help me with this?
Start by figuring out the number of datapoints for each country using table().
n <- table(drugs1$location)
Find the 25th percentile of the number of datapoints.
q <- quantile(n, .75)
Find the countries that have more than q datapoints.
countries <- names(n)[n > q]
Subset the original data to only include countries in countries.
drugs2 <- subset(drugs1, LOCATION %in% countries)
I have a very unbalanced dataset ,, thousands of healthy participants and 21 patients (16 male and 5 female) ,, I want to use bootstrapping to define a new sampler but with control of age and gender.
this is the method i'm using
parametric_bootstrap_boot <- function(x) {
# Perform bootstrap using boot package
# Estimate mean
mu <- boot(x, samplemean, R=1000)$t0
#Estimate sd
sd <- boot(x, samplesd, R=1000)$t0
# Sample 21 observations
set.seed(1)
samples <- rnorm(21,mu,sd)
return(samples)
}
how can I control for age and gender of the healthy resampling method ?
my data looks like this
Patient ID Age Mean_RR SDNN RMSSD nn50 pnn50 SEX Year of birth.0.0 Date of all cause dementia report.0.0 Source of all cause dementia report.0.0 Date of alzheimer's disease report.0.0 Source of alzheimer's disease report.0.0 Date of vascular dementia report.0.0 Source of vascular dementia report.0.0
1.53E+09 56 1257 397.34 468 2 33.33 Female 1961 NA NA NA NA NA NA
1.53E+09 56 1257 397.34 468 2 33.33 Female 1961 NA NA NA NA NA NA
this is how I call the function
control_BPM <- abs(parametric_bootstrap_boot(control_BPM))
control_SDNN <- abs(parametric_bootstrap_boot(control_SDNN))
control_RMSSD <- abs(parametric_bootstrap_boot(control_RMSSD))
I have a dataset containing the daily rate of return for every industry (in total 10 industries) per country (in total 16 countries) from 1975 to 2018. Now I need to run cross sectional regressions per day and per week and save the coefficients in a separate dataset.
I tried the following code. But the estimates are the same for every day.
fitted_models = Data %>%
group_by(Data$Date) %>%
do(model = lm(Data$RoR ~ Data$Country + Data$Industry, data=Data))
fitted_models$model
I need to include the following contrasts:
contrasts(All0$Country) <- contr.sum(16, contrasts=TRUE)
contrasts(All0$Industry) <- contr.sum(10, contrasts=TRUE)
but I get the following error message then
Error in contrasts<-(*tmp*, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels In addition: Warning messages: 1: contrasts dropped from factor Country due to missing levels 2: contrasts dropped from factor Industry due to missing levels
This is a sample of my data. As time goes on there are values for RoR.
Country Date Industry RoR
<chr> <date> <chr> <dbl>
1 Finland 1975-01-01 Basic Mats NA
2 Austria 1975-01-01 Basic Mats NA
3 Spain 1975-01-01 Basic Mats NA
4 United Kingdom 1975-01-01 Basic Mats NA
5 Norway 1975-01-01 Basic Mats NA
6 Germany 1975-01-01 Basic Mats NA
7 France 1975-01-01 Basic Mats NA
8 Italy 1975-01-01 Basic Mats NA
9 Portugal 1975-01-01 Basic Mats NA
10 Switzerland 1975-01-01 Basic Mats NA
Using the data.table package for to do group-wise operations might be a good way to approach this -- I'm using mtcars as an example data set since you haven't provided one, but the approach would be the same with your data. Here, I use cyl as the grouping column, but in your case it would be by Date.
library(data.table)
DT <- as.data.table(mtcars)
DT[,as.list(lm(mpg ~ wt+qsec)$coefficients), by = .(cyl)]
# cyl (Intercept) wt qsec
# 1: 6 25.46173 -5.201906 0.5838640
# 2: 4 24.88427 -7.513576 0.9903892
# 3: 8 14.02093 -2.813754 0.7352592