Panel Data including Subgroups or Pooled OLS - r

I am analysizing a dataset which is seperated by countries, but also in age groups and gender cohorts, 5 annual periods are included. An intervention took place inbetween the years.
As the data is sparse, I want to compare the effects of each subgroup , so I may reach statistic significance afterall (e.g. female between 10-20 years for both countries, this variable I will call ID, existing only one number per country).
I have tried panel analysis with the plm package, however, I have tried to index country, year, and ID, but this does not work as it is not unique.
Is it even possible to include country effects but have subgroups of the country? (see code below)
I have tried Difference in differences, by using lmList and saving the coefficients. This for each subgroup seperated through the ID´s. (see code below)
This has worked, but through limited periods, no statistic significance is reached, even though the coefficients are all same direction. So I wonder if there is a possibility of combining those models again, and by that reaching reliable results?
1. fixed <- plm(FE ~ x , data=df, index=c("ID","country", "year"), model="within")
2. list <- coef(lmList(y~ treated + time + did | ID, data=df))
Error from 1.
duplicate couples (id-time)
In addition: Warning messages:
1: In pdata.frame(data, index) :
duplicate couples (id-time) in resulting pdata.frame
to find out which, use e.g. table(index(your_pdataframe), useNA = "ifany")
2: In is.pbalanced.default(index[[1]], index[[2]]) :
duplicate couples (id-time)
For 2.
I do get a dataframe which contains all coefficients, but any ideas how I could properly summerize or display those? Just taking the mean of a coefficient seems to be a bit low-skilled.
Any help highly appreaciated.

I adress the first issue (just coding). plm requires a panel structure of index=c("individual", "time"). But you can define a new ID for whatever your unit of observation is. Here you can combine those numerical variables from your identifier with dplyr:
library(dplyr) df <- transform(df,GID=paste0(ID,country))
library(plm) summary(plm(y ~ x, index=c("GID", "year"), data = df,
model = "within"))
In general, you can define all other kind of observational groups. Is your "ID" numeric or string? You should add a more detailed data description or give some example data.

Related

How does the stratum function work in the clusrank package in R?

I'm working with the clusrank package in R to analyse insect abundance data, by using the clusWilcox.test function for clustered data. As far as I understand, this package allows you to add both a 'cluster' and a 'stratum' function when using the rgl method to cluster by multiple factors.
When adding a single factor as either only a cluster or only a stratum function to my code, the Z- and p-value is the same for both codes, which seems to indicate that the stratum function works. However, when I take the first factor as a cluster, and add a second, different one as stratum, the output is still identical to the cluster-only model. This makes me think only the cluster is taken into account, and the stratum function is ignored.
This problem should be reproducible by making a random test dataset (in this example called df) with four columns: the dependent variable (in my case 'abundance'), the grouping factor of which I want to know the effect (in my case 'treatment'), and two factors to add as cluster/stratum, let's call them 'factorA' and 'factorB'. In my own testdataset the factors have 2 levels each, in my real dataset 6 levels each, and the problem arises in both datasets.
My code is then as follows:
clusWilcox.test(abundance ~ treatment + cluster(factorA), data = df, method = "rgl")
Which gives the same Z- and p-value as adding factorA as stratum, with as only difference that number of clusters is now the number of rows in the testdataset, instead of the number of factor levels.
clusWilcox.test(abundance ~ treatment + stratum(factorA), data = df, method = "rgl")
And both exactly the same Z- and p-values as:
clusWilcox.test(abundance ~ treatment + cluster(factorA) + stratum(factorB), data = df, method = "rgl")
Which makes me think that the stratum function is ignored in this third line of code. If you switch factorA and factorB, the same problem arises, though with different output values, as the calculation is now based on factorB instead of factorA.
Does anyone know what happens here? Is my code wrong, or is the stratum function indeed not taken into account?

Problems in parwise comparisons in nested factors

This is a dataset were week is nested in period, this gets problematic when I want to see pairwise comparisons between Diet and week. What does the error "Try taking nested factors out of 'by'." mean?
form <- as.formula(paste(colnames(df)[8],'~ Diet + period +week*Diet +(1|id)')) #get data for interactions
dflmer <- lmer(form, data=df)
a <- Anova(dflmer, type=3)
library(emmeans)
emm <- emmeans(dflmer, pairwise ~ Diet | week)
NOTE: A nesting structure was detected in the fitted model:
week %in% period
Note: Grouping factor(s) for 'week' have been added to the 'by' list.
Error in .nested_contrast(rgobj = object, method = method, by = by, adjust = adjust, :
There are no factor levels left to contrast. Try taking nested factors out of 'by'.
Since week is nested in period, you can't condition on week without also conditioning on period. Try
emmeans(dflmer , pairwise ~ Diet | period:week)
The very latest version, 1.46, of emmeans fixes this, in that older versions did not consider the possibility of nesting in by variables.
Addendum
I think I'm remembering some details wrong. The code that generates this error message was misplaced in versions <= 1.4.5. I think you may need to install version 1.4.6 to get this to work. See the related issue report
Addendum 2
I constructed a similar example, and I got errors from this model still. The problem is that week is nested in period, and the model has Diet crossed with week but not with period, which doesn't make sense. I was able to get results after I fitted the model with fixed-effect terms Diet*(period + week)

Count-process datasets for Non-proportional Hazard (Cox) models with interaction variables

I am trying to run a nonproportional cox regression model featuring an interaction-with-time variable, as described in Chapter 15 (section 15.3) of Applied Longitudinal Data Analaysis by Singer and Willett. However I cannot seem to get answers that agree with the book.
The data used in this book and source code is supplied at this fantastic website. Unfortunarely no R code is supplied for the final chapter and the supplied dataset for R for the example discussed in-text is incomplete and provides incorrect answers for the simplest model (which I do know how to run). Instead, to obtain the complete dataset for this example, one must click the 'Download' link in the 'SAS' column (which has the correct dataset) and then, after installing the haven package (which allows one to read in foreign data formats), read in the dataset in question via:
haven::read_sas("alda/lengthofstay.sas7bdat")
This dataset indicates participants' (variable ID) length of stay (variable DAYS) in inpatient treatment in a hospital. The censoring variable is CENSOR. The researchers hypothesised that two different types of treatment (binary variable TREAT) would predict differential values of hazard of checking out of treatment. In addition they anticipated that the between-group difference in hazard would not be constant over time, therefore requiring the creation of an interaction term. I can get the simple main effect model to work, returning the same hazard coefficients reported in the book (which is how i eventually found out the .csv file supplied with the R code was incomplete).
summary(modA <- coxph(Surv(DAYS,1-CENSOR) ~ TREAT, data = los))
coef exp(coef) se(coef) z Pr(>|z|)
TREAT 0.1457 1.1568 0.1541 0.945 0.345
I tried to follow the procedure laid out here, and here, and the sources listed therein (e.g. Therneau vignette on time-varying covariates in the survival package), and, of course, when I am copy-pasting someone else's code and running that it all works fine. But I am trying to do this for myself from scratch with a dataset whose results I can compare against mine. And I just can't make it work.
first I created an EVENT variable
los$EVENT <- 1 - los$CENSOR
there is a duplicate id number in the dataset that causes issues. So we have to change it to a new ID number
los$ID[which(duplicated(los$ID))] <- 842
Now, based on what I read here and here the dataframe needs to be split so that, for every participant, there is one row indicating the EVENT status at every point prior to their event (or censorship) time when any other participant experienced an event. Therefore we need to create a vector of all the unique event times, then split the dataset on those event times
cutPoints <- sort(unique(los$DAYS[los$EVENT == 1]))
# now split the dataset
longLOS <- survSplit(Surv(DAYS,EVENT)~ ., data = los, cut = cutPoints)
# and (just because I'm anal) rename the interval upper bound column (formerly "DAYS")
names(longLOS)[5] <- "tstop"
When I looked at this dataset it appeared to be what I was after, with (1) as many rows for each participant as there are intervals prior to their event time when anyone else in the dataset experienced an event, (2) two columns indicating the lower and upper bounds of each interval, and (3) an event column with a 0 for all rows when the respondent did not experience the event, and a 1 in the final row when they either did experience the event or were censored.
Next I created the interaction-with-time variable, subtracting 1 from the 'interval upper bound' column so that main effect of TREAT represents the treatment effect on the first day of hospitalisation.
longLOS$TREATINT <- longLOS$EVENT*(longLOS$tstop - 1)
And ran the model
summary(modB <- coxph(Surv(tstart, tstop, EVENT) ~ TREAT + TREATINT, data = longLOS))
But it doesn't work! I got the (fairly unhelpful) error message
Error in fitter(X, Y, strats, offset, init, control, weights = weights, :
routine failed due to numeric overflow.This should never happen. Please contact the author.
What am I doing wrong? I have been slowly working through Singer and Willett for almost three years (I started while still a grad student), and now the final chapter is proving to be by far my greatest challenge. I have thirty pages to go; any help would be incredibly appreciated.
I figured out what I was doing wrong. A stupid error when I created the interaction variable TREATINT. instead of
longLOS$TREATINT <- longLOS$EVENT*(longLOS$tstop - 1)
it should have been
longLOS$TREATINT <- longLOS$TREAT*(longLOS$tstop - 1)
Now when you run the model
summary(modB <- coxph(Surv(tstart, tstop, EVENT) ~ TREAT + TREATINT, data = longLOS))
Not only does it work, it yields coefficients that match those reported in the Singer and Willett book.
coef exp(coef) se(coef) z Pr(>|z|)
TREAT 0.706411 2.026705 0.292404 2.416 0.0157
TREATINT -0.020833 0.979383 0.009207 -2.263 0.0237
Given how dumb my mistake was I was tempted to just delete this whole post but I think I'll leave it up for others like me who want to know how to do interaction with time Cox models in R.

residuals() function error: replacement has x rows and data has y rows

I have a data set that has reading time for each word that numerous individuals read.
I am trying to calculate reading time residuals for each individual in my data. Word lengths and the order of presentation (of a particular word) are factors in calculating a regression for each individual.
The reading time was log-transformed (logRT) and word lengths were calculated by nchar(). The order of presentation is also log-transformed.
model1<-lmer(logRT~wlen+log(order)+(1|subject), data=mydata)
Then, I try to get a residual column for every data point by doing the following,
mydata$logResid<-residuals(model1)
Then, I get this error.
Error in `$<-.data.frame`(`*tmp*`, "LogResid", value = c(0.145113408056189, :
replacement has 30509 rows, data has 30800
Does anyone have any advice? I am totally confused. Since this is an analysis I've been doing every day with no such error so far. It is even more confusing.
I would say you should try
model1 <- lmer(logRT~wlen+log(order)+(1|subject), data=mydata,
na.action=na.exclude)
and see if that helps; it should fill in NA values in the appropriate places.
From ?na.exclude:
... when ‘na.exclude’ is used the residuals and
predictions are padded to the correct length by inserting ‘NA’s
for cases omitted by ‘na.exclude’.

Model with Matched pairs and repeated measures

I will delete if this is too loosely programming but my search has turned up NULL so I'm hoping someone can help.
I have a design that has a case/control matched pairs design with repeated measurements. Looking for a model/function/package in R
I have 2 measures at time=1 and 2 measures at time=2. I have Case/Control status as Group (2 levels), and matched pairs id as match_id and want estimate the effect of Group, time and the interaction on speed, a continuous variable.
I wanted to do something like this:
(reg_id is the actual participant ID)
speed_model <- geese(speed ~ time*Group, id = c(reg_id,match_id),
data=dataforGEE, corstr="exchangeable", family=gaussian)
Where I want to model the autocorrelation within a person via reg_id, but also within the matched pairs via match_id
But I get:
Error in model.frame.default(formula = speed ~ time * Group, data = dataFullGEE, :
variable lengths differ (found for '(id)')
Can geese or GEE in general not handle clustering around 2 sets of id? Is there a way to even do this? I'm sure there is.
Thank you for any help you can provide.
This is definatly a better question for Cross Validated, but since you have exactly 2 observations per subject, I would consider the ANCOVA model:
geese(speed_at_time_2 ~ speed_at_time_1*Group, id = c(match_id),
data=dataforGEE, corstr="exchangeable", family=gaussian)
Regarding the use of ANCOVA, you might find this reference useful.

Resources