Visualising variance from random effects in a mixed model by group - r

I have run a linear mixed model in R using lmer. I am attempting to visualise the random effect structure. To produce a graph I have used print(dotplot(ranef(RT.model.4, condVar=T))[['part_no']]) where part_no is the random effect from the mixed model. It creates something like this:
This is great. However I want to be able to visually tell the difference between my two groups of participants (the random effect being discussed) in the graph. I have group A and group B. In my dataset I have a column for participant type and for each row it gives a value of A or B.
I would like to either colour code the graph to show participants from groups A and B. Or perhaps better would be to create two separate panels, one for each group.
Any suggestions on how to do this would be very much appreciated.

This is a way using ggplot rather than lattice (just because I am more familiar with it) using code from the examples in ?dotplot.ranef.mer. You need to match your treatment group in the data to the random effects grouping variables returned by ranef. I don't see how this can be done automatically within dotplot.ranef.mer.
Create a small example with a treatment group; each subject is assigned to one treatment group.
library(lme4)
library(ggplot2)
sleepstudy$trt = as.integer(sleepstudy$Subject %in% 308:340)
m = lmer(Reaction ~ trt + (1|Subject), sleepstudy)
Convert the random effects to a dataframe and match in the treatment groups
dd = as.data.frame(ranef(m, condVar=TRUE), "Subject")
dd$trt = with(sleepstudy, trt[match(dd$grp, Subject)])
You can then plot how you want, say using facet_'s or assigning a colour to each group, or ...
ggplot(dd, aes(y=grp,x=condval, colour=factor(trt))) +
geom_point() + facet_wrap(~term,scales="free_x") +
geom_errorbarh(aes(xmin=condval -2*condsd,
xmax=condval +2*condsd), height=0)
ggplot(dd, aes(y=grp,x=condval)) +
geom_point() +
geom_errorbarh(aes(xmin=condval -2*condsd,
xmax=condval +2*condsd), height=0)+
facet_wrap(~trt)

You should be able to use the groups= option in dotplot(). Assuming your data is in a dataframe called df with the group variable being in group, you could use
print(dotplot(ranef(RT.model.4, condVar=T), groups=df$group)[['part_no']])

Related

How does the stratum function work in the clusrank package in R?

I'm working with the clusrank package in R to analyse insect abundance data, by using the clusWilcox.test function for clustered data. As far as I understand, this package allows you to add both a 'cluster' and a 'stratum' function when using the rgl method to cluster by multiple factors.
When adding a single factor as either only a cluster or only a stratum function to my code, the Z- and p-value is the same for both codes, which seems to indicate that the stratum function works. However, when I take the first factor as a cluster, and add a second, different one as stratum, the output is still identical to the cluster-only model. This makes me think only the cluster is taken into account, and the stratum function is ignored.
This problem should be reproducible by making a random test dataset (in this example called df) with four columns: the dependent variable (in my case 'abundance'), the grouping factor of which I want to know the effect (in my case 'treatment'), and two factors to add as cluster/stratum, let's call them 'factorA' and 'factorB'. In my own testdataset the factors have 2 levels each, in my real dataset 6 levels each, and the problem arises in both datasets.
My code is then as follows:
clusWilcox.test(abundance ~ treatment + cluster(factorA), data = df, method = "rgl")
Which gives the same Z- and p-value as adding factorA as stratum, with as only difference that number of clusters is now the number of rows in the testdataset, instead of the number of factor levels.
clusWilcox.test(abundance ~ treatment + stratum(factorA), data = df, method = "rgl")
And both exactly the same Z- and p-values as:
clusWilcox.test(abundance ~ treatment + cluster(factorA) + stratum(factorB), data = df, method = "rgl")
Which makes me think that the stratum function is ignored in this third line of code. If you switch factorA and factorB, the same problem arises, though with different output values, as the calculation is now based on factorB instead of factorA.
Does anyone know what happens here? Is my code wrong, or is the stratum function indeed not taken into account?

Scale LDA decision boundary

I have a rather unconventional problem and having a hard time finding a solution to this. Would really appreciate your help.
I have 4 genes(features) and my classification here is binary(0 and 1). After a lot of back and forth, I have finalized on using LDA to do my classification. I have different studies each comparing the same two classes and I trained my model using these 4 genes on each of these studies.
I want to visualize the LDA scores in the form of points plot. Something like below, where each section represents a different study/dataset. Samples of that dataset on the X axis and the LD1 value I get using -
lda_model = lda(formula = class ~ ., data = train)
predict(lda_model,train) on the Y axis.
Since I trained a different model on each dataset, we can clearly see the the decision boundary (which I assume is the black line) for each dataset is different and on a different scale. However, I want to scale the values on the Y axis is such a way that all my datasets are on the same scale and I can represent this plot with a single decision boundary( again, something I can clearly draw on the plot, like the red line).
The LD1 values here are - a(GeneA) + b(GeneB) + c(GeneC) + d(GeneD) - mean(a(GeneA) + b(GeneB) + c(GeneC) + d(GeneD)). This is done for each dataset individually. However, this is not exactly equal to (a(GeneA) + b(GeneB) + c(GeneC) + d(GeneD) + intercept) which we can get using logistic regression. I am trying to find that value or some method which can scale my Y axis across all the datasets using LDA.
Thanks for your help!
I did a min-max scaling and that seemed to work. It scaled all my data points across all datasets with decision boundary at zero.

How can I visualize an interaction in cox model in r?

I fitted a model and had a significant interaction effect. How can I plot this in a graphic?
It follows a toy example (only for illustration purposes):
library(survival)
# includes bladder data set
library(survminer)
fit2 <- coxph(Surv(stop, event) ~ rx*enum, data = bladder )
# It plots only one single curve
ggadjustedcurves(fit2, data = bladder, variable = "rx")
I would like something like these:
ggadjustedcurves(fit2, data = bladder, variable = "rx") +
facet_wrap(~enum)
ggadjustedcurves(fit2, data = bladder, variable = c("enum","rx"))
It would be nice that the answer would work both for categoricalxcategorical interaction and categorical versus continuous interaction.
Categorical x categorical
If you consider your variables categorical, in variable "rx" you have 2 groups and in variable "enum" 4 groups, which gives you a total of 8 curves.
(1) One way to visualize them would be to plot all curves on the same graph:
bladder$rx_enum <- paste(as.character(bladder$rx), as.character(bladder$enum), sep="_")
ggadjustedcurves(fit2, data = bladder, method='average', variable = "rx_enum")
This is probably not the most elegant way, and you would also have to adjust the colours/linetypes to look nicer. I would probably try to set the line type according to "rx" and color according to "enum" in this case. Modifying color is relatively easy with palette-argument:
ggadjustedcurves(fit2, data = bladder, method='average', palette = c(1,2,3,4,1,2,3,4), variable = "rx_enum")
...while modifying line type is probably more tricky.
(2) Obviously, you can also make separate panels for different levels of either of variables. With "rx" variable you´ll have a panel for dataframe subset where "rx"==1 and another where "rx"==2. I probably wouldn´t use separate panels/graphs because you can visually represent all of the information on one plot - unless it is necessary/justified by your narrative. But if you want to go that way, let me know.
Categorical x Continuous
The same approach will work with continuous variable as well, if you categorize it. I am not sure how one could make a KM for continuous variable while keeping it continuous (not sure how it is even possible).
NB: This answer considered only the KM-plots which are the most common for survival analysis, but there are probably other options as well.

Plot linear and multiple linear reg on the same graph (ggplot)

I have for instance this data frame :
data <- data.frame(
x=c(1:12)
, case=c(3,5,1,8,2,4,5,0,8,2,3,5)
, rain=c(1,8,2,1,4,5,3,0,8,2,3,4)
, country=c("A","A","A","A","B","B","B","B","C","C","C","C")
, year=rep(seq(2000,2003,1),3)
)
I would like to perform 2 linear regressions and plot them on one graph.
In a nutshell, I would like to compare the crude trend of cases over time (simple lm) with the same trend of cases but this time adjusted to rainfall over the years 2000 to 2003, on one and same graph.
model<-lm(case~ year, data=data)
the second one would be a multiple linear regression. I used this code for the purpose, but not sure it is ideal.
modelrain<-lm(case~ I(year +rain), data=data)
I did it with a simple plot with abline, but don't know how to make it with ggplot. I've created a new dataframe, but doesn't seem to work perfectly ( so I don't put the rest of my code here).
Thank you very much
Building off the suggestions in the comments there are three valid regression models
model1<-lm(case~ year, data=data)
summary(model1)
model2<-lm(case~ year+rain, data=data)
summary(model2)
model3<-lm(case~ year*rain, data=data)
summary(model3)
With the limited data we have doesn't seem to be a lot going on.
The first question on how to plot the regression line for model1 using ggplot is just:
ggplot(data,aes(x=year,y=case)) + geom_point() + geom_lm()
As others have noted it is unclear what user3355655 means by "adjusted" for rain (since rain and year can't truly exist on the same x axis) but if we're willing to take the simplest course and simply treat rain as a "factor" then:
ggplot(data,aes(x=year,y=case,color=factor(rain))) + geom_point() + geom_smooth(method="lm",fill=NA) + scale_y_continuous(limits = c(-1, 10))

Using combinations of principal components in a regression model

I have a group of 51 variables into which I have applied Principal Component Analysis and selected six factors based on the Kaiser-Guttman criterion. I'm using R for my analysis and did this with the following function:
prca.searchwords <- prcomp(searchwords.ts, scale=TRUE)
summary(prca.searchwords)
prca.searchwords$sdev^2
Next I would like to use these six extracted factors in a dynamic linear regression model as explanatory variables in groups of one, two, three & four and choose the regression model that explains most of the variation of the dependent variable. The six variables are prca.searchwords$x[,1] + prca.searchwords$x[,2] + prca.searchwords$x[,3] + prca.searchwords$x[,4] + prca.searchwords$x[,5] + prca.searchwords$x[,6]
Which I convert to time series before using in a regression:
prca.searchwords.1.ts <- ts(data=prca.searchwords$x[,1], freq=12, start=c(2004, 1))
prca.searchwords.2.ts <- ts(data=prca.searchwords$x[,2], freq=12, start=c(2004, 1))
I'm using the dynlm package in R for this (I chose to use dynamic regression because other regressions that I perform require lagged values of the independent variables).
For example with the first two factors it would look like this:
private.consumption.searchwords.dynlm <- dynlm(monthly.privateconsumption.ts ~ prca.searchwords.1.ts + prca.searchwords.2.ts)
summary(private.consumption.searchwords.dynlm)
The problem I'm facing is that I would like to do this for all possible combinations of one, two, three and four factors of those six factors that I have chosen to use. This would mean that I would have to do six regressions for 1 variable groups, 15 for two variables, 20 for three variables and 15 for four variables. I would like to do this as efficiently as possible, without having to type 51 different regressions manually.
I'm a relatively new R user and therefore I still struggle with these general coding tricks that will radically speed up my analysis. Could someone please point me into the right direction?
Thank you!
You could build all the formula you are intereted in running using string manipulation functions then convert those to propert formuals and apply over the list of models you want to run. For example
vars <- paste0("prca.searchwords.",1:6,".ts")
resp <- unlist(lapply(1:6, function(i) apply(combn(vars,i), 2, paste, collapse=" + ")))
result <- lapply(resp, function(r) {
do.call("dynlm", list(as.formula(paste0("monthly.privateconsumption.ts ~ ", r))))
})

Resources