How to organize data for a multivariate probit model? - r

I've conducted a psychometric test on some subjects, and I'm trying to create a multivariate probit model.
The test was conducted as follows:
To subject 1 was given a certain stimulous under 11 different conditions, 10 times for each condition. Answers (correct=1, uncorrect=0) were registered.
So for subject 1, I have the following results' table:
# Subj 1
correct
cnt 1 0
1 0 10
2 0 10
3 1 9
4 5 5
5 7 3
6 10 0
7 10 0
8 10 0
9 9 1
10 10 0
11 10 0
This means that Subj1 answered uncorrectly 10 times under condition 1 and 2, and answered 10 times correctly under condition 10 and 11. For the other conditions, the response was increasing from condition 3 to condition 9.
I hope I was clear.
I usually analyze the data using the following code:
prob.glm <- glm(resp.mat1 ~ cnt, family = binomial(link = "probit"))
Here resp.mat1 is the responses' table, while cnt is the contrast c(1,11). So I'm able to draw the sigmoid curve using the predict() function. The graph, for the subject-1 is the following.
Now suppose I've conducted the same test on 20 subjects. I have now 20 tables, organized like the first one.
What I want to do is to compare subgroups, for example: male vs. female; young vs. older and so on. But I want to keep the inter-individual variability, so simply "adding" the 20 tables will be wrong.
How can I organize the data in order to use the glm() function?
I want to be able to write a command like:
prob.glm <- glm(resp.matTOT ~ cnt + sex, family = binomial(link = "probit"))
And then graphing the curve for sex=M, and sex=F.
I tried using the rbind() function, to create a unique table, then adding columns for Subj (1 to 20), Sex, Age. But it looks me a bad solution, so any alternative solutions will be really appreciated.

Looks like you are using the wrong function for the job. Check the first example of glmer in package lme4; it comes quite close to what you want. herd should be replaced by the subject number, but make sure that you do something like
mydata$subject = as.factor(mydata$subject)
when you have numerical subject numbers.
# Stolen from lme4
library(lattice)
library(
xyplot(incidence/size ~ period|herd, cbpp, type=c('g','p','l'),
layout=c(3,5), index.cond = function(x,y)max(y))
(gm1 <- glmer(cbind(incidence, size - incidence) ~ period + (1 | herd),
data = cbpp, family = binomial))

There's a multivariate probit command in the mlogit library of all things. You can see an example of the data structure required here:
https://stats.stackexchange.com/questions/28776/multinomial-probit-for-varying-choice-set

Related

inputs of 2 separate predict() return the same set of fitted values

Confession: I attempted to ask this question yesterday, but used a sample, congruent dataset which resembles the my "real" data in hopes this would be more convenient for readers here. One issue was resolved, but another remains that appears immutable.
My objective is creating a linear model of two predicted vectors: "yC.hat", and "yT.hat" which are meant to project average effects for unique observed values of pri2000v as a function of the average poverty level "I(avgpoverty^2) under control (treatment = 0) and treatment (treatment = 1) conditions.
While I appear to have no issues running the regression itself, the inputs of my data argument have no effect on predict(), and only the object itself affects the output. As a result, treatment = 0 and treatment = 1 in the data argument result in the same fitted values. In fact, I can plug in ANY value into the data argument and it makes do difference. So I suspect my failure to understand issue starts here.
Here is my code:
q6rega <- lm(pri2000v ~ treatment + I(log(pobtot1994)) + I(avgpoverty^2)
#interactions
+ treatment:avgpoverty + treatment:I(avgpoverty^2), data = pga)
## predicted PRI support under the Treatment condition
q6.yT.hat <- predict(q6rega,
data = data.frame(I(avgpoverty^2) = 9:25, treatment = 1))
## predicted PRI support rate under the Control condition
q6.yC.hat <- predict(q6rega,
data = data.frame(I(avgpoverty^2) = 9:25, treatment = 0))
q6.yC.hat == q6.yT.hat
TRUE[417]
dput(pga has been posted on my github, if needed
EDIT: There were a few things wrong with my code above, but not specifying pobtot1994 somehow resulted in R treating it as newdata being omitted. Since I'm fairly new to statistics, I confused fitted values with the prediction output that I was actually trying to achieve. I would have expected that an unexpected input is to produce an error instead.
I'm surprised you are able to run a prediction when it is lacking the required variable (pobtot1994) for your model in the new data frame for prediction.
Anyway, you would need to create a new data frame with the three variables in untransformed form used in the model. Since you are interested to compare the fitted values of avgpoverty 3 to 5 for treatment 1 and 0, you need to force the third variable pobtot1994 as a constant. I use the mean of pobtot9994 here for simplicity.
newdat <- expand.grid(avgpoverty=3:5, treatment=factor(c(0,1)), pobtot1994=mean(pga$pobtot1994))
avgpoverty treatment pobtot1994
1 3 0 2037.384
2 4 0 2037.384
3 5 0 2037.384
4 3 1 2037.384
5 4 1 2037.384
6 5 1 2037.384
The prediction will show you the different values for the two conditions.
newdat$fitted <- predict(q6rega, newdata=newdat)
avgpoverty treatment pobtot1994 fitted
1 3 0 2037.384 38.86817
2 4 0 2037.384 50.77476
3 5 0 2037.384 55.67832
4 3 1 2037.384 51.55077
5 4 1 2037.384 49.03148
6 5 1 2037.384 59.73910

Adjusted survival curve based on weigthed cox regression

I'm trying to make an adjusted survival curve based on a weighted cox regression performed on a case cohort data set in R, but unfortunately, I can't make it work. I was therefore hoping that some of you may be able to figure it out why it isn't working.
In order to illustrate the problem, I have used (and adjusted a bit) the example from the "Package 'survival'" document, which means im working with:
data("nwtco")
subcoh <- nwtco$in.subcohort
selccoh <- with(nwtco, rel==1|subcoh==1)
ccoh.data <- nwtco[selccoh,]
ccoh.data$subcohort <- subcoh[selccoh]
ccoh.data$age <- ccoh.data$age/12 # Age in years
fit.ccSP <- cch(Surv(edrel, rel) ~ stage + histol + age,
data =ccoh.data,subcoh = ~subcohort, id=~seqno, cohort.size=4028, method="LinYing")
The data set is looking like this:
seqno instit histol stage study rel edrel age in.subcohort subcohort
4 4 2 1 4 3 0 6200 2.333333 TRUE TRUE
7 7 1 1 4 3 1 324 3.750000 FALSE FALSE
11 11 1 2 2 3 0 5570 2.000000 TRUE TRUE
14 14 1 1 2 3 0 5942 1.583333 TRUE TRUE
17 17 1 1 2 3 1 960 7.166667 FALSE FALSE
22 22 1 1 2 3 1 93 2.666667 FALSE FALSE
Then, I'm trying to illustrate the effect of stage in an adjusted survival curve, using the ggadjustedcurves-function from the survminer package:
library(suvminer)
ggadjustedcurves(fit.ccSP, variable = ccoh.data$stage, data = ccoh.data)
#Error in survexp(as.formula(paste("~", variable)), data = ndata, ratetable = fit) :
# Invalid rate table
But unfortunately, this is not working. Can anyone figure out why? And can this somehow be fixed or done in another way?
Essentially, I'm looking for a way to graphically illustrate the effect of a continuous variable in a weighted cox regression performed on a case cohort data set, so I would, generally, also be interested in hearing if there are other alternatives than the adjusted survival curves?
Two reasons it is throwing errors.
The ggadjcurves function is not being given a coxph.object, which it's halp page indicated was the designed first object.
The specification of the variable argument is incorrect. The correct method of specifying a column is with a length-1 character vector that matches one of the names in the formula. You gave it a vector whose value was a vector of length 1154.
This code succeeds:
fit.ccSP <- coxph(Surv(edrel, rel) ~ stage + histol + age,
data =ccoh.data)
ggadjustedcurves(fit.ccSP, variable = 'stage', data = ccoh.data)
It might not answer your desires, but it does answer the "why-error" part of your question. You might want to review the methods used by Therneau, Cynthia S Crowson, and Elizabeth J Atkinson in their paper on adjusted curves:
https://cran.r-project.org/web/packages/survival/vignettes/adjcurve.pdf

Posterior_survfit() is not using the nd

I am having trouble generating posterior predictions using posterior_survfit(). I am trying to use a new data frame, but it is not using the new data frame and instead is using values from the dataset I used to fit the model. The fitted variables in the model are New.Treatment (6 treatments = categorical), Openness (a continuous light index min= 2.22, mean= 6.903221 and max=10.54), subplot_by_site(categorical-720 sites), New.Species.name(categorical- 165 species). My new data frame has 94 rows and the posterior_survfit() is giving me 3017800 rows. Help, please!
head(nd)
New.Treatment Openness
1 BE 5
2 BE 6
3 BE 7
4 BE 8
5 BE 9
6 BE 10
fit= stan_surv(formula = Surv(days, Status_surv) ~ New.Treatment*Openness + (1 |subplot_by_site)+(1|New.Species.name),
data = dataset,
basehaz = "weibull",
chains=4,
iter = 2000,
cores =4 )
Post=posterior_survfit(fit, type="surv",
newdata=nd5)
head(Post)
id cond_time time median ci_lb ci_ub
1 1 NA 62.0000 0.9626 0.9623 1.0000
2 1 NA 69.1313 0.9603 0.9600 0.9997
3 1 NA 76.2626 0.9581 0.9579 0.9696
4 1 NA 83.3939 0.9561 0.9557 0.9665
5 1 NA 90.5253 0.9541 0.9537 0.9545
6 1 NA 97.6566 0.9522 0.9517 0.9526
##Here some reproducible code to explain my problem:
library(rstanarm)
data_NHN<- expand.grid(New.Treatment = c("A","B","C"), Openness = c(seq(2, 11, by=0.15)))
data_NHN$subplot_by_site=c(rep("P1",63),rep("P2",60),rep("P3",60))
data_NHN$Status_surv=sample(0:1,183, replace=TRUE)
data_NHN$New.Species.name=c(rep("sp1",10),rep("sp2",40),rep("sp1",80),rep("sp2",20),rep("sp1",33))
data_NHN$days=sample(10, size = nrow(data_NHN), replace = TRUE)
nd_t<- expand.grid(New.Treatment = c("A","B","C"), Openness = c(seq(2, 11, by=1)))
mod= stan_surv(formula = Surv(days, Status_surv) ~ New.Treatment+Openness + (1 |subplot_by_site)+(1|New.Species.name),
data =data_NHN,
basehaz = "weibull",
chains=4,
iter = 30,
cores =4)
summary(mod)
pos=posterior_survfit(mod, type="surv",
newdataEvent=nd_t,
times = 0)
head(pos)
#I am interested in predicting values for specific Openess values
#(nd_t=20 rows)but I am getting instead values for each point in time
#(pos=18300rows)
Operating System: Mac OS Catalina 10.15.6
R version: 4.0
rstan version: 2.21.2
rstanarm Version: rstanarm_2.21.2
Any suggestions on why is it not working. it’s not clear how to give some sort of plot of the effects of one variable in the interaction as the other changes and the associated uncertainty (i.e. a marginal effects plot). In my example, I am interested in getting the values at specific "Openness" values and not at each specific time as appears in the posterior results. TIA.
I can only give you a partial answer. You should be aware that this is pretty bleeding-edge; although the ArXiv paper (dated Feb 2020) says
Hopefully by the time you are reading
this, the functionality will be available in the stable release on the Comprehensive R Archive Network (CRAN)
but so far that's not even true; it's not even in the master GitHub branch, so I used remotes::install_github("stan-dev/rstanarm#feature/survival") to install it from source.
The proximal problem is that you should specify the new data frame as newdata, not newdataEvent. There seem to be a lot of mismatches between the master and this branch, and between the docs and the code ... newdataEvent is used in the older method for stanjm models, but not for stansurv models. You can look at the code here, or use formals(rstanarm:::posterior_survfit.stansurv). Unfortunately, because this method has an (unused, unchecked) ... argument, that means that any misnamed arguments will be silently ignored.
The next problem is that if you specify the new data in your example as newdata you'll get
Error: The following variables are missing from the data: subplot_by_site, New.Species.name
That is, there doesn't seem to be an obvious way to generate a population-level posterior prediction. (Setting the random effects grouping variables to NA isn't allowed.) If you want to do this, you could either:
expand your newdata to include all combinations of the grouping variables in your data set, and average the results across levels yourself;
post an issue on GitHub or contact the maintainers ...

GAM model error

My data frame looks like:
head(bush_status)
distance status count
0 endemic 844
1 exotic 8
5 native 3
10 endemic 5
15 endemic 4
20 endemic 3
The count data is non-normally distributed. I'm trying to fit a generalized additive model to my data in two ways so i can use anova to see if the p-value supports m2.
m1 <- gam(count ~ s(distance) + status, data=bush_status, family="nb")
m2 <- gam(count ~ s(distance, by=status) + status, data=bush_status, family="nb")
m1 works fine, but m2 sends the error message:
"Error in smoothCon(split$smooth.spec[[i]], data, knots, absorb.cons,
scale.penalty = scale.penalty, :
Can't find by variable"
This is pretty beyond me so if anyone could offer any advice that would be much appreciated!
From your comments it became clear that you passed a character variable to by in the smoother. You must pass a factor variable there. This has been a frequent gotcha for me too and I consider it a design flaw (because base R regression functions deal with character variables just fine).

R decision tree using all the variables

I would like to perform a decision tree analysis. I want that the decision tree uses all the variables in the model.
I also need to plot the decision tree. How can I do that in R?
This is a sample of my dataset
> head(d)
TargetGroup2000 TargetGroup2012 SmokingGroup_Kai PA_Score wheeze3 asthma3 tres3
1 2 2 4 2 0 0 0
2 2 2 4 3 1 0 0
3 2 2 5 1 0 0 0
4 2 2 4 2 1 0 0
5 2 3 3 1 0 0 0
6 2 3 3 2 0 0 0
>
I would like to use the formula
myFormula <- wheeze3 ~ TargetGroup2000 + TargetGroup2012 + SmokingGroup_Kai + PA_Score
Note that all the variables are categorical.
EDIT:
My problem is that some variables do not appear in the final decision tree.
The deap of the tree should be defined by a penalty parameter alpha. I do not know how to set this penalty in order that all the variables appear in my model.
In other words I would like a model that minimize the training error.
As mentioned above, if you want to run the tree on all the variables you should write it as
ctree(wheeze3 ~ ., d)
The penalty you mentioned is located at the ctree_control(). You can set the P-value there and the minimum split and bucket size. So in order to maximize the chance that all the variables will be included you should do something like that:
ctree(wheeze3 ~ ., d, controls = ctree_control(mincriterion = 0.85, minsplit = 0, minbucket = 0))
The problem is that you'll get into risk of overfitting.
The last thing you need to understand is, that the reason that you may not see all the variables in the output of the tree is because they don't have a significant influence on the dependend variable. Unlike linear or logistic regression, that will show all the variables and give you the P-value in order to determine if they are significant or not, the decision tree does not return the unsiginifcant variables, i.e, it doesn't split by them.
For better understanding of how ctree works, please take a look here: https://stats.stackexchange.com/questions/12140/conditional-inference-trees-vs-traditional-decision-trees
The easiest way is to use the rpart package that is part of the core R.
library(rpart)
model <- rpart( wheeze3 ~ ., data=d )
summary(model)
plot(model)
text(model)
The . in the formula argument means use all the other variables as independent variables.
plot(ctree(myFormula~., data=sta))

Resources