Stata twoway graph of means with confidence intervals - graph

Using
clear
score group test
2 0 A
3 0 B
6 0 B
8 0 A
2 0 A
2 0 A
10 1 B
7 1 B
8 1 A
5 1 A
10 1 A
11 1 B
end
I want to scatter plot mean score by group for each test (same graph) with confidence intervals (the real data has thousands of observations). The resulting graph would have two sets of two dots. One set of dots for test==a (group==0 vs group==1) and one set of dots for test==b (group==0 vs group==1).
My current approach works but it is laborious. I compute all of the needed statistics using egen: the mean, number of observations, standard deviations...for each group by test. I then collapse the data and plot.
There has to be another way, no?
I assumed that Stata would be able to take as its input the score group and test variables and then compute and present this pretty standard graph.
After spending a lot of time on Google, I had to ask.

Although there are user-written programs, I lean towards statsby as a basic approach here. Discussion is accessible in this paper.
This example takes your data example (almost executable code). Some choices depend on the large confidence intervals implied. Note that if your version of Stata is not up-to-date, the syntax of ci will be different. (Just omit means.)
clear
input score group str1 test
2 0 A
3 0 B
6 0 B
8 0 A
2 0 A
2 0 A
10 1 B
7 1 B
8 1 A
5 1 A
10 1 A
11 1 B
end
save cj12 , replace
* test A
statsby mean=r(mean) ub=r(ub) lb=r(lb) N=r(N), by(group) clear : ///
ci means score if test == "A"
gen test = "A"
save cj12results, replace
* test B
use cj12
statsby mean=r(mean) ub=r(ub) lb=r(lb) N=r(N), by(group) clear : ///
ci means score if test == "B"
gen test = "B"
append using cj12results
* graph; show sample sizes too, but where to show them is empirical
set scheme s1color
gen where = -20
scatter mean group, ms(O) mcolor(blue) || ///
rcap ub lb group, lcolor(blue) ///
by(test, note("95% confidence intervals") legend(off)) ///
subtitle(, fcolor(ltblue*0.2)) ///
ytitle(score) xla(0 1) xsc(r(-0.25 1.25)) yla(-10(10)10, ang(h)) || ///
scatter where group, ms(none) mla(N) mlabpos(12) mlabsize(*1.5)
We can't compare your complete code or your graph, because you show neither.

Related

inputs of 2 separate predict() return the same set of fitted values

Confession: I attempted to ask this question yesterday, but used a sample, congruent dataset which resembles the my "real" data in hopes this would be more convenient for readers here. One issue was resolved, but another remains that appears immutable.
My objective is creating a linear model of two predicted vectors: "yC.hat", and "yT.hat" which are meant to project average effects for unique observed values of pri2000v as a function of the average poverty level "I(avgpoverty^2) under control (treatment = 0) and treatment (treatment = 1) conditions.
While I appear to have no issues running the regression itself, the inputs of my data argument have no effect on predict(), and only the object itself affects the output. As a result, treatment = 0 and treatment = 1 in the data argument result in the same fitted values. In fact, I can plug in ANY value into the data argument and it makes do difference. So I suspect my failure to understand issue starts here.
Here is my code:
q6rega <- lm(pri2000v ~ treatment + I(log(pobtot1994)) + I(avgpoverty^2)
#interactions
+ treatment:avgpoverty + treatment:I(avgpoverty^2), data = pga)
## predicted PRI support under the Treatment condition
q6.yT.hat <- predict(q6rega,
data = data.frame(I(avgpoverty^2) = 9:25, treatment = 1))
## predicted PRI support rate under the Control condition
q6.yC.hat <- predict(q6rega,
data = data.frame(I(avgpoverty^2) = 9:25, treatment = 0))
q6.yC.hat == q6.yT.hat
TRUE[417]
dput(pga has been posted on my github, if needed
EDIT: There were a few things wrong with my code above, but not specifying pobtot1994 somehow resulted in R treating it as newdata being omitted. Since I'm fairly new to statistics, I confused fitted values with the prediction output that I was actually trying to achieve. I would have expected that an unexpected input is to produce an error instead.
I'm surprised you are able to run a prediction when it is lacking the required variable (pobtot1994) for your model in the new data frame for prediction.
Anyway, you would need to create a new data frame with the three variables in untransformed form used in the model. Since you are interested to compare the fitted values of avgpoverty 3 to 5 for treatment 1 and 0, you need to force the third variable pobtot1994 as a constant. I use the mean of pobtot9994 here for simplicity.
newdat <- expand.grid(avgpoverty=3:5, treatment=factor(c(0,1)), pobtot1994=mean(pga$pobtot1994))
avgpoverty treatment pobtot1994
1 3 0 2037.384
2 4 0 2037.384
3 5 0 2037.384
4 3 1 2037.384
5 4 1 2037.384
6 5 1 2037.384
The prediction will show you the different values for the two conditions.
newdat$fitted <- predict(q6rega, newdata=newdat)
avgpoverty treatment pobtot1994 fitted
1 3 0 2037.384 38.86817
2 4 0 2037.384 50.77476
3 5 0 2037.384 55.67832
4 3 1 2037.384 51.55077
5 4 1 2037.384 49.03148
6 5 1 2037.384 59.73910

Errors using powerSim and powerCurve for a clmm in R

I'm new to clmm and run into the following problem:
I want to obtain the optimal sample size for my study with R using powerSim and powerCurve. Because my data is ordinal, I'm using a clmm. Study participants (VPN) should evaluate three sentence types (SH1,SM1, and SP1) on a 5 point likert scale (evaluation.likert). I need to account for my participants as a random factor while the sentence types and the evaluation are my fixed factors.
Here's a glimpse of my data (count of VPN goes up to 40 for each of the parameters, I just shortened it here):
VPN parameter evaluation.likert
1 1 SH1 2
2 2 SH1 4
3 3 SH1 5
4 4 SH1 3
...
5 1 SM1 4
6 2 SM1 2
7 3 SM1 2
8 4 SM1 5
...
9 1 SP1 1
10 2 SP1 1
11 3 SP1 3
12 4 SP1 5
...
Now, with some help I created the following model:
clmm(likert~parameter+(1|VPN), data=dfdata)
With this model, I'm doing the simulation:
ps1 <- powerSim(power, test=fixed("likert:parameter", "anova"), nsim=40)
Warning:
In observedPowerWarning(sim) :
This appears to be an "observed power" calculation
print(ps1)
Power for predictor 'likert:parameter', (95% confidence interval):
0.00% ( 0.00, 8.81)
Test: Type-I F-test
Based on 40 simulations, (0 warnings, 40 errors)
alpha = 0.05, nrow = NA
Time elapsed: 0 h 0 m 0 s
nb: result might be an observed power calculation
In the above example, I tried it with 40 participants but I already also ran a simulation with 2000000 participants to check if I just need a huge amount of people. The results were the same though: 0.0%.
lastResult()$errors tells me that I'm using a method which is not applicable for clmm:
not applicable method for'simulate' on object of class "clmm"
But besides the anova I'm doing here, I've also already tried z, t, f, chisq, lr, sa, kr, pb. (And instead of test=fixed, I've also already tried test=compare, test=fcompare, test=rcompare, and even test=random())
So I guess there must be something wrong with my model? Or are really none of these methods applicaple for clmms?
Many thanks in advance, your help is already very much appreciated!

Adjusted survival curve based on weigthed cox regression

I'm trying to make an adjusted survival curve based on a weighted cox regression performed on a case cohort data set in R, but unfortunately, I can't make it work. I was therefore hoping that some of you may be able to figure it out why it isn't working.
In order to illustrate the problem, I have used (and adjusted a bit) the example from the "Package 'survival'" document, which means im working with:
data("nwtco")
subcoh <- nwtco$in.subcohort
selccoh <- with(nwtco, rel==1|subcoh==1)
ccoh.data <- nwtco[selccoh,]
ccoh.data$subcohort <- subcoh[selccoh]
ccoh.data$age <- ccoh.data$age/12 # Age in years
fit.ccSP <- cch(Surv(edrel, rel) ~ stage + histol + age,
data =ccoh.data,subcoh = ~subcohort, id=~seqno, cohort.size=4028, method="LinYing")
The data set is looking like this:
seqno instit histol stage study rel edrel age in.subcohort subcohort
4 4 2 1 4 3 0 6200 2.333333 TRUE TRUE
7 7 1 1 4 3 1 324 3.750000 FALSE FALSE
11 11 1 2 2 3 0 5570 2.000000 TRUE TRUE
14 14 1 1 2 3 0 5942 1.583333 TRUE TRUE
17 17 1 1 2 3 1 960 7.166667 FALSE FALSE
22 22 1 1 2 3 1 93 2.666667 FALSE FALSE
Then, I'm trying to illustrate the effect of stage in an adjusted survival curve, using the ggadjustedcurves-function from the survminer package:
library(suvminer)
ggadjustedcurves(fit.ccSP, variable = ccoh.data$stage, data = ccoh.data)
#Error in survexp(as.formula(paste("~", variable)), data = ndata, ratetable = fit) :
# Invalid rate table
But unfortunately, this is not working. Can anyone figure out why? And can this somehow be fixed or done in another way?
Essentially, I'm looking for a way to graphically illustrate the effect of a continuous variable in a weighted cox regression performed on a case cohort data set, so I would, generally, also be interested in hearing if there are other alternatives than the adjusted survival curves?
Two reasons it is throwing errors.
The ggadjcurves function is not being given a coxph.object, which it's halp page indicated was the designed first object.
The specification of the variable argument is incorrect. The correct method of specifying a column is with a length-1 character vector that matches one of the names in the formula. You gave it a vector whose value was a vector of length 1154.
This code succeeds:
fit.ccSP <- coxph(Surv(edrel, rel) ~ stage + histol + age,
data =ccoh.data)
ggadjustedcurves(fit.ccSP, variable = 'stage', data = ccoh.data)
It might not answer your desires, but it does answer the "why-error" part of your question. You might want to review the methods used by Therneau, Cynthia S Crowson, and Elizabeth J Atkinson in their paper on adjusted curves:
https://cran.r-project.org/web/packages/survival/vignettes/adjcurve.pdf

R decision tree using all the variables

I would like to perform a decision tree analysis. I want that the decision tree uses all the variables in the model.
I also need to plot the decision tree. How can I do that in R?
This is a sample of my dataset
> head(d)
TargetGroup2000 TargetGroup2012 SmokingGroup_Kai PA_Score wheeze3 asthma3 tres3
1 2 2 4 2 0 0 0
2 2 2 4 3 1 0 0
3 2 2 5 1 0 0 0
4 2 2 4 2 1 0 0
5 2 3 3 1 0 0 0
6 2 3 3 2 0 0 0
>
I would like to use the formula
myFormula <- wheeze3 ~ TargetGroup2000 + TargetGroup2012 + SmokingGroup_Kai + PA_Score
Note that all the variables are categorical.
EDIT:
My problem is that some variables do not appear in the final decision tree.
The deap of the tree should be defined by a penalty parameter alpha. I do not know how to set this penalty in order that all the variables appear in my model.
In other words I would like a model that minimize the training error.
As mentioned above, if you want to run the tree on all the variables you should write it as
ctree(wheeze3 ~ ., d)
The penalty you mentioned is located at the ctree_control(). You can set the P-value there and the minimum split and bucket size. So in order to maximize the chance that all the variables will be included you should do something like that:
ctree(wheeze3 ~ ., d, controls = ctree_control(mincriterion = 0.85, minsplit = 0, minbucket = 0))
The problem is that you'll get into risk of overfitting.
The last thing you need to understand is, that the reason that you may not see all the variables in the output of the tree is because they don't have a significant influence on the dependend variable. Unlike linear or logistic regression, that will show all the variables and give you the P-value in order to determine if they are significant or not, the decision tree does not return the unsiginifcant variables, i.e, it doesn't split by them.
For better understanding of how ctree works, please take a look here: https://stats.stackexchange.com/questions/12140/conditional-inference-trees-vs-traditional-decision-trees
The easiest way is to use the rpart package that is part of the core R.
library(rpart)
model <- rpart( wheeze3 ~ ., data=d )
summary(model)
plot(model)
text(model)
The . in the formula argument means use all the other variables as independent variables.
plot(ctree(myFormula~., data=sta))

Cutting dendrogram into n trees with minimum cluster size in R

I'm trying to use hirearchical clustering (specifically hclust) to cluster a data set into 10 groups with sizes of 100 members or fewer, and with no group having more than 40% of the total population. The only method I currently know is to repeatedly use cut() and select continually lower levels of h until I'm happy with the dispersion of the cuts. However, this forces me to then go back and re-cluster the groups I pruned to aggregate them into 100 member groups, which can be very time consuming.
I've experimented with the dynamicTreeCut package, but can't figure out how to enter these (relatively simple) limitations. I'm using deepSplit as the way to designate the number of groupings, but following the documentation, this limits the maximum number to 4. For the exercise below, all I'm looking to do is to get the clusters into 5 groups of 3 or more individuals (I can deal with the maximum size limitation on my own, but if you want to try to tackle this too, it would be helpful!).
Here's my example, using the Orange dataset.
library(dynamicTreeCut)
library(reshape2)
##creating 14 individuals from Orange's original 5
Orange1<-Orange
Orange1$Tree<-as.numeric(as.character(Orange1$Tree))
Orange2<-Orange1
Orange3<-Orange1
Orange2$Tree=Orange2$Tree+6
Orange3$Tree=Orange3$Tree+11
combOr<-rbind(Orange1, Orange2[1:28,], Orange3)
####casting the data to make a correlation matrix, and then running
#### a hierarchical cluster
castOrange<-dcast(combOr, age~Tree, mean, fill=0)
castOrange[,16]<-c(1,34,5,35,34,35,21)
castOrange[,17]<-c(1,34,5,35,34,35,21)
orangeCorr<-cor(castOrange[, -1])
orangeClust<-hclust(dist(orangeCorr))
###running the dynamic tree cut
dynamicCut<-cutreeDynamic(orangeClust, minClusterSize=3, method="tree", deepSplit=4)
dynamicCut
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0
As you can see, it only designates two clusters. For my exercise, I want to shy away from using an explicit height term to cut the trees, as I want a k number of trees instead.
1- Figure out the most appropriate dissimilarity measure (e.g., "euclidean", "maximum", "manhattan", "canberra", "binary", or "minkowski") and linkage method (e.g., "ward", "single", "complete", "average", "mcquitty", "median", or "centroid") based on the nature of your data and the objective(s) of clustering. See ?dist and ?hclust for more details.
2- Plot the dendogram tree before starting the cutting step. See ?hclust for more details.
3- Use the hybrid adaptive tree cut method in dynamicTreeCut package, and tune the shape parameters (maxCoreScatter and minGap / maxAbsCoreScatter and minAbsGap). See Langfelder et al. 2009 (http://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/BranchCutting/Supplement.pdf).
For your example,
1- Change "euclidean" and/or "complete" methods as appropriate,
orangeClust <- hclust(dist(orangeCorr, method="euclidean"), method="complete")
2- Plot dendogram,
plot(orangeClust)
3- Use the hybrid tree cut method and tune shape parameters,
dynamicCut <- cutreeDynamic(orangeClust, minClusterSize=3, method="hybrid", distM=as.matrix(dist(orangeCorr, method="euclidean")), deepSplit=4, maxCoreScatter=NULL, minGap=NULL, maxAbsCoreScatter=NULL, minAbsGap=NULL)
dynamicCut
..cutHeight not given, setting it to 1.8 ===> 99% of the (truncated) height range in dendro.
..done.
2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0
As a guide for tuning the shape parameters, the default values are
deepSplit=0: maxCoreScatter = 0.64 & minGap = (1 - maxCoreScatter) * 3/4
deepSplit=1: maxCoreScatter = 0.73 & minGap = (1 - maxCoreScatter) * 3/4
deepSplit=2: maxCoreScatter = 0.82 & minGap = (1 - maxCoreScatter) * 3/4
deepSplit=3: maxCoreScatter = 0.91 & minGap = (1 - maxCoreScatter) * 3/4
deepSplit=4: maxCoreScatter = 0.95 & minGap = (1 - maxCoreScatter) * 3/4
As you can see, both maxCoreScatter and minGap should be between 0 and 1, and increasing maxCoreScatter (decreasing minGap) increases the number of clusters (with smaller sizes). The meaning of these parameters is described in Langfelder et al. 2009.
For example, to get more smaller clusters
maxCoreScatter <- 0.99
minGap <- (1 - maxCoreScatter) * 3/4
dynamicCut <- cutreeDynamic(orangeClust, minClusterSize=3, method="hybrid", distM=as.matrix(dist(orangeCorr, method="euclidean")), deepSplit=4, maxCoreScatter=maxCoreScatter, minGap=minGap, maxAbsCoreScatter=NULL, minAbsGap=NULL)
dynamicCut
..cutHeight not given, setting it to 1.8 ===> 99% of the (truncated) height range in dendro.
..done.
2 3 2 2 2 3 3 2 2 3 3 2 2 2 1 2 1 1 1 2 2 1 1 2 2 1 1 1 0 0
Finally, your clustering constraints (size, height, number, ... etc) should be reasonable and interpretable, and the generated clusters should agree with the data. This guides you to the important step of clustering validation and interpretation.
Good Luck!

Resources