Getting percentile values from gamlss centile curves - r

This question is related to: Selecting Percentile curves using gamlss::lms in R
I can get centile curve from following data and code:
age = sample(5:15, 500, replace=T)
yvar = rnorm(500, age, 20)
mydata = data.frame(age, yvar)
head(mydata)
age yvar
1 12 13.12974
2 14 -18.97290
3 10 42.11045
4 12 27.89088
5 11 48.03861
6 5 24.68591
h = lms(yvar, age , data=mydata, n.cyc=30)
centiles(h,xvar=mydata$age, cent=c(90), points=FALSE)
How can I now get yvar on the curve for each of x value (5:15) which would represent 90th percentiles for data after smoothening?
I tried to read help pages and found fitted(h) and fv(h) to get fitted values for entire data. But how to get values for each age level at 90th centile curve level? Thanks for your help.
Edit: Following figure show what I need:
I tried following but it is correct since value are incorrect:
mydata$fitted = fitted(h)
aggregate(fitted~age, mydata, function(x) quantile(x,.9))
age fitted
1 5 6.459680
2 6 6.280579
3 7 6.290599
4 8 6.556999
5 9 7.048602
6 10 7.817276
7 11 8.931219
8 12 10.388048
9 13 12.138104
10 14 14.106250
11 15 16.125688
The values are very different from 90th quantile directly from data:
> aggregate(yvar~age, mydata, function(x) quantile(x,.9))
age yvar
1 5 39.22938
2 6 35.69294
3 7 25.40390
4 8 26.20388
5 9 29.07670
6 10 32.43151
7 11 24.96861
8 12 37.98292
9 13 28.28686
10 14 43.33678
11 15 44.46269

See if this makes sense. The 90th percentile of a normal distribution with mean and sd of 'smn' and 'ssd' is qnorm(.9, smn, ssd): So this seems to deliver (somewhat) sensible results, albeit not the full hack of centiles that I suggested:
plot(h$xvar, qnorm(.9, fitted(h), h$sigma.fv))
(Note the massive overplotting from only a few distinct xvars but 500 points. Ande you may want to set the ylim so that the full range can be appreciated.)
The caveat here is that you need to check the other parts of the model to see if it is really just an ordinary Normal model. In this case it seems to be:
> h$mu.formula
y ~ pb(x)
<environment: 0x10275cfb8>
> h$sigma.formula
~1
<environment: 0x10275cfb8>
> h$nu.formula
NULL
> h$tau.formula
NULL
So the model is just mean-estimate with a fixed-variance (the ~1) across the range of the xvar, and there are no complications from higher order parameters like a Box-Cox model. (And I'm unable to explain why this is not the same as the plotted centiles. For that you probably need to correspond with the package authors.)

Related

R - polynomial regression issue - model limited to finite number of output values

I'm trying to calculate point slopes from a series of x,y data. Because some of the x data repeats (...8, 12, 12, 16...) there will be a division by zero issue when using slope = (y2-y1/x2-x1).
My solution is to create a polynomial regression equation of the data, then plug a new set of x values (xx) into the equation that monotonically increase between the limits of x. This eliminates the problem of equal x data points. As a result, (x) and (xx) have the same limits, but (xx) is always longer in length.
The problem I am having is that the fitted values for xx are limited to the length of x. When I try to use the polynomial equation with (xx) that is 20 in length, the fitted yy results provide data for the first 10 points then gives NA for the next 10 points. What is wrong here?
x <- c(1,2,2,5,8,12,12,16,17,20)
y <- c(2,4,5,6,8,11,12,15,16,20)
df <- data.frame(x,y)
my_mod <- lm(y ~ poly(x,2,raw=T), data=df) # This creates the polynomial equation
xx <- x[1]:x[length(x)] # Creates montonically increasing x using boundaries of original x
yy <- fitted(my_mod)[order(xx)]
plot(x,y)
lines(xx,yy)
tag-name
If you look at
fitted(my_mod)
It outputs:
# 1 2 3 4 5 6 7 8 9 10
#3.241032 3.846112 3.846112 5.831986 8.073808 11.461047 11.461047 15.303305 16.334967 19.600584
Meaning the name of the output matches the position of x, not the value of x, so fitted(my_mod)[order(xx)] doesn't quite make sense.
You want to use predict here:
yy <- predict(my_mod, newdata = data.frame(x = xx))
plot(xx, yy)
# 1 2 3 4 5 6 7 8 9 10
# 3.241032 3.846112 4.479631 5.141589 5.831986 6.550821 7.298095 8.073808 8.877959 9.710550
# 11 12 13 14 15 16 17 18 19 20
# 10.571579 11.461047 12.378953 13.325299 14.300083 15.303305 16.334967 17.395067 18.483606 19.600584

Adjusted survival curve based on weigthed cox regression

I'm trying to make an adjusted survival curve based on a weighted cox regression performed on a case cohort data set in R, but unfortunately, I can't make it work. I was therefore hoping that some of you may be able to figure it out why it isn't working.
In order to illustrate the problem, I have used (and adjusted a bit) the example from the "Package 'survival'" document, which means im working with:
data("nwtco")
subcoh <- nwtco$in.subcohort
selccoh <- with(nwtco, rel==1|subcoh==1)
ccoh.data <- nwtco[selccoh,]
ccoh.data$subcohort <- subcoh[selccoh]
ccoh.data$age <- ccoh.data$age/12 # Age in years
fit.ccSP <- cch(Surv(edrel, rel) ~ stage + histol + age,
data =ccoh.data,subcoh = ~subcohort, id=~seqno, cohort.size=4028, method="LinYing")
The data set is looking like this:
seqno instit histol stage study rel edrel age in.subcohort subcohort
4 4 2 1 4 3 0 6200 2.333333 TRUE TRUE
7 7 1 1 4 3 1 324 3.750000 FALSE FALSE
11 11 1 2 2 3 0 5570 2.000000 TRUE TRUE
14 14 1 1 2 3 0 5942 1.583333 TRUE TRUE
17 17 1 1 2 3 1 960 7.166667 FALSE FALSE
22 22 1 1 2 3 1 93 2.666667 FALSE FALSE
Then, I'm trying to illustrate the effect of stage in an adjusted survival curve, using the ggadjustedcurves-function from the survminer package:
library(suvminer)
ggadjustedcurves(fit.ccSP, variable = ccoh.data$stage, data = ccoh.data)
#Error in survexp(as.formula(paste("~", variable)), data = ndata, ratetable = fit) :
# Invalid rate table
But unfortunately, this is not working. Can anyone figure out why? And can this somehow be fixed or done in another way?
Essentially, I'm looking for a way to graphically illustrate the effect of a continuous variable in a weighted cox regression performed on a case cohort data set, so I would, generally, also be interested in hearing if there are other alternatives than the adjusted survival curves?
Two reasons it is throwing errors.
The ggadjcurves function is not being given a coxph.object, which it's halp page indicated was the designed first object.
The specification of the variable argument is incorrect. The correct method of specifying a column is with a length-1 character vector that matches one of the names in the formula. You gave it a vector whose value was a vector of length 1154.
This code succeeds:
fit.ccSP <- coxph(Surv(edrel, rel) ~ stage + histol + age,
data =ccoh.data)
ggadjustedcurves(fit.ccSP, variable = 'stage', data = ccoh.data)
It might not answer your desires, but it does answer the "why-error" part of your question. You might want to review the methods used by Therneau, Cynthia S Crowson, and Elizabeth J Atkinson in their paper on adjusted curves:
https://cran.r-project.org/web/packages/survival/vignettes/adjcurve.pdf

Fixing an error in R- "Incorrect number of dimensions" in the Dunn Test

I am trying to use the Dunn test for a comparison but I am getting an error: "Error in Psort[1, i] : incorrect number of dimensions"
the data I am trying to use is this sort of idea (but sample size is bigger):
Frequency Height
1 10
2 11
1 9
1 8
2 15
1 9
2 11
2 13
the code I used was
dunnTest(Height ~ Frequency,
data=Data,
method="bh")
is my problem that my frequency is only split into two groups? cause for another factor my frequency had three groups and it worked fine. If this is the problem, is there another test I can do that will perform a similar/the same function?
Thanks!
The Dunn test is equivalent to the Wilcox test (wilcox.test) if you adjust values of input parameters (disable the exact calculation of p value, disable the continuity correction, more here). For your data, one obtains:
> wilcox.test(df$Frequency, df$Height, correct = FALSE, exact = FALSE)
Wilcoxon rank sum test
data: df$Frequency and df$Height
W = 0, p-value = 0.0006346
alternative hypothesis: true location shift is not equal to 0
I think you are using the dunnTest function from the FSA package. This function fails for two groups.
Data
df <- read.table(text="Frequency Height
1 10
2 11
1 9
1 8
2 15
1 9
2 11
2 13", header=TRUE)

Machine learning in R is slow with decisiontree

I'm trying to predict the type of a vehicle (model) based on the vehicle identification number (VIN). The first 10 positions of the VIN says something about the type, so I use them as variables. See an example of the data below:
positie_1_tm_3 positie_4 positie_5 positie_6 positie_7 positie_8 positie_9 positie_10 MODEL
MBL B 7 5 L 7 A 6 SKODA YETI
JNF A A E 1 1 U 2 NISSAN NOTE
VWZ Z Z 5 Z Z 9 4 VOLKSWAGEN FOX
F1D Z 0 V 0 6 4 2 RENAULT MEGANE
NAK U 8 1 1 C A 5 KIA SORENTO
F1B R 1 J 0 H 4 1 RENAULT CLIO
I used this R code for it:
#make stratisfied train and test set:
library(caret)
train.index <- createDataPartition(VIN1$MODEL, p = .6, list = FALSE)
train <- VIN1[ train.index,]
overige_data <- VIN1[-train.index,]
test.index<-createDataPartition(overige_data$MODEL, p = .5, list = FALSE)
test<-overige_data[test.index,]
testset2<-overige_data[-test.index,]
#make decision three :
library(rpart)
library(rpart.plot)
library(rattle)
library(RColorBrewer)
tree<- rpart(MODEL ~., train, method="class")
But the last one, making the tree, is running for more than 2 weeks already.
The dataset is around 3 million rows, so the trainingset is around 1,8 million rows. Is it running so long because it’s too much data for rpart or is there another problem?
No, something is obviously wrong. It may take long, but not 2 weeks.
The question - how many labels (classes there are)? Decision trees tend to be slow when the number of classes is large (by large I mean more than 50).

How to organize data for a multivariate probit model?

I've conducted a psychometric test on some subjects, and I'm trying to create a multivariate probit model.
The test was conducted as follows:
To subject 1 was given a certain stimulous under 11 different conditions, 10 times for each condition. Answers (correct=1, uncorrect=0) were registered.
So for subject 1, I have the following results' table:
# Subj 1
correct
cnt 1 0
1 0 10
2 0 10
3 1 9
4 5 5
5 7 3
6 10 0
7 10 0
8 10 0
9 9 1
10 10 0
11 10 0
This means that Subj1 answered uncorrectly 10 times under condition 1 and 2, and answered 10 times correctly under condition 10 and 11. For the other conditions, the response was increasing from condition 3 to condition 9.
I hope I was clear.
I usually analyze the data using the following code:
prob.glm <- glm(resp.mat1 ~ cnt, family = binomial(link = "probit"))
Here resp.mat1 is the responses' table, while cnt is the contrast c(1,11). So I'm able to draw the sigmoid curve using the predict() function. The graph, for the subject-1 is the following.
Now suppose I've conducted the same test on 20 subjects. I have now 20 tables, organized like the first one.
What I want to do is to compare subgroups, for example: male vs. female; young vs. older and so on. But I want to keep the inter-individual variability, so simply "adding" the 20 tables will be wrong.
How can I organize the data in order to use the glm() function?
I want to be able to write a command like:
prob.glm <- glm(resp.matTOT ~ cnt + sex, family = binomial(link = "probit"))
And then graphing the curve for sex=M, and sex=F.
I tried using the rbind() function, to create a unique table, then adding columns for Subj (1 to 20), Sex, Age. But it looks me a bad solution, so any alternative solutions will be really appreciated.
Looks like you are using the wrong function for the job. Check the first example of glmer in package lme4; it comes quite close to what you want. herd should be replaced by the subject number, but make sure that you do something like
mydata$subject = as.factor(mydata$subject)
when you have numerical subject numbers.
# Stolen from lme4
library(lattice)
library(
xyplot(incidence/size ~ period|herd, cbpp, type=c('g','p','l'),
layout=c(3,5), index.cond = function(x,y)max(y))
(gm1 <- glmer(cbind(incidence, size - incidence) ~ period + (1 | herd),
data = cbpp, family = binomial))
There's a multivariate probit command in the mlogit library of all things. You can see an example of the data structure required here:
https://stats.stackexchange.com/questions/28776/multinomial-probit-for-varying-choice-set

Resources