comp() returns ranks instead of p-values - r

I am given example that comp() should be returning p-vals but it ends up returning ranks so let me ask:
Why is comp() function from survmisc package returning ranks instead of p-values?
Is there a way to change it?
test_drug <- survfit(Surv(N_Days,Cens) ~ Drug, data = df)
comp(ten(test_drug), p=c(0,1,1,0.5,0.5),q=c(1,0,1,0.5,2))
output:
Q Var Z pNorm
1 3.3457e+00 2.7643e+01 0.63634 4
n 3.2000e+02 1.0304e+06 0.31524 10
sqrtN 3.4634e+01 4.8218e+03 0.49877 9
S1 2.1524e+00 1.6867e+01 0.52410 7
S2 2.1294e+00 1.6650e+01 0.52185 8
FH_p=0_q=1 1.1647e+00 2.2356e+00 0.77898 3
FH_p=1_q=0 2.1809e+00 1.7056e+01 0.52809 6
FH_p=1_q=1 8.4412e-01 7.9005e-01 0.94968 1
FH_p=0.5_q=0.5 1.6895e+00 4.1759e+00 0.82678 2
FH_p=0.5_q=2 2.7491e-01 2.2027e-01 0.58575 5
maxAbsZ Var Q pSupBr
1 5.8550e+00 2.7643e+01 1.11361 5
n 9.7000e+02 1.0304e+06 0.95556 6
sqrtN 6.3636e+01 4.8218e+03 0.91643 7
S1 3.5891e+00 1.6867e+01 0.87391 9
S2 3.5737e+00 1.6650e+01 0.87581 8
FH_p=0_q=1 2.2539e+00 2.2356e+00 1.50743 2
FH_p=1_q=0 3.6025e+00 1.7056e+01 0.87230 10
FH_p=1_q=1 1.4726e+00 7.9005e-01 1.65678 1
FH_p=0.5_q=0.5 2.9457e+00 4.1759e+00 1.44148 3
FH_p=0.5_q=2 6.3430e-01 2.2027e-01 1.35150 4

So according to the topic here:
https://github.com/dardisco/survMisc/issues/21
And information that I got from the profesor lecturer who solved the problem earlier.
This is issue with R version and update is required to the fuction itself by authors or contributors.
This can be solves using attr() func with 'tft' parameter standing for test for trend. Code example here:
test_bilirubin <- survfit(Surv(N_Days,Cens) ~ Bilirubin_cat, data = df)
b=ten(test_bilirubin)
comp(b,p=c(0,1,1,0.5,0.5),q=c(1,0,1,0.5,2))
d=attr(b,"tft")
# "lrt" - the long-rank family of tests
#"sup" - Renyi test,
#"tft" - test for trend
cbind(d$tft$W,round(d$tft$pChisq,4))

Related

lcmm::predictClass with l-spline link function

I am getting an error message trying to predict class membership in lcmm::predictClass(). This seems to be due to using a spline-based link function, as exemplified below. The lcmm::predictClass() function works okay for the default link function.
The following shows 1) a reproduceable example giving the error message, and 2) a working example with the same broad approach.
## define initialisation values for quick result here
BB <- c(-19.064,21.718,-1.192,-1.295,-1.205,-0.281,0.110,
-0.232, 1.339,-1.007, 1.019,-9.395, 1.702,2.030,
2.089, 1.352,-9.369, 1.220, 1.532, 2.481,1.223)
library(lcmm)
m2c <- multlcmm(Ydep1+Ydep2~1+Time*X2,
random=~1+Time,
subject="ID",
link="3-quant-splines",
ng=2,
mixture=~1+Time,
classmb=~1+X1,
data=data_lcmm,
B=BB)
## converges in 3 iterations
## define the prediction cases
library(dplyr)
X <- data_lcmm %>%
filter(ID %in% sample(ID,10)) %>% ## 10 random IDs
select(ID,Ydep1,Ydep2,Time,X1,X2)
## find predicted class memberships
predictClass(m2c, newdata=X)
## Error in multlcmm(fixed = Ydep1 + Ydep2 ~ 1 + Time * X2, mixture = ~1 + :
## Length of vector range is not correct.
On the other hand, a similar approach with a linear link function gives the following. Note that these models are based on the example in the ?multlcmm help section.
library(lcmm)
m2 <- multlcmm(Ydep1+Ydep2~1+Time*X2,
random=~1+Time,
subject="ID",
link="linear",
ng=2,
mixture=~1+Time,
classmb=~1+X1,
data=data_lcmm,
B=c(18,-20.77,1.16,-1.41,-1.39,-0.32,0.16,
-0.26,1.69,1.12,1.1,10.8,1.24,24.88,1.89))
## converges in 2 iterations
library(dplyr)
X <- data_lcmm %>%
filter(ID %in% sample(ID,10)) %>%
select(ID,Ydep1,Ydep2,Time,X1,X2)
predictClass(m2, newdata=X)
## ID class prob1 prob2
## 1 21 2 0.031948951 9.680510e-01
## 2 25 2 0.042938984 9.570610e-01
## 3 33 2 0.026053178 9.739468e-01
## 4 46 1 0.999999964 3.597409e-08
## 5 50 2 0.066291287 9.337087e-01
## 6 74 2 0.005630593 9.943694e-01
## 7 120 2 0.024787290 9.752127e-01
## 8 171 2 0.053499974 9.465000e-01
## 9 229 1 0.999999996 4.368222e-09
##10 235 2 0.008173507 9.918265e-01
## ...or similar
The other predict functions predictL() and predictY() seem to work okay. The predictRE() throws the same error message.
I will also email the package maintainer.

Can we extract an output text from a function in R?

I have simulation data that it repeated 100 times. I applied a mclustBIC for each sample.
Then, I would like to access the top result of this function. However, I could not access it.
I provided an example of this function.
library(mclust)
mclustBIC(iris[,-5])
The output is:
Bayesian Information Criterion (BIC):
EII VII EEI VEI EVI VVI EEE VEE EVE VVE EEV
1 -1804.0854 -1804.0854 -1522.1202 -1522.1202 -1522.1202 -1522.1202 -829.9782 -829.9782 -829.9782 -829.9782 -829.9782
2 -1123.4117 -1012.2352 -1042.9679 -956.2823 -1007.3082 -857.5515 -688.0972 -656.3270 -657.2263 -605.1841 -644.5997
3 -878.7650 -853.8144 -813.0504 -779.1566 -797.8342 -744.6382 -632.9647 -605.3982 -666.5491 -636.4259 -644.7810
4 -893.6140 -812.6048 -827.4036 -748.4529 -837.5452 -751.0198 -646.0258 -604.8371 -705.5435 -639.7078 -699.8684
5 -782.6441 -742.6083 -741.9185 -688.3463 -766.8158 -711.4502 -604.8131 NA -723.7199 -632.2056 -652.2959
6 -715.7136 -705.7811 -693.7908 -676.1697 -774.0673 -707.2901 -609.8543 -609.5584 -661.9497 -664.8224 -664.4537
7 -731.8821 -698.5413 -713.1823 -680.7377 -813.5220 -766.6500 -632.4947 NA -699.5102 -690.6108 -709.9530
8 -725.0805 -701.4806 -691.4133 -679.4640 -740.4068 -764.1969 -639.2640 -654.8237 -700.4277 -709.9392 -735.4463
9 -694.5205 -700.0276 -696.2607 -702.0143 -767.8044 -755.8290 -653.0878 NA -729.6651 -734.2997 -758.9348
VEV EVV VVV
1 -829.9782 -829.9782 -829.9782
2 -561.7285 -658.3306 -574.0178
3 -562.5522 -656.0359 -580.8396
4 -602.0104 -725.2925 -630.6000
5 -634.2890 NA -676.6061
6 -679.5116 NA -754.7938
7 -704.7699 -809.8276 -806.9277
8 -712.8788 -831.7520 -830.6373
9 -748.8237 -882.4391 -883.6931
Top 3 models based on the BIC criterion:
VEV,2 VEV,3 VVV,2
-561.7285 -562.5522 -574.0178
I want to access the last line and extract values from it (is that possible?)
Top 3 models based on the BIC criterion:
VEV,2 VEV,3 VVV,2
-561.7285 -562.5522 -574.0178
update: using summary() will help to get to this value, but not to extract from it
I tried to solve this point using another way. I first extract only the values, such that:
res <- mclustBIC(iris[,-5])
res1 <- as.data.frame(res[,1:14])
res2 <- max(res1[[1]])
However, res2 will provide me with the maximum value for a specific model. In addition, I need to know the number of clusters (from 1 to 9). I would like to have it like this:
"EII, 9, -694.5205". ## the last line of EII.
A possible solution:
library(mclust)
m <- mclustBIC(iris[,-5])
BIC <- as.numeric(summary(m))
names(BIC) <- names(summary(m))
BIC
#> VEV,2 VEV,3 VVV,2
#> -561.7285 -562.5522 -574.0178

How do I generate the archetypes of new dataset from the GLRM predict function

I have used these sites as reference and though has been resourceful, I'm unable to regenerate the reduced dimensions of new datasets via the glrm predict function
https://bradleyboehmke.github.io/HOML/GLRM.html
https://github.com/h2oai/h2o-tutorials/blob/master/best-practices/glrm/GLRM-BestPractices.Rmd
I work in the Sparklyr environment with H2o. I'm keen to use the GLRM function to reduce dimensions to cluster. Though from the model, i am able to access the PCAs or Arch, i would like to generate the Archs from the GRLM predict function on new datasets.
Appreciate your help.
Here is the training of the GLRM model on the training dataset
glrm_model <-
h2o.glrm(
training_frame = train,
cols = glrm_cols,
loss = "Absolute",
model_id = "rank2",
seed = 1234,
k = 5,
transform = "STANDARDIZE",
loss_by_col_idx = losses$index,
loss_by_col = losses$loss
)
# Decompose training frame into XY
X <- h2o.getFrame(glrm_model#model$representation_name) #as h2o frame
The Arch Types from the training dataset:
X
Arch1 Arch2 Arch3 Arch4 Arch5
1 0.10141381 0.10958071 0.26773514 0.11584502 0.02865024
2 0.11471676 0.06489475 0.01407714 0.24536782 0.10223535
3 0.08848878 0.26742082 0.04915022 0.11693702 0.03530641
4 -0.03062604 0.29793032 -0.07003814 0.01927975 0.52451867
5 0.09497268 0.12915324 0.21392107 0.08574152 0.03750636
6 0.05857743 0.18863508 0.14570623 0.08695144 0.03448957
But when i wish use the trained GLRM model on new dataset to regenerate these arch types,
I got the full dimensions instead of the Arch types as per X above?
I'm using these Arch as features for clustering purposes.
# Generate predictions on a validation set (if necessary):
glrm_pred <- h2o.predict(glrm_model, newdata = test)
glrm_pred
reconstr_price reconstr_bedrooms reconstr_bathrooms reconstr_sqft_living reconstr_sqft_lot reconstr_floors reconstr_waterfront reconstr_view reconstr_condition reconstr_grade reconstr_sqft_above reconstr_sqft_basement reconstr_yr_built reconstr_yr_renovated
1 -0.8562455 -1.03334892 -1.9903167 -1.3950774 -0.2025564 -1.6537486 0 4 5 13 -1.20187061 -0.6584413 -1.25146116 -0.3042907
2 -0.7940549 -0.29723926 -0.7863867 -0.4364751 -0.1666500 -0.8527297 0 4 5 13 -0.13831432 -0.6545514 0.54821146 -0.3622902
3 -0.7499614 -0.18296317 0.1970824 -0.3989486 -0.1532677 0.4914559 0 4 5 13 -0.09100889 -0.6614534 1.38779632 -0.1844416
4 -1.0941432 0.08954988 0.7872987 -0.2087964 -0.1599888 0.8254916 0 4 5 13 0.11973488 -0.6623575 2.70176558 -0.2363486
5 0.3727360 0.82848389 0.4965246 1.1134378 -0.9013011 -1.3388791 0 4 5 13 0.08427185 2.1354440 -0.07213625 -1.2213866
6 -0.4042458 -0.59876839 -0.9685556 -0.7093578 -0.1745297 -0.5061798 0 4 5 13 -0.43503836 -0.6628391 -0.55165408 -0.2207544
reconstr_lat reconstr_long reconstr_sqft_living15 reconstr_sqft_lot15
1 -0.07307503 -0.4258959 -1.0132778 -0.1964471
2 -0.52124543 0.7283153 0.1242903 -0.1295341
3 -0.56113519 0.6011221 -0.1616215 -0.1624136
4 -0.99759090 1.3032420 0.1556193 -0.1569607
5 0.70028433 -0.6436112 1.1400189 -0.9272790
6 -0.02222403 -0.2257382 -0.4859787 -0.1817499
[6416 rows x 18 columns]
thank you

LPA - model selection based on BIC with function prior=priorControl()

I'm trying to fit models for latent profile analysis (packages: tidyLPA and mclust). For model VVI (variances=equal, covariances=zero), I get many "NA" for BIC when n_profiles > 5. I figured out that function "prior = priorControl()" can possibly fix that and - indeed - it does! But now I get a huge bump for BIC when n_profiles > 5 which indicates better model fit.
Does anyone have a guess what is going on here? Or does anybody have a recommendation how to deal with that? Any thoughts are appreciated. I hope the code below and plots attached can illustrate the issue.
Many thanks!!
###not run
library(mclust)
library(tidyLPA)
library(dplyr)
# cluster_1 is an imputed subset of 9 variables
cluster_1 <- subset %>%
single_imputation() %>%
mutate_all(list(scale))
# mclustBIC without priorControl
set.seed(0408)
BIC_m12a <- mclustBIC(cluster_1, modelNames = c("EEI", "VVI"))
BIC1_m12a
Bayesian Information Criterion (BIC):
EEI VVI
1 -127005.8 -127005.8
2 -122298.6 -121027.1
3 -120579.4 -119558.0
4 -119883.4 -118765.7
5 -119244.2 -118293.6
6 -119064.4 NA
7 -118771.5 NA
8 -118681.0 NA
9 -118440.2 NA
# mclustBIC with priorControl
set.seed(0408)
BIC_m12b <- mclustBIC(cluster_1, modelNames = c("EEI", "VVI"), prior=priorControl())
BIC_m12b
Bayesian Information Criterion (BIC):
EEI VVI
1 -127006.0 -127006.0
2 -122299.5 -121028.2
3 -120587.9 -119560.0
4 -119884.4 -118771.7
5 -119235.4 -118296.9
6 -118933.9 -112761.8
7 -118776.2 -112579.7
8 -118586.2 -112353.3
9 -118472.0 -112460.2
enter image description here
enter image description here

Compile all data produced by rolling regression into one

I am doing a rolling regression with a huge database, and the reference column used for rolling is called "Q" with the value from 5 to 45 for each data block. At first I tried with simple codes step by step, and it works very good:
fit <- as.formula(EB~EB1+EB2+EB3+EB4)
#use the 20 Quarters data to do regression
model<-lm(fit,data=datapool[(which(datapool$Q>=5&datapool$Q<=24)),])
#use the model to forecast the value of next quarter
pre<-predict(model,newdata=datapool[which(datapool$Q==25),])
#get the forecast error
error<-datapool[which(datapool$Q==25),]$EB -pre
The result of the code above is:
> head(t(t(error)))
[,1]
21 0.006202145
62 -0.003005097
103 -0.019273856
144 -0.016053012
185 -0.025608022
226 -0.004548264
The datapool has the structure below:
> head(datapool)
X Q Firm EB EB1 EB2 EB3
1 1 5 CMCSA US Equity 0.02118966 0.08608825 0.01688180 0.01826571
2 2 6 CMCSA US Equity 0.02331379 0.10506550 0.02118966 0.01688180
3 3 7 CMCSA US Equity 0.01844747 0.12961955 0.02331379 0.02118966
4 4 8 CMCSA US Equity NA NA 0.01844747 0.02331379
5 5 9 CMCSA US Equity 0.01262287 0.05622834 NA 0.01844747
6 6 10 CMCSA US Equity 0.01495291 0.06059339 0.01262287 NA
...
Firm B(also from Q5 to Q45)
...
Firm C(also from Q5 to Q45)
The errors produced above are all marked with "X" value in "datapool", so I can know from which firm does the error come from.
Since I need to run the regression for 21 times (quarters 5-24,6-25,...,25-44), so I do not want to do it manully, and have thought out the following codes:
fit <- as.formula(EB~EB1+EB2+EB3+EB4)
for (i in 0:20){
model<-lm(fit,data=datapool[(which(datapool$Q>=5+i&datapool$Q<=24+i)),])
pre<-predict(model,newdata=datapool[which(datapool$Q==25+i),])
error<-datapool[which(datapool$Q==25),]$EB -pre
}
The codes above works, and no error come out, but I do not know how to compile all errors produced by each regression into one datapool automatically? Can anyone help me with that?
(I say again: Really bad idea to use the name 'error' for a vector.) It is the name of a core function. This is how I would have attempted that task. (Using the subset parameter and indexing than the tortured which statements.
fit <- as.formula(EB~EB1+EB2+EB3+EB4)
pre <- numeric(len=21)
errset <- numeric(len=21)
for (i in 0:20){
model<-lm(fit,data=datapool, subset= Q>=5+i & Q<=24+i )
pre[i]<-predict(model,newdata=datapool[ datapool[["Q"]] %in% i:(25+i), ])
errset[i]<-datapool[25+i,]$EB -pre
}
errset
No gaurantees this won't error out by running out tof data at the beginning or end since you have not offered either data or a comprehensive description of the data-object.

Resources