Getting more precision in pvalue in survdiff - r

I am running a survdiff using the survival package and the p-value is 0.02. I would like to see it have more precision(ie. 0.02xxxx). Is there an argument that I can pass to specify the length of the pvalue. I read the documentation for the survival package and did not find any mention on how to specify it.
survdiff(surv_object~access_sam2$Area_mTLSHL)

Credits.
The computation of the p-value for objects of class "survdiff" is not completely obvious. I had to see what is going on in the print method for objects of that class to understand the way the degrees of freedom are computed.
The code below is a simplification of the code of print.survdiff and therefore the credits go to
citation("survival")
#
#Therneau T (2015). _A Package for Survival Analysis
#in S_. version 2.38, <URL:
#https://CRAN.R-project.org/package=survival>.
#
#Terry M. Therneau, Patricia M. Grambsch (2000).
#_Modeling Survival Data: Extending the Cox Model_.
#Springer, New York. ISBN 0-387-98784-3.
#
#To see these entries in BibTeX format, use
#'print(<citation>, bibtex=TRUE)', 'toBibtex(.)', or
#set 'options(citation.bibtex.max=999)'.
The code itself can be seen in the sources or by running
getAnywhere("print.survdiff")
Now for the question's problem.
I have written a generic pvalue function to make it easier to call a method for objects of the class returned by function survdiff. The example is the taken from the help page of that function.
The return value is a named list with 3 members, the names are self explanatory. One of them, chisq is a repetition of a value returned by survdiff. I have included it for the sake of completeness.
pvalue <- function(x, ...) UseMethod("pvalue")
pvalue.survdiff <- function (x, ...)
{
if (length(x$n) == 1) {
df <- 1
pval <- pchisq(x$chisq, 1, lower.tail = FALSE)
} else {
if (is.matrix(x$obs)) {
otmp <- rowSums(x$obs)
etmp <- rowSums(x$exp)
} else {
otmp <- x$obs
etmp <- x$exp
}
df <- sum(etmp > 0) - 1
pval <- pchisq(x$chisq, df, lower.tail = FALSE)
}
list(chisq = x$chisq, p.value = pval, df = df)
}
srv <- survdiff(Surv(futime, fustat) ~ rx, data = ovarian)
pvalue(srv)
#$chisq
#[1] 1.06274
#
#$p.value
#[1] 0.3025911
#
#$df
#[1] 1

I am not sure about the survival package and you did not provide a reproducible code (please do so next time). But in general, if you want to see more digits what you need to do is
print(value, digits= n)
# n is the number of digits you want to see
In your case it is
print(survdiff(surv_object~access_sam2$Area_mTLSHL), 6)

Related

Apply logistic regression in a function in R

I want to run logistic regression for multiple parameters and store the different metrics i.e AUC.
I wrote the function below but I get an error when I call it: Error in eval(predvars, data, env) : object 'X0' not found even if the variable exists in both my training and testing dataset. Any idea?
new.function <- function(a) {
model = glm(extry~a,family=binomial("logit"),data = train_df)
pred.prob <- predict(model,test_df, type='response')
predictFull <- prediction(pred.prob, test_df$extry)
auc_ROCR <- performance(predictFull, measure = "auc")
my_list <- list("AUC" = auc_ROCR)
return(my_list)
}
# Call the function new.function supplying 6 as an argument.
les <- new.function(X0)
The main reason why your function didn't work is that you are trying to call an object into a formula. You can fix it with paste formula function, but that is ultimately quite limiting.
I suggest instead that you consider using update. This allow you more flexibility to change with multiple variable combination, or change a training dataset, without breaking the function.
model = glm(extry~a,family=binomial("logit"),data = train_df)
new.model = update(model, .~X0)
new.function <- function(model){
pred.prob <- predict(model, test_df, type='response')
predictFull <- prediction(pred.prob, test_df$extry)
auc_ROCR <- performance(predictFull, measure = "auc")
my_list <- list("AUC" = auc_ROCR)
return(my_list)
}
les <- new.function(new.model)
The function can be further improved by calling the test_df as a separate argument, so that you can fit it with an alternative testing data.
To run the function in the way you intended, you would need to use non-standard evaluation to capture the symbol and insert it in a formula. This can be done using match.call and as.formula. Here's a fully reproducible example using dummy data:
new.function <- function(a) {
# Convert symbol to character
a <- as.character(match.call()$a)
# Build formula from character strings
form <- as.formula(paste("extry", a, sep = "~"))
model <- glm(form, family = binomial("logit"), data = train_df)
pred.prob <- predict(model, test_df, type = 'response')
predictFull <- ROCR::prediction(pred.prob, test_df$extry)
auc_ROCR <- ROCR::performance(predictFull, "auc")
list("AUC" = auc_ROCR)
}
Now we can call the function in the way you intended:
new.function(X0)
#> $AUC
#> A performance instance
#> 'Area under the ROC curve'
new.function(X1)
#> $AUC
#> A performance instance
#> 'Area under the ROC curve'
If you want to see the actual area under the curve you would need to do:
new.function(X0)$AUC#y.values[[1]]
#> [1] 0.6599759
So you may wish to modify your function so that the list contains auc_ROCR#y.values[[1]] rather than auc_ROCR
Data used
set.seed(1)
train_df <- data.frame(X0 = sample(100), X1 = sample(100))
train_df$extry <- rbinom(100, 1, (train_df$X0 + train_df$X1)/200)
test_df <- data.frame(X0 = sample(100), X1 = sample(100))
test_df$extry <- rbinom(100, 1, (test_df$X0 + test_df$X1)/200)
Created on 2022-06-29 by the reprex package (v2.0.1)

I try to replicate the results of multinom() function with optim() function in R, but it does not yield the same results. What was wrong?

I want to replicate the results of multinom() function with optim() function in R, but it does not yield the same results. What was wrong?
First, I imported a public data as "ml".
require(foreign)
ml <- read.dta("https://stats.idre.ucla.edu/stat/data/hsbdemo.dta")
The codes to get the summary statistics of "ml" data and the results are below:
with(ml, table(ses,prog))
with(ml, do.call(rbind,tapply(write, prog, function(x) c(M = mean(x), SD = sd(x)))))
prog
ses general academic vocation
low 16 19 12
middle 20 44 31
high 9 42 7
M SD
general 51.33333 9.397775
academic 56.25714 7.943343
vocation 46.76000 9.318754
The codes to get the results from multinom() function that conducts multinomial logistic regression, and following results are below:
library(nnet)
ml$prog2 <- relevel(ml$prog, ref = "academic")
ml_pckg <- multinom(prog2 ~ write + ses, data = ml)
summary(ml_pckg)
Call:
multinom(formula = prog2 ~ write + ses, data = ml)
Coefficients:
(Intercept) write sesmiddle seshigh
general 2.852198 -0.0579287 -0.5332810 -1.1628226
vocation 5.218260 -0.1136037 0.2913859 -0.9826649
Std. Errors:
(Intercept) write sesmiddle seshigh
general 1.166441 0.02141097 0.4437323 0.5142196
vocation 1.163552 0.02221996 0.4763739 0.5955665
Residual Deviance: 359.9635
AIC: 375.9635
The code to get the z statistics and the results are below:
z <- summary(ml_pckg)$coefficients/summary(ml_pckg)$standard.errors
z
(Intercept) write sesmiddle seshigh
general 2.445214 -2.705562 -1.2018081 -2.261334
vocation 4.484769 -5.112689 0.6116747 -1.649967
Next, I wrote the code to replicate the results above.
I generated dummy variables for the categorical dependant/independant variables as below:
ml$prog_academic <- ifelse(ml$prog == "academic", 1, 0)
ml$prog_general <- ifelse(ml$prog == "general", 1, 0)
ml$prog_vocational <- ifelse(ml$prog == "vocational", 1, 0)
ml$ses_low <- ifelse(ml$ses == "low", 1, 0)
ml$ses_middle <- ifelse(ml$ses == "middle", 1, 0)
ml$ses_high <- ifelse(ml$ses == "high", 1, 0)
I generated one vector to multiply with the intercept and subsetted write, ses_middle, and ses_high for the explanatory variable. ses_low is baseline here. I assigned these covariates into a new data frame named "X".
one <-as.data.frame(rep(1,200))
covar <- ml[,c(7,19,20)]
X <- data.frame(one,covar) #200*4
Next, I created another data frame for dependant variables named "Y" that consists of prog_general and prog_vocational. Here, prog_academic is the baseline.
Y <- ml[,16:17] #200*2
I set the initial value of the parameters similar to the results of mlogit() function so that the optimization function converges.
B_0 <- c(3, -0.1, -0.5, -1, 5, -0.1, 0.2, -1) #8*1 #initial value as vector
Here, I refer to a document to find the likelihood of the multinomial logistic regression. The likelihood is in equation 31 on page 12. I found out that the second part of the equation should be summed with respect to i as well.
I generated a blank matrix "xb" to include part
xb <- matrix(0, nrow=200, ncol=2) #200*2
I run the code below at once to get the results of the optimization.
mlogit <- function(B){
B <- matrix(B, nrow=2, ncol=4, byrow=T)
for (i in 1:nrow(xb)){ #i is the dimension of individual: 200
for (j in 1:ncol(xb)){ #j is the dimension of dependant variables -1 (categorical): 2
xb[i,j] <- sum(X[i,]*B[j,]) #200*2
}
}
exp <- exp(xb) #200*2
sumexp <- rowSums(exp) #200*1
sumexp <- as.numeric(sumexp)
yxb <- Y*xb #200*2
sumyxb <- sum(yxb)
ll <- sumyxb-sum(log(1+sumexp))
-ll
}
mlogit_result <- optim(par = B_0, fn = mlogit)
mlogit_result
The results are below:
$par
[1] 0.05325004 -0.01417267 -0.64375499 -0.96137147 6.33471560 -0.86154161 0.92387035 -0.65728823
$value
[1] 103.7692
$counts
function gradient
353 NA
$convergence
[1] 0
$message
NULL
If the results correspond with that of mlogit() function, $par should be as below:
2.852198 -0.0579287 -0.5332810 -1.1628226 5.218260 -0.1136037 0.2913859 -0.9826649
I reviewed my code and the likelihood function again and again, but could not find anything wrong here. I think maybe the initial parameter is wrongly set or the function I created has some problem.
Could anyone please give me any suggestions to deal with this problem?
Thank you!

Looping statistic Tests in R

I would like to apply T test in R within a loop
Groups Length Size Diet place
A 2.4048381 0.7474989 1.6573392 334.3273456
A 2.72500485 0.86392165 1.8610832 452.5593152
A 1.396782867 0.533330367 0.8634525 225.5998728
B 1.3888505 0.46478175 0.92406875 189.9576476
B 1.38594795 0.60068945 0.7852585 298.3744962
B 2.53491245 0.95608005 1.5788324 303.9052525
I tried this code with loop, but it is not working:
for (i in 2:4){
t.test(table[,c(i)] ~ table$Groups, conf.level = 0.95)
}
Can anyone help me with this?
Thanks!
Your code computes 4 t-tests, but the results are lost, because you don't do anything with them. Try the following:
info <- read.table(header=TRUE, text="Groups Length Size Diet place
A 2.4048381 0.7474989 1.6573392 334.3273456
A 2.72500485 0.86392165 1.8610832 452.5593152
A 1.396782867 0.533330367 0.8634525 225.5998728
B 1.3888505 0.46478175 0.92406875 189.9576476
B 1.38594795 0.60068945 0.7852585 298.3744962
B 2.53491245 0.95608005 1.5788324 303.9052525")
results <- list()
for (i in 2:4){
results[[i]] <- t.test(info[,i] ~ info$Groups, conf.level = 0.95)
}
print(results)
When interacting with the REPL/console, typing the t.test function will compute results and return them. The console will print everything that is returned. In scripts that you source, the t.test function will return results but they wil not be printed. This is why I put them into a list and printed the list later on.
Btw, I stored your information as info not as table. R will deal great with variable names that are also function names, but every now and then you will hava a hard time to read error messages, so avoid naming variables table or matrix or c or df.
Using apply functions you could also do:
res<- cbind(
do.call(rbind,apply(info[,-1],2,function(cv)t.test(cv ~ info$Groups, conf.level = 0.95)
[c("statistic","parameter","p.value")]))
,
t(apply(info[,-1],2,function(cv)unlist(t.test(cv ~ info$Groups, conf.level = 0.95)
[c("conf.int","estimate")])))
)
res
> res
statistic parameter p.value conf.int1 conf.int2 estimate.mean in group A estimate.mean in group B
Length 0.7327329 3.991849 0.5044236 -1.13263 1.943907 2.175542 1.769904
Size 0.2339013 3.467515 0.8282072 -0.47739 0.5595231 0.714917 0.6738504
Diet 0.9336103 3.823748 0.4056203 -0.7396173 1.468761 1.460625 1.096053
place 0.9748978 3.162223 0.398155 -159.4359 306.2686 337.4955 264.0791

R: Cross validation on a dataset with factors

Often, I want to run a cross validation on a dataset which contains some factor variables and after running for a while, the cross validation routine fails with the error: factor x has new levels Y.
For example, using package boot:
library(boot)
d <- data.frame(x=c('A', 'A', 'B', 'B', 'C', 'C'), y=c(1, 2, 3, 4, 5, 6))
m <- glm(y ~ x, data=d)
m.cv <- cv.glm(d, m, K=2) # Sometimes succeeds
m.cv <- cv.glm(d, m, K=2)
# Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
# factor x has new levels B
Update: This is a toy example. The same problem occurs with larger datasets as well, where there are several occurrences of level C but none of them is present in the training partition.
The function createDataPartition function from the package caret does stratified sampling for the outcome variables and correctly warns:
Also, for ‘createDataPartition’, very small class sizes (<= 3) the classes may not show up in both the training and test data.
There are two solutions which spring to mind:
First, create a subset of the data by selecting one random sample of each factor level first, starting from the rarest class (by frequency) and then greedily satisfying the next rare class and so on. Then using createDataPartition on the rest of the dataset and merging the results to create a new train dataset which contains all levels.
Using createDataPartitions and and doing rejection sampling.
So far, option 2 has worked for me because of the data sizes, but I cannot help but think that there must be a better solution than a hand rolled out one.
Ideally, I would want a solution which just works for creating partitions and fails early if there is no way to create such partitions.
Is there a fundamental theoretical reason why packages do not offer this? Do they offer it and I just haven't been able to spot them because of a blind spot? Is there a better way of doing this stratified sampling?
Please leave a comment if I should ask this question on stats.stackoverflow.com.
Update:
This is what my hand rolled out solution (2) looks like:
get.cv.idx <- function(train.data, folds, factor.cols = NA) {
if (is.na(factor.cols)) {
all.cols <- colnames(train.data)
factor.cols <- all.cols[laply(llply(train.data[1, ], class), function (x) 'factor' %in% x)]
}
n <- nrow(train.data)
test.n <- floor(1 / folds * n)
cond.met <- FALSE
n.tries <- 0
while (!cond.met) {
n.tries <- n.tries + 1
test.idx <- sample(nrow(train.data), test.n)
train.idx <- setdiff(1:nrow(train.data), test.idx)
cond.met <- TRUE
for(factor.col in factor.cols) {
train.levels <- train.data[ train.idx, factor.col ]
test.levels <- train.data[ test.idx , factor.col ]
if (length(unique(train.levels)) < length(unique(test.levels))) {
cat('Factor level: ', factor.col, ' violated constraint, retrying.\n')
cond.met <- FALSE
}
}
}
cat('Done in ', n.tries, ' trie(s).\n')
list( train.idx = train.idx
, test.idx = test.idx
)
}
Everyone agrees that there sure is an optimal solution. But personally, I would just try the cv.glm call until it works usingwhile.
m.cv<- try(cv.glm(d, m, K=2)) #First try
class(m.cv) #Sometimes error, sometimes list
while ( inherits(m.cv, "try-error") ) {
m.cv<- try(cv.glm(d, m, K=2))
}
class(m.cv) #always list
I've tried it with 100,000 rows in the data.fame and it only takes a few seconds.
library(boot)
n <-100000
d <- data.frame(x=c(rep('A',n), rep('B', n), 'C', 'C'), y=1:(n*2+2))
m <- glm(y ~ x, data=d)
m.cv<- try(cv.glm(d, m, K=2))
class(m.cv) #Sometimes error, sometimes list
while ( inherits(m.cv, "try-error") ) {
m.cv<- try(cv.glm(d, m, K=2))
}
class(m.cv) #always list
When I call traceback I get this:
> traceback()
9: stop(sprintf(ngettext(length(m), "factor %s has new level %s",
"factor %s has new levels %s"), nm, paste(nxl[m], collapse = ", ")),
domain = NA)
8: model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels)
7: model.frame(Terms, newdata, na.action = na.action, xlev = object$xlevels)
6: predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type ==
"link", "response", type), terms = terms, na.action = na.action)
5: predict.glm(d.glm, data[j.out, , drop = FALSE], type = "response")
4: predict(d.glm, data[j.out, , drop = FALSE], type = "response")
3: mean((y - yhat)^2)
2: cost(glm.y[j.out], predict(d.glm, data[j.out, , drop = FALSE],
type = "response"))
1: cv.glm(d, m, K = 2)
And looking at the cv.glm function gives:
> cv.glm
function (data, glmfit, cost = function(y, yhat) mean((y - yhat)^2),
K = n)
{
call <- match.call()
if (!exists(".Random.seed", envir = .GlobalEnv, inherits = FALSE))
runif(1)
seed <- get(".Random.seed", envir = .GlobalEnv, inherits = FALSE)
n <- nrow(data)
out <- NULL
if ((K > n) || (K <= 1))
stop("'K' outside allowable range")
K.o <- K
K <- round(K)
kvals <- unique(round(n/(1L:floor(n/2))))
temp <- abs(kvals - K)
if (!any(temp == 0))
K <- kvals[temp == min(temp)][1L]
if (K != K.o)
warning(gettextf("'K' has been set to %f", K), domain = NA)
f <- ceiling(n/K)
s <- sample0(rep(1L:K, f), n)
n.s <- table(s)
glm.y <- glmfit$y
cost.0 <- cost(glm.y, fitted(glmfit))
ms <- max(s)
CV <- 0
Call <- glmfit$call
for (i in seq_len(ms)) {
j.out <- seq_len(n)[(s == i)]
j.in <- seq_len(n)[(s != i)]
Call$data <- data[j.in, , drop = FALSE]
d.glm <- eval.parent(Call)
p.alpha <- n.s[i]/n
cost.i <- cost(glm.y[j.out], predict(d.glm, data[j.out,
, drop = FALSE], type = "response"))
CV <- CV + p.alpha * cost.i
cost.0 <- cost.0 - p.alpha * cost(glm.y, predict(d.glm,
data, type = "response"))
}
list(call = call, K = K, delta = as.numeric(c(CV, CV + cost.0)),
seed = seed)
}
It seems the problem has to do with your extremely small sample size and categorical effect (with values "A", "B", and "C"). You are fitting a glm with 2 effects: "B:A" and "C:A". In each CV iteration you bootstrap from the sample dataset and fit a new model d.glm. Given the size, the bootstrapped data are guaranteed to come up with 1 or more iteration in which the value "C" is not sampled, hence the error comes from calculating fitted probabilities from the bootstrap model from the training data in which validation data has a "C" level for x not observed in the training data.
Frank Harrell (often on stats.stackexchange.com) wrote in Regression Modelling Strategies that one ought to favor against split sample validation when sample size is small and/or some cell counts are small in categorical data analysis. Singularity (as you are seeing here) is one of many reasons why I think this is true.
Given the small sample size here, you should consider some split sample cross validation alternatives like a permutation test, or a parametric bootstrap. Another important consideration is exactly why you feel model based inference isn't correct. As Tukey said of the bootstrap, he'd like to call it a shotgun. It will blow the head off of any problem, as long as you're willing to reassemble the pieces.
There don't seem to be many simple solutions around the web so here's one I worked out that should be easy to generalize to as many factors as you need. It uses pre-installed packages and Caret but you could get away with just base R if you really wanted.
To use cross-validation when you have multiple factors follow a two-step process. Convert the factors to numerics and then multiply them together. Use this new variable as the target variable in a stratified sampling function. Be sure to remove it or keep it out of your training set after creating your folds.
If y is your DV and x is a factor then:
#Simulated factors (which are conveniently distributed for the example)
dataset <-data.frame(x=as.factor(rep(c(1,10),1000)),y=as.factor(rep(c(1,2,3,4),250)[sample(1000)]))
#Convert the factors to numerics and multiply together in new variable
dataset$cv.variable <-as.numeric(levels(dataset$x))[dataset$x]*as.numeric(levels(dataset$y))[dataset$y]
prop.table(table(dataset$y)) #One way to view distribution of levels
ftable(dataset$x,dataset$y) #A full table of all x and y combinations
folds <- caret::createFolds(dataset$cv.variable,k=10)
testIndexes <- folds[[k]]
testData <- as.data.frame(dataset[testIndexes, ])
trainData <- as.data.frame(dataset[-testIndexes, ])
prop.table(table(testData$y))
ftable(testData$x,testData$y) #evaluate distribution
which should produce a result that is close to balanced.
Note: In real life, if your sample lacks the requisite unique combinations of factors then your problem is harder overcome and might be impossible. You can either drop some levels from consideration before creating folds or employ some kind of over-sampling.

Error with multiscale hierarchical clustering in R

I'm doing hierarchical clustering with an R package called pvclust, which builds on hclust by incorporating bootstrapping to calculate significance levels for the clusters obtained.
Consider the following data set with 3 dimensions and 10 observations:
mat <- as.matrix(data.frame("A"=c(9000,2,238),"B"=c(10000,6,224),"C"=c(1001,3,259),
"D"=c(9580,94,51),"E"=c(9328,5,248),"F"=c(10000,100,50),
"G"=c(1020,2,240),"H"=c(1012,3,260),"I"=c(1012,3,260),
"J"=c(984,98,49)))
When I use hclust alone, the clustering runs fine for both Euclidean measures and correlation measures:
# euclidean-based distance
dist1 <- dist(t(mat),method="euclidean")
mat.cl1 <- hclust(dist1,method="average")
# correlation-based distance
dist2 <- as.dist(1 - cor(mat))
mat.cl2 <- hclust(dist2, method="average")
However, when using the each set up with pvclust, as follows:
library(pvclust)
# euclidean-based distance
mat.pcl1 <- pvclust(mat, method.hclust="average", method.dist="euclidean", nboot=1000)
# correlation-based distance
mat.pcl2 <- pvclust(mat, method.hclust="average", method.dist="correlation", nboot=1000)
... I get the following errors:
Euclidean: Error in hclust(distance, method = method.hclust) :
must have n >= 2 objects to cluster
Correlation: Error in cor(x, method = "pearson", use = use.cor) :
supply both 'x' and 'y' or a matrix-like 'x'.
Note that the distance is calculated by pvclust so there is no need for a distance calculation beforehand. Also note that the hclust method (average, median, etc.) does not affect the problem.
When I increase the dimensionality of the data set to 4, pvclust now runs fine. Why is it that I'm getting these errors for pvclust at 3 dimensions and below but not for hclust? Furthermore, why do the errors disappear when I use a data set above 4 dimensions?
At the end of function pvclust we see a line
mboot <- lapply(r, boot.hclust, data = data, object.hclust = data.hclust,
nboot = nboot, method.dist = method.dist, use.cor = use.cor,
method.hclust = method.hclust, store = store, weight = weight)
then digging deeper we find
getAnywhere("boot.hclust")
function (r, data, object.hclust, method.dist, use.cor, method.hclust,
nboot, store, weight = F)
{
n <- nrow(data)
size <- round(n * r, digits = 0)
....
smpl <- sample(1:n, size, replace = TRUE)
suppressWarnings(distance <- dist.pvclust(data[smpl,
], method = method.dist, use.cor = use.cor))
....
}
also note, that the default value of parameter r for function pvclust is r=seq(.5,1.4,by=.1). Well, actually as we can see this value is being changed somewhere:
Bootstrap (r = 0.33)...
so what we get is size <- round(3 * 0.33, digits =0) which is 1, finally data[smpl,] has only 1 row, which is less than 2. After correction of r it returns some error which possibly is harmless and output is given too:
mat.pcl1 <- pvclust(mat, method.hclust="average", method.dist="euclidean",
nboot=1000, r=seq(0.7,1.4,by=.1))
Bootstrap (r = 0.67)... Done.
....
Bootstrap (r = 1.33)... Done.
Warning message:
In a$p[] <- c(1, bp[r == 1]) :
number of items to replace is not a multiple of replacement length
Let me know if the results is satisfactory.

Resources