How do I change confidence interval calculation to log-log on R? - r

If I wanted to calculate confidence intervals using the coxph and confinf functions, how do I change the confidence interval calculation to log-log? My understanding is that log is the default.
I tried conf.type="log-log" but it did not work, just got an error message
library(survival)
coxph(formula = Surv(futime, fustat) ~ tx, data = tki, conf.type="log-log")
fit <- coxph(formula = Surv(futime, fustat) ~ tx, data = tki)
summary(fit)
#output provides HR CIs
confint(fit)
#coefficient CIs
exp(confint(fit))
> dput(tki) structure(list(futime = c(9.26, 11.06, 2.35, 3.75, 12.4, 10.3, 8.11, 7.29, 6.75, 6.56, 0.26, 1.9, 0.34, 1.63, 1.55, 1.6, 4.78, 2.65, 1.72, 3.63), fustat = c(1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1), tx = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -20L))

Related

How do I deal with this error when using the intEff() function of R on a probit model?

probit5 <- glm(shot_success ~ age + risk + loc + self_est + compet + regfoc + self_eff +
+ first_mover + gender + ten_throws + incentive + score_diff + first_mover*gender,
family = binomial(link = "probit"),
data=data)
intEff(probit5, vars=c("first_mover", "gender"), data=data)
Error:
Error in dimnames(x) <- dn :
length of 'dimnames' [2] not equal to array extent
In addition: Warning message:
In cbind(deriv1, deriv2, deriv3, nn, deriv0) :
number of rows of result is not a multiple of vector length (arg 4)
Reproducable example:
example <- structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), id_opp = c(2,
2, 2, 2, 2, 2, 2, 2, 2, 2), shot_success = c(0, 1, 0, 0, 0, 0,
0, 0, 1, 0), tiebreak = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), score_diff = c(0,
0, 1, 0, 0, 0, 0, 0, 0, -2), first_mover = c(1, 0, 1, 0, 1, 0,
1, 0, 1, 0), incentive = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), ten_throws = c(1,
1, 1, 1, 1, 1, 1, 1, 1, 1), pre_training = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0), gender = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), age = c(18,
18, 18, 18, 18, 18, 18, 18, 18, 18), self_est = c(1, 1, 1, 1,
1, 1, 1, 1, 1, 1), risk = c(3.6667, 3.6667, 3.6667, 3.6667, 3.6667,
3.6667, 3.6667, 3.6667, 3.6667, 3.6667), loc = c(4.8, 4.8, 4.8,
4.8, 4.8, 4.8, 4.8, 4.8, 4.8, 4.8), self_eff = c(2.8, 2.8, 2.8,
2.8, 2.8, 2.8, 2.8, 2.8, 2.8, 2.8), compet = c(4.54, 4.54, 4.54,
4.54, 4.54, 4.54, 4.54, 4.54, 4.54, 4.54), regfoc = c(3.6364,
3.6364, 3.6364, 3.6364, 3.6364, 3.6364, 3.6364, 3.6364, 3.6364,
3.6364)), row.names = c(NA, -10L), class = c("tbl_df", "tbl",
"data.frame"))
probit_example <- glm(shot_success ~ gender + age + risk + loc + self_est + compet + regfoc + self_eff + first_mover + ten_throws
+ incentive + score_diff + first_mover*gender,
family = binomial(link = "probit"),
data = example)
intEff(probit_example, vars=c("first_mover", "gender"), data=example)
Using only this few rows returns a different error. When using all 1040 rows of my data, it returns the error that I mentioned, but including all rows here would be too long...

Can´t use survfit on some data.frames

I have a dataset I´m going to use for survival analysis, and it seems to be working fine when I use the whole set. However, once I slice it into smaller dataframes using data[which(data$variable1=="somevalue")]the thing seems to break down.
Most of the resulting smaller dataframes work fine, but some are a problem. In the problematic ones, I can use summary(survfit(Surv(time, status)~variable2, data=smalldataframe))$surv without a problem, but when I try summary(survfit(Surv(time, status)~variable2, data=smalldataframe), time=5)$surv, it throws Error in array(xx, dim = dd) : negative length vectors are not allowed.
I´ve tried looking at the data, to see if I have any weird values, like negative times, but there aren´t any. Besides, if there were a problem with that, the full dataframe should be throwing an error too, but it doesn´t. All the smaller dataframes are created using the same line of code, so I also don´t understand why they are acting differently. And mostly, I don´t understand why summary(survfit(...))$surv works fine, as does plot(survfit(...)), but when I want to calculate survival at a specific time, it suddenly doesn´t like the data anymore.
Here´s one of the offending dataframes
test <-
structure(list(time2 = c(0.15, 2.08, 2.06, 0.32, 39.45, 39.09,
2.57, 3.64, 13.57, 36.57, 36.26, 0.78, 0.1, 33.94, 3.1, NA, 1.77,
28.38, 1.24, NA, 1.87, 25.83, 2.62, 1.57, 1.6, 22.74, 21.03,
20.54, 20.03, 0.97, 19.35, 18.09, 2.61, 17.68, NA, 3.85, 3.52,
11.22, 11.52, 11.04, 10.51, 1.68, 10.4, 10.61, 9.01, 9.05, 7.8,
0.11, 4.83), status = c(1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1,
0, 1, NA, 1, 1, 1, NA, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1,
0, NA, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0), cas_dg = c(1,
2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5,
6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8,
8, 9, 9, 9, 9, 9)), .Names = c("time2", "status", "cas_dg"), row.names = c(NA, -49L), class = "data.frame")
The call that is giving me trouble is summary(survfit(Surv(time2, status)~cas_dg, data=test), time=5)$surv and that only with some of the smaller dataframes.
You need to use argument extend=TRUE in summary; according to ?summary.survfit:
extend: logical value: if TRUE, prints information for all specified
‘times’, even if there are no subjects left at the end of the
specified ‘times’. This is only valid if the ‘times’
argument is present.
So for your sample data, you can do:
fit <- survfit(Surv(time2, status) ~ cas_dg, data = test);
summary(fit, time = 5, extend = TRUE)$surv;
#[1] 0.0000000 0.0000000 0.5555556 0.5000000 0.3333333 0.5714286 0.6000000
#[8] 0.6666667 0.8000000

Somers D differences between R and SAS and within R

I am new to both R and SAS. I want to calculate somers D, following the logistic regression.my dataframe(vac1) is combination of Titer and Protection.
> vac1=structure(list(Titer = c(0.9, 0.9, 0.9, 1.51, 0.9, 0.9, 2.86,1.95,2.71, 2.56, 2.71, 3.01, 2.71, 2.41, 2.11, 1.95, 2.26, 2.71, 2.56, 2.41, 2.56, 1.95, 1.81, 2.26, 2.11, 1.81, 1.95, 1.95, 1.34, 2.56, 2.26, 2.26, 2.11, 2.41, 2.71, 2.56, 1.65, 1.95, 1.51, 1.95,1.81, 1.81, 1.81, 1.95, 2.11, 2.86,2.41, 1.95, 2.56, 2.71, 2.71,2.41, 1.81, 2.41, 1.65, 1.81, 2.11, 2.11, 1.81, 1.81,2.26, 2.41,1.65, 2.56, 2.71, 2.11, 1.81), Protection = c(0, 0, 0, 0, 0,0, 1, 0, 1, 1,1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1,0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0)), .Names = c("Titer","Protection"), row.names = c(NA, -67L), class = "data.frame").
my logistic regression formula is.
> logit=glm(Protection~Titer, data=vac1, family=binomial(link="logit")).
the resulting predicted probalities from logit model is combined with original Protection data from vac1 dataframe and created vac4 dataframe.
> vac4=cbind(vac1$Protection,logit$fit)
> colnames(vac4)=c("Protection","PredictedProb").
calculated somers D by 2 ways.
1.using InformationValue package
>library(InformationValue)
>somersD(actuals=vac4$Protection, predictedScores=vac4$PredictedProb
I got the value 0.733.
2.using function copied from a link
http://shashiasrblog.blogspot.in/2014/02/binary-logistic-regression-fast.html
OptimisedConc=function(logit)
{
Data = vac4
ones = Data[Data[,1] == 1,]
zeros = Data[Data[,1] == 0,]
conc=matrix(0, dim(zeros)[1], dim(ones)[1])
disc=matrix(0, dim(zeros)[1], dim(ones)[1])
ties=matrix(0, dim(zeros)[1], dim(ones)[1])
for (j in 1:dim(zeros)[1])
{
for (i in 1:dim(ones)[1])
{
if (ones[i,2]>zeros[j,2])
{conc[j,i]=1}
else if (ones[i,2]<zeros[j,2])
{disc[j,i]=1}
else if (ones[i,2]==zeros[j,2])
{ties[j,i]=1}
}
}
Pairs=dim(zeros)[1]*dim(ones)[1]
PercentConcordance=(sum(conc)/Pairs)*100
PercentDiscordance=(sum(disc)/Pairs)*100
PercentTied=(sum(ties)/Pairs)*100
N<-length(logit$fit)
gamma<-(sum(conc)-sum(disc))/Pairs
Somers_D<-(sum(conc)-sum(disc))/(Pairs-sum(ties))
k_tau_a<-2*(sum(conc)-sum(disc))/(N*(N-1))
return(list("Percent Concordance"=PercentConcordance,
"Percent Discordance"=PercentDiscordance,
"Percent Tied"=PercentTied,
"Pairs"=Pairs,
"Gamma"=gamma,
"Somers D"=Somers_D,
"Kendall's Tau A"=k_tau_a))
}
OptimisedConc(logit).
Here i am getting the gamma and somers D values but are reversed compared to what i got it in SAS and the somers D value calculated by 2nd method in R and SAS is different from what i obtained it using the InformationValue package of R. similarly kendalls tau is infinite showing in R and in SAS it is 0.38.
can anyone help where i am making mistake? thanking you.

Plotting function goes wrong when adding new parameter

I'm a beginneR using R Studio with R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet" in Windows 7.
Data I'm using...
> dput(head(data,20))
structure(list(case = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1), age = c(37, 42, 44, 40, 26, 29, 42, 26,
18, 56, 29, 66, 71, 26, 30, 48, 39, 65, 65, 48), bmi = c(25.95,
29.07, 27.63, 27.4, 25.34, 31.38, 25.08, 28.01, 24.69, 25.06,
27.68, 23.51, 29.86, 21.72, 25.95, 22.86, 23.53, 21.3, 33.2,
29.39), ord.bmi = c(3, 3, 3, 3, 3, 4, 3, 3, 2, 3, 3, 2, 3, 2,
3, 2, 2, 2, 4, 3), alcohol = c(2, 2, 1, 1, 2, 1, 1, 1, 1, 1,
2, 1, 1, 1, 1, 1, 2, 2, 1, 1), tobacco = c(1, 1, 1, 2, 2, 1,
2, 2, 1, 1, 2, 2, 1, 2, 1, 1, 2, 2, 1, 1), dent.amalgam = c(1,
2, 2, 1, 1, 1, 2, 2, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 2, 1), exp.years = c(7,
9, 9, 5, 2, 10, 15, 5, 1, 40, 10, 50, 50, 1, 12, 22, 22, 30,
40, 30), mn = c(0, 0, 0, 1.5, 1.5, 1, 0, 0, 0.5, 0.5, 1, 1, 0,
0, 0, 0.5, 0, 0.5, 2, 1), bn = c(2.5, 5, 2.5, 2, 1.5, 4, 2, 1.5,
4.5, 4.5, 2.5, 2, 6, 2, 5, 4, 1, 1.5, 7, 1.5), ln = c(0.5, 1.5,
0, 2, 1.5, 1.5, 1, 0.5, 2, 2, 1, 1, 4.5, 0, 2, 1, 3, 2, 3, 3),
pn = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.5,
0.5, 0.5, 0, 0), cc = c(0, 1, 0, 2, 2, 4, 1, 1.5, 4.5, 2,
0, 3.5, 2, 1.5, 2, 1.5, 0.5, 1, 2, 1.5), kr = c(0, 0, 0,
0, 0, 0, 0.5, 0, 0.5, 1, 0, 0.5, 1.5, 0.5, 0.5, 0.5, 0, 0.5,
0, 0), kl = c(0.5, 2, 0, 1.5, 1.5, 0, 2, 0, 2, 2, 0, 1.5,
1.5, 1, 4, 3, 2, 3.5, 4.5, 2)), .Names = c("case", "age",
"bmi", "ord.bmi", "alcohol", "tobacco", "dent.amalgam", "exp.years",
"mn", "bn", "ln", "pn", "cc", "kr", "kl"), row.names = c(NA,
20L), class = "data.frame")
I'm plotting two different densities (which I get using density.a <- lapply(data[which(data$case == 0),], density) and density.b <- lapply(data[which(data$case == 1),], density)), and everything seems to work fine:
plot.densities <- function(sample.a, sample.b){ # declaring the function arguments
for(i in seq(length(sample.a))){ # for every element in the first argument (expected equal lengths)
plot(range(sample.a[[i]]$x, sample.b[[i]]$x), # generate a plot
range(sample.a[[i]]$y, sample.b[[i]]$y),
xlab = names(sample.a[i]), ylab = "Density", main = paste(names(sample.a[i]), "density plot"))
lines(sample.a[[i]], col = "red") # red lines
lines(sample.b[[i]], col = "green") #green lines
}
}
When I call the function, I get plots like this:
Then, if I want to fill the line between the two curves, I add the polygon function and looks like this:
filled.plot <- function(sample.a, sample.b){ # declaring the function arguments
for(i in seq(length(sample.a))){ # for every element in the first argument (expected equal lengths)
plot(range(sample.a[[i]]$x, sample.b[[i]]$x), # generate a plot
range(sample.a[[i]]$y, sample.b[[i]]$y),
xlab = names(sample.a[i]), ylab = "Density",
main = paste(names(sample.a[i])))
lines(sample.a[[i]], col = "red") # red lines
lines(sample.b[[i]], col = "green") #green lines
polygon(x = c(range(sample.a[[i]]$x, sample.b[[i]]$x),
rev(range(sample.a[[i]]$x, sample.b[[i]]$x))),
y = c(range(sample.a[[i]]$y, sample.b[[i]]$y),
rev(range(sample.a[[i]]$x, sample.b[[i]]$x))),
col = "skyblue")
}
}
But when I call the filled.plot function, I get plots like this:
I'm stuck, and some help would be just fine!
Thanks in advance.
Try with ggplot (I have changed the case value of rows 11:20 to 2):
ggplot()+
geom_density(data=testdf[testdf$case==1,], aes(age),fill='red', alpha=0.5)+
geom_density(data=testdf[testdf$case==2,], aes(age), fill='green', alpha=0.5)

Multiple stat_function on grouped data with ggplot2

I am studying a data set with multiple observation of a parameter overtime. the data is like:
test<-data.frame(t = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.33, 0.33, 0.33, 0.33, 0.33, 0.33, 0.33, 0.33, 0.33, 0.33, 0.33, 0.33, 0.33, 0.33, 0.33, 0.33, 0.33, 0.67, 0.67, 0.67, 0.67, 0.67, 0.67, 0.67, 0.67, 0.67, 0.67, 0.67, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1.33, 1.33, 1.33, 1.33, 1.33, 1.33, 1.33, 1.33, 1.33, 1.33, 1.33, 1.33, 1.33, 1.33, 1.33, 1.33, 1.67, 1.67, 1.67, 1.67, 1.67, 1.67, 1.67, 1.67, 1.67, 1.67, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 6, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10), int = c(76.44609375, 94.6619686800895, 112.148907103825, 75.1003097802036, 74.1037037037037, 76.7526662128432, 74.0734830988873, 87.9052100068855, 81.0525931336742, 92.1907873244038, 84.0708929788684, 88.8232221775814, 98.1323678006063, 115.175322139789, 91.2653104925053, 76.3661620658949, 152.637799717913, 107.054702135631, 83.4693197755961, 91.658991910392, 81.3991787335206, 106.153762268266, 100.919789842382, 67.2119436084271, 137.558914728682, 89.1182608695652, 156.10352233677, 108.180911207183, 87.9794680354643, 77.7501400560224, 80.7675382653061, 95.6662793399954, 92.5649630541872, 88.3301402668491, 84.3891875746714, 76.4318673395818, 111.413893510815, 82.4753828420879, 119.099190283401, 192.539417212559, 208.49203187251, 106.919937512205, 105.370936371214, 180.028767711464, 130.29369773608, 170.193357597816, 172.703180212014, 178.061569518042, 182.097607918614, 227.066976984743, 153.856101031661, 432.991580916745, 299.143735224586, 144.118156808803, 396.36644895153, 334.538796516231, 350.186359610275, 200.781101530882, 279.866079790223, 122.542700519331, 235.199555308505, 204.924140655867, 229.181848967152, 225.542753383955, 468.308974987739, 269.306058221873, 229.969282013323, 255.553846153846, 621.021220159151, 255.017211703959, 396.658265826583, 273.300663227708, 232.449965010497, 303.343894502483, 276.952483801296, 327.419805194805, 241.136864249474, 457.961489497136, 498.901714285714, 280.9558101473, 322.089588377724, 386.754533152909, 364.356809338521, 340.416035518412, 428.482916666667, 668.447197400487, 387.671341748481, 471.049545829893, 255.8802020688, 361.979536152797, 192.224629418472, 284.088954468803, 170.763997760358, 237.869065100343, 365.08237271854, 294.266488413547, 718.279750479846, 211.599427030671, 294.045375597047, 207.099267015707, 194.209973045822, 251.306358381503, 190.786794766966, 400.396083385976, 183.133240482823, 130.442107867392, 167.231452991453, 345.110896351776, 299.304645622394, 192.078204692282, 121.273544841369, 153.996295438759, 97.6034616378197, 362.80049522462, 130.498551774077, 106.031656035908, 117.682936668011, 90.1247837370242, 140.855475040258, 169.050049067713, 244.290241606527, 120.603356419819, 173.413333333333, 125.896389002872, 206.543873212215, 186.668320340184, 85.0988108720272, 106.57849117175, 102.867232728676, 216.232957110609, 86.6538461538462, 149.459777852575, 212.498573059361, 93.3816390633923, 105.567730417318, 120.095470383275, 137.205696941396, 141.156985871272, 90.578857338351, 84.8457760314342, 127.092660685395, 136.859870967742, 188.406440382942, 86.0879705400982))
class(test)
I managed to plot the density for each time point using:
ggplot(test, aes(int, group = as.factor(t),colour=t))+ geom_density()
But I would like to do the same graph but instead of the density I would like to plot a log normal fit of the density.
I know how to plot the lognormal fitting on one time point using fitdistr and passing parameter to stat_function whit this code
library(MASS)
fit <- fitdistr(subset(test, t == 0,select='int')$int, "lognormal")
ggplot(data=subset(test, t == 0,select='int'), aes(x=int)) +stat_function(fun = dlnorm,args = list(mean = fit$estimate[1], sd = fit$estimate[2]))
But how can I do it for all t with the colour of the line being given by the value of t is it possible to provide a function in the args list?
I thought of another naive solution: Predicting the values of every dlnorm().
## Split up the data according to t
tt <- split(test, test$t)
## Fit a lognormal to every dataset
fits <- lapply(tt, function(x) fitdistr(x$int, "lognormal"))
## Predict values
fitted <- lapply(fits, function(x) dlnorm(x = 1:max(test$int),
mean = x$estimate[1], sd = x$estimate[2]))
## Wrap everything into a data.frame ggplot can handle
plot.data <- data.frame(y = unlist(fitted), int = 1:max(test$int),
t = rep(unique(test$t),
each = length(unlist(fitted))/length(unique(test$t))))
## Plot
ggplot(test, aes(int, group = as.factor(t), colour=t)) +
#geom_density() +
geom_line(data = plot.data, aes(y = y), lwd = 1)
What about a naive solution, adding iteratively stat_function()?
cols <- brewer.pal(length(unique(test$t)),"Set1")
g <- ggplot(data=subset(test, t == 0, select='int'), aes(x=int))
n <- 1
for(i in unique(test$t)){
fit <- fitdistr(subset(test, t == i, select='int')$int, "lognormal")
g <- g+stat_function(fun = dlnorm,
args=list(mean=fit$estimate[1],sd=fit$estimate[2]),
col=cols[n])
n <- n + 1
}
g

Resources