How to predict survival probabilities in R? - r

I have data called veteran stored in R. I created a survival model and now wish to predict survival probability predictions. For example, what is the probability that a patient with 80 karno value, 10diagtime, age 65 and prior=10 and trt = 2 lives longer than 100 days?
In this case the design matrix is x = (1,0,1,0,80,10,65,10,2)
Here is my code:
library(survival)
attach(veteran)
weibull <- survreg(Surv(time,status)~celltype + karno+diagtime+age+prior+trt ,dist="w")
and here is the output:
Any idea how to predict the survival probabilities?

You can get predict.survreg to produce predicted times of survival for individual cases (to which you will pass values to newdata) with varying quantiles:
casedat <- list(celltype="smallcell", karno =80, diagtime=10, age= 65 , prior=10 , trt = 2)
predict(weibull, newdata=casedat, type="quantile", p=(1:98)/100)
[1] 1.996036 3.815924 5.585873 7.330350 9.060716 10.783617
[7] 12.503458 14.223414 15.945909 17.672884 19.405946 21.146470
[13] 22.895661 24.654597 26.424264 28.205575 29.999388 31.806521
[19] 33.627761 35.463874 37.315609 39.183706 41.068901 42.971927
[25] 44.893525 46.834438 48.795420 50.777240 52.780679 54.806537
[31] 56.855637 58.928822 61.026962 63.150956 65.301733 67.480255
[37] 69.687524 71.924578 74.192502 76.492423 78.825521 81.193029
[43] 83.596238 86.036503 88.515246 91.033959 93.594216 96.197674
[49] 98.846083 **101.541291** 104.285254 107.080043 109.927857 112.831032
[55] 115.792052 118.813566 121.898401 125.049578 128.270334 131.564138
[61] 134.934720 138.386096 141.922598 145.548909 149.270101 153.091684
[67] 157.019655 161.060555 165.221547 169.510488 173.936025 178.507710
[73] 183.236126 188.133044 193.211610 198.486566 203.974520 209.694281
[79] 215.667262 221.917991 228.474741 235.370342 242.643219 250.338740
[85] 258.511005 267.225246 276.561118 286.617303 297.518110 309.423232
[91] 322.542621 337.160149 353.673075 372.662027 395.025122 422.263020
[97] 457.180183 506.048094
#asterisks added
You can then figure out which one is greater than the specified time and it looks to be around the 50th percentile, just as one might expect from a homework question.
png(); plot(x=predict(weibull, newdata=casedat, type="quantile",
p=(1:98)/100), y=(1:98)/100 , type="l")
dev.off()

Related

ANOVA with multiple factors not working

I am trying to have my data be analyzed using the multiple factors of depth and date of sampling. The code used and levels produced listed below.
date <- factor(Amount_allSampling_Depth$date)
depth <- factor(Amount_allSampling_Depth$depth)
depth_date <- date:depth
levels(depth_date)
[1] "2016-08-08:5" "2016-08-08:10" "2016-08-08:15" "2016-08-08:20" "2016-08-15:5"
[6] "2016-08-15:10" "2016-08-15:15" "2016-08-15:20" "2016-08-22:5" "2016-08-22:10"
[11] "2016-08-22:15" "2016-08-22:20" "2016-08-29:5" "2016-08-29:10" "2016-08-29:15"
[16] "2016-08-29:20" "2016-09-05:5" "2016-09-05:10" "2016-09-05:15" "2016-09-05:20"
[21] "2016-09-12:5" "2016-09-12:10" "2016-09-12:15" "2016-09-12:20" "2016-09-19:5"
[26] "2016-09-19:10" "2016-09-19:15" "2016-09-19:20" "2016-10-03:5" "2016-10-03:10"
[31] "2016-10-03:15" "2016-10-03:20" "2016-10-10:5" "2016-10-10:10" "2016-10-10:15"
[36] "2016-10-10:20" "2016-10-17:5" "2016-10-17:10" "2016-10-17:15" "2016-10-17:20"
[41] "2016-10-24:5" "2016-10-24:10" "2016-10-24:15" "2016-10-24:20"
When I try to do an ANOVA on of my classes of data ex. Dinoflagellates.
Dinoflagellates <- Amount_allSampling_Depth$Dinoflagellates
anova(lm(Dinoflagellates~depth_date))
I got the warning message:
ANOVA F-tests on an essentially perfect fit are unreliable.
Could someone help to find out how I could make it work so I can do multiple factor analyses?

Creating a loop with character variables on R - for two-sample t.test

I am looking to do multiple two sample t.tests in R.
I want to test 50 indicators that have two levels. So at first I used :
t.test(m~f)
Welch Two Sample t-test
data: m by f
t = 2.5733, df = 174.416, p-value = 0.01091
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.05787966 0.43891600
sample estimates:
mean in group FSS mean in group NON-FSS
0.8344209 0.5860231
Here m corresponds to the first indicator I want to test m =Debt.to.equity.ratio.
Here is a list of all the indicators I need to test :
print (indicators)
[1] "Debt.to.equity.ratio" "Deposits.to.loans"
[3] "Deposits.to.total.assets" "Gross.loan.portfolio.to.total.assets"
[5] "Number.of.active.borrowers" "Percent.of.women.borrowers"
[7] "Number.of.loans.outstanding" "Gross.loan.portfolio"
[9] "Average.loan.balance.per.borrower" "Average.loan.balance.per.borrower...GNI.per.capita"
[11] "Average.outstanding.balance" "Average.outstanding.balance...GNI.per.capita"
[13] "Number.of.depositors" "Number.of.deposit.accounts"
[15] "Deposits" "Average.deposit.balance.per.depositor"
[17] "Average.deposit.balance.per.depositor...GNI.per.capita" "Average.deposit.account.balance"
[19] "Average.deposit.account.balance...GNI.per.capita" "Return.on.assets"
[21] "Return.on.equity" "Operational.self.sufficiency"
[23] "FSS" "Financial.revenue..assets"
[25] "Profit.margin" "Yield.on.gross.portfolio..nominal."
[27] "Yield.on.gross.portfolio..real." "Total.expense..assets"
[29] "Financial.expense..assets" "Provision.for.loan.impairment..assets"
[31] "Operating.expense..assets" "Personnel.expense..assets"
[33] "Administrative.expense..assets" "Operating.expense..loan.portfolio"
[35] "Personnel.expense..loan.portfolio" "Average.salary..GNI.per.capita"
[37] "Cost.per.borrower" "Cost.per.loan"
[39] "Borrowers.per.staff.member" "Loans.per.staff.member"
[41] "Borrowers.per.loan.officer" "Loans.per.loan.officer"
[43] "Depositors.per.staff.member" "Deposit.accounts.per.staff.member"
[45] "Personnel.allocation.ratio" "Portfolio.at.risk...30.days"
[47] "Portfolio.at.risk...90.days" "Write.off.ratio"
[49] "Loan.loss.rate" "Risk.coverage"
Instead of changing the indicator name each time in the t.test, I would like to create a loop that will do it automatically and calculate the p.value. I've tried creating a loop but can't make it work due to the nature of the variables = characters.
I would really appreciate any tips on how to go forward!
Thank you very much !
Best
Morgan
I am assuming you are doing the regression of each indicator against the same f.
In that case, you can try something like:
p_vals = NULL;
for(this_indicator in indicators)
{
this_formula = paste(c(this_indicator, "f"), collapse="~");
res = t.test(as.formula(this_formula));
p_vals = c(p_vals, res$p.value);
}
One comment, however: are you doing any multiplicity adjustment for these p-values? Given the large of tests you are doing, there is a good chance you will be showered with false positives.

Dividing components of a vector into several data points in R

I am trying to turn a vector of length n (say, 14), and turn it into a vector of length N (say, 90). For example, my vector is
x<-c(5,3,7,11,12,19,40,2,22,6,10,12,12,4)
and I want to turn it into a vector of length 90, by creating 90 equally "spaced" points on this vector- think of x as a function. Is there any way to do that in R?
Something like this?
> x<-c(5,3,7,11,12,19,40,2,22,6,10,12,12,4)
> seq(min(x),max(x),length=90)
[1] 2.000000 2.426966 2.853933 3.280899 3.707865 4.134831 4.561798
[8] 4.988764 5.415730 5.842697 6.269663 6.696629 7.123596 7.550562
[15] 7.977528 8.404494 8.831461 9.258427 9.685393 10.112360 10.539326
[22] 10.966292 11.393258 11.820225 12.247191 12.674157 13.101124 13.528090
[29] 13.955056 14.382022 14.808989 15.235955 15.662921 16.089888 16.516854
[36] 16.943820 17.370787 17.797753 18.224719 18.651685 19.078652 19.505618
[43] 19.932584 20.359551 20.786517 21.213483 21.640449 22.067416 22.494382
[50] 22.921348 23.348315 23.775281 24.202247 24.629213 25.056180 25.483146
[57] 25.910112 26.337079 26.764045 27.191011 27.617978 28.044944 28.471910
[64] 28.898876 29.325843 29.752809 30.179775 30.606742 31.033708 31.460674
[71] 31.887640 32.314607 32.741573 33.168539 33.595506 34.022472 34.449438
[78] 34.876404 35.303371 35.730337 36.157303 36.584270 37.011236 37.438202
[85] 37.865169 38.292135 38.719101 39.146067 39.573034 40.000000
>
Try this:
#data
x <- c(5,3,7,11,12,19,40,2,22,6,10,12,12,4)
#expected new length
N=90
#number of numbers between 2 numbers
my.length.out=round((N-length(x))/(length(x)-1))+1
#new data
x1 <- unlist(
lapply(1:(length(x)-1), function(i)
seq(x[i],x[i+1],length.out = my.length.out)))
#plot
par(mfrow=c(2,1))
plot(x)
plot(x1)

R: Using for loop on data frame

I have a data frame, deflator.
I want to get a new data frame inflation which can be calculated by:
deflator[i] - deflator[i-4]
----------------------------- * 100
deflator [i - 4]
The data frame deflator has 71 numbers:
> deflator
[1] 0.9628929 0.9596746 0.9747274 0.9832532 0.9851884
[6] 0.9797770 0.9913502 1.0100561 1.0176906 1.0092516
[11] 1.0185932 1.0241043 1.0197975 1.0174097 1.0297328
[16] 1.0297071 1.0313232 1.0244618 1.0347808 1.0480411
[21] 1.0322142 1.0351968 1.0403264 1.0447121 1.0504402
[26] 1.0487097 1.0664664 1.0935239 1.0965951 1.1141851
[31] 1.1033155 1.1234482 1.1333870 1.1188136 1.1336276
[36] 1.1096461 1.1226584 1.1287245 1.1529588 1.1582911
[41] 1.1691221 1.1782178 1.1946234 1.1963453 1.1939922
[46] 1.2118189 1.2227960 1.2140535 1.2228828 1.2314258
[51] 1.2570788 1.2572214 1.2607763 1.2744415 1.2982076
[56] 1.3318808 1.3394186 1.3525902 1.3352815 1.3492751
[61] 1.3593859 1.3368135 1.3642940 1.3538567 1.3658135
[66] 1.3710932 1.3888638 1.4262185 1.4309707 1.4328823
[71] 1.4497201
This is a very tricky question for me.
I tried to do this using a for loop:
> d <- data.frame(deflator)
> for (i in 1:71) {d <-rbind(d,c(delfaotr ))}
I think I might be doing it wrong.
Why use data frames? This is a straightforward vector operation.
inflation = 100 * (deflator[1:67] - deflator[-(1:4)])/deflator[-(1:4)]
I agree with #Fhnuzoag that your example suggests calculations on a numeric vector, not a data frame. Here's an additional way to do your calculations taking advantage of the lag argument in the diff function (with indexes that match those in your question):
lagBy <- 4 # The number of indexes by which to lag
laggedDiff <- diff(deflator, lag = lagBy) # The numerator above
theDenom <- deflator[seq_len(length(deflator) - lagBy)] # The denominator above
inflation <- laggedDiff/theDenom
The first few results are:
head(inflation)
# [1] 0.02315470 0.02094710 0.01705379 0.02725941 0.03299085 0.03008297

Range standardization (0 to 1) in R [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
scale a series between two points in R
Does any know of an R function to perform range standardization on a vector? I'm looking to transform variables to a scale between 0 and 1, while retaining rank order and the relative size of separation between values.
Just to be clear, i'm not looking to standardize variables by mean centering and scaling by the SD, as is done in the function scale().
I tried the functions mmnorm() and rangenorm() in the package 'dprep', but these don't seem to do the job.
s = sort(rexp(100))
range01 <- function(x){(x-min(x))/(max(x)-min(x))}
range01(s)
[1] 0.000000000 0.003338782 0.007572326 0.012192201 0.016055006 0.017161145
[7] 0.019949532 0.023839810 0.024421602 0.027197168 0.029889484 0.033039408
[13] 0.033783376 0.038051265 0.045183382 0.049560233 0.056941611 0.057552543
[19] 0.062674982 0.066001242 0.066420884 0.067689067 0.069247825 0.069432174
[25] 0.070136067 0.076340460 0.078709590 0.080393512 0.085591881 0.087540132
[31] 0.090517295 0.091026499 0.091251213 0.099218526 0.103236344 0.105724733
[37] 0.107495340 0.113332392 0.116103438 0.124050331 0.125596034 0.126599323
[43] 0.127154661 0.133392300 0.134258532 0.138253452 0.141933433 0.146748798
[49] 0.147490227 0.149960293 0.153126478 0.154275371 0.167701855 0.170160948
[55] 0.180313542 0.181834891 0.182554291 0.189188137 0.193807559 0.195903010
[61] 0.208902645 0.211308713 0.232942314 0.236135220 0.251950116 0.260816843
[67] 0.284090255 0.284150541 0.288498370 0.295515143 0.299408623 0.301264703
[73] 0.306817872 0.307853369 0.324882091 0.353241217 0.366800517 0.389474449
[79] 0.398838576 0.404266315 0.408936260 0.409198619 0.415165553 0.433960390
[85] 0.440690262 0.458692639 0.464027428 0.474214070 0.517224262 0.538532221
[91] 0.544911543 0.559945121 0.585390414 0.647030109 0.694095422 0.708385079
[97] 0.736486707 0.787250428 0.870874773 1.000000000
Adding ... will allow you to pass through na.rm = T if you want to omit missing values from the calculation (they will still be present in the results):
range01 <- function(x, ...){(x - min(x, ...)) / (max(x, ...) - min(x, ...))}

Resources