survfit for stratified cox-model - r

I have a stratified cox-model and want predicted survival-curves for certain profiles, based on that model.
Now, because I'm working with a large dataset with a lot of strata, I want predictions for very specific strata only, to save time and memory.
The help-page of survfit.coxph states: ... If newdata does contain strata variables, then the result will contain one curve per row of newdata, based on the indicated stratum of the original model.
When I run the code below, where newdata does contain the stratum-variable, I still get predictions for both strata, which contradicts the help-page
df <- data.frame(X1 = runif(200),
X2 = sample(c("A", "B"), 200, replace = TRUE),
Ev = sample(c(0,1), 200, replace = TRUE),
Time = rexp(200))
testfit <- coxph( Surv(Time, Ev) ~ X1 + strata(X2), df)
out <- survfit(testfit, newdata = data.frame(X1 = 0.6, X2 = "A"))
Is there anything I fail to see or understand here?

I'm not sure if this is a bug or a feature in survival:::survfit.coxph. It looks like the intended behaviour in the code is that only requested strata are returned. In the function:
strata(X2) is evaluated in an environment containing newdata and the result, A is returned.
The full curve is then created.
There is then some logic to split the curve into strata, but only if result$surv is a matrix.
In your example it is not a matrix. I can't find any documentation on the expected usage of this if it's not a bug. Perhaps it would be worth dropping the author/maintainer a note.
maintainer("survival")
# [1] "Terry M Therneau <xxxxxxxx.xxxxx#xxxx.xxx>"

Some comments that may be helpfull:
My example was not big enough (and I seem not to have read the related github post very well, but that was after I posted my question here): if newdata has at least two lines (and of course the strata-variable), predictions are returned only for the requested strata
There is an inefficiency inside survfit.coxph, where the baseline-hazard is calculated for every stratum in the original dataset, not only for the requested strata (see my contribution to the same github post). However, that doesn't seem to be a big issue (a test on a dataset with roughly half a million observation, 50% events and 1000 strata), takes less than a minute
The problem is memory allocation somewhere during calculations (in the above example, things collapse once I want predictions for 100 observations - 1 stratum each - while the final output of predictions for 80 is only a few MB)
My work-around:
Select all observations you want predictions for
use lp <- predict(..., type='lp') to get the linear predictor for all these observations
use survfit only on the first observation: survfit(fit, newdata = expand_grid(newdf, strat = strata_list))
Store the resulting survival estimates in a data.frame (or not, that's up to you)
To calculate predicted survival for other observations, use the PH-assumption (see formula below). This invokes the overhead of survfit.coxph only once and if you focus on survival on only a few times (e.g. 5- and 10-year survival), you can reduce the computer time even more

Related

Cross-Validation in R using vtreat Package

Currently learning about cross validation through a course on DataCamp. They start the process by creating an n-fold cross validation plan. This is done with the kWayCrossValidation() function from the vtreat package. They call it as follows:
splitPlan <- kWayCrossValidation(nRows, nSplits, dframe, y)
Then, they suggest running a for loop as follows:
dframe$pred.cv <- 0
# k is the number of folds
# splitPlan is the cross validation plan
for(i in 1:k) {
# Get the ith split
split <- splitPlan[[i]]
# Build a model on the training data
# from this split
# (lm, in this case)
model <- lm(fmla, data = dframe[split$train,])
# make predictions on the
# application data from this split
dframe$pred.cv[split$app] <- predict(model, newdata = dframe[split$app,])
}
This results in a new column in the datafram with the predictions, per the last line of the above chunk of code.
My doubt is thus whether the predicted values on the data frame will be in fact averages of the 3 folds or if they will just be those of the 3rd run of the for loop?
Am I missing a detail here, or is this exactly what this code is doing, which would then defeat the purpose of the 3-fold cross validation or any-fold cross validation for that matter, as it will simply output the results of the last iteration? Shouldn't we be looking to output the average of all the folds, as laid out in the splitPlan?
Thank you.
I see there is confusion about the scope of K-fold cross-validation. The idea is not to average predictions over different folds, rather to average some measure of the prediction error, so to estimate test errors.
First of all, as you are new on SO, notice that you should always provide some data to work with. As in this case your question is not data-contingent, I just simulated some. Still, it is a good practice helping us helping you.
Check the following code, which slightly modifies what you have provided in the post:
library(vtreat)
# Simulating data.
set.seed(1986)
X = matrix(rnorm(2000, 0, 1), nrow = 1000, ncol = 2)
epsilon = matrix(rnorm(1000, 0, 0.01), nrow = 1000)
y = X[, 1] + X[, 2] + epsilon
dta = data.frame(X, y, pred.cv = NA)
# Folds.
nRows = dim(dta)[1]
nSplits = 3
splitPlan = kWayCrossValidation(nRows, nSplits)
# Fitting model on all folds but i-th.
for(i in 1:nSplits)
{
# Get the i-th split.
split = splitPlan[[i]]
# Build a model on the training data from this split.
model = lm(y ~ ., data = dta[split$train, -4])
# Make predictions on the application data from this split.
dta$pred.cv[split$app] = predict(model, newdata = dta[split$app, -4])
}
# Now compute an estimate of the test error using pred.cv.
mean((dta$y - dta$pred.cv)^2)
What the for loop does, is to fit a linear model on all folds but the i-th (i.e., on dta[split$train, -4]), and then it uses the fitted function to make predictions on the i-th fold (i.e., dta[split$app, -4]). At least, I am assuming that split$train and split$app serve such roles, as the documentation is really lacking (which usually is a bad sign). Notice I am revoming the 4-th column (dta$pred.cv) as it just pre-allocates memory in order to store all the predictions (it is not a feature!).
At each iteration, we are not filling the whole dta$pred.cv, but only a subset of that (corresponding to the rows of the i-th fold, stored each time in split$app). Thus, at the end that column just stores predictions from the K iteration.
The real rationale for cross-validation jumps in here. Let me introduce the concepts of training, validation, and test set. In data analysis, the ideal is to have such a huge data set so that we can divide it in three subsamples. The first one could then be used to train the algorithms (fitting models), the second to validate the models (tuning the models), the third to choose the best model in terms on some perfomance measure (usually mean-squared-error for regression, or MSE).
However, we often do not have all these data points (especially if you are an economist). Thus, we seek an estimator for the test MSE, so that the need for splitting data disappears. This is what K-fold cross-validation does: at once, each fold is treated as the test set, and the union of all the others as the training set. Then, we make predictions as in your code (in the loop), and save them. What you miss is the last line in the code I provided: the average of the MSE across folds. That provides us with as estimate of the test MSE, where we choose the model yielding the lowest value.
That being said, I never heard before of the vtreat package. If you are into data analysis, I suggest to have a look at the tidiyverse and the caret packages. As far as I know (and I see here on SO), they are widely used and super-well documented. May be worth learning them.

How to run a multinomial logit regression with both individual and time fixed effects in R

Long story short:
I need to run a multinomial logit regression with both individual and time fixed effects in R.
I thought I could use the packages mlogit and survival to this purpose, but I am cannot find a way to include fixed effects.
Now the long story:
I have found many questions on this topic on various stack-related websites, none of them were able to provide an answer. Also, I have noticed a lot of confusion regarding what a multinomial logit regression with fixed effects is (people use different names) and about the R packages implementing this function.
So I think it would be beneficial to provide some background before getting to the point.
Consider the following.
In a multiple choice question, each respondent take one choice.
Respondents are asked the same question every year. There is no apriori on the extent to which choice at time t is affected by the choice at t-1.
Now imagine to have a panel data recording these choices. The data, would look like this:
set.seed(123)
# number of observations
n <- 100
# number of possible choice
possible_choice <- letters[1:4]
# number of years
years <- 3
# individual characteristics
x1 <- runif(n * 3, 5.0, 70.5)
x2 <- sample(1:n^2, n * 3, replace = F)
# actual choice at time 1
actual_choice_year_1 <- possible_choice[sample(1:4, n, replace = T, prob = rep(1/4, 4))]
actual_choice_year_2 <- possible_choice[sample(1:4, n, replace = T, prob = c(0.4, 0.3, 0.2, 0.1))]
actual_choice_year_3 <- possible_choice[sample(1:4, n, replace = T, prob = c(0.2, 0.5, 0.2, 0.1))]
# create long dataset
df <- data.frame(choice = c(actual_choice_year_1, actual_choice_year_2, actual_choice_year_3),
x1 = x1, x2 = x2,
individual_fixed_effect = as.character(rep(1:n, years)),
time_fixed_effect = as.character(rep(1:years, each = n)),
stringsAsFactors = F)
I am new to this kind of analysis. But if I understand correctly, if I want to estimate the effects of respondents' characteristics on their choice, I may use a multinomial logit regression.
In order to take advantage of the longitudinal structure of the data, I want to include in my specification individual and time fixed effects.
To the best of my knowledge, the multinomial logit regression with fixed effects was first proposed by Chamberlain (1980, Review of Economic Studies 47: 225–238). Recently, Stata users have been provided with the routines to implement this model (femlogit).
In the vignette of the femlogit package, the author refers to the R function clogit, in the survival package.
According to the help page, clogit requires data to be rearranged in a different format:
library(mlogit)
# create wide dataset
data_mlogit <- mlogit.data(df, id.var = "individual_fixed_effect",
group.var = "time_fixed_effect",
choice = "choice",
shape = "wide")
Now, if I understand correctly how clogit works, fixed effects can be passed through the function strata (see for additional details this tutorial). However, I am afraid that it is not clear to me how to use this function, as no coefficient values are returned for the individual characteristic variables (i.e. I get only NAs).
library(survival)
fit <- clogit(formula("choice ~ alt + x1 + x2 + strata(individual_fixed_effect, time_fixed_effect)"), as.data.frame(data_mlogit))
summary(fit)
Since I was not able to find a reason for this (there must be something that I am missing on the way these functions are estimated), I have looked for a solution using other packages in R: e.g., glmnet, VGAM, nnet, globaltest, and mlogit.
Only the latter seems to be able to explicitly deal with panel structures using appropriate estimation strategy. For this reason, I have decided to give it a try. However, I was only able to run a multinomial logit regression without fixed effects.
# state formula
formula_mlogit <- formula("choice ~ 1| x1 + x2")
# run multinomial regression
fit <- mlogit(formula_mlogit, data_mlogit)
summary(fit)
If I understand correctly how mlogit works, here's what I have done.
By using the function mlogit.data, I have created a dataset compatible with the function mlogit. Here, I have also specified the id of each individual (id.var = individual_fixed_effect) and the group to which individuals belongs to (group.var = "time_fixed_effect"). In my case, the group represents the observations registered in the same year.
My formula specifies that there are no variables correlated with a specific choice, and which are randomly distributed among individuals (i.e., the variables before the |). By contrast, choices are only motivated by individual characteristics (i.e., x1 and x2).
In the help of the function mlogit, it is specified that one can use the argument panel to use panel techniques. To set panel = TRUE is what I am after here.
The problem is that panel can be set to TRUE only if another argument of mlogit, i.e. rpar, is not NULL.
The argument rpar is used to specify the distribution of the random variables: i.e. the variables before the |.
The problem is that, since these variables does not exist in my case, I can't use the argument rpar and then set panel = TRUE.
An interesting question related to this is here. A few suggestions were given, and one seems to go in my direction. Unfortunately, no examples that I can replicate are provided, and I do not understand how to follow this strategy to solve my problem.
Moreover, I am not particularly interested in using mlogit, any efficient way to perform this task would be fine for me (e.g., I am ok with survival or other packages).
Do you know any solution to this problem?
Two caveats for those interested in answering:
I am interested in fixed effects, not in random effects. However, if you believe there is no other way to take advantage of the longitudinal structure of my data in R (there is indeed in Stata but I don't want to use it), please feel free to share your code.
I am not interested in going Bayesian. So if possible, please do not suggest this approach.

Limiting Number of Predictions - R

I'm trying to create a model that predicts whether a given team makes the playoffs in the NHL, based on a variety of available team stats. However, I'm running into a problem.
I'm using R, specifically the caret package, and I'm having fairly good success so far, with one issue: I can't limit the the number of teams that are predicted to make the playoffs.
I'm using a categorical variable as the prediction -- Y or N.
For example, using the random forest method from the caret package,
rf_fit <- train(playoff ~ ., data = train_set, method = "rf")
rf_predict <- predict(rf_fit,newdata = test_data_playoffs)
mean(rf_predict == test_data_playoffs$playoff)
gives an accuracy of approximately 90% for my test set, but that's because it's overpredicting. In the NHL, 16 teams make the playoffs, but this predicts 19 teams to make the playoffs. So I want to limit the number of "Y" predictions to 16.
Is there a way to limit the number of possible responses for a categorical variable? I'm sure there is, but google searching has given me limited success so far.
EDIT: Provide sample data which can be created with the following code:
set.seed(100) # For reproducibility
data <- data.frame(Y = sample(1:10,32,replace = T)/10, N = rep(NA,32))
data$N <- 1-data$Y
This creates a data frame similar to what you get by using the "prob" option, where you have a list of probabilities for Y and N
pred <- predict(fit,newdata = test_data_playoffs, "prob")

SVM in R (e1071): Give more recent data higher influence (weights for support vector machine?)

I'm working with Support Vector Machines from the e1071 package in R. This is my first project using SVM.
I have a dataset containing order histories of ~1k customers over 1 year and I want to predict costumer purchases. For every customer I have the information if a certain item (out of ~50) was bought or not in a certain week (for 52 weeks aka 1 yr).
My goal is to predict next month's purchases for every single customer.
I believe that a purchase let's say 1 month ago is more meaningful for my prediction than a purchase 10 months ago.
My question is now how I can give more recent data a higher impact? There is a 'weight' option in the svm-function but I'm not sure how to use it.
Anyone who can give me a hint? Would be much appreciated!
That's my code
# Fit model using Support Vecctor Machines
# install.packages("e1071")
library(e1071)
response <- train[,5]; # purchases
formula <- response ~ .;
tuned.svm <- tune.svm(train, response, probability=TRUE,
gamma=10^(-6:-3), cost=10^(1:2));
gamma.k <- tuned.svm$best.parameter[[1]];
cost.k <- tuned.svm$best.parameter[[2]];
svm.model <- svm(formula, data = train,
type='eps-regression', probability=TRUE,
gamma=gamma.k, cost=cost.k);
svm.pred <- predict(svm.model, test, probability=TRUE);
Side notes: I'm fitting a model for every single customer. Also, since I'm interested in the probability, that customer i buys item j in week k, I put
probability=TRUE
click here to see a sccreenshot of my data
Weights option in the R SVM Model is more towards assigning weights to solve the problem of imbalance classes. its class.Weights parameter and is used to assign weightage to different classes 1/0 in a biased dataset.
To answer your question: to give more weightage in a SVM Model for recent data, a simple trick in absence of an ibuild weight functionality at observation level is to repeat the recent columns (i.e. create duplicate rows for recent data) hence indirectly assigning them higher weight
Try this package: https://CRAN.R-project.org/package=WeightSVM
It uses a modified version of 'libsvm' and is able to deal with instance weighting. You can assign higher weights to recent data.
For example. You have simulated data (x,y)
x <- seq(0.1, 5, by = 0.05)
y <- log(x) + rnorm(x, sd = 0.2)
This is an unweighted SVM:
model1 <- wsvm(x, y, weight = rep(1,99))
Blue dots is the unweighted SVM and do not fit the first instance well. We want to put more weights on the first several instances.
So we can use a weighted SVM:
model2 <- wsvm(x, y, weight = seq(99,1,length.out = 99))
Green dots is the weighted SVM and fit the first instance better.

plotting glm interactions: "newdata=" structure in predict() function

My problem is with the predict() function, its structure, and plotting the predictions.
Using the predictions coming from my model, I would like to visualize how my significant factors (and their interaction) affect the probability of my response variable.
My model:
m1 <-glm ( mating ~ behv * pop +
I(behv^2) * pop + condition,
data=data1, family=binomial(logit))
mating: individual has mated or not (factor, binomial: 0,1)
pop: population (factor, 4 levels)
behv: behaviour (numeric, scaled & centered)
condition: relative fat content (numeric, scaled & centered)
Significant effects after running the glm:
pop1
condition
behv*pop2
behv^2*pop1
Although I have read the help pages, previous answers to similar questions, tutorials etc., I couldn't figure out how to structure the newdata= part in the predict() function. The effects I want to visualise (given above) might give a clue of what I want: For the "behv*pop2" interaction, for example, I would like to get a graph that shows how the behaviour of individuals from population-2 can influence whether they will mate or not (probability from 0 to 1).
Really the only thing that predict expects is that the names of the columns in newdata exactly match the column names used in the formula. And you must have values for each of your predictors. Here's some sample data.
#sample data
set.seed(16)
data <- data.frame(
mating=sample(0:1, 200, replace=T),
pop=sample(letters[1:4], 200, replace=T),
behv = scale(rpois(200,10)),
condition = scale(rnorm(200,5))
)
data1<-data[1:150,] #for model fitting
data2<-data[51:200,-1] #for predicting
Then this will fit the model using data1 and predict into data2
model<-glm ( mating ~ behv * pop +
I(behv^2) * pop + condition,
data=data1,
family=binomial(logit))
predict(model, newdata=data2, type="response")
Using type="response" will give you the predicted probabilities.
Now to make predictions, you don't have to use a subset from the exact same data.frame. You can create a new one to investigate a particular range of values (just make sure the column names match up. So in order to explore behv*pop2 (or behv*popb in my sample data), I might create a data.frame like this
popbbehv<-data.frame(
pop="b",
behv=seq(from=min(data$behv), to=max(data$behv), length.out=100),
condition = mean(data$condition)
)
Here I fix pop="b" so i'm only looking at the pop, and since I have to supply condition as well, i fix that at the mean of the original data. (I could have just put in 0 since the data is centered and scaled.) Now I specify a range of behv values i'm interested in. Here i just took the range of the original data and split it into 100 regions. This will give me enough points to plot. So again i use predict to get
popbbehvpred<-predict(model, newdata=popbbehv, type="response")
and then I can plot that with
plot(popbbehvpred~behv, popbbehv, type="l")
Although nothing is significant in my fake data, we can see that higher behavior values seem to result in less mating for population B.

Resources