Recursive Indexing failing / Bootstrapping parameter estimates [duplicate] - r

I have performed a bootstrapping with 2.000 resamples of the Lee Carter model for mortality projection.
The question is not specific for mortality studies, but on more general dimensions in R.
After performing the bootstrapping I get a list with 2000 elements, each for every of 2.000 re-estimations of the model. For each model, there are estimates of my 3 variables: a_x, b_x and k_t.
Both a_x and b_x are age-specific, so the "x" denotes an age in the interval [0:95].
I would now like to plot a histogram of all the b_x values for age x = 70.
### Performing the bootstrap:
JA_lc_fitM_boot1 <- bootstrap(LCfit_JA_M, nBoot = 2000, type = "semiparametric")
### Plotting the histogram with all b_x for x = 70:
JA_lc_fitM_boot1[["bootParameters"]][1:2000][["bx"]][[70]]
I have tried multiple options, but I cannot make it work.
The thing that triggers me, is that I am working with a double within a list of a list.
I have added a picture of the data below:
Does anybody have a solution to this?

It looks like you need the apply family of functions. Your data is not reproducible, so I can't confirm this will work, but if you do:
result <- sapply(JA_lc_fitM_boot1[["bootParameters"]], function(var) var[["bx"]][[70]])
You should get what you're looking for.

You may want to have a look at the purrr package and the family of map functions, or tidyr and the hoist funtion.
(If you want code that works, you indeed need to provide some data!)

Related

Problems with accessing double elements in a list in R

I have performed a bootstrapping with 2.000 resamples of the Lee Carter model for mortality projection.
The question is not specific for mortality studies, but on more general dimensions in R.
After performing the bootstrapping I get a list with 2000 elements, each for every of 2.000 re-estimations of the model. For each model, there are estimates of my 3 variables: a_x, b_x and k_t.
Both a_x and b_x are age-specific, so the "x" denotes an age in the interval [0:95].
I would now like to plot a histogram of all the b_x values for age x = 70.
### Performing the bootstrap:
JA_lc_fitM_boot1 <- bootstrap(LCfit_JA_M, nBoot = 2000, type = "semiparametric")
### Plotting the histogram with all b_x for x = 70:
JA_lc_fitM_boot1[["bootParameters"]][1:2000][["bx"]][[70]]
I have tried multiple options, but I cannot make it work.
The thing that triggers me, is that I am working with a double within a list of a list.
I have added a picture of the data below:
Does anybody have a solution to this?
It looks like you need the apply family of functions. Your data is not reproducible, so I can't confirm this will work, but if you do:
result <- sapply(JA_lc_fitM_boot1[["bootParameters"]], function(var) var[["bx"]][[70]])
You should get what you're looking for.
You may want to have a look at the purrr package and the family of map functions, or tidyr and the hoist funtion.
(If you want code that works, you indeed need to provide some data!)

Vegan RDA and biplot, remove values contributing >10% of variance

I am using the vegan package to do RDA and want to plot the data using biplot. In my data I have hundreds of values. What I would like to do is limit the variance explained to a set limit so in the example below to 0.1. So instead of having 44 of arrows I might only have say 8
library (vegan) # Load library
library(MASS) # load library
data(varespec) # Dummy data
vare.pca <- rda(varespec, scale = TRUE) # RDA anaylsis
biplot(vare.pca, scaling = 3,display = "species") # Plot data but includes all
## extracts the percentage##
x =(sort(round(100*scores(vare.pca, display = "sp", scaling = 0)[,1]^2, 3), decreasing = TRUE))
## Plot percentage
plot(length(x):1,sort(x)) # plot rank on value of y
Any help would be appreciated :)
Depending on the size of the data-set it would be possible to use either ordistep or ordiR2step to reducing the amount of "unimportant" variables in your plot (see https://www.rdocumentation.org/packages/vegan/versions/2.4-2/topics/ordistep). However, these functions use step-wise selection, which need to be used cautiously. Step-wise selection can select your included parameters based on AIC values, R2 values or p-values. It does not not select values based on the importance of these for the purpose of your question. It also does not mean that these variables have any meaning towards organisms or biochemical interactions. Nevertheless, step-wise selection can be helpful giving an idea on which parameters might be of strong influence on the overall variation in the data-set. Simple example below.
rda0 <- rda(varespec ~1, varespec)
rda1 <- rda(varespec ~., varespec)
rdaplotp <- ordistep(rda0, scope = formula(rda1))
plot(rdaplotp, display = "species", type = "n")
text(rdaplotp, display="bp")
Thus, by using the ordistep function the number of species displayed in the plot has been greatly reduced (see Fig 1 below). If you want to remove more variables (which I do not suggest) an option could be to look at the output of the biplot and throw out the variables which have the least amount of correlation with the principle components (see below), but I would advise against it.
sumrda <- summary(rdaplotp)
sumrda$biplot
What would be wise, is to first check which question you want to answer and see if any of the included variables could be left out on forehand. This would already reduce the amount. Minor edit: I am also a bit confused why you want to remove parameters strongly contributing to your captured variation.

How to run a multinomial logit regression with both individual and time fixed effects in R

Long story short:
I need to run a multinomial logit regression with both individual and time fixed effects in R.
I thought I could use the packages mlogit and survival to this purpose, but I am cannot find a way to include fixed effects.
Now the long story:
I have found many questions on this topic on various stack-related websites, none of them were able to provide an answer. Also, I have noticed a lot of confusion regarding what a multinomial logit regression with fixed effects is (people use different names) and about the R packages implementing this function.
So I think it would be beneficial to provide some background before getting to the point.
Consider the following.
In a multiple choice question, each respondent take one choice.
Respondents are asked the same question every year. There is no apriori on the extent to which choice at time t is affected by the choice at t-1.
Now imagine to have a panel data recording these choices. The data, would look like this:
set.seed(123)
# number of observations
n <- 100
# number of possible choice
possible_choice <- letters[1:4]
# number of years
years <- 3
# individual characteristics
x1 <- runif(n * 3, 5.0, 70.5)
x2 <- sample(1:n^2, n * 3, replace = F)
# actual choice at time 1
actual_choice_year_1 <- possible_choice[sample(1:4, n, replace = T, prob = rep(1/4, 4))]
actual_choice_year_2 <- possible_choice[sample(1:4, n, replace = T, prob = c(0.4, 0.3, 0.2, 0.1))]
actual_choice_year_3 <- possible_choice[sample(1:4, n, replace = T, prob = c(0.2, 0.5, 0.2, 0.1))]
# create long dataset
df <- data.frame(choice = c(actual_choice_year_1, actual_choice_year_2, actual_choice_year_3),
x1 = x1, x2 = x2,
individual_fixed_effect = as.character(rep(1:n, years)),
time_fixed_effect = as.character(rep(1:years, each = n)),
stringsAsFactors = F)
I am new to this kind of analysis. But if I understand correctly, if I want to estimate the effects of respondents' characteristics on their choice, I may use a multinomial logit regression.
In order to take advantage of the longitudinal structure of the data, I want to include in my specification individual and time fixed effects.
To the best of my knowledge, the multinomial logit regression with fixed effects was first proposed by Chamberlain (1980, Review of Economic Studies 47: 225–238). Recently, Stata users have been provided with the routines to implement this model (femlogit).
In the vignette of the femlogit package, the author refers to the R function clogit, in the survival package.
According to the help page, clogit requires data to be rearranged in a different format:
library(mlogit)
# create wide dataset
data_mlogit <- mlogit.data(df, id.var = "individual_fixed_effect",
group.var = "time_fixed_effect",
choice = "choice",
shape = "wide")
Now, if I understand correctly how clogit works, fixed effects can be passed through the function strata (see for additional details this tutorial). However, I am afraid that it is not clear to me how to use this function, as no coefficient values are returned for the individual characteristic variables (i.e. I get only NAs).
library(survival)
fit <- clogit(formula("choice ~ alt + x1 + x2 + strata(individual_fixed_effect, time_fixed_effect)"), as.data.frame(data_mlogit))
summary(fit)
Since I was not able to find a reason for this (there must be something that I am missing on the way these functions are estimated), I have looked for a solution using other packages in R: e.g., glmnet, VGAM, nnet, globaltest, and mlogit.
Only the latter seems to be able to explicitly deal with panel structures using appropriate estimation strategy. For this reason, I have decided to give it a try. However, I was only able to run a multinomial logit regression without fixed effects.
# state formula
formula_mlogit <- formula("choice ~ 1| x1 + x2")
# run multinomial regression
fit <- mlogit(formula_mlogit, data_mlogit)
summary(fit)
If I understand correctly how mlogit works, here's what I have done.
By using the function mlogit.data, I have created a dataset compatible with the function mlogit. Here, I have also specified the id of each individual (id.var = individual_fixed_effect) and the group to which individuals belongs to (group.var = "time_fixed_effect"). In my case, the group represents the observations registered in the same year.
My formula specifies that there are no variables correlated with a specific choice, and which are randomly distributed among individuals (i.e., the variables before the |). By contrast, choices are only motivated by individual characteristics (i.e., x1 and x2).
In the help of the function mlogit, it is specified that one can use the argument panel to use panel techniques. To set panel = TRUE is what I am after here.
The problem is that panel can be set to TRUE only if another argument of mlogit, i.e. rpar, is not NULL.
The argument rpar is used to specify the distribution of the random variables: i.e. the variables before the |.
The problem is that, since these variables does not exist in my case, I can't use the argument rpar and then set panel = TRUE.
An interesting question related to this is here. A few suggestions were given, and one seems to go in my direction. Unfortunately, no examples that I can replicate are provided, and I do not understand how to follow this strategy to solve my problem.
Moreover, I am not particularly interested in using mlogit, any efficient way to perform this task would be fine for me (e.g., I am ok with survival or other packages).
Do you know any solution to this problem?
Two caveats for those interested in answering:
I am interested in fixed effects, not in random effects. However, if you believe there is no other way to take advantage of the longitudinal structure of my data in R (there is indeed in Stata but I don't want to use it), please feel free to share your code.
I am not interested in going Bayesian. So if possible, please do not suggest this approach.

PLS in R: Extracting PRESS statistic values

I'm relatively new to R and am currently in the process of constructing a PLS model using the pls package. I have two independent datasets of equal size, the first is used here for calibrating the model. The dataset comprises of multiple response variables (y) and 101 explanatory variables (x), for 28 observations. The response variables, however, will each be included seperately in a PLS model. The code current looks as follows:
# load data
data <- read.table("....txt", header=TRUE)
data <- as.data.frame(data)
# define response variables (y)
HEIGHT <- as.numeric(unlist(data[2]))
FBM <- as.numeric(unlist(data[3]))
N <- as.numeric(unlist(data[4]))
C <- as.numeric(unlist(data[5]))
CHL <- as.numeric(unlist(data[6]))
# generate matrix containing the explanatory (x) variables only
spectra <-(data[8:ncol(data)])
# calibrate PLS model using LOO and 20 components
library(pls)
refl.pls <- plsr(N ~ as.matrix(spectra), ncomp=20, validation = "LOO", jackknife = TRUE)
# visualize RMSEP -vs- number of components
plot(RMSEP(refl.pls), legendpos = "topright")
# calculate explained variance for x & y variables
summary(refl.pls)
I have currently arrived at the point at which I need to decide, for each response variable, the optimal number of components to include in my PLS model. The RMSEP values already provide a decent indication. However, I would also like to base my decision on the PRESS (Predicted Residual Sum of Squares) statistic, in accordance various studies comparable to the one I am conducting. So in short, I would like to extract the PRESS statistic for each PLS model with n components.
I have browsed through the pls package documentation and across the web, but unfortunately have been unable to find an answer. If there is anyone out here that could help me get in the right direction that would be greatly appreciated!
You can find the PRESS values in the mvr object.
refl.pls$validation$PRESS
You can see this either by exploring the object directly with str or by perusing the documentation more thoroughly. You will notice if you look at ?mvr you will see the following:
validation if validation was requested, the results of the
cross-validation. See mvrCv for details.
Validation was indeed requested so we follow this to ?mvrCv where you will find:
PRESS a matrix of PRESS values for models with 1, ...,
ncomp components. Each row corresponds to one response variable.

Using survfit object's formula in survdiff call

I'm doing some survival analysis in R, and looking to tidy up/simplify my code.
At the moment I'm doing several steps in my data analysis:
make a Surv object (time variable with indication as to whether each observation was censored);
fit this Surv object according to a categorical predictor, for plotting/estimation of median survival time processes; and
calculate a log-rank test to ask whether there is evidence of "significant" differences in survival between the groups.
As an example, here is a mock-up using the lung dataset in the survival package from R. So the following code is similar enough to what I want to do, but much simplified in terms of the predictor set (which is why I want to simplify the code, so I don't make inconsistent calls across models).
library(survival)
# Step 1: Make a survival object with time-to-event and censoring indicator.
# Following works with defaults as status = 2 = dead in this dataset.
# Create survival object
lung.Surv <- with(lung, Surv(time=time, event=status))
# Step 2: Fit survival curves to object based on patient sex, plot this.
lung.survfit <- survfit(lung.Surv ~ lung$sex)
print(lung.survfit)
plot(lung.survfit)
# Step 3: Calculate log-rank test for difference in survival objects
lung.survdiff <- survdiff(lung.Surv ~ lung$sex)
print(lung.survdiff)
Now this is all fine and dandy, and I can live with this but would like to do better.
So my question is around step 3. What I would like to do here is to be able to use information in the formula from the lung.survfit object to feed into the calculation of the differences in survival curves: i.e. in the call to survdiff. And this is where my domitable [sic] programming skills hit a wall. Below is my current attempt to do this: I'd appreciate any help that you can give! Once I can get this sorted out I should be able to wrap a solution up in a function.
lung.survdiff <- survdiff(parse(text=(lung.survfit$call$formula)))
## Which returns following:
# Error in survdiff(parse(text = (lung.survfit$call$formula))) :
# The 'formula' argument is not a formula
As I commented above, I actually sorted out the answer to this shortly after having written this question.
So step 3 above could be replaced by:
lung.survdiff <- survdiff(formula(lung.survfit$call$formula))
But as Ben Barnes points out in the comment to the question, the formula from the survfit object can be more directly extracted with
lung.survdiff <- survdiff(formula(lung.survfit))
Which is exactly what I wanted and hoped would be available -- thanks Ben!

Resources