I have performed a bootstrapping with 2.000 resamples of the Lee Carter model for mortality projection.
The question is not specific for mortality studies, but on more general dimensions in R.
After performing the bootstrapping I get a list with 2000 elements, each for every of 2.000 re-estimations of the model. For each model, there are estimates of my 3 variables: a_x, b_x and k_t.
Both a_x and b_x are age-specific, so the "x" denotes an age in the interval [0:95].
I would now like to plot a histogram of all the b_x values for age x = 70.
### Performing the bootstrap:
JA_lc_fitM_boot1 <- bootstrap(LCfit_JA_M, nBoot = 2000, type = "semiparametric")
### Plotting the histogram with all b_x for x = 70:
JA_lc_fitM_boot1[["bootParameters"]][1:2000][["bx"]][[70]]
I have tried multiple options, but I cannot make it work.
The thing that triggers me, is that I am working with a double within a list of a list.
I have added a picture of the data below:
Does anybody have a solution to this?
It looks like you need the apply family of functions. Your data is not reproducible, so I can't confirm this will work, but if you do:
result <- sapply(JA_lc_fitM_boot1[["bootParameters"]], function(var) var[["bx"]][[70]])
You should get what you're looking for.
You may want to have a look at the purrr package and the family of map functions, or tidyr and the hoist funtion.
(If you want code that works, you indeed need to provide some data!)
Related
I have performed a bootstrapping with 2.000 resamples of the Lee Carter model for mortality projection.
The question is not specific for mortality studies, but on more general dimensions in R.
After performing the bootstrapping I get a list with 2000 elements, each for every of 2.000 re-estimations of the model. For each model, there are estimates of my 3 variables: a_x, b_x and k_t.
Both a_x and b_x are age-specific, so the "x" denotes an age in the interval [0:95].
I would now like to plot a histogram of all the b_x values for age x = 70.
### Performing the bootstrap:
JA_lc_fitM_boot1 <- bootstrap(LCfit_JA_M, nBoot = 2000, type = "semiparametric")
### Plotting the histogram with all b_x for x = 70:
JA_lc_fitM_boot1[["bootParameters"]][1:2000][["bx"]][[70]]
I have tried multiple options, but I cannot make it work.
The thing that triggers me, is that I am working with a double within a list of a list.
I have added a picture of the data below:
Does anybody have a solution to this?
It looks like you need the apply family of functions. Your data is not reproducible, so I can't confirm this will work, but if you do:
result <- sapply(JA_lc_fitM_boot1[["bootParameters"]], function(var) var[["bx"]][[70]])
You should get what you're looking for.
You may want to have a look at the purrr package and the family of map functions, or tidyr and the hoist funtion.
(If you want code that works, you indeed need to provide some data!)
I am using the vegan package to do RDA and want to plot the data using biplot. In my data I have hundreds of values. What I would like to do is limit the variance explained to a set limit so in the example below to 0.1. So instead of having 44 of arrows I might only have say 8
library (vegan) # Load library
library(MASS) # load library
data(varespec) # Dummy data
vare.pca <- rda(varespec, scale = TRUE) # RDA anaylsis
biplot(vare.pca, scaling = 3,display = "species") # Plot data but includes all
## extracts the percentage##
x =(sort(round(100*scores(vare.pca, display = "sp", scaling = 0)[,1]^2, 3), decreasing = TRUE))
## Plot percentage
plot(length(x):1,sort(x)) # plot rank on value of y
Any help would be appreciated :)
Depending on the size of the data-set it would be possible to use either ordistep or ordiR2step to reducing the amount of "unimportant" variables in your plot (see https://www.rdocumentation.org/packages/vegan/versions/2.4-2/topics/ordistep). However, these functions use step-wise selection, which need to be used cautiously. Step-wise selection can select your included parameters based on AIC values, R2 values or p-values. It does not not select values based on the importance of these for the purpose of your question. It also does not mean that these variables have any meaning towards organisms or biochemical interactions. Nevertheless, step-wise selection can be helpful giving an idea on which parameters might be of strong influence on the overall variation in the data-set. Simple example below.
rda0 <- rda(varespec ~1, varespec)
rda1 <- rda(varespec ~., varespec)
rdaplotp <- ordistep(rda0, scope = formula(rda1))
plot(rdaplotp, display = "species", type = "n")
text(rdaplotp, display="bp")
Thus, by using the ordistep function the number of species displayed in the plot has been greatly reduced (see Fig 1 below). If you want to remove more variables (which I do not suggest) an option could be to look at the output of the biplot and throw out the variables which have the least amount of correlation with the principle components (see below), but I would advise against it.
sumrda <- summary(rdaplotp)
sumrda$biplot
What would be wise, is to first check which question you want to answer and see if any of the included variables could be left out on forehand. This would already reduce the amount. Minor edit: I am also a bit confused why you want to remove parameters strongly contributing to your captured variation.
I'm relatively new to R and am currently in the process of constructing a PLS model using the pls package. I have two independent datasets of equal size, the first is used here for calibrating the model. The dataset comprises of multiple response variables (y) and 101 explanatory variables (x), for 28 observations. The response variables, however, will each be included seperately in a PLS model. The code current looks as follows:
# load data
data <- read.table("....txt", header=TRUE)
data <- as.data.frame(data)
# define response variables (y)
HEIGHT <- as.numeric(unlist(data[2]))
FBM <- as.numeric(unlist(data[3]))
N <- as.numeric(unlist(data[4]))
C <- as.numeric(unlist(data[5]))
CHL <- as.numeric(unlist(data[6]))
# generate matrix containing the explanatory (x) variables only
spectra <-(data[8:ncol(data)])
# calibrate PLS model using LOO and 20 components
library(pls)
refl.pls <- plsr(N ~ as.matrix(spectra), ncomp=20, validation = "LOO", jackknife = TRUE)
# visualize RMSEP -vs- number of components
plot(RMSEP(refl.pls), legendpos = "topright")
# calculate explained variance for x & y variables
summary(refl.pls)
I have currently arrived at the point at which I need to decide, for each response variable, the optimal number of components to include in my PLS model. The RMSEP values already provide a decent indication. However, I would also like to base my decision on the PRESS (Predicted Residual Sum of Squares) statistic, in accordance various studies comparable to the one I am conducting. So in short, I would like to extract the PRESS statistic for each PLS model with n components.
I have browsed through the pls package documentation and across the web, but unfortunately have been unable to find an answer. If there is anyone out here that could help me get in the right direction that would be greatly appreciated!
You can find the PRESS values in the mvr object.
refl.pls$validation$PRESS
You can see this either by exploring the object directly with str or by perusing the documentation more thoroughly. You will notice if you look at ?mvr you will see the following:
validation if validation was requested, the results of the
cross-validation. See mvrCv for details.
Validation was indeed requested so we follow this to ?mvrCv where you will find:
PRESS a matrix of PRESS values for models with 1, ...,
ncomp components. Each row corresponds to one response variable.
I am new to R and am trying to produce a vast number of diagnostic plots for linear models for a huge data set.
I discovered the lmList function from the nlme package.
This works a treat but what I now need is a means of passing in a fraction of this data into the plot function so that the resulting plots are not minute and unreadable.
In the example below 27 plots are nicely displayed. I want to produce diagnostics for much more data.
Is it necessary to subset the data first? (presumably with loops) or is it possible to subset within the plotting function (presumably with some kind of loop) rather than create 270 data frames and pass them all in separately?
I'm sorry to say that my R is so basic that I do not even know how to pass variables into names and values together in for loops (I tried using the paste function but it failed).
The data and function for the example are below – I would be picking values of Subject by their row numbers within the data frame. I grant that the 27 plots here show nicely but for sake of example it would be nice to split them into say into 3 sets of 9.
fm1 <- lmList(distance ~ age | Subject, Orthodont)
# observed versus fitted values by Subject
plot(fm1, distance ~ fitted(.) | Subject, abline = c(0,1))
Examples from:
https://stat.ethz.ch/R-manual/R-devel/library/nlme/html/plot.lmList.html
I would be most grateful for help and hope that my question isn't insulting to anyone's intelligence or otherwise annoying.
I can't see how to pass a subset to the plot.lmList function. But, here is a way to do it using standard split-apply-combine strategy. Here, the Subjects are just split into three arbitrary groups of 9, and lmList is applied to each group.
## Make 3 lmLists
fits <- lapply(split(unique(Orthodont$Subject), rep(1:3, each=3)), function(x) {
eval(substitute(
lmList(distance ~ age | Subject, # fit the data to subset
data=Orthodont[Orthodont$Subject %in% x,]), # use the subset
list(x=x))) # substitue the actual x-values so the proper call gets stored
})
## Make plots
for (i in seq_along(fits)) {
dev.new()
print(plot(fits[[i]], distance ~ fitted(.) | Subject, abline = c(0,1)))
}
I'm doing some survival analysis in R, and looking to tidy up/simplify my code.
At the moment I'm doing several steps in my data analysis:
make a Surv object (time variable with indication as to whether each observation was censored);
fit this Surv object according to a categorical predictor, for plotting/estimation of median survival time processes; and
calculate a log-rank test to ask whether there is evidence of "significant" differences in survival between the groups.
As an example, here is a mock-up using the lung dataset in the survival package from R. So the following code is similar enough to what I want to do, but much simplified in terms of the predictor set (which is why I want to simplify the code, so I don't make inconsistent calls across models).
library(survival)
# Step 1: Make a survival object with time-to-event and censoring indicator.
# Following works with defaults as status = 2 = dead in this dataset.
# Create survival object
lung.Surv <- with(lung, Surv(time=time, event=status))
# Step 2: Fit survival curves to object based on patient sex, plot this.
lung.survfit <- survfit(lung.Surv ~ lung$sex)
print(lung.survfit)
plot(lung.survfit)
# Step 3: Calculate log-rank test for difference in survival objects
lung.survdiff <- survdiff(lung.Surv ~ lung$sex)
print(lung.survdiff)
Now this is all fine and dandy, and I can live with this but would like to do better.
So my question is around step 3. What I would like to do here is to be able to use information in the formula from the lung.survfit object to feed into the calculation of the differences in survival curves: i.e. in the call to survdiff. And this is where my domitable [sic] programming skills hit a wall. Below is my current attempt to do this: I'd appreciate any help that you can give! Once I can get this sorted out I should be able to wrap a solution up in a function.
lung.survdiff <- survdiff(parse(text=(lung.survfit$call$formula)))
## Which returns following:
# Error in survdiff(parse(text = (lung.survfit$call$formula))) :
# The 'formula' argument is not a formula
As I commented above, I actually sorted out the answer to this shortly after having written this question.
So step 3 above could be replaced by:
lung.survdiff <- survdiff(formula(lung.survfit$call$formula))
But as Ben Barnes points out in the comment to the question, the formula from the survfit object can be more directly extracted with
lung.survdiff <- survdiff(formula(lung.survfit))
Which is exactly what I wanted and hoped would be available -- thanks Ben!