Let me preface this by saying that I do think this question is a coding question, not a statistics question. It would almost surely be closed over at Stats.SE.
The leaps package in R has a useful function for model selection called regsubsets which, for any given size of a model, finds the variables that produce the minimum residual sum of squares. Now I am reading the book Linear Models with R, 2nd Ed., by Julian Faraway. On pages 154-5, he has an example of using the AIC for model selection. The complete code to reproduce the example runs like this:
data(state)
statedata = data.frame(state.x77, row.names=state.abb)
require(leaps)
b = regsubsets(Life.Exp~.,data=statedata)
rs = summary(b)
rs$which
AIC = 50*log(rs$rss/50) + (2:8)*2
plot(AIC ~ I(1:7), ylab="AIC", xlab="Number of Predictors")
The rs$which command produces the output of the regsubsets function and allows you to select the model once you've plotted the AIC and found the number of parameters that minimizes the AIC. But here's the problem: while the typed-up example works fine, I'm having trouble with the wrong number of elements in the array when I try to use this code and adapt it to other data. For example:
require(faraway)
data(odor, package='faraway')
b=regsubsets(odor~temp+gas+pack+
I(temp^2)+I(gas^2)+I(pack^2)+
I(temp*gas)+I(temp*pack)+I(gas*pack),data=odor)
rs=summary(b)
rs$which
AIC=50*log(rs$rss/50) + (2:10)*2
produces a warning message:
Warning message:
In 50 * log(rs$rss/50) + (2:10) * 2 :
longer object length is not a multiple of shorter object length
Sure enough, length(rs$rss)=8, but length(2:10)=9. Now what I need to do is model selection, which means I really ought to have an RSS value for each model size. But if I choose b$rss in the AIC formula, it doesn't work with the original example!
So here's my question: what is summary() doing to the output of the regsubsets() function? The number of RSS values is not only not the same, but the values themselves are not the same.
Ok, so you know the help page for regsubsets says
regsubsets returns an object of class "regsubsets" containing no
user-serviceable parts. It is designed to be processed by
summary.regsubsets.
You're about to find out why.
The code in regsubsets calls Alan Miller's Fortran 77 code for subset selection. That is, I didn't write it and it's in Fortran 77. I do understand the algorithm. In 1996 when I wrote leaps (and again in 2017 when I made a significant modification) I spent enough time reading the code to understand what the variables were doing, but regsubsets mostly followed the structure of the Fortran driver program that came with the code.
The rss field of the regsubsets object has that name because it stores a variable called RSS in the Fortran code. This variable is not the residual sum of squares of the best model. RSS is computed in the setup phase, before any subset selection is done, by the subroute SSLEAPS, which is commented 'Calculates partial residual sums of squares from an orthogonal reduction from AS75.1.' That is, RSS describes the RSS of the models with no selection fitted from left to right in the design matrix: the model with just the leftmost variable, then the leftmost two variables, and so on. There's no reason anyone would need to know this if they're not planning to read the Fortran so it's not documented.
The code in summary.regsubsets extracts the residual sum of squares in the output from the $ress component of the object, which comes from the RESS variable in the Fortran code. This is an array whose [i,j] element is the residual sum of squares of the j-th best model of size i.
All the model criteria are computed from $ress in the same loop of summary.regsubsets, which can be edited down to this:
for (i in ll$first:min(ll$last, ll$nvmax)) {
for (j in 1:nshow) {
vr <- ll$ress[i, j]/ll$nullrss
rssvec <- c(rssvec, ll$ress[i, j])
rsqvec <- c(rsqvec, 1 - vr)
adjr2vec <- c(adjr2vec, 1 - vr * n1/(n1 + ll$intercept -
i))
cpvec <- c(cpvec, ll$ress[i, j]/sigma2 - (n1 + ll$intercept -
2 * i))
bicvec <- c(bicvec, (n1 + ll$intercept) * log(vr) +
i * log(n1 + ll$intercept))
}
}
cpvec gives you the same information as AIC, but if you want AIC it would be straightforward to do the same loop and compute it.
regsubsets has a nvmax parameter to control the "maximum size of subsets to examine". By default this is 8. If you increase it to 9 or higher, your code works.
Please note though, that the 50 in your AIC formula is the sample size (i.e. 50 states in statedata). So for your second example, this should be nrow(odor), so 15.
Related
I am having issue implementing recency-weighting for xgboost training in R (i.e. passing a weight vector to xgb.dmatrix) - although the weighting affects the learning curve readout for the training set, it does not appear to have any impact at all on the actual model produced - performance in the test set is identical.
I can't seem to get to the bottom of this issue or generate a reproducible example. So instead I would like to pass the Date column of the features to a custom loss function, something like:
custom_loss <- function(preds,dat) {
labels <- getinfo(dat,"label")
dates <- [a vector corresponding to the dates associated with each prediction]
grad = f(dates)*-2*(labels - preds)
hess = f(dates)*2
[where f is an increasing function of the value in dates, so later samples matter more when training]
return(list(grad=grad,hess=hess))
}
But I can't seem to figure out how to do this, any suggestions?
First of all, I thank you all beforehand for reading this.
I am trying to fit a Standardized T-Student Distribution (i.e. a T-Student with standard deviation = 1) on a series of data; that is: I want to estimate the degrees of freedom via Maximum Likelihood Estimation.
An example of what I need to achieve can be found in the following (simple) Excel file I made:
https://www.dropbox.com/s/6wv6egzurxh4zap/Excel%20Implementation%20Example.xlsx?dl=0
Inside the Excel file, I have an image that contains the formula corresponding to the calculation of the loglikelihood function for the Standardized T Student Distribution. The formula was extracted from a Finance book (Elements of Financial Risk Management - by Peter Christoffersen).
So far, I have tried this with R:
copula.data <- read.csv(file.choose(),header = TRUE)
z1 <- copula.data[,1]
library(fitdistrplus)
ft1 = fitdist(z1, "t", method = "mle", start = 10)
df1=ft1$estimate[1]
df1
logLik(ft1)
df1 yields the number: 13.11855278779897
logLike(ft1) yields the number: -3600.2918050056487
However, the Excel file yields degrees of freedom of: 8.2962365022727, and a log-likelihood of: -3588.8879 (which is the right answer).
Note: the .csv file that my code reads is the following:
https://www.dropbox.com/s/nnh2jgq4fl6cm12/Data%20for%20T%20Copula.csv?dl=0
Any ideas? Thank you people!
The formula from your spreadsheet (with n, x substituted for the df parameter and the data)
=GAMMALN((n+1)/2)-GAMMALN(n/2)-LN(PI())/2-LN(n-2)/2-1/2*(1+n)*LN(1+x^2/(n-2))
or, exponentiating,
Gamma((n+1)/2) / (sqrt((n-2) pi) Gamma(n/2)) (1+x^2/(n-2))^-((n+1)/2)
?dt gives
f(x) = Gamma((n+1)/2) / (sqrt(n pi) Gamma(n/2)) (1 + x^2/n)^-((n+1)/2)
So the difference lies in those n-2 values in two places in the formula. I don't have enough context to see why the author is defining the t distribution in that different way; there may be some good reason ...
Looking at the negative log-likelihood curve directly, it certainly seems as though the fitdistrplus answer is agreeing with the direct calculation. (It would be very surprising if there were a bug in the dt() function, R's distribution functions are very broadly used and thoroughly tested.)
LL <- function(p,data=z1) {
-sum(dt(data,df=p,log=TRUE))
}
pvec <- seq(6,20,by=0.05)
Lvec <- sapply(pvec,LL)
par(las=1,bty="l")
plot(pvec,Lvec,type="l",
xlab="df parameter",ylab="negative log-likelihood")
## superimpose fitdistr results ...
abline(v=coef(ft1),lty=2)
abline(h=-logLik(ft1),lty=2)
Unless there's something else you're not telling us about the problem definition, it seems to me that R is getting the right answer. (The mean and sd of the data you gave were not exactly equal to 0 and 1 respectively, but they were close; centering and scaling gave an even larger value for the parameter.)
I'm relatively new to R and am currently in the process of constructing a PLS model using the pls package. I have two independent datasets of equal size, the first is used here for calibrating the model. The dataset comprises of multiple response variables (y) and 101 explanatory variables (x), for 28 observations. The response variables, however, will each be included seperately in a PLS model. The code current looks as follows:
# load data
data <- read.table("....txt", header=TRUE)
data <- as.data.frame(data)
# define response variables (y)
HEIGHT <- as.numeric(unlist(data[2]))
FBM <- as.numeric(unlist(data[3]))
N <- as.numeric(unlist(data[4]))
C <- as.numeric(unlist(data[5]))
CHL <- as.numeric(unlist(data[6]))
# generate matrix containing the explanatory (x) variables only
spectra <-(data[8:ncol(data)])
# calibrate PLS model using LOO and 20 components
library(pls)
refl.pls <- plsr(N ~ as.matrix(spectra), ncomp=20, validation = "LOO", jackknife = TRUE)
# visualize RMSEP -vs- number of components
plot(RMSEP(refl.pls), legendpos = "topright")
# calculate explained variance for x & y variables
summary(refl.pls)
I have currently arrived at the point at which I need to decide, for each response variable, the optimal number of components to include in my PLS model. The RMSEP values already provide a decent indication. However, I would also like to base my decision on the PRESS (Predicted Residual Sum of Squares) statistic, in accordance various studies comparable to the one I am conducting. So in short, I would like to extract the PRESS statistic for each PLS model with n components.
I have browsed through the pls package documentation and across the web, but unfortunately have been unable to find an answer. If there is anyone out here that could help me get in the right direction that would be greatly appreciated!
You can find the PRESS values in the mvr object.
refl.pls$validation$PRESS
You can see this either by exploring the object directly with str or by perusing the documentation more thoroughly. You will notice if you look at ?mvr you will see the following:
validation if validation was requested, the results of the
cross-validation. See mvrCv for details.
Validation was indeed requested so we follow this to ?mvrCv where you will find:
PRESS a matrix of PRESS values for models with 1, ...,
ncomp components. Each row corresponds to one response variable.
I'm trying to analyze repairable systems reliability using growth models.
I have already fitted a Crow-Amsaa model but I wonder if there is any package or any code for fitting a Generalized Renewal Process (Kijima Model I) or type II
in R and find it's parameters Beta, Lambda(or alpha) and q.
(or some other model for the mean cumulative function MCF)
The equation number 15 of this article gives an expression for the
Log-likelihood
I tried to create the function like this:
likelihood.G1=function(theta,x){
# x is a vector with the failure times, theta vector of parameters
a=theta[1] #Alpha
b=theta[2] #Beta
q=theta[3] #q
logl2=log(b/a) # First part of the equation
for (i in 1:length(x)){
logl2=logl2 +(b-1)*log(x[i]/(a*(1+q)^(i-1))) -(x[i]/(a*(1+q)^(i-1)))^b
}
return(-logl2) #Negavite of the log-likelihood
}
And then use some rutine for minimize the -Log(L)
theta=c(0.5,1.2,0.8) #Start parameters (lambda,beta,q)
nlm(likelihood.G1,theta, x=Data)
Or also
optim(theta,likelihood.G1,method="BFGS",x=Data)
However it seems to be some mistake, since the parameters it returns has no sense
Any ideas of what I'm doing wrong?
Thanks
Looking at equation (16) of the paper you reference and comparing it with your code it looks like you are missing one term in the for loop. It seems that each data point contributes to three terms of the log-likelihood but in your code (inside the loop) you only have two terms (not considering the updating term)
Specifically, your code does not include the 4th term in equation (16):
and neither it does the 7th term, and so on. This is at least one error in the code. An extra consideration would be that α and β are constrained to be greater than zero. I am not sure if the solver you are using is considering this constraint.
I have created a loop to fit a non-linear model to six data points by participants (each participant has 6 data points). The first model is a one parameter model. Here is the code for that model that works great. The time variable is defined. The participant variable is the id variable. The data is in long form (one row for each datapoint of each participant).
Here is the loop code with 1 parameter that works:
1_p_model <- dlply(discounting_long, .(Participant), function(discounting_long) {wrapnls(indiff ~ 1/(1+k*time), data = discounting_long, start = c(k=0))})
However, when I try to fit a two parameter model, I get this error "Error: singular gradient matrix at initial parameter estimates" while still using the wrapnls function. I realize that the model is likely over parameterized, that is why I am trying to use wrapnls instead of just nls (or nlsList). Some in my field insist on seeing both model fits. I thought that the wrapnls model avoids the problem of 0 or near-0 residuals. Here is my code that does not work. The start values and limits are standard in the field for this model.
2_p_model <- dlply(discounting_long, .(Participant), function(discounting_long) {nlxb(indiff ~ 1/(1+k*time^s), data = discounting_long, lower = c (s = 0), start = c(k=0, s=.99), upper = c(s=1))})
I realize that I could use nlxb (which does give me the correct parameter values for each participant) but that function does not give predictive values or residuals of each data point (at least I don't think it does) which I would like to compute AIC values.
I am also open to other solutions for running a loop through the data by participants.
You mention at the end that 'nlxb doesn't give you residuals', but it does. If your result from your call to nlxbis called fit then the residuals are in fit$resid. So you can get the fitted values using just by adding them to the original data. Honestly I don't know why nlxb hasn't been made to work with the predict() function, but at least there's a way to get the predicted values.