I am new to the glmnet package in R, and wanted to specify a lambda function based on the suggestion in a published research paper to the glmnet.cv function. The documentation suggests that we can supply a decreasing sequence of lambdas as a parameter. However, in the documentation there are no examples of how to do this.
It would be very grateful if someone can suggest how to go about doing this. Do I pass a vector of 100 odd values (default value for nlambda) to the function? What restrictions should be there for the min and max value of this vector, if any? Also, are their things to keep in mind regarding nvars, nobs etc. while specifying the vector?
Thanks in advance.
You can define a grid like this :
grid=10^seq(10,-2,length=100) ##get lambda sequence
ridge_mod=glmnet(x,y,alpha=0,lambda=grid)
This is fairly easy though it's not well explained in the original documentation ;)
In the following I've used cox family but you can change it based on your need
my_cvglmnet_fit <- cv.glmnet(x=regression_data, y=glmnet_response, family="cox", maxit = 100000)
Then you can plot the fitted object created by the cv.glmnet and in the plot you can easily see where the lambda is minimum. one of those dotted vertical lines is the minimum lambda and the other one is the 1se.
plot(my_cvglmnet_fit)
the following lines helps you see the non zero coefficients and their corresponding values:
coef(my_cvglmnet_fit, s = "lambda.min")[which(coef(my_cvglmnet_fit, s = "lambda.min") != 0)] # the non zero coefficients
colnames(regression_data)[which(coef(my_cvglmnet_fit, s = "lambda.min") != 0)] # The features that are selected
here are some links that may help:
http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html
http://blog.revolutionanalytics.com/2013/05/hastie-glmnet.html
Related
function(x) var(sum(((x - mean(x))^2)/(n - 1)))
This is my function but it does not seem to work.
Are we allowed to use the var in the function?
Unless I'm mistaken, this is significantly harder than you think it is; you can't compute it directly the way you're trying to do it. (Hint: when you take the sum() you get a single value/vector of length 1, so the sample variance of that vector is NA.) This post on Mathematics Stack Exchange derives the variance of the sample variance as mu_4/n - sigma^4*(n-3)/(n*(n-1)), where mu_4 is the fourth central moment; you could also see this post on CrossValidated or this Wikipedia page (different derivations present the result in terms of the fourth central moment, the kurtosis, or the excess kurtosis ...) So:
set.seed(101)
r <- rnorm(100)
mu_4 <- mean((r-mean(r))^4)
sigma_4 <- var(r)^2
n <- length(r)
mu_4/n - sigma_4*(n-3)/(n*(n-1)) ## 0.01122
Be careful:
var(replicate(100000,var(rnorm(100))))
gives about 0.0202. At least it's not the wrong order of magnitude, but it's a factor of 2 different from the example above. My guess is just that the estimate is itself highly variable (which wouldn't be surprising since it depends on the fourth moment ...) (I tried the estimation method above a few times and it does indeed seem highly variable ...)
I'm trying to write a low-pass filter in R, to clean a "dirty" data matrix.
I did a google search, came up with a dazzling range of packages. Some apply to 1D signals (time series mostly, e.g. How do I run a high pass or low pass filter on data points in R? ); some apply to images. However I'm trying to filter a plain R data matrix. The image filters are the closest equivalent, but I'm a bit reluctant to go this way as they typically involve (i) installation of more or less complex/heavy solutions (imageMagick...), and/or (ii) conversion from matrix to image.
Here is sample data:
r<-seq(0:360)/360*(2*pi)
x<-cos(r)
y<-sin(r)
z<-outer(x,y,"*")
noise<-0.3*matrix(runif(length(x)*length(y)),nrow=length(x))
zz<-z+noise
image(zz)
What I'm looking for is a filter that will return a "cleaned" matrix (i.e. something close to z, in this case).
I'm aware this is a rather open-ended question, and I'm also happy with pointers ("have you looked at package so-and-so"), although of course I'd value sample code from users with experience on signal processing !
Thanks.
One option may be using a non-linear prediction method and getting the fitted values from the model.
For example by using a polynomial regression, we can predict the original data as the purple one,
By following the same logic, you can do the same thing to all columns of the zz matrix as,
predictions <- matrix(, nrow = 361, ncol = 0)
for(i in 1:ncol(zz)) {
pred <- as.matrix(fitted(lm(zz[,i]~poly(1:nrow(zz),2,raw=TRUE))))
predictions <- cbind(predictions,pred)
}
Then you can plot the predictions,
par(mfrow=c(1,3))
image(z,main="Original")
image(zz,main="Noisy")
image(predictions,main="Predicted")
Note that, I used a polynomial regression with degree 2, you can change the degree for a better fitting across the columns. Or maybe, you can use some other powerful non-linear prediction methods (maybe SVM, ANN etc.) to get a more accurate model.
I want to use my bayesian network as a classifier, first on complete evidence data (predict), but also on incomplete data (bnlearn::cpquery). But it seems that, even working with the same evidence, the functions give different results (not only based on slight deviation due to sampling).
With complete data, one can easily use R's predict function:
predict(object = BN,
node = "TargetVar",
data = FullEvidence ,
method = "bayes-lw",
prob = TRUE)
By analyzing the prob attribute, I understood that the predict-function simply chooses the factor level with the highest probability assigned.
When it comes to incomplete evidence (only outcomes of some nodes are known), predict doesn't work anymore:
Error in check.fit.vs.data(fitted = fitted,
data = data,
subset = setdiff(names(fitted), :
required variables [.....] are not present in the data.`
So, I want to use bnlearn::cpquery with a list of known evidence:
cpquery(fitted = BN,
event = TargetVar == "TRUE",
evidence = evidenceList,
method = "lw",
n = 100000)
Again, I simply want to use the factor with the highest probability as prediction. So if the result of cpquery is higher than 0.5, I set the prediction to TRUE, else to FALSE.
I tried to monitor the process by giving the same (complete) data to both functions, but they don't give me back the same results. There are large differences, e.g. predict's "prob"-attribute gives me a p(false) = 27% whereas cpquery gives me p(false) = 2,2%.
What is the "right" way of doing this? Using only cpquery, also for complete data? Why are there large differences?
Thanks for your help!
As user20650 put it, increasing the number of samples in the predict call was the solution to get very similar results. So just provide the argument n = ... in your function call.
Of course that makes sense, I just didn't know about that argument in the predict() function.
There's no documentation about it in the bn.fit utilities and also none in the quite generic documentation of predict.
I am new to R so please keep that in mind :)
I am currently using package ‘ycinterextra’ and interpolating yield curve with several methods. For example,
maturity<- c(1,2,5,10)
yield<- c(0.39,0.61,1.66,2.58)
t<-seq(from=min(maturity), to=max(maturity), by=0.01)
yc <- ycinter(yM = yield, matsin = maturity, matsout = t, method="SW",typeres="rates")
fitted(yc)
I know how to get fitted(yc), but I don't know how to get one value for specific maturity. for example if I am interested in 4-year yield or 1.5-year yield? What I need is just one value that correspond to specific t (any).
Thansk in advance!
Not sure if I understood correctly and very old question, but here is what I think you should do. Simply match your value from t and find it from fitted values.
as.numeric(fitted(yc))[match(4.5,t)]
[1] 1.460163
I am trying to use the outer function with predict in some classification code in R. For ease, we will assume in this post that we have two vectors named alpha and beta each containing ONLY 0 and 1. I am looking for a simple yet efficient way to pass all combinations of alpha and beta to predict.
I have constructed the code below to mimic the lda function from the MASS library, so rather than "lda", I am using "classifier". It is important to note that the prediction method within predict depends on an (alpha, beta) pair.
Of course, I could use a nested for loop to do this, but I am trying to avoid this method.
Here is what I would like to do ideally:
alpha <- seq(0, 1)
beta <- seq(0, 1)
classifier.out <- classifier(training.data, labels)
outer(X=alpha, Y=beta, FUN="predict", classifier.out, validation.data)
This is a problem because alpha and beta are not the first two parameters in predict.
So, in order to get around this, I changed the last line to
outer(X=alpha, Y=beta, FUN="predict", object=classifier.out, data=validation.data)
Note that my validation data has 40 observations, and also that there are 4 possible pairs of alpha and beta. I get an error though saying
dims [product 4] do not match the length of object [40]
I have tried a few other things, some of which work but are far from simple. Any suggestions?
The problem is that outer expects its function to be vectorized (i.e., it will call predict ONCE with a vector of all the arguments it wants executed). Therefore, when predict is called once, returning its result (which happens to be of length 4), outer complains because it doesn't equal the expected 40.
One way to fix this is to use Vectorize. Untested code:
outer(X=alpha, Y=beta, FUN=Vectorize(predict, vectorize.args=c("alpha", "beta")), object=classifier.out, data=validation.data)
I figured out one decent way to do this. Here it is:
pairs <- expand.grid(alpha, beta)
names(pairs) <- c("alpha", "beta")
mapply(predict, pairs$alpha, pairs$beta,
MoreArgs=list(object=classifier.out, data=validation.data))
Anyone have something simpler and more efficient? I am very eager to know because I spent a little too long on this problem. :(