Okay so I'm doing a LASSO regression but I'm having problems with my Y term.
I know my X has to be a matrix and the y's have to be numeric.
This is the case in my set. However I feel my model does not run properly. I first show you what I did and then what I think should be done (but no idea how to do it).
So what I did is as follows. I used the nuclear dataset from R for this example.
library(boot)
data("nuclear")
attach(nuclear)
nuclear <- as.matrix(nuclear)
So I converted it to a matrix. And then I used my matrix on x and y.
CV = cv.glmnet(x=nuclear,y=nuclear, family="multinomial", type.measure = "class", alpha = 1, nlambda = 100)
However i feel my Y-axis is not correct. I feel somehow my dependent variable should be there. But how do I get it there? Assume that nuclear$pt is my dependent variable. Putting nuclear$pt for Y does not work.
plot(CV)
fit = glmnet(x=nuclear, y=nuclear, family = "multinomial" , alpha=1, lambda=CV$lambda.1se)
If i then run this it feels my model didn't run at all. Probably something miss with my Y but i can't put my finger on it.
You used the same matrix for x and y. You have to separate the independent and dependent variables somehow. For example, you can use indices to select the variables:
cv.glmnet(x=nuclear[, 1:10],y=nuclear[, 11], family="binomial",
type.measure = "class", alpha = 1, nlambda = 100)
This will use the first 10 columns of nuclear as independent variables and the 11th column as dependent variable.
Related
When invoking boot.rq like this
b_10 = boot.rq(x, y, tau = .1, bsmethod = "xy", cov = TRUE, R = reps, mofn = mofn)
what does the B matrix (size R x p) in b_10 contain: bootstrapped coefficient estimates or bootstrapped standard errors?
The Value section in the documentation says:
A list consisting of two elements: A matrix B of dimension R by p is returned with the R resampled estimates of the vector of quantile regression parameters. [...]
So, it seems to be the coefficient estimates. But Description section says:
These functions can be used to construct standard errors, confidence intervals and tests of hypotheses regarding quantile regression models.
So it seems to be bootstrapped standard errors.
What is it really?
Edit:
I also wonder what difference the option cov = TRUE makes. Thanks!
The bootstrapped values are different depending on whether I use cov = TRUE or not. The code was written by someone else so I'm not sure why that option was put there.
It stores the bootstrap coefficients. Each row of B is a sample of coefficients, and you have R rows.
These samples are the basis of further inference. We can compute various statistics from them. For example, to compute bootstrap mean and standard error, we can do:
colMeans(B)
apply(B, 2, sd)
Do you also happen to know what difference the option cov = TRUE makes?
Are you sure that cov = TRUE works? First of all, boot.rq itself has no such argument. It may be passed in via .... However, ... is forwarded to boot.rq.pxy (if bsmethod = "pxy") or boot.rq.pwxy (if bsmethod = "pwxy"), neither of which deals with a cov argument. Furthermore, you use bsmethod = "xy", so ... will be silently ignored. As far as I could see, cov = TRUE has no effect at all.
It works in the sense that R doesn't throw me an error.
That is what "silently ignored" means. You can pass whatever into .... They are just ignored.
The bootstrapped values are different depending on whether I use cov = TRUE or not. The code was written by someone else so I'm not sure why that option was put there.
Random sampling won't give identical results on different runs. I suggest you fix a random seed then do testing:
set.seed(0); ans1 <- boot.rq(..., cov = FALSE)
set.seed(0); ans2 <- boot.rq(..., cov = TRUE)
all.equal(ans1$B, ans2$B)
If you don't get TRUE, come back to me.
You're right. It's just because of the different seeds. Thanks!!
I am new in coding, so I still struggle with simple things as loops, subsetting, and data frame vs. matrix.
I am trying to fit a ridge regression for a multivariable X (X1=Marker 1, X2= Marker, X3= Marker 3,..., X1333= Marker 1333), shown in the first image, as a predictor variable of Y, in the second image.
I want to compute the sum of the squared errors (SSE) for varying tuning parameter λ (between 1 and 20). My code is the following:
#install.packages("MASS")
library(MASS)
fitridge <- function(x,y){
fridge=lm.ridge (y ~ x, lambda = seq(0, 20, 2)) #Fitting a ridge regression for varying λ values
sum(residuals(fridge)^2) #This results in SSE
}
all_gcv= apply(as.matrix(genmark_new),2,fitridge,y=as.matrix(coleslev_new))
}
However, it returns this error, and I don't know what to do anymore. I have tried converting the data set into a matrix, a data frame, changing the order of rows and columns...
Error in colMeans(X[, -Inter]) : 'x' must be an array of at least two dimensions.
I just would like to take each marker value from a single row (first picture), pass them into my fitridge function that fits a ridge regression against the Y from the second data set (in the second picture).
And then subset the SSE and their corresponding lambda values
You cannot fit a ridge with only one independent variable. It is not meant for this. In your case, most likely you have to do:
genmark_new = data.frame(matrix(sample(0:1,1333*100,replace=TRUE),ncol=1333))
colnames(genmark_new) = paste0("Marker_",1:ncol(genmark_new))
coleslev_new = data.frame(NormalizedCholesterol=rnorm(100))
Y = coleslev_new$NormalizedCholesterol
library(MASS)
fit = lm.ridge (y ~ ., data=data.frame(genmark_new,y=Y),lambda = seq(0, 20, 2))
And calculate residuals for each lambda:
apply(fit$coef,2,function(i)sum((Y-as.matrix(genmark_new) %*% i)^2))
0 2 4 6 8 10 12 14
26.41866 27.88029 27.96360 28.04675 28.12975 28.21260 28.29530 28.37785
16 18 20
28.46025 28.54250 28.62459
If you need to fit each variable separately, you can consider using a linear model:
fitlm <- function(x,y){
fridge=lm(y ~ x)
sum(residuals(fridge)^2)
}
all_gcv= apply(genmark_new,2,fitlm,y=Y)
Suggestion, check out make notes or introductions to ridge, they are meant for multiple variate regressions, that is, more than 1 independent variable.
I ran the following code for a binary classification task w/ an SVM in both R (first sample) and Python (second example).
Given randomly generated data (X) and response (Y), this code performs leave group out cross validation 1000 times. Each entry of Y is therefore the mean of the prediction across CV iterations.
Computing area under the curve should give ~0.5, since X and Y are completely random. However, this is not what we see. Area under the curve is frequently significantly higher than 0.5. The number of rows of X is very small, which can obviously cause problems.
Any idea what could be happening here? I know that I can either increase the number of rows of X or decrease the number of columns to mediate the problem, but I am looking for other issues.
Y=as.factor(rep(c(1,2), times=14))
X=matrix(runif(length(Y)*100), nrow=length(Y))
library(e1071)
library(pROC)
colnames(X)=1:ncol(X)
iter=1000
ansMat=matrix(NA,length(Y),iter)
for(i in seq(iter)){
#get train
train=sample(seq(length(Y)),0.5*length(Y))
if(min(table(Y[train]))==0)
next
#test from train
test=seq(length(Y))[-train]
#train model
XX=X[train,]
YY=Y[train]
mod=svm(XX,YY,probability=FALSE)
XXX=X[test,]
predVec=predict(mod,XXX)
RFans=attr(predVec,'decision.values')
ansMat[test,i]=as.numeric(predVec)
}
ans=rowMeans(ansMat,na.rm=TRUE)
r=roc(Y,ans)$auc
print(r)
Similarly, when I implement the same thing in Python I get similar results.
Y = np.array([1, 2]*14)
X = np.random.uniform(size=[len(Y), 100])
n_iter = 1000
ansMat = np.full((len(Y), n_iter), np.nan)
for i in range(n_iter):
# Get train/test index
train = np.random.choice(range(len(Y)), size=int(0.5*len(Y)), replace=False, p=None)
if len(np.unique(Y)) == 1:
continue
test = np.array([i for i in range(len(Y)) if i not in train])
# train model
mod = SVC(probability=False)
mod.fit(X=X[train, :], y=Y[train])
# predict and collect answer
ansMat[test, i] = mod.predict(X[test, :])
ans = np.nanmean(ansMat, axis=1)
fpr, tpr, thresholds = roc_curve(Y, ans, pos_label=1)
print(auc(fpr, tpr))`
You should consider each iteration of cross-validation to be an independent experiment, where you train using the training set, test using the testing set, and then calculate the model skill score (in this case, AUC).
So what you should actually do is calculate the AUC for each CV iteration. And then take the mean of the AUCs.
I ran a rfe Model with around 400 variables and got the result that the optimal model uses 40 variables. However, plotting the standard deviations of the error based on cross validation shows that the 40 variable model performs only slightly better than a model with only 10 variables. That's why I'd like to go for the model with 10 variables. I would like to use the 10 variables which perform best for a ten- variable-model and train the model again.
How can I get the 10 variables which lead to the model performance shown in the rfe object?
Since I use rerank=TRUE, I cannot just pick the 10 highest ranked variables from varImp(rfeModel$fit) right? (Would this work if I was not using "rerank" ?)
I'm also struggling with the differences between the output from varImp(rfeModel$fit), varImp(rfeModel), pickVars(rfeModel$variables,40).
What is the right way to get the best performing variables with regard to the size of interest?
The following example can be used:
data(BloodBrain)
x <- scale(bbbDescr[,-nearZeroVar(bbbDescr)])
x <- x[, -findCorrelation(cor(x), .8)]
x <- as.data.frame(x)
set.seed(1)
rfProfile <- rfe(x, logBBB,
sizes = c(2, 5, 10, 20),
method="nnet",
maxit=10,
rfeControl(functions = caretFuncs,
returnResamp="all",
method="cv",
rerank=TRUE))
print(rfProfile), varImp(rfProfile$fit), varImp(rfProfile), pickVars(rfProfile$variables, rfProfile$optsize)
The simplest thing to do is to use the update function:
new_profile <- update(rfProfile, x = x, y = logBBB, size = 10)
Internally, it uses this code:
selectedVars <- rfProfile$variables
bestVar <- rfProfile$control$functions$selectVar(selectedVars, 10)
Max
I'm looking to calculate some form of correlation coefficient in R (or any common stats package actually) in which the value of the correlation is influenced by missing values. I am not sure if this is possible and am looking for a method. I do not want to impute data, but actually want the correlation to be reduced based on the number of incomplete cases included in some systematic fashion. The data are a series of time points generated by different individuals and the correlation coefficient is being used to compute reliability. In many cases, one individual's data will include several more time points than the other individual...
Again, not sure if there is any standard procedure for dealing with such a situation.
One thing to look at is fitting a logistic regression to whether or not a point is missing. If there is no relationship then that provides support for assuming that the missing values won't provide any information. If that is your case then you won't have to impute anything and can just perform your computation without the missing values. glm in R can be used for logistic regression.
Also on a different note, see the use="pairwise.complete.obs" argument to cor which may or may not apply to you.
EDIT: I have revised this answer based on rereading the question.
My feeling is that when there is a datapair that has one of the timeseries showing NA, that pair cannot be used for calculating a correlation as there is no information at that point. As there is no information on that point, there is no way to know how it would influence the correlation. Specifying that an NA reduces the correlation seems tricky, if an observation would be present at a point this could just as easily have improved the correlation.
Default behavior in R is to return NA for the correlation if there is an NA present. This behavior can be tweaked using the 'use' argument. See the documentation of that function for more details.
As pointed out in the answer by Paul Hiemstra, there is no way of knowing whether the correlation would have been higher or lower without missing values. However, for some applications it may be appropriate to penalize the observed correlation for non-matching missing values. For example, if we compare two individual coders, we may want coder B to say "NA" if and only if coder A says "NA" as well, plus we want their non-NA values to correlate.
Under these assumptions, a simple way to penalize non-matching missing values is to compute correlation for complete cases and multiply by the proportion of observations that are matched in terms of their NA-status. The penalty term can then be defined as: 1 - mean((is.na(coderA) & !is.na(coderB)) | (!is.na(coderA) & is.na(coderB))). A simple illustration follows.
fun = function(x1, x2, idx_rm) {
temp = x2
# remove 'idx_rm' points from x2
temp[idx_rm] = NA
# calculate correlations
r_full = round(cor(x1, x2, use = 'pairwise.complete.obs'), 2)
r_NA = round(cor(x1, temp, use = 'pairwise.complete.obs'), 2)
penalty = 1 - mean((is.na(temp) & !is.na(x1)) |
(!is.na(temp) & is.na(x1)))
r_pen = round(r_NA * penalty, 2)
# plot
plot(x1, temp, main = paste('r_full =', r_full,
'; r_NA =', r_NA,
'; r_pen =', r_pen),
xlim = c(-4, 4), ylim = c(-4, 4), ylab = 'x2')
points(x1[idx_rm], x2[idx_rm], col = 'red', pch = 16)
regr_full = as.numeric(summary(lm(x2 ~ x1))$coef[, 1])
regr_NA = as.numeric(summary(lm(temp ~ x1))$coef[, 1])
abline(regr_full[1], regr_full[2])
abline(regr_NA[1], regr_NA[2], lty = 2)
}
Run a simple simulation to illustrate the possible effects of missing values and penalization:
set.seed(928)
x1 = rnorm(20)
x2 = x1 * rnorm(20, mean = 1, sd = .8)
# A case when NA's artifically inflate the correlation,
# so penalization makes sense:
myfun(x1, x2, idx_rm = c(13, 19))
# A case when NA's DEflate the correlation,
# so penalization may be misleading:
myfun(x1, x2, idx_rm = c(6, 14))
# When there are a lot of NA's, penalization is much stronger
myfun(x1, x2, idx_rm = 7:20)
# Some NA's in x1:
x1[1:5] = NA
myfun(x1, x2, idx_rm = c(6, 14))