I'm trying to perform multiple LASSO regressions in R using the following code:
library(readxl)
data <-read_excel("data.xlsx") # 20x20 matrix
library(glmnet)
library(coefplot)
A <- as.matrix(data)
results <- lapply(seq_len(ncol(A)), function(i) {
list(
fit_lasso = glmnet(A[, -i], A[, i], standardize = T, alpha = 1),
cvfit = cv.glmnet(A[, -i] , A[, i] , standardize = TRUE , type.measure = "mse" , nfolds = 10 , alpha = 1)
)
})
coefficients <- lapply(results, function(x, fun) fun(coef(x$cvfit, s = "lambda.min")), function(x) x[x[, 1L] != 0L, 1L, drop = FALSE])
My output results results in a Large list (20 elements, 1MB) with 20 same LASSO output but for 20 variables and coefficients output is only the significant variables in each case.
I notice that for the same dataset the results are not always the same - maybe because of lambda changing values in each run? not sure. I want to make my code to find the same lambda.min's and give always the same results when I run the dataset. I believe a set.seed() might manage it but can't figure out how to sufficiently include it.
How can I always make it print the same outputs for a specific dataset?
I got it to produce the same lambda.min values from run to run just by putting set.seed() before the list. Then, you're setting the seed for the random draws of the cross-validation runs.
library(readxl)
data <-read_excel("data.xlsx") # 20x20 matrix
library(glmnet)
library(coefplot)
A <- as.matrix(data)
set.seed(54234)
results <- lapply(seq_len(ncol(A)), function(i) {
list(
fit_lasso = glmnet(A[, -i], A[, i], standardize = T, alpha = 1),
cvfit = cv.glmnet(A[, -i] , A[, i] , standardize = TRUE , type.measure = "mse" , nfolds = 10 , alpha = 1)
)
})
coefficients <- lapply(results, function(x, fun) fun(coef(x$cvfit, s = "lambda.min")), function(x) x[x[, 1L] != 0L, 1L, drop = FALSE])
Related
I just wanted to conduct a kNN classification with the situation when k is 3. I would like to predict the dependent variable “diabetes” in valid set using train set and calculate the accuracy.
But I faced to the error message with
Error in knn(train = TrainXNormDF, test = ValidXNormDF, cl = MLdata2[, : 'train' and 'class' have different lengths
I can't solve this problem with get approach with
for(i in ((length(MLValidY) + 1):length(TrainXNormDF)))+(MLValidY = c(MLValidY, 0))
What can I do for it? Please help.
My code is as like below
install.packages("mlbench")
install.packages("gbm")
library(mlbench)
library(gbm)
data("PimaIndiansDiabetes2")
head(PimaIndiansDiabetes2)
MLdata <- as.data.frame(PimaIndiansDiabetes2)
head(MLdata)
str(MLdata)
View(MLdata)
any(is.na(MLdata))
sum(is.na(MLdata))
MLdata2 <- na.omit(MLdata)
any(is.na(MLdata2))
sum(is.na(MLdata2))
View(MLdata2)
MLIdx <- sample(1:3, size = nrow(MLdata2), prob = c(0.6, 0.2, 0.2), replace = TRUE)
MLTrain <- MLdata2[MLIdx == 1,]
MLValid <- MLdata2[MLIdx == 2,]
MLTest <- MLdata2[MLIdx == 3,]
head(MLTrain)
head(MLValid)
head(MLTest)
str(MLTrain)
str(MLValid)
str(MLTest)
View(MLTestY)
MLTrainX <- MLTrain[ , -9]
MLValidX <- MLValid[ , -9]
MLTestX <- MLTest[ , -9]
MLTrainY <- as.data.frame(MLTrain[ , 9])
MLValidY <- as.data.frame(MLValid[ , 9])
MLTestY <- as.data.frame(MLTest[ , 9])
View(MLTrainX)
View(MLTrainY)
library(caret)
NormValues <- preProcess(MLTrainX, method = c("center", "scale"))
TrainXNormDF <- predict(NormValues, MLTrainX)
ValidXNormDF <- predict(NormValues, MLValidX)
TestXNormDF <- predict(NormValues, MLTestX)
head(TrainXNormDF)
head(ValidXNormDF)
head(TestXNormDF)
install.packages('FNN')
library(FNN)
library(class)
NN <- knn(train = TrainXNormDF,
test = ValidXNormDF,
cl = MLValidY,
k = 3)
Thank you
Your cl variable is not the same length as your train variable. MLValidY only has 74 observations, while TrainXNormDF has 224.
cl should provide the true classification for every row in your training set.
Furthermore, cl is a data.frame instead of a vector.
Try the following:
NN <- knn(train = TrainXNormDF,
test = ValidXNormDF,
cl = MLTrainY$`MLTrain[, 9]`,
k = 3)
As #rw2 stated, it's the length of cl. I think you meant to use MLtrainY, not MLvalidY. When you have a single column data frame, you can still run into shape problems (converts it to a vector). You could walk back to make sure that you use the right content here, like so:
NN <- knn(train = TrainXNormDF,
test = ValidXNormDF,
cl = MLdata2[MLIdx == 1,]$diabetes, # shape no longer an issue
k = 3)
How can I repeat this code for each subject (xxx), so that the results are added to the data.frame (centralities)?
fullDataDetrend_xxx <- subset(fullDataDetrend, subjno == xxx, select=c(subjno,depressed,sad,tired,interest,happy,neg_thoughts,concentration_probl,ruminating,activity,datevar,timestamp,dayno,beepno))
model_xxx <- var1(
fullDataDetrend_xxx)
model_xxx_omega <- getmatrix(model_xxx, "omega_zeta")
centrality_model_xxx_omega <- centrality(model_xxx_omega )
centralities[nrow(centralities) + 1,] <- c("xxx",centrality_model_xxx_omega$InExpectedInfluence)
Did as suggested:
fullDataDetrend_split <- split(fulldataDetrend, fulldataDetrend$subjno)
then, to estimate network, pull centrality estimates, and write to centralities in global environment:
analyze_one <- function(dataframe){
network_model <- var1(
dataframe,
vars = useVars,
contemporaneous = "ggm",
dayvar = "dayno",
beepvar = "beepno",
estimator = "FIML",
verbose = TRUE,
omega_zeta = "full")
model_omega <- getmatrix(network_model, "omega_zeta")
centrality_omega<- centrality(model_omega)
model_beta <- getmatrix(network_model, "beta")
centrality_beta<- centrality(model_beta)
subjno <- as.list(dataframe[1,2])
centralities[nrow(centralities) + 1,] <- c(subjno,centrality_omega$InExpectedInfluence,centrality_beta$InExpectedInfluence,centrality_beta$OutExpectedInfluence)
assign('centralities',centralities, envir=.GlobalEnv)
}
then rerun the code with lapply for all dataframes (with ignoring errors):
lapply_with_error <- function(X,FUN,...){
lapply(X, function(x, ...) tryCatch(FUN(x, ...),
error=function(e) NULL))
}
lapply_with_error(fullDataDetrend_split, FUN = analyze_one)
This is the problem instructions I was given.
Build a K-NN classifier, use 5-fold cross-validation to evaluate its performance based on average accuracy.
Report accuracy measure for k = 2, ..., 10
Write your code below (Hint: you need a loop within a loop, with the outer loop going through each value of k and inner loop going through each fold):
You can manually try k=2,...,10, but try to use a outer loop through each value of k.
I was given 2 for loops. One for Creating folds and the other for calculating k=1:10, which are listed below.
# Given data
library(datasets)
data(iris)
library(dplyr)
normalize = function(x){
return ((x - min(x))/(max(x) - min(x)))}
# normalize
Iris_normalized = IrisData %>% mutate_at(1:4, normalize)
# Create folds
cv = createFolds(y = IrisData$class, k = 5)
accuracy = c()
for (test_rows in cv) {
IrisData_train = IrisData[-test_rows,]
IrisData_test = IrisData[test_rows,]
tree = rpart(class ~ ., data = IrisData_train,
method = "class", parms = list(split = "information"))
pred_tree = predict(tree, IrisData_test, type = "class")
cm = confusionMatrix(pred_tree, IrisData_test[,5])
accuracy = c(accuracy, cm$overall[1])
}
print(mean(accuracy))
# Manual K validation
SSE_curve <- c()
for (k in 1:10) {
print(k)
kcluster = kmeans(utility_normalized, center = k)
sse = kcluster$tot.withinss
print(sse)
SSE_curve[k] = sse
}
So if I am understanding the instructions correctly. I need to:
Create 5 folds using normalized data with a for loop and set.seed.
Use a for loop to find the accuracy in k=1:10 for each fold.
I am not sure how these 2 for-loops combine to give me this result in the instructions.
I imagine the code you provide is just an example and this question sounds a lot like a student homework problem. You should at least provide your effort so far.
However here are two possible solutions:
1)two nested for-loop:
library(class)
library(dplyr)
data("iris")
normalize = function(x){
return ((x - min(x))/(max(x) - min(x)))}
# normalize
Iris_normalized = iris %>% mutate_at(1:4, normalize)
av_accuracy <- as.list(2:10)
for (k in 2:10) {
set.seed(4)
cv <- createFolds(y = Iris_normalized$Species, k = 5)
accuracy <- c()
for (i in cv) {
IrisData_train = Iris_normalized[-i,]
IrisData_test = Iris_normalized[i,]
tree <- knn(IrisData_train[,-5],IrisData_test[,-5],cl=IrisData_train$Species,k=k)
cm <- confusionMatrix(tree, IrisData_test[,5])
accuracy <- c(accuracy, cm$overall[1])
}
av_accuracy[[k-1]] <- mean(accuracy)
}
results <- data.frame(k=2:10,mean.accuracy=unlist(av_accuracy))
using the caret framework, which is built exactly for this kind of task:
control <- trainControl(method = "cv",5,returnResamp="all",)
grid <- expand.grid(k=2:10)
fit <-
train(
Species ~ .,
data = Iris_normalized,
trControl = control,
tuneGrid = grid,
method = "knn"
)
fit$results
I am using recursive feature elimination from the R package 'caret'
Linear regression works fine for my problem, therfore I am using functions = lmFuncs insinde my control function.
But I would like to test this setup again without an intercept, is this possible?
My current code:
control <- rfeControl(functions = lmFuncs
, verbose = FALSE
)
results <- rfe(df_train
, df_train
, rfeControl=control
)
I would also go for a custom function, but I do not know how.
Many thanks in advance.
Edit:
I found the answer after having a deeper look into the caret package.
lmFuncs without Intercept:
lmFuncs_wo_intercept <- list(
summary = defaultSummary,
fit = function(x, y, first, last, ...) {
tmp <- if(is.data.frame(x)) x else as.data.frame(x, stringsAsFactors = TRUE)
tmp$y <- y
#lm(y~., data = tmp) #old
lm(y~0+., data = tmp) #new
},
pred = function(object, x) {
if(!is.data.frame(x)) x <- as.data.frame(x, stringsAsFactors = TRUE)
predict(object, x)
},
rank = function(object, x, y) {
coefs <- abs(coef(object))
#coefs <- coefs[names(coefs) != "(Intercept)"] # old
coefs[is.na(coefs)] <- 0
vimp <- data.frame(Overall = unname(coefs),
var = names(coefs))
rownames(vimp) <- names(coefs)
vimp <- vimp[order(vimp$Overall, decreasing = TRUE),, drop = FALSE]
vimp
},
selectSize = pickSizeBest,
selectVar = pickVars
)
lmFuncs is your linear regression? I this case you could try to fit a second linear regression without the intercept and then apply the feature elimination function
I'm tuning parameters with custom summaryFunction in caret.
I originally thought that if I set K-fold cross validation and input data has N points, performance will be measured with N/K data points.
However, apparently it seems not correct because when I extract data$pred by using browser() which is the handed data to summary function, it only had 10 data.
Since the input(df) has over 500 data points, this number is way smaller than my expectation.
Why does it only have 10 data? Is there any way to increase this?(=performance testing with more large data points)
Any kind of help is needed. Thank you.
sigma.list <- seq(1, 5, 1)
c.list <- seq(1, 10, 1)
met <- "FValue"
#define evaluation function
eval <- function(data, lev = NULL, model = NULL){
mat <- table(data$pred, data$obs)
pre <- mat[1,1]/sum(mat[1,]) #precision
rec <- mat[1,1]/sum(mat[,1]) #recall
res <- c("Precision"=pre, "Recall"=rec, "FValue"=2*pre*rec/(pre+rec))
browser()
res
}
#define train control
tc <- trainControl(method = "cv",
number = 5,
summaryFunction = eval,
classProbs = TRUE,
)
#tune with caret
svm.tune <- train(Flag~.,
data = df,
method = "svmRadial",
tuneGrid = expand.grid(C=c.list, sigma=sigma.list),
trControl = tc,
metric = met
)
After tracking this down, it appears this is normal caret behavior.
I think that caret is essentially verifying that your summaryFunction is working properly by passing fake data (of length 10) to it. The function inside caret that is doing this is evalSummaryFunction.
I'm not quite sure what I'm doing in the RStudio's debugger but this code in train.default:
testSummary <- evalSummaryFunction(y, wts = weights,
ctrl = trControl, lev = classLevels, metric = metric,
method = method)
perfNames <- names(testSummary)
calls evalSummaryFunction which looks like:
function (y, wts = NULL, perf = NULL, ctrl, lev, metric, method)
{
n <- if (class(y)[1] == "Surv")
nrow(y)
else length(y)
if (class(y)[1] != "Surv") {
if (is.factor(y)) {
values <- rep_len(levels(y), min(10, n))
pred_samp <- factor(sample(values), levels = lev)
obs_samp <- factor(sample(values), levels = lev)
}
else {
pred_samp <- sample(y, min(10, n))
obs_samp <- sample(y, min(10, n))
}
}
else {
pred_samp <- y[sample(1:n, min(10, n)), "time"]
obs_samp <- y[sample(1:n, min(10, n)), ]
}
testOutput <- data.frame(pred = pred_samp, obs = obs_samp)
if (!is.null(perf)) {
if (is.vector(perf))
stop("`perf` should be a data frame", call. = FALSE)
perf <- perf[sample(1:nrow(perf), nrow(testOutput)),
, drop = FALSE]
testOutput <- cbind(testOutput, perf)
}
if (ctrl$classProbs) {
for (i in seq(along = lev)) testOutput[, lev[i]] <- runif(nrow(testOutput))
testOutput[, lev] <- t(apply(testOutput[, lev], 1, function(x) x/sum(x)))
}
else {
if (metric == "ROC" & !ctrl$classProbs)
stop("train()'s use of ROC codes requires class probabilities. See the classProbs option of trainControl()")
}
if (!is.null(wts))
testOutput$weights <- sample(wts, min(10, length(wts)))
testOutput$rowIndex <- sample(1:n, size = nrow(testOutput))
ctrl$summaryFunction(testOutput, lev, method)
}
It appears that 10 is the length of fake data caret passes to your summary function to evaluate it (make sure it is working properly?).
If anyone can verify/explain better that this is what caret is actually doing, please post.