Lambda Issue, or cross validation - r

I am doing double cross validation with LASSO of glmnet package, however when I plot the results I am getting lambda of 0 - 150000 which is unrealistic in my case, not sure what is wrong I am doing, can someone point me in the right direction. Thanks in advance!
calcium = read.csv("calciumgood.csv", header=TRUE)
dim(calcium)
n = dim(calcium)[1]
calcium = na.omit(calcium)
names(calcium)
library(glmnet) # use LASSO model from package glmnet
lambdalist = exp((-1200:1200)/100) # defines models to consider
fulldata.in = calcium
x.in = model.matrix(CAMMOL~. - CAMLEVEL - AGE,data=fulldata.in)
y.in = fulldata.in[,2]
k.in = 10
n.in = dim(fulldata.in)[1]
groups.in = c(rep(1:k.in,floor(n.in/k.in)),1:(n.in%%k.in))
set.seed(8)
cvgroups.in = sample(groups.in,n.in) #orders randomly, with seed (8)
#LASSO cross-validation
cvLASSOglm.in = cv.glmnet(x.in, y.in, lambda=lambdalist, alpha = 1, nfolds=k.in, foldid=cvgroups.in)
plot(cvLASSOglm.in$lambda,cvLASSOglm.in$cvm,type="l",lwd=2,col="red",xlab="lambda",ylab="CV(10)")
whichlowestcvLASSO.in = order(cvLASSOglm.in$cvm)[1]; min(cvLASSOglm.in$cvm)
bestlambdaLASSO = (cvLASSOglm.in$lambda)[whichlowestcvLASSO.in]; bestlambdaLASSO
abline(v=bestlambdaLASSO)
bestlambdaLASSO # this is the lambda for the best LASSO model
LASSOfit.in = glmnet(x.in, y.in, alpha = 1,lambda=lambdalist) # fit the model across possible lambda
LASSObestcoef = coef(LASSOfit.in, s = bestlambdaLASSO); LASSObestcoef # coefficients for the best model fit

I found the dataset you referring at
Calcium, inorganic phosphorus and alkaline phosphatase levels in elderly patients.
Basically the data are "dirty", and it is a possible reason why the algorithm does not converge properly. E.g. there are 771 year old patients, bisides 1 and 2 for male and female, there is 22 for sex encodeing etc.
As for your case you removed only NAs.
You need to check data.frame imported types as well. E.g. instead of factors it could be imported as integers (SEX, Lab and Age group) which will affect the model.
I think you need:
1) cleanse the data;
2) if doesnot work submit *.csv file

Related

Using predict in metafor when each author has multiple rows in the data

I'm running a meta-analysis where I'm interested in the effect of X on the effect of age on habitat use (raw mean values and variances) using the metafor package.
An example of one of my models is:
mod6 <-
rma.mv(
yi = Used_value,
V = Used_variance,
slab = Citation,
mods = ~ Age + poly(Slope, degrees = 2),
random = ~ 1 | Region,
data = vel.focal,
method = "ML"
)
My justification for not using Citation as a random effect is that using only Region accounts for more of the heterogeneity than when random = list( ~ 1 | Citation/ID, ~ 1 | Region) or when Citation/ID is used by itself.
What I need for output is the prediction for each age by region, but the predict() function for the model and the associated forest plot spits out the prediction for each row, as it assumes each row in the data is a unique study. In my case it is not as I have my input values separated by age and season.
predict(mod6)
pred se ci.lb ci.ub pi.lb pi.ub
Riehle and Griffith 1993.1 9.3437 2.3588 4.7205 13.9668 0.2362 18.4511
Riehle and Griffith 1993.2 9.3437 2.3588 4.7205 13.9668 0.2362 18.4511
Riehle and Griffith 1993.3 9.3437 2.3588 4.7205 13.9668 0.2362 18.4511
Spina 2000.1 8.7706 2.7386 3.4030 14.1382 -0.7364 18.2776
Spina 2000.2 8.5407 2.7339 3.1824 13.8991 -0.9611 18.0426
Spina 2000.3 8.5584 2.7406 3.1868 13.9299 -0.9509 18.0676
Vondracek and Longanecker 1993.1 12.6116 2.5138 7.6847 17.5385 3.3462 21.8769
Vondracek and Longanecker 1993.2 12.6116 2.5138 7.6847 17.5385 3.3462 21.8769
Vondracek and Longanecker 1993.3 12.3817 2.5327 7.4176 17.3458 3.0965 21.6669
Vondracek and Longanecker 1993.4 12.3817 2.5327 7.4176 17.3458 3.0965 21.6669
Does anybody know a way to modify the arguments inside predict() to tell it how you want your predictions output or to tell it that there are multiple rows per slab?
You need to use the newmods argument to specify the values for Age for which you want predicted values. You will have to plug in something for the linear and quadratic terms for the Slope variable as well (e.g., holding Slope constant at its mean and hence the quadratic term will just be the mean squared). Region is not a fixed effect, so it is not relevant if you want to compute predicted values based on the fixed effects. If you want to compute BLUPs for those random effects, you can do so with ranef(). One can then combine the predictions based on the fixed effects with the BLUPs. That would be the general idea, but implementing this will require a bit of programming.

Creating Inverse Probability of Attrition Weights in R

Weuve et al. (2012) wrote a great paper about implementing Inverse Probability of Attrition Weighting (IPAW), a weighting method used to account for bias introduced by attrition during the course of a longitudinal study. Here is a link to said article: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3237815/#R30
I am working on a project where I am trying to implement this IPAW method and there isn't much out there on how to implement and code this method, so I'm looking for some help just to make sure I'm doing everything correctly.
The data I am working with involves older individuals who may have dementia, so it makes sense to use IPAW because those with dementia are more likely to leave the study. Each individual has at least a baseline visit and then up to 12 follow up visit (the average number of visits for each person is around 3). My understanding is that I should create weights for each round of follow up visits, so I start by subsetting the data to only a certain visit, creating a variable for whether or not somebody drops out immediately following the visit, and then I proceed to creating the models and weights.
Below is the r code I have been using to generate the weights:
Creating weights for the first follow up visit (visit == 1)
for-loop to create a variable for attrition
(For ease, I am just calling the last observation "x")
data$attrition<-c()
data$attrition[x] <- 1
for (i in 1:x){
if(data$visit[i+1] == 0) {
data$attrition[i] = 1
} else {
data$attrition[i] = 0
}
}
subsetting to only get the data for the first follow up visit
data_visit1 <-subset(data, data$visit == 1)
creating stepwise model for the likelihood of attriting
# specifying null model
null_visit1 <- glm(attrition ~ 1, family = binomial, data = data_visit1)
# specifying full model --
full_visit1 <- glm(attrition~
predictor1 +
predictor2 +
...,
family = binomial, data = data_visit1)
# running combined selection
stepmodel_visit1 <- step(null_visit1, scope=list(lower = null_visit1, upper = full_visit1), direction = "forward", k=2)
Creating weights
# re-naming model for denominator
denom.model <- stepmodel_visit1
# creating the predicted categorizations
pd_visit1 <- predict(denom.model, type = "response")
## estimation of numerator of ip weights using stabilizer instead of just 1
numer.model <- glm(attrition ~ 1, family = binomial(), data = data_visit1)
# predicting the numerator values
pn_visit1 <- predict(numer.model, type = "response")
# Putting together the actual weights
data_visit1$weight <- ifelse(data2$attrition == 1, pn_visit1 / pd_visit1, (1- pn_visit1)/(1 - (pd_visit1)))
Following this, I rejoin the weights back to the full dataset and then repeat the process for each round of follow up visits. So my question is, does this all look good? I would love any and all feedback on my approach. Thanks so much!

Random Forest - Caret - Time Series

I have a time series (apple stock prices -closing prices- turn into a data frame to fit a random forest using caret. I lagged on 1 day, 2 days and 6 days. I want to predict the next 2 days. Two step ahead forecast. But caretuses the predictfunction that does not allow the argument has the forecastfunction. And i have seen that some people try to put the argument n.ahead but is not working for me. Any advice? See the code
df<-data.frame(APPL)
df$f1<-lag(df$APPL,1)
df$f2=lag(df$APPL,2)
df$f3=lag(df$APPL,6)
# change column names
colnames(df)<-c("price", "price_1", "price_2", "price_6")
# remove rows (days) with NA.
df<-df[complete.cases(df),]
fitControl <- trainControl(
method = "repeatedcv",
number = 10,
repeats = 1,
classProbs = FALSE,
verboseIter = TRUE,
preProcOptions=list(thresh = 0.95, na.remove = TRUE, verbose = TRUE))
set.seed(1234)
rf_grid= expand.grid(mtry = c(1:3))
fit <- train(price~.,
data=df,
method="rf",
preProcess=c("center","scale"),
tuneGrid = rf_grid,
trControl=fitControl,
ntree = 200,
metric="RMSE")
nextday <- predict(fit,`WHAT GOES HERE?`)
If i put just predict(fit)uses as newdatathe whole dataset. Which i think is wrong. The other thing i was thinking about is to do a loop. Predict for 1 step ahead, because i have the data of 1,2 and 6 days ago. And the fill for the 2 step ahead forecast the 1 day ago "cell" with the forecast i did before.
Right now, you can't pass other options to the underlying predict method. There is a proposed change that might enable this though.
In your case, you should give the predict function a data frame that has the appropriate predictors for the next few observations.
#1:: colnames(df)<-c("price","price_1","price_2","price_6") ;; "after price6
#2:: Predict{stats} is a generic function for predictions from the results of various model fitting functions
::predict(model object , dataframe)
we have 3 cases here for dataframe ::
case 1 :: train data::on which model is fitted :: Insample prediction
case 2 :: test data::Out of sample prediction
case 3 :: forecasted data :: forecasted values of the independent variables : we get the forecasted values of the dependent variable according to the model
The column names in case 2 & 3 should be same as column names of the train data

poLCA - Latent Class how to do the adjusted Lo-Mendell-Rubin (LMR) test with R

Good afternoon,
I am trying to perform Lo, Mendell and Rubin's (2001) adjusted test (LMR) in order to decide the optimal number of classes in LCA. I performed the command with poLCA, but I didn't find any command to perform it.
Is there someone that can help me?
Thank you very much!
Here is an example of a (ad-hoc adjusted) LMR test comparing a LCA with 3 groups (alternative model) against 2 groups (baseline model).
# load packages/install if needed
library(poLCA)
library(tidyLPA)
data("election")
# Fit LCA with 2 classes (NULL model)
mod_null <- poLCA(formula = cbind(MORALG, CARESG, KNOWG) ~ 1,
data = election, nclass = 2, verbose = F)
# store values baseline model
n <- mod_null$Nobs #number of observations (should be equal in both models)
null_ll <- mod_null$llik #log-likelihood
null_param <- mod_null$npar # number of parameters
null_classes <- length(mod_null$P) # number of classes
# Fit LCA with 3 classes (ALTERNATIVE model)
mod_alt <- poLCA(formula = cbind(MORALG, CARESG, KNOWG) ~ 1,
data = election, nclass = 3, verbose = F)
# Store values alternative model
alt_ll <- mod_alt$llik #log-likelihood
alt_param <- mod_alt$npar # number of parameters
alt_classes <- length(mod_alt$P) # number of classes
# use calc_lrt from tidyLPA package
calc_lrt(n, null_ll, null_param, null_classes, alt_ll, alt_param, alt_classes)
Wow really late to the game but as Im looking at similar things Ill leave for the next person.
The Lo-Mendell-Rubin test involves a transformation of the data and then a chi-sq test to determine if K classes is a better fit than K-1 classes... basically.
However there is reasonable research out there suggesting that a better measure of this is the bootstrap likelihood ratio.
The former is still in common use with MPlus users, the latter is far more common in LCA packages in R, e.g. mclust. Dunno about poLCA though...

How do I predict new data's cluster after clustering training data?

I have already trained my clustering model using hclust:
model=hclust(distances,method="ward”)
And the result looks good:
Now I get some new data records, I want to predict which cluster every one of them belongs to. How do I get it done ?
Clustering is not supposed to "classify" new data, as the name suggests - it is the core concept of classification.
Some of the clustering algorithms (like those centroid based - kmeans, kmedians etc.) can "label" new instance based on the model created. Unfortunately hierarchical clustering is not one of them - it does not partition the input space, it just "connects" some of the objects given during clustering, so you cannot assign the new point to this model.
The only "solution" to use the hclust in order to "classify" is to create another classifier on top of the labeled data given by hclust. For example you can now train knn (even with k=1) on the data with labels from hclust and use it to assign labels to new points.
As already mentioned, you can use a classifier such as class :: knn, to determine which cluster a new individual belongs to.
The KNN or k-nearest neighbors algorithm is one of the simplest machine learning algorithms and is an example of instance-based learning, where new data are classified based on stored, labeled instances. More specifically, the distance between the stored data and the new instance is calculated by means of some kind of a similarity measure. This similarity measure is typically expressed by a distance measure such as the Euclidean distance.
Next I leave a code as an example for the iris data.
library(scorecard)
library(factoextra)
library(class)
df_iris <- split_df(iris, ratio = 0.75, seed = 123)
d_iris <- dist(scale(df_iris$train[,-5]))
hc_iris <- hclust(d_iris, method = "ward.D2")
fviz_dend(hc_iris, k = 3,cex = 0.5,k_colors = c("#00AFBB","#E7B800","#FC4E07"),
color_labels_by_k = TRUE, ggtheme = theme_minimal())
groups <- cutree(hc_iris, k = 3)
table(groups)
Predict new data
knnClust <- knn(train = df_iris$train[,-5], test = df_iris$test[,-5] , k = 1, cl = groups)
knnClust
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 2 3 3 3 2 2 2 2 2 3 3 2 2 3 2 2 2 2 2 2 2 2 2
Levels: 1 2 3
# p1 <- fviz_cluster(list(data = df_iris$train[,-5], cluster = groups), stand = F) + xlim(-11.2,-4.8) + ylim(-3,3) + ggtitle("train")
# p2 <- fviz_cluster(list(data = df_iris$test[,-5], cluster = knnClust),stand = F) + xlim(-11.2,-4.8) + ylim(-3,3) + ggtitle("test")
# gridExtra::grid.arrange(p1,p2,nrow = 2)
pca1 <- data.frame(prcomp(df_iris$train[,-5], scale. = T)$x[,1:2], cluster = as.factor(groups), factor = "train")
pca2 <- data.frame(prcomp(df_iris$test[,-5], scale. = T)$x[,1:2], cluster = as.factor(knnClust), factor = "test")
pca <- as.data.frame(rbind(pca1,pca2))
Plot train and test data
ggplot(pca, aes(x = PC1, y = PC2, color = cluster, size = 1, alpha = factor)) +
geom_point(shape = 19) + theme_bw()
You can use this classification and then use LDA to predict which class the new point should fall into.
I face the similar problem and work out a temporal solution.
In my environment R, the function hclust gives the label for the train data.
We can use one supervised learning model to reconnect label and features.
And then we just do the same data processing when we deal with a supervised learning model.
If we face a binary classification model, we can use KS value, AUC value and so on to see the performance of this clustering.
Similarly, we can use PCA method on the feature and extract PC1 as a label.
To binning this label, we get a new label fitted to classification.
In the same way, we do the same processing when we deal with a classification model.
In R, I find PCA method processes much faster than hclust. (Mayank 2016)
In practice, I find this way is easy to deploy the model.
But I suspect whether this temporal solution results in bias on prediction or not.
Ref
Mayank. 2016. “Hclust() in R on Large Datasets.” Stack Overflow. hclust() in R on large datasets.
Why not compute the centroid of the points for each hclust cluster, then assign a new point to the nearest using the same distance function ?
knn in class will only look at nearest n and only allows Euclidean distance.
There's no need to run a classifier.

Resources