R caret nnet package - r

I have two R objects as below.
matrix "datamatrix" - 200 rows and 494 columns: these are my x variables
dataframe Y. Y$V1 is my Y variable. I have converted column V1 to a factor I am building a classification model.
I want to build a neural network and I ran below command.
model <- train(Y$V1 ~ datamatrix, method='nnet', linout=TRUE, trace = FALSE,
#Grid of tuning parameters to try:
tuneGrid=expand.grid(.size=c(1,5,10),.decay=c(0,0.001,0.1)))
I got an error - " argument "data" is missing, with no default"
Is there a way for caret package to understand that I have my X variables in one R object and Y variable in other? I dont want to combined two data objects and then write a formula as the formula will be too long
Y~x1+x2+x3.................x199+x200....x493+x494

The argument "data" is missing error is addressed by adding a data = datamatrix argument to the train call. The way I would do it would be something like:
datafr <- as.data.frame(datamatrix)
# V1 is the first column name if dimnames aren't specified
datafr$V1 <- as.factor(datafr$V1)
model <- train(V1 ~ ., data = datafr, method='nnet',
linout=TRUE, trace = FALSE,
tuneGrid=expand.grid(.size=c(1,5,10),.decay=c(0,0.001,0.1)))
Now you don't have to pull your response variable out separately.
The . identifier allows inclusion of all variables from datafr (see here for details).

Related

How can I include both my categorical and numeric predictors in my elastic net model? r

As a note beforehand, I think I should mention that I am working with highly sensitive medical data that is protected by HIPAA. I cannot share real data with dput- it would be illegal to do so. That is why I made a fake dataset and explained my processes to help reproduce the error.
I have been trying to estimate an elastic net model in r using glmnet. However, I keep getting an error. I am not sure what is causing it. The error happens when I go to train the data. It sounds like it has something to do with the data type and matrix.
I have provided a sample dataset. Then I set the outcomes and certain predictors to be factors. After setting certain variables to be factors, I label them. Next, I create an object with the column names of the predictors I want to use. That object is pred.names.min. Then I partition the data into the training and test data frames. 65% in the training, 35% in the test. With the train control function, I specify a few things I want to have happen with the model- random paraments for lambda and alpha, as well as the leave one out method. I also specify that it is a classification model (categorical outcome). In the last step, I specify the training model. I write my code to tell it to use all of the predictor variables in the pred.names.min object for the trainingset data frame.
library(dplyr)
library(tidyverse)
library(glmnet),0,1,0
library(caret)
#creating sample dataset
df<-data.frame("BMIfactor"=c(1,2,3,2,3,1,2,1,3,2,1,3,1,1,3,2,3,2,1,2,1,3),
"age"=c(0,4,8,1,2,7,4,9,9,2,2,1,8,6,1,2,9,2,2,9,2,1),
"L_TartaricacidArea"=c(0,1,1,0,1,1,1,0,0,1,0,1,1,0,1,0,0,1,1,0,1,1),
"Hydroxymethyl_5_furancarboxylicacidArea_2"=
c(1,1,0,1,0,0,1,0,1,1,0,1,1,0,1,1,0,1,0,1,0,1),
"Anhydro_1.5_D_glucitolArea"=
c(8,5,8,6,2,9,2,8,9,4,2,0,4,8,1,2,7,4,9,9,2,2),
"LevoglucosanArea"=
c(6,2,9,2,8,6,1,8,2,1,2,8,5,8,6,2,9,2,8,9,4,2),
"HexadecanolArea_1"=
c(4,9,2,1,2,9,2,1,6,1,2,6,2,9,2,8,6,1,8,2,1,2),
"EthanolamineArea"=
c(6,4,9,2,1,2,4,6,1,8,2,4,9,2,1,2,9,2,1,6,1,2),
"OxoglutaricacidArea_2"=
c(4,7,8,2,5,2,7,6,9,2,4,6,4,9,2,1,2,4,6,1,8,2),
"AminopentanedioicacidArea_3"=
c(2,5,5,5,2,9,7,5,9,4,4,4,7,8,2,5,2,7,6,9,2,4),
"XylitolArea"=
c(6,8,3,5,1,9,9,6,6,3,7,2,5,5,5,2,9,7,5,9,4,4),
"DL_XyloseArea"=
c(6,9,5,7,2,7,0,1,6,6,3,6,8,3,5,1,9,9,6,6,3,7),
"ErythritolArea"=
c(6,7,4,7,9,2,5,5,8,9,1,6,9,5,7,2,7,0,1,6,6,3),
"hpresponse1"=
c(1,0,1,1,0,1,1,0,0,1,0,0,1,0,1,1,1,0,1,0,0,1),
"hpresponse2"=
c(1,0,1,0,0,1,1,1,0,1,0,1,0,1,1,0,1,0,1,0,0,1))
#setting variables as factors
df$hpresponse1<-as.factor(df$hpresponse1)
df$hpresponse2<-as.factor(df$hpresponse2)
df$BMIfactor<-as.factor(df$BMIfactor)
df$L_TartaricacidArea<- as.factor(df$L_TartaricacidArea)
df$Hydroxymethyl_5_furancarboxylicacidArea_2<-
as.factor(df$Hydroxymethyl_5_furancarboxylicacidArea_2)
#labeling factor levels
df$hpresponse1 <- factor(df$hpresponse1, labels = c("group1.2", "group3.4"))
df$hpresponse2 <- factor(df$hpresponse2, labels = c("group1.2.3", "group4"))
df$L_TartaricacidArea <- factor(df$L_TartaricacidArea, labels =c ("No",
"Yes"))
df$Hydroxymethyl_5_furancarboxylicacidArea_2 <-
factor(df$Hydroxymethyl_5_furancarboxylicacidArea_2, labels =c ("No",
"Yes"))
df$BMIfactor <- factor(df$BMIfactor, labels = c("<40", ">=40and<50",
">=50"))
#creating list of predictor names
pred.start.min <- which(colnames(df) == "BMIfactor"); pred.start.min
pred.stop.min <- which(colnames(df) == "ErythritolArea"); pred.stop.min
pred.names.min <- colnames(df)[pred.start.min:pred.stop.min]
#partition data into training and test (65%/35%)
set.seed(2)
n=floor(nrow(df)*0.65)
train_ind=sample(seq_len(nrow(df)), size = n)
trainingset=df[train_ind,]
testingset=df[-train_ind,]
#specifying that I want to use the leave one out cross-
#validation method and
use "random" as search for elasticnet
tcontrol <- trainControl(method = "LOOCV",
search="random",
classProbs = TRUE)
#training model
elastic_model1 <- train(as.matrix(trainingset[,
pred.names.min]),
trainingset$hpresponse1,
data = trainingset,
method = "glmnet",
trControl = tcontrol)
After I run the last chunk of code, I end up with this error:
Error in { :
task 1 failed - "error in evaluating the argument 'x' in selecting a
method for function 'as.matrix': object of invalid type "character" in
'matrix_as_dense()'"
In addition: There were 50 or more warnings (use warnings() to see the first
50)
I tried removing the "as.matrix" arguemtent:
elastic_model1 <- train((trainingset[, pred.names.min]),
trainingset$hpresponse1,
data = trainingset,
method = "glmnet",
trControl = tcontrol)
It still produces a similar error.
Error in { :
task 1 failed - "error in evaluating the argument 'x' in selecting a method
for function 'as.matrix': object of invalid type "character" in
'matrix_as_dense()'"
In addition: There were 50 or more warnings (use warnings() to see the first
50)
When I tried to make none of the predictors factors (but keep outcome as factor), this is the error I get:
Error: At least one of the class levels is not a valid R variable name; This
will cause errors when class probabilities are generated because the
variables names will be converted to X0, X1 . Please use factor levels that
can be used as valid R variable names (see ?make.names for help).
How can I fix this? How can I use my predictors (both the numeric and categorical ones) without producing an error?
glmnet does not handle factors well. The recommendation currently is to dummy code and re-code to numeric where possible:
Using LASSO in R with categorical variables

How to input matrix data into brms formula?

I am trying to input matrix data into the brm() function to run a signal regression. brm is from the brms package, which provides an interface to fit Bayesian models using Stan. Signal regression is when you model one covariate using another within the bigger model, and you use the by parameter like this: model <- brm(response ~ s(matrix1, by = matrix2) + ..., data = Data). The problem is, I cannot input my matrices using the 'data' parameter because it only allows one data.frame object to be inputted.
Here are my code and the errors I obtained from trying to get around that constraint...
First off, my reproducible code leading up to the model-building:
library(brms)
#100 rows, 4 columns. Each cell contains a number between 1 and 10
Data <- data.frame(runif(100,1,10),runif(100,1,10),runif(100,1,10),runif(100,1,10))
#Assign names to the columns
names(Data) <- c("d0_10","d0_100","d0_1000","d0_10000")
Data$Density <- as.matrix(Data)%*%c(-1,10,5,1)
#the coefficients we are modelling
d <- c(-1,10,5,1)
#Made a matrix with 4 columns with values 10, 100, 1000, 10000 which are evaluation points. Rows are repeats of the same column numbers
Bins <- 10^matrix(rep(1:4,times = dim(Data)[1]),ncol = 4,byrow =T)
Bins
As mentioned above, since 'data' only allows one data.frame object to be inputted, I've tried other ways of inputting my matrix data. These methods include:
1) making the matrix within the brm() function using as.matrix()
signalregression.brms <- brm(Density ~ s(Bins,by=as.matrix(Data[,c(c("d0_10","d0_100","d0_1000","d0_10000"))])),data = Data)
#Error in is(sexpr, "try-error") :
argument "sexpr" is missing, with no default
2) making the matrix outside the formula, storing it in a variable, then calling that variable inside the brm() function
Donuts <- as.matrix(Data[,c(c("d0_10","d0_100","d0_1000","d0_10000"))])
signalregression.brms <- brm(Density ~ s(Bins,by=Donuts),data = Data)
#Error: The following variables can neither be found in 'data' nor in 'data2':
'Bins', 'Donuts'
3) inputting a list containing the matrix using the 'data2' parameter
signalregression.brms <- brm(Density ~ s(Bins,by=donuts),data = Data,data2=list(Bins = 10^matrix(rep(1:4,times = dim(Data)[1]),ncol = 4,byrow =T),donuts=as.matrix(Data[,c(c("d0_10","d0_100","d0_1000","d0_10000"))])))
#Error in names(dat) <- object$term :
'names' attribute [1] must be the same length as the vector [0]
None of the above worked; each had their own errors and it was difficult troubleshooting them because I couldn't find answers or examples online that were of a similar nature in the context of brms.
I was able to use the above techniques just fine for gam(), in the mgcv package - you don't have to define a data.frame using 'data', you can call on variables defined outside of the gam() formula, and you can make matrices inside the gam() function itself. See below:
library(mgcv)
signalregression2 <- gam(Data$Density ~ s(Bins,by = as.matrix(Data[,c("d0_10","d0_100","d0_1000","d0_10000")]),k=3))
#Works!
It seems like brms is less flexible... :(
My question: does anyone have any suggestions on how to make my brm() function run?
Thank you very much!
My understanding of signal regression is limited enough that I'm not convinced this is correct, but I think it's at least a step in the right direction. The problem seems to be that brm() expects everything in its formula to be a column in data. So we can get the model to compile by ensuring all the things we want are present in data:
library(tidyverse)
signalregression.brms = brm(Density ~
s(cbind(d0_10_bin, d0_100_bin, d0_1000_bin, d0_10000_bin),
by = cbind(d0_10, d0_100, d0_1000, d0_10000),
k = 3),
data = Data %>%
mutate(d0_10_bin = 10,
d0_100_bin = 100,
d0_1000_bin = 1000,
d0_10000_bin = 10000))
Writing out each column by hand is a little annoying; I'm sure there are more general solutions.
For reference, here are my installed package versions:
map_chr(unname(unlist(pacman::p_depends(brms)[c("Depends", "Imports")])), ~ paste(., ": ", pacman::p_version(.), sep = ""))
[1] "Rcpp: 1.0.6" "methods: 4.0.3" "rstan: 2.21.2" "ggplot2: 3.3.3"
[5] "loo: 2.4.1" "Matrix: 1.2.18" "mgcv: 1.8.33" "rstantools: 2.1.1"
[9] "bayesplot: 1.8.0" "shinystan: 2.5.0" "projpred: 2.0.2" "bridgesampling: 1.1.2"
[13] "glue: 1.4.2" "future: 1.21.0" "matrixStats: 0.58.0" "nleqslv: 3.3.2"
[17] "nlme: 3.1.149" "coda: 0.19.4" "abind: 1.4.5" "stats: 4.0.3"
[21] "utils: 4.0.3" "parallel: 4.0.3" "grDevices: 4.0.3" "backports: 1.2.1"

Insert multiple variables in the lda function from a list R

I have 6 variables for which I want to test which one is the best combination for a linear discriminant analysis lda .
I created a list with all the combinations.
I would like to loop through this list and run a lda for each combination
The lda formula wants column names to be specified with a + as follow:
lda(classification~ variable1+variable2, data=mydata)
However if I insert the value of my list in the lda function I get an error
unlist(mylist[i])
"variable1" "variable2"
Error in model.frame.default(formula = mylist ~ unlist(mylist[i]), :
variable lengths differ
reproducible example (variables are constant for illustrative purpose)
classification<-c("a","b","c","d","e","f")
variable1<-c(1,1,1,1,1,1)
variable2<-c(1,1,1,1,1,1)
variable3<-c(1,1,1,1,1,1)
variable4<-c(1,1,1,1,1,1)
variable5<-c(1,1,1,1,1,1)
variable6<-c(1,1,1,1,1,1)
mydata<-data.frame("classification","variable1","variable2","variable3","variable4","variable5","variable6")
para_combo1<-combn(mydata[2:7],1, simplify = FALSE)
para_combo2<-combn(mydata[2:7],2, simplify = FALSE)
para_combo3<-combn(mydata[2:7],3, simplify = FALSE)
para_combo4<-combn(mydata[2:7],4, simplify = FALSE)
para_combo5<-combn(mydata[2:7],5, simplify = FALSE)
para_combo6<-combn(mydata[2:7],6, simplify = FALSE)
para_combo<-c(para_combo1,para_combo2, para_combo3,
para_combo4,para_combo5, para_combo6)
#manual example
lda_table<-lda(classification~ variable1+variable2, data= mydata)
#example I would loop
lda_table<-lda(classification~ para_combo[7] , data= mydata)
I do not know how I could code my combination in the format lda requires
Apart from providing a formula, you can alternatively provide the features and the classes in the parameters x and grouping, respectively:
lda.result <- lda(x=mydata[,c(1,3)], grouping=mydata$classification)
# or simply:
lda.result <- lda(mydata[,c(1,3)], mydata$classification)
Note that the function lda in R actually does not only work with two variables, but with an arbitrary number of variables (sometimes called "multiple discriminant analysis"). There is thus no need to try out all pairs of variable combinations, but you can let lda figure it out for itself.

PGLS returns an error when referring to variables by their column position in a caper object

I am carrying out PGLS between a trait and 21 environmental variables for a clade of plant species. I am using a loop to do this 21 times, once for each of the environmental variables, and extract the p-values and some other values into a results matrix.
When normally carrying each PGLS individually I will refer to the variables by their column names, for example:
pgls(**trait1**~**meanrainfall**, data=caperobject)
But in order to loop this process for multiple environmental variables, I am referring to the variables by their column position in the data frame (which is in the form of the caper object for PGLS) instead of their column name:
pgls(**caperobject[,2]**~**caperobject[,5]**, data=caperobject)
This returns the error:
Error in model.frame.default(formula, data$data, na.action = na.pass) :
invalid type (list) for variable 'caperobject[, 2]'
This is not a problem when running a linear regression using the original data frame -- referring to the variables by their column name only produces this error when using the caper object as the data using PGLS. Does this way of referring to the column names not work for caper objects? Is there another way I could refer to the column names so I can incorporate them into a PGLS loop?
Your solution is to use caperobject$data[,2] ~ caperobject$data[,5], because comparative.data class is a list with the trait values located in the list data. Here is an example:
library(ape)
library(caper)
# generate random data
seed <- 245937
tr <- rtree(10)
dat <- data.frame(taxa = tr$tip.label,
trait1 = rTraitCont(tr, root.value = 3),
meanrainfall = rnorm(10, 50, 10))
# prepare a comparative.data structure
caperobject <- comparative.data(tr, dat, taxa, vcv = TRUE, vcv.dim = 3)
# run PGLS
pgls(trait1 ~ meanrainfall, data = caperobject)
pgls(caperobject$data[, 1] ~ caperobject$data[, 2], data = caperobject)
Both options return identical values for the intercept = 3.13 and slope = -0.003.
A good practice in problems with data format is to check, how the data are stored with str(caperobject).

library(e1071), tune Variable lengths differ

I have been attempting to utilize the iris dataset and although I've gotten svm to work from the e1071 library, I keep getting a 'variable lengths differ' error when I attempt to make tune work:
library(e1071)
data <- data.frame(iris$Sepal.Width,iris$Petal.Length,iris$Species)
svm_tr <- data[sample(nrow(datasvm), 100), ] #sample 100 random rows
tuned <- tune(svm, svm_tr$iris.Species~.,
data = svm_tr[1:2],
kernel = "linear",
ranges = list(cost=c(.001,.01,.1,1,10,100)))
I have checked the lengths of each of the columns in svm_tr[1:2] and they are the same length. I know the function doesn't take a dataframe directly but maybe I'm missing something?
I can get it to work with:
tune(svm, iris.Species ~ ., data = svm_tr[1:3],
kernel = "linear", ranges = list(cost=c(.001,.01,.1,1,10,100)))
If it's a formula interface you shouldn't be referring to a variable by using $ as all the required variables are sourced from the object specified by the data= argument. Note that I've also made data=svm_tr[1:3] instead of 1:2 so that the iris.Species column is included.

Resources