How to call regression model by variable in R - r
I'm trying to evaluate a series of one-variable regression models by using an R-script. My data is formatted in a .csv file where the first 8 columns represent dependent variables that we would like to predict and the next 52 columns represent independent variable that might be used to fit any one of the 8 dependent variables.
I've read the data into the script successfully. I've also created a list of headers for the dependent and independent variables in a vector. So my script looks like this:
#... do some stuff to get data above
var_dep<-c("dep1","dep2",...)
var_indep<-c("indep1","indep2",...)
for(dep in var_dep){
for(indep in var_indep){
lm1<-lm(dep~indep, data=mydat)
}
}
I get this error message when I run
Rscript R_ScriptV2.R XLK_friendly.csv
in terminal
Error in contrasts<-(*tmp*, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
Calls: lm ... model.matrix -> model.matrix.default -> contrasts<-
In addition: Warning message:
In model.response(mf, "numeric") : NAs introduced by coercion
Execution halted
So how can I specify the dependent and indepedent variables in my regression using variables?
This might be a hacky solution, but you can use as.formula in conjunction with paste to get this to work:
for (dep in var_dep){
for (indep in var_indep){
f <- as.formula(paste0(dep, " ~ ", indep))
lm1 <- lm(f, data = mydata)
}
}
Related
How can I include both my categorical and numeric predictors in my elastic net model? r
As a note beforehand, I think I should mention that I am working with highly sensitive medical data that is protected by HIPAA. I cannot share real data with dput- it would be illegal to do so. That is why I made a fake dataset and explained my processes to help reproduce the error. I have been trying to estimate an elastic net model in r using glmnet. However, I keep getting an error. I am not sure what is causing it. The error happens when I go to train the data. It sounds like it has something to do with the data type and matrix. I have provided a sample dataset. Then I set the outcomes and certain predictors to be factors. After setting certain variables to be factors, I label them. Next, I create an object with the column names of the predictors I want to use. That object is pred.names.min. Then I partition the data into the training and test data frames. 65% in the training, 35% in the test. With the train control function, I specify a few things I want to have happen with the model- random paraments for lambda and alpha, as well as the leave one out method. I also specify that it is a classification model (categorical outcome). In the last step, I specify the training model. I write my code to tell it to use all of the predictor variables in the pred.names.min object for the trainingset data frame. library(dplyr) library(tidyverse) library(glmnet),0,1,0 library(caret) #creating sample dataset df<-data.frame("BMIfactor"=c(1,2,3,2,3,1,2,1,3,2,1,3,1,1,3,2,3,2,1,2,1,3), "age"=c(0,4,8,1,2,7,4,9,9,2,2,1,8,6,1,2,9,2,2,9,2,1), "L_TartaricacidArea"=c(0,1,1,0,1,1,1,0,0,1,0,1,1,0,1,0,0,1,1,0,1,1), "Hydroxymethyl_5_furancarboxylicacidArea_2"= c(1,1,0,1,0,0,1,0,1,1,0,1,1,0,1,1,0,1,0,1,0,1), "Anhydro_1.5_D_glucitolArea"= c(8,5,8,6,2,9,2,8,9,4,2,0,4,8,1,2,7,4,9,9,2,2), "LevoglucosanArea"= c(6,2,9,2,8,6,1,8,2,1,2,8,5,8,6,2,9,2,8,9,4,2), "HexadecanolArea_1"= c(4,9,2,1,2,9,2,1,6,1,2,6,2,9,2,8,6,1,8,2,1,2), "EthanolamineArea"= c(6,4,9,2,1,2,4,6,1,8,2,4,9,2,1,2,9,2,1,6,1,2), "OxoglutaricacidArea_2"= c(4,7,8,2,5,2,7,6,9,2,4,6,4,9,2,1,2,4,6,1,8,2), "AminopentanedioicacidArea_3"= c(2,5,5,5,2,9,7,5,9,4,4,4,7,8,2,5,2,7,6,9,2,4), "XylitolArea"= c(6,8,3,5,1,9,9,6,6,3,7,2,5,5,5,2,9,7,5,9,4,4), "DL_XyloseArea"= c(6,9,5,7,2,7,0,1,6,6,3,6,8,3,5,1,9,9,6,6,3,7), "ErythritolArea"= c(6,7,4,7,9,2,5,5,8,9,1,6,9,5,7,2,7,0,1,6,6,3), "hpresponse1"= c(1,0,1,1,0,1,1,0,0,1,0,0,1,0,1,1,1,0,1,0,0,1), "hpresponse2"= c(1,0,1,0,0,1,1,1,0,1,0,1,0,1,1,0,1,0,1,0,0,1)) #setting variables as factors df$hpresponse1<-as.factor(df$hpresponse1) df$hpresponse2<-as.factor(df$hpresponse2) df$BMIfactor<-as.factor(df$BMIfactor) df$L_TartaricacidArea<- as.factor(df$L_TartaricacidArea) df$Hydroxymethyl_5_furancarboxylicacidArea_2<- as.factor(df$Hydroxymethyl_5_furancarboxylicacidArea_2) #labeling factor levels df$hpresponse1 <- factor(df$hpresponse1, labels = c("group1.2", "group3.4")) df$hpresponse2 <- factor(df$hpresponse2, labels = c("group1.2.3", "group4")) df$L_TartaricacidArea <- factor(df$L_TartaricacidArea, labels =c ("No", "Yes")) df$Hydroxymethyl_5_furancarboxylicacidArea_2 <- factor(df$Hydroxymethyl_5_furancarboxylicacidArea_2, labels =c ("No", "Yes")) df$BMIfactor <- factor(df$BMIfactor, labels = c("<40", ">=40and<50", ">=50")) #creating list of predictor names pred.start.min <- which(colnames(df) == "BMIfactor"); pred.start.min pred.stop.min <- which(colnames(df) == "ErythritolArea"); pred.stop.min pred.names.min <- colnames(df)[pred.start.min:pred.stop.min] #partition data into training and test (65%/35%) set.seed(2) n=floor(nrow(df)*0.65) train_ind=sample(seq_len(nrow(df)), size = n) trainingset=df[train_ind,] testingset=df[-train_ind,] #specifying that I want to use the leave one out cross- #validation method and use "random" as search for elasticnet tcontrol <- trainControl(method = "LOOCV", search="random", classProbs = TRUE) #training model elastic_model1 <- train(as.matrix(trainingset[, pred.names.min]), trainingset$hpresponse1, data = trainingset, method = "glmnet", trControl = tcontrol) After I run the last chunk of code, I end up with this error: Error in { : task 1 failed - "error in evaluating the argument 'x' in selecting a method for function 'as.matrix': object of invalid type "character" in 'matrix_as_dense()'" In addition: There were 50 or more warnings (use warnings() to see the first 50) I tried removing the "as.matrix" arguemtent: elastic_model1 <- train((trainingset[, pred.names.min]), trainingset$hpresponse1, data = trainingset, method = "glmnet", trControl = tcontrol) It still produces a similar error. Error in { : task 1 failed - "error in evaluating the argument 'x' in selecting a method for function 'as.matrix': object of invalid type "character" in 'matrix_as_dense()'" In addition: There were 50 or more warnings (use warnings() to see the first 50) When I tried to make none of the predictors factors (but keep outcome as factor), this is the error I get: Error: At least one of the class levels is not a valid R variable name; This will cause errors when class probabilities are generated because the variables names will be converted to X0, X1 . Please use factor levels that can be used as valid R variable names (see ?make.names for help). How can I fix this? How can I use my predictors (both the numeric and categorical ones) without producing an error?
glmnet does not handle factors well. The recommendation currently is to dummy code and re-code to numeric where possible: Using LASSO in R with categorical variables
R: DALEX explain Fails to Read In Target Variable Data
I'm running a simple lm model in R and I am trying to analyze the results using the DALEX package explain object. My model is as follows: lm_model <- lm (DV ~ x + z, data = datax) If it matters, x and z are factors and DV is numeric. The lm runs with no errors, and everything looks fine via summary(lm_model). When I try to create the explain object in DALEX like so: lm_exp <- DALEX::explain(lm_model, label = "lm", data = datax, y = datax$DV) It gives me the following: Preparation of a new explainer is initiated -> model label : lm -> data : 15375 rows 49 cols -> data : tibbble converted into a data.frame -> target variable : 15375 values Error in if (is_y_in_data(data, y)) { : missing value where TRUE/FALSE needed Before the lm is run, datax is filtered for values between .2 and 1 using the subset command. Looking at summary(datax$DV) and sum(is.na(datax$DV)), everything looks fine. I also checked for blanks / errors using a filter in Excel. For those reasons, I do not believe there are any blanks in the DV col of datax, so I am unsure of why I am receiving "Error in if (is_y_in_data(data, y)) { : missing value where TRUE/FALSE needed" I have scoured the internet for this error when using DALEX explain, but I have not found any results. Thanks for any help that can be provided.
R mlogit model, , missing value where TRUE/FALSE needed, 20 invalid factor level warnings
I'm trying to run a multinomial logistic regression using the mlogit package in R. I've uploaded the data here https://drive.google.com/file/d/0B_o3xTWAYdbuRGw0dzNFRzd2NEk/view?usp=sharing. The data contains two different choice variables which I want to run the same model on. I run the first model like so: lfsm1 <- mlogit.data(lfs.models, shape="wide", choice="PWK") f1 <- mFormula(PWK~1 | MIGGRP+SEX+AGE+EDU) m1 <- mlogit(f1, lfsm1, weights=PWT14) summary(m1) This model runs without issues. Then I run the same exact model on the other choice variable: lfsm2 <- mlogit.data(lfs.models, shape="wide", choice="multi") f2 <- mFormula(multi~1 | MIGGRP+SEX+AGE+EDU) m2 <- mlogit(f1, lfsm2, weights=PWT14) I get the following errors: Error in if (is.null(initial.value) || lnl <= initial.value) break : missing value where TRUE/FALSE needed In addition: There were 20 warnings (use warnings() to see them) > warnings() Warning messages: 1: In `[<-.factor`(`*tmp*`, is.na(x), value = FALSE) : invalid factor level, NA generated And that warning message repeats 20x. I'm not sure what either of these errors mean in the context of my model. A previous post (mlogit: missing value where TRUE/FALSE needed) suggests that my first error occurs because my data are not in wide format, or because there are some individuals who do not select any of the alternatives. In my case neither of these explanations can be right. What I've seen about the warning messages suggest mlogit is reacting badly to variables being factors or numeric. But I don't quite understand why this would matter in a multinomial regression context, or how the problem only occurred twenty times in such a large dataset. Any suggestions would be most appreciated!
Try m2 <- mlogit(f2, lfsm2, weights=PWT14) Note the f2 in the call to mlogit. In your second call to mlogit.data, you have specified that multi is the choice variable, and the data are prepared accordingly. Yet, in the formula that you are using, f1, the dependent variable is specified as PWK, so that mlogit is expecting a dataframe with one row for each alternative as defined by PMK, not multi.
R caret nnet package
I have two R objects as below. matrix "datamatrix" - 200 rows and 494 columns: these are my x variables dataframe Y. Y$V1 is my Y variable. I have converted column V1 to a factor I am building a classification model. I want to build a neural network and I ran below command. model <- train(Y$V1 ~ datamatrix, method='nnet', linout=TRUE, trace = FALSE, #Grid of tuning parameters to try: tuneGrid=expand.grid(.size=c(1,5,10),.decay=c(0,0.001,0.1))) I got an error - " argument "data" is missing, with no default" Is there a way for caret package to understand that I have my X variables in one R object and Y variable in other? I dont want to combined two data objects and then write a formula as the formula will be too long Y~x1+x2+x3.................x199+x200....x493+x494
The argument "data" is missing error is addressed by adding a data = datamatrix argument to the train call. The way I would do it would be something like: datafr <- as.data.frame(datamatrix) # V1 is the first column name if dimnames aren't specified datafr$V1 <- as.factor(datafr$V1) model <- train(V1 ~ ., data = datafr, method='nnet', linout=TRUE, trace = FALSE, tuneGrid=expand.grid(.size=c(1,5,10),.decay=c(0,0.001,0.1))) Now you don't have to pull your response variable out separately. The . identifier allows inclusion of all variables from datafr (see here for details).
Extracting predictions from a GAM model with splines and lagged predictors
I have some data and am trying to teach myself about utilize lagged predictors within regression models. I'm currently trying to generate predictions from a generalized additive model that uses splines to smooth the data and contains lags. Let's say I have the following data and have split the data into training and test samples. head(mtcars) Train <- sample(1:nrow(mtcars), ceiling(nrow(mtcars)*3/4), replace=FALSE) Great, let's train the gam model on the training set. f_gam <- gam(hp ~ s(qsec, bs="cr") + s(lag(disp, 1), bs="cr"), data=mtcars[Train,]) summary(f_gam) When I go to predict on the holdout sample, I get an error message. f_gam.pred <- predict(f_gam, mtcars[-Train,]); f_gam.pred Error in ExtractData(object, data, NULL) : 'names' attribute [1] must be the same length as the vector [0] Calls: predict ... predict.gam -> PredictMat -> Predict.matrix3 -> ExtractData Can anyone help diagnose the issue and help with a solution. I get that lag(__,1) leaves a data point as NA and that is likely the reason for the lengths being different. However, I don't have a solution to the problem.
I'm going to assume you're using gam() from the mgcv library. It appears that gam() doesn't like functions that are not defined in "base" in the s() terms. You can get around this by adding a column which include the transformed variable and then modeling using that variable. For example tmtcars <- transform(mtcars, ldisp=lag(disp,1)) Train <- sample(1:nrow(mtcars), ceiling(nrow(mtcars)*3/4), replace=FALSE) f_gam <- gam(hp ~ s(qsec, bs="cr") + s(ldisp, bs="cr"), data= tmtcars[Train,]) summary(f_gam) predict(f_gam, tmtcars[-Train,]) works without error. The problem appears to be coming from the mgcv:::get.var function. It tires to decode the terms with something like eval(parse(text = txt), data, enclos = NULL) and because they explicitly set the enclosure to NULL, variable and function names outside of base cannot be resolved. So because mean() is in the base package, this works eval(parse(text="mean(x)"), data.frame(x=1:4), enclos=NULL) # [1] 2.5 but because var() is defined in stats, this does not eval(parse(text="var(x)"), data.frame(x=1:4), enclos=NULL) # Error in eval(expr, envir, enclos) : could not find function "var" and lag(), like var() is defined in the stats package.