R: DALEX explain Fails to Read In Target Variable Data - r

I'm running a simple lm model in R and I am trying to analyze the results using the DALEX package explain object.
My model is as follows: lm_model <- lm (DV ~ x + z, data = datax)
If it matters, x and z are factors and DV is numeric. The lm runs with no errors, and everything looks fine via summary(lm_model).
When I try to create the explain object in DALEX like so:
lm_exp <- DALEX::explain(lm_model, label = "lm", data = datax, y = datax$DV)
It gives me the following:
Preparation of a new explainer is initiated
-> model label : lm
-> data : 15375 rows 49 cols
-> data : tibbble converted into a data.frame
-> target variable : 15375 values
Error in if (is_y_in_data(data, y)) { :
missing value where TRUE/FALSE needed
Before the lm is run, datax is filtered for values between .2 and 1 using the subset command. Looking at summary(datax$DV) and sum(is.na(datax$DV)), everything looks fine. I also checked for blanks / errors using a filter in Excel. For those reasons, I do not believe there are any blanks in the DV col of datax, so I am unsure of why I am receiving "Error in if (is_y_in_data(data, y)) { :
missing value where TRUE/FALSE needed"
I have scoured the internet for this error when using DALEX explain, but I have not found any results. Thanks for any help that can be provided.

Related

Can't complete regression loop (invalid term in model formula)

I have applied the below code, and it was working fine until I got an error message that I don't know how to solve.
respvars <- names(QBB_clean[1653:2592])
`predvars <- c("bmi","Age", "sex","lpa2c", "smoking", "CholesterolTotal")`
results <- list()
for (v in respvars) {
form <- reformulate(predvars, response = v) results[[v]] <- lm(form, data = QBB_clean) } `
Error message:
Error in terms.formula(formula, data = data) :
invalid term in model formula
The error message "invalid term in model formula" says that there is something wrong with the way the formula is being constructed.
There might be several reasons. First, one of the variables in the formula may not present in the dataset or has a different name. To fix this issue, you can print the variable names in the formula and compare them to the variable names in the dataset.

LME error in model.frame.default ... variable lengths differ

I am trying to run a random effects model with LME. It is part of a larger function and I want it to be flexible so that I can pass the fixed (and ideally random) effects variable names to the lme function as variables. get() worked great for this where I started with lm, but it only seems to throw the ambiguous "Error in model.frame.default(formula = ~var1 + var2 + ID, data = list( : variable lengths differ (found for 'ID')." I'm stumped, the data are the same lengths, there are no NAs in this data or the real data, ...
set.seed(12345) #because I got scolded for not doing this previously
var1="x"
var2="y"
exdat<-data.frame(ID=c(rep("a",10),rep("b",10),rep("c",10)),
x = rnorm(30,100,1),
y = rnorm(30,100,2))
#exdat<-as.data.table(exdat) #because the data are actually in a dt, but that doesn't seem to be the issue
Works great
lm(log(get(var1))~log(get(var2)),data=exdat)
lme(log(y)~log(x),random=(~1|ID), data=exdat)
Does not work
lme(log(get(var1,pos=exdat))~log(get(var2)),random=(~1|ID), data=exdat)
Does not work, but throws a new error code: "Error in model.frame.default(formula = ~var1 + var2 + rfac + exdat, data = list( : invalid type (list) for variable 'exdat'"
rfac="ID"
lme(log(get(var1))~log(get(var2)),random=~1|get(rfac,pos=exdat), data=exdat)
Part of the problem seems to be with the nlme package. If you can consider using lme4, the desired results can be obtained by with:
lme4::lmer(log(get(var1)) ~ log(get(var2)) + (1 | ID),
data = exdat)

adonis: Error right-hand-side of formula has no usable terms

I have this chao distance matrix based on all fungi abundances:
CR10 CR11 CR13 CR14 CR17 CR18 CR19
CR11 0.4531840
CR13 0.4288178 0.4624915
CR14 0.5903908 0.5466617 0.4942469
CR17 0.4784990 0.3387325 0.6136265 0.5779121
CR18 0.7649840 0.7537409 0.7526077 0.5632825 0.4153391
CR19 0.3772907 0.4579895 0.3208187 0.3706775 0.5644193 0.7380274
CR20 0.4598706 0.5529427 0.6424340 0.6690386 0.3855154 0.5509150 0.6406800
and the table with 33 environmental variables for the same plots.
when I run:
fungAbundAdonis <- lapply(colnames(env2), function(x) {
form <- as.formula(paste("OTU.table2", x, sep="~"))
z <- adonis(form, data = env2, permutations=999)
return(data.frame(env = rownames(z$aov.tab), Rsq = z$aov.tab$R2,P = z$aov.tab$P))}
)
I get this error:
Error in adonis(form, data = env2, permutations = 999) :
right-hand-side of formula has no usable terms.
I don't understand why because when I use the same script with the distance matrix of plots from 1 to 9 and 12 15 and 16 and the environmental table for these plots it works fine. Does anybody know what the source of the error could be?
thanks!
Your question has no reproducible example, and I have to guess. However, I can reproduce your error message if the variable is constant in the right-hand-side. This may happen when you subset env2 and in that selected subset a variable has only one value. (This only concerns vegan 2.5-x or release version: vegan 2.6-0 will not give an error message.)

R caret nnet package

I have two R objects as below.
matrix "datamatrix" - 200 rows and 494 columns: these are my x variables
dataframe Y. Y$V1 is my Y variable. I have converted column V1 to a factor I am building a classification model.
I want to build a neural network and I ran below command.
model <- train(Y$V1 ~ datamatrix, method='nnet', linout=TRUE, trace = FALSE,
#Grid of tuning parameters to try:
tuneGrid=expand.grid(.size=c(1,5,10),.decay=c(0,0.001,0.1)))
I got an error - " argument "data" is missing, with no default"
Is there a way for caret package to understand that I have my X variables in one R object and Y variable in other? I dont want to combined two data objects and then write a formula as the formula will be too long
Y~x1+x2+x3.................x199+x200....x493+x494
The argument "data" is missing error is addressed by adding a data = datamatrix argument to the train call. The way I would do it would be something like:
datafr <- as.data.frame(datamatrix)
# V1 is the first column name if dimnames aren't specified
datafr$V1 <- as.factor(datafr$V1)
model <- train(V1 ~ ., data = datafr, method='nnet',
linout=TRUE, trace = FALSE,
tuneGrid=expand.grid(.size=c(1,5,10),.decay=c(0,0.001,0.1)))
Now you don't have to pull your response variable out separately.
The . identifier allows inclusion of all variables from datafr (see here for details).

predict in caret ConfusionMatrix is removing rows

I'm fairly new to using the caret library and it's causing me some problems. Any
help/advice would be appreciated. My situations are as follows:
I'm trying to run a general linear model on some data and, when I run it
through the confusionMatrix, I get 'the data and reference factors must have
the same number of levels'. I know what this error means (I've run into it before), but I've double and triple checked my data manipulation and it all looks correct (I'm using the right variables in the right places), so I'm not sure why the two values in the confusionMatrix are disagreeing. I've run almost the exact same code for a different variable and it works fine.
I went through every variable and everything was balanced until I got to the
confusionMatrix predict. I discovered this by doing the following:
a <- table(testing2$hold1yes0no)
a[1]+a[2]
1543
b <- table(predict(modelFit,trainTR2))
dim(b)
[1] 1538
Those two values shouldn't disagree. Where are the missing 5 rows?
My code is below:
set.seed(2382)
inTrain2 <- createDataPartition(y=HOLD$hold1yes0no, p = 0.6, list = FALSE)
training2 <- HOLD[inTrain2,]
testing2 <- HOLD[-inTrain2,]
preProc2 <- preProcess(training2[-c(1,2,3,4,5,6,7,8,9)], method="BoxCox")
trainPC2 <- predict(preProc2, training2[-c(1,2,3,4,5,6,7,8,9)])
trainTR2 <- predict(preProc2, testing2[-c(1,2,3,4,5,6,7,8,9)])
modelFit <- train(training2$hold1yes0no ~ ., method ="glm", data = trainPC2)
confusionMatrix(testing2$hold1yes0no, predict(modelFit,trainTR2))
I'm not sure as I don't know your data structure, but I wonder if this is due to the way you set up your modelFit, using the formula method. In this case, you are specifying y = training2$hold1yes0no and x = everything else. Perhaps you should try:
modelFit <- train(trainPC2, training2$hold1yes0no, method="glm")
Which specifies y = training2$hold1yes0no and x = trainPC2.

Resources