GLM model runs in interactive code but not when I use knitr - r

I'm having issues with knitr.
Specifically, I have a model which runs absolutely fine in the console but when I try and knit the document, R throws an error.
Load the dataset (available here to facilitate replication )
scabies <- read.csv(file = "S1-Dataset_CSV.csv", header = TRUE, sep = ",")
scabies$agegroups <- as.factor(cut(scabies$age, c(0,10,20,Inf), labels = c("0-10","11-20","21+"), include.lowest = TRUE))
scabies$agegroups <-relevel(scabies$agegroups, ref = "21+")
scabies$house_cat <- as.factor(cut(scabies$house_inhabitants, c(0,5,10,Inf), labels = c("0-5","6-10","10+"), include.lowest = TRUE))
scabies$house_cat <- relevel(scabies$house_cat, ref = "0-5")
scabies <- scabies %>% mutate(scabies = case_when(scabies_infestation=="yes"~1,
scabies_infestation=="no"~0)) %>%
mutate(impetigo = case_when(impetigo_active=="yes" ~1,
impetigo_active=="no" ~0))
fit the model
scabiesrisk <- glm(scabies~agegroups+gender+house_cat,data=scabies,family=binomial())
scabiesrisk_OR <- exp(cbind(OR= coef(scabiesrisk), confint(scabiesrisk)))
scabiesrisk_summary <- summary(scabiesrisk)
scabiesrisk_summary <- cbind(scabiesrisk_OR, scabiesrisk_summary$coefficients)
scabiesrisk_summary
This code runs absolutely fine in the Console.
But when I try knitr I get:
Error in model.frame.default(formula = scabies ~ agegroups + gender +
: invalid type(list) for variable 'scabies Calls: ... glm
-> eval -> eval -> -> model.frame.default

I was able to reproduce the problem you describe, but haven't yet fully understood what happens under the hood.
This Markdown chunck is interesting :
```{r}
scabiesrisk_OR <- exp(cbind(OR= coef(scabiesrisk), confint((scabiesrisk))))
scabiesrisk_summary <- summary(scabiesrisk)
scabiesrisk_summary <- cbind(scabiesrisk_OR, scabiesrisk_summary$coefficients)
scabiesrisk_summary
```
If I manually quickly execute the lines in the chunck one after another (ctrl+Enter x 4), sometimes I get two profiling messages:
Waiting for profiling to be done...
Waiting for profiling to be done...
In this case, summary(scabiesrisk) is a matrix:
> class(scabiesrisk_summary)
[1] "matrix" "array"
If I manually slowly execute the lines in the chunk, I get only one profiling message:
Waiting for profiling to be done...
summary(scabiesrisk) is a summary.glm :
> class(scabiesrisk_summary)
[1] "summary.glm"
Looks like profiling is launched on a separate thread, and depending on whether it was finished or not, summary function doesn't have the same behaviour. If profiling is finished, it returns the expected summary.glm object, but if it isn't the case it launches another profiling and returns a matrix.
In particular, with a matrix scabiesrisk_summary$coefficients isn't available and I get in this situation the following error message:
Error in scabiesrisk_summary$coefficients :
$ operator is invalid for atomic vectors
This could possibly also happen while knitting : does knitting overhead make profiling slower so that the problem occurs?
With the workaround found here (use confint.defaultinstead of confint), I wasn't able to reproduce the above problem:
scabiesrisk_OR <- exp(cbind(OR= coef(scabiesrisk), confint.default((scabiesrisk))))
scabiesrisk_summary <- summary(scabiesrisk)
scabiesrisk_summary <- cbind(scabiesrisk_OR, scabiesrisk_summary$coefficients)
scabiesrisk_summary
OR 2.5 % 97.5 % Estimate Std. Error
(Intercept) 0.09357141 0.06984512 0.1253575 -2.3690303 0.1492092
agegroups0-10 2.20016940 1.60953741 3.0075383 0.7885344 0.1594864
agegroups11-20 2.53291768 1.79985894 3.5645415 0.9293719 0.1743214
gendermale 1.44749159 1.13922803 1.8391682 0.3698321 0.1221866
house_cat6-10 1.30521927 1.02586104 1.6606512 0.2663710 0.1228792
house_cat10+ 1.17003712 0.67405594 2.0309692 0.1570355 0.2813713
z value Pr(>|z|)
(Intercept) -15.8772359 9.110557e-57
agegroups0-10 4.9442116 7.645264e-07
agegroups11-20 5.3313714 9.747386e-08
gendermale 3.0267824 2.471718e-03
house_cat6-10 2.1677478 3.017788e-02
house_cat10+ 0.5581076 5.767709e-01
So you could also probably try this in your case.
Contrary to confint.defaut which is a directly readable R function, confint is a S3 dispatch method (thanks #Ben Bolker for the internal references in comments), and I didn't yet investigate further what could explain this surprising behaviour.
Another option seems to save scabiesrisk_summary in another variable.
I tried hard but was never able to reproduce the problem after doing so :
```{r}
scabiesrisk_OR <- exp(cbind(OR= coef(scabiesrisk), confint((scabiesrisk))))
scabiesrisk_summary <- summary(scabiesrisk)
scabiesrisk_final <- cbind(scabiesrisk_OR, scabiesrisk_summary$coefficients)
scabiesrisk_final
```

I strongly suspect that you forgot to include library(tidyverse) in your script. If tidyverse is loaded, then your code works fine. If it's not:
the step where you try to mutate() (and use %>%) fails, so the scabies variable is never created within the scabies data set
glm(scabies ~ ...) then interprets the response variable scabies as being the whole data set, and complains that the response variable is "invalid type(list)".
For this reason it's good practice to avoid having variables within data frames that have the same name as the data frames themselves ...
Your data transformation steps can be cleaned up a little bit (as.factor() is redundant; you can do all of the transformations as steps within a single mutate() call; as.numeric(x=="yes") is a shorter way to turn a string into a 0/1 variable ...) If I were going to do a lot more of this I would write a custom mycut() function that took breakpoints and a desired reference level as input arguments, constructed custom labels, and did the releveling.
library(tidyverse)
scabies <- (read.csv(file = "S1-Dataset_CSV.csv") %>%
mutate(agegroups <- cut(age, c(0,10,20,Inf),
labels = c("0-10","11-20","21+"),
include.lowest = TRUE),
agegroups = relevel(agegroups, ref = "21+"),
house_cat = cut(house_inhabitants, c(0,5,10,Inf),
labels = c("0-5","6-10","10+"),
include.lowest = TRUE),
house_cat = relevel(house_cat, ref = "0-5"),
scabies = as.numeric(scabies_infestation=="yes"),
impetigo = as.numeric(impetigo_active=="yes"))
)

Related

What does "invalid type (closure) for variable 'variable1'" mean and how do I fix it?

I am trying to write a function in R, which contains a function from another package. The code works perfectly outside a function.
I am guessing, it might have got to do something with the package I am using (survey).
A self-contained code example:
#activating the package
library(survey)
#getting the dataset into R
tm <- read.spss("tm.sav", to.data.frame = T, max.value.labels = 5)
# creating svydesign object (it basically contains the weights to adjust the variables (~persgew: also a column variable contained in the tm-dataset))
tm_w <- svydesign(ids=~0, weights = ~persgew, data = tm)
#getting overview of the welle-variable
#this variable is part of the tm-dataset. it is needed to execute the following steps
table(tm$welle)
# data manipulation as in: taking the v12d_gr-variable as well as the welle-variable and the svydesign-object to create a longitudinal variable which is transformed into a data frame that can be passed to ggplot
t <- svytable(~v12d_gr+welle, tm_w)
tt <- round(prop.table(t,2)*100, digits=0)
v12d <- tt[2,]
v12d <- as.data.frame(v12d)
this is the code outside the function, working perfectly. since I have to transform quite a few variables in the exact same way, I aim to create a function to save up some time.
The following function is supposed to take a variable that will be transformed as an argument (v12sd2_gr).
#making sure the survey-object is loaded
tm_w <- svydesign(ids=~0, weights = ~persgew, data = data)
#trying to write a function containing the code from above
ltd_zsw <- function(variable1){
t <- svytable(~variable1+welle, tm_w)
tt <- round(prop.table(t,2)*100, digits=0)
var_ltd_zsw <- tt[2,]
var_ltd_zsw <- as.data.frame(var_ltd_zsw)
return(var_ltd_zsw)
}
Calling the function:
#as v12d has been altered already, I am trying to transform another variable v12sd2_gr
v12sd2 <- ltd_zsw(v12sd2_gr)
Console output:
Error in model.frame.default(formula = weights ~ variable1 + welle, data = model.frame(design)) :
invalid type (closure) for variable 'variable1'
Called from: model.frame.default(formula = weights ~ variable1 + welle, data = model.frame(design))
How do I fix it? And what does it mean to dynamically build a formula and reformulating?
PS: I hope it is the appropriate way to answer to the feedback in the comments.
Update: I think I was able to trace the problem back to the argument I am passing (variable1) and I am guessing it has got something to do with the fact, that I try to call a formula within the function. But when I try to call the svytable with as.formula(svytable(~variable1+welle, tm_w))it still doesn't work.
What to do?
I have found a solution to the problem.
Here is the tested and working function:
ltd_test <- function (var, x, string1="con", string2="pro") {
print (table (var))
x$w12d_gr <- ifelse(as.numeric(var)>2,1,0)
x$w12d_gr <- factor(x$w12d_gr, levels = c(0,1), labels = c(string1,string2))
print (table (x$w12d_gr))
x_w <- svydesign(ids=~0, weights = ~persgew, data = x)
t <- svytable(~w12d_gr+welle, x_w)
tt <- round(prop.table(t,2)*100, digits=0)
w12d <- tt[2,]
w12d <- as.data.frame(w12d)
}
The problem appeared to be caused by the svydesgin()-fun. In its output it produces an object which is then used by the formula for svytable()-fun. Thats why it is imperative to first create the x_w-object with svydesgin() and then use the svytable()-fun to create the t-object.
Within the code snippet I posted originally in the question the tm_w-object has been created and stored globally.
Thanks for the help to everyone. I hope this is gonna be of use to someone one day!

R Stargazer - Manually Specify R^2 And Write to .tex

I am using Stargazer to report the results from some models where I use robust standard errors. The process of calculating these and then feeding the models to Stargazer strips out data such as R^2 and so I need to add it in manually. Doing so, however, is causing me problems. Below is the basic stargazer() call that I'm trying to run. Following it and some discussion is the code needed to generate the data going into the stargazer() call:
stargazer(fit1_robust, fit2_robust,
keep.stat = c("n", "adj.rsq"), # doesn't actually result in keeping the stats, but including it just to demonstrate such.
add.lines = list(c("Adjusted $R^2$", fit1_r2, fit2_r2)),
out = "~/Test.tex"
)
When I call this, I get the following error:
Error in if (nchar(text.matrix[r, c]) > max.length[real.c]) { :
missing value where TRUE/FALSE needed
There are some interesting aspects to this:
The error does not occur if I omit the ^ and instead just use "Adjusted $R2$"
The error does not occur if I don't use the out argument to specify a .tex file to export to.
Addressing either of these bullets "solves" the error, but at the expense of my code not really doing what I want it to. How can I manually add the adjusted R^2 in the way I've done here (and more generally, add notes involving ^)?
(Note: I also tried escaping the ^ character, replacing it with /^. That gave an error. If I use a double escape: //^ that prevents the error, but then a single escape shows up in the generated .tex file and that's not what I want.)
Here is the rest of the code to get to the point of having all objects needed for the above stargazer() call:
library(stargazer)
library(lmtest)
library(sandwich)
#################
# Simulate Data #
#################
N = 100
A = rnorm(N)
B = rnorm(N)
Y = 2*A + B + rnorm(N)
Data = data.frame(Y, A, B)
#####################################
# Fit Models and Find Robust Errors #
#####################################
fit1 = lm(Y~A)
fit2 = lm(Y~A+B)
fit1_robust = coeftest(fit1, vcov = sandwich)
fit2_robust = coeftest(fit2, vcov = sandwich)
fit1_r2 = round(summary(fit1)$adj.r.squared, 4)
fit2_r2 = round(summary(fit2)$adj.r.squared, 4)
One workaround is to save the output of stargazer in a variable, then write it to file afterwards:
star = stargazer(fit1_robust, fit2_robust,
add.lines = list(c("Adjusted $R^2$", fit1_r2, fit2_r2))
)
cat(star, sep = '\n', file = 'Test.tex')

H2O: Deep learning object not found in function 'predict' for argument 'model'

I'm just testing out h2o, in particular its deep learning capabilities, since I've heard great things about it. So far I've been using the following code:
library(h2o)
library(caret)
data("iris")
# Initiate H2O --------------------
h2o.removeAll() # Clean up. Just in case H2O was already running
h2o.init(nthreads = -1, max_mem_size="22G") # Start an H2O cluster with all threads available
# Get training and tournament data -------------------
a <- createDataPartition(iris$Species, list=FALSE)
training <- iris[a,]
test <- iris[-a,]
# Convert target to factor -------------------
target <- as.factor(iris$Species)
feature_names <- names(train)[1:(ncol(train)-1)]
train_h2o <- as.h2o(train)
test_h2o <- as.h2o(test)
prob <- test[, "id", drop = FALSE]
model_dl <- h2o.deeplearning(x = feature_names, y = "target", training_frame = train_h2o, stopping_metric = "logloss")
h2o.logloss(model_dl)
pred_dl <- predict(model_dl, newdata = tourn_h2o)
prob <- cbind(prob, as.data.frame(pred_dl$p1, col.names = "dl"))
write.table(prob[, c("id", "dl")], paste0(model_dl#model_id, ".csv"), sep = ",", row.names = FALSE, col.names = c("id", "probability"))
The relevant part is really that last line, where I got the following error:
Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page, :
ERROR MESSAGE:
Object 'DeepLearning_model_R_1494350691427_70' not found in function: predict for argument: model
Has anyone come across this before? Are there any easy solutions to this that I might be missing? Thanks in advance.
EDIT: With the updated code I get the error:
Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page, :
ERROR MESSAGE:
Illegal argument(s) for DeepLearning model: DeepLearning_model_R_1494428751150_1. Details: ERRR on field: _train: Training data must have at least 2 features (incl. response).
ERRR on field: _stopping_metric: Stopping metric cannot be logloss for regression.
I assume this has to do with the the way the Iris dataset is being read in.
Answer To First Question: Your original error message sounds like one you can get when things get of sync. E.g. maybe you had two sessions running at once, and removed the model in one session; the other session wouldn't know its variables are now out of date. H2O allows multiple connections, but they have to be co-operative. (Flow - see next paragraph - counts as a second session.)
Unless you can make a reproducible example, shrug and put it down to gremlins, and start a new session. Or, go and look at the data/models in Flow (a web server always running on 127.0.0.1:54321 ), and see if something is no longer there.
For your EDIT question, your model is making a regression model, but you are trying to use logloss, so thought you were doing a classification. This is caused by not having set the target variable to be a factor. Your current as.factor() line is on the wrong data, in the wrong place. It should go after your as.h2o() lines:
train_h2o <- as.h2o(training) #Typo fix
test_h2o <- as.h2o(test)
feature_names <- names(training)[1:(ncol(training)-1)] #typo fix
y = "Species" #The column we want to predict
train_h2o[,y] <- as.factor(train_h2o[,y])
test_h2o[,y] <- as.factor(test_h2o[,y])
And then make the model with:
model_dl <- h2o.deeplearning(x = feature_names, y = y, training_frame = train_h2o, stopping_metric = "logloss")
Get predictions with:
pred_dl <- predict(model_dl, newdata = test_h2o) #Typo fix
And compare with correct answer with the prediction using:
cbind(test[, y], as.data.frame(pred_dl$predict))
(BTW, H2O always detects the Iris data set columns as numeric vs. factor perfectly, so the above as.factor() lines are not needed; your error message must've been on your original data.)
StackOverflow advice: test your reproducible example, in full, and copy and paste in that exact code, with the exact error message that code is giving you. Your code had numerous small typos. E.g. train in places, training in others. createDataPartition() was not given; I assumed a = sample(nrow(iris), 0.8*nrow(iris)). test has no "id" column.
Other H2O advice:
Run h2o.removeAll() after h2o.init(). It was giving you an error message if run before. (Personally I avoid that function - it is the kind of thing that gets left in a production script by mistake...)
Consider importing your data into h2o earlier, and using h2o.splitFrame() to split it. I.e. avoid doing things in R that H2O can easily handle.
Avoid having your data in R, at all, if you can. Prefer importFile() over as.h2o().
The thinking beyond both the last points is that H2O will scale beyond the memory of one machine, while R won't. It also is less confusing than trying to keep track of the same thing in two places.
I had the same issue but could resolve it quite easily.
My error occured because I read in an h2o-object before initialising the h2o-cluster. So I trained an h2o-model, saved it, shut down the cluster, loaded in the model and then initialized the cluster once again.
Before reading in the h2o-object, you should already initialize the cluster (h2o.init()).

package rgp in r gives "NaNs produced" output that cannot be suppressed

I've been using package rgp (genetic programming) in r to predict survival on the Titanic. The input data frame, train, is the training data from Kaggle with the Sex variable changed to 0 for females and 1 for males. Given this, the code is:
fs <- functionSet("+", "-", "*", "exp", "sin", "cos")
ivs <- inputVariableSet("Sex")
cfs <- constantFactorySet(function() rnorm(1))
ff <- function(f) {
err <- rmse(as.numeric(f(train$Sex)), as.numeric(train$Survived))
if (is.na(err)) return(1e12) else return(err)
}
set.seed(42)
gp <- geneticProgramming(functionSet = fs,
inputVariables = ivs,
constantSet = cfs,
fitnessFunction = ff,
stopCondition = makeTimeStopCondition(60),
verbose = FALSE)
The issue is the geneticProgramming function outputs "NaNs produced" frequently. In fact, the output is enough to hang r and rstudio if I run it long enough (e.g. overnight). (The actual code I want to run is more involved and "NaNs produced" prints much more frequently, but this shortened version suffices to show the issue.)
I would like to suppress all output from geneticProgramming. The verbose = FALSE option doesn't address the "NaNs produced" output.
I have tried using capture.output, sink, and invisible. I've tried echo = FALSE in an R markdown cell. All to no avail. I would have thought sink("/dev/null") would solve the problem, but it doesn't.
Where is this output coming from, and how can I suppress it if the usual methods don't work?
Thanks very much.
In your functionSet, remove "sin" and "cos". When these are removed the NaNs are not produced.

predict in caret ConfusionMatrix is removing rows

I'm fairly new to using the caret library and it's causing me some problems. Any
help/advice would be appreciated. My situations are as follows:
I'm trying to run a general linear model on some data and, when I run it
through the confusionMatrix, I get 'the data and reference factors must have
the same number of levels'. I know what this error means (I've run into it before), but I've double and triple checked my data manipulation and it all looks correct (I'm using the right variables in the right places), so I'm not sure why the two values in the confusionMatrix are disagreeing. I've run almost the exact same code for a different variable and it works fine.
I went through every variable and everything was balanced until I got to the
confusionMatrix predict. I discovered this by doing the following:
a <- table(testing2$hold1yes0no)
a[1]+a[2]
1543
b <- table(predict(modelFit,trainTR2))
dim(b)
[1] 1538
Those two values shouldn't disagree. Where are the missing 5 rows?
My code is below:
set.seed(2382)
inTrain2 <- createDataPartition(y=HOLD$hold1yes0no, p = 0.6, list = FALSE)
training2 <- HOLD[inTrain2,]
testing2 <- HOLD[-inTrain2,]
preProc2 <- preProcess(training2[-c(1,2,3,4,5,6,7,8,9)], method="BoxCox")
trainPC2 <- predict(preProc2, training2[-c(1,2,3,4,5,6,7,8,9)])
trainTR2 <- predict(preProc2, testing2[-c(1,2,3,4,5,6,7,8,9)])
modelFit <- train(training2$hold1yes0no ~ ., method ="glm", data = trainPC2)
confusionMatrix(testing2$hold1yes0no, predict(modelFit,trainTR2))
I'm not sure as I don't know your data structure, but I wonder if this is due to the way you set up your modelFit, using the formula method. In this case, you are specifying y = training2$hold1yes0no and x = everything else. Perhaps you should try:
modelFit <- train(trainPC2, training2$hold1yes0no, method="glm")
Which specifies y = training2$hold1yes0no and x = trainPC2.

Resources