Error in Logistic Regression for Factors in R - r

I am trying to do logistic regression by using the code:
model <- glm (Participation ~ Gender + Race + Ethnicity + Education + Comorbidities + WLProgram + LoseWeight + EverLoseWeight + PastYearLW + Age + BMI, data = LogisticData, family = binomial)
summary(model)
I keep getting the error:
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels
Upon checking the forums I checked to see which variables were factors:
str(LogisticData)
'data.frame': 994 obs. of 13 variables:
$ outcome : Factor w/ 2 levels "No","Yes": 1 1 2 2 1 2 2 1 2 2 ...
$ Gender : Factor w/ 3 levels "Male","Female",..: 1 2 2 1 2 1 1 1 1
$ Race : Factor w/ 3 levels "White","Black",..: 1 1 1 3 1 1 1 1 1 1
$ Ethnicity : Factor w/ 2 levels "Hispanic/Latino",..: 2 2 2 2 2 2 2 2 2
$ Education : Factor w/ 2 levels "Below Bachelors",..: 1 1 1 2 1 1 1 2 1
$ Comorbidities : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 1 2 2 1 1 ...
$ WLProgram : Factor w/ 2 levels "No","Yes": NA 1 2 2 1 1 1 NA 1 1 ...
$ LoseWeight : Factor w/ 2 levels "Yes","No": 2 1 1 1 1 1 1 2 1 1 ...
$ PastYearLW : Factor w/ 2 levels "Yes","No": NA 2 1 1 1 2 1 NA 1 1 ...
$ EverLoseWeight: Factor w/ 2 levels "Yes","No": 2 1 1 1 1 1 1 2 1 1 ...
$ Age : int 29 35 69 32 21 45 40 62 59 58 ...
$ Participation : Factor w/ 2 levels "Yes","No": 2 2 1 1 1 1 1 2 1 2 ...
$ BMI : num 25.7 33.8 26.4 32.3 27.5 ...
All factors appear to have 2 or more levels.
I also tried to omit NA's which still gave me this error.
I want all variables in the regression, and can't figure out why it won't run.
When performing :
newdata <- droplevels(na.omit(LogisticData))
> str(newdata)
'data.frame': 840 obs. of 13 variables:
$ outcome : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 2 2 2 ...
$ Gender : Factor w/ 3 levels "Male","Female",..: 2 2 1 2 1 1 1 2 1
$ Race : Factor w/ 3 levels "White","Black",..: 1 1 3 1 1 1 1 1 3
$ Ethnicity : Factor w/ 2 levels "Hispanic/Latino",..: 2 2 2 2 2 2 2 2
$ Education : Factor w/ 2 levels "Below Bachelors",..: 1 1 2 1 1 1 1 1
$ Comorbidities : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 1 1 2 ...
$ WLProgram : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 1 1 1 1 1 ...
$ LoseWeight : Factor w/ 1 level "Yes": 1 1 1 1 1 1 1 1 1 1 ...
$ PastYearLW : Factor w/ 2 levels "Yes","No": 2 1 1 1 2 1 1 1 1 2 ...
$ EverLoseWeight: Factor w/ 1 level "Yes": 1 1 1 1 1 1 1 1 1 1 ...
$ Age : int 35 69 32 21 45 40 59 58 23 32 ...
$ Participation : Factor w/ 2 levels "Yes","No": 2 1 1 1 1 1 1 2 2 1 ...
$ BMI : num 33.8 26.4 32.3 27.5 45.4 ...
- attr(*, "na.action")=Class 'omit' Named int [1:154] 1 8 13 14 21 24 25
46 55 58 ...
.. ..- attr(*, "names")= chr [1:154] "1" "8" "13" "14" ...
This doesn't make sense to me because you can see in the first str(Logisitic Data) that there is obviously 2 levels in EverLoseWeight as you can see both the Yes and the No and the 1 and 2? How do I fix this anomaly?

Given your update, it looks like you have at least two possibilities.
1: Remove the factors that are left with only a single level after removing the NAs (i.e. LoseWeight and EverLoseWeight).
2: Treat the NAs as an extra level. Something along the lines of
a = as.factor(c(1,1,NA,2))
b = as.factor(c(1,1,2,1))
# 0 is an unused factor level for a
x = data.frame(a, b)
levels(x$a) = c(levels(x$a), 0)
x$a[is.na(x$a)] = 0
But this might not deal with any singularity issues that also resulted in having single-level factors.

Try doing summary on your raw data and make sure that all of the levels have values. I would put this in a comment, but I don't have the reputation points :(

Related

Weird error: not found #dependent variable in eval(predvars, data, env)

I am having a weird error when trying to make prediction from my model.
My original dataset is a discrete choice experiment where doctors evaluate a set of patients with different characteristics and make a treatment choice.
The structure of my data is like this:
>str(dcefull2)
'data.frame': 350 obs. of 28 variables:
$ id : Factor w/ 25 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ choice : Factor w/ 3 levels "stop","half",..: 2 3 2 2 2 2 2 2 2 3 ...
$ response : Factor w/ 2 levels "Good response",..: 1 2 2 2 2 1 2 2 1 1 ...
$ resist_profile : Factor w/ 3 levels "MDR-TB","MDR-TB + PZA + EMB resis",..: 3 2 2 3 3 2 1 2 3 3 ...
$ ambu_regimen : Factor w/ 2 levels "BPaLM","Standard": 2 2 2 2 1 2 2 2 2 1 ...
$ expo_his : Factor w/ 2 levels "No exposure",..: 2 1 1 1 2 2 2 1 2 1 ...
$ resist_prob : num 2.8 0.8 1.8 1.8 1.8 1.8 2.8 0.8 2.8 0.8 ...
$ cred_interval : Factor w/ 2 levels "Wide","Narrow": 2 2 2 1 2 2 2 1 1 1 ...
I fitted an ordered logistic model with the choice (ordinal variable with 3 categories) as a function of patient characteristics.
random_model_off3 <- clmm2(choice ~ response + resist_prob*resist_profile + resist_prob*ambu_regimen + resist_prob*expo_his + resist_prob*cred_interval, random = id, data=dcefull2, Hess = TRUE, nAGQ = 10)
I then created a new dataset with all independent variables as in the original dataset. I varied the values of the variable 'resist_prob' and 'cred_interval', other variables I kept one fixed value.
The structure of my new data is like this:
>str(newdat2)
'data.frame': 2002 obs. of 6 variables:
$ response : Factor w/ 1 level "Good response": 1 1 1 1 1 1 1 1 1 1 ...
$ resist_profile: Factor w/ 1 level "MDR-TB": 1 1 1 1 1 1 1 1 1 1 ...
$ cred_interval : Factor w/ 2 levels "Wide","Narrow": 1 2 1 2 1 2 1 2 1 2 ...
$ ambu_regimen : Factor w/ 1 level "BPaLM": 1 1 1 1 1 1 1 1 1 1 ...
$ expo_his : Factor w/ 1 level "No exposure": 1 1 1 1 1 1 1 1 1 1 ...
$ resist_prob : num 0 0 0.004 0.004 0.008 0.008 0.012 0.012 0.016 0.016 ...
I used function predict() to make prediction of probability of each treatment choice for each row.
predict(random_model_off3, newdata = newdat2)
Then I received this error:
Error in eval(predvars, data, env) : object 'choice' not found
I found it very weird because "choice" is my dependent variable. I cannot find any similar issues and solution on the internet.
I would very appreciate your help!

R, aggregate function apparently causes loss of column levels?

I just encountered a weird situation in RGui...I used the same script as always to get my data.frame into the right shape for ggplot2. So my data looks like the following:
time days treatment nucleic_acid habitat parallel disturbance variable cellcounts value
1 1 2 control dna water 1 none Proteobacteria batch 0.000000000
2 2 22 control dna water 1 none Proteobacteria batch 0.003586543
3 1 2 treated dna water 1 none Proteobacteria batch 0.000000000
4 2 22 treated dna biofilm 1 none Proteobacteria NA 0.000000000
'data.frame': 185648 obs. of 10 variables:
$ time : int 5 5 5 5 5 5 6 6 6 6 ...
$ days : int 62 62 62 62 62 62 69 69 69 69 ...
$ treatment : Factor w/ 2 levels "control","treated": 2 2 2 1 1 1 2 2 2 1 ...
$ parallel : int 1 2 3 1 2 3 1 2 3 1 ...
$ nucleic_acid: Factor w/ 2 levels "cdna","dna": 1 1 1 1 1 1 1 1 1 1 ...
$ habitat : Factor w/ 2 levels "biofilm","water": 1 1 1 1 1 1 1 1 1 1 ...
$ cellcounts : Factor w/ 4 levels "batch","high",..: NA NA NA NA NA NA NA NA NA NA ...
$ disturbance : Factor w/ 3 levels "high","low","none": 3 3 3 3 3 3 3 3 3 3 ...
$ variable : Factor w/ 656 levels "Proteobacteria",..: 1 1 1 1 1 1 1 1 1 1 ...
$ value : num 0 0 0 0 0 0 0 0 0 0 ...
and I wanted aggregate to calculate the mean value of my up to 3 parallels:
df_mean<-aggregate(value~time+days+treatment+nucleic_acid+habitat+disturbance+variable+cellcounts, data = df, mean)
afterwards, the level "biofilm" in column "habitat" is lost.
df_mean<-droplevels(df_mean)
str(df_mean)
'data.frame': 44608 obs. of 9 variables:
$ time : int 1 2 1 2 1 2 1 2 1 2 ...
$ days : int 2 22 2 22 2 22 2 22 2 22 ...
$ treatment : Factor w/ 2 levels "control","treated": 1 1 2 2 1 1 2 2 1 1 ...
$ nucleic_acid: Factor w/ 2 levels "cdna","dna": 2 2 2 2 2 2 2 2 2 2 ...
$ habitat : Factor w/ 1 level "water": 1 1 1 1 1 1 1 1 1 1 ...
$ disturbance : Factor w/ 3 levels "high","low","none": 3 3 3 3 3 3 3 3 3 3 ...
$ variable : Factor w/ 656 levels "Proteobacteria",..: 1 1 1 1 2 2 2 2 3 3 ...
$ cellcounts : Factor w/ 4 levels "batch","high",..: 1 1 1 1 1 1 1 1 1 1 ...
$ value : num 0 0.00359 0 0 0 ...
So I spent a lot of time (I actually just realised this, there were many more issues that now all seem to be aggregate related) looking into this. I removed the column "cellcounts" and it worked. Interestingly, the columns "cellcounts" and "habitat" always carry in case of "biofilm" the same, therefore redundant, information ("biofilm" goes always with "NA"). Is this the cause? But it always worked before, so I don't get my head around this. Was there a change to the base::aggregate function or something like that? Do you have an explanation for me? I'm using R-3.4.0, other packages used are reshape, reshape2 and ggplot2
Thx a lot, a confused crazysantaclaus
The issue comes from the NA, maybe your file was loaded differently in the past and these were stored as strings instead of NA values ? Here's a way to solve it by setting them to a "NA" string:
levels(df$cellcounts) <- c(levels(df$cellcounts),"NA")
df$cellcounts[is.na(df$cellcounts)] <- "NA"
df_mean <- aggregate(value ~ time+days+treatment+nucleic_acid+habitat+disturbance+variable+cellcounts, data = df, mean,na.rm=TRUE)
df_mean<-droplevels(df_mean)
str(df_mean)
'data.frame': 4 obs. of 9 variables:
$ time : int 1 2 1 2
$ days : int 2 22 2 22
$ treatment : Factor w/ 2 levels "control","treated": 1 1 2 2
$ nucleic_acid: Factor w/ 1 level "dna": 1 1 1 1
$ habitat : Factor w/ 2 levels "biofilm","water": 2 2 2 1
$ disturbance : Factor w/ 1 level "none": 1 1 1 1
$ variable : Factor w/ 1 level "Proteobacteria": 1 1 1 1
$ cellcounts : Factor w/ 2 levels "batch","NA": 1 1 1 2
$ value : num 0 0.00359 0 0
data
df <- read.table(text=" time days treatment nucleic_acid habitat parallel disturbance variable cellcounts value
1 1 2 control dna water 1 none Proteobacteria batch 0.000000000
2 2 22 control dna water 1 none Proteobacteria batch 0.003586543
3 1 2 treated dna water 1 none Proteobacteria batch 0.000000000
4 2 22 treated dna biofilm 1 none Proteobacteria NA 0.000000000
",header=T)

Merging in R: 1 row missing after merge

I have a dataframe movielens:
str(u.data)
'data.frame': 100000 obs. of 4 variables:
$ userID : int 196 186 22 244 166 298 115 253 305 6 ...
$ movieID : int 242 302 377 51 346 474 265 465 451 86 ...
$ rating : int 3 3 1 2 1 4 2 5 3 3 ...
$ timestamp: int 881250949 891717742 878887116 880606923 886397596 884182806 881171488 891628467 886324817 883603013 ...
and
str(u.item)
'data.frame': 1681 obs. of 20 variables:
$ unknown : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ Action : Factor w/ 2 levels "0","1": 1 2 1 2 1 1 1 1 1 1 ...
$ Adventure : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 1 1 ...
$ Animation : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 1 ...
$ Childrens : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 2 1 1 ...
$ Comedy : Factor w/ 2 levels "0","1": 2 1 1 2 1 1 1 2 1 1 ...
$ Crime : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 1 1 ...
$ Documentary: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ Drama : Factor w/ 2 levels "0","1": 1 1 1 2 2 2 2 2 2 2 ...
$ Fantasy : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ Film-Noir : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ Horror : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ Musical : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ Mystery : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ Romance : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ Sci-Fi : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
$ Thriller : Factor w/ 2 levels "0","1": 1 2 2 1 2 1 1 1 1 1 ...
$ War : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...
$ Western : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ movieID : int 1 2 3 4 5 6 7 8 9 10 ...
The number of row of u.data is 100.000
nrow(u.data)
100000
And
nrow(u.item)
[1] 1681
Then, I want to merge them:
all_data = u.data
all_data = merge(all_data, u.item, by = "movieID")
But the merged data has only 99.999 rows
nrow(all_data)
[1] 99999
Did I did something wrong while merging these two data frames?
This happens if min(u.data$movieID) < min(u.item$movieID) or if max(u.data$movieID) > max(u.item$movieID). Example for the latter:
# max(u.data$movieID) = 10
u.data <- data.frame(movieID = 1:10, NAME = LETTERS[1:10])
dim(u.data)
# [1] 10 2
# max(u.item$movieID) = 11
u.item <- data.frame(movieID = c(1:9,11), name = letters[c(1:9,11)])
dim(u.item)
# [1] 10 2
out <- merge(u.data, u.item, by = "movieID")
dim(out)
# [1] 9 3
# check if all elements of u.item$movieID exist in u.data$movieID
is.element(u.data$movieID, u.item$movieID)
# [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
Suggested by Batanichek:
out <- merge(u.data, u.item, by = "movieID", all.x = TRUE)
dim(out)
# [1] 10 3

Find the average of one variable in multiple year classes in R

I have 50 year-classes, and age and length data on individuals within each year class.
Without inputting a different data set for each year class I'm trying to plot average age for each year class.
As in year class along the x axis and average age (or length) (for each year class) along the y.
This is my data frame
'data.frame': 236628 obs. of 7 variables:
$ maturity: Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
$ year : Factor w/ 50 levels "1966","1967",..: 1 1 1 1 1 1 1 1 1 1 ...
$ quarter : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
$ area : Factor w/ 10 levels "1","2","3","4",..: 2 2 2 2 2 2 2 2 2 2 ...
$ lngth : int 145 145 145 150 150 150 150 150 155 155 ...
$ age : int 1 1 1 1 1 1 1 1 1 1 ...
$ sex : Ord.factor w/ 2 levels "0"<"1": 1 2 2 2 2 2 2 2 1 2 ...
Cheers

Error in scale.default: length of 'center' must equal the number of columns of 'x'

I am using mboost package to do some classification. Here is the code
library('mboost')
load('so-data.rdata')
model <- glmboost(is_exciting~., data=training, family=Binomial())
pred <- predict(model, newdata=test, type="response")
But R complains when doing prediction that
Error in scale.default(X, center = cm, scale = FALSE) :
length of 'center' must equal the number of columns of 'x'
The data (training and test) can be downloaded here (7z, zip).
What is the reason of the error and how to get rid of it? Thank you.
UPDATE:
> str(training)
'data.frame': 439599 obs. of 24 variables:
$ is_exciting : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ school_state : Factor w/ 52 levels "AK","AL","AR",..: 15 5 5 23 47 5 44 42 42 5 ...
$ school_charter : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ school_magnet : Factor w/ 2 levels "f","t": 1 1 1 1 2 1 1 1 1 1 ...
$ school_year_round : Factor w/ 2 levels "f","t": 1 1 1 1 1 2 1 1 1 2 ...
$ school_nlns : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ school_charter_ready_promise : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ teacher_prefix : Factor w/ 6 levels "","Dr.","Mr.",..: 5 5 3 5 6 5 6 6 5 6 ...
$ teacher_teach_for_america : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 2 1 2 1 ...
$ teacher_ny_teaching_fellow : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ primary_focus_subject : Factor w/ 28 levels "","Applied Sciences",..: 19 17 18 18 10 4 17 17 18 17 ...
$ primary_focus_area : Factor w/ 8 levels "","Applied Learning",..: 6 5 5 5 5 4 5 5 5 5 ...
$ secondary_focus_subject : Factor w/ 28 levels "","Applied Sciences",..: 28 18 17 19 26 18 18 28 24 25 ...
$ secondary_focus_area : Factor w/ 8 levels "","Applied Learning",..: 7 5 5 6 8 5 5 7 7 4 ...
$ resource_type : Factor w/ 7 levels "","Books","Other",..: 4 4 2 5 5 2 2 5 5 5 ...
$ poverty_level : Factor w/ 4 levels "high poverty",..: 2 2 4 2 1 2 2 1 2 1 ...
$ grade_level : Factor w/ 5 levels "","Grades 3-5",..: 5 5 2 5 5 2 3 2 4 2 ...
$ fulfillment_labor_materials : num 30 35 35 30 30 35 30 35 35 35 ...
$ total_price_excluding_optional_support: num 1274 477 892 548 385 ...
$ total_price_including_optional_support: num 1499 562 1050 645 453 ...
$ students_reached : int 31 20 250 36 19 28 90 21 60 56 ...
$ eligible_double_your_impact_match : Factor w/ 2 levels "f","t": 1 2 1 2 1 2 1 1 1 1 ...
$ eligible_almost_home_match : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 2 2 1 1 ...
$ essay_length : int 236 285 194 351 383 273 385 437 476 159 ...
> str(test)
'data.frame': 44772 obs. of 23 variables:
$ school_state : Factor w/ 51 levels "AK","AL","AR",..: 22 35 11 46 5 35 11 28 28 10 ...
$ school_charter : Factor w/ 2 levels "f","t": 1 1 1 1 2 1 1 1 1 1 ...
$ school_magnet : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ school_year_round : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ school_nlns : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ school_charter_ready_promise : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ teacher_prefix : Factor w/ 6 levels "","Dr.","Mr.",..: 3 5 6 6 3 5 5 5 3 5 ...
$ teacher_teach_for_america : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ teacher_ny_teaching_fellow : Factor w/ 2 levels "f","t": 1 2 1 1 1 1 1 1 1 1 ...
$ primary_focus_subject : Factor w/ 28 levels "","Applied Sciences",..: 5 16 17 17 18 11 16 17 2 17 ...
$ primary_focus_area : Factor w/ 8 levels "","Applied Learning",..: 2 4 5 5 5 2 4 5 6 5 ...
$ secondary_focus_subject : Factor w/ 28 levels "","Applied Sciences",..: 25 1 19 1 17 9 17 11 1 1 ...
$ secondary_focus_area : Factor w/ 8 levels "","Applied Learning",..: 4 1 6 1 5 6 5 2 1 1 ...
$ resource_type : Factor w/ 7 levels "","Books","Other",..: 5 5 5 2 5 6 4 5 5 4 ...
$ poverty_level : Factor w/ 4 levels "high poverty",..: 1 2 4 4 1 2 2 2 1 2 ...
$ grade_level : Factor w/ 5 levels "","Grades 3-5",..: 4 3 3 5 4 5 5 4 3 5 ...
$ fulfillment_labor_materials : num 30 30 30 30 30 30 30 30 30 30 ...
$ total_price_excluding_optional_support: num 2185 149 1017 156 860 ...
$ total_price_including_optional_support: num 2571 175 1197 183 1012 ...
$ students_reached : int 200 110 10 22 180 51 30 15 260 20 ...
$ eligible_double_your_impact_match : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ eligible_almost_home_match : Factor w/ 2 levels "f","t": 2 1 1 1 1 1 1 1 2 1 ...
$ essay_length : int 221 137 313 243 373 344 304 431 231 173 ...
> summary(model)
Generalized Linear Models Fitted via Gradient Boosting
Call:
glmboost.formula(formula = is_exciting ~ ., data = training, family = Binomial())
Negative Binomial Likelihood
Loss function: {
f <- pmin(abs(f), 36) * sign(f)
p <- exp(f)/(exp(f) + exp(-f))
y <- (y + 1)/2
-y * log(p) - (1 - y) * log(1 - p)
}
Number of boosting iterations: mstop = 100
Step size: 0.1
Offset: -1.197806
Coefficients:
NOTE: Coefficients from a Binomial model are half the size of coefficients
from a model fitted via glm(... , family = 'binomial').
See Warning section in ?coef.mboost
(Intercept) school_stateDC
-0.5250166130 0.0426909965
school_stateIL school_chartert
0.0084191638 0.0729272310
teacher_prefixMrs. teacher_prefixMs.
-0.0181489492 0.0438425925
teacher_teach_for_americat resource_typeBooks
0.2593005345 0.0046126706
resource_typeTechnology fulfillment_labor_materials
-0.0313904871 0.0120086140
eligible_double_your_impact_matcht eligible_almost_home_matcht
-0.0316376431 -0.0522717398
essay_length
0.0004993224
attr(,"offset")
[1] -1.197806
Selection frequencies:
fulfillment_labor_materials teacher_teach_for_americat
0.24 0.15
essay_length school_chartert
0.15 0.09
teacher_prefixMs. resource_typeTechnology
0.08 0.07
eligible_double_your_impact_matcht eligible_almost_home_matcht
0.07 0.07
teacher_prefixMrs. school_stateDC
0.04 0.02
school_stateIL resource_typeBooks
0.01 0.01
I also tried glm but it said
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
factor teacher_prefix has new levels
But I don't see any new levels in the teacher_prefix variable:
> levels(training$teacher_prefix)
[1] "" "Dr." "Mr." "Mr. & Mrs." "Mrs." "Ms."
> levels(test$teacher_prefix)
[1] "" "Dr." "Mr." "Mr. & Mrs." "Mrs." "Ms."
Actually, the problems with glmboost and glm are related. There are problems with your teacher_prefix variable.
As the glm example points out, there are levels that are in test that are not in training (kind of). While both factors have the same levels(), the training set has no observations where teacher_prefix=="" but test does. Compare
table(test$teacher_prefix)
table(training$teacher_prefix)
So glm is actually giving the more accurate, helpful error message. The problem is the same with glmboost although it isn't as direct about saying it.
Doing this seemed to "fix" it
test2 <- subset(test, teacher_prefix %in% c("Dr.","Mr.","Mrs.","Ms."))
test2$teacher_prefix <- droplevels(test2$teacher_prefix)
pred <- predict(model, newdata=test2, type="response")
We just get rid of the unused levels and then do the standard prediction.

Resources