binomial regression model produces glm.fit error - r

I have data like that below:
data.frame': 1460 obs. of 81 variables:
$ Id : int 1 2 3 4 5 6 7 8 9 10 ...
$ MSSubClass : int 60 20 60 70 60 50 20 60 50 190 ...
$ MSZoning : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 5 4 ...
$ LotFrontage : int 65 80 68 60 84 85 75 NA 51 50 ...
$ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
$ Street : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
$ Alley : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA NA NA NA NA ...
$ LotShape : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 1 1 1 1 4 1 4 4 ...
$ LandContour : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4 ...
$ Utilities : Factor w/ 2 levels "AllPub","NoSeWa": 1 1 1 1 1 1 1 1 1 1 ...
$ LotConfig : Factor w/ 5 levels "Corner","CulDSac",..: 5 3 5 1 3 5 5 1 5 1 ...
$ LandSlope : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
$ Neighborhood : Factor w/ 25 levels "Blmngtn","Blueste",..: 6 25 6 7 14 12 21 17 18 4 ...
$ Condition1 : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 5 1 1 ...
$ Condition2 : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 1 ...
$ BldgType : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 1 1 1 2 ...
$ HouseStyle : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 6 3 6 6 6 1 3 6 1 2 ...
$ OverallQual : int 7 6 7 7 8 5 8 7 7 5 ...
$ OverallCond : int 5 8 5 5 5 5 5 6 5 6 ...
$ YearBuilt : int 2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
$ YearRemodAdd : int 2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
$ RoofStyle : Factor w/ 6 levels "Flat","Gable",..: 2 2 2 2 2 2 2 2 2 2 ...
$ RoofMatl : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
$ Exterior1st : Factor w/ 15 levels "AsbShng","AsphShn",..: 13 9 13 14 13 13 13 7 4 9 ...
$ Exterior2nd : Factor w/ 16 levels "AsbShng","AsphShn",..: 14 9 14 16 14 14 14 7 16 9 ...
$ MasVnrType : Factor w/ 4 levels "BrkCmn","BrkFace",..: 2 3 2 3 2 3 4 4 3 3 ...
$ MasVnrArea : int 196 0 162 0 350 0 186 240 0 0 ...
$ ExterQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 4 3 4 3 4 4 4 ...
$ ExterCond : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
$ Foundation : Factor w/ 6 levels "BrkTil","CBlock",..: 3 2 3 1 3 6 3 2 1 1 ...
$ BsmtQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 3 3 4 3 3 1 3 4 4 ...
$ BsmtCond : Factor w/ 4 levels "Fa","Gd","Po",..: 4 4 4 2 4 4 4 4 4 4 ...
$ BsmtExposure : Factor w/ 4 levels "Av","Gd","Mn",..: 4 2 3 4 1 4 1 3 4 4 ...
$ BsmtFinType1 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 3 1 3 1 3 3 3 1 6 3 ...
$ BsmtFinSF1 : int 706 978 486 216 655 732 1369 859 0 851 ...
$ BsmtFinType2 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 6 6 6 6 6 6 6 2 6 6 ...
$ BsmtFinSF2 : int 0 0 0 0 0 0 0 32 0 0 ...
$ BsmtUnfSF : int 150 284 434 540 490 64 317 216 952 140 ...
$ TotalBsmtSF : int 856 1262 920 756 1145 796 1686 1107 952 991 ...
$ Heating : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
$ HeatingQC : Factor w/ 5 levels "Ex","Fa","Gd",..: 1 1 1 3 1 1 1 1 3 1 ...
$ CentralAir : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
$ Electrical : Factor w/ 5 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 2 5 ...
$ X1stFlrSF : int 856 1262 920 961 1145 796 1694 1107 1022 1077 ...
$ X2ndFlrSF : int 854 0 866 756 1053 566 0 983 752 0 ...
$ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ...
$ GrLivArea : int 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
$ BsmtFullBath : int 1 0 1 1 1 1 1 1 0 1 ...
$ BsmtHalfBath : int 0 1 0 0 0 0 0 0 0 0 ...
$ FullBath : int 2 2 2 1 2 1 2 2 2 1 ...
$ HalfBath : int 1 0 1 0 1 1 0 1 0 0 ...
$ BedroomAbvGr : int 3 3 3 3 4 1 3 3 2 2 ...
$ KitchenAbvGr : int 1 1 1 1 1 1 1 1 2 2 ...
$ KitchenQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 3 3 4 3 4 4 4 ...
$ TotRmsAbvGrd : int 8 6 6 7 9 5 7 7 8 5 ...
$ Functional : Factor w/ 7 levels "Maj1","Maj2",..: 7 7 7 7 7 7 7 7 3 7 ...
$ Fireplaces : int 0 1 1 1 1 0 1 2 2 2 ...
$ FireplaceQu : Factor w/ 5 levels "Ex","Fa","Gd",..: NA 5 5 3 5 NA 3 5 5 5 ...
$ GarageType : Factor w/ 6 levels "2Types","Attchd",..: 2 2 2 6 2 2 2 2 6 2 ...
$ GarageYrBlt : int 2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
$ GarageFinish : Factor w/ 3 levels "Fin","RFn","Unf": 2 2 2 3 2 3 2 2 3 2 ...
$ GarageCars : int 2 2 2 3 3 2 2 2 2 1 ...
$ GarageArea : int 548 460 608 642 836 480 636 484 468 205 ...
$ GarageQual : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 2 3 ...
$ GarageCond : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
$ PavedDrive : Factor w/ 3 levels "N","P","Y": 3 3 3 3 3 3 3 3 3 3 ...
$ WoodDeckSF : int 0 298 0 0 192 40 255 235 90 0 ...
$ OpenPorchSF : int 61 0 42 35 84 30 57 204 0 4 ...
$ EnclosedPorch: int 0 0 0 272 0 0 0 228 205 0 ...
$ X3SsnPorch : int 0 0 0 0 0 320 0 0 0 0 ...
$ ScreenPorch : int 0 0 0 0 0 0 0 0 0 0 ...
$ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ...
$ PoolQC : Factor w/ 3 levels "Ex","Fa","Gd": NA NA NA NA NA NA NA NA NA NA ...
$ Fence : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA 3 NA NA NA NA ...
$ MiscFeature : Factor w/ 4 levels "Gar2","Othr",..: NA NA NA NA NA 3 NA 3 NA NA ...
$ MiscVal : int 0 0 0 0 0 700 0 350 0 0 ...
$ MoSold : int 2 5 9 2 12 10 8 11 4 1 ...
$ YrSold : int 2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
$ SaleType : Factor w/ 9 levels "COD","Con","ConLD",..: 9 9 9 9 9 9 9 9 9 9 ...
$ SaleCondition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 1 5 5 5 5 1 5 ...
$ SalePrice : int 208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
I would like to make a GLM to predict SalePrice from all of the other variables.
After I write this:
cena_nieruchomości.lm <- glm(SalePrice~.,
data=nieruchimości,family=binomial(logit))
I am getting an error:
contrasts can be applied only to factors with 2 or more levels.
I have read that it might occur because of NA values in my data. So I tried:
cena_nieruchomości.lm <- glm(SalePrice~.,
data=nieruchimości,family=binomial("logit"), na.action=na.pass)
Then I get the next error:
Error in glm.fit(x = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
Could someone please tell what I'm doing wrong and how to avoid this error? Could it be because SalePrice is int (should it be a factor?)

SalePrice is an interval/continuous variable. family=binomial('logit') in your glm() call is for fitting logistic regression which assumes you have a dependent variable that only takes on two values.
Given your dependent variable logistic regression is not the right choice. You would do better with just estimating a linear model with lm():
cena_nieruchomości.lm <- lm(SalePrice~.,
data=nieruchimości)

Related

group by and sum not working as expected in R

Hi I have a simple dataframe with this structure
> str(allvalues)
'data.frame': 150 obs. of 8 variables:
$ seriesId : Factor w/ 1 level "2021-02-28T00:00:00Z": 1 1 1 1 1 1 1 1 1 1 ...
$ forecastPoint : Factor w/ 30 levels "790","791","792",..: 1 2 3 4 5 6 7 8 9 10 ...
$ rowId : Factor w/ 30 levels "2021-03-01T00:00:00.000000Z",..: 1 2 3 4 5 6 7 8 9 10 ...
$ timestamp : Factor w/ 65 levels "1842.6640625",..: 7 8 9 11 14 4 1 16 12 18 ...
$ predictionValues: Factor w/ 1 level "total_visits (actual)": 1 1 1 1 1 1 1 1 1 1 ...
$ forecastDistance: Factor w/ 30 levels "1","10","11",..: 1 12 23 25 26 27 28 29 30 2 ...
$ prediction : num 2111 2130 2258 2276 2298 ...
$ scenario : Factor w/ 5 levels "0 0 10 10 10",..: 4 4 4 4 4 4 4 4 4 4 ...
and I want to group by "scenario" and sum "prediction"
but when I use
> allvalues %>% group_by(scenario) %>% summarise(cond_disp = sum(prediction))
cond_disp
1 351940.8
Is not grouping by scenarios, there should be 5 rows, each scenario and the sum
any help on what I am doing wrong?

why levels of just one variable change after the properly combination of two dataframes?and how should deal it?

I have two dataframes. My first dataframe contains 16 different Lines (genotypes) and due to the different number of plants of each line in my experiment, the str() command shows 145 observation and t 16 levels for my Line variable; as you can see here
data.frame': 145 obs. of 15 variables:
$ Plate.NO. : int 1 1 1 1 1 1 1 1 1 1 ...
$ Line : Factor w/ 16 levels "L000049","L000154",..: 15 15 15 15 15 7 7 7 7 7 ...
$ Strain : Factor w/ 2 levels "AF1","V31-2": 1 1 1 1 1 1 1 1 1 1 …
$ Plant.number: int 1 2 3 4 5 1 2 3 4 5 ...
$ X0DPI : num 0 0 0 0 0 0 0 0 0 0 ...
$ X7DPI : num 0 0 0 0 0 0 0 0 0 0 ...
$ X10DPI : num 0 0 0 0 1 0 0 0 2 0 ...
$ X12DPI : num 0.5 0 0 0 2 3 2.5 2.5 2 3 ...
$ X14DPI : num 2.5 1 0 0 2 3 2.5 2.5 2.5 3 ...
$ X17DPI : num 4 1 1 0 3 4 2.5 4 3 3 ...
$ X19DPI : num 4 1 1 1 4 4 2.5 4 3 4 ...
$ X21DPI : num 4 1.5 2 1 4 4 3.5 4 4 4 ...
$ X24DPI : num 4 3 2 1 4 4 4 4 4 4 ...
$ X26DPI : num 4 3 2 1 4 4 4 4 4 4 ...
$ X28DPI : num 4 3.5 2.5 1.5 4 4 4 4 4 4 ...
Also, I have the second dataframe which consists more complementary information for 252 Lines. Here you can see the str() result for my second dataframe
data.frame': 252 obs. of 7 variables:
$ ID : Factor w/ 252 levels "HM001 ","HM002 ",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Line : Factor w/ 252 levels "A10","A20","CADL",..: 31 38 175 207 206 169 197 ...
$ Population.of.Origin: Factor w/ 252 levels "A10 ","A17_Varma ",..: 157 167 55 53 51 110 ...
$ Country.of.Origin : Factor w/ 19 levels "Algeria ","Cyprus ",..: 16 2 14 1 1 3 3 1 5 8 ...
$ Category : Factor w/ 16 levels "alfalfa ","CC144 ",..: 7 7 7 7 7 7 7 7 3 3 ...
$ Seeds.From : Factor w/ 6 levels "Charlie_Brummer,UGA ",..: 4 4 4 4 4 4 4 4 4 4 ...
$ Status : Factor w/ 2 levels "Failed.QA","Processed": 2 2 2 2 2 2 2 2 2 2 …
Out of this 252 Lines I used only 16 Lines for part of my experiment and I want to combine these two dataframes
The first dataframe object is “Rep1” (the one with only 16 Lines) and the second one is called “hap”(the one with 252 Lines)
I used these series of commands
inner<-inner_join(Rep1,hap, by = "Line")
left<- left_join(Rep1,hap, "Line")
right←right_join(hap,Rep1,"Line")
the combination take place without any problem and I have just the rows for my 16 Lines but surprisingly when the str() output shows me 252 levels for Line instead of 16 while the number of observation is correct
here is the str() output of my datafram after combination
'data.frame': 145 obs. of 21 variables:
$ ID : Factor w/ 252 levels "HM001 ","HM002 ",..: 1 1 1 1 1 1 2 2 2 2 ...
$ Line : Factor w/ 252 levels "A10","A20","CADL",..: 31 31 31 31 31 31 38 38 38 38 ...
$ Population.of.Origin: Factor w/ 252 levels "A10 ","A17_Varma ",..: 157 157 157 157 157 157 167 167 167 167 ...
$ Country.of.Origin : Factor w/ 19 levels "Algeria ","Cyprus ",..: 16 16 16 16 16 16 2 2 2 2 ...
$ Category : Factor w/ 16 levels "alfalfa ","CC144 ",..: 7 7 7 7 7 7 7 7 7 7 ...
$ Seeds.From : Factor w/ 6 levels "Charlie_Brummer,UGA ",..: 4 4 4 4 4 4 4 4 4 4 ...
$ Status : Factor w/ 2 levels "Failed.QA","Processed": 2 2 2 2 2 2 2 2 2 2 ...
$ Plate.NO. : int 2 2 2 5 5 5 1 1 1 1 ...
$ Strain : Factor w/ 2 levels "AF1","V31-2": 1 1 1 2 2 2 1 1 1 1 ...
$ Plant.number : int 1 2 3 1 2 3 1 2 3 4 ...
$ X0DPI : num 0 0 0 0 0 0 0 0 0 0 ...
$ X7DPI : num 0 0 0 0 0 0 0 0 0 0 ...
$ X10DPI : num 0 0.5 3 3 2 1 1 0.5 0 0 ...
$ X12DPI : num 0 1.5 3 3 3 3 1 3 0 0 ...
$ X14DPI : num 0.5 3 4 3 3.5 4 2.5 3 1 0 ...
$ X17DPI : num 1 4 4 3 4 4 3 4 1.5 0 ...
$ X19DPI : num 1.5 4 4 4 4 4 4 4 1.5 0 ...
$ X21DPI : num 2 4 4 4 4 4 4 4 1.5 0 ...
$ X24DPI : num 2 4 4 4 4 4 4 4 1.5 1 ...
$ X26DPI : num 3 4 4 4 4 4 4 4 1.5 1 ...
$ X28DPI : num 3.5 4 4 4 4 4 4 4 2 1 ...

Moving the last column to a nth place in R [duplicate]

This question already has answers here:
How does one reorder columns in a data frame?
(12 answers)
Closed 2 years ago.
Good Day
I am trying to move the last column of a dataset to be the third column in a dataframe in R and was wondering what would be the most efficient way to do this.
My DataFrame structure is as follows:
str(HR)
'data.frame': 2940 obs. of 36 variables:
$ EmployeeNumber : int 1 2 3 4 5 6 7 8 9 10 ...
$ Attrition : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
$ Age : int 41 49 37 33 27 32 59 30 38 36 ...
$ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3
$ DailyRate : int 1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
$ Department : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
$ DistanceFromHome : int 1 8 2 3 2 2 3 24 23 27 ...
$ Education : int 2 1 2 4 1 2 3 1 3 3 ...
$ EducationField : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
$ EmployeeCount : int 1 1 1 1 1 1 1 1 1 1 ...
$ EnvironmentSatisfaction : int 2 3 4 4 1 4 3 4 4 3 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
$ HourlyRate : int 94 61 92 56 40 79 81 67 44 94 ...
$ JobInvolvement : int 3 2 2 3 3 3 4 3 2 3 ...
$ JobLevel : int 2 2 1 1 1 1 1 1 3 2 ...
$ JobRole : Factor w/ 9 levels "Healthcare Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
$ JobSatisfaction : int 4 2 3 3 2 4 1 3 3 3 ...
$ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
$ MonthlyIncome : int 5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
$ MonthlyRate : int 19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
$ NumCompaniesWorked : int 8 1 6 1 9 0 4 1 0 6 ...
$ Over18 : Factor w/ 1 level "Y": 1 1 1 1 1 1 1 1 1 1 ...
$ OverTime : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
$ PercentSalaryHike : int 11 23 15 11 12 13 20 22 21 13 ...
$ PerformanceRating : int 3 4 3 3 3 3 4 4 4 3 ...
$ RelationshipSatisfaction: int 1 4 2 3 4 3 1 2 2 2 ...
$ StandardHours : int 80 80 80 80 80 80 80 80 80 80 ...
$ StockOptionLevel : int 0 1 0 0 1 0 3 1 0 2 ...
$ TotalWorkingYears : int 8 10 7 8 6 8 12 1 10 17 ...
$ TrainingTimesLastYear : int 0 3 3 3 3 2 3 2 2 3 ...
$ WorkLifeBalance : int 1 3 3 3 3 2 2 3 3 2 ...
$ YearsAtCompany : int 6 10 0 8 2 7 1 1 9 7 ...
$ YearsInCurrentRole : int 4 7 0 7 2 7 0 0 7 7 ...
$ YearsSinceLastPromotion : int 0 1 0 3 2 3 0 0 1 7 ...
$ YearsWithCurrManager : int 5 7 0 0 2 6 0 0 8 7 ...
$ AttritionB : num 1 0 1 0 0 0 0 0 0 0 ...
and I am trying to have AttritionB come after Attrition.
HRCorForm = HR[,c(1,2,36:35)], I have tried this code however it negates the rest of the columns
Kind Regards
Rehaan
This will get all your columns:
HRCorForm = HR[,c(1,2,36,3:35)]

C5.0 decision tree - input string 1 is invalid in this locale

I have read the questions related before, but still can not solve my problem, my training data does not have missing values, so I don't know where it was wrong.
Another problem is the tree size is 1, all predicted results are 0 (label is 0 or 1 ). I know this is an extremely unbalanced case (the 0 label take up 98%), how do I solve the problem?
model_boost<-C5.0(train,train_label)
Error:
c50 code called exit with value 1
Warning message:
In strsplit(Z$output, "\n"): input string 1 is invalid in this locale
training data:
str(train)
'data.frame': 7500 obs. of 148 variables:
$ CI_CUSTYPE : Factor w/ 4 levels "个人","家庭",..: 2 2 2 2 2 2 2 2 1 2 ...
$ CI_COUNTRY_FLAG : Factor w/ 3 levels "1","2","3": 3 2 3 2 2 2 2 2 2 1 ...
$ CI_AGE : int -1 44 31 53 58 -1 -1 46 43 61 ...
$ CI_GENDER : Factor w/ 3 levels "男","女","未知": 3 1 1 2 2 3 3 2 2 1 ...
$ CI_CITY : Factor w/ 21 levels "阿坝","巴中",..: 16 18 9 3 3 4 5 1 3 19 ...
$ CI_TENURE : int 4 44 205 92 92 26 9 110 24 48 ...
$ IS_DUAL_MODE : Factor w/ 4 levels "0","1","2","3": 2 2 2 1 2 1 4 4 4 2 ...
$ PD_CDMA_PAYMODE : Factor w/ 2 levels "1","2": 2 1 2 2 2 1 1 2 1 1 ...
$ PD_CDMA_TENURE : int 49 43 64 39 19 36 8 52 15 47 ...
$ VO_MOU_TOTAL_AVG : int 9520 344 2287 253 460 249 3 885 623 457 ...
train_label
str(train_label)
Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 .
print(head(train_label))
[1] 0 0 0 0 0 0
Levels: 0 1

Error in scale.default: length of 'center' must equal the number of columns of 'x'

I am using mboost package to do some classification. Here is the code
library('mboost')
load('so-data.rdata')
model <- glmboost(is_exciting~., data=training, family=Binomial())
pred <- predict(model, newdata=test, type="response")
But R complains when doing prediction that
Error in scale.default(X, center = cm, scale = FALSE) :
length of 'center' must equal the number of columns of 'x'
The data (training and test) can be downloaded here (7z, zip).
What is the reason of the error and how to get rid of it? Thank you.
UPDATE:
> str(training)
'data.frame': 439599 obs. of 24 variables:
$ is_exciting : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ school_state : Factor w/ 52 levels "AK","AL","AR",..: 15 5 5 23 47 5 44 42 42 5 ...
$ school_charter : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ school_magnet : Factor w/ 2 levels "f","t": 1 1 1 1 2 1 1 1 1 1 ...
$ school_year_round : Factor w/ 2 levels "f","t": 1 1 1 1 1 2 1 1 1 2 ...
$ school_nlns : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ school_charter_ready_promise : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ teacher_prefix : Factor w/ 6 levels "","Dr.","Mr.",..: 5 5 3 5 6 5 6 6 5 6 ...
$ teacher_teach_for_america : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 2 1 2 1 ...
$ teacher_ny_teaching_fellow : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ primary_focus_subject : Factor w/ 28 levels "","Applied Sciences",..: 19 17 18 18 10 4 17 17 18 17 ...
$ primary_focus_area : Factor w/ 8 levels "","Applied Learning",..: 6 5 5 5 5 4 5 5 5 5 ...
$ secondary_focus_subject : Factor w/ 28 levels "","Applied Sciences",..: 28 18 17 19 26 18 18 28 24 25 ...
$ secondary_focus_area : Factor w/ 8 levels "","Applied Learning",..: 7 5 5 6 8 5 5 7 7 4 ...
$ resource_type : Factor w/ 7 levels "","Books","Other",..: 4 4 2 5 5 2 2 5 5 5 ...
$ poverty_level : Factor w/ 4 levels "high poverty",..: 2 2 4 2 1 2 2 1 2 1 ...
$ grade_level : Factor w/ 5 levels "","Grades 3-5",..: 5 5 2 5 5 2 3 2 4 2 ...
$ fulfillment_labor_materials : num 30 35 35 30 30 35 30 35 35 35 ...
$ total_price_excluding_optional_support: num 1274 477 892 548 385 ...
$ total_price_including_optional_support: num 1499 562 1050 645 453 ...
$ students_reached : int 31 20 250 36 19 28 90 21 60 56 ...
$ eligible_double_your_impact_match : Factor w/ 2 levels "f","t": 1 2 1 2 1 2 1 1 1 1 ...
$ eligible_almost_home_match : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 2 2 1 1 ...
$ essay_length : int 236 285 194 351 383 273 385 437 476 159 ...
> str(test)
'data.frame': 44772 obs. of 23 variables:
$ school_state : Factor w/ 51 levels "AK","AL","AR",..: 22 35 11 46 5 35 11 28 28 10 ...
$ school_charter : Factor w/ 2 levels "f","t": 1 1 1 1 2 1 1 1 1 1 ...
$ school_magnet : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ school_year_round : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ school_nlns : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ school_charter_ready_promise : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ teacher_prefix : Factor w/ 6 levels "","Dr.","Mr.",..: 3 5 6 6 3 5 5 5 3 5 ...
$ teacher_teach_for_america : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ teacher_ny_teaching_fellow : Factor w/ 2 levels "f","t": 1 2 1 1 1 1 1 1 1 1 ...
$ primary_focus_subject : Factor w/ 28 levels "","Applied Sciences",..: 5 16 17 17 18 11 16 17 2 17 ...
$ primary_focus_area : Factor w/ 8 levels "","Applied Learning",..: 2 4 5 5 5 2 4 5 6 5 ...
$ secondary_focus_subject : Factor w/ 28 levels "","Applied Sciences",..: 25 1 19 1 17 9 17 11 1 1 ...
$ secondary_focus_area : Factor w/ 8 levels "","Applied Learning",..: 4 1 6 1 5 6 5 2 1 1 ...
$ resource_type : Factor w/ 7 levels "","Books","Other",..: 5 5 5 2 5 6 4 5 5 4 ...
$ poverty_level : Factor w/ 4 levels "high poverty",..: 1 2 4 4 1 2 2 2 1 2 ...
$ grade_level : Factor w/ 5 levels "","Grades 3-5",..: 4 3 3 5 4 5 5 4 3 5 ...
$ fulfillment_labor_materials : num 30 30 30 30 30 30 30 30 30 30 ...
$ total_price_excluding_optional_support: num 2185 149 1017 156 860 ...
$ total_price_including_optional_support: num 2571 175 1197 183 1012 ...
$ students_reached : int 200 110 10 22 180 51 30 15 260 20 ...
$ eligible_double_your_impact_match : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ eligible_almost_home_match : Factor w/ 2 levels "f","t": 2 1 1 1 1 1 1 1 2 1 ...
$ essay_length : int 221 137 313 243 373 344 304 431 231 173 ...
> summary(model)
Generalized Linear Models Fitted via Gradient Boosting
Call:
glmboost.formula(formula = is_exciting ~ ., data = training, family = Binomial())
Negative Binomial Likelihood
Loss function: {
f <- pmin(abs(f), 36) * sign(f)
p <- exp(f)/(exp(f) + exp(-f))
y <- (y + 1)/2
-y * log(p) - (1 - y) * log(1 - p)
}
Number of boosting iterations: mstop = 100
Step size: 0.1
Offset: -1.197806
Coefficients:
NOTE: Coefficients from a Binomial model are half the size of coefficients
from a model fitted via glm(... , family = 'binomial').
See Warning section in ?coef.mboost
(Intercept) school_stateDC
-0.5250166130 0.0426909965
school_stateIL school_chartert
0.0084191638 0.0729272310
teacher_prefixMrs. teacher_prefixMs.
-0.0181489492 0.0438425925
teacher_teach_for_americat resource_typeBooks
0.2593005345 0.0046126706
resource_typeTechnology fulfillment_labor_materials
-0.0313904871 0.0120086140
eligible_double_your_impact_matcht eligible_almost_home_matcht
-0.0316376431 -0.0522717398
essay_length
0.0004993224
attr(,"offset")
[1] -1.197806
Selection frequencies:
fulfillment_labor_materials teacher_teach_for_americat
0.24 0.15
essay_length school_chartert
0.15 0.09
teacher_prefixMs. resource_typeTechnology
0.08 0.07
eligible_double_your_impact_matcht eligible_almost_home_matcht
0.07 0.07
teacher_prefixMrs. school_stateDC
0.04 0.02
school_stateIL resource_typeBooks
0.01 0.01
I also tried glm but it said
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
factor teacher_prefix has new levels
But I don't see any new levels in the teacher_prefix variable:
> levels(training$teacher_prefix)
[1] "" "Dr." "Mr." "Mr. & Mrs." "Mrs." "Ms."
> levels(test$teacher_prefix)
[1] "" "Dr." "Mr." "Mr. & Mrs." "Mrs." "Ms."
Actually, the problems with glmboost and glm are related. There are problems with your teacher_prefix variable.
As the glm example points out, there are levels that are in test that are not in training (kind of). While both factors have the same levels(), the training set has no observations where teacher_prefix=="" but test does. Compare
table(test$teacher_prefix)
table(training$teacher_prefix)
So glm is actually giving the more accurate, helpful error message. The problem is the same with glmboost although it isn't as direct about saying it.
Doing this seemed to "fix" it
test2 <- subset(test, teacher_prefix %in% c("Dr.","Mr.","Mrs.","Ms."))
test2$teacher_prefix <- droplevels(test2$teacher_prefix)
pred <- predict(model, newdata=test2, type="response")
We just get rid of the unused levels and then do the standard prediction.

Resources