rpart warning message and then no proper rpart plot formed - r

I tried rpart and I got the following error and rpart plot only showed 0
Warning messages: error was: Setting row names on a tibble is
deprecated.
No proper rpart plot was formed. The below is the plot that was generated:
rpart.oc<-function(seed,training,labels,otrl){
ol<-makeSOCKcluster(6,type="SOCK")
registerDoSNOW(ol)
set.seed(seed)
rpart.oc<-train(x=training,y=labels,method="rpart",tuneLength=30,trControl=otrl)
stopCluster(ol)
return(rpart.oc)
}
rpart.1.cv.1o<-rpart.oc(94622,rpart.train.1o,of.label,otrl.3))
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1898 obs. of 6 variables:
$ so : Factor w/ 10 levels "",..: 7 7 7 7 7 7 7 7 7 7 ...
$ a : num 63 7 3 45 2 4 69 0 7 0 ...
$ n : Factor w/ 5 levels "","",..: 3 2 2 1 2 3 1 1 2 2 ...
$ s: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ d : Factor w/ 7 levels "Friday","Monday",..: 6 6 7 7 5 5 1 1 1 1 ...
$ c: Factor w/ 7 levels "A","C",..: 6 2 4 2 2 4 1 2 7 2

Related

group by and sum not working as expected in R

Hi I have a simple dataframe with this structure
> str(allvalues)
'data.frame': 150 obs. of 8 variables:
$ seriesId : Factor w/ 1 level "2021-02-28T00:00:00Z": 1 1 1 1 1 1 1 1 1 1 ...
$ forecastPoint : Factor w/ 30 levels "790","791","792",..: 1 2 3 4 5 6 7 8 9 10 ...
$ rowId : Factor w/ 30 levels "2021-03-01T00:00:00.000000Z",..: 1 2 3 4 5 6 7 8 9 10 ...
$ timestamp : Factor w/ 65 levels "1842.6640625",..: 7 8 9 11 14 4 1 16 12 18 ...
$ predictionValues: Factor w/ 1 level "total_visits (actual)": 1 1 1 1 1 1 1 1 1 1 ...
$ forecastDistance: Factor w/ 30 levels "1","10","11",..: 1 12 23 25 26 27 28 29 30 2 ...
$ prediction : num 2111 2130 2258 2276 2298 ...
$ scenario : Factor w/ 5 levels "0 0 10 10 10",..: 4 4 4 4 4 4 4 4 4 4 ...
and I want to group by "scenario" and sum "prediction"
but when I use
> allvalues %>% group_by(scenario) %>% summarise(cond_disp = sum(prediction))
cond_disp
1 351940.8
Is not grouping by scenarios, there should be 5 rows, each scenario and the sum
any help on what I am doing wrong?

How to combine Columns in R

http://imgur.com/a/q4IdW "table"
Hi, I have a file that has coded complaints, you can see it in the link above, and I need to find a way to combine the 4 columns(primary issue, secondary issue, etc) so that I can then sum up all the issues together. it is possible for a complaint to have multiple issues, so that is why it is broken down like this, but for analysis purposes I want to treat all the issue columns as the same. I am very new to R so please try and speak in terms ill be able to understand or can google fairly quickly
> str(mydata)
'data.frame': 136 obs. of 25 variables:
$ ï..Issue.ID : Factor w/ 136 levels "CAO-2017-01",..: 20 21 22 23 24 25 26 27 28 29 ...
$ Reviewer.ID : Factor w/ 1 level "Vinokurov, A": 1 1 1 1 1 1 1 1 1 1 ...
$ Review.Date : Factor w/ 3 levels "6/30/2017","7/14/2017",..: 1 1 1 1 1 2 2 2 2 2 ...
$ CBA.ZIP.CODE : Factor w/ 61 levels "Allentown-Bethlehem-Easton, PA",..: 29 13 24 10 29 13 10 9 47 39 ...
$ Source.of.complaint : Factor w/ 7 levels "Advocate","Beneficiary",..: 7 7 3 7 6 7 2 3 3 3 ...
$ Primary.Issue.Category : Factor w/ 10 levels "Billing, coverage, coordination of benefits",..: 3 8 4 4 4 4 7 4 4 4 ...
$ Secondary.Issue.Category : Factor w/ 15 levels "","ABN issues ",..: 4 1 15 1 15 3 3 15 15 15 ...
$ Third.Issue.Category : Factor w/ 12 levels "","- Error -",..: 1 1 1 1 1 5 1 10 1 1 ...
$ Fourth.Issue.Category : Factor w/ 2 levels "","Low quantity/quality": 1 1 1 1 1 1 1 1 1 1 ...
$ Reviewer.Issue.Notes : logi NA NA NA NA NA NA ...
$ Primary.Equipment.Category : Factor w/ 13 levels "Commode chairs",..: 9 7 2 8 2 9 10 10 2 2 ...
$ Secondary.Equipment.Category : Factor w/ 7 levels "","- Error -",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Third.Equipment.Category : Factor w/ 10 levels "","- Error -",..: 1 1 1 1 1 2 1 2 1 1 ...
$ Fourth.Equipment.Category : logi NA NA NA NA NA NA ...
$ Reviewer.Equipment.Notes : logi NA NA NA NA NA NA ...
$ Primary.Resolution.Category : Factor w/ 16 levels "Beneficiary educated about DMEPOS\n",..: 9 12 15 12 5 14 13 9 5 10 ...
$ Secondary.Resolution.Category: Factor w/ 18 levels "","- Error -",..: 1 1 3 1 4 7 7 17 15 1 ...
$ Third.Resolution.Category : Factor w/ 8 levels "","Beneficiary educated about inquiry ",..: 1 1 1 1 3 1 1 1 1 1 ...
$ Fourth.Resolution.Category : logi NA NA NA NA NA NA ...
$ Reviewer.Resolution.Notes : logi NA NA NA NA NA NA ...
$ Future.Action : Factor w/ 4 levels "no","No","yes",..: 4 4 2 2 2 2 3 4 1 1 ...
$ Coder.1 : Factor w/ 2 levels "Briskin-Limehouse, A",..: 1 1 1 1 1 2 2 2 2 2 ...
$ Coder.1.Coded.Date : Factor w/ 4 levels "6/30/2017","7/13/2017",..: 1 1 1 1 1 2 2 2 2 2 ...
$ Coder.2 : Factor w/ 1 level "Aliu, F": 1 1 1 1 1 1 1 1 1 1 ...
$ Coder.2.Coded.Date : Factor w/ 7 levels "6/30/2017","7/12/2017",..: 1 1 1 1 1 2 3 3 3 3 ...
>
What i got is: You have something like this:
issue_1 issue_2 issue_3 issue_4
person1 0 0 0 1
person2 1 1 0 1
person3 1 0 1 1
where 1 is presence of issue, and 0 the opposite, took from some survey.
would you like to show something like this?
Issue_1 appeared 2x
issue_2 appeared 1x
issue_4 appeared 3x
Could you check and answer again, please?
Please, use str(your_data) too, since you can't link us

c50 code called exit with value 1 on Mushroom Data set [duplicate]

This question already has answers here:
C5.0 decision tree - c50 code called exit with value 1
(6 answers)
Closed 6 years ago.
I'm getting error while working on C5.0 with Mushroom Data set. I've factored the target class and there are no missing values.
f <-file("https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data", open="r")
data <- read.table(f, sep=",", header=F)
str(data)
gives
'data.frame': 8124 obs. of 23 variables:
$ V1 : Factor w/ 2 levels "e","p": 2 1 1 2 1 1 1 1 2 1 ...
$ V2 : Factor w/ 6 levels "b","c","f","k",..: 6 6 1 6 6 6 1 1 6 1 ...
$ V3 : Factor w/ 4 levels "f","g","s","y": 3 3 3 4 3 4 3 4 4 3 ...
$ V4 : Factor w/ 10 levels "b","c","e","g",..: 5 10 9 9 4 10 9 9 9 10 ...
$ V5 : Factor w/ 2 levels "f","t": 2 2 2 2 1 2 2 2 2 2 ...
$ V6 : Factor w/ 9 levels "a","c","f","l",..: 7 1 4 7 6 1 1 4 7 1 ...
$ V7 : Factor w/ 2 levels "a","f": 2 2 2 2 2 2 2 2 2 2 ...
$ V8 : Factor w/ 2 levels "c","w": 1 1 1 1 2 1 1 1 1 1 ...
$ V9 : Factor w/ 2 levels "b","n": 2 1 1 2 1 1 1 1 2 1 ...
$ V10: Factor w/ 12 levels "b","e","g","h",..: 5 5 6 6 5 6 3 6 8 3 ...
$ V11: Factor w/ 2 levels "e","t": 1 1 1 1 2 1 1 1 1 1 ...
$ V12: Factor w/ 5 levels "?","b","c","e",..: 4 3 3 4 4 3 3 3 4 3 ...
$ V13: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
$ V14: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
$ V15: Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
$ V16: Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
$ V17: Factor w/ 1 level "p": 1 1 1 1 1 1 1 1 1 1 ...
$ V18: Factor w/ 4 levels "n","o","w","y": 3 3 3 3 3 3 3 3 3 3 ...
$ V19: Factor w/ 3 levels "n","o","t": 2 2 2 2 2 2 2 2 2 2 ...
$ V20: Factor w/ 5 levels "e","f","l","n",..: 5 5 5 5 1 5 5 5 5 5 ...
$ V21: Factor w/ 9 levels "b","h","k","n",..: 3 4 4 3 4 3 3 4 3 3 ...
$ V22: Factor w/ 6 levels "a","c","n","s",..: 4 3 3 4 1 3 3 4 5 4 ...
$ V23: Factor w/ 7 levels "d","g","l","m",..: 6 2 4 6 2 2 4 4 2 4 ...
and when i run
C5.model <- C5.0(data[1:4000,-1],data[1:4000,1],trials = 3)
gives
c50 code called exit with value 1
I had no clue how to find this. Any idea on debugging is appreciated
Edit1 : Error is same but solution is different.
Note: When i changed the data set, it is working.
f <-file("https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data", open="r")
data <- read.table(f, sep=",", header=F)
str(data)
pacman::p_load(C50)
C5.model <- C5.0(data[1:10000,c(2:16,18:23)],data[1:10000,1],trials = 3,na.action = na.pass)
Column 17 was the cause of this problem as it had no identifying variation.

Error in scale.default: length of 'center' must equal the number of columns of 'x'

I am using mboost package to do some classification. Here is the code
library('mboost')
load('so-data.rdata')
model <- glmboost(is_exciting~., data=training, family=Binomial())
pred <- predict(model, newdata=test, type="response")
But R complains when doing prediction that
Error in scale.default(X, center = cm, scale = FALSE) :
length of 'center' must equal the number of columns of 'x'
The data (training and test) can be downloaded here (7z, zip).
What is the reason of the error and how to get rid of it? Thank you.
UPDATE:
> str(training)
'data.frame': 439599 obs. of 24 variables:
$ is_exciting : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ school_state : Factor w/ 52 levels "AK","AL","AR",..: 15 5 5 23 47 5 44 42 42 5 ...
$ school_charter : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ school_magnet : Factor w/ 2 levels "f","t": 1 1 1 1 2 1 1 1 1 1 ...
$ school_year_round : Factor w/ 2 levels "f","t": 1 1 1 1 1 2 1 1 1 2 ...
$ school_nlns : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ school_charter_ready_promise : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ teacher_prefix : Factor w/ 6 levels "","Dr.","Mr.",..: 5 5 3 5 6 5 6 6 5 6 ...
$ teacher_teach_for_america : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 2 1 2 1 ...
$ teacher_ny_teaching_fellow : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ primary_focus_subject : Factor w/ 28 levels "","Applied Sciences",..: 19 17 18 18 10 4 17 17 18 17 ...
$ primary_focus_area : Factor w/ 8 levels "","Applied Learning",..: 6 5 5 5 5 4 5 5 5 5 ...
$ secondary_focus_subject : Factor w/ 28 levels "","Applied Sciences",..: 28 18 17 19 26 18 18 28 24 25 ...
$ secondary_focus_area : Factor w/ 8 levels "","Applied Learning",..: 7 5 5 6 8 5 5 7 7 4 ...
$ resource_type : Factor w/ 7 levels "","Books","Other",..: 4 4 2 5 5 2 2 5 5 5 ...
$ poverty_level : Factor w/ 4 levels "high poverty",..: 2 2 4 2 1 2 2 1 2 1 ...
$ grade_level : Factor w/ 5 levels "","Grades 3-5",..: 5 5 2 5 5 2 3 2 4 2 ...
$ fulfillment_labor_materials : num 30 35 35 30 30 35 30 35 35 35 ...
$ total_price_excluding_optional_support: num 1274 477 892 548 385 ...
$ total_price_including_optional_support: num 1499 562 1050 645 453 ...
$ students_reached : int 31 20 250 36 19 28 90 21 60 56 ...
$ eligible_double_your_impact_match : Factor w/ 2 levels "f","t": 1 2 1 2 1 2 1 1 1 1 ...
$ eligible_almost_home_match : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 2 2 1 1 ...
$ essay_length : int 236 285 194 351 383 273 385 437 476 159 ...
> str(test)
'data.frame': 44772 obs. of 23 variables:
$ school_state : Factor w/ 51 levels "AK","AL","AR",..: 22 35 11 46 5 35 11 28 28 10 ...
$ school_charter : Factor w/ 2 levels "f","t": 1 1 1 1 2 1 1 1 1 1 ...
$ school_magnet : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ school_year_round : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ school_nlns : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ school_charter_ready_promise : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ teacher_prefix : Factor w/ 6 levels "","Dr.","Mr.",..: 3 5 6 6 3 5 5 5 3 5 ...
$ teacher_teach_for_america : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ teacher_ny_teaching_fellow : Factor w/ 2 levels "f","t": 1 2 1 1 1 1 1 1 1 1 ...
$ primary_focus_subject : Factor w/ 28 levels "","Applied Sciences",..: 5 16 17 17 18 11 16 17 2 17 ...
$ primary_focus_area : Factor w/ 8 levels "","Applied Learning",..: 2 4 5 5 5 2 4 5 6 5 ...
$ secondary_focus_subject : Factor w/ 28 levels "","Applied Sciences",..: 25 1 19 1 17 9 17 11 1 1 ...
$ secondary_focus_area : Factor w/ 8 levels "","Applied Learning",..: 4 1 6 1 5 6 5 2 1 1 ...
$ resource_type : Factor w/ 7 levels "","Books","Other",..: 5 5 5 2 5 6 4 5 5 4 ...
$ poverty_level : Factor w/ 4 levels "high poverty",..: 1 2 4 4 1 2 2 2 1 2 ...
$ grade_level : Factor w/ 5 levels "","Grades 3-5",..: 4 3 3 5 4 5 5 4 3 5 ...
$ fulfillment_labor_materials : num 30 30 30 30 30 30 30 30 30 30 ...
$ total_price_excluding_optional_support: num 2185 149 1017 156 860 ...
$ total_price_including_optional_support: num 2571 175 1197 183 1012 ...
$ students_reached : int 200 110 10 22 180 51 30 15 260 20 ...
$ eligible_double_your_impact_match : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ eligible_almost_home_match : Factor w/ 2 levels "f","t": 2 1 1 1 1 1 1 1 2 1 ...
$ essay_length : int 221 137 313 243 373 344 304 431 231 173 ...
> summary(model)
Generalized Linear Models Fitted via Gradient Boosting
Call:
glmboost.formula(formula = is_exciting ~ ., data = training, family = Binomial())
Negative Binomial Likelihood
Loss function: {
f <- pmin(abs(f), 36) * sign(f)
p <- exp(f)/(exp(f) + exp(-f))
y <- (y + 1)/2
-y * log(p) - (1 - y) * log(1 - p)
}
Number of boosting iterations: mstop = 100
Step size: 0.1
Offset: -1.197806
Coefficients:
NOTE: Coefficients from a Binomial model are half the size of coefficients
from a model fitted via glm(... , family = 'binomial').
See Warning section in ?coef.mboost
(Intercept) school_stateDC
-0.5250166130 0.0426909965
school_stateIL school_chartert
0.0084191638 0.0729272310
teacher_prefixMrs. teacher_prefixMs.
-0.0181489492 0.0438425925
teacher_teach_for_americat resource_typeBooks
0.2593005345 0.0046126706
resource_typeTechnology fulfillment_labor_materials
-0.0313904871 0.0120086140
eligible_double_your_impact_matcht eligible_almost_home_matcht
-0.0316376431 -0.0522717398
essay_length
0.0004993224
attr(,"offset")
[1] -1.197806
Selection frequencies:
fulfillment_labor_materials teacher_teach_for_americat
0.24 0.15
essay_length school_chartert
0.15 0.09
teacher_prefixMs. resource_typeTechnology
0.08 0.07
eligible_double_your_impact_matcht eligible_almost_home_matcht
0.07 0.07
teacher_prefixMrs. school_stateDC
0.04 0.02
school_stateIL resource_typeBooks
0.01 0.01
I also tried glm but it said
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
factor teacher_prefix has new levels
But I don't see any new levels in the teacher_prefix variable:
> levels(training$teacher_prefix)
[1] "" "Dr." "Mr." "Mr. & Mrs." "Mrs." "Ms."
> levels(test$teacher_prefix)
[1] "" "Dr." "Mr." "Mr. & Mrs." "Mrs." "Ms."
Actually, the problems with glmboost and glm are related. There are problems with your teacher_prefix variable.
As the glm example points out, there are levels that are in test that are not in training (kind of). While both factors have the same levels(), the training set has no observations where teacher_prefix=="" but test does. Compare
table(test$teacher_prefix)
table(training$teacher_prefix)
So glm is actually giving the more accurate, helpful error message. The problem is the same with glmboost although it isn't as direct about saying it.
Doing this seemed to "fix" it
test2 <- subset(test, teacher_prefix %in% c("Dr.","Mr.","Mrs.","Ms."))
test2$teacher_prefix <- droplevels(test2$teacher_prefix)
pred <- predict(model, newdata=test2, type="response")
We just get rid of the unused levels and then do the standard prediction.

Impute missing data

I have the following dataset:
> str(train)
'data.frame': 4619 obs. of 110 variables:
$ UserID : int 1 2 5 6 7 8 9 11 12 13 ...
$ YOB : int 1938 1985 1963 1997 1996 1991 1995 1983 1984 1997 ...
$ Gender : Factor w/ 3 levels "","Female","Male": 3 2 3 3 3 2 3 3 2 2 ...
$ Income : Factor w/ 7 levels "","$100,001 - $150,000",..: 1 3 6 5 4 7 5 2 4 6 ...
$ HouseholdStatus: Factor w/ 7 levels "","Domestic Partners (no kids)",..: 5 6 5 6 6 6 6 5 5 6 ...
$ EducationLevel : Factor w/ 8 levels "","Associate's Degree",..: 1 8 1 7 4 5 4 3 7 4 ...
$ Party : Factor w/ 6 levels "","Democrat",..: 3 2 1 6 1 1 6 3 6 2 ...
$ Happy : int 1 1 0 1 1 1 1 1 0 0 ...
$ Q124742 : Factor w/ 3 levels "","No","Yes": 2 1 2 1 2 3 1 2 2 1 ...
$ Q124122 : Factor w/ 3 levels "","No","Yes": 1 3 3 3 2 3 1 3 3 1 ...
$ Q123464 : Factor w/ 3 levels "","No","Yes": 2 2 2 3 2 2 1 2 2 1 ...
$ Q123621 : Factor w/ 3 levels "","No","Yes": 2 3 3 2 2 1 1 3 2 1 ...
$ Q122769 : Factor w/ 3 levels "","No","Yes": 2 2 2 1 3 1 1 2 2 2 ...
$ Q122770 : Factor w/ 3 levels "","No","Yes": 3 2 2 3 3 1 1 2 3 3 ...
$ Q122771 : Factor w/ 3 levels "","Private","Public": 3 3 2 2 3 3 1 3 3 3 ...
$ Q122120 : Factor w/ 3 levels "","No","Yes": 2 2 2 2 2 3 1 2 2 2 ...
$ Q121699 : Factor w/ 3 levels "","No","Yes": 3 3 3 2 2 3 2 3 3 2 ...
$ Q121700 : Factor w/ 3 levels "","No","Yes": 2 3 2 2 3 3 2 2 2 2 ...
$ Q120978 : Factor w/ 3 levels "","No","Yes": 1 3 2 3 3 2 2 3 3 3 ...
$ Q121011 : Factor w/ 3 levels "","No","Yes": 2 2 2 2 2 3 3 2 3 2 ...
$ Q120379 : Factor w/ 3 levels "","No","Yes": 2 3 3 2 3 3 2 2 2 3 ...
$ Q120650 : Factor w/ 3 levels "","No","Yes": 3 3 3 3 3 2 3 3 3 3 ...
$ Q120472 : Factor w/ 3 levels "","Art","Science": 1 3 3 3 3 2 3 3 2 3 ...
$ Q120194 : Factor w/ 3 levels "","Study first",..: 3 2 3 2 2 3 3 3 3 3 ...
$ Q120012 : Factor w/ 3 levels "","No","Yes": 2 3 3 1 2 3 2 2 3 3 ...
$ Q120014 : Factor w/ 3 levels "","No","Yes": 2 3 2 3 3 1 3 3 2 3 ...
$ Q119334 : Factor w/ 3 levels "","No","Yes": 1 3 2 2 2 3 2 3 2 2 ...
$ Q119851 : Factor w/ 3 levels "","No","Yes": 3 2 2 3 2 2 3 2 2 3 ...
$ Q119650 : Factor w/ 3 levels "","Giving","Receiving": 1 2 2 3 2 1 2 2 2 3 ...
$ Q118892 : Factor w/ 3 levels "","No","Yes": 3 3 3 2 3 2 1 3 2 2 ...
$ Q118117 : Factor w/ 3 levels "","No","Yes": 3 2 2 3 3 3 1 2 2 2 ...
$ Q118232 : Factor w/ 3 levels "","Idealist",..: 2 2 3 3 3 1 1 2 2 3 ...
$ Q118233 : Factor w/ 3 levels "","No","Yes": 2 2 2 2 2 2 1 2 3 2 ...
$ Q118237 : Factor w/ 3 levels "","No","Yes": 2 3 3 3 2 2 1 2 3 2 ...
$ Q117186 : Factor w/ 3 levels "","Cool headed",..: 1 2 2 2 1 3 1 2 3 1 ...
$ Q117193 : Factor w/ 3 levels "","Odd hours",..: 1 2 3 2 3 3 1 3 3 3 ...
$ Q116797 : Factor w/ 3 levels "","No","Yes": 3 3 2 2 2 1 1 2 2 1 ...
$ Q116881 : Factor w/ 3 levels "","Happy","Right": 2 2 3 3 2 2 1 2 2 1 ...
$ Q116953 : Factor w/ 3 levels "","No","Yes": 3 3 3 3 1 3 3 3 3 1 ...
$ Q116601 : Factor w/ 3 levels "","No","Yes": 3 3 3 2 3 3 1 3 3 1 ...
$ Q116441 : Factor w/ 3 levels "","No","Yes": 2 2 2 2 2 2 1 2 2 1 ...
$ Q116448 : Factor w/ 3 levels "","No","Yes": 2 3 3 3 2 2 1 2 3 1 ...
$ Q116197 : Factor w/ 3 levels "","A.M.","P.M.": 3 2 2 2 2 3 1 2 3 1 ...
$ Q115602 : Factor w/ 3 levels "","No","Yes": 3 3 3 3 3 2 1 3 2 1 ...
$ Q115777 : Factor w/ 3 levels "","End","Start": 3 2 3 3 3 3 1 3 2 1 ...
$ Q115610 : Factor w/ 3 levels "","No","Yes": 3 3 3 3 3 1 1 3 2 1 ...
$ Q115611 : Factor w/ 3 levels "","No","Yes": 2 2 3 3 2 2 1 2 2 1 ...
$ Q115899 : Factor w/ 3 levels "","Circumstances",..: 2 3 3 2 2 3 1 2 3 1 ...
$ Q115390 : Factor w/ 3 levels "","No","Yes": 3 2 2 2 1 2 3 3 2 1 ...
$ Q114961 : Factor w/ 3 levels "","No","Yes": 3 3 2 3 2 3 2 2 3 1 ...
$ Q114748 : Factor w/ 3 levels "","No","Yes": 3 2 2 2 3 3 3 2 3 1 ...
$ Q115195 : Factor w/ 3 levels "","No","Yes": 3 3 3 3 3 2 3 3 3 1 ...
$ Q114517 : Factor w/ 3 levels "","No","Yes": 2 3 2 3 2 2 2 2 3 1 ...
$ Q114386 : Factor w/ 3 levels "","Mysterious",..: 1 3 3 2 2 3 3 3 3 1 ...
$ Q113992 : Factor w/ 3 levels "","No","Yes": 3 1 3 2 2 2 2 2 3 1 ...
$ Q114152 : Factor w/ 3 levels "","No","Yes": 3 2 2 2 3 2 2 2 2 1 ...
$ Q113583 : Factor w/ 3 levels "","Talk","Tunes": 2 3 2 3 3 3 3 2 3 1 ...
$ Q113584 : Factor w/ 3 levels "","People","Technology": 3 2 2 3 2 1 3 2 2 1 ...
$ Q113181 : Factor w/ 3 levels "","No","Yes": 2 3 3 3 2 3 3 2 2 1 ...
[list output truncated]
As you can see, I have 110 variables. I am trying to build a predictive model to predict happiness using these variables. If I leave them in factor form (CART models, randomForest etc. struggle) so I'm trying to convert these into vectorised or numeric type (to make the algorithm's life a bit easier)...
Currently I am doing it one by one e.g.:
> table(train_new$Q117193)
Odd hours Standard hours
1410 1299 1910
> train_new$Q117193 = as.integer(train_new$Q117193)
> table(train_new$Q117193)
1 2 3
1410 1299 1910
You can notice that almost all the factor variables have missing values denoted by "".
I have converted this dataset to numeric using:
train_numeric$Gender = as.integer(train_numeric$Gender)
train_numeric[,grep(pattern="^Q1",colnames(train_numeric))] = lapply(train_numeric[,grep(pattern="^Q1",colnames(train_numeric))],as.integer)
I am using the mice package to impute this dataset... I am lost to be honest. Any ideas how I could fill these missing values please?
It seems that you are converting factor variables (like Gender) into numeric format and to my knowledge that is not possible in this case because they contain strings, so you could only convert them to character I believe.
To replace all missing values ("") with NAs in your data frame train you could do something like
train[train == ""] <- NA
You can correct this issue while importing your file for ex. I am assuming you imported csv file so code for that woudl be
dataset<-read.csv(file="file location",sep=",",header=True,na.strings = c("","NA"))
it will replace your blank with NA in categorical variable

Resources