C5.0 decision tree - input string 1 is invalid in this locale - r

I have read the questions related before, but still can not solve my problem, my training data does not have missing values, so I don't know where it was wrong.
Another problem is the tree size is 1, all predicted results are 0 (label is 0 or 1 ). I know this is an extremely unbalanced case (the 0 label take up 98%), how do I solve the problem?
model_boost<-C5.0(train,train_label)
Error:
c50 code called exit with value 1
Warning message:
In strsplit(Z$output, "\n"): input string 1 is invalid in this locale
training data:
str(train)
'data.frame': 7500 obs. of 148 variables:
$ CI_CUSTYPE : Factor w/ 4 levels "个人","家庭",..: 2 2 2 2 2 2 2 2 1 2 ...
$ CI_COUNTRY_FLAG : Factor w/ 3 levels "1","2","3": 3 2 3 2 2 2 2 2 2 1 ...
$ CI_AGE : int -1 44 31 53 58 -1 -1 46 43 61 ...
$ CI_GENDER : Factor w/ 3 levels "男","女","未知": 3 1 1 2 2 3 3 2 2 1 ...
$ CI_CITY : Factor w/ 21 levels "阿坝","巴中",..: 16 18 9 3 3 4 5 1 3 19 ...
$ CI_TENURE : int 4 44 205 92 92 26 9 110 24 48 ...
$ IS_DUAL_MODE : Factor w/ 4 levels "0","1","2","3": 2 2 2 1 2 1 4 4 4 2 ...
$ PD_CDMA_PAYMODE : Factor w/ 2 levels "1","2": 2 1 2 2 2 1 1 2 1 1 ...
$ PD_CDMA_TENURE : int 49 43 64 39 19 36 8 52 15 47 ...
$ VO_MOU_TOTAL_AVG : int 9520 344 2287 253 460 249 3 885 623 457 ...
train_label
str(train_label)
Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 .
print(head(train_label))
[1] 0 0 0 0 0 0
Levels: 0 1

Related

group by and sum not working as expected in R

Hi I have a simple dataframe with this structure
> str(allvalues)
'data.frame': 150 obs. of 8 variables:
$ seriesId : Factor w/ 1 level "2021-02-28T00:00:00Z": 1 1 1 1 1 1 1 1 1 1 ...
$ forecastPoint : Factor w/ 30 levels "790","791","792",..: 1 2 3 4 5 6 7 8 9 10 ...
$ rowId : Factor w/ 30 levels "2021-03-01T00:00:00.000000Z",..: 1 2 3 4 5 6 7 8 9 10 ...
$ timestamp : Factor w/ 65 levels "1842.6640625",..: 7 8 9 11 14 4 1 16 12 18 ...
$ predictionValues: Factor w/ 1 level "total_visits (actual)": 1 1 1 1 1 1 1 1 1 1 ...
$ forecastDistance: Factor w/ 30 levels "1","10","11",..: 1 12 23 25 26 27 28 29 30 2 ...
$ prediction : num 2111 2130 2258 2276 2298 ...
$ scenario : Factor w/ 5 levels "0 0 10 10 10",..: 4 4 4 4 4 4 4 4 4 4 ...
and I want to group by "scenario" and sum "prediction"
but when I use
> allvalues %>% group_by(scenario) %>% summarise(cond_disp = sum(prediction))
cond_disp
1 351940.8
Is not grouping by scenarios, there should be 5 rows, each scenario and the sum
any help on what I am doing wrong?

Correlation with discrete and categoric variables in R

I am analyzing this dataset it has numeric and factor variable. I would like to know the correlation so I can choose the best variables.
str(data)
$ Ag : num [1:1470] 41 49 37 33 27 32 59 30 38 36 ...
$ Ay : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
$ Bu : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
$ Di : num [1:1470] 1 8 2 3 2 2 3 24 23 27 ...
$ Ed : num [1:1470] 2 1 2 4 1 2 3 1 3 3 ...
$ Ep : num [1:1470] 1 1 1 1 1 1 1 1 1 1 ...
$ Em : num [1:1470] 1 2 4 5 7 8 10 11 12 13 ...
$ Ge : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
$ Ho : num [1:1470] 94 61 92 56 40 79 81 67 44 94 ...
$ J1 : num [1:1470] 3 2 2 3 3 3 4 3 2 3 ...
$ J2 : num [1:1470] 2 2 1 1 1 1 1 1 3 2 ...
When I execute this(althought I want correlations of all data not only numeric) :
cor(data[sapply(data, is.numeric)])
I return this message:
Warning message:
In cor(data[sapply(data, is.numeric)]) :
the standard deviation is zero
It just politely lets you know that you set out to calculate correlation where one of the variables is constant. This often pointless.
Just filter that out aswell
x1 <- data[sapply(data,is.numeric)]
x2 <- x1[sapply(x1,sd)!=0]
cor(x2)

why levels of just one variable change after the properly combination of two dataframes?and how should deal it?

I have two dataframes. My first dataframe contains 16 different Lines (genotypes) and due to the different number of plants of each line in my experiment, the str() command shows 145 observation and t 16 levels for my Line variable; as you can see here
data.frame': 145 obs. of 15 variables:
$ Plate.NO. : int 1 1 1 1 1 1 1 1 1 1 ...
$ Line : Factor w/ 16 levels "L000049","L000154",..: 15 15 15 15 15 7 7 7 7 7 ...
$ Strain : Factor w/ 2 levels "AF1","V31-2": 1 1 1 1 1 1 1 1 1 1 …
$ Plant.number: int 1 2 3 4 5 1 2 3 4 5 ...
$ X0DPI : num 0 0 0 0 0 0 0 0 0 0 ...
$ X7DPI : num 0 0 0 0 0 0 0 0 0 0 ...
$ X10DPI : num 0 0 0 0 1 0 0 0 2 0 ...
$ X12DPI : num 0.5 0 0 0 2 3 2.5 2.5 2 3 ...
$ X14DPI : num 2.5 1 0 0 2 3 2.5 2.5 2.5 3 ...
$ X17DPI : num 4 1 1 0 3 4 2.5 4 3 3 ...
$ X19DPI : num 4 1 1 1 4 4 2.5 4 3 4 ...
$ X21DPI : num 4 1.5 2 1 4 4 3.5 4 4 4 ...
$ X24DPI : num 4 3 2 1 4 4 4 4 4 4 ...
$ X26DPI : num 4 3 2 1 4 4 4 4 4 4 ...
$ X28DPI : num 4 3.5 2.5 1.5 4 4 4 4 4 4 ...
Also, I have the second dataframe which consists more complementary information for 252 Lines. Here you can see the str() result for my second dataframe
data.frame': 252 obs. of 7 variables:
$ ID : Factor w/ 252 levels "HM001 ","HM002 ",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Line : Factor w/ 252 levels "A10","A20","CADL",..: 31 38 175 207 206 169 197 ...
$ Population.of.Origin: Factor w/ 252 levels "A10 ","A17_Varma ",..: 157 167 55 53 51 110 ...
$ Country.of.Origin : Factor w/ 19 levels "Algeria ","Cyprus ",..: 16 2 14 1 1 3 3 1 5 8 ...
$ Category : Factor w/ 16 levels "alfalfa ","CC144 ",..: 7 7 7 7 7 7 7 7 3 3 ...
$ Seeds.From : Factor w/ 6 levels "Charlie_Brummer,UGA ",..: 4 4 4 4 4 4 4 4 4 4 ...
$ Status : Factor w/ 2 levels "Failed.QA","Processed": 2 2 2 2 2 2 2 2 2 2 …
Out of this 252 Lines I used only 16 Lines for part of my experiment and I want to combine these two dataframes
The first dataframe object is “Rep1” (the one with only 16 Lines) and the second one is called “hap”(the one with 252 Lines)
I used these series of commands
inner<-inner_join(Rep1,hap, by = "Line")
left<- left_join(Rep1,hap, "Line")
right←right_join(hap,Rep1,"Line")
the combination take place without any problem and I have just the rows for my 16 Lines but surprisingly when the str() output shows me 252 levels for Line instead of 16 while the number of observation is correct
here is the str() output of my datafram after combination
'data.frame': 145 obs. of 21 variables:
$ ID : Factor w/ 252 levels "HM001 ","HM002 ",..: 1 1 1 1 1 1 2 2 2 2 ...
$ Line : Factor w/ 252 levels "A10","A20","CADL",..: 31 31 31 31 31 31 38 38 38 38 ...
$ Population.of.Origin: Factor w/ 252 levels "A10 ","A17_Varma ",..: 157 157 157 157 157 157 167 167 167 167 ...
$ Country.of.Origin : Factor w/ 19 levels "Algeria ","Cyprus ",..: 16 16 16 16 16 16 2 2 2 2 ...
$ Category : Factor w/ 16 levels "alfalfa ","CC144 ",..: 7 7 7 7 7 7 7 7 7 7 ...
$ Seeds.From : Factor w/ 6 levels "Charlie_Brummer,UGA ",..: 4 4 4 4 4 4 4 4 4 4 ...
$ Status : Factor w/ 2 levels "Failed.QA","Processed": 2 2 2 2 2 2 2 2 2 2 ...
$ Plate.NO. : int 2 2 2 5 5 5 1 1 1 1 ...
$ Strain : Factor w/ 2 levels "AF1","V31-2": 1 1 1 2 2 2 1 1 1 1 ...
$ Plant.number : int 1 2 3 1 2 3 1 2 3 4 ...
$ X0DPI : num 0 0 0 0 0 0 0 0 0 0 ...
$ X7DPI : num 0 0 0 0 0 0 0 0 0 0 ...
$ X10DPI : num 0 0.5 3 3 2 1 1 0.5 0 0 ...
$ X12DPI : num 0 1.5 3 3 3 3 1 3 0 0 ...
$ X14DPI : num 0.5 3 4 3 3.5 4 2.5 3 1 0 ...
$ X17DPI : num 1 4 4 3 4 4 3 4 1.5 0 ...
$ X19DPI : num 1.5 4 4 4 4 4 4 4 1.5 0 ...
$ X21DPI : num 2 4 4 4 4 4 4 4 1.5 0 ...
$ X24DPI : num 2 4 4 4 4 4 4 4 1.5 1 ...
$ X26DPI : num 3 4 4 4 4 4 4 4 1.5 1 ...
$ X28DPI : num 3.5 4 4 4 4 4 4 4 2 1 ...

Moving the last column to a nth place in R [duplicate]

This question already has answers here:
How does one reorder columns in a data frame?
(12 answers)
Closed 2 years ago.
Good Day
I am trying to move the last column of a dataset to be the third column in a dataframe in R and was wondering what would be the most efficient way to do this.
My DataFrame structure is as follows:
str(HR)
'data.frame': 2940 obs. of 36 variables:
$ EmployeeNumber : int 1 2 3 4 5 6 7 8 9 10 ...
$ Attrition : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
$ Age : int 41 49 37 33 27 32 59 30 38 36 ...
$ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3
$ DailyRate : int 1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
$ Department : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
$ DistanceFromHome : int 1 8 2 3 2 2 3 24 23 27 ...
$ Education : int 2 1 2 4 1 2 3 1 3 3 ...
$ EducationField : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
$ EmployeeCount : int 1 1 1 1 1 1 1 1 1 1 ...
$ EnvironmentSatisfaction : int 2 3 4 4 1 4 3 4 4 3 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
$ HourlyRate : int 94 61 92 56 40 79 81 67 44 94 ...
$ JobInvolvement : int 3 2 2 3 3 3 4 3 2 3 ...
$ JobLevel : int 2 2 1 1 1 1 1 1 3 2 ...
$ JobRole : Factor w/ 9 levels "Healthcare Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
$ JobSatisfaction : int 4 2 3 3 2 4 1 3 3 3 ...
$ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
$ MonthlyIncome : int 5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
$ MonthlyRate : int 19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
$ NumCompaniesWorked : int 8 1 6 1 9 0 4 1 0 6 ...
$ Over18 : Factor w/ 1 level "Y": 1 1 1 1 1 1 1 1 1 1 ...
$ OverTime : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
$ PercentSalaryHike : int 11 23 15 11 12 13 20 22 21 13 ...
$ PerformanceRating : int 3 4 3 3 3 3 4 4 4 3 ...
$ RelationshipSatisfaction: int 1 4 2 3 4 3 1 2 2 2 ...
$ StandardHours : int 80 80 80 80 80 80 80 80 80 80 ...
$ StockOptionLevel : int 0 1 0 0 1 0 3 1 0 2 ...
$ TotalWorkingYears : int 8 10 7 8 6 8 12 1 10 17 ...
$ TrainingTimesLastYear : int 0 3 3 3 3 2 3 2 2 3 ...
$ WorkLifeBalance : int 1 3 3 3 3 2 2 3 3 2 ...
$ YearsAtCompany : int 6 10 0 8 2 7 1 1 9 7 ...
$ YearsInCurrentRole : int 4 7 0 7 2 7 0 0 7 7 ...
$ YearsSinceLastPromotion : int 0 1 0 3 2 3 0 0 1 7 ...
$ YearsWithCurrManager : int 5 7 0 0 2 6 0 0 8 7 ...
$ AttritionB : num 1 0 1 0 0 0 0 0 0 0 ...
and I am trying to have AttritionB come after Attrition.
HRCorForm = HR[,c(1,2,36:35)], I have tried this code however it negates the rest of the columns
Kind Regards
Rehaan
This will get all your columns:
HRCorForm = HR[,c(1,2,36,3:35)]

Error in scale.default: length of 'center' must equal the number of columns of 'x'

I am using mboost package to do some classification. Here is the code
library('mboost')
load('so-data.rdata')
model <- glmboost(is_exciting~., data=training, family=Binomial())
pred <- predict(model, newdata=test, type="response")
But R complains when doing prediction that
Error in scale.default(X, center = cm, scale = FALSE) :
length of 'center' must equal the number of columns of 'x'
The data (training and test) can be downloaded here (7z, zip).
What is the reason of the error and how to get rid of it? Thank you.
UPDATE:
> str(training)
'data.frame': 439599 obs. of 24 variables:
$ is_exciting : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ school_state : Factor w/ 52 levels "AK","AL","AR",..: 15 5 5 23 47 5 44 42 42 5 ...
$ school_charter : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ school_magnet : Factor w/ 2 levels "f","t": 1 1 1 1 2 1 1 1 1 1 ...
$ school_year_round : Factor w/ 2 levels "f","t": 1 1 1 1 1 2 1 1 1 2 ...
$ school_nlns : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ school_charter_ready_promise : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ teacher_prefix : Factor w/ 6 levels "","Dr.","Mr.",..: 5 5 3 5 6 5 6 6 5 6 ...
$ teacher_teach_for_america : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 2 1 2 1 ...
$ teacher_ny_teaching_fellow : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ primary_focus_subject : Factor w/ 28 levels "","Applied Sciences",..: 19 17 18 18 10 4 17 17 18 17 ...
$ primary_focus_area : Factor w/ 8 levels "","Applied Learning",..: 6 5 5 5 5 4 5 5 5 5 ...
$ secondary_focus_subject : Factor w/ 28 levels "","Applied Sciences",..: 28 18 17 19 26 18 18 28 24 25 ...
$ secondary_focus_area : Factor w/ 8 levels "","Applied Learning",..: 7 5 5 6 8 5 5 7 7 4 ...
$ resource_type : Factor w/ 7 levels "","Books","Other",..: 4 4 2 5 5 2 2 5 5 5 ...
$ poverty_level : Factor w/ 4 levels "high poverty",..: 2 2 4 2 1 2 2 1 2 1 ...
$ grade_level : Factor w/ 5 levels "","Grades 3-5",..: 5 5 2 5 5 2 3 2 4 2 ...
$ fulfillment_labor_materials : num 30 35 35 30 30 35 30 35 35 35 ...
$ total_price_excluding_optional_support: num 1274 477 892 548 385 ...
$ total_price_including_optional_support: num 1499 562 1050 645 453 ...
$ students_reached : int 31 20 250 36 19 28 90 21 60 56 ...
$ eligible_double_your_impact_match : Factor w/ 2 levels "f","t": 1 2 1 2 1 2 1 1 1 1 ...
$ eligible_almost_home_match : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 2 2 1 1 ...
$ essay_length : int 236 285 194 351 383 273 385 437 476 159 ...
> str(test)
'data.frame': 44772 obs. of 23 variables:
$ school_state : Factor w/ 51 levels "AK","AL","AR",..: 22 35 11 46 5 35 11 28 28 10 ...
$ school_charter : Factor w/ 2 levels "f","t": 1 1 1 1 2 1 1 1 1 1 ...
$ school_magnet : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ school_year_round : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ school_nlns : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ school_charter_ready_promise : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ teacher_prefix : Factor w/ 6 levels "","Dr.","Mr.",..: 3 5 6 6 3 5 5 5 3 5 ...
$ teacher_teach_for_america : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ teacher_ny_teaching_fellow : Factor w/ 2 levels "f","t": 1 2 1 1 1 1 1 1 1 1 ...
$ primary_focus_subject : Factor w/ 28 levels "","Applied Sciences",..: 5 16 17 17 18 11 16 17 2 17 ...
$ primary_focus_area : Factor w/ 8 levels "","Applied Learning",..: 2 4 5 5 5 2 4 5 6 5 ...
$ secondary_focus_subject : Factor w/ 28 levels "","Applied Sciences",..: 25 1 19 1 17 9 17 11 1 1 ...
$ secondary_focus_area : Factor w/ 8 levels "","Applied Learning",..: 4 1 6 1 5 6 5 2 1 1 ...
$ resource_type : Factor w/ 7 levels "","Books","Other",..: 5 5 5 2 5 6 4 5 5 4 ...
$ poverty_level : Factor w/ 4 levels "high poverty",..: 1 2 4 4 1 2 2 2 1 2 ...
$ grade_level : Factor w/ 5 levels "","Grades 3-5",..: 4 3 3 5 4 5 5 4 3 5 ...
$ fulfillment_labor_materials : num 30 30 30 30 30 30 30 30 30 30 ...
$ total_price_excluding_optional_support: num 2185 149 1017 156 860 ...
$ total_price_including_optional_support: num 2571 175 1197 183 1012 ...
$ students_reached : int 200 110 10 22 180 51 30 15 260 20 ...
$ eligible_double_your_impact_match : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ eligible_almost_home_match : Factor w/ 2 levels "f","t": 2 1 1 1 1 1 1 1 2 1 ...
$ essay_length : int 221 137 313 243 373 344 304 431 231 173 ...
> summary(model)
Generalized Linear Models Fitted via Gradient Boosting
Call:
glmboost.formula(formula = is_exciting ~ ., data = training, family = Binomial())
Negative Binomial Likelihood
Loss function: {
f <- pmin(abs(f), 36) * sign(f)
p <- exp(f)/(exp(f) + exp(-f))
y <- (y + 1)/2
-y * log(p) - (1 - y) * log(1 - p)
}
Number of boosting iterations: mstop = 100
Step size: 0.1
Offset: -1.197806
Coefficients:
NOTE: Coefficients from a Binomial model are half the size of coefficients
from a model fitted via glm(... , family = 'binomial').
See Warning section in ?coef.mboost
(Intercept) school_stateDC
-0.5250166130 0.0426909965
school_stateIL school_chartert
0.0084191638 0.0729272310
teacher_prefixMrs. teacher_prefixMs.
-0.0181489492 0.0438425925
teacher_teach_for_americat resource_typeBooks
0.2593005345 0.0046126706
resource_typeTechnology fulfillment_labor_materials
-0.0313904871 0.0120086140
eligible_double_your_impact_matcht eligible_almost_home_matcht
-0.0316376431 -0.0522717398
essay_length
0.0004993224
attr(,"offset")
[1] -1.197806
Selection frequencies:
fulfillment_labor_materials teacher_teach_for_americat
0.24 0.15
essay_length school_chartert
0.15 0.09
teacher_prefixMs. resource_typeTechnology
0.08 0.07
eligible_double_your_impact_matcht eligible_almost_home_matcht
0.07 0.07
teacher_prefixMrs. school_stateDC
0.04 0.02
school_stateIL resource_typeBooks
0.01 0.01
I also tried glm but it said
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
factor teacher_prefix has new levels
But I don't see any new levels in the teacher_prefix variable:
> levels(training$teacher_prefix)
[1] "" "Dr." "Mr." "Mr. & Mrs." "Mrs." "Ms."
> levels(test$teacher_prefix)
[1] "" "Dr." "Mr." "Mr. & Mrs." "Mrs." "Ms."
Actually, the problems with glmboost and glm are related. There are problems with your teacher_prefix variable.
As the glm example points out, there are levels that are in test that are not in training (kind of). While both factors have the same levels(), the training set has no observations where teacher_prefix=="" but test does. Compare
table(test$teacher_prefix)
table(training$teacher_prefix)
So glm is actually giving the more accurate, helpful error message. The problem is the same with glmboost although it isn't as direct about saying it.
Doing this seemed to "fix" it
test2 <- subset(test, teacher_prefix %in% c("Dr.","Mr.","Mrs.","Ms."))
test2$teacher_prefix <- droplevels(test2$teacher_prefix)
pred <- predict(model, newdata=test2, type="response")
We just get rid of the unused levels and then do the standard prediction.

Resources