group by and sum not working as expected in R - r

Hi I have a simple dataframe with this structure
> str(allvalues)
'data.frame': 150 obs. of 8 variables:
$ seriesId : Factor w/ 1 level "2021-02-28T00:00:00Z": 1 1 1 1 1 1 1 1 1 1 ...
$ forecastPoint : Factor w/ 30 levels "790","791","792",..: 1 2 3 4 5 6 7 8 9 10 ...
$ rowId : Factor w/ 30 levels "2021-03-01T00:00:00.000000Z",..: 1 2 3 4 5 6 7 8 9 10 ...
$ timestamp : Factor w/ 65 levels "1842.6640625",..: 7 8 9 11 14 4 1 16 12 18 ...
$ predictionValues: Factor w/ 1 level "total_visits (actual)": 1 1 1 1 1 1 1 1 1 1 ...
$ forecastDistance: Factor w/ 30 levels "1","10","11",..: 1 12 23 25 26 27 28 29 30 2 ...
$ prediction : num 2111 2130 2258 2276 2298 ...
$ scenario : Factor w/ 5 levels "0 0 10 10 10",..: 4 4 4 4 4 4 4 4 4 4 ...
and I want to group by "scenario" and sum "prediction"
but when I use
> allvalues %>% group_by(scenario) %>% summarise(cond_disp = sum(prediction))
cond_disp
1 351940.8
Is not grouping by scenarios, there should be 5 rows, each scenario and the sum
any help on what I am doing wrong?

Related

Correlation with discrete and categoric variables in R

I am analyzing this dataset it has numeric and factor variable. I would like to know the correlation so I can choose the best variables.
str(data)
$ Ag : num [1:1470] 41 49 37 33 27 32 59 30 38 36 ...
$ Ay : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
$ Bu : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
$ Di : num [1:1470] 1 8 2 3 2 2 3 24 23 27 ...
$ Ed : num [1:1470] 2 1 2 4 1 2 3 1 3 3 ...
$ Ep : num [1:1470] 1 1 1 1 1 1 1 1 1 1 ...
$ Em : num [1:1470] 1 2 4 5 7 8 10 11 12 13 ...
$ Ge : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
$ Ho : num [1:1470] 94 61 92 56 40 79 81 67 44 94 ...
$ J1 : num [1:1470] 3 2 2 3 3 3 4 3 2 3 ...
$ J2 : num [1:1470] 2 2 1 1 1 1 1 1 3 2 ...
When I execute this(althought I want correlations of all data not only numeric) :
cor(data[sapply(data, is.numeric)])
I return this message:
Warning message:
In cor(data[sapply(data, is.numeric)]) :
the standard deviation is zero
It just politely lets you know that you set out to calculate correlation where one of the variables is constant. This often pointless.
Just filter that out aswell
x1 <- data[sapply(data,is.numeric)]
x2 <- x1[sapply(x1,sd)!=0]
cor(x2)

rpart warning message and then no proper rpart plot formed

I tried rpart and I got the following error and rpart plot only showed 0
Warning messages: error was: Setting row names on a tibble is
deprecated.
No proper rpart plot was formed. The below is the plot that was generated:
rpart.oc<-function(seed,training,labels,otrl){
ol<-makeSOCKcluster(6,type="SOCK")
registerDoSNOW(ol)
set.seed(seed)
rpart.oc<-train(x=training,y=labels,method="rpart",tuneLength=30,trControl=otrl)
stopCluster(ol)
return(rpart.oc)
}
rpart.1.cv.1o<-rpart.oc(94622,rpart.train.1o,of.label,otrl.3))
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1898 obs. of 6 variables:
$ so : Factor w/ 10 levels "",..: 7 7 7 7 7 7 7 7 7 7 ...
$ a : num 63 7 3 45 2 4 69 0 7 0 ...
$ n : Factor w/ 5 levels "","",..: 3 2 2 1 2 3 1 1 2 2 ...
$ s: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ d : Factor w/ 7 levels "Friday","Monday",..: 6 6 7 7 5 5 1 1 1 1 ...
$ c: Factor w/ 7 levels "A","C",..: 6 2 4 2 2 4 1 2 7 2

How to combine Columns in R

http://imgur.com/a/q4IdW "table"
Hi, I have a file that has coded complaints, you can see it in the link above, and I need to find a way to combine the 4 columns(primary issue, secondary issue, etc) so that I can then sum up all the issues together. it is possible for a complaint to have multiple issues, so that is why it is broken down like this, but for analysis purposes I want to treat all the issue columns as the same. I am very new to R so please try and speak in terms ill be able to understand or can google fairly quickly
> str(mydata)
'data.frame': 136 obs. of 25 variables:
$ ï..Issue.ID : Factor w/ 136 levels "CAO-2017-01",..: 20 21 22 23 24 25 26 27 28 29 ...
$ Reviewer.ID : Factor w/ 1 level "Vinokurov, A": 1 1 1 1 1 1 1 1 1 1 ...
$ Review.Date : Factor w/ 3 levels "6/30/2017","7/14/2017",..: 1 1 1 1 1 2 2 2 2 2 ...
$ CBA.ZIP.CODE : Factor w/ 61 levels "Allentown-Bethlehem-Easton, PA",..: 29 13 24 10 29 13 10 9 47 39 ...
$ Source.of.complaint : Factor w/ 7 levels "Advocate","Beneficiary",..: 7 7 3 7 6 7 2 3 3 3 ...
$ Primary.Issue.Category : Factor w/ 10 levels "Billing, coverage, coordination of benefits",..: 3 8 4 4 4 4 7 4 4 4 ...
$ Secondary.Issue.Category : Factor w/ 15 levels "","ABN issues ",..: 4 1 15 1 15 3 3 15 15 15 ...
$ Third.Issue.Category : Factor w/ 12 levels "","- Error -",..: 1 1 1 1 1 5 1 10 1 1 ...
$ Fourth.Issue.Category : Factor w/ 2 levels "","Low quantity/quality": 1 1 1 1 1 1 1 1 1 1 ...
$ Reviewer.Issue.Notes : logi NA NA NA NA NA NA ...
$ Primary.Equipment.Category : Factor w/ 13 levels "Commode chairs",..: 9 7 2 8 2 9 10 10 2 2 ...
$ Secondary.Equipment.Category : Factor w/ 7 levels "","- Error -",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Third.Equipment.Category : Factor w/ 10 levels "","- Error -",..: 1 1 1 1 1 2 1 2 1 1 ...
$ Fourth.Equipment.Category : logi NA NA NA NA NA NA ...
$ Reviewer.Equipment.Notes : logi NA NA NA NA NA NA ...
$ Primary.Resolution.Category : Factor w/ 16 levels "Beneficiary educated about DMEPOS\n",..: 9 12 15 12 5 14 13 9 5 10 ...
$ Secondary.Resolution.Category: Factor w/ 18 levels "","- Error -",..: 1 1 3 1 4 7 7 17 15 1 ...
$ Third.Resolution.Category : Factor w/ 8 levels "","Beneficiary educated about inquiry ",..: 1 1 1 1 3 1 1 1 1 1 ...
$ Fourth.Resolution.Category : logi NA NA NA NA NA NA ...
$ Reviewer.Resolution.Notes : logi NA NA NA NA NA NA ...
$ Future.Action : Factor w/ 4 levels "no","No","yes",..: 4 4 2 2 2 2 3 4 1 1 ...
$ Coder.1 : Factor w/ 2 levels "Briskin-Limehouse, A",..: 1 1 1 1 1 2 2 2 2 2 ...
$ Coder.1.Coded.Date : Factor w/ 4 levels "6/30/2017","7/13/2017",..: 1 1 1 1 1 2 2 2 2 2 ...
$ Coder.2 : Factor w/ 1 level "Aliu, F": 1 1 1 1 1 1 1 1 1 1 ...
$ Coder.2.Coded.Date : Factor w/ 7 levels "6/30/2017","7/12/2017",..: 1 1 1 1 1 2 3 3 3 3 ...
>
What i got is: You have something like this:
issue_1 issue_2 issue_3 issue_4
person1 0 0 0 1
person2 1 1 0 1
person3 1 0 1 1
where 1 is presence of issue, and 0 the opposite, took from some survey.
would you like to show something like this?
Issue_1 appeared 2x
issue_2 appeared 1x
issue_4 appeared 3x
Could you check and answer again, please?
Please, use str(your_data) too, since you can't link us

C5.0 decision tree - input string 1 is invalid in this locale

I have read the questions related before, but still can not solve my problem, my training data does not have missing values, so I don't know where it was wrong.
Another problem is the tree size is 1, all predicted results are 0 (label is 0 or 1 ). I know this is an extremely unbalanced case (the 0 label take up 98%), how do I solve the problem?
model_boost<-C5.0(train,train_label)
Error:
c50 code called exit with value 1
Warning message:
In strsplit(Z$output, "\n"): input string 1 is invalid in this locale
training data:
str(train)
'data.frame': 7500 obs. of 148 variables:
$ CI_CUSTYPE : Factor w/ 4 levels "个人","家庭",..: 2 2 2 2 2 2 2 2 1 2 ...
$ CI_COUNTRY_FLAG : Factor w/ 3 levels "1","2","3": 3 2 3 2 2 2 2 2 2 1 ...
$ CI_AGE : int -1 44 31 53 58 -1 -1 46 43 61 ...
$ CI_GENDER : Factor w/ 3 levels "男","女","未知": 3 1 1 2 2 3 3 2 2 1 ...
$ CI_CITY : Factor w/ 21 levels "阿坝","巴中",..: 16 18 9 3 3 4 5 1 3 19 ...
$ CI_TENURE : int 4 44 205 92 92 26 9 110 24 48 ...
$ IS_DUAL_MODE : Factor w/ 4 levels "0","1","2","3": 2 2 2 1 2 1 4 4 4 2 ...
$ PD_CDMA_PAYMODE : Factor w/ 2 levels "1","2": 2 1 2 2 2 1 1 2 1 1 ...
$ PD_CDMA_TENURE : int 49 43 64 39 19 36 8 52 15 47 ...
$ VO_MOU_TOTAL_AVG : int 9520 344 2287 253 460 249 3 885 623 457 ...
train_label
str(train_label)
Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 .
print(head(train_label))
[1] 0 0 0 0 0 0
Levels: 0 1

Error in scale.default: length of 'center' must equal the number of columns of 'x'

I am using mboost package to do some classification. Here is the code
library('mboost')
load('so-data.rdata')
model <- glmboost(is_exciting~., data=training, family=Binomial())
pred <- predict(model, newdata=test, type="response")
But R complains when doing prediction that
Error in scale.default(X, center = cm, scale = FALSE) :
length of 'center' must equal the number of columns of 'x'
The data (training and test) can be downloaded here (7z, zip).
What is the reason of the error and how to get rid of it? Thank you.
UPDATE:
> str(training)
'data.frame': 439599 obs. of 24 variables:
$ is_exciting : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ school_state : Factor w/ 52 levels "AK","AL","AR",..: 15 5 5 23 47 5 44 42 42 5 ...
$ school_charter : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ school_magnet : Factor w/ 2 levels "f","t": 1 1 1 1 2 1 1 1 1 1 ...
$ school_year_round : Factor w/ 2 levels "f","t": 1 1 1 1 1 2 1 1 1 2 ...
$ school_nlns : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ school_charter_ready_promise : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ teacher_prefix : Factor w/ 6 levels "","Dr.","Mr.",..: 5 5 3 5 6 5 6 6 5 6 ...
$ teacher_teach_for_america : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 2 1 2 1 ...
$ teacher_ny_teaching_fellow : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ primary_focus_subject : Factor w/ 28 levels "","Applied Sciences",..: 19 17 18 18 10 4 17 17 18 17 ...
$ primary_focus_area : Factor w/ 8 levels "","Applied Learning",..: 6 5 5 5 5 4 5 5 5 5 ...
$ secondary_focus_subject : Factor w/ 28 levels "","Applied Sciences",..: 28 18 17 19 26 18 18 28 24 25 ...
$ secondary_focus_area : Factor w/ 8 levels "","Applied Learning",..: 7 5 5 6 8 5 5 7 7 4 ...
$ resource_type : Factor w/ 7 levels "","Books","Other",..: 4 4 2 5 5 2 2 5 5 5 ...
$ poverty_level : Factor w/ 4 levels "high poverty",..: 2 2 4 2 1 2 2 1 2 1 ...
$ grade_level : Factor w/ 5 levels "","Grades 3-5",..: 5 5 2 5 5 2 3 2 4 2 ...
$ fulfillment_labor_materials : num 30 35 35 30 30 35 30 35 35 35 ...
$ total_price_excluding_optional_support: num 1274 477 892 548 385 ...
$ total_price_including_optional_support: num 1499 562 1050 645 453 ...
$ students_reached : int 31 20 250 36 19 28 90 21 60 56 ...
$ eligible_double_your_impact_match : Factor w/ 2 levels "f","t": 1 2 1 2 1 2 1 1 1 1 ...
$ eligible_almost_home_match : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 2 2 1 1 ...
$ essay_length : int 236 285 194 351 383 273 385 437 476 159 ...
> str(test)
'data.frame': 44772 obs. of 23 variables:
$ school_state : Factor w/ 51 levels "AK","AL","AR",..: 22 35 11 46 5 35 11 28 28 10 ...
$ school_charter : Factor w/ 2 levels "f","t": 1 1 1 1 2 1 1 1 1 1 ...
$ school_magnet : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ school_year_round : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ school_nlns : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ school_charter_ready_promise : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ teacher_prefix : Factor w/ 6 levels "","Dr.","Mr.",..: 3 5 6 6 3 5 5 5 3 5 ...
$ teacher_teach_for_america : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ teacher_ny_teaching_fellow : Factor w/ 2 levels "f","t": 1 2 1 1 1 1 1 1 1 1 ...
$ primary_focus_subject : Factor w/ 28 levels "","Applied Sciences",..: 5 16 17 17 18 11 16 17 2 17 ...
$ primary_focus_area : Factor w/ 8 levels "","Applied Learning",..: 2 4 5 5 5 2 4 5 6 5 ...
$ secondary_focus_subject : Factor w/ 28 levels "","Applied Sciences",..: 25 1 19 1 17 9 17 11 1 1 ...
$ secondary_focus_area : Factor w/ 8 levels "","Applied Learning",..: 4 1 6 1 5 6 5 2 1 1 ...
$ resource_type : Factor w/ 7 levels "","Books","Other",..: 5 5 5 2 5 6 4 5 5 4 ...
$ poverty_level : Factor w/ 4 levels "high poverty",..: 1 2 4 4 1 2 2 2 1 2 ...
$ grade_level : Factor w/ 5 levels "","Grades 3-5",..: 4 3 3 5 4 5 5 4 3 5 ...
$ fulfillment_labor_materials : num 30 30 30 30 30 30 30 30 30 30 ...
$ total_price_excluding_optional_support: num 2185 149 1017 156 860 ...
$ total_price_including_optional_support: num 2571 175 1197 183 1012 ...
$ students_reached : int 200 110 10 22 180 51 30 15 260 20 ...
$ eligible_double_your_impact_match : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ eligible_almost_home_match : Factor w/ 2 levels "f","t": 2 1 1 1 1 1 1 1 2 1 ...
$ essay_length : int 221 137 313 243 373 344 304 431 231 173 ...
> summary(model)
Generalized Linear Models Fitted via Gradient Boosting
Call:
glmboost.formula(formula = is_exciting ~ ., data = training, family = Binomial())
Negative Binomial Likelihood
Loss function: {
f <- pmin(abs(f), 36) * sign(f)
p <- exp(f)/(exp(f) + exp(-f))
y <- (y + 1)/2
-y * log(p) - (1 - y) * log(1 - p)
}
Number of boosting iterations: mstop = 100
Step size: 0.1
Offset: -1.197806
Coefficients:
NOTE: Coefficients from a Binomial model are half the size of coefficients
from a model fitted via glm(... , family = 'binomial').
See Warning section in ?coef.mboost
(Intercept) school_stateDC
-0.5250166130 0.0426909965
school_stateIL school_chartert
0.0084191638 0.0729272310
teacher_prefixMrs. teacher_prefixMs.
-0.0181489492 0.0438425925
teacher_teach_for_americat resource_typeBooks
0.2593005345 0.0046126706
resource_typeTechnology fulfillment_labor_materials
-0.0313904871 0.0120086140
eligible_double_your_impact_matcht eligible_almost_home_matcht
-0.0316376431 -0.0522717398
essay_length
0.0004993224
attr(,"offset")
[1] -1.197806
Selection frequencies:
fulfillment_labor_materials teacher_teach_for_americat
0.24 0.15
essay_length school_chartert
0.15 0.09
teacher_prefixMs. resource_typeTechnology
0.08 0.07
eligible_double_your_impact_matcht eligible_almost_home_matcht
0.07 0.07
teacher_prefixMrs. school_stateDC
0.04 0.02
school_stateIL resource_typeBooks
0.01 0.01
I also tried glm but it said
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
factor teacher_prefix has new levels
But I don't see any new levels in the teacher_prefix variable:
> levels(training$teacher_prefix)
[1] "" "Dr." "Mr." "Mr. & Mrs." "Mrs." "Ms."
> levels(test$teacher_prefix)
[1] "" "Dr." "Mr." "Mr. & Mrs." "Mrs." "Ms."
Actually, the problems with glmboost and glm are related. There are problems with your teacher_prefix variable.
As the glm example points out, there are levels that are in test that are not in training (kind of). While both factors have the same levels(), the training set has no observations where teacher_prefix=="" but test does. Compare
table(test$teacher_prefix)
table(training$teacher_prefix)
So glm is actually giving the more accurate, helpful error message. The problem is the same with glmboost although it isn't as direct about saying it.
Doing this seemed to "fix" it
test2 <- subset(test, teacher_prefix %in% c("Dr.","Mr.","Mrs.","Ms."))
test2$teacher_prefix <- droplevels(test2$teacher_prefix)
pred <- predict(model, newdata=test2, type="response")
We just get rid of the unused levels and then do the standard prediction.

Resources