polr function in Ordinal Regression - r

I am working on a dataset where CLASS is my target variable with d levels (HIGH, MEDIUM, LOW). Now I am trying to apply Ordinal Regression. I split my data into training and testing. Then I run this command.
polr(Class~., data= training, Hess = TRUE) -> reg
It keeps on running that is it never gets executed. I've left it for over an hour and this command was still running. I could not run any other commands until this command is done, and ultimately I have to terminate my R.
Why is this command not stopping? Am I missing something? Is there any condition that has to be fulfilled to apply Ordinal?
ind <- sample(2, nrow(realdata), replace = TRUE, prob = c(0.7,0.3))
training <- realdata[ind==1,]
testing <- realdata[ind==2,]
library(MASS)
model <- polr(Class~., data= training, Hess = TRUE)
This command should run so I can get a summary and continue with other commands but I am stuck here.
Here is the structure of my data that I'm working on:
str(realdata)
'data.frame': 4999 obs. of 19 variables:
$ Customer : Factor w/ 4137 levels " Abeera Bajwa",..: 782 3756 3756 3521 2531 2749 782 2260 3386 4048 ...
$ Customer.No : Factor w/ 4294 levels "001-000161-01",..: 1074 1118 1118 1080 1102 1119 1074 1087 1099 1135 ...
$ Shop : Factor w/ 71 levels "Abbotabad","Atriium Perfume Kiosk",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Invoice : int 29810 29824 29829 29846 29800 29802 29808 29809 29826 29837 ...
$ Quantity : int 1 2 1 1 7 2 7 2 4 2 ...
$ Sales : Factor w/ 707 levels "-100","1,000",..: 707 306 394 306 491 306 500 479 403 320 ...
$ Cash.Amt : int 910 2200 2950 2205 4740 2205 4925 4610 3210 2580 ...
$ Credit.Card.Amt : int 0 0 0 0 0 0 0 0 0 0 ...
$ Net.Sales : Factor w/ 1215 levels "1,000","1,003",..: 1212 396 476 397 712 397 734 702 540 436 ...
$ Mens.Wear : Factor w/ 17 levels "0","0\\","1",..: 3 3 1 1 3 1 9 1 1 1 ...
$ Womens.Wear : int 0 1 1 0 2 1 1 1 1 2 ...
$ Kids.Wear : int 0 0 0 1 2 1 3 1 1 0 ...
$ Foot.Wear : int 0 0 0 0 1 0 0 0 1 0 ...
$ Fragrant : int 0 0 0 0 1 0 1 0 1 0 ...
$ Class : Factor w/ 3 levels "H","L","M": 2 3 3 3 1 3 1 1 3 3 ...
$ Date : Factor w/ 36 levels "1/4/2016","1/4/2017",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Year : int 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
$ Month : Factor w/ 12 levels "April","August",..: 5 5 5 5 5 5 5 5 5 5 ...
$ Customer.Address: Factor w/ 1 level "##_#/#,#####################################": 1 1 1 1 1 1 1 1 1 1 ...

Related

Access frequencies of an atomic vector in a tibble data frame

I am doing Exploratory Data Analysis on a tibble data frame. I've never used tibble so I'm experiecing some difficulties.
My tibble data frame has this structure:
spec_tbl_df [7,397 x 19] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ X1 : num [1:7397] 9617 12179 9905 5745 10067 ...
$ Administrative : num [1:7397] 5 26 4 3 7 16 4 3 2 0 ...
$ Administrative_Duration: num [1:7397] 408 1562 58 103 165 ...
$ Informational : num [1:7397] 2 9 2 0 1 3 4 5 0 0 ...
$ Informational_Duration : num [1:7397] 47.5 503.7 28.5 0 28.5 ...
$ ProductRelated : num [1:7397] 54 183 82 25 115 86 75 23 27 33 ...
$ ProductRelated_Duration: num [1:7397] 1547 9676 4729 1109 3428 ...
$ BounceRates : num [1:7397] 0 0.0111 0 0 0 ...
$ ExitRates : num [1:7397] 0.01733 0.0142 0.01454 0.00167 0.01629 ...
$ PageValues : num [1:7397] 0 19.57 9.06 61.3 4.97 ...
$ SpecialDay : num [1:7397] 0 0 0 0 0 0 0 0 0 0 ...
$ Month : Factor w/ 10 levels "Aug","Dec","Feb",..: 8 8 8 1 8 4 8 7 8 8 ...
$ OperatingSystems : Factor w/ 8 levels "1","2","3","4",..: 2 3 2 2 2 3 3 4 8 2 ...
$ Browser : Factor w/ 13 levels "1","2","3","4",..: 2 2 2 2 2 2 2 1 2 5 ...
$ Region : Factor w/ 9 levels "1","2","3","4",..: 3 2 1 6 4 8 1 1 7 3 ...
$ TrafficType : Factor w/ 19 levels "1","2","3","4",..: 2 12 2 5 10 4 2 4 2 1 ...
$ VisitorType : Factor w/ 3 levels "New_Visitor",..: 3 3 3 1 3 3 3 3 1 3 ...
$ Weekend : Factor w/ 2 levels "FALSE","TRUE": 2 1 1 1 1 1 1 1 1 1 ...
$ Revenue : Factor w/ 2 levels "FALSE","TRUE": 2 2 2 2 2 2 2 2 2 2 ...
Now if I use plot_bar to plot the cathegorical data (using DataExplorer package) I have no problem. I would like, for example, to create a boxplot for the cathegorical variable "Month" where for each month I have a boxplot showing how values are distribuited. The problem is that I can't find a way to access the frequencies. If I do the following:
boxplot(Month)
It creates a single boxplot for all the data (all the months) but it's not helpfull at all. Like this:
I would like the months on the x axis and the frequencies on the y axis and a boxplot for each month.
I've tried to "extract" the feature month, transform it to a matrix and repeat the process but it does not work.
Here is the variable montht taken alone:
> summary(x_Month)
Aug Dec Feb Jul June Mar May Nov Oct Sep
258 1034 123 259 166 1125 2014 1814 327 277
What am I missing ?
Something like this would probably work to create barplots for the frequencies of Month:
library(ggplot2)
spec_tbl_df %>%
ggplot(aes(x = Month)) +
geom_bar()

How to get F1,Precision and Recall for a Cross Validated Data Set in R

I have two data sets.
train <- read.csv("train.csv")
test <- read.csv("test.csv")
The data in train set look as below.
> str(train)
'data.frame': 891 obs. of 12 variables:
$ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
$ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358
277 16 559 520 629 417 581 ...
$ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
$ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
$ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
$ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Cabin : Factor w/ 148 levels "","A10","A14",..: NA 83 NA 57 NA NA 131 NA NA NA ...
$ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
The data in test set look as below.
> str(test)
'data.frame': 418 obs. of 11 variables:
$ PassengerId: int 892 893 894 895 896 897 898 899 900 901 ...
$ Pclass : int 3 3 2 3 3 3 3 2 3 3 ...
$ Name : Factor w/ 418 levels "Abbott, Master. Eugene Joseph",..: 210
409 273 414 182 370 85 58 5 104 ...
$ Sex : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
$ Age : num 34.5 47 62 27 22 14 30 26 18 21 ...
$ SibSp : int 0 1 0 0 1 0 0 1 0 2 ...
$ Parch : int 0 0 0 0 1 0 0 1 0 0 ...
$ Ticket : Factor w/ 363 levels "110469","110489",..: 153 222 74 148
139 262 159 85 101 270 ...
$ Fare : num 7.83 7 9.69 8.66 12.29 ...
$ Cabin : Factor w/ 77 levels "","A11","A18",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Embarked : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...
I am using decison tree as my classifier. I want to use 10 fold cross validation to train and evaluate the train set.
For that I am using carrot package.
library(caret)
tc <- trainControl("cv",10)
rpart.grid <- expand.grid(.cp=0.2)
(train.rpart <- train( Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare
+ Embarked,
data=train,
method="rpart",
trControl=tc,
na.action = na.omit,
tuneGrid=rpart.grid))
From here, I am able to get a value for the accuracy of the cross validation.
712 samples
7 predictor
2 classes: '0', '1'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 641, 641, 640, 640, 641, 641, ...
Resampling results:
Accuracy Kappa
0.7794601 0.5334528
Tuning parameter 'cp' was held constant at a value of 0.2
My question is how to find precision, recall and F1 for the 10-fold cross validated data set in a similar manner?
The current approach reads the survival outcome as integer, which leads rpart to perform regression rather than classification. Better to recode to a factor level.
Evaluation metrics such as precision, recall, and F1 are available via the wonderful confusionMatrix function.
library(caret)
train <- read.csv("train.csv")
test <- read.csv("test.csv")
tc <- trainControl("cv",10)
rpart.grid <- expand.grid(.cp=0.2)
# Convert variable interpreted as integer to factor
train$Survived <- as.factor(train$Survived)
(train.rpart <- train( Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare
+ Embarked,
data=train,
method="rpart",
trControl=tc,
na.action = na.omit,
tuneGrid=rpart.grid))
# Predict
pred <- predict(train.rpart, train)
# Produce confusion matrix from prediction and data used for training
cf <- confusionMatrix(pred, train.rpart$trainingData$.outcome, mode = "everything")
print(cf)

multinom (nnet) invalid type (closure) for variable '(weights)'

I am trying to run multinomial logistic regression on my data of supermarket monthly store data. The data looks like this
data.frame': 833233 obs. of 22 variables:
$ ProductId : num 105422 105422 143863 170645 397474 ...
$ Brand : num NA NA NA NA NA NA NA NA NA NA ...
$ Supplier : Factor w/ 788 levels "[00000] 武商量贩",..: 1 113 265 154 99 99 99 99 99 99 ...
$ Mode.of.operations : Factor w/ 3 levels "[1] Distribution",..: 1 1 1 3 2 2 2 2 2 2 ...
$ Category : Factor w/ 27 levels "[01] Fuits and Vegetables",..: 5 5 9 1 22 22 22 22 22 22 ...
$ Name : chr "土腊肉" "土腊肉" "佳品红金龙" "野山笋" ...
$ Packaging : Factor w/ 108 levels "1","2","3","4",..: 1 1 96 1 1 1 1 1 1 1 ...
$ Specs : Factor w/ 3477 levels "(1*2)","(16+5)ml",..: 3466 3466 2678 3466 92 92 92 92 92 92 ...
$ Unit : Factor w/ 72 levels "1*1","kg","Kg",..: 2 2 57 18 8 8 8 8 8 8 ...
$ Origin : Factor w/ 370 levels "409","China",..: 15 15 15 15 15 15 15 15 15 15 ...
$ Price : num 73.5 73.5 4.4 0 6.64 ...
$ Sale.quantity : num 0 0 464 0 1 0 6 0 0 0 ...
$ Sale.revenue : num 0 0 2784 0 8 ...
$ Sale.revenue.wo.tax : num 0 0 2141.54 0 5.68 ...
$ Profit.margin : num 0 0 237.95 0 1.16 ...
$ Profit.margin.percentage : num 0 0 0.1 0 0.17 ...
$ Inventory.turnover.days : num 0 0 30.2 0 1007 ...
$ Purchase.amount.wo.tax : num 0 0 0 0 0 0 0 0 0 0 ...
$ Inventory.leftover.value.wo.tax: num 111.14 0.22 1066.15 0 181.61 ...
$ Month : Factor w/ 23 levels "1","2","3","4",..: 17 17 17 17 17 17 17 17 17 17 ...
$ Adjusted.price : num 0 0 6 0 8 0 15.9 0 0 0 ...
$ Wuhan : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
I tried to find multinomial logistic regression by
model1 = multinom(Mode.of.operations ~ Category+Wuhan+Inventory.turnover.days+Adjusted.price, data = wushang, na.omit)
but i ended up with following error
Error in model.frame.default(formula = Mode.of.operations ~ Category + :
invalid type (closure) for variable '(weights)'
I tried to search for an answer to why its happening but couldn't find anything.
If someone could help me figure it out please.
Thanks
Allrighty, i got your error solved but i got another error.
You have to put the arguments as follows:
multinom.glmulti <- function(formula, data,...)
multinom(formula, data=data, maxit=10000,...)
Im using my own formula with my terms.
```{r}
formula_autom = reformulate(variables_autom_0,"clase")
And my own data.
Now we are suposed to get the models.
res <- glmulti(formula_autom, data=clase_training,
level=4, fitfunction=multinom.glmulti, crit="aicc", confsetsize=100,na.action=na.omit)
But i got an error:
Error in nobs.default(object) : no 'nobs' method is available

Error in `row.names<-.data.frame using mlogit in R language

Here are the steps I'm following to do a Multinomial Linear Regression.
> z<-read.table("2008 Racedata.txt", header=TRUE, sep="\t", row.names=NULL)
> head(z)
datekey raceno horseno place winner draw winodds log_odds jwt hwt
1 2008091501 1 8 1 1 2 12.0 2.484907 128 1170
2 2008091501 1 11 2 0 3 8.6 2.151762 123 1135
3 2008091501 1 6 3 0 5 7.0 1.945910 127 1114
4 2008091501 1 12 4 0 10 23.0 3.135494 123 1018
5 2008091501 1 14 5 0 4 11.0 2.397895 113 1027
6 2008091501 1 5 6 0 14 50.0 3.912023 131 972
> x<-mlogit.data(z,choice="winner",shape="long",id.var="datekey",alt.var="horseno")
Error in `row.names<-.data.frame`(`*tmp*`, value = c("1.8", "1.11", "1.6", :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘10.2’, ‘10.4’, ‘10.8’,
‘100.7’, ‘101.12’, ‘102.1’, ‘102.3’, ‘103.2’, ‘103.4’,
‘103.6’, ‘104.12’, ‘104.3’, ‘104.9’, ‘105.1’, ‘105.5’,
‘105.6’, ‘105.8’, ‘106.11’, ‘106.12’, ‘106.13’, ‘106.7’,
‘107.10’, ‘107.14’, ‘107.3’, ‘108.12’, ‘108.2’, ‘108.6’,
‘108.9’, ‘109.1’, ‘109.14’, ‘109.7’, ‘11.12’, ‘11.5’,
‘11.9’, ‘110.2’, ‘110.3’, ‘110.4’, ‘110.9’, ‘111.1’,
‘111.7’, ‘112.12’, ‘112.3’, ‘112.6’, ‘112.8’, ‘113.10’,
‘113.13’, ‘113.7’, ‘114.12’, ‘114.2’, ‘114.9’, ‘115.10’,
‘115.13’, ‘115.5’, ‘116.11’, ‘116.6’, ‘117.14’, ‘117.3’,
‘117.7’, ‘118.1’, ‘118.13’, ‘118.2’, ‘118.9’, ‘119.10’,
‘119.5’, ‘119.6’, ‘119.8’, ‘12.1’, ‘12.10’, ‘12.3’,
‚Äò12.6‚Äô, ‚Äò120.2‚Äô, ‚Äò120.4‚Äô, ‚Äò120.7‚ [... truncated]
>
What step am I missing here? Why the duplicates in row.names?
Thanks,
Walt
Two problems.
You seem to have some problem with encoding since we are seeing lots of umlauts and accent marks in that error message. Furthernore I am wondering if that datekey column got converted into a factor class?
In this case it it referring to an error in construction of the row.names attribute of the new object, x. If you do:
with( z, table( datekey, horseno) )
... you may see an a horse with multiple entries on the same day.
Actually there were no duplicate datekey x horseno combos. Changing to factor for horseno and datekey and then switching the "long" argument to "wide" produces error free result with this result:
z$datekey <- as.character(z$datekey)
z$horseno <- as.character(z$horseno)
x<-mlogit.data(z,choice="winner",shape="wide",id.var="datekey",alt.var="horseno")
str(x)
#----------
Classes ‘mlogit.data’ and 'data.frame': 18312 obs. of 11 variables:
$ datekey : Factor w/ 733 levels "2008091501","2008091502",..: 1 1 1 1 1 1 1 1 1 1 ...
$ raceno : int 1 1 1 1 1 1 1 1 1 1 ...
$ horseno : chr "0" "1" "0" "1" ...
$ place : int 1 1 2 2 3 3 4 4 5 5 ...
$ winner : logi FALSE TRUE TRUE FALSE TRUE FALSE ...
$ draw : int 2 2 3 3 5 5 10 10 4 4 ...
$ winodds : num 12 12 8.6 8.6 7 7 23 23 11 11 ...
$ log_odds: num 2.48 2.48 2.15 2.15 1.95 ...
$ jwt : int 128 128 123 123 127 127 123 123 113 113 ...
$ hwt : int 1170 1170 1135 1135 1114 1114 1018 1018 1027 1027 ...
$ chid : num 1 1 2 2 3 3 4 4 5 5 ...
- attr(*, "index")='data.frame': 18312 obs. of 3 variables:
..$ chid: Factor w/ 9156 levels "1","2","3","4",..: 1 1 2 2 3 3 4 4 5 5 ...
..$ alt : Factor w/ 2 levels "0","1": 1 2 1 2 1 2 1 2 1 2 ...
..$ id : Factor w/ 733 levels "2008091501","2008091502",..: 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "choice")= chr "winner"

sum by factor conditional on another factor

I'm working with a data frame of stock information, here is what it looks like:
> str(test)
'data.frame': 211717 obs. of 19 variables:
$ Symbol : Factor w/ 3378 levels "AACC","AACE",..: 1 1 1 1 1 1 1 1 1 1 ...
$ MktCategory : Factor w/ 3 levels "","NNM","SCM": 2 2 2 2 2 2 2 2 2 2 ...
$ TSO : num 37205115 37205115 37205115 37205115 37205115 ...
$ TSO_Date : Factor w/ 200 levels "","1/1/2006",..: 137 137 137 137 137 137 137 137 137 137 ...
$ X.OfMP : int 56 56 56 56 56 56 56 56 56 56 ...
$ MPID : Factor w/ 670 levels "","ABLE","ABNA",..: 608 459 533 618 550 635 307 146 387 482 ...
$ MP_type : Factor w/ 4 levels "","C","M","NR": 2 3 4 3 3 3 3 4 3 4 ...
$ Total_Vol : int 32900 0 2949 758522 41316 706131 29300 16898 362569 1490 ...
$ Total_Rank : int 18 0 35 2 17 3 21 26 5 40 ...
$ Total_Pct : int 0 0 0 14 0 13 0 0 7 0 ...
$ Block_Vol : int 0 0 0 60800 20000 34900 19200 16600 0 0 ...
$ Block_Rank : int 0 0 0 2 6 4 7 9 0 0 ...
$ Block_Pct : int 0 0 0 15 5 9 5 4 0 0 ...
$ YTD_Total_Vol : num 81615 2929 10684 1949230 190874 ...
$ YTD_Total_Rank: int 28 59 44 3 17 5 30 27 12 67 ...
$ YTD_Total_Pct : int 0 0 0 9 0 7 0 0 2 0 ...
$ YTD_Block_Vol : int 0 0 0 197420 80000 390600 60900 73787 55994 0 ...
$ YTD_Block_Rank: int 0 0 0 5 13 3 16 14 17 0 ...
$ YTD_Block_Pct : int 0 0 0 6 3 12 2 2 2 0 ...
So I know how to sum total volume(Total_Vol) by Symbol with the aggregate function:
volbystock<-aggregate(test$Total_Vol,by=list(test$Symbol),FUN=sum)
but I am trying to analyze volume of only a few MPID values. I want to only add the Total_Vol of a Symbol when the MPID is one of the MPIDs in another list. In other words, I only want the Total_Vol of a certain Symbol added if the corresponding MPID is one of the following:
> use_MPID<-c("GSCO","LATS","TACT","INCA","LATS","LQNT","ITGI")
Using dply you can do something like this:
# load dplyr
library(dplyr)
# create a vector of MPIDs you are interested on
use_MPID <- c("GSCO","LATS","TACT","INCA","LATS","LQNT","ITGI")
# create a fake dataset just for representation
test <- data.frame(cbind(c("ci", "di", "bi", "bi"), c("GSCO","LATS","TACT","INCA"), c(35, 110, 201, 435)))
names(test) <- c("Symbol", "MPID", "TotalVol")
# use dplyr to summarise your dataset
volbystock <- test %.%
group_by(Symbol) %.%
select(Symbol, MPID, TotalVol) %.%
filter(MPID %in% use_MPID)
It looks like you could just subset your data.frame, by using:
use_MPID <- c("GSCO","LATS","TACT","INCA","LATS","LQNT","ITGI")
relevant.symbols <- which(test$MPID %in% use_MPID)
volbystock <- aggregate(test$Total_Vol[relevant.symbols],
by=list(test$Symbol[relevant.symbols]),
FUN=sum)
Does this solve your problem?
edit
Even better, you could use the subset optional argument, along with providing the right formula:
use_MPID <- c("GSCO","LATS","TACT","INCA","LATS","LQNT","ITGI")
volbystock <- aggregate(formula=test$Total_Vol ~ test$Symbol,
subset=(test$MPID %in% use_MPID),
FUN=sum)

Resources