One of the variables, 'Cabin', has a hefty amount of NAs. I am trying to use a decision tree (rpart) to predict the Cabin deck of passengers whose Cabin is not available.
Currently, this is the structure of my data table, which is a rbind of the training and test sets.
$ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
$ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
$ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
$ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
$ Age : num 22 38 26 35 35 ...
$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
$ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
$ Ticket : Factor w/ 929 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Cabin : Factor w/ 187 levels "","A10","A14",..: NA 83 NA 57 NA NA 131 NA NA NA ...
$ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
$ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
$ FamilySize : num 2 2 1 2 1 1 1 5 3 2 ...
$ FamilyID : Factor w/ 8 levels "11","3","4","5",..: 8 8 8 8 8 8 8 4 2 8 ...
$ FamilyID2 : Factor w/ 7 levels "11","4","5","6",..: 7 7 7 7 7 7 7 3 7 7 ...
$ Title : Factor w/ 11 levels "Col","Dr","Lady",..: 7 8 5 8 7 7 7 4 8 8 ...
$ Surname : chr "Braund" "Cumings" "Heikkinen" "Futrelle" ...
$ Cabin2 : Factor w/ 8 levels "A","B","C","D",..: NA 3 NA 3 NA NA 5 NA NA NA ...
Please note that I have used strsplit to create 'Cabin2' which has extracted the letter of the 'Cabin' variable, which corresponds to the deck on the Titanic to my understanding. This significantly reduced the number of levels that I was fighting with from 187 with 'Cabin' to 8 with 'Cabin2.'
I am trying to use the following code to predict the cabin deck:
cabinFit <- rpart(Cabin2 ~ Age + Sex + Fare + Embarked + SibSp + Parch + Title + FamilySize + FamilyID,
combi$Cabin2[is.na(combi$Cabin2)] <- predict(cabinFit, combi[is.na(combi$Cabin2),])
The output that I am being thrown by R is as follows:
Warning messages:
1: In `[<-.factor`(`*tmp*`, is.na(combi$Cabin2), value = c(NA, 3L, :
invalid factor level, NA generated
2: In `[<-.factor`(`*tmp*`, is.na(combi$Cabin2), value = c(NA, 3L, :
number of items to replace is not a multiple of replacement length
I am desperately trying to make sense of this as I continue fiddling with these data, however I am coming up short as to why this bit of code doesn't do the trick for me.
Related
I am doing Exploratory Data Analysis on a tibble data frame. I've never used tibble so I'm experiecing some difficulties.
My tibble data frame has this structure:
spec_tbl_df [7,397 x 19] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ X1 : num [1:7397] 9617 12179 9905 5745 10067 ...
$ Administrative : num [1:7397] 5 26 4 3 7 16 4 3 2 0 ...
$ Administrative_Duration: num [1:7397] 408 1562 58 103 165 ...
$ Informational : num [1:7397] 2 9 2 0 1 3 4 5 0 0 ...
$ Informational_Duration : num [1:7397] 47.5 503.7 28.5 0 28.5 ...
$ ProductRelated : num [1:7397] 54 183 82 25 115 86 75 23 27 33 ...
$ ProductRelated_Duration: num [1:7397] 1547 9676 4729 1109 3428 ...
$ BounceRates : num [1:7397] 0 0.0111 0 0 0 ...
$ ExitRates : num [1:7397] 0.01733 0.0142 0.01454 0.00167 0.01629 ...
$ PageValues : num [1:7397] 0 19.57 9.06 61.3 4.97 ...
$ SpecialDay : num [1:7397] 0 0 0 0 0 0 0 0 0 0 ...
$ Month : Factor w/ 10 levels "Aug","Dec","Feb",..: 8 8 8 1 8 4 8 7 8 8 ...
$ OperatingSystems : Factor w/ 8 levels "1","2","3","4",..: 2 3 2 2 2 3 3 4 8 2 ...
$ Browser : Factor w/ 13 levels "1","2","3","4",..: 2 2 2 2 2 2 2 1 2 5 ...
$ Region : Factor w/ 9 levels "1","2","3","4",..: 3 2 1 6 4 8 1 1 7 3 ...
$ TrafficType : Factor w/ 19 levels "1","2","3","4",..: 2 12 2 5 10 4 2 4 2 1 ...
$ VisitorType : Factor w/ 3 levels "New_Visitor",..: 3 3 3 1 3 3 3 3 1 3 ...
$ Weekend : Factor w/ 2 levels "FALSE","TRUE": 2 1 1 1 1 1 1 1 1 1 ...
$ Revenue : Factor w/ 2 levels "FALSE","TRUE": 2 2 2 2 2 2 2 2 2 2 ...
Now if I use plot_bar to plot the cathegorical data (using DataExplorer package) I have no problem. I would like, for example, to create a boxplot for the cathegorical variable "Month" where for each month I have a boxplot showing how values are distribuited. The problem is that I can't find a way to access the frequencies. If I do the following:
boxplot(Month)
It creates a single boxplot for all the data (all the months) but it's not helpfull at all. Like this:
I would like the months on the x axis and the frequencies on the y axis and a boxplot for each month.
I've tried to "extract" the feature month, transform it to a matrix and repeat the process but it does not work.
Here is the variable montht taken alone:
> summary(x_Month)
Aug Dec Feb Jul June Mar May Nov Oct Sep
258 1034 123 259 166 1125 2014 1814 327 277
What am I missing ?
Something like this would probably work to create barplots for the frequencies of Month:
library(ggplot2)
spec_tbl_df %>%
ggplot(aes(x = Month)) +
geom_bar()
I've used aregImpute to impute the missing values then i used impute.transcan function trying to get complete dataset using the following code.
impute_arg <- aregImpute(~ age + job + marital + education + default +
balance + housing + loan + contact + day + month + duration + campaign +
pdays + previous + poutcome + y , data = mov.miss, n.impute = 10 , nk =0)
imputed <- impute.transcan(impute_arg, imputation=1, data=mov.miss, list.out=TRUE, pr=FALSE, check=FALSE)
y <- completed[names(imputed)]
and when i used str(y) it already gives me a dataframe but with NAs as it is not imputed before, My question is how to get complete dataset without NAs after imputation?
str(y)
'data.frame': 4521 obs. of 17 variables:
$ age : int 30 NA 35 30 NA 35 36 39 41 43 ...
$ job : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
$ marital : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
$ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 NA 2 3 1 ...
$ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 NA 1 1 1 ...
$ balance : int NA 4789 1350 1476 0 747 307 147 NA -88 ...
$ housing : Factor w/ 2 levels "no","yes": NA 2 2 2 NA 1 2 2 2 2 ...
$ loan : Factor w/ 2 levels "no","yes": 1 2 1 2 NA 1 1 NA 1 2 ...
$ contact : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 NA 1 ...
$ day : int 19 NA 16 3 5 23 14 6 14 NA ...
$ month : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 NA 9 9 1 ...
$ duration : int 79 220 185 199 226 141 341 151 57 313 ...
$ campaign : int 1 1 1 4 1 2 1 2 2 NA ...
$ pdays : int -1 339 330 NA -1 176 330 -1 -1 NA ...
$ previous : int 0 4 NA 0 NA 3 2 0 0 2 ...
$ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ...
$ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
I have tested your code myself, and it works just fine, except for the last line:
y <- completed[names(imputed)]
I believe there's a type in the above line. Plus, you do not even need the completed function.
Besides, if you want to get a data.frame from the impute.transcan function, then wrap it with as.data.frame:
imputed <- as.data.frame(impute.transcan(impute_arg, imputation=1, data=mov.miss, list.out=TRUE, pr=FALSE, check=FALSE))
Moreover, if you need to test your missing data pattern, you can also use the md.pattern function provided by the mice package.
I have those data: http://www.unige.ch/ses/spo/static/simonhug/madi/Mitchell_et_al_1984.csv
> str(dataset)
'data.frame': 135 obs. of 13 variables:
$ CCode : int 2 20 40 41 42 51 52 70 90 91 ...
$ StateAbb : Factor w/ 130 levels "AFG","ALB","ALG",..: 124 19 28 52 33 62 117 75 49 53 ...
$ StateNme : Factor w/ 130 levels "Afghanistan",..: 122 20 27 51 33 62 116 76 47 52 ...
$ prison_score : Factor w/ 5 levels "never","often",..: 1 1 2 4 5 1 NA 4 5 4 ...
$ torture_score : Factor w/ 5 levels "never","often",..: 1 3 1 4 2 1 NA 2 5 2 ...
$ ht_colonial : Factor w/ 10 levels "0. Never colonized by a Western overseas colonial power",..: 1 1 4 8 4 7 7 4 4 4 ...
$ british : int NA NA 0 0 0 1 1 0 0 0 ...
$ british_colony : Factor w/ 2 levels "no","yes": NA NA 1 1 1 2 2 1 1 1 ...
$ continent : Factor w/ 5 levels "Africa","Americas",..: 2 2 2 2 2 2 2 2 2 2 ...
$ region_wb : Factor w/ 19 levels "Australia and New Zealand",..: 10 10 2 2 2 2 2 3 3 3 ...
$ gdppc_l1 : num 25839 23550 10095 1846 4758 ...
$ colonialExperience: chr NA NA "Other Colonial Background" "Other Colonial Background" ...
And have to create a similar result
With this code
# Copy the torture_score in a new col
dataset$torture_score_new = dataset$torture_score
# Add a level to the factor torture_score_new so we can t
levels(dataset$torture_score_new) = c(levels(dataset$torture_score_new), "rarely or never")
### Recode variables
# Torture score
dataset$torture_score_new[dataset$torture_score == "rarely"] = "rarely or never"
dataset$torture_score_new[dataset$torture_score == "never"] = "rarely or never"
dataset$torture_score_new = droplevels(dataset$torture_score_new)
dataset$torture_score_new = ordered(dataset$torture_score_new, levels =c("rarely or never", "somtimes", "often", "very often"))
### Text
dataset$colonialExperience = ifelse(dataset$british_colony == "yes",
"Former British Colony",
"Other Colonial Background")
useOfTortureByColonialExperience = table(dataset$torture_score_new, dataset$colonialExperience)
addmargins(round(prop.table(useOfTortureByColonialExperience)*100,2),1)
and get this result
Former British Colony Other Colonial Background
rarely or never 9.76 20.73
somtimes 10.98 15.85
often 6.10 18.29
very often 10.98 7.32
Sum 37.82 62.19
But I don't understand how to use conditional stat and get the Chi Square.
(I'm a programmer, but a total newbe to R)
Ok it's what I end up doing.
useOfTortureByColonialExperience = table(dataset$torture_score_new, dataset$colonialExperience)
# Get the number of observation
addmargins(useOfTortureByColonialExperience,1);
# Contingency table with conditional probability
useOfTortureByColonialExperienceProp = prop.table(useOfTortureByColonialExperience,2)
print(addmargins(useOfTortureByColonialExperienceProp*100,1),3)
## Chisq
chisq.test(useOfTortureByColonialExperience)
cramersV(useOfTortureByColonialExperience)
This is in reference to https://stats.stackexchange.com/questions/72251/an-example-lasso-regression-using-glmnet-for-binary-outcome I am trying to use the Cross Validation in GLMNET (i.e. cv.glmnet) for a binomial target variable. The glmnet works fine but the cv.glmnet throws an error here is the error log:
Error in storage.mode(y) = "double" : invalid to change the storage mode of a factor
In addition: Warning messages:
1: In Ops.factor(x, w) : ‘*’ not meaningful for factors
2: In Ops.factor(y, ybar) : ‘-’ not meaningful for factors
Data Types:
'data.frame': 490 obs. of 13 variables:
$ loan_id : Factor w/ 614 levels "LP001002","LP001003",..: 190 381 259 310 432 156 179 24 429 408 ...
$ gender : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 1 ...
$ married : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 2 2 2 1 ...
$ dependents : Factor w/ 4 levels "0","1","2","3+": 1 1 1 3 1 4 2 3 1 1 ...
$ education : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 1 2 1 2 ...
$ self_employed : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
$ applicantincome : int 9328 3333 14683 7667 6500 39999 3750 3365 2920 2213 ...
$ coapplicantincome: num 0 2500 2100 0 0 ...
$ loanamount : int 188 128 304 185 105 600 116 112 87 66 ...
$ loan_amount_term : Factor w/ 10 levels "12","36","60",..: 6 9 9 9 9 6 9 9 9 9 ...
$ credit_history : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ property_area : Factor w/ 3 levels "Rural","Semiurban",..: 1 2 1 1 1 2 2 1 1 1 ...
$ loan_status : Factor w/ 2 levels "0","1": 2 2 1 2 1 2 2 1 2 2 ...
Codes Used:
xfactors<-model.matrix(loan_status ~ gender+married+dependents+education+self_employed+loan_amount_term+credit_history+property_area,data=data_train)[,-1]
x<-as.matrix(data.frame(applicantincome,coapplicantincome,loanamount,xfactors))
glmmod<-glmnet(x,y=as.factor(loan_status),alpha=1,family='binomial')
plot(glmmod,xvar="lambda")
grid()
cv.glmmod <- cv.glmnet(x,y=loan_status,alpha=1) #This Is Where It Throws The Error
The credit for the answer goes to #user20650.
Suspect you need to add the familyto cv.glmnet as well. An example:
x <- model.matrix(am ~ 0 + . , data=mtcars)
cv.glmnet(x, y=factor(mtcars$am), alpha=1)
cv.glmnet(x, y=factor(mtcars$am), alpha=1, family="binomial")
Here are the steps I'm following to do a Multinomial Linear Regression.
> z<-read.table("2008 Racedata.txt", header=TRUE, sep="\t", row.names=NULL)
> head(z)
datekey raceno horseno place winner draw winodds log_odds jwt hwt
1 2008091501 1 8 1 1 2 12.0 2.484907 128 1170
2 2008091501 1 11 2 0 3 8.6 2.151762 123 1135
3 2008091501 1 6 3 0 5 7.0 1.945910 127 1114
4 2008091501 1 12 4 0 10 23.0 3.135494 123 1018
5 2008091501 1 14 5 0 4 11.0 2.397895 113 1027
6 2008091501 1 5 6 0 14 50.0 3.912023 131 972
> x<-mlogit.data(z,choice="winner",shape="long",id.var="datekey",alt.var="horseno")
Error in `row.names<-.data.frame`(`*tmp*`, value = c("1.8", "1.11", "1.6", :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘10.2’, ‘10.4’, ‘10.8’,
‘100.7’, ‘101.12’, ‘102.1’, ‘102.3’, ‘103.2’, ‘103.4’,
‘103.6’, ‘104.12’, ‘104.3’, ‘104.9’, ‘105.1’, ‘105.5’,
‘105.6’, ‘105.8’, ‘106.11’, ‘106.12’, ‘106.13’, ‘106.7’,
‘107.10’, ‘107.14’, ‘107.3’, ‘108.12’, ‘108.2’, ‘108.6’,
‘108.9’, ‘109.1’, ‘109.14’, ‘109.7’, ‘11.12’, ‘11.5’,
‘11.9’, ‘110.2’, ‘110.3’, ‘110.4’, ‘110.9’, ‘111.1’,
‘111.7’, ‘112.12’, ‘112.3’, ‘112.6’, ‘112.8’, ‘113.10’,
‘113.13’, ‘113.7’, ‘114.12’, ‘114.2’, ‘114.9’, ‘115.10’,
‘115.13’, ‘115.5’, ‘116.11’, ‘116.6’, ‘117.14’, ‘117.3’,
‘117.7’, ‘118.1’, ‘118.13’, ‘118.2’, ‘118.9’, ‘119.10’,
‘119.5’, ‘119.6’, ‘119.8’, ‘12.1’, ‘12.10’, ‘12.3’,
‚Äò12.6‚Äô, ‚Äò120.2‚Äô, ‚Äò120.4‚Äô, ‚Äò120.7‚ [... truncated]
>
What step am I missing here? Why the duplicates in row.names?
Thanks,
Walt
Two problems.
You seem to have some problem with encoding since we are seeing lots of umlauts and accent marks in that error message. Furthernore I am wondering if that datekey column got converted into a factor class?
In this case it it referring to an error in construction of the row.names attribute of the new object, x. If you do:
with( z, table( datekey, horseno) )
... you may see an a horse with multiple entries on the same day.
Actually there were no duplicate datekey x horseno combos. Changing to factor for horseno and datekey and then switching the "long" argument to "wide" produces error free result with this result:
z$datekey <- as.character(z$datekey)
z$horseno <- as.character(z$horseno)
x<-mlogit.data(z,choice="winner",shape="wide",id.var="datekey",alt.var="horseno")
str(x)
#----------
Classes ‘mlogit.data’ and 'data.frame': 18312 obs. of 11 variables:
$ datekey : Factor w/ 733 levels "2008091501","2008091502",..: 1 1 1 1 1 1 1 1 1 1 ...
$ raceno : int 1 1 1 1 1 1 1 1 1 1 ...
$ horseno : chr "0" "1" "0" "1" ...
$ place : int 1 1 2 2 3 3 4 4 5 5 ...
$ winner : logi FALSE TRUE TRUE FALSE TRUE FALSE ...
$ draw : int 2 2 3 3 5 5 10 10 4 4 ...
$ winodds : num 12 12 8.6 8.6 7 7 23 23 11 11 ...
$ log_odds: num 2.48 2.48 2.15 2.15 1.95 ...
$ jwt : int 128 128 123 123 127 127 123 123 113 113 ...
$ hwt : int 1170 1170 1135 1135 1114 1114 1018 1018 1027 1027 ...
$ chid : num 1 1 2 2 3 3 4 4 5 5 ...
- attr(*, "index")='data.frame': 18312 obs. of 3 variables:
..$ chid: Factor w/ 9156 levels "1","2","3","4",..: 1 1 2 2 3 3 4 4 5 5 ...
..$ alt : Factor w/ 2 levels "0","1": 1 2 1 2 1 2 1 2 1 2 ...
..$ id : Factor w/ 733 levels "2008091501","2008091502",..: 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "choice")= chr "winner"