I've used aregImpute to impute the missing values then i used impute.transcan function trying to get complete dataset using the following code.
impute_arg <- aregImpute(~ age + job + marital + education + default +
balance + housing + loan + contact + day + month + duration + campaign +
pdays + previous + poutcome + y , data = mov.miss, n.impute = 10 , nk =0)
imputed <- impute.transcan(impute_arg, imputation=1, data=mov.miss, list.out=TRUE, pr=FALSE, check=FALSE)
y <- completed[names(imputed)]
and when i used str(y) it already gives me a dataframe but with NAs as it is not imputed before, My question is how to get complete dataset without NAs after imputation?
str(y)
'data.frame': 4521 obs. of 17 variables:
$ age : int 30 NA 35 30 NA 35 36 39 41 43 ...
$ job : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
$ marital : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
$ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 NA 2 3 1 ...
$ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 NA 1 1 1 ...
$ balance : int NA 4789 1350 1476 0 747 307 147 NA -88 ...
$ housing : Factor w/ 2 levels "no","yes": NA 2 2 2 NA 1 2 2 2 2 ...
$ loan : Factor w/ 2 levels "no","yes": 1 2 1 2 NA 1 1 NA 1 2 ...
$ contact : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 NA 1 ...
$ day : int 19 NA 16 3 5 23 14 6 14 NA ...
$ month : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 NA 9 9 1 ...
$ duration : int 79 220 185 199 226 141 341 151 57 313 ...
$ campaign : int 1 1 1 4 1 2 1 2 2 NA ...
$ pdays : int -1 339 330 NA -1 176 330 -1 -1 NA ...
$ previous : int 0 4 NA 0 NA 3 2 0 0 2 ...
$ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ...
$ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
I have tested your code myself, and it works just fine, except for the last line:
y <- completed[names(imputed)]
I believe there's a type in the above line. Plus, you do not even need the completed function.
Besides, if you want to get a data.frame from the impute.transcan function, then wrap it with as.data.frame:
imputed <- as.data.frame(impute.transcan(impute_arg, imputation=1, data=mov.miss, list.out=TRUE, pr=FALSE, check=FALSE))
Moreover, if you need to test your missing data pattern, you can also use the md.pattern function provided by the mice package.
Related
I have a data frame with this structure :
'data.frame': 1000 obs. of 10 variables:
$ Age : Factor w/ 3 levels "Middle","Old",..: 2 1 3 1 1 3 1 1 1 2 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 2 1 2 1 2 1 2 1 2 ...
$ OwnHome : Factor w/ 2 levels "Own","Rent": 1 2 2 1 1 1 2 1 1 1 ...
$ Married : Factor w/ 2 levels "Married","Single": 2 2 2 1 2 1 2 2 1 1 ...
$ Location : Factor w/ 2 levels "Close","Far": 2 1 1 1 1 1 1 1 1 2 ...
$ Salary : int 47500 63600 13500 85600 68400 30400 48100 68400 51900 80700 ...
$ Children : int 0 0 0 1 0 0 0 0 3 0 ...
$ History : Factor w/ 3 levels "High","Low","Medium": 1 1 2 1 1 2 3 1 2 NA ...
$ Catalogs : int 6 6 18 18 12 6 12 18 6 18 ...
$ AmountSpent: int 755 1318 296 2436 1304 495 782 1155 158 3034 ...
and want to make a bar plot with geom_bar() for Age:
Age :
Middle:508
Old :205
Young :287
when I run this code below:
age_plt <- ggplot(data = df, aes(x = Age))
age_plt + geom_bar()
I want ggplot to draw the plot in increasing order(first Old,second Young and the last Middle).
How can I add this feature to my code ?(preferably without using any other variables ,because in the next steps I want to add some new features to the same plot(for example grouping the plot with Gender column.))
Change the factor order for Age before ggplot
library(tidyverse)
df%>%
mutate(Age = fct_relevel(Age,"Old","Young"))%>%
ggplot(aes(x = Age)) +
geom_bar()
I've read through others who have had a similar issue, but my situation doesn't seem to be the same as the fixes that have been proposed for those other issues. I'm trying to recode a variable using a conditional statement. I want to take a character string & turn it into a numeric so I can subset those observations out into a new data frame. Here's what I have, so far:
blad_mor <- read.csv("blad_mor.csv", header = T)
str(blad_mor)
blad_mor_recode <- gsub(C670:C679, 29010, blad_mor$cod)
I get this output for the str() command:
> str(blad_mor)
'data.frame': 127073 obs. of 12 variables:
$ year : int 1999 1999 1999 1999 1999 1999 1999 1999 1999 1999 ...
$ sex : Factor w/ 4 levels "1","2","F","M": 1 1 1 2 1 2 2 2 2 2 ...
$ race : Factor w/ 17 levels "America","Asian &",..: 4 4 4 4 4 4 4 4 4 4 ...
$ county : Factor w/ 79 levels "COUNTY1","COUNTY2",..: 1 1 1 1 1 1 1 1 1 1 ...
$ cod : Factor w/ 327 levels "C001","C005",..: 89 108 108 294 63 42 172 74 85 269 ...
$ fips : int 1 1 1 1 1 1 1 1 1 1 ...
$ state : int 5 5 5 5 5 5 5 5 5 5 ...
$ race_code : int 2 2 2 2 2 2 2 2 2 2 ...
$ ethnicity : Factor w/ 4 levels "","Hispanic",..: 1 1 1 1 1 1 1 1 1 1 ...
$ ethnicity_code: int NA NA NA NA NA NA NA NA NA NA ...
But when I try the blad_mor_recode <- gsub(C670:C679, 29010, blad_mor$cod) code I get this error:
> blad_mor_recode <- gsub(C670:C679, 29010, blad_mor$cod)
Error in gsub(C670:C679, 29010, blad_mor$cod) : object 'C670' not found
So, I verify that there actually is that object by table(blad_mor$cod) with this being some of the output:
C578 C579 C58 C60 C601 C609 C61 C629 C631 C639 C64 C65 C66 C670 C672 C674 C675 C676
2 43 4 1 1 53 6162 62 1 14 2911 30 47 1 4 1 1 2
C677 C678 C679 C680 C689 C690 C692 C693 C694 C695 C696 C699 C700 C701 C709 C71 C710 C711
1 4 2776 35 77 1 4 5 1 1 8 45 7 3 11 1 29 34
The object 'C670' has one instance as per this output, yet R is telling me it is not there & doesn't run the command. What am I missing here? Should I change the class type from factor to something else? I'm quite confused.
Edit: I have tried quotes around the character strings (e.g. blad_mor_recode <- gsub('C670:C679', '29010', blad_mor$cod) as well as ifelse(). I still get the same error message.
If you want to change all strings from C70to C79 you have to use regex. Something like the following would work:
blad_mor_recode <- gsub("C7[0-9]", "29010", blad_mor$cod)
A simple example:
gsub("C7[0-9]","",c("C60","C70","C78"))
[1] "C60" "" ""
I have those data: http://www.unige.ch/ses/spo/static/simonhug/madi/Mitchell_et_al_1984.csv
> str(dataset)
'data.frame': 135 obs. of 13 variables:
$ CCode : int 2 20 40 41 42 51 52 70 90 91 ...
$ StateAbb : Factor w/ 130 levels "AFG","ALB","ALG",..: 124 19 28 52 33 62 117 75 49 53 ...
$ StateNme : Factor w/ 130 levels "Afghanistan",..: 122 20 27 51 33 62 116 76 47 52 ...
$ prison_score : Factor w/ 5 levels "never","often",..: 1 1 2 4 5 1 NA 4 5 4 ...
$ torture_score : Factor w/ 5 levels "never","often",..: 1 3 1 4 2 1 NA 2 5 2 ...
$ ht_colonial : Factor w/ 10 levels "0. Never colonized by a Western overseas colonial power",..: 1 1 4 8 4 7 7 4 4 4 ...
$ british : int NA NA 0 0 0 1 1 0 0 0 ...
$ british_colony : Factor w/ 2 levels "no","yes": NA NA 1 1 1 2 2 1 1 1 ...
$ continent : Factor w/ 5 levels "Africa","Americas",..: 2 2 2 2 2 2 2 2 2 2 ...
$ region_wb : Factor w/ 19 levels "Australia and New Zealand",..: 10 10 2 2 2 2 2 3 3 3 ...
$ gdppc_l1 : num 25839 23550 10095 1846 4758 ...
$ colonialExperience: chr NA NA "Other Colonial Background" "Other Colonial Background" ...
And have to create a similar result
With this code
# Copy the torture_score in a new col
dataset$torture_score_new = dataset$torture_score
# Add a level to the factor torture_score_new so we can t
levels(dataset$torture_score_new) = c(levels(dataset$torture_score_new), "rarely or never")
### Recode variables
# Torture score
dataset$torture_score_new[dataset$torture_score == "rarely"] = "rarely or never"
dataset$torture_score_new[dataset$torture_score == "never"] = "rarely or never"
dataset$torture_score_new = droplevels(dataset$torture_score_new)
dataset$torture_score_new = ordered(dataset$torture_score_new, levels =c("rarely or never", "somtimes", "often", "very often"))
### Text
dataset$colonialExperience = ifelse(dataset$british_colony == "yes",
"Former British Colony",
"Other Colonial Background")
useOfTortureByColonialExperience = table(dataset$torture_score_new, dataset$colonialExperience)
addmargins(round(prop.table(useOfTortureByColonialExperience)*100,2),1)
and get this result
Former British Colony Other Colonial Background
rarely or never 9.76 20.73
somtimes 10.98 15.85
often 6.10 18.29
very often 10.98 7.32
Sum 37.82 62.19
But I don't understand how to use conditional stat and get the Chi Square.
(I'm a programmer, but a total newbe to R)
Ok it's what I end up doing.
useOfTortureByColonialExperience = table(dataset$torture_score_new, dataset$colonialExperience)
# Get the number of observation
addmargins(useOfTortureByColonialExperience,1);
# Contingency table with conditional probability
useOfTortureByColonialExperienceProp = prop.table(useOfTortureByColonialExperience,2)
print(addmargins(useOfTortureByColonialExperienceProp*100,1),3)
## Chisq
chisq.test(useOfTortureByColonialExperience)
cramersV(useOfTortureByColonialExperience)
This is in reference to https://stats.stackexchange.com/questions/72251/an-example-lasso-regression-using-glmnet-for-binary-outcome I am trying to use the Cross Validation in GLMNET (i.e. cv.glmnet) for a binomial target variable. The glmnet works fine but the cv.glmnet throws an error here is the error log:
Error in storage.mode(y) = "double" : invalid to change the storage mode of a factor
In addition: Warning messages:
1: In Ops.factor(x, w) : ‘*’ not meaningful for factors
2: In Ops.factor(y, ybar) : ‘-’ not meaningful for factors
Data Types:
'data.frame': 490 obs. of 13 variables:
$ loan_id : Factor w/ 614 levels "LP001002","LP001003",..: 190 381 259 310 432 156 179 24 429 408 ...
$ gender : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 1 ...
$ married : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 2 2 2 1 ...
$ dependents : Factor w/ 4 levels "0","1","2","3+": 1 1 1 3 1 4 2 3 1 1 ...
$ education : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 1 2 1 2 ...
$ self_employed : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
$ applicantincome : int 9328 3333 14683 7667 6500 39999 3750 3365 2920 2213 ...
$ coapplicantincome: num 0 2500 2100 0 0 ...
$ loanamount : int 188 128 304 185 105 600 116 112 87 66 ...
$ loan_amount_term : Factor w/ 10 levels "12","36","60",..: 6 9 9 9 9 6 9 9 9 9 ...
$ credit_history : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ property_area : Factor w/ 3 levels "Rural","Semiurban",..: 1 2 1 1 1 2 2 1 1 1 ...
$ loan_status : Factor w/ 2 levels "0","1": 2 2 1 2 1 2 2 1 2 2 ...
Codes Used:
xfactors<-model.matrix(loan_status ~ gender+married+dependents+education+self_employed+loan_amount_term+credit_history+property_area,data=data_train)[,-1]
x<-as.matrix(data.frame(applicantincome,coapplicantincome,loanamount,xfactors))
glmmod<-glmnet(x,y=as.factor(loan_status),alpha=1,family='binomial')
plot(glmmod,xvar="lambda")
grid()
cv.glmmod <- cv.glmnet(x,y=loan_status,alpha=1) #This Is Where It Throws The Error
The credit for the answer goes to #user20650.
Suspect you need to add the familyto cv.glmnet as well. An example:
x <- model.matrix(am ~ 0 + . , data=mtcars)
cv.glmnet(x, y=factor(mtcars$am), alpha=1)
cv.glmnet(x, y=factor(mtcars$am), alpha=1, family="binomial")
One of the variables, 'Cabin', has a hefty amount of NAs. I am trying to use a decision tree (rpart) to predict the Cabin deck of passengers whose Cabin is not available.
Currently, this is the structure of my data table, which is a rbind of the training and test sets.
$ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
$ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
$ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
$ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
$ Age : num 22 38 26 35 35 ...
$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
$ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
$ Ticket : Factor w/ 929 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Cabin : Factor w/ 187 levels "","A10","A14",..: NA 83 NA 57 NA NA 131 NA NA NA ...
$ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
$ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
$ FamilySize : num 2 2 1 2 1 1 1 5 3 2 ...
$ FamilyID : Factor w/ 8 levels "11","3","4","5",..: 8 8 8 8 8 8 8 4 2 8 ...
$ FamilyID2 : Factor w/ 7 levels "11","4","5","6",..: 7 7 7 7 7 7 7 3 7 7 ...
$ Title : Factor w/ 11 levels "Col","Dr","Lady",..: 7 8 5 8 7 7 7 4 8 8 ...
$ Surname : chr "Braund" "Cumings" "Heikkinen" "Futrelle" ...
$ Cabin2 : Factor w/ 8 levels "A","B","C","D",..: NA 3 NA 3 NA NA 5 NA NA NA ...
Please note that I have used strsplit to create 'Cabin2' which has extracted the letter of the 'Cabin' variable, which corresponds to the deck on the Titanic to my understanding. This significantly reduced the number of levels that I was fighting with from 187 with 'Cabin' to 8 with 'Cabin2.'
I am trying to use the following code to predict the cabin deck:
cabinFit <- rpart(Cabin2 ~ Age + Sex + Fare + Embarked + SibSp + Parch + Title + FamilySize + FamilyID,
combi$Cabin2[is.na(combi$Cabin2)] <- predict(cabinFit, combi[is.na(combi$Cabin2),])
The output that I am being thrown by R is as follows:
Warning messages:
1: In `[<-.factor`(`*tmp*`, is.na(combi$Cabin2), value = c(NA, 3L, :
invalid factor level, NA generated
2: In `[<-.factor`(`*tmp*`, is.na(combi$Cabin2), value = c(NA, 3L, :
number of items to replace is not a multiple of replacement length
I am desperately trying to make sense of this as I continue fiddling with these data, however I am coming up short as to why this bit of code doesn't do the trick for me.