Here is my data
> str(myData)
'data.frame': 500 obs. of 12 variables:
$ PassengerId: int 1 2 5 6 7 8 9 10 11 12 ...
$ Survived : int 0 1 0 0 0 0 1 1 1 1 ...
$ Pclass : int 3 1 3 3 1 3 3 2 3 1 ...
$ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 16 559 520 629 417 581 732 96 ...
$ Sex : Factor w/ 2 levels "female","male": 2 1 2 2 2 2 1 1 1 1 ...
$ Age : num 22 38 35 NA 54 2 27 14 4 58 ...
$ SibSp : int 1 1 0 0 0 3 0 1 1 0 ...
$ Parch : int 0 0 0 0 0 1 2 0 1 0 ...
$ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 473 276 86 396 345 133 617 39 ...
$ Fare : num 7.25 71.28 8.05 8.46 51.86 ...
$ Cabin : Factor w/ 147 levels "A10","A14","A16",..: NA 82 NA NA 130 NA NA NA 146 50 ...
$ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 2 3 3 3 1 3 3 ...
I have to generate 2 results
1.grouped by title and pclass of each passenger like this
2.display table of missing age counts grouped by title and pclass like this
but when I used what I know both resulted like below
> myData$Name = as.character(myData$Name)
> table_words = table(unlist(strsplit(myData$Name, "\\s+")))
> sort(table_words [grep('\\.',names(table_words))], decreasing=TRUE)
Mr. Miss. Mrs. Master. Dr. Rev. Col. Capt. Countess. Don.
289 99 76 20 5 3 2 1 1 1
L. Mlle. Mme. Sir.
1 1 1 1
> library(stringr)
> tb = cbind(myData$Age, str_match(myData$Name, "[a-zA-Z]+\\."))
> table(tb[is.na(tb[,1]),2])
Dr. Master. Miss. Mr. Mrs.
1 3 18 62 7
basically I have to return tables not by total amount like I did above but to display by 3 different rows sorting by Pclass int which the total of 3rows would still be the same as total amount(myTitle = Pclass int 1 / 2 / 3 in 'myData')
so for example, the result of image 1 would mean that Capt. exists only 1 by int 1 unber Pclass data.
how should i sort the total amount by Pclass int 1,2,and 3?
It is hard to tell with no data provided (though I think that it comes from the Titanic dataset on Kaggle).
I think the first thing to do is to create a new factor with Title as you want to make analysis with it. I'd do something like:
# Extract title from name and make it a factor
dat$Title <- gsub(".* (.*)\\. .*$", "\\1", as.character(dat$Name))
dat$Title <- factor(dat$Title)
You'll need to check that it works with your data.
Once you have the Title factor you can use ddply from the plyr library and make the first table (grouped by Title and Pclass of each passenger):
library(plyr)
# Number of occurences
classTitle <- ddply(dat, c('Pclass', 'Title'), summarise,
count=length(Name))
# Convert to wide format
classTitle <- reshape(classTitle, idvar = "Title", timevar = "Pclass",
direction = "wide")
# Fill NA's with 0
classTitle[is.na(classTitle)] <- 0
Almost the same thing for your second requirement (display table of missing age counts grouped by Title and Pclass):
# Number of NA in Age
countNA <- ddply(dat, c('Pclass', 'Title'), summarise,
na=sum(is.na(Age)))
# Convert to wide format
countNA <- reshape(countNA, idvar = "Title", timevar = "Pclass",
direction = "wide")
# Fill NA's with 0
countNA[is.na(countNA)] <- 0
Related
I have a data frame with this structure :
'data.frame': 1000 obs. of 10 variables:
$ Age : Factor w/ 3 levels "Middle","Old",..: 2 1 3 1 1 3 1 1 1 2 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 2 1 2 1 2 1 2 1 2 ...
$ OwnHome : Factor w/ 2 levels "Own","Rent": 1 2 2 1 1 1 2 1 1 1 ...
$ Married : Factor w/ 2 levels "Married","Single": 2 2 2 1 2 1 2 2 1 1 ...
$ Location : Factor w/ 2 levels "Close","Far": 2 1 1 1 1 1 1 1 1 2 ...
$ Salary : int 47500 63600 13500 85600 68400 30400 48100 68400 51900 80700 ...
$ Children : int 0 0 0 1 0 0 0 0 3 0 ...
$ History : Factor w/ 3 levels "High","Low","Medium": 1 1 2 1 1 2 3 1 2 NA ...
$ Catalogs : int 6 6 18 18 12 6 12 18 6 18 ...
$ AmountSpent: int 755 1318 296 2436 1304 495 782 1155 158 3034 ...
and want to make a bar plot with geom_bar() for Age:
Age :
Middle:508
Old :205
Young :287
when I run this code below:
age_plt <- ggplot(data = df, aes(x = Age))
age_plt + geom_bar()
I want ggplot to draw the plot in increasing order(first Old,second Young and the last Middle).
How can I add this feature to my code ?(preferably without using any other variables ,because in the next steps I want to add some new features to the same plot(for example grouping the plot with Gender column.))
Change the factor order for Age before ggplot
library(tidyverse)
df%>%
mutate(Age = fct_relevel(Age,"Old","Young"))%>%
ggplot(aes(x = Age)) +
geom_bar()
I have created a contingency table with several variables/cols by 510 categories/factors. I want to have factors ordered descending based on the sum of all variables/cols.
Tried converting table back to DF and rowSums but no luck.
Not sure if possible to sort while using table function?
DF structure
'data.frame': 2210 obs. of 7 variables:
$ Paddock_ID: num 1 1 1 1 1 1 1 1 1 1 ...
$ Year : num 2010 2011 2011 2012 2012 ...
$ LandUse : chr "Wheat" "Wheat" "Wheat" "Wheat" ...
$ LUT : chr "Cer" "Cer" "Cer" "Cer" ...
$ LUG : chr "Wheat" "Wheat" "Wheat" "Wheat" ...
$ Tmix : Factor w/ 6 levels "6","5","4","3",..: 6 5 6 4 6 5 4 5 6 6
...
$ combo : Factor w/ 510 levels "","GLYPHOSATE",..: 416 6 59 119 30
22 510 2 2 509
my table
a <- table(DF$"combo", DF$"LUG")
I get table ok but would like to have it ordered based on sum of all variables/columns i.e. Glyphosate = 124, then clethodim = 69, then paraquat = 53 ... descending for all 510 categories (rows).
Barley Canola Lupin Other Pasture Wheat
GLYPHOSATE 4 46 6 5 23 40
TRALKOXYDIM 0 0 0 0 0 8
MCPA; GLYPHOSATE; METSULFURON 0 0 0 0 0 1
METSULFURON 1 0 0 0 0 1
BUTROXYDIM; METSULFURON 1 0 0 0 0 0
GLYPHOSATE; METSULFURON; PYRAFLUFEN 0 0 0 0 0 1
PARAQUAT 2 7 7 2 28 7
CLETHODIM 0 41 15 3 0 0
Using an example dataset:
grades <- c(1,1,1,2,2,1,1,2,1,1,1,2,3)
credits <- c(4,4,4,8,4,4,8,4,4,4,8,4,4)
df <- cbind(grades, credits)
You can find the rowsums using rowSums().
One possible solution would be to create another column for rowsums and then sort with decreasing = T.
df <- as.data.frame(df)
df$sum <- rowSums(df)
df <- df[order(df[,3], decreasing = T),]
I have two data sets.
train <- read.csv("train.csv")
test <- read.csv("test.csv")
The data in train set look as below.
> str(train)
'data.frame': 891 obs. of 12 variables:
$ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
$ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358
277 16 559 520 629 417 581 ...
$ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
$ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
$ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
$ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Cabin : Factor w/ 148 levels "","A10","A14",..: NA 83 NA 57 NA NA 131 NA NA NA ...
$ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
The data in test set look as below.
> str(test)
'data.frame': 418 obs. of 11 variables:
$ PassengerId: int 892 893 894 895 896 897 898 899 900 901 ...
$ Pclass : int 3 3 2 3 3 3 3 2 3 3 ...
$ Name : Factor w/ 418 levels "Abbott, Master. Eugene Joseph",..: 210
409 273 414 182 370 85 58 5 104 ...
$ Sex : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
$ Age : num 34.5 47 62 27 22 14 30 26 18 21 ...
$ SibSp : int 0 1 0 0 1 0 0 1 0 2 ...
$ Parch : int 0 0 0 0 1 0 0 1 0 0 ...
$ Ticket : Factor w/ 363 levels "110469","110489",..: 153 222 74 148
139 262 159 85 101 270 ...
$ Fare : num 7.83 7 9.69 8.66 12.29 ...
$ Cabin : Factor w/ 77 levels "","A11","A18",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Embarked : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...
I am using decison tree as my classifier. I want to use 10 fold cross validation to train and evaluate the train set.
For that I am using carrot package.
library(caret)
tc <- trainControl("cv",10)
rpart.grid <- expand.grid(.cp=0.2)
(train.rpart <- train( Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare
+ Embarked,
data=train,
method="rpart",
trControl=tc,
na.action = na.omit,
tuneGrid=rpart.grid))
From here, I am able to get a value for the accuracy of the cross validation.
712 samples
7 predictor
2 classes: '0', '1'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 641, 641, 640, 640, 641, 641, ...
Resampling results:
Accuracy Kappa
0.7794601 0.5334528
Tuning parameter 'cp' was held constant at a value of 0.2
My question is how to find precision, recall and F1 for the 10-fold cross validated data set in a similar manner?
The current approach reads the survival outcome as integer, which leads rpart to perform regression rather than classification. Better to recode to a factor level.
Evaluation metrics such as precision, recall, and F1 are available via the wonderful confusionMatrix function.
library(caret)
train <- read.csv("train.csv")
test <- read.csv("test.csv")
tc <- trainControl("cv",10)
rpart.grid <- expand.grid(.cp=0.2)
# Convert variable interpreted as integer to factor
train$Survived <- as.factor(train$Survived)
(train.rpart <- train( Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare
+ Embarked,
data=train,
method="rpart",
trControl=tc,
na.action = na.omit,
tuneGrid=rpart.grid))
# Predict
pred <- predict(train.rpart, train)
# Produce confusion matrix from prediction and data used for training
cf <- confusionMatrix(pred, train.rpart$trainingData$.outcome, mode = "everything")
print(cf)
I've used aregImpute to impute the missing values then i used impute.transcan function trying to get complete dataset using the following code.
impute_arg <- aregImpute(~ age + job + marital + education + default +
balance + housing + loan + contact + day + month + duration + campaign +
pdays + previous + poutcome + y , data = mov.miss, n.impute = 10 , nk =0)
imputed <- impute.transcan(impute_arg, imputation=1, data=mov.miss, list.out=TRUE, pr=FALSE, check=FALSE)
y <- completed[names(imputed)]
and when i used str(y) it already gives me a dataframe but with NAs as it is not imputed before, My question is how to get complete dataset without NAs after imputation?
str(y)
'data.frame': 4521 obs. of 17 variables:
$ age : int 30 NA 35 30 NA 35 36 39 41 43 ...
$ job : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
$ marital : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
$ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 NA 2 3 1 ...
$ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 NA 1 1 1 ...
$ balance : int NA 4789 1350 1476 0 747 307 147 NA -88 ...
$ housing : Factor w/ 2 levels "no","yes": NA 2 2 2 NA 1 2 2 2 2 ...
$ loan : Factor w/ 2 levels "no","yes": 1 2 1 2 NA 1 1 NA 1 2 ...
$ contact : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 NA 1 ...
$ day : int 19 NA 16 3 5 23 14 6 14 NA ...
$ month : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 NA 9 9 1 ...
$ duration : int 79 220 185 199 226 141 341 151 57 313 ...
$ campaign : int 1 1 1 4 1 2 1 2 2 NA ...
$ pdays : int -1 339 330 NA -1 176 330 -1 -1 NA ...
$ previous : int 0 4 NA 0 NA 3 2 0 0 2 ...
$ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ...
$ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
I have tested your code myself, and it works just fine, except for the last line:
y <- completed[names(imputed)]
I believe there's a type in the above line. Plus, you do not even need the completed function.
Besides, if you want to get a data.frame from the impute.transcan function, then wrap it with as.data.frame:
imputed <- as.data.frame(impute.transcan(impute_arg, imputation=1, data=mov.miss, list.out=TRUE, pr=FALSE, check=FALSE))
Moreover, if you need to test your missing data pattern, you can also use the md.pattern function provided by the mice package.
I have those data: http://www.unige.ch/ses/spo/static/simonhug/madi/Mitchell_et_al_1984.csv
> str(dataset)
'data.frame': 135 obs. of 13 variables:
$ CCode : int 2 20 40 41 42 51 52 70 90 91 ...
$ StateAbb : Factor w/ 130 levels "AFG","ALB","ALG",..: 124 19 28 52 33 62 117 75 49 53 ...
$ StateNme : Factor w/ 130 levels "Afghanistan",..: 122 20 27 51 33 62 116 76 47 52 ...
$ prison_score : Factor w/ 5 levels "never","often",..: 1 1 2 4 5 1 NA 4 5 4 ...
$ torture_score : Factor w/ 5 levels "never","often",..: 1 3 1 4 2 1 NA 2 5 2 ...
$ ht_colonial : Factor w/ 10 levels "0. Never colonized by a Western overseas colonial power",..: 1 1 4 8 4 7 7 4 4 4 ...
$ british : int NA NA 0 0 0 1 1 0 0 0 ...
$ british_colony : Factor w/ 2 levels "no","yes": NA NA 1 1 1 2 2 1 1 1 ...
$ continent : Factor w/ 5 levels "Africa","Americas",..: 2 2 2 2 2 2 2 2 2 2 ...
$ region_wb : Factor w/ 19 levels "Australia and New Zealand",..: 10 10 2 2 2 2 2 3 3 3 ...
$ gdppc_l1 : num 25839 23550 10095 1846 4758 ...
$ colonialExperience: chr NA NA "Other Colonial Background" "Other Colonial Background" ...
And have to create a similar result
With this code
# Copy the torture_score in a new col
dataset$torture_score_new = dataset$torture_score
# Add a level to the factor torture_score_new so we can t
levels(dataset$torture_score_new) = c(levels(dataset$torture_score_new), "rarely or never")
### Recode variables
# Torture score
dataset$torture_score_new[dataset$torture_score == "rarely"] = "rarely or never"
dataset$torture_score_new[dataset$torture_score == "never"] = "rarely or never"
dataset$torture_score_new = droplevels(dataset$torture_score_new)
dataset$torture_score_new = ordered(dataset$torture_score_new, levels =c("rarely or never", "somtimes", "often", "very often"))
### Text
dataset$colonialExperience = ifelse(dataset$british_colony == "yes",
"Former British Colony",
"Other Colonial Background")
useOfTortureByColonialExperience = table(dataset$torture_score_new, dataset$colonialExperience)
addmargins(round(prop.table(useOfTortureByColonialExperience)*100,2),1)
and get this result
Former British Colony Other Colonial Background
rarely or never 9.76 20.73
somtimes 10.98 15.85
often 6.10 18.29
very often 10.98 7.32
Sum 37.82 62.19
But I don't understand how to use conditional stat and get the Chi Square.
(I'm a programmer, but a total newbe to R)
Ok it's what I end up doing.
useOfTortureByColonialExperience = table(dataset$torture_score_new, dataset$colonialExperience)
# Get the number of observation
addmargins(useOfTortureByColonialExperience,1);
# Contingency table with conditional probability
useOfTortureByColonialExperienceProp = prop.table(useOfTortureByColonialExperience,2)
print(addmargins(useOfTortureByColonialExperienceProp*100,1),3)
## Chisq
chisq.test(useOfTortureByColonialExperience)
cramersV(useOfTortureByColonialExperience)