R: why does gbm give NA values on Titanic data? - r

I have the classic titanic data. Here is the description of the cleaned data.
> str(titanic)
'data.frame': 887 obs. of 7 variables:
$ Survived : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 1 2 2 ...
$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
$ Age : num 22 38 26 35 35 27 54 2 27 14 ...
$ Siblings.Spouses.Aboard: int 1 1 0 1 0 0 0 3 0 1 ...
$ Parents.Children.Aboard: int 0 0 0 0 0 0 0 1 2 0 ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
I first split the data.
set.seed(123)
train_ind <- sample(seq_len(nrow(titanic)), size = smp_size)
train <- titanic[train_ind, ]
test <- titanic[-train_ind, ]
Then I changed Survived column to 0 and 1.
train$Survived <- as.factor(ifelse(train$Survived == 'Yes', 1, 0))
test$Survived <- as.factor(ifelse(test$Survived == 'Yes', 1, 0))
Finally, I ran gradient boosting algorithm.
dt_gb <- gbm(Survived ~ ., data = train)
Here are the results.
> print(dt_gb)
gbm(formula = Survived ~ ., data = train)
A gradient boosted model with bernoulli loss function.
100 iterations were performed.
There were 6 predictors of which 0 had non-zero influence.
Since there are 0 predictors that have non-zero influence, the predictions are NA. I am wondering why this is case? Anything wrong with my code?

Refrain from converting Survival to 0/1 factor in training and test data. Instead, change the Survival column to a 0/1 vector with numeric type.
# e.g. like this
titanic$Survival <- as.numeric(titantic$Survival) - 1
# data should look like this
> str(titanic)
'data.frame': 887 obs. of 7 variables:
$ Survived : num 0 1 1 1 0 0 0 0 1 1 ...
$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
$ Age : num 22 38 26 35 35 27 54 2 27 14 ...
$ Siblings.Spouses.Aboard: int 1 1 0 1 0 0 0 3 0 1 ...
$ Parents.Children.Aboard: int 0 0 0 0 0 0 0 1 2 0 ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
Then fit the model with Bernoulli loss.
dt_gb <- gbm::gbm(formula = Survived ~ ., data = titanic,
distribution = "bernoulli")
> print(dt_gb)
gbm::gbm(formula = Survived ~ ., distribution = "bernoulli",
data = titanic)
A gradient boosted model with bernoulli loss function.
100 iterations were performed.
There were 6 predictors of which 6 had non-zero influence.
Obtain predicted survival probabilities for the first few passengers:
>head(predict(dt_gb, type = "response"))
[1] 0.1200703 0.9024225 0.5875393 0.9271306 0.1200703 0.1200703

Related

R: adaboost (JOUSBoost package) giving 'Not compatible with requested type'

I have the classic titanic data. Here is the description of the cleaned data.
> str(titanic)
'data.frame': 887 obs. of 7 variables:
$ Survived : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 1 2 2 ...
$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
$ Age : num 22 38 26 35 35 27 54 2 27 14 ...
$ Siblings.Spouses.Aboard: int 1 1 0 1 0 0 0 3 0 1 ...
$ Parents.Children.Aboard: int 0 0 0 0 0 0 0 1 2 0 ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
I first split the data.
smp_size <- floor(0.8 * nrow(titanic))
set.seed(123)
train_ind <- sample(seq_len(nrow(titanic)), size = smp_size)
train <- titanic[train_ind, ]
test <- titanic[-train_ind, ]
Then I changed Survived column to -1 and 1.
train$Survived_ab <- ifelse(train$Survived == 'No', -1, 1)
test$Survived_ab <- ifelse(test$Survived == 'No', -1, 1)
Finally, I ran adaboost algorithm using JOUSBoost package.
dt_ab <- adaboost(X = data.matrix(train[,2:7]), y = train$Survived_ab)
Here are the results.
> print(dt_ab)
NULL
NULL
Dependent Variable:
No of trees:100
The weights of the trees are:
Tried to predict using the model.
predict_ab <- predict(dt_ab, data.matrix(test[,2:7]), type = 'response')
And got the error below.
Error in predict_adaboost_(tree_list, coeff_vector, newdata, num_examples, :
Not compatible with requested type: [type=NULL; target=double].
Why is this?

Adding a linear model from another dataset in ggplot

I have a dataset that contains time series information regarding soil elevation from several sampling stations. I have modeled the change in soil elevation over time for each station using ggplot. Now I would like to add a line to my graph that depicts a linear model fit to other geological data over time from a different dataset but I have been unable to do so. I know that I can add the slope and the intercept to my functions manually but I would rather not.
My data is as follows..
str(SETdata)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1620 obs. of 6 variables:
$ Observation : num 1 2 3 4 5 6 7 8 9 10 ...
$ Plot_Name : Factor w/ 3 levels "1900-01-01","1900-01-02",..: 1 1 1
1 1 1 1 1 1 1 ...
$ PipeDirectionCode: chr "001°" "001°" "001°" "001°" ...
$ Pin : num 1 2 3 4 5 6 7 8 9 1 ...
$ EventDate : num 0 0 0 0 0 0 0 0 0 0 ...
$ PinHeight_mm : num 221 207 192 220 212 212 206 209 203 222 ...
str(FeldsparData)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 540 obs. of 4 variables:
$ Benchmark : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 2 ...
$ Plot : Factor w/ 12 levels "1a","1b","1c",..: 1 1 1 2 2 2 3 3 3 5
...
$ TotalChange: num 0 0 0 0 0 0 0 0 0 0 ...
$ Day : num 0 0 0 0 0 0 0 0 0 0 ...
The graph I have is
SETdata %>%
ggplot()+
aes(x = EventDate, y = PinHeight_mm, color = Plot_Name, group = Plot_Name)+
stat_summary(fun.y = mean, geom = "point")+
stat_summary(fun.y = mean, geom = "line")
And I would like it to include this line
reg <- lm(TotalChange ~ Day, data = FeldsparData)
My attempts seem to have been thwarted because R does not like that I am using two different datasets.

How to get F1,Precision and Recall for a Cross Validated Data Set in R

I have two data sets.
train <- read.csv("train.csv")
test <- read.csv("test.csv")
The data in train set look as below.
> str(train)
'data.frame': 891 obs. of 12 variables:
$ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
$ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358
277 16 559 520 629 417 581 ...
$ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
$ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
$ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
$ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Cabin : Factor w/ 148 levels "","A10","A14",..: NA 83 NA 57 NA NA 131 NA NA NA ...
$ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
The data in test set look as below.
> str(test)
'data.frame': 418 obs. of 11 variables:
$ PassengerId: int 892 893 894 895 896 897 898 899 900 901 ...
$ Pclass : int 3 3 2 3 3 3 3 2 3 3 ...
$ Name : Factor w/ 418 levels "Abbott, Master. Eugene Joseph",..: 210
409 273 414 182 370 85 58 5 104 ...
$ Sex : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
$ Age : num 34.5 47 62 27 22 14 30 26 18 21 ...
$ SibSp : int 0 1 0 0 1 0 0 1 0 2 ...
$ Parch : int 0 0 0 0 1 0 0 1 0 0 ...
$ Ticket : Factor w/ 363 levels "110469","110489",..: 153 222 74 148
139 262 159 85 101 270 ...
$ Fare : num 7.83 7 9.69 8.66 12.29 ...
$ Cabin : Factor w/ 77 levels "","A11","A18",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Embarked : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...
I am using decison tree as my classifier. I want to use 10 fold cross validation to train and evaluate the train set.
For that I am using carrot package.
library(caret)
tc <- trainControl("cv",10)
rpart.grid <- expand.grid(.cp=0.2)
(train.rpart <- train( Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare
+ Embarked,
data=train,
method="rpart",
trControl=tc,
na.action = na.omit,
tuneGrid=rpart.grid))
From here, I am able to get a value for the accuracy of the cross validation.
712 samples
7 predictor
2 classes: '0', '1'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 641, 641, 640, 640, 641, 641, ...
Resampling results:
Accuracy Kappa
0.7794601 0.5334528
Tuning parameter 'cp' was held constant at a value of 0.2
My question is how to find precision, recall and F1 for the 10-fold cross validated data set in a similar manner?
The current approach reads the survival outcome as integer, which leads rpart to perform regression rather than classification. Better to recode to a factor level.
Evaluation metrics such as precision, recall, and F1 are available via the wonderful confusionMatrix function.
library(caret)
train <- read.csv("train.csv")
test <- read.csv("test.csv")
tc <- trainControl("cv",10)
rpart.grid <- expand.grid(.cp=0.2)
# Convert variable interpreted as integer to factor
train$Survived <- as.factor(train$Survived)
(train.rpart <- train( Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare
+ Embarked,
data=train,
method="rpart",
trControl=tc,
na.action = na.omit,
tuneGrid=rpart.grid))
# Predict
pred <- predict(train.rpart, train)
# Produce confusion matrix from prediction and data used for training
cf <- confusionMatrix(pred, train.rpart$trainingData$.outcome, mode = "everything")
print(cf)

how to generate grouped result in R?

Here is my data
> str(myData)
'data.frame': 500 obs. of 12 variables:
$ PassengerId: int 1 2 5 6 7 8 9 10 11 12 ...
$ Survived : int 0 1 0 0 0 0 1 1 1 1 ...
$ Pclass : int 3 1 3 3 1 3 3 2 3 1 ...
$ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 16 559 520 629 417 581 732 96 ...
$ Sex : Factor w/ 2 levels "female","male": 2 1 2 2 2 2 1 1 1 1 ...
$ Age : num 22 38 35 NA 54 2 27 14 4 58 ...
$ SibSp : int 1 1 0 0 0 3 0 1 1 0 ...
$ Parch : int 0 0 0 0 0 1 2 0 1 0 ...
$ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 473 276 86 396 345 133 617 39 ...
$ Fare : num 7.25 71.28 8.05 8.46 51.86 ...
$ Cabin : Factor w/ 147 levels "A10","A14","A16",..: NA 82 NA NA 130 NA NA NA 146 50 ...
$ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 2 3 3 3 1 3 3 ...
I have to generate 2 results
1.grouped by title and pclass of each passenger like this
2.display table of missing age counts grouped by title and pclass like this
but when I used what I know both resulted like below
> myData$Name = as.character(myData$Name)
> table_words = table(unlist(strsplit(myData$Name, "\\s+")))
> sort(table_words [grep('\\.',names(table_words))], decreasing=TRUE)
Mr. Miss. Mrs. Master. Dr. Rev. Col. Capt. Countess. Don.
289 99 76 20 5 3 2 1 1 1
L. Mlle. Mme. Sir.
1 1 1 1
> library(stringr)
> tb = cbind(myData$Age, str_match(myData$Name, "[a-zA-Z]+\\."))
> table(tb[is.na(tb[,1]),2])
Dr. Master. Miss. Mr. Mrs.
1 3 18 62 7
basically I have to return tables not by total amount like I did above but to display by 3 different rows sorting by Pclass int which the total of 3rows would still be the same as total amount(myTitle = Pclass int 1 / 2 / 3 in 'myData')
so for example, the result of image 1 would mean that Capt. exists only 1 by int 1 unber Pclass data.
how should i sort the total amount by Pclass int 1,2,and 3?
It is hard to tell with no data provided (though I think that it comes from the Titanic dataset on Kaggle).
I think the first thing to do is to create a new factor with Title as you want to make analysis with it. I'd do something like:
# Extract title from name and make it a factor
dat$Title <- gsub(".* (.*)\\. .*$", "\\1", as.character(dat$Name))
dat$Title <- factor(dat$Title)
You'll need to check that it works with your data.
Once you have the Title factor you can use ddply from the plyr library and make the first table (grouped by Title and Pclass of each passenger):
library(plyr)
# Number of occurences
classTitle <- ddply(dat, c('Pclass', 'Title'), summarise,
count=length(Name))
# Convert to wide format
classTitle <- reshape(classTitle, idvar = "Title", timevar = "Pclass",
direction = "wide")
# Fill NA's with 0
classTitle[is.na(classTitle)] <- 0
Almost the same thing for your second requirement (display table of missing age counts grouped by Title and Pclass):
# Number of NA in Age
countNA <- ddply(dat, c('Pclass', 'Title'), summarise,
na=sum(is.na(Age)))
# Convert to wide format
countNA <- reshape(countNA, idvar = "Title", timevar = "Pclass",
direction = "wide")
# Fill NA's with 0
countNA[is.na(countNA)] <- 0

RMLSE validation between rpart model and test set showing Na and zero

I am seeking a little help as I have hit a wall.
I have trained a model (CART) with a train dataset and am looking to validate the accuracy of the model with RMLSE on a test set.
I have the following:
data.frame': 5463 obs. of 15 variables:
$ Start_date: chr "2011-01-20 02:00:00" "2011-01-20 05:00:00" "2011-01-20 06:00:00"
$ Season : Factor w/ 4 levels "spring","summer",..: 1 1 1 1 1 1 1 1 1 1
$ Holiday : Factor w/ 2 levels "Not","Holiday": 1 1 1 1 1 1 1 1 1 1 ...
$ Workingday: int 1 1 1 1 1 1 1 1 1 1 ...
$ Weather : Factor w/ 4 levels "Clear","Cloudy",..: 1 1 1 1 1 2 1 2 2 2..
$ Temp : num 10.66 9.84 9.02 9.02 9.02 ...
$ Humidity : int 56 60 60 55 55 52 48 45 42 45 ...
$ Windspeed : num 0 15 15 15 19 ...
$ Count : num 1 1 1 8 18 6 3 4 5 3 ...
$ Date : chr "2011-01-20" "2011-01-20" "2011-01-20" "2011-01-20" ...
$ Hour : Factor w/ 24 levels "00","01","02",..: 3 6 7 8 9 10 11 12 .
$ Year : chr "2011" "2011" "2011" "2011" ...
$ Month : chr "01" "01" "01" "01" ...
$ Weekday : Factor w/ 7 levels "Friday","Monday",..: 5 5 5 5 5 5 5 5 5 5
$ Hour_Bin : num 0 0 0 0 0 0 0 0 0 0 ...
$ temp_Bin : num 1 1 1 2 2 2 2 2 2 2 ...
$ year_Bin : num 1 1 1 1 1 1 1 1 1 1 ...
The predicted values is vector of:
Named num [1:5463] 9 9 9 9 9 9 9 9 9 9 ...
- attr(*, "names")= chr [1:5463] "9266" "9267" "9268" "9269" ...
I have used the function:
Evaluate_Model <- function (test, pred) {
return(sqrt(1/nrow(test)*sum((log(pred+1)-log(test$Count+1))^2)))
}
and also tried the matrix package
library('Metrics')
rmsle(test$Count, pred)
when I try to get the Root Mean Squared Logarithmic Error, I am returned either [0] or [Na].
I gone through the process of converting the count variable to different data types, and also tried putting the prediction into a dataframe and evaluate it from their.
I have also trained a model with one attribute and tried to evaluate these models, but am still hetting the same result.
My target variable (count) and the other attributes have zero values, but these are real values, not na's.
IS it the training of the algorithm, the data types???
Any help would be appreciated.
A sample of the model code:
model3 <- rpart(Count~Month+Temp, data = train)
# round prediction
pred <- round(predict(model3, newdata = test))
Evaluate_Model(test, pred)
Thanks in advance.

Resources