Build Decision Tree Classification - r

I have two datasets , partb_data1 and partb_data2 . Given sample of customers of a bank that reflects the characteristics of the clients and whether the bank continues to work with them or not (Churn). Exited: Churn (1 if he has left the bank and 0 if he continues to work with it). Im using partb_data1 as train set and partb_data2 as test set.
Here is my data :
> dput(head(partb_data1))
structure(list(RowNumber = 1:6, CustomerId = c(15634602L, 15647311L,
15619304L, 15701354L, 15737888L, 15574012L), Surname = c("Hargrave",
"Hill", "Onio", "Boni", "Mitchell", "Chu"), CreditScore = c(619L,
608L, 502L, 699L, 850L, 645L), Geography = c("France", "Spain",
"France", "France", "Spain", "Spain"), Gender = c("Female", "Female",
"Female", "Female", "Female", "Male"), Age = c(42L, 41L, 42L,
39L, 43L, 44L), Tenure = c(2L, 1L, 8L, 1L, 2L, 8L), Balance = c(0,
83807.86, 159660.8, 0, 125510.82, 113755.78), NumOfProducts = c(1L,
1L, 3L, 2L, 1L, 2L), HasCrCard = c(1L, 0L, 1L, 0L, 1L, 1L), IsActiveMember = c(1L,
1L, 0L, 0L, 1L, 0L), EstimatedSalary = c(101348.88, 112542.58,
113931.57, 93826.63, 79084.1, 149756.71), Exited = c(1L, 0L,
1L, 0L, 0L, 1L)), row.names = c(NA, 6L), class = "data.frame")
> dput(head(partb_data2))
structure(list(RowNumber = 8001:8006, CustomerId = c(15629002L,
15798053L, 15753895L, 15595426L, 15645815L, 15632848L), Surname = c("Hamilton",
"Nnachetam", "Blue", "Madukwe", "Mills", "Ferrari"), CreditScore = c(747L,
707L, 590L, 603L, 615L, 634L), Geography = c("Germany", "Spain",
"Spain", "Spain", "France", "France"), Gender = c("Male", "Male",
"Male", "Male", "Male", "Female"), Age = c(36L, 32L, 37L, 57L,
45L, 36L), Tenure = c(8L, 9L, 1L, 6L, 5L, 1L), Balance = c(102603.3,
0, 0, 105000.85, 0, 69518.95), NumOfProducts = c(2L, 2L, 2L,
2L, 2L, 1L), HasCrCard = c(1L, 1L, 0L, 1L, 1L, 1L), IsActiveMember = c(1L,
0L, 0L, 1L, 1L, 0L), EstimatedSalary = c(180693.61, 126475.79,
133535.99, 87412.24, 164886.64, 116238.39), Exited = c(0L, 0L,
0L, 1L, 0L, 0L)), row.names = c(NA, 6L), class = "data.frame")
I have created Classification trees in order to predict churn . Here follows the code:
library(tidyverse)
library(caret)
library(rpart)
library(rpart.plot)
# Split the data into training and test set
train.data <- head(partb_data1, 500)
test.data <- tail(partb_data2, 150)
# Build the model
modelb <- rpart(Exited ~., data = train.data, method = "class")
# Visualize the decision tree with rpart.plot
rpart.plot(modelb)
# Make predictions on the test data
predicted.classes <- modelb %>%
predict(test.data, type = "class")
head(predicted.classes)
# Compute model accuracy rate on test data
mean(predicted.classes == test.data$Exited)
### Pruning the tree :
# Fit the model on the training set
modelb2 <- train(
Exited ~., data = train.data, method = "rpart",
trControl = trainControl("cv", number = 10),
tuneLength = 10
)
# Plot model accuracy vs different values of
# cp (complexity parameter)
plot(modelb2)
# Print the best tuning parameter cp that
# maximizes the model accuracy
modelb2$bestTune
# Plot the final tree model
plot(modelb2$finalModel)
# Make predictions on the test data
predicted.classes <- modelb2 %>% predict(test.data)
# Compute model accuracy rate on test data
mean(predicted.classes == test.data$Exited)
Note: I have made test set from the partb_data2.
Is the procedure i follow right? I must make any changes in order to accomplish my target which is classification trees ? Your help would be trully welcome !
EDITED !!!

Your head(partb_data1$Exited, 500) isn't a data.frame. Because of the $ you take a subset of your partb_data1 data. It's only an integer-vector, so that can't work.
class(head(partb_data1$Exited, 500))
[1] "integer"

There are always a lot of procedure options.
But you're right with sepreate your Data into a trainings, and a testdataset. Its also possible to use a crossvalidation instead. You're using a crossvalidation on your trainingsset, thats normally not necessary, but also possible.
I think using your complete Data for the cv should also work, but what you're doing isn't wrong.

Related

position=dodge in geom_col in barplot

here is a dataset of soccer players that I need to visualise the total number of yellow cards received next to the number of games played per country in one bar plot. SO I need to calculate the total number of yellow cards and the total number of games per league country and bring the data into long format.
dput(head(new_soccer_referee))
structure(list(playerShort = c("lucas-wilchez", "john-utaka",
"abdon-prats", "pablo-mari", "ruben-pena", "aaron-hughes"), player = c("Lucas Wilchez",
"John Utaka", " Abdón Prats", " Pablo Marí", " Rubén Peña", "Aaron Hughes"
), club = c("Real Zaragoza", "Montpellier HSC", "RCD Mallorca",
"RCD Mallorca", "Real Valladolid", "Fulham FC"), leagueCountry = c("Spain",
"France", "Spain", "Spain", "Spain", "England"), birthday = structure(c(4990,
4390, 8386, 8643, 7868, 3598), class = "Date"), height = c(177L,
179L, 181L, 191L, 172L, 182L), weight = c(72L, 82L, 79L, 87L,
70L, 71L), position = c("Attacking Midfielder", "Right Winger",
NA, "Center Back", "Right Midfielder", "Center Back"), games = c(1L,
1L, 1L, 1L, 1L, 1L), victories = c(0L, 0L, 0L, 1L, 1L, 0L), ties = c(0L,
0L, 1L, 0L, 0L, 0L), defeats = c(1L, 1L, 0L, 0L, 0L, 1L), goals = c(0L,
0L, 0L, 0L, 0L, 0L), yellowCards = c(0L, 1L, 1L, 0L, 0L, 0L),
yellowReds = c(0L, 0L, 0L, 0L, 0L, 0L), redCards = c(0L,
0L, 0L, 0L, 0L, 0L), photoID = c("95212.jpg", "1663.jpg",
NA, NA, NA, "3868.jpg"), rater1 = c(0.25, 0.75, NA, NA, NA,
0.25), rater2 = c(0.5, 0.75, NA, NA, NA, 0), refNum = c(1L,
2L, 3L, 3L, 3L, 4L), refCountry = c(1L, 2L, 3L, 3L, 3L, 4L
), Alpha_3 = c("GRC", "ZMB", "ESP", "ESP", "ESP", "LUX"),
meanIAT = c(0.326391469021736, 0.203374724564378, 0.369893594187172,
0.369893594187172, 0.369893594187172, 0.325185154120009),
nIAT = c(712L, 40L, 1785L, 1785L, 1785L, 127L), seIAT = c(0.000564112354334542,
0.0108748941063986, 0.000229489640866464, 0.000229489640866464,
0.000229489640866464, 0.00329680952361961), meanExp = c(0.396,
-0.204081632653061, 0.588297311544544, 0.588297311544544,
0.588297311544544, 0.538461538461538), nExp = c(750L, 49L,
1897L, 1897L, 1897L, 130L), seExp = c(0.0026964901062936,
0.0615044043187379, 0.00100164730649311, 0.00100164730649311,
0.00100164730649311, 0.013752210497518), BMI = c(22.98190175237,
25.5922099809619, 24.1140380330271, 23.8480304816206, 23.6614386154678,
21.4346093466973), position_new = c("Offense", "Offense",
"Goalkeeper", "Defense", "Midfield", "Defense"), rater_mean = c(0.375,
0.75, NA, NA, NA, 0.125), ageinyear = c(28, 30, 19, 18, 20,
32), ageinyears = c(28, 30, 19, 18, 20, 32)), row.names = c(NA,
6L), class = "data.frame")
Use the data to draw a bar plot with the following characteristics:
– The x-axis displays the league country while the y-axis displays the number of games and the number of cards
– For each country there are two bars next to each other: one for the games played and one for the cards received
barplot <- ggplot(new_soccer_referee,aes(x=leagueCountry,y=number))
barplot +
geom_bar(fill=c("games","yellowCards")) +
geom_col(Position="dodge") +
labels(x="leagueCountry", y="number")
ggplot
`
I know it is pretty messy but I am really confused how to build up the layers with ggplot and how to work out the long format, can anyone help?
One option would be to first aggregate your data to compute the number of yellowCards and games by leagueCountry. Afterwards you could convert to long which makes it easy to plot via ggplot2.
Using some fake random example data to mimic your real data:
set.seed(123)
new_soccer_referee <- data.frame(
player = sample(letters, 20),
leagueCountry = sample(c("Spain", "France", "England", "Italy"), 20, replace = TRUE),
yellowCards = sample(1:5, 20, replace = TRUE),
games = sample(1:20, 20, replace = TRUE)
)
library(dplyr)
library(tidyr)
library(ggplot2)
new_soccer_referee_long <- new_soccer_referee %>%
group_by(leagueCountry) %>%
summarise(across(c(yellowCards, games), sum)) %>%
pivot_longer(-leagueCountry, names_to = "variable", values_to = "number")
ggplot(new_soccer_referee_long, aes(leagueCountry, number, fill = variable)) +
geom_col(position = "dodge")
Something like this:
library(tidyverse)
new_soccer_referee %>%
select(leagueCountry, games, yellowCards) %>%
group_by(leagueCountry) %>%
summarise(games = sum(games),
yellowCars = sum(yellowCards)
) %>%
pivot_longer(-leagueCountry) %>%
ggplot(aes(x=leagueCountry, fill=name, y=value)) +
geom_col(position = position_dodge())

What is the best way to use agricolae to do ANOVAs on a split plot design?

I'm trying to run some ANOVAs on data from a split plot experiment, ideally using the agricolae package. It's been a while since I've taken a stats class and I wanted to be sure I'm analyzing this data correctly, so I did some searching online and couldn't really find consistency in the way people were analyzing their split plot experiments. What is the best way for me to do this?
Here's the head of my data:
dput(head(rawData))
structure(list(ï..Plot = 2111:2116, Variety = structure(c(5L,
4L, 3L, 6L, 1L, 2L), .Label = c("Burbank", "Hodag", "Lamoka",
"Norkotah", "Silverton", "Snowden"), class = "factor"), Rate = c(4L,
4L, 4L, 4L, 4L, 4L), Rep = c(1L, 1L, 1L, 1L, 1L, 1L), totalTubers = c(594L,
605L, 656L, 729L, 694L, 548L), totalOzNoCulls = c(2544.18, 2382.07,
2140.69, 2401.56, 2440.56, 2503.5), totalCWTacNoCulls = c(461.76867,
432.345705, 388.535235, 435.88314, 442.96164, 454.38525), avgLWratio = c(1.260615419,
1.287949374, 1.111981583, 1.08647584, 1.350686661, 1.107173509
), Hollow = c(14L, 15L, 22L, 25L, 14L, 13L), Double = c(10L,
13L, 15L, 22L, 11L, 9L), Knob = c(86L, 80L, 139L, 156L, 77L,
126L), Researcher = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "Wang", class = "factor"),
CullsPounds = c(1.75, 1.15, 4.7, 1.85, 0.8, 5.55), CullsOz = c(28,
18.4, 75.2, 29.6, 12.8, 88.8), totalOz = c(2572.18, 2400.47,
2215.89, 2431.16, 2453.36, 2592.3), totalCWTacCulls = c(466.85067,
435.685305, 402.184035, 441.25554, 445.28484, 470.50245)), row.names = c(NA,
6L), class = "data.frame")
For these data, the whole plot is Rate, the split plot is Variety, the block is Rep, and for discussion's sake here, we can look at totalCWTacNoCulls as the response.
Any help would be very much appreciated! I am still getting the hang of Stack Overflow, so if I have made any mistakes or shared my data wrong, please let me know and I'll change it. Thank you!
You can do this using agricolae package as follows
library(agricolae)
attach(rawData)
Rate = factor(Rate)
Variety = factor(Variety)
Rep = factor(Rep)
sp.plot(Rep, Rate, Variety, totalCWTacNoCulls)
Usage according to agricolae package is
sp.plot(block, pplot, splot, Y)
where, block is replications, pplot is main-plot Factor, splot is sub-plot Factor and Y response variable

Plotting multiple effect plots from logistic regression

I have a number of logistic regression models with different response variables but the same predictor variables. I want to use grid.arrange (or anything else) to make a single figure with all these effect plots that were made with the effects package. I followed the advice here to make such a graph: grid.arrange with John Fox's effects plots
library(effects)
library(gridExtra)
data <- structure(list(group = structure(c(1L, 1L, 1L, 1L, 1L,1L, 1L, 2L, 2L, 2L), .Label = c("group1", "group2"), class = "factor"),obs = c(1L, 1L, 4L, 4L, 6L, 12L, 26L, 1L, 10L, 6L),responseA = c(1L, 1L, 2L, 0L, 1L, 10L, 20L, 0L, 3L, 2L), responseB = c(0L, 0L, 2L, 4L, 6L, 4L, 8L, 1L, 8L, 5L)), .Names = c("group", "obs", "responseA","responseB"), row.names = c(53L, 54L, 55L, 56L, 57L, 58L,59L, 115L, 116L, 117L), class = "data.frame")
model1<-glm(cbind(responseA,(obs-responseA))~group,family=binomial, data=data)
model2<-glm(cbind(responseA,(obs-responseA))~group,family=binomial, data=data)
ef1 <-allEffects(model1)[[1]]
ef2 <- allEffects(model2)[[1]]
elist <- list( ef1,ef2)
class(elist) <- "efflist"
plot(elist, col=2)
The problem is that, in the models I am using the response variable in the model in the form cbind(response A,no response A), but for the figure I would like to change it to something more clean (like Response A). I tried changing the y labels by putting a list, but got a warning, and it turned both labels into "Response A".
plot(elist, ylab=c("response A","response B"),col=2)
Then tried the second method suggestion to change the class to trellis, got an error, so grid.arrange didn’t work either.
p1<-plot(allEffects(model1),ylab="Response A")
p2<-plot(allEffects(model2),ylab="Response B")
class(p1) <- class(p2) <- "trellis"
grid.arrange(p1, p2, ncol=2)
Can anyone provide a method to change each y-axis label separately?
With the ef1 and ef2 variables you created, you can try the following
plot1 <- plot(ef1, ylab = "Response A")
plot2 <- plot(ef2, ylab = "Response B")
grid.arrange(plot1, plot2, ncol=2)

Function predict() for object of segmented.lme()

I have previously run mixed model analyses using glmer() in package lme4. I applied functions dredge() and get.models() in package MuMIn to quantify the top.models. I then used a model.avg() approach in package MuMIn to create a fitted object for function predict(). Finally, I created a newdata object called newdat, i.e. a new object for each predictor.
I then used newdatfinal <- predict(avModX, newdata = newdat, se.fit=TRUE, re.form=NA), where avModX presents the fitted model derived from subset.top.models <- c(top.models[[1]],top.models[[1]]) and avModX <- model.avg(subset.top.models). This all works fine.
I now need to use predict() on a segmented.lme() object. The code for function segmented.lme() can be found here: https://www.researchgate.net/publication/292986444_segmented_mixed_models_in_R_code_and_data. A reference working paper is available here: https://www.researchgate.net/publication/292629179_Segmented_mixed_models_with_random_changepoints_in_R. This function allows for detection of differences in slope and provides changepoint estimates, i.e. a test for breakpoint(s) in the data.
I first used the function
global.model.lme <- lme(response ~ predictor1*predictor2*predictor3*
predictor4 + covariate1 + covariate2 + covariate3,
data = mydat,
random = list(block = pdDiag(~ 1 + predictor1),
transect = pdDiag(~ 1 + predictor1)),
na.action="na.fail")
and followed by function
global.model.seg <- segmented.lme(global.model.lme,
Z = predictor1,
random = list(block = pdDiag(~ 1 + predictor1 + U + G0),
transect = pdDiag(~ 1 + predictor1 + U + G0)),
psi.link = "identity")
Z = the 'segmented' covariate having a segmented relationship with the response, U = slope difference, G0 = the formula of random effects for changepoints (changepoint estimate)
I would now like to use the segmented.lme() object in function predict(), i.e. something like newdatfinal <- predict(global.model.seg, newdata = newdat, se.fit=TRUE, re.form=NA)
I currently get the error message:
Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "segmented.lme"
This is a reproducible subset of the original data:
structure(list(block = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("B1", "B2", "B3", "B4", "B5", "B6", "B7", "B8"), class = "factor"), transect = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("B1L", "B1M", "B1S", "B2L", "B2M", "B2S", "B3L", "B3M", "B3S", "B4L", "B4M", "B4S", "B5L", "B5M", "B5S", "B6L", "B6M", "B6S", "B7L", "B7M", "B7S", "B8L", "B8M", "B8S"), class = "factor"), predictor1 = c(28.63734661, 31.70995133, 27.40407982, 25.48842992, 21.81094637, 24.02032756), predictor2 = c(5.002945364, 6.85567854, 0, 22.470422, 0, 0), predictor3 = c(3.72, 3.55, 3.66, 3.65, 3.53, 3.66), predictor4 = c(504.8, 547.6, 499.7, 497.8, 473.8, 467.5), covariate1 = c(391L, 394L, 351L, 336L, 304L, 335L), covariate2 = c(0.96671086, 2.81939707, 0.899512367, 1.024730094, 1.641161861, 1.419433714), covariate3 = c(0.787505444, 0.641693911, 0.115804751, -0.041146951, 1.983567486, -0.451039179), response = c(0.81257636, 0.622662116, 0.490330786, 0.709929461, -0.156398286, -1.185175095)), .Names = c("block", "transect", "predictor1", "predictor2", "predictor3", "predictor4", "covariate1", "covariate2", "covariate3", "response"), row.names = c(NA, 6L), class = "data.frame")
and a reproducible subset of the newdat data:
structure(list(predictor1 = c(-0.441935, -0.433467318435754,0.424999636871508, -0.416531955307263, -0.408064273743017, -0.399596592178771), covariate1 = c(0L, 0L, 0L, 0L, 0L, 0L), covariate2 = c(0L, 0L, 0L, 0L, 0L, 0L), covariate3 = c(0L, 0L, 0L, 0L, 0L, 0L),
predictor2 = c(0L, 0L, 0L, 0L, 0L, 0L), predictor3 = c(0L,
0L, 0L, 0L, 0L, 0L), predictor4 = c(0L, 0L, 0L, 0L, 0L, 0L
)), .Names = c("predictor1", "covariate1", "covariate2", "covariate3", "predictor2", "predictor3", "predictor4"), row.names = c(NA, 6L), class = "data.frame")
Many thanks in advance for any advice.
segmented.lme is at preliminary stage, so currently there is no predict method function. However, since the algorithm relies on working linear model, you could use the last one (at convergence) to make predictions,
predict(global.model.seg[[2]], ..)
Results should be carefully checked.

melting multiple spans of variables

(still) new to r, and very confused as to how I should accomplish multiple melts of my data. Here is a subset:
df <- structure(list(Subject = c(101L, 101L, 101L, 102L, 102L, 102L
), Condition = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("apass",
"vpas"), class = "factor"), FreqCode = structure(c(1L, 1L, 1L,
2L, 2L, 2L), .Label = c("LessVerbal", "MoreVerbal"), class = "factor"),
Item = c(1L, 4L, 7L, 1L, 4L, 7L), Len = c(80L, 68L, 85L,
68L, 85L, 79L), R1_1.RT = c(237L, 203L, 207L, 336L, 487L,
340L), R1_2.RT = c(177L, 225L, 162L, 634L, 590L, 347L), R1_3.RT = c(200L,
226L, 212L, 707L, 653L, 379L), R1.RT = c(614L, 654L, 581L,
1677L, 1730L, 1066L), R1_1 = structure(c(1L, 1L, 1L, 1L,
1L, 1L), .Label = "The", class = "factor"), R1_2 = structure(c(3L,
1L, 2L, 1L, 2L, 4L), .Label = c("antique", "course", "new",
"road"), class = "factor"), R1_3 = structure(c(4L, 1L, 2L,
1L, 2L, 3L), .Label = c("car", "materials", "surfaces", "technology"
), class = "factor"), R1 = structure(c(3L, 1L, 2L, 1L, 2L,
4L), .Label = c("The antique car", "The course materials",
"The new technology", "The road surfaces"), class = "factor")), .Names = c("Subject",
"Condition", "FreqCode", "Item", "Len", "R1_1.RT", "R1_2.RT",
"R1_3.RT", "R1.RT", "R1_1", "R1_2", "R1_3", "R1"), class = "data.frame", row.names =
c(NA,
-6L))
My goal is to get output that (in part) looks like this:
Region RT WordRegion Word
R1_1.RT 237 R1_1 the
...
R1_2.RT 177 R1_2 new
...
EDIT: The variable ending with ".RT" (e.g., R1_1.RT) are Region names and will be melted into a Region column. The variables ending in numbers (e.g., R1_1) correspond exactly to the Region names and their associated values. I want them to be melted alongside the Region names so that I can analyze them in relation to the Region column
In the first part of the code, I melt all of the values into a Region column and change the value to RT. This seems to work fine:
#long transform (with individual regions at end)
SmallMelt1 = melt(df, measure.vars = c("R1_1.RT", "R1_2.RT", "R1_3.RT", "R1.RT"), var = "Region")
#change newly created column name to "RT" (note:you have to change the number in [] to match your data)
colnames(SmallMelt1)[11 ] <- "RT"
But I don't get how to simultaneously melt another span of variables such that they will line up vertically with the first span. I want to do something like this, after the first melt, but it does not work:
#Second Melt for region names (doesn't work)
SmallMelt2 = melt(SmallMelt1, measure.vars = c("R1_1", "R1_2", "R1_3", "R1"), var = "WordRegion")
#Change name to Word
colnames(SmallMelt2)[9] <- "Word" #add col number for "value" here
Please let me know if you need any clarification. I hope someone can help... thanks in advance - DT
So, after consulting with someone off-list, I found the solution. My mistake was that I was trying to run the second step on the output of the first step. By running the two steps independently on the original data and then concatenating, I get the right result.
SmallMelt1 = melt(df, measure.vars = c("R1_1.RT", "R1_2.RT", "R1_3.RT", "R1.RT"), var = "Region")
SmallMelt2 = melt(df, measure.vars = c("R1_1", "R1_2", "R1_3", "R1"), var = "WordRegion")
SmallMelt3=cbind(SmallMelt1,SmallMelt2[,11])

Resources