iris <- read.csv("iris.csv") #iris data available in R
library(rpart)
iris.rpart <- rpart(Species~Sepal.length+Sepal.width+Petal.width+Petal.length,
data=iris)
plotcp(iris.rpart)
printcp(iris.rpart)
iris.rpart1 <- prune(iris.rpart, cp=0.047)
plot(iris.rpart1,uniform=TRUE)
text(iris.rpart1, use.n=TRUE, cex=0.6)
I have tried to get the rpart done on the iris data. However, is it possible by using some function in R to get the rules applied by rpart for this current tree preparation so that we know how the classifications are made when we add further new points to the data set?
The
rpart.plot
package has a function
rpart.rules for generating a set of rules for a tree. For example
library(rpart.plot)
iris.rpart <- rpart(Species~., data=iris)
rpart.rules(iris.rpart)
gives
Species seto vers virg
setosa [1.00 .00 .00] when Petal.Length < 2.5
versicolor [ .00 .91 .09] when Petal.Length >= 2.5 & Petal.Width < 1.8
virginica [ .00 .02 .98] when Petal.Length >= 2.5 & Petal.Width >= 1.8
And
options(width=1000)
rpart.predict(iris.rpart, newdata=iris[50:52,], rules=TRUE)
gives you the rule used to make each prediction:
setosa versicolor virginica
50 1 0.00000 0.000000 because Petal.Length < 2.5
51 0 0.90741 0.092593 because Petal.Length >= 2.5 & Petal.Width < 1.8
52 0 0.90741 0.092593 because Petal.Length >= 2.5 & Petal.Width < 1.8
For more examples see Chapter 4 of the
rpart.plot vignette.
Related
I have a model, called predictive_fit <- fit(workflow, training) that classifies the Iris dataset species using xgboost. The data are pivoted wide such that each species is a dummied column represented by a 0 or 1. Here, I am trying to predict Virginica based on the Sepal and Petal columns.
Currently, I have the following code which then takes the dataset after the model has been fit to test if it can accurately predict the Virginia species of iris. (Snippet below)
testing_data <-
test %>%
bind_cols(
predict(predictive_fit, test)
)
I cannot, however, figure out how to scale this up with simulation. If I have another dataset with exactly the same structure, I would like to predict whether it is Virginica 100 times. (Snippet below)
new_iris_data <-
new_iris_data %>%
bind_cols(
replicate(n = 100, predict(predictive_fit, new_iris_data))
)
However, it looks as if when I run the new data the same predictions are just being copied 100 times. What is the appropriate way to repeatedly predict the classification? I wouldn't expect that all 100 times the model would predict exactly the same thing, but I'd like some way to have the predictions run n number of times so each and every row of new data can have its own proportion calculated.
I have already tried using the replicate() function to try this. However, it appears as if it copies the same exact results 100 times. I considered having a for loop that iterated through a different seed and then ran the predictions, but I was hoping for a more performant solution out there.
You are replicating the prediction of you model, not the data.frame you call new_iris_data, and the result is exactly that. In order to replicate a (random) part of the iris dataset, try this:
> data("iris")
>
> sample <- sample(nrow(iris), floor(nrow(iris) * 0.5))
>
> train <- iris[sample,]
> test <- iris[-sample,]
>
> new_test <- replicate(100, test, simplify = FALSE)
> new_test <- Reduce(rbind.data.frame, new_test)
>
> head(new_test)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
> nrow(new_test)
[1] 7500
The you can use the new_test in any prediction, independent of the model.
If you want 100 differents random parts of the data set, you need to drop the replicate function and do something like:
> new_test <- lapply(1:100, function(x) {
+ sample <- sample(nrow(iris), floor(nrow(iris) * 0.5))
+ iris[-sample,]
+ })
>
> new_test <- Reduce(rbind.data.frame, new_test)
>
> head(new_test)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
7 4.6 3.4 1.4 0.3 setosa
10 4.9 3.1 1.5 0.1 setosa
11 5.4 3.7 1.5 0.2 setosa
13 4.8 3.0 1.4 0.1 setosa
18 5.1 3.5 1.4 0.3 setosa
> nrow(new_test)
[1] 7500
>
Hope it helps.
This is my first time using the pipe function and my professor has not reviewed how to use it so
I am a little lost,
I have trouble with the last question since I keep getting error most likely since the last
assignment contradicts my filter <=2, Thank you in advance, The following is my code:
L.W<- iris %>%
select(Petal.Length,Petal.Width) %>% head()
print(L.W)
#b
S.L <- iris%>%
arrange(Sepal.Length)%>%
head()
print(S.L)
#c
iris%>
arrange(Sepal.Length)%>%
select(Species,Petal.Length,Petal.Width)%>%
head()
#Switch order
iris%>%
select(Species,Petal.Length,Petal.Width)%>%
head()
#there are two different data sets
#d
iris%>%
filter(Petal.Length<=2 & Petal.Width< mean(Petal.Width))%>%
mutate(Petal.Length)%>%
huge<-assign(Petal.Length>6)%>%
big<-assign(Petal.Length>5)%>%
medium<-assign(Petal.Length>4)%>%
small<-assign(Petal.Length<=4)%>%
head()
Explanation:
The solution for task #d:
Using filter select observations where Petal.Length is not <=2 !Petal.Length <= 2 and ...
Then we use mutate with case_when
#d
iris %>%
filter(!Petal.Length <= 2 & !Petal.Width < mean(Petal.Width)) %>%
mutate(new_col = case_when(Petal.Length > 6 ~ "huge",
Petal.Length > 5 ~ "big",
Petal.Length > 4 ~ "medium",
Petal.Length <= 4 ~ "small")) %>%
head()
Output:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species new_col
1 7.0 3.2 4.7 1.4 versicolor medium
2 6.4 3.2 4.5 1.5 versicolor medium
3 6.9 3.1 4.9 1.5 versicolor medium
4 5.5 2.3 4.0 1.3 versicolor small
5 6.5 2.8 4.6 1.5 versicolor medium
6 5.7 2.8 4.5 1.3 versicolor medium
New here and not very experienced, and I'm trying to get a project in R shinyapp to work.
I have a list of data frames which have a column labeled 'Gender' containing all/M/F. I want to filter all data frames based on the input, so that if the input is male, only rows containing M or all are kept.
list_tables <- list(adverb,adjective,simplenoun,verber,thingnoun,
personnoun,name_firstpart,name_secondpart)
input$gender <- "male
if(input$gender == "male"){
for (i in list_tables){
list_tables$i <- i[which((i$Gender=="M")|(i$Gender=="all")),]
}
}
Problem is, if I check the list afterwards, nothing has changed. If I do the same, but instead of using a for loop to cycle through the dataframes, I perform the same actions on only one dataframe, it does work. Theoretically, I could make a line of code for each dataframe separately, but it doesn't seem very neat and I have the feeling that the for loop should work but I'm just missing something. Would love to hear tips if anyone has them!
i is not a named-entry within list_tables, so list_tables$i doesn't work. Inside that loop, i is the data.frame you're trying to modify, but you don't update it.
Try either:
for (ind in seq_along(list_tables)) {
i <- list_tables[[ind]] # feels a little sloppt, but it's compact ...
list_tables[[ind]] <- i[which((i$Gender=="M")|(i$Gender=="all")),]
}
or even better
list_tables <- lapply(list_tables, function(i) i[which((i$Gender=="M")|(i$Gender=="all")),])
You could use lapply with subset:
example:
list_tables <- replicate(2,iris[c(1,51,101),],F)
# [[1]]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 51 7.0 3.2 4.7 1.4 versicolor
# 101 6.3 3.3 6.0 2.5 virginica
#
# [[2]]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 51 7.0 3.2 4.7 1.4 versicolor
# 101 6.3 3.3 6.0 2.5 virginica
solution:
lapply(list_tables,subset,Species %in% c("setosa","virginica"))
# [[1]]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 101 6.3 3.3 6.0 2.5 virginica
#
# [[2]]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 101 6.3 3.3 6.0 2.5 virginica
In your case that would be:
lapply(list_tables,subset,Gender %in% c("M","all"))
I need to run a simple regression using Lm() in R. Its simple because I have only one independent variable. However the catch is that I need to test this independent variable for a number of dependents which are columns in a data frame.
So basically I have one common X and numerous Y's for which i need to extract the intercept and slope and store them all in a data frame.
In excel this is possible with the intercept and slope functions and then dragging across columns. I need something in R that would basically do the same, I could of course run separate regressions , but the requirement is that I need to run all of them in one loop and store estimates of intercept and slopes together for each.
Im still learning R and any help on this would be great. Thanks :)
The lmList function in package nlme was designed for this.
Let's use the iris dataset as an example:
DF <- iris[, 1:4]
# Sepal.Length Sepal.Width Petal.Length Petal.Width
#1 5.1 3.5 1.4 0.2
#2 4.9 3.0 1.4 0.2
#3 4.7 3.2 1.3 0.2
#4 4.6 3.1 1.5 0.2
#5 5.0 3.6 1.4 0.2
#6 5.4 3.9 1.7 0.4
#...
First we have to reshape it. We want Sepal.Length as the dependent and the other columns as predictors in this example.
library(reshape2)
DF <- melt(DF, id.vars = "Sepal.Length")
# Sepal.Length variable value
#1 5.1 Sepal.Width 3.5
#2 4.9 Sepal.Width 3.0
#3 4.7 Sepal.Width 3.2
#4 4.6 Sepal.Width 3.1
#5 5.0 Sepal.Width 3.6
#6 5.4 Sepal.Width 3.9
#...
Now we can do the fits.
library(nlme)
mods <- lmList(Sepal.Length ~ value | variable,
data = DF, pool = FALSE)
We can now extract intercept and slope for each model.
coef(mods)
# (Intercept) value
#Sepal.Width 6.526223 -0.2233611
#Petal.Length 4.306603 0.4089223
#Petal.Width 4.777629 0.8885803
And get the usual t-table:
summary(mods)
# Call:
# Model: Sepal.Length ~ value | variable
# Data: DF
#
# Coefficients:
# (Intercept)
# Estimate Std. Error t value Pr(>|t|)
# Sepal.Width 6.526223 0.47889634 13.62763 6.469702e-28
# Petal.Length 4.306603 0.07838896 54.93890 2.426713e-100
# Petal.Width 4.777629 0.07293476 65.50552 3.340431e-111
# value
# Estimate Std. Error t value Pr(>|t|)
# Sepal.Width -0.2233611 0.15508093 -1.440287 1.518983e-01
# Petal.Length 0.4089223 0.01889134 21.646019 1.038667e-47
# Petal.Width 0.8885803 0.05137355 17.296454 2.325498e-37
Or the R-squared values:
summary(mods)$r.squared
#[1] 0.01382265 0.75995465 0.66902769
However, if you need something more efficient, you can use package data.table together with lm's workhorse lm.fit:
library(data.table)
setDT(DF)
DF[, setNames(as.list(lm.fit(cbind(1, value),
Sepal.Length)[["coefficients"]]),
c("intercept", "slope")), by = variable]
# variable intercept slope
#1: Sepal.Width 6.526223 -0.2233611
#2: Petal.Length 4.306603 0.4089223
#3: Petal.Width 4.777629 0.8885803
And of course the R.squared values of these models are just the squared Pearson correlation coefficients:
DF[, .(r.sq = cor(Sepal.Length, value)^2), by = variable]
# variable r.sq
#1: Sepal.Width 0.01382265
#2: Petal.Length 0.75995465
#3: Petal.Width 0.66902769
I am using R to classify a data-frame called 'd' containing data structured like below:
The data has 576666 rows and the column "classLabel" has a factor of 3 levels: ONE, TWO, THREE.
I am making a decision tree using rpart:
fitTree = rpart(d$classLabel ~ d$tripduration + d$from_station_id + d$gender + d$birthday)
And I want to predict the values for the "classLabel" for newdata:
newdata = data.frame( tripduration=c(345,244,543,311),
from_station_id=c(60,28,100,56),
gender=c("Male","Female","Male","Male"),
birthday=c(1972,1955,1964,1967) )
p <- predict(fitTree, newdata)
I expect my result to be a matrix of 4 rows each with a probability of the three possible values for "classLabel" of newdata. But what I get as the result in p, is a dataframe of 576666 rows like below:
I also get the following warning when running the predict function:
Warning message:
'newdata' had 4 rows but variables found have 576666 rows
Where am I doing wrong?!
I think the problem is: you should add "type='class'"in the prediction code:
predict(fitTree,newdata,type="class")
Try the following code. I take "iris" dataset in this example.
> data(iris)
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
# model fitting
> fitTree<-rpart(Species~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width,iris)
#prediction-one row data
> newdata<-data.frame(Sepal.Length=7,Sepal.Width=4,Petal.Length=6,Petal.Width=2)
> newdata
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 7 4 6 2
# perform prediction
> predict(fitTree, newdata,type="class")
1
virginica
Levels: setosa versicolor virginica
#prediction-multiple-row data
> newdata2<-data.frame(Sepal.Length=c(7,8,6,5),
+ Sepal.Width=c(4,3,2,4),
+ Petal.Length=c(6,3.4,5.6,6.3),
+ Petal.Width=c(2,3,4,2.3))
> newdata2
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 7 4 6.0 2.0
2 8 3 3.4 3.0
3 6 2 5.6 4.0
4 5 4 6.3 2.3
# perform prediction
> predict(fitTree,newdata2,type="class")
1 2 3 4
virginica virginica virginica virginica
Levels: setosa versicolor virginica