Data Prediction using Decision Tree of rpart - r

I am using R to classify a data-frame called 'd' containing data structured like below:
The data has 576666 rows and the column "classLabel" has a factor of 3 levels: ONE, TWO, THREE.
I am making a decision tree using rpart:
fitTree = rpart(d$classLabel ~ d$tripduration + d$from_station_id + d$gender + d$birthday)
And I want to predict the values for the "classLabel" for newdata:
newdata = data.frame( tripduration=c(345,244,543,311),
from_station_id=c(60,28,100,56),
gender=c("Male","Female","Male","Male"),
birthday=c(1972,1955,1964,1967) )
p <- predict(fitTree, newdata)
I expect my result to be a matrix of 4 rows each with a probability of the three possible values for "classLabel" of newdata. But what I get as the result in p, is a dataframe of 576666 rows like below:
I also get the following warning when running the predict function:
Warning message:
'newdata' had 4 rows but variables found have 576666 rows
Where am I doing wrong?!

I think the problem is: you should add "type='class'"in the prediction code:
predict(fitTree,newdata,type="class")
Try the following code. I take "iris" dataset in this example.
> data(iris)
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
# model fitting
> fitTree<-rpart(Species~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width,iris)
#prediction-one row data
> newdata<-data.frame(Sepal.Length=7,Sepal.Width=4,Petal.Length=6,Petal.Width=2)
> newdata
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 7 4 6 2
# perform prediction
> predict(fitTree, newdata,type="class")
1
virginica
Levels: setosa versicolor virginica
#prediction-multiple-row data
> newdata2<-data.frame(Sepal.Length=c(7,8,6,5),
+ Sepal.Width=c(4,3,2,4),
+ Petal.Length=c(6,3.4,5.6,6.3),
+ Petal.Width=c(2,3,4,2.3))
> newdata2
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 7 4 6.0 2.0
2 8 3 3.4 3.0
3 6 2 5.6 4.0
4 5 4 6.3 2.3
# perform prediction
> predict(fitTree,newdata2,type="class")
1 2 3 4
virginica virginica virginica virginica
Levels: setosa versicolor virginica

Related

R Defined Function to review numeric column and calculate log

I have a dataframe with 10 vars. Three are factors and seven are numeric. I want to write a defined function that looks through each column and determines if it is numeric; and if it is numeric calculate the log.
Here's one simple way with dplyr package -
your_df %>%
mutate_if(is.numeric, log)
As per comment, if you want to keep the original variables and add the logs as new variables -
your_df %>%
mutate_if(is.numeric, list(LG = ~log))
Example -
head(iris) %>%
mutate_if(is.numeric, list(LG = ~log))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length_LG Sepal.Width_LG Petal.Length_LG Petal.Width_LG
1 5.1 3.5 1.4 0.2 setosa 1.629241 1.252763 0.3364722 -1.6094379
2 4.9 3.0 1.4 0.2 setosa 1.589235 1.098612 0.3364722 -1.6094379
3 4.7 3.2 1.3 0.2 setosa 1.547563 1.163151 0.2623643 -1.6094379
4 4.6 3.1 1.5 0.2 setosa 1.526056 1.131402 0.4054651 -1.6094379
5 5.0 3.6 1.4 0.2 setosa 1.609438 1.280934 0.3364722 -1.6094379
6 5.4 3.9 1.7 0.4 setosa 1.686399 1.360977 0.5306283 -0.9162907
Using "dplyr" package you can select only numeric columns and calculate log. In my example I used "iris" dataset:
iris_1 <- as.data.frame(lapply(iris %>% select_if(is.numeric), log))
> head(iris_1)
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 1.629241 1.252763 0.3364722 -1.6094379
2 1.589235 1.098612 0.3364722 -1.6094379
3 1.547563 1.163151 0.2623643 -1.6094379
4 1.526056 1.131402 0.4054651 -1.6094379
5 1.609438 1.280934 0.3364722 -1.6094379
6 1.686399 1.360977 0.5306283 -0.9162907

Cycling through a list of dataframes with a for loop

New here and not very experienced, and I'm trying to get a project in R shinyapp to work.
I have a list of data frames which have a column labeled 'Gender' containing all/M/F. I want to filter all data frames based on the input, so that if the input is male, only rows containing M or all are kept.
list_tables <- list(adverb,adjective,simplenoun,verber,thingnoun,
personnoun,name_firstpart,name_secondpart)
input$gender <- "male
if(input$gender == "male"){
for (i in list_tables){
list_tables$i <- i[which((i$Gender=="M")|(i$Gender=="all")),]
}
}
Problem is, if I check the list afterwards, nothing has changed. If I do the same, but instead of using a for loop to cycle through the dataframes, I perform the same actions on only one dataframe, it does work. Theoretically, I could make a line of code for each dataframe separately, but it doesn't seem very neat and I have the feeling that the for loop should work but I'm just missing something. Would love to hear tips if anyone has them!
i is not a named-entry within list_tables, so list_tables$i doesn't work. Inside that loop, i is the data.frame you're trying to modify, but you don't update it.
Try either:
for (ind in seq_along(list_tables)) {
i <- list_tables[[ind]] # feels a little sloppt, but it's compact ...
list_tables[[ind]] <- i[which((i$Gender=="M")|(i$Gender=="all")),]
}
or even better
list_tables <- lapply(list_tables, function(i) i[which((i$Gender=="M")|(i$Gender=="all")),])
You could use lapply with subset:
example:
list_tables <- replicate(2,iris[c(1,51,101),],F)
# [[1]]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 51 7.0 3.2 4.7 1.4 versicolor
# 101 6.3 3.3 6.0 2.5 virginica
#
# [[2]]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 51 7.0 3.2 4.7 1.4 versicolor
# 101 6.3 3.3 6.0 2.5 virginica
solution:
lapply(list_tables,subset,Species %in% c("setosa","virginica"))
# [[1]]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 101 6.3 3.3 6.0 2.5 virginica
#
# [[2]]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 101 6.3 3.3 6.0 2.5 virginica
In your case that would be:
lapply(list_tables,subset,Gender %in% c("M","all"))

biglm finds the wrong data.frame to take the data from

I am trying to create chunks of my dataset to run biglm. (with fastLm I would need 350Gb of RAM)
My complete dataset is called res. As experiment I drastically decreased the size to 10.000 rows. I want to create chunks to use with biglm.
library(biglm)
formula <- iris$Sepal.Length ~ iris$Sepal.Width
test <- iris[1:10,]
biglm(formula, test)
And somehow, I get the following output:
> test <- iris[1:10,]
> test
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
Above you can see the matrix test contains 10 rows. Yet when running biglm it shows a sample size of 150
> biglm(formula, test)
Large data regression model: biglm(formula, test)
Sample size = 150
Looks like it uses iris instead of test.. how is this possible and how do I get biglm to use chunk1 the way I intend it to?
I suspect the following line is to blame:
formula <- iris$Sepal.Length ~ iris$Sepal.Width
where in the formula you explicitly reference the iris dataset. This will cause R to try and find the iris dataset when lm is called, which it finds in the global environment (because of R's scoping rules).
In a formula you normally do not use vectors, but simply the column names:
formula <- Sepal.Length ~ Sepal.Width
This will ensure that the formula contains only the column (or variable) names, which will be found in the data lm is passed. So, lm will use test in stead of iris.

save residuals with `dplyr`

I want to use dplyr to group a data.frame, fit linear regressions and save the residuals as a column in the original, ungrouped data.frame.
Here's an example
> iris %>%
select(Sepal.Length, Sepal.Width) %>%
group_by(Species) %>%
do(mod = lm(Sepal.Length ~ Sepal.Width, data=.)) %>%
Returns:
Species mod
1 setosa <S3:lm>
2 versicolor <S3:lm>
3 virginica <S3:lm>
Instead, I would like the original data.frame with a new column containing residuals.
For example,
Sepal.Length Sepal.Width resid
1 5.1 3.5 0.04428474
2 4.9 3.0 0.18952960
3 4.7 3.2 -0.14856834
4 4.6 3.1 -0.17951937
5 5.0 3.6 -0.12476423
6 5.4 3.9 0.06808885
I adapted an example from http://jimhester.github.io/plyrToDplyr/.
r <- iris %>%
group_by(Species) %>%
do(model = lm(Sepal.Length ~ Sepal.Width, data=.)) %>%
do((function(mod) {
data.frame(resid = residuals(mod$model))
})(.))
corrected <- cbind(iris, r)
update Another method is to use the augment function in the broom package:
r <- iris %>%
group_by(Species) %>%
do(augment(lm(Sepal.Length ~ Sepal.Width, data=.))
Which returns:
Source: local data frame [150 x 10]
Groups: Species
Species Sepal.Length Sepal.Width .fitted .se.fit .resid .hat
1 setosa 5.1 3.5 5.055715 0.03435031 0.04428474 0.02073628
2 setosa 4.9 3.0 4.710470 0.05117134 0.18952960 0.04601750
3 setosa 4.7 3.2 4.848568 0.03947370 -0.14856834 0.02738325
4 setosa 4.6 3.1 4.779519 0.04480537 -0.17951937 0.03528008
5 setosa 5.0 3.6 5.124764 0.03710984 -0.12476423 0.02420180
...
A solution that seems to be easier than the ones proposed so far and closer to the code of the original question is :
iris %>%
group_by(Species) %>%
do(data.frame(., resid = residuals(lm(Sepal.Length ~ Sepal.Width, data=.))))
Result :
# A tibble: 150 x 6
# Groups: Species [3]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species resid
<dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 5.1 3.5 1.4 0.2 setosa 0.0443
2 4.9 3 1.4 0.2 setosa 0.190
3 4.7 3.2 1.3 0.2 setosa -0.149
4 4.6 3.1 1.5 0.2 setosa -0.180
5 5 3.6 1.4 0.2 setosa -0.125
6 5.4 3.9 1.7 0.4 setosa 0.0681
7 4.6 3.4 1.4 0.3 setosa -0.387
8 5 3.4 1.5 0.2 setosa 0.0133
9 4.4 2.9 1.4 0.2 setosa -0.241
10 4.9 3.1 1.5 0.1 setosa 0.120
Since you are be running the exact same regression for each group, you might find it simpler to just define your regression model as a function() beforehand, and then execute it for each group using mutate.
model<- function(y,x){
a<- y + x
if( length(which(!is.na(a))) <= 2 ){
return( rep(NA, length(a)))
} else {
m<- lm( y ~ x, na.action = na.exclude)
return( residuals(m))
}
}
Note, that the first part of this function is to insure against any error messages popping up in case your regression is run on a group with less than zero degrees of freedom (This might be the case if you have a dataframe with several grouping variables with many levels , or numerous independent variables for your regression (like for example lm(y~ x1 + x2)), and can't afford to inspect each of them for sufficient non-NA observations).
So your example can be rewritten as follows:
iris %>% group_by(Species) %>%
mutate(resid = model(Sepal.Length,Sepal.Width) ) %>%
select(Sepal.Length,Sepal.Width,resid)
Which should yield:
Species Sepal.Length Sepal.Width resid
<fctr> <dbl> <dbl> <dbl>
1 setosa 5.1 3.5 0.04428474
2 setosa 4.9 3.0 0.18952960
3 setosa 4.7 3.2 -0.14856834
4 setosa 4.6 3.1 -0.17951937
5 setosa 5.0 3.6 -0.12476423
6 setosa 5.4 3.9 0.06808885
This method should not be computationally much different from the one using augment().(I've had to use both methods on data sets containing several hundred million observations, and believe there was no significant difference in terms of speed compared to using the do() function).
Also, please note that omitting na.action = na.exclude, or using m$residuals instead of residuals(m), will result in the exclusion of rows that have NAs (dropped prior to estimation) from the output vector of residuals. The corresponding vector will thus not have sufficient length() in order to be merged with the data set, and some error message might appear.

Why is using update on a lm inside a grouped data.table losing its model data?

Ok, this is a weird one. I suspect this is a bug inside data.table, but it would be useful if anyone can explain why this is happening - what is update doing exactly?
I'm using the list(list()) trick inside data.table to store fitted models. When you create a sequence of lm objects each for different groupings, and then update those models, the model data for all models becomes that of the last grouping. This seems like a reference is hanging around somewhere where a copy should have been made, but I can't find where and I can't reproduce this outside of lm and update.
Concrete example:
Starting with the iris data, first make the three species different sample sizes, then fit an lm model to each species, the update those models:
set.seed(3)
DT = data.table(iris)
DT = DT[rnorm(150) < 0.9]
fit = DT[, list(list(lm(Sepal.Length ~ Sepal.Width + Petal.Length))),
by = Species]
fit2 = fit[, list(list(update(V1[[1]], ~.-Sepal.Length))), by = Species]
The original data table has different numbers of each species
DT[,.N, by = Species]
# Species N
# 1: setosa 41
# 2: versicolor 39
# 3: virginica 42
And the first fit confirms thsi:
fit[, nobs(V1[[1]]), by = Species]
# Species V1
# 1: setosa 41
# 2: versicolor 39
# 3: virginica 42
But the updated second fit is showing 42 for all models
fit2[, nobs(V1[[1]]), by = Species]
# Species V1
# 1: setosa 42
# 2: versicolor 42
# 3: virginica 42
We can also look at the model attribute which contains the data used for fitting, and see that all the model are indeed using the final groups data. The question is how has this happened?
head(fit$V1[[1]]$model)
# Sepal.Length Sepal.Width Petal.Length
# 1 5.1 3.5 1.4
# 2 4.9 3.0 1.4
# 3 4.7 3.2 1.3
# 4 4.6 3.1 1.5
# 5 5.0 3.6 1.4
# 6 5.4 3.9 1.7
head(fit$V1[[3]]$model)
# Sepal.Length Sepal.Width Petal.Length
# 1 6.3 3.3 6.0
# 2 5.8 2.7 5.1
# 3 6.3 2.9 5.6
# 4 7.6 3.0 6.6
# 5 4.9 2.5 4.5
# 6 7.3 2.9 6.3
head(fit2$V1[[1]]$model)
# Sepal.Length Sepal.Width Petal.Length
# 1 6.3 3.3 6.0
# 2 5.8 2.7 5.1
# 3 6.3 2.9 5.6
# 4 7.6 3.0 6.6
# 5 4.9 2.5 4.5
# 6 7.3 2.9 6.3
head(fit2$V1[[3]]$model)
# Sepal.Length Sepal.Width Petal.Length
# 1 6.3 3.3 6.0
# 2 5.8 2.7 5.1
# 3 6.3 2.9 5.6
# 4 7.6 3.0 6.6
# 5 4.9 2.5 4.5
# 6 7.3 2.9 6.3
This is not an answer, but is too long for a comment
The .Environment for the terms component is identical for each resulting model
e1 <- attr(fit[['V1']][[1]]$terms, '.Environment')
e2 <- attr(fit[['V1']][[2]]$terms, '.Environment')
e3 <- attr(fit[['V1']][[3]]$terms, '.Environment')
identical(e1,e2)
## TRUE
identical(e2, e3)
## TRUE
It appears that data.table is using the same bit of memory (my non-technical term) for
each evaluation of j by group (which is efficient). However when update is called, it is using this to refit the model. This will contain the values from the last group.
So, if you fudge this, it will work
fit = DT[, { xx <-list2env(copy(.SD))
mymodel <-lm(Sepal.Length ~ Sepal.Width + Petal.Length)
attr(mymodel$terms, '.Environment') <- xx
list(list(mymodel))}, by= 'Species']
lfit2 <- fit[, list(list(update(V1[[1]], ~.-Sepal.Width))), by = Species]
lfit2[,lapply(V1,nobs)]
V1 V2 V3
1: 41 39 42
# using your exact diagnostic coding.
lfit2[,nobs(V1[[1]]),by = Species]
Species V1
1: setosa 41
2: versicolor 39
3: virginica 42
not a long term solution, but at least a workaround.

Resources