rpart: How to get the "where" vector for validation dataset? - r

when fitting with rpart, it returns the "where" vector which tells which leave each record in the training dataset is on the tree. Is there a function which return something similar to this "where" vector for a test dataset?

I think the partykit package does what you want
library('rpart')
fit <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis)
fit
rpart.plot::rpart.plot(fit)
Check with same data
set.seed(1)
idx <- sample(nrow(kyphosis), 5L)
fit$where[idx]
# 22 30 46 71 16
# 9 3 7 7 3
library('partykit')
fit <- as.party(fit)
predict(fit, kyphosis[idx, ], type = 'node')
# 22 30 46 71 16
# 9 3 7 7 3
Check with new data
dd <- kyphosis[idx, ]
set.seed(1)
dd[] <- lapply(dd, sample)
predict(fit, dd, type = 'node')
# 22 30 46 71 16
# 5 3 7 9 3
## so #46 should meet criteria for the 7th leaf:
with(kyphosis[46, ],
Start >= 8.5 & # node 1
Start < 14.5 & # node 2
Age >= 55 & # node 4
Age >= 111 # node 6
)
# [1] TRUE

As you mention, the function predict.rpart in the rpart package
doesn't have a where option (to show the leaf node number associated
with a prediction).
However, the rpart.predict function in the rpart.plot package
will do this. For example
> library(rpart.plot)
> fit <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis)
> rpart.predict(fit, newdata=kyphosis[1:3,], nn=TRUE)
gives (note the node number nn column):
absent present nn
1 0.42105 0.57895 3
2 0.85714 0.14286 22
3 0.42105 0.57895 3
And
> rpart.predict(fit, newdata=kyphosis[1:3,], nn=TRUE)$nn
gives just the where node numbers:
[1] 3 22 3
To show the rule for each prediction use
> rpart.predict(fit, newdata=kyphosis[1:5,], rules=TRUE)
which gives
absent present
1 0.42105 0.57895 because Start < 9
2 0.85714 0.14286 because Start is 9 to 15 & Age >= 111
3 0.42105 0.57895 because Start < 9

Related

When a variable switches from 1 to 2, delete some data from the other variables and average what's left?

I am analysing some data and need help.
Basically, I have a dataset that looks like this:
date <- seq(as.Date("2017-04-01"),as.Date("2017-05-09"),length.out=40)
switch <- c(rep(1:2,each=10),rep(1:2,each=10))
O2 <- runif(40,min=21.02,max=21.06)
CO2 <- runif(40,min=0.076,max=0.080)
test.data <- data.frame(date,switch,O2,CO2)
As can be seen, there's a switch column that switches between 1 and 2 every 10 data points. I want to write a code that does: when the "switch" column changes its value (from 1 to 2, or 2 to 1), delete the first 5 rows of data after the switch (i.e. leaving the 5 last data points for all the 4 variables), average the rest of the data points for O2 and CO2, and put them in 2 new columns (avg.O2 and avg.CO2) before the next switch. Then repeat this process until the end.
It's quite easy to do manually on paper or excel, but my real dataset would comprise thousands of data points and I would like to use R to do it automatically for me. So anyone has any ideas that could help me?
Please find my edits which should work for both regular and irregular
date <- seq(as.Date("2017-04-01"),as.Date("2017-05-09"),length.out=40)
switch <- c(rep(1:2,each=10),rep(1:2,each=10))
O2 <- runif(40,min=21.02,max=21.06)
CO2 <- runif(40,min=0.076,max=0.080)
test.data <- data.frame(date,switch,O2,CO2)
CleanMachineData <- function(Data, SwitchData, UnreliableRows = 5){
# First, we can properly turn your switch column into a grouping column (1,2,1,2)->(1,2,3,4)
grouplength <- rle(Data[,"switch"])$lengths
# mapply lets us input vector arguments into typically one/first-element only argument functions.
# In this case we create a sequence of lengths (output is a list/vector)
grouping <- mapply(seq, grouplength)
# Here we want it to become a single vector representing groups
groups <- mapply(rep, 1:length(grouplength), each = grouplength)
# if frequency was irregular, it will be a list, if regular it will be a matrix
# convert either into a vector by doing as follows:
if(class(grouping) == "list"){
groups <- unlist(groups)
} else {
groups <- as.vector(groups)
}
Data$group <- groups
#
# vector of the first row of each new switch (except the starting 0)
switchRow <- c(0,which(abs(diff(SwitchData)) == 1))+1
# I use "as.vector" to turn the matrix output of mapply into a sequence of numbers.
# "ToRemove" will have all the row numbers to get rid of from your original data, except for what happens before (in this case) row 10
ToRemove <- c(1:UnreliableRows, as.vector(mapply(seq, switchRow, switchRow+(UnreliableRows)-1)))
# I concatenate the missing beginning (1,2,3,4,5) and theToRemove them with c() and then remove them from n with "-"
Keep <- seq(nrow(Data))[-c(1:UnreliableRows,ToRemove)]
# Create the new data, (in case you don't know: data[<ROW>,<COLUMN>])
newdat <- Data[-ToRemove,]
# print the results
newdat
}
dat <- CleanMachineData(test.data, test.data$switch, 5)
dat
date switch O2 CO2 group
6 2017-04-05 1 21.03922 0.07648886 1
7 2017-04-06 1 21.04071 0.07747368 1
8 2017-04-07 1 21.05742 0.07946615 1
9 2017-04-08 1 21.04673 0.07782362 1
10 2017-04-09 1 21.04966 0.07936446 1
16 2017-04-15 2 21.02526 0.07833825 2
17 2017-04-16 2 21.04511 0.07747774 2
18 2017-04-17 2 21.03165 0.07662803 2
19 2017-04-18 2 21.03252 0.07960098 2
20 2017-04-19 2 21.04032 0.07892145 2
26 2017-04-25 1 21.03691 0.07691438 3
27 2017-04-26 1 21.05846 0.07857017 3
28 2017-04-27 1 21.04128 0.07891908 3
29 2017-04-28 1 21.03837 0.07817021 3
30 2017-04-29 1 21.02334 0.07917546 3
36 2017-05-05 2 21.02890 0.07723042 4
37 2017-05-06 2 21.04606 0.07979641 4
38 2017-05-07 2 21.03822 0.07985775 4
39 2017-05-08 2 21.04136 0.07781525 4
40 2017-05-09 2 21.05375 0.07941123 4
aggregate(cbind(O2,CO2) ~ group, dat, mean)
group O2 CO2
1 1 21.04675 0.07812336
2 2 21.03497 0.07819329
3 3 21.03967 0.07834986
4 4 21.04166 0.07882221
# crazier, irregular switching
test.data2 <- test.data
test.data2$switch <- unlist(mapply(rep, 1:2, times = 1, each = c(10,8,10,5,3,10)))[1:20]
dat2 <- CleanMachineData(test.data2, test.data2$switch, 5)
dat2
date switch O2 CO2 group
6 2017-04-05 1 21.03922 0.07648886 1
7 2017-04-06 1 21.04071 0.07747368 1
8 2017-04-07 1 21.05742 0.07946615 1
9 2017-04-08 1 21.04673 0.07782362 1
10 2017-04-09 1 21.04966 0.07936446 1
16 2017-04-15 2 21.02526 0.07833825 2
17 2017-04-16 2 21.04511 0.07747774 2
18 2017-04-17 2 21.03165 0.07662803 2
24 2017-04-23 1 21.05658 0.07669662 3
25 2017-04-24 1 21.04452 0.07983165 3
26 2017-04-25 1 21.03691 0.07691438 3
27 2017-04-26 1 21.05846 0.07857017 3
28 2017-04-27 1 21.04128 0.07891908 3
29 2017-04-28 1 21.03837 0.07817021 3
30 2017-04-29 1 21.02334 0.07917546 3
36 2017-05-05 2 21.02890 0.07723042 4
37 2017-05-06 2 21.04606 0.07979641 4
38 2017-05-07 2 21.03822 0.07985775 4
# You can try removing a vector with the following
lapply(5:7, function(x) {
dat <- CleanMachineData(test.data2, test.data2$switch, x)
list(data = dat, means = aggregate(cbind(O2,CO2)~group, dat, mean))
})
Use
test.data[rep(c(FALSE, TRUE), each=5),]
to select always the last five rows from the group of 10 rows.
Then you can use aggregate:
d2 <- test.data[rep(c(FALSE, TRUE), each=5),]
aggregate(cbind(O2, CO2) ~ 1, data=d2, FUN=mean)
If you want the average for every 5-rows-group:
aggregate(cbind(O2, CO2) ~ gl(k=5, n=nrow(d2)/5L), data=d2, FUN=mean)
Here is a generalization for the situation of arbitrary number of rows in test.data:
stay <- rep(c(FALSE, TRUE), each=5, length.out=nrow(test.data))
d2 <- test.data[stay,]
group <- gl(k=5, n=nrow(d2)/5L+1L, length=nrow(d2))
aggregate(cbind(O2, CO2) ~ group, data=d2, FUN=mean)
Here is a variant for mixing the data with the averages:
group <- gl(k=10, n=nrow(test.data)/10L+1L, length=nrow(test.data))
L <- split(test.data, group)
mySummary <- function(x) {
if (nrow(x) <= 5) return(NULL)
x <- x[-(1:5),]
d.avg <- aggregate(cbind(O2, CO2) ~ 1, data=x, FUN=mean)
rbind(x, cbind(date=NA, switch=-1, d.avg))
}
lapply(L, mySummary) # as list of dataframes
do.call(rbind, lapply(L, mySummary)) # as one dataframe

Get 2D table (6x6) for dataframe containing two continuous variables, by binning

I am trying to partition observations in a data frame into 36 groups, based on two continuous variables. More specifically, I am trying to cut each of the two variables into six groups, and then group the observations in one of the 36 different possible groups.
My attempt is below, which works. But is there a faster way to do this that avoids the double for loops?
Also, this isn't necessary, but how could I visualize the total number of observations in each group in a 6 by 6 grid? I know table() would produce a list of the 36 possible groups and their totals, but not in grid format.
set.seed(123)
x1 <- rnorm(1000)
x2 <- rnorm(1000)
data <- data.frame(x1,x2)
labs1 <- levels(cut(x1, 6))
ints1 <- cbind(lower = as.numeric(sub("\\((.+),.*", "\\1", labs1)),
upper = as.numeric(sub("[^,]*,([^]]*)\\]", "\\1", labs1)))
labs2 <- levels(cut(x2, 6))
ints2 <- cbind(lower = as.numeric(sub("\\((.+),.*", "\\1", labs2)),
upper = as.numeric(sub("[^,]*,([^]]*)\\]", "\\1", labs2)))
tmp <- expand.grid(labs1, labs2)
groups <- cbind(lower1 = as.numeric(sub("\\((.+),.*", "\\1", tmp[,1])),
upper1 = as.numeric(sub("[^,]*,([^]]*)\\]", "\\1", tmp[,1])),
lower2 = as.numeric(sub("\\((.+),.*", "\\1", tmp[,2])),
upper2 = as.numeric(sub("[^,]*,([^]]*)\\]", "\\1", tmp[,2])))
for (i in 1:1000){
for (j in 1:36){
if (x1[i] >= groups[j,1] & x1[i] <= groups[j,2] &
x2[i] >= groups[j,3] & x2[i] <= groups[j,4]){
data$group[i] <- j
}
}
}
You can use a mix of apply() that will iterate thru your data.frame and which() that will iterate thru your groups array:
data$group <- apply(data, 1, FUN=function(dataRow)
which(
dataRow[1] >= groups[,1] &
dataRow[1] <= groups[,2] &
dataRow[2] >= groups[,3] &
dataRow[2] <= groups[,4]))
You're overthinking things. Getting your 6x6 tables is a one-liner with table(). (Directly use the helpful factor variable created by cut(..., 6), don't just throw away the factor then manually reapply its levels and bin your variables) :
with(data, table(cut(x1, 6), cut(x2, 6)))
(-3.05,-1.97] (-1.97,-0.902] (-0.902,0.171] (0.171,1.24] (1.24,2.32] (2.32,3.4]
(-2.82,-1.8] 2 10 11 7 3 0
(-1.8,-0.793] 1 26 67 49 19 3
(-0.793,0.216] 12 57 140 146 31 3
(0.216,1.22] 11 49 109 95 36 6
(1.22,2.23] 0 10 31 34 15 0
(2.23,3.25] 0 3 5 6 2 1
# and to get the wide lines, you may need...
options('width'=199)
# or if you want more compact labels to keep it all narrow, use `cut(..., dig.lab)`
with(data, table(cut(x1, 6, dig.lab=2), cut(x2, 6, dig.lab=2)))
(-3.1,-2] (-2,-0.9] (-0.9,0.17] (0.17,1.2] (1.2,2.3] (2.3,3.4]
(-2.8,-1.8] 2 10 11 7 3 0
(-1.8,-0.79] 1 26 67 49 19 3
(-0.79,0.22] 12 57 140 146 31 3
(0.22,1.2] 11 49 109 95 36 6
(1.2,2.2] 0 10 31 34 15 0
(2.2,3.2] 0 3 5 6 2 1
Admittedly the doc for both table() and cut() do not say so directly, and could use a 2D example like this. => Doc/Enhance-bug

bootstrap issue using R

Does it have any methods to print all selected data when bootstrap is complete?
Once I use the following code
library(boot)
set.seed(1234)
rsq = function(data,indices) {
d = data[indices,]
fit = lm(formula=mpg~wt+disp,data=d)
return(summary(fit)$r.square)
}
results = boot(data = mtcars, statistic = rsq, R=1000)
print(results)
plot(results)
boot.ci(results,conf=0.95,type=c('perc','bca'))
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = results, conf = 0.95, type = c("perc", "bca"))
Intervals :
Level Percentile BCa
95% ( 0.6838, 0.8833 ) ( 0.6344, 0.8549 )
to get confidence interval.I want to print all select obs which selected by bootstrap method.
Thanks.
You could do something like this:
ind <- list()
rsq <- function(data,indices) {
d <- data[indices,]
ind[[length(ind)+1]] <<- indices
fit <- lm(formula=mpg~wt+disp,data=d)
return(summary(fit)$r.square)
}
Then all your 1000 sets of indices would be in the list ind.
Then maybe use unique to see which unique observations were sampled:
lapply(ind, unique)[1:2]
[[1]]
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
[[2]]
[1] 29 4 3 15 16 7 32 5 31 17 28 20 26 19 10 18 1 6 24 8

Apply LR models to another dataframe

I searched SO, but I could not seem to find the right code that is applicable to my question. It is similar to this question: Linear Regression calculation several times in one dataframe
I got a dataframe of LR coefficients following Andrie's code:
Cddply <- ddply(test, .(sumtest), function(test)coef(lm(Area~Conc, data=test)))
sumtest (Intercept) Conc
1 -108589.2726 846.0713372
2 -49653.18701 811.3982918
3 -102598.6252 832.6419926
4 -72607.4017 727.0765558
5 54224.28878 391.256075
6 -42357.45407 357.0845661
7 -34171.92228 367.3962888
8 -9332.569856 289.8631555
9 -7376.448899 335.7047756
10 -37704.92277 359.1457617
My question is how to apply each of these LR models (1-10) to specific row intervals in another dataframe in order to get x, the independent variable, into a 3rd column. For example, I would like to apply sumtest1 to Samples 6:29, sumtest2 to samples 35:50, sumtest3 to samples 56:79, etc.. in intervals of 24 and 16 samples. The sample numbers repeats after 200, so sumtest9 will be for Samples 6:29 again.
Sample Area
6 236211
7 724919
8 1259814
9 1574722
10 268836
11 863818
12 1261768
13 1591845
14 220322
15 608396
16 980182
17 1415859
18 276276
19 724532
20 1130024
21 1147840
22 252051
23 544870
24 832512
25 899457
26 285093
27 4291007
28 825922
29 865491
35 246707
36 538092
37 767269
38 852410
39 269152
40 971471
41 1573989
42 1897208
43 261321
44 481486
45 598617
46 769240
47 229695
48 782691
49 1380597
50 1725419
The resulting dataframe would look like this:
Sample Area Calc
6 236211 407.5312917
7 724919 985.1525288
8 1259814 1617.363812
9 1574722 1989.564693
10 268836 446.0919309
...
35 246707 365.2452551
36 538092 724.3591324
37 767269 1006.805521
38 852410 1111.736505
39 269152 392.9073207
Thank you for your assistance.
Is this what you want? I made up a slightly larger dummy data set of 'area' to make it easier to see how the code worked when I tried it out.
# create 400 rows of area data
set.seed(123)
df <- data.frame(area = round(rnorm(400, mean = 1000000, sd = 100000)))
# "sample numbers repeats after 200" -> add a sample nr 1-200, 1-200
df$sample_nr <- 1:200
# create a factor which cuts the vector of sample_nr into pieces of length 16, 24, 16, 24...
# repeat to a total length of the pieces is 200
# i.e. 5 repeats of (16, 24)
grp <- cut(df$sample_nr, breaks = c(-Inf, cumsum(rep(c(16, 24), 5))))
# add a numeric version of the chunks to data frame
# this number indicates the model from which coefficients will be used
# row 1-16 (16 rows): model 1; row 17-40 (24 rows): model 2;
# row 41-56 (16 rows): model 3; and so on.
df$mod <- as.numeric(grp)
# read coefficients
coefs <- read.table(text = "intercept beta_conc
1 -108589.2726 846.0713372
2 -49653.18701 811.3982918
3 -102598.6252 832.6419926
4 -72607.4017 727.0765558
5 54224.28878 391.256075
6 -42357.45407 357.0845661
7 -34171.92228 367.3962888
8 -9332.569856 289.8631555
9 -7376.448899 335.7047756
10 -37704.92277 359.1457617", header = TRUE)
# add model number
coefs$mod <- rownames(coefs)
head(df)
head(coefs)
# join area data and coefficients by model number
# (use 'join' instead of merge to avoid sorting)
library(plyr)
df2 <- join(df, coefs)
# calculate conc from area and model coefficients
# area = intercept + beta_conc * conc
# conc = (area - intercept) / beta_conc
df2$conc <- (df2$area - df2$intercept) / df2$beta_conc
head(df2, 41)

using predict with a list of lm() objects

I have data which I regularly run regressions on. Each "chunk" of data gets fit a different regression. Each state, for example, might have a different function that explains the dependent value. This seems like a typical "split-apply-combine" type of problem so I'm using the plyr package. I can easily create a list of lm() objects which works well. However I can't quite wrap my head around how I use those objects later to predict values in a separate data.frame.
Here's a totally contrived example illustrating what I'm trying to do:
# setting up some fake data
set.seed(1)
funct <- function(myState, myYear){
rnorm(1, 100, 500) + myState + (100 * myYear)
}
state <- 50:60
year <- 10:40
myData <- expand.grid( year, state)
names(myData) <- c("year","state")
myData$value <- apply(myData, 1, function(x) funct(x[2], x[1]))
## ok, done with the fake data generation.
require(plyr)
modelList <- dlply(myData, "state", function(x) lm(value ~ year, data=x))
## if you want to see the summaries of the lm() do this:
# lapply(modelList, summary)
state <- 50:60
year <- 50:60
newData <- expand.grid( year, state)
names(newData) <- c("year","state")
## now how do I predict the values for newData$value
# using the regressions in modelList?
So how do I use the lm() objects contained in modelList to predict values using the year and state independent values from newData?
Here's my attempt:
predNaughty <- ddply(newData, "state", transform,
value=predict(modelList[[paste(piece$state[1])]], newdata=piece))
head(predNaughty)
# year state value
# 1 50 50 5176.326
# 2 51 50 5274.907
# 3 52 50 5373.487
# 4 53 50 5472.068
# 5 54 50 5570.649
# 6 55 50 5669.229
predDiggsApproved <- ddply(newData, "state", function(x)
transform(x, value=predict(modelList[[paste(x$state[1])]], newdata=x)))
head(predDiggsApproved)
# year state value
# 1 50 50 5176.326
# 2 51 50 5274.907
# 3 52 50 5373.487
# 4 53 50 5472.068
# 5 54 50 5570.649
# 6 55 50 5669.229
JD Long edit
I was inspired enough to work out an adply() option:
pred3 <- adply(newData, 1, function(x)
predict(modelList[[paste(x$state)]], newdata=x))
head(pred3)
# year state 1
# 1 50 50 5176.326
# 2 51 50 5274.907
# 3 52 50 5373.487
# 4 53 50 5472.068
# 5 54 50 5570.649
# 6 55 50 5669.229
You need to use mdply to supply both the model and the data to each function call:
dataList <- dlply(newData, "state")
preds <- mdply(cbind(mod = modelList, df = dataList), function(mod, df) {
mutate(df, pred = predict(mod, newdata = df))
})
A solution with just base R. The format of the output is different, but all the values are right there.
models <- lapply(split(myData, myData$state), 'lm', formula = value ~ year)
pred4 <- mapply('predict', models, split(newData, newData$state))
What is wrong with
lapply(modelList, predict, newData)
?
EDIT:
Thanks for explaining what is wrong with that. How about:
newData <- data.frame(year)
ldply(modelList, function(model) {
data.frame(newData, predict=predict(model, newData))
})
Iterate over the models, and apply the new data (which is the same for each state since you just did an expand.grid to create it).
EDIT 2:
If newData does not have the same values for year for every state as in the example, a more general approach can be used. Note that this uses the original definition of newData, not the one in the first edit.
ldply(state, function(s) {
nd <- newData[newData$state==s,]
data.frame(nd, predict=predict(modelList[[as.character(s)]], nd))
})
First 15 lines of this output:
year state predict
1 50 50 5176.326
2 51 50 5274.907
3 52 50 5373.487
4 53 50 5472.068
5 54 50 5570.649
6 55 50 5669.229
7 56 50 5767.810
8 57 50 5866.390
9 58 50 5964.971
10 59 50 6063.551
11 60 50 6162.132
12 50 51 5514.825
13 51 51 5626.160
14 52 51 5737.496
15 53 51 5848.832
I take it the hard part is matching each state in newData to the corresponding model.
Something like this perhaps?
predList <- dlply(newData, "state", function(x) {
predict(modelList[[as.character(min(x$state))]], x)
})
Here I used a "hacky" way of extracting the corresponding state model: as.character(min(x$state))
...There is probably a better way?
Output:
> predList[1:2]
$`50`
1 2 3 4 5 6 7 8 9 10 11
5176.326 5274.907 5373.487 5472.068 5570.649 5669.229 5767.810 5866.390 5964.971 6063.551 6162.132
$`51`
12 13 14 15 16 17 18 19 20 21 22
5514.825 5626.160 5737.496 5848.832 5960.167 6071.503 6182.838 6294.174 6405.510 6516.845 6628.181
Or, if you want a data.frame as output:
predData <- ddply(newData, "state", function(x) {
y <-predict(modelList[[as.character(min(x$state))]], x)
data.frame(id=names(y), value=c(y))
})
Output:
head(predData)
state id value
1 50 1 5176.326
2 50 2 5274.907
3 50 3 5373.487
4 50 4 5472.068
5 50 5 5570.649
6 50 6 5669.229
Maybe I'm missing something, but I believe lmList is the ideal tool here,
library(nlme)
ll = lmList(value ~ year | state, data=myData)
predict(ll, newData)
## Or, to show that it produces the same results as the other proposed methods...
newData[["value"]] <- predict(ll, newData)
head(newData)
# year state value
# 1 50 50 5176.326
# 2 51 50 5274.907
# 3 52 50 5373.487
# 4 53 50 5472.068
# 5 54 50 5570.649
# 6 55 50 5669.229

Resources