Print dates without scientific notation in rpart classification tree - r

When I create an rpart tree that uses a date cutoff at a node, the print methods I use - both rpart.plot and fancyRpartPlot - print the dates in scientific notation, which makes it hard to interpret the result. Here's the fancyRpartPlot:
Is there a way to print this tree with more interpretable date values? This tree plot is meaningless as all those dates look the same.
Here's my code for creating the tree and plotting two ways:
library(rpart) ; library(rpart.plot) ; library(rattle)
my_tree <- rpart(a ~ ., data = dat)
rpart.plot(my_tree)
fancyRpartPlot(my_tree)
Using this data:
# define a random date/time selection function
generate_days <- function(N, st="2012/01/01", et="2012/12/31") {
st = as.POSIXct(as.Date(st))
et = as.POSIXct(as.Date(et))
dt = as.numeric(difftime(et,st,unit="sec"))
ev = runif(N, 0, dt)
rt = st + ev
rt
}
set.seed(1)
dat <- data.frame(
a = runif(1:100),
b = rpois(100, 5),
c = sample(c("hi","med","lo"), 100, TRUE),
d = generate_days(100)
)

From a practical standpoint, perhaps you'd like to just use days from the start of the data:
dat$d <- dat$d-as.POSIXct(as.Date("2012/01/01"))
my_tree <- rpart(a ~ ., data = dat)
rpart.plot(my_tree,branch=1,extra=101,type=1,nn=TRUE)
This reduces the number to something manageable and meaningful (though not as meaningful as a specific date, perhaps). You may even want to round it to the nearest day or week. (I can't install GTK+ on my computer so I can't us fancyRpartPlot.)

One possible way might be to use the digits options in print to examine the tree and as.POSIXlt to convert to date:
> print(my_tree,digits=100)
n= 100
node), split, n, deviance, yval
* denotes terminal node
1) root 100 7.0885590 0.5178471
2) d>=1346478795.049611568450927734375 33 1.7406368 0.4136051
4) b>=4.5 23 1.0294497 0.3654257 *
5) b< 4.5 10 0.5350040 0.5244177 *
3) d< 1346478795.049611568450927734375 67 4.8127122 0.5691901
6) d< 1340921905.3460228443145751953125 55 4.1140164 0.5368048
12) c=hi 28 1.8580913 0.4779574
24) d< 1335890083.3241622447967529296875 18 0.7796261 0.3806526 *
25) d>=1335890083.3241622447967529296875 10 0.6012662 0.6531062 *
13) c=lo,med 27 2.0584052 0.5978317
26) d>=1337494347.697483539581298828125 8 0.4785274 0.3843749 *
27) d< 1337494347.697483539581298828125 19 1.0618892 0.6877082 *
7) d>=1340921905.3460228443145751953125 12 0.3766236 0.7176229 *
## Get date on first node
> as.POSIXlt(1346478795.049611568450927734375,origin="1970-01-01")
[1] "2012-08-31 22:53:15 PDT"
I also check the digits option in available in rpart.plot and fancyRpartPlot:
rpart.plot(my_tree,digits=10)
fancyRpartPlot(my_tree, digits=10)

I don't know how important the specific chronological date is in your classification but an alternative method would be to breakdown your dates by the characteristics. In other words, create bins based on the "year" (2012,2013,2014...) as [1,0]. "Day of the Week" (Mon, Tues, Wed, Thurs, Fri...) as [1,0]. Maybe even as "Day of Month" (1,2,3,4,5...31) as [1,0]. This adds a lot more categories to be classifying by but it eliminates the issue with working with a fully formatted date.

Related

How to find root with more than one unknown

fff5=function(x)x*31*24 * (1/(31*24))*0.30 + 400*31*24 * (1/(31*24))*0.025 + ( (10 * 31 * 24 - 100*31*24/20 )/(31*24) * 6 ) - 200
fff5 function describes the cost of Amazon Elastic File System where x is the Gb of storage in Standard plan for 24hours per day 31 days, 400 is the gb of storage in EFS Infrequent Access with 24 hours per day 31 days and 10 is the MB/s throughput 24 hours per day 31 days, 200 is the maximum budget.
When i do:
uniroot(fff5, lower=0, upper=1, extendInt = "yes",maxiter = 10000)$root
[1] 533.3333
I find the highest value of GB's that can be stored in the standard plan 24 hours a day 31 days plus the cost of 400gb in the Infrequent Access and plus the cost of 10mb in the throughput with a maximum budget of 200:
fff5(533.3333)
>[1] -0.00001
> fff5(533.3334)
[1] 0.00002
How to do the same for the other two unknowns (y, z)? How to find root with more than one unknown?? How to find all the combinations of value of x y z that makes this function positive.
fff6=function(x,y,z)x*31*24 * (1/(31*24))*0.30 + y*31*24 * (1/(31*24))*0.025 + ( (z* 31 * 24 - 100*31*24/20 )/(31*24) * 6 ) - 200
The equation you propose is of the type
ax + by + cz + d = 0
that's a plan. This means that your solutions are infinite and are all points belonging to the plane defined by the equation.
Since there are infinite solutions, the only thing you can do is try to narrow the space where to look for them as much as possible.
You can choose one unknown (for example x) and treat the other two as parameters
At this point, assign reasonable values to y and z. Unfortunately I don't know what those variables indicate, but I assume they have the same order of magnitude as x found in the previous point (~ 500)
yy <- seq(400, 600, 10)
zz <- seq(400, 600, 10)
These two variables must be recombined in order to obtain a grid:
df_grid <- expand.grid(y = yy, z = zz)
ATTENTION: the longer the vectors, the heavier the calculation will be.
Now you can find the x solutions via uniroot (passing the y and z as numbers) and the solutions of your problem (within the chosen range) will be all triples x, y, z
fff6=function(x,y,z) { x*31*24 * (1/(31*24))*0.30 +
y*31*24 * (1/(31*24))*0.025 +
( (z* 31 * 24 - 100*31*24/20 )/(31*24) * 6 ) - 200
}
x_sol <- NULL
for (i in 1:nrow(df_grid)) {
xs <- uniroot(fff6, c(-10000, 10000), y = df_grid$y[i], z = df_grid$z[i] )$root
x_sol <- c(x_sol, xs)
}
df_grid$x <- x_sol
NOTE1: There are more elegant ways to avoid writing the previous for loop. For example:
x_sol <- mapply(function(y, z) uniroot(fff6, interval = c(-10000,10000),
y=y, z=z)$root, df_grid$y, df_grid$z))
df_grid$x <- x_sol
NOTE2: The range I have chosen shows negative solutions (which I suspect are not useful). A possible choice for obtaining positive solutions is:
yy <- seq(100, 300, 10)
zz <- seq(10, 30, 1)
Choose to search for solutions in an appropriate range!

How to code the permutation equivalent of Mood's Median Test in R? (get the p values using permutation)

I can do it for the two sample t test but not for Median test or Wilcoxon test or Hodges Lehmann test
data_2000 <- c(500,450,600,700,550,551,552)
data_2019 <- c(560,460,620,720,540,600,750)
mean(data_2000)
mean(data_2019)
mean(data_2019) - mean(data_2000)
combined_data <- c(data_2000, data_2019)
set.seed(123)
null_dist <- c()
for (i in 1:100000) {
shuffled_data <- sample(combined_data)
shuffled_2000 <- shuffled_data[1:7]
shuffled_2019 <- shuffled_data[8:14]
null_dist[i] <- mean(shuffled_2019) - mean(shuffled_2000)
}
(p_value <- (sum(null_dist >= 49.57143) + sum(null_dist <=
`enter code here`-49.57143))/length(null_dist))
I think this is what you're trying to do. I altered your code as little as possible. There are packages like infer that will do this for you and the for loop is not the most efficient but it's plenty good enough and may help you learn. As long as we're looping I did mean and median at the same time since all other parts of the code are identical. ifelse is a nice easy way to make 1s and 0s to sum.
data_2000 <- c(500,450,600,700,550,551,552)
data_2019 <- c(560,460,620,720,540,600,750)
delta_mean <- mean(data_2019) - mean(data_2000)
delta_median <- median(data_2019) - median(data_2000)
combined_data <- c(data_2000, data_2019)
trials <- 100000
set.seed(123)
mean_diff <- c()
median_diff <- c()
for (i in 1:trials) {
shuffled_data <- sample(combined_data)
shuffled_2000 <- shuffled_data[1:7]
shuffled_2019 <- shuffled_data[8:14]
mean_diff[i] <- mean(shuffled_2019) - mean(shuffled_2000)
median_diff[i] <- median(shuffled_2019) - median(shuffled_2000)
}
p_mean <- sum(ifelse(mean_diff > delta_mean | mean_diff < -1 * delta_mean, 1, 0)) / trials
p_median <- sum(ifelse(median_diff > delta_median | median_diff < -1 * delta_median, 1, 0)) / trials
p_mean
#> [1] 0.31888
p_median
#> [1] 0.24446
Following up on your question about HL test. Quoting Wikipedia
The Hodges–Lehmann statistic also estimates the difference between two populations. For two sets of data with m and n observations, the set of two-element sets made of them is their Cartesian product, which contains m × n pairs of points (one from each set); each such pair defines one difference of values. The Hodges–Lehmann statistic is the median of the m × n differences.
You could run it on your data with the following code...
Do NOT run it 100,000 times the answer is the same everytime because you're already making all 49 possible pairings
hl_df <- expand.grid(data_2019, data_2000)
hl_df$pair_diffs <- hl_df$Var1 - hl_df$Var2
median(hl_df$pair_diffs)
[1] 49
You can do the Wilcoxon test with wilcox.test in the stats package (loaded by default as part of R core). You need to set exact = FALSE because an exact p-value is not possible if there are ties.
wilcox.test(data_2019, data_2000, exact = FALSE)
Wilcoxon rank sum test with continuity correction
data: data_2019 and data_2000
W = 33.5, p-value = 0.2769
alternative hypothesis: true location shift is not equal to 0
I'll update this when I figure out how to do the other tests.

Can I write my equation more efficiently in R?

I'm quite new to coding, so I don't know what the limits are for what I can do in R, and I haven't been able to find an answer for this particular kind of problem yet, although it probably has quite a simple solution.
For equation 2, A.1 is the starting value, but in each subsequent equation I need to use the previous answer (i.e. for A.3 I need A.2, for A.4 I need A.3, etc.).
A.1 <- start.x*(1-rate[1])+start.x*rate[1]
A.[2:n] <- A.[n-1]*(1-rate[2:n])+x*rate[2:n]
How do I set A.1 as the initial value, and is there a better way of writing equation 2 than to copy and paste the equation 58 times?
I've included the variables I have below:
A.1<- -13.2 # which is the same as start.x
x<- -10.18947 # x[2:n]
n<- 58
Age<-c(23:80)
rate <- function(Age){
Turnover<-(1/(1.0355*Age-3.9585))
return(Turnover)
}
I need to find the age at which A can be rounded to -11.3. I expect to see it from ages 56 to 60.
Using the new information, try this:
x<- -10.18947
n<- 58
Age <- 23:80
rate <- (1 / (1.0355 * Age - 3.9585))
A <- vector("numeric", 58)
A[1] <- -13.2
for (i in 2:n) {
A[i] <- A[i-1] * (1 - rate[i]) + x * rate[i]
}
Age[which.min(abs(A + 11.3))]
# [1] 58
plot(Age, A, type="l")
abline(h=-11.3, v=58, lty=3)
So the closest age to -11.3 is 58 years.

Error in my math formula for implementing CUSUM in R

I'm trying to implement a check for decreasing values of avg temperatures to see when the temperature starts falling. See the chart of temperatures here:
Here is the formula I'm trying to implement:
Here is my code to implement that formula:
temps <- read.delim("temps.txt")
date_avgs <- rowMeans(temps[2:length(temps)], dims=1, na.rm=T)
mu <- 87
threshold <- 86
constant <- 3
date_avgs
S <- 0 * date_avgs
for (i in 2:length(date_avgs)) {
value <- S[i-1] + (mu - date_avgs[i] - constant)
cat("\nvalue", value, "si", date_avgs[i], i)
S[i] <- max(0, value)
if(S[i] >= threshold){
#Once I hit this for the first time, that indicates at this index the temp is decreasing
cat("\nDecreased past my threshold!!!", S[i] ,i)
}
}
But I'm not able to detect the change as I expect. My formula doesn't get over the threshold until index 108, when it should get there around index 60.
Here is the plot of my S (or CUSUM) values:
Any ideas what I'm doing wrong in my formula?
I think the problem is mu <- mean(date_avgs) basically means of all the observations. But mu should be "mean of X if no change". Thus mu should be about 87 but according your code and plotted data seems to be 80 or less.
# simulated data
set.seed(4422)
date_avgs <- c(runif(60, 84, 92), 88-(1:50)-rnorm(50,0,4))
plot(date_avgs)
# setting constants
mu <- 87
threshold <- 86
constant <- 3
# after running for cycle
Index <- match(S[S >= threshold][1], S)
Index
[1] 75
# for data
> date_avgs[74]
[1] 73.41981
# Considering a lower threshold
# (as maximum allowable difference to detect trend 2 * C)
mu <- 87
threshold <- 6 # arbitrary
constant <- 3
# after running for cycle
Index <- match(S[S >= threshold][1], S)
Index
[1] 66
So I think code is fine, maybe the interpretation is not

In the as.party function how can I clarify which are the indices for the different nodes?

After creating my CART with rpart I proceed to convert it to a party object with the as.party function from the partykit package. The subsecuent error appears:
as.party(tree.hunterpb1)
Error in partysplit(varid = which(rownames(obj$split)[j] == names(mf)), :
‘index’ has less than two elements
I can only assume thet it's refering to the partitioning made by factor variables as I´ve understood from the literature, since the index applies to factors. My tree looks like this:
tree.hunterpb1
n= 354
node), split, n, deviance, yval
* denotes terminal node
1) root 354 244402.100 75.45134
2) hr.11a14>=49.2125 19 3378.322 33.44274 *
3) hr.11a14< 49.2125 335 205592.400 77.83391
6) month=April,February,June,March,May 141 58656.390 68.57493 *
7) month=August,December,January,July,November,October,September 194 126062.800 84.56338
14) presion.11a14>=800.925 91 74199.080 81.32755
28) month=January,November,October 16 9747.934 63.13394 *
29) month=August,December,July,September 75 58025.190 85.20885 *
15) presion.11a14< 800.925 103 50069.100 87.42223 *
The traceback shows that the first partition´s conversion to party class is done correctly but the second one based on the factor variables fails and produced said error.
Previously when working on similar data this error has not appeared. I can only assume that the as.party function isn't finding the indeces. Any advice on how to solve this will be appreciated.
Possibly, the problem is caused by the following situation. (Thanks to Yan Tabachek for e-mailing me a similar example.) If one of the partitioning variables passed on to rpart() is a character variable, then it is processed as if it were a factor by rpart() but not by the conversion in as.party(). As a simple example consider this small data set:
d <- data.frame(y = c(1:10, 101:110))
d$x <- rep(c("a", "b"), each = 10)
Fitting the rpart() tree treats the character variable x as a factor:
library("rpart")
(rp <- rpart(y ~ x, data = d))
## n= 20
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 20 50165.0 55.5
## 2) x=a 10 82.5 5.5 *
## 3) x=b 10 82.5 105.5 *
However, the as.party() conversion does not work:
library("partykit")
as.party(rp)
## Error in partysplit(varid = which(rownames(obj$split)[j] == names(mf)), :
## 'index' has less than two elements
The best fix is to transform x to a factor variable and re-fit the tree. Then the conversion also works smoothly:
d$x <- factor(d$x)
rp <- rpart(y ~ x, data = d)
as.party(rp)
## Model formula:
## y ~ x
##
## Fitted party:
## [1] root
## | [2] x in a: 5.500 (n = 10, err = 82.5)
## | [3] x in b: 105.500 (n = 10, err = 82.5)
##
## Number of inner nodes: 1
## Number of terminal nodes: 2
I also added a fix in the development version of partykit on R-Forge to avoid the problem in the first place. It will be included in the next CRAN release (probably 1.0-1 for which a release date has not yet been scheduled).

Resources