Effects of Observations on Decision Tree Prediction using rpart (R package) - r

I'm very new to machine learning so I apologize if the answer to this is very obvious.
I'm using a decision tree, using the rpart package, to attempt to predict when a structure fire may result in a fatality using a variety of variables related to that structure fire such as what was the cause, the extent of damage etc.
The chance of a fatality resulting from structure fire is about 1 in 100.
In short I have about 154,000 observations in my training set. I have noticed that when I use the full training set, that the complexity parameter cp has to be reduced all the way down to .0003.
> rpart(Fatality~.,data=train_val,method="class", control=rpart.control(minsplit=50,minbucket = 1, cp=0.00035))
n= 154181
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 154181 1881 0 (0.987800053 0.012199947)
2) losscat=Minor_Loss,Med_Loss 105538 567 0 (0.994627528 0.005372472) *
3) losscat=Major_Loss,Total_Loss 48643 1314 0 (0.972986863 0.027013137)
6) HUM_FAC_1=3,6,N, 46102 1070 0 (0.976790595 0.023209405) *
7) HUM_FAC_1=1,2,4,5,7 2541 244 0 (0.903974813 0.096025187)
14) AREA_ORIG=21,24,26,47,72,74,75,76,Other 1846 126 0 (0.931744312 0.068255688)
28) CAUSE_CODE=1,2,5,6,7,8,9,10,12,14,15 1105 45 0 (0.959276018 0.040723982) *
29) CAUSE_CODE=3,4,11,13,16 741 81 0 (0.890688259 0.109311741)
58) FIRST_IGN=10,12,15,17,18,Other,UU 690 68 0 (0.901449275 0.098550725) *
59) FIRST_IGN=00,21,76,81 51 13 0 (0.745098039 0.254901961)
118) INC_TYPE=111,121 48 10 0 (0.791666667 0.208333333) *
119) INC_TYPE=112,120 3 0 1 (0.000000000 1.000000000) *
15) AREA_ORIG=14,UU 695 118 0 (0.830215827 0.169784173)
30) CAUSE_CODE=1,2,4,7,8,10,11,12,13,14,15,16 607 86 0 (0.858319605 0.141680395) *
31) CAUSE_CODE=3,5,6,9 88 32 0 (0.636363636 0.363636364)
62) HUM_FAC_1=1,2 77 24 0 (0.688311688 0.311688312) *
63) HUM_FAC_1=4,5,7 11 3 1 (0.272727273 0.727272727) *
However, when I just grab the first 10,000 observations (no meaningful order) I can now run with a cp of .01
> rpart(Fatality~., data = test, method = "class",
+ control=rpart.control(minsplit=10,minbucket = 1, cp=0.01))
n= 10000
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 10000 112 0 (0.988800000 0.011200000)
2) losscat=Minor_Loss,Med_Loss 6889 26 0 (0.996225867 0.003774133) *
3) losscat=Major_Loss,Total_Loss 3111 86 0 (0.972356156 0.027643844)
6) HUM_FAC_1=3,7,N 2860 66 0 (0.976923077 0.023076923) *
7) HUM_FAC_1=1,2,4,5,6 251 20 0 (0.920318725 0.079681275)
14) CAUSE_CODE=1,3,4,6,7,8,9,10,11,14,15 146 3 0 (0.979452055 0.020547945) *
15) CAUSE_CODE=5,13,16 105 17 0 (0.838095238 0.161904762)
30) weekday=Friday,Monday,Saturday,Tuesday,Wednesday 73 6 0 (0.917808219 0.082191781) *
31) weekday=Sunday,Thursday 32 11 0 (0.656250000 0.343750000)
62) AREA_ORIG=21,26,47,Other 17 2 0 (0.882352941 0.117647059) *
63) AREA_ORIG=14,24,UU 15 6 1 (0.400000000 0.600000000)
126) month=2,6,7,9 7 1 0 (0.857142857 0.142857143) *
127) month=1,4,10,12 8 0 1 (0.000000000 1.000000000) *
Why is it that a greater number of observations is resulting in me
having to reduce complexity? Intuitively I would think it should be
opposite.
Is having to reduce cp to .003 "bad"?
Generally, is there any other advice for improving the effectiveness of a decision tree, especially when predicting something that has such low probability in the first place?

cp, from what I read, is a parameter that is used to decide when to stop adding more leaves to the tree (for a node to be considered for another split, the improvement of the relative error by allowing a new split must by more than that cp threshold). Thus, the lower the number, the more leaves it can add. More observations implies that there is an opportunity to lower the threshold, I'm not sure I understand that you "have to" reduce cp... but I could be wrong. If this is a very rare event and your data doesn't lend itself to showing significant improvement in the early stages of the model, it may require that you "increase the sensitivity" by lowering the cp... but you probably know your data better than me.
If you're modeling a rare event, no. If it's not a rare event, the lower your cp the more likely you are to overfit to the bias of your sample. I don't think that minbucket=1 ever leads to a model that is interpretable, either... for similar reasons.
Decision Trees, to me, don't make very much sense beyond 3-4 levels unless you really believe that these hard cuts truly create criteria that justify a final "bucket"/node or a prediction (e.g. if I wanted to bucket you into something financial like a loan or insurance product that fits your risk profile, and my actuaries made hard cuts to split the prospects). After you've split your data 3-4 times, producing a minimum of 8-16 nodes at the bottom of your tree, you've essentially built a model that could be thought of as 3rd or 4th order interactions of independent categorical variables. If you put 20 statisticians (not econo-missed's) in a room and ask them about the number of times they've seen significant 3rd or 4th order interactions in a model, they'd probably scratch their heads. Have you tried any other methods? Or started with dimension reduction? More importantly, what inferences are you trying to make about the data?

Related

How would I build a selfStart with custom formula or insert my formula into nls()?

Please bear with me, as this is my first post in my first month of starting with R. I have some biphasic decay data, an example of which is included below:
N
Time
Signal
1
0.0001101
2.462455
2
0.0002230
2.362082
3
0.0003505
2.265309
4
0.0004946
2.180061
5
0.0006573
2.136348
6
0.0008411
2.071639
7
0.0010487
2.087519
8
0.0012832
1.971550
9
0.0015481
2.005190
10
0.0018473
1.969274
11
0.0021852
1.915299
12
0.0025669
1.893703
13
0.0029981
1.905901
14
0.0034851
1.839294
15
0.0040352
1.819827
16
0.0046565
1.756207
17
0.0053583
1.704472
18
0.0061510
1.630652
19
0.0070464
1.584315
20
0.0080578
1.574424
21
0.0092002
1.493813
22
0.0104905
1.349054
23
0.0119480
1.318979
24
0.0135942
1.242094
25
0.0154536
1.115491
26
0.0175539
1.065381
27
0.0199262
0.968143
28
0.0226057
0.846351
29
0.0256323
0.765699
30
0.0290509
0.736105
31
0.0329122
0.588751
32
0.0372736
0.539969
33
0.0421999
0.467340
34
0.0477642
0.389153
35
0.0540492
0.308323
36
0.0611482
0.250392
37
0.0691666
0.247006
38
0.0782235
0.177039
39
0.0884534
0.174750
40
0.1000082
0.191918
I have multiple curves to fit a double falling exponential that has the general formula where some fraction of the particle A is decaying fast (described by k1) and then the remaining fraction of particle A decays slowly (described by k2), summarized below:
where A is the particle fraction, k1 is a fast rate, k2 is the slow rate, and T is time. I believe should be entered as
DFE <- y ~ (A*exp(-c*t)) + ((A-b)*exp(-d*t))
I would like to create a selfStart code to apply to over 40 sets of data without having to guess the start values each time. I found some R documentation for this, but can't figure out where to go from here.
The problem is that I am very new to R (and programming in general) and really don't know how to do this. I have had success (meaning convergence was achieved) with
nls(Signal~ SSasymp(Time, yf, y0, log_alpha), data = DecayData)
which is a close estimate but not a truly good model. I was hoping I could somehow alter the SSasymp code to work with my equation, but I think that I am perhaps too naive to know even where to begin.
I would like to compare the asymptotic model with my double falling exponential, but the double falling exponential model never seems to reach convergence despite many, many, many trials and permutations. At this point, I am not even sure if I have entered the formula correctly anymore. So, I am wondering how to write a selfStart that would ideally give me extractable coefficients/half-times.
Thanks so much!
Edit:
As per Chris's suggestion in the comments, I have tried to insert the formula itself into the nls() command like so:
DFEm = nls("Signal" ~ (A*exp(-c*Time)) + ((A-b)*exp(-d*Time)), data = "Signal", trace= TRUE)
which returns
"Error in nls(Signal ~ (A * exp(-c * Time)) + ((A - b) * exp(-d * : 'data' must be a list or an environment"
So I am unsure of how to proceed, as I've checked spelling and capitalization. Is there something silly that I am missing?
Thanks in advance!

Error when specifiying correct model with svydesgin, R survey package

I am sampling from a dataset I created myself. It is a two stage cluster sample. However, I do not seem to specify my design without error (the way I would want to).
I have created a database based on information I have from census EA data from Zanzibar.
The data contains 2 districts. District 1 has 32 subunits (called Shehias) and District 2 has 29. In turn each of the 61 shehias has between 2 and 19 Enumerations Areas (EAs). EAs themselves contain between 51 and 129 households.
The data selection process is the following: All (2) districts and all (61) shehias are included. In each shehia, 2 EAs are selected at random. In each selected EA 22/26 households (depending on the district) are selected. All household members should be selected.
Hence this is a two stage clustering process. The Primary Sampling Unit (PSU) is the EA, the SSU are the households. Both selections are at random.
These are the first six rows of the selected data called strategy_2:
District_C Shehia_Code EA_Code HH_Number District_Numb District_Shehias Shehia_EAs HH_in_EA Prev_U3R3
1 2 2_11 510201107001_1 510201107001_1_1165 1 29 19 115 0
2 2 2_11 510201107001_1 510201107001_1_1165 1 29 19 115 0
3 2 2_11 510201107001_1 510201107001_1_1165 1 29 19 115 0
4 2 2_11 510201107001_1 510201107001_1_1165 1 29 19 115 0
5 2 2_11 510201107001_1 510201107001_1_1165 1 29 19 115 0
6 2 2_11 510201107001_1 510201107001_1_1173 1 29 19 115 1
If I spell out the whole process (including things as clusters that actually are not), then my design ought to be:
strategy_2_Design <- svydesign(id = ~ District_C + Shehia_Code + EA_Code + HH_Number,
fpc = ~ District_Numb + District_Shehias + Shehia_EAs + HH_in_EA,
data = strategy_2)
Here I define the district and the number of districts in the survey as well as the same for Shehias. In both cases sample pop = population pop so the weight contribution is 1 at each stage. The third and fourth element are the actual sampling units.
This design will give me a correct estimate (weights are correct) but the model only has one degree of freedom (2 districts – 1). Hence when I try to calculate values for subunits of Shehias through svyby it can calculate means but if I use svyciprop as FUN the confidence interval is NA because the degrees of freedom of the subset are 0.
Trying to reduce the model down to the two stages I truly am using does not work. Namely
strategy_2_Alt_1 <- svydesign(id = ~ EA_Code + HH_Number,
fpc = ~ Shehia_EAs + HH_in_EA,
data = strategy_2)
yields:
record 1 stage 1 : popsize= 19 sampsize= 122
Error in as.fpc(fpc, strata, ids, pps = pps) :
FPC implies >100% sampling in some strata
Note that 19 is the number of subunits (EAs) in that (first) PSU, 122 is the number of EAs all the sample (2 for each of the 61 Shehias, thus 122).
One way around could be to claim that EAs were stratified by Shehia. This would be:
strategy_2_Alt_2 <- svydesign(id = ~ EA_Code + HH_Number,
fpc = ~ Shehia_EAs + HH_in_EA,
strata = ~ Shehias_Cat + NULL,
data = strategy_2)
Shehias_Cat simply contains the name of the Shehia each EA is in. This give a stratified 2 level cluster sampling design with (122, 2916) clusters.
The weights here are the same as in the first design (strategy_2_Design):
> identical(weights(strategy_2_Design),weights(strategy_2_Alt_2))
[1] TRUE
Hence if I calculate the mean using the weights by hand I get the same result. However, if I try to use svymean to do this calculation, I get an error:
> svymean(~Prev_U3R3, strategy_2_Alt_2)
Error in v.sub[[i]] : subscript out of bounds
In addition: Warning message:
In by.default(1:n, list(as.numeric(clusters[, 1])), function(index) { :
NAs introduced by coercion
So my questions are 1) where do these errors come from and 2) how do I define my model correctly? I have been trying to think about this many a way but do not seem to get it right.
The data and my code are to get to this issue are available under https://www.dropbox.com/sh/u1ajzxaxgue57r8/AAAkCfPC2YrwhEq6gbLsQmGQa?dl=0.
I think you want
strategy_2_SHORT_Design <- svydesign(id = ~ factor(EA_Code) + HH_Number,
fpc = ~ Shehia_EAs + HH_in_EA,
strata = ~ Shehias_Cat,
data = strategy_2)
The design has households sampled within EA, within strata defined by shehias, and the population size in EAs is given by Shehia_EAs and then the size in households is given by HH_in_EA. In your data, EA_Code was a character variable, but it has to be numeric or factor.
The documentation for svydesign should make this clear, but doesn't, presumably because of the default conversion of strings to factors back in primitive times when the function was written.

R - How to Speed Up Recursion and Double Summation

Since this is essentially a question about how to efficiently perform a computation in R, I will start with the equation and then provide an explanation for the problem after the code for those who would find it useful or interesting.
I have written a script in R to generate values using the following function:
The function, as you can see, is recursive and involves double summation. It works well for small numbers around 15 or lower, but the execution time gets prohibitively long at higher values of n and t. I need to be able to perform the calculation for every n and t pair from 1 to 30. Is there a way to write a script that won't take months to execute?
My current script is:
explProb <- function(n,t) {
prob <- 0
#################################
# FIRST PART - SINGLE SUMMATION
#################################
i <- 0
if(t<=n) {
i <- c(t:n)
}
prob = sum(choose(n,i[i>0])*((1/3)^(i[i>0]))*((2/3)^(n-i[i>0])))
#################################
# SECOND PART - DOUBLE SUMMATION
#################################
if(t >= 2) {
for(k in 1:(t-1)) {
j <- c(0:(k-1))
prob = prob + sum(choose(n,n-k)*((1/6)^(j))*((1/6)^(k-j))*((2/3)^(n-k))*explProb(k-j,t-k))
}
}
return(prob)
}
MAX_DICE = 30
MAX_THRESHOLD = 30
probabilities = matrix(0,MAX_DICE,MAX_THRESHOLD)
for(dice in 1:MAX_DICE) {
for(threshold in 1:MAX_THRESHOLD) {
#print(sprintf("DICE = %d : THRESH = %d", dice, threshold))
probabilities[dice,threshold] = explProb(dice,threshold)
}
}
I am trying to write a script to generate a set of probabilities for a particular type of dice roll in a tabletop roleplaying game (Shadowrun 5th Edition, to be specific). The type of dice roll is called an "Exploding Dice Roll". In case you are not familiar with how these rolls work in this game, let me briefly explain.
Whenever you try to accomplish a task you make a test by rolling a number of six-sided dice. Your goal is to get a predetermined number "hits" when rolling those dice. A "hit" is defined as a 5 or 6 on a six-sided die. So, for example, if you have a dice pool of 5 dice, and you roll: 1, 3, 3, 5, 6 then you have gotten 2 hits.
In some cases you are allowed to re-roll all of the 6's that were rolled in order to try and get MORE hits.This is called an "exploding" roll. The 6's counts as hits, but can be re-rolled to "explode" into even more hits. For clarification I'll give a quick example...
If you roll 10 dice and get a result of 1, 2, 2, 4, 5, 5, 6, 6, 6, 6 then you have gotten 6 hits on the first roll... However, the 4 dice that rolled 6's can be re-rolled again. If you roll those dice and get 3, 5, 6, 6 then you have 3 more hits for a total of 9 hits. But you can now re-roll the two more sixes you got... etc... You keep re-rolling the sixes, adding the 5's and 6's to your total hits, and keep going until you get a roll with no sixes.
The function listed above generates these probabilities taking an input of "# of dice" and "number of hits" (called a "threshold" here).
n = # of Dice being rolled
t = Threshold number of "hits" to be reached
Calculation with Transition Matrix
If we have n=10 dice, then the probability of 0 to 10 occurrences of an event with prob=2/6 may be efficiently calculated in R as
dbinom(0:10,10,2/6)
Since you are allowed to keep rolling until failure, any number of ultimate hits is possible (the support of the distribution is [0,Inf)), albeit with geometrically diminishing probabilities. A recursive numeric solution is feasible due to the need to establish a cutoff for machine precision and the presence of a threshold to censor.
Since rerolls are with a smaller number of dice, it makes sense to precalculate all transition probabilities.
X<-outer(0:10,0:10,function(x,size) dbinom(x,size,2/6))
Where the i-th row of the j-th column gives the probability of (i-1) successes (hits) with (j-1) trials (dice rolled). For example, the probability of exactly 1 success with 6 trials is located at X[2,7].
Now if you start out with 10 dice, we can represent this as the vector
d<-c(rep(0,10),1)
Showing that with probability 1 we have 10 dice with 0 probability everywhere else.
After a single roll, the probabilities of the number of live dice is X %*% d.
After two rolls, the probabilities are X %*% X %*% d. We can calculate the live dice state probabilities after any number of rolls by iterating.
T<-Reduce(function(dn,n) X %*% dn,1:11,d,accumulate=TRUE)
Where T[1] gives the probabilities of live dice before the first roll and T[11] gives the probabilities of live dice before the 11th (after the 10th).
This is sufficient to calculate expected values, but for the distribution of cumulative sums, we'll need to track additional information in the state. The following function reshapes a state matrix at each step so that the i-th row and j-th column has the probability of (i-1) live dice with a current cumulative total of j-1.
step<-function(m) {
idx<-arrayInd(seq_along(m),dim(m))
idx[,2]<-rowSums(idx)-1
i<-idx[nrow(idx),]
m2<-matrix(0,i[1],i[2])
m2[idx]<-m
return(m2)
}
In order to recover the probabilities for cumulative totals, we use the following convenience function to sum across anti-diagonals
conv<-function(m)
tapply(c(m),c(row(m)+col(m)-2),FUN=sum)
The probabilities of continuing to roll rapidly diminish, so I've cut off at 40, and shown up to 20, rounded to 4 places
round(conv(Reduce(function(mn,n) X %*% step(mn), 1:40, X %*% d))[1:21],4)
#> 0 1 2 3 4 5 6 7 8 9
#> 0.0173 0.0578 0.1060 0.1413 0.1531 0.1429 0.1191 0.0907 0.0643 0.0428
#>
#> 10 11 12 13 14 15 16 17 18 19
#> 0.0271 0.0164 0.0096 0.0054 0.0030 0.0016 0.0008 0.0004 0.0002 0.0001
Calculation with Simulation
This can also be calculated in reasonable time with reasonable precision using simple simulation.
We simulate a roll of n 6-sided dice with sample(1:6,n,replace=TRUE), calculate the number to re-roll, and iterate until none are available, counting "hits" along the way.
sim<-function(n) {
k<-0
while(n>0) {
roll<-sample(1:6,n,replace=TRUE)
n<-sum(roll>=5)
k<-k+n
}
return(k)
}
Now we can simply replicate a large number of trials and tabulate
prop.table(table(replicate(100000,sim(10))))
#> 0 1 2 3 4 5 6 7 8 9
#> 0.0170 0.0588 0.1053 0.1431 0.1518 0.1433 0.1187 0.0909 0.0657 0.0421
#>
#> 10 11 12 13 14 15 16 17 18 19
#> 0.0252 0.0161 0.0102 0.0056 0.0030 0.0015 0.0008 0.0004 0.0002 0.0001
This quite feasible even with 30 dice (a few seconds even with 100,000 replications).
Efficient Calculation Using Probability Distributions
The approach in the question and in my other answer use sums over transitions of dependent binomial distributions. The dependency arising from the carry over of previous successes (hits) to subsequent trials (rolls) complicates the calculations.
An alternative approach is to view each die separately. Roll a single die as long as it turns up as a hit. Each die is independent of the other, so the random variables may be summed efficiently through convolution. However, the distribution for each die is a geometric distribution, and the sum of independent geometric distributions gives rise to a negative binomial distribution.
R provides the negative binomial distribution, so the results obtained in my other answer may be had all at once by
round(dnbinom(0:19,10,prob=2/3),4)
[1] 0.0173 0.0578 0.1060 0.1413 0.1531 0.1429 0.1191 0.0907 0.0643 0.0428
[11] 0.0271 0.0164 0.0096 0.0054 0.0030 0.0016 0.0008 0.0004 0.0002 0.0001
The probability matrix in the question, with MAX_DICE=MAX_THRESHOLD=10, has first column equal to
1-dnbinom(0,1:10,prob=2/3)
So, you might be looking for the cumulative distribution function. I have not been able to figure out your intentions with the subsequent columns, but perhaps the goal was
outer(1:10,0:10,function(size,x) 1-dnbinom(x,size,prob=2/3))

Working with Self Organizing Maps - How do I interpret the results?

I have this data set that I thought would be a good candidate for making a SOM.
So, I converted it to text thusly:
10
12 1 0 0
13 3 0 0
14 21 0 0
19 1983 15 0
24 5329 48 0
29 4543 50 0
34 3164 32 0
39 1668 22 1
44 459 4 0
49 17 0 0
I'm using Octave, so I transformed the data with these commands:
dataIn = fopen('data.txt','r');
n = fscanf(dataIn,'%d',1);
D = fscanf(dataIn,'%f'); %D is a 1 x n column matrix
D = D'; %Transpose the data D is now an n x 1 matrix
D = reshape(D, 4, []); % give D the shape of a 4 x n/4 matrix
D = D(2:4, :); % the dimensions to be used for the SOM will come from the bottom three rows
Now, I'm applying an SOM script to produce a map using D.
The script is here
and it's using findBMU defined as:
%finds best matching unit in SOM O
function [r c ] = findBMU( iv,O )
dist = zeros(size(O)); for i=1:3
dist(:,:,i) = O(:,:,i)-iv(i);
iv(i);
end
dist = sum(dist.^2,3);
[v r] = min(min(dist,[],2));
[v c] = min(min(dist,[],1));
In the end, it starts with a random map that looks like this:
and it becomes:
The thing is, I don't know what my SOM is saying. How do I read it?
Firstly, you should be aware that Octave provides at best an approximation to the SOM methodology. The main methodological advantage of the SOM is the potential transparent access of (all) the implied parameters, and those cannot be accessed in Octave any more.
Secondly, considering your data, it does not make much sense first to seriously destroy information by summarizing it then feeding a SOM with it. Basically you have four variables in your table shown above: age, total N, single N, twin N. What you have destroyed is the information about the region.
Such you put three distributions into the SOM. The only thing you could expect is clusters. Yet, the SOM is not built for building clusters. Instead, SOM is used for diagnostic and predictive modeling, in order to find the most accurate model and the most relevant variables. Note the term "best matching unit"!
In your example however you find just a distribution in the SOM. Basically, there is no interpretation, as there are neither variables nor is there a predictive/diagnostic purpose.
You could build a model, for instance, determining the similarity of distributions. Yet, for that you should use a goodness-of-fit test (non-parametric, Kolmogorof-Smirnov), not the SOM.

Cluster center mean of DBSCAN in R?

Using dbscan in package fpc I am able to get an output of:
dbscan Pts=322 MinPts=20 eps=0.005
0 1
seed 0 233
border 87 2
total 87 235
but I need to find the cluster center (mean of cluster with most seeds). Can anyone show me how to proceed with this?
You need to understand that as DBSCAN looks for arbitrarily shaped clusters, the mean can be well outside of the cluster. Looking at means of DBSCAN clusters therefore is not really sensible.
Just index back into the original data using the cluster ID of your choice. Then you can easily do whatever further processing you want to the subset. Here is an example:
library(fpc)
n = 100
set.seed(12345)
data = matrix(rnorm(n*3), nrow=n)
data.ds = dbscan(data, 0.5)
> data.ds
dbscan Pts=100 MinPts=5 eps=0.5
0 1 2 3
seed 0 1 3 1
border 83 4 4 4
total 83 5 7 5
> colMeans(data[data.ds$cluster==0, ])
[1] 0.28521404 -0.02804152 -0.06836167

Resources