Partition dataset using CART regression by leaf node - r

I'm currently trying to modify an existing Stata model in R, and I'm running into problems with a specific step in the process.
I need to use a CART regression to divide my dataset up into individual clusters based on their leaf node, such that each leaf node becomes a new dataset.
For example, lets say that my regression results in a tree as follows:
Root
/ \
ALeft ARight
/ \
BLeft BRight
/ \
CLeft CRight
I would like to then take my dataset, and for each instance determine (analogous to the typical predict method) which leaf node it belongs to, of the set (ARight,BLeft,CLeft,CRight).
Are there any existing packages, or methods for the rpart/tree CART models, which would allow me to output the leaf node?

You'd find the rpart package useful, particularly the where element.
where: an integer vector of the same length as the number of observations in the root node, containing the row number of frame corresponding to the leaf node that each observation falls into.
library(rpart)
fit <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis)
fit$where
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
9 7 9 9 3 3 3 3 3 8 8 3 9 5 3 3 3 7 3 5 3
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
9 8 9 9 5 9 8 3 3 3 7 7 3 7 3 5 9 5 8 9 5
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
9 9 3 7 3 7 9 7 8 3 9 3 3 3 5 9 5 8 9 9 9
64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81
3 3 5 3 7 5 3 7 7 3 7 3 3 7 5 7 9 5

Related

Creating a table viable for t.test function

I am given a data frame with multiple variables but I am only interested in 2 variables and am required to group the variables into 2 groups. (i.e. group 1:mean age at child-birth with having 10+ years of education; group 2: mean age at child-birth with having less than 10 years of education) I am trying to figure out how to put this into a table but I am having troubles on how I can group the rows I want based on years of education. I currently have a table that looks like this with the following code:
'''
means<-table(bfeed_df$ybirth,bfeed_df$yschool)
'''
giving me:
'''
3 6 7 8 9 10 11 12 13 14 15 16 17 18 19
78 0 0 2 2 5 8 8 26 1 2 1 0 0 0 0
79 1 2 2 3 6 12 12 38 10 5 0 0 0 0 0
80 0 0 0 5 10 11 13 38 10 5 2 0 0 0 0
.
.
'''
I want:
<10years +10years
78 9 46
79 14 77
80 15 88
. . .
. . .
# Let's generate some fake data that matches your input
temp = matrix(sample(60,60), ncol = 15)
colnames(temp) = c(3,6,7,8,9,10,11,12,13,14,15,16,17,18,19)
rownmes(temp) = c(78, 79, 80, 81)
# 3 6 7 8 9 10 11 12 13 14 15 16 17 18 19
# 78 5 4 21 13 18 17 34 43 19 41 55 36 12 52 15
# 79 56 14 38 28 30 25 8 44 35 59 39 49 20 2 58
# 80 22 27 3 9 33 54 26 50 53 45 10 40 48 7 6
# 81 42 46 23 1 60 57 47 16 24 51 37 32 11 29 31
Now we can create the summations using apply
sums = t(apply(temp, 1, function(x) c(sum(x[1:4]), sum(x[5:15])) ))
colnames(sums) = c("<10y","+10y")
sums
> sums
<10y +10y
78 43 342
79 136 369
80 61 372
81 112 395
Is this what you are looking for?
You can use cut to divide yschool in two categories and use it in table.
means <- table(bfeed_df$ybirth,cut(bfeed_df$yschool, c(-Inf, 10, Inf)))
colnames(means) <- c('<10years', '10+years')
means

boot.ci error, due to too little variation in data?

I have a dataset of 79 values ranging from 0-22 (and not a lot of variation; median 5). Using the boot.ci function, I am trying to bootstrap to calculate confidence intervals. I receive the error: [1] "All values of t are equal to 5 \n Cannot calculate confidence intervals"
I assumed this is because my number of replications wasn't high enough (as there is little variation in the data), but increasing to even 1M doesn't help.
Does anyone have any tips?
Boot_symp_pres = boot(data$Delay,
function(x,i) median(x[i], na.rm=TRUE),
R=1000000)
boot.ci(Boot_symp_pres,
conf = 0.95,
type = c("norm", "basic" ,"perc", "bca"))
Data
Delay
1 0
2 0
3 0
4 0
5 0
6 1
7 1
8 1
9 2
10 2
11 2
12 2
13 3
14 3
15 3
16 4
17 5
18 5
19 5
20 5
21 5
22 5
23 5
24 5
25 5
26 5
27 5
28 5
29 5
30 5
31 5
32 5
33 5
34 5
35 5
36 5
37 5
38 5
39 5
40 5
41 5
42 5
43 5
44 5
45 5
46 5
47 5
48 5
49 5
50 5
51 5
52 5
53 5
54 5
55 5
56 5
57 5
58 5
59 5
60 5
61 5
62 5
63 5
64 5
65 5
66 5
67 5
68 5
69 5
70 6
71 6
72 6
73 6
74 8
75 9
76 10
77 11
78 13
79 22

Understanding Transaction Class Output in R?

Would someone shed some light on the output if i run a summary of an object of class Transaction, with reference to association rule mining. For example:
transactions as itemMatrix in sparse format with
510 rows (elements/itemsets/transactions) and
8361 columns (items) and a density of 0.003117649
most frequent items:
PREPACK CARROTS 1EACH HASS AVOCADO 1EACH. 2 AT 3.00 EACH. AVOCADOS 2 FOR 5
54 51
COLES EGGS FR RANGE 700GRAM COLES DAIRY MILK 2LITRE
41 31
DAIRY FARMERS DAIRY 2LITRE (Other)
29 13088
element (itemset/transaction) length distribution:
sizes
2 3 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
1 1 5 8 9 9 13 11 12 16 17 12 13 14 15 19 15 17 18 26 18 18 11 8 16 10 8 10 14 11 6 9 8 8 9 6
39 40 41 42 43 44 45 47 48 49 50 51 52 53 54 56 57 58 59 62 63 64 68 69 72 73 74 78 81 86 87
7 6 5 2 8 4 5 4 2 2 4 5 1 3 2 4 6 4 2 2 1 1 1 1 1 1 1 1 1 1 1
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.00 16.00 23.00 26.07 34.00 87.00
includes extended item information - examples:
labels
1 1
2 10
3 100
The above is the output generated from my transaction data. I've read the R Documentation but what is the element (itemset/transaction) length distribution ? And the labels?
Thanks!

Distance Matrix from table in R

Good evening,
I need to solve a location problem in R and I'm stuck in one of the first steps.
From a .txt file I need to create a distance matrix using the euclidean method.
datos <- file.choose()
servidores <- read.table(datos)
servidores
From which I obtain the following information:
X50 shows the total number of servers.
x5 the number of hubs required.
x120 the total capacity.
The first column shows the distance of x.
The second column shows the distance of y.
The third column shows the requirements of the node.
X50 X5 X120
1 2 62 3
2 80 25 14
3 36 88 1
4 57 23 14
5 33 17 19
6 76 43 2
7 77 85 14
8 94 6 6
9 89 11 7
10 59 72 6
11 39 82 10
12 87 24 18
13 44 76 3
14 2 83 6
15 19 43 20
16 5 27 4
17 58 72 14
18 14 50 11
19 43 18 19
20 87 7 15
21 11 56 15
22 31 16 4
23 51 94 13
24 55 13 13
25 84 57 5
26 12 2 16
27 53 33 3
28 53 10 7
29 33 32 14
30 69 67 17
31 43 5 3
32 10 75 3
33 8 26 12
34 3 1 14
35 96 22 20
36 6 48 13
37 59 22 10
38 66 69 9
39 22 50 6
40 75 21 18
41 4 81 7
42 41 97 20
43 92 34 9
44 12 64 1
45 60 84 8
46 35 100 5
47 38 2 1
48 9 9 7
49 54 59 9
50 1 58 2
I tried to use the dist() function:
distance_matrix <-dist(servidores,method = "euclidean",diag = TRUE,upper = TRUE)
but since x and y are on different columns I am not sure what to do to get a 50x50 matrix with all the distances.
Anybody knows how could I create such matrix?.
Many thanks in advance.

Generate population data with specific distribution in R

I have a distribution of ages in a population.
For instance, you can imagine something like this:
Ages <24: 15%
Ages 25-49: 40%
Ages 50-60: 20%
Ages >60: 25%
I don't have the mean and standard deviation for each stratum/age group in the data. I am trying to generate a sample population of 1000 individuals where the generated data matches the distribution of ages shown above.
Let's put this data in a more friendly format:
(dat <- data.frame(min=c(0, 25, 50, 60), max=c(25, 50, 60, 100), prop=c(0.15, 0.40, 0.20, 0.25)))
# min max prop
# 1 0 25 0.15
# 2 25 50 0.40
# 3 50 60 0.20
# 4 60 100 0.25
We can easily sample 1000 rows of the table using the sample function:
set.seed(144) # For reproducibility
rows <- sample(nrow(dat), 1000, replace=TRUE, prob=dat$prop)
table(rows)
# rows
# 1 2 3 4
# 139 425 198 238
To sample actual ages you will need to define a distribution over the ages represented by each row. A simple one would be uniformly distributed ages:
age <- round(dat$min[rows] + runif(1000) * (dat$max[rows] - dat$min[rows]))
table(age)
# age
# 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
# 2 5 5 3 7 7 9 6 7 6 1 7 7 5 5 6 2 4 6 7 4 11 8 2 3 10 11 13
# 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
# 19 16 20 16 18 21 16 19 14 20 15 13 18 15 24 20 16 16 29 16 11 12 18 17 17 26 27 21
# 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83
# 17 26 11 13 20 3 8 9 6 4 3 3 5 4 3 3 5 8 3 13 5 6 4 7 9 9 6 4
# 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
# 5 5 9 9 5 6 8 9 5 4 6 5 9 6 8 4 1
Of course, if uniformly sampling the ages in each range is inappropriate in your application, then you would need to pick some other function to get ages from buckets.
This doesn't do exactly what you were looking for, but does help with the cut-offs. Hope it helps!
install.packages("truncnorm")
library(truncnorm)
set.seed(123)
pop <- 1000
ages <- rtruncnorm(n=pop, a=0, b=100, mean=40, sd=25) # ---> You can set your own mean and sd
summary(ages)

Resources