I don't know how to create this tree in R - r

I would like to maximize revenue by applying the better campaign at each hour.
I would like to create a tree that would help me choose the better campaign.
At the data below there's a record with the revenue per campaign per hour.
Looking at the data, I may see that campaign A is better between hours 1-12, and that campaign B is better between hours 13-24.
How do I create in R the tree that would tell me that?
hour campaign revenue
1 A 23
1 B 20
2 A 21
2 B 22
3 A 23
3 B 20
4 A 21
4 B 22
5 A 23
5 B 20
6 A 21
6 B 22
7 A 20
7 B 17
8 A 18
8 B 19
9 A 20
9 B 17
10 A 18
10 B 19
11 A 20
11 B 17
12 A 19
12 B 18
13 A 8
13 B 9
14 A 6
14 B 11
15 A 9
15 B 8
16 A 6
16 B 11
17 A 9
17 B 8
18 A 6
18 B 11
19 A 3
19 B 2
20 A 3
20 B 2
21 A 0
21 B 5
22 A 3
22 B 2
23 A 3
23 B 2
24 A 0
24 B 5

I'm not sure what kind of tree you are looking for exactly, but a linear model tree for revenue with regressor campaign and partitioning variable hour might be useful. Using lmtree() in package partykit you can fit a tree that starts out by fitting a linear model with two coefficients (intercept and campaign B effect) and then splits the data as long as there are significant instabilities in at least one of the coefficients:
library("partykit")
(tr <- lmtree(revenue ~ campaign | hour, data = d))
## Linear model tree
##
## Model formula:
## revenue ~ campaign | hour
##
## Fitted party:
## [1] root
## | [2] hour <= 12: n = 24
## | (Intercept) campaignB
## | 20.583333 -1.166667
## | [3] hour > 12: n = 24
## | (Intercept) campaignB
## | 4.666667 1.666667
##
## Number of inner nodes: 1
## Number of terminal nodes: 2
## Number of parameters per node: 2
## Objective function (residual sum of squares): 341.1667
In this (presumably artificial) data, this selects a single split at 12 hours and then has two terminal nodes: one with a negative campaign B effect (i.e., A is better) and one with a positive campaign B effect (i.e., B is better). The resulting plot(tr) yields:
This also brings out that the split is also driven by the change in revenue level and not only by the differing campaign effects (which are fairly small).
The underlying tree algorithm is called "Model-Based Recursive Partitioning" (MOB) and is also applicable to models other than linear regression. See the references in the manual and vignette for more details.
Another algorithm that might potentially be interesting is the QUINT (qualitative interaction trees) by Dusseldorp & Van Mechelen, available in the quint package.
For convenient replication of the example above: The d data frame can be recreated by
d <- read.table(textConnection("hour campaign revenue
1 A 23
1 B 20
2 A 21
2 B 22
3 A 23
3 B 20
4 A 21
4 B 22
5 A 23
5 B 20
6 A 21
6 B 22
7 A 20
7 B 17
8 A 18
8 B 19
9 A 20
9 B 17
10 A 18
10 B 19
11 A 20
11 B 17
12 A 19
12 B 18
13 A 8
13 B 9
14 A 6
14 B 11
15 A 9
15 B 8
16 A 6
16 B 11
17 A 9
17 B 8
18 A 6
18 B 11
19 A 3
19 B 2
20 A 3
20 B 2
21 A 0
21 B 5
22 A 3
22 B 2
23 A 3
23 B 2
24 A 0
24 B 5"), header = TRUE)

Would something like this work?
## create a sequence of hours from column 1 of the data
hr <- as.numeric(unique(data[,1]))
## Set up vectors to hold the A and B campaign "best" hours
A.hours=NULL
B.hours=NULL
## start at the lowest hour
i=1
while(i<=max(hr)) {
## create a subset of data from the current hour
sub.data <- data[matrix(which(data[,1]==hr[i])),]
## find the campaign with the highest revenue
best.camp <- sub.data[which(sub.data[,3]==max(sub.data[,3])),2]
if(best.camp=="A") {
A.hours <- c(A.hours,hr[i])
}
if(best.camp=="B") {
B.hours <- c(B.hours,hr[i])
}
i=i+1
}
The code indicates that during the A.hours (hours: 1 3 5 7 9 11 12 15 17 19 20 22 23), campaign A is more profitable.
However, during B.hours (hours: 2 4 6 8 10 13 14 16 18 21 24), campaign B is more profitable.

Related

R:How to apply a sliding conditional branch to consecutive values in the sequential data

I want to use conditional statement to consecutive values in the sliding manner.
For example, I have dataset like this;
data <- data.frame(ID = rep.int(c("A","B"), times = c(24, 12)),
+ time = c(1:24,1:12),
+ visit = as.integer(runif(36, min = 0, max = 20)))
and I got table below;
> data
ID time visit
1 A 1 7
2 A 2 0
3 A 3 6
4 A 4 6
5 A 5 3
6 A 6 8
7 A 7 4
8 A 8 10
9 A 9 18
10 A 10 6
11 A 11 1
12 A 12 13
13 A 13 7
14 A 14 1
15 A 15 6
16 A 16 1
17 A 17 11
18 A 18 8
19 A 19 16
20 A 20 14
21 A 21 15
22 A 22 19
23 A 23 5
24 A 24 13
25 B 1 6
26 B 2 6
27 B 3 16
28 B 4 4
29 B 5 19
30 B 6 5
31 B 7 17
32 B 8 6
33 B 9 10
34 B 10 1
35 B 11 13
36 B 12 15
I want to flag each ID by continuous values of "visit".
If the number of "visit" continued less than 10 for 6 times consecutively, I'd attach "empty", and "busy" otherwise.
In the data above, "A" is continuously below 10 from rows 1 to 6, then "empty". On the other hand, "B" doesn't have 6 consecutive one digit, then "busy".
I want to apply the condition to next segment of 6 values if the condition weren't fulfilled in the previous segment.
I'd like achieve this using R. Any advice will be appreciated.

How to randomly split a data frame into halves that are balanced on subject and item

The following randomly splits a data frame into halves.
df <- read.csv("https://raw.githubusercontent.com/HirokiYamamoto2531/data/master/data.csv")
head(df, 3)
# dv iv subject item
#1 562 -0.5 1 7
#2 790 0.5 1 21
#3 NA -0.5 1 19
r <- seq_len(nrow(df))
first <- sample(r, 240)
second <- r[!r %in% first]
df_1 <- df[first, ]
df_2 <- df[second, ]
However, in this way, each data frame (df_1 and df_2) is not balanced on subject and item: e.g.,
table(df_1$subject)
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
# 7 8 3 5 5 3 8 1 5 7 7 6 7 7 9 8 8 9 6 7 8 5 4 4 5 2 7 6 9
# 30 31 32 33 34 35 36 37 38 39 40
# 7 5 7 7 7 3 5 7 5 3 8
table(df_1$item)
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
# 12 11 12 12 9 11 11 8 11 12 10 8 14 7 14 10 8 7 9 9 7 11 9 8
# There are 40 subjects and 24 items, and each subject is assigned to 12 items and each item to 20 subjects.
I would like to know how to split the data frame into halves that are balanced on subject and item (i.e., exactly 6 data points from each subject and 10 data points from each item).
You can use the createDataPartition function from the caret package to create a balanced partition of one variable.
The code below creates a balanced partition of the dataset according to the variable subject:
df <- read.csv("https://raw.githubusercontent.com/HirokiYamamoto2531/data/master/data.csv")
partition <- caret::createDataPartition(df$subject, p = 0.5, list = FALSE)
first.half <- df[partition, ]
second.half <- df[-partition, ]
table(first.half$subject)
table(second.half$subject)
I'm not sure whether it's possible to balance two variables at once. You can try balancing for one variable and checking if you're happy with the partition of the second variable.

Creating Groups by Matching Values of Different Columns

I would like to create groups from a base by matching values.
I have the following data table:
now<-c(1,2,3,4,24,25,26,5,6,21,22,23)
before<-c(0,1,2,3,23,24,25,4,5,0,21,22)
after<-c(2,3,4,5,25,26,0,6,0,22,23,24)
df<-as.data.frame(cbind(now,before,after))
which reproduces the following data:
now before after
1 1 0 2
2 2 1 3
3 3 2 4
4 4 3 5
5 24 23 25
6 25 24 26
7 26 25 0
8 5 4 6
9 6 5 0
10 21 0 22
11 22 21 23
12 23 22 24
I would like to get:
now before after group
1 1 0 2 A
2 2 1 3 A
3 3 2 4 A
4 4 3 5 A
5 5 4 6 A
6 6 5 0 A
7 21 0 22 B
8 22 21 23 B
9 23 22 24 B
10 24 23 25 B
11 25 24 26 B
12 26 25 0 B
I would like to reach the answer to this without using a "for" loop becouse the real data is too large.
Any you could provide will be appreciated.
Here is one way. It is hard to avoid a for-loop as this is quite a tricky algorithm. The objection to them is often on the grounds of elegance rather than speed, but sometimes they are entirely appropriate.
df$group <- seq_len(nrow(df)) #assign each row to its own group
stop <- FALSE #indicates convergence
while(!stop){
pre <- df$group #group column at start of loop
for(i in seq_len(nrow(df))){
matched <- which(df$before==df$now[i] | df$after==df$now[i]) #check matches in before and after columns
group <- min(df$group[i], df$group[matched]) #identify smallest group no of matching rows
df$group[i] <- group #set to smallest group
df$group[matched] <- group #set to smallest group
}
if(identical(df$group, pre)) stop <- TRUE #stop if no change
}
df$group <- LETTERS[match(df$group, sort(unique(df$group)))] #convert groups to letters
#(just use match(...) to keep them as integers - e.g. if you have more than 26 groups)
df <- df[order(df$group, df$now),] #reorder as required
df
now before after group
1 1 0 2 A
2 2 1 3 A
3 3 2 4 A
4 4 3 5 A
8 5 4 6 A
9 6 5 0 A
10 21 0 22 B
11 22 21 23 B
12 23 22 24 B
5 24 23 25 B
6 25 24 26 B
7 26 25 0 B

How to extract a sample of pairs in grouping variable

My data looks like this:
x y
1 1
2 2
3 2
4 4
5 5
6 6
7 6
8 8
9 9
10 9
11 11
12 12
13 13
14 13
15 14
16 15
17 14
18 16
19 17
20 18
y is a grouping variable. I would like to see how well this grouping went.
Because of this I want to extract a sample of n pairs of cases that are grouped together by variable y
and n pairs of cases that are not grouped together by variable y. In order to calculate the number of
false positives and false negatives (either falsly grouped or not). How do I extract a sample of grouped pairs
and a sample of not-grouped pairs?
I would like the samples to look like this (for n=6) :
Grouped sample:
x y
2 2
3 2
9 9
10 9
15 14
17 14
Not-grouped sample:
x y
1 1
2 2
6 8
6 8
11 11
19 17
How would I go about this in R?
I'm not entirely clear on what you like to do, partly because I feel there is some context missing as to what you're trying to achieve. I also don't quite understand your expected output (for example, the not-grouped sample contains an entry 6 8 that does not exist in your original data...)
That aside, here is a possible approach.
# Maximum number of samples per group
n <- 3;
# Set fixed RNG seed for reproducibility
set.seed(2017);
# Grouped samples
df.grouped <- do.call(rbind.data.frame, lapply(split(df, df$y),
function(x) if (nrow(x) > 1) x[sample(min(n, nrow(x))), ]));
df.grouped;
# x y
#2.3 3 2
#2.2 2 2
#6.6 6 6
#6.7 7 6
#9.10 10 9
#9.9 9 9
#13.13 13 13
#13.14 14 13
#14.15 15 14
#14.17 17 14
# Ungrouped samples
df.ungrouped <- df[sample(nrow(df.grouped)), ];
df.ungrouped;
# x y
#7 7 6
#1 1 1
#9 9 9
#4 4 4
#3 3 2
#2 2 2
#5 5 5
#6 6 6
#10 10 9
#8 8 8
Explanation: Split df based on y, then draw min(n, nrow(x)) samples from subset x containing >1 rows; rbinding gives the grouped df.grouped. We then draw nrow(df.grouped) samples from df to produce the ungrouped df.ungrouped.
Sample data
df <- read.table(text =
"x y
1 1
2 2
3 2
4 4
5 5
6 6
7 6
8 8
9 9
10 9
11 11
12 12
13 13
14 13
15 14
16 15
17 14
18 16
19 17
20 18", header = T)

Computing values for R dataFrame cells without using for loops

I have a R dataFrame with the followings:
Serial N year current Average
B 10 14 15
B 10 16 15
C 12 13 12
D 40 20 20
B 11 15 15
C 12 11 12
I would like to have a new column based on the average for a unique serial number. I would like to have something like :
Serial N year current Average temp
B 10 14 15 (15+12+20)/15
B 10 16 15 (15+12+20)/15
C 12 13 12 (15+12+20)/12
D 40 20 20 (15+12+20)/20
B 11 15 15 (15+12+20)/15
C 12 11 12 (15+12+20)/12
temp column is the addition of the average value for each Serial N ( for B,C and D) over the value of the average for that row. How can I computing it without using for loops as rows 1,2 and 5 (Serial N: B) is the same in terms of Average column and temp? I started with this:
for (i in unique(df$Serial_N))
{
.........
}
but I got stuck as I also need the average for other Serial N. How can I do this?
For example, you can try something like the following (assuming your computation matches):
df$temp <- sum(tapply(df$Average, df$SerialN, mean)) / df$Average
Resulting output:
SerialN year current Average temp
1 B 10 14 15 3.133333
2 B 10 16 15 3.133333
3 C 12 13 12 3.916667
4 D 40 20 20 2.350000
5 B 11 15 15 3.133333
6 C 12 11 12 3.916667
Using unique.data.frame() can avoid repeat in Average between different groups
df$temp <- sum((unique.data.frame(df[c("Serial_N","Average")]))$Average) / df$Average
In base R, you can use either
df <- transform(df, temp = sum(tapply(df$Average, df$Serial_N, unique))/df$Average)
or
df$temp <- sum(tapply(df$Average, df$Serial_N, unique))/df$Average
both of which will give you
df
# Serial_N year current Average temp
# 1 B 10 14 15 3.133333
# 2 B 10 16 15 3.133333
# 3 C 12 13 12 3.916667
# 4 D 40 20 20 2.350000
# 5 B 11 15 15 3.133333
# 6 C 12 11 12 3.916667
tapply splits df$Average by the levels of df$Serial_N, and then calls unique on them, which gives you a single average for each group, which you can then sum and divide. transform adds a column (equivalent to dplyr::mutate).

Resources