Creating a contingency table with fixed margins - r

I am trying to create a table with random entries from a central hypergeometric distribution where the column and row totals are fixed.
However I can get the column sums to be fixed and equal but not the row sums. I have read other answers but none seem to talk specifically about how to do it, my R knowledge is pretty basic and could do with some help or a point in the right direction.
To get the values from a central hypergeometric distribution I am using the BiasedUrn package.
For example:
N <- 50
rand <- 10
n1 <- 25
odds0 <- rep(1,K)
m0 <- rep(N/K,K)
library(BiasedUrn)
i <- as.table(rMFNCHypergeo(nran=rand, n=n1, m=m0, odds=odds0))
addmargins(i)
A B C D E F G H I J Sum
A 5 3 5 7 5 5 6 6 5 5 52
B 8 7 4 5 5 6 3 4 5 4 51
C 3 6 4 4 4 5 6 8 5 4 49
D 4 4 6 3 6 4 5 3 3 5 43
E 5 5 6 6 5 5 5 4 7 7 55
Sum 25 25 25 25 25 25 25 25 25 25 250
Where I'm looking to keep all the column sums equal to 25, and all the row sums equal to another number which I can choose such as 50.

Are you looking for the r2dtable function from base R?
set.seed(101)
tt <- r2dtable(n=1,c=rep(25,6),r=rep(50,3))
addmargins(as.table(tt[[1]]))
## A B C D E F Sum
## A 7 9 7 11 9 7 50
## B 10 7 10 6 7 10 50
## C 8 9 8 8 9 8 50
## Sum 25 25 25 25 25 25 150

Related

How to randomly split a data frame into halves that are balanced on subject and item

The following randomly splits a data frame into halves.
df <- read.csv("https://raw.githubusercontent.com/HirokiYamamoto2531/data/master/data.csv")
head(df, 3)
# dv iv subject item
#1 562 -0.5 1 7
#2 790 0.5 1 21
#3 NA -0.5 1 19
r <- seq_len(nrow(df))
first <- sample(r, 240)
second <- r[!r %in% first]
df_1 <- df[first, ]
df_2 <- df[second, ]
However, in this way, each data frame (df_1 and df_2) is not balanced on subject and item: e.g.,
table(df_1$subject)
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
# 7 8 3 5 5 3 8 1 5 7 7 6 7 7 9 8 8 9 6 7 8 5 4 4 5 2 7 6 9
# 30 31 32 33 34 35 36 37 38 39 40
# 7 5 7 7 7 3 5 7 5 3 8
table(df_1$item)
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
# 12 11 12 12 9 11 11 8 11 12 10 8 14 7 14 10 8 7 9 9 7 11 9 8
# There are 40 subjects and 24 items, and each subject is assigned to 12 items and each item to 20 subjects.
I would like to know how to split the data frame into halves that are balanced on subject and item (i.e., exactly 6 data points from each subject and 10 data points from each item).
You can use the createDataPartition function from the caret package to create a balanced partition of one variable.
The code below creates a balanced partition of the dataset according to the variable subject:
df <- read.csv("https://raw.githubusercontent.com/HirokiYamamoto2531/data/master/data.csv")
partition <- caret::createDataPartition(df$subject, p = 0.5, list = FALSE)
first.half <- df[partition, ]
second.half <- df[-partition, ]
table(first.half$subject)
table(second.half$subject)
I'm not sure whether it's possible to balance two variables at once. You can try balancing for one variable and checking if you're happy with the partition of the second variable.

Creating Groups by Matching Values of Different Columns

I would like to create groups from a base by matching values.
I have the following data table:
now<-c(1,2,3,4,24,25,26,5,6,21,22,23)
before<-c(0,1,2,3,23,24,25,4,5,0,21,22)
after<-c(2,3,4,5,25,26,0,6,0,22,23,24)
df<-as.data.frame(cbind(now,before,after))
which reproduces the following data:
now before after
1 1 0 2
2 2 1 3
3 3 2 4
4 4 3 5
5 24 23 25
6 25 24 26
7 26 25 0
8 5 4 6
9 6 5 0
10 21 0 22
11 22 21 23
12 23 22 24
I would like to get:
now before after group
1 1 0 2 A
2 2 1 3 A
3 3 2 4 A
4 4 3 5 A
5 5 4 6 A
6 6 5 0 A
7 21 0 22 B
8 22 21 23 B
9 23 22 24 B
10 24 23 25 B
11 25 24 26 B
12 26 25 0 B
I would like to reach the answer to this without using a "for" loop becouse the real data is too large.
Any you could provide will be appreciated.
Here is one way. It is hard to avoid a for-loop as this is quite a tricky algorithm. The objection to them is often on the grounds of elegance rather than speed, but sometimes they are entirely appropriate.
df$group <- seq_len(nrow(df)) #assign each row to its own group
stop <- FALSE #indicates convergence
while(!stop){
pre <- df$group #group column at start of loop
for(i in seq_len(nrow(df))){
matched <- which(df$before==df$now[i] | df$after==df$now[i]) #check matches in before and after columns
group <- min(df$group[i], df$group[matched]) #identify smallest group no of matching rows
df$group[i] <- group #set to smallest group
df$group[matched] <- group #set to smallest group
}
if(identical(df$group, pre)) stop <- TRUE #stop if no change
}
df$group <- LETTERS[match(df$group, sort(unique(df$group)))] #convert groups to letters
#(just use match(...) to keep them as integers - e.g. if you have more than 26 groups)
df <- df[order(df$group, df$now),] #reorder as required
df
now before after group
1 1 0 2 A
2 2 1 3 A
3 3 2 4 A
4 4 3 5 A
8 5 4 6 A
9 6 5 0 A
10 21 0 22 B
11 22 21 23 B
12 23 22 24 B
5 24 23 25 B
6 25 24 26 B
7 26 25 0 B

How to extract a sample of pairs in grouping variable

My data looks like this:
x y
1 1
2 2
3 2
4 4
5 5
6 6
7 6
8 8
9 9
10 9
11 11
12 12
13 13
14 13
15 14
16 15
17 14
18 16
19 17
20 18
y is a grouping variable. I would like to see how well this grouping went.
Because of this I want to extract a sample of n pairs of cases that are grouped together by variable y
and n pairs of cases that are not grouped together by variable y. In order to calculate the number of
false positives and false negatives (either falsly grouped or not). How do I extract a sample of grouped pairs
and a sample of not-grouped pairs?
I would like the samples to look like this (for n=6) :
Grouped sample:
x y
2 2
3 2
9 9
10 9
15 14
17 14
Not-grouped sample:
x y
1 1
2 2
6 8
6 8
11 11
19 17
How would I go about this in R?
I'm not entirely clear on what you like to do, partly because I feel there is some context missing as to what you're trying to achieve. I also don't quite understand your expected output (for example, the not-grouped sample contains an entry 6 8 that does not exist in your original data...)
That aside, here is a possible approach.
# Maximum number of samples per group
n <- 3;
# Set fixed RNG seed for reproducibility
set.seed(2017);
# Grouped samples
df.grouped <- do.call(rbind.data.frame, lapply(split(df, df$y),
function(x) if (nrow(x) > 1) x[sample(min(n, nrow(x))), ]));
df.grouped;
# x y
#2.3 3 2
#2.2 2 2
#6.6 6 6
#6.7 7 6
#9.10 10 9
#9.9 9 9
#13.13 13 13
#13.14 14 13
#14.15 15 14
#14.17 17 14
# Ungrouped samples
df.ungrouped <- df[sample(nrow(df.grouped)), ];
df.ungrouped;
# x y
#7 7 6
#1 1 1
#9 9 9
#4 4 4
#3 3 2
#2 2 2
#5 5 5
#6 6 6
#10 10 9
#8 8 8
Explanation: Split df based on y, then draw min(n, nrow(x)) samples from subset x containing >1 rows; rbinding gives the grouped df.grouped. We then draw nrow(df.grouped) samples from df to produce the ungrouped df.ungrouped.
Sample data
df <- read.table(text =
"x y
1 1
2 2
3 2
4 4
5 5
6 6
7 6
8 8
9 9
10 9
11 11
12 12
13 13
14 13
15 14
16 15
17 14
18 16
19 17
20 18", header = T)

I don't know how to create this tree in R

I would like to maximize revenue by applying the better campaign at each hour.
I would like to create a tree that would help me choose the better campaign.
At the data below there's a record with the revenue per campaign per hour.
Looking at the data, I may see that campaign A is better between hours 1-12, and that campaign B is better between hours 13-24.
How do I create in R the tree that would tell me that?
hour campaign revenue
1 A 23
1 B 20
2 A 21
2 B 22
3 A 23
3 B 20
4 A 21
4 B 22
5 A 23
5 B 20
6 A 21
6 B 22
7 A 20
7 B 17
8 A 18
8 B 19
9 A 20
9 B 17
10 A 18
10 B 19
11 A 20
11 B 17
12 A 19
12 B 18
13 A 8
13 B 9
14 A 6
14 B 11
15 A 9
15 B 8
16 A 6
16 B 11
17 A 9
17 B 8
18 A 6
18 B 11
19 A 3
19 B 2
20 A 3
20 B 2
21 A 0
21 B 5
22 A 3
22 B 2
23 A 3
23 B 2
24 A 0
24 B 5
I'm not sure what kind of tree you are looking for exactly, but a linear model tree for revenue with regressor campaign and partitioning variable hour might be useful. Using lmtree() in package partykit you can fit a tree that starts out by fitting a linear model with two coefficients (intercept and campaign B effect) and then splits the data as long as there are significant instabilities in at least one of the coefficients:
library("partykit")
(tr <- lmtree(revenue ~ campaign | hour, data = d))
## Linear model tree
##
## Model formula:
## revenue ~ campaign | hour
##
## Fitted party:
## [1] root
## | [2] hour <= 12: n = 24
## | (Intercept) campaignB
## | 20.583333 -1.166667
## | [3] hour > 12: n = 24
## | (Intercept) campaignB
## | 4.666667 1.666667
##
## Number of inner nodes: 1
## Number of terminal nodes: 2
## Number of parameters per node: 2
## Objective function (residual sum of squares): 341.1667
In this (presumably artificial) data, this selects a single split at 12 hours and then has two terminal nodes: one with a negative campaign B effect (i.e., A is better) and one with a positive campaign B effect (i.e., B is better). The resulting plot(tr) yields:
This also brings out that the split is also driven by the change in revenue level and not only by the differing campaign effects (which are fairly small).
The underlying tree algorithm is called "Model-Based Recursive Partitioning" (MOB) and is also applicable to models other than linear regression. See the references in the manual and vignette for more details.
Another algorithm that might potentially be interesting is the QUINT (qualitative interaction trees) by Dusseldorp & Van Mechelen, available in the quint package.
For convenient replication of the example above: The d data frame can be recreated by
d <- read.table(textConnection("hour campaign revenue
1 A 23
1 B 20
2 A 21
2 B 22
3 A 23
3 B 20
4 A 21
4 B 22
5 A 23
5 B 20
6 A 21
6 B 22
7 A 20
7 B 17
8 A 18
8 B 19
9 A 20
9 B 17
10 A 18
10 B 19
11 A 20
11 B 17
12 A 19
12 B 18
13 A 8
13 B 9
14 A 6
14 B 11
15 A 9
15 B 8
16 A 6
16 B 11
17 A 9
17 B 8
18 A 6
18 B 11
19 A 3
19 B 2
20 A 3
20 B 2
21 A 0
21 B 5
22 A 3
22 B 2
23 A 3
23 B 2
24 A 0
24 B 5"), header = TRUE)
Would something like this work?
## create a sequence of hours from column 1 of the data
hr <- as.numeric(unique(data[,1]))
## Set up vectors to hold the A and B campaign "best" hours
A.hours=NULL
B.hours=NULL
## start at the lowest hour
i=1
while(i<=max(hr)) {
## create a subset of data from the current hour
sub.data <- data[matrix(which(data[,1]==hr[i])),]
## find the campaign with the highest revenue
best.camp <- sub.data[which(sub.data[,3]==max(sub.data[,3])),2]
if(best.camp=="A") {
A.hours <- c(A.hours,hr[i])
}
if(best.camp=="B") {
B.hours <- c(B.hours,hr[i])
}
i=i+1
}
The code indicates that during the A.hours (hours: 1 3 5 7 9 11 12 15 17 19 20 22 23), campaign A is more profitable.
However, during B.hours (hours: 2 4 6 8 10 13 14 16 18 21 24), campaign B is more profitable.

handling high dimension tables

I have a table that I routinely compute with R that has three dimensions. I would like to add some tables within the (here 5) marginal tables. What I usually do is like:
A=sample(LETTERS[1:5],100, rep=T)
b=sample(letters[1:2],100, rep=T)
numbers=sample(1:3,100, rep=T)
( tab=table(A,b,numbers) )
( tab1=ftable(addmargins(tab)) )
I would like to add the sum of the first few marginal tables and then the sum of the remaining tables at the bottom, then the grand total. I can imagine that in the resulting ftable I would want the As,Bs,Cs, then the sum of those three, then the Ds, Es, and the sum of those two, then the sum of all of the tables, like:
numbers 1 2 3 Sum
A b
A a 1 5 0 6
b 4 2 2 8
Sum 5 7 2 14
B a 2 6 6 14
b 5 4 5 14
Sum 7 10 11 28
C a 3 2 5 10
b 1 2 2 5
Sum 4 4 7 15
sumac a 6 13 11 30 #### how do i add these three lines
b ....
sum ....
D a 2 1 1 4
b 4 5 3 12
Sum 6 6 4 16
E a 5 3 4 12
b 4 3 8 15
Sum 9 6 12 27
sumde a 7 4 5 20 #### and these
b ....
sum ....
sumae a 13 17 16 46
b 18 16 20 54
Sum 31 33 36 100
As usual I think the solution is prolly many fewer lines than the question. Thanks
Seth Latimer
You could do something like this:
isABC<-ifelse(A %in% c("A", "B", "C"), "ABC", "CD")
( tab=table(isABC,A,b,numbers) )
( tab1=ftable(addmargins(tab)) )
Now you have a larger table which holds even more rows than you want, but those should be easy to drop...

Resources