Sampling a contingency table in R - r

I'd like to know if anyone can suggest an efficient method to sample a contingency table such that both the total number of observations and the column totals remain the same.
For example, in the following table where the rows are cases and the columns observations, I'd like to "scramble" the observations such that (a) the total number of observations is 54, and (b) the total number of observations in a variable (e.g., A) is 16 18, the same as the original column total for A.
x<-matrix(c(
4,6,0,0,8,0,0,
1,1,1,1,4,0,0,
3,0,1,1,6,0,1,
2,1,0,0,1,0,0,
1,1,0,1,0,1,1,
2,0,0,2,1,2,0),
ncol=6,byrow=F)
colnames(x)<-c("A","B","C","D","E","F")
I've seen a discussion of contingency table sampling in which the cell frequencies are the source of the sampling probabilities for a sample(...) call. This won't work for my purposes because, among other reasons, the column totals do not remain equal to the original column totals.
Any help would be greatly appreciated,
Patrick
EDIT
If there isn't an easy solution to this problem, perhaps someone can help me with my overly complicated (and failed) attempt. I first create a vector composed of the number of observations of each variable, e.g.,
m <- matrix()
v <- matrix()
for (h in 1:cols) {
m <- rep(colnames(x)[h], sum(x[, h]))
v <- c(v, m)}
I then sample it to randomly shuffle the observations, and bind it to a sample of values equal to the number of cases
v<-sample(v,length(v))
p<-sample(seq(1:nrow(x)),length(v),T)
n<-as.data.frame(cbind(v,p))
t(table(n))
v
p A B C D E F
1 3 1 3 1 1 1
2 1 1 0 0 0 0
3 3 0 3 0 2 1
4 3 2 1 2 1 2
5 2 1 0 0 0 1
6 3 2 3 1 1 1
7 3 1 2 0 0 1
colSums(t(table(n)))
A B C D E F
18 8 12 4 5 7
This works great except when the sample p fails to contain one of the values in the sequence (i.e., a "case" is missing), which as I've learned happens quite frequently, particularly when there are many iterations of the sample (e.g., 1000).
Thanks again,
Patrick

Another way would be:
indx <- cbind(c(replicate(ncol(x), sample(1:nrow(x)))), c(col(x)))
x1 <- x
x1[] <- x[indx]
colSums(x1)
# A B C D E F
#18 8 12 4 5 7
colSums(x)
#A B C D E F
#18 8 12 4 5 7
sum(x1)
#[1] 54
Update
Based on the new info, which is confusing, may be this helps:
cSum <- colSums(x)
ind1 <- vector("list", length=ncol(x))
for(i in seq_along(cSum)){
repeat{ind1[[i]] <- sample(0:cSum[i], nrow(x)-1, replace=TRUE)
if(sum(ind1[[i]]) <=cSum[i]) break
}
}
x1 <- do.call(cbind, ind1)
x2 <- rbind(x1,cSum-colSums(x1))
colSums(x2)
# A B C D E F
#18 8 12 4 5 7
sum(colSums(x2))
#[1] 54
x2
# A B C D E F
#[1,] 0 0 0 0 0 0
#[2,] 9 5 1 2 0 1
#[3,] 0 1 1 1 0 2
#[4,] 0 0 4 0 0 1
#[5,] 8 0 5 0 4 2
#[6,] 0 0 1 0 1 1
#[7,] 1 2 0 1 0 0

You can use
x.swapped <- apply(x, MARGIN=2, FUN=sample)
apply applies the function passed in the parameter FUN to the columns (if MARGIN is 2, rows when it is 1) of the matrix x.
In this case we apply the sample function.
When called without extra parameters sample just reorders the element in the vector (see ?sample for more help).
We can check that the totals in each column remain the same.
colSums(x)
A B C D E F
18 8 12 4 5 7
colSums(x.swapped)
A B C D E F
18 8 12 4 5 7
And obviously
sum(x)
[1] 54
sum(x.swapped)
[1] 54
An example of output may be (note that, unless you fix the RNG seed using set.seed the result from sample will differ each time).
x
A B C D E F
[1,] 4 1 3 2 1 2
[2,] 6 1 0 1 1 0
[3,] 0 1 1 0 0 0
[4,] 0 1 1 0 1 2
[5,] 8 4 6 1 0 1
[6,] 0 0 0 0 1 2
[7,] 0 0 1 0 1 0
x.swapped
A B C D E F
[1,] 6 4 1 0 1 0
[2,] 0 1 3 2 0 0
[3,] 0 0 1 1 1 2
[4,] 0 0 0 0 1 2
[5,] 8 1 1 1 1 2
[6,] 4 1 0 0 1 0
[7,] 0 1 6 0 0 1

Related

R- Include starting point in cumsum function

I have this data.frame:
a b
[1,] 1 0
[2,] 2 0
[3,] 3 0
[4,] 4 0
[5,] 5 0
[6,] 6 1
[7,] 7 2
[8,] 8 3
[9,] 9 4
[10,] 10 5
I want to apply cumsum on column a only when its corresponding value on column b is different from 0.
I tried this below but it doesn't include a starting condition on the cumsum:
df_cumsum <- cbind(c(1:10), c(0,0,0,0,0,1,2,3,4,5),
as.data.frame(ave(A[,1], A[,2] != 0, FUN=cumsum)))
Unfortunately, I obtain a cumsum over the whole column:
a b c
1 1 0 1
2 2 0 3
3 3 0 6
4 4 0 10
5 5 0 15
6 6 1 6
7 7 2 13
8 8 3 21
9 9 4 30
10 10 5 40
I would like to obtain:
a b c
1 1 0 0
2 2 0 0
3 3 0 0
4 4 0 0
5 5 0 0
6 6 1 6
7 7 2 13
8 8 3 21
9 9 4 30
10 10 5 40
Thanks for help!
Assuming the input is df as shown reproducibly in the Note at the end, try this. It zeros out any a value for which b is 0.
transform(df, cum = cumsum((b > 0) * a))
giving:
a b cum
1 1 0 0
2 2 0 0
3 3 0 0
4 4 0 0
5 5 0 0
6 6 1 6
7 7 2 13
8 8 3 21
9 9 4 30
10 10 5 40
Note
We assume this input shown in reproducible form:
Lines <- "
a b
1 0
2 0
3 0
4 0
5 0
6 1
7 2
8 3
9 4
10 5"
df <- read.table(text = Lines, header = TRUE)
Update
a and b had been reversed. Have fixed.
It would be better to create an index and update
i1 <- df1$b > 0
df1$c[i1] <- with(df1, cumsum(a[i1]))
Or in a single line
df1$c <- with(df1, cumsum(a * (b > 0)))
df1$c
#[1] 0 0 0 0 0 6 13 21 30 40
I really like how clean the other answers are using the a * (b > 0) but that can sometimes be a bit confusing for newer programers. As an alternative to this syntax you can use a vectorized ifelse function.
df <- data.frame(a=c(1:10), b=c(0,0,0,0,0,1,2,3,4,5))
# One way
df$c <- cumsum(ifelse(df$b>0,df$a,0))
# Another way
df$d <- with(df,cumsum(ifelse(b>0,a,0)))

n number of column wise subtraction

need help in N number or column wise subtraction, Below are the columns in a input dataframe.
input dataframe:
A B C D
1 4 6 2
3 3 3 4
1 2 2 2
4 4 4 4
5 2 3 2
Expected Output:
A B-A C-B D-C
1 3 2 -4
3 0 0 1
1 1 0 0
4 0 0 0
5 -3 1 -1
similarly there will be many column upto 10.
i am able to write the code for 2 columns:
Code:
df$(B-A) <- df$B - df$A
df$(C-B) <- df$C - df$B
and so on... but in this should come in loop as there are almost 10 to 12 columns. Please help me.
Here is a Vectorized way to do this,
cbind.data.frame(df[1], df[-1] - df[-ncol(df)])
which gives,
A B C D
1 1 3 2 -4
2 3 0 0 1
3 1 1 0 0
4 4 0 0 0
5 5 -3 1 -1
Here is the instructive/pedagogic straightforward solution:
df <- data.frame(A=c(1,3,1,4,5), B=c(4,3,2,4,2), C=c(6,3,2,4,3), D=c(2,4,2,4,2))
df
Get the pattern:
cbind(df[1], df[2] - df[1], df[3] - df[2], df[4] - df[3]) # solved
Now, use dynamic programming in R to finish (in the general case):
cbind(df[1], sapply(1:(ncol(df)-1), function(i) df[i+1] - df[i]))
Output:
A B C D
1 1 3 2 -4
2 3 0 0 1
3 1 1 0 0
4 4 0 0 0
5 5 -3 1 -1
Using apply() you can also try this
cbind(df[1], t(apply(df, 1, diff)))
Output:
A B C D
1 1 3 2 -4
2 3 0 0 1
3 1 1 0 0
4 4 0 0 0
5 5 -3 1 -1

R: Update adjacency matrix/data frame using pairwise combinations

Question
Let's say I have this dataframe:
# mock data set
df.size = 10
cluster.id<- sample(c(1:5), df.size, replace = TRUE)
letters <- sample(LETTERS[1:5], df.size, replace = TRUE)
test.set <- data.frame(cluster.id, letters)
Will be something like:
cluster.id letters
<int> <fctr>
1 5 A
2 4 B
3 4 B
4 3 A
5 3 E
6 3 D
7 3 C
8 2 A
9 2 E
10 1 A
Now I want to group these per cluster.id and see what kind of letters I can find within a cluster, so for example cluster 3 contains the letters A,E,D,C. Then I want to get all unique pairwise combinations (but not combinations with itself so no A,A e.g.): A,E ; A,D, A,C etc. Then I want to update the pairwise distance for these combination in an adjacency matrix/data frame.
Idea
# group by cluster.id
# per group get all (unique) pairwise combinations for the letters (excluding pairwise combinations with itself, e.g. A,A)
# update adjacency for each pairwise combinations
What I tried
# empty adjacency df
possible <- LETTERS
adj.df <- data.frame(matrix(0, ncol = length(possible), nrow = length(possible)))
colnames(adj.df) <- rownames(adj.df) <- possible
# what I tried
update.adj <- function( data ) {
for (comb in combn(data$letters,2)) {
# stucked
}
}
test.set %>% group_by(cluster.id) %>% update.adj(.)
Probably there is an easy way to do this because I see adjacency matrices all the time, but I'm not able to figure it out.. Please let me know if it's not clear
Answer to comment
Answer to #Manuel Bickel:
For the data I gave as example (the table under "will be something like"):
This matrix will be A-->Z for the full dataset, keep that in mind.
A B C D E
A 0 0 1 1 2
B 0 0 0 0 0
C 1 0 0 1 1
D 1 0 1 0 1
E 2 0 1 1 0
I will explain what I did:
cluster.id letters
<int> <fctr>
1 5 A
2 4 B
3 4 B
4 3 A
5 3 E
6 3 D
7 3 C
8 2 A
9 2 E
10 1 A
Only the clusters containing more > 1 unique letter are relevant (because we don't want combinations with itself, e.g cluster 1 containing only letter B, so it would result in combination B,B and is therefore not relevant):
4 3 A
5 3 E
6 3 D
7 3 C
8 2 A
9 2 E
Now I look for each cluster what pairwise combinations I can make:
cluster 3:
A,E
A,D
A,C
E,D
E,C
D,C
Update these combination in the adjacency matrix:
A B C D E
A 0 0 1 1 1
B 0 0 0 0 0
C 1 0 0 1 1
D 1 0 1 0 1
E 2 0 1 1 0
Then go to the next cluster
cluster 2
A,E
Update the adjacency matrix again:
A B C D E
A 0 0 1 1 2 <-- note the 2 now
B 0 0 0 0 0
C 1 0 0 1 1
D 1 0 1 0 1
E 2 0 1 1 0
As reaction to the huge dataset
library(reshape2)
test.set <- read.table(text = "
cluster.id letters
1 5 A
2 4 B
3 4 B
4 3 A
5 3 E
6 3 D
7 3 C
8 2 A
9 2 E
10 1 A", header = T, stringsAsFactors = F)
x1 <- reshape2::dcast(test.set, cluster.id ~ letters)
x1
#cluster.id A B C D E
#1 1 1 0 0 0 0
#2 2 1 0 0 0 1
#3 3 1 0 1 1 1
#4 4 0 2 0 0 0
#5 5 1 0 0 0 0
x2 <- table(test.set)
x2
# letters
#cluster.id A B C D E
# 1 1 0 0 0 0
# 2 1 0 0 0 1
# 3 1 0 1 1 1
# 4 0 2 0 0 0
# 5 1 0 0 0 0
x1.c <- crossprod(x1)
#Error in crossprod(x, y) :
# requires numeric/complex matrix/vector arguments
x2.c <- crossprod(x2)
#works fine
Following above comment, here the code of Tyler Rinker used with your data. I hope this is what you want.
UPDATE: Following below comments, I added a solution using the package reshape2 in order to be able to handle larger amounts of data.
test.set <- read.table(text = "
cluster.id letters
1 5 A
2 4 B
3 4 B
4 3 A
5 3 E
6 3 D
7 3 C
8 2 A
9 2 E
10 1 A", header = T, stringsAsFactors = F)
x <- table(test.set)
x
letters
#cluster.id A B C D E
# 1 1 0 0 0 0
# 2 1 0 0 0 1
# 3 1 0 1 1 1
# 4 0 2 0 0 0
# 5 1 0 0 0 0
#base approach, based on answer by Tyler Rinker
x <- crossprod(x)
diag(x) <- 0 #this is to set matches such as AA, BB, etc. to zero
x
# letters
# letters
# A B C D E
# A 0 0 1 1 2
# B 0 0 0 0 0
# C 1 0 0 1 1
# D 1 0 1 0 1
# E 2 0 1 1 0
#reshape2 approach
x <- acast(test.set, cluster.id ~ letters)
x <- crossprod(x)
diag(x) <- 0
x
# A B C D E
# A 0 0 1 1 2
# B 0 0 0 0 0
# C 1 0 0 1 1
# D 1 0 1 0 1
# E 2 0 1 1 0

building matrix out of a vector with the difference of each value

dataset2 <- data.frame(bird=c("A","B","C","D","E","F"), rank=c(1:6))
I have this example dataset and now i want to build a 6*6 matrix with the rank difference between each bird. How can i do this?
Is this what you want?
m <- with(dataset2, outer(rank, rank, '-'))
rownames(m) <- colnames(m) <- dataset2$bird
# A B C D E F
# A 0 -1 -2 -3 -4 -5
# B 1 0 -1 -2 -3 -4
# C 2 1 0 -1 -2 -3
# D 3 2 1 0 -1 -2
# E 4 3 2 1 0 -1
# F 5 4 3 2 1 0
You might also want to do this afterwards:
m[upper.tri(m)] <- 0
tail(m[,-ncol(m)],-1)
To get:
# A B C D E
#B 1 0 0 0 0
#C 2 1 0 0 0
#D 3 2 1 0 0
#E 4 3 2 1 0
#F 5 4 3 2 1
This is kind of the definition of the distance matrix, no?
dist(dataset2, method="maximum")
####
1
2 1
3 2 1
4 3 2 1
5 4 3 2 1
With the distinction that it returns positive distance only... maybe it doesn't suits the OP..

Creating a matrix with all combinations within a budget

I am attempting to create a matrix that includes all combinations of numbers within a range such that the row sums to a specific value. I am not sure if there is a function for this or if I would need to create the function manually. I have tried combn function but it does not constrain to the sum and so the matrix gets large very quickly.
example: 3 rows that sum to 5
5,0,0
4,1,0
4,0,1
3,2,0
3,0,2
3,1,1
2,3,0
2,0,3
2,2,1
2,1,2
etc..
These combinatorial objects are called partitions (see also here and even here), and their computation is implemented by the partitions package.
Depending on what you really want, use one of the following:
library(partitions)
## The first argument says you want to enumerate all partitions in which the
## second argument (5) is broken into three summands, each of which can take a
## maximum value of 5.
blockparts(rep(5,3),5) ## Equiv: blockparts(c(5,5,5), 5)
#
# [1,] 5 4 3 2 1 0 4 3 2 1 0 3 2 1 0 2 1 0 1 0 0
# [2,] 0 1 2 3 4 5 0 1 2 3 4 0 1 2 3 0 1 2 0 1 0
# [3,] 0 0 0 0 0 0 1 1 1 1 1 2 2 2 2 3 3 3 4 4 5
restrictedparts(5,3)
#
# [1,] 5 4 3 3 2
# [2,] 0 1 2 1 2
# [3,] 0 0 0 1 1
Perhaps this does what you want:
x <- expand.grid(replicate(3, list(0:5)))
x[rowSums(x) == 5, ]
# Var1 Var2 Var3
# 6 5 0 0
# 11 4 1 0
# 16 3 2 0
# 21 2 3 0
# 26 1 4 0
# 31 0 5 0
# 41 4 0 1
# 46 3 1 1
# 51 2 2 1
# 56 1 3 1
# 61 0 4 1
# 76 3 0 2
# 81 2 1 2
# 86 1 2 2
# 91 0 3 2
# 111 2 0 3
# 116 1 1 3
# 121 0 2 3
# 146 1 0 4
# 151 0 1 4
# 181 0 0 5
expand.grid and combn are somewhat related, but I find expand.grid to be more applicable to these types of problems.
There is also the permutations function from the "gtools" package:
library(gtools)
x <- permutations(6, 3, v = 0:5, set = FALSE, repeats.allowed=TRUE)
x[rowSums(x) == 5, ]

Resources