Related
I have a contingency table (ct) like this:
read.table( text=
1 2 3 4 5 6
1 0 0 1 0 2 0
2 0 0 2 0 0 0
70 0 0 3 0 0 0
76 15 13 19 2 9 8
85 0 0 2 0 0 0
109 0 0 0 0 1 2
479 0 0 0 0 2 0
491 2 0 0 0 0 0
1127 0 1 0 1 6 0
1131 0 1 1 1 2 0
1206 1 3 1 0 0 1
1208 1 0 1 0 0 1
1210 0 1 0 0 0 1
1225 2 0 1 0 0 0
1232 0 0 0 0 1 1
1242 0 0 0 1 0 1
1243 1 0 0 0 1 1
1251 0 0 2 0 1 2
1267 0 2 1 0 0 0
4415 0 2 0 0 0 0
4431 0 0 0 2 0 0
4808 0 0 0 0 2 0
4823 0 2 0 0 0 0 )
Where rows represent cluster, columns represent hospitals and numbers in the table the count of isolates.
For example: Cluster 1 has 3 isolates, 1 in hospital 3 and 2 in hospital 2.
I now want to check, if clusters and hospitals are dependent on each other or not. For that, I would like to create 1000 randomly distributed tables, where all isolates in one cluster have the chance to fall into every hospital.
For example: The 3 Isolates in cluster 1 might then be distributed over 3 hospitals, so that I get the values : 0 1 1 1 0 0 .
Combinations can occur multiple times.
I tried this:
replicates <- 1000
permutations <- lapply(seq(replicates), function(i, ct){
list <- lapply(apply(ct,1,list),unlist)
list <- lapply(list, function(x)as.numeric(x))
z <- as.data.frame(do.call(rbind, lapply(list, function(x) sample(x))))
}, ct = ct)
But by that only the values in the dataframe are shuffled to another position in the row.
Can someone help me with that?
I concur with Maurits Evers answer, at full rank you got binomial combination per lines : n variables mean 2^n combination... if you add m-1 columns this yields 2^(n+m) possibilities.
Here's an alternative using partitions::composition.
library(partitions)
# smaller toy data
d <- data.frame(x1 = c(0, 1, 1), x2 = c(2, 2, 0), x3 = c(0, 1, 1))
# calculate row sums
rs <- rowSums(d)
# for each unique row sum, partition the value with order m = number of columns
# this avoids repeating calculation of partitions on duplicate row sums
l <- lapply(unique(rs), compositions, m = ncol(d))
# name list elements with row sums
names(l) <- unique(rs)
# set number of samples
n <- 4
# to reproduce sample in this example
set.seed(1)
# loop over rows in data frame
lapply(1:nrow(d), function(i){
# index list of partitions using row sums
m <- l[[as.character(rs[i])]]
# number of columns to sample from
nc <- ncol(m)
# select columns from matrix using a sample of n column indexes
m[ , sample(nc, n, replace = TRUE)]
})
The result is a list where each element is a matrix for each row of the original data. Each matrix column is one (sampled) partition.
# [[1]]
# [,1] [,2] [,3] [,4]
# [1,] 1 0 1 0
# [2,] 1 2 0 0
# [3,] 0 0 1 2
#
# [[2]]
# [,1] [,2] [,3] [,4]
# [1,] 1 0 0 2
# [2,] 3 1 0 0
# [3,] 0 3 4 2
#
# [[3]]
# [,1] [,2] [,3] [,4]
# [1,] 1 2 1 1
# [2,] 0 0 1 1
# [3,] 1 0 0 0
I tried to partition the largest row sum in your example data (66), and it runs pretty quickly. Thus, if your row sums are not very much larger and the number of columns is small (like here), the code above may be a viable option.
system.time(p <- compositions(66, 6))
# user system elapsed
# 1.53 0.16 1.68
str(p)
# 'partition' int [1:6, 1:13019909] 66 0 0 0 0 0 65 1 0 0 ...
Note that it 'explodes' rapidly if the number of columns increases:
system.time(p <- compositions(66, 7))
# user system elapsed
# 14.11 1.61 15.72
Sorry #Henrik for the late response. Your code worked out quite well for me! However, with the help of a colleague of mine, I figured out this code (I'll just show it using your sample data):
#data
d <- data.frame(x1 = c(0, 1, 1), x2 = c(2, 2, 0), x3 = c(0, 1, 1))
#Number of replicates I want
replicates <- 1000
#Number of columns in the table
k<- 3
l <- NULL
#unlist the dataframe
list <- lapply(apply(d,1,list),unlist)
#Calculate replicates of the dataframe, where numbers are permuted within rows
permutations <- lapply(seq(replicates), function(j){
l_sampled <- lapply(list, function(x){
pos.random <- sample(k, sum(x), replace = T)
x.random <- rep(0,k)
for (i in 1:k){
x.random[i] <- sum(pos.random==i)
}
l = rbind(l, data.frame(x.random))
})
df <- data.frame(matrix(unlist(l_sampled), nrow=length(l_sampled), byrow=T))
})
#Example for results:
> permutations[[8]]
X1 X2 X3
1 2 0 0
2 1 2 1
3 1 0 1
> permutations[[10]]
X1 X2 X3
1 0 1 1
2 2 0 2
3 0 2 0
I would like to know if there a way to insert a column into a matrix such that..
p1 <- c("a","b","c","e","d","a","c")
p2 <- c("a","b","c","e","e","a","c")
p1mat <- model.matrix(~p1 + 0)
p2mat <- model.matrix(~p2 + 0)
colnames(p1mat) <- gsub("p1","",colnames(p1mat))
colnames(p2mat) <- gsub("p2","",colnames(p2mat))
this would give me for p1mat
a b c d e
1 1 0 0 0 0
2 0 1 0 0 0
3 0 0 1 0 0
4 0 0 0 0 1
5 0 0 0 1 0
6 1 0 0 0 0
7 0 0 1 0 0
and for p2mat
a b c e
1 1 0 0 0
2 0 1 0 0
3 0 0 1 0
4 0 0 0 1
5 0 0 0 1
6 1 0 0 0
7 0 0 1 0
My question is, is there a way to sneak in a column vector d consisting of only zeros into the matrix p2mat? such that
d
0
0
0
0
0
0
0
and the vector is automatically ordered and placed between columns c and e resulting in to following matrix for p2mat
a b c d e
1 1 0 0 0 0
2 0 1 0 0 0
3 0 0 1 0 0
4 0 0 0 0 1
5 0 0 0 0 1
6 1 0 0 0 0
7 0 0 1 0 0
Basically I want matrix p2mat to look into every column in p1mat to create an identical size matrix and to keep track of the data via dummy matrices.
Thank you.
You can factor both your inputs, making sure they both have the same levels. Then model.matrix should work as you expected.
Example:
p1 <- c("a","b","c","e","d","a","c")
p2 <- c("a","b","c","e","e","a","c")
levs <- sort(unique(c(p1, p2)))
f1 <- factor(p1, levs)
f2 <- factor(p2, levs)
model.matrix(~f1 + 0)
# f1a f1b f1c f1d f1e
# 1 1 0 0 0 0
# 2 0 1 0 0 0
# 3 0 0 1 0 0
# 4 0 0 0 0 1
# 5 0 0 0 1 0
# 6 1 0 0 0 0
# 7 0 0 1 0 0
# attr(,"assign")
# [1] 1 1 1 1 1
# attr(,"contrasts")
# attr(,"contrasts")$f1
# [1] "contr.treatment"
model.matrix(~f2 + 0)
# f2a f2b f2c f2d f2e
# 1 1 0 0 0 0
# 2 0 1 0 0 0
# 3 0 0 1 0 0
# 4 0 0 0 0 1
# 5 0 0 0 0 1
# 6 1 0 0 0 0
# 7 0 0 1 0 0
# attr(,"assign")
# [1] 1 1 1 1 1
# attr(,"contrasts")
# attr(,"contrasts")$f2
# [1] "contr.treatment"
If you're really looking to write a function, you might want to look at something like the following:
myfun <- function(..., overwrite = FALSE) {
l <- setNames(list(...), sapply(substitute(list(...))[-1], deparse))
cols <- sort(unique(unlist(lapply(l, colnames), use.names = FALSE)))
out <- lapply(l, function(x) {
cols_x <- c(colnames(x), setdiff(cols, colnames(x)))
temp <- `colnames<-`(x[, match(cols, colnames(x))], cols_x)[, cols]
replace(temp, is.na(temp), 0)
})
if (isTRUE(overwrite)) list2env(out, envir = .GlobalEnv)
out
}
This will take any number of items as inputs, compare the columns in all of them, and add missing columns where necessary. The output is stored as a list, which is a convenient structure to keep if you want to continue doing similar operations on all of the matrices. If you want to overwrite the original object, then you can change the "overwrite" argument to TRUE.
Here's some more sample data to work with.
set.seed(1)
p1 <- c("a","b","c","e","d","a","c"); p2 <-c("a","b","x","e","e","a","x")
p3 <- sample(c(cols, "z"), 7, TRUE)
p1mat <- model.matrix(~p1 + 0)
p2mat <- model.matrix(~p2 + 0)
p3mat <- model.matrix(~p3 + 0)
colnames(p1mat) <- gsub("p1","",colnames(p1mat))
colnames(p2mat) <- gsub("p2","",colnames(p2mat))
colnames(p3mat) <- gsub("p3","",colnames(p3mat))
Try the function out:
myfun(p1mat, p2mat)
myfun(p2mat, p1mat)
myfun(p3mat, p1mat)
myfun(p3mat, p1mat, p2mat)
This function takes 2 matrices, and compares their dimensions. If their dimensions differ, it inserts a new column of zeros into the matrix with fewer columns, at the exact column position that is lacking. It thus produces a new matrix with the same dimensions as the other.
match_matrices <- function(matrix1, matrix2) {
if(ncol(matrix1) != ncol(matrix2)) {
get_cols <- function(x) { l <- list(); for(i in 1:ncol(x)) { l[i] <- list(as.numeric(x[,i])) }; return(l) }
k <- get_cols(matrix2)
odd_one_out <- setdiff(colnames(matrix1), colnames(matrix2))
insert_at <- which(colnames(matrix1) == odd_one_out)
res <- t(do.call('rbind', append(k, list(rep(0, nrow(matrix2))), insert_at-1)))
colnames(res) <- colnames(matrix1)
}
return(res)
}
Using your matrices:
match_matrices(p1mat, p2mat)
If I have a data.frame
df <- data.frame(DEP=letters[1:5], ARR=letters[11:15], NO=1:5+5)
DEP ARR NO
1 a k 6
2 b l 7
3 c m 8
4 d n 9
5 e o 10
I want to create a matrix of DEP as ROW ID, and ARR as COL ID, and fill in the matrix with the relevant matching NO...
e.g.
k l m n o
a 6 7 8 9 10 ...etc
Each combination is unique.
DEP and ARR are the same vector of names. I have chosen two different sample ones here for clarity.
I am struggling to use match to sort them and fill them into the matrix template I created below:
mat <- matrix(0,nrow(df),nrow(df)); colnames(mat) <- df$ARR; rownames(mat) <- df$DEP;
k l m n o
a 0 0 0 0 0
b 0 0 0 0 0
c 0 0 0 0 0
d 0 0 0 0 0
e 0 0 0 0 0
Is there an efficient way of doing this? Many thanks for all advice!
?xtabs:
xtabs(NO ~ ., data=df)
# ARR
#DEP k l m n o
# a 6 0 0 0 0
# b 0 7 0 0 0
# c 0 0 8 0 0
# d 0 0 0 9 0
# e 0 0 0 0 10
If I understood your question correctly, you could use a sparse matrix definition:
library(Matrix)
mat <- spMatrix(length(df$DEP), length(df$ARR),
seq(df$DEP), seq(df$ARR), as.numeric(as.character(df$NO)))
rownames(mat) <- df$DEP
colnames(mat) <- df$ARR
#> as.matrix(mat)
# k l m n o
#a 6 0 0 0 0
#b 0 7 0 0 0
#c 0 0 8 0 0
#d 0 0 0 9 0
#e 0 0 0 0 10
I have data from a barter economy. I am trying to create a matrix that counts how frequently items act as counterparties with other items.
As an example:
myDat <- data.frame(
TradeID = as.factor(c(1,1,1,2,2,2,3,3,4,4,5,5,6,6,7,7,8,8,8)),
Origin = as.factor(c(1,0,0,1,1,0,1,0,1,0,1,0,1,0,1,0,1,0,0)),
ItemID = as.factor(c(1,2,3,4,5,1,1,6,7,1,1,8,7,5,1,1,2,3,4))
)
TradeID Origin ItemID
1 1 1 1
2 1 0 2
3 1 0 3
4 2 1 4
5 2 1 5
6 2 0 1
7 3 1 1
8 3 0 6
9 4 1 7
10 4 0 1
11 5 1 1
12 5 0 8
13 6 1 7
14 6 0 5
15 7 1 1
16 7 0 1
17 8 1 2
18 8 0 3
19 8 0 4
20 9 1 1
21 9 0 8
Where TradeID indicates a specific transaction. ItemID indicates an item, and Origin indicates which direction the item went.
For example, given my data the matrix I'd create would look something like this:
For example, the value 2 at [1,8] indicates that item 1 & 8 were counterparties in two trades. (Note that it's a symmetric matrix, and so [8,1] also has the value 2).
While the value of 1 at [1,2] indicates that item 1 and 2 were counterparties in only one trade (all the other 1s throughout the matrix indicate the same)
As an odd example, note at [1,1], the value of 1 indicates that item 1 was a counterparty to itself once (trade number 7)
A little extra insight into my motivation, note in my simple example that item 1 tends to act as counterparty with many different items. In a barter economy (one without explicit money) we might expect a commodity currency to be a counterparty relatively more frequently than non-commodity-currencies. A matrix like this would be the first step at one way of discovering which item was a commodity currency.
I've been struggling with this for a while. But I think I'm nearly done with an overly complicated solution, which I'll post shortly.
I'm curious if y'all might offer a bit of help also.
Alright, I think I've got this figured out. The short answer is:
Reduce("+",by(myDat, myDat$TradeID, function(x) pmin(table(x$ItemID[x$Origin==0]) %o% table(x$ItemID[x$Origin==1]) + table(x$ItemID[x$Origin==1]) %o% table(x$ItemID[x$Origin==0]),1)))
Which gives the following matrix, matching the desired result:
1 2 3 4 5 6 7 8
1 1 1 1 1 1 1 1 2
2 1 0 1 1 0 0 0 0
3 1 1 0 0 0 0 0 0
4 1 1 0 0 0 0 0 0
5 1 0 0 0 0 0 1 0
6 1 0 0 0 0 0 0 0
7 1 0 0 0 1 0 0 0
8 2 0 0 0 0 0 0 0
Here's the long answer. You can get a list of matrices for each TradeID using the by and outer (%o%) and table functions. But this double-counts Trade 7, where item 1 is traded for item 1, so I use the pmax function to fix this. Then I sum across the list by using the Reduce function.
And here's the steps to get there. Note the addition of TradeID # 9, which was left out of the question's code.
# Data
myDat <- data.frame(
TradeID = as.factor(c(1,1,1,2,2,2,3,3,4,4,5,5,6,6,7,7,8,8,8,9,9)),
Origin = as.factor(c(1,0,0,1,1,0,1,0,1,0,1,0,1,0,1,0,1,0,0,1,0)),
ItemID = as.factor(c(1,2,3,4,5,1,1,6,7,1,1,8,7,5,1,1,2,3,4,1,8))
)
# Sum in 1 direction
by(myDat, myDat$TradeID, function(x) table(x$ItemID[x$Origin==0]) %o% table(x$ItemID[x$Origin==1]))
# Sum in both directions
by(myDat, myDat$TradeID, function(x) table(x$ItemID[x$Origin==1]) %o% table(x$ItemID[x$Origin==0]) + table(x$ItemID[x$Origin==0]) %o% table(x$ItemID[x$Origin==1]))
# Remove double-count in trade 7
by(myDat, myDat$TradeID, function(x) pmin(table(x$ItemID[x$Origin==0]) %o% table(x$ItemID[x$Origin==1]) + table(x$ItemID[x$Origin==1]) %o% table(x$ItemID[x$Origin==0]),1))
# Sum across lists
Reduce("+",by(myDat, myDat$TradeID, function(x) pmin(table(x$ItemID[x$Origin==0]) %o% table(x$ItemID[x$Origin==1]) + table(x$ItemID[x$Origin==1]) %o% table(x$ItemID[x$Origin==0]),1)))
One way to speed this up would be to sum in only 1 direction (taking advantage of symmetry) and then clean up the results.
result = Reduce("+",by(myDat, myDat$TradeID, function(x) table(x$ItemID[x$Origin==0]) %o% table(x$ItemID[x$Origin==1])))
result2 = result + t(result)
diag(result2) = diag(result)
result2
1 2 3 4 5 6 7 8
1 1 1 1 1 1 1 1 2
2 1 0 1 1 0 0 0 0
3 1 1 0 0 0 0 0 0
4 1 1 0 0 0 0 0 0
5 1 0 0 0 0 0 1 0
6 1 0 0 0 0 0 0 0
7 1 0 0 0 1 0 0 0
8 2 0 0 0 0 0 0 0
This appears to run nearly twice as fast.
> microbenchmark(Reduce("+",by(myDat, myDat$TradeID, function(x) pmin(table(x$ItemID[x$Origin==0]) %o% table(x$ItemID[x$Origin==1]) + table(x$ItemID[x$Origin==1]) %o% table(x$ItemID[x$Origin==0]),1))))
Unit: milliseconds
min lq median uq max neval
7.489092 7.733382 7.955861 8.536359 9.83216 100
> microbenchmark(Reduce("+",by(myDat, myDat$TradeID, function(x) table(x$ItemID[x$Origin==0]) %o% table(x$ItemID[x$Origin==1]))))
Unit: milliseconds
min lq median uq max neval
4.023964 4.18819 4.277767 4.452824 5.801171 100
This will give you the number of observations per TradeID and ItemID
myDat <- data.frame(
TradeID = as.factor(c(1,1,1,2,2,2,3,3,4,4,5,5,6,6,7,7,8,8,8)),
Origin = as.factor(c(1,0,0,1,1,0,1,0,1,0,1,0,1,0,1,0,1,0,0)),
ItemID = as.factor(c(1,2,3,4,5,1,1,6,7,1,1,8,7,5,1,1,2,3,4))
)
result = tapply(myDat$Origin, list(myDat$ItemID,myDat$TradeID), length)
result[is.na(result)] = 0
result["1","7"]
result will then be:
> result
1 2 3 4 5 6 7 8
1 1 1 1 1 1 0 2 0
2 1 0 0 0 0 0 0 1
3 1 0 0 0 0 0 0 1
4 0 1 0 0 0 0 0 1
5 0 1 0 0 0 1 0 0
6 0 0 1 0 0 0 0 0
7 0 0 0 1 0 1 0 0
8 0 0 0 0 1 0 0 0
This will give you the proportion of 1 Origin per TradeID and ItemID
result = tapply(myDat$Origin, list(myDat$ItemID,myDat$TradeID), function(x) { sum(as.numeric(as.character(x)))/length(x) })
You can set the NA values in the last matrix to 0 using result[is.na(result)] = 0 but that would confuse no observations with nothing but 0 Origin trades.
This will give you the number of observations per consecutive ItemIDs:
idxList <- with(myDat, tapply(ItemID, TradeID, FUN = function(items)
lapply(seq(length(items) - 1),
function(i) sort(c(items[i], items[i + 1])))))
# indices of observations
idx <- do.call(rbind, unlist(idxList, recursive = FALSE))
# create a matrix
ids <- unique(myDat$ItemID)
mat <- matrix(0, length(ids), length(ids))
# place values in matrix
for (i in seq(nrow(idx))) {
mat[idx[i, , drop = FALSE]] <- mat[idx[i, , drop = FALSE]] + 1
}
# create symmatric marix
mat[lower.tri(mat)] <- t(mat)[lower.tri(mat)]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 1 1 0 0 1 1 1 1
[2,] 1 0 2 0 0 0 0 0
[3,] 0 2 0 1 0 0 0 0
[4,] 0 0 1 0 1 0 0 0
[5,] 1 0 0 1 0 0 1 0
[6,] 1 0 0 0 0 0 0 0
[7,] 1 0 0 0 1 0 0 0
[8,] 1 0 0 0 0 0 0 0
Using R, what is the best way I can aggregate rows on a condition that spans multiple rows.
For example to aggregate any rows where z = 0 for n or more times.
What this would look like run on the following sample table with n = 3.
Sample Table x:
x y z
0 0 6
5 5 0
40 2 0
4 0 0
10 0 1
0 0 2
11 7 0
0 4 0
0 0 0
0 0 0
0 0 2
18 0 4
Results Table:
x y z
0 0 6
49 7 0 <- Above two rows got aggregated
10 0 1
0 0 2
11 11 0 <- Above three rows got aggregated
0 0 2
18 0 4
Since it seems like you're still in the "leaRning phase", I thought an example using the plyr package would be helpful. plyr is an extremely handy library which allows you to slice/dice datasets and summarize their subgroups in a flexible (and terse -- as you'll see below) manner, so it would likely be worth your time to get to know. If you find yourself needing to do similar operations on extremely large data sets, you might also consider looking into the data.table package.
I'm assuming you've done Roman's textConnection trick to get your data into a data.frame named mmf.
I'm adding an idx column to mmf so you can subset it and process the results group by group:
library(plyr)
# mmf <- read.table(textConnection( ...
rle.idx <- rle(mmf$z)
mmf$idx <- rep(seq(RLE$lengths), RLE$lengths)
ans <- ddply(mmf, .(idx), colwise(sum))
And ans looks like:
x y z idx
0 0 6 1
49 7 0 6
10 0 1 3
0 0 2 4
11 11 0 20
0 0 2 6
18 0 4 7
Just remove the idx column and you're done, eg:
ans <- ans[, -4]
This is the code I used to produce your result. If you have any questions, fire away.
mmf <- read.table(textConnection("x y z # read in your example data
0 0 6
5 5 0
40 2 0
4 0 0
10 0 1
0 0 2
11 7 0
0 4 0
0 0 0
0 0 0
0 0 2
18 0 4"), header = TRUE)
# see where there are zeros in the y column
mmf.rle <- rle(mmf$z)
mmf.rle <- data.frame(lengths = mmf.rle$lengths, values = mmf.rle$values)
merge.rows <- 3
# select rows that have more or equal to three zeros
mmf.zero <- which(mmf.rle$values == 0 & mmf.rle$lengths >= merge.rows)
for (i in mmf.zero) {
# find which positions are zero, calculate sums and insert the result into a data.frame where the rows in question were turned to NA
m.mmf <- mmf.rle$lengths[1:i] # select elements from 1 to where the zero appears
select.rows <- (sum(m.mmf[1:length(m.mmf) - 1])+1):sum(m.mmf) # magic
mmf.sum <- colSums(mmf[select.rows, ]) # sum values column-wise for rows that have at least three zeros in z
mmf[select.rows,] <- NA # now that we have a sum by columns, we turn those numbers into NAs...
mmf[select.rows[1], ] <- mmf.sum # ... and insert summed result into the first NA row
}
# remove any left over NA rows
mmf <- mmf[complete.cases(mmf),]
DATA
mmf <- read.table(textConnection("x y z # read in your example data
0 0 6
5 5 0
40 2 0
4 0 0
10 0 1
0 0 2
11 7 0
0 4 0
0 0 0
0 0 0
0 0 2
18 0 4"), header = TRUE)
CODE
agg_n <- function(dat=mmf,coln="z",n=3){
agg <- function(.x) {
# Sum values if first n=3 records in column coln="z" are 0
if(all(.x[[coln]][seq(n)] == 0)) {
y <- rbind(colSums(.x[seq(n),]),.x[-1*seq(n),])
} else y <- .x
return(y)
}
# Groups of records starting with 0 in column coln="z"
G <- cumsum(diff(c(0L,dat[[coln]] == 0))==1)
new_dat <- do.call(rbind,lapply(split(dat,G),agg))
return(new_dat)
}
OUTPUT
> agg_n()
x y z
0 0 0 6
1.1 49 7 0
1.5 10 0 1
1.6 0 0 2
2.1 11 11 0
2.10 0 0 0
2.11 0 0 2
2.12 18 0 4