I have a contingency table (ct) like this:
read.table( text=
1 2 3 4 5 6
1 0 0 1 0 2 0
2 0 0 2 0 0 0
70 0 0 3 0 0 0
76 15 13 19 2 9 8
85 0 0 2 0 0 0
109 0 0 0 0 1 2
479 0 0 0 0 2 0
491 2 0 0 0 0 0
1127 0 1 0 1 6 0
1131 0 1 1 1 2 0
1206 1 3 1 0 0 1
1208 1 0 1 0 0 1
1210 0 1 0 0 0 1
1225 2 0 1 0 0 0
1232 0 0 0 0 1 1
1242 0 0 0 1 0 1
1243 1 0 0 0 1 1
1251 0 0 2 0 1 2
1267 0 2 1 0 0 0
4415 0 2 0 0 0 0
4431 0 0 0 2 0 0
4808 0 0 0 0 2 0
4823 0 2 0 0 0 0 )
Where rows represent cluster, columns represent hospitals and numbers in the table the count of isolates.
For example: Cluster 1 has 3 isolates, 1 in hospital 3 and 2 in hospital 2.
I now want to check, if clusters and hospitals are dependent on each other or not. For that, I would like to create 1000 randomly distributed tables, where all isolates in one cluster have the chance to fall into every hospital.
For example: The 3 Isolates in cluster 1 might then be distributed over 3 hospitals, so that I get the values : 0 1 1 1 0 0 .
Combinations can occur multiple times.
I tried this:
replicates <- 1000
permutations <- lapply(seq(replicates), function(i, ct){
list <- lapply(apply(ct,1,list),unlist)
list <- lapply(list, function(x)as.numeric(x))
z <- as.data.frame(do.call(rbind, lapply(list, function(x) sample(x))))
}, ct = ct)
But by that only the values in the dataframe are shuffled to another position in the row.
Can someone help me with that?
I concur with Maurits Evers answer, at full rank you got binomial combination per lines : n variables mean 2^n combination... if you add m-1 columns this yields 2^(n+m) possibilities.
Here's an alternative using partitions::composition.
library(partitions)
# smaller toy data
d <- data.frame(x1 = c(0, 1, 1), x2 = c(2, 2, 0), x3 = c(0, 1, 1))
# calculate row sums
rs <- rowSums(d)
# for each unique row sum, partition the value with order m = number of columns
# this avoids repeating calculation of partitions on duplicate row sums
l <- lapply(unique(rs), compositions, m = ncol(d))
# name list elements with row sums
names(l) <- unique(rs)
# set number of samples
n <- 4
# to reproduce sample in this example
set.seed(1)
# loop over rows in data frame
lapply(1:nrow(d), function(i){
# index list of partitions using row sums
m <- l[[as.character(rs[i])]]
# number of columns to sample from
nc <- ncol(m)
# select columns from matrix using a sample of n column indexes
m[ , sample(nc, n, replace = TRUE)]
})
The result is a list where each element is a matrix for each row of the original data. Each matrix column is one (sampled) partition.
# [[1]]
# [,1] [,2] [,3] [,4]
# [1,] 1 0 1 0
# [2,] 1 2 0 0
# [3,] 0 0 1 2
#
# [[2]]
# [,1] [,2] [,3] [,4]
# [1,] 1 0 0 2
# [2,] 3 1 0 0
# [3,] 0 3 4 2
#
# [[3]]
# [,1] [,2] [,3] [,4]
# [1,] 1 2 1 1
# [2,] 0 0 1 1
# [3,] 1 0 0 0
I tried to partition the largest row sum in your example data (66), and it runs pretty quickly. Thus, if your row sums are not very much larger and the number of columns is small (like here), the code above may be a viable option.
system.time(p <- compositions(66, 6))
# user system elapsed
# 1.53 0.16 1.68
str(p)
# 'partition' int [1:6, 1:13019909] 66 0 0 0 0 0 65 1 0 0 ...
Note that it 'explodes' rapidly if the number of columns increases:
system.time(p <- compositions(66, 7))
# user system elapsed
# 14.11 1.61 15.72
Sorry #Henrik for the late response. Your code worked out quite well for me! However, with the help of a colleague of mine, I figured out this code (I'll just show it using your sample data):
#data
d <- data.frame(x1 = c(0, 1, 1), x2 = c(2, 2, 0), x3 = c(0, 1, 1))
#Number of replicates I want
replicates <- 1000
#Number of columns in the table
k<- 3
l <- NULL
#unlist the dataframe
list <- lapply(apply(d,1,list),unlist)
#Calculate replicates of the dataframe, where numbers are permuted within rows
permutations <- lapply(seq(replicates), function(j){
l_sampled <- lapply(list, function(x){
pos.random <- sample(k, sum(x), replace = T)
x.random <- rep(0,k)
for (i in 1:k){
x.random[i] <- sum(pos.random==i)
}
l = rbind(l, data.frame(x.random))
})
df <- data.frame(matrix(unlist(l_sampled), nrow=length(l_sampled), byrow=T))
})
#Example for results:
> permutations[[8]]
X1 X2 X3
1 2 0 0
2 1 2 1
3 1 0 1
> permutations[[10]]
X1 X2 X3
1 0 1 1
2 2 0 2
3 0 2 0
Related
I have a data frame object where combinations of variables are represented by 1, but which is sparsely populated in that I do not have all combinations mapped out.
e.g.
A B C Outcome
1 0 0 700
0 1 0 900
0 0 1 450
1 1 0 280
0 1 1 100
... which is missing the potential combinations [101] and [111]
From this, I'd like to expand out all combinations of A, B, and C, taking the outcome value where the combination exists, and where not, populate Outcome with a zero.
e.g.
A B C Outcome
1 0 0 700
1 1 0 280
1 0 1 0 <- new row
1 1 1 0 <- new row
0 1 0 900
0 1 1 100
0 0 1 450
I'm afraid I don't really have any idea how to do this functionally. I've had a look at expand.grid() - for example the following also using the plyr package
expand.grid(rlply(n, c(0,1)))
which for n=3 gives
Var1 Var2 Var3
1 0 0 0
2 1 0 0
3 0 1 0
4 1 1 0
5 0 0 1
6 1 0 1
7 0 1 1
8 1 1 1
which pretty much gives me the grid I'm after, but I'm not clear now how to join my "Outcome" values to this grid, particularly where n is large (say 60 or 70 variables).
Any help gratefully received!
df <- read.table(text =
"A B C Outcome
1 0 0 700
0 1 0 900
0 0 1 450
1 1 0 280
0 1 1 100",
header = TRUE)
res <-
merge(
x = do.call(what = "expand.grid", lapply(head(as.list(df), - 1), unique)),
y = df,
all.x = TRUE
)
res$Outcome[is.na(res$Outcome)] <- 0
res
# A B C Outcome
# 1 0 0 0 0
# 2 0 0 1 450
# 3 0 1 0 900
# 4 0 1 1 100
# 5 1 0 0 700
# 6 1 0 1 0
# 7 1 1 0 280
# 8 1 1 1 0
Edit:
Not sure whether it should go in a separate answer, but here is a more elegant way with the tidyr package:
library(tidyr)
complete(df, A, B, C, fill = list(Outcome = 0))
If you want to avoid typing all 60 or 70 column names:
complete_(df, cols = setdiff(names(df), "Outcome"), fill = list(Outcome = 0))
Imagine that I have 9 sampling records for three species distributed as such:
sp1 sp2 sp3
3 1 5
What I want to obtain is a records x species matrix, and fill it with 1s and 0s as such:
sp1 sp2 sp3
1 0 0
1 0 0
1 0 0
0 1 0
0 0 1
0 0 1
0 0 1
0 0 1
0 0 1
The number of columns matches with the number of species and the number of rows with the number of records. Note that each row represents a unique record for one species.
Use rep to produce the 1s and dcast from reshape2 to shape and fill.
library(reshape2)
x<-list(sp1=3,sp2=1,sp3=5)
d<-melt(lapply(x,function(i) rep(1,i)))
dcast(d,1:nrow(d)~L1,fill=0)[-1]
sp1 sp2 sp3
1 1 0 0
2 1 0 0
3 1 0 0
4 0 1 0
5 0 0 1
6 0 0 1
7 0 0 1
8 0 0 1
9 0 0 1
Another option would be to create row/column index, use sparseMatrix from library(Matrix) to create a sparse matrix, which can be converted back to matrix of 0 and 1s with as.matrix.
It is not clear whether the initial dataset is matrix or not. Assuming that it is a matrix with 3 column and 1 row, we get the column index by replicating the sequence of columns with the elements of 'm1'. It should also work if it is a vector. For a data.frame, we have to use rep(seq_along(df1), unlist(df1)). Then, create the sparseMatrix, specifying the row index as the sequence of 'cI' , column index ('cI') and the value 'x' as 1.
library(Matrix)
cI <- rep(seq_along(m1), m1)
m2 <- as.matrix(sparseMatrix(seq_along(cI), cI, x=1))
colnames(m2) <- colnames(m1)
m2
# sp1 sp2 sp3
# [1,] 1 0 0
# [2,] 1 0 0
# [3,] 1 0 0
# [4,] 0 1 0
# [5,] 0 0 1
# [6,] 0 0 1
# [7,] 0 0 1
# [8,] 0 0 1
# [9,] 0 0 1
A base R approach would to be to create a matrix of 0 and then replace the elements that corresponds to row/column index with 1.
m2 <- matrix(0, nrow=length(cI), ncol=ncol(m1),
dimnames=list(NULL, colnames(m1)))
m2[cbind(seq_along(cI), cI)] <- 1
m2
# sp1 sp2 sp3
# [1,] 1 0 0
# [2,] 1 0 0
# [3,] 1 0 0
# [4,] 0 1 0
# [5,] 0 0 1
# [6,] 0 0 1
# [7,] 0 0 1
# [8,] 0 0 1
# [9,] 0 0 1
data
m1 <- structure(c(3L, 1L, 5L), .Dim = c(1L, 3L), .Dimnames = list(NULL,
c("sp1", "sp2", "sp3")))
Here is another option using stack and spread
library(tidyr)
stackedList = stack(apply(df, 2, function(x) rep(1, x)))
out = spread(stackedList, ind, values, fill = 0)
#> out
# sp1 sp2 sp3
#1 1 0 0
#2 1 0 0
#3 1 0 0
#4 0 1 0
#5 0 0 1
#6 0 0 1
#7 0 0 1
#8 0 0 1
#9 0 0 1
data
df = data.frame(sp1 = 3, sp2 = 1, sp3 = 5)
This probably has been asked already, but I could not find it. I have a data set, where column names are numbers, and row names are sample names (see below).
"599.773" "599.781" "599.789" "599.797" "599.804" "599.812" "599.82" "599.828"
"A" 0 0 0 0 0 2 1 4
"B" 0 0 0 0 0 1 0 3
"C" 0 0 0 0 2 1 0 1
"D" 3 0 0 0 3 1 0 0
I want to bin the columns, say every 4 columns, by summation, and then name the new columns with the mean of the binned columns. For the above table I would end up with:
"599.785" "599.816"
"A" 0 7
"B" 0 4
"C" 0 4
"D" 3 4
The new column names, 599.785 and 599.816, are average of the column names that were binned. I think something like cut would work for a vector of numbers, but I am not sure how to implement it for large data frames. Thanks for any help!
colnames <- c("599.773", "599.781", "599.789", "599.797",
"599.804", "599.812" ,"599.82" ,"599.828" )
mat <- matrix(scan(), nrow=4, byrow=TRUE)
0 0 0 0 0 2 1 4
0 0 0 0 0 1 0 3
0 0 0 0 2 1 0 1
3 0 0 0 3 1 0 0
colnames(mat)=colnames
rownames(mat) = LETTERS[1:4]
sRows <- function(mat, cols) rowSums(mat[, cols])
sapply(1:(dim(mat)[2]/4), function(base) sRows(mat, base:(base+4)) )
[,1] [,2]
A 0 2
B 0 1
C 2 3
D 6 4
accum <- sapply(1:(dim(mat)[2]/4), function(base)
sRows(mat, base:(base+4)) )
colnames(accum) <- sapply(1:(dim(mat)[2]/4),
function(base)
mean(as.numeric(colnames(mat)[ base:(base+4)] )) )
accum
#-------
599.7888 599.7966
A 0 2
B 0 1
C 2 3
D 6 4
First of all Using numeric values as columns names is not a good/standard habit.
Even I am here giving a solution as the desired OP.
## read data without checking names
dt <- read.table(text='
"599.773" "599.781" "599.789" "599.797" "599.804" "599.812" "599.82" "599.828"
"A" 0 0 0 0 0 2 1 4
"B" 0 0 0 0 0 1 0 3
"C" 0 0 0 0 2 1 0 1
"D" 3 0 0 0 3 1 0 0',header=TRUE, check.names =FALSE)
cols <- as.numeric(colnames(dt))
## create a factor to groups columns
ff <- rep(c(TRUE,FALSE),each=length(cols)/2)
## using tapply to group operations by ff
vals <- do.call(cbind,tapply(cols,ff,
function(x)
rowSums(dt[,paste0(x)])))
nn <- tapply(cols,ff,mean)
## names columns with means
colnames(vals) <- nn[colnames(vals)]
vals
599.816 599.785
A 7 0
B 4 0
C 4 0
D 4 3
I have a dataframe of 9 columns consisting of an inventory of factors. Each row can have all 9 columns filled (as in that row is holding 9 "things"), but most don't (most have between 3-4). The columns aren't specific either, as in if item 200 shows up in columns 1 and 3, it's the same thing. I'd like to create a matrix that is binary for each row that includes all factors.
Ex (shortened to 4 columns just to get point across)
R1 3 4 5 8
R2 4 6 7 NA
R3 1 5 NA NA
R4 2 6 8 9
Should turn into
1 2 3 4 5 6 7 8 9
r1 0 0 1 1 1 0 0 1 0
r2 0 0 0 1 0 1 1 0 0
r3 1 0 0 0 1 0 0 0 0
r4 0 1 0 0 0 1 0 1 1
I've looked into writeBin/readBin, K-clustering (which is something I'd like to do, but I need to get rid of the NAs first), fuzzy clustering, tag clustering. Just kinda lost about what direction to go.
I've tried writing two for loops that pull the data from the matrix by column/row and then save 0s and 1s respectively in a new matrix, but I think there were scope issues.
You guys are the best. Thanks!
Here's a base R solution:
# Read in the data, and convert to matrix form
df <- read.table(text = "
3 4 5 8
4 6 7 NA
1 5 NA NA
2 6 8 9", header = FALSE)
m <- as.matrix(df)
# Create a two column matrix containing row/column indices of cells to be filled
# with 'one's
id <- cbind(rowid = as.vector(t(row(m))),
colid = as.vector(t(m)))
id <- id[complete.cases(id), ]
# Create output matrix
out <- matrix(0, nrow = nrow(m), ncol = max(m, na.rm = TRUE))
out[id] <- 1
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
# [1,] 0 0 1 1 1 0 0 1 0
# [2,] 0 0 0 1 0 1 1 0 0
# [3,] 1 0 0 0 1 0 0 0 0
# [4,] 0 1 0 0 0 1 0 1 1
This should do the trick:
# The Incantation
options(stringsAsFactors = FALSE)
library(reshape2)
# Your example data
dat <- data.frame(id = c("R1", "R2", "R3", "R4"),
col1 = c(3, 4, 1, 2),
col2 = c(4, 6, 5, 6),
col3 = c(5, 7, NA, 7),
col4 = c(8, NA, NA, 9)
)
# Melt it down
dat.melt <- melt(dat, id.var = "id")
# Cast it back out, with the row IDs remaining the row IDs
# and the values of the columns becoming the columns themselves.
# dcast() will default to length to aggregate records - which means
# that the values in this data.frame are a count of how many times
# each value occurs in each row's columns (which, based on this data,
# seems to be capped at just once).
dat.cast <- dcast(dat.melt, id ~ value)
The result:
dat.cast
id 1 2 3 4 5 6 7 8 9 NA
1 R1 0 0 1 1 1 0 0 1 0 0
2 R2 0 0 0 1 0 1 1 0 0 1
3 R3 1 0 0 0 1 0 0 0 0 2
4 R4 0 1 0 0 0 1 1 0 1 0
These are all great answers. Thought I'd contribute the original solution I wrote that a friend of mine modified to actually work.
for(i in seq(nrow(x)))
for(j in seq(ncol(x)))
if(!is.na(x[i,j])) { y[i, x[i,j]] = 1 }
Two for loops works after setting some earlier parameters, but it's incredibly slow. Looks like these other solutions work much faster!
Using R, what is the best way I can aggregate rows on a condition that spans multiple rows.
For example to aggregate any rows where z = 0 for n or more times.
What this would look like run on the following sample table with n = 3.
Sample Table x:
x y z
0 0 6
5 5 0
40 2 0
4 0 0
10 0 1
0 0 2
11 7 0
0 4 0
0 0 0
0 0 0
0 0 2
18 0 4
Results Table:
x y z
0 0 6
49 7 0 <- Above two rows got aggregated
10 0 1
0 0 2
11 11 0 <- Above three rows got aggregated
0 0 2
18 0 4
Since it seems like you're still in the "leaRning phase", I thought an example using the plyr package would be helpful. plyr is an extremely handy library which allows you to slice/dice datasets and summarize their subgroups in a flexible (and terse -- as you'll see below) manner, so it would likely be worth your time to get to know. If you find yourself needing to do similar operations on extremely large data sets, you might also consider looking into the data.table package.
I'm assuming you've done Roman's textConnection trick to get your data into a data.frame named mmf.
I'm adding an idx column to mmf so you can subset it and process the results group by group:
library(plyr)
# mmf <- read.table(textConnection( ...
rle.idx <- rle(mmf$z)
mmf$idx <- rep(seq(RLE$lengths), RLE$lengths)
ans <- ddply(mmf, .(idx), colwise(sum))
And ans looks like:
x y z idx
0 0 6 1
49 7 0 6
10 0 1 3
0 0 2 4
11 11 0 20
0 0 2 6
18 0 4 7
Just remove the idx column and you're done, eg:
ans <- ans[, -4]
This is the code I used to produce your result. If you have any questions, fire away.
mmf <- read.table(textConnection("x y z # read in your example data
0 0 6
5 5 0
40 2 0
4 0 0
10 0 1
0 0 2
11 7 0
0 4 0
0 0 0
0 0 0
0 0 2
18 0 4"), header = TRUE)
# see where there are zeros in the y column
mmf.rle <- rle(mmf$z)
mmf.rle <- data.frame(lengths = mmf.rle$lengths, values = mmf.rle$values)
merge.rows <- 3
# select rows that have more or equal to three zeros
mmf.zero <- which(mmf.rle$values == 0 & mmf.rle$lengths >= merge.rows)
for (i in mmf.zero) {
# find which positions are zero, calculate sums and insert the result into a data.frame where the rows in question were turned to NA
m.mmf <- mmf.rle$lengths[1:i] # select elements from 1 to where the zero appears
select.rows <- (sum(m.mmf[1:length(m.mmf) - 1])+1):sum(m.mmf) # magic
mmf.sum <- colSums(mmf[select.rows, ]) # sum values column-wise for rows that have at least three zeros in z
mmf[select.rows,] <- NA # now that we have a sum by columns, we turn those numbers into NAs...
mmf[select.rows[1], ] <- mmf.sum # ... and insert summed result into the first NA row
}
# remove any left over NA rows
mmf <- mmf[complete.cases(mmf),]
DATA
mmf <- read.table(textConnection("x y z # read in your example data
0 0 6
5 5 0
40 2 0
4 0 0
10 0 1
0 0 2
11 7 0
0 4 0
0 0 0
0 0 0
0 0 2
18 0 4"), header = TRUE)
CODE
agg_n <- function(dat=mmf,coln="z",n=3){
agg <- function(.x) {
# Sum values if first n=3 records in column coln="z" are 0
if(all(.x[[coln]][seq(n)] == 0)) {
y <- rbind(colSums(.x[seq(n),]),.x[-1*seq(n),])
} else y <- .x
return(y)
}
# Groups of records starting with 0 in column coln="z"
G <- cumsum(diff(c(0L,dat[[coln]] == 0))==1)
new_dat <- do.call(rbind,lapply(split(dat,G),agg))
return(new_dat)
}
OUTPUT
> agg_n()
x y z
0 0 0 6
1.1 49 7 0
1.5 10 0 1
1.6 0 0 2
2.1 11 11 0
2.10 0 0 0
2.11 0 0 2
2.12 18 0 4