I have a dataframe of 9 columns consisting of an inventory of factors. Each row can have all 9 columns filled (as in that row is holding 9 "things"), but most don't (most have between 3-4). The columns aren't specific either, as in if item 200 shows up in columns 1 and 3, it's the same thing. I'd like to create a matrix that is binary for each row that includes all factors.
Ex (shortened to 4 columns just to get point across)
R1 3 4 5 8
R2 4 6 7 NA
R3 1 5 NA NA
R4 2 6 8 9
Should turn into
1 2 3 4 5 6 7 8 9
r1 0 0 1 1 1 0 0 1 0
r2 0 0 0 1 0 1 1 0 0
r3 1 0 0 0 1 0 0 0 0
r4 0 1 0 0 0 1 0 1 1
I've looked into writeBin/readBin, K-clustering (which is something I'd like to do, but I need to get rid of the NAs first), fuzzy clustering, tag clustering. Just kinda lost about what direction to go.
I've tried writing two for loops that pull the data from the matrix by column/row and then save 0s and 1s respectively in a new matrix, but I think there were scope issues.
You guys are the best. Thanks!
Here's a base R solution:
# Read in the data, and convert to matrix form
df <- read.table(text = "
3 4 5 8
4 6 7 NA
1 5 NA NA
2 6 8 9", header = FALSE)
m <- as.matrix(df)
# Create a two column matrix containing row/column indices of cells to be filled
# with 'one's
id <- cbind(rowid = as.vector(t(row(m))),
colid = as.vector(t(m)))
id <- id[complete.cases(id), ]
# Create output matrix
out <- matrix(0, nrow = nrow(m), ncol = max(m, na.rm = TRUE))
out[id] <- 1
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
# [1,] 0 0 1 1 1 0 0 1 0
# [2,] 0 0 0 1 0 1 1 0 0
# [3,] 1 0 0 0 1 0 0 0 0
# [4,] 0 1 0 0 0 1 0 1 1
This should do the trick:
# The Incantation
options(stringsAsFactors = FALSE)
library(reshape2)
# Your example data
dat <- data.frame(id = c("R1", "R2", "R3", "R4"),
col1 = c(3, 4, 1, 2),
col2 = c(4, 6, 5, 6),
col3 = c(5, 7, NA, 7),
col4 = c(8, NA, NA, 9)
)
# Melt it down
dat.melt <- melt(dat, id.var = "id")
# Cast it back out, with the row IDs remaining the row IDs
# and the values of the columns becoming the columns themselves.
# dcast() will default to length to aggregate records - which means
# that the values in this data.frame are a count of how many times
# each value occurs in each row's columns (which, based on this data,
# seems to be capped at just once).
dat.cast <- dcast(dat.melt, id ~ value)
The result:
dat.cast
id 1 2 3 4 5 6 7 8 9 NA
1 R1 0 0 1 1 1 0 0 1 0 0
2 R2 0 0 0 1 0 1 1 0 0 1
3 R3 1 0 0 0 1 0 0 0 0 2
4 R4 0 1 0 0 0 1 1 0 1 0
These are all great answers. Thought I'd contribute the original solution I wrote that a friend of mine modified to actually work.
for(i in seq(nrow(x)))
for(j in seq(ncol(x)))
if(!is.na(x[i,j])) { y[i, x[i,j]] = 1 }
Two for loops works after setting some earlier parameters, but it's incredibly slow. Looks like these other solutions work much faster!
Related
My task is to randomly assign 8 rows that consist of 12 columns and values that are random combinations of 0 and 1 values while each row sum equals 6 and each column sum equals 4.
So I create all possible combinations of 0 and 1 within 12 variables:
df <- expand.grid(0:1, 0:1, 0:1, 0:1, 0:1, 0:1,
0:1, 0:1, 0:1, 0:1, 0:1, 0:1)
Restrain possible combinations to these that row sum equals 6:
df <- df[rowSums(df)==6,]
Then I shuffle it:
shuffled <- df[sample(nrow(df)),]
and finally I'd like to pick 8 rows from shuffled data. All these 8 rows must have column sums that equal 4 and row sums equal 6:
colSums(picked_shuffled)
[1] 4 4 4 4 4 4 4 4 4 4 4 4
rowSums(picked_shuffled)
[1] 6 6 6 6 6 6 6 6
How to do it?
Doing it by trial and error will take you a very long time! An alternative is to construct a matrix that works and then shuffle it...
rows <- rep(1:8, 6) #48 row positions for the 1s - 6 of each
columns <- rep(1:12, each = 4) #48 column positions for the 1s - 4 of each
mat <- matrix(0, nrow = 8, ncol = 12) #blank matrix of 0s
mat[cbind(rows, columns)] <- 1 #set selected values to 1
mat <- mat[sample(1:8), sample(1:12)] #shuffle rows and columns
mat
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,] 1 0 0 0 1 1 0 0 1 1 0 1
[2,] 0 1 1 1 0 0 1 1 0 0 1 0
[3,] 0 1 1 1 0 0 1 1 0 0 1 0
[4,] 1 0 0 0 1 1 0 0 1 1 0 1
[5,] 1 0 0 0 1 1 0 0 1 1 0 1
[6,] 0 1 1 1 0 0 1 1 0 0 1 0
[7,] 1 0 0 0 1 1 0 0 1 1 0 1
[8,] 0 1 1 1 0 0 1 1 0 0 1 0
I don't know if it is possible to produce a more "random" distribution than this - there are still only two types of column and two types of row however you shuffle it!
By the way these operations are usually much faster on matrices than dataframes - you can always convert it at the end.
A more random solution...
After a bit of thought, it is possible to get a more "random" solution with the method above, but shuffling columns until you get no duplicated row-column pairs (which seems to be quite fast). So a modified version...
rows <- rep(1:8, 6)
columns <- sample(rep(1:12, 4))
while(any(duplicated(cbind(rows, columns)))){
columns <- sample(columns)
}
mat <- matrix(0, nrow = 8, ncol = 12)
mat[cbind(rows, columns)] <- 1
mat
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,] 0 0 1 0 1 1 0 1 0 1 1 0
[2,] 1 1 0 1 0 1 1 0 0 0 1 0
[3,] 1 1 1 0 0 0 1 0 0 1 0 1
[4,] 0 1 0 1 0 1 0 1 1 1 0 0
[5,] 0 1 0 0 1 0 1 0 1 0 1 1
[6,] 0 0 1 0 1 0 1 1 1 0 0 1
[7,] 1 0 1 1 1 1 0 0 0 1 0 0
[8,] 1 0 0 1 0 0 0 1 1 0 1 1
rowSums(mat)
[1] 6 6 6 6 6 6 6 6
colSums(mat)
[1] 4 4 4 4 4 4 4 4 4 4 4 4
I have got a less clean but more random solution to the problem than Andrew. It randomly shoots 1 at the initially empty grid, until the conditions are satisfied. Sometimes, it removes 20% of previous hits to prevent getting stuck. When it gets stuck because of too many iterations, it resets.
I simulated it and it usually takes about 40-80 iterations to fill the grid according to your specifications. In rare cases, it takes up to 160.
grid = matrix(0,nrow=8,ncol=12)
finished = F
count=0
while(!finished){
openrows = c(1:8)[rowSums(grid)<6]
opencols = c(1:12)[colSums(grid)<4]
if(length(openrows)>0 & length(opencols)>0){
if(length(openrows)==1 & length(opencols)==1 & grid[openrows[1],opencols[1]]==1){
grid[grid==1 & runif(length(grid),0,1)>0.8]=0
}
i = as.integer(runif(1,0,length(openrows)))+1
j = as.integer(runif(1,0,length(opencols)))+1
grid[openrows[i],opencols[j]]=1
}else{
finished=TRUE
}
count = count+1
if(count>500){
grid = matrix(0,nrow=8,ncol=12)
count=0
}
}
It's not very efficient (for large tables) but it works and gives you random data.
That was quite the brain teaser, tbh.
I have a contingency table (ct) like this:
read.table( text=
1 2 3 4 5 6
1 0 0 1 0 2 0
2 0 0 2 0 0 0
70 0 0 3 0 0 0
76 15 13 19 2 9 8
85 0 0 2 0 0 0
109 0 0 0 0 1 2
479 0 0 0 0 2 0
491 2 0 0 0 0 0
1127 0 1 0 1 6 0
1131 0 1 1 1 2 0
1206 1 3 1 0 0 1
1208 1 0 1 0 0 1
1210 0 1 0 0 0 1
1225 2 0 1 0 0 0
1232 0 0 0 0 1 1
1242 0 0 0 1 0 1
1243 1 0 0 0 1 1
1251 0 0 2 0 1 2
1267 0 2 1 0 0 0
4415 0 2 0 0 0 0
4431 0 0 0 2 0 0
4808 0 0 0 0 2 0
4823 0 2 0 0 0 0 )
Where rows represent cluster, columns represent hospitals and numbers in the table the count of isolates.
For example: Cluster 1 has 3 isolates, 1 in hospital 3 and 2 in hospital 2.
I now want to check, if clusters and hospitals are dependent on each other or not. For that, I would like to create 1000 randomly distributed tables, where all isolates in one cluster have the chance to fall into every hospital.
For example: The 3 Isolates in cluster 1 might then be distributed over 3 hospitals, so that I get the values : 0 1 1 1 0 0 .
Combinations can occur multiple times.
I tried this:
replicates <- 1000
permutations <- lapply(seq(replicates), function(i, ct){
list <- lapply(apply(ct,1,list),unlist)
list <- lapply(list, function(x)as.numeric(x))
z <- as.data.frame(do.call(rbind, lapply(list, function(x) sample(x))))
}, ct = ct)
But by that only the values in the dataframe are shuffled to another position in the row.
Can someone help me with that?
I concur with Maurits Evers answer, at full rank you got binomial combination per lines : n variables mean 2^n combination... if you add m-1 columns this yields 2^(n+m) possibilities.
Here's an alternative using partitions::composition.
library(partitions)
# smaller toy data
d <- data.frame(x1 = c(0, 1, 1), x2 = c(2, 2, 0), x3 = c(0, 1, 1))
# calculate row sums
rs <- rowSums(d)
# for each unique row sum, partition the value with order m = number of columns
# this avoids repeating calculation of partitions on duplicate row sums
l <- lapply(unique(rs), compositions, m = ncol(d))
# name list elements with row sums
names(l) <- unique(rs)
# set number of samples
n <- 4
# to reproduce sample in this example
set.seed(1)
# loop over rows in data frame
lapply(1:nrow(d), function(i){
# index list of partitions using row sums
m <- l[[as.character(rs[i])]]
# number of columns to sample from
nc <- ncol(m)
# select columns from matrix using a sample of n column indexes
m[ , sample(nc, n, replace = TRUE)]
})
The result is a list where each element is a matrix for each row of the original data. Each matrix column is one (sampled) partition.
# [[1]]
# [,1] [,2] [,3] [,4]
# [1,] 1 0 1 0
# [2,] 1 2 0 0
# [3,] 0 0 1 2
#
# [[2]]
# [,1] [,2] [,3] [,4]
# [1,] 1 0 0 2
# [2,] 3 1 0 0
# [3,] 0 3 4 2
#
# [[3]]
# [,1] [,2] [,3] [,4]
# [1,] 1 2 1 1
# [2,] 0 0 1 1
# [3,] 1 0 0 0
I tried to partition the largest row sum in your example data (66), and it runs pretty quickly. Thus, if your row sums are not very much larger and the number of columns is small (like here), the code above may be a viable option.
system.time(p <- compositions(66, 6))
# user system elapsed
# 1.53 0.16 1.68
str(p)
# 'partition' int [1:6, 1:13019909] 66 0 0 0 0 0 65 1 0 0 ...
Note that it 'explodes' rapidly if the number of columns increases:
system.time(p <- compositions(66, 7))
# user system elapsed
# 14.11 1.61 15.72
Sorry #Henrik for the late response. Your code worked out quite well for me! However, with the help of a colleague of mine, I figured out this code (I'll just show it using your sample data):
#data
d <- data.frame(x1 = c(0, 1, 1), x2 = c(2, 2, 0), x3 = c(0, 1, 1))
#Number of replicates I want
replicates <- 1000
#Number of columns in the table
k<- 3
l <- NULL
#unlist the dataframe
list <- lapply(apply(d,1,list),unlist)
#Calculate replicates of the dataframe, where numbers are permuted within rows
permutations <- lapply(seq(replicates), function(j){
l_sampled <- lapply(list, function(x){
pos.random <- sample(k, sum(x), replace = T)
x.random <- rep(0,k)
for (i in 1:k){
x.random[i] <- sum(pos.random==i)
}
l = rbind(l, data.frame(x.random))
})
df <- data.frame(matrix(unlist(l_sampled), nrow=length(l_sampled), byrow=T))
})
#Example for results:
> permutations[[8]]
X1 X2 X3
1 2 0 0
2 1 2 1
3 1 0 1
> permutations[[10]]
X1 X2 X3
1 0 1 1
2 2 0 2
3 0 2 0
(For the following, I could either an R data.frame or R data.table. Both are ok.)
I have the following data.table:
library(data.table)
dt = data.table(V1=c("dog", "dog", "cat", "cat", "cat", "bird","bird","bird","bird"),
V2=rep(42, 9), V3=c(1, 2, 4, 5, 7, 1, 2, 5, 8))
> print(dt)
V1 V2 V3
1: dog 42 1
2: dog 42 2
3: cat 42 4
4: cat 42 5
5: cat 42 7
6: bird 42 1
7: bird 42 2
8: bird 42 5
9: bird 42 8
Column V3 contains integers from 1 to 8. My goal is to populate an 8 by 8 zero matrix with the count of each combination "pair" given the unique category in column V1
So, the combination pairs for dog, cat, and bird are:
dog: (1, 2)
cat: (4, 5), (4, 7), (5, 7)
bird: (1, 2), (1, 5), (1, 8), (2, 5), (2, 8), (5, 8)
For each pair, I add +1 to the corresponding entry in the zero matrix. For this matrix, (n, m) = (m, n). The matrix given dt would be:
1 2 3 4 5 6 7 8
1: 0 2 0 0 1 0 0 1
2: 2 0 0 0 1 0 0 1
3: 0 0 0 0 0 0 0 0
4: 0 0 0 0 1 0 1 0
5: 1 1 0 1 0 0 1 1
6: 0 0 0 0 0 0 0 0
7: 0 0 0 1 1 0 0 0
8: 1 1 0 0 1 0 0 0
Note that (1,2)=(2,1) has a count 2, from the dog combination and the bird combination.
(1) Is there a method to calculate the combinations of values in an R data.table/data.frame column, given the unique value in another column?
Perhaps it would make sense to output an R list, with vector "pairs", e.g.
list(c(1, 2), c(2, 1), c(4, 5), c(4, 7), c(5, 7), c(5, 4), c(7, 4), c(7, 5),
c(1, 2), c(1, 5), c(1, 8), c(2, 5), c(2, 8), c(5, 8), c(2, 1), c(5, 1),
c(8, 1), c(5, 2), c(8, 2), c(8, 5))
However, I'm not sure how I would use this to populate a matrix...
(2) Given the input data.table/data.frame, what would be the most efficient data-structure to use to write out a matrix, as soon above?
Here's a data.table solution that seems to be efficient. We basically doing a self join in order to create combinations and then count. Then, similar to what #coldspeed done with Numpy, we will just update a zero matrix by locations with counts.
# a self join
tmp <- dt[dt,
.(V1, id = x.V3, id2 = V3),
on = .(V1, V3 < V3),
nomatch = 0L,
allow.cartesian = TRUE
][, .N, by = .(id, id2)]
## Create a zero matrix and update by locations
m <- array(0L, rep(max(dt$V3), 2L))
m[cbind(tmp$id, tmp$id2)] <- tmp$N
m + t(m)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
# [1,] 0 2 0 0 1 0 0 1
# [2,] 2 0 0 0 1 0 0 1
# [3,] 0 0 0 0 0 0 0 0
# [4,] 0 0 0 0 1 0 1 0
# [5,] 1 1 0 1 0 0 1 1
# [6,] 0 0 0 0 0 0 0 0
# [7,] 0 0 0 1 1 0 0 0
# [8,] 1 1 0 0 1 0 0 0
Alternatively, we could create tmp using data.table::CJ but that could be (potentially - thanks to #Frank for the tip) less memory efficient as it will create all possible combinations first, e.g.
tmp <- dt[, CJ(V3, V3)[V1 < V2], by = .(g = V1)][, .N, by = .(V1, V2)]
## Then, as previously
m <- array(0L, rep(max(dt$V3), 2L))
m[cbind(tmp$V1, tmp$V2)] <- tmp$N
m + t(m)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
# [1,] 0 2 0 0 1 0 0 1
# [2,] 2 0 0 0 1 0 0 1
# [3,] 0 0 0 0 0 0 0 0
# [4,] 0 0 0 0 1 0 1 0
# [5,] 1 1 0 1 0 0 1 1
# [6,] 0 0 0 0 0 0 0 0
# [7,] 0 0 0 1 1 0 0 0
# [8,] 1 1 0 0 1 0 0 0
Not sure this is the most elegant approach, but it works:
myfun <- function(x, matsize=8) {
# get all (i,j) pairs but in an unfortunate text format
pairs_all <- outer(x, x, paste)
# "drop" all self-pairs like (1,1)
diag(pairs_all) <- "0 0"
# convert these text-pairs into numeric pairs and store in matrix
ij <- do.call(rbind, lapply(strsplit(pairs_all, " "), as.numeric))
# create "empty" matrix of zeros
mat <- matrix(0, nrow=matsize, ncol=matsize)
# replace each spot of empty matrix with a 1 if that pair exists
mat[ij] <- 1
# return 0/1 matrix
return(mat)
}
# split your data by group
# lapply the custom function to each group
# add each group's 0/1 matrix together for final result
Reduce('+', lapply(split(dt$V3, dt$V1), myfun))
If anyone has a more direct way to implement the first 3 (non-comment) lines of myfun, I would happily incorporate them.
How do I convert df into df2, where df is given by:
> df
ID VALUES
1 1 a,b,c,d
2 2 a
3 3 c,d,f,g
and df2 should look something like:
> df2
ID a b c d f g
1 1 1 1 1 1 0 0
2 2 1 0 0 0 0 0
3 3 0 0 1 1 1 1
where the values from df have been broken out into separate columns and 1s and 0s reflect whether or not the ID was associated with that value (from df).
Is there a specific function for this? I thought this is what table() did but if that's the case I can't figure it out.
Here's a method that uses no extra packages:
0 + t( sapply(df[['VALUES']], function(x) {
letters[1:6] %in% scan(text=x, what="", sep=",") }))
Read 4 items
Read 1 item
Read 4 items
[,1] [,2] [,3] [,4] [,5] [,6]
a,b,c,d 1 1 1 1 0 0
a 1 0 0 0 0 0
c,d,f,g 0 0 1 1 0 1
It does return a matrix and it does depend on the VALUES column being character rather than factor. If you want to suppress the information messages from scan there is a parmeter for that. You could cbind this with the ID column:
cbind( df["ID"], 0+ t( sapply(df[['VALUES']], function(x) {letters[1:6] %in% scan(text=x, what="", sep="," , quiet=TRUE) })) )
ID 1 2 3 4 5 6
a,b,c,d 1 1 1 1 1 0 0
a 2 1 0 0 0 0 0
c,d,f,g 3 0 0 1 1 0 1
I have a data set with this structure:
region1 region2 region3
1 10 5 5
2 8 10 8
3 13 15 12
4 3 17 11
5 17 9
6 12 15
7 4
8 18
9 1
I need:
item region1 region2 region3
1 1 1 0 0
2 3 1 0 0
3 4 1 0 0
4 5 0 1 1
5 8 1 0 1
6 9 0 0 1
7 10 1 1 0
8 11 0 0 1
9 12 1 0 1
10 13 1 0 0
11 15 0 1 1
12 17 1 1 0
13 18 1 0 0
The plan was to get a distinct list of items, left join each of the regions as its own column and replace matches with 1s, missing with 0; but I must be missing a key point of the R merge, dropping the main column of interest. Any advice is greatly appreciated! I'd prefer an R solution, but my next step would be to look into sqldf package.
#read in data
regions <- read.csv("c:/data/regions.csv")
#get unique list of items from all regions
items <- na.omit(unique(stack(regions)[1]))
#merge distinct items with each region, replace matches with 1, missings with 0
merge.test <- merge(items,regions,by.x="values", by.y=c("region1"), all=TRUE)
Helps to provide a reproducible example (i.e. give us an easy copy-paste command to construct your sample data).
You didn't say, so I guess your data is in a list perhaps?
dat <- list(region1=c(10, 8, 3, 17, 12, 4, 18, 1),
region2=c(5,10,15,17),
region3=c(5,8,12,11,9,15))
First find all the items (perhaps no need to sort, I just did it because yours is sorted)
ids <- sort(unique(unlist(dat)))
Then for each region, just see if the list of unique IDs is in that region, coercing the logical TRUE/FALSE to 0 and 1 (you could leave as T/F if that would do for you)
data.frame(ids,
region1=as.integer(ids %in% dat$region1),
region2=as.integer(ids %in% dat$region2),
region3=as.integer(ids %in% dat$region3))
If you have just 3 regions that's OK, if you have more you might want to automate the typing:
cols <- lapply(dat, function (region) as.integer(ids %in% region))
cols$id <- ids
df <- do.call(data.frame, cols)
where do.call calls the data.frame function with the list cols as its (named) arguments, i.e. it just does
data.frame(id=..., region1=..., region2=..., region3=...)
If your original dat was a CSV and each column has NA values you might want to insert na.omit as appropriate.
The existing answers are fine, but they seem to complicated. Just try stack + table instead:
table(stack(dat))
# ind
# values region1 region2 region3
# 1 1 0 0
# 3 1 0 0
# 4 1 0 0
# 5 0 1 1
# 8 1 0 1
# 9 0 0 1
# 10 1 1 0
# 11 0 0 1
# 12 1 0 1
# 15 0 1 1
# 17 1 1 0
# 18 1 0 0
I'm also going to go out on a limb and say that considering your current approach, you actually have a data.frame not a list:
DAT <- dat
Len <- max(sapply(DAT, length))
DAT <- data.frame(lapply(DAT, function(x) { length(x) <- Len; x }))
In that case, the solution is no different:
table(stack(DAT))
# ind
# values region1 region2 region3
# 1 1 0 0
# 3 1 0 0
# 4 1 0 0
# 5 0 1 1
# 8 1 0 1
# 9 0 0 1
# 10 1 1 0
# 11 0 0 1
# 12 1 0 1
# 15 0 1 1
# 17 1 1 0
# 18 1 0 0
Using #mathematical.coffee's example and qdap:
dat <- list(region1=c(10, 8, 3, 17, 12, 4, 18, 1),
region2=c(5,10,15,17),
region3=c(5,8,12,11,9,15))
library(qdap)
matrix2df(t(mtabulate(dat)), "item")
You may need to expand with:
FUN <- function(x) as.numeric(x > 0)
matrix2df(apply(t(mtabulate(dat)), 2, FUN), "item")
If you have more than one item in in a vector.