creating a matrix of indicator variables - r

I would like to create a matrix of indicator variables. My initial thought was to use model.matrix, which was also suggested here: Automatically expanding an R factor into a collection of 1/0 indicator variables for every factor level
However, model.matrix does not seem to work if a factor has only one level.
Here is an example data set with three levels to the factor 'region':
dat = read.table(text = "
reg1 reg2 reg3
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
0 1 0
0 1 0
0 1 0
0 0 1
0 0 1
0 0 1
0 0 1
", sep = "", header = TRUE)
# model.matrix works if there are multiple regions:
region <- c(1,1,1,1,1,1,2,2,2,3,3,3,3)
df.region <- as.data.frame(region)
df.region$region <- as.factor(df.region$region)
my.matrix <- as.data.frame(model.matrix(~ -1 + df.region$region, df.region))
my.matrix
# The following for-loop works even if there is only one level to the factor
# (one region):
# region <- c(1,1,1,1,1,1,1,1,1,1,1,1,1)
my.matrix <- matrix(0, nrow=length(region), ncol=length(unique(region)))
for(i in 1:length(region)) {my.matrix[i,region[i]]=1}
my.matrix
The for-loop is effective and seems simple enough. However, I have been struggling to come up with a solution that does not involve loops. I can use the loop above, but have been trying hard to wean myself off of them. Is there a better way?

I would use matrix indexing. From ?"[":
A third form of indexing is via a numeric matrix with the one column for each dimension: each row of the index matrix then selects a single element of the array, and the result is a vector.
Making use of that nice feature:
my.matrix <- matrix(0, nrow=length(region), ncol=length(unique(region)))
my.matrix[cbind(seq_along(region), region)] <- 1
# [,1] [,2] [,3]
# [1,] 1 0 0
# [2,] 1 0 0
# [3,] 1 0 0
# [4,] 1 0 0
# [5,] 1 0 0
# [6,] 1 0 0
# [7,] 0 1 0
# [8,] 0 1 0
# [9,] 0 1 0
# [10,] 0 0 1
# [11,] 0 0 1
# [12,] 0 0 1
# [13,] 0 0 1

I came up with this solution by modifying an answer to a similar question here:
Reshaping a column from a data frame into several columns using R
region <- c(1,1,1,1,1,1,2,2,2,3,3,3,3)
site <- seq(1:length(region))
df <- cbind(site, region)
ind <- xtabs( ~ site + region, df)
ind
region <- c(1,1,1,1,1,1,1,1,1,1,1,1,1)
site <- seq(1:length(region))
df <- cbind(site, region)
ind <- xtabs( ~ site + region, df)
ind
EDIT:
The line below will extract the data frame of indicator variables from ind:
ind.matrix <- as.data.frame.matrix(ind)

Related

efficient way to store lists within a dataframe

I need to be able to compute pairwise intersection of lists, close to 40k.
Specifically, I want to know if I can store vector id as column 1, and a list of its values in column 2. I should be able to process this column 2 , ie find overlap/intersections between two rows.
column 1 column 2
idA 1,2,5,9,10
idB 5,9,25
idC 2,25,67
I want to be able to get the pairwise intersection values and also, if the values in column 2 are not already sorted, that should also be possible.
What is the best datastructure that I can use if I am going ahead with R?
My data originally looks like this:
column1 1 2 3 9 10 25 67 5
idA 1 1 0 1 1 0 0 1
idB 0 0 0 1 0 1 0 1
idC 0 1 0 0 0 1 1 0
edited to include more clarity as per the suggestions below.
I'd keep the data in a logical matrix:
DF <- read.table(text = "column1 1 2 3 9 10 25 67 5
idA 1 1 0 1 1 0 0 1
idB 0 0 0 1 0 1 0 1
idC 0 1 0 0 0 1 1 0", header = TRUE, check.names = FALSE)
#turn into logical matrix
m <- as.matrix(DF[-1])
rownames(m) <- DF[[1]]
mode(m) <- "logical"
#if you can, create your data as a sparse matrix to save memory
#if you already have a dense data matrix, keep it that way
library(Matrix)
M <- as(m, "lMatrix")
#calculate intersections
#does each comparison twice
intersections <- simplify2array(
lapply(seq_len(nrow(M)), function(x)
lapply(seq_len(nrow(M)), function(x, y) colnames(M)[M[x,] & (M[x,] == M[y,])], x = x)
)
)
This double loop could be optimized. I'd do it in Rcpp and create a long format data.frame instead of a list matrix. I'd also do each comparison only once (e.g., only the upper triangle).
colnames(intersections) <- rownames(intersections) <- rownames(M)
# idA idB idC
#idA Character,5 Character,2 "2"
#idB Character,2 Character,3 "25"
#idC "2" "25" Character,3
intersections["idA", "idB"]
#[[1]]
#[1] "9" "5"

r - Make adjacency matrix with tcrossprod capturing only positive values

I need to create an adjacency matrix from a dataframe using tcrossprod, but the resulting matrix needs to obey a restriction that I will explain below. Consider the following dataframe:
z <- data.frame(Person = c("a","b","c","d"), Man_United = c(1,0,1,0))
z
Person Man_United
1 a 1
2 b 0
3 c 1
4 d 0
I make an adjacency matrix from z using tcrossprod.
x <- tcrossprod(table(z))
diag(x) <- 0
x
Person
Person a b c d
a 0 0 1 0
b 0 0 0 1
c 1 0 0 0
d 0 1 0 0
I need the resulting adjacency matrix to indicate a tie (here signaled with the number 1), only when both persons have value 1 in the original dataframe (i.e. are fans of Manchester United, in this example). For example, persons "a" and "c" of dataframe z are fans, so in the resulting adjacency matrix I want their intersecting cell to be valued 1. That works fine here. However, persons "b" and "d" are not fans, and the fact that both have value 0 in the original dataframe does not mean that they are connected in any meaningful way. tcrossprod, however, produces a matrix that suggests that they are in fact connected.
How to use tcrossprod in a way that it caputures only the positve values of dataframes in producing adjacency matrices?
We may restrict attention on table results of ones with
tcrossprod(table(z)[, "1"])
# [,1] [,2] [,3] [,4]
[# 1,] 1 0 1 0
# [2,] 0 0 0 0
# [3,] 1 0 1 0
# [4,] 0 0 0 0
or, if you want to preserve the names,
tcrossprod(table(z)[, "1", drop = FALSE])
# Person
# Person a b c d
# a 1 0 1 0
# b 0 0 0 0
# c 1 0 1 0
# d 0 0 0 0
If there can be more nonzero values, then you may replace "1" by -1 as to eliminate the column for zeroes.

transition matrix force ncol to equal nrows

I have created a transition matrix as a 'from cluster' (rows) 'to cluster' (columns) frequency. Think Markov chain.
Assume I have 5 from clusters but only 3 to clusters then I get a 5*3 transition matrix. How do a force it to be a 5*5 transition matrix? Effectively how to I show the all zero columns?
I'm after an elegant solution as this will be applied on a much larger problem involving hundreds of clusters. I am really quite unfamiliar with R Matrix's and to my knowledge I don't know of an elegant way to force number of columns to enter number of rows then impute zero's where no match except for using a for loop which my hunch is that's not the best solution.
Example code:
# example data
cluster_before <- c(1,2,3,4,5)
cluster_after <- c(1,2,4,4,1)
# Table output
table(cluster_before,cluster_after)
# ncol does not = nrows. I want to rectify that
# I want output to look like this:
what_I_want <- matrix(
c(1,0,0,0,0,
0,1,0,0,0,
0,0,0,1,0,
0,0,0,1,0,
1,0,0,0,0),
byrow=TRUE,ncol=5
)
# Possible solution. But for loop can't be best solution?
empty_mat <- matrix(0,ncol=5,nrow=5)
matrix_to_update <- empty_mat
for (i in 1:length(cluster_before)) {
val_before <- cluster_before[i]
val_after <- cluster_after[i]
matrix_to_update[val_before,val_after] <- matrix_to_update[val_before,val_after]+1
}
matrix_to_update
# What's the more elegant solution?
Thanks in advance for your help. It's much appreciated.
Make them factors and then table:
levs <- union(cluster_before, cluster_after)
table(factor(cluster_before,levs), factor(cluster_after,levs))
# 1 2 3 4 5
# 1 1 0 0 0 0
# 2 0 1 0 0 0
# 3 0 0 0 1 0
# 4 0 0 0 1 0
# 5 1 0 0 0 0
Another solution is to use matrix indicies:
what_I_want <- matrix(0,ncol=5,nrow=5)
what_I_want[cbind(cluster_before,cluster_after)] <- 1
print(what_I_want)
## [,1] [,2] [,3] [,4] [,5]
##[1,] 1 0 0 0 0
##[2,] 0 1 0 0 0
##[3,] 0 0 0 1 0
##[4,] 0 0 0 1 0
##[5,] 1 0 0 0 0
The second line sets the elements corresponding to the row (cluster_before) and column (cluster_after) indices to 1.
Hope this helps.

Creating matrix with moving group identifier

I'm trying to create a matrix with 180*12 rows and 12 columns in R. I'm not sure what the specific codes for R to create something like this.
Column 1: 1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,..................0
Column 2: 0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,..................0
Column 3: 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,..................0
Ect. with the same pattern until Column12. Can someone help me out? Thanks in advance.
apply(diag(12), 2, rep, each=12)
A shorter example:
apply(diag(3), 2, rep, each=2)
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 1 0 0
## [3,] 0 1 0
## [4,] 0 1 0
## [5,] 0 0 1
## [6,] 0 0 1
Another very similar solution, without an explicit apply:
matrix(rep(diag(12), each=12), ncol=12)
This works because as.vector(diag(N)) is a vector with N 1's, each separated by N 0'. An example with diag(3), each=2, ncol=3 is identical to the example above.
Just for laughs, here is a model.matrix version of #MatthewLundberg's answer:
model.matrix( ~ rep(factor(1:3),each=2) - 1)
a <- rep(factor(1:3),each=2)
model.matrix( ~ a - 1)
a1 a2 a3
1 1 0 0
2 1 0 0
3 0 1 0
4 0 1 0
5 0 0 1
6 0 0 1
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$a
[1] "contr.treatment"
Or all in one line:
model.matrix( ~ rep(factor(1:3),each=2) - 1)
And the class.ind approach from nnet
class.ind(rep(factor(1:3),each=2))

How to randomize (or permute) a dataframe rowwise and columnwise?

I have a dataframe (df1) like this.
f1 f2 f3 f4 f5
d1 1 0 1 1 1
d2 1 0 0 1 0
d3 0 0 0 1 1
d4 0 1 0 0 1
The d1...d4 column is the rowname, the f1...f5 row is the columnname.
To do sample(df1), I get a new dataframe with count of 1 same as df1. So, the count of 1 is conserved for the whole dataframe but not for each row or each column.
Is it possible to do the randomization row-wise or column-wise?
I want to randomize the df1 column-wise for each column, i.e. the number of 1 in each column remains the same. and each column need to be changed by at least once. For example, I may have a randomized df2 like this: (Noted that the count of 1 in each column remains the same but the count of 1 in each row is different.
f1 f2 f3 f4 f5
d1 1 0 0 0 1
d2 0 1 0 1 1
d3 1 0 0 1 1
d4 0 0 1 1 0
Likewise, I also want to randomize the df1 row-wise for each row, i.e. the no. of 1 in each row remains the same, and each row need to be changed (but the no of changed entries could be different). For example, a randomized df3 could be something like this:
f1 f2 f3 f4 f5
d1 0 1 1 1 1 <- two entries are different
d2 0 0 1 0 1 <- four entries are different
d3 1 0 0 0 1 <- two entries are different
d4 0 0 1 0 1 <- two entries are different
PS. Many thanks for the help from Gavin Simpson, Joris Meys and Chase for the previous answers to my previous question on randomizing two columns.
Given the R data.frame:
> df1
a b c
1 1 1 0
2 1 0 0
3 0 1 0
4 0 0 0
Shuffle row-wise:
> df2 <- df1[sample(nrow(df1)),]
> df2
a b c
3 0 1 0
4 0 0 0
2 1 0 0
1 1 1 0
By default sample() randomly reorders the elements passed as the first argument. This means that the default size is the size of the passed array. Passing parameter replace=FALSE (the default) to sample(...) ensures that sampling is done without replacement which accomplishes a row wise shuffle.
Shuffle column-wise:
> df3 <- df1[,sample(ncol(df1))]
> df3
c a b
1 0 1 1
2 0 1 0
3 0 0 1
4 0 0 0
This is another way to shuffle the data.frame using package dplyr:
row-wise:
df2 <- slice(df1, sample(1:n()))
or
df2 <- sample_frac(df1, 1L)
column-wise:
df2 <- select(df1, one_of(sample(names(df1))))
Take a look at permatswap() in the vegan package. Here is an example maintaining both row and column totals, but you can relax that and fix only one of the row or column sums.
mat <- matrix(c(1,1,0,0,0,0,0,1,1,0,0,0,1,1,1,0,1,0,1,1), ncol = 5)
set.seed(4)
out <- permatswap(mat, times = 99, burnin = 20000, thin = 500, mtype = "prab")
This gives:
R> out$perm[[1]]
[,1] [,2] [,3] [,4] [,5]
[1,] 1 0 1 1 1
[2,] 0 1 0 1 0
[3,] 0 0 0 1 1
[4,] 1 0 0 0 1
R> out$perm[[2]]
[,1] [,2] [,3] [,4] [,5]
[1,] 1 1 0 1 1
[2,] 0 0 0 1 1
[3,] 1 0 0 1 0
[4,] 0 0 1 0 1
To explain the call:
out <- permatswap(mat, times = 99, burnin = 20000, thin = 500, mtype = "prab")
times is the number of randomised matrices you want, here 99
burnin is the number of swaps made before we start taking random samples. This allows the matrix from which we sample to be quite random before we start taking each of our randomised matrices
thin says only take a random draw every thin swaps
mtype = "prab" says treat the matrix as presence/absence, i.e. binary 0/1 data.
A couple of things to note, this doesn't guarantee that any column or row has been randomised, but if burnin is long enough there should be a good chance of that having happened. Also, you could draw more random matrices than you need and discard ones that don't match all your requirements.
Your requirement to have different numbers of changes per row, also isn't covered here. Again you could sample more matrices than you want and then discard the ones that don't meet this requirement also.
you can also use the randomizeMatrix function in the R package picante
example:
test <- matrix(c(1,1,0,1,0,1,0,0,1,0,0,1,0,1,0,0),nrow=4,ncol=4)
> test
[,1] [,2] [,3] [,4]
[1,] 1 0 1 0
[2,] 1 1 0 1
[3,] 0 0 0 0
[4,] 1 0 1 0
randomizeMatrix(test,null.model = "frequency",iterations = 1000)
[,1] [,2] [,3] [,4]
[1,] 0 1 0 1
[2,] 1 0 0 0
[3,] 1 0 1 0
[4,] 1 0 1 0
randomizeMatrix(test,null.model = "richness",iterations = 1000)
[,1] [,2] [,3] [,4]
[1,] 1 0 0 1
[2,] 1 1 0 1
[3,] 0 0 0 0
[4,] 1 0 1 0
>
The option null.model="frequency" maintains column sums and richness maintains row sums.
Though mainly used for randomizing species presence absence datasets in community ecology it works well here.
This function has other null model options as well, check out following link for more details (page 36) of the picante documentation
Of course you can sample each row:
sapply (1:4, function (row) df1[row,]<<-sample(df1[row,]))
will shuffle the rows itself, so the number of 1's in each row doesn't change. Small changes and it also works great with columns, but this is a exercise for the reader :-P
If the goal is to randomly shuffle each column, some of the above answers don't work since the columns are shuffled jointly (this preserves inter-column correlations). Others require installing a package. Yet a one-liner exist:
df2 = lapply(df1, function(x) { sample(x) })
You can also "sample" the same number of items in your data frame with something like this:
nr<-dim(M)[1]
random_M = M[sample.int(nr),]
Random Samples and Permutations ina dataframe
If it is in matrix form convert into data.frame
use the sample function from the base package
indexes = sample(1:nrow(df1), size=1*nrow(df1))
Random Samples and Permutations
Here is a data.table option using .N with sample like this:
library(data.table)
setDT(df)
df[sample(.N)]
#> a b c
#> 1: 0 1 0
#> 2: 1 1 0
#> 3: 1 0 0
#> 4: 0 0 0
Created on 2023-01-28 with reprex v2.0.2
Data:
df <- read.table(text = " a b c
1 1 1 0
2 1 0 0
3 0 1 0
4 0 0 0", header = TRUE)

Resources