in R: Two Way Match to Matrix - r

If I have a data.frame
df <- data.frame(DEP=letters[1:5], ARR=letters[11:15], NO=1:5+5)
DEP ARR NO
1 a k 6
2 b l 7
3 c m 8
4 d n 9
5 e o 10
I want to create a matrix of DEP as ROW ID, and ARR as COL ID, and fill in the matrix with the relevant matching NO...
e.g.
k l m n o
a 6 7 8 9 10 ...etc
Each combination is unique.
DEP and ARR are the same vector of names. I have chosen two different sample ones here for clarity.
I am struggling to use match to sort them and fill them into the matrix template I created below:
mat <- matrix(0,nrow(df),nrow(df)); colnames(mat) <- df$ARR; rownames(mat) <- df$DEP;
k l m n o
a 0 0 0 0 0
b 0 0 0 0 0
c 0 0 0 0 0
d 0 0 0 0 0
e 0 0 0 0 0
Is there an efficient way of doing this? Many thanks for all advice!

?xtabs:
xtabs(NO ~ ., data=df)
# ARR
#DEP k l m n o
# a 6 0 0 0 0
# b 0 7 0 0 0
# c 0 0 8 0 0
# d 0 0 0 9 0
# e 0 0 0 0 10

If I understood your question correctly, you could use a sparse matrix definition:
library(Matrix)
mat <- spMatrix(length(df$DEP), length(df$ARR),
seq(df$DEP), seq(df$ARR), as.numeric(as.character(df$NO)))
rownames(mat) <- df$DEP
colnames(mat) <- df$ARR
#> as.matrix(mat)
# k l m n o
#a 6 0 0 0 0
#b 0 7 0 0 0
#c 0 0 8 0 0
#d 0 0 0 9 0
#e 0 0 0 0 10

Related

How to quantify observation pairs in individuals

I'm looking for a way to quantify observation pairs in individuals (patients). In this example I have patients who each had two different diseases. The couple of disease(that is, in the same individuals) "a" and "b" is repeated 4 times, for example, in patients "G", "H", "I" and "J" and the couple "k" and "o" is repeated twice (patient "D" has done diseases "k" and "o" and patient "E" has also done these two diseases).
Patient_ID<- c("A","A","B","B","C","C","D","D","E","E","F","F",
"G","G","H","H","I","I","J","J")
Disease<-c("v","s","s","v","s","v" ,"k","o","k","o","o","s","a","b",
"a","b","b","a","b","a")
DATA<-data.frame(Patient_ID,Disease)
print(DATA)
Patient_ID Disease
1 A v
2 A s
3 B s
4 B v
5 C s
6 C v
7 D k
8 D o
9 E k
10 E o
11 F o
12 F s
13 G a
14 G b
15 H a
16 H b
17 I b
18 I a
19 J b
20 J a
With these statistics I would like to generate such a table below.
a b k o v s
a 0 4 0 0 0 0
b 4 0 0 0 0 0
k 0 0 0 2 0 0
o 0 0 2 0 0 1
v 0 0 0 0 0 3
s 0 0 0 1 3 0
Then generate a table for only levels that have count above a certain threshold (for example 2) like in the second table (below).
a b v s
a 0 4 0 0
b 4 0 0 0
v 0 0 0 3
s 0 0 3 0
Here is a base R option using table+crossprod, i.e.,
res <- `diag<-`(crossprod(table(DATA)),0)
which gives
> res
Disease
Disease a b k o s v
a 0 4 0 0 0 0
b 4 0 0 0 0 0
k 0 0 0 2 0 0
o 0 0 2 0 1 0
s 0 0 0 1 0 3
v 0 0 0 0 3 0
For the subset by given threshold, you can use
th <- 2
inds <- rowSums(res > th)>0
subset_res <- subset(res,inds,inds)
which gives
> subset_res
Disease
Disease a b s v
a 0 4 0 0
b 4 0 0 0
s 0 0 0 3
v 0 0 3 0
At first, use unstack() to transform Disease to a data frame with 2 columns. Remember to make both columns have equal levels. This step is to prevent dropping levels in the following operation. Then input the data frame into table() and it'll create a contingency table. In this table, "a & b" and "b & a" are different. To compute the total counts, you need tab + t(tab).
pair <- data.frame(t(unstack(DATA, Disease ~ Patient_ID)))
pair[] <- lapply(pair, factor, levels = levels(DATA$Disease))
tab <- table(pair)
tab + t(tab)
# X2
# X1 a b k o s v
# a 0 4 0 0 0 0
# b 4 0 0 0 0 0
# k 0 0 0 2 0 0
# o 0 0 2 0 1 0
# s 0 0 0 1 0 3
# v 0 0 0 0 3 0

Create an adjacency matrix from unbalanced trade flow data in R

I have a dataset of bilateral trade flows of dimension 84x244.
How can I balance the dataset to look like a 244x244 matrix but keeping the same order and names as the columns?
Non-symmetric matrix
For example the matrix resembles:
A B C D
B 0 0 0 1
D 2 0 0 0
and it should look like
A B C D
A 0 0 0 0
B 0 0 0 1
C 0 0 0 0
D 2 0 0 0
With A B C D as row and column names
Here are two methods that ensure the column names and row names are effectively the same, using a default value of 0 for missing rows/columns. These do not assume that the columns are always full; if this is guaranteed, then you can ignore the column-adding portions.
Both start with:
m <- as.matrix(read.table(header=TRUE, text="
A B C D
B 0 0 0 1
D 2 0 0 0"))
First
needrows <- setdiff(colnames(m), rownames(m))
m <- rbind(m, matrix(0, nrow=length(needrows), ncol=ncol(m), dimnames=list(needrows, colnames(m))))
needcols <- setdiff(rownames(m), colnames(m))
m <- cbind(m, matrix(0, nrow=nrow(m), ncol=length(needcols), dimnames=list(rownames(m), needcols)))
m
# A B C D
# B 0 0 0 1
# D 2 0 0 0
# A 0 0 0 0
# C 0 0 0 0
And to order the rows same as the columns ... note that if there are row names not present in the column names, they will be removed in this, though you can include them with another setdiff if needed.
m[colnames(m),]
# A B C D
# A 0 0 0 0
# B 0 0 0 1
# C 0 0 0 0
# D 2 0 0 0
Second
allnames <- sort(unique(unlist(dimnames(m))))
m2 <- matrix(0, nrow=length(allnames), ncol=length(allnames),
dimnames=list(allnames, allnames))
m2[intersect(rownames(m), allnames), colnames(m)] <-
m[intersect(rownames(m), allnames), colnames(m)]
m2[rownames(m), intersect(colnames(m), allnames)] <-
m[rownames(m), intersect(colnames(m), allnames)]
m2
# A B C D
# A 0 0 0 0
# B 0 0 0 1
# C 0 0 0 0
# D 2 0 0 0
Here is a base R solution. The basic idea is that, you first construct a square matrix will all zeros and assign row names with its column names, and then assign value to the rows according to row names, i.e.,
M <- `dimnames<-`(matrix(0,nrow = ncol(m),ncol = ncol(m)),
replicate(2,list(colnames(m))))
M[rownames(m),] <- m
such that
> M
A B C D
A 0 0 0 0
B 0 0 0 1
C 0 0 0 0
D 2 0 0 0

Count all the letters (26) of one of the char variable in a dataframe

I have a dataframe with a few columns like this:
Attr Description
60 asdfg asdg dfs
50 smlefekl dewld ewf
35 kojewdfhef e
All I need is to create extra 26 columns with counts of each letter in a row. I know I can use:
table(unlist(strsplit(mydata, ""), use.names=FALSE))
for a vector, but how can I update it for a dataframe?
If we are using the strsplit, then we may need to create a factor with levels specified as 'letters'
d1 <- stack(setNames(strsplit(df1$Description, ""), seq_len(nrow(df1))))
d2 <- subset(d1, values != " ")
d2$values <- factor(d2$values, levels = letters)
t(table(d2))
# values
# ind a b c d e f g h i j k l m n o p q r s t u v w x y z
# 1 2 0 0 3 0 2 2 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0
# 2 0 0 0 2 4 2 0 0 0 0 1 3 1 0 0 0 0 0 1 0 0 0 2 0 0 0
# 3 0 0 0 1 3 2 0 1 0 1 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0
Or as showed in the comments, use the str_count from stringr by looping through the 'letters' get the count of that letter for each row of 'Description'
library(stringr)
t(sapply(letters, function(x) str_count(df1$Description, x)))

How could i calculate the sparsity of a data.frame in R?

i have a data.frame structured like this:
A B C D E
F 1 0 7 0 0
G 0 0 0 1 1
H 1 1 0 0 0
I 1 2 1 0 0
L 1 0 0 0 0
and i want to calculate the sparsity(i.e. the percentage of 0 values) of this data.frame.
How could i do?
sum(df == 0)/(dim(df)[1]*dim(df)[2])
[1] 0.6

faster alternative to compute colCumsums of a band matrix

I am new to R and stats.In the domain I am currently working in, I am required to compute the cumulative column sums in a unique manner.
Initially a square band matrix of width b and number of rows n is provided.For example for n = 8 and b = 3
0 1 2 7 0 0 0 0
0 0 3 6 7 0 0 0
0 0 0 3 1 7 0 0
0 0 0 0 4 4 7 0
0 0 0 0 0 5 8 7
0 0 0 0 0 0 1 8
0 0 0 0 0 0 0 4
0 0 0 0 0 0 0 0
Then the matrix is to be transformed in such a way that a n x b matrix with diagonals as columns are obtained.Like for the given example,
1 2 7
3 6 7
3 1 7
4 4 7
5 8 7
1 8 0
4 0 0
0 0 0
I am currently using the following function to perform this operation.
packedband <- function(x, n, b) {
mat <- sapply(0:(b-1), function(i)
diag(x[-(n:(n-i)), -(1:(1+i))])[1:n] )
mat[is.na(mat)] <- 0
return(mat)
}
And then apply the colCumsums function from matrixStats packageto obtain the desired output matrix.For the given example,
1 2 7
4 8 14
7 9 21
11 13 28
16 21 35
17 29 35
21 29 35
21 29 35
What I am looking for is a faster computation of these operations since in the given domain,the number of columns(or rows) can be > 10^5.Probably the step of calculating packedband function can be removed since the end goal is to obtain cumulative column sum.
Thanks in advance.
After messing about with sparse matrices, I think a for loop may work well here.
Try on original data
d = as.matrix(read.table(text="0 1 2 7 0 0 0 0
0 0 3 6 7 0 0 0
0 0 0 3 1 7 0 0
0 0 0 0 4 4 7 0
0 0 0 0 0 5 8 7
0 0 0 0 0 0 1 8
0 0 0 0 0 0 0 4
0 0 0 0 0 0 0 0 "))
colnames(d) <- NULL
Functions
packedband <- function(x, b=3) {
n = nrow(d)
mat <- sapply(0:(b-1), function(i)
diag(x[-(n:(n-i)), -(1:(1+i))])[1:n] )
mat[is.na(mat)] <- 0
matrixStats::colCumsums(mat)
}
forloop <- function(d, b=3){
n = nrow(d)
m = matrix(0, n, b)
for(i in 1:b) {
ro = 1:(n-i)
co = (1+i):n
vec = `length<-`(d[cbind(ro, co)], n)
vec[is.na(vec)] <- 0
m[ , i] = cumsum(vec)
}
m
}
# create initial sparse matrix just to omit time to convert
# as if its faster it may be worth storing your band matrices in sparse format
library(Matrix)
m <- as(d, "TsparseMatrix")
spm <- function(m, b=3){
x = sparseMatrix(i = m#i+1,
j = m#j - m#i,
x = m#x,
dims = c(nrow(m),b))
matrixStats::colCumsums(as.matrix(x))
}
all.equal(forloop(d), packedband(d))
all.equal(spm(m), packedband(d))
Try with bigger data
d = matrix(0, 5e3, 5e3)
d[(col(d) - row(d)) == 1] <- 1
d[(col(d) - row(d)) == 2] <- 1
d[ (col(d) - row(d)) == 3] <- 1
m <- as(d, "TsparseMatrix")
all.equal(forloop(d), packedband(d))
all.equal(spm(m), packedband(d))
microbenchmark::microbenchmark(packedband(d), forloop(d), spm(m), times=50)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# packedband(d) 1348240.520 1724714.293 1740634.707 1733305.192 1763377.869 1960353.263 50 b
# forloop(d) 720.344 973.658 1054.461 1026.807 1174.731 1565.912 50 a
# spm(m) 2145.875 2437.321 2586.503 2480.133 2749.019 3766.051 50 a

Resources