How might I generalize this code to fit larger sets of data? - r

I want to extend the usability of the code i've written. even better, I would like to generalize it for future use.
I am using Rstudio. I have recoded a 100-dimensional vector. Values 1-10 have been converted to identity vectors. For instance, all values of 1 are now vectors that read 1 0 0 0 0 0 0 0 0 0, and all values of 2 now read 0 1 0 0 0 0 0 0 0 0, and so on. Here is the code:
tens <- seq(from=1, to=10, by=1)
y <- sample(tens, size=100, replace=TRUE)
y
num.its <- 100
Y <- rep(0,num.its*10)
dim(Y) <- c(num.its,10)
I <- diag(10)
for(i in 1:100){
if(y[i]==1){
Y[i,] <- I[1,]
} else if (y[i]==2){
Y[i,] <- I[2,]
} else if (y[i]==3){
Y[i,] <- I[3,]
} else if (y[i]==4){
Y[i,] <- I[4,]
} else if (y[i]==5){
Y[i,] <- I[5,]
} else if (y[i]==6){
Y[i,] <- I[6,]
} else if (y[i]==7){
Y[i,] <- I[7,]
} else if (y[i]==8){
Y[i,] <- I[8,]
} else if (y[i]==9){
Y[i,] <- I[9,]
} else {
Y[i,] <- I[10,]
}
}
The code works as planned. however, if I had to recode values of 1-2000, then I would rather not write 2000 else if statements. Any help would be appreciated. thank you!

A pretty decent one-liner is the following:
# sample data
set.seed(1234)
x <- c(1:5, sample(10L, 6))
Our vector is
x
[1] 1 2 3 4 5 10 6 5 4 1 8
Then, convert x to a factor variable, specifying the desired levels, and use model.matrix to get a matrix of your desired vectors.
model.matrix(~ . + 0, data.frame(x=factor(x, levels=1:10)))
This returns
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
1 1 0 0 0 0 0 0 0 0 0
2 0 1 0 0 0 0 0 0 0 0
3 0 0 1 0 0 0 0 0 0 0
4 0 0 0 1 0 0 0 0 0 0
5 0 0 0 0 1 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 1
7 0 0 0 0 0 1 0 0 0 0
8 0 0 0 0 1 0 0 0 0 0
9 0 0 0 1 0 0 0 0 0 0
10 1 0 0 0 0 0 0 0 0 0
11 0 0 0 0 0 0 0 1 0 0
attr(,"assign")
[1] 1 1 1 1 1 1 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$x
[1] "contr.treatment"
Here, the rows represent what you want. You can use t to convert this to columns if desired. Note also that even though 7 is missing in x, that column is present in the matrix.

You could use the function dummy from the package dummy
dummy::dummy(data.frame(x=factor(x)))
x_1 x_2 x_3 x_4 x_5 x_6 x_8 x_9
1 1 0 0 0 0 0 0 0
2 0 1 0 0 0 0 0 0
3 0 0 1 0 0 0 0 0
4 0 0 0 1 0 0 0 0
5 0 0 0 0 1 0 0 0
6 0 1 0 0 0 0 0 0
7 0 0 0 0 0 1 0 0
8 0 0 0 0 1 0 0 0
9 0 0 0 0 0 0 1 0
10 0 0 0 0 0 0 0 1
11 0 0 0 1 0 0 0 0

Related

Best way to covert List to Matrix or Tibble format?

I'm am seeking a decent way to convert output from a function as a list into a matrix or tibble format.
The following tibble feeds into a function. The function returns a list. In this simple example, the returned list happens to contain the same values as the function input tibble.
# # A tibble: 6 x 15
# rev CoS gm sga ebitda bd ebit ie ii gain ebt chg_DTL current tax ni
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
This is the list that is returned from the function.
> ni_out
$rev
[1] 0 0 0 0 0 0
$CoS
[1] 0 0 0 0 0 0
$gm
[1] 0 0 0 0 0 0
$sga
[1] 0 0 0 0 0 0
$ebitda
[1] 0 0 0 0 0 0
$bd
[1] 0 0 0 0 0 0
$ebit
[1] 0 0 0 0 0 0
$ie
[1] 0 0 0 0 0 0
$ii
[1] 0 0 0 0 0 0
$gain
[1] 0 0 0 0 0 0
$ebt
[1] 0 0 0 0 0 0
$chg_DTL_net
[1] 0 0 0 0 0 0
$current
[1] 0 0 0 0 0 0
$tax
[1] 0 0 0 0 0 0
$ni
[1] 0 0 0 0 0 0
I desire to convert that back into something more pleasing to look at such as the original tibble format or a matrix.
I obtain the dimensions of the list output .
lengths(ni_out)[[1]]
# [1] 6
> length(ni_out)
# [1] 15
However, my unsuccessful attempt at a matrix appears as the following.
as.matrix(unlist(ni_out), nrow = lengths(ni_out)[[1]], ncol = length(ni_out))
# [,1]
# rev1 0
# rev2 0
# rev3 0
# rev4 0
# rev5 0
# rev6 0
# CoS1 0
# CoS2 0
# CoS3 0
# CoS4 0
# CoS5 0
# CoS6 0
# gm1 0
# gm2 0
# gm3 0
# gm4 0
# gm5 0
# gm6 0
# sga1 0
# sga2 0
# sga3 0
# sga4 0
# sga5 0
# sga6 0
# ebitda1 0
# ebitda2 0
# etc.
Thoughts for a matrix or tibble format ??
Next time please provide a reproducible example.
If your list is called mylist I would try data.table::rbindlist(mylist)
Please see an example below including the conversion of vectors to data.frames.
dat <- 0:5
mylist <- list(dat, dat, dat)
mylist <- lapply(mylist, function(x) data.frame(t(x)))
data.table::rbindlist(mylist)
> data.table::rbindlist(mylist)
X1 X2 X3 X4 X5 X6
1: 0 1 2 3 4 5
2: 0 1 2 3 4 5
3: 0 1 2 3 4 5
EDIT: it seems you want to cbind instead of rbind, so I would use the below in that case.
dat <- 0:5
mylist <- list(dat, dat, dat)
mylist <- lapply(mylist, function(x) data.frame(x))
dplyr::bind_cols(mylist)
x...1 x...2 x...3
1 0 0 0
2 1 1 1
3 2 2 2
4 3 3 3
5 4 4 4
6 5 5 5
As you can see the answer is different depending on what you want and therefore it's important to provide an example.
You can use do.call funtion like this:
a <- list(data.frame(x=1:5),data.frame(y=1:5))
do.call("cbind",a)
Check cbindlist function too.
simply call data.frame or as_tibble on the list:
l <- list(x=rep(0,6),y=rep(0,6), z=rep(0,6), t=rep(0,6))
data.frame(l)
x y z t
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0
4 0 0 0 0
5 0 0 0 0
6 0 0 0 0
as_tibble(l)
# A tibble: 6 x 4
x y z t
<dbl> <dbl> <dbl> <dbl>
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0
4 0 0 0 0
5 0 0 0 0
6 0 0 0 0
as for matrix transform it first to a data.frame then to a matrix
as.matrix(data.frame(l))
x y z t
[1,] 0 0 0 0
[2,] 0 0 0 0
[3,] 0 0 0 0
[4,] 0 0 0 0
[5,] 0 0 0 0
[6,] 0 0 0 0
Another option with as.data.table
library(data.table)
as.data.table(l)
data
l <- list(x=rep(0,6),y=rep(0,6), z=rep(0,6), t=rep(0,6))

Filling a table with additional columns if they don't exist

I've the following difficult problem. Here short example of my data. Assume that I've two data sets (my real example has something about 20). The data frames result as a list computed by a self written function with lapply. So, I put the data frames in my example in a list, too. Then I "rbind" them to compute a frequency table.
df1 <- data.frame(rev(seq(12:0)), paste0("a=",sample(0:12, 13, replace=T)))
colnames(df1) <- c("k", "a")
df2 <- data.frame(rev(seq(12:0)), paste0("a=",sample(0:12, 13, replace=T)))
colnames(df2) <- c("k", "a")
list_df <- list(df1,df2)
df_combine<- plyr::ldply(list_df, rbind)
freq_foo <- table(df_combine$k,df_combine$a)
I get a frequency table of the following form.
a=0 a=11 a=12 a=2 a=5 a=6 a=7 a=8 a=3 a=9
1 1 0 0 0 0 0 0 1 0 0
2 1 0 0 0 0 0 0 0 0 1
3 1 0 0 0 0 1 0 0 0 0
4 0 0 0 1 0 1 0 0 0 0
5 0 0 0 1 1 0 0 0 0 0
6 0 0 0 0 0 0 1 0 0 1
7 0 1 1 0 0 0 0 0 0 0
8 1 0 0 0 0 1 0 0 0 0
9 0 0 0 0 0 0 2 0 0 0
10 0 0 1 0 1 0 0 0 0 0
11 1 1 0 0 0 0 0 0 0 0
12 0 0 0 0 0 0 1 0 1 0
13 1 0 1 0 0 0 0 0 0 0
I want to extend and manipulate my table in the following way:
First the table should go over a range of a=0 to a=15. So if there is a missing column, it should be added. And 2nd) I want to order the columns from 0 to 15.
For the first problem I tried
if(freq_foo$paste0("a=",0:15) == F){freq_foo$paste("a=",0:15) <- 0}
but this should work only for data frames and not for tables. Also. i've no idea how to order the columns with an ascending order. The data type isnt important to me because I just want to use the output for further calculations. So, it can also be a data frame instead of a table.
#convert freq_foo table to dataframe
df <- as.data.frame.matrix(freq_foo)
#add all zeros column for missing column name in 0:15 series
df[, paste0("a=", c(0:15)[!(c(0:15) %in% as.numeric(gsub(".*=(\\d+)", "\\1", names(df))))])] <- 0
#order columns from 0 to 15
df <- df[, order(as.numeric(gsub(".*=(\\d+)", "\\1", names(df))))]
Output is:
a=0 a=1 a=2 a=3 a=4 a=5 a=6 a=7 a=8 a=9 a=10 a=11 a=12 a=13 a=14 a=15
1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0
2 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0
3 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0
5 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0
6 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
7 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0
8 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0
10 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
11 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
12 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0
13 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0
(Edit: Updated code after getting a requirement clarification from OP)

Creating an occurrence matrix for each char from a large number of equal length strings

I am trying to create a matrix which gives me the occurrence of each element at each position, based on a large number of strings in a vector.
I have the following pet example and potential solution:
set.seed(42)
seqs <- sapply(1:10, FUN = function(x) { paste(sample(LETTERS, size = 11, replace = T), collapse = "") })
test <- lapply(seqs, FUN = function(s) {
do.call(cbind, lapply(LETTERS, FUN = function(ch) {
grepl(ch, unlist(strsplit(s, split="")))
}))
})
testR <- Reduce("+", test)
seqs
# [1] "XYHVQNTDRSL" "SYGMYZDMOXD" "ZYCNKXLVTVK" "RAVAFXPJLAZ" "LYXQZQIJKUB" "TREGNRZTOWE" "HVSGBDFMFSA" "JNAPEJQUOGC" "CHRAFYYTINT"
#[10] "QQFFKYZTTNA"
testR
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23]
[1,] 0 0 1 0 0 0 0 1 0 1 0 1 0 0 0 0 1 1 1 1 0 0 0
[2,] 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0
[3,] 1 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0
[4,] 2 0 0 0 0 1 2 0 0 0 0 0 1 1 0 1 1 0 0 0 0 1 0
[5,] 0 1 0 0 1 2 0 0 0 0 2 0 0 1 0 0 1 0 0 0 0 0 0
[6,] 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0
[7,] 0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0
[8,] 0 0 0 1 0 0 0 0 0 2 0 0 2 0 0 0 0 0 0 3 1 1 0
[9,] 0 0 0 0 0 1 0 0 1 0 1 1 0 0 3 0 0 1 0 2 0 0 0
[10,] 1 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 2 0 1 1 1
[11,] 2 1 1 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0
[,24] [,25] [,26]
[1,] 1 0 1
[2,] 0 4 0
[3,] 1 0 0
[4,] 0 0 0
[5,] 0 1 1
[6,] 2 2 1
[7,] 0 1 2
[8,] 0 0 0
[9,] 0 0 0
[10,] 1 0 0
[11,] 0 0 1
I am trying to force myself to not use loops but instead use the vectorised functions but I am not sure if my solution is actually a good (efficient) solution or if I have gotten confused somewhere. It is also fairly difficult to debug if real life data messes up somehow (which sadly is the case).
So my question, what is a good way to solve this problem?
EDIT: Following 989's track, I have done a benchmark test of the proposed solutions here, with a data size more representative of the problem at hand.
library(microbenchmark)
set.seed(42)
seqs <- sapply(1:10000, FUN = function(x) { paste(sample(LETTERS, size = 31, replace = T), collapse = "") })
f.posdef=function(){
test <- lapply(seqs, FUN = function(s) {
do.call(cbind, lapply(LETTERS, FUN = function(ch) {
grepl(ch, unlist(strsplit(s, split="")))
}))
})
(testR <- Reduce("+", test))
}
f.989=function() {
l <- lapply(seqs, function(x) {
m <- matrix(0, nchar(x), 26)
replace(m, cbind(seq(nchar(x)), match(strsplit(x, "")[[1]], LETTERS)), 1)
})
Reduce("+",l)
}
f.docendo1=function()
t(Reduce("+", lapply(strsplit(seqs, "", fixed = TRUE), function(xx)
table(factor(xx, levels = LETTERS), 1:31))))
f.docendo2=function()
t(table(do.call(cbind, strsplit(seqs, "", fixed = TRUE)), rep(1:31, 10000)))
f.akrun=function(){
strsplit(seqs, "") %>%
transpose %>%
map(unlist) %>%
setNames(seq_len(nchar(seqs[1]))) %>%
stack %>%
select(2:1) %>%
table
}
r <- f.posdef()
Note that the main difference between this benchmark and 989's is the sample size.
> all(r==f.989())
[1] TRUE
> all(r==f.docendo1())
[1] TRUE
> all(r==f.docendo2())
[1] TRUE
> all(r==f.akrun())
[1] FALSE
> res <- microbenchmark(f.posdef(), f.989(), f.docendo1(), f.docendo2(), f.akrun())
> autoplot(res)
As the plot shows, akrun's solution is blazing fast but seemingly inaccurate. Thus the gold medal goes to docendo's second solution. However it's probably worth noting that both solutions by docendo as well as the suggestion by 989 has assumptions regarding the length/number of the sample strings or the alphabet size in m <- matrix(0, nchar(x), 26)
In the case of size/length of sample strings (i.e. seqs) it would be an additional call to nchar which should not impact the runtime much. I am not sure how one would avoid making the assumption alphabet size, if this is not known a priori.
Here's another approach in base R which requires less looping than the approach by OP:
t(Reduce("+", lapply(strsplit(seqs, "", fixed = TRUE), function(xx)
table(factor(xx, levels = LETTERS), 1:11))))
# A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
# 1 0 0 1 0 0 0 0 1 0 1 0 1 0 0 0 0 1 1 1 1 0 0 0 1 0 1
# 2 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 4 0
# 3 1 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0
# 4 2 0 0 0 0 1 2 0 0 0 0 0 1 1 0 1 1 0 0 0 0 1 0 0 0 0
# 5 0 1 0 0 1 2 0 0 0 0 2 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1
# 6 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 2 2 1
# 7 0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 1 2
# 8 0 0 0 1 0 0 0 0 0 2 0 0 2 0 0 0 0 0 0 3 1 1 0 0 0 0
# 9 0 0 0 0 0 1 0 0 1 0 1 1 0 0 3 0 0 1 0 2 0 0 0 0 0 0
# 10 1 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 2 0 1 1 1 1 0 0
# 11 2 1 1 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1
Or, possibly more efficient:
t(table(do.call(cbind, strsplit(seqs, "", fixed = TRUE)), rep(1:nchar(seqs[1]), length(seqs))))
We can also use table once
library(tidyverse)
strsplit(seqs, "") %>%
transpose %>%
map(unlist) %>%
setNames(seq_len(nchar(seqs[1]))) %>%
stack %>%
select(2:1) %>%
table
# values
#ind A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
# 1 0 0 1 0 0 0 0 1 0 1 0 1 0 0 0 0 1 1 1 1 0 0 0 1 0 1
# 2 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 4 0
# 3 1 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0
# 4 2 0 0 0 0 1 2 0 0 0 0 0 1 1 0 1 1 0 0 0 0 1 0 0 0 0
# 5 0 1 0 0 1 2 0 0 0 0 2 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1
# 6 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 2 2 1
# 7 0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 1 2
# 8 0 0 0 1 0 0 0 0 0 2 0 0 2 0 0 0 0 0 0 3 1 1 0 0 0 0
# 9 0 0 0 0 0 1 0 0 1 0 1 1 0 0 3 0 0 1 0 2 0 0 0 0 0 0
# 10 1 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 2 0 1 1 1 1 0 0
# 11 2 1 1 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1
Or slightly more compact is by using mtabulate from qdapTools
library(qdapTools)
strsplit(seqs, "") %>%
transpose %>%
map(unlist) %>%
mtabulate
# A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
#1 0 0 1 0 0 0 0 1 0 1 0 1 0 0 0 0 1 1 1 1 0 0 0 1 0 1
#2 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 4 0
#3 1 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0
#4 2 0 0 0 0 1 2 0 0 0 0 0 1 1 0 1 1 0 0 0 0 1 0 0 0 0
#5 0 1 0 0 1 2 0 0 0 0 2 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1
#6 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 2 2 1
#7 0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 1 2
#8 0 0 0 1 0 0 0 0 0 2 0 0 2 0 0 0 0 0 0 3 1 1 0 0 0 0
#9 0 0 0 0 0 1 0 0 1 0 1 1 0 0 3 0 0 1 0 2 0 0 0 0 0 0
#10 1 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 2 0 1 1 1 1 0 0
#11 2 1 1 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1
You could go for match in base R:
l <- lapply(seqs, function(x) {
m <- matrix(0, nchar(x), 26)
replace(m, cbind(seq(nchar(x)), match(strsplit(x, "")[[1]], LETTERS)), 1)
})
all(Reduce("+",l)==testR)
#[1] TRUE
BENCHMARKING (I did not include #akrun's answers as I do not want to install the required packages)
library(microbenchmark)
set.seed(42)
seqs <- sapply(1:10, FUN = function(x) { paste(sample(LETTERS, size = 11, replace = T), collapse = "") })
fOP=function(){
test <- lapply(seqs, FUN = function(s) {
do.call(cbind, lapply(LETTERS, FUN = function(ch) {
grepl(ch, unlist(strsplit(s, split="")))
}))
})
(testR <- Reduce("+", test))
}
f989=function() {
l <- lapply(seqs, function(x) {
m <- matrix(0, nchar(x), 26)
replace(m, cbind(seq(nchar(x)), match(strsplit(x, "")[[1]], LETTERS)), 1)
})
Reduce("+",l)
}
fdocendo.discimus=function()
t(Reduce("+", lapply(strsplit(seqs, "", fixed = TRUE), function(xx)
table(factor(xx, levels = LETTERS), 1:11))))
fdocendo.discimus1=function()
t(table(do.call(cbind, strsplit(seqs, "", fixed = TRUE)), rep(1:11, 10)))
r <- fOP()
all(r==f989())
# [1] TRUE
all(r==fdocendo.discimus())
# [1] TRUE
all(r==fdocendo.discimus1())
# [1] TRUE
res <- microbenchmark(fOP(), f989(), fdocendo.discimus(), fdocendo.discimus1())
print(res, order="mean")
# Unit: microseconds
# expr min lq mean median uq max neval
# f989() 135.813 150.8360 205.3294 154.1415 159.700 4968.565 100
# fdocendo.discimus1() 391.813 405.1845 447.6911 418.2545 445.146 2418.480 100
# fdocendo.discimus() 943.775 990.9495 1090.9905 1015.5880 1062.311 3996.245 100
# fOP() 1486.725 1521.4280 1643.1604 1548.9215 1602.104 5782.838 100

Building a symmetric binary matrix

I have a matrix that is for example like this:
rownames V1
a 1
c 3
b 2
d 4
y 2
q 4
i 1
j 1
r 3
I want to make a Symmetric binary matrix that it's dimnames of that is the same as rownames of above matrix. I want to fill these matrix by 1 & 0 in such a way that 1 indicated placing variables that has the same number in front of it and 0 for the opposite situation.This matrix would be like
dimnames
a c b d y q i j r
a 1 0 0 0 0 0 1 1 0
c 0 1 0 0 0 0 0 0 1
b 0 0 1 0 1 0 0 0 0
d 0 0 0 1 0 1 0 0 0
y 0 0 1 0 1 0 0 0 0
q 0 0 0 1 0 1 0 0 0
i 1 0 0 0 0 0 1 1 0
j 1 0 0 0 0 0 1 1 0
r 0 1 0 0 0 0 0 0 1
Anybody know how can I do that?
Use dist:
DF <- read.table(text = "rownames V1
a 1
c 3
b 2
d 4
y 2
q 4
i 1
j 1
r 3", header = TRUE)
res <- as.matrix(dist(DF$V1)) == 0L
#alternatively:
#res <- !as.matrix(dist(DF$V1))
#diag(res) <- 0L #for the first version of the question, i.e. a zero diagonal
res <- +(res) #for the second version, i.e. to coerce to an integer matrix
dimnames(res) <- list(DF$rownames, DF$rownames)
# 1 2 3 4 5 6 7 8 9
#1 1 0 0 0 0 0 1 1 0
#2 0 1 0 0 0 0 0 0 1
#3 0 0 1 0 1 0 0 0 0
#4 0 0 0 1 0 1 0 0 0
#5 0 0 1 0 1 0 0 0 0
#6 0 0 0 1 0 1 0 0 0
#7 1 0 0 0 0 0 1 1 0
#8 1 0 0 0 0 0 1 1 0
#9 0 1 0 0 0 0 0 0 1
You can do this using table and crossprod.
tcrossprod(table(DF))
# rownames
# rownames a b c d i j q r y
# a 1 0 0 0 1 1 0 0 0
# b 0 1 0 0 0 0 0 0 1
# c 0 0 1 0 0 0 0 1 0
# d 0 0 0 1 0 0 1 0 0
# i 1 0 0 0 1 1 0 0 0
# j 1 0 0 0 1 1 0 0 0
# q 0 0 0 1 0 0 1 0 0
# r 0 0 1 0 0 0 0 1 0
# y 0 1 0 0 0 0 0 0 1
If you want the row and column order as they are found in the data, rather than alphanumerically, you can subset
tcrossprod(table(DF))[DF$rownames, DF$rownames]
or use factor
tcrossprod(table(factor(DF$rownames, levels=unique(DF$rownames)), DF$V1))
If your data is large or sparse, you can use the sparse matrix algebra in xtabs, with similar ways to change the order of the resulting table as before.
Matrix::tcrossprod(xtabs(data=DF, ~ rownames + V1, sparse=TRUE))

Label start of a sequence with NA's

This is trivial question, however I don't seem to find neat solution for this. (without excluding NA's first and including them back again). So I'm looking for some ideas without the need of NA's exclusion.
I would like to label the start of a 0 and 1 sequence with 2 and 1 respectively and replace NA's with 0 as well as the remaining sequence of 0's and 1's.
Is the rle function useful here? Base R solution would be welcomed.
Example:
x <- c(rep(NA,10),rep(1,5),rep(NA,5),rep(1,10),rep(NA,3),rep(0,7),rep(NA,15),rep(1,9))
r <- c(0,diff(x)); r[r %in% -1] <- 2
From this sample data:
x
[1] NA NA NA NA NA NA NA NA NA NA 1 1 1 1 1 NA NA NA NA NA 1 1 1 1 1 1 1 1 1 1 NA NA NA 0 0 0 0 0 0 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1 1 1 1 1 1 1 1 1
Desired output:
[1] 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
We could use rle to create a grouping variable ('gr') to split the 'x' into a list. Replace the first element that is 0 or 1 with 2 or 1 using match, concatenate with 0s, unlist and then replace the NA elements with 0.
xN <- x
xN[is.na(xN)] <- -999
v1 <- rle(xN)$lengths
gr <- rep(seq_along(v1), v1)
x1 <- unlist(lapply(split(x, gr), function(x)
c(match(x[1],1:0),rep(0,length(x)-1)) ), use.names=FALSE)
x1[is.na(x1)] <- 0
x1
#[1] 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0
#[39] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
Or instead of split, we can use which and diff to replace the values.
x1 <- (!x)+2*(!is.na(x))-1
ind <- which(!is.na(x))
x1[c(ind[c(FALSE,diff(ind)==1)], which(is.na(x)))] <- 0
x1
#[1] 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0
#[39] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
Or we can use rleid from the devel version of data.table as grouping variable. Replace the first element of 0's and 1's with 2 and 1 using match and change the NA values to 0.
library(data.table)#v1.9.5+
DT <- setDT(list(x))
DT[, c(match(V1[1], 1:0), rep(0, .N-1)) ,rleid(V1)][is.na(V1), V1:=0]$V1
#[1] 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0
#[39] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0

Resources