Vector of comma separated strings to matrix - r

I have been working on this since an hour and I feel like I ran against a wall: I want to transform a vector of comma separated strings to a matrix.
I have a vector like:
'ABC,DFGH,IJ'
'KLMN,OP,DFGH,QR'
'ST,ABC'
I want to get a matrix like
ABC DFGH IJ KLMN OP QR ST
1 1 1 0 0 0 0
0 1 0 1 1 1 0
1 0 0 0 0 0 1
Sample data:
myvec<-c('ABC,DFGH,IJ','KLMN,OP,DFGH,QR','ST,ABC')
Base R answers are welcome as well. I might need this trick for some bigger datasets again.

Another base R solution:
> myvec<-c('ABC,DFGH,IJ','KLMN,OP,DFGH,QR','ST,ABC')
> mv <- strsplit(myvec,",")
> u <- unique(unlist(mv))
> t(sapply(mv, function(x) u %in% x)*1)
# output without colnames
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 1 1 1 0 0 0 0
[2,] 0 1 0 1 1 1 0
[3,] 1 0 0 0 0 0 1
> r <- t(sapply(mv, function(x) u %in% x)*1)
# adding colnames
> colnames(r) <- u
> r
ABC DFGH IJ KLMN OP QR ST
[1,] 1 1 1 0 0 0 0
[2,] 0 1 0 1 1 1 0
[3,] 1 0 0 0 0 0 1

library(tidyverse)
myvec<-c('ABC,DFGH,IJ','KLMN,OP,DFGH,QR','ST,ABC')
data.frame(myvec) %>% # create a data frame
mutate(id = row_number(), # create row id (helpful in order to reshape)
value = 1) %>% # create value = 1 (helpful in order to reshape)
separate_rows(myvec) %>% # separate values (using the commas; automatically done by this function)
spread(myvec, value, fill = 0) %>% # reshape dataset
select(-id) # remove row id column
# ABC DFGH IJ KLMN OP QR ST
# 1 1 1 1 0 0 0 0
# 2 0 1 0 1 1 1 0
# 3 1 0 0 0 0 0 1

You can try this with BASE R:
Data:
myvec<-c('ABC,DFGH,IJ','KLMN,OP,DFGH,QR','ST,ABC')
Solution:
unq <- unique(strsplit(paste0(myvec,collapse=","),",")[[1]])
sapply(unq, function(x)grepl(x,strsplit(myvec,","))+0)
Output:
> sapply(unq, function(x)grepl(x,strsplit(myvec,","))+0)
ABC DFGH IJ KLMN OP QR ST
[1,] 1 1 1 0 0 0 0
[2,] 0 1 0 1 1 1 0
[3,] 1 0 0 0 0 0 1

Related

How can I create dummy variables from a numeric variable in R?

How can I create dummy variables from a numeric variable in R?
I want to create N dummy variables. In such a way the numeric variable means how many zeros will come, counting from the first column. Imagine N=6. Like this:
x
a 5
b 2
c 4
d 1
e 9
It must become:
1 2 3 4 5 6
a 0 0 0 0 0 1
b 0 0 1 1 1 1
c 0 0 0 0 1 1
d 0 1 1 1 1 1
e 0 0 0 0 0 0
Thank you!
Here's a hacky solution for you
x = c(5,2,4,1,9)
N = 6
out = matrix(1, length(x), N)
for (i in 1:length(x))
out[i,1:min(x[i], N)] = 0
> out
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0 0 0 0 0 1
[2,] 0 0 1 1 1 1
[3,] 0 0 0 0 1 1
[4,] 0 1 1 1 1 1
[5,] 0 0 0 0 0 0
We could do this in a vectorized manner by creating row/column index and assigning an already created matrix of 1s to 0 based on the index
m1 <- matrix(1, ncol = N, nrow = length(x),
dimnames = list(letters[seq_along(x)], seq_len(N)))
x1 <- pmin(x, ncol(m1))
m1[cbind(rep(seq_len(nrow(m1)), x1), sequence(x1))] <- 0
m1
# 1 2 3 4 5 6
#a 0 0 0 0 0 1
#b 0 0 1 1 1 1
#c 0 0 0 0 1 1
#d 0 1 1 1 1 1
#e 0 0 0 0 0 0
data
x <- c(5,2,4,1,9)
N <- 6

How can I create this special sequence?

I would like to create the following vector sequence.
0 1 0 0 2 0 0 0 3 0 0 0 0 4
My thought was to create 0 first with rep() but not sure how to add the 1:4.
Create a diagonal matrix, take the upper triangle, and remove the first element:
d <- diag(0:4)
d[upper.tri(d, TRUE)][-1L]
# [1] 0 1 0 0 2 0 0 0 3 0 0 0 0 4
If you prefer a one-liner that makes no global assignments, wrap it up in a function:
(function() { d <- diag(0:4); d[upper.tri(d, TRUE)][-1L] })()
# [1] 0 1 0 0 2 0 0 0 3 0 0 0 0 4
And for code golf purposes, here's another variation using d from above:
d[!lower.tri(d)][-1L]
# [1] 0 1 0 0 2 0 0 0 3 0 0 0 0 4
rep and rbind up to their old tricks:
rep(rbind(0,1:4),rbind(1:4,1))
#[1] 0 1 0 0 2 0 0 0 3 0 0 0 0 4
This essentially creates 2 matrices, one for the value, and one for how many times the value is repeated. rep does not care if an input is a matrix, as it will just flatten it back to a vector going down each column in order.
rbind(0,1:4)
# [,1] [,2] [,3] [,4]
#[1,] 0 0 0 0
#[2,] 1 2 3 4
rbind(1:4,1)
# [,1] [,2] [,3] [,4]
#[1,] 1 2 3 4
#[2,] 1 1 1 1
You can use rep() to create a sequence that has n + 1 of each value:
n <- 4
myseq <- rep(seq_len(n), seq_len(n) + 1)
# [1] 1 1 2 2 2 3 3 3 3 4 4 4 4 4
Then you can use diff() to find the elements you want. You need to append a 1 to the end of the diff() output, since you always want the last value.
c(diff(myseq), 1)
# [1] 0 1 0 0 1 0 0 0 1 0 0 0 0 1
Then you just need to multiply the original sequence with the diff() output.
myseq <- myseq * c(diff(myseq), 1)
myseq
# [1] 0 1 0 0 2 0 0 0 3 0 0 0 0 4
unlist(lapply(1:4, function(i) c(rep(0,i),i)))
# the sequence
s = 1:4
# create zeros vector
vec = rep(0, sum(s+1))
# assign the sequence to the corresponding position in the zeros vector
vec[cumsum(s+1)] <- s
vec
# [1] 0 1 0 0 2 0 0 0 3 0 0 0 0 4
Or to be more succinct, use replace:
replace(rep(0, sum(s+1)), cumsum(s+1), s)
# [1] 0 1 0 0 2 0 0 0 3 0 0 0 0 4

R: Matrix counting matches when 2 teams interacted from schedule with 3 participants per match

I'd like to make some calculations on FIRST robotics teams and need to build, for lack of better words, a binary interaction matrix. That is when two teams were on the same alliance. Each alliance has three teams, so there are 7 values from each match added to the matrix, when considering (i,j), (j,i), and (i,i).
The full data I'm using is here: http://frc-events.firstinspires.org/2016/MOKC/qualifications
But for simplicity, here is an example of 9 teams playing 1 match each.
> data.frame(Team.1=1:3,Team.2=4:6,Team.3=7:9)
Team.1 Team.2 Team.3
1 1 4 7
2 2 5 8
3 3 6 9
The matrix should count each binary interaction, (1,4),(4,7),(3,6),(6,3),(9,9), etc, and will be an N x N matrix, where in the above example N=9. Here's the matrix that represents the above lists:
> matrix(data=c(1,0,0,1,0,0,1,0,0,+
+ 0,1,0,0,1,0,0,1,0,+
+ 0,0,1,0,0,1,0,0,1,+
+ 1,0,0,1,0,0,1,0,0,+
+ 0,1,0,0,1,0,0,1,0,+
+ 0,0,1,0,0,1,0,0,1,+
+ 1,0,0,1,0,0,1,0,0,+
+ 0,1,0,0,1,0,0,1,0,+
+ 0,0,1,0,0,1,0,0,1),9,9)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 1 0 0 1 0 0 1 0 0
[2,] 0 1 0 0 1 0 0 1 0
[3,] 0 0 1 0 0 1 0 0 1
[4,] 1 0 0 1 0 0 1 0 0
[5,] 0 1 0 0 1 0 0 1 0
[6,] 0 0 1 0 0 1 0 0 1
[7,] 1 0 0 1 0 0 1 0 0
[8,] 0 1 0 0 1 0 0 1 0
[9,] 0 0 1 0 0 1 0 0 1
In the real data, the team number are not sequential, and are would be more like 5732,1345,3451,etc, and there are more matches per team meaning the matrix values would be between 0 and max number of matches any of the teams played. This can be seen in the real data.
Thanks to anyone that can help.
There is probably a more elegant approach, but here is one using data.table.
library(data.table)
dat <- data.table(Team.1=1:3,Team.2=4:6,Team.3=7:9)
#add match ID
dat[,match:=1:.N]
#turn to long
mdat <- melt(dat,id="match",value.name="team")[,variable:=NULL]
#merge with itself
dat2 <- merge(mdat, mdat, by=c("match"),all=T, allow.cartesian = T)
# reshape
dcast(dat2, team.x~team.y, fun.agg=length)
team.x 1 2 3 4 5 6 7 8 9
1: 1 1 0 0 1 0 0 1 0 0
2: 2 0 1 0 0 1 0 0 1 0
3: 3 0 0 1 0 0 1 0 0 1
4: 4 1 0 0 1 0 0 1 0 0
5: 5 0 1 0 0 1 0 0 1 0
6: 6 0 0 1 0 0 1 0 0 1
7: 7 1 0 0 1 0 0 1 0 0
8: 8 0 1 0 0 1 0 0 1 0
9: 9 0 0 1 0 0 1 0 0 1
And, because I can, one in base-R. A case where I think the use of a for-loop is justified (as you keep modifying the same object).
#make matrix to put results in
nteams = length(unique(unlist(dat)))
res <- matrix(0,nrow=nteams, ncol=nteams)
#split data by row, generate combinations for each row and add to matrix
for(i in 1:nrow(dat)){
x=unlist(dat[i,])
coords=as.matrix(expand.grid(x,x))
res[coords] <- res[coords]+1
}
Here is my suggestion with base functions. I tried to create a matrix. My approach was to look for the position indexes for 1.
library(magrittr)
mydf <- data.frame(Team.1 = 1:3, Team.2 = 4:6,Team.3 = 7:9)
### Create a matrix with position indexes
lapply(1:nrow(mydf), function(x){
a <- t(combn(mydf[x, ], 2)) # Get some combination
b <- a[, 2:1] # Get other combination by reversing columns
foo <- rbind(a, b)
foo
}) %>%
do.call(rbind, .) -> ana
ana <- matrix(unlist(ana), nrow = nrow(ana))
### Another set: Get indexes for self (e.g., (1,1), (2,2), (3,3))
foo <- rep(1:max(mydf), times = 2)
matrix(foo, nrow = length(foo) / 2) -> bob
### A matric with all position indexes
cammy <- rbind(ana, bob)
### Create a plain matrix
mat <- matrix(0, nrow = max(mydf), ncol = max(mydf))
### Fill in the matrix with 1
mat[cammy] <- 1
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
# [1,] 1 0 0 1 0 0 1 0 0
# [2,] 0 1 0 0 1 0 0 1 0
# [3,] 0 0 1 0 0 1 0 0 1
# [4,] 1 0 0 1 0 0 1 0 0
# [5,] 0 1 0 0 1 0 0 1 0
# [6,] 0 0 1 0 0 1 0 0 1
# [7,] 1 0 0 1 0 0 1 0 0
# [8,] 0 1 0 0 1 0 0 1 0
# [9,] 0 0 1 0 0 1 0 0 1
EDIT
Here is a revised version based on the previous idea. This is not concise like Heroka's idea with base functions. In my modified data, team 1 and 4 had two matches. The idea here is that I counted how many times each pair appeared in the data set. The dplyr part is doing that. In the for loop, I filled in the matrix, mat by going through each row of cammy.
mydf <- data.frame(Team.1=c(1:3,1),Team.2=c(4:6,4),Team.3=c(7:9,5))
# Team.1 Team.2 Team.3
#1 1 4 7
#2 2 5 8
#3 3 6 9
#4 1 4 5
library(dplyr)
lapply(1:nrow(mydf), function(x){
a <- t(combn(mydf[x, ], 2)) # Get some combination
b <- a[, 2:1] # Get other combination by reversing columns
foo <- rbind(a, b)
foo
}) %>%
do.call(rbind, .) -> ana
ana <- data.frame(matrix(unlist(ana), nrow = nrow(ana)))
### Another set: Get indexes for self (e.g., (1,1), (2,2), (3,3))
foo <- rep(1:max(mydf), times = 2)
data.frame(matrix(foo, nrow = length(foo) / 2)) -> bob
cammy <- bind_rows(ana, bob) %>%
group_by(X1, X2) %>%
mutate(total = n()) %>%
as.matrix
### Create a plain matrix
mat <- matrix(0, nrow = max(mydf), ncol = max(mydf))
for(i in 1:nrow(cammy)){
mat[cammy[i, 1], cammy[i, 2]] <- cammy[i, 3]
}
print(mat)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
# [1,] 1 0 0 2 1 0 1 0 0
# [2,] 0 1 0 0 1 0 0 1 0
# [3,] 0 0 1 0 0 1 0 0 1
# [4,] 2 0 0 1 1 0 1 0 0
# [5,] 1 1 0 1 1 0 0 1 0
# [6,] 0 0 1 0 0 1 0 0 1
# [7,] 1 0 0 1 0 0 1 0 0
# [8,] 0 1 0 0 1 0 0 1 0
# [9,] 0 0 1 0 0 1 0 0 1

Converting to the right format and counting items in a data frame

How do I convert df into df2, where df is given by:
> df
ID VALUES
1 1 a,b,c,d
2 2 a
3 3 c,d,f,g
and df2 should look something like:
> df2
ID a b c d f g
1 1 1 1 1 1 0 0
2 2 1 0 0 0 0 0
3 3 0 0 1 1 1 1
where the values from df have been broken out into separate columns and 1s and 0s reflect whether or not the ID was associated with that value (from df).
Is there a specific function for this? I thought this is what table() did but if that's the case I can't figure it out.
Here's a method that uses no extra packages:
0 + t( sapply(df[['VALUES']], function(x) {
letters[1:6] %in% scan(text=x, what="", sep=",") }))
Read 4 items
Read 1 item
Read 4 items
[,1] [,2] [,3] [,4] [,5] [,6]
a,b,c,d 1 1 1 1 0 0
a 1 0 0 0 0 0
c,d,f,g 0 0 1 1 0 1
It does return a matrix and it does depend on the VALUES column being character rather than factor. If you want to suppress the information messages from scan there is a parmeter for that. You could cbind this with the ID column:
cbind( df["ID"], 0+ t( sapply(df[['VALUES']], function(x) {letters[1:6] %in% scan(text=x, what="", sep="," , quiet=TRUE) })) )
ID 1 2 3 4 5 6
a,b,c,d 1 1 1 1 1 0 0
a 2 1 0 0 0 0 0
c,d,f,g 3 0 0 1 1 0 1

R data manipulation matrix

I have a column as below. Only for non-null elements, I want to get a matrix
such as below. 6th column represent the actual value.
1 0 0 0 0 1
0 1 0 0 0 2
0 0 0 1 0 5
Any hint what is the efficient way to do this? which commands should I use? I am thinking of writing a if loop within for loop, but don't think it will be very efficient :(
abc=c('1','2','null','5','null')
Assuming there is an error in your example, this is just a dummy variable coding essentially:
abc <- c('1','2','null','5','null')
abc <- factor(abc,levels=1:5)
cbind(model.matrix(~abc+0),orig=na.omit(abc))
# abc1 abc2 abc3 abc4 abc5 orig
#1 1 0 0 0 0 1
#2 0 1 0 0 0 2
#4 0 0 0 0 1 5
If you want to automatically calculate the range of possible factors, try:
abc <- c('1','2','null','5','null')
rng <- range(as.numeric(abc),na.rm=TRUE)
abc <- factor(abc,levels=seq(rng[1],rng[2]))
cbind(model.matrix(~abc+0),orig=na.omit(abc))
# abc1 abc2 abc3 abc4 abc5 orig
#1 1 0 0 0 0 1
#2 0 1 0 0 0 2
#4 0 0 0 0 1 5
It's not clear why that matrix is six elements wide but if it is length(abc)+1 then just substitute that expression for my use of 6.
> abcn <- as.numeric(abc)
> zero <- matrix(0,nrow=length(abcn[!is.na(abcn)]), ncol=6)
> zero[ cbind(1:3, which( !is.na(abcn)) ) ] <- 1
> zero[ , 6] <- abcn[!is.na(abcn)]
> zero
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 0 0 0 0 1
[2,] 0 1 0 0 0 2
[3,] 0 0 0 1 0 5
You can index teh [<- function for matrices with a two coulmn matrix and that's what I'm doing in the third line. The rest of it is ordinary matrix indexing.

Resources