one hot encode each column in a Int matrix in R - r

I have an issue of translating matrix into one hot encoding in R. I implemented in Matlab but i have difficulty in handling the object in R. Here i have an object of type 'matrix'.
I would like to apply one hot encoding to this matrix. I have problem with column names.
here is an example:
> set.seed(4)
> t <- matrix(floor(runif(10, 1,9)),5,5)
[,1] [,2] [,3] [,4] [,5]
[1,] 5 3 5 3 5
[2,] 1 6 1 6 1
[3,] 3 8 3 8 3
[4,] 3 8 3 8 3
[5,] 7 1 7 1 7
> class(t)
[1] "matrix"
Expecting:
1_1 1_3 1_5 1_7 2_1 2_3 2_6 2_8 ...
[1,] 0 0 1 0 0 1 0 0 ...
[2,] 1 0 0 0 0 0 1 0 ...
[3,] 0 1 0 0 0 0 0 1 ...
[4,] 0 1 0 0 0 0 0 1 ...
[5,] 0 0 0 1 1 0 0 0 ...
I tried the following, but the matrix remains the same.
library(data.table)
library(mltools)
test_table <- one_hot(as.data.table(t))
Any suggestions would be very much appreciated.

Your data table must contain some columns (variables) that have class "factor". Try this:
> t <- data.table(t)
> t[,V1:=factor(V1)]
> one_hot(t)
V1_1 V1_3 V1_5 V1_7 V2 V3 V4 V5
1: 0 0 1 0 3 5 3 5
2: 1 0 0 0 6 1 6 1
3: 0 1 0 0 8 3 8 3
4: 0 1 0 0 8 3 8 3
5: 0 0 0 1 1 7 1 7
But I read that from here that the dummyVars function from the caret package is quicker if your matrix is large.
Edit: Forgot to set the seed. :P
And a quick way to factor all variables in a data table:
t.f <- t[, lapply(.SD, as.factor)]

There are probably more concise ways to do this but this should work (and is at least easy to read and understand ;)
Suggested solution using base R and double loop:
set.seed(4)
t <- matrix(floor(runif(10, 1,9)),5,5)
# initialize result object
#
t_hot <- NULL
# for each column in original matrix
#
for (col in seq_along(t[1,])) {
# for each unique value in this column (sorted so the resulting
# columns appear in order)
#
for (val in sort(unique(t[, col]))) {
t_hot <- cbind(t_hot, ifelse(t[, col] == val, 1, 0))
# make name for this column
#
colnames(t_hot)[ncol(t_hot)] <- paste0(col, "_", val)
}
}
This returns:
1_1 1_3 1_5 1_7 2_1 2_3 2_6 2_8 3_1 3_3 3_5 3_7 4_1 4_3 4_6 4_8 5_1 5_3 5_5 5_7
[1,] 0 0 1 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 0
[2,] 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 1 0 0 0
[3,] 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 1 0 0
[4,] 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 1 0 0
[5,] 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1

Related

Intersection of two integer matrices by position R

I would like to know which positions of one matrix intersect with another matrix and which values, for example
lab <- as.matrix(read.table(text="[1,] 0 0 0 0 0 0 0 0 0 1
[2,] 2 0 2 2 2 2 2 2 2 0
[3,] 2 0 2 0 0 0 0 0 2 2
[4,] 2 2 2 0 0 0 0 0 2 2
[5,] 2 0 2 0 0 0 0 0 0 0
[6,] 2 0 2 0 0 0 0 0 0 0
[7,] 2 0 2 0 0 0 0 0 0 0
[8,] 2 0 2 0 0 0 0 3 3 3
[9,] 2 0 2 0 0 0 0 0 3 3
[10,] 2 0 2 0 0 0 0 0 0 3")[,-1])
str(lab)
la1 <- as.matrix(read.table(text="[1,] 0 1 0 0 0 0 0 0 0 2
[2,] 3 0 4 4 4 4 4 4 4 0
[3,] 3 0 4 0 0 0 0 0 4 4
[4,] 3 0 4 0 5 5 0 0 4 4
[5,] 3 0 4 0 5 5 0 0 0 0
[6,] 3 0 4 0 0 0 0 0 0 0
[7,] 3 0 4 0 0 0 0 0 0 0
[8,] 3 0 4 0 0 0 0 6 6 6
[9,] 3 0 4 0 0 0 0 6 6 6
[10,] 3 0 4 0 0 0 0 0 0 6")[,-1])
Then, these numbers represent patches, patch 3 of la1 intersect patch 3 and 4 of la1, patch 1 of lab intersect 0 (no other patch), patch 3 of lab intersect patch 6 of la1. I am using the following code
require(dplyr)
tuples <- tibble()
dx <- dim(lab)[1]
for( i in seq_len(dx))
for( j in seq_len(dx))
{
ii <- tibble(l0=lab[i,j],l1=la1[i,j])
tuples <- bind_rows(tuples,ii)
}
tuples %>% distinct()
As I will use big 3000x3000 matrices so I am thinking if there is any faster way, maybe with rcpp or raster, of doing it.
Without a double for loop, we can transpose the matrixes into a two column tibble and get the distinct rows
out <- tibble(l0 = c(t(lab)), l1 = c(t(la1))) %>%
distinct
-checking with OP's output
out_old <- tuples %>%
distinct()
all.equal(out, out_old, check.attributes = FALSE)
#[1] TRUE
Benchmarks
lab2 <- matrix(sample(0:9, size = 3000 * 3000, replace = TRUE), 3000, 3000)
la2 <- matrix(sample(0:9, size = 3000 * 3000, replace = TRUE), 3000, 3000)
system.time({out2 <- tibble(l0 = c(t(lab2)), l1 = c(t(la2))) %>%
distinct})
# user system elapsed
# 0.398 0.042 0.440
If you just want to speed up, you can try unique over data.table, e.g.,
unique(data.table(c(lab), c(la)))
Here comes a base R solution.
as.vector might be faster than c.
unique(cbind(as.vector(lab), as.vector(la1)))
# [,1] [,2]
# [1,] 0 0
# [2,] 2 3
# [3,] 0 1
# [4,] 2 0
# [5,] 2 4
# [6,] 0 5
# [7,] 3 6
# [8,] 0 6
# [9,] 1 2

adding data frame of counts to template data frame in R

I have data.frames of counts such as:
a <- data.frame(id=1:10,
"1"=c(rep(1,3),rep(0,7)),
"3"=c(rep(0,4),rep(1,6)))
names(a)[2:3] <- c("1","3")
a
> a
id 1 3
1 1 1 0
2 2 1 0
3 3 1 0
4 4 0 0
5 5 0 1
6 6 0 1
7 7 0 1
8 8 0 1
9 9 0 1
10 10 0 1
and a template data.frame such as
m <- data.frame(id=1:10,
"1"= rep(0,10),
"2"= rep(0,10),
"3"= rep(0,10),
"4"= rep(0,10))
names(m)[-1] <- 1:4
m
> m
id 1 2 3 4
1 1 0 0 0 0
2 2 0 0 0 0
3 3 0 0 0 0
4 4 0 0 0 0
5 5 0 0 0 0
6 6 0 0 0 0
7 7 0 0 0 0
8 8 0 0 0 0
9 9 0 0 0 0
10 10 0 0 0 0
and I want to add the values of a into the template m
in the appropraite columns, leaving the rest as 0.
This is working but I would like to know
if there is a more elegant way, perhaps using plyr or data.table:
provi <- rbind.fill(a,m)
provi[is.na(provi)] <- 0
mnew <- aggregate(provi[,-1],by=list(provi$id),FUN=sum)
names(mnew)[1] <- "id"
mnew <- mnew[c(1,order(names(mnew)[-1])+1)]
mnew
> mnew
id 1 2 3 4
1 1 1 0 0 0
2 2 1 0 0 0
3 3 1 0 0 0
4 4 0 0 0 0
5 5 0 0 1 0
6 6 0 0 1 0
7 7 0 0 1 0
8 8 0 0 1 0
9 9 0 0 1 0
10 10 0 0 1 0
I guess the concise option would be:
m[names(a)] <- a
Or we match the column names ('i1'), use that to create the column index with max.col, cbind with the row index ('i2'), and a similar step can be done to create 'i3'. We change the values in 'm' corresponding to 'i2' with the 'a' values based on 'i3'.
i1 <- match(names(a)[-1], names(m)[-1])
i2 <- cbind(m$id, i1[max.col(a[-1], 'first')]+1L)
i3 <- cbind(a$id, max.col(a[-1], 'first')+1L)
m[i2] <- a[i3]
m
# id 1 2 3 4
#1 1 1 0 0 0
#2 2 1 0 0 0
#3 3 1 0 0 0
#4 4 0 0 0 0
#5 5 0 0 1 0
#6 6 0 0 1 0
#7 7 0 0 1 0
#8 8 0 0 1 0
#9 9 0 0 1 0
#10 10 0 0 1 0
A data.table option would be melt/dcast
library(data.table)
dcast(melt(setDT(a), id.var='id')[,
variable:= factor(variable, levels=1:4)],
id~variable, value.var='value', drop=FALSE, fill=0)
# id 1 2 3 4
# 1: 1 1 0 0 0
# 2: 2 1 0 0 0
# 3: 3 1 0 0 0
# 4: 4 0 0 0 0
# 5: 5 0 0 1 0
# 6: 6 0 0 1 0
# 7: 7 0 0 1 0
# 8: 8 0 0 1 0
# 9: 9 0 0 1 0
#10: 10 0 0 1 0
A similar dplyr/tidyr option would be
library(dplyr)
library(tidyr)
gather(a, Var, Val, -id) %>%
mutate(Var=factor(Var, levels=1:4)) %>%
spread(Var, Val, drop=FALSE, fill=0)
You could use merge, too:
res <- suppressWarnings(merge(a, m, by="id", suffixes = c("", "")))
(res[, which(!duplicated(names(res)))][, names(m)])
# id 1 2 3 4
# 1 1 1 0 0 0
# 2 2 1 0 0 0
# 3 3 1 0 0 0
# 4 4 0 0 0 0
# 5 5 0 0 1 0
# 6 6 0 0 1 0
# 7 7 0 0 1 0
# 8 8 0 0 1 0
# 9 9 0 0 1 0
# 10 10 0 0 1 0

list all permutations of k numbers, taken from 0:k, that sums to k

This question is closely related to another question R:sample(). I want to find a way in R to list all the permutations of k numbers, that sums to k, where each number is chosen from 0:k. If k=7, I can choose 7 numbers from 0,1,...,7. A feasible solution is then 0,1,2,3,1,0,0 another is 1,1,1,1,1,1,1. I don't want to generate all permutations, since if k is just fairly larger than 7 this explodes.
Of course in the k=7 example I could use the following:
perms7<-matrix(numeric(7*1716),ncol=7)
count=0
for(i in 0:7)
for(j in 0:(7-i))
for(k in 0:(7-i-j))
for(l in 0:(7-i-j-k))
for(n in 0:(7-i-j-k-l))
for(m in 0:(7-i-j-k-l-n)){
res<-7-i-j-k-l-n-m
count<-count+1
perms7[count,]<-c(i,j,k,l,n,m,res)
}
head(perms7,10)
But how can I generalize this approach to account for any k without having to write (k-1) loops?
I tried to come up with a recursive scheme:
perms7<-matrix(numeric(7*1716),ncol=7) #store solutions (adjustable size later)
k<-7 #size of interest
d<-0 #depth
count=0 #count of permutations
rec<-function(j,d,a){
a<-a-j #max loop
d<-d+1 #depth (posistion)
for(i in 0:a ) {
if(d<(k-1)) rec(i,d,a)
count<<-count+1
perms7[count,d]<<-i
perms7[count,k]<<-k-sum(perms7[count,-k])
}
}
rec(0,0,k)
But got stuck, and I'm not quite sure this is the right way to go. Wonder if there is any "magic" R function that is neat for this (though very specific) problem or just part of it.
In the k=7 case, all the 2.097.152 permutations and the 1.716 that sum to k=7 can be found by:
library(gtools)
k=7
perms <- permutations(k+1, k, 0:k, repeats.allowed=T) #all permutations
perms.k <- perms[rowSums(perms) == k,] #permutations which sums to k
for k=8 there are 43.046.721 permutations but I only want to list the 6.435.
Any help is greatly appreciated!
There's a package for that...
require( partitions )
parts(7)
#[1,] 7 6 5 5 4 4 4 3 3 3 3 2 2 2 1
#[2,] 0 1 2 1 3 2 1 3 2 2 1 2 2 1 1
#[3,] 0 0 0 1 0 1 1 1 2 1 1 2 1 1 1
#[4,] 0 0 0 0 0 0 1 0 0 1 1 1 1 1 1
#[5,] 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1
#[6,] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
#[7,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
You appear to be looking for compositions(). e.g. for k=4:
parts(4)
#[1,] 4 3 2 2 1
#[2,] 0 1 2 1 1
#[3,] 0 0 0 1 1
#[4,] 0 0 0 0 1
compositions(4,4)
#[1,] 4 3 2 1 0 3 2 1 0 2 1 0 1 0 0 3 2 1 0 2 1 0 1 0 0 2 1 0 1 0 0 1 0 0 0
#[2,] 0 1 2 3 4 0 1 2 3 0 1 2 0 1 0 0 1 2 3 0 1 2 0 1 0 0 1 2 0 1 0 0 1 0 0
#[3,] 0 0 0 0 0 1 1 1 1 2 2 2 3 3 4 0 0 0 0 1 1 1 2 2 3 0 0 0 1 1 2 0 0 1 0
#[4,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 4
And just to check your math... :-)
ncol(compositions(8,8))
#[1] 6435

Experimental design table in R [duplicate]

This question already has answers here:
How to create design matrix in r
(6 answers)
Closed 8 years ago.
How can I generate the following experimental design table in R?
Looks like you want every combination except 0 0 0 0.
> # create all combinations of 4 0s/1s
> design <- expand.grid(0:1, 0:1, 0:1, 0:1)
> design
Var1 Var2 Var3 Var4
1 0 0 0 0
2 1 0 0 0
3 0 1 0 0
4 1 1 0 0
5 0 0 1 0
6 1 0 1 0
7 0 1 1 0
8 1 1 1 0
9 0 0 0 1
10 1 0 0 1
11 0 1 0 1
12 1 1 0 1
13 0 0 1 1
14 1 0 1 1
15 0 1 1 1
16 1 1 1 1
> # remove the single run you don't want
> design[-1,]
Var1 Var2 Var3 Var4
2 1 0 0 0
3 0 1 0 0
4 1 1 0 0
5 0 0 1 0
6 1 0 1 0
7 0 1 1 0
8 1 1 1 0
9 0 0 0 1
10 1 0 0 1
11 0 1 0 1
12 1 1 0 1
13 0 0 1 1
14 1 0 1 1
15 0 1 1 1
16 1 1 1 1
You may make use of a nice trick connected with binary representations of consecutive integers (I assume you do not wish to generate a row with zeros only):
n <- 4
M <- matrix(NA_integer_, nrow=2^n-1, ncol=n)
for (i in 1:(2^n-1))
M[i, ] <- as.integer(intToBits(i)[1:n])
print(M)
which gives for n==4:
[,1] [,2] [,3] [,4]
[1,] 1 0 0 0
[2,] 0 1 0 0
[3,] 1 1 0 0
[4,] 0 0 1 0
[5,] 1 0 1 0
[6,] 0 1 1 0
[7,] 1 1 1 0
[8,] 0 0 0 1
[9,] 1 0 0 1
[10,] 0 1 0 1
[11,] 1 1 0 1
[12,] 0 0 1 1
[13,] 1 0 1 1
[14,] 0 1 1 1
[15,] 1 1 1 1
If you're going to analyze factorial designs in R, you're better off using one of the many DoE packages. For instance, the DoE.base package has a function, fac.design(...) which does essentially what you want:
library(DoE.base)
df <- fac.design(nlevels=2,nfactors=4,randomize=F,
factor.names=list(0:1,0:1,0:1,0:1))
As pointed out in another answer, your design is a full factorial, except that is it missing two of the combinations (which makes me wonder if it's a factorial design at all...).

How to transform a item set matrix in R

How to transform a matrix like
A 1 2 3
B 3 6 9
c 5 6 9
D 1 2 4
into form like:
1 2 3 4 5 6 7 8 9
1 0 2 1 1 0 0 0 0 0
2 0 0 1 1 0 0 0 0 0
3 0 0 0 0 0 1 0 0 1
4 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 1 0 0 1
6 0 0 0 0 0 0 0 0 2
7 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0
I have some implement for it ,but it use the for loop
I wonder if there has some inner function in R (for example "apply")
add:
Sorry for the confusion.The first matrix just mean items sets, every set of items come out pairs ,for example the first set is "1 2 3" , and will become (1,2),(1,3),(2,3), correspond the second matrix.
and another question :
If the matrix is very large (10000000*10000000)and is sparse
should I use sparse matrix or big.matrix?
Thanks!
Removing the row names from M gives this:
m <- matrix(c(1,3,5,1,2,6,6,2,3,9,9,4), nrow=4)
> m
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 3 6 9
## [3,] 5 6 9
## [4,] 1 2 4
# The indicies that you want to increment in x, but some are repeated
# combn() is used to compute the combinations of columns
indices <- matrix(t(m[,combn(1:3,2)]),,2,byrow=TRUE)
# Count repeated rows
ones <- rep(1,nrow(indices))
cnt <- aggregate(ones, by=as.data.frame(indices), FUN=sum)
# Set each value to the appropriate count
x <- matrix(0, 9, 9)
x[as.matrix(cnt[,1:2])] <- cnt[,3]
x
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
## [1,] 0 2 1 1 0 0 0 0 0
## [2,] 0 0 1 1 0 0 0 0 0
## [3,] 0 0 0 0 0 1 0 0 1
## [4,] 0 0 0 0 0 0 0 0 0
## [5,] 0 0 0 0 0 1 0 0 1
## [6,] 0 0 0 0 0 0 0 0 2
## [7,] 0 0 0 0 0 0 0 0 0
## [8,] 0 0 0 0 0 0 0 0 0
## [9,] 0 0 0 0 0 0 0 0 0

Resources