Clustering of Count data - r

I am currently trying to find clusters in a data set that looks like this:
Dienstag 19 Mittwoch 20 Donnerstag 21 Freitag 22 Montag 25 Dienstag 26 Donnerstag 28
[1,] 0 0 0 0 0 0 NA
[2,] 0 0 0 0 0 0 NA
[3,] 0 0 0 0 0 0 NA
[4,] 0 0 0 0 1 0 NA
[5,] 1 0 1 1 1 1 NA
[6,] 0 0 0 0 0 0 NA
[7,] 4 0 1 0 2 1 NA
[8,] 0 1 2 1 0 2 NA
[9,] 0 0 1 0 0 0 NA
[10,] 1 0 0 0 0 1 0
[11,] 2 0 1 0 0 5 0
[12,] 1 0 0 0 0 1 1
[13,] 0 1 0 0 0 0 0
[14,] 0 0 1 0 4 1 0
It corresponds at the counting of times a user used an application given the day and the hour.
I want to find pattern/clusters that relate the usage with the hour, but I don't know how to manage it. It would really be helpful if you could give me some suggestions about methods.

There are statistical means at clustering as well but here's a visual approach. I was lazy and used libraries I am familiar with to accomplish this goal but it is likely accomplished more efficiently with some base tools.
## dat <- read.table(text=" Dienstag.19 Mittwoch.20 Donnerstag.21 Freitag.22 Montag.25 Dienstag.26 Donnerstag.28
## [1,] 0 0 0 0 0 0 NA
## [2,] 0 0 0 0 0 0 NA
## [3,] 0 0 0 0 0 0 NA
## [4,] 0 0 0 0 1 0 NA
## [5,] 1 0 1 1 1 1 NA
## [6,] 0 0 0 0 0 0 NA
## [7,] 4 0 1 0 2 1 NA
## [8,] 0 1 2 1 0 2 NA
## [9,] 0 0 1 0 0 0 NA
## [10,] 1 0 0 0 0 1 0
## [11,] 2 0 1 0 0 5 0
## [12,] 1 0 0 0 0 1 1
## [13,] 0 1 0 0 0 0 0
## [14,] 0 0 1 0 4 1 0", header=TRUE)
dat$hour <- factor(1:nrow(dat))
library(reshape2); library(qdap); library(ggplot2); library(plyr)
dat2 <- melt(dat)
dat2[, 2] <- beg2char(dat2[, 2], ".")
dat2 <- ddply(dat2, .(variable), transform,
rescale = scale(value))
ggsave("heat.png")
ggplot(dat3, aes(variable, hour)) + geom_tile(aes(fill=rescale)) +
scale_fill_gradient(low = "white", high = "red")

Most clustering algorithms will assume continuous data. While of course you can "cast" integers to double values, the results will no longer be as meaningful as they were for true continuous values.
I like Tylers visual approach. If there is a meaningful pattern, your brains visual cortex is probably the best tool to discover it.

Related

Intersection of two integer matrices by position R

I would like to know which positions of one matrix intersect with another matrix and which values, for example
lab <- as.matrix(read.table(text="[1,] 0 0 0 0 0 0 0 0 0 1
[2,] 2 0 2 2 2 2 2 2 2 0
[3,] 2 0 2 0 0 0 0 0 2 2
[4,] 2 2 2 0 0 0 0 0 2 2
[5,] 2 0 2 0 0 0 0 0 0 0
[6,] 2 0 2 0 0 0 0 0 0 0
[7,] 2 0 2 0 0 0 0 0 0 0
[8,] 2 0 2 0 0 0 0 3 3 3
[9,] 2 0 2 0 0 0 0 0 3 3
[10,] 2 0 2 0 0 0 0 0 0 3")[,-1])
str(lab)
la1 <- as.matrix(read.table(text="[1,] 0 1 0 0 0 0 0 0 0 2
[2,] 3 0 4 4 4 4 4 4 4 0
[3,] 3 0 4 0 0 0 0 0 4 4
[4,] 3 0 4 0 5 5 0 0 4 4
[5,] 3 0 4 0 5 5 0 0 0 0
[6,] 3 0 4 0 0 0 0 0 0 0
[7,] 3 0 4 0 0 0 0 0 0 0
[8,] 3 0 4 0 0 0 0 6 6 6
[9,] 3 0 4 0 0 0 0 6 6 6
[10,] 3 0 4 0 0 0 0 0 0 6")[,-1])
Then, these numbers represent patches, patch 3 of la1 intersect patch 3 and 4 of la1, patch 1 of lab intersect 0 (no other patch), patch 3 of lab intersect patch 6 of la1. I am using the following code
require(dplyr)
tuples <- tibble()
dx <- dim(lab)[1]
for( i in seq_len(dx))
for( j in seq_len(dx))
{
ii <- tibble(l0=lab[i,j],l1=la1[i,j])
tuples <- bind_rows(tuples,ii)
}
tuples %>% distinct()
As I will use big 3000x3000 matrices so I am thinking if there is any faster way, maybe with rcpp or raster, of doing it.
Without a double for loop, we can transpose the matrixes into a two column tibble and get the distinct rows
out <- tibble(l0 = c(t(lab)), l1 = c(t(la1))) %>%
distinct
-checking with OP's output
out_old <- tuples %>%
distinct()
all.equal(out, out_old, check.attributes = FALSE)
#[1] TRUE
Benchmarks
lab2 <- matrix(sample(0:9, size = 3000 * 3000, replace = TRUE), 3000, 3000)
la2 <- matrix(sample(0:9, size = 3000 * 3000, replace = TRUE), 3000, 3000)
system.time({out2 <- tibble(l0 = c(t(lab2)), l1 = c(t(la2))) %>%
distinct})
# user system elapsed
# 0.398 0.042 0.440
If you just want to speed up, you can try unique over data.table, e.g.,
unique(data.table(c(lab), c(la)))
Here comes a base R solution.
as.vector might be faster than c.
unique(cbind(as.vector(lab), as.vector(la1)))
# [,1] [,2]
# [1,] 0 0
# [2,] 2 3
# [3,] 0 1
# [4,] 2 0
# [5,] 2 4
# [6,] 0 5
# [7,] 3 6
# [8,] 0 6
# [9,] 1 2

one hot encode each column in a Int matrix in R

I have an issue of translating matrix into one hot encoding in R. I implemented in Matlab but i have difficulty in handling the object in R. Here i have an object of type 'matrix'.
I would like to apply one hot encoding to this matrix. I have problem with column names.
here is an example:
> set.seed(4)
> t <- matrix(floor(runif(10, 1,9)),5,5)
[,1] [,2] [,3] [,4] [,5]
[1,] 5 3 5 3 5
[2,] 1 6 1 6 1
[3,] 3 8 3 8 3
[4,] 3 8 3 8 3
[5,] 7 1 7 1 7
> class(t)
[1] "matrix"
Expecting:
1_1 1_3 1_5 1_7 2_1 2_3 2_6 2_8 ...
[1,] 0 0 1 0 0 1 0 0 ...
[2,] 1 0 0 0 0 0 1 0 ...
[3,] 0 1 0 0 0 0 0 1 ...
[4,] 0 1 0 0 0 0 0 1 ...
[5,] 0 0 0 1 1 0 0 0 ...
I tried the following, but the matrix remains the same.
library(data.table)
library(mltools)
test_table <- one_hot(as.data.table(t))
Any suggestions would be very much appreciated.
Your data table must contain some columns (variables) that have class "factor". Try this:
> t <- data.table(t)
> t[,V1:=factor(V1)]
> one_hot(t)
V1_1 V1_3 V1_5 V1_7 V2 V3 V4 V5
1: 0 0 1 0 3 5 3 5
2: 1 0 0 0 6 1 6 1
3: 0 1 0 0 8 3 8 3
4: 0 1 0 0 8 3 8 3
5: 0 0 0 1 1 7 1 7
But I read that from here that the dummyVars function from the caret package is quicker if your matrix is large.
Edit: Forgot to set the seed. :P
And a quick way to factor all variables in a data table:
t.f <- t[, lapply(.SD, as.factor)]
There are probably more concise ways to do this but this should work (and is at least easy to read and understand ;)
Suggested solution using base R and double loop:
set.seed(4)
t <- matrix(floor(runif(10, 1,9)),5,5)
# initialize result object
#
t_hot <- NULL
# for each column in original matrix
#
for (col in seq_along(t[1,])) {
# for each unique value in this column (sorted so the resulting
# columns appear in order)
#
for (val in sort(unique(t[, col]))) {
t_hot <- cbind(t_hot, ifelse(t[, col] == val, 1, 0))
# make name for this column
#
colnames(t_hot)[ncol(t_hot)] <- paste0(col, "_", val)
}
}
This returns:
1_1 1_3 1_5 1_7 2_1 2_3 2_6 2_8 3_1 3_3 3_5 3_7 4_1 4_3 4_6 4_8 5_1 5_3 5_5 5_7
[1,] 0 0 1 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 0
[2,] 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 1 0 0 0
[3,] 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 1 0 0
[4,] 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 1 0 0
[5,] 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1

Specific sequence creation in R

I want to create the following sequences in a smart way instead of hard-coding them:
'0-0-0-0-0-0'
'0-1-0-0-0-0'
'0-0-1-0-0-0'
'0-0-0-1-0-0'
'0-0-0-0-1-0'
'0-0-0-0-0-1'
'1-0-0-0-0-0'
'1-1-0-0-0-0'
'1-0-1-0-0-0'
'1-0-0-1-0-0'
'1-0-0-0-1-0'
'1-0-0-0-0-1'
'1-1-1-1-1-1'
'2-0-0-0-0-0'
'2-1-0-0-0-0'
'2-0-1-0-0-0'
'2-0-0-1-0-0'
'2-0-0-0-1-0'
'2-0-0-0-0-1'
'3-0-0-0-0-0'
'3-1-0-0-0-0'
'3-0-1-0-0-0'
'3-0-0-1-0-0'
'3-0-0-0-1-0'
'3-0-0-0-0-1'
'0-2-0-0-0-0'
'0-0-2-0-0-0'
'0-0-0-2-0-0'
'0-0-0-0-2-0'
'0-0-0-0-0-2'
and so on...
Elaborating more on the details of the pattern that presents: I have 4 states {0,1,2,3} and I want to find all the possible combinations for sequences of length=6 starting with any of the states and allowing only one intermediate position of the sequence to be present in any of the next positions.
Here's one method. I generate a simple description of each sequence, then build the sequences (and de-duplicate, which is needed because of the all-intermediate-0 items).
dd = expand.grid(first = 0:3, inter_value = 0:3, inter_position = 2:6)
result = t(apply(dd, 1, function(x) {
z = c(x["first"], rep(0L, 5))
z[x["inter_position"]] = x["inter_value"]
z
}))
result = result[!duplicated(result), ]
dim(result)
# [1] 64 6
head(result, 10)
# first
# [1,] 0 0 0 0 0 0
# [2,] 1 0 0 0 0 0
# [3,] 2 0 0 0 0 0
# [4,] 3 0 0 0 0 0
# [5,] 0 1 0 0 0 0
# [6,] 1 1 0 0 0 0
# [7,] 2 1 0 0 0 0
# [8,] 3 1 0 0 0 0
# [9,] 0 2 0 0 0 0
# [10,] 1 2 0 0 0 0
Getting the dashes:
apply(result, 1, paste, collapse = "-")
# [1] "0-0-0-0-0-0" "1-0-0-0-0-0" "2-0-0-0-0-0" "3-0-0-0-0-0" "0-1-0-0-0-0" "1-1-0-0-0-0" "2-1-0-0-0-0"
# [8] "3-1-0-0-0-0" "0-2-0-0-0-0" "1-2-0-0-0-0" "2-2-0-0-0-0" "3-2-0-0-0-0" "0-3-0-0-0-0" "1-3-0-0-0-0"
# [15] "2-3-0-0-0-0" "3-3-0-0-0-0" "0-0-1-0-0-0" "1-0-1-0-0-0" "2-0-1-0-0-0" "3-0-1-0-0-0" "0-0-2-0-0-0"
# [22] "1-0-2-0-0-0" "2-0-2-0-0-0" "3-0-2-0-0-0" "0-0-3-0-0-0" "1-0-3-0-0-0" "2-0-3-0-0-0" "3-0-3-0-0-0"
# [29] "0-0-0-1-0-0" "1-0-0-1-0-0" "2-0-0-1-0-0" "3-0-0-1-0-0" "0-0-0-2-0-0" "1-0-0-2-0-0" "2-0-0-2-0-0"
# [36] "3-0-0-2-0-0" "0-0-0-3-0-0" "1-0-0-3-0-0" "2-0-0-3-0-0" "3-0-0-3-0-0" "0-0-0-0-1-0" "1-0-0-0-1-0"
# [43] "2-0-0-0-1-0" "3-0-0-0-1-0" "0-0-0-0-2-0" "1-0-0-0-2-0" "2-0-0-0-2-0" "3-0-0-0-2-0" "0-0-0-0-3-0"
# [50] "1-0-0-0-3-0" "2-0-0-0-3-0" "3-0-0-0-3-0" "0-0-0-0-0-1" "1-0-0-0-0-1" "2-0-0-0-0-1" "3-0-0-0-0-1"
# [57] "0-0-0-0-0-2" "1-0-0-0-0-2" "2-0-0-0-0-2" "3-0-0-0-0-2" "0-0-0-0-0-3" "1-0-0-0-0-3" "2-0-0-0-0-3"
# [64] "3-0-0-0-0-3"
Here's a general nested for-loop solution. Not the most efficient in the world, but gets the desired result (Note: You can change states and/or sequence_len and the sequences will be generated automatically):
states <- 0:3
states_len <- length(states)
sequence_len <- 6
sequence_mat <- matrix(0, states_len*{{states_len-1}*{sequence_len-1}+1}, sequence_len)
rw <- 1
for(ii in states){
for(jj in states){
for(kk in 2:sequence_len){
if(jj != 0){
rw = rw + 1
}
sequence_mat[rw, 1] <- ii
sequence_mat[rw, kk] <- jj
if(jj == rev(states)[1] && kk == sequence_len){
rw = rw + 1
}
}
}
}
Output:
> head(sequence_mat, 20)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0 0 0 0 0 0
[2,] 0 1 0 0 0 0
[3,] 0 0 1 0 0 0
[4,] 0 0 0 1 0 0
[5,] 0 0 0 0 1 0
[6,] 0 0 0 0 0 1
[7,] 0 2 0 0 0 0
[8,] 0 0 2 0 0 0
[9,] 0 0 0 2 0 0
[10,] 0 0 0 0 2 0
[11,] 0 0 0 0 0 2
[12,] 0 3 0 0 0 0
[13,] 0 0 3 0 0 0
[14,] 0 0 0 3 0 0
[15,] 0 0 0 0 3 0
[16,] 0 0 0 0 0 3
[17,] 1 0 0 0 0 0
[18,] 1 1 0 0 0 0
[19,] 1 0 1 0 0 0
[20,] 1 0 0 1 0 0

R: when both events (columns) are true, refer to one of them to decide the value

I want to manipulate two columns in R, so that when both events are true, refer to one of the columns to decide the value. For example:
a<- c(0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0)
b<- c(0,1,1,0,0,0,0,0,1,1,1,1,1,1,1,1,0,0)
when a and b are both true, at a[9] and a[10], refer to b to decide the value of another column c in the following lines. Then, if b is FALSE at some line, (here is line 17) check again if both a and b are true. So, the desired output is like this:
c<- c(0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,0,0)
data <- cbind(a,b,c)
data
a b c
[1,] 0 0 0
[2,] 0 1 0
[3,] 0 1 0
[4,] 0 0 0
[5,] 0 0 0
[6,] 1 0 0
[7,] 1 0 0
[8,] 1 0 0
[9,] 1 1 1
[10,] 1 1 1
[11,] 0 1 1
[12,] 0 1 1
[13,] 0 1 1
[14,] 0 1 1
[15,] 0 1 1
[16,] 0 1 1
[17,] 0 0 0
[18,] 0 0 0
As the data comes in many lines, I would prefer the use vectorized method like ifelse() to handle this.
Many thanks to all the people who can help me with this.
I'm not 100% sure I understand the issue but here is a tidyverse solution that reproduces the output:
a<- c(0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0)
b<- c(0,1,1,0,0,0,0,0,1,1,1,1,1,1,1,1,0,0)
df <- data.frame(a,b)
library(tidyverse)
df %>% mutate(c=case_when(
a == 1 & b == 1 ~ 1,
a == 0 & b == 0 ~ 0,
TRUE ~ NA_real_
)) %>% fill(c)
# a b c
# 1 0 0 0
# 2 0 1 0
# 3 0 1 0
# 4 0 0 0
# 5 0 0 0
# 6 1 0 0
# 7 1 0 0
# 8 1 0 0
# 9 1 1 1
# 10 1 1 1
# 11 0 1 1
# 12 0 1 1
# 13 0 1 1
# 14 0 1 1
# 15 0 1 1
# 16 0 1 1
# 17 0 0 0
# 18 0 0 0

Count the number of appeareances in a list, trick?

I have a matrix of 1 and 0. The rules concerning this table is as follows.
I would like to count the number of times a serie of 1,1 appears (where the 1 are not separated by 0!) and make the same thing for a serie of 1,1,1. I have tried colSums but it's seemed not to be very appropriate.
the matrix final is
t1 t2 t3 t4 t5 t6 t7
[1,] 0 0 0 0 1 1 0
[2,] 0 0 1 1 0 0 1
[3,] 1 1 0 0 0 0 0
[4,] 0 0 1 1 1 0 0
[5,] 0 0 1 1 0 0 0
[6,] 1 1 0 0 0 0 0
[7,] 0 0 0 0 0 0 1
[8,] 0 0 0 0 1 1 0
[9,] 1 1 0 0 1 1 0
[10,] 0 0 0 0 0 1 1
[11,] 1 1 0 0 0 0 0
[12,] 0 0 1 1 0 0 0
[13,] 0 0 0 0 0 0 0
[14,] 0 0 0 0 0 0 1
[15,] 0 0 0 0 0 0 0
Therefore for the first row I would like to have 1 time a serie of 1,1 and 0 time a serie of 1,1,1. For row 4 I would like to have 0 time a serie of 1,1 but 1 time a serie of 1,1,1.
Can anyone tell me whats wrong with the following code for a serie of 1,1?
occ <- matrix()
occ_temp <- matrix
for (j in 1:nrow(final)){
for (i in 2:7){
if (sum(final[j,i-1:i])==2){occ_temp[j,i-1]=1}
}
occ[j] <- sum(occ_temp)
}
We can loop through the rows with apply, get the run-length-type with rle, extract the lengths where the values are 1, check that are equal to 'n1' and 'n2', and get the sum.
n1 <- 2
n2 <- 3
res <- t(apply(m1, 1, FUN=function(x) {
x1 <- with(rle(x), lengths[!!values])
c(sum(x1==n1), sum(x1==n2))
}))
colnames(res) <- paste0("count", c(11, 111))
res
# count11 count111
# [1,] 1 0
# [2,] 1 0
# [3,] 1 0
# [4,] 0 1
# [5,] 1 0
# [6,] 1 0
# [7,] 0 0
# [8,] 1 0
# [9,] 2 0
#[10,] 1 0
#[11,] 1 0
#[12,] 1 0
#[13,] 0 0
#[14,] 0 0
#[15,] 0 0

Resources