Transform categorical attribute vector into similarity matrix

Transform categorical attribute vector into similarity matrix - r

I need to transfrom a categorical attribute vector into a "same attribute matrix" using R.
For example I have a vector which reports gender of N people (male = 1, female = 0). I need to convert this vector into a NxN matrix named A (with people names on rows and columns), where each cell Aij has the value of 1 if two persons (i and j) have the same gender and 0 otherwise.
Here is an example with 3 persons, first male, second female, third male, which produce this vector:
c(1, 0, 1)
I want to transform it into this matrix:
A = matrix( c(1, 0, 1, 0, 1, 0, 1, 0, 1), nrow=3, ncol=3, byrow = TRUE)

Like lmo said in acomment it's impossible to know the structure of your dataset so what follows is just an example for you to see how it could be done.
First, make up some data.
set.seed(3488) # make the results reproducible
x <- LETTERS[1:5]
y <- sample(0:1, 5, TRUE)
df <- data.frame(x, y)
Now tabulate it according to your needs
A <- outer(df$y, df$y, function(a, b) as.integer(a == b))
dimnames(A) <- list(df$x, df$x)
A
# A B C D E
#A 1 1 1 0 0
#B 1 1 1 0 0
#C 1 1 1 0 0
#D 0 0 0 1 1
#E 0 0 0 1 1

Related

Creating a table with proportional values

I have got a data set that looks like this:
COMPANY DATABREACH CYBERBACKGROUND
A 1 2
B 0 2
C 0 1
D 0 2
E 1 1
F 1 2
G 0 2
H 0 2
I 0 2
J 0 2
No I want to create the following: 40% of the cases that the column DATABREACH has the value of 1, I want the value CYBERBACKGROUND to take the value of 2. I figure there must be some function to do this, but I cannot find it.

ind <- which(df$DATABREACH == 1)
ind <- ind[rbinom(length(ind), 1, prob = 0.4) > 0]
df$CYBERBACKGROUND[ind] <- 2
The above is a bit more efficient in that it only pulls randomness for as many as strictly required. If you aren't concerned (11000 doesn't seem too high), you can reduce that to
df$CYBERBACKGROUND <-
ifelse(df$DATABREACH == 1 & rbinom(nrow(df), 1, prob = 0.4) > 0,
2, df$CYBERBACKGROUND)

We may use
library(dplyr)
df1 <- df1 %>%
mutate(CYBERBACKGROUND = replace(CYBERBACKGROUND,
sample(which(DATABREACH == 0), sum(ceiling(sum(DATABREACH) * 0.4))), 2))

Cluster groups based on pairwise distances

I have an n x n matrix with pairwise distances as entries. The matrix looks for example like this:
m = matrix (c(0, 0, 1, 1, 1, 1,0, 0, 1, 1, 0, 1,1, 1, 0, 1, 1, 0,1, 1, 1, 0, 1, 1,1, 0, 1, 1, 0, 1,1, 1, 0, 1, 1, 0),ncol=6, byrow=TRUE)
colnames(m) <- c("A","B","C","D","E","F")
rownames(m) <- c("A","B","C","D","E","F")
Now I want to put every letter in the same cluster if the distance to any other letter is 0. For the example above, I should get three clusters consisting of:
(A,B,E)
(C,F)
(D)
I would be interested in the number of entries in each cluster. At the end, I want to have a vector like:
clustersizes = c(3,2,1)
I assume it is possible by using the hclust function, but I'm not able to extract the three clusters. I also tried the cutree function, but if I don't know the number of clusters before and also not the cutoff for the height, how should I do it?
This is what I tried:
h <- hclust(dist(m),method="single")
plot(h)
Thanks!

Welcome to SO.
There are several ways to handle this but an easy choice is to use the igraph package.
First we convert your matrix m to an adjacency matrix. It contains the distances to neighbouring nodes, where 0 means no connection. Thus, we subtract your matrix from 1 to get that
mm <- 1 - m
diag(mm) <- 0 # We don't allow loops
This gives
> mm
A B C D E F
A 0 1 0 0 0 0
B 1 0 0 0 1 0
C 0 0 0 0 0 1
D 0 0 0 0 0 0
E 0 1 0 0 0 0
F 0 0 1 0 0 0
Then we just need to feed it to igraph to compute communities
library("igraph")
fastgreedy.community(as.undirected(graph.adjacency(mm)))
which produces
IGRAPH clustering fast greedy, groups: 3, mod: 0.44
+ groups:
$`1`
[1] "A" "B" "E"
$`2`
[1] "C" "F"
$`3`
[1] "D"
Now if you save that result you can get the community sizes right away
res < fastgreedy.community(as.undirected(graph.adjacency(mm)))
sizes(res)
which yields
Community sizes
1 2 3
3 2 1

R: print certain values of a matrix to a csv-file

I have a matrix with 1 and 0 in it. Now I want to create a csv-file with the following syntax where only the values=1 were printed:
j1.i1, 1
j1.i2, 1
j2.i2, 1
...
j1 should be the name of the row 1
i1 should be the name of column 1
and so on...
Edit:
M1 = matrix(c(1, 0, 1, 0, 1, 0), nrow=2, ncol=3, byrow = TRUE)
row.names(M1) <- c(100, 101)
colnames(M1) <- c("A", "B", "C")
M1
A B C
100 1 0 1
101 0 1 0
If we take this easy example the solution i'm looking for is:
100.A, 1
100.C, 1
101.B, 1

How to use a positional vector to create a binary string?

I have a vector c(2, 5) and I want to turn this into a vector of n elements where positions 2 and 5 are equal to 1 and any remaining positions are equal to zero.
i.e. If I want to create a vector of length 6, I would want to use vector c(2, 5) to generate the following vector:
c(0, 1, 0, 0, 1, 0)

How about
x <- c(2, 5)
n <- 6
replace(integer(n), x, 1L)
# [1] 0 1 0 0 1 0
And another option is
as.integer(1:n %in% x)
# [1] 0 1 0 0 1 0

Imputing labels based on a comparison of columns

I don't think this question has been asked on this board before. I have two columns of 1s and 0s in a dataframe. Let's call these columns X and Y, respectively. In a comparison of X and Y for any row, one of four combinations is obviously possible:
A: 1, 0
B: 0, 1
C: 1, 1
D: 0, 0
Imagine the dataframe has m columns total, but we're interested only in X and Y. I'd like to write a function that compares only X and Y and then characterizes the particular combination with the corresponding labels A, B, C, or D in a new column (let's call it Z).
So say the data looks like:
X Y
1 1
0 1
0 0
1 1
The function will ouput:
X Y Z
1 1 C
0 1 B
0 0 D
1 1 C
I imagine this would be trivial but I'm an R newbie. Thanks for any guidance!

We create a key/value combination unique dataset and then merge with the input dataset based on 'X' and 'Y' columns
merge(df1, KeyDat, by = c("X", "Y"), all.x=TRUE)
# X Y Z
#1 0 0 D
#2 0 1 B
#3 1 1 C
#4 1 1 C
Or to get the output in the same order, use left_join
library(dplyr)
left_join(df1, keyDat)
#Joining by: c("X", "Y")
# X Y Z
#1 1 1 C
#2 0 1 B
#3 0 0 D
#4 1 1 C
data
keyDat <- data.frame(X= c(1, 0, 1, 0), Y = c(0, 1, 1,
0), Z = c("A", "B", "C", "D"), stringsAsFactors=FALSE)
df1 <- data.frame(X= c(1, 0, 0, 1), Y=c(1, 1, 0, 1))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Transform categorical attribute vector into similarity matrix - r

Related

Creating a table with proportional values

Cluster groups based on pairwise distances

R: print certain values of a matrix to a csv-file

How to use a positional vector to create a binary string?

Imputing labels based on a comparison of columns

Categories

Resources