How to create dichotomous variables based on some factors in r? - r

The initial dataframe is:
Factor1 Factor2 Factor3
A B C
B C NA
A NA NA
B C D
E NA NA
I want to create 5 dichotomous variables based on the above factor variables. The rule should be the new variable A will get 1 if either Factor1 or Factor2 or Factor3 contains an A otherwise A should be 0, and so on. The newly created variables should look like:
A B C D E
1 1 1 0 0
0 1 1 0 0
1 0 0 0 0
0 1 1 1 0
0 0 0 0 1

We can use table to do this. We replicate the sequence of rows with the number of columns, unlist the dataset and get the frequency of values.
table(rep(1:nrow(df1), ncol(df1)), unlist(df1))
# A B C D E
# 1 1 1 1 0 0
# 2 0 1 1 0 0
# 3 1 0 0 0 0
# 4 0 1 1 1 0
# 5 0 0 0 0 1
If we have more than 1 value per row, then convert to logical and then reconvert it back to binary.
+(!!table(rep(1:nrow(df1), ncol(df1)), unlist(df1)))
data
df1 <- structure(list(Factor1 = c("A", "B", "A", "B", "E"),
Factor2 = c("B",
"C", NA, "C", NA), Factor3 = c("C", NA, NA, "D", NA)),
.Names = c("Factor1",
"Factor2", "Factor3"), class = "data.frame", row.names = c(NA, -5L))

Related

Crosstab of two identical variables in R - reflect in diagonal

I've got a dataset where I'm interested in the frequencies of different pairs emerging, but it doesn't matter which order the elements occur. For example:
library(janitor)
set.seed(24601)
options <- c("a", "b", "c", "d", "e", "f")
data.frame(x = sample(options, 20, replace = TRUE),
y = sample(options, 20, replace = TRUE)) %>%
tabyl(x, y)
provides me with the output
x a b c d e f
a 1 0 1 0 1 0
b 0 2 0 1 0 0
c 2 0 1 0 0 0
d 0 0 0 0 1 0
e 1 1 2 0 0 3
f 0 0 1 1 0 1
I'd ideally have the top right or bottom left of this table, where the combination of values a and c would be a total of 3. This is the sum of 1 (in the top right) and 2 (in the middle left). And so on for each other pair of values.
I'm sure there must be a simple way to do this, but I can't figure out what it is...
Edited to add (thanks #Akrun for the request): ideally I'd like the following output
x a b c d e f
a 1 0 3 0 2 0
b 2 0 1 1 0
c 1 0 2 1
d 0 1 1
e 0 3
f 1
We could + with the transposed output (except the first column), then replace the 'out' object upper triangle values (subset the elements based on the upper.tri - returns a logical vector) with that corresponding elements, and assign the lower triangle elements to NA
out2 <- out[-1] + t(out[-1])
out[-1][upper.tri(out[-1])] <- out2[upper.tri(out2)]
out[-1][lower.tri(out[-1])] <- NA
-output
out
# x a b c d e f
# a 1 0 3 0 2 0
# b NA 2 0 1 1 0
# c NA NA 1 0 2 1
# d NA NA NA 0 1 1
# e NA NA NA NA 0 3
# f NA NA NA NA NA 1
data
set.seed(24601)
options <- c("a", "b", "c", "d", "e", "f")
out <- data.frame(x = sample(options, 20, replace = TRUE),
y = sample(options, 20, replace = TRUE)) %>%
tabyl(x, y)
Here is another option, using igraph
out[-1] <- get.adjacency(
graph_from_data_frame(
get.data.frame(
graph_from_adjacency_matrix(
as.matrix(out[-1]), "directed"
)
), FALSE
),
type = "upper",
sparse = FALSE
)
which gives
> out
x a b c d e f
a 1 0 3 0 2 0
b 0 2 0 1 1 0
c 0 0 1 0 2 1
d 0 0 0 0 1 1
e 0 0 0 0 0 3
f 0 0 0 0 0 1

Adjacency Matrix from source target dataset

I have a dataset as follows
Var1 Var2 Count
A B 3
A C 4
A D 10
A L 6
I need to create an adjacency matrix for usage downstream in creating a chord diagram. I am looking for an efficient way to get it.
A B C D L
A 0 3 4 10 6
B 3 0 0 0 0
C 4 0 0 0 0
D 10 0 0 0 0
L 6 0 0 0 0
I am looking for a visualization as follows
Assuming you're talking about just the symmetric matrix generation:
dat <- read.table(header=TRUE, stringsAsFactors=FALSE, text='
Var1 Var2 Count
A B 3
A C 4
A D 10
A L 6')
vars <- sort(unique(unlist(dat[c("Var1","Var2")])))
m <- matrix(0, nr=length(vars), nc=length(vars), dimnames=list(vars,vars))
m[as.matrix(dat[c("Var1","Var2")])] <- m[as.matrix(dat[c("Var2","Var1")])] <- dat$Count
m
# A B C D L
# A 0 3 4 10 6
# B 3 0 0 0 0
# C 4 0 0 0 0
# D 10 0 0 0 0
# L 6 0 0 0 0
Here is an option using xtabs. Convert the first two column to factor with levels specified in the order we want in the output. Then, use xtabs to get a matrix output, transpose the output and add to the original matrix to get the expected output
dat[1:2] <- lapply(dat[1:2], factor, levels = c("A", "B", "C", "D", "L"))
out <- xtabs(Count ~ Var1 + Var2, dat)
out + t(out)
# Var2
#Var1 A B C D L
# A 0 3 4 10 6
# B 3 0 0 0 0
# C 4 0 0 0 0
# D 10 0 0 0 0
# L 6 0 0 0 0
data
dat <- structure(list(Var1 = c("A", "A", "A", "A"), Var2 = c("B", "C",
"D", "L"), Count = c(3L, 4L, 10L, 6L)), class = "data.frame",
row.names = c(NA, -4L))

Loop through a dataframe: counting each pairwise combination of a value for each unique variable.

I have a dataframe called "df" like this:
ID Value
1 a
1 b
1 c
1 d
3 a
3 b
3 e
3 f
. .
. .
. .
I have a matrix filled with zeros like this:
a b c d e f
a x 0 0 0 0 0
b 0 x 0 0 0 0
c 0 0 x 0 0 0
d 0 0 0 x 0 0
e 0 0 0 0 x 0
f 0 0 0 0 0 x
I would then like to loop through the dataframe something like this:
for each ID, for each value i, for each value j != i, matrix[i,j] += 1
So for each ID, for each combination of values, I would like to raise the value in the matrix by 1, resulting in:
a b c d e f
a x 2 1 1 1 1
b 2 x 1 1 1 1
c 1 1 x 1 0 0
d 1 1 1 x 0 0
e 1 1 0 0 x 1
f 1 1 0 0 1 x
So for example, [a,b] = 2, because this combination of values occurs for two different IDs, while [a,c] = 1, because this combination of values only occurs when ID = 1 and not when ID = 3.
How can I achieve this? I already made a vector containing the unique IDs.
Thanks in advance.
The easiest would be to get the table and then do a crossprod
out <- crossprod(table(df))
diag(out) <- NA #replace the diagonals with NA
names(dimnames(out)) <- NULL #set the names of the dimnames as NULL
out
# a b c d e f
#a NA 2 1 1 1 1
#b 2 NA 1 1 1 1
#c 1 1 NA 1 0 0
#d 1 1 1 NA 0 0
#e 1 1 0 0 NA 1
#f 1 1 0 0 1 NA
data
df <- structure(list(ID = c(1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L), Value = c("a",
"b", "c", "d", "a", "b", "e", "f")), .Names = c("ID", "Value"
), class = "data.frame", row.names = c(NA, -8L))

Reshape a data frame into a wide shape

The data contains two variables: id and grade. Each id can have multiple records
for each grade.
dat <- data.frame(id = c(1,1,1,2,2,2,2,3,3,4,5,5,5),
grade = c("a", "b", "c", "a", "a", "b", "b", "d", "f", "c", "a", "e", "f"))
I want to reshape the data into a wide shape such that each id has only one record
and each unique grade becomes a single column. The value of each column is either 0 or 1,
depending on the grades for each id.
The final data set looks like:
id a b c d e f
1 1 1 1 0 0 0
2 1 1 0 0 0 0
3 0 0 0 1 0 1
4 0 0 1 0 0 0
5 1 0 0 0 1 1
I tried this, but no luck.
n.dat <- reshape(dat, timevar = "grade",idvar = c("id"),direction = "wide")
You could simply table the values, then convert to logical based on > 0 condition and then convert back to numeric using the + unary operator (or if you want less golfed, by simply + 0)
+(table(dat) > 0)
# grade
# id a b c d e f
# 1 1 1 1 0 0 0
# 2 1 1 0 0 0 0
# 3 0 0 0 1 0 1
# 4 0 0 1 0 0 0
# 5 1 0 0 0 1 1

Assigning values to an empty adjacency matrix based on matching column values

I have an nxn dataset, say 5X5 data set.
ALPHA BETA GAMMA DELTA EPSILON
A B A X 1
B C 3 X 3
C D E Z 4
D A D X 5
E A 2 Z 2
I use column “ALPHA” to create an empty adjacency matrix (Aij),
A B C D E
A 0 0 0 0 0
B 0 0 0 0 0
C 0 0 0 0 0
D 0 0 0 0 0
E 0 0 0 0 0
I want to reassign Adjacency matrix values to 1 or 0 based on the matched values of column “DELTA” such that, if “DELTA” matches we set Aij=1 and 0 otherwise. That is, we will have a new adjacency matrix that looks like the following,
A B C D E
A 0 1 0 1 0
B 1 0 0 1 0
C 0 0 0 0 1
D 1 1 0 0 0
E 0 0 1 0 0
What loop command can or matching technique can I use to assign the new values?
Thanks.
Phil
A loop could work. You have A(i=j) as 0 in your example so I subtracted a diagonal matrix
DELTA<-c("X","X","Z","X","Z")
Adj<-mat.or.vec(nr=length(DELTA), nc=length(DELTA))
for (i in 1:length(DELTA)){
Adj[i,DELTA==DELTA[i]]<-1
}
Adj<-Adj-diag(length(DELTA))
You could use outer
res <- +(outer(df1$DELTA, df1$DELTA, FUN='=='))*!diag(dim(df1)[1])
dimnames(res) <- rep(list(df1$ALPHA),2)
res
# A B C D E
#A 0 1 0 1 0
#B 1 0 0 1 0
#C 0 0 0 0 1
#D 1 1 0 0 0
#E 0 0 1 0 0
Or
sapply(df1$DELTA, `==`, df1$DELTA) - diag(dim(df1)[1])
data
df1 <- structure(list(ALPHA = c("A", "B", "C", "D", "E"), BETA = c("B",
"C", "D", "A", "A"), GAMMA = c("A", "3", "E", "D", "2"), DELTA = c("X",
"X", "Z", "X", "Z"), EPSILON = c(1L, 3L, 4L, 5L, 2L)), .Names = c("ALPHA",
"BETA", "GAMMA", "DELTA", "EPSILON"), class = "data.frame",
row.names = c(NA, -5L))

Resources