How to convert two character columns to a binary matrix? - r

Here is an example:
df <- data.frame(x=c("A", "A", "B", "C", "C", "C"), y=c("m", "n", "o", "p", "q", "r"))
df
x y
1 A m
2 A n
3 B o
4 C p
5 C q
6 C r
What I wanted is to convert it to a binary matrix using y as rowname and unique(x) as colname:
A B C
m 1 0 0
n 1 0 0
o 0 1 0
p 0 0 1
q 0 0 1
r 0 0 1
My first thought is to use tidyr::spread() but seems not properly working.

You can use:
library(tidyverse)
df %>%
pivot_wider(y,
names_from = x,
values_from = x,
values_fn = list(x = length),
values_fill = list(x = 0))
y A B C
<chr> <int> <int> <int>
1 m 1 0 0
2 n 1 0 0
3 o 0 1 0
4 p 0 0 1
5 q 0 0 1
6 r 0 0 1

Does this work?
library(tidyverse)
df<-data.frame(x=c("A", "A", "B", "C", "C", "C"), y=c("m", "n", "o", "p", "q",
"r"))
df %>%
mutate(x=str_split(x, ",")) %>%
unnest() %>%
mutate(dummy=1) %>%
spread(x, dummy, fill=0)
Output:
A B C
m 1 0 0
n 1 0 0
o 0 1 0
p 0 0 1
q 0 0 1
r 0 0 1

A base R solution.
cbind(df[-1], +sapply(unique(df$x), `==`, df$x))
# y A B C
# 1 m 1 0 0
# 2 n 1 0 0
# 3 o 0 1 0
# 4 p 0 0 1
# 5 q 0 0 1
# 6 r 0 0 1

We can use table in base R
+(t(table(df)) > 0)

Related

adding 1/0 columns from list all at once

I have a dataframe with identifiers and storm categories. Right now, the categories are in one column, but I want to add columns for each category with a 1 or 0 value. I don't think I want to reshape the data as wide, because in the actual dataset there are a number of long format variables I want to keep. I am using a series of ifelse statements currently, but it feels like there is probably a much better way:
library(dplyr)
library(tidyr)
df <- data.frame(
ID = c("A", "B", "C", "D", "A", "B", "C", "D", "A", "B", "C", "D"),
cat = c("TS", NA, NA, "TS", "1", "1", NA, NA, "2", NA, NA, NA)
)
df$cat_TS <- ifelse(df$cat == "TS", 1, 0) %>% replace_na(., 0)
df$cat_1 <- ifelse(df$cat == "1", 1, 0) %>% replace_na(., 0)
df$cat_2 <- ifelse(df$cat == "2", 1, 0) %>% replace_na(., 0)
We may use pivot_wider - create a sequence column 'rn', and then use pivot_wider to reshape to wide with values_fn as length and values_fill as 0
library(dplyr)
library(tidyr)
library(stringr)
df %>%
mutate(rn = row_number(), cat1 = cat) %>%
pivot_wider(names_from = cat1, values_from = cat1,
values_fn = length, values_fill = 0, names_prefix = "cat_")%>%
select(-cat_NA, -rn)
-output
# A tibble: 12 × 5
ID cat cat_TS cat_1 cat_2
<chr> <chr> <int> <int> <int>
1 A TS 1 0 0
2 B <NA> 0 0 0
3 C <NA> 0 0 0
4 D TS 1 0 0
5 A 1 0 1 0
6 B 1 0 1 0
7 C <NA> 0 0 0
8 D <NA> 0 0 0
9 A 2 0 0 1
10 B <NA> 0 0 0
11 C <NA> 0 0 0
12 D <NA> 0 0 0
or use fastDummies
library(fastDummies)
df %>%
dummy_cols("cat", remove_selected_columns = FALSE, ignore_na = TRUE) %>%
mutate(across(starts_with('cat_'), ~ replace_na(.x, 0)))
-output
ID cat cat_1 cat_2 cat_TS
1 A TS 0 0 1
2 B <NA> 0 0 0
3 C <NA> 0 0 0
4 D TS 0 0 1
5 A 1 1 0 0
6 B 1 1 0 0
7 C <NA> 0 0 0
8 D <NA> 0 0 0
9 A 2 0 1 0
10 B <NA> 0 0 0
11 C <NA> 0 0 0
12 D <NA> 0 0 0
An idea using base R
First, get all unique category names
cats <- unique(df$cat[!is.na(df$cat)])
cats
[1] "TS" "1" "2"
Then look for matches in column cat for each entry in cats. PS, I left the cat column in to show the matching is right. Remove it by using df$ID instead of df as the first argument in cbind.
cbind(df, setNames(data.frame(sapply(seq_along(cats), function(x)
df$cat %in% cats[x]) * 1), cats))
ID cat TS 1 2
1 A TS 1 0 0
2 B <NA> 0 0 0
3 C <NA> 0 0 0
4 D TS 1 0 0
5 A 1 0 1 0
6 B 1 0 1 0
7 C <NA> 0 0 0
8 D <NA> 0 0 0
9 A 2 0 0 1
10 B <NA> 0 0 0
11 C <NA> 0 0 0
12 D <NA> 0 0 0

Crosstab of two identical variables in R - reflect in diagonal

I've got a dataset where I'm interested in the frequencies of different pairs emerging, but it doesn't matter which order the elements occur. For example:
library(janitor)
set.seed(24601)
options <- c("a", "b", "c", "d", "e", "f")
data.frame(x = sample(options, 20, replace = TRUE),
y = sample(options, 20, replace = TRUE)) %>%
tabyl(x, y)
provides me with the output
x a b c d e f
a 1 0 1 0 1 0
b 0 2 0 1 0 0
c 2 0 1 0 0 0
d 0 0 0 0 1 0
e 1 1 2 0 0 3
f 0 0 1 1 0 1
I'd ideally have the top right or bottom left of this table, where the combination of values a and c would be a total of 3. This is the sum of 1 (in the top right) and 2 (in the middle left). And so on for each other pair of values.
I'm sure there must be a simple way to do this, but I can't figure out what it is...
Edited to add (thanks #Akrun for the request): ideally I'd like the following output
x a b c d e f
a 1 0 3 0 2 0
b 2 0 1 1 0
c 1 0 2 1
d 0 1 1
e 0 3
f 1
We could + with the transposed output (except the first column), then replace the 'out' object upper triangle values (subset the elements based on the upper.tri - returns a logical vector) with that corresponding elements, and assign the lower triangle elements to NA
out2 <- out[-1] + t(out[-1])
out[-1][upper.tri(out[-1])] <- out2[upper.tri(out2)]
out[-1][lower.tri(out[-1])] <- NA
-output
out
# x a b c d e f
# a 1 0 3 0 2 0
# b NA 2 0 1 1 0
# c NA NA 1 0 2 1
# d NA NA NA 0 1 1
# e NA NA NA NA 0 3
# f NA NA NA NA NA 1
data
set.seed(24601)
options <- c("a", "b", "c", "d", "e", "f")
out <- data.frame(x = sample(options, 20, replace = TRUE),
y = sample(options, 20, replace = TRUE)) %>%
tabyl(x, y)
Here is another option, using igraph
out[-1] <- get.adjacency(
graph_from_data_frame(
get.data.frame(
graph_from_adjacency_matrix(
as.matrix(out[-1]), "directed"
)
), FALSE
),
type = "upper",
sparse = FALSE
)
which gives
> out
x a b c d e f
a 1 0 3 0 2 0
b 0 2 0 1 1 0
c 0 0 1 0 2 1
d 0 0 0 0 1 1
e 0 0 0 0 0 3
f 0 0 0 0 0 1

Adjacency Matrix from source target dataset

I have a dataset as follows
Var1 Var2 Count
A B 3
A C 4
A D 10
A L 6
I need to create an adjacency matrix for usage downstream in creating a chord diagram. I am looking for an efficient way to get it.
A B C D L
A 0 3 4 10 6
B 3 0 0 0 0
C 4 0 0 0 0
D 10 0 0 0 0
L 6 0 0 0 0
I am looking for a visualization as follows
Assuming you're talking about just the symmetric matrix generation:
dat <- read.table(header=TRUE, stringsAsFactors=FALSE, text='
Var1 Var2 Count
A B 3
A C 4
A D 10
A L 6')
vars <- sort(unique(unlist(dat[c("Var1","Var2")])))
m <- matrix(0, nr=length(vars), nc=length(vars), dimnames=list(vars,vars))
m[as.matrix(dat[c("Var1","Var2")])] <- m[as.matrix(dat[c("Var2","Var1")])] <- dat$Count
m
# A B C D L
# A 0 3 4 10 6
# B 3 0 0 0 0
# C 4 0 0 0 0
# D 10 0 0 0 0
# L 6 0 0 0 0
Here is an option using xtabs. Convert the first two column to factor with levels specified in the order we want in the output. Then, use xtabs to get a matrix output, transpose the output and add to the original matrix to get the expected output
dat[1:2] <- lapply(dat[1:2], factor, levels = c("A", "B", "C", "D", "L"))
out <- xtabs(Count ~ Var1 + Var2, dat)
out + t(out)
# Var2
#Var1 A B C D L
# A 0 3 4 10 6
# B 3 0 0 0 0
# C 4 0 0 0 0
# D 10 0 0 0 0
# L 6 0 0 0 0
data
dat <- structure(list(Var1 = c("A", "A", "A", "A"), Var2 = c("B", "C",
"D", "L"), Count = c(3L, 4L, 10L, 6L)), class = "data.frame",
row.names = c(NA, -4L))

Calculate cumsum from the end towards the beginning

I'm trying to calculate the cumsum starting from the last row towards the first for each group.
Sample data:
t1 <- data.frame(var = "a", val = c(0,0,0,0,1,0,0,0,0,1,0,0,0,0,0))
t2 <- data.frame(var = "b", val = c(0,0,0,0,1,0,0,1,0,0,0,0,0,0,0))
ts <- rbind(t1, t2)
Desired format (grouped by var):
ts <- data.frame(var = c("a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a",
"b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b"),
val = c(2,2,2,2,2,1,1,1,1,1,0,0,0,0,0,2,2,2,2,2,1,1,1,0,0,0,0,0,0,0))
Promoting my comment to an answer; using:
ts$val2 <- ave(ts$val, ts$var, FUN = function(x) rev(cumsum(rev(x))))
gives:
> ts
var val val2
1 a 0 2
2 a 0 2
3 a 0 2
4 a 0 2
5 a 1 2
6 a 0 1
7 a 0 1
8 a 0 1
9 a 0 1
10 a 1 1
11 a 0 0
12 a 0 0
13 a 0 0
14 a 0 0
15 a 0 0
16 b 0 2
17 b 0 2
18 b 0 2
19 b 0 2
20 b 1 2
21 b 0 1
22 b 0 1
23 b 1 1
24 b 0 0
25 b 0 0
26 b 0 0
27 b 0 0
28 b 0 0
29 b 0 0
30 b 0 0
Or with dplyr or data.table:
library(dplyr)
ts %>%
group_by(var) %>%
mutate(val2 = rev(cumsum(rev(val))))
library(data.table)
setDT(ts)[, val2 := rev(cumsum(rev(val))), by = var]
An option without explicitly reversing the vector:
ave(ts$val, ts$var, FUN = function(x) Reduce(sum, x, right = TRUE, accumulate = TRUE))
[1] 2 2 2 2 2 1 1 1 1 1 0 0 0 0 0 2 2 2 2 2 1 1 1 0 0 0 0 0 0 0
Or the same approach with dplyr:
ts %>%
group_by(var) %>%
mutate(val = Reduce(sum, val, right = TRUE, accumulate = TRUE))

Reshape a data frame into a wide shape

The data contains two variables: id and grade. Each id can have multiple records
for each grade.
dat <- data.frame(id = c(1,1,1,2,2,2,2,3,3,4,5,5,5),
grade = c("a", "b", "c", "a", "a", "b", "b", "d", "f", "c", "a", "e", "f"))
I want to reshape the data into a wide shape such that each id has only one record
and each unique grade becomes a single column. The value of each column is either 0 or 1,
depending on the grades for each id.
The final data set looks like:
id a b c d e f
1 1 1 1 0 0 0
2 1 1 0 0 0 0
3 0 0 0 1 0 1
4 0 0 1 0 0 0
5 1 0 0 0 1 1
I tried this, but no luck.
n.dat <- reshape(dat, timevar = "grade",idvar = c("id"),direction = "wide")
You could simply table the values, then convert to logical based on > 0 condition and then convert back to numeric using the + unary operator (or if you want less golfed, by simply + 0)
+(table(dat) > 0)
# grade
# id a b c d e f
# 1 1 1 1 0 0 0
# 2 1 1 0 0 0 0
# 3 0 0 0 1 0 1
# 4 0 0 1 0 0 0
# 5 1 0 0 0 1 1

Resources