One hot encode list of vectors - r

Is there a quick way to one-hot encode lists of vectors (with different lenghts) in R, preferably using tidyverse?
For example:
vals <- list(a=c(1), b=c(2,3), c=c(1,2))
The wanted result is a wide dataframe:
1 2 3
a 1 0 0
b 0 1 1
c 1 1 0
Thanks!

We can enframe the list and convert them into separate rows, create a dummy column and convert the data into wide-format using pivot_wider.
library(tidyverse)
enframe(vals) %>%
unnest(value) %>%
mutate(temp = 1) %>%
pivot_wider(names_from = value, values_from = temp, values_fill = list(temp = 0))
# name `1` `2` `3`
# <chr> <dbl> <dbl> <dbl>
#1 a 1 0 0
#2 b 0 1 1
#3 c 1 1 0

One base R option could be:
t(table(stack(vals)))
values
ind 1 2 3
a 1 0 0
b 0 1 1
c 1 1 0

A base R approach,
do.call(rbind, lapply(vals, function(i) as.integer(!is.na(match(unique(unlist(vals)), i)))))
# [,1] [,2] [,3]
#a 1 0 0
#b 0 1 1
#c 1 1 0

Related

How to run Excel-like formulas using dplyr?

In the below reproducible R code, I'd like to add a column "adjust" that results from a series of calculations that in Excel would use cumulative countifs, max, and match (actually, to make this more complete the adjust column should have used the match formula since there could be more than 1 element in the list starting in row 15, but I think it's clear what I'm doing without actually using match) formulas as shown below in the illustration. The yellow shading shows what the reproducible code generates, and the blue shading shows my series of calculations in Excel that derive the desired values in the "adjust" column. Any suggestions for doing this, in dplyr if possible?
I am a long-time Excel user trying to migrate all of my work to R.
Reproducible code:
library(dplyr)
myData <-
data.frame(
Element = c("A","B","B","B","B","B","B","B"),
Group = c(0,1,1,1,2,2,3,3)
)
myDataGroups <- myData %>%
mutate(origOrder = row_number()) %>%
group_by(Element) %>%
mutate(ElementCnt = row_number()) %>%
ungroup() %>%
mutate(Group = factor(Group, unique(Group))) %>%
arrange(Group) %>%
mutate(groupCt = cumsum(Group != lag(Group, 1, Group[[1]])) - 1L) %>%
as.data.frame()
myDataGroups
We may use rowid to get the sequence to update the 'Group', and then create a logical vector on 'Group' to create the binary and use cumsum on the 'excessOver2' and take the lag
library(dplyr)
library(data.table)
myDataGroups %>%
mutate(Group = rowid(Element, Group),
excessOver2 = +(Group > 2), adjust = lag(cumsum(excessOver2),
default = 0))
-output
Element Group origOrder ElementCnt groupCt excessOver2 adjust
1 A 1 1 1 -1 0 0
2 B 1 2 1 0 0 0
3 B 2 3 2 0 0 0
4 B 3 4 3 0 1 0
5 B 1 5 4 1 0 1
6 B 2 6 5 1 0 1
7 B 1 7 6 2 0 1
8 B 2 8 7 2 0 1
library(dplyr)
myData %>%
group_by(Element, Group) %>%
summarize(ElementCnt = row_number(), over2 = 1 * (ElementCnt > 2),
.groups = "drop_last") %>%
mutate(adjust = cumsum(lag(over2, default = 0))) %>%
ungroup()
Result
# A tibble: 8 × 5
Element Group ElementCnt over2 adjust
<chr> <dbl> <int> <dbl> <dbl>
1 A 0 1 0 0
2 B 1 1 0 0
3 B 1 2 0 0
4 B 1 3 1 0
5 B 2 1 0 1
6 B 2 2 0 1
7 B 3 1 0 1
8 B 3 2 0 1

Loop over left joins

I've been trying to loop over left joins (using R). I need to create a table with columns representing samples from a larger table. Each column of the new table should represent each of these samples.
library(tidyr)
largetable <- data.frame(PlotCode=c(rep("Plot1",20),rep("Plot2",20)),
Category=c(rep("A",8),rep("B",8),rep("C",4),rep("A",12),rep("B",4),rep("C",4)))
a <- data.frame(PlotCode=c("Plot1","Plot1","Plot2","Plot2"),
Category=c("A","B","A","B"))
##example of code to loop over 100 left joins derived from samples of two elements from a large table. It fails to create the columns.
for (i in 1:100){
count <- largetable %>% group_by(PlotCode) %>% sample_n(2, replace = TRUE)%>%
count(PlotCode,Category)
colnames(count)[3] <- paste0("n",i)
b <- left_join(a, count, by = c("PlotCode","Category"))
}
##example of desired output table. Columns n1 to n100 should change depending of samples.
b <- data.frame(PlotCode=c("Plot1","Plot1","Plot2","Plot2"),
Category=c("A","B","A","B"),
n1=c(2,1,0,1),
n2=c(1,1,1,1),
n3=c(2,0,1,2))
How can I loop over left joins so each column corresponds to a different sample?
Instead of for loop we can use rerun/replicate to repeat a process n times.
In each iteration we randomly select 2 rows from each PlotCode and count their Category so you will have n lists which can be joined together using reduce and rename the column as per your choice and replace NA with 0.
library(dplyr)
library(purrr)
n <- 10
rerun(n, largetable %>%
group_by(PlotCode) %>%
slice_sample(n = 2, replace = TRUE) %>%
count(PlotCode,Category)) %>%
reduce(full_join, by = c('PlotCode', 'Category')) %>%
rename_with(~paste0('n', seq_along(.)), starts_with('n')) %>%
mutate(across(starts_with('n'), tidyr::replace_na, 0))
# PlotCode Category n1 n2 n3 n4 n5 n6 n7 n8 n9 n10
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Plot1 A 1 0 2 2 0 1 0 1 2 2
#2 Plot1 B 1 0 0 0 1 1 2 1 0 0
#3 Plot2 B 1 0 0 0 1 0 0 0 0 0
#4 Plot2 C 1 2 0 0 0 0 1 1 0 0
#5 Plot1 C 0 2 0 0 1 0 0 0 0 0
#6 Plot2 A 0 0 2 2 1 2 1 1 2 2

Separate a row of data on different columns with the count of each item

I have a dataset with two columns where I want to separate the second one (delimited by |) into many columns where each column has the name of the item and the observation has the count.
id column
1 a|b|a
2 a|b|c|d|e
3 a|c|c
I would like to have columns with the name of each item and its count. for example for user 1 it would be as follows:
id a b c d e
1 2 1 0 0 0
2 1 1 1 1 1
3 2 0 1 0 0
How do I get to separate this data such that the values are distributed in columns as such?
A tidyverse approach, assuming data frame named mydata:
library(dplyr)
library(tidyr)
mydata %>%
separate_rows(column, sep = "\\|") %>%
count(id, column) %>%
spread(column, n) %>%
replace(., is.na(.), 0) # or just spread(column, n, fill = 0)
Result:
# A tibble: 3 x 6
id a b c d e
<int> <int> <dbl> <dbl> <dbl> <dbl>
1 1 2 1 0 0 0
2 2 1 1 1 1 1
3 3 1 0 2 0 0

I am trying to identify patterns of missing values in rows of a dataset

I am trying to find patterns in missing values in rows.
For example if I have this data set:
a b c d
1 0.1 NA NA
2 NA 3 4
5 NA 6 NA
I expect the output to be:
n a b c d m
1 0 0 1 1 2
1 0 1 0 0 1
1 0 1 0 1 2
where column n shows the number of rows missing values in column m and 1's indicate missing values (except for columns n and m) .That is, the interpretation of the first row of the output is as follows: 1 row is missing 2 values which are for variables c and d; second row: 1 row is missing 1 value in variable b and so on.
I have tried using the subtable() function in extracat package(archived version) but I cant find the locations of missing values in each variables. I can only find frequencies.
rowmiss<-rowSums(is.na(dat1[1:ncol(dat1)]))
r1<-matrix(rowmiss, nrow=nrow(dat1))
subtable(rowmiss,1)
I expect the output to be as shown above. What I am finding so far is the frequency of missing values in rows but I expect patterns and positions of missing values.
Here's a tidyverse approach. The n column seems redundant, should it be doing something else?
library(tidyverse)
df %>%
rowid_to_column() %>%
gather(col, val, -rowid) %>%
mutate(val = is.na(val) * 1) %>%
group_by(rowid) %>% mutate(m = sum(val)) %>% ungroup() %>%
spread(col, val) %>%
mutate(n = 1) %>%
select(n, a:d, m)
# A tibble: 3 x 6
n a b c d m
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 0 1 1 2
2 1 0 1 0 0 1
3 1 0 1 0 1 2
An alternative way of doing this with tidyverse:
library(tidyverse)
df %>%
mutate_all(~ is.na(.) %>% as.numeric()) %>%
mutate(m = rowSums(.)) %>%
group_by_all() %>%
count()
Output (you may also want to ungroup() if doing anything further with the df):
# A tibble: 3 x 6
# Groups: a, b, c, d, m [3]
a b c d m n
<dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 0 0 1 1 2 1
2 0 1 0 0 1 1
3 0 1 0 1 2 1
mice::md.pattern() also does basically what you want, but returns a matrix with some of the useful info in the rownames, so would require a bit of processing to trun into a dataframe.

How do I sum recurring values according to a level in a column and output a table of counts?

I'm new to R and I have data that looks something like this:
categories <- c("A","B","C","A","A","B","C","A","B","C","A","B","B","C","C")
animals <- c("cat","cat","cat","dog","mouse","mouse","rabbit","rat","shark","shark","tiger","tiger","whale","whale","worm")
dat <- cbind(categories,animals)
Some animals repeat according to the category. For example, "cat" appears in all three categories A, B, and C.
I like my new dataframe output to look something like this:
A B C count
1 1 1 1
1 1 0 2
1 0 1 0
0 1 1 2
1 0 0 2
0 1 0 0
0 0 1 2
0 0 0 0
The number 1 under A, B, and C means that the animal appears in that category, 0 means the animal does not appear in that category. For example, the first line has 1s in all three categories. The count is 1 for the first line because "cat" is the only animal that repeats itself in each category.
Is there a function in R that will help me achieve this? Thank you in advance.
We can use table to create a cross-tabulation of categories and animals, transpose, convert to data.frame, group_by all categories and count the frequency per combination:
library(dplyr)
library(tidyr)
as.data.frame.matrix(t(table(dat))) %>%
group_by_all() %>%
summarize(Count = n())
Result:
# A tibble: 5 x 4
# Groups: A, B [?]
A B C Count
<int> <int> <int> <int>
1 0 0 1 2
2 0 1 1 2
3 1 0 0 2
4 1 1 0 2
5 1 1 1 1
Edit (thanks to #C. Braun). Here is how to also include the zero A, B, C combinations:
as.data.frame.matrix(t(table(dat))) %>%
bind_rows(expand.grid(A = c(0,1), B = c(0,1), C = c(0,1))) %>%
group_by_all() %>%
summarize(Count = n()-1)
or with complete, as suggested by #Ryan:
as.data.frame.matrix(t(table(dat))) %>%
mutate(non_missing = 1) %>%
complete(A, B, C) %>%
group_by(A, B, C) %>%
summarize(Count = sum(ifelse(is.na(non_missing), 0, 1)))
Result:
# A tibble: 8 x 4
# Groups: A, B [?]
A B C Count
<dbl> <dbl> <dbl> <dbl>
1 0 0 0 0
2 0 0 1 2
3 0 1 0 0
4 0 1 1 2
5 1 0 0 2
6 1 0 1 0
7 1 1 0 2
8 1 1 1 1
We have
xxtabs <- function(df, formula) {
xt <- xtabs(formula, df)
xxt <- xtabs( ~ . , as.data.frame.matrix(xt))
as.data.frame(xxt)
}
and
> xxtabs(dat, ~ animals + categories)
A B C Freq
1 0 0 0 0
2 1 0 0 2
3 0 1 0 0
4 1 1 0 2
5 0 0 1 2
6 1 0 1 0
7 0 1 1 2
8 1 1 1 1
(dat should really be constructed as data.frame(animals, categories)). This base approach uses xtabs() to form the first cross-tabulation
xt <- xtabs(~ animals + categories, dat)
then coerces using as.data.frame.matrix() to a second data.frame, and uses a second cross-tabulation of all columns of the computed data.frame
xxt <- xtabs(~ ., as.data.frame.matrix(xt))
coerced to the desired form
as.data.frame(xxt)
I originally said this approach was 'arcane', because it relies on knowledge of the difference between as.data.frame() and as.data.frame.matrix(); I think of xtabs() as a tool that users of base R should know. I see though that the other solutions also require this arcane knowledge, as well as knowledge of more obscure (e.g., complete(), group_by_all(), funs()) parts of the tidyverse. Also, the other answers are not (or at least not written in a way that allows) easily generalizable; xxtabs() does not actually know anything about the structure of the incoming data.frame, whereas implicit knowledge of the incoming data are present throughout the other answers.
One 'lesson learned' from the tidy approach is to place the data argument first, allowing piping
dat %>% xxtabs(~ animals + categories)
If I understood you correctly, this should do the trick.
require(tidyverse)
dat %>%
mutate(value = 1) %>%
spread(categories, value) %>%
mutate_if(is.numeric, funs(replace(., is.na(.), 0))) %>%
mutate(count = rowSums(data.frame(A, B, C), na.rm = TRUE)) %>%
group_by(A, B, C) %>%
summarize(Count = n())
# A tibble: 5 x 4
# Groups: A, B [?]
A B C Count
<dbl> <dbl> <dbl> <int>
1 0. 0. 1. 2
2 0. 1. 1. 2
3 1. 0. 0. 2
4 1. 1. 0. 2
5 1. 1. 1. 1
Adding a data.table solution. First, pivot animals against categories using dat. Then, create the combinations of A, B, C using CJ. Join that combinations with dat and count the number of occurrences for each combi.
dcast(as.data.table(dat), animals ~ categories, length)[
CJ(A=0:1, B=0:1, C=0:1), .(count=.N), on=c("A","B","C"), by=.EACHI]

Resources