One-hot encoding when data is stored across multiple columns - r

Say I have a dataframe
primary_color
secondary_color
tertiary_color
red
blue
green
yellow
red
NA
and i want this to encode by checking if the color exists across any of the three columns (1) or none of the 3 columns (0). So, it should yield
red
blue
green
yellow
1
1
1
0
1
0
0
1
I'm working in R. I know I could do this by writing out a bunch of ifelse statements for each color, but my actual problem has a lot more colors. Is there a more concise way to do this?

You may create a new column with row number to track each row, get the data in long format and bring it back to wide by counting occurrence of each color.
library(dplyr)
library(tidyr)
df %>%
mutate(row = row_number()) %>%
pivot_longer(cols = -row) %>%
pivot_wider(names_from = value, values_from = value, id_cols = row,
values_fn = length, values_fill = 0) %>%
select(-row)
# red blue green yellow
# <int> <int> <int> <int>
#1 1 1 1 0
#2 1 1 0 1
data
df <- structure(list(primary_color = c("red", "yellow"), secondary_color =
c("blue", "red"), tertiary_color = c("green", "blue")), row.names = c(NA,
-2L), class = "data.frame")

In base R you could use sapply with a function that checks the vector of desired names:
nnames <- c("red", "blue", "green", "yellow")
new_df <- t(sapply(seq_len(nrow(df)),
function(x)(nnames %in% df[x, ]) * 1))
colnames(new_df) <- nnames
# red blue green yellow
#1 1 1 1 0
#2 1 0 0 1
Note if you didnt care about the order of the columns in the second table, you could generalize nnames to nnames <- unique(unlist(df[!is.na(df)]))
Data
df <- read.table(text = "primary_color secondary_color tertiary_color
red blue green
yellow red NA", h = TRUE)

Using outer.
uc <- unique(unlist(dat))[c(1, 3, 4, 2)]
t(+outer(uc, asplit(dat, 1), Vectorize(`%in%`))) |> `colnames<-`(uc)
# red blue green yellow
# [1,] 1 1 1 0
# [2,] 1 0 0 1
Data:
dat <- structure(list(primary_color = c("red", "yellow"), secondary_color = c("blue",
"red"), tertiary_color = c("green", NA)), class = "data.frame", row.names = c(NA,
-2L))

in base R:
table(row(df), as.matrix(df))
blue green red yellow
1 1 1 1 0
2 0 0 1 1
If you want it as a data.frame:
as.data.frame.matrix(table(row(df), as.matrix(df)))
blue green red yellow
1 1 1 1 0
2 0 0 1 1
If there is one color in many columns of the same row:
+(table(row(df), as.matrix(df))>0)
blue green red yellow
1 1 1 1 0
2 0 0 1 1

Using mtabulate
library(qdapTools)
mtabulate(as.data.frame(t(df1)))
blue green red yellow
V1 1 1 1 0
V2 1 0 1 1
Or with base R
table(c(row(df1)), unlist(df1))
blue green red yellow
1 1 1 1 0
2 1 0 1 1

Related

Transform the cell values of a column into new binary / dummy variables

I've been looking through similar posts here for guidance but they all seem to focus on splitting strings within cells into distinct columns, whereas in my data all my strings are already split into individual cells.
In other words, my data look like this:
ID
word
1
blue
1
red
1
green
1
yellow
2
blue
2
purple
2
orange
2
green
But I want them to look like this:
ID
blue
red
green
yellow
purple
orange
1
1
1
1
1
0
0
2
1
0
1
0
1
1
I've tried using base R's table() to do this, but I get the error message: Error in table(df) : attempt to make a table with >= 2^31 elements which is strange because there are nowhere near that many elements in my dataset.
I've also tried doing this with pivot_wider(),
df %>%
pivot_wider(ID, names_from = word, values_from = word,
values_fn = length, values_fill = 0)
but running the code above gives me an error message saying that Column 2894 must be named. Use .name_repair to specify repair. But I don't think pivot_wider accepts .name_repair as an argument. I'm also not sure I understand why it would be necessary here since I'm just trying to create a new binary column for every unique cell value.
Any help at all is appreciated! Thanks!
Using data.table:
ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L)
word = c('blue', 'red', 'green', 'yellow', 'blue', 'purple', 'orange', 'green')
library(data.table)
DT = data.table(ID, word)
DT
dcast(DT, ID ~ word, fill = 0L, fun.agg = \(x) ifelse(x == 0, 0, 1))
Output
ID blue green orange purple red yellow
1: 1 1 1 0 0 1 1
2: 2 1 1 1 1 0 0
Here is a fix using mutate() from dplyr.
library(dplyr)
library(tidyr)
ID <- c(1, 1, 1, 1, 2, 2, 2, 2)
word <- c("blue", "red", "green", "yellow", "blue", "purple", "orange", "green")
dd1 <- data.frame(ID, word)
# A tibble: 2 x 7
ID blue red green yellow purple orange
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 1 0 0
2 2 1 0 1 0 1 1
dd1 %>%
mutate(value = 1) %>%
pivot_wider(names_from = word, values_from = value, values_fill = 0)
This approach creates the values as a dummy variable and then values_fill = 0 fills in the gaps.

Counting new levels of a factor per site (R)

I need to generate a table counting new levels of a factor per site.
My code is like this
# Data creation
f = c("red", "green", "blue", "orange", "yellow")
f = factor(f)
d = data.frame(
site = 1:10,
color1= c(
"red", "red", "green", "green", "green",
"blue","green", "blue", "orange", "yellow"
),
color2= c(
"green", "green", "green", "blue","green",
"blue", "orange", "yellow","red", "red"
)
)
d$color1 = factor( d$color1 , levels = levels(f) )
d$color2 = factor( d$color2 , levels = levels(f) )
d
It shows me this table
I need to count how many new colors are in every new site. Only count first time appearing, not duplicated. Resulting a table like this one.
Counting not duplicated colors per site is in this figure.
Is there a dplyr way to find this output?
You can do:
library(tidyverse)
d %>%
pivot_longer(cols = -site) %>%
mutate(newColors = duplicated(value)) %>%
group_by(site) %>%
mutate(newColors = sum(!newColors)) %>%
ungroup() %>%
pivot_wider()
which gives:
# A tibble: 10 x 4
site newColors color1 color2
<int> <int> <fct> <fct>
1 1 2 red green
2 2 0 red green
3 3 0 green green
4 4 1 green blue
5 5 0 green green
6 6 0 blue blue
7 7 1 green orange
8 8 1 blue yellow
9 9 0 orange red
10 10 0 yellow red
Note that this differs for row 9 where you have a 1, but both colors (orange and red) already appeared in previous rows.

add columns by count of each group [duplicate]

This question already has answers here:
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
Closed 1 year ago.
I would like to add news columns by count of each group in type. My dataframe is like this:
# color type
# black chair
# black chair
# black sofa
# pink plate
# pink chair
# red sofa
# red plate
I am looking for something like:
# color chair sofa plate
# black 2 1 0
# pink 1 0 1
# red 0 1 1
I used table(df$color, df$type), but the result has no name for column 'color'
We may use table from base R
table(df)
Or with pivot_wider
library(tidyr)
pivot_wider(df, names_from = type, values_from = type,
values_fn = length, values_fill = 0)
# A tibble: 3 × 4
color chair sofa plate
<chr> <int> <int> <int>
1 black 2 1 0
2 pink 1 0 1
3 red 0 1 1
Or with dcast
library(data.table)
dcast(setDT(df), color ~ type, value.var = 'type', length, fill = 0)
data
df <- structure(list(color = c("black", "black", "black", "pink", "pink",
"red", "red"), type = c("chair", "chair", "sofa", "plate", "chair",
"sofa", "plate")), class = "data.frame", row.names = c(NA, -7L
))

tidyr join an ID table with main table across multiple columns

This seems like a very basic operation, but my searches are not finding a simple solution.
As an example of what I am trying to do, consider the following two data frames from a database.
First an ID table that assigns an index to a color name:
ColorID <- tibble(ID = c(1:4), Name = c("Red", "Green", "Blue", "Black"))
ColorID
# A tibble: 4 x 2
ID Name
<int> <chr>
1 1 Red
2 2 Green
3 3 Blue
4 4 Black
Next some table that points to these color indexes (instead of storing text strings):
Widgets <- tibble(Front = c(1,3,4,2,1,1), Back = c(4,4,3,3,1,2),
Top = c(4,3,2,1,2,3), Bottom = c(1,2,3,4,3,2))
Widgets
# A tibble: 6 x 4
Front Back Top Bottom
<dbl> <dbl> <dbl> <dbl>
1 1 4 4 1
2 3 4 3 2
3 4 3 2 3
4 2 3 1 4
5 1 1 2 3
6 1 2 3 2
Now I just want to join the two tables to substitute the index values with the actual color names, so what I want is:
Joined <- tibble(Front = c("Red", "Blue", "Black", "Green", "Red","Red"),
Back = c("Black", "Black", "Blue","Blue", "Red", "Green"),
Top = c("Black","Blue", "Green", "Red", "Green", "Blue"),
Bottom = c("Red", "Green", "Blue", "Black", "Blue","Green"))
Joined
# A tibble: 6 x 4
Front Back Top Bottom
<chr> <chr> <chr> <chr>
1 Red Black Black Red
2 Blue Black Blue Green
3 Black Blue Green Blue
4 Green Blue Red Black
5 Red Red Green Blue
6 Red Green Blue Green
I've tried many iterations with no success, what I thought would work is something like:
J <- Widgets %>% inner_join(ColorID, by = c(. = "ID"))
I can tackle this column by column by using one variable at a time, e.g.
J <- Widgets %>% inner_join(ColorID, by = c("Front" = "ID"))
Which doesn't replace "Front", but instead creates a new "Name" column. Seems like there has to be a simple solution to this though. Thanks.
There is no need for join functions:
library(dplyr)
ColorID <- tibble(ID = c(1:4), Name = c("Red", "Green", "Blue", "Black"))
# reorder so that row number and ID are different
ColorID <- ColorID[c(2, 1, 4, 3), ]
Widgets <- tibble(Front = c(1,3,4,2,1,1), Back = c(4,4,3,3,1,2),
Top = c(4,3,2,1,2,3), Bottom = c(1,2,3,4,3,2))
check_id <- function(col){
ColorID$Name[match(col, ColorID$ID)]
}
Widgets %>%
mutate(across(everything(), check_id))
# A tibble: 6 x 4
Front Back Top Bottom
<chr> <chr> <chr> <chr>
1 Red Black Black Red
2 Blue Black Blue Green
3 Black Blue Green Blue
4 Green Blue Red Black
5 Red Red Green Blue
6 Red Green Blue Green
(Edited) What I'm doing with dplyr and mutate is matching the numbers on Widgets with the number on the ColorID$ID column. This provides me with the row on the ColorID data frame I need for extracting the name.
Does this work:
library(dplyr)
library(tidyr)
Widgets %>% pivot_longer(everything()) %>%
inner_join(ColorID, by = c('value' = 'ID')) %>% select(-value) %>%
pivot_wider(names_from = name, values_from = Name) %>% unnest(everything())
# A tibble: 6 x 4
Front Back Top Bottom
<chr> <chr> <chr> <chr>
1 Red Black Black Red
2 Blue Black Blue Green
3 Black Blue Green Blue
4 Green Blue Red Black
5 Red Red Green Blue
6 Red Green Blue Green

Take the first value in a variable as a variable name in R

In R, I'm trying to take the first value of a character variable and use it to rename the same variable or even to assign a name to another new variable, but I haven't figured out how to do this.
Example:
PR <- data.frame("Variable1" = c("Red", "Blue", "Green", "Yellow"),
"Variable2" = seq(1:4))
PR
Variable1 Variable2
1 Red 1
2 Blue 2
3 Green 3
4 Yellow 4
I know one could just use "PR %>% rename(Red = Variable1)", but I want R to take this name from the variable directly. The outcome should be:
Red Variable2
1 Red 1
2 Blue 2
3 Green 3
4 Yellow 4
I've trayed to use "rename()" function from dplyr to make it but it didn't work:
PR <- PR %>% rename(as.name(Variable1)[1] = Variable1)
Error: unexpected '=' in "PR <- PR %>% rename(as.name(Variable1)[1] ="
How could I do this using dplyr, or even in the context of creating a new variable with the "mutate()" command (for example if I want to create a new variable which name is the first value of "Variable1")?
Does this work:
> PR
Variable1 Variable2
1 Red 1
2 Blue 2
3 Green 3
4 Yellow 4
> name <- PR$Variable1[1]
> PR %>% rename(!!sym(name) := Variable1)
Red Variable2
1 Red 1
2 Blue 2
3 Green 3
4 Yellow 4
>
You need to use a special substitute.
library(tidyverse)
PR <- data.frame("Variable1" = c("Red", "Blue", "Green", "Yellow"),
"Variable2" = seq(1:4))
#Note the sequence of commands
PR %>%
mutate(Variable3 = PR$Variable1[1]) %>%
rename(!!PR$Variable1[1] := Variable1)
# Red Variable2 Variable3
# 1 Red 1 Red
# 2 Blue 2 Red
# 3 Green 3 Red
# 4 Yellow 4 Red
We can use rename_at
library(dplyr)
PR %>%
rename_at(vars(Variable1), ~ PR$Variable1[1])
#. Red Variable2
#1 Red 1
#2 Blue 2
#3 Green 3
#4 Yellow 4

Resources