Counting new levels of a factor per site (R) - r

I need to generate a table counting new levels of a factor per site.
My code is like this
# Data creation
f = c("red", "green", "blue", "orange", "yellow")
f = factor(f)
d = data.frame(
site = 1:10,
color1= c(
"red", "red", "green", "green", "green",
"blue","green", "blue", "orange", "yellow"
),
color2= c(
"green", "green", "green", "blue","green",
"blue", "orange", "yellow","red", "red"
)
)
d$color1 = factor( d$color1 , levels = levels(f) )
d$color2 = factor( d$color2 , levels = levels(f) )
d
It shows me this table
I need to count how many new colors are in every new site. Only count first time appearing, not duplicated. Resulting a table like this one.
Counting not duplicated colors per site is in this figure.
Is there a dplyr way to find this output?

You can do:
library(tidyverse)
d %>%
pivot_longer(cols = -site) %>%
mutate(newColors = duplicated(value)) %>%
group_by(site) %>%
mutate(newColors = sum(!newColors)) %>%
ungroup() %>%
pivot_wider()
which gives:
# A tibble: 10 x 4
site newColors color1 color2
<int> <int> <fct> <fct>
1 1 2 red green
2 2 0 red green
3 3 0 green green
4 4 1 green blue
5 5 0 green green
6 6 0 blue blue
7 7 1 green orange
8 8 1 blue yellow
9 9 0 orange red
10 10 0 yellow red
Note that this differs for row 9 where you have a 1, but both colors (orange and red) already appeared in previous rows.

Related

One-hot encoding when data is stored across multiple columns

Say I have a dataframe
primary_color
secondary_color
tertiary_color
red
blue
green
yellow
red
NA
and i want this to encode by checking if the color exists across any of the three columns (1) or none of the 3 columns (0). So, it should yield
red
blue
green
yellow
1
1
1
0
1
0
0
1
I'm working in R. I know I could do this by writing out a bunch of ifelse statements for each color, but my actual problem has a lot more colors. Is there a more concise way to do this?
You may create a new column with row number to track each row, get the data in long format and bring it back to wide by counting occurrence of each color.
library(dplyr)
library(tidyr)
df %>%
mutate(row = row_number()) %>%
pivot_longer(cols = -row) %>%
pivot_wider(names_from = value, values_from = value, id_cols = row,
values_fn = length, values_fill = 0) %>%
select(-row)
# red blue green yellow
# <int> <int> <int> <int>
#1 1 1 1 0
#2 1 1 0 1
data
df <- structure(list(primary_color = c("red", "yellow"), secondary_color =
c("blue", "red"), tertiary_color = c("green", "blue")), row.names = c(NA,
-2L), class = "data.frame")
In base R you could use sapply with a function that checks the vector of desired names:
nnames <- c("red", "blue", "green", "yellow")
new_df <- t(sapply(seq_len(nrow(df)),
function(x)(nnames %in% df[x, ]) * 1))
colnames(new_df) <- nnames
# red blue green yellow
#1 1 1 1 0
#2 1 0 0 1
Note if you didnt care about the order of the columns in the second table, you could generalize nnames to nnames <- unique(unlist(df[!is.na(df)]))
Data
df <- read.table(text = "primary_color secondary_color tertiary_color
red blue green
yellow red NA", h = TRUE)
Using outer.
uc <- unique(unlist(dat))[c(1, 3, 4, 2)]
t(+outer(uc, asplit(dat, 1), Vectorize(`%in%`))) |> `colnames<-`(uc)
# red blue green yellow
# [1,] 1 1 1 0
# [2,] 1 0 0 1
Data:
dat <- structure(list(primary_color = c("red", "yellow"), secondary_color = c("blue",
"red"), tertiary_color = c("green", NA)), class = "data.frame", row.names = c(NA,
-2L))
in base R:
table(row(df), as.matrix(df))
blue green red yellow
1 1 1 1 0
2 0 0 1 1
If you want it as a data.frame:
as.data.frame.matrix(table(row(df), as.matrix(df)))
blue green red yellow
1 1 1 1 0
2 0 0 1 1
If there is one color in many columns of the same row:
+(table(row(df), as.matrix(df))>0)
blue green red yellow
1 1 1 1 0
2 0 0 1 1
Using mtabulate
library(qdapTools)
mtabulate(as.data.frame(t(df1)))
blue green red yellow
V1 1 1 1 0
V2 1 0 1 1
Or with base R
table(c(row(df1)), unlist(df1))
blue green red yellow
1 1 1 1 0
2 1 0 1 1

add columns by count of each group [duplicate]

This question already has answers here:
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
Closed 1 year ago.
I would like to add news columns by count of each group in type. My dataframe is like this:
# color type
# black chair
# black chair
# black sofa
# pink plate
# pink chair
# red sofa
# red plate
I am looking for something like:
# color chair sofa plate
# black 2 1 0
# pink 1 0 1
# red 0 1 1
I used table(df$color, df$type), but the result has no name for column 'color'
We may use table from base R
table(df)
Or with pivot_wider
library(tidyr)
pivot_wider(df, names_from = type, values_from = type,
values_fn = length, values_fill = 0)
# A tibble: 3 × 4
color chair sofa plate
<chr> <int> <int> <int>
1 black 2 1 0
2 pink 1 0 1
3 red 0 1 1
Or with dcast
library(data.table)
dcast(setDT(df), color ~ type, value.var = 'type', length, fill = 0)
data
df <- structure(list(color = c("black", "black", "black", "pink", "pink",
"red", "red"), type = c("chair", "chair", "sofa", "plate", "chair",
"sofa", "plate")), class = "data.frame", row.names = c(NA, -7L
))

tidyr join an ID table with main table across multiple columns

This seems like a very basic operation, but my searches are not finding a simple solution.
As an example of what I am trying to do, consider the following two data frames from a database.
First an ID table that assigns an index to a color name:
ColorID <- tibble(ID = c(1:4), Name = c("Red", "Green", "Blue", "Black"))
ColorID
# A tibble: 4 x 2
ID Name
<int> <chr>
1 1 Red
2 2 Green
3 3 Blue
4 4 Black
Next some table that points to these color indexes (instead of storing text strings):
Widgets <- tibble(Front = c(1,3,4,2,1,1), Back = c(4,4,3,3,1,2),
Top = c(4,3,2,1,2,3), Bottom = c(1,2,3,4,3,2))
Widgets
# A tibble: 6 x 4
Front Back Top Bottom
<dbl> <dbl> <dbl> <dbl>
1 1 4 4 1
2 3 4 3 2
3 4 3 2 3
4 2 3 1 4
5 1 1 2 3
6 1 2 3 2
Now I just want to join the two tables to substitute the index values with the actual color names, so what I want is:
Joined <- tibble(Front = c("Red", "Blue", "Black", "Green", "Red","Red"),
Back = c("Black", "Black", "Blue","Blue", "Red", "Green"),
Top = c("Black","Blue", "Green", "Red", "Green", "Blue"),
Bottom = c("Red", "Green", "Blue", "Black", "Blue","Green"))
Joined
# A tibble: 6 x 4
Front Back Top Bottom
<chr> <chr> <chr> <chr>
1 Red Black Black Red
2 Blue Black Blue Green
3 Black Blue Green Blue
4 Green Blue Red Black
5 Red Red Green Blue
6 Red Green Blue Green
I've tried many iterations with no success, what I thought would work is something like:
J <- Widgets %>% inner_join(ColorID, by = c(. = "ID"))
I can tackle this column by column by using one variable at a time, e.g.
J <- Widgets %>% inner_join(ColorID, by = c("Front" = "ID"))
Which doesn't replace "Front", but instead creates a new "Name" column. Seems like there has to be a simple solution to this though. Thanks.
There is no need for join functions:
library(dplyr)
ColorID <- tibble(ID = c(1:4), Name = c("Red", "Green", "Blue", "Black"))
# reorder so that row number and ID are different
ColorID <- ColorID[c(2, 1, 4, 3), ]
Widgets <- tibble(Front = c(1,3,4,2,1,1), Back = c(4,4,3,3,1,2),
Top = c(4,3,2,1,2,3), Bottom = c(1,2,3,4,3,2))
check_id <- function(col){
ColorID$Name[match(col, ColorID$ID)]
}
Widgets %>%
mutate(across(everything(), check_id))
# A tibble: 6 x 4
Front Back Top Bottom
<chr> <chr> <chr> <chr>
1 Red Black Black Red
2 Blue Black Blue Green
3 Black Blue Green Blue
4 Green Blue Red Black
5 Red Red Green Blue
6 Red Green Blue Green
(Edited) What I'm doing with dplyr and mutate is matching the numbers on Widgets with the number on the ColorID$ID column. This provides me with the row on the ColorID data frame I need for extracting the name.
Does this work:
library(dplyr)
library(tidyr)
Widgets %>% pivot_longer(everything()) %>%
inner_join(ColorID, by = c('value' = 'ID')) %>% select(-value) %>%
pivot_wider(names_from = name, values_from = Name) %>% unnest(everything())
# A tibble: 6 x 4
Front Back Top Bottom
<chr> <chr> <chr> <chr>
1 Red Black Black Red
2 Blue Black Blue Green
3 Black Blue Green Blue
4 Green Blue Red Black
5 Red Red Green Blue
6 Red Green Blue Green

Assign unique ID based on values in EITHER of two columns

This is not a duplicate of this question. Please read questions entirely before labeling duplicates.
I have a data.frame like so:
library(tidyverse)
tibble(
color = c("blue", "blue", "red", "green", "purple"),
shape = c("triangle", "square", "circle", "hexagon", "hexagon")
)
color shape
<chr> <chr>
1 blue triangle
2 blue square
3 red circle
4 green hexagon
5 purple hexagon
I'd like to add a group_id column like this:
color shape group_id
<chr> <chr> <dbl>
1 blue triangle 1
2 blue square 1
3 red circle 2
4 green hexagon 3
5 purple hexagon 3
The difficulty is that I want to group by unique values of color or shape. I suspect the solution might be to use list-columns, but I can't figure out how.
We can use duplicated in base R
df1$group_id <- cumsum(!Reduce(`|`, lapply(df1, duplicated)))
-output
df1
# A tibble: 5 x 3
# color shape group_id
# <chr> <chr> <int>
#1 blue triangle 1
#2 blue square 1
#3 red circle 2
#4 green hexagon 3
#5 purple hexagon 3
Or using tidyverse
library(dplyr)
library(purrr)
df1 %>%
mutate(group_id = map(., duplicated) %>%
reduce(`|`) %>%
`!` %>%
cumsum)
data
df1 <- structure(list(color = c("blue", "blue", "red", "green", "purple"
), shape = c("triangle", "square", "circle", "hexagon", "hexagon"
)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"
))

Using ifelse within mutate and handling NA's

thanks for your time.
I have a question about using ifelse within the mutate function. ifelse is from base R, while mutate is from the dplyr package.
My question is about how ifelse handles NA values.
I have two character vectors:
example_character_vector contains some words and occasional NA values while the other vector, color_indicator, contains only the words Green, Yellow, and Red.
I want to mutate my dataframe example_data_frame to create a new override_color_indicator variable that converts some of the yellows to greens depending on a condition in the example_character_vector.
Example data:
example_character_vector <- c("Basic", NA, "Full", "None", NA, "None",
NA)
color_indicator <- c("Green", "Green", "Yellow", "Yellow", "Yellow",
"Red", "Red")
example_data_frame <- data.frame(example_character_vector,
color_indicator)
This example_data_frame looks like so:
example_character_vector color_indicator
1 Basic Green
2 <NA> Green
3 Full Yellow
4 None Yellow
5 <NA> Yellow
6 None Red
7 <NA> Red
I am using nested ifelse statements within mutate to create a new column called override_color_indicator.
If color_indicator is yellow and the example_character_vector contains the word "Full", I want the override_color_indicator to be Green (this is a special case within my data). Otherwise, I would like the override_color_indicator to be exactly the same as the color_indicator.
Here is my mutate:
example_data_frame <- example_data_frame %>%
mutate(override_color_indicator =
ifelse(color_indicator == "Green",
"Green",
ifelse(color_indicator == "Yellow" &
str_detect(example_character_vector, "Full"),
"Green",
ifelse(color_indicator == "Yellow" &
!str_detect(example_character_vector, "Full") |
color_indicator == "Yellow" &
is.na(character_vector),
"Yellow",
"Red"))))
(Apologies for the formatting - I tried to format this the best I could for Stack Overflow.)
This above code produces this dataframe:
example_character_vector color_indicator override_color_indicator
1 Basic Green Green
2 <NA> Green Green
3 Full Yellow Green
4 None Yellow Yellow
5 <NA> Yellow <NA>
6 None Red Red
7 <NA> Red Red
My problem here is that in line 5, an NA is introduced in the override_color_indicator color. Instead of an NA, I would like it be "Yellow".
For clarity, this is my desired dataframe:
example_character_vector color_indicator override_color_indicator
1 Basic Green Green
2 <NA> Green Green
3 Full Yellow Green
4 None Yellow Yellow
5 <NA> Yellow Yellow
6 None Red Red
7 <NA> Red Red
I've looked quite a bit for an answer, and couldn't find one anywhere. I could just create a workaround and go back and manually assign the entries to Yellow, but I don't love that option from a programmatic standpoint.
Also, I'm just kind of curious as to why this behavior happens. I've ran into this problem a few times now.
Thanks for your time!
You should use case_when here, but the reason you are getting NA is because of the second ifelse. One interesting thing about how NA propagates in R is that (from the docs) "the result will be NA if the outcome is ambiguous". So because we knew this would be FALSE regardless of the NA, we have
NA & FALSE
#> [1] FALSE
but since this is ambiguous, the NA propagates here.
NA & TRUE
#> [1] NA
Row 5 has TRUE for Yellow but str_detect will return NA, so ifelse returns NA. You can get around this by adding & !is.na(example_character_vector) in that line:
library(tidyverse)
example_data_frame <- structure(list(example_character_vector = c("Basic", NA, "Full", "None", NA, "None", NA), color_indicator = c("Green", "Green", "Yellow", "Yellow", "Yellow", "Red", "Red")), class = "data.frame", row.names = c(NA, -7L))
example_data_frame %>%
mutate(
override_color_indicator =
ifelse(
color_indicator == "Green",
"Green",
ifelse(
color_indicator == "Yellow" &
str_detect(example_character_vector, "Full") & !is.na(example_character_vector),
"Green",
ifelse(
color_indicator == "Yellow" &
(!str_detect(example_character_vector, "Full") | is.na(example_character_vector)),
"Yellow",
"Red"
)
)
)
)
#> example_character_vector color_indicator override_color_indicator
#> 1 Basic Green Green
#> 2 <NA> Green Green
#> 3 Full Yellow Green
#> 4 None Yellow Yellow
#> 5 <NA> Yellow Yellow
#> 6 None Red Red
#> 7 <NA> Red Red
But definitely use case_when!
Try this instead. case_when is a more flexible vectorised if and allows you to use TRUE to say "else, use the value in color_indicator.
example_data_frame <- structure(list(example_character_vector = c("Basic", NA, "Full", "None", NA, "None", NA), color_indicator = c("Green", "Green", "Yellow", "Yellow", "Yellow", "Red", "Red")), class = "data.frame", row.names = c(NA, -7L))
library(dplyr)
example_data_frame %>%
mutate(x = case_when(color_indicator == "Yellow" &
example_character_vector == "Full" ~ "Green",
TRUE ~ color_indicator))
#> example_character_vector color_indicator x
#> 1 Basic Green Green
#> 2 <NA> Green Green
#> 3 Full Yellow Green
#> 4 None Yellow Yellow
#> 5 <NA> Yellow Yellow
#> 6 None Red Red
#> 7 <NA> Red Red

Resources