Counting new levels of a factor per site (R)

Counting new levels of a factor per site (R) - r

I need to generate a table counting new levels of a factor per site.
My code is like this
# Data creation
f = c("red", "green", "blue", "orange", "yellow")
f = factor(f)
d = data.frame(
site = 1:10,
color1= c(
"red", "red", "green", "green", "green",
"blue","green", "blue", "orange", "yellow"
),
color2= c(
"green", "green", "green", "blue","green",
"blue", "orange", "yellow","red", "red"
)
)
d$color1 = factor( d$color1 , levels = levels(f) )
d$color2 = factor( d$color2 , levels = levels(f) )
d
It shows me this table
I need to count how many new colors are in every new site. Only count first time appearing, not duplicated. Resulting a table like this one.
Counting not duplicated colors per site is in this figure.
Is there a dplyr way to find this output?

You can do:
library(tidyverse)
d %>%
pivot_longer(cols = -site) %>%
mutate(newColors = duplicated(value)) %>%
group_by(site) %>%
mutate(newColors = sum(!newColors)) %>%
ungroup() %>%
pivot_wider()
which gives:
# A tibble: 10 x 4
site newColors color1 color2
<int> <int> <fct> <fct>
1 1 2 red green
2 2 0 red green
3 3 0 green green
4 4 1 green blue
5 5 0 green green
6 6 0 blue blue
7 7 1 green orange
8 8 1 blue yellow
9 9 0 orange red
10 10 0 yellow red
Note that this differs for row 9 where you have a 1, but both colors (orange and red) already appeared in previous rows.

Related

One-hot encoding when data is stored across multiple columns

Say I have a dataframe
primary_color
secondary_color
tertiary_color
red
blue
green
yellow
red
NA
and i want this to encode by checking if the color exists across any of the three columns (1) or none of the 3 columns (0). So, it should yield
red
blue
green
yellow
1
1
1
0
1
0
0
1
I'm working in R. I know I could do this by writing out a bunch of ifelse statements for each color, but my actual problem has a lot more colors. Is there a more concise way to do this?

You may create a new column with row number to track each row, get the data in long format and bring it back to wide by counting occurrence of each color.
library(dplyr)
library(tidyr)
df %>%
mutate(row = row_number()) %>%
pivot_longer(cols = -row) %>%
pivot_wider(names_from = value, values_from = value, id_cols = row,
values_fn = length, values_fill = 0) %>%
select(-row)
# red blue green yellow
# <int> <int> <int> <int>
#1 1 1 1 0
#2 1 1 0 1
data
df <- structure(list(primary_color = c("red", "yellow"), secondary_color =
c("blue", "red"), tertiary_color = c("green", "blue")), row.names = c(NA,
-2L), class = "data.frame")

In base R you could use sapply with a function that checks the vector of desired names:
nnames <- c("red", "blue", "green", "yellow")
new_df <- t(sapply(seq_len(nrow(df)),
function(x)(nnames %in% df[x, ]) * 1))
colnames(new_df) <- nnames
# red blue green yellow
#1 1 1 1 0
#2 1 0 0 1
Note if you didnt care about the order of the columns in the second table, you could generalize nnames to nnames <- unique(unlist(df[!is.na(df)]))
Data
df <- read.table(text = "primary_color secondary_color tertiary_color
red blue green
yellow red NA", h = TRUE)

Using outer.
uc <- unique(unlist(dat))[c(1, 3, 4, 2)]
t(+outer(uc, asplit(dat, 1), Vectorize(`%in%`))) |> `colnames<-`(uc)
# red blue green yellow
# [1,] 1 1 1 0
# [2,] 1 0 0 1
Data:
dat <- structure(list(primary_color = c("red", "yellow"), secondary_color = c("blue",
"red"), tertiary_color = c("green", NA)), class = "data.frame", row.names = c(NA,
-2L))

in base R:
table(row(df), as.matrix(df))
blue green red yellow
1 1 1 1 0
2 0 0 1 1
If you want it as a data.frame:
as.data.frame.matrix(table(row(df), as.matrix(df)))
blue green red yellow
1 1 1 1 0
2 0 0 1 1
If there is one color in many columns of the same row:
+(table(row(df), as.matrix(df))>0)
blue green red yellow
1 1 1 1 0
2 0 0 1 1

Using mtabulate
library(qdapTools)
mtabulate(as.data.frame(t(df1)))
blue green red yellow
V1 1 1 1 0
V2 1 0 1 1
Or with base R
table(c(row(df1)), unlist(df1))
blue green red yellow
1 1 1 1 0
2 1 0 1 1

add columns by count of each group [duplicate]

This question already has answers here:
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
Closed 1 year ago.
I would like to add news columns by count of each group in type. My dataframe is like this:
# color type
# black chair
# black chair
# black sofa
# pink plate
# pink chair
# red sofa
# red plate
I am looking for something like:
# color chair sofa plate
# black 2 1 0
# pink 1 0 1
# red 0 1 1
I used table(df$color, df$type), but the result has no name for column 'color'

We may use table from base R
table(df)
Or with pivot_wider
library(tidyr)
pivot_wider(df, names_from = type, values_from = type,
values_fn = length, values_fill = 0)
# A tibble: 3 × 4
color chair sofa plate
<chr> <int> <int> <int>
1 black 2 1 0
2 pink 1 0 1
3 red 0 1 1
Or with dcast
library(data.table)
dcast(setDT(df), color ~ type, value.var = 'type', length, fill = 0)
data
df <- structure(list(color = c("black", "black", "black", "pink", "pink",
"red", "red"), type = c("chair", "chair", "sofa", "plate", "chair",
"sofa", "plate")), class = "data.frame", row.names = c(NA, -7L
))

tidyr join an ID table with main table across multiple columns

This seems like a very basic operation, but my searches are not finding a simple solution.
As an example of what I am trying to do, consider the following two data frames from a database.
First an ID table that assigns an index to a color name:
ColorID <- tibble(ID = c(1:4), Name = c("Red", "Green", "Blue", "Black"))
ColorID
# A tibble: 4 x 2
ID Name
<int> <chr>
1 1 Red
2 2 Green
3 3 Blue
4 4 Black
Next some table that points to these color indexes (instead of storing text strings):
Widgets <- tibble(Front = c(1,3,4,2,1,1), Back = c(4,4,3,3,1,2),
Top = c(4,3,2,1,2,3), Bottom = c(1,2,3,4,3,2))
Widgets
# A tibble: 6 x 4
Front Back Top Bottom
<dbl> <dbl> <dbl> <dbl>
1 1 4 4 1
2 3 4 3 2
3 4 3 2 3
4 2 3 1 4
5 1 1 2 3
6 1 2 3 2
Now I just want to join the two tables to substitute the index values with the actual color names, so what I want is:
Joined <- tibble(Front = c("Red", "Blue", "Black", "Green", "Red","Red"),
Back = c("Black", "Black", "Blue","Blue", "Red", "Green"),
Top = c("Black","Blue", "Green", "Red", "Green", "Blue"),
Bottom = c("Red", "Green", "Blue", "Black", "Blue","Green"))
Joined
# A tibble: 6 x 4
Front Back Top Bottom
<chr> <chr> <chr> <chr>
1 Red Black Black Red
2 Blue Black Blue Green
3 Black Blue Green Blue
4 Green Blue Red Black
5 Red Red Green Blue
6 Red Green Blue Green
I've tried many iterations with no success, what I thought would work is something like:
J <- Widgets %>% inner_join(ColorID, by = c(. = "ID"))
I can tackle this column by column by using one variable at a time, e.g.
J <- Widgets %>% inner_join(ColorID, by = c("Front" = "ID"))
Which doesn't replace "Front", but instead creates a new "Name" column. Seems like there has to be a simple solution to this though. Thanks.

There is no need for join functions:
library(dplyr)
ColorID <- tibble(ID = c(1:4), Name = c("Red", "Green", "Blue", "Black"))
# reorder so that row number and ID are different
ColorID <- ColorID[c(2, 1, 4, 3), ]
Widgets <- tibble(Front = c(1,3,4,2,1,1), Back = c(4,4,3,3,1,2),
Top = c(4,3,2,1,2,3), Bottom = c(1,2,3,4,3,2))
check_id <- function(col){
ColorID$Name[match(col, ColorID$ID)]
}
Widgets %>%
mutate(across(everything(), check_id))
# A tibble: 6 x 4
Front Back Top Bottom
<chr> <chr> <chr> <chr>
1 Red Black Black Red
2 Blue Black Blue Green
3 Black Blue Green Blue
4 Green Blue Red Black
5 Red Red Green Blue
6 Red Green Blue Green
(Edited) What I'm doing with dplyr and mutate is matching the numbers on Widgets with the number on the ColorID$ID column. This provides me with the row on the ColorID data frame I need for extracting the name.

Does this work:
library(dplyr)
library(tidyr)
Widgets %>% pivot_longer(everything()) %>%
inner_join(ColorID, by = c('value' = 'ID')) %>% select(-value) %>%
pivot_wider(names_from = name, values_from = Name) %>% unnest(everything())
# A tibble: 6 x 4
Front Back Top Bottom
<chr> <chr> <chr> <chr>
1 Red Black Black Red
2 Blue Black Blue Green
3 Black Blue Green Blue
4 Green Blue Red Black
5 Red Red Green Blue
6 Red Green Blue Green

Assign unique ID based on values in EITHER of two columns

This is not a duplicate of this question. Please read questions entirely before labeling duplicates.
I have a data.frame like so:
library(tidyverse)
tibble(
color = c("blue", "blue", "red", "green", "purple"),
shape = c("triangle", "square", "circle", "hexagon", "hexagon")
)
color shape
<chr> <chr>
1 blue triangle
2 blue square
3 red circle
4 green hexagon
5 purple hexagon
I'd like to add a group_id column like this:
color shape group_id
<chr> <chr> <dbl>
1 blue triangle 1
2 blue square 1
3 red circle 2
4 green hexagon 3
5 purple hexagon 3
The difficulty is that I want to group by unique values of color or shape. I suspect the solution might be to use list-columns, but I can't figure out how.

We can use duplicated in base R
df1$group_id <- cumsum(!Reduce(`|`, lapply(df1, duplicated)))
-output
df1
# A tibble: 5 x 3
# color shape group_id
# <chr> <chr> <int>
#1 blue triangle 1
#2 blue square 1
#3 red circle 2
#4 green hexagon 3
#5 purple hexagon 3
Or using tidyverse
library(dplyr)
library(purrr)
df1 %>%
mutate(group_id = map(., duplicated) %>%
reduce(`|`) %>%
`!` %>%
cumsum)
data
df1 <- structure(list(color = c("blue", "blue", "red", "green", "purple"
), shape = c("triangle", "square", "circle", "hexagon", "hexagon"
)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"
))

Using ifelse within mutate and handling NA's

thanks for your time.
I have a question about using ifelse within the mutate function. ifelse is from base R, while mutate is from the dplyr package.
My question is about how ifelse handles NA values.
I have two character vectors:
example_character_vector contains some words and occasional NA values while the other vector, color_indicator, contains only the words Green, Yellow, and Red.
I want to mutate my dataframe example_data_frame to create a new override_color_indicator variable that converts some of the yellows to greens depending on a condition in the example_character_vector.
Example data:
example_character_vector <- c("Basic", NA, "Full", "None", NA, "None",
NA)
color_indicator <- c("Green", "Green", "Yellow", "Yellow", "Yellow",
"Red", "Red")
example_data_frame <- data.frame(example_character_vector,
color_indicator)
This example_data_frame looks like so:
example_character_vector color_indicator
1 Basic Green
2 <NA> Green
3 Full Yellow
4 None Yellow
5 <NA> Yellow
6 None Red
7 <NA> Red
I am using nested ifelse statements within mutate to create a new column called override_color_indicator.
If color_indicator is yellow and the example_character_vector contains the word "Full", I want the override_color_indicator to be Green (this is a special case within my data). Otherwise, I would like the override_color_indicator to be exactly the same as the color_indicator.
Here is my mutate:
example_data_frame <- example_data_frame %>%
mutate(override_color_indicator =
ifelse(color_indicator == "Green",
"Green",
ifelse(color_indicator == "Yellow" &
str_detect(example_character_vector, "Full"),
"Green",
ifelse(color_indicator == "Yellow" &
!str_detect(example_character_vector, "Full") |
color_indicator == "Yellow" &
is.na(character_vector),
"Yellow",
"Red"))))
(Apologies for the formatting - I tried to format this the best I could for Stack Overflow.)
This above code produces this dataframe:
example_character_vector color_indicator override_color_indicator
1 Basic Green Green
2 <NA> Green Green
3 Full Yellow Green
4 None Yellow Yellow
5 <NA> Yellow <NA>
6 None Red Red
7 <NA> Red Red
My problem here is that in line 5, an NA is introduced in the override_color_indicator color. Instead of an NA, I would like it be "Yellow".
For clarity, this is my desired dataframe:
example_character_vector color_indicator override_color_indicator
1 Basic Green Green
2 <NA> Green Green
3 Full Yellow Green
4 None Yellow Yellow
5 <NA> Yellow Yellow
6 None Red Red
7 <NA> Red Red
I've looked quite a bit for an answer, and couldn't find one anywhere. I could just create a workaround and go back and manually assign the entries to Yellow, but I don't love that option from a programmatic standpoint.
Also, I'm just kind of curious as to why this behavior happens. I've ran into this problem a few times now.
Thanks for your time!

You should use case_when here, but the reason you are getting NA is because of the second ifelse. One interesting thing about how NA propagates in R is that (from the docs) "the result will be NA if the outcome is ambiguous". So because we knew this would be FALSE regardless of the NA, we have
NA & FALSE
#> [1] FALSE
but since this is ambiguous, the NA propagates here.
NA & TRUE
#> [1] NA
Row 5 has TRUE for Yellow but str_detect will return NA, so ifelse returns NA. You can get around this by adding & !is.na(example_character_vector) in that line:
library(tidyverse)
example_data_frame <- structure(list(example_character_vector = c("Basic", NA, "Full", "None", NA, "None", NA), color_indicator = c("Green", "Green", "Yellow", "Yellow", "Yellow", "Red", "Red")), class = "data.frame", row.names = c(NA, -7L))
example_data_frame %>%
mutate(
override_color_indicator =
ifelse(
color_indicator == "Green",
"Green",
ifelse(
color_indicator == "Yellow" &
str_detect(example_character_vector, "Full") & !is.na(example_character_vector),
"Green",
ifelse(
color_indicator == "Yellow" &
(!str_detect(example_character_vector, "Full") | is.na(example_character_vector)),
"Yellow",
"Red"
)
)
)
)
#> example_character_vector color_indicator override_color_indicator
#> 1 Basic Green Green
#> 2 <NA> Green Green
#> 3 Full Yellow Green
#> 4 None Yellow Yellow
#> 5 <NA> Yellow Yellow
#> 6 None Red Red
#> 7 <NA> Red Red
But definitely use case_when!

Try this instead. case_when is a more flexible vectorised if and allows you to use TRUE to say "else, use the value in color_indicator.
example_data_frame <- structure(list(example_character_vector = c("Basic", NA, "Full", "None", NA, "None", NA), color_indicator = c("Green", "Green", "Yellow", "Yellow", "Yellow", "Red", "Red")), class = "data.frame", row.names = c(NA, -7L))
library(dplyr)
example_data_frame %>%
mutate(x = case_when(color_indicator == "Yellow" &
example_character_vector == "Full" ~ "Green",
TRUE ~ color_indicator))
#> example_character_vector color_indicator x
#> 1 Basic Green Green
#> 2 <NA> Green Green
#> 3 Full Yellow Green
#> 4 None Yellow Yellow
#> 5 <NA> Yellow Yellow
#> 6 None Red Red
#> 7 <NA> Red Red

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Counting new levels of a factor per site (R) - r

Related

One-hot encoding when data is stored across multiple columns

add columns by count of each group [duplicate]

tidyr join an ID table with main table across multiple columns

Assign unique ID based on values in EITHER of two columns

Using ifelse within mutate and handling NA's

Categories

Resources