dplyr split by semi colon in case_when - r

Suppose I have a dataframe df
library(dplyr)
df <- data.frame(ID = c(1:10), Type = c('a', 'a;b','b','a','b','b','c','a;c','b;c','c'))
And I want to add a column called color, based on the values that appear in Type. (This is just an example, in my code there are many more variations of Type, i.e. d;f, e;q,a;z etc)
df %>%
mutate(color = case_when(
Type == 'a' ~ 'red',
Type == 'b' ~ 'blue',
Type == 'c' ~ 'green',
TRUE ~ as.character(Type)
))
As this stands, it returns
ID Type color
1 1 a red
2 2 a;b a;b
3 3 b blue
4 4 a red
5 5 b blue
6 6 b blue
7 7 c green
8 8 a;c a;c
9 9 b;c b;c
10 10 c green
I am curious if there a way to split by semi-colon within the case_when(), in order to produce the output
ID Type color
1 1 a red
2 2 a;b red;blue
3 3 b blue
4 4 a red
5 5 b blue
6 6 b blue
7 7 c green
8 8 a;c red;green
9 9 b;c blue;green
10 10 c green

You can split the Type column into separate rows, map it to colors and then paste them together:
library(dplyr); library(tidyr);
df %>%
separate_rows(Type) %>%
mutate(color = case_when(
Type == 'a' ~ 'red',
Type == 'b' ~ 'blue',
Type == 'c' ~ 'green',
TRUE ~ as.character(Type)
)) %>%
group_by(ID) %>%
summarise_all(funs(paste0(., collapse=";")))
# A tibble: 10 x 3
# ID Type color
# <int> <chr> <chr>
# 1 1 a red
# 2 2 a;b red;blue
# 3 3 b blue
# 4 4 a red
# 5 5 b blue
# 6 6 b blue
# 7 7 c green
# 8 8 a;c red;green
# 9 9 b;c blue;green
#10 10 c green
Besides case_when, you can also put the character to color maps in a vector and then retrieve the colors later:
map <- c(a = 'red', b = 'blue', c = 'green')
df %>%
separate_rows(Type) %>%
mutate(color = map[Type]) %>%
group_by(ID) %>%
summarise_all(funs(paste0(., collapse=";")))

Related

Adding values to one columns based on conditions

I would like to update one column based on 2 columns
My example dataframe contains 3 columns
df <- data.frame(n1 = c(1,2,1,2,5,6),
n2 = c("a", "a", "a", NA, "b", "c"),
n3 = c("red", "red", NA, NA, NA, NA))
df
n1 n2 n3
1 1 a red
2 2 a red
3 1 a <NA>
4 2 <NA> <NA>
5 5 b <NA>
6 6 c <NA>
I would like to add red name to row number 3 and 4 with the condition is that if values of n1 (i.e. 1,2) match with n2 (i.e. a), even though the fourth row (n1 not match n2).
The main point is if n2 == a, and values of n1 associated with a, then values of n3 that are the same row with values of n1 should be added with red.
My desired output
n1 n2 n3
1 1 a red
2 2 a red
3 1 a red
4 2 <NA> red
5 5 b <NA>
6 6 c <NA>
Any sugesstions for this case? I hope my explanation is clear enough. Since my data is very long, I am trying to find a good to handle it.
In base R, create a logical vector to subset the rows of 'df' based on the unique values of 'n1' where 'n2' is "a", then do the assignment of 'n3' corresponding to that elements with the first non-NA element from 'n3'
i1 <- with(df, n1 %in% unique(n1[n2 %in% 'a']))
df$n3[i1] <- na.omit(df$n3[i1])[1]
-output
> df
n1 n2 n3
1 1 a red
2 2 a red
3 1 a red
4 2 <NA> red
5 5 b <NA>
6 6 c <NA>
Update:
df %>%
mutate(group = rep(row_number(), each=2, length.out = n())) %>%
group_by(group) %>%
mutate(n3 = ifelse(n1 %in% c(1,2) & any(n2 %in% "a", na.rm = TRUE), "red", n3)) %>%
ungroup() %>%
select(-group)
We could use an ifelse statement with conditions defined using any.
library(dplyr)
df %>%
mutate(n3 = ifelse(n1==1 | n1==2 & any(n2[3:4] %in% "a"), "red", n3))
n1 n2 n3
1 1 a red
2 2 a red
3 1 a red
4 2 <NA> red
5 5 b <NA>
6 6 c <NA>
library(dplyr)
library(tidyr)
df %>%
group_by(n1) %>%
fill(n3) %>%
group_by(n2) %>%
fill(n3)
# # A tibble: 6 × 3
# # Groups: n2 [4]
# n1 n2 n3
# <dbl> <chr> <chr>
# 1 1 a red
# 2 2 a red
# 3 1 a red
# 4 2 NA red
# 5 5 b NA
# 6 6 c NA

Sort/arrange within group for only chosen groups

I would like to sort/arrange data by group. That's easy enough. However, I only want to sort values within specific groups, not all groups.
I found one possible instance of a similar question at the link. But I found it to be confusing due to the framing of the question by the OP.
Arrange values within a specific group
Sample data:
df <- data.frame(var = c("apple", "banana", "eggplant", "carrot", "dill", "fava", "garlic"),
grp = c("A", "A", "B", "B", "B", "C", "C"),
val = c(4, 2, 1, 3, 7, 6, 2))
df
# var grp val
# 1 apple A 4
# 2 banana A 2
# 3 carrot B 3
# 4 dill B 7
# 5 eggplant B 1
# 6 fava C 6
# 7 garlic C 2
Desired output:
# var grp val
# 1 apple A 4
# 2 banana A 2
# 3 eggplant B 1
# 4 carrot B 3
# 5 dill B 7
# 6 garlic C 2
# 7 fava C 6
Partial solution:
library(dplyr)
df %>%
group_by(grp) %>%
arrange(val, .by_group = T)
This of course sorts for all groups. How do I get it to sort for only the groups I would like sorted, which are "B" and "C"? I would like a tidyverse solution but feel free to post a base solution as well.
We can change the sign to the elements in 'val' that correspond to "A" group so that it is ordered in the opposite direction compared to the 'val' elements in other group
library(dplyr)
df %>%
arrange(grp, val * c(1, -1)[(grp == 'A') + 1])
-output
var grp val
1 apple A 4
2 banana A 2
3 eggplant B 1
4 carrot B 3
5 dill B 7
6 garlic C 2
7 fava C 6
Or if the values for 'A' should be kept as such, then mltiply by 0 so that each value is same for 'A'
df %>%
arrange(grp, val * c(1, 0)[(grp == 'A') + 1])
var grp val
1 apple A 4
2 banana A 2
3 eggplant B 1
4 carrot B 3
5 dill B 7
6 garlic C 2
7 fava C 6
NOTE: This is done without any group_by attribute
If we want to use the OP's way, i.e. using group_by
df %>%
group_by(grp) %>%
arrange(case_when(grp == 'A' ~ -1 * val, TRUE ~ val),
.by_group = TRUE) %>%
ungroup
-ouptutu
# A tibble: 7 x 3
var grp val
<chr> <chr> <dbl>
1 apple A 4
2 banana A 2
3 eggplant B 1
4 carrot B 3
5 dill B 7
6 garlic C 2
7 fava C 6
If the values in 'val' for grp 'A' are showed in descending order because of coincidence, then create a sequence column before doing the grouping and then use that for modifying
df %>%
mutate(rn = row_number()) %>%
group_by(grp) %>%
arrange(case_when(grp == 'A' ~ as.numeric(rn), TRUE ~ val),
.by_group = TRUE) %>%
ungroup %>%
dplyr::select(-rn)
-output
# A tibble: 7 x 3
var grp val
<chr> <chr> <dbl>
1 apple A 4
2 banana A 2
3 eggplant B 1
4 carrot B 3
5 dill B 7
6 garlic C 2
7 fava C 6
Or using base R
df[with(df, order(grp, c(1, 0)[(grp == 'A') + 1] * val)),]
var grp val
1 apple A 4
2 banana A 2
3 eggplant B 1
4 carrot B 3
5 dill B 7
7 garlic C 2
6 fava C 6
You can filter the groups you want to arrange, sort them and bind to the remaining data.
library(dplyr)
order_groups <- c('B', 'C')
df %>%
filter(grp %in% order_groups) %>%
arrange(grp, val) %>%
bind_rows(df %>%
filter(!grp %in% order_groups)) %>%
arrange(grp)
#. var grp val
#1 apple A 4
#2 banana A 2
#3 eggplant B 1
#4 carrot B 3
#5 dill B 7
#6 garlic C 2
#7 fava C 6

How to build a summary data frame

I have a data set looks like this:
and I would like to get a summary data set that will looks like this:
what should i do? Thanks. The sample.data can be build through following codes:
ID<- c("1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18")
Group<-c("A","B","C","D","D","D","A","B","D","C","B","D","A","A","C","B","B","B")
Color<-c("Green","Yellow","Red","Red","Red","Yellow","Green","Green","Yellow","Red","Red","Yellow","Yellow","Yellow","Green","Red","Red","Green")
Realy_Love<-c("Y","N","Y","Y","N","N","Y","Y","Y","N","N","Y","N","Y","N","Y","N","Y")
Sample.data <- data.frame(ID, Group, Color, Realy_Love)
You can use dplyr and group by the following items:
Sample.data %>%
group_by(Group, Color, Realy_Love) %>%
summarise(Obs = n())
# Group Color Realy_Love Obs
# <chr> <chr> <chr> <int>
# 1 A Green Y 2
# 2 A Yellow N 1
# 3 A Yellow Y 1
# 4 B Green Y 2
# 5 B Red N 2
# 6 B Red Y 1
# 7 B Yellow N 1
# 8 C Green N 1
# 9 C Red N 1
# 10 C Red Y 1
# 11 D Red N 1
# 12 D Red Y 1
# 13 D Yellow N 1
# 14 D Yellow Y 2
Use dplyr from the Tidyverse to get a summary. You can then use arrange() to sort by Color or another variable.
group_by(Group, Color, Realy_Love) %>%
summarise(Obs = n()) %>%
arrange(Color)
With dplyr, you even don't need to group the columns, just use one step solution with the count() function:
Sample.data %>%
count(Group, Color, Realy_Love, sort = TRUE)
The optional sort = TRUE argument says to sort with descending order from the most frequent:
Group Color Realy_Love n
1 A Green Y 2
2 B Green Y 2
3 B Red N 2
4 D Yellow Y 2
5 A Yellow N 1
6 A Yellow Y 1
7 B Red Y 1
8 B Yellow N 1
9 C Green N 1
10 C Red N 1
11 C Red Y 1
12 D Red N 1
13 D Red Y 1
14 D Yellow N 1

R variable number of string concatenations within group_by

Let's say I have the following table of houses (or anything) and their colors:
I'm trying to:
group_by(Group)
count rows (I assume with length(unique(ID)),
mutate or summarize into a new row with a count of each color in group, as a string.
Result should be:
So I know step 3 could be done by manually entering every possible combination with something like
df <- df %>%
group_by(Group) %>%
mutate(
Summary = case_when(
all(
sum(count_green) > 0
) ~ paste(length(unique(ID)), " houses, ", count_green, " green")
)
)
but what if I have hundreds of possible combinations? Is there a way to paste into a string and append for each new color/count?
Here is one approach where we count the frequency of 'Group', 'Color' with add_count, unite that with 'Color', then grouped by 'Group', create the 'Summary' column by concatenating the unique elements of 'nColor' with the frequency (n())
library(dplyr)
library(tidyr)
library(stringr)
df %>%
add_count(Group, Color) %>%
unite(nColor, n, Color, sep= ' ', remove = FALSE) %>%
group_by(Group) %>%
mutate(
Summary = str_c(n(), ' houses, ', toString(unique(nColor)))) %>%
select(-nColor)
# Groups: Group [2]
# ID Group Color n Summary
# <int> <chr> <chr> <int> <chr>
#1 1 a Green 2 3 houses, 2 Green, 1 Orange
#2 2 a Green 2 3 houses, 2 Green, 1 Orange
#3 3 a Orange 1 3 houses, 2 Green, 1 Orange
#4 4 b Blue 2 3 houses, 2 Blue, 1 Yellow
#5 5 b Yellow 1 3 houses, 2 Blue, 1 Yellow
#6 6 b Blue 2 3 houses, 2 Blue, 1 Yellow
data
df <- structure(list(ID = 1:6, Group = c("a", "a", "a", "b", "b", "b"
), Color = c("Green", "Green", "Orange", "Blue", "Yellow", "Blue"
)), class = "data.frame", row.names = c(NA, -6L))
Here's an approach with map_chr from purrr and a lot of pasting.
library(dplyr)
library(purrr)
df %>%
group_by(Group) %>%
mutate(Summary = paste(n(),"houses,",
paste(map_chr(unique(as.character(Color)),
~paste(sum(Color == .x),.x)),
collapse = ", ")))
## A tibble: 6 x 4
## Groups: Group [2]
# ID Group Color Summary
# <int> <fct> <fct> <chr>
#1 1 a Green 3 houses, 2 Green, 1 Orange
#2 2 a Green 3 houses, 2 Green, 1 Orange
#3 3 a Orange 3 houses, 2 Green, 1 Orange
#4 4 b Blue 3 houses, 2 Blue, 1 Yellow
#5 5 b Yellow 3 houses, 2 Blue, 1 Yellow
#6 6 b Blue 3 houses, 2 Blue, 1 Yellow

Select the n most frequent values in a variable

I would like to find the most common values in a column in a data frame. I assume using table would be the best way to do this? I then want to filter/subset my data frame to only include these top-n values.
An example of my data frame is as follows. Here I want to find e.g. the top 2 IDs.
ID col
A blue
A purple
A green
B green
B red
C red
C blue
C yellow
C orange
I therefore want to output the following:
Top 2 values of ID are:
A and C
I will then select the rows corresponding to ID A and C:
ID col
A blue
A purple
A green
C red
C blue
C yellow
C orange
You can try a tidyverse. Add the counts of ID's, then filter for the top two (using < 3) or top ten (using < 11):
library(tidyverse)
d %>%
add_count(ID) %>%
filter(dense_rank(-n) < 3)
# A tibble: 7 x 3
ID col n
<fct> <fct> <int>
1 A blue 3
2 A purple 3
3 A green 3
4 C red 4
5 C blue 4
6 C yellow 4
7 C orange 4
Data
d <- read.table(text="ID col
A blue
A purple
A green
B green
B red
C red
C blue
C yellow
C orange", header=T)
We can count the number of values using table, sort them in decreasing order and select first 2 (or 10) values, get the corresponding ID's and subset those ID's from the data frame.
df[df$ID %in% names(sort(table(df$ID), decreasing = TRUE)[1:2]), ]
# ID col
#1 A blue
#2 A purple
#3 A green
#6 C red
#7 C blue
#8 C yellow
#9 C orange
With the tidyverse and its top_n :
library(tidyverse)
d %>%
group_by(ID) %>%
summarise(n()) %>%
top_n(2)
Selecting by n()
# A tibble: 2 x 2
ID `n()`
<fct> <int>
1 A 3
2 C 4
To complete with the subset :
d %>%
group_by(ID) %>%
summarise(n()) %>%
top_n(2) %>%
{ filter(d, ID %in% .$ID) }
Selecting by n()
ID col
1 A blue
2 A purple
3 A green
4 C red
5 C blue
6 C yellow
7 C orange
(we use the braces because we don't feed the left hand side result as the first argument of the filter)

Resources