Select groups with row containing specific value (with dplyr and pipes) - r

I'm trying to select groups in a grouped df that contain a specific string on a specific row within each group.
Consider the following df:
df <- data.frame(id = c(rep("id_1", 4),
rep("id_2", 4),
rep("id_3", 4)),
string = c("here",
"is",
"some",
"text",
"here",
"is",
"other",
"text",
"there",
"are",
"final",
"texts"))
I want to create a dataframe that contains just the groups that have the word "is" on the second row.
Here is some incorrect code:
desired_df <- df %>% group_by(id) %>%
filter(slice(select(., string), 2) %in% "is")
Here is the desired output:
desired_df <- data.frame(id = c(rep("id_1", 4),
rep("id_2", 4)),
string = c("here",
"is",
"some",
"text",
"here",
"is",
"other",
"text"))
I've looked here but this doesn't solve my issue because this finds groups with any occurrence of the specified string.
I could also do some sort of separate code where I identify the ids and then use that to subset the original df, like so:
ids <- df %>% group_by(id) %>% slice(2) %>% filter(string %in% "is") %>% select(id)
desired_df <- df %>% filter(id %in% ids$id)
But I'm wondering if I can do something simpler within a single pipe series.
Help appreciated!

After grouping by 'id', subset the 'string' for the second element and apply %in% with "is" on the lhs of %in% to return a single TRUE per group
library(dplyr)
df %>%
group_by(id) %>%
filter('is' %in% string[2]) %>%
ungroup
-output
# A tibble: 8 x 2
# id string
# <chr> <chr>
#1 id_1 here
#2 id_1 is
#3 id_1 some
#4 id_1 text
#5 id_2 here
#6 id_2 is
#7 id_2 other
#8 id_2 text

Related

Filter based on different conditions at different positions in a string in R

The middle part of the string is the ID, and I want only one occurrence of each ID. If there is more than one observation with the same six middle letters, I need to keep the one that says "07" rather than "08", or "A" rather than "B". I want to completely exclude if the number is "02". Other than that, if there is only one occurrence of the ID, I want to keep it. So if I had:
col1
ID-1-AMBCFG-07A-01
ID-1-CGUMBD-08A-01
ID-1-XDUMNG-07B-01
ID-1-XDUMNG-08B-01
ID-1-LOFBUM-02A-01
ID-1-ABYEMJ-08A-01
ID-1-ABYEMJ-08B-01
Then I would want:
col1
ID-1-AMBCFG-07A-01
ID-1-CGUMBD-08A-01
ID-1-XDUMNG-07B-01
ID-1-ABYEMJ-08A-01
I am thinking maybe I can use group_by to specify the 6 letter ID, and then some kind of if_else statement? But I can't figure out how to specify the positions of the characters in the string. Any help is greatly appreciated!
Using extract and some dplyr wrangling:
library(tidyr)
library(dplyr)
df %>%
extract(col1, "ID-\\d-(.*)-(\\d*)(A|B)-01",
into = c("ID", "number", "letter"),
remove = FALSE, convert = TRUE) %>%
group_by(ID) %>%
filter(number != 2) %>%
slice_min(n = 1, order(number, letter)) %>%
ungroup() %>%
select(col1)
# col1
#1 ID-1-ABYEMJ-08A-01
#2 ID-1-AMBCFG-07A-01
#3 ID-1-CGUMBD-08A-01
#4 ID-1-XDUMNG-07B-01
An option with str_detect
library(stringr)
library(dplyr)
df1 %>%
group_by(ID = str_extract(col1, "ID-\\d+-\\w+")) %>%
filter(str_detect(col1, "02", negate = TRUE), row_number() == 1) %>%
ungroup %>%
select(-ID)
-output
# A tibble: 4 × 1
col1
<chr>
1 ID-1-AMBCFG-07A-01
2 ID-1-CGUMBD-08A-01
3 ID-1-XDUMNG-07B-01
4 ID-1-ABYEMJ-08A-01
data
df1 <- structure(list(col1 = c("ID-1-AMBCFG-07A-01", "ID-1-CGUMBD-08A-01",
"ID-1-XDUMNG-07B-01", "ID-1-XDUMNG-08B-01", "ID-1-LOFBUM-02A-01",
"ID-1-ABYEMJ-08A-01", "ID-1-ABYEMJ-08B-01")), class = "data.frame",
row.names = c(NA,
-7L))

Replace values in dataframe based on other dataframe with column name and value

Let's say I have a dataframe of scores
library(dplyr)
id <- c(1 , 2)
name <- c('John', 'Ninaa')
score1 <- c(8, 6)
score2 <- c(NA, 7)
df <- data.frame(id, name, score1, score2)
Some mistakes have been made so I want to correct them. My corrections are in a different dataframe.
id <- c(2,1)
column <- c('name', 'score2')
new_value <- c('Nina', 9)
corrections <- data.frame(id, column, new_value)
I want to search the dataframe for the correct id and column and change the value.
I have tried something with match but I don't know how mutate the correct column.
df %>% mutate(corrections$column = replace(corrections$column, match(corrections$id, id), corrections$new_value))
We could join by 'id', then mutate across the columns specified in the column and replace the elements based on the matching the corresponding column name (cur_column()) with the column
library(dplyr)
df %>%
left_join(corrections) %>%
mutate(across(all_of(column), ~ replace(.x, match(cur_column(),
column), new_value[match(cur_column(), column)]))) %>%
select(names(df))
-output
id name score1 score2
1 1 John 8 9
2 2 Nina 6 7
It's an implementation of a feasible idea with dplyr::rows_update, though it involves functions of multiple packages. In practice I prefer a moderately parsimonious approach.
library(tidyverse)
corrections %>%
group_by(id) %>%
group_map(
~ pivot_wider(.x, names_from = column, values_from = new_value) %>% type_convert,
.keep = TRUE) %>%
reduce(rows_update, by = 'id', .init = df)
# id name score1 score2
# 1 1 John 8 9
# 2 2 Nina 6 7

Rowwise find most frequent term in dataframe column and count occurrences

I try to find the most frequent category within every row of a dataframe. A category can consist of multiple words split by a /.
library(tidyverse)
library(DescTools)
# example data
id <- c(1, 2, 3, 4)
categories <- c("apple,shoes/socks,trousers/jeans,chocolate",
"apple,NA,apple,chocolate",
"shoes/socks,NA,NA,NA",
"apple,apple,chocolate,chocolate")
df <- data.frame(id, categories)
# the solution I would like to achieve
solution <- df %>%
mutate(winner = c("apple", "apple", "shoes/socks", "apple"),
winner_count = c(1, 2, 1, 2))
Based on these answers I have tried the following:
Write a function that finds the most common word in a string of text using R
trial <- df %>%
rowwise() %>%
mutate(winner = names(which.max(table(categories %>% str_split(",")))),
winner_count = which.max(table(categories %>% str_split(",")))[[1]])
Also tried to follow this approach, however it also does not give me the required results
How to find the most repeated word in a vector with R
trial2 <- df %>%
mutate(winner = DescTools::Mode(str_split(categories, ","), na.rm = T))
I am mainly struggling because my most frequent category is not just one word but something like "shoes/socks" and the fact that I also have NAs. I don't want the NAs to be the "winner".
I don't care too much about the ties right now. I already have a follow up process in place where I handle the cases that have winner_count = 2.
split the categories on comma in separate rows, count their occurrence for each id, drop the NA values and select the top occurring row for each id
library(dplyr)
library(tidyr)
df %>%
separate_rows(categories, sep = ',') %>%
count(id, categories, name = 'winner_count') %>%
filter(categories != 'NA') %>%
group_by(id) %>%
slice_max(winner_count, n = 1, with_ties = FALSE) %>%
ungroup %>%
rename(winner = categories) %>%
left_join(df, by = 'id') -> result
result
# id winner winner_count categories
# <dbl> <chr> <int> <chr>
#1 1 apple 1 apple,shoes/socks,trousers/jeans,chocolate
#2 2 apple 2 apple,NA,apple,chocolate
#3 3 shoes/socks 1 shoes/socks,NA,NA,NA
#4 4 apple 2 apple,apple,chocolate,chocolate

How to get unique occurrences of these character strings separated by ";"?

So I have a column with values in this structure:
tribble(
~col,
"AA_BB;AA_AA;AA_BB",
"BB_BB;AA_AA",
"AA_BB",
"BB_AA;BB_AA;AA_AA;BB_AA")
)
So each row has items separated by a ";". The first for has items AA_BB, AA_AA and AA_BB. I want the first row to be transformed to "AA_BB;AA_AA" and the last row to be transformed to "BB_AA;AA_AA".
I thought about using separate but I the result didn't really help me (especially since I don't know how many columns there can be at most).
df %>%
separate(col, into = c("A", "B", "C", "D"), sep = ";")
Any tips on how to do this?
We can split the column, get the unique elements and paste
library(dplyr)
library(stringr)
library(purrr)
df %>%
mutate(col = map_chr(strsplit(col, ";"), ~ str_c(unique(.x), collapse=";")))
-output
# A tibble: 4 x 1
# col
# <chr>
#1 AA_BB;AA_AA
#2 BB_BB;AA_AA
#3 AA_BB
#4 BB_AA;AA_AA
Or split with separate_rows, then do a group by paste after getting the distinct rows
library(tidyr)
df %>%
mutate(rn = row_number()) %>%
separate_rows(col, sep=";") %>%
distinct %>%
group_by(rn) %>%
summarise(col = str_c(col, collapse=";"), .groups = 'drop') %>%
select(col)
In base R, you can split the string on semi-colon, keep only unique strings and paste them together.
df$col1 <- sapply(strsplit(df$col, ';'), function(x)
paste0(unique(x), collapse = ';'))
df
# A tibble: 4 x 2
# col col1
# <chr> <chr>
#1 AA_BB;AA_AA;AA_BB AA_BB;AA_AA
#2 BB_BB;AA_AA BB_BB;AA_AA
#3 AA_BB AA_BB
#4 BB_AA;BB_AA;AA_AA;BB_AA BB_AA;AA_AA

Add a grouping variable based on ranked data

Consider the following dataframe:
name <- c("Sally", "Dave", "Aaron", "Jane", "Michael")
rank <- c(1,2,1,2,3)
df <- data.frame(name, rank, stringsAsFactors = FALSE)
I'd like to create a grouping variable (event) based on the rank column, as such:
event <- c("Hurdles", "Hurdles", "Long Jump", "Long Jump", "Long Jump")
df_desired <- data.frame(name, rank, event, stringsAsFactors = FALSE)
There are lots of examples of going the other way (making a ranking variable based on a group) but I can't seem to find one doing what I'd like.
It's possible to use filter, full_join and then fill as shown below, but is there a simpler way?
library(tidyverse)
df <- df %>%
mutate(order = row_number())
df_1 <- df %>%
filter(rank == 1)
df_1$event <- c("Hurdles", "Long Jump")
df %>%
filter(rank != 1) %>%
mutate(event = as.character(NA)) %>%
full_join(df_1, by = c("order", "name", "rank", "event")) %>%
arrange(order) %>%
fill(event) %>%
select(-order)
We can use cumsum to create the index
library(dplyr)
df %>%
mutate(event = c("Hurdles", "Long Jump")[cumsum(rank == 1)])
# name rank event
#1 Sally 1 Hurdles
#2 Dave 2 Hurdles
#3 Aaron 1 Long Jump
#4 Jane 2 Long Jump
#5 Michael 3 Long Jump
Or in base R (just in case)
df$event <- c("Hurdles", "Long Jump")[cumsum(df$rank == 1)])

Resources