Select groups with row containing specific value (with dplyr and pipes)

Select groups with row containing specific value (with dplyr and pipes) - r

I'm trying to select groups in a grouped df that contain a specific string on a specific row within each group.
Consider the following df:
df <- data.frame(id = c(rep("id_1", 4),
rep("id_2", 4),
rep("id_3", 4)),
string = c("here",
"is",
"some",
"text",
"here",
"is",
"other",
"text",
"there",
"are",
"final",
"texts"))
I want to create a dataframe that contains just the groups that have the word "is" on the second row.
Here is some incorrect code:
desired_df <- df %>% group_by(id) %>%
filter(slice(select(., string), 2) %in% "is")
Here is the desired output:
desired_df <- data.frame(id = c(rep("id_1", 4),
rep("id_2", 4)),
string = c("here",
"is",
"some",
"text",
"here",
"is",
"other",
"text"))
I've looked here but this doesn't solve my issue because this finds groups with any occurrence of the specified string.
I could also do some sort of separate code where I identify the ids and then use that to subset the original df, like so:
ids <- df %>% group_by(id) %>% slice(2) %>% filter(string %in% "is") %>% select(id)
desired_df <- df %>% filter(id %in% ids$id)
But I'm wondering if I can do something simpler within a single pipe series.
Help appreciated!

After grouping by 'id', subset the 'string' for the second element and apply %in% with "is" on the lhs of %in% to return a single TRUE per group
library(dplyr)
df %>%
group_by(id) %>%
filter('is' %in% string[2]) %>%
ungroup
-output
# A tibble: 8 x 2
# id string
# <chr> <chr>
#1 id_1 here
#2 id_1 is
#3 id_1 some
#4 id_1 text
#5 id_2 here
#6 id_2 is
#7 id_2 other
#8 id_2 text

Related

Filter based on different conditions at different positions in a string in R

The middle part of the string is the ID, and I want only one occurrence of each ID. If there is more than one observation with the same six middle letters, I need to keep the one that says "07" rather than "08", or "A" rather than "B". I want to completely exclude if the number is "02". Other than that, if there is only one occurrence of the ID, I want to keep it. So if I had:
col1
ID-1-AMBCFG-07A-01
ID-1-CGUMBD-08A-01
ID-1-XDUMNG-07B-01
ID-1-XDUMNG-08B-01
ID-1-LOFBUM-02A-01
ID-1-ABYEMJ-08A-01
ID-1-ABYEMJ-08B-01
Then I would want:
col1
ID-1-AMBCFG-07A-01
ID-1-CGUMBD-08A-01
ID-1-XDUMNG-07B-01
ID-1-ABYEMJ-08A-01
I am thinking maybe I can use group_by to specify the 6 letter ID, and then some kind of if_else statement? But I can't figure out how to specify the positions of the characters in the string. Any help is greatly appreciated!

Using extract and some dplyr wrangling:
library(tidyr)
library(dplyr)
df %>%
extract(col1, "ID-\\d-(.*)-(\\d*)(A|B)-01",
into = c("ID", "number", "letter"),
remove = FALSE, convert = TRUE) %>%
group_by(ID) %>%
filter(number != 2) %>%
slice_min(n = 1, order(number, letter)) %>%
ungroup() %>%
select(col1)
# col1
#1 ID-1-ABYEMJ-08A-01
#2 ID-1-AMBCFG-07A-01
#3 ID-1-CGUMBD-08A-01
#4 ID-1-XDUMNG-07B-01

An option with str_detect
library(stringr)
library(dplyr)
df1 %>%
group_by(ID = str_extract(col1, "ID-\\d+-\\w+")) %>%
filter(str_detect(col1, "02", negate = TRUE), row_number() == 1) %>%
ungroup %>%
select(-ID)
-output
# A tibble: 4 × 1
col1
<chr>
1 ID-1-AMBCFG-07A-01
2 ID-1-CGUMBD-08A-01
3 ID-1-XDUMNG-07B-01
4 ID-1-ABYEMJ-08A-01
data
df1 <- structure(list(col1 = c("ID-1-AMBCFG-07A-01", "ID-1-CGUMBD-08A-01",
"ID-1-XDUMNG-07B-01", "ID-1-XDUMNG-08B-01", "ID-1-LOFBUM-02A-01",
"ID-1-ABYEMJ-08A-01", "ID-1-ABYEMJ-08B-01")), class = "data.frame",
row.names = c(NA,
-7L))

Replace values in dataframe based on other dataframe with column name and value

Let's say I have a dataframe of scores
library(dplyr)
id <- c(1 , 2)
name <- c('John', 'Ninaa')
score1 <- c(8, 6)
score2 <- c(NA, 7)
df <- data.frame(id, name, score1, score2)
Some mistakes have been made so I want to correct them. My corrections are in a different dataframe.
id <- c(2,1)
column <- c('name', 'score2')
new_value <- c('Nina', 9)
corrections <- data.frame(id, column, new_value)
I want to search the dataframe for the correct id and column and change the value.
I have tried something with match but I don't know how mutate the correct column.
df %>% mutate(corrections$column = replace(corrections$column, match(corrections$id, id), corrections$new_value))

We could join by 'id', then mutate across the columns specified in the column and replace the elements based on the matching the corresponding column name (cur_column()) with the column
library(dplyr)
df %>%
left_join(corrections) %>%
mutate(across(all_of(column), ~ replace(.x, match(cur_column(),
column), new_value[match(cur_column(), column)]))) %>%
select(names(df))
-output
id name score1 score2
1 1 John 8 9
2 2 Nina 6 7

It's an implementation of a feasible idea with dplyr::rows_update, though it involves functions of multiple packages. In practice I prefer a moderately parsimonious approach.
library(tidyverse)
corrections %>%
group_by(id) %>%
group_map(
~ pivot_wider(.x, names_from = column, values_from = new_value) %>% type_convert,
.keep = TRUE) %>%
reduce(rows_update, by = 'id', .init = df)
# id name score1 score2
# 1 1 John 8 9
# 2 2 Nina 6 7

Rowwise find most frequent term in dataframe column and count occurrences

I try to find the most frequent category within every row of a dataframe. A category can consist of multiple words split by a /.
library(tidyverse)
library(DescTools)
# example data
id <- c(1, 2, 3, 4)
categories <- c("apple,shoes/socks,trousers/jeans,chocolate",
"apple,NA,apple,chocolate",
"shoes/socks,NA,NA,NA",
"apple,apple,chocolate,chocolate")
df <- data.frame(id, categories)
# the solution I would like to achieve
solution <- df %>%
mutate(winner = c("apple", "apple", "shoes/socks", "apple"),
winner_count = c(1, 2, 1, 2))
Based on these answers I have tried the following:
Write a function that finds the most common word in a string of text using R
trial <- df %>%
rowwise() %>%
mutate(winner = names(which.max(table(categories %>% str_split(",")))),
winner_count = which.max(table(categories %>% str_split(",")))[[1]])
Also tried to follow this approach, however it also does not give me the required results
How to find the most repeated word in a vector with R
trial2 <- df %>%
mutate(winner = DescTools::Mode(str_split(categories, ","), na.rm = T))
I am mainly struggling because my most frequent category is not just one word but something like "shoes/socks" and the fact that I also have NAs. I don't want the NAs to be the "winner".
I don't care too much about the ties right now. I already have a follow up process in place where I handle the cases that have winner_count = 2.

split the categories on comma in separate rows, count their occurrence for each id, drop the NA values and select the top occurring row for each id
library(dplyr)
library(tidyr)
df %>%
separate_rows(categories, sep = ',') %>%
count(id, categories, name = 'winner_count') %>%
filter(categories != 'NA') %>%
group_by(id) %>%
slice_max(winner_count, n = 1, with_ties = FALSE) %>%
ungroup %>%
rename(winner = categories) %>%
left_join(df, by = 'id') -> result
result
# id winner winner_count categories
# <dbl> <chr> <int> <chr>
#1 1 apple 1 apple,shoes/socks,trousers/jeans,chocolate
#2 2 apple 2 apple,NA,apple,chocolate
#3 3 shoes/socks 1 shoes/socks,NA,NA,NA
#4 4 apple 2 apple,apple,chocolate,chocolate

How to get unique occurrences of these character strings separated by ";"?

So I have a column with values in this structure:
tribble(
~col,
"AA_BB;AA_AA;AA_BB",
"BB_BB;AA_AA",
"AA_BB",
"BB_AA;BB_AA;AA_AA;BB_AA")
)
So each row has items separated by a ";". The first for has items AA_BB, AA_AA and AA_BB. I want the first row to be transformed to "AA_BB;AA_AA" and the last row to be transformed to "BB_AA;AA_AA".
I thought about using separate but I the result didn't really help me (especially since I don't know how many columns there can be at most).
df %>%
separate(col, into = c("A", "B", "C", "D"), sep = ";")
Any tips on how to do this?

We can split the column, get the unique elements and paste
library(dplyr)
library(stringr)
library(purrr)
df %>%
mutate(col = map_chr(strsplit(col, ";"), ~ str_c(unique(.x), collapse=";")))
-output
# A tibble: 4 x 1
# col
# <chr>
#1 AA_BB;AA_AA
#2 BB_BB;AA_AA
#3 AA_BB
#4 BB_AA;AA_AA
Or split with separate_rows, then do a group by paste after getting the distinct rows
library(tidyr)
df %>%
mutate(rn = row_number()) %>%
separate_rows(col, sep=";") %>%
distinct %>%
group_by(rn) %>%
summarise(col = str_c(col, collapse=";"), .groups = 'drop') %>%
select(col)

In base R, you can split the string on semi-colon, keep only unique strings and paste them together.
df$col1 <- sapply(strsplit(df$col, ';'), function(x)
paste0(unique(x), collapse = ';'))
df
# A tibble: 4 x 2
# col col1
# <chr> <chr>
#1 AA_BB;AA_AA;AA_BB AA_BB;AA_AA
#2 BB_BB;AA_AA BB_BB;AA_AA
#3 AA_BB AA_BB
#4 BB_AA;BB_AA;AA_AA;BB_AA BB_AA;AA_AA

Add a grouping variable based on ranked data

Consider the following dataframe:
name <- c("Sally", "Dave", "Aaron", "Jane", "Michael")
rank <- c(1,2,1,2,3)
df <- data.frame(name, rank, stringsAsFactors = FALSE)
I'd like to create a grouping variable (event) based on the rank column, as such:
event <- c("Hurdles", "Hurdles", "Long Jump", "Long Jump", "Long Jump")
df_desired <- data.frame(name, rank, event, stringsAsFactors = FALSE)
There are lots of examples of going the other way (making a ranking variable based on a group) but I can't seem to find one doing what I'd like.
It's possible to use filter, full_join and then fill as shown below, but is there a simpler way?
library(tidyverse)
df <- df %>%
mutate(order = row_number())
df_1 <- df %>%
filter(rank == 1)
df_1$event <- c("Hurdles", "Long Jump")
df %>%
filter(rank != 1) %>%
mutate(event = as.character(NA)) %>%
full_join(df_1, by = c("order", "name", "rank", "event")) %>%
arrange(order) %>%
fill(event) %>%
select(-order)

We can use cumsum to create the index
library(dplyr)
df %>%
mutate(event = c("Hurdles", "Long Jump")[cumsum(rank == 1)])
# name rank event
#1 Sally 1 Hurdles
#2 Dave 2 Hurdles
#3 Aaron 1 Long Jump
#4 Jane 2 Long Jump
#5 Michael 3 Long Jump
Or in base R (just in case)
df$event <- c("Hurdles", "Long Jump")[cumsum(df$rank == 1)])

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Select groups with row containing specific value (with dplyr and pipes) - r

Related

Filter based on different conditions at different positions in a string in R

Replace values in dataframe based on other dataframe with column name and value

Rowwise find most frequent term in dataframe column and count occurrences

How to get unique occurrences of these character strings separated by ";"?

Add a grouping variable based on ranked data

Categories

Resources