In R dplyr, gsub() in mutate() using column as the pattern - r

zed = data.frame(name = c('Tom', 'Joe', 'Nick', 'Bill'), names = c('TomRyanTim', 'RobJoeMike', 'SteveKevinNick', 'EvanPacJimmy'), stringsAsFactors = FALSE)
> zed
name names
1 Tom TomRyanTim
2 Joe RobJoeMike
3 Nick SteveKevinNick
4 Bill EvanPacJimmy
> zed %>% dplyr::mutate(names = gsub(name, '', names))
name names
1 Tom RyanTim
2 Joe RobJoeMike
3 Nick SteveKevinNick
4 Bill EvanPacJimmy
Warning message:
Problem with `mutate()` column `names`.
ℹ `names = gsub(name, "", names)`.
ℹ argument 'pattern' has length > 1 and only the first element will be used
In the example above, the mutate(gsub()) seems to be attempting to gsub the name Tom in every row, whereas I'd like for each row to gsub() the value in the name column. We are looking for the following output:
output$names = c('RyanTim', 'RobMike', SteveKevin', 'EvanPacJimmy')
Is it possible to update our code for the mutate + gsub to operate as such?

Use rowwise:
zed %>%
rowwise() %>%
mutate(names = gsub(name, '', names)) %>%
ungroup()
To avoid using rowwise, you can use stringr::str_replace_all or stringr::str_remove_all:
library(stringr)
zed %>%
mutate(names = str_replace_all(names, name, ""),
names = str_remove_all(names, name))
name names
<chr> <chr>
1 Tom RyanTim
2 Joe RobMike
3 Nick SteveKevin
4 Bill EvanPacJimmy

Or group_by:
library(dplyr)
zed |>
group_by(name, names) |>
mutate(names = gsub(name, "", names)) |>
ungroup()
Output:
# A tibble: 4 × 2
name names
<chr> <chr>
1 Tom RyanTim
2 Joe RobMike
3 Nick SteveKevin
4 Bill EvanPacJimmy

Another way is to loop through your zed data frame with sapply, and use gsub within that.
library(dplyr)
zed %>%
mutate(names = sapply(1:nrow(.), \(x) gsub(.[x, 1], "", .[x, 2])))
name names
1 Tom RyanTim
2 Joe RobMike
3 Nick SteveKevin
4 Bill EvanPacJimmy

Related

Counting number of strings despite multiple elements in one cell

I got a vector A <- c("Tom; Jerry", "Lisa; Marc")
and try to identity the number of occurrences of every name.
I already used the code:
sort(table(unlist(strsplit(A, ""))), decreasing = TRUE)
However, this code is only able to create output like this:
Tom; Jerry: 1 - Lisa; Marc: 1
I am looking for a way to count every name, despite the fact, that two names are present in one cell. Consequently, my preferred result would be:
Tom: 1 Jerry: 1 Lisa: 1 Marc:1
The split should be ; followed by zero or more spaces (\\s*)
sort(table(unlist(strsplit(A, ";\\s*"))), decreasing = TRUE)
-output
Jerry Lisa Marc Tom
1 1 1 1
Use separate_rows to split the strings, group_by the names and summarise them:
library(tidyverse)
data.frame(A) %>%
separate_rows(A, sep = "; ") %>%
group_by(A) %>%
summarise(N = n())
# A tibble: 4 × 2
A N
<chr> <int>
1 Jerry 1
2 Lisa 1
3 Marc 1
4 Tom 1

Turning vectors of strings in a dataframe into categorical variables in R

I'm fairly new to R and am sure there's a way to do the following without using loops, which I'm more familiar with.
Take the following example where you have a bunch of names and fruits each person likes:
name <- c("Alice", "Bob")
preference <- list(c("apple", "pear"), c("banana", "apple"))
df <- as.data.frame(cbind(name, preference))
How to I convert it to the following?
apple <- c(1, 1)
pear <- c(1, 0)
banana <- c(0, 1)
df2 <- data.frame(name, apple, pear, banana)
My basic instinct is to first extract all the fruits then do a loop to check if each fruit is in each row's preference:
fruits <- unique(unlist(df$preference))
for (fruit in fruits) {
df <- df %>% rowwise %>% mutate("{fruit}" := fruit %in% preference)
}
This seems to work, but I'm pretty sure there's a better way to do this.
df %>%
unnest(everything()) %>%
xtabs(~., .) %>%
as.data.frame.matrix() %>%
rownames_to_column('name')
name apple banana pear
1 Alice 1 0 1
2 Bob 1 1 0
In tidyverse (assuming the 'preference' is a list column), unnest the 'preference' and then use pivot_wider to reshape back to 'wide' format with values_fn as length
library(dplyr)
library(tidyr)
df %>%
unnest_longer(preference) %>%
pivot_wider(names_from = preference, values_from = preference,
values_fn = length, values_fill = 0)
-output
# A tibble: 2 × 4
name apple pear banana
<chr> <int> <int> <int>
1 Alice 1 1 0
2 Bob 1 0 1
data
df <- data.frame(name, preference = I(preference))
Another possible solution, based on tidyr::separate_rows and janitor::tabyl:
library(tidyverse)
df %>%
separate_rows(everything(), sep="(?<=\\w), (?=\\w)") %>%
janitor::tabyl(name, preference)
#> name apple banana pear
#> Alice 1 0 1
#> Bob 1 1 0

str_detect on multiple columns in the same row

I have two datasets, one with full names and one with first and last names.
library(tidyverse)
(x = tibble(fullname = c("Michael Smith",
"Elisabeth Brown",
"John-Henry Albert")))
#> # A tibble: 3 x 1
#> fullname
#> <chr>
#> 1 Michael Smith
#> 2 Elisabeth Brown
#> 3 John-Henry Albert
(y = tribble(~first, ~last,
"Elisabeth", "Smith",
"John", "Albert",
"Roland", "Brown"))
#> # A tibble: 3 x 2
#> first last
#> <chr> <chr>
#> 1 Elisabeth Smith
#> 2 John Albert
#> 3 Roland Brown
I'd like to make a single boolean column that is true only if the first and last column is within the fullname column.
In essence, I'm looking for something like:
x %>%
mutate(fname_match = str_detect(fullname, paste0(y$first, collapse = "|")), ## correct
lname_match = str_detect(fullname, paste0(y$last, collapse = "|"))) ## correct
#> # A tibble: 3 x 3
#> fullname fname_match lname_match
#> <chr> <lgl> <lgl>
#> 1 Michael Smith FALSE TRUE
#> 2 Elisabeth Brown TRUE TRUE
#> 3 John-Henry Albert TRUE TRUE
But here if I took the columns with two TRUE's Elisabeth Brown would be a false positive because the matching first name and last name are not in the same row.
My best idea so far is to combine the first and last column and search for this, but this creates a false negative for John-Henry
y = tribble(~first, ~last,
"Elisabeth", "Smith",
"John", "Albert",
"Roland", "Brown") %>%
rowwise() %>%
mutate(longname = paste(first, last, sep = "&"))
x %>%
mutate(full_match = str_detect(fullname, paste0(y$longname, collapse = "|")))
#> # A tibble: 3 x 2
#> fullname full_match
#> <chr> <lgl>
#> 1 Michael Smith FALSE
#> 2 Elisabeth Brown FALSE
#> 3 John-Henry Albert FALSE
I think this does what you want, using purrr::map2 to iterate over the tuples of first and last.
library(dplyr)
library(purrr)
y %>%
mutate(
name_match = map2_lgl(
first, last,
.f = ~any(grepl(paste0(.x, '.*', .y), x$fullname, ignore.case = T))
)
)
Do mind, paste0(.x, '.*', .y) combines them into a regex that only lets rows pass in which the last name appears fully after the first. That seemed reasonable to do (otherwise, first name "Elisabeth", last name "Abe" would still be TRUE, which I here assume you would not want).
Also, the above is case insensitive.
// UPDATE:
I forgot; inversely, if you want to check the fullname values in x, then you can run this:
x %>%
rowwise() %>%
mutate(
name_match = any(map2_lgl(
y$first, y$last,
.f = ~grepl(paste0('\\b', .x, '\\b.*\\b', .y, '\\b'), fullname, ignore.case = T)
))
)
Depending on how important this check is for you and how many assumptions you want to make, it might make sense to tweak the above regex a little further:
ensure that the first name and last name stand as isolated words in the fullname
-> paste0('\\b', .x, '\\b.*\\b', .y, '\\b')
test that the first name comes right at the beginning
-> paste0('^', .x, '\\b.*\\b', .y, '\\b')
test that the fullname ends after the last name
-> paste0('\\b', .x, '\\b.*\\b', .y, '$')

Summarize with the latest record for each group [duplicate]

This question already has answers here:
Select row with most recent date by group
(5 answers)
Closed 2 years ago.
I have a dataframe:
df <- data.frame(Xdate = c("21-jul-2020", "29-jul-2020", "20-jul-2020", "13-may-2020" ),
names = c("peter", "lisa","peter", "lisa"),
score = c(1,3,5,7))
What is the most elegant way of getting the latest score out:
df_result <- data.frame(names = c("peter", "lisa"),
score = c(1, 3))
The latest score for peter is 1 and were achieved the 21-jul-2020 and the latest score by lisa is 3 and is achieved the 29-jul-2020.
You can use slice_max() in dplyr, which supersedes top_n() after version 1.0.0, to select the most recent date.
library(dplyr)
df %>%
mutate(Xdate = as.Date(Xdate, "%d-%b-%Y")) %>%
group_by(names) %>%
slice_max(Xdate, n = 1) %>%
ungroup()
# # A tibble: 2 x 3
# Xdate names score
# <date> <chr> <dbl>
# 1 2020-07-29 lisa 3
# 2 2020-07-21 peter 1
Here is a dplyr solution.
library(dplyr)
df %>%
mutate(Xdate = as.Date(df$Xdate, "%d-%b-%Y")) %>%
group_by(names) %>%
arrange(Xdate) %>%
summarise_all(last)
## A tibble: 2 x 3
# names Xdate score
# <chr> <date> <dbl>
#1 lisa 2020-07-29 3
#2 peter 2020-07-21 1
A base R one-liner could be
aggregate(score ~ names, data = df[order(df$Xdate),], function(x) x[length(x)])
# names score
#1 lisa 3
#2 peter 1
Here is one alternative from dplyr package
library(dplyr)
df$Xdate <- as.Date(df$Xdate, format = "%d-%b-%Y")
df %>%
group_by(names) %>%
arrange(desc(Xdate)) %>%
mutate(names = first(names),
score = first(score)) %>%
select(!Xdate) %>%
distinct(names, score)%>%
ungroup()
# names score
# <fct> <dbl>
#1 lisa 3
#2 peter 1
or
df %>% group_by(names) %>% arrange(desc(Xdate)) %>% filter(row_number() == 1)
or
df %>% group_by(names) %>% arrange(desc(Xdate)) %>% top_n(n = -1)
Using ave in base R :
subset(transform(df, Xdate = as.Date(Xdate, "%d-%b-%Y")),
Xdate == ave(Xdate, names, FUN = max))
# Xdate names score
#1 2020-07-21 peter 1
#2 2020-07-29 lisa 3
With transform we first convert Xdate to date, using ave we get max date for each names and subset those values.

Looping and concatenating based on a condition in R

I'm new to R and still struggling with loops.
I'm trying to create a loop where, based on a condition (variable_4 == 1), it will concatenate the content of variable_5, separated by comma.
data1 <- data.frame(
ID = c(123:127),
agent_1 = c('James', 'Lucas','Yousef', 'Kyle', 'Marisa'),
agent_2 = c('Sophie', 'Danielle', 'Noah', 'Alex', 'Marcus'),
agent_3 = c('Justine', 'Adrienne', 'Olivia', 'Janice', 'Josephine'),
Flag_1 = c(1,0,1,0,1),
Flag_2 = c(0,1,0,0,1),
Flag_3 = c(1,0,1,0,1)
)
data1$new_var<- ""
for(i in 2:10){
variable_4 <- paste0("flag_", i)
variable_5 <- paste0("agent_", i)
data1 <- data1 %>%
mutate(!! new_var = case_when(variable_4 == 1,paste(new_var, variable_5, sep=",")))
}
I've created new_var in a previous step because the code was giving me an error that the variable was not found. Ideally, the loop will accumulate the contents of variable_5, only if variable_4 is equal 1 and the result would be big string, separate by comma.
The loop will paste in the new var only the name of the agents which the flags are = 1. If Flag_1=1, then paste the name of the agent in the new_var, if not, ignore. If flag_2 =1, then concatenate the name of the agent in the new var, separating by comma, if not, then ignore...
You shouldn't need to use a loop for this. The data is in wide format which makes it harder, but if we convert to long format, we can easily find a vectorized solution rather than using a loop.
The pivot_longer function is useful here which requires tidyr version >= 1.0.0.
library(tidyr)
library(dplyr)
pivot_longer(data1,
cols = -ID,
names_to = c(".value", "group"),
names_sep = "_") %>%
group_by(ID) %>%
mutate(new_var = paste0(agent[Flag==1], collapse = ',')) %>%
pivot_wider(names_from = c("group"),
values_from = c('agent', 'Flag'),
names_sep = '_') %>%
ungroup() %>%
select(ID, starts_with('agent'), starts_with('Flag'), new_var)
## A tibble: 5 x 8
# ID agent_1 agent_2 agent_3 Flag_1 Flag_2 Flag_3 new_var
# <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#1 123 James Sophie Justine 1 0 1 James,Justine
#2 124 Lucas Danielle Adrienne 0 1 0 Danielle
#3 125 Yousef Noah Olivia 1 0 1 Yousef,Olivia
#4 126 Kyle Alex Janice 0 0 0 ""
#5 127 Marisa Marcus Josephine 1 1 1 Marisa,Marcus,Josephine
Details:
pivot_longer puts our data into a more natural format where each row represents one observation of the variables agent and flag, rather than several:
pivot_longer(data1,
cols = -ID,
names_to = c(".value", "group"),
names_sep = "_")
## A tibble: 15 x 4
# ID group agent Flag
# <int> <chr> <chr> <chr>
# 1 123 1 James 1
# 2 123 2 Sophie 0
# 3 123 3 Justine 1
# 4 124 1 Lucas 0
# 5 124 2 Danielle 1
# 6 124 3 Adrienne 0
# ...
For each ID, we can then paste together the agents which have flag values of 1. This is easy now that our variables are contained in single columns.
Lastly, we revert back to the wide format with pivot_wider. We also ungroup the data we previously grouped, and re-order the columns to the desired format.
There are a few different ways to do this in BaseR or the tidyverse, or a combination of both, if you stick to using tidyverse then consider this:
I have used mtcars as your dataframe instead!
#load dplyr or tidyverse
library(tidyverse)
# create data as mtcars
df <- mtcars
# create two new columns flag and agent as rownumbers
df <- df %>%
mutate(flag = paste0("flag", row_number())) %>%
mutate(agent = paste0("agent", row_number()))
# using case when in mutate statement
df2 <- df %>%
mutate(new_column = ifelse(flag == "flag1", yes = paste0(agent, " this is a new variable"), no = flag))
print(df2)
an ifelse statement might be more appropriate if you have one case - but if you have many then use case_when instead.

Resources