tidyverse: filter with str_detect - r

I want to use filter command from dplyr along with str_detect.
library(tidyverse)
dt1 <-
tibble(
No = c(1, 2, 3, 4)
, Text = c("I have a pen.", "I have a book.", "I have a pencile.", "I have a pen and a book.")
)
dt1
# A tibble: 4 x 2
No Text
<dbl> <chr>
1 1 I have a pen.
2 2 I have a book.
3 3 I have a pencile.
4 4 I have a pen and a book.
MatchText <- c("Pen", "Book")
dt1 %>%
filter(str_detect(Text, regex(paste0(MatchText, collapse = '|'), ignore_case = TRUE)))
# A tibble: 4 x 2
No Text
<dbl> <chr>
1 1 I have a pen.
2 2 I have a book.
3 3 I have a pencile.
4 4 I have a pen and a book.
Required Output
I want the following output in more efficient way (since in my original problem there would be many unknown element of MatchText).
dt1 %>%
filter(str_detect(Text, regex("Pen", ignore_case = TRUE))) %>%
select(-Text) %>%
mutate(MatchText = "Pen") %>%
bind_rows(
dt1 %>%
filter(str_detect(Text, regex("Book", ignore_case = TRUE))) %>%
select(-Text) %>%
mutate(MatchText = "Book")
)
# A tibble: 5 x 2
No MatchText
<dbl> <chr>
1 1 Pen
2 3 Pen
3 4 Pen
4 2 Book
5 4 Book
Any hint to accomplish the above task more efficiently.

library(tidyverse)
dt1 %>%
mutate(
result = str_extract_all(Text, regex(paste0("\\b", MatchText, "\\b", collapse = '|'),ignore_case = TRUE))
) %>%
unnest(result) %>%
select(-Text)
# # A tibble: 4 x 2
# No result
# <dbl> <chr>
# 1 1 pen
# 2 2 book
# 3 4 pen
# 4 4 book
I'm not sure what happened to the "whole words" part of your question after edits - I left in the word boundaries to match whole words, but since "pen" isn't a whole word match for "pencile", my result doesn't match yours. Get rid of the \\b if you want partial word matches.

str_extract_all() gives multiple matches which you can unnest into separate rows to get your desired output. If you want you can still use the paste+collapse method to generate the pattern from a vector.
library(stringr)
dt1 %>%
mutate(match = str_extract_all(tolower(Text), "pen|book")) %>%
unnest(match) %>%
select(-Text)

Related

Separate rows with conditions

I have this dataframe separate_on_condition with two columns:
separate_on_condition <- data.frame(first = 'a3,b1,c2', second = '1,2,3,4,5,6')`
# first second
# 1 a3,b1,c2 1,2,3,4,5,6
How can I turn it to:
# A tibble: 6 x 2
first second
<chr> <chr>
1 a 1
2 a 2
3 a 3
4 b 4
5 c 5
6 c 6
where:
a3 will be separated into 3 rows
b1 into 1 row
c2 into 2 rows
Is there a better way on achieving this instead of using rep() on first column and separate_rows() on the second column?
Any help would be much appreciated!
Create a row number column to account for multiple rows.
Split second column on , in separate rows.
For each row extract the data to be repeated along with number of times it needs to be repeated.
library(dplyr)
library(tidyr)
library(stringr)
separate_on_condition %>%
mutate(row = row_number()) %>%
separate_rows(second, sep = ',') %>%
group_by(row) %>%
mutate(first = rep(str_extract_all(first(first), '[a-zA-Z]+')[[1]],
str_extract_all(first(first), '\\d+')[[1]])) %>%
ungroup %>%
select(-row)
# first second
# <chr> <chr>
#1 a 1
#2 a 2
#3 a 3
#4 b 4
#5 c 5
#6 c 6
You can the following base R option
with(
separate_on_condition,
data.frame(
first = unlist(sapply(
unlist(strsplit(first, ",")),
function(x) rep(gsub("\\d", "", x), as.numeric(gsub("\\D", "", x)))
), use.names = FALSE),
second = eval(str2lang(sprintf("c(%s)", second)))
)
)
which gives
first second
1 a 1
2 a 2
3 a 3
4 b 4
5 c 5
6 c 6
Here is an alternative approach:
add NA to first to get same length
use separate_rows to bring each element to a row
use extract by regex digit to split first into first and helper
group and slice by values in helper
do some tweaking
library(tidyr)
library(dplyr)
separate_on_condition %>%
mutate(first = str_c(first, ",NA,NA,NA")) %>%
separate_rows(first, second, sep = "[^[:alnum:].]+", convert = TRUE) %>%
extract(first, into = c("first", "helper"), "(.{1})(.{1})", remove=FALSE) %>%
group_by(second) %>%
slice(rep(1:n(), each = helper)) %>%
ungroup() %>%
drop_na() %>%
mutate(second = row_number()) %>%
select(first, second)
first second
<chr> <int>
1 a 1
2 a 2
3 a 3
4 b 4
5 c 5
6 c 6

Row mean of two matching columns with same name but differ by: '_1' and '_2'

Lets say I have the dataframe:
z = data.frame(col_1 = c(1,2,3,4), col_2 = c(3,4,5,6))
col_1 col_2
1 1 3
2 2 4
3 3 5
4 4 6
I want to take columns with the same name that only differ by the number e.g. '_1' and '_2' and take the pairwise mean. In reality I have a big dataframe with many pairs and they are not in a nice order, therefore looking for a clever solution that can be applied to this.
So the output should look like this:
col
1 2
2 3
3 4
4 5
With the column name given as the same as the column pair but with the additional label removed.
Any help would be great thanks.
Here is a base R option using list2DF + split.default + rowMeans
list2DF(lapply(split.default(z,gsub("_\\d+","",names(z))),rowMeans))
which gives
col
1 2
2 3
3 4
4 5
Try this tidyverse approach. By using separate() you can extract the name and then with reshaping you can reach the desired output. Here the code:
library(dplyr)
library(tidyr)
#Data
z = data.frame(col_1 = c(1,2,3,4), col_2 = c(3,4,5,6))
#Code
z1 <- z %>% mutate(id=1:n()) %>%
pivot_longer(-id) %>%
separate(name,c('var1','var2'),sep='_') %>%
group_by(id,var1) %>% summarise(Mean=mean(value)) %>%
pivot_wider(names_from = var1,values_from=Mean) %>% ungroup() %>% select(-id)
Output:
# A tibble: 4 x 1
col
<dbl>
1 2
2 3
3 4
4 5
Here is a purrr oriented solution:
library(purrr)
library(stringr)
split.default(z, str_remove(names(z), "[:digit:]+$")) %>% map_dfc(rowMeans)
#> # A tibble: 4 x 1
#> col_
#> <dbl>
#> 1 2
#> 2 3
#> 3 4
#> 4 5
It works even if z is:
z <- data.frame(col_1 = c(1,2,3,4),
col_2 = c(3,4,5,6),
anothercol_1 = c(1,2,3,4),
anothercol_2 = c(3,4,5,6))

map over columns and apply custom function

Missing something small here and struggling to pass columns to function. I just want to map (or lapply) over columns and perform a custom function on each of the columns. Minimal example here:
library(tidyverse)
set.seed(10)
df <- data.frame(id = c(1,1,1,2,3,3,3,3),
r_r1 = sample(c(0,1), 8, replace = T),
r_r2 = sample(c(0,1), 8, replace = T),
r_r3 = sample(c(0,1), 8, replace = T))
df
# id r_r1 r_r2 r_r3
# 1 1 0 0 1
# 2 1 0 0 1
# 3 1 1 0 1
# 4 2 1 1 0
# 5 3 1 0 0
# 6 3 0 0 1
# 7 3 1 1 1
# 8 3 1 0 0
a function just to filter and counts unique ids remaining in the dataset:
cnt_un <- function(var) {
df %>%
filter({{var}} == 1) %>%
group_by({{var}}) %>%
summarise(n_uniq = n_distinct(id)) %>%
ungroup()
}
it works outside of map
cnt_un(r_r1)
# A tibble: 1 x 2
r_r1 n_uniq
<dbl> <int>
1 1 3
I want to apply the function over all r_r columns to get something like:
df2
# y n_uniq
# 1 r_r1 3
# 2 r_r2 2
# 3 r_r3 2
I thought the following would work but doesnt
map(dplyr::select(df, matches("r_r")), ~ cnt_un(.x))
any suggestions? thanks
I'm not sure if there's a direct tidyeval way to do this with something like map. The issue you're running into is that in calling map(df, *whatever_function*), the function is being called on each column of df as a vector, whereas your function expects a bare column name in the tidyeval style. To verify that:
map(df, class)
will return "numeric" for each column.
An alternative is to iterate over column names as strings, and convert those to symbols; this takes just one additional line in the function.
library(dplyr)
library(tidyr)
library(purrr)
cnt_un_name <- function(varname) {
var <- ensym(varname)
df %>%
filter({{var}} == 1) %>%
group_by({{var}}) %>%
summarise(n_uniq = n_distinct(id)) %>%
ungroup()
}
Calling the function is a little awkward because it keeps only the relevant column names (calling on "r_r1" gets columns "r_r1" and "n_uniq", etc). One way is to get the vector of column names you want, name it so you can add an ID column in map_dfr, and drop the extra columns, since they'll be mostly NA.
grep("^r_r\\d+", names(df), value = TRUE) %>%
set_names() %>%
map_dfr(cnt_un_name, .id = "y") %>%
select(y, n_uniq)
#> # A tibble: 3 x 2
#> y n_uniq
#> <chr> <int>
#> 1 r_r1 3
#> 2 r_r2 2
#> 3 r_r3 2
A better way is to call the function, then bind after reshaping.
grep("^r_r\\d+", names(df), value = TRUE) %>%
map(cnt_un_name) %>%
map_dfr(pivot_longer, 1, names_to = "y") %>%
select(y, n_uniq)
# same output as above
Alternatively (and maybe better/more scaleable) would be to do the column renaming inside the function definition.
Here's a base R solution that uses lapply. The tricky bit is that your function isn't actually running on single columns; it's using id, too, so you can't use canned functions that iterate column-wise.
do.call(rbind, lapply(grep("r_r", colnames(df), value = TRUE), function(i) {
X <- subset(df, df[,i] == 1)
row <- data.frame(y = i, n_uniq = length(unique(X$id)), stringsAsFactors = FALSE)
}))
y n_uniq
1 r_r1 2
2 r_r2 3
3 r_r3 2
Here is another solution. I changed the syntax of your function. Now you supply the pattern of the columns you want to select.
cnt_un <- function(var_pattern) {
df %>%
pivot_longer(cols = contains(var_pattern), values_to = "vals", names_to = "y") %>%
filter(vals == 1) %>%
group_by(y) %>%
summarise(n_uniq = n_distinct(id)) %>%
ungroup()
}
cnt_un("r_r")
#> # A tibble: 3 x 2
#> y n_uniq
#> <chr> <int>
#> 1 r_r1 2
#> 2 r_r2 3
#> 3 r_r3 2

Rename a dataframe Column with text from within the column itself

Given a (simplified) dataframe with format
df <- data.frame(a = c(1,2,3,4),
b = c(4,3,2,1),
temp1 = c("-","-","-","foo: 3"),
temp2 = c("-","bar: 10","-","bar: 4")
)
a b temp1 temp2
1 4 - -
2 3 - bar: 10
3 2 - -
4 1 foo: 3 bar: 4
I need to rename all temp columns with the names contained within the column, My end goal is to end up with this:
a b foo bar
1 4 - -
2 3 - 10
3 2 - -
4 1 3 4
the df column names and the data contained within them will be unknown, however the columns that need changing will contain temp and the delimiter will always be a ":"
As such I can easily remove the name from within the columns using dplyr like this:
df <- df %>%
mutate_at(vars(contains("temp")), ~(substr(., str_locate(., ":")+1,str_length(.))))
but first I need to rename the columns based on some function method, that scans the column and returns the value(s) within it, ie.
rename_at(vars(contains("temp")), ~(...some function.....))
As per the example given there's no guarantee that specific rows will have data so I can't simply grab value from row 1
Any ideas welcome.
Thanks in advance
One possibility involving dplyr and tidyr could be:
df %>%
pivot_longer(names_to = "variables", values_to = "values", -c(a:b)) %>%
mutate(values = replace(values, values == "-", NA_character_)) %>%
separate(values, into = c("variables2", "values"), sep = ": ") %>%
group_by(variables) %>%
fill(variables2, .direction = "downup") %>%
ungroup() %>%
select(-variables) %>%
pivot_wider(names_from = "variables2", values_from = "values")
a b foo bar
<dbl> <dbl> <chr> <chr>
1 1 4 <NA> <NA>
2 2 3 <NA> 10
3 3 2 <NA> <NA>
4 4 1 3 4
If you want to further replace the NAs with -:
df %>%
pivot_longer(names_to = "variables", values_to = "values", -c(a:b)) %>%
mutate(values = replace(values, values == "-", NA_character_)) %>%
separate(values, into = c("variables2", "values"), sep = ": ") %>%
group_by(variables) %>%
fill(variables2, .direction = "downup") %>%
ungroup() %>%
select(-variables) %>%
pivot_wider(names_from = "variables2", values_from = "values") %>%
mutate_at(vars(-a, -b), ~ replace_na(., "-"))
a b foo bar
<dbl> <dbl> <chr> <chr>
1 1 4 - -
2 2 3 - 10
3 3 2 - -
4 4 1 3 4
This will do the job:
colnames(df)[which(grepl("temp", colnames(df)))] <- unique(unlist(sapply(df[,grepl("temp", colnames(df))],
function(x){gsub("[:].*",
"",
grep("\\w+",
x,
value = TRUE))})))

Create new columns to indicate column name's position inside another string vector (with dplyr, purrr, and stringr)

Given this example data:
require(stringr)
require(tidyverse)
labels <- c("foo", "bar", "baz")
n_rows <- 4
df <- 1:n_rows %>%
map(~ data.frame(
block_order=paste(sample(labels, size=length(labels), replace=FALSE),
collapse="|"))) %>%
bind_rows()
df
block_order
1 foo|bar|baz
2 baz|bar|foo
3 foo|baz|bar
4 foo|bar|baz
I want to generate a column for each string in labels, which takes the value of the position of that string in the |-separated sequence in each row.
Desired output:
block_order foo bar baz
1 foo|bar|baz 1 2 3
2 baz|bar|foo 3 2 1
3 foo|baz|bar 1 3 2
4 foo|bar|baz 1 2 3
I've been trying different variations in a dplyr/purrr setup, like this example, where I map in each value of label, and then attempt to get its position in block_order using match on str_split:
labels %>%
map(~ df %>%
transmute(!!.x := match(!!.x, str_split(block_order,
"\\|",
simplify=TRUE)))) %>%
bind_cols(df, .)
But that produces unexpected output:
block_order foo bar baz
1 foo|bar|baz 1 5 2
2 baz|bar|foo 1 5 2
3 foo|baz|bar 1 5 2
4 foo|bar|baz 1 5 2
I'm not really sure what these numbers represent, or why they're all the same.
If anyone can help me figure out (a) how to achieve my desired output in a dplyr/purrr framework and (b) why the proposed solution here gives the output it does, I'd be very appreciative.
We can split the 'block_order' by |, loop through the list of vectors using lapply, get the index with match, rbind the vectors and assign it to create new columns
labels <- c("foo", "bar", "baz")
df[labels] <- do.call(rbind, lapply(strsplit(df$block_order, "|",
fixed = TRUE), match, table = labels))
Or similar idea with tidyverse
library(tidyverse)
str_split(df$block_order, "[|]") %>%
map(~ .x %>%
match(table= labels)) %>%
do.call(rbind, .) %>%
as_tibble %>%
set_names(labels) %>%
bind_cols(df, .)
# block_order foo bar baz
#1 foo|bar|baz 1 2 3
#2 baz|bar|foo 3 2 1
#3 foo|baz|bar 1 3 2
#4 foo|bar|baz 1 2 3
Another option would be to use separate_rows, reshape it to 'long' format and spread it back
rownames_to_column(df, 'rn') %>%
separate_rows(block_order) %>%
group_by(rn) %>%
mutate(ind = match(block_order, labels), labels = factor(labels, levels = labels)) %>%
select(-block_order) %>%
spread(labels, ind) %>%
ungroup %>%
select(-rn) %>%
bind_cols(df, .)
Unless you need to for other reasons, you don't have to fully split the string if you just identify the location of the first match for each value of labels, which regexpr will give you. mapping over labels will give a list with one element for each string in labels (so it's a quick iteration), which you can then pmap rank over to get indices. Using the *_dfr version to simplify the results to a data frame and cbinding to the original,
library(tidyverse)
set.seed(47)
labels <- c("foo", "bar", "baz")
df <- data_frame(block_order = replicate(10, paste(sample(labels), collapse = "|")))
labels %>%
map(~regexpr(.x, df$block_order)) %>%
pmap_dfr(~set_names(as.list(rank(c(...))), labels)) %>%
bind_cols(df, .)
#> # A tibble: 10 x 4
#> block_order foo bar baz
#> <chr> <dbl> <dbl> <dbl>
#> 1 baz|foo|bar 2. 3. 1.
#> 2 baz|bar|foo 3. 2. 1.
#> 3 bar|foo|baz 2. 1. 3.
#> 4 baz|foo|bar 2. 3. 1.
#> 5 foo|bar|baz 1. 2. 3.
#> 6 baz|foo|bar 2. 3. 1.
#> 7 foo|baz|bar 1. 3. 2.
#> 8 bar|baz|foo 3. 1. 2.
#> 9 baz|foo|bar 2. 3. 1.
#> 10 foo|bar|baz 1. 2. 3.
If you prefer stringr/stringi to base regex, you could to the same thing by changing the regexpr call to str_locate(df$block_order, .x)[, "start"] or stringi::stri_locate_first_fixed in the same arrangement.
I think this might work:
library(tidyr)
library(purrr)
position_counter <- function(...) {
row = list(...)
row %>% map(~which(row == .)) %>% setNames(row)
}
df %>%
separate(block_order, labels) %>%
pmap_df(position_counter)

Resources