Find a specific string with grepl across all columns in R dplyr - r

In a huge data.frame I am trying to search all columns for a string using dplyr in R
I am unsure where I am doing wrong, but here is an example of what I am trying.
Let's say that I am trying in mpg to find audi, and audi exists in multiple columns, and I want to extract only the rows that contain audi.
This would not work
ANy ideas
library(tidyverse)
head(mpg)
#> # A tibble: 6 × 11
#> manufacturer model displ year cyl trans drv cty hwy fl class
#> <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
#> 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
#> 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
#> 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
#> 4 audi a4 2 2008 4 auto(av) f 21 30 p compa…
#> 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa…
#> 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa…
mpg |>
filter(if_all(.cols = everything(), ~grepl("audi",.)))
#> # A tibble: 0 × 11
#> # … with 11 variables: manufacturer <chr>, model <chr>, displ <dbl>,
#> # year <int>, cyl <int>, trans <chr>, drv <chr>, cty <int>, hwy <int>,
#> # fl <chr>, class <chr>
Created on 2022-09-09 with reprex v2.0.2

Here is a base R option:
library(ggplot2) # Load for mpg dataset
mpg[Reduce(`|`, lapply(mpg, grepl, pattern = "audi")),]
#> # A tibble: 18 × 11
#> manufacturer model displ year cyl trans drv cty hwy fl class
#> <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
#> 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
#> 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
#> 3 audi a4 2 2008 4 manu… f 20 31 p comp…
#> 4 audi a4 2 2008 4 auto… f 21 30 p comp…
#> 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
#> 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
#> 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
#> 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
#> 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
#> 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
#> 11 audi a4 quattro 2 2008 4 auto… 4 19 27 p comp…
#> 12 audi a4 quattro 2.8 1999 6 auto… 4 15 25 p comp…
#> 13 audi a4 quattro 2.8 1999 6 manu… 4 17 25 p comp…
#> 14 audi a4 quattro 3.1 2008 6 auto… 4 17 25 p comp…
#> 15 audi a4 quattro 3.1 2008 6 manu… 4 15 25 p comp…
#> 16 audi a6 quattro 2.8 1999 6 auto… 4 15 24 p mids…
#> 17 audi a6 quattro 3.1 2008 6 auto… 4 17 25 p mids…
#> 18 audi a6 quattro 4.2 2008 8 auto… 4 16 23 p mids…
Created on 2022-09-09 with reprex v2.0.2

Use if_any to match a row if any of the column (i.e. at least one among all) matches the pattern. With if_all, every column would have to match the pattern.
mpg |>
filter(if_any(.cols = everything(), ~ grepl("audi", .)))

Related

I want to use a previously created function in the mutate() function. Yet R doesn't seem to want to let me [duplicate]

This question already has answers here:
adding a column to df that counts occurrence of a value in another column
(2 answers)
Closed 3 months ago.
I am looking at population data and want to make sure I have enough observations do to county level analysis. Therefore I would like to generate a variable that assigns each observation the number of observations with the same value for the "county" row.
I want to assign each row in my data frame ("cps") a new variable ("freq") which represents the frequency of its specific value in one specific variable ("county").
I used
f <- function(x)sum(with(cps, county==x))
to generate a function that tells me how often a given county x appears in the data.
Now I want to use
cps <- mutate(cps, freq=f(county))
to assign each row the number of times its county value appears in the data frame.
However, it assigns each row with the overall number of observations.
You can get what you want using dplyr::add_count():
library(dplyr)
mpg %>% add_count(cyl, name = "freq")
# A tibble: 234 × 12
manufacturer model displ year cyl trans drv cty hwy fl class freq
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr> <int>
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact 81
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact 81
3 audi a4 2 2008 4 manual(m6) f 20 31 p compact 81
4 audi a4 2 2008 4 auto(av) f 21 30 p compact 81
5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact 79
6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact 79
7 audi a4 3.1 2008 6 auto(av) f 18 27 p compact 79
8 audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26 p compact 81
9 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25 p compact 81
10 audi a4 quattro 2 2008 4 manual(m6) 4 20 28 p compact 81
# … with 224 more rows
But if you wanted to use your function, you'd need to wrap in sapply() (or purrr:map_int()) to compare each element of x against every element:
f <- function(x) sapply(x, \(x) sum(with(mpg, cyl == x)))
You can also generalize it to work with any column:
f2 <- function(x) sapply(x, \(x_i) sum(x == x_i))
mutate(mpg, freq=f2(drv))
# A tibble: 234 × 12
manufacturer model displ year cyl trans drv cty hwy fl class freq
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr> <int>
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact 106
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact 106
3 audi a4 2 2008 4 manual(m6) f 20 31 p compact 106
4 audi a4 2 2008 4 auto(av) f 21 30 p compact 106
5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact 106
6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact 106
7 audi a4 3.1 2008 6 auto(av) f 18 27 p compact 106
8 audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26 p compact 103
9 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25 p compact 103
10 audi a4 quattro 2 2008 4 manual(m6) 4 20 28 p compact 103
# … with 224 more rows

Dynamically selecting multiple columns for group_by

Data masking for group_by does not work when there is more than one grouping variable.
Pasting code below
grpByCols <- "model"
mpg%>%
group_by(.data[[grpByCols]])
grpByCols <- c("model", "manufacturer")
mpg%>%
group_by(.data[[grpByCols]])
The first group_by works, the second one fails.
Pasting the run output below
> grpByCols <- "model"
>
> mpg%>%
+ group_by(.data[[grpByCols]])
# A tibble: 234 x 11
# Groups: model [38]
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
3 audi a4 2 2008 4 manual(m6) f 20 31 p compact
4 audi a4 2 2008 4 auto(av) f 21 30 p compact
5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact
7 audi a4 3.1 2008 6 auto(av) f 18 27 p compact
8 audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26 p compact
9 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25 p compact
10 audi a4 quattro 2 2008 4 manual(m6) 4 20 28 p compact
# … with 224 more rows
>
> grpByCols <- c("model", "manufacturer")
>
> mpg%>%
+ group_by(.data[[grpByCols]])
Error: Problem with `mutate()` input `..1`.
x Must subset the data pronoun with a string.
ℹ Input `..1` is `<unknown>`.
Run `rlang::last_error()` to see where the error occurred.
>
Please let me know if you have any ideas to make this work
A simple way is to use the across() function from dplyr.
mpg %>% group_by(across(all_of(grpByCols)))
# A tibble: 234 × 11
# Groups: model, manufacturer [38]
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
3 audi a4 2 2008 4 manu… f 20 31 p comp…
4 audi a4 2 2008 4 auto… f 21 30 p comp…
5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
We could unquote the symbol with !!
grpByCols <- "model"
mpg%>%
group_by(!!sym(grpByCols))
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
3 audi a4 2 2008 4 manual(m6) f 20 31 p compact
4 audi a4 2 2008 4 auto(av) f 21 30 p compact
5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact
7 audi a4 3.1 2008 6 auto(av) f 18 27 p compact
8 audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26 p compact
9 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25 p compact
10 audi a4 quattro 2 2008 4 manual(m6) 4 20 28 p compact
# ... with 224 more rows
You can use the following solution. You should use rlang::syms which takes strings as input and turn them into symbols and since the output is a list of length 2 (corresponding to the length of input), we use big bang operator !!! to splice the elements of the list, meaning that they each become one single argument:
library(rlang)
grpByCols <- c("model", "manufacturer")
mpg %>%
group_by(!!!syms(grpByCols))
# A tibble: 234 x 11
# Groups: model, manufacturer [38]
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
3 audi a4 2 2008 4 manual(m6) f 20 31 p compact
4 audi a4 2 2008 4 auto(av) f 21 30 p compact
5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact
7 audi a4 3.1 2008 6 auto(av) f 18 27 p compact
8 audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26 p compact
9 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25 p compact
10 audi a4 quattro 2 2008 4 manual(m6) 4 20 28 p compact
# ... with 224 more rows
Using cur_data()
library(dplyr)
mpg %>%
group_by(cur_data()[grpByCols])
-output
# A tibble: 234 x 11
# Groups: model, manufacturer [38]
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
3 audi a4 2 2008 4 manual(m6) f 20 31 p compact
4 audi a4 2 2008 4 auto(av) f 21 30 p compact
5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact
7 audi a4 3.1 2008 6 auto(av) f 18 27 p compact
8 audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26 p compact
9 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25 p compact
10 audi a4 quattro 2 2008 4 manual(m6) 4 20 28 p compact
# … with 224 more rows

Sample n rows from a data frame by group using another data frame

Looking to randomly sample n rows from a dataframe by group based on the criteria of another data frame.
Example:
Randomly sample rows from the ggplot2::mpg dataframe based on the manufacturer and year grouping, where n = the pick column of the pick_df data frame.
i.e. randomly sample 3 rows from ggplot2::mpg that are hondas made in 2008, 10 volkswagens made in 1999, 2 audis made in 1999, etc.
manufacturer year pick
<chr> <int> <int>
1 honda 2008 3
2 volkswagen 1999 10
3 audi 1999 6
4 land rover 2008 2
5 subaru 1999 6
Expected output:
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 honda civic 1.8 2008 4 manual(m5) f 26 34 r subcompact
2 honda civic 1.8 2008 4 auto(l5) f 25 36 r subcompact
3 honda civic 1.8 2008 4 auto(l5) f 24 36 c subcompact
4 volkswagen gti 2.8 1999 6 manual(m5) f 17 24 r compact
5 volkswagen passat 2.8 1999 6 manual(m5) f 18 26 p midsize
6 volkswagen new beetle 1.9 1999 4 auto(l4) f 29 41 d subcompact
7 volkswagen new beetle 2 1999 4 auto(l4) f 19 26 r subcompact
8 volkswagen jetta 1.9 1999 4 manual(m5) f 33 44 d compact
9 volkswagen passat 2.8 1999 6 auto(l5) f 16 26 p midsize
10 volkswagen jetta 2.8 1999 6 auto(l4) f 16 23 r compact
11 volkswagen new beetle 2 1999 4 manual(m5) f 21 29 r subcompact
12 volkswagen passat 1.8 1999 4 manual(m5) f 21 29 p midsize
13 volkswagen gti 2 1999 4 auto(l4) f 19 26 r compact
...27 rows total...
Header of the mpg data frame from which to sample:
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
3 audi a4 2 2008 4 manual(m6) f 20 31 p compact
4 audi a4 2 2008 4 auto(av) f 21 30 p compact
5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact
7 audi a4 3.1 2008 6 auto(av) f 18 27 p compact
8 audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26 p compact
9 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25 p compact
10 audi a4 quattro 2 2008 4 manual(m6) 4 20 28 p compact
Data sources for reprex:
Source for picking data frame pick_df:
structure(list(manufacturer = c("honda", "volkswagen", "audi",
"land rover", "subaru"), year = c(2008L, 1999L, 1999L, 2008L,
1999L), pick = c(3L, 10L, 6L, 2L, 6L)), class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -5L))
mpg Data frame to sample:
ggplot2::mpg
Tried so far
I can use filter or likely slice, but the coding is all manual. The real use case has thousands of rows and hundreds of groups.
filter(mpg, manufacturer=='honda', year==2008) %>% sample_n(3)
filter(mpg, manufacturer=='volkswagen', year==1999) %>% sample_n(10)
etc...
edit:
Can filter in a loop, but kinda ugly:
df <- mpg[0,]
for(i in 1:nrow(pick_df)){
temp <- filter(mpg, manufacturer==pick_df$manufacturer[i], year==pick_df$year[i]) %>% sample_n(pick_df$pick[i])
df <- rbind(temp,df)
}
We can do an inner_join with 'pick_df', grouped by 'manufacturer', 'year', get the sample_n based on the first value of 'pick'
library(dplyr)
library(ggplot20
mpg %>%
inner_join(pick_df) %>%
group_by(manufacturer, year) %>%
sample_n(first(pick))
# A tibble: 27 x 12
# Groups: manufacturer, year [5]
# manufacturer model displ year cyl trans drv cty hwy fl class pick
# <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr> <int>
# 1 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25 p compact 6
# 2 audi a6 quattro 2.8 1999 6 auto(l5) 4 15 24 p midsize 6
# 3 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact 6
# 4 audi a4 quattro 2.8 1999 6 auto(l5) 4 15 25 p compact 6
# 5 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact 6
# 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact 6
# 7 honda civic 1.8 2008 4 manual(m5) f 26 34 r subcompact 3
# 8 honda civic 2 2008 4 manual(m6) f 21 29 p subcompact 3
# 9 honda civic 1.8 2008 4 auto(l5) f 24 36 c subcompact 3
#10 land rover range rover 4.2 2008 8 auto(s6) 4 12 18 r suv 2
# … with 17 more rows

Boxplot by group and then column in r

How do I make a boxplot such that each group of boxes in the boxplot contains columns of variables from a dataframe.
For example using the mpg dataset:
head(mpg)
# A tibble: 234 x 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
3 audi a4 2 2008 4 manual(m6) f 20 31 p compact
4 audi a4 2 2008 4 auto(av) f 21 30 p compact
5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact
7 audi a4 3.1 2008 6 auto(av) f 18 27 p compact
8 audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26 p compact
9 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25 p compact
10 audi a4 quattro 2 2008 4 manual(m6) 4 20 28 p compact
# ... with 224 more rows
So within each cyl group (4,5,6,8), I want to have boxplots for each variable/column cty,hwy, and displ.
Usually, one will set the fill in ggplot to be a factor variable but in this case, I have 3 variables.
It should look something like this:
You need to tranform your data to long format on your three variables. Here an example with data.table and melt function, but you will easily find the same with tydr:
library(ggplot2)
library(data.table)
mpg <- setDT(copy(mpg))
mpg_plot <- melt(mpg,measure.vars = c("cty","hwy","displ"),value.name = "val",variable.name = "var")
ggplot(mpg_plot, aes(x = as.factor(cyl),y = val,fill = var))+
geom_boxplot()+
theme_light()

Unable to select

I want to select variables which are character and integer type using dplyr's select_if function. But the code below throws an error.
mpg %>% select_if(is.character | is.integer)
How do I solve this?
mpg %>% select_if(is.character) alone works well, how do I apply multiple conditions?
We could use the ~ as well
library(dplyr)
mpg %>%
select_if(~ is.character(.x)|is.integer(.x))
Or with inherits
mpg %>%
select_if(~ inherits(.x, c("character", "integer")))
One way would be to use an anonymous function
library(dplyr)
mpg %>% select_if(function(x) is.character(x) | is.integer(x))
# manufacturer model year cyl trans drv cty hwy fl class
# <chr> <chr> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
# 1 audi a4 1999 4 auto(l5) f 18 29 p compact
# 2 audi a4 1999 4 manual(m5) f 21 29 p compact
# 3 audi a4 2008 4 manual(m6) f 20 31 p compact
# 4 audi a4 2008 4 auto(av) f 21 30 p compact
# 5 audi a4 1999 6 auto(l5) f 16 26 p compact
# 6 audi a4 1999 6 manual(m5) f 18 26 p compact
# 7 audi a4 2008 6 auto(av) f 18 27 p compact
# 8 audi a4 quattro 1999 4 manual(m5) 4 18 26 p compact
# 9 audi a4 quattro 1999 4 auto(l5) 4 16 25 p compact
#10 audi a4 quattro 2008 4 manual(m6) 4 20 28 p compact
# … with 224 more rows
OR using funs
mpg %>% select_if(funs(is.character(.) | is.integer(.)))

Resources