Boxplot by group and then column in r - r

How do I make a boxplot such that each group of boxes in the boxplot contains columns of variables from a dataframe.
For example using the mpg dataset:
head(mpg)
# A tibble: 234 x 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
3 audi a4 2 2008 4 manual(m6) f 20 31 p compact
4 audi a4 2 2008 4 auto(av) f 21 30 p compact
5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact
7 audi a4 3.1 2008 6 auto(av) f 18 27 p compact
8 audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26 p compact
9 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25 p compact
10 audi a4 quattro 2 2008 4 manual(m6) 4 20 28 p compact
# ... with 224 more rows
So within each cyl group (4,5,6,8), I want to have boxplots for each variable/column cty,hwy, and displ.
Usually, one will set the fill in ggplot to be a factor variable but in this case, I have 3 variables.
It should look something like this:

You need to tranform your data to long format on your three variables. Here an example with data.table and melt function, but you will easily find the same with tydr:
library(ggplot2)
library(data.table)
mpg <- setDT(copy(mpg))
mpg_plot <- melt(mpg,measure.vars = c("cty","hwy","displ"),value.name = "val",variable.name = "var")
ggplot(mpg_plot, aes(x = as.factor(cyl),y = val,fill = var))+
geom_boxplot()+
theme_light()

Related

I want to use a previously created function in the mutate() function. Yet R doesn't seem to want to let me [duplicate]

This question already has answers here:
adding a column to df that counts occurrence of a value in another column
(2 answers)
Closed 3 months ago.
I am looking at population data and want to make sure I have enough observations do to county level analysis. Therefore I would like to generate a variable that assigns each observation the number of observations with the same value for the "county" row.
I want to assign each row in my data frame ("cps") a new variable ("freq") which represents the frequency of its specific value in one specific variable ("county").
I used
f <- function(x)sum(with(cps, county==x))
to generate a function that tells me how often a given county x appears in the data.
Now I want to use
cps <- mutate(cps, freq=f(county))
to assign each row the number of times its county value appears in the data frame.
However, it assigns each row with the overall number of observations.
You can get what you want using dplyr::add_count():
library(dplyr)
mpg %>% add_count(cyl, name = "freq")
# A tibble: 234 × 12
manufacturer model displ year cyl trans drv cty hwy fl class freq
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr> <int>
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact 81
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact 81
3 audi a4 2 2008 4 manual(m6) f 20 31 p compact 81
4 audi a4 2 2008 4 auto(av) f 21 30 p compact 81
5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact 79
6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact 79
7 audi a4 3.1 2008 6 auto(av) f 18 27 p compact 79
8 audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26 p compact 81
9 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25 p compact 81
10 audi a4 quattro 2 2008 4 manual(m6) 4 20 28 p compact 81
# … with 224 more rows
But if you wanted to use your function, you'd need to wrap in sapply() (or purrr:map_int()) to compare each element of x against every element:
f <- function(x) sapply(x, \(x) sum(with(mpg, cyl == x)))
You can also generalize it to work with any column:
f2 <- function(x) sapply(x, \(x_i) sum(x == x_i))
mutate(mpg, freq=f2(drv))
# A tibble: 234 × 12
manufacturer model displ year cyl trans drv cty hwy fl class freq
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr> <int>
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact 106
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact 106
3 audi a4 2 2008 4 manual(m6) f 20 31 p compact 106
4 audi a4 2 2008 4 auto(av) f 21 30 p compact 106
5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact 106
6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact 106
7 audi a4 3.1 2008 6 auto(av) f 18 27 p compact 106
8 audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26 p compact 103
9 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25 p compact 103
10 audi a4 quattro 2 2008 4 manual(m6) 4 20 28 p compact 103
# … with 224 more rows

Dynamically selecting multiple columns for group_by

Data masking for group_by does not work when there is more than one grouping variable.
Pasting code below
grpByCols <- "model"
mpg%>%
group_by(.data[[grpByCols]])
grpByCols <- c("model", "manufacturer")
mpg%>%
group_by(.data[[grpByCols]])
The first group_by works, the second one fails.
Pasting the run output below
> grpByCols <- "model"
>
> mpg%>%
+ group_by(.data[[grpByCols]])
# A tibble: 234 x 11
# Groups: model [38]
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
3 audi a4 2 2008 4 manual(m6) f 20 31 p compact
4 audi a4 2 2008 4 auto(av) f 21 30 p compact
5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact
7 audi a4 3.1 2008 6 auto(av) f 18 27 p compact
8 audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26 p compact
9 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25 p compact
10 audi a4 quattro 2 2008 4 manual(m6) 4 20 28 p compact
# … with 224 more rows
>
> grpByCols <- c("model", "manufacturer")
>
> mpg%>%
+ group_by(.data[[grpByCols]])
Error: Problem with `mutate()` input `..1`.
x Must subset the data pronoun with a string.
ℹ Input `..1` is `<unknown>`.
Run `rlang::last_error()` to see where the error occurred.
>
Please let me know if you have any ideas to make this work
A simple way is to use the across() function from dplyr.
mpg %>% group_by(across(all_of(grpByCols)))
# A tibble: 234 × 11
# Groups: model, manufacturer [38]
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
3 audi a4 2 2008 4 manu… f 20 31 p comp…
4 audi a4 2 2008 4 auto… f 21 30 p comp…
5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
We could unquote the symbol with !!
grpByCols <- "model"
mpg%>%
group_by(!!sym(grpByCols))
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
3 audi a4 2 2008 4 manual(m6) f 20 31 p compact
4 audi a4 2 2008 4 auto(av) f 21 30 p compact
5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact
7 audi a4 3.1 2008 6 auto(av) f 18 27 p compact
8 audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26 p compact
9 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25 p compact
10 audi a4 quattro 2 2008 4 manual(m6) 4 20 28 p compact
# ... with 224 more rows
You can use the following solution. You should use rlang::syms which takes strings as input and turn them into symbols and since the output is a list of length 2 (corresponding to the length of input), we use big bang operator !!! to splice the elements of the list, meaning that they each become one single argument:
library(rlang)
grpByCols <- c("model", "manufacturer")
mpg %>%
group_by(!!!syms(grpByCols))
# A tibble: 234 x 11
# Groups: model, manufacturer [38]
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
3 audi a4 2 2008 4 manual(m6) f 20 31 p compact
4 audi a4 2 2008 4 auto(av) f 21 30 p compact
5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact
7 audi a4 3.1 2008 6 auto(av) f 18 27 p compact
8 audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26 p compact
9 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25 p compact
10 audi a4 quattro 2 2008 4 manual(m6) 4 20 28 p compact
# ... with 224 more rows
Using cur_data()
library(dplyr)
mpg %>%
group_by(cur_data()[grpByCols])
-output
# A tibble: 234 x 11
# Groups: model, manufacturer [38]
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
3 audi a4 2 2008 4 manual(m6) f 20 31 p compact
4 audi a4 2 2008 4 auto(av) f 21 30 p compact
5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact
7 audi a4 3.1 2008 6 auto(av) f 18 27 p compact
8 audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26 p compact
9 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25 p compact
10 audi a4 quattro 2 2008 4 manual(m6) 4 20 28 p compact
# … with 224 more rows

Sample n rows from a data frame by group using another data frame

Looking to randomly sample n rows from a dataframe by group based on the criteria of another data frame.
Example:
Randomly sample rows from the ggplot2::mpg dataframe based on the manufacturer and year grouping, where n = the pick column of the pick_df data frame.
i.e. randomly sample 3 rows from ggplot2::mpg that are hondas made in 2008, 10 volkswagens made in 1999, 2 audis made in 1999, etc.
manufacturer year pick
<chr> <int> <int>
1 honda 2008 3
2 volkswagen 1999 10
3 audi 1999 6
4 land rover 2008 2
5 subaru 1999 6
Expected output:
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 honda civic 1.8 2008 4 manual(m5) f 26 34 r subcompact
2 honda civic 1.8 2008 4 auto(l5) f 25 36 r subcompact
3 honda civic 1.8 2008 4 auto(l5) f 24 36 c subcompact
4 volkswagen gti 2.8 1999 6 manual(m5) f 17 24 r compact
5 volkswagen passat 2.8 1999 6 manual(m5) f 18 26 p midsize
6 volkswagen new beetle 1.9 1999 4 auto(l4) f 29 41 d subcompact
7 volkswagen new beetle 2 1999 4 auto(l4) f 19 26 r subcompact
8 volkswagen jetta 1.9 1999 4 manual(m5) f 33 44 d compact
9 volkswagen passat 2.8 1999 6 auto(l5) f 16 26 p midsize
10 volkswagen jetta 2.8 1999 6 auto(l4) f 16 23 r compact
11 volkswagen new beetle 2 1999 4 manual(m5) f 21 29 r subcompact
12 volkswagen passat 1.8 1999 4 manual(m5) f 21 29 p midsize
13 volkswagen gti 2 1999 4 auto(l4) f 19 26 r compact
...27 rows total...
Header of the mpg data frame from which to sample:
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
3 audi a4 2 2008 4 manual(m6) f 20 31 p compact
4 audi a4 2 2008 4 auto(av) f 21 30 p compact
5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact
7 audi a4 3.1 2008 6 auto(av) f 18 27 p compact
8 audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26 p compact
9 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25 p compact
10 audi a4 quattro 2 2008 4 manual(m6) 4 20 28 p compact
Data sources for reprex:
Source for picking data frame pick_df:
structure(list(manufacturer = c("honda", "volkswagen", "audi",
"land rover", "subaru"), year = c(2008L, 1999L, 1999L, 2008L,
1999L), pick = c(3L, 10L, 6L, 2L, 6L)), class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -5L))
mpg Data frame to sample:
ggplot2::mpg
Tried so far
I can use filter or likely slice, but the coding is all manual. The real use case has thousands of rows and hundreds of groups.
filter(mpg, manufacturer=='honda', year==2008) %>% sample_n(3)
filter(mpg, manufacturer=='volkswagen', year==1999) %>% sample_n(10)
etc...
edit:
Can filter in a loop, but kinda ugly:
df <- mpg[0,]
for(i in 1:nrow(pick_df)){
temp <- filter(mpg, manufacturer==pick_df$manufacturer[i], year==pick_df$year[i]) %>% sample_n(pick_df$pick[i])
df <- rbind(temp,df)
}
We can do an inner_join with 'pick_df', grouped by 'manufacturer', 'year', get the sample_n based on the first value of 'pick'
library(dplyr)
library(ggplot20
mpg %>%
inner_join(pick_df) %>%
group_by(manufacturer, year) %>%
sample_n(first(pick))
# A tibble: 27 x 12
# Groups: manufacturer, year [5]
# manufacturer model displ year cyl trans drv cty hwy fl class pick
# <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr> <int>
# 1 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25 p compact 6
# 2 audi a6 quattro 2.8 1999 6 auto(l5) 4 15 24 p midsize 6
# 3 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact 6
# 4 audi a4 quattro 2.8 1999 6 auto(l5) 4 15 25 p compact 6
# 5 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact 6
# 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact 6
# 7 honda civic 1.8 2008 4 manual(m5) f 26 34 r subcompact 3
# 8 honda civic 2 2008 4 manual(m6) f 21 29 p subcompact 3
# 9 honda civic 1.8 2008 4 auto(l5) f 24 36 c subcompact 3
#10 land rover range rover 4.2 2008 8 auto(s6) 4 12 18 r suv 2
# … with 17 more rows

How to filter a variable inside a function in R?

I need help with something really simple in R. I defined a function to perform few operations and I'm unable to select the variable while calling the function using input parameters.
Eg: Using mpg dataset just for reference, I need to filter out all columns where disp > 2.0
mpg
#Defining a simple function called select_fun
select_fun <- function(x)
{
a <- mpg %>% filter(x > 2)
return(a)
}
select_fun("disp")
Output:
<chr> model disp year cyl trans drv cty hwy class
audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
audi a4 2.0 2008 4 auto(av) f 21 30 p compact
audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
audi a4 2.8 1999 6 manual(m5) f 18 26 p compact
audi a4 3.1 2008 6 auto(av) f 18 27 p compact
The output is incorrect as the filtered values are still present.
Maybe I'm missing out on something really simple and dumb!!
Any help would be really appreciated
Thanks!!
There are various ways in which you can solve this problem :
library(dplyr)
library(rlang)
1) Use filter_at :
select_fun1 <- function(df, x) {
a <- df %>% filter_at(vars(x), any_vars(. > 2))
return(a)
}
2) Use base R subsetting
select_fun2 <- function(df, x) {
a <- df[df[[x]] > 2,]
return(a)
}
3) Use non-standard evaluation
select_fun3 <- function(df, x) {
a <- df %>% filter(!!sym(x) > 2)
return(a)
}
Check the results from 3 are the same.
identical(select_fun1(mpg, 'displ'), select_fun2(mpg, 'displ'))
#[1] TRUE
identical(select_fun1(mpg, 'displ'), select_fun3(mpg, 'displ'))
#[1] TRUE
One more, almost identical to Ronak Shah's select_fun3 but a bit shorter (thanks to curly-curly operator) and not you don't need to quote variable name with it:
select_fun4 <- function(df, x) {
df %>% filter({{x}} > 2)
}
select_fun4(mpg, displ)
# A tibble: 191 x 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
2 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact
3 audi a4 3.1 2008 6 auto(av) f 18 27 p compact
4 audi a4 quattro 2.8 1999 6 auto(l5) 4 15 25 p compact
5 audi a4 quattro 2.8 1999 6 manual(m5) 4 17 25 p compact
6 audi a4 quattro 3.1 2008 6 auto(s6) 4 17 25 p compact
7 audi a4 quattro 3.1 2008 6 manual(m6) 4 15 25 p compact
8 audi a6 quattro 2.8 1999 6 auto(l5) 4 15 24 p midsize
9 audi a6 quattro 3.1 2008 6 auto(s6) 4 17 25 p midsize
10 audi a6 quattro 4.2 2008 8 auto(s6) 4 16 23 p midsize
# ... with 181 more rows

Unable to select

I want to select variables which are character and integer type using dplyr's select_if function. But the code below throws an error.
mpg %>% select_if(is.character | is.integer)
How do I solve this?
mpg %>% select_if(is.character) alone works well, how do I apply multiple conditions?
We could use the ~ as well
library(dplyr)
mpg %>%
select_if(~ is.character(.x)|is.integer(.x))
Or with inherits
mpg %>%
select_if(~ inherits(.x, c("character", "integer")))
One way would be to use an anonymous function
library(dplyr)
mpg %>% select_if(function(x) is.character(x) | is.integer(x))
# manufacturer model year cyl trans drv cty hwy fl class
# <chr> <chr> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
# 1 audi a4 1999 4 auto(l5) f 18 29 p compact
# 2 audi a4 1999 4 manual(m5) f 21 29 p compact
# 3 audi a4 2008 4 manual(m6) f 20 31 p compact
# 4 audi a4 2008 4 auto(av) f 21 30 p compact
# 5 audi a4 1999 6 auto(l5) f 16 26 p compact
# 6 audi a4 1999 6 manual(m5) f 18 26 p compact
# 7 audi a4 2008 6 auto(av) f 18 27 p compact
# 8 audi a4 quattro 1999 4 manual(m5) 4 18 26 p compact
# 9 audi a4 quattro 1999 4 auto(l5) 4 16 25 p compact
#10 audi a4 quattro 2008 4 manual(m6) 4 20 28 p compact
# … with 224 more rows
OR using funs
mpg %>% select_if(funs(is.character(.) | is.integer(.)))

Resources