Subset by multiple single-variable conditions in one step? - r

I hope I can explain what I'm trying to do sufficiently. I'm working in R and for a dataset I'm trying to keep only observations where for one variable, another variable satisfies two conditions.
Specifically, I want to keep only rows where for a particular "cyl", there is at least one mpg value >20, and at least one <20. Here is some example data from mtcars similar to what I'm working with.
mpg cyl
1 21.0 6
2 21.0 6
3 22.8 4
4 21.4 6
5 18.7 8
6 18.1 6
7 14.3 8
8 24.4 4
9 22.8 4
10 19.2 6
11 17.8 6
12 16.4 8
13 17.3 8
14 15.2 8
15 10.4 8
16 10.4 8
17 14.7 8
18 32.4 4
19 30.4 4
20 33.9 4
Ideally, my output for the above example would be what's below.
mpg cyl
1 21.0 6
2 21.0 6
4 21.4 6
6 18.1 6
10 19.2 6
11 17.8 6
Thanks in advance!

Assuming your dataframe input is DF, try this:
library(dplyr)
DF %>%
group_by(cyl) %>%
filter(sum(mpg > 20) > 1 & sum(mpg < 20) > 1)
# A tibble: 7 x 2
# Groups: cyl [1]
# mpg cyl
# <dbl> <dbl>
# 1 21 6
# 2 21 6
# 3 21.4 6
# 4 18.1 6
# 5 19.2 6
# 6 17.8 6
# 7 19.7 6
data
DF <- mtcars[,1:2]

Related

Using everything() from tidyselect in dplyr to select a variable last

I would like to programmatically tell what column I'd like to display LAST. The function everything() from tidyr is great when it is used as the last argument --you specify everything before it and let it do the rest
But what if I want to do it the other way-- say what variable I want last, and then have everything() (or another function) provide all variables not selected before it.
Suppose in the example below I'd like to select cyl as the last variable to display...
library(tidyverse)
data = mtcars %>% as_tibble() %>% select(1:4)
data %>% select( cyl,everything()) #works -- cyl at beginning
#> # A tibble: 32 × 4
#> cyl mpg disp hp
#> <dbl> <dbl> <dbl> <dbl>
#> 1 6 21 160 110
#> 2 6 21 160 110
#> 3 4 22.8 108 93
#> 4 6 21.4 258 110
#> 5 8 18.7 360 175
#> 6 6 18.1 225 105
#> 7 8 14.3 360 245
#> 8 4 24.4 147. 62
#> 9 4 22.8 141. 95
#> 10 6 19.2 168. 123
#> # … with 22 more rows
data %>% select(everything(), cyl) #does not work! -- cyl not at end!
#> # A tibble: 32 × 4
#> mpg cyl disp hp
#> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110
#> 2 21 6 160 110
#> 3 22.8 4 108 93
#> 4 21.4 6 258 110
#> 5 18.7 8 360 175
#> 6 18.1 6 225 105
#> 7 14.3 8 360 245
#> 8 24.4 4 147. 62
#> 9 22.8 4 141. 95
#> 10 19.2 6 168. 123
#> # … with 22 more rows
Created on 2022-07-19 by the reprex package (v2.0.1)
I would want it to look like this
data %>% some_fxn()
# A tibble: 32 × 4
mpg disp hp cyl
<dbl> <dbl> <dbl> <dbl>
1 21 160 110 6
2 21 160 110 6
3 22.8 108 93 4
4 21.4 258 110 6
5 18.7 360 175 8
6 18.1 225 105 6
7 14.3 360 245 8
8 24.4 147. 62 4
9 22.8 141. 95 4
10 19.2 168. 123 6
# … with 22 more rows
Is there another function or an option to everything() to allow it to do this?
You can use the relocate() verb
data %>%
relocate(cyl, .after = last_col())
If you are going to often move things to the end of the data.frame, you can write your own helper
relocate_end <- function(..., .after=NULL) {
relocate(..., .after=last_col())
}
and this will work with multiple columns as well
data %>% relocate_end(cyl, mpg)
# disp hp cyl mpg
# <dbl> <dbl> <dbl> <dbl>
# 1 160 110 6 21
# 2 160 110 6 21
# 3 108 93 4 22.8
# 4 258 110 6 21.4
# 5 360 175 8 18.7
# 6 225 105 6 18.1
# 7 360 245 8 14.3
# 8 147. 62 4 24.4
# 9 141. 95 4 22.8
# 10 168. 123 6 19.2
To get everything except cyl, just use -cyl.
data %>% select(-cyl, cyl)
#> # A tibble: 32 x 4
#> mpg disp hp cyl
#> <dbl> <dbl> <dbl> <dbl>
#> 1 21 160 110 6
#> 2 21 160 110 6
#> 3 22.8 108 93 4
#> 4 21.4 258 110 6
#> 5 18.7 360 175 8
#> 6 18.1 225 105 6
#> 7 14.3 360 245 8
#> 8 24.4 147. 62 4
#> 9 22.8 141. 95 4
#> 10 19.2 168. 123 6
#> # ... with 22 more rows
You can try
data %>% select(setdiff(colnames(data) , "cyl") , cyl)

pass a function a vector or undefined number of arguments

I want to be able to pass a function an undefined number of arguments via ... but also to be able to pass it a vector. Here is a silly example:
library(tidyverse)
df <- data.frame(gear = as.character(unique(mtcars$gear)),
id = 1:3)
myfun <- function(...) {
ids_lst <- lst(...)
df2 <- bind_rows(map(ids_lst, function(x)
mtcars %>%
filter(gear == x) %>%
select(mpg)), .id = "gear") %>%
left_join(df)
df2
}
#these all work:
myfun(3)
myfun(3, 4)
myfun(3, 4, 5)
Passing it a vector doesn't work though:
myvector <- unique(mtcars$gear)
myfun(myvector)
The problem is because of the way the function collects the arguments and how it returns them:
myfun_lst <- function(...) {
ids_lst <- lst(...)
ids_lst
}
myfun_lst(3, 4, 5)
# $`3`
# [1] 3
# $`4`
# [1] 4
# $`5`
# [1] 5
myfun_lst(myvector)
# $myvector
# [1] 4 3 5
I thought a fix would be to test if the input is a vector, something like:
myfun_final <- function(...) {
if(is.vector(...) & !is.list(...)) {
ids_lst <- as.list(...)
names(ids_lst) <- (...)
} else {
ids_lst <- lst(...)
}
df2 <- bind_rows(map(ids_lst, function(x)
mtcars %>%
filter(gear == x) %>%
select(mpg)), .id = "gear") %>%
left_join(df)
df2
}
Now, passing the function a vector works but collecting the arguments doesn't:
myfun_final(3, 4, 5)
myfun_final(myvector)
What is a good way to solve this?
Thanks
Of course you can change your function so that it will work with both, regular arguments myfun(3, 4, 5) and a vector myfun(myvector), as shown in the answer above.
Another option is that you make use of argument splicing by unquoting with the bang bang bang operator !!!. This operator is only supported in certain {rlang} and {tidyverse} functions. In your example you evaluate the dots ... inside purrr::map which supports argument splicing. Therefore there might not be the need to rewrite your function:
library(tidyverse)
# your original function:
myfun <- function(...) {
ids_lst <- lst(...)
df2 <- bind_rows(map(ids_lst, function(x)
mtcars %>%
filter(gear == x) %>%
select(mpg)), .id = "gear") %>%
left_join(df)
df2
}
myvector <- unique(mtcars$gear)
myfun(!!! myvector) # works
#> Joining, by = "gear"
#> gear mpg id
#> 1 4 21.0 1
#> 2 4 21.0 1
#> 3 4 22.8 1
#> 4 4 24.4 1
#> 5 4 22.8 1
#> 6 4 19.2 1
#> 7 4 17.8 1
#> 8 4 32.4 1
#> 9 4 30.4 1
#> 10 4 33.9 1
#> ...
myfun(3, 4, 5) # works
#> Joining, by = "gear"
#> gear mpg id
#> 1 3 21.4 2
#> 2 3 18.7 2
#> 3 3 18.1 2
#> 4 3 14.3 2
#> 5 3 16.4 2
#> 6 3 17.3 2
#> 7 3 15.2 2
#> 8 3 10.4 2
#> 9 3 10.4 2
#> 10 3 14.7 2
#> ...
Created on 2021-12-30 by the reprex package (v0.3.0)
You can read more about unquoting with the bang bang bang operator here.
Finally, you should think about the users of your function. If you are the only user then choose whatever suits you. In case there are other users you should think about how they expect the function to work. Probably users don't expect a function to work with several arguments and at the same time, alternatively, by providing those arguments in a vector. In the tidyverse argument splicing with !!! is a well established concept. In base R we would usually use do.call("myfun", as.list(myvector)) to achieve something similar.
To add another option:
The purrr package has a family of lift functions which can be used to alter the kind of arguments a function takes. The most prominent is lift_dl which transforms a function that takes dots as argument to a function that takes a list or vector as argument. This could also be used to solve the problem without the need to rewrite the function:
lift_dl(myfun)(myvector)
#> Joining, by = "gear"
#> gear mpg id
#> 1 4 21.0 1
#> 2 4 21.0 1
#> 3 4 22.8 1
#> 4 4 24.4 1
#> 5 4 22.8 1
#> 6 4 19.2 1
#> 7 4 17.8 1
#> 8 4 32.4 1
#> 9 4 30.4 1
#> 10 4 33.9 1
#> ...
Created on 2022-01-01 by the reprex package (v0.3.0)
How about testing if ... is of length 1 and if the only argument passed through is a vector? If not so, then consider ... a list of scalers and capture them with lst(...).
myfun_final <- function(...) {
if (...length() == 1L && is.vector(..1))
ids_lst <- `names<-`(..1, ..1)
else
ids_lst <- lst(...)
df2 <- bind_rows(map(ids_lst, function(x)
mtcars %>%
filter(gear == x) %>%
select(mpg)), .id = "gear") %>%
left_join(df)
df2
}
Test
> myfun_final(3)
Joining, by = "gear"
gear mpg id
1 3 21.4 2
2 3 18.7 2
3 3 18.1 2
4 3 14.3 2
5 3 16.4 2
6 3 17.3 2
7 3 15.2 2
8 3 10.4 2
9 3 10.4 2
10 3 14.7 2
11 3 21.5 2
12 3 15.5 2
13 3 15.2 2
14 3 13.3 2
15 3 19.2 2
> myfun_final(3,4,5)
Joining, by = "gear"
gear mpg id
1 3 21.4 2
2 3 18.7 2
3 3 18.1 2
4 3 14.3 2
5 3 16.4 2
6 3 17.3 2
7 3 15.2 2
8 3 10.4 2
9 3 10.4 2
10 3 14.7 2
11 3 21.5 2
12 3 15.5 2
13 3 15.2 2
14 3 13.3 2
15 3 19.2 2
16 4 21.0 1
17 4 21.0 1
18 4 22.8 1
19 4 24.4 1
20 4 22.8 1
21 4 19.2 1
22 4 17.8 1
23 4 32.4 1
24 4 30.4 1
25 4 33.9 1
26 4 27.3 1
27 4 21.4 1
28 5 26.0 3
29 5 30.4 3
30 5 15.8 3
31 5 19.7 3
32 5 15.0 3
> myfun_final(c(3,4,5))
Joining, by = "gear"
gear mpg id
1 3 21.4 2
2 3 18.7 2
3 3 18.1 2
4 3 14.3 2
5 3 16.4 2
6 3 17.3 2
7 3 15.2 2
8 3 10.4 2
9 3 10.4 2
10 3 14.7 2
11 3 21.5 2
12 3 15.5 2
13 3 15.2 2
14 3 13.3 2
15 3 19.2 2
16 4 21.0 1
17 4 21.0 1
18 4 22.8 1
19 4 24.4 1
20 4 22.8 1
21 4 19.2 1
22 4 17.8 1
23 4 32.4 1
24 4 30.4 1
25 4 33.9 1
26 4 27.3 1
27 4 21.4 1
28 5 26.0 3
29 5 30.4 3
30 5 15.8 3
31 5 19.7 3
32 5 15.0 3

How to eliminate schools with less than 20 students?

I have a dataset, espana2015, of a country with schools, students…. I want to eliminate schools with less than 20 students.
The variable of the schools is CNTSCHID
dim(espana2015)
[1] 6736 106
The only way, long, manual and not very efficient, is to write one by one the schools.
Here are only 13 schools with less than 20 students, but what if there are many more, e.g. more than 100 schools?
espana2015 %>% group_by(CNTSCHID) %>% summarise(students=n())%>%
filter(students < 20) %>% select (CNTSCHID) ->removeSch
removeSch
# A tibble: 13 x 1
CNTSCHID
<dbl>
1 72400046
2 72400113
3 72400261
4 72400314
5 72400396
6 72400472
7 72400641
8 72400700
9 72400711
10 72400736
11 72400909
12 72400927
13 72400979
espana2015 %>% subset(!CNTSCHID %in% c(72400046,72400113,72400261,
72400314,72400396,72400472,
72400641,72400700,72400711,
72400736,72400909,72400927,
72400979)) -> new_espana2015
Please help me to do it better
Walter
Lacking sample data, I'll demonstrate on mtcars, where my cyl is your CNTSHID.
library(dplyr)
table(mtcars$cyl)
# 4 6 8
# 11 7 14
mtcars %>%
group_by(cyl) %>%
filter(n() > 10) %>%
ungroup()
# # A tibble: 25 x 11
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
# 2 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
# 3 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
# 4 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
# 5 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
# 6 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3
# 7 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3
# 8 15.2 8 276. 180 3.07 3.78 18 0 0 3 3
# 9 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
# 10 10.4 8 460 215 3 5.42 17.8 0 0 3 4
# # ... with 15 more rows
This works because the conditional in filter resolves to a single logical, and that length-1 true/false is then recycled for all rows in that group. That is, for cyl == 4, (n() > 10) --> (11 > 10) --> TRUE, so the filter is %>% filter(TRUE); the dplyr::filter function does "safe recycling" in a sense, where the conditional must be the same length as the number of rows, or length 1. When it is length 1, it is essentially saying "all or nothing".

Can I group_by columns with starts_with?

I'm dealing with a big dataframe that has a number of columns I want to group by. I'd like to do something like this:
output <- df %>%
group_by(starts_with("GEN", ignore.case=TRUE),x,y) %>%
summarize(total=n()) %>%
arrange(desc(total))
is there a way to do this? Maybe with group_by_at or some other similar function?
To use starts_with() in group_by(), you need to wrap it in across(). Here is an example using some built data.
library(dplyr)
mtcars %>%
group_by(across(starts_with("c"))) %>%
summarize(total = n()) %>%
arrange(-total)
# A tibble: 9 x 3
# Groups: cyl [3]
cyl carb total
<dbl> <dbl> <int>
1 4 2 6
2 8 4 6
3 4 1 5
4 6 4 4
5 8 2 4
6 8 3 3
7 6 1 2
8 6 6 1
9 8 8 1
Yes, there is. You could use the group_by_at function:
mtcars %>% group_by_at(vars(starts_with("c"), gear))
Group by all columns whose name starts with "c" and by the column gear
Output
# A tibble: 32 x 11
# Groups: cyl, carb, gear [12]
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# ... with 22 more rows

map a function with multiple arguments to multiple columns of dataframe

I have a function like this
adstock <- function(x, rate) {
adstocked_variable <- x
for (i in 1:(length(x)-1) ) {
adstocked_variable[i+1] <- x[i + 1] + (rate * adstocked_variable[i])
}
adstocked_variable
}
And a vector of rates like so:
rates <- seq(0.00, 1, 0.25)
How can I map the adstock function to every column in mtcars, once for every number in the rates vector?
I want the output to look something like this:
$mpg
mpg_0.00 mpg_0.25 mpg_0.5
1 21.0 21.00000 21.00000
2 21.0 26.25000 31.50000
3 22.8 29.36250 38.55000
4 21.4 28.74062 40.67500
5 18.7 25.88516 39.03750
6 18.1 24.57129 37.61875
7 14.3 20.44282 33.10938
8 24.4 29.51071 40.95469
$cyl
cyl_0.00 cyl_0.25 cyl_0.5
1 6 6.000000 6.000000
2 6 7.500000 9.000000
3 4 5.875000 8.500000
4 6 7.468750 10.250000
5 8 9.867188 13.125000
6 6 8.466797 12.562500
7 8 10.116699 14.281250
8 4 6.529175 11.140625
I have tried variations of the following with no sucess
library(tidyverse)
mtcars %>%
map(~adstock(., rates))
Here is one idea.
library(purrr)
imap(mtcars, ~map_dfc(rates, function(x){
dat <- data.frame(adstock(.x, x))
}) %>%
set_names(paste(.y, rates, sep = "_"))
)
# $mpg
# mpg_0 mpg_0.25 mpg_0.5 mpg_0.75 mpg_1
# 1 21.0 21.00000 21.00000 21.00000 21.0
# 2 21.0 26.25000 31.50000 36.75000 42.0
# 3 22.8 29.36250 38.55000 50.36250 64.8
# 4 21.4 28.74062 40.67500 59.17187 86.2
# 5 18.7 25.88516 39.03750 63.07891 104.9
# 6 18.1 24.57129 37.61875 65.40918 123.0
# 7 14.3 20.44282 33.10938 63.35688 137.3
# 8 24.4 29.51071 40.95469 71.91766 161.7
# 9 22.8 30.17768 43.27734 76.73825 184.5
# 10 19.2 26.74442 40.83867 76.75369 203.7
# 11 17.8 24.48610 38.21934 75.36526 221.5
# 12 16.4 22.52153 35.50967 72.92395 237.9
# 13 17.3 22.93038 35.05483 71.99296 255.2
# 14 15.2 20.93260 32.72742 69.19472 270.4
# 15 10.4 15.63315 26.76371 62.29604 280.8
# 16 10.4 14.30829 23.78185 57.12203 291.2
# 17 14.7 18.27707 26.59093 57.54152 305.9
# 18 32.4 36.96927 45.69546 75.55614 338.3
# 19 30.4 39.64232 53.24773 87.06711 368.7
# 20 33.9 43.81058 60.52387 99.20033 402.6
# 21 21.5 32.45264 51.76193 95.90025 424.1
# 22 15.5 23.61316 41.38097 87.42519 439.6
# 23 15.2 21.10329 35.89048 80.76889 454.8
# 24 13.3 18.57582 31.24524 73.87667 468.1
# 25 19.2 23.84396 34.82262 74.60750 487.3
# 26 27.3 33.26099 44.71131 83.25563 514.6
# 27 26.0 34.31525 48.35566 88.44172 540.6
# 28 30.4 38.97881 54.57783 96.73129 571.0
# 29 15.8 25.54470 43.08891 88.34847 586.8
# 30 19.7 26.08618 41.24446 85.96135 606.5
# 31 15.0 21.52154 35.62223 79.47101 621.5
# 32 21.4 26.78039 39.21111 81.00326 642.9
# ...

Resources