I'm doing
df_sliced <- df %>% group_by(group) %>% slice_max(n=0, order_by=n, with_ties = FALSE)
but it's just ignored.
Meaning, the df_sliced is equal df.
The problem appears to be with the assignment to n inside slice_max().
For example
mtcars %>% group_by(cyl) %>% slice_max(n=0, order_by=n, with_ties=FALSE)
Error in `slice_max()`:
! Problem while computing indices.
ℹ The error occurred in group 1: cyl = 4.
Caused by error:
! `order_by` must be a vector, not a function.
Run `rlang::last_error()` to see where the error occurred.
and
mtcars %>% group_by(cyl) %>% add_column(n=0) %>% slice_max(n)
# A tibble: 32 × 12
# Groups: cyl [3]
mpg cyl disp hp drat wt qsec vs am gear carb n
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 0
2 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 0
3 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 0
4 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1 0
5 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2 0
6 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1 0
7 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1 0
8 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1 0
9 26 4 120. 91 4.43 2.14 16.7 0 1 5 2 0
10 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2 0
# … with 22 more rows
but
mtcars %>% group_by(cyl) %>% add_column(n=0) %>% slice_max(n, with_ties=FALSE)
# A tibble: 3 × 12
# Groups: cyl [3]
mpg cyl disp hp drat wt qsec vs am gear carb n
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 0
2 21 6 160 110 3.9 2.62 16.5 0 1 4 4 0
3 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 0
suggesting that
mtcars %>% group_by(cyl) %>% add_column(n=0) %>% head(0)
# A tibble: 0 × 12
# Groups: cyl [0]
# … with 12 variables: mpg <dbl>, cyl <dbl>, disp <dbl>, hp <dbl>, drat <dbl>, wt <dbl>, qsec <dbl>, vs <dbl>, am <dbl>,
# gear <dbl>, carb <dbl>, n <dbl>
(or mtcars %>% add_column(n=0) %>% head(0))
is a potential solution.
Related
I'd like to add a row for each group, where the entry for a particular column is the mean of the values of that column for that group. It's easy to add a constant value
library(dplyr)
mtcars %>% group_by(cyl) %>% group_modify(~add_row(.x, .before=0, carb=2))
# A tibble: 35 x 11
# Groups: cyl [3]
cyl mpg disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4 NA NA NA NA NA NA NA NA NA 2
2 4 22.8 108 93 3.85 2.32 18.6 1 1 4 1
3 4 24.4 147. 62 3.69 3.19 20 1 0 4 2
4 4 22.8 141. 95 3.92 3.15 22.9 1 0 4 2
But when I try to dynamically add e.g. the mean of all carbs for that group, it doesn't recognise carb as a column:
mtcars %>% group_by(cyl) %>% group_modify(~add_row(.x, .before=0, carb=mean(carb)))
Error in mean(carb) : object 'carb' not found
Alternatively:
library(tidyverse)
mtcars %>%
group_by(cyl) %>%
summarise(carb = mean(carb)) %>%
bind_rows(mtcars) %>%
arrange(cyl)
#> # A tibble: 35 x 11
#> cyl carb mpg disp hp drat wt qsec vs am gear
#> * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 4 1.55 NA NA NA NA NA NA NA NA NA
#> 2 4 1 22.8 108 93 3.85 2.32 18.6 1 1 4
#> 3 4 2 24.4 147. 62 3.69 3.19 20 1 0 4
#> 4 4 2 22.8 141. 95 3.92 3.15 22.9 1 0 4
#> 5 4 1 32.4 78.7 66 4.08 2.2 19.5 1 1 4
#> 6 4 2 30.4 75.7 52 4.93 1.62 18.5 1 1 4
#> 7 4 1 33.9 71.1 65 4.22 1.84 19.9 1 1 4
#> 8 4 1 21.5 120. 97 3.7 2.46 20.0 1 0 3
#> 9 4 1 27.3 79 66 4.08 1.94 18.9 1 1 4
#> 10 4 2 26 120. 91 4.43 2.14 16.7 0 1 5
#> # ... with 25 more rows
I do the following:
mtcars %>%
group_by(cyl) %>%
summarise(
model = list(lm(disp ~ mpg, data = cur_data())),
data = list(dat = cur_data())
) -> df
However, when I want to access the list-column data, it gives me this error:
> df$data
$dat
Error: Can't subset elements that don't exist.
x Locations 2, 3, 4, 5, 6, etc. don't exist.
ℹ There are only 1 element.
While the actual glimpse looks like this:
> glimpse(df)
Rows: 3
Columns: 3
$ cyl <dbl> 4, 6, 8
$ model <list> [<233.067448, -4.797961, -15.673940, 30.702798, 17.126060, 1.086485, -11.509437, 0.68342…
$ data <named list> [<tbl_df[11 x 11]>, <tbl_df[7 x 11]>, <tbl_df[14 x 11]>]
Not really sure what is going wrong here...
Change the order of operation.
library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarise(
data = list(dat = cur_data()),
model = list(lm(disp ~ mpg, data = cur_data())),
) -> df
df$data
#$dat
# A tibble: 11 x 10
# mpg disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 22.8 108 93 3.85 2.32 18.6 1 1 4 1
# 2 24.4 147. 62 3.69 3.19 20 1 0 4 2
# 3 22.8 141. 95 3.92 3.15 22.9 1 0 4 2
# 4 32.4 78.7 66 4.08 2.2 19.5 1 1 4 1
# 5 30.4 75.7 52 4.93 1.62 18.5 1 1 4 2
#...
#...
#$dat
# A tibble: 7 x 10
# mpg disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 21 160 110 3.9 2.62 16.5 0 1 4 4
#2 21 160 110 3.9 2.88 17.0 0 1 4 4
#3 21.4 258 110 3.08 3.22 19.4 1 0 3 1
#4 18.1 225 105 2.76 3.46 20.2 1 0 3 1
#5 19.2 168. 123 3.92 3.44 18.3 1 0 4 4
#6 17.8 168. 123 3.92 3.44 18.9 1 0 4 4
#7 19.7 145 175 3.62 2.77 15.5 0 1 5 6
#$dat
# A tibble: 14 x 10
# mpg disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 18.7 360 175 3.15 3.44 17.0 0 0 3 2
# 2 14.3 360 245 3.21 3.57 15.8 0 0 3 4
# 3 16.4 276. 180 3.07 4.07 17.4 0 0 3 3
# 4 17.3 276. 180 3.07 3.73 17.6 0 0 3 3
# 5 15.2 276. 180 3.07 3.78 18 0 0 3 3
#...
I don't know the exact reason for this error but my guess is that after doing model = list(lm(disp ~ mpg, data = cur_data())) cur_data() now consists of current grouped dataframe as well as the model which is causing issues in storing the data.
I am trying to add a column with a summary statistic calculated by group using group_by and mutate(). I've done it many times before w/o any issues, but the case is no longer. Now the new column contains the summary of the entire data set and not by group:
> mtcars %>% group_by(cyl) %>% mutate(mean = mean(disp))
# A tibble: 32 x 12
# Groups: cyl [3]
mpg cyl disp hp drat wt qsec vs am gear carb mean
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 231.
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 231.
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 231.
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 231.
...
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 231.
The 231 in the last column is the overall mean of disp. I'd expect the last column to contain group means, e.g. 183 for cyl=6 and 105 for cyl=4, etc. What am I doing wrong here?
I would like to take a feature and spread it's values as columns with 1/0 if true/false e.g.
mtcars %>%
pivot_wider(names_from = cyl,
values_from = 1)
This appears to have done something, cyl has now been spread into columns except the values are things like 21, 21.4 or NA.
> mtcars %>%
+ pivot_wider(names_from = cyl,
+ values_from = 1)
# A tibble: 32 x 12
disp hp drat wt qsec vs am gear carb `6` `4` `8`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 160 110 3.9 2.62 16.5 0 1 4 4 21 NA NA
2 160 110 3.9 2.88 17.0 0 1 4 4 21 NA NA
3 108 93 3.85 2.32 18.6 1 1 4 1 NA 22.8 NA
4 258 110 3.08 3.22 19.4 1 0 3 1 21.4 NA NA
5 360 175 3.15 3.44 17.0 0 0 3 2 NA NA 18.7
6 225 105 2.76 3.46 20.2 1 0 3 1 18.1 NA NA
7 360 245 3.21 3.57 15.8 0 0 3 4 NA NA 14.3
8 147. 62 3.69 3.19 20 1 0 4 2 NA 24.4 NA
9 141. 95 3.92 3.15 22.9 1 0 4 2 NA 22.8 NA
10 168. 123 3.92 3.44 18.3 1 0 4 4 19.2 NA NA
I tried using values_fill like so:
> mtcars %>%
+ pivot_wider(names_from = cyl,
+ values_from = 1,
+ values_fill = list(1 = 0))
Error: unexpected '=' in:
" values_from = 1,
values_fill = list(1 ="
How can I spread cyl across columns with binary 1 or 0 values depending on whether or not the cyl is 4, 6 or 8?
Is pivot_wider() what I want?
Set mpg to 1 and set the fill for mpg to 0 like this:
mtcars %>%
mutate(mpg = 1) %>%
pivot_wider(names_from = cyl, values_from = mpg, values_fill = list(mpg = 0))
## # A tibble: 32 x 12
## disp hp drat wt qsec vs am gear carb `6` `4` `8`
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 160 110 3.9 2.62 16.5 0 1 4 4 1 0 0
## 2 160 110 3.9 2.88 17.0 0 1 4 4 1 0 0
## 3 108 93 3.85 2.32 18.6 1 1 4 1 0 1 0
## ... etc ...
or given the problem that pivot_wider currently has with ordering the columns you may prefer the older spread:
mtcars %>%
mutate(mpg = 1) %>%
spread(cyl, mpg, fill = 0)
## disp hp drat wt qsec vs am gear carb 4 6 8
## 1 71.1 65 4.22 1.835 19.90 1 1 4 1 1 0 0
## 2 75.7 52 4.93 1.615 18.52 1 1 4 2 1 0 0
## 3 78.7 66 4.08 2.200 19.47 1 1 4 1 1 0 0
## ... etc ...
Alternately specify values_fn like this:
mtcars %>%
pivot_wider(names_from = cyl, values_from = mpg,
values_fn = list(mpg = ~ 1), values_fill = list(mpg = 0))
## # A tibble: 32 x 12
## disp hp drat wt qsec vs am gear carb `6` `4` `8`
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 160 110 3.9 2.62 16.5 0 1 4 4 1 0 0
## 2 160 110 3.9 2.88 17.0 0 1 4 4 1 0 0
## 3 108 93 3.85 2.32 18.6 1 1 4 1 0 1 0
## ...etc...
An option would be to use the names and values from cyl and then recode this based on is.na:
mtcars %>%
pivot_wider(names_from = cyl,
values_from = cyl) %>%
mutate_at(vars(!!!syms(as.character(unique(mtcars$cyl)))), ~if_else(is.na(.), 0, 1))
# A tibble: 32 x 13
# mpg disp hp drat wt qsec vs am gear carb `6` `4` `8`
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 21 160 110 3.9 2.62 16.5 0 1 4 4 1 0 0
# 2 21 160 110 3.9 2.88 17.0 0 1 4 4 1 0 0
# 3 22.8 108 93 3.85 2.32 18.6 1 1 4 1 0 1 0
# 4 21.4 258 110 3.08 3.22 19.4 1 0 3 1 1 0 0
# 5 18.7 360 175 3.15 3.44 17.0 0 0 3 2 0 0 1
# 6 18.1 225 105 2.76 3.46 20.2 1 0 3 1 1 0 0
# 7 14.3 360 245 3.21 3.57 15.8 0 0 3 4 0 0 1
# 8 24.4 147. 62 3.69 3.19 20 1 0 4 2 0 1 0
# 9 22.8 141. 95 3.92 3.15 22.9 1 0 4 2 0 1 0
#10 19.2 168. 123 3.92 3.44 18.3 1 0 4 4 1 0 0
Is there an easy way to filter my data frame so that any rows after and including a row that follows some condition are filtered out? The issue here is that I want it to be robust enough to handle a case where that condition is not met, in which the whole data frame will be returned. Check out my examples below if that sounds confusing:
library(dplyr)
## Works
mtcars %>%
as_tibble() %>%
filter(between(row_number(), 1, which(mpg == 17.8)))
#> # A tibble: 11 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> 11 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
## Doesn't work
mtcars %>%
as_tibble() %>%
filter(between(row_number(), 1, which(mpg == 30.5)))
#> Error in filter_impl(.data, quo): Evaluation error: Expecting a single value: [extent=0]..
Created on 2018-08-12 by the reprex package (v0.2.0).
You could include an ifelse statement to check whether the value is present in the dataframe. Also, you need to select the first row where the condition is verified to account for cases where the value is present more than once (in your example 21.0)
library(dplyr)
mtcars %>%
as_tibble() %>%
filter(between(row_number(), 1,ifelse(!any(mpg == 30),n(),which(mpg == 30)[1]-1)))
## returns the whole tibble
mtcars %>%
as_tibble() %>%
filter(between(row_number(), 1,ifelse(!any(mpg == 21),n(),which(mpg == 21)[1]-1)))
## Returns a tibble with 0 rows
mtcars %>%
as_tibble() %>%
filter(between(row_number(), 1,ifelse(!any(mpg == 21.4),n(),which(mpg == 21.4)[1]-1)))
## returns:
# A tibble: 3 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
I think your specific example does not work because there is no mpg that equals 30.5, however, you get the same error with mpg equals 21.0 because there are two rows with that value. You will need to chose whether you want the first or the last instance of that condition
library(tidyverse)
#max row
mtcars %>%
as_tibble() %>%
filter(between(row_number(), 1, which(mtcars$mpg == 21.0)[length(which(mtcars$mpg == 21.0))]))
#> # A tibble: 2 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
or
#min row
mtcars %>%
as_tibble() %>%
filter(between(row_number(), 1, which(mtcars$mpg == 21.0)[1]))
#> # A tibble: 1 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
The example I chose just happened to be rows 1 and 2, but it illustrates the idea.
EDIT
The other answer by Lamia is much more elegant, and I probably thought about this too hard, but I felt like I needed to come up with something
library(dplyr)
filter_if_condition <- function(.data, condition, yes){
test_cond <- enquo(condition)
yes_filter <- enquo(yes)
if(.data %>% filter(!!test_cond) %>% nrow() > 0){
.data %>% filter(!!yes_filter)
}
else{.data}
}
mtcars %>%
as_tibble() %>%
filter_if_condition(366.0 %in% mpg, between(row_number(), 1, which(mpg == 366)[1]))
#> # A tibble: 32 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # ... with 22 more rows
mtcars %>%
as_tibble() %>%
filter_if_condition(18.1 %in% mpg, between(row_number(), 1, which(mpg == 18.1)[1]))
#> # A tibble: 6 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1