In this dataframe:
df <- data.frame(
comp = c("pre",rep("story",4), rep("x",2), rep("story",3)),
hbr = c(101:110)
)
let's say I need to compute the mean for hbr subsetted to the first stretch where comp=="story", how would I do that more efficiently than this way, which seems bulky and longwinded and requires that I specify the grpI want to compute the mean for manually:
library(dplyr)
library(data.table)
df %>%
mutate(grp = rleid(comp)) %>%
summarise(M = mean(hbr[grp==2]))
M
1 103.5
I'm not sure if this is any better, but at least you only need to specify that you want the first run of 'story':
df %>%
mutate(grp = ifelse(comp == 'story', rleid(comp), NA)) %>%
filter(grp == min(grp, na.rm = TRUE)) %>%
summarise(M = mean(hbr))
#> M
#> 1 103.5
In base R, you can select the desired rows using cumsum and diff, and then choosing which group you need (here it's the first, so 1), and then compute the mean on those rows. With this option, you don't need to get the group you need manually and you don't require any additional packages.
idx <- which(df$comp == "story")
first <- idx[cumsum(c(1, diff(idx) != 1)) == 1]
#[1] 2 3 4 5
mean(df$hbr[first])
#[1] 103.5
Related
One can notice that case_when do not evaluate in the formula the subset of the tibble when condition is met, but the whole tibble, an example:
picks = c(1:3)
a = tibble(id = c(1:4),
k = NA)
a %>% mutate(
k = case_when(id %in% picks~runif(length(picks)))
)
This is clear in the error:
Error: Problem with `mutate()` column `k`. i `k = case_when(id %in% picks ~ runif(length(picks)))`. x `id %in% picks ~ runif(length(picks))` must be length 4 or one, not 3.
an alternative would be to rowwise() or group_by(id) but that would still be highly inefficient. I would probably still route for rowwise(), but since I have to perform operations only on 1% of the tibble, I just want a mutate within that 1%, anything else untouched. Any suggestion to make R perform the minimal number of evaluations?
I tought about combination of filter and join, but, for example, that would not work for a tidygraph object, because through filtering the nodes, one would filtering out edges too, so local_members would not work anymore properly.
EDIT:
Also, in my experience, it seems that base::ifelse is faster than dplyr::case_when; is that expected?
dplyr::if_else is faster than base::ifelse. You also need T ~ NA_real_ in the case_when together with n():
picks = c(1:3)
a = tibble(id = c(1:4),
k = NA)
a %>% mutate(
k = case_when(
id %in% picks~runif(n()),
T ~ NA_real_
)
)
# A tibble: 4 x 2
id k
<int> <dbl>
1 1 0.0757
2 2 0.708
3 3 0.255
4 4 NA
# Would be faster with if_else:
a %>% mutate(
k = if_else(id %in% picks,
runif(n()),
NA_real_
)
)
Given a data frame like data:
data <- data.frame(group = rep(c('a','b'), each= 100),
value = rnorm(200))
We want to filter values for group == b using dplyr and use boxplot.stats to identify outliers:
library(dplyr)
data%>%
filter(group == 'b')%>%
summarise(out.stats = boxplot.stats(value))
This returns the error Column out.stats must be length 1 (a summary value), not 4, why does this not work? How do you apply functions like this inside a pipe?
The following answers to the question and to the last comment to the question, where the OP asks for the row numbers of the outliers.
what if we want to return the row numbers that go with
boxplot.stats()$out from the pipe? so if we did
b<-data%>%filter(group=='b') outside of the pipe, we could have used:
which(b$value %in% boxplot.stats(b$value)$out)
This is done by left_joining with the original data.
library(dplyr)
set.seed(1234)
data <- data.frame(group = rep(c('a','b'), each= 100),
value = rnorm(200))
data %>% filter(group == 'b') %>% pull(value) %>%
boxplot.stats() %>% '[['('out') %>%
data.frame() %>%
left_join(data, by = c('.' = 'value'))
# . group
#1 3.043766 b
#2 -2.732220 b
#3 -2.855759 b
We can use the new version of dplyr which can also return summarise with more than one row
library(dplyr) # >= 1.0.0
data%>%
filter(group == 'b')%>%
summarise(out.stats = boxplot.stats(value))
# out.stats
#1 -2.4804222, -0.7546693, 0.1304050, 0.6390749, 2.2682247
#2 100
#3 -0.08980661, 0.35061653
#4 -3.014914
In dplyr, group_by has a parameter add, and if it's true, it adds to the group_by. For example:
data <- data.frame(a=c('a','b','c'), b=c(1,2,3), c=c(4,5,6))
data <- data %>% group_by(a, add=TRUE)
data <- data %>% group_by(b, add=TRUE)
data %>% summarize(sum_c = sum(c))
Output:
a b sum_c
1 a 1 4
2 b 2 5
3 c 3 6
Is there an analogous way to add summary variables to a summarize statement? I have some complicated conditionals (with dbplyr) where if x=TRUE I want to add
variable x_v to the summary.
I see several related stackoverflow questions, but I didn't see this.
EDIT: Here is some precise example code, but simplified from the real code (which has more than two conditionals).
summarize_num <- TRUE
summarize_num_distinct <- FALSE
data <- data.frame(val=c(1,2,2))
if (summarize_num && summarize_num_distinct) {
summ <- data %>% summarize(n=n(), n_unique=n_distinct())
} else if (summarize_num) {
summ <- data %>% summarize(n=n())
} else if (summarize_num_distinct) {
summ <- data %>% summarize(n_unique=n_distinct())
}
Depending on conditions (summarize_num, and summarize_num_distinct here), the eventual summary (summ here) has different columns.
As the number of conditions goes up, the number of clauses goes up combinatorially. However, the conditions are independent, so I'd like to add the summary variables independently as well.
I'm using dbplyr, so I have to do it in a way that it can get translated into SQL.
Would this work for your situation? Here, we add a column for each requested summation using mutate. It's computationally wasteful since it does the same sum once for every row in each group, and then discards everything but the first row of each group. But that might be fine if your data's not too huge.
data <- data.frame(val=c(1,2,2), grp = c(1, 1, 2)) # To show it works within groups
summ <- data %>% group_by(grp)
if(summarize_num) {summ = mutate(summ, n = n())}
if(summarize_num_distinct) {summ = mutate(summ, n_unique=n_distinct(val))}
summ = slice(summ, 1) %>% ungroup() %>% select(-val)
## A tibble: 2 x 3
# grp n n_unique
# <dbl> <int> <int>
#1 1 2 2
#2 2 1 1
The summarise_at() function takes a list of functions as parameter. So, we can get
data <- data.frame(val=c(1,2,2))
fcts <- list(n_unique = n_distinct, n = length)
data %>%
summarise_at(.vars = "val", fcts)
n_unique n
1 2 3
All functions in the list must take one argument. Therefore, n() was replaced by length().
The list of functions can be modified dynamically as requested by the OP, e.g.,
summarize_num_distinct <- FALSE
summarize_num <- TRUE
fcts <- list(n_unique = n_distinct, n = length)
data %>%
summarise_at(.vars = "val", fcts[c(summarize_num_distinct, summarize_num)])
n
1 3
So, the idea is to define a list of possible aggregation functions and then to select dynamically the aggregation to compute. Even the order of columns in the aggregate can be determined:
fcts <- list(n_unique = n_distinct, n = length, sum = sum, avg = mean, min = min, max = max)
data %>%
summarise_at(.vars = "val", fcts[c(6, 2, 4, 3)])
max n avg sum
1 2 3 1.666667 5
I'm having some issues with a seemingly simple task: to remove all rows where all variables are NA using dplyr. I know it can be done using base R (Remove rows in R matrix where all data is NA and Removing empty rows of a data file in R), but I'm curious to know if there is a simple way of doing it using dplyr.
Example:
library(tidyverse)
dat <- tibble(a = c(1, 2, NA), b = c(1, NA, NA), c = c(2, NA, NA))
filter(dat, !is.na(a) | !is.na(b) | !is.na(c))
The filter call above does what I want but it's infeasible in the situation I'm facing (as there is a large number of variables). I guess one could do it by using filter_ and first creating a string with the (long) logical statement, but it seems like there should be a simpler way.
Another way is to use rowwise() and do():
na <- dat %>%
rowwise() %>%
do(tibble(na = !all(is.na(.)))) %>%
.$na
filter(dat, na)
but that does not look too nice, although it gets the job done. Other ideas?
Since dplyr 0.7.0 new, scoped filtering verbs exists. Using filter_any you can easily filter rows with at least one non-missing column:
# dplyr 0.7.0
dat %>% filter_all(any_vars(!is.na(.)))
Using #hejseb benchmarking algorithm it appears that this solution is as efficient as f4.
UPDATE:
Since dplyr 1.0.0 the above scoped verbs are superseded. Instead the across function family was introduced, which allows to perform a function on multiple (or all) columns. Filtering rows with at least one column being not NA looks now like this:
# dplyr 1.0.0
dat %>% filter(if_any(everything(), ~ !is.na(.)))
I would suggest to use the wonderful janitor package here. Janitor is very user-friendly:
janitor::remove_empty(dat, which = "rows")
Benchmarking
#DavidArenburg suggested a number of alternatives. Here's a simple benchmarking of them.
library(tidyverse)
library(microbenchmark)
n <- 100
dat <- tibble(a = rep(c(1, 2, NA), n), b = rep(c(1, 1, NA), n))
f1 <- function(dat) {
na <- dat %>%
rowwise() %>%
do(tibble(na = !all(is.na(.)))) %>%
.$na
filter(dat, na)
}
f2 <- function(dat) {
dat %>% filter(rowSums(is.na(.)) != ncol(.))
}
f3 <- function(dat) {
dat %>% filter(rowMeans(is.na(.)) < 1)
}
f4 <- function(dat) {
dat %>% filter(Reduce(`+`, lapply(., is.na)) != ncol(.))
}
f5 <- function(dat) {
dat %>% mutate(indx = row_number()) %>% gather(var, val, -indx) %>% group_by(indx) %>% filter(sum(is.na(val)) != n()) %>% spread(var, val)
}
# f1 is too slow to be included!
microbenchmark(f2 = f2(dat), f3 = f3(dat), f4 = f4(dat), f5 = f5(dat))
Using Reduce and lapply appears to be the fastest:
> microbenchmark(f2 = f2(dat), f3 = f3(dat), f4 = f4(dat), f5 = f5(dat))
Unit: microseconds
expr min lq mean median uq max neval
f2 909.495 986.4680 2948.913 1154.4510 1434.725 131159.384 100
f3 946.321 1036.2745 1908.857 1221.1615 1805.405 7604.069 100
f4 706.647 809.2785 1318.694 960.0555 1089.099 13819.295 100
f5 640392.269 664101.2895 692349.519 679580.6435 709054.821 901386.187 100
Using a larger data set 107,880 x 40:
dat <- diamonds
# Let every third row be NA
dat[seq(1, nrow(diamonds), 3), ] <- NA
# Add some extra NA to first column so na.omit() wouldn't work
dat[seq(2, nrow(diamonds), 3), 1] <- NA
# Increase size
dat <- dat %>%
bind_rows(., .) %>%
bind_cols(., .) %>%
bind_cols(., .)
# Make names unique
names(dat) <- 1:ncol(dat)
microbenchmark(f2 = f2(dat), f3 = f3(dat), f4 = f4(dat))
f5 is too slow so it is also excluded. f4 seems to do relatively better than before.
> microbenchmark(f2 = f2(dat), f3 = f3(dat), f4 = f4(dat))
Unit: milliseconds
expr min lq mean median uq max neval
f2 34.60212 42.09918 114.65140 143.56056 148.8913 181.4218 100
f3 35.50890 44.94387 119.73744 144.75561 148.8678 254.5315 100
f4 27.68628 31.80557 73.63191 35.36144 137.2445 152.4686 100
Starting with dyplr 1.0, the colwise vignette gives a similar case as an example:
filter(across(everything(), ~ !is.na(.x))) #Remove rows with *any* NA
We can see it uses the same implicit "& logic" filter uses with multiple expressions. So the following minor adjustment selects all NA rows:
filter(across(everything(), ~ is.na(.x))) #Remove rows with *any* non-NA
But the question asks for the inverse set: Remove rows with all NA.
We can do a simple setdiff using the previous, or
we can use the fact that across returns a logical tibble and filter effectively does a row-wise all() (i.e. &).
Eg:
rowAny = function(x) apply(x, 1, any)
anyVar = function(fcn) rowAny(across(everything(), fcn)) #make it readable
df %<>% filter(anyVar(~ !is.na(.x))) #Remove rows with *all* NA
Or:
filterout = function(df, ...) setdiff(df, filter(df, ...))
df %<>% filterout(across(everything(), is.na)) #Remove rows with *all* NA
Or even combinine the above 2 to express the first example more directly:
df %<>% filterout(anyVar(~ is.na(.x))) #Remove rows with *any* NA
In my opinion, the tidyverse filter function would benefit from a parameter describing the 'aggregation logic'. It could default to "all" and preserve behavior, or allow "any" so we wouldn't need to write anyVar-like helper functions.
The solution using dplyr 1.0 is simple and does not require helper functions, you just need to add a negation in the right place.
dat %>% filter(!across(everything(), is.na))
dplyr 1.0.4 introduced the if_any() and if_all() functions:
dat %>% filter(if_any(everything(), ~!is.na(.)))
or, more verbose:
dat %>% filter(if_any(everything(), purrr::negate(is.na)))
"Take dat and keep all rows where any entry is non-NA"
Here's another solution that uses purrr::map_lgl() and tidyr::nest():
library(tidyverse)
dat <- tibble(a = c(1, 2, NA), b = c(1, NA, NA), c = c(2, NA, NA))
any_not_na <- function(x) {
!all(map_lgl(x, is.na))
}
dat_cleaned <- dat %>%
rownames_to_column("ID") %>%
group_by(ID) %>%
nest() %>%
filter(map_lgl(data, any_not_na)) %>%
unnest() %>%
select(-ID)
## Warning: package 'bindrcpp' was built under R version 3.4.2
dat_cleaned
## # A tibble: 2 x 3
## a b c
## <dbl> <dbl> <dbl>
## 1 1. 1. 2.
## 2 2. NA NA
I doubt this approach will be able to compete with the benchmarks in #hejseb's answer, but I think it does a pretty good job at showing how the nest %>% map %>% unnest pattern works and users can run through it line-by-line to figure out what's going on.
You can use the function complete.cases from dplyr
using the dot (.) for specify the previous dataframe
on the chain.
library(dplyr)
df = data.frame(
x1 = c(1,2,3,NA),
x2 = c(1,2,NA,5),
x3 = c(NA,2,3,5)
)
df %>%
filter(complete.cases(.))
x1 x2 x3
1 2 2 2
I a neat solution what works in dplyr 1.0.1 is to use rowwise()
dat %>%
rowwise() %>%
filter(!all(is.na(across(everything())))) %>%
ungroup()
very similar to #Callum Savage 's comment on the top post but I missed it on the first pass, and without the sum()
(tidyverse 1.3.1)
data%>%rowwise()%>%
filter(!all(is.na(c_across(is.numeric))))
data%>%rowwise()%>%
filter(!all(is.na(c_across(starts_with("***")))))
I am struggling a little with dplyr because I want to do two things at one and wonder if it is possible.
I want to calculate the mean of values and at the same time the mean for the values which have a specific value in an other column.
library(dplyr)
set.seed(1234)
df <- data.frame(id=rep(1:10, each=14),
tp=letters[1:14],
value_type=sample(LETTERS[1:3], 140, replace=TRUE),
values=runif(140))
df %>%
group_by(id, tp) %>%
summarise(
all_mean=mean(values),
A_mean=mean(values), # Only the values with value_type A
value_count=sum(value_type == 'A')
)
So the A_mean column should calculate the mean of values where value_count == 'A'.
I would normally do two separate commands and merge the results later, but I guess there is a more handy way and I just don't get it.
Thanks in advance.
We can try
df %>%
group_by(id, tp) %>%
summarise(all_mean = mean(values),
A_mean = mean(values[value_type=="A"]),
value_count=sum(value_type == 'A'))
You can do this with two summary steps:
df %>%
group_by(id, tp, value_type) %>%
summarise(A_mean = mean(values)) %>%
summarise(all_mean = mean(A_mean),
A_mean = sum(A_mean * (value_type == "A")),
value_count = sum(value_type == "A"))
The first summary calculates the means per value_type and the second "sums" only the mean of value_type == "A"
You can also give the following function a try:
?summarise_if
(the function family is summarise_all)
Example
The dplyr documentation serves a quite good example of this, i think:
# The _if() variants apply a predicate function (a function that
# returns TRUE or FALSE) to determine the relevant subset of
# columns. Here we apply mean() to the numeric columns:
starwars %>%
summarise_if(is.numeric, mean, na.rm = TRUE)
#> # A tibble: 1 x 3
#> height mass birth_year
#> <dbl> <dbl> <dbl>
#> 1 174. 97.3 87.6
The interesting thing here is the predicate function. This represents the rule by which the columns, that will have to be summarized, are selected.