dplyr equivalent to DF[DF==X] <- Y - r

I'm wondering if there's a dplyr equivalent to
df <- data.frame(A=1:5,B=2:6,C=-1:3)
df[df==2] <- 10
I'm looking for
df %>% <??>
That is, a statement that is chainable with other dplyr verbs

1) replace Try this. It only requires magrittr although dplyr imports the relevant part of magrittr so it will work with dplyr too:
df %>% replace(. == 2, 10)
giving:
A B C
1 1 10 -1
2 10 3 0
3 3 4 1
4 4 5 10
5 5 6 3
1a) overwriting Note that the above is non-destructive so if you want to update df then you will need to assign it back:
df <- df %>% replace(. == 2, 10)
or
df %>% replace(. == 2, 10) -> df
or use the magrittr %<>% operator which eliminates referencing df twice:
df %<>% replace(. == 2, 10)
2) arithmetic This would also work:
df %>% { 10 * (. == 2) + . * (. != 2) }

The OP's question is about how to replace values using dplyr, and it has been resolved thanks to G. Grothendieck. But I am curious that how the performances differ between different approaches based on dplyr, data.table and base R. So I designed and conducted the following benchmarking.
# Load package
library(dplyr)
library(data.table)
library(microbenchmark)
# Create example data frame
df <- data.frame(A = 1:5, B = 2:6, C = -1:3)
# Convert to data.table
dt <- as.data.table(df)
# Method 1: Use mutate_all and ifelse
F1 = function(df){df %>% mutate_all(funs(ifelse(. == 2, 10, .)))}
# Method 2: Use mutate_all and replace
F2 = function(df){df %>% mutate_all(funs(replace(., . == 2, 10)))}
# Method 3: Use replace
F3 = function(df){df %>% replace(. == 2, 10)}
# Method 4: Base R data frame assignment
F4 = function(df){
df[df == 2] <- 10
return(df)
}
# Benchmarking
microbenchmark(
M1 = F1(df),
M2 = F2(df),
M3 = F3(df),
M4 = F4(df),
# Same as M4, but use data.table object as input
M5 = F4(dt)
)
Unit: microseconds
expr min lq mean median uq max neval
M1 8634.974 13028.7975 17224.4669 14907.3735 19496.5275 79750.182 100
M2 8925.565 12626.2675 16698.7412 15551.7410 18658.1125 35468.760 100
M3 282.252 391.6240 591.2534 553.5980 647.8965 3290.797 100
M4 163.578 252.1025 423.7627 349.6080 420.8125 5415.382 100
M5 228.367 333.2495 596.1735 440.3775 555.5230 7506.609 100
The results show that mutata_all with ifelse (M1) or replace (M2) are much slower than other approaches. Use replace with pipe (M3) is fast, but still a little bit slower than base R (M4). Convert data.frame to data.table and then apply the assignment replacement (M5) is not faster than M4.
So, I think in this case, there are no special needs to use dplyr functions because it is not faster than base R method (M4). There are also no needs to convert data.frame to data.table If pipe operation is desirable. We can use pipe with replace (M3). Or, we can define a function, such as F4, and put it in the pipe operation.

Related

Split-apply-combine with function that returns multiple variables

I need to apply myfun to subsets of a dataframe and include the results as new columns in the dataframe returned. In the old days, I used ddply. But in dplyr, I believe summarise is used for that, like this:
myfun<- function(x,y) {
df<- data.frame( a= mean(x)*mean(y), b= mean(x)-mean(y) )
return (df)
}
mtcars %>%
group_by(cyl) %>%
summarise(a = myfun(cyl,disp)$a, b = myfun(cyl,disp)$b)
The above code works, but the myfun I'll be using is computationally very expensive, so I want it to be called only once rather than separately for the a and b columns. Is there a way to do that in dplyr?
Since your function returns a data frame, you can call your function within group_by %>% do which applies the function to each individual group and rbind the returned data frame together:
mtcars %>% group_by(cyl) %>% do(myfun(.$cyl, .$disp))
# A tibble: 3 x 3
# Groups: cyl [3]
# cyl a b
# <dbl> <dbl> <dbl>
#1 4 420.5455 -101.1364
#2 6 1099.8857 -177.3143
#3 8 2824.8000 -345.1000
do is not necessarily going to improve the speed. In this post, I am going to introduce a way to design a function performing the same task, and then do a benchmarking to compare the performance of each method.
Here is an alternative way to define the function.
myfun2 <- function(dt, x, y){
x <- enquo(x)
y <- enquo(y)
dt2 <- dt %>%
summarise(a = mean(!!x) * mean(!!y), b = mean(!!x) - mean(!!y))
return(dt2)
}
Notice that the first argument of myfun2 is dt, which is the input data frame. By doing this, myfun2 can successfully implement as a part of the pipe operation.
mtcars %>%
group_by(cyl) %>%
myfun2(x = cyl, y = disp)
# A tibble: 3 x 3
cyl a b
<dbl> <dbl> <dbl>
1 4 420.5455 -101.1364
2 6 1099.8857 -177.3143
3 8 2824.8000 -345.1000
By doing this, we don't have to call my_fun each time when we want to create a new column. So this method is probably more efficient than my_fun.
Here is a comparison of the performance using the microbenchmark. The methods I compared are listed as follows. I ran the simulation 1000 times.
m1: OP's original way to apply `myfun`
m2: Psidom's method, using `do`to apply `myfun`.
m3: My approach, using `myfun2`
m4: Using `do` to apply `myfun2`
m5: Z.Lin's suggestion, directly calculating the values without defining a function.
m6: akrun's `data.table` approach with `myfun`
Here is the code for benchmarking.
microbenchmark(m1 = (mtcars %>%
group_by(cyl) %>%
summarise(a = myfun(cyl, disp)$a, b = myfun(cyl, disp)$b)),
m2 = (mtcars %>%
group_by(cyl) %>%
do(myfun(.$cyl, .$disp))),
m3 = (mtcars %>%
group_by(cyl) %>%
myfun2(x = cyl, y = disp)),
m4 = (mtcars %>%
group_by(cyl) %>%
do(myfun2(., x = cyl, y = disp))),
m5 = (mtcars %>%
group_by(cyl) %>%
summarise(a = mean(cyl) * mean(disp), b = mean(cyl) - mean(disp))),
m6 = (as.data.table(mtcars)[, myfun(cyl, disp), cyl]),
times = 1000)
And here is the result of benchmarking.
Unit: milliseconds
expr min lq mean median uq max neval
m1 7.058227 7.692654 9.429765 8.375190 10.570663 28.730059 1000
m2 8.559296 9.381996 11.643645 10.500100 13.229285 27.585654 1000
m3 6.817031 7.445683 9.423832 8.085241 10.415104 193.878337 1000
m4 21.787298 23.995279 28.920262 26.922683 31.673820 177.004151 1000
m5 5.337132 5.785528 7.120589 6.223339 7.810686 23.231274 1000
m6 1.320812 1.540199 1.919222 1.640270 1.935352 7.622732 1000
The result shows that the do method (m2 and m4) are actually slower than their counterparts(m1 and m3). In this situation, applying myfun (m1) and myfun2 (m3) is faster than using do. myfun2 (m3) is slighly faster than myfun (m1). However, without defining any functions (m5) is actually faster than all the function-defined method (m1 to m4), suggesting that for this particular case, there is actually no need to define a fucntion. Finally, if there is no need to stay in tidyverse, or the size of the dataset is enormous. We can consider the data.table approach (m6), which is a lot faster than all the tidyverse solutions listed here.
We can use data.table
library(data.table)
setDT(mtcars)[, myfun(cyl, disp), cyl]
# cyl a b
#1: 6 1099.8857 -177.3143
#2: 4 420.5455 -101.1364
#3: 8 2824.8000 -345.1000

Remove rows where all variables are NA using dplyr

I'm having some issues with a seemingly simple task: to remove all rows where all variables are NA using dplyr. I know it can be done using base R (Remove rows in R matrix where all data is NA and Removing empty rows of a data file in R), but I'm curious to know if there is a simple way of doing it using dplyr.
Example:
library(tidyverse)
dat <- tibble(a = c(1, 2, NA), b = c(1, NA, NA), c = c(2, NA, NA))
filter(dat, !is.na(a) | !is.na(b) | !is.na(c))
The filter call above does what I want but it's infeasible in the situation I'm facing (as there is a large number of variables). I guess one could do it by using filter_ and first creating a string with the (long) logical statement, but it seems like there should be a simpler way.
Another way is to use rowwise() and do():
na <- dat %>%
rowwise() %>%
do(tibble(na = !all(is.na(.)))) %>%
.$na
filter(dat, na)
but that does not look too nice, although it gets the job done. Other ideas?
Since dplyr 0.7.0 new, scoped filtering verbs exists. Using filter_any you can easily filter rows with at least one non-missing column:
# dplyr 0.7.0
dat %>% filter_all(any_vars(!is.na(.)))
Using #hejseb benchmarking algorithm it appears that this solution is as efficient as f4.
UPDATE:
Since dplyr 1.0.0 the above scoped verbs are superseded. Instead the across function family was introduced, which allows to perform a function on multiple (or all) columns. Filtering rows with at least one column being not NA looks now like this:
# dplyr 1.0.0
dat %>% filter(if_any(everything(), ~ !is.na(.)))
I would suggest to use the wonderful janitor package here. Janitor is very user-friendly:
janitor::remove_empty(dat, which = "rows")
Benchmarking
#DavidArenburg suggested a number of alternatives. Here's a simple benchmarking of them.
library(tidyverse)
library(microbenchmark)
n <- 100
dat <- tibble(a = rep(c(1, 2, NA), n), b = rep(c(1, 1, NA), n))
f1 <- function(dat) {
na <- dat %>%
rowwise() %>%
do(tibble(na = !all(is.na(.)))) %>%
.$na
filter(dat, na)
}
f2 <- function(dat) {
dat %>% filter(rowSums(is.na(.)) != ncol(.))
}
f3 <- function(dat) {
dat %>% filter(rowMeans(is.na(.)) < 1)
}
f4 <- function(dat) {
dat %>% filter(Reduce(`+`, lapply(., is.na)) != ncol(.))
}
f5 <- function(dat) {
dat %>% mutate(indx = row_number()) %>% gather(var, val, -indx) %>% group_by(indx) %>% filter(sum(is.na(val)) != n()) %>% spread(var, val)
}
# f1 is too slow to be included!
microbenchmark(f2 = f2(dat), f3 = f3(dat), f4 = f4(dat), f5 = f5(dat))
Using Reduce and lapply appears to be the fastest:
> microbenchmark(f2 = f2(dat), f3 = f3(dat), f4 = f4(dat), f5 = f5(dat))
Unit: microseconds
expr min lq mean median uq max neval
f2 909.495 986.4680 2948.913 1154.4510 1434.725 131159.384 100
f3 946.321 1036.2745 1908.857 1221.1615 1805.405 7604.069 100
f4 706.647 809.2785 1318.694 960.0555 1089.099 13819.295 100
f5 640392.269 664101.2895 692349.519 679580.6435 709054.821 901386.187 100
Using a larger data set 107,880 x 40:
dat <- diamonds
# Let every third row be NA
dat[seq(1, nrow(diamonds), 3), ] <- NA
# Add some extra NA to first column so na.omit() wouldn't work
dat[seq(2, nrow(diamonds), 3), 1] <- NA
# Increase size
dat <- dat %>%
bind_rows(., .) %>%
bind_cols(., .) %>%
bind_cols(., .)
# Make names unique
names(dat) <- 1:ncol(dat)
microbenchmark(f2 = f2(dat), f3 = f3(dat), f4 = f4(dat))
f5 is too slow so it is also excluded. f4 seems to do relatively better than before.
> microbenchmark(f2 = f2(dat), f3 = f3(dat), f4 = f4(dat))
Unit: milliseconds
expr min lq mean median uq max neval
f2 34.60212 42.09918 114.65140 143.56056 148.8913 181.4218 100
f3 35.50890 44.94387 119.73744 144.75561 148.8678 254.5315 100
f4 27.68628 31.80557 73.63191 35.36144 137.2445 152.4686 100
Starting with dyplr 1.0, the colwise vignette gives a similar case as an example:
filter(across(everything(), ~ !is.na(.x))) #Remove rows with *any* NA
We can see it uses the same implicit "& logic" filter uses with multiple expressions. So the following minor adjustment selects all NA rows:
filter(across(everything(), ~ is.na(.x))) #Remove rows with *any* non-NA
But the question asks for the inverse set: Remove rows with all NA.
We can do a simple setdiff using the previous, or
we can use the fact that across returns a logical tibble and filter effectively does a row-wise all() (i.e. &).
Eg:
rowAny = function(x) apply(x, 1, any)
anyVar = function(fcn) rowAny(across(everything(), fcn)) #make it readable
df %<>% filter(anyVar(~ !is.na(.x))) #Remove rows with *all* NA
Or:
filterout = function(df, ...) setdiff(df, filter(df, ...))
df %<>% filterout(across(everything(), is.na)) #Remove rows with *all* NA
Or even combinine the above 2 to express the first example more directly:
df %<>% filterout(anyVar(~ is.na(.x))) #Remove rows with *any* NA
In my opinion, the tidyverse filter function would benefit from a parameter describing the 'aggregation logic'. It could default to "all" and preserve behavior, or allow "any" so we wouldn't need to write anyVar-like helper functions.
The solution using dplyr 1.0 is simple and does not require helper functions, you just need to add a negation in the right place.
dat %>% filter(!across(everything(), is.na))
dplyr 1.0.4 introduced the if_any() and if_all() functions:
dat %>% filter(if_any(everything(), ~!is.na(.)))
or, more verbose:
dat %>% filter(if_any(everything(), purrr::negate(is.na)))
"Take dat and keep all rows where any entry is non-NA"
Here's another solution that uses purrr::map_lgl() and tidyr::nest():
library(tidyverse)
dat <- tibble(a = c(1, 2, NA), b = c(1, NA, NA), c = c(2, NA, NA))
any_not_na <- function(x) {
!all(map_lgl(x, is.na))
}
dat_cleaned <- dat %>%
rownames_to_column("ID") %>%
group_by(ID) %>%
nest() %>%
filter(map_lgl(data, any_not_na)) %>%
unnest() %>%
select(-ID)
## Warning: package 'bindrcpp' was built under R version 3.4.2
dat_cleaned
## # A tibble: 2 x 3
## a b c
## <dbl> <dbl> <dbl>
## 1 1. 1. 2.
## 2 2. NA NA
I doubt this approach will be able to compete with the benchmarks in #hejseb's answer, but I think it does a pretty good job at showing how the nest %>% map %>% unnest pattern works and users can run through it line-by-line to figure out what's going on.
You can use the function complete.cases from dplyr
using the dot (.) for specify the previous dataframe
on the chain.
library(dplyr)
df = data.frame(
x1 = c(1,2,3,NA),
x2 = c(1,2,NA,5),
x3 = c(NA,2,3,5)
)
df %>%
filter(complete.cases(.))
x1 x2 x3
1 2 2 2
I a neat solution what works in dplyr 1.0.1 is to use rowwise()
dat %>%
rowwise() %>%
filter(!all(is.na(across(everything())))) %>%
ungroup()
very similar to #Callum Savage 's comment on the top post but I missed it on the first pass, and without the sum()
(tidyverse 1.3.1)
data%>%rowwise()%>%
filter(!all(is.na(c_across(is.numeric))))
data%>%rowwise()%>%
filter(!all(is.na(c_across(starts_with("***")))))

dplyr summarise when function return is vector-valued?

The dplyr::summarize() function can apply arbitrary functions over the data, but it seems that function must return a scalar value. I'm curious if there is a reasonable way to handle functions that return a vector value without making multiple calls to the function.
Here's a somewhat silly minimal example. Consider a function that gives multiple values, such as:
f <- function(x,y){
coef(lm(x ~ y, data.frame(x=x,y=y)))
}
and data that looks like:
df <- data.frame(group=c('A','A','A','A','B','B','B','B','C','C','C','C'), x=rnorm(12,1,1), y=rnorm(12,1,1))
I'd like to do something like:
df %>%
group_by(group) %>%
summarise(f(x,y))
and get back a table that has 2 columns added for each of the returned values instead of the usual 1 column. Instead, this errors with: Expecting single value
Of course we can get multiple values from dlpyr::summarise() by giving the function argument multiple times:
f1 <- function(x,y) coef(lm(x ~ y, data.frame(x=x,y=y)))[[1]]
f2 <- function(x,y) coef(lm(x ~ y, data.frame(x=x,y=y)))[[2]]
df %>%
group_by(group) %>%
summarise(a = f1(x,y), b = f2(x,y))
This gives the desired output:
group a b
1 A 1.7957245 -0.339992915
2 B 0.5283379 -0.004325209
3 C 1.0797647 -0.074393457
but coding in this way is ridiculously crude and ugly.
data.table handles this case more succinctly:
dt <- as.data.table(df)
dt[, f(x,y), by="group"]
but creates an output that extend the table using additional rows instead of additional columns, resulting in an output that is both confusing and harder to work with:
group V1
1: A 1.795724536
2: A -0.339992915
3: B 0.528337890
4: B -0.004325209
5: C 1.079764710
6: C -0.074393457
Of course there are more classic apply strategies we could use here,
sapply(levels(df$group), function(x) coef(lm(x~y, df[df$group == x, ])))
A B C
(Intercept) 1.7957245 0.528337890 1.07976471
y -0.3399929 -0.004325209 -0.07439346
but this sacrifices both the elegance and I suspect the speed of the grouping. In particular, note that we cannot use our pre-defined function f in this case, but have to hard code the grouping into the function definition.
Is there a dplyr function for handling this case? If not, is there a more elegant way to handle this process of evaluating vector-valued functions over a data.frame by group?
You could try do
library(dplyr)
df %>%
group_by(group) %>%
do(setNames(data.frame(t(f(.$x, .$y))), letters[1:2]))
# group a b
#1 A 0.8983217 -0.04108092
#2 B 0.8945354 0.44905220
#3 C 1.2244023 -1.00715248
The output based on f1 and f2 are
df %>%
group_by(group) %>%
summarise(a = f1(x,y), b = f2(x,y))
# group a b
#1 A 0.8983217 -0.04108092
#2 B 0.8945354 0.44905220
#3 C 1.2244023 -1.00715248
Update
If you are using data.table, the option to get similar result is
library(data.table)
setnames(setDT(df)[, as.list(f(x,y)) , group], 2:3, c('a', 'b'))[]
This is why I still love plyr::ddply():
library(plyr)
f <- function(z) setNames(coef(lm(x ~ y, z)), c("a", "b"))
ddply(df, ~ group, f)
# group a b
# 1 A 0.5213133 0.04624656
# 2 B 0.3020656 0.01450137
# 3 C 0.2189537 0.22998823

dplyr r, removing all na values from dataframe [duplicate]

Is it possible to filter a data.frame for complete cases using dplyr? complete.cases with a list of all variables works, of course. But that is a) verbose when there are a lot of variables and b) impossible when the variable names are not known (e.g. in a function that processes any data.frame).
library(dplyr)
df = data.frame(
x1 = c(1,2,3,NA),
x2 = c(1,2,NA,5)
)
df %.%
filter(complete.cases(x1,x2))
Try this:
df %>% na.omit
or this:
df %>% filter(complete.cases(.))
or this:
library(tidyr)
df %>% drop_na
If you want to filter based on one variable's missingness, use a conditional:
df %>% filter(!is.na(x1))
or
df %>% drop_na(x1)
Other answers indicate that of the solutions above na.omit is much slower but that has to be balanced against the fact that it returns row indices of the omitted rows in the na.action attribute whereas the other solutions above do not.
str(df %>% na.omit)
## 'data.frame': 2 obs. of 2 variables:
## $ x1: num 1 2
## $ x2: num 1 2
## - attr(*, "na.action")= 'omit' Named int 3 4
## ..- attr(*, "names")= chr "3" "4"
ADDED Have updated to reflect latest version of dplyr and comments.
ADDED Have updated to reflect latest version of tidyr and comments.
This works for me:
df %>%
filter(complete.cases(df))
Or a little more general:
library(dplyr) # 0.4
df %>% filter(complete.cases(.))
This would have the advantage that the data could have been modified in the chain before passing it to the filter.
Another benchmark with more columns:
set.seed(123)
x <- sample(1e5,1e5*26, replace = TRUE)
x[sample(seq_along(x), 1e3)] <- NA
df <- as.data.frame(matrix(x, ncol = 26))
library(microbenchmark)
microbenchmark(
na.omit = {df %>% na.omit},
filter.anonymous = {df %>% (function(x) filter(x, complete.cases(x)))},
rowSums = {df %>% filter(rowSums(is.na(.)) == 0L)},
filter = {df %>% filter(complete.cases(.))},
times = 20L,
unit = "relative")
#Unit: relative
# expr min lq median uq max neval
# na.omit 12.252048 11.248707 11.327005 11.0623422 12.823233 20
#filter.anonymous 1.149305 1.022891 1.013779 0.9948659 4.668691 20
# rowSums 2.281002 2.377807 2.420615 2.3467519 5.223077 20
# filter 1.000000 1.000000 1.000000 1.0000000 1.000000 20
Here are some benchmark results for Grothendieck's reply. na.omit() takes 20x as much time as the other two solutions. I think it would be nice if dplyr had a function for this maybe as part of filter.
library('rbenchmark')
library('dplyr')
n = 5e6
n.na = 100000
df = data.frame(
x1 = sample(1:10, n, replace=TRUE),
x2 = sample(1:10, n, replace=TRUE)
)
df$x1[sample(1:n, n.na)] = NA
df$x2[sample(1:n, n.na)] = NA
benchmark(
df %>% filter(complete.cases(x1,x2)),
df %>% na.omit(),
df %>% (function(x) filter(x, complete.cases(x)))()
, replications=50)
# test replications elapsed relative
# 3 df %.% (function(x) filter(x, complete.cases(x)))() 50 5.422 1.000
# 1 df %.% filter(complete.cases(x1, x2)) 50 6.262 1.155
# 2 df %.% na.omit() 50 109.618 20.217
This is a short function which lets you specify columns (basically everything which dplyr::select can understand) which should not have any NA values (modeled after pandas df.dropna()):
drop_na <- function(data, ...){
if (missing(...)){
f = complete.cases(data)
} else {
f <- complete.cases(select_(data, .dots = lazyeval::lazy_dots(...)))
}
filter(data, f)
}
[drop_na is now part of tidyr: the above can be replaced by library("tidyr")]
Examples:
library("dplyr")
df <- data.frame(a=c(1,2,3,4,NA), b=c(NA,1,2,3,4), ac=c(1,2,NA,3,4))
df %>% drop_na(a,b)
df %>% drop_na(starts_with("a"))
df %>% drop_na() # drops all rows with NAs
try this
df[complete.cases(df),] #output to console
OR even this
df.complete <- df[complete.cases(df),] #assign to a new data.frame
The above commands take care of checking for completeness for all the columns (variable)
in your data.frame.
Just for the sake of completeness, dplyr::filter can be avoided altogether but still be able to compose chains just by using magrittr:extract (an alias of [):
library(magrittr)
df = data.frame(
x1 = c(1,2,3,NA),
x2 = c(1,2,NA,5))
df %>%
extract(complete.cases(.), )
The additional bonus is speed, this is the fastest method among the filter and na.omit variants (tested using #Miha Trošt microbenchmarks).
dplyr >= 1.0.4
if_any and if_all are available in newer versions of dplyr to apply across-like syntax in the filter function. This could be useful if you had other variables in your dataframe that were not part of what you considered complete case. For example, if you only wanted non-missing rows in columns that start with "x":
library(dplyr)
df = data.frame(
x1 = c(1,2,3,NA),
x2 = c(1,2,NA,5),
y = c(NA, "A", "B", "C")
)
df %>%
dplyr::filter(if_all(starts_with("x"), ~!is.na(.)))
x1 x2 y
1 1 1 <NA>
2 2 2 A
For more information on these functions see this link.

filter for complete cases in data.frame using dplyr (case-wise deletion)

Is it possible to filter a data.frame for complete cases using dplyr? complete.cases with a list of all variables works, of course. But that is a) verbose when there are a lot of variables and b) impossible when the variable names are not known (e.g. in a function that processes any data.frame).
library(dplyr)
df = data.frame(
x1 = c(1,2,3,NA),
x2 = c(1,2,NA,5)
)
df %.%
filter(complete.cases(x1,x2))
Try this:
df %>% na.omit
or this:
df %>% filter(complete.cases(.))
or this:
library(tidyr)
df %>% drop_na
If you want to filter based on one variable's missingness, use a conditional:
df %>% filter(!is.na(x1))
or
df %>% drop_na(x1)
Other answers indicate that of the solutions above na.omit is much slower but that has to be balanced against the fact that it returns row indices of the omitted rows in the na.action attribute whereas the other solutions above do not.
str(df %>% na.omit)
## 'data.frame': 2 obs. of 2 variables:
## $ x1: num 1 2
## $ x2: num 1 2
## - attr(*, "na.action")= 'omit' Named int 3 4
## ..- attr(*, "names")= chr "3" "4"
ADDED Have updated to reflect latest version of dplyr and comments.
ADDED Have updated to reflect latest version of tidyr and comments.
This works for me:
df %>%
filter(complete.cases(df))
Or a little more general:
library(dplyr) # 0.4
df %>% filter(complete.cases(.))
This would have the advantage that the data could have been modified in the chain before passing it to the filter.
Another benchmark with more columns:
set.seed(123)
x <- sample(1e5,1e5*26, replace = TRUE)
x[sample(seq_along(x), 1e3)] <- NA
df <- as.data.frame(matrix(x, ncol = 26))
library(microbenchmark)
microbenchmark(
na.omit = {df %>% na.omit},
filter.anonymous = {df %>% (function(x) filter(x, complete.cases(x)))},
rowSums = {df %>% filter(rowSums(is.na(.)) == 0L)},
filter = {df %>% filter(complete.cases(.))},
times = 20L,
unit = "relative")
#Unit: relative
# expr min lq median uq max neval
# na.omit 12.252048 11.248707 11.327005 11.0623422 12.823233 20
#filter.anonymous 1.149305 1.022891 1.013779 0.9948659 4.668691 20
# rowSums 2.281002 2.377807 2.420615 2.3467519 5.223077 20
# filter 1.000000 1.000000 1.000000 1.0000000 1.000000 20
Here are some benchmark results for Grothendieck's reply. na.omit() takes 20x as much time as the other two solutions. I think it would be nice if dplyr had a function for this maybe as part of filter.
library('rbenchmark')
library('dplyr')
n = 5e6
n.na = 100000
df = data.frame(
x1 = sample(1:10, n, replace=TRUE),
x2 = sample(1:10, n, replace=TRUE)
)
df$x1[sample(1:n, n.na)] = NA
df$x2[sample(1:n, n.na)] = NA
benchmark(
df %>% filter(complete.cases(x1,x2)),
df %>% na.omit(),
df %>% (function(x) filter(x, complete.cases(x)))()
, replications=50)
# test replications elapsed relative
# 3 df %.% (function(x) filter(x, complete.cases(x)))() 50 5.422 1.000
# 1 df %.% filter(complete.cases(x1, x2)) 50 6.262 1.155
# 2 df %.% na.omit() 50 109.618 20.217
This is a short function which lets you specify columns (basically everything which dplyr::select can understand) which should not have any NA values (modeled after pandas df.dropna()):
drop_na <- function(data, ...){
if (missing(...)){
f = complete.cases(data)
} else {
f <- complete.cases(select_(data, .dots = lazyeval::lazy_dots(...)))
}
filter(data, f)
}
[drop_na is now part of tidyr: the above can be replaced by library("tidyr")]
Examples:
library("dplyr")
df <- data.frame(a=c(1,2,3,4,NA), b=c(NA,1,2,3,4), ac=c(1,2,NA,3,4))
df %>% drop_na(a,b)
df %>% drop_na(starts_with("a"))
df %>% drop_na() # drops all rows with NAs
try this
df[complete.cases(df),] #output to console
OR even this
df.complete <- df[complete.cases(df),] #assign to a new data.frame
The above commands take care of checking for completeness for all the columns (variable)
in your data.frame.
Just for the sake of completeness, dplyr::filter can be avoided altogether but still be able to compose chains just by using magrittr:extract (an alias of [):
library(magrittr)
df = data.frame(
x1 = c(1,2,3,NA),
x2 = c(1,2,NA,5))
df %>%
extract(complete.cases(.), )
The additional bonus is speed, this is the fastest method among the filter and na.omit variants (tested using #Miha Trošt microbenchmarks).
dplyr >= 1.0.4
if_any and if_all are available in newer versions of dplyr to apply across-like syntax in the filter function. This could be useful if you had other variables in your dataframe that were not part of what you considered complete case. For example, if you only wanted non-missing rows in columns that start with "x":
library(dplyr)
df = data.frame(
x1 = c(1,2,3,NA),
x2 = c(1,2,NA,5),
y = c(NA, "A", "B", "C")
)
df %>%
dplyr::filter(if_all(starts_with("x"), ~!is.na(.)))
x1 x2 y
1 1 1 <NA>
2 2 2 A
For more information on these functions see this link.

Resources