dplyr r, removing all na values from dataframe [duplicate] - r

Is it possible to filter a data.frame for complete cases using dplyr? complete.cases with a list of all variables works, of course. But that is a) verbose when there are a lot of variables and b) impossible when the variable names are not known (e.g. in a function that processes any data.frame).
library(dplyr)
df = data.frame(
x1 = c(1,2,3,NA),
x2 = c(1,2,NA,5)
)
df %.%
filter(complete.cases(x1,x2))

Try this:
df %>% na.omit
or this:
df %>% filter(complete.cases(.))
or this:
library(tidyr)
df %>% drop_na
If you want to filter based on one variable's missingness, use a conditional:
df %>% filter(!is.na(x1))
or
df %>% drop_na(x1)
Other answers indicate that of the solutions above na.omit is much slower but that has to be balanced against the fact that it returns row indices of the omitted rows in the na.action attribute whereas the other solutions above do not.
str(df %>% na.omit)
## 'data.frame': 2 obs. of 2 variables:
## $ x1: num 1 2
## $ x2: num 1 2
## - attr(*, "na.action")= 'omit' Named int 3 4
## ..- attr(*, "names")= chr "3" "4"
ADDED Have updated to reflect latest version of dplyr and comments.
ADDED Have updated to reflect latest version of tidyr and comments.

This works for me:
df %>%
filter(complete.cases(df))
Or a little more general:
library(dplyr) # 0.4
df %>% filter(complete.cases(.))
This would have the advantage that the data could have been modified in the chain before passing it to the filter.
Another benchmark with more columns:
set.seed(123)
x <- sample(1e5,1e5*26, replace = TRUE)
x[sample(seq_along(x), 1e3)] <- NA
df <- as.data.frame(matrix(x, ncol = 26))
library(microbenchmark)
microbenchmark(
na.omit = {df %>% na.omit},
filter.anonymous = {df %>% (function(x) filter(x, complete.cases(x)))},
rowSums = {df %>% filter(rowSums(is.na(.)) == 0L)},
filter = {df %>% filter(complete.cases(.))},
times = 20L,
unit = "relative")
#Unit: relative
# expr min lq median uq max neval
# na.omit 12.252048 11.248707 11.327005 11.0623422 12.823233 20
#filter.anonymous 1.149305 1.022891 1.013779 0.9948659 4.668691 20
# rowSums 2.281002 2.377807 2.420615 2.3467519 5.223077 20
# filter 1.000000 1.000000 1.000000 1.0000000 1.000000 20

Here are some benchmark results for Grothendieck's reply. na.omit() takes 20x as much time as the other two solutions. I think it would be nice if dplyr had a function for this maybe as part of filter.
library('rbenchmark')
library('dplyr')
n = 5e6
n.na = 100000
df = data.frame(
x1 = sample(1:10, n, replace=TRUE),
x2 = sample(1:10, n, replace=TRUE)
)
df$x1[sample(1:n, n.na)] = NA
df$x2[sample(1:n, n.na)] = NA
benchmark(
df %>% filter(complete.cases(x1,x2)),
df %>% na.omit(),
df %>% (function(x) filter(x, complete.cases(x)))()
, replications=50)
# test replications elapsed relative
# 3 df %.% (function(x) filter(x, complete.cases(x)))() 50 5.422 1.000
# 1 df %.% filter(complete.cases(x1, x2)) 50 6.262 1.155
# 2 df %.% na.omit() 50 109.618 20.217

This is a short function which lets you specify columns (basically everything which dplyr::select can understand) which should not have any NA values (modeled after pandas df.dropna()):
drop_na <- function(data, ...){
if (missing(...)){
f = complete.cases(data)
} else {
f <- complete.cases(select_(data, .dots = lazyeval::lazy_dots(...)))
}
filter(data, f)
}
[drop_na is now part of tidyr: the above can be replaced by library("tidyr")]
Examples:
library("dplyr")
df <- data.frame(a=c(1,2,3,4,NA), b=c(NA,1,2,3,4), ac=c(1,2,NA,3,4))
df %>% drop_na(a,b)
df %>% drop_na(starts_with("a"))
df %>% drop_na() # drops all rows with NAs

try this
df[complete.cases(df),] #output to console
OR even this
df.complete <- df[complete.cases(df),] #assign to a new data.frame
The above commands take care of checking for completeness for all the columns (variable)
in your data.frame.

Just for the sake of completeness, dplyr::filter can be avoided altogether but still be able to compose chains just by using magrittr:extract (an alias of [):
library(magrittr)
df = data.frame(
x1 = c(1,2,3,NA),
x2 = c(1,2,NA,5))
df %>%
extract(complete.cases(.), )
The additional bonus is speed, this is the fastest method among the filter and na.omit variants (tested using #Miha Trošt microbenchmarks).

dplyr >= 1.0.4
if_any and if_all are available in newer versions of dplyr to apply across-like syntax in the filter function. This could be useful if you had other variables in your dataframe that were not part of what you considered complete case. For example, if you only wanted non-missing rows in columns that start with "x":
library(dplyr)
df = data.frame(
x1 = c(1,2,3,NA),
x2 = c(1,2,NA,5),
y = c(NA, "A", "B", "C")
)
df %>%
dplyr::filter(if_all(starts_with("x"), ~!is.na(.)))
x1 x2 y
1 1 1 <NA>
2 2 2 A
For more information on these functions see this link.

Related

More efficient way to compute mean for subset

In this dataframe:
df <- data.frame(
comp = c("pre",rep("story",4), rep("x",2), rep("story",3)),
hbr = c(101:110)
)
let's say I need to compute the mean for hbr subsetted to the first stretch where comp=="story", how would I do that more efficiently than this way, which seems bulky and longwinded and requires that I specify the grpI want to compute the mean for manually:
library(dplyr)
library(data.table)
df %>%
mutate(grp = rleid(comp)) %>%
summarise(M = mean(hbr[grp==2]))
M
1 103.5
I'm not sure if this is any better, but at least you only need to specify that you want the first run of 'story':
df %>%
mutate(grp = ifelse(comp == 'story', rleid(comp), NA)) %>%
filter(grp == min(grp, na.rm = TRUE)) %>%
summarise(M = mean(hbr))
#> M
#> 1 103.5
In base R, you can select the desired rows using cumsum and diff, and then choosing which group you need (here it's the first, so 1), and then compute the mean on those rows. With this option, you don't need to get the group you need manually and you don't require any additional packages.
idx <- which(df$comp == "story")
first <- idx[cumsum(c(1, diff(idx) != 1)) == 1]
#[1] 2 3 4 5
mean(df$hbr[first])
#[1] 103.5

How to use apply functions correctly when there are NA values

I'd like to calculate a function on multiple columns of a dataframe with random NA values. I have two questions:
How to deal with NAs? The code runs when I try it on non-NA columns, but returns NA when there are NAs even though I remove them.
How to print the results in a dataframe format instead of multiple arrays? I used mapply but it doesn't seem to do the calculations correctly.
Here is my code:
#create a data frame with random NAs
df<-data.frame(category1 = sample(c(1:10),100,replace=TRUE),
category2 = sample(c(1:10),100,replace=TRUE)
)
insert_nas <- function(x) {
len <- length(x)
n <- sample(1:floor(0.2*len), 1)
i <- sample(1:len, n)
x[i] <- NA
x
}
df <- sapply(df, insert_nas) %>% as.data.frame()
df$type <- sample(c("A", "B", "C"),100,replace=TRUE)
#using apply:
library(NPS)
apply(df[,c('category1', 'category2')], 2,
function(x) df %>% filter(!is.na(x)) %>% group_by(type) %>%
transmute(nps(x)) %>% unique()
)
#results:
$category1
# A tibble: 3 x 2
# Groups: type [3]
type `nps(x)`
<chr> <dbl>
1 B NA
2 A NA
3 C NA
...
#using mapply
mapply(function(x) df %>% filter(!is.na(x)) %>% group_by(type) %>%
transmute(nps(x)) %>% unique(), df[,c('category1', 'category2')])
#results:
category1 category2
type Character,3 Character,3
nps(x) Numeric,3 Numeric,3
Regarding the function I use, it doesn't have a built in way to deal with NAs, so I remove NAs prior to calling it.
I still used the !is.na part of your code because it seems that nps can't deal with NA, even though the documentation said it should (possible bug). I changed your apply to lapply and passed the variables as the list. Then I used get to identify the variable name that appears in quotes as a variable in your df.
df<-data.frame(category1 = sample(c(1:10),100,replace=TRUE),
category2 = sample(c(1:10),100,replace=TRUE)
)
insert_nas <- function(x) {
len <- length(x)
n <- sample(1:floor(0.2*len), 1)
i <- sample(1:len, n)
x[i] <- NA
x
}
df <- sapply(df, insert_nas) %>% as.data.frame()
df$type <- sample(c("A", "B", "C"),100,replace=TRUE)
#using apply:
library(NPS)
df2 <- as.data.frame(lapply(c('category1', 'category2'),
function(x) df %>% filter(!is.na(get(x))) %>% group_by(type) %>%
transmute(nps(get(x))) %>% unique()
),stringsAsFactors = FALSE)
colnames(df2) <- c("type", "nps_cat1","type2","nps_cat2")
#type2 is redundant
df2 <- select(df2, -type2)

Mutating column in `dplyr` using `rowSums`

Recently I stumbled uppon a strange behaviour of dplyr and I would be happy if somebody would provide some insights.
Assuming I have a data of which com columns contain some numerical values. In an easy scenario I would like to compute rowSums. Although there are many ways to do it, here are two examples:
df <- data.frame(matrix(rnorm(20), 10, 2),
ids = paste("i", 1:20, sep = ""),
stringsAsFactors = FALSE)
# works
dplyr::select(df, - ids) %>% {rowSums(.)}
# does not work
# Error: invalid argument to unary operator
df %>%
dplyr::mutate(blubb = dplyr::select(df, - ids) %>% {rowSums(.)})
# does not work
# Error: invalid argument to unary operator
df %>%
dplyr::mutate(blubb = dplyr::select(., - ids) %>% {rowSums(.)})
# workaround:
tmp <- dplyr::select(df, - ids) %>% {rowSums(.)}
df %>%
dplyr::mutate(blubb = tmp)
# works
rowSums(dplyr::select(df, - ids))
# does not work
# Error: invalid argument to unary operator
df %>%
dplyr::mutate(blubb = rowSums(dplyr::select(df, - ids)))
# workaround
tmp <- rowSums(dplyr::select(df, - ids))
df %>%
dplyr::mutate(blubb = tmp)
First, I don't really understand what is causing the error and second I would like to know how to actually achieve a tidy computation of some (viable) columns in a tidy way.
edit
The question mutate and rowSums exclude columns , although related, focuses on using rowSums for computation. Here I'm eager to understand why the upper examples do not work. It is not so much about how to solve (see the workarounds) but to understand what happens when the naive approach is applied.
The examples do not work because you are nesting select in mutate and using bare variable names. In this case, select is trying to do something like
> -df$ids
Error in -df$ids : invalid argument to unary operator
which fails because you can't negate a character string (i.e. -"i1" or -"i2" makes no sense). Either of the formulations below works:
df %>% mutate(blubb = rowSums(select_(., "X1", "X2")))
df %>% mutate(blubb = rowSums(select(., -3)))
or
df %>% mutate(blubb = rowSums(select_(., "-ids")))
as suggested by #Haboryme.
select_ is deprecated. You can use:
library(dplyr)
df <- data.frame(matrix(rnorm(20), 10, 2),
ids = paste("i", 1:20, sep = ""),
stringsAsFactors = FALSE)
df %>%
mutate(blubb = rowSums(select(., .dots = c("X1", "X2"))))
# Or more generally:
desired_columns <- c("X1", "X2")
df %>%
mutate(blubb = rowSums(select(., .dots = all_of(desired_columns))))
select can now accept bare column names so no need to use .dots or select_ which has been deprecated.
Here are few of the approaches that can work now.
library(dplyr)
#sum all the columns except `id`.
df %>% mutate(blubb = rowSums(select(., -ids), na.rm = TRUE))
#sum X1 and X2 columns
df %>% mutate(blubb = rowSums(select(., X1, X2), na.rm = TRUE))
#sum all the columns that start with 'X'
df %>% mutate(blubb = rowSums(select(., starts_with('X')), na.rm = TRUE))
#sum all the numeric columns
df %>% mutate(blubb = rowSums(select(., where(is.numeric))))
Adding to this old thread because I searched on this question then realized I was asking the wrong question. Also, I detect some yearning in this and related questions for the proper pipe steps way to do this.
The answers here are somewhat non-intuitive because they are trying to use the dplyr vernacular with non-"tidy" data. IF you want to do it the dplyr way, make the data tidy first, using gather(), and then use summarise()
library(tidyverse)
df <- data.frame(matrix(rnorm(20), 10, 2),
ids = paste("i", 1:20, sep = ""),
stringsAsFactors = FALSE)
df %>% gather(key=Xn,value="value",-ids) %>%
group_by(ids) %>%
summarise(rowsum=sum(value))
#> # A tibble: 20 x 2
#> ids rowsum
#> <chr> <dbl>
#> 1 i1 0.942
#> 2 i10 -0.330
#> 3 i11 0.942
#> 4 i12 -0.721
#> 5 i13 2.50
#> 6 i14 -0.611
#> 7 i15 -0.799
#> 8 i16 1.84
#> 9 i17 -0.629
#> 10 i18 -1.39
#> 11 i19 1.44
#> 12 i2 -0.721
#> 13 i20 -0.330
#> 14 i3 2.50
#> 15 i4 -0.611
#> 16 i5 -0.799
#> 17 i6 1.84
#> 18 i7 -0.629
#> 19 i8 -1.39
#> 20 i9 1.44
If you care about the order of the ids when they are not sortable using arrange(), make that column a factor first.
df %>%
mutate(ids=as_factor(ids)) %>%
gather(key=Xn,value="value",-ids) %>%
group_by(ids) %>%
summarise(rowsum=sum(value))
Why do you want to use the pipe operator? Just write an expression such as:
rowSums(df[,sapply(df, is.numeric)])
i.e. calculate the rowsums on all the numeric columns, with the advantage of not needing to specify ids.
If you want to save your results as a column within data, you can use data.table syntax like this:
dt <- as.data.table(df)
dt[, x3 := rowSums(.SD, na.rm=T), .SDcols = which(sapply(dt, is.numeric))]

Remove rows where all variables are NA using dplyr

I'm having some issues with a seemingly simple task: to remove all rows where all variables are NA using dplyr. I know it can be done using base R (Remove rows in R matrix where all data is NA and Removing empty rows of a data file in R), but I'm curious to know if there is a simple way of doing it using dplyr.
Example:
library(tidyverse)
dat <- tibble(a = c(1, 2, NA), b = c(1, NA, NA), c = c(2, NA, NA))
filter(dat, !is.na(a) | !is.na(b) | !is.na(c))
The filter call above does what I want but it's infeasible in the situation I'm facing (as there is a large number of variables). I guess one could do it by using filter_ and first creating a string with the (long) logical statement, but it seems like there should be a simpler way.
Another way is to use rowwise() and do():
na <- dat %>%
rowwise() %>%
do(tibble(na = !all(is.na(.)))) %>%
.$na
filter(dat, na)
but that does not look too nice, although it gets the job done. Other ideas?
Since dplyr 0.7.0 new, scoped filtering verbs exists. Using filter_any you can easily filter rows with at least one non-missing column:
# dplyr 0.7.0
dat %>% filter_all(any_vars(!is.na(.)))
Using #hejseb benchmarking algorithm it appears that this solution is as efficient as f4.
UPDATE:
Since dplyr 1.0.0 the above scoped verbs are superseded. Instead the across function family was introduced, which allows to perform a function on multiple (or all) columns. Filtering rows with at least one column being not NA looks now like this:
# dplyr 1.0.0
dat %>% filter(if_any(everything(), ~ !is.na(.)))
I would suggest to use the wonderful janitor package here. Janitor is very user-friendly:
janitor::remove_empty(dat, which = "rows")
Benchmarking
#DavidArenburg suggested a number of alternatives. Here's a simple benchmarking of them.
library(tidyverse)
library(microbenchmark)
n <- 100
dat <- tibble(a = rep(c(1, 2, NA), n), b = rep(c(1, 1, NA), n))
f1 <- function(dat) {
na <- dat %>%
rowwise() %>%
do(tibble(na = !all(is.na(.)))) %>%
.$na
filter(dat, na)
}
f2 <- function(dat) {
dat %>% filter(rowSums(is.na(.)) != ncol(.))
}
f3 <- function(dat) {
dat %>% filter(rowMeans(is.na(.)) < 1)
}
f4 <- function(dat) {
dat %>% filter(Reduce(`+`, lapply(., is.na)) != ncol(.))
}
f5 <- function(dat) {
dat %>% mutate(indx = row_number()) %>% gather(var, val, -indx) %>% group_by(indx) %>% filter(sum(is.na(val)) != n()) %>% spread(var, val)
}
# f1 is too slow to be included!
microbenchmark(f2 = f2(dat), f3 = f3(dat), f4 = f4(dat), f5 = f5(dat))
Using Reduce and lapply appears to be the fastest:
> microbenchmark(f2 = f2(dat), f3 = f3(dat), f4 = f4(dat), f5 = f5(dat))
Unit: microseconds
expr min lq mean median uq max neval
f2 909.495 986.4680 2948.913 1154.4510 1434.725 131159.384 100
f3 946.321 1036.2745 1908.857 1221.1615 1805.405 7604.069 100
f4 706.647 809.2785 1318.694 960.0555 1089.099 13819.295 100
f5 640392.269 664101.2895 692349.519 679580.6435 709054.821 901386.187 100
Using a larger data set 107,880 x 40:
dat <- diamonds
# Let every third row be NA
dat[seq(1, nrow(diamonds), 3), ] <- NA
# Add some extra NA to first column so na.omit() wouldn't work
dat[seq(2, nrow(diamonds), 3), 1] <- NA
# Increase size
dat <- dat %>%
bind_rows(., .) %>%
bind_cols(., .) %>%
bind_cols(., .)
# Make names unique
names(dat) <- 1:ncol(dat)
microbenchmark(f2 = f2(dat), f3 = f3(dat), f4 = f4(dat))
f5 is too slow so it is also excluded. f4 seems to do relatively better than before.
> microbenchmark(f2 = f2(dat), f3 = f3(dat), f4 = f4(dat))
Unit: milliseconds
expr min lq mean median uq max neval
f2 34.60212 42.09918 114.65140 143.56056 148.8913 181.4218 100
f3 35.50890 44.94387 119.73744 144.75561 148.8678 254.5315 100
f4 27.68628 31.80557 73.63191 35.36144 137.2445 152.4686 100
Starting with dyplr 1.0, the colwise vignette gives a similar case as an example:
filter(across(everything(), ~ !is.na(.x))) #Remove rows with *any* NA
We can see it uses the same implicit "& logic" filter uses with multiple expressions. So the following minor adjustment selects all NA rows:
filter(across(everything(), ~ is.na(.x))) #Remove rows with *any* non-NA
But the question asks for the inverse set: Remove rows with all NA.
We can do a simple setdiff using the previous, or
we can use the fact that across returns a logical tibble and filter effectively does a row-wise all() (i.e. &).
Eg:
rowAny = function(x) apply(x, 1, any)
anyVar = function(fcn) rowAny(across(everything(), fcn)) #make it readable
df %<>% filter(anyVar(~ !is.na(.x))) #Remove rows with *all* NA
Or:
filterout = function(df, ...) setdiff(df, filter(df, ...))
df %<>% filterout(across(everything(), is.na)) #Remove rows with *all* NA
Or even combinine the above 2 to express the first example more directly:
df %<>% filterout(anyVar(~ is.na(.x))) #Remove rows with *any* NA
In my opinion, the tidyverse filter function would benefit from a parameter describing the 'aggregation logic'. It could default to "all" and preserve behavior, or allow "any" so we wouldn't need to write anyVar-like helper functions.
The solution using dplyr 1.0 is simple and does not require helper functions, you just need to add a negation in the right place.
dat %>% filter(!across(everything(), is.na))
dplyr 1.0.4 introduced the if_any() and if_all() functions:
dat %>% filter(if_any(everything(), ~!is.na(.)))
or, more verbose:
dat %>% filter(if_any(everything(), purrr::negate(is.na)))
"Take dat and keep all rows where any entry is non-NA"
Here's another solution that uses purrr::map_lgl() and tidyr::nest():
library(tidyverse)
dat <- tibble(a = c(1, 2, NA), b = c(1, NA, NA), c = c(2, NA, NA))
any_not_na <- function(x) {
!all(map_lgl(x, is.na))
}
dat_cleaned <- dat %>%
rownames_to_column("ID") %>%
group_by(ID) %>%
nest() %>%
filter(map_lgl(data, any_not_na)) %>%
unnest() %>%
select(-ID)
## Warning: package 'bindrcpp' was built under R version 3.4.2
dat_cleaned
## # A tibble: 2 x 3
## a b c
## <dbl> <dbl> <dbl>
## 1 1. 1. 2.
## 2 2. NA NA
I doubt this approach will be able to compete with the benchmarks in #hejseb's answer, but I think it does a pretty good job at showing how the nest %>% map %>% unnest pattern works and users can run through it line-by-line to figure out what's going on.
You can use the function complete.cases from dplyr
using the dot (.) for specify the previous dataframe
on the chain.
library(dplyr)
df = data.frame(
x1 = c(1,2,3,NA),
x2 = c(1,2,NA,5),
x3 = c(NA,2,3,5)
)
df %>%
filter(complete.cases(.))
x1 x2 x3
1 2 2 2
I a neat solution what works in dplyr 1.0.1 is to use rowwise()
dat %>%
rowwise() %>%
filter(!all(is.na(across(everything())))) %>%
ungroup()
very similar to #Callum Savage 's comment on the top post but I missed it on the first pass, and without the sum()
(tidyverse 1.3.1)
data%>%rowwise()%>%
filter(!all(is.na(c_across(is.numeric))))
data%>%rowwise()%>%
filter(!all(is.na(c_across(starts_with("***")))))

filter for complete cases in data.frame using dplyr (case-wise deletion)

Is it possible to filter a data.frame for complete cases using dplyr? complete.cases with a list of all variables works, of course. But that is a) verbose when there are a lot of variables and b) impossible when the variable names are not known (e.g. in a function that processes any data.frame).
library(dplyr)
df = data.frame(
x1 = c(1,2,3,NA),
x2 = c(1,2,NA,5)
)
df %.%
filter(complete.cases(x1,x2))
Try this:
df %>% na.omit
or this:
df %>% filter(complete.cases(.))
or this:
library(tidyr)
df %>% drop_na
If you want to filter based on one variable's missingness, use a conditional:
df %>% filter(!is.na(x1))
or
df %>% drop_na(x1)
Other answers indicate that of the solutions above na.omit is much slower but that has to be balanced against the fact that it returns row indices of the omitted rows in the na.action attribute whereas the other solutions above do not.
str(df %>% na.omit)
## 'data.frame': 2 obs. of 2 variables:
## $ x1: num 1 2
## $ x2: num 1 2
## - attr(*, "na.action")= 'omit' Named int 3 4
## ..- attr(*, "names")= chr "3" "4"
ADDED Have updated to reflect latest version of dplyr and comments.
ADDED Have updated to reflect latest version of tidyr and comments.
This works for me:
df %>%
filter(complete.cases(df))
Or a little more general:
library(dplyr) # 0.4
df %>% filter(complete.cases(.))
This would have the advantage that the data could have been modified in the chain before passing it to the filter.
Another benchmark with more columns:
set.seed(123)
x <- sample(1e5,1e5*26, replace = TRUE)
x[sample(seq_along(x), 1e3)] <- NA
df <- as.data.frame(matrix(x, ncol = 26))
library(microbenchmark)
microbenchmark(
na.omit = {df %>% na.omit},
filter.anonymous = {df %>% (function(x) filter(x, complete.cases(x)))},
rowSums = {df %>% filter(rowSums(is.na(.)) == 0L)},
filter = {df %>% filter(complete.cases(.))},
times = 20L,
unit = "relative")
#Unit: relative
# expr min lq median uq max neval
# na.omit 12.252048 11.248707 11.327005 11.0623422 12.823233 20
#filter.anonymous 1.149305 1.022891 1.013779 0.9948659 4.668691 20
# rowSums 2.281002 2.377807 2.420615 2.3467519 5.223077 20
# filter 1.000000 1.000000 1.000000 1.0000000 1.000000 20
Here are some benchmark results for Grothendieck's reply. na.omit() takes 20x as much time as the other two solutions. I think it would be nice if dplyr had a function for this maybe as part of filter.
library('rbenchmark')
library('dplyr')
n = 5e6
n.na = 100000
df = data.frame(
x1 = sample(1:10, n, replace=TRUE),
x2 = sample(1:10, n, replace=TRUE)
)
df$x1[sample(1:n, n.na)] = NA
df$x2[sample(1:n, n.na)] = NA
benchmark(
df %>% filter(complete.cases(x1,x2)),
df %>% na.omit(),
df %>% (function(x) filter(x, complete.cases(x)))()
, replications=50)
# test replications elapsed relative
# 3 df %.% (function(x) filter(x, complete.cases(x)))() 50 5.422 1.000
# 1 df %.% filter(complete.cases(x1, x2)) 50 6.262 1.155
# 2 df %.% na.omit() 50 109.618 20.217
This is a short function which lets you specify columns (basically everything which dplyr::select can understand) which should not have any NA values (modeled after pandas df.dropna()):
drop_na <- function(data, ...){
if (missing(...)){
f = complete.cases(data)
} else {
f <- complete.cases(select_(data, .dots = lazyeval::lazy_dots(...)))
}
filter(data, f)
}
[drop_na is now part of tidyr: the above can be replaced by library("tidyr")]
Examples:
library("dplyr")
df <- data.frame(a=c(1,2,3,4,NA), b=c(NA,1,2,3,4), ac=c(1,2,NA,3,4))
df %>% drop_na(a,b)
df %>% drop_na(starts_with("a"))
df %>% drop_na() # drops all rows with NAs
try this
df[complete.cases(df),] #output to console
OR even this
df.complete <- df[complete.cases(df),] #assign to a new data.frame
The above commands take care of checking for completeness for all the columns (variable)
in your data.frame.
Just for the sake of completeness, dplyr::filter can be avoided altogether but still be able to compose chains just by using magrittr:extract (an alias of [):
library(magrittr)
df = data.frame(
x1 = c(1,2,3,NA),
x2 = c(1,2,NA,5))
df %>%
extract(complete.cases(.), )
The additional bonus is speed, this is the fastest method among the filter and na.omit variants (tested using #Miha Trošt microbenchmarks).
dplyr >= 1.0.4
if_any and if_all are available in newer versions of dplyr to apply across-like syntax in the filter function. This could be useful if you had other variables in your dataframe that were not part of what you considered complete case. For example, if you only wanted non-missing rows in columns that start with "x":
library(dplyr)
df = data.frame(
x1 = c(1,2,3,NA),
x2 = c(1,2,NA,5),
y = c(NA, "A", "B", "C")
)
df %>%
dplyr::filter(if_all(starts_with("x"), ~!is.na(.)))
x1 x2 y
1 1 1 <NA>
2 2 2 A
For more information on these functions see this link.

Resources