Recently I stumbled uppon a strange behaviour of dplyr and I would be happy if somebody would provide some insights.
Assuming I have a data of which com columns contain some numerical values. In an easy scenario I would like to compute rowSums. Although there are many ways to do it, here are two examples:
df <- data.frame(matrix(rnorm(20), 10, 2),
ids = paste("i", 1:20, sep = ""),
stringsAsFactors = FALSE)
# works
dplyr::select(df, - ids) %>% {rowSums(.)}
# does not work
# Error: invalid argument to unary operator
df %>%
dplyr::mutate(blubb = dplyr::select(df, - ids) %>% {rowSums(.)})
# does not work
# Error: invalid argument to unary operator
df %>%
dplyr::mutate(blubb = dplyr::select(., - ids) %>% {rowSums(.)})
# workaround:
tmp <- dplyr::select(df, - ids) %>% {rowSums(.)}
df %>%
dplyr::mutate(blubb = tmp)
# works
rowSums(dplyr::select(df, - ids))
# does not work
# Error: invalid argument to unary operator
df %>%
dplyr::mutate(blubb = rowSums(dplyr::select(df, - ids)))
# workaround
tmp <- rowSums(dplyr::select(df, - ids))
df %>%
dplyr::mutate(blubb = tmp)
First, I don't really understand what is causing the error and second I would like to know how to actually achieve a tidy computation of some (viable) columns in a tidy way.
edit
The question mutate and rowSums exclude columns , although related, focuses on using rowSums for computation. Here I'm eager to understand why the upper examples do not work. It is not so much about how to solve (see the workarounds) but to understand what happens when the naive approach is applied.
The examples do not work because you are nesting select in mutate and using bare variable names. In this case, select is trying to do something like
> -df$ids
Error in -df$ids : invalid argument to unary operator
which fails because you can't negate a character string (i.e. -"i1" or -"i2" makes no sense). Either of the formulations below works:
df %>% mutate(blubb = rowSums(select_(., "X1", "X2")))
df %>% mutate(blubb = rowSums(select(., -3)))
or
df %>% mutate(blubb = rowSums(select_(., "-ids")))
as suggested by #Haboryme.
select_ is deprecated. You can use:
library(dplyr)
df <- data.frame(matrix(rnorm(20), 10, 2),
ids = paste("i", 1:20, sep = ""),
stringsAsFactors = FALSE)
df %>%
mutate(blubb = rowSums(select(., .dots = c("X1", "X2"))))
# Or more generally:
desired_columns <- c("X1", "X2")
df %>%
mutate(blubb = rowSums(select(., .dots = all_of(desired_columns))))
select can now accept bare column names so no need to use .dots or select_ which has been deprecated.
Here are few of the approaches that can work now.
library(dplyr)
#sum all the columns except `id`.
df %>% mutate(blubb = rowSums(select(., -ids), na.rm = TRUE))
#sum X1 and X2 columns
df %>% mutate(blubb = rowSums(select(., X1, X2), na.rm = TRUE))
#sum all the columns that start with 'X'
df %>% mutate(blubb = rowSums(select(., starts_with('X')), na.rm = TRUE))
#sum all the numeric columns
df %>% mutate(blubb = rowSums(select(., where(is.numeric))))
Adding to this old thread because I searched on this question then realized I was asking the wrong question. Also, I detect some yearning in this and related questions for the proper pipe steps way to do this.
The answers here are somewhat non-intuitive because they are trying to use the dplyr vernacular with non-"tidy" data. IF you want to do it the dplyr way, make the data tidy first, using gather(), and then use summarise()
library(tidyverse)
df <- data.frame(matrix(rnorm(20), 10, 2),
ids = paste("i", 1:20, sep = ""),
stringsAsFactors = FALSE)
df %>% gather(key=Xn,value="value",-ids) %>%
group_by(ids) %>%
summarise(rowsum=sum(value))
#> # A tibble: 20 x 2
#> ids rowsum
#> <chr> <dbl>
#> 1 i1 0.942
#> 2 i10 -0.330
#> 3 i11 0.942
#> 4 i12 -0.721
#> 5 i13 2.50
#> 6 i14 -0.611
#> 7 i15 -0.799
#> 8 i16 1.84
#> 9 i17 -0.629
#> 10 i18 -1.39
#> 11 i19 1.44
#> 12 i2 -0.721
#> 13 i20 -0.330
#> 14 i3 2.50
#> 15 i4 -0.611
#> 16 i5 -0.799
#> 17 i6 1.84
#> 18 i7 -0.629
#> 19 i8 -1.39
#> 20 i9 1.44
If you care about the order of the ids when they are not sortable using arrange(), make that column a factor first.
df %>%
mutate(ids=as_factor(ids)) %>%
gather(key=Xn,value="value",-ids) %>%
group_by(ids) %>%
summarise(rowsum=sum(value))
Why do you want to use the pipe operator? Just write an expression such as:
rowSums(df[,sapply(df, is.numeric)])
i.e. calculate the rowsums on all the numeric columns, with the advantage of not needing to specify ids.
If you want to save your results as a column within data, you can use data.table syntax like this:
dt <- as.data.table(df)
dt[, x3 := rowSums(.SD, na.rm=T), .SDcols = which(sapply(dt, is.numeric))]
Related
I have already read a variety of threads on dynamically named variables, but I couldn't quite find an answer.
I have two dataframes.
df <- data.frame(qno=c(1,2,3,4))
ref <- data.frame(Q1 = c(1:20),Q2 = c(21:40),Q3=c(41:60),Q4 = c(61:80))
Now I want to create another column 'average' in the df dataframe which gives me the average of each column in ref.
Intended output:
df <- data.frame(qno=c(1,2,3,4), average = c(10.5,30.5,50.5,70.5))
Here is what I have tried:
df <- df %>%
mutate(average := mean(!!as.name(paste0("ref$Q",qno)))
I have also tried a version with a for loop, but that didn't work either.
for (i in 1:length(df$qno)){
df$average[i] <- mean(as.name(paste0("ref$Q",df$qno[i])))
}
df <- df %>%
mutate(average = mean(as.name(paste0("ref$Q",qno))))```.
Here it is with mutate:
df %>% mutate(average = t(ref %>% summarise(across(everything(), ~mean(.x, na.rm = TRUE)))))
qno average
1 1 10.5
2 2 30.5
3 3 50.5
4 4 70.5
But you can use it without mutate entirely if you want the names from ref:
t(ref %>% summarise(across(everything(), list(mean), .names = "{.col}"))) %>%
data.frame() %>%
rename(average = 1)
average
Q1 10.5
Q2 30.5
Q3 50.5
Q4 70.5
Does this solve your problem?
ref <- data.frame(Q1 = c(1:20),Q2 = c(21:40),Q3=c(41:60),Q4 = c(61:80))
out <- data.frame(qno=c(1,2,3,4), average = c(10.5,30.5,50.5,70.5))
df <- data.frame(qno=c(1:length(ref)))
for (i in seq_along(ref)) {
df$average[i] <- mean(ref[[i]], na.rm = T)
}
I was not really sure if you want to name the rows like the variables, so you could just add this when you create the df object:
df <- data.frame(qno = paste0("Q", c(1:length(ref))))
I'm an R newbie so my apologizes if this is a simple question.
I use a lot excel to create "dual entries" tables. It's likely the name 'dual table' is not the most accurate but I wouldn't know how to describe it otherwise.
I basically start from big tables and then create a new one where I average the data grouping by two columns and then I display it as a matrix.
I will share with you a perfectly functional R example I coded myself.
My question is: is there an easier / better way to do it?
This is my working code:
require(dplyr)
df <- mtcars
output_var <- 'disp'
rows_var <- 'cyl'
col_var <- 'am'
output_name <- paste0("Avg. ",output_var)
one_way_table <- df %>%
group_by(eval(parse(text=rows_var)), eval(parse(text=col_var)) ) %>%
summarise(output=mean( eval(parse(text=output_var)) ))
one_way_table <- data.frame(one_way_table, check.rows = F, check.names = F, stringsAsFactors = F)
colnames(one_way_table) <- c(rows_var, col_var, output_name)
unique_row_items <- unique(one_way_table[,rows_var])
unique_col_items <- unique(one_way_table[,col_var])
x_rows <- rep(unique_row_items, length(unique_col_items))
y_cols <- rep(unique_col_items, length(unique_row_items))
new_df <- data.frame(x = x_rows, y = y_cols, check.rows = F, check.names = F, stringsAsFactors = F)
colnames(new_df) <- c(rows_var, col_var)
new_df <- base::merge(new_df, one_way_table, by = c(rows_var, col_var), all.x=T)
m <- matrix(new_df[, output_name], ncol= length(unique(new_df[,col_var])) )
df_matrix <- data.frame(m, check.rows = F, check.names = F, stringsAsFactors = F)
Perhaps there's a way more efficient way to do it.
Notice how, since this will be coded inside a function, I had to use variable names do define what columns I want to use for the analysis.
Thanks
A possible solution for your issue can come from tidyverse. Here an example reshaping your data and aggregating with mean:
library(tidyverse)
#Data
df <- mtcars
#Code
df %>% pivot_longer(cols = -c(cyl,am)) %>% filter(name=='disp') %>%
group_by(cyl,am) %>% summarise(Mean=mean(value)) %>%
pivot_wider(names_from = am,values_from=Mean)
Output:
# A tibble: 3 x 3
# Groups: cyl [3]
cyl `0` `1`
<dbl> <dbl> <dbl>
1 4 136. 93.6
2 6 205. 155
3 8 358. 326
Which is close to df_matrix the final output of your code.
If we need to pivot, this can be done in a more simple way. We select the columns of interest and use pivot_wider with values_fn specifying as mean to be applied on the columns selected on values_from
library(dplyr)
library(tidyr)
mtcars %>%
select(cyl, am, disp) %>%
pivot_wider(names_from = am, values_from = disp, values_fn = mean)
# A tibble: 3 x 3
# cyl `1` `0`
# <dbl> <dbl> <dbl>
#1 6 155 205.
#2 4 93.6 136.
#3 8 326 358.
I'm having some issues with a seemingly simple task: to remove all rows where all variables are NA using dplyr. I know it can be done using base R (Remove rows in R matrix where all data is NA and Removing empty rows of a data file in R), but I'm curious to know if there is a simple way of doing it using dplyr.
Example:
library(tidyverse)
dat <- tibble(a = c(1, 2, NA), b = c(1, NA, NA), c = c(2, NA, NA))
filter(dat, !is.na(a) | !is.na(b) | !is.na(c))
The filter call above does what I want but it's infeasible in the situation I'm facing (as there is a large number of variables). I guess one could do it by using filter_ and first creating a string with the (long) logical statement, but it seems like there should be a simpler way.
Another way is to use rowwise() and do():
na <- dat %>%
rowwise() %>%
do(tibble(na = !all(is.na(.)))) %>%
.$na
filter(dat, na)
but that does not look too nice, although it gets the job done. Other ideas?
Since dplyr 0.7.0 new, scoped filtering verbs exists. Using filter_any you can easily filter rows with at least one non-missing column:
# dplyr 0.7.0
dat %>% filter_all(any_vars(!is.na(.)))
Using #hejseb benchmarking algorithm it appears that this solution is as efficient as f4.
UPDATE:
Since dplyr 1.0.0 the above scoped verbs are superseded. Instead the across function family was introduced, which allows to perform a function on multiple (or all) columns. Filtering rows with at least one column being not NA looks now like this:
# dplyr 1.0.0
dat %>% filter(if_any(everything(), ~ !is.na(.)))
I would suggest to use the wonderful janitor package here. Janitor is very user-friendly:
janitor::remove_empty(dat, which = "rows")
Benchmarking
#DavidArenburg suggested a number of alternatives. Here's a simple benchmarking of them.
library(tidyverse)
library(microbenchmark)
n <- 100
dat <- tibble(a = rep(c(1, 2, NA), n), b = rep(c(1, 1, NA), n))
f1 <- function(dat) {
na <- dat %>%
rowwise() %>%
do(tibble(na = !all(is.na(.)))) %>%
.$na
filter(dat, na)
}
f2 <- function(dat) {
dat %>% filter(rowSums(is.na(.)) != ncol(.))
}
f3 <- function(dat) {
dat %>% filter(rowMeans(is.na(.)) < 1)
}
f4 <- function(dat) {
dat %>% filter(Reduce(`+`, lapply(., is.na)) != ncol(.))
}
f5 <- function(dat) {
dat %>% mutate(indx = row_number()) %>% gather(var, val, -indx) %>% group_by(indx) %>% filter(sum(is.na(val)) != n()) %>% spread(var, val)
}
# f1 is too slow to be included!
microbenchmark(f2 = f2(dat), f3 = f3(dat), f4 = f4(dat), f5 = f5(dat))
Using Reduce and lapply appears to be the fastest:
> microbenchmark(f2 = f2(dat), f3 = f3(dat), f4 = f4(dat), f5 = f5(dat))
Unit: microseconds
expr min lq mean median uq max neval
f2 909.495 986.4680 2948.913 1154.4510 1434.725 131159.384 100
f3 946.321 1036.2745 1908.857 1221.1615 1805.405 7604.069 100
f4 706.647 809.2785 1318.694 960.0555 1089.099 13819.295 100
f5 640392.269 664101.2895 692349.519 679580.6435 709054.821 901386.187 100
Using a larger data set 107,880 x 40:
dat <- diamonds
# Let every third row be NA
dat[seq(1, nrow(diamonds), 3), ] <- NA
# Add some extra NA to first column so na.omit() wouldn't work
dat[seq(2, nrow(diamonds), 3), 1] <- NA
# Increase size
dat <- dat %>%
bind_rows(., .) %>%
bind_cols(., .) %>%
bind_cols(., .)
# Make names unique
names(dat) <- 1:ncol(dat)
microbenchmark(f2 = f2(dat), f3 = f3(dat), f4 = f4(dat))
f5 is too slow so it is also excluded. f4 seems to do relatively better than before.
> microbenchmark(f2 = f2(dat), f3 = f3(dat), f4 = f4(dat))
Unit: milliseconds
expr min lq mean median uq max neval
f2 34.60212 42.09918 114.65140 143.56056 148.8913 181.4218 100
f3 35.50890 44.94387 119.73744 144.75561 148.8678 254.5315 100
f4 27.68628 31.80557 73.63191 35.36144 137.2445 152.4686 100
Starting with dyplr 1.0, the colwise vignette gives a similar case as an example:
filter(across(everything(), ~ !is.na(.x))) #Remove rows with *any* NA
We can see it uses the same implicit "& logic" filter uses with multiple expressions. So the following minor adjustment selects all NA rows:
filter(across(everything(), ~ is.na(.x))) #Remove rows with *any* non-NA
But the question asks for the inverse set: Remove rows with all NA.
We can do a simple setdiff using the previous, or
we can use the fact that across returns a logical tibble and filter effectively does a row-wise all() (i.e. &).
Eg:
rowAny = function(x) apply(x, 1, any)
anyVar = function(fcn) rowAny(across(everything(), fcn)) #make it readable
df %<>% filter(anyVar(~ !is.na(.x))) #Remove rows with *all* NA
Or:
filterout = function(df, ...) setdiff(df, filter(df, ...))
df %<>% filterout(across(everything(), is.na)) #Remove rows with *all* NA
Or even combinine the above 2 to express the first example more directly:
df %<>% filterout(anyVar(~ is.na(.x))) #Remove rows with *any* NA
In my opinion, the tidyverse filter function would benefit from a parameter describing the 'aggregation logic'. It could default to "all" and preserve behavior, or allow "any" so we wouldn't need to write anyVar-like helper functions.
The solution using dplyr 1.0 is simple and does not require helper functions, you just need to add a negation in the right place.
dat %>% filter(!across(everything(), is.na))
dplyr 1.0.4 introduced the if_any() and if_all() functions:
dat %>% filter(if_any(everything(), ~!is.na(.)))
or, more verbose:
dat %>% filter(if_any(everything(), purrr::negate(is.na)))
"Take dat and keep all rows where any entry is non-NA"
Here's another solution that uses purrr::map_lgl() and tidyr::nest():
library(tidyverse)
dat <- tibble(a = c(1, 2, NA), b = c(1, NA, NA), c = c(2, NA, NA))
any_not_na <- function(x) {
!all(map_lgl(x, is.na))
}
dat_cleaned <- dat %>%
rownames_to_column("ID") %>%
group_by(ID) %>%
nest() %>%
filter(map_lgl(data, any_not_na)) %>%
unnest() %>%
select(-ID)
## Warning: package 'bindrcpp' was built under R version 3.4.2
dat_cleaned
## # A tibble: 2 x 3
## a b c
## <dbl> <dbl> <dbl>
## 1 1. 1. 2.
## 2 2. NA NA
I doubt this approach will be able to compete with the benchmarks in #hejseb's answer, but I think it does a pretty good job at showing how the nest %>% map %>% unnest pattern works and users can run through it line-by-line to figure out what's going on.
You can use the function complete.cases from dplyr
using the dot (.) for specify the previous dataframe
on the chain.
library(dplyr)
df = data.frame(
x1 = c(1,2,3,NA),
x2 = c(1,2,NA,5),
x3 = c(NA,2,3,5)
)
df %>%
filter(complete.cases(.))
x1 x2 x3
1 2 2 2
I a neat solution what works in dplyr 1.0.1 is to use rowwise()
dat %>%
rowwise() %>%
filter(!all(is.na(across(everything())))) %>%
ungroup()
very similar to #Callum Savage 's comment on the top post but I missed it on the first pass, and without the sum()
(tidyverse 1.3.1)
data%>%rowwise()%>%
filter(!all(is.na(c_across(is.numeric))))
data%>%rowwise()%>%
filter(!all(is.na(c_across(starts_with("***")))))
Is it possible to filter a data.frame for complete cases using dplyr? complete.cases with a list of all variables works, of course. But that is a) verbose when there are a lot of variables and b) impossible when the variable names are not known (e.g. in a function that processes any data.frame).
library(dplyr)
df = data.frame(
x1 = c(1,2,3,NA),
x2 = c(1,2,NA,5)
)
df %.%
filter(complete.cases(x1,x2))
Try this:
df %>% na.omit
or this:
df %>% filter(complete.cases(.))
or this:
library(tidyr)
df %>% drop_na
If you want to filter based on one variable's missingness, use a conditional:
df %>% filter(!is.na(x1))
or
df %>% drop_na(x1)
Other answers indicate that of the solutions above na.omit is much slower but that has to be balanced against the fact that it returns row indices of the omitted rows in the na.action attribute whereas the other solutions above do not.
str(df %>% na.omit)
## 'data.frame': 2 obs. of 2 variables:
## $ x1: num 1 2
## $ x2: num 1 2
## - attr(*, "na.action")= 'omit' Named int 3 4
## ..- attr(*, "names")= chr "3" "4"
ADDED Have updated to reflect latest version of dplyr and comments.
ADDED Have updated to reflect latest version of tidyr and comments.
This works for me:
df %>%
filter(complete.cases(df))
Or a little more general:
library(dplyr) # 0.4
df %>% filter(complete.cases(.))
This would have the advantage that the data could have been modified in the chain before passing it to the filter.
Another benchmark with more columns:
set.seed(123)
x <- sample(1e5,1e5*26, replace = TRUE)
x[sample(seq_along(x), 1e3)] <- NA
df <- as.data.frame(matrix(x, ncol = 26))
library(microbenchmark)
microbenchmark(
na.omit = {df %>% na.omit},
filter.anonymous = {df %>% (function(x) filter(x, complete.cases(x)))},
rowSums = {df %>% filter(rowSums(is.na(.)) == 0L)},
filter = {df %>% filter(complete.cases(.))},
times = 20L,
unit = "relative")
#Unit: relative
# expr min lq median uq max neval
# na.omit 12.252048 11.248707 11.327005 11.0623422 12.823233 20
#filter.anonymous 1.149305 1.022891 1.013779 0.9948659 4.668691 20
# rowSums 2.281002 2.377807 2.420615 2.3467519 5.223077 20
# filter 1.000000 1.000000 1.000000 1.0000000 1.000000 20
Here are some benchmark results for Grothendieck's reply. na.omit() takes 20x as much time as the other two solutions. I think it would be nice if dplyr had a function for this maybe as part of filter.
library('rbenchmark')
library('dplyr')
n = 5e6
n.na = 100000
df = data.frame(
x1 = sample(1:10, n, replace=TRUE),
x2 = sample(1:10, n, replace=TRUE)
)
df$x1[sample(1:n, n.na)] = NA
df$x2[sample(1:n, n.na)] = NA
benchmark(
df %>% filter(complete.cases(x1,x2)),
df %>% na.omit(),
df %>% (function(x) filter(x, complete.cases(x)))()
, replications=50)
# test replications elapsed relative
# 3 df %.% (function(x) filter(x, complete.cases(x)))() 50 5.422 1.000
# 1 df %.% filter(complete.cases(x1, x2)) 50 6.262 1.155
# 2 df %.% na.omit() 50 109.618 20.217
This is a short function which lets you specify columns (basically everything which dplyr::select can understand) which should not have any NA values (modeled after pandas df.dropna()):
drop_na <- function(data, ...){
if (missing(...)){
f = complete.cases(data)
} else {
f <- complete.cases(select_(data, .dots = lazyeval::lazy_dots(...)))
}
filter(data, f)
}
[drop_na is now part of tidyr: the above can be replaced by library("tidyr")]
Examples:
library("dplyr")
df <- data.frame(a=c(1,2,3,4,NA), b=c(NA,1,2,3,4), ac=c(1,2,NA,3,4))
df %>% drop_na(a,b)
df %>% drop_na(starts_with("a"))
df %>% drop_na() # drops all rows with NAs
try this
df[complete.cases(df),] #output to console
OR even this
df.complete <- df[complete.cases(df),] #assign to a new data.frame
The above commands take care of checking for completeness for all the columns (variable)
in your data.frame.
Just for the sake of completeness, dplyr::filter can be avoided altogether but still be able to compose chains just by using magrittr:extract (an alias of [):
library(magrittr)
df = data.frame(
x1 = c(1,2,3,NA),
x2 = c(1,2,NA,5))
df %>%
extract(complete.cases(.), )
The additional bonus is speed, this is the fastest method among the filter and na.omit variants (tested using #Miha Trošt microbenchmarks).
dplyr >= 1.0.4
if_any and if_all are available in newer versions of dplyr to apply across-like syntax in the filter function. This could be useful if you had other variables in your dataframe that were not part of what you considered complete case. For example, if you only wanted non-missing rows in columns that start with "x":
library(dplyr)
df = data.frame(
x1 = c(1,2,3,NA),
x2 = c(1,2,NA,5),
y = c(NA, "A", "B", "C")
)
df %>%
dplyr::filter(if_all(starts_with("x"), ~!is.na(.)))
x1 x2 y
1 1 1 <NA>
2 2 2 A
For more information on these functions see this link.
Is it possible to filter a data.frame for complete cases using dplyr? complete.cases with a list of all variables works, of course. But that is a) verbose when there are a lot of variables and b) impossible when the variable names are not known (e.g. in a function that processes any data.frame).
library(dplyr)
df = data.frame(
x1 = c(1,2,3,NA),
x2 = c(1,2,NA,5)
)
df %.%
filter(complete.cases(x1,x2))
Try this:
df %>% na.omit
or this:
df %>% filter(complete.cases(.))
or this:
library(tidyr)
df %>% drop_na
If you want to filter based on one variable's missingness, use a conditional:
df %>% filter(!is.na(x1))
or
df %>% drop_na(x1)
Other answers indicate that of the solutions above na.omit is much slower but that has to be balanced against the fact that it returns row indices of the omitted rows in the na.action attribute whereas the other solutions above do not.
str(df %>% na.omit)
## 'data.frame': 2 obs. of 2 variables:
## $ x1: num 1 2
## $ x2: num 1 2
## - attr(*, "na.action")= 'omit' Named int 3 4
## ..- attr(*, "names")= chr "3" "4"
ADDED Have updated to reflect latest version of dplyr and comments.
ADDED Have updated to reflect latest version of tidyr and comments.
This works for me:
df %>%
filter(complete.cases(df))
Or a little more general:
library(dplyr) # 0.4
df %>% filter(complete.cases(.))
This would have the advantage that the data could have been modified in the chain before passing it to the filter.
Another benchmark with more columns:
set.seed(123)
x <- sample(1e5,1e5*26, replace = TRUE)
x[sample(seq_along(x), 1e3)] <- NA
df <- as.data.frame(matrix(x, ncol = 26))
library(microbenchmark)
microbenchmark(
na.omit = {df %>% na.omit},
filter.anonymous = {df %>% (function(x) filter(x, complete.cases(x)))},
rowSums = {df %>% filter(rowSums(is.na(.)) == 0L)},
filter = {df %>% filter(complete.cases(.))},
times = 20L,
unit = "relative")
#Unit: relative
# expr min lq median uq max neval
# na.omit 12.252048 11.248707 11.327005 11.0623422 12.823233 20
#filter.anonymous 1.149305 1.022891 1.013779 0.9948659 4.668691 20
# rowSums 2.281002 2.377807 2.420615 2.3467519 5.223077 20
# filter 1.000000 1.000000 1.000000 1.0000000 1.000000 20
Here are some benchmark results for Grothendieck's reply. na.omit() takes 20x as much time as the other two solutions. I think it would be nice if dplyr had a function for this maybe as part of filter.
library('rbenchmark')
library('dplyr')
n = 5e6
n.na = 100000
df = data.frame(
x1 = sample(1:10, n, replace=TRUE),
x2 = sample(1:10, n, replace=TRUE)
)
df$x1[sample(1:n, n.na)] = NA
df$x2[sample(1:n, n.na)] = NA
benchmark(
df %>% filter(complete.cases(x1,x2)),
df %>% na.omit(),
df %>% (function(x) filter(x, complete.cases(x)))()
, replications=50)
# test replications elapsed relative
# 3 df %.% (function(x) filter(x, complete.cases(x)))() 50 5.422 1.000
# 1 df %.% filter(complete.cases(x1, x2)) 50 6.262 1.155
# 2 df %.% na.omit() 50 109.618 20.217
This is a short function which lets you specify columns (basically everything which dplyr::select can understand) which should not have any NA values (modeled after pandas df.dropna()):
drop_na <- function(data, ...){
if (missing(...)){
f = complete.cases(data)
} else {
f <- complete.cases(select_(data, .dots = lazyeval::lazy_dots(...)))
}
filter(data, f)
}
[drop_na is now part of tidyr: the above can be replaced by library("tidyr")]
Examples:
library("dplyr")
df <- data.frame(a=c(1,2,3,4,NA), b=c(NA,1,2,3,4), ac=c(1,2,NA,3,4))
df %>% drop_na(a,b)
df %>% drop_na(starts_with("a"))
df %>% drop_na() # drops all rows with NAs
try this
df[complete.cases(df),] #output to console
OR even this
df.complete <- df[complete.cases(df),] #assign to a new data.frame
The above commands take care of checking for completeness for all the columns (variable)
in your data.frame.
Just for the sake of completeness, dplyr::filter can be avoided altogether but still be able to compose chains just by using magrittr:extract (an alias of [):
library(magrittr)
df = data.frame(
x1 = c(1,2,3,NA),
x2 = c(1,2,NA,5))
df %>%
extract(complete.cases(.), )
The additional bonus is speed, this is the fastest method among the filter and na.omit variants (tested using #Miha Trošt microbenchmarks).
dplyr >= 1.0.4
if_any and if_all are available in newer versions of dplyr to apply across-like syntax in the filter function. This could be useful if you had other variables in your dataframe that were not part of what you considered complete case. For example, if you only wanted non-missing rows in columns that start with "x":
library(dplyr)
df = data.frame(
x1 = c(1,2,3,NA),
x2 = c(1,2,NA,5),
y = c(NA, "A", "B", "C")
)
df %>%
dplyr::filter(if_all(starts_with("x"), ~!is.na(.)))
x1 x2 y
1 1 1 <NA>
2 2 2 A
For more information on these functions see this link.