na.omit not removing NA (data is numerical) - r

I have just merged two data frames and am trying to remove rows with a NA in the value column. As of now I have..
t.dup.rights <- merge(dataframe1, dataframe2, by = "ID", all = T)
na.omit(tdup.rights$value)
The value column has either a 1 or NA. I used
is.numeric(t.dup.rights$value)
to double check that r doesn't think it's a factor. After I run the na.omit function the data frame appears to remain unchanged. I am working with a particularly large data set (200K obs). I am also using the dplyr package.

You can use tidyr::drop_na
library(dplyr)
library(tidyr)
df <- data_frame(
x = sample(c(NA, 1, 2), 10, replace = TRUE),
y = sample(c(NA, 1, 2), 10, replace = TRUE)
)
df
#> # A tibble: 10 x 2
#> x y
#> <dbl> <dbl>
#> 1 NA NA
#> 2 1 2
#> 3 2 2
#> 4 NA 1
#> 5 2 NA
#> 6 1 NA
#> 7 NA NA
#> 8 2 2
#> 9 1 NA
#> 10 NA 2
drop_na(df, x)
#> # A tibble: 6 x 2
#> x y
#> <dbl> <dbl>
#> 1 1 2
#> 2 2 2
#> 3 2 NA
#> 4 1 NA
#> 5 2 2
#> 6 1 NA

Related

Is there a way to omit NAs when using the dplyr package Lag function in R?

I have two columns of data and I want to subtract one from the previous row entry of the other. The lag() function in dplyr is perfect for this, but unfortunately I have NAs in the second row of data. This means that when calculating the new value for rows when the previous row has an NA in the second column is resulting in NA's (see below table):
Value 1
Value 2
New Value
NA
2
NA
13
3
11
6
NA
3
6
4
NA
5
4
1
Is there a way to omit then NAs and have it use the last actual value in the second column instead? Like an na.rm=TRUE that can be added? The result would look like:
Value 1
Value 2
New Value
NA
2
NA
13
3
11
6
NA
3
6
4
3
5
4
1
One option would be to mutate Value 2 beforehand and fill the NA with the last non-NA value:
df %>%
tidyr::fill(`Value 2`, .direction = "down")
After making that New.Value, you can fill the NA with previous value using fill() from tidyr package.
library(dplyr)
library(tidyr)
df <- tibble::tribble(
~Value.1, ~Value.2, ~New.Value,
NA, 2, NA,
13,3,11,
6,NA,3,
6,4,NA,
5,4,1,
)
df
#> # A tibble: 5 × 3
#> Value.1 Value.2 New.Value
#> <dbl> <dbl> <dbl>
#> 1 NA 2 NA
#> 2 13 3 11
#> 3 6 NA 3
#> 4 6 4 NA
#> 5 5 4 1
df %>%
fill(New.Value, .direction = "down")
#> # A tibble: 5 × 3
#> Value.1 Value.2 New.Value
#> <dbl> <dbl> <dbl>
#> 1 NA 2 NA
#> 2 13 3 11
#> 3 6 NA 3
#> 4 6 4 3
#> 5 5 4 1
Created on 2022-07-08 by the reprex package (v2.0.1)

counting the number of observations row wise using dplyr

I have a dataset look like this -
sample <- tibble(x = c (1,2,3,NA), y = c (5, NA,2, NA))
sample
# A tibble: 4 x 2
x y
<dbl> <dbl>
1 1 5
2 2 NA
3 3 2
4 NA NA
Now I want create a new variable Z, which will count how many observations are in each row. For example for the sample dataset above the first value of new variable Z should be 2 because both x and y have values. Similarly, for 2nd row the value of Z is 1 as there is one missing value and for 4th row, the value is 0 as there is no observations in the row.
The expected dataset looks like this -
x y z
<dbl> <dbl> <dbl>
1 1 5 2
2 2 NA 1
3 3 2 2
4 NA NA 0
I want to do this on few number of variables, not the whole dataset.
Using base R. First line checks all columns, second one checks columns by name, third might not work as good if the number of columns is substantial.
sample$z1 <- rowSums(!is.na(sample))
sample$z2 <- rowSums(!is.na(sample[c("x", "y")]))
sample$z3 <- is.finite(sample$x) + is.finite(sample$y)
> sample
# A tibble: 4 x 5
x y z1 z2 z3
<dbl> <dbl> <dbl> <dbl> <int>
1 1 5 2 2 2
2 2 NA 1 1 1
3 3 2 2 2 2
4 NA NA 0 0 0
We can use
library(dplyr)
sample %>%
rowwise %>%
mutate(z = sum(!is.na(cur_data()))) %>%
ungroup
-output
# A tibble: 4 x 3
# x y z
# <dbl> <dbl> <int>
#1 1 5 2
#2 2 NA 1
#3 3 2 2
#4 NA NA 0
If it is select columns
sample %>%
rowwise %>%
mutate(z = sum(!is.na(select(cur_data(), x:y))))
Or with rowSums on a logical matrix
sample %>%
mutate(z = rowSums(!is.na(cur_data())))
-output
# A tibble: 4 x 3
# x y z
# <dbl> <dbl> <dbl>
#1 1 5 2
#2 2 NA 1
#3 3 2 2
#4 NA NA 0
apply function with selected columns example:
set.seed(7)
vals <- sample(c(1:20, NA, NA), 20)
sample <- matrix(vals, ncol = 5)
# Select columns 1, 3, 4
cols <- c(1, 3, 4)
rowcnts <- apply(sample[ , cols], 1, function(x) length(x[!is.na(x)]))
sample <- cbind(sample, rowcnts)
> sample
rowcnts
[1,] 10 15 16 NA 12 2
[2,] 19 8 14 18 9 3
[3,] 7 17 6 4 1 3
[4,] 2 3 13 NA 5 2

Iterate over data frame by elements of a list

I am trying to merge two data frames based on a subset on common IDs. Let me demonstrate:
library(tidyverse)
set.seed(42)
df = list(id = c(1,2,3,4,1,2,2,2,1,1),
group = c("A","A","A","A","B","B","B","B","C","C"),
val = c(round(rnorm(10,6,6),0))
) %>%
tbl_df()
df_na = list(id = c(1,1,1,2,3,3,4,5,5,5),
group = c(rep(NA,10)),
val = c(rep(NA,10))
) %>%
tbl_df()
df contains data and ids, while df_na only contains ids and NAs. I would like to create a combined data frame that contains all the information of df and add the NAs by group and id, i.e. for each group in df find which ids are present in both df and df_na and merge.
If I was doing this manually, i.e. group for group, I would use something like this:
A_dist = df %>% filter(group=="A") %>%
distinct(id) %>%
pull()
df_A_comb = df_na %>%
filter(id %in% A_dist) %>%
bind_rows(filter(df, group=="A"))
# A tibble: 11 x 3
id group val
<dbl> <chr> <dbl>
1 1 NA NA
2 1 NA NA
3 1 NA NA
4 2 NA NA
5 3 NA NA
6 3 NA NA
7 4 NA NA
8 1 A 14
9 2 A 3
10 3 A 8
11 4 A 10
But obvioulsy, I would rather automate this. As an emerging fan of the tidyverse, I'm trying to get my head around purrr::map. I can create a vector of ids for each group.
df_dist = df %>%
split(.$group) %>%
map(distinct, id) %>%
map("id")
> df_dist
$A
[1] 1 2 3 4
$B
[1] 1 2
$C
[1] 1
But translating my dplyr approach is more complicated and produces an error message earlier on.
###this approach doesn't work...
df_comb = df_na %>%
map(filter, id %in% df_dist)# %>%
...
Error in UseMethod("filter_") :
no applicable method for 'filter_' applied to an object of class "c('double', 'numeric')"
Any help will be much appreaciated!
The tidyverse is great, but sometimes you can overlook the power of simple indexing and the really useful base R tools of the apply family. This single lapply function will give you a list containing all the data frames you want in the specified format:
lapply(unique(df$group), function(x){
rbind(df_na[df_na$id %in% df$id[df$group == x],], df[df$group == x,])})
Result:
#> [[1]]
#> # A tibble: 11 x 3
#> id group val
#> <dbl> <chr> <dbl>
#> 1 1 <NA> NA
#> 2 1 <NA> NA
#> 3 1 <NA> NA
#> 4 2 <NA> NA
#> 5 3 <NA> NA
#> 6 3 <NA> NA
#> 7 4 <NA> NA
#> 8 1 A 14
#> 9 2 A 3
#> 10 3 A 8
#> 11 4 A 10
#>
#> [[2]]
#> # A tibble: 8 x 3
#> id group val
#> <dbl> <chr> <dbl>
#> 1 1 <NA> NA
#> 2 1 <NA> NA
#> 3 1 <NA> NA
#> 4 2 <NA> NA
#> 5 1 B 8
#> 6 2 B 5
#> 7 2 B 15
#> 8 2 B 5
#>
#> [[3]]
#> # A tibble: 5 x 3
#> id group val
#> <dbl> <chr> <dbl>
#> 1 1 <NA> NA
#> 2 1 <NA> NA
#> 3 1 <NA> NA
#> 4 1 C 18
#> 5 1 C 6
If you want to join them into a single data frame, store the result (say as x) and do this:
do.call(rbind, x)

R - Replace missing values with highest of 4 previous values

This is a variation of the last observation carried forward problem in a vector with some missing values. Instead of filling in NA values with the last non NA observation, I would like to fill in NA values with the highest value in the 4 observations preceding it. If all 4 observations preceding are also NA, the NA missing value should be retained. Would also appreciate it this can be done by groups in a data frame/data table.
Example:
Original DF:
ID Week Value
a 1 5
a 2 1
a 3 NA
a 4 NA
a 5 3
a 6 4
a 7 NA
b 1 NA
b 2 NA
b 3 NA
b 4 NA
b 5 NA
b 6 1
b 7 NA
Output DF:
ID Week Value
a 1 5
a 2 1
a 3 5
a 4 5
a 5 3
a 6 4
a 7 4
b 1 NA
b 2 NA
b 3 NA
b 4 NA
b 5 NA
b 6 1
b 7 1
lag shifts the column by n steps and lets you peek at previous values. pmax is element-wise maximum and lets to pick the highest value for each set/row of the observations.
To abstract away notion of 4 and maintain vectorized performance, you may use quasiquotes from rlang: http://dplyr.tidyverse.org/articles/programming.html#quasiquotation
It can look a little cryptic at first but is very precise and expressive.
df <- readr::read_table(
" ID Week Value
a 1 5
a 2 1
a 3 NA
a 4 NA
a 5 3
a 6 4
a 7 NA
b 1 NA
b 2 NA
b 3 NA
b 4 NA
b 5 NA
b 6 1
b 7 NA")
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df %>%
group_by(ID) %>%
mutate(
Value = if_else(is.na(Value), pmax(lag(Value, 1), lag(Value, 2), lag(Value, 3), lag(Value, 4), na.rm = TRUE), Value)
)
#> # A tibble: 14 x 3
#> # Groups: ID [2]
#> ID Week Value
#> <chr> <int> <int>
#> 1 a 1 5
#> 2 a 2 1
#> 3 a 3 5
#> 4 a 4 5
#> 5 a 5 3
#> 6 a 6 4
#> 7 a 7 4
#> 8 b 1 NA
#> 9 b 2 NA
#> 10 b 3 NA
#> 11 b 4 NA
#> 12 b 5 NA
#> 13 b 6 1
#> 14 b 7 1
# or if you are an rlang ninja
library(purrr)
pmax_lag_n <- function(column, n) {
column <- enquo(column)
1:n %>%
map(~quo(lag(!!column, !!.x))) %>%
{ quo(pmax(!!!., na.rm = TRUE)) }
}
df %>%
group_by(ID) %>%
mutate(Value = if_else(is.na(Value), !!pmax_lag_n(Value, 4), Value))
#> # A tibble: 14 x 3
#> # Groups: ID [2]
#> ID Week Value
#> <chr> <int> <int>
#> 1 a 1 5
#> 2 a 2 1
#> 3 a 3 5
#> 4 a 4 5
#> 5 a 5 3
#> 6 a 6 4
#> 7 a 7 4
#> 8 b 1 NA
#> 9 b 2 NA
#> 10 b 3 NA
#> 11 b 4 NA
#> 12 b 5 NA
#> 13 b 6 1
#> 14 b 7 1
Define function Max which accepts a vector x and returns NA if all its elements are NA. Otherwise, if the last value is NA it returns the maximum of all non-NA elements and if the last value is not NA then it returns it.
Also define na.max which runs Max on a rolling window of length n (given by the second argument to na.max -- default 5).
Finally apply na.max to Value by ID using ave.
library(zoo)
Max <- function(x) {
last <- tail(x, 1)
if (all(is.na(x))) NA
else if (is.na(last)) max(x, na.rm = TRUE)
else last
}
na.max <- function(x, n = 5) rollapplyr(x, n, Max, partial = TRUE)
transform(DF, Value = ave(Value, ID, FUN = na.max))
giving:
ID Week Value
1 a 1 5
2 a 2 1
3 a 3 5
4 a 4 5
5 a 5 3
6 a 6 4
7 a 7 4
8 b 1 NA
9 b 2 NA
10 b 3 NA
11 b 4 NA
12 b 5 NA
13 b 6 1
14 b 7 1
Note: Input DF in reproducible form:
Lines <- "
ID Week Value
a 1 5
a 2 1
a 3 NA
a 4 NA
a 5 3
a 6 4
a 7 NA
b 1 NA
b 2 NA
b 3 NA
b 4 NA
b 5 NA
b 6 1
b 7 NA"
DF <- read.table(text = Lines, header = TRUE)

Why does dplyr's filter drop NA values from a factor variable?

When I use filter from the dplyr package to drop a level of a factor variable, filter also drops the NA values. Here's an example:
library(dplyr)
set.seed(919)
(dat <- data.frame(var1 = factor(sample(c(1:3, NA), size = 10, replace = T))))
# var1
# 1 <NA>
# 2 3
# 3 3
# 4 1
# 5 1
# 6 <NA>
# 7 2
# 8 2
# 9 <NA>
# 10 1
filter(dat, var1 != 1)
# var1
# 1 3
# 2 3
# 3 2
# 4 2
This does not seem ideal -- I only wanted to drop rows where var1 == 1.
It looks like this is occurring because any comparison with NA returns NA, which filter then drops. So, for example, filter(dat, !(var1 %in% 1)) produces the correct results. But is there a way to tell filter not to drop the NA values?
You could use this:
filter(dat, var1 != 1 | is.na(var1))
var1
1 <NA>
2 3
3 3
4 <NA>
5 2
6 2
7 <NA>
And it won't.
Also just for completion, dropping NAs is the intended behavior of filter as you can see from the following:
test_that("filter discards NA", {
temp <- data.frame(
i = 1:5,
x = c(NA, 1L, 1L, 0L, 0L)
)
res <- filter(temp, x == 1)
expect_equal(nrow(res), 2L)
})
This test above was taken from the tests for filter from github.
The answers previously given are good, but when your filter statement involves a function of many fields, the work around might not be so great. Also, who wants to use mapply the non-vectorized identical. Here is another somewhat simpler solution using coalesce
filter(dat, coalesce( var1 != 1, TRUE))
I often map identical with mapply...
(note: I believe because of changes in R 3.6.0, set.seed and sample end up with different test data)
library(dplyr, warn.conflicts = FALSE)
set.seed(919)
(dat <- data.frame(var1 = factor(sample(c(1:3, NA), size = 10, replace = T))))
#> var1
#> 1 3
#> 2 1
#> 3 <NA>
#> 4 3
#> 5 1
#> 6 3
#> 7 2
#> 8 3
#> 9 2
#> 10 1
filter(dat, var1 != 1)
#> var1
#> 1 3
#> 2 3
#> 3 3
#> 4 2
#> 5 3
#> 6 2
filter(dat, !mapply(identical, as.numeric(var1), 1))
#> var1
#> 1 3
#> 2 <NA>
#> 3 3
#> 4 3
#> 5 2
#> 6 3
#> 7 2
it works for numerics and strings as well (probably more common use case)...
library(dplyr, warn.conflicts = FALSE)
set.seed(919)
(dat <- data.frame(var1 = sample(c(1:3, NA), size = 10, replace = T),
var2 = letters[sample(c(1:3, NA), size = 10, replace = T)],
stringsAsFactors = FALSE))
#> var1 var2
#> 1 3 <NA>
#> 2 1 a
#> 3 NA a
#> 4 3 b
#> 5 1 b
#> 6 3 <NA>
#> 7 2 a
#> 8 3 c
#> 9 2 <NA>
#> 10 1 b
filter(dat, !mapply(identical, var1, 1L))
#> var1 var2
#> 1 3 <NA>
#> 2 NA a
#> 3 3 b
#> 4 3 <NA>
#> 5 2 a
#> 6 3 c
#> 7 2 <NA>
filter(dat, !mapply(identical, var2, 'a'))
#> var1 var2
#> 1 3 <NA>
#> 2 3 b
#> 3 1 b
#> 4 3 <NA>
#> 5 3 c
#> 6 2 <NA>
#> 7 1 b

Resources