I am trying to merge two data frames based on a subset on common IDs. Let me demonstrate:
library(tidyverse)
set.seed(42)
df = list(id = c(1,2,3,4,1,2,2,2,1,1),
group = c("A","A","A","A","B","B","B","B","C","C"),
val = c(round(rnorm(10,6,6),0))
) %>%
tbl_df()
df_na = list(id = c(1,1,1,2,3,3,4,5,5,5),
group = c(rep(NA,10)),
val = c(rep(NA,10))
) %>%
tbl_df()
df contains data and ids, while df_na only contains ids and NAs. I would like to create a combined data frame that contains all the information of df and add the NAs by group and id, i.e. for each group in df find which ids are present in both df and df_na and merge.
If I was doing this manually, i.e. group for group, I would use something like this:
A_dist = df %>% filter(group=="A") %>%
distinct(id) %>%
pull()
df_A_comb = df_na %>%
filter(id %in% A_dist) %>%
bind_rows(filter(df, group=="A"))
# A tibble: 11 x 3
id group val
<dbl> <chr> <dbl>
1 1 NA NA
2 1 NA NA
3 1 NA NA
4 2 NA NA
5 3 NA NA
6 3 NA NA
7 4 NA NA
8 1 A 14
9 2 A 3
10 3 A 8
11 4 A 10
But obvioulsy, I would rather automate this. As an emerging fan of the tidyverse, I'm trying to get my head around purrr::map. I can create a vector of ids for each group.
df_dist = df %>%
split(.$group) %>%
map(distinct, id) %>%
map("id")
> df_dist
$A
[1] 1 2 3 4
$B
[1] 1 2
$C
[1] 1
But translating my dplyr approach is more complicated and produces an error message earlier on.
###this approach doesn't work...
df_comb = df_na %>%
map(filter, id %in% df_dist)# %>%
...
Error in UseMethod("filter_") :
no applicable method for 'filter_' applied to an object of class "c('double', 'numeric')"
Any help will be much appreaciated!
The tidyverse is great, but sometimes you can overlook the power of simple indexing and the really useful base R tools of the apply family. This single lapply function will give you a list containing all the data frames you want in the specified format:
lapply(unique(df$group), function(x){
rbind(df_na[df_na$id %in% df$id[df$group == x],], df[df$group == x,])})
Result:
#> [[1]]
#> # A tibble: 11 x 3
#> id group val
#> <dbl> <chr> <dbl>
#> 1 1 <NA> NA
#> 2 1 <NA> NA
#> 3 1 <NA> NA
#> 4 2 <NA> NA
#> 5 3 <NA> NA
#> 6 3 <NA> NA
#> 7 4 <NA> NA
#> 8 1 A 14
#> 9 2 A 3
#> 10 3 A 8
#> 11 4 A 10
#>
#> [[2]]
#> # A tibble: 8 x 3
#> id group val
#> <dbl> <chr> <dbl>
#> 1 1 <NA> NA
#> 2 1 <NA> NA
#> 3 1 <NA> NA
#> 4 2 <NA> NA
#> 5 1 B 8
#> 6 2 B 5
#> 7 2 B 15
#> 8 2 B 5
#>
#> [[3]]
#> # A tibble: 5 x 3
#> id group val
#> <dbl> <chr> <dbl>
#> 1 1 <NA> NA
#> 2 1 <NA> NA
#> 3 1 <NA> NA
#> 4 1 C 18
#> 5 1 C 6
If you want to join them into a single data frame, store the result (say as x) and do this:
do.call(rbind, x)
Related
Let's say I have a table like so:
df <- data.frame("Group" = c("A","A","A","B","B","B","C","C","C"),
"Num" = c(1,2,3,1,2,NA,NA,NA,NA))
Group Num
1 A 1
2 A 2
3 A 3
4 B 1
5 B 2
6 B NA
7 C NA
8 C NA
9 C NA
In this case, because group C has Num as NA for all entries, I would like to remove rows in group C from the table. Any help is appreciated!
You could group_by on you Group and filter the groups with all values that are NA. You can use the following code:
library(dplyr)
df %>%
group_by(Group) %>%
filter(!all(is.na(Num)))
#> # A tibble: 6 × 2
#> # Groups: Group [2]
#> Group Num
#> <chr> <dbl>
#> 1 A 1
#> 2 A 2
#> 3 A 3
#> 4 B 1
#> 5 B 2
#> 6 B NA
Created on 2023-01-18 with reprex v2.0.2
In base R you could index based on all the groups that have at least one non-NA value:
idx <- df$Group %in% unique(df[!is.na(df$Num),"Group"])
idx
df[idx,]
# or in one line
df[df$Group %in% unique(df[!is.na(df$Num),"Group"]),]
output
Group Num
1 A 1
2 A 2
3 A 3
4 B 1
5 B 2
6 B NA
Using ave.
df[with(df, !ave(Num, Group, FUN=\(x) all(is.na(x)))), ]
# Group Num
# 1 A 1
# 2 A 2
# 3 A 3
# 4 B 1
# 5 B 2
# 6 B NA
I have two columns of data and I want to subtract one from the previous row entry of the other. The lag() function in dplyr is perfect for this, but unfortunately I have NAs in the second row of data. This means that when calculating the new value for rows when the previous row has an NA in the second column is resulting in NA's (see below table):
Value 1
Value 2
New Value
NA
2
NA
13
3
11
6
NA
3
6
4
NA
5
4
1
Is there a way to omit then NAs and have it use the last actual value in the second column instead? Like an na.rm=TRUE that can be added? The result would look like:
Value 1
Value 2
New Value
NA
2
NA
13
3
11
6
NA
3
6
4
3
5
4
1
One option would be to mutate Value 2 beforehand and fill the NA with the last non-NA value:
df %>%
tidyr::fill(`Value 2`, .direction = "down")
After making that New.Value, you can fill the NA with previous value using fill() from tidyr package.
library(dplyr)
library(tidyr)
df <- tibble::tribble(
~Value.1, ~Value.2, ~New.Value,
NA, 2, NA,
13,3,11,
6,NA,3,
6,4,NA,
5,4,1,
)
df
#> # A tibble: 5 × 3
#> Value.1 Value.2 New.Value
#> <dbl> <dbl> <dbl>
#> 1 NA 2 NA
#> 2 13 3 11
#> 3 6 NA 3
#> 4 6 4 NA
#> 5 5 4 1
df %>%
fill(New.Value, .direction = "down")
#> # A tibble: 5 × 3
#> Value.1 Value.2 New.Value
#> <dbl> <dbl> <dbl>
#> 1 NA 2 NA
#> 2 13 3 11
#> 3 6 NA 3
#> 4 6 4 3
#> 5 5 4 1
Created on 2022-07-08 by the reprex package (v2.0.1)
This is a variation of the last observation carried forward problem in a vector with some missing values. Instead of filling in NA values with the last non NA observation, I would like to fill in NA values with the highest value in the 4 observations preceding it. If all 4 observations preceding are also NA, the NA missing value should be retained. Would also appreciate it this can be done by groups in a data frame/data table.
Example:
Original DF:
ID Week Value
a 1 5
a 2 1
a 3 NA
a 4 NA
a 5 3
a 6 4
a 7 NA
b 1 NA
b 2 NA
b 3 NA
b 4 NA
b 5 NA
b 6 1
b 7 NA
Output DF:
ID Week Value
a 1 5
a 2 1
a 3 5
a 4 5
a 5 3
a 6 4
a 7 4
b 1 NA
b 2 NA
b 3 NA
b 4 NA
b 5 NA
b 6 1
b 7 1
lag shifts the column by n steps and lets you peek at previous values. pmax is element-wise maximum and lets to pick the highest value for each set/row of the observations.
To abstract away notion of 4 and maintain vectorized performance, you may use quasiquotes from rlang: http://dplyr.tidyverse.org/articles/programming.html#quasiquotation
It can look a little cryptic at first but is very precise and expressive.
df <- readr::read_table(
" ID Week Value
a 1 5
a 2 1
a 3 NA
a 4 NA
a 5 3
a 6 4
a 7 NA
b 1 NA
b 2 NA
b 3 NA
b 4 NA
b 5 NA
b 6 1
b 7 NA")
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df %>%
group_by(ID) %>%
mutate(
Value = if_else(is.na(Value), pmax(lag(Value, 1), lag(Value, 2), lag(Value, 3), lag(Value, 4), na.rm = TRUE), Value)
)
#> # A tibble: 14 x 3
#> # Groups: ID [2]
#> ID Week Value
#> <chr> <int> <int>
#> 1 a 1 5
#> 2 a 2 1
#> 3 a 3 5
#> 4 a 4 5
#> 5 a 5 3
#> 6 a 6 4
#> 7 a 7 4
#> 8 b 1 NA
#> 9 b 2 NA
#> 10 b 3 NA
#> 11 b 4 NA
#> 12 b 5 NA
#> 13 b 6 1
#> 14 b 7 1
# or if you are an rlang ninja
library(purrr)
pmax_lag_n <- function(column, n) {
column <- enquo(column)
1:n %>%
map(~quo(lag(!!column, !!.x))) %>%
{ quo(pmax(!!!., na.rm = TRUE)) }
}
df %>%
group_by(ID) %>%
mutate(Value = if_else(is.na(Value), !!pmax_lag_n(Value, 4), Value))
#> # A tibble: 14 x 3
#> # Groups: ID [2]
#> ID Week Value
#> <chr> <int> <int>
#> 1 a 1 5
#> 2 a 2 1
#> 3 a 3 5
#> 4 a 4 5
#> 5 a 5 3
#> 6 a 6 4
#> 7 a 7 4
#> 8 b 1 NA
#> 9 b 2 NA
#> 10 b 3 NA
#> 11 b 4 NA
#> 12 b 5 NA
#> 13 b 6 1
#> 14 b 7 1
Define function Max which accepts a vector x and returns NA if all its elements are NA. Otherwise, if the last value is NA it returns the maximum of all non-NA elements and if the last value is not NA then it returns it.
Also define na.max which runs Max on a rolling window of length n (given by the second argument to na.max -- default 5).
Finally apply na.max to Value by ID using ave.
library(zoo)
Max <- function(x) {
last <- tail(x, 1)
if (all(is.na(x))) NA
else if (is.na(last)) max(x, na.rm = TRUE)
else last
}
na.max <- function(x, n = 5) rollapplyr(x, n, Max, partial = TRUE)
transform(DF, Value = ave(Value, ID, FUN = na.max))
giving:
ID Week Value
1 a 1 5
2 a 2 1
3 a 3 5
4 a 4 5
5 a 5 3
6 a 6 4
7 a 7 4
8 b 1 NA
9 b 2 NA
10 b 3 NA
11 b 4 NA
12 b 5 NA
13 b 6 1
14 b 7 1
Note: Input DF in reproducible form:
Lines <- "
ID Week Value
a 1 5
a 2 1
a 3 NA
a 4 NA
a 5 3
a 6 4
a 7 NA
b 1 NA
b 2 NA
b 3 NA
b 4 NA
b 5 NA
b 6 1
b 7 NA"
DF <- read.table(text = Lines, header = TRUE)
I have just merged two data frames and am trying to remove rows with a NA in the value column. As of now I have..
t.dup.rights <- merge(dataframe1, dataframe2, by = "ID", all = T)
na.omit(tdup.rights$value)
The value column has either a 1 or NA. I used
is.numeric(t.dup.rights$value)
to double check that r doesn't think it's a factor. After I run the na.omit function the data frame appears to remain unchanged. I am working with a particularly large data set (200K obs). I am also using the dplyr package.
You can use tidyr::drop_na
library(dplyr)
library(tidyr)
df <- data_frame(
x = sample(c(NA, 1, 2), 10, replace = TRUE),
y = sample(c(NA, 1, 2), 10, replace = TRUE)
)
df
#> # A tibble: 10 x 2
#> x y
#> <dbl> <dbl>
#> 1 NA NA
#> 2 1 2
#> 3 2 2
#> 4 NA 1
#> 5 2 NA
#> 6 1 NA
#> 7 NA NA
#> 8 2 2
#> 9 1 NA
#> 10 NA 2
drop_na(df, x)
#> # A tibble: 6 x 2
#> x y
#> <dbl> <dbl>
#> 1 1 2
#> 2 2 2
#> 3 2 NA
#> 4 1 NA
#> 5 2 2
#> 6 1 NA
When I use filter from the dplyr package to drop a level of a factor variable, filter also drops the NA values. Here's an example:
library(dplyr)
set.seed(919)
(dat <- data.frame(var1 = factor(sample(c(1:3, NA), size = 10, replace = T))))
# var1
# 1 <NA>
# 2 3
# 3 3
# 4 1
# 5 1
# 6 <NA>
# 7 2
# 8 2
# 9 <NA>
# 10 1
filter(dat, var1 != 1)
# var1
# 1 3
# 2 3
# 3 2
# 4 2
This does not seem ideal -- I only wanted to drop rows where var1 == 1.
It looks like this is occurring because any comparison with NA returns NA, which filter then drops. So, for example, filter(dat, !(var1 %in% 1)) produces the correct results. But is there a way to tell filter not to drop the NA values?
You could use this:
filter(dat, var1 != 1 | is.na(var1))
var1
1 <NA>
2 3
3 3
4 <NA>
5 2
6 2
7 <NA>
And it won't.
Also just for completion, dropping NAs is the intended behavior of filter as you can see from the following:
test_that("filter discards NA", {
temp <- data.frame(
i = 1:5,
x = c(NA, 1L, 1L, 0L, 0L)
)
res <- filter(temp, x == 1)
expect_equal(nrow(res), 2L)
})
This test above was taken from the tests for filter from github.
The answers previously given are good, but when your filter statement involves a function of many fields, the work around might not be so great. Also, who wants to use mapply the non-vectorized identical. Here is another somewhat simpler solution using coalesce
filter(dat, coalesce( var1 != 1, TRUE))
I often map identical with mapply...
(note: I believe because of changes in R 3.6.0, set.seed and sample end up with different test data)
library(dplyr, warn.conflicts = FALSE)
set.seed(919)
(dat <- data.frame(var1 = factor(sample(c(1:3, NA), size = 10, replace = T))))
#> var1
#> 1 3
#> 2 1
#> 3 <NA>
#> 4 3
#> 5 1
#> 6 3
#> 7 2
#> 8 3
#> 9 2
#> 10 1
filter(dat, var1 != 1)
#> var1
#> 1 3
#> 2 3
#> 3 3
#> 4 2
#> 5 3
#> 6 2
filter(dat, !mapply(identical, as.numeric(var1), 1))
#> var1
#> 1 3
#> 2 <NA>
#> 3 3
#> 4 3
#> 5 2
#> 6 3
#> 7 2
it works for numerics and strings as well (probably more common use case)...
library(dplyr, warn.conflicts = FALSE)
set.seed(919)
(dat <- data.frame(var1 = sample(c(1:3, NA), size = 10, replace = T),
var2 = letters[sample(c(1:3, NA), size = 10, replace = T)],
stringsAsFactors = FALSE))
#> var1 var2
#> 1 3 <NA>
#> 2 1 a
#> 3 NA a
#> 4 3 b
#> 5 1 b
#> 6 3 <NA>
#> 7 2 a
#> 8 3 c
#> 9 2 <NA>
#> 10 1 b
filter(dat, !mapply(identical, var1, 1L))
#> var1 var2
#> 1 3 <NA>
#> 2 NA a
#> 3 3 b
#> 4 3 <NA>
#> 5 2 a
#> 6 3 c
#> 7 2 <NA>
filter(dat, !mapply(identical, var2, 'a'))
#> var1 var2
#> 1 3 <NA>
#> 2 3 b
#> 3 1 b
#> 4 3 <NA>
#> 5 3 c
#> 6 2 <NA>
#> 7 1 b