Remove NA by id on stacked paired observations - r

I have a df with stacked paired (time 1, time 2) observations (subject = id) of variables (v1,v2)
df <- tribble(
~id, ~time, ~v1,~ v2,
1, 1, NA, 7,
2, 1, 3, 7,
3, 1, 2, 6,
1, 2, 4, 5,
2, 2, 3, NA,
3, 2, 7, 6
)
For the paired analysis, I need to drop all ids that have NA in either time. In the example above I would be left with only id "3". How can I achieve this? (dplyr if possible.)
Thanks

Another possible solution:
library(dplyr)
df %>% group_by(id) %>% filter(!any(is.na(v1+v2))) %>% ungroup
#> # A tibble: 2 × 4
#> id time v1 v2
#> <dbl> <dbl> <dbl> <dbl>
#> 1 3 1 2 6
#> 2 3 2 7 6

We can use complete.cases and all to return only the groups that contain no NAs in v1 or v2.
library(dplyr)
df %>%
group_by(id) %>%
filter(all(complete.cases(v1, v2)))
Output
id time v1 v2
<dbl> <dbl> <dbl> <dbl>
1 3 1 2 6
2 3 2 7 6
If you have a lot more columns that start with v, then we could use c_across to specify the columns in starts_with.
df %>%
group_by(id) %>%
filter(all(complete.cases(c_across(starts_with("v")))))

Use subset from base R- get the id where 'v1' is NA (id[is.na(v1)]), create a logical with the original 'id' column (id %in% ..), negate (!) to get the ids having no NAs in 'v1'
subset(df, !id %in% id[is.na(v1)])
Or with filter from dplyr
library(dplyr)
filter(df, !id %in% id[is.na(v1)])
# A tibble: 2 × 3
id time v1
<dbl> <dbl> <dbl>
1 3 1 2
2 3 2 7
Update
Based on the new data, we can use if_any
df %>%
group_by(id) %>%
filter(all(!if_any(v1:v2, is.na))) %>%
ungroup
-output
# A tibble: 2 × 4
id time v1 v2
<dbl> <dbl> <dbl> <dbl>
1 3 1 2 6
2 3 2 7 6
Or with if_all
df %>%
group_by(id) %>%
filter(all(if_all(v1:v2, complete.cases))) %>%
ungroup
# A tibble: 2 × 4
id time v1 v2
<dbl> <dbl> <dbl> <dbl>
1 3 1 2 6
2 3 2 7 6

Related

How to filter out groups empty for 1 column in Tidyverse

tibble(
A = c("A","A","B","B"),
x = c(NA,NA,NA,1),
y = c(1,2,3,4),
) %>% group_by(A) -> df
desired output:
tibble(
A = c("B","B"),
x = c(NA,1)
y = c(3,4),
)
I want to find all groups for which all elements of x and x only are all NA, then remove those groups. "B" is filtered in because it has at least 1 non NA element.
I tried:
df %>%
filter(all(!is.na(x)))
but it seems that filters out if it finds at least 1 NA; I need the correct word, which is not all.
This will remove groups of column A if all elements of x are NA:
library(dplyr)
df %>%
group_by(A) %>%
filter(! all(is.na(x)))
# A tibble: 2 × 3
# Groups: A [1]
# A x y
# <chr> <dbl> <dbl>
#1 B NA 3
#2 B 1 4
Note that group "A" was removed because both cells in the column x are not defined.
We can use any with complete.cases
library(dplyr)
df %>%
group_by(A) %>%
filter(any(complete.cases(x))) %>%
ungroup
-output
# A tibble: 2 × 3
A x y
<chr> <dbl> <dbl>
1 B NA 3
2 B 1 4
In the devel version of dplyr, we could use .by in filter thus we don't need to group_by/ungroup
df %>%
filter(any(complete.cases(x)), .by = 'A')
# A tibble: 2 × 3
A x y
<chr> <dbl> <dbl>
1 B NA 3
2 B 1 4

How to transform a tibble from one column to two columns with repeated observations

I tried to transform df into df2. I have done it through a very patchy way using df3, Is there a simpler and more elegant way of doing it?
library(tidyverse)
# I want to transform df
df <- tibble(id = c(1, 2, 1, 2, 1, 2),
time = c('t1', 't1', 't2', 't2', 't3', 't3'),
value = c(2, 3, 6, 4, 5, 7))
df
#> # A tibble: 6 x 3
#> id time value
#> <dbl> <chr> <dbl>
#> 1 1 t1 2
#> 2 2 t1 3
#> 3 1 t2 6
#> 4 2 t2 4
#> 5 1 t3 5
#> 6 2 t3 7
# into df2
df2 <- tibble(id = c(1, 2, 1, 2),
t = c(2, 3, 6, 4),
r = c(6, 4, 5, 7))
df2
#> # A tibble: 4 x 3
#> id t r
#> <dbl> <dbl> <dbl>
#> 1 1 2 6
#> 2 2 3 4
#> 3 1 6 5
#> 4 2 4 7
# This is how I did it, but I think it should be a better way
df3 <- df %>% pivot_wider(names_from = time, values_from = value)
b <- tibble(id = numeric(), t = numeric(), r = numeric())
for (i in 2:3){
a <- df3[,c(1,i,i+1)]
colnames(a) <- c('id', 't', 'r')
b <- bind_rows(a, b)
}
b
#> # A tibble: 4 x 3
#> id t r
#> <dbl> <dbl> <dbl>
#> 1 1 6 5
#> 2 2 4 7
#> 3 1 2 6
#> 4 2 3 4
Created on 2020-11-25 by the reprex package (v0.3.0)
For each id you can use lead to select next value and create r column and drop NA rows.
library(dplyr)
df %>%
group_by(id) %>%
mutate(t = value,
r = lead(value)) %>%
na.omit() %>%
select(id, t, r)
# id t r
# <dbl> <dbl> <dbl>
#1 1 2 6
#2 2 3 4
#3 1 6 5
#4 2 4 7
We can use summarise from dplyr version >= 1.0. Previously, it had the constraint of returning only single observation per group. From version >= 1.0, it is no longer the case. Can return any number of rows i.e. it can be shorter or longer than the original number of rows
library(dplyr)
df %>%
group_by(id) %>%
summarise(t = value[-n()], r = value[-1], .groups = 'drop')
-output
# A tibble: 4 x 3
# id t r
# <dbl> <dbl> <dbl>
#1 1 2 6
#2 1 6 5
#3 2 3 4
#4 2 4 7

Omitting columns instead of dropping them in purrr

I need to calculate an index for multiple lists. However, I can only do this if I drop some columns (here represented by "w" and "x"). For ex.
library(tidyverse)
lists<- list(
l1=tribble(
~w, ~x, ~y, ~z,
#--|--|--|----
12, "a", 2, 1,
12, "a",5, 3,
12, "a",6, 2),
l2=tribble(
~w, ~x, ~y, ~z,
#--|--|--|----
13,"b", 5, 7,
13,"b", 4, 6,
13,"b", 3, 2))
lists %>%
map(~ .x %>%
#group_by(w,x) %>%
select(-w,-x) %>%
mutate(row_sums = rowSums(.)))
Instead of dropping those columns I would like to keep/omit them and calculate the index only for "y" and "z".
I manage to do this by first extracting those columns and binding them again afterward. For ex.
select.col<-lists %>%
map_dfr(~ .x %>%
select(w,x))
lists %>%
map_dfr(~ .x %>%
select(-w,-x) %>%
mutate(row_sums = rowSums(.))) %>%
bind_cols(select.col)
However, this is not so elegant and I had to bind the lists (map_dfr), I would like to keep them as a list though.
Probably, another approach would be to use select_if(., is.numeric), but as I have some numeric columns I need to omit, I'm not sure whether this is the best option.
I'm certain there is a simple solution to this problem. Can anyone take a look at it?
Instead of dropping the columns, you can select the columns for which you want to take the sum.
You can select by name
library(dplyr)
library(purrr)
lists %>% map(~ .x %>% mutate(row_sums = rowSums(.[c("y", "z")])))
#$l1
# A tibble: 3 x 5
# w x y z row_sums
# <dbl> <chr> <dbl> <dbl> <dbl>
#1 12 a 2 1 3
#2 12 a 5 3 8
#3 12 a 6 2 8
#$l2
# A tibble: 3 x 5
# w x y z row_sums
# <dbl> <chr> <dbl> <dbl> <dbl>
#1 13 b 5 7 12
#2 13 b 4 6 10
#3 13 b 3 2 5
Or also by position of columns
lists %>% map(~ .x %>% mutate(row_sums = rowSums(.[3:4])))
Here is a tidyverse approach to get the row sums
library(tidyverse)
lists %>%
map(~ .x %>%
mutate(row_sums = select(., y:z) %>%
reduce(`+`)))
#$l1
# A tibble: 3 x 5
# w x y z row_sums
# <dbl> <chr> <dbl> <dbl> <dbl>
#1 12 a 2 1 3
#2 12 a 5 3 8
#3 12 a 6 2 8
#$l2
# A tibble: 3 x 5
# w x y z row_sums
# <dbl> <chr> <dbl> <dbl> <dbl>
#1 13 b 5 7 12
#2 13 b 4 6 10
#3 13 b 3 2 5
Or using base R
lapply(lists, transform, row_sums = y + z)

loop to multiply across columns

I have a data frame with columns labeled sales1, sales2, price1, price2 and I want to calculate revenues by multiplying sales1 * price1 and so-on across each number in an iterative fashion.
data <- data_frame(
"sales1" = c(1, 2, 3),
"sales2" = c(2, 3, 4),
"price1" = c(3, 2, 2),
"price2" = c(3, 3, 5))
data
# A tibble: 3 x 4
# sales1 sales2 price1 price2
# <dbl> <dbl> <dbl> <dbl>
#1 1 2 3 3
#2 2 3 2 3
#3 3 4 2 5
Why doesn't the following code work?
data %>%
mutate (
for (i in seq_along(1:2)) {
paste0("revenue",i) = paste0("sales",i) * paste0("price",i)
}
)
Assuming your columns are already ordered (sales1, sales2, price1, price2). We can split the dataframe in two parts and then multiply them
data[grep("sales", names(data))] * data[grep("price", names(data))]
# sales1 sales2
#1 3 6
#2 4 9
#3 6 20
If the columns are not already sorted according to their names, we can sort them by using order and then use above command.
data <- data[order(names(data))]
This answer is not brief. For that, #RonakShah's existing answer is the one to look at!
My response is intended to address a broader concern regarding the difficulty of trying to do this in the tidyverse. My understanding is this is difficult because the data is not currently in a "tidy" format. Instead, you can create a tidy data frame like so:
library(tidyverse)
tidy_df <- data %>%
rownames_to_column() %>%
gather(key, value, -rowname) %>%
extract(key, c("variable", "id"), "([a-z]+)([0-9]+)") %>%
spread(variable, value)
Which then makes the final calculation straightforward
tidy_df %>% mutate(revenue = sales * price)
#> # A tibble: 6 x 5
#> rowname id price sales revenue
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 1 1 3 1 3
#> 2 1 2 3 2 6
#> 3 2 1 2 2 4
#> 4 2 2 3 3 9
#> 5 3 1 2 3 6
#> 6 3 2 5 4 20
If you need to get the data back into the original format you can although this feels clunky to me (I'm sure this can be improved in someway).
tidy_df %>% mutate(revenue = sales * price) %>%
gather(key, value, -c(rowname, id)) %>%
unite(key, key, id, sep = "") %>%
spread(key, value) %>%
select(starts_with("price"),
starts_with("sales"),
starts_with("revenue"))
#> # A tibble: 3 x 6
#> price1 price2 sales1 sales2 revenue1 revenue2
#> * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 3 3 1 2 3 6
#> 2 2 3 2 3 4 9
#> 3 2 5 3 4 6 20

Why do group_by and group_by_ give different answers when summarizing by two variables?

In the following example, I want to create a summary statistic by two variables. When I do it with dplyr::group_by, I get the correct answer, by when I do it with dplyr::group_by_, it summarizes one level more than I want it to.
library(dplyr)
set.seed(919)
df <- data.frame(
a = c(1, 1, 1, 2, 2, 2),
b = c(3, 3, 4, 4, 5, 5),
x = runif(6)
)
# Gives correct answer
df %>%
group_by(a, b) %>%
summarize(total = sum(x))
# Source: local data frame [4 x 3]
# Groups: a [?]
#
# a b total
# <dbl> <dbl> <dbl>
# 1 1 3 1.5214746
# 2 1 4 0.7150204
# 3 2 4 0.1234555
# 4 2 5 0.8208454
# Wrong answer -- too many levels summarized
df %>%
group_by_(c("a", "b")) %>%
summarize(total = sum(x))
# # A tibble: 2 × 2
# a total
# <dbl> <dbl>
# 1 1 2.2364950
# 2 2 0.9443009
What's going on?
If you want to use a vector of variable names, you can pass it to .dots parameter as:
df %>%
group_by_(.dots = c("a", "b")) %>%
summarize(total = sum(x))
#Source: local data frame [4 x 3]
#Groups: a [?]
# a b total
# <dbl> <dbl> <dbl>
#1 1 3 1.5214746
#2 1 4 0.7150204
#3 2 4 0.1234555
#4 2 5 0.8208454
Or you can use it in the same way as you would do in NSE way:
df %>%
group_by_("a", "b") %>%
summarize(total = sum(x))
#Source: local data frame [4 x 3]
#Groups: a [?]
# a b total
# <dbl> <dbl> <dbl>
#1 1 3 1.5214746
#2 1 4 0.7150204
#3 2 4 0.1234555
#4 2 5 0.8208454

Resources