remove group if any member containes NA in R - r

How can I remove entire group if one of its values is NA. For ex - remove category B because it contains NA.
library(dplyr)
tbl = tibble(category = c("A", "A", "B", "B"),
values = c(2, 3, 1, NA))

We can use filter after grouping by 'category'
library(dplyr)
tbl %>%
group_by(category) %>%
filter(!any(is.na(values))) %>%
ungroup
-output
# A tibble: 2 x 2
category values
<chr> <dbl>
1 A 2
2 A 3

tbl %>%
filter(!category %in% category[is.na(values)])
Output
category values
<chr> <dbl>
1 A 2
2 A 3

tbl %>%
group_by(category) %>%
filter(all(!is.na(values)))
category values
<chr> <dbl>
1 A 2
2 A 3

You can get the categories which has at least one NA value and exclude them.
subset(tbl, !category %in% unique(category[is.na(values)]))
# category values
# <chr> <dbl>
#1 A 2
#2 A 3
If you prefer dplyr::filter.
library(dplyr)
tbl %>% filter(!category %in% unique(category[is.na(values)]))

Related

How to sum n highest values by row using dplyr without reshaping?

I would like to create a new column based on the n highest values per row of a data frame.
Take the following example:
library(tibble)
df <- tribble(~name, ~q_1, ~q_2, ~q_3, ~sum_top_2,
"a", 4, 1, 5, 9,
"b", 2, 8, 9, 17)
Here, the sum_top_2 column sums the 2 highest values of columns prefixed with "q_". I would like to generalize to the n highest values by row. How can I do this using dplyr without reshaping?
One option is pmap from purrr to loop over the rows of the columns that starts_with 'q_', by sorting the row in decreasing order, get the first 'n' sorted elements with head and sum
library(dplyr)
library(purrr)
library(stringr)
n <- 2
df %>%
mutate(!! str_c("sum_top_", n) := pmap_dbl(select(cur_data(),
starts_with('q_')),
~ sum(head(sort(c(...), decreasing = TRUE), n))))
-output
# A tibble: 2 x 5
name q_1 q_2 q_3 sum_top_2
<chr> <dbl> <dbl> <dbl> <dbl>
1 a 4 1 5 9
2 b 2 8 9 17
Or use rowwise from dplyr.
df %>%
rowwise %>%
mutate(!! str_c("sum_top_", n) := sum(head(sort(c_across(starts_with("q_")),
decreasing = TRUE), n))) %>%
ungroup
# A tibble: 2 x 5
name q_1 q_2 q_3 sum_top_2
<chr> <dbl> <dbl> <dbl> <dbl>
1 a 4 1 5 9
2 b 2 8 9 17

using mutate with row and column indexing and group by

I want to create a variable using dplyr that takes in a value conditional on another variable.
See example below.
data.frame(list(group=c('a','a','b','b'),
time=c(1,2,1,2),
value = seq(1,4,1))
I want to create a variable 'baseline' that takes the content of variable 'value' where time = 1 and by group. As such the desired output would be
data.frame(list(group=c('a','a','b','b'),
time=c(1,2,1,2),
value = seq(1,4,1),
baseline = c(1,1,3,3)))
Tried to run the following code with indexing but am clearly going wrong somewhere
x <- data.frame(list(group=c('a','a','b','b'),
time=c(1,2,1,2),
value = seq(1,4,1))
x %>% group_by(group) %>%
mutate(baseline = .[[.$time==1,.$value]])
Thanks
We can use which.min
library(dplyr)
df1 %>%
group_by(group) %>%
mutate(baseline = value[which.min(time)])
# A tibble: 4 x 4
# Groups: group [2]
# group time value baseline
# <chr> <dbl> <dbl> <dbl>
#1 a 1 1 1
#2 a 2 2 1
#3 b 1 3 3
#4 b 2 4 3
and if it is already ordered by 'time', then simply use first
df1 %>%
group_by(group) %>%
mutate(baseline = first(value))
data
df1 <- data.frame(group=c('a','a','b','b'),
time=c(1,2,1,2),
value = seq(1,4,1))

R code to identify the same characters in a name and combine the column values

I have a Candidate column, and there are various instances where the Candidate names match but are in different rows, as they differ by lower/upper case, space or symbols. I want to look for all these duplicated rows with the same candidate name, and make it one row. For example: here A.s. Hopingson and A.S. Hopingson are in two different rows, so from the A.S. Hopingson I want to copy three columns to A.s. Hopingson and remove the row with the candidate A.S. Hopingson.
I think that the main issue is to decide which value to keep in case you have different values in separate rows for the same candidate (although this is not the case in your example). Both my suggestions use tidyverse.
The first solution that comes up to my mind is to select the row with the lowest number of missing values as your valid entry:
(data <- tibble(name = c("A.S. Hopingson", "a.S. Hopingson", "A.S.Hopingson"),
var_1 = c(1, 2, NA),
var_2 = c(1, NA, NA)))
# A tibble: 3 x 3
name var_1 var_2
<chr> <dbl> <dbl>
1 A.S. Hopingson 1 1
2 a.S. Hopingson 2 NA
3 A.S.Hopingson NA NA
data %>%
mutate(name_new = str_replace_all(str_to_lower(name), "[^[:alpha:]]", ""), # create name that consists only of letters
missing = rowSums(is.na(.))) %>% # count missing values per row
arrange(name_new, missing) %>% # arrange by missing values
group_by(name_new) %>%
filter(row_number() == 1) %>% # filter for first row (i.e. that with lowest number of missing values
ungroup() %>%
select(-name_new, -missing)
# A tibble: 1 x 3
name var_1 var_2
<chr> <dbl> <dbl>
1 A.S. Hopingson 1 1
As you can see, the value 2 for var_1[2] is dropped.
Alternatively, you can select the first non-missing value for each variable and combine these rows:
(data <- tibble(name = c("A.S. Hopingson", "a.S. Hopingson", "A.S.Hopingson"),
var_1 = c(1, 1, NA),
var_2 = c(NA, 2, NA)))
# A tibble: 3 x 3
name var_1 var_2
<chr> <dbl> <dbl>
1 A.S. Hopingson 1 NA
2 a.S. Hopingson 1 2
3 A.S.Hopingson NA NA
data %>%
mutate(name_new = str_replace_all(str_to_lower(name), "[^[:alpha:]]", "")) %>%
select(-name) %>%
gather(key, value, -name_new) %>%
unique() %>%
filter(!is.na(value)) %>%
group_by(name_new, key) %>%
filter(row_number() == 1) %>%
ungroup() %>%
spread(key, value)
# A tibble: 1 x 3
name_new var_1 var_2
<chr> <dbl> <dbl>
1 ashopingson 1 2
In case you need entries from both rows (e.g. because they are both correct), you would have to nest your data or combine the values into a single row.

Using any() or all() with is.na() over multiple columns

I'd like to drop rows from my dataset that are all NAs (AKA keep rows with any non-NAs) for a list of columns. How could I update this code so that x & y are supplied as a vector? This would enable me to flexibly add and drop columns for inspection.
library(dplyr)
ds <-
tibble(
id = c(1:4),
x = c(NA, 1, NA, 4),
y = c(NA, NA , 3, 4)
)
ds %>%
rowwise() %>%
filter(
any(
!is.na(x),
!is.na(y)
)
) %>%
ungroup()
I'm trying to write something like any(!is.na(c(x,y))) but I'm not sure how to supply multiple arguments to is.na().
We can use filter_at with any_vars
ds %>%
filter_at(vars(x:y), any_vars(!is.na(.)))
# A tibble: 3 x 3
# id x y
# <int> <dbl> <dbl>
#1 2 1 NA
#2 3 NA 3
#3 4 4 4
-Update - Feb 7 2022
In the new version of dplyr (as #GitHunter0 suggested) can use if_all/if_any or across
ds %>%
filter(if_any(x:y, complete.cases))
# A tibble: 3 × 3
id x y
<int> <dbl> <dbl>
1 2 1 NA
2 3 NA 3
3 4 4 4
You can also use ds %>% filter(!if_all(x:y, is.na)).

Using tidyr complete() with column names specified in variables

I am having trouble using the tidyr::complete() function with column names as variables.
The built-in example works as expected:
df <- data_frame(
group = c(1:2, 1),
item_id = c(1:2, 2),
item_name = c("a", "b", "b"),
value1 = 1:3,
value2 = 4:6
)
df %>% complete(group, nesting(item_id, item_name))
However, when I try to provide the column names as character strings, it produces an error.
gr="group"
id="item_id"
name="item_name"
df %>% complete_(gr, nesting_(id, name),fill = list(NA))
Even a little more simply, df %>% complete(!!!syms(gr), nesting(!!!syms(id), !!!syms(name))) now gets it done in tidyr 1.0.2
I think it's a bug that complete_ can't work with data.frames or list columns like complete can, but here's a workaround using unite_ and separate to simulate nesting:
df %>% unite_('id_name', c(id, name)) %>%
complete_(c(gr, 'id_name')) %>%
separate(id_name, c(id, name))
## # A tibble: 4 × 5
## group item_id item_name value1 value2
## * <dbl> <chr> <chr> <int> <int>
## 1 1 1 a 1 4
## 2 1 2 b 3 6
## 3 2 1 a NA NA
## 4 2 2 b 2 5
Now that tidyr has adopted tidy evaluation, the underscore variants (i.e. complete_) have been deprecated since their behavior can be handled by the standard variants (complete).
However, complete, crossing and nesting use data-masking, so the way to convert variables into names is via the .data[[var]] pronoun (per the docs), so your case becomes:
suppressPackageStartupMessages(
library(tidyr)
)
df <- data.frame(
group = c(1:2, 1),
item_id = c(1:2, 2),
item_name = c("a", "b", "b"),
value1 = 1:3,
value2 = 4:6
)
gr <- "group"
id <- "item_id"
name <- "item_name"
df %>% complete(
.data[[gr]],
nesting(.data[[id]],
.data[[name]])
)
#> # A tibble: 4 x 5
#> group item_id item_name value1 value2
#> <dbl> <dbl> <fct> <int> <int>
#> 1 1 1 a 1 4
#> 2 1 2 b 3 6
#> 3 2 1 a NA NA
#> 4 2 2 b 2 5
Created on 2020-02-28 by the reprex package (v0.3.0)
Not very elegant, but it gets the job done.

Resources