How to drop rows by condition in R? - r

From this dataframe I need to drop all the rows which have TRUEs in every column. However, since I need to automatize the process I cant drop them with column names or column indexes. I need something else
df1 <- c(TRUE,TRUE,FALSE,TRUE,TRUE)
df2 <- c(TRUE,FALSE,FALSE,TRUE,TRUE)
df3 <- c(FALSE,TRUE,TRUE,TRUE,TRUE)
df <- data.frame(df1,df2,df3)
df1 df2 df3
1 TRUE TRUE FALSE
2 TRUE FALSE TRUE
3 FALSE FALSE TRUE
4 TRUE TRUE TRUE
5 TRUE TRUE TRUE

This should be the fastest solution:
df[!do.call(pmin, df), ]
# df1 df2 df3
# 1 TRUE TRUE FALSE
# 2 TRUE FALSE TRUE
# 3 FALSE FALSE TRUE

base R:
df[!apply(df, 1, all), ]
# df1 df2 df3
#1 TRUE TRUE FALSE
#2 TRUE FALSE TRUE
#3 FALSE FALSE TRUE
tidyverse:
library(dplyr)
filter(df, !if_all())
# df1 df2 df3
#1 TRUE TRUE FALSE
#2 TRUE FALSE TRUE
#3 FALSE FALSE TRUE

We can use rowwise function from dplyr library
library(dplyr)
df |> rowwise() |> filter(!all(c_across() == TRUE))
output
# A tibble: 3 × 3
# Rowwise:
df1 df2 df3
<lgl> <lgl> <lgl>
1 TRUE TRUE FALSE
2 TRUE FALSE TRUE
3 FALSE FALSE TRUE

Related

Tabulating the results of %in%

I have a list of IDs such as:
ids1 <- c(0, 2, 3, 4, 8)
Then I have another list of IDs, such as
ids2 <- c(2, 4, 5, 7, 11)
I would like to produce a data.frame as follows:
ID in out
0 FALSE TRUE
2 TRUE FALSE
3 FALSE TRUE
4 TRUE FALSE
8 FALSE TRUE
That is, for each element in ids1 I would like a row in the output along with 2 columns that indicate whether or not the element in ids1 exists in ids2.
I know I can do things like
ids1[ids1 %in% ids2]
and
ids1[!(ids1 %in% ids2)]
which gives me the TRUE values for each column, but I can't figure out how to make the data.frame from it.
Please note that a base R or tidyverse solution is OK, but not data.table please
Thanks !
Use, data.frame itself to construct. The output of %in% is a logical vector. When we subset with [, it returns the corresponding value where the TRUE values are present
data.frame(ID = ids1, `in` = ids1 %in% ids2,
out = !ids1 %in% ids2, check.names = FALSE)
-output
ID in out
1 0 FALSE TRUE
2 2 TRUE FALSE
3 3 FALSE TRUE
4 4 TRUE FALSE
5 8 FALSE TRUE
Or in tibble
library(tibble)
tibble(ID = ids1, `in` = ids1 %in% ids2, out = !`in`)
# A tibble: 5 x 3
ID `in` out
<dbl> <lgl> <lgl>
1 0 FALSE TRUE
2 2 TRUE FALSE
3 3 FALSE TRUE
4 4 TRUE FALSE
5 8 FALSE TRUE

How create new column an add column names by selected row in r

a<-c(TRUE,FALSE,TRUE,FALSE,TRUE,FALSE)
b<-c(TRUE,FALSE,TRUE,FALSE,FALSE,FALSE)
c<-c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE)
costumer<-c("one","two","three","four","five","six")
df<-data.frame(costumer,a,b,c)
That's an example code. It looks like this printed:
costumer a b c
1 one TRUE TRUE TRUE
2 two FALSE FALSE TRUE
3 three TRUE TRUE TRUE
4 four FALSE FALSE FALSE
5 five TRUE FALSE TRUE
6 six FALSE FALSE FALSE
I want to create a new column df$items that contains only the column names that are TRUE for each row in the data. Something like this:
costumer a b c items
1 one TRUE TRUE TRUE a,b,c
2 two FALSE FALSE TRUE c
3 three TRUE TRUE TRUE a,b,c
4 four FALSE FALSE FALSE
5 five TRUE FALSE TRUE
6 six FALSE FALSE FALSE
I thought of using apply function or use which for selecting indexes, but couldn't figure it out. Can anyone help me?
df$items <- apply(df, 1, function(x) paste0(names(df)[x == TRUE], collapse = ","))
df
custumer a b c items
1 one TRUE TRUE TRUE a,b,c
2 two FALSE FALSE TRUE c
3 three TRUE TRUE TRUE a,b,c
4 four FALSE FALSE FALSE
5 five TRUE FALSE TRUE a,c
6 six FALSE FALSE FALSE
df$items = apply(df[2:4], 1, function(x) toString(names(df[2:4])[x]))
df
# custumer a b c items
# 1 one TRUE TRUE TRUE a, b, c
# 2 two FALSE FALSE TRUE c
# 3 three TRUE TRUE TRUE a, b, c
# 4 four FALSE FALSE FALSE
# 5 five TRUE FALSE TRUE a, c
# 6 six FALSE FALSE FALSE
You could use
df$items <- apply(df, 1, function(x) toString(names(df)[which(x == TRUE)]))
Output
# custumer a b c items
# 1 one TRUE TRUE TRUE a, b, c
# 2 two FALSE FALSE TRUE c
# 3 three TRUE TRUE TRUE a, b, c
# 4 four FALSE FALSE FALSE
# 5 five TRUE FALSE TRUE a, c
# 6 six FALSE FALSE FALSE
We can use pivot_longer to reshape to 'long' format and then do a group by paste
library(dplyr)
library(tidyr)
library(stringr)
df %>%
pivot_longer(cols = a:c) %>%
group_by(costumer) %>%
summarise(items = toString(name[value])) %>%
left_join(df)

Using `:=` from rlang to assign column names using lapply function inputs

I am trying to loop across a vector of patterns/strings to both match strings within another column and assign the results to a column of the same name as the pattern being searched. A simple example is below.
I am aware that this example is trivial but it captures the minimal case that produces the error that I cannot resolve.
> library(rlang)
> library(stringr)
> library(dplyr)
> set.seed(5)
> df <- data.frame(
+ groupA = sample(x = LETTERS[1:6], size = 20, replace = TRUE),
+ id_col = 1:20
+ )
>
> mycols <- c('A','C','D')
>
> dfmatches <-
+ lapply(mycols, function(icol) {
+ data.frame(!!icol := grepl(pattern = icol, x = df$groupA))
+ }) %>%
+ cbind.data.frame()
This gives me the error:
Error: `:=` can only be used within a quasiquoted argument
The desired output would be a data.frame like the below:
> dfmatches
A C D
1 FALSE FALSE FALSE
2 FALSE FALSE FALSE
3 FALSE FALSE FALSE
4 FALSE FALSE FALSE
5 TRUE FALSE FALSE
6 FALSE FALSE FALSE
7 FALSE FALSE TRUE
8 FALSE FALSE FALSE
9 FALSE FALSE FALSE
10 TRUE FALSE FALSE
11 FALSE FALSE FALSE
12 FALSE TRUE FALSE
13 FALSE FALSE FALSE
14 FALSE FALSE TRUE
15 FALSE FALSE FALSE
16 FALSE FALSE FALSE
17 FALSE TRUE FALSE
18 FALSE FALSE FALSE
19 FALSE FALSE TRUE
20 FALSE FALSE FALSE
I've tried multiple variations using {{}} or !! rlang::sym() etc but cannot quite figure out the right syntax for this.
One option is to use map_dfc from purrr. Also I think you don't need grepl since here we are looking for an exact match and not partial one.
library(dplyr)
library(purrr)
map_dfc(mycols, ~df %>% transmute(!!.x := groupA == .x))
In base R, we can do
setNames(do.call(cbind.data.frame, lapply(mycols,
function(x) df$groupA == x)), mycols)

R - Collapse Data by Grouped Row Observations

I'm working with a large data frame of hospitalization records. Many patients have two or more hospitalizations, and their past medical history may be incomplete at one or more of the hospitalizations. I'd like to collapse all the information from each of their hospitalizations into a single list of medical problems for each patient.
Here's a sample data frame:
id <- c("123","456","789","101","123","587","456","789")
HTN <- c(TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,
FALSE)
DM2 <- c(FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE)
TIA <- c(TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE)
df <- data.frame(id,HTN,DM2,TIA)
df
Which comes out to:
> df
id HTN DM2 TIA
1 123 TRUE FALSE TRUE
2 456 FALSE FALSE TRUE
3 789 FALSE TRUE TRUE
4 101 FALSE TRUE TRUE
5 123 FALSE FALSE FALSE
6 587 TRUE TRUE TRUE
7 456 FALSE FALSE TRUE
8 789 FALSE TRUE TRUE
I'd like my output to look like this:
id <- c("101","123","456","587","789")
HTN <- c(FALSE,TRUE,FALSE,TRUE,FALSE)
DM2 <- c(TRUE,FALSE,FALSE,TRUE,TRUE)
TIA <- c(TRUE,TRUE,TRUE,TRUE,TRUE)
df2 <- data.frame(id,HTN,DM2,TIA)
df2
id HTN DM2 TIA
1 101 FALSE TRUE TRUE
2 123 TRUE FALSE TRUE
3 456 FALSE FALSE TRUE
4 587 TRUE TRUE TRUE
5 789 FALSE TRUE TRUE
So far I've got a pretty good hunch that arranging and grouping my data is the right place to start, and I think I could make it work by creating a new variable for each medical problem. I have about 30 medical problem's I'll need to collapse in this way, though, and that much repetitive code just seems like a recipe for an occult error.
df3 <- df %>%
arrange(id) %>%
group_by(id)
Looking around I haven't been able to find a particularly elegant way to go about this. Is there some slick dplyr function I'm overlooking?
We may use
df %>% group_by(id) %>% summarize_all(any)
# A tibble: 5 x 4
# id HTN DM2 TIA
# <fct> <lgl> <lgl> <lgl>
# 1 101 FALSE TRUE TRUE
# 2 123 TRUE FALSE TRUE
# 3 456 FALSE FALSE TRUE
# 4 587 TRUE TRUE TRUE
# 5 789 FALSE TRUE TRUE
In this way we first indeed group by id, as you suggested. Then we summarize all the variables with a function any: we provide a logical vector (e.g., HTN for patient 101) and return TRUE if in any of the rows we have TRUE and FALSE otherwise.
A base R option would be
aggregate(.~ id, df, any)
# id HTN DM2 TIA
#1 101 FALSE TRUE TRUE
#2 123 TRUE FALSE TRUE
#3 456 FALSE FALSE TRUE
#4 587 TRUE TRUE TRUE
#5 789 FALSE TRUE TRUE
Or with rowsum
rowsum(+(df[-1]), group = df$id) > 0
If we prefer data.table we might use:
setDT(df)[, lapply(.SD, any), keyby = id]
id HTN DM2 TIA
1: 101 FALSE TRUE TRUE
2: 123 TRUE FALSE TRUE
3: 456 FALSE FALSE TRUE
4: 587 TRUE TRUE TRUE
5: 789 FALSE TRUE TRUE

Using any() vs | in dplyr::mutate

Why should I use | vs any() when I'm comparing columns in dplyr::mutate()?
And why do they return different answers?
For example:
library(tidyverse)
df <- data_frame(x = rep(c(T,F,T), 4), y = rep(c(T,F,T, F), 3), allF = F, allT = T)
df %>%
mutate(
withpipe = x | y # returns expected results by row
, usingany = any(c(x,y)) # returns TRUE for every row
)
What's going on here and why should I use one way of comparing values over another?
The difference between the two is how the answer is calculated:
for |, elements are compared row-wise and boolean logic is used to return the proper value. In the example above each x and y pair are compared to each other and a logical value is returned for each pair, resulting in 12 different answers, one for each row of the data frame.
any(), on the other hand, looks at the entire vector and returns a single value. In the above example, the mutate line that calculates the new usingany column is basically doing this: any(c(df$x, df$y)), which will return TRUE because there's at least one TRUE value in either df$x or df$y. That single value is then assigned to every row of the data frame.
You can see this in action using the other columns in your data frame:
df %>%
mutate(
usingany = any(c(x,y)) # returns all TRUE
, allfany = any(allF) # returns all FALSE because every value in df$allF is FALSE
)
To answer when you should use which: use | when you want to compare elements row-wise. Use any() when you want a universal answer about the entire data frame.
TLDR, when using dplyr::mutate(), you're usually going to want to use |.
You can also use rowwise().
df <- data_frame(x = rep(c(T,F,T), 4), y = rep(c(T,F,T, F), 3), allF = F, allT = T)
df %>%
rowwise() %>%
mutate(x_or_y = any(x,y))
Output:
# A tibble: 12 x 5
x y allF allT x_or_y
<lgl> <lgl> <lgl> <lgl> <lgl>
1 TRUE TRUE FALSE TRUE TRUE
2 FALSE FALSE FALSE TRUE FALSE
3 TRUE TRUE FALSE TRUE TRUE
4 TRUE FALSE FALSE TRUE TRUE
5 FALSE TRUE FALSE TRUE TRUE
6 TRUE FALSE FALSE TRUE TRUE
7 TRUE TRUE FALSE TRUE TRUE
8 FALSE FALSE FALSE TRUE FALSE
9 TRUE TRUE FALSE TRUE TRUE
10 TRUE FALSE FALSE TRUE TRUE
11 FALSE TRUE FALSE TRUE TRUE
12 TRUE FALSE FALSE TRUE TRUE
TL;DR (update):
if_anyis the cleanest replacement for any() in rowwise operations with dplyr. See below.
You can use both the OR operator | or any()
It is the same thing when comparing &and all().
As suggested, you must take into account that |is vectorized, while any() is not
In order to use any() the same way, you must group the data rowwise, so you can call an equivalent of any(current_row). This can be done with purrr::pmap or dplyr::rowwise.
But dplyr::if_any looks a lot cleaner.
Se the code below for a comparison of all methods:
df%>%mutate(
row_OR=x|y,
row_pmap_any=pmap_lgl(select(.,c(x,y)), any),
with_if_any = if_any(c(x,y)))%>%
rowwise()%>%
mutate(
row_rowwise_any=any(c_across(c(x,y))))
# A tibble: 12 × 8
# Rowwise:
x y allF allT row_OR row_pmap_any with_if_any row_rowwise_any
<lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
1 TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
2 FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
3 TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
4 TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
5 FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
6 TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
7 TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
8 FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
9 TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
10 TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
11 FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
12 TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
All methods work, and I did not find much difference in performance.

Resources