How to remove unique rows from dataframe using tidyverse - r

I want to remove unique rows based on a variable:
Letters Val
A 1
A 1
B 1
B 3
In this case, entries with A is removed as the Val values are unique resulting in:
Letters Val
B 1
B 3
I have tried to use count, then filter out n > 1 however in this process Val is lost.
In essence how do I filter(count(letters) > 1)?

md <- tibble::tribble(
~Letters, ~Val,
"A", 1,
"A", 1,
"B", 1,
"B", 3
)
library(dplyr)
md |>
group_by(Letters, Val) |>
filter(n() == 1)
#> # A tibble: 2 × 2
#> # Groups: Letters, Val [2]
#> Letters Val
#> <chr> <dbl>
#> 1 B 1
#> 2 B 3

Related

How to rename column names containing "(N)"?

I'd like to remove the "(N)" from the column names.
Example data:
df <- tibble(
name = c("A", "B", "C", "D"),
`id (N)` = c(1, 2, 3, 4),
`Number (N)` = c(3, 1, 2, 8)
)
I got so far, but don't know how to figure out the rest of regex
df %>%
rename_with(stringr::str_replace,
pattern = "[//(],N//)]", replacement = "")
But the n from the "number (N)" is gone.
name id N) umber (N)
1 A 1 3
2 B 2 1
3 C 3 2
4 D 4 8
One liner: rename_with(df, ~str_remove_all(., ' \\(N\\)'))
or dplyr only: rename_with(df, ~sub(' \\(N\\)', '', .))
We could use the rename_with function from dplyr package and apply a function (in this case str_remove from stringr package).
And then use \\ to escape (:
library(dplyr)
library(stringr)
df %>%
rename_with(~str_remove_all(., ' \\(N\\)'))
name id Number
<chr> <dbl> <dbl>
1 A 1 3
2 B 2 1
3 C 3 2
4 D 4 8
A possible solution:
library(tidyverse)
df <- tibble(
name = c("A", "B", "C", "D"),
`id (N)` = c(1, 2, 3, 4),
`Number (N)` = c(3, 1, 2, 8)
)
df %>% names %>% str_remove("\\s*\\(N\\)\\s*") %>% set_names(df,.)
#> # A tibble: 4 × 3
#> name id Number
#> <chr> <dbl> <dbl>
#> 1 A 1 3
#> 2 B 2 1
#> 3 C 3 2
#> 4 D 4 8
Perhaps you can try
setNames(df, gsub("\\s\\(.*\\)", "", names(df)))
which gives
name id Number
<chr> <dbl> <dbl>
1 A 1 3
2 B 2 1
3 C 3 2
4 D 4 8
A simple solution is
colnames(df) <- gsub(" \\(N\\)", "", colnames(df))

Remove sequence of rows conditional on value in single cell in group-first position

In this type of data:
df <- data.frame(
Sequ = c(1,1,2,2,2,3,3,3),
G = c("A", "B", "*", "B", "A", "A", "*", "B")
)
I need to filter out rows grouped by Sequ iff the Sequ-first value is *. I can do it like so, but was wondering if there's a more direct and more elegant way in dplyr:
library(dplyr)
df %>%
group_by(Sequ) %>%
mutate(check = ifelse(first(G)=="*", 1, 0)) %>%
filter(check != 1)
# A tibble: 5 × 3
# Groups: Sequ [2]
Sequ G check
<dbl> <chr> <dbl>
1 1 A 0
2 1 B 0
3 3 A 0
4 3 * 0
5 3 B 0
We can try the following base R code using subset + ave
subset(
df,
!ave(G == "*", Sequ, FUN = function(x) head(x, 1))
)
which gives
Sequ G
1 1 A
2 1 B
6 3 A
7 3 *
8 3 B
Here is a direct dplyr way:
library(dplyr)
df %>%
group_by(Sequ) %>%
filter(!first(G == "*"))
Sequ G
<dbl> <chr>
1 1 A
2 1 B
3 3 A
4 3 *
5 3 B
Another base R option with duplicated
subset(df, !Sequ %in% Sequ[G == "*" & !duplicated(Sequ)])
Sequ G
1 1 A
2 1 B
6 3 A
7 3 *
8 3 B

Add column to grouped data that assigns 1 to individuals and randomly assigns 1 or 0 to pairs

I have a dataframe...
df <- tibble(
id = 1:7,
family = c("a","a","b","b","c", "d", "e")
)
Families will only contain 2 members at most (so they're either individuals or pairs).
I need a new column 'random' that assigns the number 1 to families where there is only one member (e.g. c, d and e) and randomly assigns 0 or 1 to families containing 2 members (a and b in the example).
By the end the data should look like the following (depending on the random assignment of 0/1)...
df <- tibble(
id = 1:7,
family = c("a","a","b","b","c", "d", "e"),
random = c(1, 0, 0, 1, 1, 1, 1)
)
I would like to be able to do this with a combination of group_by and mutate since I am mostly using Tidyverse.
I tried the following (but this didn't randomly assign 0/1 within families)...
df %>%
group_by(family) %>%
mutate(
random = if_else(
condition = n() == 1,
true = 1,
false = as.double(sample(0:1,1,replace = T))
)
You could sample along the sequence length of the family group and take the answer modulo 2:
df %>%
group_by(family) %>%
mutate(random = sample(seq(n())) %% 2)
#> # A tibble: 7 x 3
#> # Groups: family [5]
#> id family random
#> <int> <chr> <dbl>
#> 1 1 a 0
#> 2 2 a 1
#> 3 3 b 0
#> 4 4 b 1
#> 5 5 c 1
#> 6 6 d 1
#> 7 7 e 1
We can use if/else
library(dplyr)
df %>%
group_by(family) %>%
mutate(random = if(n() == 1) 1 else sample(rep(0:1, length.out = n())))
# A tibble: 7 x 3
# Groups: family [5]
# id family random
# <int> <chr> <dbl>
#1 1 a 0
#2 2 a 1
#3 3 b 1
#4 4 b 0
#5 5 c 1
#6 6 d 1
#7 7 e 1
Another option
df %>%
group_by(family) %>%
mutate(random = 2 - sample(1:n()))
# A tibble: 7 x 3
# Groups: family [5]
id family random
# <int> <chr> <dbl>
# 1 1 a 1
# 2 2 a 0
# 3 3 b 1
# 4 4 b 0
# 5 5 c 1
# 6 6 d 1
# 7 7 e 1

Filter by values that have the exact names given in a list (dplyr)

I have the following data.
> dat
# A tibble: 12 x 2
id name
<chr> <chr>
1 1 a
2 1 b
3 1 a
4 2 a
5 2 b
6 2 c
7 2 b
8 3 a
9 3 b
10 3 c
11 3 d
12 3 d
I would like to filter only by the following list
set <- NULL
set$names <- c("a","b","c")
The ids selected are those that contain exactly the names in the list.
So the result would be only the 2s selected as follows:
> dat
# A tibble: 12 x 2
id name
<chr> <chr>
4 2 a
5 2 b
6 2 c
7 2 b
Here is the data for easy replication:
dat <- tribble(
~id, ~name,
1, "a",
1, "b",
1, "a",
2, "a",
2, "b",
2, "c",
2, "b",
3, "a",
3, "b",
3, "c",
3, "d",
3, "d"
)
I would like to have the following result.
How about:
group_by(dat, id) %>% filter(setequal(name, set$names))
This filters out all groups where the name column and set$names do not contain the same elements, but allows duplicates.
I am not sure it is what you want
dat %>%
group_by(id) %>%
filter(all(set$name %in% name) & all(name %in%set$name))
# A tibble: 4 x 2
id name
<dbl> <chr>
1 2 a
2 2 b
3 2 c
4 2 b

How do I remove offsetting rows in a tibble?

I am trying to remove rows that have offsetting values.
library(dplyr)
a <- c(1, 1, 1, 1, 2, 2, 2, 2,2,2)
b <- c("a", "b", "b", "b", "c", "c","c", "d", "d", "d")
d <- c(10, 10, -10, 10, 20, -20, 20, 30, -30, 30)
o <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J")
df <- tibble(ID = a, SEQ = b, VALUE = d, OTHER = o)
Generates this ordered table that is grouped by ID and SEQ.
> df
# A tibble: 10 x 4
ID SEQ VALUE OTHER
<dbl> <chr> <dbl> <chr>
1 1 a 10 A
2 1 b 10 B
3 1 b -10 C
4 1 b 10 D
5 2 c 20 E
6 2 c -20 F
7 2 c 20 G
8 2 d 30 H
9 2 d -30 I
10 2 d 30 J
I want to drop the row pairs (2,3), (5,6), (8,9) because VALUE negates the VALUE in the matching previous row.
I want the resulting table to be
> df2
# A tibble: 4 x 4
ID SEQ VALUE OTHER
<dbl> <chr> <dbl> <chr>
1 1 a 10 A
2 1 b 10 D
3 2 c 20 G
4 2 d 30 J
I know that I can't use group_by %>% summarize, because I need to keep the value that is in OTHER. I've looked at the dplyr::lag() function but I don't see how that can help. I believe that I could loop through the table with some type of for each loop and generate a logical vector that can be used to drop the rows, but I was hoping for a more elegant solution.
What about:
vec <- cbind(
c(head(df$VALUE,-1) + df$VALUE[-1], 9999) ,
df$VALUE + c(9999, head(df$VALUE,-1))
)
vec <- apply(vec,1,prod)
vec <- vec!=0
df[vec,]
# A tibble: 4 x 4
ID SEQ VALUE OTHER
<dbl> <chr> <dbl> <chr>
1 1 a 10 A
2 1 b 50 D
3 2 c 60 G
4 2 d 70 J
The idea is to take your VALUE field and subtract it with a slightly subset version of it. When the result is 0, than you remove the line.
Here's another solution with dplyr. Not sure about the edge case you mentioned in the comments, but feel free to test it with my solution:
library(dplyr)
df %>%
group_by(ID, SEQ) %>%
mutate(diff = VALUE + lag(VALUE),
diff2 = VALUE + lead(VALUE)) %>%
mutate_at(vars(diff:diff2), funs(coalesce(., 1))) %>%
filter((diff != 0 & diff2 != 0)) %>%
select(-diff, -diff2)
Result:
# A tibble: 4 x 4
# Groups: ID, SEQ [4]
ID SEQ VALUE OTHER
<dbl> <chr> <dbl> <chr>
1 1 a 10 A
2 1 b 50 D
3 2 c 60 G
4 2 d 70 J
Note:
This solution first creates two diff columns, one adding the lag, another adding the lead of VALUE to each VALUE. Only the offset columns will either have a zero in diff or in diff2, so I filtered out those rows, resulting in the desired output.

Resources