identifying last occurring duplicates in a vector in R - r

I would like to identify all unique values and last occurring instances of multiple values in a vector. For example, I would like to to identify the positions
c(2,3,4,6,7)
in the vector:
v <- c("m", "m", "k", "r", "l", "o", "l")
I see that
(duplicated(v) | duplicated(v, fromLast = T))
identifies all duplicated values, yet I would like to only identify the last occurring instances of duplicated elements.
How to achieve this without a loop?

Do you need:
duplicated(v)
[1] FALSE TRUE FALSE FALSE FALSE FALSE TRUE
# and for index
which(duplicated(v))
[1] 2 7
or as akrun suggests:
which(!duplicated(v, fromLast = TRUE))
[1] 2 3 4 6 7

You could do something like:
library(dplyr)
v %>%
as_tibble() %>%
mutate(index = row_number()) %>%
group_by(value) %>%
mutate(id=row_number()) %>%
filter(id == max(id))
Which gives us:
# A tibble: 5 × 3
# Groups: value [5]
value index id
<chr> <int> <int>
1 m 2 2
2 k 3 1
3 r 4 1
4 o 6 1
5 l 7 2
Additionally, if you just want the index, you can do:
v %>%
as_tibble() %>%
mutate(index = row_number()) %>%
group_by(value) %>%
mutate(id=row_number()) %>%
filter(id == max(id)) %>%
pull(index)
...to get:
[1] 2 3 4 6 7

We can try
> sort(tapply(seq_along(v), v, max))
m k r o l
2 3 4 6 7
or
> unique(ave(seq_along(v), v, FUN = max))
[1] 2 3 4 7 6
or
> rev(length(v) - which(!duplicated(rev(v))) + 1)
[1] 2 3 4 6 7

Related

Aggregate string variable using summarise and across function

df_input is the input file, and the ideal output file is df_output.
df_input <- data.frame(id = c(1,2,3,4,4,5,5,5,6,7,8,9,10),
party = c("A","B","C","D","E","F","G","H","I","J","K","L","M"),
winner= c(1,1,1,1,1,1,1,1,1,1,1,1,1))
df_output <- data.frame(id = c(1,2,3,4,5,6,7,8,9,10),
party = c("A","B","C","D,E","F_G_H","I","J","K","L","M"),
winner_sum = c(1,1,1,2,3,1,1,1,1,1))
Previously the code worked using the "summarise_at" function as follows:
df_output <- df_input %>%
dplyr::group_by_at(.vars = vars(id)) %>%
{left_join(
dplyr::summarise_at(., vars(party), ~ str_c(., collapse = ",")),
dplyr::summarise_at(., vars(winner), funs(sum))
)}
But it no longer works as it seems both "summarise_at" and "funs" has been deprecated.
I am trying to replicate using across with dplyr (1.0.10), but I am getting an error. Here is my attempt:
df_output <- df_input %>%
group_by(id) %>%
summarise(across(winner, sum, na.rm=T)) %>%
summarise(across(party, str_c(., collapse = ",")))
I have multiple numeric and character variables,s not just one, as in the example. Thanks a lot.
We don't need across if we need to apply different functions on single columns
library(dplyr)
library(stringr)
df_input %>%
group_by(id) %>%
summarise(party = str_c(party, collapse = ","),
winner_sum = sum(winner))
-output
# A tibble: 10 × 3
id party winner_sum
<dbl> <chr> <dbl>
1 1 A 1
2 2 B 1
3 3 C 1
4 4 D,E 2
5 5 F,G,H 3
6 6 I 1
7 7 J 1
8 8 K 1
9 9 L 1
10 10 M 1
If there are multiple 'party', 'winner' columns, loop across them in a single summarise as after the first summarise we have only the summarised column with the group column
df_input %>%
group_by(id) %>%
summarise(across(winner, sum, na.rm=TRUE),
across(party, ~ str_c(.x, collapse = ",")), .groups = "drop")
-output
# A tibble: 10 × 3
id winner party
<dbl> <dbl> <chr>
1 1 1 A
2 2 1 B
3 3 1 C
4 4 2 D,E
5 5 3 F,G,H
6 6 1 I
7 7 1 J
8 8 1 K
9 9 1 L
10 10 1 M
NOTE: If the columns have a simplar prefix then use starts_with to select all those columns i.e. across(starts_with("party"), or if there are different column names - across(c(party, othercol), or if the functions applied are based on their type - across(where(is.numeric), sum,, na.rm = TRUE)
df_input %>%
group_by(id) %>%
summarise(across(where(is.numeric), sum, na.rm = TRUE),
across(where(is.character), str_c, collapse = ","),
.groups = 'drop')

Separate rows with conditions

I have this dataframe separate_on_condition with two columns:
separate_on_condition <- data.frame(first = 'a3,b1,c2', second = '1,2,3,4,5,6')`
# first second
# 1 a3,b1,c2 1,2,3,4,5,6
How can I turn it to:
# A tibble: 6 x 2
first second
<chr> <chr>
1 a 1
2 a 2
3 a 3
4 b 4
5 c 5
6 c 6
where:
a3 will be separated into 3 rows
b1 into 1 row
c2 into 2 rows
Is there a better way on achieving this instead of using rep() on first column and separate_rows() on the second column?
Any help would be much appreciated!
Create a row number column to account for multiple rows.
Split second column on , in separate rows.
For each row extract the data to be repeated along with number of times it needs to be repeated.
library(dplyr)
library(tidyr)
library(stringr)
separate_on_condition %>%
mutate(row = row_number()) %>%
separate_rows(second, sep = ',') %>%
group_by(row) %>%
mutate(first = rep(str_extract_all(first(first), '[a-zA-Z]+')[[1]],
str_extract_all(first(first), '\\d+')[[1]])) %>%
ungroup %>%
select(-row)
# first second
# <chr> <chr>
#1 a 1
#2 a 2
#3 a 3
#4 b 4
#5 c 5
#6 c 6
You can the following base R option
with(
separate_on_condition,
data.frame(
first = unlist(sapply(
unlist(strsplit(first, ",")),
function(x) rep(gsub("\\d", "", x), as.numeric(gsub("\\D", "", x)))
), use.names = FALSE),
second = eval(str2lang(sprintf("c(%s)", second)))
)
)
which gives
first second
1 a 1
2 a 2
3 a 3
4 b 4
5 c 5
6 c 6
Here is an alternative approach:
add NA to first to get same length
use separate_rows to bring each element to a row
use extract by regex digit to split first into first and helper
group and slice by values in helper
do some tweaking
library(tidyr)
library(dplyr)
separate_on_condition %>%
mutate(first = str_c(first, ",NA,NA,NA")) %>%
separate_rows(first, second, sep = "[^[:alnum:].]+", convert = TRUE) %>%
extract(first, into = c("first", "helper"), "(.{1})(.{1})", remove=FALSE) %>%
group_by(second) %>%
slice(rep(1:n(), each = helper)) %>%
ungroup() %>%
drop_na() %>%
mutate(second = row_number()) %>%
select(first, second)
first second
<chr> <int>
1 a 1
2 a 2
3 a 3
4 b 4
5 c 5
6 c 6

How to use condition statement in pipe in r

I am trying to use the condition statement in the pipe but failed.
The data like this:
group = rep(letters[1:3], each = 3)
status = c(T,T,T, T,T,F, F,F,F)
value = c(1:9)
df = data.frame(group = group, status = status, value = value)
> df
group status value
1 a TRUE 1
2 a TRUE 2
3 a TRUE 3
4 b TRUE 4
5 b TRUE 5
6 b FALSE 6
7 c FALSE 7
8 c FALSE 8
9 c FALSE 9
I want to get the rows in each group that have max value with the condition that if any of the status in each group have TRUE then filter(status == T) %>% slice_max(value) or slice_max(value) otherwise.
What I have tried is this:
# way 1
df %>%
group_by(group) %>%
if(any(status) == T) {
filter(status == T) %>% slice_max(value)
} else {
slice_max(value)
}
# way 2
df %>%
group_by(group) %>%
when(any(status) == T,
filter(status == T) %>% slice_max(value),
slice_max(value))
What I expected output should like this:
> expected_df
group status value
1 a TRUE 3
2 b TRUE 5
3 c FALSE 9
Any help will be highly appreciated!
Try arranging the data by status then value, then just taking the first result
df %>%
group_by(group) %>%
arrange(!status, desc(value)) %>%
slice(1)
Since we arrange by status, if they have a TRUE value, it will come first, if not, then you just get the largest value. Generally it's a bit awkward to combine pipes and if statements but if that's something you want to look into, that's covered in this existing question but if statements don't work with group_by.
A bit more verbose :
library(dplyr)
df %>%
group_by(group) %>%
filter(if(any(status)) value ==max(value[status]) else value == max(value)) %>%
ungroup
# group status value
# <chr> <lgl> <int>
#1 a TRUE 3
#2 b TRUE 5
#3 c FALSE 9
df %>%
group_by(group) %>%
slice(which.max(value*(all(!status)|status)))
# A tibble: 3 x 3
# Groups: group [3]
group status value
<chr> <lgl> <int>
1 a TRUE 3
2 b TRUE 5
3 c FALSE 9
Though the best is to arrange the data

how do I find differences between similar strings?

I have a vector of strings (file names to be exact).
pav <- c("Sn_4Khz_3W_45_130_02_30cm_101mm_",
"Sn_4Khz_4W_45_130_02_30cm_101mm_",
"Sn_4Khz_4W_50_130_02_30cm_101mm_")
I'm looking for a simple way to find difference between these strings.
`> char_position_fun(pav) # gives unique character position
[1] 9 12 13 `
`> char_diff_fun(pav) # removes matching components (position and value)
[1] 3_4_5 4_4_5 4_5_0`
Here is my attempt. I decided to split all letters and create a data frame for each string containing position and letter information. Then, for each position, I checked if there is one unique letter or not. If FALSE, that suggests that not all letters are identical. Finally, subset the data frame with a logical condition. In this way, you can see position and letter information together.
library(tidyverse)
strsplit(mytext, split = "") %>%
map_dfr(.x = .,
.f = function(x) enframe(x, name = "position", value = "word"),
.id = "id") %>%
group_by(position) %>%
mutate(check = n_distinct(word) == 1) %>%
filter(check == FALSE)
id position word check
<chr> <int> <chr> <lgl>
1 1 9 3 FALSE
2 1 12 4 FALSE
3 1 13 5 FALSE
4 2 9 4 FALSE
5 2 12 4 FALSE
6 2 13 5 FALSE
7 3 9 4 FALSE
8 3 12 5 FALSE
9 3 13 0 FALSE
If you want to have the outcome as you described, you can add a bit more operation.
strsplit(mytext, split = "") %>%
map_dfr(.x = .,
.f = function(x) enframe(x, name = "position", value = "word"),
.id = "id") %>%
group_by(position) %>%
mutate(check = n_distinct(word) == 1) %>%
filter(check == FALSE) %>%
group_by(id) %>%
summarize_at(vars(position:word),
.funs = list(~paste0(., collapse = "_")))
id position word
<chr> <chr> <chr>
1 1 9_12_13 3_4_5
2 2 9_12_13 4_4_5
3 3 9_12_13 4_5_0
DATA
mytext <- c("Sn_4Khz_3W_45_130_02_30cm_101mm_", "Sn_4Khz_4W_45_130_02_30cm_101mm_",
"Sn_4Khz_4W_50_130_02_30cm_101mm_")
Here is a base R solution.
At first, we can invert strings from UTF8 to Int, i.e.,
z <- Map(utf8ToInt,v)
the positions of differences
pos <- unique(unlist(outer(z,z,FUN = Vectorize(function(x,y) which(x!=y)))))
> pos
[1] 9 12 13
the chars that are different:
word <- Map(function(x) paste(intToUtf8(x[p],multiple = T),collapse = "_"),z)
> word
$Sn_4Khz_3W_45_130_02_30cm_101mm_
[1] "3_4_5"
$Sn_4Khz_4W_45_130_02_30cm_101mm_
[1] "4_4_5"
$Sn_4Khz_4W_50_130_02_30cm_101mm_
[1] "4_5_0"
DATA
v <- c("Sn_4Khz_3W_45_130_02_30cm_101mm_", "Sn_4Khz_4W_45_130_02_30cm_101mm_",
"Sn_4Khz_4W_50_130_02_30cm_101mm_")

Use numeric value of `colnames` to recalculate columns in a data.frame using `dplyr`'s `mutate_at` or an alternative

I have a df where I want to recalculate some columns based on the value of that columns colnames:
library(dplyr)
df <- data.frame(year = 1:3, "10" = 0:2, "20" = 3:5)
colnames(df)[2:3] <- c("10", "20")
df
year 10 20
1 1 0 3
2 2 1 4
3 3 2 5
The expected output is col_name - col_values. I can generate the expected output with:
df %>% mutate(`10` = 10 - `10`) %>% mutate(`20` = 20 - `20`)
year 10 20
1 1 10 17
2 2 9 16
3 3 8 15
How can I generate the same output without explicitly copying the respecting colnames value?
I tried the following code (which works):
df %>% mutate(`10` = as.numeric(colnames(.)[2]) - `10`) %>% mutate(`20` = as.numeric(colnames(.)[3]) - `20`)
So I tried to further reduce this but could only think of:
df %>% mutate_at(vars(-year), ~ as.numeric(colnames(.)[.]))
which can obviously not work since the . has two meanings..
How can I achieve my expected output using mutate_at or an alternative?
Reshape, do stuff, then reshape again:
gather(df, key = "k", value = "v", -year) %>%
mutate(v = as.numeric(k) - v) %>%
spread(key = "k", value = "v")
# year 10 20
# 1 1 10 17
# 2 2 9 16
# 3 3 8 15
In base R, we can use lapply
df[-1] <- lapply(names(df[-1]), function(x) as.numeric(x) - df[,x])
# year 10 20
#1 1 10 17
#2 2 9 16
#3 3 8 15
Or mapply
df[-1] <- mapply(`-`, as.numeric(names(df[-1])), df[-1])
Here is one option with mutate_at
library(rlang)
library(tidyverse)
df %>%
mutate_at(2:3, list(~ as.numeric(as_name(quo(.)))- .))
# year 10 20
#1 1 10 17
#2 2 9 16
#3 3 8 15
Or this can be also done with deparse(substitute
df %>%
mutate_at(2:3, list(~ as.numeric(deparse(substitute(.))) - .))
Or using an option with map
map_dfc(names(df)[2:3], ~
df %>%
select(.x) %>%
mutate(!! .x := as.numeric(.x) - !! sym(.x))) %>%
bind_cols(df %>%
select(year), .)
Or with imap
df[-1] <- imap(df[-1], ~ as.numeric(.y) - .x)

Resources