How to return the index of certain duplicate strings in a character vector ignoring the index of the first occurence of the duplicate string? - r

I have a vector of strings and I want to return the index of the duplicate values, except for the index of the first occurrence of a duplicate value, given another vector with matches. For example:
x <- c("a", "b", "c", "b", "a", "a", "c", "c")
matching_values <- c("a", "b")
So I would like to have an integer vector returned with the indexes 4, 5, 6. So the first duplicate of a occurs at position 5 and the second duplicate at position 6. The first duplicate for b occurs at index 4 and because I did not specify to match c, there will be no index returned. Thank you!

You could use :
which(duplicated(x) & x %in% matching_values)
#[1] 4 5 6

We can use duplicated with %in%
which(x %in% matching_values & duplicated(x))
#[1] 4 5 6

Related

R - Data.table fast binary search based subset with multiple values in second key

I have come across this vignette at https://cran.r-project.org/web/packages/data.table/vignettes/datatable-keys-fast-subset.html#multiple-key-point.
My data looks like this:
ID TYPE MEASURE_1 MEASURE_2
1 A 3 3
1 B 4 4
1 C 5 5
1 Mean 4 4
2 A 10 1
2 B 20 2
2 C 30 3
2 Mean 20 2
When I do this ... all works as expected.
setkey(dt, ID, TYPE)
dt[.(unique(ID), "A")] # extract SD of all IDs with Type A
dt[.(unique(ID), "B")] # extract SD of all IDs with Type B
dt[.(unique(ID), "C")] # extract SD of all IDs with Type C
Whenever I try sth like this, where I want to base the keyed subset on multiple values for the second key, I only get the result of the all combinations of unique values in key 1 with only the first value defined in the vector c() for the second key. So, it only takes the first value defined in the vector and ignores all following values.
# extract SD of all IDs with one of the 3 types A/B/C
dt[.(unique(ID), c("A", "B", "C")]
# previous output is equivalent to
dt[.(unique(ID), "A")] # extract SD of all IDs with Type A
# I want/expect
dt[TYPE %in% c("A", "B", "C")]
What am I missing here or is this sth I cannot do with keyed subsets?
To clarify: As I cannot leave out the key 1 in keyed subsets, the vignette calls for inclusion of the first key with unique(key1)
And defining multiple keys in key 1 works also as expected.
dt[.(c(1, 2), "A")] == dt[ID %in% c(1,2) & TYPE == "A"] # TRUE
In the data.table documention (see help("data.table") or https://rdatatable.gitlab.io/data.table/reference/data.table.html#arguments), it is mentioned :
character, list and data.frame input to i is converted into a data.table internally using as.data.table.
So, the classical recycling rule used in R (or in data.frame) applies. That is, .(unique(ID), c("A", "B", "C")), which is equivalent to list(unique(ID), c("A", "B", "C")), becomes:
as.data.table(list(unique(ID), c("A", "B", "C")))
and since the length of the longest list element (length of c("A", "B", "C")) is not a multiple of the shorter one (length of unique(ID)), you will get an error.
If you want each value in unique(ID) combined with each element in c("A", "B", "C"), you should use CJ(unique(ID), c("A", "B", "C")) instead.
So what you should do is dt[CJ(unique(ID), c("A", "B", "C"))].
Note that dt[.(unique(ID), "A")] works correctly because you passed only one element for the second key and this gets recycled to match the length of unique(ID).

Return index numbers in R where object lengths are not multiples [duplicate]

This question already has answers here:
Is there an R function for finding the index of an element in a vector?
(4 answers)
Closed 2 years ago.
I have a vector like so:
foo = c("A", "B", "C", "D")
And I want a vector of selected index numbers, which I imagined I could do like so:
which(foo == c("A", "B", "D"))
But apparently this only works if the lengths of the two vectors are multiples, as otherwise you get an incomplete result followed by a warning message:
"longer object length is not a multiple of shorter object length".
So how do I get what I'm after, which is "1 2 4"?
Use match:
match(c('A', 'B', 'C'), foo)
Using %in% is one option here:
foo <- c("A", "B", "C", "D")
x <- c("A", "B", "D")
c(1:4)[foo %in% x] # [1] 1 2 4
The quantity foo %in% x returns a logical vector which can then be used to subset the indices you want to see.

Find vector of strings in list (R)

I have a list, in which each element is a vector of strings, as:
l <- list(c("a", "b"), c("c", "d"))
I want to find the index of the element in l that contains a specific vector of strings, as c("a", "b"). How do I do that? I thought which(l %in% c("a", "b")) should work, but it returns integer(0) instead of 1.
%in% checks presence of elements of the LHS among elements of the RHS. To treat c("a", "b") as a single element of the RHS, it needs to be in a list:
which(l %in% list(c("a", "b")))
Other possibilities are to go element-by-element through l with sapply, such as
which(sapply(l,function(x) all(c("a","b") %in% x)))
# order doesn't matter, other elements allowed
which(sapply(l, identical, c("a", "b"))) # exact match, in order

Is there a way in R to rank categorical variable (of characters) into ranked ordinal data?

I have a list of character strings, say
alphabets = c(a, b, c, d,..., z) and I would like to get the index of this list as a new column in a data.frame.
e.g. (b, a, c, d, e, g) would yield (2, 1, 3, 4, 5, 7).
The solution you need is to convert the character vector to a factor:
alphabets = c("b", "a", "c", "d", "e", "g")
#convert to class factor with the order define by the levels option
alphabets<-factor(alphabets, levels=letters)
#display the values
as.numeric(alphabets)
#[1] 2 1 3 4 5 7
This is a case for match
x <- c("b", "a", "c", "d", "e", "g")
match(x, letters)
#[1] 2 1 3 4 5 7
Or sapply with grep returning a named int vector
sapply(x, grep, letters)
#b a c d e g
#2 1 3 4 5 7
Two comments:
"I have a list of character strings" Be precise with class names of objects! alphabets = c("a", "b", "c", "d") is a character vector, not a list.
letters is a built-in constant which returns the 26 lower-case letters (of the Roman alphabet) as a character vector. See ?letters for details.

For loop with factor data

I have two vectors of factor data with equal length. Just for examples sake:
observed=c("a", "b", "c", "a", "b", "c", "a")
predicted=c("a", "a", "b", "b", "b", "c", "c")
Ultimately, I am trying to generate a classification matrix showing the number of times each factor is correctly predicted. This would look like the following for the example:
name T F
a 1 2
b 1 1
c 1 1
Note that the tables() command doesn't work here because I have 11 different factors, and the output would be 11x11 instead of 11x2. My plan is to create three vectors, and combine them into a data frame.
First, a vector of the unique factor values in the existing vectors. This is simple enough,
names=unique(df$observed)
Next, a vector of values showing the number of correct predictions. This is where I am running into trouble. I can get the number of correct predictions for an individual factor like so:
correct.a=sum(predicted[which(observed == "a")] == "a")
But this is cumbersome to repeat time and time again, and then combine into a vector like
correct=c("correct.a", "correct.b", correct.c")
Is there a way to use a loop (or other strategy that you can think of) to improve this process?
Also note that the final vector I would create would be something like this:
incorrect.a=sum(observed == "a")-correct.a
t(sapply(split(predicted == observed, observed), table))
# FALSE TRUE
#a 2 1
#b 1 1
#c 1 1
I would suggest you use data.table for explicit clean way to define your results:
library(data.table)
observed=c("a", "b", "c", "a", "b", "c", "a")
predicted=c("a", "a", "b", "b", "b", "c", "c")
dt <- data.table(observed, predicted)
res <- dt[, .(
T = sum(observed == predicted),
F = sum(observed != predicted)),
observed
]
res
# observed T F
# 1: a 1 2
# 2: b 1 1
# 3: c 1 1

Resources