Identifying which values are duplicates in R [duplicate] - r

This question already has answers here:
Identify duplicates and mark first occurrence and all others
(2 answers)
Closed 8 years ago.
I would like to identify which observations are duplicates based on the values within one variable, however I would like all of the observations which generate the duplicates to be identified rather than just the second time they appear. For example:
x <- c(1,2,3,4,5,7,5,7)
duplicated(x)
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
Rather than identify the last two elements, I would like the last four elements to be identified as well as which element is matched (e.g. element 5 and 7, 6 and 8). Thanks.

You can use duplicated twice:
duplicated(x) | duplicated(x, fromLast = TRUE)
# [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE

You could try a table
x <- c(1,2,3,4,5,7,5,7)
tab <- table(x) > 1
x[x %in% names(which(tab))]
# [1] 5 7 5 7
Another method inspired by #rawr's comment is
x %in% x[duplicated(x)]
# [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
x[ x %in% x[duplicated(x)] ]
# [1] 5 7 5 7
which(x %in% x[duplicated(x)])
# [1] 5 6 7 8

Related

How to match multiple columns without merge?

I have those two df's:
ID1 <- c("TRZ00897", "AAR9832", "NZU44447683209", "sxc89898989M", "RSU765th89", "FFF")
Date1 <- c("2022-08-21","2022-03-22","2022-09-24", "2022-09-21", "2022-09-22", "2022-09-22")
Data1 <- data.frame(ID1,Date1)
ID <- c("RSU765th89", "NZU44447683209", "AAR9832", "TRZ00897","ERD895655", "FFF", "IUHG0" )
Date <- c("2022-09-22","2022-09-21", "2022-03-22", "2022-08-21", "2022-09-21", "2022-09-22", "2022-09-22" )
Data2 <- data.frame(ID,Date)
I tried to get exact matches. An exact match is if ID and Date are the same in both df's, for example: "TRZ00897" "2022-08-21" is an exact match, because it is present in both df's
With the following line of code:
match(Data1$ID1, Data2$ID) == match(Data1$Date1, Data2$Date)
the output is:
TRUE TRUE NA NA TRUE FALSE
Obviously the last one should not be FALSE because "FFF" "2022-09-22" is in both df. The reason why it is FALSE is, that the Date"2022-09-22" occurred already in Data2 at index position 1.
match(Data1$ID1, Data2$ID)
4 3 2 NA 1 6
match(Data1$Date1, Data2$Date)
4 3 NA 2 1 1
So at the end, there is index position 6 and 1 which is not equal --> FALSE
How can I change this? Which function should I use to get the correct answer.
Note, I don't need to merge or join etc. I'm really looking for a function that can detect those patterns.
Combine the columns then match:
match(paste(Data1$ID1, Data1$Date1), paste(Data2$ID, Data2$Date))
# [1] 4 3 NA NA 1 6
To get logical outut use %in%:
paste(Data1$ID1, Data1$Date1) %in% paste(Data2$ID, Data2$Date)
# [1] TRUE TRUE FALSE FALSE TRUE TRUE
Try match with asplit (since you have different column names for two dataframes, I have to manually remove the names using unname, which can be avoided if both of them have the same names)
> match(asplit(unname(Data1), 1), asplit(unname(Data2), 1))
[1] 4 3 NA NA 1 6
Another option that is memory-expensive option is using interaction
> match(interaction(Data1), interaction(Data2))
[1] 4 3 NA NA 1 6
With mapply and %in%:
apply(mapply(`%in%`, Data1, Data2), 1, all)
[1] TRUE TRUE FALSE FALSE TRUE TRUE
rowSums(mapply(`%in%`, Data1, Data2)) == ncol(Data1)
Edit; for a subset of columns:
idx <- c(1, 2)
apply(mapply(`%in%`, Data1[idx], Data2[idx]), 1, all)
#[1] TRUE TRUE FALSE FALSE TRUE TRUE

Split dataframe character into columns based on value of character [duplicate]

This question already has answers here:
Split string column to create new binary columns
(10 answers)
Closed 1 year ago.
Is there a way to split data based on value of character in dataframe into multiple columns, so that for example, I start with this data frame
initialData = data.frame(attr = c('a','b','c','d'), type=c('1,2','2','3','2,3'))
And the endData is something like this:
attr Conditions Cond1 Cond2 Cond3
1 a 1,2 TRUE TRUE FALSE
2 b 2 FALSE TRUE FALSE
3 c 3 FALSE FALSE TRUE
4 d 2,3 FALSE TRUE TRUE
I've written a function that takes in a character, does a regexp on it to see if the condition is met and then returns true or false, but I'm not sure how to go through each line in the data frame and add to the correct column
We can use mtabulate from qdapTools after splitting the 'type' column with strsplit and cbind with the original dataset
library(qdapTools)
out <- cbind(initialData,
mtabulate(strsplit(as.character(initialData$type), ",")) > 0)
names(out)[3:5] <- paste0("Cond", names(out)[3:5])
out
# attr type Cond1 Cond2 Cond3
#1 a 1,2 TRUE TRUE FALSE
#2 b 2 FALSE TRUE FALSE
#3 c 3 FALSE FALSE TRUE
#4 d 2,3 FALSE TRUE TRUE

removing factor levels using subset [duplicate]

Suppose I have the following vector:
x <- c(8, 6, 9, 9, 7, 3, 2, 5, 5, 1, 6, 8, 5, 2, 9, 3, 5, 10, 8, 2)
How can I find which elements are either 8 or 9?
This is one way to do it. First I get the indices at which x is either 8 or 9. Then we can verify that at those indices, x is indeed 8 and 9.
> inds <- which(x %in% c(8,9))
> inds
[1] 1 3 4 12 15 19
> x[inds]
[1] 8 9 9 8 9 8
You could try the | operator for short conditions
which(x == 8 | x == 9)
In this specific case you could also use grep:
# option 1
grep('[89]',x)
# option 2
grep('8|9',x)
which both give:
[1] 1 3 4 12 15 19
When you also want to detect number with more than one digit, the second option is preferred:
> grep('10|8',x)
[1] 1 12 18 19
However, I did put emphasis on this specific case at the start of my answer for a reason. As #DavidArenburg mentioned, this could lead to unintended results. Using for example grep('1|8',x) will detect both 1 and 10:
> grep('1|8',x)
[1] 1 10 12 18 19
In order to avoid that side-effect, you will have to wrap the numbers to be detected in word-bounderies:
> grep('\\b1\\b|8',x)
[1] 1 10 12 19
Now, the 10 isn't detected.
Here is a generalized solution to find the locations of all target values (only works for vectors and 1-dimensional arrays).
locate <- function(x, targets) {
results <- lapply(targets, function(target) which(x == target))
names(results) <- targets
results
}
This function returns a list because each target may have any number of matches, including zero. The list is sorted (and named) in the original order of the targets.
Here is an example in use:
sequence <- c(1:10, 1:10)
locate(sequence, c(2,9))
$`2`
[1] 2 12
$`9`
[1] 9 19
Alternatively, if you do not need to use the indices but just the elements you can do
> x <- sample(1:10,20,replace=TRUE)
> x
[1] 6 4 7 2 9 3 3 5 4 7 2 1 4 9 1 6 10 4 3 10
> x[8<=x & x<=9]
[1] 9 9
If you want to find the answer using loops, then the following script will do the job:
> req_nos<- c(8,9)
> pos<-list()
> for (i in 1:length(req_nos)){
pos[[i]]<-which(x==req_nos[i])}
The output will look like this:
>pos
[[1]]
[1] 1 12 19
[[2]]
[1] 3 4 15
Here, pos[[1]] contains positions of 8 and pos[[2]] contains positions of 9. If you are using the %in% method and change the input order of elements, i.e, c(9,8) instead of c(8,9), the output will be the same for both of them. This method alleviates such problem.
grepl maybe a useful function. Note that grepl appears in versions of R 2.9.0 and later. What's handy about grepl is that it returns a logical vector of the same length as x.
grepl(8, x)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
grepl(9, x)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[13] FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
To arrive at your answer, you could do the following
grepl(8,x) | grepl(9,x)

subset function in R [duplicate]

Suppose I have the following vector:
x <- c(8, 6, 9, 9, 7, 3, 2, 5, 5, 1, 6, 8, 5, 2, 9, 3, 5, 10, 8, 2)
How can I find which elements are either 8 or 9?
This is one way to do it. First I get the indices at which x is either 8 or 9. Then we can verify that at those indices, x is indeed 8 and 9.
> inds <- which(x %in% c(8,9))
> inds
[1] 1 3 4 12 15 19
> x[inds]
[1] 8 9 9 8 9 8
You could try the | operator for short conditions
which(x == 8 | x == 9)
In this specific case you could also use grep:
# option 1
grep('[89]',x)
# option 2
grep('8|9',x)
which both give:
[1] 1 3 4 12 15 19
When you also want to detect number with more than one digit, the second option is preferred:
> grep('10|8',x)
[1] 1 12 18 19
However, I did put emphasis on this specific case at the start of my answer for a reason. As #DavidArenburg mentioned, this could lead to unintended results. Using for example grep('1|8',x) will detect both 1 and 10:
> grep('1|8',x)
[1] 1 10 12 18 19
In order to avoid that side-effect, you will have to wrap the numbers to be detected in word-bounderies:
> grep('\\b1\\b|8',x)
[1] 1 10 12 19
Now, the 10 isn't detected.
Here is a generalized solution to find the locations of all target values (only works for vectors and 1-dimensional arrays).
locate <- function(x, targets) {
results <- lapply(targets, function(target) which(x == target))
names(results) <- targets
results
}
This function returns a list because each target may have any number of matches, including zero. The list is sorted (and named) in the original order of the targets.
Here is an example in use:
sequence <- c(1:10, 1:10)
locate(sequence, c(2,9))
$`2`
[1] 2 12
$`9`
[1] 9 19
Alternatively, if you do not need to use the indices but just the elements you can do
> x <- sample(1:10,20,replace=TRUE)
> x
[1] 6 4 7 2 9 3 3 5 4 7 2 1 4 9 1 6 10 4 3 10
> x[8<=x & x<=9]
[1] 9 9
If you want to find the answer using loops, then the following script will do the job:
> req_nos<- c(8,9)
> pos<-list()
> for (i in 1:length(req_nos)){
pos[[i]]<-which(x==req_nos[i])}
The output will look like this:
>pos
[[1]]
[1] 1 12 19
[[2]]
[1] 3 4 15
Here, pos[[1]] contains positions of 8 and pos[[2]] contains positions of 9. If you are using the %in% method and change the input order of elements, i.e, c(9,8) instead of c(8,9), the output will be the same for both of them. This method alleviates such problem.
grepl maybe a useful function. Note that grepl appears in versions of R 2.9.0 and later. What's handy about grepl is that it returns a logical vector of the same length as x.
grepl(8, x)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
grepl(9, x)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[13] FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
To arrive at your answer, you could do the following
grepl(8,x) | grepl(9,x)

Finding All Positions for Multiple Elements in a Vector

Suppose I have the following vector:
x <- c(8, 6, 9, 9, 7, 3, 2, 5, 5, 1, 6, 8, 5, 2, 9, 3, 5, 10, 8, 2)
How can I find which elements are either 8 or 9?
This is one way to do it. First I get the indices at which x is either 8 or 9. Then we can verify that at those indices, x is indeed 8 and 9.
> inds <- which(x %in% c(8,9))
> inds
[1] 1 3 4 12 15 19
> x[inds]
[1] 8 9 9 8 9 8
You could try the | operator for short conditions
which(x == 8 | x == 9)
In this specific case you could also use grep:
# option 1
grep('[89]',x)
# option 2
grep('8|9',x)
which both give:
[1] 1 3 4 12 15 19
When you also want to detect number with more than one digit, the second option is preferred:
> grep('10|8',x)
[1] 1 12 18 19
However, I did put emphasis on this specific case at the start of my answer for a reason. As #DavidArenburg mentioned, this could lead to unintended results. Using for example grep('1|8',x) will detect both 1 and 10:
> grep('1|8',x)
[1] 1 10 12 18 19
In order to avoid that side-effect, you will have to wrap the numbers to be detected in word-bounderies:
> grep('\\b1\\b|8',x)
[1] 1 10 12 19
Now, the 10 isn't detected.
Here is a generalized solution to find the locations of all target values (only works for vectors and 1-dimensional arrays).
locate <- function(x, targets) {
results <- lapply(targets, function(target) which(x == target))
names(results) <- targets
results
}
This function returns a list because each target may have any number of matches, including zero. The list is sorted (and named) in the original order of the targets.
Here is an example in use:
sequence <- c(1:10, 1:10)
locate(sequence, c(2,9))
$`2`
[1] 2 12
$`9`
[1] 9 19
Alternatively, if you do not need to use the indices but just the elements you can do
> x <- sample(1:10,20,replace=TRUE)
> x
[1] 6 4 7 2 9 3 3 5 4 7 2 1 4 9 1 6 10 4 3 10
> x[8<=x & x<=9]
[1] 9 9
If you want to find the answer using loops, then the following script will do the job:
> req_nos<- c(8,9)
> pos<-list()
> for (i in 1:length(req_nos)){
pos[[i]]<-which(x==req_nos[i])}
The output will look like this:
>pos
[[1]]
[1] 1 12 19
[[2]]
[1] 3 4 15
Here, pos[[1]] contains positions of 8 and pos[[2]] contains positions of 9. If you are using the %in% method and change the input order of elements, i.e, c(9,8) instead of c(8,9), the output will be the same for both of them. This method alleviates such problem.
grepl maybe a useful function. Note that grepl appears in versions of R 2.9.0 and later. What's handy about grepl is that it returns a logical vector of the same length as x.
grepl(8, x)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
grepl(9, x)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[13] FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
To arrive at your answer, you could do the following
grepl(8,x) | grepl(9,x)

Resources