R - How to intersect() and include duplicates?

R - How to intersect() and include duplicates? - r

I have the following character fields I am trying to intersect. These should be equal.
> char.y[[892]]
[1] "E" "d" "w" "a" "r" "d" "s" " " "L" "i" "f" "e" "s" "c" "i" "e" "n" "c" "e" "s"
> char.x[[892]]
[1] "E" "d" "w" "a" "r" "d" "s" " " "L" "i" "f" "e" "s" "c" "i" "e" "n" "c" "e" "s"
> intersect(char.x[[892]], char.y[[892]])
[1] "E" "d" "w" "a" "r" "s" " " "L" "i" "f" "e" "c" "n"
>
expected result:
"E" "d" "w" "a" "r" "d" "s" " " "L" "i" "f" "e" "s" "c" "i" "e" "n" "c" "e"

Using intersect will return the common elements, but will not have them duplicated. For example, s is in there 3 times, but will be in the intersect only once.
If you want to see the same layout, with non intersect values removed, for example, you can use the following:
a <- c("E", "d", "w", "a", "r", "d", "s", " ", "L", "i", "f", "e", "s", "c", "i", "e", "n", "c", "e", "s")
b <- c("E", "d", "w", "a", "r", "d", "s", " ", "L", "i", "f", "e", "s", "c", "i", "e", "n", "c", "e", "s")
a[a %in% intersect(a, b)]
# [1] "E" "d" "w" "a" "r" "d" "s" " " "L" "i" "f" "e" "s" "c" "i" "e" "n" "c" "e" "s"

This would entirely depend on the vectors you are comparing (and which order) but would this be sufficient?
b <- a <- c('E', 'd', 'w', 'a', 'r', 'd', 's', '', 'L', 'i', 'f', 'e', 's', 'c', 'i', 'e', 'n', 'c', 'e')
c <- letters[sample(1:26,100, rep=T)]
a[is.element(a,b)]
# [1] "E" "d" "w" "a" "r" "d" "s" "" "L" "i" "f" "e" "s" "c" "i" "e" "n" "c" "e"
a[is.element(a,c)]
# [1] "d" "w" "a" "r" "d" "s" "i" "f" "e" "s" "c" "i" "e" "n" "c" "e"

I had the exact same problem and didn't find a solution, so I created my own little function "intersectdup":
intersectdup <- function(vektor1, vektor2) {
result <- c()
for (i in 1:length(vektor2)) {
if (is.element(vektor2[i], vektor1)){
result <- c(result, vektor2[i])
foundAt <- match(vektor2[i], vektor1)
vektor1 <- c(vektor1[1:foundAt-1], vektor1[foundAt+1:length(vektor1)])
}
}
return(result)
}

Picking up on Clemens, here is a simple function in a c-based structure:
intersectMe = function(x, y, duplicates=TRUE)
{
xyi = intersect(x,y);
if(!duplicates) { return (xyi); }
res = c();
for(xy in xyi)
{
y.xy = which(y == xy); ny.xy = length(y.xy);
x.xy = which(x == xy); nx.xy = length(x.xy);
min.xy = min(ny.xy, nx.xy);
res = c(res, rep(xy, min.xy) );
}
res;
}

vecsets library also helps (using on example created by Eric)
vecsets::vintersect(a, b)
[1] "E" "d" "d" "w" "a" "r" "s" "s" "s" " " "L" "i" "i" "f" "e" "e" "e" "c" "c" "n"

Related

Complementary sequence using gsub

I'm trying to make the complementary sequence of a dna chain stored in a vector.
It's supposed to change the "A" for the "T" and the "C" for the "G" and vice versa, the thing is, I need this to happen to the first vector and print the complementary sequence correctly. This is what I tried but got stucked:
pilot_sequence <- c("C","G","A","T","C","C","T","A","T")
complement_sequence_display <- function(pilot_sequence){
complement_chain_Incom <- gsub("A", "T", pilot_sequence)
complement_chain <- paste(complement_chain_Incom, collapse = "")
cat("Complement sequence: ", complement_chain, "\n")
}
complement_chain_Incom <- gsub("A","T", pilot_sequence)
complement_chain <- paste(complement_chain_Incom, collapse= "")
complement_sequence_display(pilot_sequence)
I got as answer: CGTTCCTTT,just the second and penultimate T are correct, how do I solve to the rest of letters ?
the pilot_sequence vector is character type and the functions displays no execution errors.

This is a ideal use case for chartr function:
chartr("ATGC","TACG",pilot_sequence)
output:
[1] "G" "C" "T" "A" "G" "G" "A" "T" "A"

You can do this with purrr::map:
pilot_sequence |> purrr::map_chr(~case_when(
.x == "T" ~ "A",
.x == "G" ~ "C",
.x == "A" ~ "T",
.x == "C" ~ "G"
))
#> [1] "G" "C" "T" "A" "G" "G" "A" "T" "A"

You can use recode from dplyr
library(dplyr)
recode(pilot_sequence, "C" = "G", "G" = "C", "A" = "T", "T" = "A")
Or in base R, create a named vector and use match to match the values location in the named vector and then call name to get the names
pilot_sequence <- c("C","G","A","T","C","C","T","A","T")
values = c("G" = "C", "C" = "G", "A" = "T", "T" = "A")
names(values[match(pilot_sequence, values)])
"G" "C" "T" "A" "G" "G" "A" "T" "A"

replace duplicate characters from strings

I am trying to remove duplicate character from strings.
dput(test)
c("APAAAAAAAAAAAPAAPPAPAPAAAAAAAAAAAAAAAAAAAAAAAAPPAPAAAAAAPPAPAAAPAPAAAAP",
"AAA", "P", "P", "A", "P", "P", "APPPPPA", "A", "P", "AA", "PP",
"PPA", "P", "P", "A", "P", "APAP", "P", "PA")
I create one function to sort the string
strSort <- function(x)
sapply(lapply(strsplit(x, NULL), sort), paste, collapse="")
Then i use gsub to remove consecutive characters
gsub("(.)\\1{2,}", "\\1", str_Sort(test))
This give out put as
gsub("(.)\\1{2,}", "\\1", strSort(test))
[1] "AP" "A" "P" "P" "A" "P" "P" "AAP" "A" "P" "AA" "PP" "APP" "P" "P" "A" "P" "AAPP" "P" "AP"
Output should only have one A and/or one P.

Using regex you can do :
gsub('(?:(.)(?=(.*)\\1))', '', test, perl = TRUE)
#[1] "AP" "A" "P" "P" "A" "P" "P" "PA" "A" "P" "A" "P" "PA"
#[14] "P" "P" "A" "P" "AP" "P" "PA"
The regex has been taken from here.

In the strsplit output, we need to use unique on the sorted elements
sapply(strsplit(test, ""), function(x)
paste(unique(sort(x)), collapse=""))
#[1] "AP" "A" "P" "P" "A" "P" "P" "AP" "A" "P" "A" "P" "AP" "P" "P" "A" "P" "AP" "P" "AP"

Here is another option using utf8ToInt + intToUtf8
> sapply(test, function(x) intToUtf8(sort(unique(utf8ToInt(x)))), USE.NAMES = FALSE)
[1] "AP" "A" "P" "P" "A" "P" "P" "AP" "A" "P" "A" "P" "AP" "P" "P"
[16] "A" "P" "AP" "P" "AP"

Print vowels from the vector

I need to execute the vowels from the LETTERS R build-in vector
"A", "E", etc.
> LETTERS
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U"
"V" "W" "X"
[25] "Y" "Z"
Maybe, someone knows how to do it with if() or other functions. Thank you in advance.

Looks like you need extract vowels, does this work:
> vowels <- c('A','E','I','O','U')
> LETTERS[sapply(vowels, function(ch) grep(ch, LETTERS))]
[1] "A" "E" "I" "O" "U"
>

Expand a vector by semicolon in some elements in R

Suppose I have a vector in R:
x<-c("a", "b", "c;d", "e", "f;g;h;i;j")
My question is how to expand x by the seperator ";", namely a desired output would be:
x
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"

With strsplit:
unlist(strsplit(x, split = ";"))
# [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"

how to repeat column names with a specific frequency in R

I have a 10x5 matrix. Each of the five columns, is named.
I need to create a vector like this:
c( rep(colnames(mymatrix)[1], dim(mymatrix)[1]),
rep(colnames(mymatrix)[2], dim(mymatrix)[1]),
...
rep(colnames(mymatrix)[5], dim(mymatrix)[1]))
However, what if I have a varying number of columns? How do I automate this without using a for loop?
Thanks!

You can do this with the each argument to rep:
rep(colnames(mymatrix), each=dim(mymatrix)[1])
To see how this works, you can try:
v = c("h", "e", "l", "l", "o")
rep(v, each=5)
# [1] "h" "h" "h" "h" "h" "e" "e" "e" "e" "e" "l" "l" "l" "l" "l" "l" "l" "l" "l"
# [20] "l" "o" "o" "o" "o" "o"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R - How to intersect() and include duplicates? - r

vecsets library also helps (using on example created by Eric) vecsets::vintersect(a, b) [1] "E" "d" "d" "w" "a" "r" "s" "s" "s" " " "L" "i" "i" "f" "e" "e" "e" "c" "c" "n"

Related

Complementary sequence using gsub

replace duplicate characters from strings

Print vowels from the vector

Expand a vector by semicolon in some elements in R

how to repeat column names with a specific frequency in R

Categories

Resources