Finding matching values in two vectors of different lengths in R - r

I have two vectors with species names following two different methods. Some names are the same, others are different and both are sorted in different ways. An example:
list 1: c(Homo sapiens sapiens, Homo sapiens neanderthalensis, Homo erectus,...,n)
List 2: c(Homo erectus, Homo sapiens, Homo neanderthalensis,...,n+1)
I write n and n+1 to denote that these lists have different lengths.
I would like to create a new list that consists out of two values: in the case that there is a match between the two vectors (e.g. Homo erectus) I would like to have the name of list 2 at the location the name has in List 1, or in case there is a mismatch a "0" at the location in List 1. So in this case this new list would be newlist: c(0,0, Homo erectus,...)
For this I have written the following code, but it does not work.
data<-read.table("species.txt",sep="\t",header=TRUE)
list1<-as.vector(data$Species1)
list2<-as.vector(data$Species2)
newlist<-as.character(rep(0,length(list1)))
for (i in 1:length(list1)){
for (j in 1:length(list2)){
if(list1[i] == list2[j]){newlist[i]<- list2[j]}else {newlist[i]= 0}
}
}
I hope this is clear.
Thanks for any help!

Take this reproducible example:
set.seed(1)
list1 <- letters[1:10]
list1names
list2 <- letters[sample(1:10, 10)]
You can avoid a loop using ifelse:
newlist <- ifelse(list1==list2, list2, 0)
The issue is that you did not declare newname, did you mean newlist ?
If you want to use a loop you can use only one loop and not 2 because length(list1) = length(list2):
for (i in 1:length(list1)){
if(list1[i] == list2[i]){newlist[i]<- list2[i]}else {newlist[i]= 0}
}
In general if you want to match elements in vectors you can use match like this:
> list1
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
> list2
[1] "c" "d" "e" "g" "b" "h" "i" "f" "j" "a"
> match(list1, list2)
[1] 10 5 1 2 3 8 4 6 7 9
As you can see match gets the indexes of the elements in list2 which are equal to the elements in list1. This is useful in case you have another table data2, and you would like to fetch the column in data2 for corresponding elements from data$list1 in data2$list3, you would use:
data <- data.frame(list1, list2)
list3 <- list2
columntoget <- 1:length(list2)
data2 <- data.frame(list3, columntoget)
data$mynewcolumn <- data2$columntoget[match(data$list1, data2$list3)]
> data$mynewcolumn
[1] 10 5 1 2 3 8 4 6 7 9

I'm not completely certain that I understand what you're trying to achieve, but I think this does what you're after.
list1 <- c("Homo sapiens sapiens","Homo sapiens neanderthalensis","Homo erectus")
list2 <- c("Homo erectus","Homo sapiens","Homo neanderthalensis")
sapply(list1, function(x) { ifelse(x %in% list2, list2[which(list1 == x)], 0) } )

The inner for loop uses newname[i] where it should be newlist[i].
Using your code, you overwrite the newlist[i] entries j times with either 0 or a species name. This is probably not what you want.

Related

How to use grep to search for patterns matches within a list of data frames using a second list of character vectors in R

I have two lists in R. One is a list of data frames with rows that contain strings (List 1). The other is a list (of the same length) of characters (List 2). I would like to go through the lists in a parallel fashion taking the character string from List 2 and searching for it to get its position (using grep) in the data frame at the corresponding element in List 1. Here is a toy example to show what my lists look like:
List1 <- list(data.frame(a = c("other","other","dog")),
data.frame(a = c("cat","other","other")),
data.frame(a = c("other","other","bird")))
List2 <- list("a" = c("dog|xxx|xxx"),
"a" = c("cat|xxx|xxx"),
"a" = c("bird|xxx|xxx"))
The output I would like to get would be a list of the position in each data frame in List 1 of the pattern match i.e. in this example the positions would be 3, 1 & 3. So the list would be:
[[1]]
[1] 3
[[2]]
[1] 1
[[3]]
[1] 3
I cannot seem to figure out how to do this.
I tried lapply:
NewList1 <- lapply(1:length(List1),
function(x) grep(List2[[x]]))
But that does not work. I also tried purrr:map2:
NewList2<-map2(List2, List1, grep(List2$A, List1))
This also does not work. I would be very grateful of any suggestions anyone may have as to how to fix this. Many thanks to anyone willing to wade in!
Try Map + unlist
> Map(grep, List2, unlist(List1, recursive = FALSE))
$a
[1] 3
$a
[1] 1
$a
[1] 3
Using Map you can do -
Map(function(x, y) grep(y, x$a), List1, List2)
#[[1]]
#[1] 3
#[[2]]
#[1] 1
#[[3]]
#[1] 3
The map2 attempt was close but you need to refer lists as .x and .y in the function.
purrr::map2(List2, List1, ~grep(.x, .y$a))

Function to look for different patterns in specific positions in a string in R

I am trying to create a function in R that searches in strings a specific pattern in a specific position, if the letter is present in the established position, I want to count it.
example of dataset:
library(dplyr)
mutations <- tibble(
"position" = c(9,10),
"AA" = c("G","G"))
strings <- c("EVQLVESGGGLAKPG",
"VQLVESGGGLAKPGGS",
"EVQLVESGGALAKPGGSLRLSCAAS")
So, in this case, I want to look for the position 9 and 10, if there is a letter "G" I want to count it.
Expected dataframe or tibble output:
| string | mut_counts |
|________|____________|
| 1 | 2 |
|________|____________|
| 2 | 1 |
|________|____________|
| 3 | 1 |
|________|____________|
In this example, all strings have a "G" at position 9, so they would all get 1, and only one of the three sequences have a "G" at position 10, so this sequence will have 2.
I am trying to use str_locate_all() from the stringr package to be able to locate the positions and then compare with my dataframe to count but I am failing to get what I wanted.
library(stringr)
.class_mutations <- function(sequences, mutations){
.count_pattern <- function(x){
df <- sum(as.integer(locating_all_patterns[[x]][,"start"] == mutations$position[mut]))
}
for(mut in nrow(mutations)){
locating_all_patterns <- str_locate_all(pattern = mutations$AA[mut], sequences)
counting_patterns <- lapply(locating_all_patterns, .count_pattern)
}
return(counting_patterns)
}
.class_mutations(strings, mutations)
I am getting this error Error in locating_all_patterns[[x]] : no such index at level 1, besides, if you have a better/faster way to do this, I would also appreciate it. I have to take into account that this is going to applied in thousands of strings, so I should avoid slow functions.
Thank you
base R
rowSums(outer(strings, seq_len(nrow(mutations)),
function(st, i) {
substr(st, mutations$position[i], mutations$position[i]) == mutations$AA[i]
}))
# [1] 2 1 1
Walk-through:
outer effectively just produces two vectors, an expansion of the cartesian product of the two arguments. If we insert a browser() as the first line of the inner anon-func, we'd see
data.frame(st, i)
# st i
# 1 EVQLVESGGGLAKPG 1
# 2 VQLVESGGGLAKPGGS 1
# 3 EVQLVESGGALAKPGGSLRLSCAAS 1
# 4 EVQLVESGGGLAKPG 2
# 5 VQLVESGGGLAKPGGS 2
# 6 EVQLVESGGALAKPGGSLRLSCAAS 2
(Shown as a frame only for a columnar presentation. Both st and i are simple vectors.)
From here, knowing that substr is vectorized across all arguments, then a single call to substr will find the ith character in each of the strings.
The result of the substr is a vector of letters. Continuing the same browser() session from above,
substr(st, mutations$position[i], mutations$position[i])
# [1] "G" "G" "G" "G" "L" "A"
mutations$AA[i]
# [1] "G" "G" "G" "G" "G" "G"
substr(st, mutations$position[i], mutations$position[i]) == mutations$AA[i]
# [1] TRUE TRUE TRUE TRUE FALSE FALSE
The mutations$AA[i] shows us what we're looking for. A nice thing of the vectorized method here is that mutations$AA[i] will always be the same length and in the expected order of letters retrieved by substr(.).
The outer itself returns a matrix, with length(X) rows and length(Y) columns (X and Y are the first and second args to outer, respective).
outer(strings, seq_len(nrow(mutations)),
function(st, i) {
substr(st, mutations$position[i], mutations$position[i]) == mutations$AA[i]
})
# [,1] [,2]
# [1,] TRUE TRUE
# [2,] TRUE FALSE
# [3,] TRUE FALSE
The number of correct mutations found in each string is just a sum of each row. (Ergo rowSums.)
If you're concerned due to a large amount of mutations and strings, you can replace the outer and iterate over each row of mutations instead:
rowSums(sapply(seq_len(nrow(mutations)), function(i) substr(strings, mutations$position[i], mutations$position[i]) == mutations$AA[i]))
# [1] 2 1 1
This calls substr once for each mutations row, so if the outer-explosion is too much, this might reduce the memory footprint.
For a base R option, we can make sure of the string functions. This approach compares the length of each substring before and after replacing the target character:
nchar(substr(strings, 9, 10)) -
nchar(gsub("G", "", substr(strings, 9, 10), fixed=TRUE))
[1] 2 1 1
Data:
strings <- c("EVQLVESGGGLAKPG",
"VQLVESGGGLAKPGGS",
"EVQLVESGGALAKPGGSLRLSCAAS")

Using R to compare sub-elements within a string ... and summarize it so that there are no duplicate sub-elements

I have a string like this:
data <- c("A:B:C", "A:B", "E:F:G", "H:I:J", "B:C:D")
I want to convert this to a string of:
c("A:B:C:D", "E:F:G", "H:I:J")
The idea is that each element inside the string is another string of sub-elements (e.g. A, B, C) that have been pasted together (with sep=":"). Each element within the string is compared with all other elements to look for common sub-elements, and elements with common sub-elements are combined.
I don't care about the order of the string (or order of the sub-elements) FWIW.
Thanks for any help offered!
--
Answers so far...
I liked d.b's suggestion - not the least because it stayed in base R. However, with a more complicated larger set, it wasn't working perfectly until everything was run again. With an even more complicated dataset, re-running everything more than twice might be needed.
I had more difficulty with thelatemail's suggestion. I had to upgrade R to use lengths, and I then had to figure out how to get to the end point because the answer was incomplete. In any case, this was how I got to the end (I suspect there is a better way). This worked with a larger set without a hitch.
library(igraph)
spl <- strsplit(data,":")
combspl <- data.frame(
grp = rep(seq_along(spl),lengths(spl)),
val = unlist(spl)
)
cl <- clusters(graph.data.frame(combspl))$membership[-(1:length(spl))]
dat <- data.frame(cl) # after getting nowhere working with the list as formatted
dat[,2] <- row.names(dat)
a <- character(0)
for (i in 1:max(cl)) {
a[i] <- paste(paste0(dat[(dat[,1] == i),][,2]), collapse=":")
}
a
#[1] "A:B:C:D" "E:F:G" "H:I:J"
I'm going to leave this for now as is.
A possible application for the igraph library, if you think of your values as an edgelist of paired groups:
library(igraph)
spl <- strsplit(data,":")
combspl <- data.frame(
grp = rep(seq_along(spl),lengths(spl)),
val = unlist(spl)
)
cl <- clusters(graph.data.frame(combspl))$membership[-(1:length(spl))]
#A B C E F G H I J D
#1 1 1 2 2 2 3 3 3 1
split(names(cl),cl)
#$`1`
#[1] "A" "B" "C" "D"
#
#$`2`
#[1] "E" "F" "G"
#
#$`3`
#[1] "H" "I" "J"
Or as collapsed text:
sapply(split(names(cl),cl), paste, collapse=";")
# 1 2 3
#"A;B;C;D" "E;F;G" "H;I;J"
a = character(0)
for (i in 1:length(data)){
a[i] = paste(unique(unlist(strsplit(data[sapply(1:length(data), function(j)
any(unlist(strsplit(data[i],":")) %in% unlist(strsplit(data[j],":"))))],":"))), collapse = ":")
}
unique(a)
#[1] "A:B:C:D" "E:F:G" "H:I:J"

How to concatenate two DNAStringSet sequences per sample in R?

I have two Large DNAStringSet objects, where each of them contain 2805 entries and each of them has length of 201. I want to simply combine them, so to have 2805 entries because each of them are this size, but I want to have one object, combination of both.
I tried to do this
s12 <- c(unlist(s1), unlist(s2))
But that created single Large DNAString object with 1127610 elements, and this is not what I want. I simply want to combine them per sample.
EDIT:
Each entry in my DNASTringSet objects named s1 and s2, have similar format to this:
width seq
[1] 201 CCATCCCAGGGGTGATGCCAAGTGATTCCA...CTAACTCTGGGGTAATGTCCTGCAGCCGG
You can convert each DNAStringSet into characters. for example:
library(Biostrings)
set1 <- DNAStringSet(c("GCT", "GTA", "ACGT"))
set2 <- DNAStringSet(c("GTC", "ACGT", "GTA"))
as.character(set1)
as.character(set2)
Then paste them together into a DNAStringSet:
DNAStringSet(paste0(as.character(set1), as.character(set2)))
Since you're using DNAStringSet which is in Biostrings package, i recommend you to use this package's default functions for dealing with XStringSets. Using r base functions would take a lot of time because they need unnecessary conversions.
So you can use Biostrings xscat function. for example:
library(Biostrings)
set1 <- DNAStringSet(c("GCT", "GTA", "ACGT"))
set2 <- DNAStringSet(c("GTC", "ACGT", "GTA"))
xscat(set1, set2)
the result would be:
DNAStringSet object of length 3:
width seq
[1] 6 GCTGTC
[2] 7 GTAACGT
[3] 7 ACGTGTA
If your goal is to return a list where each list element is the concatenation of the corresponding list elements from the original lists restulting in a list of with length 2805 where each list element has a length of 402, you can achieve this with Map. Here is an example with a smaller pair of lists.
# set up the lists
set.seed(1234)
list.a <- list(a=1:5, b=letters[1:5], c=rnorm(5))
list.b <- list(a=6:10, b=letters[6:10], c=rnorm(5))
Each list contains 3 elements, which are vectors of length 5. Now, concatenate the lists by list position with Map and c:
Map(c, list.a, list.b)
$a
[1] 1 2 3 4 5 6 7 8 9 10
$b
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
$c
[1] -1.2070657 0.2774292 1.0844412 -2.3456977 0.4291247 0.5060559
-0.5747400 -0.5466319 -0.5644520 -0.8900378
For your problem as you have described it, you would use
s12 <- Map(c, s1, s2)
The first argument of Map is a function that tells Map what to do with the list items that you have given it. Above those list items are a and b, in your example, they are s1 and s2.

R - preserve order when using matching operators (%in%)

I am using matching operators to grab values that appear in a matrix from a separate data frame. However, the resulting matrix has the values in the order they appear in the data frame, not in the original matrix. Is there any way to preserve the order of the original matrix using the matching operator?
Here is a quick example:
vec=c("b","a","c"); vec
df=data.frame(row.names=letters[1:5],values=1:5); df
df[rownames(df) %in% vec,1]
This produces > [1] 1 2 3 which is the order "a" "b" "c" appears in the data frame. However, I would like to generate >[1] 2 1 3 which is the order they appear in the original vector.
Thanks!
Use match.
df[match(vec, rownames(df)), ]
# [1] 2 1 3
Be aware that if you have duplicate values in either vec or rownames(df), match may not behave as expected.
Edit:
I just realized that row name indexing will solve your issue a bit more simply and elegantly:
df[vec, ]
# [1] 2 1 3
Use match (and get rid of the NA values for elements in either vector for those that don't match in the other):
Filter(function(x) !is.na(x), match(rownames(df), vec))
Since row name indexing also works on vectors, we can take this one step further and define:
'%ino%' <- function(x, table) {
xSeq <- seq(along = x)
names(xSeq) <- x
Out <- xSeq[as.character(table)]
Out[!is.na(Out)]
}
We now have the desired result:
df[rownames(df) %ino% vec, 1]
[1] 2 1 3
Inside the function, names() does an auto convert to character and table is changed with as.character(), so this also works correctly when the inputs to %ino% are numbers:
LETTERS[1:26 %in% 4:1]
[1] "A" "B" "C" "D"
LETTERS[1:26 %ino% 4:1]
[1] "D" "C" "B" "A"
Following %in%, missing values are removed:
LETTERS[1:26 %in% 3:-5]
[1] "A" "B" "C"
LETTERS[1:26 %ino% 3:-5]
[1] "C" "B" "A"
With %in% the logical sequence is repeated along the dimension of the object being subsetted, this is not the case with %ino%:
data.frame(letters, LETTERS)[1:5 %in% 3:-5,]
letters LETTERS
1 a A
2 b B
3 c C
6 f F
7 g G
8 h H
11 k K
12 l L
13 m M
16 p P
17 q Q
18 r R
21 u U
22 v V
23 w W
26 z Z
data.frame(letters, LETTERS)[1:5 %ino% 3:-5,]
letters LETTERS
3 c C
2 b B
1 a A

Resources