Group a DNA sequence in codons - r

I have generated a random DNA sequence
base <- c("A","G","U")
seq <- sample(base, 15, replace = T)
[1] "A" "G" "A" "U" "A" "G" "U" "A" "U" "A" "G" "U" "G" "U" "G"
How can I group the resulting sequence to codons (set of three nucleotides) in order to look for the stop codons?
I need something like these:
new_seq <- c("AGA","UAG", "UAU", "AGU", "GUG")

Convert to 3 column matrix, then paste:
base <- c("A","G","U")
set.seed(1); x <- sample(base, 15, replace = T)
x
# [1] "A" "U" "A" "G" "A" "U" "U" "G" "G" "U" "U" "A" "A" "A" "G"
do.call(paste0, as.data.frame(matrix(x, ncol = 3, byrow = TRUE)))
# [1] "AUA" "GAU" "UGG" "UUA" "AAG"

We can use gl to create the group, and using tapply do a group by paste
unname(tapply(seq, as.integer(gl(length(seq), 3,
length(seq))), FUN = paste, collapse=""))
#[1] "GAU" "UUG" "AAG" "GGU" "AGA"
NOTE: This would also work when the length is not a multiple
Or another option is to split after pasteing into a single string
strsplit(paste(seq, collapse=""), "(?<=...)", perl = TRUE)[[1]]
#[1] "GAU" "UUG" "AAG" "GGU" "AGA"

Related

How to code to get an output vector list of unique elements based satisfying two conditions?

I'm trying to get list of uniques elements based on conditions of two columns in R.
For example, I have 4 groups and I want to get unique list of names of participants who are in group-1.
This requires to specify the two conditions in the code:
Unique(df$participants XXX_group_XXX).
How to code this condition specifically to get the output vecort list satisfying both conditions?
A simple solution using only base R:
set.seed(7*11*13)
name <- sample(LETTERS[1:10], 100, replace=TRUE)
G <- sample(1:5, 100, replace=TRUE)
U <- tapply(name, G, unique)
> U
$`1`
[1] "G" "F" "D" "B" "J" "A" "E" "H" "C"
$`2`
[1] "C" "J" "D" "B" "F" "G"
$`3`
[1] "C" "G" "H" "D" "F" "E" "I" "B" "J"
$`4`
[1] "F" "B" "G" "E" "I" "C" "H" "D" "J"
$`5`
[1] "G" "D" "A" "H" "F" "E" "B" "J" "C"
Would this work for you? I need to create a data frame first. Then I filter for the group you wish to see and get the unique values per group.
library(dplyr)
seed <- 123
# create some data
data <- data.frame(
name = sample(LETTERS, size = 100, replace = TRUE),
group = sample(c(1, 2, 3, 4), size = 100, replace = TRUE)
)
# base R
unique(data[data$group == 1, 1])
# or:
unique(data[data$group == 1, "name"])
# tidyverse
data %>%
filter(group == 1) %>%
distinct(name) %>%
pull() # if you want a vector to be returned

Show adjacent members in a list

I want to inspect adjacent elements in a list based on a match. For example, in a list of randomly ordered letters, I want to know what the neighboring letters of m is. My current solution is:
library(stringr)
ltrs <- sample(letters)
ltrs[(str_which(ltrs,'m')-2):(str_which(ltrs,'m')+2)]
[1] "j" "f" "m" "q" "a"
To me, the repetition of str_which() feels unnecessary. Is there a simpler way to achieve the same result?
First, I regenerate random data with a seed for reproducibility:
set.seed(42)
ltrs <- sample(letters)
ltrs
# [1] "q" "e" "a" "j" "d" "r" "z" "o" "g" "v" "i" "y" "n" "t" "w" "b" "c" "p" "x"
# [20] "l" "m" "s" "u" "h" "f" "k"
Use -2:2 and then (cautionarily) remove those below 1 or above the length of the vector:
ind <- -2:2 + which(ltrs == "m")
ind <- ind[0 < ind & ind < length(ltrs)]
ltrs[ind]
# [1] "x" "l" "m" "s" "u"
If your target is more than one (not just "m"), then we can use a different approach.
ind <- which(ltrs %in% c("m", "f"))
ind <- lapply(ind, function(z) { z <- z + -2:2; z[0 < z & z <= length(ltrs)]; })
ind
# [[1]]
# [1] 19 20 21 22 23
# [[2]]
# [1] 23 24 25 26
lapply(ind, function(z) ltrs[z])
# [[1]]
# [1] "x" "l" "m" "s" "u"
# [[2]]
# [1] "u" "h" "f" "k"
Or, if you don't care about keeping them grouped, we can try this:
ind <- which(ltrs %in% c("m", "f"))
ind <- unique(sort(outer(-2:2, ind, `+`)))
ind <- ind[0 < ind & ind <= length(ltrs)]
ltrs[ind]
# [1] "x" "l" "m" "s" "u" "h" "f" "k"
If you don't have duplicates, you can try the code like below
ltrs[seq_along(ltrs)%in% (which(ltrs=="m")+(-2:2))]
otherwise
ltrs[seq_along(ltrs) %in% c(outer(which(ltrs == "m"), -2:2, `+`))]
You can also use the slider::slide function (using data provided by #r2evans):
slider::slide(ltrs, ~ .x, .before = 2, .after = 2)[[which(ltrs == "m")]]
# [1] "x" "l" "m" "s" "u"
slider::slide(ltrs, ~ .x, .before = 2, .after = 2)[which(ltrs %in% c("m","f"))]
# [[1]]
# [1] "x" "l" "m" "s" "u"
#
# [[2]]
# [1] "u" "h" "f" "k"

multiple data.table columns to one column of vectors

I have a data.table like this:
tab = data.table(V1 = c('a', 'b', 'c'),
V2 = c('d', 'e', 'f'),
V3 = c('g', 'h', 'i'),
id = c(1,2,3))
From the columns V1,V2,V3 of this table, I'd like to get for row i a vector of c(V1[i],V2[i], V3[i])
I can get a list of the desired vectors like this:
lapply(1:tab[, .N], function(x) tab[x, c(V1, V2, V3)])
Which returns:
[[1]]
[1] "a" "d" "g"
[[2]]
[1] "b" "e" "h"
[[3]]
[1] "c" "f" "i"
But I think this is probably slow and not very data.table-like.
Also, I'd like to generalize it, do that I don't have explicitly type V1, V2, V3, but rather pass a vector of column names to be processed this way.
Try this?
> asplit(unname(tab[, V1:V3]), 1)
[[1]]
"a" "d" "g"
[[2]]
"b" "e" "h"
[[3]]
"c" "f" "i"
Using split
split(as.matrix(tab[, V1:V3]), tab$id)
$`1`
[1] "a" "d" "g"
$`2`
[1] "b" "e" "h"
$`3`
[1] "c" "f" "i"
as.list(transpose(tab[, .(V1, V2, V3)]))
Or as a function
tdt <- function(DT, cols) as.list(transpose(DT[, .SD, .SDcols = cols]))
tdt(tab, c('V1', 'V2', 'V3'))
# $V1
# [1] "a" "d" "g"
#
# $V2
# [1] "b" "e" "h"
#
# $V3
# [1] "c" "f" "i"
tab[, 1:3] |> transpose() |> as.list()
$V1
[1] "a" "d" "g"
$V2
[1] "b" "e" "h"
$V3
[1] "c" "f" "i"

Pulling elements of list based on information in vectors of another dataframe

I have a list (my.list) and a data frame (my.dataframe). The names of each element within my.list are of a sequence and are of the same type as the elements within two variables in my.dataframe. I want to pull out elements of the list whose names fall at, within, or just outside the range of the elements of two columns in my.dataframe.
RNGkind('Mersenne-Twister')
set.seed(1)
#Create my.dataframe
my.letters <- sample(x = sample(LETTERS[1:20],
size = 13,
replace = FALSE),
size = 100,
replace = TRUE)
my.other.letters <- LETTERS[match(my.letters, LETTERS) +
sample(x = 0:5,
size = 100,
replace = TRUE)]
my.dataframe <- data.frame(col1 = my.letters,
col2 = my.other.letters)
head(my.dataframe)
col1 col2
1 D F
2 C C
3 O O
4 A E
5 T T
6 D F
#So here, I'd want to pull out elements within my.list who's names would fall within D
#and F for the first row, C for the second row, O for the fourth, A and E for the fifth,
#so on and so forth.
#Create my.list
temp.data <- data.frame(a = rnorm(13*20, 10, 1),
b = rep(LETTERS[sample(1:length(LETTERS),
size = 13,
replace = FALSE)],
each = 20))
my.list <- split(x = temp.data$a, f = factor(temp.data$b))
I've used mapply() to try and do this:
mega.list <- mapply(function(f, s)my.list[which(LETTERS == f):which(LETTERS == s)], f = my.dataframe$col1, s = my.dataframe$col2)
But it only works if col1, col2, and the names of the elements in my.list have all the letters of the alphabet, but they don't. If you look at mega.list[[98]], you've got an empty list because it's looking for names within my.list that fall between T and Y(my.dataframe[98,]). Seeing as there isn't a list element whose name is T, you get nothing.
sort(unique(as.character(my.dataframe$col1))); sort(unique(as.character(my.dataframe$col2))); sort(unique(names(my.list)))
[1] "A" "B" "C" "D" "F" "H" "I" "K" "N" "O" "P" "S" "T"
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U" "V" "X" "Y"
[1] "A" "B" "D" "E" "F" "G" "H" "J" "K" "R" "S" "W" "Z"
Question: If the exact letter name isn't available within my.list, is there a way to select the next closest letter before or after the letters in col1 or col2, respectively? For example, if it tries to look for a letter N from col1, how could I get it to select K instead? Likewise, if it's trying to find U from col2, how can I get it to look for W instead?
I figured it out. I had to make an ammendment to the mapply function, where the first which function looks for all letters and and before f and takes the last value (with the tail function) and where the last which function looks at all the letters at and after it and takes the first one (done with [1]).
mega.list <- mapply(function(f, s)my.list[tail(which(names(my.list) <= f), n = 1):which(names(my.list) >= s)[1]], f = as.character(my.dataframe$col1), s = as.character(my.dataframe$col2))

Finding Elements of Lists in R

Right now I'm working with a character vector in R, that i use strsplit to separate word by word. I'm wondering if there's a function that I can use to check the whole list, and see if a specific word is in the list, and (if possible) say which elements of the list it is in.
ex.
a = c("a","b","c")
b= c("b","d","e")
c = c("a","e","f")
If z=list(a,b,c), then f("a",z) would optimally yield [1] 1 3, and f("b",z) would optimally yield [1] 1 2
Any assistance would be wonderful.
As alexwhan says, grep is the function to use. However, be careful about using it with a list. It isn't doing what you might think it's doing. For example:
grep("c", z)
[1] 1 2 3 # ?
grep(",", z)
[1] 1 2 3 # ???
What's happening behind the scenes is that grep coerces its 2nd argument to character, using as.character. When applied to a list, what as.character returns is the character representation of that list as obtained by deparsing it. (Modulo an unlist.)
as.character(z)
[1] "c(\"a\", \"b\", \"c\")" "c(\"b\", \"d\", \"e\")" "c(\"a\", \"e\", \"f\")"
cat(as.character(z))
c("a", "b", "c") c("b", "d", "e") c("a", "e", "f")
This is what grep is working on.
If you want to run grep on a list, a safer method is to use lapply. This returns another list, which you can operate on to extract what you're interested in.
res <- lapply(z, function(ch) grep("a", ch))
res
[[1]]
[1] 1
[[2]]
integer(0)
[[3]]
[1] 1
# which vectors contain a search term
sapply(res, function(x) length(x) > 0)
[1] TRUE FALSE TRUE
Much faster than grep is:
sapply(x, function(y) x %in% y)
and if you want the index of course just use which():
which(sapply(x, function(y) x %in% y))
Evidence!
x = setNames(replicate(26, list(sample(LETTERS, 10, rep=T))), sapply(LETTERS, list))
head(x)
$A
[1] "A" "M" "B" "X" "B" "J" "P" "L" "M" "L"
$B
[1] "H" "G" "F" "R" "B" "E" "D" "I" "L" "R"
$C
[1] "P" "R" "C" "N" "K" "E" "R" "S" "N" "P"
$D
[1] "F" "B" "B" "Z" "E" "Y" "J" "R" "H" "P"
$E
[1] "O" "P" "E" "X" "S" "Q" "S" "A" "H" "B"
$F
[1] "Y" "P" "T" "T" "P" "N" "K" "P" "G" "P"
system.time(replicate(1000, grep("A", x)))
user system elapsed
0.11 0.00 0.11
system.time(replicate(1000, sapply(x, function(y) "A" %in% y)))
user system elapsed
0.05 0.00 0.05
You're looking for grep():
grep("a", z)
#[1] 1 3
grep("b", z)
#[1] 1 2

Resources