List of string to list of vectors of characters - r

After defining
> Seq.genes <- as.list(c("ATGCCCAAATTTGATTT","AGAGTTCCCACCAACG"))
I have a list of strings :
> Seq.genes[1:2]
[[1]]
[1] "ATGCCCAAATTTGATTT"
[[2]]
[1] "AGAGTTCCCACCAACG"
I would like to convert it in a list of vectors :
>Seq.genes[1:2]
[[1]]
[1]"A" "T" "G" "C" "C" "C" "A" "A" "A" "T" "T" "T" "G" "A" "T" "T" "T"
[[2]]
[1] "A" "G" "A" "G" "T" "T" "C" "C" "C" "A" "C" "C" "A" "A" "C" "G"
I tried something like :
for (i in length(Seq.genes)){
x <- Seq.genes[i]
Seq.genes[i] <- substring(x, seq(1,nchar(x),2), seq(1,nchar(x),2))
}

It may be better to have the strings in a vector rather than in a list. So, we could unlist, then do an strsplit
strsplit(unlist(Seq.genes), "")

sapply(Seq.genes, strsplit, split = '')
or
lapply(Seq.genes, strsplit, split = '')

Related

Replacing a nucleotide in a FASTA fie

I'm trying to make fasta files for each variation of a gene using a CSV file extracted from gnoMAD. In this function,x is a list with coordinates for each variation, Y is a fasta file opened using the read.fasta function from the seqinr library and data is the file I downloaded from gnomAD. I'm having trouble with the last if statement,supposed to manage SNVs. For some reason,instead of inserting the nucleotide at the position specified, the value is concatenated at the end of the fasta file.
I've read the documentation for the library but haven't found anything about the internal representation for the fasta files.
Example of output:
t" "t" "g" "c" "t" "c" "a" "c" "a" "g" "t" "g" "t" "t" "t" "g"
"a" "g" "c" "a" "g" "t" "g" "c" "t" "g" "a" "g" "c" "a" "c" "a" "a" "a" "g" "c"
"a" "g" "a" "c" "a" "c" "t" "c" "a" "a" "t" "a" "a" "a" "t" "g" "c" "t" "a" "g"
9
"a" "t" "t" "t" "a" "c" "a" "c" "a" "c" "t" "c" "C"
The C with a 9 index should be in the ninth position of the sequence
files<-function(x,y,data){
test<-str_detect(data[ ,"Consequence"],"[del]")
names<-paste(data[ ,"Chromosome"],data[ ,"Position"],data[ ,"Reference"],data[ ,"Alternate"],"ACE2",sep="-")
for (j in 1:length(x)){
copy<-y
if(length(x[[j]])!=1 && test[j]==TRUE){
for(i in x[[j]][1]:x[[j]][2]){
copy[[1]][i]<-NA
}
copy<-copy[[1]][!is.na(copy[[1]])]
}
if(length(x[[j]])==1 && test[j]==TRUE){
copy[[1]][x[[j]][1]]<-NA
copy<-copy[[1]][!is.na(copy[[1]])]
}
if(test[j]==FALSE){
n<-x[[j]][1]
copy[[1]][n]<-complementary(data[j,"Alternate"])
print(copy[[1]][n])
}
putz<-paste(names[j],"fasta",sep=".")
write.fasta(copy,names[j],putz)
}
}

Expand a vector by semicolon in some elements in R

Suppose I have a vector in R:
x<-c("a", "b", "c;d", "e", "f;g;h;i;j")
My question is how to expand x by the seperator ";", namely a desired output would be:
x
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
With strsplit:
unlist(strsplit(x, split = ";"))
# [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"

Matching between a vector and multiple vectors in a list in R

I have a list of vectors such as:
>list
[[1]]
[1] "a" "m" "l" "s" "t" "o"
[[2]]
[1] "a" "y" "o" "t" "e"
[[3]]
[1] "n" "a" "s" "i" "d"
I want to find the matches between each of them and the remaining (i.e. between the 1st and the other 2, the 2nd and the other 2, and so on) and keep the couple with the highest number of matches. I could do it with a "for" loop and intersect by couples. For example
for (i in 2:3) { intersect(list[[1]],list[[i]]) }
and then save the output into a vector or some other structure. However, this seems so inefficient to me (given than rather than 3 I have thousands) and I am wondering if R has some built-in function to do that in a clever way.
So the question would be:
Is there a way to look for matches of one vector to a list of vectors without the explicit use of a "for" loop?
I don't believe there is a built-in function for this. The best you could try is something like:
lsts <- lapply(1:5, function(x) sample(letters, 10)) # make some data (see below)
maxcomb <- which.max(apply(combs <- combn(length(lsts), 2), 2,
function(ix) length(intersect(lsts[[ix[1]]], lsts[[ix[2]]]))))
lsts <- lsts[combs[, maxcomb]]
# [[1]]
# [1] "m" "v" "x" "d" "a" "g" "r" "b" "s" "t"
# [[2]]
# [1] "w" "v" "t" "i" "d" "p" "l" "e" "s" "x"
A dump of the original:
[[1]]
[1] "z" "r" "j" "h" "e" "m" "w" "u" "q" "f"
[[2]]
[1] "m" "v" "x" "d" "a" "g" "r" "b" "s" "t"
[[3]]
[1] "w" "v" "t" "i" "d" "p" "l" "e" "s" "x"
[[4]]
[1] "c" "o" "t" "j" "d" "g" "u" "k" "w" "h"
[[5]]
[1] "f" "g" "q" "y" "d" "e" "n" "s" "w" "i"
datal <- list (a=c(2,2,1,2),
b=c(2,2,2,4,3),
c=c(1,2,3,4))
# all possible combinations
combs <- combn(length(datal), 2)
# split into list
combs <- split(combs, rep(1:ncol(combs), each = nrow(combs)))
# calculate length of intersection for every combination
intersections_length <- sapply(combs, function(y) {
length(intersect(datal[[y[1]]],datal[[y[2]]]))
}
)
# What lists have biggest intersection
combs[which(intersections_length == max(intersections_length))]

Complement a DNA sequence

Suppose I have a DNA sequence. I want to get the complement of it. I used the following code but I am not getting it. What am I doing wrong ?
s=readline()
ATCTCGGCGCGCATCGCGTACGCTACTAGC
p=unlist(strsplit(s,""))
h=rep("N",nchar(s))
unlist(lapply(p,function(d){
for b in (1:nchar(s)) {
if (p[b]=="A") h[b]="T"
if (p[b]=="T") h[b]="A"
if (p[b]=="G") h[b]="C"
if (p[b]=="C") h[b]="G"
}
Use chartr which is built for this purpose:
> s
[1] "ATCTCGGCGCGCATCGCGTACGCTACTAGC"
> chartr("ATGC","TACG",s)
[1] "TAGAGCCGCGCGTAGCGCATGCGATGATCG"
Just give it two equal-length character strings and your string. Also vectorised over the argument for translation:
> chartr("ATGC","TACG",c("AAAACG","TTTTT"))
[1] "TTTTGC" "AAAAA"
Note I'm doing the replacement on the string representation of the DNA rather than the vector. To convert the vector I'd create a lookup-map as a named vector and index that:
> p
[1] "A" "T" "C" "T" "C" "G" "G" "C" "G" "C" "G" "C" "A" "T" "C" "G" "C" "G" "T"
[20] "A" "C" "G" "C" "T" "A" "C" "T" "A" "G" "C"
> map=c("A"="T", "T"="A","G"="C","C"="G")
> unname(map[p])
[1] "T" "A" "G" "A" "G" "C" "C" "G" "C" "G" "C" "G" "T" "A" "G" "C" "G" "C" "A"
[20] "T" "G" "C" "G" "A" "T" "G" "A" "T" "C" "G"
The Bioconductor package Biostrings has many useful functions for this sort of operation. Install once:
source("http://bioconductor.org/biocLite.R")
biocLite("Biostrings")
then use
library(Biostrings)
dna = DNAStringSet(c("ATCTCGGCGCGCATCGCGTACGCTACTAGC", "ACCGCTA"))
complement(dna)
To complement, in both upper and lower case, you can use chartr():
n <- "ACCTGccatGCATC"
chartr("acgtACGT", "tgcaTGCA", n)
# [1] "TGGACggtaCGTAG"
To take it a step further and reverse complement the nucleotide sequence, you can use the following function:
library(stringi)
rc <- function(nucSeq)
return(stri_reverse(chartr("acgtACGT", "tgcaTGCA", nucSeq)))
rc("AcACGTgtT")
# [1] "AacACGTgT"
There is also a package seqinr
library(seqinr)
comp(seq) # gives complement
rev(comp(seq)) # gives the reverse complement
Biostrings has a much smaller memory profile, but seqinr is nice also because you can choose the case of the bases (including mixed) and change them to anything you want, for example if you want a mix of T and U in the same sequence. Biostrings forces you to have either T or U.
sapply(p, switch, "A"="T", "T"="A","G"="C","C"="G")
A T C T C G G C G C G C A T C G C G T
"T" "A" "G" "A" "G" "C" "C" "G" "C" "G" "C" "G" "T" "A" "G" "C" "G" "C" "A"
A C G C T A C T A G C
"T" "G" "C" "G" "A" "T" "G" "A" "T" "C" "G"
If you do not want the complementary names, you can always strip them with unname.
unname(sapply(p, switch, "A"="T", "T"="A","G"="C","C"="G") )
[1] "T" "A" "G" "A" "G" "C" "C" "G" "C" "G" "C" "G" "T" "A" "G" "C" "G" "C"
[19] "A" "T" "G" "C" "G" "A" "T" "G" "A" "T" "C" "G"
>
Here a answer using base r. Written with a horrible formatting to make things clear and to keep it as a one-liner. It supports upper and lower cases.
revc = function(s){
paste0(
rev(
unlist(
strsplit(
chartr("ATGCatgc","TACGtacg",s)
, "") # from strsplit
) # from unlist
) # from rev
, collapse='') # from paste0
}
I've generalised the solution rev(comp(seq)) with the seqinr package:
install.packages("devtools")
devtools::install_github("TomKellyGenetics/tktools")
tktools::revcomp(seq)
This version is compatible with string inputs and is vectorised to handle list or vector input of multiple strings. The output class should match the input, including cases and types. This also support inputs containing "U" for RNA and RNA output sequences.
> seq <- "ATCTCGGCGCGCATCGCGTACGCTACTAGC"
> revcomp(seq)
[1] "GCTAGTAGCGTACGCGATGCGCGCCGAGAT"
> seq <- c("TATAAT", "TTTCGC", "atgcat")
> revcomp(seq)
TATAAT TTTCGC atgcat
"ATTATA" "GCGAAA" "atgcat"
See the manual or the TomKellyGenetics/tktools github package repository.

R - shuffle a list preserving element sizes

In R, I need an efficient solution to shuffle the elements contained within a list, preserving the total number of elements, and the local element sizes (in this case, each element of the list is a vector)
a<-LETTERS[1:6]
b<-LETTERS[6:10]
c<-LETTERS[c(9:15)]
l=list(a,b,c)
> l
[[1]]
[1] "A" "B" "C" "D" "E" "F"
[[2]]
[1] "F" "G" "H" "I" "J"
[[3]]
[1] "I" "J" "K" "L" "M" "N" "O"
The shuffling should randomly select the letters of the list (without replacement) and put them in a random position of any vector within the list.
I hope I have been clear! Thanks :-)
you may try recreating a second list with the skeleton of the first, and fill it with all the elements of the first list, like this:
u<-unlist(l)
l2<-relist(u[sample(length(u))],skeleton=l)
> l2
[[1]]
[1] "F" "A" "O" "I" "S" "Q"
[[2]]
[1] "R" "P" "K" "F" "G"
[[3]]
[1] "A" "N" "M" "J" "H" "G" "E" "B" "T" "C" "D" "L"
Hope this helps!
Like this...?
> set.seed(1)
> lapply(l, sample)
[[1]]
[1] "B" "F" "C" "D" "A" "E"
[[2]]
[1] "J" "H" "G" "F" "I"
[[3]]
[1] "J" "M" "O" "L" "N" "K" "I"

Resources