Complement a DNA sequence - r

Suppose I have a DNA sequence. I want to get the complement of it. I used the following code but I am not getting it. What am I doing wrong ?
s=readline()
ATCTCGGCGCGCATCGCGTACGCTACTAGC
p=unlist(strsplit(s,""))
h=rep("N",nchar(s))
unlist(lapply(p,function(d){
for b in (1:nchar(s)) {
if (p[b]=="A") h[b]="T"
if (p[b]=="T") h[b]="A"
if (p[b]=="G") h[b]="C"
if (p[b]=="C") h[b]="G"
}

Use chartr which is built for this purpose:
> s
[1] "ATCTCGGCGCGCATCGCGTACGCTACTAGC"
> chartr("ATGC","TACG",s)
[1] "TAGAGCCGCGCGTAGCGCATGCGATGATCG"
Just give it two equal-length character strings and your string. Also vectorised over the argument for translation:
> chartr("ATGC","TACG",c("AAAACG","TTTTT"))
[1] "TTTTGC" "AAAAA"
Note I'm doing the replacement on the string representation of the DNA rather than the vector. To convert the vector I'd create a lookup-map as a named vector and index that:
> p
[1] "A" "T" "C" "T" "C" "G" "G" "C" "G" "C" "G" "C" "A" "T" "C" "G" "C" "G" "T"
[20] "A" "C" "G" "C" "T" "A" "C" "T" "A" "G" "C"
> map=c("A"="T", "T"="A","G"="C","C"="G")
> unname(map[p])
[1] "T" "A" "G" "A" "G" "C" "C" "G" "C" "G" "C" "G" "T" "A" "G" "C" "G" "C" "A"
[20] "T" "G" "C" "G" "A" "T" "G" "A" "T" "C" "G"

The Bioconductor package Biostrings has many useful functions for this sort of operation. Install once:
source("http://bioconductor.org/biocLite.R")
biocLite("Biostrings")
then use
library(Biostrings)
dna = DNAStringSet(c("ATCTCGGCGCGCATCGCGTACGCTACTAGC", "ACCGCTA"))
complement(dna)

To complement, in both upper and lower case, you can use chartr():
n <- "ACCTGccatGCATC"
chartr("acgtACGT", "tgcaTGCA", n)
# [1] "TGGACggtaCGTAG"
To take it a step further and reverse complement the nucleotide sequence, you can use the following function:
library(stringi)
rc <- function(nucSeq)
return(stri_reverse(chartr("acgtACGT", "tgcaTGCA", nucSeq)))
rc("AcACGTgtT")
# [1] "AacACGTgT"

There is also a package seqinr
library(seqinr)
comp(seq) # gives complement
rev(comp(seq)) # gives the reverse complement
Biostrings has a much smaller memory profile, but seqinr is nice also because you can choose the case of the bases (including mixed) and change them to anything you want, for example if you want a mix of T and U in the same sequence. Biostrings forces you to have either T or U.

sapply(p, switch, "A"="T", "T"="A","G"="C","C"="G")
A T C T C G G C G C G C A T C G C G T
"T" "A" "G" "A" "G" "C" "C" "G" "C" "G" "C" "G" "T" "A" "G" "C" "G" "C" "A"
A C G C T A C T A G C
"T" "G" "C" "G" "A" "T" "G" "A" "T" "C" "G"
If you do not want the complementary names, you can always strip them with unname.
unname(sapply(p, switch, "A"="T", "T"="A","G"="C","C"="G") )
[1] "T" "A" "G" "A" "G" "C" "C" "G" "C" "G" "C" "G" "T" "A" "G" "C" "G" "C"
[19] "A" "T" "G" "C" "G" "A" "T" "G" "A" "T" "C" "G"
>

Here a answer using base r. Written with a horrible formatting to make things clear and to keep it as a one-liner. It supports upper and lower cases.
revc = function(s){
paste0(
rev(
unlist(
strsplit(
chartr("ATGCatgc","TACGtacg",s)
, "") # from strsplit
) # from unlist
) # from rev
, collapse='') # from paste0
}

I've generalised the solution rev(comp(seq)) with the seqinr package:
install.packages("devtools")
devtools::install_github("TomKellyGenetics/tktools")
tktools::revcomp(seq)
This version is compatible with string inputs and is vectorised to handle list or vector input of multiple strings. The output class should match the input, including cases and types. This also support inputs containing "U" for RNA and RNA output sequences.
> seq <- "ATCTCGGCGCGCATCGCGTACGCTACTAGC"
> revcomp(seq)
[1] "GCTAGTAGCGTACGCGATGCGCGCCGAGAT"
> seq <- c("TATAAT", "TTTCGC", "atgcat")
> revcomp(seq)
TATAAT TTTCGC atgcat
"ATTATA" "GCGAAA" "atgcat"
See the manual or the TomKellyGenetics/tktools github package repository.

Related

List of string to list of vectors of characters

After defining
> Seq.genes <- as.list(c("ATGCCCAAATTTGATTT","AGAGTTCCCACCAACG"))
I have a list of strings :
> Seq.genes[1:2]
[[1]]
[1] "ATGCCCAAATTTGATTT"
[[2]]
[1] "AGAGTTCCCACCAACG"
I would like to convert it in a list of vectors :
>Seq.genes[1:2]
[[1]]
[1]"A" "T" "G" "C" "C" "C" "A" "A" "A" "T" "T" "T" "G" "A" "T" "T" "T"
[[2]]
[1] "A" "G" "A" "G" "T" "T" "C" "C" "C" "A" "C" "C" "A" "A" "C" "G"
I tried something like :
for (i in length(Seq.genes)){
x <- Seq.genes[i]
Seq.genes[i] <- substring(x, seq(1,nchar(x),2), seq(1,nchar(x),2))
}
It may be better to have the strings in a vector rather than in a list. So, we could unlist, then do an strsplit
strsplit(unlist(Seq.genes), "")
sapply(Seq.genes, strsplit, split = '')
or
lapply(Seq.genes, strsplit, split = '')

Convert vector with sets of values preceeded by "headers", to separate vectors

I have a vector with several sets of elements. Each set is preceded by a certain name, given by "A", "B" and "C" as an example over here:
v1 <- c("A", letters[1:5], "B", letters[6:7], "C", letters[8:12])
v1
# [1] "A" "a" "b" "c" "d" "e" "B" "f" "g" "C" "h" "i" "j" "k" "l"
The position of the "headers" can be obtained by grep:
start <- grep("[ABC]", v1)
# [1] 1 7 10
How do I proceed from here to extract the three sets of elements as separate vectors with the preceding "headers" as their name?
"A" <- letters[1:5]
"B" <- letters[6:7]
"C" <- letters[8:12]
A
# [1] "a" "b" "c" "d" "e"
B
# [1] "f" "g"
C
# [1] "h" "i" "j" "k" "l"
SOLUTION
I hope the kind soul who provided an answer to this question (his id eluded me), but later deleted his answer and all of his comments can be contacted, and the answer reinstated, so that he can be duly rewarded with upvotes.
Contrary to my initial claim, which was caused by a misunderstanding, his answer DID provide a viable solution.
Here's the gist of it, from what I can recall:
end <- start-1
end <- end[-1]
end[length(end)+1] <- length(v1)
[1] 6 9 15
map2(start+1, end, ~v1[.x:.y]) %>% set_names(v1[start])
$A
[1] "a" "b" "c" "d" "e"
$B
[1] "f" "g"
$C
[1] "h" "i" "j" "k" "l"

How to replace values in a data frame with another value

I have huge data set. The columns contain values like A,B,C,D,E,F,G,H and I need to replace them with 1,2,3,4...
[1] "C" "C" "C" "C" "C" "A" "H" "G" "G" "G" "G" "G" "G" "G" "C" "C" "C" "C" "C"
[20] "C" "B" "B" "B" "H" "H" "H" "H" "H" "H" "G" "C" "A" "A" "A" "A" "A" "A" "A"
[30]----
Another similar problem is values in one column are more than 1000 and I need to replace them by unique numbers.
try replace
replace function examples
in your case e.g.
replace(df, "A", 1)

How do you retrieve individual DNA sequences after importing an alignment into R?

I imported an alignment in FASTA format into R
read.dna(file.choose(),format="fasta",skip=0)
My alignment looks something like this
Seq1 ATGCGGGAATGGACTCATGCATCG
Seq2 ATTCGATCTTGCTAGCTAGCTCGT
Seq3 ATATCGATGTCGATCGATCGACGA
If I want to call individual sequences from within this alignment (say Seq2 for example), what do I need to do ?
I don't know where read.dna() comes from (there are >6000 CRAN packages, and almost 1000 Bioconductor packages). You could use the Biostrings package and
library(Biostrings)
dna = readDNAStringSet("path/to.fasta")
and do many useful things, including those described in the quick reference. If at the end you want a single character vector, then
as.character(dna[1])
or
as.character(dna[names(dna) == "Seq3"])
I am guessing that you are using ape package. Using the example in ?read.dna
library(ape)
cat(">No305",
"NTTCGAAAAACACACCCACTACTAAAANTTATCAGTCACT",
">No304",
"ATTCGAAAAACACACCCACTACTAAAAATTATCAACCACT",
">No306",
"ATTCGAAAAACACACCCACTACTAAAAATTATCAATCACT",
file = "exdna.txt", sep = "\n")
ex.dna4 <- read.dna("exdna.txt", format = "fasta")
ex.dna4[dimnames(ex.dna4)[[1]]=='No304',]
#1 DNA sequences in binary format stored in a matrix.
#All sequences of same length: 40
#Labels: No304
#Base composition:
# a c g t
#0.475 0.300 0.025 0.200
as.character(ex.dna4[dimnames(ex.dna4)[[1]]=='No304'])
#[1] "a" "t" "t" "c" "g" "a" "a" "a" "a" "a" "c" "a" "c" "a" "c" "c" "c" "a" "c"
#[20] "t" "a" "c" "t" "a" "a" "a" "a" "a" "t" "t" "a" "t" "c" "a" "a" "c" "c" "a"
#[39] "c" "t"

Generate a sequence of characters from 'A'-'Z'

I can make a sequence of numbers like this:
s = seq(from=1, to=10, by=1)
How do I make a sequence of characters from A-Z? This doesn't work:
seq(from=1, to=10)
Use LETTERS and letters (for uppercase and lowercase respectively).
Use the code you have with letters and/or LETTERS:
> LETTERS[seq( from = 1, to = 10 )]
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"
> letters[seq( from = 1, to = 10 )]
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
Just use the predefined variables letters and LETTERS.
And for completeness, here it something using seq:
R> rawToChar(as.raw(seq(as.numeric(charToRaw('a')), as.numeric(charToRaw('z')))))
[1] "abcdefghijklmnopqrstuvwxyz"
R>
R.oo package has an intToChar function, that uses ASCII values, if LETTERS and letters aren't any good. A is 65 in ASCII:
> require(R.oo)
> intToChar(65:79)
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O"
or you can use the fact that the lowest unicode numbers are ascii and hence intToUtf8 in R-base like this:
> intToUtf8(65:78,multiple=TRUE)
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N"
or faff around with rawToChar:
> rawToChar(as.raw(65:78))
[1] "ABCDEFGHIJKLMN"
LETTERS returns A-Z
To generate A-E for instance
Uppercase:
> LETTERS[1:5]
Lowercase
letters[1:5]

Resources