what is a vector of single characters

what is a vector of single characters - r

I have the following function (from package seqinr):
translate(seq, frame = 0, sens = "F", numcode = 1, NAstring = "X", ambiguous = FALSE)
Shortly, it translates DNA sequences into protein sequences. I have problems with giving the seq argument. The documentation says:
seq = the sequence to translate as a vector of single characters in lower case letters
I store the DNA sequence in a data.frame (named here seq):
seq <- data.frame(geneSeq="ATGTGTTGGGCAGCCGCAATACCTATCGCTATATCTGGCGCTCAGGCTATCAGTGGTCAGAACACTCAAGCCAAAATGATTGCCGTTCAGACCGCTGCTGGTCGTCGTCAAGCTATGGAAATCATGAGGCAGACGAACATCCAGAATGCTGACCTATCGTTGCAAGCTCGAAGTAACCTTGAGAAAGCGTCCGCCGAGTTGACCTCACAGAACATGCAKAAGGTCCAAGCTATTGGGTCTATCCGAGCGGCTATCGGAGAAAGTATGCTTGAAGGTTCCTCAATGGACCGTATTAAGCGAGTCACAGAAGGACAGTTCATTCGGGAAGCCAATATGGTAACTGAGAACTATCGCCGTGACTACCAAGCAATCTTCGTACAGCAACTTGGTGGTACTCAAAGTGCTGCAAGTCAGATTGACGAAATCTATAAGAGCGAACAGAAACAGAAGAGTAAGCTACAGATGGTTCTGGACCCACTGGCTATCATGGGGTCTTCCGCTGCGAGTGCTTACGCATCCGATGCGTTCGACTCTAAGTTCACAACTAAGGCACCTATTGTTGCCGCTAAAGGAACCAAGACGGGGAGGTAA", stringsAsFactors=FALSE)
Every time I try to use the translate function, it returns the error:
Error in s2n(seq, levels = s2c("tcag")) :
sequence is not a vector of chars
I have tried the following, all give the above error:
trans<- seqinr::translate(tolower(seq[1,1]))
trans<- seqinr::translate(stringr::str_split(tolower(seq[1,1]), pattern=""))
trans<- seqinr::translate(as.character(stringr::str_split(tolower(seq[1,1]), pattern="")))
How can I transform my DNA sequence in a vector of single characters?

You could use strsplit:
strsplit("ABCD", "")
# [[1]]
# [1] "A" "B" "C" "D"
## your example:
seqinr::translate(strsplit(seq[1,1], "")[[1]])

The first example in ?translate pretty much gives you your answer.
You don't need a data frame and can use s2c whose sole purpose in life is for "conversion of a string into a vector of chars":
geneSeq="ATGTGTTGGGCAGCCGCAATACCTATCGCTATATCTGGCGCTCAGGCTATCAGTGGTCAGAACACTCAAGCCAAAATGATTGCCGTTCAGACCGCTGCTGGTCGTCGTCAAGCTATGGAAATCATGAGGCAGACGAACATCCAGAATGCTGACCTATCGTTGCAAGCTCGAAGTAACCTTGAGAAAGCGTCCGCCGAGTTGACCTCACAGAACATGCAKAAGGTCCAAGCTATTGGGTCTATCCGAGCGGCTATCGGAGAAAGTATGCTTGAAGGTTCCTCAATGGACCGTATTAAGCGAGTCACAGAAGGACAGTTCATTCGGGAAGCCAATATGGTAACTGAGAACTATCGCCGTGACTACCAAGCAATCTTCGTACAGCAACTTGGTGGTACTCAAAGTGCTGCAAGTCAGATTGACGAAATCTATAAGAGCGAACAGAAACAGAAGAGTAAGCTACAGATGGTTCTGGACCCACTGGCTATCATGGGGTCTTCCGCTGCGAGTGCTTACGCATCCGATGCGTTCGACTCTAAGTTCACAACTAAGGCACCTATTGTTGCCGCTAAAGGAACCAAGACGGGGAGGTAA"
print(translate(s2c(geneSeq), frame = 0, sens = "F", numcode = 1, NAstring = "X", ambiguous = FALSE)
## [1] "M" "C" "W" "A" "A" "A" "I" "P" "I" "A" "I" "S" "G" "A" "Q" "A" "I" "S" "G" "Q" "N" "T" "Q" "A" "K"
## [26] "M" "I" "A" "V" "Q" "T" "A" "A" "G" "R" "R" "Q" "A" "M" "E" "I" "M" "R" "Q" "T" "N" "I" "Q" "N" "A"
## [51] "D" "L" "S" "L" "Q" "A" "R" "S" "N" "L" "E" "K" "A" "S" "A" "E" "L" "T" "S" "Q" "N" "M" "X" "K" "V"
## [76] "Q" "A" "I" "G" "S" "I" "R" "A" "A" "I" "G" "E" "S" "M" "L" "E" "G" "S" "S" "M" "D" "R" "I" "K" "R"
## [101] "V" "T" "E" "G" "Q" "F" "I" "R" "E" "A" "N" "M" "V" "T" "E" "N" "Y" "R" "R" "D" "Y" "Q" "A" "I" "F"
## [126] "V" "Q" "Q" "L" "G" "G" "T" "Q" "S" "A" "A" "S" "Q" "I" "D" "E" "I" "Y" "K" "S" "E" "Q" "K" "Q" "K"
## [151] "S" "K" "L" "Q" "M" "V" "L" "D" "P" "L" "A" "I" "M" "G" "S" "S" "A" "A" "S" "A" "Y" "A" "S" "D" "A"
## [176] "F" "D" "S" "K" "F" "T" "T" "K" "A" "P" "I" "V" "A" "A" "K" "G" "T" "K" "T" "G" "R" "*"

Related

Print vowels from the vector

I need to execute the vowels from the LETTERS R build-in vector
"A", "E", etc.
> LETTERS
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U"
"V" "W" "X"
[25] "Y" "Z"
Maybe, someone knows how to do it with if() or other functions. Thank you in advance.

Looks like you need extract vowels, does this work:
> vowels <- c('A','E','I','O','U')
> LETTERS[sapply(vowels, function(ch) grep(ch, LETTERS))]
[1] "A" "E" "I" "O" "U"
>

R: sample function repeats same results

I ran into something very weird with sample(). If I run the following line 5 times at the start of a session (in either RStudio or R), I would get the following results.
sample(letters,5,replace=TRUE)
[1] "b" "y" "d" "p" "n"
[1] "v" "n" "i" "s" "s"
[1] "d" "q" "a" "m" "x"
[1] "w" "s" "u" "h" "e"
[1] "b" "y" "g" "s" "e"
But if I restart the console and run it 5 times at the beginning of a new session, I would somehow get the same results -- every time. Is sample() (which I believe uses Mersenne Twister by default) supposed to do this? What should I do instead to get results that don't actually repeat?

set.seed(123)
> sample(letters,5,replace=TRUE)
[1] "h" "u" "k" "w" "y"
> sample(letters,5,replace=TRUE)
[1] "b" "n" "x" "o" "l"
> sample(letters,5,replace=TRUE)
[1] "y" "l" "r" "o" "c"
> sample(letters,5,replace=TRUE)
[1] "x" "g" "b" "i" "y"
> sample(letters,5,replace=TRUE)
[1] "x" "s" "q" "z" "r"
If you start a new session and change the set.seed value, you will get different results.
> set.seed(456)
> sample(letters,5,replace=TRUE)
[1] "c" "f" "t" "w" "u"
> sample(letters,5,replace=TRUE)
[1] "i" "c" "h" "g" "k"
> sample(letters,5,replace=TRUE)
[1] "j" "f" "t" "v" "p"
> sample(letters,5,replace=TRUE)
[1] "q" "v" "l" "s" "h"
> sample(letters,5,replace=TRUE)
[1] "e" "s" "x" "l" "v"
Hope that helps.

Replace values in data frame from column of indexes

I have a matrix of data that looks like the following:
> taxmat = matrix(sample(letters, 70, replace = TRUE), nrow = 10, ncol = 7)
> rownames(taxmat) <- paste0("OTU", 1:nrow(taxmat))
> taxmat<-cbind(taxmat,c("Genus","Genus","Genus","Family","Family","Order","Genus","Species","Genus","Species"))
> colnames(taxmat) <- c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species", "Lowest")
> taxmat
Domain Phylum Class Order Family Genus Species Lowest
OTU1 "h" "c" "q" "e" "q" "w" "v" "Genus"
OTU2 "f" "y" "q" "z" "p" "w" "v" "Genus"
OTU3 "w" "q" "i" "i" "z" "j" "f" "Genus"
OTU4 "c" "e" "f" "n" "z" "b" "d" "Family"
OTU5 "g" "w" "q" "k" "e" "x" "k" "Family"
OTU6 "x" "j" "l" "w" "z" "o" "q" "Order"
OTU7 "k" "s" "j" "y" "t" "a" "t" "Genus"
OTU8 "w" "u" "s" "w" "g" "y" "n" "Species"
OTU9 "t" "r" "t" "o" "i" "l" "z" "Genus"
OTU10 "x" "p" "j" "f" "k" "q" "w" "Species"
The column "Lowest" tells me the lowest rank I have confidence in the data for that row. For each row, I would like to replace the value(s) in the column(s) following the column indicated by "Lowest" with "unknown."
Expected output for this example would be:
Domain Phylum Class Order Family Genus Species Lowest
OTU1 "b" "b" "v" "v" "l" "n" "unknown" "Genus"
OTU2 "l" "m" "w" "b" "f" "y" "unknown" "Genus"
OTU3 "h" "w" "n" "y" "k" "f" "unknown" "Genus"
OTU4 "u" "m" "p" "n" "t" "unknown" "unknown" "Family"
OTU5 "o" "b" "q" "w" "a" "unknown" "unknown" "Family"
OTU6 "s" "j" "l" "d" "unknown""unknown" "unknown" "Order"
OTU7 "v" "y" "t" "p" "s" "v" "unknown" "Genus"
OTU8 "b" "r" "k" "d" "q" "c" "q" "Species"
OTU9 "k" "h" "b" "w" "h" "x" "unknown" "Genus"
OTU10 "o" "p" "b" "n" "k" "d" "q" "Species"
I can get all the indexes to replace as a vector with
idx<-lapply(tax$Lowest, grep, colnames(tax))
idx <- as.numeric(unlist(idx))+1
But I'm not sure how to replace those values. Thanks for your help!

We can use loop through the rows with apply and create a logical index by matching the names of the columns with that of the last element i.e. element in 'Lowest' to replace the values of the rows to 'unknown'
t(apply(m1, 1, function(x) {
i1 <- match( x[8], names(x)[-8])+1
i1[i1>7] <- 0
i1 <- if(i1!=0) i1:7 else i1
c(replace(x[-8], i1, "unknown"), x[8])}))
# Domain Phylum Class Order Family Genus Species Lowest
#OTU1 "b" "b" "v" "v" "l" "n" "unknown" "Genus"
#OTU2 "l" "m" "w" "b" "f" "y" "unknown" "Genus"
#OTU3 "h" "w" "n" "y" "k" "f" "unknown" "Genus"
#OTU4 "u" "m" "p" "n" "t" "unknown" "unknown" "Family"
#OTU5 "o" "b" "q" "w" "a" "unknown" "unknown" "Family"
#OTU6 "s" "j" "l" "d" "unknown" "unknown" "unknown" "Order"
#OTU7 "v" "y" "t" "p" "s" "v" "unknown" "Genus"
#OTU8 "b" "r" "k" "d" "q" "c" "q" "Species"
#OTU9 "k" "h" "b" "w" "h" "x" "unknown" "Genus"
#OTU10 "o" "p" "b" "n" "k" "d" "q" "Species"
Or another option is to create a row/column index based on the match of column names with the last column of 'm1' and the sequence of rows and then cbind the indexes and assign the values in 'm1' to 'unknown'
lst <- Map(function(x, y) if(x >y) 0 else x:y, match(m1[,8], colnames(m1)[-8])+1, 7)
m1[cbind(rep(seq_len(nrow(m1)), lengths(lst)), unlist(lst))] <- "unknown"

How to search and isolate attributes of FASTA formatted text in R

I have a FASTA formatted file, which is essentially a special text file, containing many entries, one of which looks like below, which I have assigned by the name "FASTA" in R. The original file was red and formated as seen below using seqinr package in R.
FASTA<- structure(list(`tr|A1Z6G9|A1Z6G9_DROME` = structure("MSISASHPCGLNADGTATQYKESTATIQTSGLQSSPRSFLPEREDTLEYFIKFPKPSSKNEFVLAKDHDGEDSHVPIVMLLGWAGCQDRYLMKYSKIYEERGLITVRYTAPVDSLFWKRSEMIPIGEKILKLIQDMNFDAHPLIFHIFSNGGAYLYQHINLAVIKHKSPLQVRGVIFDSAPGERRIISLYRAITAIYGREKRCNCLAALVITITLSIMWFVEESISALKSLFVPSSPVRPSPFCDLKNEANRYPQLFLYSKGDIVIPYRDVEKFIRLRRDQGIQVSSVCFEDAEHVKIYTKYPKQYVQCVCNFIRNCMTIPPLKEAVNSEPSESVSRVNLKYD", name = "tr|A1Z6G9|A1Z6G9_DROME", Annot = ">tr|A1Z6G9|A1Z6G9_DROME CG8245 OS=Drosophila melanogaster GN=CG8245-RA PE=2 SV=1", class = "SeqFastaAA")))
Now although this format allows me to get the name indices of the entry/entries, when I search for it using grep, as seen below
grep("A1Z6G9_DROME", names(FASTA))
or isolate its name using
as.vector(sapply(names(attributes(FASTA)), function(x) attr(FASTA, x)))
However I can not either grep/regexpr any of the text/information in the attributes sections or isolate any of the attributes, such as the text following name= or Annot= section. Can anyone help me with this?
As far as I could gather, when googling read.fasta in R, the manual relating to the seqinr package states something along the lines of annotations/attributes being ignored (I think) but these attribute sections hold important information regarding the identity of the entry, which I desperately need! I have tried the unlist or collapse with the paste function but they remove all the attributes that I need!

There are a lot of get* functions in the seqinr package (see http://www.rdocumentation.org/packages/seqinr). These functions are designed to access different attributes, e.g.:
getAnnot(FASTA)
#[[1]]
#[1] ">tr|A1Z6G9|A1Z6G9_DROME CG8245 OS=Drosophila melanogaster GN=CG8245-RA PE=2 SV=1"
getSequence(FASTA)
#[[1]]
# [1] "M" "S" "I" "S" "A" "S" "H" "P" "C" "G" "L" "N" "A" "D" "G" "T" "A" "T" "Q" "Y" "K" "E" "S" "T" "A" "T" "I" "Q" "T" "S" "G" "L" "Q" "S" "S" "P" "R" "S" "F" "L" "P" "E" "R" "E" "D" "T" "L" "E" "Y" "F" "I" "K" "F" "P" "K" "P" "S" "S" "K"
# [60] "N" "E" "F" "V" "L" "A" "K" "D" "H" "D" "G" "E" "D" "S" "H" "V" "P" "I" "V" "M" "L" "L" "G" "W" "A" "G" "C" "Q" "D" "R" "Y" "L" "M" "K" "Y" "S" "K" "I" "Y" "E" "E" "R" "G" "L" "I" "T" "V" "R" "Y" "T" "A" "P" "V" "D" "S" "L" "F" "W" "K"
#[119] "R" "S" "E" "M" "I" "P" "I" "G" "E" "K" "I" "L" "K" "L" "I" "Q" "D" "M" "N" "F" "D" "A" "H" "P" "L" "I" "F" "H" "I" "F" "S" "N" "G" "G" "A" "Y" "L" "Y" "Q" "H" "I" "N" "L" "A" "V" "I" "K" "H" "K" "S" "P" "L" "Q" "V" "R" "G" "V" "I" "F"
#[178] "D" "S" "A" "P" "G" "E" "R" "R" "I" "I" "S" "L" "Y" "R" "A" "I" "T" "A" "I" "Y" "G" "R" "E" "K" "R" "C" "N" "C" "L" "A" "A" "L" "V" "I" "T" "I" "T" "L" "S" "I" "M" "W" "F" "V" "E" "E" "S" "I" "S" "A" "L" "K" "S" "L" "F" "V" "P" "S" "S"
#[237] "P" "V" "R" "P" "S" "P" "F" "C" "D" "L" "K" "N" "E" "A" "N" "R" "Y" "P" "Q" "L" "F" "L" "Y" "S" "K" "G" "D" "I" "V" "I" "P" "Y" "R" "D" "V" "E" "K" "F" "I" "R" "L" "R" "R" "D" "Q" "G" "I" "Q" "V" "S" "S" "V" "C" "F" "E" "D" "A" "E" "H"
#[296] "V" "K" "I" "Y" "T" "K" "Y" "P" "K" "Q" "Y" "V" "Q" "C" "V" "C" "N" "F" "I" "R" "N" "C" "M" "T" "I" "P" "P" "L" "K" "E" "A" "V" "N" "S" "E" "P" "S" "E" "S" "V" "S" "R" "V" "N" "L" "K" "Y" "D"

Letter "y" comes after "i" when sorting alphabetically

When using function sort(x), where x is a character, the letter "y" jumps into the middle, right after letter "i":
> letters
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t"
[21] "u" "v" "w" "x" "y" "z"
> sort(letters)
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "y" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[21] "t" "u" "v" "w" "x" "z"
The reason may be that I am located in Lithuania, and this is "lithuanian-like" sorting of letters, but I need normal sorting. How do I change the sorting method back to normal inside R code?
I'm using R 2.15.2 on Win7.

You need to change the locale that R is running in. Either do that for your entire Windows install (which seems suboptimal) or within the R sessions via:
Sys.setlocale("LC_COLLATE", "C")
You can use any other valid locale string in place of "C" there, but that should get you back to the sort order for letters you want.
Read ?locales for more.
I suppose it is worth noting the sister function Sys.getlocale(), which queries the current setting of a locale parameter. Hence you could do
(locCol <- Sys.getlocale("LC_COLLATE"))
Sys.setlocale("LC_COLLATE", "lt_LT")
sort(letters)
Sys.setlocale("LC_COLLATE", locCol)
sort(letters)
Sys.getlocale("LC_COLLATE")
## giving:
> (locCol <- Sys.getlocale("LC_COLLATE"))
[1] "en_GB.UTF-8"
> Sys.setlocale("LC_COLLATE", "lt_LT")
[1] "lt_LT"
> sort(letters)
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "y" "j" "k" "l" "m" "n"
[16] "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "z"
> Sys.setlocale("LC_COLLATE", locCol)
[1] "en_GB.UTF-8"
> sort(letters)
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o"
[16] "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
> Sys.getlocale("LC_COLLATE")
[1] "en_GB.UTF-8"
which of course is what #Hadley's Answer shows with_collate() doing somewhat more succinctly once you have devtools installed.

If you want to do this temporarily, devtools provides the with_collate function:
library(devtools)
with_collate("C", sort(letters))
# [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
# [20] "t" "u" "v" "w" "x" "y" "z"
with_collate("lt_LT", sort(letters))
# [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "y" "j" "k" "l" "m" "n" "o" "p" "q" "r"
# [20] "s" "t" "u" "v" "w" "x" "z"