Letter "y" comes after "i" when sorting alphabetically

Letter "y" comes after "i" when sorting alphabetically - r

When using function sort(x), where x is a character, the letter "y" jumps into the middle, right after letter "i":
> letters
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t"
[21] "u" "v" "w" "x" "y" "z"
> sort(letters)
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "y" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[21] "t" "u" "v" "w" "x" "z"
The reason may be that I am located in Lithuania, and this is "lithuanian-like" sorting of letters, but I need normal sorting. How do I change the sorting method back to normal inside R code?
I'm using R 2.15.2 on Win7.

You need to change the locale that R is running in. Either do that for your entire Windows install (which seems suboptimal) or within the R sessions via:
Sys.setlocale("LC_COLLATE", "C")
You can use any other valid locale string in place of "C" there, but that should get you back to the sort order for letters you want.
Read ?locales for more.
I suppose it is worth noting the sister function Sys.getlocale(), which queries the current setting of a locale parameter. Hence you could do
(locCol <- Sys.getlocale("LC_COLLATE"))
Sys.setlocale("LC_COLLATE", "lt_LT")
sort(letters)
Sys.setlocale("LC_COLLATE", locCol)
sort(letters)
Sys.getlocale("LC_COLLATE")
## giving:
> (locCol <- Sys.getlocale("LC_COLLATE"))
[1] "en_GB.UTF-8"
> Sys.setlocale("LC_COLLATE", "lt_LT")
[1] "lt_LT"
> sort(letters)
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "y" "j" "k" "l" "m" "n"
[16] "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "z"
> Sys.setlocale("LC_COLLATE", locCol)
[1] "en_GB.UTF-8"
> sort(letters)
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o"
[16] "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
> Sys.getlocale("LC_COLLATE")
[1] "en_GB.UTF-8"
which of course is what #Hadley's Answer shows with_collate() doing somewhat more succinctly once you have devtools installed.

If you want to do this temporarily, devtools provides the with_collate function:
library(devtools)
with_collate("C", sort(letters))
# [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
# [20] "t" "u" "v" "w" "x" "y" "z"
with_collate("lt_LT", sort(letters))
# [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "y" "j" "k" "l" "m" "n" "o" "p" "q" "r"
# [20] "s" "t" "u" "v" "w" "x" "z"

Related

Print vowels from the vector

I need to execute the vowels from the LETTERS R build-in vector
"A", "E", etc.
> LETTERS
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U"
"V" "W" "X"
[25] "Y" "Z"
Maybe, someone knows how to do it with if() or other functions. Thank you in advance.

Looks like you need extract vowels, does this work:
> vowels <- c('A','E','I','O','U')
> LETTERS[sapply(vowels, function(ch) grep(ch, LETTERS))]
[1] "A" "E" "I" "O" "U"
>

R: sample function repeats same results

I ran into something very weird with sample(). If I run the following line 5 times at the start of a session (in either RStudio or R), I would get the following results.
sample(letters,5,replace=TRUE)
[1] "b" "y" "d" "p" "n"
[1] "v" "n" "i" "s" "s"
[1] "d" "q" "a" "m" "x"
[1] "w" "s" "u" "h" "e"
[1] "b" "y" "g" "s" "e"
But if I restart the console and run it 5 times at the beginning of a new session, I would somehow get the same results -- every time. Is sample() (which I believe uses Mersenne Twister by default) supposed to do this? What should I do instead to get results that don't actually repeat?

set.seed(123)
> sample(letters,5,replace=TRUE)
[1] "h" "u" "k" "w" "y"
> sample(letters,5,replace=TRUE)
[1] "b" "n" "x" "o" "l"
> sample(letters,5,replace=TRUE)
[1] "y" "l" "r" "o" "c"
> sample(letters,5,replace=TRUE)
[1] "x" "g" "b" "i" "y"
> sample(letters,5,replace=TRUE)
[1] "x" "s" "q" "z" "r"
If you start a new session and change the set.seed value, you will get different results.
> set.seed(456)
> sample(letters,5,replace=TRUE)
[1] "c" "f" "t" "w" "u"
> sample(letters,5,replace=TRUE)
[1] "i" "c" "h" "g" "k"
> sample(letters,5,replace=TRUE)
[1] "j" "f" "t" "v" "p"
> sample(letters,5,replace=TRUE)
[1] "q" "v" "l" "s" "h"
> sample(letters,5,replace=TRUE)
[1] "e" "s" "x" "l" "v"
Hope that helps.

what is a vector of single characters

I have the following function (from package seqinr):
translate(seq, frame = 0, sens = "F", numcode = 1, NAstring = "X", ambiguous = FALSE)
Shortly, it translates DNA sequences into protein sequences. I have problems with giving the seq argument. The documentation says:
seq = the sequence to translate as a vector of single characters in lower case letters
I store the DNA sequence in a data.frame (named here seq):
seq <- data.frame(geneSeq="ATGTGTTGGGCAGCCGCAATACCTATCGCTATATCTGGCGCTCAGGCTATCAGTGGTCAGAACACTCAAGCCAAAATGATTGCCGTTCAGACCGCTGCTGGTCGTCGTCAAGCTATGGAAATCATGAGGCAGACGAACATCCAGAATGCTGACCTATCGTTGCAAGCTCGAAGTAACCTTGAGAAAGCGTCCGCCGAGTTGACCTCACAGAACATGCAKAAGGTCCAAGCTATTGGGTCTATCCGAGCGGCTATCGGAGAAAGTATGCTTGAAGGTTCCTCAATGGACCGTATTAAGCGAGTCACAGAAGGACAGTTCATTCGGGAAGCCAATATGGTAACTGAGAACTATCGCCGTGACTACCAAGCAATCTTCGTACAGCAACTTGGTGGTACTCAAAGTGCTGCAAGTCAGATTGACGAAATCTATAAGAGCGAACAGAAACAGAAGAGTAAGCTACAGATGGTTCTGGACCCACTGGCTATCATGGGGTCTTCCGCTGCGAGTGCTTACGCATCCGATGCGTTCGACTCTAAGTTCACAACTAAGGCACCTATTGTTGCCGCTAAAGGAACCAAGACGGGGAGGTAA", stringsAsFactors=FALSE)
Every time I try to use the translate function, it returns the error:
Error in s2n(seq, levels = s2c("tcag")) :
sequence is not a vector of chars
I have tried the following, all give the above error:
trans<- seqinr::translate(tolower(seq[1,1]))
trans<- seqinr::translate(stringr::str_split(tolower(seq[1,1]), pattern=""))
trans<- seqinr::translate(as.character(stringr::str_split(tolower(seq[1,1]), pattern="")))
How can I transform my DNA sequence in a vector of single characters?

You could use strsplit:
strsplit("ABCD", "")
# [[1]]
# [1] "A" "B" "C" "D"
## your example:
seqinr::translate(strsplit(seq[1,1], "")[[1]])

The first example in ?translate pretty much gives you your answer.
You don't need a data frame and can use s2c whose sole purpose in life is for "conversion of a string into a vector of chars":
geneSeq="ATGTGTTGGGCAGCCGCAATACCTATCGCTATATCTGGCGCTCAGGCTATCAGTGGTCAGAACACTCAAGCCAAAATGATTGCCGTTCAGACCGCTGCTGGTCGTCGTCAAGCTATGGAAATCATGAGGCAGACGAACATCCAGAATGCTGACCTATCGTTGCAAGCTCGAAGTAACCTTGAGAAAGCGTCCGCCGAGTTGACCTCACAGAACATGCAKAAGGTCCAAGCTATTGGGTCTATCCGAGCGGCTATCGGAGAAAGTATGCTTGAAGGTTCCTCAATGGACCGTATTAAGCGAGTCACAGAAGGACAGTTCATTCGGGAAGCCAATATGGTAACTGAGAACTATCGCCGTGACTACCAAGCAATCTTCGTACAGCAACTTGGTGGTACTCAAAGTGCTGCAAGTCAGATTGACGAAATCTATAAGAGCGAACAGAAACAGAAGAGTAAGCTACAGATGGTTCTGGACCCACTGGCTATCATGGGGTCTTCCGCTGCGAGTGCTTACGCATCCGATGCGTTCGACTCTAAGTTCACAACTAAGGCACCTATTGTTGCCGCTAAAGGAACCAAGACGGGGAGGTAA"
print(translate(s2c(geneSeq), frame = 0, sens = "F", numcode = 1, NAstring = "X", ambiguous = FALSE)
## [1] "M" "C" "W" "A" "A" "A" "I" "P" "I" "A" "I" "S" "G" "A" "Q" "A" "I" "S" "G" "Q" "N" "T" "Q" "A" "K"
## [26] "M" "I" "A" "V" "Q" "T" "A" "A" "G" "R" "R" "Q" "A" "M" "E" "I" "M" "R" "Q" "T" "N" "I" "Q" "N" "A"
## [51] "D" "L" "S" "L" "Q" "A" "R" "S" "N" "L" "E" "K" "A" "S" "A" "E" "L" "T" "S" "Q" "N" "M" "X" "K" "V"
## [76] "Q" "A" "I" "G" "S" "I" "R" "A" "A" "I" "G" "E" "S" "M" "L" "E" "G" "S" "S" "M" "D" "R" "I" "K" "R"
## [101] "V" "T" "E" "G" "Q" "F" "I" "R" "E" "A" "N" "M" "V" "T" "E" "N" "Y" "R" "R" "D" "Y" "Q" "A" "I" "F"
## [126] "V" "Q" "Q" "L" "G" "G" "T" "Q" "S" "A" "A" "S" "Q" "I" "D" "E" "I" "Y" "K" "S" "E" "Q" "K" "Q" "K"
## [151] "S" "K" "L" "Q" "M" "V" "L" "D" "P" "L" "A" "I" "M" "G" "S" "S" "A" "A" "S" "A" "Y" "A" "S" "D" "A"
## [176] "F" "D" "S" "K" "F" "T" "T" "K" "A" "P" "I" "V" "A" "A" "K" "G" "T" "K" "T" "G" "R" "*"

How to search and isolate attributes of FASTA formatted text in R

I have a FASTA formatted file, which is essentially a special text file, containing many entries, one of which looks like below, which I have assigned by the name "FASTA" in R. The original file was red and formated as seen below using seqinr package in R.
FASTA<- structure(list(`tr|A1Z6G9|A1Z6G9_DROME` = structure("MSISASHPCGLNADGTATQYKESTATIQTSGLQSSPRSFLPEREDTLEYFIKFPKPSSKNEFVLAKDHDGEDSHVPIVMLLGWAGCQDRYLMKYSKIYEERGLITVRYTAPVDSLFWKRSEMIPIGEKILKLIQDMNFDAHPLIFHIFSNGGAYLYQHINLAVIKHKSPLQVRGVIFDSAPGERRIISLYRAITAIYGREKRCNCLAALVITITLSIMWFVEESISALKSLFVPSSPVRPSPFCDLKNEANRYPQLFLYSKGDIVIPYRDVEKFIRLRRDQGIQVSSVCFEDAEHVKIYTKYPKQYVQCVCNFIRNCMTIPPLKEAVNSEPSESVSRVNLKYD", name = "tr|A1Z6G9|A1Z6G9_DROME", Annot = ">tr|A1Z6G9|A1Z6G9_DROME CG8245 OS=Drosophila melanogaster GN=CG8245-RA PE=2 SV=1", class = "SeqFastaAA")))
Now although this format allows me to get the name indices of the entry/entries, when I search for it using grep, as seen below
grep("A1Z6G9_DROME", names(FASTA))
or isolate its name using
as.vector(sapply(names(attributes(FASTA)), function(x) attr(FASTA, x)))
However I can not either grep/regexpr any of the text/information in the attributes sections or isolate any of the attributes, such as the text following name= or Annot= section. Can anyone help me with this?
As far as I could gather, when googling read.fasta in R, the manual relating to the seqinr package states something along the lines of annotations/attributes being ignored (I think) but these attribute sections hold important information regarding the identity of the entry, which I desperately need! I have tried the unlist or collapse with the paste function but they remove all the attributes that I need!

There are a lot of get* functions in the seqinr package (see http://www.rdocumentation.org/packages/seqinr). These functions are designed to access different attributes, e.g.:
getAnnot(FASTA)
#[[1]]
#[1] ">tr|A1Z6G9|A1Z6G9_DROME CG8245 OS=Drosophila melanogaster GN=CG8245-RA PE=2 SV=1"
getSequence(FASTA)
#[[1]]
# [1] "M" "S" "I" "S" "A" "S" "H" "P" "C" "G" "L" "N" "A" "D" "G" "T" "A" "T" "Q" "Y" "K" "E" "S" "T" "A" "T" "I" "Q" "T" "S" "G" "L" "Q" "S" "S" "P" "R" "S" "F" "L" "P" "E" "R" "E" "D" "T" "L" "E" "Y" "F" "I" "K" "F" "P" "K" "P" "S" "S" "K"
# [60] "N" "E" "F" "V" "L" "A" "K" "D" "H" "D" "G" "E" "D" "S" "H" "V" "P" "I" "V" "M" "L" "L" "G" "W" "A" "G" "C" "Q" "D" "R" "Y" "L" "M" "K" "Y" "S" "K" "I" "Y" "E" "E" "R" "G" "L" "I" "T" "V" "R" "Y" "T" "A" "P" "V" "D" "S" "L" "F" "W" "K"
#[119] "R" "S" "E" "M" "I" "P" "I" "G" "E" "K" "I" "L" "K" "L" "I" "Q" "D" "M" "N" "F" "D" "A" "H" "P" "L" "I" "F" "H" "I" "F" "S" "N" "G" "G" "A" "Y" "L" "Y" "Q" "H" "I" "N" "L" "A" "V" "I" "K" "H" "K" "S" "P" "L" "Q" "V" "R" "G" "V" "I" "F"
#[178] "D" "S" "A" "P" "G" "E" "R" "R" "I" "I" "S" "L" "Y" "R" "A" "I" "T" "A" "I" "Y" "G" "R" "E" "K" "R" "C" "N" "C" "L" "A" "A" "L" "V" "I" "T" "I" "T" "L" "S" "I" "M" "W" "F" "V" "E" "E" "S" "I" "S" "A" "L" "K" "S" "L" "F" "V" "P" "S" "S"
#[237] "P" "V" "R" "P" "S" "P" "F" "C" "D" "L" "K" "N" "E" "A" "N" "R" "Y" "P" "Q" "L" "F" "L" "Y" "S" "K" "G" "D" "I" "V" "I" "P" "Y" "R" "D" "V" "E" "K" "F" "I" "R" "L" "R" "R" "D" "Q" "G" "I" "Q" "V" "S" "S" "V" "C" "F" "E" "D" "A" "E" "H"
#[296] "V" "K" "I" "Y" "T" "K" "Y" "P" "K" "Q" "Y" "V" "Q" "C" "V" "C" "N" "F" "I" "R" "N" "C" "M" "T" "I" "P" "P" "L" "K" "E" "A" "V" "N" "S" "E" "P" "S" "E" "S" "V" "S" "R" "V" "N" "L" "K" "Y" "D"

R - shuffle a list preserving element sizes

In R, I need an efficient solution to shuffle the elements contained within a list, preserving the total number of elements, and the local element sizes (in this case, each element of the list is a vector)
a<-LETTERS[1:6]
b<-LETTERS[6:10]
c<-LETTERS[c(9:15)]
l=list(a,b,c)
> l
[[1]]
[1] "A" "B" "C" "D" "E" "F"
[[2]]
[1] "F" "G" "H" "I" "J"
[[3]]
[1] "I" "J" "K" "L" "M" "N" "O"
The shuffling should randomly select the letters of the list (without replacement) and put them in a random position of any vector within the list.
I hope I have been clear! Thanks :-)

you may try recreating a second list with the skeleton of the first, and fill it with all the elements of the first list, like this:
u<-unlist(l)
l2<-relist(u[sample(length(u))],skeleton=l)
> l2
[[1]]
[1] "F" "A" "O" "I" "S" "Q"
[[2]]
[1] "R" "P" "K" "F" "G"
[[3]]
[1] "A" "N" "M" "J" "H" "G" "E" "B" "T" "C" "D" "L"
Hope this helps!

Like this...?
> set.seed(1)
> lapply(l, sample)
[[1]]
[1] "B" "F" "C" "D" "A" "E"
[[2]]
[1] "J" "H" "G" "F" "I"
[[3]]
[1] "J" "M" "O" "L" "N" "K" "I"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Letter "y" comes after "i" when sorting alphabetically - r

Related

Print vowels from the vector

R: sample function repeats same results

what is a vector of single characters

How to search and isolate attributes of FASTA formatted text in R

R - shuffle a list preserving element sizes

Categories

Resources