R: sample function repeats same results - r

I ran into something very weird with sample(). If I run the following line 5 times at the start of a session (in either RStudio or R), I would get the following results.
sample(letters,5,replace=TRUE)
[1] "b" "y" "d" "p" "n"
[1] "v" "n" "i" "s" "s"
[1] "d" "q" "a" "m" "x"
[1] "w" "s" "u" "h" "e"
[1] "b" "y" "g" "s" "e"
But if I restart the console and run it 5 times at the beginning of a new session, I would somehow get the same results -- every time. Is sample() (which I believe uses Mersenne Twister by default) supposed to do this? What should I do instead to get results that don't actually repeat?

set.seed(123)
> sample(letters,5,replace=TRUE)
[1] "h" "u" "k" "w" "y"
> sample(letters,5,replace=TRUE)
[1] "b" "n" "x" "o" "l"
> sample(letters,5,replace=TRUE)
[1] "y" "l" "r" "o" "c"
> sample(letters,5,replace=TRUE)
[1] "x" "g" "b" "i" "y"
> sample(letters,5,replace=TRUE)
[1] "x" "s" "q" "z" "r"
If you start a new session and change the set.seed value, you will get different results.
> set.seed(456)
> sample(letters,5,replace=TRUE)
[1] "c" "f" "t" "w" "u"
> sample(letters,5,replace=TRUE)
[1] "i" "c" "h" "g" "k"
> sample(letters,5,replace=TRUE)
[1] "j" "f" "t" "v" "p"
> sample(letters,5,replace=TRUE)
[1] "q" "v" "l" "s" "h"
> sample(letters,5,replace=TRUE)
[1] "e" "s" "x" "l" "v"
Hope that helps.

Related

Print vowels from the vector

I need to execute the vowels from the LETTERS R build-in vector
"A", "E", etc.
> LETTERS
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U"
"V" "W" "X"
[25] "Y" "Z"
Maybe, someone knows how to do it with if() or other functions. Thank you in advance.
Looks like you need extract vowels, does this work:
> vowels <- c('A','E','I','O','U')
> LETTERS[sapply(vowels, function(ch) grep(ch, LETTERS))]
[1] "A" "E" "I" "O" "U"
>

Efficient way of running multiple successive for loops in R?

I am trying to run several for loops in succession in R. I hope this simplified example of the kind of thing I am trying to do provides enough information, and that the question is relevant/interesting enough to a general audience.
Essentially, I have a pool of individuals (here represented by the 26 LETTERS and saved in a vector called 'ids'). I start with 2 of them randomly selected (called 'ids1') and run a for loop (here 5 times as defined by 'runs'). Those letters not picked get put into another vector called 'ids.left1'.
The first thing going on in the for loop in this example is that I am just randomly picking one of the letters five times. I am storing the result of this in another vector called result1. In this example I'm also storing those letters not used in another vector called 'otherresult1'. (My real-world reason for doing this would be using loops containing several different processes, not just these two).
set.seed(123)
#Initializing
ids<-LETTERS[1:26]
runs<-5
#1st time
result1 <- vector("list",runs)
otherresult1 <- vector("list",runs)
ids1<-sample(ids,2)
ids.left1<-setdiff(ids,ids1)
for (i in 1:runs) {
picked1<-sample(ids1, 1)
result1[[i]] <- picked1
otherresult1[[i]] <- setdiff(ids1,picked1)
}
result1x<-unlist(result1) #[1] "H" "T" "T" "H" "T"
The above is trivial. What I am trying to do next is to add an extra letter (randomly selected) to the pool (so we now have 3) and run the for loop again for the same number of times (5). I also want to store the now 23 letters not being used in a vector (ids.left2) and also store the results of this loop in result2. Those not selected get stored in otherresult2.
#2nd time
result2 <- vector("list",runs)
otherresult2 <- vector("list",runs)
ids2<-c(ids1, sample(ids.left1,1))
ids.left2<-setdiff(ids,ids2)
for (i in 1:runs) {
picked2<- sample(ids2, 1)
result2[[i]] <- picked2
otherresult2[[i]] <- setdiff(ids2,picked2)
}
result2x<-unlist(result2) #[1] "T" "T" "X" "T" "X"
This is repeated again. Another letter is added (so we now have 4), and the same for loop is run 5 times, and the results stored again in another vector. Those not used again get stored in otherresult3.
#3rd time
result3 <- vector("list",runs)
otherresult3 <- vector("list",runs)
ids3<-c(ids2, sample(ids.left2,1))
for (i in 1:runs) {
picked3 <- sample(ids3, 1)
result3[[i]] <- picked3
otherresult3[[i]] <- setdiff(ids3,picked3)
}
result3x<-unlist(result3) #[1] "H" "O" "H" "H" "T"
This is just putting the results all together.
#putting results together
results.final <- c(result1x,result2x,result3x)
results.final #[1] "H" "T" "T" "H" "T" "T" "T" "X" "T" "X" "H" "O" "H" "H" "T"
unlist(otherresult1) #[1] "T" "H" "H" "T" "H"
unlist(otherresult2) #[1] "H" "X" "H" "X" "H" "T" "H" "X" "H" "T"
unlist(otherresult3) #[1] "T" "X" "O" "H" "T" "X" "T" "X" "O" "T" "X" "O" "H" "X" "O"
This is all pretty easy when I am only running the for loop 3 times. However, if I wanted to do the same thing (adding in one individual into a pool of individuals) 1000 times, it would be crazy to manually write the code. (Obviously, I wouldn't be using letters if I ran it 1000 times but some other identifier).
My question is therefore, is it possible to more efficiently code these successive for loops?
EDIT: I added in another process in the for-loop (the result being stored in 'otherresult' vector) to try and make this more realistic.
A perfect time to use recursion
recCount <- 1 #which recursive iteration we are in
allLetters <- LETTERS[1:26]
endPoint <- 6 #after how many recursions do we stop
runs <- 5
recEx <- function(resultList,otherResultList,
inLetters,outLetters,
recCount)
{
newLetter <- sample(inLetters,ifelse(recCount==1,2,1)) #pick a letter, 2 if this is the first run
outLetters <- c(outLetters,newLetter) #add this letter to our pool of usable letters
inLetters <- inLetters[inLetters!=newLetter] #subtract this letter from the total pool
excludedList <- includedList <- list() #initialize the lists we will add to
for (i in 1:runs) {
picked1<-sample(outLetters, 1)
includedList[[i]] <- picked1
excludedList[[i]] <- setdiff(outLetters,picked1)
}
if(recCount == endPoint) return(list(c(resultList,list(includedList)), #if we're done
c(otherResultList,list(excludedList)))) else
return(recEx(c(resultList,list(includedList)), #pass in our results so far, and add the "included" list onto the end
c(otherResultList,list(excludedList)), #same with the "excluded" list
inLetters,outLetters,recCount+1))
}
finalResult <- recEx(list(),list(),allLetters,NULL,1)
> finalResult
[[1]]#1 is for your final results, #2 is for the excluded results
[[1]][[1]]# 1 through 6 are your 6 iterations, with 2 through 7 letters in each iteration
[[1]][[1]][[1]] #1 through 5 are your 5 runs
[1] "H"
[[1]][[1]][[2]]
[1] "T"
[[1]][[1]][[3]]
[1] "T"
[[1]][[1]][[4]]
[1] "H"
[[1]][[1]][[5]]
[1] "T"
[[1]][[2]]
[[1]][[2]][[1]]
[1] "T"
[[1]][[2]][[2]]
[1] "T"
[[1]][[2]][[3]]
[1] "X"
[[1]][[2]][[4]]
[1] "T"
[[1]][[2]][[5]]
[1] "X"
[[1]][[3]]
[[1]][[3]][[1]]
[1] "H"
[[1]][[3]][[2]]
[1] "N"
[[1]][[3]][[3]]
[1] "H"
[[1]][[3]][[4]]
[1] "H"
[[1]][[3]][[5]]
[1] "T"
[[1]][[4]]
[[1]][[4]][[1]]
[1] "Y"
[[1]][[4]][[2]]
[1] "N"
[[1]][[4]][[3]]
[1] "N"
[[1]][[4]][[4]]
[1] "Y"
[[1]][[4]][[5]]
[1] "N"
[[1]][[5]]
[[1]][[5]][[1]]
[1] "N"
[[1]][[5]][[2]]
[1] "N"
[[1]][[5]][[3]]
[1] "T"
[[1]][[5]][[4]]
[1] "H"
[[1]][[5]][[5]]
[1] "Q"
[[1]][[6]]
[[1]][[6]][[1]]
[1] "Y"
[[1]][[6]][[2]]
[1] "Q"
[[1]][[6]][[3]]
[1] "H"
[[1]][[6]][[4]]
[1] "N"
[[1]][[6]][[5]]
[1] "Q"
[[2]] #your excluded letters
[[2]][[1]]
[[2]][[1]][[1]]
[1] "T"
[[2]][[1]][[2]]
[1] "H"
[[2]][[1]][[3]]
[1] "H"
[[2]][[1]][[4]]
[1] "T"
[[2]][[1]][[5]]
[1] "H"
[[2]][[2]]
[[2]][[2]][[1]]
[1] "H" "X"
[[2]][[2]][[2]]
[1] "H" "X"
[[2]][[2]][[3]]
[1] "H" "T"
[[2]][[2]][[4]]
[1] "H" "X"
[[2]][[2]][[5]]
[1] "H" "T"
[[2]][[3]]
[[2]][[3]][[1]]
[1] "T" "X" "N"
[[2]][[3]][[2]]
[1] "H" "T" "X"
[[2]][[3]][[3]]
[1] "T" "X" "N"
[[2]][[3]][[4]]
[1] "T" "X" "N"
[[2]][[3]][[5]]
[1] "H" "X" "N"
[[2]][[4]]
[[2]][[4]][[1]]
[1] "H" "T" "X" "N"
[[2]][[4]][[2]]
[1] "H" "T" "X" "Y"
[[2]][[4]][[3]]
[1] "H" "T" "X" "Y"
[[2]][[4]][[4]]
[1] "H" "T" "X" "N"
[[2]][[4]][[5]]
[1] "H" "T" "X" "Y"
[[2]][[5]]
[[2]][[5]][[1]]
[1] "H" "T" "X" "Y" "Q"
[[2]][[5]][[2]]
[1] "H" "T" "X" "Y" "Q"
[[2]][[5]][[3]]
[1] "H" "X" "N" "Y" "Q"
[[2]][[5]][[4]]
[1] "T" "X" "N" "Y" "Q"
[[2]][[5]][[5]]
[1] "H" "T" "X" "N" "Y"
[[2]][[6]]
[[2]][[6]][[1]]
[1] "H" "T" "X" "N" "Q" "V"
[[2]][[6]][[2]]
[1] "H" "T" "X" "N" "Y" "V"
[[2]][[6]][[3]]
[1] "T" "X" "N" "Y" "Q" "V"
[[2]][[6]][[4]]
[1] "H" "T" "X" "Y" "Q" "V"
[[2]][[6]][[5]]
[1] "H" "T" "X" "N" "Y" "V"
This isn't the best structure for results imo, but this is as you specified. Unpacking these lists is trivial though
How about this?
set.seed(123)
#Initializing
ids =LETTERS[1:26]
runs=5
result1 = list()
temp = sample(ids,2)
j=1
results = c()
while(j<6) {
ids.left = ids[!(ids%in%temp)]
for(i in 1:runs){
result1[[i]] = sample(temp,1)
}
temp = c(temp, sample(ids.left,1))
j=j+1
results = c(results, unlist(result1))
}
results # [1] "H" "T" "T" "H" "T" "T" "T" "X" "T" "X" "H" "O" "H" "H" "T" "Y" "O" "O" "Y" "O" "O" "O" "T" "H" "Q"

what is a vector of single characters

I have the following function (from package seqinr):
translate(seq, frame = 0, sens = "F", numcode = 1, NAstring = "X", ambiguous = FALSE)
Shortly, it translates DNA sequences into protein sequences. I have problems with giving the seq argument. The documentation says:
seq = the sequence to translate as a vector of single characters in lower case letters
I store the DNA sequence in a data.frame (named here seq):
seq <- data.frame(geneSeq="ATGTGTTGGGCAGCCGCAATACCTATCGCTATATCTGGCGCTCAGGCTATCAGTGGTCAGAACACTCAAGCCAAAATGATTGCCGTTCAGACCGCTGCTGGTCGTCGTCAAGCTATGGAAATCATGAGGCAGACGAACATCCAGAATGCTGACCTATCGTTGCAAGCTCGAAGTAACCTTGAGAAAGCGTCCGCCGAGTTGACCTCACAGAACATGCAKAAGGTCCAAGCTATTGGGTCTATCCGAGCGGCTATCGGAGAAAGTATGCTTGAAGGTTCCTCAATGGACCGTATTAAGCGAGTCACAGAAGGACAGTTCATTCGGGAAGCCAATATGGTAACTGAGAACTATCGCCGTGACTACCAAGCAATCTTCGTACAGCAACTTGGTGGTACTCAAAGTGCTGCAAGTCAGATTGACGAAATCTATAAGAGCGAACAGAAACAGAAGAGTAAGCTACAGATGGTTCTGGACCCACTGGCTATCATGGGGTCTTCCGCTGCGAGTGCTTACGCATCCGATGCGTTCGACTCTAAGTTCACAACTAAGGCACCTATTGTTGCCGCTAAAGGAACCAAGACGGGGAGGTAA", stringsAsFactors=FALSE)
Every time I try to use the translate function, it returns the error:
Error in s2n(seq, levels = s2c("tcag")) :
sequence is not a vector of chars
I have tried the following, all give the above error:
trans<- seqinr::translate(tolower(seq[1,1]))
trans<- seqinr::translate(stringr::str_split(tolower(seq[1,1]), pattern=""))
trans<- seqinr::translate(as.character(stringr::str_split(tolower(seq[1,1]), pattern="")))
How can I transform my DNA sequence in a vector of single characters?
You could use strsplit:
strsplit("ABCD", "")
# [[1]]
# [1] "A" "B" "C" "D"
## your example:
seqinr::translate(strsplit(seq[1,1], "")[[1]])
The first example in ?translate pretty much gives you your answer.
You don't need a data frame and can use s2c whose sole purpose in life is for "conversion of a string into a vector of chars":
geneSeq="ATGTGTTGGGCAGCCGCAATACCTATCGCTATATCTGGCGCTCAGGCTATCAGTGGTCAGAACACTCAAGCCAAAATGATTGCCGTTCAGACCGCTGCTGGTCGTCGTCAAGCTATGGAAATCATGAGGCAGACGAACATCCAGAATGCTGACCTATCGTTGCAAGCTCGAAGTAACCTTGAGAAAGCGTCCGCCGAGTTGACCTCACAGAACATGCAKAAGGTCCAAGCTATTGGGTCTATCCGAGCGGCTATCGGAGAAAGTATGCTTGAAGGTTCCTCAATGGACCGTATTAAGCGAGTCACAGAAGGACAGTTCATTCGGGAAGCCAATATGGTAACTGAGAACTATCGCCGTGACTACCAAGCAATCTTCGTACAGCAACTTGGTGGTACTCAAAGTGCTGCAAGTCAGATTGACGAAATCTATAAGAGCGAACAGAAACAGAAGAGTAAGCTACAGATGGTTCTGGACCCACTGGCTATCATGGGGTCTTCCGCTGCGAGTGCTTACGCATCCGATGCGTTCGACTCTAAGTTCACAACTAAGGCACCTATTGTTGCCGCTAAAGGAACCAAGACGGGGAGGTAA"
print(translate(s2c(geneSeq), frame = 0, sens = "F", numcode = 1, NAstring = "X", ambiguous = FALSE)
## [1] "M" "C" "W" "A" "A" "A" "I" "P" "I" "A" "I" "S" "G" "A" "Q" "A" "I" "S" "G" "Q" "N" "T" "Q" "A" "K"
## [26] "M" "I" "A" "V" "Q" "T" "A" "A" "G" "R" "R" "Q" "A" "M" "E" "I" "M" "R" "Q" "T" "N" "I" "Q" "N" "A"
## [51] "D" "L" "S" "L" "Q" "A" "R" "S" "N" "L" "E" "K" "A" "S" "A" "E" "L" "T" "S" "Q" "N" "M" "X" "K" "V"
## [76] "Q" "A" "I" "G" "S" "I" "R" "A" "A" "I" "G" "E" "S" "M" "L" "E" "G" "S" "S" "M" "D" "R" "I" "K" "R"
## [101] "V" "T" "E" "G" "Q" "F" "I" "R" "E" "A" "N" "M" "V" "T" "E" "N" "Y" "R" "R" "D" "Y" "Q" "A" "I" "F"
## [126] "V" "Q" "Q" "L" "G" "G" "T" "Q" "S" "A" "A" "S" "Q" "I" "D" "E" "I" "Y" "K" "S" "E" "Q" "K" "Q" "K"
## [151] "S" "K" "L" "Q" "M" "V" "L" "D" "P" "L" "A" "I" "M" "G" "S" "S" "A" "A" "S" "A" "Y" "A" "S" "D" "A"
## [176] "F" "D" "S" "K" "F" "T" "T" "K" "A" "P" "I" "V" "A" "A" "K" "G" "T" "K" "T" "G" "R" "*"

R - shuffle a list preserving element sizes

In R, I need an efficient solution to shuffle the elements contained within a list, preserving the total number of elements, and the local element sizes (in this case, each element of the list is a vector)
a<-LETTERS[1:6]
b<-LETTERS[6:10]
c<-LETTERS[c(9:15)]
l=list(a,b,c)
> l
[[1]]
[1] "A" "B" "C" "D" "E" "F"
[[2]]
[1] "F" "G" "H" "I" "J"
[[3]]
[1] "I" "J" "K" "L" "M" "N" "O"
The shuffling should randomly select the letters of the list (without replacement) and put them in a random position of any vector within the list.
I hope I have been clear! Thanks :-)
you may try recreating a second list with the skeleton of the first, and fill it with all the elements of the first list, like this:
u<-unlist(l)
l2<-relist(u[sample(length(u))],skeleton=l)
> l2
[[1]]
[1] "F" "A" "O" "I" "S" "Q"
[[2]]
[1] "R" "P" "K" "F" "G"
[[3]]
[1] "A" "N" "M" "J" "H" "G" "E" "B" "T" "C" "D" "L"
Hope this helps!
Like this...?
> set.seed(1)
> lapply(l, sample)
[[1]]
[1] "B" "F" "C" "D" "A" "E"
[[2]]
[1] "J" "H" "G" "F" "I"
[[3]]
[1] "J" "M" "O" "L" "N" "K" "I"

Letter "y" comes after "i" when sorting alphabetically

When using function sort(x), where x is a character, the letter "y" jumps into the middle, right after letter "i":
> letters
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t"
[21] "u" "v" "w" "x" "y" "z"
> sort(letters)
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "y" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[21] "t" "u" "v" "w" "x" "z"
The reason may be that I am located in Lithuania, and this is "lithuanian-like" sorting of letters, but I need normal sorting. How do I change the sorting method back to normal inside R code?
I'm using R 2.15.2 on Win7.
You need to change the locale that R is running in. Either do that for your entire Windows install (which seems suboptimal) or within the R sessions via:
Sys.setlocale("LC_COLLATE", "C")
You can use any other valid locale string in place of "C" there, but that should get you back to the sort order for letters you want.
Read ?locales for more.
I suppose it is worth noting the sister function Sys.getlocale(), which queries the current setting of a locale parameter. Hence you could do
(locCol <- Sys.getlocale("LC_COLLATE"))
Sys.setlocale("LC_COLLATE", "lt_LT")
sort(letters)
Sys.setlocale("LC_COLLATE", locCol)
sort(letters)
Sys.getlocale("LC_COLLATE")
## giving:
> (locCol <- Sys.getlocale("LC_COLLATE"))
[1] "en_GB.UTF-8"
> Sys.setlocale("LC_COLLATE", "lt_LT")
[1] "lt_LT"
> sort(letters)
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "y" "j" "k" "l" "m" "n"
[16] "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "z"
> Sys.setlocale("LC_COLLATE", locCol)
[1] "en_GB.UTF-8"
> sort(letters)
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o"
[16] "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
> Sys.getlocale("LC_COLLATE")
[1] "en_GB.UTF-8"
which of course is what #Hadley's Answer shows with_collate() doing somewhat more succinctly once you have devtools installed.
If you want to do this temporarily, devtools provides the with_collate function:
library(devtools)
with_collate("C", sort(letters))
# [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
# [20] "t" "u" "v" "w" "x" "y" "z"
with_collate("lt_LT", sort(letters))
# [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "y" "j" "k" "l" "m" "n" "o" "p" "q" "r"
# [20] "s" "t" "u" "v" "w" "x" "z"

Resources