how to repeat column names with a specific frequency in R - r

I have a 10x5 matrix. Each of the five columns, is named.
I need to create a vector like this:
c( rep(colnames(mymatrix)[1], dim(mymatrix)[1]),
rep(colnames(mymatrix)[2], dim(mymatrix)[1]),
...
rep(colnames(mymatrix)[5], dim(mymatrix)[1]))
However, what if I have a varying number of columns? How do I automate this without using a for loop?
Thanks!

You can do this with the each argument to rep:
rep(colnames(mymatrix), each=dim(mymatrix)[1])
To see how this works, you can try:
v = c("h", "e", "l", "l", "o")
rep(v, each=5)
# [1] "h" "h" "h" "h" "h" "e" "e" "e" "e" "e" "l" "l" "l" "l" "l" "l" "l" "l" "l"
# [20] "l" "o" "o" "o" "o" "o"

Related

Print vowels from the vector

I need to execute the vowels from the LETTERS R build-in vector
"A", "E", etc.
> LETTERS
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U"
"V" "W" "X"
[25] "Y" "Z"
Maybe, someone knows how to do it with if() or other functions. Thank you in advance.
Looks like you need extract vowels, does this work:
> vowels <- c('A','E','I','O','U')
> LETTERS[sapply(vowels, function(ch) grep(ch, LETTERS))]
[1] "A" "E" "I" "O" "U"
>

Matching between a vector and multiple vectors in a list in R

I have a list of vectors such as:
>list
[[1]]
[1] "a" "m" "l" "s" "t" "o"
[[2]]
[1] "a" "y" "o" "t" "e"
[[3]]
[1] "n" "a" "s" "i" "d"
I want to find the matches between each of them and the remaining (i.e. between the 1st and the other 2, the 2nd and the other 2, and so on) and keep the couple with the highest number of matches. I could do it with a "for" loop and intersect by couples. For example
for (i in 2:3) { intersect(list[[1]],list[[i]]) }
and then save the output into a vector or some other structure. However, this seems so inefficient to me (given than rather than 3 I have thousands) and I am wondering if R has some built-in function to do that in a clever way.
So the question would be:
Is there a way to look for matches of one vector to a list of vectors without the explicit use of a "for" loop?
I don't believe there is a built-in function for this. The best you could try is something like:
lsts <- lapply(1:5, function(x) sample(letters, 10)) # make some data (see below)
maxcomb <- which.max(apply(combs <- combn(length(lsts), 2), 2,
function(ix) length(intersect(lsts[[ix[1]]], lsts[[ix[2]]]))))
lsts <- lsts[combs[, maxcomb]]
# [[1]]
# [1] "m" "v" "x" "d" "a" "g" "r" "b" "s" "t"
# [[2]]
# [1] "w" "v" "t" "i" "d" "p" "l" "e" "s" "x"
A dump of the original:
[[1]]
[1] "z" "r" "j" "h" "e" "m" "w" "u" "q" "f"
[[2]]
[1] "m" "v" "x" "d" "a" "g" "r" "b" "s" "t"
[[3]]
[1] "w" "v" "t" "i" "d" "p" "l" "e" "s" "x"
[[4]]
[1] "c" "o" "t" "j" "d" "g" "u" "k" "w" "h"
[[5]]
[1] "f" "g" "q" "y" "d" "e" "n" "s" "w" "i"
datal <- list (a=c(2,2,1,2),
b=c(2,2,2,4,3),
c=c(1,2,3,4))
# all possible combinations
combs <- combn(length(datal), 2)
# split into list
combs <- split(combs, rep(1:ncol(combs), each = nrow(combs)))
# calculate length of intersection for every combination
intersections_length <- sapply(combs, function(y) {
length(intersect(datal[[y[1]]],datal[[y[2]]]))
}
)
# What lists have biggest intersection
combs[which(intersections_length == max(intersections_length))]

what is a vector of single characters

I have the following function (from package seqinr):
translate(seq, frame = 0, sens = "F", numcode = 1, NAstring = "X", ambiguous = FALSE)
Shortly, it translates DNA sequences into protein sequences. I have problems with giving the seq argument. The documentation says:
seq = the sequence to translate as a vector of single characters in lower case letters
I store the DNA sequence in a data.frame (named here seq):
seq <- data.frame(geneSeq="ATGTGTTGGGCAGCCGCAATACCTATCGCTATATCTGGCGCTCAGGCTATCAGTGGTCAGAACACTCAAGCCAAAATGATTGCCGTTCAGACCGCTGCTGGTCGTCGTCAAGCTATGGAAATCATGAGGCAGACGAACATCCAGAATGCTGACCTATCGTTGCAAGCTCGAAGTAACCTTGAGAAAGCGTCCGCCGAGTTGACCTCACAGAACATGCAKAAGGTCCAAGCTATTGGGTCTATCCGAGCGGCTATCGGAGAAAGTATGCTTGAAGGTTCCTCAATGGACCGTATTAAGCGAGTCACAGAAGGACAGTTCATTCGGGAAGCCAATATGGTAACTGAGAACTATCGCCGTGACTACCAAGCAATCTTCGTACAGCAACTTGGTGGTACTCAAAGTGCTGCAAGTCAGATTGACGAAATCTATAAGAGCGAACAGAAACAGAAGAGTAAGCTACAGATGGTTCTGGACCCACTGGCTATCATGGGGTCTTCCGCTGCGAGTGCTTACGCATCCGATGCGTTCGACTCTAAGTTCACAACTAAGGCACCTATTGTTGCCGCTAAAGGAACCAAGACGGGGAGGTAA", stringsAsFactors=FALSE)
Every time I try to use the translate function, it returns the error:
Error in s2n(seq, levels = s2c("tcag")) :
sequence is not a vector of chars
I have tried the following, all give the above error:
trans<- seqinr::translate(tolower(seq[1,1]))
trans<- seqinr::translate(stringr::str_split(tolower(seq[1,1]), pattern=""))
trans<- seqinr::translate(as.character(stringr::str_split(tolower(seq[1,1]), pattern="")))
How can I transform my DNA sequence in a vector of single characters?
You could use strsplit:
strsplit("ABCD", "")
# [[1]]
# [1] "A" "B" "C" "D"
## your example:
seqinr::translate(strsplit(seq[1,1], "")[[1]])
The first example in ?translate pretty much gives you your answer.
You don't need a data frame and can use s2c whose sole purpose in life is for "conversion of a string into a vector of chars":
geneSeq="ATGTGTTGGGCAGCCGCAATACCTATCGCTATATCTGGCGCTCAGGCTATCAGTGGTCAGAACACTCAAGCCAAAATGATTGCCGTTCAGACCGCTGCTGGTCGTCGTCAAGCTATGGAAATCATGAGGCAGACGAACATCCAGAATGCTGACCTATCGTTGCAAGCTCGAAGTAACCTTGAGAAAGCGTCCGCCGAGTTGACCTCACAGAACATGCAKAAGGTCCAAGCTATTGGGTCTATCCGAGCGGCTATCGGAGAAAGTATGCTTGAAGGTTCCTCAATGGACCGTATTAAGCGAGTCACAGAAGGACAGTTCATTCGGGAAGCCAATATGGTAACTGAGAACTATCGCCGTGACTACCAAGCAATCTTCGTACAGCAACTTGGTGGTACTCAAAGTGCTGCAAGTCAGATTGACGAAATCTATAAGAGCGAACAGAAACAGAAGAGTAAGCTACAGATGGTTCTGGACCCACTGGCTATCATGGGGTCTTCCGCTGCGAGTGCTTACGCATCCGATGCGTTCGACTCTAAGTTCACAACTAAGGCACCTATTGTTGCCGCTAAAGGAACCAAGACGGGGAGGTAA"
print(translate(s2c(geneSeq), frame = 0, sens = "F", numcode = 1, NAstring = "X", ambiguous = FALSE)
## [1] "M" "C" "W" "A" "A" "A" "I" "P" "I" "A" "I" "S" "G" "A" "Q" "A" "I" "S" "G" "Q" "N" "T" "Q" "A" "K"
## [26] "M" "I" "A" "V" "Q" "T" "A" "A" "G" "R" "R" "Q" "A" "M" "E" "I" "M" "R" "Q" "T" "N" "I" "Q" "N" "A"
## [51] "D" "L" "S" "L" "Q" "A" "R" "S" "N" "L" "E" "K" "A" "S" "A" "E" "L" "T" "S" "Q" "N" "M" "X" "K" "V"
## [76] "Q" "A" "I" "G" "S" "I" "R" "A" "A" "I" "G" "E" "S" "M" "L" "E" "G" "S" "S" "M" "D" "R" "I" "K" "R"
## [101] "V" "T" "E" "G" "Q" "F" "I" "R" "E" "A" "N" "M" "V" "T" "E" "N" "Y" "R" "R" "D" "Y" "Q" "A" "I" "F"
## [126] "V" "Q" "Q" "L" "G" "G" "T" "Q" "S" "A" "A" "S" "Q" "I" "D" "E" "I" "Y" "K" "S" "E" "Q" "K" "Q" "K"
## [151] "S" "K" "L" "Q" "M" "V" "L" "D" "P" "L" "A" "I" "M" "G" "S" "S" "A" "A" "S" "A" "Y" "A" "S" "D" "A"
## [176] "F" "D" "S" "K" "F" "T" "T" "K" "A" "P" "I" "V" "A" "A" "K" "G" "T" "K" "T" "G" "R" "*"

Letter "y" comes after "i" when sorting alphabetically

When using function sort(x), where x is a character, the letter "y" jumps into the middle, right after letter "i":
> letters
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t"
[21] "u" "v" "w" "x" "y" "z"
> sort(letters)
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "y" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[21] "t" "u" "v" "w" "x" "z"
The reason may be that I am located in Lithuania, and this is "lithuanian-like" sorting of letters, but I need normal sorting. How do I change the sorting method back to normal inside R code?
I'm using R 2.15.2 on Win7.
You need to change the locale that R is running in. Either do that for your entire Windows install (which seems suboptimal) or within the R sessions via:
Sys.setlocale("LC_COLLATE", "C")
You can use any other valid locale string in place of "C" there, but that should get you back to the sort order for letters you want.
Read ?locales for more.
I suppose it is worth noting the sister function Sys.getlocale(), which queries the current setting of a locale parameter. Hence you could do
(locCol <- Sys.getlocale("LC_COLLATE"))
Sys.setlocale("LC_COLLATE", "lt_LT")
sort(letters)
Sys.setlocale("LC_COLLATE", locCol)
sort(letters)
Sys.getlocale("LC_COLLATE")
## giving:
> (locCol <- Sys.getlocale("LC_COLLATE"))
[1] "en_GB.UTF-8"
> Sys.setlocale("LC_COLLATE", "lt_LT")
[1] "lt_LT"
> sort(letters)
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "y" "j" "k" "l" "m" "n"
[16] "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "z"
> Sys.setlocale("LC_COLLATE", locCol)
[1] "en_GB.UTF-8"
> sort(letters)
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o"
[16] "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
> Sys.getlocale("LC_COLLATE")
[1] "en_GB.UTF-8"
which of course is what #Hadley's Answer shows with_collate() doing somewhat more succinctly once you have devtools installed.
If you want to do this temporarily, devtools provides the with_collate function:
library(devtools)
with_collate("C", sort(letters))
# [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
# [20] "t" "u" "v" "w" "x" "y" "z"
with_collate("lt_LT", sort(letters))
# [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "y" "j" "k" "l" "m" "n" "o" "p" "q" "r"
# [20] "s" "t" "u" "v" "w" "x" "z"

ggplot2 heatmap : how to preserve the label order?

I'm trying to plot heatmap in ggplot2 using csv data following casbon's solution in
http://biostar.stackexchange.com/questions/921/how-to-draw-a-csv-data-file-as-a-heatmap-using-numpy-and-matplotlib
the problem is x-label try to re-sort itself. For example, if I swap label COG0002 and COG0001 in that example data, the x-label still come out in sort order (cog0001, cog0002, cog0003.... cog0008).
Is there anyway to prevent this ? I want to it to be ordered as in csv file
thanks
pp
If I recall, when calling factor(x) with the default levels argument, the levels are set as levels = sort(unique(x)).
You can override this action by setting levels = unique(x).
For example:
set.seed(1)
x = sample(letters, 100, replace = TRUE)
head(x, 5)
[1] "g" "j" "o" "x" "f"
levels(factor(x))
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"
levels(factor(x, levels = unique(x)))
[1] "g" "j" "o" "x" "f" "y" "r" "q" "b" "e" "u" "m" "s" "z" "d" "k" "a" "w" "i"
[20] "p" "v" "c" "n" "t" "l" "h"
You can see that setting levels = unique(x) preserves the order of occurrence in the data.
If you want to keep the order directly from the csv file :
foomelt$COG <- factor(foomelt$COG, levels = unique(as.character(foo[[1]])))
Did you try reordering factor levels before plotting?
e.g.
foomelt$COG = factor(foomelt$COG,levels(foomelt$COG)[c(2,1,3:8)])
(I can't try it right now, so I can't be sure that it works)

Resources