How to expand loop n times in R programming? - r

I have a vector of characters 'A', 'B', 'C', 'D' and would like to loop n times to get all possible combinations (4^n) of the characters. How do I write a function that will perform this given input n?
For example, if n=2, my loop will look something like this:
string <- c('A','B','C','D')
combination = c()
count = 1
for (j in string) {
for (k in string) {
combination[count] <- paste0(j,k)
count = count + 1
}
}
which will yield:
> combination
[1] "AA" "AB" "AC" "AD" "BA" "BB" "BC" "BD" "CA" "CB" "CC" "CD" "DA" "DB" "DC" "DD"
and if n=3, the code will be like this:
combination = c()
count = 1
for (j in string) {
for (k in string) {
for (l in string) {
combination[count] <- paste0(j,k,l)
count = count + 1
}
}
}
which yields
> combination
[1] "AAA" "AAB" "AAC" "AAD" "ABA" "ABB" "ABC" "ABD" "ACA" "ACB" "ACC" "ACD" "ADA" "ADB" "ADC" "ADD" "BAA" "BAB" "BAC"
[20] "BAD" "BBA" "BBB" "BBC" "BBD" "BCA" "BCB" "BCC" "BCD" "BDA" "BDB" "BDC" "BDD" "CAA" "CAB" "CAC" "CAD" "CBA" "CBB"
[39] "CBC" "CBD" "CCA" "CCB" "CCC" "CCD" "CDA" "CDB" "CDC" "CDD" "DAA" "DAB" "DAC" "DAD" "DBA" "DBB" "DBC" "DBD" "DCA"
[58] "DCB" "DCC" "DCD" "DDA" "DDB" "DDC" "DDD"

In a word, recursion:
#' n: length of each combination string
#' basis: the starting vector of characters to combine
#' extras: the vector of characters to be combined with basis. Defaults to basis
#' i: The current depth of recursion. Users should generally not need to access
#' this parameter.
combine <- function(n, basis=c('A','B','C','D'), extras=basis, i=1) {
x <- expand.grid(basis, extras)
y <- paste0(x$Var1, x$Var2)
if (i == n-1) {
return(y)
} else {
return(combine(n, y, extras, i+1))
}
}
Giving, for example,
> combine(2)
[1] "AA" "BA" "CA" "DA" "AB" "BB" "CB" "DB" "AC" "BC" "CC" "DC" "AD" "BD" "CD" "DD"
and
> combine(3)
[1] "AAA" "BAA" "CAA" "DAA" "ABA" "BBA" "CBA" "DBA" "ACA" "BCA" "CCA" "DCA" "ADA" "BDA" "CDA" "DDA" "AAB" "BAB" "CAB" "DAB" "ABB" "BBB" "CBB" "DBB" "ACB" "BCB" "CCB"
[28] "DCB" "ADB" "BDB" "CDB" "DDB" "AAC" "BAC" "CAC" "DAC" "ABC" "BBC" "CBC" "DBC" "ACC" "BCC" "CCC" "DCC" "ADC" "BDC" "CDC" "DDC" "AAD" "BAD" "CAD" "DAD" "ABD" "BBD"
[55] "CBD" "DBD" "ACD" "BCD" "CCD" "DCD" "ADD" "BDD" "CDD" "DDD"
etc.
Feel free to sort the output if another order is more desirable.

Related

Creating a list of 6-character strings that have a distance/ difference of at least 3 per string? (for a DNA/ oligo-related problem)

I want to create a list of strings comprised of only A's G's C's and T's that have a difference/ distance of at least 3, eg (two strings/ oligos could be eg. ATCTGA and TAGTGC). I can create all the possible combinations of a 6nt string, but I can't work out how to select only a subset 3-distant oligos. I know that there will be more than one list, but any list would do.
Not really done much DNA data manipulation so I am unsure how to approach this, would appreciate any suggestions of any tools out there.
Thank
Given a reference oligo x of nonzero length, this function returns a character vector listing all oligos of equal length whose Hamming distance from x is at least mindist.
oligo1 <- function(x, mindist = 0L) {
acgt <- c("A", "C", "G", "T")
x <- match(strsplit(x, "")[[1L]], acgt)
if ((n <- length(x)) == 0L || anyNA(x)) {
stop("'x' is not a valid oligo.")
}
if (mindist > n) {
return(character(0L))
}
P <- gtools::permutations(4L, n, repeats.allowed = TRUE)
if (mindist > 0L) {
P <- P[rowSums(P != rep.int(x, rep.int(4^n, n))) >= mindist, , drop = FALSE]
}
m <- nrow(P)
do.call(paste0, split(acgt[P], gl(n, m)))
}
oligo1("AA", 0L)
## [1] "AA" "AC" "AG" "AT" "CA" "CC" "CG" "CT" "GA" "GC"
## [11] "GG" "GT" "TA" "TC" "TG" "TT"
oligo1("AA", 1L)
## [1] "AC" "AG" "AT" "CA" "CC" "CG" "CT" "GA" "GC" "GG"
## [11] "GT" "TA" "TC" "TG" "TT"
oligo1("AA", 2L)
## [1] "CC" "CG" "CT" "GC" "GG" "GT" "TC" "TG" "TT"
Employing the above recursively, you can find the largest set containing x whose elements mutually satisfy the condition on Hamming distance. More precisely, you can construct the longest y such that x %in% y and the Hamming distance from y[i] to y[j] is at least mindist for all i != j.
oligo2 <- function(x, mindist = 0L) {
y <- c(x, oligo1(x, mindist))
n <- length(y)
pos <- 2L
while (pos < n) {
y <- c(y[1:pos], intersect(y[(pos+1L):n], oligo1(y[pos], mindist)))
n <- length(y)
pos <- pos + 1L
}
y
}
oligo2("AA", 0L)
## [1] "AA" "AA" "AC" "AG" "AT" "CA" "CC" "CG" "CT" "GA"
## [11] "GC" "GG" "GT" "TA" "TC" "TG" "TT"
oligo2("AA", 1L)
## [1] "AA" "AC" "AG" "AT" "CA" "CC" "CG" "CT" "GA" "GC"
## [11] "GG" "GT" "TA" "TC" "TG" "TT"
oligo2("AA", 2L)
## [1] "AA" "CC" "GG" "TT"
Hence one possible answer to your question would be:
oligo2("AAAAAA", 3L)
## [1] "AAAAAA" "AAACCC" "AAAGGG" "AAATTT" "AACACG"
## [6] "AACCAT" "AACGTA" "AACTGC" "AAGAGT" "AAGCTG"
## [11] "AAGGAC" "AAGTCA" "AATATC" "AATCGA" "AATGCT"
## [16] "AATTAG" "ACAACT" "ACACAG" "ACAGTC" "ACATGA"
## [21] "ACCAAC" "ACCCCA" "ACCGGT" "ACCTTG" "ACGATA"
## [26] "ACGCGC" "ACGGCG" "ACGTAT" "ACTAGG" "ACTCTT"
## [31] "ACTGAA" "ACTTCC" "AGAAGC" "AGACTA" "AGAGAT"
## [36] "AGATCG" "AGCATT" "AGCCGG" "AGCGCC" "AGCTAA"
## [41] "AGGAAG" "AGGCCT" "AGGGGA" "AGGTTC" "AGTACA"
## [46] "AGTCAC" "AGTGTG" "AGTTGT" "ATAATG" "ATACGT"
## [51] "ATAGCA" "ATATAC" "ATCAGA" "ATCCTC" "ATCGAG"
## [56] "ATCTCT" "ATGACC" "ATGCAA" "ATGGTT" "ATGTGG"
## [61] "ATTAAT" "ATTCCG" "ATTGGC" "ATTTTA"
The length-6 oligos in this list are mutually at least 3-distant.

R - how to sort named list of character vectors

I'm new to R and looking for the following:
My input:
v = list(bob=c("aa", "cc"), cas=c("tt", "ff"), john=c("aa", "bb"))
v
$bob
[1] "aa" "cc"
$cas
[1] "tt" "ff"
$john
[1] "aa" "bb"
I want to sort based on the character vectors inside it, the desired output I'm looking for :
sorted_v
$john
[1] "aa" "bb"
$bob
[1] "aa" "cc"
$cas
[1] "tt" "ff"
How to obtain sorted_v?
We can paste all the elements of the list together, sort them and extract the names of them.
sorted_v <- v[names(sort(sapply(v, paste0, collapse = "")))]
sorted_v
#$john
#[1] "aa" "bb"
#$bob
#[1] "aa" "cc"
#$cas
#[1] "tt" "ff"
OR
as #ycw mentioned in the comments we can also use toString instead of paste0, collapse combination :
sorted_v <- v[names(sort(sapply(v, toString)))]
Also using #A5C1D2H2I1M1N2O1R2T1 and #ycw's inputs we can reduce it to
v[order(sapply(v, toString))]
#$john
#[1] "aa" "bb"
#$bob
#[1] "aa" "cc"
#$cas
#[1] "tt" "ff"

R, split string to pairs of character

How to split string in R in following way ? Look at example, please
example:
c("ex", "xa", "am", "mp", "pl", "le") ?
x = "example"
substring(x, first = 1:(nchar(x) - 1), last = 2:nchar(x))
# [1] "ex" "xa" "am" "mp" "pl" "le"
You could, of course, wrap it into a function, maybe omit non-letters (I don't know if the colon was supposed to be part of your string or not), etc.
To do this to a vector of strings, you can use it as an anonymous function with lapply:
lapply(month.name, function(x) substring(x, first = 1:(nchar(x) - 1), last = 2:nchar(x)))
# [[1]]
# [1] "Ja" "an" "nu" "ua" "ar" "ry"
#
# [[2]]
# [1] "Fe" "eb" "br" "ru" "ua" "ar" "ry"
#
# [[3]]
# [1] "Ma" "ar" "rc" "ch"
# ...
Or make it into a named function and use it by name. This would make sense if you'll use it somewhat frequently.
str_split_pairs = function(x) {
substring(x, first = 1:(nchar(x) - 1), last = 2:nchar(x))
}
lapply(month.name, str_split_pairs)
## same result as above
Here's another option (though it's slower than #Gregor's answer):
x=c("example", "stackoverflow", "programming")
lapply(x, function(i) {
i = unlist(strsplit(i,""))
paste0(i, lead(i))[-length(i)]
})
[[1]]
[1] "ex" "xa" "am" "mp" "pl" "le"
[[2]]
[1] "st" "ta" "ac" "ck" "ko" "ov" "ve" "er" "rf" "fl" "lo" "ow"
[[3]]
[1] "pr" "ro" "og" "gr" "ra" "am" "mm" "mi" "in" "ng"

R strsplit, nested lists blues

I am facing this issue in R in which I want to split the strings on comma and then further split on semicolon, but only keep the first item before the semicolon i.e. ee and jj below. I have tried a bunch of things but nested lists seem too convoluted!
Here's what I am doing:
d <- c("aa,bb,cc,dd,ee;e,ff",
"gg,hh,ii,jj;j")
e=strsplit(d,",")
myfun2 <- function(x,arg1) {
strsplit(x,";")
}
f=lapply(e,myfun2)
f=
[[1]]
[[1]][[1]]
[1] "aa"
[[1]][[2]]
[1] "bb"
[[1]][[3]]
[1] "cc"
[[1]][[4]]
[1] "dd"
[[1]][[5]]
[1] "ee" "e"
[[1]][[6]]
[1] "ff"
[[2]]
[[2]][[1]]
[1] "gg"
[[2]][[2]]
[1] "hh"
[[2]][[3]]
[1] "ii"
[[2]][[4]]
[1] "jj" "j"
Here's the output that I want
Correct output=
[[1]]
[1] "aa" "bb" "cc" "dd" "ee" "ff"
[[2]]
[1] "gg" "hh" "ii" "jj"
I have tried a bunch of things using lapply to the nested list "f" and used "[[" and "[" but with no success.
Any help is greatly appreciated. (I know that I am missing something silly, but just can't figure it out right now!)
This is your code
d <- c("aa,bb,cc,dd,ee;e,ff", "gg,hh,ii,jj;j")
e <- strsplit(d,",")
myfun2 <- function(x,arg1) { strsplit(x,";") }
f <- lapply(e,myfun2)
If we start from your f, then the next step would be
lapply(f,function(x) mapply(`[`,x,1))
[[1]]
[1] "aa" "bb" "cc" "dd" "ee" "ff"
[[2]]
[1] "gg" "hh" "ii" "jj"
Basically, you need an inner and outer type apply function to go down the two levels of nesting.
We can use gsub to match the pattern ; followed by one ore more alphabetic characters, replace with '', and then split (strsplit) with ,.
strsplit(gsub(';[a-z]+', '', d), ',')
#[[1]]
#[1] "aa" "bb" "cc" "dd" "ee" "ff"
#[[2]]
#[1] "gg" "hh" "ii" "jj"

repeating vector of letters

Is there a function to create a repeating list of letters in R?
something like
letters[1:30]
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z" NA NA NA NA
but instead of NA, I would like the output to continue aa, bb, cc, dd ...
It's not too difficult to piece together a quick function to do something like this:
myLetters <- function(length.out) {
a <- rep(letters, length.out = length.out)
grp <- cumsum(a == "a")
vapply(seq_along(a),
function(x) paste(rep(a[x], grp[x]), collapse = ""),
character(1L))
}
myLetters(60)
# [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l"
# [13] "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x"
# [25] "y" "z" "aa" "bb" "cc" "dd" "ee" "ff" "gg" "hh" "ii" "jj"
# [37] "kk" "ll" "mm" "nn" "oo" "pp" "qq" "rr" "ss" "tt" "uu" "vv"
# [49] "ww" "xx" "yy" "zz" "aaa" "bbb" "ccc" "ddd" "eee" "fff" "ggg" "hhh"
If you just want unique names, you could use
make.unique(rep(letters, length.out = 30), sep='')
Edit:
Here's another way to get repeating letters using Reduce.
myletters <- function(n)
unlist(Reduce(paste0,
replicate(n %/% length(letters), letters, simplify=FALSE),
init=letters,
accumulate=TRUE))[1:n]
myletters(60)
# [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l"
# [13] "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x"
# [25] "y" "z" "aa" "bb" "cc" "dd" "ee" "ff" "gg" "hh" "ii" "jj"
# [37] "kk" "ll" "mm" "nn" "oo" "pp" "qq" "rr" "ss" "tt" "uu" "vv"
# [49] "ww" "xx" "yy" "zz" "aaa" "bbb" "ccc" "ddd" "eee" "fff" "ggg" "hhh"
Working solution
A function to produce Excel-style column names, i.e.
# A, B, ..., Z, AA, AB, ..., AZ, BA, BB, ..., ..., ZZ, AAA, ...
letterwrap <- function(n, depth = 1) {
args <- lapply(1:depth, FUN = function(x) return(LETTERS))
x <- do.call(expand.grid, args = list(args, stringsAsFactors = F))
x <- x[, rev(names(x)), drop = F]
x <- do.call(paste0, x)
if (n <= length(x)) return(x[1:n])
return(c(x, letterwrap(n - length(x), depth = depth + 1)))
}
letterwrap(26^2 + 52) # through AAZ
Botched attempt
Initially I thought this would best be done cleverly by converting to base 26, but that doesn't work. The issue is that Excel column names aren't base 26, which took me a long time to realize. The catch is 0: if you try to map a letter (like A) to 0, you've got a problem when you want to distinguish between A and AA and AAA...
Another way to illustrate the problem is in "digits". In base 10, there are 10 single-digit numbers (0-9), then 90 double-digit numbers (10:99), 900 three-digit numbers... generalizing to 10^d - 10^(d - 1) numbers with d digits for d > 1. However, in Excel column names there are 26 single-letter names, 26^2 double-letter names, 26^3 triple-letter names, with no subtraction.
I'll leave this code as a warning to others:
## Converts a number to base 26, returns a vector for each "digit"
b26 <- function(n) {
stopifnot(n >= 0)
if (n <= 1) return(n)
n26 <- rep(NA, ceiling(log(n, base = 26)))
for (i in seq_along(n26)) {
n26[i] <- (n %% 26)
n <- n %/% 26
}
return(rev(n26))
}
## Returns the name of nth value in the sequence
## A, B, C, ..., Z, AA, AB, AC, ..., AZ, BA, ...
letterwrap1 <- function(n, lower = FALSE) {
let <- if (lower) letters else LETTERS
base26 <- b26(n)
base26[base26 == 0] <- 26
paste(let[base26], collapse = "")
}
## Vectorized version of letterwrap
letter_col_names <- Vectorize(letterwrap, vectorize.args="n")
> letter_col_names(1:4)
[1] "A" "B" "C" "D"
> letter_col_names(25:30)
[1] "Y" "Z" "AA" "AB" "AC" "AD"
# Looks pretty good
# Until we get here:
> letter_col_names(50:54)
[1] "AX" "AY" "BZ" "BA" "BB"
There is almost certainly a better way, but this is what I ended up with:
letter_wrap <- function(idx) {
vapply(
idx,
function(x)
paste0(
rep(
letters[replace(x %% 26, !x %% 26, 26)], 1 + (x - 1) %/% 26 ), collapse=""), "")
}
letter_wrap(1:60)
# [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n"
# [15] "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z" "aa" "bb"
# [29] "cc" "dd" "ee" "ff" "gg" "hh" "ii" "jj" "kk" "ll" "mm" "nn" "oo" "pp"
# [43] "qq" "rr" "ss" "tt" "uu" "vv" "ww" "xx" "yy" "zz" "aaa" "bbb" "ccc" "ddd"
# [57] "eee" "fff" "ggg" "hhh"
EDIT: failed to notice Ananda's answer before I posted this one. This one is different enough that I'm leaving it. Note it takes the index vector as an input, as opposed to the number of items.
Probably not the cleanest, but easy to see what's happening:
foo<-letters[1:26]
outlen <- 73 # or whatever length you want
oof <- vector(len=26)
for ( j in 2:(outlen%/%26)) {
for (k in 1:26) oof[k] <- paste(rep(letters[k],j),sep='',collapse='')
foo<-c(foo,oof)
}
for (jj in 1:(outlen%%26) ) foo[(26*j)+jj]<-paste(rep(letters[jj],(j+1)),sep='',collapse='')
foo
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n"
[15] "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z" "aa" "bb"
[29] "cc" "dd" "ee" "ff" "gg" "hh" "ii" "jj" "kk" "ll" "mm" "nn" "oo" "pp"
[43] "qq" "rr" "ss" "tt" "uu" "vv" "ww" "xx" "yy" "zz" "aaa" "bbb" "ccc" "ddd"
[57] "eee" "fff" "ggg" "hhh" "iii" "jjj" "kkk" "lll" "mmm" "nnn" "ooo" "ppp" "qqq" "rrr"
[71] "sss" "ttt" "uuu"
EDIT: Matthew wins, hands-down:
microbenchmark(anandaLetters(5000),matthewletters(5000),carlletters(5000),times=10)
Unit: milliseconds
expr min lq median uq max neval
anandaLetters(5000) 85.339200 85.567978 85.9827715 86.260298 86.612231 10
matthewletters(5000) 3.413706 3.503506 3.9067535 3.946950 4.106453 10
carlletters(5000) 94.893983 95.405418 96.4492430 97.234784 110.681780 10
Let me do a little correction on seq "AY" "BZ". You have to rest out one letter to the previous digiletter.
colExcel2num <- function(x) {
p <- seq(from = nchar(x) - 1, to = 0)
y <- utf8ToInt(x) - utf8ToInt("A") + 1L
S <- sum(y * 26^p)
return(S)
}
## Converts a number to base 26, returns a vector for each "digit"
b26 <- function(n) {
stopifnot(n >= 0)
if (n <= 1) return(n)
n26 <- rep(NA, ceiling(log(n, base = 26)))
for (i in seq_along(n26)) {
n26[i] <- (n %% 26)
n <- n %/% 26
}
return(rev(n26))
}
## Retorna el nombre de columna Excel según la posición de columna
## A, B, C, ..., Z, AA, AB, AC, ..., AZ, BA, ...
colnum2Excel <- function(n, lower = FALSE) {
let <- if (lower) letters else LETTERS
base26 <- b26(n)
i <- base26 == 0
base26[i] <- 26
base26[lead(i, default = FALSE)] <- base26[lead(i, default = FALSE)] - 1
paste(let[base26], collapse = "")
}
## Return df's column index based on column name
## A, B, C, ..., Z, AA, AB, AC, ..., AZ, BA, ...
## buscando el número de columna en el df
varnum2Excel <- function(df, colname, lower = FALSE) {
index <- match(colname, names(df))
stopifnot(index > 0)
return(colnum2Excel(index))
}
Here some example:
require(openxlsx)
table <- data.frame(milk = c(1,2,3), oranges = c(2,4,6))
table <- table %>%
mutate(
ajjhh = sprintf(paste0(
varnum2Excel(.,"milk"), "%1$s", " + ",
varnum2Excel(.,"oranges"),"%1$s"),
2:(n()+1)
)
)
class(table$ajjhh) <- c(class(table$ajjhh), "formula")
wb <- createWorkbook()
addWorksheet(wb = wb, sheetName = "Sheet1", tabColour = "chocolate4")
writeData (wb, "Sheet1", x = table)
saveWorkbook(wb, "formulashasnotgone.xlsx", overwrite = TRUE)

Resources