Here is a DNA string that I want to split and then combine in groups of 3
dna=c("TACACGATGACAGTCTTGACGGGTTCTCCTACT")
dna.sg = unlist(strsplit(dna, ""))
Gives
[1] "T" "A" "C" "A" "C" "G" "A" "T" "G" "A" "C" "A" "G" "T" "C" "T" "T" "G" "A" "C" "G" "G" "G" "T" "T" "C" "T" "C" "C" "T" "A" "C" "T"
But I'd like to have
"TAC" "ACG" [...]
You may split every 3 characters in strsplit.
unlist(strsplit(dna, "(?<=.{3})", perl = TRUE))
#[1] "TAC" "ACG" "ATG" "ACA" "GTC" "TTG" "ACG" "GGT" "TCT" "CCT" "ACT"
Other possibilities:
dna <- c("TACACGATGACAGTCTTGACGGGTTCTCCTACT")
regmatches(dna, gregexpr(".{3}", dna))[[1]]
# [1] "TAC" "ACG" "ATG" "ACA" "GTC" "TTG" "ACG" "GGT" "TCT" "CCT" "ACT"
sapply(seq(1, nchar(dna), 3), \(x) substr(dna, x, x+3-1))
# [1] "TAC" "ACG" "ATG" "ACA" "GTC" "TTG" "ACG" "GGT" "TCT" "CCT" "ACT"
substring(dna, seq(1, nchar(dna), by = 3), seq(3, nchar(dna), by = 3))
# [1] "TAC" "ACG" "ATG" "ACA" "GTC" "TTG" "ACG" "GGT" "TCT" "CCT" "ACT"
unlist(strsplit(gsub("(.{3})", "\\1 ", dna), split = " "))
# [1] "TAC" "ACG" "ATG" "ACA" "GTC" "TTG" "ACG" "GGT" "TCT" "CCT" "ACT"
Felt like doing a benchmarking with all those solutions:
dna <- c("TACACGATGACAGTCTTGACGGGTTCTCCTACT")
library(microbenchmark)
bm <- microbenchmark(
reg = regmatches(dna, gregexpr(".{3}", dna))[[1]],
substr = sapply(seq(1, nchar(dna), 3), \(x) substr(dna, x, x+3-1)),
substring = substring(dna, seq(1, nchar(dna), by = 3), seq(3, nchar(dna), by = 3)),
gsub = unlist(strsplit(gsub("(.{3})", "\\1 ", dna), split = " ")),
strsplit = unlist(strsplit(dna, "(?<=.{3})", perl = TRUE)),
times = 10L,
setup = gc(FALSE)
)
autoplot(bm)
gsub seems to be a clear winner!
Related
I guess this is better understood with an example, I feel this is really easy but I cannot get around it...
I have a list that looks like this:
[[1]] [1] "A" "B" "C" "D" "E" "F"
[[2]] [1] "A" "B" "C"
[[3]] [1] "A" "B" "C" "D"
[[4]] [1] "A" "B" "C" "D"
[[5]] [1] "A" "B" "C" "D" "E"
And I want to obtain this:
[[1]] [1] "A" "A;B" "A;B;C" "A;B;C;D" "A;B;C;D;E" "A;B;C;D;E;F"
[[2]] [1] "A" "A;B" "A;B;C"
[[3]] [1] "A" "A;B" "A;B;C" "A;B;C;D"
[[4]] [1] "A" "A;B" "A;B;C" "A;B;C;D"
[[5]] [1] "A" "A;B" "A;B;C" "A;B;C;D" "A;B;C;D;E"
So I need a function to apply in this way:
list2 <- lapply(list1,
function(x) {
#something here
})
We can loop through the list, get the sequence of the length of elements, loop through it with sapply, extract the list elements based on the index and paste
lapply(list1, function(x) sapply(seq(length(x)),
function(i) paste(x[seq_len(i)], collapse=",")))
#[[1]]
#[1] "A" "A,B" "A,B,C" "A,B,C,D" "A,B,C,D,E" "A,B,C,D,E,F"
#[[2]]
#[1] "A" "A,B" "A,B,C"
#[[3]]
#[1] "A" "A,B" "A,B,C" "A,B,C,D"
#[[4]]
#[1] "A" "A,B" "A,B,C" "A,B,C,D"
#[[5]]
#[1] "A" "A,B" "A,B,C" "A,B,C,D" "A,B,C,D,E"
Or another option is Reduce with accumulate = TRUE
lapply(list1, function(x) Reduce(function(...) paste(..., sep=","), x, accumulate = TRUE))
This can be written without an anonymous function call if the sep is not important
lapply(list1, Reduce, f = paste, accumulate = TRUE)
data
list1 <- lapply(c(6, 3, 4, 4, 5), function(i) LETTERS[1:i])
I have a vector like this:
test <- c("a","b","c","d")
test
[1] "a" "b" "c" "d"
And I would like to paste a string, e.g. "_2", onto every other element of the vector, to get this:
"a" "b_2" "c" "d_2"
I tried this command:
ifelse(test %in% seq(1, length(test), 2), test, paste(test, "_2", sep=""))
but this just gives me:
"a_2" "b_2" "c_2" "d_2"
which is wrong. Any suggestions on how to properly do this? Thank you!
How about
paste0(c("a","b","c","d"), c("", "_2"))
[1] "a" "b_2" "c" "d_2"
Another option would be,
test[c(FALSE, TRUE)] <- paste0(test[c(FALSE, TRUE)], '_2')
test
#[1] "a" "b_2" "c" "d_2"
x <- c("a","b","c","d")
x[seq(2, length(x), by=2)] <- paste0(x[seq(2, length(x), by=2)], "_2")
x
this gives:
"a" "b_2" "c" "d_2"
Is there a function to create a repeating list of letters in R?
something like
letters[1:30]
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z" NA NA NA NA
but instead of NA, I would like the output to continue aa, bb, cc, dd ...
It's not too difficult to piece together a quick function to do something like this:
myLetters <- function(length.out) {
a <- rep(letters, length.out = length.out)
grp <- cumsum(a == "a")
vapply(seq_along(a),
function(x) paste(rep(a[x], grp[x]), collapse = ""),
character(1L))
}
myLetters(60)
# [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l"
# [13] "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x"
# [25] "y" "z" "aa" "bb" "cc" "dd" "ee" "ff" "gg" "hh" "ii" "jj"
# [37] "kk" "ll" "mm" "nn" "oo" "pp" "qq" "rr" "ss" "tt" "uu" "vv"
# [49] "ww" "xx" "yy" "zz" "aaa" "bbb" "ccc" "ddd" "eee" "fff" "ggg" "hhh"
If you just want unique names, you could use
make.unique(rep(letters, length.out = 30), sep='')
Edit:
Here's another way to get repeating letters using Reduce.
myletters <- function(n)
unlist(Reduce(paste0,
replicate(n %/% length(letters), letters, simplify=FALSE),
init=letters,
accumulate=TRUE))[1:n]
myletters(60)
# [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l"
# [13] "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x"
# [25] "y" "z" "aa" "bb" "cc" "dd" "ee" "ff" "gg" "hh" "ii" "jj"
# [37] "kk" "ll" "mm" "nn" "oo" "pp" "qq" "rr" "ss" "tt" "uu" "vv"
# [49] "ww" "xx" "yy" "zz" "aaa" "bbb" "ccc" "ddd" "eee" "fff" "ggg" "hhh"
Working solution
A function to produce Excel-style column names, i.e.
# A, B, ..., Z, AA, AB, ..., AZ, BA, BB, ..., ..., ZZ, AAA, ...
letterwrap <- function(n, depth = 1) {
args <- lapply(1:depth, FUN = function(x) return(LETTERS))
x <- do.call(expand.grid, args = list(args, stringsAsFactors = F))
x <- x[, rev(names(x)), drop = F]
x <- do.call(paste0, x)
if (n <= length(x)) return(x[1:n])
return(c(x, letterwrap(n - length(x), depth = depth + 1)))
}
letterwrap(26^2 + 52) # through AAZ
Botched attempt
Initially I thought this would best be done cleverly by converting to base 26, but that doesn't work. The issue is that Excel column names aren't base 26, which took me a long time to realize. The catch is 0: if you try to map a letter (like A) to 0, you've got a problem when you want to distinguish between A and AA and AAA...
Another way to illustrate the problem is in "digits". In base 10, there are 10 single-digit numbers (0-9), then 90 double-digit numbers (10:99), 900 three-digit numbers... generalizing to 10^d - 10^(d - 1) numbers with d digits for d > 1. However, in Excel column names there are 26 single-letter names, 26^2 double-letter names, 26^3 triple-letter names, with no subtraction.
I'll leave this code as a warning to others:
## Converts a number to base 26, returns a vector for each "digit"
b26 <- function(n) {
stopifnot(n >= 0)
if (n <= 1) return(n)
n26 <- rep(NA, ceiling(log(n, base = 26)))
for (i in seq_along(n26)) {
n26[i] <- (n %% 26)
n <- n %/% 26
}
return(rev(n26))
}
## Returns the name of nth value in the sequence
## A, B, C, ..., Z, AA, AB, AC, ..., AZ, BA, ...
letterwrap1 <- function(n, lower = FALSE) {
let <- if (lower) letters else LETTERS
base26 <- b26(n)
base26[base26 == 0] <- 26
paste(let[base26], collapse = "")
}
## Vectorized version of letterwrap
letter_col_names <- Vectorize(letterwrap, vectorize.args="n")
> letter_col_names(1:4)
[1] "A" "B" "C" "D"
> letter_col_names(25:30)
[1] "Y" "Z" "AA" "AB" "AC" "AD"
# Looks pretty good
# Until we get here:
> letter_col_names(50:54)
[1] "AX" "AY" "BZ" "BA" "BB"
There is almost certainly a better way, but this is what I ended up with:
letter_wrap <- function(idx) {
vapply(
idx,
function(x)
paste0(
rep(
letters[replace(x %% 26, !x %% 26, 26)], 1 + (x - 1) %/% 26 ), collapse=""), "")
}
letter_wrap(1:60)
# [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n"
# [15] "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z" "aa" "bb"
# [29] "cc" "dd" "ee" "ff" "gg" "hh" "ii" "jj" "kk" "ll" "mm" "nn" "oo" "pp"
# [43] "qq" "rr" "ss" "tt" "uu" "vv" "ww" "xx" "yy" "zz" "aaa" "bbb" "ccc" "ddd"
# [57] "eee" "fff" "ggg" "hhh"
EDIT: failed to notice Ananda's answer before I posted this one. This one is different enough that I'm leaving it. Note it takes the index vector as an input, as opposed to the number of items.
Probably not the cleanest, but easy to see what's happening:
foo<-letters[1:26]
outlen <- 73 # or whatever length you want
oof <- vector(len=26)
for ( j in 2:(outlen%/%26)) {
for (k in 1:26) oof[k] <- paste(rep(letters[k],j),sep='',collapse='')
foo<-c(foo,oof)
}
for (jj in 1:(outlen%%26) ) foo[(26*j)+jj]<-paste(rep(letters[jj],(j+1)),sep='',collapse='')
foo
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n"
[15] "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z" "aa" "bb"
[29] "cc" "dd" "ee" "ff" "gg" "hh" "ii" "jj" "kk" "ll" "mm" "nn" "oo" "pp"
[43] "qq" "rr" "ss" "tt" "uu" "vv" "ww" "xx" "yy" "zz" "aaa" "bbb" "ccc" "ddd"
[57] "eee" "fff" "ggg" "hhh" "iii" "jjj" "kkk" "lll" "mmm" "nnn" "ooo" "ppp" "qqq" "rrr"
[71] "sss" "ttt" "uuu"
EDIT: Matthew wins, hands-down:
microbenchmark(anandaLetters(5000),matthewletters(5000),carlletters(5000),times=10)
Unit: milliseconds
expr min lq median uq max neval
anandaLetters(5000) 85.339200 85.567978 85.9827715 86.260298 86.612231 10
matthewletters(5000) 3.413706 3.503506 3.9067535 3.946950 4.106453 10
carlletters(5000) 94.893983 95.405418 96.4492430 97.234784 110.681780 10
Let me do a little correction on seq "AY" "BZ". You have to rest out one letter to the previous digiletter.
colExcel2num <- function(x) {
p <- seq(from = nchar(x) - 1, to = 0)
y <- utf8ToInt(x) - utf8ToInt("A") + 1L
S <- sum(y * 26^p)
return(S)
}
## Converts a number to base 26, returns a vector for each "digit"
b26 <- function(n) {
stopifnot(n >= 0)
if (n <= 1) return(n)
n26 <- rep(NA, ceiling(log(n, base = 26)))
for (i in seq_along(n26)) {
n26[i] <- (n %% 26)
n <- n %/% 26
}
return(rev(n26))
}
## Retorna el nombre de columna Excel según la posición de columna
## A, B, C, ..., Z, AA, AB, AC, ..., AZ, BA, ...
colnum2Excel <- function(n, lower = FALSE) {
let <- if (lower) letters else LETTERS
base26 <- b26(n)
i <- base26 == 0
base26[i] <- 26
base26[lead(i, default = FALSE)] <- base26[lead(i, default = FALSE)] - 1
paste(let[base26], collapse = "")
}
## Return df's column index based on column name
## A, B, C, ..., Z, AA, AB, AC, ..., AZ, BA, ...
## buscando el número de columna en el df
varnum2Excel <- function(df, colname, lower = FALSE) {
index <- match(colname, names(df))
stopifnot(index > 0)
return(colnum2Excel(index))
}
Here some example:
require(openxlsx)
table <- data.frame(milk = c(1,2,3), oranges = c(2,4,6))
table <- table %>%
mutate(
ajjhh = sprintf(paste0(
varnum2Excel(.,"milk"), "%1$s", " + ",
varnum2Excel(.,"oranges"),"%1$s"),
2:(n()+1)
)
)
class(table$ajjhh) <- c(class(table$ajjhh), "formula")
wb <- createWorkbook()
addWorksheet(wb = wb, sheetName = "Sheet1", tabColour = "chocolate4")
writeData (wb, "Sheet1", x = table)
saveWorkbook(wb, "formulashasnotgone.xlsx", overwrite = TRUE)
Given the following data (myinput.txt):
A q,y,h
B y,f,g
C n,r,q
### more rows
How can I convert it into such data structure in R?
$A
[1] "q" "y" "h"
$B
[1] "y" "f" "g"
$C
[1] "n" "r" "q"
I've assumed this as your data:
dat <- read.table(text="q,y,h
y,f,g
n,r,q", header=FALSE, sep=",", row.names=c("A", "B", "C"))
If you want an automatic method:
as.list(as.data.frame((t(dat)), stringsAsFactors=FALSE))
## $A
## [1] "q" "y" "h"
##
## $B
## [1] "y" "f" "g"
##
## $C
## [1] "n" "r" "q"
Another couple of methods which work are:
lapply(apply(dat, 1, list), "[[", 1)
unlist(apply(dat, 1, list), recursive=FALSE)
Using a bit of readLines strsplit and regex to account for breaking the names off the start:
dat <- readLines(textConnection("A q,y,h
B y,f,g
C n,r,q"))
result <- lapply(strsplit(dat,"\\s{2}|,"),function(x) x[2:length(x)])
names(result) <- gsub("^(.+)\\s{2}.+$","\\1",dat)
> result
$A
[1] "q" "y" "h"
$B
[1] "y" "f" "g"
$C
[1] "n" "r" "q"
or with less regex and more steps:
result <- strsplit(dat,"\\s{2}|,")
names(result) <- lapply(result,"[",1)
result <- lapply(result,function(x) x[2:length(x)])
> result
$A
[1] "q" "y" "h"
$B
[1] "y" "f" "g"
$C
[1] "n" "r" "q"
I have very simple question: How can I divide the following text into 3 in a single code
mycodes <- c("ATTTGGGCTAATTTTGTTTCTTTCTGGGTCTCTC")
strsplit(mycodes, split = character(3), fixed = T, perl = FALSE, useBytes = FALSE)
[[1]]
[1] "A" "T" "T" "T" "G" "G" "G" "C" "T" "A" "A" "T" "T" "T" "T" "G" "T" "T" "T" "C"
[21] "T" "T" "T" "C" "T" "G" "G" "G" "T" "C" "T" "C" "T" "C"
This is not what I want; I want three letters at a time:
[1] "ATT" "TGG", "GCT"...............and so on the final may be of one, two or three letters depending upon the letter availability.
Thanks;
I assume you want to work with codons. If that's the case, you might want to look at the Biostrings package from Bioconductor. It provides a variety of tools for working with biological sequence data.
library(Biostrings)
?codons
You can achieve what you want, with a little bit of clumsy coercion:
as.character(codons(DNAString(mycodes)))
Here is one approach using stringr package
require(stringr)
start = seq(1, nchar(mycodes), 3)
stop = pmin(start + 2, nchar(mycodes))
str_sub(mycodes, start, stop)
Output is
[1] "ATT" "TGG" "GCT" "AAT" "TTT" "GTT" "TCT" "TTC" "TGG"
[10] "GTC" "TCT" "C"
You can also use:
strsplit(data, '(?<=.{3})', perl=TRUE)
[[1]]
[1] "ATT" "TGG" "GCT" "AAT" "TTT" "GTT" "TCT" "TTC" "TGG" "GTC" "TCT" "C"
or
library(stringi)
stri_extract_all_regex(data, '.{1,3}')