I want to find how many combinations of genome are found in a sequence. I mean for binary combinations: AA,AT,AG,AC,... 16 combinations like that;or for 3-elemented combinations ATG,ACG,... 64 combinations like that. I know how to do that with a package and I will write down it here. I want to create my own code to perform this
seqinr package is perfect on its job. That is the code that i used for;
install.packages('seqinr')
library(seqinr)
m = read.fasta(file='sequence.fasta')
mseq = m[[1]]
count(mseq,2) # gives how many binary combinations are found in the seq
count(mseq,3) # gives how many 3-elemented combinations are found in the seq
This is a slow way to do it. I am certain it is faster in the bioconductor package.
# some practice data
mseq = paste(sample(c("A", "C", "G", "T"), 1000, rep=T), collapse="")
# define a function called count
count = function(mseq, n){
# split the sequence into every possible sub sequence of length n
x = sapply(1:(nchar(mseq) - n + 1), function(i) substr(mseq, i, i+n-1))
# how many unique sub sequences of length R are there?
length(table(x))
}
Actually just checked and this is pretty much how they did it:
function (seq, wordsize, start = 0, by = 1, freq = FALSE, alphabet = s2c("acgt"),
frame = start)
{
if (!missing(frame))
start = frame
istarts <- seq(from = 1 + start, to = length(seq), by = by)
oligos <- seq[istarts]
oligos.levels <- levels(as.factor(words(wordsize, alphabet = alphabet)))
if (wordsize >= 2) {
for (i in 2:wordsize) {
oligos <- paste(oligos, seq[istarts + i - 1], sep = "")
}
}
counts <- table(factor(oligos, levels = oligos.levels))
if (freq == TRUE)
counts <- counts/sum(counts)
return(counts)
}
If you want to find the code for a function use getAnywhere()
getAnywhere(count)
The simple thing to do is just something like this:
# Generate a test sequence
set.seed(1234)
testSeq <- paste(sample(LETTERS[1:3], 100, replace = T), collapse = "")
# Split string into chunks of size 2 and then count occurrences
testBigram <- substring(testSeq, seq(1, nchar(testSeq), 2), seq(2, nchar(testSeq), 2))
table(testBigram)
AA AB AC BA BB BC CA CB CC
10 10 14 3 3 2 2 5 1
Here is a way using a "function factory" (https://adv-r.hadley.nz/function-factories.html).
The 2-element and 3-element combinations are n-grams of size 2 and 3. So we make this n-gram function factory.
# Generate a function to create a function
ngram <- function(size) {
function(myvector) {
substring(myvector, seq(1, nchar(myvector), size), seq(size, nchar(myvector), size))
}
}
# Assign the functions names (optional)
bigram <- ngram(2)
trigram <- ngram(3)
# 2 element combinations
table(bigram(testSeq))
AA AB AC BA BB BC CA CB CC
10 10 14 3 3 2 2 5 1
# count of 2 element combinations
length(unique(bigram(testSeq)))
[1] 9
# counting function
count <- function(mseq, n) length(unique(ngram(n)(mseq)))
count(testSeq, 2)
[1] 9
# and if we wanted to do with with 3 element combinations
table(trigram(testSeq))
Related
I have 100,000 individuals
Using a combination of upper case letters, lower case letters and numbers, I want to create
a five-character ID for each individual. I should not have any duplicates.
How can I do this? I have tried the code below but I have 4 duplicates.
What is the number of possible unique combinations to create a 5 character ID with "letters", "LETTERS" and "0:9"?
set.seed(0)
mydata<-data.frame(
ID=rep(NA,10^5),
Poids=rnorm(n=10^5,mean = 65,sd=5)
)
for (i in 1:nrow(mydata)){
mydata$ID[i]<-c(
paste(sample(c(0:9,LETTERS,letters),replace = F,size = 1),
sample(c(0:9,LETTERS,letters),replace = F,size = 1),
sample(c(0:9,LETTERS,letters),replace = F,size = 1),
sample(c(0:9,LETTERS,letters),replace = F,size = 1),
sample(c(0:9,LETTERS,letters),replace = F,size = 1),sep = "")
)
}
table(duplicated(mydata$ID))
FALSE TRUE
99996 4
(length(letters) + length(LETTERS) + length(0:9))^5 is 91,6132,832, so there is plenty of space to avoid clashes.
In fact, we can use this number to help generate our sample. We draw 100,000 integers out of 91,6132,832 without replacement and interpret each number as its unique string of characters using a bit of modular math and indexing. This can all be done in a single pass:
space <- c(LETTERS, letters, 0:9)
set.seed(0)
samps <- sample(length(space)^5, 10^5)
m <- matrix("", nrow = 10^5, ncol = 5)
for(i in seq(ncol(m))) {
m[,i] <- space[(samps %% length(space)) + 1]
samps <- samps %/% length(space)
}
ID <- apply(m, 1, paste, collapse = "")
We can see this fulfils our requirements:
head(ID)
#> [1] "vpdnq" "rK0ej" "ofE9t" "PqLIr" "6G6tu" "Vhc7R"
length(ID)
#> [1] 100000
length(unique(ID))
#> [1] 100000
The whole thing takes less than a second on my modest machine:
user system elapsed
0.72 0.00 0.74
Update
It occurs to me that it is possible to give 100,000 people a unique ID using only 16 characters, i.e. 0-9 and a-f, with code that is much quicker and simpler than above:
set.seed(0)
ID <- as.hexmode(sample(16^5, 10^5))
head(ID)
#> [1] "d43f9" "392a7" "033a2" "cf1d7" "aa10e" "134bb"
length(unique(ID))
#> [1] 100000
Which takes less than 10 milliseconds.
Created on 2022-05-15 by the reprex package (v2.0.1)
You can try the code below (given N <- 1e5 and k <- 5):
n <- ceiling(N^(1 / k))
S <- sample(c(LETTERS, letters, 0:9), n)
ID <- head(do.call(paste0, expand.grid(rep(list(S), k))),N)
where
n gives a subset of the whole space that supports all unique combinations up to given number N, e.g., N <- 100000
S denotes a sub-space from which we draw the alphabets or digits
expand.grid gives all combinations
If you don't need randomness, the highly performant arrangements package can help by iterating over the permutations in order, not generating any more than are needed:
library(arrangements)
x = c(letters, LETTERS, 0:9)
ix = ipermutations(x = x, k = 5)
ind = ix$getnext(d = nrow(mydata))
mydata$ID = apply(ind, MAR = 1, FUN = \(i) paste(x[i], collapse = ""))
rbind(head(mydata), tail(mydata))
# ID Poids
# 1 abcde 64.46278
# 2 abcdf 62.00053
# 3 abcdg 75.71787
# 4 abcdh 67.73765
# 5 abcdi 66.45402
# 6 abcdj 66.85561
# 99995 abFpe 56.20545
# 99996 abFpf 64.14443
# 99997 abFpg 70.70191
# 99998 abFph 66.83226
# 99999 abFpi 65.22835
# 100000 abFpj 56.28880
This is quite fast:
user system elapsed
0.194 0.001 0.203
Suppose I have the following matrix, for arbitrary J:
set.seed(1)
J=2
n = 100
BB = data.table(r=1:n)
BB[, (paste0("a",seq(J))) := rnorm(n,1,7) ]
So the output is...
> BB
r a1 a2
1: 1 -3.38517668 -3.38517668
2: 2 2.28550327 2.28550327
3: 3 -4.84940029 -4.84940029
...
How come the two columns are identical and now different rnorms?
You can use the super-fast for-set combination:
for(i in seq(J))
set(x = BB, j = paste0('a',i), value = rnorm(n, 1, 7))
I am simulating some random DNA strings (sequences) in R. These sequences are composed of the letters A, C, G and T.
In a dataset of such sequences, some may be identical (duplicates).
Suppose I have 100 such sequences, which can be grouped in 5 unique sets.
Working code that I now have (shown below) will output something like
Unique groups: 1 2 3 4 5
Number of Sequences: 96 1 1 1 1
However, I would like the distribution to be something like
Unique groups: 1 2 3 4 5
Number of Sequences: 56 24 10 5 5
This can be accomplished through randomly allocating some sequences from group 1 (96 sequences) to all other groups containing only a single sequence.
My sequences are stored in a variable called 'res'.
The below code doesn't quite do what I would like.
res <- res[sample(unique(nrow(res)), size = nrow(res), replace = TRUE), ]
In the above code, 'size' is the number of DNA sequences in the dataset.
How can the above code be altered to accomplish what I am needing?
Note: one does not know the number of unique sets that will be generated a priori, so using the 'probs' argument in sample will not suffice.
Reproducible example:
library(pegas)
num.seqs <- 100 # number of DNA sequences to generate
length.seqs <- 500 # length of DNA sequences
mu.rate <- 1e-4 # mutation rate
set.seed(1234) # for reproducibility
nucl <- as.DNAbin(c('a','c','g','t')) # DNA alphabet
res <- sample(nucl, size = length.seqs, replace = TRUE, prob = rep(0.25, 4)) # generate a random DNA sequence
mu.set <- list('a' = as.DNAbin('c'),
'a' = as.DNAbin('g'),
'a' = as.DNAbin('t'),
'c' = as.DNAbin('a'),
'c' = as.DNAbin('g'),
'c' = as.DNAbin('t'),
'g' = as.DNAbin('a'),
'g' = as.DNAbin('c'),
'g' = as.DNAbin('t'),
't' = as.DNAbin('a'),
't' = as.DNAbin('c'),
't' = as.DNAbin('g'))
muts <- function(res) {
unlist(mu.set[as.character(res)])
}
duplicate.seq <- function(res) { # duplicate sequrnce
num.muts <- rbinom(n = 1, size = length.seqs, prob = mu.rate) # add random mutations
if (num.muts > 0) {
idx <- sample(length.seqs, size = num.muts, replace = FALSE)
res[idx] <- muts(res[idx])
}
res
}
res <- matrix(replicate(num.seqs, duplicate.seq(res)), byrow = TRUE, nrow = num.seqs) # generate num.seqs sequences
class(res) <- "DNAbin"
h <- sort(haplotype(res), decreasing = TRUE, what = "frequencies") # distribution of unique groups
rownames(h) <- 1:nrow(h)
h # print distribution
A simpler example (sans R package)
num.seqs <- 10
length.seqs <- 10
lets <- letters[c(1, 3, 7, 20)] # DNA alphabet
lets
res <- t(replicate(num.seqs, sample(lets, length.seqs, replace = TRUE))) # generate sequences
res <- res[sample(nrow(res), size = nrow(res), replace = TRUE), ] # duplicate some sequences
I am trying to calculate the combinations of elements of a matrix but each element should appear only once.
The (real) matrix is symmetric, and can have more then 5 elements (up to ~2000):
o <- matrix(runif(25), ncol = 5, nrow = 5)
dimnames(o) <- list(LETTERS[1:5], LETTERS[1:5])
# A B C D E
# A 0.4400317 0.1715681 0.7319108946 0.3994685 0.4466997
# B 0.5190471 0.1666164 0.3430245044 0.3837903 0.9322599
# C 0.3249180 0.6122229 0.6312876740 0.8017402 0.0141673
# D 0.1641411 0.1581701 0.0001703419 0.7379847 0.8347536
# E 0.4853255 0.5865909 0.6096330935 0.8749807 0.7230507
I desire to calculate the product of all the combinations of pairs (If possible it should appear all elements:AB, CD, EF if the matrix is of 6 elements), where for each pair one letter is the column, the other one is the row. Here are some combinations:
AB, CD, E
AC, BD, E
AD, BC, E
AE, BC, D
AE, BD, C
Where the value of the single element is just 1.
Combinations not desired:
AB, BC: Element B appears twice
AB, AC: Element A appears twice
Things I tried:
I thought about removing the unwanted part of the matrix:
out <- which(upper.tri(o), arr.ind = TRUE)
out <- cbind.data.frame(out, value = o[upper.tri(o)])
out[, 1] <- colnames(o)[out[, 1]]
out[, 2] <- colnames(o)[out[, 2]]
# row col value
# 1 A B 0.1715681
# 2 A C 0.7319109
# 3 B C 0.3430245
# 4 A D 0.3994685
# 5 B D 0.3837903
# 6 C D 0.8017402
# 7 A E 0.4466997
# 8 B E 0.9322599
# 9 C E 0.0141673
# 10 D E 0.8347536
My attempt involves the following process:
Make a copy of the matrix (out)
Store first value of the first row.
Remove all the pairs that involve any of the pair.
Select the next pair of the resulting matrix
Repeat until all rows are removed of the matrix
Repeat 2:5 starting from a different row
However, this method has one big problem, it doesn't guarantee that all the combinations are stored, and it could store several times the same combination.
My expected output is a vector, where each element is the product of the values in the cell selected by the combination:
AB, CD: 0.137553
How can I extract all those combinations efficiently?
This might work. I tested this on N elements = 5 and 6.
Note that this is not optimised, and hopefully can provide a framework for you to work from. With a much larger array, I can see steps involving apply and combn being a bottleneck.
The idea here is to generate a collection of unique sets first before calculating the product of the sets from another data.frame that stores values of sets.
Unique sets are identified by counting the number of unique elements in all combination pairs. For example, if N elements = 6, we expect length(unlist(combination)) == 6. The same is true if N elements = 7 (there will only be 3 pairs plus a remainder element). In cases where N elements is odd, we can ignore the remaining, unpaired element since it is constrained by the other elements.
library(dplyr)
library(reshape2)
## some functions
unique_by_n <- function(inlist, N){
## select unique combinations by count
## if unique, expect n = 6 if n elements = 6)
if(N %% 2) N <- N - 1 ## for odd numbers
return(length(unique(unlist(inlist))) == N)
}
get_combs <- function(x,xall){
## format and catches remainder if matrix of odd elements
xu <- unlist(x)
remainder <- setdiff(xall,xu) ## catch remainder if any
xset <- unlist(lapply(x, paste0, collapse=''))
finalset <- c(xset, remainder)
return(finalset)
}
## make dataset
set.seed(0) ## set reproducible example
#o <- matrix(runif(25), ncol = 5, nrow = 5) ## uncomment to test 5
#dimnames(o) <- list(LETTERS[1:5], LETTERS[1:5])
o <- matrix(runif(36), ncol = 6, nrow = 6)
dimnames(o) <- list(LETTERS[1:6], LETTERS[1:6])
o[lower.tri(o)] <- t(o)[lower.tri(o)] ## make matrix symmetric
n_elements = nrow(o)
#### get matrix
dat <- melt(o, varnames = c('Rw', 'Cl'), as.is = TRUE)
dat$Set <- apply(dat, 1, function(x) paste0(sort(unique(x[1:2])), collapse = ''))
## get unique sets (since your matrix is symmetric)
dat <- subset(dat, !duplicated(Set))
#### get sets
elements <- rownames(o)
allpairs <- expand.grid(Rw = elements, Cl = elements) %>%
filter(Rw != Cl) ## get all pairs
uniqpairsgrid <- unique(t(apply(allpairs,1,sort)))
uniqpairs <- split(uniqpairsgrid, seq(nrow(uniqpairsgrid))) ## get unique pairs
allpaircombs <- combn(uniqpairs,floor(n_elements/2)) ## get combinations of pairs
uniqcombs <- allpaircombs[,apply(allpaircombs, 2, unique_by_n, N = n_elements)] ## remove pairs with repeats
finalcombs <- apply(uniqcombs, 2, get_combs, xall=elements)
#### calculate results
res <- apply(finalcombs, 2, function(x) prod(subset(dat, Set %in% x)$value)) ## calculate product
names(res) <- apply(finalcombs, 2, paste0, collapse=',') ## add names
resdf <- data.frame(Sets = names(res), Products = res, stringsAsFactors = FALSE, row.names = NULL)
print(resdf)
#> Sets Products
#> 1 AB,CD,EF 0.130063454
#> 2 AB,CE,DF 0.171200062
#> 3 AB,CF,DE 0.007212619
#> 4 AC,BD,EF 0.012494787
#> 5 AC,BE,DF 0.023285088
#> 6 AC,BF,DE 0.001139712
#> 7 AD,BC,EF 0.126900247
#> 8 AD,BE,CF 0.158919605
#> 9 AD,BF,CE 0.184631344
#> 10 AE,BC,DF 0.042572488
#> 11 AE,BD,CF 0.028608495
#> 12 AE,BF,CD 0.047056905
#> 13 AF,BC,DE 0.003131029
#> 14 AF,BD,CE 0.049941770
#> 15 AF,BE,CD 0.070707311
Created on 2018-07-23 by the [reprex package](http://reprex.tidyverse.org) (v0.2.0.9000).
Maybe the following does what you want.
Note that I was more interested in being right than in performance.
Also, I have set the RNG seed, to have reproducible results.
set.seed(9840) # Make reproducible results
o <- matrix(runif(25), ncol = 5, nrow = 5)
dimnames(o) <- list(LETTERS[1:5], LETTERS[1:5])
cmb <- combn(LETTERS[1:5], 2)
n <- ncol(cmb)
res <- NULL
nms <- NULL
for(i in seq_len(n)){
for(j in seq_len(n)[-seq_len(i)]){
x <- unique(c(cmb[, i], cmb[, j]))
if(length(x) == 4){
res <- c(res, o[cmb[1, i], cmb[2, i]] * o[cmb[1, j], cmb[2, j]])
nms <- c(nms, paste0(cmb[1, i], cmb[2, i], '*', cmb[1, j], cmb[2, j]))
}
}
}
names(res) <- nms
res
I am trying to figure out how to analyze multiple select/multiple responses (i.e., 'select all that apply') questions in a survey I recently conducted.
SPSS has nice capabilities for analyzing online survey data and these types of questions so I am guessing that R has that and more. Dealing with these survey answers is a bit tricky in Excel. For example, show me a histogram/distribution everyone who likes strawberry and chocolate ice cream by age.
How do I structure the data set and what would be the commands to perform some basic tabulations of frequency, pareto, and logical AND OR functions?
I've not found anything that is quite as convenient as the multiple response sets in SPSS. However, you can create groups relatively easily based on common column names, and then use any of the apply() function or friends to iterate through each group. Here's one approach using adply() from the plyr package:
library(plyr)
set.seed(1)
#Fake data with three "like" questions. 0 = non selected, 1 = selected
dat <- data.frame(resp = 1:10,
like1 = sample(0:1, 10, TRUE),
like2 = sample(0:1, 10, TRUE),
like3 = sample(0:1, 10, TRUE)
)
adply(dat[grepl("like", colnames(dat))], 2, function(x)
data.frame(Count = as.data.frame(table(x))[2,2],
Perc = as.data.frame(prop.table(table(x)))[2,2]))
#-----
X1 Count Perc
1 like1 6 0.6
2 like2 5 0.5
3 like3 3 0.3
I recently wrote a quick function to deal with these. You can easily modify it to add proportion of total responses too.
set.seed(1)
dat <- data.frame(resp = 1:10,
like1 = sample(0:1, 10, TRUE),
like2 = sample(0:1, 10, TRUE),
like3 = sample(0:1, 10, TRUE))
The function:
multi.freq.table = function(data, sep="", dropzero=FALSE, clean=TRUE) {
# Takes boolean multiple-response data and tabulates it according
# to the possible combinations of each variable.
#
# See: http://stackoverflow.com/q/11348391/1270695
counts = data.frame(table(data))
N = ncol(counts)
counts$Combn = apply(counts[-N] == 1, 1,
function(x) paste(names(counts[-N])[x],
collapse=sep))
if (isTRUE(dropzero)) {
counts = counts[counts$Freq != 0, ]
} else if (!isTRUE(dropzero)) {
counts = counts
}
if (isTRUE(clean)) {
counts = data.frame(Combn = counts$Combn, Freq = counts$Freq)
}
counts
}
Apply the function:
multi.freq.table(dat[-1], sep="-")
# Combn Freq
# 1 1
# 2 like1 2
# 3 like2 2
# 4 like1-like2 2
# 5 like3 1
# 6 like1-like3 1
# 7 like2-like3 0
# 8 like1-like2-like3 1
Hope this helps! Otherwise, show some examples of desired output or describe some features, and I'll see what can be added.
Update
After looking at the output of SPSS for this online, it seems like the following should do it for you. This is easy enough to wrap into a function if you need to use it a lot.
data.frame(Freq = colSums(dat[-1]),
Pct.of.Resp = (colSums(dat[-1])/sum(dat[-1]))*100,
Pct.of.Cases = (colSums(dat[-1])/nrow(dat[-1]))*100)
# Freq Pct.of.Resp Pct.of.Cases
# like1 6 42.85714 60
# like2 5 35.71429 50
# like3 3 21.42857 30
multfreqtable(data_set, "Banner")
multfreqtable = function(data, question.prefix) {
z = length(question.prefix)
temp = vector("list", z)
for (i in 1:z) {
a = grep(question.prefix[i], names(data))
b = sum(data[, a] != 0)
d = colSums(data[, a] != 0)
e = sum(rowSums(data[,a]) !=0)
f = as.numeric(c(d, b))
temp[[i]] = data.frame(question = c(sub(question.prefix[i],
"", names(d)), "Total"),
freq = f,
percent_response = (f/b)*100,
percent_cases = (f/e)*100 )
names(temp)[i] = question.prefix[i]
}
temp
}
does a very good job of giving you numbers, percentages at the number of cases level and percentage at the number of responses level. Perfect for analyzing Multi-Response Questions