Shuffling string (non-randomly) for maximal difference - r

After trying for an embarrassingly long time and extensive searches online, I come to you with a problem.
I am looking for a method to (non-randomly) shuffle a string to get a string which has the maximal ‘distance’ from the original one, while still containing the same set of characters.
My particular case is for short nucleotide sequences (4-8 nt long), as represented by these example sequences:
seq_1<-"ACTG"
seq_2<-"ATGTT"
seq_3<-"ACGTGCT"
For each sequence, I would like to get a scramble sequence which contains the same nucleobase count, but in a different order.
A favourable scramble sequence for seq_3 could be something like;
seq_3.scramble<-"CATGTGC"
,where none of the sequence positions 1-7 has the same nucleobase, but the overall nucleobase count is the same (A =1, C = 2, G= 2, T=2). Naturally it would not always be possible to get a completely different string, but these I would just flag in the output.
I am not particularly interested in randomising the sequence and would prefer a method which makes these scramble sequences in a consistent manner.
Do you have any ideas?

python, since I don't know r, but the basic solution is as follows
def calcDistance(originalString,newString):
d = 0
i=0
while i < len(originalString):
if originalString[i] != newString[i]: d=d+1
i=i+1
s = "ACTG"
d_max = 0
s_final = ""
for combo in itertools.permutations(s):
if calcDistance(s,combo) > d_max:
d_max = calcDistance(s,combo)
s_final = combo

Give this a try. Rather than return a single string that fits your criteria, I return a data frame of all strings sorted by their string-distance score. String-distance score is calculated using stringdist(..., ..., method=hamming), which determines number of substitutions required to convert string A to B.
seq_3<-"ACGTGCT"
myfun <- function(S) {
require(combinat)
require(dplyr)
require(stringdist)
vec <- unlist(strsplit(S, ""))
P <- sapply(permn(vec), function(i) paste(i, collapse=""))
Dist <- c(stringdist(S, P, method="hamming"))
df <- data.frame(seq = P, HD = Dist, fixed=TRUE) %>%
distinct(seq, HD) %>%
arrange(desc(HD))
return(df)
}
library(combinat)
library(dplyr)
library(stringdist)
head(myfun(seq_3), 10)
# seq HD
# 1 TACGTGC 7
# 2 TACGCTG 7
# 3 CACGTTG 7
# 4 GACGTTC 7
# 5 CGACTTG 7
# 6 CGTACTG 7
# 7 TGCACTG 7
# 8 GTCACTG 7
# 9 GACCTTG 7
# 10 GATCCTG 7

Related

R - joining more than 2^31 rows with data.table

I have an igraph network graph with 103,887 nodes and 4,795,466 ties.
This can be structured as an edgelist in a data.table with almost 9 million rows.
I can find the common neighbors in this network, following #chinsoon12's answer here. See the example below.
This works beautifully for smaller networks, but runs into problems in my use-case because the merge results in more than 2^31 rows.
Questions:
Are there efficient alternatives on how to deal with this?
Can I split the data and do the computation in steps? The results will be used to query about common neighbors.
Example - modified from #chinsoon12's answer:
library(data.table)
library(igraph)
set.seed(1234)
g <- random.graph.game(10, p=0.10)
adjSM <- as(get.adjacency(g), "dgTMatrix")
adjDT <- data.table(V1=adjSM#i+1, V2=adjSM#j+1)
res <- adjDT[adjDT, nomatch=0, on="V2", allow.cartesian=TRUE
][V1 < i.V1, .(Neighbours=paste(V2, collapse=",")),
by=c("V1","i.V1")][order(V1)]
res
V1 i.V1 Neighbours
1: 4 5 8
2: 4 10 8
3: 5 10 8
Update
If you just want to query the common neighbors, I don't suggest you build up a huge look-up table. Instead, you can use the following code to get the result for your query:
find_common_neighbors <- function(g, Vs) {
which(colSums(distances(g, Vs) == 1) == length(Vs))
}
such that
> find_common_neighbors(g, c(4, 8))
integer(0)
> find_common_neighbors(g, c(4, 5))
[1] 8
If you need a look-up table, an alternative is to use Neighbours as the key to search its associated node, e.g.,
res <- transform(
data.frame(Neighbours = which(degree(g) >= 2)),
Nodes = sapply(
Neighbours,
function(x) toString(neighbors(g, x))
)
)
Previous Answer
I think you can use ego over g directly to generate res, e.g.,
setNames(
data.frame(
t(do.call(
cbind,
lapply(
Filter(function(x) length(x) > 2, ego(g, 1)),
function(x) {
rbind(combn(x[-1], 2), x[1])
}
)
))
),
c("V1", "V2", "Neighbours")
)
which gives
V1 V2 Neighbours
1 4 5 8
2 4 10 8
3 5 10 8
common neighbors
Can I split the data and do the computation in steps?
You can split by V1 to avoid running into the big-merge issue:
neighDT = adjDT[, if (.N > 1) {
cb = combn(V2, 2)
.(a = cb[1, ], b = cb[2, ])
}, by=.(neighbor = V1)]
which gives
neighbor a b
1: 8 4 5
2: 8 4 10
3: 8 5 10
(The OP found gRbase::combnPrim to be faster than combn here.)
How can we collapse all the common neighbors (separated with a comma) for the same combination into one observation?
neighDT_agg = neighDT[order(neighbor),
.(neighbors = toString(neighbor))
, keyby=.(a,b)]
The order ensures that the string is sorted alphabetically. The keyby ensures that the table is sorted by pairs {a,b} and facilitates a simple fast lookup for multiple pairs at once:
# single query
neighDT_agg[.(4,10), neighbors]
# [1] "8"
# multi query
pairs_queryDT = data.table(a = c(4,5,8), b = c(5,10,10))
neighDT_agg[pairs_queryDT, neighbors]
[1] "8" "8" NA
I have an igraph network graph with 103,887 nodes and 4,795,466 ties.
Each call to combn will be making a 2-by-choose(.N, 2) matrix. If a node is connected to all other nodes, then it is a common neighbor to all pairs of other nodes and you'll be facing choose(103887-1, 2) of these pairs. I guess this is more an issue with the way the problem is defined than with the approach to solving it.
The results will be used to query about common neighbors.
For the approach above, you'll need to compute the full neighbors table first.
If you just have a few ad hoc queries about intersecting neighbors:
find_neighbors <- function(a, b){
adjDT[.(c(a, b)), on=.(V1), V2[duplicated(V2)]]
}
find_neighbors(4, 10)
# [1] 8
This can similarly be wrapped in toString to collapse the values.

Generation of Unique ID

Can some help with how to generate a unique 6 digit URN in R,as I don't know how to do this please.
Below are the rule for the URN
It needs to be alphanumeric,start with letter and maybe end with letter (e.g AA34YB)
Use only upper case alphabets
Do not use the alphabets O or I (this is the alphabet after H and before J)
Use only digits from 1- 9. Exclude 0
First two digit should be letter,then followed by 2 digit number and end with 2 digit letter,e.g "AA22DD","EE34TY","ER67YU"
All records must contain number as shown in rule 5
IT MUST BE 6 DIGIT PLEASE
I would love to generate upto 4 million unique records please.Any R code suggestion is highly welcome.I am not an expert in R,actually new to R
Thanks very much
here is a function that will generate ordered unique IDs:
generateIDs <- function(n, existing=NULL){
# Initialise a counter to produce IDs
counter <- 0
# Create a arrays of letters and digits
letters <- LETTERS[LETTERS %in% c("O", "I") == FALSE]
digits <- 1:9
# Initialise an array to store the IDs created
ids <- c()
# iterate through the letters
for(first in letters){
# iterate through the letters
for(second in letters){
# iterate through the digits
for(third in digits){
# iterate through the digits
for(fourth in digits){
# iterate through the letters
for(fifth in letters){
# iterate through the letters
for(sixth in letters){
# Create the unique code
code <- paste0(first, second, third, fourth, fifth, sixth)
# Check if already exists
if(code %in% existing == FALSE){
# Iterate the counter
counter <- counter + 1
# Store the ID
ids[counter] <- code
existing[length(existing) + 1] <- code
# Check if created enough IDs
if(counter == n){
return(ids)
}
# Note progress
if(counter %% 10000 == 0){
cat("\rCreated", counter, "ids!")
}
}
}
}
}
}
}
}
}
That is a horrific number of nested for loops but it avoids the inefficient random generation of IDs. You can test it using the following code:
generateIDs(10)
"AA11AA" "AA11AB" "AA11AC" "AA11AD" "AA11AE" "AA11AF" "AA11AG" "AA11AH" "AA11AJ" "AA11AK"
Note that ideally you should run this function once. Theoretically, this function could create up to 26873856 unique IDs but it doesn't scale well!
See #GKi's answer for a much better solution! :-)
You can use expand.grid to generate Unique ID's.
n <- 10
t1 <- LETTERS[!LETTERS %in% c("O", "I")]
t2 <- 1:9
#t1 <- rawToChar(as.raw(c(65:72,74:78,80:90)), multiple = TRUE) #Alternativ
#t2 <- rawToChar(as.raw(49:57), multiple = TRUE)
apply(expand.grid(t1, t1, t2, t2, t1, t1)[seq(n),], 1, paste, collapse = "")
# 1 2 3 4 5 6 7 8
#"AA11AA" "BA11AA" "CA11AA" "DA11AA" "EA11AA" "FA11AA" "GA11AA" "HA11AA"
# 9 10
#"JA11AA" "KA11AA"
set.seed(1) #Sample randomly
apply(expand.grid(t1, t1, t2, t2, t1, t1)[sample(length(t1)^4*length(t2)^2, n),]
, 1, paste, collapse = "")
#10938497 17633234 12201267 18120554 21612295 21509711 13901861 6841049
#"SL15UK" "BG59TR" "CU65XL" "BH54ES" "GJ13HV" "YF31FV" "EE79KN" "SV66CG"
#23945701 10770210
#"NK23KX" "TG68QK"
In case it needs to much memory look #Joseph-Crispell's answer.

Find similar strings and reconcile them within one dataframe

Another question for me as a beginner. Consider this example here:
n = c(2, 3, 5)
s = c("ABBA", "ABA", "STING")
b = c(TRUE, "STING", "STRING")
df = data.frame(n,s,b)
n s b
1 2 ABBA TRUE
2 3 ABA STING
3 5 STING STRING
How can I search within this dataframe for similar strings, i.e. ABBA and ABA as well as STING and STRING and make them the same (doesn't matter whether ABBA or ABA, either fine) that would not require me knowing any variations? My actual data.frame is very big so that it would not be possible to know all the different variations.
I would want something like this returned:
> n = c(2, 3, 5)
> s = c("ABBA", "ABBA", "STING")
> b = c(TRUE, "STING", "STING")
> df = data.frame(n,s,b)
> print(df)
n s b
1 2 ABBA TRUE
2 3 ABBA STING
3 5 STING STING
I have looked around for agrep, or stringdist, but those refer to two data.frames or are able to name the column which I can't since I have many of those.
Anyone an idea? Many thanks!
Best regards,
Steffi
This worked for me but there might be a better solution
The idea is to use a recursive function, special, that uses agrepl, which is the logical version of approximate grep, https://www.rdocumentation.org/packages/base/versions/3.4.1/topics/agrep. Note that you can specify the 'error tolerance' to group similar strings with agrep. Using agrepl, I split off rows with similar strings into x, mutate the s column to the first-occurring string, and then add a grouping variable grp. The remaining rows that were not included in the ith group are stored in y and recursively passed through the function until y is empty.
You need the dplyr package, install.packages("dplyr")
library(dplyr)
desired <- NULL
grp <- 1
special <- function(x, y, grp) {
if (nrow(y) < 1) { # if y is empty return data
return(x)
} else {
similar <- agrepl(y$s[1], y$s) # find similar occurring strings
x <- rbind(x, y[similar,] %>% mutate(s=head(s,1)) %>% mutate(grp=grp))
y <- setdiff(y, y[similar,])
special(x, y, grp+1)
}
}
desired <- special(desired,df,grp)
To change the stringency of string similarity, change max.distance like agrepl(x,y,max.distance=0.5)
Output
n s b grp
1 2 ABBA TRUE 1
2 3 ABBA STING 1
3 5 STING STRING 2
To remove the grouping variable
withoutgrp <- desired %>% select(-grp)

Reverse only alphabetical patterns in a string in R

I'm trying to learn R and a sample problem is asking to only reverse part of a string that is in alphabetical order:
String: "abctextdefgtext"
StringNew: "cbatextgfedtext"
Is there a way to identify alphabetical patterns to do this?
Here is one approach with base R based on the patterns showed in the example. We split the string to individual characters ('v1'), use match to find the position of characters with that of alphabet position (letters), get the difference of the index and check if it is equal to 1 ('i1'). Using the logical vector, we subset the vector ('v1'), create a grouping variable and reverse (rev) the vector based on grouping variable. Finally, paste the characters together to get the expected output
v1 <- strsplit(str1, "")[[1]]
i1 <- cumsum(c(TRUE, diff(match(v1, letters)) != 1L))
paste(ave(v1, i1, FUN = rev), collapse="")
#[1] "cbatextgfedtext"
Or as #alexislaz mentioned in the comments
v1 = as.integer(charToRaw(str1))
rawToChar(as.raw(ave(v1, cumsum(c(TRUE, diff(v1) != 1L)), FUN = rev)))
#[1] "cbatextgfedtext"
EDIT:
1) A mistake was corrected based on #alexislaz's comments
2) Updated with another method suggested by #alexislaz in the comments
data
str1 <- "abctextdefgtext"
You could do this in base R
vec <- match(unlist(strsplit(s, "")), letters)
x <- c(0, which(diff(vec) != 1), length(vec))
newvec <- unlist(sapply(seq(length(x) - 1), function(i) rev(vec[(x[i]+1):x[i+1]])))
paste0(letters[newvec], collapse = "")
#[1] "cbatextgfedtext"
Where s <- "abctextdefgtext"
First you find the positions of each letter in the sequence of letters ([1] 1 2 3 20 5 24 20 4 5 6 7 20 5 24 20)
Having the positions in hand, you look for consecutive numbers and, when found, reverse that sequence. ([1] 3 2 1 20 5 24 20 7 6 5 4 20 5 24 20)
Finally, you get the letters back in the last line.

Select all binary neighbors of decimal number

Let's say I have a number in decimal format: 5
its binary version is: 00101
I would like to write a function that takes the decimal number x
and returns all other decimal numbers that have a single digit difference (in their binary forms) from the original one:
so for the example above the neighbors are:
10101 01101 00111 00001 00100
and the corresponding decimals are:
21 13 7 1 4
I would like a solution that is computationally efficient and doesn't take a long time even if I have say a million digits.
Is this possible to do?
I've no idea how trial and error got me here, but it looks valid unless I've messed up binaries and decimals:
bin_neighs = function(x, n) bitwXor(x, (2 ^ (0:(n - 1))))
bin_neighs(5, 5)
#[1] 4 7 1 13 21
I think you're asking how to take as input a number 5 and to return all neighboring binary values. To do this, you need to convert the number to a useful binary format (just the bits you want to flip), flip each bit, and return the result:
library(R.utils)
bin.neighbors <- function(x, num.neighbors=NA) {
# Get the bits with the appropriate amount of padding
bits <- as.numeric(unlist(strsplit(intToBin(x), "")))
if (!is.na(num.neighbors) & num.neighbors > length(bits)) {
bits <- c(rep(0, num.neighbors-length(bits)), bits)
}
# Build a matrix where each column is a bit vector of a neighbor
mat <- matrix(bits, length(bits), length(bits))
diag(mat) <- 1-diag(mat)
# Return the decimal values of the neighbors using strtoi
apply(mat, 2, function(x) strtoi(paste0(x, collapse=""), 2))
}
bin.neighbors(5, 5)
# [1] 21 13 1 7 4
Because each number has a number of binary representations with different numbers of leading 0s (e.g. 5 can be represented as 101, 0101, 00101, 000101, 0000101, etc.), I added an argument num.neighbors to specify the length of the output vector from the function. You can pass NA to obtain an output vector equal to the number of bits in the binary representation of the input with no leading zeros.
Here's another way using magrittr's pipe:
binNeighbours <- function(a, numNeighbours = ceiling(log2(a))) {
rep(a, numNeighbours) %>%
outer(., seq(.) - 1, function(x, y) x %/% (2 ^ y) %% 2) %>%
`diag<-`(., 1 - diag(.)) %>%
`%*%`(2 ^(0:(nrow(.) - 1))) %>%
`[`(, 1)
}

Resources