Let's say I have a number in decimal format: 5
its binary version is: 00101
I would like to write a function that takes the decimal number x
and returns all other decimal numbers that have a single digit difference (in their binary forms) from the original one:
so for the example above the neighbors are:
10101 01101 00111 00001 00100
and the corresponding decimals are:
21 13 7 1 4
I would like a solution that is computationally efficient and doesn't take a long time even if I have say a million digits.
Is this possible to do?
I've no idea how trial and error got me here, but it looks valid unless I've messed up binaries and decimals:
bin_neighs = function(x, n) bitwXor(x, (2 ^ (0:(n - 1))))
bin_neighs(5, 5)
#[1] 4 7 1 13 21
I think you're asking how to take as input a number 5 and to return all neighboring binary values. To do this, you need to convert the number to a useful binary format (just the bits you want to flip), flip each bit, and return the result:
library(R.utils)
bin.neighbors <- function(x, num.neighbors=NA) {
# Get the bits with the appropriate amount of padding
bits <- as.numeric(unlist(strsplit(intToBin(x), "")))
if (!is.na(num.neighbors) & num.neighbors > length(bits)) {
bits <- c(rep(0, num.neighbors-length(bits)), bits)
}
# Build a matrix where each column is a bit vector of a neighbor
mat <- matrix(bits, length(bits), length(bits))
diag(mat) <- 1-diag(mat)
# Return the decimal values of the neighbors using strtoi
apply(mat, 2, function(x) strtoi(paste0(x, collapse=""), 2))
}
bin.neighbors(5, 5)
# [1] 21 13 1 7 4
Because each number has a number of binary representations with different numbers of leading 0s (e.g. 5 can be represented as 101, 0101, 00101, 000101, 0000101, etc.), I added an argument num.neighbors to specify the length of the output vector from the function. You can pass NA to obtain an output vector equal to the number of bits in the binary representation of the input with no leading zeros.
Here's another way using magrittr's pipe:
binNeighbours <- function(a, numNeighbours = ceiling(log2(a))) {
rep(a, numNeighbours) %>%
outer(., seq(.) - 1, function(x, y) x %/% (2 ^ y) %% 2) %>%
`diag<-`(., 1 - diag(.)) %>%
`%*%`(2 ^(0:(nrow(.) - 1))) %>%
`[`(, 1)
}
Related
Suppose I have a string
age<-c("7y2m4d","5m4d","7y5m6d")
I want to convert it to a numeric vector like
c(7.34, 0.43, 7.43)
How can I make the R code?
We can assume there is 365 days in a year and 365/12 days in a month.
lubridate::duration will convert your strings to (approximate) seconds.
library(lubridate)
library(magrittr)
age <- c("7y2m4d", "5m4d", "7y5m6d")
age_sec <- age %>%
duration() %>%
as.numeric()
age_sec
[1] 226508400 13494600 234570600
Then you can approximate years as 365 * 24 * 60 * 60 seconds:
age_sec / (365 * 24 * 60 * 60)
[1] 7.182534 0.427911 7.438185
Another solution with base R:
age<-c("7y2m4d","5m4d","7y5m6d")
age <- gsub('y', ' + ', age)
age <- gsub('m', ' / 12 + ', age)
age <- gsub('d', ' / 365', age)
sapply(age, function(x) eval(parse(text = x)))
#7 + 2 / 12 + 4 / 365 5 / 12 + 4 / 365 7 + 5 / 12 + 6 / 365
# 7.1776256 0.4276256 7.4331050
The idea is to create the formula and then evaluate it for each element of your vector.
These solutions:
handle missing y, m and/or d and
give the same answer as in the question (except for the first element of age for which the question appears to have computed the answer incorrectly)
avoid the use of eval
only use base (except for alternative 1a)
Comparing the solutions below on the basis of simplicity (1a) is the simplest and automatically handles all the edge cases without specific code for them suggesting that it is the most natural; however, it does make use of a package. (1) is only slightly more complex and uses no packages and (2) pretty short and also does not use any packages but it is not as simple as (1) or (1a).
1) Here getNum extracts and returns the number from x associated with the code (the code is "y", "m" or "d") or if the code is not present in x returns 0. We then add up the year, month/12 and day/365.
getNum <- function(code, x) {
pat <- sprintf(".*?(\\d+)%s.*", code)
as.numeric(ifelse(grepl(code, x), sub(pat, "\\1", x), 0))
}
getNum("y", age) + getNum("m", age) / 12 + getNum("d", age) / 365
## [1] 7.1776256 0.4276256 7.4331050
1a) This is similar to (1) except that we use strapply in gsubfn to simplify getNum. In fact getNum reduces to a single strapply call and the regular expression it uses is also simpler.
library(gsubfn)
getNum <- function(code, x) {
strapply(x, paste0("(\\d+)", code), as.numeric, empty = 0, simplify = TRUE)
}
getNum("y", age) + getNum("m", age) / 12 + getNum("d", age) / 365
## [1] 7.1776256 0.4276256 7.4331050
2) This alternative converts each string to dcf format and uses read.dcf to create a matrix of the y, m and d numbers.
In detail, the first line of code is to handle certain edge cases which are not actually present in the sample data in the question. We first append 0d to age (from the question) if d is missing so that we can handle the case where y, m and d are all missing. We also prepend a dummy entry to ensure that y, m and d are present in at least one entry. If we knew that y, m and d were present in at least one component and there was no component in which y, m and d were all simultaneously missing then this first line of code could be omitted.
The second line of code converts each input character string to dcf form and reads it into a matrix ensuring that the columns are in a known order and deleting the dummy entry added above.
Finally we replace NAs with 0 and and use matrix multiplication to add up the year, month/12 and day/365.
a0 <- c("0y0m0d", paste0(age, ifelse(grepl("d", age), "", "0d")))
m <- read.dcf(textConnection(gsub("(\\d+)(\\D)", "\\2: \\1\n", a0)))[-1, c("y", "m", "d")]
m[is.na(m)] <- 0
c(array(as.numeric(m), dim(m)) %*% c(1, 1/12, 1/365))
## [1] 7.1776256 0.4276256 7.4331050
Update: Rearranged and added (1) and (1a).
After trying for an embarrassingly long time and extensive searches online, I come to you with a problem.
I am looking for a method to (non-randomly) shuffle a string to get a string which has the maximal ‘distance’ from the original one, while still containing the same set of characters.
My particular case is for short nucleotide sequences (4-8 nt long), as represented by these example sequences:
seq_1<-"ACTG"
seq_2<-"ATGTT"
seq_3<-"ACGTGCT"
For each sequence, I would like to get a scramble sequence which contains the same nucleobase count, but in a different order.
A favourable scramble sequence for seq_3 could be something like;
seq_3.scramble<-"CATGTGC"
,where none of the sequence positions 1-7 has the same nucleobase, but the overall nucleobase count is the same (A =1, C = 2, G= 2, T=2). Naturally it would not always be possible to get a completely different string, but these I would just flag in the output.
I am not particularly interested in randomising the sequence and would prefer a method which makes these scramble sequences in a consistent manner.
Do you have any ideas?
python, since I don't know r, but the basic solution is as follows
def calcDistance(originalString,newString):
d = 0
i=0
while i < len(originalString):
if originalString[i] != newString[i]: d=d+1
i=i+1
s = "ACTG"
d_max = 0
s_final = ""
for combo in itertools.permutations(s):
if calcDistance(s,combo) > d_max:
d_max = calcDistance(s,combo)
s_final = combo
Give this a try. Rather than return a single string that fits your criteria, I return a data frame of all strings sorted by their string-distance score. String-distance score is calculated using stringdist(..., ..., method=hamming), which determines number of substitutions required to convert string A to B.
seq_3<-"ACGTGCT"
myfun <- function(S) {
require(combinat)
require(dplyr)
require(stringdist)
vec <- unlist(strsplit(S, ""))
P <- sapply(permn(vec), function(i) paste(i, collapse=""))
Dist <- c(stringdist(S, P, method="hamming"))
df <- data.frame(seq = P, HD = Dist, fixed=TRUE) %>%
distinct(seq, HD) %>%
arrange(desc(HD))
return(df)
}
library(combinat)
library(dplyr)
library(stringdist)
head(myfun(seq_3), 10)
# seq HD
# 1 TACGTGC 7
# 2 TACGCTG 7
# 3 CACGTTG 7
# 4 GACGTTC 7
# 5 CGACTTG 7
# 6 CGTACTG 7
# 7 TGCACTG 7
# 8 GTCACTG 7
# 9 GACCTTG 7
# 10 GATCCTG 7
I'm trying to learn R and a sample problem is asking to only reverse part of a string that is in alphabetical order:
String: "abctextdefgtext"
StringNew: "cbatextgfedtext"
Is there a way to identify alphabetical patterns to do this?
Here is one approach with base R based on the patterns showed in the example. We split the string to individual characters ('v1'), use match to find the position of characters with that of alphabet position (letters), get the difference of the index and check if it is equal to 1 ('i1'). Using the logical vector, we subset the vector ('v1'), create a grouping variable and reverse (rev) the vector based on grouping variable. Finally, paste the characters together to get the expected output
v1 <- strsplit(str1, "")[[1]]
i1 <- cumsum(c(TRUE, diff(match(v1, letters)) != 1L))
paste(ave(v1, i1, FUN = rev), collapse="")
#[1] "cbatextgfedtext"
Or as #alexislaz mentioned in the comments
v1 = as.integer(charToRaw(str1))
rawToChar(as.raw(ave(v1, cumsum(c(TRUE, diff(v1) != 1L)), FUN = rev)))
#[1] "cbatextgfedtext"
EDIT:
1) A mistake was corrected based on #alexislaz's comments
2) Updated with another method suggested by #alexislaz in the comments
data
str1 <- "abctextdefgtext"
You could do this in base R
vec <- match(unlist(strsplit(s, "")), letters)
x <- c(0, which(diff(vec) != 1), length(vec))
newvec <- unlist(sapply(seq(length(x) - 1), function(i) rev(vec[(x[i]+1):x[i+1]])))
paste0(letters[newvec], collapse = "")
#[1] "cbatextgfedtext"
Where s <- "abctextdefgtext"
First you find the positions of each letter in the sequence of letters ([1] 1 2 3 20 5 24 20 4 5 6 7 20 5 24 20)
Having the positions in hand, you look for consecutive numbers and, when found, reverse that sequence. ([1] 3 2 1 20 5 24 20 7 6 5 4 20 5 24 20)
Finally, you get the letters back in the last line.
I have a column in a data.frame where each observation is a string of numbers (e.g. "1,5,6,7,0,21"). I am attempting to calculate the difference for the first instance of non-consecutive numbers. In the above example the result would be 5 - 1 = 4. However, with the code I currently have I get 6. If my input is "1,2,0,21" I get the correct result of 21 - 2 = 19 (the numbers are sorted before subtraction occurs). I thought maybe the zero was the issue, but adding one to all values did not solve the issue. Perhaps a problem with my indexing? Any suggestions?
# find distance between number in first gap of non-consecutive numbers
b <- c("1,5,6,7,0,21") # does not work as desired result is 6 instead of 4
# b <- ("1,2,0,21") # works as desired
b.Uncomma <- sort(unique(as.numeric(unlist(strsplit(b, split=","))))) # remove commas, remove duplicates, sort
#b.Uncomma <- b.Uncomma + 1 # same result
b.Gaps <- c(which(diff(b.Uncomma) != 1), length(b.Uncomma)) # find where the difference is not 1
b.FirstGap <- b.Gaps[1:2] # get the positions/index on either side of the first gap
b.Result <- b.Uncomma[(b.FirstGap[2])] - b.Uncomma[(b.FirstGap[1])] # subtract to get result
inp <- scan(text=b,sep=",")
#Read 6 items
sinp <- sort(inp)
diff(sinp)
#[1] 1 4 1 1 14
> diff(sinp)[diff(sinp) != 1][1]
#[1] 4
Try:
b.Uncomma <- sort(unique(as.numeric(unlist(strsplit(b, split=","))))) # remove commas, remove duplicates, sort
b.Gaps <- c(which(diff(b.Uncomma) != 1), length(b.Uncomma)) # find where the difference is not 1
b.FirstGap <- b.Gaps[1] # get the positions/index of the first gap
b.Result <- b.Uncomma[(b.FirstGap+1)] - b.Uncomma[(b.FirstGap)] # subtract to get result
b.Result
I am new to this forum. I guess something like this has been asked before but, I am not really sure if that is what I want.
I have a sequence like this,
1 2 3 4 5 8 9 10 12 14 15 17 18 19
So, what I wish to do is this, get all the numbers which form a series,i.e.the numbers that belonging to that set should all have a constant difference with the previous element, and also the minimum number of elements should be 3 in that set.
i.e., I can see that (1,2,3,4,5) forms one such series in which numbers appear after an interval of 1 and the total size of this set is 5 which satisfies the minimum threshold criteria.
(1,3,5) forms one such a pattern in which the numbers appear after an interval of 2.
(8,10,12,14) forms another such pattern with an interval of 2. So, as you can see, the interval of repetition can be anything.
Also, for a particular set, I want its maximal one. I dont want, (8,10,12) (although it satisfies the minimum threshold of 3 and constant difference ) as the output and only of the maximal length I want, i.e. (8,10,12,14).
Similarly, for, (1,2,3,4,5) , I dont want (1,2,3) or (2,3,4,5) as the output, only the MAXIMAL LENGTH ONE I WANT, i.e. (1,2,3,4,5).
How can I do this in R?
Edit: That is, I want any set which forms a basic AP series with any difference, however the total value should be greater than 3 in that series and it should be maximal.
Edit2: I have tried using rle and acf in R but that doesnt entirely solves my problem.
Edit3: When I did acf, it basically gave me the maximum peak difference that I could have used. However, I want all the differences possible. Also, rle is just way different. It gave me the longest continuous sequence of similar numbers. Which is not there in my case.
If you are looking for sequences of consecutive numbers, then cgwtools::seqle will find them for you in the same way rle finds a sequence of repeated values.
In the general case of basically any subset of your data which form such a sequence, such as the 8,10,12,14 case you cite, your criteria are so general as to be very difficult to satisfy. You'd have to start at each element of your series and do a forward-looking search for x[j] +1, x[j]+2, x[j]+3 ... ad infinitum. This suggests using some tree-based algorithms.
Here's a potential solution - albeit a very ugly, sloppy one:
##
arithSeq <- function(x=nSeq, minSize=4){
##
dx <- diff(x,lag=1)
Runs <- rle(diff(x))
##
rLens <- Runs[[1]]
rVals <- Runs[[2]]
pStart <- c(
rep(1,rLens[1]),
rep(cumsum(1+rLens[-length(rLens)]),times=rLens[-1])
)
pEnd <- pStart + c(
rep(rLens[1]-1, rLens[1]),
rep(rLens[-1],times=rLens[-1])
)
pGrp <- rep(1:length(rLens),times=rLens)
pLen <- rep(rLens, times=rLens)
dAll <- data.frame(
pStart=pStart,
pEnd=pEnd,
pGrp=pGrp,
pLen=pLen,
runVal=rep(rVals,rLens)
)
##
dSub <- subset(dAll, pLen >= minSize - 1)
##
uVals <- unique(dSub$runVal)
##
maxSub <- subset(dSub, runVal==uVals[1])
maxLen <- max(maxSub$pLen)
maxSub <- subset(maxSub, pLen==maxLen)
##
if(length(uVals) > 1){
for(i in 2:length(uVals)){
iSub <- subset(dSub, runVal==uVals[i])
iMaxLen <- max(iSub$pLen)
iSub <- subset(iSub, pLen==iMaxLen)
maxSub <- rbind(
maxSub,
iSub)
maxSub
}
##
}
##
deDup <- maxSub[!duplicated(maxSub),]
seqStarts <- as.numeric(rownames(deDup))
outList <- list(NULL); length(outList) <- nrow(deDup)
for(i in 1:nrow(deDup)){
outList[[i]] <- list(
Sequence = x[seqStarts[i]:(seqStarts[i]+deDup[i,"pLen"])],
Length=deDup[i,"pLen"]+1,
StartPosition=seqStarts[i],
EndPosition=seqStarts[i]+deDup[i,"pLen"])
outList
}
##
return(outList)
##
}
##
So there are things that can definitely be improved in this function - for instance I made a mistake somewhere in the calculation of pStart and pEnd, the start and end indices of a given arithmetic sequence, but it just so happened that the true start positions of such sequences are given as the rownumbers of one of the intermediate data.frames, so that was a hacky sort of solution. Anyways, it accepts a numeric vector x and a minimum length parameter, minSize. It will return a list containing information about sequences meeting the criteria you outlined above.
set.seed(1234)
lSeq <- sample(1:25,100000,replace=TRUE)
nSeq <- c(1:10,12,33,13:17,16:26)
##
> arithSeq(nSeq)
[[1]]
[[1]]$Sequence
[1] 16 17 18 19 20 21 22 23 24 25 26
[[1]]$Length
[1] 11
[[1]]$StartPosition
[1] 18
[[1]]$EndPosition
[1] 28
##
> arithSeq(x=lSeq,minSize=5)
[[1]]
[[1]]$Sequence
[1] 13 16 19 22 25
[[1]]$Length
[1] 5
[[1]]$StartPosition
[1] 12760
[[1]]$EndPosition
[1] 12764
[[2]]
[[2]]$Sequence
[1] 11 13 15 17 19
[[2]]$Length
[1] 5
[[2]]$StartPosition
[1] 37988
[[2]]$EndPosition
[1] 37992
Like I said, its sloppy and inelegant, but it should get you started.