Splitting a number in R - r

In R I have a number, say 1293828893, called x.
I wish to split this number so as to remove the middle 4 digits 3828 and return them, pseudocode is as follows:
splitnum <- function(number){
#check number is 10 digits
if(nchar(number) != 10){
stop("number not of right size");
}
middlebits <- middle 4 digits of number
return(middlebits);
}
This is a pretty simple question but the only solutions I have found apply to character strings, rather than numeric ones.
If of interest, I am trying to create an implementation in R of the Middle-square method, but this step is particularly tricky.

You can use substr(). See its help page ?substr. In your function I would do:
splitnum <- function(number){
#check number is 10 digits
stopifnot(nchar(number) == 10)
as.numeric(substr(number, start = 4, stop = 7))
}
which gives:
> splitnum(1293828893)
[1] 3828
Remove the as.numeric(....) wrapping on the last line you want the digits as a string.

Just use integer division:
> x <- 1293828893
> (x %/% 1e3) %% 1e4
[1] 3828

Here's a function that completely avoids converting the number to a character
splitnum <- function(number){
#check number is 10 digits
if(trunc(log10(X))!=9) {
stop("number not of right size")
}
(number %/% 1e3) %% 1e4
}
splitnum(1293828893)
# [1] 3828

Related

Is there a function in R which checks if two XStrings have matching substrings of some size (n) in corresponding positions

I am trying to determine if two XStrings of equal length have the same substring of some given length in the corresponding positions.
Is there a built-in function in R for this problem?
Let's say I have strings
a <- "AACCT**GCCCGGAA**CCT" ,
b <- "CCATC**GCCCGGAA**CCT"
and given length is 8
I need a function fun(a,b,len=8) that would return TRUE or possibly even a position where such a substring first occurs.
Of course, real strings that I am using are much longer and the given length of substring may not be 8 all the time.
This could be done by for lops but it would be preferred not to use them
You could do this by splitting the strings into individual characters, testing equality of the resulting vectors, and performing run-length encoding on the logical vector produced:
f <- function(a, b, n) {
rl <- rle(strsplit(a, "")[[1]] == strsplit(b, "")[[1]])
ind <- which(rl$values == TRUE & rl$lengths >= n)[1]
cumsum(rl$lengths[seq(ind - 1)]) + 1
}
This will tell you the first position in the strings where there are at least n parallel matching bases:
f(a, b, 8)
#> [1] 6
We can use rleid
library(data.table)
Map(function(u, v) {i1 <- u == v
grp <- rleid(i1); which(i1 & ave(seq_along(grp), grp,
FUN = length) >= 8)[1]},
strsplit(a, ""), strsplit(b, ""))[[1]]
#[1] 6

R or Python code for returning the largest number of consecutive 1s from a binary vector

I'm new to coding and have started with R and Python. I want to write a code to solve the following problem.
My question is: How do I write a function in R or Python which takes a vector of length L consisting only of 0s and 1s as input and returns the largest number of consecutive 1s from the vector as an integer, e.g. the function takes [1,1,0,1,1,1,0,0,1] and returns 3.
In R, you can use rle to get the max number of consecutive 1's.
consecutive_1 <- function(x) {
with(rle(x == 1), max(lengths[values]))
}
vec <- c(1,1,0,1,1,1,0,0,1)
consecutive_1(vec)
#[1] 3
vec <- c(1,1,0,1,1,1,0,1,1,1,1)
consecutive_1(vec)
#[1] 4
If we want to check for length, we can add an additional condition
consecutive_1 <- function(x, L) {
if(length(x) != L)
stop('Vector is not of length ', L)
with(rle(x == 1), max(lengths[values]))
}
consecutive_1(vec, 9)
#[1] 3
consecutive_1(vec, 10)
Error in consecutive_1(vec, 10) : Vector is not of length 10

Weighted sum of digits in R

I am trying to figure out the most efficient way to calculate the weighted sum of digits for a numeric string (where the weight is equal to the position of the digit in the numeric string).
Example: For the number 1059, the weighted sum of digits is calculated as 1 * 1 + 0 * 2 + 5 * 3 + 9 * 4 = 52
I would like to allow for the input to be of any length, but if there are more efficient ways when there is a limit to the string length (e.g. knowing that the number is no more 10 digits allows for a more efficient program) I am open to that too. Also, if it is preferred that the input is a of type numeric rather than character that is acceptable too.
What I have right now is an old fashioned for loop:
wsod <- function(str) {
output <- 0
for (pos in 1:nchar(str)) {
digit <- as.numeric(substr(str, pos , pos))
output <- output + pos * digit
}
output
}
A few solutions have been proposed for Python (using a numeric input) but I don't think they apply to R directly.
> number <- 1059
> x <- strsplit(as.character(number), "")[[1]]
> y <- seq_len(nchar(number))
> as.numeric(as.numeric(x) %*% y)
[1] 52
weighted.digit <- function(str) {
splitted.nums <- as.numeric(strsplit(str, '')[[1]])
return(sum(splitted.nums * 1:length(splitted.nums)))
}
weighted.digit('1059')
[1] 52
One could modify this to accept a numeric input, and then simply convert that to character as a first step.

How to find a string in a vector in r?

I have created a function that essentially creates a vector of a 1000 binary values. I have been able to count the longest streak of consecutive 1s by using rle.
I was wondering how to find a specific vector (say c(1,0,0,1)) in this larger vector? I would want it to return the amount of occurrences of that vector. So c(1,0,0,1,1,0,0,1) should return 2, while c(1,0,0,0,1) should return 0.
Most solutions that I have found just find whether a sequence occurs at all and return TRUE or FALSE, or they give results for the individual values, not the specific vector that is specified.
Here's my code so far:
# creates a function where a 1000 people choose either up or down.
updown <- function(){
n = 1000
X = rep(0,n)
Y = rbinom(n, 1, 1 / 2)
X[Y == 1] = "up"
X[Y == 0] = "down"
#calculate the length of the longest streak of ups:
Y1 <- rle(Y)
streaks <- Y1$lengths[Y1$values == c(1)]
max(streaks, na.rm=TRUE)
}
# repeat this process n times to find the average outcome.
longeststring <- replicate(1000, updown())
longeststring(p_vals)
This will also work:
library(stringr)
x <- c(1,0,0,1)
y <- c(1,0,0,1,1,0,0,1)
length(unlist(str_match_all(paste(y, collapse=''), '1001')))
[1] 2
y <- c(1,0,0,0,1)
length(unlist(str_match_all(paste(y, collapse=''), '1001')))
[1] 0
If you want to match overlapped patterns,
y <- c(1,0,0,1,0,0,1) # overlapped
length(unlist(gregexpr("(?=1001)",paste(y, collapse=''),perl=TRUE)))
[1] 2
Since Y is only 0s and 1s, we can paste it into a string and use regex, specifically gregexpr. Simplified a bit:
set.seed(47) # for reproducibility
Y <- rbinom(1000, 1, 1 / 2)
count_pattern <- function(pattern, x){
sum(gregexpr(paste(pattern, collapse = ''),
paste(x, collapse = ''))[[1]] > 0)
}
count_pattern(c(1, 0, 0, 1), Y)
## [1] 59
paste reduces the pattern and Y down to strings, e.g. "1001" for the pattern here, and a 1000-character string for Y. gregexpr searches for all occurrences of the pattern in Y and returns the indices of the matches (together with a little more information so they can be extracted, if one wanted). Because gregexpr will return -1 for no match, testing for numbers greater than 0 will let us simply sum the TRUE values to get the number of macthes; in this case, 59.
The other sample cases mentioned:
count_pattern(c(1,0,0,1), c(1,0,0,1,1,0,0,1))
## [1] 2
count_pattern(c(1,0,0,1), c(1,0,0,0,1))
## [1] 0

how to return number of decimal places in R

I am working in R. I have a series of coordinates in decimal degrees, and I would like to sort these coordinates by how many decimal places these numbers have (i.e. I will want to discard coordinates that have too few decimal places).
Is there a function in R that can return the number of decimal places a number has, that I would be able to incorporate into function writing?
Example of input:
AniSom4 -17.23300000 -65.81700
AniSom5 -18.15000000 -63.86700
AniSom6 1.42444444 -75.86972
AniSom7 2.41700000 -76.81700
AniLac9 8.6000000 -71.15000
AniLac5 -0.4000000 -78.00000
I would ideally write a script that would discard AniLac9 and AniLac 5 because those coordinates were not recorded with enough precision. I would like to discard coordinates for which both the longitude and the latitude have fewer than 3 non-zero decimal values.
You could write a small function for the task with ease, e.g.:
decimalplaces <- function(x) {
if ((x %% 1) != 0) {
nchar(strsplit(sub('0+$', '', as.character(x)), ".", fixed=TRUE)[[1]][[2]])
} else {
return(0)
}
}
And run:
> decimalplaces(23.43234525)
[1] 8
> decimalplaces(334.3410000000000000)
[1] 3
> decimalplaces(2.000)
[1] 0
Update (Apr 3, 2018) to address #owen88's report on error due to rounding double precision floating point numbers -- replacing the x %% 1 check:
decimalplaces <- function(x) {
if (abs(x - round(x)) > .Machine$double.eps^0.5) {
nchar(strsplit(sub('0+$', '', as.character(x)), ".", fixed = TRUE)[[1]][[2]])
} else {
return(0)
}
}
Here is one way. It checks the first 20 places after the decimal point, but you can adjust the number 20 if you have something else in mind.
x <- pi
match(TRUE, round(x, 1:20) == x)
Here is another way.
nchar(strsplit(as.character(x), "\\.")[[1]][2])
Rollowing up on Roman's suggestion:
num.decimals <- function(x) {
stopifnot(class(x)=="numeric")
x <- sub("0+$","",x)
x <- sub("^.+[.]","",x)
nchar(x)
}
x <- "5.2300000"
num.decimals(x)
If your data isn't guaranteed to be of the proper form, you should do more checking to ensure other characters aren't sneaking in.
Not sure why this simple approach was not used above (load the pipe from tidyverse/magrittr).
count_decimals = function(x) {
#length zero input
if (length(x) == 0) return(numeric())
#count decimals
x_nchr = x %>% abs() %>% as.character() %>% nchar() %>% as.numeric()
x_int = floor(x) %>% abs() %>% nchar()
x_nchr = x_nchr - 1 - x_int
x_nchr[x_nchr < 0] = 0
x_nchr
}
> #tests
> c(1, 1.1, 1.12, 1.123, 1.1234, 1.1, 1.10, 1.100, 1.1000) %>% count_decimals()
[1] 0 1 2 3 4 1 1 1 1
> c(1.1, 12.1, 123.1, 1234.1, 1234.12, 1234.123, 1234.1234) %>% count_decimals()
[1] 1 1 1 1 2 3 4
> seq(0, 1000, by = 100) %>% count_decimals()
[1] 0 0 0 0 0 0 0 0 0 0 0
> c(100.1234, -100.1234) %>% count_decimals()
[1] 4 4
> c() %>% count_decimals()
numeric(0)
So R does not seem internally to distinguish between getting 1.000 and 1 initially. So if one has a vector input of various decimal numbers, one can see how many digits it initially had (at least) by taking the max value of the number of decimals.
Edited: fixed bugs
If someone here needs a vectorized version of the function provided by Gergely Daróczi above:
decimalplaces <- function(x) {
ifelse(abs(x - round(x)) > .Machine$double.eps^0.5,
nchar(sub('^\\d+\\.', '', sub('0+$', '', as.character(x)))),
0)
}
decimalplaces(c(234.1, 3.7500, 1.345, 3e-15))
#> 1 2 3 0
I have tested some solutions and I found this one robust to the bugs reported in the others.
countDecimalPlaces <- function(x) {
if ((x %% 1) != 0) {
strs <- strsplit(as.character(format(x, scientific = F)), "\\.")
n <- nchar(strs[[1]][2])
} else {
n <- 0
}
return(n)
}
# example to prove the function with some values
xs <- c(1000.0, 100.0, 10.0, 1.0, 0, 0.1, 0.01, 0.001, 0.0001)
sapply(xs, FUN = countDecimalPlaces)
In [R] there is no difference between 2.30000 and 2.3, both get rounded to 2.3 so the one is not more precise than the other if that is what you want to check. On the other hand if that is not what you meant: If you really want to do this you can use 1) multiply by 10, 2) use floor() function 3) divide by 10 4) check for equality with the original. (However be aware that comparing floats for equality is bad practice, make sure this is really what you want)
For the common application, here's modification of daroczig's code to handle vectors:
decimalplaces <- function(x) {
y = x[!is.na(x)]
if (length(y) == 0) {
return(0)
}
if (any((y %% 1) != 0)) {
info = strsplit(sub('0+$', '', as.character(y)), ".", fixed=TRUE)
info = info[sapply(info, FUN=length) == 2]
dec = nchar(unlist(info))[seq(2, length(info), 2)]
return(max(dec, na.rm=T))
} else {
return(0)
}
}
In general, there can be issues with how a floating point number is stored as binary. Try this:
> sprintf("%1.128f", 0.00000000001)
[1] "0.00000000000999999999999999939458150688409432405023835599422454833984375000000000000000000000000000000000000000000000000000000000"
How many decimals do we now have?
Interesting question. Here is another tweak on the above respondents' work, vectorized, and extended to handle the digits on the left of the decimal point. Tested against negative digits, which would give an incorrect result for the previous strsplit() approach.
If it's desired to only count the ones on the right, the trailingonly argument can be set to TRUE.
nd1 <- function(xx,places=15,trailingonly=F) {
xx<-abs(xx);
if(length(xx)>1) {
fn<-sys.function();
return(sapply(xx,fn,places=places,trailingonly=trailingonly))};
if(xx %in% 0:9) return(!trailingonly+0);
mtch0<-round(xx,nds <- 0:places);
out <- nds[match(TRUE,mtch0==xx)];
if(trailingonly) return(out);
mtch1 <- floor(xx*10^-nds);
out + nds[match(TRUE,mtch1==0)]
}
Here is the strsplit() version.
nd2 <- function(xx,trailingonly=F,...) if(length(xx)>1) {
fn<-sys.function();
return(sapply(xx,fn,trailingonly=trailingonly))
} else {
sum(c(nchar(strsplit(as.character(abs(xx)),'\\.')[[1]][ifelse(trailingonly, 2, T)]),0),na.rm=T);
}
The string version cuts off at 15 digits (actually, not sure why the other one's places argument is off by one... the reason it's exceeded through is that it counts digits in both directions so it could go up to twice the size if the number is sufficiently large). There is probably some formatting option to as.character() that can give nd2() an equivalent option to the places argument of nd1().
nd1(c(1.1,-8.5,-5,145,5,10.15,pi,44532456.345243627,0));
# 2 2 1 3 1 4 16 17 1
nd2(c(1.1,-8.5,-5,145,5,10.15,pi,44532456.345243627,0));
# 2 2 1 3 1 4 15 15 1
nd1() is faster.
rowSums(replicate(10,system.time(replicate(100,nd1(c(1.1,-8.5,-5,145,5,10.15,pi,44532456.345243627,0))))));
rowSums(replicate(10,system.time(replicate(100,nd2(c(1.1,-8.5,-5,145,5,10.15,pi,44532456.345243627,0))))));
Don't mean to hijack the thread, just posting it here as it might help someone to deal with the task I tried to accomplish with the proposed code.
Unfortunately, even the updated #daroczig's solution didn't work for me to check if a number has less than 8 decimal digits.
#daroczig's code:
decimalplaces <- function(x) {
if (abs(x - round(x)) > .Machine$double.eps^0.5) {
nchar(strsplit(sub('0+$', '', as.character(x)), ".", fixed = TRUE)[[1]][[2]])
} else {
return(0)
}
}
In my case produced the following results
NUMBER / NUMBER OF DECIMAL DIGITS AS PRODUCED BY THE CODE ABOVE
[1] "0.0000437 7"
[1] "0.000195 6"
[1] "0.00025 20"
[1] "0.000193 6"
[1] "0.000115 6"
[1] "0.00012501 8"
[1] "0.00012701 20"
etc.
So far was able to accomplish the required tests with the following clumsy code:
if (abs(x*10^8 - floor(as.numeric(as.character(x*10^8)))) > .Machine$double.eps*10^8)
{
print("The number has more than 8 decimal digits")
}
PS: I might be missing something in regard to not taking the root of the .Machine$double.eps so please take caution
Another contribution, keeping fully as numeric representations without converting to character:
countdecimals <- function(x)
{
n <- 0
while (!isTRUE(all.equal(floor(x),x)) & n <= 1e6) { x <- x*10; n <- n+1 }
return (n)
}
Vector solution based on daroczig's function (can also deal with dirty columns containing strings and numerics):
decimalplaces_vec <- function(x) {
vector <- c()
for (i in 1:length(x)){
if(!is.na(as.numeric(x[i]))){
if ((as.numeric(x[i]) %% 1) != 0) {
vector <- c(vector, nchar(strsplit(sub('0+$', '', as.character(x[i])), ".", fixed=TRUE)[[1]][[2]]))
}else{
vector <- c(vector, 0)
}
}else{
vector <- c(vector, NA)
}
}
return(max(vector))
}
as.character uses scientific notation for numbers that are between -1e-4 and 1e-4 but not zero:
> as.character(0.0001)
[1] "1e-04"
You can use format(scientific=F) instead:
> format(0.0001,scientific=F)
[1] "0.0001"
Then do this:
nchar(sub("^-?\\d*\\.?","",format(x,scientific=F)))
Or in vectorized form:
> nplaces=function(x)sapply(x,function(y)nchar(sub("^-?\\d*\\.?","",format(y,scientific=F))))
> nplaces(c(0,-1,1.1,0.123,1e-8,-1e-8))
[1] 0 0 1 3 8 8

Resources