Function to count of consecutive digits in a string vector - r

I would like to create a function that takes a string object of at least 1 element and contains the numbers 2 through 5, and determine if there are consecutive digits of at least N length where N is the actual digit value.
If so, return the string true, otherwise return the string false.
For Example:
Input: "555123"
Output: false
Because 5 is found only 3 times instead of 5.
Or:
Input: "57333"
Output: true
Because 3 is found exactly 3 times.

Try rle + strsplit if you are working with base R
f <- function(s) {
with(
rle(unlist(strsplit(s, ""))),
any(as.numeric(values) <= lengths & lengths > 1)
)
}
and you will see
> f("555123")
[1] FALSE
> f("57333")
[1] TRUE

Late to the party but maybe still worth your while:
Data:
x <- c("555123", "57333", "21112", "12345", "22144", "44440")
Define vector with allowed numbers:
digits <- 2:5
Define alternation pattern with multiple backreferences:
patt <- paste0("(", digits, ")\\", c(1, digits), "{", digits - 1, "}", collapse = "|")
Input patt into str_detect:
library(stringr)
str_detect(x, patt)
[1] FALSE TRUE FALSE FALSE TRUE TRUE

You could check if the values in table correspond to the names.
x <- c('555123', '57333')
f <- \(x) {
s <- strsplit(x, '')
lapply(s, \(x) {
tb <- table(x)
names(tb) == tb
}) |> setNames(x)
}
f(x)
# $`555123`
# x
# 1 2 3 5
# TRUE FALSE FALSE FALSE
#
# $`57333`
# x
# 3 5 7
# TRUE FALSE FALSE

Another way would be:
my_func <- function(x) {
as.numeric(unlist(strsplit(x, ""))) -> all
table(all[all %in% 2:5]) -> f
any(names(f) == f)
}
# Input <- "555123"
# (my_func(Input))
# FALSE
# Input <- "57333"
# (my_func(Input))
# TRUE

Related

Is there a better way to check if all elements in a list are named?

I want to check if all elements in a list are named. I've came up with this solution, but I wanted to know if there is a more elegant way to check this.
x <- list(a = 1, b = 2)
y <- list(1, b = 2)
z <- list (1, 2)
any(stringr::str_length(methods::allNames(x)) == 0L) # FALSE, all elements are
# named.
any(stringr::str_length(methods::allNames(y)) == 0L) # TRUE, at least one
# element is not named.
# Throw an error here.
any(stringr::str_length(methods::allNames(z)) == 0L) # TRUE, at least one
# element is not named.
# Throw an error here.
I am not sure if the following base R code works for your general cases, but it seems work for the ones in your post.
Define a function f to check the names
f <- function(lst) length(lst) == sum(names(lst) != "",na.rm = TRUE)
and you will see
> f(x)
[1] TRUE
> f(y)
[1] FALSE
> f(z)
[1] FALSE
We can create a function to check if the the names attribute is NULL or (|) there is blank ("") name, negate (!)
f1 <- function(lst1) is.list(lst1) && !(is.null(names(lst1))| '' %in% names(lst1))
-checking
f1(x)
#[1] TRUE
f1(y)
#[1] FALSE
f1(z)
#[1] FALSE
Or with allNames
f2 <- function(lst1) is.list(lst1) && !("" %in% allNames(lst1))
-checking
f2(x)
#[1] TRUE
f2(y)
#[1] FALSE
f2(z)
#[1] FALSE

Get index of element in unfiltered vector

I have a rle object from a vector and would like to get the sum of lengths (with values == TRUE & values == FALSE) before and after the maximum of lengths (with values == TRUE)
That I can do:
se <- c(3,1,2,3,1,2,3,4,5,6,5,8,9,9,9,5,4,3,4,5,4,3,2)
obj <- rle(se > 4)
obj
Run Length Encoding
lengths: int [1:5] 8 8 3 1 3
values : logi [1:5] FALSE TRUE FALSE TRUE FALSE
#Getting the maximum with values == True
with(obj, max(lengths[values]))
8
However now I am not sure how to proceed,.. If I would try to find the maximum 8 via match() I would first find the element with values == FALSE and I can not rely that there is not another lengths = 8 with values == FALSE right of the maximum I am trying to look for.
For the showed example the expected result would be:
Sum1 = 8
Sum2 = 7
Thank you a lot for your help,
yasel
Here is an idea that uses rleid from data.table in order to create TRUE/FALSE groups, i.e.
i1 <- setNames(se > 4, data.table::rleid(se > 4))
i2 <- names(which.max(tapply(names(i1)[i1], names(i1)[i1], length)))
sum1 <- length(i1[names(i1) < i2])
sum2 <- length(i1[names(i1) > i2])
sum1
#[1] 8
sum2
#[1] 7
You can also make it a function,
f1 <- function(x, size) {
i1 <- setNames(x > size, data.table::rleid(se > size))
i2 <- names(which.max(tapply(names(i1)[i1], names(i1)[i1], length)))
return(c(length(i1[names(i1) < i2]), length(i1[names(i1) > i2])))
}
f1(se, 4)
#[1] 8 7
Another possible approach:
with(obj, {
i <- which(values & lengths==max(lengths))
c(sum(lengths[seq_len(i-1)]), sum(lengths[-seq_len(i)]))
})
output:
[1] 8 7

Check if a number is between two others

I am looking for a function that verifies if a number is between two other numbers. I also need to control if I want a strict comparison (a
I know the function between() in dplyr. Yet, I have to know the upper and lower numbers.
MyNumber = 8
First = 2
Second = 10
# This will return TRUE
between(MyNumber, lower = First, upper = Second)
# But this will return FALSE
between(MyNumber, lower = Second, upper = First)
# This will return TRUE. I want it to return FALSE
First = 8
between(MyNumber, lower = First, upper = Second)
I need a function that returns TRUE no matter what is the order.
Something like:
between2 <- function(number,bounds) { number > min(bounds) & number < max(bounds)}
between2(8, c(2,10))
[1] TRUE
between2(8, c(10,2))
[1] TRUE
This function also deals with your added condition
between2(8,c(8,10))
[1] FALSE
You could do it with a simple arithmetics:
between <- function(number, first, second) { (first - number) * (second - number) < 0 }
Here are some example outputs:
> between(8, 2, 10)
[1] TRUE
> between(8, 10, 2)
[1] TRUE
> between(8, 10, 12)
[1] FALSE
> between(8, 1, 2)
[1] FALSE
You could use %in% with the : function, once you now first and last:
first <- 2
last <- 10
number <- 8
number %in% first:last
[1] TRUE
first <- 10
last <- 2
number <- 8
number %in% first:last
[1] TRUE
first <- 10
last <- 12
number <- 8
number %in% first:last
[1] FALSE
first <- 12
last <- 10
number <- 8
number %in% first:last
[1] FALSE
In a function, and strict lets you consider or not strict comparison:
my_between <- function(n, f, l, strict = FALSE) {
if (!strict) {
n %in% f:l # if strict == FALSE (default)
} else {
n %in% (f+1):(l-1) # if strict == TRUE
}
}
my_between(8, 2, 10)
What's wrong with
f_between <- function (num, L, R) num>=min(L,R) & num<=max(L,R)
f_between(8, 2, 10)
#[1] TRUE
f_between(6, 6, 10)
#[1] TRUE
f_between(2, -10, -2)
#[1] FALSE
f_between(3, 5, 7)
#[1] FALSE

R Why False results when using the function "mean with condition" at period.apply

I want to calculate the average value for values above a constant every year.
I explain with this example:
library(xts)
library(PerformanceAnalytics)
data(edhec)
head(edhec)
edhec_4yr <- edhec["1997/2001"]
ep <- endpoints(edhec_4yr, "years")
# mean
period.apply(edhec_4yr, INDEX = ep, function(x) apply(x,2,mean))
# Length with condition Ok
period.apply(edhec_4yr,
INDEX = ep,
function(x) apply(x,
2,
function(y) length(which(y>0.002))))
# But Mean with condition : the results are false,
#they do not correspond to the true results. Why!!!
period.apply(edhec_4yr,
INDEX = ep,
function(x) apply(x,
2,
function(y) mean(which(y>0.002))))
View(edhec_4yr)
Thank you in advance for explain me why I do not find a good result in the last step!
It would help you debug your issue if you create simple examples. A simple example makes the problem clear:
set.seed(21)
(x <- rnorm(10))
# [1] 0.793013171 0.522251264 1.746222241 -1.271336123 2.197389533
# [6] 0.433130777 -1.570199630 -0.934905667 0.063493345 -0.002393336
x > 0
# [1] TRUE TRUE TRUE FALSE TRUE TRUE FALSE FALSE TRUE FALSE
which(x > 0)
# [1] 1 2 3 5 6 9
mean(which(x > 0))
# [1] 4.333333
So you need something like this:
apply.yearly(edhec_4yr, function(x) apply(x, 2, function(y) mean(y[y > 0.002])))

Looping through 2 vectors of different dimension in R

I have two character vectors a, b with different dimensions. I have to take each element in a and compare with all elements in b and note the element if there is a close match. For matching I'm using agrepl function.
Following is the sample data
a <- c("US","Canada","United States","United States of America")
b <- c("United States","U.S","United States","Canada", "America", "Spain")
Following is the code that I'm using to match. Please help me how to avoid for loop as my real data has more 900 and 5000 records respectively
for(i in 1:4)
{
for(j in 1:6)
{
bFlag <- agrepl(a[i],b[j], max.distance = 0.1,ignore.case = TRUE)
if(bFlag)
{
#Custom logic
}
else
{
#Custom logic
}
}
}
You don't need a double loop, since agrepl's second argument accepts vectors of length >= 1. So you could do something like:
lapply(a, function(x) agrepl(x, b, max.distance = 0.1, ignore.case = TRUE))
# [[1]]
# [1] TRUE TRUE TRUE FALSE FALSE TRUE
#
# [[2]]
# [1] FALSE FALSE FALSE TRUE FALSE FALSE
#
# [[3]]
# [1] TRUE FALSE TRUE FALSE FALSE FALSE
#
# [[4]]
# [1] FALSE FALSE FALSE FALSE FALSE FALSE
You can add some custom logic inside the lapply call if needed, but that's not specified in the question so I'll just leave the output as a list of logicals.
If you want indices (of TRUEs) instead of logicals, you can use agrep instead of agrepl:
lapply(a, function(x) agrep(x, b, max.distance = 0.1,ignore.case = TRUE))
# [[1]]
# [1] 1 2 3 6
#
# [[2]]
# [1] 4
#
# [[3]]
# [1] 1 3
#
# [[4]]
# integer(0)
If you only want the first TRUE index, you can use:
sapply(a, function(x) agrep(x, b, max.distance = 0.1,ignore.case = TRUE)[1])
# US Canada United States United States of America
# 1 4 1 NA

Resources