Test if string in n positions of alphabet

Test if string in n positions of alphabet - r

As the title suggests, I am looking for an elegant* way to test whether a character is in the first n positions in the alphabet.
So, for a character vector as follows:
names <- c("Brian", "Cormac", "Zachariah")
And with n <- 6
It would return:
TRUE','TRUE', 'FALSE'
*I am aware that I can use substr(names,1,1) %in% c("A", "B", "C", "D", "E", "F"), but I was hoping for a better solution.
EDIT: What I mean by position in the alphabet is whether the first letter is in the first n letters in alphabetical order. So, "A" is in the first n = 1+, "B" is in the first n =2+, "Y" in the first n=25 letters, etc.

PoGibas comment seems to have as elegant as it gets. Next step would be wrapping it in a function:
cht6_pog <- function(string) {
x <- toupper(substring(string, 1, 1)) %in% LETTERS[1:6]
names(x) <- string
x
}
cht6_pog(names)
Brian Cormac Zachariah
TRUE TRUE FALSE

Here is my answer for your question.
# fun:
check_char <- function(string, start_n, end_n, char_pattern)
{
str_list <- strsplit(substr(string, start_n, end_n), "")
out <- sapply(str_list, function(x) any(tolower(x) %in% tolower(char_pattern)))
return(out)
}
# args:
str_vec <- c("Google", "Facebook", "Amazon")
str_n <- 1
end_n <- 4
char <- LETTERS[1:6]
# run:
out <- check_char(str_vec, str_n, end_n, char)
print(out)
# [1] FALSE TRUE TRUE

Related

Can I somehow refer to a column by more than one name?

I am working with two data.frames which use different terminology. To keep the terminology of each data.frame intact, I am currently deliberating whether it would be an idea to simply add the columns to the other data.frame.
df_a <- data.frame(
A = c("a", "b", "c"),
B = c("a", "b", "c")
)
df_b <- data.frame(
same_as_A = c("a", "b", "c"),
same_as_B = c("a", "b", "c")
)
df_a <- cbind(df_a, df_b)
df_b <- cbind(df_b, df_a)
This will however become problematic as soon as I will start making changes to any of these columns. I was wondering if there is instead a way or even a trick, to refer to a column by more than one name. Obviously this does not work, but something like:
df_a <- data.frame(
A & same_as_A = c("a", "b", "c"),
B & same_as_B = c("a", "b", "c")
)
Where df_a$same_as_A is equal to df_a$A
"a" "b" "c"

You can derive your own superclass of data.frame, wrap [ and $, and handle aliases explicitly.
aliases <- function(x, ...) {
dots <- list(...)
stopifnot(!is.null(names(dots)), all(nzchar(names(dots))))
nms <- attr(x, "aliases")
attr(x, "aliases") <- c(nms[!names(nms) %in% names(dots)], dots)
if (class(x)[1] != "aliased_dataframe") {
class(x) <- c("aliased_dataframe", class(x))
}
x
}
`[.aliased_dataframe` <- function(x, i, j, ...) {
if (!inherits(x, "aliased_dataframe")) NextMethod()
if (!missing(j) && length(j)) {
aliases <- attr(x, "aliases")
ind <- j %in% names(aliases)
j[ind] <- unlist(aliases[ match(j[ind], names(aliases)) ])
}
NextMethod(object = x)
}
`$.aliased_dataframe` <- function(x, j, ...) {
if (!inherits(x, "aliased_dataframe")) NextMethod()
if (!missing(j) && length(j)) {
aliases <- attr(x, "aliases")
ind <- j %in% names(aliases)
j[ind] <- unlist(aliases[ match(j[ind], names(aliases)) ])
}
NextMethod(object = x)
}
`$<-.aliased_dataframe` <- function(x, j, ...) {
if (!inherits(x, "aliased_dataframe")) NextMethod()
if (!missing(j) && length(j)) {
aliases <- attr(x, "aliases")
ind <- j %in% names(aliases)
j[ind] <- unlist(aliases[ match(j[ind], names(aliases)) ])
}
NextMethod(object = x)
}
Demo:
df_b <- data.frame(
same_as_A = c("a", "b", "c"),
same_as_B = c("a", "b", "c")
)
df_b[, "a"]
# Error in `[.data.frame`(df_b, , "a") : undefined columns selected
df_b$a
# NULL
df_b <- aliases(df_b, a="same_as_A", b="same_as_B")
df_b[, "a"]
# [1] "a" "b" "c"
df_b$a
# [1] "a" "b" "c"
df_b$a <- c("A","B","C")
df_b
# same_as_A same_as_B
# 1 A a
# 2 B b
# 3 C c
Coincidentally, this works with tbl_df as well, but sadly not with data.table variants.
library(tibble) # or dplyr
df_b <- tibble(df_b)
df_b[, "a"]
# Error in `stop_subscript()`:
# ! Can't subset columns that don't exist.
# x Column `a` doesn't exist.
# Run `rlang::last_error()` to see where the error occurred.
df_b$a
# Warning: Unknown or uninitialised column: `a`.
# NULL
df_b <- aliases(df_b, a="same_as_A", b="same_as_B")
df_b[, "a"]
# # A tibble: 3 x 1
# same_as_A
# <chr>
# 1 a
# 2 b
# 3 c
df_b$a
# [1] "a" "b" "c"
df_b$a <- c("A","B","C")
df_b
# # A tibble: 3 x 2
# same_as_A same_as_B
# <chr> <chr>
# 1 A a
# 2 B b
# 3 C c
I should note that this accommodates explicit use of j=, as in df_b[,"a"]; the shortcut of df_b["a"] is technically overloading the i= argument, and while the base [.data.frame is correctly inferring your intent, these S3 wrappers are not. It is not difficult to add that (just another conditional, perhaps starting with if (missing(j) && !missing(i) && is.character(i))), but for simplicity I"m keeping it out. Because of this, df_b["a"] fails.
Another note, I did not overload [[, so df_b[["a"]] returns NULL. If it's really important to you, one could adapt this methodology to do that as well.

length of the unique elements of listed vectors in R

In my R function below, I was wondering how I could get the length of the unique elements (which is 2) of two vectors a and b?
Here is what I tried without success:
foo <- function(...){
L <- list(...)
lengths(unique(unlist(L)))
}
a = rep(c("a", "b"), 30) # Vector `a`
b = rep(c("a", "b"), 20) # Vector `b`
foo(a, b) # the function returns 1 1 instead of 2 2

Use lapply() or sapply() because your object is a list. I think you might check the difference between length() and lengths(). They both exist but have different abilities. I provide two solutions foo1 and foo2:
foo1 <- function(...){
L <- list(...)
sapply(L, function(x) length(unique(x)))
}
foo2 <- function(...){
L <- list(...)
lengths(lapply(L, unique))
}
a = rep(c("a", "b"), 30) # Vector `a`
b = rep(c("a", "b"), 20) # Vector `b`
foo1(a, b)
# [1] 2 2
foo2(a, b)
# [1] 2 2

Here is the answer
You were using the unlist function - so you were back at the start with the vector lengths!
use this code instead
foo <- function(a,b){
L <- list(a,b)
lengths(unique(L)) ### this return 1 1
}
a = rep(c("a", "b"), 30) # Vector `a`
b = rep(c("a", "b"), 20) # Vector `b`
foo(a, b)

Selecting logical vector elements based on certain conditions

I have a situation where I would like to detect conditions between two logical, named vectors based on the TRUE / FALSE combination at each position in the vector. For example:
x <- c(TRUE, FALSE, FALSE, TRUE)
names(x) <- c("a", "b", "c", "d")
y <- c(TRUE, TRUE, FALSE, FALSE)
names(y) <- names(x)
For each element in these two vectors I want to detect 3 conditions:
x[i] is TRUE and y[i] is TRUE;
x[i] is FALSE and y[i] is TRUE,
x[i] is TRUE and y[i] is FALSE.
The length of x and y are the same but could be longer than this example. I want to retrieve the name of the element for each condition and assign the element name to a new variable. For this example:
v1 <- "a"
v2 <- "b"
v3 <- "d"
In a longer version of these two vectors I might end up with something like:
v1 <- c("a", "e")
v2 <- c("b", "f", "g")
v3 <- c("d", "i", "k", "l")
What is the best vectorized way to do this. I think it is simple but I am unable to come up with the answer. Thanks in advance.

We can efficiently use split, but before that, we need a single grouping index. Here is a possibility:
g <- x + y + x
split(names(x), g)
To understand the above grouping index, consider this:
x <- c(TRUE, TRUE, FALSE, FALSE)
y <- c(TRUE, FALSE, TRUE, FALSE)
x + y + x
#[1] 3 2 1 0
So you can see that 4 combinations of TRUE and FALSE are mapped to 4 integer values.
Ah, so "a" get assigned to T-T, "b" to T-F, etc. But, why the x + y + x?? I don't follow adding x twice.
If you only do x + y, the result is only 0, 1 and 2. You won't be able to differentiate T-F and F-T as they are both 1.
#thelatemail offers a more readable way:
split(names(x), interaction(x, y, drop=TRUE))
Update
Ah... stupid me... Why did I bother creating g. I suddenly remember that we can pass a list to f argument in split:
split(names(x), list(x, y))
Note, internally in split.default:
if (is.list(f))
f <- interaction(f, drop = drop, sep = sep)

Union of intersecting vectors in a list in R

I have a list of vectors as follows.
data <- list(v1=c("a", "b", "c"), v2=c("g", "h", "k"),
v3=c("c", "d"), v4=c("n", "a"), v5=c("h", "i"))
I am trying to achieve the following:
Check whether any of the vectors intersect with each other
If intersecting vectors are found, get their union
So the desired output is
out <- list(v1=c("a", "b", "c", "d", "n"), v2=c("g", "h", "k", "i"))
I can get the union of a group of intersecting sets as follows.
Reduce(union, list(data[[1]], data[[3]], data[[4]]))
Reduce(union, list(data[[2]], data[[5]])
How to first identify the intersecting vectors? Is there a way of dividing the list into lists of groups of intersecting vectors?
#Update
Here is an attempt using data.table. Gets the desired results. But still slow for large lists as in this example dataset.
datasets.
data <- sapply(data, function(x) paste(x, collapse=", "))
data <- as.data.frame(data, stringsAsFactors = F)
repeat {
M <- nrow(data)
data <- data.table( data , key = "data" )
data <- data[ , list(dataelement = unique(unlist(strsplit(data , ", " )))), by = list(data)]
data <- data.table(data , key = "dataelement" )
data <- data[, list(data = paste0(sort(unique(unlist(strsplit(data, split=", ")))), collapse=", ")), by = "dataelement"]
data$dataelement <- NULL
data <- unique(data)
N <- nrow(data)
if (M == N)
break
}
data <- strsplit(as.character(data$data) , "," )

This is kind of like a graph problem so I like to use the igraph library for this, using your sample data, you can do
library(igraph)
#build edgelist
el <- do.call("rbind",lapply(data, embed, 2))
#make a graph
gg <- graph.edgelist(el, directed=F)
#partition the graph into disjoint sets
split(V(gg)$name, clusters(gg)$membership)
# $`1`
# [1] "b" "a" "c" "d" "n"
#
# $`2`
# [1] "h" "g" "k" "i"
And we can view the results with
V(gg)$color=c("green","purple")[clusters(gg)$membership]
plot(gg)

Here's another approach using only base R
Update
Next update after akrun's comment and with his sample data:
data <- list(v1=c('g', 'k'), v2= letters[1:4], v3= c('b', 'c', 'd', 'a'))
Modified function:
x <- lapply(seq_along(data), function(i) {
if(!any(data[[i]] %in% unlist(data[-i]))) {
data[[i]]
} else if (any(data[[i]] %in% unlist(data[seq_len(i-1)]))) {
NULL
} else {
z <- lapply(data[-seq_len(i)], intersect, data[[i]])
z <- names(z[sapply(z, length) >= 1L])
if (is.null(z)) NULL else union(data[[i]], unlist(data[z]))
}
})
x[!sapply(x, is.null)]
#[[1]]
#[1] "g" "k"
#
#[[2]]
#[1] "a" "b" "c" "d"
This works well with the original sample data, MrFlick's sample data and akrun's sample data.

Efficiency be damned and do you people even sleep? Base R only and much slower than the fastest answer. Since I wrote it, might as well post it.
f.union = function(x) {
repeat{
n = length(x)
m = matrix(F, nrow = n, ncol = n)
for (i in 1:n){
for (j in 1:n) {
m[i,j] = any(x[[i]] %in% x[[j]])
}
}
o = apply(m, 2, function(v) Reduce(union, x[v]))
if (all(apply(m, 1, sum)==1)) {return(o)} else {x=unique(o)}
}
}
f.union(data)
[[1]]
[1] "a" "b" "c" "d" "n"
[[2]]
[1] "g" "h" "k" "i"
Because I like being slow. (loaded library outside of benchmark)
Unit: microseconds
expr min lq mean median uq max neval
vlo() 896.435 1070.6540 1315.8194 1129.4710 1328.6630 7859.999 1000
akrun() 596.263 658.6590 789.9889 694.1360 804.9035 3470.158 1000
flick() 805.854 928.8160 1160.9509 1001.8345 1172.0965 5780.824 1000
josh() 2427.752 2693.0065 3344.8671 2943.7860 3524.1550 16505.909 1000 <- deleted :-(
doc() 254.462 288.9875 354.6084 302.6415 338.9565 2734.795 1000

One option would be to use combn and then find the intersects. There would be easier options.
indx <- combn(names(data),2)
lst <- lapply(split(indx, col(indx)),
function(i) Reduce(`intersect`,data[i]))
indx1 <- names(lst[sapply(lst, length)>0])
indx2 <- indx[,as.numeric(indx1)]
indx3 <- apply(indx2,2, sort)
lapply(split(1:ncol(indx3), indx3[1,]),
function(i) unique(unlist(data[c(indx3[,i])], use.names=FALSE)))
#$v1
#[1] "a" "b" "c" "d" "n"
#$v2
#[1] "g" "h" "k" "i"
Update
You could use combnPrim from library(gRbase) to make this even faster. Using a slightly bigger dataset
library(gRbase)
set.seed(25)
data <- setNames(lapply(1:1e3,function(i)sample(letters,
sample(1:20), replace=FALSE)), paste0("v", 1:1000))
and comparing with the fastest. These are modified functions based on OP's comments to #docendo discimus.
akrun2M <- function(){
ind <- sapply(seq_along(data), function(i){#copied from #docendo discimus
!any(data[[i]] %in% unlist(data[-i]))
})
data1 <- data[!ind]
indx <- combnPrim(names(data1),2)
lst <- lapply(split(indx, col(indx)),
function(i) Reduce(`intersect`,data1[i]))
indx1 <- names(lst[sapply(lst, length)>0])
indx2 <- indx[,as.numeric(indx1)]
indx3 <- apply(indx2,2, sort)
c(data[ind],lapply(split(1:ncol(indx3), indx3[1,]),
function(i) unique(unlist(data[c(indx3[,i])], use.names=FALSE))))
}
doc2 <- function(){
x <- lapply(seq_along(data), function(i) {
if(!any(data[[i]] %in% unlist(data[-i]))) {
data[[i]]
}
else {
z <- unlist(data[names(unlist(lapply(data[-c(1:i)],
intersect, data[[i]])))])
if (is.null(z)){
z
}
else union(data[[i]], z)
}
})
x[!sapply(x, is.null)]
}
Benchmarks
microbenchmark(doc2(), akrun2M(), times=10L)
# Unit: seconds
# expr min lq mean median uq max neval cld
# doc2() 35.43687 53.76418 54.77813 54.34668 62.86665 67.76754 10 b
#akrun2M() 26.64997 28.74721 38.02259 35.35081 47.56781 49.82158 10 a

I came across a similar problem that prompted me to look everywhere for a solution. I finally found a very good one thanks to a number of great contributors here, however as I seen this post I thought I would write my own custom function for this purpose. It's not actually elegant and is too slow but I think it's quite effective and can do the trick for now until I make some improvements:
anoush <- function(x) {
# First we check whether x is a list
stopifnot(is.list(x))
# Then we take every element of the input and calculate the intersect between
# that element & others. In case there were some we would store the indices
# in `vec` vector. So in the end we have a list called `ind` whose elements
# are all the indices connected with the corresponding elements of the original
# list for example first element of `ind` is `1`, `2`, `3` which means in
# the original list these elements have common values.
ind <- lapply(1:length(x), function(a) {
vec <- c()
for(i in 1:length(x)) {
if(length(unique(base::intersect(x[[a]], x[[i]]))) > 0) {
vec <- c(vec, i)
}
}
vec
})
# Then we go on to again compare each element of `ind` with other elements
# in case there were any intersect, we will calculate the `union` of them.
# for each element we will end up with a list of accumulated values but
# but in the end we use `Reduce` to capture only the last one. So for each
# element of `ind` we end up having a collection of indices that also
# result in duplicated values. For example elements `1` through `5` of
# `dup_ind` contains the same value cause in the original list these
# elements have common values.
dup_ind <- lapply(1:length(ind), function(a) {
out <- c()
for(i in 1:length(ind)) {
if(length(unique(base::intersect(ind[[a]], ind[[i]]))) > 0) {
out[[i]] <- union(ind[[a]], ind[[i]])
}
vec2 <- Reduce("union", out)
}
vec2
})
# Here we get rid of the duplicated elements of the list by means of
# `relist` funciton and since in this process all the duplicated elements
# will turn to `integer(0)` I have filtered those out.
un <- unlist(dup_ind)
res <- Map(`[`, dup_ind, relist(!duplicated(un), skeleton = dup_ind))
res2 <- Filter(length, res)
sapply(res2, function(a) unique(unlist(lapply(a, function(b) `[[`(x, b)))))
}
OP's Data Sample
> anoush(data)
[[1]]
[1] "a" "b" "c" "d" "n"
[[2]]
[1] "g" "h" "k" "i"
Dear #akrun's Data Sample
data <- list(v1=c('g', 'k'), v2= letters[1:4], v3= c('b', 'c', 'd', 'a'))
> anoush(data)
[[1]]
[1] "g" "k"
[[2]]
[1] "a" "b" "c" "d"

In general, you cannot do much better/faster than Floyd-Warshall-Algorithm, which is as follows:
library(Rcpp)
cppFunction(
"LogicalMatrix floyd(LogicalMatrix w){
int n = w.nrow();
for( int k = 0; k < n; k++ )
for( int i = 0; i < (n-1); i++ )
for( int j = i+1; j < n; j++ )
if( w(i,k) && w(k,j) ) {
w(i,j) = true;
w(j,i) = true;
}
return w;
}")
fw.union<-function(x) {
n<-length(x)
w<-matrix(F,nrow=n,ncol=n)
for( i in 1:n ) {
w[i,i]<-T
}
for( i in 1:(n-1) ) {
for( j in (i+1):n ) {
w[i,j]<-w[j,i]<- any(x[[i]] %in% x[[j]])
}
}
apply( unique( floyd(w) ), 1, function(y) { Reduce(union,x[y]) } )
}
Running benchmarks would be interesting, though. Preliminary tests suggest that my implementation is about 2-3 times faster than Vlo's.

how to match longest matching string

I have string and character vector. I would like to find all strings in character vector matching as much as possible characters from beging of string.
For example:
s <- "abs"
vc <- c("ab","bb","abc","acbd","dert")
result <- c("ab","abc")
String s should be matched exactly up to first K characters. I want match for as much as possible (max K<=length(s)).
Here there is no match for "abs" (grep("abs",vc)), but for "ab" there are two matches (result <-grep("ab",vc)).

Another interpretation:
s <- "abs"
# Updated vc
vc <- c("ab","bb","abc","acbd","dert","abwabsabs")
st <- strsplit(s, "")[[1]]
mtc <- sapply(strsplit(substr(vc, 1, nchar(s)), ""),
function(i) {
m <- i == st[1:length(i)]
sum(m * cumsum(m))})
vc[mtc == max(mtc)]
#[1] "ab" "abc" "abwabsabs"
# Another vector vc
vc <- c("ab","bb","abc","acbd","dert","absq","abab")
....
vc[mtc == max(mtc)]
#[1] "absq"
Since we are considering only beginnings of strings, in the first case the longest match was "ab", even though there is "abwabsabs" which has "abs".
Edit: Here is a "single pattern" solution, possibly it could be more concise, but here we go...
vc <- c("ab","bb","abc","acbd","dert","abwabsabs")
(auxOne <- sapply((nchar(s)-1):1, function(i) substr(s, 1, i)))
#[1] "ab" "a"
(auxTwo <- sapply(nchar(s):2, function(i) substring(s, i)))
#[1] "s" "bs"
l <- attr(regexpr(
paste0("^((",s,")|",paste0("(",auxOne,"(?!",auxTwo,"))",collapse="|"),")"),
vc, perl = TRUE), "match.length")
vc[l == max(l)]
#[1] "ab" "abc" "abwabsabs"

Here's a function that uses grep and checks to see if a given string s matches the beginning of any string in vc, recursively removing a character from the end of s:
myfun <- function(s, vc) {
notDone <- TRUE
maxChar <- max(nchar(vc)) # EDIT: these two lines truncate s to
s <- substr(s, 1, maxChar) # the maximum number of chars in vc
subN <- nchar(s)
while(notDone & subN > 0){
ss <- substr(s, 1, subN)
ans <- grep(sprintf("^%s", ss), vc, val = TRUE)
if(length(ans)) {
notDone <- FALSE
} else {
subN <- subN - 1
}
}
return(ans)
}
s <- "abs"
# Updated vc from #Julius's answer
vc <- c("ab","bb","abc","acbd","dert","absq","abab")
> myfun(s, vc)
[1] "absq"
# And there's no infinite recursion if there's no match
> myfun("q", "a")
character(0)

Just a note, long after the fact, that the triebeard package now exists; it's very, very efficient and user-friendly for finding longest or partial matches.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Test if string in n positions of alphabet - r

PoGibas comment seems to have as elegant as it gets. Next step would be wrapping it in a function: cht6_pog <- function(string) { x <- toupper(substring(string, 1, 1)) %in% LETTERS[1:6] names(x) <- string x } cht6_pog(names) Brian Cormac Zachariah TRUE TRUE FALSE

Related

Can I somehow refer to a column by more than one name?

length of the unique elements of listed vectors in R

Selecting logical vector elements based on certain conditions

Union of intersecting vectors in a list in R

how to match longest matching string

Categories

Resources