I have string and character vector. I would like to find all strings in character vector matching as much as possible characters from beging of string.
For example:
s <- "abs"
vc <- c("ab","bb","abc","acbd","dert")
result <- c("ab","abc")
String s should be matched exactly up to first K characters. I want match for as much as possible (max K<=length(s)).
Here there is no match for "abs" (grep("abs",vc)), but for "ab" there are two matches (result <-grep("ab",vc)).
Another interpretation:
s <- "abs"
# Updated vc
vc <- c("ab","bb","abc","acbd","dert","abwabsabs")
st <- strsplit(s, "")[[1]]
mtc <- sapply(strsplit(substr(vc, 1, nchar(s)), ""),
function(i) {
m <- i == st[1:length(i)]
sum(m * cumsum(m))})
vc[mtc == max(mtc)]
#[1] "ab" "abc" "abwabsabs"
# Another vector vc
vc <- c("ab","bb","abc","acbd","dert","absq","abab")
....
vc[mtc == max(mtc)]
#[1] "absq"
Since we are considering only beginnings of strings, in the first case the longest match was "ab", even though there is "abwabsabs" which has "abs".
Edit: Here is a "single pattern" solution, possibly it could be more concise, but here we go...
vc <- c("ab","bb","abc","acbd","dert","abwabsabs")
(auxOne <- sapply((nchar(s)-1):1, function(i) substr(s, 1, i)))
#[1] "ab" "a"
(auxTwo <- sapply(nchar(s):2, function(i) substring(s, i)))
#[1] "s" "bs"
l <- attr(regexpr(
paste0("^((",s,")|",paste0("(",auxOne,"(?!",auxTwo,"))",collapse="|"),")"),
vc, perl = TRUE), "match.length")
vc[l == max(l)]
#[1] "ab" "abc" "abwabsabs"
Here's a function that uses grep and checks to see if a given string s matches the beginning of any string in vc, recursively removing a character from the end of s:
myfun <- function(s, vc) {
notDone <- TRUE
maxChar <- max(nchar(vc)) # EDIT: these two lines truncate s to
s <- substr(s, 1, maxChar) # the maximum number of chars in vc
subN <- nchar(s)
while(notDone & subN > 0){
ss <- substr(s, 1, subN)
ans <- grep(sprintf("^%s", ss), vc, val = TRUE)
if(length(ans)) {
notDone <- FALSE
} else {
subN <- subN - 1
}
}
return(ans)
}
s <- "abs"
# Updated vc from #Julius's answer
vc <- c("ab","bb","abc","acbd","dert","absq","abab")
> myfun(s, vc)
[1] "absq"
# And there's no infinite recursion if there's no match
> myfun("q", "a")
character(0)
Just a note, long after the fact, that the triebeard package now exists; it's very, very efficient and user-friendly for finding longest or partial matches.
Related
I'm trying to create a function (v, n) where v is a vector of letters/ words, and n is an integer number. This function should return the same vector but with the "n" first elements in capital letters.
So far I've made this:
capital <- function(v, n){
c <- v[n]
return (toupper(c))
}
But this function only returns the "n" element in capital letters, not all the first "n" elements. Also it doesn't return the rest of the chain.
So, if we have v <- c("alpha", "bravo", "charlie", "delta") and n <- 2. My funtion returns:
[1] "BRAVO"
But I would like that it returns:
[1] "ALPHA" "BRAVO" "charlie" "delta"
Any help?
An option would be to subset the vector based on the sequence of 'n' and assign back by applying toUpper
i <- seq_len(n)
v[i] <- toupper(v[i])
-output
v
#[1] "ALPHA" "BRAVO" "charlie" "delta"
In the OP's function, the return is not assigning based on the index
capital <- function(v, n){
stopifnot(n <= length(v))
i <- seq_len(n) # // create a sequence from n
v[i] <- toupper(v[i]) # // apply toupper based on the index
return(v) # // return the updated vector
}
-output
capital(v, n)
#[1] "ALPHA" "BRAVO" "charlie" "delta"
We can try:
capital <- function(v, n){
c1 <- v[1:n]
c2 <- toupper(c1)
c3 <- c(c2,v[(n+1):length(v)])
return (c3)
}
capital(v = c("alpha", "bravo", "charlie", "delta"),n=2 )
A full explanation would be:
In c1 we choose the number of words.
In c2 we capitalize them.
In c3 we merge with the original vector.
Finally, we return.
I have a list of character vectors representing words split in phonemes:
> head(words)
[[1]]
[1] "UU"
[[2]]
[1] "EY" "Z"
[[3]]
[1] "T" "R" "IH" "P" "UU" "L" "EY"
[[4]]
[1] "AA" "B" "ER" "G"
[[5]]
[1] "AA" "K" "UU" "N"
[[6]]
[1] "AA" "K" "ER"
For each word in the list, I would like to find the number of words that differ from the considered word by one phoneme (one phoneme added, subtracted or substituted) and have the same number of
phonemes in the same positions.
In this sense, for the word "EY" "Z" acceptable cases would be:
[1] "M" "EY" "Z"
[1] "AY" "Z"
[1] "EY" "D"
[1] "EY" "Z" "AH"
But the following cases should be rejected:
[1] "EY" "D" "Z"
[1] "Z" "EY" "D"
[1] "HH" "EY"
Basically, I would like to find differences of one element respecting the positions of the phonemes in the vectors.
At the moment, the best solution I have found is:
diffs <- c()
for (i in seq_along(words)) {
diffs <- c(diffs, sum(sapply(words, function(y) {
count <- 0
elements <- list(words[[i]], y)
len <- c(length(words[[i]]), length(y))
if (identical(elements[which(len==max(len))][[1]][-1], elements[which(len==min(len))][[1]]) == 1) {
count + identical(elements[which(len==max(len))][[1]][-1], elements[which(len==min(len))][[1]])
} else {
length(elements[which(len==min(len))][[1]]) <- length(elements[which(len==max(len))][[1]])
elements <- rapply(elements, f=function(x) ifelse(is.na(x),"$$",x), how="replace" )
count + sum(elements[[1]] != elements[[2]])
}
})== 1))
}
However, this solution is taking ages because my list words has 120.000 elements (words/vectors), so I would like to ask if you know other solutions to speed up the process.
Thank you very much in advance for your answers
And a different answer, using regular Levenshtein distance (i.e. allowing insertions at any point), but this time FAST - 1000 words in 15 seconds fast.
The trick is using one of the fast Levenshtein implementations available in R packages; in this case I'm using stringdist but any should work. The issue is that they operate on strings and characters, not multi-character phoneme representations. But there's a trivial solution for that: as there are more characters than phonemes, we can just translate the phonemes into single characters. The resulting strings are unreadable as phonemic transcriptions, but work perfectly fine as input to the neighborhood density algorithm.
library(stringdist)
phonemes <- unique(unlist(words))
# add a few buffer characters
targets <- c(letters, LETTERS, 0:9, "!", "ยง", "%", "&", "/", "=",
"#")[1:length(phonemes)]
ptmap <- targets
names(ptmap) <- phonemes
wordsT <- sapply(words, function(i) paste0(ptmap[i], collapse=""))
wordlengths <- nchar(wordsT)
onediffs.M <- function(x) {
lengthdiff <- abs(wordlengths - nchar(x))
sum(stringdist(x, wordsT[lengthdiff == 0], method="hamming") == 1) +
sum(stringdist(x, wordsT[lengthdiff == 1], method="lv") == 1)
}
So, the key here is to separate words with respect to their lengths so that we can test each asumption (substitution/addition/deletion) only on a subset of interest.
get_one_diff <- function(words) {
K <- max(le <- lengths(words))
i_chr <- as.character(seq_len(K))
words.spl <- split(words, le)
test_substitution <- function(i) {
word1 <- words[[i]]
do.call(sum, lapply(words.spl[[i_chr[le[i]]]], function(word2) {
sum(word1 != word2) == 1
}))
}
test_addition <- function(i) {
if ((le <- le[i]) == K) return(0)
word1 <- words[[i]]
do.call(sum, lapply(words.spl[[i_chr[le + 1]]], function(word2) {
isOneDiff(word1, word2)
}))
}
test_deletion <- function(i) {
if ((le <- le[i]) == 1) return(0)
word1 <- words[[i]]
do.call(sum, lapply(words.spl[[i_chr[le - 1]]], function(word2) {
isOneDiff(word2, word1)
}))
}
sapply(seq_along(words), function(i) {
test_substitution(i) + test_addition(i) + test_deletion(i)
})
}
where isOneDiff is an Rcpp function:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
bool isOneDiff(const StringVector& w1,
const StringVector& w2) {
int i, n = w1.size();
for (i = 0; i < n; i++) if (w1[i] != w2[i]) break;
for ( ; i < n; i++) if (w1[i] != w2[i+1]) return false;
return true;
}
This is 20 times as fast as your version and as it is merely an sapply, it could be easily parallelized.
Here's a version using Levenshtein Distance with the Wagner-Fischer algorithm.
vecLeven <- function(s, t) {
d <- matrix(0, nrow = length(s) + 1, ncol=length(t) + 1)
d[, 1] <- (1:nrow(d)) - 1
d[1,] <- (1:ncol(d))-1
for (i in 1:length(s)) {
for (j in 1:length(t)) {
d[i+1, j+1] <- min(
d[i, j+1] + 1, # deletion
d[i+1, j] + 1, # insertion
d[i, j] + if (s[i] == t[j]) 0 else 1 # substitution
)
}
}
d[nrow(d), ncol(d)]
}
onediff <- sapply(words[1:10], function(x) {
lengthdiff <- sapply(words, function(word) abs(length(word) - length(x)))
sum(sapply(words[lengthdiff == 0], function(word) sum(word != x) == 1)) +
sum(mapply(vecLeven, list(x), words[lengthdiff == 1]) == 1)
})
I tested both versions on the CMU dictionary, which has a similar size. It's a bit faster than your version (about 30 seconds instead of 50 for 10 words), and should parallelize well. Still, running it on the complete data set would take several days.
One large performance factor is that all pairs are computed twice, once for th first word and once for the second; doing a lookup instead would halve that. However, there are more than 7 billion pairs, so you would need a database to store them.
Given is vector:
vec <- c(LETTERS[1:10])
I would like to be able to combine it in a following manner:
resA <- c("AB", "CD", "EF", "GH", "IJ")
resB <- c("ABCDEF","GHIJ")
where elements of the vector vec are merged together according to the desired size of a new element constituting the resulting vector. This is 2 in case of resA and 5 in case of resB.
Desired solution characteristics
The solution should allow for flexibility with respect to the element sizes, i.e. I may want to have vectors with elements of size 2 or 20
There may be not enough elements in the vector to match the desired chunk size, in that case last element should be shortened accordingly (as shown)
This is shouldn't make a difference but the solution should work on words as well
Attempts
Initially, I was thinking of using something on the lines:
c(
paste0(vec[1:2], collapse = ""),
paste0(vec[3:4], collapse = ""),
paste0(vec[5:6], collapse = "")
# ...
)
but this would have to be adapted to jump through the remaining pairs/bigger groups of the vec and handle last group which often would be of a smaller size.
Here is what I came up with. Using Harlan's idea in this question, you can split the vector in different number of chunks. You also want to use your paste0() idea in lapply() here. Finally, you unlist a list.
unlist(lapply(split(vec, ceiling(seq_along(vec)/2)), function(x){paste0(x, collapse = "")}))
# 1 2 3 4 5
#"AB" "CD" "EF" "GH" "IJ"
unlist(lapply(split(vec, ceiling(seq_along(vec)/5)), function(x){paste0(x, collapse = "")}))
# 1 2
#"ABCDE" "FGHIJ"
unlist(lapply(split(vec, ceiling(seq_along(vec)/3)), function(x){paste0(x, collapse = "")}))
# 1 2 3 4
#"ABC" "DEF" "GHI" "J"
vec <- c(LETTERS[1:10])
f1 <- function(x, n){
f <- function(x) paste0(x, collapse = '')
regmatches(f(x), gregexpr(f(rep('.', n)), f(x)))[[1]]
}
f1(vec, 2)
# [1] "AB" "CD" "EF" "GH" "IJ"
or
f2 <- function(x, n)
apply(matrix(x, nrow = n), 2, paste0, collapse = '')
f2(vec, 5)
# [1] "ABCDE" "FGHIJ"
or
f3 <- function(x, n) {
f <- function(x) paste0(x, collapse = '')
strsplit(gsub(sprintf('(%s)', f(rep('.', n))), '\\1 ', f(x)), '\\s+')[[1]]
}
f3(vec, 4)
# [1] "ABCD" "EFGH" "IJ"
I would say the last is best of these since n for the others must be a factor or you will get warnings or recycling
edit - more
f4 <- function(x, n) {
f <- function(x) paste0(x, collapse = '')
Vectorize(substring, USE.NAMES = FALSE)(f(x), which((seq_along(x) %% n) == 1),
which((seq_along(x) %% n) == 0))
}
f4(vec, 2)
# [1] "AB" "CD" "EF" "GH" "IJ"
or
f5 <- function(x, n)
mapply(function(x) paste0(x, collapse = ''),
split(x, c(0, head(cumsum(rep_len(sequence(n), length(x)) %in% n), -1))),
USE.NAMES = FALSE)
f5(vec, 4)
# [1] "ABCD" "EFGH" "IJ"
Here is another way, working with the original array.
A side note, working with words is not straightforward, since there is at least two ways to understand it: you can either keep each word separately or collapse them first an get individual characters. The next function can deal with both options.
vec <- c(LETTERS[1:10])
vec2 <- c("AB","CDE","F","GHIJ")
cuts <- function(x, n, bychar=F) {
if (bychar) x <- unlist(strsplit(paste0(x, collapse=""), ""))
ii <- seq_along(x)
li <- split(ii, ceiling(ii/n))
return(sapply(li, function(y) paste0(x[y], collapse="")))
}
cuts(vec2,2,F)
# 1 2
# "ABCDE" "FGHIJ"
cuts(vec2,2,T)
# 1 2 3 4 5
# "AB" "CD" "EF" "GH" "IJ"
I have used adist to calculate the number of characters that differ between two strings:
a <- "Happy day"
b <- "Tappy Pay"
adist(a,b) # result 2
Now I would like to extract those character that differ. In my example, I would like to get the string "Hd" (or "TP", it doesn't matter).
I tried to look in adist, agrep and stringi but found nothing.
You can use the following sequence of operations:
split the string using strsplit().
Use setdiff() to compare the elements
Wrap in a reducing function
Try this:
Reduce(setdiff, strsplit(c(a, b), split = ""))
[1] "H" "d"
Split into letters and take the difference as sets:
> setdiff(strsplit(a,"")[[1]],strsplit(b,"")[[1]])
[1] "H" "d"
Not really proud of this, but it seems to do the job:
sapply(setdiff(utf8ToInt(a), utf8ToInt(b)), intToUtf8)
Results:
[1] "H" "d"
You can use one of the variables as a regex character class and gsub out from the other one:
gsub(paste0("[",a,"]"),"",b)
[1] "TP"
gsub(paste0("[",b,"]"),"",a)
[1] "Hd"
As long as a and b have the same length we can do this:
s.a <- strsplit(a, "")[[1]]
s.b <- strsplit(b, "")[[1]]
paste(s.a[s.a != s.b], collapse = "")
giving:
[1] "Hd"
This seems straightforward in terms of clarity of the code and seems tied for the fastest of the solutions provided here although I think I prefer f3:
f1 <- function(a, b)
paste(setdiff(strsplit(a,"")[[1]],strsplit(b,"")[[1]]), collapse = "")
f2 <- function(a, b)
paste(sapply(setdiff(utf8ToInt(a), utf8ToInt(b)), intToUtf8), collapse = "")
f3 <- function(a, b)
paste(Reduce(setdiff, strsplit(c(a, b), split = "")), collapse = "")
f4 <- function(a, b) {
s.a <- strsplit(a, "")[[1]]
s.b <- strsplit(b, "")[[1]]
paste(s.a[s.a != s.b], collapse = "")
}
a <- "Happy day"
b <- "Tappy Pay"
library(rbenchmark)
benchmark(f1, f2, f3, f4, replications = 10000, order = "relative")[1:4]
giving the following on a fresh session on my laptop:
test replications elapsed relative
3 f3 10000 0.07 1.000
4 f4 10000 0.07 1.000
1 f1 10000 0.09 1.286
2 f2 10000 0.10 1.429
I have assumed that the differences must be in the corresponding character positions. You might want to clarify if that is the intention or not.
The following function could be a better option to solve problem like this.
list.string.diff <- function(a, b, exclude = c("-", "?"), ignore.case = TRUE, show.excluded = FALSE)
{
if(nchar(a)!=nchar(b)) stop("Lengths of input strings differ. Please check your input.")
if(ignore.case)
{
a <- toupper(a)
b <- toupper(b)
}
split_seqs <- strsplit(c(a, b), split = "")
only.diff <- (split_seqs[[1]] != split_seqs[[2]])
only.diff[
(split_seqs[[1]] %in% exclude) |
(split_seqs[[2]] %in% exclude)
] <- NA
diff.info<-data.frame(which(is.na(only.diff)|only.diff),
split_seqs[[1]][only.diff],split_seqs[[2]][only.diff])
names(diff.info)<-c("position","poly.seq.a","poly.seq.b")
if(!show.excluded) diff.info<-na.omit(diff.info)
diff.info
from https://www.r-bloggers.com/extract-different-characters-between-two-strings-of-equal-length/
Then you can run
list.string.diff(a, b)
to get the difference.
I use LETTERS most of the time for my factors but today I tried to go beyond 26 characters:
LETTERS[1:32]
Expecting there to be an automatic recursive factorization AA, AB, AC... But was disappointed. Is this simply a limitation of LETTERS or is there a way to get what I'm looking for using another function?
Would 702 be enough?
LETTERS702 <- c(LETTERS, sapply(LETTERS, function(x) paste0(x, LETTERS)))
If not, how about 18,278?
MOAR_LETTERS <- function(n=2) {
n <- as.integer(n[1L])
if(!is.finite(n) || n < 2)
stop("'n' must be a length-1 integer >= 2")
res <- vector("list", n)
res[[1]] <- LETTERS
for(i in 2:n)
res[[i]] <- c(sapply(res[[i-1L]], function(y) paste0(y, LETTERS)))
unlist(res)
}
ml <- MOAR_LETTERS(3)
str(ml)
# chr [1:18278] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" ...
This solution uses recursion. Usage is a bit different in the sense MORELETTERS is not a long vector you will have to store and possibly expand as your inputs get larger. Instead, it is a function that converts your numbers into the new base.
extend <- function(alphabet) function(i) {
base10toA <- function(n, A) {
stopifnot(n >= 0L)
N <- length(A)
j <- n %/% N
if (j == 0L) A[n + 1L] else paste0(Recall(j - 1L, A), A[n %% N + 1L])
}
vapply(i-1L, base10toA, character(1L), alphabet)
}
MORELETTERS <- extend(LETTERS)
MORELETTERS(1:1000)
# [1] "A" "B" ... "ALL"
MORELETTERS(c(1, 26, 27, 1000, 1e6, .Machine$integer.max))
# [1] "A" "Z" "AA" "ALL" "BDWGN" "FXSHRXW"
You can make what you want like this:
LETTERS2<-c(LETTERS[1:26], paste0("A",LETTERS[1:26]))
Another solution for excel style column names, generalized to any number of letters
#' Excel Style Column Names
#'
#' #param n maximum number of letters in column name
excel_style_colnames <- function(n){
unlist(Reduce(
function(x, y) as.vector(outer(x, y, 'paste0')),
lapply(1:n, function(x) LETTERS),
accumulate = TRUE
))
}
A variant on eipi10's method (ordered correctly) using data.table:
library(data.table)
BIG_LETTERS <- c(LETTERS,
do.call("paste0",CJ(LETTERS,LETTERS)),
do.call("paste0",CJ(LETTERS,LETTERS,LETTERS)))
Yet another option:
l2 = c(LETTERS, sort(do.call("paste0", expand.grid(LETTERS, LETTERS[1:3]))))
Adjust the two instances of LETTERS inside expand.grid to get the number of letter pairs you'd like.
A function to produce Excel-style column names, i.e.
# A, B, ..., Z, AA, AB, ..., AZ, BA, BB, ..., ..., ZZ, AAA, ...
letterwrap <- function(n, depth = 1) {
args <- lapply(1:depth, FUN = function(x) return(LETTERS))
x <- do.call(expand.grid, args = list(args, stringsAsFactors = F))
x <- x[, rev(names(x)), drop = F]
x <- do.call(paste0, x)
if (n <= length(x)) return(x[1:n])
return(c(x, letterwrap(n - length(x), depth = depth + 1)))
}
letterwrap(26^2 + 52) # through AAZ
## This will take a few seconds:
# x <- letterwrap(1e6)
It's probably not the fastest, but it extends indefinitely and is nicely predictable. Took about 20 seconds to produce through 1 million, BDWGN.
(For a few more details, see here: https://stackoverflow.com/a/21689613/903061)
A little late to the party, but I want to play too.
You can also use sub, and sprintf in place of paste0 and get a length 702 vector.
c(LETTERS, sapply(LETTERS, sub, pattern = " ", x = sprintf("%2s", LETTERS)))
Here's another addition to the list. This seems a bit faster than Gregor's (comparison done on my computer - using length.out = 1e6 his took 12.88 seconds, mine was 6.2), and can also be extended indefinitely. The flip side is that it's 2 functions, not just 1.
make.chars <- function(length.out, case, n.char = NULL) {
if(is.null(n.char))
n.char <- ceiling(log(length.out, 26))
m <- sapply(n.char:1, function(x) {
rep(rep(1:26, each = 26^(x-1)) , length.out = length.out)
})
m.char <- switch(case,
'lower' = letters[m],
'upper' = LETTERS[m]
)
m.char <- LETTERS[m]
dim(m.char) <- dim(m)
apply(m.char, 1, function(x) paste(x, collapse = ""))
}
get.letters <- function(length.out, case = 'upper'){
max.char <- ceiling(log(length.out, 26))
grp <- rep(1:max.char, 26^(1:max.char))[1:length.out]
unlist(lapply(unique(grp), function(n) make.chars(length(grp[grp == n]), case = case, n.char = n)))
}
##
make.chars(5, "lower", 2)
#> [1] "AA" "AB" "AC" "AD" "AE"
make.chars(5, "lower")
#> [1] "A" "B" "C" "D" "E"
make.chars(5, "upper", 4)
#> [1] "AAAA" "AAAB" "AAAC" "AAAD" "AAAE"
tmp <- get.letters(800)
head(tmp)
#> [1] "A" "B" "C" "D" "E" "F"
tail(tmp)
#> [1] "ADO" "ADP" "ADQ" "ADR" "ADS" "ADT"
Created on 2019-03-22 by the reprex package (v0.2.1)