Fastest Way To Find Last Names From String in R - r

I am trying to identify likely last name from parts of name strings in various formats in R. What is the fastest way to identify the longest string match from the dataset of last names to a given name string (I'm using the wru surnames2010 dataset)?
I need the longest possibility rather than any possibility. I.e. in the example below the first string "scottcampbell" contains possible surnames "scott" and "campbell". I want to only return the longest of the possible matches, in this case only "campbell".
Reproduce example data:
library(wru)
data("surnames2010")
#filter out names under 4 characters
lnames <- surnames2010[nchar(as.character(surnames2010$surname))>3,]
testvec <- c("scottcampbell","mattbaker","tsmith","watkins","burnsmary","terri","frankrodriguez","neal")
Desired imagined function+result:
foo_longest_matches(testvec)
#Desired imagined result:
[1] "campbell" "baker" "smith" "watkins" "burns" "terri" "rodriguez" "neal")

You could use adist. Please note that you are doing more than 1million comparisons to obtain the longest. I would prefer you use a different method. The best so far that I have in mind is
a <- adist(toupper(testvec), surnames2010$surname, counts = TRUE)
b <- attr(a, "trafos")
d <- array(grepl("S|I", b) + nchar(gsub("(.)\\1++", "1",b, perl=TRUE)), dim(a)) * 10 + a
as.character(surnames2010$surname[max.col(-d)])
[1] "CAMPBELL" "BAKER" "SMITH" "WATKINS" "BURNS" "TERRI" "RODRIGUEZ" "NEAL"
benchmark:
longest <- function(testvec,namevec){
a <- adist(testvec, namevec, counts = TRUE)
b <- attr(a, "trafos")
d <- array(grepl("S|I", b) + nchar(gsub("(.)\\1++", "1",b, perl=TRUE)), dim(a)) * 10 + a
as.character(namevec[max.col(-d)])
}
EDIT: Was able to obtain a faster method(Not necessarily the fastest)
longest2 <- function(testvec,namevec){
a <- stack(sapply(namevec,grep,testvec,value = TRUE,simplify = FALSE))
tapply(as.character(a[, 2]), a[, 1], function(x) x[which.max(nchar(x))])[testvec]
}
microbenchmark::microbenchmark(longest(testvec,lnames$surname),longest2(testvec,lnames$surname),foo_longest_matches(testvec),times = 5)
Unit: seconds
expr min lq mean median uq max neval
longest(testvec, lnames$surname) 3.316550 3.984128 5.308339 6.265192 6.396348 6.579477 5
longest2(testvec, lnames$surname) 1.817059 1.917883 2.835354 3.350068 3.538278 3.553481 5
foo_longest_matches(testvec) 10.093179 10.325489 11.610619 10.756714 10.889326 15.988384 5

Not sure about fastest but here is a method to test:
library(wru)
data("surnames2010")
lnames <- surnames2010[nchar(as.character(surnames2010$surname))>3,]
testvec <- c("scottcampbell","mattbaker","tsmith","watkins","burnsmary","terri","frankrodriguez","neal")
lnames$surname <- tolower(lnames$surname)
testvec <- tolower(testvec)
foo_longest_matches <- function(string_vector) {
outdf <- c()
for (name in string_vector) {
print(name)
ting <- lnames[sapply(lnames$surname, function(x) grepl(x, name)),]
# you only care about the longest, remove the next line to get all matches
ting <- ting[which.max(nchar(ting$surname)),]
outdf <- rbind(outdf, ting)
}
return(outdf)
}
get_matches <- foo_longest_matches(testvec)
get_matches
# surname p_whi p_bla p_his p_asi p_oth
# 47 campbell 0.7366 0.2047 0.02490000 0.00530000 0.02840000
# 44 baker 0.7983 0.1444 0.02280000 0.00560000 0.02890000
# 1 smith 0.7090 0.2311 0.02400000 0.00500000 0.03080000
# 240 watkins 0.6203 0.3227 0.02090000 0.00420000 0.03200000
# 155 burns 0.8026 0.1406 0.02480000 0.00590000 0.02610000
# 110133 terri 0.7453 0.1801 0.01243333 0.01243333 0.04973333
# 9 rodriguez 0.0475 0.0054 0.93770000 0.00570000 0.00360000
# 337 neal 0.6210 0.3184 0.02160000 0.00600000 0.03290000

Related

Keeping vectors (from list of vectors) whose elements do not have a proper subset within that same list (using RCPP)

I have asked this question previously (see here) and received a satisfactory answer using the purr package. However, this has proved to be a bottle neck in my program so I would like to rewrite the section using the RCPP package.
Proper subset: A proper subset S' of a set S is a subset that is strictly contained in S and so excludes S itself (note I am also excluding the empty set).
Suppose you have the following vectors in a list:
a = c(1,2)
b = c(1,3)
c = c(2,4)
d = c(1,2,3,4)
e = c(2,4,5)
f = c(1,2,3)
My aim is to keep only vectors which have no proper subset within the list, which in this example would be a, b and c.
Previous Solution
library(purr)
possibilities <- list(a,b,c,d,e,f)
keep(possibilities,
map2_lgl(.x = possibilities,
.y = seq_along(possibilities),
~ !any(map_lgl(possibilities[-.y], function(z) all(z %in% .x)))))
The notion here is to avoid the O(N^3) and use a less order instead. The other answer provided here will be slow still since it is greater than O(N^2). Here is a solution with less than O(N^2), where the worst case scenario is O(N^2) when all the elements are unique.
onlySet <- function(x){
i <- 1
repeat{
y <- sapply(x[-1], function(el)!all(is.element(x[[1]], el)))
if(all(y)){
if(i==length(x)) break
else i <- i+1
}
x <- c(x[-1][y], x[1])
}
x
}
Now to show the time difference, check out the following:
match_fun <- Vectorize(function(s1, s2) all(s1 %in% s2))
method1 <- function(a){
mat <- outer(a, a, match_fun)
a[colSums(mat) == 1]
}
poss <- rep(possibilities, 100)
microbenchmark::microbenchmark(method1(poss), onlySet(poss))
Unit: milliseconds
expr min lq mean median uq max neval cld
method1(poss) 840.7919 880.12635 932.255030 889.36380 923.32555 1420.1077 100 b
onlySet(poss) 1.9845 2.07005 2.191647 2.15945 2.24245 3.3656 100 a
Have you tried optimising the solution in base R first? For example, the following reproduces your expected output and uses (faster) base R array routines:
match_fun <- Vectorize(function(s1, s2) all(s1 %in% s2))
mat <- outer(possibilities, possibilities, match_fun)
possibilities[colSums(mat) == 1]
#[[1]]
#[1] 1 2
#
#[[2]]
#[1] 1 3
#
#[[3]]
#[1] 2 4
Inspired by Onyambu's performant solution, here is another base R option using a recursive function
f_recursive <- function(x, i = 1) {
if (i > length(x)) return(x)
idx <- which(sapply(x[-i], function(el) all(x[[i]] %in% el))) + 1
if (length(idx) == 0) f_recursive(x, i + 1) else f_recursive(x[-idx], i + 1)
}
f(possibilities)
The performance is on par with Onyambu's solution.
poss <- rep(possibilities, 100)
microbenchmark::microbenchmark(
method1(poss),
onlySet(poss),
f_recursive(poss))
#Unit: milliseconds
# expr min lq mean median uq
# method1(poss) 682.558602 710.974831 750.325377 730.627996 765.040976
# onlySet(poss) 1.700646 1.782713 1.870972 1.819820 1.918669
# f_recursive(poss) 1.681120 1.737459 1.884685 1.806384 1.901582
# max neval
# 1200.562889 100
# 2.371646 100
# 3.217013 100

Delete last two characters in string if they match criteria

I have 2 million names in a database. For example:
df <- data.frame(names=c("A ADAM", "S BEAN", "A APPLE A", "A SCHWARZENEGGER"))
> df
names
1 A ADAM
2 S BEAN
3 A APPLE A
4 A SCHWARZENEGGER
I want to delete ' A' (white space A) if these are the last two characters of the string.
I know that regex is our friend here. How do I efficiently apply a regex function to the last two characters of the string?
Desired output:
> output
names
1 A ADAM
2 S BEAN
3 A APPLE
4 A SCHWARZENEGGER
If you want good performance for millions of records, the stringi package is what you need. It even outperforms the base R functions:
require(stringi)
n <- 10000
x <- stri_rand_strings(n, 1:100)
ind <- sample(n, n/100)
x[ind] <- stri_paste(x[ind]," A")
baseR <- function(x){
sub("\\sA$", "", x)
}
stri1 <- function(x){
stri_replace_last_regex(x, "\\sA$","")
}
stri2 <- function(x){
ind <- stri_detect_regex(x, "\\sA$")
x[ind] <- stri_sub(x[ind],1, -3)
x
}
#if we assume that there can only be space, not any white character
#this is even faster (ca 200x)
stri3 <- function(x){
ind <- stri_endswith_fixed(x, " A")
x[ind] <- stri_sub(x[ind],1, -3)
x
}
head(stri2(x),44)
require(microbenchmark)
microbenchmark(baseR(x), stri1(x),stri2(x),stri3(x))
Unit: microseconds
expr min lq mean median uq max neval
baseR(x) 166044.032 172054.30 183919.6684 183112.1765 194586.231 219207.905 100
stri1(x) 36704.180 39015.59 41836.8612 40164.9365 43773.034 60373.866 100
stri2(x) 17736.535 18884.56 20575.3306 19818.2895 21759.489 31846.582 100
stri3(x) 491.963 802.27 918.1626 868.9935 1008.776 2489.923 100
We can use sub to match a space \\s followed by 'A' at the end ($) of the string and replace it with blank ("")
df$names <- sub("\\sA$", "", df$names)
df$names
#[1] "A ADAM" "S BEAN" "A APPLE" "A SCHWARZENEGGER"
The answer from #akrun is, of course, correct, but based on the comments I will just add one more thing when the column is factor.
Using the example of #vincentmajor in the comments:
df <- df2 <- data.frame(names = rep(c("A ADAM", "S BEAN", "A APPLE A", "A SCHWARZENEGGER"), length.out = 2000000))
# Probably we want the column to remain factor after substitution
system.time(
df$names <- factor(sub("\\sA$", "", df$names))
)
# user system elapsed
# 0.892 0.000 0.893
# Also if there are a lot of duplicates, like in this example,
# substituting the levels is way quicker
system.time(
levels(df2$names) <- sub("\\sA$", "", levels(df2$names))
)
# user system elapsed
# 0.052 0.000 0.053
Maybe not the fastest solution, but this will work too:
require(stringi)
x <- stri_rand_strings(10, 1:10)
ind <- sample(10, 5)
x[ind] <- stri_paste(x[ind]," A")
x
# [1] "z A" "hX" "uv0 A" "HQtD A" "kTNZh" "4SIVBh" "v28UrqS A" "uskxxNkl A"
# [9] "dKxloBsA6" "sRkCQp7sn4"
y <- stri_sub(x, -2,-1) == " A"
x[y] <- stri_sub(x[y], 1, -3)
x
# [1] "z" "hX" "uv0" "HQtD" "kTNZh" "4SIVBh" "v28UrqS" "uskxxNkl"
# [9] "dKxloBsA6" "sRkCQp7sn4"

Speedproblems by using apply in a function to translate multiple strings

I wrote a little function dictTranslator to translate multiple strings into abbreviations or group them into categories... I use a list as a dictionary and have a data frame column with abbreviations and I want a column category with the related category name.
dictTranslator <- function(x, dict) {
sapply(x, function(a) {
result <- names(which(sapply(dict, function(b) {a %in% b})))
if(identical(result, character(0))) {
warning(sprintf('NAs are introduced, "%s" not found!', a), call.=FALSE)
NA
} else {
result
}
})
}
my_dictionary <-
list(embryo=c('00h','01h','02h','e02','03h','04h','05h','06h','e06',
'08h','10h','12h','e12','14h','16h','18h','20h','e20'),
larvae=c('L1','L2','L3e','L3l'),
pupae=c('p1','p2','p3','p4','p5'),
adult=c('vm','m','vf','f'))
sample data:
df <- data.frame(abbreviation=rep(unlist(my_dictionary), 30000))
nrow(df)
# [1] 930000
system.time(df$category <- dictTranslator(df$abbreviation, my_dictionary))
The function works as expected but the performance is pretty slow (about a minute). Has anyone an idea to speed this up or is there maybe a better solution to this?
The result looks like:
> head(df,40)
abbreviation category
1 00h embryo
2 01h embryo
...
19 L1 larvae
20 L2 larvae
21 L3e larvae
22 L3l larvae
23 p1 pupae
24 p2 pupae
25 p3 pupae
26 p4 pupae
27 p5 pupae
28 vm adult
29 m adult
30 vf adult
31 f adult
32 00h embryo
33 01h embryo
34 02h embryo
35 e02 embryo
36 03h embryo
I would use match on a named vector, plus removal of the numbers generated by unlisting my_dictionary (for which I used sub). This way you can avoid costly loops.
x <- unlist(my_dictionary)
df$category <- sub('\\d+$', '', names(x)[match(df$abbreviation, x)])
Run time is less than a second:
df <- data.frame(abbreviation=rep(unlist(my_dictionary), 30000),
stringsAsFactors = FALSE)
system.time({df$category <- sub('\\d+$', '', names(x)[match(df$abbreviation, x)])})
# User System elapsed
# 0.634 0.003 0.639
I found this approach finished fastest on my machine. Note that the input data is character which results in speed gains when compared with factor input.
## non-factor sample data
my_dictionary <- unlist(my_dictionary)
df <- data.frame(abbreviation = rep(my_dictionary, 30000),
stringsAsFactors = FALSE)
system.time({
## names
result <- sapply(df$abbreviation, function(i) {
names(which(i == my_dictionary))
})
## discard numbers
df$category <- gsub("\\d", "", result)
})
# user system elapsed
# 3.993 0.000 3.991
As regards the code you provided (i.e., using multiple sapply loops in combination with %in%), keep in mind that %in% (or match) perform rather poorly when you are searching for a single entry (see e.g. this question).
I suggest another data format for dictionary, also use data.table library:
library(data.table)
md <- list()
for(i in 1:length(my_dictionary)) {
md[[i]] <- data.table(abbreviation = my_dictionary[[i]], category = names(my_dictionary[i]))
}
md <- rbindlist(md)
Then you simply join:
df <- data.table(df, key = 'abbreviation')
df <- df[md]
Comparison. I did ran a comparison of three approaches and here are the results:
Data preparation:
#docendo-discimus and #fdetsch use same data format
md1 <- unlist(my_dictionary)
df1 <- df
#danas.zuokas uses
library(data.table)
md2 <- list()
for(i in 1:length(my_dictionary)) {
md2[[i]] <- data.table(abbreviation = my_dictionary[[i]], category = names(my_dictionary[i]))
}
md2 <- rbindlist(md2)
df2 <- data.table(df, key = 'abbreviation')
Here are three functions to compare
f_dd <- function(x, y) { x$category <- sub('\\d+$', '', names(y)[match(x$abbreviation, y)]); x } #docendo-discimus
f_dz <- function(x, y) { x <- x[y]; x } #danas.zuokas
f_fd <- function(x, y) { x$category <- gsub('\\d', '', sapply(x$abbreviation, function(i) names(which(i == y)))); x } #fdetsch
And here are the results
library(microbenchmark)
microbenchmark(f_dd(df1, md1), f_fd(df1, md1), f_dz(df2, md2), times = 10)
Unit: milliseconds
expr min lq mean median uq max neval
f_dd(df1, md1) 1041.9195 1142.8361 1236.2033 1224.498 1266.9600 1469.7119 10
f_fd(df1, md1) 7106.6641 7417.5538 7924.3541 7868.716 8304.7760 8961.2615 10
f_dz(df2, md2) 35.6389 41.7524 77.2347 63.478 70.1699 183.9867 10
I will use a named vector for this:
a2c <- c ("larvae", "larvae", "larvae", "pupae", "pupae", ...)
names (a2e) <- c ('L1','L2','L3e','p1', 'p2', ...)
In your case you can easily build the vector doing:
a2c <- rep (names (my_dictionary), times = sapply (my_dictionary, length))
names (a2c) <- unlist (my_dictionary)
Make sure they match:
cbind (names (a2c), a2c)
And then:
df[,"category"] <- a2c[df$abbreviation]
In your example df$abbreviation is a factor so you may need to turn it into character
df[,"category"] <- a2c[as.character (df$abbreviation)]

FAST way to iterate over vertices and compute new attributes based on that of neighbors

I'm doing a simple task: to iterate over all vertices and compute new attribute based on that of their neighbors. I search the SO and so far I know there are at least three way to do it:
Use ad_adj_list to create a adj list and then iterate over each element;
use sapply to iterate each vertex directly.
However, both methods take too long for the magnitude of my data (300k vertices and 8 million edges). Is there any fast way to loop over vertices? thanks!
For benchmark, say I have the following sample data:
set.seed <- 42
g <- sample_gnp(10000, 0.1)
V(g)$name <- seq_len(gorder(g)) # add a name attribute for data.table merge
V(g)$attr <- rnorm(gorder(g))
V(g)$mean <- 0 # "mean" is the attribute I want to compute
The code for method 1. is that:
al <- as_adj_list(g)
attr <- V(g)$attr
V(g)$mean <- sapply(al, function(x) mean(attr[x]))
# took 28s
# most of the time is spent on creating the adj list
The code for method 2. is that:
compute_mean <- function(v){
mean(neighbors(g, v)$attr)
}
V(g)$mean <- sapply(V(g), compute_mean) # took 33s
I BELIEVE that igraph-R SHOULD NOT be so slow in interating vertices, otherwise, this will make analysis of large graph with size of millions impossible, which task I think should be quite common to R users!
Update
According to #MichaelChirico's comment, now I came up with a third method: import the graph structure into a data.table and do the calculation with the data.table by syntax, as follows:
gdt.v <- as_data_frame(g, what = "vertices") %>% setDT() # output the vertices
gdt.e <- as_data_frame(g, what = "edges") %>% setDT() # output the edges
gdt <- gdt.e[gdt.v, on = c(to = "name"), nomatch = 0] # merge vertices and edges data.table
mean <- gdt[, .(mean = mean(attr)), keyby = from][, mean]
V(g)$mean <- mean
# took only 0.74s !!
The data.table way is MUCH faster. However, its result is NOT exactly identical to that of the first two methods. Besides, I'm very disappointed to see that I have to rely on another package to do such a simple task, which I thought should be the strength of igraph-R. Hope I'm wrong!
I'm not sure where is the actual problem... When I re-run your code:
library(microbenchmark)
library(data.table)
library(igraph)
set.seed <- 42
g <- sample_gnp(10000, 0.1)
V(g)$name <- seq_len(gorder(g)) # add a name attribute for data.table merge
V(g)$attr <- rnorm(gorder(g))
V(g)$mean <- 0 # "mean" is the attribute I want to compute
gg <- g
... and compare the two methods in expressions e1 and e2
e1 <- expression({
al <- as_adj_list(gg)
attr <- V(gg)$attr
V(gg)$mean <- sapply(al, function(x) mean(attr[x]))
})
e2 <- expression({
gdt.v <- as_data_frame(g, what = "vertices") %>% setDT() # output the vertices
gdt.e <- as_data_frame(g, what = "edges") %>% setDT() # output the edges
gdt <- gdt.e[gdt.v, on = c(to = "name"), nomatch = 0] # merge vertices and edges data.table
mean <- gdt[, .(mean = mean(attr)), keyby = from][, mean]
V(g)$mean <- mean
})
The timings are:
microbenchmark(e1, e2)
## Unit: nanoseconds
## expr min lq mean median uq max neval cld
## e1 47 47 51.42 48 48 338 100 a
## e2 47 47 59.98 48 48 956 100 a
So very similar, and the results
all.equal(V(g)$mean, V(gg)$mean)
## [1] TRUE
... are the same.

R fast way for extracting elements from lists of List

Hello everyone,
I am working with large list, which contains lists. Each of the sub lists contains n elements. I always want to get the 3rd one, e.g.
l = list()
l[[1]] = list(A=runif(1), B=runif(1), C=runif(1))
l[[2]] = list(A=runif(1), B=runif(1), C=runif(1))
l[[3]] = list(A=runif(1), B=runif(1), C=runif(1))
res = sapply(l, function(x) x$C)
res = sapply(l, function(x) x[[3]]) #alternative
But my list contains several thousands of elements and I am performing this operation a lot of times. So, is there a faster way to do the operation above?
Beste regards,
Mario
If you do this mulitple times, then it would be better to convert your list to an easier structure like data.table.
library(data.table)
DT=rbindlist(l);
res = DT$C
# or if you prefer the 3rd element, not necessarily called 'C' then:
res = DT[[3]] # or DT[,C] which might be faster. Please check #richard-scriven comment
Alternatively if you want to keep base R you could use rbind
res = do.call(rbind.data.frame, l)$C # or [[3]]
Would this make things easier?
UPDATE
Here are some benchmarks showing different solutions to the problem:
preparations:
library(data.table)
library(microbenchmark)
# creating a list and filling it with items
nbr = 1e5;
l = vector("list",nbr)
for (i in 1:nbr) {
l[[i]] = list(A=runif(1), B=runif(1), C=runif(1))
}
# creating data.frame and data.table versions
DT <- rbindlist(l)
DF <- data.frame(rbindlist(l))
benchmarking:
# doing the benchmarking
op <-
microbenchmark(
LAPPLY.1 = lapply(l, function(x) x$C),
LAPPLY.2 = lapply(l, `[`, "C"),
LAPPLY.3 = lapply(l, `[[`, "C"),
SAPPLY.1 = sapply(l, function(x) x$C),
SAPPLY.2 = sapply(l, function(x) x[[3]]),
SAPPLY.3 = sapply(l, `[[`, 3),
DT.1 = rbindlist(l)$C,
DT.2 = DT$C,
DF.2 = DF$C,
times = 100
)
results:
op
## Unit: microseconds
## expr min lq mean median uq max neval
## LAPPLY.1 124088 142390 161672 154415 163240 396761 100
## LAPPLY.2 111397 134745 156012 150062 165229 364539 100
## LAPPLY.3 66965 71608 82975 77329 84949 323041 100
## SAPPLY.1 133220 149093 166653 159222 172495 311857 100
## SAPPLY.2 105917 119533 137990 133364 139216 346759 100
## SAPPLY.3 70391 74726 81910 80520 85792 110062 100
## DT.1 46895 48943 49113 49178 49391 51377 100
## DT.2 8 18 37 47 49 58 100
## DF.2 7 13 33 40 42 82 100
(1) In general it would be best to use a table like structure like data.frame or data.table in the first place - selecting columns from those costs the least of time.
(2) If this is not possible it is better to first turn the list into a data.frame or data.table to than extract the values in one single operation.
(3) Interestingly using sapply or lapply with the base R (optimized) [[-function results in process times that are only twice as bad as using rbind and than extracting the values as column.

Resources