Fast count of digits in a string, in R - r

Is there a more efficient way to count the most frequently appearing digit in a string? My R code below calls gsub() 10 times for each string; and I have gazillions of strings to process.
> txt = 'wow:011 test 234567, abc=8951111111111aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
> max(vapply(0:9, function(i) nchar(gsub(paste0('[^',i,']'), '', txt)), integer(1L)))
[1] 12
I don't care about the digit itself. I just want the count of the most frequent one.
I would prefer to use R's core packages, unless some external package offers a significant outperformance. I use x64 R version 3.4.1 (2017-06-30) on Windows 10.
UPDATE:
Here is the (apples-to-apples) performance comparison of excellent suggestions below.
> microbenchmark(
+ original = max(vapply(0:9, function(i) nchar(gsub(paste0('[^',i,']'), '', s)), integer(1L))),
+ strsplit = max(table(unlist(strsplit(gsub("\\D+", "", s), "")))),
+ gregexpr = max(vapply(0:9, function(d) sum(unlist(gregexpr(d, s)) > 0), integer(1L))),
+ stringi = max(vapply(0:9, function(x) stri_count_fixed(s, x), integer(1L))),
+ raw=max(vapply(0x30:0x39, function(x) sum(charToRaw(s)==x), integer(1L))),
+ tabulate = max(tabulate(as.integer(charToRaw(paste('a',s))))[48:57]),
+ times=1000L)
Unit: microseconds
expr min lq mean median uq max neval
original 476.172 536.9770 567.86559 554.8600 580.0530 8054.805 1000
strsplit 366.071 422.3660 448.69815 445.3810 469.6410 798.389 1000
gregexpr 302.622 345.2325 423.08347 360.3170 378.0455 9082.416 1000
stringi 112.589 135.2940 149.82411 144.6245 155.1990 3910.770 1000
raw 58.161 71.5340 83.57614 77.1330 82.1090 6249.642 1000
tabulate 18.039 29.8575 35.20816 36.3890 40.7430 72.779 1000
Why the weird calculation?
This odd formula helps identify some plainly-looking fake identifiers entered by the user. For example, some non-creative users (I'm a guilty one as well) fill out same digits for their phone numbers. Frequently, in data analysis, it would be better to have no phone number at all than a fake phone number that changes from one dataset to another. Naturally, if there is a check-digit, it would be an additional easy validation.

max(table(unlist(strsplit(gsub("\\D+", "", txt), ""))))
#OR
max(sapply(0:9, function(d) sum(unlist(gregexpr(d, txt)) > 0)))
#[1] 12
Or if you do care about the digit
with(rle(sort(unlist(strsplit(gsub("\\D+", "", txt), "")))),
setNames(c(max(lengths)), values[which.max(lengths)]))
# 1
#12
library(microbenchmark)
set.seed(42)
t = paste(sample(c(letters, 0:9), 1e5, TRUE), collapse = "")
microbenchmark(original = max(sapply(0:9, function(i) nchar(gsub(paste0('[^',i,']'), '', t)))),
strsplit = max(table(unlist(strsplit(gsub("\\D+", "", t), "")))),
gregexpr = max(sapply(0:9, function(d) sum(unlist(gregexpr(d, t)) > 0))))
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# original 215.371764 220.862807 233.368696 228.757529 239.809292 308.94393 100 c
# strsplit 11.224226 11.856327 12.956749 12.320586 12.893789 30.61072 100 b
# gregexpr 7.542871 7.958818 8.680391 8.302971 8.728735 13.79921 100 a

Using charToRaw to count digits in string:
# To count only digits in string, filter out ASCii codes for numbers from 0 to 9 which is 48 to 57 according to https://ascii.cl/
# You need to add na.rm = TRUE in case some of your strings contain only one digit
txt = 'wow:011 test 234567, abc=8951111111111aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
max(tabulate(as.integer(charToRaw(txt)))[48:57], na.rm = TRUE)
#[1] 12
txt='22222222222'
max(tabulate(as.integer(charToRaw(txt)))[48:57], na.rm = TRUE)
#[1] 11
#Andrew already did benchmarking test which proves that using charToRaw is fastest approach to count digits in string.
If you do not care about the digit and just want to count most frequent character/digit then you just remove filtering ASCII codes [48:57].
txt = 'wow:011 test 234567, abc=8951111111111aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
max(tabulate(as.integer(charToRaw(txt))))
#[1] 32
txt='22222222222'
max(tabulate(as.integer(charToRaw(txt))))
#[1] 11

Building on Santosh's approach, this is significantly faster than the other options...
max(tabulate(as.integer(charToRaw(txt)))[48:57]) #48:57 picks out ASCII digits
library(microbenchmark)
set.seed(42)
t = paste(sample(c(letters, 0:9), 1e5, TRUE), collapse = "")
microbenchmark(original = max(sapply(0:9, function(i) nchar(gsub(paste0('[^',i,']'), '', t)))),
strsplit = max(table(unlist(strsplit(gsub("\\D+", "", t), "")))),
gregexpr = max(sapply(0:9, function(d) sum(unlist(gregexpr(d, t)) > 0))),
tabulate = max(tabulate(as.integer(charToRaw(t)))[48:57]))
Unit: milliseconds
expr min lq mean median uq max neval
original 807.947235 860.112901 1169.744733 935.169003 1154.057709 3513.1401 100
strsplit 34.100444 36.453163 55.457896 42.881400 58.208820 390.1453 100
gregexpr 27.205510 29.333569 42.616817 33.146572 49.840566 246.9001 100
tabulate 1.189702 1.208321 2.150022 1.226319 1.297068 37.4300 100

Related

How to speed up the proceeds of grepl function?

Trying to use this option into a large number of words and text:
# Create some fake data
words <- c("stock", "revenue", "continuous improvement")
phrases <- c("blah blah stock and revenue", "yada yada revenue yada",
"continuous improvement is an unrealistic goal",
"phrase with no match")
# Apply the 'grepl' function along the list of words, and convert the result to numeric
df <- data.frame(lapply(words, function(word) {as.numeric(grepl(word, phrases))}))
# Name the columns the words that were searched
names(df) <- words
It takes to much time to implement into large lists and input text
Is there any way to change it to make the process faster?
One possibility is to use grepl() with fixed = TRUE:
lapply(words, function(word) as.numeric(grepl(word, phrases, fixed = TRUE)))
Alternatively, you can use stri_detect_fixed() from stringi:
lapply(words, function(word) as.numeric(stri_detect_fixed(phrases, word)))
A small simulation:
phrases <- rep(phrases, 100000)
library(microbenchmark)
microbenchmark(grepl = lapply(words, function(word) as.numeric(grepl(word, phrases))),
grepl_fixed = lapply(words, function(word) as.numeric(grepl(word, phrases, fixed = TRUE))),
stri_detect_fixed = lapply(words, function(word) as.numeric(stri_detect_fixed(phrases, word))),
times = 50)
Unit: milliseconds
expr min lq mean median uq max neval
grepl 857.5839 918.3976 1007.4775 957.3126 986.9762 1631.5336 50
grepl_fixed 116.8073 130.1615 146.6852 139.1170 152.0428 278.1512 50
stri_detect_fixed 105.2338 116.9041 128.8941 126.7353 135.7818 199.4968 50
As proposed by #akrun, some performance improvement could be achieved by replacing as.numeric() with a +:
microbenchmark(grepl_plus = lapply(words, function(word) +grepl(word, phrases)),
grepl_fixed_plus = lapply(words, function(word) +grepl(word, phrases, fixed = TRUE)),
stri_detect_fixed_plus = lapply(words, function(word) +stri_detect_fixed(phrases, word)),
grepl_as_numeric = lapply(words, function(word) as.numeric(grepl(word, phrases))),
grepl_fixed_as_numeric = lapply(words, function(word) as.numeric(grepl(word, phrases, fixed = TRUE))),
stri_detect_fixed_as_numeric = lapply(words, function(word) as.numeric(stri_detect_fixed(phrases, word))),
times = 50)
Unit: milliseconds
expr min lq mean median uq max
grepl_plus 839.2060 889.8748 1008.0753 926.4712 1022.6071 2063.8296
grepl_fixed_plus 117.0043 126.4407 141.5917 136.5732 146.2262 318.7412
stri_detect_fixed_plus 104.4772 110.3147 126.3931 115.9223 124.4952 423.4654
grepl_as_numeric 851.4198 893.6703 957.4348 935.0965 1010.3131 1375.0810
grepl_fixed_as_numeric 121.8952 128.6741 142.4962 136.3370 145.5004 235.6042
stri_detect_fixed_as_numeric 106.0639 114.6759 128.0724 121.9647 135.4791 191.1315
neval
50
50
50
50
50
50

Faster alternative than apply for using function utf8ToInt in a matrix

I have a string matrix (my_data) of dimensions 9000000x10 with each value being a single character string. I want to transform it to a numeric matrix using the function utf8ToInt, but it takes a long time and crashes my session.
new_matrix <- apply(my_data, 1:2, "utf8ToInt")
The result is what I expect, but I need a more efficient way of doing that.
Any help is deeply appreciated.
Imagine my data is:
my_data <- matrix(c("a","b","c","d"), ncol = 2)
but it is actually 9000000x10 instead of 2x2.
stringi::stri_enc_toutf32 may be an alternative.
From ?stri_enc_toutf32:
This function is roughly equivalent to a vectorized call to utf8ToInt(enc2utf8(str))
On a 1e3 * 2 matrix, stri_enc_toutf32 is about 10 and 20 times faster than vapply / apply + utf8ToInt respectively:
library(stringi)
library(microbenchmark)
nr = 1e3
nc = 2
m = matrix(sample(letters, nr*nc, replace = TRUE), nrow = nr, ncol = nc)
microbenchmark(
f_apply = apply(m, 1:2, utf8ToInt),
f_vapply = structure(vapply(m, utf8ToInt, numeric(1)), dim=dim(m)),
f = matrix(unlist(stri_enc_toutf32(m), use.names = FALSE), nrow = nrow(m)),
times = 10L, check = "equal")
# Unit: microseconds
# expr min lq mean median uq max neval
# f_apply 2283.4 2297.2 2351.17 2325.40 2354.5 2583.6 10
# f_vapply 1276.1 1298.0 1348.88 1322.00 1353.4 1611.3 10
# f 87.6 92.3 108.53 105.15 111.0 163.8 10
Using vapply would be almost twice as fast. Since vapply returns a vector, it is necessary to re-establish the matrix format (here with structure).
library(microbenchmark)
my_data <- matrix(sample(letters, 2*100, replace = TRUE), ncol = 2)
microbenchmark(
apply = apply(my_data, 1:2, utf8ToInt),
vapply = structure(vapply(my_data, utf8ToInt, numeric(1)), dim=dim(my_data)),
times = 500L, check = 'equal'
)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> apply 199.201 208.001 224.811 213.801 220.1515 1560.400 500
#> vapply 111.000 115.501 136.343 120.401 124.9505 1525.901 500
Created on 2021-03-06 by the reprex package (v1.0.0)

Delete characters before regular expression (R)

I have a character vector of stock tickers where the ticker name is concatenated to the country in which that ticker is based in the following form: country_name/ticker_name. I am trying to split each string and delete everything from the '/' back, returning a character vector of only the ticker names. Here is an example vector:
sample_string <- c('US/SPY', 'US/AOL', 'US/MTC', 'US/PHA', 'US/PZI',
'US/AOL', 'US/BRCM')
My initial thought would be to use the stringr library. I don't have really any experience with that package, but here is what I was trying:
library(stringr)
split_string <- str_split(sample_string, '/')
But I was unsure how to return only the second element of each list as a single vector.
How would I do this over a large character vector (~105 million entries)?
Some benchmark here including all the methods suggested by #David Arenburg, and another method using str_extract from stringr package.
sample_string <- rep(sample_string, 1000000)
library(data.table); library(stringr)
s1 <- function() sub(".*/(.*)", "\\1", sample_string)
s2 <- function() sub(".*/", "", sample_string)
s3 <- function() str_extract(sample_string, "(?<=/)(.*)")
s4 <- function() tstrsplit(sample_string, "/", fixed = TRUE)[[2]]
length(sample_string)
# [1] 7000000
identical(s1(), s2())
# [1] TRUE
identical(s1(), s3())
# [1] TRUE
identical(s1(), s4())
# [1] TRUE
microbenchmark::microbenchmark(s1(), s2(), s3(), s4(), times = 5)
# Unit: seconds
# expr min lq mean median uq max neval
# s1() 3.916555 3.917370 4.046708 3.923246 3.925184 4.551184 5
# s2() 3.584694 3.593755 3.726922 3.610284 3.646449 4.199426 5
# s3() 3.051398 3.062237 3.354410 3.138080 3.722347 3.797985 5
# s4() 1.908283 1.964223 2.349522 2.117521 2.760612 2.996971 5
The tstrsplit method is the fastest.
Update:
Add another method from #Frank, this comparison is not strictly accurate which depends on the actual data, if there is a lot of duplicated cases as the sample_string is produced above, the advantage is quite obvious:
s5 <- function() setDT(list(sample_string))[, v := tstrsplit(V1, "/", fixed = TRUE)[[2]], by=V1]$v
identical(s1(), s5())
# [1] TRUE
microbenchmark::microbenchmark(s1(), s2(), s3(), s4(), s5(), times = 5)
# Unit: milliseconds
# expr min lq mean median uq max neval
# s1() 3905.97703 3913.264 3922.8540 3913.4035 3932.2680 3949.3575 5
# s2() 3568.63504 3576.755 3713.7230 3660.5570 3740.8252 4021.8426 5
# s3() 3029.66877 3032.898 3061.0584 3052.6937 3086.9714 3103.0604 5
# s4() 1322.42430 1679.475 1985.5440 1801.9054 1857.8056 3266.1101 5
# s5() 82.71379 101.899 177.8306 121.6682 209.0579 373.8141 5
Some helpful notes about your question: Firstly, there is a str_split_fixed function in the stringrpackage which does what you want it to do by calling lapply.
library(data.table); library(stringr)
sample_string <- c('US/SPY', 'US/AOL', 'US/MTC', 'US/PHA', 'US/PZI',
'US/AOL', 'US/BRCM')
sample_string <- rep(sample_string, 1e5)
split_string <- str_split_fixed(sample_string, '/', 2)[,2]
It works by calling stringi::stri_split_fixed and is not dissimilar to
do.call("c", lapply(str_split(sample_string, '/'),"[[",2))
Secondly, another way to think about extracting each second element of the list is by doing exactly what tstrsplit is doing internally.
transpose(strsplit(sample_string, "/", fixed = T))[[2]]
On a total side note, the above should be marginally faster than calling tstrsplit. This of course, is probably not worth typing at length but it helps to know what the function does.
library(data.table); library(stringr)
s4 <- function() tstrsplit(sample_string, "/", fixed = TRUE)[[2]]
s5 <- function() transpose(strsplit(sample_string, "/", fixed = T))[[2]]
identical(s4(), s5())
microbenchmark::microbenchmark(s4(), s5(), times = 20)
microbenchmark::microbenchmark(s4(), s5(), times = 20)
Unit: milliseconds
expr min lq mean median uq max neval
s4() 161.0744 193.3611 255.8136 234.9945 271.6811 434.7992 20
s5() 140.8569 176.5600 233.3570 194.1676 251.7921 420.3431 20
Regarding this second method, in short, transposing this list of length 7 million, each with 2 elements will convert your result to a list of length 2, each with 7 million elements. You are then extracting the second element of this list.

Extract part of string before the first semicolon

I have a column containing values of 3 strings separated by semicolons. I need to just extract the part of the string which comes before the first semicolon.
Type <- c("SNSR_RMIN_PSX150Y_CSH;SP_12;I0.00V50HX0HY3000")
What I want is: Get the first part of the string (till the first semicolon).
Desired output : SNSR_RMIN_PSX150Y_CSH
I tried gsub without success.
You could try sub
sub(';.*$','', Type)
#[1] "SNSR_RMIN_PSX150Y_CSH"
It will match the pattern i.e. first occurence of ; to the end of the string and replace with ''
Or use
library(stringi)
stri_extract(Type, regex='[^;]*')
#[1] "SNSR_RMIN_PSX150Y_CSH"
The stringi package works very fast here:
stri_extract_first_regex(Type, "^[^;]+")
## [1] "SNSR_RMIN_PSX150Y_CSH"
I benchmarked on the 3 main approaches here:
Unit: milliseconds
expr min lq mean median uq max neval
SAPPLY() 254.88442 267.79469 294.12715 277.4518 325.91576 419.6435 100
SUB() 182.64996 186.26583 192.99277 188.6128 197.17154 237.9886 100
STRINGI() 89.45826 91.05954 94.11195 91.9424 94.58421 124.4689 100
Here's the code for the Benchmarks:
library(stringi)
SAPPLY <- function() sapply(strsplit(Type, ";"), "[[", 1)
SUB <- function() sub(';.*$','', Type)
STRINGI <- function() stri_extract_first_regex(Type, "^[^;]+")
Type <- c("SNSR_RMIN_PSX150Y_CSH;SP_12;I0.00V50HX0HY3000")
Type <- rep(Type, 100000)
library(microbenchmark)
microbenchmark(
SAPPLY(),
SUB(),
STRINGI(),
times=100L)
you can also use strsplit
strsplit(Type, ";")[[1]][1]
[1] "SNSR_RMIN_PSX150Y_CSH"
When performance is important you can use substr in combination with regexpr from base.
substr(Type, 1, regexpr(";", Type, fixed=TRUE)-1)
#[1] "SNSR_RMIN_PSX150Y_CSH"
Timings: (Reusing the part from #tyler-rinker)
library(stringi)
SAPPLY <- function() sapply(strsplit(Type, ";"), "[[", 1)
SUB <- function() sub(';.*$','', Type)
SUB2 <- function() sub(';.*','', Type)
SUB3 <- function() sub('([^;]*).*','\\1', Type)
STRINGI <- function() stri_extract_first_regex(Type, "^[^;]+")
STRINGI2 <- function() stri_extract_first_regex(Type, "[^;]*")
SUBSTRREG <- function() substr(Type, 1, regexpr(";", Type)-1)
SUBSTRREG2 <- function() substr(Type, 1, regexpr(";", Type, fixed=TRUE)-1)
SUBSTRREG3 <- function() substr(Type, 1, regexpr(";", Type, fixed=TRUE, useBytes = TRUE)-1)
Type <- c("SNSR_RMIN_PSX150Y_CSH;SP_12;I0.00V50HX0HY3000")
Type <- rep(Type, 100000)
library(microbenchmark)
microbenchmark(SAPPLY(), SUB(), SUB2(), SUB3(), STRINGI()
, STRINGI2(), SUBSTRREG(), SUBSTRREG2(), SUBSTRREG3())
#Unit: milliseconds
# expr min lq mean median uq max neval
# SAPPLY() 382.23750 395.92841 412.82508 410.05236 427.58816 460.28508 100
# SUB() 111.92120 114.28939 116.41950 115.57371 118.15573 123.92400 100
# SUB2() 94.27831 96.50462 98.14741 97.38199 99.15260 119.51090 100
# SUB3() 167.77139 172.51271 175.07144 173.83121 176.27710 190.97815 100
# STRINGI() 38.27645 39.33428 39.94134 39.71842 40.50182 42.55838 100
# STRINGI2() 38.16736 39.19250 40.14904 39.63929 40.37686 56.03174 100
# SUBSTRREG() 45.04828 46.39867 47.13018 46.85465 47.71985 51.07955 100
# SUBSTRREG2() 10.67439 11.02963 11.29290 11.12222 11.43964 13.64643 100
# SUBSTRREG3() 10.74220 10.95139 11.39466 11.06632 11.46908 27.72654 100

Count characters in a string (excluding spaces) in R?

I want to count the number of characters in a string (excluding spaces) and I'd like to know if my approach can be improved.
Suppose I have:
x <- "hello to you"
I know nchar() will give me the number of characters in a string (including spaces):
> nchar(x)
[1] 12
But I'd like to return the following (excluding spaces):
[1] 10
To this end, I've done the following:
> nchar(gsub(" ", "",x))
[1] 10
My worry is the gsub() will take a long time over many strings. Is this the correct way to approach this, or is there a type of nchar'esque function that will return the number of characters without counting spaces?
Thanks in advance.
Building on Richard's comment, "stringi" would be a great consideration here:
The approach could be to calculate the overall string length and subtract the number of spaces.
Compare the following.
library(stringi)
library(microbenchmark)
x <- "hello to you"
x
# [1] "hello to you"
fun1 <- function(x) stri_length(x) - stri_count_fixed(x, " ")
fun2 <- function(x) nchar(gsub(" ", "",x))
y <- paste(as.vector(replicate(1000000, x, TRUE)), collapse = " ")
microbenchmark(fun1(x), fun2(x))
# Unit: microseconds
# expr min lq mean median uq max neval
# fun1(x) 5.560 5.988 8.65163 7.270 8.1255 44.047 100
# fun2(x) 9.408 9.837 12.84670 10.691 12.4020 57.732 100
microbenchmark(fun1(y), fun2(y), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fun1(y) 68.22904 68.50273 69.6419 68.63914 70.47284 75.17682 10
# fun2(y) 2009.14710 2011.05178 2042.8123 2030.10502 2079.87224 2090.09142 10
Indeed, stringi seems most appropriate here. Try this:
library(stringi)
x <- "hello to you"
stri_stats_latex(x)
Result:
CharsWord CharsCmdEnvir CharsWhite Words Cmds Envirs
10 0 2 3 0 0
If you need it in a variable, you can access the parameters via regular [i], e.g.:
stri_stats_latex(x)[1]

Resources