Count sequence of blanks in a string with R - r

I want to retrieve the sequence of consecutive blanks in a string. Like :
mystring="lalalal lalalal lalala lalalala "
retrieve_sequence_of_consecutive_blanks(mystring)
[1] 2 1 3 1
Actually, I got a solution, with this
sequence_of_blanks=function(vectors_of_strings){
tokens=strsplit(x = mystring,split = "",fixed = TRUE)
sequence=lapply(X = tokens,FUN = rle)
resultats=lapply(sequence, function(item){
resultats=item$lengths[which(item$values==" ")]
})
}
My question is about performance, do you think if there is better way to do it? What about a regex solution? What about a python solution?

You may match all space chunks and get their lengths, e.g.
library(stringr)
nchar(unlist(str_extract_all(mystring, " +")))
Or the base R equivalent:
nchar(unlist(regmatches(mystring, gregexpr(" +", mystring))))
Both yield
[1] 2 1 3 1
In Python, you may use
[x.count(" ") for x in re.findall(" +", mystring)]
See the Python demo
If you plan to count any whitespace, replace the literal space with \s. Tweak as per your further requirements.

You could use
myrle <- rle(charToRaw(mystring) == charToRaw(" "))
myrle$lengths[myrle$values]
which is a bit faster:
microbenchmark::microbenchmark(
OP = sequence_of_blanks(mystring),
akrun = tabulate(cumsum(c(TRUE, diff(str_locate_all(mystring, " ")[[1]][,2]) !=1))),
wiktor = nchar(unlist(str_extract_all(mystring, " +"))),
# charToRaw(mystring) == charToRaw(" "),
fprive = { myrle <- rle(charToRaw(mystring) == charToRaw(" ")); myrle$lengths[myrle$values] }
)
Unit: microseconds
expr min lq mean median uq max neval
OP 32.826 37.680 42.97734 42.3940 46.3405 115.239 100
akrun 40.718 44.874 48.40903 48.4360 50.7050 78.991 100
wiktor 24.166 29.753 34.73199 35.0955 36.7370 129.626 100
fprive 23.258 25.877 29.50010 28.6000 31.6730 43.721 100
If you really need performance, designing some Rcpp function for your particular use giving as arguments charToRaw(mystring) and charToRaw(" ") would improve performance.

If you want a bit more of performance using simple base R:
length_seq_blanks <- function(string) {
x <- nchar(unlist(strsplit(string, "[a-z]+")))
x[x > 0]
}
length_seq_blanks(mystring)
[1] 2 1 3 1
Benchmark
microbenchmark::microbenchmark(
snoram = {
length_seq_blanks <- function(string) {
x <- nchar(unlist(strsplit(string, "[a-z]+")))
x[x > 0]
}
length_seq_blanks(mystring)
},
fprive = {
myrle <- rle(charToRaw(mystring) == charToRaw(" "))
myrle$lengths[myrle$values]
},
unit = "relative"
)
Unit: relative
expr min lq mean median uq max neval
snoram 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000 100
fprive 1.866597 1.818247 1.734015 1.684211 1.634093 1.20812 100

Related

Count number of palindromes within a string

I have written the below code to count the number of palindromic strings in a given string:
countPalindromes <- function(str){
len <- nchar(str)
count <- 0
for(i in 1:len){
for(j in i:len){
subs <- substr(str, i, j)
rev <- paste(rev(substring(subs, 1:nchar(subs), 1:nchar(subs))), collapse = "")
if(subs == rev){
count <- count + 1
}
}
}
count
}
This is actually working fine but the code needs to be optimized in such a way so that it executes at a faster rate.
Please suggest some ways to optimize this piece of code.
Here's a solution that uses the wonderful stringi package - just as Andre suggested - together with a wee bit of vectorization.
cp <- function(s) {
lenstr <- stri_length(s) # Get the length
res <- sapply(1:lenstr, function(i) {
# Get all substrings
sub_string <- stringi::stri_sub(s, i, i:lenstr)
# Count matches
sum((sub_string == stringi::stri_reverse(sub_string)))
})
sum(res)
}
This should give the same result as your function
> cp("enafdemderredmedfane")
[1] 30
> countPalindromes("enafdemderredmedfane")
[1] 30
There is not much speedup for short strings, but for longer strings you can really see a benefit:
> microbenchmark::microbenchmark(countPalindromes("howdoyoudo"), cp("howdoyoudo"))
Unit: microseconds
expr min lq mean median uq max neval cld
countPalindromes("howdoyoudo") 480.979 489.6180 508.9044 494.9005 511.201 662.605 100 b
cp("howdoyoudo") 156.117 163.1555 175.4785 169.5640 179.993 324.145 100 a
Compared to
> microbenchmark::microbenchmark(countPalindromes("enafdemderredmedfane"), cp("enafdemderredmedfane"))
Unit: microseconds
expr min lq mean median uq max neval cld
countPalindromes("enafdemderredmedfane") 2031.565 2115.0305 2475.5974 2222.354 2384.151 6696.484 100 b
cp("enafdemderredmedfane") 324.991 357.6055 430.8334 387.242 478.183 1298.390 100 a
Working with a vector the process is faster, I am thinking of eliminating the double for, but I can not find an efficient way.
countPalindromes_new <- function(str){
len <- nchar(str)
strsp <- strsplit(str, "")[[1]]
count <- 0
for(i in 1:len){
for(j in i:len){
if(all(strsp[i:j] == strsp[j:i])){
count <- count + 1
}
}
}
count
}
> microbenchmark::microbenchmark(countPalindromes("howdoyoudo"), cp("howdoyoudo"), countPalindromes_new("howdoyoudo"))
Unit: microseconds
expr min lq mean median uq max neval
countPalindromes("howdoyoudo") 869.121 933.1215 1069.68001 963.201 1022.081 6712.751 100
cp("howdoyoudo") 192.000 202.8805 243.11972 219.308 258.987 477.441 100
countPalindromes_new("howdoyoudo") 49.068 53.3340 62.32815 57.387 63.574 116.481 100
> microbenchmark::microbenchmark(countPalindromes("enafdemderredmedfane"), cp("enafdemderredmedfane"), countPalindromes_new("enafdemderredmedfane"))
Unit: microseconds
expr min lq mean median uq max neval
countPalindromes("enafdemderredmedfane") 3578.029 3800.9620 4170.0888 3987.416 4173.6550 10205.445 100
cp("enafdemderredmedfane") 391.254 438.4010 609.8782 481.708 534.6135 6116.270 100
countPalindromes_new("enafdemderredmedfane") 200.534 214.1875 235.3501 223.148 245.5475 448.854 100
UPDATE (NEW VERSION WIHTOUT LEN 1 COMPARASION):
countPalindromes_new2 <- function(str){
len <- nchar(str)
strsp <- strsplit(str, "")[[1]]
count <- len
for(i in 1:(len-1)){
for(j in (i + 1):len){
if(all(strsp[i:j] == strsp[j:i])){
count <- count + 1
}
}
}
count
}
Simply: normally I'm against using new libraries everywhere. But stringi is THE library for working with strings in R.
string_vec <- c("anna","nothing","abccba")
string_rev <- stringi::stri_reverse(string_vec)
sum(string_vec == string_rev)
#evals 2

Optimizing calculation of combinations into list - large data set

I wonder if someone can figure out a faster way to calculate combinations of elements in vector. My approach works but is slow with about 6 million elements in the vector.
Test vector
test.vector <- c("335261 344015 537633","22404 132858","254654 355860 488288","219943 373817","331839 404477")
My approach
lapply(strsplit(test.vector, " "), function(x) unique(apply(combn(x, 2), 2, function(y) paste0(y, collapse = ""))))
Expected output
[[1]]
[1] "335261344015" "335261537633" "344015537633"
[[2]]
[1] "22404132858"
[[3]]
[1] "254654355860" "254654488288" "355860488288"
[[4]]
[1] "219943373817"
[[5]]
[1] "331839404477"
Here is an answer that is over 25x faster than the OP's solution on large test cases. It doesn't rely on paste, but rather we take advantage of properties of numbers and vectorized operations. We also use comboGeneral from the RcppAlgos package (I am the author) which is much faster than combn and combnPrim from the linked answer for generating combinations of a vector. First we show the efficiency gains of comboGeneral over the other functions:
## library(gRbase)
library(RcppAlgos)
library(microbenchmark)
microbenchmark(gRbase::combnPrim(300, 2), combn(300, 2),
comboGeneral(300, 2), unit = "relative")
Unit: relative
expr min lq mean median uq max neval
gRbase::combnPrim(300, 2) 5.145654 5.192439 4.83561 7.167839 4.320497 3.98992 100
combn(300, 2) 204.866624 192.559119 143.75540 174.079339 102.733367 539.12325 100
comboGeneral(300, 2) 1.000000 1.000000 1.00000 1.000000 1.000000 1.00000 100
Now, we create a function to create some random reproducible data that will be passed to our test functions:
makeTestSet <- function(vectorSize, elementSize, mySeed = 42, withRep = FALSE) {
set.seed(mySeed)
sapply(1:vectorSize, function(x) {
paste(sample(10^6, s1 <- sample(2:elementSize, 1), replace = withRep), collapse = " ")
})
}
makeTestSet(5, 3)
[1] "937076 286140 830446" "519096 736588 134667" "705065 457742 719111"
[4] "255429 462293 940013" "117488 474997 560332"
That looks good. Now, lets see if setting fixed = TRUE gets us any gains (as suggested above by #MichaelChirico):
bigVec <- makeTestSet(10, 100000)
microbenchmark(standard = strsplit(bigVec, " "),
withFixed = strsplit(bigVec, " ", fixed = TRUE),
times = 15, unit = "relative")
Unit: relative
expr min lq mean median uq max neval
standard 4.447413 4.296662 4.133797 4.339537 4.084019 3.415639 15
withFixed 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 15
#MichaelChirico was spot on. Putting it all together we get:
combPairFast <- function(testVec) {
lapply(strsplit(testVec, " ", fixed = TRUE), function(x) {
combs <- RcppAlgos::comboGeneral(as.numeric(x), 2)
unique(combs[,1] * (10)^(as.integer(log10(combs[,2])) + 1L) + combs[,2])
})
}
## test.vector defined above by OP
combPairFast(test.vector)
[[1]]
[1] 335261344015 335261537633 344015537633
[[2]]
[1] 22404132858
[[3]]
[1] 254654355860 254654488288 355860488288
[[4]]
[1] 219943373817
[[5]]
[1] 331839404477
## OP original code
combPairOP <- function(testVec) {
lapply(strsplit(testVec, " "), function(x) unique(apply(combn(x, 2), 2, function(y) paste0(y, collapse = ""))))
}
As stated in the comments by the OP, the maximum number is less than a million (600000 to be exact), which means that after we multiply one of the numbers by at most 10^6 and add it to another 6 digit number (equivalent to simply concatenating two strings of numbers), we are guaranteed to be within the numerical precision of base R (i.e. 2^53 - 1). This is good because arithmetic operations on numerical numbers is much more efficient than strings operations.
All that is left is to benchmark:
test.vector <- makeTestSet(100, 50)
microbenchmark(combPairOP(test.vector),
combPairFast(test.vector),
times = 20, unit = "relative")
Unit: relative
expr min lq mean median uq max neval
combPairOP(test.vector) 22.33991 22.4264 21.67291 22.11017 21.729 25.23342 20
combPairFast(test.vector) 1.00000 1.0000 1.00000 1.00000 1.000 1.00000 20
And on larger vectors:
bigTest.vector <- makeTestSet(1000, 100, mySeed = 22, withRep = TRUE)
## Duplicate values exist
any(sapply(strsplit(bigTest.vector, " ", fixed = TRUE), function(x) {
any(duplicated(x))
}))
[1] TRUE
system.time(t1 <- combPairFast(bigTest.vector))
user system elapsed
0.303 0.011 0.314
system.time(t2 <- combPairOP(bigTest.vector))
user system elapsed
8.820 0.081 8.902 ### 8.902 / 0.314 ~= 28x faster
## results are the same
all.equal(t1, lapply(t2, as.numeric))
[1] TRUE

Remove duplicate pair in string

In a string variable I would like to remove both parts of a duplicates; so that I only select the unique strings. That is:
I have a string
MyString <- c("aaa", "bbb", "ccc", "ddd", "aaa", "ddd")
I would like to remove both pair of a duplicate; and thus select:
[1] "bbb" "ccc"
With not luck I tried:
unique((MyString)
x <- table(MyString)
names(x[x==1])
[1] "bbb" "ccc"
also:
MyString[ !duplicated(MyString) & !duplicated(MyString,fromLast = T) ]
[1] "bbb" "ccc"
Find the set of duplicates
dups = MyString[ duplicated(MyString) ]
and drop all occurrences in the set
MyString[ !MyString %in% dups ]
Alternative:
setdiff(MyString, dups)
The table-based solution from #Moody_Mudskipper provides more flexibility, e.g., to choose strings that occur twice. An alternative (probably faster than but analogous to table()-solutions, when MyString is long), create a index into the unique strings, find the number of times each unique string is matched (tabulate() == 1) and use these to subset the unique strings:
UString = unique(MyString)
UString[ tabulate(match(MyString, UString)) == 1 ]
or save the need to create UString
MyString[ which(tabulate(match(MyString, MyString)) == 1) ]
Alternative: sort and then find runs of length 1.
r = rle(sort(MyString))
r$values[ r$lengths == 1 ]
For performance, here are some functions implementing the various solutions
f0 = function(x) x[ !x %in% x[duplicated(x)] ]
f1 = function(x) setdiff( x, x[duplicated(x)] )
f2 = function(x) { ux = unique(x); ux[ tabulate(match(x, ux)) == 1 ] }
f3 = function(x) x[ which( tabulate( match(x, x) ) == 1 ) ]
f4 = function(x) { r = rle(sort(x)); r$values[ r$lengths == 1] }
f5 = function(x) { x = table(x); names(x)[x==1] }
f6 = function(x) x[ !duplicated(x) & !duplicated(x, fromLast = TRUE) ]
evidence that they produce identical results
> identical(f0(x), f1(x))
[1] TRUE
> identical(f0(x), f2(x))
[1] TRUE
> identical(f0(x), f3(x))
[1] TRUE
> identical(f0(x), f4(x))
[1] TRUE
> identical(f0(x), f5(x))
[1] TRUE
> identical(f0(x), f6(x))
[1] TRUE
f5() (also the original implementation) fails for x = character(0)
> f1(character(0))
character(0)
> f5(character(0))
NULL
f4() and f5() return values in alphabetical order, whereas the others preserve the order in the input, like unique(). All methods but f5() work with vectors of other type, e.g., integer() (f5() always returns a character vector, the others return a vector with the same type as the input). f4() and f5() do not recognize unique occurrences of NA.
And timings:
> microbenchmark(f0(x), f1(x), f2(x), f3(x), f4(x), f5(x), f6(x))
Unit: microseconds
expr min lq mean median uq max neval
f0(x) 9.195 10.9730 12.35724 11.8120 13.0580 29.100 100
f1(x) 20.471 22.6625 50.15586 24.6750 25.9915 2600.307 100
f2(x) 13.708 15.2265 58.58714 16.8180 18.4685 4180.829 100
f3(x) 7.533 8.8775 52.43730 9.9855 11.0060 4252.063 100
f4(x) 74.333 79.4305 124.26233 83.1505 87.4455 4091.371 100
f5(x) 147.744 154.3080 196.05684 158.4880 163.6625 3721.522 100
f6(x) 12.458 14.2335 58.11869 15.4805 17.0440 4250.500 100
Here's performance with 10,000 unique words
> x = readLines("/usr/share/dict/words", 10000)
> microbenchmark(f0(x), f1(x), f2(x), f3(x), f4(x), f5(x), f6(x), times = 10)
Unit: microseconds
expr min lq mean median uq max neval
f0(x) 848.086 871.359 880.8841 873.637 899.669 916.528 10
f1(x) 1440.904 1460.704 1556.7154 1589.405 1607.048 1640.347 10
f2(x) 2143.997 2257.041 2288.1878 2288.329 2334.494 2372.639 10
f3(x) 1420.144 1548.055 1547.8093 1562.927 1596.574 1601.176 10
f4(x) 11829.680 12141.870 12369.5407 12311.334 12716.806 12952.950 10
f5(x) 15796.546 15833.650 16176.2654 15858.629 15913.465 18604.658 10
f6(x) 1219.036 1356.807 1354.3578 1363.276 1372.831 1407.077 10
And with substantial duplication
> x = sample(head(x, 1000), 10000, TRUE)
> microbenchmark(f0(x), f1(x), f2(x), f3(x), f4(x), f5(x), f6(x))
Unit: milliseconds
expr min lq mean median uq max neval
f0(x) 1.914699 1.922925 1.992511 1.945807 2.030469 2.246022 100
f1(x) 1.888959 1.909469 2.097532 1.948002 2.031083 5.310342 100
f2(x) 1.396825 1.404801 1.447235 1.420777 1.479277 1.820402 100
f3(x) 1.248126 1.257283 1.295493 1.285652 1.329139 1.427220 100
f4(x) 24.075280 24.298454 24.562576 24.459281 24.700579 25.752481 100
f5(x) 4.044137 4.120369 4.307893 4.174639 4.283030 7.740830 100
f6(x) 1.221024 1.227792 1.264572 1.243201 1.295888 1.462007 100
f0() seems to be the speed winner when duplicates are rare
> x = readLines("/usr/share/dict/words", 100000)
> microbenchmark(f0(x), f1(x), f3(x), f6(x))
Unit: milliseconds
expr min lq mean median uq max neval
f0(x) 11.03298 11.17124 12.17688 11.36114 11.62769 19.83124 100
f1(x) 21.16154 21.33792 22.76237 21.67234 22.26473 31.99544 100
f3(x) 21.15801 21.49355 22.60749 21.77821 22.54203 31.17288 100
f6(x) 18.72260 18.97623 20.29060 19.46875 19.94892 28.17551 100
f3() and f6() look correct and fast; f6() is probably easier to understand (but only handles the special case of keeping words that occur exactly once).

How to efficiently read the first character from each line of a text file?

I'd like to read only the first character from each line of a text file, ignoring the rest.
Here's an example file:
x <- c(
"Afklgjsdf;bosfu09[45y94hn9igf",
"Basfgsdbsfgn",
"Cajvw58723895yubjsdw409t809t80",
"Djakfl09w50968509",
"E3434t"
)
writeLines(x, "test.txt")
I can solve the problem by reading everything with readLines and using substring to get the first character:
lines <- readLines("test.txt")
substring(lines, 1, 1)
## [1] "A" "B" "C" "D" "E"
This seems inefficient though. Is there a way to persuade R to only read the first characters, rather than having to discard them?
I suspect that there ought to be some incantation using scan, but I can't find it. An alternative might be low level file manipulation (maybe with seek).
Since performance is only relevant for larger files, here's a bigger test file for benchmarking with:
set.seed(2015)
nch <- sample(1:100, 1e4, replace = TRUE)
x2 <- vapply(
nch,
function(nch)
{
paste0(
sample(letters, nch, replace = TRUE),
collapse = ""
)
},
character(1)
)
writeLines(x2, "bigtest.txt")
Update: It seems that you can't avoid scanning the whole file. The best speed gains seem to be using a faster alternative to readLines (Richard Scriven's stringi::stri_read_lines solution and Josh O'Brien's data.table::fread solution), or to treat the file as binary (Martin Morgan's readBin solution).
If you allow/have access to Unix command-line tools you can use
scan(pipe("cut -c 1 test.txt"), what="", quiet=TRUE)
Obviously less portable but probably very fast.
Using #RichieCotton's benchmarking code with the OP's suggested "bigtest.txt" file:
expr min lq mean median uq
RC readLines 14.797830 17.083849 19.261917 18.103020 20.007341
RS read.fwf 125.113935 133.259220 148.122596 138.024203 150.528754
BB scan pipe cut 6.277267 7.027964 7.686314 7.337207 8.004137
RC readChar 1163.126377 1219.982117 1324.576432 1278.417578 1368.321464
RS scan 13.927765 14.752597 16.634288 15.274470 16.992124
data.table::fread() seems to beat all of the solutions so far proposed, and has the great virtue of running comparably fast on both Windows and *NIX machines:
library(data.table)
substring(fread("bigtest.txt", sep="\n", header=FALSE)[[1]], 1, 1)
Here are microbenchmark timings on a Linux box (actually a dual-boot laptop, booted up as Ubuntu):
Unit: milliseconds
expr min lq mean median uq max neval
RC readLines 15.830318 16.617075 18.294723 17.116666 18.959381 27.54451 100
JOB fread 5.532777 6.013432 7.225067 6.292191 7.727054 12.79815 100
RS read.fwf 111.099578 113.803053 118.844635 116.501270 123.987873 141.14975 100
BB scan pipe cut 6.583634 8.290366 9.925221 10.115399 11.013237 15.63060 100
RC readChar 1347.017408 1407.878731 1453.580001 1450.693865 1491.764668 1583.92091 100
And here are timings from the same laptop booted up as a Windows machine (with the command-line tool cut supplied by Rtools):
Unit: milliseconds
expr min lq mean median uq max neval cld
RC readLines 26.653266 27.493167 33.13860 28.057552 33.208309 61.72567 100 b
JOB fread 4.964205 5.343063 6.71591 5.538246 6.027024 13.54647 100 a
RS read.fwf 213.951792 217.749833 229.31050 220.793649 237.400166 287.03953 100 c
BB scan pipe cut 180.963117 263.469528 278.04720 276.138088 280.227259 387.87889 100 d
RC readChar 1505.263964 1572.132785 1646.88564 1622.410703 1688.809031 2149.10773 100 e
Figure out the file size, read it in as a single binary blob, find the offsets of the characters of interest (don't count the last '\n', at the end of the file!), and coerce to final form
f0 <- function() {
sz <- file.info("bigtest.txt")$size
what <- charToRaw("\n")
x = readBin("bigtest.txt", raw(), sz)
idx = which(x == what)
rawToChar(x[c(1L, idx[-length(idx)] + 1L)], multiple=TRUE)
}
The data.table solution (was I think the fastest so far -- need to include the first line as part of the data!)
library(data.table)
f1 <- function()
substring(fread("bigtest.txt", header=FALSE)[[1]], 1, 1)
and in comparison
> identical(f0(), f1())
[1] TRUE
> library(microbenchmark)
> microbenchmark(f0(), f1())
Unit: milliseconds
expr min lq mean median uq max neval
f0() 5.144873 5.515219 5.571327 5.547899 5.623171 5.897335 100
f1() 9.153364 9.470571 9.994560 10.162012 10.350990 11.047261 100
Still wasteful, since the entire file is read in to memory before mostly being discarded.
01/04/2015 Edited to bring the better solution to the top.
Update 2 Changing the scan() method to run on an open connection instead of opening and closing on every iteration allows to read line-by-line and eliminates the looping. The timing improved quite a bit.
## scan() on open connection
conn <- file("bigtest.txt", "rt")
substr(scan(conn, what = "", sep = "\n", quiet = TRUE), 1, 1)
close(conn)
I also discovered the stri_read_lines() function in the stringi package, Its help file says it's experimental at the moment, but it is very fast.
## stringi::stri_read_lines()
library(stringi)
stri_sub(stri_read_lines("bigtest.txt"), 1, 1)
Here are the timings for these two methods.
## timings
library(microbenchmark)
microbenchmark(
scan = {
conn <- file("bigtest.txt", "rt")
substr(scan(conn, what = "", sep = "\n", quiet = TRUE), 1, 1)
close(conn)
},
stringi = {
stri_sub(stri_read_lines("bigtest.txt"), 1, 1)
}
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# scan 50.00170 50.10403 50.55055 50.18245 50.56112 54.64646 100
# stringi 13.67069 13.74270 14.20861 13.77733 13.86348 18.31421 100
Original [slower] answer :
You could try read.fwf() (fixed width file), setting the width to a single 1 to capture the first character on each line.
read.fwf("test.txt", 1, stringsAsFactors = FALSE)[[1L]]
# [1] "A" "B" "C" "D" "E"
Not fully tested of course, but works for the test file and is a nice function for getting substrings without having to read the entire file.
Update 1 : read.fwf() is not very efficient, calling scan() and read.table() internally. We can skip the middle-men and try scan() directly.
lines <- count.fields("test.txt") ## length is num of lines in file
skip <- seq_along(lines) - 1 ## set up the 'skip' arg for scan()
read <- function(n) {
ch <- scan("test.txt", what = "", nlines = 1L, skip = n, quiet=TRUE)
substr(ch, 1, 1)
}
vapply(skip, read, character(1L))
# [1] "A" "B" "C" "D" "E"
version$platform
# [1] "x86_64-pc-linux-gnu"
Benchmarks for each answer, under Windows.
library(microbenchmark)
microbenchmark(
"RC readLines" = {
lines <- readLines("test.txt")
substring(lines, 1, 1)
},
"RS read.fwf" = read.fwf("test.txt", 1, stringsAsFactors = FALSE)$V1,
"BB scan pipe cut" = scan(pipe("cut -c 1 test.txt"),what=character()),
"RC readChar" = {
con <- file("test.txt", "r")
x <- readChar(con, 1)
while(length(ch <- readChar(con, 1)) > 0)
{
if(ch == "\n")
{
x <- c(x, readChar(con, 1))
}
}
close(con)
}
)
## Unit: microseconds
## expr min lq mean median uq
## RC readLines 561.598 712.876 830.6969 753.929 884.8865
## RS read.fwf 5079.010 6429.225 6772.2883 6837.697 7153.3905
## BB scan pipe cut 308195.548 309941.510 313476.6015 310304.412 310772.0005
## RC readChar 1238.963 1549.320 1929.4165 1612.952 1740.8300
## max neval
## 2156.896 100
## 8421.090 100
## 510185.114 100
## 26437.370 100
And on the bigger dataset:
## Unit: milliseconds
## expr min lq mean median uq max neval
## RC readLines 52.212563 84.496008 96.48517 103.319789 104.124623 158.086020 20
## RS read.fwf 391.371514 660.029853 703.51134 766.867222 777.795180 799.670185 20
## BB scan pipe cut 283.442150 482.062337 516.70913 562.416766 564.680194 567.089973 20
## RC readChar 2819.343753 4338.041708 4500.98579 4743.174825 4921.148501 5089.594928 20
## RS scan 2.088749 3.643816 4.16159 4.651449 4.731706 5.375819 20
I don't find it very informative to benchmark operations in the order of micro or milliseconds. But I understand that in some cases it can't be avoided. In those cases, still, I find it essential to test data of different (increasing sizes) to get a rough measure of how well the method scales..
Here's my run on #MartinMorgan's tests using f0() and f1() on 1e4, 1e5 and 1e6 rows and here are the results:
1e4
# Unit: milliseconds
# expr min lq mean median uq max neval
# f0() 4.226333 7.738857 15.47984 8.398608 8.972871 89.87805 100
# f1() 8.854873 9.204724 10.48078 9.471424 10.143601 84.33003 100
1e5
# Unit: milliseconds
# expr min lq mean median uq max neval
# f0() 71.66205 176.57649 174.9545 184.0191 187.7107 307.0470 100
# f1() 95.60237 98.82307 104.3605 100.8267 107.9830 205.8728 100
1e6
# Unit: seconds
# expr min lq mean median uq max neval
# f0() 1.443471 1.537343 1.561025 1.553624 1.558947 1.729900 10
# f1() 1.089555 1.092633 1.101437 1.095997 1.102649 1.140505 10
identical(f0(), f1()) returned TRUE on all the tests.
Update:
1e7
I also ran on 1e7 rows.
f1() (data.table) ran in 9.7 seconds, where as f0() ran in 7.8 seconds the first time, and 9.4 and 6.6s the second time.
However, f1() resulted in no noticeable change in memory while reading the entire 0.479GB file, whereas, f0() resulted in a spike of 2.4GB.
Another observation:
set.seed(2015)
x2 <- vapply(
1:1e5,
function(i)
{
paste0(
sample(letters, 100L, replace = TRUE),
collapse = "_"
)
},
character(1)
)
# 10 million rows, with 200 characters each
writeLines(unlist(lapply(1:100, function(x) x2)), "bigtest.txt")
## readBin() results in a 2 billion row vector
system.time(f0()) ## explodes on memory
Because the readBin() step results in a 2 billion length vector (~1.9GB to read the file), and which(x == what) step takes ~4.5+GB (= ~6.5GB in total) at which point I stopped the process.
fread() takes ~23 seconds in this case.
HTH

Count characters in a string (excluding spaces) in R?

I want to count the number of characters in a string (excluding spaces) and I'd like to know if my approach can be improved.
Suppose I have:
x <- "hello to you"
I know nchar() will give me the number of characters in a string (including spaces):
> nchar(x)
[1] 12
But I'd like to return the following (excluding spaces):
[1] 10
To this end, I've done the following:
> nchar(gsub(" ", "",x))
[1] 10
My worry is the gsub() will take a long time over many strings. Is this the correct way to approach this, or is there a type of nchar'esque function that will return the number of characters without counting spaces?
Thanks in advance.
Building on Richard's comment, "stringi" would be a great consideration here:
The approach could be to calculate the overall string length and subtract the number of spaces.
Compare the following.
library(stringi)
library(microbenchmark)
x <- "hello to you"
x
# [1] "hello to you"
fun1 <- function(x) stri_length(x) - stri_count_fixed(x, " ")
fun2 <- function(x) nchar(gsub(" ", "",x))
y <- paste(as.vector(replicate(1000000, x, TRUE)), collapse = " ")
microbenchmark(fun1(x), fun2(x))
# Unit: microseconds
# expr min lq mean median uq max neval
# fun1(x) 5.560 5.988 8.65163 7.270 8.1255 44.047 100
# fun2(x) 9.408 9.837 12.84670 10.691 12.4020 57.732 100
microbenchmark(fun1(y), fun2(y), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fun1(y) 68.22904 68.50273 69.6419 68.63914 70.47284 75.17682 10
# fun2(y) 2009.14710 2011.05178 2042.8123 2030.10502 2079.87224 2090.09142 10
Indeed, stringi seems most appropriate here. Try this:
library(stringi)
x <- "hello to you"
stri_stats_latex(x)
Result:
CharsWord CharsCmdEnvir CharsWhite Words Cmds Envirs
10 0 2 3 0 0
If you need it in a variable, you can access the parameters via regular [i], e.g.:
stri_stats_latex(x)[1]

Resources