Delete characters before regular expression (R) - r

I have a character vector of stock tickers where the ticker name is concatenated to the country in which that ticker is based in the following form: country_name/ticker_name. I am trying to split each string and delete everything from the '/' back, returning a character vector of only the ticker names. Here is an example vector:
sample_string <- c('US/SPY', 'US/AOL', 'US/MTC', 'US/PHA', 'US/PZI',
'US/AOL', 'US/BRCM')
My initial thought would be to use the stringr library. I don't have really any experience with that package, but here is what I was trying:
library(stringr)
split_string <- str_split(sample_string, '/')
But I was unsure how to return only the second element of each list as a single vector.
How would I do this over a large character vector (~105 million entries)?

Some benchmark here including all the methods suggested by #David Arenburg, and another method using str_extract from stringr package.
sample_string <- rep(sample_string, 1000000)
library(data.table); library(stringr)
s1 <- function() sub(".*/(.*)", "\\1", sample_string)
s2 <- function() sub(".*/", "", sample_string)
s3 <- function() str_extract(sample_string, "(?<=/)(.*)")
s4 <- function() tstrsplit(sample_string, "/", fixed = TRUE)[[2]]
length(sample_string)
# [1] 7000000
identical(s1(), s2())
# [1] TRUE
identical(s1(), s3())
# [1] TRUE
identical(s1(), s4())
# [1] TRUE
microbenchmark::microbenchmark(s1(), s2(), s3(), s4(), times = 5)
# Unit: seconds
# expr min lq mean median uq max neval
# s1() 3.916555 3.917370 4.046708 3.923246 3.925184 4.551184 5
# s2() 3.584694 3.593755 3.726922 3.610284 3.646449 4.199426 5
# s3() 3.051398 3.062237 3.354410 3.138080 3.722347 3.797985 5
# s4() 1.908283 1.964223 2.349522 2.117521 2.760612 2.996971 5
The tstrsplit method is the fastest.
Update:
Add another method from #Frank, this comparison is not strictly accurate which depends on the actual data, if there is a lot of duplicated cases as the sample_string is produced above, the advantage is quite obvious:
s5 <- function() setDT(list(sample_string))[, v := tstrsplit(V1, "/", fixed = TRUE)[[2]], by=V1]$v
identical(s1(), s5())
# [1] TRUE
microbenchmark::microbenchmark(s1(), s2(), s3(), s4(), s5(), times = 5)
# Unit: milliseconds
# expr min lq mean median uq max neval
# s1() 3905.97703 3913.264 3922.8540 3913.4035 3932.2680 3949.3575 5
# s2() 3568.63504 3576.755 3713.7230 3660.5570 3740.8252 4021.8426 5
# s3() 3029.66877 3032.898 3061.0584 3052.6937 3086.9714 3103.0604 5
# s4() 1322.42430 1679.475 1985.5440 1801.9054 1857.8056 3266.1101 5
# s5() 82.71379 101.899 177.8306 121.6682 209.0579 373.8141 5

Some helpful notes about your question: Firstly, there is a str_split_fixed function in the stringrpackage which does what you want it to do by calling lapply.
library(data.table); library(stringr)
sample_string <- c('US/SPY', 'US/AOL', 'US/MTC', 'US/PHA', 'US/PZI',
'US/AOL', 'US/BRCM')
sample_string <- rep(sample_string, 1e5)
split_string <- str_split_fixed(sample_string, '/', 2)[,2]
It works by calling stringi::stri_split_fixed and is not dissimilar to
do.call("c", lapply(str_split(sample_string, '/'),"[[",2))
Secondly, another way to think about extracting each second element of the list is by doing exactly what tstrsplit is doing internally.
transpose(strsplit(sample_string, "/", fixed = T))[[2]]
On a total side note, the above should be marginally faster than calling tstrsplit. This of course, is probably not worth typing at length but it helps to know what the function does.
library(data.table); library(stringr)
s4 <- function() tstrsplit(sample_string, "/", fixed = TRUE)[[2]]
s5 <- function() transpose(strsplit(sample_string, "/", fixed = T))[[2]]
identical(s4(), s5())
microbenchmark::microbenchmark(s4(), s5(), times = 20)
microbenchmark::microbenchmark(s4(), s5(), times = 20)
Unit: milliseconds
expr min lq mean median uq max neval
s4() 161.0744 193.3611 255.8136 234.9945 271.6811 434.7992 20
s5() 140.8569 176.5600 233.3570 194.1676 251.7921 420.3431 20
Regarding this second method, in short, transposing this list of length 7 million, each with 2 elements will convert your result to a list of length 2, each with 7 million elements. You are then extracting the second element of this list.

Related

Subset list of vectors by position in a vectorized way

I have a list of vectors and I'm trying to select (for example) the 2nd and 4th element in each vector. I can do this using lapply:
list_of_vec <- list(c(1:10), c(10:1), c(1:10), c(10:1), c(1:10))
lapply(1:length(list_of_vec), function(i) list_of_vec[[i]][c(2,4)])
[[1]]
[1] 2 4
[[2]]
[1] 9 7
[[3]]
[1] 2 4
[[4]]
[1] 9 7
[[5]]
[1] 2 4
But is there a way to do this in a vectorized way -- avoiding one of the apply functions? My problem is that my actual list_of_vec is fairly long, so lapply takes awhile.
Solutions:
Option 1 #Athe's clever solution using do.call?:
do.call(rbind, list_of_vec)[ ,c(2,4)]
Option 2 Using lapply more efficiently:
lapply(list_of_vec, `[`, c(2, 4))
Option 3 A vectorized solution:
starts <- c(0, cumsum(lengths(list_of_vec)[-1]))
matrix(unlist(list_of_vec)[c(starts + 2, starts + 4)], ncol = 2)
Option 4 the lapply solution you wanted to improve:
lapply(1:length(list_of_vec), function(i) list_of_vec[[i]][c(2,4)])
Data:
And a few datasets I will test them on:
# The original data
list_of_vec <- list(c(1:10), c(10:1), c(1:10), c(10:1), c(1:10))
# A long list with short elements
list_of_vec2 <- rep(list_of_vec, 1e5)
# A long list with long elements
list_of_vec3 <- lapply(list_of_vec, rep, 1e3)
list_of_vec3 <- rep(list_of_vec3, 1e4)
Benchmarking:
Original list:
Unit: microseconds
expr min lq mean median uq max neval cld
o1 2.276 2.8450 3.00417 2.845 3.129 10.809 100 a
o2 2.845 3.1300 3.59018 3.414 3.414 23.325 100 a
o3 3.698 4.1250 4.60558 4.267 4.552 20.480 100 a
o4 5.689 5.9735 17.52222 5.974 6.258 1144.606 100 a
Longer list, short elements:
Unit: milliseconds
expr min lq mean median uq max neval cld
o1 146.30778 146.88037 155.04077 149.89164 159.52194 184.92028 10 b
o2 185.40526 187.85717 192.83834 188.42749 190.32103 213.79226 10 c
o3 26.55091 27.27596 28.46781 27.48915 28.84041 32.19998 10 a
o4 407.66430 411.58054 426.87020 415.82161 437.19193 473.64265 10 d
Longer list, long elements:
Unit: milliseconds
expr min lq mean median uq max neval cld
o1 4855.59146 4978.31167 5012.0429 5025.97619 5072.9350 5095.7566 10 c
o2 17.88133 18.60524 103.2154 21.28613 195.0087 311.4122 10 a
o3 855.63128 872.15011 953.8423 892.96193 1069.7526 1106.1980 10 b
o4 37.92927 38.87704 135.6707 124.05127 214.6217 276.5814 10 a
Summary:
Looks like the vectorized solution wins out if the list is long and the elements are short, but lapply is the clear winner for a long list with longer elements. Some of the options output a list, others a matrix. So keep in mind what you want your output to be. Good luck!!!
If your list is composed of vectors of the same length, you could first transform it into a matrix and then get the columns you want.
matrix_of_vec <- do.call(rbind,list_of_vec)
matrix_of_vec[ ,c(2,4)]
Otherwise I'm afraid you'll have to stick to the apply family. The most efficient way to do it is using the parallel package to compute parallely (surprisingly).
corenum <- parallel::detectCores()-1
cl<-parallel::makeCluster(corenum)
parallel::clusterExport(cl,"list_of_vec"))
parallel::parSapply(cl,list_of_vec, '[', c(2,4) )
In this piece of code '[' is the name of the subsetting function and c(2,4) the argument you pass to it.

Compute digit-sums in specific columns of a data frame

I'm trying to sum the digits of integers in the last 2 columns of my data frame. I have found a function that does the summing, but I think I may have an issue with applying the function - not sure?
Dataframe
a = c("a", "b", "c")
b = c(1, 11, 2)
c = c(2, 4, 23)
data <- data.frame(a,b,c)
#Digitsum function
digitsum <- function(x) sum(floor(x / 10^(0:(nchar(as.character(x)) - 1))) %% 10)
#Applying function
data[2:3] <- lapply(data[2:3], digitsum)
This is the error that I get:
*Warning messages:
1: In 0:(nchar(as.character(x)) - 1) :
numerical expression has 3 elements: only the first used
2: In 0:(nchar(as.character(x)) - 1) :
numerical expression has 3 elements: only the first used*
Your function digitsum at the moment works fine for a single scalar input, for example,
digitsum(32)
# [1] 5
But, it can not take a vector input, otherwise ":" will complain. You need to vectorize this function, using Vectorize:
vec_digitsum <- Vectorize(digitsum)
Then it works for a vector input:
b = c(1, 11, 2)
vec_digitsum(b)
# [1] 1 2 2
Now you can use lapply without trouble.
#Zheyuan Li 's answer solved your problem of using lapply. Though I'd like to add several points:
Vectorize is just a wrapper with mapply, which doesn't give you the performance of vectorization.
The function itself can be improved for much better readability:
see
digitsum <- function(x) sum(floor(x / 10^(0:(nchar(as.character(x)) - 1))) %% 10)
vec_digitsum <- Vectorize(digitsum)
sumdigits <- function(x){
digits <- strsplit(as.character(x), "")[[1]]
sum(as.numeric(digits))
}
vec_sumdigits <- Vectorize(sumdigits)
microbenchmark::microbenchmark(digitsum(12324255231323),
sumdigits(12324255231323), times = 100)
Unit: microseconds
expr min lq mean median uq max neval cld
digitsum(12324255231323) 12.223 12.712 14.50613 13.201 13.690 96.801 100 a
sumdigits(12324255231323) 13.689 14.667 15.32743 14.668 15.157 38.134 100 a
The performance of two versions are similar, but the 2nd one is much easier to understand.
Interestingly, the Vectorize wrapper add considerable overhead for single input:
microbenchmark::microbenchmark(vec_digitsum(12324255231323),
vec_sumdigits(12324255231323), times = 100)
Unit: microseconds
expr min lq mean median uq max neval cld
vec_digitsum(12324255231323) 92.890 96.801 267.2665 100.223 108.045 16387.07 100 a
vec_sumdigits(12324255231323) 94.357 98.757 106.2705 101.445 107.556 286.00 100 a
Another advantage of this function is that if you have really big numbers in string format, it will still work (with small modification of removing the as.character). While the first version function will have problem with big numbers or may introduce errors.
Note: At first my benchmark was comparing the vectorized version of OP function and non-vectorized version of my function, that gave me the wrong impression of my function is much faster. Turned out that was caused by Vectorize overhead.

Extract part of string before the first semicolon

I have a column containing values of 3 strings separated by semicolons. I need to just extract the part of the string which comes before the first semicolon.
Type <- c("SNSR_RMIN_PSX150Y_CSH;SP_12;I0.00V50HX0HY3000")
What I want is: Get the first part of the string (till the first semicolon).
Desired output : SNSR_RMIN_PSX150Y_CSH
I tried gsub without success.
You could try sub
sub(';.*$','', Type)
#[1] "SNSR_RMIN_PSX150Y_CSH"
It will match the pattern i.e. first occurence of ; to the end of the string and replace with ''
Or use
library(stringi)
stri_extract(Type, regex='[^;]*')
#[1] "SNSR_RMIN_PSX150Y_CSH"
The stringi package works very fast here:
stri_extract_first_regex(Type, "^[^;]+")
## [1] "SNSR_RMIN_PSX150Y_CSH"
I benchmarked on the 3 main approaches here:
Unit: milliseconds
expr min lq mean median uq max neval
SAPPLY() 254.88442 267.79469 294.12715 277.4518 325.91576 419.6435 100
SUB() 182.64996 186.26583 192.99277 188.6128 197.17154 237.9886 100
STRINGI() 89.45826 91.05954 94.11195 91.9424 94.58421 124.4689 100
Here's the code for the Benchmarks:
library(stringi)
SAPPLY <- function() sapply(strsplit(Type, ";"), "[[", 1)
SUB <- function() sub(';.*$','', Type)
STRINGI <- function() stri_extract_first_regex(Type, "^[^;]+")
Type <- c("SNSR_RMIN_PSX150Y_CSH;SP_12;I0.00V50HX0HY3000")
Type <- rep(Type, 100000)
library(microbenchmark)
microbenchmark(
SAPPLY(),
SUB(),
STRINGI(),
times=100L)
you can also use strsplit
strsplit(Type, ";")[[1]][1]
[1] "SNSR_RMIN_PSX150Y_CSH"
When performance is important you can use substr in combination with regexpr from base.
substr(Type, 1, regexpr(";", Type, fixed=TRUE)-1)
#[1] "SNSR_RMIN_PSX150Y_CSH"
Timings: (Reusing the part from #tyler-rinker)
library(stringi)
SAPPLY <- function() sapply(strsplit(Type, ";"), "[[", 1)
SUB <- function() sub(';.*$','', Type)
SUB2 <- function() sub(';.*','', Type)
SUB3 <- function() sub('([^;]*).*','\\1', Type)
STRINGI <- function() stri_extract_first_regex(Type, "^[^;]+")
STRINGI2 <- function() stri_extract_first_regex(Type, "[^;]*")
SUBSTRREG <- function() substr(Type, 1, regexpr(";", Type)-1)
SUBSTRREG2 <- function() substr(Type, 1, regexpr(";", Type, fixed=TRUE)-1)
SUBSTRREG3 <- function() substr(Type, 1, regexpr(";", Type, fixed=TRUE, useBytes = TRUE)-1)
Type <- c("SNSR_RMIN_PSX150Y_CSH;SP_12;I0.00V50HX0HY3000")
Type <- rep(Type, 100000)
library(microbenchmark)
microbenchmark(SAPPLY(), SUB(), SUB2(), SUB3(), STRINGI()
, STRINGI2(), SUBSTRREG(), SUBSTRREG2(), SUBSTRREG3())
#Unit: milliseconds
# expr min lq mean median uq max neval
# SAPPLY() 382.23750 395.92841 412.82508 410.05236 427.58816 460.28508 100
# SUB() 111.92120 114.28939 116.41950 115.57371 118.15573 123.92400 100
# SUB2() 94.27831 96.50462 98.14741 97.38199 99.15260 119.51090 100
# SUB3() 167.77139 172.51271 175.07144 173.83121 176.27710 190.97815 100
# STRINGI() 38.27645 39.33428 39.94134 39.71842 40.50182 42.55838 100
# STRINGI2() 38.16736 39.19250 40.14904 39.63929 40.37686 56.03174 100
# SUBSTRREG() 45.04828 46.39867 47.13018 46.85465 47.71985 51.07955 100
# SUBSTRREG2() 10.67439 11.02963 11.29290 11.12222 11.43964 13.64643 100
# SUBSTRREG3() 10.74220 10.95139 11.39466 11.06632 11.46908 27.72654 100

How to efficiently read the first character from each line of a text file?

I'd like to read only the first character from each line of a text file, ignoring the rest.
Here's an example file:
x <- c(
"Afklgjsdf;bosfu09[45y94hn9igf",
"Basfgsdbsfgn",
"Cajvw58723895yubjsdw409t809t80",
"Djakfl09w50968509",
"E3434t"
)
writeLines(x, "test.txt")
I can solve the problem by reading everything with readLines and using substring to get the first character:
lines <- readLines("test.txt")
substring(lines, 1, 1)
## [1] "A" "B" "C" "D" "E"
This seems inefficient though. Is there a way to persuade R to only read the first characters, rather than having to discard them?
I suspect that there ought to be some incantation using scan, but I can't find it. An alternative might be low level file manipulation (maybe with seek).
Since performance is only relevant for larger files, here's a bigger test file for benchmarking with:
set.seed(2015)
nch <- sample(1:100, 1e4, replace = TRUE)
x2 <- vapply(
nch,
function(nch)
{
paste0(
sample(letters, nch, replace = TRUE),
collapse = ""
)
},
character(1)
)
writeLines(x2, "bigtest.txt")
Update: It seems that you can't avoid scanning the whole file. The best speed gains seem to be using a faster alternative to readLines (Richard Scriven's stringi::stri_read_lines solution and Josh O'Brien's data.table::fread solution), or to treat the file as binary (Martin Morgan's readBin solution).
If you allow/have access to Unix command-line tools you can use
scan(pipe("cut -c 1 test.txt"), what="", quiet=TRUE)
Obviously less portable but probably very fast.
Using #RichieCotton's benchmarking code with the OP's suggested "bigtest.txt" file:
expr min lq mean median uq
RC readLines 14.797830 17.083849 19.261917 18.103020 20.007341
RS read.fwf 125.113935 133.259220 148.122596 138.024203 150.528754
BB scan pipe cut 6.277267 7.027964 7.686314 7.337207 8.004137
RC readChar 1163.126377 1219.982117 1324.576432 1278.417578 1368.321464
RS scan 13.927765 14.752597 16.634288 15.274470 16.992124
data.table::fread() seems to beat all of the solutions so far proposed, and has the great virtue of running comparably fast on both Windows and *NIX machines:
library(data.table)
substring(fread("bigtest.txt", sep="\n", header=FALSE)[[1]], 1, 1)
Here are microbenchmark timings on a Linux box (actually a dual-boot laptop, booted up as Ubuntu):
Unit: milliseconds
expr min lq mean median uq max neval
RC readLines 15.830318 16.617075 18.294723 17.116666 18.959381 27.54451 100
JOB fread 5.532777 6.013432 7.225067 6.292191 7.727054 12.79815 100
RS read.fwf 111.099578 113.803053 118.844635 116.501270 123.987873 141.14975 100
BB scan pipe cut 6.583634 8.290366 9.925221 10.115399 11.013237 15.63060 100
RC readChar 1347.017408 1407.878731 1453.580001 1450.693865 1491.764668 1583.92091 100
And here are timings from the same laptop booted up as a Windows machine (with the command-line tool cut supplied by Rtools):
Unit: milliseconds
expr min lq mean median uq max neval cld
RC readLines 26.653266 27.493167 33.13860 28.057552 33.208309 61.72567 100 b
JOB fread 4.964205 5.343063 6.71591 5.538246 6.027024 13.54647 100 a
RS read.fwf 213.951792 217.749833 229.31050 220.793649 237.400166 287.03953 100 c
BB scan pipe cut 180.963117 263.469528 278.04720 276.138088 280.227259 387.87889 100 d
RC readChar 1505.263964 1572.132785 1646.88564 1622.410703 1688.809031 2149.10773 100 e
Figure out the file size, read it in as a single binary blob, find the offsets of the characters of interest (don't count the last '\n', at the end of the file!), and coerce to final form
f0 <- function() {
sz <- file.info("bigtest.txt")$size
what <- charToRaw("\n")
x = readBin("bigtest.txt", raw(), sz)
idx = which(x == what)
rawToChar(x[c(1L, idx[-length(idx)] + 1L)], multiple=TRUE)
}
The data.table solution (was I think the fastest so far -- need to include the first line as part of the data!)
library(data.table)
f1 <- function()
substring(fread("bigtest.txt", header=FALSE)[[1]], 1, 1)
and in comparison
> identical(f0(), f1())
[1] TRUE
> library(microbenchmark)
> microbenchmark(f0(), f1())
Unit: milliseconds
expr min lq mean median uq max neval
f0() 5.144873 5.515219 5.571327 5.547899 5.623171 5.897335 100
f1() 9.153364 9.470571 9.994560 10.162012 10.350990 11.047261 100
Still wasteful, since the entire file is read in to memory before mostly being discarded.
01/04/2015 Edited to bring the better solution to the top.
Update 2 Changing the scan() method to run on an open connection instead of opening and closing on every iteration allows to read line-by-line and eliminates the looping. The timing improved quite a bit.
## scan() on open connection
conn <- file("bigtest.txt", "rt")
substr(scan(conn, what = "", sep = "\n", quiet = TRUE), 1, 1)
close(conn)
I also discovered the stri_read_lines() function in the stringi package, Its help file says it's experimental at the moment, but it is very fast.
## stringi::stri_read_lines()
library(stringi)
stri_sub(stri_read_lines("bigtest.txt"), 1, 1)
Here are the timings for these two methods.
## timings
library(microbenchmark)
microbenchmark(
scan = {
conn <- file("bigtest.txt", "rt")
substr(scan(conn, what = "", sep = "\n", quiet = TRUE), 1, 1)
close(conn)
},
stringi = {
stri_sub(stri_read_lines("bigtest.txt"), 1, 1)
}
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# scan 50.00170 50.10403 50.55055 50.18245 50.56112 54.64646 100
# stringi 13.67069 13.74270 14.20861 13.77733 13.86348 18.31421 100
Original [slower] answer :
You could try read.fwf() (fixed width file), setting the width to a single 1 to capture the first character on each line.
read.fwf("test.txt", 1, stringsAsFactors = FALSE)[[1L]]
# [1] "A" "B" "C" "D" "E"
Not fully tested of course, but works for the test file and is a nice function for getting substrings without having to read the entire file.
Update 1 : read.fwf() is not very efficient, calling scan() and read.table() internally. We can skip the middle-men and try scan() directly.
lines <- count.fields("test.txt") ## length is num of lines in file
skip <- seq_along(lines) - 1 ## set up the 'skip' arg for scan()
read <- function(n) {
ch <- scan("test.txt", what = "", nlines = 1L, skip = n, quiet=TRUE)
substr(ch, 1, 1)
}
vapply(skip, read, character(1L))
# [1] "A" "B" "C" "D" "E"
version$platform
# [1] "x86_64-pc-linux-gnu"
Benchmarks for each answer, under Windows.
library(microbenchmark)
microbenchmark(
"RC readLines" = {
lines <- readLines("test.txt")
substring(lines, 1, 1)
},
"RS read.fwf" = read.fwf("test.txt", 1, stringsAsFactors = FALSE)$V1,
"BB scan pipe cut" = scan(pipe("cut -c 1 test.txt"),what=character()),
"RC readChar" = {
con <- file("test.txt", "r")
x <- readChar(con, 1)
while(length(ch <- readChar(con, 1)) > 0)
{
if(ch == "\n")
{
x <- c(x, readChar(con, 1))
}
}
close(con)
}
)
## Unit: microseconds
## expr min lq mean median uq
## RC readLines 561.598 712.876 830.6969 753.929 884.8865
## RS read.fwf 5079.010 6429.225 6772.2883 6837.697 7153.3905
## BB scan pipe cut 308195.548 309941.510 313476.6015 310304.412 310772.0005
## RC readChar 1238.963 1549.320 1929.4165 1612.952 1740.8300
## max neval
## 2156.896 100
## 8421.090 100
## 510185.114 100
## 26437.370 100
And on the bigger dataset:
## Unit: milliseconds
## expr min lq mean median uq max neval
## RC readLines 52.212563 84.496008 96.48517 103.319789 104.124623 158.086020 20
## RS read.fwf 391.371514 660.029853 703.51134 766.867222 777.795180 799.670185 20
## BB scan pipe cut 283.442150 482.062337 516.70913 562.416766 564.680194 567.089973 20
## RC readChar 2819.343753 4338.041708 4500.98579 4743.174825 4921.148501 5089.594928 20
## RS scan 2.088749 3.643816 4.16159 4.651449 4.731706 5.375819 20
I don't find it very informative to benchmark operations in the order of micro or milliseconds. But I understand that in some cases it can't be avoided. In those cases, still, I find it essential to test data of different (increasing sizes) to get a rough measure of how well the method scales..
Here's my run on #MartinMorgan's tests using f0() and f1() on 1e4, 1e5 and 1e6 rows and here are the results:
1e4
# Unit: milliseconds
# expr min lq mean median uq max neval
# f0() 4.226333 7.738857 15.47984 8.398608 8.972871 89.87805 100
# f1() 8.854873 9.204724 10.48078 9.471424 10.143601 84.33003 100
1e5
# Unit: milliseconds
# expr min lq mean median uq max neval
# f0() 71.66205 176.57649 174.9545 184.0191 187.7107 307.0470 100
# f1() 95.60237 98.82307 104.3605 100.8267 107.9830 205.8728 100
1e6
# Unit: seconds
# expr min lq mean median uq max neval
# f0() 1.443471 1.537343 1.561025 1.553624 1.558947 1.729900 10
# f1() 1.089555 1.092633 1.101437 1.095997 1.102649 1.140505 10
identical(f0(), f1()) returned TRUE on all the tests.
Update:
1e7
I also ran on 1e7 rows.
f1() (data.table) ran in 9.7 seconds, where as f0() ran in 7.8 seconds the first time, and 9.4 and 6.6s the second time.
However, f1() resulted in no noticeable change in memory while reading the entire 0.479GB file, whereas, f0() resulted in a spike of 2.4GB.
Another observation:
set.seed(2015)
x2 <- vapply(
1:1e5,
function(i)
{
paste0(
sample(letters, 100L, replace = TRUE),
collapse = "_"
)
},
character(1)
)
# 10 million rows, with 200 characters each
writeLines(unlist(lapply(1:100, function(x) x2)), "bigtest.txt")
## readBin() results in a 2 billion row vector
system.time(f0()) ## explodes on memory
Because the readBin() step results in a 2 billion length vector (~1.9GB to read the file), and which(x == what) step takes ~4.5+GB (= ~6.5GB in total) at which point I stopped the process.
fread() takes ~23 seconds in this case.
HTH

Count characters in a string (excluding spaces) in R?

I want to count the number of characters in a string (excluding spaces) and I'd like to know if my approach can be improved.
Suppose I have:
x <- "hello to you"
I know nchar() will give me the number of characters in a string (including spaces):
> nchar(x)
[1] 12
But I'd like to return the following (excluding spaces):
[1] 10
To this end, I've done the following:
> nchar(gsub(" ", "",x))
[1] 10
My worry is the gsub() will take a long time over many strings. Is this the correct way to approach this, or is there a type of nchar'esque function that will return the number of characters without counting spaces?
Thanks in advance.
Building on Richard's comment, "stringi" would be a great consideration here:
The approach could be to calculate the overall string length and subtract the number of spaces.
Compare the following.
library(stringi)
library(microbenchmark)
x <- "hello to you"
x
# [1] "hello to you"
fun1 <- function(x) stri_length(x) - stri_count_fixed(x, " ")
fun2 <- function(x) nchar(gsub(" ", "",x))
y <- paste(as.vector(replicate(1000000, x, TRUE)), collapse = " ")
microbenchmark(fun1(x), fun2(x))
# Unit: microseconds
# expr min lq mean median uq max neval
# fun1(x) 5.560 5.988 8.65163 7.270 8.1255 44.047 100
# fun2(x) 9.408 9.837 12.84670 10.691 12.4020 57.732 100
microbenchmark(fun1(y), fun2(y), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fun1(y) 68.22904 68.50273 69.6419 68.63914 70.47284 75.17682 10
# fun2(y) 2009.14710 2011.05178 2042.8123 2030.10502 2079.87224 2090.09142 10
Indeed, stringi seems most appropriate here. Try this:
library(stringi)
x <- "hello to you"
stri_stats_latex(x)
Result:
CharsWord CharsCmdEnvir CharsWhite Words Cmds Envirs
10 0 2 3 0 0
If you need it in a variable, you can access the parameters via regular [i], e.g.:
stri_stats_latex(x)[1]

Resources