Difficulties with `agrep(..., fixed=F)` - r

In ?agrep (grep with fuzzy matching) it mentions that I can set the argument fixed=FALSE to let my pattern be interpreted as a regular expression.
However, I can't get it to work!
agrep('(asdf|fdsa)', 'asdf', fixed=F)
# integer(0)
The above should match as the regular expression "(asdf|fdsa)" exactly matches the test string "asdf" in this case.
To confirm:
grep('(asdf|fdsa)', 'asdf', fixed=F)
# 1 : it does match with grep
And even more confusingly, adist correctly gives the distance between the pattern and string as 0, meaning that agrep should definitely return 1 rather than integer(0) (there's no possibility that 0 is greater than the default max.dist = 0.1).
adist('(asdf|fdsa)', 'asdf', fixed=F)
# [,1]
# [1,] 0
Why is this not working? Is there something I don't understand? A workaround?
I'm happy to use adist, but am not entirely sure how to convert agrep's default max.distance=0.1 parameter to adist's corresponding parameter.
(yes, I'm stuck on an old computer that can't do better than R 2.15.2)
> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: i686-redhat-linux-gnu (32-bit)
locale:
[1] LC_CTYPE=en_AU.utf8 LC_NUMERIC=C
[3] LC_TIME=en_AU.utf8 LC_COLLATE=en_AU.utf8
[5] LC_MONETARY=en_AU.utf8 LC_MESSAGES=en_AU.utf8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_AU.utf8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base

tl;dr: agrep(..., fixed=F) does not seem to work with the '|' character. Use aregexec.
Upon further investigation, I think this is a bug, and that agrep(..., fixed=F) does not seem to work with '|' regexes (although adist(..., fixed=F) does).
To elaborate, note that
adist('(asdf|fdsa)', 'asdf', fixed=T) # 7
nchar('(asdf|fdsa)') # 11
If 'asdf' were agrep'd to the non-regular-expression string '(asdf|fdsa)', then it would have distance 7.
On that note:
agrep('(asdf|fdsa)', 'asdf', fixed=F, max.distance=7) # 1
agrep('(asdf|fdsa)', 'asdf', fixed=F, max.distance=6) # integer(0)
These are the results I'd expect if fixed=T. If fixed=F, my regex would match 'asdf' exactly and the distance would be 0, so I'd always get a result of '1' back out of agrep.
So it looks agrep(pattern, x, fixed=F) does not work, i.e. it actually regardes fixed as TRUE for this sort of pattern.
As #Arun mentions, it might just be '|' regexes that don't work. For example, agrep('la[sb]y', 'lazy', fixed=FALSE) does work as expected.
EDIT: Workaround (thanks #Arun)
The function aregexec appears to work.
> aregexec('(asdf|fdsa)', 'asdf', fixed=F)
[[1]]
[1] 1 1
attr(,"match.length")
[1] 4 4

This has (finally) been fixed in the R sources "trunk" / R-devel") and R-patched which will become R 3.5.1 early July 2018.

Related

Broken encoding UTF-8 when use Encoding() and tokens()

I've got quite strange problem with encoding. When I run Encoding(txt) <- "UTF-8", I get encoding broken and strings look like "\xe7\xed\xe0\xfe\xf2".
txt <- c("привет", "пока")
Encoding(txt) # I get "unknown" "unknown"
Encoding(txt) <- "UTF-8"
Encoding(txt) # I get "UTF-8" "UTF-8", but strange symbols in vector
Plus, when I run l10n_info(), I get
$MBCS
[1] FALSE
$`UTF-8`
[1] FALSE
$`Latin-1`
[1] FALSE
I know, that I can use enc2utf8() with strings, but I work with quanteda and get the same issue as here: https://github.com/quanteda/quanteda/issues/1387 (but reinsalling package from github didn't help). I think, that problrm is with encoding on server.
P.S. dataframe loaded from excel is displayed correctly + when I save tokens object into new xslx, all the strings displayed in cyrillic.
Here is my session info:
R version 4.0.4 (2021-02-15)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server 2012 R2 x64 (build 9600)
Matrix products: default
locale:
[1] LC_COLLATE=Russian_Russia.1251 LC_CTYPE=Russian_Russia.1251 LC_MONETARY=Russian_Russia.1251 LC_NUMERIC=C LC_TIME=Russian_Russia.1251
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] tsne_0.1-3 stringi_1.5.3 tm_0.7-8 NLP_0.2-1 DataCombine_0.2.21 emo_0.0.0.9000 ggplot2_3.3.3 reshape2_1.4.4
[9] topicmodels_0.2-12 ldatuning_1.0.2 writexl_1.3.1 plyr_1.8.6 quanteda_2.9.9000 stringr_1.4.0 readxl_1.3.1
Thank you in advance!
Encoding issues are tricky, especially on Windows systems. It looks like your native encoding system is Windows-1251, an 8-bit encoding for Cyrillic. So when you input your string, it is input in that encoding. You can convert it to Unicode, but it still won't necessarily display correctly if you use the print method.
Here's the result of me trying to simulate the problem on my macOS platform.
> stringi::stri_info()$Charset.native[1:2]
$Name.friendly
[1] "UTF-8"
$Name.ICU
[1] "UTF-8"
My guess is that your system will show something different, but I cannot be sure.\
> # on macOS 10.15.7
> txt <- c("привет", "пока")
> txt
[1] "\u043f\u0440\u0438\u0432\u0435\u0442" "\u043f\u043e\u043a\u0430"
> Encoding(txt)
[1] "UTF-8" "UTF-8"
So that produces the same output that you are seeing, but it's encoded as UTF-8. To simulate what that would look like if the system encoded it as Windows-1251, we can convert it:
> # convert to Windows-1251
> txt_1251 <- iconv(txt, from = "UTF-8", to = "WINDOWS-1251")
> print(txt_1251)
[1] "\xef\xf0\xe8\xe2\xe5\xf2" "\xef\xee\xea\xe0"
> cat(txt_1251)
������ ����> Encoding(txt_1251)
[1] "unknown" "unknown"
Is that what you are seeing?
You can try fixing it this way:
> txt_from1251 <- stringi::stri_conv(txt_1251, from = "windows-1251", to = "utf-8")
> print(txt_from1251)
[1] "\u043f\u0440\u0438\u0432\u0435\u0442" "\u043f\u043e\u043a\u0430"
> cat(txt_from1251)
привет пока> Encoding(txt_from1251)
[1] "UTF-8" "UTF-8"
So while it still does not print(), it shows correctly from cat(), and has the correct Encoding bit set.
I could be wrong about this since my understanding of Unicode and character sets in R is incomplete, and it seems to be quite platform and locale dependent. I would happily see another response that improves this answer, or to hear your success or not with trying some of the fixes suggested above.

Why is R making a copy-on-modification after using str?

I was wondering why R is making a copy-on-modification after using str.
I create a matrix. I can change its dim, one element or even all. No copy is made. But when a call str R is making a copy during the next modification operation on the Matrix. Why is this happening?
m <- matrix(1:12, 3)
tracemem(m)
#[1] "<0x559df861af28>"
dim(m) <- 4:3
m[1,1] <- 0L
m[] <- 12:1
str(m)
# int [1:4, 1:3] 12 11 10 9 8 7 6 5 4 3 ...
dim(m) <- 3:4 #Here after str a copy is made
#tracemem[0x559df861af28 -> 0x559df838e4a8]:
dim(m) <- 3:4
str(m)
# int [1:3, 1:4] 12 11 10 9 8 7 6 5 4 3 ...
dim(m) <- 3:4 #Here again after str a copy
#tracemem[0x559df838e4a8 -> 0x559df82c9d78]:
Also I was wondering why a copy is made when having a Task Callback.
TCB <- addTaskCallback(function(...) TRUE)
m <- matrix(1:12, nrow = 3)
tracemem(m)
#[1] "<0x559dfa79def8>"
dim(m) <- 4:3 #Copy on modification
#tracemem[0x559dfa79def8 -> 0x559dfa8998e8]:
removeTaskCallback(TCB)
#[1] TRUE
dim(m) <- 4:3 #No copy
sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 10 (buster)
Matrix products: default
BLAS: /usr/local/lib/R/lib/libRblas.so
LAPACK: /usr/local/lib/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=de_AT.UTF-8 LC_NUMERIC=C
[3] LC_TIME=de_AT.UTF-8 LC_COLLATE=de_AT.UTF-8
[5] LC_MONETARY=de_AT.UTF-8 LC_MESSAGES=de_AT.UTF-8
[7] LC_PAPER=de_AT.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=de_AT.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_4.0.3
This is a follow up question to Is there a way to prevent copy-on-modify when modifying attributes?.
I start R with R --vanilla to have a clean session.
I have asked this question on R-help as suggested by #sam-mason in the comments.
The answer from Luke Tierney solved the issue with str:
As of R 4.0.0 it is in some cases possible to reduce reference counts
internally and so avoid a copy in cases like this. It would be too
costly to try to detect all cases where a count can be dropped, but it
this case we can do better. It turns out that the internals of
pos.to.env were unnecessarily creating an extra reference to the call
environment (here in a call to exists()). This is fixed in r79528.
Thanks.
And related to Task Callback:
It turns out there were some issues with the way calls to the
callbacks were handled. This has been revised in R-devel in r79541.
This example will no longere need to duplicate in R-devel.
Thanks for the report.

Different results when subsetting data.table columns with numeric indices in different ways

See the minimal example:
library(data.table)
DT <- data.table(x = 2, y = 3, z = 4)
DT[, c(1:2)] # first way
# x y
# 1: 2 3
DT[, (1:2)] # second way
# [1] 1 2
DT[, 1:2] # third way
# x y
# 1: 2 3
As described in this post, subsetting columns with numeric indices is possible now. However, I would like to known why indices are evaluated to a vector in the second way rather than column indices?
In addition, I updated data.table just now:
> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS
Matrix products: default
BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.11.2
loaded via a namespace (and not attached):
[1] compiler_3.4.4 tools_3.4.4 yaml_2.1.17
By looking at the source code we can simulate data.tables behaviour for different inputs
if (!missing(j)) {
jsub = replace_dot_alias(substitute(j))
root = if (is.call(jsub)) as.character(jsub[[1L]])[1L] else ""
if (root == ":" ||
(root %chin% c("-","!") && is.call(jsub[[2L]]) && jsub[[2L]][[1L]]=="(" && is.call(jsub[[2L]][[2L]]) && jsub[[2L]][[2L]][[1L]]==":") ||
( (!length(av<-all.vars(jsub)) || all(substring(av,1L,2L)=="..")) &&
root %chin% c("","c","paste","paste0","-","!") &&
missing(by) )) { # test 763. TODO: likely that !missing(by) iff with==TRUE (so, with can be removed)
# When no variable names (i.e. symbols) occur in j, scope doesn't matter because there are no symbols to find.
# If variable names do occur, but they are all prefixed with .., then that means look up in calling scope.
# Automatically set with=FALSE in this case so that DT[,1], DT[,2:3], DT[,"someCol"] and DT[,c("colB","colD")]
# work as expected. As before, a vector will never be returned, but a single column data.table
# for type consistency with >1 cases. To return a single vector use DT[["someCol"]] or DT[[3]].
# The root==":" is to allow DT[,colC:colH] even though that contains two variable names.
# root == "-" or "!" is for tests 1504.11 and 1504.13 (a : with a ! or - modifier root)
# We don't want to evaluate j at all in making this decision because i) evaluating could itself
# increment some variable and not intended to be evaluated a 2nd time later on and ii) we don't
# want decisions like this to depend on the data or vector lengths since that can introduce
# inconistency reminiscent of drop=TRUE in [.data.frame that we seek to avoid.
with=FALSE
Basically, "[.data.table" catches the expression passed to j and decides how to treat it based on some predefined rules. If one of the rules is satisfied, it sets with=FALSE which basically means that column names were passed to j, using standard evaluation.
The rules are (roughly) as follows:
Set with=FALSE,
1.1. if j expression is a call and the call is : or
1.2. if the call is a combination of c("-","!") and ( and : or
1.3. if some value (character, integer, numeric, etc.) or .. was passed to j and the call is in c("","c","paste","paste0","-","!") and there is no a by call
otherwise set with=TRUE
So we can convert this into a function and see if any of the conditions were satisfied (I've skipped the converting the . to list function as it is irrelevant here. We will just test with list directly)
is_satisfied <- function(...) {
jsub <- substitute(...)
root = if (is.call(jsub)) as.character(jsub[[1L]])[1L] else ""
if (root == ":" ||
(root %chin% c("-","!") &&
is.call(jsub[[2L]]) &&
jsub[[2L]][[1L]]=="(" &&
is.call(jsub[[2L]][[2L]]) &&
jsub[[2L]][[2L]][[1L]]==":") ||
( (!length(av<-all.vars(jsub)) || all(substring(av,1L,2L)=="..")) &&
root %chin% c("","c","paste","paste0","-","!"))) TRUE else FALSE
}
is_satisfied("x")
# [1] TRUE
is_satisfied(c("x", "y"))
# [1] TRUE
is_satisfied(..x)
# [1] TRUE
is_satisfied(1:2)
# [1] TRUE
is_satisfied(c(1:2))
# [1] TRUE
is_satisfied((1:2))
# [1] FALSE
is_satisfied(y)
# [1] FALSE
is_satisfied(list(x, y))
# [1] FALSE

dataframe does not work inside of a function

When trying to generate a data.frame inside of a function, found that when calling the function, despite everything apparently worked well outside of the function, the data.frame was not generated.
Anybody could tell me how is this possible?
Species=c("a","b","c")
data=data.frame(Species)
df=data.frame(matrix(nrow=length(levels(data$Species)),ncol=43))
rm(df)
f<-function(data)
{
df=data.frame(matrix(nrow=length(levels(data$Species)),ncol=43))
}
f(data)
In my Rstudio no data.frame is generated when calling the function f!
> sessionInfo()
R version 2.14.1 (2011-12-22)
Platform: x86_64-pc-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_Australia.1252
LC_CTYPE=English_Australia.1252
LC_MONETARY=English_Australia.1252
[4] LC_NUMERIC=C LC_TIME=English_Australia.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] plyr_1.7.1 maptools_0.8-14 lattice_0.20-0 foreign_0.8-48
geosphere_1.2-26
[6] rgdal_0.7-8 outliers_0.14 XML_3.9-4.1 PBSmapping_2.62.34
dismo_0.7-14
[11] raster_1.9-58 sp_0.9-93
loaded via a namespace (and not attached):
[1] grid_2.14.1 tools_2.14.1
This should not be surprising. You haven't specified anywhere in the function what the function should return. This can be done the same way you display an object that you have created at the command prompt in R: type the name of the object. Alternatively, you can use return().
In other words, modify your function as follows (I've changed "df" to "mydf" and "data" to "mydata" to avoid any potential conflicts with base R functions):
f <- function(mydata)
{
mydf = data.frame(matrix(nrow=length(levels(data$Species)), ncol=43))
mydf
## Or, more explicitly
## return(mydf)
}
You can now run it using f(data). However, note that this will just display the output, not assign it to an object. If you wanted it assigned to an object ("mydf", for example) you need to use mydf <- f(data).
There is another option, use <<- in your function.
f <- function(mydata)
{
mydf <<- data.frame(matrix(nrow=length(levels(data$Species)), ncol=43))
## uncomment the next line if you want to *display* the output too
## mydf
}
> rm(mydf)
> ls(pattern = "mydf")
character(0)
> f(data) ## No ouput is displayed when you run the function
> ls(pattern = "mydf")
[1] "mydf"

Find whether a particular date is an Option Expiration Friday - problem with timeDate package

I am trying to write a simple function that (should) return true if the parameter date(s) is an Op-Ex Friday.
require(timeDate)
require(quantmod)
getSymbols("^GSPC", adjust=TRUE, from="1960-01-01")
assign("SPX", GSPC, envir=.GlobalEnv)
names(SPX) <- c("SPX.Open", "SPX.High", "SPX.Low", "SPX.Close",
"SPX.Volume", "SPX.Adjusted")
dates <- last(index(SPX), n=10)
from <- as.numeric(format(as.Date(min(dates)), "%Y"))
to <- as.numeric(format(as.Date(max(dates)), "%Y"))
isOpExFriday <- ifelse(
isBizday(
timeDate(as.Date(dates)),
holidayNYSE(from:to)) & (as.Date(dates) == as.Date(
format(timeNthNdayInMonth(timeFirstDayInMonth(dates), nday=5, nth=3)))
), TRUE, FALSE)
Now, the result should be [1] "2011-09-16". But instead I get [1] "2011-09-15":
dates[isOpExFriday]
[1] "2011-09-15"
Am I doing something wrong, expecting something that timeDate package is not doing by design or is there a bug in timeDate?
I am guessing it's a timezone problem. What happens if you use this:
format(dates[isOpExFriday], tz="UTC")
On second look, you probably need to put the 'tz=' argument inside the format call inside the as.Date(format(...)) call. The format function "freezes" that dates value as text.
EDIT: On testing however I think you are right about it being a bug. (And I sent a bug report to the maintainer with this response.) Even after trying to insert various timezone specs and setting myFinCenter in RmetricsOptions, I still get the which stems from this error deep inside your choice of functions:
timeNthNdayInMonth(as.Date("2011-09-01"), nday=5, nth=3)
America/New_York
[1] [2011-09-15]
I suspect it is because of this code since as I understand it Julian dates are not adjusted for timezones or daylight savings times:
ct = 24 * 3600 * (as.integer(julian.POSIXt(lt)) +
(nth - 1) * 7 + (nday - lt1$wday)%%7)
class(ct) = "POSIXct"
The ct value in seconds is then coverted to POSIXct from second since "origin" simply by coercion of class. If I change the code to:
ct=as.POSIXct(ct, origin="1970-01-01") # correct results come back
My quantmod and timeDate versions are both current per CRAN. Running Mac with R 2.13.1 in 64 bit mode with a US locale. I have not yet tried to reproduce with a minimal session so there could still be some collision or hijacking with other packages:
> sessionInfo()
R version 2.13.1 RC (2011-07-03 r56263)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] grid splines stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] quantmod_0.3-17 TTR_0.20-3 xts_0.8-2
[4] Defaults_1.1-1 timeDate_2130.93 zoo_1.7-4
[7] gplots_2.10.1 KernSmooth_2.23-6 caTools_1.12
[10] bitops_1.0-4.1 gdata_2.8.1 gtools_2.6.2
[13] wordnet_0.1-8 ggplot2_0.8.9 proto_0.3-9.2
[16] reshape_0.8.4 plyr_1.6 rattle_2.6.10
[19] RGtk2_2.20.17 rms_3.3-1 Hmisc_3.8-3
[22] survival_2.36-9 sos_1.3-0 brew_1.0-6
[25] lattice_0.19-30

Resources