\xe8 matching \xf1 in str_detect() and str_replace_all() - r

I want to process text files including some characters shown in hexadecimals on R. When I tried to convert those back into more readable characters, I encountered some unexpected (to me) behaviours of stringr functions. Specifically, \xe8 apparently matches \xf1:
> library("tidyverse")
> str <- "ni\xf1a"
> str_detect(str, "\xe8")
[1] TRUE
This is inconvenient when I want to convert \xe8 into è and \xf1 into ñ in the same files:
> str %>%
+ str_replace_all("\xe8", "è") %>%
+ str_replace_all("\xf1", "ñ")
[1] "nièa" # I expect niña
Interestingly, gsub() works as I expect:
> str %>%
+ gsub("\xe8", "è", .) %>%
+ gsub("\xf1", "ñ", .)
[1] "niña"
Why does \xe8 match \xf1 in str_detect() and str_replace_all()? Is there a way to avoid it?
Why is the behaviour different between stringr functions and gsub()?
Update
Here is part of the output of devtools::session_info():
> devtools::session_info()
─ Session info ──────────────────────────────────────────────────────────────────
setting value
version R version 4.0.2 (2020-06-22)
os macOS Catalina 10.15.7
system x86_64, darwin17.0
ui RStudio
language (EN)
collate en_GB.UTF-8
ctype en_GB.UTF-8
tz Europe/London
date 2020-09-30
─ Packages ──────────────────────────────────────────────────────────────────────
package * version date lib source
...
stringr * 1.4.0 2019-02-10 [1] CRAN (R 4.0.2)
...

Related

Polynomial Function Expansion in R

I am currently reviewing this question on SO and see that the OP stated that by adding more for loops you can expand the polynomials. How exactly would you do so? I am trying to expand to polyorder 5.
Polynomial feature expansion in R
Here is the code below:
polyexp = function(df){
df.polyexp = df
colnames = colnames(df)
for (i in 1:ncol(df)){
for (j in i:ncol(df)){
colnames = c(colnames, paste0(names(df)[i],'.',names(df)[j]))
df.polyexp = cbind(df.polyexp, df[,i]*df[,j])
}
}
names(df.polyexp) = colnames
return(df.polyexp)
}
Ultimately, I'd like to order the matrix so that it expands in order of degree. I tried using the poly function but I'm not sure if you can order the result so that it returns a matrix that starts with degree 1 then moves to degree 2, then 3, 4, and 5.
To "sort by degree" is a little ambiguous. x^2 and x*y both have degree 2. I'll assume you want to sort by total degree, and then within each of those, by degree of the 1st column; within that, by degree of the second column, etc. (I believe the default is to ignore total degree and sort by degree of the last column, within that the second last, and so on, but this is not documented so I won't count on it.)
Here's how to use polym to do this. The columns are named things like "2.0" or "1.1". You could sort these alphabetically and it would be fine up to degree 9, but if you convert those names using as.numeric_version, there's no limit. So convert the column names to version names, get the sort order, and use that plus degree to re-order the columns of the result. For example,
df <- data.frame(x = 1:6, y = 0:5, z = -(1:6))
expanded <- polym(as.matrix(df), degree = 5)
o <- order(attr(expanded, "degree"),
as.numeric_version(colnames(expanded)))
sorted <- expanded[,o]
# That lost the attributes, so put them back
attr(sorted, "degree") <- attr(expanded, "degree")[o]
attr(sorted, "coefs") <- attr(expanded, "coefs")
class(sorted) <- class(expanded)
# If you call predict(), it comes out in the default order,
# so will need sorting too:
predict(sorted, newdata = as.matrix(df[1,]))[, o]
#> 0.0.1 0.1.0 1.0.0 0.0.2 0.1.1 0.2.0
#> 0.59761430 -0.59761430 -0.59761430 0.54554473 -0.35714286 0.54554473
#> 1.0.1 1.1.0 2.0.0 0.0.3 0.1.2 0.2.1
#> -0.35714286 0.35714286 0.54554473 0.37267800 -0.32602533 0.32602533
#> 0.3.0 1.0.2 1.1.1 1.2.0 2.0.1 2.1.0
#> -0.37267800 -0.32602533 0.21343368 -0.32602533 0.32602533 -0.32602533
#> 3.0.0 0.0.4 0.1.3 0.2.2 0.3.1 0.4.0
#> -0.37267800 0.18898224 -0.22271770 0.29761905 -0.22271770 0.18898224
#> 1.0.3 1.1.2 1.2.1 1.3.0 2.0.2 2.1.1
#> -0.22271770 0.19483740 -0.19483740 0.22271770 0.29761905 -0.19483740
#> 2.2.0 3.0.1 3.1.0 4.0.0 0.0.5 0.1.4
#> 0.29761905 -0.22271770 0.22271770 0.18898224 0.06299408 -0.11293849
#> 0.2.3 0.3.2 0.4.1 0.5.0 1.0.4 1.1.3
#> 0.20331252 -0.20331252 0.11293849 -0.06299408 -0.11293849 0.13309928
#> 1.2.2 1.3.1 1.4.0 2.0.3 2.1.2 2.2.1
#> -0.17786140 0.13309928 -0.11293849 0.20331252 -0.17786140 0.17786140
#> 2.3.0 3.0.2 3.1.1 3.2.0 4.0.1 4.1.0
#> -0.20331252 -0.20331252 0.13309928 -0.20331252 0.11293849 -0.11293849
#> 5.0.0
#> -0.06299408
Created on 2020-03-21 by the reprex package (v0.3.0)

Crayons concat gives NULL

I am using
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 18.3
and tidyverse_1.2.1. Using the %+% operator provided by the crayons package (loaded by tdiyverse) gives NULL. Why? Is this a bug?
E.g. reproducing the example from the manual gives:
> "foo" %+% "bar" %>% print
NULL
instead of "foobar".
ggplot2 has its own version of %+%, which can mask the one from crayon. If I make sure that I load ggplot2/tidyverse first, before loading crayon, I get the expected results:
> library(tidyverse)
-- Attaching packages ---------------------- tidyverse 1.2.1 --
v ggplot2 3.1.0 v purrr 0.2.5
v tibble 1.4.2 v dplyr 0.7.8
v tidyr 0.8.2 v stringr 1.3.1
v readr 1.2.1 v forcats 0.3.0
-- Conflicts ------------------------- tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
> library(crayon)
Attaching package: ‘crayon’
The following object is masked from ‘package:ggplot2’:
%+%
> "foo" %+% "bar" %>% print
[1] "foobar"
This is indeed just because both ggplot2 and crayon define a %+% function! Then which function is used will depend on the order of the packages loaded, making your code fragile.
To make sure to avoid any conflict, you can give an alias to these operators, such as (see stack post):
library(tidyverse)
`%+c%` <- crayon::`%+%`
"foo" %+% "bar" %>% print
#> NULL
"foo" %+c% "bar" %>% print
#> [1] "foobar"
Created on 2021-08-13 by the reprex package (v0.3.0)

square bracket "[" operator extracting inaccurate subset

In the following code block I would expect the 5578 to be 650. I am unclear why it is not.
tmp <- tempfile(fileext = ".dat")
download.file("https://github.com/vz-risk/VCDB/raw/master/data/verisr/vcdb.dat", tmp, quiet=TRUE)
load(tmp, verbose=TRUE)
> dim(vcdb[vcdb$plus.dbir_year == 2018, ])
[1] 5578 2393
> vcdb %>% dplyr::filter(plus.dbir_year ==2018) %>% dim()
[1] 650 2393
> table(vcdb$plus.dbir_year == 2018)
FALSE TRUE
2211 650
This was tried across environment resets, two different users' environments, and tested nrow() vs dim(). 'df' is a data.frame. This was not tested with other dataframes or columns.
Version information:
version
_
platform x86_64-apple-darwin15.6.0
arch x86_64
os darwin15.6.0
system x86_64, darwin15.6.0
status
major 3
minor 4.0
year 2017
month 04
day 21
svn rev 72570
language R
version.string R version 3.4.0 (2017-04-21)
nickname You Stupid Darkness
edit1: Updated with environment version information. Code updated to be reproducible.

How should I use the uq() function inside a package?

I'm puzzled by the behaviour of the uq() function. The behavior is not the same when I use uq() or lazyeval::uq().
Here is my reproducible example :
First, I generate a fake dataset
library(tibble)
library(lazyeval)
fruits <- c("apple", "banana", "peanut")
price <- c(5,6,4)
table_fruits <- tibble(fruits, price)
Then I write a toy function, toy_function_v1, using only uq() :
toy_function_v1 <- function(data, var) {
lazyeval::f_eval(f = ~ uq(var), data = data)
}
and a second function using lazyeval::uq() :
toy_function_v2 <- function(data, var) {
lazyeval::f_eval(f = ~ lazyeval::uq(var), data = data)
}
Surprisingly, the output of v1 and v2 is not the same :
> toy_function_v1(data = table_fruits, var = ~ price)
[1] 5 6 4
> toy_function_v2(data = table_fruits, var = ~ price)
price
Is there any explanation ?
I know it's a good practice to use the syntaxe package::function() to use the function inside a new package. So what's the best solution in that case ?
Here is my session_info :
> devtools::session_info()
Session info ----------------------------------------------------------------------------------------------------------------------------------------------------
setting value
version R version 3.3.1 (2016-06-21)
system x86_64, linux-gnu
ui RStudio (1.0.35)
language (EN)
collate C
tz <NA>
date 2016-11-07
Packages --------------------------------------------------------------------------------------------------------------------------------------------------------
package * version date source
Rcpp 0.12.7 2016-09-05 CRAN (R 3.2.3)
assertthat 0.1 2013-12-06 CRAN (R 3.2.2)
devtools 1.12.0 2016-06-24 CRAN (R 3.2.3)
digest 0.6.10 2016-08-02 CRAN (R 3.2.3)
lazyeval * 0.2.0.9000 2016-10-14 Github (hadley/lazyeval#c155c3d)
memoise 1.0.0 2016-01-29 CRAN (R 3.2.3)
tibble * 1.2 2016-08-26 CRAN (R 3.2.3)
withr 1.0.2 2016-06-20 CRAN (R 3.2.3)
It's just a bug in the uq() function. The issue is open on Github : https://github.com/hadley/lazyeval/issues/78.

Colophon for an R book

At the end of an R book, I'd like to show the versions of main R packages used to compile the book. I'm wondering if there is anything I could do better than just use sessionInfo() in a chunk, e.g.,
\section*{Colophon}
This book was produced using \Sexpr{R.version.string},
\pkg{knitr} (\Sexpr{packageDescription("knitr")[["Version"]]})
and other package versions listed below.
<<session-info, size='footnotesize',R.options=list(width=90)>>=
print(sessionInfo(), locale = FALSE)
#
In particular, sessionInfo() lists all packages loaded indirectly as well as those loaded directly.
```{r}
library(knitr)
p = devtools::loaded_packages()
p$version = unlist(lapply(p$package, function(x) as.character(packageVersion(x))))
kable(p[order(p$package),], row.names=FALSE)
```
If you do not have devtools installed, steal the code from loaded_packages.
This will give a comma separated list of the packages loaded into the current session:
pkgs <- sort(sub("package:", "", grep("package:", search(), value = TRUE)));
toString(Map(function(p) sprintf("%s (%s)", p, packageVersion(p)), pkgs))
giving this string which you can insert by placing the code above in a \Sexpr:
[1] "base (3.2.0), datasets (3.2.0), graphics (3.2.0), grDevices (3.2.0), methods (3.2.0), stats (3.2.0), utils (3.2.0)"
Only core R functions are used in this code.
I don't want to list all packages loaded (base packages, dependencies) in the current session, so I came up with a better solution for my needs. Maybe this will be useful to someone else.
Find all packages explicitly loaded via library() in the .Rnw files for the book.
Use devtools:::package_info() for formatting
For (1.), I used the following pipe of shell commands, all standard except for my trusty tcgrep perl script that find strings in files recursively
tcgrep -E Rnw '^library(.*)' . \
| grep '/ch' \
| perl -p -e 's/^.*://; s/\s*#.*//' \
| perl -p -e 's/library\(([\w\d]+)\)/"$1"/g; s/;/, /' \
| sort -u | perl -p -e 's/\n/, /' > packages-used.R
This gave me
packages <- c(
"AER", "ca", "car", "colorspace", "corrplot", "countreg", "directlabels", "effects", "ggparallel", "ggplot2", "ggtern", "gmodels", "gnm", "gpairs", "heplots", "Lahman", "lattice", "lmtest", "logmult", "MASS", "MASS", "countreg", "mgcv", "nnet", "plyr", "pscl", "RColorBrewer", "reshape2", "rms", "rsm", "sandwich", "splines", "UBbipl", "vcd", "vcdExtra", "VGAM", "xtable")
Then for (2.),
library(devtools)
pkg_info <- devtools:::package_info(packages)
# clean up unwanted
pkg_info$source <- sub(" \\(R.*\\)", "", pkg_info$source)
pkg_info <- pkg_info[,-2]
pkg_info
I like the result because it also identifies non-CRAN (development version) packages. I could also format this with kable:
> pkg_info
package version date source
AER 1.2-3 2015-02-24 CRAN
ca 0.60 2015-03-01 R-Forge
car 2.0-25 2015-03-03 R-Forge
colorspace 1.2-6 2015-03-11 CRAN
corrplot 0.73 2013-10-15 CRAN
countreg 0.1-2 2014-10-17 R-Forge
directlabels 2013.6.15 2013-07-23 CRAN
effects 3.0-4 2015-03-22 R-Forge
ggparallel 0.1.1 2012-09-09 CRAN
ggplot2 1.0.1 2015-03-17 CRAN
ggtern 1.0.5.0 2015-04-15 CRAN
gmodels 2.15.4.1 2013-09-21 CRAN
gnm 1.0-8 2015-04-22 CRAN
gpairs 1.2 2014-03-09 CRAN
heplots 1.0-15 2015-04-18 CRAN
Lahman 3.0-1 2014-09-13 CRAN
lattice 0.20-31 2015-03-30 CRAN
lmtest 0.9-33 2014-01-23 CRAN
logmult 0.6.2 2015-04-22 CRAN
MASS 7.3-40 2015-03-21 CRAN
mgcv 1.8-6 2015-03-31 CRAN
nnet 7.3-9 2015-02-11 CRAN
plyr 1.8.2 2015-04-21 CRAN
pscl 1.4.9 2015-03-29 CRAN
RColorBrewer 1.1-2 2014-12-07 CRAN
reshape2 1.4.1 2014-12-06 CRAN
rms 4.3-1 2015-05-01 CRAN
rsm 2.7-2 2015-05-13 CRAN
sandwich 2.3-3 2015-03-26 CRAN
UBbipl 3.0.4 2013-10-13 local
vcd 1.4-0 2015-04-20 local
vcdExtra 0.6-8 2015-04-16 CRAN
VGAM 0.9-8 2015-05-11 CRAN
xtable 1.7-4 2014-09-12 CRAN
If you're using LaTeX you could simply generate bibliography for all the packages using:
%% begin.rcode rubber, results = 'asis', cache = FALSE
% write_bib(file = "generated.bib")
%% end.rcode
You can put this after your \end{document} and add corresponding \bibliography{mybib,generated} entry. This way you could also reference them with usual \cite{}

Resources