*_join with empty suffix - r

Fair warning: this can hang your operating system.
*_join() from dplyr fails when either of the left or right suffixes are specified as empty (''), e.g.
inner_join(data.frame(x=1, y=2),
data.frame(x=1, y=3),
by='x',
suffix=c('', '.b'))
Whereas the following works fine:
inner_join(data.frame(x=1, y=2),
data.frame(x=1, y=3),
by='x',
suffix=c('.a', '.b'))
Meanwhile, the S3 generic merge() (base) has no problem with empty suffixes:
merge(data.frame(x=1, y=2),
data.frame(x=1, y=3),
by='x',
suffixes=c('', '.b'))
dplyr package info:
> packageVersion('dplyr')
[1] ‘0.5.0’
R version info:
> version
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 3.0
year 2016
month 05
day 03
svn rev 70573
language R
version.string R version 3.3.0 (2016-05-03)
nickname Supposedly Educational

This was fun when I stumbled across this bug. The following will accomplish the desired effect using dplyr of using suffixes '' and .b
library(dplyr)
inner_join(data.frame(x=1, y=2),
data.frame(x=1, y=3),
by='x',
suffix=c('.a', '.b')) %>%
setNames(gsub('\\.a$', '', names(.)))

Related

\xe8 matching \xf1 in str_detect() and str_replace_all()

I want to process text files including some characters shown in hexadecimals on R. When I tried to convert those back into more readable characters, I encountered some unexpected (to me) behaviours of stringr functions. Specifically, \xe8 apparently matches \xf1:
> library("tidyverse")
> str <- "ni\xf1a"
> str_detect(str, "\xe8")
[1] TRUE
This is inconvenient when I want to convert \xe8 into è and \xf1 into ñ in the same files:
> str %>%
+ str_replace_all("\xe8", "è") %>%
+ str_replace_all("\xf1", "ñ")
[1] "nièa" # I expect niña
Interestingly, gsub() works as I expect:
> str %>%
+ gsub("\xe8", "è", .) %>%
+ gsub("\xf1", "ñ", .)
[1] "niña"
Why does \xe8 match \xf1 in str_detect() and str_replace_all()? Is there a way to avoid it?
Why is the behaviour different between stringr functions and gsub()?
Update
Here is part of the output of devtools::session_info():
> devtools::session_info()
─ Session info ──────────────────────────────────────────────────────────────────
setting value
version R version 4.0.2 (2020-06-22)
os macOS Catalina 10.15.7
system x86_64, darwin17.0
ui RStudio
language (EN)
collate en_GB.UTF-8
ctype en_GB.UTF-8
tz Europe/London
date 2020-09-30
─ Packages ──────────────────────────────────────────────────────────────────────
package * version date lib source
...
stringr * 1.4.0 2019-02-10 [1] CRAN (R 4.0.2)
...

Why did R's sorting change data imported with load() after an upgrade from 3.5.2 to 4.0.0?

Short version. I load() data in a package. Previously, a test in a package passed, now it fails because the output of sort changed.
Here is a minimal reproducible example - for details see below:
y <- c("Schaffhausen", "Schwyz", "Seespital", "SRZ")
sort(y)
# OLD 3.5.2 [1] "Schaffhausen" "Schwyz" "Seespital" "SRZ"
# NEW 4.0.0 [1] "SRZ" "Schaffhausen" "Schwyz" "Seespital"
# Update 4.0.2 see comment:
# [1] "Schaffhausen" "Schwyz" "Seespital" "SRZ"
# From jay.sf's comment
sort.int(y, method="radix")
# [1] "SRZ" "Schaffhausen" "Schwyz" "Seespital"
sort.int(y, method="shell")
# [1] "Schaffhausen" "Schwyz" "Seespital" "SRZ"
# From Henrik's comment:
data.table::fsort(y)
# [1] "SRZ" "Schaffhausen" "Schwyz" "Seespital"
The only related reported change I found is
CHANGES IN R 4.0.0
NEW FEATURES
...
When loading data sets via read.table(), data() now uses LC_COLLATE=C to ensure locale-independent results for possible string-to-factor conversions.
But I am even not sure, if this could explain what I see.
As I want to minimize the number of imported packages and I would like to understand what's going on, I am not sure how to proceed. Do I miss something?
(A change to a sort.int with method radix would do the job, but still: Why did it change? Is that really better?
I just realized, that (thanks to Roland) sort calls in my case sort.int:
function (x, decreasing = FALSE, na.last = NA, ...)
{
if (is.object(x))
x[order(x, na.last = na.last, decreasing = decreasing)]
else sort.int(x, na.last = na.last, decreasing = decreasing,
...)
}
From ?sort.int:
The "auto" method selects "radix" for short (less than 2^31 elements) numeric vectors, integer vectors, logical vectors and factors; otherwise, "shell".)
And according to the docs, sort.int did not change from 4.0.0 to 4.0.2.
From ?data.table::setorder
data.table always reorders in "C-locale". As a consequence, the
ordering may be different to that obtained by base::order. In English
locales, for example, sorting is case-sensitive in C-locale. Thus,
sorting c("c", "a", "B") returns c("B", "a", "c") in data.table but
c("a", "B", "c") in base::order. Note this makes no difference in most
cases of data; both return identical results on ids where only
upper-case or lower-case letters are present ("AB123" < "AC234" is
true in both), or on country names and other proper nouns which are
consistently capitalized. For example, neither "America" < "Brazil"
nor "america" < "brazil" are affected since the first letter is
consistently capitalized.
Using C-locale makes the behaviour of sorting in data.table more
consistent across sessions and locales. The behaviour of base::order
depends on assumptions about the locale of the R session. In English
locales, "america" < "BRAZIL" is true by default but false if you
either type Sys.setlocale(locale="C") or the R session has been
started in a C locale for you – which can happen on servers/services
since the locale comes from the environment the R session was started
in. By contrast, "america" < "BRAZIL" is always FALSE in data.table
regardless of the way your R session was started.
(Related questions Language dependent sorting with R and Best practice: Should I try to change to UTF-8 as locale or is it safe to leave it as is?)
Details
R.version # old _
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 5.2
year 2018
month 12
day 20
svn rev 75870
language R
version.string R version 3.5.2 (2018-12-20)
nickname Eggshell Igloo
y <- c("Schaffhausen", "Schwyz", "Seespital", "SRZ")
sort(y)
# [1] "Schaffhausen" "Schwyz" "Seespital" "SRZ"
stringr::str_sort(y)
# [1] "Schaffhausen" "Schwyz" "Seespital" "SRZ"
stringr::str_sort(y, locale = "C")
# [1] "SRZ" "Schaffhausen" "Schwyz" "Seespital"
# =======
R.version # new after upgrade
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 4
minor 0.0
year 2020
month 04
day 24
svn rev 78286
language R
version.string R version 4.0.0 (2020-04-24)
nickname Arbor Day
y <- c("Schaffhausen", "Schwyz", "Seespital", "SRZ")
sort(y)
# [1] "SRZ" "Schaffhausen" "Schwyz" "Seespital"
stringr::str_sort(y)
# [1] "Schaffhausen" "Schwyz" "Seespital" "SRZ"
stringr::str_sort(y, locale = "C")
#[1] "SRZ" "Schaffhausen" "Schwyz" "Seespital"
# ==== Test with new 4.0.2
R.version
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 4
minor 0.2
year 2020
month 06
day 22
svn rev 78730
language R
version.string R version 4.0.2 (2020-06-22)
nickname Taking Off Again
y <- c("Schaffhausen", "Schwyz", "Seespital", "SRZ")
sort(y)
# [1] "Schaffhausen" "Schwyz" "Seespital" "SRZ"
stringr::str_sort(y)
# [1] "Schaffhausen" "Schwyz" "Seespital" "SRZ"
stringr::str_sort(y, locale = "C")
# [1] "SRZ" "Schaffhausen" "Schwyz" "Seespital"
In summary, it was a bug which has been removed in R version 4.0.1. As #Roland figured out.
From CRAN:
In R 4.0.0, sort.list(x) when is.object(x) was true, e.g., for x <-I(letters), was accidentally usingmethod = "radix". Consequently,
e.g., merge(<data.frame>) was much slower than previously; reported in
PR#17794.

Why does rbind with data.table having more than 254 columns reorders column names

I am not sure of the extent of this side effect. Why is this happening ? What caution does one need to take.
dt <- data.table(
sample = 1
)
i = 1
while(i <= 254) {
col <- paste("x", i, sep = "_")
dt[[col]] = i
i = (i + 1)
}
> combined_dt <- rbind(dt, dt)
> print(head(names(combined_dt))) # Columns get reordered
[1] "sample" "x_5" "x_6" "x_1" "x_2" "x_3"
>
> combined_dt <- rbindlist(list(dt, dt))
> print(head(names(combined_dt))) # Columns do not get reordered
[1] "sample" "x_1" "x_2" "x_3" "x_4" "x_5"
R details
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 4.4
year 2018
month 03
day 15
svn rev 74408
language R
version.string R version 3.4.4 (2018-03-15)
nickname Someone to Lean On

Same seed, different OS, different random numbers in R

I was experiencing inconsistent results between two machines and a linux server, until I realized that fixing the seed was having different effects. I am running different R versions in all of them, all above 3.3.0. Here are the examples:
Linux 1
> set.seed(10); rnorm(1)
[1] -0.4463588
> version
_
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 3.0
year 2016
month 05
day 03
svn rev 70573
language R
version.string R version 3.3.0 (2016-05-03)
nickname Supposedly Educational
Linux 2
> set.seed(10); rnorm(1)
[1] 0.01874617
> version
_
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 4.2
year 2017
month 09
day 28
svn rev 73368
language R
version.string R version 3.4.2 (2017-09-28)
nickname Short Summer
Mac OS
> set.seed(10); rnorm(1)
[1] 0.01874617
> version
_
platform x86_64-apple-darwin15.6.0
arch x86_64
os darwin15.6.0
system x86_64, darwin15.6.0
status
major 3
minor 4.3
year 2017
month 11
day 30
svn rev 73796
language R
version.string R version 3.4.3 (2017-11-30)
nickname Kite-Eating Tree
Windows
> set.seed(10); rnorm(1)
[1] 0.01874617
> version
_
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 4.1
year 2017
month 06
day 30
svn rev 72865
language R
version.string R version 3.4.1 (2017-06-30)
nickname Single Candle
Linux gives a different random number generation from the same seed, thus making the result of a script run on it not fully reproducible (depending on the OS in which they are re-run, the results will agree or not). This is annoying.
I do not know what is happening here. Particularly:
(1) Is it an issue with R's versions or something more involved?
(2) How can this inconsistent behaviour be avoided? Any help is appreciated.
EDIT originated from #Jesse Tweedle answer (output in Linux 1 in a new session):
> set.seed(10); rnorm(1)
[1] -0.4463588
> set.seed(10); rnorm(1)
[1] -0.4463588
> set.seed(102); rnorm(1)
[1] 0.05752965
> set.seed(10, kind = "Mersenne-Twister"); rnorm(1)
[1] 0.01874617
> set.seed(10); rnorm(1)
[1] 0.01874617
> set.seed(102); rnorm(1)
[1] 0.1805229
From docs:
Random docs:
RNGversion can be used to set the random generators as they were in an earlier R version (for reproducibility).
So try this on all systems:
set.seed(10, kind = "Mersenne-Twister", normal.kind = "Inversion"); rnorm(1)
[1] 0.01874617

square bracket "[" operator extracting inaccurate subset

In the following code block I would expect the 5578 to be 650. I am unclear why it is not.
tmp <- tempfile(fileext = ".dat")
download.file("https://github.com/vz-risk/VCDB/raw/master/data/verisr/vcdb.dat", tmp, quiet=TRUE)
load(tmp, verbose=TRUE)
> dim(vcdb[vcdb$plus.dbir_year == 2018, ])
[1] 5578 2393
> vcdb %>% dplyr::filter(plus.dbir_year ==2018) %>% dim()
[1] 650 2393
> table(vcdb$plus.dbir_year == 2018)
FALSE TRUE
2211 650
This was tried across environment resets, two different users' environments, and tested nrow() vs dim(). 'df' is a data.frame. This was not tested with other dataframes or columns.
Version information:
version
_
platform x86_64-apple-darwin15.6.0
arch x86_64
os darwin15.6.0
system x86_64, darwin15.6.0
status
major 3
minor 4.0
year 2017
month 04
day 21
svn rev 72570
language R
version.string R version 3.4.0 (2017-04-21)
nickname You Stupid Darkness
edit1: Updated with environment version information. Code updated to be reproducible.

Resources