*_join with empty suffix

*_join with empty suffix - r

Fair warning: this can hang your operating system.
*_join() from dplyr fails when either of the left or right suffixes are specified as empty (''), e.g.
inner_join(data.frame(x=1, y=2),
data.frame(x=1, y=3),
by='x',
suffix=c('', '.b'))
Whereas the following works fine:
inner_join(data.frame(x=1, y=2),
data.frame(x=1, y=3),
by='x',
suffix=c('.a', '.b'))
Meanwhile, the S3 generic merge() (base) has no problem with empty suffixes:
merge(data.frame(x=1, y=2),
data.frame(x=1, y=3),
by='x',
suffixes=c('', '.b'))
dplyr package info:
> packageVersion('dplyr')
[1] ‘0.5.0’
R version info:
> version
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 3.0
year 2016
month 05
day 03
svn rev 70573
language R
version.string R version 3.3.0 (2016-05-03)
nickname Supposedly Educational

This was fun when I stumbled across this bug. The following will accomplish the desired effect using dplyr of using suffixes '' and .b
library(dplyr)
inner_join(data.frame(x=1, y=2),
data.frame(x=1, y=3),
by='x',
suffix=c('.a', '.b')) %>%
setNames(gsub('\\.a$', '', names(.)))

Related

\xe8 matching \xf1 in str_detect() and str_replace_all()

I want to process text files including some characters shown in hexadecimals on R. When I tried to convert those back into more readable characters, I encountered some unexpected (to me) behaviours of stringr functions. Specifically, \xe8 apparently matches \xf1:
> library("tidyverse")
> str <- "ni\xf1a"
> str_detect(str, "\xe8")
[1] TRUE
This is inconvenient when I want to convert \xe8 into è and \xf1 into ñ in the same files:
> str %>%
+ str_replace_all("\xe8", "è") %>%
+ str_replace_all("\xf1", "ñ")
[1] "nièa" # I expect niña
Interestingly, gsub() works as I expect:
> str %>%
+ gsub("\xe8", "è", .) %>%
+ gsub("\xf1", "ñ", .)
[1] "niña"
Why does \xe8 match \xf1 in str_detect() and str_replace_all()? Is there a way to avoid it?
Why is the behaviour different between stringr functions and gsub()?
Update
Here is part of the output of devtools::session_info():
> devtools::session_info()
─ Session info ──────────────────────────────────────────────────────────────────
setting value
version R version 4.0.2 (2020-06-22)
os macOS Catalina 10.15.7
system x86_64, darwin17.0
ui RStudio
language (EN)
collate en_GB.UTF-8
ctype en_GB.UTF-8
tz Europe/London
date 2020-09-30
─ Packages ──────────────────────────────────────────────────────────────────────
package * version date lib source
...
stringr * 1.4.0 2019-02-10 [1] CRAN (R 4.0.2)
...

Why did R's sorting change data imported with load() after an upgrade from 3.5.2 to 4.0.0?

Short version. I load() data in a package. Previously, a test in a package passed, now it fails because the output of sort changed.
Here is a minimal reproducible example - for details see below:
y <- c("Schaffhausen", "Schwyz", "Seespital", "SRZ")
sort(y)
# OLD 3.5.2 [1] "Schaffhausen" "Schwyz" "Seespital" "SRZ"
# NEW 4.0.0 [1] "SRZ" "Schaffhausen" "Schwyz" "Seespital"
# Update 4.0.2 see comment:
# [1] "Schaffhausen" "Schwyz" "Seespital" "SRZ"
# From jay.sf's comment
sort.int(y, method="radix")
# [1] "SRZ" "Schaffhausen" "Schwyz" "Seespital"
sort.int(y, method="shell")
# [1] "Schaffhausen" "Schwyz" "Seespital" "SRZ"
# From Henrik's comment:
data.table::fsort(y)
# [1] "SRZ" "Schaffhausen" "Schwyz" "Seespital"
The only related reported change I found is
CHANGES IN R 4.0.0
NEW FEATURES
...
When loading data sets via read.table(), data() now uses LC_COLLATE=C to ensure locale-independent results for possible string-to-factor conversions.
But I am even not sure, if this could explain what I see.
As I want to minimize the number of imported packages and I would like to understand what's going on, I am not sure how to proceed. Do I miss something?
(A change to a sort.int with method radix would do the job, but still: Why did it change? Is that really better?
I just realized, that (thanks to Roland) sort calls in my case sort.int:
function (x, decreasing = FALSE, na.last = NA, ...)
{
if (is.object(x))
x[order(x, na.last = na.last, decreasing = decreasing)]
else sort.int(x, na.last = na.last, decreasing = decreasing,
...)
}
From ?sort.int:
The "auto" method selects "radix" for short (less than 2^31 elements) numeric vectors, integer vectors, logical vectors and factors; otherwise, "shell".)
And according to the docs, sort.int did not change from 4.0.0 to 4.0.2.
From ?data.table::setorder
data.table always reorders in "C-locale". As a consequence, the
ordering may be different to that obtained by base::order. In English
locales, for example, sorting is case-sensitive in C-locale. Thus,
sorting c("c", "a", "B") returns c("B", "a", "c") in data.table but
c("a", "B", "c") in base::order. Note this makes no difference in most
cases of data; both return identical results on ids where only
upper-case or lower-case letters are present ("AB123" < "AC234" is
true in both), or on country names and other proper nouns which are
consistently capitalized. For example, neither "America" < "Brazil"
nor "america" < "brazil" are affected since the first letter is
consistently capitalized.
Using C-locale makes the behaviour of sorting in data.table more
consistent across sessions and locales. The behaviour of base::order
depends on assumptions about the locale of the R session. In English
locales, "america" < "BRAZIL" is true by default but false if you
either type Sys.setlocale(locale="C") or the R session has been
started in a C locale for you – which can happen on servers/services
since the locale comes from the environment the R session was started
in. By contrast, "america" < "BRAZIL" is always FALSE in data.table
regardless of the way your R session was started.
(Related questions Language dependent sorting with R and Best practice: Should I try to change to UTF-8 as locale or is it safe to leave it as is?)
Details
R.version # old _
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 5.2
year 2018
month 12
day 20
svn rev 75870
language R
version.string R version 3.5.2 (2018-12-20)
nickname Eggshell Igloo
y <- c("Schaffhausen", "Schwyz", "Seespital", "SRZ")
sort(y)
# [1] "Schaffhausen" "Schwyz" "Seespital" "SRZ"
stringr::str_sort(y)
# [1] "Schaffhausen" "Schwyz" "Seespital" "SRZ"
stringr::str_sort(y, locale = "C")
# [1] "SRZ" "Schaffhausen" "Schwyz" "Seespital"
# =======
R.version # new after upgrade
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 4
minor 0.0
year 2020
month 04
day 24
svn rev 78286
language R
version.string R version 4.0.0 (2020-04-24)
nickname Arbor Day
y <- c("Schaffhausen", "Schwyz", "Seespital", "SRZ")
sort(y)
# [1] "SRZ" "Schaffhausen" "Schwyz" "Seespital"
stringr::str_sort(y)
# [1] "Schaffhausen" "Schwyz" "Seespital" "SRZ"
stringr::str_sort(y, locale = "C")
#[1] "SRZ" "Schaffhausen" "Schwyz" "Seespital"
# ==== Test with new 4.0.2
R.version
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 4
minor 0.2
year 2020
month 06
day 22
svn rev 78730
language R
version.string R version 4.0.2 (2020-06-22)
nickname Taking Off Again
y <- c("Schaffhausen", "Schwyz", "Seespital", "SRZ")
sort(y)
# [1] "Schaffhausen" "Schwyz" "Seespital" "SRZ"
stringr::str_sort(y)
# [1] "Schaffhausen" "Schwyz" "Seespital" "SRZ"
stringr::str_sort(y, locale = "C")
# [1] "SRZ" "Schaffhausen" "Schwyz" "Seespital"

In summary, it was a bug which has been removed in R version 4.0.1. As #Roland figured out.
From CRAN:
In R 4.0.0, sort.list(x) when is.object(x) was true, e.g., for x <-I(letters), was accidentally usingmethod = "radix". Consequently,
e.g., merge(<data.frame>) was much slower than previously; reported in
PR#17794.

Why does rbind with data.table having more than 254 columns reorders column names

I am not sure of the extent of this side effect. Why is this happening ? What caution does one need to take.
dt <- data.table(
sample = 1
)
i = 1
while(i <= 254) {
col <- paste("x", i, sep = "_")
dt[[col]] = i
i = (i + 1)
}
> combined_dt <- rbind(dt, dt)
> print(head(names(combined_dt))) # Columns get reordered
[1] "sample" "x_5" "x_6" "x_1" "x_2" "x_3"
>
> combined_dt <- rbindlist(list(dt, dt))
> print(head(names(combined_dt))) # Columns do not get reordered
[1] "sample" "x_1" "x_2" "x_3" "x_4" "x_5"
R details
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 4.4
year 2018
month 03
day 15
svn rev 74408
language R
version.string R version 3.4.4 (2018-03-15)
nickname Someone to Lean On

Same seed, different OS, different random numbers in R

I was experiencing inconsistent results between two machines and a linux server, until I realized that fixing the seed was having different effects. I am running different R versions in all of them, all above 3.3.0. Here are the examples:
Linux 1
> set.seed(10); rnorm(1)
[1] -0.4463588
> version
_
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 3.0
year 2016
month 05
day 03
svn rev 70573
language R
version.string R version 3.3.0 (2016-05-03)
nickname Supposedly Educational
Linux 2
> set.seed(10); rnorm(1)
[1] 0.01874617
> version
_
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 4.2
year 2017
month 09
day 28
svn rev 73368
language R
version.string R version 3.4.2 (2017-09-28)
nickname Short Summer
Mac OS
> set.seed(10); rnorm(1)
[1] 0.01874617
> version
_
platform x86_64-apple-darwin15.6.0
arch x86_64
os darwin15.6.0
system x86_64, darwin15.6.0
status
major 3
minor 4.3
year 2017
month 11
day 30
svn rev 73796
language R
version.string R version 3.4.3 (2017-11-30)
nickname Kite-Eating Tree
Windows
> set.seed(10); rnorm(1)
[1] 0.01874617
> version
_
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 4.1
year 2017
month 06
day 30
svn rev 72865
language R
version.string R version 3.4.1 (2017-06-30)
nickname Single Candle
Linux gives a different random number generation from the same seed, thus making the result of a script run on it not fully reproducible (depending on the OS in which they are re-run, the results will agree or not). This is annoying.
I do not know what is happening here. Particularly:
(1) Is it an issue with R's versions or something more involved?
(2) How can this inconsistent behaviour be avoided? Any help is appreciated.
EDIT originated from #Jesse Tweedle answer (output in Linux 1 in a new session):
> set.seed(10); rnorm(1)
[1] -0.4463588
> set.seed(10); rnorm(1)
[1] -0.4463588
> set.seed(102); rnorm(1)
[1] 0.05752965
> set.seed(10, kind = "Mersenne-Twister"); rnorm(1)
[1] 0.01874617
> set.seed(10); rnorm(1)
[1] 0.01874617
> set.seed(102); rnorm(1)
[1] 0.1805229

From docs:
Random docs:
RNGversion can be used to set the random generators as they were in an earlier R version (for reproducibility).
So try this on all systems:
set.seed(10, kind = "Mersenne-Twister", normal.kind = "Inversion"); rnorm(1)
[1] 0.01874617

square bracket "[" operator extracting inaccurate subset

In the following code block I would expect the 5578 to be 650. I am unclear why it is not.
tmp <- tempfile(fileext = ".dat")
download.file("https://github.com/vz-risk/VCDB/raw/master/data/verisr/vcdb.dat", tmp, quiet=TRUE)
load(tmp, verbose=TRUE)
> dim(vcdb[vcdb$plus.dbir_year == 2018, ])
[1] 5578 2393
> vcdb %>% dplyr::filter(plus.dbir_year ==2018) %>% dim()
[1] 650 2393
> table(vcdb$plus.dbir_year == 2018)
FALSE TRUE
2211 650
This was tried across environment resets, two different users' environments, and tested nrow() vs dim(). 'df' is a data.frame. This was not tested with other dataframes or columns.
Version information:
version
_
platform x86_64-apple-darwin15.6.0
arch x86_64
os darwin15.6.0
system x86_64, darwin15.6.0
status
major 3
minor 4.0
year 2017
month 04
day 21
svn rev 72570
language R
version.string R version 3.4.0 (2017-04-21)
nickname You Stupid Darkness
edit1: Updated with environment version information. Code updated to be reproducible.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

*_join with empty suffix - r

This was fun when I stumbled across this bug. The following will accomplish the desired effect using dplyr of using suffixes '' and .b library(dplyr) inner_join(data.frame(x=1, y=2), data.frame(x=1, y=3), by='x', suffix=c('.a', '.b')) %>% setNames(gsub('\\.a$', '', names(.)))

Related

\xe8 matching \xf1 in str_detect() and str_replace_all()

Why did R's sorting change data imported with load() after an upgrade from 3.5.2 to 4.0.0?

Why does rbind with data.table having more than 254 columns reorders column names

Same seed, different OS, different random numbers in R

square bracket "[" operator extracting inaccurate subset

Categories

Resources