square bracket "[" operator extracting inaccurate subset - r

In the following code block I would expect the 5578 to be 650. I am unclear why it is not.
tmp <- tempfile(fileext = ".dat")
download.file("https://github.com/vz-risk/VCDB/raw/master/data/verisr/vcdb.dat", tmp, quiet=TRUE)
load(tmp, verbose=TRUE)
> dim(vcdb[vcdb$plus.dbir_year == 2018, ])
[1] 5578 2393
> vcdb %>% dplyr::filter(plus.dbir_year ==2018) %>% dim()
[1] 650 2393
> table(vcdb$plus.dbir_year == 2018)
FALSE TRUE
2211 650
This was tried across environment resets, two different users' environments, and tested nrow() vs dim(). 'df' is a data.frame. This was not tested with other dataframes or columns.
Version information:
version
_
platform x86_64-apple-darwin15.6.0
arch x86_64
os darwin15.6.0
system x86_64, darwin15.6.0
status
major 3
minor 4.0
year 2017
month 04
day 21
svn rev 72570
language R
version.string R version 3.4.0 (2017-04-21)
nickname You Stupid Darkness
edit1: Updated with environment version information. Code updated to be reproducible.

Related

furrr with rTorch in multisession

I want to use rTorch in a furrr "loop". A minimal example seems to be:
library(rTorch)
torch_it <- function(i) {
#.libPaths("/User/homes/mreichstein/.R_libs_macadamia4.0/")
#require(rTorch)
cat("torch is: "); print(torch)
out <- torch$tensor(1)
out
}
library(furrr)
plan(multisession, workers=8) # this works with sequential or multicore
result <- future_map(1:5, torch_it)
I get as output:
torch is: <environment: 0x556b57f07990>
attr(,"class")
[1] "python.builtin.module" "python.builtin.object"
Error in torch$tensor(1) : attempt to apply non-function
When I use multicore or sequential I get the expected output:
torch is: Module(torch)
torch is: Module(torch)
torch is: Module(torch)
torch is: Module(torch)
torch is: Module(torch)
... and result is the expected list of tensors.
Uncommenting the first lines in the function, assuming the new sessions "need this" did not help.
Update: I added torch <- import("torch") in the torch_it function. Then it runs without error, but the result I get is a list of empty pointers (?):
> result
[[1]]
<pointer: 0x0>
[[2]]
<pointer: 0x0>
[[3]]
<pointer: 0x0>
[[4]]
<pointer: 0x0>
[[5]]
<pointer: 0x0>
So, how do I properly link rTorch to each of the multisessions?
Thanks in advance!
Info:
> R.version
_
platform x86_64-redhat-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 4
minor 0.3
year 2020
month 10
day 10
svn rev 79318
language R
version.string R version 4.0.3 (2020-10-10)
nickname Bunny-Wunnies Freak Out

Why did R's sorting change data imported with load() after an upgrade from 3.5.2 to 4.0.0?

Short version. I load() data in a package. Previously, a test in a package passed, now it fails because the output of sort changed.
Here is a minimal reproducible example - for details see below:
y <- c("Schaffhausen", "Schwyz", "Seespital", "SRZ")
sort(y)
# OLD 3.5.2 [1] "Schaffhausen" "Schwyz" "Seespital" "SRZ"
# NEW 4.0.0 [1] "SRZ" "Schaffhausen" "Schwyz" "Seespital"
# Update 4.0.2 see comment:
# [1] "Schaffhausen" "Schwyz" "Seespital" "SRZ"
# From jay.sf's comment
sort.int(y, method="radix")
# [1] "SRZ" "Schaffhausen" "Schwyz" "Seespital"
sort.int(y, method="shell")
# [1] "Schaffhausen" "Schwyz" "Seespital" "SRZ"
# From Henrik's comment:
data.table::fsort(y)
# [1] "SRZ" "Schaffhausen" "Schwyz" "Seespital"
The only related reported change I found is
CHANGES IN R 4.0.0
NEW FEATURES
...
When loading data sets via read.table(), data() now uses LC_COLLATE=C to ensure locale-independent results for possible string-to-factor conversions.
But I am even not sure, if this could explain what I see.
As I want to minimize the number of imported packages and I would like to understand what's going on, I am not sure how to proceed. Do I miss something?
(A change to a sort.int with method radix would do the job, but still: Why did it change? Is that really better?
I just realized, that (thanks to Roland) sort calls in my case sort.int:
function (x, decreasing = FALSE, na.last = NA, ...)
{
if (is.object(x))
x[order(x, na.last = na.last, decreasing = decreasing)]
else sort.int(x, na.last = na.last, decreasing = decreasing,
...)
}
From ?sort.int:
The "auto" method selects "radix" for short (less than 2^31 elements) numeric vectors, integer vectors, logical vectors and factors; otherwise, "shell".)
And according to the docs, sort.int did not change from 4.0.0 to 4.0.2.
From ?data.table::setorder
data.table always reorders in "C-locale". As a consequence, the
ordering may be different to that obtained by base::order. In English
locales, for example, sorting is case-sensitive in C-locale. Thus,
sorting c("c", "a", "B") returns c("B", "a", "c") in data.table but
c("a", "B", "c") in base::order. Note this makes no difference in most
cases of data; both return identical results on ids where only
upper-case or lower-case letters are present ("AB123" < "AC234" is
true in both), or on country names and other proper nouns which are
consistently capitalized. For example, neither "America" < "Brazil"
nor "america" < "brazil" are affected since the first letter is
consistently capitalized.
Using C-locale makes the behaviour of sorting in data.table more
consistent across sessions and locales. The behaviour of base::order
depends on assumptions about the locale of the R session. In English
locales, "america" < "BRAZIL" is true by default but false if you
either type Sys.setlocale(locale="C") or the R session has been
started in a C locale for you – which can happen on servers/services
since the locale comes from the environment the R session was started
in. By contrast, "america" < "BRAZIL" is always FALSE in data.table
regardless of the way your R session was started.
(Related questions Language dependent sorting with R and Best practice: Should I try to change to UTF-8 as locale or is it safe to leave it as is?)
Details
R.version # old _
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 5.2
year 2018
month 12
day 20
svn rev 75870
language R
version.string R version 3.5.2 (2018-12-20)
nickname Eggshell Igloo
y <- c("Schaffhausen", "Schwyz", "Seespital", "SRZ")
sort(y)
# [1] "Schaffhausen" "Schwyz" "Seespital" "SRZ"
stringr::str_sort(y)
# [1] "Schaffhausen" "Schwyz" "Seespital" "SRZ"
stringr::str_sort(y, locale = "C")
# [1] "SRZ" "Schaffhausen" "Schwyz" "Seespital"
# =======
R.version # new after upgrade
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 4
minor 0.0
year 2020
month 04
day 24
svn rev 78286
language R
version.string R version 4.0.0 (2020-04-24)
nickname Arbor Day
y <- c("Schaffhausen", "Schwyz", "Seespital", "SRZ")
sort(y)
# [1] "SRZ" "Schaffhausen" "Schwyz" "Seespital"
stringr::str_sort(y)
# [1] "Schaffhausen" "Schwyz" "Seespital" "SRZ"
stringr::str_sort(y, locale = "C")
#[1] "SRZ" "Schaffhausen" "Schwyz" "Seespital"
# ==== Test with new 4.0.2
R.version
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 4
minor 0.2
year 2020
month 06
day 22
svn rev 78730
language R
version.string R version 4.0.2 (2020-06-22)
nickname Taking Off Again
y <- c("Schaffhausen", "Schwyz", "Seespital", "SRZ")
sort(y)
# [1] "Schaffhausen" "Schwyz" "Seespital" "SRZ"
stringr::str_sort(y)
# [1] "Schaffhausen" "Schwyz" "Seespital" "SRZ"
stringr::str_sort(y, locale = "C")
# [1] "SRZ" "Schaffhausen" "Schwyz" "Seespital"
In summary, it was a bug which has been removed in R version 4.0.1. As #Roland figured out.
From CRAN:
In R 4.0.0, sort.list(x) when is.object(x) was true, e.g., for x <-I(letters), was accidentally usingmethod = "radix". Consequently,
e.g., merge(<data.frame>) was much slower than previously; reported in
PR#17794.

Why does rbind with data.table having more than 254 columns reorders column names

I am not sure of the extent of this side effect. Why is this happening ? What caution does one need to take.
dt <- data.table(
sample = 1
)
i = 1
while(i <= 254) {
col <- paste("x", i, sep = "_")
dt[[col]] = i
i = (i + 1)
}
> combined_dt <- rbind(dt, dt)
> print(head(names(combined_dt))) # Columns get reordered
[1] "sample" "x_5" "x_6" "x_1" "x_2" "x_3"
>
> combined_dt <- rbindlist(list(dt, dt))
> print(head(names(combined_dt))) # Columns do not get reordered
[1] "sample" "x_1" "x_2" "x_3" "x_4" "x_5"
R details
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 4.4
year 2018
month 03
day 15
svn rev 74408
language R
version.string R version 3.4.4 (2018-03-15)
nickname Someone to Lean On

Same seed, different OS, different random numbers in R

I was experiencing inconsistent results between two machines and a linux server, until I realized that fixing the seed was having different effects. I am running different R versions in all of them, all above 3.3.0. Here are the examples:
Linux 1
> set.seed(10); rnorm(1)
[1] -0.4463588
> version
_
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 3.0
year 2016
month 05
day 03
svn rev 70573
language R
version.string R version 3.3.0 (2016-05-03)
nickname Supposedly Educational
Linux 2
> set.seed(10); rnorm(1)
[1] 0.01874617
> version
_
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 4.2
year 2017
month 09
day 28
svn rev 73368
language R
version.string R version 3.4.2 (2017-09-28)
nickname Short Summer
Mac OS
> set.seed(10); rnorm(1)
[1] 0.01874617
> version
_
platform x86_64-apple-darwin15.6.0
arch x86_64
os darwin15.6.0
system x86_64, darwin15.6.0
status
major 3
minor 4.3
year 2017
month 11
day 30
svn rev 73796
language R
version.string R version 3.4.3 (2017-11-30)
nickname Kite-Eating Tree
Windows
> set.seed(10); rnorm(1)
[1] 0.01874617
> version
_
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 4.1
year 2017
month 06
day 30
svn rev 72865
language R
version.string R version 3.4.1 (2017-06-30)
nickname Single Candle
Linux gives a different random number generation from the same seed, thus making the result of a script run on it not fully reproducible (depending on the OS in which they are re-run, the results will agree or not). This is annoying.
I do not know what is happening here. Particularly:
(1) Is it an issue with R's versions or something more involved?
(2) How can this inconsistent behaviour be avoided? Any help is appreciated.
EDIT originated from #Jesse Tweedle answer (output in Linux 1 in a new session):
> set.seed(10); rnorm(1)
[1] -0.4463588
> set.seed(10); rnorm(1)
[1] -0.4463588
> set.seed(102); rnorm(1)
[1] 0.05752965
> set.seed(10, kind = "Mersenne-Twister"); rnorm(1)
[1] 0.01874617
> set.seed(10); rnorm(1)
[1] 0.01874617
> set.seed(102); rnorm(1)
[1] 0.1805229
From docs:
Random docs:
RNGversion can be used to set the random generators as they were in an earlier R version (for reproducibility).
So try this on all systems:
set.seed(10, kind = "Mersenne-Twister", normal.kind = "Inversion"); rnorm(1)
[1] 0.01874617

*_join with empty suffix

Fair warning: this can hang your operating system.
*_join() from dplyr fails when either of the left or right suffixes are specified as empty (''), e.g.
inner_join(data.frame(x=1, y=2),
data.frame(x=1, y=3),
by='x',
suffix=c('', '.b'))
Whereas the following works fine:
inner_join(data.frame(x=1, y=2),
data.frame(x=1, y=3),
by='x',
suffix=c('.a', '.b'))
Meanwhile, the S3 generic merge() (base) has no problem with empty suffixes:
merge(data.frame(x=1, y=2),
data.frame(x=1, y=3),
by='x',
suffixes=c('', '.b'))
dplyr package info:
> packageVersion('dplyr')
[1] ‘0.5.0’
R version info:
> version
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 3.0
year 2016
month 05
day 03
svn rev 70573
language R
version.string R version 3.3.0 (2016-05-03)
nickname Supposedly Educational
This was fun when I stumbled across this bug. The following will accomplish the desired effect using dplyr of using suffixes '' and .b
library(dplyr)
inner_join(data.frame(x=1, y=2),
data.frame(x=1, y=3),
by='x',
suffix=c('.a', '.b')) %>%
setNames(gsub('\\.a$', '', names(.)))

Resources