Is there an argument to force UTF8 in igraph functions? - r

I am trying to use an adjacency matrix that has labels in UTF-8. Is there a way to make sure iGraph functions use UTF8, something along the line of "encoding = "UTF8""? This is to avoid the following result (French text on a japanese system shows kanji instead of french diacritics). Thanks for any pointers.
> m1 <- graph_from_adjacency_matrix(m, mode = "directed", weighted = TRUE)
> m1
> IGRAPH 7a99453 DNW- 391 1454 --
+ attr: name (v/c), weight (e/n)
+ edges from 7a99453 (vertex names):
[1] Accept ->Accepter Acknowledge->Appr馗ier Acknowledge->Confirmer Acknowledge->Conscient
[5] Acknowledge->Consid駻er Acknowledge->Constater Acknowledge->Convenir Acknowledge->Donner
+ ... omitted several edges
As per requested:
R version 3.3.3 (2017-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252 LC_MONETARY=French_France.1252
[4] LC_NUMERIC=C LC_TIME=French_France.1252
Actually, this may not be an issue with igraph, but with RStudio, since I just realised that although my tables are properly displayed with the View() command, if I just call them in the console, the French diacritics are displayed with Japanese kanji. In any cases, igraph also does this.

Related

`seq` takes a very long time with `by=1`

I noticed something strange today, in some cases adding by=1 to seq function introduces a large inefficiency.
> system.time(seq(from=936144000, to=1135987200))
user system elapsed
0 0 0
> system.time(seq(from=936144000, to=1135987200, by=1))
user system elapsed
4.42 8.39 18.20
At first glance the results are equivalent:
> all.equal(seq(from=936144000, to=1135987200),
+ seq(from=936144000, to=1135987200, by=1))
[1] TRUE
TRUE
The difference seems to be that omitting by=1 causes the result to be numeric, even if by is explicitly integer.
> identical(seq(from=936144000, to=1135987200),
+ seq(from=936144000, to=1135987200, by=1))
[1] FALSE
> class(seq(from=936144000, to=1135987200))
[1] "integer"
> class(seq(from=936144000, to=1135987200, by=1L))
[1] "numeric"
Also, calling directly to seq.int (assuming that is what happens behind the scenes in seq) also takes much longer than the seq without any arguments:
> system.time(seq.int(from=936144000, to=1135987200, by=1L))
user system elapsed
0.25 1.68 2.81
How do I properly specify by to avoid the inefficiency or to get the efficiency of omitting by?
> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_4.1.0 RUnit_0.4.32 tools_4.1.0 geneorama_1.7.3 data.table_1.14.0
You don't need to assume what happens behind the scenes, you can run debug(seq) and see what the difference is. It's a generic function, and it calls seq.default.
In seq.default it turns out that if the by argument is missing (and some other conditions which hold in your example), seq(from, to) does from:to. This is extremely fast, because it doesn't even allocate the full vector: in recent versions of R, it is stored in a special format with just the limits of the range.
The other thing you can see if you look at seq.default is that the only way to get this output is to have missing(by) be TRUE. So the answer to your question is that you can't specify by to get the same speed.
#Baraliuh's advice is good: if you want seq(from, to, by=1), use from:to instead.

Why is R slower on my (stronger) Desktop than on my (weaker) laptop?

I'm using a Dell Latitude E7440 Laptop with Windows 7 Enterprise OS, 8GB RAM, 64-bit OS, Intel(R) Core(TM) i7-4600U CPU # 2.10GHz Processor, 2701 Mhz, 2 Cores, 4 Logical Processors (that's 4 cores).
I'm using a Dell Precision Tower 7810 Desktop with Windows 7 Enterprise OS, 32GB RAM, 64-bit OS, Intel(R) Xeon(R) CPU E5-2620 v3 # 2.40GHz 2 Processors, 2401 Mhz, 6 Cores, 12 Logical Processors (that's 24 cores).
A good demonstration of my use of R would be running binary classification using gbm in RStudio on 100K-sized data with ~300 features. But whatever I do on my laptop R version (all other software closed, no use of parallelization), is considerably faster than on my Desktop R version. How can that be? What do I need to do to find out?
Laptop:
> sum <- 0; system.time(for (i in 1:1000000) sum <- sum + i)
user system elapsed
0.36 0.00 0.36
> memory.limit()
[1] 8097
> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_3.3.1
Desktop:
> sum <- 0; system.time(for (i in 1:1000000) sum <- sum + i)
user system elapsed
0.52 0.00 0.52
> memory.limit()
[1] 32684
> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_3.3.1
Dell Latitude E7440 Laptop ... i7-4600U CPU # 2.10GHz Processor, 2701 Mhz
Dell Precision Tower 7810 Desktop ... E5-2620 v3 # 2.40GHz 2 Processors, 2401 Mhz
That would be why. Your laptop's CPU is running at a faster physical clock speed than your desktop, hence R also runs faster.
In the absence of multithreaded BLAS or other parallel-processing tricks, having multiple cores won't affect matters. Similarly, as long as you have enough memory to hold your data, more gigabytes won't speed things up (excepting caching issues but 100K should easily fit into the cache on both machines).

Spread function fails

R newbie here, re-building Python pipeline in R.
d is a data.frame():
$ day (chr) "2016-10-13", ...
$ city_name (chr) "SF", ...
$ type (chr) "Green", ...
$ count (int) 10, ...
I'm doing a spread() on the data from tidyr package:
d %>% spread(type,count)
Works fine running locally (Mac) with:
R version 3.3.0 (2016-05-03)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.5 (El Capitan)
tidyr_0.6.1
I run the identical command on a Linux box, on the same input d, but it returns an error:
Error in `[.data.table`(.variables, lengths != 0) : i evaluates to a logical vector length 3 but there are 930 rows. Recycling of logical i is no longer allowed as it hides more bugs than is worth the rare convenience. Explicitly use rep(...,length=.N) if you really need to recycle.
On Linux, I'm running:
R version 3.2.5 (2016-04-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu precise (12.04.5 LTS)
tidyr_0.6.0
Any idea what this error means, and why it would be thrown?
Edit: fixed this after updating tidyr on Linux, and re-starting the R session.

initializing parallel chains in rjags

I'm doing some ghetto parallelization in jags through rjags.
I've been using the function parallel.seeds to obtain RNG states to intialize the RNG's (example below). However, I don't understand why multiple integers are returned for each RNG. In the documentation it says that when you intialize .RNG.state is supposed to be a numeric vector with length one.
Furthermore, sometimes when I try to do this R crashes with no error generated. When I give up and just let it generate the seed for the chain on it's own, the model runs fine. Does this mean I am using the wrong .RNG.state? Any insight would be appreciated, as I am planning to scale up this model in the future.
> parallel.seeds("base::BaseRNG", 3)
[[1]]
[[1]]$.RNG.name
[1] "base::Wichmann-Hill"
[[1]]$.RNG.state
[1] 3891 16261 19841
[[2]]
[[2]]$.RNG.name
[1] "base::Marsaglia-Multicarry"
[[2]]$.RNG.state
[1] 408065014 1176110892
[[3]]
[[3]]$.RNG.name
[1] "base::Super-Duper"
[[3]]$.RNG.state
[1] -848274653 175424331
There is a difference between .RNG.seed (which is a vector of length one, and the thing you can specify to jags.model to e.g. ensure MCMC samples are repeatable) and .RNG.state (which is a vector of length depending on the pRNG algorithm). It is possible that these got mixed up in the docs somewhere - can you tell me where you read this so I can make sure it is fixed for JAGS/rjags 4?
Regarding the crashing - some more details would be needed to help you with that I'm afraid. I assume that it is the JAGS model that crashes, and not your R session that terminates, and after the model has been running for a while? A reproducible example would help a lot.
By the way - when you say 'scale up' - if you are planning to make use of > 4 chains I would strongly recommend you load the lecuyer module (see ?parallel.seeds examples at the bottom).
Matt
The documentation is a bit confusing; under ?jags.model we see that .RNG.seed should be a vector of length 1, but parallel.seeds() returns .RNG.state which is usually > 1. The state space for the Mersenne Twister algorithm has 624 integers, and that is the length of the vector when you do
parallel.seeds("base::BaseRNG",4)
to make sure you see all 4 types of RNG. Similarly the state space of the Wichmann-Hill generator has 3 integers, and I'm sure similar research would reveal the state spaces for the other two are longer than 1.
For my own edification I mocked up an example using the LINE data in rjags:
data(LINE)
LINE$model() ## edit and save to line.r
data = LINE$data()
line = jags.model("line.r",data=data)
line.samples <- jags.samples(LINE, c("alpha","beta","sigma"),n.iter=1000)
line.samples
inits = parallel.seeds("base::BaseRNG", 3) # a list of lists
inits[[1]]$tau = 1
inits[[1]]$alpha = 3
inits[[1]]$beta = 1
inits[[2]]$tau = .1
inits[[2]]$alpha = .3
inits[[2]]$beta = .1
inits[[3]]$tau = 10
inits[[3]]$alpha = 10
inits[[3]]$beta = 5
line = jags.model("line.r",data=data,inits=inits,n.chains=3)
line.samples <- jags.samples(line, c("alpha","beta","sigma"),n.iter=1000)
line2 = jags.model("line.r",data=data,inits=inits,n.chains=3)
line.samples2 <- jags.samples(line2, c("alpha","beta","sigma"),n.iter=1000)
all(line.samples$alpha-line.samples2$alpha < 0.00000001) ## TRUE
So the results are entirely repeatable, which is cool.
To understand the conditions under which R is crashing, I'd need to know the results of sessionInfo() on your computer, plus more details of the circumstances (e.g. what JAGS model are you running?). I just did:
for (i in 1:100){parallel.seeds("base::BaseRNG",4)}
and my computer didn't crash. For reference:
sessionInfo()
# R version 3.1.3 (2015-03-09)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows 7 x64 (build 7601) Service Pack 1
#
# locale:
# [1] LC_COLLATE=English_United States.1252
# [2] LC_CTYPE=English_United States.1252
# [3] LC_MONETARY=English_United States.1252
# [4] LC_NUMERIC=C
# [5] LC_TIME=English_United States.1252
#
# attached base packages:
# [1] stats graphics grDevices utils datasets
# [6] methods base
#
# other attached packages:
# [1] rjags_3-14 coda_0.17-1 mlogit_0.2-4
# [4] maxLik_1.2-4 miscTools_0.6-16 Formula_1.2-1
#
# loaded via a namespace (and not attached):
# [1] grid_3.1.3 lattice_0.20-30 lmtest_0.9-33
# [4] MASS_7.3-39 sandwich_2.3-3 statmod_1.4.21
# [7] tools_3.1.3 zoo_1.7-12
That shows the version of R and rjags that I'm using.

Format to find day of week not working in Windows

I'm new to R. I have Windows PC at work and Ubuntu Linux at home. I'm try to figure out why my code doesn't work in windows R.
Im trying to find the day(numeric) of the week for a given date.
Using format, format(Sys.time(), "%u")
It works in Linux not on Windows? Am I missing something, I have added a simple code and session info from both PC's.
Output from my Windows 7 PC with R 3.01
sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
format(Sys.time(), "%u")
[1] ""
Sys.time()
[1] "2013-09-22 10:34:00 CDT"
Output from my LINUX PC with R 3.01
sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached)
[1] tools_3.0.1
format(Sys.time(),"%u")
[1] "7"
Sys.time()
[1] "2013-09-22 10:34:15 CDT"
This is R (on Windows)'s documented behavior.
For details, see ?strptime (I know, probably not the first place you might think to look ;-) which documents the date-time conversion specifications available in R. Under Details, an initial list of specifications found on all OSes is followed by a section that reads:
Also defined in the current standards but less widely implemented
(e.g. not for output on Windows) are
‘%C’ Century (00-99): the integer part of the year divided by 100.
[ . . . many snipped lines . . .]
‘%u’ Weekday as a decimal number (1-7, Monday is 1).
[ . . . more snipped lines . . .]
The closest substitutes in format are:
%w Weekday as decimal number (0–6, Sunday is 0).
%a Abbreviated weekday name in the current locale. (Also matches full
name on input.)
%A Full weekday name in the current locale. (Also matches abbreviated
name on input.)
You may check the function ISOweekday() in package ISOweek:
"This function returns the weekday of a given date according to ISO 8601. It is an substitute for the "%u" format which is not implemented on Windows."
ISOweekday("2013-09-24 01:42:23")
# [1] 2

Resources