The following piece of simple code works perfectly on my local windows machine
require(roll)
x = matrix(rnorm(100),100,1)
y = matrix(rnorm(100),100,1)
roll_lm(x,y,10)
However, on a debian distant machine, it crashes with this error message:
caught illegal operation
address 0x7f867a59ee04,
cause 'illegal operand'
Traceback:
1: .Call("roll_roll_lm", PACKAGE = "roll", x, y, as.integer(width), as.numeric(weights), as.logical(center_x), as.logical(center_y), as.logical(scale_x), as.logical(scale_y), as.integer(min_obs), as.logical(complete_obs), as.logical(na_restore), as.character(match.arg(parallel_for)))
2: roll_lm(x, y, 10)
Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace*
Option 1 : abort (with core dump, if enabled) gives:
Illegal instruction
I am clueless on how to interpret this message.
Any help? Thanks.
Some info :
R.version _
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 2.5
year 2016
month 04
day 14
svn rev 70478
language R
version.string R version 3.2.5 (2016-04-14)
nickname Very, Very Secure Dishes
The system:
Linux machineName 3.2.0-4-amd64 #1 SMP Debian 3.2.63-2+deb7u1 x86_64 GNU/Linux
Works for me:
R> require(roll)
R> x = matrix(rnorm(100),100,1)
R> y = matrix(rnorm(100),100,1)
R> str(roll_lm(x,y,10))
List of 2
$ coefficients: num [1:100, 1:2] NA NA NA NA NA ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:2] "(Intercept)" "x1"
$ r.squared : num [1:100, 1] NA NA NA NA NA ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "R-squared"
R>
I suggest you rebuild reinstall package roll.
Sometimes this happens when one component (Rcpp, RcppParallel, ...) gets updated.
Related
UPDATED 28Jun2017, below, in response to #Michal Kurka.
UPDATED 26Jun2017, below.
I am unable to load a large GBM model that I saved in native H2O format (ie, hex).
H2O v3.10.5.1
R v3.3.2
Linux 3.10.0-327.el7.x86_64 GNU/Linux
My goal is to eventually save this model as MOJO.
This model was so large that I had to initialize H2O with min/max memory 100G/200G before H2O's model training would run successfully.
This is how I trained the GBM model:
localH2O <- h2o.init(ip = 'localhost', port = port, nthreads = -1,
min_mem_size = '100G', max_mem_size = '200G')
iret <- h2o.gbm(x = predictors, y = response, training_frame = train.hex,
validation_frame = holdout.hex, distribution="multinomial",
ntrees = 3000, learn_rate = 0.01, max_depth = 5, nbins = numCats,
model_id = basename_model)
gbm <- h2o.getModel(basename_model)
oPath <- h2o.saveModel(gbm, path = './', force = TRUE)
The training data contains 81,886 records with 1413 columns. Of these columns, 19 are factors. The vast majority of these columns are 0/1.
$ wc -l training/*.txt
81887 training/train.txt
27294 training/holdout.txt
This is the saved model as written to disk:
$ ls -l
total 37G
-rw-rw-r-- 1 bfo7328 37G Jun 22 19:57 my_model.hex
This is how I tried to read the model from disk using the same large memory allocation values 100G/200G:
$ R
R version 3.3.2 (2016-10-31) -- "Sincere Pumpkin Patch"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-redhat-linux-gnu (64-bit)
> library(h2o)
> localH2O=h2o.init(ip='localhost', port=65432, nthreads=-1,
min_mem_size='100G', max_mem_size='200G')
H2O is not running yet, starting it now...
Note: In case of errors look at the following log files:
/tmp/RtmpVSwxXR/h2o_bfo7328_started_from_r.out
/tmp/RtmpVSwxXR/h2o_bfo7328_started_from_r.err
openjdk version "1.8.0_121"
OpenJDK Runtime Environment (build 1.8.0_121-b13)
OpenJDK 64-Bit Server VM (build 25.121-b13, mixed mode)
Starting H2O JVM and connecting: .. Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 3 seconds 550 milliseconds
H2O cluster version: 3.10.5.1
H2O cluster version age: 13 days
H2O cluster name: H2O_started_from_R_bfo7328_kmt050
H2O cluster total nodes: 1
H2O cluster total memory: 177.78 GB
H2O cluster total cores: 64
H2O cluster allowed cores: 64
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 65432
H2O Connection proxy: NA
H2O Internal Security: FALSE
R Version: R version 3.3.2 (2016-10-31)
From /tmp/RtmpVSwxXR/h2o_bfo7328_started_from_r.out:
INFO: Processed H2O arguments: [-name, H2O_started_from_R_bfo7328_kmt050, -ip, localhost, -port, 65432, -ice_root, /tmp/RtmpVSwxXR]
INFO: Java availableProcessors: 64
INFO: Java heap totalMemory: 95.83 GB
INFO: Java heap maxMemory: 177.78 GB
INFO: Java version: Java 1.8.0_121 (from Oracle Corporation)
INFO: JVM launch parameters: [-Xms100G, -Xmx200G, -ea]
INFO: OS version: Linux 3.10.0-327.el7.x86_64 (amd64)
INFO: Machine physical memory: 1.476 TB
My call to h2o.loadModel:
if ( TRUE ) {
now <- format(Sys.time(), "%a %b %d %Y %X")
cat( sprintf( 'Begin %s\n', now ))
model_filename <- './my_model.hex'
in_model.hex <- h2o.loadModel( model_filename )
now <- format(Sys.time(), "%a %b %d %Y %X")
cat( sprintf( 'End %s\n', now ))
}
From /tmp/RtmpVSwxXR/h2o_bfo7328_started_from_r.out:
INFO: GET /, parms: {}
INFO: GET /, parms: {}
INFO: GET /, parms: {}
INFO: GET /3/InitID, parms: {}
INFO: Locking cloud to new members, because water.api.schemas3.InitIDV3
INFO: POST /99/Models.bin/, parms: {dir=./my_model.hex}
After waiting an hour, I see these "out of memory" (OOM) error messages:
INFO: POST /99/Models.bin/, parms: {dir=./my_model.hex}
#e Thread WARN: Swapping! GC CALLBACK, (K/V:24.86 GB + POJO:112.01 GB + FREE:40.90 GB == MEM_MAX:177.78 GB), desiredKV=22.22 GB OOM!
#e Thread WARN: Swapping! GC CALLBACK, (K/V:26.31 GB + POJO:118.41 GB + FREE:33.06 GB == MEM_MAX:177.78 GB), desiredKV=22.22 GB OOM!
#e Thread WARN: Swapping! GC CALLBACK, (K/V:27.36 GB + POJO:123.03 GB + FREE:27.39 GB == MEM_MAX:177.78 GB), desiredKV=22.22 GB OOM!
#e Thread WARN: Swapping! GC CALLBACK, (K/V:28.21 GB + POJO:126.73 GB + FREE:22.83 GB == MEM_MAX:177.78 GB), desiredKV=22.22 GB OOM!
I would not expect to need so much memory to read the model from disk.
How can I read this model from disk into memory. And once I do, can I save it as a MOJO?
UPDATE 1: 26Jun2017
I just noticed that the disk size of a GBM model increased dramatically between versions of H2O:
H2O v3.10.2.1:
-rw-rw-r-- 1 169M Jun 19 07:23 my_model.hex
H2O v3.10.5.1:
-rw-rw-r-- 1 37G Jun 22 19:57 my_model.hex
Any ideas why? Could this be the root of the problem?
UPDATE 2: 28Jun2017 in response to comments by #Michal Kurka.
When I load the training data via fread, the class (type) of each column is correct:
* 24 columns are ‘character’;
* 1389 columns are ‘integer’ (all but one column are 0/1);
* 1413 total columns.
I then convert the R-native data frame to an H2O data frame and manually factor-ize 20 columns:
train.hex <- as.h2o(df.train, destination_frame = "train.hex”)
length(factorThese)
[1] 20
train.hex[factorThese] <- as.factor(train.hex[factorThese])
str(train.hex)
A condensed version of the output from str(train.hex), showing only those 19 columns that are factors (1 factor is the response column):
- attr(*, "nrow")= int 81886
- attr(*, "ncol")= int 1413
- attr(*, "types")=List of 1413
..$ : chr "enum" : Factor w/ 72 levels
..$ : chr "enum" : Factor w/ 77 levels
..$ : chr "enum" : Factor w/ 51 levels
..$ : chr "enum" : Factor w/ 4226 levels
..$ : chr "enum" : Factor w/ 4183 levels
..$ : chr "enum" : Factor w/ 3854 levels
..$ : chr "enum" : Factor w/ 3194 levels
..$ : chr "enum" : Factor w/ 735 levels
..$ : chr "enum" : Factor w/ 133 levels
..$ : chr "enum" : Factor w/ 16 levels
..$ : chr "enum" : Factor w/ 25 levels
..$ : chr "enum" : Factor w/ 647 levels
..$ : chr "enum" : Factor w/ 715 levels
..$ : chr "enum" : Factor w/ 679 levels
..$ : chr "enum" : Factor w/ 477 levels
..$ : chr "enum" : Factor w/ 645 levels
..$ : chr "enum" : Factor w/ 719 levels
..$ : chr "enum" : Factor w/ 678 levels
..$ : chr "enum" : Factor w/ 478 levels
The above results are exactly the same between v3.10.2.1 (smaller model written to disk: 169M) and v3.10.5.1 (larger model written to disk: 37G).
The actual GBM training uses nbins <- 37:
numCats <- n_distinct(as.matrix(dplyr::select_(df.train,response)))
numCats
[1] 37
iret <- h2o.gbm(x = predictors, y = response, training_frame = train.hex,
validation_frame = holdout.hex, distribution="multinomial",
ntrees = 3000, learn_rate = 0.01, max_depth = 5, nbins = numCats,
model_id = basename_model)
The difference in size of the models (169M vs 37G) is surprising. Can you please make sure that H2O recognizes all your numeric columns as numeric and not categorical with very high cardinality?
Do you use automatic detection of column types or do you specify it manually?
Whenever I use any sort of HTTP command via the system() function in R studio, the rainbow circle of death appears and I have to force-quit R Studio. Up until now, I've written a bunch of checks to make sure a user isn't in R Studio before using an HTTP command (which I use a ton to access data), but it's quite a pain, and it would be fantastic to get to the root of the problem.
e.g.
system("http get http://api.bls.gov/publicAPI/v1/timeseries/data/CXUALCBEVGLB0101M")
causes R studio to crash. Oddly, on another laptop of mine, such commands don't crash R Studio but cause the following error: 'sh: http: command not found', even though http is installed and works fine when using the terminal.
Does anybody know how to fix this problem / why it happens / does it occur for you guys too? Although I know a lot about R, I'm afraid I have no idea how to try to fix this problem.
Thanks!!!
Using http from the httpie package on Linux hangs RStudio (and not plain terminal R) on my Linux system (your rainbow circle implies its a Mac?) so I'm getting the same behaviour as you.
Installing and using wget works for me:
system("wget -O /tmp/data.out http://api.bls.gov/publicAPI/v1/timeseries/data/CXUALCBEVGLB0101M")
Or you could try R's native download.file function. There's a whole bunch of other functions for getting stuff off the web - see the Web Task View http://cran.r-project.org/web/views/WebTechnologies.html
I've not seen this http command used much, so maybe its flakey. Or maybe its opening stdin...
Yes... Try this:
system("http get http://api.bls.gov/publicAPI/v1/timeseries/data/CXUALCBEVGLB0101M >/tmp/data2.out </dev/null" )
I think http is opening stdin, the Unix standard input channel, RStudio isn't sending anything to it. So it waits. If you explicitly assign http's stdin as /dev/null then http completes. This works for me in RStudio.
However, I still prefer wget or curl-based solutions!
Without more contextual information regarding Rstudio version / operating system it is hard to do more than suggest an alternative approach that avoids the use system()
Instead you could use RCurl and getURL
library(RCurl)
getURL('http://api.bls.gov/publicAPI/v1/timeseries/data/CXUALCBEVGLB0101M')
#[1] "{\"status\":\"REQUEST_SUCCEEDED\",\"responseTime\":129,\"message\":[],\"Results\":{\n\"series\":\n[{\"seriesID\":\"CXUALCBEVGLB0101M\",\"data\":[{\"year\":\"2013\",\"period\":\"A01\",\"periodName\":\"Annual\",\"value\":\"445\",\"footnotes\":[{}]},{\"year\":\"2012\",\"period\":\"A01\",\"periodName\":\"Annual\",\"value\":\"451\",\"footnotes\":[{}]},{\"year\":\"2011\",\"period\":\"A01\",\"periodName\":\"Annual\",\"value\":\"456\",\"footnotes\":[{}]}]}]\n}}"
You could also use PUT, GET, POST, etc directly in R, abstracted from RCurl by the httr package:
library(httr)
tmp <- GET("http://api.bls.gov/publicAPI/v1/timeseries/data/CXUALCBEVGLB0101M")
dat <- content(tmp, as="parsed")
str(dat)
## List of 4
## $ status : chr "REQUEST_SUCCEEDED"
## $ responseTime: num 27
## $ message : list()
## $ Results :List of 1
## ..$ series:'data.frame': 1 obs. of 2 variables:
## .. ..$ seriesID: chr "CXUALCBEVGLB0101M"
## .. ..$ data :List of 1
## .. .. ..$ :'data.frame': 3 obs. of 5 variables:
## .. .. .. ..$ year : chr [1:3] "2013" "2012" "2011"
## .. .. .. ..$ period : chr [1:3] "A01" "A01" "A01"
## .. .. .. ..$ periodName: chr [1:3] "Annual" "Annual" "Annual"
## .. .. .. ..$ value : chr [1:3] "445" "451" "456"
## .. .. .. ..$ footnotes :List of 3
## .. .. .. .. ..$ :'data.frame': 1 obs. of 0 variables
## .. .. .. .. ..$ :'data.frame': 1 obs. of 0 variables
## .. .. .. .. ..$ :'data.frame': 1 obs. of 0 variables
R - TM package - Issue with arabic - diff between Mac OS X and Windows OS
ON MACBOOK PRO with RSTUDIO
```{r}
versionInfo()
```
1.R version 3.1.0 (2014-04-10)
2.Platform: x86_64-apple-darwin10.8.0 (64-bit)
3.Packages : tm_0.6 NLP_0.1-3
ON WINDOWS 8.1 with RSTUDIO
```{r}
versionInfo()
```
1.R version 3.1.0 (2014-04-10)
2.Platform: x86_64-w64-mingw32/x64 (64-bit)
3.Packages : tm_0.6 NLP_0.1-3
Problem description
Dear all,
I have been working all the week-end. I'm working on PhD on social network analysis. At this moment, I'm using TM package for text mining and analysis purposes, with english and arabic languages mixed in bid data sets.
The data sets are collected from Twitter API with a JAVA program and placed in a MongoDB data base.
For test purposes, I use a small dataset of 36000 tweets.
The problem is that for huge datasets computing (>1000000 rows), my MacBookPro would not be sufficient. I need to use a PC with Windows 8.1 OS which have better ROM and RAM.
When testing my Code on Windows 8.1 OS which working fine on RStudio on MAC OS X with the same test dataset, I have some different results from TM package at the Corpus compute level.
Here the beginning of the R code:
```{r}
y <<- dget("file") # get the file ext rated from MongoDB with rmongodb package
a <<- y$tweet_text # extract only the text of the tweets in the dataset
text_df <<- data.frame(a, stringsAsFactors = FALSE) # Save as a data frame
myCorpus_df <<- Corpus(DataframeSource(text_df_2)) # Compute a Corpus from the data frame
```
When I check on R in MAC OS, all the character, english and arabic, are well represented :
```{r}
str(myCorpus_df[1:2])
```
List of 2
$ 1:List of 2
..$ content: chr "The CHRONICLE EYE Ahrar al#Sham is clearly fighting #ISIS where its men storm some #Manbij buildings #Aleppo "
..$ meta :List of 7
.. ..$ author : chr(0)
.. ..$ datetimestamp: POSIXlt[1:1], format: "2014-07-03 22:42:18"
.. ..$ description : chr(0)
.. ..$ heading : chr(0)
.. ..$ id : chr "1"
.. ..$ language : chr "en"
.. ..$ origin : chr(0)
.. ..- attr(*, "class")= chr "TextDocumentMeta"
..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
$ 2:List of 2
..$ content: chr "RT ######### جبهة النصرة مهاجرينها وأنصارها مقراتها مكان آمن لكل من يخشى على نفسه الآذى "
..$ meta :List of 7
.. ..$ author : chr(0)
.. ..$ datetimestamp: POSIXlt[1:1], format: "2014-07-03 22:42:18"
.. ..$ description : chr(0)
.. ..$ heading : chr(0)
.. ..$ id : chr "2"
.. ..$ language : chr "en"
.. ..$ origin : chr(0)
.. ..- attr(*, "class")= chr "TextDocumentMeta"
..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
- attr(*, "class")= chr [1:2] "VCorpus" "Corpus"
Nevertheless, when I do the same part of code in RSTUDIO on WINDOWS OS, all the arabic language is wrongly decoded (can't pass you here). the str of the Corpus show the same parameters. Only the display of arabic is unreadable. When checking at the data frame text_df, the arabic language is well displayed.
When I check the encoding of an arabic word on the both OS (MAC & WINDOWS OS), it seems to be well coded :
```{r}
Encoding("لمياه_و_الإصحا")
```
[1] "UTF-8"
I've tried to pass many additional information when creating the Corpus (with readerControletc…) but nothing have changed : my arabic language is not well displayed in R or in RStudio on Windows OS with the tm package.
Is anyone have encountered the same difference issues between MAC OS X and WINDOWS OS with non-latin language text mining ?
As far as I can tell, it seems to me that the Arabic characters are being encoded in some native (Windows-specific) encoding, while your R code is incorrectly decoding them as UTF8. That's why you're getting all those ennoying symbols such as "Ø" **. To verify this, just inspect the raw bytes of your string variables using charToRaw and then check the UTF8 character table.
I haven't worked with the mongodb package before, but I wonder if there is a way to force the data to be read from mongodb in UTF8 format, perhaps by specifying an encoding parameter of some "read" function.
** Actually, the reason I can immediately recognize those characters is because I ran into this kind of problem while working with Arabic tweets that I had obtained using the twitteR package.
The mixOmics package is meant to analyze big data sets (e.g. from high throughput experiments), but it seems not be working with my big matrix.
I am having issues with both rcc (regularized canonical correlation) and tune.rcc (labmda parameters estimation for regularized can cor).
> str(Y)
num [1:13, 1:17766] ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:13] ...
..$ : chr [1:17766] ...
> str(X)
num [1:13, 1:26] ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:13] ...
..$ : chr [1:26] ...
tune.rcc(X, Y, grid1 = seq(0.001, 1, length = 5),
grid2 = seq(0.001, 1, length = 5),
validation = "loo", plt=F)
On Mavericks: runs forever (I quit R after hours)
Since I know Mavericks is problematic I've tried it on a Windows8 machine and on the mixOmics web interface.
On Windows 8:
Error: cannot allocate vector of size 2.4 Gb
On web interface, since it is not possible to estimate lambdas (tune.rcc) I tried rcc with "some" lambdas and get:
Error: cannot allocate vector of size 2.4 Gb
Am I doing something obviously wrong?
Any help very much appreciated.
How does one determine which architectures are supported by an installation of R? On a standard windows install, one may look for the existence of R_HOME/bin/*/R.exe where * is the architecture (typically i386 or x64). On a standard mac install from CRAN, there are no subdirectories.
I can query R for the default architecture using something like:
$ R --silent -e "sessionInfo()[[1]][[2]]"
> sessionInfo()[[1]][[2]]
[1] "x86_64"
but how do I know on mac/linux whether any sub-architectures are installed, and if so what they are?
R.version, R.Version(), R.version.string, and version provide detailed information about the version of R running.
Update, based on a better understanding of the question. This isn't a complete solution, but it seems you can get fairly close via a combination of the following commands:
# get all the installed architectures
arch <- basename(list.dirs(R.home('bin'), recursive=FALSE))
# handle different operating systems
if(.Platform$OS.type == "unix") {
arch <- gsub("exec","",arch)
if(arch == "")
arch <- R.version$arch
} else { # Windows
# any special handling
}
Note that this won't work if you've built R from source and installed the different architectures in various different places. See 2.6 Sub-architectures of the R Installation and Administration manual for more details.
Using Sys.info() you have a lot of information on your system.
May be it can help here
Sys.info()["machine"]
machine
"x86_64"
EDIT
One workaround to have all architecture possible is to download log files from the Rstudio mirror, it's not complete but it's good estimate of what you need.
start <- as.Date('2012-10-01')
today <- as.Date('2013-07-01')
all_days <- seq(start, today, by = 'day')
year <- as.POSIXlt(all_days)$year + 1900
urls <- paste0('http://cran-logs.rstudio.com/', year, '/', all_days, '.csv.gz')
files <- file.path("/tmp", basename(urls))
list_data <- lapply(files, read.csv, stringsAsFactors = FALSE)
data <- do.call(rbind, list_data)
str(data)
## 'data.frame': 10694506 obs. of 10 variables:
## $ date : chr "2012-10-01" "2012-10-01" "2012-10-01" "2012-10-01" ...
## $ time : chr "00:30:13" "00:30:15" "02:30:16" "02:30:16" ...
## $ size : int 35165 212967 167199 21164 11046 42294 435407 326143 119459 868695 ...
## $ r_version: chr "2.15.1" "2.15.1" "2.15.1" "2.15.1" ...
## $ r_arch : chr "i686" "i686" "x86_64" "x86_64" ...
## $ r_os : chr "linux-gnu" "linux-gnu" "linux-gnu" "linux-gnu" ...
## $ package : chr "quadprog" "lavaan" "formatR" "stringr" ...
## $ version : chr "1.5-4" "0.5-9" "0.6" "0.6.1" ...
## $ country : chr "AU" "AU" "US" "US" ...
## $ ip_id : int 1 1 2 2 2 2 2 1 1 3 ...
unique(data[["r_arch"]])
## [1] "i686" "x86_64" NA "i386" "i486"
## [6] "i586" "armv7l" "amd64" "000000" "powerpc64"
## [11] "armv6l" "sparc" "powerpc" "arm" "armv5tel"