My results of using splines::ns with a least-squares fit varied with no rhyme or reason that I could see, and I think I have traced the problem to the ns function itself.
I have reduced the problem to this:
require(splines)
N <- 0
set.seed(1)
for (i in 1:100) N <- N + identical(ns(1:10,3),ns(1:10,3))
N
My results average about 39, range 34--44 or so, but I expected 100 every time. Why should the results of ns be random? If I substitute bs for ns in both places, I get 100, as expected. My set.seed(1) hopes to demonstrate that the randomness I get is not what R intended.
In a clean session, using RStudio and R version 2.14.2 (2012-02-29), I get 39, 44, 38, etc. Everyone else seems to be getting 100.
Further info:
Substituing splines::ns for ns gives the same results. A clean vanilla session gives the same results. My computer has 8 cores.
The differences, when they happen, are generally or always 2^-54:
Max <- 0
for (i in 1:1000) Max <- max( Max, abs(ns(1:10,3)-ns(1:10,3)) )
c(Max,2^-54)
with result [1] 5.551115e-17 5.551115e-17. This variability causes me big problems down the line, because my optimize(...)$min now varies sometimes even in the first digit, making results not repeatable.
My sessionInfo with a clean vanilla session:
I created what I understand to be known as a clean vanilla session using
> .Last <- function() system("R --vanilla")
> q("no")
This blows away the session, and when I restart it, I get my clean vanilla session. Then, in answer to Ben Bolker's clarifying question, I did this at the beginning of my clean vanilla session:
> sessionInfo()
R version 2.14.2 (2012-02-29)
Platform: x86_64-pc-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Revobase_6.1.0 RevoMods_6.1.0 RevoScaleR_3.1-0 lattice_0.20-0
[5] rpart_3.1-51
loaded via a namespace (and not attached):
[1] codetools_0.2-8 foreach_1.4.0 grid_2.14.2 iterators_1.0.6
[5] pkgXMLBuilder_1.0 revoIpe_1.0 tools_2.14.2 XML_3.9-1.1
> require(splines)
Loading required package: splines
> N <- 0
> set.seed(1)
> for (i in 1:100) N <- N + identical(ns(1:10,3),ns(1:10,3))
> N
[1] 32
This is the answer I got from REvolution Technical Suppoort (posted here with permission):
The problem here is an issue of floating point arithmetic. Revolution
R uses the Intel mkl BLAS library for some computations, which differs
from what CRAN-R uses and uses this library for the 'ns()'
computation. In this case you will also get different results
depending on whether you are doing the computation on a
Intel-processor based machine or a machine with an AMD chipset.
We do ship the same BLAS and Lapack DLL's that are shipped with
CRAN-R, but they are not the default ones used with Revolution R.
Customers can revert the installed DLL's if they so choose and prefer,
by doing the following:
1). Renaming 'Rblas.dll' to 'Rblas.dll.bak' and 'Rlapack.dll' to
'Rlapack.dll.bak'
in the folder 'C:\Revolution\R-Enterprise-6.1\R-2.14.2\bin\x64'.
2). Rename the files 'Rblas.dll.0' and 'Rlapack.dll.0' in this folder
to Rblas.dll and Rlpack.dll respectively.
Their suggestion worked perfectly. I have renamed these files back and forth several times, using both RStudio (with Revolution R) and Revolution R's own IDE, always with the same result: The BLAS dlls give me N==40 or so, and the CRAN-R dlls give me N==100.
I will probably go back to BLAS, because in my tests, it is 8-times faster for %*% and 4-times faster for svd(). And that is just using one of my cores (verified by the CPU usage column of the Processes tab of Windows Task Manager).
I am hoping someone with better understanding can write a better answer, for I still don't really understand the full ramifications of this.
Related
When trying to run the sample code that is in the Minet paper / vignette I am experiencing a nunber of issues, e.g.
mim <- build.mim(discretize(syn.data), estimator)
Error in build.mim(dataset, estimator, disc, nbins):
could not find function "mutinformation"
I have also received other errors such as "unknown estimator" when trying methods prefixed with "mi." e.g. "mi.empirical."
I am running windows 8.1. Any help would be much appreciated!
EDIT 1: Additional Information
After playing around some more, the main problem I am having is when trying to use the discretize function like so:
> data(syn.data)
> disc <- "equalwidth"
> nbins <- sqrt(nrow(syn.data))
> ew.data <- discretize(syn.data, disc, nbins)
Error: could not find function "discretize"
This causes the same Error in all functions e.g. build.mim or minet which utilise discretize. I can run build.mim successfully without including discretize.
In addition, I am getting errors if I use minet (whilst excluding the discretize argument) with any of the mi.* estimation methods, e.g.
> res<-minet(syn.data,"mrnet","mi.empirical","equal width",10)
Error in build.mim(dataset, estimator, disc, nbins) :
could not find function "mutinformation"
However, running the same function with the "spearman" estimator works fine.
Edit 2: Output of sessionInfo()
> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] minet_3.20.0
loaded via a namespace (and not attached):
[1] tools_3.1.0
discretize() is a function from the package infotheo that you can find in CRAN. There are some references to that package in the minet documentation. Maybe the authors of minet have moved some functionality to the infotheo package but since it is not a dependency it does not get installed automatically. It may be worth contacting the authors about this.
library(infotheo)
data(syn.data)
disc <- "equalwidth"
nbins <- sqrt(nrow(syn.data))
ew.data <- discretize(syn.data, disc, nbins)
The same applies for the multiinformation() function(). It is part of the infotheo package.
How do you read in / manipulate datasets in R which exceed the allotted memory limit?
EDIT:
Great help so far, thanks. Let me add an additional constraint. The server is enterprise owned and I do not have administrative access. Is there a way to read partial files using read.table or something similar (e.g., by designating nrows to only read 100,000 rows at a time)? Need a workaround which can run with current environment so cannot use fread, bigmemory, etc.
My target dataset contains approx 32 million rows with 30 columns, divided into 12 approximately equal files (some readable, some not).
The files are "|" delimited and stored on a remote serve in 12 individual files. About half of the files can be read using R, the other half exceed the allowable limit.
I'm using a simple read and rbind script:
path<-"filepath/mydata/contains 12 files.txt/"
fulldf<-data.frame()
for(i in 1:length(dir(path))){
file1<-read.table(file=paste0(path,dir(path[i]), sep="|", fill=T, quote="\"")
fulldf<-rbind(fulldf,file2)
}
I'd primarily like to be able to subset the data and write it to a .csv (e.g., read the data piece by piece, subset by location then rbind), but some of the files are simply too big to even read in.
Is there a way to read in part of a large file piece by piece, i.e., split an unreadable file into readable pieces?
System:
Microsoft Windows Server 2003 R2
Enterprise Edition
Service Pack 2
Computer:
Intel(R)Xeon(TM) MP CPU
3.66GHz
3.67 GHz, 12.0 GB RAM
Physical Address Extension
> sessionInfo()
R version 2.12.1 (2010-12-16)
Platform: i386-pc-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.7.1
loaded via a namespace (and not attached):
[1] tools_2.12.1
I think iotools with the chunk.reader, read.chunk and chunk.apply may be what you are looking for.
If you cannot install packages then I gather read.table with the nrows and skip arguments should do? You can use the colClasses argument to ensure that they all have the same column class.
When I create s3 methods in R 2.14.1, and then call them, the s3 objects fail to execute the methods in cases where the method has the same name as functions already loaded into the workspace (i.e. base functions). Instead it calls the base function and returns an error. This example uses 'match'. I never had this problem before today. Since I last ran this code, I installed R 3.0.2, but kept my 2.14.1 version. I ran into some trouble (different trouble) with 3.0.2 due to certain packages not being up to date in CRAN, so I reverted RStudio to 2.14.1, and then this problem cropped up. Here's an example:
rm(list=ls())
library(R.oo)
this.df<-data.frame(letter=c("A","B","C"),number=1:3)
setConstructorS3("TestClass", function(DF) {
if (missing(DF)) {
data= NA
} else {
data=DF
}
extend(Object(), "TestClass",
.data=data
)
})
setMethodS3(name="match", class="TestClass", function(this,letter,number,...){
ret = rep(TRUE,nrow(this$.data))
if (!missing(number))
ret = ret & (this$.data$number %in% number)
if (!missing(letter)){
ret = ret & (this$.data$letter %in% letter)
}
return(ret)
})
setMethodS3("get_data", "TestClass", function(this,...) {
return(this$.data[this$match(...),])
})
hope<-TestClass(this.df)
hope$match()
Error in match(this, ...) : argument "table" is missing, with no default
hope$get_data()
Here's the sessionInfo() for clues:
sessionInfo()
R version 2.14.1 (2011-12-22)
Platform: i386-pc-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] R.oo_1.13.0 R.methodsS3_1.4.2
loaded via a namespace (and not attached):
[1] tools_2.14.1
I tried a lot of combinations of the arguments in setMethodsS3 with no luck.
How can I fix this?
First, I highly recommend calling S3 methods the regular way and not via the <object>$method(...) way, e.g. match(hope) instead of hope$match(). If you do that, everything works as expected.
Second, I can reproduce this issue with R 3.0.2 and R.oo 1.17.0. There appears to be some issues using the particular method name match() here, because if you instead use match2(), calling hope$match2() works as expected. I've seen similar problems when trying to create S3 methods named assign() and get(). The latter actually generates a error if tried, e.g. "Trying to use an unsafe generic method name (trust us, it is for a good reason): get". I'll add assign() and most likely match() to the list of no-no names. DETAILS: Those functions are very special in R so one should avoid using those names. This is because, if done, S3 generic functions are created for them and all calls will be dispatched via generic functions and that is not compatible with
Finally, you should really really update your R - it's literally ancient and few people will not bother trying to help you with issues unless you run the most recent stable R version (now R 3.0.2 soon to be R 3.1.0). At a minimum, you should make sure to run the latest versions of the package (your R.methodsS3 and R.oo versions are nearly 2 and 1 years old by now with important updates since).
Hope this helps
learning how to compute tasks in R for large data sets (more than 1 or 2 GB), I am trying to use ff package and ffdfdply function.
(See this link on how to use ffdfdply: R language: problems computing "group by" or split with ff package )
My data have the following columns:
"id" "birth_date" "diagnose" "date_diagnose"
There are several rows for each "id", and I want to extract the first date where there was a diagnose.
I would apply this :
library(ffbase)
library(plyr)
load(file=file_name); # to load my ffdf database, called data.f .
my_fun <- function(x){
ddply( x , .(id), summarize,
age = min(date_diagnose - birth_date, na.rm=TRUE))
}
result <- ffdfdply(x = data.f, split = data.f$id,
FUN = function(x) my_fun(x) , trace=TRUE) ;
result[1:10,] # to check....
It is very strange, but this command: ffdfdply(x = data.f, .... ) is making RStudio (and R) crash. Sometimes the same command will crash R and sometimes not.
For example, if I trigger again the ffdfdply line (which worked the first time), R will crash.
Also using other functions, data, etc. will have the same effect. There is no memory increase, or anything into log.txt.
Same behaviour when applying the summaryBy "technique"....
So if anybody has the same problem and found the solution, that would be very helpful.
Also ffdfdply gets very slow (slower than SAS...) , and I am thinking about making another strategy to make this kind of tasks.
Is ffdfdply taking into account that for example the data set is ordered by id? (so it does not have to look into all the data to take the same ids... ).
So, if anybody knows other approaches to this ddply problem, it would be really great for all the "large data sets in R with low RAM memory" users.
This is my sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: i386-w64-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=Danish_Denmark.1252 LC_CTYPE=Danish_Denmark.1252
[3] LC_MONETARY=Danish_Denmark.1252 LC_NUMERIC=C
[5] LC_TIME=Danish_Denmark.1252
attached base packages:
[1] tools stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] plyr_1.7.1 ffbase_0.6-1 ff_2.2-10 bit_1.1-9
I also noticed this when using the package which we uploaded to CRAN recently. It seems to be caused by overloading in package ffbase the "[.ff" and "[<-.ff" extractor and setter functions from package ff.
I will remove this feature from the package and will upload it to CRAN soon. In the mean time, you can use the version 0.7 of ffbase, which you can get here:
http://dl.dropbox.com/u/25690064/ffbase_0.7.tar.gz
and install it as:
download.file("http://dl.dropbox.com/u/25690064/ffbase_0.7.tar.gz", "ffbase_0.7.tar.gz")
shell("R CMD INSTALL ffbase_0.7.tar.gz")
Let me know if that helped.
I have tried the following code, and then R exits unexpectedly:
temp <- rep(as.Date("2009-01-01")+1:365, 365)
print(temp)
Anyone tried this before, is it a bug, or if there is anything I can do?
I have increase the memory available for R from 1024M to 2047M, but the same thing happens.
Thanks.
UPDATE #1
Here's my sessionInfo()
> sessionInfo()
R version 2.11.0 (2010-04-22)
i386-pc-mingw32
locale:
[1] LC_COLLATE=Chinese_Hong Kong S.A.R..950 LC_CTYPE=Chinese_Hong Kong S.A.R..950
[3] LC_MONETARY=Chinese_Hong Kong S.A.R..950 LC_NUMERIC=C
[5] LC_TIME=Chinese_Hong Kong S.A.R..950
attached base packages:
[1] stats graphics grDevices utils datasets methods base
And actually I am trying to do some format(foo, "%y%M"), but it also exit unexpectedly - for "unexpectedly", I mean R closes itself without any signs. Thanks again.
Looks like a known bug, fixed in 2.11.1 to me. Look at 2.11.1 changelog, section BUG FIXES, 8th item.
Works fine for me (R 2.12.0, running on Fedora Core 13)
You may check the output of getOption("max.print") and lower it a little bit (using, for instance, option(max.print=5000).
In general, however, there is no need to print out such a vector, as you will not be able to read the complete output. Functions like str(t) or head(t) are your friends!