Can't run r package mixOmics functions with large matrix - r

The mixOmics package is meant to analyze big data sets (e.g. from high throughput experiments), but it seems not be working with my big matrix.
I am having issues with both rcc (regularized canonical correlation) and tune.rcc (labmda parameters estimation for regularized can cor).
> str(Y)
num [1:13, 1:17766] ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:13] ...
..$ : chr [1:17766] ...
> str(X)
num [1:13, 1:26] ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:13] ...
..$ : chr [1:26] ...
tune.rcc(X, Y, grid1 = seq(0.001, 1, length = 5),
grid2 = seq(0.001, 1, length = 5),
validation = "loo", plt=F)
On Mavericks: runs forever (I quit R after hours)
Since I know Mavericks is problematic I've tried it on a Windows8 machine and on the mixOmics web interface.
On Windows 8:
Error: cannot allocate vector of size 2.4 Gb
On web interface, since it is not possible to estimate lambdas (tune.rcc) I tried rcc with "some" lambdas and get:
Error: cannot allocate vector of size 2.4 Gb
Am I doing something obviously wrong?
Any help very much appreciated.

Related

Five figures summary with dbplyr

I have 4 years experience using R but I am very new to the Big Data game as I always worked on csv files.
It is thrilling to manipulate large amount of data from a distance but also somehow frustating as simple things you were used to are to be rengineered.
The task I am struggling right now is to have a basic 5 figure summary of a variable:
summary(df$X)
Some context, I am connected with impala, these lines of codes work fine:
library(dbplyr)
localTable <- tbl(con, 'serverTable')
localTable %>% tally()
localTable %>% filter(X > 10) %>% tally()
If I just write
localTable
instead, RStudio gets stuck/takes a lot of time so I suppress it with the task manager.
Coming back to my current question, I tried to have a 5 figure summary in these ways:
summary(localTable$X) #returns Length 0, Class NULL, Mode NULL
localTable %>% fivenum(X) #returns Error in rank(x, ties.method = "min", na.last = "keep") : unimplemented type 'list' in 'greater'
also building a custom summary() with summarise
localTable %>% summarize(Min = min(X),
Q1 = quantile(X, .25),
Avg = mean(X),
Q3 = quantile(X, .75),
Max = max(X))
returns me a SYNTAX ERROR.
My guess is that there is a very trivial missing link between my code and the server in form of a data structure, but I can't figure it out what.
I tried as well to save localTable$x to a in-memory variable with
XL <- localTable$X
but I always get a NULL
On the graphical side, using dbplot, if I try
library(dbplot)
localTable %>% dbplot_histogram(X)
I get an empty graphic.
I thought about leveraging the 5 figures summary in the boxplot function, ggplotbuild(object)$data likewise so to speak, but with dbplot_boxplot I get the error could not find function "dbplot_boxplot".
I started using dbplyr as I am quite fluent with dplyr and I don't want to write queries in SQL with DBI::dbGetQuery, but you can suggest other packages like implyR, sparklyR or the such, as well as tutorials on the subject as large, as the ones I found are quite basic.
EDIT:
as requested in a comment, I add the result of
str(localTable)
which is
List of 2
$ src:List of 2
..$ con :Formal class 'Impala' [package ".GlobalEnv"] with 4 slots
.. .. ..# ptr :<externalptr>
.. .. ..# quote : chr "`"
.. .. ..# info :List of 15
.. .. .. ..$ dbname : chr "IMPALA"
.. .. .. ..$ dbms.name : chr "Impala"
.. .. .. ..$ db.version : chr "2.9.0-cdh5.12.1"
.. .. .. ..$ username : chr "User"
.. .. .. ..$ host : chr ""
.. .. .. ..$ port : chr ""
.. .. .. ..$ sourcename : chr "impala connector"
.. .. .. ..$ servername : chr "Impala"
.. .. .. ..$ drivername : chr "Cloudera ODBC Driver for Impala"
.. .. .. ..$ odbc.version : chr "03.80.0000"
.. .. .. ..$ driver.version : chr "2.6.11.1011"
.. .. .. ..$ odbcdriver.version : chr "03.80"
.. .. .. ..$ supports.transactions : logi FALSE
.. .. .. ..$ getdata.extensions.any_column: logi TRUE
.. .. .. ..$ getdata.extensions.any_order : logi TRUE
.. .. .. ..- attr(*, "class")= chr [1:3] "Impala" "driver_info" "list"
.. .. ..# encoding: chr ""
..$ disco: NULL
..- attr(*, "class")= chr [1:4] "src_Impala" "src_dbi" "src_sql" "src"
$ ops:List of 2
..$ x : 'ident' chr "serverTable"
..$ vars: chr [1:157] "X" ...
..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
- attr(*, "class")= chr [1:5] "tbl_Impala" "tbl_dbi" "tbl_sql" "tbl_lazy" ...
Not sure if I can dput my table as it is sensitive information
There are quite a few aspects to your post. I am going to try and address the main ones.
(1) What you are calling localTable is not local. What you have is a local access point to a remote table. It is a remote table because the data is stored in the database, rather than in R.
To copy a remote table into local R memory use localTable = collect(remoteTable). Use this carefully. If the table is many GB in the database this will be slow to transfer into R. Also if you collect a database table that is bigger than the ram avaialble to R then you will receive an out of memory error.
I recommend using collect for moving summary results into R. Do the processing and summarizing in the database and just fetch the results into R. Alternatively, use remoteTable %>% head(20) %>% collect() to copy just the first 20 rows into R.
(2) The tableName$colname will not work for remote tables. In R the $ notation lets you access a named component of a list. Data.frames are a special kind of list. If you try data(iris) followed by names(iris) you will get the columns names of iris. Any of these can be accessed using iris$.
However as your str(localTable) shows, localTable is a list of length 2 with the first named item src. If you call names(localTable) then you will receive two names back, the first of which is src. This means you can call localTable$src (and as localTable$src is also a list you can also call localTable$src$con).
When working with dbplyr R translates data manipulation commands into the database language. There are translations defined for most dplyr commands, but there are not translations defined for all R commands.
So the recommended approach to access just a specific column is using select from dplyr:
local_copy_of_just_one_column = remoteTable %>%
select(required_column) %>%
collect()
(3) You have the right approach with a custom summary function. This is the best approach for producing the five figure summary without pulling the data into local memory (RAM).
One possible cause of the syntax error is that you may have used R commands that do not have a translation into your database language.
You can check whether a command has translations defined using translate_sql. I recommend you try
library(dbplyr)
translate_sql(quantile(colname, 0.25))
To see what the translation look like.
You can view the translation of an entire table manipulation using show_query. This is my go-to approach when debugging SQL translation. Try:
localTable %>%
summarize(Min = min(X),
Q1 = quantile(X, .25),
Avg = mean(X),
Q3 = quantile(X, .75),
Max = max(X)) %>%
show_query()
If this does not produce valid SQL then executing the command will error.
One possible cause is the Min and Max have special meanings in SQL and so might produce odd behavior in your translation.
When I experimented with quantile it looks like it might need an OVER clause in SQL. This is created using group_by. So perhaps you want something like the following:
localSummary = remoteTable %>%
# create dummy column
mutate(ones = 1) %>%
# group to satisfy over clause
group_by(ones) %>%
summarise(var_min = min(var),
var_lq = quantile(var, 0.25),
var_mean = mean(var),
var_uq = quantile(var, 0.75),
var_max = max(var)) %>%
# copy results from database into R memory
collect()

as.matrix(A$mat) for a given list A

I have n matrices of which I am trying to apply nearPD()from the Matrixpackage.
I have done this using the following code:
A<-lapply(b, nearPD)
where b is the list of n matrices.
I now would like to convert the list A into matrices. For an individual matrix I would use the following code:
A<-matrix(runif(n*n),ncol = n)
PD_mat_A<-nearPD(A)
B<-as.matrix(PD_mat_A$mat)
But I am trying to do this with a list. I have tried the following code but it doesn't seem to work:
d<-lapply(c, as.matrix($mat))
Any help would be appreciated. Thank you.
Here is a code so you can try and reproduce this:
n<-10
generate<-function (n){
matrix(runif(10*10),ncol = 10)
}
b<-lapply(1:n, generate)
Here is the simplest method using as.matrix as noted by #nicola in the comments below and (a version using apply) by #cimentadaj in the comments above:
d <- lapply(A, function(i) as.matrix(i$mat))
My original answer, exploiting the nearPD data structure was
With a little fiddling with the nearPD object type, here is an extraction method:
d <- lapply(A, function(i) matrix(i$mat#x, ncol=i$mat#Dim[2]))
Below is some commentary on how I arrived at my answer.
This object is fairly complicated as str(A[[1]]) returns
List of 7
$ mat :Formal class 'dpoMatrix' [package "Matrix"] with 5 slots
.. ..# x : num [1:100] 0.652 0.477 0.447 0.464 0.568 ...
.. ..# Dim : int [1:2] 10 10
.. ..# Dimnames:List of 2
.. .. ..$ : NULL
.. .. ..$ : NULL
.. ..# uplo : chr "U"
.. ..# factors : list()
$ eigenvalues: num [1:10] 4.817 0.858 0.603 0.214 0.15 ...
$ corr : logi FALSE
$ normF : num 1.63
$ iterations : num 2
$ rel.tol : num 0
$ converged : logi TRUE
- attr(*, "class")= chr "nearPD"
You are interested in the "mat" which is accessed by $mat. The # symbols show that "mat" is an s4 object and its components are accessed using #. The components of interest are "x", the matrix content, and "Dim" the dimension of the matrix. The code above puts this information together to extract the matrices from the list of "nearPD" objects.
Below is a brief explanation of why as.matrix works in this case. Note the matrix inside a nearPD object is not a matrix:
is.matrix(A[[1]]$mat)
[1] FALSE
However, it is a "Matrix":
class(A[[1]]$mat)
[1] "dpoMatrix"
attr(,"package")
[1] "Matrix"
From the note in the help file, help("as.matrix,Matrix-method"),
Loading the Matrix namespace “overloads” as.matrix and as.array in the base namespace by the equivalent of function(x) as(x, "matrix"). Consequently, as.matrix(m) or as.array(m) will properly work when m inherits from the "Matrix" class.
So, the Matrix package is taking care of the as.matrix conversion "under the hood."

Block bootstrap for time series in R

I'm using the function tsbootstrap() from the package tseries to generate block bootstrap samples, and to calculate the standard errors for the estimate of the parameters of a regime-switching autoregressive model (which I can obtain using the function msmFit() from the package MSwM).
Here is the code I'm using. Firstly I define a function for the statistic I want to use:
switching.par <- function(z) {
n<-length(z)
x<-z[1:(n-1)]
y<-z[2:n]
my.xy<- data.frame(x,y)
mod<-lm(y~x,data=my.xy)
mod.mswm=msmFit(mod,k=2,sw=c(T,T,T))
b1 <- mod.mswm#Coef[1,2]
b2 <- mod.mswm#Coef[2,2]
c1 <- mod.mswm#Coef[1,1]
c2 <- mod.mswm#Coef[2,1]
del1 <- mod.mswm#std[1]
del2 <- mod.mswm#std[2]
parameters<-c(b1, b2, c1, c2, del1, del2)
names(parameters)<-c("b1", "b2", "c1", "c2", "del1", "del2")
parameters
}
And then I use the tsbootstrap() function (where x is a monthly time series of 10-year US government bonds)
use.boot <- tsbootstrap(x, nb=1000, statistic=switching.par, type="block", b=9)
But I get the following error message:
Error in solve.default(res$Hessian) :
Lapack routine dgesv: system is exactly singular: U[4,4] = 0
Any insight on how to fix this problem? I think the error comes from the function msmFit() of the package.
The error as you correctly understood comes from the msmFit function that fails to converge.
I will give a bit of insight as to the error and then provide a solution that worked for me:
solve.default is a common error that you can see when an optimiser is being used. Usually the optimiser (such as optim) will try to calculate the hessian matrix in order to 'direct' itself to the optimal solution that minimises the underlying function. At some point the hessian matrix needs to be inverted and if it is singular the algorithm crashes. Practically this means that the optimiser failed to converge (i.e. couldn't find a solution).
This can be because of a number of reasons:
Too few observations
Bad starting values
Bad design of the function to be optimised (or used in the function)
Low number of maximum iterations
Literally no solution for the problem
Now let's go to your problem:
It seems that the default maximum iterations for msmFit is 100. Try increasing that to 500 as I do below. The design of the function seems ok to me. Now let's go to the low number of observations. The b argument of the tsbootstrap function as far as I understand from the documentation is a value that controls the observations that go to the switch function. Having it to 9 makes the msmFit function fail. I increased that to 50 (assuming that your df has 50 observations. Anything less than that will probably fail all the time). Also, having it produce 1000 bootstrap series will take a day to compute (at least on my machine).
With all the above in mind, the following seems to work on my machine (for just 10 bootstrap series) and it took ages.
switching.par <- function(z) {
n<-length(z)
x<-z[1:(n-1)]
y<-z[2:n]
my.xy<- data.frame(x,y)
mod<-lm(y~x,data=my.xy)
mod.mswm=msmFit(mod,k=2,sw=c(T,T,T), control=list(maxiter=500))
b1 <- mod.mswm#Coef[1,2]
b2 <- mod.mswm#Coef[2,2]
c1 <- mod.mswm#Coef[1,1]
c2 <- mod.mswm#Coef[2,1]
del1 <- mod.mswm#std[1]
del2 <- mod.mswm#std[2]
parameters<-c(b1, b2, c1, c2, del1, del2)
names(parameters)<-c("b1", "b2", "c1", "c2", "del1", "del2")
parameters
}
use.boot <- tsbootstrap(x, nb=10, statistic=switching.par, type="block", b=50)
Output:
> str(use.boot)
List of 5
$ statistic : num [1:10, 1:6] -0.0275 -0.1692 -0.0199 -0.0587 -0.0763 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:6] "t1" "t2" "t3" "t4" ...
$ orig.statistic: Named num [1:6] 0.0114 -0.1002 0.5155 0.5319 0.2868 ...
..- attr(*, "names")= chr [1:6] "t1" "t2" "t3" "t4" ...
$ bias : Named num [1:6] -0.2029 0.2041 0.0307 -0.0217 -0.036 ...
..- attr(*, "names")= chr [1:6] "t1" "t2" "t3" "t4" ...
$ se : Named num [1:6] 0.2076 0.1774 0.1686 0.1375 0.0533 ...
..- attr(*, "names")= chr [1:6] "t1" "t2" "t3" "t4" ...
$ call : language tsbootstrap(x = x, nb = 10, statistic = switching.par, b = 50, type = "block")
- attr(*, "class")= chr "resample.statistic"

Fail to create couponbonds object in termstrc package using R

I am trying to use R package termstrc to estimate the term structure. To do that I have to prepare the data as the couponbonds class required by the package. I used some fake data to prevent the potential problem of the real data. Though I tried a lot, it still didn't work.
Any idea what is going wrong?
structure of the official demo data which works
data("govbonds")
str(govbonds)
List of 3
$ GERMANY:List of 8
..$ ISIN : chr [1:52] "DE0001141414" "DE0001137131" "DE0001141422" "DE0001137149" ...
..$ MATURITYDATE: Date[1:52], format: "2008-02-15" "2008-03-14" "2008-04-11" ...
..$ ISSUEDATE : Date[1:52], format: "2002-08-14" "2006-03-08" "2003-04-11" ...
..$ COUPONRATE : num [1:52] 0.0425 0.03 0.03 0.0325 0.0413 ...
..$ PRICE : num [1:52] 100 99.9 99.8 99.8 100.1 ...
..$ ACCRUED : num [1:52] 4.09 2.66 2.43 2.07 2.39 ...
..$ CASHFLOWS :List of 3
.. ..$ ISIN: chr [1:384] "DE0001141414" "DE0001137131" "DE0001141422" "DE0001137149" ...
.. ..$ CF : num [1:384] 104 103 103 103 104 ...
.. ..$ DATE: Date[1:384], format: "2008-02-15" "2008-03-14" "2008-04-11" ...
..$ TODAY : Date[1:1], format: "2008-01-30"
#another two are omitted here
- attr(*, "class")= chr "couponbonds"
> ns_res <- estim_nss(govbonds, c("GERMANY"), method = "ns",tauconstr=list(c(0.2, 5, 0.1)))
[1] "Searching startparameters for GERMANY"
beta0 beta1 beta2 tau1
5.008476 -1.092510 -3.209695 2.400100
my code to prepare fake data
bond=list()
bond$CHINA=list()
n=30*12#suppose I have n bond
enddate=as.Date('2014/11/7')
isin=sprintf('DE%010d',1:n)#some fake ISIN
bond$CHINA$ISIN=isin
bond$CHINA$MATURITYDATE=enddate+(1:n)*30
bond$CHINA$ISSUEDATE=rep(enddate,n)
bond$CHINA$COUPONRATE=rep(5/100,n)
bond$CHINA$PRICE=rep(100,n)
bond$CHINA$ACCRUED=rep(0,n)
bond$CHINA$CASHFLOWS=list()
bond$CHINA$CASHFLOWS$ISIN=isin
bond$CHINA$CASHFLOWS$CF=100+(1:n)*5/12
bond$CHINA$CASHFLOWS$DATE=enddate+(1:n)*30
bond$CHINA$TODAY=enddate
class(bond)='couponbonds'
ns_res <- estim_nss(bond, c("CHINA"), method = "ns",tauconstr=list(c(0.2, 5, 0.1)))
the output
Error in `colnames<-`(`*tmp*`, value = c("DE0000000001", "DE0000000002", :
attempt to set 'colnames' on an object with less than two dimensions
The problem was finally solved by adding one cashflow with amount zero to the CASHFLOW$CF.
Put it in another way, at least one bond should have at least two cashflows.
Then you may face another error caused by uniroot function. Be sure to only include the cashflow after TODAY. The termstrc doesn't filter the cashflow for you by using TODAY.

Difficulty with smacofSym - multidimensional scaling

I have a question concerning the smacofSym function in the Smacof package. I am using R version 3.1.0 through RStudio Version 0.98.501.
I am using the following command:
MDSdata <- smacofSym(DJaccardMatrix, ndim=2, metric=FALSE, verbose=TRUE).
I've included details of the data I'm using (DJaccardMatrix) below. Every time I run smacofSym I end up with a configuration where the final configuration is right on top of each other. Here is a sample of the results:
MDSdata$conf
D1 D2
1 0.06259624 -0.01494732
2 0.06276541 -0.01480409
3 0.06266933 -0.01492375
4 0.06262438 -0.01496111
5 0.06243336 -0.01496193
6 0.06258047 -0.01502270
7 0.06247747 -0.01500037 .......
To check the results I ran the same matrix on XLStat and got what I was expecting, a much more distributed set of points. After looking at some of the other help requests I've tried running smacofSym as both a matrix and dist, but neither has affected the results.
Here is my info on DJaccardMatrix as a matrix:
num [1:121, 1:121] 0 0.969 0.679 0.704 0.939 ...
attr(*, "dimnames")=List of 2
..$ : chr [1:121] "1" "2" "3" "4" ...
..$ : chr [1:121] "1" "2" "3" "4" ...
Here is my info on DJaccardMatrix as a dist object:
Class 'dist' atomic [1:7260] 0.969 0.679 0.704 0.939 0.8 ...
..- attr(*, "Size")= int 121
..- attr(*, "call")= language as.dist.default(m = dissmat)
..- attr(*, "Diag")= logi FALSE
..- attr(*, "Upper")= logi FALSE
I'm thankful for any recommendations people have. I am assuming it is something very basic, but I am definitely not finding it. (On a side note - feel free to ignore this because it's concerning interpretation - what is the relation between the nonmetric stress that smacof reports and Kruskal's stress? Is there any?)
This answer concerns your side question in parenthesis at the end: "what is the relation between the nonmetric stress that smacof reports and Kruskal's stress"
Kruskal's stress (or Stress-1) is the square root of the nonmetric stress (stress.nm) reported by smacof.
So, if you have a model called mod obtained by running smacofSym:
Stress-1 = mod$stress.nm^.5

Resources