Looking for ways to optimize R's sparse.model.matrix - r

I have sparse data problems that generally require computing a sparse model matrix. The matrix I should receive in the end contains ~95% zeroes. It is usually due to factors that get one hot encoded, which become sparse, and further taking interactions with these sparse vectors.
require(Matrix)
require(data.table)
require(magrittr)
n = 500000
p = 10
x.matrix = matrix(rnorm(n*p), n, p)
colnames(x.matrix) = sprintf("n%s", 1:p)
x.categorical = data.table(
c1 = sample(LETTERS, n, replace = T),
c2 = sample(LETTERS, n, replace = T),
c3 = sample(LETTERS, n, replace = T)
)
x = cBind(x.matrix, x.categorical)
myformula = "~ n1 + n2 + n3 + n4 + n5 + n6 + n7 + n8 + n9 +n10 +
c1 + c1*c2 + c3 + n1:c1"
mm = model.matrix(myformula %>% as.formula, x)
mm2 = sparse.model.matrix(myformula %>% as.formula, x)
I have found that the performance of sparse.model.matrix on a sparse problem is worse than model.matrix (normally used for dense problems). This is revealed using Rstudio's profiling tools.
Here sparse model matrix takes much more time than model.matrix, and uses almost the same amount of memory. In some problems I have found sparse.model.matrix to be up to 10x slower than model.matrix when working with data that should be sparse.
Are there better ways to create the sparse matrix? I have searched quite a lot and have not found any. Alternatively, I would be interested in finding others or getting tips in how to implement a smarter version of sparse.model.matrix from scratch, perhaps using Rcpp or data.table functions
The source of the problem is in sparse2int, although I don't quite understand what it is for, and there are a few "FIXME"s still left in the code.

Related

Memory problems while using lm.circular()

I am trying to run circular regression in R using the circular package. My dataset is somewhat large, ~85000 rows and 6 variables. When I try to run the model, I get a error message reading "Error: cannot allocate vector of size 53.3 Gb." I am more of a statistician than a programmer so I can't figure out how to fix this, other than it seems odd that it's throwing out this large memory allocation, as my dataset is not that large. I have attached a fictional dataset and code below. Thank you.
library(circular)
set.seed(12)
n = 80000
df <- data.frame(y = rnorm(n,2,.2),
x1 = rnorm(n,100,2),
x2 = rnorm(n,0,1),
x3 = rnorm(n,9,.2),
x4 = rnorm(n,0,1),
x5 = rnorm(n,1,.1))
y <- circular(df$y, type = "angles", units = "radians")
x <- model.matrix(y ~., data = df)
m1 <- lm.circular(y = y, x = x, type = "c-l", init = c(1,.01,.5,.5,.5,.5))
The implementation tries to set up some diagonal matrices of size n x n using
A <- diag(k * A1(k), nrow = n)
g.p <- diag(apply(x, 1, function(row, betaPrev) 2/(1 + (t(betaPrev) %*%
row)^2), betaPrev = betaPrev), nrow = n)
(in circular:::LmCircularclRad) without using any sparse matrix tricks. For your example, those matrices would each take 50 GB of memory, and that allocation fails.
I don't think there's anything you can do to avoid this, other than suggesting a more efficient way to carry out the required calculations. Usually linear algebra using diagonal matrices can be done with much less memory use, but you'll have to look closely at this code to see if that's the case here.

R - Compare performance of two types while controlling for interaction

I have been programming in R and have a dataset containing the results (succes or not) of two Machine Learning algorithms which have been tried out using different amounts of parameters. An example is provided below:
type success paramater_amount
a1 0 15639
a1 0 18623
a1 1 19875
a2 1 12513
a2 1 10256
a2 0 12548
I now want to compare both algorithms to see which one has the best overall performance. But there is a catch. It is known that the higher the parameter_amount, the higher the chances for success. When checking out the parameter amounts both algorithms were tested on, one can also notice that a1 has been tested with higher parameter amounts than a2 was. This would make simply counting the amount of successes of both algorithms unfair.
What would be a good approach to handle this scenario?
I will give you an answer but without any guarantees on the truth of what I'm telling you. Indeed for more precisions you should give more informations on the algorithm and other. I also propose to migrate this question to cross-validate.
Indeed, your question is a statistical question. Because, in statistics, we search for sparcity. We prefer a simpler model than a very complex one at given performance because we are worried of over-fitting : https://statisticsbyjim.com/regression/overfitting-regression-models/.
One way to do what you want is to compare the performance with respect to the complexity of the model like for this toy example :
library(tidyverse)
library(ggplot2)
set.seed(123)
# number of estimation for each models
n <- 1000
performance_1 <- round(runif(n))
complexity_1 <- round(rnorm(n, mean = n, sd = 50))
performance_2 <- round(runif(n, min = 0, max = 0.6))
complexity_2 <- round(rnorm(n, mean = n, sd = 50))
df <- data.frame(performance = c(performance_1, performance_2),
complexity = c(complexity_1, complexity_2),
models = as.factor(c(rep(1, n), rep(2, n))))
temp <- df %>% group_by(complexity, models) %>% summarise(perf = sum(performance))
ggplot(temp, aes(x = complexity, y = perf, group = models, fill = models)) +
geom_smooth() +
theme_classic()
It only works if you have many data points. Complexity for you is the number of parameters fitted. In that toy exemple, the first model seems a better because for each level of complexity it is better.

Memory efficient representation of ``model.matrix``

Assume we have a large data.table object with model variables:
library(data.table)
library(magrittr)
library(pryr)
library(caret)
df <- rnorm(10000000, 0, 1) %>% matrix(., ncol = 10) %>% as.data.table
df[,factor_vars:=LETTERS[sample(1:26, 1000000, replace = T)]]
df[,factor_vars2:=LETTERS[sample(1:5, 1000000, replace = T)]]
I'm looking for an efficient way of making a model variable matrix from the data. At the moment the best way I've found is by using caret::dummyVars in the following manner:
dd_object <- dummyVars(~ -1 + V1 + V2 + V3 + V4 + V5 + V6 + V7 +
V8 + V9 + V10 + I(as.character(factor_vars)) +
I(as.character(factor_vars2)),
data = df)
Note that this creates a very convenient object for exporting and recreating without the original data.
object_size(dd_object)
R> 17.3 kB
On the other hand, same as with the base::model.matrix, it still retains the inefficiencies of the matrix object when dealing with many zeroes, i.e.:
MM1 <- predict(dd_object, newdata = df)
object_size(MM1)
R> 392 MB
object_size(df)
R> 96 MB
Note that the sizes can blow up very easily with more dummy variables added and etc., this is just for demonstration purposes.
My question: I want to use the same model-matrix object for various known modelling packages (glm,glmnet,xgboost and etc). The sparse matrix representation from Matrix packages does sound nice and efficient, but not every package is able to work with it, and as.matrix(.) transformation is a pain in that case.
Are there any known solutions for my case? I'm looking for something with greater efficiency than the base matrix (possibly like sparse matrices) and the capability of forming a storeable model.matrix object just like caret::dummyVars is able to do.
The desired workflow could be something along the lines of
fread %>% predict(dummyVars_object, newdata =.) %>% predict(some_Model, newdata =.)

Nonlinear model with many independent variables (fixed effects) in R

I'm trying to fit a nonlinear model with nearly 50 variables (since there are year fixed effects). The problem is I have so many variables that I cannot write the complete formula down like
nl_exp = as.formula(y ~ t1*year.matrix[,1] + t2*year.matrix[,2]
+... +t45*year.matirx[,45] + g*(x^d))
nl_model = gnls(nl_exp, start=list(t=0.5, g=0.01, d=0.1))
where y is the binary response variable, year.matirx is a matrix of 45 columns (indicating 45 different years) and x is the independent variable. The parameters need to be estimated are t1, t2, ..., t45, g, d.
I have good starting values for t1, ..., t45, g, d. But I don't want to write a long formula for this nonlinear regression.
I know that if the model is linear, the expression can be simplified using
l_model = lm(y ~ factor(year) + ...)
I tried factor(year) in gnls function but it does not work.
Besides, I also tried
nl_exp2 = as.formula(y ~ t*year.matrix + g*(x^d))
nl_model2 = gnls(nl_exp2, start=list(t=rep(0.2, 45), g=0.01, d=0.1))
It also returns me error message.
So, is there any easy way to write down the nonlinear formula and the starting values in R?
Since you have not provided any example data, I wrote my own - it is completely meaningless and the model actually doesn't work because it has bad data coverage but it gets the point across:
y <- 1:100
x <- 1:100
year.matrix <- matrix(runif(4500, 1, 10), ncol = 45)
start.values <- c(rep(0.5, 45), 0.01, 0.1) #you could also use setNames here and do this all in one row but that gets really messy
names(start.values) <- c(paste0("t", 1:45), "g", "d")
start.values <- as.list(start.values)
nl_exp2 <- as.formula(paste0("y ~ ", paste(paste0("t", 1:45, "*year.matrix[,", 1:45, "]"), collapse = " + "), " + g*(x^d)"))
gnls(nl_exp2, start=start.values)
This may not be the most efficient way to do it, but since you can pass a string to as.formula it's pretty easy to use paste commands to construct what you are trying to do.

Minimization with constraint on all parameters in R

I want to minimize a simple linear function Y = x1 + x2 + x3 + x4 + x5 using ordinary least squares with the constraint that the sum of all coefficients have to equal 5. How can I accomplish this in R? All of the packages I've seen seem to allow for constraints on individual coefficients, but I can't figure out how to set a single constraint affecting coefficients. I'm not tied to OLS; if this requires an iterative approach, that's fine as well.
The basic math is as follows: we start with
mu = a0 + a1*x1 + a2*x2 + a3*x3 + a4*x4
and we want to find a0-a4 to minimize the SSQ between mu and our response variable y.
if we replace the last parameter (say a4) with (say) C-a1-a2-a3 to honour the constraint, we end up with a new set of linear equations
mu = a0 + a1*x1 + a2*x2 + a3*x3 + (C-a1-a2-a3)*x4
= a0 + a1*(x1-x4) + a2*(x2-x4) + a3*(x3-x4) + C*x4
(note that a4 has disappeared ...)
Something like this (untested!) implements it in R.
Original data frame:
d <- data.frame(y=runif(20),
x1=runif(20),
x2=runif(20),
x3=runif(20),
x4=runif(20))
Create a transformed version where all but the last column have the last column "swept out", e.g. x1 -> x1-x4; x2 -> x2-x4; ...
dtrans <- data.frame(y=d$y,
sweep(d[,2:4],
1,
d[,5],
"-"),
x4=d$x4)
Rename to tx1, tx2, ... to minimize confusion:
names(dtrans)[2:4] <- paste("t",names(dtrans[2:4]),sep="")
Sum-of-coefficients constraint:
constr <- 5
Now fit the model with an offset:
lm(y~tx1+tx2+tx3,offset=constr*x4,data=dtrans)
It wouldn't be too hard to make this more general.
This requires a little more thought and manipulation than simply specifying a constraint to a canned optimization program. On the other hand, (1) it could easily be wrapped in a convenience function; (2) it's much more efficient than calling a general-purpose optimizer, since the problem is still linear (and in fact one dimension smaller than the one you started with). It could even be done with big data (e.g. biglm). (Actually, it occurs to me that if this is a linear model, you don't even need the offset, although using the offset means you don't have to compute a0=intercept-C*x4 after you finish.)
Since you said you are open to other approaches, this can also be solved in terms of a quadratic programming (QP):
Minimize a quadratic objective: the sum of the squared errors,
subject to a linear constraint: your weights must sum to 5.
Assuming X is your n-by-5 matrix and Y is a vector of length(n), this would solve for your optimal weights:
library(limSolve)
lsei(A = X,
B = Y,
E = matrix(1, nrow = 1, ncol = 5),
F = 5)

Resources