Tracing unexpected changes in matrix - r

I have a large matrix (9600x9600, 703.6 Mb) that keeps changing for no apparent reason. When created it looks fine, but after being used for calculations all of the sudden all the values except for a few columns are replaced by 0s. It's driving me a bit crazy since I cannot debug the problem. Is there a way to trace what is making this variable change? Like a change or access log? Or alternatively is there a way to lock the variable so that it cannot be modified?
Any help is greatly appreciated.
edit:
It seems matrix "L" is modified after applying this equation, even after it has been locked through 'lockBinding':
F.calc.E = function(M,p){
M$V1 <- paste(M$V1,M$V2,sep = ", ")
p.loc = grep(pattern = p,x = M$V1) # loc of target pressure
p.vector = as.numeric(M[p.loc,4:ncol(M),with=FALSE])
pL = mmult(L,p.vector)
return(pL)
}
The code for the mmult function is this, obtained through another SO post:
func <- 'NumericMatrix mmult( NumericMatrix m , NumericVector v , bool byrow = true ){
if( byrow );
if( ! m.nrow() == v.size() ) stop("Non-conformable arrays") ;
if( ! byrow );
if( ! m.ncol() == v.size() ) stop("Non-conformable arrays") ;
NumericMatrix out(m) ;
if( byrow ){
for (int j = 0; j < m.ncol(); j++) {
for (int i = 0; i < m.nrow(); i++) {
out(i,j) = m(i,j) * v[j];
}
}
}
if( ! byrow ){
for (int i = 0; i < m.nrow(); i++) {
for (int j = 0; j < m.ncol(); j++) {
out(i,j) = m(i,j) * v[i];
}
}
}
return out ;
}'
I am still unable to debug.

You could use lockBinding:
m <- matrix(1:4, 2)
evil.fun <- function(x) .GlobalEnv[[x]][2,2] <- 0
evil.fun("m")
m
# [,1] [,2]
#[1,] 1 3
#[2,] 2 0
m <- matrix(1:4, 2)
lockBinding("m", .GlobalEnv)
evil.fun("m")
#Error in .GlobalEnv[[x]][2, 2] <- 0 :
# cannot change value of locked binding for 'm'
unlockBinding("m", .GlobalEnv)

Related

Interleaving results from many objects in Rcpp

I need to write to a file row by row of matrices and sparse matrices that appears in a list and I am doing something like this:
#include <RcppArmadillo.h>
// [[Rcpp::export]]
bool write_rows (Rcpp::List data, Rcpp::CharacterVector clss, int n) {
int len = data.length();
for(int i = 0; i<n; i++) {
for(int j=0; j<len; j++) {
if (clss[j] == "matrix") {
Rcpp::NumericMatrix x = data[j];
auto row = x.row(i);
// do something with row i
} else if (clss[j] == "dgCMatrix") {
arma::sp_mat x = data[j];
auto row = x.row(i);
// do something different with row i
}
}
}
return true;
}
This function can be called in R with:
data <- list(
x = Matrix::rsparsematrix(nrow = 1000, ncol = 1000, density = 0.3),
y = matrix(1:10000, nrow = 1000, ncol = 10)
)
clss <- c("dgCMatrix", "matrix")
write_rows(data, clss, 1000)
The function receives a list of matrices or sparse matrices with the same number of rows and writes those matrices row by row, ie. first writes first rows of all elements in data then the second row of all elements and etc.
My problem is that it seems that this line arma::sp_mat x = data[i]; seems to have a huge impact in performance since it seems that I am implicitly casting the list element data[j] to an Armadillo Sparse Matrix n times.
My question is: is there anyway I could avoid this? Is there a more efficient solution? I tried to find a solution by looking into readr's source code, since they also write list elements row by row, but they also do a cast for each row (in this line for example, but maybe this doesn't impact the performance because they deal with SEXPS?
With the clarification, it seems that the result should interleave the rows from each matrix. You can still do this while avoiding multiple conversions.
This is the original code, modified to generate some actual output:
// [[Rcpp::export]]
arma::mat write_rows(Rcpp::List data, Rcpp::CharacterVector clss, int nrows, int ncols) {
int len = data.length();
arma::mat result(nrows*len, ncols);
for (int i = 0, k = 0; i < nrows; i++) {
for (int j = 0; j < len; j++) {
arma::rowvec r;
if (clss[j] == "matrix") {
Rcpp::NumericMatrix x = data[j];
r = x.row(i);
}
else {
arma::sp_mat x = data[j];
r = x.row(i);
}
result.row(k++) = r;
}
}
return result;
}
The following code creates a vector of converted objects, and then extracts the rows from each object as required. The conversion is only done once per matrix. I use a struct containing a dense and sparse mat because it's a lot simpler than dealing with unions; and I don't want to drag in boost::variant or require C++17. Since there's only 2 classes we want to deal with, the overhead is minimal.
struct Matrix_types {
arma::mat m;
arma::sp_mat M;
};
// [[Rcpp::export]]
arma::mat write_rows2(Rcpp::List data, Rcpp::CharacterVector clss, int nrows, int ncols) {
const int len = data.length();
std::vector<Matrix_types> matr(len);
std::vector<bool> is_dense(len);
arma::mat result(nrows*len, ncols);
// populate the structs
for (int j = 0; j < len; j++) {
is_dense[j] = (clss[j] == "matrix");
if (is_dense[j]) {
matr[j].m = Rcpp::as<arma::mat>(data[j]);
}
else {
matr[j].M = Rcpp::as<arma::sp_mat>(data[j]);
}
}
// populate the result
for (int i = 0, k = 0; i < nrows; i++) {
for (int j = 0; j < len; j++, k++) {
if (is_dense[j]) {
result.row(k) = matr[j].m.row(i);
}
else {
arma::rowvec r(matr[j].M.row(i));
result.row(k) = r;
}
}
}
return result;
}
Running on some test data:
data <- list(
a=Matrix(1.0, 1000, 1000, sparse=TRUE),
b=matrix(2.0, 1000, 1000),
c=Matrix(3.0, 1000, 1000, sparse=TRUE),
d=matrix(4.0, 1000, 1000)
)
system.time(z <- write_rows(data, sapply(data, class), 1000, 1000))
# user system elapsed
# 185.75 35.04 221.38
system.time(z2 <- write_rows2(data, sapply(data, class), 1000, 1000))
# user system elapsed
# 4.21 0.05 4.25
identical(z, z2)
# [1] TRUE

Apply functions instead of for loop in R

I am novice in R. I want to know how we can write the below for loop in an efficient way. I am getting correct answer by the below code for small dataset.
data <- data.frame(x1=c(rep('a',12)),
x2=c(rep('b',12)),
x3=c(rep(as.Date('2017-03-09'),4),rep(as.Date('2017-03-10'),4),rep(as.Date('2017-03-11'),4)),
value1= seq(201,212),
x4=c(as.Date('2017-03-09'),as.Date('2017-03-10'),as.Date('2017-03-11'),as.Date('2017-03-12')
,as.Date('2017-03-10'),as.Date('2017-03-11'),as.Date('2017-03-12'),as.Date('2017-03-13')
,as.Date('2017-03-11'),as.Date('2017-03-12'),as.Date('2017-03-13'),as.Date('2017-03-14')),
value2= seq(101,112), stringsAsFactors = FALSE)
Below for loop script:
for (i in 1:length(data$x3)){
print(i)
if (!is.na(data$x4[i])){
if(data$x4[i] == data$x3[i] && data$x2[i]==data$x2[i] && data$x1[i]==data$x1[i]){
data$diff[i] <- data$value1[i] - data$value2[i]
}
else{
print("I am in else")
for (j in 1:length(data$x3)){
print(c(i,j))
# print(a$y[i])
if(data$x4[i]==data$x3[j] && data$x1[i]==data$x1[j] && data$x2[i]==data$x2[j]){
# print(a$x[j])
data$diff[i] <- data$value1[j] - data$value2[i]
break
}
}
}
}
}
If you want performance, the answer is often Rcpp.
Translating your R code in Rcpp:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector f_Rcpp(List data) {
StringVector x1 = data["x1"];
StringVector x2 = data["x2"];
NumericVector x3 = data["x3"];
NumericVector x4 = data["x4"];
NumericVector value1 = data["value1"];
NumericVector value2 = data["value2"];
int n = value1.size();
NumericVector diff(n, NA_REAL);
int i, j;
for (i = 0; i < n; i++) {
Rprintf("%d\n", i);
if (x4[i] != NA_REAL) {
if (x4[i] == x3[i]) {
diff[i] = value1[i] - value2[i];
} else {
Rprintf("I am in else\n");
for (j = 0; j < n; j++) {
Rprintf("%d %d\n", i, j);
if (x4[i] == x3[j] && x1[i] == x1[j] && x2[i] == x2[j]) {
diff[i] = value1[j] - value2[i];
break;
}
}
}
}
}
return diff;
}
/*** R
f_Rcpp(data)
*/
Put that in a .cpp file and source it.
You can do this:
data$diff <- sapply(seq_along(data$x3), function(i) {
if (!is.na(data$x4[i])){
ind <- which(data$x4[i] == data$x3 & data$x1[i] == data$x1 & data$x2[i] == data$x2)
j <- `if`(i %in% ind, i, min(ind))
data$value1[j] - data$value2[i]
} else {
NA
}
})
Beware in your code, if column $diff doesn't exist yet, doing data$diff[1] <- 100 will put all the values of the column at 100.

Rcpp using a sparse matrix in C++

Overwhelmed with starting with RCCP. How would I be able to use (index,read and assign values) a sparse matrix as defined in the code the same way as I can do with the 'standard' matrix?
library('Matrix')
library(Rcpp)
library(inline)
r <- matrix(seq(1,9,1),ncol=3,nrow=3)
i <- Matrix(0, nrow = nrow(r) * ncol(r), ncol = nrow(r)*ncol(r), sparse=TRUE)
fx <- cxxfunction( signature( x_ = "matrix" ,y_="dsCMatrix"), '
NumericMatrix x(x_) ;
int nr = x.nrow(), nc = x.ncol() ;
for (int i = 0; i < nr; i++) {
for (int j = 1; j < nc; j++) {
x(i,j) = 1;
}
}
return wrap( x ) ;
', plugin = "Rcpp" )
fx( r,i)
Your best bet may be
the posts about sparse matrices at the Rcpp Gallery, and
the rcpp-devel mailing list.

How to write Rcpp function for simple matrix multiplication in R

I have wrote a Rcpp code to compute element wise matrix multiplication in R. But when try to run this code R stops working and its exiting. How to correct this function?
Thanks in advance.
library(Rcpp)
func <- 'NumericMatrix mmult( NumericMatrix m , NumericMatrix v, bool byrow=true )
{
if( ! m.nrow() == v.nrow() ) stop("Non-conformable arrays") ;
if( ! m.ncol() == v.ncol() ) stop("Non-conformable arrays") ;
NumericMatrix out(m) ;
for (int i = 1; i <= m.nrow(); i++)
{
for (int j = 1; j <= m.ncol(); j++)
{
out(i,j)=m(i,j) * v(i,j) ;
}
}
return out ;
}'
cppFunction( func )
m1<-matrix(1:4,2,2)
m2<-m1
r1<-mmult(m1,m2)
r2<-m1*m2
The (at least to me) obvious choice is to use RcppArmadillo:
R> cppFunction("arma::mat matmult(arma::mat A, arma::mat B) { return A % B; }",
+ depends="RcppArmadillo")
R> m1 <- m2 <- matrix(1:4,2,2)
R> matmult(m1,m2)
[,1] [,2]
[1,] 1 9
[2,] 4 16
R>
as Armadillo is strongly typed, and has an element-by-element multiplication operator (%) which we use in the one-liner it takes.
You have to keep in mind that c++ uses 0 indexed arrays. (See Why does the indexing start with zero in 'C'? and Why are zero-based arrays the norm? .)
So you need to define your loop to run from 0 to m.nrow() - 1
Try this:
func <- '
NumericMatrix mmult( NumericMatrix m , NumericMatrix v, bool byrow=true )
{
if( ! m.nrow() == v.nrow() ) stop("Non-conformable arrays") ;
if( ! m.ncol() == v.ncol() ) stop("Non-conformable arrays") ;
NumericMatrix out(m) ;
for (int i = 0; i < m.nrow(); i++)
{
for (int j = 0; j < m.ncol(); j++)
{
out(i,j)=m(i,j) * v(i,j) ;
}
}
return out ;
}
'
Then I get:
> mmult(m1,m2)
[,1] [,2]
[1,] 1 9
[2,] 4 16
> m1*m2
[,1] [,2]
[1,] 1 9
[2,] 4 16

How to know what r is doing behind the scene

As a new R user, I am very curious on what R is doing when we type in a function. For example, I am using knn function in the class package. All I need to do is type in knn and define by train and test data sets. Then what I get is the predicted class for my test data. However, I am curious if there is a way to see the actual equations/formula that is in knn. I have look through some knn references but am still curious on EXACTLY what R is doing! Is it possible to find such information?
Any help is greatly appreciated!!!
Well, the first thing you can do is simply type in the name of the function, which in many cases will give you the source right there. For example:
> knn
function (train, test, cl, k = 1, l = 0, prob = FALSE, use.all = TRUE)
{
train <- as.matrix(train)
if (is.null(dim(test)))
dim(test) <- c(1, length(test))
test <- as.matrix(test)
if (any(is.na(train)) || any(is.na(test)) || any(is.na(cl)))
stop("no missing values are allowed")
p <- ncol(train)
ntr <- nrow(train)
if (length(cl) != ntr)
stop("'train' and 'class' have different lengths")
if (ntr < k) {
warning(gettextf("k = %d exceeds number %d of patterns",
k, ntr), domain = NA)
k <- ntr
}
if (k < 1)
stop(gettextf("k = %d must be at least 1", k), domain = NA)
nte <- nrow(test)
if (ncol(test) != p)
stop("dims of 'test' and 'train' differ")
clf <- as.factor(cl)
nc <- max(unclass(clf))
Z <- .C(VR_knn, as.integer(k), as.integer(l), as.integer(ntr),
as.integer(nte), as.integer(p), as.double(train), as.integer(unclass(clf)),
as.double(test), res = integer(nte), pr = double(nte),
integer(nc + 1), as.integer(nc), as.integer(FALSE), as.integer(use.all))
res <- factor(Z$res, levels = seq_along(levels(clf)), labels = levels(clf))
if (prob)
attr(res, "prob") <- Z$pr
res
}
<bytecode: 0x393c650>
<environment: namespace:class>
>
In this case, you can see that the real work is being done by an external call to VR_knn. If you want to dig deeper, you can go to http://cran.r-project.org/web/packages/class/index.html, and download the source for this package. If you download and extract the source, you will find a folder called "src" that holds the C code, and you can look through that, and find the source to that function:
void
VR_knn(Sint *kin, Sint *lin, Sint *pntr, Sint *pnte, Sint *p,
double *train, Sint *class, double *test, Sint *res, double *pr,
Sint *votes, Sint *nc, Sint *cv, Sint *use_all)
{
int i, index, j, k, k1, kinit = *kin, kn, l = *lin, mm, npat, ntie,
ntr = *pntr, nte = *pnte, extras;
int pos[MAX_TIES], nclass[MAX_TIES];
int j1, j2, needed, t;
double dist, tmp, nndist[MAX_TIES];
RANDIN;
/*
Use a 'fence' in the (k+1)st position to avoid special cases.
Simple insertion sort will suffice since k will be small.
*/
for (npat = 0; npat < nte; npat++) {
kn = kinit;
for (k = 0; k < kn; k++)
nndist[k] = 0.99 * DOUBLE_XMAX;
for (j = 0; j < ntr; j++) {
if ((*cv > 0) && (j == npat))
continue;
dist = 0.0;
for (k = 0; k < *p; k++) {
tmp = test[npat + k * nte] - train[j + k * ntr];
dist += tmp * tmp;
}
/* Use 'fuzz' since distance computed could depend on order of coordinates */
if (dist <= nndist[kinit - 1] * (1 + EPS))
for (k = 0; k <= kn; k++)
if (dist < nndist[k]) {
for (k1 = kn; k1 > k; k1--) {
nndist[k1] = nndist[k1 - 1];
pos[k1] = pos[k1 - 1];
}
nndist[k] = dist;
pos[k] = j;
/* Keep an extra distance if the largest current one ties with current kth */
if (nndist[kn] <= nndist[kinit - 1])
if (++kn == MAX_TIES - 1)
error("too many ties in knn");
break;
}
nndist[kn] = 0.99 * DOUBLE_XMAX;
}
for (j = 0; j <= *nc; j++)
votes[j] = 0;
if (*use_all) {
for (j = 0; j < kinit; j++)
votes[class[pos[j]]]++;
extras = 0;
for (j = kinit; j < kn; j++) {
if (nndist[j] > nndist[kinit - 1] * (1 + EPS))
break;
extras++;
votes[class[pos[j]]]++;
}
} else { /* break ties at random */
extras = 0;
for (j = 0; j < kinit; j++) {
if (nndist[j] >= nndist[kinit - 1] * (1 - EPS))
break;
votes[class[pos[j]]]++;
}
j1 = j;
if (j1 == kinit - 1) { /* no ties for largest */
votes[class[pos[j1]]]++;
} else {
/* Use reservoir sampling to choose amongst the tied distances */
j1 = j;
needed = kinit - j1;
for (j = 0; j < needed; j++)
nclass[j] = class[pos[j1 + j]];
t = needed;
for (j = j1 + needed; j < kn; j++) {
if (nndist[j] > nndist[kinit - 1] * (1 + EPS))
break;
if (++t * UNIF < needed) {
j2 = j1 + (int) (UNIF * needed);
nclass[j2] = class[pos[j]];
}
}
for (j = 0; j < needed; j++)
votes[nclass[j]]++;
}
}
/* Use reservoir sampling to choose amongst the tied votes */
ntie = 1;
if (l > 0)
mm = l - 1 + extras;
else
mm = 0;
index = 0;
for (i = 1; i <= *nc; i++)
if (votes[i] > mm) {
ntie = 1;
index = i;
mm = votes[i];
} else if (votes[i] == mm && votes[i] >= l) {
if (++ntie * UNIF < 1.0)
index = i;
}
res[npat] = index;
pr[npat] = (double) mm / (kinit + extras);
}
RANDOUT;
}
In your editor (e.g., RStudio) just type in the function name and execute the line. This shows you the source code of the function, i.e., type
knn
In RStudio you can also click on the function and hit F2. A new tab with the function source code will open.
Alternatively you could use
debug(knn)
knn(your function arguments)
and step through the function with the debugger.
When you are done use
undebug(knn)
The Help Desk article in the October 2006 R News (a newsletter that has since evolved into The R Journal) shows how to access the source of R functions covering many of the different cases that you may need to use, from just typing the name of the function, to looking in namespaces, to finding the source files for compiled code.

Resources