I am trying to move character elements from list to a character vector using Rcpp.
Rcpp::cppFunction('
Rcpp::StringVector foo(Rcpp::List lst) {
Rcpp::StringVector res(2);
res(0) = "aha";
Rcpp::List tmp = lst(1);
res(1) = tmp(1); // fails to compile, tried different things including tmp(1).str()
return res;
}
')
# Excpected output
films = list(list("a"), list("B", "bingo"))
foo(films)
# [1] "aha" "bingo"
we as 3 friends try to solve minimum spanning tree with coflicts problem using r. In solving this question, we read files in .txt format that contain for ex.
"1 2 5
2 4 6" etc. which indicates from node 1 to 2, there exists an edge with weight 5 and
"1 2 2 4" etc. which indicates there's a conflict relationship between the edges 1-2 and 2-4. To continue, we have to form an nxn conflict matrix in which we will store 0's if there exist no conflict relation between the edges or 1 if there exist a conflict relation. For this purpose, we developed a 3-for loop for(i in 1:dim(edges_read)[1]){
for(i in 1:dim(edges_read)[1]){
for(k in 1:dim(edges_read)[1]){
for(t in 1:dim(conflicts)[1]){
if(all(conflicts[t,] == c(edges_read[i,1], edges_read[i,2],
edges_read[k,1], edges_read[k,2]) )){
conflictmatrix[i,k] <- 1
}
}
}
}
However, R cannot get us a solution and this for loops take very long times. How can we solve this situation? Thanks for further assistance
As you have discovered, for() loops are not fast in R. There are faster approaches, but it's hard to provide examples without data. Please use something like dput(edges_read) and dput(conflicts) to provide a small example of the data.
As one example, you could implement the for loops in the Rcpp package for speed improvement. Based on the code in your question, you could re-implement the 3-loop code sort of like this:
Rcpp::cppFunction('NumericVector MSTC_nxn_Cpp(NumericMatrix edges_read, NumericMatrix conflicts){
int n = edges_read.nrow(); //output matrix size (adjust to what you need)
int m = conflicts.nrow(); //output matrix size (adjust to what you need)
NumericMatrix conflictmatrix( n , m ); //the output matrix
for(int i=0;i<n;i++){ //your i loop
for(int k=0;k<n;k++){ // your k loop
double te = edges_read( i, 0 ); //same as edges_read[i,1]
double tf = edges_read( i, 1 ); //same as edges_read[i,2]
double tg = edges_read( k, 0 ); //same as edges_read[k,1]
double th = edges_read( k, 1 ); //same as edges_read[k,2]
NumericVector w = NumericVector::create(te,tf,tg,th); //this could probably be more simple
for(int t=0;t<m;t++){ //your t loop
NumericVector v = conflicts( t , _ ); // same as conflicts[t,]
LogicalVector r; //vector for checking if conflicts and edges are the same
for(int p=0; p<4; p++){ //loop to check logic
r[p]=v[p]==w[p]; //True / False stored
};
int q = r.size();
for (int ii = 0; ii < q; ++ii) { //similar to all() This code could be simplified!
if (!r[ii]) {false;}
else{conflictmatrix[i,k] = 1;}}
}}}
return conflictmatrix; //your output
}')
#Then run the function
MSTC_nxn_Cpp(edges_read, conflicts )
Here is a silly (maybe only in my mind) way to accomplish my goal:
A <- "This is a test."
B <- "This is the answer."
swap <- function(item1,item2) {
tmp <- item2
item2 <- item1
item1 <- tmp
return(list(item1,item2))
}
AB <- swap(A,B)
A <- AB[[1]]
B <- AB[[2]]
But I'm considering something similar to the C code following:
void swap(int *a, int *b)
{
int iTemp ;
iTemp = *a;
*a = *b;
*b = iTemp;
}
My motivations:
My real data is quite large, e.g. 5k*5k matrix, so the assignment of the existing variable in the iteration twice, inside the function and outside the function, must be time squandering.
The closest question on the SO is this one, but just like the OP in the question, my R session also has lots of objects: I'm working with Rmpi, and each slave will have a great number of variables.
In my humble opinion, R is written in C, so R may have pointers like C does, while I can't find much on the net surprisingly.
How about this; this just assigns to the parent environment.
A <- "This is a test."
B <- "This is the answer."
swap <- function(item1, item2) {
tmp <- item1
assign(deparse(substitute(item1)), item2, pos = 1)
assign(deparse(substitute(item2)), tmp, pos = 1)
}
swap(A, B)
A
#[1] "This is the answer."
B
#[1] "This is a test.
I'm working with a large dataset x. I want to drop rows of x that are missing in one or more columns in a set of columns of x, that set being specified by a character vector varcols.
So far I've tried the following:
require(data.table)
x <- CJ(var1=c(1,0,NA),var2=c(1,0,NA))
x[, textcol := letters[1:nrow(x)]]
varcols <- c("var1","var2")
x[, missing := apply(sapply(.SD,is.na),1,any),.SDcols=varcols]
x <- x[!missing]
Is there a faster way of doing this?
Thanks.
This should be faster than using apply:
x[rowSums(is.na(x[, ..varcols])) == 0, ]
# var1 var2 textcol
# 1: 0 0 e
# 2: 0 1 f
# 3: 1 0 h
# 4: 1 1 i
Here is a revised version of a c++ solution with a number of modifications based on a long discussion with Matthew (see comments below). I am new to c so I am sure that someone might still be able to improve this.
After library("RcppArmadillo") you should be able to run the whole file including the benchmark using sourceCpp('cleanmat.cpp'). The c++-file includes two functions. cleanmat takes two arguments (X and the index of the columns) and returns the matrix without the columns with missing values. keep just takes one argument X and returns a logical vector.
Note about passing data.table objects: These functions do not accept a data.table as an argument. The functions have to be modified to take DataFrame as an argument (see here.
cleanmat.cpp
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
using namespace arma;
// [[Rcpp::export]]
mat cleanmat(mat X, uvec idx) {
// remove colums
X = X.cols(idx - 1);
// get dimensions
int n = X.n_rows,k = X.n_cols;
// create keep vector
vec keep = ones<vec>(n);
for (int j = 0; j < k; j++)
for (int i = 0; i < n; i++)
if (keep[i] && !is_finite(X(i,j))) keep[i] = 0;
// alternative with view for each row (slightly slower)
/*vec keep = zeros<vec>(n);
for (int i = 0; i < n; i++) {
keep(i) = is_finite(X.row(i));
}*/
return (X.rows(find(keep==1)));
}
// [[Rcpp::export]]
LogicalVector keep(NumericMatrix X) {
int n = X.nrow(), k = X.ncol();
// create keep vector
LogicalVector keep(n, true);
for (int j = 0; j < k; j++)
for (int i = 0; i < n; i++)
if (keep[i] && NumericVector::is_na(X(i,j))) keep[i] = false;
return (keep);
}
/*** R
require("Rcpp")
require("RcppArmadillo")
require("data.table")
require("microbenchmark")
# create matrix
X = matrix(rnorm(1e+07),ncol=100)
X[sample(nrow(X),1000,replace = TRUE),sample(ncol(X),1000,replace = TRUE)]=NA
colnames(X)=paste("c",1:ncol(X),sep="")
idx=sample(ncol(X),90)
microbenchmark(
X[!apply(X[,idx],1,function(X) any(is.na(X))),idx],
X[rowSums(is.na(X[,idx])) == 0, idx],
cleanmat(X,idx),
X[keep(X[,idx]),idx],
times=3)
# output
# Unit: milliseconds
# expr min lq median uq max
# 1 cleanmat(X, idx) 253.2596 259.7738 266.2880 272.0900 277.8921
# 2 X[!apply(X[, idx], 1, function(X) any(is.na(X))), idx] 1729.5200 1805.3255 1881.1309 1913.7580 1946.3851
# 3 X[keep(X[, idx]), idx] 360.8254 361.5165 362.2077 371.2061 380.2045
# 4 X[rowSums(is.na(X[, idx])) == 0, idx] 358.4772 367.5698 376.6625 379.6093 382.5561
*/
For speed, with a large number of varcols, perhaps look to iterate by column. Something like this (untested) :
keep = rep(TRUE,nrow(x))
for (j in varcols) keep[is.na(x[[j]])] = FALSE
x[keep]
The issue with is.na is that it creates a new logical vector to hold its result, which then must be looped through by R to find the TRUEs so it knows which of the keep to set FALSE. However, in the above for loop, R can reuse the (identically sized) previous temporary memory for that result of is.na, since it is marked unused and available for reuse after each iteration completes. IIUC.
1. is.na(x[, ..varcols])
This is ok but creates a large copy to hold the logical matrix as large as length(varcols). And the ==0 on the result of rowSums will need a new vector, too.
2. !is.na(var1) & !is.na(var2)
Ok too, but ! will create a new vector again and so will &. Each of the results of is.na have to be held by R separately until the expression completes. Probably makes no difference until length(varcols) increases a lot, or ncol(x) is very large.
3. CJ(c(0,1),c(0,1))
Best so far but not sure how this would scale as length(varcols) increases. CJ needs to allocate new memory, and it loops through to populate that memory with all the combinations, before the join can start.
So, the very fastest (I guess), would be a C version like this (pseudo-code) :
keep = rep(TRUE,nrow(x))
for (j=0; j<varcols; j++)
for (i=0; i<nrow(x); i++)
if (keep[i] && ISNA(x[i,j])) keep[i] = FALSE;
x[keep]
That would need one single allocation for keep (in C or R) and then the C loop would loop through the columns updating keep whenever it saw an NA. The C could be done in Rcpp, in RStudio, inline package, or old school. It's important the two loops are that way round, for cache efficiency. The thinking is that the keep[i] && part helps speed when there are a lot of NA in some rows, to save even fetching the later column values at all after the first NA in each row.
Two more approaches
two vector scans
x[!is.na(var1) & !is.na(var2)]
join with unique combinations of non-NA values
If you know the possible unique values in advance, this will be the fastest
system.time(x[CJ(c(0,1),c(0,1)), nomatch=0])
Some timings
x <-data.table(var1 = sample(c(1,0,NA), 1e6, T, prob = c(0.45,0.45,0.1)),
var2= sample(c(1,0,NA), 1e6, T, prob = c(0.45,0.45,0.1)),
key = c('var1','var2'))
system.time(x[rowSums(is.na(x[, ..varcols])) == 0, ])
user system elapsed
0.09 0.02 0.11
system.time(x[!is.na(var1) & !is.na(var2)])
user system elapsed
0.06 0.02 0.07
system.time(x[CJ(c(0,1),c(0,1)), nomatch=0])
user system elapsed
0.03 0.00 0.04
In Rcpp, there are various "Rcpp sugar" commands that permit nice vectorised operations in the code. In the code below I move across a data frame, break it into vectors, then use the "ifelse" and "sum" sugar commands to compute the mean of v over the rows where x equals either y or y+1. All seems to work correctly.
Just wondering if there is a neater way than this - e.g. an equivalent of the "which" command that gives index points satisfying a particular condition? There seems to be a facility available as "find" in Armadillo but that means using incompatible object types (you can't use "find" and "ifelse" together).
On the same topic, is it possible to get "ifelse" to accept a compound logical condition? In the example below, for instance, the definition of indic is formed of two "ifelse" commands, and it would obviously be cleaner as one. Any thoughts would be much appreciated.
Look forward to hearing your responses :)
require(Rcpp)
require(inline)
set.seed(42)
df = data.frame(x = rpois(1000,3), y = rpois(1000,3), v = rnorm(1000),
stringsAsFactors=FALSE)
myfunc1 = cxxfunction(
signature(DF = "data.frame"),
plugin = "Rcpp",
body = '
using namespace Rcpp;
DataFrame df(DF);
IntegerVector x = df["x"];
IntegerVector y = df["y"];
NumericVector v = df["v"];
LogicalVector indic = ifelse(x==y,true,ifelse(x==y+1,true,false));
double subsum = sum(ifelse(indic,v,0));
int subsize = sum(indic);
double mn = ((subsize>0) ? subsum/subsize : 0.0);
return(Rcpp::List::create(_["subsize"] = subsize,
_["submean"] = mn
));
'
)
myfunc1(df)
### OUTPUT:
#
# $subsize
# [1] 300
#
# $submean
# [1] 0.1091555
#
Rcpp (>= 0.10.0) implements the | operator between two logical sugar expressions. So you can do:
require( Rcpp )
cppFunction( code = '
List subsum( IntegerVector x, IntegerVector y, NumericVector v){
using namespace Rcpp ;
LogicalVector indic = (x==y) | (x==y+1) ;
int subsize = sum(indic) ;
double submean = subsize == 0 ? 0.0 : sum(ifelse(indic,v,0)) / subsize ;
return List::create( _["subsize"] = subsize, _["submean"] = submean ) ;
}
' )
subsum( rpois(1000,3), rpois(1000,3), rnorm(1000) )
# $subsize
# [1] 320
#
# $submean
# [1] -0.05708866