As a new R user, I am very curious on what R is doing when we type in a function. For example, I am using knn function in the class package. All I need to do is type in knn and define by train and test data sets. Then what I get is the predicted class for my test data. However, I am curious if there is a way to see the actual equations/formula that is in knn. I have look through some knn references but am still curious on EXACTLY what R is doing! Is it possible to find such information?
Any help is greatly appreciated!!!
Well, the first thing you can do is simply type in the name of the function, which in many cases will give you the source right there. For example:
> knn
function (train, test, cl, k = 1, l = 0, prob = FALSE, use.all = TRUE)
{
train <- as.matrix(train)
if (is.null(dim(test)))
dim(test) <- c(1, length(test))
test <- as.matrix(test)
if (any(is.na(train)) || any(is.na(test)) || any(is.na(cl)))
stop("no missing values are allowed")
p <- ncol(train)
ntr <- nrow(train)
if (length(cl) != ntr)
stop("'train' and 'class' have different lengths")
if (ntr < k) {
warning(gettextf("k = %d exceeds number %d of patterns",
k, ntr), domain = NA)
k <- ntr
}
if (k < 1)
stop(gettextf("k = %d must be at least 1", k), domain = NA)
nte <- nrow(test)
if (ncol(test) != p)
stop("dims of 'test' and 'train' differ")
clf <- as.factor(cl)
nc <- max(unclass(clf))
Z <- .C(VR_knn, as.integer(k), as.integer(l), as.integer(ntr),
as.integer(nte), as.integer(p), as.double(train), as.integer(unclass(clf)),
as.double(test), res = integer(nte), pr = double(nte),
integer(nc + 1), as.integer(nc), as.integer(FALSE), as.integer(use.all))
res <- factor(Z$res, levels = seq_along(levels(clf)), labels = levels(clf))
if (prob)
attr(res, "prob") <- Z$pr
res
}
<bytecode: 0x393c650>
<environment: namespace:class>
>
In this case, you can see that the real work is being done by an external call to VR_knn. If you want to dig deeper, you can go to http://cran.r-project.org/web/packages/class/index.html, and download the source for this package. If you download and extract the source, you will find a folder called "src" that holds the C code, and you can look through that, and find the source to that function:
void
VR_knn(Sint *kin, Sint *lin, Sint *pntr, Sint *pnte, Sint *p,
double *train, Sint *class, double *test, Sint *res, double *pr,
Sint *votes, Sint *nc, Sint *cv, Sint *use_all)
{
int i, index, j, k, k1, kinit = *kin, kn, l = *lin, mm, npat, ntie,
ntr = *pntr, nte = *pnte, extras;
int pos[MAX_TIES], nclass[MAX_TIES];
int j1, j2, needed, t;
double dist, tmp, nndist[MAX_TIES];
RANDIN;
/*
Use a 'fence' in the (k+1)st position to avoid special cases.
Simple insertion sort will suffice since k will be small.
*/
for (npat = 0; npat < nte; npat++) {
kn = kinit;
for (k = 0; k < kn; k++)
nndist[k] = 0.99 * DOUBLE_XMAX;
for (j = 0; j < ntr; j++) {
if ((*cv > 0) && (j == npat))
continue;
dist = 0.0;
for (k = 0; k < *p; k++) {
tmp = test[npat + k * nte] - train[j + k * ntr];
dist += tmp * tmp;
}
/* Use 'fuzz' since distance computed could depend on order of coordinates */
if (dist <= nndist[kinit - 1] * (1 + EPS))
for (k = 0; k <= kn; k++)
if (dist < nndist[k]) {
for (k1 = kn; k1 > k; k1--) {
nndist[k1] = nndist[k1 - 1];
pos[k1] = pos[k1 - 1];
}
nndist[k] = dist;
pos[k] = j;
/* Keep an extra distance if the largest current one ties with current kth */
if (nndist[kn] <= nndist[kinit - 1])
if (++kn == MAX_TIES - 1)
error("too many ties in knn");
break;
}
nndist[kn] = 0.99 * DOUBLE_XMAX;
}
for (j = 0; j <= *nc; j++)
votes[j] = 0;
if (*use_all) {
for (j = 0; j < kinit; j++)
votes[class[pos[j]]]++;
extras = 0;
for (j = kinit; j < kn; j++) {
if (nndist[j] > nndist[kinit - 1] * (1 + EPS))
break;
extras++;
votes[class[pos[j]]]++;
}
} else { /* break ties at random */
extras = 0;
for (j = 0; j < kinit; j++) {
if (nndist[j] >= nndist[kinit - 1] * (1 - EPS))
break;
votes[class[pos[j]]]++;
}
j1 = j;
if (j1 == kinit - 1) { /* no ties for largest */
votes[class[pos[j1]]]++;
} else {
/* Use reservoir sampling to choose amongst the tied distances */
j1 = j;
needed = kinit - j1;
for (j = 0; j < needed; j++)
nclass[j] = class[pos[j1 + j]];
t = needed;
for (j = j1 + needed; j < kn; j++) {
if (nndist[j] > nndist[kinit - 1] * (1 + EPS))
break;
if (++t * UNIF < needed) {
j2 = j1 + (int) (UNIF * needed);
nclass[j2] = class[pos[j]];
}
}
for (j = 0; j < needed; j++)
votes[nclass[j]]++;
}
}
/* Use reservoir sampling to choose amongst the tied votes */
ntie = 1;
if (l > 0)
mm = l - 1 + extras;
else
mm = 0;
index = 0;
for (i = 1; i <= *nc; i++)
if (votes[i] > mm) {
ntie = 1;
index = i;
mm = votes[i];
} else if (votes[i] == mm && votes[i] >= l) {
if (++ntie * UNIF < 1.0)
index = i;
}
res[npat] = index;
pr[npat] = (double) mm / (kinit + extras);
}
RANDOUT;
}
In your editor (e.g., RStudio) just type in the function name and execute the line. This shows you the source code of the function, i.e., type
knn
In RStudio you can also click on the function and hit F2. A new tab with the function source code will open.
Alternatively you could use
debug(knn)
knn(your function arguments)
and step through the function with the debugger.
When you are done use
undebug(knn)
The Help Desk article in the October 2006 R News (a newsletter that has since evolved into The R Journal) shows how to access the source of R functions covering many of the different cases that you may need to use, from just typing the name of the function, to looking in namespaces, to finding the source files for compiled code.
Related
Suppose I have the following c++ code in a file named test.cpp
#include <Rcpp.h>
//[[Rcpp::export]]
Rcpp::NumericMatrix MyAbar (const Rcpp::NumericMatrix & x, int T){
unsigned int outrows = x.nrow(), i = 0, j = 0;
double d;
Rcpp::NumericMatrix out(outrows,outrows);
// Rcpp::LogicalVector comp;
for (i = 0; i < outrows - 1; i++){
Rcpp::NumericVector v1 = x.row(i);
Rcpp::NumericVector ans(outrows);
for (j = i + 1; j < outrows ; j ++){
d = mean(Rcpp::runif( T ) < x(i,j));
out(j,i)=d;
out(i,j)=d;
}
}
return out;
}
I know with the following command, I can have my own package
Rcpp.package.skeleton("test",cpp_files = "~/Desktop/test.cpp")
However, what if I want to combine the following R function which call the Rcpp-function into the package
random = function(A, T){
if (!is.matrix(A)){
A = Reduce("+",A)/T
}
# global constant and threshold
n = nrow(A)
B_0 = 3
w = min(sqrt(n),sqrt(T * log(n)))
q = B_0 * log(n) / (sqrt(n) * w)
A2 = MyAbar(A)
diag(A2) <- NA
K = A2 <= rowQuantiles(A2, probs=q, na.rm =TRUE)
diag(K) = FALSE
P = K %*% A * ( 1/(rowSums(K) + 1e-10))
return( (P + t(P))*0.5 )
}
How can i make it?
So you are asking how to make an R package? There are many good tutorials.
To a first approximation:
copy your file into, say, file R/random.R
deal with a help file for your function, either manually by writing man/random.Rd or by learning package roxygen2
make sure you know what NAMESPACE is for and that DESCRIPTION is right
I have a following R code which is not efficient. I would like to make this efficient using Rcpp. Particularly, I am not used to dealing with array in Rcpp. Any help would be appreciated.
myfunc <- function(n=1600,
m=400,
p = 3,
time = runif(n,min=0.05,max=4),
qi21 = rnorm(n),
s0c = rnorm(n),
zc_min_ecox_multi = array(rnorm(n*n*p),dim=c(n,n,p)),
qi=matrix(0,n,n),
qi11 = rnorm(p),
iIc_mat = matrix(rnorm(p*p),p,p)){
for (j in 1:n){
u<-time[j]
ind<-1*(u<=time)
locu<-which(time==u)
qi2<- sum(qi21*ind) /s0c[locu]
for (i in 1:n){
qi1<- qi11%*%iIc_mat%*%matrix(zc_min_ecox_multi[i,j,],p,1)
qi[i,j]<- -(qi1+qi2)/m
}
}
}
Computing time is about 7.35 secs. I need to call this function over and over again, maybe 20 times.
system.time(myfunc())
user system elapsed
7.34 0.00 7.35
First thing to do would be to profile your code: profvis::profvis({myfunc()}).
What you can do is precompute qi11 %*% iIc_mat once.
You get (with minor improvements):
precomp <- qi11 %*% iIc_mat
for (j in 1:n) {
u <- time[j]
qi2 <- sum(qi21[u <= time]) / s0c[time == u]
for (i in 1:n) {
qi1 <- precomp %*% zc_min_ecox_multi[i, j, ]
qi[i, j] <- -(qi1 + qi2) / m
}
}
that is twice as fast (8 sec -> 4 sec).
Vectorizing the i loop then seems straightforward:
q1_all_i <- tcrossprod(precomp, zc_min_ecox_multi[, j, ])
qi[, j] <- -(q1_all_i + qi2) / m
(12 times as fast now)
And if you want to try it in Rcpp, you will first need a function to multiply the matrices...
#include<Rcpp.h>
#include<numeric>
// [[Rcpp::plugins("cpp11")]]
Rcpp::NumericMatrix mult(const Rcpp::NumericMatrix& lhs,
const Rcpp::NumericMatrix& rhs)
{
if (lhs.ncol() != rhs.nrow())
Rcpp::stop ("Incompatible matrices");
Rcpp::NumericMatrix out(lhs.nrow(),rhs.ncol());
Rcpp::NumericVector rowvec, colvec;
for (int i = 0; i < lhs.nrow(); ++i)
{
rowvec = lhs(i,Rcpp::_);
for (int j = 0; j < rhs.ncol(); ++j)
{
colvec = rhs(Rcpp::_,j);
out(i, j) = std::inner_product(rowvec.begin(), rowvec.end(),
colvec.begin(), 0.);
}
}
return out;
}
Then port your function...
// [[Rcpp::export]]
Rcpp::NumericMatrix myfunc_rcpp( int n, int m, int p,
const Rcpp::NumericVector& time,
const Rcpp::NumericVector& qi21,
const Rcpp::NumericVector& s0c,
const Rcpp::NumericVector& zc_min_ecox_multi,
const Rcpp::NumericMatrix& qi11,
const Rcpp::NumericMatrix& iIc_mat)
{
Rcpp::NumericMatrix qi(n, n);
Rcpp::NumericMatrix outermat = mult(qi11, iIc_mat);
for (int j = 0; j < n; ++j)
{
double qi2 = 0;
for(int k = 0; k < n; ++k)
{
if(time[j] <= time[k]) qi2 += qi21[k];
}
qi2 /= s0c[j];
for (int i = 0; i < n; ++i)
{
Rcpp::NumericMatrix tmpmat(p, 1);
for(int z = 0; z < p; ++z)
{
tmpmat(z, 0) = zc_min_ecox_multi[i + n*j + z*n*n];
}
Rcpp::NumericMatrix qi1 = mult(outermat, tmpmat);
qi(i,j) -= (qi1(0,0) + qi2)/m;
}
}
return qi;
}
Then in R:
my_rcpp_func <- function(n=1600,
m=400,
p = 3,
time = runif(n,min=0.05,max=4),
qi21 = rnorm(n),
s0c = rnorm(n),
zc_min_ecox_multi = array(rnorm(n*n*p),dim=c(n,n,p)),
qi11 = rnorm(p),
iIc_mat = matrix(rnorm(p*p),p,p))
{
myfunc_rcpp(n, m, p, time, qi21, s0c, as.vector(zc_min_ecox_multi),
matrix(qi11,1,p), iIc_mat)
}
This is certainly faster, and gives the same results as your own function, but it's no quicker than the in-R optimizations suggested by F Privé. Maybe optimizing the C++ code could get things even faster, but ultimately you are multiplying 2 reasonably large matrices together over 2.5 million times, so it's never going to be all that fast. R is optimized pretty well for this kind of calculation after all...
I need to do it with recursion, but the problem is that function depends on only ONE parameter and inside function it depends on two ( k and n ), also how to find minimum value if it returns only one value?
The function is :
I've already tried to make random k, but I don't think that is really good idea.
F1(int n) {
Random random = new Random();
int k = random.Next(1,10);
if (1 <= k && k <= n){
return Math.Min(F1(k - 1) + F1(n - k) + n);
} else {
return 0;
}
}
You need to make a loop traversing all k values in range 1..n. Something like this:
F1(int n) {
if (n == 0)
return ???? what is starting value?
minn = F1(0) + F1(n - 1) + n
for (int k = 2; k <= n; k++)
minn = Math.Min(minn, F1(k - 1) + F1(n - k) + n);
return minn;
}
I receive this error when running the following code aimed at producing a quicksort function. It seems to be the first while call I make that causes the problem. Can someone tell me what I'm doing wrong?
library(Rcpp)
library(inline)
body_sortCpp <- '
NumericVector arr(x);
int n = arr.size();
double tmp;
double left = arr[0];
double right = arr[n-1];
int pivot = arr[n/2];
double i = left, j = right;
while (i <= j){
while (arr[i] <= pivot)
i++;
while (arr[j] > pivot)
j--;
if (i <= j) {
tmp = arr[i];
arr[i] = arr[j];
arr[j] = tmp;
i++;
j--;
}
}
if (left < j){
sortCpp(arr, left, j);
}
if (i < right){
sortCpp(arr, i, right);
}
return wrap( arr );
'
sortCpp <- cxxfunction( signature( x = "numeric"),
body = body_sortCpp,
include = body_sortCpp,
plugin = "Rcpp")
Your use of cxxfunction() is likely wrong:
sortCpp <- cxxfunction( signature( x = "numeric"),
body = body_sortCpp,
include = body_sortCpp,
plugin = "Rcpp")
as you supply the code twice. Remove the include= and you may be good to go.
Also consider a proper editor with indenting, and look into the Rcpp Attributes vignette as a superior alternative to cxxfunction().
I have a big matrix and am interested in computing the correlation between the rows of the matrix. Since the cor method computes correlation between the columns of a matrix, I am transposing the matrix before calling cor. But since the matrix is big, transposing it is expensive and is slowing down my program. Is there a way to compute the correlations among the rows without having to take transpose?
EDIT: thanks for the responses. thought i'd share some findings. my input matrix is 16 rows by 239766 cols and comes from a .mat file. I wrote C# code to do the same thing using the csmatio library. it looks like this:
foreach (var file in Directory.GetFiles(path, interictal_pattern))
{
var reader = new MatFileReader(file);
var mla = reader.Data[0] as MLStructure;
convert(mla.AllFields[0] as MLNumericArray<double>, data);
double sum = 0;
for (var i = 0; i < 16; i++)
{
for (var j = i + 1; j < 16; j++)
{
sum += cor(data, i, j);
}
}
var avg = sum / 120;
if (++count == 10)
{
var t2 = DateTime.Now;
var t = t2 - t1;
Console.WriteLine(t.TotalSeconds);
break;
}
}
static double[][] createArray(int rows, int cols)
{
var ans = new double[rows][];
for (var row = 0; row < rows; row++)
{
ans[row] = new double[cols];
}
return ans;
}
static void convert(MLNumericArray<double> mla, double[][] M)
{
var rows = M.Length;
var cols = M[0].Length;
for (int i = 0; i < rows; i++)
for (int j = 0; j < cols; j++)
M[i][j] = mla.Get(i, j);
}
static double cor(double[][] M, int i, int j)
{
var count = M[0].Length;
double sum1 = 0, sum2 = 0;
for (int ctr = 0; ctr < count; ctr++)
{
sum1 += M[i][ctr];
sum2 += M[j][ctr];
}
var mu1 = sum1 / count;
var mu2 = sum2 / count;
double numerator = 0, sumOfSquares1 = 0, sumOfSquares2 = 0;
for (int ctr = 0; ctr < count; ctr++)
{
var x = M[i][ctr] - mu1;
var y = M[j][ctr] - mu2;
numerator += x * y;
sumOfSquares1 += x * x;
sumOfSquares2 += y * y;
}
return numerator / Math.Sqrt(sumOfSquares1 * sumOfSquares2);
}
this gave a throughput of 22.22s for 10 files or 2.22s/file
Then I profiled my R code:
ptm=proc.time()
for(file in files)
{
i = i + 1;
mat = readMat(paste(path,file,sep=""))
a = t(mat[[1]][[1]])
C = cor(a)
correlations[i] = mean(C[lower.tri(C)])
}
print(proc.time()-ptm)
to my surprise its running faster than C# and is giving throughput of 5.7s per 10 files or 0.6s/file (an improvement of almost 4x!). The bottleneck in C# is the methods inside csmatio library to parse double values from input stream.
and if i do not convert the csmatio classes into a double[][] then the C# code runs extremely slow (order of magnitude slower ~20-30s/file).
Seeing that this problem arises from a data input issue whose details are not stated (and only hinted at in a comment), I will assume this is a comma-delimited file of unquoted numbers with the number of columns= Ncol. This does the transposition on input.
in.mat <- matrix( scan("path/to/the_file/fil.txt", what =numeric(0), sep=","),
ncol=Ncol, byrow=TRUE)
cor(in.nmat)
One dirty work-around would be to apply cor-functions row-wise and produce the correlation matrix from the results. You could try if this is any more efficient (which I doubt, though you could fine-tune it by not double computing everything or the redundant diagonal cases):
# Apply 2-fold nested row-wise functions
set.seed(1)
dat <- matrix(rnorm(1000), nrow=10)
cormat <- apply(dat, MARGIN=1, FUN=function(z) apply(dat, MARGIN=1, FUN=function(y) cor(z, y)))
cormat[1:3,1:3] # Show few first
# [,1] [,2] [,3]
#[1,] 1.000000000 0.002175792 0.1559263
#[2,] 0.002175792 1.000000000 -0.1870054
#[3,] 0.155926259 -0.187005418 1.0000000
Though, generally I would expect the transpose to have a really, really efficient implementation, so it's hard to imagine when that would be the bottle-neck. But, you could also dig through the implementation of 'cor' function and call the correlation C-function itself by first making sure your rows are suitable. Type 'cor' in the terminal to see the implementation, which is mostly a wrapper that makes input suitable for the C-function:
# Row with C-call from the implementation of 'cor':
# if (method == "pearson")
# .Call(C_cor, x, y, na.method, FALSE)
You can use outer:
outer(seq(nrow(mat)), seq(nrow(mat)),
Vectorize(function(x, y) cor(mat[x , ], mat[y , ])))
where mat is the name of your matrix.