I have an operation I am running in R, and want to know if there is any set of rules that can help me determine if I want the operation to be performed over rows or columns, given that transposing a matrix is a matter of programming preference otherwise.
The only regular advice I have so far is: Test it on a subsample every time. Can we do better than that in any way, say: Division is best longer than wider? If we can't do better than that, why not?
I have programmed my specific operation of interest to be as follows, but keep in mind I am more interested in this in general than in specific:
support_n: Some matrix I'm investigating. It is, (N) x (K choose N). K is >50, N>4
fz(): A bland function of several variables, polynomials, max, and min.
fz<-function(z,vec_l){
if(z%in%vec_l){ #find if z is eqivilant to any number, return 0
out<-0
} else if(z>max(vec_l)){
out<-z^2*max(vec_l)^2
} else {
out<-z^2+min(vec_l)^2
}
out
}
registerDoParallel(cl)
system.time(
payoff<-foreach(y=1:n, .combine='cbind') %:%
foreach(x=1:ncol(support_n), .combine='c') %dopar% {
fz(support_n[y,x],support_n[-y,x])
}
)
So should I run this over y's or x's first, in general? Why?
Related
Say I have a function func that takes two scalar numeric inputs and delivers a scalar numeric result, and I have the following code to calculate a result vector u, based on input numeric vector v and initial value u0 for the result vector:
u<-rep(u0,1+length(v))
for (k in 2:length(u)){
u[k]<-func(u[k-1],v[k-1])
}
Note how a component of the result vector depends not only on the corresponding element of the input vector but also on the immediately prior element of the result vector. I can see no obvious way to vectorise this.
It is common to do this sort of thing in financial simulations, for instance when projecting forward company accounts, rolling them up with interest or inflation and adding in operational cash flows each year.
For some specific instances, it is possible to find a case-specific, non-iterative coding, but I would like to know if there's a general solution.
The problem can also be coded by recursion, as follows:
calc.u<-function(v,u0){
if (length(v)<2){
func(u0,v[1]) }
else {
u.prior<-func(u0,v[-length(v),drop=FALSE])
c(u.prior,func(u.prior[length(u.prior)],v[length(v)]) )
}
u<-calc.u(v,u0)
Is there an R tactic for doing this without using either iteration or recursion, ie for vectorising it?
Answered: Thank you #MrFlick for introducing me to the Reduce function, which does exactly what I was wanting. I see that
Reduce('+',v,0,accumulate=T)[-1]
gives me
cumsum(v)
and
Reduce('*',v,0,accumulate=T)[-1]
gives me
cumprod(v)
as expected, where the [-1] is to discard the initial value.
Very nice indeed! Thanks again.
If you have this example
u0 <- 5
v <- (1:5)*2
func <- function(u,v) {u/2+v}
u <- rep(u0,1+length(v))
for (k in 2:length(u)){
u[k]<-func(u[k-1],v[k-1])
}
this is equivalent to
w <- Reduce(func, v, u0, accumulate=TRUE)
And we can check that
all(u==w)
# [1] TRUE
I have a fairly simply computation I need to do, but I cannot figure out how to do it in a way that is even close to efficient. I have a large nxn matrix, and I need to compute the following:
I'm still fairly inexperienced at coding, and so the only way that comes to my mind is to do the straightforward thing and use 3 for loops to move across the indexes:
sum=0
for(i in 1:n)
{
for(j in 1:n)
{
for(k in 1:n)
{
sum = sum + A[i,j]*A[j,k]
}
}
}
Needless to say, for any decent size matrix this takes forever to run. I know there must be a better, more efficient way to do this, but I cannot figure it out.
If you don't consider the k and i sums, you can realise that you are just doing the matrix product of A with itself. Such product in R is obtained through the %*% operator. After calculating this matrix, you just need to sum all the elements together:
sum(A %*% A)
should give the result you are seeking.
Inspired by the experimental fuzzy_join function from the statar package I wrote a function myself which combines exact and fuzzy (by string distances) matching. The merging job I have to do is quite big (resulting into multiple string distance matrices with a little bit less than one billion cells) and I had the impression that the fuzzy_join function is not written very efficiently (with regard to memory usage) and the parallelization is implemented in a weird manner (the computation of the string distance matrices, if there are multiple fuzzy variables, and not the computation of the string distances itself is parallelized). As for the fuzzy_join function the idea is to match for exact variables if possible (to keep the matrices smaller) and then to proceed to fuzzy matching within this exactly matched groups. I actually think that the function is self-explanatory. I am posting it here because I would like to have some feedback to improve it and because I guess that I am not the only one who tries to do stuff like that in R (although I admit that Python, SQL and things like that would probably be more efficient in this context. But one has to stick to the things one feels most comfortable with and doing the data cleaning and preparation in the same language is nice with regard to reproducibility)
merge.fuzzy = function(a,b,.exact,.fuzzy,.weights,.method,.ncores) {
require(stringdist)
require(matrixStats)
require(parallel)
if (length(.fuzzy)!=length(.weights)) {
stop(paste0("fuzzy and weigths must have the same length"))
}
if (!any(class(a)=="data.table")) {
stop(paste0("'a' must be of class data.table"))
}
if (!any(class(b)=="data.table")) {
stop(paste0("'b' must be of class data.table"))
}
#convert everything to lower
a[,c(.fuzzy):=lapply(.SD,tolower),.SDcols=.fuzzy]
b[,c(.fuzzy):=lapply(.SD,tolower),.SDcols=.fuzzy]
a[,c(.exact):=lapply(.SD,tolower),.SDcols=.exact]
b[,c(.exact):=lapply(.SD,tolower),.SDcols=.exact]
#create ids
a[,"id.a":=as.numeric(.I),by=c(.exact,.fuzzy)]
b[,"id.b":=as.numeric(.I),by=c(.exact,.fuzzy)]
c <- unique(rbind(a[,.exact,with=FALSE],b[,.exact,with=FALSE]))
c[,"exa.id":=.GRP,by=.exact]
a <- merge(a,c,by=.exact,all=FALSE)
b <- merge(b,c,by=.exact,all=FALSE)
##############
stringdi <- function(a,b,.weights,.by,.method,.ncores) {
sdm <- list()
if (is.null(.weights)) {.weights <- rep(1,length(.by))}
if (nrow(a) < nrow(b)) {
for (i in 1:length(.by)) {
sdm[[i]] <- stringdistmatrix(a[[.by[i]]],b[[.by[i]]],method=.method,ncores=.ncores,useNames=TRUE)
}
} else {
for (i in 1:length(.by)) { #if a is shorter, switch sides; this enhances parallelization speed
sdm[[i]] <- stringdistmatrix(b[[.by[i]]],a[[.by[i]]],method=.method,ncores=.ncores,useNames=FALSE)
}
}
rsdm = dim(sdm[[1]])
csdm = ncol(sdm[[1]])
sdm = matrix(unlist(sdm),ncol=length(by))
sdm = rowSums(sdm*.weights,na.rm=T)/((0 + !is.na(sdm)) %*% .weights)
sdm = matrix(sdm,nrow=rsdm,ncol=csdm)
#use ids as row/ column names
rownames(sdm) <- a$id.a
colnames(sdm) <- b$id.b
mid <- max.col(-sdm,ties.method="first")
mid <- matrix(c(1:nrow(sdm),mid),ncol=2)
bestdis <- sdm[mid]
res <- data.table(as.numeric(rownames(sdm)),as.numeric(colnames(sdm)[mid[,2]]),bestdis)
setnames(res,c("id.a","id.b","dist"))
res
}
setkey(b,exa.id)
distances = a[,stringdi(.SD,b[J(.BY[[1]])],.weights=.weights,.by=.fuzzy,.method=.method,.ncores=.ncores),by=exa.id]
a = merge(a,distances,by=c("exa.id","id.a"))
res = merge(a,b,by=c("exa.id","id.b"))
res
}
The following points would be interesting:
I am not quite sure how to code multiple exact matching variables in the data.table style I used above (which I believe is the fasted option).
Is it possible to have nested parallelization? This means is it possible to use a parallel foreach loop on top of the computation of the string distance matrices.
I am also interested in ideas with regard to making the whole thing more efficient, i.e. to consume less memory.
Maybe you can suggest a bigger "real world" data set so that I can create a woking example. Unfortunately I cannot share even small samples of my data with you.
In the future it would also be nice to do something else than a classic left inner join. So also ideas with regard to this topic are very much appreciated.
All your comments are welcome!
I've written a short 'for' loop to find the minimum euclidean distance between each row in a dataframe and all the other rows (and to record which row is closest). In theory this avoids the errors associated with trying to calculate distance measures for very large matrices. However, while not that much is being saved in memory, it is very very slow for large matrices (my use case of ~150K rows is still running).
I'm wondering whether anyone can advise or point me in the right direction in terms of vectorising my function, using apply or similar. Apologies for what may seem a simple question, but I'm still struggling to think in a vectorised way.
Thanks in advance (and for your patience).
require(proxy)
df<-data.frame(matrix(runif(10*10),nrow=10,ncol=10), row.names=paste("site",seq(1:10)))
min.dist<-function(df) {
#df for results
all.min.dist<-data.frame()
#set up for loop
for(k in 1:nrow(df)) {
#calcuate dissimilarity between each row and all other rows
df.dist<-dist(df[k,],df[-k,])
# find minimum distance
min.dist<-min(df.dist)
# get rowname for minimum distance (id of nearest point)
closest.row<-row.names(df)[-k][which.min(df.dist)]
#combine outputs
all.min.dist<-rbind(all.min.dist,data.frame(orig_row=row.names(df)[k],
dist=min.dist, closest_row=closest.row))
}
#return results
return(all.min.dist)
}
#example
min.dist(df)
This should be a good start. It uses fast matrix operations and avoids the growing object construct, both suggested in the comments.
min.dist <- function(df) {
which.closest <- function(k, df) {
d <- colSums((df[, -k] - df[, k]) ^ 2)
m <- which.min(d)
data.frame(orig_row = row.names(df)[k],
dist = sqrt(d[m]),
closest_row = row.names(df)[-k][m])
}
do.call(rbind, lapply(1:nrow(df), which.closest, t(as.matrix(df))))
}
If this is still too slow, as a suggested improvement, you could compute the distances for k points at a time instead of a single one. The size of k will need to be a compromise between speed and memory usage.
Edit: Also read https://stackoverflow.com/a/16670220/1201032
Usually, built in functions are faster that coding it yourself (because coded in Fortran or C/C++ and optimized).
It seems that the function dist {stats} answers your question spot on:
Description
This function computes and returns the distance matrix computed by using the specified distance measure to compute the distances between the rows of a data matrix.
I know that I should avoid for-loops, but I'm not exactly sure how to do what I want to do with an apply function.
Here is a slightly simplified model of what I'm trying to do. So, essentially I have a big matrix of predictors and I want to run a regression using a window of 5 predictors on each side of the indexed predictor (i in the case of a for loop). With a for loop, I can just say something like:
results<-NULL
window<-5
for(i in 1:ncol(g))
{
first<-i-window #Set window boundaries
if(first<1){
1->first
}
last<-i+window-1
if(last>ncol(g)){
ncol(g)->last
}
predictors<-g[,first:last]
#Do regression stuff and return some result
results[i]<-regression stuff
}
Is there a good way to do this with an apply function? My problem is that the vector that apply would be shoving into the function really doesn't matter. All that matters is the index.
This question touches several points that are made in 'The R Inferno' http://www.burns-stat.com/pages/Tutor/R_inferno.pdf
There are some loops you should avoid, but not all of them. And using an apply function is more hiding the loop than avoiding it. This example seems like a good choice to leave in a 'for' loop.
Growing objects is generally bad form -- it can be extremely inefficient in some cases. If you are going to have a blanket rule, then "not growing objects" is a better one than "avoid loops".
You can create a list with the final length by:
result <- vector("list", ncol(g))
for(i in 1:ncol(g)) {
# stuff
result[[i]] <- #results
}
In some circumstances you might think the command:
window<-5
means give me a logical vector stating which values of 'window' are less than -5.
Spaces are good to use, mostly not to confuse humans, but to get the meaning directly above not to confuse R.
Using an apply function to do your regression is mostly a matter of preference in this case; it can handle some of the bookkeeping for you (and so possibly prevent errors) but won't speed up the code.
I would suggest using vectorized functions though to compute your first's and last's, though, perhaps something like:
window <- 5
ng <- 15 #or ncol(g)
xy <- data.frame(first = pmax( (1:ng) - window, 1 ),
last = pmin( (1:ng) + window, ng) )
Or be even smarter with
xy <- data.frame(first= c(rep(1, window), 1:(ng-window) ),
last = c((window+1):ng, rep(ng, window)) )
Then you could use this in a for loop like this:
results <- list()
for(i in 1:nrow(xy)) {
results[[i]] <- xy$first[i] : xy$last[i]
}
results
or with lapply like this:
results <- lapply(1:nrow(xy), function(i) {
xy$first[i] : xy$last[i]
})
where in both cases I just return the sequence between first and list; you would substitute with your actual regression code.