Need a more efficient threshold matching with function for R - r

Not sure how best to ask this question, so feel free to edit the question title if there is a more standard vocabulary to use here.
I have two 2-column data tables in R, the first is a list of unique 2-variable values (u), so much shorter than the second, which is a raw list of similar values (d). I need a function that will, for every 2-variable set of values in u, find all the 2-variable sets of values in d for which both variables are within a given threshold.
Here's a minimal example. Actual data is much larger (see below, as this is the problem) and (obviously) not created randomly as in the example. In the actual data, u would have about 600,000 to 1,000,000 values (rows) and d would have upwards of 10,000,000 rows.
# First create the table of unique variable pairs (no 2-column duplicates)
u <- data.frame(PC1=c(-1.10,-1.01,-1.13,-1.18,-1.12,-0.82),
PC2=c(-1.63,-1.63,-1.81,-1.86,-1.86,-1.77))
# Now, create the set of raw 2-variable pairs, which may include duplicates
d <- data.frame(PC1=sample(u$PC1,100,replace=T)*sample(90:100,100,replace=T)/100,
PC2=sample(u$PC2,100,replace=T)*sample(90:100,100,replace=T)/100)
# Set the threshold that defined a 'close-enough' match between u and d values
b <- 0.1
So, my first attempt to do this was with a for loop for all values of u. This works nicely, but is computationally intensive and takes quite a while to process the actual data.
# Make a list to output the list of within-threshold rows
m <- list()
# Loop to find all values of d within a threshold b of each value of u
# The output list will have as many items as values of u
# For each list item, there may be up to several thousand matching rows in d
# Note that there's a timing command (system.time) in here to keep track of performance
system.time({
for(i in 1:nrow(u)){
m <- c(m, list(which(abs(d$PC1-u$PC1[i])<b & abs(d$PC2-u$PC2[i])<b)))
}
})
m
That works. But I thought using a function with apply() would be more efficient. Which it is...
# Make the user-defined function for the threshold matching
match <- function(x,...){
which(abs(d$PC1-x[1])<b & abs(d$PC2-x[2])<b)
}
# Run the function with the apply() command.
system.time({
m <- apply(u,1,match)
})
Again, this apply function works and is slightly faster than the for loop, but only marginally. This may simply be a big data problem for which I need a bit more computing power (or more time!). But I thought others might have thoughts on a sneaky command or function syntax that would dramatically speed this up. Outside the box approaches to finding these matching rows also welcome.

Somewhat sneaky:
library(IRanges)
ur <- with(u*100L, IRanges(PC2, PC1))
dr <- with(d*100L, IRanges(PC2, PC1))
hits <- findOverlaps(ur, dr + b*100L)
Should be fast once the number of rows is sufficiently large. We multiply by 100 to get into integer space. Reversing the order of the arguments to findOverlaps could improve performance.

Alas, this seems only slightly faster than the for loop
unlist(Map(function(x,y) {
which(abs(d$PC1-x)<b & abs(d$PC2-y)<b)
}, u$PC1, u$PC2))
but at least it's something.

I have a cunning plan :-) . How about just doing calculations:
> set.seed(10)
> bar<-matrix(runif(10),nc=2)
> bar
[,1] [,2]
[1,] 0.50747820 0.2254366
[2,] 0.30676851 0.2745305
[3,] 0.42690767 0.2723051
[4,] 0.69310208 0.6158293
[5,] 0.08513597 0.4296715
> foo<-c(.3,.7)
> thresh<-foo-bar
> sign(thresh)
[,1] [,2]
[1,] -1 1
[2,] 1 1
[3,] -1 1
[4,] 1 -1
[5,] 1 1
Now all you have to do is select the rows of that last matrix which are c(-1,1) , using which , and you can easily extract the desired rows from your bar matrix. Repeat for each row in foo.

Related

Using R to count patterns in columns

I have a matrix in R containing 1000 columns and 4 rows. Each cell in the matrix contains an integer between 1-4. I want to know two things:
1) What is the number of columns that contain a "1", "2", "3", and "4" in any order? Ideally, I would like the code to not require that I input each possible combination of 1,2,3,4 to perform its count.
2) What is the number of columns that contain 3 of the possible integers, but not all 4?
Solution 1
The most obvious approach is to run apply() over the columns and test for the required tabulation of the column vector using tabulate(). This requires first building a factor() out of the column vector to normalize its storage representation to an integer vector based from 1. And since you don't care about order, we must run sort() before comparing it against the expected tabulation.
For the "4 of 4" problem the expected tabulation will be four 1s, while for the "3 of 4" problem the expected tabulation will be two 1s and one 2.
## generate data
set.seed(1L); NR <- 4L; NC <- 1e3L; m <- matrix(sample(1:4,NR*NC,T),NR);
sum(apply(m,2L,function(x) identical(rep(1L,4L),sort(tabulate(factor(x))))));
## [1] 107
sum(apply(m,2L,function(x) identical(c(1L,1L,2L),sort(tabulate(factor(x))))));
## [1] 545
Solution 2
v <- c(1L,2L,4L,8L);
sum(colSums(matrix(v[m],nrow(m)))==15L);
## [1] 107
v <- c(1L,3L,9L,27L);
s3 <- c(14L,32L,38L,16L,34L,22L,58L,46L,64L,42L,48L,66L);
sum(colSums(matrix(v[m],nrow(m)))%in%s3);
## [1] 545
Here's a slightly weird solution.
I was looking into how to use colSums() or colMeans() to try to find a quick test for columns that have 4 of 4 or 3 of 4 of the possible cell values. The problem is, there are multiple combinations of the 4 values that sum to the same total. For example, 1+2+3+4 == 10, but 1+1+4+4 == 10 as well, so just getting a column sum of 10 is not enough.
I realized that one possible solution would be to change the set of values that we're summing, such that our target combinations would sum to unambiguous values. We can achieve this by spreading out the original set from 1:4 to something more diffuse. Furthermore, the original set of values of 1:4 is perfect for indexing a precomputed vector of values, so this seemed like a particularly logical approach for your problem.
I wasn't sure what degree of diffusion would be required to make unique the sums of the target combinations. Some ad hoc testing seemed to indicate that multiplication by a fixed multiplier would not be sufficient to disambiguate the sums, so I moved up to exponentiation. I wrote the following code to facilitate the testing of different bases to identify the minimal bases necessary for this disambiguation.
tryBaseForTabulation <- function(N,tab,base) {
## make destination value set, exponentiating from 0 to N-1
x <- base^(seq_len(N)-1L);
## make a matrix of unique combinations of the original set
g <- unique(t(apply(expand.grid(x,x,x,x),1L,sort)));
## get the indexes of combinations that match the required tabulation
good <- which(apply(g,1L,function(x) identical(tab,sort(tabulate(factor(x))))));
## get the sums of good and bad combinations
hs <- rowSums(g[good,,drop=F]);
ns <- rowSums(g[-good,,drop=F]);
## return the number of ambiguous sums; we need to get zero!
sum(hs%in%ns);
}; ## end tryBaseForTabulation()
The function takes the size of the set (4 for us), the required tabulation (as returned by tabulate()) in sorted order (as revealed earlier, this is four 1s for the "4 of 4" problem, two 1s and one 2 for the "3 of 4" problem), and the test base. This is the result for a base of 2 for the "4 of 4" problem:
tryBaseForTabulation(4L,rep(1L,4L),2L);
## [1] 0
So we get the result we need right away; a base of 2 is sufficient for the "4 of 4" problem. But for the "3 of 4" problem, it takes one more attempt:
tryBaseForTabulation(4L,c(1L,1L,2L),2L);
## [1] 7
tryBaseForTabulation(4L,c(1L,1L,2L),3L);
## [1] 0
So we need a base of 3 for the "3 of 4" problem.
Note that, although we are using exponentiation as the tool to diffuse the set, we don't actually need to perform any exponentiation at solution run-time, because we can simply index a precomputed vector of powers to transform the value space. Unfortunately, indexing a vector with a matrix returns a flat vector result, losing the matrix structure. But we can easily rebuild the matrix structure with a call to matrix(), thus we don't lose very much with this idiosyncrasy.
The last step is to derive the destination value space and the set of sums that satisfy the problem condition. The value spaces are easy; we can just compute the power sequence as done within tryBaseForTabulation():
2L^(1:4-1L);
## [1] 1 2 4 8
3L^(1:4-1L);
## [1] 1 3 9 27
The set of sums was computed as hs in the tryBaseForTabulation() function. Hence we can write a new similar function for these:
getBaseSums <- function(N,tab,base) {
## make destination value set, exponentiating from 0 to N-1
x <- base^(seq_len(N)-1L);
## make a matrix of unique combinations of the original set
g <- unique(t(apply(expand.grid(x,x,x,x),1L,sort)));
## get the indexes of combinations that match the required tabulation
good <- which(apply(g,1L,function(x) identical(tab,sort(tabulate(factor(x))))));
## return the sums of good combinations
rowSums(g[good,,drop=F]);
}; ## end getBaseSums()
Giving:
getBaseSums(4L,rep(1L,4L),2L);
## [1] 15
getBaseSums(4L,c(1L,1L,2L),3L);
## [1] 14 32 38 16 34 22 58 46 64 42 48 66
Now that the solution is complete, I realize that the cost of the vector index operation, rebuilding the matrix, and the %in% operation for the second problem may render it inferior to other potential solutions. But in any case, it's one possible solution, and I thought it was an interesting idea to explore.
Solution 3
Another possible solution is to precompute an N-dimensional lookup table that stores which combinations match the problem condition and which don't. The input matrix can then be used directly as an index matrix into the lookup table (well, almost directly; we'll need a single t() call, since its combinations are laid across columns instead of rows).
For a large set of values, or for long vectors, this could easily become impractical. For example, if we had 8 possible cell values with 8 rows then we would need a lookup table of size 8^8 == 16777216. But fortunately for the sizing given in the question we only need 4^4 == 256, which is completely manageable.
To facilitate the creation of the lookup table, I wrote the following function, which stands for "N-dimensional combinations":
NDcomb <- function(N,f) {
x <- seq_len(N);
g <- do.call(expand.grid,rep(list(x),N));
array(apply(g,1L,f),rep(N,N));
}; ## end NDcomb()
Once the lookup table is computed, the solution is easy:
v <- NDcomb(4L,function(x) identical(rep(1L,4L),sort(tabulate(factor(x)))));
sum(v[t(m)]);
## [1] 107
v <- NDcomb(4L,function(x) identical(c(1L,1L,2L),sort(tabulate(factor(x)))));
sum(v[t(m)]);
## [1] 545
We can use colSums. Loop over 1:4, convert the matrix to a logical matrix, get the colSums, check whether it is not equal to 0 and sum it.
sapply(1:4, function(i) sum(colSums(m1==i)!=0))
#[1] 6 6 9 5
If we need the number of columns that contain 3 and not have 4
sum(colSums(m1!=4)!=0 & colSums(m1==3)!=0)
#[1] 9
data
set.seed(24)
m1 <- matrix(sample(1:4, 40, replace=TRUE), nrow=4)

join two data frames based on user defined function

I'm trying to (inner) join two data frames based on a similarity function that I have.
for example:
data1<-data.frame(a=c(1,2,3),lat=c(38.862976,37.878146,36.825658), lon=c(-99.336782,-99.326054,-98.475976))
data2<-data.frame(b=c(10,20),lat=c(38.863412,37.877333), lon=c(-99.336701,-99.325151))
and given a similarity function:
are.close(lat1,long1,lat2,long2)
something like
data3<-join(a=data1,b=data2,by=c(lat,lon),FUN=are.close(a.lat,a.lon,b.lat,b.lon))
The output I wish to receive is:
a b lat lon
1 1 10 38.862976 -99.336782
2 2 20 37.878146 -99.326054
Where the lat/lon belongs to one of the tables (it doesn't matter which, say the first).
All the join/merge methods I checked doesn't let you define how the join is taking place. It only lets you specify things like col1=col2.
Is there a way to do this computationally efficient (not by running with two loops on the two sets)?
I'd suggest using outer to identify (a,b) pairs that meet the criterion:
neighbormat <- outer(
1:nrow(data1),
1:nrow(data2),
function(i1,i2){
are.close(
data1$lat[i1],
data1$lon[i1],
data2$lat[i2],
data2$lon[i2]
)
}
)
dimnames(neighbormat) <- list(data1$a,data2$b)
Using the names only makes sense if a and b are unique, but I'll assume they are since the OP is using them that way. For #konvas's are.close function, this gives
10 20
1 TRUE TRUE
2 TRUE TRUE
3 FALSE FALSE
To get the (a,b) pairs that meet the criterion, use
ns <- which(neighbormat,arr.ind=TRUE,use.names=TRUE)
dimnames(ns) <- list(NULL,c("a","b"))
a b
[1,] 1 1
[2,] 2 1
[3,] 1 2
[4,] 2 2
It's straightforward to merge these back to the original data. (Taking an arbitrary (lat,lon) as the OP does, seems like a very bad idea, though.)
Here is an approach using dplyr. I have assumed that are.close() is vectorised and returns TRUE/FALSE, for example this will work with a function like are.close <- function(a, b, c, d) (a-c)^2 + (b-d)^2 < 1
library(dplyr)
expand.grid(a = data1$a, b = data2$b) %>%
left_join(data1, by = "a") %>%
left_join(data2, by = "b") %>%
mutate(close = are.close(lat.x, lon.x, lat.y, lon.y)) %>%
filter(close)
I wouldn't know of a function that does this (but there of course might be...), so I would try writing some code myself. Which might be difficult depending on the data. But assuming that couples are really clear (e.g. the latitude of point 1 could be closest to b 10, whereas the longitude might be closer to b 20, etc.) this might be the beginning of something to work with:
data1<-data.frame(a=c(1,2,3),lat=c(38.862976,37.878146,36.825658), lon=c(-99.336782,-99.326054,-98.475976))
data2<-data.frame(b=c(10,20),lat=c(38.863412,37.877333), lon=c(-99.336701,-99.325151))
# calculate which is the closest value
names(data1)=c("a","lat_original","lon_original")
closest=function(x,to=to) to[which.min(abs(to - x))]
data1$lat=sapply(data1$lat_original,function(x) closest(x,to=data2$lat))
data1$lon=sapply(data1$lon_original,function(x) closest(x,to=data2$lon))
# if dataframes are not equally big: remove biggest assigned "closest values" (or doubles?)
if(nrow(data1)!=nrow(data2)) {
data1$diff=abs(data1$lat-data1$lat_original)+abs(data1$lon-data1$lon_original)
maxN <- function(x, N=N){
x=x[!is.na(x)]
len=length(x)
if(N>len){
warning('N greater than length(x). Setting N=length(x)')
N=length(x)
}
sort(x,partial=len-N+1)[as.numeric(len-N+1):len]
}
data1=data1[!data1$diff %in% maxN(data1$diff,N=nrow(data1)-nrow(data2)),]}
# perhaps check if doubles (two different points of data 1 assigned to the same point in data2)
which(duplicated(paste(data1$lat,data1$lon))==T)
#merge based on those closest values
merge(data1,data2,by=c("lat","lon"))

R Matching function between two matrices

Sorry if this has been posted before. I looked for the answer both on Google and Stackoverflow and couldn't find a solution.
Right now I have two matrices of data in R. I am trying to loop through each row in the matrix, and find the row in the other matrix that is most similar by some distance metric (for now least squared). I figured out one method but it is O(n^2) which is prohibitive for my data.
I think this might be similar to some dictionary learning techniques but I couldn't find anything.
Thanks!
Both matrices are just 30 by n matrices with a number at each entry.
distance.fun=function(mat1,mat2){
match=c()
for (i in 1:nrow(mat1)){
if (all(is.na(mat1[i,]))==FALSE){
dist=c()
for (j in 1:nrow(mat2)){
dist[j]=sum((mat1[i,]-mat2[j,])^2)
match[i]=which(min(dist) %in% dist)
}
}
}
return(match)
}
A better strategy would be to compute the distance matrix all at once first, then extract the mins. Here's an example using simualted data
set.seed(15)
mat1<-matrix(runif(2*25), ncol=2)
mat2<-matrix(runif(2*25), ncol=2)
and here's a helper function that can calculate the distances between values in one matrix to another. It uses the built in dist function but it does do unnecessary within-group comparisons that we eventually have to filter out, still it may be better performing overall.
distab<-function(m1, m2) {
stopifnot(ncol(m1)==ncol(m2))
m<-as.matrix(dist(rbind(m1, m2)))[1:nrow(m1), -(1:nrow(m1))]
rownames(m)<-rownames(m1)
colnames(m)<-rownames(m2)
m
}
mydist<-distab(mat1, mat2)
now that we have the between-group distances, we just need to minimize the matches.
best <- apply(mydist, 2, which.min)
rr <- cbind(m1.row=seq.int(nrow(mat1)), best.m2.row = best)
head(rr) #just print a few
# m1.row best.m2.row
# [1,] 1 1
# [2,] 2 14
# [3,] 3 7
# [4,] 4 3
# [5,] 5 23
# [6,] 6 15
note that with a strategy like this (we well as with your original implementation) it is possible for multiple rows from mat1 to match to the same row in mat2 and some rows in mat2 to be unmatched to mat1.

append new row for each run of function

I am trying to append a new row to a matrix for each time I run a function. I reckon, the first time the function is run a matrix is created and the succeeding times, a new row with values is appended.
Here is some dummy data. Lets say x and y are sides of rectangle and z some sort of ID. In reality, these are not known in advance, but outputted by the function. The real function takes a species directory as argument, reads shapefiles, merges polygons and does a bunch of other things, but outputs the surface area. For each species (i.e. run of function) I would like to store each outputted area in a matrix or a data.frame for further analysis instead of outputting it to individual variables.
myfunc <- function(x, y, z){
area <- x*y
id <- z
tmp <- cbind(area,id)
assign(as.matrix('mtrx'), rbind(tmp), envir=.GlobalEnv)
}
The above obviously only creates the matrix and overwrites it each time the function is run.
Any pointers would be very much appreciated!
If, as in your example, you know the values for x, y and z in advance, it makes sense to say something like:
> f1 <- function(x, y, z) c(x*y, z)
mapply(f1, x=seq(4), y=seq(4), z=seq(4))
> [,1] [,2] [,3] [,4]
[1,] 1 4 9 16
[2,] 1 2 3 4
If the values for these variables are returned by another function, then perhaps best to store them until you're ready to run all the values through the final function (e.g. f1 above).
You say
a new row with values is appended
but in RAM a new matrix is created (assigned) with the new row added each time you append. (You're in Circle 2).
For small sized data this is not likely to be a problem in practice.
Also, using assign can make scoping awkward when calling a function within an environment (e.g. another function), so generally best to avoid if possible. There's usually a better alternative.
Here's the basic idea.
myfunc <- function(ID) {
# do a bunch of stuff based on ID
# calculate area
area <- 2*ID + rnorm(1,0,10) # fake the area...
return(c(ID=ID,area=area))
}
ID.list <- rep(1:100) # list of ID's
result <- do.call(rbind,lapply(ID.list,myfunc))
# head(result)
# ID area
# [1,] 1 -14.794850
# [2,] 2 13.777036
# [3,] 3 17.807578
# [4,] 4 21.070712
# [5,] 5 11.904047
# [6,] 6 3.735771
Return ID and area as a named vector with c(ID=ID, area=area). Do this for all ID's with the call to lapply(...). Then bind them all together using do.call(rbind,...).
I highly recommend against this method, but you need to use get in that last line
assign('mtrx', rbind(get('mtrx', envir=parent.frame()), tmp)), envir=.GlobalEnv)

R: a for statement wanted that allows for the use of values from each row

I'm pretty new to R..
I'm reading in a file that looks like this:
1 2 1
1 4 2
1 6 4
and storing it in a matrix:
matrix <- read.delim("filename",...)
Does anyone know how to make a for statement that adds up the first and last numbers of one row per iteration ?
So the output would be:
2
3
5
Many thanks!
Edit: My bad, I should have made this more clear...
I'm actually more interested in an actual for-loop where I can use multiple values from any column on that specific row in each iteration. The adding up numbers was just an example. I'm actually planning on doing much more with those values (for more than 2 columns), and there are many rows.
So something in the lines of:
for (i in matrix_i) #where i means each row
{
#do something with column j and column x from row i, for example add them up
}
If you want to get a vector out of this, it is simpler (and marginally computationally faster) to use apply rather than a for statement. In this case,
sums = apply(m, 1, function(x) x[1] + x[3])
Also, you shouldn't call your variables "matrix" since that is the name of a built in function.
ETA: There is an even easier and computationally faster way. R lets you pull out columns and add them together (since they are vectors, they will get added elementwise):
sums = m[, 1] + m[, 3]
m[, 1] means the first column of the data.
Something along these lines should work rather efficiently (i.e. this is a vectorised approach):
m <- matrix(c(1,1,1,2,4,6,1,2,4), 3, 3)
# [,1] [,2] [,3]
# [1,] 1 2 1
# [2,] 1 4 2
# [3,] 1 6 4
v <- m[,1] + m[,3]
# [1] 2 3 5
You probably can use an apply function or a vectorized approach --- and if you can you really should, but you ask for how to do it in a for loop, so here's how to do that. (Let's call your matrix m.)
results <- numeric(nrow(m))
for (row in nrow(m)) {
results[row] <- m[row, 1] + m[row, 3]
}
This is probably one of those 100 ways to skin a cat questions. You are perhaps looking for the rowSums function, although you might also find many answers using the apply function.

Resources