join two data frames based on user defined function

join two data frames based on user defined function - r

I'm trying to (inner) join two data frames based on a similarity function that I have.
for example:
data1<-data.frame(a=c(1,2,3),lat=c(38.862976,37.878146,36.825658), lon=c(-99.336782,-99.326054,-98.475976))
data2<-data.frame(b=c(10,20),lat=c(38.863412,37.877333), lon=c(-99.336701,-99.325151))
and given a similarity function:
are.close(lat1,long1,lat2,long2)
something like
data3<-join(a=data1,b=data2,by=c(lat,lon),FUN=are.close(a.lat,a.lon,b.lat,b.lon))
The output I wish to receive is:
a b lat lon
1 1 10 38.862976 -99.336782
2 2 20 37.878146 -99.326054
Where the lat/lon belongs to one of the tables (it doesn't matter which, say the first).
All the join/merge methods I checked doesn't let you define how the join is taking place. It only lets you specify things like col1=col2.
Is there a way to do this computationally efficient (not by running with two loops on the two sets)?

I'd suggest using outer to identify (a,b) pairs that meet the criterion:
neighbormat <- outer(
1:nrow(data1),
1:nrow(data2),
function(i1,i2){
are.close(
data1$lat[i1],
data1$lon[i1],
data2$lat[i2],
data2$lon[i2]
)
}
)
dimnames(neighbormat) <- list(data1$a,data2$b)
Using the names only makes sense if a and b are unique, but I'll assume they are since the OP is using them that way. For #konvas's are.close function, this gives
10 20
1 TRUE TRUE
2 TRUE TRUE
3 FALSE FALSE
To get the (a,b) pairs that meet the criterion, use
ns <- which(neighbormat,arr.ind=TRUE,use.names=TRUE)
dimnames(ns) <- list(NULL,c("a","b"))
a b
[1,] 1 1
[2,] 2 1
[3,] 1 2
[4,] 2 2
It's straightforward to merge these back to the original data. (Taking an arbitrary (lat,lon) as the OP does, seems like a very bad idea, though.)

Here is an approach using dplyr. I have assumed that are.close() is vectorised and returns TRUE/FALSE, for example this will work with a function like are.close <- function(a, b, c, d) (a-c)^2 + (b-d)^2 < 1
library(dplyr)
expand.grid(a = data1$a, b = data2$b) %>%
left_join(data1, by = "a") %>%
left_join(data2, by = "b") %>%
mutate(close = are.close(lat.x, lon.x, lat.y, lon.y)) %>%
filter(close)

I wouldn't know of a function that does this (but there of course might be...), so I would try writing some code myself. Which might be difficult depending on the data. But assuming that couples are really clear (e.g. the latitude of point 1 could be closest to b 10, whereas the longitude might be closer to b 20, etc.) this might be the beginning of something to work with:
data1<-data.frame(a=c(1,2,3),lat=c(38.862976,37.878146,36.825658), lon=c(-99.336782,-99.326054,-98.475976))
data2<-data.frame(b=c(10,20),lat=c(38.863412,37.877333), lon=c(-99.336701,-99.325151))
# calculate which is the closest value
names(data1)=c("a","lat_original","lon_original")
closest=function(x,to=to) to[which.min(abs(to - x))]
data1$lat=sapply(data1$lat_original,function(x) closest(x,to=data2$lat))
data1$lon=sapply(data1$lon_original,function(x) closest(x,to=data2$lon))
# if dataframes are not equally big: remove biggest assigned "closest values" (or doubles?)
if(nrow(data1)!=nrow(data2)) {
data1$diff=abs(data1$lat-data1$lat_original)+abs(data1$lon-data1$lon_original)
maxN <- function(x, N=N){
x=x[!is.na(x)]
len=length(x)
if(N>len){
warning('N greater than length(x). Setting N=length(x)')
N=length(x)
}
sort(x,partial=len-N+1)[as.numeric(len-N+1):len]
}
data1=data1[!data1$diff %in% maxN(data1$diff,N=nrow(data1)-nrow(data2)),]}
# perhaps check if doubles (two different points of data 1 assigned to the same point in data2)
which(duplicated(paste(data1$lat,data1$lon))==T)
#merge based on those closest values
merge(data1,data2,by=c("lat","lon"))

Related

group data by tolerance via index list

I dont know how to explain it shortly. I try my best:
I have the following example data:
Data<-data.frame(A=c(1,2,3,5,8,9,10),B=c(5.3,9.2,5,8,10,9.5,4),C=c(1:7))
and a index
Ind<-data.frame(I=c(5,6,2,4,1,3,7))
The value in Ind corresponds to the C column in the Data. Now I want to start with the first Ind value, and find the corresponding row in the Data data.frame (column C). From that row I want to go up and down and find values in column A that are in a tolerance range of 1. I want to write these values into a result dataframe add a group id column and delete it in the dataframe Data (where I found them). Then I start with the next entry in the Index dataframe Ind and so an until the data.frame Data is empty. I know how to match my Ind with column C of my Data and how to write and delete and the other stuff in a for loop, but I dont know the main point, which is my question here:
when I have found my row in the Data, how can I look up fitting values of column A in the tolerance range up and below that entry to get my Group id?
what I want to get is this result:
A B C Group
1 5.3 1 2
2 9.2 2 2
3 5 3 2
5 8 4 3
8 10 5 1
9 9.5 6 1
10 4 7 4
Maybe somebody could help me with the critical point in my question or even how to solve this issue in a fast way.
Many thanks!

Generally: avoid deleting or growing a data frame row by row inside a loop. R's memory management means that every time you add or delete a row, another copy of the data frame is made. Garbage collection will eventually discard the "old" copies of the data frame, but garbage can quickly accumulate and reduce performance. Instead, add a logical column to the Data data frame, and set "extracted" rows to TRUE. So like this:
Data$extracted <- rep(FALSE,nrow(Data))
As for your problem: I get a different set of grouping numbers, but the groups are identical.
There might be a more elegant way to do this, but this will get it done.
# store results in a separate list
res <- list()
group.counter <- 1
# loop until they're all done.
for(idx in Ind$I) {
# skip this iteration if idx is NA.
if(is.na(idx)) {
next
}
# dat.rows is a logical vector which shows the rows where
# "A" meets the tolerance requirement.
# specify the tolerance here.
mytol <- 1
# the next only works for integer compare.
# also not covered: what if multiple values of C
# match idx? do we loop over each corresponding value of A,
# i.e. loop over each value of 'target'?
target <- Data$A[Data$C == idx]
# use the magic of vectorized logical compare.
dat.rows <-
( (Data$A - target) >= -mytol) &
( (Data$A - target) <= mytol) &
( ! Data$extracted)
# if dat.rows is all false, then nothing met the criteria.
# skip the rest of the loop
if( ! any(dat.rows)) {
next
}
# copy the rows to the result list.
res[[length(res) + 1]] <- data.frame(
A=Data[dat.rows,"A"],
B=Data[dat.rows,"B"],
C=Data[dat.rows,"C"],
Group=group.counter # this value will be recycled to match length of A, B, C.
)
# flag the extraction.
Data$extracted[dat.rows] <- TRUE
# increment the group counter
group.counter <- group.counter + 1
}
# now make a data.frame from the results.
# this is the last step in how we avoid
#"growing" a data.frame inside a loop.
resData <- do.call(rbind, res)

r - Force which() to return only first match

Part of a function I'm working on uses the following code to take a data frame and reorder its columns on the basis of the largest (absolute) value in each column.
ord <- order(abs(apply(dfm,2,function(x) x[which(abs(x) == max(abs(x)), arr.ind = TRUE)])))
For the most part, this works fine, but with the dataset I'm working on, I occasionally get data that looks like this:
a <- rnorm(10,5,7); b <- rnorm(10,0,1); c <- rep(1,10)
dfm <- data.frame(A = a, B = b, C = c)
> dfm
A B C
1 0.6438373 -1.0487023 1
2 10.6882204 0.7665011 1
3 -16.9203506 -2.5047946 1
4 11.7160291 -0.1932127 1
5 13.0839793 0.2714989 1
6 11.4904625 0.5926858 1
7 -5.9559206 0.1195593 1
8 4.6305924 -0.2002087 1
9 -2.2235623 -0.2292297 1
10 8.4390810 1.1989515 1
When that happens, the above code returns a "non-numeric argument to mathematical function" error at the abs() step. (And if I get rid of the abs() step because I know, due to transformation, my data will be all positive, order() returns: "unimplemented type 'list' in 'orderVector1'".) This is because which() returns all the 1's in column C, which in turn makes apply() spit out a list, rather than a nice tidy vector.
My question is this: How can I make which() JUST return one value for column C in this case? Alternately, is there a better way to write this code to do what I want it to (reorder the columns of a matrix based on the largest value in each column, whether or not that largest value is duplicated) that won't have this problem?

If you want to select just the first element of the result, you can subset it with [1]:
ord <- order(abs(apply(dfm,2,function(x) x[which(abs(x) == max(abs(x)), arr.ind = TRUE)][1])))

To order the columns by their maximum element (in absolute value), you can do
dfm[order(apply(abs(dfm),2,max))]
Your code, with #CarlosCinelli's correction, should work fine, though.

how to write a loop of the number of for loops in R?

this is probably a simple one, but I somehow got stuck...
I need to many loops to get the result of every sample in my support like the usual stacked loops:
for (a in 1:N1){
for (b in 1:N2){
for (c in 1:N3){
...
}
}
}
but the number of the for loops needed in this messy system depends on another random variable, let's say,
for(f in 1:N.for)
so how can I write a for loop to do deal with this? Or are there more elegant ways to do this?
note that the difference is that the nested for loops above (the variables a,b,c,...) do matter in my calculations, but the variable f of the for loop that controls for the number of for loops needed does not go into any of my calculations for my real purpose - all it does is count/ensure the number of for loops needed is correct.
Did I make it clear?
So what I am actually trying to do is generate all the possible combinations of a number of peoples preferences towards others.
Let's say I have 6 people (the simplest case for my purpose): Abi, Bob, Cath, Dan, Eva, Fay.
Abi and Bob have preference lists of C D E F ( 4!=24 possible permutations for each of them);
Cath and Dan have preference lists of A B and E F, respectively (2! * 2! = 4 possible permutations for each of them);
Eva and Fay have preference lists of A B C D (4!=24 possible permutations for each of them);
So all together there should be 24*24*4*4*24*24 possible permutations of preferences when taking all six them together.
I am just wondering what is a clear, easy and systematic way to generate them all at once?
I'd want them in the format such as
c.prefs <- as.matrix(data.frame(Abi = c("Eva", "Fay", "Dan", "Cath"),Bob = c("Dan", "Eva", "Fay", "Cath"))
but any clear format is fine...
Thank you so much!!

I'll assume you have a list of each loop variable and its maximum value, ordered from the outermost to innermost variable.
loops <- list(a=2, b=3, c=2)
You could create a data frame with all the loop variable values in the correct order with:
(indices <- rev(do.call(expand.grid, lapply(rev(loops), seq_len))))
# a b c
# 1 1 1 1
# 2 1 1 2
# 3 1 2 1
# 4 1 2 2
# 5 1 3 1
# 6 1 3 2
# 7 2 1 1
# 8 2 1 2
# 9 2 2 1
# 10 2 2 2
# 11 2 3 1
# 12 2 3 2
If the code run at the innermost point of the nested loop doesn't depend on the previous iterations, you could use something like apply to process each iteration independently. Otherwise you could loop through the rows of the data frame with a single loop:
for (i in seq_len(nrow(indices))) {
# You can get "a" with indices$a[i], "b" with indices$b[i], etc.
}

For the way of doing the calculation, an option is to use the Reduce function or some other higher-order function.
Since your data is not inherently ordered (an individual is part of a set, its preferences are part of the set) I would keep indivudals in a factor and have eg preferences in lists named with the individuals. If you have large data you can store it in an environment.
The first code is just how to make it reproducible. the problem domain was akin for graph oriented naming. You just need to change in the first line and in runif to change the behavior.
#people
verts <- factor(c(LETTERS[1:10]))
#relations, disallow preferring yourself
edges<-lapply(seq_along(verts), function(ind) {
levels(verts)[-ind]
})
names(edges) <- levels(verts)
#directions
#say you have these stored in a list or something
pool <- levels(verts)
directions<-lapply(pool, function(vert) {
relations <- pool[unique(round(runif(5, 1, 10)))]
relations[!(vert %in% relations)]
})
names(directions) = pool
num_prefs <- (lapply(directions, length))
names(num_prefs) <- names(directions)
#First take factorial of each persons preferences,
#then reduce that with multiplication
combinations <-
Reduce(`*`,
sapply(num_prefs, factorial)
)
I hope this answers your question!

Need a more efficient threshold matching with function for R

Not sure how best to ask this question, so feel free to edit the question title if there is a more standard vocabulary to use here.
I have two 2-column data tables in R, the first is a list of unique 2-variable values (u), so much shorter than the second, which is a raw list of similar values (d). I need a function that will, for every 2-variable set of values in u, find all the 2-variable sets of values in d for which both variables are within a given threshold.
Here's a minimal example. Actual data is much larger (see below, as this is the problem) and (obviously) not created randomly as in the example. In the actual data, u would have about 600,000 to 1,000,000 values (rows) and d would have upwards of 10,000,000 rows.
# First create the table of unique variable pairs (no 2-column duplicates)
u <- data.frame(PC1=c(-1.10,-1.01,-1.13,-1.18,-1.12,-0.82),
PC2=c(-1.63,-1.63,-1.81,-1.86,-1.86,-1.77))
# Now, create the set of raw 2-variable pairs, which may include duplicates
d <- data.frame(PC1=sample(u$PC1,100,replace=T)*sample(90:100,100,replace=T)/100,
PC2=sample(u$PC2,100,replace=T)*sample(90:100,100,replace=T)/100)
# Set the threshold that defined a 'close-enough' match between u and d values
b <- 0.1
So, my first attempt to do this was with a for loop for all values of u. This works nicely, but is computationally intensive and takes quite a while to process the actual data.
# Make a list to output the list of within-threshold rows
m <- list()
# Loop to find all values of d within a threshold b of each value of u
# The output list will have as many items as values of u
# For each list item, there may be up to several thousand matching rows in d
# Note that there's a timing command (system.time) in here to keep track of performance
system.time({
for(i in 1:nrow(u)){
m <- c(m, list(which(abs(d$PC1-u$PC1[i])<b & abs(d$PC2-u$PC2[i])<b)))
}
})
m
That works. But I thought using a function with apply() would be more efficient. Which it is...
# Make the user-defined function for the threshold matching
match <- function(x,...){
which(abs(d$PC1-x[1])<b & abs(d$PC2-x[2])<b)
}
# Run the function with the apply() command.
system.time({
m <- apply(u,1,match)
})
Again, this apply function works and is slightly faster than the for loop, but only marginally. This may simply be a big data problem for which I need a bit more computing power (or more time!). But I thought others might have thoughts on a sneaky command or function syntax that would dramatically speed this up. Outside the box approaches to finding these matching rows also welcome.

Somewhat sneaky:
library(IRanges)
ur <- with(u*100L, IRanges(PC2, PC1))
dr <- with(d*100L, IRanges(PC2, PC1))
hits <- findOverlaps(ur, dr + b*100L)
Should be fast once the number of rows is sufficiently large. We multiply by 100 to get into integer space. Reversing the order of the arguments to findOverlaps could improve performance.

Alas, this seems only slightly faster than the for loop
unlist(Map(function(x,y) {
which(abs(d$PC1-x)<b & abs(d$PC2-y)<b)
}, u$PC1, u$PC2))
but at least it's something.

I have a cunning plan :-) . How about just doing calculations:
> set.seed(10)
> bar<-matrix(runif(10),nc=2)
> bar
[,1] [,2]
[1,] 0.50747820 0.2254366
[2,] 0.30676851 0.2745305
[3,] 0.42690767 0.2723051
[4,] 0.69310208 0.6158293
[5,] 0.08513597 0.4296715
> foo<-c(.3,.7)
> thresh<-foo-bar
> sign(thresh)
[,1] [,2]
[1,] -1 1
[2,] 1 1
[3,] -1 1
[4,] 1 -1
[5,] 1 1
Now all you have to do is select the rows of that last matrix which are c(-1,1) , using which , and you can easily extract the desired rows from your bar matrix. Repeat for each row in foo.

compare two variables of different length using R

I need to compare the values stored in two variables.The variable sizes are different. For example
x = c(1,2,3,4,5,6,7,8,9,10)
and
y = c(2,6,11,12,13)
I need an answer that 2 and 6 are present in both variables. I need this to be done in R.Anyone help please.

The intersect function avoids the need for #mdsumner's simple indexing:
> x = c(1,2,3,4,5,6,7,8,9,10)
> y = c(2,6,11,12,13)
> intersect(x,y)
[1] 2 6
Whole bunch of set operators to be found here: help(intersect)

Posted after the added requirement that some sort of tolerance be allowed: You could sequentially check one set of values against all the others in the second set or you could do it all at once with outer(). Once you have the outer result as a logical matrix there remains the task of referring back to the values, but expand.grid seems capable of handling that:
expand.grid(x,y)[outer(x,y, FUN=function(x,y) abs(x-y) < 0.01), ]
# Var1 Var2
#2 2 2
#16 6 6
After posting It occurred to me that you values were sorted. Turns out that this extraction from expand.grid() survives passing unsorted vectors.

x[x %in% y]
[1] 2 6
Or, more explicitly:
x[match(x, y, nomatch = 0) > 0]
[1] 2 6
Note that you actually chain together the results of the match with simple indexing into the input values.
See ?match.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

join two data frames based on user defined function - r

Related

group data by tolerance via index list

r - Force which() to return only first match

how to write a loop of the number of for loops in R?

Need a more efficient threshold matching with function for R

compare two variables of different length using R

Categories

Resources