We are given a huge set of points in 2D plane. We need to find, for each point the closest point within the set. For instance suppose the initial set is as follows:
foo <- data.frame(x=c(1,2,4,4,10),y=c(1,2,4,4,10))
The output should be like this:
ClosesPair(foo)
2
1
4
3
3 # (could be 4 also)
Any idea?
The traditional approach is to preprocess the data
and put it in a data structure, often a K-d tree,
for which the "nearest point" query is very fast.
There is an implementation in the nnclust package.
library(nnclust)
foo <- cbind(x=c(1,2,4,4,10),y=c(1,2,4,4,10))
i <- nnfind(foo)$neighbour
plot(foo)
arrows( foo[,1], foo[,2], foo[i,1], foo[i,2] )
Here is an example; all wrapped into a single function. You might want to split it a bit for optimization.
ClosesPair <- function(foo) {
dist <- function(i, j) {
sqrt((foo[i,1]-foo[j,1])**2 + (foo[i,2]-foo[j,2])**2)
}
foo <- as.matrix(foo)
ClosestPoint <- function(i) {
indices <- 1:nrow(foo)
indices <- indices[-i]
distances <- sapply(indices, dist, i=i, USE.NAMES=TRUE)
closest <- indices[which.min(distances)]
}
sapply(1:nrow(foo), ClosestPoint)
}
ClosesPair(foo)
# [1] 2 1 4 3 3
Of cause, it does not handle ties very well.
Use the package spatstat . It's got builtin functions to do this sort of stuff.
Related
I have a data frame that looks like this:
set.seed(42)
data <- runif(1000)
utility <- sample(c("abc","bcd","cde","def"),1000,replace=TRUE)
stage <- sample(c("vwx","wxy","xyz"),1000,replace=TRUE)
x <- data.frame(data,utility,stage)
head(x)
data utility stage
1 0.9148060 def xyz
2 0.9370754 abc wxy
3 0.2861395 def xyz
4 0.8304476 cde xyz
5 0.6417455 bcd xyz
6 0.5190959 abc xyz
and I want to generate cumulative distribution functions for the unique combinations of utility and stage. In my real application I'll end up generating about 100 cdfs but this random data will have 12 (4x3) unique combinations. But I'll be using each of those cdfs thousands of times, so I don't want to calculate the cdf on the fly each time. The ecdf() function works exactly as I'd like, except I'd need to vectorize it. The following code doesn't work, but it's the gist of what I'm trying to do:
ecdf_multiple <- function(x)
{
i=0
utilities <- levels(x$utilities)
stages <- levels(x$stages)
for(utility in utilities)
{
for(stage in stages)
{
i <- i + 1
y <- ecdf(x[x$utilities == utility & x$stage == stage,1])
# calculate ecdf for the unique util/stage combo
z[i] <- list(y,utility,stage)
# then assign it to a data element (list, data frame, json, whatever) note-this doesn't actually work
}
}
z # return value
}
so after running ecdf_multiple and assigning it to a variable, I'd reference that variable somehow by passing a value (for which I wanted the cdf), the utility and the stage.
Is there a way to vectorize the ecdf function (or use/build another) so that I can the output several times without neededing to generate distributions over and over?
-------Added to respond to #Pascal 's excellent suggestion.-------
How might one expand this to a more general case of taking "n" dimensions of categories? This is my stab, based on Pascal's case of two dimensions. Notice how I tried to assign "y":
set.seed(42)
data <- runif(1000)
utility <- sample(c("abc","bcd","cde","def"),1000,replace=TRUE)
stage <- sample(c("vwx","wxy","xyz"),1000,replace=TRUE)
openclose <- sample(c("open","close"),1000,replace=TRUE)
x <- data.frame(data,utility,stage,openclose)
numlabels <- length(names(x))-1
y <- split(x, list(x[,2:(numlabels+1)]))
l <- lapply(y,function(x) ecdf(x[,"data"]))
#execute
utility <- "abc"
stage <- "xyz"
openclose <- "close"
comb <- paste(utility, stage, openclose, sep = ".")
# call the function
l[[comb]](.25)
During the assignment of "y" above, I get this error message:
"Error in sort.list(y) : 'x' must be atomic for 'sort.list'
Have you called 'sort' on a list?"
The following might help:
# we create a list of criteria by excluding
# the first column of the data.frame
y <- split(x, as.list(x[,-1]))
l <- lapply(y, function(x) ecdf(x[,"data"]))
utility <- "abc"
stage <- "xyz"
comb <- paste(utility, stage, sep = ".")
l[[comb]](0.25)
# [1] 0.2613636
plot(l[[comb]])
I have a list of elemental compositions and I'd like to display a count for the number of times an element is included in a composition mapped onto the periodic table (e.g. CH4 would increase the count on H and C by one).
How can I do this with ggplot? Is there a map I can use?
With a bit of searching I found information about the periodic table in this example code project. They had an Access Database with element information. I've exported it to this gist. You can import the data using the httr library with
library(httr)
dd <- read.table(text=content(GET("https://gist.githubusercontent.com/MrFlick/c1183c911bc5398105d4/raw/715868fba2d0d17a61a8081de17c468bbc525ab1/elements.txt")), sep=",", header=TRUE)
(You should probably create your own local version for easier loading in the future.)
Then your other challenge is decomposing something like "CH4" into the raw element counts. I've created this helper function which I think does what you need.
decompose <- function(x) {
m <- gregexpr("([A-Z][a-z]?)(\\d*)", x, perl=T)
dx <- Map(function(x, y) {
ElementSymbol <- gsub("\\d","", x)
cnt <- as.numeric(gsub("\\D","", x))
cnt[is.na(cnt)]<-1
cbind(Sym=y, as.data.frame(xtabs(cnt~ElementSymbol)))
}, regmatches(x,m), x)
do.call(rbind, dx)
}
Here I test the function
test_input <- c("H2O","CH4")
decompose(test_input)
# Sym ElementSymbol Freq
# 1 H2O H 2
# 2 H2O O 1
# 3 CH4 C 1
# 4 CH4 H 4
Now we can combine the data and the reference information to make a plot
library(ggplot2)
ggplot(merge(decompose("CH4"), dd), aes(Column, -Row)) +
geom_tile(data=dd, aes(fill=GroupName), color="black") +
geom_text(aes(label=Freq))
Clearly there are opportunities for improvement but this should give you a good start.
You might look for a more robust decomposition function. Looks like the CHNOSZ package has one
library(CHNOSZ)
data(thermo)
decompose <- function(x) {
do.call(`rbind`, lapply(x, function (x) {
z <- makeup(x)
cbind(data.frame(ElementSymbol = names(z),Freq=z), Sym=x)
}))
}
ggplot(merge(decompose("CaAl2Si2O7(OH)2*H2O"), dd), aes(Column, -Row)) +
geom_tile(data=dd, aes(fill=GroupName), color="black") +
geom_text(aes(label=Freq))
Given an euclidean distance function:
eDistance <- function(q,m){
apply(m,1,function(x){
sqrt(sum((q-x)^2))
})
}
And the following matrix:
t = cbind(c(3,4,1,6,3),c(1,9,1,2,1))
I want to find a vectorized way to do the following:
r = rep(NA,nrow(t))
for (i in 1:nrow(t)){
tmp = eDistance(t[i,],t)
r[i] = sum(tmp)
}
My main concern is performance time.
Thank You.
Use the dist() function. You want the column sums of this...
colSums(as.matrix(dist(t)))
# 1 2 3 4 5
#13.22454 31.94863 17.64302 18.70368 13.22454
I have a for loop in R in which I want to store the result of each calculation (for all the values looped through). In the for loop a function is called and the output is stored in a variable r in the moment. However, this is overwritten in each successive loop. How could I store the result of each loop through the function and access it afterwards?
Thanks,
example
for (par1 in 1:n) {
var<-function(par1,par2)
c(var,par1)->var2
print(var2)
So print returns every instance of var2 but in var2 only the value for the last n is saved..is there any way to get an array of the data or something?
initialise an empty object and then assign the value by indexing
a <- 0
for (i in 1:10) {
a[i] <- mean(rnorm(50))
}
print(a)
EDIT:
To include an example with two output variables, in the most basic case, create an empty matrix with the number of columns corresponding to your output parameters and the number of rows matching the number of iterations. Then save the output in the matrix, by indexing the row position in your for loop:
n <- 10
mat <- matrix(ncol=2, nrow=n)
for (i in 1:n) {
var1 <- function_one(i,par1)
var2 <- function_two(i,par2)
mat[i,] <- c(var1,var2)
}
print(mat)
The iteration number i corresponds to the row number in the mat object. So there is no need to explicitly keep track of it.
However, this is just to illustrate the basics. Once you understand the above, it is more efficient to use the elegant solution given by #eddi, especially if you are handling many output variables.
To get a list of results:
n = 3
lapply(1:n, function(par1) {
# your function and whatnot, e.g.
par1*par1
})
Or sapply if you want a vector instead.
A bit more complicated example:
n = 3
some_fn = function(x, y) { x + y }
par2 = 4
lapply(1:n, function(par1) {
var = some_fn(par1, par2)
return(c(var, par1)) # don't have to type return, but I chose to make it explicit here
})
#[[1]]
#[1] 5 1
#
#[[2]]
#[1] 6 2
#
#[[3]]
#[1] 7 3
I followed the discussion over HERE and am curious why is using<<- frowned upon in R. What kind of confusion will it cause?
I also would like some tips on how I can avoid <<-. I use the following quite often. For example:
### Create dummy data frame of 10 x 10 integer matrix.
### Each cell contains a number that is between 1 to 6.
df <- do.call("rbind", lapply(1:10, function(i) sample(1:6, 10, replace = TRUE)))
What I want to achieve is to shift every number down by 1, i.e all the 2s will become 1s, all the 3s will be come 2 etc. Therefore, all n would be come n-1. I achieve this by the following:
df.rescaled <- df
sapply(2:6, function(i) df.rescaled[df.rescaled == i] <<- i-1))
In this instance, how can I avoid <<-? Ideally I would want to be able to pipe the sapply results into another variable along the lines of:
df.rescaled <- sapply(...)
First point
<<- is NOT the operator to assign to global variable. It tries to assign the variable in the nearest parent environment. So, say, this will make confusion:
f <- function() {
a <- 2
g <- function() {
a <<- 3
}
}
then,
> a <- 1
> f()
> a # the global `a` is not affected
[1] 1
Second point
You can do that by using Reduce:
Reduce(function(a, b) {a[a==b] <- a[a==b]-1; a}, 2:6, df)
or apply
apply(df, c(1, 2), function(i) if(i >= 2) {i-1} else {i})
But
simply, this is sufficient:
ifelse(df >= 2, df-1, df)
You can think of <<- as global assignment (approximately, because as kohske points out it assigns to the top environment unless the variable name exists in a more proximal environment). Examples of why this is bad are here:
Examples of the perils of globals in R and Stata