Creating the 2D bin and counting the number of elements in each bins - 2d

I have 2D array in form [[x1,y1],[x2,y2],....,[x1000,Y1000]]. I need to dived the XY plane to 100 (10*10) bins and then count the number of the elements in each bins. How can I do that? Also, I need to remove all empty bins too.
Thank you for your help.

Related

How to use pointDistance with a very large vector

I've got a big problem.
I've got a large raster (rows=180, columns=480, number of cells=86400)
At first I binarized it (so that there are only 1's and 0's) and then I labelled the clusters.(Cells that are 1 and connected to each other got the same label.)
Now I need to calculate all the distances between the cells, that are NOT 0.
There are quiet a lot and that's my big problem.
I did this to get the coordinates of the cells I'm interested in (get the positions (i.e. cell numbers) of the cells, that are not 0):
V=getValues(label)
Vu=c(1:max(V))
pos=which(V %in% Vu)
XY=xyFromCell(label,pos)
This works very well. So XY is a matrix, which contains all the coordinates (of cells that are not 0). But now I'm struggling. I need to calculate the distances between ALL of these coordinates. Then I have to put each one of them in one of 43 bins of distances. It's kind of like this (just an example):
0<x<0.2 bin 1
0.2<x<0.4 bin2
When I use this:
pD=pointDistance(XY,lonlat=FALSE)
R says it's not possible to allocate vector of this size. It's getting too large.
Then I thought I could do this (create an empty data frame df or something like that and let the function pointDistance run over every single value of XY):
for (i in 1:nrow(XY))
{pD=PointDistance(XY,XY[i,],lonlat=FALSE)
pDbin=as.matrix(table(cut(pD,breaks=seq(0,8.6,by=0.2),Labels=1:43)))
df=cbind(df,pDbin)
df=apply(df,1,FUN=function(x) sum(x))}
It is working when I try this with e.g. the first 50 values of XY.
But when I use that for the whole XY matrix it's taking too much time.(Sometimes this XY matrix contains 10000 xy-coordinates)
Does anyone have an idea how to do it faster?
I don't know if this will works fast or not. I recommend you try this:
Let say you have dataframe with value 0 or 1 in each cell. To find coordinates all you have to do is write the below code:
cord_matrix <- which(dataframe == 1, arr.ind = TRUE)
Now, you get the coordinate matrix with row index and column index.
To find the euclidean distance use dist() function. Go through it. It will look like this:
dist_vector <- dist(cord_matrix)
It will return lower triangular matrix. can be transformed into vector/symmetric matrix. Now all you have to do is calculating bins according to your requirement.
Let me know if this works within the specific memory space.

In R: sort the maximum dissimilarity between rows in a matrix

I have a matrix, which includes 100 rows and 10 columns, here I want to compare the diversity between rows and sort them. And then, I want to select the 10 maximum dissimilarity rows from it, Which method can I use?
set.seed(123)
mat <- matrix(runif(100 * 10), nrow = 100, ncol = 10)
My initial method is to calculate the similarity (e.g. saying tanimoto coefficient or others: http://en.wikipedia.org/wiki/Jaccard_index ) between two rows, and dissimilairty = 1 - similarity, and then compare the dissimilarty values. At last I will sort all dissimilarity value, and select the 10 maximum dissimilarity values. But it seems that the result is a 100 * 100 matrix, maybe need efficient method to such calculation if there are a large number of rows. However, this is just my thought, maybe not right, so I need help.
[update]
After looking for some literatures. I find the one definition for the maximum dissimilarity method.
Maximum dissimilarity method: It begins by randomly choosing a data record as the first cluster center. The record maximally distant from the first point is selected as the next cluster center. The record maximally distant from both current points is selected after that . The process repeats itself until there is a sufficient number of cluster centers.
Here in my question, the sufficient number should be 10.
Thanks.
First of all, the Jacard Index is not right for you. From the wikipedia page
The Jaccard coefficient measures similarity between finite sample sets...
Your matrix has samples of floats, so you have a different problem (note that the Index in question is defined in terms of intersections; that should be a red flag right there :-).
So, you have to decide what you mean by dissimilarity. One natural interpretation would be to say row A is more dissimilar from the data set than row B if it has a greater Euclidean distance to the center of mass of the data set. You can think of the center of mass of the data set as the vector you get by taking the mean of each of the colums and putting them together (apply(mat, 2, mean)).
With this, you can take the distance of each row to that central vector, and then get an ordering on those distances. From that you can work back to the rows you desire from the original matrix.
All together:
center <- apply(mat, 2, mean)
# not quite the distances, actually, but their squares. That will work fine for us though, since the order
# will still be the same
dists <- apply(mat, 1, function(row) sum((row - center) ** 2))
# this gives us the row indices in order of least to greaest dissimiliarity
dist.order <- order(dists)
# Now we just grab the 10 most dissimilar of those
most.dissimilar.ids <- dist.order[91:100]
# and use them to get the corresponding rows of the matrix
most.dissimilar <- mat[most.dissimilar.ids,]
If I was actually writing this, I probably would have compressed the last three lines as most.dissimilar <- mat[order(dists)[91:100],], but hopefully having it broken up like this makes it a little easier to see what's going on.
Of course, if distance from the center of mass doesn't make sense as the best way of thinking of "dissimilarity" in your context, then you'll have to amend with something that does.

How to select multiple cells in a matrix and perform an operation on corresponding cells in another matrix of the same size?

I am trying to write an R script to do pollution routing in world rivers, and need some help on selecting matrix cell coordinates and applying these to other matrices of the same dimension.
My data: I have several matrices corresponding to hydrological parameters of world rivers on a half degree grid (360 rows, 720 columns). These matrices represent flow accumulation (how many cells flow into this cell), flow direction (which of the 8 surrounding cells does the load of certain cell flow to) and pollutant load.
My idea: compute pollutant load in each grid cell from the start to the end of a river. I can base this on flow accumulation (low to high). However, each river basin can have multiple cells with the same flow accumulation value.
The problem: I need to select all matrix cells of each value of flow accumulation (low to high), find their coordinates (row,column), and transfer the corresponding pollutant load to the correct adjacent cell using the flow direction matrix. I have tried various ways, but selecting the coordinates of the correct cells and applying these to another matrix I cannot get to work.
I will give an example of what I have tried, using two for loops on one single river basin. In this example, a flow direction value of 1 means that the pollutant load needs to be transferred to the adjacent cell to the right (row is the same, column +1):
BasinFlowAccumulation <-FlowAccumulation[Basin]
BasinFlowAccumulationMaximum <- max(BasinFlowAccumulation)
BasinFlowDirection <-FlowDirection[Basin]
BasinPollutant <-Pollutant[Basin]
b<-0
for(i in 0:BasinFlowAccumulationMaximum){
cells.index<-which(BasinFlowAccumulation[]==b, arr.ind=TRUE)
for (j in 1:length(cells.index)){
print(BasinFlowDirection[cells[j]])
Row<-BasinPollutant[cells[j[1]]]
Column<-BasinPollutant[cells[j[2]]]
ifelse(BasinFlowDirection[cells.index[j]]==1, BasinPollutant[Row,(Column+1)]<-BasinPollutant[Row,(Column+1)]+Basinpollutant[Row,Column]
}
b<-b+1
}
Any advice would be greatly appreciated!

Cumulative sum of a georeferenced variable in R

I have a number of fishing boat tracks, and I'm trying to detect a certain pattern in their movement using R. In doing so I have reached a point where I have discarded all points of the track where the desired pattern is not occurring within a given time window, and I'm left with the remaining georeferenced points. These points have a score value associated, which measures the 'intensity' of the desired pattern.
track_1[1:10,]:
LAT LON SCORE
1 32.34855 -35.49264 80.67
2 31.54764 -35.58691 18.14
3 31.38293 -35.25243 46.70
4 31.21447 -35.25830 22.65
5 30.76365 -35.38881 11.93
6 30.75872 -35.54733 22.97
7 30.60261 -35.95472 35.98
8 30.62818 -36.27024 31.09
9 31.35912 -35.73573 14.97
10 31.15218 -36.38027 37.60
The code bellow provides the same data
data.frame(cbind(
LAT=c(32.34855,31.54764,31.38293,31.21447,30.76365,30.75872,30.60261,30.62818,31.35912,31.15218),
LON=c(-35.49264,-35.58691,-35.25243,-35.25830,-35.38881,-35.54733,-35.95472,-36.27024,-35.73573,-36.38027),
SCORE=c(80.67,18.14,46.70,22.65,11.93,22.97,35.98,31.09,14.97,37.60)))
Because some of these points occur geographically close to each other I need to 'pool' their scores together. Hence, I now need a way to throw this data into some kind of a spatial grid and cumulatively sum the scores of all points that fall in the same cell of the grid. This would allow me to find in what areas a given fishing boat exhibits the pattern I'm after the most (and this is not just about time spent in one place). Ultimately, the preferred output would contain lat and lon for every grid cell (center), and the sum of all scores on each cell. In addition, I would also like to be able to adjust the sizing of the grid cells.
I've looked around and all I can find either does not preserve the georeferenced information, is very inefficient, or performs binning of data. There may already be some answers out there, but it might be the case that I'm not able to recognize them since I'm a bit out of my league on this stuff. Can someone please point me to some direction (package, function, etc.)? Any guidance will be greatly appreciated.
Take your lat/lon coordinates, and multiply them by the inverse of your desired grid cell edge lengths, measured in degrees. The result will be a pair of floating point numbers whose integer part identifies the grid cell in question. Take the floor of these and you have two numbers describing the cell, which you could paste to form a single string. You may add that as a new factor column of your data frame. Then you can perform operations based on that factor, like summarizing values.
Example:
latScale <- 2 # one cell for every 0.5 degrees
lonScale <- 2 # likewise
track_1$cell <- factor(with(track_1,
paste(floor(LAT*latScale), floor(LON*lonScale), sep='.')))
library(plyr)
ddply(track_1, .(cell), summarize,
LAT=mean(LAT), LON=mean(LON), SCORE=sum(SCORE))
If you want to, you can use weighted.mean instead of mean. If you don't like these factors, you can put more effort in making them nice (e.g. by using compass directions instead of signs), or drop them altogether and use a pair of integer columns instead.

Create grid out of number of elements

Ok here's what I'm trying to accomplish. Say I have 100 items. I want to create a "grid"(each Item consisting of an x, y point). I want the grid to be as close to a square as possible.
Is there any kind of math to determine the grid width, and grid height i'd need by just a single number?(By grid width and height I mean the number of x items, and the number of Y items)
Now that I think about it would it be efficient to take the square root of the number, say varI=sqrt(45), remove the decimal place from varI...X=varI...then Y would be varI+1?
The square root is precisely what you need.
N
x=floor(sqrt(N))
y=raise(N/x)
This is the minimum rectangle that has more than N places and is closest to a square.
Now... if you want to find a rectangle that has exactly N places and is closest to a square...that's a different problem.
You need to find a factor of N, x, that's closest
You have to run through the factors of N and find the closest to sqrt(N). Then the rectangle is x by N/x, both integers.
There are several issues to consider here. If you want your grid to be as square as possible, for many Ns it will have empty cells in it. A simple example is N=10. You can create a 3x4 grid for it, but it will have two empty cells. A 2x5 grid, on the other hand, will have no empty cells. Some Ns (prime numbers) will always have empty cells in the grid.
But if you just want the square and don't care about empty fields then generally yes, you should take the square root. Say your number is N. Then, take R = int(sqrt(N)). Next, do an integer division N/R, take the quotient and add 1 to it. This is C. The grid is RxC. Note that when N is a square (like 100), this is a special case so don't add 1 to the quotient.
Example:
N = 40
R = int(sqrt(N)) = 6
C = int(40 / 6) + 1 = 7
grid is 6x7
I was looking to solve this problem too for a grid in html/css that had fixed dimensions and where N items would fit. I ended up creating my own script for that in javascript.
If you're interested in the method and maths I used, you can read http://machinesaredigging.com/2013/05/21/jgridder-how-to-fit-elements-in-a-sized-grid/, it's all documented there. I used recursion and it works really well, you can use the same method for your own language. Hope this helps.
I explored Eli's answer and found something I'd like to point out. For the sake of generality, one must add 1 to C only if R x C (C = int(N/R)) is not exactly N. So, the exception includes both numbers with square root and numbers which are exactly the product of two integers.
For instance:
N = 12
R = 3
C = 4 (int(N/R))
Hope it helps.

Resources