Related
I have a dataset of spatial locations data. I want to do a point pattern analysis using the spatstat package in R using this data. I want the best polygon area for the analysis instead of the rectangle area. The code I have is
original_data = read.csv("/home/hudamoh/PhD_Project_Moh_Huda/Dataset_files/my_coordinates.csv")
plot(original_data$row, original_data$col)
which results in a plot that looks like this
Setting the data for point pattern data
point_pattern_data = ppp(original_data$row, original_data$col, c(0, 77), c(0, 116))
plot(point_pattern_data)
summary(point_pattern_data)
resulting in a plot that looks like this
#The observed data has considerably wide white spaces, which I want to remove for a better analysis area. Therefore, I want to make the point pattern a polygon instead of a rectangle. The vertices for the polygon are the pairs of (x,y) below to avoid white space as much as possible.
x = c(3,1,1,0.5,0.5,1,2,2.5,5.5, 16,21,28,26,72,74,76,75,74,63,58,52,47,40)
y = c(116,106,82.5,64,40,35,25,17.5,5,5,5,10,8,116,100,50,30,24,17,10,15,15,8)
I find these vertices above manually by considering the plot below (with the grid lines)
plot(original_data$row,original_data$col)
grid(nx = 40, ny = 25,
lty = 2, # Grid line type
col = "gray", # Grid line color
lwd = 2) # Grid line width
So I want to make the point pattern polygon. The code is
my_data_poly = owin(poly = list(x = c(3,1,1,0.5,0.5,1,2,2.5,5.5, 16,21,28,26,72,74,76,75,74,63,58,52,47,40), y = c(116,106,82.5,64,40,35,25,17.5,5,5,5,10,8,116,100,50,30,24,17,10,15,15,8)))
plot(my_data_poly)
but it results in an error. The error is
I fix it by
my_data_poly = owin(poly = list(x = c(116,106,82.5,64,40,35,25,17.5,5,5,5,10,8,116,100,50,30,24,17,10,15,15,8), y = c(3,1,1,0.5,0.5,1,2,2.5,5.5, 16,21,28,26,72,74,76,75,74,63,58,52,47,40)))
plot(my_data_poly)
It results in a plot
However, this is not what I want. How to get the observed area as a polygon in point pattern data analysis?
This should be a reasonable solution to the problem.
require(sp)
poly = Polygon(
cbind(original_data$col,
original_data$row)
))
This will create a polygon from your points. You can use this document to understand the sp package better
We don’t have access to the point data you read in from file, but if you just want to fix the polygonal window that is not a problem.
You need to traverse the vertices of your polygon sequentially and anti-clockwise.
The code connects the first point you give to the next etc. Your vertices are:
library(spatstat)
x = c(3,1,1,0.5,0.5,1,2,2.5,5.5, 16,21,28,26,72,74,76,75,74,63,58,52,47,40)
y = c(116,106,82.5,64,40,35,25,17.5,5,5,5,10,8,116,100,50,30,24,17,10,15,15,8)
vert <- ppp(x, y, window = owin(c(0,80),c(0,120)))
plot.ppp(vert, main = "", show.window = FALSE, chars = NA)
text(vert)
Point number 13 is towards the bottom left and 14 in the top right, which gives the funny crossing in the polygon.
Moving the order around seems to help:
xnew <- c(x[1:11], x[13:12], x[23:14])
ynew <- c(y[1:11], y[13:12], y[23:14])
p <- owin(poly = cbind(xnew, ynew))
plot(p, main = "")
It is unclear from your provided plot of the data that you really should apply point pattern analysis.
The main assumption underlying point process modelling as implemented in spatstat
is that the locations of events (points) are random and the process that
generated the random locations is of interest.
Your points seem to be on a grid and maybe you need another tool for your analysis.
Of course spatstat has a lot of functionality for simply handling and summarising data like this so you may still find useful tools in there.
I'm trying to made a flow map in R, which so far I've managed to do, but due to my map only covering the space of one country gcIntermediate from the geosphere will create spatial lines for me, but they have no curve.
I thought maybe I could add a bezier curve to my lines, but I'm having zero luck with working out how I might do that.
long <- runif(10000, 49.92332, 55.02101) #Random co-ordinates
lat <- runif(10000, -6.30217, 1.373248) # Random co-ordinates
df <- as.data.frame.matrix(data.frame(Lat.1 = sample(lat, 10),
Long.1 = sample(long, 10),
Lat.2 = sample(lat, 10),
Long.2 = sample(long, 10))) # Dataframe of flow beginning to flow end
lines <- gcIntermediate(df[,c("Long.1", "Lat.1")], df[,c("Long.2", "Lat.2")], 500, addStartEnd = TRUE) #Create spatial lines with the geosphere package
plot(lines) #Some very straight lines
My problem comes when setting a start and end point for the bezier line, as the function in the bezier package only seems to accept one value for start and one for end, which given each point needs two values (long, lat) to define it I'm a bit stumped by.
I won't bore you with all of the different things I've tried with the bezier package (as none of them worked), here are some things that didn't work
bezier(sep(0,1,100), lines, lines$Long.1~lines$Lat.1, lines$Long.2~lines$Lat.2) # Won't accept a line object and I don't think Long.1 etc exist anymore
bezier(sep(0,1,100), df, df$Long.1~df$Lat.1, df$Long.2~df$Lat.2) #Hoped that if I used a formula syntax I could combine the long/lat of the starting and ending points respectively (I can't)
Has anyone got any insight on this? It's quite frustrating being so close and yet so far.
I am constructing a microgrid, so I have to connect 12 or so houses to a central solar power source. Wiring is a major cost here, so I'm trying to come up with a configuration that minimizes wire length. It's similar but not exactly like a traveling salesman problem, because multiple wires can come out of the same source -- if it were a single wire/path, it would be exactly like the TSP.
So my question is:
Does anyone know of an algorithm to determine the shortest way to connect all points, where there is a central point that can connect an indeterminate number of the surrounding points? The final solution should resemble a graph in which n-1 nodes have maximum two edges connecting them and one may have up to n-1 edges? Specifically, is there a way to do this in R?
EDIT TO SHOW CODE/EFFORT
I've solved it in a relatively simple way assuming a single path. And I've solved it assuming every house is connected directly to the power source. Here is that code:
############ interested users Wire CalculationBommekalla Analysis
require(png)
require(grid)
require(arm)
require(DCluster)
require(ggplot2)
setwd("C:/Users/Lucas/Documents/India2014_2015/ADATS Docs/BoomekallaAnalysis")
data= read.csv("BommekallInterestedUsers.csv")
summary(data)
names(data)
data$ind = c(1:nrow(data))
##### Analysis
# Shortest Path
distFrame = data[,c("Lon.Deg", "Lat.Deg")]
dists= as.matrix(dist(distFrame, upper=TRUE))
diag(dists)=1000
current= which(data.WO$ind==11) # sushilemma
ord = rep(current,length=nrow(data.WO))
dists[,current]=1000
for (j in c(1:(nrow(data.WO)-1))){
current= which(dists[current,]==min(dists[current,]))
dists[,current]=1000
ord[j+1] = current
}
# line calculation
firstHouses= data.WO[ord,]
secondHouses= data.WO[c(ord[-1],NA),]
lines = data.frame(lonA = firstHouses$Lon.Deg,
latA= firstHouses$Lat.Deg,
lonB = secondHouses$Lon.Deg,
latB = secondHouses$Lat.Deg)
lines= na.omit(lines)
# Spider web -- completely connected to source
ccLines = data.frame(latA = data$Lat.Deg[
data$Name=="Sushilemma"], latB = data$Lat.Deg,
lonA = data$Lon.Deg[data$Name=="Sushilemma"],
lonB = data$Lon.Deg)
# Haversine Distance -- atanh_trans() is arctan
linesRads=lines*pi/180
a= with(linesRads, sin((latB-latA)/2)^2+
cos(latB)*cos(latA)*sin((lonB-lonA)/2)^2)
c= 2*asin(pmin(1,sqrt(a)))
lines$distance=6371*c*1000
totalDistance = sum(lines$distance)
totalCost = totalDistance*15
I seem to have found a discrepancy in the results of a univariate Ripley's K point pattern analysis (figure 1). To begin, I generated a 1x1 uniform point grid to see if my R script was producing logical results (Figure 2). The study area is 20x40 (Figure2). Given the completely uniform data, I would not expect to see any random or clustered point patterns at any search distance (r). The attached script was used to generate these results. Under these controlled conditions why am I seeing clustering and CSR when there should only be a uniform point pattern?
require(spatstat)
require(maptools)
require(splancs)
# Local Variables
flower = 0
year = 2013
# Read the shapefile
sdata = readShapePoints("C:/temp/sample_final.shp") #Read the shapefile
data = sdata[sdata$flow_new == flower,] # subset only flowering plants
data2 = data[data$year == year,] # subset flowering plants at year X
data.frame(data2) # Check the data
# Get the ripras estimate of area based on the study area measurements
gapdata = readShapePoints("C:/temp/study_area_boundary.shp") #Read the shapefile
whole = coordinates(gapdata) # get just the coords, excluding other data
win = convexhull.xy(whole) # Ripras will determine a good bounding polygon for the points (usually a variant of Convex Hull)
plot(win)
# Converting to PPP
points = coordinates(data2) # get just the coords, excluding other data
ppp = as.ppp(points, win) # Convert the points into the spatstat format
data.check = data.frame(ppp) # Check the format of the ppp data
summary(ppp) # General info about the created ppp object
plot(ppp) # Visually check the points and bounding area
# Now run the ppa
L.Env.ppp = envelope(ppp, Lest, nsim = 1000, correction = "best", rank =1)
plot(L.Env.ppp, main = "Uniform Test")
abline(v=(seq(1:12)), lty="dotted")
Figure 1
Results of the analysis
Figure 2
The uniform points and the window
Those points are regularly dispersed (sometimes also called hyperdispersed). Although in a colloquial sense they appear uniform, the point process underlying them is not itself uniform: if it were, there would be some chance of point pairs less than one unit apart.
In drawing your attention to that short range deviation from uniformity, Ripley's K is performing exactly as it was designed to!
I wish to present a distance matrix in an article I am writing, and I am looking for good visualization for it.
So far I came across balloon plots (I used it here, but I don't think it will work in this case), heatmaps (here is a nice example, but they don't allow to present the numbers in the table, correct me if I am wrong. Maybe half the table in colors and half with numbers would be cool) and lastly correlation ellipse plots (here is some code and example - which is cool to use a shape, but I am not sure how to use it here).
There are also various clustering methods but they will aggregate the data (which is not what I want) while what I want is to present all of the data.
Example data:
nba <- read.csv("http://datasets.flowingdata.com/ppg2008.csv")
dist(nba[1:20, -1], )
I am open for ideas.
You could also use force-directed graph drawing algorithms to visualize a distance matrix, e.g.
nba <- read.csv("http://datasets.flowingdata.com/ppg2008.csv")
dist_m <- as.matrix(dist(nba[1:20, -1]))
dist_mi <- 1/dist_m # one over, as qgraph takes similarity matrices as input
library(qgraph)
jpeg('example_forcedraw.jpg', width=1000, height=1000, unit='px')
qgraph(dist_mi, layout='spring', vsize=3)
dev.off()
Tal, this is a quick way to overlap text over an heatmap. Note that this relies on image rather than heatmap as the latter offsets the plot, making it more difficult to put text in the correct position.
To be honest, I think this graph shows too much information, making it a bit difficult to read... you may want to write only specific values.
also, the other quicker option is to save your graph as pdf, import it in Inkscape (or similar software) and manually add the text where needed.
Hope this helps
nba <- read.csv("http://datasets.flowingdata.com/ppg2008.csv")
dst <- dist(nba[1:20, -1],)
dst <- data.matrix(dst)
dim <- ncol(dst)
image(1:dim, 1:dim, dst, axes = FALSE, xlab="", ylab="")
axis(1, 1:dim, nba[1:20,1], cex.axis = 0.5, las=3)
axis(2, 1:dim, nba[1:20,1], cex.axis = 0.5, las=1)
text(expand.grid(1:dim, 1:dim), sprintf("%0.1f", dst), cex=0.6)
A Voronoi Diagram (a plot of a Voronoi Decomposition) is one way to visually represent a Distance Matrix (DM).
They are also simple to create and plot using R--you can do both in a single line of R code.
If you're not famililar with this aspect of computational geometry, the relationship between the two (VD & DM) is straightforward, though a brief summary might be helpful.
Distance Matrices--i.e., a 2D matrix showing the distance between a point and every other point, are an intermediate output during kNN computation (i.e., k-nearest neighbor, a machine learning algorithm which predicts the value of a given data point based on the weighted average value of its 'k' closest neighbors, distance-wise, where 'k' is some integer, usually between 3 and 5.)
kNN is conceptually very simple--each data point in your training set is in essence a 'position' in some n-dimension space, so the next step is to calculate the distance between each point and every other point using some distance metric (e.g., Euclidean, Manhattan, etc.). While the training step--i.e., construcing the distance matrix--is straightforward, using it to predict the value of new data points is practically encumbered by the data retrieval--finding the closest 3 or 4 points from among several thousand or several million scattered in n-dimensional space.
Two data structures are commonly used to address that problem: kd-trees and Voroni decompositions (aka "Dirichlet tesselation").
A Voronoi decomposition (VD) is uniquely determined by a distance matrix--i.e., there's a 1:1 map; so indeed it is a visual representation of the distance matrix, although again, that's not their purpose--their primary purpose is the efficient storage of the data used for kNN-based prediction.
Beyond that, whether it's a good idea to represent a distance matrix this way probably depends most of all on your audience. To most, the relationship between a VD and the antecedent distance matrix will not be intuitive. But that doesn't make it incorrect--if someone without any statistics training wanted to know if two populations had similar probability distributions and you showed them a Q-Q plot, they would probably think you haven't engaged their question. So for those who know what they are looking at, a VD is a compact, complete, and accurate representation of a DM.
So how do you make one?
A Voronoi decomp is constructed by selecting (usually at random) a subset of points from within the training set (this number varies by circumstances, but if we had 1,000,000 points, then 100 is a reasonable number for this subset). These 100 data points are the Voronoi centers ("VC").
The basic idea behind a Voronoi decomp is that rather than having to sift through the 1,000,000 data points to find the nearest neighbors, you only have to look at these 100, then once you find the closest VC, your search for the actual nearest neighbors is restricted to just the points within that Voronoi cell. Next, for each data point in the training set, calculate the VC it is closest to. Finally, for each VC and its associated points, calculate the convex hull--conceptually, just the outer boundary formed by that VC's assigned points that are farthest from the VC. This convex hull around the Voronoi center forms a "Voronoi cell." A complete VD is the result from applying those three steps to each VC in your training set. This will give you a perfect tesselation of the surface (See the diagram below).
To calculate a VD in R, use the tripack package. The key function is 'voronoi.mosaic' to which you just pass in the x and y coordinates separately--the raw data, not the DM--then you can just pass voronoi.mosaic to 'plot'.
library(tripack)
plot(voronoi.mosaic(runif(100), runif(100), duplicate="remove"))
You may want to consider looking at a 2-d projection of your matrix (Multi Dimensional Scaling). Here is a link to how to do it in R.
Otherwise, I think you are on the right track with heatmaps. You can add in your numbers without too much difficulty. For example, building of off Learn R :
library(ggplot2)
library(plyr)
library(arm)
library(reshape2)
nba <- read.csv("http://datasets.flowingdata.com/ppg2008.csv")
nba$Name <- with(nba, reorder(Name, PTS))
nba.m <- melt(nba)
nba.m <- ddply(nba.m, .(variable), transform,
rescale = rescale(value))
(p <- ggplot(nba.m, aes(variable, Name)) + geom_tile(aes(fill = rescale),
colour = "white") + scale_fill_gradient(low = "white",
high = "steelblue")+geom_text(aes(label=round(rescale,1))))
A dendrogram based on a hierarchical cluster analysis can be useful:
http://www.statmethods.net/advstats/cluster.html
A 2-D or 3-D multidimensional scaling analysis in R:
http://www.statmethods.net/advstats/mds.html
If you want to go into 3+ dimensions, you might want to explore ggobi / rggobi:
http://www.ggobi.org/rggobi/
In the book "Numerical Ecology" by Borcard et al. 2011 they used a function called *coldiss.r *
you can find it here: http://ichthyology.usm.edu/courses/multivariate/coldiss.R
it color codes the distances and even orders the records by dissimilarity.
another good package would be the seriation package.
Reference:
Borcard, D., Gillet, F. & Legendre, P. (2011) Numerical Ecology with R. Springer.
A solution using Multidimensional Scaling
data = read.csv("http://datasets.flowingdata.com/ppg2008.csv", sep = ",")
dst = tcrossprod(as.matrix(data[,-1]))
dst = matrix(rep(diag(dst), 50L), ncol = 50L, byrow = TRUE) +
matrix(rep(diag(dst), 50L), ncol = 50L, byrow = FALSE) - 2*dst
library(MASS)
mds = isoMDS(dst)
#remove {type = "n"} to see dots
plot(mds$points, type = "n", pch = 20, cex = 3, col = adjustcolor("black", alpha = 0.3), xlab = "X", ylab = "Y")
text(mds$points, labels = rownames(data), cex = 0.75)