Let's say I have the output of a monte-carlo simulation of one variable over several different iterations (think millions). For each iteration, I have the values of the variable at each point in time (ranging from t=1 to t=365).
I would like to produce the following plot:
For each point in time, t, on the x axis and for each possible value "y" in a given range, set the color of x,y to "k" where "k" is a count of how many observations are within a vicinity of distance "d" to x,y.
I know you can easily make density heatmaps for 1D data, but is there a good package for doing this on 2 dimensions? Do I have to use kriging?
Edit: The data structure is currently a matrix.
data matrix
day number
[,1] [,2] [,3] [,4] [,5] ... [,365]
iteration [1,] 0.000213 0.001218 0.000151 0.000108 ... 0.000101
[2,] 0.000314 0.000281 0.000117 0.000103 ... 0.000305
[3,] 0.000314 0.000281 0.000117 0.000103 ... 0.000305
[4,] 0.000171 0.000155 0.000141 0.000219 ... 0.000201
.
.
.
[100000000,] 0.000141 0.000148 0.000144 0.000226 ... 0.000188
I want to, for each "day" have the pixels running vertically across that "day" to represent the probability density of the iteration's values for that day in color. The result should look like a heatmap.
Here is one solution to what I think you are after.
Generate data.
myData <- mapply(rnorm, 1000, 200, mean=seq(-50,50,0.5))
This is a matrix with 1000 rows (observations) and 201 time points. In each time point the mean of data there shifts gradually from -50 to 50. By 0.5 each time.
Get densities.
myDensities <- apply(myData, 2, density, from=-500, to=500)
This will give you a list of densities for each column. In order for them to be plottable side by side we specified the ranges (from -500 to 500) manually.
Obtain density values from the list.
Ys <- sapply(myDensities, "[", "y")
This is again a list. You need to get a matrix from that.
Get matrix from list.
img <- do.call(cbind, Ys)
This simply combines all Ys elements by column.
Plot.
filled.contour(x=1:ncol(img), y=myDensities[[1]]$x, t(img))
I use filled.contour for that. But you can look around for other 2-D plot functions. I also used values obtained from the densities D[[1]]$x.
And here is the result:
The shift from -50 to 50 is visible.
Not sure if this can work well with millions of time points. But plotting million probably makes little sense since you will in any case by limited by the number of pixels. Some kind of pre-processing might be necessary.
Another way to present data over time is to create a video.
The following uses the same matrix data as Karolis:
library(av)
myData <- mapply(rnorm, 1000, 200, mean=seq(-50,50,0.5))
# create function that includes a for loop, the output from
# each iteration of the for loop will become one frame in
# the animation.
make_plot <- function(myData){
xrange = range(myData)
for(i in seq_along(myData[1,])){
d <- density(myData[,i],
bandwidth = 45) # returns the density data
plot(d,
xlim=xrange,
ylim=c(0, 0.003),
main = paste("Density, day:",i))
}
}
# create video
av_capture_graphics(make_plot(myData),
output = "Density change over time.mp4",
width = 720,
height = 480,
framerate = 120)
Related
I have a dataframe of x and y geographic coordinates (30,000+ coordinates) that look like the example matrix points below. I want to take a random sample of these but such that I don't lose the pairs of x and y coordinates.
For example, I know that I can get a random sample of say 2 of the items in x and y, but how do I get a random sample so that items that go together are preserved? In other words, in my matrix of points, one actual point is a pair of an x coordinate (for example, the first item: -12.89) that goes with the first item in the y list: 18.275.
Is there a way that I could put together the items in x and y such that the order is preserved in a tuple-like object (I'm more of a python user) and then take a random sample using sample()? Thanks.
# Make some pretend data
x<-c(-12.89,-15.35,-15.46,-41.17,45.32)
y<-c(18.275,11.370,18.342,18.305,18.301)
points<-cbind(x,y)
points
# Get a random sample:
# This is wrong because the x and y need to be considered together
c(sample(x, 2),
sample(y, 2))
# This is also wrong because it treats each item in `points` separately
sample(points, size=2, replace=FALSE)
Ultimately, in this example, I would want to end up with two random pairs that go together.
For example: (-15.35,11.370) and (45.32,18.301)
You can take a sample from the row index:
set.seed(42)
points[sample(seq_len(nrow(points)), 2), ]
Gives
# x y
#[1,] -12.89 18.275
#[2,] 45.32 18.301
Another option could be:
set.seed(123)
do.call(`rbind`, sample(asplit(points, 1), 2))
x y
[1,] -15.35 11.370
[2,] -41.17 18.305
I'm trying to find sites to collect snails by using a semi-random selection method. I have set a 10km2 grid around the region I want to collect snails from, which is broken into 10,000 10m2 cells. I want to randomly this grid in R to select 200 field sites.
Randomly sampling a matrix in R is easy enough;
dat <- matrix(1:10000, nrow = 100)
sample(dat, size = 200)
However, I want to bias the sampling to pick cells closer to a single position (representing sites closer to the research station). It's easier to explain this with an image;
The yellow cell with a cross represents the position I want to sample around. The grey shading is the probability of picking a cell in the sample function, with darker cells being more likely to be sampled.
I know I can specify sampling probabilities using the prob argument in sample, but I don't know how to create a 2D probability matrix. Any help would be appreciated, I don't want to do this by hand.
I'm going to do this for a 9 x 6 grid (54 cells), just so it's easier to see what's going on, and sample only 5 of these 54 cells. You can modify this to a 100 x 100 grid where you sample 200 from 10,000 cells.
# Number of rows and columns of the grid (modify these as required)
nx <- 9 # rows
ny <- 6 # columns
# Create coordinate matrix
x <- rep(1:nx, each=ny);x
y <- rep(1:ny, nx);y
xy <- cbind(x, y); xy
# Where is the station? (edit: not snails nest)
Station <- rbind(c(x=3, y=2)) # Change as required
# Determine distance from each grid location to the station
library(SpatialTools)
D <- dist2(xy, Station)
From the help page of dist2
dist2 takes the matrices of coordinates coords1 and coords2 and
returns the inter-Euclidean distances between coordinates.
We can visualize this using the image function.
XY <- (matrix(D, nr=nx, byrow=TRUE))
image(XY) # axes are scaled to 0-1
# Create a scaling function - scales x to lie in [0-1)
scale_prop <- function(x, m=0)
(x - min(x)) / (m + max(x) - min(x))
# Add the coordinates to the grid
text(x=scale_prop(xy[,1]), y=scale_prop(xy[,2]), labels=paste(xy[,1],xy[,2],sep=","))
Lighter tones indicate grids closer to the station at (3,2).
# Sampling probabilities will be proportional to the distance from the station, which are scaled to lie between [0 - 1). We don't want a 1 for the maximum distance (m=1).
prob <- 1 - scale_prop(D, m=1); range (prob)
# Sample from the grid using given probabilities
sam <- sample(1:nrow(xy), size = 5, prob=prob) # Change size as required.
xy[sam,] # Thse are your (**MY!**) 5 samples
x y
[1,] 4 4
[2,] 7 1
[3,] 3 2
[4,] 5 1
[5,] 5 3
To confirm the sample probabilities are correct, you can simulate many samples and see which coordinates were sampled the most.
snail.sam <- function(nsamples) {
sam <- sample(1:nrow(xy), size = nsamples, prob=prob)
apply(xy[sam,], 1, function(x) paste(x[1], x[2], sep=","))
}
SAMPLES <- replicate(10000, snail.sam(5))
tab <- table(SAMPLES)
cols <- colorRampPalette(c("lightblue", "darkblue"))(max(tab))
barplot(table(SAMPLES), horiz=TRUE, las=1, cex.names=0.5,
col=cols[tab])
If using a 100 x 100 grid and the station is located at coordinates (60,70), then the image would look like this, with the sampled grids shown as black dots:
There is a tendency for the points to be located close to the station, although the sampling variability may make this difficult to see. If you want to give even more weight to grids near the station, then you can rescale the probabilities, which I think is ok to do, to save costs on travelling, but these weights need to be incorporated into the analysis when estimating the number of snails in the whole region. Here I've cubed the probabilities just so you can see what happens.
sam <- sample(1:nrow(xy), size = 200, prob=prob^3)
The tendency for the points to be located near the station is now more obvious.
There may be a better way than this but a quick way to do it is to randomly sample on both x and y axis using a distribution (I used the normal - bell shaped distribution, but you can really use any). The trick is to make the mean of the distribution the position of the research station. You can change the bias towards the research station by changing the standard deviation of the distribution.
Then use the randomly selected positions as your x and y coordinates to select the positions.
dat <- matrix(1:10000, nrow = 100)
#randomly selected a position for the research station
rs <- c(80,30)
# you can change the sd to change the bias
x <- round(rnorm(400,mean = rs[1], sd = 10))
y <- round(rnorm(400, mean = rs[2], sd = 10))
position <- rep(NA, 200)
j = 1
i = 1
# as some of the numbers sampled can be outside of the area you want I oversampled # and then only selected the first 200 that were in the area of interest.
while (j <= 200) {
if(x[i] > 0 & x[i] < 100 & y[i] > 0 & y [i]< 100){
position[j] <- dat[x[i],y[i]]
j = j +1
}
i = i +1
}
plot the results:
plot(x,y, pch = 19)
points(x =80,y = 30, col = "red", pch = 19) # position of the station
I have a 288000x3 matrix (288000 rows, 3 columns) of x and y cartesian coordinates from -60 to 60 with decimals that trail to 8 places along with a value at those coordinates.
Example-
y.cart x.cart value
[1,] 0.001308930 0.07498858 -49.36752
[2,] 0.002617462 0.07495431 -48.33903
[3,] 0.003925197 0.07489722 -51.42450
[4,] 0.005231736 0.07481730 -51.93874
[5,] 0.006536681 0.07471460 -513.73075
[6,] 0.007839635 0.07458914 -52.45299
[7,] 0.009140201 0.07444096 -51.93874
[8,] 0.010437983 0.07427011 -48.85327
[9,] 0.011732585 0.07407663 -49.36752
[10,] 0.013023613 0.07386058 -50.91025
This is weather radar reflectivity data and I need to plot it to look like the output that filled.contour creates, but in order to use filled.contour, the values need to be in a matrix because the function uses the matrix position as the coordinates for the plot which doesn't work with the way that my data is organized. Is there a way to use a filled.contour with the data in this form or, is there another way to do this? I've been fiddling with it for two days and haven't gotten very far. Any help would be greatly appreciated.
You can try to get the value column in a Matrix. This can done in a for loop. But for this I make the assumption, that in your data the y and x values in the variables y.cart and x.cart are not unique. I did this because I think you have something like a map and on this map every point from a grid is a pair of coordinates.
Is this correct you can try this code:
# Some sample data:
y.cart <- x.cart <- seq(-60,60,length.out = 600)
# Bring it in the form like your data are:
DF <- data.frame(x.cart = sample(x = x.cart, length(x.cart)^2, replace = TRUE),
y.cart = sample(x = y.cart, length(y.cart)^2, replace = TRUE),
value = rnorm(length(y.cart)^2))
# Also works for a Matrix:
DF <- as.matrix(DF)
# Define the Matrix Z. In this Matrix are just NAs, because if a value on a
# special coordinate doesn't exist there should be nothing drawn:
Z <- matrix(rep(NA,length(DF[,1])^2), nrow = length(DF[,1]))
# Get the unique points which represent the x and y coordinate. It's important
# to use the unique points for getting the index for the Matrix out of this vectors:
x <- sort(unique(DF[,1]))
y <- sort(unique(DF[,2]))
# In this loop every row in de data.frame (or matrix) is matched with the vector
# x for the i-th row in the Matrix and with the vector y for the j-th column in
# the Matrix Z[i,j]:
for(i in seq(along = DF[,1])) {
Z[which(x == DF[i,1]),which(y == DF[i,2])] <- DF[i,3]
}
# Now you can use persp or filled.contour with the following call:
persp(x,y,Z)
filled.contour(x,y,Z)
This works for my sample data, even though it makes no sense for them. Keep in your mind that the for loop isn't very fast and with your data it could take a while. You can build in a process bar to controle the status from the loop with:
pb <- txtProgressBar(min = 1, max = length(DF[,1]), style = 3)
for(i in seq(along = DF[,1])) {
Z[which(x == DF[i,1]),which(y == DF[i,2])] <- DF[i,3]
setTxtProgressBar(pb, i)
}
Also it's necessary that x and y have the same length and the Matrix Z is a Matrix with dimensions lenght(x) and length(y).
I hope this works for you. If my thinkings about the data aren't true you can maybe give a little more details about the data. And do not forget to replace DF with the name of your matrix.
I have a problem I wish to solve in R with example data below. I know this must have been solved many times but I have not been able to find a solution that works for me in R.
The core of what I want to do is to find how to translate a set of 2D coordinates to best fit into an other, larger, set of 2D coordinates. Imagine for example having a Polaroid photo of a small piece of the starry sky with you out at night, and you want to hold it up in a position so they match the stars' current positions.
Here is how to generate data similar to my real problem:
# create reference points (the "starry sky")
set.seed(99)
ref_coords = data.frame(x = runif(50,0,100), y = runif(50,0,100))
# generate points take subset of coordinates to serve as points we
# are looking for ("the Polaroid")
my_coords_final = ref_coords[c(5,12,15,24,31,34,48,49),]
# add a little bit of variation as compared to reference points
# (data should very similar, but have a little bit of noise)
set.seed(100)
my_coords_final$x = my_coords_final$x+rnorm(8,0,.1)
set.seed(101)
my_coords_final$y = my_coords_final$y+rnorm(8,0,.1)
# create "start values" by, e.g., translating the points we are
# looking for to start at (0,0)
my_coords_start =apply(my_coords_final,2,function(x) x-min(x))
# Plot of example data, goal is to find the dotted vector that
# corresponds to the translation needed
plot(ref_coords, cex = 1.2) # "Starry sky"
points(my_coords_start,pch=20, col = "red") # start position of "Polaroid"
points(my_coords_final,pch=20, col = "blue") # corrected position of "Polaroid"
segments(my_coords_start[1,1],my_coords_start[1,2],
my_coords_final[1,1],my_coords_final[1,2],lty="dotted")
Plotting the data as above should yield:
The result I want is basically what the dotted line in the plot above represents, i.e. a delta in x and y that I could apply to the start coordinates to move them to their correct position in the reference grid.
Details about the real data
There should be close to no rotational or scaling difference between my points and the reference points.
My real data is around 1000 reference points and up to a few hundred points to search (could use less if more efficient)
I expect to have to search about 10 to 20 sets of reference points to find my match, as many of the reference sets will not contain my points.
Thank you for your time, I'd really appreciate any input!
EDIT: To clarify, the right plot represent the reference data. The left plot represents the points that I want to translate across the reference data in order to find a position where they best match the reference. That position, in this case, is represented by the blue dots in the previous figure.
Finally, any working strategy must not use the data in my_coords_final, but rather reproduce that set of coordinates starting from my_coords_start using ref_coords.
So, the previous approach I posted (see edit history) using optim() to minimize the sum of distances between points will only work in the limited circumstance where the point distribution used as reference data is in the middle of the point field. The solution that satisfies the question and seems to still be workable for a few thousand points, would be a brute-force delta and comparison algorithm that calculates the differences between each point in the field against a single point of the reference data and then determines how many of the rest of the reference data are within a minimum threshold (which is needed to account for the noise in the data):
## A brute-force approach where min_dist can be used to
## ameliorate some random noise:
min_dist <- 5
win_thresh <- 0
win_thresh_old <- 0
for(i in 1:nrow(ref_coords)) {
x2 <- my_coords_start[,1]
y2 <- my_coords_start[,2]
x1 <- ref_coords[,1] + (x2[1] - ref_coords[i,1])
y1 <- ref_coords[,2] + (y2[1] - ref_coords[i,2])
## Calculate all pairwise distances between reference and field data:
dists <- dist( cbind( c(x1, x2), c(y1, y2) ), "euclidean")
## Only take distances for the sampled data:
dists <- as.matrix(dists)[-1*1:length(x1),]
## Calculate the number of distances within the minimum
## distance threshold minus the diagonal portion:
win_thresh <- sum(rowSums(dists < min_dist) > 1)
## If we have more "matches" than our best then calculate a new
## dx and dy:
if (win_thresh > win_thresh_old) {
win_thresh_old <- win_thresh
dx <- (x2[1] - ref_coords[i,1])
dy <- (y2[1] - ref_coords[i,2])
}
}
## Plot estimated correction (your delta x and delta y) calculated
## from the brute force calculation of shifts:
points(
x=ref_coords[,1] + dx,
y=ref_coords[,2] + dy,
cex=1.5, col = "red"
)
I'm very interested to know if there's anyone that solves this in a more efficient manner for the number of points in the test data, possibly using a statistical or optimization algorithm.
What would be an easy way to generate a 3 different spatial distribution of points (N = 20 points) using R. For example, 1) random, 2) uniform, and 3) clustered on the same space (50 x 50 grid)?
1) Here's one way to get a very even spacing of 5 points in a 25 by 25 grid numbered from 1 each direction. Put points at (3,18), (8,3), (13,13), (18,23), (23,8); you should be able to generalize from there.
2) as you suggest, you could use runif ... but I'd have assumed from your question you actually wanted points on the lattice (i.e. integers), in which case you might use sample.
Are you sure you want continuous rather than discrete random variables?
3) This one is "underdetermined" - depending on how you want to define things there's a bunch of ways you might do it. e.g. if it's on a grid, you could sample points in such a way that points close to (but not exactly on) already sampled points had a much higher probability than ones further away; a similar setup works for continuous variables. Or you could generate more points than you need and eliminate the loneliest ones. Or you could start with random uniform points and them make them gravitate toward their neighbors. Or you could generate a few cluster-centers (4-10, say), and then scatter points about those centers. Or you could do any of a hundred other things.
A bit late, but the answers above do not really address the problem. Here is what you are looking for:
library(sp)
# make a grid of size 50*50
x1<-seq(1:50)-0.5
x2<-x1
grid<-expand.grid(x1,x2)
names(grid)<-c("x1","x2")
# make a grid a spatial object
coordinates(grid) <- ~x1+x2
gridded(grid) <- TRUE
First: random sampling
# random sampling
random.pt <- spsample(x = grid, n= 20, type = 'random')
Second: regular sampling
# regular sampling
regular.pt <- spsample(x = grid, n= 20, type = 'regular')
Third: clustered at a distance of 2 from a random location (can go outside the area)
# random sampling of one location
ori <- data.frame(spsample(x = grid, n= 1, type = 'random'))
# select randomly 20 distances between 0 and 2
n.point <- 20
h <- rnorm(n.point, 1:2)
# empty dataframe
dxy <- data.frame(matrix(nrow=n.point, ncol=2))
# take a random angle from the randomly selected location and make a dataframe of the new distances from the original sampling points, in a random direction
angle <- runif(n = n.point,min=0,max=2*pi)
dxy[,1]= h*sin(angle)
dxy[,2]= h*cos(angle)
cluster <- data.frame(x=rep(NA, 20), y=rep(NA, 20))
cluster$x <- ori$coords.x1 + dxy$X1
cluster$y <- ori$coords.x2 + dxy$X2
# make a spatial object and plot
coordinates(cluster)<- ~ x+y
plot(grid)
plot(cluster, add=T, col='green')
plot(random.pt, add=T, col= 'red')
plot(regular.pt, add=T, col= 'blue')