Drawing a sample that changes the shape of the mother sample - r

Background:
I'm trying to modify the shape of a histogram resulted from an "Initial" large sample obtained using Initial = rbeta(1e5, 2, 3). Specifically, I want the modified version of the Initial large sample to have 2 additional smaller (in height) "humps" (i.e., another 2 smaller-height peaks in addition to the one that existed in the Initial large sample).
Coding Question:
I'm wondering how to manipulate sample() (maybe using its prob argument) in R base so that this command samples in a manner that the two additional humps be around ".5" and ".6" on the X-Axis?
Here is my current R code:
Initial = rbeta(1e5, 2, 3) ## My initial Large Sample
hist (Initial) ## As seen, here there is only one "hump" say near
# less than ".4" on the X-Axis
Modified.Initial = sample(Initial, 1e4 ) ## This is meant to be the modified version of the
# the Initial with two additional "humps"
hist(Modified.Initial) ## Here, I need to see two additional "humps" near
# ".5" and ".6" on the X-Axis

You can adjust the density distribution by combining it with beta distributions with the desired modes for a smoothed adjustment.
set.seed(47)
Initial = rbeta(1e5, 2, 3)
d <- density(Initial)
# Generate densities of beta distribution. Parameters determine center (0.5) and spread.
b.5 <- dbeta(seq(0, 1, length.out = length(d$y)), 50, 50)
b.5 <- b.5 / (max(b.5) / max(d$y)) # Scale down to max of original density
# Repeat centered at 0.6
b.6 <- dbeta(seq(0, 1, length.out = length(d$y)), 60, 40)
b.6 <- b.6 / (max(b.6) / max(d$y))
# Collect maximum densities at each x to use as sample probability weights
p <- pmax(d$y, b.5, b.6)
plot(p, type = 'l')
# Sample from density breakpoints with new probability weights
Final <- sample(d$x, 1e4, replace = TRUE, prob = p)
Effects on histogram are subtle...
hist(Final)
...but are more obvious in the density plot.
plot(density(Final))
Obviously all adjustments are arbitrary. Please don't do terrible things with your power.

Related

R: Sample a matrix for cells close to a specified position

I'm trying to find sites to collect snails by using a semi-random selection method. I have set a 10km2 grid around the region I want to collect snails from, which is broken into 10,000 10m2 cells. I want to randomly this grid in R to select 200 field sites.
Randomly sampling a matrix in R is easy enough;
dat <- matrix(1:10000, nrow = 100)
sample(dat, size = 200)
However, I want to bias the sampling to pick cells closer to a single position (representing sites closer to the research station). It's easier to explain this with an image;
The yellow cell with a cross represents the position I want to sample around. The grey shading is the probability of picking a cell in the sample function, with darker cells being more likely to be sampled.
I know I can specify sampling probabilities using the prob argument in sample, but I don't know how to create a 2D probability matrix. Any help would be appreciated, I don't want to do this by hand.
I'm going to do this for a 9 x 6 grid (54 cells), just so it's easier to see what's going on, and sample only 5 of these 54 cells. You can modify this to a 100 x 100 grid where you sample 200 from 10,000 cells.
# Number of rows and columns of the grid (modify these as required)
nx <- 9 # rows
ny <- 6 # columns
# Create coordinate matrix
x <- rep(1:nx, each=ny);x
y <- rep(1:ny, nx);y
xy <- cbind(x, y); xy
# Where is the station? (edit: not snails nest)
Station <- rbind(c(x=3, y=2)) # Change as required
# Determine distance from each grid location to the station
library(SpatialTools)
D <- dist2(xy, Station)
From the help page of dist2
dist2 takes the matrices of coordinates coords1 and coords2 and
returns the inter-Euclidean distances between coordinates.
We can visualize this using the image function.
XY <- (matrix(D, nr=nx, byrow=TRUE))
image(XY) # axes are scaled to 0-1
# Create a scaling function - scales x to lie in [0-1)
scale_prop <- function(x, m=0)
(x - min(x)) / (m + max(x) - min(x))
# Add the coordinates to the grid
text(x=scale_prop(xy[,1]), y=scale_prop(xy[,2]), labels=paste(xy[,1],xy[,2],sep=","))
Lighter tones indicate grids closer to the station at (3,2).
# Sampling probabilities will be proportional to the distance from the station, which are scaled to lie between [0 - 1). We don't want a 1 for the maximum distance (m=1).
prob <- 1 - scale_prop(D, m=1); range (prob)
# Sample from the grid using given probabilities
sam <- sample(1:nrow(xy), size = 5, prob=prob) # Change size as required.
xy[sam,] # Thse are your (**MY!**) 5 samples
x y
[1,] 4 4
[2,] 7 1
[3,] 3 2
[4,] 5 1
[5,] 5 3
To confirm the sample probabilities are correct, you can simulate many samples and see which coordinates were sampled the most.
snail.sam <- function(nsamples) {
sam <- sample(1:nrow(xy), size = nsamples, prob=prob)
apply(xy[sam,], 1, function(x) paste(x[1], x[2], sep=","))
}
SAMPLES <- replicate(10000, snail.sam(5))
tab <- table(SAMPLES)
cols <- colorRampPalette(c("lightblue", "darkblue"))(max(tab))
barplot(table(SAMPLES), horiz=TRUE, las=1, cex.names=0.5,
col=cols[tab])
If using a 100 x 100 grid and the station is located at coordinates (60,70), then the image would look like this, with the sampled grids shown as black dots:
There is a tendency for the points to be located close to the station, although the sampling variability may make this difficult to see. If you want to give even more weight to grids near the station, then you can rescale the probabilities, which I think is ok to do, to save costs on travelling, but these weights need to be incorporated into the analysis when estimating the number of snails in the whole region. Here I've cubed the probabilities just so you can see what happens.
sam <- sample(1:nrow(xy), size = 200, prob=prob^3)
The tendency for the points to be located near the station is now more obvious.
There may be a better way than this but a quick way to do it is to randomly sample on both x and y axis using a distribution (I used the normal - bell shaped distribution, but you can really use any). The trick is to make the mean of the distribution the position of the research station. You can change the bias towards the research station by changing the standard deviation of the distribution.
Then use the randomly selected positions as your x and y coordinates to select the positions.
dat <- matrix(1:10000, nrow = 100)
#randomly selected a position for the research station
rs <- c(80,30)
# you can change the sd to change the bias
x <- round(rnorm(400,mean = rs[1], sd = 10))
y <- round(rnorm(400, mean = rs[2], sd = 10))
position <- rep(NA, 200)
j = 1
i = 1
# as some of the numbers sampled can be outside of the area you want I oversampled # and then only selected the first 200 that were in the area of interest.
while (j <= 200) {
if(x[i] > 0 & x[i] < 100 & y[i] > 0 & y [i]< 100){
position[j] <- dat[x[i],y[i]]
j = j +1
}
i = i +1
}
plot the results:
plot(x,y, pch = 19)
points(x =80,y = 30, col = "red", pch = 19) # position of the station

Find the common area between two graphs with multiple intersection points

I have following simulated data of following 2 variables. I created the density plot as follows,
set.seed(1)
x1=density(rnorm(100,0.5,3))
x2=density(rnorm(100,1,3))
plot(x1)
lines(x2)
Is there any function that can use to find the common area for these 2 graphs using R ?
Do i need to perform an integration for intersecting points ?
Thank you
If you set the sequence both densities use for x values to be identical, you can use pmin on the y values. (Call str(x1) to see how they're stored.) For instance, to see how it works:
set.seed(1)
x1 <- density(rnorm(100,0.5,3), from = -10, to = 10, n = 501)
x2 <- density(rnorm(100,1,3), from = -10, to = 10, n = 501)
plot(x2, main = 'Density intersection')
lines(x1)
polygon(x1$x, pmin(x1$y, x2$y), 20, col = 'dodgerblue')
Taking the integral means just multiplying each pmin times the increment in the x sequence and summing the lot:
sum(pmin(x1$y, x2$y) * diff(x1$x[1:2]))
#> [1] 0.896468

R radarchart: free axis to enhance records display?

I am trying to display my data using radarchart {fmsb}. The values of my records are highly variable. Therefore, low values are not visible on final plot.
Is there a was to "free" axis per each record, to visualize data independently of their scale?
Dummy example:
df<-data.frame(n = c(100, 0,0.3,60,0.3),
j = c(100,0, 0.001, 70,7),
v = c(100,0, 0.001, 79, 3),
z = c(100,0, 0.001, 80, 99))
n j v z
1 100.0 100.0 100.000 100.000 # max
2 0.0 0.0 0.000 0.000 # min
3 0.3 0.001 0.001 0.001 # small values -> no visible on final chart!!
4 60.0 0.001 79.000 80.000
5 0.3 0.0 3.000 99.000
Create radarchart
require(fmsb)
radarchart(df, axistype=0, pty=32, axislabcol="grey",# na.itp=FALSE,
seg = 5, centerzero = T)
Result: (only rows #2 and #3 are visible, row #1 with low values is not visible !!)
How to make visible all records (rows), i.e. how to "free" axis for any of my records? Thank you a lot,
If you want to be sure to see all 4 dimensions whatever the differences, you'll need a logarithmic scale.
As by design of the radar chart we cannot have negative values we are restricted on our choice of base by the range of values and by our number of segments (axis ticks).
If we want an integer base the minimum we can choose is:
seg0 <- 5 # your initial choice, could be changed
base <- ceiling(
max(apply(df[-c(1,2),],MARGIN = 1,max) / apply(df[-c(1,2),],MARGIN = 1,min))
^(1/(seg0-1))
)
Here we have a base 5.
Let's normalize and transform our data.
First we normalize the data by setting the maximum to 1 for all series,then we apply our logarithmic transformation, that will set the maximum of each series to seg0 (n for black, z for others) and the minimum among all series between 1 and 2 (here the v value of the black series).
df_normalized <- as.data.frame(df[-c(1,2),]/apply(df[-c(1,2),],MARGIN = 1,max))
df_transformed <- rbind(rep(seg0,4),rep(0,4),log(df_normalized,base) + seg0)
radarchart(df_transformed, axistype=0, pty=32, axislabcol="grey",# na.itp=FALSE,
seg = seg0, centerzero = T,maxmin=T)
If we look at the green series we see:
j and v have same order of magnitude
n is about 5^2 = 25 times smaller than j (5 i the value of the base, ^2 because 2 segments)
v is about 5^2 = 25 times (again) smaller than z
If we look at the black series we see that n is about 3.5^5 times bigger than the other dimensions.
If we look at the red series we see that the order of magnitude is the same among all dimensions.
Maybe a workaround for your problem:
If you would transform your data before running radarchart
(e.g. logarithm, square root ..) then you could also visualise small values.
Here an example using a cubic root transformation:
library(specmine)
df.c<-data.frame(cubic_root_transform(df)) # transform dataset
radarchart(df.c, axistype=0, pty=32, axislabcol="grey",# na.itp=FALSE,
seg = 5, centerzero = T)`
and the result will look like this:
EDIT:
If you want to zoom the small values even more you can do that with a higher order of the root.
e.g.
t<-5 # for fifth order root
df.t <- data.frame(apply(df, 2, function(x) FUN=x^(1/t))) # transform dataset
radarchart(df.t, axistype=0, pty=32, axislabcol="grey",# na.itp=FALSE,
seg = 5, centerzero = T)
You can adjust the "zoom" as you want by changing the value of t
So you should find a visualization that is suitable for you.
Here is an example using 10-th root transformation:
library(specmine)
df.c<-data.frame((df)^(1/10)) # transform dataset
radarchart(df.c, axistype=0, pty=32, axislabcol="grey",# na.itp=FALSE,
seg = 5, centerzero = T)`
and the result will look like this:
You can try n-th root for find the one that is best for you. N grows, the root of a number nearby zero grows faster.

Spatial correlogram using the raster package

Dear Crowd
Problem
I tried to calculate a spatial correlogram with the packages nfc, pgirmess, SpatialPack and spdep. However, I was troubling to define the start and end-point of the distance. I'm only interested in the spatial autocorrelation at smaller distances, but there on smaller bins. Additionally, as the raster is quite large (1.8 Megapixels), I run into memory troubles with these packages but the SpatialPack.
So I tried to produce my own code, using the function Moran from the package raster. But I must have some error, as the result for the complete dataset is somewhat different than the one from the other packages. If there is no error in my code, it might at least help others with similar problems.
Question
I'm not sure, whether my focal matrix is erroneous. Could you please tell me whether the central pixel needs to be incorporated? Using the testdata I can't show the differences between the methods, but on my complete dataset, there are differences visible, as shown in the Image below. However, the bins are not exactly the same (50m vs. 69m), so this might explain parts of the differences. However, at the first bin, this explanation seems not to be plausible to me. Or might the irregular shape of my raster, and different ways to handle NA's cause the difference?
Comparison of Own method with the one from SpatialPack
Runable Example
Testdata
The code for calculating the testdata is taken from http://www.petrkeil.com/?p=1050#comment-416317
# packages used for the data generation
library(raster)
library(vegan) # will be used for PCNM
# empty matrix and spatial coordinates of its cells
side=30
my.mat <- matrix(NA, nrow=side, ncol=side)
x.coord <- rep(1:side, each=side)*5
y.coord <- rep(1:side, times=side)*5
xy <- data.frame(x.coord, y.coord)
# all paiwise euclidean distances between the cells
xy.dist <- dist(xy)
# PCNM axes of the dist. matrix (from 'vegan' package)
pcnm.axes <- pcnm(xy.dist)$vectors
# using 8th PCNM axis as my atificial z variable
z.value <- pcnm.axes[,8]*200 + rnorm(side*side, 0, 1)
# plotting the artificial spatial data
r <- rasterFromXYZ(xyz = cbind(xy,z.value))
plot(r, axes=F)
Own Code
library(raster)
sp.Corr <- matrix(nrow = 0,ncol = 2)
formerBreak <- 0 #for the first run important
for (i in c(seq(10,200,10))) #Calculate the Morans I for these bins
{
cat(paste0("..",i)) #print the bin, which is currently calculated
w = focalWeight(r,d = i,type = 'circle')
wTemp <- w #temporarily saves the weigtht matrix
if (formerBreak>0) #if it is the second run
{
midpoint <- ceiling(ncol(w)/2) # get the midpoint
w[(midpoint-formerBreak):(midpoint+formerBreak),(midpoint-formerBreak):(midpoint+formerBreak)] <- w[(midpoint-formerBreak):(midpoint+formerBreak),(midpoint-formerBreak):(midpoint+formerBreak)]*(wOld==0)#set the previous focal weights to 0
w <- w*(1/sum(w)) #normalizes the vector to sum the weights to 1
}
wOld <- wTemp #save this weight matrix for the next run
mor <- Moran(r,w = w)
sp.Corr <- rbind(sp.Corr,c(Moran =mor,Distance = i))
formerBreak <- i/res(r)[1]#divides the breaks by the resolution of the raster to be able to translate them to the focal window
}
plot(x=sp.Corr[,2],y = sp.Corr[,1],type = "l",ylab = "Moran's I",xlab="Upper bound of distance")
Other methods to calculate the Spatial Correlogram
library(SpatialPack)
sp.Corr <- summary(modified.ttest(z.value,z.value,coords = xy,nclass = 21))
plot(x=sp.Corr$coef[,1],y = data$coef[,4],type = "l",ylab = "Moran's I",xlab="Upper bound of distance")
library(ncf)
ncf.cor <- correlog(x.coord, y.coord, z.value,increment=10, resamp=1)
plot(ncf.cor)
In order to compare the results of the correlogram, in your case, two things should be considered. (i) your code only works for bins proportional to the resolution of your raster. In that case, a bit of difference in the bins could make to include or exclude an important amount of pairs. (ii) The irregular shape of the raster has a strong impact of the pairs that are considered to compute the correlation for certain distance interval. So your code should deal with both, allow any value for the length of bin and consider the irregular shape of the raster. A small modification of your code to tackle those problems are below.
# SpatialPack correlation
library(SpatialPack)
test <- modified.ttest(z.value,z.value,coords = xy,nclass = 21)
# Own correlation
bins <- test$upper.bounds
library(raster)
sp.Corr <- matrix(nrow = 0,ncol = 2)
for (i in bins) {
cat(paste0("..",i)) #print the bin, which is currently calculated
w = focalWeight(r,d = i,type = 'circle')
wTemp <- w #temporarily saves the weigtht matrix
if (i > bins[1]) {
midpoint <- ceiling(dim(w)/2) # get the midpoint
half_range <- floor(dim(wOld)/2)
w[(midpoint[1] - half_range[1]):(midpoint[1] + half_range[1]),
(midpoint[2] - half_range[2]):(midpoint[2] + half_range[2])] <-
w[(midpoint[1] - half_range[1]):(midpoint[1] + half_range[1]),
(midpoint[2] - half_range[2]):(midpoint[2] + half_range[2])]*(wOld==0)
w <- w * (1/sum(w)) #normalizes the vector to sum the weights to 1
}
wOld <- wTemp #save this weight matrix for the next run
mor <- Moran(r,w=w)
sp.Corr <- rbind(sp.Corr,c(Moran =mor,Distance = i))
}
# Comparing
plot(x=test$upper.bounds, test$imoran[,1], col = 2,type = "b",ylab = "Moran's I",xlab="Upper bound of distance", lwd = 2)
lines(x=sp.Corr[,2],y = sp.Corr[,1], col = 3)
points(x=sp.Corr[,2],y = sp.Corr[,1], col = 3)
legend('topright', legend = c('SpatialPack', 'Own code'), col = 2:3, lty = 1, lwd = 2:1)
The image shows that the results of using the SpatialPack package and the own code are the same.

spatial distribution of points, R

What would be an easy way to generate a 3 different spatial distribution of points (N = 20 points) using R. For example, 1) random, 2) uniform, and 3) clustered on the same space (50 x 50 grid)?
1) Here's one way to get a very even spacing of 5 points in a 25 by 25 grid numbered from 1 each direction. Put points at (3,18), (8,3), (13,13), (18,23), (23,8); you should be able to generalize from there.
2) as you suggest, you could use runif ... but I'd have assumed from your question you actually wanted points on the lattice (i.e. integers), in which case you might use sample.
Are you sure you want continuous rather than discrete random variables?
3) This one is "underdetermined" - depending on how you want to define things there's a bunch of ways you might do it. e.g. if it's on a grid, you could sample points in such a way that points close to (but not exactly on) already sampled points had a much higher probability than ones further away; a similar setup works for continuous variables. Or you could generate more points than you need and eliminate the loneliest ones. Or you could start with random uniform points and them make them gravitate toward their neighbors. Or you could generate a few cluster-centers (4-10, say), and then scatter points about those centers. Or you could do any of a hundred other things.
A bit late, but the answers above do not really address the problem. Here is what you are looking for:
library(sp)
# make a grid of size 50*50
x1<-seq(1:50)-0.5
x2<-x1
grid<-expand.grid(x1,x2)
names(grid)<-c("x1","x2")
# make a grid a spatial object
coordinates(grid) <- ~x1+x2
gridded(grid) <- TRUE
First: random sampling
# random sampling
random.pt <- spsample(x = grid, n= 20, type = 'random')
Second: regular sampling
# regular sampling
regular.pt <- spsample(x = grid, n= 20, type = 'regular')
Third: clustered at a distance of 2 from a random location (can go outside the area)
# random sampling of one location
ori <- data.frame(spsample(x = grid, n= 1, type = 'random'))
# select randomly 20 distances between 0 and 2
n.point <- 20
h <- rnorm(n.point, 1:2)
# empty dataframe
dxy <- data.frame(matrix(nrow=n.point, ncol=2))
# take a random angle from the randomly selected location and make a dataframe of the new distances from the original sampling points, in a random direction
angle <- runif(n = n.point,min=0,max=2*pi)
dxy[,1]= h*sin(angle)
dxy[,2]= h*cos(angle)
cluster <- data.frame(x=rep(NA, 20), y=rep(NA, 20))
cluster$x <- ori$coords.x1 + dxy$X1
cluster$y <- ori$coords.x2 + dxy$X2
# make a spatial object and plot
coordinates(cluster)<- ~ x+y
plot(grid)
plot(cluster, add=T, col='green')
plot(random.pt, add=T, col= 'red')
plot(regular.pt, add=T, col= 'blue')

Resources