R calculate possible values of two variables - r

I'm trying to calculate all the possible values of a grid size (x by y) that lead to the same number of cells, so for example a 2x2 grid has a cell size of 4. I want the y to be half of the x, and the total to be, for example 4000. So I guess I want R to calculate all the possible positive integer values of x and y where
function (total) {
x*y=total
x/y=2
x!=total
y!= total.
}
I suppose one way to get positive integers and to consider different solutions would be to allow the total to be up to 10% larger than its original value (but not smaller, I need the grid to be at least as big as the total value I give), in which case the function could have two fields, tot (e.g. 4000) and tolerance (e.g. 10%). Total (as used in the sketch function above) than has to be between tot and (tot+tolerance*tot)
I have several cell sizes so 4000 is only one example. I'm trying to build a quick function which returns positive integers only and returns a matrix of Xs and Ys.
Any ideas?
Many thanks

What about this:
possible.sizes <- function(total, tolerance) {
min.total <- total
max.total <- total * (1 + tolerance)
min.y <- ceiling(sqrt(min.total/2))
max.y <- floor(sqrt(max.total/2))
if (max.y < min.y)
return(data.frame(x=numeric(0), y=numeric(0)))
y <- seq(min.y, max.y)
x <- 2*y
return(data.frame(x=x, y=y))
}
possible.sizes(4000, 0.1)
# x y
# 1 90 45
# 2 92 46

Related

Find local minimum in a vector with r

Taking the ideas from the following links:
the local minimum between the two peaks
How to explain ...
I look for the local minimum or minimums, avoiding the use of functions already created for this purpose [max / min locale or global].
Our progress:
#DATA
simulate <- function(lambda=0.3, mu=c(0, 4), sd=c(1, 1), n.obs=10^5) {
x1 <- rnorm(n.obs, mu[1], sd[1])
x2 <- rnorm(n.obs, mu[2], sd[2])
return(ifelse(runif(n.obs) < lambda, x1, x2))
}
data <- simulate()
hist(data)
d <- density(data)
#
#https://stackoverflow.com/a/25276661/8409550
##Since the x-values are equally spaced, we can estimate dy using diff(d$y)
d$x[which.min(abs(diff(d$y)))]
#With our data we did not obtain the expected value
#
d$x[which(diff(sign(diff(d$y)))>0)+1]#pit
d$x[which(diff(sign(diff(d$y)))<0)+1]#peak
#we check
#1
optimize(approxfun(d$x,d$y),interval=c(0,4))$minimum
optimize(approxfun(d$x,d$y),interval=c(0,4),maximum = TRUE)$maximum
#2
tp <- pastecs::turnpoints(d$y)
summary(tp)
ind <- (1:length(d$y))[extract(tp, no.tp = FALSE, peak = TRUE, pit = TRUE)]
d$x[ind[2]]
d$x[ind[1]]
d$x[ind[3]]
My questions and request for help:
Why did the command lines fail:
d$x[which.min(abs(diff(d$y)))]
It is possible to eliminate the need to add one to the index in the command lines:
d$x[which(diff(sign(diff(d$y)))>0)+1]#pit
d$x[which(diff(sign(diff(d$y)))<0)+1]#peak
How to get the optimize function to return the two expected maximum values?
Question 1
The answer to the first question is straighforward. The line d$x[which.min(abs(diff(d$y)))] asks for the x value at which there was the smallest change in y between two consecutive points. The answer is that this happened at the extreme right of the plot where the density curve is essentially flat:
which.min(abs(diff(d$y)))
#> [1] 511
length(abs(diff(d$y)))
#> [1] 511
This is not only smaller than the difference at your local maxima /minima points; it is orders of magnitude smaller. Let's zoom in to the peak value of d$y, including only the peak and the point on each side:
which.max(d$y)
#> [1] 324
plot(d$x[323:325], d$y[323:325])
We can see that the smallest difference is around 0.00005, or 5^-5, between two consecutive points. Now look at the end of the plot where it is flattest:
plot(d$x[510:512], d$y[510:512])
The difference is about 1^-7, which is why this is the flattest point.
Question 2
The answer to your second question is "no, not really". You are taking a double diff, which is two elements shorter than x, and if x is n elements long, a double diff will correspond to elements 2 to (n - 1) in x. You can remove the +1 from the index, but you will have an off-by-one error if you do that. If you really wanted to, you could concatenate dummy zeros at each stage of the diff, like this:
d$x[which(c(0, diff(sign(diff(c(d$y, 0))))) > 0)]
which gives the same result, but this is longer, harder to read and harder to justify, so why would you?
Question 3
The answer to the third question is that you could use the "pit" as the dividing point between the minimum and maximum value of d$x to find the two "peaks". If you really want a single call to get both at once, you could do it inside an sapply:
pit <- optimize(approxfun(d$x,d$y),interval=c(0,4))$minimum
peaks <- sapply(1:2, function(i) {
optimize(approxfun(d$x, d$y),
interval = c(min(d$x), pit, max(d$x))[i:(i + 1)],
maximum = TRUE)$maximum
})
pit
#> [1] 1.691798
peaks
#> [1] -0.02249845 3.99552521

R: Sample a matrix for cells close to a specified position

I'm trying to find sites to collect snails by using a semi-random selection method. I have set a 10km2 grid around the region I want to collect snails from, which is broken into 10,000 10m2 cells. I want to randomly this grid in R to select 200 field sites.
Randomly sampling a matrix in R is easy enough;
dat <- matrix(1:10000, nrow = 100)
sample(dat, size = 200)
However, I want to bias the sampling to pick cells closer to a single position (representing sites closer to the research station). It's easier to explain this with an image;
The yellow cell with a cross represents the position I want to sample around. The grey shading is the probability of picking a cell in the sample function, with darker cells being more likely to be sampled.
I know I can specify sampling probabilities using the prob argument in sample, but I don't know how to create a 2D probability matrix. Any help would be appreciated, I don't want to do this by hand.
I'm going to do this for a 9 x 6 grid (54 cells), just so it's easier to see what's going on, and sample only 5 of these 54 cells. You can modify this to a 100 x 100 grid where you sample 200 from 10,000 cells.
# Number of rows and columns of the grid (modify these as required)
nx <- 9 # rows
ny <- 6 # columns
# Create coordinate matrix
x <- rep(1:nx, each=ny);x
y <- rep(1:ny, nx);y
xy <- cbind(x, y); xy
# Where is the station? (edit: not snails nest)
Station <- rbind(c(x=3, y=2)) # Change as required
# Determine distance from each grid location to the station
library(SpatialTools)
D <- dist2(xy, Station)
From the help page of dist2
dist2 takes the matrices of coordinates coords1 and coords2 and
returns the inter-Euclidean distances between coordinates.
We can visualize this using the image function.
XY <- (matrix(D, nr=nx, byrow=TRUE))
image(XY) # axes are scaled to 0-1
# Create a scaling function - scales x to lie in [0-1)
scale_prop <- function(x, m=0)
(x - min(x)) / (m + max(x) - min(x))
# Add the coordinates to the grid
text(x=scale_prop(xy[,1]), y=scale_prop(xy[,2]), labels=paste(xy[,1],xy[,2],sep=","))
Lighter tones indicate grids closer to the station at (3,2).
# Sampling probabilities will be proportional to the distance from the station, which are scaled to lie between [0 - 1). We don't want a 1 for the maximum distance (m=1).
prob <- 1 - scale_prop(D, m=1); range (prob)
# Sample from the grid using given probabilities
sam <- sample(1:nrow(xy), size = 5, prob=prob) # Change size as required.
xy[sam,] # Thse are your (**MY!**) 5 samples
x y
[1,] 4 4
[2,] 7 1
[3,] 3 2
[4,] 5 1
[5,] 5 3
To confirm the sample probabilities are correct, you can simulate many samples and see which coordinates were sampled the most.
snail.sam <- function(nsamples) {
sam <- sample(1:nrow(xy), size = nsamples, prob=prob)
apply(xy[sam,], 1, function(x) paste(x[1], x[2], sep=","))
}
SAMPLES <- replicate(10000, snail.sam(5))
tab <- table(SAMPLES)
cols <- colorRampPalette(c("lightblue", "darkblue"))(max(tab))
barplot(table(SAMPLES), horiz=TRUE, las=1, cex.names=0.5,
col=cols[tab])
If using a 100 x 100 grid and the station is located at coordinates (60,70), then the image would look like this, with the sampled grids shown as black dots:
There is a tendency for the points to be located close to the station, although the sampling variability may make this difficult to see. If you want to give even more weight to grids near the station, then you can rescale the probabilities, which I think is ok to do, to save costs on travelling, but these weights need to be incorporated into the analysis when estimating the number of snails in the whole region. Here I've cubed the probabilities just so you can see what happens.
sam <- sample(1:nrow(xy), size = 200, prob=prob^3)
The tendency for the points to be located near the station is now more obvious.
There may be a better way than this but a quick way to do it is to randomly sample on both x and y axis using a distribution (I used the normal - bell shaped distribution, but you can really use any). The trick is to make the mean of the distribution the position of the research station. You can change the bias towards the research station by changing the standard deviation of the distribution.
Then use the randomly selected positions as your x and y coordinates to select the positions.
dat <- matrix(1:10000, nrow = 100)
#randomly selected a position for the research station
rs <- c(80,30)
# you can change the sd to change the bias
x <- round(rnorm(400,mean = rs[1], sd = 10))
y <- round(rnorm(400, mean = rs[2], sd = 10))
position <- rep(NA, 200)
j = 1
i = 1
# as some of the numbers sampled can be outside of the area you want I oversampled # and then only selected the first 200 that were in the area of interest.
while (j <= 200) {
if(x[i] > 0 & x[i] < 100 & y[i] > 0 & y [i]< 100){
position[j] <- dat[x[i],y[i]]
j = j +1
}
i = i +1
}
plot the results:
plot(x,y, pch = 19)
points(x =80,y = 30, col = "red", pch = 19) # position of the station

Data frames using conditional probabilities to extract a certain range of values

I would like some help answering the following question:
Dr Barchan makes 600 independent recordings of Eric’s coordinates (X, Y, Z), selects the cases where X ∈ (0.45, 0.55), and draws a histogram of the Y values for these cases.
By construction, these values of Y follow the conditional distribution of Y given X ∈ (0.45,0.55). Use your function sample3d to mimic this process and draw the resulting histogram. How many samples of Y are displayed in this histogram?
We can argue that the conditional distribution of Y given X ∈ (0.45, 0.55) approximates the conditional distribution of Y given X = 0.5 — and this approximation is improved if we make the interval of X values smaller.
Repeat the above simulations selecting cases where X ∈ (0.5 − δ, 0.5 + δ), using a suitably chosen δ and a large enough sample size to give a reliable picture of the conditional distribution of Y given X = 0.5.
I know for the first paragraph we want to have the values generated for x,y,z we got in sample3d(600) and then restrict the x's to being in the range 0.45-0.55, is there a way to code (maybe an if function) that would allow me to keep values of x in this range but discard all the x's from the 600 generated not in the range? Also does anyone have any hints for the conditional probability bit in the third paragraph.
sample3d = function(n)
{
df = data.frame()
while(n>0)
{
X = runif(1,-1,1)
Y = runif(1,-1,1)
Z = runif(1,-1,1)
a = X^2 + Y^2 + Z^2
if( a < 1 )
{
b = (X^2+Y^2+Z^2)^(0.5)
vector = data.frame(X = X/b, Y = Y/b, Z = Z/b)
df = rbind(vector,df)
n = n- 1
}
}
df
}
sample3d(n)
Any help would be appreciated, thank you.
Your function produces a data frame. The part of the question that asks you to find those values in a data frame that are in a given range can be solved by filtering the data frame. Notice that you're looking for a closed interval (the values aren't included).
df <- sample3d(600)
df[df$X > 0.45 & df$X < 0.55,]
Pay attention to the comma.
You can use a dplyr solution as well, but don't use the helper between(), since it will look at an open interval (you need a closed interval).
filter(df, X > 0.45 & X < 0.55)
For the remainder of your assignment, see what you can figure out and if you run into a specific problem, stack overflow can help you.

(Simple) Generating Multiple (1000) random variables in R

I'm attempting to create 1000 samples of a certain variable Z, in which first I generate 12 uniform RV's Ui, and then have Z = ∑ (Ui-6) from i=1 to 12. I can generate one Z from
u <- runif(12)
Z <- sum(u-6)
However I am not sure how to go about repeating that 1000x. In the end, the desire is to plot out the histogram of the Z's, and ideally it to resemble the normal curve. Sorry, clearly I am as beginner as you can get in this realm. Thank you!
If I understand the question, this is a pretty straightforward way to do it -- use replicate() to perform the calculation as many times as you want.
# number of values to draw per iteration
n_samples <- 12
# number of iterations
n_iters <- 1000
# get samples, subtract 6 from each element, sum them (1000x)
Zs <- replicate(n_iters, sum(runif(n_samples) - 6))
# print a histogram
hist(Zs)
Is this what you're after?
set.seed(2017);
n <- 1000;
u <- matrix(runif(12 * n), ncol = 12);
z <- apply(u, 1, function(x) sum(x - 6));
# Density plot
require(ggplot2);
ggplot(data.frame(z = z), aes(x = z)) + geom_density();
Explanation: Draw 12 * 1000 uniform samples in one go, store in a 1000 x 12 matrix, and then sum row entries x - 6.

Vectorize calculation of cell number in a hexagonal grid

I know the use of for-loop in R is often unnecessary, because it supports vectorization. I want to program as efficient as possible, there for my question concerning the following example code.
I have a hexagonal grid, and I am calculating the number of the cell, this counts from 1 to 225 in my example starting in the left lower corner, going to the right. So cell 16 is placed a bit offset right above cell 1.
see snapshot:
Therefor, if I have the Y coordinate, the X coordinate has to be either rounded, or ceiling. In my application the user points out cells, I save this and in a for loop go through the cells to determine the cells he chose as follows, with toy input values for Xcells and Ycells the user would have chosen:
gridsize <- 15
Xcells <-c(0.8066765, 1.8209879, 3.0526517, 0.5893240)
Ycells <-c(0.4577802, 0.4577802, 0.5302311, 1.5445425)
clicks <- length(Xcells)
cells <-vector('list', clicks)
This corresponds to cell 1 2 3 and 16. 4 clicks. Now to determine the cell numbers:
Y <- ceiling(Ycells)
for(i in 1:clicks){
if(Y[i]%%2==1){
X[i] <- round(Xcells[i])
}
else{
X[i]<- ceiling(Xcells[i])
}
#determine the cell numbers and store in predefined list
cells[[i]] <- (Y[i]-1)*gridsize + X[i]
}
So if the Y is 'even' the X has to be rounded, and if the Y is 'un-even' it has to be the ceiling value.
Is there a way to do this without the for loop, by using the vectorization?
You can vectorize this as follows
(Y - 1) * gridsize + ifelse(Y %% 2 == 1, round(Xcells), ceiling(Xcells))
# [1] 1 2 3 16
(I'm not sure pre-calculating round(Xcells) and ceiling(Xcells) will improve this a bit more - you could try)
Another option (if you want to avoid ifelse) could be
(Y - 1) * gridsize + cbind(ceiling(Xcells), round(Xcells))[cbind(1:length(Xcells), Y %% 2 + 1)]
# [1] 1 2 3 16

Resources