Random sampling based on vector of probability weights - r

I have the vector d<-1:100
I want to sample k=3 times from this vector without replacement. I would like to make elements that are at a distance length(d)/k from the first sampled element to have a higher probability of getting sampled. I am not yet sure how much higher. I know that sample has a prob= argument, however i can't seem to find a way so that the prob= vectors gets to be recalculated from the location of the initial sample.
Any ideas?
Example:
d<-1:100 . Lets say the first trial samples d[30]=30. Then the elements of ddd that are near 0, 60 and 90 should have a higher probability of sampling. So after the initial sample the the distribution of the sampling probabilities of the rest of the elements of ddd is as in the image:

I think:
samp <- sample(1:100,1)
prob <- rep(1,100)
prob[samp]=0
MORE EDIT: I'm an idiot today. Now this will make the probability shape you asked for.
peke<-c(2,5,7,10,7,5,2) #your 'triangle' probability
for (jj = c(0,2,3){
prob[(1:7)*(1+samp*(jj)] <- peke
}
newsamp <-sample(1:100,1,prob)
You may want to add a slight offset if that doesn't place the probability peaks where you wanted them.

Related

Preferentially Sampling Based upon Value Size

So, this is something I think I'm complicating far too much but it also has some of my other colleagues stumped as well.
I've got a set of areas represented by polygons and I've got a column in the dataframe holding their areas. The distribution of areas is heavily right skewed. Essentially I want to randomly sample them based upon a distribution of sampling probabilities that is inversely proportional to their area. Rescaling the values to between zero and one (using the {​​​​​​​​x-min(x)}​​​​​​​​/{​​​​​​​​max(x)-min(x)}​​​​​​​​ method) and subtracting them from 1 would seem to be the intuitive approach, but this would simply mean that the smallest are almost always the one sampled.
I'd like a flatter (but not uniform!) right-skewed distribution of sampling probabilities across the values, but I am unsure on how to do this while taking the area values into account. I don't think stratifying them is what I am looking for either as that would introduce arbitrary bounds on the probability allocations.
Reproducible code below with the item of interest (the vector of probabilities) given by prob_vector. That is, how to generate prob_vector given the above scenario and desired outcomes?
# Data
n= 500
df <- data.frame("ID" = 1:n,"AREA" = replicate(n,sum(rexp(n=8,rate=0.1))))
# Generate the sampling probability somehow based upon the AREA values with smaller areas having higher sample probability::
prob_vector <- ??????
# Sampling:
s <- sample(df$ID, size=1, prob=prob_vector)```
There is no one best solution for this question as a wide range of probability vectors is possible. You can add any kind of curvature and slope.
In this small script, I simulated an extremely right skewed distribution of areas (0-100 units) and you can define and directly visualize any probability vector you want.
area.dist = rgamma(1000,1,3)*40
area.dist[area.dist>100]=100
hist(area.dist,main="Probability functions")
area = seq(0,100,0.1)
prob_vector1 = 1-(area-min(area))/(max(area)-min(area)) ## linear
prob_vector2 = .8-(.6*(area-min(area))/(max(area)-min(area))) ## low slope
prob_vector3 = 1/(1+((area-min(area))/(max(area)-min(area))))**4 ## strong curve
prob_vector4 = .4/(.4+((area-min(area))/(max(area)-min(area)))) ## low curve
legend("topright",c("linear","low slope","strong curve","low curve"), col = c("red","green","blue","orange"),lwd=1)
lines(area,prob_vector1*500,col="red")
lines(area,prob_vector2*500,col="green")
lines(area,prob_vector3*500,col="blue")
lines(area,prob_vector4*500,col="orange")
The output is:
The red line is your solution, the other ones are adjustments to make it weaker. Just change numbers in the probability function until you get one that fits your expectations.

Number from sample to be drawn from a Poisson distribution with upper/lower bounds

Working in R, I need to create a vector of length n with the values randomly drawn from a Poisson distribution with lambda=1, but with a lower bound of 2 and upper bound of 6 (i.e. all numbers will be either 2,3,4,5, or 6).
I am unsure how to do this. I tried creating a for loop that would replace any values outside that range with values inside the range:
seed(123)
n<-25 #example length
example<-rpois(n,1)
test<-example #redundant - only duplicating to compare with original *example* values
for (i in 1:length(n)){
if (test[i]<2||test[i]>6){
test[i]<-rpois(1,1)
}
}
But this didn't seem to work (still getting 0's and 1, etc, in test). Any ideas would be greatly appreciated!
Here is one way to generate n numbers with Poisson distribution and replace all the numbers which are outside range to random number inside the range.
n<-25 #example length
example<-rpois(n,1)
inds <- example < 2 | example > 6
example[inds] <- sample(2:6, sum(inds), replace = TRUE)

How does distances weighting work in KNN?

I'm writing KNN classifier in R. I want to add weighting scheme, e. g. inverted indices 1/d. As it is, for Iris dataset I get almost perfect 66% accuracy (no matter the metric used) since value no. 3 ("virginica") almost never shows up and I want to make it better with weighting. My question is: what exactly and how do I weight? I've read that I should weight classes of K nearest neighbours with those distances.
I've tried creating vectors of classes and distances to K nearest neighbours and then taking weighted mean from it:
inverted <- function(vals, distances)
{
inv_distances <- 1 / distances
# eliminate division-by-zero errors
inv_distances <- ifelse((inv_distances < 0.01), 0.01, inv_distances)
weighted.mean(vals, inv_distances)
}
My results are weird: for correct vectors vals (classes) and distances I sometimes get NaN (Not a Number) or NA values. Also my weights don't sum to 1, and... they probably should? I'm not sure. I just need someone to clear this weighting scheme for me.
EDIT:
I've debugged above code, since it multiplied by weight too late (therefore not eliminating distance 0 and causing NaNs). I've also changed it to harmonic series weights, not using distance (so first neighbour has weight 1, second 1/2, third 1/3 etc.). I still don't know exactly how it works and what other weights may be.
inverted <- function(vals)
{
weights <- 1 / seq(length(vals))
res <- weighted.mean(vals, weights)
res
}

Random Walks and Gaussian (Normal) Distribution in R

I'm very new to coding in R(coding in general). I've created a distribution using a random walk within the following code:
set.seed(124)
norm <- rnorm(1000)
mean(norm)
mean(norm)^2
sd(norm)
d <- density(norm)
plot(d)
Now I want to create a function of n-steps using the above distribution. The function calculates the expected values based on the probability of moving n-steps to the left or right from the center. I have no idea where to begin.
Any direction would be greatly appreciated.
Thanks
If each normally distributed variate is your step size (positive moves right and negative moves left), then the cumulative sum of your random draws represents your current position. You can compute that with the cumsum function in R:
set.seed(144)
pos <- cumsum(rnorm(1000))
plot(seq_along(pos), pos, xlab="Step Number", ylab="Current Position")
Using replicate and logical operations, you can simulate any number of different questions about random walks. For instance "with what probability does the value of the random walk exceed 100 within the first 1000 steps" could be simulated with:
set.seed(144)
exceed.100 <- replicate(100000, any(cumsum(rnorm(1000)) >= 100))
mean(exceed.100)
# [1] 0.00173
From these 100k replicates, it looks like the probability is around 0.17% that the random walk will exceed 100 during the first 1000 steps.

Creating random binary matrices with different distributions

I have been recently helped in getting a function to write a random binary matrix, with the condition that the diagonal is 0s.
fun <- function(n){
vals <- sample(0:1, n*(n-1)/2, rep = T)
mat <- matrix(0, n, n)
mat[upper.tri(mat)] <- vals
mat[lower.tri(mat)] <- vals
mat
}
Here I am entering values from the 'sample' to the upper and lower triangle separately. I would like to keep this in any updated function because sometimes I may wish to enter transpositions of each triangle into the other.
What I would like assistance with is how to change the frequency of 1s in the random matrix. This already varies around, I believe, a normal distribution. e.g. in a 9x9 matrix, there are 81-9=72 cells to fill, and the average number of 1s used is 36.
However, if I wanted to create matrices with a probability of e.g. p=0.9 of there being a 1, or e.g. p=0.2 of there being a 1... - how is this done?
I tried some ways of changing the sample(0:1,) part of the code by adding in probability functions but I only got errors.
Thanks
You should look in to help page of sample function
?sample shows :
Usage
sample(x, size, replace = FALSE, prob = NULL)
where
prob
A vector of probability weights for obtaining the elements of
the vector being sampled.
and further below in Details you will see
The optional prob argument can be used to give a vector of weights for
obtaining the elements of the vector being sampled. They need not sum
to one, but they should be non-negative and not all zero.
So to answer your question, apart from read the manual , use prob=c(0.1,0.9) if you want probability of 0.1 for first element of x and 0.9 for the second.

Resources