I'm very new to coding in R(coding in general). I've created a distribution using a random walk within the following code:
set.seed(124)
norm <- rnorm(1000)
mean(norm)
mean(norm)^2
sd(norm)
d <- density(norm)
plot(d)
Now I want to create a function of n-steps using the above distribution. The function calculates the expected values based on the probability of moving n-steps to the left or right from the center. I have no idea where to begin.
Any direction would be greatly appreciated.
Thanks
If each normally distributed variate is your step size (positive moves right and negative moves left), then the cumulative sum of your random draws represents your current position. You can compute that with the cumsum function in R:
set.seed(144)
pos <- cumsum(rnorm(1000))
plot(seq_along(pos), pos, xlab="Step Number", ylab="Current Position")
Using replicate and logical operations, you can simulate any number of different questions about random walks. For instance "with what probability does the value of the random walk exceed 100 within the first 1000 steps" could be simulated with:
set.seed(144)
exceed.100 <- replicate(100000, any(cumsum(rnorm(1000)) >= 100))
mean(exceed.100)
# [1] 0.00173
From these 100k replicates, it looks like the probability is around 0.17% that the random walk will exceed 100 during the first 1000 steps.
Related
Im new to statictics and received below question that need to be answered in R language:
Simulate an i.i.d process {Xt}t=1,···,n following standard normal Xt ∼ Normal(0,1) with
sample size n = 1000 and simulation time N = 500. Compute the sample mean ̄X(1),··· , ̄X(N),
where ̄X(i) is the sample mean from the i-th simulation. Plot the histogram for ̄X(1),··· , ̄X(N).
my thought is:
as sample size n=1000, then I should
set.seed(1) # Setting a seed
X1 <- rnorm(1000) # Simulating X1
to compute the sample mean of X1-XN
result.mean <- mean(x1)
plot the histogram for mean X1-XN
plot(result.mean, type = 'h')
However I'm not sure what to do with the simulation time N = 500? the plot i generated is just 1 bar histogram, so I'm pretty sure the simulation time should be used.
what is the purpose of simulation here? and if my thought correct in the case of iid? thank you
Using randomized numbers from a normal distribution, the base (stats) r code is rnorm, with default values having a mean of 0 and standard deviation of 1. We get 500 samples from this. Then, take the mean of a vector of those 1000 numbers. We repeat that with replicate 1000 times and throw the result into a histogram.
hist(replicate(500, mean(rnorm(1000)), simplify = "vector"))
So, this is something I think I'm complicating far too much but it also has some of my other colleagues stumped as well.
I've got a set of areas represented by polygons and I've got a column in the dataframe holding their areas. The distribution of areas is heavily right skewed. Essentially I want to randomly sample them based upon a distribution of sampling probabilities that is inversely proportional to their area. Rescaling the values to between zero and one (using the {x-min(x)}/{max(x)-min(x)} method) and subtracting them from 1 would seem to be the intuitive approach, but this would simply mean that the smallest are almost always the one sampled.
I'd like a flatter (but not uniform!) right-skewed distribution of sampling probabilities across the values, but I am unsure on how to do this while taking the area values into account. I don't think stratifying them is what I am looking for either as that would introduce arbitrary bounds on the probability allocations.
Reproducible code below with the item of interest (the vector of probabilities) given by prob_vector. That is, how to generate prob_vector given the above scenario and desired outcomes?
# Data
n= 500
df <- data.frame("ID" = 1:n,"AREA" = replicate(n,sum(rexp(n=8,rate=0.1))))
# Generate the sampling probability somehow based upon the AREA values with smaller areas having higher sample probability::
prob_vector <- ??????
# Sampling:
s <- sample(df$ID, size=1, prob=prob_vector)```
There is no one best solution for this question as a wide range of probability vectors is possible. You can add any kind of curvature and slope.
In this small script, I simulated an extremely right skewed distribution of areas (0-100 units) and you can define and directly visualize any probability vector you want.
area.dist = rgamma(1000,1,3)*40
area.dist[area.dist>100]=100
hist(area.dist,main="Probability functions")
area = seq(0,100,0.1)
prob_vector1 = 1-(area-min(area))/(max(area)-min(area)) ## linear
prob_vector2 = .8-(.6*(area-min(area))/(max(area)-min(area))) ## low slope
prob_vector3 = 1/(1+((area-min(area))/(max(area)-min(area))))**4 ## strong curve
prob_vector4 = .4/(.4+((area-min(area))/(max(area)-min(area)))) ## low curve
legend("topright",c("linear","low slope","strong curve","low curve"), col = c("red","green","blue","orange"),lwd=1)
lines(area,prob_vector1*500,col="red")
lines(area,prob_vector2*500,col="green")
lines(area,prob_vector3*500,col="blue")
lines(area,prob_vector4*500,col="orange")
The output is:
The red line is your solution, the other ones are adjustments to make it weaker. Just change numbers in the probability function until you get one that fits your expectations.
I can generate numbers with uniform distribution by using the code below:
runif(1,min=10,max=20)
How can I sample randomly generated numbers that fall more frequently closer to the minimum and maxium boundaries? (Aka an "upside down bell curve")
Well, bell curve is usually gaussian, meaning it doesn't have min and max. You could try Beta distribution and map it to desired interval. Along the lines
min <- 1
max <- 20
q <- min + (max-min)*rbeta(10000, 0.5, 0.5)
As #Gregor-reinstateMonica noted, Beta distribution is bounded on both ends, [0...1], so it could be easily mapped into any bounded interval just by scale and shift. It has two parameters, and symmetric if those parameters are equal. Above 1 parameters make it kind of bell distribution, but below 1 parameters make it into inverse bell, what you're looking for. You could play with them, put different values instead of 0.5 and see how it is going. Parameters equal to 1 makes it uniform.
Sampling from a beta distribution is a good idea. Another way is to sample a number of uniform numbers and then take the minimum or maximum of them.
According to the theory of order statistics, the cumulative distribution function for the maximum is F(x)^n where F is the cdf from which the sample is taken and n is the number of samples, and the cdf for the minimum is 1 - (1 - F(x))^n. For a uniform distribution, the cdf is a straight line from 0 to 1, i.e., F(x) = x, and therefore the cdf of the maximum is x^n and the cdf of the minimum is 1 - (1 - x)^n. As n increases, these become more and more curved, with most of the mass close to the ends.
A web search for "order statistics" will turn up some resources.
If you don't care about decimal places, a hacky way would be to generate a large sample of normally distributed datapoints using rnorm(), then count the number of times each given rounded value appears (n), and then substract n from the maximum value of n (max(n)) to get inverse counts.
You can then use the inverse count to make a new vector (that you can sample from), i.e.:
library(tidyverse)
x <- rnorm(100000, 100, 15)
x_tib <- round(x) %>%
tibble(x = .) %>%
count(x) %>%
mutate(new_n = max(n) - n)
new_x <- rep(x_tib$x, x_tib$new_n)
qplot(new_x, binwidth = 1)
An "upside-down bell curve" compared to the normal distribution can be sampled using the following algorithm. I write it in pseudocode because I'm not familiar with R. Notice that this sampler samples in a truncated interval (here, the interval [x0, x1]) because it's not possible for an upside-down bell curve extended to infinity to integrate to 1 (which is one of the requirements for a probability density).
In the pseudocode, RNDU01() is a uniform(0, 1) random number.
x0pdf = 1-exp(-(x0*x0))
x1pdf = 1-exp(-(x1*x1))
ymax = max(x0pdf, x1pdf)
while true
# Choose a random x-coordinate
x=RNDU01()*(x1-x0)+x0
# Choose a random y-coordinate
y=RNDU01()*ymax
# Return x if y falls within PDF
if y < 1-exp(-(x*x)): return x
end
I have dataset with values from 100 to 200, but there are a few spikes in data.
I don't want to smooth the whole dataset with rollmean or rollaplly.
I want to work it in that way:
find these spikes with condition (value > 300)
replace these too big values with mean/median that had been
calculated from 10 near neighbors values.
Example in pseudo-code:
data[n] = spike
data[n] = mean(from data[n-5] to data[n+5])
It's like using window function not on the whole data set, only on certain points in data.
Thank you in advance
I like this question. A typical moving average/k-nearest neighbourhood estimation. A nonnparametric approach. The following should work.
foo <- function(x, thresh = 300, h = 5, window.fun = mean) {
spikes.loc <- which(x > thresh)
low.bound <- spikes - h
up.bound <- spikes + h
N <- length(spikes.loc)
x.hat <- x
for (i in 1:N) x.hat[spikes.loc[i]] <- window.fun(x[low.bound[i]:up.bound[i]])
return(x.hat)
}
This function takes your original observations vector x, threshold, a window size (a smoothing parameter) as well as a user-specified window function. The returned value is the vector smoothed data. It only differs from original data at spikes points. Common choice of a window function is density function, so you end up with weighted average of all neighbouring data.
Please note, I am assuming your data are evenly spaced, so a simple index x[i-h] : x[i+h] gives a reasonable neighbourhood. In more general setting, a window is based on euclidean distance, but will naively costs O(N*N), where N is the number of observations, which is expensive.
In R, there are built-in nonparametric estimation / smoothing tools. The most basic one is kernel smoothing, a generalization of moving average. It uses FFT algorithm for fast computation at O(N log(N)) costs. Please see ?ksmooth. More advanced are KernSmooth and sm packages.
I have the vector d<-1:100
I want to sample k=3 times from this vector without replacement. I would like to make elements that are at a distance length(d)/k from the first sampled element to have a higher probability of getting sampled. I am not yet sure how much higher. I know that sample has a prob= argument, however i can't seem to find a way so that the prob= vectors gets to be recalculated from the location of the initial sample.
Any ideas?
Example:
d<-1:100 . Lets say the first trial samples d[30]=30. Then the elements of ddd that are near 0, 60 and 90 should have a higher probability of sampling. So after the initial sample the the distribution of the sampling probabilities of the rest of the elements of ddd is as in the image:
I think:
samp <- sample(1:100,1)
prob <- rep(1,100)
prob[samp]=0
MORE EDIT: I'm an idiot today. Now this will make the probability shape you asked for.
peke<-c(2,5,7,10,7,5,2) #your 'triangle' probability
for (jj = c(0,2,3){
prob[(1:7)*(1+samp*(jj)] <- peke
}
newsamp <-sample(1:100,1,prob)
You may want to add a slight offset if that doesn't place the probability peaks where you wanted them.