I'm attempting to create 1000 samples of a certain variable Z, in which first I generate 12 uniform RV's Ui, and then have Z = ∑ (Ui-6) from i=1 to 12. I can generate one Z from
u <- runif(12)
Z <- sum(u-6)
However I am not sure how to go about repeating that 1000x. In the end, the desire is to plot out the histogram of the Z's, and ideally it to resemble the normal curve. Sorry, clearly I am as beginner as you can get in this realm. Thank you!
If I understand the question, this is a pretty straightforward way to do it -- use replicate() to perform the calculation as many times as you want.
# number of values to draw per iteration
n_samples <- 12
# number of iterations
n_iters <- 1000
# get samples, subtract 6 from each element, sum them (1000x)
Zs <- replicate(n_iters, sum(runif(n_samples) - 6))
# print a histogram
hist(Zs)
Is this what you're after?
set.seed(2017);
n <- 1000;
u <- matrix(runif(12 * n), ncol = 12);
z <- apply(u, 1, function(x) sum(x - 6));
# Density plot
require(ggplot2);
ggplot(data.frame(z = z), aes(x = z)) + geom_density();
Explanation: Draw 12 * 1000 uniform samples in one go, store in a 1000 x 12 matrix, and then sum row entries x - 6.
Related
I'm trying to find sites to collect snails by using a semi-random selection method. I have set a 10km2 grid around the region I want to collect snails from, which is broken into 10,000 10m2 cells. I want to randomly this grid in R to select 200 field sites.
Randomly sampling a matrix in R is easy enough;
dat <- matrix(1:10000, nrow = 100)
sample(dat, size = 200)
However, I want to bias the sampling to pick cells closer to a single position (representing sites closer to the research station). It's easier to explain this with an image;
The yellow cell with a cross represents the position I want to sample around. The grey shading is the probability of picking a cell in the sample function, with darker cells being more likely to be sampled.
I know I can specify sampling probabilities using the prob argument in sample, but I don't know how to create a 2D probability matrix. Any help would be appreciated, I don't want to do this by hand.
I'm going to do this for a 9 x 6 grid (54 cells), just so it's easier to see what's going on, and sample only 5 of these 54 cells. You can modify this to a 100 x 100 grid where you sample 200 from 10,000 cells.
# Number of rows and columns of the grid (modify these as required)
nx <- 9 # rows
ny <- 6 # columns
# Create coordinate matrix
x <- rep(1:nx, each=ny);x
y <- rep(1:ny, nx);y
xy <- cbind(x, y); xy
# Where is the station? (edit: not snails nest)
Station <- rbind(c(x=3, y=2)) # Change as required
# Determine distance from each grid location to the station
library(SpatialTools)
D <- dist2(xy, Station)
From the help page of dist2
dist2 takes the matrices of coordinates coords1 and coords2 and
returns the inter-Euclidean distances between coordinates.
We can visualize this using the image function.
XY <- (matrix(D, nr=nx, byrow=TRUE))
image(XY) # axes are scaled to 0-1
# Create a scaling function - scales x to lie in [0-1)
scale_prop <- function(x, m=0)
(x - min(x)) / (m + max(x) - min(x))
# Add the coordinates to the grid
text(x=scale_prop(xy[,1]), y=scale_prop(xy[,2]), labels=paste(xy[,1],xy[,2],sep=","))
Lighter tones indicate grids closer to the station at (3,2).
# Sampling probabilities will be proportional to the distance from the station, which are scaled to lie between [0 - 1). We don't want a 1 for the maximum distance (m=1).
prob <- 1 - scale_prop(D, m=1); range (prob)
# Sample from the grid using given probabilities
sam <- sample(1:nrow(xy), size = 5, prob=prob) # Change size as required.
xy[sam,] # Thse are your (**MY!**) 5 samples
x y
[1,] 4 4
[2,] 7 1
[3,] 3 2
[4,] 5 1
[5,] 5 3
To confirm the sample probabilities are correct, you can simulate many samples and see which coordinates were sampled the most.
snail.sam <- function(nsamples) {
sam <- sample(1:nrow(xy), size = nsamples, prob=prob)
apply(xy[sam,], 1, function(x) paste(x[1], x[2], sep=","))
}
SAMPLES <- replicate(10000, snail.sam(5))
tab <- table(SAMPLES)
cols <- colorRampPalette(c("lightblue", "darkblue"))(max(tab))
barplot(table(SAMPLES), horiz=TRUE, las=1, cex.names=0.5,
col=cols[tab])
If using a 100 x 100 grid and the station is located at coordinates (60,70), then the image would look like this, with the sampled grids shown as black dots:
There is a tendency for the points to be located close to the station, although the sampling variability may make this difficult to see. If you want to give even more weight to grids near the station, then you can rescale the probabilities, which I think is ok to do, to save costs on travelling, but these weights need to be incorporated into the analysis when estimating the number of snails in the whole region. Here I've cubed the probabilities just so you can see what happens.
sam <- sample(1:nrow(xy), size = 200, prob=prob^3)
The tendency for the points to be located near the station is now more obvious.
There may be a better way than this but a quick way to do it is to randomly sample on both x and y axis using a distribution (I used the normal - bell shaped distribution, but you can really use any). The trick is to make the mean of the distribution the position of the research station. You can change the bias towards the research station by changing the standard deviation of the distribution.
Then use the randomly selected positions as your x and y coordinates to select the positions.
dat <- matrix(1:10000, nrow = 100)
#randomly selected a position for the research station
rs <- c(80,30)
# you can change the sd to change the bias
x <- round(rnorm(400,mean = rs[1], sd = 10))
y <- round(rnorm(400, mean = rs[2], sd = 10))
position <- rep(NA, 200)
j = 1
i = 1
# as some of the numbers sampled can be outside of the area you want I oversampled # and then only selected the first 200 that were in the area of interest.
while (j <= 200) {
if(x[i] > 0 & x[i] < 100 & y[i] > 0 & y [i]< 100){
position[j] <- dat[x[i],y[i]]
j = j +1
}
i = i +1
}
plot the results:
plot(x,y, pch = 19)
points(x =80,y = 30, col = "red", pch = 19) # position of the station
I would like some help answering the following question:
Dr Barchan makes 600 independent recordings of Eric’s coordinates (X, Y, Z), selects the cases where X ∈ (0.45, 0.55), and draws a histogram of the Y values for these cases.
By construction, these values of Y follow the conditional distribution of Y given X ∈ (0.45,0.55). Use your function sample3d to mimic this process and draw the resulting histogram. How many samples of Y are displayed in this histogram?
We can argue that the conditional distribution of Y given X ∈ (0.45, 0.55) approximates the conditional distribution of Y given X = 0.5 — and this approximation is improved if we make the interval of X values smaller.
Repeat the above simulations selecting cases where X ∈ (0.5 − δ, 0.5 + δ), using a suitably chosen δ and a large enough sample size to give a reliable picture of the conditional distribution of Y given X = 0.5.
I know for the first paragraph we want to have the values generated for x,y,z we got in sample3d(600) and then restrict the x's to being in the range 0.45-0.55, is there a way to code (maybe an if function) that would allow me to keep values of x in this range but discard all the x's from the 600 generated not in the range? Also does anyone have any hints for the conditional probability bit in the third paragraph.
sample3d = function(n)
{
df = data.frame()
while(n>0)
{
X = runif(1,-1,1)
Y = runif(1,-1,1)
Z = runif(1,-1,1)
a = X^2 + Y^2 + Z^2
if( a < 1 )
{
b = (X^2+Y^2+Z^2)^(0.5)
vector = data.frame(X = X/b, Y = Y/b, Z = Z/b)
df = rbind(vector,df)
n = n- 1
}
}
df
}
sample3d(n)
Any help would be appreciated, thank you.
Your function produces a data frame. The part of the question that asks you to find those values in a data frame that are in a given range can be solved by filtering the data frame. Notice that you're looking for a closed interval (the values aren't included).
df <- sample3d(600)
df[df$X > 0.45 & df$X < 0.55,]
Pay attention to the comma.
You can use a dplyr solution as well, but don't use the helper between(), since it will look at an open interval (you need a closed interval).
filter(df, X > 0.45 & X < 0.55)
For the remainder of your assignment, see what you can figure out and if you run into a specific problem, stack overflow can help you.
I am a non-expert in Fourier analysis and quite don't get what R's function fft() does. Even after crossreading a lot I couldnt figure it out.
I built an example.
require(ggplot2)
freq <- 200 #sample frequency in Hz
duration <- 3 # length of signal in seconds
#arbitrary sine wave
x <- seq(-4*pi,4*pi, length.out = freq*duration)
y <- sin(0.25*x) + sin(0.5*x) + sin(x)
which looks like:
fourier <- fft(y)
#frequency "amounts" and associated frequencies
amo <- Mod(fft(y))
freqvec <- 1:length(amo)
I ASSUME that fft expects a vector recorded over a timespan of 1 second, so I divide by the timespan
freqvec <- freqvec/duration
#and put this into a data.frame
df <- data.frame(freq = freqvec, ammount = amo)
Now I ASSUMABLY can/have to omit the second half of the data.frame since the frequency "amounts" are only significant to half of the sampling rate due to Nyquist.
df <- df[(1:as.integer(0.5*freq*duration)),]
For plotting I discretize a bit
df.disc <- data.frame(freq = 1:100)
cum.amo <- numeric(100)
for (i in 1:100){
cum.amo[i] <- sum(df$ammount[c(3*i-2,3*i-1,3*i)])
}
df.disc$ammount <- cum.amo
The plot function for the first 20 frequencies:
df.disc$freq <- as.factor(df.disc$freq)
ggplot(df.disc[1:20,], aes(x=freq, y=ammount)) + geom_bar(stat = "identity")
The result:
Is this really a correct spectrogram of the above function? Are my two assumptions correct? Where is my mistake? If there is no, what does this plot now tell me?
EDIT:
Here is a picture without discretization:
THANKS to all of you,
Micha.
Okay, okay. Due to the generally inferior nature of my mistake the solution is quite trivial.
I wrote freq = 200 and duration = 3. But the real duration is from -4pi to 4 pi, hence 8pi resulting in a "real" sample frequency of 1/ ((8*pi)/600) = 23.87324 which does not equal 200.
Replacing the respective lines in the example code by
freq <- 200 #sample frequency in Hz
duration <- 6 # length of signal in seconds
x <- seq(0,duration, length.out = freq*duration)
y <- sin(4*pi*x) + sin(6*pi*x) + sin(8*pi*x)
(with a more illustrative function) yields the correct frequencies as demonstrated by the following plot (restricted to the important part of the frequency domain):
I would like to build the hexbin plot where for every bin is the "ratio between class 1 and class2 points falling into this bin" is plotted (either log or not).
x <- rnorm(10000)
y <- rnorm(10000)
h <- hexbin(x,y)
plot(h)
l <- as.factor(c( rep(1,2000), rep(2,8000) ))
Any suggestions on how to implement this? Is there a way to introduce function to every bin based on bin statistics?
#cryo111's answer has the most important ingredient - IDs = TRUE. After that it's just a matter of figuring out what you want to do with Inf's and how much do you need to scale the ratios by to get integers that will produce a pretty plot.
library(hexbin)
library(data.table)
set.seed(1)
x = rnorm(10000)
y = rnorm(10000)
h = hexbin(x, y, IDs = TRUE)
# put all the relevant data in a data.table
dt = data.table(x, y, l = c(1,1,1,2), cID = h#cID)
# group by cID and calculate whatever statistic you like
# in this case, ratio of 1's to 2's,
# and then Inf's are set to be equal to the largest ratio
dt[, list(ratio = sum(l == 1)/sum(l == 2)), keyby = cID][,
ratio := ifelse(ratio == Inf, max(ratio[is.finite(ratio)]), ratio)][,
# scale up (I chose a scaling manually to get a prettier graph)
# and convert to integer and change h
as.integer(ratio*10)] -> h#count
plot(h)
You can determine the number of class 1 and class 2 points in each bin by
library(hexbin)
library(plyr)
x=rnorm(10000)
y=rnorm(10000)
#generate hexbin object with IDs=TRUE
#the object includes then a slot with a vector cID
#cID maps point (x[i],y[i]) to cell number cID[i]
HexObj=hexbin(x,y,IDs = TRUE)
#find count statistics for first 2000 points (class 1) and the rest (class 2)
CountDF=merge(count(HexObj#cID[1:2000]),
count(HexObj#cID[2001:length(x)]),
by="x",
all=TRUE
)
#replace NAs by 0
CountDF[is.na(CountDF)]=0
#check if all points are included
sum(CountDF$freq.x)+sum(CountDF$freq.y)
But printing them is another story. For instance, what if there are no class 2 points in one bin? The fraction is not defined then.
In addition, as far as I understand hexbin is just a two dimensional histogram. As such, it counts the number of points that fall into a given bin. I do not think that it can handle non-integer data as in your case.
I would like to generate a random sequence composed of 3000 points, which follows the normal distribution. The mean is c and the standard deviation is d. But I would like these 3000 points lies in the range of [a,b].
Can you tell me how to do it in R?
If I would like to plot this sequence, if Y-axis uses the generated 3000 points, then how should I generate the points corresponding to X-axis.
You can do this using standard R functions like this:
c <- 1
d <- 2
a <- -2
b <- 3.5
ll <- pnorm(a, c, d)
ul <- pnorm(b, c, d)
x <- qnorm( runif(3000, ll, ul), c, d )
hist(x)
range(x)
mean(x)
sd(x)
plot(x, type='l')
The pnorm function is used to find the limits to use for the uniform distriution, data is then generated from a uniform and then transformed back to the normal.
This is even simpler using the distr package:
library(distr)
N <- Norm(c,d)
N2 <- Truncate(N, lower=a, upper=b)
plot(N2)
x <- r(N2)(3000)
hist(x)
range(x)
mean(x)
sd(x)
plot(x, type='l')
Note that in both cases the mean is not c and the sd is not d. If you want the mean and sd of the resulting truncated data to be c and d, then you need the parent distribution (before truncating) to have different values (higher sd, mean depends on the truncating values), finding those values would be a good homework problem for a math/stat theory course. If that is what you really need then add a comment or edit the question to say so specifically.
If you want to generate the data from the untruncated normal, but only plot the data within the range [a,b] then just use the ylim argument to plot:
plot( rnorm(3000, c, d), ylim=c(a,b) )
Generating a random sequence of numbers from any probability distribution is very easy in R. To do this for the normal distribution specifically
c = 1
d = 2
x <- rnorm(3000, c, d)
Clipping the values in x so that they're only within a given range is kind of a strange thing to want to do with a sample from the normal distribution. Maybe what you really want to do is sample a uniform distribution.
a = 0
b = 3
x2 <- runif(3000, a, b)
As for how the plot the distribution, I'm not sure I follow your question. You can plot a density estimate for the sample with this code
plot(density(x))
But, if you want to plot this data as a scatter plot of some sort, you actually need to generate a second sample of numbers.
If I would like to plot this sequence, if Y-axis uses the generated 3000 points, then how should I generate the points corresponding to X-axis.
If you just generate your points, like JoFrhwld said with
y <- rnorm(3000, 1, 2)
Then
plot(y)
Will automatically plot them using the array indices as x axis
a = -2; b = 3
plot(dnorm, xlim = c(a, b))