I have this histogram:
I am wondering what i means to find the empirical relative frequency of say salaries < 85000. I understand that empirical frequency is the occurrence int eh data set.
Is this correct:
x = our_sample[our_sample$Salaries < 85000, ] # x has 292 rows
result <- 292 / length(our_sample) # our_sample has 500 samples
result
I am not sure which functions to use to find this.
Related
I know the basic loop format, but I'm unsure how to incorporate 'population' into the loop to find the probability of collecting a sample with a mean of 42 or larger.
Use a loop to find out the probability of collecting a sample (n=10) with a mean of 42 (or larger) from the dataset produced by the following code:
set.seed(1)
population<-rnorm(n=500,mean=35,sd=10)
One approach to this problem is to repeatedly sample from population and compute the frequency that the mean of these samples is greater than or equal to 42.
set.seed(1);
population <- rnorm(n=500, mean=35, sd=10)
nsim <- 100000 # the number of time we will do this
vec_mean <- numeric(nsim) # a vector to hold the sample means
for (i in 1:nsim) {
samp <- sample(population, size = 10, replace = TRUE)
vec_mean[i] <- mean(samp)
}
sum(vec_mean >= 42) / nsim
# [1] 0.01727
This can be interpreted as the (frequentist) probability of collecting a sample of size 10 from this population with a mean of 42 or larger.
I'm generating small samples (e.g. 24 obs) of normally distributed variable in R. It seems that the resulting variable has a systematically negative autocorrelation.
Code below generates 1000 samples of 24 observations of x and calculates the first three autocorrelations. These are not huge on average (-0.075 to 0.045) but the averages are always negative. Increasing sample size (N) decreases the autocorrelation towards zero. However, my questions is: Why are the random numbers in a small sample negatively autocorrelated?
K <- 1000
N <- 24
ac <- NULL
for (k in 1:K) {
x <- rnorm(n=N)
ac <- rbind(ac, pacf(x, plot=F)$acf[1:3,1,1])
}
apply(ac, 2, mean)
[1] -0.04925651 -0.07523400 -0.04542514
I have 16068 datapoints with values that range between 150 and 54850 (mean = 3034.22). What would the R code be to generate a set of random numbers that grow in frequency exponentially between 54850 and 150?
I've tried using the rexp() function in R, but can't figure out how to set the range to between 150 and 54850. In my actual data population, the lambda value is 25.
set.seed(123)
myrange <- c(54850, 150)
rexp(16068, 1/25, myrange)
The call produces an error.
Error in rexp(16068, 1/25, myrange) : unused argument (myrange)
The hypothesized population should increase exponentially the closer the data values are to 150. I have 25 data points with a value of 150 and only one with a value of 54850. The simulated population should fall in this range.
This is really more of a question for math.stackexchange, but out of curiosity I provide this solution. Maybe it is sufficient for your needs.
First, ?rexp tells us that it has only two arguments, so we generate a random exponential distribution with the desired length.
set.seed(42) # for sake of reproducibility
n <- 16068
mr <- c(54850, 150) # your 'myrange' with less typing
y0 <- rexp(n, 1/25) # simulate exp. dist.
y <- y0[order(-y0)] # sort
Now we need a mathematical approach to rescale the distribution.
# f(x) = (b-a)(x - min(x))/(max(x)-min(x)) + a
y.scaled <- (mr[1] - mr[2]) * (y - min(y)) / (max(y) - min(y)) + mr[2]
Proof:
> range(y.scaled)
[1] 150.312 54850.312
That's not too bad.
Plot:
plot(y.scaled, type="l")
Note: There might be some mathematical issues, see therefore e.g. this answer.
I want to calculate the error rate by interval where 0 is good and 1 is bad. If I have a sample of 100 observation as levels divided in intervals as follows:
X <- 10; q<-sample(c(0,1), replace=TRUE, size=X)
l <- sample(c(1:100),replace=T,size=10)
bornes<-seq(min(l),max(l),5)
v <- cut(l,breaks=bornes,include.lowest=T)
table(v)
How can I get a table or function that calculates the default rate by each interval, the number of bad observations divided by the total number of observations?
tx_erreur<-function(x){
t<-table(x,q)
return(sum(t[,2])/sum(t))
}
I already tried this code above and tapply.
Thank you!
I think you want this:
tapply(q,# the variable to be summarized
v,# the variable that defines the bins
function(x) # the function to calculate the summary statistics within each bin
sum(x)/length(x))
I want to quickly generate discrete random numbers where I have a known CDF. Essentially, the algorithm is:
Construct the CDF vector (an increasing vector starting at 0 and end at 1) cdf
Generate a uniform(0, 1) random number u
If u < cdf[1] choose 1
else if u < cdf[2] choose 2
else if u < cdf[3] choose 3
*...
Example
First generate an cdf:
cdf = cumsum(runif(10000, 0, 0.1))
cdf = cdf/max(cdf)
Next generate N uniform random numbers:
N = 1000
u = runif(N)
Now sample the value:
##With some experimenting this seemed to be very quick
##However, with N = 100000 we run out of memory
##N = 10^6 would be a reasonable maximum to cope with
colSums(sapply(u, ">", cdf))
If you know the probability mass function (which you do, if you know the cumulative distribution function), you can use R's built-in sample function, where you can define the probabilities of discrete events with argument prob.
cdf = cumsum(runif(10000, 0, 0.1))
cdf = cdf/max(cdf)
system.time(sample(size=1e6,x=1:10000,prob=c(cdf[1],diff(cdf)),replace=TRUE))
user system elapsed
0.01 0.00 0.02
How about using cut:
N <- 1e6
u <- runif(N)
system.time(as.numeric(cut(u,cdf)))
user system elapsed
1.03 0.03 1.07
head(table(as.numeric(cut(u,cdf))))
1 2 3 4 5 6
51 95 165 172 148 75
If you have a finite number of possible values then you can use findInterval or cut or better sample as mentioned by #Hemmo.
However, if you want to generate data from a distribution that that theoretically goes to infinity (like the geometric, negative binomial, Poisson, etc.) then here is an algorithm that will work (this will also work with a finite number of values if wanted):
Start with your vector of uniform values and loop through the distribution values subtracting them from the vector of uniforms, the random value is the iteration where the value goes negative. This is a easier to see whith an example. This generates values from a Poisson with mean 5 (replace the dpois call with your calculated values) and compares it to using the inverse CDF (which is more efficient in this case where it exists).
i <- 0
tmp <- tmp2 <- runif(10000)
randvals <- rep(0, length(tmp) )
while( any(tmp > 0) ) {
tmp <- tmp - dpois(i, 5)
randvals <- randvals + (tmp > 0)
i <- i + 1
}
randvals2 <- qpois( tmp2, 5 )
all.equal(randvals, randvals2)