How to generate a sample of artificial data with a particular variance? - r

I am trying to generate a data set with the following information->
sample size:200
variance = 2
mean = 20
I have tried generating it using the rnorm() function but it only takes standard deviation as variable. I have also tried to square root the standard deviation to generate the desired variance but it doesn't work either.
How can I generate such dataset with that mean and variance in Rstudio?
Thank you.

x = rnorm(200, 20, sd=sqrt(2))
c(mean(x), var(x))
[1] 20.064919 1.981597

Related

How to simulate a iid process in R language?

Im new to statictics and received below question that need to be answered in R language:
Simulate an i.i.d process {Xt}t=1,···,n following standard normal Xt ∼ Normal(0,1) with
sample size n = 1000 and simulation time N = 500. Compute the sample mean ̄X(1),··· , ̄X(N),
where ̄X(i) is the sample mean from the i-th simulation. Plot the histogram for ̄X(1),··· , ̄X(N).
my thought is:
as sample size n=1000, then I should
set.seed(1) # Setting a seed
X1 <- rnorm(1000) # Simulating X1
to compute the sample mean of X1-XN
result.mean <- mean(x1)
plot the histogram for mean X1-XN
plot(result.mean, type = 'h')
However I'm not sure what to do with the simulation time N = 500? the plot i generated is just 1 bar histogram, so I'm pretty sure the simulation time should be used.
what is the purpose of simulation here? and if my thought correct in the case of iid? thank you
Using randomized numbers from a normal distribution, the base (stats) r code is rnorm, with default values having a mean of 0 and standard deviation of 1. We get 500 samples from this. Then, take the mean of a vector of those 1000 numbers. We repeat that with replicate 1000 times and throw the result into a histogram.
hist(replicate(500, mean(rnorm(1000)), simplify = "vector"))

R how to calculate confidence interval based on proportion

I'm new to R and trying to learn stats..
Here is one practice question that I'm trying to figure out
How should I use R code to create a function based on this math equation?
I have a dataframe like this
the "exposed" column from the df contains two groups, one is called"Test Group (Exposed)" the other one is called "Control Group". So the math function is referring to these two groups.
In another practice I have these codes here to calculate the confidence interval
# sample size
# OK for non normal data if n > 30
n <- 150
# calculate the mean & standard deviation
will_mean <- mean(will_sample)
will_s <- sd(will_sample)
# normal quantile function, assuming mean has a normal distribution:
qnorm(p=0.975, mean=0, sd=1) # 97.5th percentile for a N(0,1) distribution
# a.k.a. Z = 1.96 from the standard normal distribution
# calculate standard error of the mean
# standard error of the mean = mean +/- critical value x (s/sqrt(n))
# "q" functions in r give the value of the statistic at a given quantile
critical_value <- qt(p=0.975, df=n-1)
error <- critical_value * will_s/sqrt(n)
# confidence inverval
will_mean - error
will_mean + error
but I'm not sure how to do the exposed 2 groups
Don't worry it's quite easy if you have experience in at least one programming language, R is quite trivial.
The only remarkable difference between R and most of other programming languanges is that R was developed for statistical purposes.
You can compute what is the quantile for a certain significance level α (reminds to divide it by 2 for your formula) by using the function qnorm(). By default it is set up for standardized normal distribution, like in your case, but you can get more details using the documentation, reachable by the command ?qnorm().
Actually in the exercise you are not required to compute it, since you have to pass it as argument, but in reality you need to.
The code should be something like:
conf <- function(p1,p2,n1,n2,z){
part = z*(p1*(1-p1)/n1+p2*(1-p2)/n2)**(1/2)
return(c(p1-p2-part,
p1-p2+part))
}

Different moments given by R using the same library

I'm using R along with library moments to generate a small dataset and compute the four initial moments of my data:
Mean
Variation
Skewness
Kurstosis
The code is shown below. I set a random seed for my PRNG and generates 1000 data points using a normal distribution.
Then, I print four moments two ways. First, I print then individually. Then, I print them using the method all.moments.
library(moments)
set.seed(123)
x = rnorm(1000, sd = 0.02)
print(mean(x));
print(var(x));
print(skewness(x))
print(kurtosis(x))
print(moments::all.moments(x, order.max = 4))
The outputs are shown below.
print(mean(x));
0.0003225573
print(var(x));
0.0003933836
print(skewness(x));
0.06529391
print(kurtosis(x));
2.925747
print(moments::all.moments(x, order.max = 4));
1.000000e+00 3.225573e-04 3.930942e-04 8.889998e-07 4.527577e-07
One may note that both the skewness and the kurtosis of both methods are different.
My question is: Why they give different results? Which result is the right one?
Note that the third and fourth moments are NOT the skewness and kurtosis. These should be calculated afterwards

Finding Mean Squared Error?

I have produced a linear data set and have used lm() to fit a model to that dataset. I am now trying to find the MSE using mse()
I know the formula for MSE but I'm trying to use this function. What would be the proper way to do so? I have looked at the documentation, but I'm either dumb or it's just worded for people who actually know what they're doing.
library(hydroGOF)
x.linear <- seq(0, 200, by=1) # x data
error.linear <- rnorm(n=length(x.linear), mean=0, sd=1) # Error (0, 1)
y.linear <- x.linear + error.linear # y data
training.data <- data.frame(x.linear, y.linear)
training.model <- lm(training.data)
training.mse <- mse(training.model, training.data)
plot(training.data)
mse() needs two data frames. I'm not sure how to get a data frame out of lm(). Am I even on the right track to finding a proper MSE for my data?
Try this:
mean((training.data - predict(training.model))^2)
#[1] 0.4467098
You can also use below mentioned code which is very clean to get mean square error
install.packages("Metrics")
library(Metrics)
mse(actual, predicted)
The first data set on which is actual one : training.data
The second argument is the one which you will predict like :
pd <- predict(training.model , training.data)
mse(training.data$,pd)
Seems you have not done prediction yet so first predict the data based on your model and then calculate mse
You can use the residual component from lm model output to find mse in this manner :
mse = mean(training.model$residuals^2)
Note: if you come from another program (like SAS)
they get the mean using the sum and the degrees of freedom of the residual. I recommend doing the same if you want a more accurate estimate of the error.
mse = sum(training.model$residuals^2)/training.model$df.residual
I found this while trying to figure out why mean(my_model$residuals^2) was different in R than the MSE in SAS.

Generating multiple confidence intervals from samples of a normal distribution in R

I am an statistics student and R beginner (understatement of the year) trying to generate multiple confidence intervals for randomly generated samples of a normal distribution as part of an assignment.
I used the function
data <- replicate(25, rnorm(20, 50, 6))
to generate 25 samples of size n=20 from a N(50, 6^2) distribution (in a double matrix).
My question is, how do I find a 95% confidence interval for each sample of this distribution? I know that I can use colMeans(data) and sd(data) to find the sample mean and sample standard deviation for each sample, but I am having a brain fart trying to think of a function that can generate the confidence intervals for all columns in the double matrix (data).
As of now, my (extremely crude) solution consists of creating the functions
left <- function (x,y){x-(qnorm(0.975)*y/sqrt(20))}
right <- function (x,y){x+(qnorm(0.975)*y/sqrt(20))}
left(colMeans(data), sd(data)
right(colMeans(data), sd(data)
to generate 2 vectors of left and right bounds. Please let me know if there is a better way I can do this.
I suppose you could use the t.test() function. It returns the mean and the 95% confidence interval for a given vector of numbers.
# Create your data
data <- replicate(25, rnorm(20, 50, 6))
data <- as.data.frame(data)
After you make your data, you could apply the t.test() function to all columns using the lapply() function.
# Apply the t.test function and save the results
results <- lapply(data, t.test)
If you only want to see the confidence interval or mean returned, you can call them using the dollar sign operator. For example, for column one of your original data frame, you could type the following:
# Check 95% CI for sample one
results[[1]]$conf.int[1:2]
You could come up with a more eloquent way of saving these data to a results data frame. Remember, you can always see what individual bits of information you can yank from an object by using the str() command. For example:
# Example
example <- t.test(data[,1])
str(example)
Hope this helps. Try this link for more information: Using R to find Confidence Intervals

Resources