Creating and Visualizing samples in R - r

I want to create 100 samples from a normal distribution. For the first class, the mean is to be taken as (0,0) and covariance matrix as [(1,0),(0,1)]. For the second class, the mean is to be taken as (5,0) but the covariance matrix is the same as for the first class and finally would like to visualize all 200 instances in a single plot with different colors for each class.
My problem is: When I generate this plot I am unsure about the final plot whether it actually has a volume of 200 samples.
My approach:
a1 <- c(1,0)
a2 <- c(0,1)
M <- cbind(a1, a2)
x <- cov(M)
dev <- sd(x, na.rm = FALSE)
C0 <- sample(rnorm(100, mean=0, sd=dev), size=100, replace=T)
C1 <- sample(rnorm(100, mean=5, sd=dev), size=100, replace=T)
plot(C0,C1, col=c("red","blue"), main = '200 samples, with mean 0 and 5 and S.D=0.5')
legend("topright", 95, legend=c("C0", "C1"),
col=c("red", "blue"), lty=1:2, cex=0.8)
I would like to know the corrections in my code.
plot

Aside from the plotting issue mentioned in the other answer, it seems from your description like you want to sample from two 2D multivariate normal distributions with different means.
If so, you can simply use the mvtnorm library to sample from these distributions, which is the multivariate normal distribution.
library(mvtnorm)
C0 <- rmvnorm(100, c(0,0), M) # 100 samples, means (0, 0), covariance mtx M
C1 <- rmvnorm(100, c(5,0), M)
Right now, you take the covariance of the covariance matrix you have by typing x <- cov(M). This doesn't make much sense unless I'm misunderstanding what you're trying to accomplish.
EDIT: This is the full code for what I think you're trying to accomplish:
a1 <- c(1, 0)
a2 <- c(0, 1)
M <- cbind(a1, a2)
C0 <- rmvnorm(100, c(0, 0), M)
C1 <- rmvnorm(100, c(5, 0), M)
plot(C0, col = "red", xlim = c(-5, 10), ylim = c(-5, 5), xlab = "X", ylab = "Y")
points(C1, col = "blue")
legend("topright", inset = .05, c("Class 1", "Class 2"), fill = c("red", "blue"))
which outputs the plot

Your x and y axes demonstrate that you're plotting C1 against C0. That's why your y-axis has its midpoint at 5 and the x-axis has it at 0. What you've done is plot 100 points with their x-coordinate from C0 and y-coordinate from C1.
Short of counting them, proving that you have 100 points on the screen is difficult. I know of no way to access the data that R has used to display your plot. However, one trick is to call text(C0,C1,label=1:150) after your code. This adds the numbers 1:150 to your plot, with each number having a corresponding label. If you had 200 points, this would be a tidy plot. However, since you have 100, many are labelled twice, making the plot unreadable.
If we make a new plot and use text(C0,C1,label=1:100) instead, things are much more clear:

Related

Overlay a Normal curve to Histogram

I repeat 50 times a rnorm with n=100, mean=100 and sd=25. Then I plot the histogram of all the sample means, but now I need to overlay a normal curve over the histogram.
x <- replicate(50, rnorm(100, 100, 25), simplify = FALSE)
x
sapply(x, mean)
sapply(x, sd)
hist(sapply(x, mean))
Do you know ow to overlay a normal curve over the histogram of the means?
Thanks
When we plot the density rather than the frequency histogram by setting freq=FALSE, we may overlay a curve of a normal distribution with the mean of the means. For the xlim of the curve we use the range of the means.
mean.of.means <- mean(sapply(x, mean))
r <- range(sapply(x, mean))
v <- hist(sapply(x, mean), freq=FALSE, xlim=r, ylim=c(0, .5))
curve(dnorm(x, mean=mean.of.means, sd=1), r[1], r[2], add=TRUE, col="red")
Also possible is to draw a sufficient amount of a normal distribution, and overlay the histogram with the lines of the density distribution.
lines(density(rnorm(1e6, mean.of.means, 1)))
Note, that I have used 500 mean values in my answer, since the comparison with a normal distribution may become meaningless with too few values. However, you can play with the breaks= option in the histogram function.
Data
set.seed(42)
x <- replicate(500, rnorm(100, 100, 25), simplify = FALSE)

How to add value of mean in number in violin plot using base R

I have data
Name V1
M1 50
M2 10
M1 30
M1 45
M2 5
M2 7
With my code, I was able to produce a violin plot. But I don't know how to put the value of mean in each violin plot in number using base R (not ggplot)?
Here is an example of my code.
with(Data, vioplot(V1[Name=="M1"], V1[Name=="M2"], names=c("M1", "M2"),
plotCentre="line", rectCol="white", col="gray", ylab="",
ylim=c(0,80)))
title(ylab="A($m)", xlab="Name", main="AA")
Thanks a lot
You could use the following code:
Your data:
Data <- read.table(header = TRUE,
text = "Name V1
M1 50
M2 10
M1 30
M1 45
M2 5
M2 7")
Code
library(vioplot)
library(dplyr)
## calculate the mean per group
DataMean <- Data %>%
group_by(Name) %>%
summarize(mean = mean(V1))
## your plotting code
with(Data, vioplot(V1[Name=="M1"], V1[Name=="M2"], names=c("M1", "M2"),
plotCentre="line", rectCol="white", col="gray", ylab="",
ylim=c(0,80)))
title(ylab="A($m)", xlab="Name", main="AA")
## use base r 'text' to add the calculated means
## at position 1 and 2 on the X-axis, and
## at height of the Y-axis of 60 (2 times)
text(x = c(1, 2), y = c(60,60), labels = round(DataMean$mean, 2))
yielding the following plot:
Of course, we can play around with the position of the text. If we want the means to appear inside the violin plots, we use the mean values as Y-coordinates, and change the color to make it more visible (and shift the X-coordinates a little to the right, in combination with a lighter grey).
### playing with position and color
with(Data, vioplot(V1[Name=="M1"], V1[Name=="M2"], names=c("M1", "M2"),
plotCentre="line", rectCol="white", col="lightgray", ylab="",
ylim=c(0,80)))
title(ylab="A($m)", xlab="Name", main="AA")
text(x = c(1.1, 2.1), y = DataMean$mean, labels = round(DataMean$mean, 2), col = "blue")
yielding this plot:
Please, let me know whether this is what you want.

How to plot a normal distribution by labeling specific parts of the x-axis?

I am using the following code to create a standard normal distribution in R:
x <- seq(-4, 4, length=200)
y <- dnorm(x, mean=0, sd=1)
plot(x, y, type="l", lwd=2)
I need the x-axis to be labeled at the mean and at points three standard deviations above and below the mean. How can I add these labels?
The easiest (but not general) way is to restrict the limits of the x axis. The +/- 1:3 sigma will be labeled as such, and the mean will be labeled as 0 - indicating 0 deviations from the mean.
plot(x,y, type = "l", lwd = 2, xlim = c(-3.5,3.5))
Another option is to use more specific labels:
plot(x,y, type = "l", lwd = 2, axes = FALSE, xlab = "", ylab = "")
axis(1, at = -3:3, labels = c("-3s", "-2s", "-1s", "mean", "1s", "2s", "3s"))
Using the code in this answer, you could skip creating x and just use curve() on the dnorm function:
curve(dnorm, -3.5, 3.5, lwd=2, axes = FALSE, xlab = "", ylab = "")
axis(1, at = -3:3, labels = c("-3s", "-2s", "-1s", "mean", "1s", "2s", "3s"))
But this doesn't use the given code anymore.
If you like hard way of doing something without using R built in function or you want to do this outside R, you can use the following formula.
x<-seq(-4,4,length=200)
s = 1
mu = 0
y <- (1/(s * sqrt(2*pi))) * exp(-((x-mu)^2)/(2*s^2))
plot(x,y, type="l", lwd=2, col = "blue", xlim = c(-3.5,3.5))
An extremely inefficient and unusual, but beautiful solution, which works based on the ideas of Monte Carlo simulation, is this:
simulate many draws (or samples) from a given distribution (say the normal).
plot the density of these draws using rnorm. The rnorm function takes as arguments (A,B,C) and returns a vector of A samples from a normal distribution centered at B, with standard deviation C.
Thus to take a sample of size 50,000 from a standard normal (i.e, a normal with mean 0 and standard deviation 1), and plot its density, we do the following:
x = rnorm(50000,0,1)
plot(density(x))
As the number of draws goes to infinity this will converge in distribution to the normal. To illustrate this, see the image below which shows from left to right and top to bottom 5000,50000,500000, and 5 million samples.
In general case, for example: Normal(2, 1)
f <- function(x) dnorm(x, 2, 1)
plot(f, -1, 5)
This is a very general, f can be defined freely, with any given parameters, for example:
f <- function(x) dbeta(x, 0.1, 0.1)
plot(f, 0, 1)
I particularly love Lattice for this goal. It easily implements graphical information such as specific areas under a curve, the one you usually require when dealing with probabilities problems such as find P(a < X < b) etc.
Please have a look:
library(lattice)
e4a <- seq(-4, 4, length = 10000) # Data to set up out normal
e4b <- dnorm(e4a, 0, 1)
xyplot(e4b ~ e4a, # Lattice xyplot
type = "l",
main = "Plot 2",
panel = function(x,y, ...){
panel.xyplot(x,y, ...)
panel.abline( v = c(0, 1, 1.5), lty = 2) #set z and lines
xx <- c(1, x[x>=1 & x<=1.5], 1.5) #Color area
yy <- c(0, y[x>=1 & x<=1.5], 0)
panel.polygon(xx,yy, ..., col='red')
})
In this example I make the area between z = 1 and z = 1.5 stand out. You can move easily this parameters according to your problem.
Axis labels are automatic.
This is how to write it in functions:
normalCriticalTest <- function(mu, s) {
x <- seq(-4, 4, length=200) # x extends from -4 to 4
y <- (1/(s * sqrt(2*pi))) * exp(-((x-mu)^2)/(2*s^2)) # y follows the formula
of the normal distribution: f(Y)
plot(x,y, type="l", lwd=2, xlim = c(-3.5,3.5))
abline(v = c(-1.96, 1.96), col="red") # draw the graph, with 2.5% surface to
either side of the mean
}
normalCriticalTest(0, 1) # draw a normal distribution with vertical lines.
Final result:

Plot weighted frequency matrix

This question is related to two different questions I have asked previously:
1) Reproduce frequency matrix plot
2) Add 95% confidence limits to cumulative plot
I wish to reproduce this plot in R:
I have got this far, using the code beneath the graphic:
#Set the number of bets and number of trials and % lines
numbet <- 36
numtri <- 1000
#Fill a matrix where the rows are the cumulative bets and the columns are the trials
xcum <- matrix(NA, nrow=numbet, ncol=numtri)
for (i in 1:numtri) {
x <- sample(c(0,1), numbet, prob=c(5/6,1/6), replace = TRUE)
xcum[,i] <- cumsum(x)/(1:numbet)
}
#Plot the trials as transparent lines so you can see the build up
matplot(xcum, type="l", xlab="Number of Trials", ylab="Relative Frequency", main="", col=rgb(0.01, 0.01, 0.01, 0.02), las=1)
My question is: How can I reproduce the top plot in one pass, without plotting multiple samples?
Thanks.
You can produce this plot...
... by using this code:
boring <- function(x, occ) occ/x
boring_seq <- function(occ, length.out){
x <- seq(occ, length.out=length.out)
data.frame(x = x, y = boring(x, occ))
}
numbet <- 31
odds <- 6
plot(1, 0, type="n",
xlim=c(1, numbet + odds), ylim=c(0, 1),
yaxp=c(0,1,2),
main="Frequency matrix",
xlab="Successive occasions",
ylab="Relative frequency"
)
axis(2, at=c(0, 0.5, 1))
for(i in 1:odds){
xy <- boring_seq(i, numbet+1)
lines(xy$x, xy$y, type="o", cex=0.5)
}
for(i in 1:numbet){
xy <- boring_seq(i, odds+1)
lines(xy$x, 1-xy$y, type="o", cex=0.5)
}
You can also use Koshke's method, by limiting the combinations of values to those with s<6 and at Andrie's request added the condition on the difference of Ps$n and ps$s to get a "pointed" configuration.
ps <- ldply(0:35, function(i)data.frame(s=0:i, n=i))
plot.new()
plot.window(c(0,36), c(0,1))
apply(ps[ps$s<6 & ps$n - ps$s < 30, ], 1, function(x){
s<-x[1]; n<-x[2];
lines(c(n, n+1, n, n+1), c(s/n, s/(n+1), s/n, (s+1)/(n+1)), type="o")})
axis(1)
axis(2)
lines(6:36, 6/(6:36), type="o")
# need to fill in the unconnected points on the upper frontier
Weighted Frequency Matrix is also called Position Weight Matrix (in bioinformatics).
It can be represented in a form of a sequence logo.
This is at least how I plot weighted frequency matrix.
library(cosmo)
data(motifPWM); attributes(motifPWM) # Loads a sample position weight matrix (PWM) containing 8 positions.
plot(motifPWM) # Plots the PWM as sequence logo.

Reverse Statistics with R

What I want to do sounds simple. I want to plot a normal IQ curve with R with a mean of 100 and a standard deviation of 15. Then, I'd like to be able to overlay a scatter plot of data on top of it.
Anybody know how to do this?
I'm guessing what you want to do is this: you want to plot the model normal density with mean 100 and sd = 15, and you want to overlay on top of that the empirical density of some set of observations that purportedly follow the model normal density, so that you can visualize how well the model density fits the empirical density. The code below should do this (here, x would be the vector of actual observations but for illustration purposes I'm generating it with a mixed normal distribution N(100,15) + 15*N(0,1), i.e. the purported N(100,15) distribution plus noise).
require(ggplot2)
x <- round( rnorm( 1000, 100, 15 )) + rnorm(1000)*15
dens.x <- density(x)
empir.df <- data.frame( type = 'empir', x = dens.x$x, density = dens.x$y )
norm.df <- data.frame( type = 'normal', x = 50:150, density = dnorm(50:150,100,15))
df <- rbind(empir.df, norm.df)
m <- ggplot(data = df, aes(x,density))
m + geom_line( aes(linetype = type, colour = type))
Well, it's more like a histogram, since I think you are expecting these to be more like an integer rounded process:
x<-round(rnorm(1000, 100, 15))
y<-table(x)
plot(y)
par(new=TRUE)
plot(density(x), yaxt="n", ylab="", xlab="", xaxt="n")
If you want the theoretic value of dnorm superimposed, then use one of these:
lines(sort(x), dnorm(sort(x), 100, 15), col="red")
-or
points(x, dnorm(x, 100, 15))
You can generate IQ scores PDF with:
curve(dnorm(x, 100, 15), 50, 150)
But why would you like to overlay scatter over density curve? IMHO, that's very unusual...
In addition to the other good answers, you might be interested in plotting a number of panels, each with its own graph. Something like this.

Resources