How to get a scatter plot of mixture data with different shape and colour for each distribution? - r

I am running a simulation of mixture data. My function is harder than Gaussian distribution. Hence, here, I simplified my question to be in Gaussian form. That is, if I simulated a mixture data like this:
N=2000
U=runif(N, min=0,max=1)
X = matrix(NA, nrow=N, ncol=2)
for (i in 1:N){
if(U[i] < 0.7){
X[i,] <- rnorm(1,0.5,1)
} else {
X[i,] <- rnorm(1,3,5)
}
}
How can I have a scatter plot with different colour and shape (type of the plot point) for each cluster or distribution? I would like to have this manually since my function is hard and complex. I tried plot(X[,1],X[,2],col=c("red","blue")) but it does not work.

I think this is what you want. Note that I had to do a bit of guesswork here to figure out what was going on, because your example code seems to have an error in it, you weren't generating different x1 and x2 values in each row:
N=2000
U=runif(N, min=0,max=1)
X = matrix(NA, nrow = N, ncol=2)
for (i in 1:N){
if(U[i] < 0.7){
# You had rnorm(n=1, ...) which gives 2 identical values in each row
# Change that to 2 and you get different X1 and X2 values
X[i,] <- rnorm(2, 0.5, 1)
} else {
X[i,] <- rnorm(2, 3, 5)
}
}
df = data.frame(
source = ifelse(U < 0.7, "dist1", "dist2"),
x = X[, 1],
y = X[, 2]
)
library(ggplot2)
ggplot(df, aes(x = x, y = y, colour = source, shape = source)) +
geom_point()
Result:

Here's what I got, but I'm not sure if this what you are looking for - the location of the observations for both clusters are exactly the same.
library(tidyverse)
df <- data.frame(X = X, U = U)
df <- gather(df, key = cluster, value = X, -U)
ggplot(df, aes(x = X, y = U, colour = cluster)) + geom_point() + facet_wrap(~cluster)
EDIT: I don't seem to be understanding what you are looking to map onto a scatter plot, so I'll indicate how you need to shape your data in order to create a chart like the above with the proper X and Y coordinates:
head(df)
U cluster X
1 0.98345408 X.1 2.3296047
2 0.33939935 X.1 -0.6042917
3 0.66715421 X.1 -2.2673422
4 0.06093674 X.1 2.4007376
5 0.48162959 X.1 -2.3118850
6 0.50780007 X.1 -0.7307929
So you want one variable for the Y coordinate (I'm using variable U here), one variable for the X coordinate (using X here), and a 3rd variable that indicates whether the observation belongs to cluster 1 or cluster 2 (variable cluster here).

Related

How to simulate 10000 datasets in R with 30 observations each and create histogram with F-statistics and superimpose F-distribution?

What am I doing wrong in my following code?
My simulation study module asks me to use simple linear regression i.e., p=2. I'm supposed to generate B=10,000 independent simulations from a simple linear regression with N= 30 (number of observations) and B_0=B_1=0. For each simulation, one creates a dataset and extracts the F-statistic for the global F test. Then one should verify that the histogram resembles that of an F(1, N-2) distribution. I am confused whether my loop is the problem or my ggplot code or if it's a mix of the two.
My current output looks like:
n=30
F1 = array(NA,dim=Nsim)
for(i in 1:Nsim){
X=rnorm(n,0,sd=sigmax) # generate x
res=rnorm(n,0,sd=sigma) # generate sigma
Y=b0+b1*X+res # generate Y
mod = lm(Y~X)
res = summary(mod)
F1[i]=res$fstatistic[1] # F statistic
}
df<-tibble(F1=F1)
x=seq(1,10,1)
y=df(x,df1 = 1, df2 = n-2)
df2 = tibble(x=x,y=y)
ggplot() + geom_histogram(data=df, aes(x=F1,y=..density..), binwidth=0.1,color="black", fill="white")+
xlab("F") +
xlim(c(NA,10))+
ggtitle("n=30") +
geom_line(data = df2, aes(x = x, y = y), color = "red")
There are a number of problems with your code - you've left out a number of variable declarations so the code doesn't run as-is. Below I've added a number of variables to make the code run, but you should check what I've done to see if it is consistent with your intentions.
For your plot, the histogram is fine, but the line is only defined for x in the range [1, 10] in integer steps, but your histogram goes from 0 to 10 in steps of 0.1. If you change the range of x for your line to match the histogram, the plot covers the range and has a point at each histogram bar.
library(tibble)
library(ggplot2)
n=30
Nsim = 10000 ### added
sigmax = 1.0 ### added
sigma = 1.0 ### added
b0 = 0 ### added
b1 = 0 ### added
F1 = array(NA,dim=Nsim)
for(i in 1:Nsim){
X=rnorm(n,0,sd=sigmax) # generate x
res=rnorm(n,0,sd=sigma) # generate sigma
Y=b0+b1*X+res # generate Y
mod = lm(Y~X)
res = summary(mod)
F1[i]=res$fstatistic[1] # F statistic
}
df<-tibble(F1=F1)
x=seq(0,10,.10) ### changed to range from 0 to 10 in steps of 0.1
y=df(x,df1 = 1, df2 = n-2)
df2 = tibble(x=x,y=y)
ggplot() + geom_histogram(data=df, aes(x=F1,y=..density..), binwidth=0.1,color="black", fill="white")+
xlab("F") +
xlim(c(NA,10))+
ggtitle("n=30") +
geom_line(data = df2, aes(x = x, y = y), color = "red")

Directly Plotting Mathematical Functions in R

I am working with the R programming language.
In a previous question that I asked (Understanding 3D traces in Plotly and R: Evaluating and Plotting Functions Over a Grid), I learned how to plot mathematical functions by first evaluating the mathematical function at different points, then by plotting these points on a 3D grid, and finally "interpolating" a 3D surface over these points:
# set seed for reproducibility
#load libraries
set.seed(123)
library(dplyr)
library(plotly)
#create more data
n <- 50
my_grid <- expand.grid(i1 = 1:n, i2 = 1:n)
my_grid$final_value = with(my_grid, sin(i1) + cos(i2) )
#make plot
plot_ly(data = my_grid, x=~i1, y=~i2, z=~final_value, type='mesh3d', intensity = ~final_value, colors = colorRamp(c("blue", "grey", "red")))
I am trying to use this same approach to plot the following function (https://en.wikipedia.org/w/index.php?title=Test_functions_for_optimization&oldid=1030693803, https://en.wikipedia.org/w/index.php?title=Test_functions_for_optimization&oldid=1030693803#/media/File:ConstrTestFunc03.png) :
I first defined the function:
my_function <- function(x,y) {
final_value = (1 - x)^2 + 100*((y - x^2)^2)
}
Then, I defined the "grid":
input_1 <- seq(-1.5, 1.5,0.1)
input_2 <- seq(-1.5, 1.5,0.1)
my_grid <- data.frame(input_1, input_2)
my_grid$final_value = (1 - input_1)^2 + 100*((input_2 - input_1^2)^2)
Then, I tried to plot this function:
x <- my_grid$input_1
y <- my_grid$input_2
z <- matrix(my_grid$final_value, nrow = length(x), ncol = length(y)) # proper matrix & dimensions
plot_ly(x = x, y = y, z = z) %>% add_surface()
My Problem: The final result does not look similar to the result from the Wikipedia page:
Can someone please show me what I am doing wrong? Is there an easier way to do this?
Thanks!
Your problem is that you are not actually creating a grid, you are creating a single vector of equal x, y points and running your formula on that, so your matrix is wrong (every column will be the same due to it being repeated). The easiest fix is to run outer on your function to evaluate it at every pair of input 1 and input 2:
z <- outer(input_1, input_2, my_function)
plot_ly(x = input_1, y = input_2, z = z) %>% add_surface()

How to delete outliers from a QQ-plot graph made with ggplot()?

I have a two dimensional dataset (say columns x and y). I use the following function to plot a QQ-plot of this data.
# Creating a toy data for presentation
df = cbind(x = c(1,5,8,2,9,6,1,7,12), y = c(1,4,10,1,6,5,2,1,32))
# Plotting the QQ-plot
df_qq = as.data.frame(qqplot(df[,1], df[,2], plot.it=FALSE))
ggplot(df_qq) +
geom_point(aes(x=x, y=y), size = 2) +
geom_abline(intercept = c(0,0), slope = 1)
That is the resulting graph:
My question is, how to avoid plotting the last point (i.e. (12,32))? I would rather not delete it manually because i have several of these data pairs and there are similar outliers in each of them. What I would like to do is to write a code that somehow identifies the points that are too far from the 45 degree line and eliminate them from df_qq (for instance if it is 5 times further than the average distance to the 45 line it can be eliminated). My main objective is to make the graph easier to read. When outliers are not eliminated the more regular part of the QQ-plot occupies a too small part of the graph and it prevents me from visually evaluating the similarity of two vectors apart from the outliers.
I would appreciate any help.
There is a CRAN package, referenceIntervals that uses Cook's distance to detect outliers. By applying it to the values of df_qq$y it can then give an index into df_qq to be removed.
library(referenceIntervals)
out <- cook.outliers(df_qq$y)$outliers
i <- which(df_qq$y %in% out)
ggplot(df_qq[-i, ]) +
geom_point(aes(x=x, y=y), size = 2) +
geom_abline(intercept = c(0,0), slope = 1)
Edit.
Following the OP's comment,
But as far as I understand this function does not look at
the relation between x & y,
maybe the following function is what is needed to remove outliers only if they are outliers in one of the vectors but not in both.
cookOut <- function(X){
out1 <- cook.outliers(X[[1]])$outliers
out2 <- cook.outliers(X[[2]])$outliers
i <- X[[1]] %in% out1
j <- X[[2]] %in% out2
w <- which((!i & j) | (i & !j))
if(length(w)) X[-w, ] else X
}
Test with the second data set, the one in the comment.
The extra vector, id is just to make faceting easier.
df1 <- data.frame(x = c(1,5,8,2,9,6,1,7,12), y = c(1,4,10,1,6,5,2,1,32))
df2 <- data.frame(x = c(1,5,8,2,9,6,1,7,32), y = c(1,4,10,1,6,5,2,1,32))
df_qq1 = as.data.frame(qqplot(df1[,1], df1[,2], plot.it=FALSE))
df_qq2 = as.data.frame(qqplot(df2[,1], df2[,2], plot.it=FALSE))
df_qq_out1 <- cookOut(df_qq1)
df_qq_out2 <- cookOut(df_qq2)
df_qq_out1$id <- "A"
df_qq_out2$id <- "B"
df_qq_out <- rbind(df_qq_out1, df_qq_out2)
ggplot(df_qq_out) +
geom_point(aes(x=x, y=y), size = 2) +
geom_abline(intercept = c(0,0), slope = 1) +
facet_wrap(~ id)

Difference between two geom_smooth() lines

I made a plot for my data and am now I would like to have the difference in y for every x that was estimated by geom_smooth(). There is a similiar question which unfortunately has no answer. For example, how to get the differences for the following plot (data below):
EDIT
Two suggestions were made but I still don't know how to calculate the differences.
First suggestion was to access the data from the ggplot object. I did so with
pb <- ggplot_build(p)
pb[["data"]][[1]]
That approach kind of works, but the data doesn't use the same x values for the groups. For example, the first x value of the first group is -3.21318853, but there is no x of -3.21318853 for the second group, hence, I can not calculate the difference in y for -3.21318853 between both groups
Second suggestion was to see what formula is used in geom_smooth(). The package description says that "loess() is used for less than 1,000 observations; otherwise mgcv::gam() is used with formula = y ~ s(x, bs = "cs")". My N is more than 60,000, hence, gam is used by default. I am not familiar with gam; can anyone provide a short answer how to calculate the difference between the two lines considering the things just described?
R Code
library("ggplot2") # library ggplot
set.seed(1) # make example reproducible
n <- 5000 # set sample size
df <- data.frame(x= rnorm(n), g= factor(rep(c(0,1), n/2))) # generate data
df$y <- NA # include y in df
df$y[df$g== 0] <- df$x[df$g== 0]**2 + rnorm(sum(df$g== 0))*5 # y for group g= 0
df$y[df$g== 1] <-2 + df$x[df$g== 1]**2 + rnorm(sum(df$g== 1))*5 # y for g= 1 (with intercept 2)
ggplot(df, aes(x, y, col= g)) + geom_smooth() + geom_point(alpha= .1) # make a plot
Hi and welcome on Stack Overflow,
The first suggestion is good. To make the x-sequences match, you can interpolate the values in between using the approx function (in stats).
library("ggplot2") # library ggplot
set.seed(1) # make example reproducible
n <- 5000 # set sample size
df <- data.frame(x= rnorm(n), g= factor(rep(c(0,1), n/2))) # generate data
df$y <- NA # include y in df
df$y[df$g== 0] <- df$x[df$g== 0]**2 + rnorm(sum(df$g== 0))*5 # y for group g= 0
df$y[df$g== 1] <-2 + df$x[df$g== 1]**2 + rnorm(sum(df$g== 1))*5 # y for g= 1 (with intercept 2)
p <- ggplot(df, aes(x, y, col= g)) + geom_smooth() + geom_point(alpha= .1) # make a plot
pb <- ggplot_build(p) # Get computed data
data.of.g1 <- pb[['data']][[1]][pb[['data']][[1]]$group == 1, ] # Extract info for group 1
data.of.g2 <- pb[['data']][[1]][pb[['data']][[1]]$group == 2, ] # Extract info for group 2
xlimit.inf <- max(min(data.of.g1$x), min(data.of.g2$x)) # Get the minimum X the two smoothed data have in common
xlimit.sup <- min(max(data.of.g1$x), max(data.of.g2$x)) # Get the maximum X
xseq <- seq(xlimit.inf, xlimit.sup, 0.01) # Sequence of X value (you can use bigger/smaller step size)
# Based on data from group 1 and group 2, interpolates linearly for all the values in `xseq`
y.g1 <- approx(x = data.of.g1$x, y = data.of.g1$y, xout = xseq)
y.g2 <- approx(x = data.of.g2$x, y = data.of.g2$y, xout = xseq)
difference <- data.frame(x = xseq, dy = abs(y.g1$y - y.g2$y)) # Compute the difference
ggplot(difference, aes(x = x, y = dy)) + geom_line() # Make the plot
Output:
As I mentioned in the comments above, you really are better off doing this outside of ggplot and instead do it with a full model of the two smooths from which you can compute uncertainties on the difference, etc.
This is basically a short version of a blog post that I wrote a year or so back.
OP's exmaple data
set.seed(1) # make example reproducible
n <- 5000 # set sample size
df <- data.frame(x= rnorm(n), g= factor(rep(c(0,1), n/2))) # generate data
df$y <- NA # include y in df
df$y[df$g== 0] <- df$x[df$g== 0]**2 + rnorm(sum(df$g== 0))*5 # y for group g= 0
df$y[df$g== 1] <-2 + df$x[df$g== 1]**2 + rnorm(sum(df$g== 1))*5 # y for g= 1 (with intercept 2)
Start by fitting the model for the example data:
library("mgcv")
m <- gam(y ~ g + s(x, by = g), data = df, method = "REML")
Here I'm fitting a GAM with a factor-smooth interaction (the by bit) and for this model we need to also include g as a parametric effect as the group-specific smooths are both centred about 0 so we need to include the group means in the parametric part of the model.
Next we need a grid of data along the x variable at which we will estimate the difference between the two estimated smooths:
pdat <- with(df, expand.grid(x = seq(min(x), max(x), length = 200),
g = c(0,1)))
pdat <- transform(pdat, g = factor(g))
then we use this prediction data to generate the Xp matrix, which is a matrix that maps values of the covariates to values of the basis expansion for the smooths; we can manipulate this matrix to get the difference smooth that we want:
xp <- predict(m, newdata = pdat, type = "lpmatrix")
Next some code to identify which rows and columns in xp belong to the smooths for the respective levels of g; as there are only two levels and only a single smooth term in the model, this is entirely trivial but for more complex models this is needed and it is important to get the smooth component names right for the grep() bits to work.
## which cols of xp relate to splines of interest?
c1 <- grepl('g0', colnames(xp))
c2 <- grepl('g1', colnames(xp))
## which rows of xp relate to sites of interest?
r1 <- with(pdat, g == 0)
r2 <- with(pdat, g == 1)
Now we can difference the rows of xp for the pair of levels we are comparing
## difference rows of xp for data from comparison
X <- xp[r1, ] - xp[r2, ]
As we focus on the difference, we need to zero out all the column not associated with the selected pair of smooths, which includes any parametric terms.
## zero out cols of X related to splines for other lochs
X[, ! (c1 | c2)] <- 0
## zero out the parametric cols
X[, !grepl('^s\\(', colnames(xp))] <- 0
(In this example, these two lines do exactly the same thing, but in more complex examples both are needed.)
Now we have a matrix X which contains the difference between the two basis expansions for the pair of smooths we're interested in, but to get this in terms of fitted values of the response y we need to multiply this matrix by the vector of coefficients:
## difference between smooths
dif <- X %*% coef(m)
Now dif contains the difference between the two smooths.
We can use X again and covariance matrix of the model coefficients to compute the standard error of this difference and thence a 95% (in this case) confidence interval for the estimate difference.
## se of difference
se <- sqrt(rowSums((X %*% vcov(m)) * X))
## confidence interval on difference
crit <- qt(.975, df.residual(m))
upr <- dif + (crit * se)
lwr <- dif - (crit * se)
Note that here with the vcov() call we're using the empirical Bayesian covariance matrix but not the one corrected for having chosen the smoothness parameters. The function I show shortly allows you to account for this additional uncertainty via argument unconditional = TRUE.
Finally we gather the results and plot:
res <- data.frame(x = with(df, seq(min(x), max(x), length = 200)),
dif = dif, upr = upr, lwr = lwr)
ggplot(res, aes(x = x, y = dif)) +
geom_ribbon(aes(ymin = lwr, ymax = upr, x = x), alpha = 0.2) +
geom_line()
This produces
Which is consistent with an assessment that shows the model with the group-level smooths doesn't provide substantially better fit than a model with different group means but only single common smoother in x:
r$> m0 <- gam(y ~ g + s(x), data = df, method = "REML")
r$> AIC(m0, m)
df AIC
m0 9.68355 30277.93
m 14.70675 30285.02
r$> anova(m0, m, test = 'F')
Analysis of Deviance Table
Model 1: y ~ g + s(x)
Model 2: y ~ g + s(x, by = g)
Resid. Df Resid. Dev Df Deviance F Pr(>F)
1 4990.1 124372
2 4983.9 124298 6.1762 73.591 0.4781 0.8301
Wrapping up
The blog post I mentioned has a function which wraps the steps above into a simple function, smooth_diff():
smooth_diff <- function(model, newdata, f1, f2, var, alpha = 0.05,
unconditional = FALSE) {
xp <- predict(model, newdata = newdata, type = 'lpmatrix')
c1 <- grepl(f1, colnames(xp))
c2 <- grepl(f2, colnames(xp))
r1 <- newdata[[var]] == f1
r2 <- newdata[[var]] == f2
## difference rows of xp for data from comparison
X <- xp[r1, ] - xp[r2, ]
## zero out cols of X related to splines for other lochs
X[, ! (c1 | c2)] <- 0
## zero out the parametric cols
X[, !grepl('^s\\(', colnames(xp))] <- 0
dif <- X %*% coef(model)
se <- sqrt(rowSums((X %*% vcov(model, unconditional = unconditional)) * X))
crit <- qt(alpha/2, df.residual(model), lower.tail = FALSE)
upr <- dif + (crit * se)
lwr <- dif - (crit * se)
data.frame(pair = paste(f1, f2, sep = '-'),
diff = dif,
se = se,
upper = upr,
lower = lwr)
}
Using this function we can repeat the entire analysis and plot the difference with:
out <- smooth_diff(m, pdat, '0', '1', 'g')
out <- cbind(x = with(df, seq(min(x), max(x), length = 200)),
out)
ggplot(out, aes(x = x, y = diff)) +
geom_ribbon(aes(ymin = lower, ymax = upper, x = x), alpha = 0.2) +
geom_line()
I won't show the plot here as it is identical to that shown above except for the axis labels.

Plotting several random trials in R using ggplot2

I am currently running a simulation using a normal distribution, it simulates the times between events and is based on an analysis of data given (not relevant for the problem). The simulation is created like this:
SimProcess <- function(mu, sigma, T) {
ctimes <- c() # Array of arrival times, initially empty
t <- rnorm(1,mu, sqrt(sigma)) # Time of next arrival
while(t < T) {
ctimes <- c(ctimes, t)
dt = rnorm(1, mu, sqrt(sigma))
if (dt<0){dt = 0}
t <- t + dt # sampling from the dataset
}
return(ctimes)
}
# Create a sample path of one run
T <- 10
# arrival times
arrivals <- SimProcess(mu_t, var_t, T)
Now I would like to do several of these random trials and then plot them in a figure so we can compare it to the given data. 10 of these trials would be ideal. I tried plotting it like this but unfortunately it doesn't work. I am afraid i'll have to use reshape2 to melt the data of the 10 trials because the length of these vectors is all not the same. I use this to try to plot all the lines, it clearly doesn't work the way it should.
x <- c(0, arrivals, T,rep(0,500-length(arrivals)))
y <- c(0:length(arrivals), length(arrivals),rep(0,500-length(arrivals)))
plotdataNT = data.frame(x,y)
p = ggplot(plotdataNT,aes(x,y))
plot(x,y,type = 's')
j = 1
for (j in 10){
arrivals <- SimProcess(mu_t,var_t,T)
x <- c(0, arrivals, T,rep(0,500-length(arrivals)))
y <- c(0:length(arrivals), length(arrivals),rep(0,500-length(arrivals)))
p = p + geom_step(mapping = aes (x,y))
}
Edit:
In the end I figuered it out, because I used 10 instead of 1:10 it would not run properly and I also had some more tiny mistakes. This ended up being the solution:
arrivals <- SimProcess(mu_t,var_t,T)
NT <- length(arrivals)
x <- c(0, arrivals, T,rep(0,correction-length(arrivals)))
y <- c(0:length(arrivals), length(arrivals),rep(0,correction-length(arrivals)))
plotdataNT = data.frame(x,y)
p = ggplot(plotdataNT,aes(x,y)) + geom_step(mapping = aes (x,y))
jk = 1
runs = 25
colourvec = rainbow(runs)
for (jk in 1:runs){
arrivals <- SimProcess(mu_t,var_t,T)
x <- c(0, arrivals, T,rep(0,correction-length(arrivals)))
y <- c(0:length(arrivals), length(arrivals),rep(0,correction-length(arrivals)))
newdata = data.frame(x,y)
p = p + geom_step(mapping = aes (x,y),newdata,colour = colourvec[jk])
}
p = p + scale_x_continuous(name = "Time in days") + scale_y_continuous(name = "Amount of claims")
p
This results in 26 random samples plotted in one graph in several colors, it represents a process with random time steps according to the gamma, normal or lognormal distribution. The answer below is a more clean example of what I meant. If anyone knows how to do this with reshape2 in a more efficient way I'd also be glad to know.
Two solutions:
for (j in 1:10) {
arrivals <- SimProcess(mu_t,var_t,T)
x <- c(0, arrivals, T,rep(0,500-length(arrivals)))
y <- c(0:length(arrivals), length(arrivals),rep(0,500-length(arrivals)))
xy <- data.frame(x,y)
p = p + geom_step(data=xy, mapping=aes(x,y))
}
print(p)
for (j in 1:10) {
arrivals <- SimProcess(mu_t,var_t,T)
x <- c(0, arrivals, T,rep(0,500-length(arrivals)))
y <- c(0:length(arrivals), length(arrivals),rep(0,500-length(arrivals)))
xy <- data.frame(x,y)
p = p + geom_step(mapping=aes_string(x,y))
}
print(p)

Resources