Generate random numbers of a chi squared distribution in R - r

I want to generate a chi squared distribution with 100,000 random numbers with degrees of freedom 3.
This is what I have tried.
df3=data.frame(X=dchisq(1:100000, df=3))
But output is not I have expected. I used below code to visualize it.
ggplot(df3,aes(x=X,y=..density..)) + geom_density(fill='blue')
Then the pdf looks abnormal. Please help

Use rchisq to sample from the distribution:
df3=data.frame(X=rchisq(1:100000, df=3))
ggplot(df3,aes(x=X,y=..density..)) + geom_density(fill='blue')
If your goal is to plot a density function, do this:
ggplot(data.frame(x = seq(0, 25, by = 0.01)), aes(x = x)) +
stat_function(fun = dchisq, args = list(df = 3), fill = "blue", geom = "density")
The latter has the advantage of the plot being fully deterministic.

Use rchisq() to create a distribution of 100,000 observations randomly drawn from a chi square distribution with 3 degrees of freedom.
df3=data.frame(X=rchisq(1:100000, df=3))
hist(df3$X)
...and the output:
The ggplot version looks like this:
library(ggplot2)
ggplot(data = df3, aes(X)) + geom_histogram()

You may use rchisq to make random draws from a random X2 distribution as shown in the other answers.
dchisq is the density distribution function, which you might find useful though, since you want to plot:
curve(dchisq(x, 3), xlim=0:1*15)

Related

Can the Pearson's correlation coefficient be calculated on a linear-log ggplot without being impacted by the log scale?

Hello and welcome to my question.
What I want is to plot data from two columns as a linear-log scatterplot and then calculate the Pearson's correlation coefficient using the ggpubr::stat_cor() function. The coefficient needs to be calculated on the plot, since the analysis will involve a lot of exploration; calculating separately would be laborious.
Example Dataset
library(tidyverse)
library(ggpubr)
df <- tibble(A = c(1:10), B = c(1:10))
ggplot(data = df, aes(x = A, y = B)) +
geom_point() +
stat_cor(aes(label = ..r.label..), method = "pearson") +
scale_x_continuous(trans = "log10")
The issue is that, while I can run this code without the scale_x_continuous() and get what I expect (R = 1), I need the x-axis to be in log scale in my actual code. When I add the log scale, stat_cor() calculates the coefficient with the log values, resulting in an unexpected correlation (R = 0.95).
The bottom line is that I'd like a solution that allows me to plot the x-axis on a log scale without impacting the stat_cor() function, so that the function calculates the coefficient on the original (linear) values.

Trying to Plot Standard Normal Distribution and t-distribution on same graph

dat <- data.frame(dens = c(rnorm(1000000), rt(1000000, 4)), lines = rep(c("a", "b"), each = 100000))
ggplot(dat, aes(x = dens, fill = lines)) + geom_density(alpha = 0.5)
This is my code. I'm trying to plot the two distributions on the same graph. I only ended up with the t distribution.
Any feedback would be appreciated. Thank you.
Like one of the comments say, this is basically a typo because you repeat the variables a and b 100,000 (a hundred thousand) times each, making the normally distributed numbers and the t-distributed numbers mixed up. You will need to set the argument each=1000000 (a million). Or to avoid confusion with how many zeros should be entered, just use 1e6 which means 1 * 10^6.

add exponential function given mean and intercept to cdf plot

Considering the following random data:
set.seed(123456)
# generate random normal data
x <- rnorm(100, mean = 20, sd = 5)
weights <- 1:100
df1 <- data.frame(x, weights)
#
library(ggplot2)
ggplot(df1, aes(x)) + stat_ecdf()
We can create a general cumulative distribution plot.
But, I want to compare my curve to that from data used 20 years ago. From the paper, I only know that the data is "best modeled by a shifted exponential distribution with an x intercept of 1.1 and a mean of 18"
How can I add such a function to my plot?
+ stat_function(fun=dexp, geom = "line", size=2, col="red", args = (mean=18.1))
but I am not sure how to deal with the shift (x intercept)
I think scenarios like this are best handled by making your function first outside of the ggplot call.
dexp doesn't take a parameter mean but uses rate instead which is the same as lambda. That means you want rate = 1/18.1 based on properties of exponential distributions. Also, I don't think dexp makes much sense here since it shows the density and I think you really want the probability with is pexp.
your code could look something like this:
library(ggplot2)
test <- function(x) {pexp(x, rate = 1/18.1)}
ggplot(df1, aes(x)) + stat_ecdf() +
stat_function(fun=test, size=2, col="red")
you could shift your pexp distributions doing this:
test <- function(x) {pexp(x-10, rate = 1/18.1)}
ggplot(df1, aes(x)) + stat_ecdf() +
stat_function(fun=test, size=2, col="red") +
xlim(10,45)
just for fun this is what using dexp produces:
I am not entirely sure if I understand concept of mean for exponential function. However, generally, when you pass function as an argument, which is fun=dexp in your case, you can pass your own, modified functions in form of: fun = function(x) dexp(x)+1.1, for example.
Maybe experimenting with this feature will get you to the solution.

Displaying smoothed (convolved) densities with ggplot2

I'm trying to display some frequencies convolved with a Gaussian kernel in ggplot2. I tried smoothing the lines with:
+ stat_smooth(se = F,method = "lm", formula = y ~ poly(x, 24))
Without success.
I read an article suggesting the frequencies should be convolved with a Gaussian kernel. Which ggplot2's stat_density function (http://docs.ggplot2.org/current/stat_density.html) seem to be able to produce.
However, I can't seem to be able to replace my geometry with stat_density. I there anything wrong with my code?
require(reshape2)
library(ggplot2)
library(RColorBrewer)
fileName = "/1.csv" # downloadable there: https://www.dropbox.com/s/l5j7ckmm5s9lo8j/1.csv?dl=0
mydata = read.csv(fileName,sep=",", header=TRUE)
dataM = melt(mydata,c("bins"))
myPalette <- colorRampPalette(rev(brewer.pal(11, "Spectral")))
ggplot(data=dataM,
aes(x=bins, y=value, colour=variable)) +
geom_line() + scale_x_continuous(limits = c(0, 2))
This code produces the following plot:
I'm looking at smoothing the lines a little bit, so they look more like this:
(from http://journal.frontiersin.org/Journal/10.3389/fncom.2013.00189/full)
Since my comments solved your problem, I'll convert them to an answer:
The density function takes individual measurements and calculates a kernel density distribution by convolution (gaussian is the default kernel). For example, plot(density(rnorm(1000))). You can control the smoothness with the bw (bandwidth) parameter. For example, plot(density(rnorm(1000), bw=0.01)).
But your data frame is already a density distribution (analogous to the output of the density function). To generate a smoother density estimate, you need to start with the underlying data and run density on it, adjusting bw to get the smoothness where you want it.
If you don't have access to the underlying data, you can smooth out your existing density distributions as follows:
ggplot(data=dataM, aes(x=bins, y=value, colour=variable)) +
geom_smooth(se=FALSE, span=0.3) +
scale_x_continuous(limits = c(0, 2)).
Play around with the span parameter to get the smoothness you want.

How to convert a bar histogram into a line histogram in R

I've seen many examples of a density plot but the density plot's y-axis is the probability. What I am looking for a is a line plot (like a density plot) but the y-axis should contain counts (like a histogram).
I can do this in excel where I manually make the bins and the frequencies and make a bar histogram and then I can change the chart type to a line - but can't find anything similar in R.
I've checked out both base and ggplot2; yet can't seem to find an answer. I understand that histograms are meant to be bars but I think representing them as a continuous line makes more visual sense.
Using default R graphics (i.e. without installing ggplot) you can do the following, which might also make what the density function does a bit clearer:
# Generate some data
data=rnorm(1000)
# Get the density estimate
dens=density(data)
# Plot y-values scaled by number of observations against x values
plot(dens$x,length(data)*dens$y,type="l",xlab="Value",ylab="Count estimate")
This is an old question, but I thought it might be helpful to post a solution that specifically addresses your question.
In ggplot2, you can plot a histogram and display the count with bars using:
ggplot(data) +
geom_histogram()
You can also plot a histogram and display the count with lines using a frequency polygon:
ggplot(data) +
geom_freqpoly()
For more info --
ggplot2 reference
To adapt the example on the ?stat_density help page:
m <- ggplot(movies, aes(x = rating))
# Standard density plot.
m + geom_density()
# Density plot with y-axis scaled to counts.
m + geom_density(aes(y = ..count..))
Although this is old, I thought the following might be useful.
Let's say you have a data set of 10,000 points, and you believe they belong to a certain distribution, and you would like to plot the histogram of the actual data and the line of the probability density of the ideal distribution on top of it.
noise <- 2
#
# the noise is tagged onto the end using runif
# just do demo issues w/real data and fitting
# the subtraction causes the data to have some
# negative values, which must be addressed in
# the fit later on
#
noisylognorm <- rlnorm(10000,
mean = 0.25,
sd = 1) +
(noise * runif(10000) - noise / 10)
#
# using package fitdistrplus
#
# subset is used to remove the negative values
# as the lognormal distribution needs positive only
#
fitlnorm <- fitdist(subset(noisylognorm,
noisylognorm > 0),
"lnorm")
fitlnorm_density <- density(rlnorm(10000,
mean = fitlnorm$estimate[1],
sd = fitlnorm$estimate[2]))
hist(subset(noisylognorm,
noisylognorm < 25),
breaks = seq(-1, 25, 0.5),
col = "lightblue",
xlim = c(0, 25),
xlab = "value",
ylab = "frequency",
main = paste0("Log Normal Distribution\n",
"noise = ", noise))
lines(fitlnorm_density$x,
10000 * fitlnorm_density$y * 0.5,
type="l",
col = "red")
Note the * 0.5 in the lines function. As far as I can tell, this is necessary to account for the width of the hist() bars.
There is a very simple and fast way for count data.
First let's generate some dummy count data:
my.count.data = rpois(n = 10000, lambda = 3)
And then the plotting command (assuming you have called library(magrittr)):
my.count.data %>% table %>% plot

Resources