Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 7 months ago.
Improve this question
I'm trying to multiply some probability functions as to update the probability given certain factors. I've tried several things using the pdqr and bayesmeta packages, but they all work out not the way I intend, what am I missing?
A reproducible example showing two different distributions, a and b, which I want to multiply. That is because, as you notice, b doesn't have measurements in the low values, so a probability of 0. This should be reflected in the updated distribution.
library(tidyverse)
library(pdqr)
library(bayesmeta)
#measurements
a <- c(1, 2, 2, 4, 5, 5, 6, 6, 7, 7, 7, 8, 7, 8, 2, 6, 9, 10)
b <- c(5, 6, 6, 6, 7, 7, 7, 7, 7, 8, 8, 8, 9, 9, 9, 7)
#create probability distribution functions
distr_a <- new_d(a, type = "continuous")
distr_b <- new_d(b, type = "continuous")
#try to combine distributions
summarized <- distr_a + distr_b
multiplied <- distr_a * distr_b
mixture <- form_mix(list(distr_a, distr_b))
convolution <- convolve(distr_a, distr_b)
The resulting PDF's are plotted like this:
The bayesmeta::convolve() does the same as summarizing two pdqr PDF's and seem to oddly shift the distributions to the right and make them not as high as supposed to be.
Ordinarily multiplying the pdqr PDF's leaves a very low probablity overall.
Using the pdqr::form_mix() seems to even the PDF's out in between, but leaving probabilies above 0 for the lower x-values.
So, I tried to gain some insight in what I wanted to do, by using the PDF's for a and b to generate probabilities for each x value and multiply that:
#multiply distributions manually
x <- c(1:10)
manual <- data.frame(x) %>%
mutate(a = distr_a(x),
b = distr_b(x),
multiplied = a*b)
This indeed gives a resulting shape I am after, it however (logically) has too low probabilities:
I would like to multiply (multiple) PDF's. What am I doing wrong? Are my statistics wrong, or am I missing a usefull function?
UPDATE:
It seems I am a stats noob on this subject, but I would like to achieve something like the below distribution. Given that both situation a and b are true, I would expect the distribution te be something like the dotted line. Is that possible?
multiplied is the correct one. One can check with log-normal distributions. The sum of two independant log-normal random variables is log-normal with µ = µ_a + µ_b and sigma² = sigma²_a + sigma²_b.
a <- rlnorm(25000, meanlog = 0, sdlog = 1)
b <- rlnorm(25000, meanlog = 1, sdlog = 1)
distr_a <- new_d(a, type = "continuous")
distr_b <- new_d(b, type = "continuous")
distr_ab <- form_trans(
list(distr_a, distr_b), trans = function(x, y) x*y
)
# or: distr_ab <- distr_a * distr_b
plot(distr_ab, xlim = c(0, 40))
curve(dlnorm(x, meanlog = 1, sdlog = sqrt(2)), add = TRUE, col = "red")
As demonstrated here:
https://www.r-bloggers.com/2019/05/bayesian-models-in-r-2/
# Example distributions
probs <- seq(0,1,length.out= 100)
prior <- dbinom(x = 8, prob = probs, size = 10)
lik <- dnorm(x = probs, mean = .5, sd = .1)
# Multiply distributions
unstdPost <- lik * prior
# If you wanted to get an actual posterior, it must be a probability
# distribution (integrate to 1), so we can divide by the sum:
stdPost <- unstdPost / sum(unstdPost)
# Plot
plot(probs, prior, col = "black", # rescaled
type = "l", xlab = "P(Black)", ylab = "Density")
lines(probs, lik / 15, col = "red")
lines(probs, unstdPost, col = "green")
lines(probs, stdPost, col = "blue")
legend("topleft", legend = c("Lik", "Prior", "Unstd Post", "Post"),
text.col = 1:4, bty = "n")
Created on 2022-08-06 by the reprex package (v2.0.1)
Related
I am trying to generate a similar plot as below to show the change in R-hat over iterations:
I have tried the following options :
summary(fit1)$summary : gives R-hat all chains are merged
summary(fit1)$c_summary : gives R-hat for each chain individually
Can you please help me to get R-hat for each iteration for a given parameter?
rstan provides the Rhat() function, which takes a matrix of iterations x chains and returns R-hat. We can extract this matrix from the fitted model and apply Rhat() cumulatively over it. The code below uses the 8 schools model as an example (copied from the getting started guide).
library(tidyverse)
library(purrr)
library(rstan)
theme_set(theme_bw())
# Fit the 8 schools model.
schools_dat <- list(J = 8,
y = c(28, 8, -3, 7, -1, 1, 18, 12),
sigma = c(15, 10, 16, 11, 9, 11, 10, 18))
fit <- stan(file = 'schools.stan', data = schools_dat)
# Extract draws for mu as a matrix; columns are chains and rows are iterations.
mu_draws = as.array(fit)[,,"mu"]
# Get the cumulative R-hat as of each iteration.
mu_rhat = map_dfr(
1:nrow(mu_draws),
function(i) {
return(data.frame(iteration = i,
rhat = Rhat(mu_draws[1:i,])))
}
)
# Plot iteration against R-hat.
mu_rhat %>%
ggplot(aes(x = iteration, y = rhat)) +
geom_line() +
labs(x = "Iteration", y = expression(hat(R)))
I am looking for a function similar to the logistic function, but instead of bounding the values between 0 to 1. I want it to transform the values to the range of -1 to 1.
I have some data that ranges from -1 to 1. I then fit a model and based on estimated coefficients and variance I simulate some data from a normal distribution. But some values are outside the range of -1 to 1. I was wondering if there is a function to convert all values a range of -1 to 1.
Thank you
2*atan(x)/pi or 2*F(x)-1 for any suitable cumulative distribution function F will do.
curve(2 * atan(x) / pi, -5, 5, col = 1)
curve(2*pnorm(x)-1, -5, 5, col = 2, add = TRUE)
curve(2*pt(x, 5)-1, -5, 5, col = 3, add = TRUE)
curve(2*plogis(x)-1, -5, 5, col = 4, add = TRUE)
legend("topleft", c("2*atan/pi", "2*pnorm-1", "2*pt-1", "2*plogis-1"), lty = 1, col = 1:4)
I am trying to create a data frame in R, with a set of variables that are normally distributed. Firstly, we only create the data frame with the following variables:
RootCause <- rnorm(500, 0, 9)
OtherThing <- rnorm(500, 0, 9)
Errors <- rnorm(500, 0, 4)
df <- data.frame(RootCuase, OtherThing, Errors)
In the second part, we're asked to redo the above, but with a defined correlation between RootCause and OtherThing of 0.5. I have tried reading through a couple of pages and articles explaining correlation commands in R, but I am afraid I am struggling with comprehending it.
Easy answer
Draw another random variable OmittedVar and add it to the other variables:
n <- 1000
OmittedVar <- rnorm(n, 0, 9)
RootCause <- rnorm(n, 0, 9) + OmittedVar
OtherThing <- rnorm(n, 0, 9) + OmittedVar
Errors <- rnorm(n, 0, 4)
cor(RootCause, OtherThing)
[1] 0.4942716
Other answer: use multivariate normal function from MASS package:
But you have to define the variance/covariance matrix that gives you the correlation you like (the Sigma argument here):
d <- MASS::mvrnorm(n = n, mu = c(0, 0), Sigma = matrix(c(9, 4.5, 4.5, 9), nrow = 2, ncol = 2), tol = 1e-6, empirical = FALSE, EISPACK = FALSE)
cor(d[,1], d[,2])
[1] 0.5114698
Note:
Getting a correlation other than 0.5 depends on the process; if you want to change it from 0.5, you'll change the details (from adding 1 * OmittedVar in the first strat or changing Sigma in the second strat). But you'll have to look up details on variance rulse of the normal distribution.
An assignment has tasked us with creating a series of variables: normal1, normal2, normal3, chiSquared1 and 2, t, and F. They are defined as follows:
library(tibble)
Normal.Frame <- data_frame(normal1 = rnorm(5000, 0, 1),
normal2 = rnorm(5000, 0, 1),
normal3 = rnorm(5000, 0, 1),
chiSquared1 = normal1^2,
chiSquared2 = normal2^2,
F = sum(chiSquared1/chiSquared2),
t = sum(normal3/sqrt(chiSquared1 )))
We then have to make histograms of the distributions for normal1, chiSquared1 and 2, t, and F, which is simple enough for normal1 and the chiSquared variables, but when I try to plot F and t, the plot space is blank.
Our lecturer recommended limiting the range of F to 0-10, and t to -5 to 5. To do this, I use:
HistT <- hist(Normal.Frame$t, xlim = c(-5, 5))
HistF <- hist(Normal.Frame$F, xlim = c(0, 10))
Like I mentioned, this yields blank plots.
Your t and F are defined as sums; they will be single values. If those values are outside your range, the histogram will be empty. If you remove the sum() function you should get the desired results.
I have a discrete data set with multiple peaks. I am trying to generate an automatic method for fitting a Gaussian curve to an unknown number of data points. The ultimate goal is to provide a measure of uncertainty on the location (x-axis) of the peak in the y-axis, using the sigma value of a best-fit Gaussian curve. The full data set has a half dozen or so unique peaks of various shapes.
Here is a sample data set.
working <- data.frame(age = seq(1, 50), likelihood = c())
likelihood = c(10, 10, 10, 10, 10, 12, 14, 16, 17, 18,
19, 20, 19, 18, 17, 16, 14, 12, 11, 10,
10, 9, 8, 8, 8, 8, 7, 6, 6, 6))
Here is the Gaussian fitting procedure. I found it on SO, but I can't find the page I took it from again, so please forgive the lack of link and citation.
fitG =
function(x,y,mu,sig,scale)
f = function(p){
d = p[3] * dnorm( x, mean = p[ 1 ], sd = p[ 2 ] )
sum( ( d - y ) ^ 2)
}
optim( c( mu, sig, scale ), f )
}
This works well if I pre-define the area to fit. For instance taking only the area around the peak and using input mean = 10, sigma = 5, and scale = 1:
work2 <- work[5:20, ]
fit1 <- fitG(work2$age, work2$likelihood, 10, 5, 1)
fitpar1 <- fit1$par
plot(work2$age, work2$likelihood, pch = 20)
lines(work2$age, fitpar1[3]*dnorm(work2$age, fitpar1[1], fitpar1[2]))
However, I am interested in automating the procedure in some way, where I define the peak centers for the whole data set using peakwindow from the cardidates package. The ideal function would then iterate the number of data points used in the fit around a given peak in order to optimize the Gaussian parameters. Here is my attempt:
fitG.2 <- function (x, y) {
g <- function (z) {
newdata <- x[(y - 1 - z) : (y + 1 + z), ]
newfit <- fitG( newdata$age, newdata$likelihood, 10, 5, 1)
}
optimize( f = g, interval = c(seq(1, 100)))
}
However, I can't get this type of function to actually work (an error I can't solve). I have also tried creating a function with a for loop and setting break parameters but this method does not work consistently for peaks with widely varying shape parameters. There are likely many other R functions unknown to me that do exactly this.