Density distributions in R - r

An assignment has tasked us with creating a series of variables: normal1, normal2, normal3, chiSquared1 and 2, t, and F. They are defined as follows:
library(tibble)
Normal.Frame <- data_frame(normal1 = rnorm(5000, 0, 1),
normal2 = rnorm(5000, 0, 1),
normal3 = rnorm(5000, 0, 1),
chiSquared1 = normal1^2,
chiSquared2 = normal2^2,
F = sum(chiSquared1/chiSquared2),
t = sum(normal3/sqrt(chiSquared1 )))
We then have to make histograms of the distributions for normal1, chiSquared1 and 2, t, and F, which is simple enough for normal1 and the chiSquared variables, but when I try to plot F and t, the plot space is blank.
Our lecturer recommended limiting the range of F to 0-10, and t to -5 to 5. To do this, I use:
HistT <- hist(Normal.Frame$t, xlim = c(-5, 5))
HistF <- hist(Normal.Frame$F, xlim = c(0, 10))
Like I mentioned, this yields blank plots.

Your t and F are defined as sums; they will be single values. If those values are outside your range, the histogram will be empty. If you remove the sum() function you should get the desired results.

Related

Multiply probability distributions in R [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 7 months ago.
Improve this question
I'm trying to multiply some probability functions as to update the probability given certain factors. I've tried several things using the pdqr and bayesmeta packages, but they all work out not the way I intend, what am I missing?
A reproducible example showing two different distributions, a and b, which I want to multiply. That is because, as you notice, b doesn't have measurements in the low values, so a probability of 0. This should be reflected in the updated distribution.
library(tidyverse)
library(pdqr)
library(bayesmeta)
#measurements
a <- c(1, 2, 2, 4, 5, 5, 6, 6, 7, 7, 7, 8, 7, 8, 2, 6, 9, 10)
b <- c(5, 6, 6, 6, 7, 7, 7, 7, 7, 8, 8, 8, 9, 9, 9, 7)
#create probability distribution functions
distr_a <- new_d(a, type = "continuous")
distr_b <- new_d(b, type = "continuous")
#try to combine distributions
summarized <- distr_a + distr_b
multiplied <- distr_a * distr_b
mixture <- form_mix(list(distr_a, distr_b))
convolution <- convolve(distr_a, distr_b)
The resulting PDF's are plotted like this:
The bayesmeta::convolve() does the same as summarizing two pdqr PDF's and seem to oddly shift the distributions to the right and make them not as high as supposed to be.
Ordinarily multiplying the pdqr PDF's leaves a very low probablity overall.
Using the pdqr::form_mix() seems to even the PDF's out in between, but leaving probabilies above 0 for the lower x-values.
So, I tried to gain some insight in what I wanted to do, by using the PDF's for a and b to generate probabilities for each x value and multiply that:
#multiply distributions manually
x <- c(1:10)
manual <- data.frame(x) %>%
mutate(a = distr_a(x),
b = distr_b(x),
multiplied = a*b)
This indeed gives a resulting shape I am after, it however (logically) has too low probabilities:
I would like to multiply (multiple) PDF's. What am I doing wrong? Are my statistics wrong, or am I missing a usefull function?
UPDATE:
It seems I am a stats noob on this subject, but I would like to achieve something like the below distribution. Given that both situation a and b are true, I would expect the distribution te be something like the dotted line. Is that possible?
multiplied is the correct one. One can check with log-normal distributions. The sum of two independant log-normal random variables is log-normal with µ = µ_a + µ_b and sigma² = sigma²_a + sigma²_b.
a <- rlnorm(25000, meanlog = 0, sdlog = 1)
b <- rlnorm(25000, meanlog = 1, sdlog = 1)
distr_a <- new_d(a, type = "continuous")
distr_b <- new_d(b, type = "continuous")
distr_ab <- form_trans(
list(distr_a, distr_b), trans = function(x, y) x*y
)
# or: distr_ab <- distr_a * distr_b
plot(distr_ab, xlim = c(0, 40))
curve(dlnorm(x, meanlog = 1, sdlog = sqrt(2)), add = TRUE, col = "red")
As demonstrated here:
https://www.r-bloggers.com/2019/05/bayesian-models-in-r-2/
# Example distributions
probs <- seq(0,1,length.out= 100)
prior <- dbinom(x = 8, prob = probs, size = 10)
lik <- dnorm(x = probs, mean = .5, sd = .1)
# Multiply distributions
unstdPost <- lik * prior
# If you wanted to get an actual posterior, it must be a probability
# distribution (integrate to 1), so we can divide by the sum:
stdPost <- unstdPost / sum(unstdPost)
# Plot
plot(probs, prior, col = "black", # rescaled
type = "l", xlab = "P(Black)", ylab = "Density")
lines(probs, lik / 15, col = "red")
lines(probs, unstdPost, col = "green")
lines(probs, stdPost, col = "blue")
legend("topleft", legend = c("Lik", "Prior", "Unstd Post", "Post"),
text.col = 1:4, bty = "n")
Created on 2022-08-06 by the reprex package (v2.0.1)

How do I graph a Bayesian Network with instantiated nodes using bnlearn and graphviz?

I am trying to graph a Bayesian Network (BN) with instantiated nodes using the libraries bnlearn and Rgraphviz. My workflow is as follow:
After creating a data frame with random data (the data I am actually using is obviously not random) I then discretise the data, structure learn the directed acyclic graph (DAG), fit the data to the DAG and then plot the DAG. I also plot a DAG which shows the posterior probabilities of each of the nodes.
#rm(list = ls())
library(bnlearn)
library(Rgraphviz)
# Generating random dataframe
data_clean <- data.frame(a = runif(min = 0, max = 100, n = 1000),
b = runif(min = 0, max = 100, n = 1000),
c = runif(min = 0, max = 100, n = 1000),
d = runif(min = 0, max = 100, n = 1000),
e = runif(min = 0, max = 100, n = 1000))
# Discretising the data into 3 bins
bins <- 3
data_discrete <- discretize(data_clean, breaks = bins)
# Creating factors for each bin in the data
lv <- c("low", "med", "high")
for (i in names(data_discrete)){
levels(data_discrete[, i]) = lv
}
# Structure learning the DAG from the training set
whitelist <- matrix(c("a", "b",
"b", "c",
"c", "e",
"a", "d",
"d", "e"),
ncol = 2, byrow = TRUE, dimnames = list(NULL, c("from", "to")))
bn.hc <- hc(data_discrete, whitelist = whitelist)
# Plotting the DAG
dag.hc <- graphviz.plot(bn.hc,
layout = "dot")
# Fitting the data to the structure
fitted <- bn.fit(bn.hc, data = data_discrete, method = "bayes")
# Plotting the DAG with posteriors
graphviz.chart(fitted, type = "barprob", layout = "dot")
The next thing I do is to manually change the distributions in the bn.fit object, assigned to fitted, and then plot a DAG that shows the instantiated nodes and the updated posterior probability of the response variable e.
# Manually instantiating
fitted_evidence <- fitted
cpt.a = matrix(c(1, 0, 0), ncol = 3, dimnames = list(NULL, lv))
cpt.c = c(1, 0, 0,
0, 1, 0,
0, 0, 1)
dim(cpt.c) <- c(3, 3)
dimnames(cpt.c) <- list("c" = lv, "b" = lv)
cpt.b = c(1, 0, 0,
0, 1, 0,
0, 0, 1)
dim(cpt.b) <- c(3, 3)
dimnames(cpt.b) <- list("b" = lv, "a" = lv)
cpt.d = c(0, 0, 1,
0, 1, 0,
1, 0, 0)
dim(cpt.d) <- c(3, 3)
dimnames(cpt.d) <- list("d" = lv, "a" = lv)
fitted_evidence$a <- cpt.a
fitted_evidence$b <- cpt.b
fitted_evidence$c <- cpt.c
fitted_evidence$d <- cpt.d
# Plotting the DAG with instantiation and posterior for response
graphviz.chart(fitted_evidence, type = "barprob", layout = "dot")
This is the result I get but my actual BN is much larger with many more arcs and it would be impractical to manually change the bn.fit object.
I would like to find out if there is a way to plot a DAG with instantiation without changing the bn.fit object manually? Is there a workaround or function that I am missing?
I think/hope I have read the documentation for bnlearn thoroughly. I appreciate any feedback and would be happy to change anything in the question if I have not conveyed my thoughts clearly enough.
Thank you.
How about using cpdist to draw samples from the posterior given the evidence. You can then estimate the updated parameters using bn.fit using the cpdist samples. Then plot as before.
An example:
set.seed(69184390) # for sampling
# Your evidence vector
ev <- list(a = "low", b="low", c="low", d="high")
# draw samples
updated_dat <- cpdist(fitted, nodes=bnlearn::nodes(fitted), evidence=ev, method="lw", n=1e6)
# refit : you'll get warnings over missing levels
updated_fit <- bn.fit(bn.hc, data = updated_dat)
# plot
par(mar=rep(0,4))
graphviz.chart(updated_fit, type = "barprob", layout = "dot")
Note I used bnlearn::nodes as nodes is masked by a dependency of Rgraphviz. I tend to load bnlearn last.

3D trajectory visualization with path in R

I'm looking for an efficient way to plot time, x, y, z with different colors for different objects - to view proximity of the objects over time.
plot3D::line3D works with add = TRUE, but it is not very elegant. Here's a sample code that works:
data$object_id <- factor(data$object_id)
library(plot3D)
for(tr in unique(data$object_id)) {
lines3D(data$x[data$object_id == tr], data$y[data$object_id == tr], data$z[data$ba object_id ll == tr], add = T, col = data$object_id[data$object_id == tr])
}
Example data:
data <- data.frame(object_id = c(1, 1, 2, 2), t = c(0, 1, 0, 1), x = c(0, 1, 1, 0), y = c(0, 1, 1, 0), altitude = c(0, 1, 1, 0))
Desired result: path traced by different objects at a given time along with an arrow that indicates the current direction of heading (determined by joining the last 2 known positions).
At time t = 0, this should yield nothing or should yield points. At t = 1, this should yield 2 lines (one over the other) of different colors: one color for each object.
2D equivalent is ggplot2::geom_path, which does all the heavy-lifting using group parameter which joins all the paths by the grouping variable.

R hierarchical clustering visualizing classifications without clustering on them

I've been studying some data sets using hierarchical clustering. In these data sets, there are a certain number of variables that I want to use to cluster data, and then there are other classification variables that I do not want to cluster on but still want to visualize.
What I want to do is find a way to "add a tier" to the heat map generated by the clustering algorithm where I can view binary classifications (colored red for 1, blue for 0), without actually clustering on this data. That way, I can evaluate how well my classification responses are grouped together by clustering.
Here is a simplified example:
library("gplots")
set.seed(1)
## creating random data to input into hierarchial clustering algorithm
data <- matrix(rexp(100, rate = 0.1), ncol = 10)
colnames(data) <- c("var1", "var2", "var3", "var4", "var5", "var6",
"var7", "var8", "var9", "var10")
# these are the two classification labels for each data point
classification1 <- c(1, 1, 0, 1, 1, 0, 0, 0, 1, 1)
# I want to visualize how well the clustering algorithm groups
# the data correlates with the classifications without
# clustering on these classifications
classification2 <- c(0, 0, 0, 0, 0, 1, 1, 1, 1, 0)
par(mar = c(1, 4.5, 0.1, 0.1))
matrix = rbind(c(1, 2), c(3, 4), c(5, 6))
wid = c(1, 1)
hei = c(0.5, 10)
hclustfunc <- function(x) hclust(x, method = "complete")
distfunc <- function(x) dist(x, method = "euclidean")
my_palette <- colorRampPalette(c("yellow", "orange", "darkorange",
"red", "darkred"))(n = 1000)
heatmap.2(as.matrix(data), dendrogram = "row", trace = "none",
margin = c(8, 9), hclust = hclustfunc, distfun = distfunc,
col = my_palette, key = FALSE, key.xlab = "", key.title = "Clustering Algorithm",
key.ylab = "", keysize = 1.25, density.info = "density",
lhei = hei)
This generates the heat map that has given me a lot of information. What I would now like to do is append two more columns to the right of the heat map that the clustering algorithm does not use for clustering.
These two columns would be binary labels for "classification 1" and "classification 2" (and have a red cell for 1, a blue cell for 0). I just want a visualization of how well these classification responses are grouped together in the dendogram.
If you have only one classification to add you can just use heatmap.2 with the RowSideColors options. But if you have multiple classifications to add you'll use heatmap.plus. The options are slightly different than for heatmap and heatmap.2, but the important one for your question is that RowSideColors option takes a matrix.
library(heatmap.plus)
class1_cols <- c('red', 'blue')[classification1+1]
class2_cols <- c('red','blue')[classification2+1]
anno <- data.frame(class1 = class1_cols, clas2 = class2_cols)
heatmap.plus(as.matrix(data), col = my_palette,
RowSideColors = as.matrix(anno))

Mean aggregation in R (Polygon in Polygon)

I have a set of polygons that represent the unit of analysis (gadmpolys).
In addition I have a set of polygons with levels of various variables (r3mergepolys).
What I want to accomplish is to aggregate the mean of one or more variables from the polygons (from r3mergepolys) that intersect with the unit of analysis polygons (gadmpolys).
I believe the over and/or aggregate function are my friends, but I cannot seem to figure out how to write the code.
# gadmpolys is the spdf containing my units of analysis
# r3mergepoly is the spdf with many smaller polygons which I want to aggregate from
r3mergepoly <- SpatialPolygonsDataFrame(Sr=r3polys, data=r3merge, match.ID=TRUE)
# Overlay GADMpolys and Afrobarometer-GADM matched polygons. Aggregate survey results for intersecting polygons
gadmpoly_r3 <- over(gadmpoly, r3mergepoly[17:21], fn=mean)
Quick and ugly centroid-based work-around.
B <- SpatialPointsDataFrame(gCentroid(poly.pr, byid=TRUE),poly.pr#data, match.ID=FALSE)
plot(A)
points(poly_centroids)
# Overlay points and extract just the code column:
a.data <- over(A, B[,"code"])
# Add that data back to A:
A$bcode <- a.data$code
The sf package implementation of aggregate also provides a working example of using aggregate
m1 = cbind(c(0, 0, 1, 0), c(0, 1, 1, 0))
m2 = cbind(c(0, 1, 1, 0), c(0, 0, 1, 0))
pol = st_sfc(st_polygon(list(m1)), st_polygon(list(m2)))
set.seed(1985)
d = data.frame(matrix(runif(15), ncol = 3))
p = st_as_sf(x = d, coords = 1:2)
plot(pol)
plot(p, add = TRUE)
(p_ag1 = aggregate(p, pol, mean))
plot(p_ag1) # geometry same as pol
# works when x overlaps multiple objects in 'by':
p_buff = st_buffer(p, 0.2)
plot(p_buff, add = TRUE)
(p_ag2 = aggregate(p_buff, pol, mean)) # increased mean of second
# with non-matching features
m3 = cbind(c(0, 0, -0.1, 0), c(0, 0.1, 0.1, 0))
pol = st_sfc(st_polygon(list(m3)), st_polygon(list(m1)), st_polygon(list(m2)))
(p_ag3 = aggregate(p, pol, mean))
plot(p_ag3)
# In case we need to pass an argument to the join function:
(p_ag4 = aggregate(p, pol, mean,
join = function(x, y) st_is_within_distance(x, y, dist = 0.3)))

Resources