R hierarchical clustering visualizing classifications without clustering on them - r

I've been studying some data sets using hierarchical clustering. In these data sets, there are a certain number of variables that I want to use to cluster data, and then there are other classification variables that I do not want to cluster on but still want to visualize.
What I want to do is find a way to "add a tier" to the heat map generated by the clustering algorithm where I can view binary classifications (colored red for 1, blue for 0), without actually clustering on this data. That way, I can evaluate how well my classification responses are grouped together by clustering.
Here is a simplified example:
library("gplots")
set.seed(1)
## creating random data to input into hierarchial clustering algorithm
data <- matrix(rexp(100, rate = 0.1), ncol = 10)
colnames(data) <- c("var1", "var2", "var3", "var4", "var5", "var6",
"var7", "var8", "var9", "var10")
# these are the two classification labels for each data point
classification1 <- c(1, 1, 0, 1, 1, 0, 0, 0, 1, 1)
# I want to visualize how well the clustering algorithm groups
# the data correlates with the classifications without
# clustering on these classifications
classification2 <- c(0, 0, 0, 0, 0, 1, 1, 1, 1, 0)
par(mar = c(1, 4.5, 0.1, 0.1))
matrix = rbind(c(1, 2), c(3, 4), c(5, 6))
wid = c(1, 1)
hei = c(0.5, 10)
hclustfunc <- function(x) hclust(x, method = "complete")
distfunc <- function(x) dist(x, method = "euclidean")
my_palette <- colorRampPalette(c("yellow", "orange", "darkorange",
"red", "darkred"))(n = 1000)
heatmap.2(as.matrix(data), dendrogram = "row", trace = "none",
margin = c(8, 9), hclust = hclustfunc, distfun = distfunc,
col = my_palette, key = FALSE, key.xlab = "", key.title = "Clustering Algorithm",
key.ylab = "", keysize = 1.25, density.info = "density",
lhei = hei)
This generates the heat map that has given me a lot of information. What I would now like to do is append two more columns to the right of the heat map that the clustering algorithm does not use for clustering.
These two columns would be binary labels for "classification 1" and "classification 2" (and have a red cell for 1, a blue cell for 0). I just want a visualization of how well these classification responses are grouped together in the dendogram.

If you have only one classification to add you can just use heatmap.2 with the RowSideColors options. But if you have multiple classifications to add you'll use heatmap.plus. The options are slightly different than for heatmap and heatmap.2, but the important one for your question is that RowSideColors option takes a matrix.
library(heatmap.plus)
class1_cols <- c('red', 'blue')[classification1+1]
class2_cols <- c('red','blue')[classification2+1]
anno <- data.frame(class1 = class1_cols, clas2 = class2_cols)
heatmap.plus(as.matrix(data), col = my_palette,
RowSideColors = as.matrix(anno))

Related

How do I graph a Bayesian Network with instantiated nodes using bnlearn and graphviz?

I am trying to graph a Bayesian Network (BN) with instantiated nodes using the libraries bnlearn and Rgraphviz. My workflow is as follow:
After creating a data frame with random data (the data I am actually using is obviously not random) I then discretise the data, structure learn the directed acyclic graph (DAG), fit the data to the DAG and then plot the DAG. I also plot a DAG which shows the posterior probabilities of each of the nodes.
#rm(list = ls())
library(bnlearn)
library(Rgraphviz)
# Generating random dataframe
data_clean <- data.frame(a = runif(min = 0, max = 100, n = 1000),
b = runif(min = 0, max = 100, n = 1000),
c = runif(min = 0, max = 100, n = 1000),
d = runif(min = 0, max = 100, n = 1000),
e = runif(min = 0, max = 100, n = 1000))
# Discretising the data into 3 bins
bins <- 3
data_discrete <- discretize(data_clean, breaks = bins)
# Creating factors for each bin in the data
lv <- c("low", "med", "high")
for (i in names(data_discrete)){
levels(data_discrete[, i]) = lv
}
# Structure learning the DAG from the training set
whitelist <- matrix(c("a", "b",
"b", "c",
"c", "e",
"a", "d",
"d", "e"),
ncol = 2, byrow = TRUE, dimnames = list(NULL, c("from", "to")))
bn.hc <- hc(data_discrete, whitelist = whitelist)
# Plotting the DAG
dag.hc <- graphviz.plot(bn.hc,
layout = "dot")
# Fitting the data to the structure
fitted <- bn.fit(bn.hc, data = data_discrete, method = "bayes")
# Plotting the DAG with posteriors
graphviz.chart(fitted, type = "barprob", layout = "dot")
The next thing I do is to manually change the distributions in the bn.fit object, assigned to fitted, and then plot a DAG that shows the instantiated nodes and the updated posterior probability of the response variable e.
# Manually instantiating
fitted_evidence <- fitted
cpt.a = matrix(c(1, 0, 0), ncol = 3, dimnames = list(NULL, lv))
cpt.c = c(1, 0, 0,
0, 1, 0,
0, 0, 1)
dim(cpt.c) <- c(3, 3)
dimnames(cpt.c) <- list("c" = lv, "b" = lv)
cpt.b = c(1, 0, 0,
0, 1, 0,
0, 0, 1)
dim(cpt.b) <- c(3, 3)
dimnames(cpt.b) <- list("b" = lv, "a" = lv)
cpt.d = c(0, 0, 1,
0, 1, 0,
1, 0, 0)
dim(cpt.d) <- c(3, 3)
dimnames(cpt.d) <- list("d" = lv, "a" = lv)
fitted_evidence$a <- cpt.a
fitted_evidence$b <- cpt.b
fitted_evidence$c <- cpt.c
fitted_evidence$d <- cpt.d
# Plotting the DAG with instantiation and posterior for response
graphviz.chart(fitted_evidence, type = "barprob", layout = "dot")
This is the result I get but my actual BN is much larger with many more arcs and it would be impractical to manually change the bn.fit object.
I would like to find out if there is a way to plot a DAG with instantiation without changing the bn.fit object manually? Is there a workaround or function that I am missing?
I think/hope I have read the documentation for bnlearn thoroughly. I appreciate any feedback and would be happy to change anything in the question if I have not conveyed my thoughts clearly enough.
Thank you.
How about using cpdist to draw samples from the posterior given the evidence. You can then estimate the updated parameters using bn.fit using the cpdist samples. Then plot as before.
An example:
set.seed(69184390) # for sampling
# Your evidence vector
ev <- list(a = "low", b="low", c="low", d="high")
# draw samples
updated_dat <- cpdist(fitted, nodes=bnlearn::nodes(fitted), evidence=ev, method="lw", n=1e6)
# refit : you'll get warnings over missing levels
updated_fit <- bn.fit(bn.hc, data = updated_dat)
# plot
par(mar=rep(0,4))
graphviz.chart(updated_fit, type = "barprob", layout = "dot")
Note I used bnlearn::nodes as nodes is masked by a dependency of Rgraphviz. I tend to load bnlearn last.

Radarchart Using fmsb in R

I was trying to make a radarchart using fmsb but I'm having some trouble with my script.
If anyone can help I will be really appreciated.
my radarchart sequence must be between (0,2) and by 0.2.
But when I try it the graph it is out of bounds.
my script:
library("fmsb")
data:
df <- data.frame("hours" = 0:23,
"max_value" = 2,"min_value" = 0,"CallsProportion" = (c(1.583333333
,1.291666667,1.166666667,1,0.041666667,0.833333333,0.625
,0.791666667,0.75,0.458333333,0.833333333,0.625,0.708333333,0.458333333,0.291666667,0.416666667,0.625,0.458333333,1.208333333,2.375,2.166666667,2.208333333,1.625,1.541666667)))
radarchart(df, axistype = 1,
caxislabels = seq(0,2,0.2),
cglcol = "grey",
vlcex = 0.8, cglty = 1,
cglwd = 0.5,axislabcol = "grey",
pcol = "black",plwd = 1.8, plty = 1)
This is a data format problem. Function fmsb::radarchart needs an object of class "data.frame" as first argument, a "matrix" won't do it. From help("radarchart"):
The data frame to be used to draw radarchart. If maxmin is TRUE,
this must include maximum values as row 1 and minimum values as row 2
for each variables, and actual data should be given as row 3 and lower
rows. The number of columns (variables) must be more than 2.
Note that maxmin = TRUE is the default setting.
So follow these steps.
Get the range of values, their minimum and maximum.
Create a matrix with 2 rows and as many columns as columns in the original data frame.
Bind the matrix with the data rows.
Set the column names to the first column of df, column hours.
In this case there is just one vector to plot, vector CallsProportion.
r <- range(df$CallsProportion)
m <- matrix(r[2:1], nrow = 2, ncol = nrow(df))
df2 <- rbind.data.frame(m, df$CallsProportion)
names(df2) <- df$hours
radarchart(df2, axistype = 1,
caxislabels = seq(0, 2, 0.2),
cglcol = "grey",
vlcex = 0.8, cglty = 1,
cglwd = 0.5, axislabcol = "grey",
pcol = "black", plwd = 1.8, plty = 1)

MXNet: sequence length in LSTM in a non-sequence data (R)

My data are not timeseries, but it has sequential properties.
Consider one sample:
data1 = matrix(rnorm(10, 0, 1), nrow = 1)
label1 = rnorm(1, 0, 1)
label1 is a function of the data1, but the data matrix is not a timeseries. I suppose that label is a function of not just one data sample, but more older samples, which are naturally ordered in time (not sampled randomly), in other words, data samples are dependent with one another.
I have a batch of examples, say, 16.
With that I want to understand how I can design an RNN/LSTM model which will memorize all 16 examples from the batch to construct the internal state. I am especially confused with the seq_len parameter, which as I understand is specifically about the length of the timeseries used as an input to a network, which is not case.
Now this piece of code (taken from a timeseries example) only confuses me because I don't see how my task fits in.
rm(symbol)
symbol <- rnn.graph.unroll(seq_len = 5,
num_rnn_layer = 1,
num_hidden = 50,
input_size = NULL,
num_embed = NULL,
num_decode = 1,
masking = F,
loss_output = "linear",
dropout = 0.2,
ignore_label = -1,
cell_type = "lstm",
output_last_state = F,
config = "seq-to-one")
graph.viz(symbol, type = "graph", direction = "LR",
graph.height.px = 600, graph.width.px = 800)
train.data <- mx.io.arrayiter(
data = matrix(rnorm(100, 0, 1), ncol = 20)
, label = rnorm(20, 0, 1)
, batch.size = 20
, shuffle = F
)
Sure, you can treat them as time steps, and apply LSTM. Also check out this example: https://github.com/apache/incubator-mxnet/tree/master/example/multivariate_time_series as it might be relevant for your case.

Density distributions in R

An assignment has tasked us with creating a series of variables: normal1, normal2, normal3, chiSquared1 and 2, t, and F. They are defined as follows:
library(tibble)
Normal.Frame <- data_frame(normal1 = rnorm(5000, 0, 1),
normal2 = rnorm(5000, 0, 1),
normal3 = rnorm(5000, 0, 1),
chiSquared1 = normal1^2,
chiSquared2 = normal2^2,
F = sum(chiSquared1/chiSquared2),
t = sum(normal3/sqrt(chiSquared1 )))
We then have to make histograms of the distributions for normal1, chiSquared1 and 2, t, and F, which is simple enough for normal1 and the chiSquared variables, but when I try to plot F and t, the plot space is blank.
Our lecturer recommended limiting the range of F to 0-10, and t to -5 to 5. To do this, I use:
HistT <- hist(Normal.Frame$t, xlim = c(-5, 5))
HistF <- hist(Normal.Frame$F, xlim = c(0, 10))
Like I mentioned, this yields blank plots.
Your t and F are defined as sums; they will be single values. If those values are outside your range, the histogram will be empty. If you remove the sum() function you should get the desired results.

Mean aggregation in R (Polygon in Polygon)

I have a set of polygons that represent the unit of analysis (gadmpolys).
In addition I have a set of polygons with levels of various variables (r3mergepolys).
What I want to accomplish is to aggregate the mean of one or more variables from the polygons (from r3mergepolys) that intersect with the unit of analysis polygons (gadmpolys).
I believe the over and/or aggregate function are my friends, but I cannot seem to figure out how to write the code.
# gadmpolys is the spdf containing my units of analysis
# r3mergepoly is the spdf with many smaller polygons which I want to aggregate from
r3mergepoly <- SpatialPolygonsDataFrame(Sr=r3polys, data=r3merge, match.ID=TRUE)
# Overlay GADMpolys and Afrobarometer-GADM matched polygons. Aggregate survey results for intersecting polygons
gadmpoly_r3 <- over(gadmpoly, r3mergepoly[17:21], fn=mean)
Quick and ugly centroid-based work-around.
B <- SpatialPointsDataFrame(gCentroid(poly.pr, byid=TRUE),poly.pr#data, match.ID=FALSE)
plot(A)
points(poly_centroids)
# Overlay points and extract just the code column:
a.data <- over(A, B[,"code"])
# Add that data back to A:
A$bcode <- a.data$code
The sf package implementation of aggregate also provides a working example of using aggregate
m1 = cbind(c(0, 0, 1, 0), c(0, 1, 1, 0))
m2 = cbind(c(0, 1, 1, 0), c(0, 0, 1, 0))
pol = st_sfc(st_polygon(list(m1)), st_polygon(list(m2)))
set.seed(1985)
d = data.frame(matrix(runif(15), ncol = 3))
p = st_as_sf(x = d, coords = 1:2)
plot(pol)
plot(p, add = TRUE)
(p_ag1 = aggregate(p, pol, mean))
plot(p_ag1) # geometry same as pol
# works when x overlaps multiple objects in 'by':
p_buff = st_buffer(p, 0.2)
plot(p_buff, add = TRUE)
(p_ag2 = aggregate(p_buff, pol, mean)) # increased mean of second
# with non-matching features
m3 = cbind(c(0, 0, -0.1, 0), c(0, 0.1, 0.1, 0))
pol = st_sfc(st_polygon(list(m3)), st_polygon(list(m1)), st_polygon(list(m2)))
(p_ag3 = aggregate(p, pol, mean))
plot(p_ag3)
# In case we need to pass an argument to the join function:
(p_ag4 = aggregate(p, pol, mean,
join = function(x, y) st_is_within_distance(x, y, dist = 0.3)))

Resources