Radarchart Using fmsb in R - r

I was trying to make a radarchart using fmsb but I'm having some trouble with my script.
If anyone can help I will be really appreciated.
my radarchart sequence must be between (0,2) and by 0.2.
But when I try it the graph it is out of bounds.
my script:
library("fmsb")
data:
df <- data.frame("hours" = 0:23,
"max_value" = 2,"min_value" = 0,"CallsProportion" = (c(1.583333333
,1.291666667,1.166666667,1,0.041666667,0.833333333,0.625
,0.791666667,0.75,0.458333333,0.833333333,0.625,0.708333333,0.458333333,0.291666667,0.416666667,0.625,0.458333333,1.208333333,2.375,2.166666667,2.208333333,1.625,1.541666667)))
radarchart(df, axistype = 1,
caxislabels = seq(0,2,0.2),
cglcol = "grey",
vlcex = 0.8, cglty = 1,
cglwd = 0.5,axislabcol = "grey",
pcol = "black",plwd = 1.8, plty = 1)

This is a data format problem. Function fmsb::radarchart needs an object of class "data.frame" as first argument, a "matrix" won't do it. From help("radarchart"):
The data frame to be used to draw radarchart. If maxmin is TRUE,
this must include maximum values as row 1 and minimum values as row 2
for each variables, and actual data should be given as row 3 and lower
rows. The number of columns (variables) must be more than 2.
Note that maxmin = TRUE is the default setting.
So follow these steps.
Get the range of values, their minimum and maximum.
Create a matrix with 2 rows and as many columns as columns in the original data frame.
Bind the matrix with the data rows.
Set the column names to the first column of df, column hours.
In this case there is just one vector to plot, vector CallsProportion.
r <- range(df$CallsProportion)
m <- matrix(r[2:1], nrow = 2, ncol = nrow(df))
df2 <- rbind.data.frame(m, df$CallsProportion)
names(df2) <- df$hours
radarchart(df2, axistype = 1,
caxislabels = seq(0, 2, 0.2),
cglcol = "grey",
vlcex = 0.8, cglty = 1,
cglwd = 0.5, axislabcol = "grey",
pcol = "black", plwd = 1.8, plty = 1)

Related

Prp plot - Coloring positive and negative values differently

I am fitting regression trees via the function rpart(). Given my data, I am going to have both positive and negative estimates in nodes. Is there a way to color them differently?
In particular, what I would like to have is a tree whose nodes are shaded in blue for negative values and in red for positive values, where darker colors signal stronger absolute values.
I attach a minimal reproducible example.
library(rpart)
library(rpart.plot)
# Simulating data.
set.seed(1986)
X = matrix(rnorm(2000, 0, 1), nrow = 1000, ncol = 2)
epsilon = matrix(rnorm(1000, 0, 0.01), nrow = 1000)
y = X[, 1] + X[, 2] + epsilon
dta = data.frame(X, y)
# Fitting regression tree.
my.tree = rpart(y ~ X1 + X2, data = dta, method = "anova", maxdepth = 3)
# Plotting.
prp(my.tree,
type = 2,
clip.right.labs = FALSE,
extra = 101,
under = FALSE,
under.cex = 1,
fallen.leaves = TRUE,
box.palette = "BuRd",
branch = 1,
round = 0,
leaf.round = 0,
prefix = "" ,
main = "",
cex.main = 1.5,
branch.col = "gray",
branch.lwd = 3)
# Repeating, with median(y) != 0.
X = matrix(rnorm(2000, 5, 1), nrow = 1000, ncol = 2)
epsilon = matrix(rnorm(1000, 0, 0.01), nrow = 1000)
y = X[, 1] + X[, 2] + epsilon
dta = data.frame(X, y)
my.tree = rpart(y ~ X1 + X2, data = dta, method = "anova", maxdepth = 3)
# HERE I NEED HELP!
prp(my.tree,
type = 2,
clip.right.labs = FALSE,
extra = 101,
under = FALSE,
under.cex = 1,
fallen.leaves = TRUE,
box.palette = "BuRd",
branch = 1,
round = 0,
leaf.round = 0,
prefix = "" ,
main = "",
cex.main = 1.5,
branch.col = "gray",
branch.lwd = 3)
As far as I understood, thanks to the box.palette option, I obtained the result I need in the first setting because median(y) is close to zero.
Indeed, in the second setting I am unhappy: I get blue shades for values less than median(y), and red shades for those above such value. How can I impose zero as the threshold for the two colors?
To be more specific, I would like a command that automatically ensures the two-colors system in any tree.
Ook, I answered my own question. The solution is actually quite simple: if the box.palette option is a two-color diverging palette (as in my example), we can use pal.thresh to set the threshold we want. In my case:
prp(my.tree,
type = 2,
clip.right.labs = FALSE,
extra = 101,
under = FALSE,
under.cex = 1,
fallen.leaves = TRUE,
box.palette = "BuRd",
branch = 1,
round = 0,
leaf.round = 0,
prefix = "" ,
main = "",
cex.main = 1.5,
branch.col = "gray",
branch.lwd = 3,
pal.thresh = 0) # HERE THE SOLUTION!
Even if this is probably bad for me, I will leave here the answer for future users and close the question, rather than deleting it.

Adding significance bars within and between groups in dodged ggplot2 boxplots

I have some data that I would like to 1) plot as grouped boxplots, and 2) add significance bars A) between boxplots within each group and B) between specific boxplots of different groups. My data looks something like this:
library("ggplot2")
df <- data.frame(enzyme = c(rep("A", 9), rep("B", 9), rep("C", 9)),
substrate = c(rep("1", 3), rep("2", 3), rep("3", 3),
rep("1", 3), rep("4", 3), rep("5", 3),
rep("1", 3), rep("4", 3), rep("5", 3)),
AUC = c(6.64, 6.56, 6.21, 5.96, 6.12, 6.24, 6.02, 6.32, 6.12,
0, 0, 0, 5.99, 6.26, 5.94, 0, 0, 0,
0, 0, 0, 5.99, 6.11, 6.13, 0, 0, 0))
q <- ggplot(df, aes(x = enzyme, y = AUC, color = substrate)) +
geom_boxplot(show.legend = F,
position = position_dodge2(width = 0.75, preserve = "single")) +
geom_point(show.legend = F, size = 2, position = position_dodge2(width = 0.75, preserve = "single"))
plot(q)
I know that I can add significance bars between groups with the following:
q + geom_signif(comparisons = list(c("A", "B"), c("A", "C"), c("B", "C")),
test = "t.test", map_signif_level = T)
However, these comparisons are not meaningful for my data.
Instead, I would like to A) add significance bars between boxplots of the same group. I thought I could follow the suggestion of Simon, who suggested that I manually add bars by defining p-values, labels, and y coordinates for the bars (How to add significance bar between subgroups of box plot), though for my dataset this will be more difficult because I have three subgroups per group rather than two.
Ultimately, I would also like to B) add significance bars comparing two specific subgroups from different groups.
My question is, is there any easy way to do this using existing functions/packages? If I have to do this manually, can anyone suggest a good strategy? I would appreciate it!
I thought about this for a bit and figured out a lengthy solution. If anyone has a more succinct way of doing this, please let me know!
## significance bars within and between subgroups
# rearrange df, one unique sample per column, rows are replicates
df.split <- do.call(cbind, sapply(split(df, df$enzyme), function(x) {
sapply(split(x, x$substrate), function(x) {x$AUC}) }) )
# keep track of sample names
sample.names <- do.call(c, lapply(split(df, df$enzyme), function(x) {
unique(paste0(x$enzyme, ".", x$substrate)) }) )
colnames(df.split) <- sample.names
# perform statistical tests between every pairwise combination of
# samples/columns in df.split
df.tests <- apply(combn(seq_along(sample.names), 2), 2,
function(x) {
t.test(df.split[ ,x[1]], df.split[ ,x[2]])$p.value })
# keep track of sample pairs
sample.pairs <- apply(combn(seq_along(sample.names), 2), 2,
function(x) {
paste0(colnames(df.split)[x[1]], "X",
colnames(df.split)[x[2]]) })
names(df.tests) <- sample.pairs
# think about how the significance bars will be laid out: because there are
# three subgroups per enzyme, the bars for the three pairwise comparisons on
# the same plot would overlap. This needs to be done in layers
# select tests of interest for each layer
within.tests.1 <- c("A.1XA.2", "A.2XA.3",
"B.1XB.4", "B.4XB.5",
"C.1XC.4", "C.4XC.5")
within.tests.2 <- c("A.1XA.3", "B.1XB.5","C.1XC.5")
between.tests.1 <- c("A.1XB.4", "B.4XC.4")
between.tests.2 <- c("A.1XC.4")
p.values.1 <- df.tests[which(names(df.tests) %in% within.tests.1)]
p.values.2 <- df.tests[which(names(df.tests) %in% within.tests.2)]
p.values.3 <- df.tests[which(names(df.tests) %in% between.tests.1)]
p.values.4 <- df.tests[which(names(df.tests) %in% between.tests.2)]
# convert p-values into easily read labels, with NaN values omitted
p.values.1 <- replace(p.values.1, is.na(p.values.1), 1)
p.values.2 <- replace(p.values.2, is.na(p.values.2), 1)
p.values.3 <- replace(p.values.3, is.na(p.values.3), 1)
p.values.4 <- replace(p.values.4, is.na(p.values.4), 1)
labels.1 <- symnum(p.values.1, corr = FALSE, cutpoints = c(0, .001,.01,.05, 1),
symbols = c("***","**","*",""))
labels.2 <- symnum(p.values.2, corr = FALSE, cutpoints = c(0, .001,.01,.05, 1),
symbols = c("***","**","*",""))
labels.3 <- symnum(p.values.3, corr = FALSE, cutpoints = c(0, .001,.01,.05, 1),
symbols = c("***","**","*",""))
labels.4 <- symnum(p.values.4, corr = FALSE, cutpoints = c(0, .001,.01,.05, 1),
symbols = c("***","**","*",""))
# determine coordinates for significance bars
# y values for layer 1 should all be just above the highest data point of all
# samples being compared
y.values.1 <- do.call(max, lapply(unlist(strsplit(names(labels.1), "X")),
function(x) {
df.split[, which(colnames(df.split) %in% x)] }) ) + 0.3 %>%
rep(times = length(labels.1))
# y values for layer 2 should be higher than those of layer 1
y.values.2 <- y.values.1[c(1, 3, 5)] + 0.4
# y values for layer 3 should all be above the highest data point of all
# samples being compared, and higher than layer 2
y.values.3 <- do.call(max, lapply(unlist(strsplit(names(labels.3), "X")),
function(x) {
df.split[, which(colnames(df.split) %in% x)] }) ) + 1.2 %>%
rep(times = length(labels.3))
# y values for layer 4 should be higher than those of layer 3
y.values.4 <- y.values.3[1] + 0.5
# for x values, first boxplot is always at x = 1
# since there are three groups per x = 1 and preserve = "single", the width of
# each subgroup boxplot is 0.25
x.min.values.1 <- c(0.75, 1, 1.75, 2, 2.75, 3)
x.max.values.1 <- x.min.values.1 + 0.25
x.min.values.2 <- c(0.75, 1.75, 2.75)
x.max.values.2 <- x.min.values.2 + 0.50
x.min.values.3 <- c(0.75, 2)
x.max.values.3 <- c(2, 3)
x.min.values.4 <- c(0.75)
x.max.values.4 <- c(3)
# finally, plot the significance bars for each layer, one on top of the other
q + geom_signif(y_position = y.values.1,
xmin = x.min.values.1,
xmax = x.max.values.1,
annotations = labels.1,
tip_length = rep(0.02, length(labels.1)),
vjust = 0.5 ) +
geom_signif(y_position = y.values.2,
xmin = x.min.values.2,
xmax = x.max.values.2,
annotations = labels.2,
tip_length = rep(0.04, length(labels.2)),
vjust = 0.5 ) +
geom_signif(y_position = y.values.3,
xmin = x.min.values.3,
xmax = x.max.values.3,
annotations = labels.3,
tip_length = rep(0.04, length(labels.3)),
vjust = 0.5 ) +
geom_signif(y_position = y.values.4,
xmin = x.min.values.4,
xmax = x.max.values.4,
annotations = labels.4,
tip_length = rep(0.06, length(labels.4)),
vjust = 0.5 )
The output looks like this:
Barplot_with_significance_bars_within_and_between_groups

R. lapply multinomial test to list of dataframes

I have a data frame A, which I split into a list of 100 data frames, each having 3 rows (In my real data each data frame has 500 rows). Here I show A with 2 elements of the list (row1-row3; row4-row6):
A <- data.frame(n = c(0, 1, 2, 0, 1, 2),
prob = c(0.4, 0.5, 0.1, 0.4, 0.5, 0.1),
count = c(24878, 33605, 12100 , 25899, 34777, 13765))
# This is the list:
nest <- split(A, rep(1:2, each = 3))
I want to apply the multinomial test to each of these data frames and extract the p-value of each test. So far I have done this:
library(EMT)
fun <- function(x){
multinomial.test(x$count,
prob=x$prob,
useChisq = FALSE, MonteCarlo = TRUE,
ntrial = 100, # n of withdrawals accomplished
atOnce=100)
}
lapply(nest, fun)
However, I get:
"Error in multinomial.test(x$counts_set, prob = x$norm_genome, useChisq = F, :
Observations have to be stored in a vector, e.g. 'observed <- c(5,2,1)'"
Does anyone have a smarter way of doing this?
The results of split are created with names 1, 2 and so on. That's why x$count in fun cannot access it. To make it simpler, you can combine your splitted elements using the list function and then use lapply:
n <- c(0,1,2,0,1,2)
prob <- c(0.4, 0.5, 0.1, 0.4, 0.5, 0.1)
count <- c(24878, 33605, 12100 , 25899, 34777, 13765)
A <- cbind.data.frame(n, prob, count)
nest = split(A,rep(1:2,each=3))
fun <- function(x){
multinomial.test(x$count,
prob=x$prob,
useChisq = F, MonteCarlo = TRUE,
ntrial = 100, # n of withdrawals accomplished
atOnce=100)
}
# Create a list of splitted elements
new_list <- list(nest$`1`, nest$`2`)
lapply(new_list, fun)
A solution with dplyr.
A = data.frame(n = c(0,1,2,0,1,2),
prob = c(0.4, 0.5, 0.1, 0.4, 0.5, 0.1),
count = c(43, 42, 9, 74, 82, 9))
library(dplyr)
nest <- A %>%
mutate(pattern = rep(1:2,each=3)) %>%
group_by(pattern) %>%
dplyr::summarize(mn_pvals = multinomial.test(count, prob)$p.value)
nest

MXNet: sequence length in LSTM in a non-sequence data (R)

My data are not timeseries, but it has sequential properties.
Consider one sample:
data1 = matrix(rnorm(10, 0, 1), nrow = 1)
label1 = rnorm(1, 0, 1)
label1 is a function of the data1, but the data matrix is not a timeseries. I suppose that label is a function of not just one data sample, but more older samples, which are naturally ordered in time (not sampled randomly), in other words, data samples are dependent with one another.
I have a batch of examples, say, 16.
With that I want to understand how I can design an RNN/LSTM model which will memorize all 16 examples from the batch to construct the internal state. I am especially confused with the seq_len parameter, which as I understand is specifically about the length of the timeseries used as an input to a network, which is not case.
Now this piece of code (taken from a timeseries example) only confuses me because I don't see how my task fits in.
rm(symbol)
symbol <- rnn.graph.unroll(seq_len = 5,
num_rnn_layer = 1,
num_hidden = 50,
input_size = NULL,
num_embed = NULL,
num_decode = 1,
masking = F,
loss_output = "linear",
dropout = 0.2,
ignore_label = -1,
cell_type = "lstm",
output_last_state = F,
config = "seq-to-one")
graph.viz(symbol, type = "graph", direction = "LR",
graph.height.px = 600, graph.width.px = 800)
train.data <- mx.io.arrayiter(
data = matrix(rnorm(100, 0, 1), ncol = 20)
, label = rnorm(20, 0, 1)
, batch.size = 20
, shuffle = F
)
Sure, you can treat them as time steps, and apply LSTM. Also check out this example: https://github.com/apache/incubator-mxnet/tree/master/example/multivariate_time_series as it might be relevant for your case.

R hierarchical clustering visualizing classifications without clustering on them

I've been studying some data sets using hierarchical clustering. In these data sets, there are a certain number of variables that I want to use to cluster data, and then there are other classification variables that I do not want to cluster on but still want to visualize.
What I want to do is find a way to "add a tier" to the heat map generated by the clustering algorithm where I can view binary classifications (colored red for 1, blue for 0), without actually clustering on this data. That way, I can evaluate how well my classification responses are grouped together by clustering.
Here is a simplified example:
library("gplots")
set.seed(1)
## creating random data to input into hierarchial clustering algorithm
data <- matrix(rexp(100, rate = 0.1), ncol = 10)
colnames(data) <- c("var1", "var2", "var3", "var4", "var5", "var6",
"var7", "var8", "var9", "var10")
# these are the two classification labels for each data point
classification1 <- c(1, 1, 0, 1, 1, 0, 0, 0, 1, 1)
# I want to visualize how well the clustering algorithm groups
# the data correlates with the classifications without
# clustering on these classifications
classification2 <- c(0, 0, 0, 0, 0, 1, 1, 1, 1, 0)
par(mar = c(1, 4.5, 0.1, 0.1))
matrix = rbind(c(1, 2), c(3, 4), c(5, 6))
wid = c(1, 1)
hei = c(0.5, 10)
hclustfunc <- function(x) hclust(x, method = "complete")
distfunc <- function(x) dist(x, method = "euclidean")
my_palette <- colorRampPalette(c("yellow", "orange", "darkorange",
"red", "darkred"))(n = 1000)
heatmap.2(as.matrix(data), dendrogram = "row", trace = "none",
margin = c(8, 9), hclust = hclustfunc, distfun = distfunc,
col = my_palette, key = FALSE, key.xlab = "", key.title = "Clustering Algorithm",
key.ylab = "", keysize = 1.25, density.info = "density",
lhei = hei)
This generates the heat map that has given me a lot of information. What I would now like to do is append two more columns to the right of the heat map that the clustering algorithm does not use for clustering.
These two columns would be binary labels for "classification 1" and "classification 2" (and have a red cell for 1, a blue cell for 0). I just want a visualization of how well these classification responses are grouped together in the dendogram.
If you have only one classification to add you can just use heatmap.2 with the RowSideColors options. But if you have multiple classifications to add you'll use heatmap.plus. The options are slightly different than for heatmap and heatmap.2, but the important one for your question is that RowSideColors option takes a matrix.
library(heatmap.plus)
class1_cols <- c('red', 'blue')[classification1+1]
class2_cols <- c('red','blue')[classification2+1]
anno <- data.frame(class1 = class1_cols, clas2 = class2_cols)
heatmap.plus(as.matrix(data), col = my_palette,
RowSideColors = as.matrix(anno))

Resources