Percentage stacked barplot in Julia

Percentage stacked barplot in Julia - julia

I would like to create a percentage stacked barplot in Julia. In R we may do the following:
set.seed(7)
data <- matrix(sample(1:30,6), nrow=3)
colnames(data) <- c("A","B")
rownames(data) <- c("V1","V2","V3")
library(RColorBrewer)
cols <- brewer.pal(3, "Pastel1")
df_percentage <- apply(data, 2, function(x){x*100/sum(x,na.rm=T)})
barplot(df_percentage, col=cols, border="white", xlab="group")
Created on 2022-12-29 with reprex v2.0.2
I am now able to create the axis in percentages, but not to make it stacked and percentage for each stacked bar like above. Here is some reproducible code:
using StatsPlots
measles = [38556, 24472]
mumps = [20178, 23536]
chickenPox = [37140, 32169]
ticklabel = ["A", "B"]
foo = #. measles + mumps + chickenPox
my_range = LinRange(0, maximum(foo), 11)
groupedbar(
[measles mumps chickenPox],
bar_position = :stack,
bar_width=0.7,
xticks=(1:2, ticklabel),
yticks=(my_range, 0:10:100),
label=["measles" "mumps" "chickenPox"]
)
Output:
This is almost what I want. So I was wondering if anyone knows how to make a stacked percentage barplot like above in Julia?

You just need to change the maximum threshold of the LinRange to be fitted to the maximum value of bars (which is 1 in this case), and change the input data for plotting to be the proportion of each segment:
my_range = LinRange(0, 1, 11)
foo = #. measles + mumps + chickenPox
groupedbar(
[measles./foo mumps./foo chickenPox./foo],
bar_position = :stack,
bar_width=0.7,
xticks=(1:2, ["A", "B"]),
yticks=(my_range, 0:10:100),
label=["measles" "mumps" "chickenPox"],
legend=:outerright
)
If you want to have the percentages on each segment, then you can use the following function:
function percentages_on_segments(data)
first_phase = permutedims(data)[end:-1:1, :]
a = [0 0;first_phase]
b = accumulate(+, 0.5*(a[1:end-1, :] + a[2:end, :]), dims=1)
c = vec(b)
annotate!(
repeat(1:size(data, 1), inner=size(data, 2)),
c,
["$(round(100*item, digits=1))%" for item=vec(first_phase)],
:white
)
end
percentages_on_segments([measles./foo mumps./foo chickenPox./foo])
Note that [measles./foo mumps./foo chickenPox./foo] is the same data that I passed to the groupedbar function:

Related

How to use a for loop over negative (and positive) values in R?

I am trying to use a for-loop over a range of positive and negative values and then plot the results. However, I'm having trouble getting R not to plot the correct values, since the negative values seem to screw up the index.
More precisely, the code I am running is:
# Setup objects
R = (1:20)
rejection = rep(NA, 20)
t = seq(from = -10, to = 10, by = 1)
avg_rej_freq = rep(NA, 21)
# Test a hypothesis for each possible value of x and each replication
for (x in t) {
for (r in R) {
# Generate 1 observation from N(x,1)
y = rnorm(1, x, 1)
# Take the average of this observation
avg_y = mean(y)
# Test this observation using the test we found in part a
if (avg_y >= 1 + pnorm(.95))
{rejection[r] = 1}
if (y < 1 + pnorm(.95))
{rejection[r] = 0}
}
# Calculate the average rejection frequency across the 20 samples
avg_rej_freq[x] = mean(rejection)
}
# Plot the different values of x against the average rejection frequency
plot(t, avg_rej_freq)
The resulting graph should look something like this
# Define the rejection probability for n=1
rej_prob = function(x)(1-pnorm(1-x+qnorm(0.95)))
# Plot it
curve(rej_prob,from = -10, to = 10, xlab = expression(theta),
ylab = "Rejection probability")
...but there's clearly something wrong with my code that is shifting the positive values on the graph over to the left.
Any help on how to fix this would be much appreciated!

Yep, as you suspected the negative indices are causing problems. R doesn't know how to store something as the "negative first" object in a vector, so it just drops them. Instead, try using seq_along to produce a vector of all positive indices and looping over those instead:
# Setup objects
R = (1:20)
rejection = rep(NA, 20)
t = seq(from = -10, to = 10, by = 1)
avg_rej_freq = rep(NA, 21)
# Test a hypothesis for each possible value of x and each replication
for (x in seq_along(t)) {
for (r in R) {
# Generate 1 observation from N(x,1)
# Now we ask for the value of t at index x rather than t directly
y = rnorm(1, t[x], 1)
# Take the average of this observation
avg_y = mean(y)
# Test this observation using the test we found in part a
if (avg_y >= 1 + pnorm(.95))
{rejection[r] = 1}
if (y < 1 + pnorm(.95))
{rejection[r] = 0}
}
# Calculate the average rejection frequency across the 20 samples
avg_rej_freq[x] = mean(rejection)
}
# Plot the different values of x against the average rejection frequency
plot(t, avg_rej_freq)
which produces the following plot:

Not sure why you want to simulate the vectorized function pnrom() using for loops, still correcting the mistakes in your code (check the comments):
# Test a hypothesis for each possible value of x and each replication
for (x in t) {
for (r in R) {
# Generate 1 observation from N(x,1)
y = rnorm(1, x, 1)
# no need to take average since you have a single observation
# Test this observation using the test we found in part a
rejection[r] = ifelse(y >= 1 + pnorm(.95), 1, 0)
}
# Calculate the average rejection frequency across the 20 samples
# `R` vector index starts from 1, transform your x values s.t., negative values become positive
avg_rej_freq[x-min(t)+1] = mean(rejection)
}
# Define the rejection probability for n=1
rej_prob = function(x)(1-pnorm(1-x+qnorm(0.95)))
# Plot it
curve(rej_prob,from = -10, to = 10, xlab = expression(theta),
ylab = "Rejection probability")
# plot your points
points(t, avg_rej_freq, pch=19, col='red')

Not sure why the for loops etc, what you are doing can be collapsed into a one line. The rest of the code taken from #Sandipan Dey:
R <- 20
t <- seq(from = -10, to = 10, by = 1)
#All the for-loops collapsed into this one line:
avg_rej_freq <- rowMeans(matrix(rnorm(R * length(t), t), 21) >= 1 + pnorm(.95))
rej_prob <- function(x) 1 - pnorm(1 - x + qnorm(0.95))
curve(rej_prob,from = -10, to = 10, xlab = expression(theta),
ylab = "Rejection probability")
# plot your points
points(t, avg_rej_freq, pch=19, col='red')

paired data for a facet_wrap

Imagine I have data foo below. Each row contains a measurement (y) on a species and each species is paired with another (species.pair). So in the example below, species a is paired with e, b with f, and so on. The number of observations for each species varies. I'd like to plot the density of each species's distribution along with its partner's distribution in its own facet. Below I hand coded this with the column sppPairs. The species are all unique and each has a match in species.pair. I'm unsure of how to make the grouping column sppPairs below. I'm sure there is some clever way to do this with {dplyr} but I can't figure out what to do. Some kind of pasting species to species.pair I imagine? Any help much appreciated.
foo <- data.frame(species = rep(letters[1:8],each=10),
species.pair = rep(letters[c(5:8,1:4)],each=10),
y=rnorm(80))
# species and species pair match exactly
all(unique(foo$species) %in% unique(foo$species.pair))
# what I want
foo$sppPairs <- c(rep("a:e",10),
rep("b:f",10),
rep("c:g",10),
rep("d:h",10),
rep("a:e",10),
rep("b:f",10),
rep("c:g",10),
rep("d:h",10))
p1 <- ggplot(foo,aes(y,fill=species))
p1 <- p1 + geom_density(alpha=0.5)
p1 <- p1 + facet_wrap(~sppPairs)
p1

Yes, you can use apply on the appropriate columns to paste the sorted elements together in the correct order (otherwise a:e is different from e:a and so on, and you end up with 8 groups instead of 4):
library(ggplot2)
foo <- data.frame(species = rep(letters[1:8], each = 10),
species.pair = rep(letters[c(5:8, 1:4)], each = 10),
y = rnorm(80))
foo$sppPairs <- apply(foo[c("species", "species.pair")], 1,
function(x) paste(sort(x), collapse = ":"))
ggplot(foo, aes(y, fill = species)) +
geom_density(alpha = 0.5) +
facet_wrap(~sppPairs)
Created on 2020-10-05 by the reprex package (v0.3.0)

R print groups of data points in different colors

I'm doing some basic statistics in R and I'm trying to have a different color for each iteration of the loop. So all the data points for i=1 should have the same color, all the data points for i=2 should have the same color etc. The best would be to have different colors for the varying i ranging from yellow to blue for exemple. (I already tried to deal with Colorramp etc. but I didn't manage to get it done.)
Thanks for your help.
library(ggplot2)
#dput(thedata[,2])
#c(1.28994585412464, 1.1317747077577, 1.28029504741834, 1.41172820353708,
#1.13172920065253, 1.40276516298315, 1.43679599499374, 1.90618019359643,
#2.33626745030772, 1.98362330686504, 2.22606615548188, 2.40238822720322)
#dput(thedata[,4])
#c(NA, -1.7394747097211, 2.93081902519318, -0.33212717268786,
#-1.78796119503752, -0.5080871442002, -0.10110379236627, 0.18977632798691,
#1.7514277696687, 1.50275797771879, -0.74632159611221, 0.0978774103243802)
#OR
#dput(thedata[,c(2,4)])
#structure(list(LRUN74TTFRA156N = c(1.28994585412464, 1.1317747077577,
#1.28029504741834, 1.41172820353708, 1.13172920065253, 1.40276516298315,
#1.43679599499374, 1.90618019359643, 2.33626745030772, 1.98362330686504,
#2.22606615548188, 2.40238822720322), SELF = c(NA, -1.7394747097211,
#2.93081902519318, -0.33212717268786, -1.78796119503752, -0.5080871442002,
#-0.10110379236627, 0.18977632798691, 1.7514277696687, 1.50275797771879,
#-0.74632159611221, 0.0978774103243802)), row.names = c(NA, 12L
#), class = "data.frame")
x1=1
xn=x1+3
plot(0,0,col="white",xlim=c(0,12),ylim=c(-5,7.5))
for(i in 1:3){
y=thedata[x1:xn,4]
x=thedata[x1:xn,2]
reg<-lm(y~x)
points(x,y,col=colors()[i])
abline(reg,col=colors()[i])
x1=x1+4
xn=x1+3
}

The basic idea of colorRamp and colorRampPalette is that they are functionals - they are functions that return functions.
From the help page:
colorRampPalette returns a function that takes an integer argument (the required number of colors) and returns a character vector of colors (see rgb) interpolating the given sequence (similar to heat.colors or terrain.colors).
So, we'll get a yellow-to-blue palette function from colorRampPalette, and then we'll give it the number of colors we want along that ramp to actually get the colors:
# create the palette function
my_palette = colorRampPalette(colors = c("yellow", "blue"))
# test it out, see how it works
my_palette(3)
# [1] "#FFFF00" "#7F7F7F" "#0000FF"
my_palette(5)
# [1] "#FFFF00" "#BFBF3F" "#7F7F7F" "#3F3FBF" "#0000FF"
# Now on with our plot
x1 = 1
xn = x1 + 3
# Set the number of iterations (number of colors needed) as a variable:
nn = 3
# Get the colors from our palettte function
my_cols = my_palette(nn)
# type = 'n' means nothing will be plotted, no points, no lines
plot(0, 0, type = 'n',
xlim = c(0, 12),
ylim = c(-5, 7.5))
# plot
for (i in 1:nn) {
y = thedata[x1:xn, 2]
x = thedata[x1:xn, 1]
reg <- lm(y ~ x)
# use the ith color
points(x, y, col = my_cols[i])
abline(reg, col = my_cols[i])
x1 = x1 + 4
xn = x1 + 3
}
You can play with just visualizing the palette---try out the following code for different n values. You can also try out different options, maybe different starting colors. I like the results better with the space = "Lab" argument for the palette.
n = 10
my_palette = colorRampPalette(colors = c("yellow", "blue"), space = "Lab")
n_palette = my_palette(n)
plot(1:n, rep(1, n), col = n_palette, pch = 15, cex = 4)

Besides of lacking a reproducible example, you seem to have some misconceptions.
First, the function colors doesn't take a numeric argument, see ?colors. So if you want to fetch a different color in each iteration, you need to call it like colors()[i]. The code should look something similar to this (in absence of a reproducible example):
for (i in 20:30){
plot(1:10, 1:10, col = colors()[i])
}
Please bear in mind that the call of x1 and xn in your first and second lines inside the for loop, before defining them will cause an error too.

R - Add trace in Plotly if condition is met

I am creating a scatter plot graph and I would like the points connected by a trace if a condition is met. My data is separated into sequences, if an X and Y coordinate is part of the same sequence, I would like there to be a trace. My sample data and code snippet is below.
Sample Data:
X Y Seq
1 3 1
2 5 1
1 4 1
3 1 2
4 5 2
6 3 3
3 4 3
In this example I would like points (1, 3), (2, 5), (1, 4) traced, points (3, 1), (4, 5) traced, and points (6, 3), (3, 4) traced. There should be a break in the trace if a new sequence starts.
Code:
plot_ly (data, x = data$X , y = data$Y,
type = "scatter",
mode="markers")%>%
add_trace(data$Seq==shift(data$Seq, type="lag"), mode="lines")
Here is an image of the plot that my actual data is giving me. You can see the points are being plotted but there is no break.

The problem lies in your use of add_trace. You're passing what I assume is a subset of your data to the first argument of add_trace when this argument expects an existing plot/trace. The problem is, since you're piping in with %>% the function is inheriting the original data and ignoring your subset.
Note that the below will give the same plot even though my variable NO has nothing to do with the plot:
X=c(1,2,1,3,4,6,3)
Y=c(3,5,4,1,5,3,4)
seq=c(1,1,1,2,2,3,3)
dataX <- data.frame(X,Y,seq)
NO <- "this won't work"
plot_ly plot_ly (dataX, x = dataX$X , y = dataX$Y,
type = "scatter",
mode="markers") %>%
add_trace(NO, mode="lines")
You can fix this with inherit=F, but then it won't work because add_trace is trying to add something to the plot NO which isn't a plot (and your subset wouldn't work either)
plot_ly (dataX, x = dataX$X , y = dataX$Y,
type = "scatter",
mode="markers") %>%
add_trace(NO, mode="lines", inherit=FALSE)
## No trace type specified:
When you add traces you want to be explicit in the x= and y=. Then you can allow it to automatically inherit the previous plot/trace, or specify one. As for what you're trying to do, you could build it up with a loop:
#make the plot
p <- plot_ly (dataX, x = dataX$X , y = dataX$Y,
type = "scatter",
mode="markers")
#build it up
for(i in levels(factor(dataX$seq))){
#subset data
dataFilt <- dataX[dataX$seq==i,]
#add it
p <- add_trace(p, x=dataFilt$X, y=dataFilt$Y,mode="lines",color ='yellow')
}
p
This makes a new series each time so it's a bit of a work around. You can hide the legend and it looks correct:
p %>%
layout(showlegend = FALSE)

Generating "2D" histogram in R

I am new to R and I would like to know how to generate histograms for the following situation :
I initially have a regular frequency table with 2 columns : Column A is the category (or bin) and Column B is the number of cases that fall in that category
Col A Col B
1-10 7
11-20 4
21-30 5
From this initial frequency table, I create a table with 3 columns : Col A is again the category (or bin), but now Col B is the "fraction of total cases", so for the category 1-10, column B will have the value 7/(7+4+5) = 7/16 . Now there is also a third column, Col C which is "fraction of total cases falling between the categories 1-20", so for 1-10, the value for Col C would be 7/(7+4) = 7/11. The complete table would look like below :
Col A Col B Col C
1-10 7/16 7/11
11-20 4/16 4/11
21-30 5/16 0
How do I generate a histogram from this 3-column table above ? My X axis should be the bin (1-10, 11-20 etc.) and my Y axis should be the fraction, however for every bin I have two fractions (Col B and Col C), so there will be two fraction "bars" for every bin in the histogram.
Any help would be greatly appreciated.

The data:
dat <- data.frame(A = c("1-10", "11-20", "21-30"), B = c(7, 4, 5))
Now, calculate the proportions and create a new object:
dat2 <- rbind(B = dat$B/sum(dat$B), C = c(dat$B[1:2]/sum(dat$B[1:2]), 0))
colnames(dat2) <- dat$A
Plot:
barplot(dat2, beside = TRUE, legend = rownames(dat2))

Your title should be changed to "Dodged Bar Chart" instead of 2D histogram, because histograms have continuous scale on x axis unlike bar chart and they are basically used for comparing the distributions of univariate data or the distributions of univariate data modeled on the dependent factor. You are trying to compare colB vs colC which can be effectively visualized using a 2D scatter plot but not with bar chart. The better way to compare the distributions of colB and colC using histograms would be plotting two histograms separately and check the change in location of the data points.
If you want to compare distributions of colB and colC, try the following code: I did round up the values for getting a reasonable data per your data description. Notice a random sampling by permutation is happening and everytime, you run the same code, there will be slight change in the distribution, but that will not affect the inference of distribution between colB and colC.
library("ggplot2")
# 44 datapoints between 1-10
a <- rep(1:10, 4)
a <- c(a, sample(a, size=4, replace=FALSE))
# 25 datapoints between 11-20
b <- rep(11:20, 2)
b <- c(b, sample(b, size=5, replace=FALSE))
# 31 datapoints between 21-30
c <- rep(21:30, 3)
c <- c(c, sample(c, size=1, replace=FALSE))
colB <- c(a, b, c)
# 64 datapoints between 1-10
a <- rep(1:10, 6)
a <- c(a, sample(a, size=4, replace=FALSE))
# 36 datapoints between 11-20
b <- rep(11:20, 3)
b <- c(b, sample(b, size=6, replace=FALSE))
colC <- c(a, b)
df <- data.frame(cbind(colB, colC=colC))
write.table(df, file = "data")
data <- read.table("data", header=TRUE)
data
ggplot(data=data, aes(x=colB, xmin=1, xmax=30)) + stat_bin(binwidth = 1)
ggplot(data=data, aes(x=colC, xmin=1, xmax=30)) + stat_bin(binwidth = 1)
# if you want density distribution, then you can try something like this:
ggplot(data=data, aes(x=colB, y = ..density.., xmin=1, xmax=30)) + stat_bin(binwidth = 1)
ggplot(data=data, aes(x=colC, y = ..density.., xmin=1, xmax=30)) + stat_bin(binwidth = 1)
HTH
-Sathish