How can i plot the difference between two histograms? (Julia) - julia

So time ago i asked the same question here and someone answered just what i wanted!
Using fit(histogram...) and weights you can do it! (just like the picture below).
julia> using StatsBase, Random; Random.seed!(0);
julia> x1, x2 = rand(100), rand(100);
julia> h1 = fit(Histogram, x1, 0:0.1:1);
julia> h2 = fit(Histogram, x2, 0:0.1:1);
julia> using Plots
julia> p1 = plot(h1, α=0.5, lab="x1") ; plot!(p1, h2, α=0.5, lab="x2")
julia> p2 = bar(0:0.1:1, h2.weights - h1.weights, lab="diff")
julia> plot(p1, p2)
The problem is i can't use fit, i need to use Histogram(...). And this one doesn't have .weights.
How can i do this using Histogram ?
This is what i'm using:
using Plots
using StatsBase
h1 = histogram(Group1, bins= B, normalize =:probability, labels = "Group 1")
h2 = histogram(Group2 , bins= B, normalize =:probability, labels ="Group 2"))

Technically there is no Histogram function in any common Julia package; perhaps you mean either the Histogram (capital h) type provided by StatsBase, or the histogram (lowercase h) function provided by Plots.jl? In either case though, the answer is "you can't".
If you mean histogram from Plots.jl there is unfortunately no practical way to access that underlying data. If you mean Histogram from StatsBase on the other hand, that only works with fit (it's a type, not a function that can be used on its own).
There are other histogram packages though if for any reason you cannot or do not want to use StatsBase and fit, including FastHistograms.jl and NaNStatistics.jl, both of which are additionally somewhat faster than StatsBase for simple cases. So, for example
using NaNStatistics, Plots
a,b = rand(100), rand(100)
dx = 0.1
binedges = 0:dx:1
aw = histcounts(a, binedges)
bw = histcounts(b, binedges)
bar(binedges, aw-bw, label="difference", bar_width=dx)

Related

Cannot convert Array{Any,2} to series data for plotting

I am learning the Julia from the coursera
using DelimitedFiles
EVDdata = DelimitedFiles.readdlm("wikipediaEVDdatesconverted.csv", ',')
# extract the data
epidays = EVDdata[:,1]
EVDcasesbycountry = EVDdata[:, [4, 6, 8]]
# load Plots and plot them
using Plots
gr()
plot(epidays, EVDcasesbycountry)
I am getting the error message Cannot convert Array{Any,2} to series data for plotting
but in that course the lecturer successfully plots the data. where I am going wrong?
I search about the error where I end up something call parsing the string into an integer. As the data set may contain string values.
Or am I missing something else.
I found this to be working for me:
# extract the data
epidays = Array{Integer}(EVDdata[:,1])
EVDcasesbycountry = Array{Integer}(EVDdata[:, [4, 6, 8]])
# load Plots and plot them
using Plots
gr()
plot(epidays, EVDcasesbycountry)
It's a bit hard to tell what's going on in Coursera, as it's not clear what versions of Plots and DataFrames the video is using.
The error you're seeing however is telling you that a 2-dimensional Array (i.e. a matrix) can't be converted to a single series for plotting. This is because plot is supposed to be called with two vectors, one for x and one for y values:
plot(epidays, EVData[:, 4])
You can plot multiple columns in a loop:
p = plot()
for c in eachcol(EVData[:, [4, 6, 8]])
plot!(p, epidays, c)
end
display(p)
There is also StatsPlots.jl, which extend the standard Plots.jl package for frequently needed "data science-y" plotting functions. In this case you could use the #df macro for plotting DataFrames; just quoting one of the examples in the Readme:
using DataFrames, IndexedTables
df = DataFrame(a = 1:10, b = 10 .* rand(10), c = 10 .* rand(10))
#df df plot(:a, [:b :c], colour = [:red :blue])
Finally, there are some more grammar-of-graphics inspired plotting packages in Julia which are focused on plotting DataFrames, e.g. the pure-Julia Gadfly.jl, or the VegaLite wrapper VegaLite.jl
You can also try this
using StatsPlots
gr()
using DataFrames, IndexedTables
df = DataFrame(EVDdata)
#df df plot(:x1, [:x4 :x6 :x8], marker = ([:octagon :star7 :square], 9), title = "EVD in West Africa, epidemic segregated by country", xlabel = "Days since 22 March 2014",ylabel = "Number of cases to date",line = (:scatter), colour = [:red :blue :black])
On the other hand, this tutorial does (apparently) the same thing as the coursera plot and it works.
https://docs.juliaplots.org/latest/tutorial/#Basic-Plotting:-Line-Plots
x = 1:10; y = rand(10, 2) # 2 columns means two lines
plot(x, y)
And I haven't figured out why too...
Update: The staff answer is that maybe " Julia no longer supports plot 'Array{Any,2}' " and a simple workaround is to convert the EVDcasesbycountry data to Int doing this:
epidays = EVDdata[:,1]
EVDcasesbycountry = convert.(Int, EVDdata[:, [4, 6, 8]])
It worked for me and is kinda consistant with my first answer because when I checked the types of x and y they weren't Any as the data of epidays and EVDcasesbycountry.
https://docs.juliaplots.org/latest/generated/gr/
This contains some nice examples
Coming to the problem you can pass vector instead on the matrix for plotting
using Plots
gr()
y = Vector[EVData[:,4],EVData[:,6],EVData[:,8]]
plot(
epidays,y,
color = [:black :orange :red],
line = (:scatter),
marker = ([:hex :d :star4],5)
)

Multiple histograms in Julia using Plots.jl

I am working with a large number of observations and to really get to know it I want to do histograms using Plots.jl
My question is how I can do multiple histograms in one plot as this would be really handy. I have tried multiple things already, but I am a bit confused with the different plotting sources in julia (plots.jl, pyplot, gadfly,...).
I don't know if it would help for me to post some of my code, as this is a more general question. But I am happy to post it, if needed.
There is an example that does just this:
using Plots
pyplot()
n = 100
x1, x2 = rand(n), 3rand(n)
# see issue #186... this is the standard histogram call
# our goal is to use the same edges for both series
histogram(Any[x1, x2], line=(3,0.2,:green), fillcolor=[:red :black], fillalpha=0.2)
I looked for "histograms" in the Plots.jl repo, found this related issue and followed the links to the example.
With Plots, there are two possibilities to show multiple series in one plot:
First, you can use a matrix, where each column constitutes a separate series:
a, b, c = randn(100), randn(100), randn(100)
histogram([a b c])
Here, hcat is used to concatenate the vectors (note the spaces instead of commas).
This is equivalent to
histogram(randn(100,3))
You can apply options to the individual series using a row matrix:
histogram([a b c], label = ["a" "b" "c"])
(Again, note the spaces instead of commas)
Second, you can use plot! and its variants to update a previous plot:
histogram(a) # creates a new plot
histogram!(b) # updates the previous plot
histogram!(c) # updates the previous plot
Alternatively, you can specify which plot to update:
p = histogram(a) # creates a new plot p
histogram(b) # creates an independent new plot
histogram!(p, c) # updates plot p
This is useful if you have several subplots.
Edit:
Following Felipe Lema's links, you can implement a recipe for histograms that share the edges:
using StatsBase
using PlotRecipes
function calcbins(a, bins::Integer)
lo, hi = extrema(a)
StatsBase.histrange(lo, hi, bins) # nice edges
end
calcbins(a, bins::AbstractVector) = bins
#userplot GroupHist
#recipe function f(h::GroupHist; bins = 30)
args = h.args
length(args) == 1 || error("GroupHist should be given one argument")
bins = calcbins(args[1], bins)
seriestype := :bar
bins, mapslices(col -> fit(Histogram, col, bins).weights, args[1], 1)
end
grouphist(randn(100,3))
Edit 2:
Because it is faster, I changed the recipe to use StatsBase.fit for creating the histogram.

How do you color (x,y) scatter plots according to values in z using Plots.jl?

Using the Plots.jl package in Julia, I am able to use various backends to make a scatter plot based on two vectors x and y
k = 100
x = rand(k)
y = rand(k)
scatter(x, y)
I am unable to find information about how to color them according to some length k vector z. How do you do that?
The following method will be much better than jverzani's (you don't want to create a new series for every data point). Plots could use some additional love for manually defining color vectors, but right now gradients are pretty well supported, so you can take advantage of that.
using Plots
pyplot(size=(400,200), legend=false) # set backend and set some session defaults
scatter(rand(30),
m = ColorGradient([:red, :green, :blue]), # colors are defined by a gradient
zcolor = repeat( [0,0.5,1], 10) # sample from the gradient, cycling through: 0, 0.5, 1
)
I would have thought if you defined k as a vector of color symbols this would work: scatter(x, y, markercolors=k), but it doesn't seem to. However, adding them one at a time will, as this example shows:
using Plots
xs = rand(10)
ys = rand(10)
ks = randbool(10) + 1 # 1 or 2
mcols = [:red, :blue] # together, mcols[ks] is the `k` in the question
p = scatter(xs[ks .== 1], ys[ks .== 1], markercolor=mcols[1])
for k = 2:length(mcols)
scatter!(xs[ks .== k], ys[ks .== k], markercolor=mcols[k])
end
p
If the elements in vector z are categorical rather than continuous values, you might want to consider using the group parameter to the plotting call as follows:
using Plots
# visualize x and y colouring points based on category z
scatter(x, y, group=z)

ggplot2: easy way to plot integral over independent variable?

I'm integrating a function f(t) = 2t (just an example) and would like to plot the integral as a function of time t using
awesome_thing <- function(t) {2*t}
integrate(awesome_thing, lower=0, upper=10)
However, I would like to plot the integral as a function of time in ggplot2, so for this example the plotted points would be (1,1), (2,4), (3,9), ..., (10,100).
Is there an easy way to do this in ggplot (e.g., something similar to how functions are plotted)? I understand I can "manually" evaluate and plot the data for each t, but I thought i'd see if anyone could recommend a simpler way.
Here is a ggplot solution and stat_function
# create a function that is vectorized over the "upper" limit of your
# integral
int_f <- Vectorize(function(f = awesome_thing, lower=0,upper,...){
integrate(f,lower,upper,...)[['value']] },'upper')
ggplot(data.frame(x = c(0,10)),aes(x=x)) +
stat_function(fun = int_f, args = list(f = awesome_thing, lower=0))
Not ggplot2 but shouldn't be difficult to adapt by creating a dataframe to pass to that paradgm:
plot(x=seq(0.1,10, by=0.1),
y= sapply(seq(0.1,10, by=0.1) ,
function(x) integrate(awesome_thing, lower=0, upper=x)$value ) ,
type="l")
The trick with the integrate function is that it retruns a list and you need to extract the 'value'-element for various changes in the upper limit.

Graphing a polynomial output of calc.poly

I apologize first for bringing what I imagine to be a ridiculously simple problem here, but I have been unable to glean from the help file for package 'polynom' how to solve this problem. For one out of several years, I have two vectors of x (d for day of year) and y (e for an index of egg production) data:
d=c(169,176,183,190,197,204,211,218,225,232,239,246)
e=c(0,0,0.006839425,0.027323127,0.024666883,0.005603878,0.016599262,0.002810977,0.00560387 8,0,0.002810977,0.002810977)
I want to, for each year, use the poly.calc function to create a polynomial function that I can use to interpolate the timing of maximum egg production. I want then to superimpose the function on a plot of the data. To begin, I have no problem with the poly.calc function:
egg1996<-poly.calc(d,e)
egg1996
3216904000 - 173356400*x + 4239900*x^2 - 62124.17*x^3 + 605.9178*x^4 - 4.13053*x^5 +
0.02008226*x^6 - 6.963636e-05*x^7 + 1.687736e-07*x^8
I can then simply
plot(d,e)
But when I try to use the lines function to superimpose the function on the plot, I get confused. The help file states that the output of poly.calc is an object of class polynomial, and so I assume that "egg1996" will be the "x" in:
lines(x, len = 100, xlim = NULL, ylim = NULL, ...)
But I cannot seem to, based on the example listed:
lines (poly.calc( 2:4), lty = 2)
Or based on the arguments:
x an object of class "polynomial".
len size of vector at which evaluations are to be made.
xlim, ylim the range of x and y values with sensible defaults
Come up with a command that successfully graphs the polynomial "egg1996" onto the raw data.
I understand that this question is beneath you folks, but I would be very grateful for a little help. Many thanks.
I don't work with the polynom package, but the resultant data set is on a completely different scale (both X & Y axes) than the first plot() call. If you don't mind having it in two separate panels, this provides both plots for comparison:
library(polynom)
d <- c(169,176,183,190,197,204,211,218,225,232,239,246)
e <- c(0,0,0.006839425,0.027323127,0.024666883,0.005603878,
0.016599262,0.002810977,0.005603878,0,0.002810977,0.002810977)
egg1996 <- poly.calc(d,e)
par(mfrow=c(1,2))
plot(d, e)
plot(egg1996)

Resources