How to create a histogram with varying bin widths - plot

I have been unsuccessful with other using hist plot.
A simple problem would be using the following data:
age range - frequency - central band width - bin width - height (respectively)
1-4 - 30 - 2.5 - 3 - 10
5-6 - 20 - 5.5 - 1 - 20
7-17 - 30 - 12 - 10 - 3
With age along the X axis, with a linear scale, so the bin width for 1-4 would be 3, with height 10, bin width for 5-6 would be 1 with height of 20, and 7-17 would be 10 and the height would be 3.
How would can I place these data into a Word/notepad document .dat file?
And how can I then use them to set up a histogram in gnuplot?

I would use the following data file format (use only white spaces to delimit fields):
"age range" "frequency" "central band width" "bin width" "height"
1-4 30 2.5 3 10
5-6 20 5.5 1 20
7-17 30 12 10 3
To plot with variable boxwidth, use the boxes plotting style. That allows you to use the value from a column as width.
With xtic(1) you use the entry in the first column as xticlabel.
So a rather simple plotting script looks as follows:
set style fill solid noborder
set yrange [0:*]
set offset 1,1,1,0
plot 'file.txt' using 3:5:4:xtic(1) with boxes notitle
The result with version 4.6.3 and the pngcairo terminal is:

I managed a fairly nice example of variable width boxes last night. I was plotting latency histogram data produced by the FIO storage performance test package. With my compile options I have 1856 bins, that go as follows:
1 ns wide from 0-128 ns (128 bins)
2 ns wide from 128-256 ns (64 bins)
4 ns wide from 256-512 ns (64 bins)
8 ns wide from 512-1024 ns (64 bins)
etc...
My latency values at plot time are in microseconds (FIO provides nanoseconds, but I wanted microseconds for historical reasons). I did not have the opportunity to include the bin widths in my data. So I did this:
f(x) = (2**(int(log(x*1000)/log(2))-6))/1100
plot "temp" u 1:2:(f(column(1))) with boxes fs transparent solid 0.7 noborder title "$legend"$base_plot
The f(x) definition returns the box width for a given latency - it works as follows:
First, x*1000 gets me back to nanoseconds.
log(x*1000)/log(2) takes the base 2 logarithm of the nanosecond count.
The int() just gives me the integer part of that. Note that now for, say, 128 ns, I'd have 7.
The -6 gets me to the base 2 log of the bin width.
The 2 ** gets me to the bin width.
The /1000 returns me from nanoseconds to microseconds.
Then I just use f(latency) in the plot command as the box width.
This works - it seems to work perfectly as far as I can tell. It would not give the right result for x < 64 ns, but I don't have any data that small, so it works out. A conditional expression could be used to patch it up for that part of the range.
I think the key observations here are that a) you don't have to have the width as literal data - if you can calculate it from the data you do have, you're golden, and b) column(n) is an alternative to $n as a way of expressing column values in the plot command. In my case I have all this in a bash script, and bash intercepted the $1.

Related

How to plot two files with error bars in gnuplot side by side?

I have two files with three columns. The first column is the X, the second column is the Y - Mean, and the third is the error.
I need to plot these two files to compare the error between them. I can plot but the error bars overlap. I need them to stand side by side.
Archive 1
10 0.15127 0.0986
30 0.14606 0.10022
60 0.16739 0.10298
Archive 2
10 0.19177 0.10253
30 0.17864 0.12178
60 0.18111 0.11272
What I can plot
What I need
I need the two categories to be side by side with the bar showing the error for plus and minus and midpoint.
We will use the line number (column 0, shorthand $0) for the x coordinate and offset the second set of values by 1/10 unit on x
set offset 0.5, 0.5 # put whitespace on both sides of the data
set yrange [0:1]
plot 'ar1' using ($0):2:3:xtic(1) with yerrrorbars, \
'ar2' using ($0+0.1):2:3 with yerrorbars

Simulation of random variables using Polar method

I have the following algorithm
Step 1. Generate u1 and u2~U(0,1)
Step 2. Define v1=2u1-1, v2=2u2-1 and s=v1^2+v2^2
Step 3. If s>1, come back to Step 1.
Step 4. If s<=1, x=v1(-2logs/s)^(1/2) and y=v2(-2logs/s)^(1/2)
Here is my approach to implement this algorithm in R:
PolarMethod1<-function(N)
{
x<-numeric(N)
y<-numeric(N)
z<-numeric(N)
i<-1
while(i<=N)
{u1<-runif(1)
u2<-runif(1)
v1<-(2*u1)-1
v2<-(2*u2)-1
s<-(v1^2)+(v2^2)
if(s<=1)
{
x[i]<-((-2*log(s)/s)^(1/2))*v1
y[i]<-((-2*log(s)/s)^(1/2))*v2
z[i]<-(x[i]+y[i])/sqrt(2) #standarization
i<-i+1
}
else
i<-i-1
}
return(z)
}
z<-PolarMethod1(10000)
hist(z,freq=F,nclass=10,ylab="Density",col="purple",xlab=" z values")
curve(dnorm(x),from=-3,to=3,add=TRUE)
The code, fortunately, does not mark any error and works quite well with N=1000 but when I change to N=10000, instead of making a better approach to the curve displays:
contrast with N=1000 displays:
Why is that?
Is there something wrong with my code? It's supposed to be better adjusted when N increases.
Note:I added the z in the code to include both variables in the output.
Why is there a difference between 1000 and 100000 runs?
When you run 1000 simulations the z values usually go from -3.2 to 3.2. But if you increase the runs to 100k you will obtain more extreme values, z will go from -4 to 4.
The histogram is binning the z results into 10 bins. A higher range in z will result in wider bins, and wider bins usually adjust worse to the probability density.
Your bin width for 1000 runs is aproximately 0.5, but for 100k is 1.
You ask for 10 bins when you draw the histogram, but that's only a suggestion. You actually got 8, because to cover the range from -4 to 4 there is no division into 10 bins that ends up on nice round numbers, whereas 8 bins have very nice boundaries.
If you want more bins, then don't specify nclass. The default gave me 20 bins. Or specify breaks = "Scott", which uses a different rule to select bins. I saw about 80 bins using this option.

Gaussian peaks not overlapping in Gnuplot

I’m trying to plot multiple Gaussian functions on the same graph with Gnuplot, which is quite a simple thing. The problem is that the peaks do not overlap and I get the following result that looks like they have different peaks, which they don’t. How can I fix this?
First, it helps to understand how gnuplot generates plots of functions (or really how any computer program must do it). It must convert a continuous function into some kind of discrete representation. The mathematical function to be plotted is evaluated at various points along the independent (x) axis. This creates a set of (x,y) points. A line is then drawn between these points (think "connect the dots"). As you might imagine, the number of discrete samples used affects how accurately the curve is represented, and how smooth it looks.
The problem you have noticed is that the default sample size in gnuplot is a bit too low. The default (I believe) is 100 samples across the visible x-axis. You can adjust the number of samples (to 1000, for example) with
set samples 1000
I have made some example plots of gaussians to illustrate this point. (I made a rough estimate of your gaussian parameters.) Each plot has a different number of samples:
Notice how the lines get too jagged if the sample size is too low. Even the default value of 100 is too low. Setting to 1000 makes it plenty smooth. This is probably more than it needs to be, but it works. If you're using a terminal that generates a bitmap image (e.g. PNG), then you shouldn't need more samples than you have width in pixels used for the x-axis plot area. If you're generating vector based output, then just pick something that "looks right" for whatever you are using it in.
See the question Gnuplot x-axis resolution for more.
By the way, the code to generate the above examples is:
set terminal pngcairo size 640,480 enhanced
# Line styles
set style line 1 lw 2 lc rgb "blue"
set style line 2 lw 2 lc rgb "red"
set style line 3 lw 2 lc rgb "yellow"
# Gaussian function stuff
set yrange [0:1.1]
set xrange [-20:20]
gauss(x,a) = exp(-(x/a)**2)
eqn(a) = sprintf("y = e^{-(x/%d)^2}", a)
# First example (default)
set output "example1.png"
set title "100 samples (default)"
plot gauss(x,8) ls 1 title eqn(8), \
gauss(x,2) ls 2 title eqn(2), \
gauss(x,1) ls 3 title eqn(1)
# Second example (too low)
set output "example2.png"
set title "20 samples (too low)"
set samples 20
replot
# Third example (plenty high)
set output "example3.png"
set title "1000 samples (plenty high)"
set samples 1000
replot

gnuplot - adding label for the median to the right-hand side of the diagram

I'm using gnuplot's STATS to plot a median to some data.
On the right-hand side of the graph, where the median intersects the diagram's border, I should like to place a label stating the median's value - or its percentage of a pre-determined maximum.
How do I best achieve that?
You can use a label. The syntax is
set label id "text" at x-coordinate,y-coordinate
where the id is optional (but useful if you need to change the label or remove it later). See help label for all of the options (including alignment and font options).
Note also that gnuplot has several coordinate systems. See help coordinates for information about these.
To place a label with the median on the right side at y-coordinate 10 (for example), you can use a command like:
set label 1 sprintf("%f",STATS_median) at graph 1, first 10
where we use sprintf to turn our numerical value into a string for the label. We specify the graph coordinate system for the x-coordinate. This system runs from 0 (left) to 1 (right) and similarly for the y-values from top to bottom. It is useful for when we need to address relative positions without knowing exact coordinates. We specify the first coordinate system for the y-coordinate. This system corresponds to the system used the by x1 and y1 axes.
Note that when placing a label outside the graph (which we have done here), it is sometimes necessary to increase margins. See help margins for full details. A command like set rmargin 15 will give enough space for a 15 character string on the right.
For an example, suppose that we have data that looks like
8
9
15
3
6
Then we can plot this, drawing a line at the median and labeling it with
stats datafile nooutput
set arrow 1 from graph 0, first STATS_median to graph 1, first STATS_median nohead
set label 1 sprintf("%0.2f",STATS_median) at graph 1, first STATS_median offset char 1,0
set rmargin 5
plot datafile w points pt 7
This produces the following
Note that we specified an offset on the label in the character coordinate system, which allows us to move it one character width to the right.
In this example, an alternative could have been achieved by using the y2 axis and specifying the tic marks literally with
set link y
set y2tics ("%0.2f" STATS_median)
plot datafile w points pt 7
The advantage here is that the margin is automatically computed. This is not possible, however, when you need the label elsewhere.

rrdtool y-axis values "200m" instead of "0.2"

I've got a rrd which contains mostly values 0 to 1 (linux load avarage).
Sometimes the graph displays at the y-axis => "0.1 0.2 ... 0.9". That's the way I want it.
But other times, I see the following "100m 200m ...".
Is there a way to force displaying as "0.1 etc." values?
-X 0 did the trick.
[-X|--units-exponent value]
This sets the 10**exponent scaling of the y-axis values. Normally, values will be scaled to the appropriate units (k, M, etc.). However, you may wish to display units always in k (Kilo, 10e3) even if the data is in the M (Mega, 10e6) range, for instance. Value should be an integer which is a multiple of 3 between -18 and 18 inclusively. It is the exponent on the units you wish to use. For example, use 3 to display the y-axis values in k (Kilo, 10e3, thousands), use -6 to display the y-axis values in u (Micro, 10e-6, millionths). Use a value of 0 to prevent any scaling of the y-axis values.

Resources