I am looking to present a variable as a bar plot with the caveat that the groups I am trying to plot (the size of an object) vary over several orders of magnitude. The other complication of the data is that the variable y also varies over several orders of magnitude when positive as well as having negative values. I usually think in pictures so I have sketched something along the lines that I am looking for below (the colour would simply be a function of the distance from zero, i.e. white zero, dark blue very negative, dark red very positive etc):
Here is a real case of the data if required:
x <- c(1.100e-08, 1.200e-08, 1.300e-08, 1.400e-08, 1.600e-08, 1.700e-08, 1.900e-08, 2.100e-08, 2.300e-08, 2.600e-08, 3.100e-08, 3.500e-08, 4.200e-08, 4.700e-08, 5.200e-08, 5.800e-08, 6.400e-08, 7.100e-08, 7.900e-08, 8.800e-08, 9.800e-08, 1.100e-07, 1.230e-07, 1.380e-07, 1.550e-07, 1.760e-07, 3.250e-07, 3.750e-07, 4.250e-07, 4.750e-07, 5.400e-07, 6.150e-07, 6.750e-07, 7.500e-07, 9.000e-07, 1.150e-06, 1.450e-06, 1.800e-06, 2.250e-06, 2.750e-06, 3.250e-06, 3.750e-06, 4.500e-06, 5.750e-06, 7.000e-06, 8.000e-06, 9.250e-06, 1.125e-05, 1.375e-05, 1.625e-05, 1.875e-05, 2.250e-05, 2.750e-05, 3.100e-05)
y <-c(1.592140e+01, -1.493541e+01, -6.255603e+00, -2.191637e+00, -1.274086e+00, -1.343391e+00, -8.869018e-01, -7.717447e-01, -6.140710e-01, -5.637220e-01, -5.404424e-01, -3.473077e-01, -2.279666e-01, -1.945254e-01, -2.485636e-01, -2.363181e-01, -2.197054e-01, -2.119314e-01, -1.897220e-01, -1.656779e-01, -1.478176e-01, -1.364191e-01, -1.297830e-01, -1.408082e-01, -1.514742e-01, -1.311300e-01, -1.358422e-01, -2.718636e+00, -2.231532e+00, -3.479395e+00, -3.572720e+00, -2.297957e+00, -3.265428e+00, -5.449620e+00, -7.741435e+00, -1.172256e+01, 9.368365e+00, 1.078983e+02, 9.542029e+01, 1.484089e+02, 2.293383e+02, 3.678836e+02, 7.965286e+02, 1.349151e+03, 1.577808e+04, 4.554271e+05, 1.821730e+06, 8.092310e+04, 1.015619e+06, 2.113788e+06, 5.208331e+06, 4.534863e+06, 8.086026e+06, 1.577413e+07)
I could also plot this as a scatterplot with broken axis but I am currently playing with the a nice approach to display such data- important for me is highlighting at the approximate value of x that y changes sign as well as the variability and magnitude of both the positive and negative values. Any tips and advice you have plotting such data would be great.
Edit based upon comments
I realise that on my graph x and y are the wrong way around, apologies for that. Parameter x should indeed be on the x-axis and parameter y on the y-axis.
Taking on board your suggestions I would be better to plot this data as a scatterplot. Accepting that I still need to break my axis at a relevant value of y (not x as shown in the figure) and have a log scale above this value and linear scale below. Somewhere below the smallest "positive" value of y seems sensible for this break. Can this be done using base r?
I guess something like this but with the split on the y-axis rather than the x-axis and in r of course.
Related
I have a large dataset which I need to plot in loglog scale in Gnuplot, like this:
set log xy
plot 'A_1D_l0.25_L1024_r0.dat' u 1:($2-512)
LogLogPlot of my datapoints
Text file with the datapoints
Datapoints on the x axis are equally spaced, but because of the logscale they get very dense on the right part of the graph, and as a result the output file (I finally export it in .tex) gets very large.
In linear scale, I would simply use the option every to reduce the number of points which get plotted. Is there a similar option for loglogscale, such that the plotted points appear equally spaced?
I am aware of a similar question which was raised a few years ago, but in my opinion the solution is unsatisfactory: plotted points are not equally spaced along the x-axis. I think this is a really unsophisticated problem which deserves a clearer solution.
As I understand it, you don't want to plot the actual data points; you just want to plot a line through them. But you want to keep the appearance of points rather than a line. Is that right?
set log xy
plot 'A_1D_l0.25_L1024_r0.dat' u 1:($2-512) with lines dashtype '.' lw 2
Amended answer
If it is important to present outliers/errors in the data set then you must not use every or any other technique that simply discards or skips most of the data points. In that case I would prefer the plot with points that you show in the original question, perhaps modified to represent each point as a dot rather than a cross. I will simulate this by modifying a single point in your 500000 point data set (first figure below). But I would also suggest that the presence of outliers is even more apparent if you plot with lines (second figure below).
Showing error bounds is another alternative for noisy data, but the options depend on what you have to work with in your data set. If you want to pursue that, please ask a separate question.
If you really want to reduce the number of data to be plotted, you might consider the following script.
s = 0.1 ### sampling interval in log scale
### (try 0.05 for more detail)
c = log10(0.01) ### a parameter used in sampler(x)
### which should be initialized by
### smaller value than any x in log scale
sampler(x) = (x>0 && log10(x)>=c) ? (c=ceil(log10(x)/s+0.5)*s, x) : NaN
set log xy
set grid xtics
plot 'A_1D_l0.25_L1024_r0.dat' using (sampler($1)):($2-512) with points pt 7 lt 1 notitle , \
'A_1D_l0.25_L1024_r0.dat' using 1:($2-512) with lines lt 1 notitle
This script samples the data in increments of roughly 0.1 on x-axis in log scale. It makes use of the property that points whose x value is evaluated as NaN in using are not drawn.
I am wondering if there is a way to account for outlier in a histogram plot. I want to plot the frequencies of a random variable, which is very small and distributed around zero. However, in most of the cases I am considering I also have an outlier that complicates things. Is there a way to adjust the scale of the x axis in R/Matlab so that I can capture the distribution of the random variable I am considering and also show the outlier? Because normal ways to obtain the plot result in such a scale that all values are considered to be zero, and I want to show how they are distributed around zero. So ideally I would like to have the scales around zero accounting for very small numbers and than after a gap (which does not necessarily have to be proportional to the actual distance from zero) a bin to indicate the value of the outlier. And I do not want to remove the outlier from the sample.
Is such a thing possible in R/Matlab? Any other suggestions would be welcome.
Edit: The problem is not in identifying the outliers and using a different color for them. The problem is in adjusting the scales on the x-axis so I can observe the distribution of the variable as well as have the outlier included in the plot.
The next code will do the job, but you need to change the Xticklabels of the axes in order to make them show the real value of the outliers.
A=rand(1000,1)*0.1;
A(1:10)=10;
% modify the data for plotting pourposes. Get the outliers closer
expected_maximum_value=1; % You can compute this useg 3*sigma maybe?
distance_to_outliers=0.5;
outlier_mean=mean(A(A>expected_maximum_value));
A(A>expected_maximum_value)=A(A>expected_maximum_value)-outlier_mean+distance_to_outliers;
% plot
h=histogram(A,'BinWidth',0.01)
%% trick the X axis
ax=gca;
ax.XTickLabel{end-1}=[ax.XTickLabel{end-1} '//'];
ax.XTickLabel{end}=['//' num2str(outlier_mean)];
I am able to create a 2-D plot using two parameters in IDL, i.e., star formation rate (y-axis) vs. time (x-axis).
But I would like to include the redshift (another variable) corresponding to each data point, say, as the top x-axis. It didn't work when I tried adding the third variable to PLOT procedure, and I have not been able to find any discussion on how to accomplish this online. Any help is appreciated.
First run PLOT.PRO with the NODATA keyword set and XAXIS=4 and YAXIS=4 to suppress each axis. Then you can use the AXIS.PRO program to define each axis. Then you can use OPLOT.PRO to draw the points of Z vs. X and Z vs. Y, where Z = star formation rate, X = time, and Y = redshift. Look up details on the [XYZ]AXIS keywords to determine which axis to draw at each time. You can even color each axis using the COLOR keyword with the AXIS.PRO program.
The only trick is that you will have to scale the Y data points to the X-axis scale prior to plotting because you will explicitly define the [XYZ]RANGE when calling PLOT.PRO (well you could do the converse and scale it to Y and redefine X, it's your choice). You need to do this scaling because OPLOT.PRO and, say, PLOTS.PRO use the original [XYZ]RANGE defined when calling PLOT.PRO to convert device coordinates to data coordinates.
Does that make sense?
first call PLOT, TIME, SFR with XSTYLE=9 to force exact range and suppress the top x-axis
then use the AXIS procedure to create the top x-axis
be careful with the ticks of that axis, which you want to correspond to a REDSHIFT that you compute from the TIME variable
example with a bottom x-axis in velocity and a top y-axis in frequency:
> plot, vel, spec, xsty=9, xtick_get=xtick, xtit='Velocity (km/s)', ytit='Antenna Temperature (K)'
> axis, !x.crange[0], !y.crange[1], xaxis=1, xtickv=((ref_freq - ref_freq/299792.458*xtick)), xtickformat='(F8.3)', xticks=n_elements(xtick)-1, xrange=(ref_freq - ref_freq/299792.458*minmax(!x.crange)), chars=1.5
You could always set the color to be the third dimension (ie. color or size).
I have two data sets that I am comparing using a ked2d contour plot on a log10 scale,
Here I will use an example of the following data sets,
b<-log10(rgamma(1000,6,3))
a<-log10((rweibull(1000,8,2)))
density<-kde2d(a,b,n=100)
filled.contour(density,color.palette=colorRampPalette(c('white','blue','yellow','red','darkred')))
This produces the following plot,
Now my question is what does the z values on the legend actually mean? I know it represents where most the data lies but 0-15 confuses me. I thought it could be a percentage but without the log10 scale I have values ranging from 0-1? And I have also produced plots with scales 1-1.2, 1-2 using my real data.
The colors represent the the values of the estimated density function ranging from 0 to 15 apparently. Just like with your other question about the odd looking linear regression I can relate to your confusion.
You just have to understand that a density's integral over the full domain has to be 1, so you can use it to calculate the probability of an observation falling into a specific region.
I have
probability values: 0.06,0.06,0.1,0.08,0.12,0.16,0.14,0.14,0.08,0.02,0.04 ,summing up to 1
the corresponding intervals where a stochastic variable may take its value with the corresponding probability from the above list:
126,162,233,304,375,446,517,588,659,730,801,839
How can I plot the probability distribution?
On the x axis, the interval values, between the intervals histogram with the probability value?
Thanks.
How about
x <- c(126,162,233,304,375,446,517,588,659,730,801,839)
p <- c(0.06,0.06,0.1,0.08,0.12,0.16,0.14,0.14,0.08,0.02,0.04)
plot(x,c(p,0),type="s")
lines(x,c(0,p),type="S")
rect(x[-1],0,x[-length(x)],p,col="lightblue")
for a quick answer? (With the rect included you might not need the lines call and might be able to change it to plot(x,p,type="n"). As usual I would recommend par(bty="l",lty=1) for my preferred graphical defaults ...)
(Explanation: "s" and "S" are two different stair-step types (see Details in ?plot): I used them both to get both the left and right boundaries of the distribution.)
edit: In your comments you say "(it) doesn't look like a histogram". It's not quite clear what you want. I added rectangles in the example above -- maybe that does it? Or you could do
b <- barplot(p,width=diff(x),space=0)
but getting the x-axis labels right is a pain.