how to de-clutter graph created using proc gchart? - graph

I utilized proc gchart in SAS and the following code to generate the graph displayed here.
proc gchart data=combined;
vbar distrct / discrete type=sum sumvar=PERCENT
subgroup= population coutline=gray width=6;
run;
However, as you can see it seems that individual variable bars are stacked extremely close together and is difficult to comprehend. I have 110 variable bars representing densities of ethnic groups
My question is
is there a way to make this graph look less cluttered (I tried reducing the width but it does not seem to work)?
Should I be using a different procedure than the g chart procedure?

2 is easier to answer; proc gchart is mostly replaced by proc sgplot nowadays. It's still maintained, but I don't think much new work is being done in gchart or the other sas/graph procedures.
As for how to make it better; there are some answers, definitely, for how to improve it, but ultimately trying to show 110 bars each split by four ethnicities, means your'e showing 440 data points on one graph. That's going to be a tough lift no matter what.
The first thing I'd consider is switching to horizontal. Horizontal may allow you to have a larger graph, allowing for more spacing, and often times readers have an easier time reading horizontal charts when combining that with stacked bar charts. Scrolling is also easier up-down for most people (a mouse wheel), so if it's okay that they not see it on one screen this may be better. It also allows the bar titles to be presented in the usual left-to-right manner.
Second, consider if your bars can be grouped together. Do you have regions or such that allow you to group bars together, with a bit more spacing between the region? Or more importantly, are there bars that you'd like the readers to be comparing visually to each other? Right now it looks like it's sorted alphabetically, but that is probably not the right way to sort it if there's any sort of relationship between the bars. For example, does the area have sub-areas that are ethnically related? Maybe group those together; or by just geographies (here is the north-east section, here's the east, here's the south-west, etc.) Any time you can group like-things together it makes it easier for the reader to understand what they're looking at and draw sensible conclusions.
You could also sort them by a particular racial makeup - say, in descending order of "color" which seems the dominant population group - which is often an effective way to present data that's this cluttered, as a reader can both see the trend and can find, say, their neighborhood and see where it falls in the order just by looking.
Best overall though might be to group the district up and then display that, so you have many fewer bars. If there's a sensible way to do that, that'll get your idea across more effectively.

Related

Faceting in ggplot() in R

I am trying to build a plot for a numeric variable rider_count vs a categorical variable weekdays("Mon", "Tue"....), and this plot is required to be a faceting plot with 55 categories,
I tried to use
ggplot(aes(x=wday, y=rider_count_sum)) +
geom_bar(stat = "identity") +
facet_wrap(~counter_edited, scales="free")
However, the output of it is twisted very hard due to the scale does not fit.
Are there any ways to make it scale normally?
The issue you here is your faceting. It produces a grid of 8 x 7 cells. The plot displays on my monitor at about 18cm x 11cm in size. That means each cell is approximately 2.25cm x 1.5cm. Is a cell of that size large enough to provide meaningful information in the form of a plot? I would say "no".
So, you have two options: increase the size of the graphic or reduce the size of the grid.
Is increasing the size of the plot an option? Well, how big would each cell have to be to be meaningful? I don't know: you'd have to experiment: it would depend on the viewing distance and the level of information you'd want to convey. As a thought experiment, let's say you need each cell to be 8cm x 8cm to be interpretable. That means the graphic would need to be at least 64cm x 56cm. That would require an A1/ANSI D sheet of paper. That's heading to paper size. Unless you're talking posters, that's not reasonable. Even as a poster, a reader would have to stand so close that they wouldn't get the message of the whole graphic. On a digital display, you'd again be talking about a wall mounted unit. Standing close enough to look at a cell, pixel resulution would be an issue. Scrolling on a smaller unit would destroy the whole purpose of using a facted display.
Pagination would also destroy the benefit of faceting: you wouldn't be able to see all the data at the same time.
So, whilst increasing the size of your plot might be technical possible, I don't think it would be practically useful.
What about reducing the number of cells? That to me would be the way to go. Simplify your presentation to allow your message to come across. For example, summairse weekdays vs weekends in one graphic, differences between weekdays in another. That reduces one dimension from 7 to either 2 or 5. I don't know how you construct counter_edited, so I don't know what the columns of your facet represent, but could you perhaps reduce the number of categories to 3 or 4? Combined with my weekday/weekend suggestion, would give you grids of between 4x5 and 2x3. Much more managable (though even 4x5 may be too complex).
In short: even if making you current graphic look better is technically possible, I doubt it will ever be practically useful. I suggest adopting a different approach. The question I would ask is deeper than the simple technical one of improving your graphic: what is your underlying purpose? Once you know that, adapt your presentation to best address your objective.

Are dual-axes time series plots more acceptable if the data for each axis is from the exact same time/location?

I know dual-axes time series plots can be misleading. They're difficult to make in ggplot because Hadley Wickham believes they are fundamentally flawed. Others have concluded that they are ok sometimes, when axes are chosen so that the lines look as though they had been converted to indices first (even if they are given in their actual units). I'm wondering if this example is one in which dual-axes are justifiable.
This online tool is an example similar to what I want to create: https://carve.ornl.gov/visualize/
Measurements taken at the same point in time, from the same flight, are plotted over time. The user can select any two measurements to overlay, and the time matches up with a map showing flight coordinates. I think this is an elegant way for users to interact with the data, and I can't really imagine an alternative that would convey the same information.
That being said, I am interested to hear other opinions. Will this type of plot draw vitriol from other data scientists?! Do you have other ideas? And, if you have recommendations for what R tools I should turn to (since ggplot might be off the table...), I would love to hear them (I will be using Shiny). Thanks!
The debate on multiple axis on a same cartesian plane is indeed a hot one. It reminds me of the endless debates around social scienceĀ“s approaches.
If you follow the orthodoxy of the Grammar of GraphicsG Gospel, then the graph you linked is flawed. To come back into the herd, you could simply map either the CO2 or the Altitude to a different plotted symbology, like the size of the dots or color. Or simply plot two different panels, aligned by the X scale.
Now, the Grammar of Graphic people have much fewer problems with multiple scales on the plotted scales than on the cartesian scales.
Yet, I think that methodological opportunism is preferable to methodological orthodoxy. Do whatever is easier for you to communicate the idea to the public.

Align bars in ciplot

I'm working with the ciplot graphing module for Stata and am encountering a problem with the alignment of bars when I use the by() option. Here's a trivial example demonstrating the issue:
webuse citytemp, clear
ciplot heatdd cooldd, by(region) horizontal recast(conn)
So, the graph shows means and confidence intervals for two variables across categories of the region variable. The bars for the different variables do not align horizontally, though. For each region, the point and bar for heatdd is one line above, and the point and bar for cooldd is one line below, the category label. I would like these to be on the same line, but I can't figure out how to achieve it.
I'm open to solutions that do not involve ciplot, but I have found it to be useful for the specific task I'm working on.
This is my program (in Stata terms, downloadable via ssc install ciplot) so I can speak confidently. (On Statalist, it's expected that you explain the exact provenance of user-written programs; that would be good practice here too.)
It's not a bug; it's a feature (supposedly).
The offsets are entirely deliberate, to avoid messes when two or more intervals would just overlap and occlude each other, which is entirely likely when groups or comparable variables have similar values, which in turn is common when you do this. Even in your example, intervals for heating and cooling degree-days for the South would overlap otherwise, so the graph makes the point for me.
I can see that it's not what you want, but
There is no option in ciplot to remove the offset. I can see a case for one, but
My advice is now to use statsby to get a reduced dataset containing the confidence interval information, and then the graphics are typically a couple of command lines and you get to choose what you want. This approach is documented in a paper easily accessible from the Stata Journal.
You are always welcome to clone the program and modify the code using a different program name, with notional mention of the original.

Is it possible to create a pie in pie chart in SPSS or R?

I know it is possible to create such double pie charts in excel like this:
http://chandoo.org/wp/2009/12/02/group-small-slices-in-pie-charts/
but can SPSS or R do this also?
In relation to R:
The answer to the title question is "yes" ... see ?pie
As for the second question, the one in the body - it would be possible but would involve some coding. You'd have to draw two pie charts side by side (which could be managed with two calls to pie) and use segments or arrows (and text if necessary) to do the additional components of the plot.
Here's a rough example:
That required the fig argument of par to get them side-by-side.
(That example required a little fiddling to get right, but it would be possible to write a function to automate the details.)
The main issue I can see would be 'why on earth would you do it?' -- pie charts are a poor way of conveying information of this form. There are alternatives that result in much better ability to distinguish values, and less bias (such as what you get when comparing nearly horizontal vs nearly vertical slices).

Irregular scaling of axis in R

I have computed values for several categories for three networks. I'd like to create a bar plot in R to show the differences between these parameters for the networks. So far I plotted this with the barplot R function with the categories on the x-axis, their values on the y-axis and to each category three bars (one for each network).
But now I have one value which is much higher than all the others. Therefore the differences for the rest cannot be seen since they're represented only by a thin line because of that one large bar which almost fills the whole plot.
My idea was now to plot the values on the y-axis on an irregular scale, meaning for example, that one half represents the values from 0 to 300, and the other half from 300 to 3000. Is there any way to do this? Or a good alternative approach to handle this problem? I also thought of plotting the logarithm but unfortunatly I have also negative values.
I would suggest that an irregular scale isn't a good plan - I think it confuses viewers of the chart. Instead, you could use the layout() function to plot three separate barplots in a horizontal layout. Thus, each category could have it's own plot, with it's own scale.
If, however, you still have a single bar at 3000, while everything else is at 300, that won't really help. In that case, you could manually set your y-axis limits with ylim=c(min,max). To keep the bar from stretching off the screen, you can just use simple logic to define anything > 300 as 300, or something similar. Then, put a text point there stating the actual value (using text, maybe with arrow).
With those ideas out there, I would suggest that a graph where one value is 10x the other values might not really be worth presenting, or if it is, the main takeaway from it isn't going to be "how do values 2 and 3 compare to each other", it's going to be "holy moley look how much bigger 1 is than 2 and 3". So, it might not be a big deal if one bar is giant and two are small, as long as you aren't doing all 9 on a single plot (which would screw up other, relevant comparisons). So, if you split them using layout(), then it wouldn't be as big of a deal.

Resources