Plotting only the highest and lowest values using geom_bar - r

I'm trying to plot the GDP for every country on a bar chart. The country name goes on the x-axis and the GDP value on the y. However, there are a lot of countries, and I'd like the bar chart to just show the top 3 GDPs, and the bottom 3 GDPs and in between I want perhaps some dots or something to indicate that there are other countries in between. How do I go about this?

Building on #steveLangsford's solution - doing things in a (possibly) slightly more principled way
There might be a more "tidy" way to do this part:
find breakpoints for GDP categories:
GDP_sorted <- sort(toydata$GDP)
GDP_breaks <- c(-Inf,GDP_sorted[hm_selected],
GDP_sorted[hm_rows-hm_selected],
Inf)
use cut()to define GDP categories, and order countries by GDP:
toydata <- toydata %>%
mutate(GDP_cat=cut(GDP,breaks=GDP_breaks,labels=
c("Lowest","Mid","Highest")),
country=reorder(factor(country),GDP)) %>%
filter(GDP_cat != "Mid") %>%
droplevels()
Plot with facets (add a bit of extra space between the panels to emphasize the axis break):
ggplot(toydata,aes(x=country,y=GDP,fill=GDP_cat))+
geom_bar(stat="identity")+
theme_bw()+
theme(legend.position="none",
panel.spacing.x=grid::unit(5,"lines")
)+xlab("")+
scale_fill_brewer(palette="Dark2")+
facet_wrap(~GDP_cat,scale="free_x")

1) You will get faster better answers if you give a toy data set
2) Putting "dots or something" on your plot is likely to make data visualization people wince a little. You're basically suggesting a discontinuity in the x-axis, which is a common enough thing to do, but specifically excluded from ggplot
(see here:
Using ggplot2, can I insert a break in the axis?
and here:
https://groups.google.com/forum/#!topic/ggplot2/jSrL_FnS8kc)
But, this same discussion suggests facets as a solution to your problem. One way to do that might be something like this:
library(tidyverse)
library(patchwork)
hm_rows <- 50
hm_selected <- 3
toydata <- data.frame(country=paste("Country",1:hm_rows) ,GDP=runif(hm_rows,0,5000))%>%
arrange(desc(GDP))%>%
filter(row_number()<=hm_selected | row_number()>(hm_rows-hm_selected))%>%droplevels
toydata$status <- rep(c("Highest","Lowest"),each=hm_selected)
ggplot(toydata%>%filter(status=="Highest"),aes(x=country,y=GDP))+
geom_bar(stat="identity")+
ggtitle("Highest")+
ylim(c(0,max(toydata$GDP)))+
ggplot(toydata%>%filter(status=="Lowest"),aes(x=country,y=GDP))+
geom_bar(stat="identity")+
ggtitle("Lowest")+
ylim(c(0,max(toydata$GDP)))+
theme(#possibly questionable, but tweaks the results closer to the single-graph requested:
axis.text.y=element_blank(),
axis.ticks=element_blank()
)+ylab("")

Related

R - Bar Plot with transparency based on values?

I have a dataset myData which contains x and y values for various Samples. I can create a line plot for a dataset which contains a few Samples with the following pseudocode, and it is a good way to represent this data:
myData <- data.frame(x = 290:450, X52241 = c(..., ..., ...), X75123 = c(..., ..., ...))
myData <- myData %>% gather(Sample, y, -x)
ggplot(myData, aes(x, y)) + geom_line(aes(color=Sample))
Which generates:
This turns into a Spaghetti Plot when I have a lot more Samples added, which makes the information hard to understand, so I want to represent the "hills" of each sample in another way. Preferably, I would like to represent the data as a series of stacked bars, one for each myData$Sample, with transparency inversely related to what is in myData$y. I've tried to represent that data in photoshop (badly) here:
Is there a way to do this? Creating faceted plots using facet_wrap() or facet_grid() doesn't give me what I want (far too many Samples). I would also be open to stacked ridgeline plots using ggridges, but I am not understanding how I would be able to convert absolute values to a stat(density) value needed to plot those.
Any suggestions?
Thanks to u/Joris for the helpful suggestion! Since, I did not find this question elsewhere, I'll go ahead and post the pretty simple solution to my question here for others to find.
Basically, I needed to apply the alpha aesthetic via aes(alpha=y, ...). In theory, I could apply this over any geom. I tried geom_col(), which worked, but the best solution was to use geom_segment(), since all my "bars" were going to be the same length. Also note that I had to "slice" up the segments in order to avoid the problem of overplotting similar to those found here, here, and here.
ggplot(myData, aes(x, Sample)) +
geom_segment(aes(x=x, xend=x-1, y=Sample, yend=Sample, alpha=y), color='blue3', size=14)
That gives us the nice gradient:
Since the max y values are not the same for both lines, if I wanted to "match" the intensity I normalized the data (myDataNorm) and could make the same plot. In my particular case, I kind of preferred bars that did not have a gradient, but which showed a hard edge for the maximum values of y. Here was one solution:
ggplot(myDataNorm, aes(x, Sample)) +
geom_segment(aes(x=x, xend=x-1, y=Sample, y=end=Sample, alpha=ifelse(y>0.9,1,0)) +
theme(legend.position='none')
Better, but I did not like the faint-colored areas that were left. The final code is what gave me something that perfectly captured what I was looking for. I simply moved the ifelse() statement to apply to the x aesthetic, so the parts of the segment drawn were only those with high enough y values. Note my data "starts" at x=290 here. Probably more elegant ways to combine those x and xend terms, but whatever:
ggplot(myDataNorm, aes(x, Sample)) +
geom_segment(aes(
x=ifelse(y>0.9,x,290), xend=ifelse(y>0.9,x-1,290),
y=Sample, yend=Sample), color='blue3', size=14) +
xlim(290,400) # needed to show entire scale

How can I wrap a ggplot column n times after hitting threshold number

I have a barplot where I have one entry that is so much larger then my other entries that it makes it difficult to do interesting analysis on the other smaller valued data-points.
plt <- ggplot(dffd[dffd$Month==i & dffd$UniqueCarrier!="AA",],aes(x=UniqueCarrier,y=1,fill=DepDelay))+
geom_col()+
coord_flip()+
scale_fill_gradientn(breaks=late_breaks,labels=late_breaks,limits=c(0,150),colours=c('black','yellow','orange','red','darkred'))
When I remove it I get back to an interesting degree of interpretation but now I'm tossing out upwards of half the data and arguably the most important one to explore.
I was wondering if there is a way that I could set an interval on my bar plot, say 500 in this case, after which I can start another column for the same entry right under it and resume building up my bar plot. In this example, that would translate here into WN splitting into 3 bars of length 500 500 and ~400 stacked one below the other all under that one WN label (ideally it shows the one tick for all three). Since I have a couple of other disproportionately large representative, plots doing this in as a layer during the plotting is of great interest to me.
Typically, when you have such disproportionate values in your data set, you should either put your values on a log scale (or use some other transformation) or zoom in on the plot using coord_cartesian. I think you probably could hack your way around and create the desired plot, but it's going to be quite misleading in terms of visualisation and analysis.
EDIT:
Based on your comments, I have a rather hacky solution. The data you've pasted was not directly usable (a part of dput was missing + there's no DepDelay columns, so I improvised).
The idea is to create an extra tag column based on the UniqueCarrier column and the max amount you want.
df2 <- df %>%
filter(UniqueCarrier != "AA" & Month == i) %>%
group_by(UniqueCarrier) %>%
mutate(tag = paste(UniqueCarrier, rep(seq(1, n()%/%500+1), each=500), sep="_")[1:n()])
This adds a tag column that basically says how many columns you'll have in each category.
plt <- ggplot(df2, aes(x=tag, y=1, fill=DepDelay)) +
geom_col() +
coord_flip() +
scale_fill_gradientn(breaks=late_breaks, labels=late_breaks,
limits=c(0,150),
colours=c('black','yellow','orange','red','darkred')) +
scale_x_discrete(labels=str_replace(sort(unique(df2$tag)), "_[:digit:]", ""))
plt
In the image above, I've used CarrierDelay with break interval of 100. You can see that the WN label then repeats - there are ways to remove the extra ones (some more creative replacements in scale_x_discrete labels.
If you want the columns to be ordered differently, just replace seq(1, n()%/%500+1) with seq(n()%/%500+1, 1).

selecting only some facets to print in facet_wrap, ggplot2

my question is very simple, but I have failed to solve it after many attempts. I just want to print some facets of a facetted plot (made with facet_wrap in ggplot2), and remove the ones I am no interested in.
I have facet_wrap with ggplot2, as follows:
#anomalies linear trends
an.trends <- ggplot()+
geom_smooth(method="lm", data=tndvilong.anomalies, aes(x=year, y=NDVIan, colour=TenureZone,
group=TenureZone))+
scale_color_manual(values=miscol) +
ggtitle("anomalies' trends")
#anomalies linear trends by VEG
an.trendsVEG <- an.trends + facet_wrap(~VEG,ncol=2)
print(an.trendsVEG)
And I get the plot as I expected (you can see it in te link below):
anomalies' trends by VEG
The question is: how do I get printed only the facest I am interested on?
I only want to print "CenKal_ShWoodl", "HlShl_ShDens", "NKal_ShWoodl", and "ThShl_ShDens"
Thanks
I suggest the easiest way to do that is to simply give ggplot() an appropriate subset. In this case:
facets <- c("CenKal_ShWoodl", "HlShl_ShDens", "NKal_ShWoodl", "ThShl_ShDens")
an.trends.sub <- ggplot(tndvilong.anomalies[tndvilong.anomalies$VEG %in% facets,])+
geom_smooth(method="lm" aes(x=year, y=NDVIan, colour=TenureZone,
group=TenureZone))+
scale_color_manual(values=miscol) +
ggtitle("anomalies' trends") +
facet_wrap(~VEG,ncol=2)
Obviously without your data I can't be sure this will give you what you want, but based on your description, it should work. I find that with ggplot, it is generally best to pass it the data you want plotted, rather than finding ways of changing the plot itself.

varying axis values in facet_wrap

I am working with a Danish dataset on immigrants by country of origin and age group. I transformed the data so I can see the top countries of origin for each age group.
I am plotting it using facet_wrap. What I would like to do is, since different age groups come from quite different areas, to show a different set of values for one axis in each facet. For example, those that are between 0 and 10 years old come from countries x,y and z, while those 10-20 years of age come from countries q, r, z and so on.
In my current version, it shows the entire set of values, including countries that are not in the top 10. I would like to show just the top ten countries of origin for each facet, in effect having different axis labels for each. (And, if it is possible, sorting by high to low for each facet).
Here is what I have so far:
library(ggplot2)
library(reshape)
###load and inspect data
load(url('http://dl.dropbox.com/u/7446674/dk_census.rda'))
head(dk_census)
###reshape for plotting--keep just a few age groups
dk_census.m <- melt(dk_census[dk_census$Age %in% c('0-9 år', '10-19 år','20-29 år','30-39 år'),c(1,2,4)])
###get top 10 observations for each age group, store in data frame
top10 <- by(dk_census.m[order(dk_census.m$Age,-dk_census.m$value),], dk_census.m$Age, head, n=10)
top10.df<-do.call("rbind", as.list(top10))
top10.df
###plot
ggplot(data=top10.df, aes(x=as.factor(Country), y=value)) +
geom_bar(stat="identity")+
coord_flip() +
facet_wrap(~Age)+
labs(title="Immigrants By Country by Age",x="Country of Origin",y="Population")
One option (that I actually strongly suspect you won't be happy with) is this:
p <- ggplot(data=top10.df, aes(x=Country, y=value)) +
geom_bar(stat="identity")+
coord_flip() +
facet_wrap(~Age)+
labs(title="Immigrants By Country by Age",x="Country of Origin",y="Population")
pp <- dlply(.data=top10.df,.(Age),function(x) {x$Country <- reorder(x$Country,x$value); p %+% x})
library(gridExtra)
do.call(grid.arrange,pp)
(Edited to sort each graph.)
Keep in mind that the only reason faceting exists is to plot multiple panels that share a common scale. So when you start asking to facet on some variable, but have the scales be different (oh, and also sort them separately on each panel as well) what you're doing is really no longer faceting. It's just making four different plots and arranging them together.
using lattice (Here I use ``latticeExtrafor ggplot2 theme), you can set torelation=freebetween panels. Here I am using abbreviate = TRUE` to short long labels.
library(latticeExtra)
barchart(value~ Country|Age,data=top10.df,layout=c(2,2),
horizontal=T,
par.strip.text =list(cex=2),
scales=list(y=list(relation='free',cex=1.5,abbreviate=T,
labels=levels(factor(top10.df$Country)))),
# ,cex=1.5,abbreviate=F),
par.settings = ggplot2like(),axis=axis.grid,
main="Immigrants By Country by Age",
ylab="Country of Origin",
xlab="Population")

Ordering the bars of a stacked bar graph in ggplot from least to greatest

Is there a way to specify that I want the bars of a stacked bar graph in with ggplot ordered in terms of the total of the four factors from least to greatest? (so in the code below, I want to order by the total of all of the variables) I have the total for each x value in a dataframe that that I melted to create the dataframe from which I formed the graph.
The code that I am using to graph is:
ggplot(md, aes(x=factor(fullname), fill=factor(variable))) + geom_bar()
My current graph looks like this:
http://i.minus.com/i5lvxGAH0hZxE.png
The end result is I want to have a graph that looks a bit like this:
http://i.minus.com/kXpqozXuV0x6m.jpg
My data looks like this:
(source: minus.com)
and I melt it to this form where each student has a value for each category:
melted data http://i.minus.com/i1rf5HSfcpzri.png
before using the following line to graph it
ggplot(data=md, aes(x=fullname, y=value, fill=variable), ordered=TRUE) + geom_bar()+ opts(axis.text.x=theme_text(angle=90))
Now, I'm not really sure that I understand the way Chi does the ordering and if I can apply that to the data from either of the frames that I have. Maybe it's helpful that that the data is ordered in the original data frame that I have, the one that I show first.
UPDATE: We figured it out. See this thread for the answer:
Order Stacked Bar Graph in ggplot
I'm not sure about the way your data were generated (i.e., whether you use a combination of cast/melt from the reshape package, which is what I suspect given the default name of your variables), but here is a toy example where sorting is done outside the call to ggplot. There might be far better way to do that, browse on SO as suggested by #Andy.
v1 <- sample(c("I","S","D","C"), 200, rep=T)
v2 <- sample(LETTERS[1:24], 200, rep=T)
my.df <- data.frame(v1, v2)
idx <- order(apply(table(v1, v2), 2, sum))
library(ggplot2)
ggplot(my.df, aes(x=factor(v2, levels=LETTERS[1:24][idx], ordered=TRUE),
fill=v1)) + geom_bar() + opts(axis.text.x=theme_text(angle=90)) +
labs(x="fullname")
To sort in the reverse direction, add decr=TRUE with the order command. Also, as suggested by #Andy, you might overcome the problem with x-labels overlap by adding + coord_flip() instead of the opts() option.

Resources