I have a data frame containing order data for each of 20+ products from each of 20+ countries. I have put it in a highlight table using ggplot2 with code similar to this:
require(ggplot2)
require(reshape)
require(scales)
mydf <- data.frame(industry = c('all industries','steel','cars'),
'all regions' = c(250,150,100), americas = c(150,90,60),
europe = c(150,60,40), check.names = FALSE)
mydf
mymelt <- melt(mydf, id.var = c('industry'))
mymelt
ggplot(mymelt, aes(x = industry, y = variable, fill = value)) +
geom_tile() + geom_text(aes(fill = mymelt$value, label = mymelt$value))
Which produces a plot like this:
In the real plot, the 450 cell table very nicely shows the 'hotspots' where orders are concentrated. The last refinement I want to implement is to arrange the items on both the x-axis and y-axis in alphabetical order. So in the plot above, the y-axis (variable) would be ordered as all regions, americas, then europe and the x-axis (industry) would be ordered all industries, cars and steel. In fact the x-axis is already ordered alphabetically, but I wouldn't know how to achieve that if it were not already the case.
I feel somewhat embarrassed about having to ask this question as I know there are many similar on SO, but sorting and ordering in R remains my personal bugbear and I cannot get this to work. Although I do try, in all except the simplest cases I got lost in a welter of calls to factor, levels, sort, order and with.
Q. How can I arrange the above highlight table so that both y-axis and x-axis are ordered alphabetically?
EDIT: The answers from smillig and joran below do resolve the question with the test data but with the real data the problem remains: I can't get an alphabetical sort. This leaves me scratching my head as the basic structure of the data frame looks the same. Clearly I have omitted something, but what??
> str(mymelt)
'data.frame': 340 obs. of 3 variables:
$ Industry: chr "Animal and vegetable products" "Food and beverages" "Chemicals" "Plastic and rubber goods" ...
$ variable: Factor w/ 17 levels "Other areas",..: 17 17 17 17 17 17 17 17 17 17 ...
$ value : num 0.000904 0.000515 0.007189 0.007721 0.000274 ...
However, applying the with statement doesn't result in levels with an alphabetical sort.
> with(mymelt,factor(variable,levels = rev(sort(unique(variable)))))
[1] USA USA USA
[4] USA USA USA
[7] USA USA USA
[10] USA USA USA
[13] USA USA USA
[16] USA USA USA
[19] USA USA Canada
[22] Canada Canada Canada
[25] Canada Canada Canada
[28] Canada Canada Canada
All the way down to:
[334] Other areas Other areas Other areas
[337] Other areas Other areas Other areas
[340] Other areas
And if you do a levels() it seems to show the same thing:
[1] "Other areas" "Oceania" "Africa"
[4] "Other Non-Eurozone" "UK" "Other Eurozone"
[7] "Holland" "Germany" "Other Asia"
[10] "Middle East" "ASEAN-5" "Singapore"
[13] "HK/China" "Japan" "South Central America"
[16] "Canada" "USA"
That is, the non-reversed version of the above.
The following shot shows what the plot of the real data looks like. As you can see, the x-axis is sorted and the y-axis is not. I'm perplexed. I'm missing something but can't see what it is.
The y-axis on your chart is also already ordered alphabetically, but from the origin. I think you can achieve the order of the axes that you want by using xlim and ylim. For example:
ggplot(mymelt, aes(x = industry, y = variable, fill = value)) +
geom_tile() + geom_text(aes(fill = mymelt$value, label = mymelt$value)) +
ylim(rev(levels(mymelt$variable))) + xlim(levels(mymelt$industry))
will order the y-axis from all regions at the top, followed by americas, and then europe at the bottom (which is reverse alphabetical order, technically). The x-axis is alphabetically ordered from all industries to steel with cars in between.
As smillig says, the default is already to order the axes alphabetically, but the y axis will be ordered from the lower left corner up.
The basic rule with ggplot2 that applies to almost anything that you want in a specific order is:
If you want something to appear in a particular order, you must make the corresponding variable a factor, with the levels sorted in your desired order.
In this case, all you should need to do it this:
mymelt$variable <- with(mymelt,factor(variable,levels = rev(sort(unique(variable)))))
which should work regardless of whether you're running R with stringsAsFactors = TRUE or FALSE.
This principle applies to ordering axis labels, ordering bars, ordering segments within bars, ordering facets, etc.
For continuous variables there is a convenient scale_*_reverse() but apparently not for discrete variables, which would be a nice addition, I think.
Another possibility is to use fct_reorder from forecast library.
library(forecast)
mydf %>%
pivot_longer(cols=c('all regions', 'americas', 'europe')) %>%
mutate(name1=fct_reorder(name, value, .desc=FALSE)) %>%
ggplot( aes(x = industry, y = name1, fill = value)) +
geom_tile() + geom_text(aes( label = value))
Maybe a little bit late,
with(mymelt,factor(variable,levels = rev(sort(unique(variable)))))
this function doesn't order, because you are ordering "variable" that has no order (it's an unordered factor).
You should transform first the variable to a character, with the as.character function, like so:
with(mymelt,factor(variable,levels = rev(sort(unique(as.character(variable))))))
maybe this StackOverflow question can help:
Order data inside a geom_tile
specifically the first answer by Brandon Bertelsen:
"Note it's not an ordered factor, it's a factor in the right order"
It helped me to get the right order of the y-axis in a ggplot2 geom_tile plot.
Related
I am trying to make a plot of GDP vs CO2 emissions globally. I have found that I have two countries that have data that is a lot larger than the rest of the data so I am trying to separate it with facet_wrap so I have one graph of the two outlier countries and one graph with the rest of the data.
My code thus far is
ggplot(CO2_GDP, aes(x= GDP, y=value)) +
geom_point(size=1)+
labs(title = "GDP and CO2 Emissions", y= "CO2 Emissions in Tons", x= "GDP in Billions of USD") +
facet_wrap(~country_name==c("China", "United States"))
This gives me one graph with all of the countries including China and the United States and another graph of just China and United States. I need to find a way to remove China and United States from the first graph but have just that data on the second graph.
I thought by adding the comma between China and United States in the last row would remove them from the first graph and just show it on the second but thats not the case as you can see in this image the data on the "True" graph is still on the false graph and its not supposed to be.
I have extracted seeds from the seed bank , under the tree crown and 3 m away the crown. I have these data for three study sites in two countries, south Australia and Sri Lanka (a part of the data is attached). The script I used to develop a BW plot using lattice is given below. In fact I have prepared two plots here separately for the two countries. I want to develop this graph. I want to show data of one country (South Australia) on one side of the plot(beneath crown and 3m away in 2 colors) and the other side the other country (Sri Lanka) same two colors to show beneath crown and 3 m away.
setwd("E:/Research/Fieldwork SL-data/Seed bank/analysis")
seed.bank <- read.csv(file="seedbank_rev.csv", header=TRUE, sep=',')
attach(seed.bank)
names(seed.bank)
## [1] "seed.no." "location" "study.site" "country"
seed.bank1<-seed.bank[!(country=="Sri Lanka"),]
seed.bank2<-seed.bank[!(country=="South Australia"),]
library("lattice")
bwplot(log(seed.no.) ~ study.site | location, data=seed.bank1, xlab="Study Sites in South Australia", ylab="log(seed number)")
bwplot(log(seed.no.) ~ study.site | location, data=seed.bank2, xlab="Study Sites in Sri Lanka", ylab="log(seed number)")
A simple R base solution to what seems to be your problem is a barplot (since you have categorical variables):
# define dataframe:
df <- data.frame(
location = c(rep("beneath",1), rep("at_distance",4),rep("beneath",4), rep("at_distance",3)),
country = c(rep("SA",6),rep("SL",6))
)
# check structure:
str(df)
This step reveals that the data are formatted as factors. These need to be converted to characters to obtain frequency counts:
# convert factors to characters:
df <- lapply(df, as.character)
# make frequency table:
freq_seeds <- table(df$country, df$location)
Now you are ready to plot the data; the key argument to place the corresponding bars side by side is beside=TRUE:
# define plotting region:
par(mfrow=c(1,1), mar=c(4,4,4,4))
# barplot:
barplot(freq_seeds, beside=T, main="Seeds", col=c("blue", "green"))
# draw legend:
legend("topright", c("South Australia", "Sri Lanka"), fill=c("blue", "green"), col=c("blue", "green"))
I have a Count for each Site (which corresponds with a country), and each Site belongs to a Region. The data looks like this:
> summary_data
Site Count Region
1 Chad 5 Africa
2 Angola 1 Africa
3 France 10 Europe
4 USA 6 Americas
5 Bolivia 3 Americas
6 Chile 4 Americas
I would like to generate a bar graph that:
Has a bar per country
The bars for a region are all next to each other in the bar graph
Per region, the bars appear in descending order
The bars are all the same width, but the heights are all on the same scale
Can be generalized (in particular: arbitrary regions, arbitrary countries per region)
I do not want to use fill color to represent the region (I want to use color to represent another characteristic eventually)
I want to have some visual representation to group the columns. For instance, having a gray background behind all the columns for the Americas region, a blue background behind all the columns for the Africa region, etc). I actually would be open to other approaches (perhaps a line at the top spanning all of Africa with "Africa" as a label or something).
Obviously each region can have a different number of country sites, and no country site spans two regions (I tried using facets but quickly realized that was not the right route). I also tried looping through all the regions to generate separate graphs per region and then put them together but that didn't quite seem the right approach either.
I have generated a graph like this (Closest I have gotten):
Using this code:
library("dplyr")
library(ggplot2)
sorted <- arrange(summary_data,Region,-Count)
sorted$Site <- factor(sorted$Site, levels = sorted$Site)
bar = ggplot(sorted,
aes(
x = Site,
y = Count,
fill = Region
)) +
geom_col()
print(bar)
But this does not meet the last two requirements I set above (I specifically do not want to use fill to represent region). I started down the path of geom_rect() but did not understand the coordinate system for discrete x values rather than continuous (I did find Stackoverflow questions / answers on continuous but didn't see how to translate to this). I think having shaded rectangles behind the columns is probably the best approach, but I would appreciate any input in general approach as well as how to pull it off.
You could consider defining a new panel for each region to separate them using facet_grid. If you want the colors to be the same, just remove the aes(fill = Site) argument inside geom_bar.
The argument space = "free_x" assures that the width of the bars are the same and with scale = free only those axis values corresponding to the specific region are shown.
ggplot(sorted, aes(x = Site, y = Count)) +
geom_bar(position = "dodge", stat = "identity", aes(fill = Site)) +
facet_grid(. ~ Region,scale="free", space="free_x")
I have produced a line graph something that looks like this
I have the data set of 50 countries and its GDP for last 10 years.
Sample data:
Country variable value
China Y2007 3.55218e+12
USA Y2007 1.45000e+13
Japan Y2007 4.51526e+12
UK Y2007 3.06301e+12
Russia Y2007 1.29971e+12
Canada Y2007 1.46498e+12
Germany Y2007 3.43995e+12
India Y2007 1.20107e+12
France Y2007 2.66311e+12
SKorea Y2007 1.12268e+12
I generated the line graph using the code
GDP_lineplot = ggplot(data=GDP_linechart, aes(x=variable,y=value)) +
geom_line() +
scale_y_continuous(name = "GDP(USD in Trillions)",
breaks = c(0.0e+00,5.0e+12,1.0e+13,1.5e+13),
labels = c(0,5,10,15)) +
scale_x_discrete(name = "Years", labels = c(2007,"",2009,"",2011,"",2013,"",2015))
The idea is to make the graph look like this.
I tried adding
group=country, color = country
It outputs coloring all the countries.
How can I color the countries with top 4 and the rest?
PS: I am still naive with R.
By plotting subsets, the other groups aren't included in the colour legend on the right. The alternative approach below manipulates factor levels and uses a customized color scale to overcome this.
Preparing data
It is assumed that GDP_long contains the data in long format. This is in line with the data shown by the OP (GDP_lineplot, but see Data section below for differences). To manipulate factor levels, the forcatspackage is used (and data.table).
library(data.table)
library(forcats)
# coerce to data.table, reorder factors by values in last = most actual year
setDT(GDP_long)[, Country := fct_reorder(Country, -value, last)]
# create new factor which collapses all countries to "Other" except the top 4 countries
GDP_long[, top_country := fct_other(Country, keep = head(levels(Country), 4))]
Create plot
library(ggplot2)
ggplot(GDP_long, aes(Year, value/1e12, group = Country, colour = top_country)) +
geom_point() + geom_line(size = 1) + theme_bw() + ylab("GDP(USD in Trillions)") +
scale_colour_manual(name = "Country",
values = c("green3", "orange", "blue", "red", "grey"))
The chart is now quite similar to the expected result. The lines of the top 4 countries are displayed in different colours while the other countries are displayed in grey but do appear in the colour legend to the right.
Note that the groupaesthetic is still needed so that a single line is plotted for each country while colour is controlled by the levels of top_country.
Data
The data set is too large to be reproduced here (even with dput()). The structure
str(GDP_long)
'data.frame': 1763 obs. of 3 variables:
$ Country: chr "Afghanistan" "Albania" "Algeria" "Andorra" ...
$ Year : int 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
$ value : num 9.84e+09 1.07e+10 1.35e+11 4.01e+09 6.04e+10 ...
is similar to OP's data with the exception that the variable column already is converted to an integer column year. This will give a nicely formatted x-axis without additional effort.
My apologies I missed the part about only coloring a subset of the countries... in the geom_line calls you can add the subsetting that suits your needs.
df <- data.frame(Country=rep(LETTERS[1:10], each=5),
Year=rep(2007:2011, length.out=10),
value=rnorm(50))
ggplot(df) +
geom_line(data=df[21:50, ], aes(x=Year, y=value, group=Country), color="#999999") +
geom_line(data=df[1:20, ], aes(Year, y=value, color=Country))
I am working with a Danish dataset on immigrants by country of origin and age group. I transformed the data so I can see the top countries of origin for each age group.
I am plotting it using facet_wrap. What I would like to do is, since different age groups come from quite different areas, to show a different set of values for one axis in each facet. For example, those that are between 0 and 10 years old come from countries x,y and z, while those 10-20 years of age come from countries q, r, z and so on.
In my current version, it shows the entire set of values, including countries that are not in the top 10. I would like to show just the top ten countries of origin for each facet, in effect having different axis labels for each. (And, if it is possible, sorting by high to low for each facet).
Here is what I have so far:
library(ggplot2)
library(reshape)
###load and inspect data
load(url('http://dl.dropbox.com/u/7446674/dk_census.rda'))
head(dk_census)
###reshape for plotting--keep just a few age groups
dk_census.m <- melt(dk_census[dk_census$Age %in% c('0-9 år', '10-19 år','20-29 år','30-39 år'),c(1,2,4)])
###get top 10 observations for each age group, store in data frame
top10 <- by(dk_census.m[order(dk_census.m$Age,-dk_census.m$value),], dk_census.m$Age, head, n=10)
top10.df<-do.call("rbind", as.list(top10))
top10.df
###plot
ggplot(data=top10.df, aes(x=as.factor(Country), y=value)) +
geom_bar(stat="identity")+
coord_flip() +
facet_wrap(~Age)+
labs(title="Immigrants By Country by Age",x="Country of Origin",y="Population")
One option (that I actually strongly suspect you won't be happy with) is this:
p <- ggplot(data=top10.df, aes(x=Country, y=value)) +
geom_bar(stat="identity")+
coord_flip() +
facet_wrap(~Age)+
labs(title="Immigrants By Country by Age",x="Country of Origin",y="Population")
pp <- dlply(.data=top10.df,.(Age),function(x) {x$Country <- reorder(x$Country,x$value); p %+% x})
library(gridExtra)
do.call(grid.arrange,pp)
(Edited to sort each graph.)
Keep in mind that the only reason faceting exists is to plot multiple panels that share a common scale. So when you start asking to facet on some variable, but have the scales be different (oh, and also sort them separately on each panel as well) what you're doing is really no longer faceting. It's just making four different plots and arranging them together.
using lattice (Here I use ``latticeExtrafor ggplot2 theme), you can set torelation=freebetween panels. Here I am using abbreviate = TRUE` to short long labels.
library(latticeExtra)
barchart(value~ Country|Age,data=top10.df,layout=c(2,2),
horizontal=T,
par.strip.text =list(cex=2),
scales=list(y=list(relation='free',cex=1.5,abbreviate=T,
labels=levels(factor(top10.df$Country)))),
# ,cex=1.5,abbreviate=F),
par.settings = ggplot2like(),axis=axis.grid,
main="Immigrants By Country by Age",
ylab="Country of Origin",
xlab="Population")