How can I color a line graph by grouping the variables in R? - r

I have produced a line graph something that looks like this
I have the data set of 50 countries and its GDP for last 10 years.
Sample data:
Country variable value
China Y2007 3.55218e+12
USA Y2007 1.45000e+13
Japan Y2007 4.51526e+12
UK Y2007 3.06301e+12
Russia Y2007 1.29971e+12
Canada Y2007 1.46498e+12
Germany Y2007 3.43995e+12
India Y2007 1.20107e+12
France Y2007 2.66311e+12
SKorea Y2007 1.12268e+12
I generated the line graph using the code
GDP_lineplot = ggplot(data=GDP_linechart, aes(x=variable,y=value)) +
geom_line() +
scale_y_continuous(name = "GDP(USD in Trillions)",
breaks = c(0.0e+00,5.0e+12,1.0e+13,1.5e+13),
labels = c(0,5,10,15)) +
scale_x_discrete(name = "Years", labels = c(2007,"",2009,"",2011,"",2013,"",2015))
The idea is to make the graph look like this.
I tried adding
group=country, color = country
It outputs coloring all the countries.
How can I color the countries with top 4 and the rest?
PS: I am still naive with R.

By plotting subsets, the other groups aren't included in the colour legend on the right. The alternative approach below manipulates factor levels and uses a customized color scale to overcome this.
Preparing data
It is assumed that GDP_long contains the data in long format. This is in line with the data shown by the OP (GDP_lineplot, but see Data section below for differences). To manipulate factor levels, the forcatspackage is used (and data.table).
library(data.table)
library(forcats)
# coerce to data.table, reorder factors by values in last = most actual year
setDT(GDP_long)[, Country := fct_reorder(Country, -value, last)]
# create new factor which collapses all countries to "Other" except the top 4 countries
GDP_long[, top_country := fct_other(Country, keep = head(levels(Country), 4))]
Create plot
library(ggplot2)
ggplot(GDP_long, aes(Year, value/1e12, group = Country, colour = top_country)) +
geom_point() + geom_line(size = 1) + theme_bw() + ylab("GDP(USD in Trillions)") +
scale_colour_manual(name = "Country",
values = c("green3", "orange", "blue", "red", "grey"))
The chart is now quite similar to the expected result. The lines of the top 4 countries are displayed in different colours while the other countries are displayed in grey but do appear in the colour legend to the right.
Note that the groupaesthetic is still needed so that a single line is plotted for each country while colour is controlled by the levels of top_country.
Data
The data set is too large to be reproduced here (even with dput()). The structure
str(GDP_long)
'data.frame': 1763 obs. of 3 variables:
$ Country: chr "Afghanistan" "Albania" "Algeria" "Andorra" ...
$ Year : int 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
$ value : num 9.84e+09 1.07e+10 1.35e+11 4.01e+09 6.04e+10 ...
is similar to OP's data with the exception that the variable column already is converted to an integer column year. This will give a nicely formatted x-axis without additional effort.

My apologies I missed the part about only coloring a subset of the countries... in the geom_line calls you can add the subsetting that suits your needs.
df <- data.frame(Country=rep(LETTERS[1:10], each=5),
Year=rep(2007:2011, length.out=10),
value=rnorm(50))
ggplot(df) +
geom_line(data=df[21:50, ], aes(x=Year, y=value, group=Country), color="#999999") +
geom_line(data=df[1:20, ], aes(Year, y=value, color=Country))

Related

ggplot2 - How to plot length of time using geom_bar?

I am trying to show different growing season lengths by displaying crop planting and harvest dates at multiple regions.
My final goal is a graph that looks like this:
which was taken from an answer to this question. Note that the dates are in julian days (day of year).
My first attempt to reproduce a similar plot is:
library(data.table)
library(ggplot2)
mydat <- "Region\tCrop\tPlanting.Begin\tPlanting.End\tHarvest.Begin\tHarvest.End\nCenter-West\tSoybean\t245\t275\t1\t92\nCenter-West\tCorn\t245\t336\t32\t153\nSouth\tSoybean\t245\t1\t1\t122\nSouth\tCorn\t183\t336\t1\t153\nSoutheast\tSoybean\t275\t336\t1\t122\nSoutheast\tCorn\t214\t336\t32\t122"
# read data as data table
mydat <- setDT(read.table(textConnection(mydat), sep = "\t", header=T))
# melt data table
m <- melt(mydat, id.vars=c("Region","Crop"), variable.name="Period", value.name="value")
# plot stacked bars
ggplot(m, aes(x=Crop, y=value, fill=Period, colour=Period)) +
geom_bar(stat="identity") +
facet_wrap(~Region, nrow=3) +
coord_flip() +
theme_bw(base_size=18) +
scale_colour_manual(values = c("Planting.Begin" = "black", "Planting.End" = "black",
"Harvest.Begin" = "black", "Harvest.End" = "black"), guide = "none")
However, there's a few issues with this plot:
Because the bars are stacked, the values on the x-axis are aggregated and end up too high - out of the 1-365 scale that represents day of year.
I need to combine Planting.Begin and Planting.End in the same color, and do the same to Harvest.Begin and Harvest.End.
Also, a "void" (or a completely uncolored bar) needs to be created between Planting.Begin and Harvest.End.
Perhaps the graph could be achieved with geom_rect or geom_segment, but I really want to stick to geom_bar since it's more customizable (for example, it accepts scale_colour_manual in order to add black borders to the bars).
Any hints on how to create such graph?
I don't think this is something you can do with a geom_bar or geom_col. A more general approach would be to use geom_rect to draw rectangles. To do this, we need to reshape the data a bit
plotdata <- mydat %>%
dplyr::mutate(Crop = factor(Crop)) %>%
tidyr::pivot_longer(Planting.Begin:Harvest.End, names_to="period") %>%
tidyr::separate(period, c("Type","Event")) %>%
tidyr::pivot_wider(names_from=Event, values_from=value)
# Region Crop Type Begin End
# <chr> <fct> <chr> <int> <int>
# 1 Center-West Soybean Planting 245 275
# 2 Center-West Soybean Harvest 1 92
# 3 Center-West Corn Planting 245 336
# 4 Center-West Corn Harvest 32 153
# 5 South Soybean Planting 245 1
# ...
We've used tidyr to reshape the data so we have one row per rectangle that we want to draw and we've also make Crop a factor. We can then plot it like this
ggplot(plotdata) +
aes(ymin=as.numeric(Crop)-.45, ymax=as.numeric(Crop)+.45, xmin=Begin, xmax=End, fill=Type) +
geom_rect(color="black") +
facet_wrap(~Region, nrow=3) +
theme_bw(base_size=18) +
scale_y_continuous(breaks=seq_along(levels(plotdata$Crop)), labels=levels(plotdata$Crop))
The part that's a bit messy here that we are using a discrete scale for y but geom_rect prefers numeric values, so since the values are factors now, we use the numeric values for the factors to create ymin and ymax positions. Then we need to replace the y axis with the names of the levels of the factor.
If you also wanted to get the month names on the x axis you could do something like
dateticks <- seq.Date(as.Date("2020-01-01"), as.Date("2020-12-01"),by="month")
# then add this to you plot
... +
scale_x_continuous(breaks=lubridate::yday(dateticks),
labels=lubridate::month(dateticks, label=TRUE, abbr=TRUE))

How to plot top 5 most frequent variables by region in R

I am looking to do a plot to look into the most common occuring FINAL_CALL_TYPE in my dataset by BOROUGH in NYC. I have a dataset with over 3 million obs. I broke this down into a sample of 2000, but have refined it even more to just the incident type and the borough it occured in.
Essentially, I want to create a plot that will visualize to the 5 most common call types in each borough, with the count of how many of each call types there was in each borough.
Below is a brief look of how my data looks with just Call Type and Borough
> head(df)
FINAL_CALL_TYPE BOROUGH
1804978 INJURY BRONX
1613888 INJMAJ BROOKLYN
294874 INJURY BROOKLYN
1028374 DRUG BROOKLYN
1974030 INJURY MANHATTAN
795815 CVAC BRONX
This shows how many unique values there are
> str(df)
'data.frame': 2000 obs. of 2 variables:
$ FINAL_CALL_TYPE: Factor w/ 139 levels "ABDPFC","ABDPFT",..: 50 48 50 34 50 25 17 138 28 28 ...
$ BOROUGH : Factor w/ 5 levels "BRONX","BROOKLYN",..: 1 2 2 2 3 1 4 2 4 4 ...
This is the code that I have tried
> ggplot(df, aes(x=BOROUGH, y=FINAL_CALL_TYPE)) +
+ geom_bar(stat = 'identity') +
+ facet_grid(~BOROUGH)
and below is the result
I have tried a few suggestions accross this community, but I have not found any that shows how to perform the action with 2 columns.
It would be much appreciated if there is someone who know a solution for this.
Thanks!
If I understand correctly, you can use tidyverse to doo something like:
df <- df %>%
group_by(BOROUGH, FINAL_CALL) %>%
summarise(count = n()) %>%
top_n(n = 5, wt = count)
then plot
ggplot(df, aes(x = FINAL_CALL, y = count) +
geom_col() +
facet(~BOROUGH, scales = "free")
creating the barplot
The first part of your problem is to create the barplot. With geom_bar you only need to supply the x variable, as the y-axis is the count of observations of that variable. You can then use the facet option to separate that count into different panels for another grouping variable.
library(ggplot2)
ggplot(data = diamonds, aes(x = color)) +
geom_bar() +
facet_grid(.~cut)
filtering to top 5 observations
The second part of your problem, limiting the data to only the top five in each group is slightly more complex. An easy way to do this is to first tally the data which will create a column n that has the count of observations. By adding the sort option we can filter the data to the first five rows in each group. tally, like summarize, automatically removes the last group.
In the ggplot call I now use geom_col instead of geom_bar and I explicitly specify that the y-variable is n (n is created by tally).
geom_bar plots the count of observations per x-variable, geom_col plots a y-variable value for each value of the x-variable.
scales = "free_x" removes values from the x-axis that are present in one cut panel but not another.
library(tidyverse)
df <- diamonds %>%
group_by(cut, color) %>%
tally(sort = TRUE) %>%
filter(row_number() <= 5)
ggplot(data = df, aes(x = color, y = n)) +
geom_col() +
facet_grid(.~cut, scales = "free_x")

Using ggplot to create a professional looking graph

I have tried to use ggplot2 to create a professional-looking graph, but I have having some trouble with a lot of things. I would like to add color to the data points, add dates on the x-axis, and create a line of best fit or something similar if possible. I have been searching on Stack Exchange and Google in general to try and solve this problem but to no avail. I am using the "Civilian Labor Force Participation Rate: 20 years and over, Black or African American Men" from the Federal Reserve Bank of St. Louis (FRED).
I am using RStudio, and I imported the data from LNS11300031 and then used the read.csv() function to read it into RStudio. I initially used the plot() function to plot the data, but I want to use the ggplot() function to create a better looking graph, but when I create the graph the data points look very opaque, blurry, and cloudy, and there is no labels on the x-axis. I would like to add color and a line of best fit, but I do not know how to do that.
This is the code I used to create the graph with no x-axis labels:
ggplot(data = labor, mapping = aes(x = labor$DATE, y = labor$LNS11300031)) + geom_point(alpha = 0.1)
This is the graph that my code produced:
Here is some sample data (labor is the variable I used to store the data from the FRED site):
head(labor) DATE LNS11300031
1 1972-01-01 77.6
2 1972-02-01 78.3
3 1972-03-01 78.7
4 1972-04-01 78.6
5 1972-05-01 78.7
6 1972-06-01 79.4
I would like to change the variable name LNS11300031 to Labor Force Participation Rate
Additional information about the data:
str(labor)
'data.frame': 566 obs. of 2 variables:
$ DATE : Factor w/ 566 levels "1972-01-01","1972-02-01",..: 1 2 3 4 5 6 7 8 9 10 ...
$ LNS11300031: num 77.6 78.3 78.7 78.6 78.7 79.4 78.8 78.7 78.6 78.1 ...
I would like the code to create much clearer data points with color and a trend line, and be able to have an x-axis with the corresponding dates.
Here's a basic attempt to cover all 3 of your desired improvements:
Clearer points: don't set the alpha too low! A bit of alpha is good for overlapping points, but alpha = 0.1 makes them too blurry.
Colour: R understands simple colour names like "red", but also hex colour codes. Pick any colours you want.
Trend line: easy to add with stat_smooth(). I've used method='lm' which gives a straight linear regression line but there are more flexible alternatives.
Date labels on the x-axis: Make sure your DATE column is correctly set as a Date type, and use scale_x_date() to tweak the labels.
quantmod::getSymbols("LNS11300031", src="FRED")
# Your data is available from the quantmod package
labor = LNS11300031 %>%
as.data.frame() %>%
rownames_to_column(var = "DATE") %>%
# Make sure DATE is a Date column
mutate(DATE = as.Date(DATE))
# Generally, you don't use data$column syntax within ggplot,
# just give the column name
ggplot(data = labor, mapping = aes(x = DATE, y = LNS11300031)) +
geom_point(alpha = 0.7, colour = "#B07AA1") +
stat_smooth(method = "lm", colour = "#E15759", se = FALSE) +
scale_x_date(date_breaks = "5 years", date_labels = "%Y") +
theme_minimal()
Output:

Plot line graph in R with ggplot2 from dataset

I have a dataset of countries health expenditure and life expectancy and wish to plot these visually.
I currently have the code:
dd = data.frame(Series_Name = "Health expenditure per capita (current US$) Australia",
Year = c(2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014),
Value = c(1665.200,1883.316,2370.881,2933.229,3214.031,3421.908,4077.852,4410.438,4256.641,5324.517,6368.424,6543.524,6258.467,6031.107))
Which I am then plotting with:
require(ggplot2)
##The values Year, Value, School_ID are
##inherited by the geoms
ggplot(dd, aes(Year, Value,colour=Series_Name)) +
geom_line() +
geom_point()
This displays the graph how I would like, although the issue is that I would to be able to specify which series of data should be placed within the value variable to avoid inputting it manually, the year does not need to be changed and can stay how it is.
The data has been read in from a csv file and saved to the variable 'statistics'. The data looks like this:
Series Name 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
Health expenditure per capita (current US$) Australia 1665.200 1883.316 2370.881 2933.229 3214.031 3421.908 4077.852 4410.438 4256.641 5324.517 6368.424 6543.524 6258.467 6031.107
If I wished to change data from Australia to Japan, how would I go about doing so, the Series name is set out the same with the exception of the country name.
Thanks for your help!
EDIT: Thought it may beneficial to add an image of the data layout.
The statistics.csv file - https://ufile.io/ocynw
You could use the following approach. If your data frame is called dd:
names(dd) <- c("Series_Name", seq(2001,2014,1))
library(reshape2)
library(tidyverse)
library(stringr)
We first convert your data frame from wide to long format:
dd2 <- melt(dd, id.vars=c("Series_Name"), value.name = c("value"))
Selecting the variables 'Health expenditure per capita' only
dd2 <- dd2[startsWith(as.character(dd2$Series_Name), prefix = "Health expenditure per capita"), ]
Creating a column with the name of the country that will appear in the legend:
dd2$country <- as.factor(word(dd2$Series_Name,-1) )
Sorting your data:
dd2 <- arrange(dd2, country)
and plotting all the countries:
ggplot(dd2, aes(x = variable, y = value, group=country, color=country)) + geom_line() +
geom_point()
If you want just Japan:
filter(dd2, country == "Japan") %>%
ggplot(aes(x = variable, y = value, group=country, color=country)) +
geom_line() + geom_point()

Ordering the axis labels in geom_tile

I have a data frame containing order data for each of 20+ products from each of 20+ countries. I have put it in a highlight table using ggplot2 with code similar to this:
require(ggplot2)
require(reshape)
require(scales)
mydf <- data.frame(industry = c('all industries','steel','cars'),
'all regions' = c(250,150,100), americas = c(150,90,60),
europe = c(150,60,40), check.names = FALSE)
mydf
mymelt <- melt(mydf, id.var = c('industry'))
mymelt
ggplot(mymelt, aes(x = industry, y = variable, fill = value)) +
geom_tile() + geom_text(aes(fill = mymelt$value, label = mymelt$value))
Which produces a plot like this:
In the real plot, the 450 cell table very nicely shows the 'hotspots' where orders are concentrated. The last refinement I want to implement is to arrange the items on both the x-axis and y-axis in alphabetical order. So in the plot above, the y-axis (variable) would be ordered as all regions, americas, then europe and the x-axis (industry) would be ordered all industries, cars and steel. In fact the x-axis is already ordered alphabetically, but I wouldn't know how to achieve that if it were not already the case.
I feel somewhat embarrassed about having to ask this question as I know there are many similar on SO, but sorting and ordering in R remains my personal bugbear and I cannot get this to work. Although I do try, in all except the simplest cases I got lost in a welter of calls to factor, levels, sort, order and with.
Q. How can I arrange the above highlight table so that both y-axis and x-axis are ordered alphabetically?
EDIT: The answers from smillig and joran below do resolve the question with the test data but with the real data the problem remains: I can't get an alphabetical sort. This leaves me scratching my head as the basic structure of the data frame looks the same. Clearly I have omitted something, but what??
> str(mymelt)
'data.frame': 340 obs. of 3 variables:
$ Industry: chr "Animal and vegetable products" "Food and beverages" "Chemicals" "Plastic and rubber goods" ...
$ variable: Factor w/ 17 levels "Other areas",..: 17 17 17 17 17 17 17 17 17 17 ...
$ value : num 0.000904 0.000515 0.007189 0.007721 0.000274 ...
However, applying the with statement doesn't result in levels with an alphabetical sort.
> with(mymelt,factor(variable,levels = rev(sort(unique(variable)))))
[1] USA USA USA
[4] USA USA USA
[7] USA USA USA
[10] USA USA USA
[13] USA USA USA
[16] USA USA USA
[19] USA USA Canada
[22] Canada Canada Canada
[25] Canada Canada Canada
[28] Canada Canada Canada
All the way down to:
[334] Other areas Other areas Other areas
[337] Other areas Other areas Other areas
[340] Other areas
And if you do a levels() it seems to show the same thing:
[1] "Other areas" "Oceania" "Africa"
[4] "Other Non-Eurozone" "UK" "Other Eurozone"
[7] "Holland" "Germany" "Other Asia"
[10] "Middle East" "ASEAN-5" "Singapore"
[13] "HK/China" "Japan" "South Central America"
[16] "Canada" "USA"
That is, the non-reversed version of the above.
The following shot shows what the plot of the real data looks like. As you can see, the x-axis is sorted and the y-axis is not. I'm perplexed. I'm missing something but can't see what it is.
The y-axis on your chart is also already ordered alphabetically, but from the origin. I think you can achieve the order of the axes that you want by using xlim and ylim. For example:
ggplot(mymelt, aes(x = industry, y = variable, fill = value)) +
geom_tile() + geom_text(aes(fill = mymelt$value, label = mymelt$value)) +
ylim(rev(levels(mymelt$variable))) + xlim(levels(mymelt$industry))
will order the y-axis from all regions at the top, followed by americas, and then europe at the bottom (which is reverse alphabetical order, technically). The x-axis is alphabetically ordered from all industries to steel with cars in between.
As smillig says, the default is already to order the axes alphabetically, but the y axis will be ordered from the lower left corner up.
The basic rule with ggplot2 that applies to almost anything that you want in a specific order is:
If you want something to appear in a particular order, you must make the corresponding variable a factor, with the levels sorted in your desired order.
In this case, all you should need to do it this:
mymelt$variable <- with(mymelt,factor(variable,levels = rev(sort(unique(variable)))))
which should work regardless of whether you're running R with stringsAsFactors = TRUE or FALSE.
This principle applies to ordering axis labels, ordering bars, ordering segments within bars, ordering facets, etc.
For continuous variables there is a convenient scale_*_reverse() but apparently not for discrete variables, which would be a nice addition, I think.
Another possibility is to use fct_reorder from forecast library.
library(forecast)
mydf %>%
pivot_longer(cols=c('all regions', 'americas', 'europe')) %>%
mutate(name1=fct_reorder(name, value, .desc=FALSE)) %>%
ggplot( aes(x = industry, y = name1, fill = value)) +
geom_tile() + geom_text(aes( label = value))
Maybe a little bit late,
with(mymelt,factor(variable,levels = rev(sort(unique(variable)))))
this function doesn't order, because you are ordering "variable" that has no order (it's an unordered factor).
You should transform first the variable to a character, with the as.character function, like so:
with(mymelt,factor(variable,levels = rev(sort(unique(as.character(variable))))))
maybe this StackOverflow question can help:
Order data inside a geom_tile
specifically the first answer by Brandon Bertelsen:
"Note it's not an ordered factor, it's a factor in the right order"
It helped me to get the right order of the y-axis in a ggplot2 geom_tile plot.

Resources