How to merge estimation and projection graphs in one plot? - r

I accessed this graph of estimation of number of cases of diabetes and future projections of numbers for every two year estimation data points from year 2000. The graph is factually incorrect as the points on line do not coincide with the scale on left. I am trying to replot it in ggplot2 or ggplotly. While replotting I intend to make two line graphs in a single plot - One for estimations over last few years and the other for future projections made in those years for next 20-25 years and the year on which the projections were made. Any help is highly appreciable.
Here is the data that was used to plot the graph - Estimated numbers with year are represented in blue while Projected numbers for future years are represented by red line. Since, there are multiple projected numbers for few year, I am intending to keep the highest number on the line graph.
EstimationYear
Estimates (in millions)
Projections (in millions)
Projection Year
2000
151
333
2025
2003
194
380
2025
2006
246
438
2030
2009
285
552
2030
2011
366
578
2030
2013
382
592
2035
2015
415
642
2040
2017
425
629
2045
2019
463
700
2045

Your question is more about the data wrangling than the actual plotting with ggplot. Once you have the data in the right shape, the plotting command is just a few lines.
prepare the data for the estimation (blue) points. Set a column type to "estimation".
prepare the data for the projected (red) points. Set a column type to "projection".
use bind_rows to combine both tables.
In the aesthetics of ggplot use color=type

Here is a start in how you can go recreate the plot from the data. I haven't put any effort in recreating the balloons, set the theme to something more elegant and those kind of things.
library(ggplot2)
txt <- "2000 151 333 2025
2003 194 380 2025
2006 246 438 2030
2009 285 552 2030
2011 366 578 2030
2013 382 592 2035
2015 415 642 2040
2017 425 629 2045
2019 463 700 2045"
df <- read.table(text = txt)
# Putting years and values in the same columns
# Probably some tidyverse function can do this more elegantly
df <- rbind(cbind(unname(df[1:2]), type = "Estimates"),
cbind(unname(df[4:3]), type = "Projection"))
colnames(df) <- c("year", "value", "type")
# We're reordering on value, because the red line does not touch year-duplicates
df <- df[order(df$value, decreasing = TRUE),]
ggplot(df, aes(year, value, colour = type)) +
# Formula notation to filter out data for the line
geom_line(data = ~ subset(., !duplicated(year))) +
geom_point() +
scale_colour_manual(
values = c("Estimates" = "dodgerblue",
"Projection" = "tomato")
) +
scale_y_continuous(limits = c(0, NA),
name = "Millions")
Created on 2021-01-06 by the reprex package (v0.3.0)

Related

R GAM visualisation, geom_smooth not fit to all observed data

I've made a GAM model in R using the following code:
mod_gam1 <-gam(y ~ s(ï..x), data=Bird.data, method = "REML")
plot(mod_gam1)
coef(mod_gam1)
plot(mod_gam1, residuals = TRUE, pch = 1)
coef(mod_gam1)
mod_gam1$fitted.values
result <- data.frame(data = c(mod_gam1$fitted.values, Bird.data$y), Year = rep(1991:2019, times = 2),
'source' = c(rep('Modelled', times = 29), rep('Observed', times = 29)))
ggplot(result, aes(x = Year, y = data, colour = source))+ geom_point()+ geom_smooth(span= 0.8)+labs(x="Year", y = "Bird Island Total Debris Count")+ scale_y_continuous(limits = c(0,1000))
and the output looks ok but the shaded area of the geom_smooth error doesn't extend to the whole of my dataset (stops short of my first two datapoints) and I am not sure why.
Any help would be appreciated!
I can't upload a picture as I am new to the site, but yeah basically I have two datasets (observed and GAM modelled values) which both have their SE confidence ribbon, but these start two datapoints in to my datasets not at the first points.
These are my datapoints:
Bird.data
ï..x
y
1991
17
1992
76
1993
328
1994
131
1995
425
1996
892
1997
501
1998
419
1999
297
2000
277
2001
310
2002
282
2003
189
2004
278
2005
322
2006
444
2007
412
2008
241
2009
242
2010
255
2011
289
2012
335
2013
279
2014
628
2015
500
2016
174
2017
636
2018
420
2019
447
Fitted Values
[1] 95.56189 177.01468 255.17074 324.97532 380.28813 415.71334 428.67793 420.86624 398.18522 369.06325
[11] 341.72715 321.65585 310.33971 305.81158 304.53360 303.60521 302.21413 301.75501 304.77184 313.43400
[21] 328.37279 348.39076 371.04203 393.66222 414.29754 432.15104 447.48020 461.14595 474.09266
Negative Binomial
It is because of the limits you have put using scale_y_continuous. If you remove that line (or adjust the y down, so that it allows the minimum y value of the smooth, then you will see the smooth fill completely.
However, you have a larger problem here. You are not actually showing the gam model in the smooth (only the gam point predictions). There are a couple of ways to do this.. Easiest might be to feed Bird.data directly to the ggplot function, and use the method and formula params of the geom_smooth() to directly request the gam smooth:
ggplot(Bird.data, aes(x,y)) +
geom_point() +
geom_smooth(method="gam", formula=y~s(x)) +
labs(x="Year", y = "Bird Island Total Debris Count")
The problem with this approach is that you don't get the prediction points as well. This can be fixed with the following approach
add the se directly to the result dataframe
result$se = c(predict(mod_gam1,se=T)$se, rep(NA,29))
use ggplot as before, but use geom_ribbon, setting the ymin and ymax directly
ggplot(result, aes(x = Year, y = data, colour = source, fill=source))+
geom_point()+
geom_ribbon(aes(ymin=data-1.96*se, ymax=data+1.96*se), alpha=0.2) +
labs(x="Year", y = "Bird Island Total Debris Count")+
scale_y_continuous(limits = c(-200,1000))

R - coerce geom_density() in ggplot2 to accept a df column as the frequency (y-variable)

I am attempting to make a smoothed histogram using geom_density in ggplot2. The problem is, technically what I am making is not exactly a histogram, so I am running into trouble. Specifically, along the x-axis of the desired plot is genomic position, but the values can start and end anywhere. Moreover, the y-axis is not exactly counts, but rather is contents of a numeric vector in the df, locus_df_trim$V7, which for my data is an intensity value ranging between 0 and 266.
The "bins" corresponding to each observation may be different numbers of base pairs long, so there are no uniform bin sizes, and there also may be breaks between bins.
I have not been able to get ggplot to accept anything resembling locus_trim_df$V7 as a y-value. Also, if I let the value = ..scaled.., it is not correct because the intensity values are not proportional to the number of observations (which is what drives the ..scaled.. and ..count.. variables if I specify them).
So, at this point my only idea is to recreate the dataframe itself so that there are uniform bin sizes, and each observation for each cell type that does not have an intensity receives a 0 in the df. However, I am wondering if there is a way to produce the desired plot using a df of the current form, which is:
head(locus_df_trim)
> head(locus_df_trim)
V2 V3 V5 V7 annot_width
1 CD4+_CD25-_IL17-_PMA- H3K27ac 204738970 103 1042
2 CD4+_CD25-_IL17-_PMA- H3K27ac 204738517 40 250
3 CD4+_CD25-_IL17-_PMA- H3K36me3 204738136 158 515
4 CD4+_CD25-_IL17-_PMA- H3K36me3 204738702 104 709
5 CD4+_CD25-_IL17+ H3K4me1 204738665 226 1246
6 CD4+_CD25-_IL17+ H3K4me1 204741441 73 397
...
43 Tmem_Primary_Cells H3K27ac 204738908 34 390
44 Tmem_Primary_Cells H3K27me3 204738382 28 194
45 Tmem_Primary_Cells H3K4me1 204738766 124 424
46 Tmem_Primary_Cells H3K4me1 204741433 48 423
47 Tmem_Primary_Cells H3K4me1 204739411 40 215
48 Tmem_Primary_Cells H3K4me1 204737304 33 210
I am able to produce a plot that is close to what is desired, but trying to specify the y-value to be proportional to V7 (as below) results in the error:
Error in eval(substitute(list(...)), `_data`, parent.frame()) :
object 'y' not found
To be clear, the desired plot has the following attributes:
I want to make a smoothed histogram with the following features:
the bin for a given observation begins at the x-coordinate of a value found in one column, and ends at the value found in another column
each cell type (in column locus_trim_df$V2 below) has a different color)
the height of each peak (in the y-direction) is proportional to the value in a column of the df (locus_df_trim$V7).
so far I have the following code:
library("ggplot2")
base_plot<-ggplot(data=locus_df_trim, aes(x=V5, y=V7, fill=V2, size=V7)) + geom_density(alpha=0.3, adjust=0.4, kernel="gaussian")
base_plot<-base_plot + geom_vline(xintercept = 204738919, size = 1, colour = "#FF3721", linetype = "dashed")
base_plot<-base_plot + theme_classic(); base_plot
by contrast, the following code does not error and produces a plot similar to what is desired, but suffers from miscalibrated bin placement and incorrect height for the peaks in the y-direction:
base_plot<-ggplot(data=locus_df_trim, aes(x=V5, fill=V2)) + geom_density(alpha=0.3, adjust=0.4, kernel="gaussian")
base_plot<-base_plot + geom_vline(xintercept = 204738919, size = 1, colour = "#FF3721", linetype = "dashed")
base_plot<-base_plot + theme_classic(); base_plot

R ggplot2 the names of the biggest values on x axis

Hi there) can anybody help me. I have a big DF with two columns Country_dest and SumTotal (is value), trying to use qplot function
qplot(country_dest, SumTotal, data=Africa)
Brunei 58
Aruba 73
Cuba 95
Nicaragua 97
Turkmenistan 99
Saint Lucia 102
Honduras 153
Barbados 161
Haiti 165
Montenegro 175
And I would like to draw a plot, but on x axis put the name of the countries (for example 7 or 6 of them) with the highest value of SumTotal, is it possible to do?)
Thank you in advance!
using ggplot, just reorder by population:
ggplot(data = Africa, aes(x= reorder(country_dest, -SumTotal), y= SumTotal)) + geom_bar(stat = "identity")
if you just wanna take say the top 5 use arrange and then subset:
require(dplyr)
Africa.ordered <- arrange(Africa, -SumTotal)
Africa.top5 <- Africa.ordered[1:5,]
and then draw your plot

Several or multiple timeseries plot outputs from a single data frame

Hello,
I have been struggling with this problem for a while now and anyone who can help me out I would greatly appreciate it.
First off, I am working with time series data in a single data frame containing multiple time series. Too many to output individually into graphs. I have tried passing qplot() through ddply() however r tells me it qplot is not a function and therefore it will not work.
the structure of my data is like this...
goodlocs <-
Loc Year dir
Artesia 1983 1490
Artesia 1984 1575
Artesia 1986 1567
Artesia 1987 1630
Artesia 1990 1680
Bogota 1983 1525
Bogota 1984 1610
Bogota 1985 1602
Bogota 1986 1665
Bogota 1990 1715
Carlsbad 1983 1560
Carlsbad 1985 1645
Carlsbad 1986 1637
Carlsbad 1987 1700
Carlsbad 1990 1750
Carlsbad 1992 1595
Datil 1987 1680
Datil 1990 1672
Datil 1991 1735
Datil 1992 1785
I have about 250 Locations(Locs) and would like to be able to go over each stations data on a graph like the following one so I can inspect all of my data visually.
Artesia <- goodlocs[goodlocs$Loc == "Artesia",]
qplot(YEAR, dir, data = Artesia, geom = c("point", "line"), xlab = "Year",
ylab = "DIR", main = "Artesia DIR Over Record Period") +
geom_smooth(method=lm)
I understand that Par() is supposed to help do this but I can not figure it out for the life of me. Any help is greatly appreciated.
Thanks,
-Zia
edit -
as Arun pointed out, I am trying to save a .pdf of 250 different graphs of my goodlocs df split by "Loc", with point and line geometry for data review....
I also tried passing a ddply of my df through qplot as the data but it did not work either, I was not really expecting it to but i had to try.
How about this?
require(ggplot2)
require(plyr)
require(gridExtra)
pl <- dlply(df, .(Loc), function(dat) {
ggplot(data = dat, aes(x = Year, y = dir)) + geom_line() +
geom_point() + xlab("x-label") + ylab("y-label") +
geom_smooth(method = "lm")
})
ml <- do.call(marrangeGrob, c(pl, list(nrow = 2, ncol = 2)))
ggsave("my_plots.pdf", ml, height = 7, width = 13, units = "in")
The idea: First split the data by Loc and create the plot for each subset. The splitting part is done using plyr function dlply that basically takes a data.frame as input and provides a list as output. The plot element is stored in each element of the list corresponding to the subset. Then, we use gridExtra package's marrangeGrob function to arrange multiple plots (which also has the very useful nrow and ncol arguments to set the argument). Then, you can save it using ggsave from ggplot2.
I'll leave you to any additional tweaks you may require.

Grouped bar chart with ggplot2 and already tabulated data

I fit a count model to a vector of actual data and would now like to plot the actual and the predicted as a grouped (dodged) bar chart. Since this is a count model, the data are discrete (X=x from 0 to 317). Since I am fitting a model, I only have already-tabulated data for the predicted values.
Here is how my original data frame looks:
actual predicted
1 3236 3570.4995
2 1968 1137.1202
3 707 641.8186
4 302 414.8763
5 185 285.1854
6 104 203.0502
I transformed the data to be plotted with ggplot2:
melted.data <- melt(plot.data)
melted.data$realization <- c(rep(0:317, times=2))
colnames(melted.data)=c('origin','count','realization')
So that my data frame now looks like this:
head(melted.data)
origin count realization
1 actual 3236 0
2 actual 1968 1
3 actual 707 2
4 actual 302 3
5 actual 185 4
6 actual 104 5
> tail(melted.data)
origin count realization
631 predicted 1.564673e-27 312
632 predicted 1.265509e-27 313
633 predicted 1.023552e-27 314
634 predicted 8.278601e-28 315
635 predicted 6.695866e-28 316
636 predicted 5.415757e-28 317
When I try to graph it (again, I'd like to have the actual and predicted count --which is already tabulated in the data-- by discrete realization), I give this command:
ggplot(melted.data, stat="identity", aes(x=realization, fill=origin)) + geom_bar(position="dodge")
Yet it seems like the stat parameter is not liked by ggplot2, as I don't get the correct bar height (which would be those of the variable "count").
Any ideas?
Thanks,
Roberto.
You need y-values in the aes mapping if you use stat_identity (column count). Try the following:
ggplot(melted.data, aes(x=realization, y=count, fill=origin)) +
stat_identity(position="dodge", geom="bar")
or
ggplot(melted.data, aes(x=realization, y=count, fill=origin)) +
geom_bar(position="dodge", stat="identity")

Resources