I've made a GAM model in R using the following code:
mod_gam1 <-gam(y ~ s(ï..x), data=Bird.data, method = "REML")
plot(mod_gam1)
coef(mod_gam1)
plot(mod_gam1, residuals = TRUE, pch = 1)
coef(mod_gam1)
mod_gam1$fitted.values
result <- data.frame(data = c(mod_gam1$fitted.values, Bird.data$y), Year = rep(1991:2019, times = 2),
'source' = c(rep('Modelled', times = 29), rep('Observed', times = 29)))
ggplot(result, aes(x = Year, y = data, colour = source))+ geom_point()+ geom_smooth(span= 0.8)+labs(x="Year", y = "Bird Island Total Debris Count")+ scale_y_continuous(limits = c(0,1000))
and the output looks ok but the shaded area of the geom_smooth error doesn't extend to the whole of my dataset (stops short of my first two datapoints) and I am not sure why.
Any help would be appreciated!
I can't upload a picture as I am new to the site, but yeah basically I have two datasets (observed and GAM modelled values) which both have their SE confidence ribbon, but these start two datapoints in to my datasets not at the first points.
These are my datapoints:
Bird.data
ï..x
y
1991
17
1992
76
1993
328
1994
131
1995
425
1996
892
1997
501
1998
419
1999
297
2000
277
2001
310
2002
282
2003
189
2004
278
2005
322
2006
444
2007
412
2008
241
2009
242
2010
255
2011
289
2012
335
2013
279
2014
628
2015
500
2016
174
2017
636
2018
420
2019
447
Fitted Values
[1] 95.56189 177.01468 255.17074 324.97532 380.28813 415.71334 428.67793 420.86624 398.18522 369.06325
[11] 341.72715 321.65585 310.33971 305.81158 304.53360 303.60521 302.21413 301.75501 304.77184 313.43400
[21] 328.37279 348.39076 371.04203 393.66222 414.29754 432.15104 447.48020 461.14595 474.09266
Negative Binomial
It is because of the limits you have put using scale_y_continuous. If you remove that line (or adjust the y down, so that it allows the minimum y value of the smooth, then you will see the smooth fill completely.
However, you have a larger problem here. You are not actually showing the gam model in the smooth (only the gam point predictions). There are a couple of ways to do this.. Easiest might be to feed Bird.data directly to the ggplot function, and use the method and formula params of the geom_smooth() to directly request the gam smooth:
ggplot(Bird.data, aes(x,y)) +
geom_point() +
geom_smooth(method="gam", formula=y~s(x)) +
labs(x="Year", y = "Bird Island Total Debris Count")
The problem with this approach is that you don't get the prediction points as well. This can be fixed with the following approach
add the se directly to the result dataframe
result$se = c(predict(mod_gam1,se=T)$se, rep(NA,29))
use ggplot as before, but use geom_ribbon, setting the ymin and ymax directly
ggplot(result, aes(x = Year, y = data, colour = source, fill=source))+
geom_point()+
geom_ribbon(aes(ymin=data-1.96*se, ymax=data+1.96*se), alpha=0.2) +
labs(x="Year", y = "Bird Island Total Debris Count")+
scale_y_continuous(limits = c(-200,1000))
Related
I accessed this graph of estimation of number of cases of diabetes and future projections of numbers for every two year estimation data points from year 2000. The graph is factually incorrect as the points on line do not coincide with the scale on left. I am trying to replot it in ggplot2 or ggplotly. While replotting I intend to make two line graphs in a single plot - One for estimations over last few years and the other for future projections made in those years for next 20-25 years and the year on which the projections were made. Any help is highly appreciable.
Here is the data that was used to plot the graph - Estimated numbers with year are represented in blue while Projected numbers for future years are represented by red line. Since, there are multiple projected numbers for few year, I am intending to keep the highest number on the line graph.
EstimationYear
Estimates (in millions)
Projections (in millions)
Projection Year
2000
151
333
2025
2003
194
380
2025
2006
246
438
2030
2009
285
552
2030
2011
366
578
2030
2013
382
592
2035
2015
415
642
2040
2017
425
629
2045
2019
463
700
2045
Your question is more about the data wrangling than the actual plotting with ggplot. Once you have the data in the right shape, the plotting command is just a few lines.
prepare the data for the estimation (blue) points. Set a column type to "estimation".
prepare the data for the projected (red) points. Set a column type to "projection".
use bind_rows to combine both tables.
In the aesthetics of ggplot use color=type
Here is a start in how you can go recreate the plot from the data. I haven't put any effort in recreating the balloons, set the theme to something more elegant and those kind of things.
library(ggplot2)
txt <- "2000 151 333 2025
2003 194 380 2025
2006 246 438 2030
2009 285 552 2030
2011 366 578 2030
2013 382 592 2035
2015 415 642 2040
2017 425 629 2045
2019 463 700 2045"
df <- read.table(text = txt)
# Putting years and values in the same columns
# Probably some tidyverse function can do this more elegantly
df <- rbind(cbind(unname(df[1:2]), type = "Estimates"),
cbind(unname(df[4:3]), type = "Projection"))
colnames(df) <- c("year", "value", "type")
# We're reordering on value, because the red line does not touch year-duplicates
df <- df[order(df$value, decreasing = TRUE),]
ggplot(df, aes(year, value, colour = type)) +
# Formula notation to filter out data for the line
geom_line(data = ~ subset(., !duplicated(year))) +
geom_point() +
scale_colour_manual(
values = c("Estimates" = "dodgerblue",
"Projection" = "tomato")
) +
scale_y_continuous(limits = c(0, NA),
name = "Millions")
Created on 2021-01-06 by the reprex package (v0.3.0)
I would like to create a graph to represent projected vs collected revenue by person and I'm not sure how to do this. The goal would be to have the negative differences plotted as a red vertical bar and positive as black.
ggplot(appts2,
aes(Provider, Difference),
main = "Difference in Projected vs Actual Revenue") +
geom_bar(fill = ifelse(appts2$Difference < 0, "red", "black"), stat = 'identity') +
coord_flip()
works but isn't coloring things correctly.
Provider Revenue Visits Ave Total Add Ons Total Scheduled Total Seen Total Not Seen TotalBatchVisits ProjectedRevenue Difference MissingRecords
Smith 40911 539 75.9 38 438 404 82 486 36887.4 -4023.6 53
Antonio 4827 63 76.62 7 88 60 35 95 7278.9 2451.9 -32
Jackson 13832 171 80.89 32 155 161 20 181 14641.09 809.09 -10
Redding 23030 278 82.84 25 164 144 34 178 14745.52 -8284.48 100
You can accomplish this by setting the "fill" aesthetic to a logical statement, such as Difference < 0. ggplot will then fill the bars depending on whether the bar is less than or greater than zero.
Never use the $ operator inside of aes() (you reference appts2$Difference). Instead, use the bare column name, which ggplot will then search for in the provided data set. ggplot orders the data before plotting it, so providing an outside vector with $ can cause strange conflicts with its intended order.
library(ggplot2)
set.seed(1)
df <- data.frame(category = letters[1:10], difference = rnorm(10))
g <- ggplot(data = df, aes(y = difference, x = category, fill = difference < 0)) +
geom_col() +
coord_flip()
print(g)
I am trying to get my cumulative area plot to stack using the code below, which is based on http://dantalus.github.io/2015/08/16/step-plots/. I have added in position=stack, however the plot still overlaps.
The aim of what I am trying to achieve is to show the cumulative number of publications each year within a given period. So, as an example, in 1940 there may be one publication, the following year there may be 2 more, bringing the cumulative total to 3.
What would be the best way to get the areas to stack on top of each other?
How can the order be controlled? Would I need to use arrange() to order TERM2?
ggplot(data=working, aes(x=Year, color=TERM2, fill=TERM2)) +
stat_bin(data = subset(working, TERM2=="A"), bins=80, aes(y=cumsum(..count..)),geom="area", position="stack", alpha=0.1) +
stat_bin(data = subset(working, TERM2=="B"), bins=80, aes(y=cumsum(..count..)),geom="area", position="stack",alpha=0.1) +
stat_bin(data = subset(working, TERM2=="Both"),bins=80, aes(y=cumsum(..count..)),geom="area", position="stack", alpha=0.1) +
ylab("Total Number") + xlim(1940,2020) + ggtitle("Cumulative number by measurement method")
What I am currently getting:
Example of what I am trying to achieve:
The following chart was created in Excel using the same data which is exactly what I am looking to achieve in R.
My Data:
Example of how my data is currently structured:
Year TERM2
1944 A
1959 B
1966 A
1968 B
1968 A
1970 A
1971 B
1971 B
1971 A
1971 A
1971 Both
1971 Both
1971 Both
1972 A
1972 Both
1972 Both
1973 B
1973 A
1974 A
1974 A
'data.frame': 803 obs. of 6 variables:
$ Year : int 1944 1959 1966 1968 1968 1970 1971 1971 1971 1971 ...
$ TERM2 : Factor w/ 3 levels "B","A","Both": 2 1 2 1 2 2 1 1 2 2 ...
Changes based on user127649's suggestions
This is the plot after user127649's suggestions, which is close to what I would expect except I am looking for it to start at 0 and end at 803 (total number of publications).
ggplot(data=working, aes(x=Year, color=TERM2, fill=TERM2)) +
stat_bin(bins=80, aes(y=cumsum(..count..)), geom="area", alpha=0.1) +
ylab("Total Number") + xlim(1940,2020) + ggtitle("Cumulative number by measurement method")
I think there were two issues.
When You use stat_bin() in three separate layers, each effectively has it’s own independent data set. This will give the correct count, but (and this is a guess really) I think being in three separate layers means you can’t stack them.
If you use stat_bin() on all the layers I think stat = '..count..' performs cumsum() on the data as a whole.
I don’t know whether this is the best approach or not, but I think it’s what you’re after.
Data
The data are grouped and cumsum() is used on each group separately.
library(tidyverse)
working <- working %>%
count(Year, TERM2) %>%
spread(TERM2, n, fill = 0) %>%
mutate_at(vars('A', 'B', 'Both'), cumsum) %>%
gather(TERM2, N, -Year, factor_key = T) #%>%
# mutate(TERM2 = ordered(TERM2, levels = rev(levels(TERM2))))
Plot
This code will produce the first plot below. If you prefer the look of the second plot, you can un-comment the last line of the data manipulation chunk.
ggplot(working, aes(Year, N, fill = TERM2)) +
geom_area(position = 'stack') +
ylab("Total Number")
Result
Hi there) can anybody help me. I have a big DF with two columns Country_dest and SumTotal (is value), trying to use qplot function
qplot(country_dest, SumTotal, data=Africa)
Brunei 58
Aruba 73
Cuba 95
Nicaragua 97
Turkmenistan 99
Saint Lucia 102
Honduras 153
Barbados 161
Haiti 165
Montenegro 175
And I would like to draw a plot, but on x axis put the name of the countries (for example 7 or 6 of them) with the highest value of SumTotal, is it possible to do?)
Thank you in advance!
using ggplot, just reorder by population:
ggplot(data = Africa, aes(x= reorder(country_dest, -SumTotal), y= SumTotal)) + geom_bar(stat = "identity")
if you just wanna take say the top 5 use arrange and then subset:
require(dplyr)
Africa.ordered <- arrange(Africa, -SumTotal)
Africa.top5 <- Africa.ordered[1:5,]
and then draw your plot
Hello,
I have been struggling with this problem for a while now and anyone who can help me out I would greatly appreciate it.
First off, I am working with time series data in a single data frame containing multiple time series. Too many to output individually into graphs. I have tried passing qplot() through ddply() however r tells me it qplot is not a function and therefore it will not work.
the structure of my data is like this...
goodlocs <-
Loc Year dir
Artesia 1983 1490
Artesia 1984 1575
Artesia 1986 1567
Artesia 1987 1630
Artesia 1990 1680
Bogota 1983 1525
Bogota 1984 1610
Bogota 1985 1602
Bogota 1986 1665
Bogota 1990 1715
Carlsbad 1983 1560
Carlsbad 1985 1645
Carlsbad 1986 1637
Carlsbad 1987 1700
Carlsbad 1990 1750
Carlsbad 1992 1595
Datil 1987 1680
Datil 1990 1672
Datil 1991 1735
Datil 1992 1785
I have about 250 Locations(Locs) and would like to be able to go over each stations data on a graph like the following one so I can inspect all of my data visually.
Artesia <- goodlocs[goodlocs$Loc == "Artesia",]
qplot(YEAR, dir, data = Artesia, geom = c("point", "line"), xlab = "Year",
ylab = "DIR", main = "Artesia DIR Over Record Period") +
geom_smooth(method=lm)
I understand that Par() is supposed to help do this but I can not figure it out for the life of me. Any help is greatly appreciated.
Thanks,
-Zia
edit -
as Arun pointed out, I am trying to save a .pdf of 250 different graphs of my goodlocs df split by "Loc", with point and line geometry for data review....
I also tried passing a ddply of my df through qplot as the data but it did not work either, I was not really expecting it to but i had to try.
How about this?
require(ggplot2)
require(plyr)
require(gridExtra)
pl <- dlply(df, .(Loc), function(dat) {
ggplot(data = dat, aes(x = Year, y = dir)) + geom_line() +
geom_point() + xlab("x-label") + ylab("y-label") +
geom_smooth(method = "lm")
})
ml <- do.call(marrangeGrob, c(pl, list(nrow = 2, ncol = 2)))
ggsave("my_plots.pdf", ml, height = 7, width = 13, units = "in")
The idea: First split the data by Loc and create the plot for each subset. The splitting part is done using plyr function dlply that basically takes a data.frame as input and provides a list as output. The plot element is stored in each element of the list corresponding to the subset. Then, we use gridExtra package's marrangeGrob function to arrange multiple plots (which also has the very useful nrow and ncol arguments to set the argument). Then, you can save it using ggsave from ggplot2.
I'll leave you to any additional tweaks you may require.