Related
This question already has an answer here:
Making a stacked area plot using ggplot2
(1 answer)
Closed 2 years ago.
I'm trying to create a stacked area graph with R using the package ggplot2 with the below data:
> dput(ec.admin1.ma.tall[1:20,])
structure(list(date = structure(c(18346, 18347, 18348, 18349,
18350, 18351, 18352, 18353, 18362, 18363, 18364, 18365, 18366,
18367, 18354, 18374, 18375, 18376, 18379, 18380), class = "Date"),
locations = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("azuay_newcase_avg",
"bolivar_newcase_avg", "canar_newcase_avg", "carchi_newcase_avg",
"chimborazo_newcase_avg", "cotopaxi_newcase_avg", "eloro_newcase_avg",
"esmeraldas_newcase_avg", "galapagos_newcase_avg", "guayas_newcase_avg",
"imbabura_newcase_avg", "loja_newcase_avg", "losrios_newcase_avg",
"manabi_newcase_avg", "moronasant_newcase_avg", "napo_newcase_avg",
"orellana_newcase_avg", "pastaza_newcase_avg", "pichincha_newcase_avg",
"santaelena_newcase_avg", "stodom_newcase_avg", "sucumbios_newcase_avg",
"tungurahua_newcase_avg", "zamchin_newcase_avg"), class = "factor"),
newcases_ma = c(NA, NA, NA, 5.85714285714286, 8.14285714285714,
13.1428571428571, 12.8571428571429, 16.2857142857143, 15.2857142857143,
16.1428571428571, 14.2857142857143, 12.5714285714286, 18,
19.2857142857143, 39.2857142857143, 38.7142857142857, 53.2857142857143,
53, 52.4285714285714, 46)), row.names = c(NA, 20L), class = "data.frame")
> ec.admin1.ma.tall$locations <- factor(ec.admin1.ma.tall$locations)
> ec.admin1.ma.tall$date <- as.Date(ec.admin1.ma.tall$date, "%m/%d/%Y")
> ggplot(ec.admin1.ma.tall, aes(x = date, y = newcases_ma, fill = locations, group =
locations)) + geom_area()
The image I get from this code is: Stacked Area Graph plotting number of new cases by region
However, from plotting the individual regions, I don't believe my plot is accurate. The code for this plot is below:
ggplot(ec.admin1.ma.tall, aes(x = date, y = newcases_ma, fill = locations)) +
geom_col() +
labs(title = "Moving 7-Day Average for New Cases in Admin 1 Regions - Ecuador",
x = "Date", y = "7-Day Moving Average, New Cases") +
theme(axis.text.x = element_text(angle = 90, size = rel(0.5), vjust = 0.5, hjust=1)) +
facet_wrap(~locations, nrow = 6, scales = "free")
Bar graph of new cases over time, split by individual regions
As you can see from the y-axis of these individual regions, none of the values go above 2000 and not many go even above 1000 cases. Would anyone know why there is this discrepancy between the individual region's data and the stacked area graph?
I just took a quick look, but the plots seem reasonable to me and the code looks OK too. Check out the "guayas" small multiples plot. the peak values early on reach about 1500, which is about the vertical size of the large green section of your stacked area plot. None go over 2000, but the sum of guayas and other regions certainly goes over 2000 at that particular point on the x-axis.
I'm trying to run a GAM on proportional data (numeric between 0 and 1). But I'm getting the warning
In eval(family$initialize) : non-integer #successes in a binomial glm!
Basically I'm modelling the number of occurrences of warm adapted species vs total occurrences of warm and cold adapted species against sea surface temperature and using data from another weather system (NAO) as a random effect, and three other categorical, parametric, variables.
m5 <- gam(prop ~ s(SST_mean) + s(NAO, bs="re") + WarmCold + Cycle6 + Region,
family=binomial, data=DAT_WC, method = "REML")
prop = proportion of occurrences, WarmCold = whether species is warm adapted or cold adapted, Cycle6 = 6 year time period, Region = one of 4 regions. A sample of my dataset is below
structure(list(WarmCold = structure(c(1L, 1L, 1L, 1L, 2L, 2L), .Label = c("Cold",
"Warm"), class = "factor"), Season = structure(c(2L, 2L, 2L,
2L, 2L, 2L), .Label = c("Autumn", "Spring", "Summer", "Winter"
), class = "factor"), Region = structure(c(1L, 2L, 3L, 4L, 1L,
2L), .Label = c("OSPARII_N", "OSPARII_S", "OSPARIII_N", "OSPARIII_S"
), class = "factor"), Cycle6 = structure(c(1L, 1L, 1L, 1L, 1L,
1L), .Label = c("1990-1995", "1996-2001", "2002-2007", "2008-2013",
"2014-2019"), class = "factor"), WC.Strandings = c(18L, 10L,
0L, 3L, 5L, 25L), SST_mean = c(7.4066298185553, 7.49153086390094,
9.28247524767124, 10.8654859624361, 7.4066298185553, 7.49153086390094
), NAO = c(0.542222222222222, 0.542222222222222, 0.542222222222222,
0.542222222222222, 0.542222222222222, 0.542222222222222), AMO = c(-0.119444444444444,
-0.119444444444444, -0.119444444444444, -0.119444444444444, -0.119444444444444,
-0.119444444444444), Total.Strandings = c(23, 35, 5, 49, 23,
35), prop = c(0.782608695652174, 0.285714285714286, 0, 0.0612244897959184,
0.217391304347826, 0.714285714285714)), row.names = c(NA, 6L), class = "data.frame")
From the literature (Zuur, 2009) it seems that a binomial distribution is the best used for proportional data. But it doesn't seem to be working. It's running but giving the above warning, and outputs that don't make sense. What am I doing wrong here?
This is a warning, not an error, but it does indicate that something is somewhat not correct; the binomial distribution has support on the non-negative integer values so it doesn't make sense to pass in non-integer values without the samples totals from which the proportions were formed.
You can do this using the weights argument, which in this case should take a vector of integers containing the count total for each observation from which the proportion was computed.
Alternatively, consider using family = quasibinomial if the mean-variance relationship is right for your data; the warming will go away, but then you'll not be able to use AIC and related tools that expect a real likelihood.
If you proportions are true proportions then consider family = betar to fit a beta regression model, where the conditional distribution of the response has support on reals values on the unit interval (0, 1) (but technically not 0 or 1 — mgcv will add or subtract a small number to adjust the data if there are 0 or 1 values in the response).
I also found that rather than calculating a total, but using cbind() with the 2 columns of interest removed the warning e.g.
m8 <- gam(cbind(WC.Strandings, Total.Strandings) ~ s(x1) + x2,
family=binomial(link="logit"), data=DAT, method = "REML")
I'm hoping to get some help with presenting regression outputs for my Masters thesis. I am assessing the impacts of elephants on woody vegetation, particularly in relation to artificial waterholes. In addition to generally declining with distance from waterholes, the impacts differ substantially between the two vegetation types involved.
I've figured out what seems to me a satisfactory way to of plotting this using visreg. In the model output shown below, both distance to waterhole and veg type explained damage, hence my attempt to show both. However, the issue is that I only have samples at the furthest distances for waterholes (x-axis) from the red vegetation type. As you can see, the regression line for the blue veg type is extending beyond the last points for this vegetation type. Is there anyway I can get the blue line to stop at a smaller distance from the waterhole (x axis value) than for the red to avoid this?
See code for the model and plot below the visreg plot.
Sample data and code
> dput(vegdata[21:52, c(4,7,33)])
structure(list(distance = c(207L, 202L, 501L, 502L, 1001L, 1004L,
2010L, 1997L, 4003L, 3998L, 202L, 194L, 499L, 494L, 1004L, 1000L,
2008L, 1993L, 4008L, 3998L, 493L, 992L, 1941L, 2525L, 485L, 978L,
1941L, 3024L, 495L, 978L, 1977L, 2952L), vegtype = structure(c(1L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("teak",
"term"), class = "factor"), toedl = c(35.48031025, 47.30482718,
25.16709533, 22.29360164, 17.6546533, 12.81605101, 20.34136734,
18.45809334, 11.3578081, 3.490830751, 60.54870317, 44.9863128,
18.81010698, 20.4777188, 30.36994386, 18.7417214, 21.52247156,
18.29685939, 30.26217664, 8.945486104, 43.95749178, 43.54799495,
44.42693993, 50.06207783, 48.05538594, 35.31220933, 52.37339094,
40.51569938, 41.45677007, 58.86629306, 37.80203313, 46.35633342
)), row.names = 21:52, class = "data.frame")
m1<-lm(toedl~vegtype+distance, data=vegdata)
summary(m1)
library(visreg)
visreg(oedl6, 'sexactd', by='vegtype',overlay=TRUE, gg=TRUE, points=list(size=2.5), ylab='% old elephant damage', xlab='distance from waterhole')
Regarding the comments about a reproducible example, you can just make a small dataframe with representative data like below, also a general comment that you should avoid naming your variables names of base functions like 'all'.
I'm not sure whether it's possible to use visreg to do what you want, but you can extract the information from your model using predict, then use ggplot to plot it, which may be preferable because ggplot is really good for customizing plots.
library(ggplot2)
library(visreg)
# Create reproducible data example
allData <- data.frame(vegtype = rep(c("t1", "t2"), each = 10),
oedl = c(seq(from = 35, to = 20, length.out = 10),
seq(from = 20, to = 5, length.out = 10)),
sexactd = c(seq(from = -1, to = 1, length.out = 10),
seq(from = -1, to = 2, length.out = 10)))
# Make linear model
oedl6 <- lm(formula = oedl ~ sexactd + vegtype, data = allData)
# Predict the data using the linear model
odelPred <- cbind(allData, predict(oedl6, interval = 'confidence'))
ggplot(odelPred, aes(sexactd, oedl, color = vegtype, fill = vegtype)) +
geom_point() + geom_line(aes(sexactd, fit)) +
geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = 0.3)
MR Macarthurs solution is great, and (s)he deserved the accepted answer. Visualising a multiple regression model with several predictors in a 2 dimensional graph is... difficult. Basically, you are limited to one predictor. And can add the interaction (in your case: vegtype). One can simply use
geom_smooth for it.
Using your data:
library(tidyverse)
ggplot(vegdata, aes(toedl, distance, color = vegtype)) +
geom_point() +
geom_smooth(method = 'lm')
Created on 2019-12-13 by the reprex package (v0.3.0)
I'm having some trouble setting readable tick marks on my axes. The problem is that my data are at different magnitudes, so I'm not really sure how to go about it.
My data include ~400 different products, with 3/4 variables each, from two machines. I've pre-processed it into a data.table and used gather to convert it to long form- that part is fine.
Overview: Data is discrete, each X_________ on the x-axis represents a separate reading, and its relative values from machine 1/2 - the idea is to compare the two. The graphical format is perfect for my needs, I would just like to set the ticks at say, every 10 products on the x-axes, and at reasonable values on the y-axis.
Y_1: from 150 to 250
Y_2: from say, 1.5* to 2.5
Y_3: from say, 0.8* to 2.3
Y_4: from say, 0.4* to 1.5
*Bottom value, rounded down
Here's the code I'm using so far
var.Parameter <- c("Var1", "Var2", "Var3", "Var4")
MProduct$Parameter <- factor(MProduct$Parameter,
labels = var.Parameter)
labels_x <- MProduct$Lot[seq(0, 1626, by= 20)]
labels_y <- MProduct$Value[seq(0, 1626, by= 15)]
plot.MProduct <- ggplot(MProduct, aes(x = Lot,
y = Value,
colour = V4)) +
facet_grid(Parameter ~.,
scales = "free_y") +
scale_x_discrete(breaks=labels_x) +
scale_y_discrete(breaks=labels_y) +
geom_point() +
labs(title = "Product: Select Trends | 2018",
x = "Time (s)",
y = "Value") +
theme(axis.text.x = element_text (angle = 90,
hjust = 1,
vjust = 0.5))
# ggsave("MProduct.png")
plot.MProduct
Anyone knows how to possibly render this graph more readable? Setting labels/breaks manually greatly limits flexibility and readability - there should be an option to set it to every X ticks, right? Same with y.
I need to apply this as a function to multiple datasets, so I'm not very happy about having to specify the column length of the "gathered" dataset every time either, which, in this case is 1626.
Since I'm here, I would also like to take the opportunity to ask about this code:
var.Parameter <- c("Var1", "Var2", "Var3", "Var4")
More often than not, I need to label my data in a specific order, which is not necessarily alphabetical. R, however, defaults to some kind of odd behaviour whereupon I have to plot and verify that the labels are indeed where they should be. Any clue how I could force them to be presented in order? As it is, my solution is to keep shifting their position in that line of code until it produces the graph correctly.
Many thanks.
Okay. I'm going to ignore the y axis labels because the defaults seem to work just fine as long as you don't try to overwrite them with your custom labels_y thing. Just let the defaults do their work. For the X axis, we'll give a couple options:
(A) label every N products on X-axis. Looking at ?scale_x_discrete, we can set the labels to a function that takes all the level of the factor and returns the labels we want. So we'll write a functional that returns a function that returns every Nth label:
every_n_labeler = function(n = 3) {
function (x) {
ind = ((1:length(x)) - 1) %% n == 0
x[!ind] = ""
return(x)
}
}
Now let's use that as the labeler:
ggplot(df, aes(x = Lot,
y = Value,
colour = Machine)) +
facet_grid(Parameter ~ .,
scales = "free_y") +
geom_point() +
scale_x_discrete(labels = every_n_labeler(3)) +
labs(title = "Product: Select Trends | 2018",
x = "Time (s)",
y = "Value") +
theme(axis.text.x = element_text (
angle = 90,
hjust = 1,
vjust = 0.5
))
You can change the every_n_labeler(3) to (10) to make it every 10th label.
(B) Maybe more appropriate, it seems like your x-axis is actually numeric, it just happens to have "X" in front of it, let's convert it to numeric and let the defaults do the labeling work:
df$time = as.numeric(gsub(pattern = "X", replacement = "", x = df$Lot))
ggplot(df, aes(x = time,
y = Value,
colour = Machine)) +
facet_grid(Parameter ~ .,
scales = "free_y") +
geom_point() +
labs(title = "Product: Select Trends | 2018",
x = "Time (s)",
y = "Value") +
theme(axis.text.x = element_text (
angle = 90,
hjust = 1,
vjust = 0.5
))
With your full x range, I imagine that would look nice.
(C) But who wants to read those 9-digit numbers? You're labeling the x-axis a "Time (s)", which makes me think it's actual a time, measured in seconds from some start time. I'll make up that your start time is 2010-01-01 and covert these seconds to actual times, and then we get a nice date-time scale:
ggplot(df_s, aes(x = as.POSIXct(time, origin = "2010-01-01"),
y = Value,
colour = Machine)) +
facet_grid(Parameter ~ .,
scales = "free_y") +
geom_point() +
labs(title = "Product: Select Trends | 2018",
x = "Time (s)",
y = "Value") +
theme(axis.text.x = element_text (
angle = 90,
hjust = 1,
vjust = 0.5
))
If this is the real meaning behind your data, then using a date-time axis is a big step up for readability. (Again, notice that we are not specifying the breaks, the defaults work quite well.)
Using this data (I subset your sample data down to 2 facets and used dput to make it copy/pasteable):
df = structure(list(Lot = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L,
4L, 1L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 1L, 2L, 3L, 4L, 1L,
2L, 3L, 4L, 1L), .Label = c("X180106482", "X180126485", "X180306523",
"X180526326"), class = "factor"), Value = c(201, 156, 253, 211,
178, 202.5, 203.4, 204.3, 205.2, 2.02, 2.17, 1.23, 1.28, 1.54,
1.28, 1.45, 1.61, 2.35, 1.34, 1.36, 1.67, 2.01, 2.06, 2.07, 2.19,
1.44, 2.19), Parameter = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L), .Label = c("Var 1", "Var 2", "Var 3", "Var 4"
), class = "factor"), Machine = structure(c(2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Machine 1", "Machine 2"), class = "factor"),
time = c(180106482, 180126485, 180306523, 180526326, 180106482,
180126485, 180306523, 180526326, 180106482, 180106482, 180126485,
180306523, 180526326, 180106482, 180126485, 180306523, 180526326,
180106482, 180106482, 180126485, 180306523, 180526326, 180106482,
180126485, 180306523, 180526326, 180106482)), row.names = c(NA,
-27L), class = "data.frame")
I've searched SO and the internet wide and far, but somehow can't find a reason or solution to this problem. When plotting time-series type data using ggplot2 I always seem to have a vertical line connecting my points instead of the points being plotted singularly and simply connected via lines over time. Here's an example using mpg.
require(ggplot2)
gg <- ggplot(mpg, aes(x=year, y=cty,
group=manufacturer, colour=manufacturer))
gg + geom_point() + geom_line()
Is there any way to have the vertical line connecting the points removed? And why does ggplot2 do this? Thanks for your help in advance!
EDITED BASED ON DOWN VOTE AND QUESTIONS BELOW.
Perhaps mpg wasn't the best dataset to use as an example. I have multiple observations for individuals at defined time points which I want to plot by combining geom_point() and geom_line(). However, at each time point my individual observations (points) are also connected with a vertical line - which I do not know what it means and how it can be removed. Is it because I have multiple observations for the same individual at the same time-point?
Here's a dataset that helps illustrate the problem.
dput(x1)
structure(list(Assessment_Time = structure(c(1L, 2L, 1L, 1L,
2L, 2L, 3L, 3L, 4L, 4L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 4L, 4L, 4L, 1L, 3L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 4L, 4L,
6L, 6L, 2L, 3L, 3L, 2L, 2L, 3L, 3L, 4L, 4L, 1L, 1L, 2L), .Label = c("Initial",
"First follow-up", "Second follow-up", "Third follow-up", "Fourth follow-up",
"Fifth follow-up"), class = "factor"), id = c(454316L, 454316L,
1184099L, 1184099L, 1184099L, 1184099L, 1184099L, 1184099L, 1184099L,
1184099L, 124227L, 124227L, 124227L, 124227L, 124227L, 124227L,
124227L, 124227L, 124227L, 124227L, 124227L, 124227L, 124227L,
124227L, 1227808L, 1227808L, 1234280L, 1234280L, 1234280L, 1234280L,
1233898L, 1233898L, 1233898L, 1233898L, 1233898L, 1233898L, 1233898L,
1233898L, 1191086L, 1191086L, 1191086L, 1232973L, 1232973L, 1232973L,
1232973L, 1232973L, 1232973L, 1251251L, 1251251L, 1251251L),
US_thickest_um = c(3400, 1500, 7600, 6000, 6600, 4500, 6100,
4000, 6400, 3500, 2300, 2400, 3400, 2200, 1500, 2500, 2100,
1500, 2500, 1700, 1700, 3800, 2800, 2800, 2300, 1300, 6000,
3200, 3800, 1900, 5400, 6200, 2200, 3000, 1900, 2100, 1900,
2500, 4600, 2800, 2100, 3400, 1900, 2400, 1700, 2100, 1300,
2800, 4000, 3700)), .Names = c("Assessment_Time", "id", "US_thickest_um"
), row.names = c(NA, -50L), class = c("tbl_df", "tbl", "data.frame"
))
gg <- ggplot(x1, aes(x=Assessment_Time, y=US_thickest_um, group=factor(id)))
gg + geom_point(aes(colour=factor(id))) + geom_line(aes(colour=factor(id)))
It's not totally clear what your goal is here, but let's say it is to compare the mean for each manufacturer in 1999 and 2008 in a way that also shows the variation by plotting the individual points.
You could do something like this, playing around with the options until you get it the way you want.
means <- mpg %>% dplyr::group_by(year, manufacturer) %>% dplyr::summarize(cty = mean(cty))
ggplot(mpg, aes(x=year, y = cty)) +
geom_jitter(aes(colour = manufacturer), width = 0.15) +
geom_line(data = means, aes(group = manufacturer, colour = manufacturer))
It's not clear what you're trying to do. You refer to time-series data but actually use something completely different: neither mpg nor your updated sample data are time-series data.
I assume you are asking about how to plot time-series data in ggplot and encode different time series in different coloured lines. Here is a simple example that should help you getting started.
First off, let's generate data for 10 time series.
ts <- replicate(
10,
ts(cumsum(1 + round(rnorm(100), 2)), start = c(1954, 7), frequency = 12),
simplify = FALSE)
We convert the ts objects into a list of data.frames.
lst <- lapply(setNames(ts, paste0("series_", 1:10)), function(x)
data.frame(Y = as.matrix(x), date = as.Date(as.yearmon(time(gnp)))))
We now plot data by mapping id to the colour aesthetic to show the 10 different time series as 10 differently coloured line graphs.
library(tidyverse)
dplyr::bind_rows(lst, .id = "id") %>%
ggplot(aes(date, Y, colour = as.factor(id))) +
geom_line()
You need to reconsider your plot design.
There is there is only two years. So this can't be a classic timeseries line chart.
library(tidyverse)
table(mpg$year)
year n
<int> <int>
1 1999 117
2 2008 117
One of the alternatives can be this
gg <- ggplot(mpg, aes(x=manufacturer, fill = as.factor(cyl)))
gg + geom_bar(stat = "count") +
facet_wrap(~year) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))