Show difference value over time in trend line - r

My question is similar to this one .
In the linked question ,the plot shows difference of values over time ,I want to show the line plot as well along with the difference of the values .
What I want to achieve,along with this , is a trend line across the year on the values as well . How can I do that .
data to replicate (similar to linked question )
library(ggplot2)
library(dplyr)
original.df <- read.table(text = "year Arabica Robusta
1990 100 200
1995 180 120
2000 200 190
2005 190 210
2012 230 120", header = TRUE)
df <- original.df %>%
mutate(direction = ifelse(Robusta - Arabica > 0, "Up", "Down"))%>%
melt(id = c("year", "direction"))
g1 <- ggplot(df, aes(x=year, y = value, color = variable, group = year )) +
geom_point(size=4) +
geom_path(aes(color = direction), arrow=arrow())
The plot (in the linked question) looks like .
If I add geom_smooth ,it does not show anything ,which makes sense to me as I understand geom_smooth does not know which points to refer ,whether its Arabica or the Robusta.

I tried few things and able to solve it apparently
ggplot(df, aes(x=year, y = value, color = variable))+geom_line()+geom_point(size=4)+geom_path(aes(color=direction,group=year),arrow = arrow())

Related

Finding multiple peak densities on facet wrapped ggplot for two datasets

I am currently attempting to plot densities of flies on julian dates, per year. The aim is to see when there are peak densities of flies, for two methods of data collection (group 1 and group 2). I have many rows of data, over the course of 10 years, for example, the data set looks like this:
year
julian
group
2000
214
1
2001
198
1
2001
224
1
2000
189
2
2000
214
2
2001
222
2
2001
259
2
2000
260
2
2000
212
1
Each row is a single observation.
This is my first time plotting using ggplots, so I am confused as to how to plot vertical peak lines for each year.
The code currently looks like this:
Code
data$group <- as.factor(data$group)
plots <- ggplot(data, aes(x = julian, group = group)) +
geom_density(aes(colour = group),adjust = 2) + facet_wrap(~year, ncol = 2)
I have attempted to plot peaks using this code:
geom_vline(data = vline, aes(xintercept = density(data$julian)$x[which.max(density(data$julian)$y)]))
vline <- summarise(group_by(data,year, group=group), density(ata$julian, group=group)$x[which.max(density(data$julian)$y)])
vline
However I assume it has found the peak density for all years and all groups.
Please may anyone help advise me on how to plot max densities for each year and group across each facet? Even better if there are multiple peaks, how would I find those, and a quantitative value for the peaks?
Thank you in advance, I am very new to ggplots.
Instead of trying to wrangle all computations into one line of code I would suggest to split it into steps like so. Instead of using your code to find the highest peak I make use of this answer which in principle should also find multiple peaks (see below):
library(dplyr)
library(ggplot2)
fun_peak <- function(x, adjust = 2) {
d <- density(x, adjust = adjust)
d$x[c(F, diff(diff(d$y) >= 0) < 0)]
}
vline <- data %>%
group_by(year, group) %>%
summarise(peak = fun_peak(julian))
#> `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
ggplot(data, aes(x = julian, group = group)) +
geom_density(aes(colour = group), adjust = 2) +
geom_vline(data = vline, aes(xintercept = peak)) +
facet_wrap(~year, ncol = 2)
And here is a small example with multiple peaks based on the example data in the linked answer:
x <- c(1,1,4,4,9)
data <- data.frame(
year = 2000,
julian = rep(c(1,1,4,4,9), 2),
group = rep(1:2, each = 5)
)
data$group <- as.factor(data$group)
vline <- data %>%
group_by(year, group) %>%
summarise(peak = fun_peak(julian, adjust = 1))
#> `summarise()` has grouped output by 'year', 'group'. You can override using the `.groups` argument.
ggplot(data, aes(x = julian, group = group)) +
geom_density(aes(colour = group), adjust = 1) +
geom_vline(data = vline, aes(xintercept = peak)) +
facet_wrap(~year, ncol = 2)

R Sort Cleavland Dot Plot by not shown variable

I followed this manual (https://afit-r.github.io/cleveland-dot-plots) to create a Cleaveland Dot Plot which I was able to reproduce but I faced the following challenges:
How do I sort my Y-Axis in historical order? The varieties on my y-axis have different release years and although those are not shown in my plot I would like to order them in historical order. Now they are in some wired alphabetic order starting from the back and I don't even know how to change that.
I couldn't manage to show the differences between the plots in percentages (like in the manual), could anyone explain to me that in more detail?
Do you see any possibility of including the same data for another year?
See below for my code and picture:
require(ggplot2)
require(reshape2)
require(dplyr)
require(plotrix)
cleanup = theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.background = element_blank(), axis.line = element_line(color = "black"))
data19 = read.csv("Harvest_2019_V2.csv", sep = ";")
data19$Experiment_Year <- as.factor(data19$Experiment_Year)
data19$Release_year <- as.factor(data19$Release_year)
Subset2019 = subset(data19, Experiment_Year == 2019)
agHarvest.Weight <- aggregate(Subset2019[, 9], list(Subset2019$Variety,Subset2019$Release_year,Subset2019$Treatment), mean)
agHarvest.Weight$Variety <- agHarvest.Weight$Group.1
agHarvest.Weight$Release_Year <- agHarvest.Weight$Group.2
agHarvest.Weight$Treatment <- agHarvest.Weight$Group.3
agHarvest.Weight$Yield <- agHarvest.Weight$x
right_label <- agHarvest.Weight %>%
group_by(Variety) %>%
arrange(desc(Yield)) %>%
top_n(1)
left_label <- agHarvest.Weight %>%
group_by(Variety) %>%
arrange(desc(Yield)) %>%
slice(2)
ggplot(agHarvest.Weight, aes(Yield, Variety)) +
geom_line(aes(group = Variety)) +
geom_point(aes(color = Treatment), size = 1.5) +
geom_text(data = right_label, aes(color = Treatment, label = round(Yield, 0)),
size = 3, hjust = -.5) +
geom_text(data = left_label, aes(color = Treatment, label = round(Yield, 0)),
size = 3, hjust = 1.5) +
scale_x_continuous(limits = c(2500, 4500)) + cleanup + xlab("Yield, g") +
scale_color_manual(values=c("blue","darkgreen"))
OP. Understandably, you cannot always share data for various reasons. This is why it is always recommended to either use an existing publicly-available dataset or craft your own in order to produce a minimum reproducible example. Fortunately, you're in luck, as I don't mind doing this for you. :)
TL;DR - there are many ways, but simplest method is to use reorder(your_variable, variable_to_sort_by). Note that y axis direction goes "bottom-up" rather than "top-to-bottom" on the plot.
Example Data
df <- data.frame(
Variety=rep(LETTERS[1:5], each=2),
Yield=c(265, 285, 458, 964, 152, 202, 428, 499, 800, 900),
Treatment=rep(c('first','second'), 5),
Year=rep(c(2000, 2001, 2010, 1999, 1998), each=2)
)
> df
Variety Yield Treatment Year
1 A 265 first 2000
2 A 285 second 2000
3 B 458 first 2001
4 B 964 second 2001
5 C 152 first 2010
6 C 202 second 2010
7 D 428 first 1999
8 D 499 second 1999
9 E 800 first 1998
10 E 900 second 1998
Basic Cleveland Dot Plot
p <- ggplot(df, aes(x=Yield, y=Variety)) +
geom_line(aes(group=Variety)) +
geom_point(size=3) +
geom_text(aes(label=Yield), nudge_y=0.2, size=2) +
theme_bw()
p
Sort Variety (Y axis) by Year Column
You should first notice how ggplot2 arranges your axes. The key is to understand that the origin of the plot starts at the bottom left corner. This means that the lowest value for x and y axes will be at the left and bottom, respectively. This is the reason why df$Variety is alphabetical, but "goes up" (from bottom to top). To reverse the y axis, you can just add scale_y_reverse() to your plot code, but that only works for continuous axes. For discrete axes, you can use scale_y_discrete(limits=rev(df$Variety)). You'll see in the following approach we can avoid that.
To sort the y axis by another column, you can use reorder() right with the aes() call. The reorder() function is basically setup as follows:
reorder(columnA, column_to_use_to_sort_columnA)
In this case, you'll want to sort df$Variety by df$Year, so this should become:
reorder(Variety, Year)
...but remember how the y axis "goes up"? If you want the Y axis to be sorted by df$Year and "go down", you can either reverse the axis via scale_y_discrete(limits=rev(df$Variety)), or conveniently just sort by df$Year in reverse using the syntax:
reorder(Variety, -Year)
Putting this together you get this:
p1 <- ggplot(df, aes(x=Yield, y=reorder(Variety, -Year))) +
geom_line(aes(group=Variety)) +
geom_point(size=2) +
geom_text(aes(label=Yield), nudge_y=0.2, size=2) +
theme_bw()
p1
You'll see we have our proper order now, where df$Variety is sorted by ascending df$Year, starting from the top (1999) and going down to the bottom (2010).
Other ways?
There's other ways to do your sorting, but I found this most straightforward. The other fundamentally different approach would be to sort your data frame first, then plot. However, if you do this, be aware that ggplot2 will convert any column with discrete values into a factor first, and the default factor levels are created by sorting the names in alphabetical order. This means that if you sort your data frame first, then plot, you'll still be stuck with alphabetical order. You would need to sort, then discretely convert df$Variety into a factor (and specify the levels), then plot. Something like this works just the same:
df <- dplyr::arrange(df, -Year) # arrange by descending Year
df$Variety <- factor(df$Variety, levels=unique(df$Variety)) # factor and indicate levels
ggplot(df, aes(x=Yield, y=Variety)) +
geom_line(aes(group=Variety)) +
geom_point(size=2) +
geom_text(aes(label=Yield), nudge_y=0.2, size=2) +
theme_bw() +
scale_y_discrete(limits=rev(df$Variety))
Above code gives you the same plot as the method using reorder(Variety, -Year).

Creating line graph for variable against proportion of another variable using ggplot in R

I have a large survey dataset for Women and labour force. The answers are categorical values with different data labels. The dataset consists of 63,000 responses and 2000 different variables but I have attached a small snippet of the relevant variables below along with the data labels.
I need to construct a line graph for the Age profile of women in labour force by geographical location. I have the data for Age, Currently employed (with values 0 and 1 ; 0 being no and 1 being yes) and place of residence (values are 1 and 2; 1 being urban and 2 being rural) but I cannot figure out a way to combine the data and plot it since I am a beginner.
I wish to take the proportion of women currently employed on the y-axis and age on the x-axis and get two lines one showing urban and one for rural.
I have attached an image of the kind of output I have in mind and the snippet of the variables. Since I couldn't add two separate images, I ave put them together.
I understand that I can show urban-rural using facet_grid but I'm having trouble figuring out how to combine that data.
Image link
I would greatly appreciate any help.
Welcome! Like #stefan said, it is easier if we see some of your data. So I generated some from your description.
library(tidyverse)
library(magrittr)
place = sample(c(1,2),63000, replace = TRUE) # 1 = Urban and 2 = Rural
employ = sample(c(0,1),63000, replace = TRUE) # 0 = Not Employed and 1 = Employed
age = sample(c(20:45), 63000, replace = TRUE) # Age
df = data.frame(place,employ, age)
df %>%
group_by(age,place,employ) %>%
summarise(n = n()) %>%
mutate(prop = n/(n[1]+n[2])) %>%
filter(employ == 1) %>%
mutate(newplace = case_when(place == 1 ~ "Urban", place == 2 ~ "Rural")) %>%
ggplot(., aes(x = age, y = prop, color=newplace))+
geom_line(aes(linetype = newplace))+
scale_color_manual(values = c("blue", "red")) + #Or color of your choice
labs(title = "Proportion of Women Employed:\n Comparing Urban and Rural Communities", y = "Proportion of Employed Women", color = "",
linetype = "")+ # Removed legend titles since they were redundany
theme_classic()+
theme(plot.title = element_text(hjust =.5), legend.position = "bottom")

R ggplot geom_point overlay from 2 data frames, differentiated by color, subset by id

I have two data frames with identical rows and columns, DataMaster and IMPSAVG, for which I'm trying to create a series of combined overlaid 2d scatterplots (subset by country "ids" and variable columns) with observations from the two data sets differentiated by color in ggplot. The code below does not work, but gives a sense of what I'm aiming for (acctm is the variable and ARG is the country in this example).
ggplot() +
geom_point(data=DataMaster, aes(x="Year", y="acctm"), subset = .(Country %in% c("ARG")), shape=21, color= "red") +
geom_point(data=IMPSAVG, aes(x="Year", y="acctm"), subset = .(Country %in% c("ARG")), shape=21, color= "blue")
While just getting the above to work would be much appreciated, a loop to create separate plots of this variable for all unique country ids in the column Country found in both datasets (also specified by the vector CountryList$Country) would be amazing. Thanks!
Without reproducible example of your dataset, it is hard ot be sure of what you ar elooking for.
However, using these fake datasets:
df1 <- data.frame(Country = c("A","A","A","B","B"),
Year = 2010:2014,
Value = sample(1:100,5))
df2 <- data.frame(Country = c("A","A","A","B","B"),
Year = 2010:2014,
Value = sample(1:100,5))
1) Plotting without joining datasets (not the most appropriate)
You don't have to absolutely assemble your dataframes to plot them, however it will make things a little bit harder (especially if you want to customize several parameters).
Here you can do:
library(ggplot2)
ggplot()+
geom_point(data = df1, aes(x = Year, y = Value, color = "blue"), shape = 21)+
geom_point(data = df2, aes(x = Year, y = Value, color = "red"), shape = 21, show.legend = TRUE)+
scale_color_manual(values = c("blue","red"), labels = c("df1","df2"), name = "")
2) Assembling both dataframes (best way to do it)
However, it will be much easier if you assemble your both dataframes (ggplot2 is designed to work with dataframes in a longer format).
So, here, you can do:
df1$Dataset = "DF1"
df2$Dataset = "DF2"
DF <- rbind(df1,df2)
Country Year Value Dataset
1 A 2010 66 DF1
2 A 2011 64 DF1
3 A 2012 40 DF1
4 B 2013 58 DF1
5 B 2014 20 DF1
6 A 2010 78 DF2
7 A 2011 25 DF2
8 A 2012 71 DF2
9 B 2013 40 DF2
10 B 2014 61 DF2
Now, you can simply plot it like this which is much more concise:
library(ggplot2)
ggplot(DF, aes(x = Year, y = Value, color = Dataset))+
geom_point(shape = 21)
3) Subsetting dataframe
To plot only a subset of your dataframes, starting with the assembled dataframe DF, you can simply do:
library(ggplot2)
ggplot(subset(DF, Country =="A"), aes(x = Year, y = Value, color = Dataset))+
geom_point(shape = 21)
Does it answer your question ?
I think you need to create a new dataframe, which combines those two dataframes and subsets the countries that you are interested in. You can use rbind for combining the two, and also you should add a column for samples indicating which dataframe they are coming from, so that you can use it later in aes(..., color = new_column).
Just to add onto dc37's excellent write up, here is the trick to have one dataframe print on top of the other
ggplot(subset(DF, Country =="A"), aes(x = Year, y = Value, color = Dataset)) +
geom_point(shape = 21, na.rm = T) +
geom_point(data = subset(DF, Dataset == DF1 & Country == "A"),
aes(x = Year, y = compi, color = E), shape = 21, na.rm = T)
where "DF1" is the dataframe you want plotted on top.

Setting facet-specific breaks in stat_contour

I'd like to show a contour plot using ggplot and stat_contour for two categories of my data with facet_grid. I want to highlight a particular level based on the data. Here's an analogous dummy example using the usual volcano data.
library(dplyr)
library(ggplot2)
v.plot <- volcano %>% reshape2::melt(.) %>%
mutate(dummy = Var1 > median(Var1)) %>%
ggplot(aes(Var1, Var2, z = value)) +
stat_contour(breaks = seq(90, 200, 12)) +
facet_grid(~dummy)
Plot 1:
Let's say within each factor level (here east and west halves, I guess), I want to find the mean height of the volcano and show that. I can calculate it manually:
volcano %>% reshape2::melt(.) %>%
mutate(dummy = Var1 > median(Var1)) %>%
group_by(dummy) %>%
summarise(h.bar = mean(value))
# A tibble: 2 × 2
dummy h.bar
<lgl> <dbl>
1 FALSE 140.7582
2 TRUE 119.3717
Which tells me that the mean heights on each half are 141 and 119. I can draw BOTH of those on BOTH facets, but not just the appropriate one on each side.
v.plot + stat_contour(breaks = c(141, 119), colour = "red", size = 2)
Plot 2:
And you can't put breaks= inside an aes() statement, so passing it in as a column in the original dataframe is out. I realize with this dummy example I could probably just do something like bins=2 but in my actual data I don't want the mean of the data, I want something else altogether.
Thanks!
I made another attempt at this problem and came up with a partial solution, but I'm forced to use a different geom.
volcano %>% reshape2::melt(.) %>%
mutate(dummy = Var1 > median(Var1)) %>%
group_by(dummy) %>%
mutate(h.bar = mean(value), # edit1
is.close = round(h.bar) == value) %>% #
ggplot(aes(Var1, Var2, z = value)) +
stat_contour(breaks = seq(90, 200, 12)) +
geom_point(colour = "red", size = 3, # edit 2
aes(alpha = is.close)) + #
scale_alpha_discrete(range = c(0,1)) + #
facet_grid(~dummy)
In edit 1 I added a mutate() to the above block to generate a variable identifying where value was "close enough" (rounded to the nearest integer) to the desired highlight point (the mean of the data for this example).
In edit2 I added geom_points to show the grid locations with the desired value, and hid the undesired ones using an alpha of 0 or totally transparent.
Plot 3:
The problem with this solution is that it's very gappy, and trying to bridge those with geom_path is a jumbled mess. I tried coarser rounding as well, and it just made things muddy.
Would love to hear other ideas! Thanks

Resources