Two line graphs in the same plot in R - r

I have a large dataframe. i am trying to plot sales for 2 different years in the same plots as line graph to show the variation across 2 years each month. There is a long series of grouping and filtering i have done before getting the below dataframe.
Dataframe has 3 columns (month, sales and the year)
When I am trying to plot the sales across the different years as :
ggplot(df,aes(x=month.sales,y=sales/100000,color=year)) +
geom_line()
I am getting a blank graph with x and y labels , while if I plot a column graph, it works.
Please help.
thank you

I'm guessing your data looks something like this:
set.seed(69)
df <- data.frame(month.sales = factor(rep(month.abb, 2), month.abb),
year = rep(2018:2019, each = 12),
sales = runif(24, 1, 2) * 100000)
df
#> month.sales year sales
#> 1 Jan 2018 114570.1
#> 2 Feb 2018 123197.1
#> 3 Mar 2018 166092.7
#> 4 Apr 2018 163214.1
#> 5 May 2018 109486.6
#> 6 Jun 2018 131429.8
#> 7 Jul 2018 167363.6
#> 8 Aug 2018 191097.6
#> 9 Sep 2018 127427.4
#> 10 Oct 2018 145360.1
#> 11 Nov 2018 134577.1
#> 12 Dec 2018 169486.6
#> 13 Jan 2019 168493.2
#> 14 Feb 2019 147552.5
#> 15 Mar 2019 139811.3
#> 16 Apr 2019 156351.2
#> 17 May 2019 199368.3
#> 18 Jun 2019 130953.6
#> 19 Jul 2019 148150.5
#> 20 Aug 2019 166307.3
#> 21 Sep 2019 121830.8
#> 22 Oct 2019 101838.1
#> 23 Nov 2019 109716.9
#> 24 Dec 2019 125407.9
In which case you can draw a line plot like this:
library(ggplot2)
ggplot(df, aes(x = month.sales, y = sales / 100000,
color = factor(year), group = factor(year))) +
geom_line()
Note that you need to add the group aesthetic so that ggplot doesn't automatically group your data points according to the factor levels on the x axis.

Related

R -> Sum part of Columns + agreggating observations [duplicate]

This question already has answers here:
Group by multiple columns and sum other multiple columns
(7 answers)
How to sum a variable by group
(18 answers)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed last year.
I am very new to coding and just started doing some R graphics and now I am kinda lost with my data analyse and need some light! I am training some analyses and I got a very long dataset with 19 Countries x 12 months x 22 Products and for every month a Profit. Kinda like this:
Country Month Product Profit
Brazil Jan A 50
Brazil fev A 80
Brazil mar A 15
Austria Jan A 35
Austria fev A 80
Austria mar A 47
France Jan A 21
France fev A 66
France mar A 15
[...]
France Dez C 40 etc...
I am was thinking to do one graph showing the profits through the year and another for every country, so I could see the top and bottom 2 countries. I wanted to have something like:
All Countries Jan 106 or Brazil 2021 145
All Countries Fev 146 Austria 2021 162
All Countries Mar 77 France 2021 112
but the sum function can't help with characters type and as I have a long List, idk how to sum only part of the column.
sorry if it got confusing.
The package dplyr has quite a natural syntax for this:
require(dplyr)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- data.frame(
Country = rep(c(rep("Brazil", 3L), rep("Austria", 3L), rep("France", 3L)), 3L),
Profit = rep(c(50, 80, 15, 35, 80, 47, 21, 66, 15), 3L),
Month = rep(c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep"), 3L),
Year = sort(rep(c(2019, 2020, 2021), 9L))
)
df %>%
group_by(Country, Month) %>%
summarize(sum = sum(Profit))
#> `summarise()` has grouped output by 'Country'. You can override using the `.groups` argument.
#> # A tibble: 9 × 3
#> # Groups: Country [3]
#> Country Month sum
#> <chr> <chr> <dbl>
#> 1 Austria Apr 105
#> 2 Austria Jun 141
#> 3 Austria May 240
#> 4 Brazil Feb 240
#> 5 Brazil Jan 150
#> 6 Brazil Mar 45
#> 7 France Aug 198
#> 8 France Jul 63
#> 9 France Sep 45
Using base R, you can try something along these lines.
# sum of profit per month
out1 <- tapply(df$Profit, df$Month, sum)
# sum of profit per year per country
out2 <- data.frame(
profit = sapply(split(df, f = ~ df$Country + df$Year), function(x) sum(x$Profit))
)
out2$Country <- gsub('\\.[0-9]*', '', rownames(out2))
out2$Year <- gsub('[a-zA-z]*\\.', '', rownames(out2))
rownames(out2) <- NULL
Output
> out1
Apr Aug Feb Jan Jul Jun Mar May Sep
105 198 240 150 63 141 45 240 45
> head(out2)
profit Country Year
1 162 Austria 2019
2 145 Brazil 2019
3 102 France 2019
4 162 Austria 2020
5 145 Brazil 2020
6 102 France 2020
Data
# sample data
df <- data.frame(
Country = rep(c(rep('Brazil',3L),rep('Austria',3L),rep('France',3L)), 3L),
Profit = rep(c(50,80,15,35,80,47,21,66,15), 3L),
Month = rep(c('Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep'),3L),
Year = sort(rep(c(2019,2020,2021), 9L))
)

Computing information taken out from applying 'aggregate' in R

I have the following information:
head(Callao20)
Dia Mes Aho Temp
1 12 Feb 2020 NA
2 12 Feb 2020 NA
3 12 Feb 2020 NA
4 12 Feb 2020 NA
5 12 Feb 2020 NA
6 12 Feb 2020 NA
Despite the fact that I have NA's, I also have further information below. By the way, do you recommend me to delete such NA's?.
Anyway, I'd like to estimate the cv for each month, then I estimated the following parameters monthly:
aggregate(Callao20[, 4], list(Callao20$Mes), mean)
Group.1 x
1 Feb NA
2 Mar 17.84195
3 Abr 17.50487
4 May 16.77294
5 Jun 16.45750
6 Jul 15.53369
7 Ago 14.93071
8 Set 14.65176
9 Oct 14.60224
10 Nov 14.48786
11 Dic 14.47635
...and also:
aggregate(Callao20[, 4], list(Callao20$Mes), sd)
Group.1 x
1 Feb NA
2 Mar 0.6280132
3 Abr 0.7163050
4 May 0.3962204
5 Jun 0.4165841
6 Jul 0.3743657
7 Ago 0.4063140
8 Set 0.3538223
9 Oct 0.6060919
10 Nov 0.5034747
11 Dic 0.3035467
Knowing that cv = (sd/mean)*100, how do you recommend me to estimate it for each month, from what I already have?.
We could use tidyverse as this can handle NA better
library(dplyr)
Callao20 %>%
group_by(Mes) %>%
summarise(out = sd(Temp, na.rm = TRUE)/mean(Temp, na.rm = TRUE) * 100)
Or if we want to use aggregate, we can use a formula approach (R 4.1.0)
aggregate(Temp ~ Mes, Callao20,
\(x) sd(x, na.rm = TRUE)/mean(x, na.rm = TRUE) * 100)
I would suggest to do this in one aggregate command instead of breaking it down in separate aggregate calls and then trying to combine them.
aggregate(Callao20[, 4], list(Callao20$Mes),
function(x) (sd(x, na.rm = TRUE)/mean(x, na.rm = TRUE))*100)

R, dplyr: How to divide date frame elements by specific elements

edit: Solution at the end.
I have a dataframe that contains different variables and the sum of these different variables as a variable called "total".
I want to add a new column that calculates each variables' share of the "total"-variable.
Example:
library(dplyr)
name <- c('A','A',
'B','B')
month = c("oct 2018", "nov 2018",
"oct 2018", "nov 2018")
value <- seq(1:length(month))
df = data.frame(name, month, value)
# Create total variable
dfTotal =
df%>%
group_by_("month")%>%
summarize(value = sum(value, na.rm = TRUE))
dfTotal[["name"]] <- "Total"
dfTotal = as.data.frame(dfTotal)
# Add total column to dataframe
df2 = rbind(df, dfTotal)
df2
which gives the dataframe
name month value
1 A oct 2018 1
2 A nov 2018 2
3 B oct 2018 3
4 B nov 2018 4
5 Total nov 2018 6
6 Total oct 2018 4
What I want is to produce a new column with the shares of the total for each month in the above dataframe, so that I get something like
name month value share
1 A oct 2018 1 0.25 (=1/4)
2 A nov 2018 2 0.33 (=2/6)
3 B oct 2018 3 0.75 (=3/4)
4 B nov 2018 4 0.67 (=4/6)
5 Total nov 2018 6 1.00 (=6/6)
6 Total oct 2018 4 1.00 (=4/4)
Does anybody know how I from the first dataframe can produce the last column in the second dataframe?
Solution:
Based on tmfmnk's comment, the following solves the problem:
df2 =
df2 %>%
group_by(month) %>%
mutate(share = value/max(value))
df2
which gives
name month value share
<fct> <fct> <int> <dbl>
1 A oct 2018 1 0.25
2 A nov 2018 2 0.333
3 B oct 2018 3 0.75
4 B nov 2018 4 0.667
5 Total nov 2018 6 1
6 Total oct 2018 4 1

Use dplyr/tidyr to turn rows into columns in R data frame

I have a data frame like this:
year <-c(floor(runif(100,min=2015, max=2017)))
month <- c(floor(runif(100, min=1, max=13)))
inch <- c(floor(runif(100, min=0, max=10)))
mm <- c(floor(runif(100, min=0, max=100)))
df = data.frame(year, month, inch, mm);
year month inch mm
2016 11 0 10
2015 9 3 34
2016 6 3 33
2015 8 0 77
I only care about the columns year, month, and mm.
I need to re-arrange the data frame so that the first column is the name of the month and the rest of the columns is the value of mm.
Months 2015 2016
Jan # #
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
So two things needs to happen.
(1) The month needs to become a string of the first three letters of the month.
(2) I need to group by year, and then put the mm values in a column under that year.
So far I have this code, but I can't figure it out:
df %>%
select(-inch) %>%
group_by(month) %>%
summarize(mm = mm) %>%
ungroup()
To convert month to names, you can refer to month.abb; And then you can summarize by year and month, spread to wide format:
library(dplyr)
library(tidyr)
df %>%
group_by(year, month = month.abb[month]) %>%
summarise(mm = mean(mm)) %>% # use mean as an example, could also be sum or other
# intended aggregation methods
spread(year, mm) %>%
arrange(match(month, month.abb)) # rearrange month in chronological order
# A tibble: 12 x 3
# month `2015` `2016`
# <chr> <dbl> <dbl>
# 1 Jan 65.50000 28.14286
# 2 Feb 54.40000 30.00000
# 3 Mar 23.50000 95.00000
# 4 Apr 7.00000 43.60000
# 5 May 45.33333 44.50000
# 6 Jun 70.33333 63.16667
# 7 Jul 72.83333 52.00000
# 8 Aug 53.66667 66.50000
# 9 Sep 51.00000 64.40000
#10 Oct 74.00000 39.66667
#11 Nov 66.20000 58.71429
#12 Dec 38.25000 51.50000

Boxplot not plotting all data

I'm trying to plot a boxplot for a time series (e.g. http://www.r-graph-gallery.com/146-boxplot-for-time-series/) and can get every other example to work, bar my last one. I have averages per month for six years (2011 to 2016) and have data for 2014 and 2015 (albeit in small quantities), but for some reason, boxes aren't being shown for the 2014 and 2015 data.
My input data has three columns: year, month and residency index (a value between 0 and 1). There are multiple individuals (in this example, 37) each with an average residency index per month per year (including 2014 and 2015).
For example:
year month RI
2015 1 NA
2015 2 NA
2015 3 NA
2015 4 NA
2015 5 NA
2015 6 NA
2015 7 0.387096774
2015 8 0.580645161
2015 9 0.3
2015 10 0.225806452
2015 11 0.3
2015 12 0.161290323
2016 1 0.096774194
2016 2 0.103448276
2016 3 0.161290323
2016 4 0.366666667
2016 5 0.258064516
2016 6 0.266666667
2016 7 0.387096774
2016 8 0.129032258
2016 9 0.133333333
2016 10 0.032258065
2016 11 0.133333333
2016 12 0.129032258
which is repeated for each individual fish.
My code:
#make boxplot
boxplot(RI$RI~RI$month+RI$year,
xaxt="n",xlab="",col=my_colours,pch=20,cex=0.3,ylab="Residency Index (RI)", ylim=c(0,1))
abline(v=seq(0,12*6,12)+0.5,col="grey")
axis(1,labels=unique(RI$year),at=seq(6,12*6,12))
The average trend line works as per the other examples.
a=aggregate(RI$RI,by=list(RI$month,RI$year),mean, na.rm=TRUE)
lines(a[,3],type="l",col="red",lwd=2)
Any help on this matter would be greatly appreciated.
Your problem seems to be the presence of missing values, NA, in your data, the other values are plotted correctly. I've simplified your code a bit.
boxplot(RI$RI ~ RI$month + RI$year,
ylab="Residency Index (RI)")
a <- aggregate(RI ~ month + year, data = RI, FUN = mean, na.rm = TRUE)
lines(c(rep(NA, 6), a[,3]), type="l", col="red", lwd=2)
Also, I believe that maybe a boxplot is not the best way to depict your data. You only have one value per year/month, when a boxplot would require more. Maybe a simple scatter plot will do better.

Resources