Computing information taken out from applying 'aggregate' in R - r

I have the following information:
head(Callao20)
Dia Mes Aho Temp
1 12 Feb 2020 NA
2 12 Feb 2020 NA
3 12 Feb 2020 NA
4 12 Feb 2020 NA
5 12 Feb 2020 NA
6 12 Feb 2020 NA
Despite the fact that I have NA's, I also have further information below. By the way, do you recommend me to delete such NA's?.
Anyway, I'd like to estimate the cv for each month, then I estimated the following parameters monthly:
aggregate(Callao20[, 4], list(Callao20$Mes), mean)
Group.1 x
1 Feb NA
2 Mar 17.84195
3 Abr 17.50487
4 May 16.77294
5 Jun 16.45750
6 Jul 15.53369
7 Ago 14.93071
8 Set 14.65176
9 Oct 14.60224
10 Nov 14.48786
11 Dic 14.47635
...and also:
aggregate(Callao20[, 4], list(Callao20$Mes), sd)
Group.1 x
1 Feb NA
2 Mar 0.6280132
3 Abr 0.7163050
4 May 0.3962204
5 Jun 0.4165841
6 Jul 0.3743657
7 Ago 0.4063140
8 Set 0.3538223
9 Oct 0.6060919
10 Nov 0.5034747
11 Dic 0.3035467
Knowing that cv = (sd/mean)*100, how do you recommend me to estimate it for each month, from what I already have?.

We could use tidyverse as this can handle NA better
library(dplyr)
Callao20 %>%
group_by(Mes) %>%
summarise(out = sd(Temp, na.rm = TRUE)/mean(Temp, na.rm = TRUE) * 100)
Or if we want to use aggregate, we can use a formula approach (R 4.1.0)
aggregate(Temp ~ Mes, Callao20,
\(x) sd(x, na.rm = TRUE)/mean(x, na.rm = TRUE) * 100)

I would suggest to do this in one aggregate command instead of breaking it down in separate aggregate calls and then trying to combine them.
aggregate(Callao20[, 4], list(Callao20$Mes),
function(x) (sd(x, na.rm = TRUE)/mean(x, na.rm = TRUE))*100)

Related

Applying custom function to a list of DFs, taking another list as an input - R

I have a list of dfs and a list of annual budgets.
Each df represents one business year, and each budget represents a total spend for that year.
# the business year starts from Feb and ends in Jan.
# the budget column is first populated with the % of annual budget allocation
df <- data.frame(monthly_budget=c(0.06, 0.13, 0.07, 0.06, 0.1, 0.06, 0.06, 0.09, 0.06, 0.06, 0.1, 0.15),
month=month.abb[c(2:12, 1)])
# dfs for 3 years
df2019_20 <- df
df2020_21 <- df
df2021_22 <- df
# budgets for 3 years
budget2019_20 <- 6000000
budget2020_21 <- 7000000
budget2021_22 <- 8000000
# into lists
df_list <- list(df2019_20, df2020_21, df2021_22)
budget_list <- list(budget2019_20, budget2020_21, budget2021_22)
I've written the following function to both apply the right year to Jan and fill in the rest by deparsing the respective dfs name.
It works perfectly if I supply a single df and a single budget.
budget_func <- function(df, budget){
df_name <- deparse(substitute(df))
df <- df %>%
mutate(year=ifelse(month=="Jan",
as.numeric(str_sub(df_name, -2)) + 2000,
as.numeric(str_extract(df_name, "\\d{4}(?=_)")))
)
for (i in 1:12){
df[i,1] <- df[i,1] * budget
i <- i+1
}
return(df)
}
To speed things up I want to pass both lists as arguments to mapply. However I don't get the results I want - what am I doing wrong?
final_budgets <- mapply(budget_func, df_list, budget_list)
Instead of using deparse/substitute (which works when we are passing a single dataset, and is different in the loop because the object passed is not the object name), we may add a new argument to pass the names. In addition, when we create the list, it should have the names as well. We can either use list(df2019_20 = df2019_20, ...) or use setNames or an easier option is dplyr::lst which does return with the name of the object passed
budget_func <- function(df, budget, nm1){
df_name <- nm1
df <- df %>%
mutate(year=ifelse(month=="Jan",
as.numeric(str_sub(df_name, -2)) + 2000,
as.numeric(str_extract(df_name, "\\d{4}(?=_)")))
)
for (i in 1:12){
df[i,1] <- df[i,1] * budget
i <- i+1
}
return(df)
}
-testing
df_list <- dplyr::lst(df2019_20, df2020_21, df2021_22)
budget_list <- list(budget2019_20, budget2020_21, budget2021_22)
Map(budget_func, df_list, budget_list, names(df_list))
-output
$df2019_20
monthly_budget month year
1 360000 Feb 2019
2 780000 Mar 2019
3 420000 Apr 2019
4 360000 May 2019
5 600000 Jun 2019
6 360000 Jul 2019
7 360000 Aug 2019
8 540000 Sep 2019
9 360000 Oct 2019
10 360000 Nov 2019
11 600000 Dec 2019
12 900000 Jan 2020
$df2020_21
monthly_budget month year
1 420000 Feb 2020
2 910000 Mar 2020
3 490000 Apr 2020
4 420000 May 2020
5 700000 Jun 2020
6 420000 Jul 2020
7 420000 Aug 2020
8 630000 Sep 2020
9 420000 Oct 2020
10 420000 Nov 2020
11 700000 Dec 2020
12 1050000 Jan 2021
$df2021_22
monthly_budget month year
1 480000 Feb 2021
2 1040000 Mar 2021
3 560000 Apr 2021
4 480000 May 2021
5 800000 Jun 2021
6 480000 Jul 2021
7 480000 Aug 2021
8 720000 Sep 2021
9 480000 Oct 2021
10 480000 Nov 2021
11 800000 Dec 2021
12 1200000 Jan 2022

Two line graphs in the same plot in R

I have a large dataframe. i am trying to plot sales for 2 different years in the same plots as line graph to show the variation across 2 years each month. There is a long series of grouping and filtering i have done before getting the below dataframe.
Dataframe has 3 columns (month, sales and the year)
When I am trying to plot the sales across the different years as :
ggplot(df,aes(x=month.sales,y=sales/100000,color=year)) +
geom_line()
I am getting a blank graph with x and y labels , while if I plot a column graph, it works.
Please help.
thank you
I'm guessing your data looks something like this:
set.seed(69)
df <- data.frame(month.sales = factor(rep(month.abb, 2), month.abb),
year = rep(2018:2019, each = 12),
sales = runif(24, 1, 2) * 100000)
df
#> month.sales year sales
#> 1 Jan 2018 114570.1
#> 2 Feb 2018 123197.1
#> 3 Mar 2018 166092.7
#> 4 Apr 2018 163214.1
#> 5 May 2018 109486.6
#> 6 Jun 2018 131429.8
#> 7 Jul 2018 167363.6
#> 8 Aug 2018 191097.6
#> 9 Sep 2018 127427.4
#> 10 Oct 2018 145360.1
#> 11 Nov 2018 134577.1
#> 12 Dec 2018 169486.6
#> 13 Jan 2019 168493.2
#> 14 Feb 2019 147552.5
#> 15 Mar 2019 139811.3
#> 16 Apr 2019 156351.2
#> 17 May 2019 199368.3
#> 18 Jun 2019 130953.6
#> 19 Jul 2019 148150.5
#> 20 Aug 2019 166307.3
#> 21 Sep 2019 121830.8
#> 22 Oct 2019 101838.1
#> 23 Nov 2019 109716.9
#> 24 Dec 2019 125407.9
In which case you can draw a line plot like this:
library(ggplot2)
ggplot(df, aes(x = month.sales, y = sales / 100000,
color = factor(year), group = factor(year))) +
geom_line()
Note that you need to add the group aesthetic so that ggplot doesn't automatically group your data points according to the factor levels on the x axis.

How to apply 10-year average filter in R on a dataframe?

How can I run a 10-year average filter on the NBP on this dataframe?
This is the head of dataframe
> head(df3)
Year NBP
1 1850 35.454343
2 1851 4.5634543
3 1852 112.389182
4 1853 151.169251
5 1854 73.123145
6 1855 -72.309647
In reality I have years from 1850 to 2100, how can I apply 10-year average filter on the NBP on this dataframe for the variable NBP and plot it temporally?
One option would be using slider package function slide_dbl() that allows you to create rolling variables. Here the code:
library(slider)
library(dplyr)
set.seed(123)
#Data
df <- data.frame(Year=1990:2020,NBP=rnorm(31,2,0.5))
# Rolling by group
df %>%
mutate(rollingNBP = slide_dbl(NBP, mean, .before = 9, .complete = T))
Output:
Year NBP rollingNBP
1 1990 1.8399718 NA
2 1991 1.3442388 NA
3 1992 1.7001958 NA
4 1993 1.9352947 NA
5 1994 2.4433681 NA
6 1995 1.9243020 NA
7 1996 2.1648956 NA
8 1997 0.3863386 NA
9 1998 1.6141041 NA
10 1999 2.1432743 1.749598
11 2000 1.3897440 1.704576
12 2001 2.2172752 1.791879
13 2002 2.4000884 1.861868
14 2003 1.9180345 1.860142
15 2004 2.6214594 1.877952
16 2005 1.5328075 1.838802
17 2006 2.1968543 1.841998
18 2007 2.2018157 2.023546
19 2008 1.5567816 2.017813
20 2009 1.3405312 1.937539
21 2010 2.0144220 2.000007
22 2011 1.7839351 1.956673
23 2012 2.8449363 2.001158
24 2013 2.6141964 2.070774
25 2014 2.1380117 2.022429
26 2015 1.4755122 2.016700
27 2016 1.7395653 1.970971
28 2017 2.8116013 2.031949
29 2018 1.4649659 2.022768
30 2019 2.8429436 2.173009
31 2020 1.8791551 2.159482
If you want to include a plot, you can use ggplot2:
library(ggplot2)
#Code2
df %>%
mutate(rollingNBP = slide_dbl(NBP, mean, .before = 9, .complete = T)) %>%
ggplot(aes(x=Year,y=rollingNBP))+
geom_line()
Output:
And if you want to see both series, try this:
library(tidyr)
#Code 3
df %>%
mutate(rollingNBP = slide_dbl(NBP, mean, .before = 9, .complete = F)) %>%
pivot_longer(-Year) %>%
ggplot(aes(x=Year,y=value,group=name,color=name))+
geom_line()
Output:
An option with rollmean from zoo
library(dplyr)
library(zoo)
df %>%
mutate(rollingNBP = rollmeanr(NBP, k = 10, fill = NA))

How to insert a value in a table

I have aggregated a table from my datafile using this synthax:
sumtab <- as.data.frame(table(S$MONTH))
colnames(sumtab) <- c("Month", "Frq")
rownames(sumtab) <- c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug",
"Sep","Oct","Dec")
Resulting in this table sumtab:
Month Frq
Jan 1 3
Feb 2 5
Mar 3 16
Apr 4 45
May 5 11
Jun 6 16
Jul 7 99
Aug 8 101
Sep 9 45
Oct 10 456
Dec 12 112
And this script produces a ggplot:
ggplot(sumtab, aes(x=Month,y=Frq),width=1.5) +
scale_y_continuous(limit=c(0,17),expand=c(0, 0)) +
geom_bar(stat='identity',fill="lightgreen",colour="black") +
xlab("Month") + ylab("No of bears killed") +
theme_bw(base_size = 11) +
theme(axis.text.x=element_text(angle=0,size=9))
The problem is that there are no values for November in my data, and I need to somehow enter a zero for November in the table. Probably a simple thing for most of you, and I have tried to search in other questions , and I have googled and read the books, but been unable to find the correct synthax.Need a little help.
Adding rbind into the script:
sumtab <- as.data.frame(table(S$MONTH))
sumtab <- rbind(sumtab, c(11, 0))
produced this error message:
Warning message:
In `[<-.factor`(`*tmp*`, ri, value = 11) :
invalid factor level, NA generated
ant this table:
Var1 Freq
1 1 3
2 2 5
3 3 6
4 4 14
5 5 7
6 6 2
7 7 13
8 8 12
9 9 3
10 10 1
11 12 4
12 <NA> 0
So thanks #PaulH for your help, but I've probably used your help in a wrong way.
You could use the rbind command to add the November row:
sumtab <- rbind(sumtab, Nov = c(11, 0))
Good luck!

forecast throws error K must be not be greater than period/2

I issue the following commands:
ops <- read.csv("ops.csv")
ops.ts <- ts(ops, frequency=12, start=c(2014,1))
ops.fc <- forecast(ops.ts)
forecast() then throws the following error:
Error in ...fourier(x, K, 1:length(x)) :
K must be not be greater than period/2
The data from the csv looks like this according to summary(ops):
1 10
2 3
3 7
4 4
5 2
6 20
7 13
8 9
9 8
10 7
11 6
12 11
13 7
R is up to date, Forecast is installed via CRAN.
I appreciate any advice especially because I am quiet new to R.
The error message is self-explanatory.
You have 13 elements in your dataset so when you do:
ops.ts <- ts(ops, frequency = 12, start=c(2014, 1))
You get (notice the 2015 value here):
#> ops.ts
# Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
#2014 10 3 7 4 2 20 13 9 8 7 6 11
#2015 7
I'm guessing you only want to use the first 12 months and then use forecast() ? If that is the case you can do either:
ops.ts <- ts(ops, frequency = 12, start = 2014, end = c(2015, 0))
ops.fc <- forecast(ops.ts)
or
ops <- ops[1:12, ]
ops.ts <- ts(ops, frequency = 12, start = 2014)
ops.fc <- forecast(ops.ts)

Resources