I have a dataframe df as follows (my real data there are many columns):
df <- read.table(text = "date hfgf lmo
2019-01-01 0.7 1.4
2019-02-01 0.11 2.3
2019-03-01 1.22 6.7
2020-04-01 0.44 5.2
2020-05-01 0.19 2.3
2021-06-01 3.97 9.5
,
header = TRUE, stringsAsFactors = FALSE)
I would like to replace the monthly values in the columns value 1 value 2 etc by the yearly mean.
Note that I can melt and use summarize function but I need to keep the columns as they are.
If we want to update the columns with the yearly mean, do a grouping by the year extracted 'date' and use mutate to update the columns with the mean of those columns by looping across
If it is to return a single mean row per 'year', use summarise
library(lubridate)
library(dplyr)
df %>%
group_by(year = year(date)) %>%
summarise(across(where(is.numeric), mean, na.rm = TRUE))
-output
# A tibble: 3 × 3
year hfgf lmo
<dbl> <dbl> <dbl>
1 2019 0.677 3.47
2 2020 0.315 3.75
3 2021 3.97 9.5
Here is a base R solution with aggregate.
res <- aggregate(cbind(hfgf, lmo) ~ format(df$date, "%Y"), df, mean)
names(res)[1] <- names(df)[1]
res
# date hfgf lmo
#1 2019 0.6766667 3.466667
#2 2020 0.3150000 3.750000
#3 2021 3.9700000 9.500000
I am not sure but maybe you mean this kind of solution. In essence it is same as akrun's solution: Here with mutate, alternatively you could use the outcommented summarise:
library(lubridate)
library(dplyr)
df %>%
group_by(year = year(date)) %>%
mutate(across(c(hfgf, lmo), mean, na.rm=TRUE, .names = "mean_{unique(year)}_{.col}"))
# summarise(across(c(hfgf, lmo), mean, na.rm=TRUE, .names = "mean_{unique(year)}_{.col}"))
date hfgf lmo year mean_2019_hfgf mean_2019_lmo mean_2020_hfgf mean_2020_lmo mean_2021_hfgf mean_2021_lmo
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2019-~ 0.7 1.4 2019 0.677 3.47 NA NA NA NA
2 2019-~ 0.11 2.3 2019 0.677 3.47 NA NA NA NA
3 2019-~ 1.22 6.7 2019 0.677 3.47 NA NA NA NA
4 2020-~ 0.44 5.2 2020 NA NA 0.315 3.75 NA NA
5 2020-~ 0.19 2.3 2020 NA NA 0.315 3.75 NA NA
6 2021-~ 3.97 9.5 2021 NA NA NA NA 3.97 9.5
Related
When I use tq_transmute, I am able to rename all the columns. When I use tq_mutate, only the first 2 columns are getting renamed. The last two columns remain as it is. Is this a limitation, or am I missing something to fix for tq_mutate?
In the first output below, you will see that "TR_high","TR_Low" is not printed as the last 2 columns.
#https://www.rpubs.com/stephenodea54/776350
# Loads tidyquant, lubridate, xts, quantmod, TTR, and PerformanceAnalytics
library(tidyverse)
library(tidyquant)
library(ggthemes)
startdt <- "2021-02-01"
AMC <- tq_get(
"AMC",
get = "stock.prices",
from = startdt
)
test1<-AMC %>%
tq_mutate(select = c("high", "low", "close"), n=14, mutate_fun = ATR,col_rename = c("TrueRange","ATR","TR_high","TR_Low")) # %>%
head(test1)
# A tibble: 6 x 12
symbol date open high low close volume adjusted TrueRange ATR ATR..1 ATR..2
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 AMC 2021-02-01 17 17.2 12.9 13.3 434608000 13.3 NA NA NA NA
2 AMC 2021-02-02 9.48 10.1 6 7.82 462775900 7.82 7.3 NA 13.3 6
3 AMC 2021-02-03 8.85 9.77 7.89 8.97 221405100 8.97 1.95 NA 9.77 7.82
Now, I am just using a different function, and all the 4 columns are nicely renamed.
test2<-AMC %>%
tq_transmute(select = c("high", "low", "close"), n=14, mutate_fun = ATR,col_rename = c("TrueRange","ATR","TR_high","TR_Low")) # %>%
head(test2)
# A tibble: 6 x 5
date TrueRange ATR TR_high TR_Low
<date> <dbl> <dbl> <dbl> <dbl>
1 2021-02-01 NA NA NA NA
2 2021-02-02 7.3 NA 13.3 6
3 2021-02-03 1.95 NA 9.77 7.82
I am struggling with the tidyverse package. I'm using the mpg dataset from R to display the issue that I'm facing (ignore if the relationships are not relevant, it is just for the sake of explaining my problem).
What I'm trying to do is to obtain the average "displ" grouped by manufacturer and year AND at the same time (and this is what I can't figure out), have several columns for each of the fuel types variable (i.e.: a column for the mean of diesel, a column for the mean of petrol, etc.).
This is the first part of the code and I'm new to R so I really don't know what do I need to add...
mpg %>%
group_by(manufacturer, year) %>%
summarize(Mean. = mean(c(displ)))
# A tibble: 30 × 3
# Groups: manufacturer [15]
manufacturer year Mean.
<chr> <int> <dbl>
1 audi 1999 2.36
2 audi 2008 2.73
3 chevrolet 1999 4.97
4 chevrolet 2008 5.12
5 dodge 1999 4.32
6 dodge 2008 4.42
7 ford 1999 4.45
8 ford 2008 4.66
9 honda 1999 1.6
10 honda 2008 1.85
# … with 20 more rows
Any help is appreciated, thank you.
Perhaps, we need to reshape into 'wide'
library(dplyr)
library(tidyr)
mpg %>%
select(manufacturer, year, fl, displ) %>%
pivot_wider(names_from = fl, values_from = displ, values_fn = mean)
-output
# A tibble: 30 x 7
manufacturer year p r e d c
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 audi 1999 2.36 NA NA NA NA
2 audi 2008 2.73 NA NA NA NA
3 chevrolet 2008 6.47 4.49 5.3 NA NA
4 chevrolet 1999 5.7 4.22 NA 6.5 NA
5 dodge 1999 NA 4.32 NA NA NA
6 dodge 2008 NA 4.42 4.42 NA NA
7 ford 1999 NA 4.45 NA NA NA
8 ford 2008 5.4 4.58 NA NA NA
9 honda 1999 1.6 1.6 NA NA NA
10 honda 2008 2 1.8 NA NA 1.8
# … with 20 more rows
The following code was executed:
tb <- tibble(
year <- rep(2001:2020,10)
)
tb %<>% arrange(year) %>%
mutate(
id <- rep(1:10,20),
r1 <- rnorm(200,0,1),
r2 <- rnorm(200,1,1),
r3 <- rnorm(200,2,1)
)
Then the error message popped up:
Error: arrange() failed at implicit mutate() step.
x Could not create a temporary column for ..1.
ℹ ..1 is year.
Can anyone shed light on what the reason is?
Try this. It looks like a variable assignation issue. Try replacing <- by = and %<>% by %>%. Here a possible solution:
#Data
tb <- tibble(
year = rep(2001:2020,10)
)
#Code
tb %>% arrange(year) %>%
mutate(
id = rep(1:10,20),
r1 = rnorm(200,0,1),
r2 = rnorm(200,1,1),
r3 = rnorm(200,2,1)
)
Output:
# A tibble: 200 x 5
year id r1 r2 r3
<int> <int> <dbl> <dbl> <dbl>
1 2001 1 1.10 1.62 2.92
2 2001 2 0.144 1.18 1.08
3 2001 3 -0.118 2.32 3.15
4 2001 4 -0.912 0.701 1.36
5 2001 5 -1.44 -0.648 1.11
6 2001 6 -0.797 1.95 -0.333
7 2001 7 1.25 -0.113 1.85
8 2001 8 0.772 1.62 2.32
9 2001 9 -0.220 1.51 1.29
10 2001 10 -0.425 1.37 3.24
# ... with 190 more rows
What I want it's create the var3 using a lag (dplyr package), but should be consistent with the year and the ID. I mean, the lag should belong to the corresponding ID. The dataset is like an unbalanced panel.
YEAR ID VARS
2010 1 -
2011 1 -
2012 1 -
2010 2 -
2011 2 -
2012 2 -
2010 3 -
...
My issue is similar to the following question/post, but grouping by two categories:
dplyr: lead() and lag() wrong when used with group_by()
I tried to extend the solution, unsuccessfully (I get NAs).
Attempt #1:
data %>%
group_by(YEAR,ID) %>%
summarise(var1 = ...
var2 = ...
var3 = var1 - dplyr::lag(var2))
)
Attempt #2:
data %>%
group_by(YEAR,ID) %>%
summarise(var1 = ...
var2 = ...
gr = sprintf(YEAR,ID)
var3 = var1 - dplyr::lag(var2, order_by = gr))
)
Minimum example:
MyData <-
data.frame(YEAR = rep(seq(2010,2014),5),
ID = rep(1:5, each=5),
var1 = rnorm(n=25,mean=10,sd=3),
var2 = rnorm(n=25,mean=1,sd=1)
)
MyData %>%
group_by(YEAR,ID) %>%
summarise(var3 = var1 - dplyr::lag(var2)
)
Thanks in advance.
Do you mean group_by(ID) and effectively "order by YEAR"?
MyData %>%
group_by(ID) %>%
mutate(var3 = var1 - dplyr::lag(var2)) %>%
print(n=99)
# # A tibble: 25 x 5
# # Groups: ID [5]
# YEAR ID var1 var2 var3
# <int> <int> <dbl> <dbl> <dbl>
# 1 2010 1 11.1 1.16 NA
# 2 2011 1 13.5 -0.550 12.4
# 3 2012 1 10.2 2.11 10.7
# 4 2013 1 8.57 1.43 6.46
# 5 2014 1 12.6 1.89 11.2
# 6 2010 2 8.87 1.87 NA
# 7 2011 2 5.30 1.70 3.43
# 8 2012 2 6.81 0.956 5.11
# 9 2013 2 13.3 -0.0296 12.4
# 10 2014 2 9.98 -1.27 10.0
# 11 2010 3 8.62 0.258 NA
# 12 2011 3 12.4 2.00 12.2
# 13 2012 3 16.1 2.12 14.1
# 14 2013 3 8.48 2.83 6.37
# 15 2014 3 10.6 0.190 7.80
# 16 2010 4 12.3 0.887 NA
# 17 2011 4 10.9 1.07 10.0
# 18 2012 4 7.99 1.09 6.92
# 19 2013 4 10.1 1.95 9.03
# 20 2014 4 11.1 1.82 9.17
# 21 2010 5 15.1 1.67 NA
# 22 2011 5 10.4 0.492 8.76
# 23 2012 5 10.0 1.66 9.51
# 24 2013 5 10.6 0.567 8.91
# 25 2014 5 5.32 -0.881 4.76
(Disregarding your summarize into a mutate for now.)
I'm extracting "wide" data which I intend to tidy with tidyr::pivot_longer().
library(tidyverse)
df1 <-
data.frame(
M = words[1:10],
N = rnorm(10, 3, 3),
O = rnorm(10, 3, 3),
P = rnorm(10, 3, 3)
)
df2 <-
data.frame(
M = words[1:10],
N = rnorm(10, 3, 3),
O = rnorm(10, 3, 3),
P = rnorm(10, 3, 3)
)
df3 <-
data.frame(
M = words[1:10],
N = rnorm(10, 3, 3),
O = rnorm(10, 3, 3),
P = rnorm(10, 3, 3)
)
df4 <-
data.frame(
M = words[1:10],
N = rnorm(10, 3, 3),
O = rnorm(10, 3, 3),
P = rnorm(10, 3, 3)
)
lst <- list(df1, df2, df3, df4)
colname <-
c("ticker", "2017", "2018", "2019")
header <- list("Leverage", "Gearing", "Capex.to.sales", "FCFex")
lst <- lst %>%
lapply(setNames, colname) %>%
lapply(pivot_longer, -ticker, names_to = "Period", values_to = header)
Using values_to = header gives me this error:
Error in [[<-.data.frame(tmp`, ".value", value = list("Leverage", :
replacement has 4 rows, data has 3
Instead, I had to use the default values_to = "value", and subsequently use this code to rename my columns:
lst <- lst %>%
lapply(setNames, colname) %>%
lapply(pivot_longer, -ticker, names_to = "Period", values_to = "value")
lst <- map(seq_along(lst), function(i){
x <- lst[[i]]
colnames(x)[3] <- header[[i]]
x
})
My output is shown below (columns renamed), but I was wondering if there is a way to feed a vector into values_to instead of using map (as it makes for better piping)? Or is there a more efficient way about going about this?
> lst
[[1]]
# A tibble: 30 x 3
ticker Period Leverage
<fct> <chr> <dbl>
1 a 2017 6.01
2 a 2018 4.82
3 a 2019 1.58
4 able 2017 8.64
5 able 2018 6.70
6 able 2019 0.831
7 about 2017 -0.187
8 about 2018 0.549
9 about 2019 0.829
10 absolute 2017 1.26
# ... with 20 more rows
[[2]]
# A tibble: 30 x 3
ticker Period Gearing
<fct> <chr> <dbl>
1 a 2017 2.37
2 a 2018 3.58
3 a 2019 5.63
4 able 2017 0.311
5 able 2018 0.708
6 able 2019 -0.0651
7 about 2017 2.89
8 about 2018 6.25
9 about 2019 10.1
10 absolute 2017 6.48
# ... with 20 more rows
[[3]]
# A tibble: 30 x 3
ticker Period Capex.to.sales
<fct> <chr> <dbl>
1 a 2017 5.22
2 a 2018 1.88
3 a 2019 0.746
4 able 2017 -3.90
5 able 2018 3.06
6 able 2019 1.91
7 about 2017 1.35
8 about 2018 4.12
9 about 2019 11.1
10 absolute 2017 1.76
# ... with 20 more rows
[[4]]
# A tibble: 30 x 3
ticker Period FCFex
<fct> <chr> <dbl>
1 a 2017 1.76
2 a 2018 2.85
3 a 2019 1.86
4 able 2017 -3.38
5 able 2018 -3.02
6 able 2019 -1.52
7 about 2017 6.46
8 about 2018 5.39
9 about 2019 0.810
10 absolute 2017 8.08
# ... with 20 more rows
For the second part of my question, I intend to use bind_col() to combine all the four dataframes into one, but the two common columns are being duplicated (as seen below).
How do I tell R to just bind the rightmost column that was renamed i.e. exclude the first two columns for the last three dataframes? Thank you.
Metrics <- bind_cols(lst)
> Metrics
# A tibble: 30 x 12
ticker Period Leverage ticker1 Period1 Gearing ticker2 Period2
<fct> <chr> <dbl> <fct> <chr> <dbl> <fct> <chr>
1 a 2017 6.01 a 2017 2.37 a 2017
2 a 2018 4.82 a 2018 3.58 a 2018
3 a 2019 1.58 a 2019 5.63 a 2019
4 able 2017 8.64 able 2017 0.311 able 2017
5 able 2018 6.70 able 2018 0.708 able 2018
6 able 2019 0.831 able 2019 -0.0651 able 2019
7 about 2017 -0.187 about 2017 2.89 about 2017
8 about 2018 0.549 about 2018 6.25 about 2018
9 about 2019 0.829 about 2019 10.1 about 2019
10 absol~ 2017 1.26 absolu~ 2017 6.48 absolu~ 2017
# ... with 20 more rows, and 4 more variables: Capex.to.sales <dbl>,
# ticker3 <fct>, Period3 <chr>, FCFex <dbl>
You could do it with purrr:
library(purrr)
lst <- map(lst, setNames, colname)
map2_dfc(lst, header, ~ pivot_longer(
.x, -ticker, names_to = "Period", values_to = .y)) %>%
select(c(1:3, 6, 9, 12))
Output:
ticker Period Leverage Gearing Capex.to.sales FCFex
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 a 2017 6.20 3.43 7.87 7.52
2 a 2018 1.63 3.30 0.126 1.52
3 a 2019 2.32 1.49 -0.286 6.95
4 able 2017 6.38 3.42 7.34 2.60
5 able 2018 0.763 1.68 -0.648 -2.85
6 able 2019 5.56 2.35 -0.572 3.21
7 about 2017 -0.762 1.49 3.12 2.43
8 about 2018 9.07 -1.22 0.821 4.00
9 about 2019 1.37 8.27 -0.700 -1.05
10 absolute 2017 1.39 2.49 0.390 2.40
# … with 20 more rows