The following code was executed:
tb <- tibble(
year <- rep(2001:2020,10)
)
tb %<>% arrange(year) %>%
mutate(
id <- rep(1:10,20),
r1 <- rnorm(200,0,1),
r2 <- rnorm(200,1,1),
r3 <- rnorm(200,2,1)
)
Then the error message popped up:
Error: arrange() failed at implicit mutate() step.
x Could not create a temporary column for ..1.
ℹ ..1 is year.
Can anyone shed light on what the reason is?
Try this. It looks like a variable assignation issue. Try replacing <- by = and %<>% by %>%. Here a possible solution:
#Data
tb <- tibble(
year = rep(2001:2020,10)
)
#Code
tb %>% arrange(year) %>%
mutate(
id = rep(1:10,20),
r1 = rnorm(200,0,1),
r2 = rnorm(200,1,1),
r3 = rnorm(200,2,1)
)
Output:
# A tibble: 200 x 5
year id r1 r2 r3
<int> <int> <dbl> <dbl> <dbl>
1 2001 1 1.10 1.62 2.92
2 2001 2 0.144 1.18 1.08
3 2001 3 -0.118 2.32 3.15
4 2001 4 -0.912 0.701 1.36
5 2001 5 -1.44 -0.648 1.11
6 2001 6 -0.797 1.95 -0.333
7 2001 7 1.25 -0.113 1.85
8 2001 8 0.772 1.62 2.32
9 2001 9 -0.220 1.51 1.29
10 2001 10 -0.425 1.37 3.24
# ... with 190 more rows
Related
I'm trying to complete a data.frame with scaled scores.
First I have a set of scores that relate to a grade, and a universal score that has been calculated.
library(dplyr)
df <- tibble(grade = c("X", "E", "D", "C", "B", "A", "Max"),
score = c(0,17,25,33,41,48,60),
universal = c(0,22,44,65,87,108,108))
I expand the frame to include all integer values of score
df %>% complete(score = full_seq(score, period = 1)) %>%
fill(grade, .direction = "down")
I now want to complete the universal score that relates to each integer score based on the relative steps between the previously defined universal scores for each grade.
This is based on a conversion/scaling factor:
(universal boundary for grade above - universal boundary below)/(score boundary grade above - score boundary grade below)
For the grade U this would be (22-0)/(17-0) = 1.29. Each previous score is summed with this factor to find the corresponding next universal score.
So the first part of the result should look like this:
score grade universal
0 U 0
1 U 1.29
2 U 2.59
3 U 3.88
4 U 5.18
5 U 6.47
6 U 7.76
7 U 9.06
8 U 10.35
9 U 11.65
10 U 12.94
11 U 14.24
12 U 15.53
13 U 16.82
14 U 18.12
15 U 19.41
16 U 20.71
17 N 22.00
I'm trying to achieve this with Tidy principles and various combinations of group_by(), complete(), seq(), etc., but haven't been able to achieve it in a neat way. I think my problem is that my max value is outside the grouping variable.
Any help will be much appreciated.
Base R has the approx function to do this linear interpolation. You can use it in a tidyverse context like this:
df %>%
complete(score = full_seq(score, period = 1)) %>%
fill(grade, .direction = "down") %>%
mutate(universal = approx(x=score,y=universal,xout=score)$y)
# A tibble: 61 × 3
score grade universal
<dbl> <chr> <dbl>
1 0 X 0
2 1 X 1.29
3 2 X 2.59
4 3 X 3.88
5 4 X 5.18
6 5 X 6.47
7 6 X 7.76
8 7 X 9.06
9 8 X 10.4
10 9 X 11.6
df %>% mutate(
inc = c(diff(universal) / diff(score), NA)
) %>%
complete(score = full_seq(score, period = 1)) %>%
fill(grade, inc, .direction = "down") %>%
group_by(grade) %>%
mutate(universal = first(universal) + (row_number() - 1) * inc) %>%
ungroup() %>%
print(n = 30)
# # A tibble: 61 × 4
# score grade universal inc
# <dbl> <chr> <dbl> <dbl>
# 1 0 X 0 1.29
# 2 1 X 1.29 1.29
# 3 2 X 2.59 1.29
# 4 3 X 3.88 1.29
# 5 4 X 5.18 1.29
# 6 5 X 6.47 1.29
# 7 6 X 7.76 1.29
# 8 7 X 9.06 1.29
# 9 8 X 10.4 1.29
# 10 9 X 11.6 1.29
# 11 10 X 12.9 1.29
# 12 11 X 14.2 1.29
# 13 12 X 15.5 1.29
# 14 13 X 16.8 1.29
# 15 14 X 18.1 1.29
# 16 15 X 19.4 1.29
# 17 16 X 20.7 1.29
# 18 17 E 22 2.75
# 19 18 E 24.8 2.75
# 20 19 E 27.5 2.75
# 21 20 E 30.2 2.75
# 22 21 E 33 2.75
# 23 22 E 35.8 2.75
# 24 23 E 38.5 2.75
# 25 24 E 41.2 2.75
# 26 25 D 44 2.62
# 27 26 D 46.6 2.62
# 28 27 D 49.2 2.62
# 29 28 D 51.9 2.62
# 30 29 D 54.5 2.62
# # … with 31 more rows
# # ℹ Use `print(n = ...)` to see more rows
I have a dataframe df as follows (my real data there are many columns):
df <- read.table(text = "date hfgf lmo
2019-01-01 0.7 1.4
2019-02-01 0.11 2.3
2019-03-01 1.22 6.7
2020-04-01 0.44 5.2
2020-05-01 0.19 2.3
2021-06-01 3.97 9.5
,
header = TRUE, stringsAsFactors = FALSE)
I would like to replace the monthly values in the columns value 1 value 2 etc by the yearly mean.
Note that I can melt and use summarize function but I need to keep the columns as they are.
If we want to update the columns with the yearly mean, do a grouping by the year extracted 'date' and use mutate to update the columns with the mean of those columns by looping across
If it is to return a single mean row per 'year', use summarise
library(lubridate)
library(dplyr)
df %>%
group_by(year = year(date)) %>%
summarise(across(where(is.numeric), mean, na.rm = TRUE))
-output
# A tibble: 3 × 3
year hfgf lmo
<dbl> <dbl> <dbl>
1 2019 0.677 3.47
2 2020 0.315 3.75
3 2021 3.97 9.5
Here is a base R solution with aggregate.
res <- aggregate(cbind(hfgf, lmo) ~ format(df$date, "%Y"), df, mean)
names(res)[1] <- names(df)[1]
res
# date hfgf lmo
#1 2019 0.6766667 3.466667
#2 2020 0.3150000 3.750000
#3 2021 3.9700000 9.500000
I am not sure but maybe you mean this kind of solution. In essence it is same as akrun's solution: Here with mutate, alternatively you could use the outcommented summarise:
library(lubridate)
library(dplyr)
df %>%
group_by(year = year(date)) %>%
mutate(across(c(hfgf, lmo), mean, na.rm=TRUE, .names = "mean_{unique(year)}_{.col}"))
# summarise(across(c(hfgf, lmo), mean, na.rm=TRUE, .names = "mean_{unique(year)}_{.col}"))
date hfgf lmo year mean_2019_hfgf mean_2019_lmo mean_2020_hfgf mean_2020_lmo mean_2021_hfgf mean_2021_lmo
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2019-~ 0.7 1.4 2019 0.677 3.47 NA NA NA NA
2 2019-~ 0.11 2.3 2019 0.677 3.47 NA NA NA NA
3 2019-~ 1.22 6.7 2019 0.677 3.47 NA NA NA NA
4 2020-~ 0.44 5.2 2020 NA NA 0.315 3.75 NA NA
5 2020-~ 0.19 2.3 2020 NA NA 0.315 3.75 NA NA
6 2021-~ 3.97 9.5 2021 NA NA NA NA 3.97 9.5
I'm trying to calculate year-wise standard error for the variable AcrePrice. I'm running the function stderr (also tried with sd(acrePrice)/count(n)). Both of these return an error.
Here's the relevant code:
library(alr4)
library(tidyverse)
MinnLand %>% group_by(year) %>% summarize(sd(acrePrice)/count(n))
MinnLand %>% group_by(year) %>% summarize(stderr(acrePrice))
Why is there a problem? The mean and SDs are easily calculated.
The issue with the first function is count, which requires a data.frame, instead it would be n()
library(dplyr)
MinnLand %>%
group_by(year) %>%
summarize(SE = sd(acrePrice)/n(), .groups = 'drop')
-output
# A tibble: 10 x 2
# year SE
# <dbl> <dbl>
# 1 2002 2.25
# 2 2003 0.840
# 3 2004 0.742
# 4 2005 0.862
# 5 2006 0.849
# 6 2007 0.765
# 7 2008 0.708
# 8 2009 1.23
# 9 2010 0.986
#10 2011 1.95
According to ?stderr
stdin(), stdout() and stderr() are standard connections corresponding to input, output and error on the console respectively (and not necessarily to file streams).
We can use std.error from plotrix
library(plotrix)
MinnLand %>%
group_by(year) %>%
summarize(SE = std.error(acrePrice))
-output
# A tibble: 10 x 2
# year SE
# <dbl> <dbl>
# 1 2002 53.4
# 2 2003 38.6
# 3 2004 37.0
# 4 2005 41.5
# 5 2006 39.7
# 6 2007 36.3
# 7 2008 34.9
# 8 2009 47.1
# 9 2010 42.1
#10 2011 63.6
What I want it's create the var3 using a lag (dplyr package), but should be consistent with the year and the ID. I mean, the lag should belong to the corresponding ID. The dataset is like an unbalanced panel.
YEAR ID VARS
2010 1 -
2011 1 -
2012 1 -
2010 2 -
2011 2 -
2012 2 -
2010 3 -
...
My issue is similar to the following question/post, but grouping by two categories:
dplyr: lead() and lag() wrong when used with group_by()
I tried to extend the solution, unsuccessfully (I get NAs).
Attempt #1:
data %>%
group_by(YEAR,ID) %>%
summarise(var1 = ...
var2 = ...
var3 = var1 - dplyr::lag(var2))
)
Attempt #2:
data %>%
group_by(YEAR,ID) %>%
summarise(var1 = ...
var2 = ...
gr = sprintf(YEAR,ID)
var3 = var1 - dplyr::lag(var2, order_by = gr))
)
Minimum example:
MyData <-
data.frame(YEAR = rep(seq(2010,2014),5),
ID = rep(1:5, each=5),
var1 = rnorm(n=25,mean=10,sd=3),
var2 = rnorm(n=25,mean=1,sd=1)
)
MyData %>%
group_by(YEAR,ID) %>%
summarise(var3 = var1 - dplyr::lag(var2)
)
Thanks in advance.
Do you mean group_by(ID) and effectively "order by YEAR"?
MyData %>%
group_by(ID) %>%
mutate(var3 = var1 - dplyr::lag(var2)) %>%
print(n=99)
# # A tibble: 25 x 5
# # Groups: ID [5]
# YEAR ID var1 var2 var3
# <int> <int> <dbl> <dbl> <dbl>
# 1 2010 1 11.1 1.16 NA
# 2 2011 1 13.5 -0.550 12.4
# 3 2012 1 10.2 2.11 10.7
# 4 2013 1 8.57 1.43 6.46
# 5 2014 1 12.6 1.89 11.2
# 6 2010 2 8.87 1.87 NA
# 7 2011 2 5.30 1.70 3.43
# 8 2012 2 6.81 0.956 5.11
# 9 2013 2 13.3 -0.0296 12.4
# 10 2014 2 9.98 -1.27 10.0
# 11 2010 3 8.62 0.258 NA
# 12 2011 3 12.4 2.00 12.2
# 13 2012 3 16.1 2.12 14.1
# 14 2013 3 8.48 2.83 6.37
# 15 2014 3 10.6 0.190 7.80
# 16 2010 4 12.3 0.887 NA
# 17 2011 4 10.9 1.07 10.0
# 18 2012 4 7.99 1.09 6.92
# 19 2013 4 10.1 1.95 9.03
# 20 2014 4 11.1 1.82 9.17
# 21 2010 5 15.1 1.67 NA
# 22 2011 5 10.4 0.492 8.76
# 23 2012 5 10.0 1.66 9.51
# 24 2013 5 10.6 0.567 8.91
# 25 2014 5 5.32 -0.881 4.76
(Disregarding your summarize into a mutate for now.)
I'm extracting "wide" data which I intend to tidy with tidyr::pivot_longer().
library(tidyverse)
df1 <-
data.frame(
M = words[1:10],
N = rnorm(10, 3, 3),
O = rnorm(10, 3, 3),
P = rnorm(10, 3, 3)
)
df2 <-
data.frame(
M = words[1:10],
N = rnorm(10, 3, 3),
O = rnorm(10, 3, 3),
P = rnorm(10, 3, 3)
)
df3 <-
data.frame(
M = words[1:10],
N = rnorm(10, 3, 3),
O = rnorm(10, 3, 3),
P = rnorm(10, 3, 3)
)
df4 <-
data.frame(
M = words[1:10],
N = rnorm(10, 3, 3),
O = rnorm(10, 3, 3),
P = rnorm(10, 3, 3)
)
lst <- list(df1, df2, df3, df4)
colname <-
c("ticker", "2017", "2018", "2019")
header <- list("Leverage", "Gearing", "Capex.to.sales", "FCFex")
lst <- lst %>%
lapply(setNames, colname) %>%
lapply(pivot_longer, -ticker, names_to = "Period", values_to = header)
Using values_to = header gives me this error:
Error in [[<-.data.frame(tmp`, ".value", value = list("Leverage", :
replacement has 4 rows, data has 3
Instead, I had to use the default values_to = "value", and subsequently use this code to rename my columns:
lst <- lst %>%
lapply(setNames, colname) %>%
lapply(pivot_longer, -ticker, names_to = "Period", values_to = "value")
lst <- map(seq_along(lst), function(i){
x <- lst[[i]]
colnames(x)[3] <- header[[i]]
x
})
My output is shown below (columns renamed), but I was wondering if there is a way to feed a vector into values_to instead of using map (as it makes for better piping)? Or is there a more efficient way about going about this?
> lst
[[1]]
# A tibble: 30 x 3
ticker Period Leverage
<fct> <chr> <dbl>
1 a 2017 6.01
2 a 2018 4.82
3 a 2019 1.58
4 able 2017 8.64
5 able 2018 6.70
6 able 2019 0.831
7 about 2017 -0.187
8 about 2018 0.549
9 about 2019 0.829
10 absolute 2017 1.26
# ... with 20 more rows
[[2]]
# A tibble: 30 x 3
ticker Period Gearing
<fct> <chr> <dbl>
1 a 2017 2.37
2 a 2018 3.58
3 a 2019 5.63
4 able 2017 0.311
5 able 2018 0.708
6 able 2019 -0.0651
7 about 2017 2.89
8 about 2018 6.25
9 about 2019 10.1
10 absolute 2017 6.48
# ... with 20 more rows
[[3]]
# A tibble: 30 x 3
ticker Period Capex.to.sales
<fct> <chr> <dbl>
1 a 2017 5.22
2 a 2018 1.88
3 a 2019 0.746
4 able 2017 -3.90
5 able 2018 3.06
6 able 2019 1.91
7 about 2017 1.35
8 about 2018 4.12
9 about 2019 11.1
10 absolute 2017 1.76
# ... with 20 more rows
[[4]]
# A tibble: 30 x 3
ticker Period FCFex
<fct> <chr> <dbl>
1 a 2017 1.76
2 a 2018 2.85
3 a 2019 1.86
4 able 2017 -3.38
5 able 2018 -3.02
6 able 2019 -1.52
7 about 2017 6.46
8 about 2018 5.39
9 about 2019 0.810
10 absolute 2017 8.08
# ... with 20 more rows
For the second part of my question, I intend to use bind_col() to combine all the four dataframes into one, but the two common columns are being duplicated (as seen below).
How do I tell R to just bind the rightmost column that was renamed i.e. exclude the first two columns for the last three dataframes? Thank you.
Metrics <- bind_cols(lst)
> Metrics
# A tibble: 30 x 12
ticker Period Leverage ticker1 Period1 Gearing ticker2 Period2
<fct> <chr> <dbl> <fct> <chr> <dbl> <fct> <chr>
1 a 2017 6.01 a 2017 2.37 a 2017
2 a 2018 4.82 a 2018 3.58 a 2018
3 a 2019 1.58 a 2019 5.63 a 2019
4 able 2017 8.64 able 2017 0.311 able 2017
5 able 2018 6.70 able 2018 0.708 able 2018
6 able 2019 0.831 able 2019 -0.0651 able 2019
7 about 2017 -0.187 about 2017 2.89 about 2017
8 about 2018 0.549 about 2018 6.25 about 2018
9 about 2019 0.829 about 2019 10.1 about 2019
10 absol~ 2017 1.26 absolu~ 2017 6.48 absolu~ 2017
# ... with 20 more rows, and 4 more variables: Capex.to.sales <dbl>,
# ticker3 <fct>, Period3 <chr>, FCFex <dbl>
You could do it with purrr:
library(purrr)
lst <- map(lst, setNames, colname)
map2_dfc(lst, header, ~ pivot_longer(
.x, -ticker, names_to = "Period", values_to = .y)) %>%
select(c(1:3, 6, 9, 12))
Output:
ticker Period Leverage Gearing Capex.to.sales FCFex
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 a 2017 6.20 3.43 7.87 7.52
2 a 2018 1.63 3.30 0.126 1.52
3 a 2019 2.32 1.49 -0.286 6.95
4 able 2017 6.38 3.42 7.34 2.60
5 able 2018 0.763 1.68 -0.648 -2.85
6 able 2019 5.56 2.35 -0.572 3.21
7 about 2017 -0.762 1.49 3.12 2.43
8 about 2018 9.07 -1.22 0.821 4.00
9 about 2019 1.37 8.27 -0.700 -1.05
10 absolute 2017 1.39 2.49 0.390 2.40
# … with 20 more rows