I want to create a new column and then I want to have the values of the following day in the following row. In my example-dataframe I have 3 columns: date, price and the return. Now I want to detect overreactions. If the returns are higher than the mean + 1 standard deviation, then the return is an overreaction. If not, then the value is 'NA'.
library(tidyverse)
library(quantmod)
df <- tibble(
date = lubridate::today() +0:9,
price = c(1,2.5,2,3,5,6.5,4,9,3,4))
df <- mutate(df, return = Delt(price))
df <- df %>% mutate(overreaction=
ifelse(return > mean(df$return, na.rm = TRUE) + sd(df$return, na.rm = TRUE),
yes = return, no = NA
)
)
Now I'm creating a new column, that gives me the return of the following day, if an overreaction took place at the previous day.
df <- df %>% mutate(following_day =
ifelse(overreaction != "NA",
yes= return%>% data.table::shift(n=1L, fill=NA, type=c("lead")),
no=NA)
)
print(df)
# A tibble: 10 x 5
date price return overreaction following_day
<date> <dbl> <dbl> <dbl> <dbl>
1 2019-02-04 1 NA NA NA
2 2019-02-05 2.5 1.5 1.5 -0.200
3 2019-02-06 2 -0.200 NA NA
4 2019-02-07 3 0.5 NA NA
5 2019-02-08 5 0.667 NA NA
6 2019-02-09 6.5 0.3 NA NA
7 2019-02-10 4 -0.385 NA NA
8 2019-02-11 9 1.25 1.25 -0.667
9 2019-02-12 3 -0.667 NA NA
10 2019-02-13 4 0.333 NA NA
And it works except for one problem:
I want that the values in the following_day-column are shiftetd by 1 row, so that they are in the original position.
This is how the dataframe should look like:
# A tibble: 10 x 5
date price return overreaction following_day
<date> <dbl> <dbl> <dbl> <dbl>
1 2019-02-04 1 NA NA NA
2 2019-02-05 2.5 1.5 1.5 NA
3 2019-02-06 2 -0.200 NA -0.200
4 2019-02-07 3 0.5 NA NA
5 2019-02-08 5 0.667 NA NA
6 2019-02-09 6.5 0.3 NA NA
7 2019-02-10 4 -0.385 NA NA
8 2019-02-11 9 1.25 1.25 NA
9 2019-02-12 3 -0.667 NA -0.667
10 2019-02-13 4 0.333 NA NA
Can someone help me?
Enclose df$following_day in dplyr::lag:
library(tidyverse)
library(quantmod)
df <- tibble(
date = lubridate::today() +0:9,
price = c(1,2.5,2,3,5,6.5,4,9,3,4)) %>%
mutate(return= Delt(price))
df <- mutate(df, overreaction =
ifelse( return > mean(df$return, na.rm = TRUE) + sd(df$return, na.rm = TRUE),
return, NA))
df <- mutate(df, following_day = ifelse(!is.na(overreaction),
data.table::shift(df$return, type = "lead"),
NA))
df$following_day <- dplyr::lag(df$following_day) # or data.table::shift
Output:
> df
# A tibble: 10 x 5
date price return overreaction following_day
<date> <dbl> <dbl> <dbl> <dbl>
1 2019-02-04 1 NA NA NA
2 2019-02-05 2.5 1.5 1.5 NA
3 2019-02-06 2 -0.200 NA -0.200
4 2019-02-07 3 0.5 NA NA
5 2019-02-08 5 0.667 NA NA
6 2019-02-09 6.5 0.3 NA NA
7 2019-02-10 4 -0.385 NA NA
8 2019-02-11 9 1.25 1.25 NA
9 2019-02-12 3 -0.667 NA -0.667
10 2019-02-13 4 0.333 NA NA
The same can be achieved by subbing dplyr::lag with data.table::shift(df$following_day, type = "lag")
Related
This question already has answers here:
How do I replace NA values with zeros in an R dataframe?
(29 answers)
Closed 5 months ago.
i have a dataframe that looks like this :
Date = seq(as.Date("2022/1/1"), by = "day", length.out = 10)
x = c(rnorm(9),NA)
y = c(NA,0,rnorm(4),6,0,0,10)
z = c(rnorm(4),0,0,NA,NA,3,2)
d = tibble(Date,x,y,z);d
# A tibble: 10 x 4
Date x y z
<date> <dbl> <dbl> <dbl>
1 2022-01-01 2.456174 NA 0.2963012
2 2022-01-02 0.3648335 0 0.3981664
3 2022-01-03 0.8283570 -0.1843364 1.194378
4 2022-01-04 1.061199 1.507231 -0.2337116
5 2022-01-05 -0.07824196 -0.6708553 0
6 2022-01-06 -0.2654019 0.3008499 0
7 2022-01-07 1.426953 6 NA
8 2022-01-08 -0.5776817 0 NA
9 2022-01-09 0.8706953 0 3
10 2022-01-10 NA 10 2
how i can replace all the zeros across all columns with NA using Dplyr package ?
With dplyr, you could use na_if():
library(dplyr)
d %>%
mutate(across(everything(), na_if, 0))
or simply
d[d == 0] <- NA
# Pipeline
d %>%
`[<-`(d == 0, value = NA)
I would avoid using dplyr here (3 base R example already in comments on OP) but you could
library(dplyr)
d |> mutate_all( \(x) replace(x, x == 0, NA))
# x y z
# <dbl> <dbl> <dbl>
# 1 -0.626 NA -2.21
# 2 0.184 NA 1.12
# 3 -0.836 -0.305 -0.0449
# 4 1.60 1.51 -0.0162
# 5 0.330 0.390 NA
# 6 -0.820 -0.621 NA
# 7 0.487 6 NA
# 8 0.738 NA NA
# 9 0.576 NA 3
# 10 NA 10 2
Reproducible data:
set.seed(1)
x = c(rnorm(9),NA)
y = c(NA,0,rnorm(4),6,0,0,10)
z = c(rnorm(4),0,0,NA,NA,3,2)
d = tibble(x,y,z)
Here is my comment to the question as a dplyr pipe.
suppressPackageStartupMessages(library(dplyr))
x = c(rnorm(9),NA)
y = c(NA,0,rnorm(4),6,0,0,10)
z = c(rnorm(4),0,0,NA,NA,3,2)
Date <- seq(as.Date("2022-01-01"), by = "day", length = length(x))
d = tibble(Date,x,y,z)
d %>%
mutate(across(everything(), ~`is.na<-`(., . == 0)))
#> # A tibble: 10 × 4
#> Date x y z
#> <date> <dbl> <dbl> <dbl>
#> 1 2022-01-01 -0.311 NA -0.891
#> 2 2022-01-02 -0.192 NA 0.278
#> 3 2022-01-03 1.24 0.742 -0.331
#> 4 2022-01-04 0.0130 -1.18 0.384
#> 5 2022-01-05 -1.11 -1.17 NA
#> 6 2022-01-06 -0.330 -0.629 NA
#> 7 2022-01-07 -1.25 6 NA
#> 8 2022-01-08 0.0937 NA NA
#> 9 2022-01-09 0.986 NA 3
#> 10 2022-01-10 NA 10 2
Created on 2022-09-14 with reprex v2.0.2
I have a dataframe which looks like this example, just much larger:
Name date var1 var2 var3
Peter 2020-03-30 0.4 0.5 0.2
Ben 2020-10-14 0.6 0.4 0.1
Mary 2020-12-06 0.7 0.2 0.9
I want to create a new dataframe for each variable (i.e., var1, var2, var3), which should look like this, e.g., for var1:
date Peter Ben Mary
2020-03-30 0.4 NA NA
2020-10-14 NA 0.6 NA
2020-12-06 NA NA 0.7
I can do it with spread for one variable at a time:
df_new <-tidyr::spread(df[,-c(2:3)], name, var1)
But I could not figure out how to loop it over all columns as I am new to R.
Thank you!
First we want to create a list of data frames and then pivot each one:
library(tidyverse)
res_list = dat %>%
pivot_longer(cols = contains("var")) %>%
split(., .$name) %>%
map(. %>% pivot_wider(names_from="Name"))
$var1
# A tibble: 3 × 5
date name Peter Ben Mary
<date> <chr> <dbl> <dbl> <dbl>
1 2020-03-30 var1 0.4 NA NA
2 2020-10-14 var1 NA 0.6 NA
3 2020-12-06 var1 NA NA 0.7
$var2
# A tibble: 3 × 5
date name Peter Ben Mary
<date> <chr> <dbl> <dbl> <dbl>
1 2020-03-30 var2 0.5 NA NA
2 2020-10-14 var2 NA 0.4 NA
3 2020-12-06 var2 NA NA 0.2
$var3
# A tibble: 3 × 5
date name Peter Ben Mary
<date> <chr> <dbl> <dbl> <dbl>
1 2020-03-30 var3 0.2 NA NA
2 2020-10-14 var3 NA 0.1 NA
3 2020-12-06 var3 NA NA 0.9
Then you can access them like
res_list["var1"]
# A tibble: 3 × 5
date name Peter Ben Mary
<date> <chr> <dbl> <dbl> <dbl>
1 2020-03-30 var1 0.4 NA NA
2 2020-10-14 var1 NA 0.6 NA
3 2020-12-06 var1 NA NA 0.7
We can do it this way:
The beginning is similar to user438383 solution.
But then we name each tibble in the list and save them to the global environment within the the pipe. For this we need massign from collapse package: thanks to #akrun How to save each named tibble in a list, as a separate tibble or dataframe in one run
library(tidyverse)
library(collapse)
df %>%
pivot_longer(cols = contains("var")) %>%
group_split(name) %>%
setNames(unique(df$Name)) %>%
map(. %>% pivot_wider(names_from = Name)) %>%
map(. %>% select(-name)) %>%
massign(names(.), ., .GlobalEnv)
Ben
Mary
Peter
A tibble: 3 x 4
date Peter Ben Mary
<chr> <dbl> <dbl> <dbl>
1 2020-03-30 0.5 NA NA
2 2020-10-14 NA 0.4 NA
3 2020-12-06 NA NA 0.2
> Mary
# A tibble: 3 x 4
date Peter Ben Mary
<chr> <dbl> <dbl> <dbl>
1 2020-03-30 0.2 NA NA
2 2020-10-14 NA 0.1 NA
3 2020-12-06 NA NA 0.9
> Peter
# A tibble: 3 x 4
date Peter Ben Mary
<chr> <dbl> <dbl> <dbl>
1 2020-03-30 0.4 NA NA
2 2020-10-14 NA 0.6 NA
3 2020-12-06 NA NA 0.7
This post is similar to another post where the direction was to look upwards on the column: How to search upwards a column for a value based on whether another column is NA or not?
This time I need to look downwards on the first and second entry where value is not NA.
Again, using a simple shift will not work here.
EDIT: Added grouping variable and the possibility of both TYPE and VALUE NOT NA
dtihave = data.table(id = c(rep(1,9)),
date = as.Date(c("2020-01-01", "2020-02-01", "2020-03-03", "2020-04-02", "2020-05-09", "2020-06-10", "2020-07-18", "2020-08-23", "2020-09-09")),
type = c(1,1,1,NA,1,1,NA,NA,1),
value = c(7,NA,6,8,15,NA,5,9,NA))
> dtihave
id date type value
1: 1 2020-01-01 1 7
2: 1 2020-02-01 1 NA
3: 1 2020-03-03 1 6
4: 1 2020-04-02 NA 8
5: 1 2020-05-09 1 15
6: 1 2020-06-10 1 NA
7: 1 2020-07-18 NA 5
8: 1 2020-08-23 NA 9
9: 1 2020-09-09 1 NA
dtiwant2 = data.table(id = c(rep(1,9)),
date = as.Date(c("2020-01-01", "2020-02-01", "2020-03-03", "2020-04-02", "2020-05-09", "2020-06-10", "2020-07-18", "2020-08-23", "2020-09-09")),
type = c(1,1,1,NA,1,1,NA,NA,1),
value = c(7,NA,6,8,15,NA,5,9,NA),
iwantdateonedown = c(as.Date("2020-03-03"), as.Date("2020-03-03"), as.Date("2020-04-02"), NA, as.Date("2020-07-18"), as.Date("2020-07-18"), NA,NA,NA),
iwantvalueonedown = c(6,6,8,NA,5,5,NA,NA,NA),
iwantdatetwodown = c(as.Date("2020-04-02"), as.Date("2020-04-02"), as.Date("2020-05-09"), NA, as.Date("2020-8-23"), as.Date("2020-08-23"), NA,NA,NA),
iwantvaluetwodown = c(8,8,15,NA,9,9,NA,NA,NA))
> dtiwant2
id date type value iwantdateonedown iwantvalueonedown iwantdatetwodown iwantvaluetwodown
1: 1 2020-01-01 1 7 2020-03-03 6 2020-04-02 8
2: 1 2020-02-01 1 NA 2020-03-03 6 2020-04-02 8
3: 1 2020-03-03 1 6 2020-04-02 8 2020-05-09 15
4: 1 2020-04-02 NA 8 <NA> NA <NA> NA
5: 1 2020-05-09 1 15 2020-07-18 5 2020-08-23 9
6: 1 2020-06-10 1 NA 2020-07-18 5 2020-08-23 9
7: 1 2020-07-18 NA 5 <NA> NA <NA> NA
8: 1 2020-08-23 NA 9 <NA> NA <NA> NA
9: 1 2020-09-09 1 NA <NA> NA <NA> NA
Current solution
dtihave %>%
group_by(id) %>%
mutate( value = replace(value, 1, NA),
val_na = !is.na(value), idx=na_if(val_na * row_number(), 0),
idx = nafill(idx, 'nocb'), idx = idx * NA^val_na,
idx1 = row_number() *na_if(val_na & lag(val_na), 0),
idx1 = nafill(idx1, 'nocb') * NA ^val_na,
value1 = value[idx], date1 = date[idx],
value2 = value[idx1], date2=date[idx1],
idx = NULL,idx1 = NULL,val_na = NULL
)
# A tibble: 9 x 8
# Groups: id [1]
id date type value value1 date1 value2 date2
<dbl> <date> <dbl> <dbl> <dbl> <date> <dbl> <date>
1 1 2020-01-01 1 NA 6 2020-03-03 8 2020-04-02
2 1 2020-02-01 1 NA 6 2020-03-03 8 2020-04-02
3 1 2020-03-03 1 6 NA NA NA NA
4 1 2020-04-02 NA 8 NA NA NA NA
5 1 2020-05-09 1 15 NA NA NA NA
6 1 2020-06-10 1 NA 5 2020-07-18 9 2020-08-23
7 1 2020-07-18 NA 5 NA NA NA NA
8 1 2020-08-23 NA 9 NA NA NA NA
9 1 2020-09-09 1 NA NA NA NA NA
dtihave %>%
mutate( value = replace(value, 1, NA),
val_na = !is.na(value), idx=na_if(val_na * row_number(), 0),
idx = nafill(idx, 'nocb'), idx = idx * NA^val_na,
idx1 = row_number() *na_if(val_na & lag(val_na), 0),
idx1 = nafill(idx1, 'nocb') * NA ^val_na,
value1 = value[idx], date1 = date[idx],
value2 = value[idx1], date2=date[idx1],
idx = NULL,idx1 = NULL,val_na = NULL
)
date type value value1 date1 value2 date2
1: 2020-01-01 1 NA 6 2020-03-03 8 2020-04-02
2: 2020-02-01 1 NA 6 2020-03-03 8 2020-04-02
3: 2020-03-03 NA 6 NA <NA> NA <NA>
4: 2020-04-02 NA 8 NA <NA> NA <NA>
5: 2020-05-09 1 NA 5 2020-07-18 9 2020-08-23
6: 2020-06-10 1 NA 5 2020-07-18 9 2020-08-23
7: 2020-07-18 NA 5 NA <NA> NA <NA>
8: 2020-08-23 NA 9 NA <NA> NA <NA>
9: 2020-09-09 1 NA NA <NA> NA <NA>
I have a dataset with three variables: date, signal and value. Now I want to mutate a new colum, which is conditioned of the signals and calculated from the value-column.
If there is a signal on a previous day (ifelse(lag(signal)==1), then calculate the mean of the values of the following three days.
In this case I've used this expression:
(value+lead(value)+lead(value,n = 2)) /3.
And so I get what I want:
set.seed(123)
df<-tibble(date=today()+0:10,
signal=c(0,1,0,0,0,0,1,0,0,0,0),
value= sample.int(n=11))
df%>%mutate(calculation=ifelse(lag(signal)==1,
(value+lead(value)+lead(value, n = 2)) /3,
NA))
# A tibble: 11 x 4
date signal value calculation
<date> <dbl> <int> <dbl>
1 2019-07-17 0 1 NA
2 2019-07-18 1 7 NA
3 2019-07-19 0 5 6.33
4 2019-07-20 0 4 NA
5 2019-07-21 0 10 NA
6 2019-07-22 0 2 NA
7 2019-07-23 1 9 NA
8 2019-07-24 0 3 7.33
9 2019-07-25 0 11 NA
10 2019-07-26 0 8 NA
11 2019-07-27 0 6 NA
But my problem is that I do not just want to use the following 3 days.
I want to use several days. And so I want to automate the code and calculate several columns. Maybe with something like an apply-function.
Here is my desired output (in this example with 5 following days):
date signal value calc_day_1 calc_day2 calc_day3 calc_day4 calc_day5
<date> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2019-07-17 0 1 NA NA
2 2019-07-18 1 7 NA NA
3 2019-07-19 0 5 5 (5+4)/2=4.5
4 2019-07-20 0 4 NA NA
5 2019-07-21 0 10 NA NA
6 2019-07-22 0 2 NA NA
7 2019-07-23 1 9 NA NA
8 2019-07-24 0 3 3 (3+11)/2=7
9 2019-07-25 0 11 NA NA
10 2019-07-26 0 8 NA NA
11 2019-07-27 0 6 NA NA
Can someone show me how can I solve this problem?
Hi you can use the rlang package and the purrr package as follows:
library(tidyverse)
myfun <- paste0("if_else(lag(signal) == 1, map_dbl(1:n(), ~mean(value[.x - 1 + 1:",
1:5 ,"])), NA_real_)") %>%
setNames(paste0("calc_day", 1:5)) %>%
purrr::map(rlang::parse_expr)
df %>%
mutate(!!! myfun)
# A tibble: 11 x 8
date signal value calc_day1 calc_day2 calc_day3 calc_day4 calc_day5
<date> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2019-07-17 0 3 NA NA NA NA NA
2 2019-07-18 1 11 NA NA NA NA NA
3 2019-07-19 0 2 2 4 6 5.75 5.4
4 2019-07-20 0 6 NA NA NA NA NA
5 2019-07-21 0 10 NA NA NA NA NA
6 2019-07-22 0 5 NA NA NA NA NA
7 2019-07-23 1 4 NA NA NA NA NA
8 2019-07-24 0 9 9 8.5 6 6.25 NA
9 2019-07-25 0 8 NA NA NA NA NA
10 2019-07-26 0 1 NA NA NA NA NA
11 2019-07-27 0 7 NA NA NA NA NA
Small explanation: if you just wanted one these columns (say calc_day2) you could do the following:
df %>%
mutate(calc_day2 = if_else(lag(signal) == 1, map_dbl(1:n(), ~ mean(value[.x - 1 + 1:2])), NA_real_))
So in theory you could just copy this line just five times (each time replacing the 2 with the corresponding number).
Or you use the rlang package (see also this question) to take a shortcut :).
I want to create several columns with a ifelse()-condition. Here is my example-code:
df <- tibble(
date = lubridate::today() +0:9,
return= c(1,2.5,2,3,5,6.5,1,9,3,2))
And now I want to add new columns with ascending conditions (from 1 to 8). The first column should only contain values from the "return"-column, which are higher than 1, the second column should only contain values, which are higher than 2, and so on...
I can calculate each column with a mutate() function:
df <- df %>% mutate( `return>1`= ifelse(return > 1, return, NA))
df <- df %>% mutate( `return>2`= ifelse(return > 2, return, NA))
df <- df %>% mutate( `return>3`= ifelse(return > 3, return, NA))
df <- df %>% mutate( `return>4`= ifelse(return > 4, return, NA))
df <- df %>% mutate( `return>5`= ifelse(return > 5, return, NA))
df <- df %>% mutate( `return>6`= ifelse(return > 6, return, NA))
df <- df %>% mutate( `return>7`= ifelse(return > 7, return, NA))
df <- df %>% mutate( `return>8`= ifelse(return > 8, return, NA))
> head(df)
# A tibble: 6 x 10
date return `return>1` `return>2` `return>3` `return>4` `return>5` `return>6` `return>7` `return>8`
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2019-03-08 1 NA NA NA NA NA NA NA NA
2 2019-03-09 2.5 2.5 2.5 NA NA NA NA NA NA
3 2019-03-10 2 2 NA NA NA NA NA NA NA
4 2019-03-11 3 3 3 NA NA NA NA NA NA
5 2019-03-12 5 5 5 5 5 NA NA NA NA
6 2019-03-13 6.5 6.5 6.5 6.5 6.5 6.5 6.5 NA NA
Is there an easier way to create all these columns and reduce all this code? Maybe with a map_function? And is there a way to automatically name the new columns?
An option with lapply
n <- seq(1, 8)
df[paste0("return > ", n)] <- lapply(n, function(x)
replace(df$return, df$return <= x, NA))
# date return `return > 1` `return > 2` `return > 3` .....
# <date> <dbl> <dbl> <dbl> <dbl>
#1 2019-03-08 1 NA NA NA
#2 2019-03-09 2.5 2.5 2.5 NA
#3 2019-03-10 2 2 NA NA
#4 2019-03-11 3 3 3 NA
#5 2019-03-12 5 5 5 5
#6 2019-03-13 6.5 6.5 6.5 6.5
#...
Here is a for loop solution:
for(i in 1:8){
varname =paste0("return>",i)
df[[varname]] <- with(df, ifelse(return > i, return, NA))
}
use purrr::map_df
> bind_cols(df,purrr::map_df(setNames(1:8,paste0('return>',1:8)),
+ function(x) ifelse(df$return > x, df$return, NA)))
# A tibble: 6 x 10
# date return `return>1` `return>2` `return>3` `return>4` `return>5` `return>6` `return>7` `return>8`
# <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2019-03-08 1 NA NA NA NA NA NA NA NA
# 2 2019-03-09 2.5 2.5 2.5 NA NA NA NA NA NA
# 3 2019-03-10 2 2 NA NA NA NA NA NA NA
# 4 2019-03-11 3 3 3 NA NA NA NA NA NA
# 5 2019-03-12 5 5 5 5 5 NA NA NA NA
# 6 2019-03-13 6.5 6.5 6.5 6.5 6.5 6.5 6.5 NA NA