I have a dataset with three variables: date, signal and value. Now I want to mutate a new colum, which is conditioned of the signals and calculated from the value-column.
If there is a signal on a previous day (ifelse(lag(signal)==1), then calculate the mean of the values of the following three days.
In this case I've used this expression:
(value+lead(value)+lead(value,n = 2)) /3.
And so I get what I want:
set.seed(123)
df<-tibble(date=today()+0:10,
signal=c(0,1,0,0,0,0,1,0,0,0,0),
value= sample.int(n=11))
df%>%mutate(calculation=ifelse(lag(signal)==1,
(value+lead(value)+lead(value, n = 2)) /3,
NA))
# A tibble: 11 x 4
date signal value calculation
<date> <dbl> <int> <dbl>
1 2019-07-17 0 1 NA
2 2019-07-18 1 7 NA
3 2019-07-19 0 5 6.33
4 2019-07-20 0 4 NA
5 2019-07-21 0 10 NA
6 2019-07-22 0 2 NA
7 2019-07-23 1 9 NA
8 2019-07-24 0 3 7.33
9 2019-07-25 0 11 NA
10 2019-07-26 0 8 NA
11 2019-07-27 0 6 NA
But my problem is that I do not just want to use the following 3 days.
I want to use several days. And so I want to automate the code and calculate several columns. Maybe with something like an apply-function.
Here is my desired output (in this example with 5 following days):
date signal value calc_day_1 calc_day2 calc_day3 calc_day4 calc_day5
<date> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2019-07-17 0 1 NA NA
2 2019-07-18 1 7 NA NA
3 2019-07-19 0 5 5 (5+4)/2=4.5
4 2019-07-20 0 4 NA NA
5 2019-07-21 0 10 NA NA
6 2019-07-22 0 2 NA NA
7 2019-07-23 1 9 NA NA
8 2019-07-24 0 3 3 (3+11)/2=7
9 2019-07-25 0 11 NA NA
10 2019-07-26 0 8 NA NA
11 2019-07-27 0 6 NA NA
Can someone show me how can I solve this problem?
Hi you can use the rlang package and the purrr package as follows:
library(tidyverse)
myfun <- paste0("if_else(lag(signal) == 1, map_dbl(1:n(), ~mean(value[.x - 1 + 1:",
1:5 ,"])), NA_real_)") %>%
setNames(paste0("calc_day", 1:5)) %>%
purrr::map(rlang::parse_expr)
df %>%
mutate(!!! myfun)
# A tibble: 11 x 8
date signal value calc_day1 calc_day2 calc_day3 calc_day4 calc_day5
<date> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2019-07-17 0 3 NA NA NA NA NA
2 2019-07-18 1 11 NA NA NA NA NA
3 2019-07-19 0 2 2 4 6 5.75 5.4
4 2019-07-20 0 6 NA NA NA NA NA
5 2019-07-21 0 10 NA NA NA NA NA
6 2019-07-22 0 5 NA NA NA NA NA
7 2019-07-23 1 4 NA NA NA NA NA
8 2019-07-24 0 9 9 8.5 6 6.25 NA
9 2019-07-25 0 8 NA NA NA NA NA
10 2019-07-26 0 1 NA NA NA NA NA
11 2019-07-27 0 7 NA NA NA NA NA
Small explanation: if you just wanted one these columns (say calc_day2) you could do the following:
df %>%
mutate(calc_day2 = if_else(lag(signal) == 1, map_dbl(1:n(), ~ mean(value[.x - 1 + 1:2])), NA_real_))
So in theory you could just copy this line just five times (each time replacing the 2 with the corresponding number).
Or you use the rlang package (see also this question) to take a shortcut :).
Related
I have a dataset like this (in the actual dataset, I have more columns like subj01):
# A tibble: 10 x 4
item subj01 subj02 subj03
<int> <dbl> <dbl> <dbl>
1 1 1 1 1
2 2 2 2 6
3 3 5 5 9
4 4 9 6 NA
5 5 10 8 NA
6 6 NA 9 NA
7 7 NA 10 NA
8 8 NA NA NA
9 9 NA NA NA
10 10 NA NA NA
I created the dataset using the code below.
data = tibble(item = 1:10, subj01 = c(1,2,5,9,10,NA,NA,NA,NA,NA), subj02 = c(1,2,5,6,8,9,10,NA,NA,NA), subj03 = c(1,6,9,NA,NA,NA,NA,NA,NA,NA))
I would like to reorder all the columns beginning with "subj" so that the position of the values match that in the item column.
That is, for this example dataset, I would like to end up with this:
# A tibble: 10 x 4
item subj01 subj02 subj03
<int> <dbl> <dbl> <dbl>
1 1 1 1 1
2 2 2 2 NA
3 3 NA NA NA
4 4 NA NA NA
5 5 5 5 NA
6 6 NA 6 6
7 7 NA NA NA
8 8 NA 8 NA
9 9 9 9 9
10 10 10 10 NA
I've figured that I can match and re-order one column by running this:
data$subj01[match(data$item,data$subj01)]
[1] 1 2 NA NA 5 NA NA NA 9 10
But I am struggling to apply this across multiple columns (ideally I'd like to embed the command in a dplyr pipe).
I tried the command below, but this gave me an error "Error in mutate(x. = x.[match(item, x.)]) : object 'x.' not found".
data = data %>% across(mutate(x.=x.[match(item,x.)]))
I'd appreciate any suggestions! Thank you.
library(tidyverse)
data %>%
pivot_longer(-item) %>%
filter(!is.na(value)) %>%
mutate(item = value) %>%
complete(item = 1:10, name) %>%
pivot_wider(names_from = name, values_from = value)
# A tibble: 10 × 4
item subj01 subj02 subj03
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 1
2 2 2 2 NA
3 3 NA NA NA
4 4 NA NA NA
5 5 5 5 NA
6 6 NA 6 6
7 7 NA NA NA
8 8 NA 8 NA
9 9 9 9 9
10 10 10 10 NA
I have a data frame grouped by 'id' and a variable 'age' which contains missing values, NA.
Within each 'id', I want to replace missing values of 'age', but only "fill up" before the first non-NA value.
data <- data.frame(id=c(1,1,1,1,1,1,2,2,2,2,2,3,3,3,3,3),
age=c(NA,6,NA,8,NA,NA,NA,NA,3,8,NA,NA,NA,7,NA,9))
id age
1 1 NA
2 1 6 # first non-NA in id = 1. Fill up from here
3 1 NA
4 1 8
5 1 NA
6 1 NA
7 2 NA
8 2 NA
9 2 3 # first non-NA in id = 2. Fill up from here
10 2 8
11 2 NA
12 3 NA
13 3 NA
14 3 7 # first non-NA in id = 3. Fill up from here
15 3 NA
16 3 9
Expected output:
1 1 6
2 1 6
3 1 NA
4 1 8
5 1 NA
6 1 NA
7 2 3
8 2 3
9 2 3
10 2 8
11 2 NA
12 3 7
13 3 7
14 3 7
15 3 NA
16 3 9
I tried using fill with .direction = "up" like this:
library(dplyr)
library(tidyr)
data1 <- data %>% group_by(id) %>%
fill(!is.na(age[1]), .direction = "up")
You could use cumall(is.na(age)) to find the positions before the first non-NA value.
library(dplyr)
data %>%
group_by(id) %>%
mutate(age2 = replace(age, cumall(is.na(age)), age[!is.na(age)][1])) %>%
ungroup()
# A tibble: 16 × 3
id age age2
<dbl> <dbl> <dbl>
1 1 NA 6
2 1 6 6
3 1 NA NA
4 1 8 8
5 1 NA NA
6 1 NA NA
7 2 NA 3
8 2 NA 3
9 2 3 3
10 2 8 8
11 2 NA NA
12 3 NA 7
13 3 NA 7
14 3 7 7
15 3 NA NA
16 3 9 9
Another option (agnostic about where the missing and non-missing values start) could be:
data %>%
group_by(id) %>%
mutate(rleid = with(rle(is.na(age)), rep(seq_along(lengths), lengths)),
age2 = ifelse(rleid == min(rleid[is.na(age)]),
age[rleid == (min(rleid[is.na(age)]) + 1)][1],
age))
id age rleid age2
<dbl> <dbl> <int> <dbl>
1 1 NA 1 6
2 1 6 2 6
3 1 NA 3 NA
4 1 8 4 8
5 1 NA 5 NA
6 1 NA 5 NA
7 2 NA 1 3
8 2 NA 1 3
9 2 3 2 3
10 2 8 2 8
11 2 NA 3 NA
12 3 NA 1 7
13 3 NA 1 7
14 3 7 2 7
15 3 NA 3 NA
16 3 9 4 9
This question already has answers here:
How can I automatically create n lags in a timeseries?
(3 answers)
Closed 1 year ago.
Given this tibble:
tibble(x = c(1:9))
I want to add a column x_lag_1 = c(NA,1:8), a column x_lag_2 = c(NA,NA,1:7), etc.
Up to x_lag_n.
This can be quick with data.table:
library(data.table)
n <- seq(4)
setDT(df)[, paste0('x_lag_', n) := shift(x, n)]
df
x x_lag_1 x_lag_2 x_lag_3 x_lag_4
1: 1 NA NA NA NA
2: 2 1 NA NA NA
3: 3 2 1 NA NA
4: 4 3 2 1 NA
5: 5 4 3 2 1
6: 6 5 4 3 2
7: 7 6 5 4 3
8: 8 7 6 5 4
9: 9 8 7 6 5
You may use map_dfc to add n new columns.
library(dplyr)
library(purrr)
df <- tibble(x = c(1:9))
n <- 3
bind_cols(df, map_dfc(seq_len(n), ~df %>%
transmute(!!paste0('x_lag', .x) := lag(x, .x))))
# x x_lag1 x_lag2 x_lag3
# <int> <int> <int> <int>
#1 1 NA NA NA
#2 2 1 NA NA
#3 3 2 1 NA
#4 4 3 2 1
#5 5 4 3 2
#6 6 5 4 3
#7 7 6 5 4
#8 8 7 6 5
#9 9 8 7 6
Edit 2: Reworked the answer to contemplate the case of a grouped df.
library(tidyverse)
set.seed(123)
df <- tibble(group = sample(letters[1:3], 30, replace = TRUE), x = c(1:30))
formulas <- seq(3, 12, 3) %>%
map(~ as.formula(str_glue("~lag(.,n={.x})"))) %>%
set_names(str_c("lag", seq(3, 12, 3)))
df %>%
summarise(x, across(x, lst(!!!formulas)))
#> # A tibble: 30 × 5
#> x x_lag3 x_lag6 x_lag9 x_lag12
#> <int> <int> <int> <int> <int>
#> 1 1 NA NA NA NA
#> 2 2 NA NA NA NA
#> 3 3 NA NA NA NA
#> 4 4 1 NA NA NA
#> 5 5 2 NA NA NA
#> 6 6 3 NA NA NA
#> 7 7 4 1 NA NA
#> 8 8 5 2 NA NA
#> 9 9 6 3 NA NA
#> 10 10 7 4 1 NA
#> # … with 20 more rows
df %>%
group_by(group) %>%
summarise(x, across(x, lst(!!!formulas)), .groups = "drop")
#> # A tibble: 30 × 6
#> group x x_lag3 x_lag6 x_lag9 x_lag12
#> <chr> <int> <int> <int> <int> <int>
#> 1 a 10 NA NA NA NA
#> 2 a 13 NA NA NA NA
#> 3 a 16 NA NA NA NA
#> 4 a 19 10 NA NA NA
#> 5 a 20 13 NA NA NA
#> 6 a 21 16 NA NA NA
#> 7 a 22 19 10 NA NA
#> 8 a 27 20 13 NA NA
#> 9 b 4 NA NA NA NA
#> 10 b 6 NA NA NA NA
#> # … with 20 more rows
Created on 2021-12-30 by the reprex package (v2.0.1)
Hello a really simple question but I have just got stuck, how do I add a conditional column containing number 1 where completed column is not NA?
id completed
<chr> <chr>
1 abc123sdf 35929
2 124cv NA
3 125xvdf 36295
4 126v NA
5 127sdsd 43933
6 128dfgs NA
7 129vsd NA
8 130sdf NA
9 131sdf NA
10 123sdfd NA
I need this to calculate an overall percent of completed column/id.
(Additional question - how can I do this in dplyr without using a helper column?)
Thanks
You can use is.na to check for NA values.
library(dplyr)
df %>% mutate(newcol = as.integer(!is.na(completed)))
# id completed newcol
#1 abc123sdf 35929 1
#2 124cv NA 0
#3 125xvdf 36295 1
#4 126v NA 0
#5 127sdsd 43933 1
#6 128dfgs NA 0
#7 129vsd NA 0
#8 130sdf NA 0
#9 131sdf NA 0
#10 123sdfd NA 0
library("dplyr")
df <- data.frame(id = 1:10,
completed = c(35929, NA, 36295, NA, 43933, NA, NA, NA, NA, NA))
df %>%
mutate(is_na = as.integer(!is.na(completed)))
#> id completed is_na
#> 1 1 35929 1
#> 2 2 NA 0
#> 3 3 36295 1
#> 4 4 NA 0
#> 5 5 43933 1
#> 6 6 NA 0
#> 7 7 NA 0
#> 8 8 NA 0
#> 9 9 NA 0
#> 10 10 NA 0
But you shouldn't need this extra column to calculate a percentage, you can just use na.rm:
df %>%
mutate(pct = completed / sum(completed, na.rm = TRUE))
#> id completed pct
#> 1 1 35929 0.3093141
#> 2 2 NA NA
#> 3 3 36295 0.3124650
#> 4 4 NA NA
#> 5 5 43933 0.3782209
#> 6 6 NA NA
#> 7 7 NA NA
#> 8 8 NA NA
#> 9 9 NA NA
#> 10 10 NA NA
We can also do
library(dplyr)
df %>%
mutate(newcol = +(!is.na(completed)))
I have a unbalanced data frame with date, localities and prices. I would like calculate diff price among diferents localities by date. My data its unbalanced and to get all diff price I think in create data(localities) to balance data.
My data look like:
library(dplyr)
set.seed(123)
df= data.frame(date=(1:3),
locality= rbinom(21,3, 0.2),
price=rnorm(21, 50, 20))
df %>%
arrange(date, locality)
> date locality price
1 1 0 60.07625
2 1 0 35.32994
3 1 0 63.69872
4 1 1 54.76426
5 1 1 66.51080
6 1 1 28.28602
7 1 2 47.09213
8 2 0 26.68910
9 2 1 100.56673
10 2 1 48.88628
11 2 1 48.29153
12 2 2 29.02214
13 2 2 45.68269
14 2 2 43.59887
15 3 0 60.98193
16 3 0 75.89527
17 3 0 43.30174
18 3 0 71.41221
19 3 0 33.62969
20 3 1 34.31236
21 3 1 23.76955
To get balanced data I think in:
> date locality price
1 1 0 60.07625
2 1 0 35.32994
3 1 0 63.69872
4 1 1 54.76426
5 1 1 66.51080
6 1 1 28.28602
7 1 2 47.09213
8 1 2 NA
9 1 2 NA
10 2 0 26.68910
10 2 0 NA
10 2 0 NA
11 2 1 100.56673
12 2 1 48.88628
13 2 1 48.29153
14 2 2 29.02214
15 2 2 45.68269
16 2 2 43.59887
etc...
Finally to get diff price beetwen pair localities I think:
> date diff(price, 0-1) diff(price, 0-2) diff(price, 1-2)
1 1 60.07625-54.76426 60.07625-47.09213 etc...
2 1 35.32994-66.51080 35.32994-NA
3 1 63.69872-28.28602 63.69872-NA
You don't need to balance your data. If you use dcast, it will add the NAs for you.
First transform the data to show individual columns for each locality
library(data.table)
library(tidyverse)
setDT(df)
df[, rid := rowid(date, locality)]
df2 <- dcast(df, rid + date ~ locality, value.var = 'price')
# rid date 0 1 2
# 1: 1 1 60.07625 54.76426 47.09213
# 2: 1 2 26.68910 100.56673 29.02214
# 3: 1 3 60.98193 34.31236 NA
# 4: 2 1 35.32994 66.51080 NA
# 5: 2 2 NA 48.88628 45.68269
# 6: 2 3 75.89527 23.76955 NA
# 7: 3 1 63.69872 28.28602 NA
# 8: 3 2 NA 48.29153 43.59887
# 9: 3 3 43.30174 NA NA
# 10: 4 3 71.41221 NA NA
# 11: 5 3 33.62969 NA NA
Then create a data frame to_diff of differences to calculate, and pmap over that to calculate the differences. Here c0_1 corresponds to what you call in your question diff(price, 0-1).
to_diff <- CJ(0:2, 0:2)[V1 < V2]
pmap(to_diff, ~ df2[[as.character(.x)]] - df2[[as.character(.y)]]) %>%
setNames(paste0('c', to_diff[[1]], '_', to_diff[[2]])) %>%
bind_cols(df2[, 1:2])
# A tibble: 11 x 5
# c0_1 c0_2 c1_2 rid date
# <dbl> <dbl> <dbl> <int> <int>
# 1 5.31 13.0 7.67 1 1
# 2 -73.9 -2.33 71.5 1 2
# 3 26.7 NA NA 1 3
# 4 -31.2 NA NA 2 1
# 5 NA NA 3.20 2 2
# 6 52.1 NA NA 2 3
# 7 35.4 NA NA 3 1
# 8 NA NA 4.69 3 2
# 9 NA NA NA 3 3
# 10 NA NA NA 4 3
# 11 NA NA NA 5 3