I'm trying to use Slider to compute moving averages over some time series data. The data has day resolution (one observation per day). For each observation I want to compute the average daily value over the last 7 days.
The problem is my code is ignoring the missing observations with implied values of zero. So if my period is 7 days, and during some 7 day window there are only 2 observations, it's summing them and dividing by 2, whereas I'm looking to sum and divide by 7 to get the average per day.
In the code below you'll see that the second row (2023-02-03) is computing the average by dividing by 2 (the number of observations), rather than by dividing by 4 (the number of days in the period 2023-01-31 to 2023-02-03).
Is there a good way to achieve the desired result, or do I just need to replace the mean calculation with sum() / 7?
I had originally backfilled the missing observations which worked, but the data is relatively large and quite sparse and doing so massively increased the runtime (from ~8 seconds to ~100).
library(tidyverse)
library(slider)
data <- data.frame(
date = Sys.Date() - c(0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 13),
val = c(0, 0, 2, 1, 0, 10, 0, 1, 1, 6, 1)
)
print(as_tibble(data))
summary <- function(data) {
summarise(data,
moving_total = sum(val),
moving_avg = mean(val, na.rm = FALSE),
num_observations = n()
)
}
res <- data %>%
arrange(date) %>%
mutate(
weekly = slide_period_dfr(
.x = pick(everything()),
.i = date,
.period = "day",
.f = summary,
.before = 6,
.complete = FALSE
)
)
print(as_tibble(res))
# A tibble: 11 x 2
date val
<date> <dbl>
1 2023-02-13 0
2 2023-02-12 0
3 2023-02-11 2
4 2023-02-10 1
5 2023-02-09 0
6 2023-02-07 10
7 2023-02-06 0
8 2023-02-05 1
9 2023-02-04 1
10 2023-02-03 6
11 2023-01-31 1
# A tibble: 11 x 3
date val weekly$moving_total $moving_avg $num_observations
<date> <dbl> <dbl> <dbl> <int>
1 2023-01-31 1 1 1 1
2 2023-02-03 6 7 3.5 2
3 2023-02-04 1 8 2.67 3
4 2023-02-05 1 9 2.25 4
5 2023-02-06 0 9 1.8 5
6 2023-02-07 10 18 3.6 5
7 2023-02-09 0 18 3 6
8 2023-02-10 1 13 2.17 6
9 2023-02-11 2 14 2.33 6
10 2023-02-12 0 13 2.17 6
11 2023-02-13 0 13 2.17 6
Just a note on the implementation. In the real world the moving averages are being computed over groups, hence the use of pick(everything()) above. Don't think it's necessary for the toy example, but I leave it in just in case it influences the answer.
Thanks
A straightforward solution is to supply zero values for the absent dates.
library(tidyverse)
library(slider)
(data_ <- tibble(
date = Sys.Date() - c(0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 13),
val = c(0, 0, 2, 1, 0, 10, 0, 1, 1, 6, 1)
))
summary <- function(.data) {
summarise(.data,
moving_total = sum(val),
moving_avg = mean(val, na.rm = FALSE),
num_observations = n()
)
}
(full_dates <- tibble(date=
seq(min(data_$date),
max(data_$date),
by = "1 day"
)))
(fdj <- left_join(
full_dates,
data_
) |> mutate(val = if_else(
is.na(val), 0, val
)))
(res <-
mutate(fdj,
weekly = slide_period_dfr(
.x = pick(everything()),
.i = date,
.period = "day",
.f = summary,
.before = 6,
.complete = FALSE
)
))
Related
I'm trying out the table() function in R for running transition matrices, and so far I'm finding it easy to use and very quick compared to other options I've been playing with such as for-loops and lapply().
However I'd like to expand the output table to include all possible combinations even if they don't manifest in the data, populating those "unused" fields with 0's. Is there a quick/easy way to do this?
When I run the following example data frame through the table() function, both as shown below:
ID Period Balance Flags
1 10 1 5 X00
2 10 2 10 X01
3 10 3 15 X00
4 11 1 0 X01
5 11 2 2 X02
6 11 3 4 X02
7 15 1 3 X02
8 15 2 6 X01
9 15 3 2 X00
dataTest <-
data.frame(
ID = c(10,10,10,11,11,11,15,15,15),
Period = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
Balance = c(5, 10, 15, 0, 2, 4, 3, 6, 2),
Flags = c("X00","X01","X00","X01","X02","X02","X02","X01","X00")
)
table(dataTest[dataTest$Period == 3,]$Flags, dataTest[dataTest$Period == 1,]$Flags)
I get the following results (which is correct):
X00 X01 X02
X00 1 0 1
X02 0 1 0
However I would like to expand those results to include the unused combinations, so it looks like this:
X00 X01 X02
X00 1 0 1
X01 0 0 0
X02 0 1 0
This is a transition table, showing along the columns the state the elements started in (in Period = 1 per this example user input) and showing along the rows the state the elements ended in (in Period = 3 per this example user input).
library(tidyverse)
dataTest <-
data.frame(
ID = c(10, 10, 10, 11, 11, 11, 15, 15, 15),
Period = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
Balance = c(5, 10, 15, 0, 2, 4, 3, 6, 2),
Flags = c("X00", "X01", "X00", "X01", "X02", "X02", "X02", "X01", "X00")
)
table(
dataTest[dataTest$Period == 3, ]$Flags,
dataTest[dataTest$Period == 1, ]$Flags
) %>%
as.data.frame() %>%
mutate(
Var1 = Var1 %>% factor(levels = unique(dataTest$Flags)),
Var2 = Var2 %>% factor(levels = unique(dataTest$Flags)),
) %>%
complete(Var1, Var2, fill = list(Freq = 0)) %>%
pivot_wider(names_from = Var2, values_from = Freq)
#> # A tibble: 3 × 4
#> Var1 X00 X01 X02
#> <fct> <int> <int> <int>
#> 1 X00 1 0 1
#> 2 X01 0 0 0
#> 3 X02 0 1 0
Created on 2022-03-11 by the reprex package (v2.0.0)
i have this data
data.frame(start_date =as.Date(c('2020-03-02', '2020-03-09', '2020-03-16')),
end_date =as.Date(c('2020-03-06', '2020-03-13', '2002-03-20')),
a = c(9, 1, 8),
b = c(6, 5, 7))
and I want to manipulate it to transform it like this
data.frame(date =as.Date(c('2020-03-02', '2020-03-09', '2020-03-16', '2020-03-06', '2020-03-13', '2002-03-20')),
a = c(9, 1, 8, 9, 1, 8),
b = c(6, 5, 7, 6, 5, 7))
How can i do it? Thanks!
You can use the tidyr gather function to get this.
-First assign the data frame as an object.
-Then, gather the start and end dates with their a and b values respectively. (by excluding a and b from gather with a minus "-" sign.) Change the name of the value column with "date". The output from the gather was like that.
df %>%
gather(key, value = "date", -a, -b)
a b key date
1 9 6 start_date 2020-03-02
2 1 5 start_date 2020-03-09
3 8 7 start_date 2020-03-16
4 9 6 end_date 2020-03-06
5 1 5 end_date 2020-03-13
6 8 7 end_date 2002-03-20
-For the last part, in order to get rid of the "key" column (start_date and end_date), select only the ones you wanted.
See the full code below:
df <- data.frame(start_date =as.Date(c('2020-03-02', '2020-03-09', '2020-03-16')),
end_date =as.Date(c('2020-03-06', '2020-03-13', '2002-03-20')),
a = c(9, 1, 8),
b = c(6, 5, 7)) #df assignment to an object
df1 <- df %>%
gather(key, value = "date", -a, -b) %>% #gathering dates
select(date, a, b) #choosing what is needed
-This full code brings this output:
date a b
1 2020-03-02 9 6
2 2020-03-09 1 5
3 2020-03-16 8 7
4 2020-03-06 9 6
5 2020-03-13 1 5
6 2002-03-20 8 7
I have the following tibble (but in reality with many more rows): it is called education_tibble
library(tidyverse)
education_tibble <- tibble(
ghousecode = c(1011027, 1011017, 1011021, 1011019, 1011025, 1011017,
1011016, 1011021, 1011017, 1011019),
hhc_educ = c(2, 0, 11, 10, 14, 4, 8, 16, 0, 9))
ghousecode hhc_educ
<dbl> <dbl>
1 1011027 2
2 1011017 0
3 1011021 11
4 1011019 10
5 1011025 14
6 1011017 4
7 1011016 8
8 1011021 16
9 1011017 0
10 1011019 9
I am trying to sum the hhc_educ so that each ghousecode has a corresponding "total hhc_educ". I am struggling to do this, and not sure what to do. I have been using the tidyverse, so have been exploring ways mostly within dplyr. Here is my code:
education_tibble %>%
group_by(ghousecode, add = TRUE)
summarize(total_educ = sum(hhc_educ))
The problem is that this code generates just one value for some reason, not a total_educ value for each group. Essentially I'm after a new tibble ultimately which will have each ghousecode in one row with the sum of all of the hhc_educ values next to it. Any help would be much appreciated! Thank you!
You missed a %>% I think.
library(tidyverse)
#data
education_tibble <- tibble(
ghousecode = c(1011027, 1011017, 1011021, 1011019, 1011025, 1011017,
1011016, 1011021, 1011017, 1011019),
hhc_educ = c(2, 0, 11, 10, 14, 4, 8, 16, 0, 9))
# grouped count
education_tibble %>%
group_by(ghousecode, add = TRUE) %>%
summarise(total_educ = sum(hhc_educ))
Produces:
# A tibble: 6 x 2
ghousecode total_educ
<dbl> <dbl>
1 1011016 8
2 1011017 4
3 1011019 19
4 1011021 27
5 1011025 14
6 1011027 2
I have a dataset with start and end times for events (called df_time), and another dataset with when an event happened (df_val). I want to use df_time to filter down df_val only to events that happened within recorded time intervals.
I'm a bit lost on how to accomplish this though.
start = c(1, 5, 7, 4)
end = c(2, 7, 11, 7)
df_time = data.frame(start, end)
time = c(3, 6, 2, 10, 11)
val = c(100, 20, 30, 40, 50)
df_val = data.frame(time, val)
df_val %>% select_all() %>%
filter(time >= df_time$start & time <= df_time$end)
Output:
time val
1 6 20
Warning messages:
1: In time >= df_time$start :
longer object length is not a multiple of shorter object length
2: In time <= df_time$end :
longer object length is not a multiple of shorter object length
The above will run with warning messages (above), and gives me the wrong output (ignores starts/ends that are equal to value timestamps). Above, all values but 3 should be printed.
I'm unsure on how to fix this, and would appreciate any help/resources!
Is this what you are trying to accomplish?
library(tidyverse)
start = c(1, 5, 7, 4)
end = c(2, 7, 11, 7)
df_time = data.frame(start, end)
time = c(3, 6, 2, 10, 11)
val = c(100, 20, 30, 40, 50)
df_val = data.frame(time, val)
# return one row for each start/end pair that time falls between
map2_dfr(start, end, ~filter(df_val, time >= .x, time <= .y) %>% mutate(start = .x, end = .y))
#> time val start end
#> 1 2 30 1 2
#> 2 6 20 5 7
#> 3 10 40 7 11
#> 4 11 50 7 11
#> 5 6 20 4 7
#return unique pairs
map2_dfr(start, end, ~filter(df_val, time >= .x, time <= .y)) %>% unique()
#> time val
#> 1 2 30
#> 2 6 20
#> 3 10 40
#> 4 11 50
#simpler method, probably
df_val %>% filter(map_lgl(time, ~any((.x >= start) & .x <= end)))
#> time val
#> 1 6 20
#> 2 2 30
#> 3 10 40
#> 4 11 50
Created on 2019-07-25 by the reprex package (v0.2.1)
Edit: added some alternatives
Here is another option using non-equi inner join using data.table:
library(data.table)
setDT(df_time)
setDT(df_val)
df_time[df_val, on=.(ID, start<time, end>time), nomatch=0L,
c(mget(paste0("x.", names(df_time))), mget(paste0("i.", names(df_val))))]
output:
x.ID x.start x.end i.ID i.time i.val
1: 1 5 7 1 6 20
2: 1 4 7 1 6 20
3: 1 7 11 1 10 40
I'm trying to recode answers using a vector that contains the correct answers. I made a for loop that create a new column (with the coded answer) at each loop using a vector with the possible names for the new columns.
However, it seems that mutate does not receive vectors with names. I've tried some different vectors and some paste0() combinations but nothing seem to work.
Here is my reproduceable code:
library(dplyr)
library(tibble)
correct = c(4, 5, 2, 2, 2, 3, 3, 5, 4, 5, 2, 1, 3, 4, 2, 2, 2, 4, 3, 1, 1, 5, 4, 1, 3, 2)
sub1 = c(3, 5, 1, 5, 4, 3, 2, 5, 4, 3, 4, 4, 4, 1, 5, 1, 4, 3, 3, 4, 3, 2, 4, 2, 3, 4)
df = t(data.frame(sub1))
colnames(df) = paste0("P", 1:26)
new_names = paste0("P", 1:26, "_coded")
for(i in 1:26){
df = as.tibble(df) %>%
mutate(new_names = case_when(.[i] == correct[i] ~ 1,
.[i] != correct[i] ~ 0,
T ~ 9999999))
print(df) # to know what's going on.
}
Also, I know that .dots can receive names in a vector (I think), but I don't quite understand how to use it with case_when inside mutate().
Others ways to create new columns with the recoded value are welcome also
UPDATE:
My expected output would be the original data frame with 26 new columns, P1_COD:P26_COD with possible values 1 (if correct) and 0 (if incorrect).
Something like this (I just created four columns with 1s and 0s as an example).
df %>%
mutate(P1_COD = 1,
P2_COD = 0,
P3_COD = 1,
P4_COD = 1)
The data is not in a format that dplyr will handle best. I would suggest restructuring your data to longitudinal format, and then the case_when becomes trivial and no for loop is required.
see other documentation for tidyr regarding data format at tidyverse.org documentation
Here is an example of the "longitudinal" format including your sample data. I also added a couple of other subjects with random answers.
library(tidyverse)
responses <- data_frame(
subject = rep(1:3, each = 26),
qNum = rep(1:26, 3),
response = c(sub1,
sample(5, 26, replace = T),
sample(5, 26, replace = T)))
The answers can be created and then merged:
answers <- data_frame(
qNum = 1:26,
answer = correct)
df <- left_join(responses, answers)
Next, score the answers using dplyr::case_when:
df <- df %>% mutate(score = case_when(response == answer ~ 1,
TRUE ~ 0))
note: the TRUE ~ 0 may be confusing at first. It tells what to do with the remaining values, if the first condition is FALSE. The resulting df/tibble:
# A tibble: 26 x 5
subject qNum response answer score
<dbl> <int> <dbl> <dbl> <dbl>
1 1 1 3 4 0
2 1 2 5 5 1
3 1 3 1 2 0
4 1 4 5 2 0
5 1 5 4 2 0
6 1 6 3 3 1
7 1 7 2 3 0
8 1 8 5 5 1
9 1 9 4 4 1
10 1 10 3 5 0
# ... with 16 more rows
If you want to convert this to "wide" format, use tidyr::spread:
df %>%
select(-response, -answer) %>%
spread(qNum, score, sep = ".")
# A tibble: 3 x 27
subject qNum.1 qNum.2 qNum.3 qNum.4 qNum.5 qNum.6 qNum.7 qNum.8 qNum.9 qNum.10
* <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 1 0 0 0 1 0 1 1 0
2 2 0 0 0 0 1 0 0 0 0 0
3 3 0 0 0 0 1 0 0 0 0 0