How to apply rolling t.test in R on a single variable?

How to apply rolling t.test in R on a single variable? - r

I have a data.frame (df) with two columns (Date & Count) which looks something like shown below:
Date Count
1/1/2022 5
1/2/2022 13
1/3/2022 21
1/4/2022 29
1/5/2022 37
1/6/2022 45
1/7/2022 53
1/8/2022 61
1/9/2022 69
1/10/2022 77
1/11/2022 85
1/12/2022 93
1/13/2022 101
1/14/2022 109
1/15/2022 117
Since I have single variable (count), the idea is to identify if there's been a change in mean in every three days, therefore I want to apply rolling t.test with a window of 3 days and save the resulting p-value next to Count column which I can plot later. Since I have seen people doing these sorts of tests with two variables usually, I can't figure out how to do it with a single variable.
For example, I saw this relevant answer here:
ttestFun <- function(dat) {
myTtest = t.test(x = dat[, 1], y = dat[, 2])
return(myTtest$p.value)
}
rollapply(df_ts, 7, FUN = ttestFun, fill = NA, by.column = FALSE)
But again, this is with two columns. Any guidance please?

Irrespective of any discussion about the usefulness of the approach, given a fixed number of measurements of 3, you could just shift the counts by 3 and perform t-test between two columns as in your example, such as:
library(data.table)
set.seed(123)
dates <- seq(as.POSIXct("2022-01-01"), as.POSIXct("2022-02-01"), by = "1 day")
dt <- data.table(Date=dates, count = sample(1:200, length(dates), replace=TRUE), key="Date")
dt[, nxt:=shift(count, 3, type = "lead")]
dt[, group:=rep(1:ceiling(length(dates)/3), each=3)[seq_along(dates)]]
dt[, p:= tryCatch(t.test(count, nxt)$p.value, error=function(e) NA), by="group"][]
#> Date count nxt group p
#> 1: 2022-01-01 159 195 1 0.7750944
#> 2: 2022-01-02 179 170 1 0.7750944
#> 3: 2022-01-03 14 50 1 0.7750944
#> 4: 2022-01-04 195 118 2 0.2240362
#> 5: 2022-01-05 170 43 2 0.2240362
#> 6: 2022-01-06 50 14 2 0.2240362
#> 7: 2022-01-07 118 118 3 0.1763296
#> 8: 2022-01-08 43 153 3 0.1763296
#> 9: 2022-01-09 14 90 3 0.1763296
#> 10: 2022-01-10 118 91 4 0.8896343
#> 11: 2022-01-11 153 197 4 0.8896343
#> 12: 2022-01-12 90 91 4 0.8896343
#> 13: 2022-01-13 91 185 5 0.8065021
#> 14: 2022-01-14 197 92 5 0.8065021
#> 15: 2022-01-15 91 137 5 0.8065021
#> 16: 2022-01-16 185 99 6 0.1060465
#> 17: 2022-01-17 92 72 6 0.1060465
#> 18: 2022-01-18 137 26 6 0.1060465
#> 19: 2022-01-19 99 7 7 0.5283156
#> 20: 2022-01-20 72 170 7 0.5283156
#> 21: 2022-01-21 26 137 7 0.5283156
#> 22: 2022-01-22 7 164 8 0.9612965
#> 23: 2022-01-23 170 78 8 0.9612965
#> 24: 2022-01-24 137 81 8 0.9612965
#> 25: 2022-01-25 164 43 9 0.6111337
#> 26: 2022-01-26 78 103 9 0.6111337
#> 27: 2022-01-27 81 117 9 0.6111337
#> 28: 2022-01-28 43 76 10 0.6453494
#> 29: 2022-01-29 103 143 10 0.6453494
#> 30: 2022-01-30 117 NA 10 0.6453494
#> 31: 2022-01-31 76 NA 11 NA
#> 32: 2022-02-01 143 NA 11 NA
#> Date count nxt group p
Created on 2022-04-07 by the reprex package (v2.0.1)
You could further clean that up, e.g. by taking the first date per group:
dt[, .(Date=Date[1], count=round(mean(count), 2), p=p[1]), by="group"]
#> group Date count p
#> 1: 1 2022-01-01 117.33 0.7750944
#> 2: 2 2022-01-04 138.33 0.2240362
#> 3: 3 2022-01-07 58.33 0.1763296
#> 4: 4 2022-01-10 120.33 0.8896343
#> 5: 5 2022-01-13 126.33 0.8065021
#> 6: 6 2022-01-16 138.00 0.1060465
#> 7: 7 2022-01-19 65.67 0.5283156
#> 8: 8 2022-01-22 104.67 0.9612965
#> 9: 9 2022-01-25 107.67 0.6111337
#> 10: 10 2022-01-28 87.67 0.6453494
#> 11: 11 2022-01-31 109.50 NA

You can create a grp, and then simply apply a t.test to each consecutive pair of groups:
d <- d %>% mutate(grp=rep(1:(n()/3), each=3))
d %>% left_join(
tibble(grp = 2:max(d$grp),
pval = sapply(2:max(d$grp), function(x) {
t.test(d %>% filter(grp==x) %>% pull(Count),
d %>% filter(grp==x-1) %>% pull(Count))$p.value
})
)) %>% group_by(grp) %>% slice_min(Date)
Output: (p-value is constant only because of the example data you provided)
Date Count grp pval
<date> <dbl> <int> <dbl>
1 2022-01-01 5 1 NA
2 2022-01-04 29 2 0.0213
3 2022-01-07 53 3 0.0213
4 2022-01-10 77 4 0.0213
5 2022-01-13 101 5 0.0213
Or a data.table approach:
setDT(d)[, `:=`(grp=rep(1:(nrow(d)/3), each=3),cy=shift(Count,3))] %>%
.[!is.na(cy), pval:=t.test(Count,cy)$p.value, by=grp] %>%
.[,.SD[1], by=grp, .SDcols=!c("cy")]
Output:
grp Date Count pval
<int> <Date> <num> <num>
1: 1 2022-01-01 5 NA
2: 2 2022-01-04 29 0.02131164
3: 3 2022-01-07 53 0.02131164
4: 4 2022-01-10 77 0.02131164
5: 5 2022-01-13 101 0.02131164

Related

How best to do this join in R?

Below is the sample data. I know that I have to do a left join. The question is how to have it only return values that match (indcodelist = indcodelist2) but with the highest codetype value.
indcodelist <- c(110000,111000,112000,113000,114000,115000,121000,210000,211000,315000)
estemp <- c(11,21,31,41,51,61,55,21,22,874)
projemp <- c(15,25,36,45,52,61,31,29,31,899)
nchg <- c(4,4,5,4,1,0,-24,8,9,25)
firsttable <- data.frame(indcodelist,estemp,projemp,nchg)
indcodelist2 <- c(110000,111000,112000,113000,114000,115000,121000,210000,211000,315000,110000,111000,112000,113000)
codetype <- c(18,18,18,18,18,18,18,18,18,18,10,10,10,10)
codetitle <- c("Accountant","Doctor","Lawyer","Teacher","Economist","Financial Analyst","Meteorologist","Dentist", "Editor","Veterinarian","Accounting Technician","Doctor","Lawyer","Teacher")
secondtable <- data.frame(indcodelist2,codetype,codetitle)
tried <- left_join(firsttable,secondtable, by =c(indcodelist = "indcodelist2"))
Desired Result
indcodelist estemp projemp nchg codetitle
110000 11 15 4 Accountant
111000 21 25 4 Doctor

If you only want values that match in both tables, inner_join might be what you’re looking for. You can see this answer to understand different types of joins.
To get the highest codetype, you can use dplyr::slice_max(). Be aware the default behavior is to return values that tie. If there is more than one codetitle at the same codetype, they’ll all be returned.
library(tidyverse)
firsttable %>%
inner_join(., secondtable, by = c("indcodelist" = "indcodelist2")) %>%
group_by(indcodelist) %>%
slice_max(codetype)
#> # A tibble: 10 × 6
#> # Groups: indcodelist [10]
#> indcodelist estemp projemp nchg codetype codetitle
#> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 110000 11 15 4 18 Accountant
#> 2 111000 21 25 4 18 Doctor
#> 3 112000 31 36 5 18 Lawyer
#> 4 113000 41 45 4 18 Teacher
#> 5 114000 51 52 1 18 Economist
#> 6 115000 61 61 0 18 Financial Analyst
#> 7 121000 55 31 -24 18 Meteorologist
#> 8 210000 21 29 8 18 Dentist
#> 9 211000 22 31 9 18 Editor
#> 10 315000 874 899 25 18 Veterinarian
Created on 2022-09-15 by the reprex package (v2.0.1)

You might use {powerjoin} :
library(powerjoin)
power_inner_join(
firsttable,
secondtable |> summarize_by_keys(dplyr::across()[which.max(codetype),]),
by = c("indcodelist" = "indcodelist2")
)
#> indcodelist estemp projemp nchg codetype codetitle
#> 1 110000 11 15 4 18 Accountant
#> 2 111000 21 25 4 18 Doctor
#> 3 112000 31 36 5 18 Lawyer
#> 4 113000 41 45 4 18 Teacher
#> 5 114000 51 52 1 18 Economist
#> 6 115000 61 61 0 18 Financial Analyst
#> 7 121000 55 31 -24 18 Meteorologist
#> 8 210000 21 29 8 18 Dentist
#> 9 211000 22 31 9 18 Editor
#> 10 315000 874 899 25 18 Veterinarian

How do i create a categorical variable with mutliple values from an existing variable?

I am working with a survey dataset in R where participants are asked what their present day religion is. I am however only interested if a person is religious or not. There are 14 answer categories in that question with categories 1-8,11,13 and 14 being religious and categories 9,10 and 12 being non-religious. How do I create a new variable where I can cane these subcategories merged so that religious is dispalyed as 1 and non-religious is dispalyed as 0? I have tried it with the ifelse function but that didn`t really work.
Thank you all for your help

Please find below a reprex which gives you a solution using the data.table library
Reprex
Simulated data
set.seed(58) # set seed to make the reprex reproducible!
df <- data.frame(ID = c(seq(1:100)), # ID of participants
Answer_cat = c(sample(1:14, 100, replace = TRUE))) # simulated answers
Code
library(data.table)
setDT(df)[, Religion_or_Not := fifelse(Answer_cat %in% c(9,10,12),0,1)][]
Output
#> ID Answer_cat Religion_or_Not
#> 1: 1 3 1
#> 2: 2 13 1
#> 3: 3 6 1
#> 4: 4 2 1
#> 5: 5 2 1
#> 6: 6 9 0
#> 7: 7 4 1
#> 8: 8 12 0
#> 9: 9 9 0
#> 10: 10 1 1
#> 11: 11 4 1
#> 12: 12 1 1
#> 13: 13 10 0
#> 14: 14 12 0
#> 15: 15 4 1
#> 16: 16 3 1
#> 17: 17 12 0
#> 18: 18 14 1
#> 19: 19 8 1
#> 20: 20 7 1
#> 21: 21 3 1
#> 22: 22 3 1
#> 23: 23 11 1
#> 24: 24 9 0
#> 25: 25 6 1
#> 26: 26 1 1
#> 27: 27 11 1
#> 28: 28 1 1
#> 29: 29 13 1
#> 30: 30 1 1
#> 31: 31 9 0
#> 32: 32 1 1
#> 33: 33 6 1
#> 34: 34 13 1
#> 35: 35 9 0
#> 36: 36 9 0
#> 37: 37 14 1
#> 38: 38 8 1
#> 39: 39 1 1
#> 40: 40 3 1
#> 41: 41 4 1
#> 42: 42 12 0
#> 43: 43 12 0
#> 44: 44 7 1
#> 45: 45 12 0
#> 46: 46 12 0
#> 47: 47 2 1
#> 48: 48 14 1
#> 49: 49 8 1
#> 50: 50 9 0
#> 51: 51 13 1
#> 52: 52 4 1
#> 53: 53 12 0
#> 54: 54 2 1
#> 55: 55 12 0
#> 56: 56 9 0
#> 57: 57 13 1
#> 58: 58 6 1
#> 59: 59 9 0
#> 60: 60 13 1
#> 61: 61 7 1
#> 62: 62 11 1
#> 63: 63 6 1
#> 64: 64 2 1
#> 65: 65 14 1
#> 66: 66 9 0
#> 67: 67 12 0
#> 68: 68 7 1
#> 69: 69 4 1
#> 70: 70 9 0
#> 71: 71 12 0
#> 72: 72 6 1
#> 73: 73 13 1
#> 74: 74 7 1
#> 75: 75 4 1
#> 76: 76 9 0
#> 77: 77 12 0
#> 78: 78 4 1
#> 79: 79 7 1
#> 80: 80 5 1
#> 81: 81 7 1
#> 82: 82 7 1
#> 83: 83 12 0
#> 84: 84 6 1
#> 85: 85 10 0
#> 86: 86 6 1
#> 87: 87 9 0
#> 88: 88 1 1
#> 89: 89 10 0
#> 90: 90 6 1
#> 91: 91 12 0
#> 92: 92 9 0
#> 93: 93 14 1
#> 94: 94 14 1
#> 95: 95 12 0
#> 96: 96 1 1
#> 97: 97 3 1
#> 98: 98 6 1
#> 99: 99 7 1
#> 100: 100 6 1
#> ID Answer_cat Religion_or_Not
Created on 2021-11-06 by the reprex package (v2.0.1)

keep most recent observations when there are duplicates in R

I have the following data.
date var1 level score_1 score_2
2020-02-19 12:10:52.166661 dog n1 1 3
2020-02-19 12:17:25.087898 dog n1 3 6
2020-02-19 12:34:27.624939 dog n2 4 3
2020-02-19 12:35:50.522116 cat n1 2 0
2020-02-19 12:38:49.547181 cat n2 3 4
There should be just one observation for any combination var1 & level. I want to eliminate duplicates and keep only most recent records. in the previous example the first row should be eliminated as dog-n1 from row 2 is more recent. nevertheless, I want to keep row 3 even if var1 is also equal to "dog" because level is different.
so, what I want to obtain:
date var1 level score_1 score_2
2020-02-19 12:17:25.087898 dog n1 3 6
2020-02-19 12:34:27.624939 dog n2 4 3
2020-02-19 12:35:50.522116 cat n1 2 0
2020-02-19 12:38:49.547181 cat n2 3 4

Using tidyverse
df %>%
group_by(var1, level) %>%
filter(date == max(date)) %>%
ungroup()

In base R, use duplicated. Looks like your data is already sorted by date, so you can use
df[!duplicated(df[c("var1", "level")], fromLast = TRUE), ]
(by default, duplicated will give FALSE for the first occurrence of anything, and TRUE for every other occurrence. Setting fromLast = TRUE will make reverse the direction, so the last occurrence is kept)
If you're not sure your data is already sorted, sort it first!
df = df[order(df$var1, df$level, dfd$date), ]

You can also use data.table approach as follows:
library(data.table)
setDT(df)[, .SD[which.max(date)], .(var1, level)]

Another tidyverse answer, using dplyr::slice_max().
To demonstrate with a reproducible example, here is flights data from nycflights13 package:
library(nycflights13) # for the data
library(dplyr, warn.conflicts = FALSE)
my_flights <- # a subset of 3 columns
flights |>
select(carrier, dest, time_hour)
my_flights # preview of the subset data
#> # A tibble: 336,776 × 3
#> carrier dest time_hour
#> <chr> <chr> <dttm>
#> 1 UA IAH 2013-01-01 05:00:00
#> 2 UA IAH 2013-01-01 05:00:00
#> 3 AA MIA 2013-01-01 05:00:00
#> 4 B6 BQN 2013-01-01 05:00:00
#> 5 DL ATL 2013-01-01 06:00:00
#> 6 UA ORD 2013-01-01 05:00:00
#> 7 B6 FLL 2013-01-01 06:00:00
#> 8 EV IAD 2013-01-01 06:00:00
#> 9 B6 MCO 2013-01-01 06:00:00
#> 10 AA ORD 2013-01-01 06:00:00
#> # … with 336,766 more rows
Grouping by carrier & dest, we can see many rows for each group.
my_flights |>
count(carrier, dest)
#> # A tibble: 314 × 3
#> carrier dest n
#> <chr> <chr> <int>
#> 1 9E ATL 59
#> 2 9E AUS 2
#> 3 9E AVL 10
#> 4 9E BGR 1
#> 5 9E BNA 474
#> 6 9E BOS 914
#> 7 9E BTV 2
#> 8 9E BUF 833
#> 9 9E BWI 856
#> 10 9E CAE 3
#> # … with 304 more rows
So if we want to deduplicate those in-group rows by taking the most recent time_hour value, we could utilize slice_max()
my_flights |>
group_by(carrier, dest) |>
slice_max(time_hour)
#> # A tibble: 329 × 3
#> # Groups: carrier, dest [314]
#> carrier dest time_hour
#> <chr> <chr> <dttm>
#> 1 9E ATL 2013-05-04 07:00:00
#> 2 9E AUS 2013-02-03 16:00:00
#> 3 9E AVL 2013-07-13 11:00:00
#> 4 9E BGR 2013-10-17 21:00:00
#> 5 9E BNA 2013-12-31 15:00:00
#> 6 9E BOS 2013-12-31 14:00:00
#> 7 9E BTV 2013-09-01 12:00:00
#> 8 9E BUF 2013-12-31 18:00:00
#> 9 9E BWI 2013-12-31 19:00:00
#> 10 9E CAE 2013-12-31 09:00:00
#> # … with 319 more rows
By the same token, we could have used slice_min() to get the rows with the earliest time_hour value.

How to use R find the closest date before specific date of matrix A? And 14 days after the specific date?

I’m a newbie in R.
I have two dataset A and B.
A <- data.table::fread(
"
V1 DATE ID
1 7/16/11 a
2 2/18/09 b
3 3/25/08 c
")
B <- data.table::fread(
"
V1 DATE ID Value
1 2013-06-13 a 109
2 2017-08-22 a 86
3 2017-09-15 a 88
4 2008-11-05 a 78
5 2009-02-17 a 74
6 2009-03-09 a 84
7 2009-03-17 a 81
8 2009-04-14 a 57
9 2009-04-21 a 65
10 2009-05-12 a 54
11 2009-06-08 a 54
12 2009-08-27 a 68
13 2011-08-26 b 199
14 2011-12-07 b 174
15 2012-01-31 b 66
16 2012-02-15 b 58
17 2012-04-17 b 59
18 2012-12-21 b 78
19 2013-01-14 b 91
20 2014-03-12 b 74
21 2014-08-28 b 98
22 2014-10-18 b 112
23 2010-12-15 b 36
24 2011-08-26 b 199
25 2011-12-07 b 174
26 2012-01-31 b 66
27 2012-02-15 b 58
28 2012-04-17 b 59
29 2015-05-08 c 105
30 2006-03-27 c 69
31 2007-03-12 c 104
32 2007-11-09 c 63
33 2008-03-25 c 239
34 2008-04-04 c 446
35 2008-04-09 c 354
36 2008-04-10 c 365
37 2008-04-11 c 366
38 2008-04-18 c 273
39 2008-04-28 c 271
40 2008-05-06 c 262
41 2008-05-19 c 72
42 2008-05-24 c 86
43 2008-06-20 c 47
44 2008-07-10 c 46
45 2008-08-06 c 55
46 2008-09-01 c 58
47 2008-09-29 c 56
48 2008-10-30 c 53
49 2008-12-09 c 71
50 2008-12-18 c 63
51 2009-01-14 c 60
52 2009-02-21 c 58
53 2009-03-28 c 54
54 2009-04-29 c 56
55 2009-04-30 c 59
56 2009-06-23 c 64
57 2009-07-24 c 69
58 2009-08-17 c 73
59 2009-10-04 c 127
60 2009-11-26 c 289
61 2009-12-02 c 277
62 2009-12-08 c 230
")
I tried weeks to use R to:
find value from B which ID==A$ID, and B$DATE is closest date before or the same date as A$DATE;
The expected result is : ID=c, DATE=2008-03-25, Value=239
find value from B which ID==A$ID, and B$DATE is 14 days after A$DATE. If there is no exact date after 14 days, find the closest date's value (like 15, 16 or 17 days after A$DATE)
The expected result is : ID=c, DATE=2008-04-09, Value=354

Both questions can answered using a rolling join from data.table.
However, there are two important steps in preparing the data.
The date strings need to be converted to class IDate (or Date) to allow for date arithmetic. (IDate uses an integer representation to save memory).
The dataframes need to be coerced to data.table to enable the enhanced syntax. setDT() coerces a dataframe or tibble to data.table by reference, i.e., without copying.
BTW: The sample datasets provided by the OP were already data.tables as the OP had used the data.table::fread() function.
Data preparation:
library(data.table)
setDT(A)[, DATE := as.IDate(DATE, "%m/%d/%y")]
setDT(B)[, DATE := as.IDate(DATE)]
Now, we can apply the rolling join:
B[A, on = .(ID, DATE), roll = +Inf, .(ID, DATE, Value)]
ID DATE Value
1: a 2011-07-16 68
2: b 2009-02-18 NA
3: c 2008-03-25 239
The result can be verified by printing B in proper order B[order(ID, DATE)]. The earliest date for ID == "b" in B is 2011-08-26. So, there is no date in B on or before 2009-02-18.
Please, note that the value in the DATE column is the reference date A$DATE, not the matching B$DATE.
Edit after clarification of the expected result by the OP:
Also the second question can be solved by a rolling join but the code requires three modifications:
The reference dates A$DATE need to be shifted by 14 days later.
We need a backward rolling join because the OP wants to find the closest date in B on or after the shifted reference date.
According to OP's expected result the result should contain the matching B$DATE.
With the additional requrements we get
B[A[, .(ID, DATE = DATE + 14)], on = .(ID, DATE), roll = -Inf, .(ID, DATE = x.DATE, Value)]
ID DATE Value
1: a 2013-06-13 109
2: b 2010-12-15 36
3: c 2008-04-09 354

A solution using dplyr:
q1 and q2 corresponds to your two questions.
library(dplyr)
A$DATE <- as.Date(A$DATE,format = "%m/%d/%y")
B$DATE <- as.Date(B$DATE)
BA <- left_join(B,A, by= c("ID"="ID"))
q1 <- BA %>%
filter(ID %in% A$ID) %>%
filter(DATE.x < DATE.y) %>%
group_by(ID) %>%
arrange(desc(DATE.x)) %>%
slice(1)
q2 <- BA %>%
filter(ID %in% A$ID) %>%
group_by(ID) %>%
filter(as.numeric(DATE.x) - as.numeric(DATE.y) >= 14)
q1
#> # A tibble: 2 x 6
#> # Groups: ID [2]
#> V1.x DATE.x ID Value V1.y DATE.y
#> <int> <date> <chr> <int> <int> <date>
#> 1 12 2009-08-27 a 68 1 2011-07-16
#> 2 32 2007-11-09 c 63 3 2008-03-25
q2
#> # A tibble: 48 x 6
#> # Groups: ID [3]
#> V1.x DATE.x ID Value V1.y DATE.y
#> <int> <date> <chr> <int> <int> <date>
#> 1 1 2013-06-13 a 109 1 2011-07-16
#> 2 2 2017-08-22 a 86 1 2011-07-16
#> 3 3 2017-09-15 a 88 1 2011-07-16
#> 4 13 2011-08-26 b 199 2 2009-02-18
#> 5 14 2011-12-07 b 174 2 2009-02-18
#> 6 15 2012-01-31 b 66 2 2009-02-18
#> 7 16 2012-02-15 b 58 2 2009-02-18
#> 8 17 2012-04-17 b 59 2 2009-02-18
#> 9 18 2012-12-21 b 78 2 2009-02-18
#> 10 19 2013-01-14 b 91 2 2009-02-18
#> # ... with 38 more rows

Using Prophet Package to Predict By Group in Dataframe in R

I am using the new package released from Facebook called Prophet. It does time series predictions and I want to apply this function By Group.
Scroll down to R Section.
https://facebookincubator.github.io/prophet/docs/quick_start.html
This is my attempt:
grouped_output = df %>% group_by(group) %>%
do(m = prophet(df[,c(1,3)])) %>%
do(future = make_future_dataframe(m, period = 7)) %>%
do(forecast = prophet:::predict.prophet(m, future))
grouped_output[[1]]
I then need to extract the results from the list of each group which I am having trouble doing.
Below is my original dataframe without the groups:
ds <- as.Date(c('2016-11-01','2016-11-02','2016-11-03','2016-11-04',
'2016-11-05','2016-11-06','2016-11-07','2016-11-08',
'2016-11-09','2016-11-10','2016-11-11','2016-11-12',
'2016-11-13','2016-11-14','2016-11-15','2016-11-16',
'2016-11-17','2016-11-18','2016-11-19','2016-11-20',
'2016-11-21','2016-11-22','2016-11-23','2016-11-24',
'2016-11-25','2016-11-26','2016-11-27','2016-11-28',
'2016-11-29','2016-11-30'))
y <- c(15,17,18,19,20,54,67,23,12,34,12,78,34,12,3,45,67,89,12,111,123,112,14,566,345,123,567,56,87,90)
y<-as.numeric(y)
df <- data.frame(ds, y)
df
ds y
1 2016-11-01 15
2 2016-11-02 17
3 2016-11-03 18
4 2016-11-04 19
5 2016-11-05 20
6 2016-11-06 54
7 2016-11-07 67
8 2016-11-08 23
9 2016-11-09 12
10 2016-11-10 34
11 2016-11-11 12
12 2016-11-12 78
13 2016-11-13 34
14 2016-11-14 12
15 2016-11-15 3
16 2016-11-16 45
17 2016-11-17 67
18 2016-11-18 89
19 2016-11-19 12
20 2016-11-20 111
21 2016-11-21 123
22 2016-11-22 112
23 2016-11-23 14
24 2016-11-24 566
25 2016-11-25 345
26 2016-11-26 123
27 2016-11-27 567
28 2016-11-28 56
29 2016-11-29 87
30 2016-11-30 90
The current function works when I do it to a single group as follows:
#install.packages('prophet')
library(prophet)
m<-prophet(df)
future <- make_future_dataframe(m, period = 7)
forecast <- prophet:::predict.prophet(m, future)
forecast$yhat
[1] -2.649032 -29.762095 128.169781 59.573684 -11.623727 107.473617 -29.949730 -42.862455 -62.378408 104.797639 46.868610
[12] -12.502864 119.282058 -4.914921 -4.402638 -10.643570 169.309505 123.321261 74.734746 215.856347 99.290218 105.508059
[23] 102.882915 284.245984 237.401258 185.688202 321.466962 197.451536 194.280518 180.535663 349.304365 288.684031 222.337210
[34] 342.968499 203.648851 185.377165
I now want to change this so that it applies the prophet:::predict function to each group. So the NEW dataframe BY GROUP looks like this:
ds <- as.Date(c('2016-11-01','2016-11-02','2016-11-03','2016-11-04',
'2016-11-05','2016-11-06','2016-11-07','2016-11-08',
'2016-11-09','2016-11-10','2016-11-11','2016-11-12',
'2016-11-13','2016-11-14','2016-11-15','2016-11-16',
'2016-11-17','2016-11-18','2016-11-19','2016-11-20',
'2016-11-21','2016-11-22','2016-11-23','2016-11-24',
'2016-11-25','2016-11-26','2016-11-27','2016-11-28',
'2016-11-29','2016-11-30',
'2016-11-01','2016-11-02','2016-11-03','2016-11-04',
'2016-11-05','2016-11-06','2016-11-07','2016-11-08',
'2016-11-09','2016-11-10','2016-11-11','2016-11-12',
'2016-11-13','2016-11-14','2016-11-15','2016-11-16',
'2016-11-17','2016-11-18','2016-11-19','2016-11-20',
'2016-11-21','2016-11-22','2016-11-23','2016-11-24',
'2016-11-25','2016-11-26','2016-11-27','2016-11-28',
'2016-11-29','2016-11-30'))
y <- c(15,17,18,19,20,54,67,23,12,34,12,78,34,12,3,45,67,89,12,111,123,112,14,566,345,123,567,56,87,90,
45,23,12,10,21,34,12,45,12,44,87,45,32,67,1,57,87,99,33,234,456,123,89,333,411,232,455,55,90,21)
y<-as.numeric(y)
group<-c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A",
"A","A","A","A","A","A","A","A","A","A","A","A","A","A","A",
"B","B","B","B","B","B","B","B","B","B","B","B","B","B","B",
"B","B","B","B","B","B","B","B","B","B","B","B","B","B","B")
df <- data.frame(ds,group, y)
df
ds group y
1 2016-11-01 A 15
2 2016-11-02 A 17
3 2016-11-03 A 18
4 2016-11-04 A 19
5 2016-11-05 A 20
6 2016-11-06 A 54
7 2016-11-07 A 67
8 2016-11-08 A 23
9 2016-11-09 A 12
10 2016-11-10 A 34
11 2016-11-11 A 12
12 2016-11-12 A 78
13 2016-11-13 A 34
14 2016-11-14 A 12
15 2016-11-15 A 3
16 2016-11-16 A 45
17 2016-11-17 A 67
18 2016-11-18 A 89
19 2016-11-19 A 12
20 2016-11-20 A 111
21 2016-11-21 A 123
22 2016-11-22 A 112
23 2016-11-23 A 14
24 2016-11-24 A 566
25 2016-11-25 A 345
26 2016-11-26 A 123
27 2016-11-27 A 567
28 2016-11-28 A 56
29 2016-11-29 A 87
30 2016-11-30 A 90
31 2016-11-01 B 45
32 2016-11-02 B 23
33 2016-11-03 B 12
34 2016-11-04 B 10
35 2016-11-05 B 21
36 2016-11-06 B 34
37 2016-11-07 B 12
38 2016-11-08 B 45
39 2016-11-09 B 12
40 2016-11-10 B 44
41 2016-11-11 B 87
42 2016-11-12 B 45
43 2016-11-13 B 32
44 2016-11-14 B 67
45 2016-11-15 B 1
46 2016-11-16 B 57
47 2016-11-17 B 87
48 2016-11-18 B 99
49 2016-11-19 B 33
50 2016-11-20 B 234
51 2016-11-21 B 456
52 2016-11-22 B 123
53 2016-11-23 B 89
54 2016-11-24 B 333
55 2016-11-25 B 411
56 2016-11-26 B 232
57 2016-11-27 B 455
58 2016-11-28 B 55
59 2016-11-29 B 90
60 2016-11-30 B 21
How do I predict using the prophet package, the y-hat by group rather than in total?

Here is a solution using tidyr::nest to nest the data by group, fit the models in those groups using purrr::map and then retrieving the y-hat as requested.
I took your code, but incorporated it into mutate calls that would compute new colums using purrr::map.
library(prophet)
library(dplyr)
library(purrr)
library(tidyr)
d1 <- df %>%
nest(-group) %>%
mutate(m = map(data, prophet)) %>%
mutate(future = map(m, make_future_dataframe, period = 7)) %>%
mutate(forecast = map2(m, future, predict))
Here is the output at this point:
d1
# A tibble: 2 × 5
group data m future
<fctr> <list> <list> <list>
1 A <tibble [30 × 2]> <S3: list> <data.frame [36 × 1]>
2 B <tibble [30 × 2]> <S3: list> <data.frame [36 × 1]>
# ... with 1 more variables: forecast <list>
Then I use unnest() to retrieve the data from the forecast column and select the y-hat value as requested.
d <- d1 %>%
unnest(forecast) %>%
select(ds, group, yhat)
And here is the output for the newly forecasted values:
d %>% group_by(group) %>%
top_n(7, ds)
Source: local data frame [14 x 3]
Groups: group [2]
ds group yhat
<date> <fctr> <dbl>
1 2016-11-30 A 180.53422
2 2016-12-01 A 349.30277
3 2016-12-02 A 288.68215
4 2016-12-03 A 222.33501
5 2016-12-04 A 342.96654
6 2016-12-05 A 203.64625
7 2016-12-06 A 185.37395
8 2016-11-30 B 131.07827
9 2016-12-01 B 222.83703
10 2016-12-02 B 236.33555
11 2016-12-03 B 145.41001
12 2016-12-04 B 228.59687
13 2016-12-05 B 162.49244
14 2016-12-06 B 68.44477

I was looking for a solution for the same problem. I came up with the following code, which is a bit simpler than the accepted answer.
library(tidyr)
library(dplyr)
library(prophet)
data = df %>%
group_by(group) %>%
do(predict(prophet(.), make_future_dataframe(prophet(.), periods = 7))) %>%
select(ds, group, yhat)
And here are the predicted values
data %>% group_by(group) %>%
top_n(7, ds)
# A tibble: 14 x 3
# Groups: group [2]
ds group yhat
<date> <fctr> <dbl>
1 2016-12-01 A 316.9709
2 2016-12-02 A 258.2153
3 2016-12-03 A 196.6835
4 2016-12-04 A 346.2338
5 2016-12-05 A 208.9083
6 2016-12-06 A 216.5847
7 2016-12-07 A 206.3642
8 2016-12-01 B 230.0424
9 2016-12-02 B 268.5359
10 2016-12-03 B 190.2903
11 2016-12-04 B 312.9019
12 2016-12-05 B 266.5584
13 2016-12-06 B 189.3556
14 2016-12-07 B 168.9791

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to apply rolling t.test in R on a single variable? - r

Related

How best to do this join in R?

How do i create a categorical variable with mutliple values from an existing variable?

keep most recent observations when there are duplicates in R

How to use R find the closest date before specific date of matrix A? And 14 days after the specific date?

Using Prophet Package to Predict By Group in Dataframe in R

Categories

Resources