glmmTMB with autocorrelation of irregular times - r

I'm putting together a glmmTMB model. I have data collected at a single site over the course of May, every year, for 4 years. Time resolution within year can range from a few minutes (or even same minute) to days apart.
The covariance vignette says that the ar1() structure requires a regular time series, but the ou(times + 0 | group) structure can handle irregular times. That said - it looks like the times argument is a factor - how does that work with irregular time structure??
So, for example, is this a correct use of the ou() structure?
df <- structure(list(DayYear = c(234, 220, 234, 231, 243, 229, 228,
223, 220, 218, 234, 237, 234, 231, 241, 237, 241, 241, 233, 234,
234, 232, 218, 227, 232, 229, 220, 223, 228, 224), DateTime = structure(c(1495477980,
1399590540, 1495479780, 1495225920, 1464631980, 1495052760, 1463324460,
1494525780, 1494256560, 1494088440, 1495471320, 1495730940, 1495476960,
1495225200, 1432919940, 1495725900, 1432924200, 1432918860, 1495384020,
1495479900, 1463848140, 1495298820, 1399420080, 1463253000, 1463692920,
1495037040, 1494275160, 1494510780, 1463348220, 1494597180), class = c("POSIXct",
"POSIXt"), tzone = ""), Year = c(2017, 2014, 2017, 2017, 2016,
2017, 2016, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2015, 2017,
2015, 2015, 2017, 2017, 2016, 2017, 2014, 2016, 2016, 2017, 2017,
2017, 2016, 2017), N = c(2, 2, 7, 2, 6, 4, 1, 4, 1, 3, 1, 6,
2, 2, 2, 2, 5, 5, 3, 5, 3, 2, 4, 1, 6, 2, 2, 3, 5, 2)), row.names = c(NA,
-30L), class = c("tbl_df", "tbl", "data.frame"))
create sampling factor within year
df <- df %>%
arrange(DateTime) %>%
group_by(Year) %>%
mutate(times = 1:n()) %>%
ungroup() %>%
mutate(YearF = as.factor(Year),
times = numFactor(times))
mod1 <- glmmTMB(N ~ DayYear + YearF +
ou(times + 0 | YearF),
family = nbinom2,
data = df)
This particular model doesn't run too well because the toy dataset is so tiny (and probably doesn't show what I need showing) - but is that a correct specification of the autocorrelation structure under an irregular time series?

No, it's not: you have to use decimal times/dates in numFactor. The way you've done it coerces the data set to be equally spaced. Below I use lubridate::decimal.date(DateTime) %% 1 to get the fraction-of-year variable that's used as the time coordinate.
library(dplyr)
library(lubridate)
library(glmmTMB)
df2 <- (df
%>% arrange(DateTime)
%>% group_by(Year)
%>% mutate(times = lubridate::decimal_date(DateTime) %% 1)
%>% ungroup()
)
df3 <- (df2
%>% mutate(YearF = as.factor(Year),
times = glmmTMB::numFactor(times))
%>% select(N, DayYear, YearF, times)
)
mod1 <- glmmTMB(N ~ DayYear + YearF +
ou(times + 0 | YearF),
family = nbinom2,
data = df3)

Related

Easy method for checking for duplicates and errata in panel dataset

Imagine a dataframe:
df1 <- tibble::tribble( ~City, ~Year, ~Burger_cost, ~Cola_cost, ~Resident_AVGGrowth_cm,
"Abu Dhabi", 2020, 2, 3, 175,
"Abu Dhabi", 2019, 1, 3, 174,
"Abu Dhabi", 2018, 1, 2, 173,
"New York", 2020, 4, 5, 500,
"New York", 2019, 3, 5, 184,
"New York", 2018, 2, 3, 183,
"Abu Dhabi", 2020, 2, 3, 175,
"Abu Dhabi", 2019, 1, 3, 174,
"Abu Dhabi", 2018, 1, 2, 173,
"Abu Dhabi", 2017, 1, NA, 100,
"London", 2020, 5, 6, 186,
"London", 2019, 4, 6, 188,
"London", 2018, 3, 5, 185,
"New York", 2020, 4, 5, 185,
"New York", 2019, 3, 5, 184,
"New York", 2018, 3, 3, 183,
"London", 2020, 5, 6, 186,
"London", 2019, 4, 6, 188,
"London", 2018, 3, 5, 185)
The same dataset for visual representation:
There can be many inputs. For example, data for London is totally similar for all years, so we can delete it. The data for Abu Dhabi is similar for years 2018-2020 and has difference for 2017.
The data for New York contains discrepancy in year 2018 for the Burger cost (and growth of a city resident).
The data for the growth of a city resident is surely erroneus in row 4 for NY (too huge descrepancy), but it has a duplicate value in the row 16 (in this case we would prefer delete row 4 and keep row 16 based on criteria that no person can be shorter than 50 cm and longer than 4 meters [400 cm] [especially in the mean values :)]).
Rows 6 and 16 (for NY, 2018) contain different data for the burger cost which can only be resolved by human (say, variant with 3 USD is correct but R needs to show the fact of discrepancy for the end user).
The question: can you suppose nice and neat methods for performing these operations? What do you usually use to solve such issues.
I just started to think on this issue.
It is an easy task in C#. I am keeping in mind several strategies, but I would be extremely interested to know what ways do you use for solving such issues :) Any ideas are much appreciated :)
Perhaps this helps - Grouped by 'City', create a flag for the outliers with boxplot on the 'Resident_AVGGrowth_cm', then add 'Year' into the grouping, create another flag for unique values based on the columns Burger_cost to Resident_AVGGrowth_cm using n_distinct and looping over if_all (returns TRUE only if all the columns looped for the row are returning TRUE), then grouped by City, filter out those City having all duplicates e.g. London, as well as remove the rows with outlier_flag
library(dplyr)
df1 %>%
group_by(City) %>%
mutate(outlier_flag = Resident_AVGGrowth_cm %in%
boxplot(Resident_AVGGrowth_cm, plot = FALSE)$out) %>%
group_by(Year, .add = TRUE) %>%
mutate(flag_all_unq = if_all(Burger_cost:Resident_AVGGrowth_cm,
~ n_distinct(.x, na.rm = TRUE) == 1)) %>%
group_by(City) %>%
filter(!all(flag_all_unq)) %>%
ungroup %>%
filter((!outlier_flag)|flag_all_unq)

Negative binomial regression of month trend

I read a paper about negative binomial regression:"We modelled the number of Ecoli bloodstream infections and E coli UTIs per month using negative-binomial regression (incorporating overdispersion), assuming the same underlying population(no offset)." The figure as the followings
I also have a set of data, want to figure the infection like the figure with month/year, how can I do that? thank you very much
df <- structure(list(Year = c(2013, 2013, 2013, 2013, 2013, 2013, 2013,
2013, 2013, 2013, 2013, 2013, 2014, 2014, 2014, 2014, 2014, 2014,
2014, 2014, 2014, 2014, 2014, 2014, 2015, 2015, 2015, 2015, 2015,
2015, 2015, 2015, 2015, 2015, 2015, 2015), Month = c(1, 2, 3,
4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12), Incidence = c(2.25538216197745,
3.49502862307924, 2.76311704439615, 2.9836483329794, 3.09375,
3.0368028900429, 3.82920688208141, 3.9154960734432, 3.33517393705135,
3.54593329432417, 3.27586206896552, 3.25655281969817, 3.35912052117264,
3.21672101986362, 2.78237182605312, 2.58435732397113, 2.72516428295323,
3.1227603153476, 2.6300688599847, 2.66324718879463, 2.62653374233129,
2.45256358498183, 2.39520958083832, 3.58683926645092, 3.41995942421022,
3.61001317523057, 2.62718158187895, 2.86944045911047, 2.77978993118435,
2.89282762420792, 2.69410829432029, 3.22232223222322, 3.39818882811799,
3.36725958337297, 2.90030211480363, 3.20789124668435), Inpatient = c(8779,
6638, 9663, 9418, 9600, 8858, 9532, 9041, 9055, 8545, 9280, 10072,
9824, 6746, 10279, 10254, 10348, 9767, 10456, 10138, 10432, 9908,
9853, 11124, 10351, 7590, 10772, 11152, 11044, 10889, 11321,
11110, 11153, 10513, 11585, 12064), infection = c(198, 232, 267,
281, 297, 269, 365, 354, 302, 303, 304, 328, 330, 217, 286, 265,
282, 305, 275, 270, 274, 243, 236, 399, 354, 274, 283, 320, 307,
315, 305, 358, 379, 354, 336, 387)), row.names = c(NA, -36L), class = c("tbl_df",
"tbl", "data.frame"))
reference:
Vihta K D, Stoesser N, Llewelyn M J, et al. Trends over time in Escherichia coli bloodstream infections, urinary tract infections, and antibiotic susceptibilities in Oxfordshire, UK, 1998–2016: a study of electronic health records[J]. The Lancet Infectious Diseases, 2018, 18(10): 1138-1149.
Using the data above, one can do the following:
library(MASS) # for function glm.nb
library(ggplot2)
library(broom) # for tidy model outputs
Create a date, to make plotting easy
df$t <- as.Date(paste("01", df$Month, df$Year, sep = "-"), format = "%d-%m-%Y")
Plot the data. geom_smooth adds the trend line and confidence intervals, using the date as the predictor.
p <- ggplot(data = df, aes(x = t, y = infection)) +
geom_point() +
geom_smooth(method = "glm.nb")
p
To perform the regression, set the count of infections as the dependent variable and the nth month as the independent variable, below month_as_integer.
df$month_as_integer <- seq_along(df$Month)
m1 <- glm.nb(infection ~ month_as_integer, data = df)
using tidy from the broom package, one can get the estimate and confidence intervals as a data frame.
out1 <- as.data.frame(tidy(m1, exponentiate = TRUE, conf.int = TRUE) )
out1
term estimate std.error statistic p.value conf.low conf.high
1 (Intercept) 264.44399 0.048006493 116.184897 0.000000000 240.943378 290.556355
2 month_as_integer 1.00697 0.002250993 3.085763 0.002030303 1.002569 1.011394

forecast model giving odd MAPE values, can some one please tell me if this is correct?

I ran this script as part of a forecasting project for school, but I got some odd results especially with the MAPE values. What it's supposed to do is predict international terrorism incident for the next 12 months. Can anyone tell me if this report is accurate or if I missed something? I tried to include the graphs, but I don't think they can be posted here.
Thanks
library(ggplot2)
library(forecast)
library(tseries)
library(reprex)
terror <- tibble::tribble(
~imonth, ~iyear, ~monthly,
1, 2015, 1534,
2, 2015, 1295,
3, 2015, 1183,
4, 2015, 1277,
5, 2015, 1316,
6, 2015, 1168,
7, 2015, 1263,
8, 2015, 1290,
9, 2015, 1107,
10, 2015, 1269,
11, 2015, 1172,
12, 2015, 1091,
1, 2016, 1162,
2, 2016, 1153,
3, 2016, 1145,
4, 2016, 1120,
5, 2016, 1353,
6, 2016, 1156,
7, 2016, 1114,
8, 2016, 1162,
9, 2016, 1045,
10, 2016, 1140,
11, 2016, 1114,
12, 2016, 923,
1, 2017, 879,
2, 2017, 879,
3, 2017, 961,
4, 2017, 856,
5, 2017, 1081,
6, 2017, 1077,
7, 2017, 994,
8, 2017, 968,
9, 2017, 838,
10, 2017, 805,
11, 2017, 804,
12, 2017, 749
)
# aggregated data
terror_byMonth_Train = ts(data = terror$monthly,
start = c(2015,1),
end = c(2016,12),
frequency=12)
terror_byMonth_Test = ts(data = terror$monthly,
start = c(2017,1),
end = c(2017,12),
frequency=12)
# arima instead of exp smooth
m_arima <- auto.arima(terror_byMonth_Train)
#> Warning in value[[3L]](cond): The chosen test encountered an error, so no
#> seasonal differencing is selected. Check the time series data.
# fit exp smooth model
m_ets = ets(terror_byMonth_Train)
# Get length of terror_byMonth_Test set
size <- length(terror_byMonth_Test)
# forecast for 2017 using multiple forecast (Davis Style)
f_arima_multi <- m_arima %>%
forecast(h = size)
f_arima_multi %>%
autoplot()
# forecast ARIMA 2017 (Orininal Style)
f_arima<-forecast(m_arima,h=12)
f_arima %>%
autoplot()
# forecast ETS 2017
f_ets = forecast(m_ets, h=12)
f_ets %>%
autoplot()
# check accuracy ETS
acc_ets <- accuracy(m_ets)
#check accuracy ARIMA, between train and test sets
acc_arima_TrainVSTest <- accuracy(f_arima_multi, x = terror_byMonth_Test)
# check accuarcy ARIMA
acc_arima <- accuracy(f_arima)
# MAPE(ETS)= 20.03 < MAPE(ARIMA) = 22.05
# ETS model chosen
# Compare to 2017 data
accuracy(f_ets, terror_byMonth_Test)
#> ME RMSE MAE MPE MAPE MASE
#> Training set -14.30982 90.08823 70.06438 -1.606862 5.900178 0.5790445
#> Test set 303.53575 316.03133 303.53575 23.986363 23.986363 2.5085599
#> ACF1 Theil's U
#> Training set 0.0008690031 NA
#> Test set -0.2148651254 2.356116
Created on 2019-02-13 by the reprex package (v0.2.1)
The issue is in how you defined terror_byMonth_Test. It should be, e.g.,
terror_byMonth_Test <- ts(data = tail(terror$monthly, 12),
start = c(2017, 1),
end = c(2017, 12),
frequency = 12)
That is, simply providing start and end dates isn't enough for ts to know which 12 observations out of 24 in terror$monthly to take. This reduces MAPE to 10.4%.

Subset by increasing count over time in R

I have a data frame that contains count of sales by seller, year and month, called sales_by_month:
library(tidyverse)
sales_by_month <- tribble(
~Seller, ~Year, ~Month, ~Sales,
"John Doe", 2018, 1, 82,
"John Doe", 2018, 2, 72,
"John Doe", 2018, 3, 42,
"Sally Jane", 2018, 1, 25,
"Sally Jane", 2018, 2, 77)
I need to subset this dataset by only those sellers where their sales are increasing over time, and I cannot figure out how to do it.
The resulting subset dataset should contain;
Seller Year Month Sales
Sally Jane 2018 1 25
Sally Jane 2018 2 77
Because Sally's sales are increasing, while John's sales are decreasing.
Any help would be very much appreciated!!
First, make a variable representing the difference in Sales(I name it dif). If dif < 0, it means decreasing values exist in somebody's sales.
library(dplyr)
df %>% arrange(Seller, Year, Month) %>%
group_by(Seller) %>%
mutate(dif = c(0, diff(Sales))) %>%
filter(all(dif >= 0)) %>%
select(-dif) # drop dif
# Seller Year Month Sales
# <fct> <int> <int> <int>
# 1 SallyJane 2018 1 25
# 2 SallyJane 2018 2 77
More concise:
df %>% group_by(Seller) %>%
arrange(Seller, Year, Month) %>%
filter(all(c(0, diff(Sales)) >= 0))
How to do this really depends on how you want to define increasing over time. One way to define increasing over time is if there is a month over month increase. My solution just looks at if the last month had an increase but this could be changed to look at it different ways:
We calculate the change every month. We filter for just the last month and if it was a positive change. Then we pull out unique Seller names.
We filter for the seller names we obtained in part 1.
The code below does this as well as a dataframe we can load straight into R
library(tidyverse)
sales_by_month <- tribble(
~Seller, ~Year, ~Month, ~Sales,
"John Doe", 2018, 1, 82,
"John Doe", 2018, 2, 72,
"John Doe", 2018, 3, 42,
"Sally Jane", 2018, 1, 25,
"Sally Jane", 2018, 2, 77)
increased_from_last_month <- sales_by_month %>%
group_by(Seller) %>%
arrange(Seller, Year, Month) %>%
mutate(change = Sales - lag(Sales, default = 0)) %>%
summarise_all(last) %>%
filter(change > 0) %>%
pull(Seller) %>%
unique()
sales_by_month %>%
filter(Seller %in% increased_from_last_month)

Error in svd(ctr) : infinite or missing values in 'x'

I'm trying to run svd on some data and I'm getting an error. I saw another post suggesting that this might happen when one or more of the columns are all 0, but this is not the case here. Can someone please explain what is going on and how to fix this? Note, that this is a subset of a much larger data-set. Thank you.
year <- c(2015, 2015, 2015, 2015, 2015, 2015)
week <- c(1, 1, 1, 1, 1, 1)
flight_type_name <- c("Commercial", "Filler", "Label", "Commercial", "Filler", "Filler")
userdata_country <- c("NO", "SG", "NI", "None", "CA", "GT")
platform <- c("iphone", "linux", "iphone", "linux", "web", "ipad")
num_users <- c("26726, 2, 161, 1, 4316, 577")
impressions <- c(135019, 0, 312, 0, 37014, 11492)
clicks <- c(407, 2, 2, 2, 59, 25)
ctr <- data.frame(year, week, flight_type_name, userdata_country, platform, num_users, impressions, clicks)
svd(ctr)

Resources