How to calculate running slope for rlm using runner? - r

I have a data frame "customers" build of customer id, month and total purchases that month.
I'm trying to calculate a running slope for a window of 12 months using robust regression.
I have tried the following:
Coef <- function(x) {return(rlm(cbind(x)~cbind(1:length(x)))$coefficients[2])}
customer_slope = customers %>% mutate(slope = runner(x=total_purchases,k=12,f=Coef))
I get the following error:
x 'x' is singular: singular fits are not implemented in 'rlm'
If I run a single example, the function returns what I've expected:
Coef(c(4,11,7,15,5,14,8,9,14,17,14,13))
cbind(1:length(x))
0.6888112

So I ran into similar problems and finally came to the below solution using slider. This provides a 3 days rolling estimate (of course you can change as you see fit). This doesn't quite get to your answer (which you could probably get with loops), but most of the way there.
library(MASS)
library(dplyr)
library(slider)
dat <- tibble::tibble(customers = c(4,11,7,15,5,14,8,9,14,17,14,13)) %>%
mutate(t = 1:n() %>% as.numeric())
dat %>%
mutate(results = slide_dbl(.x = .,
.f = ~rlm(customers ~ t, k = 12, data = .x)$coefficients[2],
.before = 2,
.complete = T))

It look like that's the way to go, thanks!
It seems like what caused the singularity was that I didn't change the default .complete from F to T.
So, combined with your suggestion, this is how I made it work (took about two hours for 3M rows I did have however more complex group_by involved which is not shown below)
slope_rlm <- function(x) {
x=as.numeric(x)
prep = tibble(data=x)%>%mutate(t=1:n()%>%as.numeric())
return(rlm(data~t,data=prep)$coefficients[2])
}
customers_rlm = customers %>%
mutate(cust_rlm_12=slide_dbl(total_purchases,slope_rlm,.before=11,.complete=T))

Consider data with two customers with data from 1000 days span. total_purchases are cumulated by customer, and each purchase size is ~pois(5).
set.seed(1)
customers <- data.frame(
id = factor(rep(1:2, length.out = 100)),
date = seq(Sys.Date(), Sys.Date() + 1000, length.out = 100)
) %>%
group_by(id) %>%
mutate(
total_purchases = cumsum(rpois(n(), lambda = 5))
)
When using calculating regression in rolling window make sure that you handle errors which comming from insufficient degrees of freedom, singularity etc. - that is why I've put tryCatch around rlm call - if there is any error, function returns NA for failing window.
Data below is grouped by id which means that model is calculated per customer. Yearly rolling regression should converge to the slope = 5 (+/- random error).
customers %>%
group_by(id) %>%
mutate(
slope = runner(
x = .,
f = function(x) {
tryCatch(
rlm(x$total_purchases ~ seq_len(nrow(x)))$coefficients[2],
error = function(e) NA
)
},
idx = "date",
k = "year"
)
)
Plotting slope in time for customers
ggplot(customers, aes(x = date, y = slope, color = id, group = id)) +
geom_line() +
geom_hline(yintercept = 5, color = "red")

Related

Dealing with factor in detection covariates in unmarked package

I am trying to compare (with a statistical test) the probability of detection according to a categorical covariates with 3 levels, let's say A,B and C in a very simple occupancy model. I am interested by the 3 pairwise comparison. Is there any way to get this result from the unmarked package ?
I decide to circumvent this problem by changing the reference level in order to use the test provided by the summary output of the fitted occupancy model. However, even so my covariates are specified as factors, the function unmarkedFrameOccu does not recognize it and I get the following warning Warning message:
obsCovs contains characters. Converting them to factors.
As a consequence, aconversion is done by unmarked I have no way to control the reference level.
Do you have any idea of the possible issues and why unmarkedFrameOccu does not recognize my factor ? I add a reproducible example below. I find a way to deal with my main issue by recoding categorical variable using 1/0 but I am still curious on this warning thing.
Thanks for your help and time and here is the example
Marie
library(tidyverse)
library(unmarked)
n_occas <- 6
n_sites <- 10
p_detect <- c(0.3, 0.8, 0.5)
cov <- matrix(factor( sample(c("A", "B", "C"), size = 60, replace = TRUE)), ncol = n_occas, nrow= n_sites)
sites <- sample(c(0,1), size = n_sites, replace = TRUE)
## generate actual detection according to presence state
y <- cov %>% as_tibble %>%
add_column(truth = sites) %>%
mutate(id = 1 : n()) %>%
pivot_longer(cols = c(-truth, -id), names_to = "occas", values_to = "cov") %>%
mutate(detection = case_when(cov == "A" ~ truth * (runif(n = 1) < p_detect[1]),
cov == "B" ~ truth * (runif(n = 1) < p_detect[2]),
cov == "C" ~ truth * (runif(n = 1) < p_detect[3])
)
) %>%
select(-cov) %>%
pivot_wider(names_from = occas, values_from = detection, values_fill = NA) %>%
select(-truth, -id)
unmarkedFrameOccu(y = y,
obsCovs = list(cov1 = cov))

Change normal regression model to rolling regression

I have a sample data set with number of sales for 1000 different products in 13 countries over 3 years:
13 Countries = (US, China, UK…)
1000 Products = (Prod A, B, C …)
Number of Sales --> my dependent variable.
Number of Clicks and 3 more variables as independent variable.
I've coded a regression model and it works. In the next step I would like to do a rolling regression: How do I have to adapt the code for that?
Thanks for your help and many greetings! :)
# prepare data
nest_dt = raw_data %>%
group_by(product, country) %>%
nest()
# function
lm_function = function(data, formula) {
lm(formula = formula,
data = data)
}
# regression
lm_data = nest_data %>%
mutate(lm = map(
.x = data,
.f = lm_function,
formula = sales ~ clicks + needs + guesses + interests
))
# show solution
solution_data = lm_data %>%
mutate(solution = map(lm, sw_tidy)) %>%
unnest(solution) %>%
select(-data, -lm)
# where to put?!?
rollapply(lm_function, width=10, roll)
You could try the roll_lm function from the roll package. See the description here: Package ‘roll’ on Cran

GAM with mrf smooth - errors (mismatch between nb/polys area names and data area names

I am trying to fit Polish local government election results in 2015 following the superb blog by #GavinSimpson. https://www.fromthebottomoftheheap.net/2017/10/19/first-steps-with-mrf-smooths/ I joined my xls data with the shp data using a 6 digit identifier (there may be leading 0's). I kept it as a text variable. EDIT, I simplified the identifier and am now using a sequence from 1 to nrow to simplify my question.
library(tidyverse)
library(sf)
library(mgcv)
# Read data
# From https://www.gis-support.pl/downloads/gminy.zip shp file
boroughs_shp <- st_read("../../_mapy/gminy.shp",options = "ENCODING=WINDOWS-1250",
stringsAsFactors = FALSE ) %>%
st_transform(crs = 4326)%>%
janitor::clean_names() %>%
# st_simplify(preserveTopology = T, dTolerance = 0.01) %>%
mutate(teryt=str_sub(jpt_kod_je, 1, 6)) %>%
select(teryt, nazwa=jpt_nazwa, geometry)
# From https://parlament2015.pkw.gov.pl/wyniki_zb/2015-gl-lis-gm.zip data file
elections_xls <-
readxl::read_excel("data/2015-gl-lis-gm.xls",
trim_ws = T, col_names = T) %>%
janitor::clean_names() %>%
select(teryt, liczba_wyborcow, glosy_niewazne)
elections <-
boroughs_shp %>% fortify() %>%
left_join(elections_xls, by = "teryt") %>%
arrange(teryt) %>%
mutate(idx = seq.int(nrow(.)) %>% as.factor(),
teryt = as.factor(teryt))
# Neighbors
boroughs_nb <-spdep::poly2nb(elections, snap = 0.01, queen = F, row.names = elections$idx )
names(boroughs_nb) <- attr(boroughs_nb, "region.id")
# Model
ctrl <- gam.control(nthreads = 4)
m1 <- gam(glosy_niewazne ~ s(idx, bs = 'mrf', xt = list(nb = boroughs_nb)),
data = elections,
offset = log(liczba_wyborcow), # number of votes
method = 'REML',
control = ctrl,
family = betar())
Here is the error message:
Error in smooth.construct.mrf.smooth.spec(object, dk$data, dk$knots) :
mismatch between nb/polys supplied area names and data area names
In addition: Warning message:
In if (all.equal(sort(a.name), sort(levels(k))) != TRUE) stop("mismatch between nb/polys supplied area names and data area names") :
the condition has length > 1 and only the first element will be used
elections$idx is a factor. I am using it to give names to boroughs_nb to be absolutely sure I have the same number of levels. What am I doing wrong?
EDIT: The condition mentioned in error message is met:
> all(sort(names(boroughs_nb)) == sort(levels(elections$idx)))
[1] TRUE
It seems that I solved the issue, maybe not quite realizing how it did being stat beginner.
First, not a single NA should be present in modeled data. There was one. After that the mcgv seemed to run, but it took long time (quarter of an hour) and inexplicably for me, only when I limited no of knots to k=50, with poor results (less or more and it did not return any result) and with warning to be cautious about results.
Then I tried to remove offset=log(liczba_wyborcow) ie offset number of voters and made number of void votes per 1000 my predicted variable.
elections <-
boroughs_shp %>%
left_join(elections_xls, by = "teryt") %>% na.omit() %>%
arrange(teryt) %>%
mutate(idx = row_number() %>% as.factor()) %>%
mutate(void_ratio=round(glosy_niewazne/liczba_wyborcow,3)*1000)
Now that it is a count, why not try change family = betar() in gam formula to poisson() - still not a good result, and then to negative binomial family = nb()
Now my formula looks like
m1 <-
gam(
void_ratio ~ s(
idx,
bs = 'mrf',
k =500,
xt = list(nb = boroughs_nb),
fx = TRUE),
data = elections_df,
method = 'REML',
control = gam.control(nthreads = 4),
family = nb()
)
It seems now to be blazingly fast and return valid results with no warnings or errors. On a laptop with 4 cores Intel Core I7 6820HQ # 2.70GHZ 16GB Win10 it takes now 1-2 minutest to build a model.
In brief, what I changed was: remove a single NA, remove offset from formula and use negative binomial distribution.
Here is the result of what I wanted to achieve, from left to right, real rate of void votes, a rate smoothed by a model and residuals indicating discrepancies. The mcgv code let me do that.

Self-made function works in test but not for my actual data set

I am working with functions. I wrote a function for Basal Area
ba <- function(dbh,na.rm) {
stopifnot(is.numeric(dbh))
answer <- dbh^2*(0.005454)
return(answer)
}
The function works with a test vector. Now I am trying to do some summaries of a dataset I have.
(copy and pasted directly from R)
plot.summary <- trees %>% group_by(MU, Plot, Inv) %>% summarize(year = first(Year), arithemtic.mean = my.mean(dbh, na.rm = TRUE), quadratic.mean = my.q.mean(dbh, na.rm = TRUE), var = my.var(dbh, na.rm = TRUE), n.trees = n())
(Modified spacing to read easier)
plot.summary <- trees %>% group_by(MU, Plot, Inv) %>%
summarize(year = first(Year), arithemtic.mean = my.mean(dbh, na.rm = TRUE),
quadratic.mean = my.q.mean(dbh, na.rm = TRUE), var = my.var(dbh, na.rm = TRUE),
n.trees = n())
When I run it is says
Error in summarise_impl(.data, dots) :
Column `basal.area` must be length 1 (a summary value), not 19
I am not sure why. The data set has only 18 columns.
My command works perfectly fine when I do not include the basal area part.
I am not sure what I might be missing
Thank you for any help!
The variables you refer to in the group_by function are not in the dataset trees, so I've taken some liberties to create a reproducible example that hopefully fits your needs.
Assuming you wanted to group by a variable like Height, here is a working example:
plot.summary <- trees %>%
group_by(Height) %>%
summarise(mean.basal.area = mean(ba(Girth)),
n.trees = n())
In the above, your function ba is wrapped in mean. This results in a mean basal area for the set of values of Girth that share the same Height.
Is that the kind of thing you want?

Using ARIMA with exogenous regressors for outlier detection in R

I would like to detect outliers in real-time data that is aggregated per hour. For this example, I've selected the hourly pedestrian data from Melbourne, Australia
(Pedestrian volume (updated monthly), Pedestrian Counting System)
I understand there are a large number of existing detection algorithms, which in time I'll learn and use.
In the short term I'd like to use the simplest approach. One such method is outlined by #Aksakal in the following stackexchange post:
What algorithm should I use to detect anomalies on time-series?
I think the key is "unexpected" qualifier in your graph. In order to
detect the unexpected you need to have an idea of what's expected.
I would start with a simple time series model such as AR(p) or
ARMA(p,q). Fit it to data, add seasonality as appropriate. For
instance, your SAR(1)(24) model could be: $y_{t}=c+\phi
y_{t-1}+\Phi_{24}y_{t-24}+\Phi_{25}y_{t-25}+\varepsilon_t$, where $t$
is time in hours. So, you'd be predicting the graph for the next hour.
Whenever the prediction error $e_t=y_t-\hat y_t$ is "too big" you
throw an alert.
When you estimate the model you'll get the variance
$\sigma_\varepsilon$ of the error $\varepsilon_t$. Depending on your
distributional assumptions, such as normal, you can set the threshold
based on the probability, such as $|e_t|<3\sigma_\varepsilon$ for
99.7% or one-sided $e_t>3\sigma_\varepsilon$.
The number of visitors is probably quite persistent, but super
seasonal. It might work better to try seasonal dummies instead of the
multiplicative seasonality, then you'd try ARMAX where X stands for
exogenous variables, which could be anything like holiday dummy, hour
dummies, weekend dummies etc.
Unfortunately the post does not go into details, hence I have a few questions:
Q.1) How do I calculate/extract the variance $\sigma_\varepsilon$ of the ARIMA error term $\epsilon$ from the fitted model produced by auto.arima(data, xreg = xreg)?
Below is a complete R example that uses multiple seasonality to capture daily, weekly and yearly seasonality. This is not optimised and is only presented as an example implementation to help answer question 2.
I wish to predict the thresholds for a whole year (or at least for a 30 day period), this means that h = 24hrs * 30 = 720.
In essence, I want to forecast, not the mean of the hourly pedestrian counts, but the upper expected number of pedestrians per hour (e.g. 3σ_ε) for h>>1 (eg, h = 720 hours (30 days) or even h = 24*365 = 8760 hours (1 year) ).
Q.2) How can I achieve this using the method above?
Example code to help solve the above questions.
library(rwalkr)
library(forecast)
library(tidyverse)
library(tsibble)
library(xts)
library(dygraphs)
pedestrian <- as_tibble(rwalkr::run_melb( year = c(2015:2018) ))
pedestrian_statelibrary <- pedestrian %>%
filter(Sensor == "State Library") %>%
left_join(tsibble::holiday_aus(2015:2018, state='VIC'), by=c( 'Date' = 'date' )) %>%
mutate(holiday = replace_na(holiday, ''),
Count = ifelse(Count == 0, NA, Count))
# Replace all counts of zero with NA so Box-Cox transform lambda = 0 and constrain output to +ve.
pedestrian_statelibrary_train <- pedestrian_statelibrary %>% filter(Date >= as.Date('2015-05-13'), Date < as.Date('2017-01-01') )
pedestrian_statelibrary_test <- pedestrian_statelibrary %>% filter(Date >= as.Date('2017-01-01') )
# tsbox functions to convert tsibble to tz indirectly. Must be a better way of doing this...
pedestrian_statelibrary_train_zoo <- tsbox::ts_zoo( pedestrian_statelibrary_train %>% select(Date_Time, Count) )
pedestrian_statelibrary_train_ts <- tsbox::ts_ts(pedestrian_statelibrary_train_zoo)
pedestrian_statelibrary_test_zoo <- tsbox::ts_zoo( pedestrian_statelibrary_test %>% select(Date_Time, Count) )
pedestrian_statelibrary_test_ts <- tsbox::ts_ts(pedestrian_statelibrary_test_zoo)
## Create external regressors.
xreg_holidays_train <- model.matrix(~as.factor(pedestrian_statelibrary_train$holiday))
xreg_holidays_train <- xreg_holidays_train[,-1] # remove intercept.
# Remove 1st level from levels()
colnames(xreg_holidays_train) <- levels(as.factor(pedestrian_statelibrary_train$holiday))[-1]
xreg_holidays_test <- model.matrix(~as.factor(pedestrian_statelibrary_test$holiday))
xreg_holidays_test <- xreg_holidays_test[,-1] # remove intercept.
colnames(xreg_holidays_test) <- levels(as.factor(pedestrian_statelibrary_test$holiday))[-1]
# periods (intervals(samples) per period) for hourly data.
period_day <- 24
period_week <- 24*7
period_year <- 24*365.25
seasonal_periods = c(period_day, period_week, period_year)
pedestrian_statelibrary_train_msts <- msts(pedestrian_statelibrary_train_ts,
start = start(pedestrian_statelibrary_train_ts),
seasonal.periods = seasonal_periods)
pedestrian_statelibrary_test_msts <- msts(pedestrian_statelibrary_test_ts,
start = start(pedestrian_statelibrary_test_ts),
seasonal.periods = seasonal_periods)
# set number of Fourier terms per season. Not optimal.
Ks = c(12, 10, 2)
xreg_train <- cbind( seasonality = fourier(pedestrian_statelibrary_train_msts, K = Ks),
holidays = xreg_holidays_train )
######################################
## Fit model of exogenous factors and ARIMA as error
######################################
fit <- pedestrian_statelibrary_train_msts %>%
auto.arima( xreg = xreg_train,
seasonal=FALSE,
stepwise = FALSE,
parallel = TRUE,
num.cores = NULL,
lambda = 0
)
######################################
## Forecast
######################################
fc <- forecast( fit,
xreg=cbind( seasonality = fourier(pedestrian_statelibrary_test_msts, K = Ks),
holidays = xreg_holidays_test)
)
######################################
## Check residuals and accuracy.
######################################
checkresiduals(fit)
checkresiduals(fc)
accuracy(fc, pedestrian_statelibrary_test_msts)
######################################
## Display fitted model and forecast using interactive dygraph.
######################################
# Plotting `forecast` prediction using `dygraphs`
# https://stackoverflow.com/questions/43624634/plotting-forecast-prediction-using-dygraphs#43668603
as.forecast.ts <- function(forecast_obj){
training <- forecast_obj$x
lower <- forecast_obj$lower[,2]
upper <- forecast_obj$upper[,2]
point_forecast <- forecast_obj$mean
cbind(training, lower, upper, point_forecast)
}
fc_ts <- as.forecast.ts(fc)
# Add the time stamps back to ts object.
idx_train <- pedestrian_statelibrary_train %>% ungroup() %>% select(Date_Time) %>% as.data.frame()
idx_test <- pedestrian_statelibrary_test %>% ungroup() %>% select(Date_Time) %>% as.data.frame()
idx_all <- rbind(idx_train, idx_test)
# Append testing values to fc_ts object, by left joining two xts objects.
test_xts <- as.xts(x = pedestrian_statelibrary_test %>%
dplyr::ungroup() %>%
as.data.frame() %>%
dplyr::select( Count ) %>%
dplyr::rename( 'testing' = 'Count'),
pedestrian_statelibrary_test$Date_Time)
fc_xts <- as.xts(x = fc_ts %>%
as.data.frame(),
idx_all$Date_Time )
fc_xts <- fc_xts %>% xts::merge.xts(test_xts, join='left')
dygraph(data = fc_xts, main = "Pedestrian traffic Forecasting for State Library.") %>%
dyRangeSelector %>%
dySeries(name = "training", label = "Train") %>%
dySeries(name = 'testing', label = "Test") %>%
dySeries(name = "point_forecast", label = "Predicted") %>%
dyLegend(show = "always", hideOnMouseOut = FALSE) %>%
dyOptions(axisLineColor = "navy", gridLineColor = "grey")

Resources