Using ARIMA with exogenous regressors for outlier detection in R - r

I would like to detect outliers in real-time data that is aggregated per hour. For this example, I've selected the hourly pedestrian data from Melbourne, Australia
(Pedestrian volume (updated monthly), Pedestrian Counting System)
I understand there are a large number of existing detection algorithms, which in time I'll learn and use.
In the short term I'd like to use the simplest approach. One such method is outlined by #Aksakal in the following stackexchange post:
What algorithm should I use to detect anomalies on time-series?
I think the key is "unexpected" qualifier in your graph. In order to
detect the unexpected you need to have an idea of what's expected.
I would start with a simple time series model such as AR(p) or
ARMA(p,q). Fit it to data, add seasonality as appropriate. For
instance, your SAR(1)(24) model could be: $y_{t}=c+\phi
y_{t-1}+\Phi_{24}y_{t-24}+\Phi_{25}y_{t-25}+\varepsilon_t$, where $t$
is time in hours. So, you'd be predicting the graph for the next hour.
Whenever the prediction error $e_t=y_t-\hat y_t$ is "too big" you
throw an alert.
When you estimate the model you'll get the variance
$\sigma_\varepsilon$ of the error $\varepsilon_t$. Depending on your
distributional assumptions, such as normal, you can set the threshold
based on the probability, such as $|e_t|<3\sigma_\varepsilon$ for
99.7% or one-sided $e_t>3\sigma_\varepsilon$.
The number of visitors is probably quite persistent, but super
seasonal. It might work better to try seasonal dummies instead of the
multiplicative seasonality, then you'd try ARMAX where X stands for
exogenous variables, which could be anything like holiday dummy, hour
dummies, weekend dummies etc.
Unfortunately the post does not go into details, hence I have a few questions:
Q.1) How do I calculate/extract the variance $\sigma_\varepsilon$ of the ARIMA error term $\epsilon$ from the fitted model produced by auto.arima(data, xreg = xreg)?
Below is a complete R example that uses multiple seasonality to capture daily, weekly and yearly seasonality. This is not optimised and is only presented as an example implementation to help answer question 2.
I wish to predict the thresholds for a whole year (or at least for a 30 day period), this means that h = 24hrs * 30 = 720.
In essence, I want to forecast, not the mean of the hourly pedestrian counts, but the upper expected number of pedestrians per hour (e.g. 3σ_ε) for h>>1 (eg, h = 720 hours (30 days) or even h = 24*365 = 8760 hours (1 year) ).
Q.2) How can I achieve this using the method above?
Example code to help solve the above questions.
library(rwalkr)
library(forecast)
library(tidyverse)
library(tsibble)
library(xts)
library(dygraphs)
pedestrian <- as_tibble(rwalkr::run_melb( year = c(2015:2018) ))
pedestrian_statelibrary <- pedestrian %>%
filter(Sensor == "State Library") %>%
left_join(tsibble::holiday_aus(2015:2018, state='VIC'), by=c( 'Date' = 'date' )) %>%
mutate(holiday = replace_na(holiday, ''),
Count = ifelse(Count == 0, NA, Count))
# Replace all counts of zero with NA so Box-Cox transform lambda = 0 and constrain output to +ve.
pedestrian_statelibrary_train <- pedestrian_statelibrary %>% filter(Date >= as.Date('2015-05-13'), Date < as.Date('2017-01-01') )
pedestrian_statelibrary_test <- pedestrian_statelibrary %>% filter(Date >= as.Date('2017-01-01') )
# tsbox functions to convert tsibble to tz indirectly. Must be a better way of doing this...
pedestrian_statelibrary_train_zoo <- tsbox::ts_zoo( pedestrian_statelibrary_train %>% select(Date_Time, Count) )
pedestrian_statelibrary_train_ts <- tsbox::ts_ts(pedestrian_statelibrary_train_zoo)
pedestrian_statelibrary_test_zoo <- tsbox::ts_zoo( pedestrian_statelibrary_test %>% select(Date_Time, Count) )
pedestrian_statelibrary_test_ts <- tsbox::ts_ts(pedestrian_statelibrary_test_zoo)
## Create external regressors.
xreg_holidays_train <- model.matrix(~as.factor(pedestrian_statelibrary_train$holiday))
xreg_holidays_train <- xreg_holidays_train[,-1] # remove intercept.
# Remove 1st level from levels()
colnames(xreg_holidays_train) <- levels(as.factor(pedestrian_statelibrary_train$holiday))[-1]
xreg_holidays_test <- model.matrix(~as.factor(pedestrian_statelibrary_test$holiday))
xreg_holidays_test <- xreg_holidays_test[,-1] # remove intercept.
colnames(xreg_holidays_test) <- levels(as.factor(pedestrian_statelibrary_test$holiday))[-1]
# periods (intervals(samples) per period) for hourly data.
period_day <- 24
period_week <- 24*7
period_year <- 24*365.25
seasonal_periods = c(period_day, period_week, period_year)
pedestrian_statelibrary_train_msts <- msts(pedestrian_statelibrary_train_ts,
start = start(pedestrian_statelibrary_train_ts),
seasonal.periods = seasonal_periods)
pedestrian_statelibrary_test_msts <- msts(pedestrian_statelibrary_test_ts,
start = start(pedestrian_statelibrary_test_ts),
seasonal.periods = seasonal_periods)
# set number of Fourier terms per season. Not optimal.
Ks = c(12, 10, 2)
xreg_train <- cbind( seasonality = fourier(pedestrian_statelibrary_train_msts, K = Ks),
holidays = xreg_holidays_train )
######################################
## Fit model of exogenous factors and ARIMA as error
######################################
fit <- pedestrian_statelibrary_train_msts %>%
auto.arima( xreg = xreg_train,
seasonal=FALSE,
stepwise = FALSE,
parallel = TRUE,
num.cores = NULL,
lambda = 0
)
######################################
## Forecast
######################################
fc <- forecast( fit,
xreg=cbind( seasonality = fourier(pedestrian_statelibrary_test_msts, K = Ks),
holidays = xreg_holidays_test)
)
######################################
## Check residuals and accuracy.
######################################
checkresiduals(fit)
checkresiduals(fc)
accuracy(fc, pedestrian_statelibrary_test_msts)
######################################
## Display fitted model and forecast using interactive dygraph.
######################################
# Plotting `forecast` prediction using `dygraphs`
# https://stackoverflow.com/questions/43624634/plotting-forecast-prediction-using-dygraphs#43668603
as.forecast.ts <- function(forecast_obj){
training <- forecast_obj$x
lower <- forecast_obj$lower[,2]
upper <- forecast_obj$upper[,2]
point_forecast <- forecast_obj$mean
cbind(training, lower, upper, point_forecast)
}
fc_ts <- as.forecast.ts(fc)
# Add the time stamps back to ts object.
idx_train <- pedestrian_statelibrary_train %>% ungroup() %>% select(Date_Time) %>% as.data.frame()
idx_test <- pedestrian_statelibrary_test %>% ungroup() %>% select(Date_Time) %>% as.data.frame()
idx_all <- rbind(idx_train, idx_test)
# Append testing values to fc_ts object, by left joining two xts objects.
test_xts <- as.xts(x = pedestrian_statelibrary_test %>%
dplyr::ungroup() %>%
as.data.frame() %>%
dplyr::select( Count ) %>%
dplyr::rename( 'testing' = 'Count'),
pedestrian_statelibrary_test$Date_Time)
fc_xts <- as.xts(x = fc_ts %>%
as.data.frame(),
idx_all$Date_Time )
fc_xts <- fc_xts %>% xts::merge.xts(test_xts, join='left')
dygraph(data = fc_xts, main = "Pedestrian traffic Forecasting for State Library.") %>%
dyRangeSelector %>%
dySeries(name = "training", label = "Train") %>%
dySeries(name = 'testing', label = "Test") %>%
dySeries(name = "point_forecast", label = "Predicted") %>%
dyLegend(show = "always", hideOnMouseOut = FALSE) %>%
dyOptions(axisLineColor = "navy", gridLineColor = "grey")

Related

How to calculate running slope for rlm using runner?

I have a data frame "customers" build of customer id, month and total purchases that month.
I'm trying to calculate a running slope for a window of 12 months using robust regression.
I have tried the following:
Coef <- function(x) {return(rlm(cbind(x)~cbind(1:length(x)))$coefficients[2])}
customer_slope = customers %>% mutate(slope = runner(x=total_purchases,k=12,f=Coef))
I get the following error:
x 'x' is singular: singular fits are not implemented in 'rlm'
If I run a single example, the function returns what I've expected:
Coef(c(4,11,7,15,5,14,8,9,14,17,14,13))
cbind(1:length(x))
0.6888112
So I ran into similar problems and finally came to the below solution using slider. This provides a 3 days rolling estimate (of course you can change as you see fit). This doesn't quite get to your answer (which you could probably get with loops), but most of the way there.
library(MASS)
library(dplyr)
library(slider)
dat <- tibble::tibble(customers = c(4,11,7,15,5,14,8,9,14,17,14,13)) %>%
mutate(t = 1:n() %>% as.numeric())
dat %>%
mutate(results = slide_dbl(.x = .,
.f = ~rlm(customers ~ t, k = 12, data = .x)$coefficients[2],
.before = 2,
.complete = T))
It look like that's the way to go, thanks!
It seems like what caused the singularity was that I didn't change the default .complete from F to T.
So, combined with your suggestion, this is how I made it work (took about two hours for 3M rows I did have however more complex group_by involved which is not shown below)
slope_rlm <- function(x) {
x=as.numeric(x)
prep = tibble(data=x)%>%mutate(t=1:n()%>%as.numeric())
return(rlm(data~t,data=prep)$coefficients[2])
}
customers_rlm = customers %>%
mutate(cust_rlm_12=slide_dbl(total_purchases,slope_rlm,.before=11,.complete=T))
Consider data with two customers with data from 1000 days span. total_purchases are cumulated by customer, and each purchase size is ~pois(5).
set.seed(1)
customers <- data.frame(
id = factor(rep(1:2, length.out = 100)),
date = seq(Sys.Date(), Sys.Date() + 1000, length.out = 100)
) %>%
group_by(id) %>%
mutate(
total_purchases = cumsum(rpois(n(), lambda = 5))
)
When using calculating regression in rolling window make sure that you handle errors which comming from insufficient degrees of freedom, singularity etc. - that is why I've put tryCatch around rlm call - if there is any error, function returns NA for failing window.
Data below is grouped by id which means that model is calculated per customer. Yearly rolling regression should converge to the slope = 5 (+/- random error).
customers %>%
group_by(id) %>%
mutate(
slope = runner(
x = .,
f = function(x) {
tryCatch(
rlm(x$total_purchases ~ seq_len(nrow(x)))$coefficients[2],
error = function(e) NA
)
},
idx = "date",
k = "year"
)
)
Plotting slope in time for customers
ggplot(customers, aes(x = date, y = slope, color = id, group = id)) +
geom_line() +
geom_hline(yintercept = 5, color = "red")

Change normal regression model to rolling regression

I have a sample data set with number of sales for 1000 different products in 13 countries over 3 years:
13 Countries = (US, China, UK…)
1000 Products = (Prod A, B, C …)
Number of Sales --> my dependent variable.
Number of Clicks and 3 more variables as independent variable.
I've coded a regression model and it works. In the next step I would like to do a rolling regression: How do I have to adapt the code for that?
Thanks for your help and many greetings! :)
# prepare data
nest_dt = raw_data %>%
group_by(product, country) %>%
nest()
# function
lm_function = function(data, formula) {
lm(formula = formula,
data = data)
}
# regression
lm_data = nest_data %>%
mutate(lm = map(
.x = data,
.f = lm_function,
formula = sales ~ clicks + needs + guesses + interests
))
# show solution
solution_data = lm_data %>%
mutate(solution = map(lm, sw_tidy)) %>%
unnest(solution) %>%
select(-data, -lm)
# where to put?!?
rollapply(lm_function, width=10, roll)
You could try the roll_lm function from the roll package. See the description here: Package ‘roll’ on Cran

GAM with mrf smooth - errors (mismatch between nb/polys area names and data area names

I am trying to fit Polish local government election results in 2015 following the superb blog by #GavinSimpson. https://www.fromthebottomoftheheap.net/2017/10/19/first-steps-with-mrf-smooths/ I joined my xls data with the shp data using a 6 digit identifier (there may be leading 0's). I kept it as a text variable. EDIT, I simplified the identifier and am now using a sequence from 1 to nrow to simplify my question.
library(tidyverse)
library(sf)
library(mgcv)
# Read data
# From https://www.gis-support.pl/downloads/gminy.zip shp file
boroughs_shp <- st_read("../../_mapy/gminy.shp",options = "ENCODING=WINDOWS-1250",
stringsAsFactors = FALSE ) %>%
st_transform(crs = 4326)%>%
janitor::clean_names() %>%
# st_simplify(preserveTopology = T, dTolerance = 0.01) %>%
mutate(teryt=str_sub(jpt_kod_je, 1, 6)) %>%
select(teryt, nazwa=jpt_nazwa, geometry)
# From https://parlament2015.pkw.gov.pl/wyniki_zb/2015-gl-lis-gm.zip data file
elections_xls <-
readxl::read_excel("data/2015-gl-lis-gm.xls",
trim_ws = T, col_names = T) %>%
janitor::clean_names() %>%
select(teryt, liczba_wyborcow, glosy_niewazne)
elections <-
boroughs_shp %>% fortify() %>%
left_join(elections_xls, by = "teryt") %>%
arrange(teryt) %>%
mutate(idx = seq.int(nrow(.)) %>% as.factor(),
teryt = as.factor(teryt))
# Neighbors
boroughs_nb <-spdep::poly2nb(elections, snap = 0.01, queen = F, row.names = elections$idx )
names(boroughs_nb) <- attr(boroughs_nb, "region.id")
# Model
ctrl <- gam.control(nthreads = 4)
m1 <- gam(glosy_niewazne ~ s(idx, bs = 'mrf', xt = list(nb = boroughs_nb)),
data = elections,
offset = log(liczba_wyborcow), # number of votes
method = 'REML',
control = ctrl,
family = betar())
Here is the error message:
Error in smooth.construct.mrf.smooth.spec(object, dk$data, dk$knots) :
mismatch between nb/polys supplied area names and data area names
In addition: Warning message:
In if (all.equal(sort(a.name), sort(levels(k))) != TRUE) stop("mismatch between nb/polys supplied area names and data area names") :
the condition has length > 1 and only the first element will be used
elections$idx is a factor. I am using it to give names to boroughs_nb to be absolutely sure I have the same number of levels. What am I doing wrong?
EDIT: The condition mentioned in error message is met:
> all(sort(names(boroughs_nb)) == sort(levels(elections$idx)))
[1] TRUE
It seems that I solved the issue, maybe not quite realizing how it did being stat beginner.
First, not a single NA should be present in modeled data. There was one. After that the mcgv seemed to run, but it took long time (quarter of an hour) and inexplicably for me, only when I limited no of knots to k=50, with poor results (less or more and it did not return any result) and with warning to be cautious about results.
Then I tried to remove offset=log(liczba_wyborcow) ie offset number of voters and made number of void votes per 1000 my predicted variable.
elections <-
boroughs_shp %>%
left_join(elections_xls, by = "teryt") %>% na.omit() %>%
arrange(teryt) %>%
mutate(idx = row_number() %>% as.factor()) %>%
mutate(void_ratio=round(glosy_niewazne/liczba_wyborcow,3)*1000)
Now that it is a count, why not try change family = betar() in gam formula to poisson() - still not a good result, and then to negative binomial family = nb()
Now my formula looks like
m1 <-
gam(
void_ratio ~ s(
idx,
bs = 'mrf',
k =500,
xt = list(nb = boroughs_nb),
fx = TRUE),
data = elections_df,
method = 'REML',
control = gam.control(nthreads = 4),
family = nb()
)
It seems now to be blazingly fast and return valid results with no warnings or errors. On a laptop with 4 cores Intel Core I7 6820HQ # 2.70GHZ 16GB Win10 it takes now 1-2 minutest to build a model.
In brief, what I changed was: remove a single NA, remove offset from formula and use negative binomial distribution.
Here is the result of what I wanted to achieve, from left to right, real rate of void votes, a rate smoothed by a model and residuals indicating discrepancies. The mcgv code let me do that.

How to recommend items for all users and test accuracy? user-item

I'm currently working on a user-item collaborative filtering model.
I have a set of users and places they have shopped at, and have attempted to build a recommender model using R.
There are two aims of this project:
a) Recommend new shops to ALL customers
b) Give a stat to show how accurate the model is.
I have 2 years worth of data.
To answer b), I have subset my data to customers that have purchased in both the first 1.5 years AND in the following 6 months.
I have created a model on the data of transactions in the first 1.5 years, then have compared to model predictions to the ACTUAL 6 months of data.
By performing the above, I determined that I was to use UBCF and nn=500, and I was able to achieve accuracy of approx 80%.
However, I am now unsure of how to predict for the ENTIRE user base.
I was thinking of applying the ENTIRE dataset to the model I have just created, but there is bias/will not be accurate, as not all shops are represented in this small model I have created.
I have read articles and tutorials where people have done different things.
I have seen one where they input the entire dataset, and apply the [which] subsetting, so that it creates the model in 80% and tests using the remaining 20%.
My question is, if I was to use this process, how would I then get recommendations for ALL users, when the model only gives predictions for 20% of the users?
Is it best to create the model on the entire dataset?
SUBSET THE DATA
Create period flags
#If in 1.5 years, then 1. If in following 6 months, then 0.
FV$Flag1<-ifelse(FV$Date<="2018-10-01",1,0)
FV$Flag2<-ifelse(FV$Date>"2018-10-01",1,0)
IDENTIFICATION OF CUSTOMERS TO USE IN TRAINING MODEL
#Create SCV
#FV
FV_SCV<-select(FV, Customer, Flag1, Flag2) %>%
group_by(Customer) %>%
summarise_all(funs(sum)) #Sum all variables.
#Determine which customers to use based on if they have purchased both in the first and second years
FV_SCV$Use<-ifelse(FV_SCV$Flag1>0 & FV_SCV$Flag2>0, 1,0)
EXTRACT CUSTOMER LIST FOR TRAINING MODEL
#Training. Where customers have purchased both in the first & second year, but we only run the model on the first.
FV_Train<-FV_SCV %>%
filter(Use==1 )
SUBSET TO CUSTOMERS THAT HAVE PURCHASED IN 1 YEAR AND OF THE CUSTOMERS THAT HAVE PURCHASED IN BOTH YEARS, TO ONLY THOSE THAT HAVE SHOPPED IN THE FIRST YEAR
#FV_SCV$flag_sum<- FV_SCV$Flag1+FV_SCV$Flag2
SCV FOR CUSTOMERS USED IN THE TRAINING MODEL
#Join on the USE flag
FV_Train_Transactions<- FV %>% #Join on the page info
left_join(select(FV_Train, Customer, Use), by=c("Customer"="Customer"))
#Replace NA with 0
FV_Train_Transactions[is.na(FV_Train_Transactions)] <- 0
##Subset to only the users' transactions to be used in training
FV_Train_Transactions<-FV_Train_Transactions %>%
filter(Use==1)
##Create date flag for train and test to use to create the model on the train and comparing the results with the output of the test df
FV_Train_Transactions_Compare<-FV_Train_Transactions %>%
filter(Flag2>0)
##Create SCV for TRAIN
FV_TRAIN_SCV<-FV_Train_Transactions %>%
filter(Flag1>0) %>%
group_by(Customer, Brand)%>%
select(Customer, Brand)
FV_TRAIN_SCV$Flag<-1
#Not sure why this hasn't selected unique rows, so remove duplicates.
FV_TRAIN_SCV<-distinct(FV_TRAIN_SCV)
##Create scv for TEST
FV_TEST_SCV<-FV_Train_Transactions_Compare %>%
filter(Flag2>0) %>%
select(Customer, Brand) %>%
group_by(Customer, Brand)
FV_TEST_SCV$Flag<-1
#Not sure why this hasn't selected unique rows, so remove duplicates.
FV_TEST_SCV<-distinct(FV_TEST_SCV)
Transpose to columns
install.packages("reshape")
install.packages("reshape2")
install.packages("tidytext")
library(reshape)
library(reshape2)
library(tidytext)
#Melt data for transposition
#Train
fv_train_md<-melt(FV_TRAIN_SCV, id=(c("Customer", "Brand")))
FV_TRAIN_SCV2<-dcast(fv_train_md, Customer~Brand, value="Flag", fun.aggregate = mean)
#Test
fv_test_md<-melt(FV_TEST_SCV, id=(c("Customer" , "Brand")))
#Do the same for the overall transactions table
#Make FV_SCV a binary rating matrix
fv_overall<- FV[,c(1,3)] #The table name is case sensitive. Select only the customer and brand columns
fv_overall<- distinct(fv_overall) #Remove dups
fv_overall$Flag<-1
fv_overall_md<-melt(fv_overall, id=(c("Customer", "Brand")))
fv_overall_2<- dcast(fv_overall_md, Customer~Brand, value="Flag", fun.aggregate = mean)
#fv_test_123<-dcast(FV_TEST_SCV, Customer~Brand, value.var="Brand")
#colnames(fv_test_123)
#fv_test_12345<-which(fv_test_123==1, arr.ind=TRUE)
#fv_test_123<-colnames(fv_test_123)[fv_test_12345]
#fv_test_123
#fv_test_123_df<-as.data.frame((fv_test_123))
FV_TRAIN_SCV2<-dcast(fv_train_md, Customer~Brand, value="Value", fun.aggregate = mean)
FV_TEST_SCV2<-dcast(fv_test_md, Customer~Brand, value="Value", fun.aggregate = mean)
#Replace NaN with 0
is.nan.data.frame <- function(x)
do.call(cbind, lapply(x, is.nan))
FV_TRAIN_SCV2[is.nan(FV_TRAIN_SCV2)] <- 0
FV_TEST_SCV2[is.nan(FV_TEST_SCV2)] <- 0
fv_overall_2[is.nan(fv_overall_2)] <- 0
#
#
install.packages("recommenderlab")
library(recommenderlab)
row.names(FV_TRAIN_SCV2)<-FV_TRAIN_SCV2$Customer
FV_TRAIN_SCV2$Hawkers<-0
FV_TRAIN_SCV2$Pollini<-0
FV_TRAIN_SCV2$"Twin Set"<-0
FV_TRAIN_SCV2_matrix<-as.matrix(FV_TRAIN_SCV2[,2:ncol(FV_TRAIN_SCV2)])
FV_TRAIN_SCV2_binarymatrix<-as(FV_TRAIN_SCV2_matrix,"binaryRatingMatrix")
similarity_FV_train_trans_items<-similarity(FV_TRAIN_SCV2_binarymatrix, method="jaccard", which="items")
train_col<- data.frame(colnames(FV_TRAIN_SCV2))
#------------------------------------------------------------------------------------
row.names(fv_overall_2)<-fv_overall_2$Customer
#Convert NaN to 0
fv_overall_2[is.nan(fv_overall_2)]<-0
#fv_overall_matrix<- as.matrix(fv_overall_2[,2:(ncol(fv_overall_2)-3)])#Convert to matrix
fv_overall_matrix<- as.matrix(fv_overall_2[,2:ncol(fv_overall_2)])#Convert to matrix
#fv_overall_matrix<- as.matrix(fv_overall_matrix[,1:(ncol(fv_overall_2)-3)])
fv_matrix_binary<- as(fv_overall_matrix, "binaryRatingMatrix") #Make a binary ratings matrix
FV_overall_similarity<-similarity(fv_matrix_binary, method="jaccard", which="items")
overall_col<- data.frame(colnames(fv_overall_2))
#---------------------------------------------------------------------------------------------------------
#
#
#Now, define multiple recommender algorithms to compare them all.
algorithms <- list(`user-based CF 50` = list(name = "UBCF",param = list(method = "Jaccard", nn = 50)),
`user-based CF 100` = list(name = "UBCF",param = list(method = "Jaccard", nn = 100)),
`user-based CF 200` = list(name = "UBCF",param = list(method = "Jaccard", nn = 200)),
`user-based CF 500` = list(name = "UBCF",param = list(method = "Jaccard", nn = 500)),
#
`item-based CF 3` = list(name = "IBCF",param = list(method = "Jaccard", k = 3)),
`item-based CF 5` = list(name = "IBCF",param = list(method = "Jaccard", k = 5)),
`item-based CF 10` = list(name = "IBCF",param = list(method = "Jaccard", k = 10)),
`item-based CF 50` = list(name = "IBCF",param = list(method = "Jaccard", k = 50))
)
scheme <- evaluationScheme(FV_TRAIN_SCV2_binarymatrix, method = "cross", k = 4,given = 1)
scheme <- evaluationScheme(fv_matrix_binary, method = "cross", k = 4,given = 1)
results <- evaluate(scheme, algorithms, n = c(1,2,3,4,5,6,7,8))
results <- evaluate(scheme, algorithms, n = c(1,2,3,4,5,6,7,8,50,100,200,500))#Evaluating with n=c(1,2,3.....) being the number of recommendations
#names(results) #Check that all results have run.
#results
#Plot results to help determine which of the above models is best for further analysis
#plot(results, annotate = c(1, 3), legend = "right") #ROC Curve
#plot(results, "prec/rec", annotate = 3) #Precision/Recall Plot
The first of these plots (with FPR on x axis) is the ROC curve. The better performing model is the curve with the highest area therefore the better performing model, of these tested parameters, is UBCF with nn=500. Or, with nn=50.
Based on the precision/recall plot, nn should be set to 500.
MODEL USING UBCF nn = 500
recc_model <- Recommender(data = FV_TRAIN_SCV2_binarymatrix, method = "UBCF",
parameter = list(method = "Jaccard",
nn=500))
model_details <- getModel(recc_model)
model_details
#Running on ENTIRE DATA
recc_model <- Recommender(data = fv_matrix_binary, method = "UBCF",
parameter = list(method = "Jaccard",
nn=500))
model_details <- getModel(recc_model)
model_details
install.packages("plyr")
library(plyr)
#Get the scores
recc_predicted <- predict(object = recc_model, newdata = FV_TRAIN_SCV2_binarymatrix, n = 198, type="ratings")
ibcf_scores<-as(recc_predicted, "list")
ibcf_list_scores<-ldply(ibcf_scores, rbind)
ibcf_list_scores[is.na(ibcf_list_scores)]<-0
#Get the list of brands
recc_predicted_list2<-predict(object=recc_model,newdata= FV_TRAIN_SCV2_binarymatrix,type="topNList", n=198)
recc_predicted_list2_a<- as(recc_predicted_list2, "list")
recc_predicted_list2_b<-ldply(recc_predicted_list2_a, rbind)
#recc_predicted_list2_b[is.na(recc_predicted_list2_b)]<-0
#------------------------------------------------------------
#On the overall model:
#Get the scores
recc_predicted <- predict(object = recc_model, newdata = fv_matrix_binary, n = 80, type="ratings")
ibcf_scores<-as(recc_predicted, "list")
ibcf_list_scores<-ldply(ibcf_scores, rbind)
ibcf_list_scores[is.na(ibcf_list_scores)]<-0
#Get the list of brands
recc_predicted_list2<-predict(object=recc_model,newdata= fv_matrix_binary,type="topNList", n=80)
recc_predicted_list2_a<- as(recc_predicted_list2, "list")
recc_predicted_list2_b<-ldply(recc_predicted_list2_a, rbind)
#recc_predicted_list2_b[is.na(recc_predicted_list2_b)]<-0
Reshape df so all ratings are in one column. Use this to then create a unique table, to thendo a count if, as this always crashes excel.
install.packages("data.table")
library(data.table)
df.m1<- melt(ibcf_list_scores, id.vars=c(".id"),
value.name="Rating")
df.m1.unique<- data.frame(df.m1)
df.m1.unique$variable<-NULL
df.m1.unique$.id<-NULL
#df.m1.unique<-distinct(df.m1.unique)
#df.m1.unique<- df.m1.unique[order(df.m1.unique$Rating),] #This comma means it is only ordering based on this one var.
#Using ave
df.m1.unique$count<- ave(df.m1.unique$Rating, df.m1.unique[,c("Rating")], FUN=length)
rownames(df.m1.unique) <- c() #Remove rownames
df.m1.unique<-distinct(df.m1.unique)
df.m1.unique<- df.m1.unique[order(-df.m1.unique$Rating),]#Sort by ascending rating
#Plot this
df.m1.unique.plot<- data.frame(df.m1.unique[2:(nrow(df.m1.unique)-1),])
#plot(x=df.m1.unique.plot$Rating, y=df.m1.unique.plot$count)
#Get the cumulative distribution
df.m1.unique.plot2<- df.m1.unique.plot %>%
mutate(Percentage=cumsum(100*(count/sum(count))),
cumsum=cumsum(count))
Remove ratings
#a) Remove values that are less than specific rating
#Using logical indexing with replacement
ibcf_list_scores_removal<- ibcf_list_scores
#Replace low values with 0
ibcf_list_scores_removal[,2:ncol(ibcf_list_scores_removal)][ibcf_list_scores_removal[,2:ncol(ibcf_list_scores_removal)] < 0.0217] <- 0
#To flag whether customer is recommended the brand, replace all values >0 with 1. Keep 0 as is.
ibcf_list_scores_removal_b<- ibcf_list_scores_removal #Call a new df
ibcf_list_scores_removal_b[,2:ncol(ibcf_list_scores_removal_b)][ibcf_list_scores_removal_b[,2:ncol(ibcf_list_scores_removal_b)] > 0] <- 1#Create the flag
So basically I'd like to know how to create the model on my ENTIRE dataset?
And how to extract all ratings?
Thank you

correlogram with reliabilities in the main diagonal

I'm rather new to R, here is something I encountered in my first steps with it.
In some papers it is required to present a correlogram with the reliability (Cronbach Alpha) of the the correlated variables in the main diagonal (where the correlations are 1 )
an example might be 5 correlated psychometric measures
Job_ins (an average of 4 items)
Employability (an average of 4 items)
INT_to_quit (an average of 4 items)
Mobility_pref (an average of 5 items)
Career_self_mgmt (an average of 8 items)
note that in the native cor() R function the main diagnal (the correlations of the measures with themselves) shows 1.
What I would like to do is to present internal reliablity (cronbach alphas) in the main diagonal instead.
any ideas?
Saar
If I understood you correctly, this is my (long) solution.
#Loading pkgs
require(tidyverse)
require(Hmisc)
require(psych)
#Creating example data
set.seed(123) #making the random data reproducible
#Creating the items for each subject
job <- sim.congeneric(N=200, short = FALSE, low = 1, high=10,categorical=TRUE)
Employability <- sim.congeneric(N=200, short = FALSE, low = 1, high=10,categorical=TRUE)
Career <- sim.congeneric(N=200, short = FALSE, low = 1, high=10,categorical=TRUE)
#Arranging the data to one data.frame
df <- data.frame(as.data.frame(job$observed) %>%
set_names(c("job1", "job2", "job3", "job4")),
as.data.frame(Employability$observed) %>%
set_names(c("Employability1", "Employability2",
"Employability3", "Employability4")),
as.data.frame(job$observed) %>%
set_names(c("Career1", "Career2", "Career3", "Career4")))
#Creating a vector with the Cronbach's alpha for each subject
CronAlpha <- c(
alpha(df %>%
select(job1, job2, job3, job4))$total$std.alpha,
alpha(df %>%
select(Employability1, Employability2,
Employability3, Employability4))$total$std.alpha,
alpha(df %>%
select(Career1, Career2,
Career3, Career4))$total$std.alpha)
#Calculating the mean for each subject, than the correlations
Correlation <- df %>%
#Calculating the means
mutate(job = rowMeans(data.frame(job1, job2, job3, job4), na.rm = TRUE),
Employability =rowMeans(data.frame(Employability1, Employability2,
Employability3, Employability4), na.rm = TRUE),
Career =rowMeans(data.frame(Career1, Career2,
Career3, Career4), na.rm = TRUE)) %>%
#Selecting only the vars that I want for the correlation matrix
select(job, Employability, Career) %>%
as.matrix() %>%
rcorr()
#Extracting the Pearson's r
CorrelationRs <- Correlation$r
#Looping through the correlation data.frame and replacing with
# Cronbach's alpha
i <- 1
for (i in 1:nrow(CorrelationRs)) {
CorrelationRs[i, i] <- CronAlpha[i]
}
CorrelationRs
Edit
Instead of using loop, I should use diag().
diag(CorrelationRs) <- CronAlpha

Resources