I'm rather new to R, here is something I encountered in my first steps with it.
In some papers it is required to present a correlogram with the reliability (Cronbach Alpha) of the the correlated variables in the main diagonal (where the correlations are 1 )
an example might be 5 correlated psychometric measures
Job_ins (an average of 4 items)
Employability (an average of 4 items)
INT_to_quit (an average of 4 items)
Mobility_pref (an average of 5 items)
Career_self_mgmt (an average of 8 items)
note that in the native cor() R function the main diagnal (the correlations of the measures with themselves) shows 1.
What I would like to do is to present internal reliablity (cronbach alphas) in the main diagonal instead.
any ideas?
Saar
If I understood you correctly, this is my (long) solution.
#Loading pkgs
require(tidyverse)
require(Hmisc)
require(psych)
#Creating example data
set.seed(123) #making the random data reproducible
#Creating the items for each subject
job <- sim.congeneric(N=200, short = FALSE, low = 1, high=10,categorical=TRUE)
Employability <- sim.congeneric(N=200, short = FALSE, low = 1, high=10,categorical=TRUE)
Career <- sim.congeneric(N=200, short = FALSE, low = 1, high=10,categorical=TRUE)
#Arranging the data to one data.frame
df <- data.frame(as.data.frame(job$observed) %>%
set_names(c("job1", "job2", "job3", "job4")),
as.data.frame(Employability$observed) %>%
set_names(c("Employability1", "Employability2",
"Employability3", "Employability4")),
as.data.frame(job$observed) %>%
set_names(c("Career1", "Career2", "Career3", "Career4")))
#Creating a vector with the Cronbach's alpha for each subject
CronAlpha <- c(
alpha(df %>%
select(job1, job2, job3, job4))$total$std.alpha,
alpha(df %>%
select(Employability1, Employability2,
Employability3, Employability4))$total$std.alpha,
alpha(df %>%
select(Career1, Career2,
Career3, Career4))$total$std.alpha)
#Calculating the mean for each subject, than the correlations
Correlation <- df %>%
#Calculating the means
mutate(job = rowMeans(data.frame(job1, job2, job3, job4), na.rm = TRUE),
Employability =rowMeans(data.frame(Employability1, Employability2,
Employability3, Employability4), na.rm = TRUE),
Career =rowMeans(data.frame(Career1, Career2,
Career3, Career4), na.rm = TRUE)) %>%
#Selecting only the vars that I want for the correlation matrix
select(job, Employability, Career) %>%
as.matrix() %>%
rcorr()
#Extracting the Pearson's r
CorrelationRs <- Correlation$r
#Looping through the correlation data.frame and replacing with
# Cronbach's alpha
i <- 1
for (i in 1:nrow(CorrelationRs)) {
CorrelationRs[i, i] <- CronAlpha[i]
}
CorrelationRs
Edit
Instead of using loop, I should use diag().
diag(CorrelationRs) <- CronAlpha
Related
I have a dataset with multiple observations for each category:
country PC1 PC2 PC3 PC4 PC5
BD 0.0960408090569664 0.373740208940467 -0.369920989335273 -1.02993010449105 -0.481901935725247
BD -0.538617581045194 0.537010643603669 0.447050616992454 -1.3888975041278 -0.759524281163431
PK -0.452943925236246 0.507244835779749 0.64679762176707 -1.38054973938184 -0.278384245105666
PK -1.01487954986928 0.737191371806965 -0.202656866687033 -1.22663700666619 0.186305912881529
UK -0.377594639422628 0.817593863033578 0.3739216019342 -1.73856626173224 1.12404906217336
UK -0.636564327570674 0.714647668634421 1.00488527275837 -1.4344227886331 0.637219423443802
US -0.775649983771687 0.0900448150403809 0.243317360780493 -1.72498526814162 -0.618714136277983
US -0.372815509141658 0.419096654055852 0.904247466040119 -0.573219421959129 -0.0154666267035251
I want to run hierarchical cluster analysis on it in R, such that there are only 4 nodes (corresponding to 4 levels of country). The only way I can think of is to take mean values of the columns (PC1, PC2...) based on country and then run hclust in R. Since I have multiple observations for each categorical variable (there are at least 200 for each level), I want to run a bootstrap version of hierarchical cluster analysis on thousands of sub-samples (by randomly selecting one observation for each categorical variable) and running hclust, and then get a final result. I have come across the following ways of bootstrap clustering. pvclust appears to be useful for the summarised version of this data. ClusterBootstrap and Bclust also do not look useful for my scenario. Any ideas how can I run bootstrap using sub-samples of actual observations instead of using the summarized version with /without replacement?
Bootstrap cluster analysis is possible as follows:
library(future)
plan(multisession)
library(shipunov)
library(dplyr)
data = data.frame(country = c(rep("PK", 10), rep("UK", 10), rep("US", 10), rep("BD", 10), rep("IN", 10)),
"PC1" = runif(n = 50, min = -2, max = 3),
"PC2" = runif(n = 50, min = -2.5, max = 4),
"PC3" = runif(n = 50, min = -4, max = 2))
#original that will be used for comparison
d1 = data |>
dplyr::group_by(country) |>
dplyr::summarise_if(is.numeric, mean) |>
tibble::column_to_rownames(var="country") |>
data.frame()
dist_mat <- dist(d1, method = 'euclidean')
list_of_hc <- furrr::future_map(1:20000, function(i) {
print(i)
##create a dataframe with replacement using original df and summarize it
d = data |> group_by(country) |> slice_sample(prop = 1, replace=TRUE) |>
ungroup() |>
dplyr::group_by(country) |>
dplyr::summarise_if(is.numeric, mean) |>
tibble::column_to_rownames(var="country")
##run hclust on the data
dist_mat = dist(d, method = 'euclidean')
hc = hclust(dist_mat)
##save the hclust result to a list
hc}, .progress = TRUE)
#first element of the list is based on original df
list_of_hc[[1]] <- hclust(dist_mat)
#use Bclust to calculate similarity b/w the original (first element) and subsequent bootstrapped hclust
(bb3 <- Bclust(hclist=list_of_hc, relative = TRUE))
plot(bb3)
Result:
Hello I have the following code from a course
library(tidyverse)
library(dslabs)
data("polls_us_election_2016")
head(results_us_election_2016)
results_us_election_2016 %>% arrange(desc(electoral_votes)) %>% top_n(5, electoral_votes)
'Computing the average and standard deviation for each state'
polls <- polls_us_election_2016 %>%
filter(state != "U.S." &
!grepl("CD", "state") &
enddate >= "2016-10-31" &
(grade %in% c("A+", "A", "A-", "B+") | is.na(grade))) %>%
mutate(spread = rawpoll_clinton/100 - rawpoll_trump/100) %>%
group_by(state) %>%
summarize(avg = mean(spread), sd = sd(spread), n = n()) %>%
mutate(state = as.character(state))
# joining electoral college votes and results
results <- left_join(polls, results_us_election_2016, by="state")
head(results)
# states with no polls: note Rhode Island and District of Columbia = Democrat
results_us_election_2016 %>% filter(!state %in% results$state)
# assigns sd to states with just one poll as median of other sd values
results <- results %>%
mutate(sd = ifelse(is.na(sd), median(results$sd, na.rm = TRUE), sd))
#Calculating the posterior mean and posterior standard error
mu <- 0
tau <- 0.02
results %>% mutate(sigma = sd/sqrt(n),
B = sigma^2/ (sigma^2 + tau^2),
posterior_mean = B*mu + (1-B)*avg,
posterior_se = sqrt( 1 / (1/sigma^2 + 1/tau^2))) %>%
arrange(abs(posterior_mean))
#Monte Carlo simulation of Election Night results (no general bias)
mu <- 0
tau <- 0.02
clinton_EV <- replicate(1000, {
results %>% mutate(sigma = sd/sqrt(n),
B = sigma^2/ (sigma^2 + tau^2),
posterior_mean = B*mu + (1-B)*avg,
posterior_se = sqrt( 1 / (1/sigma^2 + 1/tau^2)),
simulated_result = rnorm(length(posterior_mean), posterior_mean, posterior_se),
clintonvotes = ifelse(simulated_result > 0, electoral_votes, 0)) %>% # award votes if Clinton wins state
summarize(clinton = sum(clintonvotes)) %>% # total votes for Clinton
.$clinton + 7 # 7 votes for Rhode Island and DC
})
mean(clinton_EV > 269) # over 269 votes wins election
I don't understand how this line works
simulated_result = rnorm(length(posterior_mean), posterior_mean, posterior_se)
length(posterior_mean) = 47, so rnorm should return a vector of size 47.
When I replace this with 1 each state gets the same result from rnorm although posterior_mean and posterior_se are diffent for each state. When I change it 46 I get an error.
So it seems to me that this line fills the whole column simulated_result (perhaps 47 times with the same results?). I would have expected that mutate uses the values of each row only to manipulate this particulate row.
Can perhaps someone explain this behavior to me or point me to a resource where this is explained?
For the rnorm function, if you check the vignette:
rnorm(n, mean = 0, sd = 1) Arguments
x, q :vector of quantiles.
p :vector of probabilities.
n :number of observations. If length(n) > 1, the length is taken to be the number required.
mean :vector of means.
sd :vector of standard deviations.
There are two ways to use it, one, you generate a vector of length n, coming from normal distribution of same mean and sd, for example:
set.seed(111)
rnorm(10,0,1)
[1] 0.2352207 -0.3307359 -0.3116238 -2.3023457 -0.1708760 0.1402782 -1.4974267 -1.0101884
[9] -0.9484756 -0.4939622
If you provide a vector that is as long as n, you are specifying the mean and sd for each entry, for example:
set.seed(111)
rnorm(10,1:10,1:10)
[1] 1.23522071 1.33852826 2.06512853 -5.20938263 4.14561978 6.84166935 -3.48198659 -0.08150735
[9] 0.46371956 5.06037783
In this case, you generate a vector of 10 random normal variable, first entry comes from mean=1, sd=1, 2nd entry mean=2, sd=2 and so on. We can also do something in between:
set.seed(111)
rnorm(10,1:10,1))
[1] 1.235221 1.669264 2.688376 1.697654 4.829124 6.140278 5.502573 6.989812 8.051524 9.506038
In this case, it returns a vector of length 10, first entry coming from mean = 1,sd=1, 2nd coming from mean =2,sd =1, and we can visualize this by re-running this:
t(replicate(10,rnorm(10,1:10,1)))
It's not very clear what you replaced with 1, but essentially the purpose of mutate is to assign a column with the values. And the simulated results columns work like the above.
I am trying to fit Polish local government election results in 2015 following the superb blog by #GavinSimpson. https://www.fromthebottomoftheheap.net/2017/10/19/first-steps-with-mrf-smooths/ I joined my xls data with the shp data using a 6 digit identifier (there may be leading 0's). I kept it as a text variable. EDIT, I simplified the identifier and am now using a sequence from 1 to nrow to simplify my question.
library(tidyverse)
library(sf)
library(mgcv)
# Read data
# From https://www.gis-support.pl/downloads/gminy.zip shp file
boroughs_shp <- st_read("../../_mapy/gminy.shp",options = "ENCODING=WINDOWS-1250",
stringsAsFactors = FALSE ) %>%
st_transform(crs = 4326)%>%
janitor::clean_names() %>%
# st_simplify(preserveTopology = T, dTolerance = 0.01) %>%
mutate(teryt=str_sub(jpt_kod_je, 1, 6)) %>%
select(teryt, nazwa=jpt_nazwa, geometry)
# From https://parlament2015.pkw.gov.pl/wyniki_zb/2015-gl-lis-gm.zip data file
elections_xls <-
readxl::read_excel("data/2015-gl-lis-gm.xls",
trim_ws = T, col_names = T) %>%
janitor::clean_names() %>%
select(teryt, liczba_wyborcow, glosy_niewazne)
elections <-
boroughs_shp %>% fortify() %>%
left_join(elections_xls, by = "teryt") %>%
arrange(teryt) %>%
mutate(idx = seq.int(nrow(.)) %>% as.factor(),
teryt = as.factor(teryt))
# Neighbors
boroughs_nb <-spdep::poly2nb(elections, snap = 0.01, queen = F, row.names = elections$idx )
names(boroughs_nb) <- attr(boroughs_nb, "region.id")
# Model
ctrl <- gam.control(nthreads = 4)
m1 <- gam(glosy_niewazne ~ s(idx, bs = 'mrf', xt = list(nb = boroughs_nb)),
data = elections,
offset = log(liczba_wyborcow), # number of votes
method = 'REML',
control = ctrl,
family = betar())
Here is the error message:
Error in smooth.construct.mrf.smooth.spec(object, dk$data, dk$knots) :
mismatch between nb/polys supplied area names and data area names
In addition: Warning message:
In if (all.equal(sort(a.name), sort(levels(k))) != TRUE) stop("mismatch between nb/polys supplied area names and data area names") :
the condition has length > 1 and only the first element will be used
elections$idx is a factor. I am using it to give names to boroughs_nb to be absolutely sure I have the same number of levels. What am I doing wrong?
EDIT: The condition mentioned in error message is met:
> all(sort(names(boroughs_nb)) == sort(levels(elections$idx)))
[1] TRUE
It seems that I solved the issue, maybe not quite realizing how it did being stat beginner.
First, not a single NA should be present in modeled data. There was one. After that the mcgv seemed to run, but it took long time (quarter of an hour) and inexplicably for me, only when I limited no of knots to k=50, with poor results (less or more and it did not return any result) and with warning to be cautious about results.
Then I tried to remove offset=log(liczba_wyborcow) ie offset number of voters and made number of void votes per 1000 my predicted variable.
elections <-
boroughs_shp %>%
left_join(elections_xls, by = "teryt") %>% na.omit() %>%
arrange(teryt) %>%
mutate(idx = row_number() %>% as.factor()) %>%
mutate(void_ratio=round(glosy_niewazne/liczba_wyborcow,3)*1000)
Now that it is a count, why not try change family = betar() in gam formula to poisson() - still not a good result, and then to negative binomial family = nb()
Now my formula looks like
m1 <-
gam(
void_ratio ~ s(
idx,
bs = 'mrf',
k =500,
xt = list(nb = boroughs_nb),
fx = TRUE),
data = elections_df,
method = 'REML',
control = gam.control(nthreads = 4),
family = nb()
)
It seems now to be blazingly fast and return valid results with no warnings or errors. On a laptop with 4 cores Intel Core I7 6820HQ # 2.70GHZ 16GB Win10 it takes now 1-2 minutest to build a model.
In brief, what I changed was: remove a single NA, remove offset from formula and use negative binomial distribution.
Here is the result of what I wanted to achieve, from left to right, real rate of void votes, a rate smoothed by a model and residuals indicating discrepancies. The mcgv code let me do that.
I'm currently working on a user-item collaborative filtering model.
I have a set of users and places they have shopped at, and have attempted to build a recommender model using R.
There are two aims of this project:
a) Recommend new shops to ALL customers
b) Give a stat to show how accurate the model is.
I have 2 years worth of data.
To answer b), I have subset my data to customers that have purchased in both the first 1.5 years AND in the following 6 months.
I have created a model on the data of transactions in the first 1.5 years, then have compared to model predictions to the ACTUAL 6 months of data.
By performing the above, I determined that I was to use UBCF and nn=500, and I was able to achieve accuracy of approx 80%.
However, I am now unsure of how to predict for the ENTIRE user base.
I was thinking of applying the ENTIRE dataset to the model I have just created, but there is bias/will not be accurate, as not all shops are represented in this small model I have created.
I have read articles and tutorials where people have done different things.
I have seen one where they input the entire dataset, and apply the [which] subsetting, so that it creates the model in 80% and tests using the remaining 20%.
My question is, if I was to use this process, how would I then get recommendations for ALL users, when the model only gives predictions for 20% of the users?
Is it best to create the model on the entire dataset?
SUBSET THE DATA
Create period flags
#If in 1.5 years, then 1. If in following 6 months, then 0.
FV$Flag1<-ifelse(FV$Date<="2018-10-01",1,0)
FV$Flag2<-ifelse(FV$Date>"2018-10-01",1,0)
IDENTIFICATION OF CUSTOMERS TO USE IN TRAINING MODEL
#Create SCV
#FV
FV_SCV<-select(FV, Customer, Flag1, Flag2) %>%
group_by(Customer) %>%
summarise_all(funs(sum)) #Sum all variables.
#Determine which customers to use based on if they have purchased both in the first and second years
FV_SCV$Use<-ifelse(FV_SCV$Flag1>0 & FV_SCV$Flag2>0, 1,0)
EXTRACT CUSTOMER LIST FOR TRAINING MODEL
#Training. Where customers have purchased both in the first & second year, but we only run the model on the first.
FV_Train<-FV_SCV %>%
filter(Use==1 )
SUBSET TO CUSTOMERS THAT HAVE PURCHASED IN 1 YEAR AND OF THE CUSTOMERS THAT HAVE PURCHASED IN BOTH YEARS, TO ONLY THOSE THAT HAVE SHOPPED IN THE FIRST YEAR
#FV_SCV$flag_sum<- FV_SCV$Flag1+FV_SCV$Flag2
SCV FOR CUSTOMERS USED IN THE TRAINING MODEL
#Join on the USE flag
FV_Train_Transactions<- FV %>% #Join on the page info
left_join(select(FV_Train, Customer, Use), by=c("Customer"="Customer"))
#Replace NA with 0
FV_Train_Transactions[is.na(FV_Train_Transactions)] <- 0
##Subset to only the users' transactions to be used in training
FV_Train_Transactions<-FV_Train_Transactions %>%
filter(Use==1)
##Create date flag for train and test to use to create the model on the train and comparing the results with the output of the test df
FV_Train_Transactions_Compare<-FV_Train_Transactions %>%
filter(Flag2>0)
##Create SCV for TRAIN
FV_TRAIN_SCV<-FV_Train_Transactions %>%
filter(Flag1>0) %>%
group_by(Customer, Brand)%>%
select(Customer, Brand)
FV_TRAIN_SCV$Flag<-1
#Not sure why this hasn't selected unique rows, so remove duplicates.
FV_TRAIN_SCV<-distinct(FV_TRAIN_SCV)
##Create scv for TEST
FV_TEST_SCV<-FV_Train_Transactions_Compare %>%
filter(Flag2>0) %>%
select(Customer, Brand) %>%
group_by(Customer, Brand)
FV_TEST_SCV$Flag<-1
#Not sure why this hasn't selected unique rows, so remove duplicates.
FV_TEST_SCV<-distinct(FV_TEST_SCV)
Transpose to columns
install.packages("reshape")
install.packages("reshape2")
install.packages("tidytext")
library(reshape)
library(reshape2)
library(tidytext)
#Melt data for transposition
#Train
fv_train_md<-melt(FV_TRAIN_SCV, id=(c("Customer", "Brand")))
FV_TRAIN_SCV2<-dcast(fv_train_md, Customer~Brand, value="Flag", fun.aggregate = mean)
#Test
fv_test_md<-melt(FV_TEST_SCV, id=(c("Customer" , "Brand")))
#Do the same for the overall transactions table
#Make FV_SCV a binary rating matrix
fv_overall<- FV[,c(1,3)] #The table name is case sensitive. Select only the customer and brand columns
fv_overall<- distinct(fv_overall) #Remove dups
fv_overall$Flag<-1
fv_overall_md<-melt(fv_overall, id=(c("Customer", "Brand")))
fv_overall_2<- dcast(fv_overall_md, Customer~Brand, value="Flag", fun.aggregate = mean)
#fv_test_123<-dcast(FV_TEST_SCV, Customer~Brand, value.var="Brand")
#colnames(fv_test_123)
#fv_test_12345<-which(fv_test_123==1, arr.ind=TRUE)
#fv_test_123<-colnames(fv_test_123)[fv_test_12345]
#fv_test_123
#fv_test_123_df<-as.data.frame((fv_test_123))
FV_TRAIN_SCV2<-dcast(fv_train_md, Customer~Brand, value="Value", fun.aggregate = mean)
FV_TEST_SCV2<-dcast(fv_test_md, Customer~Brand, value="Value", fun.aggregate = mean)
#Replace NaN with 0
is.nan.data.frame <- function(x)
do.call(cbind, lapply(x, is.nan))
FV_TRAIN_SCV2[is.nan(FV_TRAIN_SCV2)] <- 0
FV_TEST_SCV2[is.nan(FV_TEST_SCV2)] <- 0
fv_overall_2[is.nan(fv_overall_2)] <- 0
#
#
install.packages("recommenderlab")
library(recommenderlab)
row.names(FV_TRAIN_SCV2)<-FV_TRAIN_SCV2$Customer
FV_TRAIN_SCV2$Hawkers<-0
FV_TRAIN_SCV2$Pollini<-0
FV_TRAIN_SCV2$"Twin Set"<-0
FV_TRAIN_SCV2_matrix<-as.matrix(FV_TRAIN_SCV2[,2:ncol(FV_TRAIN_SCV2)])
FV_TRAIN_SCV2_binarymatrix<-as(FV_TRAIN_SCV2_matrix,"binaryRatingMatrix")
similarity_FV_train_trans_items<-similarity(FV_TRAIN_SCV2_binarymatrix, method="jaccard", which="items")
train_col<- data.frame(colnames(FV_TRAIN_SCV2))
#------------------------------------------------------------------------------------
row.names(fv_overall_2)<-fv_overall_2$Customer
#Convert NaN to 0
fv_overall_2[is.nan(fv_overall_2)]<-0
#fv_overall_matrix<- as.matrix(fv_overall_2[,2:(ncol(fv_overall_2)-3)])#Convert to matrix
fv_overall_matrix<- as.matrix(fv_overall_2[,2:ncol(fv_overall_2)])#Convert to matrix
#fv_overall_matrix<- as.matrix(fv_overall_matrix[,1:(ncol(fv_overall_2)-3)])
fv_matrix_binary<- as(fv_overall_matrix, "binaryRatingMatrix") #Make a binary ratings matrix
FV_overall_similarity<-similarity(fv_matrix_binary, method="jaccard", which="items")
overall_col<- data.frame(colnames(fv_overall_2))
#---------------------------------------------------------------------------------------------------------
#
#
#Now, define multiple recommender algorithms to compare them all.
algorithms <- list(`user-based CF 50` = list(name = "UBCF",param = list(method = "Jaccard", nn = 50)),
`user-based CF 100` = list(name = "UBCF",param = list(method = "Jaccard", nn = 100)),
`user-based CF 200` = list(name = "UBCF",param = list(method = "Jaccard", nn = 200)),
`user-based CF 500` = list(name = "UBCF",param = list(method = "Jaccard", nn = 500)),
#
`item-based CF 3` = list(name = "IBCF",param = list(method = "Jaccard", k = 3)),
`item-based CF 5` = list(name = "IBCF",param = list(method = "Jaccard", k = 5)),
`item-based CF 10` = list(name = "IBCF",param = list(method = "Jaccard", k = 10)),
`item-based CF 50` = list(name = "IBCF",param = list(method = "Jaccard", k = 50))
)
scheme <- evaluationScheme(FV_TRAIN_SCV2_binarymatrix, method = "cross", k = 4,given = 1)
scheme <- evaluationScheme(fv_matrix_binary, method = "cross", k = 4,given = 1)
results <- evaluate(scheme, algorithms, n = c(1,2,3,4,5,6,7,8))
results <- evaluate(scheme, algorithms, n = c(1,2,3,4,5,6,7,8,50,100,200,500))#Evaluating with n=c(1,2,3.....) being the number of recommendations
#names(results) #Check that all results have run.
#results
#Plot results to help determine which of the above models is best for further analysis
#plot(results, annotate = c(1, 3), legend = "right") #ROC Curve
#plot(results, "prec/rec", annotate = 3) #Precision/Recall Plot
The first of these plots (with FPR on x axis) is the ROC curve. The better performing model is the curve with the highest area therefore the better performing model, of these tested parameters, is UBCF with nn=500. Or, with nn=50.
Based on the precision/recall plot, nn should be set to 500.
MODEL USING UBCF nn = 500
recc_model <- Recommender(data = FV_TRAIN_SCV2_binarymatrix, method = "UBCF",
parameter = list(method = "Jaccard",
nn=500))
model_details <- getModel(recc_model)
model_details
#Running on ENTIRE DATA
recc_model <- Recommender(data = fv_matrix_binary, method = "UBCF",
parameter = list(method = "Jaccard",
nn=500))
model_details <- getModel(recc_model)
model_details
install.packages("plyr")
library(plyr)
#Get the scores
recc_predicted <- predict(object = recc_model, newdata = FV_TRAIN_SCV2_binarymatrix, n = 198, type="ratings")
ibcf_scores<-as(recc_predicted, "list")
ibcf_list_scores<-ldply(ibcf_scores, rbind)
ibcf_list_scores[is.na(ibcf_list_scores)]<-0
#Get the list of brands
recc_predicted_list2<-predict(object=recc_model,newdata= FV_TRAIN_SCV2_binarymatrix,type="topNList", n=198)
recc_predicted_list2_a<- as(recc_predicted_list2, "list")
recc_predicted_list2_b<-ldply(recc_predicted_list2_a, rbind)
#recc_predicted_list2_b[is.na(recc_predicted_list2_b)]<-0
#------------------------------------------------------------
#On the overall model:
#Get the scores
recc_predicted <- predict(object = recc_model, newdata = fv_matrix_binary, n = 80, type="ratings")
ibcf_scores<-as(recc_predicted, "list")
ibcf_list_scores<-ldply(ibcf_scores, rbind)
ibcf_list_scores[is.na(ibcf_list_scores)]<-0
#Get the list of brands
recc_predicted_list2<-predict(object=recc_model,newdata= fv_matrix_binary,type="topNList", n=80)
recc_predicted_list2_a<- as(recc_predicted_list2, "list")
recc_predicted_list2_b<-ldply(recc_predicted_list2_a, rbind)
#recc_predicted_list2_b[is.na(recc_predicted_list2_b)]<-0
Reshape df so all ratings are in one column. Use this to then create a unique table, to thendo a count if, as this always crashes excel.
install.packages("data.table")
library(data.table)
df.m1<- melt(ibcf_list_scores, id.vars=c(".id"),
value.name="Rating")
df.m1.unique<- data.frame(df.m1)
df.m1.unique$variable<-NULL
df.m1.unique$.id<-NULL
#df.m1.unique<-distinct(df.m1.unique)
#df.m1.unique<- df.m1.unique[order(df.m1.unique$Rating),] #This comma means it is only ordering based on this one var.
#Using ave
df.m1.unique$count<- ave(df.m1.unique$Rating, df.m1.unique[,c("Rating")], FUN=length)
rownames(df.m1.unique) <- c() #Remove rownames
df.m1.unique<-distinct(df.m1.unique)
df.m1.unique<- df.m1.unique[order(-df.m1.unique$Rating),]#Sort by ascending rating
#Plot this
df.m1.unique.plot<- data.frame(df.m1.unique[2:(nrow(df.m1.unique)-1),])
#plot(x=df.m1.unique.plot$Rating, y=df.m1.unique.plot$count)
#Get the cumulative distribution
df.m1.unique.plot2<- df.m1.unique.plot %>%
mutate(Percentage=cumsum(100*(count/sum(count))),
cumsum=cumsum(count))
Remove ratings
#a) Remove values that are less than specific rating
#Using logical indexing with replacement
ibcf_list_scores_removal<- ibcf_list_scores
#Replace low values with 0
ibcf_list_scores_removal[,2:ncol(ibcf_list_scores_removal)][ibcf_list_scores_removal[,2:ncol(ibcf_list_scores_removal)] < 0.0217] <- 0
#To flag whether customer is recommended the brand, replace all values >0 with 1. Keep 0 as is.
ibcf_list_scores_removal_b<- ibcf_list_scores_removal #Call a new df
ibcf_list_scores_removal_b[,2:ncol(ibcf_list_scores_removal_b)][ibcf_list_scores_removal_b[,2:ncol(ibcf_list_scores_removal_b)] > 0] <- 1#Create the flag
So basically I'd like to know how to create the model on my ENTIRE dataset?
And how to extract all ratings?
Thank you
I would like to detect outliers in real-time data that is aggregated per hour. For this example, I've selected the hourly pedestrian data from Melbourne, Australia
(Pedestrian volume (updated monthly), Pedestrian Counting System)
I understand there are a large number of existing detection algorithms, which in time I'll learn and use.
In the short term I'd like to use the simplest approach. One such method is outlined by #Aksakal in the following stackexchange post:
What algorithm should I use to detect anomalies on time-series?
I think the key is "unexpected" qualifier in your graph. In order to
detect the unexpected you need to have an idea of what's expected.
I would start with a simple time series model such as AR(p) or
ARMA(p,q). Fit it to data, add seasonality as appropriate. For
instance, your SAR(1)(24) model could be: $y_{t}=c+\phi
y_{t-1}+\Phi_{24}y_{t-24}+\Phi_{25}y_{t-25}+\varepsilon_t$, where $t$
is time in hours. So, you'd be predicting the graph for the next hour.
Whenever the prediction error $e_t=y_t-\hat y_t$ is "too big" you
throw an alert.
When you estimate the model you'll get the variance
$\sigma_\varepsilon$ of the error $\varepsilon_t$. Depending on your
distributional assumptions, such as normal, you can set the threshold
based on the probability, such as $|e_t|<3\sigma_\varepsilon$ for
99.7% or one-sided $e_t>3\sigma_\varepsilon$.
The number of visitors is probably quite persistent, but super
seasonal. It might work better to try seasonal dummies instead of the
multiplicative seasonality, then you'd try ARMAX where X stands for
exogenous variables, which could be anything like holiday dummy, hour
dummies, weekend dummies etc.
Unfortunately the post does not go into details, hence I have a few questions:
Q.1) How do I calculate/extract the variance $\sigma_\varepsilon$ of the ARIMA error term $\epsilon$ from the fitted model produced by auto.arima(data, xreg = xreg)?
Below is a complete R example that uses multiple seasonality to capture daily, weekly and yearly seasonality. This is not optimised and is only presented as an example implementation to help answer question 2.
I wish to predict the thresholds for a whole year (or at least for a 30 day period), this means that h = 24hrs * 30 = 720.
In essence, I want to forecast, not the mean of the hourly pedestrian counts, but the upper expected number of pedestrians per hour (e.g. 3σ_ε) for h>>1 (eg, h = 720 hours (30 days) or even h = 24*365 = 8760 hours (1 year) ).
Q.2) How can I achieve this using the method above?
Example code to help solve the above questions.
library(rwalkr)
library(forecast)
library(tidyverse)
library(tsibble)
library(xts)
library(dygraphs)
pedestrian <- as_tibble(rwalkr::run_melb( year = c(2015:2018) ))
pedestrian_statelibrary <- pedestrian %>%
filter(Sensor == "State Library") %>%
left_join(tsibble::holiday_aus(2015:2018, state='VIC'), by=c( 'Date' = 'date' )) %>%
mutate(holiday = replace_na(holiday, ''),
Count = ifelse(Count == 0, NA, Count))
# Replace all counts of zero with NA so Box-Cox transform lambda = 0 and constrain output to +ve.
pedestrian_statelibrary_train <- pedestrian_statelibrary %>% filter(Date >= as.Date('2015-05-13'), Date < as.Date('2017-01-01') )
pedestrian_statelibrary_test <- pedestrian_statelibrary %>% filter(Date >= as.Date('2017-01-01') )
# tsbox functions to convert tsibble to tz indirectly. Must be a better way of doing this...
pedestrian_statelibrary_train_zoo <- tsbox::ts_zoo( pedestrian_statelibrary_train %>% select(Date_Time, Count) )
pedestrian_statelibrary_train_ts <- tsbox::ts_ts(pedestrian_statelibrary_train_zoo)
pedestrian_statelibrary_test_zoo <- tsbox::ts_zoo( pedestrian_statelibrary_test %>% select(Date_Time, Count) )
pedestrian_statelibrary_test_ts <- tsbox::ts_ts(pedestrian_statelibrary_test_zoo)
## Create external regressors.
xreg_holidays_train <- model.matrix(~as.factor(pedestrian_statelibrary_train$holiday))
xreg_holidays_train <- xreg_holidays_train[,-1] # remove intercept.
# Remove 1st level from levels()
colnames(xreg_holidays_train) <- levels(as.factor(pedestrian_statelibrary_train$holiday))[-1]
xreg_holidays_test <- model.matrix(~as.factor(pedestrian_statelibrary_test$holiday))
xreg_holidays_test <- xreg_holidays_test[,-1] # remove intercept.
colnames(xreg_holidays_test) <- levels(as.factor(pedestrian_statelibrary_test$holiday))[-1]
# periods (intervals(samples) per period) for hourly data.
period_day <- 24
period_week <- 24*7
period_year <- 24*365.25
seasonal_periods = c(period_day, period_week, period_year)
pedestrian_statelibrary_train_msts <- msts(pedestrian_statelibrary_train_ts,
start = start(pedestrian_statelibrary_train_ts),
seasonal.periods = seasonal_periods)
pedestrian_statelibrary_test_msts <- msts(pedestrian_statelibrary_test_ts,
start = start(pedestrian_statelibrary_test_ts),
seasonal.periods = seasonal_periods)
# set number of Fourier terms per season. Not optimal.
Ks = c(12, 10, 2)
xreg_train <- cbind( seasonality = fourier(pedestrian_statelibrary_train_msts, K = Ks),
holidays = xreg_holidays_train )
######################################
## Fit model of exogenous factors and ARIMA as error
######################################
fit <- pedestrian_statelibrary_train_msts %>%
auto.arima( xreg = xreg_train,
seasonal=FALSE,
stepwise = FALSE,
parallel = TRUE,
num.cores = NULL,
lambda = 0
)
######################################
## Forecast
######################################
fc <- forecast( fit,
xreg=cbind( seasonality = fourier(pedestrian_statelibrary_test_msts, K = Ks),
holidays = xreg_holidays_test)
)
######################################
## Check residuals and accuracy.
######################################
checkresiduals(fit)
checkresiduals(fc)
accuracy(fc, pedestrian_statelibrary_test_msts)
######################################
## Display fitted model and forecast using interactive dygraph.
######################################
# Plotting `forecast` prediction using `dygraphs`
# https://stackoverflow.com/questions/43624634/plotting-forecast-prediction-using-dygraphs#43668603
as.forecast.ts <- function(forecast_obj){
training <- forecast_obj$x
lower <- forecast_obj$lower[,2]
upper <- forecast_obj$upper[,2]
point_forecast <- forecast_obj$mean
cbind(training, lower, upper, point_forecast)
}
fc_ts <- as.forecast.ts(fc)
# Add the time stamps back to ts object.
idx_train <- pedestrian_statelibrary_train %>% ungroup() %>% select(Date_Time) %>% as.data.frame()
idx_test <- pedestrian_statelibrary_test %>% ungroup() %>% select(Date_Time) %>% as.data.frame()
idx_all <- rbind(idx_train, idx_test)
# Append testing values to fc_ts object, by left joining two xts objects.
test_xts <- as.xts(x = pedestrian_statelibrary_test %>%
dplyr::ungroup() %>%
as.data.frame() %>%
dplyr::select( Count ) %>%
dplyr::rename( 'testing' = 'Count'),
pedestrian_statelibrary_test$Date_Time)
fc_xts <- as.xts(x = fc_ts %>%
as.data.frame(),
idx_all$Date_Time )
fc_xts <- fc_xts %>% xts::merge.xts(test_xts, join='left')
dygraph(data = fc_xts, main = "Pedestrian traffic Forecasting for State Library.") %>%
dyRangeSelector %>%
dySeries(name = "training", label = "Train") %>%
dySeries(name = 'testing', label = "Test") %>%
dySeries(name = "point_forecast", label = "Predicted") %>%
dyLegend(show = "always", hideOnMouseOut = FALSE) %>%
dyOptions(axisLineColor = "navy", gridLineColor = "grey")