First Order Markov Transition Matrix for Person Period Data - r

Using time series data for a single person, I can calculate a first order probability transition matrix i.e.library(markovchain)and calculate its density i.e.library(statnet)
This code works:
ds = matrix(c(1,1,2,1,2,4,1,3,6,1,4,8),ncol=3,byrow=TRUE) #create person period data for a single person
colnames(ds) = c("Id", "Time", "Evt")
ds = as.data.frame(ds)
mc = markovchainFit(ds$Evt, name = "mc")$estimate #calculate markovchain
am = mc#transitionMatrix #remove slot from S4 object
em = network(am, matrix.type="adjacency", directed=TRUE, Weighted = TRUE, loops = FALSE) #make network object
gden(em)#calculate density of network, etc
But I am having trouble making it work for a data with multiple ID's using tapply. This code doesn't work after line 4, but it is how a solution looks in my head:
ds2 = matrix(c(1,1,2,1,2,4,1,3,6,1,4,8,2,1,3,2,2,5,2,3,7,2,4,9),ncol=3,byrow=TRUE) #create person period data for two people
colnames(ds2) = c("Id", "Time", "Evt")
ds2 = as.data.frame(ds2)
mc2 = tapply(ds2$Evt, ds2$Id, markovchainFit) #it works to here and I am STUCK for days *see below
am2 = mc#transitionMatrix, #can't figure how to integrate these steps from above
em2 = network(am, matrix.type="adjacency", directed=TRUE, Weighted = TRUE, loops = FALSE)
gden(em2)
*For each person in the list I can't figure out:
how to name the markov chain S4 object
how to remove the transition matrix slot from the S4 object
how to pass additional functions after markovchainFit
Does anybody have any suggestions about how to loop my analysis for a single person through an ID vector? It would be very much appreciated.

How about something like below. In the code below, I make a function that does all of the interim work and returns the results of gden() on the appropriate object.
ds2 = matrix(c(1,1,2,1,2,4,1,3,6,1,4,8,2,1,3,2,2,5,2,3,7,2,4,9),ncol=3,byrow=TRUE) #create person period data for two people
colnames(ds2) = c("Id", "Time", "Evt")
ds2 = as.data.frame(ds2)
mcfun <- function(x){
mc <- markovchainFit(x, name="mc")$estimate
am <- mc#transitionMatrix
em <- network(am, matrix.type="adjacency", directed=TRUE, Weighted = TRUE, loops = FALSE) #make network object
gden(em)#calculate density of network, etc
}
tapply(ds2$Evt, ds2$Id, mcfun)
# 1 2
# 0.25 0.25

Related

How to recommend items for all users and test accuracy? user-item

I'm currently working on a user-item collaborative filtering model.
I have a set of users and places they have shopped at, and have attempted to build a recommender model using R.
There are two aims of this project:
a) Recommend new shops to ALL customers
b) Give a stat to show how accurate the model is.
I have 2 years worth of data.
To answer b), I have subset my data to customers that have purchased in both the first 1.5 years AND in the following 6 months.
I have created a model on the data of transactions in the first 1.5 years, then have compared to model predictions to the ACTUAL 6 months of data.
By performing the above, I determined that I was to use UBCF and nn=500, and I was able to achieve accuracy of approx 80%.
However, I am now unsure of how to predict for the ENTIRE user base.
I was thinking of applying the ENTIRE dataset to the model I have just created, but there is bias/will not be accurate, as not all shops are represented in this small model I have created.
I have read articles and tutorials where people have done different things.
I have seen one where they input the entire dataset, and apply the [which] subsetting, so that it creates the model in 80% and tests using the remaining 20%.
My question is, if I was to use this process, how would I then get recommendations for ALL users, when the model only gives predictions for 20% of the users?
Is it best to create the model on the entire dataset?
SUBSET THE DATA
Create period flags
#If in 1.5 years, then 1. If in following 6 months, then 0.
FV$Flag1<-ifelse(FV$Date<="2018-10-01",1,0)
FV$Flag2<-ifelse(FV$Date>"2018-10-01",1,0)
IDENTIFICATION OF CUSTOMERS TO USE IN TRAINING MODEL
#Create SCV
#FV
FV_SCV<-select(FV, Customer, Flag1, Flag2) %>%
group_by(Customer) %>%
summarise_all(funs(sum)) #Sum all variables.
#Determine which customers to use based on if they have purchased both in the first and second years
FV_SCV$Use<-ifelse(FV_SCV$Flag1>0 & FV_SCV$Flag2>0, 1,0)
EXTRACT CUSTOMER LIST FOR TRAINING MODEL
#Training. Where customers have purchased both in the first & second year, but we only run the model on the first.
FV_Train<-FV_SCV %>%
filter(Use==1 )
SUBSET TO CUSTOMERS THAT HAVE PURCHASED IN 1 YEAR AND OF THE CUSTOMERS THAT HAVE PURCHASED IN BOTH YEARS, TO ONLY THOSE THAT HAVE SHOPPED IN THE FIRST YEAR
#FV_SCV$flag_sum<- FV_SCV$Flag1+FV_SCV$Flag2
SCV FOR CUSTOMERS USED IN THE TRAINING MODEL
#Join on the USE flag
FV_Train_Transactions<- FV %>% #Join on the page info
left_join(select(FV_Train, Customer, Use), by=c("Customer"="Customer"))
#Replace NA with 0
FV_Train_Transactions[is.na(FV_Train_Transactions)] <- 0
##Subset to only the users' transactions to be used in training
FV_Train_Transactions<-FV_Train_Transactions %>%
filter(Use==1)
##Create date flag for train and test to use to create the model on the train and comparing the results with the output of the test df
FV_Train_Transactions_Compare<-FV_Train_Transactions %>%
filter(Flag2>0)
##Create SCV for TRAIN
FV_TRAIN_SCV<-FV_Train_Transactions %>%
filter(Flag1>0) %>%
group_by(Customer, Brand)%>%
select(Customer, Brand)
FV_TRAIN_SCV$Flag<-1
#Not sure why this hasn't selected unique rows, so remove duplicates.
FV_TRAIN_SCV<-distinct(FV_TRAIN_SCV)
##Create scv for TEST
FV_TEST_SCV<-FV_Train_Transactions_Compare %>%
filter(Flag2>0) %>%
select(Customer, Brand) %>%
group_by(Customer, Brand)
FV_TEST_SCV$Flag<-1
#Not sure why this hasn't selected unique rows, so remove duplicates.
FV_TEST_SCV<-distinct(FV_TEST_SCV)
Transpose to columns
install.packages("reshape")
install.packages("reshape2")
install.packages("tidytext")
library(reshape)
library(reshape2)
library(tidytext)
#Melt data for transposition
#Train
fv_train_md<-melt(FV_TRAIN_SCV, id=(c("Customer", "Brand")))
FV_TRAIN_SCV2<-dcast(fv_train_md, Customer~Brand, value="Flag", fun.aggregate = mean)
#Test
fv_test_md<-melt(FV_TEST_SCV, id=(c("Customer" , "Brand")))
#Do the same for the overall transactions table
#Make FV_SCV a binary rating matrix
fv_overall<- FV[,c(1,3)] #The table name is case sensitive. Select only the customer and brand columns
fv_overall<- distinct(fv_overall) #Remove dups
fv_overall$Flag<-1
fv_overall_md<-melt(fv_overall, id=(c("Customer", "Brand")))
fv_overall_2<- dcast(fv_overall_md, Customer~Brand, value="Flag", fun.aggregate = mean)
#fv_test_123<-dcast(FV_TEST_SCV, Customer~Brand, value.var="Brand")
#colnames(fv_test_123)
#fv_test_12345<-which(fv_test_123==1, arr.ind=TRUE)
#fv_test_123<-colnames(fv_test_123)[fv_test_12345]
#fv_test_123
#fv_test_123_df<-as.data.frame((fv_test_123))
FV_TRAIN_SCV2<-dcast(fv_train_md, Customer~Brand, value="Value", fun.aggregate = mean)
FV_TEST_SCV2<-dcast(fv_test_md, Customer~Brand, value="Value", fun.aggregate = mean)
#Replace NaN with 0
is.nan.data.frame <- function(x)
do.call(cbind, lapply(x, is.nan))
FV_TRAIN_SCV2[is.nan(FV_TRAIN_SCV2)] <- 0
FV_TEST_SCV2[is.nan(FV_TEST_SCV2)] <- 0
fv_overall_2[is.nan(fv_overall_2)] <- 0
#
#
install.packages("recommenderlab")
library(recommenderlab)
row.names(FV_TRAIN_SCV2)<-FV_TRAIN_SCV2$Customer
FV_TRAIN_SCV2$Hawkers<-0
FV_TRAIN_SCV2$Pollini<-0
FV_TRAIN_SCV2$"Twin Set"<-0
FV_TRAIN_SCV2_matrix<-as.matrix(FV_TRAIN_SCV2[,2:ncol(FV_TRAIN_SCV2)])
FV_TRAIN_SCV2_binarymatrix<-as(FV_TRAIN_SCV2_matrix,"binaryRatingMatrix")
similarity_FV_train_trans_items<-similarity(FV_TRAIN_SCV2_binarymatrix, method="jaccard", which="items")
train_col<- data.frame(colnames(FV_TRAIN_SCV2))
#------------------------------------------------------------------------------------
row.names(fv_overall_2)<-fv_overall_2$Customer
#Convert NaN to 0
fv_overall_2[is.nan(fv_overall_2)]<-0
#fv_overall_matrix<- as.matrix(fv_overall_2[,2:(ncol(fv_overall_2)-3)])#Convert to matrix
fv_overall_matrix<- as.matrix(fv_overall_2[,2:ncol(fv_overall_2)])#Convert to matrix
#fv_overall_matrix<- as.matrix(fv_overall_matrix[,1:(ncol(fv_overall_2)-3)])
fv_matrix_binary<- as(fv_overall_matrix, "binaryRatingMatrix") #Make a binary ratings matrix
FV_overall_similarity<-similarity(fv_matrix_binary, method="jaccard", which="items")
overall_col<- data.frame(colnames(fv_overall_2))
#---------------------------------------------------------------------------------------------------------
#
#
#Now, define multiple recommender algorithms to compare them all.
algorithms <- list(`user-based CF 50` = list(name = "UBCF",param = list(method = "Jaccard", nn = 50)),
`user-based CF 100` = list(name = "UBCF",param = list(method = "Jaccard", nn = 100)),
`user-based CF 200` = list(name = "UBCF",param = list(method = "Jaccard", nn = 200)),
`user-based CF 500` = list(name = "UBCF",param = list(method = "Jaccard", nn = 500)),
#
`item-based CF 3` = list(name = "IBCF",param = list(method = "Jaccard", k = 3)),
`item-based CF 5` = list(name = "IBCF",param = list(method = "Jaccard", k = 5)),
`item-based CF 10` = list(name = "IBCF",param = list(method = "Jaccard", k = 10)),
`item-based CF 50` = list(name = "IBCF",param = list(method = "Jaccard", k = 50))
)
scheme <- evaluationScheme(FV_TRAIN_SCV2_binarymatrix, method = "cross", k = 4,given = 1)
scheme <- evaluationScheme(fv_matrix_binary, method = "cross", k = 4,given = 1)
results <- evaluate(scheme, algorithms, n = c(1,2,3,4,5,6,7,8))
results <- evaluate(scheme, algorithms, n = c(1,2,3,4,5,6,7,8,50,100,200,500))#Evaluating with n=c(1,2,3.....) being the number of recommendations
#names(results) #Check that all results have run.
#results
#Plot results to help determine which of the above models is best for further analysis
#plot(results, annotate = c(1, 3), legend = "right") #ROC Curve
#plot(results, "prec/rec", annotate = 3) #Precision/Recall Plot
The first of these plots (with FPR on x axis) is the ROC curve. The better performing model is the curve with the highest area therefore the better performing model, of these tested parameters, is UBCF with nn=500. Or, with nn=50.
Based on the precision/recall plot, nn should be set to 500.
MODEL USING UBCF nn = 500
recc_model <- Recommender(data = FV_TRAIN_SCV2_binarymatrix, method = "UBCF",
parameter = list(method = "Jaccard",
nn=500))
model_details <- getModel(recc_model)
model_details
#Running on ENTIRE DATA
recc_model <- Recommender(data = fv_matrix_binary, method = "UBCF",
parameter = list(method = "Jaccard",
nn=500))
model_details <- getModel(recc_model)
model_details
install.packages("plyr")
library(plyr)
#Get the scores
recc_predicted <- predict(object = recc_model, newdata = FV_TRAIN_SCV2_binarymatrix, n = 198, type="ratings")
ibcf_scores<-as(recc_predicted, "list")
ibcf_list_scores<-ldply(ibcf_scores, rbind)
ibcf_list_scores[is.na(ibcf_list_scores)]<-0
#Get the list of brands
recc_predicted_list2<-predict(object=recc_model,newdata= FV_TRAIN_SCV2_binarymatrix,type="topNList", n=198)
recc_predicted_list2_a<- as(recc_predicted_list2, "list")
recc_predicted_list2_b<-ldply(recc_predicted_list2_a, rbind)
#recc_predicted_list2_b[is.na(recc_predicted_list2_b)]<-0
#------------------------------------------------------------
#On the overall model:
#Get the scores
recc_predicted <- predict(object = recc_model, newdata = fv_matrix_binary, n = 80, type="ratings")
ibcf_scores<-as(recc_predicted, "list")
ibcf_list_scores<-ldply(ibcf_scores, rbind)
ibcf_list_scores[is.na(ibcf_list_scores)]<-0
#Get the list of brands
recc_predicted_list2<-predict(object=recc_model,newdata= fv_matrix_binary,type="topNList", n=80)
recc_predicted_list2_a<- as(recc_predicted_list2, "list")
recc_predicted_list2_b<-ldply(recc_predicted_list2_a, rbind)
#recc_predicted_list2_b[is.na(recc_predicted_list2_b)]<-0
Reshape df so all ratings are in one column. Use this to then create a unique table, to thendo a count if, as this always crashes excel.
install.packages("data.table")
library(data.table)
df.m1<- melt(ibcf_list_scores, id.vars=c(".id"),
value.name="Rating")
df.m1.unique<- data.frame(df.m1)
df.m1.unique$variable<-NULL
df.m1.unique$.id<-NULL
#df.m1.unique<-distinct(df.m1.unique)
#df.m1.unique<- df.m1.unique[order(df.m1.unique$Rating),] #This comma means it is only ordering based on this one var.
#Using ave
df.m1.unique$count<- ave(df.m1.unique$Rating, df.m1.unique[,c("Rating")], FUN=length)
rownames(df.m1.unique) <- c() #Remove rownames
df.m1.unique<-distinct(df.m1.unique)
df.m1.unique<- df.m1.unique[order(-df.m1.unique$Rating),]#Sort by ascending rating
#Plot this
df.m1.unique.plot<- data.frame(df.m1.unique[2:(nrow(df.m1.unique)-1),])
#plot(x=df.m1.unique.plot$Rating, y=df.m1.unique.plot$count)
#Get the cumulative distribution
df.m1.unique.plot2<- df.m1.unique.plot %>%
mutate(Percentage=cumsum(100*(count/sum(count))),
cumsum=cumsum(count))
Remove ratings
#a) Remove values that are less than specific rating
#Using logical indexing with replacement
ibcf_list_scores_removal<- ibcf_list_scores
#Replace low values with 0
ibcf_list_scores_removal[,2:ncol(ibcf_list_scores_removal)][ibcf_list_scores_removal[,2:ncol(ibcf_list_scores_removal)] < 0.0217] <- 0
#To flag whether customer is recommended the brand, replace all values >0 with 1. Keep 0 as is.
ibcf_list_scores_removal_b<- ibcf_list_scores_removal #Call a new df
ibcf_list_scores_removal_b[,2:ncol(ibcf_list_scores_removal_b)][ibcf_list_scores_removal_b[,2:ncol(ibcf_list_scores_removal_b)] > 0] <- 1#Create the flag
So basically I'd like to know how to create the model on my ENTIRE dataset?
And how to extract all ratings?
Thank you

How can I bootstrap text readability statistics using quanteda?

I'm new to both bootstrapping and the quanteda package for text analysis. I have a large corpus of texts organized by document group type that I'd like to obtain readability scores for. I can easily obtain readability scores for each group with the following function:
textstat_readability(texts(mwe, groups = "document"), "Flesch")
I then want to bootstrap the results to obtain a 95% confidence interval by wrapping a function:
b_readability <- function(x, i, groups = NULL, measure = "Flesch")
textstat_readability(texts(x[i], groups = groups), measure)
n <- 10
groups <- factor(mwe[["document"]]$document)
b <- boot(texts(mwe), b_readability, strata = groups, R = n, groups = groups)
colnames(b$t) <- names(b$t0)
apply(b$t, 2, quantile, c(.025, .5, .975))
But "b <-" fails with the error:
"Error in t.star[r, ] <- res[[r]] : incorrect number of subscripts on matrix"
I've wasted two days trying to debug with no luck. What am I doing wrong? Much appreciated for any advice...
MWE:
mwe<-structure(list(document = structure(c(1L, 1L),
.Label = c("a", "b", "c", "d", "e"), class = "factor"), text = c("Text 1. Text 1.1", "Text 2."), section = structure(2:1, .Label = c("aa", "bb", "cc", "dd", "ee", "ff", "hh", "ii", "jj", "kk"), class = "factor"), year = c(1919L, 1944L), preamble = structure(8:9, .Label = c("1", "2","3", " 4 ", "5", "6 ", "7 ", "8 ", "9 ", "10 "), class = "factor"), articles = c(43L, 47L), pages = c(5.218, 7.666), wordcount = c(3503L, 4929L), mean_articles = c(45, 45)), row.names = 1:2, class = "data.frame")
mwe <- corpus(mwe)
b_readability <- function(x, i, groups = NULL, measure = "Flesch")
textstat_readability(texts(x[i], groups = groups), measure)
n <- 10
groups <- factor(mwe[["document"]]$document)
b <- boot(texts(mwe), b_readability, strata = groups, R = n, groups = groups)
colnames(b$t) <- names(b$t0)
apply(b$t, 2, quantile, c(.025, .5, .975))
A good question that involves knowing a lot about the boot package as well as how to index and group corpus texts in quanteda. Here's the best (currently) and safest way to do it. "Safest" here means future-proof, since there are some things that currently work in the internal addressing of a quanteda corpus that will not work in upcoming v2. (We warn about this very clearly in ?corpus but no one seems to heed that warning...) Note also that while this should always work, we are also planning, in future versions, more direct methods for bootstrapping textual statistics that would not require the user to do this sort of deep dive into the boot package.
Let's try a reproducible example from built-in objects first. To "bootstrap" a text, we will construct a new, hypothetical text using sentence-level resampling (with replacement) from the original, and use texts(x, groups = "<groupvar>") to piece this together into a hypothetical kind of text. (This is how I have done in in the two references at the end of this post.) To make this happen, we can exploit the property of texts() that it works to get texts from a corpus object but also works on character objects (but with fast grouping).
To get the sentences, after subsetting the corpus to simplify our example here, we reshape it into sentences.
First, however, I recorded the original document's name in a new document variable, so that we can use this for grouping later. In this example, we could also have used Year, but doing it this way will work for any example. (There are some internal records about the original docname that we might have used, but doing it this way will be future-proof.)
library("quanteda")
## Package version: 1.4.1
library("boot")
docvars(data_corpus_inaugural, "docnameorig") <- docnames(data_corpus_inaugural)
sent_corpus <- data_corpus_inaugural %>%
corpus_subset(Year > 2000) %>%
corpus_reshape(to = "sentences")
Then we have to define the function to be bootstrapped. We will use the "index" method and call the index i (as you did above). Here, x will be a character and not a corpus, even though we will call texts() on it, again, using the grouping variable to reassemble it. This will also need to return a vector and not a data.frame, which is normal form of a textstat_*() return. So we will extract just the measure column and return it as a vector.
b_readability <- function(x, i, groups = NULL, measure = "Flesch") {
textstat_readability(texts(x[i], groups = groups[i]), measure)[[measure]]
}
We will call our grouping variable simgroups just to distinguish the value from the argument name, and use this for both the groups argument and for strata in the call to boot(). The strata is an argument to boot(), while groups is passed through to our function b_readability(). We need to factorize this grouping variable since the function seems to want that. Then we call boot() and get our answer.
simgroups <- factor(docvars(sent_corpus, "docnameorig"))
boot(texts(sent_corpus), b_readability, R = 10,
strata = simgroups, groups = simgroups)
##
## STRATIFIED BOOTSTRAP
##
##
## Call:
## boot(data = texts(sent_corpus), statistic = b_readability, R = 10,
## strata = simgroups, groups = simgroups)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 60.22723 -0.01454477 2.457416
## t2* 53.23332 1.24942328 2.564719
## t3* 60.56705 1.07426297 1.996705
## t4* 53.55532 -0.28971190 1.943986
## t5* 58.63471 0.52289051 2.502101
These correspond to the five (original) documents, here distinguished by year, although unfortunately those names have been replaced by t1, t2, ... in the return object from boot().
To return to your original example, let's say these form two documents from one strata (since these are too short two subdivide further). Then:
simgroups <- rep(1, ndoc(mwe))
boot(texts(mwe), b_readability, R = 10, strata = simgroups, groups = simgroups)
##
## STRATIFIED BOOTSTRAP
##
##
## Call:
## boot(data = texts(mwe), statistic = b_readability, R = 10, strata = simgroups,
## groups = simgroups)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 119.19 0.6428333 0.4902916

How to perform Johansen cointegration test iteratively for 2 variables taking some rows at a time?

I want to test cointegration between two time series using Johansen Cointegration test. I want to perform the test incrementally, first 120 observations then leaving 1 at top and adding one at bottom for total 2250 observations. I want to automate this using a For loop, but the code is giving error. Please help.
library(urca)
x= BDICOM$BDI
y= BDICOM$Soybn
for(i in 1:2666){
A = x[i:i+120]; B = y[i:i+120]
jocot[i] = ca.jo(data.frame(A,B), type = "eigen", ecdet = "none",K = 2, spec = "longrun");i=i+1
}
Try this, but without sample of your data i can't be sure if it's correct:
library(urca)
x = BDICOM$BDI
y = BDICOM$Soybn
jocot <- vector('numeric', (2000-120))
for(i in 1:(2000-120)){
A = x[i:(i+120)]
B = y[i:(i+120)]
jocot[i] = ca.jo(
data.frame(A, B),
type = "eigen",
ecdet = "none",
K = 2,
spec = "longrun"
)#teststat[2]
}

Learning hidden markov model in R

A hidden Markov model (HMM) is one in which you observe a sequence of observations, but do not know the sequence of states the model went through to generate the observations. Analyses of hidden Markov models seek to recover the sequence of hidden states from the observed data.
I have data with both observations and hidden states (observations are of continuous values) where the hidden states were tagged by an expert. I would like to train a HMM that would be able - based on a (previously unseen) sequence of observations - to recover the corresponding hidden states.
Is there any R package to do that? Studying the existing packages (depmixS4, HMM, seqHMM - for categorical data only) allows you to specify a number of hidden states only.
EDIT:
Example:
data.tagged.by.expert = data.frame(
hidden.state = c("Wake", "REM", "REM", "NonREM1", "NonREM2", "REM", "REM", "Wake"),
sensor1 = c(1,1.2,1.2,1.3,4,2,1.78,0.65),
sensor2 = c(7.2,5.3,5.1,1.2,2.3,7.5,7.8,2.1),
sensor3 = c(0.01,0.02,0.08,0.8,0.03,0.01,0.15,0.45)
)
data.newly.measured = data.frame(
sensor1 = c(2,3,4,5,2,1,2,4,5,8,4,6,1,2,5,3,2,1,4),
sensor2 = c(2.1,2.3,2.2,4.2,4.2,2.2,2.2,5.3,2.4,1.0,2.5,2.4,1.2,8.4,5.2,5.5,5.2,4.3,7.8),
sensor3 = c(0.23,0.25,0.23,0.54,0.36,0.85,0.01,0.52,0.09,0.12,0.85,0.45,0.26,0.08,0.01,0.55,0.67,0.82,0.35)
)
I would like to create a HMM with discrete time t whrere random variable x(t) represents the hidden state at time t, x(t) {"Wake", "REM", "NonREM1", "NonREM2"}, and 3 continuous random variables sensor1(t), sensor2(t), sensor3(t) representing the observations at time t.
model.hmm = learn.model(data.tagged.by.user)
Then I would like to use the created model to estimate hidden states responsible for newly measured observations
hidden.states = estimate.hidden.states(model.hmm, data.newly.measured)
Data (training/testing)
To be able to run learning methods for Naive Bayes classifier, we need longer data set
states = c("NonREM1", "NonREM2", "NonREM3", "REM", "Wake")
artificial.hypnogram = rep(c(5,4,1,2,3,4,5), times = c(40,150,200,300,50,90,30))
data.tagged.by.expert = data.frame(
hidden.state = states[artificial.hypnogram],
sensor1 = log(artificial.hypnogram) + runif(n = length(artificial.hypnogram), min = 0.2, max = 0.5),
sensor2 = 10*artificial.hypnogram + sample(c(-8:8), size = length(artificial.hypnogram), replace = T),
sensor3 = sample(1:100, size = length(artificial.hypnogram), replace = T)
)
hidden.hypnogram = rep(c(5,4,1,2,4,5), times = c(10,10,15,10,10,3))
data.newly.measured = data.frame(
sensor1 = log(hidden.hypnogram) + runif(n = length(hidden.hypnogram), min = 0.2, max = 0.5),
sensor2 = 10*hidden.hypnogram + sample(c(-8:8), size = length(hidden.hypnogram), replace = T),
sensor3 = sample(1:100, size = length(hidden.hypnogram), replace = T)
)
Solution
In the solution, we used Viterbi algorithm - combined with Naive Bayes classifier.
At each clock time t, a Hidden Markov Model consist of
an unobserved state (denoted as hidden.state in this case) taking a finite number of states
states = c("NonREM1", "NonREM2", "NonREM3", "REM", "Wake")
a set of observed variables (sensor1, sensor2, sensor3 in this case)
Transition matrix
A new state is entered based upon a transition probability distribution
(transition matrix). This can be easily computed from data.tagged.by.expert e.g. using
library(markovchain)
emit_p <- markovchainFit(data.tagged.by.expert$hidden.state)$estimate
Emission matrix
After each transition is made, an observation (sensor_i) is produced according to a conditional probability distribution (emission matrix) which depends on the current state H of hidden.state only. We will replace emmision matrices by Naive Bayes classifier.
library(caret)
library(klaR)
library(e1071)
model = train(hidden.state ~ .,
data = data.tagged.by.expert,
method = 'nb',
trControl=trainControl(method='cv',number=10)
)
Viterbi algorithm
To solve the problem, we use Viterbi algorithm with the initial probability of 1 for "Wake" state and 0 otherwise. (We expect the patient to be awake in the beginning of the experiment)
# we expect the patient to be awake in the beginning
start_p = c(NonREM1 = 0,NonREM2 = 0,NonREM3 = 0, REM = 0, Wake = 1)
# Naive Bayes model
model_nb = model$finalModel
# the observations
observations = data.newly.measured
nObs <- nrow(observations) # number of observations
nStates <- length(states) # number of states
# T1, T2 initialization
T1 <- matrix(0, nrow = nStates, ncol = nObs) #define two 2-dimensional tables
row.names(T1) <- states
T2 <- T1
Byj <- predict(model_nb, newdata = observations[1,])$posterior
# init first column of T1
for(s in states)
T1[s,1] = start_p[s] * Byj[1,s]
# fill T1 and T2 tables
for(j in 2:nObs) {
Byj <- predict(model_nb, newdata = observations[j,])$posterior
for(s in states) {
res <- (T1[,j-1] * emit_p[,s]) * Byj[1,s]
T2[s,j] <- states[which.max(res)]
T1[s,j] <- max(res)
}
}
# backtract best path
result <- rep("", times = nObs)
result[nObs] <- names(which.max(T1[,nObs]))
for (j in nObs:2) {
result[j-1] <- T2[result[j], j]
}
# show the result
result
# show the original artificial data
states[hidden.hypnogram]
References
To read more about the problem, see Vomlel Jiří, Kratochvíl Václav : Dynamic Bayesian Networks for the Classification of Sleep Stages , Proceedings of the 11th Workshop on Uncertainty Processing (WUPES’18), p. 205-215 , Eds: Kratochvíl Václav, Vejnarová Jiřina, Workshop on Uncertainty Processing (WUPES’18), (Třeboň, CZ, 2018/06/06) [2018] Download

Multi label classification in R

I have a training data set with 28 variables (13 labels and 15 features). A test data set with 15 features and I have to predict labels for this test data set based on the features. I made KNN classifiers for all 13 labels individually.
Is there a possibility of combining all these 13 individual label KNN classifiers into one single multi label classifier?
My current code for single label:
library(class)
train_from_train <- train[1:600,2:16]
target_a_train_from_train <- train[1:600,17]
test_from_train <- train[601:800,2:16]
target_a_test_from_train <- train[601:800,17]
knn_pred_a <-knn (train = train_from_train, test = test_from_train, cl= target_a_train_from_train, k = 29)
table(knn_pred_a, target_a_test_from_train)
mean(knn_pred_a != target_a_test_from_train)
knn_pred_a_ON_TEST <-knn (train = train[,2:16], test = test[2:16], cl= train[,17], k = 29)
knn_pred_a_ON_TEST
I scoured internet and package mldr seems to be an option but I couldn't adapt it to my needs.
You can use the package ARNN for this. However, it is not exact as far as I know.
library(RANN)
library(reshape2)
####
## generate some sample data and randomize order
iris.knn <- iris[sample(1:150,150),]
#add a second class
iris.knn["Class2"] <- iris.knn[,5]=="versicolor"
iris.knn$org.row.id <- 1:nrow(iris.knn)
train <- iris.knn[1:100,]
test <- iris.knn[101:150,]
##
#####
## get nearest neighbours
nn.idx <- as.data.frame(nn2(train[1:4],query=test[1:4],k=4)$nn.idx)
## add row id
nn.idx$test.row.id <- test$rowid
#classes and row id
multiclass.vec <- data.frame(row.id=1:150,iris.knn[,5:6])
#1 row per nearest neighbour
melted <-melt(nn.idx,id.vars="row.id")
merged <- merge(melted,multiclass.vec, by.x = "value",by.y="org.row.id")
#aggrgate a single class
aggregate(merged$Species, list(merged$row.id), function(x) names(which.max(table(x))))
#### aggregate for all classes
all.classes <- melt(merged[c(2,4,5)],id.vars = "row.id")
fun.agg <- function(x) {
if(length(x)==0){
"" #<-- default value adaptation might be needed.
}else{
names(which.max(table(x)))
}
}
dcast(all.classes,row.id~variable, fun.aggregate=fun.agg,fill=NULL)
I did the aggreation only for a single class. Doing this step for all classes in parallel would require another melt operation and would make the code pretty messy.

Resources