Hidden Markov models package in R - r

I need some help implementing a HMM module in R. I'm new to R and don't have a lot of knowledge on it.
So i have to implement an IE using HMM, i have 2 folders with files, one with the sentences and the other with the corresponding tags i want to learn form each sentence.
folder1 > event1.txt: "2013 2nd International Conference on Information and Knowledge Management (ICIKM 2013) will be held in Chengdu, China during July 20-21, 2013."
folder2 > event1.txt:
"N: 2nd International Conference on Information and Knowledge Management (ICIKM 2013)
D: July 20-21, 2013
L: Chengdu, China"
N -> Name; D -> Date; L -> Location
My question is how to implement it on R, how do i initialize the model and how do i do to train it? And then how do i apply it to a random sentence to extract the information?
Thanks in advance for all the help!

If you run the following command:
RSiteSearch('hidden markov model')
Then it finds 4 Task Views, 40 Vignettes, and 255 functions (when I ran it, there could be more by the time you run it).
I would suggest looking through those results (probably start with the views and vignettes) to see if anything there works for you. If not, then tell us what you have tried and what you need that is not provided there.

I'm not sure what exactly you want to do, but you might find this excellent tutorial on hidden Markov models using R useful. You build the functions and Markov models from scratch starting from regular Markov models and then moving to hidden Markov models. That's really valuable to understand how they work.
There is also the R package depmixS4 for specifying and fitting hidden Markov models. It's documentation is pretty solid and going through the example code might help you.

depmixS4 is most general and reasonably good package, if you get it to work on your data. It checked out on dummy data for me but gave error on real data. HMM also works but only if you have discrete variables and not continuous.

DepmixS4 is what you are looking for.
First of all, you need to identify best number of hidden states for your model. This can be done by taking model with least value of AIC for different hidden states.
I have created a function HMM_model_execution which will return the model variable and number of states for the model.
library(depmixS4)
first column should be visible state and remaining external variable in doc_data
HMM_model_execution<-function( doc_data, k)
k number of total hidden state to compare
{
aic_values <- vector(mode="numeric", length=k-1) # to store AIC values
for( i in 2:k)
{
print(paste("loop counter",i))
mod <- depmix(response = doc_data$numpresc ~ 1, data = doc_data, nstates = i)
fm <- fit(mod, verbose = FALSE)
aic_values[i-1]<- AIC(fm)
#print(paste("Aic value at this index",aic_values[i-1]))
#writeLines("\n")
}
min_index<-which.min(aic_values)
no of hidden states for best model
#print(paste("index of minimum AIC",min_index))
mod <- depmix(response = doc_data$numpresc ~ 1, data = doc_data, nstates = (min_index+1))
fm <- fit(mod, verbose = FALSE)
best model execution
print(paste("best model with number of hidden states", min_index+1))
return(c(fm, min_index+1))
writeLines("\n")
writeLines("\n")
External variables( co-variates can be passed in function depmix ). summary (fm) will give you all model parameters.

Related

Estimation to plot person-item map not feasible because items "have no 0-responses" in data matrix

I am trying to create a person item map that organizes the questions from a dataset in order of difficulty. I am using the eRm package and the output should looks like follows:
[person-item map] (https://hansjoerg.me/post/2018-04-23-rasch-in-r-tutorial_files/figure-html/unnamed-chunk-3-1.png)
So one of the previous steps, before running the function that outputs the map, I have to fit the data set to have a matrix which is the object that the plotting functions uses to create the actual map, but I am having an error when creating that matrix
I have already tried to follow and review some documentation that might be useful if you want to have some extra-information:
[Tutorial] https://hansjoerg.me/2018/04/23/rasch-in-r-tutorial/#plots
[Ploting function] https://rdrr.io/rforge/eRm/man/plotPImap.html
[Documentation] https://eeecon.uibk.ac.at/psychoco/2010/slides/Hatzinger.pdf
Now, this is the code that I am using. First, I install and load the respective libraries and the data:
> library(eRm)
> library(ltm)
Loading required package: MASS
Loading required package: msm
Loading required package: polycor
> library(difR)
Then I fit the PCM and generate the object of class Rm and here is the error:
*the PCM function here is specific for polytomous data, if I use a different one the output says that I am not using a dichotomous dataset
> res <- PCM(my.data)
>Warning:
The following items have no 0-responses:
AUT_10_04 AUN_07_01 AUN_07_02 AUN_09_01 AUN_10_01 AUT_11_01 AUT_17_01
AUT_20_03 CRE_05_02 CRE_07_04 CRE_10_01 CRE_16_02 EFEC_03_07 EFEC_05
EFEC_09_02 EFEC_16_03 EVA_02_01 EVA_07_01 EVA_12_02 EVA_15_06 FLX_04_01
... [rest of items]
>Responses are shifted such that lowest
category is 0.
Warning:
The following items do not have responses on
each category:
EFEC_03_07 LC_07_03 LC_11_05
Estimation may not be feasible. Please check
data matrix
I must clarify that all the dataset has a range from 1 to 5. Is a Likert polytomous dataset
Finally, I try to use the plot function and it does not have any output, the system just keep loading ad-infinitum with no answer
>plotPImap(res, sorted=TRUE)
I would like to add the description of that particular function and the arguments:
>PCM(X, W, se = TRUE, sum0 = TRUE, etaStart)
#X
Input data matrix or data frame with item responses (starting from 0);
rows represent individuals, columns represent items. Missing values are
inserted as NA.
#W
Design matrix for the PCM. If omitted, the function will compute W
automatically.
#se
If TRUE, the standard errors are computed.
#sum0
If TRUE, the parameters are normed to sum-0 by specifying an appropriate
W.
If FALSE, the first parameter is restricted to 0.
#etaStart
A vector of starting values for the eta parameters can be specified. If
missing, the 0-vector is used.
I do not understand why is necessary to have a score beginning from 0, I think that that what the error is trying to say but I don't understand quite well that output.
I highly appreciate any hint that you can provide me
Feel free to ask for any information that could be useful to reach the solution to this issue
The problem is not caused by the fact that there are no items with 0-responses. The model automatically corrects this by centering the response scale categories on zero. (You'll notice that the PI-map that you linked to is centered on zero. Also, I believe the map you linked to is of dichotomous data. Polytomous data should include the scale categories on the PI-map, I believe.)
Without being able to see your data, it is impossible to know the exact cause though.
It may be that the model is not converging. That may be what this error was alluding to: Estimation may not be feasible. Please check data matrix. You could check by entering > res at the prompt. If the model was able to converge you should see something like:
Conditional log-likelihood: -2.23709
Number of iterations: 27
Number of parameters: 8
...
Does your data contain answers with decimal numbers? I found the same error, I solved it by using dplyr::dense_rank() function:
df_ranked <- sapply(df_decimal_data, dense_rank)
Worked.

package "fdapace" (R) - How to access the principal components of the functional principal component analysis

After applying the FPCA() function of the "fdapace" package on a dataset, the function returns a FPCA object with various values and fields. Unfortunately I don't know which of those fields contain the Principal components and how to access them or plot them. I know that there is a documentation for the package but as a beginner it doesn't really help me(no criticism intended). You can find the documentation here: fdapace.pdf
The estimate of the functional principal components (FPCs) are saved in xiEst in the result list, a matrix each row of which is the FPCs for a subject in the data. You can make whatever plots you want with this information. See the following for an example.
res = FPCA(Ly, Lt)
res$xiEst # This is the matrix containing the FPC estimates.
Plotting the first eigenfunction:
workGrid = FPCAsparse$workGrid
phi1=FPCAsparse$phi[,1]
plot(workGrid,phi1)
Plotting the mean function:
mu=FPCAsparse$mu
workGrid = FPCAsparse$workGrid
plot(workGrid,mu)

R+Tableau connection: Using linear regression & Relaimpo package; Working in R but not in connection

I am applying a linear regression model to data, and using the relaimpo package to find the most significant factors.
When running the following code in R, it works fine
library(readxl)
nba <- read_excel("XXXX")
View(nba)
library(relaimpo)
rec = lm(won ~ o_fgm + o_ftm + o_pts , data = nba)
x= calc.relimp(rec, type = c("lmg"), rela = TRUE, rank = TRUE)
x$lmg
I get output of:
o_fgm o_ftm o_pts
0.3374366 0.2628543 0.3997091
When connecting via Tableau I use the following code:
SCRIPT_REAL("
won=.arg1
o_fgm=.arg2
o_ftm=.arg3
o_pts=.arg4
library(relaimpo)
rec = lm(won ~ o_fgm + o_ftm + o_pts)
x= calc.relimp(rec, type = c('lmg'), rela = TRUE, rank = TRUE)
"
,MEDIAN([Won]),MEDIAN([O Fgm]),MEDIAN([O Ftm]),MEDIAN([O Pts]))
I am getting the following error:
An error occurred while communicating with the RServe service.
Error in calc.relimp.default.intern(object = structure(list(won = 39, : Too few complete observations for estimating this model
I have run it with just the regression and it runs fine; so it seems the issue is with the relaimpo package. There is limited documentation online on this package so I cannot find a fix; any help is really appreciated thanks!
Data is from kaggle at https://www.kaggle.com/open-source-sports/mens-professional-basketball
(the "basketball_teams.csv" file)
When Tableau calls R or Python using the SCRIPT_REAL() function, or any SCRIPT_XXX() function, it is using what Tableau calls a table calculation. This has the effect of passing R one or more vectors -- and receiving back vector results -- instead of calling the function once for each scalar cell.
However, you are responsible for specifying how to partition your aggregate results into vectors, and how to order the rows in the vectors you send to R or Python. You do that by specifying the "partitioning" and "addressing" of each table calc via the Edit Table Calc command (right click on a calc field).
So the most likely issue, is that you are sending R less data than you expect, perhaps many short vectors instead of the one long one you intend. Read about Table Calcs and partitioning and addressing in the online help. You specify partitioning in particular by the choice of which dimensions are not set to "compute using" (a synonym for addressing dimensions) The Table Calc editor gives you some visible feedback as you try different settings - I recommend using specific dimensions in most cases.
For table calcs, the choice of partitioning and addressing is as important as the actual formula.

iterating a coxph() model using various sets of covariates

I'm still a little new to R, so this may be a basic question.
I am looking for risk estimates for a joint-cox model using coxph(). I have to iterate the model for about 60 times using various combinations of variables. Since each iteration of the model will have different covariates (and main exposures), I want to write one function to do it. In the age-adjusted model I just had the main exposure, everything runs fine. I can add the covariates, it runs... I just need a way to write a single function where the "covars" can be whatever I put into the function call.
Note: this is a simplified version, it runs just fine, I just want to make it work without writing out 60 unique iterations of it.
subtype <- function(expo, covars){
temp <- coxph(Surv(FAIL, OUTCOME) ~ joint[[expo]]*strata(EVENT2)+
covars+
cluster(ID)+strata(AGE_INT),
na.action=na.exclude,
data=joint)
return(summary(temp))
}
results <- subtype("RACE", covars=...)
results2 <- subtype("GENDER", covers=...
When I did this macro programing in SAS, it was easy.
Thank you for your help.

Accessing class values in R's poLCA

I am trying my hand at learning Latent Component Analysis, while also learning R. I'm using the poLCA package, and am having a bit of trouble accessing the attributes. I can run the sample code just fine:
ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
ds = within(ds, (cesdcut = ifelse(cesd>20, 1, 0)))
library(poLCA)
res2 = poLCA(cbind(homeless=homeless+1,
cesdcut=cesdcut+1, satreat=satreat+1,
linkstatus=linkstatus+1) ~ 1,
maxiter=50000, nclass=3,
nrep=10, data=ds)
but in order to make this more useful, I'd like to access the attributes within the objects created by the poLCA class as such:
attr(res2, 'Nobs')
attr(res2, 'maxiter')
but they both come up as 'Null'. I expect Nobs to be 453 (determined by the function) and maxiter to be 50000 (dictated by my input value).
I'm sure I'm just being naive, but I could use any help available. Thanks a lot!
Welcome to R. You've got the model-fitting syntax right, in that you can get a model out (don't know how latent component analysis works, so can't speak to the statistical validity of your result). However, you've mixed up the different ways in which R can store information pertaining to a model.
poLCA returns an object of class poLCA, which is
a list containing the following elements:
(. . .)
Nobs number of fully observed cases (less than or equal to N).
maxiter maximum number of iterations through which the estimation algorithm was set
to run.
Since it's a list, you can extract individual elements from your model object using the $ operator:
res2$Nobs # number of observations
res2$maxiter # maximum iterations
In some cases, there might be extractor functions to get this information without having to do low-level indexing. For example, many model-fitting functions will have a fitted method, which pulls out the vector of fitted values on the training data; and similarly residuals pulls out the vector of residuals. You should check whether there are such extractor functions provided by the poLCA package and use them if possible; that way, you're not making assumptions about the structure of the model object that might be broken in the future.
This is distinct to getting the attributes of an object, which is what you use attr for. Attributes in R are what you might call metadata: they contain R-specific information about an object itself, rather than information about whatever it is the object relates to. Examples of common attributes include class (the class of an object), dim (the dimensions of an array or matrix), names (names of individual elements of a vector/list/array) and so on.

Resources