Generating confusion matrix with HResults HTK Tool for handwriting recognition ICFHR’s example - htk

I am studying how HTK Tools works with handwriting recognition. Following the ICFHR–2010 TUTORIAL I run examples for "Spanish-Numbers" corpus and received the resulting HMMs (files stored in folder hmm and listed in HMMsList), and res32.mlf with results of recognition received with HVite. Also I have master label file SamplesRef.mlf.
And now I want to see recognition results statistics, i.e. studying HResults tool.
When I run HResults as
HResults -I SamplesRef.mlf HMMsList res32.mlf
I see
====================== HTK Results Analysis =======================
Date: Tue Mar 31 15:21:11 2015
Ref : SamplesRef.mlf
Rec : res32.mlf
------------------------ Overall Results --------------------------
SENT: %Correct=0.00 [H=0, S=2, N=2]
WORD: %Corr=77.78, Acc=77.78 [H=7, D=0, S=2, I=0, N=9]
===================================================================
But if I add option -p in order to have confusion matrix I see the following error message:
~/icfhr$ HResults -p -I SamplesRef.mlf HMMsList res32.mlf
ERROR [+3331] Index: Label millones not in list[0 of 19]
FATAL ERROR - Terminating program HResults
I understand that message means that there is no HMM with name "millones" and I found that in my res32.mlf samples looks like:
"’*’/210341.rec"
mil
seiscientos
cincuenta
y
siete
millones
.
If I change res32.mlf with text editor to res33.mlf with content like:
"’*’/210341.rec"
m
i
l
s
e
i
s
c
i
... and so on.
And use samples.mlf (instead of SamplesRef.mlf) which inside looks like:
"*/210341.lab"
m
i
l
#
q
u
i
n
i
e
n
t
o
s
#
c
... and so on.
I have the desired result:
~/icfhr$ HResults -p -I samples.mlf HMMsList res33.mlf
====================== HTK Results Analysis =======================
Date: Tue Mar 31 15:35:42 2015
Ref : samples.mlf
Rec : res33.mlf
------------------------ Overall Results --------------------------
SENT: %Correct=0.00 [H=0, S=2, N=2]
WORD: %Corr=79.63, Acc=77.78 [H=43, D=5, S=6, I=1, N=54]
------------------------ Confusion Matrix -------------------------
a c d e i l m n o s t u v y Del [ %c / %e]
# 0 0 0 0 0 1 1 0 0 0 0 0 0 0 5 [ 0.0/3.7]
a 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0
d 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
e 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0
i 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0
l 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0
m 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0
n 0 1 0 0 0 0 0 6 0 0 0 0 0 0 0 [85.7/1.9]
o 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0
q 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 [ 0.0/1.9]
s 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0
t 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0
u 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 [50.0/1.9]
v 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
y 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 [50.0/1.9]
Ins 0 0 0 0 0 0 0 0 0 1 0 0 0 0
===================================================================
So, the main question is:
What is the simplest way (without text editor) to make mlf-files adapted for making confusion matrix?
(I suppose I miss some option of some HTK tool… but which tool and which option?)
Any useful ideas would be highly appreciated.

In order to use the -p option, you need to provide the labels list of the classes not your HMMs, (i.e. if you're trying to recognize the words Yes, No, Never) then your "HMMsList" file should be written as:
Yes
No
Never
Regardless of the HMMs that actually constitutes the words.
Your "HMMsList" file should be "LabelsList"

Related

Genetic Algorithm in R: Specify number of 1s in binary chromosomes

I am using the rbga function, but my question still stands for other genetic algorithm implementations in R. Is there a way to specify the number of 1s in binary chromosomes?
I have the following example provided by the library documentation.
data(iris)
library(MASS)
X <- as.data.frame(cbind(scale(iris[,1:4]), matrix(rnorm(36*150), 150, 36)))
Y <- iris[,5]
iris.evaluate <- function(indices) {
print("Chromosome")
print(indices)
print("================================")
result = 1
if (sum(indices) > 2) {
huhn <- lda(X[,indices==1], Y, CV=TRUE)$posterior
result = sum(Y != dimnames(huhn)[[2]][apply(huhn, 1,
function(x)
which(x == max(x)))]) / length(Y)
}
result
}
monitor <- function(obj) {
minEval = min(obj$evaluations);
plot(obj, type="hist");
}
woppa <- rbga.bin(size=40, mutationChance=0.05, zeroToOneRatio=10,
evalFunc=iris.evaluate, showSettings=TRUE, verbose=TRUE)
Here are some of the chromosomes.
"Chromosome"
0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
"================================"
"Chromosome"
0 0 1 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0
"================================"
"Chromosome"
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0
"================================"
"Chromosome"
0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
"================================"
The 1s (i.e., the chosen characteristics) are 5, 8, 5 and 4 respectively.
I am trying to follow the technique specified in a paper and they claim that they apply a genetic algorithm and in the end they pick a specific number of characteristics.
Is it possible to specify in a genetic algorithm the number of characteristics that I want my solution(s)/chromosome(s) to have?
Could this be done on the final solution/chromosome and if yes how?

Filling a table with additional columns if they don't exist

I've the following difficult problem. Here short example of my data. Assume that I've two data sets (my real example has something about 20). The data frames result as a list computed by a self written function with lapply. So, I put the data frames in my example in a list, too. Then I "rbind" them to compute a frequency table.
df1 <- data.frame(rev(seq(12:0)), paste0("a=",sample(0:12, 13, replace=T)))
colnames(df1) <- c("k", "a")
df2 <- data.frame(rev(seq(12:0)), paste0("a=",sample(0:12, 13, replace=T)))
colnames(df2) <- c("k", "a")
list_df <- list(df1,df2)
df_combine<- plyr::ldply(list_df, rbind)
freq_foo <- table(df_combine$k,df_combine$a)
I get a frequency table of the following form.
a=0 a=11 a=12 a=2 a=5 a=6 a=7 a=8 a=3 a=9
1 1 0 0 0 0 0 0 1 0 0
2 1 0 0 0 0 0 0 0 0 1
3 1 0 0 0 0 1 0 0 0 0
4 0 0 0 1 0 1 0 0 0 0
5 0 0 0 1 1 0 0 0 0 0
6 0 0 0 0 0 0 1 0 0 1
7 0 1 1 0 0 0 0 0 0 0
8 1 0 0 0 0 1 0 0 0 0
9 0 0 0 0 0 0 2 0 0 0
10 0 0 1 0 1 0 0 0 0 0
11 1 1 0 0 0 0 0 0 0 0
12 0 0 0 0 0 0 1 0 1 0
13 1 0 1 0 0 0 0 0 0 0
I want to extend and manipulate my table in the following way:
First the table should go over a range of a=0 to a=15. So if there is a missing column, it should be added. And 2nd) I want to order the columns from 0 to 15.
For the first problem I tried
if(freq_foo$paste0("a=",0:15) == F){freq_foo$paste("a=",0:15) <- 0}
but this should work only for data frames and not for tables. Also. i've no idea how to order the columns with an ascending order. The data type isnt important to me because I just want to use the output for further calculations. So, it can also be a data frame instead of a table.
#convert freq_foo table to dataframe
df <- as.data.frame.matrix(freq_foo)
#add all zeros column for missing column name in 0:15 series
df[, paste0("a=", c(0:15)[!(c(0:15) %in% as.numeric(gsub(".*=(\\d+)", "\\1", names(df))))])] <- 0
#order columns from 0 to 15
df <- df[, order(as.numeric(gsub(".*=(\\d+)", "\\1", names(df))))]
Output is:
a=0 a=1 a=2 a=3 a=4 a=5 a=6 a=7 a=8 a=9 a=10 a=11 a=12 a=13 a=14 a=15
1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0
2 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0
3 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0
5 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0
6 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
7 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0
8 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0
10 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
11 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
12 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0
13 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0
(Edit: Updated code after getting a requirement clarification from OP)

creating a larger matrix from smaller matrices in R

I have a series of text files in a folder called "Disintegration T1" which look like this:
> 1.txt
0 0 0 0 1
1 0 0 0 1
0 1 0 0 1
0 0 0 0 0
1 1 1 1 0
> 2.txt
0 1 1 0 1
0 0 1 1 1
1 1 0 1 1
1 1 1 0 1
0 0 0 0 1
> 3.txt
0 1 1 1
1 0 0 0
0 0 0 0
1 0 0 0
The files are all either 4X4 or 5X5. They must be read in as matrices, as the data is for social network analyses. My goal is to automate the process of putting these matrices into a larger matrix, so that these matrices are directly diagonal to each other, and 0s inputted in the blank spaces within the larger matrix. In this case the final result would look like:
> mega_matrix
0 0 0 0 1 0 0 0 0 0 0 0 0 0
1 0 0 0 1 0 0 0 0 0 0 0 0 0
0 1 0 0 1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 1 0 1 0 0 0 0
0 0 0 0 0 0 0 1 1 1 0 0 0 0
0 0 0 0 0 1 1 0 1 1 0 0 0 0
0 0 0 0 0 1 1 1 0 1 0 0 0 0
0 0 0 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 1 1 1
0 0 0 0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 1 0 0 0
Thank you!
You want bdiag from the Matrix package:
library(Matrix)
bdiag(matrix1, matrix2, matrix3)
And to do the whole directory (thanks to #user20650 in the comments) :
bdiag(lapply(dir(), function(x){as.matrix(read.table(x))}))

Rarefaction in specaccum producing errors for some data but not others of the same layout

I am trying to produce rarefied species accumulation curves for two different habitat types using the rarefaction method in specaccum function of the vegan package using the code:
spa <- specaccum(Example, method = "rarefaction")
The function works for one data set but not the other where it produces this error:
Error in rarefy(t(freq), ind[i], se = TRUE) :
function accepts only integers (counts)
However, the data is laid out exactly the same as the other data set and when I investigate the data frame it says that all the data for the species is in integer form.
Here is a much smaller section of the data frame:
Site AA AB AC AD AE AF AG AH AI AJ AK AL
1.1 0 0 0 0 1 0 0 2 0 0 0 0
1.2 1 0 0 0 0 0 0 0 0 0 0 0
1.3 0 0 1 0 0 0 0 0 0 0 0 0
2.1 1 0 0 0 1 0 0 0 0 0 0 0
2.2 0 1 0 0 0 0 0 0 0 0 0 0
2.3 0 0 0 0 0 0 0 1 0 0 1 1
3.2 0 2 1 0 1 0 0 0 0 0 0 0
3.3 0 0 0 0 2 0 0 3 0 0 0 0
4.1 0 0 0 0 0 0 1 1 0 0 0 0
4.2 0 0 0 0 0 0 0 0 0 0 0 0
4.3 0 0 0 0 1 0 0 1 0 0 0 0
5.1 0 1 0 0 0 1 0 3 0 0 0 0
5.2 0 0 1 0 2 0 0 1 0 0 0 0
5.3 0 0 0 1 3 2 0 4 0 0 0 0
6.1 0 0 0 0 0 0 0 0 0 0 0 0
6.2 0 2 2 0 0 2 0 0 0 0 0 0
6.3 0 0 0 0 0 0 0 0 0 0 0 0
Sites being sampling points within the habitat site, letters denoting species. I am unsure as to why it is working for one set of data but not the other when they are both laid out like this. Can someone help me understand?
Thanks

How do you silently save an inspect object in R's tm package?

When I save the inspect() object in R's tm package it prints to screen. It does save the data that I want in the data.frame, but I have thousands of documents to analyze and the printing to screen is eating up my memory.
library(tm)
data("crude")
matrix <- TermDocumentMatrix(corpus,control=list(removePunctuation = TRUE,
stopwords=TRUE))
out= data.frame(inspect(matrix))
I have tried every trick that I can think of. capture.output() changes the object (not the desired effect), as does sink(). dev.off() does not work. invisible() does nothing. suppressWarnings(), suppressMessages(), and try() unsurprisingly do nothing. There are no silent or quiet options in the inspect command.
The closest that I can get is
out= capture.output(inspect(matrix))
out= data.frame(out)
which notably does not give the same data.frame, but pretty easily could be if I need to go down this route. Any other (less hacky) suggestions would be helpful. Thanks.
Windows 7
64- bit R-3.0.1
tm package is the most recent version (0.5-9.1).
Assign inside the capture then:
capture.output(out <- data.frame(inspect(matrix))) -> .null # discarding this
But really, inspect is for visual inspection, so maybe try
as.data.frame(as.matrix(matrix))
instead (btw matrix is a very unfortunate name for a variable, as that's a base function).
Using this input (varible name changed from you question as using a variable named "matrix" can be confusing:
library(tm)
data("crude")
tdm <- TermDocumentMatrix(crude,control=list(removePunctuation = TRUE,
stopwords=TRUE))
Then this will avoid printing to screen
m <- as.matrix(tdm)
and then I would personally do something like
require(data.table)
data.table(m, keep.rownames=TRUE)
# rn 127 144 191 194 211 236 237 242 246 248 273 349 352 353 368 489 502 543 704 708
# 1: 100000 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
# 2: 108 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
# 3: 111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
# 4: 115 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
# 5: 12217 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
# ---
# 996: yesterday 0 0 0 0 0 0 0 3 0 0 1 0 0 0 0 0 0 0 0 0
# 997: yesterdays 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
# 998: york 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0
# 999: zero 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0
# 1000: zone 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0

Resources