How do you define summary and extension of weighted finite state transducers? - math

So reading through this paper:
http://www.cs.nyu.edu/~mohri/pub/fla.pdf
I see that a weighted finite state transducer (WFST) is a semiring, and many operations on WFST can be expressed in terms of "sum" and "product" over the semiring. For example, composition of Transducers one and two is:
(T1 ◦ T2)(x, y) = ⊕ z∈∆∗ T1(x, z)⊗T2(z, y)
But I can't seem to find an explanation on how do pure sum and product of WFST, and am having trouble backing out the operation from the composition example above.
A demonstration over this example would be much appreciated:
format: state1 state2, input alphabet : output alphabet, transition prob
T1
0 1 a : b, 0.1
0 2 b : b, 0.2
2 3 b : b, 0.3
0 0 a : a, 0.4
1 3 b : a, 0.5
T2
0 1 b : a, 0.1
1 2 b : a, 0.2
1 1 a : d, 0.3
1 2 a : c, 0.4
Example taken from: How to perform FST (Finite State Transducer) composition
--------------- update ------------
Found the answer in this document: http://www.cs.nyu.edu/~mohri/pub/hwa.pdf
page 12

Related

R - conditional cumsum using multiple columns

I'm new to stackoverflow So I hope I post my question in the right format. I have a test dataset with three columns where rank is the ranking of a cell, Esvalue is the value of a cell and zoneID is an area identifier(Note! in the real dataset I have up to 40.000 zoneIDs)
rank<-seq(0.1,1,0.1)
Esvalue<-seq(10,1)
zoneID<-rep(seq.int(1,2),times=5)
rank Esvalue zoneID
0.1 10 1
0.2 9 2
0.3 8 1
0.4 7 2
0.5 6 1
0.6 5 2
0.7 4 1
0.8 3 2
0.9 2 1
1.0 1 2
I want to calculate the following:
% ES value <- For each rank, including all lower ranks, the cumulative % share of the total ES value relative to the ES value of all zones
cumsum(df$Esvalue)/sum(df$Esvalue)
% ES value zone <- For each rank, including all lower ranks, the cumulative % share of the total Esvalue relative to the ESvalue of a zoneID for each zone. I tried this now using mutate and using dplyr. Both so far only give me the cumulative sum, not the share. In the end this will generate a variable for each zoneID
df %>%
mutate(cA=cumsum(ifelse(!is.na(zoneID) & zoneID==1,Esvalue,0))) %>%
mutate(cB=cumsum(ifelse(!is.na(zoneID) & zoneID==2,Esvalue,0)))
These two variables I want to combine by
1) calculating the abs difference between the two for all the zoneIDs
2) for each rank calculate the mean of the absolute difference over all zoneIDs
In the end the final output should look like:
rank Esvalue zoneID mean_abs_diff
0.1 10 1 0.16666667
0.2 9 2 0.01333333
0.3 8 1 0.12000000
0.4 7 2 0.02000000
0.5 6 1 0.08000000
0.6 5 2 0.02000000
0.7 4 1 0.04666667
0.8 3 2 0.01333333
0.9 2 1 0.02000000
1.0 1 2 0.00000000
Now I created the last using some intermediate steps in Excel but my final dataset will be way too big to be handled by Excel. Any advice on how to proceed would be appreciated

Iterate in data table by factor and insert result in another dataframe

I have following data table
> total.dt[,list(first, sched, platform, CCR, speedup)]
first sched platform CCR speedup
1: mult static_hlfet 1 0.1 1.000000
2: mult static_mcp 1 0.1 1.000000
3: mult static_eft 1 0.1 1.000000
4: mult static_lheft 1 0.1 1.000000
5: mult greedy 1 0.1 1.000000
---
1634: gen64 static_eft 64 10.0 9.916995
1635: gen64 static_lheft 64 10.0 8.926877
1636: gen64 greedy 64 10.0 5.235970
1637: gen64 Horizon-8 64 10.0 11.523087
1638: gen64 Horizon-1 64 10.0 9.896009
I want to find out how many times every sched is better than every other sched, when fields first, platform and CCR are equal. And group these numbers by sched.
First I create all combinations of groups where I do the comparison.
setkey(total.dt, first, platform, CCR)
comb <- unique(total.dt[,list(first, platform, CCR)])
Now I can get a group where i can do the comparison
d <- total.dt[comb[n,], list(first, platform, CCR, sched, speedup)]
> print (d) # if n equals 1
first platform CCR sched speedup
1: mult 1 0.1 static_hlfet 1
2: mult 1 0.1 static_mcp 1
3: mult 1 0.1 static_eft 1
4: mult 1 0.1 static_lheft 1
5: mult 1 0.1 greedy 1
6: mult 1 0.1 Horizon-8 1
7: mult 1 0.1 Horizon-1 1
And now I have to count how many times every sched wins others (has bigger speedup), loses or has draw. This I have to store to the data frame which has 5 columns: (first, second, win, lose, draw). I have to repeat this operation for every row in comb and accumulate numbers in second dataframe.
And here I'm a bit lost, because I do not understand how to do this and how to store the result.
I'll appreciate any your help and sorry if this kind of question is not appropriate for SO.
UPD.
Minimal example.
I have following data:
d <- expand.grid(first=c("heat", "lu"),
sched=c("eft", "mcp"),
CCR=c(0.1, 1), platform=c(1,2))
d$speedup <- 1:16
I want get following results:
res <- data.frame(first=c("eft", "mcp"),
win=c(0, 8), lose=c(8, 0), draw=c(0, 0),
second=c("mcp", "eft"))
How do I calculate. First I take rows where first="heat", platform="1", CCR=".1". There are two such rows. First has sched=eft, speedup=1. The second one has sched=mcp, speedup=9. This means mcp wins. In the data.frame res we increase win counter in the row where first=mcp, second=eft. And we increase lose counter in the row where first=eft, second=mcp
Then I take next rows one by one from data frame d and repeat the procedure, filling the res data frame

R - association rules - apriori

I'm running the apriori algorithm like this:
rules <-apriori(dt)
inspect(rules)
where dt is my data.frame with this format:
> head(dt)
Cus T C B
1: C1 0 1 1
2: C2 0 1 0
3: C3 0 1 0
4: C4 0 1 0
5: C5 0 1 0
6: C6 0 1 1
The idea of the data set is to capture the customer and whether he\she bought three different items (T, C and B) on a particular purchase. For example, based on the information above, we can see that C1 bought C and B; customers C2 to C5 bought only C and customer C6 bought only C and B.
the output is the following:
lhs rhs support confidence lift
1 {} => {T=0} 0.90 0.9000000 1.0000000
2 {} => {C=1} 0.91 0.9100000 1.0000000
3 {B=0} => {T=0} 0.40 0.8163265 0.9070295
4 {B=0} => {C=1} 0.40 0.8163265 0.8970621
5 {B=1} => {T=0} 0.50 0.9803922 1.0893246
6 {B=1} => {C=1} 0.51 1.0000000 1.0989011
My questions are:
1) how can I get rid of rules where T,C or B are equal to 0. If you think about it, the rule {B=0} => {T=0} or even {B=1} => {T=0} doesn't really make sense.
2)I was reading about the apriori algorithm and in most of the examples, each line represents the actual transactions so in my case, it should be something like:
C,B
C
C
C
C
C, B
instead of my sets of ones and zeros, is that a rule? Or can I still work with my format?
Thanks
Not sure what the aim of the program is supposed to be, but the aim of the Apriori algorithm is first to extract frequent itemsets of a given data, in which frequent itemsets are a certain quantity of items which often appear as such quantity in the data. And second to generate of those extracted frequent itemsets association rules. An association rule looks for example like this:
B -> C
Which in the stated case means, that customers who bought B buys C too to a certain probability. Whereby the probability is determined by the support and confidence level of the Apriori algorithm. The support level regulates the amount of frequent itemsets and the confidence level the amount of association rules. Association rules over the confidence are called strong association rules.
Do not understand against this backdrop why for the determination whether a customer bought different articles the Apriori algorithm is used. This could be answered by an if statement. And the provided output makes no sense in this context. The output says for example for the third line that if a customer does not buy B then he buys not T with a support of 40% and a confidence of 81.6%. Apart of that association rules does not have a support, only the association rule B -> C is correct, but it's confidence value wrong.
Nevertheless, if the aim is to generate described association rules the original Apriori cannot operate an input in this format:
> head(dt)
Cus T C B
1: C1 0 1 1
2: C2 0 1 0
3: C3 0 1 0
4: C4 0 1 0
5: C5 0 1 0
6: C6 0 1 1
For the uncustomized Apriori algorithm a data set needs this format:
> head(dt)
C1: {B, C}
C2: {C}
C3: {C}
C4: {C}
C5: {C}
C6: {B, C}
See two solutions: Either to format the input wherever or to customize the Apriori algorithm to this format what would be argubaly a change of the input format within the algorithm. To clarify the need of the stated input format, the Apriori algorithm in a nutshell with the provided data:
Support level = 0.3
Confidence level = 0.3
Number of customers = 6
Total number of B's bought = 2
Total number of C's bought = 6
Support of B = 2 / 6 = 0.3 >= 0.3 = support level
Support of C = 6 / 6 = 1 >= 0.3 = support level
Support of B, C = 2 / 6 = 0.3 >= 0.3 = support level
-> Frequent itemsets = {B, C, BC}
-> Association rules = {B -> C}
Confidence of B -> C = 2 / 2 = 1 >= 0.3 = confidence level
-> Strong association rules = {B -> C}
Hope this helps.

coxph() X matrix deemed to be singular;

I'm having some trouble using coxph(). I've two categorical variables:"tecnologia" and "pais", and I want to evaluate the possible interaction effect of "pais" on "tecnologia"."tecnologia" is a variable factor with 2 levels: gps and convencional. And "pais" as 2 levels: PT and ES. I have no idea why this warning keeps appearing.
Here's the code and the output:
cox_AC<-coxph(Surv(dados_temp$dias_seg,dados_temp$status)~tecnologia*pais,data=dados_temp)
Warning message:
In coxph(Surv(dados_temp$dias_seg, dados_temp$status) ~ tecnologia * :
X matrix deemed to be singular; variable 3
> cox_AC
Call:
coxph(formula = Surv(dados_temp$dias_seg, dados_temp$status) ~
tecnologia * pais, data = dados_temp)
coef exp(coef) se(coef) z p
tecnologiagps -0.152 0.859 0.400 -0.38 7e-01
paisPT 1.469 4.345 0.406 3.62 3e-04
tecnologiagps:paisPT NA NA 0.000 NA NA
Likelihood ratio test=23.8 on 2 df, p=6.82e-06 n= 127, number of events= 64
I'm opening another question about this subject, although I made a similar one some months ago, because I'm facing the same problem again, with other data. And this time I'm sure it's not a data related problem.
Can somebody help me?
Thank you
UPDATE:
The problem does not seem to be a perfect classification
> xtabs(~status+tecnologia,data=dados)
tecnologia
status conv doppler gps
0 39 6 24
1 30 3 34
> xtabs(~status+pais,data=dados)
pais
status ES PT
0 71 8
1 49 28
> xtabs(~tecnologia+pais,data=dados)
pais
tecnologia ES PT
conv 69 0
doppler 1 8
gps 30 28
Here's a simple example which seems to reproduce your problem:
> library(survival)
> (df1 <- data.frame(t1=seq(1:6),
s1=rep(c(0, 1), 3),
te1=c(rep(0, 3), rep(1, 3)),
pa1=c(0,0,1,0,0,0)
))
t1 s1 te1 pa1
1 1 0 0 0
2 2 1 0 0
3 3 0 0 1
4 4 1 1 0
5 5 0 1 0
6 6 1 1 0
> (coxph(Surv(t1, s1) ~ te1*pa1, data=df1))
Call:
coxph(formula = Surv(t1, s1) ~ te1 * pa1, data = df1)
coef exp(coef) se(coef) z p
te1 -23 9.84e-11 58208 -0.000396 1
pa1 -23 9.84e-11 100819 -0.000229 1
te1:pa1 NA NA 0 NA NA
Now lets look for 'perfect classification' like so:
> (xtabs( ~ s1+te1, data=df1))
te1
s1 0 1
0 2 1
1 1 2
> (xtabs( ~ s1+pa1, data=df1))
pa1
s1 0 1
0 2 1
1 3 0
Note that a value of 1 for pa1 exactly predicts having a status s1 equal to 0. That is to say, based on your data, if you know that pa1==1 then you can be sure than s1==0. Thus fitting Cox's model is not appropriate in this setting and will result in numerical errors.
This can be seen with
> coxph(Surv(t1, s1) ~ pa1, data=df1)
giving
Warning message:
In fitter(X, Y, strats, offset, init, control, weights = weights, :
Loglik converged before variable 1 ; beta may be infinite.
It's important to look at these cross tables before fitting models. Also it's worth starting with simpler models before considering those involving interactions.
If we add the interaction term to df1 manually like this:
> (df1 <- within(df1,
+ te1pa1 <- te1*pa1))
t1 s1 te1 pa1 te1pa1
1 1 0 0 0 0
2 2 1 0 0 0
3 3 0 0 1 0
4 4 1 1 0 0
5 5 0 1 0 0
6 6 1 1 0 0
Then check it with
> (xtabs( ~ s1+te1pa1, data=df1))
te1pa1
s1 0
0 3
1 3
We can see that it's a useless classifier, i.e. it does not help predict status s1.
When combining all 3 terms, the fitter does manage to produce a numerical value for te1 and pe1 even though pe1 is a perfect predictor as above. However a look at the values for the coefficients and their errors shows them to be implausible.
Edit #JMarcelino: If you look at the warning message from the first coxph model in the example, you'll see the warning message:
2: In coxph(Surv(t1, s1) ~ te1 * pa1, data = df1) :
X matrix deemed to be singular; variable 3
Which is likely the same error you're getting and is due to this problem of classification. Also, your third cross table xtabs(~ tecnologia+pais, data=dados) is not as important as the table of status by interaction term. You could add the interaction term manually first as in the example above then check the cross table. Or you could say:
> with(df1,
table(s1, pa1te1=pa1*te1))
pa1te1
s1 0
0 3
1 3
That said, I notice one of the cells in your third table has a zero (conv, PT) meaning you have no observations with this combination of predictors. This is going to cause problems when trying to fit.
In general, the outcome should be have some values for all levels of the predictors and the predictors should not classify the outcome as exactly all or nothing or 50/50.
Edit 2 #user75782131 Yes, generally speaking xtabs or a similar cross-table should be performed in models where the outcome and predictors are discrete i.e. have a limited no. of levels. If 'perfect classification' is present then a predictive model / regression may not be appropriate. This is true for example for logistic regression (outcome is binary) as well as Cox's model.

Update dataframe column efficiently using some hashmap method in R

I am new to R and can't figure out what I might be doing wrong in the code below and how I could speed it up.
I have a dataset and would like to add a column containing average value calculated from two column of data. Please take a look at the code below (WARNING: it could take some time to read my question but the code runs fine in R):
first let me define a dataset df (again I apologize for the long description of the code)
> df<-data.frame(prediction=sample(c(0,1),10,TRUE),subject=sample(c("car","dog","man","tree","book"),10,TRUE))
> df
prediction subject
1 0 man
2 1 dog
3 0 man
4 1 tree
5 1 car
6 1 tree
7 1 dog
8 0 tree
9 1 tree
10 1 tree
Next I add a the new column called subjectRate to df
df$subjectRate <- with(df,ave(prediction,subject))
> df
prediction subject subjectRate
1 0 man 0.0
2 1 dog 1.0
3 0 man 0.0
4 1 tree 0.8
5 1 car 1.0
6 1 tree 0.8
7 1 dog 1.0
8 0 tree 0.8
9 1 tree 0.8
10 1 tree 0.8
from the new table definition I generate a rateMap so as to automatically fill in new data with the subjectRate column initialized with the previously obtained average.
rateMap <- df[!duplicated(df[, c("subjectRate")]), c("subject","subjectRate")]
> rateMap
subject subjectRate
1 man 0.0
2 dog 1.0
4 tree 0.8
Now I am defining a new dataset with a combination of the old subject in df and new subjects
> dfNew<-data.frame(prediction=sample(c(0,1),15,TRUE),subject=sample(c("car","dog","man","cat","book","computer"),15,TRUE))
> dfNew
prediction subject
1 1 man
2 0 cat
3 1 computer
4 0 dog
5 0 book
6 1 cat
7 1 car
8 0 book
9 0 computer
10 1 dog
11 0 cat
12 0 book
13 1 dog
14 1 man
15 1 dog
My question: How do I create the third column efficiently? currently I am running the test below where I look up the subject rate in the map and input the value if found, or 0.5 if not.
> all_facts<-levels(factor(rateMap$subject))
> dfNew$subjectRate <- sapply(dfNew$subject,function(t) ifelse(t %in% all_facts,rateMap[as.character(rateMap$subject) == as.character(t),][1,"subjectRate"],0.5))
> dfNew
prediction subject subjectRate
1 1 man 0.0
2 0 cat 0.5
3 1 computer 0.5
4 0 dog 1.0
5 0 book 0.5
6 1 cat 0.5
7 1 car 0.5
8 0 book 0.5
9 0 computer 0.5
10 1 dog 1.0
11 0 cat 0.5
12 0 book 0.5
13 1 dog 1.0
14 1 man 0.0
15 1 dog 1.0
but with a real dataset (more than 200,000 rows) with multiple columns similar to subject to compute the average, the code takes a very long time to run. Can somebody suggest maybe a better way to do what I am trying to achieve? maybe some merge or something, but I am out of ideas.
Thank you.
I suspect (but am not sure, since I haven't tested it) that this will be faster:
dfNew$subjectRate <- rateMap$subjectRate[match(dfNew$subject,rateMap$subject)]
since it mostly uses just indexing and match. It certainly a bit simpler, I think. This will fill in the "new" values with NAs, rather than 0.5, which can then be filled in however you like with,
dfNew$subjectRate[is.na(dfNew$subjectRate)] <- newValue
If the ave piece is particularly slow, the standard recommendation these days is to use the data.table package:
require(data.table)
dft <- as.data.table(df)
setkeyv(dft, "subject")
dft[, subjectRate := mean(prediction), by = subject]
and this will probably attract a few comments suggesting ways to eke a bit more speed out of that data table aggregation in the last line. Indeed, merging or joining using pure data.tables may be even slicker (and fast), so you might want to investigate that option as well. (See the very bottom of ?data.table for a bunch of examples.)

Resources