KMP failure function calculation - string-matching

My professor solved the kmp failure function as follows:
index 1 2 3 4 5 6 7 8 9
string a a b a a b a b b
ff 0 1 2 1 2 3 4 5 1
From other texts I checked online, I found out it might be wrong, I went back to confirm from him again and he told me he's absolutely right. Can someone pls explain to me why he thinks it's right or wrong in a simple step by step manner? Thanks

As I understand the algorithm, the failure function for your example should be the following:
1 2 3 4 5 6 7 8 9
a a b a a b a b b
0 1 0 1 2 3 4 0 0
f - failure function (by definition, this is the length of the longest prefix of the string which is a suffix also)
Here how I built it step by step:
f(a) = 0 (always = 0 for one letter)
f(aa) = 1 (one letter 'a' is both a prefix and suffix)
f(aab) = 0 (there is no the same suffixes and prefixes: a != b, aa != ab)
f(aaba) = 1 ('a' is the same in the beginning and the end, but if you take 2 letters, they won't be equal: aa != ba)
f(aabaa) = 2 ( you can take 'aa' but no more: aab != baa)
f(aabaab) = 3 ( you can take 'aab')
f(aabaaba) = 4 ( you can take 'aaba')
f(aabaabab) = 0 ( 'a' != 'b', 'aa' != 'ab' and so on, it can't be = 5, so as 'aabaa' != 'aabab')
f(aabaababb) = 0 ( the same situation)

Since #user1041889 was confused (and got me confused too) I'll lay here the differences between the Z-function and the failure function.
Failure function, π[i]:
Is the mapping of and index to the length of the longest prefix of the string which is also a suffix
But that's arguably Chinese so I'll dumb it down in order to actually understand what I'm saying:
How big is the longest sub-string at the beginning of the string of interest, that is equal to the sub-string ending at index i
Or equivalently:
What is the length of the biggest sub-string ending at index i which matches the start of the string of interest
So in your example:
index 1 2 3 4 5 6 7 8 9
string a a b a a b a b b
ff 0 1 0 1 2 3 4 0 0
We observe that π[6] = 3, so what's the substring that ends at index 6 with length 3? aab!
Interesting how we've seen that before!
Let's check that it is indeed the biggest one: baab != aab. Yup!
Notice how this implies that the failure functions always grows uniformly.
That isn't the case for the Z-algorithm.
[SAVING DRAFT to continue later]

Related

colon operator sequence in R

Why when I use colon operator to generate a sequence it will give me different results from using (from,to) if starting number is less than 0?
i.e.:
seq1 = seq(-1,10)
returns
-1 0 1 2 3 4 5 6 7 8 9 10
whereas
seq = seq(-1:10)
returns
1 2 3 4 5 6 7 8 9 10 11 12
I'm not sure what you are expecting with seq(-1:10). The : operator is a shortcut to seq itself. So that's the same as seq(seq(-1, 10)) which is also the same as
x <- -1:10
seq(x)
and when you only pass a single parameter to seq() and that single parameter has a length greater than 1, it will return a sequence of the same length at that vector starting at one. Basically it behaves like seq_along in that case. See the ?seq help page for more info. See also
seq(c("a","b","c"))
#[1] 1 2 3

R set column to maximum of current entry and specified value in an elegant way [duplicate]

This question already has answers here:
R: Get the min/max of each item of a vector compared to single value
(1 answer)
Replace negative values by zero
(5 answers)
Closed 1 year ago.
NOTE: I technically know how to do this, but I feel like there has to be a "nicer" way to do this. If such questions are not allowed here just delete it, but I would really like to improve my R style, so any suggestions are welcome.
I have a dataframe data <- data.frame(foo=rep(c(-1,2),5))
foo
1 -1
2 2
3 -1
4 2
5 -1
6 2
7 -1
8 2
9 -1
10 2
Now I would like to be able to set the entries of foo to a certain value (for this example, let's say 1) if the current entry is smaller than that value.
So my desired output would be
foo
1 1
2 2
3 1
4 2
5 1
6 2
7 1
8 2
9 1
10 2
I feel like there should be something like data$foo <- max(data$foo,1) that does the job (but ofc, it "maxes" over the whole column).
Is there an elegant way to do this?
data$foo <- ifelse(data$foo < 1,1,data$foo) and data$foo <- lapply(data$foo,function(x) max(1,x)) just feel somewhat "ugly".
max gives you maximum of the whole column but for your case you need pmax(parallel maximum) so it gives you maximum of 1 or each number in the vector.
data$foo <- pmax(data$foo, 1)
data
# foo
#1 1
#2 2
#3 1
#4 2
#5 1
#6 2
#7 1
#8 2
#9 1
#10 2
This works:
data <- data.frame(foo=rep(c(-1,2),5))
val <- 1
data[data$foo < val, ] <- val
Let's break this down. data$foo takes the column and makes it into a vector. data$foo < val checks which elements of this vector are smaller than val, creating a new vector of similar lenghts filled with TRUE and FALSE at the correct positions.
Finally, the entire line data[data$foo < val, ] <- val uses that vector of TRUE and FALSE to select the rows (using the [, ]) of data to which val is now used.

How to produce a binary column in R from an existing numerical column?

Here are some sample data:
Age Parent
0 4
2 4
5 3
8 3
10 4
15 2
18 2
19 0
The data represent male and female parental attendance at a bird nest. Here, 4= both parents are present, 3= only male is present, 2= only female is present, 0= neither parent present.
I would like to produce a new column (preferably in addition to the original parent column rather than replacing it) giving binary data, where 3 and 4 become '1' and 2 and 0 become '0'.
So my sample data would give the following binary column:
Age Parent
0 1
2 1
5 1
8 1
10 1
15 0
18 0
19 0
I hope I have given enough information but please ask if you need some extra details.
As is the case with most R questions, there are a couple of different ways to do this, but the easiest is probably (let's say you've stored your data in a data frame d):
d$Father <- ifelse(d$Parent >= 3, 1, 0)
There are, of course, any number of logical vectors that could take the place of d$Parent >= 3 in the above code.
You can also directly take advantage of the fact that R treats 1 and TRUE equivalently:
d$Dad <- d$Parent %in% c(3,4)
d$Dad_Num <- as.numeric(d$Dad <- d$Parent %in% c(3,4))
Both of those resulting vectors will work for most R applications.
For more complicated case handling, the memisc package provides a cases function (though the syntax takes a bit of getting used to):
library(memisc)
d$Father <- cases(
d$Parent == 4 -> 1,
d$Parent == 3 -> 1,
d$Parent == 2 -> 0,
d$Parent == 0 -> 0
)
This is overkill in your case, but may prove useful to know about in the future.
You can do something like this:
df$parentNew <- ifelse(df$Parent %in% c(3, 4), 1, 0)
df
Age Parent parentNew
1 0 4 1
2 2 4 1
3 5 3 1
4 8 3 1
5 10 4 1
6 15 2 0
7 18 2 0
8 19 0 0

Cumulative sum conditional over multiple columns in r dataframe containing the same values

Say my data.frame is as outlined below:
df<-as.data.frame(cbind("Home"=c("a","c","e","b","e","b"),
"Away"=c("b","d","f","c","a","f"))
df$Index<-rep(1,nrow(df))
Home Away Index
1 a b 1
2 c d 1
3 e f 1
4 b c 1
5 e a 1
6 b f 1
What I want to do is calculate a cumulative sum using the Index column for each character a - f regardless of whether they in the Home or Away columns. Thus a column called Cumulative_Sum_Home, say, takes the character in the Home row, "b" in the case of row 6, and counts how many times "b" has appeared in either the Home or Away columns in all previous rows including row 6. Thus in this case b has appeared 3 times cumulatively in the first 6 rows, and thus the Cumulative_Sum_Home gives the value 3. Likewise the same logic applies to the Cumulative_Sum_Away column. Taking row 5, character "a" appears in the Away column, and has cumulatively appeared 2 times in either Home or Away columns up to that row, so the column Cumulative_Sum_Away takes the value 2.
Home Away Index Cumulative_Sum_Home Cumulative_Sum_Away
1 a b 1 1 1
2 c d 1 1 1
3 e f 1 1 1
4 b c 1 2 2
5 e a 1 2 2
6 b f 1 3 2
I have to confess to being totally stumped as to how to solve this problem. I've tried looking at the data.table approaches, but I've never used that package before so I can't immediately see how to solve it. Any tips would be greatly received.
There is scope to make this leaner but if that doesn't matter much for you then this should be okay.
NewColumns = list()
for ( i in sort(unique(c(levels(df[,"Home"]),levels(df[,"Away"]))))) {
NewColumnAddition = i == df$Home | i ==df$Away
NewColumnAddition[NewColumnAddition] = cumsum(NewColumnAddition[NewColumnAddition])
NewColumns[[i]] = NewColumnAddition
}
df$Cumulative_Sum_Home = sapply(
seq(nrow(df)),
function(i) {
NewColumns[[as.character(df[i,"Home"])]][i]
}
)
df$Cumulative_Sum_Away = sapply(
seq(nrow(df)),
function(i) {
NewColumns[[as.character(df[i,"Away"])]][i]
}
)
> df
Home Away Index HomeSum AwaySum
1 a b 1 1 1
2 c d 1 1 1
3 e f 1 1 1
4 b c 1 2 2
5 e a 1 2 2
6 b f 1 3 2
Here's a data.table alternative -
setDT(df)
for ( i in sort(unique(c(levels(df[,Home]),levels(df[,Away]))))) {
df[, TotalSum := cumsum(i == Home | i == Away)]
df[Home == i, Cumulative_Sum_Home := TotalSum]
df[Away == i, Cumulative_Sum_Away := TotalSum]
}
df[,TotalSum := NULL]

R isTRUE logic incoherence

I am tying to code a new variable based on ifs and elses using isTRUE and am having difficulty. I would like to have a condition such as
if (isTRUE(t$a > t$b)) {
t$c <- 0
} else if (isTRUE(t$a < t$b)) {
t$c <- 1
} else {
t$c <- 2
}
Consider the following data:
t<-as.data.frame(c(1:5))
names(t)<-"a"
t$b<-c(5:1)
Running the above code gives c values as always being 2 i.e. isTRUE(t$a > t$b) and isTRUE(t$a < t$b) are always FALSE.
Read the help for isTRUE:
‘isTRUE(x)’ is an abbreviation of ‘identical(TRUE, x)’, and so is
true if and only if ‘x’ is a length-one logical vector whose only
element is ‘TRUE’ and which has no attributes (not even names).
This is probably not what you want.
I'm guessing that you want a vector, t$c that is 0 if t$a>t$b, 1 if t$a<t$b and 2 otherwise. In R, we can do that in a single vectorised operation:
Easier setup:
> t = data.frame(a=1:5, b=5:1)
> t
a b
1 1 5
2 2 4
3 3 3
4 4 2
5 5 1
Now if c is 0 if a>b, 1 if a
> t$c=2-((t$a>t$b)+(t$a!=t$b))
> t
a b c
1 1 5 1
2 2 4 1
3 3 3 2
4 4 2 0
5 5 1 0
Logical operations (>, != etc) operate along vectors, and evaluate numerically to 1 for TRUE and 0 for FALSE. If you try typing parts of my expression for t$c you should learn how this all works together.
If you don't like that tricksy boolean arithmetic, a couple of nested ifelse functions work:
t$c = ifelse(t$b>t$a, 1, ifelse(t$b==t$a,2,0))
This has the advantage of being a bit more readable - if b>a its 1 otherwise if b=a its 2 otherwise its 0. Note how ifelse works, like lots of R functions, on each element of a vector.
Here's an approach using sign() and indexing a vector of desired values:
t <- data.frame(a=1:5,b=5:1);
t;
## a b
## 1 1 5
## 2 2 4
## 3 3 3
## 4 4 2
## 5 5 1
t$c <- c(1,2,0)[sign(t$a-t$b)+2];
t;
## a b c
## 1 1 5 1
## 2 2 4 1
## 3 3 3 2
## 4 4 2 0
## 5 5 1 0
The advantage here is that you can easily change the desired values later, because they're defined explicitly in the indexed vector as (a<b,a==b,a>b). Spacedman's solution to use logical arithmetic is rather brilliant (+1 from me!), but does not easily lend itself to future changes in values.
use a nested ifelse
t$c=ifelse(t$a<t$b,1,
ifelse(t$a>t$b,0,2)
)
from help of nested ifelse
?ifelse
S4 method for class 'db.obj':
ifelse((test, yes, no))
Arguments
test
A db.obj object, which has only one column. The column can be casted into boolean values.
yes
A normal value or a db.obj object. It is the returned value when test is TRUE.
no
The returned value when test is FALSE.

Resources