Erlang Understanding recursion: Find if substring is prefix of string recursively - recursion

I want to understand this implementation of finding a prefix in a string, which is implemented without using any built-in list functions, but using recursion to iterate through the string.
attempt:
checkStringPrefix([C|StringTail],[C|LookupTail]) ->
checkStringPrefix(StringTail,LookupTail);
checkStringPrefix(_,[]) ->
"***";
checkStringPrefix(_String,_Lookup) ->
"".
It works, the function calls itself recursively separating the first Character from the Tail.
Example calls:
1> stringUtil:checkStringPrefix("test xxxxx", "test").
"***"
2> stringUtil:checkStringPrefix("test xxxxx", "testtt").
[]
In the case of two non equal characters, the last function variant gets called.
I dont fully grasp this concept, I would appreciate an explanation. I understand the process of recursive iteration, what I dont understand is why the second variants gets called in the correct moment.

Consider what happens when passing either single character strings or empty strings as arguments:
Passing "a" and "a": this calls the first clause of checkStringPrefix/2 because explicitly matches the first elements of both arguments in its function head, which also enforces that neither argument can be the empty list. This clause calls the function recursively, and since neither argument has a tail, the second function clause gets called because the second argument matches the empty list, so the result is "***".
Passing "a" and "b": this won't call the first clause because the first elements do not match, and it won't call the second clause because the second argument isn't the empty list. It therefore calls the third clause, so the result is "".
Passing "" and "": this won't call the first clause because both arguments are empty lists; it calls the second clause because the second argument matches the empty list, so the result is "***". In fact, since the second clause treats its first argument as a don't-care, this same analysis applies even if the first argument is non-empty.
Passing "" and "a": this won't call the first clause because the first argument is empty, and it won't call the second clause because the second argument is not empty, so it calls the third clause and the result is "".
No matter the lengths of the two argument strings, these are the choices for each invocation.
To answer your specific question about when the second function clause gets called, it happens any time the second argument is the empty list because of the specific match for that case in the function head, and that function head also ignores the first argument. Matching occurs in order of declaration; the first function clause requires both arguments to have at least one element, so when the second argument is the empty list, it can't match the first clause and so matches the second clause.

Related

cross-parsing in function input parameters

This really falls under the purview of "Don't DO that!" ,but..
I wrote this to see what would happen.
circit <-function(x=deparse(substitute(y)),y=deparse(substitute(x)))
{
return(list(x=x,y=y))
}
Two examples:
> circit()
$x
[1] "deparse(substitute(x))"
$y
[1] "deparse(substitute(y))"
> circit(3)
$x
[1] 3
$y
[1] "3"
Notice the subtle swap of "x" and "y" in the output.
I can't follow the logic, so can someone explain how the argument parser handles this absurd pair of default inputs? (the second case is easy to follow)
The key thing to understand/remember is that formal arguments are promise objects, and that substitute() has special rules for how it evaluates promise objects. As explained in ?substitute, it returns their expression slot, not their value:
Substitution takes place by examining each component of the parse
tree as follows: If it is not a bound symbol in ‘env’, it is
unchanged. If it is a promise object, i.e., a formal argument to
a function or explicitly created using ‘delayedAssign()’, the
expression slot of the promise replaces the symbol. If it is an
ordinary variable, its value is substituted, unless ‘env’ is
‘.GlobalEnv’ in which case the symbol is left unchanged.
To make this clearer, it might help to walk through the process in detail. In the first case, you call circuit() with no supplied arguments, so circuit() uses the default values of both x= and y=.
For x, that means its value is gotten by evaluating deparse(substitute(y)). The symbol y in that expression is matched by the formal argument y, a promise object. substitute() replaces the symbol y with its expression slot, which holds the expression deparse(substitute(x)). Deparsing that expression returns the text string "deparse(substitute(x))", which is what gets assigned to x's value slot.
Likewise, the value of y is gotten by evaluating the expression deparse(substitute(x)). The symbol x is matched by the formal argument x, a promise object. Even though the value of x is something else, its expression slot is still deparse(substitute(y)), so that's what's returned by evaluating substitute(x). As a result, deparse(substitute(x)) returns the string "deparse(substitute(y))"

Convert string argument to regular expression

Trying to get into Julia after learning python, and I'm stumbling over some seemingly easy things. I'd like to have a function that takes strings as arguments, but uses one of those arguments as a regular expression to go searching for something. So:
function patterncount(string::ASCIIString, kmer::ASCIIString)
numpatterns = eachmatch(kmer, string, true)
count(numpatterns)
end
There are a couple of problems with this. First, eachmatch expects a Regex object as the first argument and I can't seem to figure out how to convert a string. In python I'd do r"{0}".format(kmer) - is there something similar?
Second, I clearly don't understand how the count function works (from the docs):
count(p, itr) → Integer
Count the number of elements in itr for which predicate p returns true.
But I can't seem to figure out what the predicate is for just counting how many things are in an iterator. I can make a simple counter loop, but I figure that has to be built in. I just can't find it (tried the docs, tried searching SO... no luck).
Edit: I also tried numpatterns = eachmatch(r"$kmer", string, true) - no go.
To convert a string to a regex, call the Regex function on the string.
Typically, to get the length of an iterator you an use the length function. However, in this case that won't really work. The eachmatch function returns an object of type Base.RegexMatchIterator, which doesn't have a length method. So, you can use count, as you thought. The first argument (the predicate) should be a one argument function that returns true or false depending on whether you would like to count a particular item in your iterator. In this case that function can simply be the anonymous function x->true, because for all x in the RegexMatchIterator, we want to count it.
So, given that info, I would write your function like this:
patterncount(s::ASCIIString, kmer::ASCIIString) =
count(x->true, eachmatch(Regex(kmer), s, true))
EDIT: I also changed the name of the first argument to be s instead of string, because string is a Julia function. Nothing terrible would have happened if we would have left that argument name the same in this example, but it is usually good practice not to give variable names the same as a built-in function name.

Sliding window matching in functional programming

I'm trying to implement a sliding window algorithm for matching words in a text file. I come from a procedural background and my first attempt to do this in a functional language like Erlang seems to require time O(n^2) (or even more). How would one do this in a functional language?
-module(test).
-export([readText/1,patternCount/2,main/0]).
readText(FileName) ->
{ok,File} = file:read_file(FileName),
unicode:characters_to_list(File).
patternCount(Text,Pattern) ->
patternCount_(Text,Pattern,string:len(Pattern),0).
patternCount_(Text,Pattern,PatternLength,Count) ->
case string:len(Text) < PatternLength of
true -> Count;
false ->
case string:equal(string:substr(Text,1,PatternLength),Pattern) of
true ->
patternCount_(string:substr(Text,2),Pattern,PatternLength,Count+1);
false ->
patternCount_(string:substr(Text,2),Pattern,PatternLength,Count)
end
end.
main() ->
test:patternCount(test:readText("file.txt"),"hello").
Your question is a bit too broad, since it asks about implementing this algorithm in functional languages but how best to do that is language-dependent. My answer therefore focuses on Erlang, given your example code.
First, note that there's no need to have separate patternCount and patternCount_ functions. Instead, you can just have multiple patternCount functions with different arities as well as multiple clauses of the same arity. First, let's rewrite your functions to take that into account, and also replace calls to string:len/1 with the length/1 built-in function:
patternCount(Text,Pattern) ->
patternCount(Text,Pattern,length(Pattern),0).
patternCount(Text,Pattern,PatternLength,Count) ->
case length(Text) < PatternLength of
true -> Count;
false ->
case string:equal(string:substr(Text,1,PatternLength),Pattern) of
true ->
patternCount(string:substr(Text,2),Pattern,PatternLength,Count+1);
false ->
patternCount(string:substr(Text,2),Pattern,PatternLength,Count)
end
end.
Next, the multi-level indentation in the patternCount/4 function is a "code smell" indicating it can be done better. Let's split that function into multiple clauses:
patternCount(Text,Pattern,PatternLength,Count) when length(Text) < PatternLength ->
Count;
patternCount(Text,Pattern,PatternLength,Count) ->
case string:equal(string:substr(Text,1,PatternLength),Pattern) of
true ->
patternCount(string:substr(Text,2),Pattern,PatternLength,Count+1);
false ->
patternCount(string:substr(Text,2),Pattern,PatternLength,Count)
end.
The first clause uses a guard to detect that no more matches are possible, while the second clause looks for matches. Now let's refactor the second clause to use Erlang's built-in matching. We want to advance through the input text one element at a time, just as the original code does, but we also want to detect matches as we do so. Let's perform the matches in our function head, like this:
patternCount(_Text,[]) -> 0;
patternCount(Text,Pattern) ->
patternCount(Text,Pattern,Pattern,length(Pattern),0).
patternCount(Text,_Pattern,_Pattern,PatternLength,Count) when length(Text) < PatternLength ->
Count;
patternCount(Text,[],Pattern,PatternLength,Count) ->
patternCount(Text,Pattern,Pattern,PatternLength,Count+1);
patternCount([C|TextTail],[C|PatternTail],Pattern,PatternLength,Count) ->
patternCount(TextTail,PatternTail,Pattern,PatternLength,Count);
patternCount([_|TextTail],_,Pattern,PatternLength,Count) ->
patternCount(TextTail,Pattern,Pattern,PatternLength,Count).
First, note that we added a new argument to the bottom four clauses: we now pass Pattern as both the second and third arguments to allow us to use one of them for matching and one of them to maintain the original pattern, as explained more fully below. Note also that we added a new clause at the very top to check for an empty Pattern and just return 0 in that case.
Let's focus only on the bottom three patternCount/5 clauses. These clauses are tried in order at runtime, but let's look at the second of these three clauses first, then the third clause, then the first of the three:
In the second of these three clauses, we write the first and second arguments in [Head|Tail] list notation, which means Head is the first element of the list and Tail is the rest of the list. We use the same variable for the head of both lists, which means that if the first elements of both lists are equal, we have a potential match in progress, so we then recursively call patternCount/5 passing the tails of the lists as the first two arguments. Passing the tails allows us to advance through both the input text and the pattern an element at a time, checking for matching elements.
In the last clause, the heads of the first two arguments do not match; if they did, the runtime would execute the second clause, not this one. This means that our pattern match has failed, and so we no longer care about the first element of the first argument nor about the second argument, and we have to advance through the input text to look for a new match. Note that we write both the head of the input text and the second argument as the _ "don't care" variable, as they are no longer important to us. We recursively call patternCount/5, passing the tail of the input text as the first argument and the full Pattern as the second argument, allowing us to start looking for a new match.
In the first of these three clauses, the second argument is the empty list, which means we've gotten here by successfully matching the full Pattern, element by element. So we recursively call patternCount/5 passing the full Pattern as the second argument to start looking for a new match, and we also increment the match count.
Try it! Here's the full revised module:
-module(test).
-export([read_text/1,pattern_count/2,main/0]).
read_text(FileName) ->
{ok,File} = file:read_file(FileName),
unicode:characters_to_list(File).
pattern_count(_Text,[]) -> 0;
pattern_count(Text,Pattern) ->
pattern_count(Text,Pattern,Pattern,length(Pattern),0).
pattern_count(Text,_Pattern,_Pattern,PatternLength,Count)
when length(Text) < PatternLength ->
Count;
pattern_count(Text,[],Pattern,PatternLength,Count) ->
pattern_count(Text,Pattern,Pattern,PatternLength,Count+1);
pattern_count([C|TextTail],[C|PatternTail],Pattern,PatternLength,Count) ->
pattern_count(TextTail,PatternTail,Pattern,PatternLength,Count);
pattern_count([_|TextTail],_,Pattern,PatternLength,Count) ->
pattern_count(TextTail,Pattern,Pattern,PatternLength,Count).
main() ->
pattern_count(read_text("file.txt"),"hello").
A few final recommendations:
Searching through text element by element is slower than necessary. You should have a look at the Boyer-Moore algorithm and other related algorithms to see ways of advancing through text in larger chunks. For example, Boyer-Moore attempts to match at the end of the pattern first, since if that's not a match, it can advance through the text by as much as the full length of the pattern.
You might want to also looking into using Erlang binaries rather than lists, as they are more compact memory-wise and they allow for matching more than just their first elements. For example, if Text is the input text as a binary and Pattern is the pattern as a binary, and assuming the size of Text is equal to or greater than the size of Pattern, this code attempts to match the whole pattern:
case Text of
<<Pattern:PatternLength/binary, TextTail/binary>> = Text ->
patternCount(TextTail,Pattern,PatternLength,Count+1);
<<_/binary,TextTail/binary>> ->
patternCount(TextTail,Pattern,PatLen,Count)
end.
Note that this code snippet reverts to using patternCount/4 since we no longer need the extra Pattern argument to work through element by element.
As shown in the full revised module, when calling functions in the same module, you don't need the module prefix. See the simplified main/0 function.
As shown in the full revised module, conventional Erlang style does not use mixed case function names like patternCount. Most Erlang programmers would use pattern_count instead.

Case insensitive token matching

Is it possible to set the grammar to match case insensitively.
so for example a rule:
checkName = 'CHECK' Word;
would match check name as well as CHECK name
Creator of PEGKit here.
The only way to do this currently is to use a Semantic Predicate in a round-about sort of way:
checkName = { MATCHES_IGNORE_CASE(LS(1), #"check") }? Word Word;
Some explanations:
Semantic Predicates are a feature lifted directly from ANTLR. The Semantic Predicate part is the { ... }?. These can be placed anywhere in your grammar rules. They should contain either a single expression or a series of statements ending in a return statement which evaluates to a boolean value. This one contains a single expression. If the expression evaluates to false, matching of the current rule (checkName in this case) will fail. A true value will allow matching to proceed.
MATCHES_IGNORE_CASE(str, regexPattern) is a convenience macro I've defined for your use in Predicates and Actions to do regex matches. It has a case-sensitive friend: MATCHES(str, regexPattern). The second argument is an NSString* regex pattern. Meaning should be obvious.
LS(num) is another convenience macro for your use in Predicates/Actions. It means fetch a Lookahead String and the argument specifies how far to lookahead. So LS(1) means lookahead by 1. In other words, "fetch the string value of the first upcoming token the parser is about to try to match".
Notice that I'm still matching Word twice at the end there. The first Word is necessary for matching 'check' (even though it was already tested in the predicate, it was not matched and consumed). The second Word is for your name or whatever.
Hope that helps.

Recursion Append list

Heres a snippet:
translate("a", "4").
translate("m", "/\\/\\").
tol33t([], []).
tol33t([Upper|UpperTail], [Lower|LowerTail]) :-
translate([Upper], [Lower]),
tol33t(UpperTail, LowerTail).
Basically what i want to do is look up in the table for a letter and then get that letter and add it to the new list.
What i have works if its a character, but I'm not sure how to append the new list of characters with the old.
Example input:
l33t("was", L).
It will be put through like this:
l33t([119,97,115], L).
Now that should come back as:
[92,47,92,47]++[52]++[53] or [92,47,92,47,52,53]
Problem is i don't know how to append it like that.
Consider these modifications to tol33t/2:
tol33t([], []).
tol33t([Code|Codes], Remainder) :-
translate([Code], Translation), !,
tol33t(Codes, Rest),
append(Translation, Rest, Remainder).
tol33t([Code|Codes], [Code|Remainder]) :-
tol33t(Codes, Remainder).
The first clause is the base case.
The second clause will succeed iff there is a translation for the current Code via translate/2, as a list of characters of arbitrary length (Translation - note you had [Lower] instead, which restricted results to lists of length 1 only). The cut (!) after the check for a code translation commits to finding the Rest of the solution recursively and then appends the Translation to the front, as the Remainder to return.
The third clause is executed iff there was no translation for the current Code (i.e., the call to translate/2) in the second clause. In this case, no translation for the Code means we just return it as is and compute the rest.
EDIT:
If you don't have cut (!), the second and third clauses can be combined to become:
tol33t([Code|Codes], Remainder) :-
tol33t(Codes, Rest),
(translate([Code], Translation) ->
append(Translation, Rest, Remainder)
; Remainder = [Code|Rest]
).
This (unoptimized) version checks, at every Code in the character list, if there is a translate/2 that suceeds; if so, the Translation is appended to the Rest, else the Code is passed through unchanged. Note that this has the same semantics as the implementation above, in that solutions are commited to (i.e., simulating a cut !) if the antecedent to -> (translate/2) succeeds. Note that the cut in both implementations is strictly necessary; without it, the program will backtrack to find solutions where Code bindings are not translated where there exists an applicable translate/2 clause.

Resources