Problem with understanding Functional Programming concept - functional-programming

I'm taking a course in functional programming and coming from OOP my brain hurts trying to solve something that I think is quite trivial but I'm just not understanding the concept here. This is an exercise I need to do for school
Given a phrase, in my case it is
"Pattern Matching with Elixir. Remember that equals sign is a match operator, not an assignment"
I need to check starting letters of every word for a matching pattern and apply specific modification depending on pattern.
Im not even entirely sure what my current code is producing, when I inspect x and y I sort of understand what the loops are doing, they have a list for every word of phrase and that inner list consists of checks against every single letter, nil if it doesnt start with that letter and modified word if it does.
OOP in me wants those loops not to return any "nil" and only return a single edited word in every iteration. In functional programming I cant break loops and force returns so I need to think about this in another way.
My question is how should I approach this problem from functional programming perspective?
In the beginning I get a list of words which form a phrase, then I wish to edit every word and in the end get a list again containing these edited words.
Maybe a pseudo-code-like structure on how to tackle this would help me understand underlying concepts.
Here is my current code:
#Task two
def taskTwo() do
IO.puts "Task Two"
IO.puts "Pattern Matching with Elixir. Remember that equals sign is a match operator, not an assignment"
IO.puts "Task Two\n...Editing Words ..."
phrase = String.downcase("Pattern Matching with Elixir. Remember that equals sign is a match operator, not an assignment") |> String.split()
x = for word <- phrase do
checkVowels(word)
end
y = for word <- phrase do
checkConsonants(word)
end
IO.inspect x
IO.inspect y
end
#check vowels
def checkVowels(word) do
vowels = ["a","e","i","o","u"]
for vowel <- vowels do
if String.starts_with?(word, vowel) do
word <> "ay "
end
end
end
#check consonants
def checkConsonants(word) do
consonants = ["b","c","d","f","g","h","j","k","l","m","n","p","q","r","s","t","v","w","x","z","y"]
for consonant <- consonants do
if String.starts_with?(word, consonant) do
edited = String.replace_prefix(word, consonant, "")
edited <> consonant <> "ay "
end
end
end
MOdifications I need to apply: First check for starting letter and apply modification, then check again an see if inside word there are any of the multiletter combinations
Words beginning with consonants should have the consonant moved to the end of the word, followed by "ay".
Words beginning with vowels (aeiou) should have "ay" added to the end of the word.
Some groups of letters are treated like consonants, including "ch", "qu", "squ", "th", "thr", and "sch".
Some groups are treated like vowels, including "yt" and "xr".

One possible way (maybe not the most efficient, but using some of the Elixir powers) is to use recursion and pattern matching.
So one possible way to solve the first part of your exercise.
defmodule Example do
#vowels ~w[a e i o u y]
#phrase "Pattern Matching with Elixir. Remember that equals sign is a match operator, not an assignment"
def task_two() do
phrase =
#phrase
|> String.downcase()
|> String.split()
|> Enum.reverse()
# splitting the first letter of each word
|> Enum.map(&String.split_at(&1, 1))
# pass the phrase and an accumulator as arguments
result = check_words(phrase, [])
IO.inspect(result)
end
# using recursion to traverse the list
# when the phrase have no more words, it will match the empty list
def check_words([], phrase) do
Enum.join(phrase, " ")
end
# pattern match on the first letter and using guards to decide if vowel
def check_words([{first_letter, rest_of_word} | rest_of_list], accumulator)
when first_letter in #vowels do
# call the function again, with the rest of the list
new_word = first_letter <> rest_of_word <> "ay"
check_words(rest_of_list, [new_word | accumulator])
end
# when the pattern does not match for vowels, it should be a consonant
def check_words([{first_letter, rest_of_word} | rest_of_list], accumulator) do
new_word = rest_of_word <> first_letter <> "ay"
check_words(rest_of_list, [new_word | accumulator])
end
end
You will need to skip/handle the commas and possible some more tweaks to be fully functional. But hope it helps as a general idea.
Notes:
write tests. They will really help you understand what the code is doing
Elixir is an amazing language. If you come from OOP may look strange in the beginning. This book: Programming Elixir is really good and will help you a lot if you want to progress quickly.

Related

grepl() in R using complex pattern with multiple AND, OR

Is that possible to use a pattern like this (see below) in grepl()?
(poverty OR poor) AND (eradicat OR end OR reduc OR alleviat) AND extreme
The goal is to determine if a sentence meets the pattern using
ifelse(grepl(pattern, x, ignore.case = TRUE),"Yes","No")
For example, if x = "end extreme poverty in the country", it will return "Yes", while if x = "end poverty in the country", it will return "No".
An earlier post here works only for single work like poor AND eradicat AND extreme, but not work for my case. Any way to achieve my goal?
Tried this, pattern = "(?=.*poverty|poor)(?=.*eradicat|end|reduce|alleviate)(?=.*extreme)", but it does not work. The error is 'Invalid regexp'
For using all 3 assertions, you can group the words using a non capture group.
^(?=.*(?:poverty|poor))(?=.*extreme)(?=.*(?:eradicat|end|reduc|alleviat)).+
^ Start of string
(?=.*(?:poverty|poor)) Assert either poverty OR poor
(?=.*extreme) Assert extreme
(?=.*(?:eradicat|end|reduc|alleviat)) Assert either eradicat OR end OR reduc or alleviat
.+ Match the whole line for example
Regex demo
For grepl, you have to use perl=T enabling PCRE for the lookarounds.
grepl('^(?=.*(?:poverty|poor))(?=.*extreme)(?=.*(?:eradicat|end|reduc|alleviat)).+', v, perl=T)

Can quantifiers be used in regex replacement in R?

My objective would be replacing a string by a symbol repeated as many characters as have the string, in a way as one can replace letters to capital letters with \\U\\1, if my pattern was "...(*)..." my replacement for what is captured by (*) would be something like x\\q1 or {\\q1}x so I would get so many x as characters captured by *.
Is this possible?
I am thinking mainly in sub,gsub but you can answer with other libraris like stringi,stringr, etc.
You can use perl = TRUE or perl = FALSE and any other options with convenience.
I assume the answer can be negative, since seems to be quite limited options (?gsub):
a replacement for matched pattern in sub and gsub. Coerced to character if possible. For fixed = FALSE this can include backreferences "\1" to "\9" to parenthesized subexpressions of pattern. For perl = TRUE only, it can also contain "\U" or "\L" to convert the rest of the replacement to upper or lower case and "\E" to end case conversion. If a character vector of length 2 or more is supplied, the first element is used with a warning. If NA, all elements in the result corresponding to matches will be set to NA.
Main quantifiers are (?base::regex):
?
The preceding item is optional and will be matched at most once.
*
The preceding item will be matched zero or more times.
+
The preceding item will be matched one or more times.
{n}
The preceding item is matched exactly n times.
{n,}
The preceding item is matched n or more times.
{n,m}
The preceding item is matched at least n times, but not more than m times.
Ok, but it seems to be an option (which is not in PCRE, not sure if in PERL or where...) (*) which captures the number of characters the star quantifier is able to match (I found it at https://www.rexegg.com/regex-quantifier-capture.html) so then it could be used \q1 (same reference) to refer to the first captured quantifier (and \q2, etc.). I also read that (*) is equivalent to {0,} but I'm not sure if this is really the fact for what I'm interested in.
EDIT UPDATE:
Since asked by commenters I update my question with an specific example provide by this interesting question. I modify a bit the example. Let's say we have a <- "I hate extra spaces elephant" so we are interested in keeping the a unique space between words, the 5 first characters of each word (till here as the original question) but then a dot for each other character (not sure if this is what is expected in the original question but doesn't matter) so the resulting string would be "I hate extra space. eleph..." (one . for the last s in spaces and 3 dots for the 3 letters ant in the end of elephant). So I started by keeping the 5 first characters with
gsub("(?<!\\S)(\\S{5})\\S*", "\\1", a, perl = TRUE)
[1] "I hate extra space eleph"
How should I replace the exact number of characters in \\S* by dots or any other symbol?
Quantifiers cannot be used in the replacement pattern, nor the information how many chars they match.
What you need is a \G base PCRE pattern to find consecutive matches after a specific place in the string:
a <- "I hate extra spaces elephant"
gsub("(?:\\G(?!^)|(?<!\\S)\\S{5})\\K\\S", ".", a, perl = TRUE)
See the R demo and the regex demo.
Details
(?:\G(?!^)|(?<!\S)\S{5}) - the end of the previous successful match or five non-whitespace chars not preceded with a non-whitespace char
\K - a match reset operator discarding text matched so far
\S - any non-whitespace char.
gsubfn is like gsub except the replacement string can be a function which inputs the match and outputs the replacement. The function can optionally be expressed a formula as we do here replacing each string of word characters with the output of the function replacing that string. No complex regular expressions are needed.
library(gsubfn)
gsubfn("\\w+", ~ paste0(substr(x, 1, 5), strrep(".", max(0, nchar(x) - 5))), a)
## [1] "I hate extra space. eleph..."
or almost the same except function is slightly different:
gsubfn("\\w+", ~ paste0(substr(x, 1, 5), substring(gsub(".", ".", x), 6)), a)
## [1] "I hate extra space. eleph..."

Extract first letter in each word in R

I had a data.frame with some categorical variables. Let's suppose sentences is one of these variables:
sentences <- c("Direito à participação e ao controle social",
"Direito a ser ouvido pelo governo e representantes",
"Direito aos serviços públicos",
"Direito de acesso à informação")
For each value, I would like to extract just the first letter of each word, ignoring if the word has 4 letters or less (e, de, à, a, aos, ser, pelo), My goal is create acronym variables. I expect the following result:
[1] "DPCS", "DOGR", "DSP", "DAI
I tried to make a pattern subset using stringr with a regex pattern founded here:
library(stringr)
pattern <- "^(\b[A-Z]\w*\s*)+$"
str_subset(str_to_upper(sentences), pattern)
But I got an error when creating the pattern object:
Error: '\w' is an escape sequence not recognized in the string beginning with ""^(\b[A-Z]\w"
What am I doing wrong?
Thanks in advance for any help.
You can use gsub to delete all the unwanted characters and remain with the ones you want. From the expected output, it seems you are still using characters from words tht are 3 characters long:
gsub('\\b(\\pL)\\pL{2,}|.','\\U\\1',sentences,perl = TRUE)
[1] "DPCS" "DSOPGR" "DASP" "DAI"
But if we were to ignore the words you indicated then it would be:
gsub('\\b(\\pL)\\pL{4,}|.','\\U\\1',sentences,perl = TRUE)
[1] "DPCS" "DOGR" "DSP" "DAI"
#Onyambu's answer is great, though as a regular expression beginner, it does take me a long time to try to understand it so that I can make modifications to suit my own needs.
Here is my understanding to gsub('\\b(\\pL)\\pL{4,}|.','\\U\\1',sentences,perl = TRUE).
Post in the hope of being helpful to others.
Background information:
\\b: boundary of word
\\pL matches any kind of letter from any language
{4,} is an occurrence indicator
{m}: The preceding item is matched exactly m times.
{m,}: The preceding item is matched m or more times, i.e., m+
{m,n}: The preceding item is matched at least m times, but not more than n times.
| is OR logic operator
. represents any one character except newline.
\\U\\1 in the replacement text is to reinsert text captured by the pattern as well as capitalize the texts. Note that parentheses () create a numbered capturing group in the pattern.
With all the background knowledge, the interpretation of the command is
replace words matching \\b(\\pL)\\pL{4,} with the first letter
replace any character not matching the above pattern with "" as nothing is captured for this group
Here are two great places I learned all these backgrounds.
https://www.regular-expressions.info/rlanguage.html
https://www3.ntu.edu.sg/home/ehchua/programming/howto/Regexe.html
You can use this pattern: (?<=^| )\S(?=\pL{4,})
I used a positive lookbehind to make sure the matches are preceded by either a space or the beginning of the line. Then I match one character, only if it is followed by 4 or more letters, hence the positive lookahead.
I suggest you don't use \w for non-English languages, because it won't match any characters with accents. Instead, \pL matches any letter from any language.
Once you have your matches, you can just concatenate them to create your strings (dpcs, dogr, etc...)
Here's a demo

regex - define boundary using characters & delimiters

I realize this is a rather simple question and I have searched throughout this site, but just can't seem to get my syntax right for the following regex challenges. I'm looking to do two things. First have the regex to pick up the first three characters and stop at a semicolon. For example, my string might look as follows:
Apt;House;Condo;Apts;
I'd like to go here
Apartment;House;Condo;Apartment
I'd also like to create a regex to substitute a word in between delimiters, while keep others unchanged. For example, I'd like to go from this:
feline;labrador;bird;labrador retriever;labrador dog; lab dog;
To this:
feline;dog;bird;dog;dog;dog;
Below is the regex I'm working with. I know ^ denotes the beginning of the string and $ the end. I've tried many variations, and am making substitutions, but am not achieving my desired out put. I'm also guessing one regex could work for both? Thanks for your help everyone.
df$variable <- gsub("^apt$;", "Apartment;", df$variable, ignore.case = TRUE)
Here is an approach that uses look behind (so you need perl=TRUE):
> tmp <- c("feline;labrador;bird;labrador retriever;labrador dog; lab dog;",
+ "lab;feline;labrador;bird;labrador retriever;labrador dog; lab dog")
> gsub( "(?<=;|^) *lab[^;]*", "dog", tmp, perl=TRUE)
[1] "feline;dog;bird;dog;dog;dog;"
[2] "dog;feline;dog;bird;dog;dog;dog"
The (?<=;|^) is the look behind, it says that any match must be preceded by either a semi-colon or the beginning of the string, but what is matched is not included in the part to be replaced. The * will match 0 or more spaces (since your example string had one case where there was space between the semi-colon and the lab. It then matches a literal lab followed by 0 or more characters other than a semi-colon. Since * is by default greedy, this will match everything up to, but not including' the next semi-colon or the end of the string. You could also include a positive look ahead (?=;|$) to make sure it goes all the way to the next semi-colon or end of string, but in this case the greediness of * will take care of that.
You could also use the non-greedy modifier, then force to match to end of string or semi-colon:
> gsub( "(?<=;|^) *lab.*?(?=;|$)", "dog", tmp, perl=TRUE)
[1] "feline;dog;bird;dog;dog;dog;"
[2] "dog;feline;dog;bird;dog;dog;dog"
The .*? will match 0 or more characters, but as few as it can get away with, stretching just until the next semi-colon or end of line.
You can skip the look behind (and perl=TRUE) if you match the delimiter, then include it in the replacement:
> gsub("(;|^) *lab[^;]*", "\\1dog", tmp)
[1] "feline;dog;bird;dog;dog;dog;"
[2] "dog;feline;dog;bird;dog;dog;dog"
With this method you need to be careful that you only match the delimiter on one side (the first in my example) since the match consumes the delimiter (not with the look-ahead or look-behind), if you consume both delimiters, then the next will be skipped and only every other field will be considered for replacement.
I'd recommend doing this in two steps:
Split the string by the delimiters
Do the replacements
(optional, if that's what you gotta do) Smash the strings back together.
To split the string, I'd use the stringr library. But you can use base R too:
myString <- "Apt;House;Condo;Apts;"
# base R
splitString <- unlist(strsplit(myString, ";", fixed = T))
# with stringr
library(stringr)
splitString <- as.vector(str_split(myString, ";", simplify = T))
Once you've done that, THEN you can do the text substitution:
# base R
fixedApts <- gsub("^Apt$|^Apts$", "Apartment", splitString)
# with stringr
fixedApts <- str_replace(splitString, "^Apt$|^Apts$", "Apartment")
# then do the rest of your replacements
There's probabably a better way to do the replacements than regular expressions (using switch(), maybe?)
Use paste0(fixedApts, collapse = "") to collapse the vector into a single string at the end if that's what you need to do.

Use regular expressions inside only the end portion of strings

I am pre-processing a data frame with 100,000+ blog URLs, many of which contain content from the blog header. The grep function lets me drop many of those URLs because they pertain to archives, feeds, images, attachments or a variety of other reasons. One of them is that they contain “atom”.
For example,
string <- "http://www.example.com/2014/05/update-on-atomic-energy-legislation/feed/atom/archives/"
row <- "one"
df <- data.frame(row, string)
df$string <- as.character(df$string) df[-grep("atom", string), ]
My problem is that the pattern “atom” might appear in a blog header, which is important content, and I do not want to drop those URLs.
How can I concentrate the grep on only the final 20 characters (or some number that greatly reduces the risk that I will grep out content that contains the pattern rather than the ending elements? This question uses $ at the end but is not using R; besides, I don't know how to extend the $ back 20 characters. Regular Expressions _# at end of string
Assume that it is not always the case that the pattern has forward slashes on either or both ends. E.g, /atom/.
The function substr can isolate the end portion of the strings, but I don’t know how to grep only within that portion. The pseudo-code below draws on the %in% function to try to illustrate what I would like to do.
substr(df$string, nchar(df$string)-20, nchar(df$string)) # extracts last 20 characters; start at nchar end -20, to end
But what is the next step?
string[-grep(pattern = "atom" %in% (substr(string, nchar(string)-20, nchar(string))), x = string)]
Thank you for your guidance.
lastpart=substr(df$string, nchar(df$string)-20, nchar(df$string))
if(length(grep("atom",lastpart))>0){
# atom was in there
} else {
# atom was not in there
}
could also do it without the lastpart..
if(length(grep("atom",substr(df$string, nchar(df$string)-20, nchar(df$string))))>0){
# atom was in there
} else {
# atom was not in there
}
but things become harder to read... (gives better perfomance though)
You could try using a URL component depth approach (i.e. only return df rows which contain the word "atom" after 5 slashes):
find_first_match <- function(string, pattern) {
components <- unlist(strsplit(x = string, split = "/", fixed = TRUE), use.names = FALSE)
matches <- grepl(pattern = pattern, x = components)
if(any(matches) == TRUE) {
first.match <- which.min(matches)
} else {
first.match <- NA
}
return(first.match)
}
Which can be used as follows:
# Add index for first component match of "atom" in url
df$first.match <- lapply(df$string, find_first_match, pattern = "atom")
# Return rows which have the word "atom" only after the first 5 components
df[first.match >= 6]
# row string first.match
# 1 one http://www.example.com/2014/05/update-on-atomic-energy-legislation/feed/atom/archives/ 6
This gives you control over which URLs to return based on the depth of when "atom" appears
I chose the second answer because it is easier for me to understand and because with the first one it is not possible to predict how many forward slashes to include in the “component depth”.
The second answer translated into English from the inside function to the broadest function out says:
Define the final 20 characters of your string with the substr() function, your substring;
then find if the pattern “atom” is in that sub-string with the grep() function;
then count whether “atom” was found more than once in the substring, thus with length greater than zero, and that row will be omitted;
finally, if no pattern is matched, i.e., no “atom” is found in the final 20 characters, leave the row alone – all done with the if…else() function

Resources