zsh : Removing run of certain characters at the end of a variable - zsh

To keep it simple, consider following task:
Given a non-empty shell variable var, where we know that the last character is the letter a and that at least one character is not an a, remove all the a from the right of the variable.
Example: If the variable initially contains abcadeaaa, it should contain abcade afterwards.
I wondering whether this can be done in a compact way in Zsh.
Of course this is trivial to do using an external program (such as sed), or by using a while loop, where we consecutively strip the last a (${VAR%a}), until the value of the variable doesn't change anymore. Both would work in a POSIX shell, in bash or in ksh. However, given that Zsh has so many nice features for expansion, I wonder, whether there isn't a better way.
The problem is that matching a run of a certain character (independend of length) cries out for regular expressions, but the pattern after % in parameter expansion, and the pattern in the s/// substitution, both is a wildcard pattern, which doesn't allow me to do what I want - at least according to my understanding of the zshexpn man page.
Any ideas for this?
Note that this question is not to solve a real-world problem (in which case I simply would use sed to do the job, as this would get the job done), but more out of academic interest, to find out how far we can stretch the limits of the Zsh expansion mechanism.

Related

Can I extract all matches with functions like aregexec?

I've been enjoying the powerful function aregexec that allows me to mine strings in a fuzzy way.
For that I can search for a string of nucleotide "ATGGCTTCGTC" within a DNA section with defined allowance of insertion, deletion and substitute.
However, it only show me the first match without finishing the whole string. For example,
If I run
aregexec("a","adfasdfasdfaa")
only the first "a" will show up from the result. I'd like to see all the matches.
I wonder if there are other more powerful functions or a argument to be added to this one.
Thank you very much.
P.S. I explained the fuzzy search poorly. I mean, the match doesn't have to be perfect. Say if I allow an substitution of one character, and search AATTGG in ctagtactaAATGGGatctgct, the capital part will be considered a match. I can similarly allow insertions and deletions of certain characters.
gregexpr will show every time there is the pattern in the string, like in this example.
gregexpr("as","adfasdfasdfaa")
There are many more information if you use ?grep in R, it will explain every aspect of using regex.

Convention for cell-local variables in R markdown

I often have variables in cells that I do not need again. Say a loop index or just an intermediate result. These build up in the list of variables in RStudio, and it does not make it easier to work with.
Therefore I would like to mark those “cell local” variables such that they are not used in other cells. I am using underscores instead of dots to separate words in identifiers, sticking with Hadley Wickam's guideline.
One convention that I have seen is prefixing with a dot. Does that make sense in this case? Or rather an underscore, like in Python?
I assume you mean "chunk local." There is nothing automatic I know of, but if you a convention like a leading ., you can periodically run something like
rm(grep("^.", ls(), value=TRUE))
in a chunk

How to find out the longest definition entry in an English dictionary text file?

I asked over at the English Stack Exchange, "What is the English word with the longest single definition?" The best answer they could give is that I would need a program that could figure out the longest entry in a (text) file listing dictionary definitions, by counting the amount of characters or words in a given entry, and then provide a list of the longest entries. I also asked at Superuser but they couldn't come up with an answer either, so I decided to give it a shot here.
I managed to find a dictionary file which converted to text has the following format:
a /a/ indefinite article (an before a vowel) 1 any, some, one (have a cookie). 2 one single thing (there’s not a store for miles). 3 per, for each (take this twice a day).
aardvark /ard-vark/ n an African mammal with a long snout that feeds on ants.
abacus /a-ba-kus, a-ba-kus/ n a counting frame with beads.
As you can see, each definition comes after the pronunciation (enclosed by slashes), and then either:
1) ends with a period, or
2) ends before an example (enclosed by parenthesis), or
3) follows a number and ends with a period or before an example, when a word has multiple definitions.
What I would need, then, is a function or program that can distinguish each definition (including considering multiple definitions of a single word as separate ones), then count the amount of characters and/or words within (ignoring the examples in parenthesis since that is not the proper definition), and finally provide a list of the longest definitions (I don't think I would need more than say, a top 20 or so to compare). If the file format was an issue, I can convert the file to PDF, EPUB, etc. with no problem. And, I guess ideally I would want to be able to choose between counting length by characters and by words, if it was possible.
How should I go to do this? I have little experience from programming classes I took a long time ago, but I think it's better to assume I know close to nothing about programming at all.
Thanks in advance.
I'm not going to write code for you, but I'll help think the problem through. Pick the programming language you're most familiar with from long ago, and give it a whack. When you run in to problems, come back and ask for help.
I'd chop this task up into a bunch of subproblems:
Read the dictionary file from the filesystem.
Chunk the file up into discrete entries. If it's a text file like you show, most programming languages have a facility to easily iterate linewise through a file (i.e. take a line ending character or character sequence as the separator).
Filter bad entries: in your example, your lines appear separated by an empty line. As you iterate, you'll just drop those.
Use your human observation and judgement to look for strong patterns in the data that you can give communicate as firm rules -- this is one of the central activities of programming. You've already started identifying some patterns in your question, i.e.
All entries have a preamble with the pronounciation and part of speech.
A multiple definition entry will be interspersed with lone numerals.
Otherwise, a single definition just follows the preamble.
Write the rules you've invented into code. It'll go something like this: First find a way to lop off the word itself and the preamble. With the remainder, identify multiple-def entries by presence of lone numerals or whatever; if it's not, treat it as single-def.
For each entry, iterate over each of the one-or-more definitions you've identified.
Write a function that will count a definition either word-wise or character-wise. If word-wise, you'll probably tokenize based on whitespace. Counting the length of a string character-wise is trivial in most programming languages. Why not implement both!
Keep a data structure in memory as you iterate the file to track "longest". For each definition in each entry, after you apply the length calculation, you'll compare against the previous longest entry. If the new one is longer, you'll record this new leading word and its word count in your data structure. Comparing 'greater than' and storing a variable are fundamental in most programming languages, so while this is the real meat of your program, this shouldn't be hard.
Implement some way to display your results once iteration is done. This may be as simple as a print statement.
Finally, write the glue code that lets you execute the program easily. A program like this could easily be a command-line tool that takes one or two arguments (the path to the file to be analyzed, perhaps you pass your desired counting method 'character|word' as an argument too, since you implemented both). Different languages vary in how easy it is to create an executable to run from the command line, but most support it, so it's a good option for tasks like this.

Cleaning data which to use (Grep) or (str_extract_all)

I need to extract from the dataset all the elements that mention "mean" and "std" which is standard deviation.
example of how it is written in feat, the column 2, the variables.
Goal> I am trying to extract only the elements that have this written.
"tBodyAcc-mean()-Z"
"tBodyAcc-std()-X"
feat<-read.table("features.txt")
I assumed that using
grep("mean"&"std",feat[,2])
would work
But does not work, I have this error:
"operations are possible only for numeric, logical or complex types"
I found someone who has used this:
meansd<-grep("-(mean|std)\\(\\)",feat[,2])
It worked fine but I do not understand the meaning of the backlashes.
I don't understand what it exactly means and I don't want to use it.
What you need is an alternation operator | in a regex pattern. grep allows using literal values (when fixed=TRUE is used) or a regular expression (by default).
Now, you found:
meansd<-grep("-(mean|std)\\(\\)",feat[,2])
The -(mean|std)\(\) regex matches a -, then either mean or std (since (...) is a grouping construct that allows enumerating alternatives inside a bigger expression), then ( and then ) (these must be escaped with a \ literal symbol - that is why it is doubled in the R code).
If you think the expression is an overkill, and you only want to find entries with either std or mean as substrings, you can use a simpler
meansd<-grep("mean|std",feat[,2])
Here, no grouping construct is necessary since you only have two alternatives in the expression.

What is wrong with this Regex "^(.|\s){1,280}$"

Should be validating 1-280 input characters, but it hangs when more than 280 characters are input.
Clarification
I am using the above regex to validate the length of input string to be 280 characters maximum.
I am using asp:RegularExpressionValidator to do that.
There's nothing “wrong” with it per se, but it's horrendous because with most RE engines (you don't say which one you're using) when it doesn't match with the first thing it tries because it causes the engine to backtrack and try loads of different possibilities (none of which can ever cause a match). So it's not a hang, but rather just a machine that's trying to execute around 2280 operations to see if there's a match possible. Excuse me if I don't wait around for that!
Of course, it's theoretically possible for the RE compiler to merge the (.|\s) part of the RE into something it doesn't need to backtrack to deal with. Some RE engines do this (typically the more automata-theoretic ones) but many don't (the stack-based ones).
It is trying every possible combination of . and \s for each character trying to find a version of the pattern that matches the string.
. already matches any character, so (.|\s) is redundant. Further, if you just want to check what the length of the string is, then just do that - why are you pulling out regexes?
If you really want to use a regular expression, you could use .{1, 280}$ combined with the SingleLine option, so that the . metacharacter will match everything, including new lines (see here, Regular Expression API section).

Resources