Parsing strings with grep/str_extract - r

As part of my feature engineering, I need to parse text strings from different languages and keep text enclosed within parentheses. Everything was going well until I encountered a very strange phenomenon. For some languages, the parentheses I need to find look slightly different, and various regexp options fail.
I'm pasting screen-shots because strangely, copying and pasting the strange parentheses changes it to a 'normal' one, so I can't set up a different regex to find those separately.
Notice that the parentheses in the first entry look normal, but for the second entry, it appears sort of 'sharp'
If I use stringr's str_extract, the first instance works fine, but the second fails.
But, the encodings are the same. Anyone know what's going on?
[Edit: here are the results of dput on these same examples. dput apparently sees the parentheses as equivalent, even though grep does not]
c("Obnaružena poterâ šaga na (Motor šprica pipettora R1).", "(STAT tàn zhen Z zhóu ma dá) tàn cè dào diu bù<U+3002>")
Finally, I am actually copy and pasting the two parentheses from R into the code window below; they do appear different this way. First is normal, second is the strange one.
( (

Related

When does pressing the Enter (Return) key matter when creating a regex matching expression?

I want to search for multiple codes appearing in a cell. There are so many codes that I'd like to write parts of the code in succeeding lines. For example, let's say I am looking for "^a11","^b12", "^c67$" or "^d13[[:blank:]]". I am using:
^a11|^b12|^c67$|^d13[[:blank:]]
This seems to work. Now, I tried:
^a11|^b12|
^c67$|^d13[[:blank:]]
That also seemed to work. However when I tried:
^a11|^b12|^c67$|
^d13[[:blank:]]
It did not count the last one.
Note that my code is wrapped into a function. So the above is an argument that I feed the function. I'm thinking that's the problem, but I still don't know why one truncation works while the other does not.
I realized the answer today. The problem is that since I am feeding the regex argument, it will count the next line in the succeeding code.
Thus, the code below was only "working" because ^c67$ is empty.
^a11|^b12|
^c67$|^d13[[:blank:]]
And the code below was not working because ^d13 is not empty but also this setup looks for (next line)^d13[[:blank:]] instead of just ^d13[[:blank:]]
^a11|^b12|^c67$|
^d13[[:blank:]]
So an inelegant fix is:
^a11|^b12|^c67$|
^nothinghere|^d13[[:blank:]]
This inserts a burner code that is empty which is affected by the line break.

Paste bracketing causes slow pastes

When I paste in large blocks of text, or even just long commands or urls, pasting is done character by character and takes a long time. I can make it speed up by pressing right arrow but that sometimes autosuggest will then come in and paste other things.
I know this is caused in part by paste bracketing. I tried turning it off which makes it about the speed it goes when I press right arrow during a slow paste. But then indents aren't preserved and it executes line by line. This doesn't work for me since I regularly paste in multiline commands like with Docker for example. Also, syntax highlighting doesn't work anymore when using the right arrow technique.
Is there a way to make pasting instant but preserve the bracketing behavior? In vim it's instant but I guess indents are just newline characters or something I guess? Is bracketing even causing this problem or is there some other explanation/fix?
I'm using iTerm2 with ohmyzsh and Powerlevel10k on MacOS Catalina. Searched high and low but only found posts about the ~0 and ~1 characters being inserted with bracketing or something. I never had that happen.
A gif for illustration purposes:
https://i.imgur.com/vFL0gmL.mp4
Just disable magic functions in .zsh.
Uncoment DISABLE_MAGIC_FUNCTIONS="true"

Multiline prompt formatting incorrectly due to date command in zsh

I have the following in my .zshrc:
setopt PROMPT_SUBST
precmd(){
echo""
LEFT="$fg[cyan]$USERNAME#$HOST $fg[green]$PWD"
RIGHT="$fg[yellow]$(date +%I:%M\ %P)"
RIGHTWIDTH=$(($COLUMNS-${#LEFT}))
echo $LEFT${(l:$RIGHTWIDTH:)RIGHT}
}
PROMPT="$ "
This gives me the following screenshot
The time part on the right is always not going all the way to the edge of the terminal, even when resized. I think this is due to the $(date +%I:%M\ %P) Anyone know how to fix this?
EDIT: Zoomed in screenshot
While your idea is commendable, the problem you suffer from is that your LEFT and RIGHT contains ANSI escape sequences (for colors), which should be zero-width characters, but are nevertheless counted toward the length of a string if you naively use $#name, or ${(l:expr:)name}.
Also, as a matter of style, you're better off using Zsh's builtin prompt expansion, which wraps a lot of common things people may want to see in their prompts in short percent escape sequences. In particular, there are builtin sequences for colors, so you don't need to rely on nonstandard $fg[blah].
Below is an approximate of your prompt written in my preferred coding style... Not exactly, I made everything super verbose so as to be understandable (hopefully). The lengths of left and right preprompts are calculated after stripping the escape sequences for colors and performing prompt expansion, which gives the correct display length (I can't possibly whip that up in minutes; I ripped the expression off pure).
precmd(){
local preprompt_left="%F{cyan}%n#%m %F{green}%~"
local preprompt_right="%F{yellow}%D{%I:%M %p}%f"
local preprompt_left_length=${#${(S%%)preprompt_left//(\%([KF1]|)\{*\}|\%[Bbkf])}}
local preprompt_right_length=${#${(S%%)preprompt_right//(\%([KF1]|)\{*\}|\%[Bbkf])}}
local num_filler_spaces=$((COLUMNS - preprompt_left_length - preprompt_right_length))
print -Pr $'\n'"$preprompt_left${(l:$num_filler_spaces:)}$preprompt_right"
}
PROMPT="$ "
Edit: In some terminal emulators, printing exactly $COLUMN characters might wrap the line. In that case, replace the appropriate line with
local num_filler_spaces=$((COLUMNS - preprompt_left_length - preprompt_right_length - 1))
End of edit.
This is very customizable, because you can put almost anything in preprompt_left and preprompt_right and still get the correct lengths — just remember to use prompt escape sequence for zero width sequences, e.g., %F{}%f for colors, %B%b for bold, etc. Again, read the docs on prompt expansion: http://zsh.sourceforge.net/Doc/Release/Prompt-Expansion.html.
Note: You might notice that %D{%I:%M %p} expands to things like 11:35 PM. That's because I would like to use %P to get pm, but not every implementation of strftime supports %P. Worst case scenario: if you really want lowercase but %P is not supported, use your original command subsitution $(date +'%I:%M %P').
Also, I'm using %~ instead of %/, so you'll get ~/Desktop instead of /c/Users/johndoe/Desktop. Some like it, some don't. However, as I said, this is easily customizable.

ZSH prompt substitution issues

I've searched through several answers here and through Google, but I'm still not sure what's going wrong with my prompt.
According to the documentation I've read, this should work
setopt prompt_subst
autoload -U colors && colors
PROMPT="%{[00m[38;5;245m%}test %D%{[00m%}"
My prompt is the following, however:
[00m[38;5;245mtest 15-07-01[00m
Note that the date expansion actually worked, so prompt substitution is working. The ZSH man pages for prompt expansion states that %{...%} should be treated as a raw escape code, but that doesn't seem to be happening. Passing that string to print -P also results in the output above. I've found example prompts on the Internet for ZSH that also seem to indicate that the above syntax should work. See this for one example - the $FG and $FX arrays are populated with escape codes and are defined here. I've tried this example directly by merging both the files above, adding setopt prompt_subst to the beginning just to make sure it's set, then sourcing it and the prompt is a mess of escape codes.
The following works
setopt prompt_subst
autoload -U colors && colors
PROMPT=$'%{\e[00m\e[38;5;245m%}test %D%{\e[00m%}'
I get the expected result of test 15-07-01 in the proper color.
I've tested this on ZSH 5.0.5 in OSX Yosimite, 5.0.7 from MacPorts, and 4.3.17 on Debian, with the same results. I know I have provided a valid solution to my own problem here with the working example, but I'm wondering why the first syntax isn't working as it seems it should.
I think this all has to do with the timeless and perennial problem of escaping. It's worth reminding ourselves what escaping means, briefly: an escape character is an indicator to the computer that what follows should not be output literally.
So there are 2 escaping issues with:
PROMPT="%{[00m[38;5;245m%}test %D%{[00m%}"
Firstly, the colour escape sequences (eg; [00m) should all start with the control character like so \e[00m. You may have also seen it written as ^[00m and \003[00m. What I suspect has happened is one of the variations has suffered the common fate of being inadvertently escaped by either the copy/paste of the author or the website's framework stack, whether that be somewhere in a database, HTTP rendering or JS parsing. The control character (ie, ^, \e or \003), as you probably know, does not have a literal representation, say if you press it on the keyboard. That's why a web stack might decide to not display anything if it sees it in a string. So let's correct that now:
PROMPT="%{\e[00m\e[38;5;245m%}test %D%{\e[00m%}"
This actually nicely segues into the next escaping issue. Somewhat comically \e[ is actually a representation of ESC, it is therefore in itself an escape sequence marker that, yes, is in turn escaped by \. It's a riff on the old \\\\\\\\\\ sort of joke. Now, significantly, we must be clear on the difference between the escape expressions for the terminal and the string substitutions of the prompt, in pseudo code:
PROMPT="%{terminal colour stuff%}test %D%{terminal colour stuff%}"
Now what I suspect is happening, though I can't find any documentation to prove it, is that once ZSH has done its substitutions, or indeed during the substitution process, all literal characters, regardless of escape significations, are promoted to real characters¹. To yet further the farce, this promotion is likely done by escaping all the escape characters. For example if you actually want to print '\e' on the command line, you have to do echo "\\\e". So to overcome this issue, we just need to make sure the 'terminal colour stuff' escape sequences get evaluated before being assigned to PROMPT and that can be done simply with the $'' pattern, like so:
PROMPT=$'%{\e[00m\e[38;5;245m%}test %D%{\e[00m%}'
Note that $'' is of the same ilk as $() and ${}, except that its only function is to interpret escape sequences.
[1] My suspicion for this is based on the fact that you can actually do something like the following:
PROMPT='$(date)'
where $(date) serves the same purpose as %D, by printing a live version of the date for every new prompt output to the screen. What this specific examples serves to demonstrate is that the PROMPT variable should really be thought of as storage for a mini script, not a string (though admittedly there is overlap between the 2 concepts and thus stems confusion). Therefore, as a script, the string is first evaluated and then printed. I haven't looked at ZSH's prompt rendering code, but I assume such evaluation would benefit from native use of escape sequences. For example what if you wanted to pass an escape sequence as an argument to a command (a command that gets run for every prompt render) in the prompt? For example the following is functionally identical to the prompt discussed above:
PROMPT='%{$(print "\e[00m\e[38;5;245m")%}test $(date)%{$(print "\e[00m")%}'
The escape sequences are stored literally and only interpreted at the moment of each prompt rendering.

Improperly formatted CSV, how to repair?

I have a csv, and each line reads as follows:
"http://www.videourl.com/video,video title,video duration,thumbnail,<iframe src=""http://embed.videourl.com/video"" frameborder=0 width=510 height=400 scrolling=no> </iframe>,tag 1,tag 2",,,,,,,,,,,,,,,,,,,,,,,,,,
Is there a program I can use to clean this up? I'm trying to import it to wordpress and map it to current fields, but it isn't functioning properly. Any suggestions?
Just use search and replace in this case. remove the commas at the end and then replace the remaining commas with ",".
Should anyone else have the same issue. Know that this solution will only work with data much like the example giving. If data has a lot of text and there are commas within the text that need kept. Then search replacing comma will not work. Using regex would be the next option and that can be done in Notepad ++
However I think the regex pattern depends on the data so not much point creating an example.
PHP could be used to explode each line also. Remove values that match a regex out of many i.e. URL, money. Then what is left could be (depending on the data again) just a block of text. That approach may not work if there are two or more columns with a lot of text

Resources