grep on two strings - r

I'm working to grab two different elements in a string.
The string look like this,
str <- c('a_abc', 'b_abc', 'abc', 'z_zxy', 'x_zxy', 'zxy')
I have tried with the different options in ?grep, but I can't get it right, 'm doing something like this,
grep('[_abc]:[_zxy]',str, value = TRUE)
and what I would like is,
[1] "a_abc" "b_abc" "z_zxy" "x_zxy"
any help would be appreciated.

Use normal parentheses (, not the square brackets [
grep('_(abc|zxy)',str, value = TRUE)
[1] "a_abc" "b_abc" "z_zxy" "x_zxy"
To make the grep a bit more flexible, you could do something like:
grep('_.{3}$',str, value = TRUE)
Which will match an underscore _ followed by any character . three times {3} followed immediately by the end of the string $

this should work: grep('_abc|_zxy', str, value=T)
X|Y matches when either X matches or Y matches

In this case just doing:
str[grep("_",str)]
will work... is it more complicated in your specific case?

Related

Insert characters when a string changes its case R

I would like to insert characters in the places were a string change its case. I tried this to insert a '\n' after a fixed number of characters and then a ' ', as I don't figure out how to detect the case change
s <-c("FloridaIslandE7", "FloridaIslandE9", "Meta")
gsub('^(.{7})(.{6})(.*)$', '\\1\\\n\\2 \\3', s )
[1] "Florida\nIsland E7" "Florida\nIsland E9" "Meta"
This works because the positions are fixed but I would like to know how to do it for the general case.
Surely there's a less convoluted regex for this, but you could try:
gsub('([A-Z][0-9])', ' \\1', gsub('([a-z])([A-Z])', '\\1\n\\2', s))
Output:
[1] "Florida\nIsland E7" "Florida\nIsland E9" "Meta"
Here is an option
str_replace_all(s, "(?<=[a-z])(?=[A-Z])", "\n")
#[1] "Florida\nIsland\nE7" "Florida\nIsland\nE9" "Meta"
If you really want to insert \n, try this:
gsub("([a-z])([A-Z])", "\\1\\\n\\2", s)
[1] "Florida\nIsland\nE7" "Florida\nIsland\nE9" "Meta"

R regex match things other than known characters

For a text field, I would like to expose those that contain invalid characters. The list of invalid characters is unknown; I only know the list of accepted ones.
For example for French language, the accepted list is
A-z, 1-9, [punc::], space, àéèçè, hyphen, etc.
The list of invalid charactersis unknown, yet I want anything unusual to resurface, for example, I would want
This is an 2-piece à-la-carte dessert to pass when
'Ã this Øs an apple' pumps up as an anomalie
The 'not contain' notion in R does not behave as I would like, for example
grep("[^(abc)]",c("abcdef", "defabc", "apple") )
(those that does not contain 'abc') match all three while
grep("(abc)",c("abcdef", "defabc", "apple") )
behaves correctly and match only the first two. Am I missing something
How can we do that in R ? Also, how can we put hypen together in the list of accepted characters ?
[a-z1-9[:punct:] àâæçéèêëîïôœùûüÿ-]+
The above regex matches any of the following (one or more times). Note that the parameter ignore.case=T used in the code below allows the following to also match uppercase variants of the letters.
a-z Any lowercase ASCII letter
1-9 Any digit in the range from 1 to 9 (excludes 0)
[:punct:] Any punctuation character
The space character
àâæçéèêëîïôœùûüÿ Any valid French character with a diacritic mark
- The hyphen character
See code in use here
x <- c("This is an 2-piece à-la-carte dessert", "Ã this Øs an apple")
gsub("[a-z1-9[:punct:] àâæçéèêëîïôœùûüÿ-]+", "", x, ignore.case=T)
The code above replaces all valid characters with nothing. The result is all invalid characters that exist in the string. The following is the output:
[1] "" "ÃØ"
If by "expose the invalid characters" you mean delete the "accepted" ones, then a regex character class should be helpful. From the ?regex help page we can see that a hyphen is already part of the punctuation character vector;
[:punct:]
Punctuation characters:
! " # $ % & ' ( ) * + , - . / : ; < = > ? # [ \ ] ^ _ ` { | } ~
So the code could be:
x <- 'Ã this Øs an apple'
gsub("[A-z1-9[:punct:] àéèçè]+", "", x)
#[1] "ÃØ"
Note that regex has a predefined, locale-specific "[:alpha:]" named character class that would probably be both safer and more compact than the expression "[A-zàéèçè]" especially since the post from ctwheels suggests that you missed a few. The ?regex page indicates that "[0-9A-Za-z]" might be both locale- and encoding-specific.
If by "expose" you instead meant "identify the postion within the string" then you could use the negation operator "^" within the character class formalism and apply gregexpr:
gregexpr("[^A-z1-9[:punct:] àéèçè]+", x)
[[1]]
[1] 1 8
attr(,"match.length")
[1] 1 1

regex to replace text *outside* of {}

I want to use regex to replace commands or tags around strings. My use case is converting LaTeX commands to bookdown commands, which means doing things like replacing \citep{*} with [#*], \ref{*} with \#ref(*), etc. However, lets stick to the generalized question:
Given a string <begin>somestring<end> where <begin> and <end> are known and somestring is an arbitrary sequence of characters, can we use regex to susbstitute <newbegin> and <newend> to get the string <newbegin>somestring<newend>?
For example, consider the LaTeX command \citep{bonobo2017}, which I want to convert to [#bonobo2017]. For this example:
<begin> = \citep{
somestring = bonobo2017
<end> = }
<newbegin> = [#
<newend> = ]
This question is basically the inverse of this question.
I'm hoping for an R or notepad++ solution.
Additional Examples
Convert \citet{bonobo2017} to #bonobo2017
Convert \ref{myfigure} to \#ref(myfigure)
Convert \section{Some title} to # Some title
Convert \emph{something important} to *something important*
I'm looking for a template regex that I can fill in my <begin>, <end>, <newbegin> and <newend> on a case-by-case basis.
You can try something like this with dplyr + stringr:
string = "\\citep{bonobo2017}"
begin = "\\citep{"
somestring = "bonobo2017"
end = "}"
newbegin = "[#"
newend = "]"
library(stringr)
library(dplyr)
string %>%
str_extract(paste0("(?<=\\Q", begin, "\\E)\\w+(?=\\Q", end, "\\E)")) %>%
paste0(newbegin, ., newend)
or:
string %>%
str_replace_all(paste0("\\Q", begin, "\\E|\\Q", end, "\\E"), "") %>%
paste0(newbegin, ., newend)
You can also make it a function for convenience:
convertLatex = function(string, BEGIN, END, NEWBEGIN, NEWEND){
string %>%
str_replace_all(paste0("\\Q", BEGIN, "\\E|\\Q", END, "\\E"), "") %>%
paste0(NEWBEGIN, ., NEWEND)
}
convertLatex(string, begin, end, newbegin, newend)
# [1] "[#bonobo2017]"
Notes:
Notice that I manually added an additional \ to "\\citep{bonobo2017}", this is because raw strings don't exist in R(I hope they do exist), so a single \ would be treated as an escape character. I need another \ to escape the first \.
The regex in str_extract uses positive lookbehind and positve lookahead to extract the somestring in between begin and end.
str_replace takes another approach of removing begin and end from string.
The "\\Q", "\\E" pair in the regex means "Backslash all nonalphanumeric characters" and "\\E" ends the expression. This is especially useful in your case since you likely have special characters in your Latex command. This expression automatically escapes them for you.

String recognition in idl

I have the following strings:
F:\Sheyenne\ROI\SWIR32_subset\SWIR32_2005210_East_A.dat
F:\Sheyenne\ROI\SWIR32_subset\SWIR32_2005210_Froemke-Hoy.dat
and from each I want to extract the three variables, 1. SWIR32 2. the date and 3. the text following the date. I want to automate this process for about 200 files, so individually selecting the locations won't exactly work for me.
so I want:
variable1=SWIR32
variable2=2005210
variable3=East_A
variable4=SWIR32
variable5=2005210
variable6=Froemke-Hoy
I am going to be using these to add titles to graphs later on, but since the position of the text in each string varies I am unsure how to do this using strmid
I think you want to use a combination of STRPOS and STRSPLIT. Something like the following:
s = ['F:\Sheyenne\ROI\SWIR32_subset\SWIR32_2005210_East_A.dat', $
'F:\Sheyenne\ROI\SWIR32_subset\SWIR32_2005210_Froemke-Hoy.dat']
name = STRARR(s.length)
date = name
txt = name
foreach sub, s, i do begin
sub = STRMID(sub, 1+STRPOS(sub, '\', /REVERSE_SEARCH))
parts = STRSPLIT(sub, '_', /EXTRACT)
name[i] = parts[0]
date[i] = parts[1]
txt[i] = STRJOIN(parts[2:*], '_')
endforeach
You could also do this with a regular expression (using just STRSPLIT) but regular expressions tend to be complicated and error prone.
Hope this helps!

Pyparsing - name not starting with a character

I am trying to use Pyparsing to identify a keyword which is not beginning with $ So for the following input:
$abc = 5 # is not a valid one
abc123 = 10 # is valid one
abc$ = 23 # is a valid one
I tried the following
var = Word(printables, excludeChars='$')
var.parseString('$abc')
But this doesn't allow any $ in var. How can I specify all printable characters other than $ in the first character position? Any help will be appreciated.
Thanks
Abhijit
You can use the method I used to define "all characters except X" before I added the excludeChars parameter to the Word class:
NOT_DOLLAR_SIGN = ''.join(c for c in printables if c != '$')
keyword_not_starting_with_dollar = Word(NOT_DOLLAR_SIGN, printables)
This should be a bit more efficient than building up with a Combine and a NotAny. But this will match almost anything, integers, words, valid identifiers, invalid identifiers, so I'm skeptical of the value of this kind of expression in your parser.

Resources