Discrepancy between stringr and base R regular expressions - r

I have this rather annoying to read regular expression.
pattern = "(?<=(?<=[0-9])[dD](?=[0-9]))[0-9]+"
It was generated automatically so human readability or efficiency is less of an issue than validity. It was meant to parse RPG dice type syntax, such as 10d20. Specifically it is supposed to match the 20.
If I use the old method of string matching in R
text = '10d20'
regmatches(text,regexpr(pattern,text,perl = TRUE))
I get what I want, which is 20, however using the more modern method of string matching
stringr::str_match(text, pattern)
I get nothing. I was wondering what causes this difference between the two methods and how can I avoid issues like this in the future.

Unless you need the extras that come with ICU (via stringi which stringr is merely a crutch helper wrapper for) there's no need for woe.
In fact, there's a pkg with less marketing power than tidyverse-based pkgs called stringb which puts "data first" (like string[ir]) and relieves you from base regexp inanity. Vis-a-vis:
library(stringb)
pattern <- "(?<=(?<=[0-9])[dD](?=[0-9]))[0-9]+"
text <- '10d20'
text_extract(text, pattern, perl = TRUE)
## [1] "20"
You get saner syntax without relying on a massive compiled code dependencies and 1-away* stringr abstraction. Bellisimo!
* TBFair: the stringb package also has 1-away abstraction from base R functions but the saner syntax makes up for it IMO (unlike stringr).

Related

Rename a column with R

I'm trying to rename a specific column in my R script using the colnames function but with no sucess so far.
I'm kinda new around programming so it may be something simple to solve.
Basically, I'm trying to rename a column called Reviewer Overall Notes and name it Nota Final in a data frame called notas with the codes:
colnames(notas$`Reviewer Overall Notes`) <- `Nota Final`
and it returns to me:
> colnames(notas$`Reviewer Overall Notes`) <- `Nota Final`
Error: object 'Nota Final' not found
I also found in [this post][1] a code that goes:
colnames(notas) [13] <- `Nota Final`
But it also return the same message.
What I'm doing wrong?
Ps:. Sorry for any misspeling, English is not my primary language.
You probably want
colnames(notas)[colnames(notas) == "Reviewer Overall Notes"] <- "Nota Final"
(#Whatif's answer shows how you can do this with the numeric index, but probably better practice to do it this way; working with strings rather than column indices makes your code both easier to read [you can see what you're renaming] and more robust [in case the order of columns changes in the future])
Alternatively,
notas <- notas %>% dplyr::rename(`Nota Final` = `Reviewer Overall Notes`)
Here you do use back-ticks, because tidyverse (of which dplyr is a part) prefers its arguments to be passed as symbols rather than strings.
Why using backtick? Use the normal quotation mark.
colnames(notas)[13] <- 'Nota Final'
This seems to matter:
df <- data.frame(a = 1:4)
colnames(df)[1] <- `b`
Error: object 'b' not found
You should not use single or double quotes in naming:
I have learned that we should not use space in names. If there are spaces in names (it works and is called a non-syntactic name: And according to Wickham Hadley's description in Advanced R book this is due to historical reasons:
"You can also create non-syntactic bindings using single or double quotes (e.g. "_abc" <- 1) instead of backticks, but you shouldn’t, because you’ll have to use a different syntax to retrieve the values. The ability to use strings on the left hand side of the assignment arrow is an historical artefact, used before R supported backticks."
To get an overview what syntactic names are use ?make.names:
make.names("Nota Final")
[1] "Nota.Final"

Extract a theme after a dot (in R)

I am trying to extract the theme (DPS3_2h) that is after the dot in:
ABBCCAA.DPS3_2h
Using a below command I am able to extract E15 that before the underscore. Not sure how to do it for the above example.
E15_AAACCCAAGAGGCTGT
types <- sub('(.*)_.*', '\\1', colnames(E15))
If you aren't bound to base R, the stringr package offers a lot of nice functions to work with strings. The pattern used here is using a positive look behind (?<=).
stringr::str_extract("ABBCCAA.DPS3_2h",
pattern = "(?<=\\.).*")
#> [1] "DPS3_2h"

How to check whether an English word is meaningful in Julia?

In Julia, how can I check an English word is a meaningful word? Suppose I want to know whether "Hello" is meaningful or not. In Python, one can use the enchant or nltk packages(Examples: [1],[2]). Is it possible to do this in Julia as well?
What I need is a function like this:
is_english("Hello")
>>>true
is_english("Hlo")
>>>false
# Because it doesn't have meaning! We don't have such a word in English terminology!
is_english("explicit")
>>>true
is_english("eeplicit")
>>>false
Here is what I've tried so far:
I have a dataset that contains frequent 5char English words(link to google drive). So I decided to augment it to my question for better understanding. Although this dataset is not adequate (because it just contains frequent 5char meaningful words, not all the meaningful English words with any length), it's suitable to use it to show what I want:
using CSV
using DataFrames
df = CSV.read("frequent_5_char_words.csv" , DataFrame , skipto=2)
df = [lowercase(item) for item in df[:,"0"]]
function is_english(word::String)::Bool
return lowercase(word) in df
end
Then when I try these:
julia>is_english("Helo")
false
julia>is_english("Hello")
true
But I don't have an affluent dataset! So this isn't enough. So I'm curious if there are any packages like what I mentioned before, in Julia or not?
(not enough rep to post a comment!)
You can still use NLTK in Julia via PyCall. Or, as it seems you don't need an NLP tool but just a dictionary, you can use wiktionary to do some lookup or build the dataset.
There is a recently new package, Named LanguageDetect.jl. It does not return true/false, but a list of probabilities. You could define something like:
using LanguageDetect: detect
function is_english(text, threshold=0.8)
langs = detect(text)
for lang in langs
if lang.language == "en"
return lang.probability >= threshold
end
end
ret

Running regex in R using str_extract_all has regexp not yet implemented

I am trying to use regex to parse a file using regex. Most of the solutions to using regex in R use the stringr package. I have not found another way, or another package to use that would work. If you have another way of going about this that would also be acceptable.
What I am trying to accomplish is to grab a couple of values that are seperated by spaces with the last value being some comma seperated values of variable length. This should go into a matrix or df in table like format is it is currently.
foo foo_123bar foo,bar,bazz
foo2 foo_456bar foo2,bar2
I have the working example of my regex here.
There could be a couple of issues I could be running into. The first could be that the regex I am writing is not supported by R's regex engine. Although I have the feeling from this that would be supported. I have seen that R uses a POSIX like format which could make things interesting. The second simply could be exactly what the error message bellow is showing. This is not a feature that has been coded in yet. This however would be the most troubling because I don't know another way to solve my problem without this package.
Below is the R code that I am using to replicate this error
library("stringr")
string = " foo foo_123bar foo,bar,bazz\n foo2 foo_456bar foo2,bar2,bazz2"
pattern = "
(?(DEFINE)
(?<blanks>[[:blank:]]+)
(?<var>\"?[[:alnum:]_]+\"?)
(?<csvar>(\"?[[:alnum:]_]+\"?,?)+)
)
^
(?&blanks)((?&var))
(?&blanks)((?&var))
(?&blanks)((?&csvar))"
# Both of these are throwing the error
str_extract_all(string, pattern)
str_extract_all(string, regex(pattern, multiline=TRUE, comments=TRUE))
> Error in stri_extract_all_regex(string, pattern, simplify = simplify, :
> Use of regexp feature that is not yet implemented. (U_REGEX_UNIMPLEMENTED)
# Using the example from ?str_extract_all runs without error
shopping_list <- c("apples x4", "bag of flour", "bag of sugar", "milk x2")
str_extract_all(shopping_list, "\\b[a-z]+\\b", simplify = TRUE)
I am looking for a solution, not necessarily a stringr solution, but this is the only way I found that fits my needs. The other simpler R regex functions only accept the pattern and not the extra parameters that include the multi line and comment functionality that I am using.
You have a PCRE regex that can only be used in methods/functions that parse the regex with the PCRE regex library (or Boost, it is based on PCRE). stringr str_extract parses the regex with the ICU regex library. ICU regex does not support recursion and DEFINE block. You just can't use the in-pattern approach to define subpatterns and then re-use them.
Instead, just declare the regex parts you need to re-use as variables and build the pattern dynamically:
library("stringr")
string = " foo foo_123bar foo,bar,bazz\n foo2 foo_456bar foo2,bar2,bazz2"
blanks <- "[[:blank:]]+"
vars <- "\"?[[:alnum:]_]+\"?"
csvar <- "(?:\"?[[:alnum:]_]+\"?,?)+"
pattern <- paste0("^",blanks,"(", vars, ")",blanks,"(", vars,")",blanks,"(",csvar, ")")
str_match_all(string, pattern)
# [[1]]
# [,1] [,2] [,3] [,4]
#[1,] " foo foo_123bar foo,bar,bazz" "foo" "foo_123bar" "foo,bar,bazz"
Note: you need to use str_match (or str_match_all) to extract the capturing group values as str_extract or str_extract_all only allows access to the whole match values.

stargazer and omit regular expressions

I am trying to use regular expressions to omit some variables in stargazer. I finally found a working regex, but it's using the Perl standard. This doesn't work for the base regex in R, though regexpr in R can take a perl=T option. Given that you wrap the regex for variable sets to omit in "", you can't really pass it this option. Any ideas on how to use perl regex with stargazer?
An example of the regex I would like to use is
placed.ind2*(?:(?!:switchind).)*$
applied to these 4 strings:
placed.ind2PROF SERVICES
placed.ind2TRANSPORT
placed.ind2PROF SERVICES:switchind2TRUE
placed.ind2TRANSPORT:switchind2TRUE
I would like the first two to be selected, but the last to be.
Starting from version 4.0 (on CRAN now), you can run stargazer with the argument perl=TRUE to allow for Perl-compatible regular expressions in your other arguments.

Resources