Substitute captured non-ascii letter with upper case

Substitute captured non-ascii letter with upper case - r

Is it possible to replicate, using only regex and only base R (only using the g*sub() functions), the following...
sub("(i)", "\\U\\1", "string", perl = TRUE)
# [1] "strIng"
For non-ascii letters?
# Hoped for output
sub("(í)", "?", "stríng", perl = TRUE)
# [1] "strÍng"
PS. R regex flavours are TRE and PCRE.
PS2. I'm using R 4.2.1 with Sys.getlocale() giving:
[1] "LC_COLLATE=Icelandic_Iceland.utf8;LC_CTYPE=Icelandic_Iceland.utf8;LC_MONETARY=Icelandic_Iceland.utf8;LC_NUMERIC=C;LC_TIME=Icelandic_Iceland.utf8"

You can use
x="stríng"
gr <- gregexpr("í", x)
mat <- regmatches(x, gr)
regmatches(x, gr) <- lapply(mat, toupper)
# > x
# > [1] "strÍng"
See the R demo online.

For a slightly more involved/explicit solution that only uses base R:
sub_nascii <- function(pattern, string) {
matches <- gregexpr(pattern, string)[[1]]
for (i in matches) {
substr(string, i, i) <- toupper(substr(string, i, i))
}
string
}
sub_nascii(pattern = "í", "stríng")
This works in my locale where sub on it's own doesn't.

Related

gsub / sub to extract between certain characters

How can I extract the numbers / ID from the following string in R?
link <- "D:/temp/sample_data/0000098618-13-000011.htm"
I want to just extract 0000098618-13-000011
That is discard the .htm and the D:/temp/sample_data/.
I have tried grep and gsub without much luck.

1) basename Use basename followed by sub:
sub("\\..*", "", basename(link))
## [1] "0000098618-13-000011"
2) file_path_sans_ext
library(tools)
file_path_sans_ext(link)
## [1] "0000098618-13-000011"
3) sub
sub(".*/(.*)\\..*", "\\1", link)
## [1] "0000098618-13-000011"
4) gsub
gsub(".*/|\\.[^.]*$", "", link)
## [1] "0000098618-13-000011"
5) strsplit
sapply(strsplit(link, "[/.]"), function(x) tail(x, 2)[1])
## [1] "0000098618-13-000011"
6) read.table. If link is a vector this will only work if all elements have the same number of /-separated components. Also this assumes that the only dot is the one separting the extension.
DF <- read.table(text = link, sep = "/", comment = ".", as.is = TRUE)
DF[[ncol(DF)]]
## [1] "0000098618-13-000011"

Using stringr:
library(stringr)
str_extract(link , "[0-9-]+")
# "0000098618-13-000011"

Subset string by counting specific characters

I have the following strings:
strings <- c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG")
I want to cut off the string, as soon as the number of occurances of A, G and N reach a certain value, say 3. In that case, the result should be:
some_function(strings)
c("ABBSDGN", "AABSDG", "AGN", "GGG")
I tried to use the stringi, stringr and regex expressions but I can't figure it out.

You can accomplish your task with a simple call to str_extract from the stringr package:
library(stringr)
strings <- c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG")
str_extract(strings, '([^AGN]*[AGN]){3}')
# [1] "ABBSDGN" "AABSDG" "AGN" "GGG"
The [^AGN]*[AGN] portion of the regex pattern says to look for zero or more consecutive characters that are not A, G, or N, followed by one instance of A, G, or N. The additional wrapping with parenthesis and braces, like this ([^AGN]*[AGN]){3}, means look for that pattern three times consecutively. You can change the number of occurrences of A, G, N, that you are looking for by changing the integer in the curly braces:
str_extract(strings, '([^AGN]*[AGN]){4}')
# [1] "ABBSDGNHN" NA "AGNA" "GGGDSRTYHG"
There are a couple ways to accomplish your task using base R functions. One is to use regexpr followed by regmatches:
m <- regexpr('([^AGN]*[AGN]){3}', strings)
regmatches(strings, m)
# [1] "ABBSDGN" "AABSDG" "AGN" "GGG"
Alternatively, you can use sub:
sub('(([^AGN]*[AGN]){3}).*', '\\1', strings)
# [1] "ABBSDGN" "AABSDG" "AGN" "GGG"

Here is a base R option using strsplit
sapply(strsplit(strings, ""), function(x)
paste(x[1:which.max(cumsum(x %in% c("A", "G", "N")) == 3)], collapse = ""))
#[1] "ABBSDGN" "AABSDG" "AGN" "GGG"
Or in the tidyverse
library(tidyverse)
map_chr(str_split(strings, ""),
~str_c(.x[1:which.max(cumsum(.x %in% c("A", "G", "N")) == 3)], collapse = ""))

Identify positions of pattern using gregexpr then extract n-th position (3) and substring everything from 1 to this n-th position using subset.
nChars <- 3
pattern <- "A|G|N"
# Using sapply to iterate over strings vector
sapply(strings, function(x) substr(x, 1, gregexpr(pattern, x)[[1]][nChars]))
PS:
If there's a string that doesn't have 3 matches it will generate NA, so you just need to use na.omit on the final result.

This is just a version without strsplit to Maurits Evers neat solution.
sapply(strings,
function(x) {
raw <- rawToChar(charToRaw(x), multiple = TRUE)
idx <- which.max(cumsum(raw %in% c("A", "G", "N")) == 3)
paste(raw[1:idx], collapse = "")
})
## ABBSDGNHNGA AABSDGDRY AGNAFG GGGDSRTYHG
## "ABBSDGN" "AABSDG" "AGN" "GGG"
Or, slightly different, without strsplit and paste:
test <- charToRaw("AGN")
sapply(strings,
function(x) {
raw <- charToRaw(x)
idx <- which.max(cumsum(raw %in% test) == 3)
rawToChar(raw[1:idx])
})

Interesting problem. I created a function (see below) that solves your problem. It's assumed that there are just letters and no special characters in any of your strings.
reduce_strings = function(str, chars, cnt){
# Replacing chars in str with "!"
chars = paste0(chars, collapse = "")
replacement = paste0(rep("!", nchar(chars)), collapse = "")
str_alias = chartr(chars, replacement, str)
# Obtain indices with ! for each string
idx = stringr::str_locate_all(pattern = '!', str_alias)
# Reduce each string in str
reduce = function(i) substr(str[i], start = 1, stop = idx[[i]][cnt, 1])
result = vapply(seq_along(str), reduce, "character")
return(result)
}
# Example call
str = c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG")
chars = c("A", "G", "N") # Characters that are counted
cnt = 3 # Count of the characters, at which the strings are cut off
reduce_strings(str, chars, cnt) # "ABBSDGN" "AABSDG" "AGN" "GGG"

How to do str_extract with base R?

I am balancing several versions of R and want to change my R libraries loaded depending on which R and which operating system I'm using. As such, I want to stick with base R functions.
I was reading this page to see what the base R equivalent to stringr::str_extract was:
http://stat545.com/block022_regular-expression.html
It suggested I could replicate this functionality with grep. However, I haven't been able to get grep to do more than return the whole string if there is a match. Is this possible with grep alone, or do I need to combine it with another function? In my case I'm trying to distinguish between CentOS versions 6 and 7.
grep(pattern = "release ([0-9]+)", x = readLines("/etc/system-release"), value = TRUE)

1) strcapture If you want to extract a string of digits and dots from "release 1.2.3" using base then
x <- "release 1.2.3"
strcapture("([0-9.]+)", x, data.frame(version = character(0)))
## version
## 1 1.2.3
2) regexec/regmatches There is also regmatches and regexec but that has already been covered in another answer.
3) sub Also it is often possible to use sub:
sub(".* ([0-9.]+).*", "\\1", x)
## [1] "1.2.3"
3a) If you know the match is at the beginning or end then delete everything after or before it:
sub(".* ", "", x)
## [1] "1.2.3"
4) gsub Sometimes we know that the field to be extracted has certain characters and they do not appear elsewhere. In that case simply delete every occurrence of every character that cannot be in the string:
gsub("[^0-9.]", "", x)
## [1] "1.2.3"
5) read.table One can often decompose the input into fields and then pick off the desired one by number or via grep. strsplit, read.table or scan can be used:
read.table(text = x, as.is = TRUE)[[2]]
## [1] "1.2.3"
5a) grep/scan
grep("^[0-9.]+$", scan(textConnection(x), what = "", quiet = TRUE), value = TRUE)
## [1] "1.2.3"
5b) grep/strsplit
grep("^[0-9.]+$", strsplit(x, " ")[[1]], value = TRUE)
## [1] "1.2.3"
6) substring If we know the character position of the field we can use substring like this:
substring(x, 9)
## [1] "1.2.3"
6a) substring/regexpr or we may be able to use regexpr to locate the character position for us:
substring(x, regexpr("\\d", x))
## [1] "1.2.3"
7) read.dcf Sometimes it is possible to convert the input to dcf form in which case it can be read with read.dcf. Such data is of the form name: value
read.dcf(textConnection(sub(" ", ": ", x)))
## release
## [1,] "1.2.3"

You could do
txt <- c("foo release 123", "bar release", "foo release 123 bar release 123")
pattern <- "release ([0-9]+)"
stringr::str_extract(txt, pattern)
# [1] "release 123" NA "release 123"
sapply(regmatches(txt, regexec(pattern, txt)), "[", 1)
# [1] "release 123" NA "release 123"

txt <- c("foo release 123", "bar release", "foo release 123 bar release 123")
pattern <- "release ([0-9]+)"
Extract first match
sapply(
X = txt,
FUN = function(x){
tmp = regexpr(pattern, x)
m = attr(tmp, "match.length")
st = unlist(tmp)
if (st == -1){NA}else{substr(x, start = st, stop = st + m - 1)}
},
USE.NAMES = FALSE)
#[1] "release 123" NA "release 123"
Extract all matches
sapply(
X = txt,
FUN = function(x){
tmp = gregexpr(pattern, x)
m = attr(tmp[[1]], "match.length")
st = unlist(tmp)
if (st[1] == -1){
NA
}else{
sapply(seq_along(st), function(i) substr(x, st[i], st[i] + m[i] - 1))
}
},
USE.NAMES = FALSE)
#[[1]]
#[1] "release 123"
#[[2]]
#[1] NA
#[[3]]
#[1] "release 123" "release 123"

How to extract parts from a string

I have an string called PATTERN:
PATTERN <- "MODEL_Name.model-OUTCOME_any.outcome-IMP_number"
and I would like to parse the string using a pattern matching function, like grep, sub, ... to obtain a string variable MODEL equal to "Name.model", a string variable OUTCOME equal to "any.outcome" and an integer variable IMP equal to number.
If MODEL, OUTCOME and IMP were all integers, I could get the values using function sub:
PATTERN <- "MODEL_002-OUTCOME_007-IMP_001"
pattern_build <- "MODEL_([0-9]+)-OUTCOME_([0-9]+)-IMP_([0-9]+)"
MODEL <- as.integer(sub(pattern_build, "\\1", PATTERN))
OUTCOME <- as.integer(sub(pattern_build, "\\2", PATTERN))
IMP <- as.integer(sub(pattern_build, "\\3", PATTERN))
Do you have any idea of how to match the string contained in variable PATTERN?
Possible tricky patterns are:
PATTERN <- "MODEL_PS2-OUTCOME_stroke_i-IMP_001"
PATTERN <- "MODEL_linear-model-OUTCOME_stroke_i-IMP_001"

A solution which is also able to deal with the 'tricky' patterns:
PATTERN <- "MODEL_linear-model-OUTCOME_stroke_i-IMP_001"
lst <- strsplit(PATTERN, '([A-Z]+_)')[[1]][2:4]
lst <- sub('-$','',lst)
which gives:
> lst
[1] "linear-model" "stroke_i" "001"
And if you want that in a dataframe:
df <- as.data.frame.list(lst)
names(df) <- c('MODEL','OUTCOME','IMP')
which gives:
> df
MODEL OUTCOME IMP
1 linear-model stroke_i 001

A minimal-regex approach,
sapply(strsplit(PATTERN, '-'), function(i) sub('(.*?_){1}', '', i))
# [,1]
#[1,] "PS2"
#[2,] "stroke_i"
#[3,] "001"

You may use a pattern with capturing groups matching any chars, as few as possible between known delimiting substrings:
MODEL_(.*?)-OUTCOME_(.*?)-IMP_(.*)
See the regex demo. Note that the last .* is greedy since you get all the rest of the string into this capture.
You may precise this pattern to only allow matching expected characters (say, to match digits into the last capturing group, use ([0-9]+) rather than (.*).
Use it with, say, str_match from stringr:
> library(stringr)
> x <- "MODEL_Name.model-OUTCOME_any.outcome-IMP_number"
> res <- str_match(x, "MODEL_(.*?)-OUTCOME_(.*?)-IMP_(.*)")
> res[,2]
[1] "Name.model"
> res[,3]
[1] "any.outcome"
> res[,4]
[1] "number"
>
A base R solution using the same regex will involve a regmatches / regexec:
> res <- regmatches(x, regexec("MODEL_(.*?)-OUTCOME_(.*?)-IMP_(.*)", x))[[1]]
> res[2]
[1] "Name.model"
> res[3]
[1] "any.outcome"
> res[4]
[1] "number"
>

Split a string every 5 characters

Suppose I have a long string:
"XOVEWVJIEWNIGOIWENVOIWEWVWEW"
How do I split this to get every 5 characters followed by a space?
"XOVEW VJIEW NIGOI WENVO IWEWV WEW"
Note that the last one is shorter.
I can do a loop where I constantly count and build a new string character by character but surely there must be something better no?

Using regular expressions:
gsub("(.{5})", "\\1 ", "XOVEWVJIEWNIGOIWENVOIWEWVWEW")
# [1] "XOVEW VJIEW NIGOI WENVO IWEWV WEW"

Using sapply
> string <- "XOVEWVJIEWNIGOIWENVOIWEWVWEW"
> sapply(seq(from=1, to=nchar(string), by=5), function(i) substr(string, i, i+4))
[1] "XOVEW" "VJIEW" "NIGOI" "WENVO" "IWEWV" "WEW"

You can try something like the following:
s <- "XOVEWVJIEWNIGOIWENVOIWEWVWEW" # Original string
l <- seq(from=5, to=nchar(s), by=5) # Calculate the location where to chop
# Add sentinels 0 (beginning of string) and nchar(s) (end of string)
# and take substrings. (Thanks to #flodel for the condense expression)
mapply(substr, list(s), c(0, l) + 1, c(l, nchar(s)))
Output:
[1] "XOVEW" "VJIEW" "NIGOI" "WENVO" "IWEWV" "WEW"
Now you can paste the resulting vector (with collapse=' ') to obtain a single string with spaces.

No *apply stringi solution:
x <- "XOVEWVJIEWNIGOIWENVOIWEWVWEW"
stri_sub(x, seq(1, stri_length(x),by=5), length=5)
[1] "XOVEW" "VJIEW" "NIGOI" "WENVO" "IWEWV" "WEW"
This extracts substrings just like in #Jilber answer, but stri_sub function is vectorized se we don't need to use *apply here.

You can also use a sub-string without a loop. substring is the vectorized substr
x <- "XOVEWVJIEWNIGOIWENVOIWEWVWEW"
n <- seq(1, nc <- nchar(x), by = 5)
paste(substring(x, n, c(n[-1]-1, nc)), collapse = " ")
# [1] "XOVEW VJIEW NIGOI WENVO IWEWV WEW"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Substitute captured non-ascii letter with upper case - r

You can use x="stríng" gr <- gregexpr("í", x) mat <- regmatches(x, gr) regmatches(x, gr) <- lapply(mat, toupper) # > x # > [1] "strÍng" See the R demo online.

Related

gsub / sub to extract between certain characters

Subset string by counting specific characters

How to do str_extract with base R?

How to extract parts from a string

Split a string every 5 characters

Categories

Resources