Inserting characters around a part of a string? - r

I'm looking to wrap parts of a string in R, following certain rules, in a vectorised way.
Put simply, if I had a vector:
c("x^2", "x^2:z", "z", "x:z", "z:x:b", "z:x^2:b")
the function would sweep through each element and wrap I() around those parts where there is an exponent, resulting in the following output:
c("I(x^2)", "I(x^2):z", "z", "x:z", "z:x:b", "z:I(x^2):b")
I've tried various approaches where I first split by : and then gsub, but this isn't particularly scalable.

Something like below?
> gsub("(x(\\^\\d+))", "I(\\1)", c("x^2", "x^2:z", "z", "x:z", "z:x:b", "z:x^2:b"))
[1] "I(x^2)" "I(x^2):z" "z" "x:z" "z:x:b"
[6] "z:I(x^2):b"

These seem reasonably general. They don't assume that the variable in the complex fields must be named x but handle any names made of word characters and also don't assume that the arithmetic expression must be an exponential but handle any arithmetic expression that includes non-word characters. for example, they would surround y+pi with I(...).
1) This one liner captures each field and processes it using the indicated function, expressed in formula notation. It surrounds each field that contains a non-word character with I(...) . It works with any variables whose names are made from word character.
library(gsubfn)
gsubfn("[^:]+", ~ if (grepl("\\W", x)) sprintf("I(%s)", x) else x, s)
## [1] "I(x^2)" "I(x^2):z" "z" "x:z" "z:x:b"
## [6] "z:I(x^2):b"
2) This surrounds any field containing a character that is not :, letter or number with I(...)
gsub("([^:]*[^:[:alnum:]][^:]*)", "I(\\1)", s)
## [1] "I(x^2)" "I(x^2):z" "z" "x:z" "z:x:b"
## [6] "z:I(x^2):b"
3) In this alternative we split the strings at colon, then surround fields containing a non-word character with I(...) and paste them back together.
surround <- function(x) ifelse(grepl("\\W", x), sprintf("I(%s)", x), x)
s |>
strsplit(":") |>
sapply(function(x) paste(surround(x), collapse = ":"))
## [1] "I(x^2)" "I(x^2):z" "z" "x:z" "z:x:b"
## [6] "z:I(x^2):b"
Note
The input used is the following:
s <- c("x^2", "x^2:z", "z", "x:z", "z:x:b", "z:x^2:b")

Related

Why does R appear to be a lazy match [duplicate]

This question already has answers here:
R regex to get partly match
(2 answers)
Closed 6 days ago.
I want to use stri_replace_all_regex to replace string see as follows:
It's known that R default to greedy matching, but why it appears lazy matching here?
library(stringi)
a <- c('abc2','xycd2','mnb345')
b <- c('ab','abc','xyc','mnb','mn')
stri_replace_all_regex(a, "\\b" %s+% b %s+% "\\S+", b, vectorize_all=FALSE)
The result is [1] "ab" "xyc" "mn", which is not what I want. I
expected "abc" "xyc" "mnb".
You are calling stri_replace_all_regex with four arguments:
a is length 3. That's the str argument.
"\\b" %s+% b %s+% "\\S+" is length 5. (It would be a lot easier to read if you had used paste0("\\b", b, "\\S+"), but that's beside the point.) That's the pattern argument.
b is length 5. That's the replacement argument.
The last argument is vectorize_all=FALSE.
What it tries to do is documented as follows:
However, for stri_replace_all*, if vectorize_all is FALSE, then each
substring matching any of the supplied patterns is replaced by a
corresponding replacement string. In such a case, the vectorization is
over str, and - independently - over pattern and replacement. In other
words, this is equivalent to something like for (i in 1:npatterns) str <- stri_replace_all(str, pattern[i], replacement[i]). Note that you
must set length(pattern) >= length(replacement).
That's pretty sloppy documentation (I want to know what it does, not "something like" what it does!), but I think the process is as follows:
Your first pattern is "\\bab\\S+". That says "word boundary followed by ab followed by one or more non-whitespace chars". That matches all of a[1], so a[1] is replaced by b[1], which is "ab". It then tries the four other patterns, but none of them match, so you get "ab" as output.
The handling of a[3] is more complicated. The first match replaces it with "mnb", based on pattern[4]. Then a second replacement happens, because "mnb" matches pattern[5], and it gets changed again to "mn".
When you say R defaults to greedy matching, that's when doing a single regular expression match. You're doing five separate greedy matches, not one big greedy match.
EDITED to add:
I don't know the stringi functions well, but in the base regex functions you can do this with just one regex:
a <- c('abc2','xycd2','mnb345')
b <- c('ab','abc','xyc','mnb','mn')
# Build a big pattern:
# "|" means "or", "(" ... ") capture the match
pattern <- paste0("\\b(", b, ")\\S+", collapse = "|")
pattern
#> [1] "\\b(ab)\\S+|\\b(abc)\\S+|\\b(xyc)\\S+|\\b(mnb)\\S+|\\b(mn)\\S+"
# \\1 etc contain whatever matched the parenthesized
# patterns. Only one will match, the rest will be empty
gsub(pattern, "\\1\\2\\3\\4\\5", a)
#> [1] "ab" "xyc" "mnb"
# I would have guessed the greedy rule would have found "abc"
# Try again:
pattern <- paste0("\\b(", b[c(2, 1, 3:5)], ")\\S+", collapse = "|")
pattern
#> [1] "\\b(abc)\\S+|\\b(ab)\\S+|\\b(xyc)\\S+|\\b(mnb)\\S+|\\b(mn)\\S+"
gsub(pattern, "\\1\\2\\3\\4\\5", a)
#> [1] "abc" "xyc" "mnb"
Created on 2023-02-13 with reprex v2.0.2
It appears the "|" takes the first match, not the greedy match. I don't think the R docs specify it one way or the other.

How can I paste a comma (,) in a string of numbers in R Statistics?

I'm quite newbie at R Statistics. I have a vector with multiple objects inside (numbers), and I want to put a comma between the first and second number for the whole objects.
x gives this result:
[8] -8196110 -7681989 -8042092 -8196660 -7606310 -7217828 -7634887
[15] -7401244 -7211947 -7636932 -7606444 -7598894 -7398965```
My question is how to automatically put a comma in all those objects between the first and the second numbers. The desired output would be:
```[1] -8,385772 -7,390682 -8,019960 -8,300000 -8,069984 -8,786782 -7,414995
[8] -8,196110 -7,681989 -8,042092 -8,196660 -7,606310 -7,217828 -7,634887
[15] -7,401244 -7,211947 -7,636932 -7,606444 -7,598894 -7,398965```
We can use sub to capture the first digit from the start (^) of the string and replace with the backreference (\\1) followed by the,
sub("^(-?\\d)", "\\1,", x)
-output
[1] "-8,196110" "-7,681989" "-8,042092" "-8,196660" "-7,606310" "-7,217828" "-7,634887" "-7,401244" "-7,211947" "-7,636932" "-7,606444" "-7,598894" "-7,398965"
data
x <- c(-8196110, -7681989, -8042092, -8196660, -7606310, -7217828,
-7634887, -7401244, -7211947, -7636932, -7606444, -7598894, -7398965
)
We can use strsplit to split our numeric vector into a list where each element has the first digit and then the rest of the number. Then pass that into an sapply call that inserts a comma in the right spot:
x_split = strsplit(as.character(x), split = '')
sapply(x_split, function(k){paste0(c(k[1], ',',k[2:length(k)]), collapse = '')})

extract substring in R

Suppose I have list of string "S[+229]EC[+57]VDSTDNSSK[+229]PSSEPTSHVAR" and need to get a vector of string that contains only numbers with bracket like eg. [+229][+57].
Is there a convenient way in R to do this?
Using base R, then try it with
> unlist(regmatches(s,gregexpr("\\[\\+\\d+\\]",s)))
[1] "[+229]" "[+57]" "[+229]"
Or you can use
> gsub(".*?(\\[.*\\]).*","\\1",gsub("\\].*?\\[","] | [",s))
[1] "[+229] | [+57] | [+229]"
We can use str_extract_all from stringr
stringr::str_extract_all(x, "\\[\\+\\d+\\]")[[1]]
#[1] "[+229]" "[+57]" "[+229]"
Wrap it in unique if you need only unique values.
Similarly, in base R using regmatches and gregexpr
regmatches(x, gregexpr("\\[\\+\\d+\\]", x))[[1]]
data
x <- "S[+229]EC[+57]VDSTDNSSK[+229]PSSEPTSHVAR"
Seems like you want to remove the alphabetical characters, so
gsub("[[:alpha:]]", "", x)
where [:alpha:] is the class of alphabetical (lower-case and upper-case) characters, [[:alpha:]] says 'match any single alphabetical character', and gsub() says substitute, globally, any alphabetical character with the empty string "". This seems better than trying to match bracketed numbers, which requires figuring out which characters need to be escaped with a (double!) \\.
If the intention is to return the unique bracketed numbers, then the approach is to extract the matches (rather than remove the unwanted characters). Instead of using gsub() to substitute matches to a regular expression with another value, I'll use gregexpr() to identify the matches, and regmatches() to extract the matches. Since numbers always occur in [], I'll simplify the regular expression to match one or more (+) characters from the collection +[:digit:].
> xx <- regmatches(x, gregexpr("[+[:digit:]]+", x))
> xx
[[1]]
[1] "+229" "+57" "+229"
xx is a list of length equal to the length of x. I'll write a function that, for any element of this list, makes the values unique, surrounds the values with [ and ], and concatenates them
fun <- function(x)
paste0("[", unique(x), "]", collapse = "")
This needs to be applied to each element of the list, and simplified to a vector, a task for sapply().
> sapply(xx, fun)
[1] "[+229][+57]"
A minor improvement is to use vapply(), so that the result is robust (always returning a character vector with length equal to x) to zero-length inputs
> x = character()
> xx <- regmatches(x, gregexpr("[+[:digit:]]+", x))
> sapply(xx, fun) # Hey, this returns a list :(
list()
> vapply(xx, fun, "character") # vapply() deals with 0-length inputs
character(0)

Using R to compare two words and find letters unique to second word (across c. 6000 cases)

I have a dataframe comprising two columns of words. For each row I'd like to identify any letters that occur in only the word in the second column e.g.
carpet carpelt #return 'l'
bag flag #return 'f' & 'l'
dog dig #return 'i'
I'd like to use R to do this automatically as I have 6126 rows.
As an R newbie, the best I've got so far is this, which gives me the unique letters across both words (and is obviously very clumsy):
x<-(strsplit("carpet", ""))
y<-(strsplit("carpelt", ""))
z<-list(l1=x, l2=y)
unique(unlist(z))
Any help would be much appreciated.
The function you’re searching for is setdiff:
chars_for = function (str)
strsplit(str, '')[[1]]
result = setdiff(chars_for(word2), chars_for(word1))
(Note the inverted order of the arguments in setdiff.)
To apply it to the whole data.frame, called x:
apply(x, 1, function (words) setdiff(chars_for(words[2]), chars_for(words[1])))
Use regex :) Paste your word with brackets [] and then use replace function for regex. This regex finds any letter from those in brackets and replaces it with empty string (you can say that it "removes" these letters).
require(stringi)
x <- c("carpet","bag","dog")
y <- c("carplet", "flag", "smog")
pattern <- stri_paste("[",x,"]")
pattern
## [1] "[carpet]" "[bag]" "[dog]"
stri_replace_all_regex(y, pattern, "")
## [1] "l" "fl" "sm"
x <- c("carpet","bag","dog")
y <- c("carpelt", "flag", "dig")
Following (somewhat) with what you were going for with strsplit, you could do
> sx <- strsplit(x, "")
> sy <- strsplit(y, "")
> lapply(seq_along(sx), function(i) sy[[i]][ !sy[[i]] %in% sx[[i]] ])
#[[1]]
#[1] "l"
#
#[[2]]
#[1] "f" "l"
#
#[[3]]
#[1] "i"
This uses %in% to logically match the characters in y with the characters in x. I negate the matching with ! to determine those those characters that are in y but not in x.

Extract first X Numbers from Text Field using Regex

I have strings that looks like this.
x <- c("P2134.asfsafasfs","P0983.safdasfhdskjaf","8723.safhakjlfds")
I need to end up with:
"2134", "0983", and "8723"
Essentially, I need to extract the first four characters that are numbers from each element. Some begin with a letter (disallowing me from using a simple substring() function).
I guess technically, I could do something like:
x <- gsub("^P","",x)
x <- substr(x,1,4)
But I want to know how I would do this with regex!
You could use str_match from the stringr package:
library(stringr)
print(c(str_match(x, "\\d\\d\\d\\d")))
# [1] "2134" "0983" "8723"
You can do this with gsub too.
> sub('.?([0-9]{4}).*', '\\1', x)
[1] "2134" "0983" "8723"
>
I used sub instead of gsub to assure I only got the first match. .? says any single character and its optional (similar to just . but then it wouldn't match the case without the leading P). The () signify a group that I reference in the replacement '\\1'. If there were multiple sets of () I could reference them too with '\\2'. Inside the group, and you had the syntax correct, I want only numbers and I want exactly 4 of them. The final piece says zero or more trailing characters of any type.
Your syntax was working, but you were replacing something with itself so you wind up with the same output.
This will get you the first four digits of a string, regardless of where in the string they appear.
mapply(function(x, m) paste0(x[m], collapse=""),
strsplit(x, ""),
lapply(gregexpr("\\d", x), "[", 1:4))
Breaking it down into pieces:
What's going on in the above line is as follows:
# this will get you a list of matches of digits, and their location in each x
matches <- gregexpr("\\d", x)
# this gets you each individual digit
matches <- lapply(matches, "[", 1:4)
# individual characters of x
splits <- strsplit(x, "")
# get the appropriate string
mapply(function(x, m) paste0(x[m], collapse=""), splits, matches)
Another group capturing approach that doesn't assume 4 numbers.
x <- c("P2134.asfsafasfs","P0983.safdasfhdskjaf","8723.safhakjlfds")
gsub("(^[^0-9]*)(\\d+)([^0-9].*)", "\\2", x)
## [1] "2134" "0983" "8723"

Resources