extract substring in R

extract substring in R - r

Suppose I have list of string "S[+229]EC[+57]VDSTDNSSK[+229]PSSEPTSHVAR" and need to get a vector of string that contains only numbers with bracket like eg. [+229][+57].
Is there a convenient way in R to do this?

Using base R, then try it with
> unlist(regmatches(s,gregexpr("\\[\\+\\d+\\]",s)))
[1] "[+229]" "[+57]" "[+229]"
Or you can use
> gsub(".*?(\\[.*\\]).*","\\1",gsub("\\].*?\\[","] | [",s))
[1] "[+229] | [+57] | [+229]"

We can use str_extract_all from stringr
stringr::str_extract_all(x, "\\[\\+\\d+\\]")[[1]]
#[1] "[+229]" "[+57]" "[+229]"
Wrap it in unique if you need only unique values.
Similarly, in base R using regmatches and gregexpr
regmatches(x, gregexpr("\\[\\+\\d+\\]", x))[[1]]
data
x <- "S[+229]EC[+57]VDSTDNSSK[+229]PSSEPTSHVAR"

Seems like you want to remove the alphabetical characters, so
gsub("[[:alpha:]]", "", x)
where [:alpha:] is the class of alphabetical (lower-case and upper-case) characters, [[:alpha:]] says 'match any single alphabetical character', and gsub() says substitute, globally, any alphabetical character with the empty string "". This seems better than trying to match bracketed numbers, which requires figuring out which characters need to be escaped with a (double!) \\.
If the intention is to return the unique bracketed numbers, then the approach is to extract the matches (rather than remove the unwanted characters). Instead of using gsub() to substitute matches to a regular expression with another value, I'll use gregexpr() to identify the matches, and regmatches() to extract the matches. Since numbers always occur in [], I'll simplify the regular expression to match one or more (+) characters from the collection +[:digit:].
> xx <- regmatches(x, gregexpr("[+[:digit:]]+", x))
> xx
[[1]]
[1] "+229" "+57" "+229"
xx is a list of length equal to the length of x. I'll write a function that, for any element of this list, makes the values unique, surrounds the values with [ and ], and concatenates them
fun <- function(x)
paste0("[", unique(x), "]", collapse = "")
This needs to be applied to each element of the list, and simplified to a vector, a task for sapply().
> sapply(xx, fun)
[1] "[+229][+57]"
A minor improvement is to use vapply(), so that the result is robust (always returning a character vector with length equal to x) to zero-length inputs
> x = character()
> xx <- regmatches(x, gregexpr("[+[:digit:]]+", x))
> sapply(xx, fun) # Hey, this returns a list :(
list()
> vapply(xx, fun, "character") # vapply() deals with 0-length inputs
character(0)

Related

Inserting characters around a part of a string?

I'm looking to wrap parts of a string in R, following certain rules, in a vectorised way.
Put simply, if I had a vector:
c("x^2", "x^2:z", "z", "x:z", "z:x:b", "z:x^2:b")
the function would sweep through each element and wrap I() around those parts where there is an exponent, resulting in the following output:
c("I(x^2)", "I(x^2):z", "z", "x:z", "z:x:b", "z:I(x^2):b")
I've tried various approaches where I first split by : and then gsub, but this isn't particularly scalable.

Something like below?
> gsub("(x(\\^\\d+))", "I(\\1)", c("x^2", "x^2:z", "z", "x:z", "z:x:b", "z:x^2:b"))
[1] "I(x^2)" "I(x^2):z" "z" "x:z" "z:x:b"
[6] "z:I(x^2):b"

These seem reasonably general. They don't assume that the variable in the complex fields must be named x but handle any names made of word characters and also don't assume that the arithmetic expression must be an exponential but handle any arithmetic expression that includes non-word characters. for example, they would surround y+pi with I(...).
1) This one liner captures each field and processes it using the indicated function, expressed in formula notation. It surrounds each field that contains a non-word character with I(...) . It works with any variables whose names are made from word character.
library(gsubfn)
gsubfn("[^:]+", ~ if (grepl("\\W", x)) sprintf("I(%s)", x) else x, s)
## [1] "I(x^2)" "I(x^2):z" "z" "x:z" "z:x:b"
## [6] "z:I(x^2):b"
2) This surrounds any field containing a character that is not :, letter or number with I(...)
gsub("([^:]*[^:[:alnum:]][^:]*)", "I(\\1)", s)
## [1] "I(x^2)" "I(x^2):z" "z" "x:z" "z:x:b"
## [6] "z:I(x^2):b"
3) In this alternative we split the strings at colon, then surround fields containing a non-word character with I(...) and paste them back together.
surround <- function(x) ifelse(grepl("\\W", x), sprintf("I(%s)", x), x)
s |>
strsplit(":") |>
sapply(function(x) paste(surround(x), collapse = ":"))
## [1] "I(x^2)" "I(x^2):z" "z" "x:z" "z:x:b"
## [6] "z:I(x^2):b"
Note
The input used is the following:
s <- c("x^2", "x^2:z", "z", "x:z", "z:x:b", "z:x^2:b")

How to count " in the string? [duplicate]

I am trying to get the number of open brackets in a character string in R. I am using the str_count function from the stringr package
s<- "(hi),(bye),(hi)"
str_count(s,"(")
Error in stri_count_regex(string, pattern, opts_regex = attr(pattern,
: ` Incorrectly nested parentheses in regexp pattern.
(U_REGEX_MISMATCHED_PAREN)
I am hoping to get 3 for this example

( is a special character. You need to escape it:
str_count(s,"\\(")
# [1] 3
Alternatively, given that you're using stringr, you can use the coll function:
str_count(s,coll("("))
# [1] 3

You could also use gregexpr along with length in base R:
sum(gregexpr("(", s, fixed=TRUE)[[1]] > 0)
[1] 3
gregexpr takes in a character vector and returns a list with the starting positions of each match. I added fixed=TRUE in order to match literals.length will not work because gregexpr returns -1 when a subexpression is not found.
If you have a character vector of length greater than one, you would need to feed the result to sapply:
# new example
s<- c("(hi),(bye),(hi)", "this (that) other", "what")
sapply((gregexpr("(", s, fixed=TRUE)), function(i) sum(i > 0))
[1] 3 1 0

If you want to do it in base R you can split into a vector of individual characters and count the "(" directly (without representing it as a regular expression):
> s<- "(hi),(bye),(hi)"
> chars <- unlist(strsplit(s,""))
> length(chars[chars == "("])
[1] 3

Split and re-concatenate a string

I am trying to get the host of an IP address from a list of strings.
ips <- c('140.112.204.42', '132.212.14.139', '31.2.47.93', '7.112.221.238')
I want to get the first 2 digits from the ips. output:
ips <- c('140.112', '132.212', '31.2', '7.112')
This is the code that I wrote to convert them:
cat(unlist(strsplit(ips, "\\.", fixed = FALSE))[1:2], sep = ".")
When I check the type of individual ips in the end I get something like this:
140.112 NULL
Not sure what I am doing wrong. If you have some other ideas completely different from this that is completely fine too.

With sub:
ips <- c('140.112.204.42', '132.212.14.139', '31.2.47.93', '7.112.221.238')
sub('\\.\\d+\\.\\d+$', '', ips)
# [1] "140.112" "132.212" "31.2" "7.112"
With str_extract from stringr:
library(stringr)
str_extract(ips, '^\\d+\\.\\d+')
# [1] "140.112" "132.212" "31.2" "7.112"
With strsplit + sapply:
sapply(strsplit(ips, '\\.'), function(x) paste(x[1:2], collapse = '.'))
# [1] "140.112" "132.212" "31.2" "7.112"
With read.table + apply:
apply(read.table(textConnection(ips), sep='.')[1:2], 1, paste, collapse = '.')
#[1] "140.112" "132.212" "31.2" "7.112"
Notes:
sub('\\.\\d+\\.\\d+$', '', ips):
i. \\.\\d+\\.\\d+$ matches a literal dot, a digit one or more times, a literal dot again, and a digit one or more times at the end of the string
ii. sub removes the above match from the string
str_extract(ips, '^\\d+\\.\\d+'):
i. ^\\d+\\.\\d+ matches a digit one or more times, a literal dot and a digit one or more times in the beginning of the string
ii. str_extract extracts the above match from the string
sapply(strsplit(ips, '\\.'), function(x) paste(x[1:2], collapse = '.')):
i. strsplit(ips, '\\.') splits each ip using a literal dot as the delimiter. This returns a list of vectors after the split
ii. With sapply, paste(x[1:2], collapse = '.') is applied to every element of the list, thus taking only the first two numbers from each vector, and collapsing them with a dot as the separator. sapply then coerces the list to a vector, thus returning a vector of the desired ips.
apply(read.table(textConnection(ips), sep='.')[1:2], 1, paste, collapse = '.'):
i. read.table(textConnection(ips), sep='.')[1:2] treats ips as text input and reads it in with dot as a delimiter. Only taking the first two columns.
ii. apply enables paste to be operated on each row, and collapses with a dot.

Could you please try following.
gsub("([0-9]+.[0-9]+)(.*)","\\1",ips)
Explanation: Using gsub function and putting regex there to match digits then DOT then digits in memory's 1st place holder and keeping .* everything after it in 2nd place holder of memory. Then substituting these with \\1 with first regex's value which will be first 2 fields.

One solution is the following:
vapply(strsplit(ips, ".", fixed = TRUE),
function(x) paste(x[1:2], collapse = "."),
character(1L))
vapply applies function(x) to each element of the output of strsplit
strsplit produces a list where each element of the list is the components of the IP addresses separated by "."; setting fixed = TRUE requests to split using the exact value of the splitting string (i.e., "."), not using regex
function(x) takes the first two elements (x[1:2]) of each item coming out of strsplit and pastes them together, seperated by "."
character(1L) tells vapply that each element of the output (i.e., returned from function(x) should be a string of length 1.
Edit: #useR posted this solution right before me (using sapply).

substr is vectorised on the stop argument, so you can use this with a vector of positions before the second dot. regexpr gives the positions of the first match, so if you sub out the first one you can match on the second - which will be conveniently one before it's true position as needed (since you removed the first one).
substr(ips,1,regexpr("\\.",sub("\\.","",ips)))
[1] "140.112" "132.212" "31.2" "7.112"

We can convert the ip addresses to numeric_version class and then format using this base R one-liner that employs no regular expressions:
format(numeric_version(ips)[, 1:2])
[1] "140.112" "132.212" "31.2" "7.112"

check if a character within a string is uppercase

I have a string
x <- "lowerUpper"
and want do determine if and which character within this string is an uppercase letter.
I can use toupper(x) == x, which tells me if all characters are uppercase, but how do I check if only some (and which) are?

One option is gregexpr to find the position where the character is uppercase
unlist(gregexpr("[A-Z]", x))
#[1] 6

You can also use the symbol \U to check for uppercase:
unlist(gregexpr("\\U", "lowerUpper"))
#[1] 6

> x <- "lowerUpper"
> sapply(strsplit(x, ''), function(a) which(a %in% LETTERS)[1])
[1] 6
or
> library(stringi)
> stri_locate_first_regex(x, "[A-Z]")

Another option is to check each letter:
which(toupper(strsplit(x,split = "")[[1]])==strsplit(x,split = "")[[1]])
#[1] 6

Perhaps a cleaner code version using %in%
unlist(strsplit("lowerUpper",'')) %in% LETTERS
An advantage here is the return of the logical vector indicating each letter position in the string. This solution works for multiple uppercase letters too, whereas the grep options return only the first match. Lastly, using LETTERS makes for more readable code to my mind.

Extract first X Numbers from Text Field using Regex

I have strings that looks like this.
x <- c("P2134.asfsafasfs","P0983.safdasfhdskjaf","8723.safhakjlfds")
I need to end up with:
"2134", "0983", and "8723"
Essentially, I need to extract the first four characters that are numbers from each element. Some begin with a letter (disallowing me from using a simple substring() function).
I guess technically, I could do something like:
x <- gsub("^P","",x)
x <- substr(x,1,4)
But I want to know how I would do this with regex!

You could use str_match from the stringr package:
library(stringr)
print(c(str_match(x, "\\d\\d\\d\\d")))
# [1] "2134" "0983" "8723"

You can do this with gsub too.
> sub('.?([0-9]{4}).*', '\\1', x)
[1] "2134" "0983" "8723"
>
I used sub instead of gsub to assure I only got the first match. .? says any single character and its optional (similar to just . but then it wouldn't match the case without the leading P). The () signify a group that I reference in the replacement '\\1'. If there were multiple sets of () I could reference them too with '\\2'. Inside the group, and you had the syntax correct, I want only numbers and I want exactly 4 of them. The final piece says zero or more trailing characters of any type.
Your syntax was working, but you were replacing something with itself so you wind up with the same output.

This will get you the first four digits of a string, regardless of where in the string they appear.
mapply(function(x, m) paste0(x[m], collapse=""),
strsplit(x, ""),
lapply(gregexpr("\\d", x), "[", 1:4))
Breaking it down into pieces:
What's going on in the above line is as follows:
# this will get you a list of matches of digits, and their location in each x
matches <- gregexpr("\\d", x)
# this gets you each individual digit
matches <- lapply(matches, "[", 1:4)
# individual characters of x
splits <- strsplit(x, "")
# get the appropriate string
mapply(function(x, m) paste0(x[m], collapse=""), splits, matches)

Another group capturing approach that doesn't assume 4 numbers.
x <- c("P2134.asfsafasfs","P0983.safdasfhdskjaf","8723.safhakjlfds")
gsub("(^[^0-9]*)(\\d+)([^0-9].*)", "\\2", x)
## [1] "2134" "0983" "8723"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

extract substring in R - r

Suppose I have list of string "S[+229]EC[+57]VDSTDNSSK[+229]PSSEPTSHVAR" and need to get a vector of string that contains only numbers with bracket like eg. [+229][+57]. Is there a convenient way in R to do this?

Using base R, then try it with > unlist(regmatches(s,gregexpr("\\[\\+\\d+\\]",s))) [1] "[+229]" "[+57]" "[+229]" Or you can use > gsub(".?(\\[.\\]).","\\1",gsub("\\].?\\[","] | [",s)) [1] "[+229] | [+57] | [+229]"

Related

Inserting characters around a part of a string?

How to count " in the string? [duplicate]

Split and re-concatenate a string

check if a character within a string is uppercase

Extract first X Numbers from Text Field using Regex

Categories

Resources

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

extract substring in R - r

Suppose I have list of string "S[+229]EC[+57]VDSTDNSSK[+229]PSSEPTSHVAR" and need to get a vector of string that contains only numbers with bracket like eg. [+229][+57]. Is there a convenient way in R to do this?

Using base R, then try it with > unlist(regmatches(s,gregexpr("\\[\\+\\d+\\]",s))) [1] "[+229]" "[+57]" "[+229]" Or you can use > gsub(".*?(\\[.*\\]).*","\\1",gsub("\\].*?\\[","] | [",s)) [1] "[+229] | [+57] | [+229]"

Related

Inserting characters around a part of a string?

How to count " in the string? [duplicate]

Split and re-concatenate a string

check if a character within a string is uppercase

Extract first X Numbers from Text Field using Regex

Categories

Resources

Using base R, then try it with > unlist(regmatches(s,gregexpr("\\[\\+\\d+\\]",s))) [1] "[+229]" "[+57]" "[+229]" Or you can use > gsub(".?(\\[.\\]).","\\1",gsub("\\].?\\[","] | [",s)) [1] "[+229] | [+57] | [+229]"