Detect a length&alphanumeric pattern in a vector of strings

Detect a length&alphanumeric pattern in a vector of strings - r

I have a character vector that stores different sentences. Some sentences include an alphanumeric string of a fixed length (15 characters), but some don't. Such string(s) of interest might vary in their combination of alphanumeric characters, but regardless will always be comprised of:
upper case letters
lower case letters
digits
no special characters
and will always be length of 15.
In some vector elements, there will be leading and trailing blank space before/after the string of interest. However, in other elements the string might show immediately at the beginning, or otherwise at the end. The most complex situation is when the string of interest shows without any spaces before/after, which means that it's embedded within another string.
I want to take such a vector and manipulate to return a new vector of the same length, but:
in elements that contained a string of interest, return only the string of interest.
in elements that did not contain a string of interest, return NA.
Example
set.seed(2020)
library(stringi)
library(stringr)
vector_of_strings <- stri_rand_strings(n = 100, length = 15, pattern = "[A-Za-z0-9]")
my_sentences <-
c(
str_interp("my sentece contains ${sample(vector_of_strings, size = 1)}"),
str_interp("${sample(vector_of_strings, size = 1)} is in my sentence"),
str_interp("sometimes - ${sample(vector_of_strings, size = 1)} - it shows like this"),
str_interp("other times it could be${sample(vector_of_strings, size = 1)}without any space before or after"),
"occasionally there's no string of interest, so such element should become 'NA'"
)
my_sentences
## [1] "my sentece contains 8OarR1YUGPBoRfi"
## [2] "WoV8ym3WB2zg2TD is in my sentence"
## [3] "sometimes - pmMk73q0L73qKUa - it shows like this"
## [4] "other times it could be1qvzWei5FxPtRGXwithout any space before or after"
## [5] "occasionally there's no string of interest, so such element should become 'NA'"
How can I take my_sentences and have it return the following?
[1] "8OarR1YUGPBoRfi"
[2] "WoV8ym3WB2zg2TD"
[3] "pmMk73q0L73qKUa"
[4] "1qvzWei5FxPtRGX"
[5] NA
EDIT
Based on ekoam's comment, I wonder whether the following mechanism could be utilized.
(1) First step, test whether any of the strings in vector_of_strings exists in my_sentences's elements. If yes, return the string of interest.
(2) Else, if no match, test whether any combination of alphanumeric and 15-character length exists. If there's a definitive single match, return the matched string.
(3).a. Else, if there's more than one possible match, return all possible strings.
(3).b. Else, if there's no match whatsoever, return NA.
For the sake of this example, let's assume that step (1) above is based on matching against sample(vector_of_strings, size = 50). This is for leaving some room for no match (to be able to move forward to step (2).
And just to make it clearer as of the desired output, I'm trying to get it all in a single vector that is of the same length as the original my_sentences, and the output(s) of the "mechanism" described above are at the respective element positions of the original vector.

This is not straightforwardly addressed. All the conditions you mention are easily operationalizable but one: that the string of interest can be embedded within a larger string. What you could do is extract all words that have the right combination of character types but allow for their length to go beyond 15:
library(stringr)
x <- str_extract(my_sentences, "\\b[A-Za-z0-9]{15,}\\b")
x
[1] "heLvIQabKdDTrBC" "KpxeqHQ0Z94X6vG" "UNMcDuUDzPsRU7s" "beZccQAS3rCFZ5UO7without"
[5] NA
This way you would at least be sure not to have overlooked the embedded target strings. If the number of such strings is not too large you could then in a second step isolate the larger-than-15-char strings and remove the unwanted bits:
x[which(nchar(x) > 15)]
[1] "beZccQAS3rCFZ5UO7without"

Related

Subsetting a string vector based on a partial match of unknown characters

I have a vector of 8-character file names of the format
"/relative/path/to/folder/a(bc|de|fg)...[xy]1.sav"
where the brackets hold one of two-three known characters, and the '...' are three unknown characters. I want to match all character vectors that has the same unknown sequence XXX and sort into a list of character vectors.
I am not sure how to proceed on this. I am thinking about a way to extract the letters in the fourth to sixth position (...), and put into a vector then use `grep to get all the files with the matching string.
E.g.
# Pseudo-code. Not functioning code, but sort of the thing I want to do
> char.extr <- str_extract(file.vector, !"a(bc|de|fg)...[xy]1.sav")
> char.extr
"JKL", "MNO" ,"PQR" ...
# Use grep and lapply to put matched strings into list
> path.list <- lapply(char.extr, grep, file.vector)
> path.list
1. "/relative/path/to/folder/abcJKLx1.sav"
"/relative/path/to/folder/adeJKLy1.sav"
2. "/relative/path/to/folder/afgMNOx1.sav"
"/relative/path/to/folder/abcMNOy1.sav"

Since we know the name structure, I'd imaging extracting the 3 letter substring and then using split to get individual lists is what you're looking for.
split(path.list, substr(basename(path.list), 4, 6))

Extracting word with co-occurring alphabets in R

I wanted to extract certain words from a bigger word-list. One example of a desired extracted word-list is: extract all the words that contain /s/ followed by /r/. So this should give me words such as sər'ka:rəh, e:k'sa:r, səmʋitərəɳ, and so:'ha:rd. from the bigger word-list.
Consider the data (IPA transcription) to be the one given below:
sər'ka:rəh
sə'lᴔ:nija:
hã:ki:
pu:'dʒa:ẽ:
e:k'sa:r
mritko:
dʒʱã:sa:
pə'hũtʃ'ne:'ʋa:le:
kərəpʈ
tʃinhirit
tʃʰəʈʈʰi:
dʱũdʱ'la:pən
səmʋitərəɳ
so:'ha:rd
məl'ʈi:spe:'ʃijliʈi:
la:'pər'ʋa:i:
upləbɡʱ
Thanks much!

Here's an answer to the issue described in the first paragraph of your post. (To my mind, the examples in the second paragraph are inconsistent with the issue described in the first para, so I'll take the liberty of ignoring them here).
You say you want to "extract all the words that contain p followed by t". The word 'extract' implies that there are other characters in the same string than those you want to match and extract. The verb 'contain' implies that the words you want to extract need not necessarily have p in word-initial position. Based on these premises, here's some mock data and a solution to the task:
Data:
x <- c("pastry is to the pastor's appetite what pot is to the pupil's")
Solution:
libary(stringr)
unlist(str_extract_all(x, "\\b\\w*(?<=p)\\w*t\\w*\\b"))
This uses word boundaries \\b to extract the target words from the surrounding context; it further uses positive lookbehind (?<=...) to assert the condition that for there to be a matching t there needs to be a p character occurring prior to the match.
The regex in more detail:
\\b: the opening word boundary
\\w*: zero or more alphanumeric chars (or an underscore)
(?<=p): positive lookbehind: "if and only if you see a p char on
the left..."
\\w*: zero or more alphanumeric chars (or an underscore)
t: the literal character t
\\w*: zero or more alphanumeric chars (or an underscore)
\\b: the closing word boundary
Result:
[1] "pastry" "pastor" "appetite" "pot"
EDIT 1:
Now that the question has been updated, a more definitive answer is possible.
Data:
x <- c("sər'ka:rəh","sə'lᴔ:nija:","hã:ki:","pu:'dʒa:ẽ:","e:k'sa:r",
"mritko:","dʒʱã:sa:","pə'hũtʃ'ne:'ʋa:le:","kərəpʈ","tʃinhirit",
"tʃʰəʈʈʰi:","dʱũdʱ'la:pən","səmʋitərəɳ","so:'ha:rd",
"məl'ʈi:spe:'ʃijliʈi:", "la:'pər'ʋa:i:","upləbɡʱ")
If you want to match (rather than extract) words that "contain /s/ followed by /r/", you can use grepin various ways. Here are two ways:
grep("s.*r", x, value = T)
or:
grep("(?<=s).*r", x, value = T, perl = T) # with lookbehind
The result is the same in either case:
[1] "sər'ka:rəh" "e:k'sa:r" "səmʋitərəɳ" "so:'ha:rd"
EDIT 2:
If the aim is to match words that "contain /s/ or /p/ followed by /r/ or /t/", you can use the metacharacter | to allow for alternatives:
grep("s.*r|s.*t|p.*r|p.*t", x, value = T)
# or, more succinctly:
grep("(s|p).*(r|t)", x, value = T)
[1] "sər'ka:rəh" "e:k'sa:r" "pə'hũtʃ'ne:'ʋa:le:" "səmʋitərəɳ" "so:'ha:rd"
[6] "la:'pər'ʋa:i:"

You can use grep function. Assuming your list is called list:
grep("p[a-z]+t", list, value=TRUE)

Regular Expression in R - Spaces before and after the text

I have a stats file that has lines that are like this:
"system.l2.compressor.compression_size::1 0 # Number of blocks that compressed to fit in 1 bits"
0 is the value that I care about in this case. The spaces between the actual statistic and whatever is before and after it are not the same each time.
My code is something like that to try and get the stats.
if (grepl("system.l2.compressor.compression_size::1", line))
{
matches <- regmatches(line, gregexpr("[[:digit:]]+\\.*[[:digit:]]", line))
compression_size_1 = as.numeric(unlist(matches))[1]
}
The reason I have this regular expression
[[:digit:]]+\\.*[[:digit:]]
is because in other cases the statistic is a decimal number. I don't anticipate in the cases that are like the example I posted for the numbers to be decimals, but it would be nice to have a "fail safe" regex that can capture even such a case.
In this case I get "2." "1" "0" "1" as answers. How can I restrict it so that I can get only the true stat as the answer?
I tried using something like this
"[:space:][[:digit:]]+\\.*[[:digit:]][:space:]"
or other variations, but either I get back NA, or the same numbers but with spaces surrounding them.

Here are a couple base R possibilities depending on how your data is set up. In the future, it is helpful to provide a reproducible example. Definitely provide one if these don't work. If the pattern works, it will probably be faster to adapt it to a stringr or stringi function. Good luck!!
# The digits after the space after the anything not a space following "::"
gsub(".*::\\S+\\s+(\\d+).*", "\\1", strings)
[1] "58740" "58731" "70576"
# Getting the digit(s) following a space and preceding a space and pound sign
gsub(".*\\s+(\\d+)\\s+#.*", "\\1", strings)
[1] "58740" "58731" "70576"
# Combining the two (this is the most restrictive)
gsub(".*::\\S+\\s+(\\d+)\\s+#.*", "\\1", strings)
[1] "58740" "58731" "70576"
# Extracting the first digits surounded by spaces (least restrictive)
gsub(".*?\\s+(\\d+)\\s+.*", "\\1", strings)
[1] "58740" "58731" "70576"
# Or, using stringr for the last pattern:
as.numeric(stringr::str_extract(strings, "\\s+\\d+\\s+"))
[1] 58740 58731 70576
EDIT: Explanation for the second one:
gsub(".*\\s+(\\d+)\\s+#.*", "\\1", strings)
.* - .=any character except \n; *=any number of times
\\s+ - \\s =whitespace; +=at least one instance (of the whitespace)
(\\d+) - ()=capture group, you can reference it later by the number of occurrences (i.e., the ”\\1” returns the first instance of this pattern); \\d=digit; +=at least one instance (of a digit)
\\s+# - \\s =whitespace; +=at least one instance (of the whitespace); # a literal pound sign
.* - .=any character except \n; *=any number of times
Data:
strings <- c("system.l2.compressor.compression_size::256 58740 # Number of blocks that compressed to fit in 256 bits",
"system.l2.compressor.encoding::Base*.8_1 58731 # Number of data entries that match encoding Base8_1",
"system.l2.overall_hits::.cpu.data 70576 # number of overall hits")

Regex returns digit string with leading "_"

Using R script in PowerBI Query Editor to find six digit numeric string in a description column and add this as a new column to the table. It works EXCEPT where the number string is preceded by a "_" (underscore character)
# 'dataset' holds the input data for this script ##
library(stringr)
# assign regex to variable #
pattern <- "(?:^|\\D)(\\d{6})(?!\\d)"
# define function to use pattern ##
isNewSiteNum = function(x) substr(str_extract(x,pattern),1,6)
# output statement - within adds new column to dataset ##
output <- within(dataset,{NewSiteNum=isNewSiteNum(dataset$LineItemComment)})
number string can be at start, end or in the middle of the description text. When the number string is preceded by underscore (_123456 for example) the regex returns the _12345 instead of 123456. Not sure how to tell this to skip the underscore but still grab the six digits (and not break the cases where there is no leading underscore that currently work.)
regex101.com shows the full match as '_123456' and group.1 as '123456' but my result column has '_12345' For the case with a leading space the full match is ' 123456' yet my result column is correct. I seem to be missing something since the full match gets 7 char and the desirec group 1 has 6.

The problem was with the str_extract which I could not get to work. However, by using the str_match and selecting the group I get what I am looking for.
# 'dataset' holds input data
library(stringr)
pattern<-"(?:^|\\D)(\\d{6})(?!\\d)"
SiteNum = function(x) str_match(x, pattern)[,2]
output<-within(dataset,{R_SiteNum2=SiteNum(dataset$ReqComments)})
this does not pick up non-numeric initial characters.

adding or retaining leading zeros without converting to character format

Is it possible to add or retain one or more leading zeros to a number without the result being converted to character? Every solution I have found for adding leading zeros returns a character string, including: paste, formatC, format, and sprintf.
For example, can x be 0123 or 00123, etc., instead of 123 and still be numeric?
x <- 0123
EDIT
It is not essential. I was just playing around with the following code and the last two lines gave the wrong answer. I just thought maybe if I could have leading zeros with numeric format obtaining the correct answer would be easier.
a7 = c(1,1,1,0); b7=c(0,1,1,1); # 4
a77 = '1110' ; b77='0111' ; # 4
a777 = 1110 ; b777=0111 ; # 4
length(b7[(b7 %in% intersect(a7,b7))])
R - count matches between characters of one string and another, no replacement
keyword <- unlist(strsplit(a77, ''))
text <- unlist(strsplit(b77, ''))
sum(!is.na(pmatch(keyword, text)))
ab7 <- read.fwf(file = textConnection(as.character(rbind(a777, b777))), widths = c(1,1,1,1), colClasses = rep("character", 2))
length(ab7[2,][(ab7[2,] %in% intersect(ab7[1,],ab7[2,]))])

You are not thinking correctly about what a "number" is. Programming languages store an internal representation which retains full precision to the machine limit. You are apparently concerned with what gets printed to your screen or console. By definition, those number characters are string elements, which is to say, a couple bytes are processed by the ASCII decoder (or equivalent) to determine what to draw on the screen. What x "is," to draw happily on Presidential Testimony, depends on your definition of what "is" is.

You could always create your own class of objects that has one slot for the value of the number (but if it is stored as numeric then what we see as 123 will actually be stored as as a binary value, something like 01111011 (though probably with more leading 0's)) and another slot or attribute for either the number of leading 0's or the number of significant digits. Then you can write methods for what to do with the number (and what effect that will have on the leading 0's, sig digits, etc.).
The print method could then make sure to print it with the leading zeros while keeping the internal value as a number.
But this seems a bit overkill in most cases (though I know that some fields make a big deal about indicating number of significant digits so that leading 0's could be important). It may be simpler to use the conversion to character methods that you already know about, but just do the printing in a way that does not look obviously like a number, see the cat and print functions for the options.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Detect a length&alphanumeric pattern in a vector of strings - r

Related

Subsetting a string vector based on a partial match of unknown characters

Extracting word with co-occurring alphabets in R

Regular Expression in R - Spaces before and after the text

Regex returns digit string with leading "_"

adding or retaining leading zeros without converting to character format

Categories

Resources