Extracting numeric character of length (1|2) from character list - r

I am scraping PDFs for data and am trying to search for a numeric character (1:9) that is either of length 1 or 2. Unfortunately the value I am after changes position across the PDFs so I cannot simply call the index of the value and assign it to a variable.
I have tried many regex functions and can get numbers out of the list, but cannot seem to implement the argument to only pull numbers of the specific length.
# Data comes in as a long string
Test<-("82026-424 82026-424 1 CSX10 Store Room 75.74 75.74")
# Seperate data into individual pieces with str_split
Split_Test<-str_split(Test[1],"\\s+")
# We can easily unlist it with the following code (Not sure if needed)
Test_Unlisted<-unlist(Split_Test)
> Test_Unlisted
[1] "82026-424" "82026-424" "1" "CSX10" "Store" "Room"
[8] "75.74" "75.74"
My desired outcome would be to get the "1" out of the character list, and then if the value was "20" also be able to recognize that.
The best logic I can think of in code exists below, but this does not work.:
Test_Final<-str_match(Test_Unlisted, "\\d|\\d\\d")
Using this code I can grab anything of length=1, but it is not guaranteed to be a character:
Test_Final<-which(sapply(Test_Unlisted, nchar)==1)
Thanks for all the help!

You need to use
Test<-("82026-424 82026-424 1 CSX10 Store Room 75.74 75.74, 20")
regmatches(Test, gregexpr("\\b(?<!\\d\\.)\\d{1,2}\\b(?!\\.\\d)", Test, perl=TRUE))
See the regex demo and the regex demo.
Details
\b - a word boundary
(?<!\d\.) - a negative lookbehind that fails the match if, immediately to the left of the current location, there is a digit and a dot
\d{1,2} - 1 or 2 digits
\b - a word boundary
(?!\.\d) - a negative lookahead that fails the match if, immediately to the right of the current location, there is a dot and a digit.
Note that due to the lookarounds used in the pattern, the regex should be passed to the PCRE regex engine, hence the perl=TRUE argument is required.
With stringr that is ICU regex engine powered, you may use
library(stringr)
str_extract_all(Test, "\\b(?<!\\d\\.)\\d{1,2}\\b(?!\\.\\d)")

Related

How to remove a certain portion of the column name in a dataframe?

I have column names in the following format:
col= c('UserLanguage','Q48','Q21...20','Q22...21',"Q22_4_TEXT...202")
I would like to get the column names without everything that is after ...
[1] "UserLanguage" "Q48" "Q21" "Q22" "Q22_4_TEXT"
I am not sure how to code it. I found this post here but I am not sure how to specify the pattern in my case.
You can use gsub.
gsub("\\...*","",col)
#[1] "UserLanguage" "Q48" "Q21" "Q22" "Q22_4_TEXT"
Or you can use stringr
library(stringr)
str_remove(col, "\\...*")
Since . matches any character, we need to "escape" (\) to specify exactly what we want to match in the regular expression (and not use the special behavior of the .). So, to match a period, we would need \.. However, the backslash (\) is used to escape special behavior (e.g., escape symbol in strings) in regexps. So, to create the regular expression, we need an additional backslash, \\. In this case, we want to match additional periods, so we can add those here, hence \\.... Then, * specifies that the previous expression (everything the three periods) may occur 0 or more times.
You could sub and capture the first word in each column:
col <- c("UserLanguage", "Q48", "Q21...20", "Q22...21", "Q22_4_TEXT...202")
sub("^(\\w+).*$", "\\1", col)
[1] "UserLanguage" "Q48" "Q21" "Q22" "Q22_4_TEXT"
The regex pattern used here says to match:
^ from the start of the input
(\w+) match AND capture the first word
.* then consume the rest
$ end of the input
Then, using sub we replace with \1 to retain just the first word.

how to remove decimal point between numbers in R

I am trying to remove the decimal points in decimal numbers in R. Please note I want to keep the full stop of strings.
Example:
data= c("It's 6.00pm, and is late.")
I know that I have to use regex for this, but I am struggling. My desired output is:
"It's 6 00pm, and is late."
Thank you in advance.
Try this:
sub("(?<=\\d)\\.(?=\\d)", " ", data, perl = TRUE)
This solution uses lookbehind (?<=...) and lookahead (?=...)to assert that the period you wish to remove be enclosed by digits (thus avoiding matching the period at the sentence end). If you have several such cases within strings, then use gsubinstead of sub.
I suggest using a simple pattern to find the target text, then adding parenthesis to identify the parts of the matching text that you want to retain.
# Test data
data <- c("It's 6.00pm, and is late.")
The target pattern is a literal dot with a string of digits before and after it. \\d+ matches one or more digits and \\. matches a literal dot. Testing the pattern to see if it works:
grepl("\\d+\\.\\d+", data)
Result
TRUE
If we wanted too eliminate the whole thing we could do a simple replacement with an empty string. Testing if this targets the correct text:
sub("\\d+\\.\\d+", "", data)
Result
"It's pm, and is late."
Instead, to discard only a section of matched text we can identify the parts we want to keep, which is done by surrounding them with parenthesis. Once done we can refer to the captured text in the replacement. \\1 refers to the first chunk of text captured and \\2 refers to the second chunk of text, corresponding to the first and second sets of parenthesis
# pattern replacement
sub("(\\d+)\\.(\\d+)", "\\1\\2", data)
Result
[1] "It's 600pm, and is late."
This effectively removes the dot by omitting it from the replacement text.

Pattern match with R

I am trying to match a pattern using rgep() function as below -
grep("XYZ31__Sheqwqet1__CSV.csv", "^(XYZ)+[0-9]{2}[a-zA-Z_]+(csv)+$")
However unfortunately above expression results in no match. Any pointer towards the right direction will be very helpful.
Thanks for your time
Before the csv there is also a . and some digits. In addition, the order of arguments is pattern, followed by the input x. (if we pass arguments via name, the order wouldn't matter though)
grep( "^(XYZ)+[0-9]{2}[[:alnum:]_.]+(csv)$", "XYZ31__Sheqwqet1__CSV.csv")
#[1] 1
Pattern match is
^- start of the string
(XYZ)+ - one or more occurence of those letters
[0-9]{2} - two digits
[[:alnum:]_.]+ - one or more alpha numeric characters including the additional two
(csv)$- csv at the end of the string

Find occurrences with regex and then only remove first character in matched expression

Surprisingly I haven't found a satisfactory answer to this regex problem. I have the following vector:
row1
[1] "AA.8.BB.CCCC" "2017" "3.166.5" "3.080.2" "68" "162.6"
[7] "185.223.632.4" "500.332.1"
My end result should look like this:
row1
[1] "AA.8.BB.CCCC" "2017" "3,166.5" "3,080.2" "68" "162.6"
[7] "185,223,632.4" "500,332.1"
The last period in each of the numeric values is the decimal point and the other periods should be converted to commas. I want this done without affecting the value with letters ([1]). I tried the following:
gsub("[.]\\d{3}[.]", ",", row1)
This regex sort of works but doesn't quite do what I want. Additionally it removes the numbers, which is problematic. Is there a way to find the regex and then only remove the first character and not the entire matched values? If there is a better way of approaching this I welcome those responses as well.
You can use the following:
See code in use here
gsub("\\G\\d+\\K\\.(?=\\d+(?!$))",",",x,perl=T)
See regex in use here
Note: The regex at the URL above is changed to (?:\G|^) for display purposes (\G matches the start of the string \A, but not the start of the line).
\G\d+\K\.(?=\d+(?!$))
How it works:
\G asserts position either at the end of the previous match or at the start of the string
\d+\K\. matches a digit one or more times, then resets the match (previously consumed characters are no longer included in the final match), then match a dot . literally
(?=\d+(?!$)) positive lookahead ensuring what follows is one or more digits, but not followed by the end of the line
One option is to use a combination of a lookbehind and a lookahead to match only a dot when what is on the left is a digit and on the right are 3 digits followed by a dot.
You could add perl = TRUE using gsub.
In the replacement use a comma.
(?<=\d)[.](?=\d{3}[.])
Regex demo | R demo
Double escaped as noted by #r2evans
(?<=\\d)[.](?=\\d{3}[.])

Extract up to two more digits

This may be a very simple question but I have not much experience with regex expressions. This page is a good source of regex expressions but could not figure out how to include them into my following code:
data %>% filter(grepl("^A01H1", icl))
Question
I would like to extract the values in one column of my data frame starting with this A01H1 up to 2 more digits, for example A01H100, A01H140, A01H110. I could not find a solution despite my few attempts:
Attempts
I looked at this question from which I used ^A01H1[0-9].{2} to select up tot two more digits.
I tried with adding any character ^A01H1[0-9][0-9][x-y] to stop after two digits.
Any help would be much appreciated :)
You can use "^A01H1\\d{1,2}$".
The first part ("^A01H1"), you figured out yourself, so what are we doing in the second part ("\\d{1,2}$")?
\d includes all digits and is equivalent to [0-9], since we are working in R you need to escape \ and thus we use \\d
{1,2} indicates we want to have 1 or 2 matches of \\d
$ specifies the end of the string, so nothing should come afterwards and this prevents to match more than 2 digits
It looks as if you want to match a part of a string that starts with A01H1, then contains 1 or 2 digits and then is not followed with any digit.
You may use
^A01H1\d{1,2}(?!\d)
See the regex demo. If there can be no text after two digits at all, replace (?!\d) with $.
Details
^ - start of strinmg
A01H1 - literal string
\d{1,2} - one to two digits
(?!\d) - no digit allowed immediately to the right
$ - end of string
In R, you could use it like
grepl("^A01H1\\d{1,2}(?!\\d)", icl, perl=TRUE)
Or, with the string end anchor,
grepl("^A01H1\\d{1,2}$", icl)
Note the perl=TRUE is only necessary when using PCRE specific syntax like (?!\d), a negative lookahead.

Resources