how to use grep in R to get the specified character? - r

I have
str=c("00005.profit", "00005.profit-in","00006.profit","00006.profit-in")
and I want to get
"00005.profit" "00006.profit"
How can I achieve this using grep in R?

Here is one way:
R> s <- c("00005.profit", "00005.profit-in","00006.profit","00006.profit-in")
> unique(gsub("([0-9]+.profit).*", "\\1", s))
[1] "00005.profit" "00006.profit"
R>
We define a regular expression as digits followed by .profit, which we assign by keeping the expression in parantheses. The \\1 then recalls the first such assignment -- and as we recall nothing else that is what we get. The unique() then reduces the four items to two unique ones.

Dirk's answer is pretty much the ideal generalisable answer, but here are a couple of other options based on the fact that your example always has a - character starting the part you wish to chop off:
1: gsub to return everything prior to the -
gsub("(.+)-.+","\\1",str)
2: strsplit on - and keep only the first part.
sapply(strsplit(str,"-"),head,1)
Both return:
[1] "00005.profit" "00005.profit" "00006.profit" "00006.profit"
which you can then wrap in unique to not return duplicates like:
unique(gsub("(.+)-.+","\\1",str))
unique(sapply(strsplit(str,"-"),head,1))
These will then return:
[1] "00005.profit" "00006.profit"
Another non-generalisable solution would be to just take the first 12 characters (assuming string length for the part you want to keep doesn't change):
unique(substr(str,1,12))
[1] "00005.profit" "00006.profit"

I'm actually interpreting your question differently. I think you might want
grep("[0-9]+\\.profit$",str,value=TRUE)
That is, if you only want the strings that end with profit. The $ special character stands for "end of string", so it excludes cases that have additional characters at the end ... The \\. means "I really want to match a dot, not any character at all" (a . by itself would match any character). You weren't entirely clear about your target pattern -- you might prefer "0+[1-9]\\.profit$" (any number of zeros followed by a single non-zero digit), or even "0{4}[1-9]\\.profit$" (4 zeros followed by a single non-zero digit).

Related

Find occurrences with regex and then only remove first character in matched expression

Surprisingly I haven't found a satisfactory answer to this regex problem. I have the following vector:
row1
[1] "AA.8.BB.CCCC" "2017" "3.166.5" "3.080.2" "68" "162.6"
[7] "185.223.632.4" "500.332.1"
My end result should look like this:
row1
[1] "AA.8.BB.CCCC" "2017" "3,166.5" "3,080.2" "68" "162.6"
[7] "185,223,632.4" "500,332.1"
The last period in each of the numeric values is the decimal point and the other periods should be converted to commas. I want this done without affecting the value with letters ([1]). I tried the following:
gsub("[.]\\d{3}[.]", ",", row1)
This regex sort of works but doesn't quite do what I want. Additionally it removes the numbers, which is problematic. Is there a way to find the regex and then only remove the first character and not the entire matched values? If there is a better way of approaching this I welcome those responses as well.
You can use the following:
See code in use here
gsub("\\G\\d+\\K\\.(?=\\d+(?!$))",",",x,perl=T)
See regex in use here
Note: The regex at the URL above is changed to (?:\G|^) for display purposes (\G matches the start of the string \A, but not the start of the line).
\G\d+\K\.(?=\d+(?!$))
How it works:
\G asserts position either at the end of the previous match or at the start of the string
\d+\K\. matches a digit one or more times, then resets the match (previously consumed characters are no longer included in the final match), then match a dot . literally
(?=\d+(?!$)) positive lookahead ensuring what follows is one or more digits, but not followed by the end of the line
One option is to use a combination of a lookbehind and a lookahead to match only a dot when what is on the left is a digit and on the right are 3 digits followed by a dot.
You could add perl = TRUE using gsub.
In the replacement use a comma.
(?<=\d)[.](?=\d{3}[.])
Regex demo | R demo
Double escaped as noted by #r2evans
(?<=\\d)[.](?=\\d{3}[.])

Extracting numeric character of length (1|2) from character list

I am scraping PDFs for data and am trying to search for a numeric character (1:9) that is either of length 1 or 2. Unfortunately the value I am after changes position across the PDFs so I cannot simply call the index of the value and assign it to a variable.
I have tried many regex functions and can get numbers out of the list, but cannot seem to implement the argument to only pull numbers of the specific length.
# Data comes in as a long string
Test<-("82026-424 82026-424 1 CSX10 Store Room 75.74 75.74")
# Seperate data into individual pieces with str_split
Split_Test<-str_split(Test[1],"\\s+")
# We can easily unlist it with the following code (Not sure if needed)
Test_Unlisted<-unlist(Split_Test)
> Test_Unlisted
[1] "82026-424" "82026-424" "1" "CSX10" "Store" "Room"
[8] "75.74" "75.74"
My desired outcome would be to get the "1" out of the character list, and then if the value was "20" also be able to recognize that.
The best logic I can think of in code exists below, but this does not work.:
Test_Final<-str_match(Test_Unlisted, "\\d|\\d\\d")
Using this code I can grab anything of length=1, but it is not guaranteed to be a character:
Test_Final<-which(sapply(Test_Unlisted, nchar)==1)
Thanks for all the help!
You need to use
Test<-("82026-424 82026-424 1 CSX10 Store Room 75.74 75.74, 20")
regmatches(Test, gregexpr("\\b(?<!\\d\\.)\\d{1,2}\\b(?!\\.\\d)", Test, perl=TRUE))
See the regex demo and the regex demo.
Details
\b - a word boundary
(?<!\d\.) - a negative lookbehind that fails the match if, immediately to the left of the current location, there is a digit and a dot
\d{1,2} - 1 or 2 digits
\b - a word boundary
(?!\.\d) - a negative lookahead that fails the match if, immediately to the right of the current location, there is a dot and a digit.
Note that due to the lookarounds used in the pattern, the regex should be passed to the PCRE regex engine, hence the perl=TRUE argument is required.
With stringr that is ICU regex engine powered, you may use
library(stringr)
str_extract_all(Test, "\\b(?<!\\d\\.)\\d{1,2}\\b(?!\\.\\d)")

How to remove characters before matching pattern and after matching pattern in R in one line?

I have this vector Target <- c( "tes_1123_SS1G_340T01", "tes_23_SS2G_340T021". I want to remove anything before SS and anything after T0 (including T0).
Result I want in one line of code:
SS1G_340 SS2G_340
Code I have tried:
gsub("^.*?SS|\\T0", "", Target)
We can use str_extract
library(stringr)
str_extract(Target, "SS[^T]*")
#[1] "SS1G_340" "SS2G_340"
Try this:
gsub(".*(SS.*)T0.*","\\1",Target)
[1] "SS1G_340" "SS2G_340"
Why it works:
With regex, we can choose to keep a pattern and remove everything outside of that pattern with a two-step process. Step 1 is to put the pattern we'd like to keep in parentheses. Step 2 is to reference the number of the parentheses-bound pattern we'd like to keep, as sometimes we might have multiple parentheses-bound elements. See the example below for example:
gsub(".*(SS.*)+(T0.*)","\\1",Target)
[1] "SS1G_340" "SS2G_340"
Note that I've put the T0.* in parentheses this time, but we still get the correct answer because I've told gsub to return the first of the two parentheses-bound patterns. But now see what happens if I use \\2 instead:
gsub(".*(SS.*)+(T0.*)","\\2",Target)
[1] "T01" "T021"
The .* are wild cards by the way. If you'd like to learn more about using regex in R, here's a reference that can get you started.

remove/replace specific words or phrases from character strings - R

I looked around both here and elsewhere, I found many similar questions but none which exactly answer mine. I need to clean up naming conventions, specifically replace/remove certain words and phrases from a specific column/variable, not the entire dataset. I am migrating from SPSS to R, I have an example of the code to do this in SPSS below, but I am not sure how to do it in R.
EG:
"Acadia Parish" --> "Acadia" (removes Parish and space before Parish)
"Fifth District" --> "Fifth" (removes District and space before District)
SPSS syntax:
COMPUTE county=REPLACE(county,' Parish','').
There are only a few instances of this issue in the column with 32,000 cases, and what needs replacing/removing varies and the cases can repeat (there are dozens of instances of a phrase containing 'Parish'), meaning it's much faster to code what needs to be removed/replaced, it's not as simple or clean as a regular expression to remove all spaces, all characters after a specific word or character, all special characters, etc. And it must include leading spaces.
I have looked at the replace() gsub() and other similar commands in R, but they all involve creating vectors, or it seems like they do. What I'd like is syntax that looks for characters I specify, which can include leading or trailing spaces, and replaces them with something I specify, which can include nothing at all, and if it does not find the specific characters, the case is unchanged.
Yes, I will end up repeating the same syntax many times, it's probably easier to create a vector but if possible I'd like to get the syntax I described, as there are other similar operations I need to do as well.
Thank you for looking.
> x <- c("Acadia Parish", "Fifth District")
> x2 <- gsub("^(\\w*).*$", "\\1", x)
> x2
[1] "Acadia" "Fifth"
Legend:
^ Start of pattern.
() Group (or token).
\w* One or more occurrences of word character more than 1 times.
.* one or more occurrences of any character except new line \n.
$ end of pattern.
\1 Returns group from regexp
Maybe I'm missing something but I don't see why you can't simply use conditionals in your regex expression, then trim out the annoying white space.
string <- c("Arcadia Parish", "Fifth District")
bad_words <- c("Parish", "District") # Write all the words you want removed here!
bad_regex <- paste(bad_words, collapse = "|")
trimws( sub(bad_regex, "", string) )
# [1] "Arcadia" "Fifth"
dataframename$varname <- gsub(" Parish","", dataframename$varname)

Replacing all occurrences of a pattern in a string

Used to run R with numbers and matrix, when it comes to play with strings and characters I am lost. I want to analyze some data where the time is read into R as follow:
>my.time.char[1]
[1] "\"2011-10-05 15:55:00\""
I want to end up with a string containing only:
"2011-10-05 15:55:00"
Using the function sub() (that i barely understand...), I got the following result:
> sub("(\")","",my.time.char[1])
[1] "2011-10-05 15:55:00\""
This is closer to the format i am looking for, but I still need to get rid of the two last characters (\").
The second line from ?sub explains:
sub and gsub perform replacement of the first and all matches respectively.
which should tell you to use gsub instead.

Resources