How to remove "p.485" inside the brackets with R? - r

I have:`
String="(anthony,2019, p.485)"
Desidered Output:
String="(anthony,2019)"
i want removed only p.485.
I use the regex: `
gsub("\\( \\,p\\.[0-9]\\)","",String)
but it does not work.
Thanks!

We can use sub to match , followed by a space, 'p' and one or more digits (\\d+), replace by blank ("") in the replacement
sub(", p\\.\\d+", "", String)
#[1] "(anthony,2019)"

We could also try this slightly inefficient regex:
String="(anthony,2019, p.485)"
gsub(",\\s\\w.\\d{1,}","",String,perl=TRUE)
#[1] "(anthony,2019)"

This seems like it might be a bit fragile, since it depends on there not being any spaces after commas except for the last one:
gsub("[,] [^,)]+","", String)
[1] "(anthony,2019)"
That fragility afflicts akrun's answer as well. A more robust solution would be to match on the last comma but leave in the closing paren. This solution will locate that last comma and excise everything up to but not included the closing paren:
gsub("(.+)([,] [^,]+)([)])","\\1\\3", String) # 3 capture classes
[1] "(anthony,2019)" # return only first and third

Related

How to remove a certain portion of the column name in a dataframe?

I have column names in the following format:
col= c('UserLanguage','Q48','Q21...20','Q22...21',"Q22_4_TEXT...202")
I would like to get the column names without everything that is after ...
[1] "UserLanguage" "Q48" "Q21" "Q22" "Q22_4_TEXT"
I am not sure how to code it. I found this post here but I am not sure how to specify the pattern in my case.
You can use gsub.
gsub("\\...*","",col)
#[1] "UserLanguage" "Q48" "Q21" "Q22" "Q22_4_TEXT"
Or you can use stringr
library(stringr)
str_remove(col, "\\...*")
Since . matches any character, we need to "escape" (\) to specify exactly what we want to match in the regular expression (and not use the special behavior of the .). So, to match a period, we would need \.. However, the backslash (\) is used to escape special behavior (e.g., escape symbol in strings) in regexps. So, to create the regular expression, we need an additional backslash, \\. In this case, we want to match additional periods, so we can add those here, hence \\.... Then, * specifies that the previous expression (everything the three periods) may occur 0 or more times.
You could sub and capture the first word in each column:
col <- c("UserLanguage", "Q48", "Q21...20", "Q22...21", "Q22_4_TEXT...202")
sub("^(\\w+).*$", "\\1", col)
[1] "UserLanguage" "Q48" "Q21" "Q22" "Q22_4_TEXT"
The regex pattern used here says to match:
^ from the start of the input
(\w+) match AND capture the first word
.* then consume the rest
$ end of the input
Then, using sub we replace with \1 to retain just the first word.

Find occurrences with regex and then only remove first character in matched expression

Surprisingly I haven't found a satisfactory answer to this regex problem. I have the following vector:
row1
[1] "AA.8.BB.CCCC" "2017" "3.166.5" "3.080.2" "68" "162.6"
[7] "185.223.632.4" "500.332.1"
My end result should look like this:
row1
[1] "AA.8.BB.CCCC" "2017" "3,166.5" "3,080.2" "68" "162.6"
[7] "185,223,632.4" "500,332.1"
The last period in each of the numeric values is the decimal point and the other periods should be converted to commas. I want this done without affecting the value with letters ([1]). I tried the following:
gsub("[.]\\d{3}[.]", ",", row1)
This regex sort of works but doesn't quite do what I want. Additionally it removes the numbers, which is problematic. Is there a way to find the regex and then only remove the first character and not the entire matched values? If there is a better way of approaching this I welcome those responses as well.
You can use the following:
See code in use here
gsub("\\G\\d+\\K\\.(?=\\d+(?!$))",",",x,perl=T)
See regex in use here
Note: The regex at the URL above is changed to (?:\G|^) for display purposes (\G matches the start of the string \A, but not the start of the line).
\G\d+\K\.(?=\d+(?!$))
How it works:
\G asserts position either at the end of the previous match or at the start of the string
\d+\K\. matches a digit one or more times, then resets the match (previously consumed characters are no longer included in the final match), then match a dot . literally
(?=\d+(?!$)) positive lookahead ensuring what follows is one or more digits, but not followed by the end of the line
One option is to use a combination of a lookbehind and a lookahead to match only a dot when what is on the left is a digit and on the right are 3 digits followed by a dot.
You could add perl = TRUE using gsub.
In the replacement use a comma.
(?<=\d)[.](?=\d{3}[.])
Regex demo | R demo
Double escaped as noted by #r2evans
(?<=\\d)[.](?=\\d{3}[.])

Add a white-space between number and special character condition R

I'm trying to use stringr or R base calls to conditionally add a white-space for instances in a large vector where there is a numeric value then a special character - in this case a $ sign without a space. str_pad doesn't appear to allow for a reference vectors.
For example, for:
$6.88$7.34
I'd like to add a whitespace after the last number and before the next dollar sign:
$6.88 $7.34
Thanks!
If there is only one instance, then use sub to capture digit and the $ separately and in the replacement add the space between the backreferences of the captured group
sub("([0-9])([$])", "\\1 \\2", v1)
#[1] "$6.88 $7.34"
Or with a regex lookaround
gsub("(?<=[0-9])(?=[$])", " ", v1, perl = TRUE)
data
v1 <- "$6.88$7.34"
This will work if you are working with a vectored string:
mystring<-as.vector('$6.88$7.34 $8.34$4.31')
gsub("(?<=\\d)\\$", " $", mystring, perl=T)
[1] "$6.88 $7.34 $8.34 $4.31"
This includes cases where there is already space as well.
Regarding the question asked in the comments:
mystring2<-as.vector('Regular_Distribution_Type† Income Only" "Distribution_Rate 5.34%" "Distribution_Amount $0.0295" "Distribution_Frequency Monthly')
gsub("(?<=[[:alpha:]])\\s(?=[[:alpha:]]+)", "_", mystring2, perl=T)
[1] "Regular_Distribution_Type<U+2020> Income_Only\" \"Distribution_Rate 5.34%\" \"Distribution_Amount $0.0295\" \"Distribution_Frequency_Monthly"
Note that the \ appears due to nested quotes in the vector, should not make a difference. Also <U+2020> appears due to encoding the special character.
Explanation of regex:
(?<=[[:alpha:]]) This first part is a positive look-behind created by ?<=, this basically looks behind anything we are trying to match to make sure what we define in the look behind is there. In this case we are looking for [[:alpha:]] which matches a alphabetic character.
We then check for a blank space with \s, in R we have to use a double escape so \\s, this is what we are trying to match.
Finally we use (?=[[:alpha:]]+), which is a positive look-ahead defined by ?= that checks to make sure our match is followed by another letter as explained above.
The logic is to find a blank space between letters, and match the space, which then is replaced by gsub, with a _
See all the regex here

How to remove characters before matching pattern and after matching pattern in R in one line?

I have this vector Target <- c( "tes_1123_SS1G_340T01", "tes_23_SS2G_340T021". I want to remove anything before SS and anything after T0 (including T0).
Result I want in one line of code:
SS1G_340 SS2G_340
Code I have tried:
gsub("^.*?SS|\\T0", "", Target)
We can use str_extract
library(stringr)
str_extract(Target, "SS[^T]*")
#[1] "SS1G_340" "SS2G_340"
Try this:
gsub(".*(SS.*)T0.*","\\1",Target)
[1] "SS1G_340" "SS2G_340"
Why it works:
With regex, we can choose to keep a pattern and remove everything outside of that pattern with a two-step process. Step 1 is to put the pattern we'd like to keep in parentheses. Step 2 is to reference the number of the parentheses-bound pattern we'd like to keep, as sometimes we might have multiple parentheses-bound elements. See the example below for example:
gsub(".*(SS.*)+(T0.*)","\\1",Target)
[1] "SS1G_340" "SS2G_340"
Note that I've put the T0.* in parentheses this time, but we still get the correct answer because I've told gsub to return the first of the two parentheses-bound patterns. But now see what happens if I use \\2 instead:
gsub(".*(SS.*)+(T0.*)","\\2",Target)
[1] "T01" "T021"
The .* are wild cards by the way. If you'd like to learn more about using regex in R, here's a reference that can get you started.

get last part of a string

I would like to get the last substring of a variable (the last part after the underscore), in this case: "myvar".
x = "string__subvar1__subvar2__subvar3__myvar"
my attempts result in a match starting from the first substring, e.g.
library(stringr)
str_extract(x, "__.*?$)
How do I do this in R?
You can do
sub('.*\\__', '', x)
You can do:
library(stringr)
str_extract(x,"[a-zA-Z]+$")
EDIT:
one could use lookaround feature as well: str_extract(x,"(?=_*)[a-zA-Z]+$")
also from baseR
regmatches(x,gregexpr("[a-zA-Z]+$",x))[[1]]
From documentation ?regex:
The caret ^ and the dollar sign $ are metacharacters that respectively
match the empty string at the beginning and end of a line.
Could this work? Sorry, I hope i'm understanding what you're asking correctly.
substr(x,gregexpr("_",x)[[1]][length(gregexpr("_",x)[[1]])]+1,nchar(x))
[1] "myvar"

Resources