I have tried gsub as following to remove everything before the first space but it didn't work.
lagl2$SUSPENSE <- gsub(pattern = "(.*)\\s",replace=" ", lagl2$SUSPENSE)
example of the row data: 64400/GL WORKERS COMPENSATION
and I want the result to be like that: WORKERS COMPENSATION
This is just an example but I have many observations and one column and need to delete everything before the first space.
I am new to R and to programming but I started loving it.
You can remove everything before first space using sub as -
sub(".*?\\s", "", "64400/GL WORKERS COMPENSATION")
#[1] "WORKERS COMPENSATION"
To apply to the whole column you can do -
lagl2$SUSPENSE <- sub(".*?\\s", "", lagl2$SUSPENSE)
You could also assert the start of the string ^ , and match optional non whitespace chars followed by one or more whitespace chars using \S*\s+ that you want to remove.
sub("^\\S*\\s+", "", "64400/GL WORKERS COMPENSATION")
Output
[1] "WORKERS COMPENSATION"
You can match everything before first space using lookarounds
/^[^\s]+(?=\s)\s+/gm
Demo
Related
I have nearly 100,000 rows of scraped data that I have converted to data frames. One column is a string of text characters but is operating strangely. In the example below, there is text, that has bracketed information that I want to remove, and I also want to remove " (c)". However the space in front is not technically a space (is it considered whitespace?).
I am not sure how to reproduce the example here because when I copy/paste a record, it is treated like normal and works, but in the scraped data, it does not. Gut check was to count spaces and it gave me 4, which means the space in front of ( is not a true space. I do not know how to remove this!
My code that I usually would run is as follows. Again, works this way, but does not work in my scraped data.
test<-c("Barry Windham (c) & Mike Rotundo (c)")
test<-gsub("[ ][(]c[)]","",test)
You can consider using:
test<-c("Barry Windham (c) & Mike Rotundo (c)")
gsub("(*UCP)\\s+\\(c\\)", "", test, perl=TRUE)
# => [1] "Barry Windham & Mike Rotundo"
See an online R demo
Details
(*UCP) - makes all shorthand character classes in the PCRE regex (it is PCRE due to perl=TRUE) Unicode aware
\\s+ - any one or more Unicode whitespaces
\\(c\\) - (c) substring.
If you need to keep (c), capture it and use a backreference in the replacement:
gsub("(*UCP)\\s+(\\(c\\))", "\\1", test, perl=TRUE)
I am detecting substrings within reports and then adding suffix words to the end of the reports depending if the substring is present or absent. Shorter words are dangerous as they are usually parts of longer words. Example: ear and overbearing. The spacebar tends to be a reasonable solution. Therefore, instead of search for the substring 'ear' I will use ' ear'. Note the white space in front of the substring. And no white space at the end of the substring, as I don't want to miss the plural ears.
The problem is when the 1st word in the entire report is Ear. There is no leading white space.
I tried to solve the problem with library stringr but adding a space to the beginning of each report, but the text is returned unchanged.
(stringr)
Data$Fail <- str_pad(Data$text, width = 1, side = "left")
Data$Fail <- str_pad(Data$text, width = 1, side = "left") didn't work because str_pad() pads a string to a fixed length, which you specified as width = 1, so it would only have inserted a space if the text were initially empty.
But if you just want to insert a space at the start of a string, you don't need a special library - text = paste("", text) would do.
Armali already answered your question (use paste('',text)) to add a space in front of ear. Since you also want to match the Ear at the start of a sentence you can better use a regex as pointed out by HO LI Pin.
pattern <- '(?<![A-z])[Ee]ar'
This will only match E/ear if not preceded by any other letter (it can thus still be preceded by things like _ ,(, etc. but it is not clear from your question whether this is allowed or not. Then you can either use the base R or simpler the stringr library to search all matches using this regex pattern:
library(stringr)
pattern <- '(?<![A-z])[Ee]ar'
text = 'Ear this is some nice text as you can hear with your ear about overbearing'
unlist(str_extract_all(text, pattern, simplify = FALSE))
Which will give you:
[1] "Ear" "ear"
I'm trying to use stringr or R base calls to conditionally add a white-space for instances in a large vector where there is a numeric value then a special character - in this case a $ sign without a space. str_pad doesn't appear to allow for a reference vectors.
For example, for:
$6.88$7.34
I'd like to add a whitespace after the last number and before the next dollar sign:
$6.88 $7.34
Thanks!
If there is only one instance, then use sub to capture digit and the $ separately and in the replacement add the space between the backreferences of the captured group
sub("([0-9])([$])", "\\1 \\2", v1)
#[1] "$6.88 $7.34"
Or with a regex lookaround
gsub("(?<=[0-9])(?=[$])", " ", v1, perl = TRUE)
data
v1 <- "$6.88$7.34"
This will work if you are working with a vectored string:
mystring<-as.vector('$6.88$7.34 $8.34$4.31')
gsub("(?<=\\d)\\$", " $", mystring, perl=T)
[1] "$6.88 $7.34 $8.34 $4.31"
This includes cases where there is already space as well.
Regarding the question asked in the comments:
mystring2<-as.vector('Regular_Distribution_Type† Income Only" "Distribution_Rate 5.34%" "Distribution_Amount $0.0295" "Distribution_Frequency Monthly')
gsub("(?<=[[:alpha:]])\\s(?=[[:alpha:]]+)", "_", mystring2, perl=T)
[1] "Regular_Distribution_Type<U+2020> Income_Only\" \"Distribution_Rate 5.34%\" \"Distribution_Amount $0.0295\" \"Distribution_Frequency_Monthly"
Note that the \ appears due to nested quotes in the vector, should not make a difference. Also <U+2020> appears due to encoding the special character.
Explanation of regex:
(?<=[[:alpha:]]) This first part is a positive look-behind created by ?<=, this basically looks behind anything we are trying to match to make sure what we define in the look behind is there. In this case we are looking for [[:alpha:]] which matches a alphabetic character.
We then check for a blank space with \s, in R we have to use a double escape so \\s, this is what we are trying to match.
Finally we use (?=[[:alpha:]]+), which is a positive look-ahead defined by ?= that checks to make sure our match is followed by another letter as explained above.
The logic is to find a blank space between letters, and match the space, which then is replaced by gsub, with a _
See all the regex here
I have this vector Target <- c( "tes_1123_SS1G_340T01", "tes_23_SS2G_340T021". I want to remove anything before SS and anything after T0 (including T0).
Result I want in one line of code:
SS1G_340 SS2G_340
Code I have tried:
gsub("^.*?SS|\\T0", "", Target)
We can use str_extract
library(stringr)
str_extract(Target, "SS[^T]*")
#[1] "SS1G_340" "SS2G_340"
Try this:
gsub(".*(SS.*)T0.*","\\1",Target)
[1] "SS1G_340" "SS2G_340"
Why it works:
With regex, we can choose to keep a pattern and remove everything outside of that pattern with a two-step process. Step 1 is to put the pattern we'd like to keep in parentheses. Step 2 is to reference the number of the parentheses-bound pattern we'd like to keep, as sometimes we might have multiple parentheses-bound elements. See the example below for example:
gsub(".*(SS.*)+(T0.*)","\\1",Target)
[1] "SS1G_340" "SS2G_340"
Note that I've put the T0.* in parentheses this time, but we still get the correct answer because I've told gsub to return the first of the two parentheses-bound patterns. But now see what happens if I use \\2 instead:
gsub(".*(SS.*)+(T0.*)","\\2",Target)
[1] "T01" "T021"
The .* are wild cards by the way. If you'd like to learn more about using regex in R, here's a reference that can get you started.
I have a lot of error messages that I am trying to clean up.
some of the errors end with the text "(sec): 0.xxx"
i'm trying to use gsub to remove everything after (sec)
data$Message <- gsub("(sec).*", "", data$Message, perl = TRUE)
this returns everything after (
I know it would be easy to just use ":" or ")" but then it effects other errors that I do not want to change.
Is there a way to use gsub to look at several characters -like "(sec)"- instead of just one?
on a related note is their a symbol that represents any number (excludes text) similiar to "."?
You can use regex look behind ?<= to avoid sec being removed and at the same time assert the removed pattern follows sec, so (?<=sec\\)).* will remove everything after sec) but not sec) itself:
gsub("(?<=sec\\)).*", "", "(sec): 0.xxx", perl = TRUE)
# [1] "(sec)"
You can select the first part of the expression (between brackets) and omit the rest:
gsub('(^.*\\(sec\\)).*', '\\1', '(sec): 0.xxx')
## [1] "(sec)"