Substring detection for 1st word - r

I am detecting substrings within reports and then adding suffix words to the end of the reports depending if the substring is present or absent. Shorter words are dangerous as they are usually parts of longer words. Example: ear and overbearing. The spacebar tends to be a reasonable solution. Therefore, instead of search for the substring 'ear' I will use ' ear'. Note the white space in front of the substring. And no white space at the end of the substring, as I don't want to miss the plural ears.
The problem is when the 1st word in the entire report is Ear. There is no leading white space.
I tried to solve the problem with library stringr but adding a space to the beginning of each report, but the text is returned unchanged.
(stringr)
Data$Fail <- str_pad(Data$text, width = 1, side = "left")

Data$Fail <- str_pad(Data$text, width = 1, side = "left") didn't work because str_pad() pads a string to a fixed length, which you specified as width = 1, so it would only have inserted a space if the text were initially empty.
But if you just want to insert a space at the start of a string, you don't need a special library - text = paste("", text) would do.

Armali already answered your question (use paste('',text)) to add a space in front of ear. Since you also want to match the Ear at the start of a sentence you can better use a regex as pointed out by HO LI Pin.
pattern <- '(?<![A-z])[Ee]ar'
This will only match E/ear if not preceded by any other letter (it can thus still be preceded by things like _ ,(, etc. but it is not clear from your question whether this is allowed or not. Then you can either use the base R or simpler the stringr library to search all matches using this regex pattern:
library(stringr)
pattern <- '(?<![A-z])[Ee]ar'
text = 'Ear this is some nice text as you can hear with your ear about overbearing'
unlist(str_extract_all(text, pattern, simplify = FALSE))
Which will give you:
[1] "Ear" "ear"

Related

How to remove characters between space and specific character in R

I have a question similar to this one but instead of having two specific characters to look between, I want to get the text between a space and a specific character. In my example, I have this string:
myString <- "This is my string I scraped from the web. I want to remove all instances of a picture. picture-file.jpg. The text continues here. picture-file2.jpg"
but if I were to do something like this: str_remove_all(myString, " .*jpg) I end up with
[1] "This"
I know that what's happening is R is finding the first instance of a space and removing everything between that space and ".jpg" but I want it to be the first space immediately before ".jpg". My final result I hope for looks like this:
[1] "This is my string I scraped from the web. I want to remove all instances of a picture. the text continues here.
NOTE: I know that a solution may arise which does what I want, but ends up putting two periods next to each other. I do not mind a solution like that because later in my analysis I am removing punctuation.
You can use
str_remove_all(myString, "\\S*\\.jpg")
Or, if you also want to remove optional whitespace before the "word":
str_remove_all(myString, "\\s*\\S*\\.jpg")
Details:
\s* - zero or more whitespaces
\S* - zero or more non-whitespaces
\.jpg - .jpg substring.
To make it case insensitive, add (?i) at the pattern part: "(?i)\\s*\\S*\\.jpg".
If you need to make sure there is no word char after jpg, add a word boundary: "(?i)\\s*\\S*\\.jpg\\b"

how to remove decimal point between numbers in R

I am trying to remove the decimal points in decimal numbers in R. Please note I want to keep the full stop of strings.
Example:
data= c("It's 6.00pm, and is late.")
I know that I have to use regex for this, but I am struggling. My desired output is:
"It's 6 00pm, and is late."
Thank you in advance.
Try this:
sub("(?<=\\d)\\.(?=\\d)", " ", data, perl = TRUE)
This solution uses lookbehind (?<=...) and lookahead (?=...)to assert that the period you wish to remove be enclosed by digits (thus avoiding matching the period at the sentence end). If you have several such cases within strings, then use gsubinstead of sub.
I suggest using a simple pattern to find the target text, then adding parenthesis to identify the parts of the matching text that you want to retain.
# Test data
data <- c("It's 6.00pm, and is late.")
The target pattern is a literal dot with a string of digits before and after it. \\d+ matches one or more digits and \\. matches a literal dot. Testing the pattern to see if it works:
grepl("\\d+\\.\\d+", data)
Result
TRUE
If we wanted too eliminate the whole thing we could do a simple replacement with an empty string. Testing if this targets the correct text:
sub("\\d+\\.\\d+", "", data)
Result
"It's pm, and is late."
Instead, to discard only a section of matched text we can identify the parts we want to keep, which is done by surrounding them with parenthesis. Once done we can refer to the captured text in the replacement. \\1 refers to the first chunk of text captured and \\2 refers to the second chunk of text, corresponding to the first and second sets of parenthesis
# pattern replacement
sub("(\\d+)\\.(\\d+)", "\\1\\2", data)
Result
[1] "It's 600pm, and is late."
This effectively removes the dot by omitting it from the replacement text.

How to prevent code from detecting and pulling patterns within words (Example: I want 'one' detected but not 'one' in the word al'one')?

I have this code that is meant to add highlights to some numbers in a text stored in "lines"
stringr::str_replace_all(lines, nums, function(x) {paste0("<<", x, ">>")})
where nums is the following pattern being deteced
nums<-(Zero|One|Two|Three|Four|Five|Six|Seven|Eight|Nine)+\\s?(Hundred|Thousand|Million|Billion|Trillion)?'
The problem I'm having is that the line of code above also leads to numbers embedded in words also being detected. In the following text this happens:
Get <<ten>> eggs. That is what is writ<<ten>>. I am <<one>> and al<<one>>.
when it should be:
Get <<ten>> eggs. That is what is written. I am <<one>> and alone.
I don't want to remove the question mark after the \s because I want to detect both numbers like "One" followed by no space and "One Hundred" which has a space in between.
Does anyone know how to do this?
Surround (Zero|One|Two|Three|Four|Five|Six|Seven|Eight|Nine)+ with \b.
\b matches word boundaries, so this expression will newer match inside a word.

How to add the removed space in a sentence?

I have the following string:
x = "marchTextIWantToDisplayWithSpacesmarch"
I would like to delete the 'march' portion at the beginning of the string and then add a space before each uppercase letter in the remainder to yield the following result:
"Text I Want To Display With Spacesmarch"
To insert whitepace, I used gsub("([a-z]?)([A-Z])", "\\1 \\2", x, perl= T) but I have no clue how to modify the pattern so that the first 'march' is excluded from the returned string. I'm trying to get better at this so any help would be greatly appreciated.
An option would be to capture the upper case letter as a group ((...)) and in the replacement create a space followed by the backreference (\\1) of the captured group
gsub("([A-Z])", " \\1", x)
#[1] "march Text I Want To Display With Spacesmarch"
If we need to remove the 'march'
sub("\\b[a-z]\\w+\\s+", "", gsub("([A-Z])", " \\1", x))
[#1] "Text I Want To Display With Spacesmarch"
data
x <- "marchTextIWantToDisplayWithSpacesmarch"
No, you can't achieve your replacement using single gsub because in one of your requirement, you want to remove all lowercase letters starting from the beginning, and your second requirement is to introduce a space before every capital letter except the first capital letter of the resultant string after removing all lowercase letters from the beginning of text.
Doing it in single gsub call would have been possible in cases where somehow we can re-use some of the existing characters to make the conditional replace which can't be the case here. So in first step, you can use ^[a-z]+ regex to get rid of all lowercase letters only from the beginning of string,
sub('^[a-z]+', '', "marchTextIWantToDisplayWithSpacesmarch")
leaving you with this,
[1] "TextIWantToDisplayWithSpacesmarch"
And next step you can use this (?<!^)(?=[A-Z]) regex to insert a space before every capital letter except the first one as you might not want an extra space before your sentence. But you can combine both and write them as this,
gsub('(?<!^)(?=[A-Z])', ' ', sub('^[a-z]+', '', "marchTextIWantToDisplayWithSpacesmarch"), perl=TRUE)
which will give you your desired string,
[1] "Text I Want To Display With Spacesmarch"
Edit:
Explanation of (?<!^)(?=[A-Z]) pattern
First, let's just take (?=[A-Z]) pattern,
See the pink markers in this demo
As you can see, in the demo, every capital letter is preceded by a pink mark which is the place where a space will get inserted. But we don't want space to be inserted before the very first letter as that is not needed. Hence we need a condition in regex, which will not select the first capital letter which appears at the start of string. And for that, we need to use a negative look behind (?<!^) which means that Do not select the position which is preceded by start of string and hence this (?<!^) helps in discarding the upper case letter that is preceded by just start of string.
See this demo where the pink marker is gone from the very first uppercase letter
Hope this clarifies how every other capital letter is selected but not the very first. Let me know if you have any queries further.
You may use a single regex call to gsub coupled with trimws to trim the resulting string:
trimws(gsub("^\\p{Ll}+|(?<=.)(?=\\p{Lu})", " ", x, perl=TRUE))
## => [1] "Text I Want To Display With Spacesmarch"
It also supports all Unicode lowercase (\p{Ll}) and uppercase (\p{Lu}) letters.
See the R demo online and the regex demo.
Details
^\\p{Ll}+ - 1 or more lowercase letters at the string start
| - or
(?<=.)(?=\\p{Lu}) - any location between any char but linebreak chars and an uppercase letter.
Here is an altenative with a single call to gsubfn regex with some ifelse logic:
> gsubfn("^\\p{Ll}*(\\p{L})|(?<=.)(?=\\p{Lu})", function(n) ifelse(nchar(n)>0,n," "), x, perl=TRUE,backref=-1)
[1] "Text I Want To Display With Spacesmarch"
Here, the ^\\p{Ll}*(\\p{L}) part matches 0+ lowercase letters and captures the next uppercase into Group 1 that will be accessed by passing n argument to the anonymous function. If n length is non-zero, this alternative matched and the we need to replace with this value. Else, we replace with a space.
Since this is tagged perl, my 2 cents:
Can you chain together the substitutions inside sub() and gsub()? In newer perl versions an /r option can be added to the s/// substitution so the matched string can be returned "non-destructively" and then matched again. This allows hackish match/substitution/rematches without mastering advanced syntax, e.g.:
perl -E '
say "marchTextIWantToDisplayWithSpacesmarch" =~
s/\Amarch//r =~ s/([[:upper:]])/ $1/gr =~ s/\A\s//r;'
Output
Text I Want To Display With Spacesmarch
This seems to be what #pushpesh-kumar-rajwanshi and #akrun are doing by wrapping gsub inside sub() (and vice versa). In general I don't thinkperl = T captures the full magnificently advanced madness of perl regexps ;-) but gsub/sub must be fast operating on vectors, no?

remove/replace specific words or phrases from character strings - R

I looked around both here and elsewhere, I found many similar questions but none which exactly answer mine. I need to clean up naming conventions, specifically replace/remove certain words and phrases from a specific column/variable, not the entire dataset. I am migrating from SPSS to R, I have an example of the code to do this in SPSS below, but I am not sure how to do it in R.
EG:
"Acadia Parish" --> "Acadia" (removes Parish and space before Parish)
"Fifth District" --> "Fifth" (removes District and space before District)
SPSS syntax:
COMPUTE county=REPLACE(county,' Parish','').
There are only a few instances of this issue in the column with 32,000 cases, and what needs replacing/removing varies and the cases can repeat (there are dozens of instances of a phrase containing 'Parish'), meaning it's much faster to code what needs to be removed/replaced, it's not as simple or clean as a regular expression to remove all spaces, all characters after a specific word or character, all special characters, etc. And it must include leading spaces.
I have looked at the replace() gsub() and other similar commands in R, but they all involve creating vectors, or it seems like they do. What I'd like is syntax that looks for characters I specify, which can include leading or trailing spaces, and replaces them with something I specify, which can include nothing at all, and if it does not find the specific characters, the case is unchanged.
Yes, I will end up repeating the same syntax many times, it's probably easier to create a vector but if possible I'd like to get the syntax I described, as there are other similar operations I need to do as well.
Thank you for looking.
> x <- c("Acadia Parish", "Fifth District")
> x2 <- gsub("^(\\w*).*$", "\\1", x)
> x2
[1] "Acadia" "Fifth"
Legend:
^ Start of pattern.
() Group (or token).
\w* One or more occurrences of word character more than 1 times.
.* one or more occurrences of any character except new line \n.
$ end of pattern.
\1 Returns group from regexp
Maybe I'm missing something but I don't see why you can't simply use conditionals in your regex expression, then trim out the annoying white space.
string <- c("Arcadia Parish", "Fifth District")
bad_words <- c("Parish", "District") # Write all the words you want removed here!
bad_regex <- paste(bad_words, collapse = "|")
trimws( sub(bad_regex, "", string) )
# [1] "Arcadia" "Fifth"
dataframename$varname <- gsub(" Parish","", dataframename$varname)

Resources