Removing a specific first item in a string in R - r

I have strings such as:
'THE HOUSE'
'IN THE HOUSE'
'THE THE HOUSE'
And I would like to remove 'THE' only if it occurs at the first position in the string.
I know how to remove 'THE' with:
gsub("\\<THE\\>", "", string)
And I know how to grab the first word with:
"([A-Za-z]+)" or "([[:alpha:]]+)"or "(\\w+)"
But no idea how to combine the two to end up having:
'HOUSE'
'IN THE HOUSE'
'THE HOUSE'
Cheers!

You may use
string <- c("THE HOUSE", "IN THE HOUSE", "THE THE HOUSE")
sub("^THE\\b\\s*", "", string)
## => [1] "HOUSE" "IN THE HOUSE" "THE HOUSE"
See the regex demo and an online R demo.
Details
^ - start of string
THE - a literal substring
\\b - a word boundary (you may keep \\> trailing word boundary if you wish)
\\s* - 0+ whitespace chars.

Related

Regular expression meaning in R : "( \n|\n )"

I am new to R programming and was trying out the gsub function for text replacement in pandas dataframe series(i.e new_text).
It a vast series so will not be able to print all here.
It is just a series with strings containing postal address.
I came across this gsub code : gsub(pattern = "( \n|\n )", replacement = " ", x = new_text) -> new_text
can you please let me know the meaning of this regex expression as well as the python alternative using regex expression.
Your pattern, slightly rewritten, is [ ]\n|\n[ ], which says to match:
[ ]\n a space followed by a newline
| OR
\n[ ] a newline followed by a space
Note that you might be able to use [ ]?\n[ ]? to the same effect, depending on the actual text you are using with gsub.

R regex match whole word taking punctuation into account

I'm in R. I want to match whole words in text, taking punctuation into account.
Example:
to_match = c('eye','nose')
text1 = 'blah blahblah eye-to-eye blah'
text2 = 'blah blahblah eye blah'
I would like eye to be matched in text2 but not in text1.
That is, the command:
to_match[sapply(paste0('\\<',to_match,'\\>'),grepl,text1)]
should return character(0). But right now, it returns eye.
I also tried with '\\b' instead of '\\<', with no success.
UseĀ 
to_match[sapply(paste0('(?:\\s|^)',to_match,'(?:\\s|$)'),grepl,text1)]
The point is that word boundaries match between a word and a nonword chars, that is why you had a match in eye-to-eye. You want to match only in between start or end of string and whitespace.
In a TRE regex, this is better done with groups as this regex library does not support lookarounds and you just need to test a string for a single pattern match to return true or false.
The (?:\s|^) noncapturing group matches any whitespace or start of string and (?:\s|$) matches whitespace or end of string.

Only select last part of a string after the last point

I have a dataframe with one column that represents the requests made by my users. A few examples look like this:
GET /enviro/html/tris/tris_overview.html
GET /./enviro/gif/emcilogo.gif
GET /docs/exposure/meta_exp.txt.html
GET /hrmd/
GET /icons/circle_logo_small.gif
I only want to select the last part of the string after the last "." in such a way that I return the pagetype of the string. The output of these lines should therefore be:
.html
.gif
.html
.gif
I tried doing this with sub but I only manage to select everything after the first "." example:
tring <- c("GET /enviro/html/tris/tris_overview.html", "GET /./enviro/gif/emcilogo.gif", "GET /docs/exposure/meta_exp.txt.html", "GET /hrmd/", "GET /icons/circle_logo_small.gif")
sub("^[^.]*", "", sapply(strsplit(tring, "\\s+"), `[`, 2))
this returns:
".html"
"./enviro/gif/emcilogo.gif"
".txt.html"
""
".gif"
I created the following gsub code that works for string containing two points:
gsub(pattern = ".*\\.", replacement = "", "GET /./enviro/gif/finds.gif", "\\s+")
this returns:
"gif"
However, I cant seem to come up with one gsub/sub that works for all possible input. It should read the string from right to left. Stop when it sees the first "." and return everything that was found after that "."
I am new to R and I can't come up with something that is doing this. Any help would be highly appreciated!
You can't change the string parsing direction with R regex. Instead, you may match all up to . and remove it, or match the . that has no . chars to the right of it till the end of string.
string <- c('GET /enviro/html/tris/tris_overview.html','GET /./enviro/gif/emcilogo.gif','GET /docs/exposure/meta_exp.txt.html','GET /hrmd/','GET /icons/circle_logo_small.gif')
res <- regmatches(string, regexec("\\.[^.]*$", string))
res[lengths(res)==0] <- ""
unlist(res)
Or
sub("^(.*(?=\\.)|.*)", "", string, perl=TRUE)
See the R online demo. Both return
[1] ".html" ".gif" ".html" "" ".gif"
Here, \.[^.]*$ matches a . and then any 0+ chars other than . till the end of string. The sub code used ^(.*(?=\\.)|.*) pattern that matches the start of string, then either any 0+ chars as many as possible till . without consuming the dot, or just matches any 0+ chars as many as possible, and replaces the match with an empty string.
See Regex 1 and Regex 2 demos.
Here is a regex-free solution:
sapply(
seq_along(a),
function(i) {
if (grepl("\\.", a[i])) tail(strsplit(a[i], "\\.")[[1]], 1) else ""
}
)
# [1] "html" "gif" "html" "" "gif"

REGEX: Remove middle of string after certain number of "/"

How do I remove the middle of a string using regex. I have the following url:
https://www.sec.gov/Archives/edgar/data/1347185/000134718517000016/0001347185-17-000016-index.htm/exh1025730032017.xml
but I want it to look like this:
https://www.sec.gov/Archives/edgar/data/1347185/000134718517000016/exh1025730032017.xml
I can get rid of everything after "data/../../"
That last long string of numbers isnt needed
I tried this
sub(sprintf("^((?:[^/]*;){8}).*"),"", URLxml)
But it doesnt do anything! Help please!
To remove the last but one subpart of the path, you may use
x <- "https://www.sec.gov/Archives/edgar/data/1347185/000134718517000016/0001347185-17-000016-index.htm/exh1025730032017.xml"
sub("^(.*/).*/(.*)", "\\1\\2", x)
## [1] "https://www.sec.gov/Archives/edgar/data/1347185/000134718517000016/exh1025730032017.xml"
See the online R demo and here is a regex demo.
Details:
^ - start of a string
(.*/) - Group 1 (referred to with \1 from the replacement string) any 0+ chars up to the last but one /
.*/ - any 0+ chars up to the last /
(.*) - Group 2 (referred to with \2 backreference from the replacement string) any 0+ chars up to the end.
a<-'https://www.sec.gov/Archives/edgar/data/1347185/000134718517000016/0001347185-17-000016-index.htm/exh1025730032017.xml'
gsub('data/(.+?)/(.+?)/(.+?)/','data/\\1/\\2/',a)
so in the url:
data/.../.../..(this is removed)../ ....

How can I delete every single letter of a row after a certain character in R?

I am having a problem doing a cleaning of transactions. I have an excel with every single transaction that clients do, with the number, the gloss and the code of the industry. I convert this excel in text separated by ";" then I only need to clean the gloss and convert it back again into an excel.
tolower(tabla1)
lapply(tabla1, tolower)
tabla1[] <- lapply(tabla1, tolower)
str(tabla1)
tabla1
tabla1_texto <- gsub("[.]", "", tabla1)
table1_texto <- gsub("[(]", " ", tabla1_texto)
I know that I need to use gsub() but I'm not sure how to use it, in other hand, someone know how to do a correct dictionary and only keep certain words and delete every other word?
If you have a string like this one:
string <- "Some text here; and some text here; and some more text here"
Then you can delete everything after the first ; with:
gsub(";.*$", "", string)
[1] "Some text here"
Explanation of ;,*$ which you will be substituting for "" (empty string):
starting with ;
any character . zero or more times *
up until the end of the line $
If you have a table - you will have to do this for every row separately.

Resources