Put spaces around all punctuation but excluding apostrophes - r

I'm new to this, so I'm sorry if this is a stupid question... I need help with a bit of code in R...
I have a bit of code (below) which puts a space around all my punctuation in all txt files in a folder. It's lovely, but I don't want it to add space around apostrophes (') -
Can anybody help me exclude apostrophes in that bit gsub("(\.+|[[:punct:]])", " \1 " ---?
Or is that how you would do it? (with [^ ?)
I get this:
"I want : spaces around all these marks ; : ! ? . but i didn ’ t want it there in didn ’ t"
I want this:
"I want : spaces around all these marks ; : ! ? . but i didn’t want it there in didn’t"
for(file in filelist){
tx=readLines(file)
tx2=gsub("(\\.+|[[:punct:]])", " \\1 ", tx)
writeLines(tx2, con=file)
}

You can use
tx <- "I want: spaces around all these marks;:!?.but i didn’t want it there in didn't"
gsub("\\s*(\\.+|[[:punct:]])(?<!\\b['’]\\b)\\s*", " \\1 ", tx, perl=TRUE)
## => [1] "I want : spaces around all these marks ; : ! ? . but i didn’t want it there in didn't"
The perl=TRUE only means that the regex is handled with the PCRE library (note that PCRE regex engine is not the same as Perl regex engine).
See the R demo online and the regex demo.
Details:
\s* - zero or more whitespaces
(\.+|[[:punct:]]) - Group 1 (\1): one or more dots, or a punctuation char
(?<!\b['’]\b) - immediately on the left, there must be no ' or ’ enclosed with word chars
\s* - zero or more whitespaces

We may match the ' and SKIP it before matching all other punctuation works
gsub("’(*SKIP)(*FAIL)|([[:punct:].])", " \\1 ", tx, perl = TRUE)
-output
[1] "I want : spaces around all these marks ; : ! ? . but i didn’t want it there in didn’t"
data
tx <- "I want:spaces around all these marks;:!?. but i didn’t want it there in didn’t"

Related

Compressing multiple line breaks \n and tabs \t from a string in R

I tried to use
gsub('(\t\\n)+','\n',.)
function to compress multiple \n and \t into only \n, but it didn't work.
I'm kinda confused by regex, so can anyone help me? Please find the R console screenshot below:
If your questions is to convert consecutive "\n"s and "\t"s (might mixed together) into "\n", then the following would work.
gsub("(\\t|\\n)+","\\n",inputStr)
If '\t' (real TAB character) should included, then
gsub("(\t|\\t|\\n)+","\\n",inputStr)
You can use
gsub("\t*\n[\t\n]*", "\n", x)
This will replace all sequences of tabs and newlines where one newline char is obligatory with a single newline (LF) char.
See an R demo online:
x <- "A\tB\t\n\n\n\nD\tF"
gsub("\t*\n[\t\n]*", "\n", x)
## => [1] "A\tB\nD\tF"
Details:
\t* - zero or more tabs
\n - an LF, newline char
[\t\n]* - zero or more TAB or LF chars.
If you need to incluse CR as an optional char, use
gsub("[\t\r]*\n[\t\r\n]*", "\n", x)

Remove number between end of row and new space

I'm trying to remove the numbers at the beginning of a row inside quotation marks.
> g<-"My name is Paul.\nI like playing football.\n\"55012\" And that's all."
> cat(g)
My name is Paul.
I like playing football.
"55012" And that's all.
> gsub("[\r\n]\"+[[:digit:]][^[[:space:]]]*"," ",g)
[1] "My name is Paul.\nI like playing football. 012\" And that's all."
This should work, but I don't know why only \n"55 is being replaced and not the entire number.
You closed the bracket expression with a couple of redundant [...]. [^[[:space:]]] is a sequence of [^[[:space:]] and ] patterns and matches any char other than [ and whitespace and then a ] char.
However, even that is not enough to fully fix the issue.
You may use
gsub("(^|\n)\"+[0-9]+\"+\\s*","\\1", g)
See the R demo
Pattern details
(^|\n) - start of string or a newline captured in Group 1 (referred to with \1 from the replacement pattern)
\"+ - one or more double quotes
[0-9]+ - 1+ digits
\"+ - one or more double quotes
\s* - 0+ whitespaces.
See the regex demo

Replace single quotes without changing the apostrophe

I have a data frame with column Title, I want to replace the single quotes to double quotes without changing the apostrophe. For example, 'I don't go to work tomorrow' . It should be "I don't go to work tomorrow".
I tried like this:
gsub("(\\w'\\w+) |, ", "\\1", "I don't go to work tomorrow")
I have tried a couple of ways, but have not got the result as expected.
I try str_replace_all() in stringr, but it replaces all ' into ". Every recommendation would be appreciated.
I think your rule is perhaps as simple as: if an apostrophe has something (non-space) before and after it, then don't replace it; otherwise, replace it.
gsub("^'|(?<= )'|'(?= )|'$", '"', "'I don't go to work tomorrow'", perl = TRUE)
# [1] "\"I don't go to work tomorrow\""
(Updated so that it does not consume the preceding/following space, if present.)
Patterns
To match an apostrophe only at the start/end of the string:
^'|'$
See the regex demo
If the apostophe is searched only outside a word, you may use
\b'\b(*SKIP)(*FAIL)|'
See this regex demo. Here, the ' is matched only if it is not enclosed on both ends with letters, digits or underscores since all ' that are enclosed with word chars are skipped/failed.
If you need to match a ' only when it is not between two letters, use
'(?!(?<=[A-Za-z]')[A-Za-z]) # ASCII only
'(?!(?<=\p{L}')\p{L}) # Any Unicode letters
See this regex demo.
Usage
gsub("^'|'$", '"', "'I don't go to work tomorrow 2'5.'")
## => "I don't go to work tomorrow 2'5."
gsub("\\b'\\b(*SKIP)(*FAIL)|'", '"', "'I don't go to work tomorrow 2'5.'", perl=TRUE)
## => "I don't go to work tomorrow 2'5."
gsub("'(?!(?<=\\p{L}')\\p{L})", '"', "'I don't go to work tomorrow 2'5.'", perl=TRUE)
## => "I don't go to work tomorrow 2"5."
See the R demo online.

R utf-8 and replace a word from a sentence based on ending character

I have a requirement where I am working on a large data which is having double byte characters, in korean text. i want to look for a character and replace it. In order to display the korean text correctly in the browser I have changed the locale settings in R. But not sure if it gets updated for the code as well. below is my code to change locale to korean and the korean text gets visible properly in viewer, however in console it gives junk character on printing-
Sys.setlocale(category = "LC_ALL", locale = "korean")
My data is in a data.table format that contains a column with text in korean. example -
"광주광역시 동구 제봉로 49 (남동,(지하))"
I want to get rid of the 1st word which ends with "시" character. Then I want to get rid of the "(남동,(지하))" an the end. I was trying gsub, but it does not seem to be working.
New <- c("광주광역시 동구 제봉로 49 (남동,(지하))")
data <- as.data.table(New)
data[,New_trunc := gsub("\\b시", "", data$New)]
Please let me know where I am going wrong. Since I want to search the end of word, I am using \\b and since I want to replace any word ending with "시" character I am giving it as \\b시.....is this not the way to give? How to take care of () at the end of the sentence.
What would be a good source to refer to for regular expressions.
Is a utf-8 setting needed for the script as well?How to do that?
Since you need to match the letter you have at the end of the word, you need to place \b (word boundary) after the letter, so as to require a transition from a letter to a non-letter (or end of string) after that letter. A PCRE pattern that will handle this is
"\\s*\\b\\p{L}*시\\b"
Details
\\s* - zero or more whitespaces
\\b - a leading word boundary
\\p{L}* - zero or more letters
시 - your specific letter
\\b - end of the word
The second issue is that you need to remove a set of nested parentheses at the end of the string. You need again to rely on the PCRE regex (perl=TRUE) that can handle recursion with the help of a subroutine call.
> sub("\\s*(\\((?:[^()]++|(?1))*\\))$", "", New, perl=TRUE)
[1] "광주광역시 동구 제봉로 49"
Details:
\\s* - zero or more whitespaces
(\\((?:[^()]++|(?1))*\\)) - Group 1 (will be recursed) matching
\\( - a literal (
(?:[^()]++|(?1))* - zero or more occurrences of
[^()]++ - 1 or more chars other than ( and ) (possessively)
| - or
(?1) - a subroutine call that repeats the whole Group 1 subpattern
\\) - a literal )
$ - end of string.
Now, if you need to combine both, you would see that R PCRE-powered gsub does not handle Unicode chars in the pattern so easily. You must tell it to use Unicode mode with (*UCP) PCRE verb.
> gsub("(*UCP)\\b\\p{L}*시\\b|\\s*(\\((?:[^()]++|(?1))*\\))$", "", New, perl=TRUE)
[1] " 동구 제봉로 49"
Or using trimws to get rid of the leading/trailing whitespace:
> trimws(gsub("(*UCP)\\b\\p{L}*시\\b|(\\((?:[^()]++|(?1))*\\))$", "", New, perl=TRUE))
[1] "동구 제봉로 49"
See more details about the verb at PCRE Man page.

R regex remove apostroph except the ones preceded and followed by letter

I'm cleaning a text and I'd like to remove any apostrophe except for the ones preceded and followed by letters such as in : i'm, i'll, he's..etc.
I the following preliminary solution, handling many cases, but I want a better one:
rmAps <- function(x) gsub("^\'+| \'+|\'+ |[^[:alpha:]]\'+(a-z)*|\\b\'*$", " ", x)
rmAps("'i'm '' ' 'we end' '")
[1] " i'm we end "
I also tried:
(?<![a-z])'(?![a-z])
But I think I am still missing sth.
gsub("'(?!\\w)|(?<!\\w)'", "", x, perl = TRUE)
#[1] "i'm we end "
Remove occasions when your character is not followed by a word character: '(?!\\w).
Remove occasions when your character is not preceded by a word character: (?<!\\w)'.
If either of those situations occur, you want to remove it, so '(?!\\w)|(?<!\\w)' should do the trick. Just note that \\w includes the underscore, and adjust as necessary.
Another option is
gsub("\\w'\\w(*SKIP)(*FAIL)|'", "", x, perl = TRUE)
In this case, you match any instances when ' is surrounded by word characters: \\w'\\w, and then force that match to fail with (*SKIP)(*FAIL). But, also look for ' using |'. The result is that only occurrences of ' not wrapped in word characters will be matched and substituted out.
You can use the following regular expression:
(?<=\w)'(?=\w)
(?<=) is a positive lookbehind. Everything inside needs to match before the next selector
(?=) is a positive lookahead. Everything inside needs to match after the previous selector
\w any alphanumeric character and the underscore
You could also switch \w to e.g. [a-zA-Z] if you want to restrict the results.
→ Here is your example on regex101 for live testing.

Resources