replace white space outside quotes - r

I am working in R. I am trying to import data, some use tab and some use white space between columns. My code looks like this:
gsub("[[:blank:]]", ";",
readLines(paste(PathToRecipes,recipe.files[i],sep="/")))
I read the lines of each file, and then gsub replaces all white space with a colon. Then, I will put it in a write.table and have a new set of files which I can re-import to data frame.
The problem is I get this:
"1";6;"medium;(2-1/4\";to;3-1/4\";dia.)";"Potatoes,;boiled,;cooked;without;skin,;flesh,;without;salt";11367
When I should get:
"1";6;"medium (2-1/4\" to 3-1/4\" dia.)";"Potatoes, boiled, cooked without skin, flesh, without salt";11367
There is text within quotes where white space should not be replaced with ";". How can I tell it to avoid quotes?

If the whitespaces within the text that you do not want to be replaced with semicolons are NOT tab characters then you should be able to do what you're currently doing but instead of using [[:blank:]] use \\t to replace only just the tab characters rather than all white space.
gsub("\\t", ";",readLines(paste(PathToRecipes,recipe.files[i],sep="/")))

You may use a PCRE regex with a SKIP-FAIL technique:
(*UCP)"[^"\\]*(?:\\.[^"\\]*)*"(*SKIP)(*F)|\s+
See the regex demo. Your data seeems to only have paired double quotes, so the above pattern is sufficient. Else, add a bit more to it:
(*UCP)(?<!\\)(?:\\{2})*"[^"\\]*(?:\\.[^"\\]*)*"(*SKIP)(*F)|\s+
Details
(*UCP) - make \s Unicode aware
(?<!\\) - no \ immediately to the left of the current location is allowed
(?:\\{2})* - 0+ sequences of double backslashes
" - a " char
[^"\\]* - zero or more chars other than " and \
(?:\\.[^"\\]*)* - zero or more sequences of any escape sequence and then zero or more chars other than " and \
" - a " char
(*SKIP)(*F) - omit and skip the current match starting to look for the next match from the current index (where the skipped match ended)
| - or
\s+ - matches 1 or more whitespace in any other context.
R demo:
rx <- '(*UCP)(?<!\\\\)(?:\\\\{2})*"[^"\\\\]*(?:\\\\.[^"\\\\]*)*"(*SKIP)(*F)|\\s+'
x <- c('"1"; 6; "medium (2-1/4\\" to 3-1/4\\" dia.)"; "Potatoes, boiled, cooked without skin, flesh, without salt"; 11367')
cat(gsub(rx, '', x, perl=TRUE))
## => "1";6;"medium (2-1/4\" to 3-1/4\" dia.)";"Potatoes, boiled, cooked without skin, flesh, without salt";11367

Related

Remove first occurrence of special characters until the first word or word character in R using regex

For my project I am looking into removing parts of text based on the pattern of special characters. I have a long .txt file that has the below structure:
mycharobj=c("---------Some text is here.---------More text is here - [3548]----- Even more text is here.-----------More text is here - [408]--------- Even more text is here again.")
String continues following the above pattern.
My target is to remove parts that start with - and end - [number], such as:
"-----------------------More text is here - [3548]"
"-----------More text is here - [408]"
I am planning to use the below to remove these parts with (will be looped in the future)
library(stringr)
library(qdapRegex)
temp=unlist(regmatches(mycharobj, gregexpr("[[:digit:]]+", mycharobj)))
mycharobj=rm_between(mycharobj, "-", paste(temp[1],"]", sep=""))
but for this to work, I need a regex expression that will remove the first occurrence of "-----------" in text until the first word or word character. If a string starts with text (word or word characters), it needs to ignore this and identify the first occurrence of "-----------" for my potential loop to work.
I was wondering if this can be done with regular expressions? Any help is appreciated. I have a very computationally demanding solution for this; split the string based on the special character "-" and then identify the parts of the text that I need through a set of conditionals. But due to the fact that it takes a lot more of the processing time, this solution is not very scalable for processing a large number of such .txt files.
You can use
gsub("-{9,}(?:(?!-{9}).)*?- \\[\\d+]", "", mycharobj, perl=TRUE)
See the regex demo.
Details:
-{9,} - nine or more - chars
(?:(?!-{9}).)*? - any one char, other than a line break char, zero or more but as few as possible occurrences, that does not start a nine hyphen char sequence
- \[ - a - [ string
\d+ - one or more digits
] - a ] char.

Remove number between end of row and new space

I'm trying to remove the numbers at the beginning of a row inside quotation marks.
> g<-"My name is Paul.\nI like playing football.\n\"55012\" And that's all."
> cat(g)
My name is Paul.
I like playing football.
"55012" And that's all.
> gsub("[\r\n]\"+[[:digit:]][^[[:space:]]]*"," ",g)
[1] "My name is Paul.\nI like playing football. 012\" And that's all."
This should work, but I don't know why only \n"55 is being replaced and not the entire number.
You closed the bracket expression with a couple of redundant [...]. [^[[:space:]]] is a sequence of [^[[:space:]] and ] patterns and matches any char other than [ and whitespace and then a ] char.
However, even that is not enough to fully fix the issue.
You may use
gsub("(^|\n)\"+[0-9]+\"+\\s*","\\1", g)
See the R demo
Pattern details
(^|\n) - start of string or a newline captured in Group 1 (referred to with \1 from the replacement pattern)
\"+ - one or more double quotes
[0-9]+ - 1+ digits
\"+ - one or more double quotes
\s* - 0+ whitespaces.
See the regex demo

How to add the removed space in a sentence?

I have the following string:
x = "marchTextIWantToDisplayWithSpacesmarch"
I would like to delete the 'march' portion at the beginning of the string and then add a space before each uppercase letter in the remainder to yield the following result:
"Text I Want To Display With Spacesmarch"
To insert whitepace, I used gsub("([a-z]?)([A-Z])", "\\1 \\2", x, perl= T) but I have no clue how to modify the pattern so that the first 'march' is excluded from the returned string. I'm trying to get better at this so any help would be greatly appreciated.
An option would be to capture the upper case letter as a group ((...)) and in the replacement create a space followed by the backreference (\\1) of the captured group
gsub("([A-Z])", " \\1", x)
#[1] "march Text I Want To Display With Spacesmarch"
If we need to remove the 'march'
sub("\\b[a-z]\\w+\\s+", "", gsub("([A-Z])", " \\1", x))
[#1] "Text I Want To Display With Spacesmarch"
data
x <- "marchTextIWantToDisplayWithSpacesmarch"
No, you can't achieve your replacement using single gsub because in one of your requirement, you want to remove all lowercase letters starting from the beginning, and your second requirement is to introduce a space before every capital letter except the first capital letter of the resultant string after removing all lowercase letters from the beginning of text.
Doing it in single gsub call would have been possible in cases where somehow we can re-use some of the existing characters to make the conditional replace which can't be the case here. So in first step, you can use ^[a-z]+ regex to get rid of all lowercase letters only from the beginning of string,
sub('^[a-z]+', '', "marchTextIWantToDisplayWithSpacesmarch")
leaving you with this,
[1] "TextIWantToDisplayWithSpacesmarch"
And next step you can use this (?<!^)(?=[A-Z]) regex to insert a space before every capital letter except the first one as you might not want an extra space before your sentence. But you can combine both and write them as this,
gsub('(?<!^)(?=[A-Z])', ' ', sub('^[a-z]+', '', "marchTextIWantToDisplayWithSpacesmarch"), perl=TRUE)
which will give you your desired string,
[1] "Text I Want To Display With Spacesmarch"
Edit:
Explanation of (?<!^)(?=[A-Z]) pattern
First, let's just take (?=[A-Z]) pattern,
See the pink markers in this demo
As you can see, in the demo, every capital letter is preceded by a pink mark which is the place where a space will get inserted. But we don't want space to be inserted before the very first letter as that is not needed. Hence we need a condition in regex, which will not select the first capital letter which appears at the start of string. And for that, we need to use a negative look behind (?<!^) which means that Do not select the position which is preceded by start of string and hence this (?<!^) helps in discarding the upper case letter that is preceded by just start of string.
See this demo where the pink marker is gone from the very first uppercase letter
Hope this clarifies how every other capital letter is selected but not the very first. Let me know if you have any queries further.
You may use a single regex call to gsub coupled with trimws to trim the resulting string:
trimws(gsub("^\\p{Ll}+|(?<=.)(?=\\p{Lu})", " ", x, perl=TRUE))
## => [1] "Text I Want To Display With Spacesmarch"
It also supports all Unicode lowercase (\p{Ll}) and uppercase (\p{Lu}) letters.
See the R demo online and the regex demo.
Details
^\\p{Ll}+ - 1 or more lowercase letters at the string start
| - or
(?<=.)(?=\\p{Lu}) - any location between any char but linebreak chars and an uppercase letter.
Here is an altenative with a single call to gsubfn regex with some ifelse logic:
> gsubfn("^\\p{Ll}*(\\p{L})|(?<=.)(?=\\p{Lu})", function(n) ifelse(nchar(n)>0,n," "), x, perl=TRUE,backref=-1)
[1] "Text I Want To Display With Spacesmarch"
Here, the ^\\p{Ll}*(\\p{L}) part matches 0+ lowercase letters and captures the next uppercase into Group 1 that will be accessed by passing n argument to the anonymous function. If n length is non-zero, this alternative matched and the we need to replace with this value. Else, we replace with a space.
Since this is tagged perl, my 2 cents:
Can you chain together the substitutions inside sub() and gsub()? In newer perl versions an /r option can be added to the s/// substitution so the matched string can be returned "non-destructively" and then matched again. This allows hackish match/substitution/rematches without mastering advanced syntax, e.g.:
perl -E '
say "marchTextIWantToDisplayWithSpacesmarch" =~
s/\Amarch//r =~ s/([[:upper:]])/ $1/gr =~ s/\A\s//r;'
Output
Text I Want To Display With Spacesmarch
This seems to be what #pushpesh-kumar-rajwanshi and #akrun are doing by wrapping gsub inside sub() (and vice versa). In general I don't thinkperl = T captures the full magnificently advanced madness of perl regexps ;-) but gsub/sub must be fast operating on vectors, no?

R Capturing String inside Brackets

I'm trying to parse some of my chess pgn data but I'm having some trouble capturing characters just inside one bracket.
testString <- '[Event \"?\"]\n[Site \"http://www.chessmaniac.com play free chess\"]\n[Date \"2018.08.25\"]\n[Round \"-\"]\n[White \"NothingFancy 1497\"]\n[Black \"JR Smith 1985\"]\n[Result \"1-0\"]\n\n1.'
#Attempt to just get who white is, which is inside a bracket [White xxx]
findWhite <- regexpr('\\[White.*\\]', tempString)
regmatches(tempString, findWhite)
The stringr package seems to do what I want, but I'm curious what is different about the use of the same regular expression. I'm fine using stringr, but I like to also know how to do this in base R.
library(stringr)
str_extract(tempString, '\\[White.*\\]')
If you need the whole match starting with [White and ending with ] you may use
regmatches(testString, regexpr("\\[White\\s*[^][]*]", testString))
[1] "[White \"NothingFancy 1497\"]"
If you only need the substring inside double quotes:
regmatches(testString, regexpr("\\[White\\s*\\K[^][]*", testString, perl=TRUE))
[1] "\"NothingFancy 1497\""
See the regex demo.
To strip the double quotes, you may use something like
regmatches(testString, regexpr('\\[White\\s*"\\K.*(?="])', testString, perl=TRUE))
[1] "NothingFancy 1497"
See another regex demo and an online R demo.
Details
\\[ - a [ char
White - a literal substring
\\s* - 0+ whitespaces
\\K - match reset operator discarding the text matched so far
[^][]* - 0+ chars other than [ and ]
.* (in the other version) - matches any 0+ chars other than line break chars, as many as possible
(?="]) - a positive lookahead that matches a position inside a string that is immediately followed with "].
At least one way to do it in base R is to use sub and only keep the part that you want.
sub(".*\\[White\\s(*.*?)\\].*", "\\1", testString)
[1] "\"NothingFancy 1497\""

R utf-8 and replace a word from a sentence based on ending character

I have a requirement where I am working on a large data which is having double byte characters, in korean text. i want to look for a character and replace it. In order to display the korean text correctly in the browser I have changed the locale settings in R. But not sure if it gets updated for the code as well. below is my code to change locale to korean and the korean text gets visible properly in viewer, however in console it gives junk character on printing-
Sys.setlocale(category = "LC_ALL", locale = "korean")
My data is in a data.table format that contains a column with text in korean. example -
"광주광역시 동구 제봉로 49 (남동,(지하))"
I want to get rid of the 1st word which ends with "시" character. Then I want to get rid of the "(남동,(지하))" an the end. I was trying gsub, but it does not seem to be working.
New <- c("광주광역시 동구 제봉로 49 (남동,(지하))")
data <- as.data.table(New)
data[,New_trunc := gsub("\\b시", "", data$New)]
Please let me know where I am going wrong. Since I want to search the end of word, I am using \\b and since I want to replace any word ending with "시" character I am giving it as \\b시.....is this not the way to give? How to take care of () at the end of the sentence.
What would be a good source to refer to for regular expressions.
Is a utf-8 setting needed for the script as well?How to do that?
Since you need to match the letter you have at the end of the word, you need to place \b (word boundary) after the letter, so as to require a transition from a letter to a non-letter (or end of string) after that letter. A PCRE pattern that will handle this is
"\\s*\\b\\p{L}*시\\b"
Details
\\s* - zero or more whitespaces
\\b - a leading word boundary
\\p{L}* - zero or more letters
시 - your specific letter
\\b - end of the word
The second issue is that you need to remove a set of nested parentheses at the end of the string. You need again to rely on the PCRE regex (perl=TRUE) that can handle recursion with the help of a subroutine call.
> sub("\\s*(\\((?:[^()]++|(?1))*\\))$", "", New, perl=TRUE)
[1] "광주광역시 동구 제봉로 49"
Details:
\\s* - zero or more whitespaces
(\\((?:[^()]++|(?1))*\\)) - Group 1 (will be recursed) matching
\\( - a literal (
(?:[^()]++|(?1))* - zero or more occurrences of
[^()]++ - 1 or more chars other than ( and ) (possessively)
| - or
(?1) - a subroutine call that repeats the whole Group 1 subpattern
\\) - a literal )
$ - end of string.
Now, if you need to combine both, you would see that R PCRE-powered gsub does not handle Unicode chars in the pattern so easily. You must tell it to use Unicode mode with (*UCP) PCRE verb.
> gsub("(*UCP)\\b\\p{L}*시\\b|\\s*(\\((?:[^()]++|(?1))*\\))$", "", New, perl=TRUE)
[1] " 동구 제봉로 49"
Or using trimws to get rid of the leading/trailing whitespace:
> trimws(gsub("(*UCP)\\b\\p{L}*시\\b|(\\((?:[^()]++|(?1))*\\))$", "", New, perl=TRUE))
[1] "동구 제봉로 49"
See more details about the verb at PCRE Man page.

Resources