Edge Conditional White Space Issue R - r

I'm trying to clean a long character vector and am getting an edge case where separating the following format of text isn't possible:
$4.917.10%
The issue is how to set a conditional whitespace so that the text looks like this: $4.91 7.10%.
The vector is called "test9" and the script that cleans the typical situations where there is a "-" in front of % is:
gsub("(?=[-])", " ", test9, perl = TRUE)
The edge case is infrequent but a feature of the vector that needs to be adjusted for. There isn't a fixed number of digits to the left of the decimal (whether expressing $ or %) but there are always two decimals to the right of a decimal which makes me think conditionally approaching that is probably the way to go.
Here is a sample of a large piece of one element of the vector:
$28.00$25.0518.09%
Thanks!

Here's another option.
gsub("(?<=\\.\\d{2})(?!%)", " ", "$28.00$25.0518.09%", perl = TRUE)
# [1] "$28.00 $25.05 18.09%"
We have a positive lookbehind (?<=\\.\\d{2}) looking for a dot and two digits, and a negative lookahead (?!%) for %.
More generally, I guess you may also have "$28.00$25.0518.09%18.09%" in which case we need something else:
gsub("((?<=\\.\\d{2})|(?<=%))(?=[\\d$])", " ", "$28.00$25.0518.09%18.09%", perl = TRUE)
# [1] "$28.00 $25.05 18.09% 18.09%"
Now we have either a positive lookbehind for a dot and two digits or a positive lookbehind for %, and a positive lookahead for a digit or the end of a character.
If I understand correctly that your general problem is of the form "$28.00$25.0518.09%-7.10%$25.05-$25.05$25.05", then we may use almost the same solution as the latter one:
gsub("((?<=\\.\\d{2})|(?<=%))(?=[\\d$-])", " ", "$28.00$25.0518.09%-7.10%$25.05-$25.05$25.05", perl = TRUE)
# [1] "$28.00 $25.05 18.09% -7.10% $25.05 -$25.05 $25.05"

One option is to do it in two stages. First insert a space after every second decimal. Then remove the unwanted space this inserts before a %
x = '$28.00$25.0518.09%'
y = gsub('(\\.\\d{2})', '\\1 ', x, perl = T) #insert space after decimals
trimws(gsub('\\s%', '% ', y)) # move space from before % to after %
# "$28.00 $25.05 18.09%"
This should also work for the more general cases #Julius described
x = "$28.00$25.0518.09%18.09%" # "$28.00 $25.05 18.09% 18.09%"
x = "$28.00$25.0518.09%-7.10%$25.05-$25.05$25.05" # "$28.00 $25.05 18.09% -7.10% $25.05 -$25.05 $25.05"

Related

Keep only the first letter of each word after a comma

I have strings like Sacher, Franz Xaver or Nishikawa, Kiyoko.
Using R, I want to change them to Sacher, F. X. or Nishikawa, K..
In other words, the first letter of each word after the comma should be retained with a dot (and a whitespace if another word follows).
Here is a related response, but it cannot be applied to my case 1:1 as it does not have a comma in its strings; it seems that the simple addition of (<?=, ) does not work.
E.g. in the following attempts, gsub() replaces everything, while my str_replace_all()-attempt leads to an error:
TEST <- c("Sacher, Franz Xaver", "Nishikawa, Kiyoko", "Al-Assam, Muhammad")
# first attempt
# (resembles the response from the other thread)
gsub('\\b(\\pL)\\pL{2,}|.','\\U\\1', TEST, perl = TRUE)
# second attempt
# error: "Incorrect unicode property"
stringr::str_replace_all(TEST, '(?<=, )\\b(\\pL)\\pL{2,}|.','\\U\\1')
I would be grateful for your help!
You can use
gsub("(*UCP)^[^,]+(*SKIP)(*F)|\\b(\\p{L})\\p{L}*", "\\U\\1.", TEST, perl=TRUE)
See the regex demo. Details:
(*UCP) - the PCRE verb that will make \b Unicode aware
^[^,]+(*SKIP)(*F) - start of string and then any zero or more chars other than a comma, and then the match is failed and skipped, the next match starts at the location where the failure occurred
| - or
\b - word boundary
(\p{L}) - Group 1: any Unicode letter
\p{L}* - zero or more Unicode letters
See the R demo:
TEST <- c("Sacher, Franz Xaver", "Nishikawa, Kiyoko", "Al-Assam, Muhammad")
gsub("(*UCP)^[^,]+(*SKIP)(*F)|\\b(\\p{L})\\p{L}*", "\\U\\1.", TEST, perl=TRUE)
## => [1] "Sacher, F. X." "Nishikawa, K." "Al-Assam, M."
A crude approach splitting the string :
TEST <- c("Sacher, Franz Xaver", "Nishikawa, Kiyoko", "Al-Assam, Muhammad")
sapply(strsplit(TEST, '\\s+'), function(x)
paste0(x[1], paste0(substr(x[-1], 1, 1), collapse = '.'), '.'))
#[1] "Sacher,F.X." "Nishikawa,K." "Al-Assam,M."
An approach using multiple backreference:
gsub("(\\b\\w+,\\s)(\\b\\w).*(\\b\\w)*", "\\1\\2.\\3", TEST)
[1] "Sacher, F." "Nishikawa, K." "Al-Assam, M."
Here, we use three capturing groups to refer back to in gsub's replacment argument via backreference:
(\\b\\w+,\\s): this, first, group captures the last name plus the comma followed by whitespace
(\\b\\w): this, second, group captures the initial of the first name
(\\b\\w): this, third, group captures the initial of the middle name

How to insert a white space before open bracket

I have a string 3.4(2.5-4.7), I want to insert a white space before the open bracket "(" so that the string becomes 3.4 (2.5-4.7).
Any idea how this could be done in R?
x <- "3.4(2.5-4.7)"
sub("(.*)(?=\\()", "\\1 ", x, perl = T)
[1] "3.4 (2.5-4.7)"
This regex is based on lookahead: it creates one capturing group subsuming everything up until the lookahead, namely, the opening parenthesis (?=\\()), recalls it and inserts one whitespace after it in the replacement argument to sub (which is enough unless you have more than one such substitution per string, in which case gsubis needed). The argument perl = Tneeds to be added to enable the lookahead.
EDIT:
If you have a string like this:
x <- "3.4(2.5to4.7)"
the regex gets slightly more complex; the underlying idea though remains the same: you divide the string into different captruing groups (...), which you then recall using appropriate backreference in the replacement argument while adding the sought spaces:
sub("(.*)(\\(\\d+\\.\\d+)(to)(\\d+\\.\\d+\\))", "\\1 \\2 \\3 \\4", x)
[1] "3.4 (2.5 to 4.7)"
EDIT2:
x <- '3.4(2.5,4.7)'
sub("(.*)(\\(\\d+\\.\\d+)(,)(\\d+\\.\\d+\\))", "\\1 \\2\\3 \\4", x)
[1] "3.4 (2.5, 4.7)"
EDIT3:
x <- '3(2,4)'
sub("(.*)(\\(\\d+)(,)(\\d+)", "\\1 \\2\\3 \\4", x)
A very short way uses sub, which will substitute the first open bracket ( with a space followed by an open bracket, i.e. (.
x <- '3.4(2.5-4.7)'
sub("\\(", " (", x)
# [1] "3.4 (2.5-4.7)"
Alternatively, you can specify the argument fixed = TRUE which considers the pattern as fixed and not as a regular expression.
x <- '3.4(2.5-4.7)'
sub("(", " (", x, fixed = TRUE)
# [1] "3.4 (2.5-4.7)"
Try
gsub('(.*)(\\(.*\\))', '\\1 \\2', '3.4(2.5-4.7)')
#[1] "3.4 (2.5-4.7)"
The way the regex works is that it creates two groups. The first group (.*) it takes all elements and the second group (\\(.*\\)) takes all elements after the parenthesis. Note that we need to escape the parenthesis so we use \\(. We then join those two groups with a space between them \\1 \\2

Replace matched patterns in a string based on condition

I have a text string containing digits, letters and spaces. Some of its substrings are month abbreviations. I want to perform a condition-based pattern replacement, namely to enclose a month abbreviation in whitespaces if and only if a given condition is fulfilled. As an example, let the condition be as follows: "preceeded by a digit and succeeded by a letter".
I tried stringr package but I fail to combine the functions str_replace_all() and str_locate_all():
# Input:
txt = "START1SEP2 1DECX JANEND"
# Desired output:
# "START1SEP2 1 DEC X JANEND"
# (A) What I could do without checking the condition:
library(stringr)
patt_month = paste("(", paste(toupper(month.abb), collapse = "|"), ")", sep='')
str_replace_all(string = txt, pattern = patt_month, replacement = " \\1 ")
# "START1 SEP 2 1 DEC X JAN END"
# (B) But I actually only need replacements inside the condition-based bounds:
str_locate_all(string = txt, pattern = paste("[0-9]", patt_month, "[A-Z]", sep=''))[[1]]
# start end
# [1,] 12 16
# To combine (A) and (B), I'm currently using an ugly for() loop not shown here and want to get rid of it
You are looking for lookarounds:
(?<=\d)DEC(?=[A-Z])
See a demo on regex101.com.
Lookarounds make sure a certain position is matched without consuming any characters. They are available in front of sth. (called lookbehind) or to make sure anything that follows is of a certain type (called lookahead). You have positive and negative ones on both sides, thus you have four types (pos./neg. lookbehind/-ahead).
A short memo:
(?=...) is a pos. lookahead
(?!...) is a neg. lookahead
(?<=...) is a pos. lookbehind
(?<!...) is a neg. lookbehind
A Base R version
patt_month <- capture.output(cat(toupper(month.abb),"|"))#concatenate all month.abb with OR
pat <- paste0("(\\s\\d)(", patt_month, ")([A-Z]\\s)")#make it a three group thing
gsub(pattern = pat, replacement = "\\1 \\2 \\3", txt, perl =TRUE)#same result as above
Also works for txt2 <- "START1SEP2 1JANY JANEND" out of the box.
[1] "START1SEP2 1 JAN Y JANEND"

Tidying messy coordinates for use in measurements

I have some rather messy degrees, decimal minutes coordinates (the source of which is out of my control) in the following format (see below). I am trying to work out the distance between the points ultimately.
minlat <- "51 12.93257'"
maxlat <- "66 13.20549'"
minlong <- "- 5 1.23944'"
maxlong <- "- 5 1.36293'"
As they are they are in a rather unfriendly format for (from measurements package):
measurements::conv_unit(minlat, from = 'deg_dec_min', to = 'dec_deg')
and ultimately
distm(c(minlong, minlat), c(maxlong, maxlat), fun = distHaversine)
I think I need to use the gsub( to get them into a friendly format, whereby I would like them to be
minlat <- 51 12.93257 # removing the double space
minlong <- -4 1.36293 # removing the double space and the space after the -
I've been messing around with gusb( all morning and it has beaten me, any help would be great!!
It sounds like you just need to strip all excess whitespace. We can try using gsub with lookarounds here.
minlong <- " - 5 1.23944 " # -5 1.23944
minlong
gsub("(?<=^|\\D) | (?=$|\\D)", "", gsub("\\s+", " ", minlong), perl=TRUE)
[1] " - 5 1.23944 "
[1] "-5 1.23944"
The inner call to gsub replaces any occurence of two or more spaces with just a single space. The outer call then selectively removes a remaining single space only if it not be sandwiched between two digits.

center a string by padding spaces up to a specified length

I have a vector of names, like this:
x <- c("Marco", "John", "Jonathan")
I need to format it so that the names get centered in 10-character strings, by adding leading and trailing spaces:
> output
# [1] " Marco " " John " " Jonathan "
I was hoping a solution less complicated than to go with paste, rep, and counting nchar? (maybe with sprintf but I don't know how).
Here's a sprintf() solution that uses a simple helper vector f to determine the low side widths. We can then insert the widths into our format using the * character, taking the ceiling() on the right side to account for an odd number of characters in a name. Since our max character width is at 10, each name that exceeds 10 characters will remain unchanged because we adjust those widths with pmax().
f <- pmax((10 - nchar(x)) / 2, 0)
sprintf("%-*s%s%*s", f, "", x, ceiling(f), "")
# [1] " Marco " " John " " Jonathan " "Christopher"
Data:
x <- c("Marco", "John", "Jonathan", "Christopher")
Eventually, I know it's not the same language, but it is Worth noting that Python (and not R) has a built-in method for doing just that, it's called centering a string:
example = "John"
example.center(10)
#### ' john '
It adds to the right for odd Numbers, and allows you to input the filling character of your choice. ALthough it's not vectorized.

Resources