Tidying messy coordinates for use in measurements - r

I have some rather messy degrees, decimal minutes coordinates (the source of which is out of my control) in the following format (see below). I am trying to work out the distance between the points ultimately.
minlat <- "51 12.93257'"
maxlat <- "66 13.20549'"
minlong <- "- 5 1.23944'"
maxlong <- "- 5 1.36293'"
As they are they are in a rather unfriendly format for (from measurements package):
measurements::conv_unit(minlat, from = 'deg_dec_min', to = 'dec_deg')
and ultimately
distm(c(minlong, minlat), c(maxlong, maxlat), fun = distHaversine)
I think I need to use the gsub( to get them into a friendly format, whereby I would like them to be
minlat <- 51 12.93257 # removing the double space
minlong <- -4 1.36293 # removing the double space and the space after the -
I've been messing around with gusb( all morning and it has beaten me, any help would be great!!

It sounds like you just need to strip all excess whitespace. We can try using gsub with lookarounds here.
minlong <- " - 5 1.23944 " # -5 1.23944
minlong
gsub("(?<=^|\\D) | (?=$|\\D)", "", gsub("\\s+", " ", minlong), perl=TRUE)
[1] " - 5 1.23944 "
[1] "-5 1.23944"
The inner call to gsub replaces any occurence of two or more spaces with just a single space. The outer call then selectively removes a remaining single space only if it not be sandwiched between two digits.

Related

Edge Conditional White Space Issue R

I'm trying to clean a long character vector and am getting an edge case where separating the following format of text isn't possible:
$4.917.10%
The issue is how to set a conditional whitespace so that the text looks like this: $4.91 7.10%.
The vector is called "test9" and the script that cleans the typical situations where there is a "-" in front of % is:
gsub("(?=[-])", " ", test9, perl = TRUE)
The edge case is infrequent but a feature of the vector that needs to be adjusted for. There isn't a fixed number of digits to the left of the decimal (whether expressing $ or %) but there are always two decimals to the right of a decimal which makes me think conditionally approaching that is probably the way to go.
Here is a sample of a large piece of one element of the vector:
$28.00$25.0518.09%
Thanks!
Here's another option.
gsub("(?<=\\.\\d{2})(?!%)", " ", "$28.00$25.0518.09%", perl = TRUE)
# [1] "$28.00 $25.05 18.09%"
We have a positive lookbehind (?<=\\.\\d{2}) looking for a dot and two digits, and a negative lookahead (?!%) for %.
More generally, I guess you may also have "$28.00$25.0518.09%18.09%" in which case we need something else:
gsub("((?<=\\.\\d{2})|(?<=%))(?=[\\d$])", " ", "$28.00$25.0518.09%18.09%", perl = TRUE)
# [1] "$28.00 $25.05 18.09% 18.09%"
Now we have either a positive lookbehind for a dot and two digits or a positive lookbehind for %, and a positive lookahead for a digit or the end of a character.
If I understand correctly that your general problem is of the form "$28.00$25.0518.09%-7.10%$25.05-$25.05$25.05", then we may use almost the same solution as the latter one:
gsub("((?<=\\.\\d{2})|(?<=%))(?=[\\d$-])", " ", "$28.00$25.0518.09%-7.10%$25.05-$25.05$25.05", perl = TRUE)
# [1] "$28.00 $25.05 18.09% -7.10% $25.05 -$25.05 $25.05"
One option is to do it in two stages. First insert a space after every second decimal. Then remove the unwanted space this inserts before a %
x = '$28.00$25.0518.09%'
y = gsub('(\\.\\d{2})', '\\1 ', x, perl = T) #insert space after decimals
trimws(gsub('\\s%', '% ', y)) # move space from before % to after %
# "$28.00 $25.05 18.09%"
This should also work for the more general cases #Julius described
x = "$28.00$25.0518.09%18.09%" # "$28.00 $25.05 18.09% 18.09%"
x = "$28.00$25.0518.09%-7.10%$25.05-$25.05$25.05" # "$28.00 $25.05 18.09% -7.10% $25.05 -$25.05 $25.05"

splitting text into character and numeric

Could someone help me split this string:
string <- "Rolling in the deep $15.25"
I'm trying to get two outputs out of this:
1) Rolling in the Deep # character
2) 15.25 # numeric value
I know how to do this in excel but a bit lost with R
Using strsplit will do the trick. The solution will be as:
string <- "Rolling in the deep $15.25"
strsplit(string, "\\s+\\$")
^ ^___ find a $ (escaped with \\ because $ means end of word)
\______ find 1 or more whitespaces
# Result
#"Rolling in the deep" "15.25"
strsplit(string, "\\s+\\$")[[1]][1]
#[1] "Rolling in the deep"
strsplit(string, "\\s+\\$")[[1]][2]
#[1] "15.25"
As long as the right hand side is always preceded by a dollar sign, you will need to "escape" the dollar sign. Try this:
# you will need stringr, which you could load alone but the tidyverse is amazing
library(tidyverse)
string <- "Rolling in the deep $15.25"
str_split_fixed(string, "\\$", n = 2)
Here's how you can extract the information using only regular expressions:
x <- c("Rolling in the deep $15.25",
"Apetite for destruction $20.00",
"Piece of mind $19")
rgx <- "^(.*)\\s{2,}(\\$.*)$"
data.frame(album = trimws(gsub(rgx, "\\1", x)),
price = trimws(gsub(rgx, "\\2", x))
)
album price
1 Rolling in the deep $15.25
2 Apetite for destruction $20.00
3 Piece of mind $19

center a string by padding spaces up to a specified length

I have a vector of names, like this:
x <- c("Marco", "John", "Jonathan")
I need to format it so that the names get centered in 10-character strings, by adding leading and trailing spaces:
> output
# [1] " Marco " " John " " Jonathan "
I was hoping a solution less complicated than to go with paste, rep, and counting nchar? (maybe with sprintf but I don't know how).
Here's a sprintf() solution that uses a simple helper vector f to determine the low side widths. We can then insert the widths into our format using the * character, taking the ceiling() on the right side to account for an odd number of characters in a name. Since our max character width is at 10, each name that exceeds 10 characters will remain unchanged because we adjust those widths with pmax().
f <- pmax((10 - nchar(x)) / 2, 0)
sprintf("%-*s%s%*s", f, "", x, ceiling(f), "")
# [1] " Marco " " John " " Jonathan " "Christopher"
Data:
x <- c("Marco", "John", "Jonathan", "Christopher")
Eventually, I know it's not the same language, but it is Worth noting that Python (and not R) has a built-in method for doing just that, it's called centering a string:
example = "John"
example.center(10)
#### ' john '
It adds to the right for odd Numbers, and allows you to input the filling character of your choice. ALthough it's not vectorized.

Move location of special character

I have an entire vector of strings with the only special symbol in them being "-"
To be clear a sample string is like 23 C-Exam
I'd like to change it 23-C Exam
I essentially want R to find the location of "-" and move it 2 spaces back.
I feel this is a really simple task although I cant figure out how.
Assume that whenever R finds "-" , two spaces back is whitespace just like the example above.
regex attempt:
x <- c("23 C-Exam","45 D-Exam")
#[1] "23 C-Exam" "45 D-Exam"
sub(".(.)-", "-\\1 ", x)
#[1] "23-C Exam" "45-D Exam"
Find a character ., before a character (.), followed by a literal dash -.
Replace with a literal dash -, the saved character from above \\1, and overwrite the dash with a space
There is probably a sleek way of doing this with regular expressions, but one approach is to simply splice together the various pieces of the desired output. First, I find the index in the string containing the -, and then I use substr() to piece together the output.
pos <- regexpr("-", "23 C-Exam")
x <- "23 C-Exam"
x <- paste0(substr(x, 1, pos-3),
"-",
substr(x, pos-1, pos-1),
" ",
substr(x, pos+1, nchar(x)))
> x
[1] "23-C Exam"
We can also use chartr
chartr(" -", "- ", x)
#[1] "23-C Exam" "45-D Exam"
data
x <- c("23 C-Exam","45 D-Exam")

how do you format numbers in vector without having extra spaces and quotes around the numbers

I have a vector like this:
dput(yy)
c(97.1433841613379, 1102.1208262592, 32.5418522860492, 217.694780086999,
1306.31759309228, 202.18335752298, 22.8301149425287)
I need to only keep 2 decimal points and I am doing this to get rid of additional decimal points:
yy<-format(yy, digits=1)
When I do dput(yy), I get additional spaces in front of the my values as this:
dput(yy)
c(" 97.14", "1102.12", " 32.54", " 217.69", "1306.32", " 202.18",
" 22.83")
Is there an easy way to format the numbers without inserting extra space and quotes around the numbers?
You could use ?sprintf (it use the same syntax like sprintf in C):
x <- c(97.1433841613379, 1102.1208262592, 32.5418522860492, 217.694780086999, 1306.31759309228, 202.18335752298, 22.8301149425287)
sprintf("%.2f", x)
# [1] "97.14" "1102.12" "32.54" "217.69" "1306.32" "202.18" "22.83"
EDIT:
Or do you look for ?round?
round(x, digits=2)
# [1] 97.14 1102.12 32.54 217.69 1306.32 202.18 22.83
If you want to keep everything as numbers then use round(x, 2), however that will change a number like 1.5000002 to 1.5 rather than 1.50 that you could get with format or sprintf.

Resources