Pattern Matching & Replacement / Cleaning of Data in R

Pattern Matching & Replacement / Cleaning of Data in R - r

I'm looking to plot geospatial data, thus I require coordinates. The information I've been provided is very messy and I need a good system to convert a vector of coordinates in multiple formats into one useful format as per below:
Input:
- lat <- c("41º12'23.33''", "40º39'15.6'", "41 10 589", "38 31 10.6",
"38.720647")
- lon <- c("8º19'40.66''", "7º52'31.95'", "8 37 832", "8 54 17.0",
"-9.22522")
Output:
- lat <- c(41.122333, 40.39156, 41.10589, 38.31106, 38.720647)
- lon <- c(8.194066, 7.523195, 8.37832, 8.54170, -9.22522)
Does anyone have a creative solution? Any response is much appreciated!

lat <- c("41º12'23.33''", "40º39'15.6'", "41 10 589", "38 31 10.6", "38.720647")
lon <- c("8º19'40.66''", "7º52'31.95'", "8 37 832", "8 54 17.0", "-9.22522")
gsub(" ", "", sub("\\s", ".", gsub("º|\\'|\\.", " ", lat)))
[1] "41.122333" "40.39156" "41.10589" "38.31106" "38.720647"
gsub(" ", "", sub("\\s", ".", gsub("º|\\'|\\.", " ", lon)))
[1] "8.194066" "7.523195" "8.37832" "8.54170" "-9.22522"
1.: replace all º, ' and . with a white space
2.: replace the first white space with a decimal point
3.: replace all remaining spaces by "" to have your strings pasted together again

With Base R could you please try following and let me know if this helps you.
lat <- c("41º12'23.33''", "40º39'15.6'", "41 10 589", "38 31 10.6", "38.720647")
for (i in lat)
{
i <- gsub("º| ","#",i)
i <- gsub("'|\\.","",i)
i <- gsub("#",".",i)
print(i)
}
Output will be as follows.
[1] "41.122333"
[1] "40.39156"
[1] "41 10 589"
[1] "38 31 106"
[1] "38720647"

This function will also work:
# DATA
lat <- c("41º12'23.33''", "40º39'15.6'", "41 10 589", "38 31 10.6", "38.720647")
lon <- c("8º19'40.66''", "7º52'31.95'", "8 37 832", "8 54 17.0", "-9.22522")
# FUNCTION
convert_coordinates <- function(x) {
splits <- x %>% strsplit(. , "º| |[.]|'") # Remove unwanted punctuation. Note that you can add more characters to replace here, just separate them with a |
splits <- lapply(splits, function(x){x[!x ==""]}) # Remove any empty strings
output <- c()
for (i in 1:length(splits)) {
output[i] <- paste0(splits[[i]][1], ".", paste0(splits[[i]][2:(length(splits[[i]]))], collapse=""), collapse="")
}
return(output)
}
# RESULTS
convert_coordinates(lat)
# [1] "41.122333" "40.39156" "41.10589" "38.31106" "38.720647"
convert_coordinates(lon)
# [1] "8.194066" "7.523195" "8.37832" "8.54170" "-9.22522"

Related

How to manipulate digits in a character string in R?

I feel like I have a super easy question but for the life of me I can't find it when googling or searching here (or I don't know the correct terms to find a solution) so here goes.
I have a large amount of text in R in which I want to identify all numbers/digits, and add a specific number to them, for example 5.
So just as a small example, if this were my text:
text <- c("Hi. It is 6am. I want to leave at 7am")
I want the output to be:
> text
[1] "Hi. It is 11am. I want to leave at 12am"
But also I need the addition for each individual digit, so if this is the text:
text <- c("Hi. It is 2017. I am 35 years old.")
...I want the output to be:
> text
[1] "Hi. It is 75612. I am 810 years old."
I have tried 'grabbing' the numbers from the string and adding 5, but I don't know how to then get them back into the original string so I can get the full text back.
How should I go about this? Thanks in advance!

Here is how I would do the time. I would search for a number that is followed by am or pm and then sub in a math expression to be evaluated by gsubfn. This is pretty flexible, but would require whole hours in its current implementation. I added an am and pm if you wanted to swap those, but I didn't try to code in detecting if the number changes from am to pm. Also note that I didn't code in rolling from 12 to 1. If you add numbers over 12, you will get a number bigger than 12.
text1 <- c("Hi. It is 6am. I want to leave at 7am")
text2 <- c("It is 9am. I want to leave at 10am, but the cab comes at 11am. Can I push my flight to 12am?")
change_time <- function(text, hours, sign, am_pm){
string_change <- glue::glue("`(\\1{sign}{hours})`{am_pm}")
gsub("(\\d+)(?=am|pm)(am|pm)", string_change, text, perl = TRUE)|>
gsubfn::fn$c()
}
change_time(text = text1, hours = 5, sign = "+", am_pm = "am")
#> [1] "Hi. It is 11am. I want to leave at 12am"
change_time(text = text2, hours = 3, sign = "-", am_pm = "pm")
#> [1] "It is 6pm. I want to leave at 7pm, but the cab comes at 8pm. Can I push my flight to 9pm?"

text1 <- c("Hi. It is 2017. I am 35 years old.")
text2 <- c("Hi. It is 6am. I want to leave at 7am")
change_number <- function(text, change, sign){
string_change <- glue::glue("`(\\1{sign}{change})`")
gsub("(\\d)", string_change, text, perl = TRUE) %>%
gsubfn::fn$c() }
change_number(text = text1, change = 5, sign = "+")
#>[1] "Hi. It is 75612. I am 810 years old."
change_number(text = text2, change = 5, sign = "+")
#>[1] "Hi. It is 11am. I want to leave at 12am"
This works perfectly. Many thanks to #AndS., I tweaked (or rather, simplified) your code to fit my needs better. I was determined to figure out the other text myself haha, so thanks for showing me how!

Something quick and dirty with base R:
add_n = \(x, n, by_digit = FALSE) {
if (by_digit) ptrn = "[0-9]" else ptrn = "[0-9]+"
tmp = gregexpr(ptrn, x)
raw = regmatches(x, gregexpr(ptrn, x))
raw_plusn = lapply(raw, \(x) as.integer(x) + n)
for (i in seq_along(x)) regmatches(x[i], tmp[i]) = raw_plusn[i]
x
}
text = c(
"Hi. It is 6am. I want to leave at 7am",
"wow it's 505 dollars and 19 cents",
"Hi. It is 2017. I am 35 years old."
)
> add_n(text, 5)
# [1] "Hi. It is 11am. I want to leave at 12am"
# [2] "wow it's 510 dollars and 24 cents"
# [3] "Hi. It is 2022. I am 40 years old."
> add_n(text, -2)
# [1] "Hi. It is 4am. I want to leave at 5am" "wow it's 503 dollars and 17 cents"
# [3] "Hi. It is 2015. I am 33 years old."
> add_n(text, 5, by_digit = TRUE)
# [1] "Hi. It is 11am. I want to leave at 12am"
# [2] "wow it's 10510 dollars and 614 cents"
# [3] "Hi. It is 75612. I am 810 years old."

Here's a tidyverse solution:
data.frame(text) %>%
# separate `text` into individual characters:
separate_rows(text, sep = "(?<!^)(?!$)") %>%
# add `5` to any digit:
mutate(
# if you detect a digit...
text = ifelse(str_detect(text, "\\d"),
# ... extract it, convert it to numeric, add `5`:
as.numeric(str_extract(text, "\\d")) + 5,
# ... else leave `text` as is:
text)
) %>%
# string the characters back together:
summarise(text = str_c(text, collapse = ""))
# A tibble: 1 × 1
text
<chr>
1 Hi. It is 11am. I want to leave at 12am
Data 1:
text <- c("Hi. It is 6am. I want to leave at 7am")
Note that the same code works for the second text as well without any change:
# A tibble: 1 × 1
text
<chr>
1 Hi. It is 75612. I am 810 years old.
Data 2:
text <- c("Hi. It is 2017. I am 35 years old.")

How to separate values in a string only after the second space

I have a string of names, for example:
st <- 'IKE IROEGBU NIMROD LEVI KYLE GIBSON CHAVAUGHN LEWIS BRYCE WASHINGSON'
and I want the output to be a vector like this:
c('IKE IROEGBU', 'NIMROD LEVI', 'KYLE GIBSON', 'CHAVAUGHN LEWIS', 'BRYCE WASHINGSON')
how can I do this?

You can do:
st <- 'IKE IROEGBU NIMROD LEVI KYLE GIBSON CHAVAUGHN LEWIS BRYCE WASHINGSON'
c(stringr::str_match_all(st, "\\S+\\s\\S+")[[1]])
#> [1] "IKE IROEGBU" "NIMROD LEVI" "KYLE GIBSON" "CHAVAUGHN LEWIS"
#> [5] "BRYCE WASHINGSON"

An other, non-regex friendly way:
sst <- strsplit(st, " ")[[1]]
paste(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)])
# [1] "IKE IROEGBU" "NIMROD LEVI" "KYLE GIBSON" "CHAVAUGHN LEWIS" "BRYCE WASHINGSON"

Another way:
unlist(strsplit(st, " ")) -> names
i = 0
while (i < length(names) / 2) {
print(
paste0(names[1:2 + i * 2], collapse = " ")
)
i = i + 1
}
# [1] "IKE IROEGBU"
# [1] "NIMROD LEVI"
# [1] "KYLE GIBSON"
# [1] "CHAVAUGHN LEWIS"
# [1] "BRYCE WASHINGSON"

Sequence of numbers by hyphen without hyphenating single occurrences

I want to generate readable number sequences (e.g. 1, 2, 3, 4 = 1-4), but for a set of data where each number in the sequence must have four digits (e.g. 99 = 0099 or 1 = 0001 or 1022 = 1022) AND where there are different letters in front of each number.
I was looking at the answer to this question, which managed to do almost exactly as I want with two caveats:
If there is a stand-alone number that does not appear in a sequence, it will appear twice with a hyphen in between
If there are several stand-alone numbers that do no appear in a sequence, they won't be included in the result
### Create Data Set ====
## Create the data for different tags. I'm only using two unique levels here, but in my dataset I've got
## 400+ unique levels.
FM <- paste0('FM', c('0001', '0016', '0017', '0018', '0019', '0021', '0024', '0026', '0028'))
SC <- paste0('SC', c('0002', '0003', '0004', '0010', '0012', '0014', '0033', '0036', '0039'))
## Combine data
my.seq1 <- c(FM, SC)
## Sort data by number in sequence
my.seq1 <- my.seq1[order(substr(my.seq1, 3, 7))]
### Attempt Number Sequencing ====
## Get the letters
sp.tags <- substr(my.seq1, 1, 2)
## Get the readable number sequence
lapply(split(my.seq1, sp.tags), ## Split data by the tag ID
function(x){
## Get the run lengths as per [previous answer][1]
rl <- rle(c(1, pmin(diff(as.numeric(substr(x, 3, 7))), 2)))
## Generate number sequence by separator as per [previous answer][1]
seq2 <- paste0(x[c(1, cumsum(rl$lengths))], c("-", ",")[rl$values], collapse="")
return(substr(seq2, 1, nchar(seq2)-1))
})
## Combine lists and sort elements
my.seq2 <- unlist(strsplit(do.call(c, my.seq2), ","))
my.seq2 <- my.seq2[order(substr(my.seq2, 3, 7))]
names(my.seq2) <- NULL
my.seq2
[1] "FM0001-FM0001" "SC0002-SC0004" "FM0016-FM0019" "FM0028" "SC0039"
my.seq1
[1] "FM0001" "SC0002" "SC0003" "SC0004" "SC0010" "SC0012" "SC0014" "FM0016" "FM0017" "FM0018" "FM0019" "FM0021"
[13] "FM0024" "FM0026" "FM0028" "SC0033" "SC0036" "SC0039"
The major problems with this are:
Some values are completely missing from the data set (e.g. FM0021, FM0024, FM0026)
The first number in the sequence (FM0001) appears with a hyphen in between
I feel like I'm getting warmer by using A5C1D2H2I1M1N2O1R2T1's answer to utilize seqToHumanReadable because it's quite elegant AND solves both problems. Two more problems are that I'm not able to tag the ID before each number and can't force the number of digits to four (e.g. 0004 becomes 4).
library(R.utils)
lapply(split(my.seq1, sp.tags), function(x){
return(unlist(strsplit(seqToHumanReadable(substr(x, 3, 7)), ',')))
})
$FM
[1] "1" " 16-19" " 21" " 24" " 26" " 28"
$SC
[1] "2-4" " 10" " 12" " 14" " 33" " 36" " 39"
Ideally the result would be:
"FM0001, SC002-SC004, SC0012, SC0014, FM0017-FM0019, FM0021, FM0024, FM0026, FM0028, SC0033, SC0036, SC0039"
Any ideas? It's one of those things that's really simple to do by hand but would take blinking ages, and you'd think a function would exist for it but I haven't found it yet or it doesn't exist :(

This should do?
# get the prefix/tag and number
tag <- gsub("(^[A-z]+)(.+)", "\\1", my.seq1)
num <- gsub("([A-z]+)(\\d+$)", "\\2", my.seq1)
# get a sequence id
n <- length(tag)
do_match <- c(FALSE, diff(as.numeric(num)) == 1 & tag[-1] == tag[-n])
seq_id <- cumsum(!do_match) # a sequence id
# tapply to combine the result
res <- setNames(tapply(my.seq1, seq_id, function(x)
if(length(x) < 2)
return(x)
else
paste(x[1], x[length(x)], sep = "-")), NULL)
# show the result
res
#R> [1] "FM0001" "SC0002-SC0004" "SC0010" "SC0012" "SC0014" "FM0016-FM0019" "FM0021"
#R> [8] "FM0024" "FM0026" "FM0028" "SC0033" "SC0036" "SC0039"
# compare with
my.seq1
#R> [1] "FM0001" "SC0002" "SC0003" "SC0004" "SC0010" "SC0012" "SC0014" "FM0016" "FM0017" "FM0018" "FM0019" "FM0021" "FM0024"
#R> [14] "FM0026" "FM0028" "SC0033" "SC0036" "SC0039"
Data
FM <- paste0('FM', c('0001', '0016', '0017', '0018', '0019', '0021', '0024', '0026', '0028'))
SC <- paste0('SC', c('0002', '0003', '0004', '0010', '0012', '0014', '0033', '0036', '0039'))
my.seq1 <- c(FM, SC)
my.seq1 <- my.seq1[order(substr(my.seq1, 3, 7))]

Create a list with named values by applying a function to each row of a data frame

I'm trying to get a list where each element has a name, by applying a function to each row of a data frame, but can't get the right output.
Assuming this is the function that I want to apply to each row:
format_setup_name <- function(m, v, s) {
a <- list()
a[[paste(m, "machines and", v, s, "GB volumes")]] <- paste(num_machines,num_volumes,vol_size,sep="-")
a
}
If this is the input data frame:
df <- data.frame(m=c(1,2,3), v=c(3,3,3), s=c(15,20,30))
I can't get a list that looks like:
$`1-3-15`
[1] "1 machines and 3 15 GB volumes"
$`2-3-20`
[1] "2 machines and 3 20 GB volumes"
$`3-3-30`
[1] "3 machines and 3 30 GB volumes"
Can someone give me hints how to do it?
Why do I need this? Well, I want to populate selectizeInput in shiny using values coming from the database. Since I'm combining several columns, I need a way to match the selected input with the values.

This is a good use case for setNames which can add the names() attribute to an object, in place. Also, if you use as.list, you can do this in just one line without any looping:
setNames(as.list(paste(df$m, ifelse(df$m == 1, "machine", "machines"), "and", df$v, df$s, "GB volumes")), paste(df$m,df$v,df$s,sep="-"))
# $`1-3-15`
# [1] "1 machine and 3 15 GB volumes"
#
# $`2-3-20`
# [1] "2 machines and 3 20 GB volumes"
#
# $`3-3-30`
# [1] "3 machines and 3 30 GB volumes"

Thomas has already found a pretty neat solution to your problem (and in one line, too!). But I'll just show you how you could have succeeded with the approach you first tried:
# We'll use the same data, this time called "dat" (I avoid calling
# objects `df` because `df` is also a function's name)
dat <- data.frame(m = c(1,2,3), v = c(3,3,3), s = c(15,20,30))
format_setup_name <- function(m, v, s) {
a <- list() # initialize the list, all is well up to here
# But here we'll need a loop to assign in turn each element to the list
for(i in seq_along(m)) {
a[[paste(m[i], v[i], s[i], sep="-")]] <-
paste(m[i], "machines and", v[i], s[i], "GB volumes")
}
return(a)
}
Note that what goes inside the brackets is the name of the element, while what's at the right side of the <- is the content to be assigned, not the inverse as your code was suggesting.
So let's try it:
my.setup <- format_setup_name(dat$m, dat$v, dat$s)
my.setup
# $`1-3-15`
# [1] "1 machines and 3 15 GB volumes"
#
# $`2-3-20`
# [1] "2 machines and 3 20 GB volumes"
#
# $`3-3-30`
# [1] "3 machines and 3 30 GB volumes"
Everything seems nice. Just one thing to note: with the $ operator, you'll need to use single or double quotes to access individual items by their names:
my.setup$"1-3-15" # my.setup$1-3-15 won't work
# [1] "1 machines and 3 15 GB volumes"
my.setup[['1-3-15']] # equivalent
# [1] "1 machines and 3 15 GB volumes"
Edit: lapply version
Since loops have really fallen out of favor, here's a version with lapply:
format_setup_name <- function(m, v, s) {
a <- lapply(seq_along(m), function(i) paste(m[i], "machines and", v[i], s[i], "GB volumes"))
names(a) <- paste(m, v, s, sep="-")
return(a)
}

From timespan (for example "15 min" or "2 sec") to "00:15:00" or "00:00:02"

I am searching all over help for R function that would convert timespan, for example "15 min" or "1 hour" or "6 sec" or "1 day" into datetime object like "00:15:00" or "01:00:00" or "00:00:06" or "1960-01-02 00:00:00" (not sure for this one). I am sure a function like this exists or there is a neat way to avoid programming it...
To be more specific I would like to do something like this (using made up function name transform.span.to.time):
library(chron)
times(transform.span.to.time("15 min"))
which should yield the same result as
times("00:15:00")
Does a function like transform.span.to.time("15 min") which returns something like "00:15:00" exists or does there exists a trick how to do that?

We will assume a single space separating the numbers and units, and also no trailing space after "secs" unit. This will handle mixed units:
test <- "0 hours 15 min 0 secs"
transform.span <- function(test){
testh <- if(!grepl( " hour | hours ", "0 hours 15 min 0 secs")){
# First consequent if no hours
sub("^", "0:", test)} else {
sub(" hour | hours ", ":", test)}
testm <- if(!grepl( " min | minutes ", testh)) {
# first consequent if no minutes
sub(" min | minutes ", "0:", testh)} else{
sub(" min | minutes ", ":", testh) }
test.s <- if(!grepl( " sec| secs| seconds", testm)) {
# first consequent if no seconds
sub(" sec| secs| seconds", "0", testm)} else{
sub(" sec| secs| seconds", "", testm)}
return(times(test.s)) }
### Use
> transform.span(test)
[1] 00:15:00
> test2 <- "21 hours 15 min 38 secs"
> transform.span(test2)
[1] 21:15:38

The first solution uses strapply in the gsubfn package and transforms to days, e.g. 1 hour is 1/24th of a day. The second solution transforms to an R expression which calculates the number of days and then evaluates it.
library(gsubfn)
library(chron)
unit2days <- function(d, u)
as.numeric(d) * switch(tolower(u), s = 1, m = 60, h = 3600)/(24 * 3600)
transform.span.to.time <- function(x)
sapply(strapply(x, "(\\d+) *(\\w)", unit2days), sum)
Here is a second solution:
library(chron)
transform.span.to.time2 <- function(x) {
x <- paste(x, 0)
x <- sub("h\\w*", "*3600+", x, ignore.case = TRUE)
x <- sub("m\\w*", "*60+", x, ignore.case = TRUE)
x <- sub("s\\w*", "+", x, ignore.case = TRUE)
unname(sapply(x, function(x) eval(parse(text = x)))/(24*3600))
}
Tests:
> x <- c("12 hours 3 min 1 sec", "22h", "18 MINUTES 23 SECONDS")
>
> times(transform.span.to.time(x))
[1] 12:03:01 22:00:00 00:18:23
>
> times(transform.span.to.time2(x))
[1] 12:03:01 22:00:00 00:18:23

The base function ?cut.POSIXt does this work for a specified set of values for breaks:
breaks: a vector of cut points _or_ number giving the number of
intervals which ‘x’ is to be cut into *_or_ an interval
specification, one of ‘"sec"’, ‘"min"’, ‘"hour"’, ‘"day"’,
‘"DSTday"’, ‘"week"’, ‘"month"’, ‘"quarter"’ or ‘"year"’,
optionally preceded by an integer and a space, or followed by
‘"s"’. For ‘"Date"’ objects only ‘"day"’, ‘"week"’,
‘"month"’, ‘"quarter"’ and ‘"year"’ are allowed.*
See the source code by typing in cut.POSIXt, the relevant section starts with this:
else if (is.character(breaks) && length(breaks) == 1L) {
You could adopt the code in this section to work for your needs.

You can define the time span with difftime:
span2time <- function(span, units = c('mins', 'secs', 'hours')) {
span.dt <- as.difftime(span, units = match.arg(units))
format(as.POSIXct("1970-01-01") + span.dt, "%H:%M:%S")
}
For example:
> span2time(15)
[1] "00:15:00"
EDIT: modified to produce character string acceptable to chron's times.

#DWin: thank you.
Based on DWin example I rearranged a bit and here is the result:
transform.span<-function(timeSpan) {
timeSpanH <- if(!grepl(" hour | hours | hour| hours|hour |hours |hour|hours", timeSpan)) {
# First consequent if no hours
sub("^", "00:", timeSpan)
} else {
sub(" hour | hours | hour| hours|hour |hours |hour|hours", ":", timeSpan)
}
timeSpanM <- if(!grepl( " min | minutes | min| minutes|min |minutes |min|minutes", timeSpanH)) {
# first consequent if no minutes
paste("00:", timeSpanH, sep="")
} else{
sub(" min | minutes | min| minutes|min |minutes |min|minutes", ":", timeSpanH)
}
timeSpanS <- if(!grepl( " sec| secs| seconds|sec|secs|seconds", timeSpanM)) {
# first consequent if no seconds
paste(timeSpanM, "00", sep="")
} else{
sub(" sec| secs| seconds|sec|secs|seconds", "", timeSpanM)
}
return(timeSpanS)
}
### Use
test <- "1 hour 2 min 1 sec"
times(transform.span(test))
test1hour <- "1 hour"
times(transform.span(test1hour))
test15min <- "15 min"
times(transform.span(test15min))
test4sec <- "4 sec"
times(transform.span(test4sec))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Pattern Matching & Replacement / Cleaning of Data in R - r

Related

How to manipulate digits in a character string in R?

How to separate values in a string only after the second space

Sequence of numbers by hyphen without hyphenating single occurrences

Create a list with named values by applying a function to each row of a data frame

From timespan (for example "15 min" or "2 sec") to "00:15:00" or "00:00:02"

Categories

Resources