Replace specific pattern (shortening notations) by full notation in R - r

I have a data frame of short forms like
Ann-e/i is the short form for Anne and Anni
How can I replace the pattern -e/i in the data frame by the full notations?
Another example is Matt-e/i for Matte and Matti.
Thanks in advance for any help!

x <- c("Ann-e/i", "Matt-e/i")
gsub("(^[a-zA-Z]+?)-([a-z])/([a-z])$", "\\1\\2 and \\1\\3", x)
[1] "Anne and Anni" "Matte and Matti"

Wimpel's suggestion using gsub from base R works well and is quite flexible. Another approach is provided by the package stringr from the tidyverse, which might be more intuitive.
library(stringr)
strings <- c("Ann-e/i", "Annerl", "Matt-e/i")
str_replace(strings, "(\\w+)-e/i", "\\1i or \\1e")
#> [1] "Anni or Anne" "Annerl" "Matti or Matte"
Created on 2021-11-08 by the reprex package (v2.0.1)
You'll find it helpful to learn about regular expressions (regex), if you're not already familiar with them. Since there are several varieties of regex with different syntax, here's a link that is specific to using it with stringr. https://stringr.tidyverse.org/articles/regular-expressions.html

If you have comma-separated values you can do either of this depending on your desired outcome:
Data:
string <- c("Annerl,Ann-e/i", "Matt-e/i")
First solution:
sub("(^\\w+)-(\\w)/(\\w)$", "\\1\\2 and \\1\\3", unlist(strsplit(string, ",")))
# [1] "Annerl" "Anne and Anni" "Matte and Matti"
Second:
c(sub("(^\\w+),(\\w+)-(\\w)/(\\w)$|", "\\1, \\2\\3 and \\2\\4", string[grepl(",", string)]),
sub("(^\\w+)-(\\w)/(\\w)$", "\\1\\2 and \\1\\3", string[grep(",", string, invert = TRUE)]))
# [1] "Annerl, Anne and Anni" "Matte and Matti"

Related

R turn 6 digit number into HMS i.e. "130041" into "13:00:41"

As the question states, I want to turn "130041" into "13:00:41" i.e. HMS data
lubridate::ymd("20220413") works no problems but lubridate::hms("130041") does not.
I assume there should be a reasonably simply solution?!
Thank you.
If you need the output as a lubridate Period object, rather than a character vector, as you need to perform operations on it, you can use the approach suggested by Tim Biegeleisen of adding colon separators to the character vector and then using lubridate:
x <- "130041"
gsub("(\\d{2})(?!$)", "\\1:", x, perl = TRUE) |>
lubridate::hms()
# [1] "13H 0M 41S"
The output is similar but it is a Period object. I used a slightly different regex as well (add a colon when there are two digits not followed by the end of string) but it is fundamentally the same approach.
You could use sub here:
x <- "130041"
output <- sub("(\\d{2})(\\d{2})(\\d{2})", "\\1:\\2:\\3", x)
output
[1] "13:00:41"
Another regex which will will also work in case when hour uses only on digit.
gsub("(?=(..){1,2}$)", ":", c("130041", "30041"), perl=TRUE)
#[1] "13:00:41" "3:00:41"
Which can than be used in e.g. in lubridate::hms or hms::as_hms.
In base::as.difftime a format could be given: as.difftime("130041", "%H%M%S")

How to delete all data after the third colon in strings/ substrings in R?

So I have a series of about 200,000 data points that look like this: DATA:abc:de123fg:12ghk8d and DATA:ghi:56kdv:128485hg. The only identifying data that I need to look at is before the third colon. I want to remove everything after the third colon so I can aggregate unique identifiers from the rest of the substring..
So far, I have attempted to use str_remove_all and gsub to remove everything after the third colon. The problem with this is that sometimes the data points are grouped together in the same string like this:
DATA:ghi:56kdv:128485hg|DATA:abc:de123fg:12ghk8d
So string_remove_all is just removing the end of the last substring and it ends up looking like this:
DATA:ghi:56kdv:128485hg|DATA:abc:de123fg
Does anyone know how I can accomplish this task? Thanks in advance..
Here's an option in base R with regmatches and regexpr:
str <- c("DATA:abc:de123fg:12ghk8d", "DATA:ghi:56kdv:128485hg|DATA:abc:de123fg:12ghk8d")
regmatches(str, regexpr("[^:]*:[^:]*", str))
#> [1] "DATA:abc" "DATA:ghi"
And the corresponding solution in stringr, if you prefer:
library(stringr)
str_extract(str, "[^:]*:[^:]*")
#> [1] "DATA:abc" "DATA:ghi"
Created on 2019-12-03 by the reprex package (v0.3.0)

Partial string extraction with stringr - getting NA

I'm trying to extract part of a string using stringr.
I'm aiming for the output to be E5_1_C33 and E5_1_C23, but instead I'm getting NA.
Any help would be appreciated!
library(stringr)
mystring <- c("can_ComplianceWHOInfrastructurePol_E5_1_C33","can_ComplianceWHOInfrastructurePol_E5_1_C23")
str_extract(mystring, "A\\d_\\d_B\\d\\d$")
slightly modified your line , as as need any letter not only A and B:
str_extract(mystring, "[A-z]\\d_\\d_[A-z]\\d\\d$")
Here's an R base approach using gsub
> gsub(".*(\\w{2}_\\w{1}_\\w{3})$", "\\1", mystring)
[1] "E5_1_C33" "E5_1_C23"

stringr equivalent to grep

Is there an stringr equivalent to base R's grep function?
I want to have the index of the string that matches. Example:
grep("F|Y", LETTERS)
[1] 6 25
With stringr my workaround would be using which as follows:
which(str_detect(LETTERS, "F|Y"))
[1] 6 25
Sorry for the late answer but it might be helpful for future visitors:
Now you can use str_which(string, pattern) which is a wrapper around which(str_detect(string, pattern)) and equivalent to grep(pattern, string).
str_which(LETTERS, "F|Y")
[1] 6 25
More details at: http://stringr.tidyverse.org/reference/str_subset.html
With the new update string_like will also be applicable.
which(str_like(LETTERS, "F|Y"))
Read more about the stringr updates that are linked below.
Hope this helps everyone.

extract text from string in R

I have a lot of strings that all looking similar, e.g.:
x1= "Aaaa_11111_AA_Whatiwant.txt"
x2= "Bbbb_11111_BBBB_Whatiwanttoo.txt"
x3= "Ccc_22222_CC_Whatiwa.txt"
I would like to extract the: Whatiwant, Whatiwanttoo, and the Whatiwa in R.
I started with substring(x1,15,23), but I don't know how to generalize it. How can I always extract the part between the last _ and the .txt ?
Thank you!
You can use regexp capture groups:
gsub(".*_([^_]*)\\.txt","\\1",x1)
You can also use the stringr library with funtions like str_extract (and many other possibilities) only in case you don't get into regular expressions. It is extremely easy to use
x1= "Aaaa_11111_AA_Whatiwant.txt"
x2= "Bbbb_11111_BBBB_Whatiwanttoo.txt"
x3= "Ccc_22222_CC_Whatiwa.txt"
library(stringr)
patron <- "(What)[a-z]+"
str_extract(x1, patron)
## [1] "Whatiwant"
str_extract(x2, patron)
## [1] "Whatiwanttoo"
str_extract(x3, patron)
## [1] "Whatiwa"

Resources