extract text from string in R - r

I have a lot of strings that all looking similar, e.g.:
x1= "Aaaa_11111_AA_Whatiwant.txt"
x2= "Bbbb_11111_BBBB_Whatiwanttoo.txt"
x3= "Ccc_22222_CC_Whatiwa.txt"
I would like to extract the: Whatiwant, Whatiwanttoo, and the Whatiwa in R.
I started with substring(x1,15,23), but I don't know how to generalize it. How can I always extract the part between the last _ and the .txt ?
Thank you!

You can use regexp capture groups:
gsub(".*_([^_]*)\\.txt","\\1",x1)

You can also use the stringr library with funtions like str_extract (and many other possibilities) only in case you don't get into regular expressions. It is extremely easy to use
x1= "Aaaa_11111_AA_Whatiwant.txt"
x2= "Bbbb_11111_BBBB_Whatiwanttoo.txt"
x3= "Ccc_22222_CC_Whatiwa.txt"
library(stringr)
patron <- "(What)[a-z]+"
str_extract(x1, patron)
## [1] "Whatiwant"
str_extract(x2, patron)
## [1] "Whatiwanttoo"
str_extract(x3, patron)
## [1] "Whatiwa"

Related

R turn 6 digit number into HMS i.e. "130041" into "13:00:41"

As the question states, I want to turn "130041" into "13:00:41" i.e. HMS data
lubridate::ymd("20220413") works no problems but lubridate::hms("130041") does not.
I assume there should be a reasonably simply solution?!
Thank you.
If you need the output as a lubridate Period object, rather than a character vector, as you need to perform operations on it, you can use the approach suggested by Tim Biegeleisen of adding colon separators to the character vector and then using lubridate:
x <- "130041"
gsub("(\\d{2})(?!$)", "\\1:", x, perl = TRUE) |>
lubridate::hms()
# [1] "13H 0M 41S"
The output is similar but it is a Period object. I used a slightly different regex as well (add a colon when there are two digits not followed by the end of string) but it is fundamentally the same approach.
You could use sub here:
x <- "130041"
output <- sub("(\\d{2})(\\d{2})(\\d{2})", "\\1:\\2:\\3", x)
output
[1] "13:00:41"
Another regex which will will also work in case when hour uses only on digit.
gsub("(?=(..){1,2}$)", ":", c("130041", "30041"), perl=TRUE)
#[1] "13:00:41" "3:00:41"
Which can than be used in e.g. in lubridate::hms or hms::as_hms.
In base::as.difftime a format could be given: as.difftime("130041", "%H%M%S")

Replace specific pattern (shortening notations) by full notation in R

I have a data frame of short forms like
Ann-e/i is the short form for Anne and Anni
How can I replace the pattern -e/i in the data frame by the full notations?
Another example is Matt-e/i for Matte and Matti.
Thanks in advance for any help!
x <- c("Ann-e/i", "Matt-e/i")
gsub("(^[a-zA-Z]+?)-([a-z])/([a-z])$", "\\1\\2 and \\1\\3", x)
[1] "Anne and Anni" "Matte and Matti"
Wimpel's suggestion using gsub from base R works well and is quite flexible. Another approach is provided by the package stringr from the tidyverse, which might be more intuitive.
library(stringr)
strings <- c("Ann-e/i", "Annerl", "Matt-e/i")
str_replace(strings, "(\\w+)-e/i", "\\1i or \\1e")
#> [1] "Anni or Anne" "Annerl" "Matti or Matte"
Created on 2021-11-08 by the reprex package (v2.0.1)
You'll find it helpful to learn about regular expressions (regex), if you're not already familiar with them. Since there are several varieties of regex with different syntax, here's a link that is specific to using it with stringr. https://stringr.tidyverse.org/articles/regular-expressions.html
If you have comma-separated values you can do either of this depending on your desired outcome:
Data:
string <- c("Annerl,Ann-e/i", "Matt-e/i")
First solution:
sub("(^\\w+)-(\\w)/(\\w)$", "\\1\\2 and \\1\\3", unlist(strsplit(string, ",")))
# [1] "Annerl" "Anne and Anni" "Matte and Matti"
Second:
c(sub("(^\\w+),(\\w+)-(\\w)/(\\w)$|", "\\1, \\2\\3 and \\2\\4", string[grepl(",", string)]),
sub("(^\\w+)-(\\w)/(\\w)$", "\\1\\2 and \\1\\3", string[grep(",", string, invert = TRUE)]))
# [1] "Annerl, Anne and Anni" "Matte and Matti"

Regex to remove a tail after a date

I've looked around but I'm having trouble utilizing regex to remove the "*/tues" from a date variable using the sub() function.
all variables in a$date look like this:
01/01/2017/Sun
01/03/2017/Tues
etc..
And i'm trying to do this:
sub(a$date,"*7/\\*","")
I'm sure I've just messed up the regex.
How do I escape the first two / and tell it only to delete the third one and everything after?
We can use
sub("\\/[^0-9]+$","", a$date)
#[1] "01/01/2017" "01/03/2017"
Or with substr
substr(a$date, 1, 10)
#[1] "01/01/2017" "01/03/2017"
data
a <- data.frame(date = c("01/01/2017/Sun", "01/03/2017/Tues"))
Another solution is simply to truncate your string:
library("stringr")
# truncate date after 10 characters
a$date <- str_trunc(a$date,10)

how create a sequence of strings with different numbers in R

I just cant figure it out how to create a vector in which the strings are constant but the numbers are not. For example:
c("raster[1]","raster[2]","raster[3]")
I'd like to use something like seq(raster[1],raster[99], by=1), but this does not work.
Thanks in advance.
The sprintf function should also work:
rasters <- sprintf("raster[%s]",seq(1:99))
head(rasters)
[1] "raster[1]" "raster[2]" "raster[3]" "raster[4]" "raster[5]" "raster[6]"
As suggested by Richard Scriven, %d is more efficient than %s. So, if you were working with a longer sequence, it would be more appropriate to use:
rasters <- sprintf("raster[%d]",seq(1:99))
We can do
paste0("raster[", 1:6, "]")
# [1] "raster[1]" "raster[2]" "raster[3]" "raster[4]" "raster[5]" "raster[6]"

Convert mathematical notation to string

The solution might be very simply, but I can't seem to figure it out easily. I have the following number:
a = 1000000
#> a
#[1] 1e+06
I would like to convert "a" to a string, but when I try using toString, it gives the following:
#> toString(a)
#[1] "1e+06"
I would like to get: 1,000,000 instead, with the comma separator. Is that easily feasible?
Thanks!
format(1e6, big.mark=",", scientific=FALSE) or prettyNum(1000000,big.mark=",",scientific=F) should give you the desired result

Resources