Converting IDs from three to four digits [duplicate] - r

I have the following data
GT-BU7867-09
GT-BU6523-113
GT-BU6452-1
GT-BU8921-12
How do I use R to make the numbers after the hyphen to pad leading zeros so it will have three digits? The resulting format should look like this:
GT-BU7867-009
GT-BU6523-113
GT-BU6452-001
GT-BU8921-012

Base solution:
sapply(strsplit(x,"-"), function(x)
paste(x[1], x[2], sprintf("%03d",as.numeric(x[3])), sep="-")
)
Result:
[1] "GT-BU7867-009" "GT-BU6523-113" "GT-BU6452-001" "GT-BU8921-012"

A solution using stringr and str_pad and strsplit
library(stringr)
x <- readLines(textConnection('GT-BU7867-09
GT-BU6523-113
GT-BU6452-1
GT-BU8921-12'))
unlist(lapply(strsplit(x,'-'),
function(x){
x[3] <- str_pad(x[3], width = 3, side = 'left', pad = '0')
paste0(x, collapse = '-')}))
[1] "GT-BU7867-009" "GT-BU6523-113" "GT-BU6452-001"
[4] "GT-BU8921-012"

Another version using str_pad and str_extract from package stringr
library(stringr)
x <- gsub("[[:digit:]]+$", str_pad(str_extract(x, "[[:digit:]]+$"), 3, pad = "0"), x)
i.e. extract the trailing numbers of x, pad them to 3 with 0s, then substitute these for the original trailing numbers.

Related

Regex: Match first two digits of a four digit number

I have:
'30Jun2021'
I want to skip/remove the first two digits of the four digit number (or any other way of doing this):
'30Jun21'
I have tried:
^.{0,5}
https://regex101.com/r/hAJcdE/1
I have the first 5 characters but I have not figured out how to skip/remove the '20'
Manipulating datetimes is better using the dedicated date/time functions.
You can convert the variable to date and use format to get the output in any format.
x <- '30Jun2021'
format(as.Date(x, '%d%b%Y'), '%d%b%y')
#[1] "30Jun21"
You can also use lubridate::dmy(x) to convert x to date.
You don't even need regex for this. Just use substring operations:
x <- '30Jun2021'
paste0(substr(x, 1, 5), substr(x, 8, 9))
[1] "30Jun21"
Use sub
sub('\\d{2}(\\d{2})$', "\\1", x)
[1] "30Jun21"
or with str_remove
library(stringr)
str_remove(x, "\\d{2}(?=\\d{2}$)")
[1] "30Jun21"
data
x <- '30Jun2021'
You could also match the format of the string with 2 capture groups, where you would match the part that you want to omit and capture what you want to keep.
\b(\d+[A-Z][a-z]+)\d\d(\d\d)\b
Regex demo
sub("\\b(\\d+[A-Z][a-z]+)\\d\\d(\\d\\d)\\b", "\\1\\2", "30Jun2021")
Output
[1] "30Jun21"

How to pad with zeroes to the string using regexp to get a length of 4 (from the beginning to the point)?

I have a vector:
x <- c("1. Ure.html", "15. Astra basta.html", "16. Mafa of Part 4.html", "16.1 Veka--Cons.pdf")
How do I get vector y using regexp? I need add lead zero to string for length 4 from start to point.
y <-c("0001. Ure.html", "0015. Astra basta.html", "0016. Mafa of Part 4.html", "0016.1 Veka--Cons.pdf")
An option is gsubfn
library(gsubfn)
gsubfn("^\\d+", ~ sprintf("%04d", as.numeric(x)), x)
#[1] "0001. Ure.html" "0015. Astra basta.html"
#[3] "0016. Mafa of Part 4.html" "0016.1 Veka--Cons.pdf"
We can use str_replace from stringr and pad the additional values with 0
library(stringr)
str_replace(x, "\\d+", function(m) str_pad(m, 4, pad = '0'))
#[1] "0001. Ure.html" "0015. Astra basta.html"
# "0016. Mafa of Part 4.html" "0016.1 Veka--Cons.pdf"
This can also be achieved with sprintf
str_replace(x, "\\d+", function(m) sprintf('%04s', m))
In base R, find the matches
m <- regexpr("^\\d+", x)
extract and coerce the matches to the desired format and update the match locations in the original vector
regmatches(x, m) <- sprintf("%04s", regmatches(x, m))

substring replace nth positions R

I need to replace the 6,7,8th position to "_". In substring, I mentioned the start and stop position. It didn't work.
> a=c("UHI786KJRH2V", "TYR324FHASJKDG","DHA927NFSYFN34")
> substring(a, 6,8) <- "_"
> a
[1] "UHI78_KJRH2V" "TYR32_FHASJKDG" "DHA92_NFSYFN34"
I need UHI78_RH2V TYR32_ASJKDG DHA92_SYFN34
Using sub, we can match on the pattern (?<=^.{5}).{3}, and then replace it by a single underscore:
a <- c("UHI786KJRH2V", "TYR324FHASJKDG","DHA927NFSYFN34")
out <- sub("(?<=^.{5}).{3}", "_", a, perl=TRUE)
out
[1] "UHI78_RH2V" "TYR32_ASJKDG" "DHA92_SYFN34"
Demo
We could also try doing substring operations here, but we would have to do some splicing:
out <- paste0(substr(a, 1, 5), "_", substr(a, 9, nchar(a)))
1) str_sub<- The str_sub<- replacement function in the stringr package can do that.
library(stringr)
str_sub(a, 6, 8) <- "_"
a
## [1] "UHI78_RH2V" "TYR32_ASJKDG" "DHA92_SYFN34"
2 Base R With only base R you could do this. It replaces the entire string with the match to the first capture group, an underscore and the match to the second capture group.
sub("(.....)...(.*)", "\\1_\\2", a)
## [1] "UHI78_RH2V" "TYR32_ASJKDG" "DHA92_SYFN34"
That regex could also be written as "(.{5}).{3}(.*)" .
3) separate/unite If a is a column in a data frame then we could use dplyr and tidyr to do this:
library(dplyr)
library(tidyr)
DF <- data.frame(a)
DF %>%
separate(a, into = c("pre", "junk", "post"), sep = c(5, 8)) %>%
select(-junk) %>%
unite(a)
giving:
a
1 UHI78_RH2V
2 TYR32_ASJKDG
3 DHA92_SYFN34
From the documentation:
If the portion to be replaced is longer than the replacement string, then only the portion the length of the string is replaced.
So we could do something like this:
substring(a, 6,8) <- "_##"
sub("#+", "", a)
[1] "UHI78_RH2V" "TYR32_ASJKDG" "DHA92_SYFN34"

Replace multiple symbols in a string differently in r

I tried to recode values such as (5,10],(20,20] to 5-10%,20-20% using gsub. So, the first parenthesis should be gone, the comma should be changed to dash and the last bracket should be %. What I can do was only
x<-c("(5,10]","(20,20]")
gsub("\\,","-",x)
Then the comma is changed to the dash. How can I change others as well?
Thanks.
Keeping it very simple, a set of gsubs.
x <- c("(5,10]","(20,20]")
x <- gsub(",", "-", x) # remove comma
x <- gsub("\\(", "", x) # remove bracket
x <- gsub("]", "%", x) # replace ] by %
x
"5-10%" "20-20%"
Here's another alternative:
> gsub("\\((\\d+),(\\d+)\\]", "\\1-\\2%", x)
[1] "5-10%" "20-20%"
Other solution.
Using regmatches we extract all the numbers. We then combine every first and second number.
nrs <- regmatches(x, gregexpr("[[:digit:]]+", x))
nrs <- as.numeric(unlist(nrs))
i <- 1:length(nrs); i <- i[(i%%2)==1]
for(h in i){print(paste0(nrs[h],'-',nrs[h+1],'%'))}
[1] "5-10%"
[1] "20-20%"
Just for fun, an ugly one-liner:
sapply(regmatches(x, gregexpr("\\d+", x)), function(x) paste0(x[1], "-", x[2], "%"))
[1] "5-10%" "20-20%"

convert digits to special format

In my data processing, I need to do the following:
#convert '7-25' to '0007 0025'
#pad 0's to make each four-digit number
digits.formatter <- function ('7-25'){.......?}
I have no clue how to do that in R. Can anyone help?
In base R, split the character string (or vector of strings) at -, convert its parts to numeric, format the parts using sprintf, and then paste them back together.
sapply(strsplit(c("7-25", "20-13"), "-"), function(x)
paste(sprintf("%04d", as.numeric(x)), collapse = " "))
#[1] "0007 0025" "0020 0013"
A solution with stringr:
library(stringr)
digits.formatter <- function(string){
str_vec = str_split(string, "-")
output = sapply(str_vec, function(x){
str_padded = str_pad(x, width = 4, pad = "0")
paste(str_padded, collapse = " ")
})
return(output)
}
digits.formatter(c('7-25', '8-30'))
# [1] "0007 0025" "0008 0030"
The pad= argument in str_pad specifies whatever you like to pad, whereas width= specifies the minimum width of the padded string. You can also use an optional argument side= to specify which side you want to pad the string (defaults to side=left). For example:
str_pad(1:5, width = 4, pad = "0", side = "right")
# [1] "1000" "2000" "3000" "4000" "5000"
We could do this with gsubfn
library(gsubfn)
gsubfn("(\\d+)", ~sprintf("%04d", as.numeric(x)), v1)
#[1] "0007-0025" "0020-0013"
If we don't need the -,
either use sub after the gsubfn
sub("-", " ", gsubfn("(\\d+)", ~sprintf("%04d", as.numeric(x)), v1))
#[1] "0007 0025" "0020 0013"
or directly use two capture groups in gsubfn
gsubfn("(\\d+)-(\\d+)", ~sprintf("%04d %04d", as.numeric(x), as.numeric(y)), v1)
#[1] "0007 0025" "0020 0013"
data
v1 <- c("7-25", "20-13")

Resources