In my data processing, I need to do the following:
#convert '7-25' to '0007 0025'
#pad 0's to make each four-digit number
digits.formatter <- function ('7-25'){.......?}
I have no clue how to do that in R. Can anyone help?
In base R, split the character string (or vector of strings) at -, convert its parts to numeric, format the parts using sprintf, and then paste them back together.
sapply(strsplit(c("7-25", "20-13"), "-"), function(x)
paste(sprintf("%04d", as.numeric(x)), collapse = " "))
#[1] "0007 0025" "0020 0013"
A solution with stringr:
library(stringr)
digits.formatter <- function(string){
str_vec = str_split(string, "-")
output = sapply(str_vec, function(x){
str_padded = str_pad(x, width = 4, pad = "0")
paste(str_padded, collapse = " ")
})
return(output)
}
digits.formatter(c('7-25', '8-30'))
# [1] "0007 0025" "0008 0030"
The pad= argument in str_pad specifies whatever you like to pad, whereas width= specifies the minimum width of the padded string. You can also use an optional argument side= to specify which side you want to pad the string (defaults to side=left). For example:
str_pad(1:5, width = 4, pad = "0", side = "right")
# [1] "1000" "2000" "3000" "4000" "5000"
We could do this with gsubfn
library(gsubfn)
gsubfn("(\\d+)", ~sprintf("%04d", as.numeric(x)), v1)
#[1] "0007-0025" "0020-0013"
If we don't need the -,
either use sub after the gsubfn
sub("-", " ", gsubfn("(\\d+)", ~sprintf("%04d", as.numeric(x)), v1))
#[1] "0007 0025" "0020 0013"
or directly use two capture groups in gsubfn
gsubfn("(\\d+)-(\\d+)", ~sprintf("%04d %04d", as.numeric(x), as.numeric(y)), v1)
#[1] "0007 0025" "0020 0013"
data
v1 <- c("7-25", "20-13")
Related
I have the following data
GT-BU7867-09
GT-BU6523-113
GT-BU6452-1
GT-BU8921-12
How do I use R to make the numbers after the hyphen to pad leading zeros so it will have three digits? The resulting format should look like this:
GT-BU7867-009
GT-BU6523-113
GT-BU6452-001
GT-BU8921-012
Base solution:
sapply(strsplit(x,"-"), function(x)
paste(x[1], x[2], sprintf("%03d",as.numeric(x[3])), sep="-")
)
Result:
[1] "GT-BU7867-009" "GT-BU6523-113" "GT-BU6452-001" "GT-BU8921-012"
A solution using stringr and str_pad and strsplit
library(stringr)
x <- readLines(textConnection('GT-BU7867-09
GT-BU6523-113
GT-BU6452-1
GT-BU8921-12'))
unlist(lapply(strsplit(x,'-'),
function(x){
x[3] <- str_pad(x[3], width = 3, side = 'left', pad = '0')
paste0(x, collapse = '-')}))
[1] "GT-BU7867-009" "GT-BU6523-113" "GT-BU6452-001"
[4] "GT-BU8921-012"
Another version using str_pad and str_extract from package stringr
library(stringr)
x <- gsub("[[:digit:]]+$", str_pad(str_extract(x, "[[:digit:]]+$"), 3, pad = "0"), x)
i.e. extract the trailing numbers of x, pad them to 3 with 0s, then substitute these for the original trailing numbers.
I am trying to format UK postcodes that come in as a vector of different input in R.
For example, I have the following postcodes:
postcodes<-c("IV41 8PW","IV408BU","kY11..4hJ","KY1.1UU","KY4 9RW","G32-7EJ")
How do I write a generic code that would convert entries of the above vector into:
c("IV41 8PW","IV40 8BU","KY11 4HJ","KY1 1UU","KY4 9RW","G32 7EJ")
That is the first part of the postcode is separated from the second part of the postcode by one space and all letters are capitals.
EDIT: the second part of the postcode is always the 3 last characters (combination of a number followed by letters)
I couldn't come up with a smart regex solution so here is a split-apply-combine approach.
sapply(strsplit(sub('^(.*?)(...)$', '\\1:\\2', postcodes), ':', fixed = TRUE), function(x) {
paste0(toupper(trimws(x, whitespace = '[.\\s-]')), collapse = ' ')
})
#[1] "IV41 8PW" "IV40 8BU" "KY11 4HJ" "KY1 1UU" "KY4 9RW" "G32 7EJ"
The logic here is that we insert a : (or any character that is not in the data) in the string between the 1st and 2nd part. Split the string on :, remove unnecessary characters, get it in upper case and combine it in one string.
One approach:
Convert to uppercase
extract the alphanumeric characters
Paste back together with a space before the last three characters
The code would then be:
library(stringr)
postcodes<-c("IV41 8PW","IV408BU","kY11..4hJ","KY1.1UU","KY4 9RW","G32-7EJ")
postcodes <- str_to_upper(postcodes)
sapply(str_extract_all(postcodes, "[:alnum:]"), function(x)paste(paste0(head(x,-3), collapse = ""), paste0(tail(x,3), collapse = "")))
# [1] "IV41 8PW" "IV40 8BU" "KY11 4HJ" "KY1 1UU" "KY4 9RW" "G32 7EJ"
You can remove everything what is not a word caracter \\W (or [^[:alnum:]_]) and then insert a space before the last 3 characters with (.{3})$ and \\1.
sub("(.{3})$", " \\1", toupper(gsub("\\W+", "", postcodes)))
#sub("(...)$", " \\1", toupper(gsub("\\W+", "", postcodes))) #Alternative
#sub("(?=.{3}$)", " ", toupper(gsub("\\W+", "", postcodes)), perl=TRUE) #Alternative
#[1] "IV41 8PW" "IV40 8BU" "KY11 4HJ" "KY1 1UU" "KY4 9RW" "G32 7EJ"
# Option 1 using regex:
res1 <- gsub(
"(\\w+)(\\d[[:upper:]]\\w+$)",
"\\1 \\2",
gsub(
"\\W+",
" ",
postcodes
)
)
# Option 2 using substrings:
res2 <- vapply(
trimws(
gsub(
"\\W+",
" ",
postcodes
)
),
function(ir){
paste(
trimws(
substr(
ir,
1,
nchar(ir) -3
)
),
substr(
ir,
nchar(ir) -2,
nchar(ir)
)
)
},
character(1),
USE.NAMES = FALSE
)
Say I have a file of characters that I would like to split at a character and then select the left side of the split to a new field. Is there a way to do this one step? For example:
x <- strsplit('dark_room',"_")
x[[1]][2]
[1] "room"
Is a two-step operation to access the "room" - how can I subset "room" in one operation?
An easier option is to use word from stringr and specify the sep and other parameters
library(stringr)
word('dark_room', -1, sep="_")
#[1] "room"
Or with str_extract to match characters other than _ ([^_]+) at the end ($) of the string
str_extract('dark_room', "[^_]+$")
#[1] "room"
Or with stri_extract from stringi where we can specify the first and last
library(stringi)
stri_extract_first('dark_room', regex = '[^_]+')
#[1] "dark"
stri_extract_last('dark_room', regex = '[^_]+')
#[1] "room"
The strsplit always returns a list. So, as a general case (applicable when there are more than one element) would be to loop through the list with sapply and subset with either tail
sapply( strsplit('dark_room',"_"), tail, 1)
#[1] "room"
or head to get the first n elements
Or using scan
tail(scan(text = 'dark_room', what = "", sep="_"), 1)
Here's a way using sub from base R which just removes everything up to the first underscore -
# for left side of '_'
sub(".*_", "", x)
[1] "room"
# for right side of '_'
sub("(.*)_.*", "\\1", x)
[1] "dark"
Define your own functions (using the solution in your question or any of the other solutions from the answers)
left = function(x, sep){
substring(x, 1, regexpr(sep, x) - 1)
}
right = function(x, sep){
substring(x, regexpr(sep, x) + 1, nchar(x))
}
s = c("dark_room")
left(s, "_")
#[1] "dark"
right(s, "_")
#[1] "room"
I have the following strings:
strings <- c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG")
I want to cut off the string, as soon as the number of occurances of A, G and N reach a certain value, say 3. In that case, the result should be:
some_function(strings)
c("ABBSDGN", "AABSDG", "AGN", "GGG")
I tried to use the stringi, stringr and regex expressions but I can't figure it out.
You can accomplish your task with a simple call to str_extract from the stringr package:
library(stringr)
strings <- c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG")
str_extract(strings, '([^AGN]*[AGN]){3}')
# [1] "ABBSDGN" "AABSDG" "AGN" "GGG"
The [^AGN]*[AGN] portion of the regex pattern says to look for zero or more consecutive characters that are not A, G, or N, followed by one instance of A, G, or N. The additional wrapping with parenthesis and braces, like this ([^AGN]*[AGN]){3}, means look for that pattern three times consecutively. You can change the number of occurrences of A, G, N, that you are looking for by changing the integer in the curly braces:
str_extract(strings, '([^AGN]*[AGN]){4}')
# [1] "ABBSDGNHN" NA "AGNA" "GGGDSRTYHG"
There are a couple ways to accomplish your task using base R functions. One is to use regexpr followed by regmatches:
m <- regexpr('([^AGN]*[AGN]){3}', strings)
regmatches(strings, m)
# [1] "ABBSDGN" "AABSDG" "AGN" "GGG"
Alternatively, you can use sub:
sub('(([^AGN]*[AGN]){3}).*', '\\1', strings)
# [1] "ABBSDGN" "AABSDG" "AGN" "GGG"
Here is a base R option using strsplit
sapply(strsplit(strings, ""), function(x)
paste(x[1:which.max(cumsum(x %in% c("A", "G", "N")) == 3)], collapse = ""))
#[1] "ABBSDGN" "AABSDG" "AGN" "GGG"
Or in the tidyverse
library(tidyverse)
map_chr(str_split(strings, ""),
~str_c(.x[1:which.max(cumsum(.x %in% c("A", "G", "N")) == 3)], collapse = ""))
Identify positions of pattern using gregexpr then extract n-th position (3) and substring everything from 1 to this n-th position using subset.
nChars <- 3
pattern <- "A|G|N"
# Using sapply to iterate over strings vector
sapply(strings, function(x) substr(x, 1, gregexpr(pattern, x)[[1]][nChars]))
PS:
If there's a string that doesn't have 3 matches it will generate NA, so you just need to use na.omit on the final result.
This is just a version without strsplit to Maurits Evers neat solution.
sapply(strings,
function(x) {
raw <- rawToChar(charToRaw(x), multiple = TRUE)
idx <- which.max(cumsum(raw %in% c("A", "G", "N")) == 3)
paste(raw[1:idx], collapse = "")
})
## ABBSDGNHNGA AABSDGDRY AGNAFG GGGDSRTYHG
## "ABBSDGN" "AABSDG" "AGN" "GGG"
Or, slightly different, without strsplit and paste:
test <- charToRaw("AGN")
sapply(strings,
function(x) {
raw <- charToRaw(x)
idx <- which.max(cumsum(raw %in% test) == 3)
rawToChar(raw[1:idx])
})
Interesting problem. I created a function (see below) that solves your problem. It's assumed that there are just letters and no special characters in any of your strings.
reduce_strings = function(str, chars, cnt){
# Replacing chars in str with "!"
chars = paste0(chars, collapse = "")
replacement = paste0(rep("!", nchar(chars)), collapse = "")
str_alias = chartr(chars, replacement, str)
# Obtain indices with ! for each string
idx = stringr::str_locate_all(pattern = '!', str_alias)
# Reduce each string in str
reduce = function(i) substr(str[i], start = 1, stop = idx[[i]][cnt, 1])
result = vapply(seq_along(str), reduce, "character")
return(result)
}
# Example call
str = c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG")
chars = c("A", "G", "N") # Characters that are counted
cnt = 3 # Count of the characters, at which the strings are cut off
reduce_strings(str, chars, cnt) # "ABBSDGN" "AABSDG" "AGN" "GGG"
How would I be able to insert a vertical bar in between every character of a string in R? For example, say I have a string "ABC123". How could I obtain the output to be "A|B|C|1|2|3"? If anyone could vectorize this idea for a vector of character strings, that would be great.
First, separate into individual characters and then collapse
paste(unlist(strsplit("ABC123", "")), collapse = "|")
#[1] "A|B|C|1|2|3"
For vector of strings, use sapply to loop through them
mystrings = c("ABC123", "PASDP")
sapply(strsplit(mystrings, ""), paste, collapse = "|")
#[1] "A|B|C|1|2|3" "P|A|S|D|P"
Here is an option using regex
gsub("(?<=.)(?=.)", "|", "ABC123", perl = TRUE)
#[1] "A|B|C|1|2|3"
Or with more than one string
mystrings <- c("ABC123", "PASDP")
gsub("(?<=.)(?=.)", "|", mystrings, perl = TRUE)
#[1] "A|B|C|1|2|3" "P|A|S|D|P"