Replace multiple symbols in a string differently in r - r

I tried to recode values such as (5,10],(20,20] to 5-10%,20-20% using gsub. So, the first parenthesis should be gone, the comma should be changed to dash and the last bracket should be %. What I can do was only
x<-c("(5,10]","(20,20]")
gsub("\\,","-",x)
Then the comma is changed to the dash. How can I change others as well?
Thanks.

Keeping it very simple, a set of gsubs.
x <- c("(5,10]","(20,20]")
x <- gsub(",", "-", x) # remove comma
x <- gsub("\\(", "", x) # remove bracket
x <- gsub("]", "%", x) # replace ] by %
x
"5-10%" "20-20%"

Here's another alternative:
> gsub("\\((\\d+),(\\d+)\\]", "\\1-\\2%", x)
[1] "5-10%" "20-20%"

Other solution.
Using regmatches we extract all the numbers. We then combine every first and second number.
nrs <- regmatches(x, gregexpr("[[:digit:]]+", x))
nrs <- as.numeric(unlist(nrs))
i <- 1:length(nrs); i <- i[(i%%2)==1]
for(h in i){print(paste0(nrs[h],'-',nrs[h+1],'%'))}
[1] "5-10%"
[1] "20-20%"

Just for fun, an ugly one-liner:
sapply(regmatches(x, gregexpr("\\d+", x)), function(x) paste0(x[1], "-", x[2], "%"))
[1] "5-10%" "20-20%"

Related

Converting IDs from three to four digits [duplicate]

I have the following data
GT-BU7867-09
GT-BU6523-113
GT-BU6452-1
GT-BU8921-12
How do I use R to make the numbers after the hyphen to pad leading zeros so it will have three digits? The resulting format should look like this:
GT-BU7867-009
GT-BU6523-113
GT-BU6452-001
GT-BU8921-012
Base solution:
sapply(strsplit(x,"-"), function(x)
paste(x[1], x[2], sprintf("%03d",as.numeric(x[3])), sep="-")
)
Result:
[1] "GT-BU7867-009" "GT-BU6523-113" "GT-BU6452-001" "GT-BU8921-012"
A solution using stringr and str_pad and strsplit
library(stringr)
x <- readLines(textConnection('GT-BU7867-09
GT-BU6523-113
GT-BU6452-1
GT-BU8921-12'))
unlist(lapply(strsplit(x,'-'),
function(x){
x[3] <- str_pad(x[3], width = 3, side = 'left', pad = '0')
paste0(x, collapse = '-')}))
[1] "GT-BU7867-009" "GT-BU6523-113" "GT-BU6452-001"
[4] "GT-BU8921-012"
Another version using str_pad and str_extract from package stringr
library(stringr)
x <- gsub("[[:digit:]]+$", str_pad(str_extract(x, "[[:digit:]]+$"), 3, pad = "0"), x)
i.e. extract the trailing numbers of x, pad them to 3 with 0s, then substitute these for the original trailing numbers.

How to pad with zeroes to the string using regexp to get a length of 4 (from the beginning to the point)?

I have a vector:
x <- c("1. Ure.html", "15. Astra basta.html", "16. Mafa of Part 4.html", "16.1 Veka--Cons.pdf")
How do I get vector y using regexp? I need add lead zero to string for length 4 from start to point.
y <-c("0001. Ure.html", "0015. Astra basta.html", "0016. Mafa of Part 4.html", "0016.1 Veka--Cons.pdf")
An option is gsubfn
library(gsubfn)
gsubfn("^\\d+", ~ sprintf("%04d", as.numeric(x)), x)
#[1] "0001. Ure.html" "0015. Astra basta.html"
#[3] "0016. Mafa of Part 4.html" "0016.1 Veka--Cons.pdf"
We can use str_replace from stringr and pad the additional values with 0
library(stringr)
str_replace(x, "\\d+", function(m) str_pad(m, 4, pad = '0'))
#[1] "0001. Ure.html" "0015. Astra basta.html"
# "0016. Mafa of Part 4.html" "0016.1 Veka--Cons.pdf"
This can also be achieved with sprintf
str_replace(x, "\\d+", function(m) sprintf('%04s', m))
In base R, find the matches
m <- regexpr("^\\d+", x)
extract and coerce the matches to the desired format and update the match locations in the original vector
regmatches(x, m) <- sprintf("%04s", regmatches(x, m))

find alphanumeric elements in vector

I have a vector
myVec <- c('1.2','asd','gkd','232','4343','1.3zyz','fva','3213','1232','dasd')
In this vector, I want to do two things:
Remove any numbers from an element that contains both numbers and letters and then
If a group of letters is followed by another group of letters, merge them into one.
So the above vector will look like this:
'1.2','asdgkd','232','4343','zyzfva','3213','1232','dasd'
I thought I will first find the alphanumeric elements and remove the numbers from them using gsub.
I tried this
gsub('[0-9]+', '', myVec[grepl("[A-Za-z]+$", myVec, perl = T)])
"asd" "gkd" ".zyz" "fva" "dasd"
i.e. it retains the . which I don't want.
This seems to return what you are after
myVec <- c('1.2','asd','gkd','232','4343','1.3zyz','fva','3213','1232','dasd')
clean <- function (x) {
is_char <- grepl("[[:alpha:]]", x)
has_number <- grepl("\\d", x)
mixed <- is_char & has_number
x[mixed] <- gsub("[\\d\\.]+","", x[mixed], perl=T)
grp <- cumsum(!is_char | (is_char & !c(FALSE, head(is_char, -1))))
unname(tapply(x, grp, paste, collapse=""))
}
clean(myVec)
# [1] "1.2" "asdgkd" "232" "4343" "zyzfva" "3213" "1232" "dasd"
Here we look for numbers and letters mixed together and remove the numbers. Then we defined groups for collapsing, looking for characters that come after other characters to put them in the same group. Then we finally collapse all the values in the same group.
Here's my regex-only solution:
myVec <- c('1.2','asd','gkd','232','4343','1.3zyz','fva','3213','1232','dasd')
# find all elemnts containing letters
lettrs = grepl("[A-Za-z]", myVec)
# remove all non-letter characters
myVec[lettrs] = gsub("[^A-Za-z]" ,"", myVec[lettrs])
# paste all elements together, remove delimiter where delimiter is surrounded by letters and split string to new vector
unlist(strsplit(gsub("(?<=[A-Za-z])\\|(?=[A-Za-z])", "", paste(myVec, collapse="|"), perl=TRUE), split="\\|"))

R Match And Sub On Space Between Specific Characters

I need a little help with a regular expression using gsub. Take this object:
x <- "4929A 939 8229"
I want to remove the space in between "A" and "9", but I am not sure how to match on only the space between them and not on the second space. I essentially need something like this:
x <- gsub("A 9", "", x)
But I am not sure how to write the regular expression to not match on the "A" and "9" and only the space between them.
Thanks in advance!
You may use the following regex in sub:
> x <- "4929A 939 8229"
> sub("\\s+", "", x)
[1] "4929A939 8229"
The \\s+ will match 1 or more whitespace symbols.
The replacement part is an empty string.
See the online R demo
gsub matches/uses all regex found whereas sub only matches/uses the first one. So
sub(" ", "", "4929A 939 8229") # returns "4929A939 8229"
Will do the job
Removing second/nth occurence
You can do that e.g. by using strsplit as follows:
x <- c("4929A 939 8229", "4929A 9398229")
collapse_nth <- function(x_split, split, nth, replacement){
left <- paste(x_split[seq_len(nth)], collapse = split)
right <- paste(x_split[-seq_len(nth)], collapse = split)
paste(left, right, sep = replacement)
}
remove_nth <- function(x, nth, split, replacement = ""){
x_split <- strsplit(x, split, fixed = TRUE)
x_len <- vapply(x_split, length, integer(1))
out <- x
out[x_len>nth] <- vapply(x_split[x_len>nth], collapse_nth, character(1), split, nth, replacement)
out
}
Which gives you:
# > remove_nth(x, 2, " ")
# [1] "4929A 9398229" "4929A 9398229"
and
# > remove_nth(x, 2, " ", "---")
# [1] "4929A 939---8229" "4929A 9398229"

Split a string every 5 characters

Suppose I have a long string:
"XOVEWVJIEWNIGOIWENVOIWEWVWEW"
How do I split this to get every 5 characters followed by a space?
"XOVEW VJIEW NIGOI WENVO IWEWV WEW"
Note that the last one is shorter.
I can do a loop where I constantly count and build a new string character by character but surely there must be something better no?
Using regular expressions:
gsub("(.{5})", "\\1 ", "XOVEWVJIEWNIGOIWENVOIWEWVWEW")
# [1] "XOVEW VJIEW NIGOI WENVO IWEWV WEW"
Using sapply
> string <- "XOVEWVJIEWNIGOIWENVOIWEWVWEW"
> sapply(seq(from=1, to=nchar(string), by=5), function(i) substr(string, i, i+4))
[1] "XOVEW" "VJIEW" "NIGOI" "WENVO" "IWEWV" "WEW"
You can try something like the following:
s <- "XOVEWVJIEWNIGOIWENVOIWEWVWEW" # Original string
l <- seq(from=5, to=nchar(s), by=5) # Calculate the location where to chop
# Add sentinels 0 (beginning of string) and nchar(s) (end of string)
# and take substrings. (Thanks to #flodel for the condense expression)
mapply(substr, list(s), c(0, l) + 1, c(l, nchar(s)))
Output:
[1] "XOVEW" "VJIEW" "NIGOI" "WENVO" "IWEWV" "WEW"
Now you can paste the resulting vector (with collapse=' ') to obtain a single string with spaces.
No *apply stringi solution:
x <- "XOVEWVJIEWNIGOIWENVOIWEWVWEW"
stri_sub(x, seq(1, stri_length(x),by=5), length=5)
[1] "XOVEW" "VJIEW" "NIGOI" "WENVO" "IWEWV" "WEW"
This extracts substrings just like in #Jilber answer, but stri_sub function is vectorized se we don't need to use *apply here.
You can also use a sub-string without a loop. substring is the vectorized substr
x <- "XOVEWVJIEWNIGOIWENVOIWEWVWEW"
n <- seq(1, nc <- nchar(x), by = 5)
paste(substring(x, n, c(n[-1]-1, nc)), collapse = " ")
# [1] "XOVEW VJIEW NIGOI WENVO IWEWV WEW"

Resources