Remove part of string after 3-digit number - r

I would like to substitute the strings in the list by cutting each string after the first 3-digit number.
a <- c("MTH314PHY410","LB471LB472","PHY472CHM141")
I would like for it to look something like
a <- c("MTH314","LB471","PHY472")
I have tried something like
b <- gsub("[100-999].*","",a)
but it returns c("MTH","LB","PHY") without the first number

A possible solution, based on stringr::str_remove:
library(stringr)
a <- c("MTH314PHY410","LB471LB472","PHY472CHM141")
str_remove(a, "(?<=\\d{3}).*")
#> [1] "MTH314" "LB471" "PHY472"

c("MTH314PHY410","LB471LB472","PHY472CHM141") %>%
stringr::str_extract('.+?\\d{3}')
[1] "MTH314" "LB471" "PHY472"

Related

Add a character to a specific part of a string?

I have a list of file names as such:
"A/B/file.jpeg"
"A/C/file2.jpeg"
"B/C/file3.jpeg"
and a couple of variations of such.
My question is how would I be able to add a "new" or any characters into each of these file names after the second "/" such that the length of the string/name doesn't matter just that it is placed after the second "/"
Results would ideally be:
"A/B/newfile.jpeg"
"A/B/newfile2.jpeg" etc.
Thanks!
Another possible solution, based on stringr::str_replace:
library(stringr)
l <- c("A/B/file.jpeg", "A/B/file2.jpeg", "A/B/file3.jpeg")
str_replace(l, "\\/(?=file)", "\\/new")
#> [1] "A/B/newfile.jpeg" "A/B/newfile2.jpeg" "A/B/newfile3.jpeg"
Using gsub.
gsub('(file)', 'new\\1', x)
# [1] "A/B/newfile.jpeg" "A/C/newfile2.jpeg" "B/C/newfile3.jpeg"
Data:
x <- c("A/B/file.jpeg", "A/C/file2.jpeg", "B/C/file3.jpeg")

Remove first "." from values in R

I have a dataset with different values in R. Some values are like 11.474 and others like 1.034.496 in the same column. I would like to change the values with two dots from 1.034.496 to 1034.496. Is there anyone who could help me please?
Thanks for the help!
Use gsub with Perl regexes:
df <- data.frame(a = c('11.474', '1.034.496', '1.234.034.496'))
df$a = gsub('[.](?=.*[.])', '', df$a, perl = TRUE)
print(df)
## a
## 1 11.474
## 2 1034.496
## 3 1234034.496
Here, [.](?=.*[.]) is a literal dot (has to be escaped like so \. or put into a character class like so: [.]), followed by a literal dot using positive lookahead: (?=PATTERN).
I guess there must be other smarter regex approaches than the below one, but here is my attempt
> ifelse(lengths(gregexpr("\\.",v))>1,sub("\\.","",v),v)
[1] "11.474" "1034.496"
where
v <- c("11.474","1.034.496")

How to extract the trailing digits from a string in R? [duplicate]

This question already has answers here:
Extract a substring according to a pattern
(9 answers)
Closed 4 years ago.
I have a column of data that looks like this:
**varX**
Q1#_1
Q1#_5
Q1#_10
I would like to edit the data to look like this:
**varX**
1
5
10
Is there a command I could use to simply keep all information after the underscore?
If you want a tidyverse solution, you can use str_extract from the stringr package:
data %>%
mutate(varx = str_extract(varx, "[0-9]+$")) %>%
mutate(varx = as.numeric(varx)) # include this last line if you want a number and not character
In case you always have the Q1#_ string, you can do:
gsub("Q1#_", "", df$varX)
I think you're looking for sub, substitute a certain part of a string with something else. You can give it a regular expression if you want to go fancy, or just give it a literal:
VarX <- sub('Q1#_', '', VarX, fixed=T)
The fancy way ("remove everything before and including the underscore") would be
VarX <- sub('^.*_', '', VarX)
And you may want to convert it to a numeric or an integer:
VarX <- as.integer(sub('Q1#_', '', VarX, fixed=T)) # or as.numeric
You could you use regular expressions:
df[["varX"]] <- sub(".+_", "", df[["varX"]])
df
varX
1 1
2 5
3 10
Or regular expressions-free: with strsplit():
df[["varX"]] <- sapply(df[["varX"]], function(x) strsplit(x, "_")[[c(1,2)]])

R-- Add leading zero to string, with no fixed string format

I have a column as below.
9453, 55489, 4588, 18893, 4457, 2339, 45489HQ, 7833HQ
I would like to add leading zero if the number is less than 5 digits. However, some numbers have "HQ" in the end, some don't.(I did check other posts, they dont have similar problem in the "HQ" part)
so the finally desired output should be:
09453, 55489, 04588, 18893, 04457, 02339, 45489HQ, 07833HQ
any idea how to do this? Thank you so much for reading my post!
A one-liner using regular expressions:
my_strings <- c("9453", "55489", "4588",
"18893", "4457", "2339", "45489HQ", "7833HQ")
gsub("^([0-9]{1,4})(HQ|$)", "0\\1\\2",my_strings)
[1] "09453" "55489" "04588" "18893"
"04457" "02339" "45489HQ" "07833HQ"
Explanation:
^ start of string
[0-9]{1,4} one to four numbers in a row
(HQ|$) the string "HQ" or the end of the string
Parentheses represent capture groups in order. So 0\\1\\2 means 0 followed by the first capture group [0-9]{1,4} and the second capture group HQ|$.
Of course if there is 5 numbers, then the regex isn't matched, so it doesn't change.
I was going to use the sprintf approach, but found the the stringr package provides a very easy solution.
library(stringr)
x <- c("9453", "55489", "4588", "18893", "4457", "2339", "45489HQ", "7833HQ")
[1] "9453" "55489" "4588" "18893" "4457" "2339" "45489HQ" "7833HQ"
This can be converted with one simple stringr::str_pad() function:
stringr::str_pad(x, 5, side="left", pad="0")
[1] "09453" "55489" "04588" "18893" "04457" "02339" "45489HQ" "7833HQ"
If the number needs to be padded even if the total string width is >5, then the number and text need to be separated with regex.
The following will work. It combines regex matching with the very helpful sprintf() function:
sprintf("%05.0f%s", # this encodes the format and recombines the number with padding (%05.0f) with text(%s)
as.numeric(gsub("^(\\d+).*", "\\1", x)), #get the number
gsub("[[:digit:]]+([a-zA-Z]*)$", "\\1", x)) #get just the text at the end
[1] "09453" "55489" "04588" "18893" "04457" "02339" "45489HQ" "07833HQ"
Another attempt, which will also work in cases like "123" or "1HQR":
x <- c("18893","4457","45489HQ","7833HQ","123", "1HQR")
regmatches(x, regexpr("^\\d+", x)) <- sprintf("%05d", as.numeric(sub("\\D+$","",x)))
x
#[1] "18893" "04457" "45489HQ" "07833HQ" "00123" "00001HQR"
This basically finds any numbers at the start of the string (^\\d+) and replaces them with a zero-padded (via sprintf) string that was subset out by removing any non-numeric characters (\\D+$) from the end of the string.
We can use only sprintf() and gsub() by splitting up the parts then putting them back together.
sprintf("%05d%s", as.numeric(gsub("[^0-9]+", "", x)), gsub("[0-9]+", "", x))
# [1] "18893" "04457" "45489HQ" "07833HQ" "00123" "00001HQR"
Using #thelatemail's data:
x <- c("18893", "4457", "45489HQ", "7833HQ", "123", "1HQR")

Find first matching substring in a long string in R

I'm trying to find the first matching string from a vector in a long string. I have for example a example_string <- 'LionabcdBear1231DogextKittyisananimalTurtleisslow' and a matching_vector<- c('Turtle',Dog') Now I want that it returns 'Dog' as this is the first substring in the matching_vector that we see in the example string: LionabcdBear1231DogextKittyisananimalTurtleisslow
I already tried pmatch(example_string,matching_vector) but it doesn't work. Obviously as it doesn't work with substrings...
Thanks!
Tim
Is the following solution working for you?
example_string <- 'LionabcdBear1231DogextKittyisananimalTurtleisslow'
matching_vector<- c('Turtle','Dog')
match_ids <- sapply(matching_vector, function(x) regexpr(x ,example_string))
result <- names(match_ids)[which.min(match_ids)]
> result
[1] "Dog"
We can use stri_match_first from stringi
library(stringi)
stri_match_first(example_string, regex = paste(matching_vector, collapse="|"))

Resources