How to edit names using regular expression in R? [duplicate] - r

This question already has answers here:
Find a word before one of two possible separators
(4 answers)
Closed 8 years ago.
I have names like as following. I just want to keep the part before . . How
>name
uc001aaa.3
uc001aac.4
uc001aae.4
uc001aah.4
uc001aai.1
uc001aak.3
uc001aal.1
uc001aam.4
uc001aaq.2
uc001aar.2
How can I implement this using regex or sub in R ?

I thought this would certainly be a duplicate, but despite the number of gsub question I can't easily find one (e.g. https://stackoverflow.com/questions/23844473/exclude-a-pattern-in-all-collumn-names-in-r). Update: ironically, the closest one is a question the OP asked a few days ago, How to trim the column name of the matrix? ...
Anyway,
gsub("\\.[0-9]$","",name)
does what you want;
\\. specifies a literal . character (one backslash is required to specify that . is literal rather than meaning "any character"; the second is required to protect the first!). As #MatthewLundberg points out you could also use [.] here (. is interpreted literally, rather than as "any character", within the range brackets []).
[0-9] means "a single character in the range 0-9" (not, as you seem to think, the first 9 characters of the string)
$ means "end of string"
So this will remove a dot plus a single number from the end of every string. It doesn't matter how many characters are before the dot. On the other hand, if you might have multiple numeric values, e.g. foo.123, you would need "\\.[0-9]+$ instead (the + means "one or more of the preceding pattern")

Here is a strsplit method, which separates the string on . characters, and keeps the first portion:
sapply(strsplit(name, '[.]'), '[', 1)
## [1] "uc001aaa" "uc001aac" "uc001aae" "uc001aah" "uc001aai" "uc001aak" "uc001aal" "uc001aam" "uc001aaq" "uc001aar"
I'm using the regular expression [.] to match a literal dot rather than \\. because I find it more readable. (It also helps if you have multiple levels of interpretation, but that's not an issue here.)

Related

Identify if character string contains any numbers?

I've been trying to word this title for 5 minutes to avoid it being a similarly phrased question. No luck, so apologies if this has already been discussed. I couldn't find any other threads on this particular subject.
Simply put, I want to identify if numbers exist in a class character string. If true apply further functions.
Here's a dodgy attempt.
x <- "900 years old"
if(str_detect(x, ">=0")) {
print("contains numbers")
}
So obviously the problem is that I'm trying to use relational operators within a character string. Considering it's of this class, how can i identify numeric characters?
[0-9] is a regex pattern for numbers 0 to 9. You could also use special patterns \d or [:digit:] (for digits). In R, you have to add extra escapes to the special patterns. All of these should work:
str_detect(x, "[0-9]")
str_detect(x, "\\d")
str_detect(x, "[[:digit:]]")
With base R, we can use grepl
grepl('[0-9]', x)

R: Regular Expression for Twitter hashtags? [duplicate]

This question already has answers here:
What characters are allowed in twitter hashtags?
(6 answers)
Closed 4 years ago.
I'm trying to come up with a regular expression that matches Twitter hashtags. Twitter hashtags have the following rules:
1)They cannot contain spaces,
2)They cannot contain punctuation
3) They cannot start with or use only numbers.
This is what I've come up so far, but it still has issues with spaces and punctuation characters:
"#{1}[^0-9]*[^[::punct::]\\s]*?[A-z0-9]*?"
Would appreciate any help with this. Thanks!
Your regex looks a bit complicated, you only need to match the # then a letter and then alphanumeric characters.
You also don't need quantifier for a single character. This should work:
#[a-zA-Z]\w*
If you won't allow underscores (they are legal characters in tweets), use this instead:
#[a-zA-Z][\da-zA-Z]*
It looks like the real spec for a hashtag however is that underscores and numbers are valid anywhere as long as they're at least a letter.
So this would be better:
#\w*[a-zA-Z]\w*
This regex captures only valid hashtags :
(#[a-zA-Z]+[\w]?)(?:\s|$)

Split a string in a flexible manner with a regular expression

Context: I need to split strings that are too long and that are used as column headers in an html table. Those strings are variable names, so they don't have any spaces in them.
If I let the css max-width property do the job, the string is split at a fixed place, not making use of the dots or _'s in the string.
For example, suppose I have this string:
this.is.a.long.string.indeed.yeah.well.you.know
Using the dots as separators, I can split it in many, many different ways. But I pose these guiding principles:
All substrings must be 12 characters or less
Separators [._] should be at the end, not at the beginning of a substring
The number of substrings must be minimal
If several solutions exist, the one having the most similar substring lengths is to be preferred.
I could do this programmatically with R, but I'm turning to regex wizards to see whether this is possible using solely regular expressions.
What I have so far:
Regex: .{1,12}(_|\b|\Z)
Results: this.is.a. | long.string. | indeed.yeah. | well.you. | know
It works well, except when there is a long sequence of letters without any separators. Please see this example on regex101.com.
Ideally, separators would be used whenever possible, and a fallback split would occur when there is a sequence longer than 12 characters without a separator.
You were so close, you just need to present it with another alternative for cases where no separator is found:
.{1,12}(_|\b|\Z)|.{1,12}
Check it out: https://regex101.com/r/XrJuYj/2/
Edit: to ensure the split portion contains a non-separating character, you can use the following:
(?=.{1,12}(.*))(?=.*?[^\W_].*?[\W_].*?\1).{1,12}(?<=_|\b|\Z)|.{1,12}
See it at: https://regex101.com/r/XrJuYj/3

Trying to validate two different format in one regular expression [duplicate]

This question already has answers here:
How to validate phone numbers using regex
(43 answers)
Closed 5 years ago.
I want to validate these formats in one regular expression in asp.net:
XX-XXXXXXX or XXX-XX-XXXX
These have to be numeric only no characters except the "-".
Is this possible? I've been trying without any success so I want to ask the experts.
Thanks,
Pune
The following should work given your requirements.
"(^\d{2}-\d{7}$)|(^\d{3}-\d{2}-\d{4}$)"
Try something like this:
/^([0-9]{2}-[0-9]{7}|[0-9]{2}-[0-9]{2}-[0-9]{4})$/
[0-9] means any character from 0 to 9.
{X} means X times
| means "or"
- means "-"
and ( and ) delimits a group for replacing
^ and $ delimit the beginning and the ending of the match.

R extract parts of a string based on punctuation characters that appear more than once

I have spent a lot of time reading successful answers on how to extract part of a string using substr and substring. Still, I am having trouble applying answers because I cannot differentiate where specific punctuation characters are used to indicate when to start and stop selecting other punctuation characters, if those characters appear more than once within the string.
In a generalised case, I would like to split my string in several places based on re-ocurring characters of "_" and "."
In my individual case, a cell in one column of my dataframe contains a filename as a string, and I would like to use that string to generate strings in 3 subsequent columns on the same row.
To demonstrate, the string might look like: "Name_12. Word_CsvData.txt" and that should be split without referring to numeric character positions into "Name", "12", "Word"
To do this, I am aiming to find out how to extract part of a string:
1) from the beginning of the string to the first instance of an underscore character.
2) from the first underscore to the first full stop.
3) from the space to the second underscore character.
Any help would be much appreciated.

Resources