Identify if character string contains any numbers? - r

I've been trying to word this title for 5 minutes to avoid it being a similarly phrased question. No luck, so apologies if this has already been discussed. I couldn't find any other threads on this particular subject.
Simply put, I want to identify if numbers exist in a class character string. If true apply further functions.
Here's a dodgy attempt.
x <- "900 years old"
if(str_detect(x, ">=0")) {
print("contains numbers")
}
So obviously the problem is that I'm trying to use relational operators within a character string. Considering it's of this class, how can i identify numeric characters?

[0-9] is a regex pattern for numbers 0 to 9. You could also use special patterns \d or [:digit:] (for digits). In R, you have to add extra escapes to the special patterns. All of these should work:
str_detect(x, "[0-9]")
str_detect(x, "\\d")
str_detect(x, "[[:digit:]]")

With base R, we can use grepl
grepl('[0-9]', x)

Related

Splitting strings into elements from a list

A function in a package gives me a character, where the original strings are merged together. I need to separate them, in other words I have to find the original elements. Here is an example and what I have tried:
orig<-c("answer1","answer2","answer3")
result<-"answer3answer2"
What I need as an outcome is:
c("answer2","answer3")
I have tried to split() result, but there is no string to base it on, especially that I have no former knowledge of what the answers will be.
I have tried to match() the result to the orig, but I would need to do that with all substrings.
There has to be an easy solution, but I haven't found it.
index <- gregexpr(paste(orig,collapse='|'),result)[[1]]
starts <- as.numeric(index)
stops <- starts + attributes(index)$match.length - 1 )
substring(result, starts, stops)
This should work for well defined and reversible inputs. Alternatively, is it possible to append some strings to the input of the function, such that it can be easily separated afterwards?
What you're describing seems to be exactly string matching, and for your strings, grepl seems to be just the thing, in particular:
FindSubstrings <- function(orig, result){
orig[sapply(orig, grepl, result)]
}
In more detail: grepl takes a pattern argument and looks whether it occurs in your string (result in our case), and returns a TRUE/FALSE value. We subset the original values by the logical vector - does the value occur in the string?
Possible improvements:
fixed=TRUE may be a bright idea, because you don't need the full regex power for simple strings matching
some match patterns may contain others, for example, "answer10" contains "answer1"
stringi may be faster for such tasks (just rumors floating around, haven't rigorously tested), so you may want to look into it if you do this a lot.

R str_extract_all expression to extract all letters, numbers, dollar signs, single and double quotes

I've only taken up R a few days ago and regular expressions themselves so far are more difficult than any programming language I've tried, heh...
I'm in desperate need of one, that would help me extract all sequences of letters, numbers, dollar signs, single and double quotes (last two seem to be the issue).
It is for a spam prediction project using Naive Bayes and differentiating between symbol sequences that may have single or double quotes in them is a requirement.
I'm specifically using the str_extract_all function from stringr library and must've read like 50 articles over the last two days without finding what would solve my specific problem to the point where I simply don't have time left.
Any help would be greatly appreciated and would put me a step forward in my machine learning interests.
Cheers.
You may try using regmatches here to return all matches of your pattern within a given input string:
txt <- "Hello World \"how are you today\"? Goodbye."
m <- gregexpr("[0-9A-Za-z$'\"]+", txt, perl = TRUE)
regmatches(txt, m)
[[1]]
[1] "Hello" "World" "\"how" "are" "you" "today\"" "Goodbye"
Demo
The output may not make much sense, but you did not include whitespace as being characters which are allowed in a sequence. Hence, we are left with words, possibly having a quote on either side.

Convert string to variable name in R

I have spend hours to look for a proper solutions but I found nothing on Internet. There is my question. In R, I have a specific list of characters containings my desired variable names ("2011_Q4", "2012_Q1", ...). When I try to assign a dataset to each of this name with a loop, it does work but the output it's strange. Indeed, I have
> View(`2011_Q4`)
instead of
> View(2011_Q4)
And I don't know how to remove this apostrophe. It's very annoying since I have to type this ` in order to call the variable.
Somebody can help me? I would appreciate his help.
Thanks a lot and best regards
Firstly, it's a backtick (`), not an apostrophe ('). In R, backticks occasionally denote variable names; apostrophes work as single quotes for denoting strings.
The issue you're having is that your variables start with a number, which is not allowed in R. Since you somehow made it happen anyway, you need to use backticks to tell R not to interpret 2011_Q4 as a number, but as a variable.
From ?Quotes:
Names and Identifiers
Identifiers consist of a sequence of letters, digits, the period (.)
and the underscore. They must not start with a digit nor underscore,
nor with a period followed by a digit. Reserved words are not valid
identifiers.
The definition of a letter depends on the current locale, but only
ASCII digits are considered to be digits.
Such identifiers are also known as syntactic names and may be used
directly in R code. Almost always, other names can be used provided
they are quoted. The preferred quote is the backtick (`), and deparse
will normally use it, but under many circumstances single or double
quotes can be used (as a character constant will often be converted to
a name). One place where backticks may be essential is to delimit
variable names in formulae: see formula.
The best solution to your issue is simply to change your variable names to something that starts with a character, e.g. Y2011_Q4.

How to edit names using regular expression in R? [duplicate]

This question already has answers here:
Find a word before one of two possible separators
(4 answers)
Closed 8 years ago.
I have names like as following. I just want to keep the part before . . How
>name
uc001aaa.3
uc001aac.4
uc001aae.4
uc001aah.4
uc001aai.1
uc001aak.3
uc001aal.1
uc001aam.4
uc001aaq.2
uc001aar.2
How can I implement this using regex or sub in R ?
I thought this would certainly be a duplicate, but despite the number of gsub question I can't easily find one (e.g. https://stackoverflow.com/questions/23844473/exclude-a-pattern-in-all-collumn-names-in-r). Update: ironically, the closest one is a question the OP asked a few days ago, How to trim the column name of the matrix? ...
Anyway,
gsub("\\.[0-9]$","",name)
does what you want;
\\. specifies a literal . character (one backslash is required to specify that . is literal rather than meaning "any character"; the second is required to protect the first!). As #MatthewLundberg points out you could also use [.] here (. is interpreted literally, rather than as "any character", within the range brackets []).
[0-9] means "a single character in the range 0-9" (not, as you seem to think, the first 9 characters of the string)
$ means "end of string"
So this will remove a dot plus a single number from the end of every string. It doesn't matter how many characters are before the dot. On the other hand, if you might have multiple numeric values, e.g. foo.123, you would need "\\.[0-9]+$ instead (the + means "one or more of the preceding pattern")
Here is a strsplit method, which separates the string on . characters, and keeps the first portion:
sapply(strsplit(name, '[.]'), '[', 1)
## [1] "uc001aaa" "uc001aac" "uc001aae" "uc001aah" "uc001aai" "uc001aak" "uc001aal" "uc001aam" "uc001aaq" "uc001aar"
I'm using the regular expression [.] to match a literal dot rather than \\. because I find it more readable. (It also helps if you have multiple levels of interpretation, but that's not an issue here.)

Regular expression for x number of digits and only one hyphen?

I made the following regex:
(\d{5}|\d-\d{4}|\d{2}-\d{3}|\d{3}-\d{2}|\d{4}-\d)
And it seems to work. That is, it will match a 5 digit number or a 5 digit number with only 1 hyphen in it, but the hyphen can not be the lead or the end.
I would like a similar regex, but for a 25 digit number. If I use the same tactic as above, the regex will be very long.
Can anyone suggest a simpler regex?
Additional Notes:
I'm putting this regex into an XML file which is to be consumed by an ASP.NET application. I don't have access to the .net backend code. But I suspect they would do something liek this:
Match match = Regex.Match("Something goes here", "my regex", RegexOptions.None);
You need to use a lookahead:
^(?:\d{25}|(?=\d+-\d+$)[\d\-]{26})$
Explanation:
Either it's \d{25} from start to end, 25 digits.
Or: it is 26 characters of [\d\-] (digits or hyphen) AND it matched \d+-\d+ - meaning it has exactly one hyphen in the middle.
Working example with test cases
You could use this regex:
^[0-9](?:(?=[0-9]*-[0-9]*$)[0-9-]{24}|[0-9]{23})[0-9]$
The lookahead makes sure there's only 1 dash and the character class makes sure there are 23 numbers between the first and the last. Might be made shorter though I think.
EDIT: The a 'bit' shorter xP
^(?:[0-9]{25}|(?=[^-]+-[^-]+$)[0-9-]{26})$
A bit similar to Kobi's though, I admit.
If you aren't fussy about the length at all (i.e. you only want a string of digits with an optional hyphen) you could use:
([\d]+-[\d]+){1}|\d
(You may want to add line/word boundaries to this, depending on your circumstances)
If you need to have a specific length of match, this pattern doesn't really work. Kobi's answer is probably a better fit for you.
I think the fastest way is to do a simple match then add up the length of the capture buffers, why attempt math in a regex, makes no sence.
^(\d+)-?(\d+)$
This will match 25 digits and exactly one hyphen in the middle:
^(?=(-*\d){25})\d.{24}\d$

Resources