I have spend hours to look for a proper solutions but I found nothing on Internet. There is my question. In R, I have a specific list of characters containings my desired variable names ("2011_Q4", "2012_Q1", ...). When I try to assign a dataset to each of this name with a loop, it does work but the output it's strange. Indeed, I have
> View(`2011_Q4`)
instead of
> View(2011_Q4)
And I don't know how to remove this apostrophe. It's very annoying since I have to type this ` in order to call the variable.
Somebody can help me? I would appreciate his help.
Thanks a lot and best regards
Firstly, it's a backtick (`), not an apostrophe ('). In R, backticks occasionally denote variable names; apostrophes work as single quotes for denoting strings.
The issue you're having is that your variables start with a number, which is not allowed in R. Since you somehow made it happen anyway, you need to use backticks to tell R not to interpret 2011_Q4 as a number, but as a variable.
From ?Quotes:
Names and Identifiers
Identifiers consist of a sequence of letters, digits, the period (.)
and the underscore. They must not start with a digit nor underscore,
nor with a period followed by a digit. Reserved words are not valid
identifiers.
The definition of a letter depends on the current locale, but only
ASCII digits are considered to be digits.
Such identifiers are also known as syntactic names and may be used
directly in R code. Almost always, other names can be used provided
they are quoted. The preferred quote is the backtick (`), and deparse
will normally use it, but under many circumstances single or double
quotes can be used (as a character constant will often be converted to
a name). One place where backticks may be essential is to delimit
variable names in formulae: see formula.
The best solution to your issue is simply to change your variable names to something that starts with a character, e.g. Y2011_Q4.
Related
I have a long character that comes from a pdf that I want to process.
I have recurring instances of Table X. Name of the table, that in my character are always followed by a \r\n
However, when I try to extract all the tables in a list, using List_Tables <-str_extract_all(Plain_Text, "Table\\s+\\d+\\.\\s+(([A-z]|\\s))+\\r\\n"), I do have often another line that is still in my extraction, e.g.
> List_Tables
[[1]]
[1] "Table 1. Real GDP\r\n Percentage changes\r\n"
[2] "Table 2. Nominal GDP\r\n Percentage changes\r\n"
What have I missed in my code ?
\s matches all whitespace, including line breaks! When combined with the greedy quantifier +, this means that (([A-z]|\\s))+ matches, in your first example,
Real GDP\r\n […] Percentage changes\r\n
The easiest way to fix this is to use a non-greedy quantifier: i.e. +? instead of +.
Just for completeness’ sake I’ll mention that there are alternatives, but they get more complicated. For instance, you could use negative assertions to include an “if” test to match whitespace which isn’t a line break character; or you could use the character class [ \t] instead of \s, which is more restrictive but also more explicit and probably closer to what you want.
Context: I need to split strings that are too long and that are used as column headers in an html table. Those strings are variable names, so they don't have any spaces in them.
If I let the css max-width property do the job, the string is split at a fixed place, not making use of the dots or _'s in the string.
For example, suppose I have this string:
this.is.a.long.string.indeed.yeah.well.you.know
Using the dots as separators, I can split it in many, many different ways. But I pose these guiding principles:
All substrings must be 12 characters or less
Separators [._] should be at the end, not at the beginning of a substring
The number of substrings must be minimal
If several solutions exist, the one having the most similar substring lengths is to be preferred.
I could do this programmatically with R, but I'm turning to regex wizards to see whether this is possible using solely regular expressions.
What I have so far:
Regex: .{1,12}(_|\b|\Z)
Results: this.is.a. | long.string. | indeed.yeah. | well.you. | know
It works well, except when there is a long sequence of letters without any separators. Please see this example on regex101.com.
Ideally, separators would be used whenever possible, and a fallback split would occur when there is a sequence longer than 12 characters without a separator.
You were so close, you just need to present it with another alternative for cases where no separator is found:
.{1,12}(_|\b|\Z)|.{1,12}
Check it out: https://regex101.com/r/XrJuYj/2/
Edit: to ensure the split portion contains a non-separating character, you can use the following:
(?=.{1,12}(.*))(?=.*?[^\W_].*?[\W_].*?\1).{1,12}(?<=_|\b|\Z)|.{1,12}
See it at: https://regex101.com/r/XrJuYj/3
The read.table family (read.table, read.csv, read.delim et al) has the argument check.names with the following explanation:
logical. If TRUE then the names of the variables in the data frame are checked to ensure that they are syntactically valid variable names. If necessary they are adjusted (by make.names) so that they are, and also to ensure that there are no duplicates.
Say I have loaded a data frame containing syntactically invalid column names. Is there any other consequence apart from having to access a specific column by name using the ` character?
Check out help(make.names) to understand what it is doing and why.
A syntactically valid name consists of letters, numbers and the dot or
underline characters and starts with a letter or the dot not followed
by a number. Names such as ".2way" are not valid, and neither are the
reserved words.
The definition of a letter depends on the current locale, but only
ASCII digits are considered to be digits.
The character "X" is prepended if necessary. All invalid characters
are translated to ".". A missing value is translated to "NA". Names
which match R keywords have a dot appended to them. Duplicated values
are altered by make.unique.
The big ones that will trip you up are blank column names (df$`` gives an error) and repeated column names (df$val will return the first val column result only).
Outside of that, if you pass this data.frame to a function that is expecting a data.frame with valid names, you will likely get errors, and perhaps silent ones that are hard to detect.
I want a regular expression that check string must contain least an alphabet [a-zA-Z] or a digit. All other special characters are allowed, but only special characters or only spaces or only spaces with special characters will now be accepted.
I have tried /\b(?=[A-Z]*[0-9])(?=[0-9]*[A-Z])[\s\S]\b/i and ^(a-zA-Z0-9).*[\s\S]*$ and ^(a-zA-Z0-9).*[\s].*[\S]*$ etc. But its not working. Awaiting for your valuable response.
Thanks
^(?=.*[\w\d]).+
This pattern will fail if there is not at least one character or one digit with any combination of special characters and spaces.
I'm not sure I understood you correctly, but from what I've gathered you want to have atleast one letter (a-z, 0-9) in the string. This regex will do just that: /^(?=.*[a-z\d]).+/igm
(Set the flags however they need to be set in asp.net. The m-flag might be redundant for you, I only used it for the demo. The g-flag likely does not exist. If so, just remove it.)
Demo+explanation: http://regex101.com/r/jY9fJ5
If you want at least one alphabet or digit, followed by only spaces and symbols:
/^.*[a-zA-Z0-9][^a-zA-Z0-9]*$/
If you want only one alphabet or digit, followed by the same:
/^[^a-zA-Z0-9]*[a-zA-Z0-9][^a-zA-Z0-9]*$/
I can't imagine what else it is that you are looking for. Examples would help immensely.
(?=.*?[0-9])(?=.*?[A-Za-z]).+
Allows special characters and makes sure at least one number and one letter.
(?=.*?[0-9])(?=.*?[A-Za-z])(?=.*[^0-9A-Za-z]).+
Demands at least one letter, one digit and one special-character.
The first one does not demand special chars, only allows them.
I'm trying to set up a validation expression for an ASP.Net Regular Expression Validator control. It is for validating the creation of a user name, so I want to limit the number of characters, and I also want to prevent them from using spaces. Here's what I've got so far:
^.*(?=.{5,20})(?=.*\w{5,255}).*$
The \w{5,255} part prevents spaces and special characters (except for underscores, apparently). I have no idea how "5,255" makes it work, but it does; I just copied it from somewhere else.
The main problem I'm having is that if the first or last character is a space (or special character), it passes validation, which is not acceptable. Can anyone help me? I'm sure it is something simple, but I know next to nothing about regular expressions.
You can use something simpler like this:
^[a-zA-Z0-9_]{5,255}$
This will allow alphanumeric usernames between 5-255 characters in length.
(let's expand overall understanding of how to at least use regex!)
The main reason why the posted regex wasn't working is because you were attempting to use lookahead. Lookahead is a 0-length pattern that just guarantees that the next part of the string will match a certain pattern (and is usually used to take advantage of it being 0-length, so it doesn't expand your capturing group).
Effectively, what your regex (going off of the original /^.(?=.{5,20})(?=.\w{5,255}).*$/) meant was:
^. "The beginning of our line should match any single character (provided it's not a newline, although this depends on the regex implementation as well as flags that may or may not have been passed in)"
(?= "and guarantee that after here"
.{5,20}) are any 5-20 characters."
(?= "Also, after that same first character (since, remember, lookahead is 0-length), guarantee"
. "one arbitrary character"
\w{5,255}) "and 5-255 word characters."
.*$ And of course, since all of that exhaustive matching was 0-length, we want the rest of the line to be an arbitrary number of characters."
What you technically could have done to use lookaround was ^(?=\w{5,255}).{5,255}$, but that's just overly convoluted. I'd suggest just using \w{5,255} or something along those lines.