Legal column names in R and consequences of syntactically invalid column names - r

df <- read.csv(
text = '"2019-Jan","2019-Feb",
"3","1"',
check.names = FALSE
)
OK, so I use check.names = FALSE and now my column names are not syntactically valid. What are the practical consequences?
df
#> 2019-Jan 2019-Feb
#> 1 3 1 NA
And why is this NA appearing in my data frame? I didn't put that in my code. Or did I?
Here's the check.names man page for reference:
check.names logical. If TRUE then the names of the variables in the
data frame are checked to ensure that they are syntactically valid
variable names. If necessary they are adjusted (by make.names) so that
they are, and also to ensure that there are no duplicates.

The only consequence is that you need to escape or quote the names to work with them. You either string-quote and use standard evaluation with the [[ column subsetting operator:
df[['2019-Jan']]
… or you escape the identifier name with backticks (R confusingly also calls this quoting), and use the $ subsetting:
df$`2019-Jan`
Both work, and can be used freely (as long as they don’t lead to exceedingly unreadable code).
To make matters more confusing, R allows using '…' and "…" instead of `…` in certain contexts:
df$'2019-Jan'
Here, '2019-Jan' is not a character string as far as R is concerned! It’s an escaped identifier name.1
This last one is a really bad idea because it confuses names2 with character strings, which are fundamentally different. The R documentation advises against this. Personally I’d go further: writing 'foo' instead of `foo` to refer to a name should become a syntax error in future versions of R.
1 Kind of. The R parser treats it as a character string. In particular, both ' and " can be used, and are treated identically. But during the subsequent evaluation of the expression, it is treated as a name.
2 “Names”, or “symbols”, in R refer to identifiers in code that denote a variable or function parameter. As such, a name is either (a) a function name, (b) a non-function variable name, (c) a parameter name in a function declaration, or (d) an argument name in a function call.

The NA issue is unrelated to the names. read.csv is expecting an input with no comma after the last column. You have a comma after the last column, so read.csv reads the blank space after "2019-Feb", as the column name of the third column. There is no data for this column, so an NA value is assigned.
Remove the extra comma and it reads properly. Of course, it may be easier to just remove the last column after using read.csv.
df <- read.csv(
text = '"2019-Jan","2019-Feb"
"3","1"',
check.names = FALSE
)
df
# 2019-Jan 2019-Feb
# 1 3 1

Consider df$foo where foo is a column name. Syntactically invalid names will not work.
As for the NA it’s a consequence of there being three columns in your first line and only two in your second.

Related

read.csv ;check.names=F; R;Look at the picture,why it works a treat?

please see the the column name "if" in the second column,the deifference is :when check.name=F,"." beside "if" disappear
Sorry for the code,because I try to type some codes to generate this data.frame like in the picture,but i failed due to the "if".We know that "if" is a reserved word in R(like else,for, while ,function).And here, i deliberately use the "if" as the column name (the 2nd column),and see whether R will generate some novel things.
So using another way, I type the "if" in the excel and save as the format of csv in order to use read.csv.
Question is:
Why "if." changes to "if"?(After i use check.names=FALSE)
enter image description here
?read.csv describes check.names= in a similar fashion:
check.names: logical. If 'TRUE' then the names of the variables in the
data frame are checked to ensure that they are syntactically
valid variable names. If necessary they are adjusted (by
'make.names') so that they are, and also to ensure that there
are no duplicates.
The default action is to allow you to do something like dat$<column-name>, but unfortunately dat$if will fail with Error: unexpected 'if' in "dat$if", ergo check.names=TRUE changing it to something that the parser will not trip over. Note, though, that dat[["if"]] will work even when dat$if will not.
If you are wondering if check.names=FALSE is ever a bad thing, then imagine this:
dat <- read.csv(text = "a,a\n2,3")
dat
# a a.1
# 1 2 3
dat <- read.csv(text = "a,a\n2,3", check.names = FALSE)
dat
# a a
# 1 2 3
In the second case, how does one access the second column by-name? dat$a returns 2 only. However, if you don't want to use $ or [[, and instead can rely on positional indexing for columns, then dat[,colnames(dat) == "a"] does return both of them.

How to replace NA's with blank value?

I am pretty new to R and I wonder if I can replace NA value (which looks like string) with blank, nothing
It is easy when entire table is as.character however my table contains double's as well therefore when I try to run
f <- as.data.frame(replace(df, is.na(df), ""))
or
df[is.na(df)] <- ""
Both does not work.
Error is like
Assigned data `values` must be compatible with existing data.
Error occurred for column `ID`.
Can't convert <character> to <double>.
and I understand, but I really need ID's as well as any other cell in the table (character, double or other) blank to remain in the table, later on it is connected to BI tool and I can't present "NA", just blank for the sake of clarity
If your column is of type double (numbers), you can't replace NAs (which is the R internal for missings) by a character string. And "" IS a character string even though you think it's empty, but it is not.
So you need to choose: converting you whole column to type character or leave the missings as NA.
EDIT:
If you really want to covnert your numeric column to character, you can just use as.character(MYCOLUMN). But I think what you really want is:
Telling your exporting function how to treat NA'S, which is easy, e.g. write.csv(df, na = ""). Also check the help function with ?write.csv.

Assign names to list elements without titled quotes

I am interested to assign names to list elements. To do so I execute the following code:
file_names <- gsub("\\..*", "", doc_csv_names)
print(file_names)
"201409" "201412" "201504" "201507" "201510" "201511" "201604" "201707"
names(docs_data) <- file_names
In this case the name of the list element appears with ``.
docs_data$`201409`
However, in this case the name of the list element appears in the following way:
names(docs_data) <- paste("name", 1:8, sep = "")
docs_data$name1
How can I convert the gsub() result to receive the latter naming pattern without quotes?
gsub() and paste () seem to produce the same class () object. What is the difference?
Both gsub and paste return character objects. They are different because they are completely different functions, which you seem to know based on their usage (gsub replaces instances of your pattern with a desired output in a string of characters, while paste just... pastes).
As for why you get the quotations, that has nothing to do with gsub and everything to do with the fact that you are naming variables/columns with numbers. Indeed, try
names(docs_data) <- paste(1:8)
and you'll realize you have the same problem when invoking the naming pattern. It basically has to do with the fact that R doesn't want to be confused about whether a number is really a number or a variable because that would be chaos (how can 1 refer to a variable and also the number 1?), so what it does in such cases is change a number 1 into the character "1", which can be given names. For example, note that
> 1 <- 3
Error in 1 <- 3 : invalid (do_set) left-hand side to assignment
> "1" <- 3 #no problem!
So R is basically correcting that for you! This is not a problem when you name something using characters. Finally, an easy fix: just add a character in front of the numbers of your naming pattern, and you'll be able to invoke them without the quotations. For example:
file_names <- paste("file_",gsub("\\..*", "", doc_csv_names),sep="")
Should do the trick (or just change the "file_" into whatever you want as long as it's not empty, cause then you just have numbers left and the same problem)!

R encoding ASCII backtick

I have the following backtick on my list's names. Prior lists did not have this backtick.
$`1KG_1_14106394`
[1] "PRDM2"
$`1KG_20_16729654`
[1] "OTOR"
I found out that this is a 'ASCII grave accent' and read the R page on encoding types. However what to do about it ? I am not clear if this will effect some functions (such as matching on list names) or is it OK leave it as is ?
Encoding help page: https://stat.ethz.ch/R-manual/R-devel/library/base/html/Encoding.html
Thanks!
My understanding (and I could be wrong) is that the backticks are just a means of escaping a list name which otherwise could not be used if left unescaped. One example of using backticks to refer to a list name is the case of a name containing spaces:
lst <- list(1, 2, 3)
names(lst) <- c("one", "after one", "two")
If you wanted to refer to the list element containing the number two, you could do this using:
lst[["after one"]]
But if you want to use the dollar sign notation you will need to use backticks:
lst$`after one`
Update:
I just poked around on SO and found this post which discusses a similar question as yours. Backticks in variable names are necessary whenever a variable name would be forbidden otherwise. Spaces is one example, but so is using a reserved keyword as a variable name.
if <- 3 # forbidden because if is a keyword
`if` <- 3 # allowed, because we use backticks
In your case:
Your list has an element whose name begins with a number. The rules for variable names in R is pretty lax, but they cannot begin with a number, hence:
1KG_1_14106394 <- 3 # fails, variable name starts with a number
KG_1_14106394 <- 3 # allowed, starts with a letter
`1KG_1_14106394` <- 3 # also allowed, since escaped in backticks

Why are Xs added to data frame variable names when using read.csv?

When I use the read.csv() function in R to load data, I often find that an X has been added to variable names. I think I just about always see it it in the first variable, but I could be wrong.
At first, I thought R might be doing this because I had a space at the beginning of the variable name - I don't.
Second, I had read somewhere that if you have a variable that starts with a number, or is a very short variable name, R would add the X. The variable name is all text and the length of the name of this variable is 12 characters, so it's not short.
Now, this is purely an annoyance. I can rename the column, but it does add a step, albeit a small one.
Is there a way to prevent this from rogue X from infiltrating my data frame?
Here is my original code:
df <- read.csv("/file/location.filecsv", header=T, sep=",")
Here is the variable in question:
str(orders)
'data.frame': 2620276 obs. of 26 variables:
$ X.OrderDetailID : Factor w/ 2620193 levels "(2620182 row(s) affected)",..: 105845
read.table and read.csv have a check.names= argument that you can set to FALSE.
For example, try it with this input consisting of just a header:
> read.csv(text = "a,1,b")
[1] a X1 b
<0 rows> (or 0-length row.names)
versus
> read.csv(text = "a,1,b", check.names = FALSE)
[1] a 1 b
<0 rows> (or 0-length row.names)
It is surprising behavior, but I think we would need a reproducible example. Perhaps you have some invisible/special characters hiding in your file?
names(read.csv(textConnection(
"abcdefghijkl, a1,2x")))
behaves fine. Can you make an example along these lines that demonstrates your problem?
As described in the other answer, check.names=FALSE is a possible workaround. You can experiment with make.names to determine the behavior ...
As Gabor said, by default read.csv deafults to converting the names in your header row to be valid variable names (use check.names = FALSE to turn this off). This is done using the function make.names. The help page for that function explains what constitutes a valid variable name.
A syntactically valid name consists of letters, numbers and the dot or
underline characters and starts with a letter or the dot not followed
by a number. Names such as ".2way" are not valid, and neither are the
reserved words.
The list of reserved words is found on the help page ?reserved.
The other condition is that the variable name must be 10000 characters or less, but make.names won't shorten it. So be careful of being really verbose with your variable names.
You can check for valid variable names using
library(assertive.code)
is_valid_variable_name(x)

Resources