The read.table family (read.table, read.csv, read.delim et al) has the argument check.names with the following explanation:
logical. If TRUE then the names of the variables in the data frame are checked to ensure that they are syntactically valid variable names. If necessary they are adjusted (by make.names) so that they are, and also to ensure that there are no duplicates.
Say I have loaded a data frame containing syntactically invalid column names. Is there any other consequence apart from having to access a specific column by name using the ` character?
Check out help(make.names) to understand what it is doing and why.
A syntactically valid name consists of letters, numbers and the dot or
underline characters and starts with a letter or the dot not followed
by a number. Names such as ".2way" are not valid, and neither are the
reserved words.
The definition of a letter depends on the current locale, but only
ASCII digits are considered to be digits.
The character "X" is prepended if necessary. All invalid characters
are translated to ".". A missing value is translated to "NA". Names
which match R keywords have a dot appended to them. Duplicated values
are altered by make.unique.
The big ones that will trip you up are blank column names (df$`` gives an error) and repeated column names (df$val will return the first val column result only).
Outside of that, if you pass this data.frame to a function that is expecting a data.frame with valid names, you will likely get errors, and perhaps silent ones that are hard to detect.
Related
I have tables of weather data from several stations. When I import them separately using read.csv, the fields are factors, integers, and numerics. However, when I try to import one csv file with all of data combined, the resulting fields in a dataframe are all factors. In the combined file the 1st field has several alphanumeric variables, whereas in the individual files there is only one variable (name of station).
This is a commom behaviour of data.frame() from base R. And most of the times, the result of read.csv() will be stored in a data.frame. As #Duck suggested in the comment section, you can avoid this behaviour, by setting the stringsAsFactors argument to FALSE.
read.csv('myfile.csv', stringsAsFactors = FALSE)
You can check this description below on the documentation page of the data.frame function. You can access this documentation with ?data.frame command.
Character variables passed to data.frame are converted to factor columns unless protected by I() or argument stringsAsFactors is false.
So in your case, this happens in your combined file, because R are interpreting all variables as caracters. Why? Probably because in one (or some) of your files, in the numeric and integers columns, some lines of data are out of format. For example, maybe in a row, you have an "x" to represent an missing value. And read.csv() uses the entire file to decide wich format of data is each column, so as soon as the function hits this "x" value, it interprets the entire column as character. When this data is passed to data.frame(), the function converts these characters to factors. You sad that, in the combined file, you have in the first field, some alphanumeric values. So these values, are probably your "x"'s that are generating your problem.
I have a large data.frame d that was read from a .csv file using read (it is actually a data.table resulting from fread a .csv file). I want to check in every column of type character for weird/corrupted characters. Meaning the weird sequences of characters that result from other corrupted parts of a text file or from using the wrong encoding.
A data.table solution, or some other fast solution, would be best.
This is a pseudo-code for a possible solution
create a vector str_cols with the names of all the character columns of d
for each column j in str_cols compute a frequency table of the values: tab <-d[,.N,j]. (this step is probably not necessary, just used to reduce the dimensions of the object that will be checked in columns with repetitions)
Check the values of j in the summary table tab
The crucial step is 3. Is there a function that does that?
Edit1: Perhaps some smart regular expression? This is a related non R question, trying to explicitly list all weird characters. Another solution perhaps is to find any character outside of the accepted list of characters [a-z 0-9 + punctuation].
If you post some example data it would be easier to give a more definitive answer. You could likely try something like this though.
DT[, lapply(.SD, stringr::str_detect, "^[^[[:print:]]]+$")]
It will return a data.table of the same size, but any string that has characters that aren't alphanumeric, punctuation, and/or space will be replaced with TRUE, and everything else replaced with FALSE. This would be my interpretation of your question about wanting to detect values that contain these characters.
You can change the behavior by replacing str_detect with whatever base R or stringr function you want, and slightly modifying the regex as needed. For example, you can remove the offending characters with the following code.
DT[, lapply(.SD, stringr::str_replace_all, "[^[[:print:]]]", "")]
I have spent a lot of time reading successful answers on how to extract part of a string using substr and substring. Still, I am having trouble applying answers because I cannot differentiate where specific punctuation characters are used to indicate when to start and stop selecting other punctuation characters, if those characters appear more than once within the string.
In a generalised case, I would like to split my string in several places based on re-ocurring characters of "_" and "."
In my individual case, a cell in one column of my dataframe contains a filename as a string, and I would like to use that string to generate strings in 3 subsequent columns on the same row.
To demonstrate, the string might look like: "Name_12. Word_CsvData.txt" and that should be split without referring to numeric character positions into "Name", "12", "Word"
To do this, I am aiming to find out how to extract part of a string:
1) from the beginning of the string to the first instance of an underscore character.
2) from the first underscore to the first full stop.
3) from the space to the second underscore character.
Any help would be much appreciated.
I have spend hours to look for a proper solutions but I found nothing on Internet. There is my question. In R, I have a specific list of characters containings my desired variable names ("2011_Q4", "2012_Q1", ...). When I try to assign a dataset to each of this name with a loop, it does work but the output it's strange. Indeed, I have
> View(`2011_Q4`)
instead of
> View(2011_Q4)
And I don't know how to remove this apostrophe. It's very annoying since I have to type this ` in order to call the variable.
Somebody can help me? I would appreciate his help.
Thanks a lot and best regards
Firstly, it's a backtick (`), not an apostrophe ('). In R, backticks occasionally denote variable names; apostrophes work as single quotes for denoting strings.
The issue you're having is that your variables start with a number, which is not allowed in R. Since you somehow made it happen anyway, you need to use backticks to tell R not to interpret 2011_Q4 as a number, but as a variable.
From ?Quotes:
Names and Identifiers
Identifiers consist of a sequence of letters, digits, the period (.)
and the underscore. They must not start with a digit nor underscore,
nor with a period followed by a digit. Reserved words are not valid
identifiers.
The definition of a letter depends on the current locale, but only
ASCII digits are considered to be digits.
Such identifiers are also known as syntactic names and may be used
directly in R code. Almost always, other names can be used provided
they are quoted. The preferred quote is the backtick (`), and deparse
will normally use it, but under many circumstances single or double
quotes can be used (as a character constant will often be converted to
a name). One place where backticks may be essential is to delimit
variable names in formulae: see formula.
The best solution to your issue is simply to change your variable names to something that starts with a character, e.g. Y2011_Q4.
I can’t find a spec of the language…
Note that I want a correct answer, e.g. like this, as i could easily come up with a simple, but likely wrong approximation myself, such as [[:alpha:]._][\w._]*
The documentation for make.names() says
A syntactically valid name consists of letters, numbers and the dot or underline characters and starts with a letter or the dot not followed by a number. Names such as ".2way" are not valid, and neither are the reserved words.
The definition of a letter depends on the current locale, but only ASCII digits are considered to be digits.
#Roland points out this section of the R language definition:
10.3.2 Identifiers
Identifiers consist of a sequence of letters, digits, the period (‘.’) and the underscore. They must not start with a digit or an underscore, or with a period followed by a digit.
The definition of a letter depends on the current locale: the precise set of characters allowed is given by the C expression (isalnum(c) || c == ‘.’ || c == ‘_’) and will include accented letters in many Western European locales.
Notice that identifiers starting with a period are not by default listed by the ls function and that ‘...’ and ‘..1’, ‘..2’, etc. are special.
Notice also that objects can have names that are not identifiers. These are generally accessed via get and assign, although they can also be represented by text strings in some limited circumstances when there is no ambiguity (e.g. "x" <- 1). As get and assign are not restricted to names that are identifiers they do not recognise subscripting operators or replacement functions.
The rules seem to allow "Morse coding":
> .__ <- 1
> ._._. <- 2
> .__ + ._._.
[1] 3