Selecting non existent columns in data.table - r

Here is my small example:
library(data.table)
data<-data.table(x_2dig_id=rnorm(100))
data$x_2dig
What I do not understand, why I do not get an error (i.e. there is no column x_2dig in my data).
This would be great if somebody could elaborate on this.

This happens with lists as well which is one of the very basic data structure type in R. $ is a shorthand operator, where x$y is equivalent to x[["y", exact = FALSE]]. It’s often used to access variables in a data frame.
If you want to receive a warning for partial matching then you can do :
options(warnPartialMatchDollar = TRUE)
From R documenation:
?`[[`
x[[i, exact = TRUE]]
exact Controls possible partial matching of [[ when extracting by a
character vector (for most objects, but see under ‘Environments’). The
default is no partial matching. Value NA allows partial matching but
issues a warning when it occurs. Value FALSE allows partial matching
without any warning.
and
Both [[ and $ select a single element of the list. The main difference
is that $ does not allow computed indices, whereas [[ does. x$name is
equivalent to x[["name", exact = FALSE]]. Also, the partial matching
behavior of [[ can be controlled using the exact argument.
This is also explained in Advanced R book of Hadley Wickham's subsetting chapter. You can find it here

This goes back to the old days of R being overly helpful and trying to guess what you meant. Data frames (upon which data tables are built) have this functionality, while tibbles (data_frame) intentionally leave this out to prevent the problem you're seeing! Hadley talks about this sometimes in his talks.

Because data.table knows that you have gotten a unique identifier of the column. Example below demonstrates that once it isn't unique, it fails.
data<-data.table(x_2dig_id=rnorm(100),x_2dig_id2=rnorm(100))
data$x_2dig
NULL
Edit: Per Joy, this maybe an R thing, and not a data.table per se'.

Related

Why doesn't R throw an error when I use only the initial part of my column name in a data frame?

I have a data frame containing various columns along with sender_bank_flag. I ran the below two queries on my data frame.
sum(s_50k_sample$sender_bank_flag, na.rm=TRUE)
sum(s_50k_sample$sender_bank, na.rm=TRUE)
I got the same output from both the queries even though there is no such column as sender_bank in my data frame. I expected to get an error for the second code. Didn't know R has such a functionality! Does anyone know what exactly is this functionality & how can it be better utilized?
Probably worthwhile to augment all comments into an answer.
Both my comment and BenBolker's point to doc page ?Extract:
Under Recursive (list-like) objects:
Both "[[" and "$" select a single element of the list. The main difference is that "$" does not allow computed indices, whereas "[[" does. x$name is equivalent to x[["name", exact = FALSE]]. Also, the partial matching behavior of "[[" can be controlled using the exact argument.
Under Character indices:
Character indices can in some circumstances be partially matched (see ?pmatch) to the names or dimnames of the object being subsetted (but never for subassignment). Unlike S (Becker et al p. 358), R never uses partial matching when extracting by "[", and partial matching is not by default used by "[[" (see argument exact).
Thus the default behaviour is to use partial matching only when extracting from recursive objects (except environments) by "$". Even in that case, warnings can be switched on by options(warnPartialMatchDollar = TRUE).
Note, the manual has rich information, and make sure you fully digest them. I formatted the content, adding Stack Overflow threads behind where relevant.
Links provided by phiver's comment are worth reading in a long term.

What's wrong with using $-extraction?

fortune(312) and fortune(343) allude to the problems with using $ to extract elements of a list instead of [[, but aren't specific about what exactly the dangers are.
The problem here is that the $ notation is a magical shortcut and like any other magic
if used incorrectly is likely to do the programmatic equivalent of turning yourself into
a toad.
-- Greg Snow (in response to a user that wanted to access a column whose name is stored
in y via x$y rather than x[[y]])
R-help (February 2012)
Sooner or later most R beginners are bitten by this all too convenient shortcut. As an R
newbie, think of R as your bank account: overuse of $-extraction can lead to undesirable
consequences. It's best to acquire the '[[' and '[' habit early.
-- Peter Ehlers (about the use of $-extraction)
R-help (March 2013)
Looking through the documentation for `$`, I've found that
$ is only valid for recursive objects
and
The main difference is that $ does not allow computed indices, whereas [[ does [...] Also, the partial matching behavior of [[ can be controlled using the exact argument.
So, is the argument to use [[ over $ because the former offers greater control and transparency in writing code? What are the actual risks of using $, and if [[ is preferred are there any circumstances where it is appropriate to use $-extraction?
Consider the list:
foo<-list(BlackCat=1,BlackDog=2, WhiteCat=3,WhiteDog=4)
Suppose you want to call the indice according to the two user parametric variables: colour and animal species.
Parametrising the colour and the species somewhere in the code as
myColour<-"Black"
mySpecies<-"Dog"
you can make the call to index parametric easily as
foo[[paste0(myColour,mySpecies)]]
by using [[ or [. However, this is not case for $ extraction: foo$paste0(myColour,mySpecies) would not evaluate the function paste0.

R: Error in .Primitive, non-numeric argument to binary operator

I did some reading on similar SO questions, but couldn't figure out how to resolve my error.
I have written the following string of code:
points[paste0(score.avail,"_pts")] <-
Map('*', points[score.avail], mget(paste0(score.avail,'_m')) )
Essentially, I have a list of columns in the 'points' data frame, defined by 'score.avail'. I am multiplying each of the columns by a respective constant, defined as the paste0(score.avail, '_m') expression. It appends new fields based on the multiplication, given by paste0(score.avail, "_pts") expression.
I have used this function before in a similar setup with no issues. However, I am now getting the following error:
Error in .Primitive("*")(dots[[1L]][[1L]], dots[[2L]][[1L]]) :
non-numeric argument to binary operator
I'm pretty sure R is telling me that one of the fields I'm trying to multiply is not numeric. However, I have checked all my fields, and they are numeric. I have even tried running a line as.numeric(score.avail) but that doesn't help. I also ran the following to remove NA's in the fields (before the Map function above).
for(col in score.avail){
points[is.na(get(col)) & (data.source == "average" |
data.source == "averageWeighted"), (col) := 0]}
The thing that stumps me is that this expression has worked with no issues before.
Update
I did some more digging by separating out each component of my original function. I'm getting odd output when running points[score.avail]. Previously when I ran this, it would return just the columns for all of my rows. Now, however, I'm getting none of the rows in my original data frame -- rather, it is imputing the column names in the 'score.avail' list as rows and filling in NA's everywhere (this is clearly the source of my problem).
I think this is because I'm using the object I'm pointing to is a data.table with keyvars set. Previously with this function, I had been pointing to a data frame.
Off to try a few more things.
Another Update
I was able to solve my problem by copying the 'points' object using as.data.frame(). However, I will leave the question open to see if anyone knows how to reset the data table key vars so that the function I specified above will work.
I was able to solve my problem by copying the 'points' object using as.data.frame(). Apparently classifying the object as a data.table was causing my headaches.

Trying to understand R structure: what does a dot in function names signify?

I am trying to learn how to use R. I can use it to do basic things like reading in data and running a t-test. However, I am struggling to understand the way R is structured (I am have a very mediocre java background).
What I don't understand is the way the functions are classified.
For example in is.na(someVector), is is a class? Or for read.csv, is csv a method of the read class?
I need an easier way to learn the functions than simply memorizing them randomly. I like the idea of things belonging to other things. To me it seems like this gives a language a tree structure which makes learning more efficient.
Thank you
Sorry if this is an obvious question I am genuinely confused and have been reading/watching quite a few tutorials.
Your confusion is entirely understandable, since R mixes two conventions of using (1) . as a general-purpose word separator (as in is.na(), which.min(), update.formula(), data.frame() ...) and (2) . as an indicator of an S3 method, method.class (i.e. foo.bar() would be the "foo" method for objects with class attribute "bar"). This makes functions like summary.data.frame() (i.e., the summary method for objects with class data.frame) especially confusing.
As #thelatemail points out above, there are some other sets of functions that repeat the same prefix for a variety of different options (as in read.table(), read.delim(), read.fwf() ...), but these are entirely conventional, not specified anywhere in the formal language definition.
dotfuns <- apropos("[a-z]\\.[a-z]")
dotstart <- gsub("\\.[a-zA-Z]+","",dotfuns)
head(dotstart)
tt <- table(dotstart)
head(rev(sort(tt)),10)
## as is print Sys file summary dev format all sys
## 118 51 32 18 17 16 16 15 14 13
(Some of these are actually S3 generics, some are not. For example, Sys.*(), dev.*(), and file.*() are not.)
Historically _ was used as a shortcut for the assignment operator <- (before = was available as a synonym), so it wasn't available as a word separator. I don't know offhand why camelCase wasn't adopted instead.
Confusingly, methods("is") returns is.na() among many others, but it is effectively just searching for functions whose names start with "is."; it warns that "function 'is' appears not to be generic"
Rasmus Bååth's presentation on naming conventions is informative and entertaining (if a little bit depressing).
extra credit: are there any dot-separated S3 method names, i.e. cases where a function name of the form x.y.z represents the x.y method for objects with class attribute z ?
answer (from Hadley Wickham in comments): as.data.frame.data.frame() wins. as.data.frame is an S3 generic (unlike, say, as.numeric), and as.data.frame.data.frame is its method for data.frame objects. Its purpose (from ?as.data.frame):
If a data frame is supplied, all classes preceding ‘"data.frame"’
are stripped, and the row names are changed if that argument is
supplied.

Remove values from a dataset based on a vector of those values

I have a dataset that looks like this, except it's much longer and with many more values:
dataset <- data.frame(grps = c("a","b","c","a","d","b","c","a","d","b","c","a"), response = c(1,4,2,6,4,7,8,9,4,5,0,3))
In R, I would like to remove all rows containing the values "b" or "c" using a vector of values to remove, i.e.
remove<-c("b","c")
The actual dataset is very long with many hundreds of values to remove, so removing values one-by-one would be very time consuming.
Try:
dataset[!(dataset$grps %in% remove),]
There's also subset:
subset(dataset, !(grps %in% remove))
... which is really just a wrapper around [ that lets you skip writing dataset$ over and over when there are multiple subset criteria. But, as the help page warns:
This is a convenience function intended for use interactively. For
programming it is better to use the standard subsetting functions like
‘[’, and in particular the non-standard evaluation of argument
‘subset’ can have unanticipated consequences.
I've never had any problems, but the majority of my R code is scripting for my own use with relatively static inputs.
2013-04-12
I have now had problems. If you're building a package for CRAN, R CMD check will throw a NOTE if you have use subset in this way in your code - it will wonder if grps is a global variable, even though subset is evaluating it within dataset's environment (not the global one). So if there's any possiblity your code will end up in a package and you feel squeamish about NOTEs, stick with Rcoster's method.

Resources