Do vectors in R have the same functionality as a set data type? - r

Someone with familiarity with other programming languages asked me if R had a set data type. Elements of R vectors are numbered and have an order so it seems to me that this distinguishes them from the set data type. However, any of the operations you might do on a set can be performed in R. For example, append(), subsetting (including for removing elements), sample() for something like enumerate, length() to determine size, %in% for "is an element of" and you can easily compare membership using things like intersect() and setdiff() and so forth.
Questions:
Does R have a specific set data type?
Can vectors perform the same kind of functions as a set data type?

I don't see this as appropriate for another site since it is clearly about the R language and supported data types. No there is no set "data type", i.e class that will behave like the mathematical set construct although there are functions that perform set-like operations: unique, %in%, setdiff, intersect, union. (Arguably Q could be thought of as OT b/c it is essentially a request for a package recommendation.) There is a package that implements a set class and it unsurprisingly named: sets.
install.packages("sets")
library(sets)
help(pack=sets)

Related

Access columns of an R data.table as a matrix, by reference

I recall there is a special (undocumented?) form that allows accessing selected columns of a data.table by reference as a matrix (in so doing allocating no memory) but for the life of me I can't find my note on the subject. Something analogous to setDF, but yielding a matrix, not a data.frame. I understand this might come with certain dangers, and would like to be reminded of how to do this, and what the dangers are.
It is not possible, because matrix has different layout in memory than list/data.frame/data.table, and we don't have any mapping that allows that.

What is the Julia's best approximation to R objects' attributes?

I store important metadata in R objects as attributes. I want to migrate my workflow to Julia and I am looking for a way to represent at least temporarily the attributes as something accessible by Julia. Then I can start thinking about extending the RData package to fill this data structure with actual objects' attributes.
I understand, that annotating with things like label or unit in DataFrame - I think the most important use for object' attributes - is probably going to be implemented in the DataFrames package some time (https://github.com/JuliaData/DataFrames.jl/issues/35). But I am asking about about more general solution, that doesn't depend on this specific use case.
For anyone interested, here is a related discussion in the RData package
In Julia it is ideomatic to define your own types - you'd simply make fields in the type to store the attributes. In R, the nice thing about storing things as attributes is that they don't affect how the type dispatches - e.g. adding metadata to a Vector doesn't make it stop behaving like a Vector. In julia, that approach is a little more complicated - you'd have to define the AbstractVector interface for your type https://docs.julialang.org/en/latest/manual/interfaces/#man-interface-array-1 to have it behave like a Vector.
In essence, this means that the workflow solutions are a little different - e.g. often the attribute metadata in R is used to associate metadata to an object when it's returned from a function. An easy way to do something similar in Julia is to have the function return a tuple and assign the result to a tuple:
function ex()
res = rand(5)
met = "uniformly distributed random numbers"
res, met
end
result, metadata = ex()
I don't think there are plans to implement attributes like in R.

Trying to understand R structure: what does a dot in function names signify?

I am trying to learn how to use R. I can use it to do basic things like reading in data and running a t-test. However, I am struggling to understand the way R is structured (I am have a very mediocre java background).
What I don't understand is the way the functions are classified.
For example in is.na(someVector), is is a class? Or for read.csv, is csv a method of the read class?
I need an easier way to learn the functions than simply memorizing them randomly. I like the idea of things belonging to other things. To me it seems like this gives a language a tree structure which makes learning more efficient.
Thank you
Sorry if this is an obvious question I am genuinely confused and have been reading/watching quite a few tutorials.
Your confusion is entirely understandable, since R mixes two conventions of using (1) . as a general-purpose word separator (as in is.na(), which.min(), update.formula(), data.frame() ...) and (2) . as an indicator of an S3 method, method.class (i.e. foo.bar() would be the "foo" method for objects with class attribute "bar"). This makes functions like summary.data.frame() (i.e., the summary method for objects with class data.frame) especially confusing.
As #thelatemail points out above, there are some other sets of functions that repeat the same prefix for a variety of different options (as in read.table(), read.delim(), read.fwf() ...), but these are entirely conventional, not specified anywhere in the formal language definition.
dotfuns <- apropos("[a-z]\\.[a-z]")
dotstart <- gsub("\\.[a-zA-Z]+","",dotfuns)
head(dotstart)
tt <- table(dotstart)
head(rev(sort(tt)),10)
## as is print Sys file summary dev format all sys
## 118 51 32 18 17 16 16 15 14 13
(Some of these are actually S3 generics, some are not. For example, Sys.*(), dev.*(), and file.*() are not.)
Historically _ was used as a shortcut for the assignment operator <- (before = was available as a synonym), so it wasn't available as a word separator. I don't know offhand why camelCase wasn't adopted instead.
Confusingly, methods("is") returns is.na() among many others, but it is effectively just searching for functions whose names start with "is."; it warns that "function 'is' appears not to be generic"
Rasmus Bååth's presentation on naming conventions is informative and entertaining (if a little bit depressing).
extra credit: are there any dot-separated S3 method names, i.e. cases where a function name of the form x.y.z represents the x.y method for objects with class attribute z ?
answer (from Hadley Wickham in comments): as.data.frame.data.frame() wins. as.data.frame is an S3 generic (unlike, say, as.numeric), and as.data.frame.data.frame is its method for data.frame objects. Its purpose (from ?as.data.frame):
If a data frame is supplied, all classes preceding ‘"data.frame"’
are stripped, and the row names are changed if that argument is
supplied.

Remove values from a dataset based on a vector of those values

I have a dataset that looks like this, except it's much longer and with many more values:
dataset <- data.frame(grps = c("a","b","c","a","d","b","c","a","d","b","c","a"), response = c(1,4,2,6,4,7,8,9,4,5,0,3))
In R, I would like to remove all rows containing the values "b" or "c" using a vector of values to remove, i.e.
remove<-c("b","c")
The actual dataset is very long with many hundreds of values to remove, so removing values one-by-one would be very time consuming.
Try:
dataset[!(dataset$grps %in% remove),]
There's also subset:
subset(dataset, !(grps %in% remove))
... which is really just a wrapper around [ that lets you skip writing dataset$ over and over when there are multiple subset criteria. But, as the help page warns:
This is a convenience function intended for use interactively. For
programming it is better to use the standard subsetting functions like
‘[’, and in particular the non-standard evaluation of argument
‘subset’ can have unanticipated consequences.
I've never had any problems, but the majority of my R code is scripting for my own use with relatively static inputs.
2013-04-12
I have now had problems. If you're building a package for CRAN, R CMD check will throw a NOTE if you have use subset in this way in your code - it will wonder if grps is a global variable, even though subset is evaluating it within dataset's environment (not the global one). So if there's any possiblity your code will end up in a package and you feel squeamish about NOTEs, stick with Rcoster's method.

Finding What You Need in R: focused searching within R and all (3,500+) CRAN Packages

Often in R, there are a dozen functions scattered across as many packages--all of which have the same purpose but of course differ in accuracy, performance, documentation, theoretical rigor, and so on.
How do you locate these--from within R and even from among the CRAN Packages which you have not installed?
So for instance: the generic plot function. Setting secondary ticks is much easier using a function outside of the base package:
minor.tick(nx=n, ny=n, tick.ratio=n)
Of course plot is in R core, but minor.tick is not, it's actually in Hmisc.
Of course, that doesn't show up in the documentation for plot, nor should you expect it to.
Another example: data-input arguments to plot can be supplied by an object returned from the function hexbin, again, this function is from a library outside of R core.
What would be great obviously is a programmatic way to gather these function arguments from the various libraries and put them in a single namespace?
*edit: (trying to re-state my example just above more clearly:) the arguments to plot supplied in R core, e.g., setting the axis tick frequency are xaxp/yaxp; however, one can also set a/t/f via a function outside of the base package, again, as in the minor.tick function from the Hmisc package--but you wouldn't know that just from looking at the plot method signature. Is there a meta function in R for this?*
So far, as i come across them, i've been manually gathering them, each set gathered in a single TextMate snippet (along with the attendant library imports). This isn't that difficult or time consuming, but i can only update my snippet as i find out about these additional arguments/parameters. Is there a canonical R way to do this, or at least an easier way?
Just in case that wasn't clear, i am not talking about the case where multiple packages provide functions directed to the same statistic or view (e.g., 'boxplot' in the base package; 'boxplot.matrix' in gplots; and 'bplots' in Rlab). What i am talking is the case in which the function name is the same across two or more packages.
The "sos" package is an excellent resource. It's primary interface is the "findFn" command, which accepts a string (your search term) and scans the "function" entries in Johnathan Baron's site search database, and returns the entries that contain the search term in a data frame (of class "findFn").
The columns of this data frame are: Count, MaxScore, TotalScore, Package, Function, Date, Score, Description, and Link. Clicking on "Link" in any entry's row will immediately pull up the help page.
An example: suppose you wanted to find all convolution filters across all 1800+ R packages.
library(sos)
cf = findFn("convolve")
This query will look the term "convolve", in other words, that doesn't have to be the function name.
Keying in "cf" returns an HTML table of all matches found (23 in this case). This table is an HTML rendering of the data frame i mentioned just above. What is particularly convenient is that each column ("Count", "MaxScore", etc.) is sortable by clicking on the column header, so you can view the results by "Score", by "Package Name", etc.
(As an aside: when running that exact query, one of the results was the function "panel.tskernel" in a package called "latticeExtra". I was not aware this package had any time series filters in it and i doubt i would have discovered it otherwise.
Your question is not easy to answer. There is not one definitive function.
formals is the function that gives the named arguments to a function and their defaults in a named list, but you can always have variable arguments through the ... parameter and hidden named arguments with embedded hadArg function. To get a list of those you would have to use a getAnywhere and then scan the expression for the hasArg. I can't think of a automatic way of doing it yourself. That is if the functions hidden arguments are not documented.

Resources