x <- 1:10
str(x)
# int [1:10] 1 2 3 4 5 6 7 8 9 10
str(as.double(x))
# num [1:10] 1 2 3 4 5 6 7 8 9 10
str(as(x, 'double'))
# int [1:10] 1 2 3 4 5 6 7 8 9 10
I'd be surprised if there was a bug in R with something so basic as type conversion. Is there a reason for this inconsistency?
as is for coercing to a new class, and double technically isn't a class but rather a storage.mode.
y <- x
storage.mode(y) <- "double"
identical(x,y)
[1] FALSE
> identical(as.double(x),y)
[1] TRUE
The argument "double" is handled as a special case by as and will attempt to coerce to the class numeric, which the class integer already inherits, therefore there is no change.
is.numeric(x)
[1] TRUE
Not so fast...
While the above made sense, there is some further confusion. From ?double:
It is a historical anomaly that R has two names for its floating-point
vectors, double and numeric (and formerly had real).
double is the name of the type. numeric is the name of the mode and
also of the implicit class. As an S4 formal class, use "numeric".
The potential confusion is that R has used mode "numeric" to mean
‘double or integer’, which conflicts with the S4 usage. Thus
is.numeric tests the mode, not the class, but as.numeric (which is
identical to as.double) coerces to the class.
Therefore as should really change x according to the documentation... I will investigate further.
The plot is thicker than whipped cream and cornflour soup...
Well, if you debug as, you find out that what eventually happens is that the following method gets created rather than using the c("ANY","numeric") signature for the coerce generic which would call as.numeric:
function (from, strict = TRUE)
if (strict) {
class(from) <- "numeric"
from
} else from
So actually, class<- gets called on x and this eventually means R_set_class is called from coerce.c. I believe the following part of the function determines the behaviour:
...
else if(!strcmp("numeric", valueString)) {
setAttrib(obj, R_ClassSymbol, R_NilValue);
if(IS_S4_OBJECT(obj)) /* NULL class is only valid for S3 objects */
do_unsetS4(obj, value);
switch(TYPEOF(obj)) {
case INTSXP: case REALSXP: break;
default: PROTECT(obj = coerceVector(obj, REALSXP));
nProtect++;
}
...
Note the switch statement: it breaks out without doing coercion in the case of integers and real values.
Bug or not?
Whether or not this is a bug depends on your point of view. Integers are numeric in one sense as confirmed by is.numeric(x) returning TRUE, but strictly speaking they are not a numeric class. On the other hand, since integers get promoted to double automatically on overflow, one may view them conceptually as the same. There are two major differences: i) Integers require less storage space - this may be significant for larger vectors, and, ii) when interacting with external code that has greater type discipline conversion costs may come into play.
as(x,"double"):
Methods are pre-defined for coercing any object to one of the basic datatypes. For example, as(x, "numeric") uses the existing as.numeric function. These built-in methods can be listed by showMethods("coerce").
These functions manage the relations that allow coercing an object to
a given class.
as.double(x):
as.double is a generic function. It is identical to as.numeric. Methods should return an object of base type "double". as.double creates, coerces to or test for a double-precision vector.
Related
In newer versions of R (I have 3.6 and previously had 3.2), the stats::regularize.values function has been changed to have a default value of warn.collapsing as TRUE. This function is used in splinefun and several other interpolation functions in R. In a microsimulation model, I am using splinefun to smooth a large amount (n > 100,000) of data points of the form (x, f(x)). Here, x is a simulated vector of positive-valued scalers, and f(x) is some function of (x). With an n that large, there are often some replications of pseudo-randomly generated values (i.e., not all values of x are unique). My understanding is that splinefun gets rid of ties in the x values. That is not a problem for me, but, because of the new default, I get a warning message printed each time (below)
"In regularize.values(x, y, ties, missing(ties)) : collapsing to
unique 'x' values"
Is there a way to either change the default of the warn.collapsing argument of the stats::regularize.values function back to F? Or can I somehow suppress that particular warning? This matters because it's embedded in a long microsimulation code and when I update it I often run into bugs. So I can't just ignore warning messages.
I tried using the formalize function. I was able to get the default arguments of stats::regularize.values printed, but when I tried to assign new values using the alist function it said there is no object 'stats'.
I had this problem too, and fixed it by adding ties=min to the argument list of splinefun().
The value of missing(ties) is now passed as warn.collapsing to regularize.values().
https://svn.r-project.org/R/trunk/src/library/stats/R/splinefun.R
https://svn.r-project.org/R/trunk/src/library/stats/R/approx.R
Also see:
https://cran.r-project.org/doc/manuals/r-release/NEWS.html
and search for regularize.values().
Referencing this article
Wrap your call of regularize.values like this:
withCallingHandlers(regularize.values(x), warning = function(w){
if (grepl("collapsing to unique 'x' values", w$message))
invokeRestart("muffleWarning")
})
Working example (adapted from the above link to call a function):
f1 <- function(){
x <- 1:10
x + 1:3
}
f1()
# if we just call f1() we get a warning
Warning in x + 1:3 :
longer object length is not a multiple of shorter object length
[1] 2 4 6 5 7 9 8 10 12 11
withCallingHandlers(f1(), warning=function(w){invokeRestart("muffleWarning")})
[1] 2 4 6 5 7 9 8 10 12 11
I'm trying to understand what the c++ sizeof does when operating on an RCpp vector. As an example:
library(Rcpp)
cppFunction('int size_of(NumericVector a) {return(sizeof a);}')
size_of(1.0)
# [1] 16
this returns the value 16 for any numeric or integer vector passed to it.
As also does
cppFunction('int size_of(IntegerVector a) {return(sizeof a);}')
size_of(1)
# [1] 16
I thought that numerics in R were 8 bytes and integers 4 bytes. So what is going on here? The motivation is to use memcpy on RCpp vectors, for which the size needs to be known.
Everything we pass from R to C(++) and return is a SEXP type -- a pointer to an S Expression.
So if we generalize your function and actually let a SEXP in, we can see some interesting things:
R> Rcpp::cppFunction('int size_of(SEXP a) {return(sizeof a);}')
R> size_of(1L) ## single integer -- still a pointer
[1] 8
R> size_of(1.0) ## single double -- still a pointer
[1] 8
R> size_of(seq(1:100)) ## a sequence ...
[1] 8
R> size_of(help) ## a function
[1] 8
R> size_of(globalenv) ## an environment
[1] 8
R>
in short you got caught between a compile-time C++ type analysis operator (sizeof) and the run-time feature that everything is morphed into the SEXP type. For actual vectors, you probably want the size() or length() member functions and so on.
You would have to get into how NumericVector and IntegerVector are implemented to discover why they statically take up a certain number of bytes.
Based on your observation of the size of a "numeric" or "integer" in this context, it is likely that the value 16 accounts for any/all of the following:
Pointer to [dynamically-allocated?] data
Current logical size of container (number of elements)
Any other metadata
Ideally, don't use memcpy to transfer the state of one object to another, unless you are absolutely certain that it is a trivial object with only members of built-in type. If I have correctly guessed the layout of a NumericVector, using memcpy on it will violate its ownership semantics and thus be incorrect. There are other ways to copy R vectors.
I am currently reading Hands-on Programming with R. The author wrote the following paragraph -
A class is a special case of an atomic vector. For example, the die
matrix is a special case of a double vector. Every element in the
matrix is still a double, but the elements have been arranged into a
new structure. R added a class attribute to die when you changed its
dimensions. This class describes die’s new format. Many R functions
will specifically look for an object’s class attribute, and then
handle the object in a predetermined way based on the attribute.
From what I have understood, shouldn't this statement be the other way around - vector is a special case of matrix because its dimensions are Nx1 instead of NxM. Similarity, shouldn't vector be a special case of a class because vector has NULL class.
Why is it not the case?
What the author refers to (in a bad way imho), is the internal representation of objects. They are all some type of "list" with extra bits of information that define how R deals with it.
Take for example a matrix. A matrix is a vector with an extra attribute called "dim". It is this attribute that makes it a matrix. Removing the attribute, shows the underlying vector structure:
> x <- matrix(1:10, ncol = 5)
> x
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 5 7 9
[2,] 2 4 6 8 10
> attributes(x)
$dim
[1] 2 5
> attr(x,"dim") <- NULL
> x
[1] 1 2 3 4 5 6 7 8 9 10
Data frames on the other hand, are special cases of a list. They are defined as S3 classes, again by an attribute. This time the attribute is called "class".
The S3 system is a very rudimentary implementation of OOP: there is no formal class definition, so the class of an instance is only defined by the attributes. Generic functions like print(), summary() and so on look at that class attribute, and search for the specific method for that class.
Note how the attributes are a named list with extra information on the object. In the case of a data frame, that's the row and column names next to the class attribute itself:
> class(iris)
[1] "data.frame"
> attributes(iris)
$names
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
$row.names
[1] 1 2 3 4 5 ...
$class
[1] "data.frame"
> class(iris) <- NULL
> class(iris)
[1] "list"
Other instances of S3 classes are also defined by that attribute "class". If you do a linear model for example, the output is a list with a class attribute that makes it of the class "lm". Removing the class attribute leaves you with a named list.
When talking about S4, things become a bit more complex. But again an S4 object is a list-like structure, where every slot is an "element" of that "list". Note that you can't remove an attribute or so to get to a normal list like you could do with an S3 class. S4 is more strictly defined, and hence the general idea voiced by the author does not apply to S4 objects.
To answer your question about vector and matrix: A vector does not have dimensions in R. Or more exact: it does not have a dimension attribute. You can add one, but then you end up with a one dimensional array. They do behave very similar to a vector, but not always. So a matrix is internally a vector with one small extra piece of information. I wouldn't call that "a special case of a vector", but it's true that a matrix is derived from a vector and not the other way around.
I want to compute the mean of "Population" of built-in matrix state.x77. The codes are :
apply(state.x77[,"Population"],2,FUN=mean)
#Error in apply(state.x77[, "Population"], 2, FUN = mean) :
# dim(X) must have a positive length
how can I prevent this error? If I use $ sign
apply(state.x77$Population,2,mean)
# Error in state.x77$Population : $ operator is invalid for atomic vectors
What is atomic vector?
To expand on joran's comments, consider:
> is.vector(state.x77[,"Population"])
[1] TRUE
> is.matrix(state.x77[,"Population"])
[1] FALSE
So, your Population data is now no diferent from any other vector, like 1:10, which has neither columns or rows to apply against. It is just a series of numbers with no more advanced structure or dimension. E.g.
> apply(1:10,2,mean)
Error in apply(1:10, 2, mean) : dim(X) must have a positive length
Which means you can just use the mean function directly against the matrix subset which you have selected: E.g.:
> mean(1:10)
[1] 5.5
> mean(state.x77[,"Population"])
[1] 4246.42
To explain 'atomic' vector more, see the R FAQ again (and this gets a bit complex, so hold on to your hat)...
R has six basic (‘atomic’) vector types: logical, integer, real,
complex, string (or character) and raw.
http://cran.r-project.org/doc/manuals/r-release/R-lang.html#Vector-objects
So atomic in this instance is referring to vectors as the basic building blocks of R objects (like atoms make up everything in the real world).
If you read R's inline help by entering ?"$" as a command, you will find it says:
‘$’ is only valid for recursive objects, and is only
discussed in the section below on recursive objects.
Since vectors (like 1:10) are basic building blocks ("atomic"), with no recursive sub-elements, trying to use $ to access parts of them will not work.
Since your matrix (statex.77) is essentially just a vector with some dimensions, like:
> str(matrix(1:10,nrow=2))
int [1:2, 1:5] 1 2 3 4 5 6 7 8 9 10
...you also can't use $ to access sub-parts.
> state.x77$Population
Error in state.x77$Population : $ operator is invalid for atomic vectors
But you can access subparts using [ and names like so:
> state.x77[,"Population"]
Alabama Alaska Arizona...
3615 365 2212...
can somebody explain to me what's going on here ? when a variable is coded as a factor and nchar coerces to a character, why can't that function effectively count the number of characters ?
> x <- c("73210", "73458", "73215", "72350")
> nchar(x)
[1] 5 5 5 5
>
> x <- factor(x)
> nchar(x)
[1] 1 1 1 1
>
> nchar(as.character(x))
[1] 5 5 5 5
thanks.
It is because with a factor, your data is represented by 1, 2, etc. What you mean to do is count the characters of the levels:
> nchar(levels(x)[x])
[1] 5 5 5 5
see the warning section of ?factor:
The interpretation of a factor depends on both the codes and the
‘"levels"’ attribute. Be careful only to compare factors with the
same set of levels (in the same order). In particular,
‘as.numeric’ applied to a factor is meaningless, and may happen by
implicit coercion. To transform a factor ‘f’ to approximately its
original numeric values, ‘as.numeric(levels(f))[f]’ is recommended
and slightly more efficient than ‘as.numeric(as.character(f))’.
nchar(levels(x))
The other answers are correct, I think, that the issue is that nchar is examining the underlying integer codes, not the labels. However, what I think most directly addresses your question is this piece from ?nchar:
The internal equivalent of the default method of as.character is
performed on x (so there is no method dispatch)
I'm not 100% sure, but I suspect this means that the coercion that takes place in nchar is not the same thing that happens when you directly call as.character, most likely going directly to the integer codes, rather than "smartly" looking at the labels.