Undo a round in R - r

This is a curiousity and I highly doubt you can do what I am asking because the concept is, well silly. If I were to round something can it be unrounded?
So:
x <- round(rnorm(10))
x
You have no idea what the original something is can you get back to the original numbers generated by rnorm?
I ask because when I write functions for users I often put rounding arguments in them to make display better but I always give the user control of the digits and allow independent control of digit rounding for list objects. That makes a function full of digits= arguments really quickly. I would put these arguments in the function internally if I knew the user could somehow magically re-extract the original values. I could leave the digits as are, assign to a class and use a print method but for a list this is a pain at best.

If you round the actual data itself, in general you cannot recover it. Instead you should change the display using a custom print or trying something like option(digits=3). In the very particular case of random number generation, you could recover the original data if you first set the seed (set.seed), remembered it and then re-generated the random data from the same seed.

You could use sprintf to just modify how things get printed.
myfun <- function(){
x <- rnorm(3)
print(sprintf("%.3f", x))
invisible(x)
}
out <- myfun()
#[1] "-0.527" "0.226" "-0.168"
out
#[1] -0.5266562 0.2262599 -0.1680460

Since I can't resist doing it the hard way...
x<-runif(100)*10
z<-round(x,2)
y<-x-z

Related

index by name or by position in list / vector, which is faster?

I am currently trying to optimise the speed of a physical model computation. The specificity of this model is that it uses hundreds of input parameters, all stored in a big named vector:
initialize = c("temperature"=100, "airpressure"=150, "friction"=0.46)
The model, while iterating hundreds of times, needs to access the parameters, possibly updates them, etc.:
compute(initialize['temperature'], initialize['airpressure'])
initialize['friction'] <- updateP(initialize['friction'])
This is the logic. However I wonder if this is really efficient to work like this. What happens behind an indexation by name, is it fast? Some ideas to change this logic:
define each parameter as an independent variable in the environment?
(but how to pass a a large number of them as argument of a function?
have a list of parameters instead of a named vector?
access each parameter by its index in the vector, like this:
compute(initialize[1], initialize[2])
If I go with this last solution, of course I will loose the readability of the code (which parameter is actually initialize[1]?). So a way to go could be to define their positions first:
temperature.pos <- 1
airpressure.pos <- 2
compute(initialize[temperature.pos], initialize[airpressure.pos])
Of course, why didn't I try this and tested the speed? Well, it would take me hours to transform every location of parameters call in the script, that's why I ask before doing it.
And maybe there is a even more clever solution?
Thanks

Summing up list members step by step

I've got a very simple problem but I was unable to find a simple solution in R because I was used to solve such problems by iterating through an incrementing for-loop in other languages.
Let's say I've got a random distributed numeric list like:
rand.list <- list(4,3,3,2,5)
I'd like to change this random distributed pattern into a constantly rising pattern so the result would look like:
[4,7,10,12,17]
Try using Reduce with the accumulate parameter set to TRUE:
Reduce("+",rand.list, accumulate = T)
I hope this helps.
It came to me first to do cumsum(unlist(rand.list)), where unlist collapses the list into a plain vector. However, my lucky try shows that cumsum(rand.list) also works.
It is not that clear to me how this work, as the source code of cumsum calls .Primitive, an internal S3 method dispatcher which is not easy to further investigate. But I make another complementary experiment as follow:
x <- list(1:2,3:4,5:6)
cumsum(x) ## does not work
x <- list(c(1,2), c(3,4), c(5,6))
cumsum(x) ## does not work
In this case, we have to do cumsum(unlist(x)).

What does the "by" argument in ffbase::as.character do?

In the post below,
aggregation using ffdfdply function in R
There is a line like this.
splitby <- as.character(data$Date, by = 250000)
Just out of curiosity, I wonder what by argument means. It seems to be related to ff dataframe but I'm not sure. Google search and R documentation of as.character and as.vector provided no useful information.
I tried some examples but the codes below give the same results.
d <- seq.Date(Sys.Date(), Sys.Date()+10000, by = "day")
as.character(d, by=1)
as.character(d, by=10)
as.character(d, by=100)
If anybody could tell me what it is, I'd appreciate it. Thank you in advance.
Since as.character.ff works using the default as.character internally, and in view of the fact that df vectors can be larger than RAM, the data needs to be processed in chunks. The partition into chunks is facilitated by the chunk function. In this case, the relevant method is chunk.ff_vector. By default, this will calculate the chunk size by dividing getOption("ffbatchbytes") by the record size. However, this behaviour can be overridden by supplying the chunk size using by.
In the example you give, the ff vector will be converted to character 250000 members at a time.
The end result will be the same for any by or without by at all. Larger values will lead to greater temporary use of RAM but potentially quicker operation.
First, that function is ffbase::as.character, not plain old base::as.character
See http://www.inside-r.org/packages/cran/ffbase/docs/as.character.ff
which says
as.character((x, ...))
Arguments:
x: a ff vector
...: other parameters passed on to chunk
So the by argument is being passed through to some chunk function.
Then you need to figure out which package's chunk function is being used. Type ?chunk, tell us which one, then go read its doc to see what its by argument does.

Ordered Map / Hash Table in R

While working with lists i've noticed an issue that i didn't expect.
result5 <- vector("list",length(queryResults[[1]]))
for(i in 1:length(queryResults[[1]])){
id <- queryResults[[1]][i]
result5[[id]] <-getPrices(id)
}
The problem is that after this code runs instead of the result staying the same size (w/e queryResults[[1]] is) it goes up to the last index creating a bunch of null entries in the middle.
result5 current stores a number of int,double lists so it looks like :
result5[[index(int)]][[row]][col]
While on it's own it's not too problematic I would rather avoid that simply for easier size calculations later on.
For clarification, id is an integer. And in the given case for loop offers same performance, but greater convenience than the apply functions.
After some testing seems like the easiest way of doing it is :
Using a hash package to convert it using a hash using :
result6 <- hash(queryResults[[1]],lapply(queryResults[[1]],getPrices))
And if it needs to get accessed calling
result6[[toString(id)]]
With the difference in performance being marginal, albeit it's still fairly annoying having to include toString in your code.
It's not clear exactly what your question is, but judging by the structure of the loop, you probably want
result5[[i]] <- getPrices(id)
rather than result5[[id]] <- getPrices(id).

Do you use attach() or call variables by name or slicing?

Many intro R books and guides start off with the practice of attaching a data.frame so that you can call the variables by name. I have always found it favorable to call variables with $ notation or square bracket slicing [,2]. That way I can use multiple data.frames without confusing them and/or use iteration to successively call columns of interest. I noticed Google recently posted coding guidelines for R which included the line
1) attach: avoid using it
How do people feel about this practice?
I never use attach. with and within are your friends.
Example code:
> N <- 3
> df <- data.frame(x1=rnorm(N),x2=runif(N))
> df$y <- with(df,{
x1+x2
})
> df
x1 x2 y
1 -0.8943125 0.24298534 -0.6513271
2 -0.9384312 0.01460008 -0.9238312
3 -0.7159518 0.34618060 -0.3697712
>
> df <- within(df,{
x1.sq <- x1^2
x2.sq <- x2^2
y <- x1.sq+x2.sq
x1 <- x2 <- NULL
})
> df
y x2.sq x1.sq
1 0.8588367 0.0590418774 0.7997948
2 0.8808663 0.0002131623 0.8806532
3 0.6324280 0.1198410071 0.5125870
Edit: hadley mentions transform in the comments. here is some code:
> transform(df, xtot=x1.sq+x2.sq, y=NULL)
x2.sq x1.sq xtot
1 0.41557079 0.021393571 0.43696436
2 0.57716487 0.266325959 0.84349083
3 0.04935442 0.004226069 0.05358049
I much prefer to use with to obtain the equivalent of attach on a single command:
with(someDataFrame, someFunction(...))
This also leads naturally to a form where subset is the first argument:
with(subset(someDataFrame, someVar > someValue),
someFunction(...))
which makes it pretty clear that we operate on a selection of the data. And while many modelling function have both data and subset arguments, the use above is more consistent as it also applies to those functions who do not have data and subset arguments.
The main problem with attach is that it can result in unwanted behaviour. Suppose you have an object with name xyz in your workspace. Now you attach dataframe abc which has a column named xyz. If your code reference to xyz, can you guarantee that is references to the object or the dataframe column? If you don't use attach then it is easy. just xyz refers to the object. abc$xyz refers to the column of the dataframe.
One of the main reasons that attach is used frequently in textbooks is that it shortens the code.
"Attach" is an evil temptation. The only place where it works well is in the classroom setting where one is given a single dataframe and expected to write lines of code to do the analysis on that one dataframe. The user is unlikely to ever use that data again once the assignement is done and handed in.
However, in the real world, more data frames can be added to the collection of data in a particular project. Furthermore one often copies and pastes blocks of code to be used for something similar. Often one is borrowing from something one did a few months ago and cannot remember the nuances of what was being called from where. In these circumstances one gets drowned by the previous use of "attach."
Just like Leoni said, with and within are perfect substitutes for attach, but I wouldn't completely dismiss it. I use it sometimes, when I'm working directly at the R prompt and want to test some commands before writing them on a script. Especially when testing multiple commands, attach can be a more interesting, convenient and even harmless alternative to with and within, since after you run attach, the command prompt is clear for you to write inputs and see outputs.
Just make sure to detach your data after you're done!
I prefer not to use attach(), as it is far too easy to run a batch of code several times each time calling attach(). The data frame is added to the search path each time, extending it unnecessarily. Of course, good programming practice is to also detach() at the end of the block of code, but that is often forgotten.
Instead, I use xxx$y or xxx[,"y"]. It's more transparent.
Another possibility is to use the data argument available in many functions which allows individual variables to be referenced within the data frame. e.g., lm(z ~ y, data=xxx).
While I, too, prefer not to use attach(), it does have its place when you need to persist an object (in this case, a data.frame) through the life of your program when you have several functions using it. Instead of passing the object into every R function that uses it, I think it is more convenient to keep it in one place and call its elements as needed.
That said, I would only use it if I know how much memory I have available and only if I make sure that I detach() this data.frame once it is out of scope.
Am I making sense?

Resources