What does the "by" argument in ffbase::as.character do? - r

In the post below,
aggregation using ffdfdply function in R
There is a line like this.
splitby <- as.character(data$Date, by = 250000)
Just out of curiosity, I wonder what by argument means. It seems to be related to ff dataframe but I'm not sure. Google search and R documentation of as.character and as.vector provided no useful information.
I tried some examples but the codes below give the same results.
d <- seq.Date(Sys.Date(), Sys.Date()+10000, by = "day")
as.character(d, by=1)
as.character(d, by=10)
as.character(d, by=100)
If anybody could tell me what it is, I'd appreciate it. Thank you in advance.

Since as.character.ff works using the default as.character internally, and in view of the fact that df vectors can be larger than RAM, the data needs to be processed in chunks. The partition into chunks is facilitated by the chunk function. In this case, the relevant method is chunk.ff_vector. By default, this will calculate the chunk size by dividing getOption("ffbatchbytes") by the record size. However, this behaviour can be overridden by supplying the chunk size using by.
In the example you give, the ff vector will be converted to character 250000 members at a time.
The end result will be the same for any by or without by at all. Larger values will lead to greater temporary use of RAM but potentially quicker operation.

First, that function is ffbase::as.character, not plain old base::as.character
See http://www.inside-r.org/packages/cran/ffbase/docs/as.character.ff
which says
as.character((x, ...))
Arguments:
x: a ff vector
...: other parameters passed on to chunk
So the by argument is being passed through to some chunk function.
Then you need to figure out which package's chunk function is being used. Type ?chunk, tell us which one, then go read its doc to see what its by argument does.

Related

Need help understandig the 'rep()' function

rep (2,5)
rep
Hello everyone, I am learning 'R' by watching a Udemy tutorial and I've been following along. Recently I learned seq() and rep() function. However, when I try to run the code written above I get an additional output. The code returns 2.2.2.2.2 and .Primitive("rep"). I am using Kaggle notebooks. Help me understand how this functions works, what is going wrong here, and what will happen if we provide multiple input as rep(2,3,4,5) or (1,2,3,4,6,8).
In R, rep is a function. It is designed to replicate its first argument a number of times equal to its second argument. Thus rep(2, 5) returns a vector of length 5 with each element as 2.
In R, functions are also objects, and when you input a function's name, R will return the something that tries to be useful by showing that the input is a function and providing the expected arguments. The .Primitive("rep") part tells you that rep is a primitive function, part of the base R code.
rep
function (x, ...) .Primitive("rep")
In this case, rep requires at least one argument x, which the object to be replicated. The ... indicates that it can take a number of other optional arguments. To learn about them, you can access the help file for rep with ?rep.
You can call rep with more arguments, but the behavior might not be what you expect.
By typing rep without any details, you are asking R to show you the internal "guts" of what the function does. You can learn more about it by typing ?rep. The manual is probably a lot for a beginner but if you scroll to the bottom you will see some useful examples.
I hope this help:
rep ("hi", 5) # print hi five times
rep(c("hi", "hello"), 3) # print the object holding hi and hello three times
rep(c("hi", "hello"), c(1, 2)) # print hi once and hello two times

not error, but not results either in R

I am trying to make a function in R that calculates the mean of nitrate, sulfate and ID. My original dataframe have 4 columns (date,nitrate, sulfulfate,ID). So I designed the next code
prueba<-read.csv("C:/Users/User/Desktop/coursera/001.csv",header=T)
columnmean<-function(y, removeNA=TRUE){ #y will be a matrix
whichnumeric<-sapply(y, is.numeric)#which columns are numeric
onlynumeric<-y[ , whichnumeric] #selecting just the numeric columns
nc<-ncol(onlynumeric) #lenght of onlynumeric
means<-numeric(nc)#empty vector for the means
for(i in 1:nc){
means[i]<-mean(onlynumeric[,i], na.rm = TRUE)
}
}
columnmean(prueba)
When I run my data without using the function(), but I use row by row with my data it will give me the mean values. Nevertheless if I try to use the function so it will make all the steps by itself, it wont mark me error but it also won't compute any value, as in my environment the dataframe 'prueba' and the columnmean function
what am I doing wrong?
A reproducible example would be nice (although not absolutely necessary in this case).
You need a final line return(means) at the end of your function. (Some old-school R users maintain that means alone is OK - R automatically returns the value of the last expression evaluated within the function whether return() is specified or not - but I feel that using return() explicitly is better practice.)
colMeans(y[sapply(y, is.numeric)], na.rm=TRUE)
is a slightly more compact way to achieve your goal (although there's nothing wrong with being a little more verbose if it makes your code easier for you to read and understand).
The result of an R function is the value of the last expression. Your last expression is:
for(i in 1:nc){
means[i]<-mean(onlynumeric[,i], na.rm = TRUE)
}
It may seem strange that the value of that expression is NULL, but that's the way it is with for-loops in R. The means vector does get changed sequentially, which means that BenBolker's advice to use return(.) is correct (as his advice almost always is.) . For-loops in R are a notable exception to the functional programming paradigm. They provide a mechanism for looping (as do the various *apply functions) but the commands inside the loop exert their effects in the calling environment via side effects (unlike the apply functions).

What does the t in tapply stand for?

There seems to be general agreement that the l in "lapply" stands for list, the s in "sapply" stands for simplify and the r in "rapply" stands for recursively. But I could not find anything on the t in "tapply". I am now very curious.
Stands for table since tapply is the generic form of the table function. You can see this by comparing the following calls:
x <- sample(letters, 100, rep=T)
table(x)
tapply(x, x, length)
although obviously tapply can do more than counting.
Also, some references that refer to "table-apply":
R and S Plus companion
Modern Applied Biostatistical Methods
I think of it as 'table'-apply since the result comes as a matrix/table/array and its dimensions are established by the INDEX arguments. An R table-classed object is really very similar in contrcution and behavior to an R matrix or array. The application is being performed in a manner similar to that of ave. Groups are first assembled on the basis of the "factorized" INDEX argument list (possibly with multiple dimensions) and a matrix or array is returned with the results of the FUN applied to each cross-classified grouping.
The other somewhat similar function is 'xtabs'. I keep thinking it should have a "FUN" argument, but what I'm probably forgetting at that point is really tapply.
tapply is sort of the odd man out. As far as I know, and as far as the R documentation for the apply functions goes, the 't' does not stand for anything, unlike the other apply functions which indicate the input or output options.

Yet another apply Questions

I am totally convinced that an efficient R programm should avoid using loops whenever possible and instead should use the big family of the apply functions.
But this cannot happen without pain.
For example I face with a problem whose solution involves a sum in the applied function, as a result the list of results is reduced to a single value, which is not what I want.
To be concrete I will try to simplify my problem
assume N =100
sapply(list(1:N), function(n) (
choose(n,(floor(n/2)+1):n) *
eps^((floor(n/2)+1):n) *
(1- eps)^(n-((floor(n/2)+1):n))))
As you can see the function inside cause length of the built vector to explode
whereas using the sum inside would collapse everything to single value
sapply(list(1:N), function(n) (
choose(n,(floor(n/2)+1):n) *
eps^((floor(n/2)+1):n) *
(1- eps)^(n-((floor(n/2)+1):n))))
What I would like to have is a the list of degree of N.
so what do you think? how can I repair it?
Your question doesn't contain reproducible code (what's "eps"?), but on the general point about for loops and optimising code:
For loops are not incredibly slow. For loops are incredibly slow when used improperly because of how memory is assigned to objects. For primitive objects (like vectors), modifying a value in a field has a tiny cost - but expanding the /length/ of the vector is fairly costly because what you're actually doing is creating an entirely new object, finding space for that object, copying the name over, removing the old object, etc. For non-primitive objects (say, data frames), it's even more costly because every modification, even if it doesn't alter the length of the data.frame, triggers this process.
But: there are ways to optimise a for loop and make them run quickly. The easiest guidelines are:
Do not run a for loop that writes to a data.frame. Use plyr or dplyr, or data.table, depending on your preference.
If you are using a vector and can know the length of the output in advance, it will work a lot faster. Specify the size of the output object before writing to it.
Do not twist yourself into knots avoiding for loops.
So in this case - if you're only producing a single value for each thing in N, you could make that work perfectly nicely with a vector:
#Create output object. We're specifying the length in advance so that writing to
#it is cheap
output <- numeric(length = length(N))
#Start the for loop
for(i in seq_along(output)){
output[i] <- your_computations_go_here(N[i])
}
This isn't actually particularly slow - because you're writing to a vector and you've specified the length in advance. And since data.frames are actually lists of equally-sized vectors, you can even work around some issues with running for loops over data.frames using this; if you're only writing to a single column in the data.frame, just create it as a vector and then write it to the data.frame via df$new_col <- output. You'll get the same output as if you had looped through the data.frame, but it'll work faster because you'll only have had to modify it once.

Undo a round in R

This is a curiousity and I highly doubt you can do what I am asking because the concept is, well silly. If I were to round something can it be unrounded?
So:
x <- round(rnorm(10))
x
You have no idea what the original something is can you get back to the original numbers generated by rnorm?
I ask because when I write functions for users I often put rounding arguments in them to make display better but I always give the user control of the digits and allow independent control of digit rounding for list objects. That makes a function full of digits= arguments really quickly. I would put these arguments in the function internally if I knew the user could somehow magically re-extract the original values. I could leave the digits as are, assign to a class and use a print method but for a list this is a pain at best.
If you round the actual data itself, in general you cannot recover it. Instead you should change the display using a custom print or trying something like option(digits=3). In the very particular case of random number generation, you could recover the original data if you first set the seed (set.seed), remembered it and then re-generated the random data from the same seed.
You could use sprintf to just modify how things get printed.
myfun <- function(){
x <- rnorm(3)
print(sprintf("%.3f", x))
invisible(x)
}
out <- myfun()
#[1] "-0.527" "0.226" "-0.168"
out
#[1] -0.5266562 0.2262599 -0.1680460
Since I can't resist doing it the hard way...
x<-runif(100)*10
z<-round(x,2)
y<-x-z

Resources