Accumulate vs fold vs reduce vs compress - functional-programming

Are the functions accumulate, fold, reduce, and compress synonyms?

Well, it depends on the language. It is a common function with different names in different languages.
See: Wikipedia entry
But yes, it's commonly known as the names you mentioned plus inject.
The Wikipedia entry has a more comprehensive list of its aliases in several languages.

Related

Does R include an efficient implementation of sets?

Is there an efficient implementation of the set data structure in R?
In C++ I would use an std::set (which is implemented using red-black trees), in Python a set (which is implemented using hash tables), but I am not sure what I should use in R.
I have found this link, which describes some set operations, like union() and intersection(), that you can perform on vectors. So, I guess that since vectors are involved, the complexities would not be logarithmic, as you could have using the data structures mentioned above.
Fun fact, note how in this case the name of the language does not help, searching "r set" one finds many results concerning $\mathbb{R}$, and not the programming language :D

Are JuliaDB or DataFrame faster than plain Array?

I wonder if there's a difference in performance of plain Array versus JuliaDB or DataFrame to do calculations on huge data sets (large but still fit in memory)?
I can use plain arrays and algorithms to do sorting, grouping, reducing etc. So why do I need JuliaDB or DataFrame?
I kinda understand why Python needs Pandas - because it translates slow python into fast C. But why Julia needs JuliaDB or DataFrame - Julia already fast.
This is a possibly broad topic. Let me highlight the features that are key in my opinion.
What are the benefits of DataFrames.jl or JuliaDB.jl over standard arrays
They allow you to store columns of data having different types. You can do the same in arrays, but then they have to be arrays of Any in general which will be slower and use up more memory than having data columns having concrete types.
You can access columns using names. However, this is a secondary feature - e.g. NamedArrays.jl provides an array-like type with named dimensions.
The additional benefit is that there is an ecosystem built on the fact that columns have names (e.g. joining two DataFrames or building GLM model using GLM.jl).
This type of storage (heterogeneous columns with names) is a representation of table in relational databases.
What is the difference between DataFrames.jl and JuliaDB.jl
JuliaDB.jl supports distributed parallelism; normal use of DataFrames.jl assumes that data fits into memory (you can work around this using SharedArray but this is not a part of the design) and if you want to parallelise computations you have to do it manually;
JuliaDB.jl supports indexing while DataFrames.jl currently does not;
Column types of JuliaDB.jl are stable and for DataFrames.jl currently they are not. The consequences are:
when using JuliaDB.jl each time a new type of data structure is created all functions that are applied over this type have to be recompiled (which for large data sets can be ignored but when working with many heterogeneous small data sets can have a visible performance impact);
when using DataFrames.jl you have to use special techniques ensuring type inference to achieve high performance is some situations (most notably barrier functions as discussed here).

Is there performance difference between NumericVector and vector<double>?

Suppose one uses NumericVector and the other uses vector<double> in their Rcpp code. Is there notable difference between the two usages, especially in performance?
Generally, yes.
All of the Rcpp(11) types are "thin proxy objects" (which we talk about in several places, talks, slide decks, my book, ...) around the underlying SEXP objects. That means no copies are made when you go from R to C++, and when you go back from C++ to R.
Using standard C++ types like std::vector<T>, however, generally requires a copy.
So you should easily see a difference on some trivial test script as N increases enough.
Personally speaking, I generally like the "clean" use of C++ / STL types for code that "feels more C++-ish" but remain aware of the performance penalty. Often it does not really matter as the C++ solution is faster than what you replace in a pure R solution.
But your question is if one dominates the other, and the other is a clear yes.

What is your preferred style for naming variables in R? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
Which conventions for naming variables and functions do you favor in R code?
As far as I can tell, there are several different conventions, all of which coexist in cacophonous harmony:
1. Use of period separator, e.g.
stock.prices <- c(12.01, 10.12)
col.names <- c('symbol','price')
Pros: Has historical precedence in the R community, prevalent throughout the R core, and recommended by Google's R Style Guide.
Cons: Rife with object-oriented connotations, and confusing to R newbies
2. Use of underscores
stock_prices <- c(12.01, 10.12)
col_names <- c('symbol','price')
Pros: A common convention in many programming langs; favored by Hadley Wickham's Style Guide, and used in ggplot2 and plyr packages.
Cons: Not historically used by R programmers; is annoyingly mapped to '<-' operator in Emacs-Speaks-Statistics (alterable with 'ess-toggle-underscore').
3. Use of mixed capitalization (camelCase)
stockPrices <- c(12.01, 10.12)
colNames <- c('symbol','price')
Pros: Appears to have wide adoption in several language communities.
Cons: Has recent precedent, but not historically used (in either R base or its documentation).
Finally, as if it weren't confusing enough, I ought to point out that the Google Style Guide argues for dot notation for variables, but mixed capitalization for functions.
The lack of consistent style across R packages is problematic on several levels. From a developer standpoint, it makes maintaining and extending other's code difficult (esp. where its style is inconsistent with your own). From a R user standpoint, the inconsistent syntax steepens R's learning curve, by multiplying the ways a concept might be expressed (e.g. is that date casting function asDate(), as.date(), or as_date()? No, it's as.Date()).
Good previous answers so just a little to add here:
underscores are really annoying for ESS users; given that ESS is pretty widely used you won't see many underscores in code authored by ESS users (and that set includes a bunch of R Core as well as CRAN authors, excptions like Hadley notwithstanding);
dots are evil too because they can get mixed up in simple method dispatch; I believe I once read comments to this effect on one of the R list: dots are a historical artifact and no longer encouraged;
so we have a clear winner still standing in the last round: camelCase. I am also not sure if I really agree with the assertion of 'lacking precendent in the R community'.
And yes: pragmatism and consistency trump dogma. So whatever works and is used by colleagues and co-authors. After all, we still have white-space and braces to argue about :)
I did a survey of what naming conventions that are actually used on CRAN that got accepted to the R Journal :) Here is a graph summarizing the results:
Turns out (no surprises perhaps) that lowerCamelCase was most often used for function names and period.separated names most often used for parameters. To use UpperCamelCase, as advocated by Google's R style guide is really rare however, and it is a bit strange that they advocate using that naming convention.
The full paper is here:
http://journal.r-project.org/archive/2012-2/RJournal_2012-2_Baaaath.pdf
Underscores all the way! Contrary to popular opinion, there are a number of functions in base R that use underscores. Run grep("^[^\\.]*$", apropos("_"), value = T) to see them all.
I use the official Hadley style of coding ;)
I like camelCase when the camel actually provides something meaningful -- like the datatype.
dfProfitLoss, where df = dataframe
or
vdfMergedFiles(), where the function takes in a vector and spits out a dataframe
While I think _ really adds to the readability, there just seems to be too many issues with using .-_ or other characters in names. Especially if you work across several languages.
This comes down to personal preference, but I follow the google style guide because it's consistent with the style of the core team. I have yet to see an underscore in a variable in base R.
As I point out here:
How does the verbosity of identifiers affect the performance of a programmer?
it's worth bearing in mind how understandable your variable names are to your co-workers/users if they are non-native speakers...
For that reason I'd say underscores and periods are better than capitalisation, but as you point out consistency is essential within your script.
As others have mentioned, underscores will screw up a lot of folks. No, it's not verboten but it isn't particularly common either.
Using dots as a separator gets a little hairy with S3 classes and the like.
In my experience, it seems like a lot of the high muckity mucks of R prefer the use of camelCase, with some dot usage and a smattering of underscores.
I have a preference for mixedCapitals.
But I often use periods to indicate what the variable type is:
mixedCapitals.mat is a matrix.
mixedCapitals.lm is a linear model.
mixedCapitals.lst is a list object.
and so on.
Usually I rename my variables using a ix of underscores and a mixed capitalization (camelCase). Simple variables are naming using underscores, example:
PSOE_votes -> number of votes for the PSOE (political group of Spain).
PSOE_states -> Categorical, indicates the state where PSOE wins {Aragon, Andalucia, ...)
PSOE_political_force -> Categorial, indicates the position between political groups of PSOE {first, second, third)
PSOE_07 -> Union of PSOE_votes + PSOE_states + PSOE_political_force at 2007 (header -> votes, states, position)
If my variable is a result of to applied function in one/two Variables I using a mixed capitalization.
Example:
positionXstates <- xtabs(~states+position, PSOE_07)

Applicative programming and common lisp types

I've just started learning Common Lisp--and rapidly falling in love with it--and I've just moved onto the type system. I seem to be developing a particular fondness for applicative programming.
As I understand it, in CL strings and lists are both sequences, but there don't seem to be any standard functions for mapping over a sequence, only lists. I can see why they would be supplied for lists, what with them being the fundamental datatype and all, but why was it not designed to work with sequences? As they are a more general type, it would seem more useful to target applicative functions at them rather than lists. Or am I completely misunderstandimatifying how it works?
Edit:
What I was feeling particularly confused about was the way that sequences -- the abstraction -- and lists -- an implementation -- seem to be muddled up in CL. The consensus seems to be that this is for historical reasons; lisp has been around so long that you can pretty much map out the development of software engineering practices through its functions and macros; which functions apply to sequences and which to lists seems arbitrary at first glance because CL has a mixture of pre-sequence-abstraction functions that operate only on lists, and functions that do the same thing in a more general way on sequences. As someone who is just learning CL at the moment, I think it would be useful if authors introduced sequences first as the cleaner abstraction, and then bought in lists as the most fundamental implementation of that abstraction. Lists would still be needed as syntax of course, but by the time it is necessary to state this explicitly many readers would have worked this out by themselves, which would be quite an ego boost when starting out.
Why, there are a lot of functions working on sequences. Mapping over a sequence is done with MAP or MAP-INTO.
Look at the sequences section of the CLHS to find out more.
There is also a quick reference that is nicely organized.
Well, you are generally correct. Most functions do indeed focus on lists (mapcar, find, count, remove, append etc.) For a few of these there are equivalent functions for sequences (concatenate, some and every come to mind), and some, where the list-equivalent is outdated (eg. nth for lists only vs. elt for all sequences). Some functions simply work on sequences (length, for example).
CL is a bit of a mess. It's a big language, as in huge. Over 700 functions, AFAIK. And it's old. Some of these functions are deprecated by convention, and others are rarely, if ever, used.
Yes, it would be more useful to have mapping functions be methods, that applied as intended on all sequences. CL was simply not built that way. If it were to be built again today, I'm sure this would be considered, and it would look very different.
That said, you are not left completely in the cold. The loop macro works on sequences, as does iterate (a separate looping macro, which i happen to like more). This will get you far. For most practical purposes you will be using lists, and this won't be more than a pragmatic problem. If you do happen to lack a mapping function for vectors (or sequences in general), who's to stop you from writing it?

Resources