Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
If one is building a substantial, organization-wide code base in R, is it acceptable practice to rely on the sqldf package as the default approach for data munging tasks? Or is best practice to rely on operations with R specific syntax where possible? By relying on sqldf, one is introducing a substantial amount of a different syntax, SQL, into their R code base.
I'm asking this question with specific regard to maintainability and style. I've searched existing R style guides and did not find anything on this subject.
EDIT: To clarify the workflow I'm concerned with, consider a data munging script making ample use of sqldf as follows:
library(sqldf)
gclust_group<-sqldf("SELECT clust,SUM(trips) AS trips2
FROM gclust
GROUP BY clust")
gclust_group2<-sqldf("SELECT g.*, h.Longitude,h.Latitude,h.withinss, s.trips2
FROM highestd g
LEFT JOIN centers h
ON g.clust=h.clust
LEFT JOIN gclust_group s
ON g.clust=s.clust")
And such a script could continue for many lines. (For those familiar with Hadoop and PIG, the style is actually similar to a PIG script). Most of the work is done using SQL syntax, albeit with the benefit of avoiding complex subqueries.
Write functions. Functions with clear names that describe their purpose. Document them. Write tests.
Whether the functions contain sqldf parts, or use dplyr, or use bare R code, or call Rcpp is at that level irrelevant.
But if you want to try changing something from sqldf to dplyr the important thing is that you have a stable platform on which to experiment, which means well-defined functions and a good set of tests. Maybe there's a bottleneck in one function that might run 100x faster if you do it with dplyr? Great, you can profile and test the code with both.
You can even branch your code and have a sqldf branch and a dplyr branch in your revision control system (you are using an RCS, right?) and work in parallel until you get a winner.
It honestly doesn't matter if you are introducing other bits of syntax into your R code from a maintainability perspective if your codebase is well-documented and tested.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I am trying to learn how to use R/R studio for a project. Some of the initial tasks I will be using R for are described below, and I would be very grateful for a resource that teaches me how I could perform the tasks below.
I have a column of unique identifiers in one excel document (document A), ie a, b, and c. I have another excel document for each of these identifiers, with the same name as these unique identifiers. So for each unique ID, I want to look-up the spreadsheet with a matching name, and from that spreadsheet, I want to retrieve the first and final value in a certain column, as well as the mean and maximum values in that column.
I am interested in finding a resource that will teach me to do all this and more, and don't mind investing time to learn ie I am not in a rush to do this.
After this step, I have something more complicated I want to do.
I have another document (document B) where I have a column of identifiers, but the identifiers are repeated multiple times. So again, using the first document with the list of identifiers, I want to search through document B and retrieve values from the rows where the identifier is mentioned for the first and last time in the column.
If you have a resource I can study to learn to do all this and more I would be very grateful. Thank you.
R offers multiple ways to do what you want and after you understood the basics you will find it probably easy to implement a solution for the tasks you described
Besides learning the R basics I'd also suggest looking at the tidyverse collection of packages. Its package dplyr offers a easy to write and read way of structuring code and together with tidyr almost all the functions you'll ever need for your day to day data wrangling needs (including the tasks mentioned in your question).
An Introduction to R - CRAN An official intro to the basics of R. While you would probably use alternative solutions to many of the examples here, I think it's very useful to at least once having read the basics
tidyverse Here you will find links (by clicking on the icons) to the tidyverse packages. Notably ggplot2 probably the plotting package in R and the aforementioned dplyr and tidyr as well as readxl, a package to read data from excel files.
Just to give you a glimpse into the future: The workflow to solve the tasks from the question could look something like the following:
Read data from the excel file with the unique identifiers using readxl::read_excel
Loop through the identifiers and load the corresponding files
Use dplyr::mutate to find the mean, max, dplyr::first, and dplyr::last
Proceed similarly for document B, maybe using dplyr::group_by and dplyr::first, and dplyr::last
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
The community reviewed whether to reopen this question 1 year ago and left it closed:
Original close reason(s) were not resolved
Improve this question
After reading the convincing book R for Data Science I was excited about all the tidyverse functions, especially the transformation and data wrangling components dplyr and tidyr. It seemed that coding with those saves a lot of time and results in better readability compared to base R. But the more I use dplyr, the more I encounter situations where the opposite seems to be the case. In one of my last questions I asked how to replace rows with NAs if one of the variable exceeds some threshold. In base I would simply do
df[df$age > 90, ] <- NA
The two answers suggested using
df %>% select(x, y, age) %>% mutate_all(~replace(.x, age> 90, NA))
# or
df %>% mutate_all(function(i) replace(i, .$age> 90, NA))
Both answers are great and I am thankful to get them. Still, the code in base R seems so much simpler to me. Now I am facing another situation where my code with dplyr is much more complicated, too. I am aware that it is a subjective impression whether some code is complicated, but putting it in a more objective way I would say that nchar(dplyr_code) > nchar(base_code) in many situations.
Further, I noticed that I seem to encounter this more often if the code I need to write is about operations on rows rather than on columns. It can be argued that one can use tidyr from tidyverse to transpose the data in order to change rows to columns. But even doing this seems also much more complicated in the tidyverse frame than in base R (see here).
My question is whether I am facing this problem because I am quite new to tidyverse or whether it is the case that coding with base is more efficient in some situations. If latter is the case: Are there resources that summarize on a abstract level when it is more efficient to code with base versus tidyverse or can you state some situations? I am asking because sometimes I spend quite some time to figure out how to solve something with tidyverse and in the end I notice that base is a much more convenient coding in this situation. Knowing when to use tidyverse or base for data wrangling and transformation would save me much time.
If this question is too broad, please let me know and I will try to rephrase or delete the question.
If you have a clean, readable and functioning solution in base R that seems more appropriate, why would you go for an additional layer? Perhaps to keep the same interface (pipes) within a script, to advance readability? But as you argue, this is not always guaranteed with tidyverse compared to base R.
A main difference is:
Base R is highly focused on stability, whereas the tidyverse will not guarantee this. From their own documentation: "tidyverse will make breaking changes in the search for better interfaces" (https://tidyverse.tidyverse.org/articles/paper.html).
This makes base R in some cases a better partner for production environments, as you may find tidyverse functions deprecating and changing over time. I myself prefer as less dependencies as possible in packages.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
This is my first question, so apologies if this isn't perfectly appropriate. A colleague of mine recently sent me a script where she loaded all of her packages using lapply, instead of repeated calls to library, e.g.
packages_list <- c("dplyr", "ggplot2")
lapply(packages_list, library, character.only = TRUE)
Instead, this is how I normally do it:
library(dplyr)
library(ggplot2)
Is using lapply computationally faster somehow? I can see how that for a script requiring large numbers of packages, using lapply might save me from repeatedly typing out library, but with auto-complete this doesn't seem like a huge benefit.
Most of the people in my faculty are beginners with R (we are biologists), and many of my students don't even know how apply works...I think using lapply at the beginning of the script is possibly confusing for complete novices. I'm just curious if you expert coders out there could provide more illuminating input. Thank you!
There's no speed advantage in using:
packages_list <- c("dplyr", "ggplot2")
lapply(packages_list, library, character.only = TRUE)
... over:
library( dplyr)
library( ggplot2 )
It's also less compact until the number of packages exceeds about 8.
You should give weight to #BenBolker whose is a highly valued contributor, not just to SO but also the R-help and R-SIG-ME Mailing Lists. (More opinion follows) The preferred approach would be to write the script in a text editor or an IDE that supports copying library( ) if a large number of packages are needed.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Have Revolution Enterprise. Want to run 2 simple but computationally intensive operations on each of 121k files in a directory, outputting to new files. Was hoping to use some Revoscaler function that chunked/parallel processed the data similarly to lapply. So I'd have lapply(list of files, function), but using a faster Rxdf (revoscaler) function that might actually finish, since I suspect basic lapply would never complete.
So is there a Revoscaler version of lapply? Will running it from Revolution Enterprise automatically chunk things?
I see parlapply, mclapply (http://www.inside-r.org/r-doc/parallel/clusterApply)...can I run these using cores on the same desktop? Aws servers? Do I get anything out of running these packages in Revoscaler if its not a native Rxdf function? I guess then that this is a question more on what I can use as a "cluster" in this situation.
There is rxExec, which behaves like lapply in the single-core scenario, and like parLapply in the multi-core/multi-process scenario. You would use it like this:
# vector of file names to operate on
files <- list.files()
rxSetComputeContext("localpar")
rxExec(function(fname) {
...
}, fname=rxElemArg(files))
Here, func is the function that carries out the operations you want on the files; and you pass it to rxExec much like you would to lapply. The rxElemArg function tells rxExec to execute func on each of the different values of files. Setting the compute context to "localpar" starts up a local cluster of slave processes, so the operations will run in parallel. By default, the number of slaves is 4, but you can change this with rxOptions(numCoresToUse).
How much speedup can you expect to get? That depends on your data. If your files are small and most of the time is taken up by computations, then doing things in parallel can get you a big speedup. However, if your files are large, then you may run into I/O bottlenecks especially if all the files are on the same hard disk.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I am exploring parallel programming in R and I have a good understanding of how the foreach function works, but I don't understand the differences between parallel,doparallel,doMC,doSNOW,SNOW,multicore, etc.
After doing a bunch of reading it seems that these packages work differently depending on the operating system, and I see some packages use the word multicore, and others use cluster (I am not sure if those are different), but beyond that it isn't clear what advantages or disadvantages each have.
I am working Windows, and I want to calculate standard errors using replicate weights in parallel so I don't have to calculate each replicate one at a time (if I have n cores I should be able to do n replicates at once). I was able to implement it using doSNOW, but it looks like plyr and the R community in general uses doMC so I am wondering if using doSNOW is a mistake.
Regards,
Carl
My understanding is that parallel is a conglomeration of snow and multicore, and is meant to incorporate the best parts of both.
For parallel computing on a single machine, I find parallel to have been very effective.
For parallel computing using a cluster of multiple machines, I've never succeeded in completing the cluster set up using parallel, but have succeeded using snow.
I've never used any of the do* packages, so I'm afraid I'm unable to comment.