Related
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Background (sorry it's so long):
I've been tasked with maintaining an ETL that collects a variety of online advertising data, around 20-30 MBs a day, and appends it to tables in MySQL. Outside contractors built the ETL with Pentaho Spoon (kitchen, kettle?). The ETL consists of about 250 jobs and transformations (.ktr,.kjb), each with about 5 to 25 steps. It is very common that something is going wrong in this large process. I've found that writing R scripts to do the transform and load is much more efficient. In fact, I think the ETL could be reduced to well under 1000 lines of code besides calls with RMySQL (i.e. plyr!). Perhaps Python would be used to extract the data from the web.
My use of R has led to some resistance. The computer programmers that designed the ETL don't know R so couldn't be called if I leave, and moreover a lot of time was invested in the Spoon ETL. Also, a layman can more easily follow the steps visually in Spoon, than in the R scripts. For my part, I think we are getting bogged down by the ETL. However, I don't have a large say in the matter as I don't have a background in computer science.
Please comment if you have any insights on the following. Please know I have been researching this for months and have read many opinions, but nothing as concise or reliable as SO usually provides:
R has been called not as scalable by some at the company. I think the opposite mostly because of the logging capabilities. Spoon has limited pure logging output, whereas all R scripts can be sinked into a daily log. Fixing and avoiding mistakes in the .ktrs is very tedious, but easy with setting flags and/or searching through the R log. Any thoughts on this?
This leads to a big picture question. What is the point of ETLs like Pentaho? This post Do I need a ETL?, leads me to believe that if you use R or other so-called OOL, there is no reason to have a tool like Pentaho. Can someone please confirm this if so? I really need a second opinion here. If this is so who uses tools like Pentaho? Is it simply people without the programming background, or someone else? I do see a fair amount of Pentaho questions on SO.
It is true that a lot more people use R and than Pentaho, right? This http://www.kdnuggets.com/2012/05/top-analytics-data-mining-big-data-software.html makes it look so. To be honest I was surprised that Pentaho was 5th, which makes me doubly wonder who uses Pentaho and if my doubts about it's use in my work setting are misplaced.
Thanks for any responses. I don't mean any condescension towards Spoon or Spoon users; I am just really confused and in need of outside opinions.
R as an ETL tool? Thats a new one, but whatever floats your boat.
I would say this though, if you can get 250 jobs and transformations down to under 1000 lines of R I would say your ETL is poorly written.
Along with this you have to think about supportability and scalability. Both of which I would imagine would be far easier with a graphical tool like Spoon rather than R code.
Personally I think you are misguided and the question you ask is poorly written but thats a different argument.
Regarding your points, PDI's logging is very good and you can log pretty much however you like, all into one large database table if you like a consolidated log.
ETL's wont be going away, even with the advent of the love of unstructured data storage pools like HDFS, also think about data analysis done outside R, if you want reporting or OLAP over the top of your data, it will still need transforming regardless.
Is it true, more people use R vs Pentaho? What sort of question is that? By Pentaho I assume you mean PDI? How can that ever be compared? A data analysis tool vs ETL tool and you want to count users? eh? If on the other hand you mean R vs Pentaho as a whole, then I would guess no.You are looking at a report on R vs Weka and making it fit your ETL argument. That doesn't wash in a month of sundays.
==EDIT==
Okay so you have around 1000 lines of R & Python code currently. As your bosses requirements expand this slowly grows over time, and because you are trying to hit deadlines the new code is written as cleanly or as well documented as the code you currently have in place. So over time this grows to 5000 lines say plus a few python scripts. Then one day you get hit by a bus, and some new person has to come in and manage your code... where do they start, how to they make changes?
Virtually anyone with a modicum of data experience could make a change to a PDI ETL should they be required to. Where as it would take some with enough in depth R knowledge to make changes to what you have done.
ETL tools are designed to be quick and easy to use, they also offer far more than R can provide in terms of data connectivity to different systems (non db or file based, for example), although I guess this is why people resort to python etc.
That said there is room for both, there is an R plugin for PDI kicking around in the community I've seen demonstrated.
On top of that I've seen enough TSQL to ETL migrations over the years to know from experience, that even though maintaining your ETL in code may seem practical in the short term, in the long term it just brings more pain.
On the other hand if you can code 250 PDI transformations down to 1000 lines of R, your ETL is likely bloated through bad design by your predecessor.
If you'd like me to give an opinion on your existing PDI ETL structure, that can also be arranged.
Tom
I'm writing to ask for some guidance on choosing a language and course of action in learning programming. I apologize if this type of question is inappropriate for Cross Validated, please advise me to another forum if that is the case.
I've seen thread after thread with questions from newbies, asking, "What is the best language to start with?" and then it always starts a flame war or someone just answers, "There's no best language, it's best to pick one and start learning it." My question is a little bit more focused than that.
First off, I've been programming my whole life, in very limited capacities. My deepest training was in C++. Whilst in my EECS degree program, I resolved to never be a software developer because I couldn't stand not interacting with people for such long periods of time. Instead I realized I wanted to be a math teacher, and so that is the path I have taken.
But now that I'm well down that path, I've started to realize that perhaps I could develop my own software to help me in the classroom. If I want to demonstrate the Euclidean algorithm, what better way than to have a piece of software that breaks down the process? Students could run that software as part of their studies, and the advanced students might even develop programs for themselves. Or, with an Ipad in hand, why not have an app that lets students take their own attendance? It would certainly streamline some of the needs of classroom management.
There's obviously a lot of great stuff already out there for math, and for education, but I want a way to more directly create things specific to my lectures. If I'm teaching a specific way of calculating a percent, I want to create an app that aligns with my teaching style, not just another calculator app that requires the student to learn twice.
The most I use in class right now is iWork Numbers/Microsoft Excel for my stats class. Students can learn the basic statistical functions, and turn some of their data into graphs.
I have dabbled a bit with R, and used Maple in college. I've started the basic tutorials for OS X/iOS development and have actually made good progress making an OS X app that takes a text string, converts it to numbers, and performs encryption using modular addition and multiplication. I sometimes use Wolfram|Alpha to save myself some time in getting quick solutions to equations or base conversions. I know of MatLab, Mathematica, and recently people have been telling me to check into Python or Ruby. I also know basic HTML, and while it's forgotten now, learned Javascript and PERL in college.
If I keep on the path of Obj-C/Cocoa, I think it will have great benefits. Unfortunately, anything I produced for Mac would only be usable on a Mac, so it wouldn't be universal for all of my students. Perhaps then learning a web language would be better. Second, I'm wondering if the primary use is mathematical, then perhaps my time would be better spent learning Mathematica Programming Language, or R, or something based less on GUI and more on simple coding of algorithms, maybe Python or Ruby?
It seems that Mathematica already has a lot of demos for different math concepts, so why reinvent the wheel is also a question I have. I think overall, it would be good to have more control and design things the way I need. And then, if I do want to make an "Attendance" app or something else, I would already have the programming experience to more easily design something for my iPad or MacBook.
The related question to this is what is a good language to teach to my students? In his TED talk, Conrad Wolfram says one of the best ways to check the understanding of a student is have them write a program. But if Mathematica does the math virtually automatically for them, then I'm not sure that will get the deeper experience of working out logic for themselves, like you do when you're writing C, or a traditional procedural language.
I know that programming takes time to learn, but I also know that at this point, my goal is not to be able to make an app like "Tiny Wings." With the app store ease, some of my work may be an extra revenue stream, but I see myself as more of a hobbyist, and now teacher looking to software development specifically for its ability to help me demonstrate mathematical concepts.
I think I will push ahead with Obj-C/Cocoa for OSX/iOS, but if anyone has some better guidance regarding all of the other available stuff, it would be much appreciated. I don't think I would want to go fully to the web (I like apps), but perhaps someone could suggest a nice way of bridging what I produce in XCode to a universal web version. For example, if you come up with an algorithm in obj-c is it easiest to transition that to ruby and run it online, or is there another approach that works better?
Mathematica is pretty awesome for the first part of your question. I've used the interactive mode (Manipulate[]) for explaining things to my colleges (and myself). It makes really nice dynamic figures and is fairly expressive (although your code can end up looking like line noise). It is very powerful, but it does far less for you than you might think. It's pretty intuitive, which is a good thing for teaching.
You could use Scala if you want an "easy" way to make a domain specific language for teaching. Python seems to confuse people as a first programming language. Objective C seems like a completely random choice to me.
Mathematica then. It's worth the price. But anything that is interpreted and has an interactive shell is probably better than a compiled language. BBC BASIC?
Nothing beats Haskell for general-purpose mathematical programming. The wiki's quite extensive and the IRC channel (#haskell on Freenode) is great for asking questions. If you statically link your binaries on compilation, you should be able to run your programs on just about any system (with a few exceptions, e.g., libgmp).
Haskell code reads (roughly) like mathematical notation once you get the hang of it, so it can really help to tie things together for your students who are motivated to write their own programs. The purely functional style can be beneficial, as well, since it focuses less on I/O and the marshalling of data (perfectly useful in applications, perhaps less so in pure math), and more on the actual creation and refinement of functions and algorithms. You can even compose functions just as you would on paper.
If you want to get really serious, you could also look into Coq or Agda, but those might be a bit much for most classes.
For a Haskell program idea for an educator, check out this link.
A nice list of arguments can also be found at:
Eleven Reasons to use Haskell as a Mathematician and the book The Haskell Road to Logic, Maths and Programming
This question is inspired by the remark of Duncan Murdoch on the r-devel mailing list in response to a bug report about Sweave :
This is fixed in R-patched. (It would
have been fixed in 2.12.0 if more
people tested the betas...).
Honestly, I've stayed away from beta -aka development- versions for a number of reasons, and these are reasons I hear from more people :
I am a bit horrified it would
somehow cause conflicts with my
current R distribution. As I need it
for work, having to repair it regularly would be a loss of
time I can't explain to my boss
I wouldn't have a clue how to test
efficiently. I reckon every test I
could come up with has already been
run by the development team.
I still find it difficult to figure
out when something is a bug, and
when (most often) it is my own
stupidity kicking in.
But as I understood, it would be a valuable contribution to the R community, and I'm willing to do my bit of the testing as well if I can fit it somehow into my own work. I was thinking of keeping the beta on the side and running my scripts through it as well as a checkup. Saving the constructed objects allows a quick and easy all.equal() to see if something is wrong.
Anybody some more/better ideas on how I could help testing with a minimum amount of effort and a maximum amount of efficiency?
I'd also like to promote this a bit more on our department as well. Apart from the "It's time to give back to the community", any other good reasons why testing betas is worth the effort? How can I counter the arguments given above?
Edit:
As Dirk Eddelbuettel pointed out in the comments, part of the deal is preventing the path variables in Windows. I have some ideas on that, but pointers on how to practically organize your computer for testing R-devel versions are greatly appreciated as well.
I fear you misunderstand. This may not be straightforward or obvious at first so maybe this helps:
"patched" is not "beta". Patched is what R 2.12.1 will be.
There is no conflict. It drops in for 2.12.0.
It is a separate download, and a nightly build available from here.
This is not r-devel but r-patched.
It is our duty as users to test pre-releases as well. So if anything, in an ideal word you would have R-patched installed --- as well as R-devel!
Testing can be as easy as installing another version, keeping it outside your path and then adjusting PATH and R_HOME dynamicaly from a script. Testing means running it on your code and data to prevent you from getting bitten by bugs once the new code is released.
I wouldn't have a clue how to test efficiently. I reckon every test I could come up with has already been run by the development team.
I still find it difficult to figure out when something is a bug, and when (most often) it is my own stupidity kicking in.
The problem is, software is not (or not only) going to be used by developers. It is going to be used by people that may not have programming knowledge at all (I'm speaking generally, this is valid for R as well as for any other software).
If the help or the interface or the general way the software is built do not give you enough informations on how to do something, well, that is maybe not a bug, but it is something that can be improved (and pointed out to the devs).
Also, remember that the developers wrote the software. They know how to use it and often they will be biased in testing it mainly by using it correctly and see if it gives the good result rather than by "trying to break it".
By using it in YOUR way (which may possibly be "uncorrect"), you are effectively running tests that maybe escaped the developers, just because they were not thinking of using it like you did.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I've been using R for a little over a year now and it's been a successful venture. But all too often, I find that there is something that I can't figure out for lack of knowing how to find it or an example of it.
Stackoverflow,
Could you recommend a pathway for learning R in a manner that provides one with a toolset at their disposal to solve problems of a statistical nature?
There's a wealth of knowledge on the internet, between the r-project website and the mailings lists but it seems to be "everywhere" and nowhere when you're actually looking for it.
For example, when I first started using R, I went through "Intro to R". Then I read the language definition (which obviously hasn't sunk in). But every time I ask a question on Stackoverflow I'm presented with some new badass function that is the solution to all my problems in the short term. My question is, how did you know these functions existed in the first place? And how does one go about finding them? Presumably, you read something or found some resources that detoured your learning to the exponential part of the curve. What was it?
Obviously, R's functionality as a statistical tool is broad. For my own purposes I work mostly with economic or financial data. Hence, answers with this in mind would be most helpful.
Completely biased response: learn plyr, reshape2 and ggplot2. They will cover 90% of your data manipulation and visualisation needs. All three packages have a consistent philosophy of data (which the ggplot2 book touches upon), and are designed to be consistent and easier to
learn.
Rather than learning many specialised functions, I really encourage you to learn about simple functions that can be flexibly composed to solve a wide range of problems. This is what plyr strives to do for data manipulation, and what ggplot2 strives to do for visualisation. It does mean you need to invest more time up front to learn a little about the underlying theory, but it's my belief that it will pay off handsomely in the long run.
My way how I learned R.
R resources:
To learn R, the most important resource is google. search for: “TOPIC r-project”, “TOPIC filetype:r”, or “TOPIC site:nabble.com”.
Second, look at the example code provided with most packages. go to “http://bm2.genes.nig.ac.jp/”, search for a topic and look at the example code. run it and adapt it, this way you can often solve part of your problem.
Third: the r-help mailing list. Read the posts, the basic questions get asked over and over again. If you have a problem and you are completely stuck, ask a question on the mailing list.
Finally, look at the source code of the R-packages. that’s the hardest part. if you can alter the code to your needs, you have mastered R ;-)
Some Tips:
R has a steep learing curve. that’s a feature ;-) , it is designed to solve advanced problems and in the end you are fast than when using an alternative to R.
Know every single R package and function that is relevant to your problem. the strength of R is that there are so many packages availiable (around 2000, I think). Usually there is always a package that’s more suited or that already solves your problem. (some help pages are badly written and hard to understand - I got used to it)
R books are not helpful in learning R. yes, that’s true. If you are an expert programmer and expert statistician, you don’t need any book on R. (only exception is Hadley Wickham’s ggplot2 book). If your are not, learn programming in general and/or advanced statistics.
Some R package have known bugs, which nobody will fix (package owner left university, etc.). just a warning, this can be tricky if you are looking for a bug in your code and the bug is in a R package.
I'll start with this:
My question is, how did you know these functions existed in the first place?
Simple - we tried to solve a similar problem and came across that function. It either suited or didn't suit our needs but we now know it's there. I haven't used R much personally but what you're describing is the learning curve for every programming language ever. Firstly, you learn the "grammar" i.e. what you can do. Then you try to do something. You find you can't.
At that stage a programmer has a number of options. What do I do personally? Depends. I'll try and look up that package/header/library/whatever's member functions to see if something suits my needs. I might Google it, because unless you're really pushing the boundaries someone somewhere has probably tried and failed to do it before and had their question answered. If you are pushing the boundaries, someone somewhere has probably tried and failed before, but got no answer. I might try a forum or two to see what happens. I personally don't use IRC much, but that's another option, as are mailing lists depending on how specialised the problem is.
I also have a folder on my computer full of books which I search through depending on the problem and a small library of books I look through/learnt from, which often contain practical, not-quite-there-but-adaptable examples.
My only comment would be attempting to read the language specification is unlikely to be massively useful to you as a beginner. You won't fully understand what it means because you haven't pushed the bounds and tried things yet. For example, a novice in C might try this:
char c = '7';
int x = (int) c;
to convert the character '7' into an integer form. It's not a bad thought process until you understand how characters and ASCII work, then you see why the above doesn't give you what you want.
In short, I think this is going to be part of the learning process and I don't think you can cut it any shorter. The consolation is like any research, the more you do it the more you'll know where to look and what questions to ask on various communities.
One of the things I do is follow the RSS feed of R questions on SO (https://stackoverflow.com/feeds/tag/r). Then I can browse what other people have asked/answered.
Often I will favourite a particular question/answer if I think I'll use it, or jot down the salient points into my notebook software (OneNote), occaisonaly I'll even try the question/answer out myself.
EDIT:
I'd also recomend Patrick Burn's book R-Inferno. It's not so much of a training book as a description of all the gotchas and oooh moments Patrick has found (so far).
There's a free book you might be interested in: Introduction to Probability and Statistics Using R
Here is a good list of resources for learning R:
https://stats.stackexchange.com/questions/138/resources-for-learning-r
Also, that website in general is a good resource.
In general I would say that following a mailing list, or a help list is the best way I have found for learning new things. (That and the "R magazine": http://www.r-bloggers.com )
Learning the RODBC package to interact directly with Oracle data made a big impact at my job. My boss was amazed when I pulled Oracle data directly into R and cranking out a plot in only a few lines of code. Try doing that in Excel!
Moral of the story, learn how to pull in data and manipulate it within R. Then move to some of the cooler stuff like ggplot.
I can recommend Penn University's Introductory Course on R.
The ggplot chapter alone is worth reading - I found ggplot very confusing but this is a great explanation.
The book that helped my learning the most was The Art of R Programming. A lot of programming books can be dry. Since R is commonly an entry point to programming it's important for the voice of materials to resonante with the student. That book did just that with me. The voice felt very casual and I liked that.
Some interesting links:
Intro, links and examples: http://manuals.bioinformatics.ucr.edu/home/programming-in-r
A lot of documentation: https://en.wikibooks.org/wiki/R_Programming
R forum: http://r.789695.n4.nabble.com/
The [R] tag FAQ, right here on Stackoverflow, https://stackoverflow.com/questions/tagged/r?sort=frequent provides numerous reproducible examples that one can use to "learn by doing".
Most of the problems are very common and will eventually be something that you will have to look up as a beginner. The FAQ also provides highly literate (and experienced) examples of usage for a diverse range of functions and useful packages.
If you're new to R, and you prefer a more hands on approach to learning, the FAQ should not be overlooked as a potential resource for learning. Many of the questions also provide useful discussion surrounding paradigms of the language itself (vectorization, workflow, debugging are just a few examples).
Nearly every question in the FAQ is worth studying as a new user as it touches on elements that, speaking for myself, I wish I had been pointed to when I asked this question originally.
Just a few examples:
How to make a great R reproducible example
Grouping functions (tapply, by, aggregate) and the *apply family
Workflow for statistical analysis and report writing
How to sort a dataframe by multiple column(s)?
What is your favorite R debugging trick?
This question is not coming from a programmer. (obviously) I currently have a programmer making a website for me and I am realizing that he isn't going to completely work out.
He has already done quite a bit of work and the site is almost there but I need someone who is better to take it the rest of way. The site has been done in asp.net and I am wondering how hard it would be for a more experienced programmer to take over and finish the work he has already done?
In general, is it hard for an asp.net programmer to come in towards the end of a project and fix what needs to be fixed?
There is five different pages on the site with two overlays for a signup and sign in. (Five pages with many different versions) There is a database and client-side scripting. AJAX was also used. It's a site somewhat similar to SO only not quite as complex and about something completly different. I would say think of something that falls somewhere between Stackoverflow and Craig's List. Thats all I can say now as I don't know the technical words.
You'll probably find that the new programmer will want to rewrite most of the code from scratch. If you are on a tight deadline or tight budget and can't accept a complete rewrite then you will need to hire someone that is not just good at writing good code, but good at reading, refactoring and improving bad code. It is two completely different skillsets and the second is much rarer. Depending on the quality of the existing code (and I'm assuming here that it is not good), your new programmer may end up rewriting much of the existing codebase just to understand what is going on.
Depends on how good the previous programmer was and on the complexity of the project. It might be anything between trivial (well commented source, some high-level docs, unit tests, modular or simple project), to "this crap needs a complete rewrite" (no docs, custom "let's try this" solutions, etc.). If you're not a developer it might be really hard to tell. And other people won't be able to answer without more details.
I'm no asp.net expert, but I suspect the ease with which the replacement will be able to finish the project will depend mostly on just how bad a job the first programmer actaully did. Bad code is painful to fix in any language. :)
A good idea will be to have them work together,for say, a week or two. This will help the new programmer get some much needed training about your current system.
You may find that although the site is almost complete, the successor will have to spend more time than anticipated when performing alterations, as this person will have the mental model of the software that the current developer has. Hence the need to next developer to "re-write" the code base.
If you can, you'll want to ensure that the code base that you have built is maintainable. That is, the solution is built in such a way that it can support alterations easily. As Mark Byers suggested, you'll want to get someone who can not only program but can also re-work your existing code with the goal being that someone else will inevitably implement future changes. If the software is something that you need to keep working for an extended period you'll want to make the investment in making sure that it new functionality can be added easily.
Remember this experience described at The Daily WTF. Take appropriate precautions.
Generally if the site is set up in some sort of standard fashion then another programmer should be able to pick it up easily. if the existing programmer did things to obscure the code then it will be hard for another programmer to pick it up. Basically the question is how readable is the code?
If the current programmer is unwilling to communicate the true status of the project in a professional, non-technical manner, then give him an ultimatum - your way or the highway. Odds are he will be more forthcoming if he knows you mean business. Make sure you have a copy of the latest code before broaching the subject.
It sounds like you are going to end up hiring someone else anyway, especially if you're asking these kinds of questions at this stage, so you might as well go for broke.
As Mark Byers said, it takes a seasoned developer to take someone else's code and resist the urge to "pretty it up" in order to bring the project to a working conclusion!