As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Background (sorry it's so long):
I've been tasked with maintaining an ETL that collects a variety of online advertising data, around 20-30 MBs a day, and appends it to tables in MySQL. Outside contractors built the ETL with Pentaho Spoon (kitchen, kettle?). The ETL consists of about 250 jobs and transformations (.ktr,.kjb), each with about 5 to 25 steps. It is very common that something is going wrong in this large process. I've found that writing R scripts to do the transform and load is much more efficient. In fact, I think the ETL could be reduced to well under 1000 lines of code besides calls with RMySQL (i.e. plyr!). Perhaps Python would be used to extract the data from the web.
My use of R has led to some resistance. The computer programmers that designed the ETL don't know R so couldn't be called if I leave, and moreover a lot of time was invested in the Spoon ETL. Also, a layman can more easily follow the steps visually in Spoon, than in the R scripts. For my part, I think we are getting bogged down by the ETL. However, I don't have a large say in the matter as I don't have a background in computer science.
Please comment if you have any insights on the following. Please know I have been researching this for months and have read many opinions, but nothing as concise or reliable as SO usually provides:
R has been called not as scalable by some at the company. I think the opposite mostly because of the logging capabilities. Spoon has limited pure logging output, whereas all R scripts can be sinked into a daily log. Fixing and avoiding mistakes in the .ktrs is very tedious, but easy with setting flags and/or searching through the R log. Any thoughts on this?
This leads to a big picture question. What is the point of ETLs like Pentaho? This post Do I need a ETL?, leads me to believe that if you use R or other so-called OOL, there is no reason to have a tool like Pentaho. Can someone please confirm this if so? I really need a second opinion here. If this is so who uses tools like Pentaho? Is it simply people without the programming background, or someone else? I do see a fair amount of Pentaho questions on SO.
It is true that a lot more people use R and than Pentaho, right? This http://www.kdnuggets.com/2012/05/top-analytics-data-mining-big-data-software.html makes it look so. To be honest I was surprised that Pentaho was 5th, which makes me doubly wonder who uses Pentaho and if my doubts about it's use in my work setting are misplaced.
Thanks for any responses. I don't mean any condescension towards Spoon or Spoon users; I am just really confused and in need of outside opinions.
R as an ETL tool? Thats a new one, but whatever floats your boat.
I would say this though, if you can get 250 jobs and transformations down to under 1000 lines of R I would say your ETL is poorly written.
Along with this you have to think about supportability and scalability. Both of which I would imagine would be far easier with a graphical tool like Spoon rather than R code.
Personally I think you are misguided and the question you ask is poorly written but thats a different argument.
Regarding your points, PDI's logging is very good and you can log pretty much however you like, all into one large database table if you like a consolidated log.
ETL's wont be going away, even with the advent of the love of unstructured data storage pools like HDFS, also think about data analysis done outside R, if you want reporting or OLAP over the top of your data, it will still need transforming regardless.
Is it true, more people use R vs Pentaho? What sort of question is that? By Pentaho I assume you mean PDI? How can that ever be compared? A data analysis tool vs ETL tool and you want to count users? eh? If on the other hand you mean R vs Pentaho as a whole, then I would guess no.You are looking at a report on R vs Weka and making it fit your ETL argument. That doesn't wash in a month of sundays.
==EDIT==
Okay so you have around 1000 lines of R & Python code currently. As your bosses requirements expand this slowly grows over time, and because you are trying to hit deadlines the new code is written as cleanly or as well documented as the code you currently have in place. So over time this grows to 5000 lines say plus a few python scripts. Then one day you get hit by a bus, and some new person has to come in and manage your code... where do they start, how to they make changes?
Virtually anyone with a modicum of data experience could make a change to a PDI ETL should they be required to. Where as it would take some with enough in depth R knowledge to make changes to what you have done.
ETL tools are designed to be quick and easy to use, they also offer far more than R can provide in terms of data connectivity to different systems (non db or file based, for example), although I guess this is why people resort to python etc.
That said there is room for both, there is an R plugin for PDI kicking around in the community I've seen demonstrated.
On top of that I've seen enough TSQL to ETL migrations over the years to know from experience, that even though maintaining your ETL in code may seem practical in the short term, in the long term it just brings more pain.
On the other hand if you can code 250 PDI transformations down to 1000 lines of R, your ETL is likely bloated through bad design by your predecessor.
If you'd like me to give an opinion on your existing PDI ETL structure, that can also be arranged.
Tom
Related
I am a software engineer working in Medical Imaging.I have just started using the language IDL and i feel very comfortable with it.As a new member in this field with a language like IDL, i would like to know the chances of IDL in this field.Can any one help me?
Well, so here is my biased opinion -> I'm heading the opposite way to you. I have used IDL (and before PV-Wave) on and off for ca 10 years (mostly MRI) and I'm now trying to part from it. Here is why. If you are proficient you can very quickly test something in an interactive / lightly scripted fashion. This is the typical use case of scientists; most have little CS education and are happy to grab any tool that seems to helpful. In fact, IDL is fairly good at dealing with largish arrays/images etc as you are likely to encounter in imaging.
However, it is not very pretty and coding gets increasingly awkward as your project size increases. If you are a software engineer, I suspect you'll hit the limits soon and will be cursing it no end. If you try to develop GUI code for people around you, you might be in for a rough ride. This is one of the main reasons I am moving over to Python + EPD with scipy and the likes. Also, binding to existing sophisticated image processing tools as you might need (registration, segmentation, etc) are not ideal.
A further complaint I have are the ongoing licensing costs. Even in an academic environment they are becoming prohibitive and I'd rather spend it on a Coop-student who could code for me than on ITT. A nice feature though is the ability to compile almost all IDL code into a sav file that others can use with a free IDL virtual machine.
Essentially, what it will come down to is how much your collaborators need you to use IDL. If it's fully your choice, I would look elsewhere. If there is a significant (and decent) code base, I would stay. The medical imaging plus astro community is dependent enough to keep this going for a while. If you do decide to hang on, I can highly recommend Dave Fanning's writings (his web page + his book + the google-group). He is somewhat of an icon in the idl community and certainly taught me things that were very useful. (Check out the mighty histogram function, I'm not kidding!)
Hope this works out for you.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 8 years ago.
Improve this question
What are the relevant skills in the arsenal of a Data Scientist? With new technologies coming in every day, how does one pick and choose the essentials?
A few ideas germane to this discussion:
Knowing SQL and the use of a DB such as MySQL, PostgreSQL was great till the advent of NoSql and non-relational databases. MongoDB, CouchDB etc. are becoming popular to work with web-scale data.
Knowing a stats tool like R is enough for analysis, but to create applications one may need to add Java, Python, and such others to the list.
Data now comes in the form of text, urls, multi-media to name a few, and there are different paradigms associated with their manipulation.
What about cluster computing, parallel computing, the cloud, Amazon EC2, Hadoop ?
OLS Regression now has Artificial Neural Networks, Random Forests and other relatively exotic machine learning/data mining algos. for company
Thoughts?
To quote from the intro to Hadley's phd thesis:
First, you get the data in a form that
you can work with ... Second, you
plot the data to get a feel for what
is going on ... Third, you iterate
between graphics and models to build a
succinct quantitative summary of the
data ... Finally, you look back at
what you have done, and contemplate
what tools you need to do better in
the future
Step 1 almost certainly involves data munging, and may involve database accessing or web scraping. Knowing people who create data is also useful. (I'm filing that under 'networking'.)
Step 2 means visualisation/ plotting skills.
Step 3 means stats or modelling skills. Since that is a stupidly broad category, the ability to delegate to a modeller is also a useful skill.
The final step is mostly about soft skills like introspection and management-type skills.
Software skills were also mentioned in the question, and I agree that they come in very handy. Software Carpentry has a good list of all the basic software skills you should have.
Just to throw in some ideas for others to expound upon:
At some ridiculously high level of abstraction all data work involves the following steps:
Data Collection
Data Storage/Retrieval
Data Manipulation/Synthesis/Modeling
Result Reporting
Story Telling
At a minimum a data scientist should have at least some skills in each of these areas. But depending on specialty one might spend a lot more time in a limited range.
JD's are great, and for a bit more depth on these ideas read Michael Driscoll's excellent post The Three Sexy Skills of Data Geeks:
Skill #1: Statistics (Studying)
Skill #2: Data Munging (Suffering)
Skill #3: Visualization (Story telling)
At dataist the question is addressed in a general way with a nice Venn diagram:
JD hit it on the head: Storytelling. Although he did forget the OTHER important story: the story of why you used <insert fancy technique here>. Being able to answer that question is far and away the most important skill you can develop.
The rest is just hammers. Don't get me wrong, stuff like R is great. R is a whole bag of hammers, but the important bit is knowing how to use your hammers and whatnot to make something useful.
I think it's important to have command of a commerial database or two. In the finance world that I consult in, I often see DB/2 and Oracle on large iron and SQL Server on the distributed servers. This basically means being able to read and write SQL code. You need to be able to get data out of storage and into your analytic tool.
In terms of analytical tools, I believe R is increasingly important. I also think it's very advantageous to know how to use at least one other stat package as well. That could be SAS or SPSS... it really depends on the company or client that you are working for and what they expect.
Finally, you can have an incredible grasp of all these packages and still not be very valuable. It's extremely important to have a fair amount of subject matter expertise in a specific field and be able to communicate to relevant users and managers what the issues are surrounding your analysis as well as your findings.
Matrix algebra is my top pick
The ability to collaborate.
Great science, in almost any discipline, is rarely done by individuals these days.
There are several computer science topics that are useful for data scientists, many of them have been mentioned: distributed computing, operating systems, and databases.
Analysis of algorithms, that is understanding the time and space requirements of a computation, is the single most-important computer science topic for data scientists. It's useful for implementing efficient code, from statistical learning methods to data collection; and determining your computational needs, such as how much RAM or how many Hadoop nodes.
Patience - both for getting results out in a reasonable fashion and then to be able to go back and change it for what was 'actually' required.
Study Linear Algebra on MIT Open course ware 18.06 and substitute your study with the book "Introduction to Linear Algebra". Linear Algebra is one of the essential skill sets in data analytic in addition to skills mentioned above.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
After investigating a little bit scrum and kanban, I finally read this answer and decided to start using kanban, picking something from scrum (note that I'm working mostly by myself, and I do have read this question and its answers).
Now, my question is: which tool would be best to get started?
whiteboard and postit
agilezen.com
JIRA with greenhopper
a spreadsheet (possibly on Google Docs)
brightgreenprojects.com
Agilo
Target Process
something else (please specify)
Notes about each:
I would lean towards the whiteboard, but there are several drawbacks (e.g. cannot make automatic charts, time measurements, metrics, and sometimes I work from home - where I need it most - and it's not convenient to carry :-)
I don't want to remember another username/password (I promised to myself to signup only to OpenID-enabled services)
My employer has JIRA but my group doesn't use it - I might ask for an account (it shouldn't require another password) and maybe later involve the rest of the group. But I don't know if they are using greenhopper and if it's a big deal installing it.
I generally hate spreadsheets
maybe overkill?
I'd be happy to have a localhost instance, but it could be problematic to give access to the whole group (per network/firewalls) - not a deal-breaker but surely a concern
What I'd like to get from this?
being more productive
tracking how much time I spend in any given task, possibly discussing the issue with my supervisor
tracking what "blocks" me most often
immediately see where I am compared to my schedule
manage in a better way my long todo list (e.g. answering faster to the "what I should do next?" question)
Do you have any suggestion?
Note on the scrumish tag: read the Henrik Kniberg's PDF. He first introduced the definition of scrumish on page 9.
If I may, I think that you are on the wrong path. Anything else than 1. or 4. is overkill and pretty much useless for a non distributed team. So for a team of one person...
Seriously, if you can avoid using a web based application, just do it. First, unless you are already mastering Scrum/Kaban, you need to learn the process, not a tool. Don't let a tool dictate the process. Then most web based tools are just too much click intensive, less easy and fast to update, less transparent/visible than a spreadsheet and a physical board. They are really 2nd category options.
So, I'd go for a spreadsheet and a physical board combo. If you need some charts (I'm still wondering what kind of chart/metrics you want to generate and what value they provide), a spreadsheet is the ideal tool (but honestly, you don't need any tool to draw a burndown). If you need to work from home, take the spreadsheet (or use google docs) and post its with you. Let's be objective, the impediments you mentioned are actually not real.
Last thing, if you had chosen the simplest thing that can possibly work, you would already be doing Scrum, Scrumban or whatever. So, instead of looking for a tool, my advice would be to just start doing it.
agilezen.com seems like the ideal solution for you. I have used it in the past solo for myself and it is convenient. I would not let a prejudice against non-OpenID sites get in the way of making a good choice.
pick the tool you already have, and start using it; don't let the absence of the "perfect tool" become an excuse not to start
EDIT: pick the simplest thing that can possibly work. In your case that would be whiteboard and postit notes. These have almost no setup overhead and will provide a constant visual reminder of what you're supposed to be doing.
And I suggest that you get used to making decisions on your own, as you're going to have to be your own Scrum Master ;-)
In the interests of diversity ;-) www.kanbantool.com has just launched too. It's open beta and seems at first glance even more "lightweight" than agilezen.
Target process is good too
We've been using JIRA with Greenhopper for a few months, in an effort to go agile. I use it for both Scrum for development, as well as for my personal kanban. The software is pretty flexible, and I'm going to keep with it. The story/subtask management is really handy, and it's fairly easy to use. One thing I like is that you can add stories/cards quickly, and can customize the data. This allowed me to add definition of done fields, work order numbers, etc.
In short, we're happy with it.
Bright Green have just launched a free version of their tool. It looks good .. the free version is fully functional too: https://signup.brightgreenprojects.com/plan/Free
I've tried out another kanban product for personal use and am absolutely loving this one. Feels lightweight and simple but actually packs in a fair amount of functionality at the same time.
www.kanbanery.com (free for personal use)
A novel tool not mentioned before is getsmartQ (out of beta since Dec 2010)
Key characteristics: very intuitive, supports LWP, customizable forms, and task discussions
Pros
configurable workflow, mark overdue stories, mail notifications (e.g., for upcoming story deadlines), multiple story owners
Stories form completely customizable, per project workflow and story forms, different roles (people only creating stories)
very responsive GUI with partial updates
Apparently good support: I've asked a question and got a good answer within a few hours
Cons
no easy way to declare something as blocked or to distinguish type (feature/bug/..)
no API
no subtasks or story dependencies
In comparison to Agilezen it has a more sophisticated notification system, but apart from that still lacks important features (see cons above).
Note, getsmartQ is under active development and hence missing features mentioned above may have been implemented in the meantime.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I'm trying to evaluate the purchase of a statistical tool. This will be used in part by non-programming users (doing clinical studies) and in part by programmers, so I'm trying to find a good compromise between usability and automation. Of course, cost is an issue, but if I can build a solid case, we could probably buy a commercial package, so we're not totally limited to free options.
So far, our options are:
Statistica (which some non-programmers already know)
Matlab Statistics toolbox (programmers already use matlab)
R language (would need a UI for non-programmers)
Hack something into Excel (not fun, but that's what non-programmers do right now)
?...
What else is out there? What's the industry standard? What kind of distinctive features should I look for? What would you recommend, and why?
Ideally, we'd like a tool that can run both on Linux and Windows machines.
(I work in medical imaging, so we do both biostatistics, and software engineering statistics)
Hands down it's R. R is very programmer friendly. It has functional aspects and it's GNU.
S-PLUS and R are both based off the S language. Both are similar and in most cases you can run as S-PLUS program in R and vice versa.
SAS is another option, although geared more towards BI and enterprise. SAS has a simpler syntax than R and in my opinion is easier to pickup for a non-programmer.
Other options include SPSS, Matlab, and even Excel.
I recommend R, personally. It's used by bioinformaticians and psychologists, I hear. Don't know what your field is though, so maybe it's a lousy choice. It is reasonably easy to use and learn.
Stata and SPSS tend to be the most commonly used packages in clinical studies. Both are pretty easy to pick up and use for non-technically minded folks but are generally flexible enough. I've used Stata more than any of the others and have been pretty happy with its options (supports both menu-based and command line operation, decent enough plugin system to get new user-created modules, good graphing support).
R is a little more daunting for newbie users, though it is popular with the biostatisticians. Since it's free, that's another nice point in its favor.
For a statistical package with a GUI which non-technical users can use, I would recommend that you go with "SAS Enterprise Guide". You will get the common and advanced SAS procedures, an excellent graphics facility and the ability to program for the technical users. I recommend that you start with the "SAS Learning Edition" (http://support.sas.com/learn/le/) which is a fully functional version of Enterprise Guide, but limited to processing 1000 rows at a time only. It is under $500, which makes it a pretty good deal.
I would look at S-Plus.
You get a strong programming environment (S-Plus Workbench, based upon the Eclipse platform), an intuitive GUI for non-programmers, and an extensive user community (including users of R, which was based upon the original S).
Visual Numerics is another option.
It sounds like you're trying to maximize multiple goals. You say "This will be used in part by non-programming users (doing clinical studies) and in part by programmers, so I'm trying to find a good compromise between usability and automation", with an implicit assumption that this will be the same tool in both cases, when that might not be realistic. What's the compromise for Word and LaTeX, for example?
Some different questions about the requirements:
Should it be extensible for programmers
Able to use C extensions
Easy to make new procedures and methods
What analysis are non-programmers going to want to use?
Graphics?
Ease of use for different groups
So my read on this:
Easy to extend: R/S-plus, Matlab/Octave (I happen to prefer R, but I do more stats and fewer matrix things)
Easy to use for normal people: Excel, custom wrapped R, SPSS
Also, R on windows has a limited GUI, which may or may not help your users.
If it was me, I'd go with a hybrid solution. Use R, and give a cheat sheet for for common tasks to non-programmers that illustrates common tasks, or even better, write some wrapper functions with names like "image_summary" that automate their exploratory work.
For writing front end scripts for R, the RPy python wrappers might help as well.
SAS Enterprise Guide has good usability for non-programmers. Also, it has good options to connect to Excel. And for programmers, it's the most robust option out there. The sas server runs on anything, though, enterprise guide is Windows only.
Consider Excel one more time. It is well known, and widely available. Refer this book or this book.
This Wikipedia page compares the features available for several statistical packages, as well as their OS compatibility and pricing info (which seems a little out of date, but it gives an overall idea)
We ended up getting the Matlab Statistics toolbox (mainly because we already have some experience with Matlab in the team, and needed the tool anyway)
So far, it's doing what we need to do, and it's easily expansible. Usage will show if non-programmers really use it, but so far it's looking good.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
In my work experience, most fresh out of school programmers are set right to creating reports for 6-12 months or so. While I see the benefit of doing something non-crucial, it seems to really discourage them.
So my question is, should organizations allow newbies to work with someone experienced right off the bat, obviously doing non-critical phases of a project, do get a real feel for what their career choice has in stock, or throw them on reports out of the gate?
Ah, there really is nothing like exploiting interns for remedial jobs...
Seriously though, you get back what you put in. Forcing them to do a thoughtless, thankless job for a long period of time is a quick way to build up a useless team member.
Perhaps they should be looking for a job at different companies? Maybe they shouldn't settle?
I was once a fresh-grad, and I have never been asked to work on a report. I had a programming check-in within the first 5 days of my job.
Maybe I am confused about the question. We are talking about folks who apply for programming positions and are sent to doing "reports" related job?!
I didn't start in "reports". I started on a conversion -- just get stuff to run on the new platform. Relatively safe, minor programming changes.
Then I did some new development for a while.
Then another conversion.
Then -- 2 years into my career -- no longer a complete n00b -- I wound up in "Reports". They wanted something like a dozen dumb-as-dirt accounting reports. Each was a "pull from the general ledger", "do some quick math" and "write a columnar report". [It was 1980, that's how stuff was done.]
I couldn't stand to do copy-and-paste programming. So I wrote a thing that extracted from the ledger into an array of values. It used a flexible notation for doing calculations on values in that array, then it wrote out the results of the calculations.
It could add, subtract, multiply and divide. You could use multiple operations on a series of "cells" to compute wonderfully complex things. To a limit.
I had invented the spreadsheet, built as a COBOL batch program. Seriously. That's what putting someone on reports can lead to. A single program that produced the dozen dumb-as-dirt financial reports. And a large number of additional reports, too.
Bonus. It was built in an Agile, incremental fashion. The first version did a half-dozen of the really easy reports. The next one did two or three more.
I don't think "reports" is a bad gig. What's bad is forcing people to copy and paste yet another dumb-as-dirt report program from a cookie-cutter template.
I believe it to be beneficial. It's what happened to me long ago and it provided me an opportunity to learn the database schema, the domain, and how the data is being used.
But, if they were hired as a Software Engineer they shouldn't be a report writer indefinitely. Programmer/Analyst however...
It's beneficial to the company in the short run, because then you can get useful work out of new graduates. It's harmful to everyone in the long run, because creating reports isn't really that hard, so the newbies don't learn much from doing it.
That being said, 6-12 months is a really long time to stick anybody on doing reports (unless they enjoy it, which most people don't). Maybe a shorter time period would be better training for a new employee.
I've worked in shops that threw a lot at the new hires where the results were mixed and I've worked at shops where they did pointless monkey-business exercises such as writing reports that nobody would read, attending 'process' meetings and open-ended tasks like "read a book about C++" or "learn something about this technology or that one. Both of these approaches were a waste of effort and time.
At my shop, if you are the new guy you aren't going to get left to your own devices to figure out X or to create busy work for yourself. Typically, we'll run you through our products so you are familiar with them as a user, then we'll talk through whatever task it is we need you to do, do the "I'm right over here, tell me if you need assistance" thing and then check up on them during the morning "what are you working on?" meeting. The goal at my shop is to get a developer up to speed as quickly as possible without skipping over the important stuff.
I think the key to successfully developing the new employee, particularly one who may be right out of school is to challenge them, provide them with interesting tasking that will make them not dread coming to work. If you get them interested in the work, you get an employee who becomes valuable. There are some tasks that just aren't interesting, and we all do them at my place. For me, I dread getting anywhere near MS Word to write formal documentation, but that comes with the territory sometimes. The 'new guy' needs to realize it won't always be code slinging or new development. Sometimes it is maintenance coding - much of the time it is. Sometimes it is 'turn the crank' type work. Sometimes it is report writing.
A good manager or senior developer will mentor the new hire. If a shop doesn't do that, I'd probably not want to work there myself.
They should be pair programming (or spectator programming) with different people from their department for a few weeks. Then they get to know all the people, the structure, the code and useful tips.
Reports are a wonderful introduction.
They tend to have very specific specifications, unlike many other projects. They're a good "stand alone" task. They also give the developer a good introduction to the domain model, which they must use to actually get the data out for the report.
Finally, they're (typically) reasonably simple with some reporting framework doing most of the heavy lifting for them. So they need to focus on learning the tools of the trade, deployment, and the data model.
They're a nice gradual introduction to the larger domain and application.
I've never been put on a non-important job as a safety function. Even when I didn't know exactly what I was doing I got put on important projects people wanted yesterday, and then paired with someone who had specific development he/she wanted to offload onto the new-hire.
It works pretty well that way.
If you put a college grad on report-writing duty for a long time, he's going to bail on you. Bad management and a waste of money...
I have two contrasting experiences with Crystal Reports in two different companies:
With my first employer (fresh out of University), our Crystal Reports expert was leaving, so I was asked to take over the role. No actual training was provided, so I had to learn everything on-the-job, with no support from either the Vendor or the Employer. Although my position description was as an IT Developer, I eventually spent 100% of my time working on Crystal Reports. It was an unproductive experience for me, and a waste of manpower and resources.
My current employer asked me to assist another Developer in creating and maintaining their Crystal Reports setup. Because they provided adequate training, and I was mentored in the role, I gained knowledge on multiple systems and databases. I even a little experience at administrating and maintaining SQL Server. And I also got the chance to interact with many different clients in the company, as many different sections of the company needed these reports.
So my answer to the original question is that it really depends on the organization, rather than the central concept. If your employer is intending to use it as a way of familiarizing new employees with multiple systems, then I think it's a great idea. If it's just a short-term way of foisting a thankless (and rotten) job on a hapless new employee, then I think it's a waste of manpower and resources.
The good thing about reports is that they are not updating information so there's no chance that any data will be lost.
Depending on what the tools are for reporting too. When I did reporting, I learned tons about SQL, and stored procedures. Of course that is probably not the norm for reporting.
It depends on the report, and it depends on the job. Many reports are anything but trivial, and excellent SQL skills are needed to create a performant and properly maintainable back-end. If your newbies are good with SQL, let them cut their teeth on the queries. It will be a good way for them to learn the schema of your database.
However, if "putting them on reports" is just a euphamism for them trying in vain for hours without direction or inspiration to format a table in Crystal reports 25 (or whatever the current version is), well, I think you probably already know my answer to that question...