How does software development compare with statistical programming/analysis? [closed] - r

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
Statistical analysis/programming, is writing code. Whether for descriptive or inferential, You write code to: import data, to clean it, to analyse it and to compile a report.
Analyzing the data can involve many twists and turns of statistical procedures, and angles from which you look at your data. At the end, you have many files, with many lines of code, performing tasks on your data. Some of which is reusable and you capsulate it as a "good to have" function.
This process of "Statistical analysis" feels to me like "programming" But I am not sure it feels the same to everyone.
From the Wikipedia article on Software development:
The term software development is often
used to refer to the activity of
computer programming, which is the
process of writing and maintaining the
source code, whereas the broader sense
of the term includes all that is
involved between the conception of the
desired software through to the final
manifestation of the software.
Therefore, software development may
include research, new development,
modification, reuse, re-engineering,
maintenance, or any other activities
that result in software products.
For larger software systems, usually
developed by a team of people, some
form of process is typically followed
to guide the stages of production of
the software.
According to this simplistic definition (and my humble opinion), this sounds very much like building a statistical analysis. But I imagine it is not that simple.
Which leads me to my question: what differences can you outline between the two activities?
It can be in terms of the technical aspects, the different strategies or work styles, and what ever else you think is relevant.
This question came to me from the following threads:
How do you combine "Revision Control" with "Workflow" for R?
How to organize large R programs?
Workflow for statistical analysis and report writing

As I said in my response to your other question, what you're describing is programming. So the short answer is: there is no difference. The slightly longer answer is that statistical and scientific computing should require even more controls around development than other programming.
A certain percentage of statistical analysis can be done using Excel, or in a point-and-click approach using SPSS, SAS, Matlab, or S-Plus (for instance). A more sophisticated analysis done using one of those programs (or R) that involves programming is clearly a form of software development. And this kind of statistical computing can benefit immensely from following all the best practices from software development: source control, documentation, a project plan, scope document, bug tracking/change control, etc.
Moreover, there are different kinds of statistical analyses that can follow different approaches, as with any programming project:
Exploratory data analysis should follow an iterative methodology, like the Agile methodology. In this case, when you don't know explicity the steps involved up front, it's critical to use a development methodology that is adaptive and self-reflective.
A more routine kind of analysis (e.g. an government annual survey such as the Census) could follow a more traditional methodology such as the waterfall approach since it would be following a very clear set of steps that are mostly known in advance.
I would suggest that any statistician would benefit from reading a book like "Code Complete" (look at the other top books in this post): the more organized you are with your analysis, the greater the likelihood of success.
Statistical analysis in some sense requires even more good practices around version control and documentation than other programming. If your program is just serving some business need, then the algorithm or software used is really of secondary importance so long as the program functions the way the specifications require. On the other hand, with scientific and statistical computing, accuracy and reproducibility are paramount. This is one of John Chambers' (the creator of the S language) major emphases in "Software for Data Analysis". That is another reason to add literate programming (e.g. with Sweave) as an important tool in the statistician's toolkit.

Perhaps the common denominator is "problem solving."
Beyond that, i doubt i doubt i could provide any insight, but i can at least provide a limited answer from personal experience.
This issue arises for us in hiring--i.e., do we hire a programmer and teach them statistics or do we hire a statistics person and teach them to program? Ideally we could find someone fluent in both discipline, and indeed, that's the third net we cast, but rarely with any success.
Here's an example. The most stable distinction between the two activities (software dev & statistical analysis) is probably their respective outputs, or project deliverables. For instance, in my group someone is conducting the statistical analysis on the results of our split-path and factorial experiments (e.g., from the t-test results, whether the difference is significant, or whether the test ought to continue). That analysis will be sent to the marketing department which they'll use to modify the web pages comprising the Site with a view towards improving conversion. A second task involves the abstraction of and partial automation of those analyses so the results can be processed in near-real time.
For the first task, we'll assign a statistician; for the second, a programmer. The business problem we are trying to solve is the same for both tasks, yet for the first, the crux is statistics, for the second, the statistics problems have been largely solved and the crux is a core programming task (I/O).
Notice also how the evolution of the tools associated with the two activities have evolved so the distinction between the two (software dev & data analysis) is further obfuscated: mainstream development languages are being adapted for use as domain-specific analytical tools, at the same time, frameworks continue to be developed which enable the non-developers to quickly build lightweight, task-oriented applications in DSLs.
For instance, python, a general purpose development language has R bindings (RPy2) which along with its native interactive interpreter (IDLE), substantially facilitates Python's use in statistical analysis, while at the same time, there is a clear trend in R package development toward (web) application development: R Bindings for Qt, gWidgetsWWW, and RApache--are all R Packages directed to Client or Web App development, and whose initial release was (i think) w/in the past 18 months. Aside from that, since at least the last quarter of last year, i've noticed an accelerating frequency of blog posts, presentations, etc. on the subject of Web app development in R.
Finally, i wonder if your question is perhaps evidence of the growing popularity of R. Here's what i mean. A decade ago, when my employer purchased a site license, i began learning and using one of the major statistical computing products (no point here in saying which one, it begins with "S"). i found it unnatural and inflexible. Unlike Perl (which i was using at the time) this tool was not an extension of my brain (which isn't an optional attribute of an analytical tool, to me it's more or less the definition of one). Interacting with this System was more like using a vending machine--i selected some statistical function i wanted and then waited for the "output", which was often an impressive set of high-impact, full-color charts and tables. Nearly always though what i wanted was to modify my input or use that output for the next analytical step. That seemed to required another, separate trip to the vending machine. The fact that this tool was context-aware--i.e., it knew statistics--while Perl didn't, didn't compensate for the awkward interaction. Statistical analysis done this way would never be confused with software development. (Again, this is just a summary of my own experience, i don't claim it can be abstracted. It's also not a polemic against any (or all) commercial data analysis platforms--millions use them and they've earned zillions for the people who created them, so let's assume it was my own limitations that caused the failure to bond.)
I had never heard of R until about 18 months ago, and i only discovered it while scanning PyPI (The Web Interface to Python's external package repository) for statistics libraries for python. There i came across RPy, which seemed brilliant but required a dependency called "R" (RPy of course is really just a set of Python bindings to R).
Perhaps R appeals to programmer and non-programmers equally, still for a programmer/analyst, this was a godsend. It hit everything on my wish list for a data analysis platform: an engine based on a full-featured, general programming language (which in this case is a proven scheme descendant), an underlying functional paradigm, built-in interactive interpreter, native data types built from the ground up for data analysis, and the domain knowledge baked in. Data analysis became more like coding. Life was good.

If you are using R, then you'll likely be writing code to solve your statistical questions, so in this sense, statistical analysis is a subset of programming.
On the other hand, there are plenty of SPSS users who have never ventured beyind a bit of pointing and clicking to solve their stats problems. This feels less like programming to me.

Related

Essential skills of a Data Scientist [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 8 years ago.
Improve this question
What are the relevant skills in the arsenal of a Data Scientist? With new technologies coming in every day, how does one pick and choose the essentials?
A few ideas germane to this discussion:
Knowing SQL and the use of a DB such as MySQL, PostgreSQL was great till the advent of NoSql and non-relational databases. MongoDB, CouchDB etc. are becoming popular to work with web-scale data.
Knowing a stats tool like R is enough for analysis, but to create applications one may need to add Java, Python, and such others to the list.
Data now comes in the form of text, urls, multi-media to name a few, and there are different paradigms associated with their manipulation.
What about cluster computing, parallel computing, the cloud, Amazon EC2, Hadoop ?
OLS Regression now has Artificial Neural Networks, Random Forests and other relatively exotic machine learning/data mining algos. for company
Thoughts?
To quote from the intro to Hadley's phd thesis:
First, you get the data in a form that
you can work with ... Second, you
plot the data to get a feel for what
is going on ... Third, you iterate
between graphics and models to build a
succinct quantitative summary of the
data ... Finally, you look back at
what you have done, and contemplate
what tools you need to do better in
the future
Step 1 almost certainly involves data munging, and may involve database accessing or web scraping. Knowing people who create data is also useful. (I'm filing that under 'networking'.)
Step 2 means visualisation/ plotting skills.
Step 3 means stats or modelling skills. Since that is a stupidly broad category, the ability to delegate to a modeller is also a useful skill.
The final step is mostly about soft skills like introspection and management-type skills.
Software skills were also mentioned in the question, and I agree that they come in very handy. Software Carpentry has a good list of all the basic software skills you should have.
Just to throw in some ideas for others to expound upon:
At some ridiculously high level of abstraction all data work involves the following steps:
Data Collection
Data Storage/Retrieval
Data Manipulation/Synthesis/Modeling
Result Reporting
Story Telling
At a minimum a data scientist should have at least some skills in each of these areas. But depending on specialty one might spend a lot more time in a limited range.
JD's are great, and for a bit more depth on these ideas read Michael Driscoll's excellent post The Three Sexy Skills of Data Geeks:
Skill #1: Statistics (Studying)
Skill #2: Data Munging (Suffering)
Skill #3: Visualization (Story telling)
At dataist the question is addressed in a general way with a nice Venn diagram:
JD hit it on the head: Storytelling. Although he did forget the OTHER important story: the story of why you used <insert fancy technique here>. Being able to answer that question is far and away the most important skill you can develop.
The rest is just hammers. Don't get me wrong, stuff like R is great. R is a whole bag of hammers, but the important bit is knowing how to use your hammers and whatnot to make something useful.
I think it's important to have command of a commerial database or two. In the finance world that I consult in, I often see DB/2 and Oracle on large iron and SQL Server on the distributed servers. This basically means being able to read and write SQL code. You need to be able to get data out of storage and into your analytic tool.
In terms of analytical tools, I believe R is increasingly important. I also think it's very advantageous to know how to use at least one other stat package as well. That could be SAS or SPSS... it really depends on the company or client that you are working for and what they expect.
Finally, you can have an incredible grasp of all these packages and still not be very valuable. It's extremely important to have a fair amount of subject matter expertise in a specific field and be able to communicate to relevant users and managers what the issues are surrounding your analysis as well as your findings.
Matrix algebra is my top pick
The ability to collaborate.
Great science, in almost any discipline, is rarely done by individuals these days.
There are several computer science topics that are useful for data scientists, many of them have been mentioned: distributed computing, operating systems, and databases.
Analysis of algorithms, that is understanding the time and space requirements of a computation, is the single most-important computer science topic for data scientists. It's useful for implementing efficient code, from statistical learning methods to data collection; and determining your computational needs, such as how much RAM or how many Hadoop nodes.
Patience - both for getting results out in a reasonable fashion and then to be able to go back and change it for what was 'actually' required.
Study Linear Algebra on MIT Open course ware 18.06 and substitute your study with the book "Introduction to Linear Algebra". Linear Algebra is one of the essential skill sets in data analytic in addition to skills mentioned above.

exploring mathematics of/in computer science [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 13 years ago.
I have been working for two years in software industry. Some things that have puzzled me are as follows:
There is lack of application of mathematics in current software industry.
e.g.: When a mechanical engineer designs an electricity pole , he computes the stress on the foundation by using stress analysis techniques(read mathematical equations) to determine exactly what kind and what grade of steel should be used, but when a software developer deploys a web server application he just guesses on the estimated load on his server and leaves the rest on luck and god, there is nothing that he can use to simulate mathematically to answer his problem (my observation).
Great softwares (wind tunnel simulators etc) and computing programs(like matlab etc) are there to simulate real world problems (because they have their mathematical equations) but we in software industry still are clueless about how much actual resources in terms of memory , computing resources, clock speed , RAM etc would be needed when our server side application would actually be deployed. we just keep on guessing about the solution and solve such problem's by more or less 'hit and trial' (my observation).
Programming is done on API's, whether in c, C#, java etc. We are never able to exactly check the complexity of our code and hence efficiency because somewhere we are using an abstraction written by someone else whose source code we either don't have or we didn't have the time to check it.
e.g. If I write a simple client server app in C# or java, I am never able to calculate beforehand how much the efficiency and complexity of this code is going to be or what would be the minimum this whole client server app will require (my observation).
Load balancing and scalability analysis are just too vague and are merely solved by adding more nodes if requests on the server are increasing (my observation).
Please post answers to any of my above puzzling observations.
Please post relevant references also.
I would be happy if someone proves me wrong and shows the right way.
Thanks in advance
Ashish
I think there are a few reasons for this. One is that in many cases, simply getting the job done is more important than making it perform as well as possible. A lot of software that I write is stuff that will only be run on occasion on small data sets, or stuff where the performance implications are pretty trivial (it's a loop that does a fixed computation on each element, so it's trivially O(n)). For most of this software, it would be silly to spend time analyzing the running time in detail.
Another reason is that software is very easy to change later on. Once you've built a bridge, any fixes can be incredibly expensive, so it's good to be very sure of your design before you do it. In software, unless you've made a horrible architectural choice early on, you can generally find and optimize performance hot spots once you have some more real-world data about how it performs. In order to avoid those horrible architectural choices, you can generally do approximate, back-of-the-envelope calculations (make sure you're not using an O(2^n) algorithm on a large data set, and estimate within a factor of 10 or so how many resources you'll need for the heaviest load you expect). These do require some analysis, but usually it can be pretty quick and off the cuff.
And then there are cases in which you really, really do need to squeeze the ultimate performance out of a system. In these case, people frequently do actually sit down, work out the performance characteristics of the systems they are working with, and do very detailed analyses. See, for instance, Ulrich Drepper's very impressive paper What Every Programmer Should Know About Memory (pdf).
Think about the engineering sciences, they all have very well defined laws that are applicable to the design, and building of physical items, things like gravity, strength of materials, etc. Whereas in Computer science, there are not many well defined laws when it comes to building an application against.
I can think of many different ways to write a simple hello world program that would satisfy the requirment. However, if I have to build an electricity pole, I am severely constrained by the physical world, and the requirements of the pole.
Point by point
An electricity pole has to withstand the weather, a load, corrosion etc and these can be quantified and modelled. I can't quantify my website launch success, or how my database will grow.
Premature optimisation? Good enough is exactly that, fix it when needed. If you're a vendor, you've no idea what will be running your code in real life or how it's configured. Again you can't quantify it.
Premature optimisation
See point 1. I can add as needed.
Carrying on... even engineers bollix up. Collapsing bridges, blackout, car safety recalls, "wrong kind of snow" etc etc. Shall we change the question to "why don't engineers use more empirical observations?"
The answer to most of these is in order to have meaningful measurements (and accepted equations, limits, tolerances etc) that you have in real-world engineering you first need a way of measuring what it is that you are looking at.
Most of these things simply can't be measured easily - Software complexity is a classic, what is "complex"? How do you look at source code and decide if it is complex or not? McCabe's Cyclomatic Complexity is the closest standard we have for this but it's still basically just counting branch instructions in methods.
There is little math in software programs because the programs themselves are the equation. It is not possible to figure out the equation before it is actually run. Engineers use simple (and very complex) programs to simulate what happens in the real world. It is very difficult to simulate a simulator. additionally, many problems in computer science don't even have an answer mathematically: see traveling salesman.
Much of the mathematics is also built into languages and libraries. If you use a hash table to store data, you know to find any element can be done in constant time O(1), no matter how many elements are in the hash table. If you store it in a binary tree, it will take longer depending on the number of elements [0(n^2) if i remember correctly].
The problem is that software talks with other software, written by humans. The engineering examples you describe deal with physical phenomenon, which are constant. If I develop an electrical simulator, everyone in the world can use it. If I develop a protocol X simulator for my server, it will help me, but probably won't be worth the work.
No one can design a system from scratch and people that write semi-common libraries generally have plenty of enhancements and extensions to work on rather than writing a simulator for their library.
If you want a network traffic simulator you can find one, but it will tell you little about your server load because the traffic won't be using the protocol your server understands. Every server is going to see completely different sets of traffic.
There is lack of application of mathematics in current software industry.
e.g.: When a mechanical engineer designs an electricity pole , he computes the stress on the foundation by using stress analysis techniques(read mathematical equations) to determine exactly what kind and what grade of steel should be used, but when a software developer deploys a web server application he just guesses on the estimated load on his server and leaves the rest on luck and god, there is nothing that he can use to simulate mathematically to answer his problem (my observation).
I wouldn't say that luck or god are always the basis for load estimation. Often realistic data can be had.
It's also not true that there are no mathematical techniques to answer the question. Operations research and queuing theory can be applied to good advantage.
The real problem is that mechanical engineering is based on laws of physics and a foundation of thousands of years worth of empirical and scientific investigation. Computer science is only as old as me. Computer science will be much further along by the time your children and grandchildren apply the best practices of their day.
An MIT EE grad would not have this problem ;)
My thoughts:
Some people do actually apply math to estimate server load. The equations are very complex for many applications and many people resort to rules of thumb, guess and adjust or similar strategies. Some applications (real time applications with a high penalty for failure... weapons systems, powerplant control applications, avionics) carefully compute the required resources and ensure that they will be available at runtime.
Same as 1.
Engineers also use components provided by others, with a published interface. Think of electrical engineering. You don't usually care about the internals of a transistor, just it's interface and operating specifications. If you wanted to examine every component you use in all of it's complexity, you would be limited to what one single person can accomplish.
I have written fairly complex algorithms that determine what to scale when based on various factors such as memory consumption, CPU load, and IO. However, the most efficient solution is sometimes to measure and adjust. This is especially true if the application is complex and evolves over time. The effort invested in modeling the application mathematically (and updating that model over time) may be more than the cost of lost efficiency by try and correct approaches. Eventually, I could envision a better understanding of the correlation between code and the environment it executes in could lead to systems that predict resource usage ahead of time. Since we don't have that today, many organizations load test code under a wide range of conditions to empirically gather that information.
Software engineering are very different from the typical fields of engineering. Where "normal" engineering are bound to the context of our physical universe and the laws in it we've identified, there's no such boundary in the software world.
Producing software are usually an attempt to mirror a subset of the real-life world into a virtual reality. Here we define the laws ourselves, by only picking the ones we need and by making them just as complex as we need. Because of this fundamental difference, you need to look at the problem-solving from a different perspective. We try to make abstractions to make complex parts less complex, just like we teach kids that yellow + blue = green, when it's really the wavelength of the light that bounces on the paper that changes.
Once in a while we are bound by different laws though. Stuff like Big-O, Test-coverage, complexity-measurements, UI-measurements and the likes are all models of mathematic laws. If you look into digital signal processing, realtime programming and functional programming, you'll often find that the programmers use equations to figure out a way to do what they want. - but these techniques aren't really (to some extend) useful to create a virtual domain, that can solve complex logic, branching and interact with a user.
The reasons why wind tunnels, simulations, etc.. are needed in the engineering world is that it's much cheaper to build a scaled down prototype, than to build the full thing and then test it. Also, a failed test on a full scale bridge is destructive - you have to build a new one for each test.
In software, once you have a prototype that passes the requirements, you have the full-blown solution. there is no need to build the full-scale version. You should be running load simulations against your server apps before going live with them, but since loads are variable and often unpredictable, you're better off building the app to be able to scale to any size by adding more hardware than to target a certain load. Bridge builders have a given target load they need to handle. If they had a predicted usage of 10 cars at any given time, and then a year later the bridge's popularity soared to 1,000,000 cars per day, nobody would be surprised if it failed. But with web applications, that's the kind of scaling that has to happen.
1) Most business logic is usually broken down into decision trees. This is the "equation" that should be proofed with unit tests. If you put in x then you should get y, I don't see any issue there.
2,3) Profiling can provide some insight as to where performance issues lie. For the most part you can't say that software will take x cycles because that will change over time (ie database becomes larger, OS starts going funky, etc). Bridges for instance require constant maintenance, you can't slap one up and expect it to last 50 years without spending time and money on it. Using libraries is like not trying to figure out pi every time you want to find the circumference of a circle. It has already been proven (and is cost effective) so there is no need to reinvent the wheel.
4) For the most part web applications scale well horizontally (multiple machines). Vertical (multithreading/multiprocess) scaling tends to be much more complex. Adding machines is usually relatively easy and cost effective and avoid some bottlenecks that become limited rather easily (disk I/O). Also load balancing can eliminate the possibility of one machine being a central point of failure.
It isn't exactly rocket science as you never know how many consumers will come to the serving line. Generally it is better to have too much capacity then to have errors, pissed of customers and someone (generally your boss) chewing your hide out.

Evolutionary vs throwaway prototyping [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
Who is winning in the "Low vs High fidelity prototyping" debate?
Should prototype-zero (P0) be the first version of the final product? Or should be P-0 always a throwaway? What approach is the industry favoring?
Excelent article from wikipedia: Software prototyping
A prototype should always be a throwaway - a prototype is used to quickly prove a concept and influence the design of the real product. As such, a lot of things which are important for a real product (a thought-out architecture and design, reliability, security, maintainability, etc.) fall by the wayside. If you do take these things into account when building your prototype, you're not really building a prototype anymore.
My experience with prototypes where the code directly evolved into an actual product shows that the end-result suffers because of it - the lack of a real architecture resulted in a lot of cobbled-together code that had to be constantly hacked to add new features. I've even seen a case the original technology chosen for rapid development of the prototype was not the best choice for the actual product, and a complete re-write was necessary for V2.
I think we, the pedants, have lost this particular battle -- alleged "prototypes" (which by definition should be rewritten from scratch!!!-) are in fact being "evolved" into (often half-baked "betas"), etc.
Even today, I've applauded at the smart attempt by a colleague of mine to recapture the concept, even if the term is a lost battle: he's setting up a way for proofs of concept small projects to be developed (and, if the concept does get proven, transferred to software engineers for real prototyping, then development).
The idea is that, in our department, we have many people who aren't (and aren't in fact supposed to be!-) software developers, but are very smart, computer savvy, and in daily contact with the reality "in the trenches" -- they are the ones who are most likely to smell an opportunity for some potential innovation which could have real impact once implemented as a "production-ready" software project. Salespeople, account managers, business analysts, technology managers -- at our company, they all often fit this description.
But they're NOT going to program in C++, hardly at all in Java, maybe in Python but miles away from "productionized" -- indeed they're far more likely to whip up a smart proof of concept in php, javascript, perl, bash, Excel+VBA, and sundry other "quick and dirty" technologies we don't even want to dream about productionizing and supporting forevermore!-)
So by calling their prototypes "proofs of concept", we hope to encourage them to embody their daring concepts in concrete form (vague natural-language blabberings and much waving of hands being least useful, and alien to the company's culture anyway;-) and yet sharply indicate that such projects, if promoted to exist among the software engineers' goals and priorities, DO have to be programmed from scratch -- the proof-of-concept serves, at best, as a good draft/sketch spec for what the engineers are aiming for, definitely NOT to be incrementally enriched, but redone from the root up!-).
It's early to say how well this idea works -- ask me in three months, when we evaluate the quarter's endeavors (right now, we're just providing a blueprint for them, hot on the heels of evaluating last quarter's department- and company-wise undertakings!-).
Write the prototype, then keep refactoring it until it becomes the product.
The key is to not hesitate to refactor when necessary.
It helps to have few people working on it initially. With too many people working on something, refactoring becomes more difficult.
Response from BUNDALLAH, HAMISI
A prototype typically simulates only a few aspects of the features of the eventual program, and may be completely different from the eventual implementation.
Contrary to what my other colleagues have suggested above, I would NOT advise my boss to opt for the throw away prototype model. I am with Anita on this. Given the two prototype models and the circumstances provided, I would strongly advise the management (my boss) to opt for the evolutionary prototype model. The company being large with all the other variables given such as the complexity of the code, the newness of the programming language to be used, I would not use throw away prototype model. The throw away prototype model becomes the starting point from which users can re-examine their expectations and clarify their requirements. When this has been achieved, the prototype model is 'thrown away', and the system is formally developed based on the identified requirements (Crinnion, 1991). But with this situation, the users may not know all the requirements at once due to the complexity of the factors given in this particular situation. Evolutionary prototyping is the process of developing a computer system by a process of gradual refinement. Each refinement of the system contains a system specification and software development phase. In contrast to both the traditional waterfall approach and incremental prototyping, which required everyone to get everything right the first time this approach allows participants to reflect on lessons learned from the previous cycle(s). It is usual to go through three such cycles of gradual refinement. However there is nothing stopping a process of continual evolution which is often the case in many systems. According to Davis (1992), an evolutionary prototyping acknowledges that we do not understand all the requirements (as we have been told above that the system is complex, the company is large, the code will be complex, and the language is fairly new to the programming team). The main goal when using Evolutionary Prototyping is to build a very robust prototype in a structured manner and constantly refine it. The reason for this is that the Evolutionary prototype, when built, forms the heart of the new system, and the improvements and further requirements will be built. This technique allows the development team to add features, or make changes that couldn't be conceived during the requirements and design phase. For a system to be useful, it must evolve through use in its intended operational environment. A product is never "done;" it is always maturing as the usage environment change. Developers often try to define a system using their most familiar frame of reference--where they are currently (or rather, the current system status). They make assumptions about the way business will be conducted and the technology base on which the business will be implemented. A plan is enacted to develop the capability, and, sooner or later, something resembling the envisioned system is delivered. (SPC, 1997).
Evolutionary Prototypes have an advantage over Throwaway Prototypes in that they are functional systems. Although they may not have all the features the users have planned, they may be used on an interim basis until the final system is delivered.
In Evolutionary Prototyping, developers can focus themselves to develop parts of the system that they understand instead of working on developing a whole system. To minimize risk, the developer does not implement poorly understood features. The partial system is sent to customer sites. As users work with the system, they detect opportunities for new features and give requests for these features to developers. Developers then take these enhancement requests along with their own and use sound configuration-management practices to change the software-requirements specification, update the design, recode and retest. (Bersoff and Davis, 1991).
However, the main problems with evolutionary prototyping are due to poor management: Lack of defined milestones, lack of achievement - always putting off what would be in the present prototype until the next one, lack of proper evaluation, lack of clarity between a prototype and an implemented system, lack of continued commitment from users. This process requires a greater degree of sustained commitment from users for a longer time span than traditionally required. Users must be constantly informed as to what is going on and be completely aware of the expectations of the 'prototypes'.
References
Bersoff, E., Davis, A. (1991). Impacts of Life Cycle Models of Software Configuration Management. Comm. ACM.
Crinnion, J.(1991). Evolutionary Systems Development, a practical guide to the use of prototyping within a structured systems methodology. Plenum Press, New York.
Davis, A. (1992). Operational Prototyping: A new Development Approach. IEEE Software.
Software Productivity Consortium (SPC). (1997). Evolutionary Rapid Development. SPC document SPC-97057-CMC, version 01.00.04.

How to get started on Information Extraction?

Could you recommend a training path to start and become very good in Information Extraction. I started reading about it to do one of my hobby project and soon realized that I would have to be good at math (Algebra, Stats, Prob). I have read some of the introductory books on different math topics (and its so much fun). Looking for some guidance. Please help.
Update: Just to answer one of the comment. I am more interested in Text Information Extraction.
Just to answer one of the comment. I am more interested in Text Information Extraction.
Depending on the nature of your project, Natural language processing, and Computational linguistics can both come in handy -they provide tools to measure, and extract features from the textual information, and apply training, scoring, or classification.
Good introductory books include OReilly's Programming Collective Intelligence (chapters on "searching, and ranking", Document filtering, and maybe decision trees).
Suggested projects utilizing this knowledge: POS (part-of-speech) tagging, and named entity recognition (ability to recognize names, places, and dates from the plain text). You can use Wikipedia as a training corpus since most of the target information is already extracted in infoboxes -this might provide you with some limited amount of measurement feedback.
The other big hammer in IE is search, a field not to be underestimated. Again, OReilly's book provides some introduction in basic ranking; once you have a large corpus of indexed text, you can do some really IE tasks with it. Check out Peter Norvig: Theorizing from data as a starting point, and a very good motivator -maybe you could reimplement some of their results as a learning exercise.
As a fore-warning, I think I'm obligated to tell you, that information extraction is hard. The first 80% of any given task is usually trivial; however, the difficulty of each additional percentage for IE tasks are usually growing exponentially -in development, and research time. It's also quite underdocumented -most of the high-quality info is currently in obscure white papers (Google Scholar is your friend) -do check them out once you've got your hand burned a couple of times. But most importantly, do not let these obstacles throw you off -there are certainly big opportunities to make progress in this area.
I would recommend the excellent book Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze. It covers a broad area of issues which form a great and up-to-date (2008) basis for Information Extraction and is available online in full text (under the given link).
I would suggest you take a look at the Natural Language Toolkit (nltk) and the NLTK Book. Both are available for free and are great learning tools.
You don't need to be good at math to do IE just understand how the algorithm works, experiment on the cases for which you need an optimal result performance, and the scale with which you need to achieve target accuracy level and work with that. You are basically working with algorithms and programming and aspects of CS/AI/Machine learning theory not writing a PhD paper on building a new machine-learning algorithm where you have to convince someone by way of mathematical principles why the algorithm works so I totally disagree with that notion. There is a difference between practical and theory - as we all know mathematicians are stuck more on theory then the practicability of algorithms to produce workable business solutions. You would, however, need to do some background reading both books in NLP as well as journal papers to find out what people found from their results. IE is a very context-specific domain so you would need to define first in what context you are trying to extract information - How would you define this information? What is your structured model? Supposing you are extracting from semi and unstructured data sets. You would then also want to weigh out whether you want to approach your IE from a standard human approach which involves things like regular expressions and pattern matching or would you want to do it using statistical machine learning approaches like Markov Chains. You can even look at hybrid approaches.
A standard process model you can follow to do your extraction is to adapt a data/text mining approach:
pre-processing - define and standardize your data to extraction from various or specific sources cleansing your data
segmentation/classification/clustering/association - your black box where most of your extraction work will be done
post-processing - cleansing your data back to where you want to store it or represent it as information
Also, you need to understand the difference between what is data and what is information. As you can reuse your discovered information as sources of data to build more information maps/trees/graphs. It is all very contextualized.
standard steps for: input->process->output
If you are using Java/C++ there are loads of frameworks and libraries available you can work with.
Perl would be an excellent language to do your NLP extraction work with if you want to do a lot of standard text extraction.
You may want to represent your data as XML or even as RDF graphs (Semantic Web) and for your defined contextual model you can build up relationship and association graphs that most likely will change as you make more and more extractions requests. Deploy it as a restful service as you want to treat it as a resource for documents. You can even link it to taxonomized data sets and faceted searching say using Solr.
Good sources to read are:
Handbook of Computational Linguistics and Natural Language Processing
Foundations of Statistical Natural Language Processing
Information Extraction Applications in Prospect
An Introduction to Language Processing with Perl and Prolog
Speech and Language Processing (Jurafsky)
Text Mining Application Programming
The Text Mining Handbook
Taming Text
Algorithms of Intelligent Web
Building Search Applications
IEEE Journal
Make sure you do a thorough evaluation before deploying such applications/algorithms into production as they can recursively increase your data storage requirements. You could use AWS/Hadoop for clustering, Mahout for large scale classification amongst others. Store your datasets in MongoDB or unstructured dumps into jackrabbit, etc. Try experimenting with prototypes first. There are various archives you can use to base your training on say Reuters corpus, tipster, TREC, etc. You can even check out alchemy API, GATE, UIMA, OpenNLP, etc.
Building extractions from standard text is easier than say a web document so representation at pre-processing step becomes even more crucial to define what exactly it is you are trying to extract from a standardized document representation.
Standard measures include precision, recall, f1 measure amongst others.
I disagree with the people who recommend reading Programming Collective Intelligence. If you want to do anything of even moderate complexity, you need to be good at applied math and PCI gives you a false sense of confidence. For example, when it talks of SVM, it just says that libSVM is a good way of implementing them.
Now, libSVM is definitely a good package but who cares about packages. What you need to know is why SVM gives the terrific results that it gives and how it is fundamentally different from Bayesian way of thinking ( and how Vapnik is a legend).
IMHO, there is no one solution to it. You should have a good grip on Linear Algebra and probability and Bayesian theory. Bayes, I should add, is as important for this as oxygen for human beings ( its a little exaggerated but you get what I mean, right ?). Also, get a good grip on Machine Learning. Just using other people's work is perfectly fine but the moment you want to know why something was done the way it was, you will have to know something about ML.
Check these two for that :
http://pindancing.blogspot.com/2010/01/learning-about-machine-learniing.html
http://measuringmeasures.com/blog/2010/1/15/learning-about-statistical-learning.html
http://measuringmeasures.com/blog/2010/3/12/learning-about-machine-learning-2nd-ed.html
Okay, now that's three of them :) / Cool
The Wikipedia Information Extraction article is a quick introduction.
At a more academic level, you might want to skim a paper like Integrating Probabilistic Extraction Models and Data Mining to Discover Relations and Patterns in Text.
Take a look here if you need enterprise grade NER service. Developing a NER system (and training sets) is a very time consuming and high skilled task.
This is a little off topic, but you might want to read Programming Collective Intelligence from O'Reilly. It deals indirectly with text information extraction, and it doesn't assume much of a math background.

Statistical tools for programmers [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I'm trying to evaluate the purchase of a statistical tool. This will be used in part by non-programming users (doing clinical studies) and in part by programmers, so I'm trying to find a good compromise between usability and automation. Of course, cost is an issue, but if I can build a solid case, we could probably buy a commercial package, so we're not totally limited to free options.
So far, our options are:
Statistica (which some non-programmers already know)
Matlab Statistics toolbox (programmers already use matlab)
R language (would need a UI for non-programmers)
Hack something into Excel (not fun, but that's what non-programmers do right now)
?...
What else is out there? What's the industry standard? What kind of distinctive features should I look for? What would you recommend, and why?
Ideally, we'd like a tool that can run both on Linux and Windows machines.
(I work in medical imaging, so we do both biostatistics, and software engineering statistics)
Hands down it's R. R is very programmer friendly. It has functional aspects and it's GNU.
S-PLUS and R are both based off the S language. Both are similar and in most cases you can run as S-PLUS program in R and vice versa.
SAS is another option, although geared more towards BI and enterprise. SAS has a simpler syntax than R and in my opinion is easier to pickup for a non-programmer.
Other options include SPSS, Matlab, and even Excel.
I recommend R, personally. It's used by bioinformaticians and psychologists, I hear. Don't know what your field is though, so maybe it's a lousy choice. It is reasonably easy to use and learn.
Stata and SPSS tend to be the most commonly used packages in clinical studies. Both are pretty easy to pick up and use for non-technically minded folks but are generally flexible enough. I've used Stata more than any of the others and have been pretty happy with its options (supports both menu-based and command line operation, decent enough plugin system to get new user-created modules, good graphing support).
R is a little more daunting for newbie users, though it is popular with the biostatisticians. Since it's free, that's another nice point in its favor.
For a statistical package with a GUI which non-technical users can use, I would recommend that you go with "SAS Enterprise Guide". You will get the common and advanced SAS procedures, an excellent graphics facility and the ability to program for the technical users. I recommend that you start with the "SAS Learning Edition" (http://support.sas.com/learn/le/) which is a fully functional version of Enterprise Guide, but limited to processing 1000 rows at a time only. It is under $500, which makes it a pretty good deal.
I would look at S-Plus.
You get a strong programming environment (S-Plus Workbench, based upon the Eclipse platform), an intuitive GUI for non-programmers, and an extensive user community (including users of R, which was based upon the original S).
Visual Numerics is another option.
It sounds like you're trying to maximize multiple goals. You say "This will be used in part by non-programming users (doing clinical studies) and in part by programmers, so I'm trying to find a good compromise between usability and automation", with an implicit assumption that this will be the same tool in both cases, when that might not be realistic. What's the compromise for Word and LaTeX, for example?
Some different questions about the requirements:
Should it be extensible for programmers
Able to use C extensions
Easy to make new procedures and methods
What analysis are non-programmers going to want to use?
Graphics?
Ease of use for different groups
So my read on this:
Easy to extend: R/S-plus, Matlab/Octave (I happen to prefer R, but I do more stats and fewer matrix things)
Easy to use for normal people: Excel, custom wrapped R, SPSS
Also, R on windows has a limited GUI, which may or may not help your users.
If it was me, I'd go with a hybrid solution. Use R, and give a cheat sheet for for common tasks to non-programmers that illustrates common tasks, or even better, write some wrapper functions with names like "image_summary" that automate their exploratory work.
For writing front end scripts for R, the RPy python wrappers might help as well.
SAS Enterprise Guide has good usability for non-programmers. Also, it has good options to connect to Excel. And for programmers, it's the most robust option out there. The sas server runs on anything, though, enterprise guide is Windows only.
Consider Excel one more time. It is well known, and widely available. Refer this book or this book.
This Wikipedia page compares the features available for several statistical packages, as well as their OS compatibility and pricing info (which seems a little out of date, but it gives an overall idea)
We ended up getting the Matlab Statistics toolbox (mainly because we already have some experience with Matlab in the team, and needed the tool anyway)
So far, it's doing what we need to do, and it's easily expansible. Usage will show if non-programmers really use it, but so far it's looking good.

Resources