Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I am working on a Spring based Web application which performs Predictive Analysis based on the user historical data and come up with offers to users. I need to implement a Predictive Analysis or any regression kind of functionality which gives a confidence score/prediction to present those offers. I am a Java developer and looked at Weka, Mahout to get the desired result. But both the tools dont provide good documentation and it is highly difficult to proceed with them. I need a suggestion regarding Java based analytics API to process my data using regression or Neural Networks or Decision Trees and provides a confidence score depicting customers probablity on buying the product in future.
Any help in this regard is highly appreciable.
I've just finished working on a long project that involves building a GUI with JavaFx and R using the JRI package, it uses forecasting from the forecast package in R.
if you'll choose this solution (JavaFX + R) , all the statistical packaging of R will be at use, R has great documentation for this, but the interface jri is a challenge.
The program i built is in a stand alone mode, not a web start.
Most of the fuss regards setting up all environment variables, and passing parameters to the JVM, the big problem is for deployment, you need to make sure your clients have R, and to setup all the links between R and Java in their PC.
If you're interested in any prediction analysis (trees, regressions..) in R using Java /JRI, let me know and ill post it.
I'd advise you to keep trying with Weka. It's a great tool, not only for implementation but also to get an idea of which algorithms will work for you, what your data looks like, etc.
The book is worth the price, but if you're not willing to buy it, this wiki page might be a good starting point.
It might be best to start with testing, not programming - I believe the quote goes "60% of the difficulty of machine learning is understanding the dataset". Play around with the Weka GUI, find out what works best for you and your data, and do try some of the meta-classifiers (boosting, bagging, stacking); they often give great results (at the cost of processing time).
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I'm currently interested in working and studying Big data analytics and web analytics, but I don't know how and where to get started. I tried looking in the Internet, but some are advance for me. Is there any skills, knowledge in statistics and mathematics that I need first before going this route?
My current plan is to attend online courses every weekends, since I'm currently working as Associate Software Engineer during weekdays, and practice programming languages needed for Big data like R. I already have a degree in Computer Science so familiarity with some statistical and mathematical methods is not a problem. Any suggestions and comments are pretty much appreciated!
For those who already have an experience, how is your experienced and what do you work most with?
I am in a similar boat as you. I work in a web development department as a business analyst. I do some software development, data mining, and data visualization, but I am constantly improving my skills because it's all pretty interesting to me, and it makes me an extremely versatile employee.
Web Analytics/Big Data
See if you can get read access into your company's Google Analytics account, assuming they have a website. The API is really good, and pre-built packages in R make it really easy to get large amounts of data out. If their website is big enough, you can easily create your own, real data sets. While these probably won't be "big" as in "big data", they're definitely awesome for practicing data visualizations. I'd suggest learning Shiny and R Markdown. You can easily create web stats visualizations you can share with your company. If you end up coming up to issues with the amount of data you're trying to process (ie: if they have a huge web presence), then you might look into Spark for processing big data. Coursera has a specialization focusing on Big Data - https://www.coursera.org/specializations/big-data. You can take all the classes for free if you just "audit" them. You won't get certificate or anything, but you get access to all the course material. They apparently go through Spark, Hadoop, Pig, and Hive. I haven't taken it, but the UCSD Coursera classes I have taken have been pretty good.
Obviously Coursera isn't the end-all-be-all... Also check out edx.org, Pluralsight, Udemy, etc... You can get a free Pluralsight membership for a year - just Google it. Mine was through Microsoft somehow. My favorite courses by Pluralsight have been (unrelated to data/analytics) Ethical Hacking. Udemy often has amazing deals on HUGE courses - like 21 hours of lectures about Python for data analysis and stuff like that. Just sign up for the service, and you'll get a "special offer" in a week or two. They're usually $10-20. https://www.brighttalk.com/ is also a good place for webinars and talks related to data science/analytics.
Databases
My company uses SQL Server (Microsoft), so I also took some database classes on MVA (Microsoft Virtual Academy). They have a bunch of classes from complete noob to brushing up on skills: MVA Database Stuff.
Data Sets
If you find yourself needing big data sets, join Kaggle. They often have great data sets for machine learning, but you can use them yourself to mine and do visualizations. I'd look for labelled data sets in particular. Many of the bigger sets are completely anonymized - no labels, no nothin'. But that's not very fun if you're just digging around. Additionally, someone has compiled a bunch of public data sources here: https://github.com/caesar0301/awesome-public-datasets. Finally, NYC Open Data is one of my favorite places to get net data sets. Some are super boring, but there have been some cool analyses done on parking tickets and alike.
More...
If you're just looking for more classes to take or books to read, check out https://www.metacademy.org/. They have a few suggested paths to learn deep learning, machine learning, Bayesian stats, and other stuff like that. I think machine learning is an excellent next step - once you're versed in software development, database management/creation/querying, and visualization.
Even more...
Just immerse yourself. There are TONS of data blogs, podcast, meetup groups, conferences, and news out there. Do all you can to get in there and figure out what's going on and who's doing what. It's super interesting anyway. Two of my favorite things I follow: datatau (hacker news for data science) and I Quant NY (linked above, for parking tickets).
Ok this question is not exactly technical but very pertinent and current.
If you may have not heard Eureqa (http://creativemachines.cornell.edu/eureqa) is a machine learning (?) based tool that helps you find hidden equations and mathematical relationships within the data. It does sound futuristic and experimental and to a great degree it seems it is.
This is the relevant talk by Eureqa inventor Hod Lipson
http://www.youtube.com/watch?v=Xja6sLl6dVg entitled Mining experimental data.
So i believe this can become popular amongst many R users .
On the official site one can obtain Eureqa clients for MATLAB, Mathematica, Python etc but none so far for R.
So this question is just what it is , is anyone of you working on creating one...or if you know this project do you know what it will take to make one ?
I asked the same question on the Eureqa group a few months ago. Here's the link:
http://groups.google.com/group/eureqa-group/browse_thread/thread/cb251327b50dbd4f
The last entry has the following links:
http://r.eureqa.ivi.eu.com/
http://groups.google.com/group/eureqa-r
I haven't tried this, so I don't know if it works. If you try it, please post your results.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I am currently facing a situation where I as an advocate of test driven development have to compete with an advocate of model driven software development (MDSD) / model driven architecture (MDA).
In my opinion, code generation is a valuable tool in my toolbox and I make heavy use of templates and automation when needed. I also create diagrams in UML when I think this helps to understand the inner working or to discuss architecture on the white board. However, I strongly doubt that creating software via UML (creating statecharts and sequence diagrams to create working code not only skeletons of code) is more efficient for multi tier applications (database layer, business/domain layer and a Gui, maybe even distributed). It seems to me when it comes to MDSD, the CASE tooling suddenly isn't just a tool anymore but it is the thing to satisfy: As I see it, on the one hand, MDSDevelopers profit from the higher abstraction UML gives them but at the same time they are struggling with modifing the codegenerator/template/engine to fullfill their needs which might be easily implemented (and tested) if used another tool out of their toolbox (VisualStudio, Eclipse,...).
All this makes me wonder if there has been a success story (suceess being that the product was rolled out in time, within the budged and with only few bugs and parts of the software have been reused later on) for a real world application which fullfills this creteria and has been developed using a strict model driven approach:
it has nothing to do with the the Object Management Group (OMG) or with consultants related to MDSD/MDA/SOA/
the application is not related to Business Process Modelling and is not a CASE tool itself
the application is actively used by end user
it has at least three tiers, including a user interface which goes beyond displaying raw table values and is not one of the common MDA/MDSD examples ("how to model a coffee machine, traffic light, dishwasher").
A tiny, but nevertheless useful testimonial on the use of MDSD has been posted on the Model Driven Software Network:
http://www.modeldrivensoftware.net/profiles/blogs/viva-mdd-follow-up-building-a?xg_source=activity
It is a relatively small app being developed, but still a good example of MDSD in action.
More success stories are listed at Metacase's site (http://www.metacase.com/cases/index.html). Metacase sells MetaEdit+, which implements DSM (Domain-Specific Modeling). DSM is just a form of MDSD.
I am also developing ABSE (Atom-Based Software Engineering), another form of MDSD, very close to DSM. ABSE is outlined at http://www.abse.info.
I used MDA and code generation on an embedded system project using 4 processors connected via CAN. We had over 20 axes of motion and many, many sensors. The system was highly robust and maintainable as the mechanical components were evaluated and modified.
We worked in the models and generated code so the models were always up-to-date. We did a careful domain analysis to achieve subject matter isolation. The motor control required very high performance and so was not modeled or generated. Our network drivers were also hand-coded, and we wrote interfaces that allowed bridge services to send events to any service anywhere in the system as needed (although this was tightly controlled so as to minimize interprocessor dependencies).
Using the method took a bit of discipline, but having working models was great because they can be reviewed by non-software types.
Version control and differencing of the models was a bit of a challenge but we had a small, localized team so we were able to avoid merge issues.
The good people at Pathfinder Solutions (our tool vendor) can help mentor you through the project.
You could also take a look at the slides from previous Code Generation conferences. Several of these talks were from successful case studies e.g. http://www.codegeneration.net/cg2009/slides.php
I am working on one of the project for legacy modernization and its using MDA tool named Bluage. Its for a big healthcare organization and its in production so i could say that its successful. MDA is better in case of legacy modernization as it can generate KDM model from some technologies like pacbase which are going to be out of support.
I worked on a MDSD system that generated admin style web apps in Google Closure. I believe that your question is compelling. Too much complexity and your MDSD system is too hard to use. Too simple and you won't generate apps that are useful in the real world. Where MDSD really shines is in saving developer time typing lots of plumbing style code but how can MDSD remain effective over multiple releases? Requirements can go in many directions. That is the real challenge. I recently blogged about my MDSD lessons learned on that project.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 8 years ago.
Improve this question
What are the relevant skills in the arsenal of a Data Scientist? With new technologies coming in every day, how does one pick and choose the essentials?
A few ideas germane to this discussion:
Knowing SQL and the use of a DB such as MySQL, PostgreSQL was great till the advent of NoSql and non-relational databases. MongoDB, CouchDB etc. are becoming popular to work with web-scale data.
Knowing a stats tool like R is enough for analysis, but to create applications one may need to add Java, Python, and such others to the list.
Data now comes in the form of text, urls, multi-media to name a few, and there are different paradigms associated with their manipulation.
What about cluster computing, parallel computing, the cloud, Amazon EC2, Hadoop ?
OLS Regression now has Artificial Neural Networks, Random Forests and other relatively exotic machine learning/data mining algos. for company
Thoughts?
To quote from the intro to Hadley's phd thesis:
First, you get the data in a form that
you can work with ... Second, you
plot the data to get a feel for what
is going on ... Third, you iterate
between graphics and models to build a
succinct quantitative summary of the
data ... Finally, you look back at
what you have done, and contemplate
what tools you need to do better in
the future
Step 1 almost certainly involves data munging, and may involve database accessing or web scraping. Knowing people who create data is also useful. (I'm filing that under 'networking'.)
Step 2 means visualisation/ plotting skills.
Step 3 means stats or modelling skills. Since that is a stupidly broad category, the ability to delegate to a modeller is also a useful skill.
The final step is mostly about soft skills like introspection and management-type skills.
Software skills were also mentioned in the question, and I agree that they come in very handy. Software Carpentry has a good list of all the basic software skills you should have.
Just to throw in some ideas for others to expound upon:
At some ridiculously high level of abstraction all data work involves the following steps:
Data Collection
Data Storage/Retrieval
Data Manipulation/Synthesis/Modeling
Result Reporting
Story Telling
At a minimum a data scientist should have at least some skills in each of these areas. But depending on specialty one might spend a lot more time in a limited range.
JD's are great, and for a bit more depth on these ideas read Michael Driscoll's excellent post The Three Sexy Skills of Data Geeks:
Skill #1: Statistics (Studying)
Skill #2: Data Munging (Suffering)
Skill #3: Visualization (Story telling)
At dataist the question is addressed in a general way with a nice Venn diagram:
JD hit it on the head: Storytelling. Although he did forget the OTHER important story: the story of why you used <insert fancy technique here>. Being able to answer that question is far and away the most important skill you can develop.
The rest is just hammers. Don't get me wrong, stuff like R is great. R is a whole bag of hammers, but the important bit is knowing how to use your hammers and whatnot to make something useful.
I think it's important to have command of a commerial database or two. In the finance world that I consult in, I often see DB/2 and Oracle on large iron and SQL Server on the distributed servers. This basically means being able to read and write SQL code. You need to be able to get data out of storage and into your analytic tool.
In terms of analytical tools, I believe R is increasingly important. I also think it's very advantageous to know how to use at least one other stat package as well. That could be SAS or SPSS... it really depends on the company or client that you are working for and what they expect.
Finally, you can have an incredible grasp of all these packages and still not be very valuable. It's extremely important to have a fair amount of subject matter expertise in a specific field and be able to communicate to relevant users and managers what the issues are surrounding your analysis as well as your findings.
Matrix algebra is my top pick
The ability to collaborate.
Great science, in almost any discipline, is rarely done by individuals these days.
There are several computer science topics that are useful for data scientists, many of them have been mentioned: distributed computing, operating systems, and databases.
Analysis of algorithms, that is understanding the time and space requirements of a computation, is the single most-important computer science topic for data scientists. It's useful for implementing efficient code, from statistical learning methods to data collection; and determining your computational needs, such as how much RAM or how many Hadoop nodes.
Patience - both for getting results out in a reasonable fashion and then to be able to go back and change it for what was 'actually' required.
Study Linear Algebra on MIT Open course ware 18.06 and substitute your study with the book "Introduction to Linear Algebra". Linear Algebra is one of the essential skill sets in data analytic in addition to skills mentioned above.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
Statistical analysis/programming, is writing code. Whether for descriptive or inferential, You write code to: import data, to clean it, to analyse it and to compile a report.
Analyzing the data can involve many twists and turns of statistical procedures, and angles from which you look at your data. At the end, you have many files, with many lines of code, performing tasks on your data. Some of which is reusable and you capsulate it as a "good to have" function.
This process of "Statistical analysis" feels to me like "programming" But I am not sure it feels the same to everyone.
From the Wikipedia article on Software development:
The term software development is often
used to refer to the activity of
computer programming, which is the
process of writing and maintaining the
source code, whereas the broader sense
of the term includes all that is
involved between the conception of the
desired software through to the final
manifestation of the software.
Therefore, software development may
include research, new development,
modification, reuse, re-engineering,
maintenance, or any other activities
that result in software products.
For larger software systems, usually
developed by a team of people, some
form of process is typically followed
to guide the stages of production of
the software.
According to this simplistic definition (and my humble opinion), this sounds very much like building a statistical analysis. But I imagine it is not that simple.
Which leads me to my question: what differences can you outline between the two activities?
It can be in terms of the technical aspects, the different strategies or work styles, and what ever else you think is relevant.
This question came to me from the following threads:
How do you combine "Revision Control" with "Workflow" for R?
How to organize large R programs?
Workflow for statistical analysis and report writing
As I said in my response to your other question, what you're describing is programming. So the short answer is: there is no difference. The slightly longer answer is that statistical and scientific computing should require even more controls around development than other programming.
A certain percentage of statistical analysis can be done using Excel, or in a point-and-click approach using SPSS, SAS, Matlab, or S-Plus (for instance). A more sophisticated analysis done using one of those programs (or R) that involves programming is clearly a form of software development. And this kind of statistical computing can benefit immensely from following all the best practices from software development: source control, documentation, a project plan, scope document, bug tracking/change control, etc.
Moreover, there are different kinds of statistical analyses that can follow different approaches, as with any programming project:
Exploratory data analysis should follow an iterative methodology, like the Agile methodology. In this case, when you don't know explicity the steps involved up front, it's critical to use a development methodology that is adaptive and self-reflective.
A more routine kind of analysis (e.g. an government annual survey such as the Census) could follow a more traditional methodology such as the waterfall approach since it would be following a very clear set of steps that are mostly known in advance.
I would suggest that any statistician would benefit from reading a book like "Code Complete" (look at the other top books in this post): the more organized you are with your analysis, the greater the likelihood of success.
Statistical analysis in some sense requires even more good practices around version control and documentation than other programming. If your program is just serving some business need, then the algorithm or software used is really of secondary importance so long as the program functions the way the specifications require. On the other hand, with scientific and statistical computing, accuracy and reproducibility are paramount. This is one of John Chambers' (the creator of the S language) major emphases in "Software for Data Analysis". That is another reason to add literate programming (e.g. with Sweave) as an important tool in the statistician's toolkit.
Perhaps the common denominator is "problem solving."
Beyond that, i doubt i doubt i could provide any insight, but i can at least provide a limited answer from personal experience.
This issue arises for us in hiring--i.e., do we hire a programmer and teach them statistics or do we hire a statistics person and teach them to program? Ideally we could find someone fluent in both discipline, and indeed, that's the third net we cast, but rarely with any success.
Here's an example. The most stable distinction between the two activities (software dev & statistical analysis) is probably their respective outputs, or project deliverables. For instance, in my group someone is conducting the statistical analysis on the results of our split-path and factorial experiments (e.g., from the t-test results, whether the difference is significant, or whether the test ought to continue). That analysis will be sent to the marketing department which they'll use to modify the web pages comprising the Site with a view towards improving conversion. A second task involves the abstraction of and partial automation of those analyses so the results can be processed in near-real time.
For the first task, we'll assign a statistician; for the second, a programmer. The business problem we are trying to solve is the same for both tasks, yet for the first, the crux is statistics, for the second, the statistics problems have been largely solved and the crux is a core programming task (I/O).
Notice also how the evolution of the tools associated with the two activities have evolved so the distinction between the two (software dev & data analysis) is further obfuscated: mainstream development languages are being adapted for use as domain-specific analytical tools, at the same time, frameworks continue to be developed which enable the non-developers to quickly build lightweight, task-oriented applications in DSLs.
For instance, python, a general purpose development language has R bindings (RPy2) which along with its native interactive interpreter (IDLE), substantially facilitates Python's use in statistical analysis, while at the same time, there is a clear trend in R package development toward (web) application development: R Bindings for Qt, gWidgetsWWW, and RApache--are all R Packages directed to Client or Web App development, and whose initial release was (i think) w/in the past 18 months. Aside from that, since at least the last quarter of last year, i've noticed an accelerating frequency of blog posts, presentations, etc. on the subject of Web app development in R.
Finally, i wonder if your question is perhaps evidence of the growing popularity of R. Here's what i mean. A decade ago, when my employer purchased a site license, i began learning and using one of the major statistical computing products (no point here in saying which one, it begins with "S"). i found it unnatural and inflexible. Unlike Perl (which i was using at the time) this tool was not an extension of my brain (which isn't an optional attribute of an analytical tool, to me it's more or less the definition of one). Interacting with this System was more like using a vending machine--i selected some statistical function i wanted and then waited for the "output", which was often an impressive set of high-impact, full-color charts and tables. Nearly always though what i wanted was to modify my input or use that output for the next analytical step. That seemed to required another, separate trip to the vending machine. The fact that this tool was context-aware--i.e., it knew statistics--while Perl didn't, didn't compensate for the awkward interaction. Statistical analysis done this way would never be confused with software development. (Again, this is just a summary of my own experience, i don't claim it can be abstracted. It's also not a polemic against any (or all) commercial data analysis platforms--millions use them and they've earned zillions for the people who created them, so let's assume it was my own limitations that caused the failure to bond.)
I had never heard of R until about 18 months ago, and i only discovered it while scanning PyPI (The Web Interface to Python's external package repository) for statistics libraries for python. There i came across RPy, which seemed brilliant but required a dependency called "R" (RPy of course is really just a set of Python bindings to R).
Perhaps R appeals to programmer and non-programmers equally, still for a programmer/analyst, this was a godsend. It hit everything on my wish list for a data analysis platform: an engine based on a full-featured, general programming language (which in this case is a proven scheme descendant), an underlying functional paradigm, built-in interactive interpreter, native data types built from the ground up for data analysis, and the domain knowledge baked in. Data analysis became more like coding. Life was good.
If you are using R, then you'll likely be writing code to solve your statistical questions, so in this sense, statistical analysis is a subset of programming.
On the other hand, there are plenty of SPSS users who have never ventured beyind a bit of pointing and clicking to solve their stats problems. This feels less like programming to me.