Essential skills of a Data Scientist [closed] - r

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 8 years ago.
Improve this question
What are the relevant skills in the arsenal of a Data Scientist? With new technologies coming in every day, how does one pick and choose the essentials?
A few ideas germane to this discussion:
Knowing SQL and the use of a DB such as MySQL, PostgreSQL was great till the advent of NoSql and non-relational databases. MongoDB, CouchDB etc. are becoming popular to work with web-scale data.
Knowing a stats tool like R is enough for analysis, but to create applications one may need to add Java, Python, and such others to the list.
Data now comes in the form of text, urls, multi-media to name a few, and there are different paradigms associated with their manipulation.
What about cluster computing, parallel computing, the cloud, Amazon EC2, Hadoop ?
OLS Regression now has Artificial Neural Networks, Random Forests and other relatively exotic machine learning/data mining algos. for company
Thoughts?

To quote from the intro to Hadley's phd thesis:
First, you get the data in a form that
you can work with ... Second, you
plot the data to get a feel for what
is going on ... Third, you iterate
between graphics and models to build a
succinct quantitative summary of the
data ... Finally, you look back at
what you have done, and contemplate
what tools you need to do better in
the future
Step 1 almost certainly involves data munging, and may involve database accessing or web scraping. Knowing people who create data is also useful. (I'm filing that under 'networking'.)
Step 2 means visualisation/ plotting skills.
Step 3 means stats or modelling skills. Since that is a stupidly broad category, the ability to delegate to a modeller is also a useful skill.
The final step is mostly about soft skills like introspection and management-type skills.
Software skills were also mentioned in the question, and I agree that they come in very handy. Software Carpentry has a good list of all the basic software skills you should have.

Just to throw in some ideas for others to expound upon:
At some ridiculously high level of abstraction all data work involves the following steps:
Data Collection
Data Storage/Retrieval
Data Manipulation/Synthesis/Modeling
Result Reporting
Story Telling
At a minimum a data scientist should have at least some skills in each of these areas. But depending on specialty one might spend a lot more time in a limited range.

JD's are great, and for a bit more depth on these ideas read Michael Driscoll's excellent post The Three Sexy Skills of Data Geeks:
Skill #1: Statistics (Studying)
Skill #2: Data Munging (Suffering)
Skill #3: Visualization (Story telling)

At dataist the question is addressed in a general way with a nice Venn diagram:

JD hit it on the head: Storytelling. Although he did forget the OTHER important story: the story of why you used <insert fancy technique here>. Being able to answer that question is far and away the most important skill you can develop.
The rest is just hammers. Don't get me wrong, stuff like R is great. R is a whole bag of hammers, but the important bit is knowing how to use your hammers and whatnot to make something useful.

I think it's important to have command of a commerial database or two. In the finance world that I consult in, I often see DB/2 and Oracle on large iron and SQL Server on the distributed servers. This basically means being able to read and write SQL code. You need to be able to get data out of storage and into your analytic tool.
In terms of analytical tools, I believe R is increasingly important. I also think it's very advantageous to know how to use at least one other stat package as well. That could be SAS or SPSS... it really depends on the company or client that you are working for and what they expect.
Finally, you can have an incredible grasp of all these packages and still not be very valuable. It's extremely important to have a fair amount of subject matter expertise in a specific field and be able to communicate to relevant users and managers what the issues are surrounding your analysis as well as your findings.

Matrix algebra is my top pick

The ability to collaborate.
Great science, in almost any discipline, is rarely done by individuals these days.

There are several computer science topics that are useful for data scientists, many of them have been mentioned: distributed computing, operating systems, and databases.
Analysis of algorithms, that is understanding the time and space requirements of a computation, is the single most-important computer science topic for data scientists. It's useful for implementing efficient code, from statistical learning methods to data collection; and determining your computational needs, such as how much RAM or how many Hadoop nodes.

Patience - both for getting results out in a reasonable fashion and then to be able to go back and change it for what was 'actually' required.

Study Linear Algebra on MIT Open course ware 18.06 and substitute your study with the book "Introduction to Linear Algebra". Linear Algebra is one of the essential skill sets in data analytic in addition to skills mentioned above.

Related

How to get started in Big Data and Web Analytics [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I'm currently interested in working and studying Big data analytics and web analytics, but I don't know how and where to get started. I tried looking in the Internet, but some are advance for me. Is there any skills, knowledge in statistics and mathematics that I need first before going this route?
My current plan is to attend online courses every weekends, since I'm currently working as Associate Software Engineer during weekdays, and practice programming languages needed for Big data like R. I already have a degree in Computer Science so familiarity with some statistical and mathematical methods is not a problem. Any suggestions and comments are pretty much appreciated!
For those who already have an experience, how is your experienced and what do you work most with?
I am in a similar boat as you. I work in a web development department as a business analyst. I do some software development, data mining, and data visualization, but I am constantly improving my skills because it's all pretty interesting to me, and it makes me an extremely versatile employee.
Web Analytics/Big Data
See if you can get read access into your company's Google Analytics account, assuming they have a website. The API is really good, and pre-built packages in R make it really easy to get large amounts of data out. If their website is big enough, you can easily create your own, real data sets. While these probably won't be "big" as in "big data", they're definitely awesome for practicing data visualizations. I'd suggest learning Shiny and R Markdown. You can easily create web stats visualizations you can share with your company. If you end up coming up to issues with the amount of data you're trying to process (ie: if they have a huge web presence), then you might look into Spark for processing big data. Coursera has a specialization focusing on Big Data - https://www.coursera.org/specializations/big-data. You can take all the classes for free if you just "audit" them. You won't get certificate or anything, but you get access to all the course material. They apparently go through Spark, Hadoop, Pig, and Hive. I haven't taken it, but the UCSD Coursera classes I have taken have been pretty good.
Obviously Coursera isn't the end-all-be-all... Also check out edx.org, Pluralsight, Udemy, etc... You can get a free Pluralsight membership for a year - just Google it. Mine was through Microsoft somehow. My favorite courses by Pluralsight have been (unrelated to data/analytics) Ethical Hacking. Udemy often has amazing deals on HUGE courses - like 21 hours of lectures about Python for data analysis and stuff like that. Just sign up for the service, and you'll get a "special offer" in a week or two. They're usually $10-20. https://www.brighttalk.com/ is also a good place for webinars and talks related to data science/analytics.
Databases
My company uses SQL Server (Microsoft), so I also took some database classes on MVA (Microsoft Virtual Academy). They have a bunch of classes from complete noob to brushing up on skills: MVA Database Stuff.
Data Sets
If you find yourself needing big data sets, join Kaggle. They often have great data sets for machine learning, but you can use them yourself to mine and do visualizations. I'd look for labelled data sets in particular. Many of the bigger sets are completely anonymized - no labels, no nothin'. But that's not very fun if you're just digging around. Additionally, someone has compiled a bunch of public data sources here: https://github.com/caesar0301/awesome-public-datasets. Finally, NYC Open Data is one of my favorite places to get net data sets. Some are super boring, but there have been some cool analyses done on parking tickets and alike.
More...
If you're just looking for more classes to take or books to read, check out https://www.metacademy.org/. They have a few suggested paths to learn deep learning, machine learning, Bayesian stats, and other stuff like that. I think machine learning is an excellent next step - once you're versed in software development, database management/creation/querying, and visualization.
Even more...
Just immerse yourself. There are TONS of data blogs, podcast, meetup groups, conferences, and news out there. Do all you can to get in there and figure out what's going on and who's doing what. It's super interesting anyway. Two of my favorite things I follow: datatau (hacker news for data science) and I Quant NY (linked above, for parking tickets).

R vs Pentaho Spoon as an ETL tool [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Background (sorry it's so long):
I've been tasked with maintaining an ETL that collects a variety of online advertising data, around 20-30 MBs a day, and appends it to tables in MySQL. Outside contractors built the ETL with Pentaho Spoon (kitchen, kettle?). The ETL consists of about 250 jobs and transformations (.ktr,.kjb), each with about 5 to 25 steps. It is very common that something is going wrong in this large process. I've found that writing R scripts to do the transform and load is much more efficient. In fact, I think the ETL could be reduced to well under 1000 lines of code besides calls with RMySQL (i.e. plyr!). Perhaps Python would be used to extract the data from the web.
My use of R has led to some resistance. The computer programmers that designed the ETL don't know R so couldn't be called if I leave, and moreover a lot of time was invested in the Spoon ETL. Also, a layman can more easily follow the steps visually in Spoon, than in the R scripts. For my part, I think we are getting bogged down by the ETL. However, I don't have a large say in the matter as I don't have a background in computer science.
Please comment if you have any insights on the following. Please know I have been researching this for months and have read many opinions, but nothing as concise or reliable as SO usually provides:
R has been called not as scalable by some at the company. I think the opposite mostly because of the logging capabilities. Spoon has limited pure logging output, whereas all R scripts can be sinked into a daily log. Fixing and avoiding mistakes in the .ktrs is very tedious, but easy with setting flags and/or searching through the R log. Any thoughts on this?
This leads to a big picture question. What is the point of ETLs like Pentaho? This post Do I need a ETL?, leads me to believe that if you use R or other so-called OOL, there is no reason to have a tool like Pentaho. Can someone please confirm this if so? I really need a second opinion here. If this is so who uses tools like Pentaho? Is it simply people without the programming background, or someone else? I do see a fair amount of Pentaho questions on SO.
It is true that a lot more people use R and than Pentaho, right? This http://www.kdnuggets.com/2012/05/top-analytics-data-mining-big-data-software.html makes it look so. To be honest I was surprised that Pentaho was 5th, which makes me doubly wonder who uses Pentaho and if my doubts about it's use in my work setting are misplaced.
Thanks for any responses. I don't mean any condescension towards Spoon or Spoon users; I am just really confused and in need of outside opinions.
R as an ETL tool? Thats a new one, but whatever floats your boat.
I would say this though, if you can get 250 jobs and transformations down to under 1000 lines of R I would say your ETL is poorly written.
Along with this you have to think about supportability and scalability. Both of which I would imagine would be far easier with a graphical tool like Spoon rather than R code.
Personally I think you are misguided and the question you ask is poorly written but thats a different argument.
Regarding your points, PDI's logging is very good and you can log pretty much however you like, all into one large database table if you like a consolidated log.
ETL's wont be going away, even with the advent of the love of unstructured data storage pools like HDFS, also think about data analysis done outside R, if you want reporting or OLAP over the top of your data, it will still need transforming regardless.
Is it true, more people use R vs Pentaho? What sort of question is that? By Pentaho I assume you mean PDI? How can that ever be compared? A data analysis tool vs ETL tool and you want to count users? eh? If on the other hand you mean R vs Pentaho as a whole, then I would guess no.You are looking at a report on R vs Weka and making it fit your ETL argument. That doesn't wash in a month of sundays.
==EDIT==
Okay so you have around 1000 lines of R & Python code currently. As your bosses requirements expand this slowly grows over time, and because you are trying to hit deadlines the new code is written as cleanly or as well documented as the code you currently have in place. So over time this grows to 5000 lines say plus a few python scripts. Then one day you get hit by a bus, and some new person has to come in and manage your code... where do they start, how to they make changes?
Virtually anyone with a modicum of data experience could make a change to a PDI ETL should they be required to. Where as it would take some with enough in depth R knowledge to make changes to what you have done.
ETL tools are designed to be quick and easy to use, they also offer far more than R can provide in terms of data connectivity to different systems (non db or file based, for example), although I guess this is why people resort to python etc.
That said there is room for both, there is an R plugin for PDI kicking around in the community I've seen demonstrated.
On top of that I've seen enough TSQL to ETL migrations over the years to know from experience, that even though maintaining your ETL in code may seem practical in the short term, in the long term it just brings more pain.
On the other hand if you can code 250 PDI transformations down to 1000 lines of R, your ETL is likely bloated through bad design by your predecessor.
If you'd like me to give an opinion on your existing PDI ETL structure, that can also be arranged.
Tom

Programmer understanding academic writings on maths [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I would like to know in which order I should learn different areas of maths so I can have a robust overview of all the theory in case I need something for a computer programming problem.
So I've created this mind map
I do not intend to know all those small details about how to do a certain thing (e.g. "gauss-jordan reduction"), I would rather look over an example, but then do it with math software like sage-maths or mathematica.
I would like to know, for instance, how to get to a taylor series, given the analytical function (I know it already, I am merely illustrating the kind of knowledge depth I expect).
So all I all, I want to be able to read academic articles about maths which have applicability in computer science / programming, and actually understand something from those articles, so I can use that knowledge in solving actual programming problems.
The open question is:
(a) In what order do you suggest to learn about these areas, on what areas should I insiste more?
(b) Do you see any missing areas in the mind map?
I was recommended this book in a data structures and Algorithms class a few years ago. It covers a lot of relevant areas ( probability, graphs, counting, relations, etc.) and it's free! :-)
If you want to be serious, you should get a graduate level course on computer science in a university. There is no replacement for this.
You must know basic set theory, big O, basic data structures, basic real analysis. I suggest looking at Cormen et al. 'Introduction to Algorithms', and/or Manber's 'Introduction to Algorithms: A Creative Approach'. For number theory, check Victor Shoup's book - contains far too much, but it is readable at any level.
I suggest not bothering at all with topics such as: complex analysis, functional analysis, projective/inversive geometry, control theory, mathematical physics, until you know you need them.
Instead of making your graph bigger and bigger, make it as small as possible.
There is a good book, that I think would help you to get more out of computer science research papers and dissertations. It's called "Concrete Mathematics: A foundation for Computer Science", and it's available on Amazon:
http://www.amazon.com/Concrete-Mathematics-Foundation-Computer-Science/dp/0201558025/ref=sr_1_1?s=books&ie=UTF8&qid=1341081763&sr=1-1&keywords=math+computer+science
I think this would help because it will all be relevant, and its consolidated which will help expedite the learning process.
Even if you don't have any money, just Google it and take a look at the index to get an idea of what areas you might want to learn.
And here's one more interesting book.
This is almost impossible to answer as many programming tasks require no mathematical knowledge (other than counting and basic logic) at all. If you have specific interests in an area (such as numerical linear algebra or statistics for example) then start there. Still what I would suggest is getting a good grasp of how finite precision arithmetic works. Perhaps read an introductory text on numerical analysis as this will give a good understanding of what numerical stability is. A good book on analysis of algorithms (with regards to speed and efficiency) would do no harm, and if you plan on doing any kind of mathematical programming, getting a sound knowledge of linear algebra is a very good idea. Almost everything in applied mathematics reduces to ultimately solving a linear system of equations (or doing so iteratively).
There is no right answer to be honest.

How to make the application intelligent to grade answers?

I am adding a feature to an application in which the students answer questions that are more descriptive in nature. I am curious to know if there's a way to make the system "smart" enough to grade these answers. Ofcourse, I can run the answers through a set of keywords to ensure that the student has atleast included the keywords in the answers, but obviously this is not smart enough.
I know there's no fool proof way of grading descriptive answers, but was wondering if there's any technologies out there that I can look into.
You could use mechanical turk which is an API for humans. Which is probably as far as you can get with AI'ing your system. Understanding and grading actual text is one of the last remaining problems where humans are way better than computers (i.e. computers suck)
One notable exception is Watson which is actually really good at Jeopardy, but it runs on a huge computing cluster and includes some serious optimizations and smarts. That's nothing you just turn on. Sorry...
The answer is not so simple. There are "automated grading systems" out there, used, I believe, for example, to grade GRE exams. For example, see this paper and this by ETS.

How does software development compare with statistical programming/analysis? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
Statistical analysis/programming, is writing code. Whether for descriptive or inferential, You write code to: import data, to clean it, to analyse it and to compile a report.
Analyzing the data can involve many twists and turns of statistical procedures, and angles from which you look at your data. At the end, you have many files, with many lines of code, performing tasks on your data. Some of which is reusable and you capsulate it as a "good to have" function.
This process of "Statistical analysis" feels to me like "programming" But I am not sure it feels the same to everyone.
From the Wikipedia article on Software development:
The term software development is often
used to refer to the activity of
computer programming, which is the
process of writing and maintaining the
source code, whereas the broader sense
of the term includes all that is
involved between the conception of the
desired software through to the final
manifestation of the software.
Therefore, software development may
include research, new development,
modification, reuse, re-engineering,
maintenance, or any other activities
that result in software products.
For larger software systems, usually
developed by a team of people, some
form of process is typically followed
to guide the stages of production of
the software.
According to this simplistic definition (and my humble opinion), this sounds very much like building a statistical analysis. But I imagine it is not that simple.
Which leads me to my question: what differences can you outline between the two activities?
It can be in terms of the technical aspects, the different strategies or work styles, and what ever else you think is relevant.
This question came to me from the following threads:
How do you combine "Revision Control" with "Workflow" for R?
How to organize large R programs?
Workflow for statistical analysis and report writing
As I said in my response to your other question, what you're describing is programming. So the short answer is: there is no difference. The slightly longer answer is that statistical and scientific computing should require even more controls around development than other programming.
A certain percentage of statistical analysis can be done using Excel, or in a point-and-click approach using SPSS, SAS, Matlab, or S-Plus (for instance). A more sophisticated analysis done using one of those programs (or R) that involves programming is clearly a form of software development. And this kind of statistical computing can benefit immensely from following all the best practices from software development: source control, documentation, a project plan, scope document, bug tracking/change control, etc.
Moreover, there are different kinds of statistical analyses that can follow different approaches, as with any programming project:
Exploratory data analysis should follow an iterative methodology, like the Agile methodology. In this case, when you don't know explicity the steps involved up front, it's critical to use a development methodology that is adaptive and self-reflective.
A more routine kind of analysis (e.g. an government annual survey such as the Census) could follow a more traditional methodology such as the waterfall approach since it would be following a very clear set of steps that are mostly known in advance.
I would suggest that any statistician would benefit from reading a book like "Code Complete" (look at the other top books in this post): the more organized you are with your analysis, the greater the likelihood of success.
Statistical analysis in some sense requires even more good practices around version control and documentation than other programming. If your program is just serving some business need, then the algorithm or software used is really of secondary importance so long as the program functions the way the specifications require. On the other hand, with scientific and statistical computing, accuracy and reproducibility are paramount. This is one of John Chambers' (the creator of the S language) major emphases in "Software for Data Analysis". That is another reason to add literate programming (e.g. with Sweave) as an important tool in the statistician's toolkit.
Perhaps the common denominator is "problem solving."
Beyond that, i doubt i doubt i could provide any insight, but i can at least provide a limited answer from personal experience.
This issue arises for us in hiring--i.e., do we hire a programmer and teach them statistics or do we hire a statistics person and teach them to program? Ideally we could find someone fluent in both discipline, and indeed, that's the third net we cast, but rarely with any success.
Here's an example. The most stable distinction between the two activities (software dev & statistical analysis) is probably their respective outputs, or project deliverables. For instance, in my group someone is conducting the statistical analysis on the results of our split-path and factorial experiments (e.g., from the t-test results, whether the difference is significant, or whether the test ought to continue). That analysis will be sent to the marketing department which they'll use to modify the web pages comprising the Site with a view towards improving conversion. A second task involves the abstraction of and partial automation of those analyses so the results can be processed in near-real time.
For the first task, we'll assign a statistician; for the second, a programmer. The business problem we are trying to solve is the same for both tasks, yet for the first, the crux is statistics, for the second, the statistics problems have been largely solved and the crux is a core programming task (I/O).
Notice also how the evolution of the tools associated with the two activities have evolved so the distinction between the two (software dev & data analysis) is further obfuscated: mainstream development languages are being adapted for use as domain-specific analytical tools, at the same time, frameworks continue to be developed which enable the non-developers to quickly build lightweight, task-oriented applications in DSLs.
For instance, python, a general purpose development language has R bindings (RPy2) which along with its native interactive interpreter (IDLE), substantially facilitates Python's use in statistical analysis, while at the same time, there is a clear trend in R package development toward (web) application development: R Bindings for Qt, gWidgetsWWW, and RApache--are all R Packages directed to Client or Web App development, and whose initial release was (i think) w/in the past 18 months. Aside from that, since at least the last quarter of last year, i've noticed an accelerating frequency of blog posts, presentations, etc. on the subject of Web app development in R.
Finally, i wonder if your question is perhaps evidence of the growing popularity of R. Here's what i mean. A decade ago, when my employer purchased a site license, i began learning and using one of the major statistical computing products (no point here in saying which one, it begins with "S"). i found it unnatural and inflexible. Unlike Perl (which i was using at the time) this tool was not an extension of my brain (which isn't an optional attribute of an analytical tool, to me it's more or less the definition of one). Interacting with this System was more like using a vending machine--i selected some statistical function i wanted and then waited for the "output", which was often an impressive set of high-impact, full-color charts and tables. Nearly always though what i wanted was to modify my input or use that output for the next analytical step. That seemed to required another, separate trip to the vending machine. The fact that this tool was context-aware--i.e., it knew statistics--while Perl didn't, didn't compensate for the awkward interaction. Statistical analysis done this way would never be confused with software development. (Again, this is just a summary of my own experience, i don't claim it can be abstracted. It's also not a polemic against any (or all) commercial data analysis platforms--millions use them and they've earned zillions for the people who created them, so let's assume it was my own limitations that caused the failure to bond.)
I had never heard of R until about 18 months ago, and i only discovered it while scanning PyPI (The Web Interface to Python's external package repository) for statistics libraries for python. There i came across RPy, which seemed brilliant but required a dependency called "R" (RPy of course is really just a set of Python bindings to R).
Perhaps R appeals to programmer and non-programmers equally, still for a programmer/analyst, this was a godsend. It hit everything on my wish list for a data analysis platform: an engine based on a full-featured, general programming language (which in this case is a proven scheme descendant), an underlying functional paradigm, built-in interactive interpreter, native data types built from the ground up for data analysis, and the domain knowledge baked in. Data analysis became more like coding. Life was good.
If you are using R, then you'll likely be writing code to solve your statistical questions, so in this sense, statistical analysis is a subset of programming.
On the other hand, there are plenty of SPSS users who have never ventured beyind a bit of pointing and clicking to solve their stats problems. This feels less like programming to me.

Resources