Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I am just curious, since for a scrum-enabled development process the Team Leader (TL) role is merged with Scrum Master (SM) role and the team is supposed to be self-motivated, self-organised and self-driven.
Without a Team Leader (and architect!) inside a team - who is making decisions on a future-proof ways on the implementation? Say, which libraries to use, client data-handling, etc.
Based on the scrum concept, such decisions should be made by the team collaboratively or by the individual team member. Chances are they will not be qualified/experienced enough and the decision could cost millions in the years to come (even one or two years perspective).
How in scrum concept is this addressed, please?
Thanks
Scrum is only the delivery process. Technical decision making should be done by relevant responsible people. Those decisions may come after discussions and team meetings. Team members can also participate to these meetings if required. Role of the scrum master is to make sure the backlog items taken into the sprint is delivered. It has nothing to do with technical decision making. (Scrum master is responsible to remove the blockers and make the team move towards the sprint goal) The team can allocate time for architectural decision making or design/redesign process when they take in a task during the planning meeting.
For example, the TL may have a 4 hour task in scrum board to do the design. If this task involves a meeting with the architect, and the architect is offsite, scrum master should take necessary actions to schedule a meeting and make sure this blocker is removed. May be he can arrange a call between them.
Also it is wrong to think technical leader should be the scrum master. Those are two separate roles and may or may not be the same person.
In Scrum there are three roles defined:
Scrum Master
Product Owner
Development Team
The Scrum Master helps the team to follow Scrum and to remove impediments. The Product Owner looks after the backlog of work and prioritises things.
The Development Team does everything else. There are no defined roles in the Development Team, but instead we have capabilities.
For example, the Development Team has a capability to do development. It also has a capability to do testing.
There is nothing to stop a Development Team having a capability to do architecture. For example, it might have a team member who is from an architecture background or perhaps several team members have architecture experience.
As for the Team Leader role, there is no need for that as everyone in Scrum can act as a leader. For example, if somebody in the team is really experienced in database work, they might show leadership when the team is working on database work items.
who is making decisions on a future-proof ways on the implementation?
The team does.
An experienced Scrum team will get together regularly to talk about 'the big picture'. They will think about the future of the implementation, about the way its architecture, etc.
which libraries to use, client data-handling, etc.
Who has experience making decisions on which libraries to use? They might be a good person to suggest an approach on which libraries to use. The team can then discuss it and agree on an approach. The same goes for client data-handling.
This is collaboration. Everyone has a voice and everyone is a potential leader.
Chances are they will not be qualified/experienced enough and the decision could cost millions in the years to come
If the team is concerned that they don't have enough experience to make important decisions then they should raise that as an issue. Some possible solutions include:
Get additional training for team members
Bring somebody into the team with more experience
Use communities of practice to get advice
The Agile Manifesto directly addresses your question in the Principle that mentions self-organizing teams: "The best architectures, requirements, and designs
emerge from self-organizing teams." Relative to the other well-known Agile methods, Scrum perhaps does the best job of trusting "motivated individuals" (per another Principle) to do so using what scientists calls the "group mind." My coaching practice is based on my review of scientific studies related to teamwork. These have shown repeatedly that cross-functional small groups with diverse backgrounds will out-perform a single expert in decision quality over time. Scrum facilitates this by making people plan everybody's work together and holding them accountable to the group through the scrums and Demos.
That doesn't mean you can't have a technical expert like an architect on the team or providing input to the team. I agree with Leni that it's best not to have the technical expert be the SM, as that consolidates too much social power in one role and places further psychological blocks to true self-organization. (Having a full-time SM prevents real self-organization from the social psychology perspective, so I have members rotate the role.) But those experts can serve as advisors by coordinating bottom-up standards development across teams, providing input into user stories, sitting in on various teams' Planning Ceremonies and Demos, etc.
I have two datasets. One with details of contracts and other with details of organizations. For eg: One dataset has details- Company name, description, company type. Other datasets has details- Contract name, Contract description, CPV code.
I want an algorithm that can 1) given a company can we find the top 10 contracts that are most closely related or potentially interesting to this company.
2. Or given a contract can we find the companies most likely to bid or win the contract.
This might be a one off, real time algorithm to match one row of the first dataset to a best match cluster in the second dataset.
Is it possible to do this type of row by row cross matching in two different datasets? Is it possible to use text descriptions for this kind of matching?
It would be of great help if someone has code examples. Thank you.
I am also attaching example datasets here.
Company data
Contract data
Your question is effectively "Will someone do ~10K worth of data science for me for free?" What you are looking for is a recommender system and what seems more specifically to be a content based filtering system. In order for these to work, you are going to have to look at your two datasets and develop features that can be used to quantitatively describe the contracts and the clients. If you have information about previous contracts the organizations were interested in you can use a hybrid algorithm that incorporates aspects of collaborative filtering.
R has a package recommenderlab that can help you to work on these types of problems. I haven't used it, but skimming over it, it seems to be solid. If you are wanting something a little more plug and play though with fewer options, I would recommend checking out AzureML. It uses GUI interfaces to help guide users through the data science process including a recommender tutorial. You may also be able to use some of their text classifier tutorial to help engineer features from your fields containing free form text.
Best of luck.
besides BM25, what's other ranking functions exists? Where I found information on this topic?
BM25 is one of term-based ranking algorithms. Nowadays there are concept-based algorithms as well.
BM25 if state-of-the-art of term based information retrieval; however, there are some challenges that term-based cannot overcome such as, relating synonyms, matching an abbreviation or recognizing homonyms.
Here are the examples:
synonym: "buy" and "purchase"
antonym: "Professor" and "Prof."
homonym:
bow – a long wooden stick with horse hair that is used to play certain string instruments such as the violin
bow – to bend forward at the waist in respect (e.g. "bow down")
To deal with these problems, some are using concept-based models such as this article and this article.
Concept-based models are mostly using dictionaries or external terminologies to identify concepts and each have their own representation of concepts or weighting algorithms.
vanilla tf-idf is what is often used. If you want to learn about these things the best place to start is this book.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 8 years ago.
Improve this question
What are the relevant skills in the arsenal of a Data Scientist? With new technologies coming in every day, how does one pick and choose the essentials?
A few ideas germane to this discussion:
Knowing SQL and the use of a DB such as MySQL, PostgreSQL was great till the advent of NoSql and non-relational databases. MongoDB, CouchDB etc. are becoming popular to work with web-scale data.
Knowing a stats tool like R is enough for analysis, but to create applications one may need to add Java, Python, and such others to the list.
Data now comes in the form of text, urls, multi-media to name a few, and there are different paradigms associated with their manipulation.
What about cluster computing, parallel computing, the cloud, Amazon EC2, Hadoop ?
OLS Regression now has Artificial Neural Networks, Random Forests and other relatively exotic machine learning/data mining algos. for company
Thoughts?
To quote from the intro to Hadley's phd thesis:
First, you get the data in a form that
you can work with ... Second, you
plot the data to get a feel for what
is going on ... Third, you iterate
between graphics and models to build a
succinct quantitative summary of the
data ... Finally, you look back at
what you have done, and contemplate
what tools you need to do better in
the future
Step 1 almost certainly involves data munging, and may involve database accessing or web scraping. Knowing people who create data is also useful. (I'm filing that under 'networking'.)
Step 2 means visualisation/ plotting skills.
Step 3 means stats or modelling skills. Since that is a stupidly broad category, the ability to delegate to a modeller is also a useful skill.
The final step is mostly about soft skills like introspection and management-type skills.
Software skills were also mentioned in the question, and I agree that they come in very handy. Software Carpentry has a good list of all the basic software skills you should have.
Just to throw in some ideas for others to expound upon:
At some ridiculously high level of abstraction all data work involves the following steps:
Data Collection
Data Storage/Retrieval
Data Manipulation/Synthesis/Modeling
Result Reporting
Story Telling
At a minimum a data scientist should have at least some skills in each of these areas. But depending on specialty one might spend a lot more time in a limited range.
JD's are great, and for a bit more depth on these ideas read Michael Driscoll's excellent post The Three Sexy Skills of Data Geeks:
Skill #1: Statistics (Studying)
Skill #2: Data Munging (Suffering)
Skill #3: Visualization (Story telling)
At dataist the question is addressed in a general way with a nice Venn diagram:
JD hit it on the head: Storytelling. Although he did forget the OTHER important story: the story of why you used <insert fancy technique here>. Being able to answer that question is far and away the most important skill you can develop.
The rest is just hammers. Don't get me wrong, stuff like R is great. R is a whole bag of hammers, but the important bit is knowing how to use your hammers and whatnot to make something useful.
I think it's important to have command of a commerial database or two. In the finance world that I consult in, I often see DB/2 and Oracle on large iron and SQL Server on the distributed servers. This basically means being able to read and write SQL code. You need to be able to get data out of storage and into your analytic tool.
In terms of analytical tools, I believe R is increasingly important. I also think it's very advantageous to know how to use at least one other stat package as well. That could be SAS or SPSS... it really depends on the company or client that you are working for and what they expect.
Finally, you can have an incredible grasp of all these packages and still not be very valuable. It's extremely important to have a fair amount of subject matter expertise in a specific field and be able to communicate to relevant users and managers what the issues are surrounding your analysis as well as your findings.
Matrix algebra is my top pick
The ability to collaborate.
Great science, in almost any discipline, is rarely done by individuals these days.
There are several computer science topics that are useful for data scientists, many of them have been mentioned: distributed computing, operating systems, and databases.
Analysis of algorithms, that is understanding the time and space requirements of a computation, is the single most-important computer science topic for data scientists. It's useful for implementing efficient code, from statistical learning methods to data collection; and determining your computational needs, such as how much RAM or how many Hadoop nodes.
Patience - both for getting results out in a reasonable fashion and then to be able to go back and change it for what was 'actually' required.
Study Linear Algebra on MIT Open course ware 18.06 and substitute your study with the book "Introduction to Linear Algebra". Linear Algebra is one of the essential skill sets in data analytic in addition to skills mentioned above.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
Assume you know a student who wants to study Machine Learning and Natural Language Processing.
What specific computer science subjects should they focus on and which programming languages are specifically designed to solve these types of problems?
I am not looking for your favorite subjects and tools, but rather industry standards.
Example: I'm guessing that knowing Prolog and Matlab might help them. They also might want to study Discrete Structures*, Calculus, and Statistics.
*Graphs and trees. Functions: properties, recursive definitions, solving recurrences. Relations: properties, equivalence, partial order. Proof techniques, inductive proof. Counting techniques and discrete probability. Logic: propositional calculus, first-order predicate calculus. Formal reasoning: natural deduction, resolution. Applications to program correctness and automatic reasoning. Introduction to algebraic structures in computing.
This related stackoverflow question has some nice answers: What are good starting points for someone interested in natural language processing?
This is a very big field. The prerequisites mostly consist of probability/statistics, linear algebra, and basic computer science, although Natural Language Processing requires a more intensive computer science background to start with (frequently covering some basic AI). Regarding specific langauges: Lisp was created "as an afterthought" for doing AI research, while Prolog (with it's roots in formal logic) is especially aimed at Natural Language Processing, and many courses will use Prolog, Scheme, Matlab, R, or another functional language (e.g. OCaml is used for this course at Cornell) as they are very suited to this kind of analysis.
Here are some more specific pointers:
For Machine Learning, Stanford CS 229: Machine Learning is great: it includes everything, including full videos of the lectures (also up on iTunes), course notes, problem sets, etc., and it was very well taught by Andrew Ng.
Note the prerequisites:
Students are expected to have the following background: Knowledge of
basic computer science principles and skills, at a level sufficient to write
a reasonably non-trivial computer program. Familiarity with the basic probability theory.
Familiarity with the basic linear algebra.
The course uses Matlab and/or Octave. It also recommends the following readings (although the course notes themselves are very complete):
Christopher Bishop, Pattern Recognition and Machine Learning. Springer, 2006.
Richard Duda, Peter Hart and David Stork, Pattern Classification, 2nd ed. John Wiley & Sons, 2001.
Tom Mitchell, Machine Learning. McGraw-Hill, 1997.
Richard Sutton and Andrew Barto, Reinforcement Learning: An introduction. MIT Press, 1998
For Natural Language Processing, the NLP group at Stanford provides many good resources. The introductory course Stanford CS 224: Natural Language Processing includes all the lectures online and has the following prerequisites:
Adequate experience with programming
and formal structures. Programming
projects will be written in Java 1.5,
so knowledge of Java (or a willingness
to learn on your own) is required.
Knowledge of standard concepts in
artificial intelligence and/or
computational linguistics. Basic
familiarity with logic, vector spaces,
and probability.
Some recommended texts are:
Daniel Jurafsky and James H. Martin. 2008. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition. Second Edition. Prentice Hall.
Christopher D. Manning and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. MIT Press.
James Allen. 1995. Natural Language Understanding. Benjamin/Cummings, 2ed.
Gerald Gazdar and Chris Mellish. 1989. Natural Language Processing in Prolog. Addison-Wesley. (this is available online for free)
Frederick Jelinek. 1998. Statistical Methods for Speech Recognition. MIT Press.
The prerequisite computational linguistics course requires basic computer programming and data structures knowledge, and uses the same text books. The required articificial intelligence course is also available online along with all the lecture notes and uses:
S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach. Second Edition
This is the standard Artificial Intelligence text and is also worth reading.
I use R for machine learning myself and really recommend it. For this, I would suggest looking at The Elements of Statistical Learning, for which the full text is available online for free. You may want to refer to the Machine Learning and Natural Language Processing views on CRAN for specific functionality.
My recommendation would be either or all (depending on his amount and area of interest) of these:
The Oxford Handbook of Computational Linguistics:
(source: oup.com)
Foundations of Statistical Natural Language Processing:
Introduction to Information Retrieval:
String algorithms, including suffix trees. Calculus and linear algebra. Varying varieties of statistics. Artificial intelligence optimization algorithms. Data clustering techniques... and a million other things. This is a very active field right now, depending on what you intend to do.
It doesn't really matter what language you choose to operate in. Python, for instance has the NLTK, which is a pretty nice free package for tinkering with computational linguistics.
I would say probabily & statistics is the most important prerequisite. Especially Gaussian Mixture Models (GMMs) and Hidden Markov Models (HMMs) are very important both in machine learning and natural language processing (of course these subjects may be part of the course if it is introductory).
Then, I would say basic CS knowledge is also helpful, for example Algorithms, Formal Languages and basic Complexity theory.
Stanford CS 224: Natural Language Processing course that was mentioned already includes also videos online (in addition to other course materials). The videos aren't linked to on the course website, so many people may not notice them.
Jurafsky and Martin's Speech and Language Processing http://www.amazon.com/Speech-Language-Processing-Daniel-Jurafsky/dp/0131873210/ is very good. Unfortunately the draft second edition chapters are no longer free online now that it's been published :(
Also, if you're a decent programmer it's never too early to toy around with NLP programs. NLTK comes to mind (Python). It has a book you can read free online that was published (by OReilly I think).
How about Markdown and an Introduction to Parsing Expression Grammars (PEG) posted by cletus on his site cforcoding?
ANTLR seems like a good place to start for natural language processing. I'm no expert though.
Broad question, but I certainly think that a knowledge of finite state automata and hidden Markov models would be useful. That requires knowledge of statistical learning, Bayesian parameter estimation, and entropy.
Latent semantic indexing is a commonly yet recently used tool in many machine learning problems. Some of the methods are rather easy to understand. There are a bunch of potential basic projects.
Find co-occurrences in text corpora for document/paragraph/sentence clustering.
Classify the mood of a text corpus.
Automatically annotate or summarize a document.
Find relationships among separate documents to automatically generate a "graph" among the documents.
EDIT: Nonnegative matrix factorization (NMF) is a tool that has grown considerably in popularity due to its simplicity and effectiveness. It's easy to understand. I currently research the use of NMF for music information retrieval; NMF has shown to be useful for latent semantic indexing of text corpora, as well. Here is one paper. PDF
Prolog will only help them academically it is also limited for logic constraints and semantic NLP based work. Prolog is not yet an industry friendly language so not yet practical in real-world. And, matlab also is an academic based tool unless they are doing a lot of scientific or quants based work they wouldn't really have much need for it. To start of they might want to pick up the 'Norvig' book and enter the world of AI get a grounding in all the areas. Understand some basic probability, statistics, databases, os, datastructures, and most likely an understanding and experience with a programming language. They need to be able to prove to themselves why AI techniques work and where they don't. Then look to specific areas like machine learning and NLP in further detail. In fact, the norvig book sources references after every chapter so they already have a lot of further reading available. There are a lot of reference material available for them over internet, books, journal papers for guidance. Don't just read the book try to build tools in a programming language then extrapolate 'meaningful' results. Did the learning algorithm actually learn as expected, if it didn't why was this the case, how could it be fixed.