What's the difference between relational diagrams, ER diagrams and EER diagrams - erd

I got confused when someone spoke of relational models when I asked a question about EER models. I've learned about the difference between ER and EER diagrams, but I'd like to understand the whole modelling process... I know EER are enhanced ER models, i.e. ER models with specialization/generalization.
When someone says ER modelling does he also imply EER modelling?
And what about database normalization? Does that only apply to relational diagrams?

The ER model was originally introduced in 1976 by Peter Chen, although it was influenced by earlier work. In the early 1980s it was almost exclusively used to model data at the conceptual level, where its principal value was that it was unbiased with regard to implementation. While it was, and remains, very easy to convert an ER model into a relational model, the ER model has also been seen as useful in some cases where the final implementation was to be some kind of pre-relational DBMS like IMS. It has also been used in a preliminary stage in projects where the final implementation was to be in some kind of unstructured or post-relational DBMS, or an Object Database.
A great many practitioners merge ER modeling and relational modeling, and come up with a single model that serves both purposes. While the two models have a lot of overlap, the differences are important enough so that merging the two of them waters them both down. This merging is most visible when it comes to ER diagrams. Many, perhaps most, of the so called ER diagrams are really relational models, even if they use ER diagramming conventions.
In the Wikipedia article on ER, it mentions the classic three layers: conceptual, logical, and physical, and treats them as all variants on the ER model. That's not how it was in the 1980s. The ER model was conceptual. The logical model was relational, provided the final target was to be a relational database. The physical level was DBMS specific, and tried to meet performance and volume goals as well as the more abstract goals of the logical and conceptual levels.
All this may be ancient history, or even pre-history in the world of IT, which is forever young.
The biggest differences are that foreign keys are not present in an ER model. Relationships are visible in an ER model, but ER is silent on how they are to be implemented. Foreign keys are just one way to implement relationships. In a relational database, they are the only way that makes sense. ER also models many-to-many relationships directly, without putting another entity in the middle. Relational models require an intermediate table (often called a "junction box") to hold two foreign keys that implement the many-to-many relationship.
The enhancements that are included in the EER consist mainly in adding gen-spec (superclass/subclass) and unions to the modelling conventions. These are nearly universally part of ER by now, so the term EER is really a historical accident.
Normalization as originally developed is properly part of relational database design. It can't really be applied in non-relational situations, without substantially messing around with the normal forms (1NF through 5NF and DKNF). Properly speaking, normalization is irrelevant in ER modeling. However, there is a modelling error that's easy to make In ER modelling that almost always correlates with normalization errors at the logical level: it's associating an attribute with the wrong entity, or conflating two distinct attributes into a single one.
I could go on, but this is already too long.

Related

What is software physical specification and logical specification?

What is software physical specification and logical specification? I understand about logical specifications which could be derived from user requirements like identifying attributes, entities and use-cases and draw the software using UML in graphical depiction. But what is the physical specification of software?
Logical vs physical terminology
The terminology logical vs. physical specification is related to the idea of an implementation-independent specification (logical) that is then refined to take into account implementation details and related constraints (physical).
This distinction can be made for any system view-point, such as architecture, data-flows and process design. But the terms are mainly used in the context of data modeling (ERD):
the logical specification describes how data meets the business requirements. Typically, you'd describe entities, their attributes and their relationships;
the physical specification describes how a logical data model is implemented in the database, taking into consideration also technical requirements and constraints. Typically, you'd find tables, columns, primary keys, foreign keys, indexes and everything that matters for the implementation.
Remark: The term "physical" probably dates back to the times where you had to design carefully the layout of the data in data (e.g. in COBOL you had to define the fields of a record at the byte level and that layout was really used to physically store the data on the disk; it was also very difficult to change it afterwards).
Purpose oriented terminology
Nowadays, specifications or models tend to be named according to their purpose. But how they are called and whether they are independent models or successive refinements of the same model is very dependent on the methodology. Some popular terminology:
Requirement specification / Analysis model, to express the business needs (i.e. problem space)
Design specification / model, to describe the solution (i.e. solution space)
Implementation specification / model, with all the technical details (i.e. one-to-one with the code, and therefore difficult to keep in sync).
Domain model, to express the design of business objects and business logic in a given domain, but without any application-specific design (i.e. like design model but with only elements that are of interest for the business).
UML
UML is UML and the same kind of diagrams may be used for different purposes. For example:
A use-case diagram represents in general user goal and tend to be mapped to requirements ("logical"). But use-cases can also show the relationship of an autonomous device / independent component to technical actors in its environment ("physical").
A class diagram can be used to document a domain model ("logical"). But a class diagram can also document the implementation details ("physical"). See for example this article with an example of logical vs. physical class diagram.

UML class diagram: Association or Composition?

I'm a little confused on what the relationship would be for the scenario below. When examples of composition are used they always tend to use simple ones such as rooms and a building.
This scenario is that doctor patient visits are recorded. Would it be an association, composition or a mix of both? I've included a picture below of the two different relationships I am stuck between. I am thinking composition because the visit belongs to each party?
Derived association
In general my rule of thumb is that when in doubt, always use association rather than composition/aggregation. My reasons for this are:
(1) In Object-oriented analysis and design for information systems Wazlawick notes that the real advantage of composition and aggregation
is that the attributes of the parts are often used to derive attributes of the whole. As an example he mentions that the total value of an order (whole) is derived of the value of each of its items (parts). However, this to him is a design concern rather than a conceptual modelling
concern. From a conceptual modelling perspective, he believes that modellers often apply aggregation and composition inappropriately (that is, where whole-part relations are not present) and that their use seldom have real benefit. Hence he suggests avoiding or even abolishing their use.
(2) UML aims to provide a semi-formalization of part-whole relations through composition/aggregation. However, formalization of part-whole relations is a non-trivial task, which the UML specification does not do justice. Indeed, a number of researchers have pointed out various aspects with regards to aggregation and composition in which the UML specification is under specified. All have proposed means for addressing the shortcomings of the UML specification, but to date these changes have not been incorporated into the UML specification. See for instance Introduction to part-whole relations.
When being in doubt, which kind of associoation to use, use the more generic one. Especially, in your case there is no real "consists of" relation. Further in your EX2, you would have an instance of visit, which is an existance bound instance to an Doctor instance and to Patient instance. This is problem when applying the composition rules, as it also introduces an existence relation between Doctor and Patien implicitely. Thus, this shall not be done.
I guess the concept you are loooking for is an association class. This is a class, which instances give the association between an Doctor instance and Patient instance some further information.

What are the symbols for specializations and generalizations etc. in crow foot notation

My university is forcing me to learn from a terrible textbook on ERD's, in which they are using a notation I personally don't like because I've never seen it used before (the book is so bad it doesn't even say what notation they're using), and I like learning it using a more common notation. Therefore I chose to learn it using the crow foot notation. (Please enlighten me if you think this is a bad decision)
Now the textbook is covering is-a(n) relationships (a.k.a. specializations/generalizations) and I'd like to know how I have to represent one in a consistent way with martin/crow foot notation... I learned about it thanks to a yt video (https://www.youtube.com/watch?v=MTG1zl8PkXk) but I noticed he's not using the same notation I'm using.
So how do I represent a specialization or generalization in crow foot notation? Or is crow foot notation only specific to cardinality? In my textbook, a few pages ahead, I also see concepts like multidimensional relationships (entity A has the same relation with entity B as with entity C) and relationships that refer to the same entity itsself (so 1 employee can hire multiple other employee's). Extra much love if you can show me how I should draw those as well :)
Unfortunately I could not find much information on this using search engines...
Try searching on "EER Diagram PDF". You'll get more images than you can shake a stick at. Some of them use crow's foot notation. Others don't.
The extra "E" stands for "Enhanced". This has to do with the fact that the original ER model did not have modeling conventions for Gen-spec (superclass/subclass) or for unions.
Unlike most people, I prefer to make a sharp distinction between diagrams that depict an ER model and ones that depict a relational model. Contrary to prevailing opinion, ER modeling isn't just "relational lite". It's a different model, with different purposes. You can look up the history if you are really interested.
I tend to use crow's foot notation in ER diagrams, and I always leave out junction boxes and foreign keys. This makes the diagram more useful for stakeholders who want to see the big picture.
I like arrowhead notation for relational diagrams. Foreign keys and junction boxes must be included in relational diagrams. They are part of the model, and implement relationships.
As far as a relational table design for gen-spec, I don't think you can beat Fowler's treatment of the subject. Try searching on "Fowler Class Table Inheritance" for an entry point into this aspect of the topic.

besides BM25, whats other ranking functions exists?

besides BM25, what's other ranking functions exists? Where I found information on this topic?
BM25 is one of term-based ranking algorithms. Nowadays there are concept-based algorithms as well.
BM25 if state-of-the-art of term based information retrieval; however, there are some challenges that term-based cannot overcome such as, relating synonyms, matching an abbreviation or recognizing homonyms.
Here are the examples:
synonym: "buy" and "purchase"
antonym: "Professor" and "Prof."
homonym:
bow – a long wooden stick with horse hair that is used to play certain string instruments such as the violin
bow – to bend forward at the waist in respect (e.g. "bow down")
To deal with these problems, some are using concept-based models such as this article and this article.
Concept-based models are mostly using dictionaries or external terminologies to identify concepts and each have their own representation of concepts or weighting algorithms.
vanilla tf-idf is what is often used. If you want to learn about these things the best place to start is this book.

How to get started on Information Extraction?

Could you recommend a training path to start and become very good in Information Extraction. I started reading about it to do one of my hobby project and soon realized that I would have to be good at math (Algebra, Stats, Prob). I have read some of the introductory books on different math topics (and its so much fun). Looking for some guidance. Please help.
Update: Just to answer one of the comment. I am more interested in Text Information Extraction.
Just to answer one of the comment. I am more interested in Text Information Extraction.
Depending on the nature of your project, Natural language processing, and Computational linguistics can both come in handy -they provide tools to measure, and extract features from the textual information, and apply training, scoring, or classification.
Good introductory books include OReilly's Programming Collective Intelligence (chapters on "searching, and ranking", Document filtering, and maybe decision trees).
Suggested projects utilizing this knowledge: POS (part-of-speech) tagging, and named entity recognition (ability to recognize names, places, and dates from the plain text). You can use Wikipedia as a training corpus since most of the target information is already extracted in infoboxes -this might provide you with some limited amount of measurement feedback.
The other big hammer in IE is search, a field not to be underestimated. Again, OReilly's book provides some introduction in basic ranking; once you have a large corpus of indexed text, you can do some really IE tasks with it. Check out Peter Norvig: Theorizing from data as a starting point, and a very good motivator -maybe you could reimplement some of their results as a learning exercise.
As a fore-warning, I think I'm obligated to tell you, that information extraction is hard. The first 80% of any given task is usually trivial; however, the difficulty of each additional percentage for IE tasks are usually growing exponentially -in development, and research time. It's also quite underdocumented -most of the high-quality info is currently in obscure white papers (Google Scholar is your friend) -do check them out once you've got your hand burned a couple of times. But most importantly, do not let these obstacles throw you off -there are certainly big opportunities to make progress in this area.
I would recommend the excellent book Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze. It covers a broad area of issues which form a great and up-to-date (2008) basis for Information Extraction and is available online in full text (under the given link).
I would suggest you take a look at the Natural Language Toolkit (nltk) and the NLTK Book. Both are available for free and are great learning tools.
You don't need to be good at math to do IE just understand how the algorithm works, experiment on the cases for which you need an optimal result performance, and the scale with which you need to achieve target accuracy level and work with that. You are basically working with algorithms and programming and aspects of CS/AI/Machine learning theory not writing a PhD paper on building a new machine-learning algorithm where you have to convince someone by way of mathematical principles why the algorithm works so I totally disagree with that notion. There is a difference between practical and theory - as we all know mathematicians are stuck more on theory then the practicability of algorithms to produce workable business solutions. You would, however, need to do some background reading both books in NLP as well as journal papers to find out what people found from their results. IE is a very context-specific domain so you would need to define first in what context you are trying to extract information - How would you define this information? What is your structured model? Supposing you are extracting from semi and unstructured data sets. You would then also want to weigh out whether you want to approach your IE from a standard human approach which involves things like regular expressions and pattern matching or would you want to do it using statistical machine learning approaches like Markov Chains. You can even look at hybrid approaches.
A standard process model you can follow to do your extraction is to adapt a data/text mining approach:
pre-processing - define and standardize your data to extraction from various or specific sources cleansing your data
segmentation/classification/clustering/association - your black box where most of your extraction work will be done
post-processing - cleansing your data back to where you want to store it or represent it as information
Also, you need to understand the difference between what is data and what is information. As you can reuse your discovered information as sources of data to build more information maps/trees/graphs. It is all very contextualized.
standard steps for: input->process->output
If you are using Java/C++ there are loads of frameworks and libraries available you can work with.
Perl would be an excellent language to do your NLP extraction work with if you want to do a lot of standard text extraction.
You may want to represent your data as XML or even as RDF graphs (Semantic Web) and for your defined contextual model you can build up relationship and association graphs that most likely will change as you make more and more extractions requests. Deploy it as a restful service as you want to treat it as a resource for documents. You can even link it to taxonomized data sets and faceted searching say using Solr.
Good sources to read are:
Handbook of Computational Linguistics and Natural Language Processing
Foundations of Statistical Natural Language Processing
Information Extraction Applications in Prospect
An Introduction to Language Processing with Perl and Prolog
Speech and Language Processing (Jurafsky)
Text Mining Application Programming
The Text Mining Handbook
Taming Text
Algorithms of Intelligent Web
Building Search Applications
IEEE Journal
Make sure you do a thorough evaluation before deploying such applications/algorithms into production as they can recursively increase your data storage requirements. You could use AWS/Hadoop for clustering, Mahout for large scale classification amongst others. Store your datasets in MongoDB or unstructured dumps into jackrabbit, etc. Try experimenting with prototypes first. There are various archives you can use to base your training on say Reuters corpus, tipster, TREC, etc. You can even check out alchemy API, GATE, UIMA, OpenNLP, etc.
Building extractions from standard text is easier than say a web document so representation at pre-processing step becomes even more crucial to define what exactly it is you are trying to extract from a standardized document representation.
Standard measures include precision, recall, f1 measure amongst others.
I disagree with the people who recommend reading Programming Collective Intelligence. If you want to do anything of even moderate complexity, you need to be good at applied math and PCI gives you a false sense of confidence. For example, when it talks of SVM, it just says that libSVM is a good way of implementing them.
Now, libSVM is definitely a good package but who cares about packages. What you need to know is why SVM gives the terrific results that it gives and how it is fundamentally different from Bayesian way of thinking ( and how Vapnik is a legend).
IMHO, there is no one solution to it. You should have a good grip on Linear Algebra and probability and Bayesian theory. Bayes, I should add, is as important for this as oxygen for human beings ( its a little exaggerated but you get what I mean, right ?). Also, get a good grip on Machine Learning. Just using other people's work is perfectly fine but the moment you want to know why something was done the way it was, you will have to know something about ML.
Check these two for that :
http://pindancing.blogspot.com/2010/01/learning-about-machine-learniing.html
http://measuringmeasures.com/blog/2010/1/15/learning-about-statistical-learning.html
http://measuringmeasures.com/blog/2010/3/12/learning-about-machine-learning-2nd-ed.html
Okay, now that's three of them :) / Cool
The Wikipedia Information Extraction article is a quick introduction.
At a more academic level, you might want to skim a paper like Integrating Probabilistic Extraction Models and Data Mining to Discover Relations and Patterns in Text.
Take a look here if you need enterprise grade NER service. Developing a NER system (and training sets) is a very time consuming and high skilled task.
This is a little off topic, but you might want to read Programming Collective Intelligence from O'Reilly. It deals indirectly with text information extraction, and it doesn't assume much of a math background.

Resources