Please help me to solve this homework. I need to draw the ER diagram, relationships and cardinality.
An environmental Agency needs to catalog all the plants in an area that is vulnerable to acid rains. Plants exist in quadrants and a botanist is responsible for cataloging plants. The data that should be stored should include genus,species,quantity (in numbers,kg's) of the plants, date of record, quadrant id, quadrant location, average altitude of quadrant and botanists information such as name.
Before you begin to learn how to draw an ER diagram, you will do well to learn the differences between a relational model of the data and an ER model of the data. Most of the ER diagrams being presented here in SO are really diagrams of a relational model.
This may seem overly picky, but the confusion between the two kinds of models slows down beginners enormously. If you have decided on a relational model, and want to use an ERD to depict it, you can do that. But learn how to make a model before you learn how to draw a picture of a model.
Related
I am new to pathway analysis. (I am working in R, but am open to try other programs for this).
I have built a model to analyze some ecological data. Therefore I have some known relationships among some of my variables. Let's say I have a known structure among variables v1- v6 as depicted in the attached diagram. I know that an external variable, e.g. latitude, acts on v6. But I want to find out at which instance latitude actually acts (e.g. it could be that latitude in reality affects any of v1- v5, thus carrying over its effect on v6, due to the relationships between v1- v6).
Also, at a later stage, I'd like to do this for more external variables.
My question is, is there any way to take into account such known relationships in pathway analysis? Furthermore, a few of these known relationships are actually non-linear. I understand that nonlinear relations are not easy in pathway analysis, such as e.g. in SEM's, but it also seems to me that the difficulty arises from testing for such nonlinear relationships. Here, I would not need to test for any of those nonlinear relationships. The relationship of latitude on v1-v6 is assumed to be linear.
Thanks for any input, appreciate to hear if anyone has dealt with a similar situation!
Pathway Diagram
I am looking for Visibility Graph applications. In line with the articles I read, I have obtained the applications of this algorithm, which are as follows:
Robot path planning
Placement of radio antennas
Complex network theory
Regional planning
This algorithm is also used to analyse time series. In the analysis of time series using the graph algorithm, the question arose that after obtaining the graph obtained from this algorithm: what is the efficiency of this graph?
If we consider the meteorological data and obtain its graph with the Visibility Graph Algorithm, from this graph we can obtain statistical properties or the degree distribution of networks that follow the law of power.
In general, my question is what efficiency and information does the graph from the meteorological time series or the purchase of medicine at certain times and many other time series provide us with?
As explained in the paper entitled From time series to complex networks: The visibility graph by Lucas Lacasa, Bartolo Luque, Fernando Ballesteros, Jordi Luque, and Juan Carlos Nuño, in 2018, the visibility graph of a time series if invariant under translation, rescaling, addition of a linear trend, and other transformations. It however captures key time series features, like periodicity, self-similar indexes, etc.
What is the difference between the Maximal Information Coefficient and Hierarchical Agglomerative Clustering in identifying functional and non functional dependencies.
Which of them can identify duplicates better?
This question doesn't make a lot of sense, sorry.
The MIC and HAC have close to zero in common.
The MIC is a crippled form of "correlation" with a very crude heuristic search, and plenty of promotion video and news announcements, and received some pretty harsh reviews from statisticians. You can file it in the category "if it had been submitted to an appropriate journal (rather than the quite unspecific and overrated Science which probably shouldn't publish such topics at all - or at least, get better reviewers from the subject domains. It's not the first Science article of this quality....), it would have been rejected (as-is - better expert reviewers would have demanded major changes)". See, e.g.,
Noah Simon and Robert Tibshirani, Comment on “Detecting Novel Associations in Large Data Sets” by Reshef et al., Science Dec. 16, 2011
"As one can see from the Figure, MIC has lower power than dcor, in every case except the somewhat pathological high-frequency sine wave. MIC is sometimes less powerful than Pearson correlation as well, the linear case being particularly worrisome."
And "tibs" is a highly respected author. And this is just one of many surprised that such things get accepted in such a high reputation journal. IIRC, the MIC authors even failed to compare to "ancient" alternatives such as Spearman, to modern alternatives like dCor, or to properly conduct a test of statistical power of their method.
MIC works much worse than advertised when studied with statistical scrunity:
Gorfine, M., Heller, R., & Heller, Y. (2012). Comment on "detecting novel associations in large data sets"
"under the majority of the noisy functionals and non-functional settings, the HHG and dCor tests hold very large power advantages over the MIC test, under practical sample sizes; "
As a matter of fact, MIC gives wildly inappropriate results on some trivial data sets such as a checkerboard uniform distribution ▄▀, which it considers maximally correlated (as correlated as y=x); by design. Their grid-based design is overfitted to the rather special scenario with the sine curve. It has some interesting properties, but these are IMHO captured better by earlier approaches such as Spearman and dCor).
The failure by the MIC authors to compare to Spearman is IMHO a severe omission, because their own method is also purely rank-based if I recall correctly. Spearman is Pearson-on-ranks, yet they compare only to Pearson. The favorite example of MIC (another questionable choice) is the sine wave - which after rank transformation actually is busy a zigzag curve, not a sine anymore). I consider this to be "cheating" to make Pearson look bad, by not using the rank transformation with Pearson, too. Good reviewers would have demanded such a comparison.
Now all of these complaints are essentially unrelated to HAC. HAC is not trying to define any form if "correlation", but it can be used with any distance or similarity (including correlation similarity).
HAC is something completely different: a clustering algorithm. It analyzes a larger rows, not two (!) columns.
You could even combine them: if you compute the MIC foe every pair of variables (but I'd rather use Pearson correlation, Spearman correlation, or distance correlation dCor instead), you can use HAC to cluster variables.
For finding aftual duplicates, neither is a good choice. Just sort your data, and duplicates will follow each other. (Or, if you sort columns, next to each other).
I recently started to work with a huge dataset, provided by medical emergency
service. I have cca 25.000 spatial points of incidents.
I am searching books and internet for quite some time and am getting more and more confused about what to do and how to do it.
The points are, of course, very clustered. I calculated K, L and G function
for it and they confirm serious clustering.
I also have population point dataset - one point for every citizen, that is similarly clustered as incidents dataset (incidents happen to people, so there is a strong link between these two datasets).
I want to compare these two datasets to figure out, if they are similarly
distributed. I want to know, if there are places, where there are more
incidents, compared to population. In other words, I want to use population dataset to explain intensity and then figure out if the incident dataset corresponds to that intensity. The assumption is, that incidents should appear randomly regarding to population.
I want to get a plot of the region with information where there are more or less incidents than expected if the incidents were randomly happening to people.
How would you do it with R?
Should I use Kest or Kinhom to calculate K function?
I read the description, but still don't understand what is a basic difference
between them.
I tried using Kcross, but as I figured out, one of two datasets used
should be CSR - completely spatial random.
I also found Kcross.inhom, should I use that one for my data?
How can I get a plot (image) of incident deviations regarding population?
I hope I asked clearly.
Thank you for your time to read my question and
even more thanks if you can answer any of my questions.
Best regards!
Jernej
I do not have time to answer all your questions in full, but here are some pointers.
DISCLAIMER: I am a coauthor of the spatstat package and the book Spatial Point Patterns: Methodology and Applications with R so I have a preference for using these (and I genuinely believe these are the best tools for your problem).
Conceptual issue: How big is your study region and does it make sense to treat the points as distributed everywhere in the region or are they confined to be on the road network?
For now I will assume we can assume they are distributed anywhere.
A simple approach would be to estimate the population density using density.ppp and then fit a Poisson model to the incidents with the population density as the intensity using ppm. This would probably be a reasonable null model and if that fits the data well you can basically say that incidents happen "completely at random in space when controlling for the uneven population density". More info density.ppp and ppm are in chapters 6 and 9 of 1, respectively, and of course in the spatstat help files.
If you use summary statistics like the K/L/G/F/J-functions you should always use the inhom versions to take the population density into account. This is covered in chapter 7 of 1.
Also it could probably be interesting to see the relative risk (relrisk) if you combine all your points in to a marked point pattern with two types (background and incidents). See chapter 14 of 1.
Unfortunately, only chapters 3, 7 and 9 of 1 are availble as free to download sample chapters, but I hope you have access to it at your library or have the option of buying it.
For my thesis assignment I need to perform a cluster analysis on a high dimensional data set containing purchase data from a retail store (+1000 dimensions). Because traditional clustering algorithms are not well suited for high dimensions (and dimension reduction is not really an option), I would like to try algorithms specifically developed for high dimensional data(e.g. ProClus).
Here however, my problem starts.
I have no clue what value I should use for parameter d. Can anyone help me?
This is just one of the many limitations of ProClus.
The parameter is the average dimensionality of your cluster. It assumes there is a linear cluster somewhere in your data. This likely will not hold for purchase data, but you can try. For sparse data such as purchases, I would rather focus on frequent itemset mining.
There is no universal clustering algorithm. Any clustering algorithm will come with a variety of parameters that you need to experiment with.
For cluster analysis it is essential that you somehow can visualize or analyze the result, to be able to find out if and how well the method worked.