Graph data modeling assistance relating to soccer matches - graph

I am trying to model soccer matches and the referees and teams that play in them. I want to create nodes based on matches, referees, and players and am not clear on the best approach to model them? That is should I model it after cities, matches? Do I create a root node Id etc?
The kind of information I would looking for later would be stuff like:
1). Show all the matches for a particular referee (could be in multiple cities)
2). Show all matches where referee worked and home team won
3). show all referees that that have the highest count of wins for the home team?
4). show most active refereess in a particular city
As you can see there are all sorts of questions and for someone new this can be a little overwhelming. While I am reading some books, I wanted to see if any experts could help me in the scenario above. Again not sure if I need a root node that connects all the cities and referees and matches or just keep things independent. Your feedback would be most appreciated.

One of the possible models that at the moment seem to satisfy the queries you've posted:
(Team)-[:PLAYS]->(Match)
(Match)-[:HAS_REFEREE]->(Referee)
(Match)-[:PLAYED_IN]->(City)
The PLAYS relation could have a property to indicate if the team was the home team. You could also have a property on the PLAYS relation to indicate whether that team won or not. Or if winning is a big part of what you're looking for, you can create an extra relation such as
(Team)-[:WON]->(Match) (though then you need to think about how to model draws. The absence of a WON relation on either of the two teams for a match could indicate a draw maybe).
1) All matches for a particular referee: Start at the referee, traverse through the Match to the Cities. You might index some unique property of the referee to be able to look him up quickly
2) All matches where the referee worked and the home team won: Start at the referee, find all his matches, filter on the WON relation/property and the home team property
3)All referees that have the highest count of wins for the home team: Same as above, start at all referees
4)Most active referees for a city: Start at the city, find all matches and their referees
You might move things around a bit depending on more questions that you want to answer (especially home team properties, WIN/LOSE relations or properties etc.)
And I don't think you need the root node at all. You can index all matches/cities/referees etc if you want to find all of them

I've done some modelling of football/soccer matches which might be interesting to look at - http://staging.thinkingingraphs.com/
Mostly the same as what Luanne said although I've got specific relationship types indicating which team played at home and away. I've been writing up what I discovered while building out the model here as well - http://www.markhneedham.com/blog/tag/neo4j/page/2/

Related

Users, Roles, and Security Groups Management - How to Set up a Downline in SuiteCRM

SuiteCRM 7.5.1 - In Reference to using Users, Roles, and the Security Groups within SuiteCRM specifically.
So, I have a specific setup and I've looked through and read lots of documentation and tried my best to wrap my head around how SuiteCRM does this.
How would one correctly implement the following scenario?:
Let's say I have a tree like so:
We'll number these rows for the sake of understanding: 1, 2, 3 and 4. Then we have Administrators who are employees to throw into the mix.
Administrators can work with almost all records except working with workflows, mess with code, or mess with a few custom modules, outside of that, they have very few restrictions and don't obey any of the rules of the downline.
Then we follow the downline:
Person 1's can see all Person 2's, 3's, and 4's that are specifically within their downline and within their Territory. They cannot see any other Person 1's period. They cannot see any 2's, 3's, and 4's that aren't within their downline or their Territory. They also cannot see Administrators or anything assigned to them.
Person 2's can see all Person 3's and 4's within their specific Downline and Territory, They cannot see any Person 1's or 2's period. They cannot see any Person 3's or 4's outside of their Territory or Downline. They also cannot see Administrators or anything assigned to them.
Person 3's can see all 4's within their specific Downline and Territory, They cannot see any Person 1's, 2's, or other 3's period. They cannot see any Person 4's outside of their Territory or Downline. They also cannot see Administrators or anything assigned to them.
Person 4's can see only records assigned to them.
In this example there is only 4 deep, in the real world, there is actually 12 deep plus administrators plus me, the Super Admin.
How can I go about resolving this?
I wrote SecuritySuite and what you need is fairly typical. There can be a large learning curve for figuring this out so I wrote up an example setup for a 3 deep hierarchy here to try to help with that a bit: https://www.sugaroutfitters.com/docs/securitysuite/example-of-a-typical-setup.
Your example is a 4 deep hierarchy, but it's fairly similar. The key is to create groups for the lowest level. In your case, this would be at the person 4 levels. So person 4a, 4b, 4c would all be in Group A. A role with Owner only rights would be assigned directly to Group A so that 4a/4b/4c could only access their own records.
Person 3a would be in Group A, but a "Manager" role would be created with Group access and assigned directly to person 3a. Person 3a's Group A membership would be marked as non-inheritable so that when person 3a creates a record Group A wouldn't be assigned to it directly. Person 3a would also be in Group AA along with person 3b/3c/3d (according to the picture above).
Person 2b (2nd person in the 2nd tier of the image above) would be in Group A and Group AA, both marked as non-inheritable. Person 2b would have the "Manager" role assigned directly.
Person 1 would have a role assigned directly with "All" access as this person can see everyone.

Graph design (for neo4j) for sports tournaments

I want to use a graph database for a web application that tracks the players, matches, leagues for a given sport say volleyball. Below is the 1st level model I came up with. I would like to support the below statistics for this web application
Player
Show all the leagues played by a player.
Show all the matches played by a player in each league.
Player's current team and his previous teams.
How many times the player was a captain and all the leagues for which he was the captain.
Team
All leagues played by a team.
How many times the team was a winner or runner.
NOTE: Right click on the image and open it in a new tab to see the original image.
You model looks good, however after looking at your use cases, I have a few questions/suggestions:
Query
I'll give these in Cypher as it's easiest to show in this format.
Player
Show all the leagues played by a player.
START player=node:Player('indexForPlayer')
MATCH player-[PLAYED]->match-[PART_OF]->league
RETURN league
Show all the matches played by a player in each league.
START player=node:Player('indexForPlayer')
MATCH player-[PLAYED]->match-[PART_OF]->league
RETURN match, collect(league)
Player's current team and his previous teams.
START player=node:Player('indexForPlayer')
MATCH player-[BELONGED_TO]->team
RETURN team
How many times the player was a captain and all the leagues for which he was the captain.
How do you determine if they were a captain of a league?
Team
How many times the team was a winner or runner.
You might want to put this as a relationship such as (match)-[WINNER]->(team) this way to find out how many wins your team has, all you have to do is count the WINNER relationship.
Data Model
Add a property to the Match node for date played. I'm unfamiliar with sports, but Year may not be enough if they can swap teams within a year, however Neo4j doesn't really have a good method for dealing with time, other than a 'seconds since epoch ` type system.

How to mitigate against bandwagon effect (voting behavior) in my ranking system?

What I mean by bandwagon effect describes itself like so:
Already top-ranked items have a higher tendency to get voted on at all, possibly even to get upvoted.
What I am hoping to get is some concrete recommendations, at best based on your practical experience with a mathematical formula and in which situation it helped.
However, any useful pointers are more than welcome!
My ranking system
Please consider a ranking system at a website that has a reputation system and where users cast only upvotes on items and the ranking table is reset to start fresh every month.
Every user has one upvote per item within each month, and there is a reward for users who, within a certain month, upvoted an item that made it into the top ranks at the end of that month.
Users are told the following about what increases the weight of their upvote:
1)... the more reputation you have at the time of upvoting
2)... the fewer items you upvote within the current month (including the current upvote)
3)... the fewer upvotes that item already has within the current month before your own upvote
The ranking table is recalculated once a day and is visible to all.
Goal
I'd like to implement part 3) in an effort to correct the ranks of items where one cannot tell if some users just upvoted it because of the bandwagon effect (those users might hope to gain a "tactical" advantage simply by voting what they perceive lots of other users already upvoted)
Also, I hope to mitigate this way against the possible use of sock puppets that managed to attain some reputation, but upvote the same item or group of items.
Question
Is there a (maybe even tested?) mathematical formula that I could just apply on the time-ordered list of upvotes for each item to get a coffecient for each of those upvotes so that their weights will be corrected in a sensible fashion?
I'm thinking it's got to be something of a lograthmic function but I can't quite get a grip on it...
Thank you!
Edit
Zack says: "beyond a certain level of popularity, additional upvotes decrease the probability that something will be displayed"
To further clarify: what I am after is which actual mathematical approaches are worth trying out that will, in the form of a mathematical function, translate this descrease in pop (i.e., apply coefficients to the weights, see above) in sensible, balanced manner.
My hope is someone has practical experience with such approaches in a simmilar or general situation to the one above.
Consider applying the "Indie Rock Peter Principle": beyond a certain level of popularity, additional upvotes decrease the probability that something will be displayed.
Term coined by Leonard Richardson in this paper. Indie Rock Peter is of course from Diesel Sweeties.
I have always disliked the bandwagon effect in voting systems, especially "most viewed" rankings in which simply clicking on a highly ranked item increases its rank. My solution to this problem, which I have never tested or seen implemented, would be to keep track of how an item was reached (and then voted for), and ignore (or greatly decrease the weight of) votes that came from any sorted-by-ranking page.

Determining the popularity of a video with ratings and views

I am about to embark on a new project - a video website. Users will be able to register, and vote on videos by clicking "like" or "dislike", or something to that effect. In any event, it will be a 2-option voting system, not a 5-star system.
Every X number of days, I will be generating a "chart" of the most popular videos. So my question is: how should I determine the popularity of a given video?
If I went the route of tallying up the videos with the most views, this could have the effect of exceptionally bad videos making it to the of the charts (just because they're so bad).
If I go the route of a scoring system based on the amount of "like" and "dislike" votes (eg. 100 like votes, and 50 dislike votes equals a score of 2), videos with few views could appear on the top of the charts.
So, what I need to do is a combination of the two. Barring, of course, spammy views and votes.
What's your guys' thoughts on the subject?
Edit: the following tags were removed: [mysql] [postgresql], to make room for other, more representative tags; the SQL technology used in the intended implementation does not seem to bear much on the considerations regarding the rating model per-se.
You seem to be missing the point that likes and dislikes in movies are anything but objective even within the context of a relatively homogeneous group of "voters". Think how the term "Chix Flix" or the success story called "NetFlix", illustrate this subjectivity...
Yet, if you persist in implementing the model you suggest, there are several hidden variables and system dynamics that need to be acknowledged and possibly taken into account in the rating's formula.
the existence of a third, implicit, value of the vote: "No vote"
i.e. when someone views the movie page and yet doesn't vote, either way.
The problem of dealing with this extra value is its ambiguity: do people not vote because they didn't see the movie or because they neither truly like nor disliked it? Very likely a bit of both, therefore we can/should use the count of the "Page views without vote" in the formula, to boost (somewhat) the rating of movies that do not generate a strong (positive or negative) sentiment (lest the "polarizing" movies will appear more notorious or popular)
the bandwagon effect
Past a certain threshold, and particularly if the rating and/or vote counts is visible before the page view, the rating and vote counts can influence the way people decide to vote (either way) or even decide to abstain from voting. The implication is that the total vote and/or view counts do not relate linearly to the effective rating.
"quality" vs. "notoriety"
Vote ratios in general (eg "likes" / "total" or "likes"/"dislikes" etc.) are indicative of the "quality" of a movie (note the quotes around quality...), whereby the number of votes (and of views) is indicative of the notoriety ("name recognition" etc.) of a movie.
statistical representativity
Very small vote and/or view counts are to be handled carefully because they introduce much volatility in the rating. Phrased otherwise, small samples make for not so statically representative ratings.
trends (the time variable)
At the risk of complicating the model, consider keeping [some] record of when votes/view happened, to allow identifying "hot" (and "cooling") movies in the collection. This info may inform the rating logic, but also may be used to direct the users towards currently hot items. BTW, hence feeding the bandwagon effect mentioned :-( but also, increasing the voting sample size :-).
All these considerations suggest caution in implementing this rating system. It also hints at the likely need of including statistics about the complete set of movies into the rating formula for an individual movie. In other words, do not rate a given movie solely on the basis of the its own vote/view counts but also on say the average vote counts a move receives, the maximum view a movie page gets etc. In fact, an iterative process, whereby movies are [roughly] ranked at first and then the ranking is recalculated by using the statistics of groups of movies similarly rated may provide a better system (provided the formulas are "fair" and somehow converge)
A standard trick is to start with a neutral baseline: say 10 likes and 10 dislikes that gives a score of 1. The first few votes don't change the ratio too much, but as votes accumulate, the baseline is overwhelmed. The exact choice of the baseline values will influence the rating of a new movie (the two values don't have to be equal), and how many votes are needed to change the rating substantially.

Algorithm for similarity (of topic) of news items

I want to determine the similarity of the content of two news items, similar to Google news but different in the sense that I want to be able determine what the basic topics are then determine what topics are related.
So if an article was about Saddam Hussein, then the algorithm might recommend something about Donald Rumsfeld's business dealings in Iraq.
If you can just throw around key words like k-nearest neighbours and a little explanation about why they work (if you can) I will do the rest of the reseach and tweak the algorithm. Just looking for a place to get started, since I know someone out there must have tried something similar before.
First thoughts:
toss away noise words (and, you, is, the, some, ...).
count all other words and sort by quantity.
for each word in the two articles, add a score depending on the sum (or product or some other formula) of the quantities.
the score represent the similarity.
It seems to be that an article primarily about Donald Rumsfeld would have those two words quite a bit, which is why I weight them in the article.
However, there may be an article mentioning Warren Buffet many times with Bill Gates once, and another mentioning both Bill Gates and Microsoft many times. The correlation there would be minimal.
Based on your comment:
So if an article was about Saddam Hussein, then the algorithm might recommend something about Donald Rumsfeld's business dealings in Iraq.
that wouldn't be the case unless the Saddam article also mentioned Iraq (or Donald).
That's where I'd start and I can see potential holes in the theory already (an article about Bill Gates would match closely with an article about Bill Clinton if their first names are mentioned a lot). This may well be taken care of by all the other words (Microsoft for one Bill, Hillary for the other).
I'd perhaps give it a test run before trying to introduce word-proximity functionality since that's going to make it very complicated (maybe unnecessarily).
One other possible improvement would be maintaining 'hard' associations (like always adding the word Afghanistan to articles with Osama bin Laden in them). But again, that requires extra maintenance for possibly dubious value since articles about Osama would almost certainly mention Afghanistan as well.
At the moment I am thinking of something like this.
Each non-noise-word is a dimension. Each article is represented by a vector where the words that don't appear are represented by zero and those that do appear get a value that is equal to the number of times they appear divided by the total words on the page. Then I can take Euclidean distance between each of the points in this space to get the similarity of any two articles.
The next step would be to determine clusters of the articles, and then determine a central point for each cluster. Then compute the Euclidean distance between any two clusters which gives the similarity of the topics.
Baaah I think by typing it out I solved my own problem. Of course only in a very high level way, I am sure when I get down to it I will find problems ... the devil is always in the detail.
But comments and improvements still highly appreciated.

Resources