Data Modelling for Big Data - graph

I have this type of database to implement
Circled point are various cities.
People are travelling from one city to another.
Number of people travelling from one city to another are shown by the weight over the edges.
Circle G : is my goal city
What I want to achieve?
total number of people reaching "G"?
What path they followed to reach goal "G"?
e.g :
200 people started from A->F..!
100 goes back to A using Path F->A
from remaining 100 only 20 user made to reach Goal "G"
so, the number of people reaching "G" from the right side is 80
What information I need at point “G”
80 people from right side = 20(from A->F->G) + 60 (from A->D->F->G)
This is a small graph.
I want to implement this on a graph having 1000+ Nodes?
Right now the approach I am taking to solve this is (using ArangoDB) :
I am creating One Vertex collection and One Edge collection.
Each City (A, B, C, D) is document inside same collection.
I am saving the complete previous path for every people travelling.
e.g John is travelling from A->G
The details I am saving at F for John: {"John : A_D_F"}
The details I am saving at city G for John: {"John : A_D_F_G"}
I am repeating this for Every single people travelling.
In short I want to achieve funneling at Any point(city) in the graph.
What is the better way of data modelling for this type of graph in Arango or other Big Data Storage and which Big Data Storage will be best?

You are right in your conclusion to treat this a graph problem. Irrespective of the tech stack that you want to use, I suggest you model your data by following some of the best practices/examples outlined in these links
There are a lot of proven choices with reference to scaling to a 1000 or even 10000 node graph
Here is one possible way to model this:
a] Treat the Cities and Persons as Nodes
b] Then model the City-to-City path as Relationships
c] Also then add the Person-has-travelled-to-City as a Relationship
d] If you need to sequence the relationship you can use Properties on the Person-to-City relationship
Next step is to
Create these in a Graphdb of your choice
Create the Sample Dataset
Run your queries and check the answers
See if you need to optimize either the model or the Data
Hope this helps


Interesting Python data structure problem involving disjoint sets, hashing, and graphs

Problem: You are planning an around-the-world trip with your two best friends for the
summer. There is a total of n cities that the three of you want to visit. As you are traveling around the world, you are
worried about time zones and airport access. Therefore, some cities can only be visited after visiting another city first,
which is in a nearby timezone or has an airport, which are expressed as a list of pairs (cityX,cityY) (cityX can only be
visited after visiting cityY).
Given the total number of cities and a list of dependency pairs, is it possible for you all to visit all cities?
Your task is to write the function can_visit_all_cities, which determines whether visiting the n cities is possible or
not given the dependencies.
• Must run in O(m+n), and cannot use built in Python set/dictionary
This sounds like a dependency-graph. I don't know if python has a built in datastructure for this.
If you were to implement one on your own you'd have to use lists/sets though.

Using Socrata API distance_in_meters() function with separate latitude, longitude field

I am trying to retrieve information about trees surrounding a given location from the Socrata API.
API Endpoint Description
I found two functions within_circle(...) and distance_in_meters(...) which I could use to filter the data set. The problem is, that those functions need either a location or a point data type which is not present in the data set.
There is, however a latitude and longitude field.
Is there any method to utilize those functions or get nearby trees other way?
Tried this, but POINT(0 0) must be the point of the tree.$where=within_circle( 'POINT(0 0)' ,0,0,400)
I need something like this.$where=within_circle( make_point(latitude, longitude) ,0,0,400)
If you have not already done so, you may want to submit this question at In addition to the possibility that the people there will have an answer I do not, it would serve as some feedback that a function like what you had in mind would be useful.
I cannot think of a way to do exactly what you have in mind. Really, what I mean is a way that is within your power (or mine). The owner of the dataset could create a Point column -- and you may want to reach out to the NYC open data team to ask for that if you have not already done so.
However, since the X and Y coordinates, in feet, are present, you should be able to use the Pythagorean theorem to determine the distance from any given point. For that matter, the size of a degree of latitude or longitude cannot vary that much over an area as small as NYC so you could do the same thing with those values and save having to figure out the X and Y of your reference point.
Good luck!

Choosing a similarity metric for user-scores of television shows

I have a database of user ratings of various television shows on a 1-10 scale. I've been trying to find a good way of determining how similar two user score-lists are two one another for shared shows.
It feels like the most obvious way to do this is just to take the absolute value of the difference. And then sum/average that for all shared shows. But I was reading this does not take into account how users will rate things on different scales. I saw some people saying cosine similarity is better for this sort of thing. Unfortunately, I've run into a lot of cases where that metric doesn't really make sense.
overall average of user1 = 8.1
overall average of user2 = 5.8
scores for shared shows only:
S1 = [8,8,10,10,10,10,6,8,10,5,6,10]
S2 = [5,6,7,8,9,9,4,5,9,1,2,8]
Obviously, these two people rated the shows they watched pretty differently. When I use the average difference it says they are not very similar (2.3 where 0 is the same). When I use something like the cosine similarity it says they are extremely similar (0.97 where 1 is the same).
Is there a metric that would be better suited for this kind of thing? My ultimate goal is to recommend users shows from other users that have similar tastes to them.

modeling scenario with mostly semi-additive facts

Im learning dimensional modeling and Im trying to create a model. I was thinking about a social media platform which rates hotels. The platform has following data:
hotel information: name and address
a user can rate hotels (1-5 points)
a user can write comments
platform stores the date of the comments
hotel can answer via comment and it stores the date of it
the platform stores the total number of each rating level (i.e.: all rates with 1 point, all rates with 2 point etc.)
platform stores information of the user: sex, name, total number of votes he/she made and address
First, I tried to define which information belongs to a dimension or fact table
(here I also checked which one is additive/semi additive/non-additive)
I realized my example is kind of difficult, because it’s hard to decide if it belongs to a fact table or dimension.
I would like to hear some advice. Would someone agree with my model?
This is how I would model it:
Hotel information -> hotel dimension
User rating -> additive fact – because I can aggregate them with all dimensions
User comment -> semi additive? – because I can aggregate them with the date dimension (I don’t know if my argument is correct, but I know I would have new comments every day, which is for me a reason to store it in a fact table
Answer as comment -> same handling like with the user comments
Date of comment-> dimension
Total Number of all votes (1/2/3/4/5) -> semi-additive facts – makes no sense to aggregate them, since its already total but I would get the average
User information sex and name, address -> user-dimension
User Information: total number of votes -> could be dimension or fact. It depends how often it changes. If it changes often, I store it in a fact. If its not that often, then dimension
I still have question, hope someone can help me:
My Question: should I create two date dimensions, or can I store both information in one date dimension?
2nd Question: each user and hotel just have one address. Are there arguments, to separate the address dimension in a own hierarchy? Can I create a 1:1 relationship to a user dimension and address dimension?
For your model, it looks well considered, but here are some thoughts:
User comment (and answers to comments): they are an event to be captured (with new ones each day, as you mention) so are factual, with dimensionality of the commenter, type of comment, date, and the measure is at least a 'count' which is additive. But you don't want to store big text in a fact so you would put that in a dimension by itself which is 1:1 with the fact, for situations where you need to query on the comment itself.
Total Number of all votes (1/2/3/4/5) are, as you say, already aggregates, mostly for performance. Totals should be easy from the raw data itself so maybe not worthwhile to store them at all. You might also consider updating the hotel dimension with columns (hotel A has 5 '1' votes and 4 '2' votes) that you'd update as you go on, for easy filtering and categorisation.
User Information: total number of votes: it is factual information about a user (dimension) and it depends on whether you always just want to 'find it out' about a person or whether you are likely to use it to filter other information (i.e. show me all reviews for users who have made 10-20 votes). In that case you might store the total in the user dimension (and/or a banding, like 'number of reviews range' with 10-20, 20-30). You can update dimensions often if you need to, but you're right, it could still just live as a fact only.
As for date dimensions, if the 'grain' is 'day' then you only need one dimension, that you refer to from multiple facts.
As for addresses, you're right that there are arguments on both sides! Many people separate addresses into their own dimension, referred to from the other dimensions that use them. Kimball suggests you can do that behind the scenes if necessary, but prefers for each dimension to have its own set of address columns(but modelled as consistently as possible).

SSAS facts sharing the same dimension

I'm building a cube with 2 fact tables that share some dimensions.
In the example below, I have Fact_Employee, Fact_Manager, Dim_Date, Dim_Country, Dim_Employee and Dim_Manager, with the respective links.
In SSAS I've created one Dim_Country. In the Cube "Dimension Usage" I am creating 2 dimensions (Man_Country and Emp_Country) and linking to the respective measure groups.
My Fact_Employee has the key for the Dim_Manager, so I can relate them.
My problem here is, when in the pivot table I drag the Man_Country, Emp_Country, Emp_Amount and Man_Amount, this doesn't work because I'm getting the list of all Manager Countries not related to the Manager Number and then the Employee Countries are correctly linked to the Employee Number, but are duplicate.
The below image shows the result Pivot table and what I am trying to get.
What do I need to change in the data source view or cube dimension usage to have the correct results.
The users should be able to filter the pivot by, for example, Manager Country to see all the employee Countries and Numbers and the amounts (for Managers and Employees).
Many thanks in advance for any help.
If you have country dimension then you should use this dimension for both measure groups, just remember to configure dimension usage for this dimension vs both measure groups.
There are special cases where you would want to separate those dimensions, f.eks:if you want them to act separately - let say you have a fact table with parcels and you need to have both DimFromCountry and DimToCountry. In this case you would want to use role playing dimension - it is same dimension then, but connected differently.
