Interesting Python data structure problem involving disjoint sets, hashing, and graphs - graph

Problem: You are planning an around-the-world trip with your two best friends for the
summer. There is a total of n cities that the three of you want to visit. As you are traveling around the world, you are
worried about time zones and airport access. Therefore, some cities can only be visited after visiting another city first,
which is in a nearby timezone or has an airport, which are expressed as a list of pairs (cityX,cityY) (cityX can only be
visited after visiting cityY).
Given the total number of cities and a list of dependency pairs, is it possible for you all to visit all cities?
Your task is to write the function can_visit_all_cities, which determines whether visiting the n cities is possible or
not given the dependencies.
Requirements
• Must run in O(m+n), and cannot use built in Python set/dictionary

This sounds like a dependency-graph. I don't know if python has a built in datastructure for this.
If you were to implement one on your own you'd have to use lists/sets though.

Related

Reverse geocode latitude/longitude coordinates to retrieve landuse data (eg. residential area, highway, etc.)

I would like to analyse the locations of electric vehicle charging stations for Germany, Italy and France. Those three countries, because they differ quite a lot in regard to their respective incentive programmes for public charging station infrastructure.
What I have so far are .csv exports from both OpenChargeMap and OpenStreetMap containing the location data (latitude and longitude) of all charging stations in those three countries along with a few other information that I can process in R.
What I would like to do now is some sort of reverse geocoding on those latitude and longitude coordinates to retrieve additional information on the surroundings. Especially, whether the respective charging station is located in a residential area in a city for example or at a rest stop on the highway. By knowing at what kind of locations the charging stations are placed in those three countries I am hoping to be able to draw conclusions regarding the incentive programmes. I'm not looking for specific addresses in this case, but rather an API or another way to process thousands of coordinates and retrieve information regarding for example population density or any other piece of data from which I could derive conclusions.
I have tried to get OpenStreetMap exports to work, but unfortunately I cannot seem to be able to query for the 'landuse' attribute through the Overpass Turbo API. This is my basic query that I'm using in this specific API, but as soon as I query for ["landuse" = "residential"] instead of ["landuse" = ""] I get prompted empty fields as result.
I found an API from Google which would offer lookup for various address components/types. Unfortunately, registering an API key at Google is not quite realistic for the scope of my work. Does somebody know of a (preferably FOSS) API that is able to do something like this? Or even how to make a 'landuse' query work in the Overpass Turbo API linked above?
Thank you in advance for your time.
Your Overpass API query is looking for elements that are tagged as amenity=charging_station and landuse. This is rather uncommon since charging stations and landuse are mapped as distinct objects. Instead you need to look around charging stations for landuse elements.
So instead of
area["ISO3166-1"="DE"]->.a;
nwr(area.a)["amenity"="charging_station"]["landuse"=""];
you will need a query like
area["ISO3166-1"="DE"]->.a;
nwr(area.a)["amenity"="charging_station"];
way(around:200)["landuse"];
This searches for ways with a landuse tag located within 200 meters of charging stations.
Note that this is a rather heavy query. You should probably use your own Overpass API server for it.

Recommendation systems - converting transaction counts to star ratings

I'm doing some exploratory work on recommendation systems and have been reading about collaborative filtering techniques involving user-based, item-based, and SVD algorithms. I am also trying out R's recommenderlab package.
One apparent assumption in the literature is that the user data has labelled items based on a rating scale, e.g. between 1 and 5 stars. I'm looking at problems where the user data does not have ratings but rather just transactions. For example, if I want to recommend restaurants to a user, the only data I have is how often he has visited other restaurants.
How can I convert these "transaction" counts into ratings that can be used by recommendation algorithms that expect a fixed-scale rating? One approach I thought of is simple binning:
0 stars = 0-1 visits
1 star = 2-3 visits
...
5 stars = 10+ visits
However, that doesn't seem like it would work well. For example, if someone visited a restaurant only once, he may still really love it.
Any help would be appreciated.
I would try different approaches. As you said, only visited once may indicate that the user still loves the restaurant but you don't know for sure. Your goal is not to optimize for one single user rather for all users. So for this, you can split your data into training and test data. Train on the training data with different scales and test on the test data.
The different scales may be
a binary scale (0:never visited, 1: visited). This is mostly used in online shops (bought or not). Would support your assuption with the one time visit.
your presented scale or other ranges for the 5 stars. You can also use more than 5 stars. I would potentially not group 0-1 visits.
The approach with the best accuracy should be chosen.
Here's an idea: restaurants the user has visited zero or one times tell you nothing about what they like. Restaurants they have visited many times tell you lots. Why not just look for restaurants similar to those the customer most regularly frequents? In this way, you're using positive information (what they like) but none of the negative since you don't have access to it anyway.
If you absolutely had to infer some continuous measure, I think it would only be sensible to look at the propensity for another visit given past behaviour. This would start with the prior probability of choosing that restaurant (background frequency, or just uniform over restaurants) with a likelihood term related to the number of visits to that restaurant. In this way the more a user visits a restaurant the more likely they are to visit again.

Graph data modeling assistance relating to soccer matches

I am trying to model soccer matches and the referees and teams that play in them. I want to create nodes based on matches, referees, and players and am not clear on the best approach to model them? That is should I model it after cities, matches? Do I create a root node Id etc?
The kind of information I would looking for later would be stuff like:
1). Show all the matches for a particular referee (could be in multiple cities)
2). Show all matches where referee worked and home team won
3). show all referees that that have the highest count of wins for the home team?
4). show most active refereess in a particular city
As you can see there are all sorts of questions and for someone new this can be a little overwhelming. While I am reading some books, I wanted to see if any experts could help me in the scenario above. Again not sure if I need a root node that connects all the cities and referees and matches or just keep things independent. Your feedback would be most appreciated.
One of the possible models that at the moment seem to satisfy the queries you've posted:
(Team)-[:PLAYS]->(Match)
(Match)-[:HAS_REFEREE]->(Referee)
(Match)-[:PLAYED_IN]->(City)
The PLAYS relation could have a property to indicate if the team was the home team. You could also have a property on the PLAYS relation to indicate whether that team won or not. Or if winning is a big part of what you're looking for, you can create an extra relation such as
(Team)-[:WON]->(Match) (though then you need to think about how to model draws. The absence of a WON relation on either of the two teams for a match could indicate a draw maybe).
1) All matches for a particular referee: Start at the referee, traverse through the Match to the Cities. You might index some unique property of the referee to be able to look him up quickly
2) All matches where the referee worked and the home team won: Start at the referee, find all his matches, filter on the WON relation/property and the home team property
3)All referees that have the highest count of wins for the home team: Same as above, start at all referees
4)Most active referees for a city: Start at the city, find all matches and their referees
You might move things around a bit depending on more questions that you want to answer (especially home team properties, WIN/LOSE relations or properties etc.)
And I don't think you need the root node at all. You can index all matches/cities/referees etc if you want to find all of them
I've done some modelling of football/soccer matches which might be interesting to look at - http://staging.thinkingingraphs.com/
Mostly the same as what Luanne said although I've got specific relationship types indicating which team played at home and away. I've been writing up what I discovered while building out the model here as well - http://www.markhneedham.com/blog/tag/neo4j/page/2/

Travel APIs how to integrate them all?

I may start working on a project very similar to Hipmunk.com, where it pulls the hotel cost information by calling different APIs (like expedia, orbitz, travelocity, hotels.com etc)
I did some research on this, but I am not able to find any unique hotel id or any field to match the hotels between several API's. Anyone have experience on how can to compare the hotel from expedia with orbitz or travelcity etc?
Thanks
EDIT: Google also doing the same thing http://www.google.com/hotelfinder/
From what I have seen of GDS systems, and these API's there is rarely a unique identifier between systems for e.g. hotels
Airports, airlines and countries have unique ISO identifiers: http://www.iso-code.com/airports.2.html
I would guess you are going to have to have your own internal mapping to identify and disambiguate the properties.
:|
When you get started with hotel APIs, the choice of free ones isn't really that big, see e.g. here for an overview.
The most extensive and accessible one is Expedia's EAN http://developer.ean.com/ which includes Sabre and Venere with unique IDs but still each structured differently.
That is, you are looking into different database tables.
You do get several identifies such as Name, Address, and coordinates, which can serve for unique identification, assuming they are free of errors. Which is an assumption.

Counting cars on a specific road

I am receiving cordinates from a large number of cars in the city. I would like to associate each car with the nearest road and later count how many cars are on each road. i am using google maps. I would like to know if there is a more efficient approach to what I am doing - I am hand-drawing all major streets and storing the polyline. When I receive a location, I search my database of roads (polylines) and find the nearest road. This is slow because mapping all roads is very difficult and I receive thousands of positions per minute.
Looks like you'd better use a GIS-enabled database like PostGIS loaded with a suitable dataset like OpenStreetMap's data - if the OpenStreetMap data are of sufficiently high quality for your region and your purpose of course.

Resources