Web scraping: match same items having different names on different sources

Web scraping: match same items having different names on different sources - web-scraping

I am scraping different betting sites to get best odds on the same events. Let's say I have these results regarding the same game on two different sites:
{
"1": 1.27,
"2": 10,
"game": "Juventus - Spal 2013",
"X": 5.45
}
and
{
"1": 1.28,
"2": 11,
"game": "Juventus - Spal",
"X": 5.5
}
What is the best way I can "tell" my system that "Spal" and "Spal 2013" are the same team? (this is just an example, it can happen for a lot of events, teams and players).

In the end, I chose to use string-similarity package, that has the method findBestMatch(item, targetStrings)

Related

Inconsistent Cosmos DB graph RU cost for the same query

We're using a CosmosDB Graph API instance provisioned with 120K RUs. We've setup a consistent partitioning structure using the /partition_key property.
When querying our graph using Gremlin, we've noticed that some queries use an unreasonably high amount of RUs when compared to other queries. The queries themselves are the same, but for the partition_key value itself.
The following query costs 23.25 RUs, for example:
g.V().has('partition_key', 'xxx')
Whereas the same query with a different partition_key value costs 4.14 RUs:
g.V().has('partition_key', 'yyy')
Looking at the .exectionProfile() results for both queries; they look similar.
The expensive query which costs 23.25 RUs (xxx):
[
{
"gremlin": "g.V().has('partition_key', 'xxx').executionProfile()",
"activityId": "ec181c9d-59a1-4849-9c08-111d6b465b88",
"totalTime": 12,
"totalResourceUsage": 19.8,
"metrics": [
{
"name": "GetVertices",
"time": 12.324,
"stepResourceUsage": 19.8,
"annotations": {
"percentTime": 98.78,
"percentResourceUsage": 100
},
"counts": {
"resultCount": 1
},
"storeOps": [
{
"fanoutFactor": 1,
"count": 1,
"size": 848,
"storageCount": 1,
"storageSize": 791,
"time": 12.02,
"storeResourceUsage": 19.8
}
]
},
{
"name": "ProjectOperator",
"time": 0.15259999999999962,
"stepResourceUsage": 0,
"annotations": {
"percentTime": 1.22,
"percentResourceUsage": 0
},
"counts": {
"resultCount": 1
}
}
]
}
]
The cheap query which costs 4.14 RUs (yyy):
[
{
"gremlin": "g.V().has('partition_key', 'yyy').executionProfile()",
"activityId": "841e1c37-471c-461e-b784-b53893a3c349",
"totalTime": 6,
"totalResourceUsage": 3.08,
"metrics": [
{
"name": "GetVertices",
"time": 5.7595,
"stepResourceUsage": 3.08,
"annotations": {
"percentTime": 98.71,
"percentResourceUsage": 100
},
"counts": {
"resultCount": 1
},
"storeOps": [
{
"fanoutFactor": 1,
"count": 1,
"size": 862,
"storageCount": 1,
"storageSize": 805,
"time": 5.4,
"storeResourceUsage": 3.08
}
]
},
{
"name": "ProjectOperator",
"time": 0.07500000000000018,
"stepResourceUsage": 0,
"annotations": {
"percentTime": 1.29,
"percentResourceUsage": 0
},
"counts": {
"resultCount": 1
}
}
]
}
]
The results both queries return a single vertex of about the same size.
Can someone please help explain why this is so? And why one is significantly more expensive than the other? Is there some aspect that I don't understand about Cosmos DB partitioning?
Edit 1:
We also did some experimentation by adding other query parameters, such as id and also label. An id clause did indeed reduce the cost of the expensive query from ~23 RUs to ~4.57 RUs. The problem with this approach is that in general it makes all other queries less efficient (i.e. it increases RUs).
For example, other queries (like the fast one in this ticket) go from ~4.14 RUs to ~4.80 RUs, with the addition of an id clause. So that's not really feasible as 99% of queries would be worse off. We need to find the root cause.
Edit 2:
Queries are run on the same container using the Data Explorer tool in Azure Portal. Here's the partition distribution graph:

The "issue" you're describing can be related to the size boundaries of physical partitions (PP) and logical partition (LP). Cosmos DB allows "infinite" scaling based on its partitioning architecture. But scaling performance and data growth highly depends on logical partition strategy. Microsoft highly recommend to have as granular LP key as possible so data will be equally distributed across PPs.
Initially when creating container for 120k RUs you will end-up with 12 PP - 10k RUs limit per physical partition. Once you start loading data it is possible to end-up with more PP. Following scenarios might lead to "split":
size of the LP (total size of data per your partition key) is larger than 20GB
size of PP (total sum of all LP sizes stored in this PP) is larger than 50GB
As per documentation
A physical partition split simply creates a new mapping of logical partitions to physical partitions.
Based on the PP storage allocation it looks like you had multiple "splits" resulting in provisioning of ~20 PPs.
Every time "split" occurs Cosmos DB creates two new PPs and split existing PP equally between newly created. Once this process is finished "skewed" PP is deleted. You can roughly guess number of splits by PP id's on the Metrics chart you provided (you would have id: [1-12] if no splits happened).
Splits potentially can result in higher RU consumption due to request fan-out and cross-partition queries.

Retrieve LINK_ATTRIBUTE_FC1..5 based on LINK_ID

Assume I have a LINK_ID which I got using the LINK_FC5 layer with search/proximity resource described here.
Note that the search/proximity resource doesn't allow non-geometric layers such as LINK_ATTRIBUTE_FC5 to be specified.
Errorcode: 400, message: Provided layer does not contain geometries.
The documentation suggests using tile resource for getting the non-geometric layers, but that seems quite inefficient. Within one tile there are many LINK_IDs. Hard for me to believe there is no better way to do this. Hence the question:
What is a efficient way to retrieve all attributes from the LINK_ATTRIBUTE_FC5 layer using the LINK_ID?

You need to club to APIs to get the desired attributes for link IDs. So a tile comprises of multiple layers, Each layer contains multiple link IDs. this requires to make association between these resources. You can also refer to example : https://tcs.ext.here.com/examples/v3/pde_get_any_link_info
First API is :
https://s.fleet.ls.hereapi.com/1/index.json?layer=ROAD_GEOM_FCn&attributes=LINK_ID&values=548294575,833539855,550088940,930893121&apiKey=xxx
The response will be like :
"Layers": [
{
"layer": "ROAD_GEOM_FC5",
"level": 13,
"tileXYs": [
{
"x": 8580,
"y": 6376
}
]
},
{
"layer": "ROAD_GEOM_FC1",
"level": 9,
"tileXYs": [
{
"x": 534,
"y": 397
},
{
"x": 536,
"y": 398
}
]
}
From here you will get level, tile x,y and layer,
2nd API will be :
https://s.fleet.ls.hereapi.com/1/tiles.json?apiKey=xx&tilexy=536,398&levels=13&layers=LINK_ATTRIBUTE_FC5

How can I request pinpoint accurate geocoding data using HERE Api

I am looking to use HEREs geocoding service to locate the Lat and Lon of a place based on a UK postcode. At the moment my request will return a rough location even though I have provided a full postcode.
The old "geocode" API that I used previously, would return relevant results however this has been put into maintenance and replaced with the "geocode and search" API. This new API seems like it just looks through a list of stored points of interest within HERE’s database and returns the closest it can to what you have searched for, rather than trying to find the exact location entered.
How can I get more accurate results using the below request? Bare in mind that I will only have access to the postcode.
https://geocode.search.hereapi.com/v1/geocode?q={postCode}&apiKey={key}
At the moment I receive a response similar to the below using postcode PE1 1QL. It should be pointing to a car park, however if you enter the lat and lon returned from the API into a map E.g Google Maps, it gives you a more general location, rather than an accurate one.
{
"title": "PE1 1, Peterborough, England",
"id": "here:cm:namedplace:22221149",
"resultType": "locality",
"localityType": "postalCode",
"address": {
"label": "PE1 1, Peterborough, England",
"countryCode": "GBR",
"countryName": "England",
"county": "Cambridgeshire",
"city": "Peterborough",
"postalCode": "PE1 1"
},
"position": {
"lat": 52.57362,
"lng": -0.24219
},
"mapView": {
"west": -0.23515,
"south": 52.56739,
"east": -0.25194,
"north": 52.57984
},
"scoring": {
"queryScore": 0.67,
"fieldScore": {
"postalCode": 0.95
}
}
},
I would expect the Lat and Lng to be much closer to the postcode entered than the above example.

Regarding on this release notes https://developer.here.com/documentation/geocoding-search-api/release_notes/topics/known-issues.html
you can read "High precision postal codes are not yet supported":
Known Issues
The following table lists issues known to be present in the current release.
Search for intersections is not yet supported
Search by telephone numbers is not yet supported
Political views are not yet supported. All views are “International”
Places detail views are not yet supported
High precision postal codes are not yet supported
The Geocoder API 6.2 will be supported at least until end of 2020 (maybe more) and "Maintenance" in documentation means: no new features.

Disable token breaks on punctuation LUIS.ai

I am working with Microsoft Cognitive Service's Language Understanding Service API, LUIS.ai.
Whenever text is parsed by LUIS, whitespace tokens are always inserted around punctuation.
This behavior is intentional, according to the documentation.
"English, French, Italian, Spanish: token breaks are inserted at any
whitespace, and around any punctuation."
For my project, I need to preserve the original query string, without these tokens, as some entities trained for my model will include punctuation, and it's annoying and a bit hacky to strip the extra whitespace from the parsed entities.
Example of this behavior:
Is there a way to disable this? It would save quite a bit of effort.
Thanks!!

Unfortunately there's no way to disable that for now, but the good news is that the predictions returned will deal with the original string, not the tokenized one you see in the example labeling process.
Here in the documentation of how to understand the JSON response you can see the example output preservers the original "query" string, and the extracted entities have the zero based character indices ("startIndex", "endIndex") in the original string; this will allow you to deal with the indices instead of parsed entity phrases.
{
"query": "Book me a flight to Boston on May 4",
"intents": [
{
"intent": "BookFlight",
"score": 0.919818342
},
{
"intent": "None",
"score": 0.136909246
},
{
"intent": "GetWeather",
"score": 0.007304534
}
],
"entities": [
{
"entity": "boston",
"type": "Location::ToLocation",
"startIndex": 20,
"endIndex": 25,
"score": 0.621795356
},
{
"entity": "may 4",
"type": "builtin.datetime.date",
"startIndex": 30,
"endIndex": 34,
"resolution": {
"date": "XXXX-05-04"
}
}
]
}

scrape under "show more"

I am trying to scrape all the objects with the same tag from a specific site (Google Scholar) with BeautifulSoup, but it doesn't scrap the object under the "show more" at the end of the page. How can I fix it?
Here's an example of my code:
# -*- coding: cp1253 -*-
from urllib import urlopen
from bs4 import BeautifulSoup
webpage=urlopen('http://scholar.google.gr/citations?user=FwuKA4UAAAAJ&hl=el')
soup=BeautifulSoup(webpage)
for t in soup.findAll('a',{"class":"gsc_a_at"}):
print t.text

You have to pass pagination parameters to the request url.
cstart - Parameter defines the result offset. It skips the given number of results. It's used for pagination. (e.g., 0 (default) is the first page of results, 20 is the 2nd page of results, 40 is the 3rd page of results, etc.).
pagesize - Parameter defines the number of results to return. (e.g., 20 (default) returns 20 results, 40 returns 40 results, etc.). Maximum number of results to return is 100.
You could also use a third party solution like SerpApi to do this for you. It's a paid API with a free trial.
Example python code (available in other libraries also) to retrieve the second page of results:
from serpapi import GoogleSearch
params = {
"engine": "google_scholar_author",
"hl": "en",
"author_id": "FwuKA4UAAAAJ",
"start": "20",
"api_key": "secret_api_key"
}
search = GoogleSearch(params)
results = search.get_dict()
Example JSON output:
"articles": [
{
"title": "MuseumScrabble: Design of a mobile game for children’s interaction with a digitally augmented cultural space",
"link": "https://scholar.google.com/citations?view_op=view_citation&hl=en&user=FwuKA4UAAAAJ&cstart=20&citation_for_view=FwuKA4UAAAAJ:RHpTSmoSYBkC",
"citation_id": "FwuKA4UAAAAJ:RHpTSmoSYBkC",
"authors": "C Sintoris, A Stoica, I Papadimitriou, N Yiannoutsou, V Komis, N Avouris",
"publication": "Social and organizational impacts of emerging mobile devices: Evaluating use …, 2012",
"cited_by": {
"value": 69,
"link": "https://scholar.google.com/scholar?oi=bibs&hl=en&cites=6286720977869955347",
"serpapi_link": "https://serpapi.com/search.json?cites=6286720977869955347&engine=google_scholar&hl=en",
"cites_id": "6286720977869955347"
},
"year": "2012"
},
{
"title": "The effective combination of hybrid usability methods in evaluating educational applications of ICT: Issues and challenges",
"link": "https://scholar.google.com/citations?view_op=view_citation&hl=en&user=FwuKA4UAAAAJ&cstart=20&citation_for_view=FwuKA4UAAAAJ:hqOjcs7Dif8C",
"citation_id": "FwuKA4UAAAAJ:hqOjcs7Dif8C",
"authors": "N Tselios, N Avouris, V Komis",
"publication": "Education and Information Technologies 13 (1), 55-76, 2008",
"cited_by": {
"value": 68,
"link": "https://scholar.google.com/scholar?oi=bibs&hl=en&cites=1046912849634390721",
"serpapi_link": "https://serpapi.com/search.json?cites=1046912849634390721&engine=google_scholar&hl=en",
"cites_id": "1046912849634390721"
},
"year": "2008"
},
...
Check out the documentation for more details.
Disclaimer: I work at SerpApi.

In Chrome, try F12 --> Network, select 'Preserve log' and disable cache.
Now hit the show more button.
Check the GET/POST request being sent. You will know what to do next.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Web scraping: match same items having different names on different sources - web-scraping

In the end, I chose to use string-similarity package, that has the method findBestMatch(item, targetStrings)

Related

Inconsistent Cosmos DB graph RU cost for the same query

Retrieve LINK_ATTRIBUTE_FC1..5 based on LINK_ID

How can I request pinpoint accurate geocoding data using HERE Api

Disable token breaks on punctuation LUIS.ai

scrape under "show more"

Categories

Resources