How to do a paginated crawl? - web-scraping

I am trying to run the following which is goto flipkart, crawl all the product links and extract product, price and description. However, this only grabs one page only, I want to repeat the crawl across all pages ex) page 1, 2, 3...etc
GOTO flipkart.com/search?q=laptops&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off
CRAWL //div[2]/div[2]/div[1]/div//div[1]/a[#class="_2cLu-l"][1]
EXTRACT {
"product": "//span[#class=\"_35KyD6\"][1]",
"price": "//div[#class=\"_1vC4OE _3qQ9m1\"][1]",
"description": "//div[#class=\"_3u-uqB\"][1]"
}

You need to prepend the paginator with [[xpath_for_nextpage_element]].
In this case the xpath for the "next page" link is //nav/a[11]/span. You wrap [[ and ]] around it and put it right after the CRAWL statement.
So we get: [[//nav/a[11]/span]].
GOTO flipkart.com/search?q=laptops&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off
CRAWL [[//nav/a[11]/span]] //div[2]/div[2]/div[1]/div//div[1]/a[#class="_2cLu-l"][1]
EXTRACT {
"product": "//span[#class=\"_35KyD6\"][1]",
"price": "//div[#class=\"_1vC4OE _3qQ9m1\"][1]",
"description": "//div[#class=\"_3u-uqB\"][1]"
}
This is essentially now a scraper that will grab all the product information.

Related

Google Sheets: Importing numbers from website using ImportXML

I have no experience coding!
I am having trouble scraping data from a website into my Google spreadsheet. I want to get the Observation number into my spreadsheet form this page
I have tried this but honestly have no idea what I'm doing:
=IMPORTXML(A3,"//*[#id="obsstatcol"]/div/div[1]")
With A3 being the above page URL, and the rest is a mash of some tutorial I found with the XPath copies from the observation value I'm trying to scrape off the page.
Can anyone help me make sense of what the hell I'm trying to do and offer some advice?
Thanks in advance
Good attempt! However, unfortunately the obvervation number is not determined until after the page loads. That means that your formula:
=IMPORTXML(A3,"//*[#id=""obsstatcol""]/div/div[1]")
yields
{{ shared.numberWithCommas( totalObservations ) }}
so you cannot just use ImportXML() in this case.
However, all is not lost. I opened the network monitor with F12, and saw that the page was making a web request to this url:
https://api.inaturalist.org/v1/observations/observers?verifiable=any&quality_grade=needs_id&user_id=ericthuranira&locale=en-US
to get the observation data, which appears to be in JSON format. E.g. (formatted for readability)
{
"total_results": 1,
"page": 1,
"per_page": 500,
"results": [
{
"user_id": 1265521,
"observation_count": 121,
"species_count": 42,
"user": {
"id": 1265521,
"login": "ericthuranira",
"spam": false,
"suspended": false,
"created_at": "2018-10-09T11:43:22+00:00",
"login_autocomplete": "ericthuranira",
"login_exact": "ericthuranira",
"name": "Eric Thuranira",
"name_autocomplete": "Eric Thuranira",
"orcid": null,
"icon": "https://static.inaturalist.org/attachments/users/icons/1265521/thumb.jpeg?1580369132",
"observations_count": 237,
"identifications_count": 203,
"journal_posts_count": 0,
"activity_count": 440,
"species_count": 150,
"universal_search_rank": 237,
"roles": [],
"site_id": 1,
"icon_url": "https://static.inaturalist.org/attachments/users/icons/1265521/medium.jpeg?1580369132"
}
}
]
}
This is not in XML format, so you'll have to use a JSON parser to do that. Fortunately, somebody has made one for Google Sheets! You can easily get this for yourself by doing the following:
Paste the code from here into your script editor (Tools > Script Editor), and save it as ImportJSON. This gives you your JSON parser.
Taking the "api" URL I mentioned above for the observers, use this formula (assuming URL is in A3)
=ImportJSON(A3,"/results/observation_count","noHeaders")
And this will get you the number you want.

What should the "author" field for a LinkedIn UGC post be for Showcase/Brand pages?

I am trying to specify an author for a UGC post to a showcase page. I am expecting that the author of the showcase post is the showcase page itself, which is what happens when I manually create a post, but this doesn't seem to work with the API.
Let's say I have a showcase urn:li:organizationBrand:123456. If I specify the showcase as the author ("author": "urn:li:organizationBrand:123456) I get an error about an invalid "author" field. But if I wrap the brand URN ID with "organization" instead of "organizationBrand" ("author": "urn:li:organization:123456") it works but I have not found this interchangeability documented anywhere.
This same workaround works for retrieving post stats (/organizationalEntityShareStatistics).
Can anyone explain what the right approach is supposed to be?
Are organization brand URNs meant to effectively be an alias of organization URNs?
You can use the organizationalEntityAcls API to find your URN. organization URNs are not necessarily interchangeable with organizationBrand URNs.
For example:
GET https://api.linkedin.com/v2/organizationalEntityAcls?q=roleAssignee
"paging": {
"count": 10,
"start": 0
},
"elements": [
{
"state": "APPROVED",
"role": "ADMINISTRATOR",
"roleAssignee": "urn:li:person:R8302pZx",
"organizationalTarget": "urn:li:organization:1000"
}
]
}
source: https://learn.microsoft.com/en-us/linkedin/marketing/integrations/community-management/organizations/organization-access-control#find-access-control-information

structure data for different queries

I am learning firebase and trying to find the best way to structure my data.
Use an example of a simple leave application. Employees can submit and view their leaves. Managers can approve leaves.
Option 1
"leaves": [
{
"employee": "pCIUfttSrXQ1dLPDwH7j9GExCkA2",
"date": "2017-03-01",
"status": "pendingApproval",
},
{
"employee": "YSJCAe4wZdYCplA3e0ejMqzQmEF3",
"date": "2017-01-01",
"status": "approved"
}]
With option 1, filtering will be required in both cases:
When employee lists his leave history (filter by "employee")
When manager lists all the pending leaves (filter by "status=pending")
Option 2
"leaves":
{
"pCIUfttSrXQ1dLPDwH7j9GExCkA2" : [
{
"date": "2017-03-01",
"status": "pendingApproval"
}
],
"YSJCAe4wZdYCplA3e0ejMqzQmEF3" : [
{
"date": "2017-01-01",
"status": "approved"
}
]
}
With option 2, no filtering is required when employee lists his leave history, but filtering is required (and I don't know how) for manager to list pending leaves.
What should be the right way to structure the data? And if it's option 2, how would we filter the pending leaves for all employees?
Use the second option;
For the manager to filter through the pending queries , use:
FIRDatabase.database().reference().child("leaves").queryOrdered(byChild: "status").queryEqual(toValue: "pending").observeSingleEvent(of: .value, with: {(Snapshot) in
print(Snapshot.value!)
// If you have multiple pending request you gotta loop through them
// using for loop, access them as separate entity and operate on them
})

Stop GA reporting from appending the domain to the path when previewing

I'm using Google Analytics to track data across multiple domains in a single profile.
By default, reporting only shows the path, not the full URL. This makes it quite confusing where multiple pages on our different domains have the same paths (e.g. '/index' or '/about').
To get round this, I've implemented the filter advised by Google to display the full URL in reporting:
Filter Type: Custom filter > Advanced
Field A: Hostname Extract A: (.*)
Field B: Request URI Extract: (.*)
Output To: Request URI Constructor: $A1$B1
This works just fine ; the only downside is that using the 'preview link' button in the reporting always appends the domain, resulting in a 404 error.
....clicking the 'link preview' icon results in......
Does anyone know a way around this ; either by preventing GA from appending the domain or a better way of displaying the full URLs in reporting?
Thanks Eike - I took your advice and wrote a small browser extension for Chrome. Obviously this isn't an essential, but I wanted to address it as our marketing team use the feature so frequently.
The manifest json :
{
"manifest_version": 2,
"name": "Analytics cross-domain link shortcut",
"version": "1.0",
"description": "Makes the links shortcuts in analytics work when using a 'full url' filter!",
"content_scripts":
[
{
"matches": ["*://*/*"],
"js": ["myscript.js"],
"run_at": "document_start"
}
]
}
And the script:
if (window.opener && document.referrer == "") {
var currentLocation = window.location.href;
if(currentLocation.indexOf("www.appendedurl.com") > -1) {
var newLocation = currentLocation.substr(30); // where '30' is the length of the appended URL
window.location.href = "http://"+newLocation;
}
}
So it's essentially just snipping off the appended URL (if present) on freshly opened popup windows.

Relate two entities using properties in Freebase

I want to find out how Wenjin SU and Jimei University are related in Freebase. I have found out the Wenjin SU has a type /business/board_member/which has property/business/board_member/leader_of. How can I use this information in an Freebase MQL to extract the term or mid of Jimei University?
If you go to the Freebase page for Wenjin SU you see that he has the type /business/board_member/ and under that section it lists him as the /business/board_member/leader_of Jimei University
The first thing you should do is go to the Query Editor and create a skeleton MQL query for that relationship:
{
"id": "/m/0sxhm9v",
"name": null,
"/business/board_member/leader_of": [{}]
}
When you run this query you get the following result:
{
"result": {
"name": "Wenjin SU",
"/business/board_member/leader_of": [{
"name": null,
"type": [
"/organization/leadership"
],
"id": "/m/0sxhm9s"
}],
"id": "/m/0sxhm9v"
}
}
This is not quite what you were asking for. It's saying that he is the leader_of an un-named topic /m/0sxhm9s. Now, if you visit the Freebase page for that topic you'll see that its a mediator node that connects a person and their role to an organization for a specific date range. You'll also notice that Jimei University is listed as the /organization/leadership/organization on this page.
We can now add this mediated property to our MQL query to get the full relationship that you're looking for:
{
"id": "/m/0sxhm9v",
"name": null,
"/business/board_member/leader_of": [{
"/organization/leadership/organization": {
}
}]
}
If you're building an application that has a pre-determined set of relationships like this then you can use this process of exploring the Freebase data to build MQL queries for those relationships. If you're looking to find any arbitrary connection between any two entities in Freebase then you'll need to download the Freebase Data Dumps and run a shortest path algorithm over the entire graph.

Resources