How to reverse engineer POST request's body generation - web-scraping

I'm trying to scrape reviews from Google Play. Google Play loads reviews dynamically after page has been scrolled to the end. I intercepted post requests that browser sends for retrieving reviews and noticed that the only thing that changes per request is the request's body. What I'm struggling to understand is how the request's body is generated.
The first request's body looked like this:
f.req: [[["UsvDTd","[null,null,[2,null,[40,null,\"CpUBCpIBKm0KOfc7ms0D_z7jKJielp7Fz8_Pz8_Pms3OzpuZyJvMnMXOxYmSxc3MyczPz8vIycjMysbHxszPysb__hAoITbZQaENmbWoMU2VCwWZPGwZOdccwQD8MmXEUABaCwlwT4zmNQBa2BADYMm1lu0EMiEKHwodYW5kcm9pZF9oZWxwZnVsbmVzc19xc2NvcmVfdjI\"],null,[]],[\"com.feelingtouch.zf3d\",7]]",null,"generic"]]]
and this's is the second request:
f.req: [[["UsvDTd","[null,null,[2,null,[40,null,\"CpUBCpIBKm0KOfc7msyg_28-Rpielp7Fz8_Pz8_Pm56eypyZzcycm8XOxYmSxc3MyczPz8vIycjMysbHxszPysb__hB4ITbZQaENmbWoMZI5V7V-7g3BObnBkABfM2XEUABaCwli2aizD1W9ExADYMm1lu0EMiEKHwodYW5kcm9pZF9oZWxwZnVsbmVzc19xc2NvcmVfdjI\"],null,[]],[\"com.feelingtouch.zf3d\",7]]",null,"generic"]]]
Can I somehow reverse engineer how the request is generated?
I tried to use Selenium, but after scrolling down few dozens time RAM usage runs up and Selenium becomes unresponsive.

The thing that changes is the pagination token. But, there are a couple of other things as well.
Here is the full encoded request body, with the parameters wrapped in #{} (number_of_results, pagination_token, and product_id).
f.req=%5B%5B%5B%22UsvDTd%22%2C%22%5Bnull%2Cnull%2C%5B2%2Cnull%2C%5B#{number_of_results}%2Cnull%2C#{pagination_token}%5D%2Cnull%2C%5B%5D%5D%2C%5B%5C%22#{product_id}%5C%22%2C7%5D%5D%22%2Cnull%2C%22generic%22%5D%5D%5D
So each time you scroll the page the pagination_token would change. They use it to retrieve the next page results.
You don't need to reverse engineer the token itself. You can find the first one when inspecting the page source, and then, for each next time you make a request to retrieve the results, the next_page_toke will be included in there. So, you just keep replacing the token until you reach the last page, and retrieve all the reviews.
Alternatively, you could use a third-party solution like SerpApi. We handle proxies, solve captchas, and parse all rich structured data for you.
Example python code for retrieving YouTube reviews (available in other libraries also):
from serpapi import GoogleSearch
params = {
"api_key": "SECRET_API_KEY",
"engine": "google_play_product",
"store": "apps",
"gl": "us",
"product_id": "com.google.android.youtube",
"all_reviews": "true"
}
search = GoogleSearch(params)
results = search.get_dict()
Example JSON output:
"reviews": [
{
"title": "Qwerty Jones",
"avatar": "https://play-lh.googleusercontent.com/a/AATXAJwSQC_a0OIQqkAkzuw8nAxt4vrVBgvkmwoSiEZ3=mo",
"rating": 3,
"snippet": "Overall a great app. Lots of videos to see, look at shorts, learn hacks, etc. However, every time I want to go on the app, it says I need to update the game and that it's \"not the current version\". I've done it about 3 times now, and it's starting to get ridiculous. It could just be my device, but try to update me if you have any clue how to fix this. Thanks :)",
"likes": 586,
"date": "November 26, 2021"
},
{
"title": "matthew baxter",
"avatar": "https://play-lh.googleusercontent.com/a/AATXAJy9NbOSrGscHXhJu8wmwBvR4iD-BiApImKfD2RN=mo",
"rating": 1,
"snippet": "App is broken, every video shows no dislikes even after I hit the button. I've tested this with multiple videos and now my recommended is all messed up because of it. The ads are longer than the videos that I'm trying to watch and there is always a second ad after the first one. This app seriously sucks. I would not recommend this app to anyone.",
"likes": 352,
"date": "November 28, 2021"
},
{
"title": "Operation Blackout",
"avatar": "https://play-lh.googleusercontent.com/a-/AOh14GjMRxVZafTAmwYA5xtamcfQbp0-rUWFRx_JzQML",
"rating": 2,
"snippet": "YouTube used to be great, but now theyve made questionable and arguably stupid decisions that have effectively ruined the platform. For instance, you now have the grand chance of getting 30 seconds of unskipable ad time before the start of a video (or even in the middle of it)! This happens so frequently that its actually a feasible option to buy an ad blocker just for YouTube itself... In correlation with this, YouTube is so sensitive twords the public they decided to remove dislikes. Why????",
"likes": 370,
"date": "November 24, 2021"
},
...
],
"serpapi_pagination": {
"next": "https://serpapi.com/search.json?all_reviews=true&engine=google_play_product&gl=us&hl=en&next_page_token=CpEBCo4BKmgKR_8AwEEujFG0VLQA___-9zuazVT_jmsbmJ6WnsXPz8_Pz8_PxsfJx5vJns3Gxc7FiZLFxsrLysnHx8rIx87Mx8nNzsnLyv_-ECghlTCOpBLShpdQAFoLCZiJujt_EovhEANgmOjCATIiCiAKHmFuZHJvaWRfaGVscGZ1bG5lc3NfcXNjb3JlX3YyYQ&product_id=com.google.android.youtube&store=apps",
"next_page_token": "CpEBCo4BKmgKR_8AwEEujFG0VLQA___-9zuazVT_jmsbmJ6WnsXPz8_Pz8_PxsfJx5vJns3Gxc7FiZLFxsrLysnHx8rIx87Mx8nNzsnLyv_-ECghlTCOpBLShpdQAFoLCZiJujt_EovhEANgmOjCATIiCiAKHmFuZHJvaWRfaGVscGZ1bG5lc3NfcXNjb3JlX3YyYQ"
}
Check out the documentation for more details.
Test the search live on the playground.
Disclaimer: I work at SerpApi.

Related

Google Analytics gtag.js - Ecommerce - Sending product click Event

Decided to use Google's recommended new method of installing Google Analytics: "gtag.js"
I've followed the instructions for measuring "Enhanced Ecommerce" and as part of that build, here's my code to measure a product click:
gtag('event', 'select_content', {
"content_type": "product",
"items": [
{
"id": "TEST-SKU-1",
"name": "Test Product",
"list_name": "Home Page",
"category": "For Testing",
"list_position": 1,
"price": 8.76
}
]
});
I then added that code on the link to the product like this:
<a href="/for-testing/test-product-1" onclick="gtag('event', 'select_content', { 'content_type': 'product', 'items': [{ 'id': 'TEST-SKU-1', 'name': 'Test Product', 'list_name': 'Home Page', 'category': 'For Testing', 'list_position': 1, 'price': 8.76 }] });">
It shows up inside Google Analytics Events (it works), but it shows up with an Event Category/Action/Label of "engagement"/"select_content"/"product". After doing some research, it looks like Google is auto-magically building the Event for me, based off of some new "standard way" they want to do things :-(
QUESTION:
How do I override the gtag.js defaults for Event Category/Action/Label it uses when I send a "Product Click"?
NOTE:
I realize that I can swap out gtag.js for analytics.js or I could use jQuery's post() method and send this directly via Google Analytics' Measurement Protocol but that is NOT my question... I need to figure out "the new correct non-hacky way of doing this"... preferably.
After lots of hacking, inspecting, and painfully time consuming trial & error, I think I got it figured out... One word of caution: This is what I came up with after finding zero info on this subject, so no, I'm NOT 100% certain I've considered everything here. If you know something I don't, please comment below. For those going as insane as I did trying to figure this out, here's what I'm doing:
When you install Google Analytics via the new/default "gtag.js" it actually installs the old "analytics.js" stuff behind the scenes, and it looks like the "gtag.js" is using the "analytics.js" to run code, kinda like a dumbed down wrapper for it.
So what this means is that if you don't like how "gtag.js" is doing something, without editing/hacking/installing/changing anything... all of the old "analytics.js" stuff is available, as long as you remember to do a ga('create'... before the command.
So as a very basic example, here's how you would send a "Test" Event on a link click:
Go Home Test
So that's the answer to my original question: Do it the "old way" because it is the "old way" under the hood anyway.

No large images in shares posted using LinkedIn API

During last couple of weeks any shares made using LinkedIn sharing API don't display large images, though we provide all required information for this, including image URL. The same happens when we use REST Console. Below you can see a sample request and how the share looks like.
{
"comment": "How Triggre achieves its simplicity",
"content": {
"title": "Triggre / Blog / Design Philosophy - Part 3",
"description": "In the previous two posts about our design philosophy you could read how we decided to build Triggre and why we chose simplicity as the core of our desi...",
"submitted-url": "https://www.triggre.com/en/blog/the-triggre-design-philosophy-part-3/",
"submitted-image-url": "https://www.triggre.com/media/1105/sagrada-familia.jpg?width=800"
},
"visibility": {
"code": "anyone"
}
}
A share without large image
What is happening and how could we workaround it?

To get more than 5 reviews from google places API

I am doing an application where I extract the google reviews using google places API.When I read the document related to it in "https://developers.google.com/maps/documentation/javascript/places",I found out that I could get only 5 top reviews.Is there any option to get more reviews.
In order to have access to more than 5 reviews with the Google API you have to purchase Premium data Access from Google.
That premium plan will grant you access to all sorts of additional data points you have to shell out a pretty penny.
If you are a Business owner wanting to retrieve all of your reviews, you can do so but first you have to get verified and could do this through the MyBusiness API more info here: https://developers.google.com/my-business/
There is a feature request for that: Issue 7630: Response to Include More Than 5 Reviews ─ I'd recommend you "star" it to receive updates.
Unfortunately there's no way to get more than 5 reviews in the places API unless you are the business owner after getting verified as Tekill said.
But it looks like there are some external services that can get all the reviews. My guess is that they scrape them from Google Maps directly:
Some of these services are Wextractor, ReviewShake and AllReviews
Alternatively, you can use a third party solution like SerpApi to scrape all the reviews of any place. It's a paid API with a free trial.
Each page fatches 10 results. To implement the pagination just use the start parameter which defines the result offset (e.g., 0 (default) is the first page of results, 10 is the 2nd page of results, 20 is the 3rd page of results, etc.)
Example python code (available in other libraries also):
from serpapi import GoogleSearch
params = {
"engine": "google_maps_reviews",
"place_id": "0x89c259a61c75684f:0x79d31adb123348d2",
"api_key": "SECRET_API_KEY"
}
search = GoogleSearch(params)
results = search.get_dict()
reviews = results['reviews']
Example output:
"reviews": [
{
"user": {
"name": "Waylon Bilbrey",
"link": "https://www.google.com/maps/contrib/107691056156160235121?hl=en-US&sa=X&ved=2ahUKEwiUituIlpTvAhVYCc0KHbvTCrgQvvQBegQIARAx",
"thumbnail": "https://lh3.googleusercontent.com/a-/AOh14GjOj6Wjfk1kSYjhvH7WIBNMdl4nPj6FvUhvYcR6=s40-c0x00000000-cc-rp",
"reviews": 1
},
"rating": 4,
"date": "a week ago",
"snippet": "I've been here multiple times. The coffee itself is just average to me. The service is good (the people working are nice). The aesthetic is obviously what brings the place some fame. A little overpriced (even for NY). A very small cup for $6 where I feel like the price comes from the top rainbow foam decor , when I'm going to cover it anyways. If it's for an insta pic then it may be worth it?"
},
{
"user": {
"name": "Amber Grace Sale",
"link": "https://www.google.com/maps/contrib/106390058588469541899?hl=en-US&sa=X&ved=2ahUKEwiUituIlpTvAhVYCc0KHbvTCrgQvvQBegQIARA7",
"thumbnail": "https://lh3.googleusercontent.com/a-/AOh14Gj84nHu_9V_0V4yRbZcr-8ZTYAHua6gUBP8fC7W=s40-c0x00000000-cc-rp-ba3",
"local_guide": true,
"reviews": 33,
"photos": 17
},
"rating": 5,
"date": "2 years ago",
"snippet": "They really take pride in their espresso roast here and the staff is extremely knowledgeable on the subject. It’s also a GREAT place to do work although a table is no guarantee; you might have to wait for a bit. My almond milk cappuccino was very acidic at the end which wasn’t expected but I could still tell the bean was high quality. Their larger lattés they put in a tall glass cup which looks really really cool. Would definitely go again.",
"likes": 2,
"images": [
"https://lh5.googleusercontent.com/p/AF1QipMup24_dHrWtNN4ZD70EPsiRMf_tykcUkPw6A1H=w100-h100-p-n-k-no"
]
},
{
"user": {
"name": "Kelvin Petar",
"link": "https://www.google.com/maps/contrib/100859090874785206875?hl=en-US&sa=X&ved=2ahUKEwiUituIlpTvAhVYCc0KHbvTCrgQvvQBegQIARBG",
"thumbnail": "https://lh3.googleusercontent.com/a-/AOh14GhdIvUDamzfPqbYIpwhnGJV2XWSi77iVXfEsiKS=s40-c0x00000000-cc-rp",
"reviews": 3
},
"rating": 4,
"date": "3 months ago",
"snippet": "Stumptown Cafe is the perfect place to work or catch up with friends. Never too loud, never too dead. Their lattes and deliciously addicting and the toasts are tasty as well. Wifi is always fast, which is a huge plus! The staff are the friendliest, I highly recommend this place!"
},
...
]
You can check out the documentation for more details.
Disclaimer: I work at SerpApi.
Adding to the answer of #miguev, there's at the moment no way to get more than 5 top reviews without using premium APIs (according to a Google Maps guy I had a talk with) and that's pricey.
We tried to sign for The Google Maps Platform Premium Plan to show them on pages like this but Google said it's no longer available for sign up or new customers. Right now we're limited to 5 reviews only.

Microsoft Band Web Tile Not Refreshing

This post is similar to Microsoft Band Web Tile not Updating, but the response marked as an answer to that question didn't really solve my issue, so I thought I'd start a new post.
I recently purchased a Band 2 and am trying to set up a web tile that will pull data from a service that provides data in JSON format (not an rss feed). So, I created a single-page non-feed tile using the 5-step authoring tool. When I first deployed the tile to my band, it successfully polled the service and displayed data; however, since that point, the data displayed on the web tile has not updated, even though the refresh interval is set (the default of 30 minutes).
The service that's being called is an ASP.Net Web API service. It is setting the following cache-related headers:
Cache-Control: no-cache
Pragma: no-cache
Expires: -1
Last-Modified:
ETag:
If I review the HTTP logs for the site, I can see where my service endpoint is getting called from my band/phone, roughly every 30 minutes, and the server responds with a 200 OK response on every call - I'm not seeing a 304 Not Modified response on the server side of the transaction.
My band is paired with an Android device (Samsung GS5). I've also tried pairing with an iPhone 6, as well, with the same result. Other tiles on the band seem to work fine (i.e., the standard ones that come with the MS Health app). as part of pairing/re-pairing, I've done a factory reset twice, and that didn't seem to help. I've tried re-starting both phones (when they were paired), as well. That doesn't help, either.
What am I missing?
For reference, here is what the web tile's manifest.json file contains (with placeholders for some data points:
{
"manifestVersion": 1,
"name": "<Name Here>",
"description": "<Description here>",
"version": 1,
"versionString": "1",
"author": "<Author Here>",
"organization": "",
"contactEmail": "",
"tileIcon": {
"46": "icons/tileIcon.png"
},
"icons": {},
"refreshIntervalMinutes": 30,
"resources": [
{
"url": "<URL Here>",
"style": "Simple",
"content": {
"_1_bg": "BG",
"_1_datestring": "DateString",
"_1_trend": "Trend",
"_1_direction": "Direction"
}
}
],
"pages": [
{
"layout": "MSBand_MetricsWithIcons",
"condition": "true",
"textBindings": [
{
"elementId": "12",
"value": "BG: {{_1_bg}}"
},
{
"elementId": "22",
"value": "{{_1_datestring}}"
},
{
"elementId": "32",
"value": "Trend: {{_1_trend}}, {{_1_direction}}"
}
]
}
],
"notifications": [
{
"condition": "{{_1_bg}} >= 250",
"title": "HIGH BG: {{_1_bg}}",
"body": "{{_1_datestring}}"
},
{
"condition": "{{_1_bg}} <= 80",
"title": "Low BG: {{_1_bg}}",
"body": "{{_1_datestring}}"
},
{
"condition": "{{_1_bg}} <= 55",
"title": "REALLY LOW: {{_1_bg}}",
"body": "{{_1_datestring}}"
}
]
}
Can you supply the URL for the resource? If so I can take a look at your server responses and see why the tile is not refreshing.
Better yet, can you share the webtile and I can try that to see why it is not refreshing. You can build your WebTile at https://developer.microsofthealth.com/WebTile/ and choose to submit it. Reply here with the name of it and I will take a look.
By the way, here is how we handle refresh on a simple tile:
If Etag was in the last response then use that with next request to let the server decide if there is something new to provide.
If Etag was not supplied, then look for Last-Modified and use that when available.
Else, process the downloaded data and send to the tile.
So, if you have Etag or Last-Modified in your server responses then we will use that to send in future requests and that may be causing your problem. In that case you would want to make sure that Etag and Last-Modified are not being sent in your server responses.
Some things I can think of:
Are you keeping the tile open on your WebTile while the updates are happening? If so, then tiles in some FW versions of the band do not update when new data comes in, close the tile and open it after the sync.
You can test your tile syncing more often than 30 minutes by hitting the sync icon on the top left of the left nav bar inside the Microsoft Health app.
After that, if you are still having problems please send feedback from inside the Microsoft Health app. Access via Left Nav, bottom under Settings, use the "Help and Feedback".
When reporting feedback, if you can attach the webtile that will help us test the webtile you are having problems with.
I share the frustration here. I too have exactly the same issue it seems. I have also been a developer for 20 years. My answer to this problem now is that there is a bug perhaps when JSON is used, and/or with Android phones. I've tried to get answers and discussions with Microsoft but not had any luck. My issue is at Web Tile works once but never refreshes

Using webhooks with Google Analytics

I'm trying to integrate my CRM with Google Analytics to monitor lead changes (from lead to sell) and so on. As I understood, I need to use Google Measurement Protocol, to receive webhooks from CRM and translate it to Analytics Conversions.
But in fact, I don't really understand how to do it. I need to make some script, to translate webhook code to analytics, but where I need to place that script? Are there some templates? And so on.
So, If you know some tutorials/courses/freelancers to help me with intergrating webhooks with Analytics - I need your advice.
Example of webhook from CRM:
{
"leads": {
"status": {
"id": "25399013",
"name": "Lead title",
"old_status_id": "7039101",
"status_id": "142",
"price": "0",
"responsible_user_id": "102525",
"last_modified": "1413554372",
"modified_user_id": "102525",
"created_user_id": "102525",
"date_create": "1413554349",
"account_id": "7039099",
"custom_fields": [
{
"id": "427183",
"name": "Checkbox custom field",
"values": ["1"]
},
{
"id": "427271",
"name": "Date custom field",
"values": ["1412380800"]
},
{
"id": "1069602",
"name": "Checkbox custom field",
"values": ["0"]
},
{
"id": "427661",
"name": "Text custom field",
"values": ["Валера"]
},
{
"id": "1075272",
"name": "Date custom field",
"values": ["1413331200"]
}
]
}
}
}
"Webhook" is a fancy way of saying that your CRM can call a web based service whenever something interesting happens (i.e. the CRM can "hook" into a web based application). E.g. if a new lead is created you can call an url with the lead details as parameters.
Specifics depend on your CRM, but when you set up a webhook there should be a field to set a url; the script that evaluates the CRM data is located at the URL.
You have that big JSON thing as your example - No real way to tell without knowing your system, but I assume that is sent as request body. So in your script you evaluate the request body, extract the parameters you want to send to analytics (be mindful that you are not allowed to store personally identifiable information) and sent it via the measurement protocol as described in the documentation linked in the other answer.
Depending on the system you might even be able to call the measurement protocol without having a custom script in between (after all the measurement protocol is an url with a few parameters).
This is an awfully generic answer, but then the question is really broad.
I've done just this in my line of work.
You need to first decide your data model on how you would like the CRM data to look within Google Analytics. This could be just mapping Google Analytics' event category, event label, event action to your data, or perhpas using custom dimensions and metrics.
Then to make it most useful, you would like to be able to link the CRM activity of a customer to their online activity. You can do this if they login online. In that case, you can set the cid and/or uid of the user to your CRM id.
Then, if you send in a GA hit with the same cid/uid in your Measurement Protocol hit, you will link the online sessions with your offline CRM activity.
To make the actual record hit Google Analytics, you will need to program something that takes the CRM data and turns it into a Measurement Protocol hit, which is essentially just a URL with the correct parameters. Look here for reference: https://developers.google.com/analytics/devguides/collection/protocol/v1/reference
An example could be: http://www.google-analytics.com/collect?v=1&tid=UA-123456-1&cid=5555&t=pageview&dp=%2FpageA
We usually have this as a seperate process, that fires when the CRM data is written to its database (the webhook in your example). If its a lot of data, you should probably implement checks to see if the hit was sucessful, and caching in case the service is not online - you have an optional parameter that gives you 4 hours leeway in sending data.
Hope this gets you at least started.

Resources