How to retrieve more than 50 results using youtube api in R - r

I am working on a school project in R where I am attempting to map where the most popular youtube videos are posted around the world. I am able to get the data for the 50 most popular videos, but am having trouble understanding how to use pageToken.
The current get request I am using is with the following:
https://www.googleapis.com/youtube/v3/videospart=snippet%2CrecordingDetails&chart=mostPopular&maxResults=50&key={api_key}
Is it possible to retrieve more than 50 results using "pageToken" (I am unfamiliar with how this works).
Any help would be appreciated thanks!

Videos: list
pageToken string The pageToken parameter identifies a specific
page in the result set that should be returned. In an API response,
the nextPageToken and prevPageToken properties identify other pages
that could be retrieved.
Note: This parameter is supported for use in conjunction with the
myRating parameter, but it is not supported for use in conjunction
with the id parameter.
So when you get the results from the first request you should have an option called page token if you send that to the next request
&pageToken=api_pageToken
it should give you the next bunch of rows.
Note: I am not an R programmer so I cant help with the code for a loop over the results to find out if there are page tokens or not.

Related

Using LinkedIn API to retrieve advertising reports

I'm working on a simple app to programmatically retrieve ads performance within Linkedin. I have general API experience but this is the first time i get my feet wet with the Linkedin API.
One example from Linkedin API documentation suggest something that would get me started:
GET https://api.linkedin.com/v2/adAnalyticsV2?q=analytics&dateRange.start.month=1&dateRange.start.day=1&dateRange.start.year=2016&timeGranularity=MONTHLY&pivot=CREATIVE&campaigns=urn:li:sponsoredCampaign:112466001
I am encountering two problems:
First this example implies that you already know the campaign ID. However I am unable to find a way to retrieve a list of campaign ID's for a given account.
Second, if I manually pull a campaign ID, I receive an error: "{"serviceErrorCode":2,"message":"Too many fields requested. Maximum possible fields to request: 20","status":400}". Pretty clear error.
A little research tells me that by adding the parameter "&fields=" I will be able to limit my query to less than 20 field (I really need only a dozen anyway) but I can't find and documentation regarding the names of the fields available.
Any help or pointer will be appreciated.
please refer the link below scroll down where you ill see the field names mentioned as metrics , these are the fields.
https://learn.microsoft.com/en-us/linkedin/marketing/integrations/ads-reporting/ads-reporting?tabs=http#analytics-finder

why is soup find all not showing all tags

I'm trying to scrape the daily temperature data from this page - specifically the min and max daily temp: https://www.wunderground.com/calendar/gb/birmingham/EGBB/date/2020-8
I found the line in the html where the data is located:
calendar days temperature li tag
and the rest of the daily temperature can also be found in the other li tags:
other li tags where temp data is inside
I'm trying to use beautiful soup to scrape the said data but when I try to use the following code, I am not getting all the li tags from the html, even if they are there when I inspect the html at the website
when I print the resulting temp_cont, there are the other li tags but not the ones that contain the daily data: result of soup find all
I've already tried using other html parser but it didn't work - all other parser output the same data.
I'm looking at other solution like trying to load it using javascript since others suggest that some pages may load dynamically but I don't really understand this.
Hope you can help me with this one.
response = get(url,headers=headers)
soup = BeautifulSoup(response.content, 'lxml')
temp_cont = soup.find_all('li')
EDIT (ADDITIONAL QUESTION):
I tested the solution recommended by #AaronS below and the scraping worked perfectly fine. However, when I try to re-run the script again after few hours, an error of 'Nonetype' was prompted since one of the list element is 'None'.
When I inspected the website again in the network preview of the API, the first element of temperatureMax is now written "null". I don't understand why/how it changed or if there's a workaround so that the scraping works again. See screenshot here: network preview with null
So if you disable javascript in your browser you will find that none of the information you require is there. This is what Roman is explaining about. Javascript can make HTTP requests and grab data from APIs, the response is then fed back to the browser.
If you inspect the page and go to network tools. You will be able to see all the requests to load the page up. In those requests, there's one that if you click and go to preview, you'll see there's some temperature data.
I'm lazy so I copy the cURL of this request and input into a website like curl.trillworks.com which converts this to a python request.
Here you can see I'm copying the cURL request here.
Code Example
import requests
params = (
('apiKey', '6532d6454b8aa370768e63d6ba5a832e'),
('geocode', '52.45,-1.75'),
('language', 'en-US'),
('units', 'e'),
('format', 'json'),
)
response = requests.get('https://api.weather.com/v3/wx/forecast/daily/15day', params=params)
max = response.json()['temperatureMax'][0]
min = response.json()['temperatureMin'][0]
print('Min Temperature: ',min)
print('Max Temperature: ',max)
Output
Min Temperature: 65
Max Temperature: 87
Explanation
So the url is an API that weather.com has for daily forecasts. It has parameters specifying where you are, the format the response should return.
We make an HTTP get request with those parameters and the response we get back is a JSON object. We can convert this to a python dictionary using response.json() method.
Now if you output response.json() you'll get a lot of data, if you look at the preview in your browser of that HTTP request you can have a navigate down to the data you want. We want the data in the keys 'temperatureMax' and 'temperatureMin'. The result actually a list, and todays temperature max and min are the first item of those lists. Hence response.json()['temperatureMax'][0] and min = response.json()['temperatureMin'][0]
Additional Information
This is a case of the website has dynamic content which is loaded by javascript. There are two broad ways to deal with this type of content.
Mimic the HTTP request that the javascript invokes (This is what we have done here)
Use a package like selenium to invoke browser activity. You can use selenium to render the entire HTML including the javascript invoked parts of that HTML. Then select the data you want.
The benefit of the first option is efficency, it's much faster, the data is more structured. The reason to consider the last type of work last option is selenium was never meant for web scraping, it's brittle, if anything changes on the website then you'll find yourself in a position where you'll need to maintain the code often.
So my advice is try to do the first option, inspect the page, go to network tools and look at the previews of all the requests made. Play about with the previews data to see if it has what you want. Then re-create that request. Now sometimes you just need a simple HTTP request without parameters, cookies or headers.
In this particular example, you justed parameters, but sometimes you'll need all through and possibly more for different websites. So be mindful of that if you're not able to get the data. It's not fool proof, there are definitely instances that re-creating the HTTP request is difficult and there are things you as the user of the website are not privvy to which are required. Infact a well developed API which have this feature in to stop people scraping it.

Google reviews counter

I want to know if there is any api that can allow me to get the number of reviews from an url.
I know that google offers the possibility to get this number by using the placeid, but the only information I have is the url of the website of a company.
Any ideas please?
Maybe, but probably not.
Places API Text Search seems to be able to find places by their URL:
https://maps.googleapis.com/maps/api/place/textsearch/json?key=YOURKEY&query=http://www.starbucks.com/store/1014527/us/303-congress-street/303-congress-street-boston-ma-02210
However, this is not a documented feature of the API and I do not think this can be relied upon, so I'd recommend filing a feature request, to make this a supported, reliable feature.
As for the amount of reviews, you may be interested in:
Issue 3484: Add # of reviews to the Place Details Results
I've written an API like this for Reviewsmaker, but I target specific business names not URLs. See this example (I activated a key for this purpose for now):
http://reviewsmaker.com/api/google/?business=life%20made%20a%20little%20easier&api_key=4a2819f3-2874-4eee-9c46-baa7fa17971c
Or, try yourself with any business name:
http://reviewsmaker.com/api/google/?business=Toys R Us&api_key=4a2819f3-2874-4eee-9c46-baa7fa17971c
The following call would return a JSON object which shows:
{
"results":{
"business_name":"Life Made A Little Easier",
"business_address":"1702 Sheepshead Bay Rd, Brooklyn, NY 11235, USA",
"place_id":"ChIJ_xjIR2REwokRH2qEigdFCvs",
"review_count":38
},
"api":{
"author":"Ilan Patao",
"home":"www.reviewsmaker.com"
}
}
Pinging this EP using a Chronjob for example once every hour or two and return the review_count can pretty much build your own review monitoring app;
You can probably do what you're looking for if you query the Places API Text Search or the CSE (Custom Search Engine) API to lookup the URL, return back the matching name of the business associated with this URL and calling an endpoint like this one to return back the associated review count.
You can probably code this in py or PHP. Not sure how familiar you are with data parsing, but I was able to build my API based on Google's CSE API. CSE provides metadata in its results which contain the total reviews, so if you create a CSE engine and use the CSE API looking for business schemas, review schemas, etc; you can return back items and within the PageMap node there are objects with data that you need very little tweaking to do (such as string replacing, trimming) which will return back the values you're looking for.
Hope my answer helped, at least to lead you in the right direction :)

Best Approach To Retrieve Search Result

I am trying to write a program that extracts shipping container information from a specific site. I've had success with several shipping companies wbsites that use POST methods to submit searches. For these sites I have been using cURL, a PHP libary. However, this one site http://www.cma-cgm.com/eBusiness/Tracking/ has been very difficult to interact with. I have tried using cURL but all I retrieve is the surrounding html without the actual search results.
A sample container I am trying to track is CMAU1173561.
The actual tracking URL seems to be http://www.cma-cgm.com/eBusiness/Tracking/Default.aspx?ContNum=CMAU1173561&T=292012319448 where ContNum is the shipping container and T is a value constructed from current time.
I also noted the .aspx. What is the best approach for retrieving these search results programatically?

Request GA statistic data for a specific large set of pages

I've spent last few days trying to find a solution to solve problem below.
I have set of URLs for which I would like to request data - mainly pageviews and visits by months in specific time interval. These URL specify one web section and we would like to get statistics for this section. I'm using PHP GAPI.
I am able to construct correct filter for the URL set:
ga:pagePath==[url1]||ga:pagePath==[url2]||ga:pagePath==[url3]...
But this works for a fews URLs because request is sent via GET and there is request length limitation for GET.
At first I tried to make severeal requests for a few URLs from the whole set and after all requests (when I had data for all pages) I made sum of pageviews and visits. Than I realized that this could work for pageviews but not for visits (one particular visit could be counted in more than one response and thanks to sum it was counted muliple times).
And than i have these limitations:
I can't use regular expresion to shorten the filter. URLs of pages are badly designed (not thanks to us :) ) and the pages in a web section therefore don't have nice URL prefix like /my-section/*
I need historical data (2 years back), so it won't help to start tracking some custom variable or event for pages in particular web section from now.
So I tried to make POST request to API. I was able to get auth token, but POSTing request to get statistic data returns:
403 Forbidden
Target feed is read-only
I tried to find if there is actualy the possibility to use POST method, but had no luck finding exact info (some clues suggest that it is not possible).
Another idea could be redesigning URL to have some nice prefix to filter by regexp and somehow changing the stored URLs in GA, but I have a feeling that it's not possible either.
Does anyone have an idea how to solve this?
Thanks for any suggests :)

Resources