I'm trying to capture some information from an architects register but I can only scrape down 25 pages, how can I access pages after page 25?
Example URL http://architects-register.org.uk/search/address/london
try to add the following parameters to the source-url of your api
kimoffset=xxx&kimlimit=1000
where xxx are "elements page number x number of pages you mean to skip"
Related
I am trying to fetch all of the videos, tags etc etc through the Wordpress rest api but it doesn't seem to be returning all of the data that I have on Wordpress. For example I have 13 videos on the admin panel (and web app) but on the rest api I can only see 4. I don't have any filters to limit the results I get on the fetch() function.
For the tags, I have 10 for example but it only returns the 8. When I do /wp-json/wp/v2/tags/53 then I get the data for tag #53 (which is one of the tags that don't show up on /wp-json/wp/v2/tags.
The current method of controlling the number of results returned is to add per_page as a query parameter:
website.com/wp-json/wp/v2/posts/?per_page=15
Replace 15 with the desired amount, anywhere from 1 - 100 is permitted
I wrote some code which should check whether a product is back in stock and when it is, send me an email to notify me. This works when the things I'm looking for are in the html.
However, sometimes certain objects are loaded through JavaScript. How could I edit my code so that the web scraping also works with JavaScript?
This is my code thus far:
import time
import requests
while True:
# Get the url of the IKEA page
url = 'https://www.ikea.com/nl/nl/p/flintan-bureaustoel-vissle-zwart-20336841/'
# Get the text from that page and put everything in lower cases
productpage = requests.get(url).text.lower()
# Set the strings that should be on the page if the product is not available
outofstockstrings = ['niet beschikbaar voor levering', 'alleen beschikbaar in de winkel']
# Check whether the strings are in the text of the webpage
if any(x in productpage for x in outofstockstrings):
time.sleep(1800)
continue
else:
# send me an email and break the loop
Instead of scraping and analyzing the HTML you could use the inofficial stock API that the IKEA website is using too. That API return JSON data which is way easier to analyze and you’ll also get estimates when the product gets back to stock.
There even is a project written in javascript / node which provides you this kind of information straight from the command line: https://github.com/Ephigenia/ikea-availability-checker
You can easily check the stock amount of the chair in all stores in the Netherlands:
npx ikea-availability-checker stock --country nl 20336841
I recently tried Rfacebook package by pablobarbera, which works quite well. I am having this slight issue, for which I am sharing the code.
install.packages("Rfacebook") # from CRAN
library(devtools)
install_github("Rfacebook", "pablobarbera", subdir = "Rfacebook")
library(Rfacebook)
# token generated here: https://developers.facebook.com/tools/explorer
token <- "**********"
page <- getPage("DarazOnlineShopping", token, n = 1000)
getPage command works, but it only retrieves 14 records from the Facebook page I used in the command. In the example used by pablobarbera in the original post he retreived all the posts from "Humans of New York", but when I tried the same command, facebook asked me to reduce the number of posts, and I hardly managed to get 20 posts. This is the command used by Pablo bera:
page <- getPage("humansofnewyork", token, n = 5000)
I thought I was using temporary token access that why Facebook is not giving me the required data, but I completed the wholo Facebook Oauth Process, and the same result.
Can somebody look into this, and tell why this is happening.
The getPage() command looks fine to me, I manually counted 14 posts (including photos) on the main page. It could be that Daraz Online Shopping has multiple pages and that the page name you are using only returns results from the main page, when (I assume) you want results from all of them.
getPage() also accepts page IDs. You might want to collect a list of IDs associated with Daraz Online Shopping, loop through and call each of them and combine the outputs to get the results you need.
To find this out these IDs you could write a scraper (or manually search for them all) that views the page source and searches for the unique page ID. Searching for content="fb://page/?id= will highlight the location of the page ID in the source code.
I try to get a list of pages people go after visiting the selected page. pagePath is a starting page, nextPagePath is the result list of pages I'm interested in. I included a filter to show only one starting page. But the result I get is confusing:
What am I doing wrong?!
Use ga:previousPagePath instead of ga:pagePath. For the reason to me unknown, ga:pagePath and ga:nexPagePath refers in GA API to the same thing.
How can you modify the URL for the current page that gets passed to Google Analytics?
(I need to strip the extensions from certain pages because for different cases a page can be requested with or without it and GA sees this as two different pages.)
For example, if the page URL is http://mysite/cake/ilikecake.html, how can I pass to google analytics http://mysite/cake/ilikecake instead?
I can strip the extension fine, I just can't figure out how to pass the URL I want to Google Analytics. I've tried this, but the stats in the Google Analytics console don't show any page views:
pageTracker._trackPageview('cake/ilikecake');
Thanks,
Mike
You could edit the GA profile and add custom filters ...
Create a 'Search and Replace' custom filter, setting the filter field to 'Request URI' and using something like:
Search String: (.*ilikecake\.)html$
Replace String: $1
(was \1)
Two possibilities come to mind:
it can take a while, up to about 24 hours, for visits to be reflected in the Analytics statistics. How long ago did you make your change?
try beginning the pathname with a "/", so
pageTracker._trackPageview('/cake/ilikecake');
and then wait a bit, as per the first item.
Usually you have the ga script code at the end of your file, while special _trackPageviews() calls are often used somewhere else.
Have you made sure you have your call to pageTracker._trackPageview() after you have defined the pagetracker?
Like this:
var pageTracker = _gat._getTracker("UA-XXXXXXX-X");
pageTracker._trackPageview();
otherwise you just get a JavaScript error I suppose.