Scrape wikipedia "prop=linkshere" more efficiently? - web-scraping

My scraping code works, but it seems inefficient: I have to send a bazillion "continue" requests to get it all. Here's the initial request:
https://en.m.wikipedia.org/w/api.php?action=query&prop=linkshere&format=json&maxlag=2&titles=Korn
and I get back a continuation number, so I follow with:
https://en.m.wikipedia.org/w/api.php?action=query&prop=linkshere&format=json&maxlag=2&titles=Korn&lhcontinue=20653
over and over and over until the end. Each request gives a tiny amount of the total data.
Am I missing something simple to get more data on each request? Thanks!

The default lhlimit for each response is 10. Change it to max, e.g. https://en.m.wikipedia.org/w/api.php?action=query&prop=linkshere&format=json&maxlag=2&titles=Korn&lhlimit=max .

Related

How many http get requests/minute can we get without being considered attack/spam

Let's say we have a url fuzzer for security testing writen in Python.
How many Get requests can we have per second or minute without affecting that website ?
For example: Are 200 Requests/Minute too much ?
Thank you.
There's no fixed amount, it totally depends on the website.
Sometimes the HTTP response status code can help you to understand that using the 429 status code: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/429

Set departure datetime in https://route.api.here.com/routing/7.2/calculateroute.json not working?

I'm trying to do truck routing using the HERE calculateroute service. I need to calculate the estimated arrival time. From my understanding this should work if I simply add departure="2020-03-10T17:00:00+02" or "now" to the request. However, if I add this I get the same result as with a request that does not have a fixed departure time. I'd expect to see a departure- and arrival time in the response but they're not there.
curl -X GET 'https://route.api.here.com/routing/7.2/calculateroute.json?waypoint0=50.16193,8.53361&waypoint1=50.11208,8.68342&jsonAttributes=1&alternatives=1&routeattributes=waypoints,summary,summaryByCountry,shape,boundingBox,legs,notes,lines,routeId,groups,tickets,incidents,zones&legattributes=waypoint,maneuvers,links,length,travelTime,shape,indices,boundingBox,baseTime,trafficTime,summary&linkattributes=consumption,dynamicSpeedInfo,flags,functionalClass,indices,length,maneuver,nextLink,nextStopName,publicTransportLine,remainDistance,remainTime,roadName,roadNumber,shape,speedLimit,timeDependentRestriction,timezone,truckRestrictions&instructionformat=text&app_id={app_id}&app_code={app_code}&mode=fastest;truck;traffic:enabled&truckType=truck&trailersCount=0&axleCount=2&limitedWeight=20&height=4&width=2.5&length=10&truckRestrictionPenalty=strict&departure=now'
curl -X GET 'https://route.api.here.com/routing/7.2/calculateroute.json?waypoint0=50.16193,8.53361&waypoint1=50.11208,8.68342&jsonAttributes=1&alternatives=1&routeattributes=waypoints,summary,summaryByCountry,shape,boundingBox,legs,notes,lines,routeId,groups,tickets,incidents,zones&legattributes=waypoint,maneuvers,links,length,travelTime,shape,indices,boundingBox,baseTime,trafficTime,summary&linkattributes=consumption,dynamicSpeedInfo,flags,functionalClass,indices,length,maneuver,nextLink,nextStopName,publicTransportLine,remainDistance,remainTime,roadName,roadNumber,shape,speedLimit,timeDependentRestriction,timezone,truckRestrictions&instructionformat=text&app_id={app_id}&app_code={app_code}&mode=fastest;truck;traffic:enabled&truckType=truck&trailersCount=0&axleCount=2&limitedWeight=20&height=4&width=2.5&length=10&truckRestrictionPenalty=strict'
According to the documentation my request seems to be fine: https://developer.here.com/documentation/routing/dev_guide/topics/resource-calculate-route.html
I also found out that it works with the newer routing API v8 (8.20.3). But since there's no way to also get the link ids (I think?) I need to use v7.2.
Am I doing anything wrong?
By default if you don't put anything it's now. For example you can compare departure=now vs departure=2021-03-19T08:23:05Z then you will see the different between traffic time and maybe basetime as well.

How to get FB like count for each post while pulling the feed via Open Graph API?

I am trying to get all the posts for a page by using
https://graph.facebook.com/PAGE_ID/feed
And it works like a charm. I can get all the info for each post except the like count.
The feed does return "likes" for each post, but it shows the like info for the first 25 likes. I cannot know the like count of a post.
The closest solution I found on the net is to set "summary=1" when requesting info of a post, e.g.
https://graph.facebook.com/POST_ID/likes?summary=1
This will return a summary field that shows the like count of this post, which is exactly what I need.
However, if this is the only way to solve the problem, I have to make additional network request for each post just for getting the like count. I could originally finish the job with only ONE network request, but now I have make 1+N times (number of posts in the page feed) of network requests.
I think I must be missing something. FB must have some way to get the like count embedded in the feed info. Just like the FB app or website, all posts show their like counts immediately, there is no way to make additional N times of network requests in order to get the like count for each post.
Hope someone can help. Thanks a lot in advance.
Finally, I found there is a way to get the like/comment counts for each post while pulling the feed without making further network requests:
/url/feed?fields=likes.summary(1).limit(0)
Isn't it great?

Request GA statistic data for a specific large set of pages

I've spent last few days trying to find a solution to solve problem below.
I have set of URLs for which I would like to request data - mainly pageviews and visits by months in specific time interval. These URL specify one web section and we would like to get statistics for this section. I'm using PHP GAPI.
I am able to construct correct filter for the URL set:
ga:pagePath==[url1]||ga:pagePath==[url2]||ga:pagePath==[url3]...
But this works for a fews URLs because request is sent via GET and there is request length limitation for GET.
At first I tried to make severeal requests for a few URLs from the whole set and after all requests (when I had data for all pages) I made sum of pageviews and visits. Than I realized that this could work for pageviews but not for visits (one particular visit could be counted in more than one response and thanks to sum it was counted muliple times).
And than i have these limitations:
I can't use regular expresion to shorten the filter. URLs of pages are badly designed (not thanks to us :) ) and the pages in a web section therefore don't have nice URL prefix like /my-section/*
I need historical data (2 years back), so it won't help to start tracking some custom variable or event for pages in particular web section from now.
So I tried to make POST request to API. I was able to get auth token, but POSTing request to get statistic data returns:
403 Forbidden
Target feed is read-only
I tried to find if there is actualy the possibility to use POST method, but had no luck finding exact info (some clues suggest that it is not possible).
Another idea could be redesigning URL to have some nice prefix to filter by regexp and somehow changing the stored URLs in GA, but I have a feeling that it's not possible either.
Does anyone have an idea how to solve this?
Thanks for any suggests :)

Why is the GET method faster than POST in HTTP?

I am new to web programming and just curious to know about the GET and POST methods of sending data from one page to another.
It is said that the GET method is faster than POST but I don't know why.
One reason I could find is that GET can take only 255 characters?
Is there any other reason? Please someone explain to me.
It's not much about speed. There are plenty of cases where POST is more applicable. For example, search engines will index GET URLs and browsers can bookmark them and make them show up in history. As a result, if you take actions like modifying a DB based on a GET request, it might be harmful as some bots might also traverse the URL.
The other case can be security issue. If you post credentials using GET, it'll get listed in browser history and server log files.
There are several misconceptions about GET and POST in HTTP. There is one primary difference, GET must be idempotent while POST does not have to be. What this means is that GETs cause no side effects, i.e I can send a GET to a web application as many times as I want to (think hitting Ctrl+R or F5 many times) and the requests will be 'safe'
I cannot do that with POST, a POST may change data on the server. For example, if I order an item on the web the item should be added with a POST because state is changed on the server, the number of items I've added has increased by 1. If I did this with a POST and hit refresh in the browser the browser warns me, if I do it with a GET the browser will simply send the request.
On the server GET vs POST is pure convention, i.e. it's up to me as a developer to ensure that I code the POST on the server to not repeat the call. There are various ways of doing this but that's another question.
To actually answer the question if I use GET or POST to perform the same task there is no performance difference.
You can read the RFC (http://www.w3.org/Protocols/rfc2616/rfc2616.html) for more details.
Looking at the http protocol, POST or GET should be equally easy and fast to parse. I would argue, there is no performance difference.
Take a look at the raw HTTP headers
http GET
GET /index.html?userid=joe&password=guessme HTTP/1.1
Host: www.mysite.com
User-Agent: Mozilla/4.0
http POST
POST /login.jsp HTTP/1.1
Host: www.mysite.com
User-Agent: Mozilla/4.0
Content-Length: 27
Content-Type: application/x-www-form-urlencoded
userid=joe&password=guessme
From my point of view, performance should not be considered when comparing GET and POST.
You should think of GET as "a place to go", and POST as "doing something". For example, a search form should be submitted using GET because the search result page is a "place" and the user will want to bookmark it or retrieve it from their history at a later date. If you submit the form using POST the user can only recreate the page by submitting the form again. On the other hand, if you were to perform an action such as clicking a delete button, you would not want to submit this with GET, as the action would be repeated whenever the user returned to the URL.
Just my few cents from 2016.
I am creating a simple message system. At first I used POST to receive new alerts. In jQuery I had:
$.post('/a/alerts', 'stamp=' + STAMP, function(result)
{
});
And in PHP I used $_POST['stamp']. Even from localhost I got 90-100 ms for every request like this.
I simply changed:
$.get('/a/alerts?stamp=' + STAMP, function(result)
{
});
and in PHP switched to $_GET['stamp']. So a little less than 1 minute of changes. Now every request takes 30-40 ms.
So GET can be twice as fast as POST. Of course not always but for small amounts of data I get same results all the time.
GET is slightly faster because the values are sent in the header unlike the POST the values are sent in the request body, in the format that the content type specifies.
Usually the content type is application/x-www-form-urlencoded, so the request body uses the same format as the query string:
parameter=value&also=another
When you use a file upload in the form, you use the multipart/form-data encoding instead, which has a different format. It's more complicated.
I agree with other answers, but it was not mentioned that GET requests can be cached while POST requests are never cached. I think this is the main reason for some GET request being performed faster.
(Of-coarse this means that sometimes no request is actually sent. Hence it's not actually the GET request which is faster, but your browser's cache.)
HTTP Methods: GET vs. POST: http://www.w3schools.com/tags/ref_httpmethods.asp
POST will grow your headers more, just making it larger, but the difference ought to be negligible really, so I don't see why this should be a concern.
Just bear in mind that the proper way to speak HTTP is to use GET only for actions and POST for data. You don't have to, but you also don't want to have a case where Google bots can, for example, insert, delete or manipulate data that was only meant for a human to handle simply because it is following the links it finds.

Resources