scrapy + splash response incomplete page - web-scraping

I have scraped this url, https://www.disco.com.ar/prod/409496/at%C3%BAn-en-aceite-jumbo-120-gr for months in order to get the price for example. But last week I couldn't. I don't understand what change. Because the response only return the icon but not the HTML.
I use scrapy + splash.
Here the example of the response in splash
I changed the setting in the scrapy, also the LUA in splash but nothing working.

Related

Making a external POST is not working in my Vue code

I have a problem and I've been trying to fix it many days and I can't solve it. I just want to send a hint to Google Analytics when I click one button (to do it, I have to make a request (POST).
This is my code, I have deleted the ID that I have in GA, just to show you the original code:
import Vue from 'vue'
import VueResource from 'vue-resource'
Vue.use(VueResource)
handleMP () {
this.$http.post('www.google-analytics.comv=1&t=pageview&tid=UA-XXXXXX-
X&cid=555&dp=%2Fanalytics')
}
The problem is that I don't know why, when I make the POST, the URL I use it is added to http::/localhost:8080/ and I can't make the POST.
Example URL I can see in the console:
http://localhost:8080/www.google-analytics.com?v=1&t=pageview&tid=UA-XXXXXXX-X&cid=555&dp=%2Fanalytics
How can I fix this?
Thanks in advance
I'm pretty sure you're missing a http:// or https:// prefix in the URL, so Vue thinks it's just a relative URL and appends what you entered to the current address.
Try adding the https://, so it looks like this:
handleMP () {
this.$http.post('https://www.google-analytics.comv=1&t=pageview&tid=UA-XXXXXX-
X&cid=555&dp=%2Fanalytics')
}

Why the same URL gives different results?

On the following page, the number 2, 3 ... at the bottom all point to the same URL. Yet, the different tables will be shown. Does anybody know what specific techniques are used here? How to extract information in these tables using raw HTTP request (I prefer not to use a headless browser to do so)? Thanks.
https://services27.ieee.org/fellowsdirectory/home.html#results_table
It is using Javascript (AJAX) to make HTTP calls to the server.
If you inspect the Network activity in the Developer tools you will see calls to the following URL: https://services27.ieee.org/fellowsdirectory/getpageresultsdesk.html.
They send data from Javascript:
selectedJSON: {"alpha":"ALL","menu":"ALPHABETICAL","gender":"All","currPageNum":1,"breadCrumbs":[{"breadCrumb":"Alphabetical Listing "}],"helpText":"Click on any of the alphabet letters to view a list of Fellows."}
inputFilterJSON: {"sortOnList":[{"sortByField":"fellow.lastName","sortType":"ASC"}],"typeAhead":false}
pageNum: 2
You can see the pageNum property. This is how they request a specific page of results.
When you click the number buttons, some Javascript code makes an AJAX POST request to https://services27.ieee.org/fellowsdirectory/getpageresultsdesk.html;jsessionid=yoursessionid with formData including pageNum: 3 and some other formatting parameters. The server responds with the HTML block of table rows that get loaded into the page. You can look at the requests on that webpage in your browser's network inspector (in the developer tools) to see exactly what HTTP requests are happening.
The link has an onclick handler that changes the href onclick. Go to
https://services27.ieee.org/fellowsdirectory/home.html#results_table
In the console, enter:
window.location=getDetailProfileUrl('lOH1bDxMyI1CCIxo5ODlGg==');
This redirects to Aarons, Jules.
Now go back and enter window.location=getDetailProfileUrl('JJuL3J00kHdIUozoVAgKdg==');
This opens Aarts, Ronald.
Basically, when the link is clicked, the JavaScript changes the url of the link.
To extract them using php, use the file_get_contents() function.
echo file_get_contents('https://services27.ieee.org/fellowsdirectory/home.html#results_table');
That will print out the page. Now scrape it with JavaScript.
echo "<script>console.log(document.querySelectorAll('.name'));</script>";
Hope this helps.

Scraping startpage with bs4 and requests

I'm trying to scrape the search results off of http://startpage.com/, I have scraped the results all ready using bs4 and requests. I ran into a problem after being able to scrape the results. I can not get to the next page of the search results. I can not find a link using web browsing developer tools. When I do inspect the element this is what it shows 2
thats the number 2 button. The other option is the next button Next<span class="i_next"></span> How do I make a request or what ever it is I need to do to get to the next page after scraping the results of the first page.
import requests
from bs4 import BeautifulSoup
def dork():
url = 'https://www.startpage.com/do/search?cmd=process_search&query=inurl:admin&language=english_au&cat=web&with_language=&with_region=&pl=&ff=&rl=&abp=-1&with_date=m'
source_code = requests.get(url, 'html')
plain_txt = source_code.text
soup = BeautifulSoup(plain_txt, "lxml")
for text in soup.find_all('h3', {'class': 'clk'}):
for link in text.find_all('a'):
href = link.get('href')
print(href)
dork()
Thats the code that gets the links.
I will recommend you to try the Selenium/PhantomJS, which give you the ability to have a real, headless and scriptable browser. Checkout this answer

Web scraping: images show when shared to Facebook but not my app. Error 401 No signature found

I'm building a news curation service that uses RSS feeds from various sources including The Guardian.
When I try to pull the image from The Guardian articles, I get: Error 401 No signature found error.
However when you share the article to Facebook etc, the image will show in the feed.
For example, this is the image link to a current article:
https://i.guim.co.uk/img/media/dd92773d05e7da9adcff7c007390a746930c2f71/0_0_2509_1505/master/2509.jpg?w=1200&h=630&q=55&auto=format&usm=12&fit=crop&crop=faces%2Centropy&bm=normal&ba=bottom%2Cleft&blend64=aHR0cHM6Ly91cGxvYWRzLmd1aW0uY28udWsvMjAxNi8wNi8wNy9vdmVybGF5LWxvZ28tMTIwMC05MF9vcHQucG5n&s=bb057e1ec495b0ec4eb75a892b6a190c
From this page: https://www.theguardian.com/global-development/2016/mar/22/world-water-day-quiz-are-you-a-fount-of-wisdom
Is there a way for me to use the image like Facebook is able to?
Thanks.
The 401 error that you're facing is probably caused because you're trying to use some intranet resources without being logged or authenticated into the system.
Using the following code you'll be able to fetch a smaller version of your picture. It will read html source of the page provided by you and search for an img with the specific requirements
Code:
from bs4 import BeautifulSoup
import requests
url = 'https://www.theguardian.com/global-development/2016/mar/22/world-water-day-quiz-are-you-a-fount-of-wisdom'
html_source = requests.get(url).text
#print(html_source)
soup = BeautifulSoup(source, 'html.parser')
img = soup.find_all('img', {'class':'maxed responsive-img'})
Then you can print you results:
Only the first img:
print(img[0]['src'])
Output:
https://i.guim.co.uk/img/media/dd92773d05e7da9adcff7c007390a746930c2f71/0_0_2509_1505/master/2509.jpg?w=300&q=55&auto=format&usm=12&fit=max&s=ba3a4698fe5fce056174eff9ff3863d6
All img results:
for i in img:
print(i['src'])
Output:
https://i.guim.co.uk/img/media/dd92773d05e7da9adcff7c007390a746930c2f71/0_0_2509_1505/master/2509.jpg?w=300&q=55&auto=format&usm=12&fit=max&s=ba3a4698fe5fce056174eff9ff3863d6
https://i.guim.co.uk/img/media/6ef58c034b1e86f3424db4258e398c88bb3a3fb4/0_0_5200_3121/2000.jpg?w=300&q=55&auto=format&usm=12&fit=max&s=ea8370295d1e2d193136fd221263c8b8
https://i.guim.co.uk/img/media/e1c2b1336979a752a68c3c554611bc28aa0a4baa/0_290_4324_2594/2000.jpg?w=300&q=55&auto=format&usm=12&fit=max&s=eef138cefe66834919c3544826a3e468
https://i.guim.co.uk/img/media/37df4e7b52dfd554d431f7d439cdd1a137789fa4/0_0_4256_2553/2000.jpg?w=300&q=55&auto=format&usm=12&fit=max&s=9e461f6739325cf3524a1228f5f7e60b

PhantomJS parsing CSS file's source url isn't working in script

I have a webpage with css, font, js, etc - and these files call other files.
If I use netlog.js phantomjs example, all files are requested successfully (ie response is received).
If I use my company script (1000s of lines), I don't get a font file.
When I look at the netlog.js output, the missing file is the very last received but it is the 4th of 9 in terms of requested - meaning I can see it is requested 4th but the response (page.onResourceRecieved) doesn't start until after all others have returned.
When I look at the company script, the missing file is not requested at all - hence it is missing. How can someone mis-program phantomjs to ignore this file so that it isn't requested? I assume that is the bug I'm hunting for.
In case the HTML/CSS is the culprit somehow, I'm going to include it below.
I have an html page that includes a css file via style tag (partial tag below)
<style>#import url(//fast.fonts.net/t/1.css?apiType=ad&projectid=2731384a-7cac-11e5-9c62-005056a60fc6&fontids=32RbV4zvBY&campaignid=HKyhF7DchmY);#import url(//fonts.googleapis.com/css?family=Montserrat:700&text=%2C-ABCDEFGHILMNOPRSTUVWabcdefghilmnoprstuvw);
The 1.css? request is correctly processed and the the next import of css?family is also requested. That second imported url requests another file: http://fonts.gstatic.com/l/font?kit=IQHow_FEYlDC4Gzy_m8fcqJ_SlhcvGEAn8FM2hC_Gzi8FMKbpN1MIaqg2HOsKpgsB-MyxXR1frCnhD4ZhVnHAATo_LDfaGo7fRovcW5LQvM&skey=11a939c399e8c9fe&v=v7
The netlog.js picks up the fonts.gstatic.com request - even if it doesn't come back until after everything else. The company script doesn't figure out it needs to request fonts.gstatic.com.
Netlog is very basic - it doesn't mess with timing, headers, or events. I think the company script is doing something to stop the request for fonts.gstatic.com once it is discovered by phantom via some setting or event but I don't know where to start.

Resources