Not able to scrape a webpage using scrapy or Beautifulsoup - web-scraping

I had tried to scrape the search result in below website
https://www.guelphchamber.com/find/#/action/AdvancedSearch/cid/119/id/201/listingType/O?value=&city=14
But the response text is showing the body with the javascript arrays.Did i need to use selenium instead of the requests
import requests
session = requests.session()
url="https://www.guelphchamber.com/find#/action/AdvancedSearch/cid/119/id/201/listingType/O?value=&city=14"
response=requests.get(url)

Related

scrapy + splash response incomplete page

I have scraped this url, https://www.disco.com.ar/prod/409496/at%C3%BAn-en-aceite-jumbo-120-gr for months in order to get the price for example. But last week I couldn't. I don't understand what change. Because the response only return the icon but not the HTML.
I use scrapy + splash.
Here the example of the response in splash
I changed the setting in the scrapy, also the LUA in splash but nothing working.

Scraping startpage with bs4 and requests

I'm trying to scrape the search results off of http://startpage.com/, I have scraped the results all ready using bs4 and requests. I ran into a problem after being able to scrape the results. I can not get to the next page of the search results. I can not find a link using web browsing developer tools. When I do inspect the element this is what it shows 2
thats the number 2 button. The other option is the next button Next<span class="i_next"></span> How do I make a request or what ever it is I need to do to get to the next page after scraping the results of the first page.
import requests
from bs4 import BeautifulSoup
def dork():
url = 'https://www.startpage.com/do/search?cmd=process_search&query=inurl:admin&language=english_au&cat=web&with_language=&with_region=&pl=&ff=&rl=&abp=-1&with_date=m'
source_code = requests.get(url, 'html')
plain_txt = source_code.text
soup = BeautifulSoup(plain_txt, "lxml")
for text in soup.find_all('h3', {'class': 'clk'}):
for link in text.find_all('a'):
href = link.get('href')
print(href)
dork()
Thats the code that gets the links.
I will recommend you to try the Selenium/PhantomJS, which give you the ability to have a real, headless and scriptable browser. Checkout this answer

Web scraping: images show when shared to Facebook but not my app. Error 401 No signature found

I'm building a news curation service that uses RSS feeds from various sources including The Guardian.
When I try to pull the image from The Guardian articles, I get: Error 401 No signature found error.
However when you share the article to Facebook etc, the image will show in the feed.
For example, this is the image link to a current article:
https://i.guim.co.uk/img/media/dd92773d05e7da9adcff7c007390a746930c2f71/0_0_2509_1505/master/2509.jpg?w=1200&h=630&q=55&auto=format&usm=12&fit=crop&crop=faces%2Centropy&bm=normal&ba=bottom%2Cleft&blend64=aHR0cHM6Ly91cGxvYWRzLmd1aW0uY28udWsvMjAxNi8wNi8wNy9vdmVybGF5LWxvZ28tMTIwMC05MF9vcHQucG5n&s=bb057e1ec495b0ec4eb75a892b6a190c
From this page: https://www.theguardian.com/global-development/2016/mar/22/world-water-day-quiz-are-you-a-fount-of-wisdom
Is there a way for me to use the image like Facebook is able to?
Thanks.
The 401 error that you're facing is probably caused because you're trying to use some intranet resources without being logged or authenticated into the system.
Using the following code you'll be able to fetch a smaller version of your picture. It will read html source of the page provided by you and search for an img with the specific requirements
Code:
from bs4 import BeautifulSoup
import requests
url = 'https://www.theguardian.com/global-development/2016/mar/22/world-water-day-quiz-are-you-a-fount-of-wisdom'
html_source = requests.get(url).text
#print(html_source)
soup = BeautifulSoup(source, 'html.parser')
img = soup.find_all('img', {'class':'maxed responsive-img'})
Then you can print you results:
Only the first img:
print(img[0]['src'])
Output:
https://i.guim.co.uk/img/media/dd92773d05e7da9adcff7c007390a746930c2f71/0_0_2509_1505/master/2509.jpg?w=300&q=55&auto=format&usm=12&fit=max&s=ba3a4698fe5fce056174eff9ff3863d6
All img results:
for i in img:
print(i['src'])
Output:
https://i.guim.co.uk/img/media/dd92773d05e7da9adcff7c007390a746930c2f71/0_0_2509_1505/master/2509.jpg?w=300&q=55&auto=format&usm=12&fit=max&s=ba3a4698fe5fce056174eff9ff3863d6
https://i.guim.co.uk/img/media/6ef58c034b1e86f3424db4258e398c88bb3a3fb4/0_0_5200_3121/2000.jpg?w=300&q=55&auto=format&usm=12&fit=max&s=ea8370295d1e2d193136fd221263c8b8
https://i.guim.co.uk/img/media/e1c2b1336979a752a68c3c554611bc28aa0a4baa/0_290_4324_2594/2000.jpg?w=300&q=55&auto=format&usm=12&fit=max&s=eef138cefe66834919c3544826a3e468
https://i.guim.co.uk/img/media/37df4e7b52dfd554d431f7d439cdd1a137789fa4/0_0_4256_2553/2000.jpg?w=300&q=55&auto=format&usm=12&fit=max&s=9e461f6739325cf3524a1228f5f7e60b

Avoid downloading images using Beautifulsoup and urllib.request

I am using BeautifulSoup ('lxml' parser) with urllib.request.urlopen() to get text information from a website. However, when I check the network section in my Acitivity Monitor, I see that python downloads a lot of data. This suggests that not only the text is downloaded, but the images as well.
Is it possible to avoid downloading images when webscraping with BeautifulSoup?
That's unlikely as images are not on the page they are in <img src="/here/goes/this/img"... The browser or urllib has to make multiple trips to where-ever the static files like JS, img, CSS are. One possible solution to reduce size is request for zipped content.
Add "Accept-Encoding":"gzip" header to the Request object. If the server supports it, the size reduction will be good. You will then gzip.decompress() it to get string data.

How can I scrape Google?

How do I get HTML inside google.com?
Let's say I go to Google and type "Humpty Dumpty" and I get the search results and the URL changes to something like:
https://www.google.com/search?newwindow=1&q=humpty+dumpty&oq=humtp&gs_l=serp.3.0.0i10l10.7599.8190.0.9757.5.5.0.0.0.0.373.732.3j1j0j1.5.0....0...1c.1.30.serp..2.3.187.2B69R71ux4U
But when I try to HttpWebRequest to download this webpage I don't get any search result HTML inside of it. I think this is because Google makes request for results after the page is loaded?
Is there any way I can get the HTML?
P.S: I know scraping from Google is against their TOS. I am trying to learn of how to scrape such websites.
Using the below code, I'm seeing the correct HTML coming back (something coming back about nursery rhymes)
The below code uses WebClient to retrieve the correct HTML
WebClient wbclient = new WebClient();
string html = wbclient.DownloadString("https://www.google.com/search?newwindow=1&q=humpty+dumpty&oq=humtp&gs_l=serp.3.0.0i10l10.7599.8190.0.9757.5.5.0.0.0.0.373.732.3j1j0j1.5.0....0...1c.1.30.serp..2.3.187.2B69R71ux4U");

Resources