I'm trying to scrape the search results off of http://startpage.com/, I have scraped the results all ready using bs4 and requests. I ran into a problem after being able to scrape the results. I can not get to the next page of the search results. I can not find a link using web browsing developer tools. When I do inspect the element this is what it shows 2
thats the number 2 button. The other option is the next button Next<span class="i_next"></span> How do I make a request or what ever it is I need to do to get to the next page after scraping the results of the first page.
import requests
from bs4 import BeautifulSoup
def dork():
url = 'https://www.startpage.com/do/search?cmd=process_search&query=inurl:admin&language=english_au&cat=web&with_language=&with_region=&pl=&ff=&rl=&abp=-1&with_date=m'
source_code = requests.get(url, 'html')
plain_txt = source_code.text
soup = BeautifulSoup(plain_txt, "lxml")
for text in soup.find_all('h3', {'class': 'clk'}):
for link in text.find_all('a'):
href = link.get('href')
print(href)
dork()
Thats the code that gets the links.
I will recommend you to try the Selenium/PhantomJS, which give you the ability to have a real, headless and scriptable browser. Checkout this answer
Related
So I have tried web scraping a website and it has a field where you can write ( a navigation bar of some sort )
Whenever I am writing something there it creates a dropdown of things related to what I wrote ( things that contain what I wrote )
What I'm trying to do is essentially use requests.post from requests library in python in order to fill a value inside it, afterwards, I want it to grab whatever the dropdown showed.
I've had a few problems while doing it:
The dropdown disappears whenever you click somewhere else on the website so it does create temporary HTML tags of the list temporarily.
I couldn't find a way to actually post something inside the navigation bar.
A great example I've found on the web is inside FUTWIZ which does exactly what I described above, Whenever I try with F12 I see it creates some HTML description, is there a way to grab the HTML After the value is put inside the actual navigation bar?
EDIT
This is the code I've tried:
import requests
from bs4 import BeautifulSoup
urls = "https://www.futwiz.com/en/"
requst = requests.get(urls)
bs4Out = BeautifulSoup(requst.text, "html.parser")
poster = requests.post(urls, data={"form-control": "Messi"})
print(poster.text)
Now, I know the data in requests.post only puts it as a query but I can't really figure out how to fill the header
This is the link to FUTWIZ, it has the navigation bar which is the thing I'm trying to work with?
https://www.futwiz.com/en/
I just started learning web scraping and decided to scrape the daily value from this site:
https://www.tradingview.com/symbols/INDEX-MMTW/
I am using BeautifulSoup and then doing inspect element and then Copy -> CSS Selector.
However, the returned items are always 0 length. I tried the select() method (from ATBS) and find() method.
Not sure what I am doing wrong. Here is the code...
import requests, bs4
res = requests.get('https://www.tradingview.com/symbols/INDEX-MMTW/')
res.raise_for_status()
nmmtw_data = bs4.BeautifulSoup(res.text, 'lxml')
(Instead of writing the selector yourself, you can also right-click on the element in your browser
and select Inspect Element. When the browser’s developer console opens, right-click on the element’s
HTML and select Copy ▸ CSS Selector to copy the selector string to the clipboard and paste it into your
source code.)
elems = nmmtw_data.select("div.js-symbol-last > span:nth-child(1)")
new_try = nmmtw_data.find(class_="tv-symbol-price-quote__value js-symbol-last")
print(type(new_try))
print(len(new_try))
print(elems)
print(type(elems))
print(len(elems))
Thanks in advance!
Since the price table is generated with JavaScript, unfortunately, we cannot simply use BeautifulSoup to scrape the pricing table. Instead, you should use web browser automation framework.
I'm sure you've found the solution so far, but if not, I believe the answer to your problems is using selenium module. Additionally, you need to install the webdriver specific to the browser you're using. I think BeautifulSoup is very limited these days because most of the sites are generated using java script.
All the info that you need for selenium you can find here:
https://www.selenium.dev/documentation/webdriver/
I have scraped this url, https://www.disco.com.ar/prod/409496/at%C3%BAn-en-aceite-jumbo-120-gr for months in order to get the price for example. But last week I couldn't. I don't understand what change. Because the response only return the icon but not the HTML.
I use scrapy + splash.
Here the example of the response in splash
I changed the setting in the scrapy, also the LUA in splash but nothing working.
I had tried to scrape the search result in below website
https://www.guelphchamber.com/find/#/action/AdvancedSearch/cid/119/id/201/listingType/O?value=&city=14
But the response text is showing the body with the javascript arrays.Did i need to use selenium instead of the requests
import requests
session = requests.session()
url="https://www.guelphchamber.com/find#/action/AdvancedSearch/cid/119/id/201/listingType/O?value=&city=14"
response=requests.get(url)
I'm building a news curation service that uses RSS feeds from various sources including The Guardian.
When I try to pull the image from The Guardian articles, I get: Error 401 No signature found error.
However when you share the article to Facebook etc, the image will show in the feed.
For example, this is the image link to a current article:
https://i.guim.co.uk/img/media/dd92773d05e7da9adcff7c007390a746930c2f71/0_0_2509_1505/master/2509.jpg?w=1200&h=630&q=55&auto=format&usm=12&fit=crop&crop=faces%2Centropy&bm=normal&ba=bottom%2Cleft&blend64=aHR0cHM6Ly91cGxvYWRzLmd1aW0uY28udWsvMjAxNi8wNi8wNy9vdmVybGF5LWxvZ28tMTIwMC05MF9vcHQucG5n&s=bb057e1ec495b0ec4eb75a892b6a190c
From this page: https://www.theguardian.com/global-development/2016/mar/22/world-water-day-quiz-are-you-a-fount-of-wisdom
Is there a way for me to use the image like Facebook is able to?
Thanks.
The 401 error that you're facing is probably caused because you're trying to use some intranet resources without being logged or authenticated into the system.
Using the following code you'll be able to fetch a smaller version of your picture. It will read html source of the page provided by you and search for an img with the specific requirements
Code:
from bs4 import BeautifulSoup
import requests
url = 'https://www.theguardian.com/global-development/2016/mar/22/world-water-day-quiz-are-you-a-fount-of-wisdom'
html_source = requests.get(url).text
#print(html_source)
soup = BeautifulSoup(source, 'html.parser')
img = soup.find_all('img', {'class':'maxed responsive-img'})
Then you can print you results:
Only the first img:
print(img[0]['src'])
Output:
https://i.guim.co.uk/img/media/dd92773d05e7da9adcff7c007390a746930c2f71/0_0_2509_1505/master/2509.jpg?w=300&q=55&auto=format&usm=12&fit=max&s=ba3a4698fe5fce056174eff9ff3863d6
All img results:
for i in img:
print(i['src'])
Output:
https://i.guim.co.uk/img/media/dd92773d05e7da9adcff7c007390a746930c2f71/0_0_2509_1505/master/2509.jpg?w=300&q=55&auto=format&usm=12&fit=max&s=ba3a4698fe5fce056174eff9ff3863d6
https://i.guim.co.uk/img/media/6ef58c034b1e86f3424db4258e398c88bb3a3fb4/0_0_5200_3121/2000.jpg?w=300&q=55&auto=format&usm=12&fit=max&s=ea8370295d1e2d193136fd221263c8b8
https://i.guim.co.uk/img/media/e1c2b1336979a752a68c3c554611bc28aa0a4baa/0_290_4324_2594/2000.jpg?w=300&q=55&auto=format&usm=12&fit=max&s=eef138cefe66834919c3544826a3e468
https://i.guim.co.uk/img/media/37df4e7b52dfd554d431f7d439cdd1a137789fa4/0_0_4256_2553/2000.jpg?w=300&q=55&auto=format&usm=12&fit=max&s=9e461f6739325cf3524a1228f5f7e60b