I just started learning web scraping and decided to scrape the daily value from this site:
https://www.tradingview.com/symbols/INDEX-MMTW/
I am using BeautifulSoup and then doing inspect element and then Copy -> CSS Selector.
However, the returned items are always 0 length. I tried the select() method (from ATBS) and find() method.
Not sure what I am doing wrong. Here is the code...
import requests, bs4
res = requests.get('https://www.tradingview.com/symbols/INDEX-MMTW/')
res.raise_for_status()
nmmtw_data = bs4.BeautifulSoup(res.text, 'lxml')
(Instead of writing the selector yourself, you can also right-click on the element in your browser
and select Inspect Element. When the browser’s developer console opens, right-click on the element’s
HTML and select Copy ▸ CSS Selector to copy the selector string to the clipboard and paste it into your
source code.)
elems = nmmtw_data.select("div.js-symbol-last > span:nth-child(1)")
new_try = nmmtw_data.find(class_="tv-symbol-price-quote__value js-symbol-last")
print(type(new_try))
print(len(new_try))
print(elems)
print(type(elems))
print(len(elems))
Thanks in advance!
Since the price table is generated with JavaScript, unfortunately, we cannot simply use BeautifulSoup to scrape the pricing table. Instead, you should use web browser automation framework.
I'm sure you've found the solution so far, but if not, I believe the answer to your problems is using selenium module. Additionally, you need to install the webdriver specific to the browser you're using. I think BeautifulSoup is very limited these days because most of the sites are generated using java script.
All the info that you need for selenium you can find here:
https://www.selenium.dev/documentation/webdriver/
Related
I hope you are all doing well.
I am facing an error during web scraping in R using the Selector Gadget Tool where when I am selecting the data using the tool on the Coursera website, the no. of values it shows is correct (10). But when I copy that particular CSS code in R and run it, it's showing 18 names in the list. Please if anyone can help me with this. Here is a screenshot of the selector gadget output:
And here is what gets returned in R when I scrape that css selector:
The rendered content seen via a browser is not exactly the same as that returned by an XHR request (rvest). This is because a browser can run JavaScript to update content.
Inspect the page source by pressing Ctrl+U in browser on that webpage.
You can re-write your css selector list to match the actual html returned. One example would be as follows, which also removes the reliance on dynamic classes which change more frequently and would break your program more quickly.
library(rvest)
read_html("https://in.coursera.org/degrees/bachelors") |>
html_elements('[data-e2e="degree-list"] div[class] > p:first-child') |>
html_text2()
Learn about CSS selectors and operators here: https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/Selectors
I'm new in Scrapy.
I try to get link to the next page from this site https://book24.ru/knigi-bestsellery/?section_id=1592
What how html looks like: enter image description here
In scrapy shell I wrote this command:
response.css('li.pagination__button-item._next a::attr(href)')
It returns an empty list.
I have also tried
response.css('a.pagination__item._link._button._next.smartLink')
but it also returns an empty list.
I will be grateful for the help!
The page is generated with JavaScript, see how it look with 'view(response)'.
# with css:
In [1]: response.css('head > link:nth-child(28) ::attr(href)').get()
Out[1]: 'https://book24.ru/knigi-bestsellery/page-2/'
# with xpath:
In [2]: response.xpath('//link[#rel="next"]/#href').get()
Out[2]: 'https://book24.ru/knigi-bestsellery/page-2/'
I would like to add to #SuperUser's answer. Seeing as the site loads the HTML via JavaScript, please read the documentation on how to handle JavaScript websites. scrapy-playwright is a recent library that I have found to be quite fast and easy to use when scraping JS rendered sites.
I'm trying to build a hacker news scraper using Symfony 2's Dom Crawler [1]
When I try out the xpath with a chrome plugin [2], it works. But when I try it in my scraper I keep getting The current node list is empty.
Here's my scraper code:
$crawler1 = $client1->request('GET','https://news.ycombinator.com/item?id=8296437');
$hnpost->selftext = $crawler1->filterXPath('/html/body/center/table/tbody/tr[3]/td/table[1]/tbody/tr[4]/td[2]')->text();
[1] http://api.symfony.com/2.0/Symfony/Component/DomCrawler/Crawler.html#method_filter
[2] https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl?hl=en-US
If the problem is what I think it is, I've been battered by this one a couple of times. Chrome implicitly adds any missing <tbody> tags to the DOM, so if you then copy the XPath or CSS path, you may also have copied tags that don't necessarily exist in the source document. Try viewing the page's source and see if the DOM reported by your browser's console corresponds to the original source HTML. If the <tbody> tags are absent, be sure to exclude them in your filterXPath() call.
So, the HTML of the page has Hi There and I'm using CSS to convert it to HI THERE.
I run cucumber to check if the page has HI THERE on it (as it should since that's the end result.
Yet when I run cucumber, I get an error. HI THERE is not on the page... I edit the tests to instead look for Hi there and it passes.
Key Question: How do I tell Cucumber that I need it to render the CSS before giving me an OK answer?
Edit: So, cucumber does load the CSS when I get good old Firefox to pop up and run for me. Therefore, I feel that my question now becomes: Is there a way to tell cucumber to render the CSS without a browser being loaded?
def hi_there(options = {})
click_button('hi there')
expect(page).to have_content("HI THERE")
# some other code for more testing
end
I'm using CSS to convert it to HI THERE.
This means that you are using text-transform:uppercase; property which would internally call javascript to convert the given text on the page.
So you would need to install(if you haven't already) a javascript_driver and use it for your scenario.
Refer to Using Capybara with Cucumber here.
I am new to Rails and Selenium but have used other automated testing tools.
I exported a script from the Selenium 2 IDE to Rails/RSpec and am altering the code to get it to run. The script fails to find a specific link.
The original Ruby code as exported from the working IDE script was:
#driver.find_element(:link, "skip").click
This failed, so I attempted to identify the element with an XPath statement. (There were other elements that failed in the originally exported code that I fixed by using XPath, so that’s why I am using this strategy.)
I tried different alternatives to identify the link in the Ruby code, such as:
Attempt #1:#driver.find_element(:XPath, "//*[#class='skip-link']").click
Attempt #2:#driver.find_element(:XPath, "//*[#value='skip']").click
Result in all cases: Unable to locate element: {"method":"link text","selector":"skip"}
HTML reported per Firebug:
<a style="float: right; margin-right: 10px; text-decoration: none;" href="/yourfuture" class="skip-link"> skip </a>
XPath reported per Firebug:
/html/body/div[2]/div[2]/div/div/div[3]/div[4]/a
I have timeout set to 60 seconds and see the link skip displayed for several seconds before the script fails, so I don’t this is an issue.
One possibly relevant fact: when the app presents the window with all the controls, the skip does not appear initially. By program design, the app waits about 5 seconds before.
What am I doing wrong? Thanks in advance.
It turns out the link that was not seen was in an iFrame. The attributes of the iframe changed so it was difficult to identify it with a static property, so I just inserted the following line of code just before the line that Webdriver could not "see":
#driver.switch_to.frame(0)
I could have also used #driver.switch_to.frame("iframe-identifier-per-swtich") if there was a reliable way of identifying the iframe.
Yeah, you probably knew that but maybe this will help another newbie avoid the same problem.