Two different BeautifulSoup outputs? - web-scraping

I have a piece of code that I made in Google Colab that essentially just scrapes a piece of data from a website, show below:
#imports
#<div class="sc-aef7b723-0 dDQUel priceTitle">
#<div class="priceValue ">
from bs4 import BeautifulSoup
import requests
import time
url = 'https://coinmarketcap.com/currencies/index-cooperative/'
HTML = requests.get(url)
soup = BeautifulSoup(HTML.text, 'html.parser')
text = soup.find('div', attrs={'class':'sc-aef7b723-0 dDQUel priceTitle'}).find('div', attrs={'class':'priceValue '}).text
print(text)
I need this to run as a py file on my computer, but when it runs as a py file, I get the error:
text = soup.find('div', attrs={'class':'sc-aef7b723-0 dDQUel priceTitle'}).find('div', attrs={'class':'priceValue '}).text
AttributeError: 'NoneType' object has no attribute 'text'
I was wondering why this happened as it is the exact same code. All of my packages are at the most recent version as well.

You just need to remove the trailing space in a class name attrs={'class':'priceValue'} because when you run the specified web page through the html.parser it corrects the HTML in some ways.
In this case it remove the trailing space that present on the web page, because it doesn't really makes sense to have a trailing space in a class name. Spaces needed when you have more than one class for a given element.
So, parsed web page that you store in your soup variable have that div looking like this <div class="priceValue"><span>$1.74</span></div>. And soup.find function actually care about trailing spaces, so it couldn't match the class priceValue with priceValue
To match the class with any trailing or leading whitespaces you could've used the soup.select function that uses CSS selectors to match elements, so it doesn't care about spaces, you could've found element of that class like this (with any amount of trailing and/or leading whitespaces):
css_selected_value = soup.select("[class= priceValue ]")[0].text
print(css_selected_value)
That being said, I'm not sure why your code works properly on Google Colab, never tried it. Maybe will try to dig into it later.

Related

How to scrape transliterated or font rendered text from a html page

I want to scrape https://777codes.com/newtestament/gen1.html and fetch all the Hebrew sentences.
However some letters in the words are being rendered by the stylesheet and font files so data that is fetched by scraping the html directly is not complete.
For example when I use Beautiful Soup and fetch the contents of the first "stl_01 stl_21" class div I get "ייתꢀראꢁראꢁ" when I should be getting "בראשית"
I think I need to build a character map and match and replace the missing letters? How do I convert the scraped string into something I can use like utf8 encoded or unicode code point correctly so I can than lookup and replace the missing/replaced chars with their correct values.
Or is there a simpler way to get "בראשית" instead of "ייתꢀראꢁראꢁ" when scraping the first "stl_01 stl_21" class div

CSS Selector With Several Quotations

I'm having trouble finding a unique element or set of elements to identify a password field. There are two attributes I want to experiment with but haven't figured out how to deal with all the quotations. Typically, there are only two sets and I know one must be single and the other double but how are, e.g., three sets to be managed?
(Is this impossible and should I take another approach such as using a path-like approach using descendants/children?)
Website I'm working on:
https://myibd.investors.com/secure/signin.aspx?eurl=https://marketsmith.investors.com/
My code so far:
import os
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# create driver object and launch the webpage
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 20)
driver.get("https://myibd.investors.com/secure/signin.aspx?eurl=https://marketsmith.investors.com/")
# switch to the iframe we need
driver.switch_to.frame(driver.find_element_by_css_selector("iframe[id = 'signin-iframe']"))
# create variable for the login field
login_field = driver.find_element_by_css_selector("input[name='username'][data-gigya-placeholder='Email']")
# variable for password field
pswd_field = driver.find_element_by_css_selector()
The two tags & values I'd like to experiment with:
gigya-expression:data-gigya-placeholder="screenset.translations['PASSWORD_132128826476804690_PLACEHOLDER']"
gigya-expression:aria-label="screenset.translations['PASSWORD_132128826476804690_PLACEHOLDER']"
Edit 1
Two new attempts that did not work:
1.) I tried using backslashes \ to escape the quotes inside the value as suggested below.
password_field = driver.find_element_by_css_selector("input[gigya-expression:data-gigya-placeholder='\"screenset.translations[\'PASSWORD_132128826476804690_PLACEHOLDER\']\"]")
2.) I took #Laif's idea a little further and found this article on using escape characters for single and double quotes, ' and " respective.
password_field = driver.find_element_by_css_selector("input[gigya-expression:data-gigya-placeholder='"screenset.translations['PASSWORD_132128826476804690_PLACEHOLDER']"']")
Edit 2
workaround using XPath
I haven't figured out how to deal with the problem in CSS but I've gotten what I need using xpath. After Inspecting the target field, you can right-click the HTML and copy > Copy XPath.
copied XPath:
//*[#id="gigya-login-form"]/div[2]/div[3]/div[2]/input
python code:
pswd_field = driver.find_element_by_xpath("//*[#id='gigya-login-form']/div[2]/div[3]/div[2]/input")
If quotations are giving you trouble, remember you can escape any character inline with \
For example if I were to assign str="\"\"\"\"", print(str) would print out """".
Both these are valid strings as well, because the ' is enveloped by the "
gigya-expression:data-gigya-placeholder="screenset.translations['PASSWORD_132128826476804690_PLACEHOLDER']"
gigya-expression:aria-label="screenset.translations['PASSWORD_132128826476804690_PLACEHOLDER']"

Beautiful soup replaces certain symbols in a URL with other symbols

I am parsing a certain webpage with Beautiful soup, trying to retrieve all links that are inside h3 tag:
page = = requests.get(https://www....)
soup = BeautifulSoup(page.text, "html.parser")
links = []
for item in soup.find_all('h3'):
links.append(item.a['href']
However, the links found are different than the links present in the page. For example, when the link http://www.estense.com/?p=116872 is present in the page, Beautiful soup returns http://www.estense.com/%3Fp%3D116872, replacing '?' with '%3F' and '=' with %3D. Why is that?
Thanks.
You can unquote the URL using urllib.parse
from urllib import parse
parse.unquote(item.a['href'])

How to get the span of a dictionary as it appears on the site?

I am trying to get all the meanings in the "noun" heading of the word the user enters.
This is my code for now:
import requests
from bs4 import BeautifulSoup
word=raw_input("Enter word: ").lower()
url=('http://www.dictionary.com/browse/'+word)
r=requests.get(url)
soup=BeautifulSoup(r.content,"html.parser")
try:
meaning=soup.find("div",attrs={"class":"def-content"}).get_text()
print "Meaning of",word,"is: "
print meaning
except AttributeError:
print "Sorry, we were not able to find the word."
pass
finally:
print "Thank you for using our dictionary."
Now suppose the user enters the word "today" and my output will be:
this present day: Today is beautiful.
I dont understand why does it leave so many spaces and why doesnt the part
"Today is beautiful"
come down.
Anyway when you look up that word on this site, you can see there are 2 meanings yet my program only shows one.
I want the output to be:
1.this present day:
Today is beautiful.
2.
this present time or age:
the world of today.
Can anyone explain me whats wrong and how can i fix it?
I have no idea what's wrong so please dont think I dint try.
You are getting the first noun meaning using the above code.
I have rewritten the code, it is as below:
from bs4 import BeautifulSoup
import requests
word = raw_input("Enter word: ").lower()
url = ('http://www.dictionary.com/browse/' + word)
r = requests.get(url)
bsObj = BeautifulSoup(r.content, "lxml")
nouns = bsObj.find("section", {"class": "def-pbk ce-spot"})
data = nouns.findAll('div', {'class': 'def-content'})
count = 1
for item in data:
temp = ' '.join(item.get_text().strip().split())
print str(count) + '. ' + temp
count += 1
Explanation:
Yes. Assuming the website shows noun meaning first, I am retrieving the first section which contains complete noun data. Then I am finding all the meanings under that section inside data variable and iterating it in a loop and fetching the text of each meaning present in the data. Then to remove all the extra spaces I am splitting the fetched text and the joining it with a single space along with the addition of a number at the beginning
try:
meaning = soup.find(attrs={"class": "def-pbk ce-spot"}).get_text(separator="\n",strip=True)
you can strip the whitesapce of the text by pass strip=True to get_text()
the reason way you don't got all the text is that you seletor is wrong, you should make the range bigger.
I add separator= '\n' to get_text() to format output.
if you hava any question, you can read the BeautifulSoup Document.

How to represent markdown properly with escaping and line breaks?

I'm currently trying to build a chat app, using the official markdown package as well as underscore's escape function, and my template contains something like this:
<span class="message-content">
{{#markdown}}{{text}}{{/markdown}}
</span>
When I grab the text from the chat input box, I try to escape any HTML and then add in line breaks. safeText is then inserted into the database and displayed in the above template.
rawText = $("#chat-input-textbox").val();
safeText = _.escape(rawText).replace(/(?:\r\n|\r|\n)/g, '\n');
The normal stuff like headings, italics, and bold looks okay. However, there are two major problems:
Code escape issue - With the following input:
<script>alert("test")</script>
```
alert('hello');
```
This is _italics_!
Everything looks fine, except the alert('hello'); has become alert('hello'); instead. The <pre> blocks aren't rendering the escaped characters, which makes sense. But the problem is, the underscore JS escape function escapes everything.
SOLVED: Line break Issue - With the following input:
first
second
third
I get first second third being displayed with no line breaks. I understand this could be a markdown thing. Since I believe you need an empty line between paragraphs to get linebreaks in markdown. But having the above behaviour would be the most ideal, anyone know how to do this?
UPDATE Line break issue has been solved by adding an extra \n to my regex. So now I'm making sure that any line break will be represented with at least two \n characters (i.e. \n\n).
You should check the showdown docs and the wiki article they have on the topic.
The marked npm package, which is used by Telescope removes disallowed-tags. These include <script> of course. As the article I linked to above explains, there's still another problem with this:
<a href='javascript:alert("kidding! Im more the world domination kinda guy. Muhahahah")'>
click me for world peace!
</a>
Which isn't prevented by marked. I'd follow the advice of the author and use a HTML sanitation library. Like OWASP's ESAPI or Caja's html-sanitizer. Both of these project's seem outdated dough. I also found a showdown extension for it called showdown-xss-filter. So my advice is to write your own helper, and use showdown-xss-filter.

Resources