bs4 find sub-text / find_next - web-scraping

On the page link, there is a section called "COSEWIC Assessment report". This section had emboldened text that heads categories and then non-bold text containing the information regarding that category. I am looking to scrape the non-bold text using bs4.
The HTML format for the bold text is wrapped in <strong> sample text </strong> tags in this way I can find the bold titles for each category using result = s.find("strong", text=re.compile("Scientific name")).
That said, I would then like to scrape the information under that header for each given header. If I inspect the HTML for that section it looks like this.
<p>
<strong> Scientific name </strong>
<br>
"Anarta edwarsii"
</p>
So, from a starting point of having located the "scientific name" part, how do I get the "Anarta edwarsii" part.
I thought maybe bs4 find_next_sibling() would work or something of the sort but so far nothing has been successful. Also important to note is that I cannot use the text to look up the element because I have to repeat the processes for many different species. Therefore the header remains constant but its sub text will change.
Thanks!!

You can use next_siblings as resultset, iterate with list comprehension ans join() the results:
' '.join([x.text for x in soup.select_one('p:-soup-contains("Scientific name:") strong').next_siblings]).strip()
Output:
'"Anarta edwarsii"'
Alternativ example:
Select the <p> that contains the string "Scientific Name" get its stripped_strings as list ['Scientific name:', 'Anarta edwardsii'] and pick the second element:
import requests
from bs4 import BeautifulSoup
headers ={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36',
'Referer': 'https://www.google.com/'
}
r = requests.get('https://www.canada.ca/en/environment-climate-change/services/species-risk-public-registry/cosewic-assessments-status-reports/edwards-beach-moth-2009.html',headers=headers)
soup = BeautifulSoup(r.text,'lxml')
list(soup.select_one('p:-soup-contains("Scientific name:")').stripped_strings)[-1]
Output:
'"Anarta edwarsii"'

Related

How to scrape transliterated or font rendered text from a html page

I want to scrape https://777codes.com/newtestament/gen1.html and fetch all the Hebrew sentences.
However some letters in the words are being rendered by the stylesheet and font files so data that is fetched by scraping the html directly is not complete.
For example when I use Beautiful Soup and fetch the contents of the first "stl_01 stl_21" class div I get "ייתꢀראꢁראꢁ" when I should be getting "בראשית"
I think I need to build a character map and match and replace the missing letters? How do I convert the scraped string into something I can use like utf8 encoded or unicode code point correctly so I can than lookup and replace the missing/replaced chars with their correct values.
Or is there a simpler way to get "בראשית" instead of "ייתꢀראꢁראꢁ" when scraping the first "stl_01 stl_21" class div

I am doing web scraping of japanese site. But instead of showing japanese character in output, the characters are numbers and symbols

I write the code of web scraping below.
import requests
import bs4
from bs4 import BeautifulSoup
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36 Edg/107.0.1418.52'
}
url = 'https://www.mlit.go.jp/'
r = requests.get(url,{'headers':headers})
soup = bs4.BeautifulSoup(r.text,'html.parser')
soup.find_all('div',{'class':'clearfix'})[1].text
the output shows as below
'\n\n\n\n\n\nã\x83\x88ã\x83\x94ã\x83\x83ã\x82¯ã\x82¹ å\x9b½å\x9c\x9f交é\x80\x9aç\x9c\x81ã\x81®æ´»å\x8b\x95\n'
I want japanese sentences instead of these type of data.
can anybody help me?
I tried encoding decoding UTF-8, Cp932, shiftjis. Nothing worked
I want japanese sentences instead of these type of data.
can anybody help me?
I tried encoding decoding UTF-8, Cp932, shiftjis. Nothing worked

Read_csv ignoring delimeters

i have a problem handling a questionnaire which i need to import into R. The problem is that the following entry contains commas.
"Mozilla/5.0 (Windows NT 6.2; Win64;" x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36",Windows 8,Chrome
When i specify that commas should be used as delimeters is splices the entry at "KHTML, like Gecko" and "...36" ,Windows 8". But it somehow does not split at other commas such as "4,21 Jul 2020 - 16:06:04 CEST,21 Jul 2020 - 16:20:55 CEST,0h 14m 51s,"
I do not understand why these cases are handled differently and how to resolve the issue.
Any help would be greatly appreciated.
4,21 Jul 2020 - 16:06:04 CEST,21 Jul 2020 - 16:20:55 CEST,0h 14m 51s,0h 57m 2s,1595340364,1595341255,5,5,,Terminating,1,"Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 ( like Gecko) Chrome/73.0.3683.75 Safari/537.36",Windows 8,Chrome 73.0.3683.75,1920,,DKJK1995,1,1,1,4,2,2,6,6,7,1,2,5,5,4,4,2,2,3,2,2,3,4,3,3,5,5,5,2,2,4,4,4,5,6,6,6,1,2,3,3,4,3,5,6,6,2,3,4,5,4,4,4,2,3,2,3,5,5,5,6,2,2,1,2,3,5,5,4,5,2,2,3,100,0,100,0,20,72,100,0,100,0,0,0,100,0,100,0,4,5,5,4,4,4,3,3,2,4,4,5,5,6,6,4,4,4,4,4,5,2,2,3,3,5,3,3,5,5,2,2,2,2,4,3,3,4,5,2,3,2,3,5,3,2,2,2,2,2,2,6,6,6,24,1,2,90,60,90,6,Deutschland,4,1,2,2,1,,1,1,"1,6,5,8,3,2,7,4","6,5,7,3,1,8,4,2","2,1","1,2,4,3",3,56,92,70,54,25,64,29,132,22,34,27,136,15,66,32,109,16,46,36,132,13,31,20,132,10,40,20,107,8,25,16,84,12,27,19,91,7,148,329,355,233,185,29,216,72,"4,2,6,7,5,1,8,3","2,3,1,7,4,8,6,5","5,6,7,1,3,8,4,2","8,3,11,13,10,2,16,5","7,4,11,13,9,2,16,6","7,4,11,13,9,2,16,6";
The problem is that the line you showed has mismatched quotes. From the comments, it appears that the quote in Win64;" is not supposed to end the field, which is supposed to continue to Safari/537.36".
Fixing this will need three steps: read the file without interpreting fields, clean up the bad strings, then read it again as a CSV. For example, if your file is called data.csv, you could do it like this:
lines <- readLines("data.csv")
cleaned <- sub('Win64;"', 'Win64;', lines)
data <- read.csv(textConnection(cleaned))
I don't know if the line to clean it will work in general: you'll need to find out where those spurious quotes are, and work out a pattern to match and remove them.

Difficulty with importing txt file into R

I'm currently trying to import a txt file into R, and neither read.table, read_table or read_table2 worked. With read.table, I just got an error that there were more columns than data. With read_table I got two columns of data, and with read_table2 I got too many columns split in a way that made no sense, with columns and data fields split up as per the image below.
This is a sample of the headers and first lines from the txt file:
Blockquote Source Generator Generator Identifier Created Timestamp Created Date Recorded Timestamp Recorded Date Visit ID URL ID Referrer ID Domain Protocol URL Title Top Domain Search Terms Transition Type Timestamp Date
Blockquote 27-strong-rooks-knitted-patiently Web Historian - Community Edition/1.3.6 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36 web-historian 1517096769 2018-01-27T23:46:09.773127+00:00 1527424157 2018-05-27T12:29:17.479092+00:00 96844 0 american.edu https https://engage.american.edu/learn/login/index.php
This is an image of what the result was with read_table2:
enter image description here
Thanks!

How to get the span of a dictionary as it appears on the site?

I am trying to get all the meanings in the "noun" heading of the word the user enters.
This is my code for now:
import requests
from bs4 import BeautifulSoup
word=raw_input("Enter word: ").lower()
url=('http://www.dictionary.com/browse/'+word)
r=requests.get(url)
soup=BeautifulSoup(r.content,"html.parser")
try:
meaning=soup.find("div",attrs={"class":"def-content"}).get_text()
print "Meaning of",word,"is: "
print meaning
except AttributeError:
print "Sorry, we were not able to find the word."
pass
finally:
print "Thank you for using our dictionary."
Now suppose the user enters the word "today" and my output will be:
this present day: Today is beautiful.
I dont understand why does it leave so many spaces and why doesnt the part
"Today is beautiful"
come down.
Anyway when you look up that word on this site, you can see there are 2 meanings yet my program only shows one.
I want the output to be:
1.this present day:
Today is beautiful.
2.
this present time or age:
the world of today.
Can anyone explain me whats wrong and how can i fix it?
I have no idea what's wrong so please dont think I dint try.
You are getting the first noun meaning using the above code.
I have rewritten the code, it is as below:
from bs4 import BeautifulSoup
import requests
word = raw_input("Enter word: ").lower()
url = ('http://www.dictionary.com/browse/' + word)
r = requests.get(url)
bsObj = BeautifulSoup(r.content, "lxml")
nouns = bsObj.find("section", {"class": "def-pbk ce-spot"})
data = nouns.findAll('div', {'class': 'def-content'})
count = 1
for item in data:
temp = ' '.join(item.get_text().strip().split())
print str(count) + '. ' + temp
count += 1
Explanation:
Yes. Assuming the website shows noun meaning first, I am retrieving the first section which contains complete noun data. Then I am finding all the meanings under that section inside data variable and iterating it in a loop and fetching the text of each meaning present in the data. Then to remove all the extra spaces I am splitting the fetched text and the joining it with a single space along with the addition of a number at the beginning
try:
meaning = soup.find(attrs={"class": "def-pbk ce-spot"}).get_text(separator="\n",strip=True)
you can strip the whitesapce of the text by pass strip=True to get_text()
the reason way you don't got all the text is that you seletor is wrong, you should make the range bigger.
I add separator= '\n' to get_text() to format output.
if you hava any question, you can read the BeautifulSoup Document.

Resources