Extract specific text in tr beautifulsoup - web-scraping

I'm stuck with getting information from html code with beautifulsoup. I extracted the HTML piece below by doing the following steps:
result = requests.get(url, headers = headers)
soup = BeautifulSoup(result.text, 'lxml')
tably = soup.find("table", id="table4")
last_row = tably.findAll('tr')[-1]
Now, I want to obtain the following output:
Classification: Mass murderer
Characteristics: Militant Al-Takfir wa al-Hijran (Renunciation and Exile) faction
Number of victims: 23
Sample HTML:
<tr>
<td style="font-size: 8pt; color: #000000" width="100%">
<style color="#000000" face="Verdana">
Classification: <b>Mass murderer</b></font></td>
</tr>
<tr>
<td width="100%" style="font-size: 8pt; color: #000000">
<style="font-size: 8pt" color="#000000" face="Verdana">
Characteristics: <b>Militant Al-Takfir wa
al-Hijran </b>(Renunciation and Exile)<b> faction</b></font></td>
</tr>
<tr>
<td width="100%" style="font-size: 8pt; color: #000000">
<style="font-size: 8pt" color="#000000" face="Verdana">
Number of victims: <b>23</b></font></td>
</tr>
</font>

You might want to try this:
import requests
from bs4 import BeautifulSoup
from tabulate import tabulate
headers = {
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.152 Safari/537.36"
}
page = requests.get("https://murderpedia.org/male.A/a/abbas.htm", headers=headers).text
table = BeautifulSoup(page, "html5lib").find("table", {"id": "table4"})
output = [
" ".join(i.getText(strip=True).split()).split(":") for i
in table.find_all("td") if i.getText(strip=True)
][:9]
print(tabulate(output))
Output:
----------------- --------------------------------------------------------------
Classification Mass murderer
Characteristics Militant Al-Takfir wa al-Hijran(Renunciation and Exile)faction
Number of victims 23
Date of murders December 8,2000
Date of birth 1967
Victims profile Maleworshippers
Method of murder Shooting(Kalashnikov assault rifle)
Location Omdurman, Sudan
Status Shot to death by police

Related

I need to scrape HTML table with irregular colums

Please can anyone help me, I am very new to Beautiful Soup for scraping web pages. I want to extract the first table on the webpage, however the table has irregular columns as shown in the attached image. The first 3 rows/ 3 columns is followed by a row with 1 column. Please note the first three rows can change in the future to be more or less than 3 rows. The single row/1 column is followed by some rows and 4 columns then a single row/column. Is there a way I can write the Beautiful Soup python script to extract the first 3 rows before the td tag with a columnspan of 6 then extract the next rows after the td tag with a columnspan of 6?
My Python code so far (#### gives me the entire table with the irregular column but not what I want):
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
url = " "
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
rows = []
for child in soup.find_all('table')[1].children:
row = []
for td in child:
try:
row.append(td.text.replace('\n', ''))
except:
continue
if len(row) > 0:
rows.append(row)
pd.DataFrame(rows[1:])
Once having the table element I would select for the tr children not having a td with colspan = 6
Here is an example with colspan 3
from bs4 import BeautifulSoup as bs
html = '''<table id="foo">
<tbody>
<tr>
<th>A</th>
<th>B</th>
<th>C</th>
</tr>
<tr>
<td>D</td>
<td>E</td>
<td>F</td>
</tr>
<tr>
<td colspan="3">I'm not invited to the party :-(</td>
</tr>
<tr>
<td>G</td>
<td>H</td>
<td>I</td>
</tr>
</tbody>
</table>'''
soup = bs(html, 'lxml')
for tr in soup.select('#foo tr:not(:has(td[colspan="3"]))'):
print([child.text for child in tr.select('th, td')])

python html scraping whith BeautifulSoup separately

i am trying to scrape from this html part, the 2,768 and 25,000 separately:
<td class="ColCompany">Company</td>
<td class="alignCenter">2,768</td><td class="alignCenter" >
<a class="aMeasure" title="Text. href="/Reports/Index#Measure"> 69 </a></td>
<td class="alignCenter">25,000</td>
<td class="alignCenter">7</td>
with this python code:
def get_posts():
global Comp_Name
Comp_Name=""
plain_text = r.text
soup = BeautifulSoup(plain_text, "html.parser")
for link in soup.findAll('td',{'class': 'alignCenter'}):
title = link.string
if title != None :
list_of_titles.append(title)
Unfortunately, he returns the two values ​​together,
I would be happy to assist you so that each numer will be separatel
10x
To get these two numbers, you could use this script:
data = ''' <td class="ColCompany">Company</td>
<td class="alignCenter">2,768</td><td class="alignCenter" >
<a class="aMeasure" title="Text. href="/Reports/Index#Measure"> 69 </a></td>
<td class="alignCenter">25,000</td>
<td class="alignCenter">7</td>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html.parser')
numbers = [t.get_text(strip=True) for t in soup.select('.alignCenter')]
print(numbers[0])
print(numbers[2])
Prints:
2,768
25,000
Based on html supplied you might be able to use nth-of-type. Though accessing twice seems less efficient than just indexing into list of both.
soup.select_one('td.alignCenter:nth-of-type(2)').text
and
soup.select_one('td.alignCenter:nth-of-type(3)').text
The nth-of-type indices came from testing with jsoup on your html and adding in surrounding table tags. Your mileage may vary but the principle is the same.

The website cannot display the page issue

I am stuck in vary basic problem "The website cannot display the page "
The page "sfcRecInsplst.asp" page is there if i changed it to szhref="http://google.com/" it redirecting to the google. but why not sfcRecInsplst.asp page?
What should i check? how to solve this issue
function DoSearch(){
var szhref;
var szplantid;
var szwono;
var szskuno;
var szcartonpn;
var szfromdate;
var sztodate;
szfromdate=document.search.lfromdate.value;
sztodate=document.search.ltodate.value;
szplantid=document.search.lPlantid.value;
szhref="sfcRecInsplst.asp?pfromdate=" + szfromdate +"&ptodate=" + sztodate+"&lplantid=" + szplantid;
win=window.open(szhref,'CartonUsagerpt','toolbar=yes,top=0,left=0,width=<%=session("width")%>,height=<%=session("height")-100%>,menubar=yes,scrollbars=yes,maximize=yes,resizable=yes,status=yes,statusbar=yes');
win.focus;
}
function DoSearchReset() {
document.search.reset();
}
function lRecestatus_onkeypress(){
if (window.event.keyCode ==13)
{
DoSearch();
}
}
</script>
<table width="80%" bgcolor="#c0c0c0" border="1" rules="NONE" cellspacing="0" cellpadding="0">
<tr>
<td align=right valign=bottom>
<a href="JavaScript:DoSearch();">
<img src="/images/goe.gif" border="0" alt="Search" valign="middle"><font face="Arial, Helvetica, sans-serif" size="1">Go</font></a>
<a href="JavaScript:DoSearchReset();">
<img src="/images/resete.gif" border="0" alt="Reset" valign="middle"><font face="Arial, Helvetica, sans-serif" size="1">Reset</font></a>
You don't mention what version of IIS you are using, but in IIS 8 you can enable parent paths by doing the following:
Under Sites, click on your web site.
In the Features View pane, double-click on ASP.
Set Enable Parent Paths to True.
You may need to restart IIS for it to take effect.

How to import and save data from csv in web2py database table?

I used SQlite database.
I wrote code like this
Module:
db.py
db = DAL('sqlite://storage.sqlite')
db.define_table('data3')
db.data3.import_from_csv_file(open('mypath/test.csv'),'r')
Controller:
def website_list():
return dict(websites = db().select(db.data4.ALL))
View:
{{extend 'layout.html'}}
<h2>List Of Websites</h2>
<table class="flakes-table" style="width:100% ;">
<thead>
<tr>
<td class="id" >ID</a></td>
<td class="link" >Link</td>
</tr>
</thead>
{{for web in websites:}}
<tbody class="list">
<tr>
<td >{{=web.id}}</td>
<td >{{=web.Link}</td>
</tr>{{pass}}
</tbody>
</table>
But it is showing error as
"type 'exceptions.AttributeError'"
Also error has this line
Function argument list
(self=, key='data3')
I think some thing is wrong in reading csv file. My csv file has following data
"Link_Title","Link"
"Apple's Ad Blockers Rile Publishers","somelink"
"Uber Valued at More Than $50 Billion","somelink"
"England to Roll Out Tailored Billboards","somelink"
Can anyone help in this..?

Crawler4j missing outgoing links?

I'm trying to crawl the Apache Mailing Lists to get all the archived messages using Crawler4j. I provided a seed URL and am trying to get links to the other messages. However, it seems to not be extracting all the links.
Following is the HTML of my seed page (http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/%3CCAOG_4QZ-yyrcwTpRu-8eu6VoUoM3%3DAo_J8Linhpnc%2B6y7tOcxg%40mail.gmail.com%3E):
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Re: some healthy broker disappear from zookeeper</title>
<link rel="stylesheet" type="text/css" href="/archives/style.css" />
</head>
<body id="archives">
<h1>kafka-users mailing list archives</h1>
<h5>
Site index · List index</h5> <table class="static" id="msgview">
<thead>
<tr>
<th class="title">Message view</th>
<th class="nav">« Date » · « Thread »</th>
</tr>
</thead>
<tfoot>
<tr>
<th class="title">Top</th>
<th class="nav">« Date » · « Thread »</th>
</tr>
</tfoot>
<tbody>
<tr class="from">
<td class="left">From</td>
<td class="right">Neha Narkhede <neha.narkh...#gmail.com></td>
</tr>
<tr class="subject">
<td class="left">Subject</td>
<td class="right">Re: some healthy broker disappear from zookeeper</td>
</tr>
<tr class="date">
<td class="left">Date</td>
<td class="right">Tue, 20 Nov 2012 19:01:56 GMT</td>
</tr>
<tr class="contents"><td colspan="2"><pre>
zookeeper server version is 3.3.3 is pretty buggy and has known
session expiration and unexpected ephemeral node deletion bugs.
Please upgrade to 3.3.4 and retry.
Thanks,
Neha
On Tue, Nov 20, 2012 at 10:42 AM, Xiaoyu Wang <xwang#rocketfuel.com> wrote:
> Hello everybody,
>
> We have run into this problem a few times in the past week. The symptom is
> some broker disappear from zookeeper. The broker appears to be healthy.
> After that, producers start producing lots of ZK producer cache stale log
> and stop making any progress.
> "logger.info("Try #" + numRetries + " ZK producer cache is stale.
> Refreshing it by reading from ZK again")"
>
> We are running kafka 0.7.1 and the zookeeper server version is 3.3.3.
>
> The missing broker will show up in zookeeper after we restart it. My
> question is
>
> 1. Did anyone encounter the same problem? how did you fix it?
> 2. Why producer is not making any progress? Can we make the producer
> work with those brokers that are listed in zookeeper.
>
>
> Thanks,
>
> -Xiaoyu
</pre></td></tr>
<tr class="mime">
<td class="left">Mime</td>
<td class="right">
<ul>
<li><a rel="nofollow" href="/mod_mbox/kafka-users/201211.mbox/raw/%3cCAOG_4QZ-yyrcwTpRu-8eu6VoUoM3=Ao_J8Linhpnc+6y7tOcxg#mail.gmail.com%3e/">Unnamed text/plain</a> (inline, None, 1037 bytes)</li>
</ul>
</td>
</tr>
<tr class="raw">
<td class="left"></td>
<td class="right">View raw message</td>
</tr>
</tbody>
</table>
</body>
</html>
These are the outgoing URLs as identified by Crawler4j.
http://mail-archives.apache.org/archives/style.css
http://mail-archives.apache.org/mod_mbox/
http://mail-archives.apache.org/mod_mbox/kafka-users
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/date
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/thread
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/%3CCAOG_4QZ-yyrcwTpRu-8eu6VoUoM3%3DAo_J8Linhpnc%2B6y7tOcxg%40mail.gmail.com%3E
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/date
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/thread
However, the URLs that I'm interested in are missing.
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/%3cCAFbh0Q1uJGxQiH15a7xS+pCwq+Jft9yKhb66t_C78UrMX338mQ#mail.gmail.com%3e
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/%3cCAA+BczTexvtqSK+nmauvj37vhTF31awzeegpWdk6eZ-+LaGTVw#mail.gmail.com%3e
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/%3cCAPkvrnkTEtfFhYnCMj=xMs58pFU1sy-9sJuJ6e19mGVVipRg0A#mail.gmail.com%3e
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/%3cCAA+BczS5JOCA+QpgLU+tXeG=Ke_MXxiG_PinMt0YDxGBtz5nPg#mail.gmail.com%3e
What am I doing wrong? How do I get Crawler4j to extract the URLs I need?
Please tell me you have noticed there are direct links for downloading mbox files for mailing lists. In your case, just wget this, no crawler needed:
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox
You are probably giving the wrong seed page.
I think your seed page should be:
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/thread
and then use
public boolean shouldVisit(WebURL url) {
String href = url.getURL().toLowerCase();
return (!FILTERS.matcher(href).matches() && href.contains("http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/%3cCA"));
}
I hope that helps.

Resources