I need to scrape HTML table with irregular colums

I need to scrape HTML table with irregular colums - web-scraping

Please can anyone help me, I am very new to Beautiful Soup for scraping web pages. I want to extract the first table on the webpage, however the table has irregular columns as shown in the attached image. The first 3 rows/ 3 columns is followed by a row with 1 column. Please note the first three rows can change in the future to be more or less than 3 rows. The single row/1 column is followed by some rows and 4 columns then a single row/column. Is there a way I can write the Beautiful Soup python script to extract the first 3 rows before the td tag with a columnspan of 6 then extract the next rows after the td tag with a columnspan of 6?
My Python code so far (#### gives me the entire table with the irregular column but not what I want):
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
url = " "
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
rows = []
for child in soup.find_all('table')[1].children:
row = []
for td in child:
try:
row.append(td.text.replace('\n', ''))
except:
continue
if len(row) > 0:
rows.append(row)
pd.DataFrame(rows[1:])

Once having the table element I would select for the tr children not having a td with colspan = 6
Here is an example with colspan 3
from bs4 import BeautifulSoup as bs
html = '''<table id="foo">
<tbody>
<tr>
<th>A</th>
<th>B</th>
<th>C</th>
</tr>
<tr>
<td>D</td>
<td>E</td>
<td>F</td>
</tr>
<tr>
<td colspan="3">I'm not invited to the party :-(</td>
</tr>
<tr>
<td>G</td>
<td>H</td>
<td>I</td>
</tr>
</tbody>
</table>'''
soup = bs(html, 'lxml')
for tr in soup.select('#foo tr:not(:has(td[colspan="3"]))'):
print([child.text for child in tr.select('th, td')])

Related

Use rowspan on an empty cell of a previous row

I want to fuse a cell with the cell of the previous row in a shopping cart.
Here is what I want to do
Table rowspan
No problem for the first and last columns, I just use rowspan on cell 1 and 4 then hide cell 5 and 8, but how can I do the same for cells 3 and 7 ? I know I could use Javascript and appendTo the content of the cell 7 into the cell 3 then use rowspan on cell 3, but I'm wondering if it is possible to do that only with CSS.
My code looks like that right now :
<table>
<tr>
<td rowspan="2">Image</td>
<td>Title</td>
<td></td>
<td rowspan="2">Price</td>
</tr>
<tr>
<td>Description</td>
<td>Delete</td>
</tr>
</table>
I want the cell "Delete" to fill both rows, and I can't swap its content with the empty cell on the above TR.

how to get tag value that wrapped in table?

<td> <label for="cp_designation">Designation : </label></td>
<td> PARTNER</td>
</tr>
<tr>
<td><label for="cp_category">Category : </label></td>
<td>SPORTS GEARS</td>
</tr>
<tr>
<td> <label for="cp_address">Address : </label></td>
<td> A-148, WARD NO.4, PAINTER STREETSIALKOT-CANTT.</td>
</tr>
<tr>
<td> <label for="cp_phone">Phone : </label></td>
<td> 4603886,</td>
</tr>
soup = bs(page.content, "html.parser")
for i in soup:
label = soup.find_all('label',text='Designation : ')
print(label.find('tr'))
hi y'all my question is that i want to extract label value that is in tag i tried so many things but fail to get value. did you guys has any experties if yes so it would be hightly appreciatable. thanks in advance.

Here you can find main tr tag with find_all method to iterate over label tag to get data as key-value pair and use find_next to get next tag with label tag to get values of labels
from bs4 import BeautifulSoup
soup=BeautifulSoup(html,"html.parser")
dict1={}
for i in soup.find_all("tr"):
label=i.find("label")
dict1[label.get_text(strip=True)]=label.find_next("td").get_text(strip=True)
Output:
{'Designation :': 'PARTNER',
'Category :': 'SPORTS GEARS',
'Address :': 'A-148, WARD NO.4, PAINTER STREETSIALKOT-CANTT.',
'Phone :': '4603886,'}

What we do here is take a list of the headers, take a list of the table rows, and zip the headers to the data stored in the table data tag (as text), we then convert this to a dictionary and add to a list.
This isn't the best way of scraping as you can hit issues where data doesn't exist and data in the incorrect location, however with the below you can adapt it to be more robust.
soup=BeautifulSoup(html,"html.parser")
all_data = []
table = soup.find('table')
headers = [i.text for i in table.find_all('th')]
rows = table.find_all('tr')
for row in rows:
table_data_text = [i.text for i in row.find_all('td')]
output_dict = dict(zip(headers, table_data_text))
all_data.append(output_dict)

python html scraping whith BeautifulSoup separately

i am trying to scrape from this html part, the 2,768 and 25,000 separately:
<td class="ColCompany">Company</td>
<td class="alignCenter">2,768</td><td class="alignCenter" >
<a class="aMeasure" title="Text. href="/Reports/Index#Measure"> 69 </a></td>
<td class="alignCenter">25,000</td>
<td class="alignCenter">7</td>
with this python code:
def get_posts():
global Comp_Name
Comp_Name=""
plain_text = r.text
soup = BeautifulSoup(plain_text, "html.parser")
for link in soup.findAll('td',{'class': 'alignCenter'}):
title = link.string
if title != None :
list_of_titles.append(title)
Unfortunately, he returns the two values together,
I would be happy to assist you so that each numer will be separatel
10x

To get these two numbers, you could use this script:
data = ''' <td class="ColCompany">Company</td>
<td class="alignCenter">2,768</td><td class="alignCenter" >
<a class="aMeasure" title="Text. href="/Reports/Index#Measure"> 69 </a></td>
<td class="alignCenter">25,000</td>
<td class="alignCenter">7</td>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html.parser')
numbers = [t.get_text(strip=True) for t in soup.select('.alignCenter')]
print(numbers[0])
print(numbers[2])
Prints:
2,768
25,000

Based on html supplied you might be able to use nth-of-type. Though accessing twice seems less efficient than just indexing into list of both.
soup.select_one('td.alignCenter:nth-of-type(2)').text
and
soup.select_one('td.alignCenter:nth-of-type(3)').text
The nth-of-type indices came from testing with jsoup on your html and adding in surrounding table tags. Your mileage may vary but the principle is the same.

How to create zebra-stripe CSS with TAL?

How can I use Chameleon or Zope Page Templates to easily create CSS zebra striping? I want to add odd and even classes to each row in a table, but using a condition with repeat/name/odd or repeat/name/even looks rather verbose even with a conditional expression:
<table>
<tr tal:repeat="row rows"
tal:attributes="class python:repeat['row'].odd and 'odd' or 'even'">
<td tal:repeat="col row" tal:content="col">column text text</td>
</tr>
</table>
This gets especially tedious if you have multiple classes to calculate.

The Zope Page Templates implementation for the repeat variable has an under-documented extra parameter, parity, than gives you the string 'odd' or 'even', alternating between iterations:
<table>
<tr tal:repeat="row rows"
tal:attributes="class repeat/row/parity">
<td tal:repeat="col row" tal:content="col">column text text</td>
</tr>
</table>
This is also much easier to interpolate into a string expression:
tal:attributes="class string:striped ${row/class} ${repeat/row/parity}"
This works in Chameleon as well.

Select table rows excluding first row

How to select table rows excluding first row. Number of table rows could vary.
Here is example:
<table id="grdVerzekeringen" >
<tr>
<th>First name</th><th>Last name</th>
</tr>
<tr>
<td>Pera</td><td>Peric</td>
</tr>
<tr>
<td>Mika</td><td>Mikic</td>
</tr>
<tr>
<td>Zika</td><td>Zikic</td>
</tr>
</table>
In this example I want to select table rows that have actual data not header data. I could use css selectors or XPath.

If the header row uses th, you are lucky. Just use the following XPath expression:
table[#id="grdVerzekeringen"]/tr[td]
If the header uses td as well, you can use the position() function:
table[#id="..."]/tr[position()>1]

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

I need to scrape HTML table with irregular colums - web-scraping

Related

Use rowspan on an empty cell of a previous row

how to get tag value that wrapped in table?

python html scraping whith BeautifulSoup separately

How to create zebra-stripe CSS with TAL?

Select table rows excluding first row

Categories

Resources