python html scraping whith BeautifulSoup separately

python html scraping whith BeautifulSoup separately - web-scraping

i am trying to scrape from this html part, the 2,768 and 25,000 separately:
<td class="ColCompany">Company</td>
<td class="alignCenter">2,768</td><td class="alignCenter" >
<a class="aMeasure" title="Text. href="/Reports/Index#Measure"> 69 </a></td>
<td class="alignCenter">25,000</td>
<td class="alignCenter">7</td>
with this python code:
def get_posts():
global Comp_Name
Comp_Name=""
plain_text = r.text
soup = BeautifulSoup(plain_text, "html.parser")
for link in soup.findAll('td',{'class': 'alignCenter'}):
title = link.string
if title != None :
list_of_titles.append(title)
Unfortunately, he returns the two values together,
I would be happy to assist you so that each numer will be separatel
10x

To get these two numbers, you could use this script:
data = ''' <td class="ColCompany">Company</td>
<td class="alignCenter">2,768</td><td class="alignCenter" >
<a class="aMeasure" title="Text. href="/Reports/Index#Measure"> 69 </a></td>
<td class="alignCenter">25,000</td>
<td class="alignCenter">7</td>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html.parser')
numbers = [t.get_text(strip=True) for t in soup.select('.alignCenter')]
print(numbers[0])
print(numbers[2])
Prints:
2,768
25,000

Based on html supplied you might be able to use nth-of-type. Though accessing twice seems less efficient than just indexing into list of both.
soup.select_one('td.alignCenter:nth-of-type(2)').text
and
soup.select_one('td.alignCenter:nth-of-type(3)').text
The nth-of-type indices came from testing with jsoup on your html and adding in surrounding table tags. Your mileage may vary but the principle is the same.

Related

how to get tag value that wrapped in table?

<td> <label for="cp_designation">Designation : </label></td>
<td> PARTNER</td>
</tr>
<tr>
<td><label for="cp_category">Category : </label></td>
<td>SPORTS GEARS</td>
</tr>
<tr>
<td> <label for="cp_address">Address : </label></td>
<td> A-148, WARD NO.4, PAINTER STREETSIALKOT-CANTT.</td>
</tr>
<tr>
<td> <label for="cp_phone">Phone : </label></td>
<td> 4603886,</td>
</tr>
soup = bs(page.content, "html.parser")
for i in soup:
label = soup.find_all('label',text='Designation : ')
print(label.find('tr'))
hi y'all my question is that i want to extract label value that is in tag i tried so many things but fail to get value. did you guys has any experties if yes so it would be hightly appreciatable. thanks in advance.

Here you can find main tr tag with find_all method to iterate over label tag to get data as key-value pair and use find_next to get next tag with label tag to get values of labels
from bs4 import BeautifulSoup
soup=BeautifulSoup(html,"html.parser")
dict1={}
for i in soup.find_all("tr"):
label=i.find("label")
dict1[label.get_text(strip=True)]=label.find_next("td").get_text(strip=True)
Output:
{'Designation :': 'PARTNER',
'Category :': 'SPORTS GEARS',
'Address :': 'A-148, WARD NO.4, PAINTER STREETSIALKOT-CANTT.',
'Phone :': '4603886,'}

What we do here is take a list of the headers, take a list of the table rows, and zip the headers to the data stored in the table data tag (as text), we then convert this to a dictionary and add to a list.
This isn't the best way of scraping as you can hit issues where data doesn't exist and data in the incorrect location, however with the below you can adapt it to be more robust.
soup=BeautifulSoup(html,"html.parser")
all_data = []
table = soup.find('table')
headers = [i.text for i in table.find_all('th')]
rows = table.find_all('tr')
for row in rows:
table_data_text = [i.text for i in row.find_all('td')]
output_dict = dict(zip(headers, table_data_text))
all_data.append(output_dict)

How to correctly write a nested xpath for selenium

I'm trying to use Rselenium+seleniumPipes to access this zipfile by name: "PUB004_PRE_20220316.zip"
<tr id="l1_VkFMX1BSRV9SUl8yMDIyMDMxMi56aXA" class="elfinder-cwd-file elfinder-ro ui-selectee ui-draggable-handle" title="PUB004_PRE_20220316.zip
Hoy 02:20 PM (4.22 MB)">
<td class="elfinder-col-name">
<div class="elfinder-cwd-file-wrapper">
<span class="elfinder-cwd-icon elfinder-cwd-icon-application elfinder-cwd-icon-zip">.</span>
<span class="elfinder-perms"></span>
<span class="elfinder-lock"></span>
<span class="elfinder-cwd-filename">PUB004_PRE_20220316.zip</span>
</div>
</td>
<td class="elfinder-col-perm">lectura</td>
<td class="elfinder-col-date">Hoy 02:20 PM</td>
<td class="elfinder-col-size">4.22 MB</td>
<td class="elfinder-col-kind">Archivo ZIP</td>
</tr>
Picture of the whole code
But cant seem to get the xpath correctly.
Some of my tries:
select_file <- robot$findElement(
"xpath", "//tr[.//td[.//div[#class='elfinder-cwd-file-wrapper']]]//span[#class='elfinder-cwd-filename']//*[text()='PUB004-PRE-20220316.zip']")
select_file$clickElement()
select_file <- robot$findElement(
"xpath", "//*[#class='elfinder-cwd-file-wrapper']//*[#class='elfinder-cwd-filename']//*[text()='PUB004-PRE-20220316.zip']")
select_file$clickElement()
select_file <- robot$findElement(
"xpath", "//*[#class='elfinder-cwd-filename']//*[text()='PUB004-PRE-20220316.zip']")
select_file$clickElement()
This is the webpage. I want to download a the zip files.
Note: I need to do it by name because I'm interested in downloading the file programmatically by date (20220316).

Seems you were close enough, instead of _ character it should have been - character and should have been PUB004-PRE-20220316.zip
Solution
To identify the element you can use either of the following locator strategies:
Using xpath and the innerText:
select_file <- robot$findElement("xpath", "//span[text()='PUB004-PRE-20220316.zip']")
Using xpath with class and the innerText:
select_file <- robot$findElement("xpath", "//span[#class='elfinder-cwd-filename' and text()='PUB004-PRE-20220316.zip']")

If I understand you correctly, you want to click on tr parent element containing the span element containing the desired file name in it's text. Right?
But the tr element itself contains that string in it's title.
So, why not simply to use this XPath: ?
"//tr[contains(#title,'20220316')]"

I need to scrape HTML table with irregular colums

Please can anyone help me, I am very new to Beautiful Soup for scraping web pages. I want to extract the first table on the webpage, however the table has irregular columns as shown in the attached image. The first 3 rows/ 3 columns is followed by a row with 1 column. Please note the first three rows can change in the future to be more or less than 3 rows. The single row/1 column is followed by some rows and 4 columns then a single row/column. Is there a way I can write the Beautiful Soup python script to extract the first 3 rows before the td tag with a columnspan of 6 then extract the next rows after the td tag with a columnspan of 6?
My Python code so far (#### gives me the entire table with the irregular column but not what I want):
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
url = " "
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
rows = []
for child in soup.find_all('table')[1].children:
row = []
for td in child:
try:
row.append(td.text.replace('\n', ''))
except:
continue
if len(row) > 0:
rows.append(row)
pd.DataFrame(rows[1:])

Once having the table element I would select for the tr children not having a td with colspan = 6
Here is an example with colspan 3
from bs4 import BeautifulSoup as bs
html = '''<table id="foo">
<tbody>
<tr>
<th>A</th>
<th>B</th>
<th>C</th>
</tr>
<tr>
<td>D</td>
<td>E</td>
<td>F</td>
</tr>
<tr>
<td colspan="3">I'm not invited to the party :-(</td>
</tr>
<tr>
<td>G</td>
<td>H</td>
<td>I</td>
</tr>
</tbody>
</table>'''
soup = bs(html, 'lxml')
for tr in soup.select('#foo tr:not(:has(td[colspan="3"]))'):
print([child.text for child in tr.select('th, td')])

Java RemoteWebDriver - WebElement.findElements(...) doesn't work exactly?

I have a website to test and there is this piece of html code in it:
<table id="tableid">
<tbody>
<tr class="first">
<td>Hello World</td>
</tr>
<tr class="second">
<td>Bye World</td>
</tr>
</tbody>
</table>
So I want to create a list of the tr-Tags and iterate over them with the following code:
List<WebElement> list = driver.findElements(By.xpath("//table[#id='tableid']/tbody/tr"));
for(WebElement l : list){
System.out.println(l.getAttribute("class"));
System.out.println(l.getLocation());
System.out.println(l.hashCode());
System.out.println(l.findElement(By.xpath("//td")).getText());
}
These four System.out.println's are the following:
first
(32, 300)
1573
Hello World
second
(64, 600)
1574
Hello World
So the location is different, even the class attributes are different. But the getText method returns only the text from the first element. Why? Am I missing something? Doing something wrong? I can't figure it out.
EDIT/UPDATE:
This seems kind of odd. The above code does not work. If I do the following code it works fine. Any explanations?
List<WebElement> list = driver.findElements(By.xpath("//table[#id='tableid']/tbody/tr/td"));
System.out.println(list.get(0).getText());
System.out.println(list.get(1).getText());
Output:
Hello World
Bye World

Your XPath is wrong. //td means "any element anywhere in the document". Try l.findElement(By.xpath("td")).getText() instead - I think you'll get the result you want.

Remove HTML with Regex

Is it possible to use regex to remove HTML tags inside a particular block of HTML?
E.g.
<body>
<p>Hello World!</p>
<table>
<tr>
<td>
<p>My First HTML Table</p>
</td>
</tr>
</table>
I don't want to remove all P tags, only those within the table element.
The ability to both remove or retain the text inside the nested p tag would be ideal.
Thanks.

There are a lot of mentions regarding not to use regex when parsing HTML, so you could use Html Agility Pack for this:
var html = #"
<body>
<p>Hello World!</p>
<table>
<tr>
<td>
<p>My First HTML Table</p>
</td>
</tr>
</table>";
HtmlDocument document = new HtmlDocument();
document.LoadHtml(html);
var nodes = document.DocumentNode.SelectNodes("//table//p");
foreach (HtmlNode node in nodes)
{
node.ParentNode.ReplaceChild(
HtmlNode.CreateNode(node.InnerHtml),
node
);
}
string result = null;
using (StringWriter writer = new StringWriter())
{
document.Save(writer);
result = writer.ToString();
}
So after all these manupulations, you'll get the next result:
<body>
<p>Hello World!</p>
<table>
<tr>
<td>
My First HTML Table
</td>
</tr>
</table></body>

I have found this link in which it seems the exact question was asked
"I have an HTML document in .txt format containing multiple tables and other texts and I am trying to delete any HTML (anything within "<>") if it's inside a table (between and ). For example:"
Regex to delete HTML within <table> tags

<td>[\r\n\s]*<p>([^<]*)</p>[\r\n\s]*</td>
The round brackets denote a numbered capture group which will contain your text.
However, using regular expressions in this way relies on a lot of assumptions regarding the content of the <p> tag and the construction of the HTML.
Have a read of the ubiquitous SO question regarding using regular expressions to parse (X)HTML and see #Bruno's answer for a more robust solution.

Possible to some extent but not reliable!
I will rather suggest you to look at HTML parsers such as HTML Agility Pack.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

python html scraping whith BeautifulSoup separately - web-scraping

Related

how to get tag value that wrapped in table?

How to correctly write a nested xpath for selenium

I need to scrape HTML table with irregular colums

Java RemoteWebDriver - WebElement.findElements(...) doesn't work exactly?

Remove HTML with Regex

Categories

Resources