How to scrape when parents are similar but not the same

How to scrape when parents are similar but not the same - web-scraping

How would you scrape the titles and links of this website if the parents are not named the same?
For example, as you can see form the screenshot, the first title and link are inside div class="slot type-post type-order-1". For the second title and link, they are inside div class="slot type-post type-order-2" and so on.
The site is https://thechive.com/
If there's no solution, I'd have a very long code, which doesn't seem to make sense like this:
content1 = soup.find_all('div', class_='slot type-post type-order-1')
content2 = soup.find_all('div', class_='slot type-post type-order-2')
for contents in content1:
title1 = contents.find('h3', class_='post-title entry-title card-title').text
link1 = contents.h3.a['href']
print(title1)
print(link1)
for content in content2:
title2 = content.find('h3', class_='post-title entry-title card-title').text
link2 = content.h3.a['href']
print(title2)
print(link2)

You can use a css selector using the select method.
soup.select('div[class*="slot type-post type-order-"]')
The *= stands for Contains.
Ref:
CSS selector cheat sheet
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors
Code:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://thechive.com/')
soup = BeautifulSoup(r.text, 'html.parser')
for content in soup.select('div[class*="slot type-post type-order-"]'):
title = content.find('h3', class_='post-title entry-title card-title').text
link = content.h3.a['href']
print(title)
print(link)
Output:
GAPs can help keep you warm through this winter freeze (45 Photos)
https://thechive.com/2021/02/15/gaps-can-help-keep-you-warm-through-this-winter-freeze/
Texans REALLY do not know how to handle a little snow (20 Photos)
https://thechive.com/2021/02/15/texans-really-do-not-know-how-to-handle-a-little-snow-20-photos/
...

Related

Insert folium map html code inside a bokeh app

I posted this as a followup question on Include folium in bokeh tabs, and now as a new question as well.
I´m trying to render the raw HTML-code from my folium map, but it´s not working.. Any ideas? :)
div = Div(
text=map.get_root().render(),
width=x,
height=y
)
I would much rather be able to render my folium map directly into a bokeh Div object instead of running an Flask app on the side.. I have looked into the possibilities of using an iframe, but there seems to be something off with my code here as well:
div.text = """<iframe srcdoc= """ + map.get_root().render() + """ height=""" + y + """ width=""" + x +"""></iframe>"""
I managed to use a Flask app on the side for the folium map and then use the url as src to my iframe, but then I was having trouble updating the content of that map from my bokeh tool.
Feel free to comment on anything of the above, cheers! :)
Update - Testscript:
from bokeh.models.widgets import Div
from bokeh.layouts import row
from bokeh.plotting import curdoc
import folium
def run():
folium_map = folium.Map(location=(60., 20.))
div = Div(
text=folium_map._repr_html_(),
width=500,
height=500,
)
return row(div)
bokeh_layout = run()
doc = curdoc()
doc.add_root(bokeh_layout)

With map.get_root().render(), you have all the HTML page. If you just want an iframe, you can use the method _repr_html_() of the folium map :
div = Div(
text=map._repr_html_(),
width=x,
height=y
)

How to extract data from a list of Urls for web scraping

I a new to Web scraping and I want to extract the coordinates from the <div> tag that is accessed through a URL. There is a list of URLs from which I want to extract the coordinates and save them in a CSV file.
<div class="single-view-data-row">
<div class="single-view-data-title">Coordinates</div>
<div class="single-view-data-get">
17.009164 N, -90.309259 E<br/>»» UTM / MGRS</div></div></div>
Thanks for the Help!!!

To extract link and coordinates from this HTML text, you can use this script:
from bs4 import BeautifulSoup
txt = ''' <div class="single-view-data-row">
<div class="single-view-data-title">Coordinates</div>
<div class="single-view-data-get">
17.009164 N, -90.309259 E<br/>»» UTM / MGRS</div></div></div>
'''
soup = BeautifulSoup(txt, 'html.parser')
link = soup.select_one('.single-view-data-get a')['href']
coords = soup.select_one('.single-view-data-get').find_next(text=True).split(',')
print(link)
print(coords[0].strip())
print(coords[1].strip())
Prints:
http://geographiclib.sourceforge.net/cgi-bin/GeoConvert?input=17.009164+-90.309259
17.009164 N
-90.309259 E

Add a footer on Migradoc last page

I need to add a footer on the MigraDoc.
The following code adds footer to all the pages.
The page has a header which needs to appear on each page.
Document document = new Document();
PdfDocumentRenderer pdfRenderer = new PdfDocumentRenderer(false);
Section HeaderSection = document.AddSection();
HeaderSection.PageSetup.DifferentFirstPageHeaderFooter = false;
MigraDoc.DocumentObjectModel.Shapes.Image image = HeaderSection.Headers.Primary.AddImage("../images/logo.jpg");
image.Height = new Unit(65);
image.Width = new Unit(150);
image.LockAspectRatio = false;
image.RelativeVertical = RelativeVertical.Line;
image.RelativeHorizontal = RelativeHorizontal.Margin;
Paragraph ParaHead1 = HeaderSection.AddParagraph();
Parahead1.AddFormattedText("..dfg");
Table table = HeaderSection.Footers.Primary.AddTable();
table.Borders.Width = 0;
Column column = table.AddColumn();
column.Width =Unit.FromPoint(300);
column.Format.Alignment = ParagraphAlignment.Left;
Column column1 = table.AddColumn();
column1.Width = Unit.FromPoint(200);
column1.Format.Alignment = ParagraphAlignment.Left;
Row row = table.AddRow();
Cell cell = row.Cells[0];
cell.AddParagraph("Regards,");
cell = row.Cells[1];
Paragraph para1 = cell.AddParagraph();
para1.AddFormattedText("Support Team");
I need the footer table to appear only on the last page.
I don't want add to add the last paragraph as the table as the footer as that will cause the footer to appear just appear the text.
The content on the page is dynamic.

You cannot use the MigraDoc footers for a footer on the last page only.
To achieve this effect, you have to add the text to the main body - or draw the footer later using PDFsharp.
You can use a TextFrame to have the footer at a fixed location, but you must take care that the TextFrame will not overlap with other main body content.
To answer the question from the comment:
To have the "footer" directly below the content, just add it to the main body in any form you like (table, paragraph, ...)
To have the footer at an absolute position (e.g. using a TextFrame): I recommend adding an empty dummy paragraph to the main body text (if needed) to make sure the footer does not overlap with the main body; the height of the dummy paragraph will be the height of the footer that overlaps with the main body area of the document

The approach I used was to add a flag to PageSetup within a Section.
The flag tells the engine to replace last page header and footer with the ones specified by the LastPageHeader and LastPageFooter keywords.
This is an example of section supporting last page header and footer ( it uses a special Migradoc/xml syntax, but it's supported with the original mddl as well):
<Section>
<Attributes>
<PageSetup PageHeight="29.7cm" PageWidth="21cm" Orientation="Portrait" DifferentLastPageHeaderFooter="true"/>
</Attributes>
<LastPageHeader>
....
</LastPageHeader>
<LastPageFooter>
....
</LastPageFooter>
</Section>
A fork supporting this functionality is available here: https://github.com/emazv72/MigraDoc
Note that LastPageHeader and LastPageFooter only work with PDF, not with RTF.

How to send an Embedded Image along with text in a Message via Telegram Bot API

Using Telegram Bot API,
I'm aware that it is possible to send an image via https://core.telegram.org/bots/api#sendphoto
However, how can I embed a remote image into a formatted message?
The message I am looking to send, can be compared to a news article with a title in bold, an image, and a longer text with links. I figured out how to create bold text and links with markdown, but I'm failing at inserting images. How can we do that?

you must set ParseMode in HTML and set your Image Url in A tag like this:
‍
‍ -> never show in message

You can use zero-width space trick. Works for both Markdown and HTML parse mode.
Markdown:
$data = [
'chat_id' => $chat_id,
'parse_mode' => 'markdown',
'text' => "[](https://upload.wikimedia.org/wikipedia/commons/thumb/0/02/Stack_Overflow_logo.svg/200px-Stack_Overflow_logo.svg.png) Some text here.",
];
Result:
Note: The zero-width space is in-between the brackets "[]".

import requests
text="testing"
img="http://imageurl.png"
r = requests.get('https://api.telegram.org/botyour_token_here/sendMessage?chat_id=#your_channel_here&parse_mode=markdown&text='+"[]("+img+")"+text)

Method using <a href=http://.......jpg>..</a> will show preview of the image below the text.
Like this:
a href sample
It will look better if you send an image with a caption.
caption sample

You should just add captions
bot.send_video(user_id, video, caption='some interesting text')
In our case captions are text. look this image

Using sendPhoto rather than sendMessage is a cleaner way of achieving this, depending on your use case, for example:
import io
import json
import requests
telegram_bot_token = 'INSERT_TOKEN_HERE'
chat_id = '#INSERT_CHAT_ID_HERE'
bot_url = 'https://api.telegram.org/bot' + telegram_bot_token + '/sendPhoto'
img_url = 'https://upload.wikimedia.org/wikipedia/commons/thumb/0/02/Stack_Overflow_logo.svg/200px-Stack_Overflow_logo.svg.png'
msg_txt = '<b>Stack Overflow Logo</b>'
msg_txt += '\n\nStack Overflow solves all our problems'
payload = {
'chat_id': chat_id,
'caption': msg_txt,
'parse_mode': 'html'
}
remote_image = requests.get(img_url)
photo = io.BytesIO(remote_image.content)
photo.name = 'img.png'
files = {'photo': photo}
req = requests.post(url=bot_url, data=payload, files=files)
response = req.json()
print(response)

TYPO3 lib HTML and TEXT code

I want to show i LIB on my page, but it will be showed on the page all my sits, but not the site with Uid = 3
So in my main TS i have, this
[globalVar = TSFE:id <> 3]
.....
[end]
My question is now, how do i setup a lib, thats have some text and HTML content in it..
Lets say that its this i want to show
<div class="ProductListTitle_style1">
my text my text
<p> text text text... </p>
</div>

You can use lib = COA in combination with TEXT and IMAGE
lib.b = COA
lib.b {
wrap = <div class="ProductListTitle_style1">|</div>
10 = TEXT
10.value = my text my text
20 = TEXT
20.value = text text text...
20.wrap = <p>|</p>
30 = IMAGE
30.file = path/to/file.png
30.altText = My image
30.width = 300
}
Before TYPO3 6.0 you could use lib = HTML.
lib.a = HTML
lib.a.value (
<div class="ProductListTitle_style1">
my text my text
<p> text text text... </p>
</div>
)
You can also combine the two possibilities
lib.c = COA
lib.c {
wrap = <div class="ProductListTitle_style1">|</div>
10 = TEXT
10.value = my text my text
20 = HTML
20.value = <p> text text text... </p>
}

Just for clarification: In TYPO3 4.5+, the Content Objects TEXT and HTML have the same functionality. So you can of course put HTML tags in a TEXT object:
lib.something = TEXT
lib.something.value = <p>My Text</p>
Since both objects could do the same since TYPO3 4.5, the HTML cObject was deprecated and removed in 6.0.
As for Thomas question about COA: A COA is a "content object array" and thus an array of content elements. A COA is used when more than one content needs to be combined in one TypoScript object. So if you just have one object (as in my example above), you don't need a COA, but if you have more than one content, use it (as in hildende's first example).

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to scrape when parents are similar but not the same - web-scraping

Related

Insert folium map html code inside a bokeh app

How to extract data from a list of Urls for web scraping

Add a footer on Migradoc last page

How to send an Embedded Image along with text in a Message via Telegram Bot API

TYPO3 lib HTML and TEXT code

Categories

Resources