HTML <p> nodes InnerText including anchor text in CsQuery - csquery

I'm parsing some wordpress blog articles using CsQuery to do some text clustering analysis on them. I'd like to strip out the text from the pertinent <p> node.
var content = dom["div.entry-content>p"];
if (content.Length == 1)
{
System.Diagnostics.Debug.WriteLine(content[0].InnerHTML);
System.Diagnostics.Debug.WriteLine(content[0].InnerText);
}
In one of the posts the InnerHTML looks like this:
An MIT Europe project that attempts to <a title="Wired News: Gizmo Puts Cards
on the Table" href="http://www.wired.com/news/technology/0,1282,61265,00.html?
tw=rss.TEK">connect two loved ones seperated by distance</a> through the use
of two tables, a bunch of RFID tags and a couple of projectors.
and the corresponding InnerText like this
An MIT Europe project that attempts to through the use of two tables,
a bunch of RFID tags and a couple of projectors.
i.e. the inner text is missing the anchor text. I could parse the HTML myself but I am hoping there is a way to have CsQuery give me
An MIT Europe project that attempts to connect two loved ones
seperated by distance through the use of two tables, a bunch of RFID
tags and a couple of projectors.
(my italics.) How should I get this?

string result = dom["div.entry-content>p"].Text();
Text function will include everything that is bellow p includes p tag.

Try to use HtmlAgilityPack
using HAP = HtmlAgilityPack;
...
var doc = new HAP.HtmlDocument();
doc.LoadHtml("Your html");
var node = doc.DocumentNode.SelectSingleNode(#"node xPath");
Console.WriteLine(node.InnerText());
xPath is the path to the node on the page.
For example: In Google Chrome, press F12 and select your node, right-click and select "Copy xPath"
This topic header xPath: //*[#id="question-header"]/h1/a

Related

cannot see some data after scraping a link using requests.get or scrapy

I am trying to scrape data from a stock exchange website. Specifically, I need to read numbers in the top left table. If you inspect the html page, you will see these numbers under <div> tags, following <td> tags whose id is "e0", "e3", "e1" and "e4". However, the reponse, once saved into a text file, lacks all these numbers and some others. I have tried using selenium with some 20 second delays (so that the javascript is loaded) but this does not work and the element cannot be found.
Is there any workaround for this issue?
If you use inspect element > network > filter by XHR, you will see the page which sends the data :
In your case this is this link : http://www.tsetmc.com/tsev2/data/instinfofast.aspx?i=7745894403636165&c=23%20.
Unfortunately for you, the data is badly arranged so you will have to look at which position in the answer is the data which interests you. Good luck.

Apache tika: preserve bullets list and styles and customize output

i'm doing some test with Apache Tika. Goal is to turn complex Word documents (few pages of text, tables, images, bullet list with many level of indentations) into xhtml, preserving as many info/styles as possible.
I found this out of the box example on the offical site. It does its job, but with many limitations:
Bullets and numbering list are not outputted correctly. <p class="list_Paragraph">· first element of the list</p> is generated instead of <ul><li>first element of the list</li>....and indentation levels are lost if there are nested lists.
Text colors, font size, alignment and many other styles are not outputted at all.
Is it possible to generate a specific output for a specific tag/style? (ex: heading3 to be turned into <smallHeading> instead of <h3>)
Images are not extracted.
Point 4 probably requires an extractor to be implemented (from what i found in other posts), but is it possible to achieve the first 3 points above? Are we talking of a few settings/extending the example parser/handler or everything has to be implemented from scratch? Suggestions?
Thanks a lot.
public String parseToHTML() throws IOException, SAXException, TikaException {
ContentHandler handler = new ToXMLContentHandler();
AutoDetectParser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
try (InputStream stream = ContentHandlerExample.class.getResourceAsStream("test.doc")) {
parser.parse(stream, handler, metadata);
return handler.toString();
}
}

How can I scrape the string from this tag in ruby

I'm currently trying to do my first proper project outside of Codecademy/Baserails and could use some pointers. I'm using a scraper as part of one of the Baserails projects as a base to work from. My aim is to get the string "Palms Trax" and store it in array called DJ. I also wish to get the string "Solid Steel Radio Show" and store it in an array called source. My plan was to extract all the lines from the details section into a subarray and to then filter it into the DJ and Source arrays but if there is a better way of doing it please tell me. I've been trying various different combinations such as '.details none.li.div', 'ul details none.li.div.a' etc but can't seem to stumble on the right one. Also could someone please explain to me why the code
page.css('ol').each do |line|
subarray = line.text.strip.split(" - ")
end
only works if I declare the subarray earlier outside of the loop as in the Baserails project I am working from this did not seem to be the case.
Here is the relevant html:
<!-- Infos -->
<ul class="details none">
<li><span>Source</span><div> Solid Steel Radio Show</div></li>
<li><span>Date</span><div>2015.02.27</div></li>
<li><span>Artist</span><div>Palms Trax</div></li>
<li><span>Genres</span><div>Deep HouseExperimentalHouseMinimalTechno</div></li>
<li><span>Categories</span><div>Radio ShowsSolid Steel Radio Show</div></li>
<li><span>File Size</span><div> 135 MB</div></li>
<li><span>File Format</span><div> MP3 Stereo 44kHz 320Kbps</div></li>
</ul>
and my code so far:
require "open-uri"
require "nokogiri"
require "csv"
#store url to be scraped
url = "http://www.electronic-battle-weapons.com/mix/solid-steel-palms-trax/"
#parse the page
page = Nokogiri::HTML(open(url))
#initalize empty arrays
details = []
dj = []
source = []
artist = []
track = []
subarray =[]
#store data in arrays
page.css('ul details none.li.div').each do |line|
details = line.text.strip
end
puts details
page.css('ol').each do |line|
subarray = line.text.strip.split(" - ")
end
I'm Alex, one of the co-founders of BaseRails. Glad you're now starting to work on your own projects - that's the best way to start applying what you've learned. I thought I'd chip in and see if I can help out.
I'd try this:
page.css(ul.details.none li div a)
This will grab each of the <a> tags, and you'll be able to use .text to extract the text of the link (e.g. Solid Steel Radio Show, Palms Trax, etc). To understand the code above, remember that the . means "with a class called..." and a space means "that has the following nested inside".
So in English, "ul.details.none li div a" is translated to become "a <ul> tag with a class called "details" and another class called "none" that has an <li> tag nested inside, with a <div> tag nested inside that, with an <a> tag inside that. Try that out and see if you can then figure out how to filter the results into DJ, Source, etc.
Finally, I'm not sure why your subarray needs to be declared. It shouldn't need to be declared if that's the only context in which you're using it. FYI the reason why we don't need to declare it in the BaseRails course is because the .split function returns an array by default. It's unlike our name, price, and details arrays where we're using a different function (<<). The << function can be used in multiple contexts, so it's important that we make clear that we're using it to add elements to an array.
Hope that helps!

using the chrome console to select out data

I'm looking to pull out all of the companies from this page (https://angel.co/finder#AL_claimed=true&AL_LocationTag=1849&render_tags=1) in plain text. I saw someone use the Chrome Developer Tools console to do this and was wondering if anyone could point me in the right direction?
TLDR; How do I use Chrome console to select and pull out some data from a URL?
Note: since jQuery is available on this page, I'll just go ahead and use it.
First of all, we need to select elements that we want, e.g. names of the companies. These are being kept on the list with ID startups_content, inside elements with class items in a field with class name. Therefore, selector for these can look like this:
$('#startups_content .items .name a')
As a result, we will get bunch of HTMLElements. Since we want a plain text we need to extract it from these HTMLElements by doing:
.map(function(idx, item){ return $(item).text(); }).toArray()
Which gives us an array of company names. However, lets make a single plain text list out of it:
.join('\n')
Connecting all the steps above we get:
$('#startups_content .items .name a').map(function(idx, item){ return $(item).text(); }).toArray().join('\n');
which should be executed in the DevTools console.
If you need some other data, e.g. company URLs, just follow the same steps as described above doing appropriate changes.

How to read website content in python

I am trying to write a program which reads articles (posts) of any website that could range from Blogspot or Wordpress blogs / any other website. As to write code which is compatible with almost all websites which might have been written in HTML5/XHTML etc.. I thought of using RSS/ Atom feeds as ground from extracting content.
However, as RSS/ Atom feeds usually might not contain entire articles of websites, I thought to gather all "posts" links from the feed using feedparser and then want to extract the article content from the respective URL.
I could get URL's of all articles in website (including summary. i.e., article content shown in feed) but I want to access the entire article data for which I have to use the respective URL.
I came across various libraries like BeautifulSoup, lxml etc.. (various HTML/XML Parsers) but I really don't know how to get the "exact" content of the article (I assume "exact" means the data with all hyperlinks, iframes, slides shows etc still exist; I don't want CSS part).
So, can anyone help me on it?
Fetching the HTML code of all linked pages is quite easy.
The hard part is to extract exactly the content you are looking for. If you simply need all code inside of the <body> tag, this shouldn't be a big problem either; extracting all text is equally simple. But if you want a more specific subset, you have more work to do.
I suggest that you download the requests and BeautifulSoup module (both avaible via easy_install requests/bs4 or better pip install requests/bs4). The requests module makes fetching your page really easy.
The following example fetches a rss feed and returns three lists:
linksoups is a list of the BeautifulSoup instances of each page linked from the feed
linktexts is a list of the visible text of each page linked from the feed
linkimageurls is a list of lists with the src-urls of all the images embedded in each page linked from the feed
e.g. [['/pageone/img1.jpg', '/pageone/img2.png'], ['/pagetwo/img1.gif', 'logo.bmp']]
import requests, bs4
# request the content of the feed an create a BeautifulSoup object from its content
response = requests.get('http://rss.slashdot.org/Slashdot/slashdot')
responsesoup = bs4.BeautifulSoup(response.text)
linksoups = []
linktexts = []
linkimageurls = []
# iterate over all <link>…</link> tags and fill three lists: one with the soups of the
# linked pages, one with all their visible text and one with the urls of all embedded
# images
for link in responsesoup.find_all('link'):
url = link.text
linkresponse = requests.get(url) # add support for relative urls with urlparse
soup = bs4.BeautifulSoup(linkresponse.text)
linksoups.append(soup)
linktexts.append(soup.find('body').text)
# Append all text between tags inside of the body tag to the second list
images = soup.find_all('img')
imageurls = []
# get the src attribute of each <img> tag and append it to imageurls
for image in images:
imageurls.append(image['src'])
linkimageurls.append(imageurls)
# now somehow merge the retrieved information.
That might be a rough starting point for your project.

Resources