How to edit this code so that I can view only texts? - web-scraping

This is a snippet of config file of snownews (a terminal based news aggregator).
Problem: When I try to view a rss feed on terminal, I can only view text till the image! after that everything is blank. Also since i'm using a terminal image is also not supported.
image: https://imgur.com/a/O3Gq2Dl
Here is the code were the web scraping takes place. I only need to view text without any images.
# Importing
if (($PROGRAM_NAME =~ "snow2opml") || ($ARGV[0] eq "--export")) {
OPMLexport();
} else {
my $parser = XML::LibXML->new();
$parser->validation(0); # Turn off validation from libxml
$parser->recover(1); # And ignore any errors while parsi>
my(#lines) = <>;
my($input) = join ("\n", #lines);
my($doc) = $parser->parse_string($input);
my($root) = $doc->documentElement();
# Parsing the document tree using xpath
my(#items) = $root->findnodes("//outline");
foreach (#items) {
my(#attrs) = $_->attributes();
foreach (#attrs) {
# Only print attribute xmlUrl=""
if ($_->nodeName =~ /xmlUrl/i) {
print $_->value."\n";
}
}
}
}
If the full code is needed I can post it

Related

Extracting data from a web page to Excel sheet

How can I extract information from a web page into an Excel sheet?
The website is https://www.proudlysa.co.za/members.php and I would like to extract all the companies listed there and all their respective information.
The process you're referring to is called web scraping, and there are several VBA tutorials out there for you to try.
Alternatively, you can always try
(source: netdna-ssl.com)
I tried creating something to grab for all pages. But ran of time and had bugs. This should help you a little. You will have to do this on all 112 pages.
Using chrome go to the page
type javascript: in the url then paste the code below. it should extra what you need. then you will have to just copy and paste it in to excel.
var list = $(document).find(".pricing-list");
var csv ="";
for (i = 0; list.length > i;i++) {
var dataTags = list[i].getElementsByTagName('li');
var dataArr = [];
for (j = 0; dataTags.length > j;j++) {
dataArr.push(dataTags[j].innerText.trim());
}
csv += dataArr.join(', ') + "<br>";
}
you will get something like this
EDITTED
use this instead will automatically download each page as csv then you can just combine them after somehow.
Make sure to type javascript: in url before pasting and pressing enter
Also works with chrome, not sure about other browsers. i dont use them much
var list = $(document).find(".pricing-list");
var csv ="data:text/csv;charset=utf-8,";
for (i = 0; list.length > i;i++) {
var dataTags = list[i].getElementsByTagName('li');
var dataArr = [];
for (j = 0; dataTags.length > j;j++) {
dataArr.push(dataTags[j].innerText.trim());
}
csv += dataArr.join(', ') + "\n";
}
var a = document.createElement("a");
a.href = ""+ encodeURI(csv);
a.download = "data.csv";
a.click();

How does TextRenderInfo work in iTextSharp?

I have got some codes from online and they are providing me the font sizes. I did not understand how the TextRenderInfo is reading text. I tried with renderInfo.GetText()) which is giving random number of characters, sometimes 3 characters, sometimes 2 characters or more or less. I need to know how the renderInfo is reading data ?
My intention is to separate every lines and paragraphs from pdf and also read their properties individually such as font size, font style etc. If you have any suggestion, please mention them.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using iTextSharp.text.pdf.parser;
using iTextSharp.text.pdf;
namespace FontSizeDig1
{
class Program
{
static void Main(string[] args)
{
// reader ==> http://itextsupport.com/apidocs/itext5/5.5.9/com/itextpdf/text/pdf/PdfReader.html#pdfVersion
PdfReader reader = new PdfReader(System.IO.Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "document.pdf"));
TextWithFontExtractionStategy S = new TextWithFontExtractionStategy();//strategy==> http://itextsupport.com/apidocs/itext5/5.5.9/com/itextpdf/text/pdf/parser/TextExtractionStrategy.html
// for (int i = 1; i <= reader.NumberOfPages; i++)
// {
string F = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader, 1/*i*/, S);
// PdfTextExtractor.GetTextFromPage(reader, 6, S) ==>> http://itextsupport.com/apidocs/itext5/5.5.9/com/itextpdf/text/pdf/parser/PdfTextExtractor.html
Console.WriteLine(F);
// }
Console.ReadKey();
//this.Close();
}
}
public class TextWithFontExtractionStategy : iTextSharp.text.pdf.parser.ITextExtractionStrategy
{
//HTML buffer
private StringBuilder result = new StringBuilder();
//Store last used properties
private Vector lastBaseLine;
private string lastFont;
private float lastFontSize;
//http://api.itextpdf.com/itext/com/itextpdf/text/pdf/parser/TextRenderInfo.html
private enum TextRenderMode
{
FillText = 0,
StrokeText = 1,
FillThenStrokeText = 2,
Invisible = 3,
FillTextAndAddToPathForClipping = 4,
StrokeTextAndAddToPathForClipping = 5,
FillThenStrokeTextAndAddToPathForClipping = 6,
AddTextToPaddForClipping = 7
}
public void RenderText(iTextSharp.text.pdf.parser.TextRenderInfo renderInfo)
{
string curFont = renderInfo.GetFont().PostscriptFontName; // http://itextsupport.com/apidocs/itext5/5.5.9/com/itextpdf/text/pdf/parser/TextRenderInfo.html#getFont--
//Check if faux bold is used
if ((renderInfo.GetTextRenderMode() == 2/*(int)TextRenderMode.FillThenStrokeText*/))
{
curFont += "-Bold";
}
//This code assumes that if the baseline changes then we're on a newline
Vector curBaseline = renderInfo.GetBaseline().GetStartPoint();
Vector topRight = renderInfo.GetAscentLine().GetEndPoint();
iTextSharp.text.Rectangle rect = new iTextSharp.text.Rectangle(curBaseline[Vector.I1], curBaseline[Vector.I2], topRight[Vector.I1], topRight[Vector.I2]);
Single curFontSize = rect.Height;
//See if something has changed, either the baseline, the font or the font size
if ((this.lastBaseLine == null) || (curBaseline[Vector.I2] != lastBaseLine[Vector.I2]) || (curFontSize != lastFontSize) || (curFont != lastFont))
{
//if we've put down at least one span tag close it
if ((this.lastBaseLine != null))
{
this.result.AppendLine("</span>");
}
//If the baseline has changed then insert a line break
if ((this.lastBaseLine != null) && curBaseline[Vector.I2] != lastBaseLine[Vector.I2])
{
this.result.AppendLine("<br />");
}
//Create an HTML tag with appropriate styles
this.result.AppendFormat("<span style=\"font-family:{0};font-size:{1}\">", curFont, curFontSize);
}
//Append the current text
this.result.Append(renderInfo.GetText());
Console.WriteLine("me=" + renderInfo.GetText());//by imtiaj
//Set currently used properties
this.lastBaseLine = curBaseline;
this.lastFontSize = curFontSize;
this.lastFont = curFont;
}
public string GetResultantText()
{
//If we wrote anything then we'll always have a missing closing tag so close it here
if (result.Length > 0)
{
result.Append("</span>");
}
return result.ToString();
}
//Not needed
public void BeginTextBlock() { }
public void EndTextBlock() { }
public void RenderImage(ImageRenderInfo renderInfo) { }
}
}
Take a look at this PDF:
What do you see?
I see:
Hello World
Hello People
Now, let's parse this file? What do you expect?
You probably expect:
Hello World
Hello People
I don't.
That's where you and I differ, and that difference explains why you ask this question.
What do I expect?
Well, I'll start by looking inside the PDF, more specifically at the content stream of the first page:
I see 4 strings in the content stream: ld, Wor, llo, and He (in that order). I also see coordinates. Using those coordinates, I can compose what is shown:
Hello World
I don't immediately see "Hello People" anywhere, but I do see a reference to a Form XObject named /Xf1, so let's examine that Form XObject:
Woohoo! I'm in luck, "Hello People" is stored in the document as a single string value. I don't need to look at the coordinates to compose the actual text that I can see with my human eyes.
Now for your question. You say "I need to know how the renderInfo is reading data" and now you know: by default, iText will read all the strings from a page in the order they occur: ld, Wor, llo, He, and Hello People.
Depending on how the PDF is created, you can have output that is easy to read (Hello People), or output that is hard to read (ld, Wor, llo, He). iText comes with "strategies" that reorder all those snippets so that [ld, Wor, llo, He] is presented as [He, llo, Wor, ld], but detecting which of those parts belong to the same line, and which lines belong to the same paragraph, is something you will have to do.
NOTE: at iText Group, we already have plenty of closed source code that could save you plenty of time. Since we are the copyright owner of the iText library, we can ask money for that closed source code. That's something you typically can't do if you're using iText for free (because of the AGPL). However, if you are a customer of iText, we can probably disclose more source code. Do not expect us to give that code for free, as that code has too much commercial value.

How can I print the data contained by a C++ Map while debugging using DBX

I want to know the contents of a Map while debugging a c++ program.
I am using command line dbx.
I have pointer to the map.
Is there a way in which i can get the data printed.
--
Edit:
p *dataMap will give me this::
p *dataMap
*dataMap = {
__t = {
__buffer_size = 32U
__buffer_list = {
__data_ = 0x3ba2b8
}
__free_list = (nil)
__next_avail = 0x474660
__last = 0x474840
__header = 0x3b97b8
__node_count = 76U
__insert_always = false
__key_compare = {
/* try using "print -r" to see any inherited members */
}
}
}
Thanks
Alok Kr.
you need to write a ksh function to pretty print map, here is an example :
put following line in .dbxrc
source /ksh_STL_map
in dbx, use ppp to call ksh function that define in ksh_STL_map:
(dbx) ppp k
k = 2 elems {343, 0x301f8; 565, 0x30208}
I tried to post content of ksh_STL_map here, but this editor format will mess up the content, it's better that you post your email, then I can send ksh_STL_map directly to you.

Getting a cmdlet's dynamic parameters via reflection

Powershell exposes some parameters, "dynamic parameters", based on context. The MSDN page explains the mechanism pretty well, but the skinny is that to find out about these one must call GetDynamicParameters(), which returns a class containing the additional parameters. I need to get these parameters via reflection, and (here's the crux of it), in a ReflectionOnly context (that is, the types are loaded with ReflectionOnlyLoadFrom). So, no Assembly.InvokeMember("GetDynamicParameters").
Can this be done?
No. Reflection works against static assembly metadata. Dynamic parameters in powershell are added at runtime by the command or function itself.
Perhaps this helps:
1: Defintion of the dynamic parameters
#===================================================================================
# DEFINITION OF FREE FIELDS USED BY THE CUSTOMER
#-----------------------------------------------------------------------------------
# SYNTAX: #{ <FF-Name>=#(<FF-Number>,<isMandatory_CREATE>,<isMandatory_UPDATE>); }
$usedFFs = #{
"defaultSMTP"=#(2,1,0); `
"allowedSMTP"=#(3,1,0); `
"secondName"=#(100,1,0); `
"orgID"=#(30001,1,0); `
"allowedSubjectTypeIDs"=#(30002,1,0); `
}
# FF-HelpMessage for input
$usedFFs_HelpMSG = #{ 2="the default smtp domain used by the organizaiton. Sampel:'algacom.ch'"; `
3="comma seperated list of allowed smtp domains. Sampel:'algacom.ch,basel.algacom.ch'"; `
100="an additional organization name. Sampel:'algaCom AG')"; `
30001="an unique ID (integer) identifying the organization entry"; `
30002="comma seperated list of allowed subject types. Sampel:'1,2,1003,10040'"; `
}
2: definition of function that builds the dynamic parameters
#-------------------------------------------------------------------------------------------------------
# Build-DynParams : Used to build the dynamic input parameters based on $usedFFs / $usedFFs_HelpMSG
#-------------------------------------------------------------------------------------------------------
function Build-DynParams($type) {
$paramDictionary = New-Object -Type System.Management.Automation.RuntimeDefinedParameterDictionary
foreach($ffName in $usedFFs.Keys) {
$ffID = $usedFFs.Item($ffName)[0]
$dynAttribCol = New-Object -Type System.Collections.ObjectModel.Collection[System.Attribute]
$dynAttrib = New-Object System.Management.Automation.ParameterAttribute
$dynAttrib.ParameterSetName = "__AllParameterSets"
$dynAttrib.HelpMessage = $usedFFs_HelpMSG.Item($ffID)
switch($type) {
"CREATE" { $dynAttrib.Mandatory = [bool]($usedFFs.Item($ffName)[1]) }
"UPDATE" { $dynAttrib.Mandatory = [bool]($usedFFs.Item($ffName)[2]) }
}
$dynAttribCol.Add($dynAttrib)
$dynParam = New-Object -Type System.Management.Automation.RuntimeDefinedParameter($ffName, [string], $dynAttribCol)
$paramDictionary.Add($ffName, $dynParam)
}
return $paramDictionary
}
3. Function that makes use of the dynamic params
#-------------------------------------------------------------------------------------------------------
# aAPS-OrganizationAdd : This will add a new organization entry
#-------------------------------------------------------------------------------------------------------
Function aAPS-OrganizationAdd {
[CmdletBinding()]
Param(
[Parameter(Mandatory=$true,HelpMessage="The name of the new organization")]
[String]$Descr,
[Parameter(Mandatory=$false,HelpMessage="The name of the parent organization")]
[String]$ParentDescr=$null,
[Parameter(Mandatory=$false,HelpMessage="The status of the new organization [1=Active|2=Inactive]")]
[int]$Status = 1,
[Parameter(Mandatory=$false,HelpMessage="If you want to see the data of the deactivated object")]
[switch]$ShowResult
)
DynamicParam { Build-DynParams "CREATE" }
Begin {}
Process {
# do what oyu want here
}
End {}
}

read more problem

I did read more function but it's not working correctly. I mean I can split my test post and I can cut my string with substring function. And I did this using < !--kamore--> keyword.
But after I cut this with substring and do innerhtml and if there is some html tag before the index the css is going crazy. (< p>< !--kamore-->) I can't solve this. If I'm using regex it just make all of them like text and there is no html tags in my post and it's not good. I mean if there is some links or table in my post they will not showing. They are just text.
Here is my little code.
#region ReadMore
string strContent = drvRow["cont"].ToString();
//strContent = Server.HtmlDecode(strContent);
//strContent = Regex.Replace(strContent, #"</?\w+((\s+\w+(\s*=\s*(?:"".*?""|'.*?'|[^'"">\s]+))?)+\s*|\s*)/?>", string.Empty);
// More extension by kad1r
int kaMoreIndex;
kaMoreIndex = strContent.IndexOf("<!--kamore-->");
if (kaMoreIndex > 0)
{
if (strContent.Length >= kaMoreIndex)
{
aReadMore.Visible = true;
article.InnerHtml = strContent.Substring(0, kaMoreIndex);
// if this ends like this there is a problem
// < p>< !--kamore--> or < div>< !--kamore-->
// because there is no end of this tag!
}
else
{
article.InnerHtml = strContent;
}
}
else
{
article.InnerHtml = strContent;
}
#endregion
I fix it. I found this code and after finding I added to my string. Now everything works fine.
http://social.msdn.microsoft.com/Forums/en-US/csharpgeneral/thread/0f06a2e9-ab09-4692-890e-91a6974725c0

Resources