Crawler4j missing outgoing links? - crawler4j

I'm trying to crawl the Apache Mailing Lists to get all the archived messages using Crawler4j. I provided a seed URL and am trying to get links to the other messages. However, it seems to not be extracting all the links.
Following is the HTML of my seed page (http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/%3CCAOG_4QZ-yyrcwTpRu-8eu6VoUoM3%3DAo_J8Linhpnc%2B6y7tOcxg%40mail.gmail.com%3E):
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Re: some healthy broker disappear from zookeeper</title>
<link rel="stylesheet" type="text/css" href="/archives/style.css" />
</head>
<body id="archives">
<h1>kafka-users mailing list archives</h1>
<h5>
Site index · List index</h5> <table class="static" id="msgview">
<thead>
<tr>
<th class="title">Message view</th>
<th class="nav">« Date » · « Thread »</th>
</tr>
</thead>
<tfoot>
<tr>
<th class="title">Top</th>
<th class="nav">« Date » · « Thread »</th>
</tr>
</tfoot>
<tbody>
<tr class="from">
<td class="left">From</td>
<td class="right">Neha Narkhede <neha.narkh...#gmail.com></td>
</tr>
<tr class="subject">
<td class="left">Subject</td>
<td class="right">Re: some healthy broker disappear from zookeeper</td>
</tr>
<tr class="date">
<td class="left">Date</td>
<td class="right">Tue, 20 Nov 2012 19:01:56 GMT</td>
</tr>
<tr class="contents"><td colspan="2"><pre>
zookeeper server version is 3.3.3 is pretty buggy and has known
session expiration and unexpected ephemeral node deletion bugs.
Please upgrade to 3.3.4 and retry.
Thanks,
Neha
On Tue, Nov 20, 2012 at 10:42 AM, Xiaoyu Wang <xwang#rocketfuel.com> wrote:
> Hello everybody,
>
> We have run into this problem a few times in the past week. The symptom is
> some broker disappear from zookeeper. The broker appears to be healthy.
> After that, producers start producing lots of ZK producer cache stale log
> and stop making any progress.
> "logger.info("Try #" + numRetries + " ZK producer cache is stale.
> Refreshing it by reading from ZK again")"
>
> We are running kafka 0.7.1 and the zookeeper server version is 3.3.3.
>
> The missing broker will show up in zookeeper after we restart it. My
> question is
>
> 1. Did anyone encounter the same problem? how did you fix it?
> 2. Why producer is not making any progress? Can we make the producer
> work with those brokers that are listed in zookeeper.
>
>
> Thanks,
>
> -Xiaoyu
</pre></td></tr>
<tr class="mime">
<td class="left">Mime</td>
<td class="right">
<ul>
<li><a rel="nofollow" href="/mod_mbox/kafka-users/201211.mbox/raw/%3cCAOG_4QZ-yyrcwTpRu-8eu6VoUoM3=Ao_J8Linhpnc+6y7tOcxg#mail.gmail.com%3e/">Unnamed text/plain</a> (inline, None, 1037 bytes)</li>
</ul>
</td>
</tr>
<tr class="raw">
<td class="left"></td>
<td class="right">View raw message</td>
</tr>
</tbody>
</table>
</body>
</html>
These are the outgoing URLs as identified by Crawler4j.
http://mail-archives.apache.org/archives/style.css
http://mail-archives.apache.org/mod_mbox/
http://mail-archives.apache.org/mod_mbox/kafka-users
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/date
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/thread
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/%3CCAOG_4QZ-yyrcwTpRu-8eu6VoUoM3%3DAo_J8Linhpnc%2B6y7tOcxg%40mail.gmail.com%3E
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/date
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/thread
However, the URLs that I'm interested in are missing.
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/%3cCAFbh0Q1uJGxQiH15a7xS+pCwq+Jft9yKhb66t_C78UrMX338mQ#mail.gmail.com%3e
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/%3cCAA+BczTexvtqSK+nmauvj37vhTF31awzeegpWdk6eZ-+LaGTVw#mail.gmail.com%3e
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/%3cCAPkvrnkTEtfFhYnCMj=xMs58pFU1sy-9sJuJ6e19mGVVipRg0A#mail.gmail.com%3e
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/%3cCAA+BczS5JOCA+QpgLU+tXeG=Ke_MXxiG_PinMt0YDxGBtz5nPg#mail.gmail.com%3e
What am I doing wrong? How do I get Crawler4j to extract the URLs I need?

Please tell me you have noticed there are direct links for downloading mbox files for mailing lists. In your case, just wget this, no crawler needed:
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox

You are probably giving the wrong seed page.
I think your seed page should be:
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/thread
and then use
public boolean shouldVisit(WebURL url) {
String href = url.getURL().toLowerCase();
return (!FILTERS.matcher(href).matches() && href.contains("http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/%3cCA"));
}
I hope that helps.

Related

How to embed file content into body of the email using mail command?

I have requirement,where i need to send file content as mail body.can we dot through unix scripting.
Thanks in Advance
With the data create a html file. And then send that file in email as content.
use an expression to create your file data like this -
v_data= ' <tr>
<td>'||company ||'</td>
<td>'|| contact_person|| '</td>
<td>'|| country ||'</td>
</tr>'
Use an aggregator to concat all these data into one single row. group by none.
Then use another expression transformation.
create a ports like this -
v_head ='
<head></head>
<body>
<b>pls find below data.</b>
<table>
<tr>
<th>Company</th>
<th>Contact</th>
<th>Country</th>
</tr>'
v_body = Aggregated_v_data
v_tail='</table></body>'
v_output = v_head||v_body ||v_tail
Then use this output and connect to a flat file target.
Then send this flat file sing mailx command/any mail client.
Output should look like this.
html file should looks like this
<head></head>
<body>
pls find below data.
<br> </br>
<table>
<tr>
<th>Company</th>
<th>Contact</th>
<th>Country</th>
</tr>
<tr>
<td>Alfreds </td>
<td>Maria </td>
<td>Germany</td>
</tr>
</table>
</body>

SNMP/mrtg/traffic reporting incorrect on interfaces

I've installed MRTG, added some options and customizing of mrtg.cfg. Interface speed is 1 Gb/s. I wanted to show the graph and data in bits, not bytes. I ended up with graphs as example show 8Mb/s rather than 80 Mb/s. Where is the mistake in my mrtg.cfg?
Target[10.0.1.1_7]: 7:public#10.0.1.1:
YLegend[10.0.1.1_7]: Bits per Second
Colours[10.0.1.1_7]: GREEN#00eb0c,BLUE#1000ff,DARK GREEN#006600,VIOLET#ff00ff
Background[10.0.1.1_7]: #a0a0a0a
Kilo[10.0.1.1_7]: 1024
SetEnv[10.0.1.1_7]: MRTG_INT_IP="a.b.c.d" MRTG_INT_DESCR="eth1"
MaxBytes[10.0.1.1_7]: 125000000
Title[10.0.1.1_7]: WAN -- Oslo
PageTop[10.0.1.1_7]: <h1>WAN -- OSL</h1>
<div id="sysdetails">
<table>
<tr>
<td>System:</td>
<td>Cisco RV320 OSL </td>
</tr>
<tr>
<td>Maintainer:</td>
<td></td>
</tr>
<tr>
<td>Description:</td>
<td>eth1 </td>
</tr>
<tr>
<td>ifType:</td>
<td>ethernetCsmacd (6)</td>
</tr>
<tr>
<td>ifName:</td>
<td>eth1</td>
</tr>
<tr>
<td>Max Speed:</td>
<td>1000 Mbit/s</td>
</tr>
<tr>
<td>Ip:</td>
<td>a.b.c.d (No DNS name)</td>
</tr>
</table>
</div>
I also added following options:
options[_]: growright,bits,transparent,nobanner,nolegend
The answer was actually in the way mrtg.cfg was read and my understanding. MRTG by default during CFGMAKER creates new line of options and create the RunAsDaemon at the bottom of the mrtg.cfg.
Then I came upon this text: The later setting replaces the previous one for the rest of the configuration file
This means it run as default through the config file, and then changes when reading my values at the end.
Solution: All global settings need to be above SNMP/interface/etc settings to be valid.

How to import and save data from csv in web2py database table?

I used SQlite database.
I wrote code like this
Module:
db.py
db = DAL('sqlite://storage.sqlite')
db.define_table('data3')
db.data3.import_from_csv_file(open('mypath/test.csv'),'r')
Controller:
def website_list():
return dict(websites = db().select(db.data4.ALL))
View:
{{extend 'layout.html'}}
<h2>List Of Websites</h2>
<table class="flakes-table" style="width:100% ;">
<thead>
<tr>
<td class="id" >ID</a></td>
<td class="link" >Link</td>
</tr>
</thead>
{{for web in websites:}}
<tbody class="list">
<tr>
<td >{{=web.id}}</td>
<td >{{=web.Link}</td>
</tr>{{pass}}
</tbody>
</table>
But it is showing error as
"type 'exceptions.AttributeError'"
Also error has this line
Function argument list
(self=, key='data3')
I think some thing is wrong in reading csv file. My csv file has following data
"Link_Title","Link"
"Apple's Ad Blockers Rile Publishers","somelink"
"Uber Valued at More Than $50 Billion","somelink"
"England to Roll Out Tailored Billboards","somelink"
Can anyone help in this..?

Laravel 4 email template styling

I'm trying to give a style to my email template using bootstrap but every time what laravel is doing is, it parses the template and adds a '3D' where ever there is '=' in the template, which results in style=3D"table" instead of style="table", here is a snippet of the mail source code
<div class=3D"well">
<table class=3D"table table-bordered table-striped" id=3D'table'>
<thead>
<th>Group Name</th>
<th>Kpi Name</th>
<th>User Name</th>
</thead>
<tbody>
</tbody>
</table>
</div>
here is my code for template
<html lang="en-US">
<head>
<meta charset="utf-8">
{{ HTML::style('app\\client\\css\\bootstrap.css') }}
{{ HTML::script('app\\javascripts\\js\\jquery-1.8.3.min.js') }}
{{ HTML::script('app\\client\\js\\bootstrap.min.js') }}
<style>
#table {
border: 2px solid #ccc;
border-radius: 5px;
}
</style>
</head>
<body>
<div class="well">
<table class="table table-bordered table-striped" id='table'>
<thead>
<th>Group Name</th>
<th>Kpi Name</th>
<th>User Name</th>
</thead>
<tbody>
#foreach ($groups as $group)
<tr>
<td>{{ $group['name'] }}</td>
<td>
<ul>
#if (array_key_exists('kpis', $group))
#foreach ($group['kpis'] as $kpi)
<li>{{ Kpi::find($kpi['kpi'])->title }}</li>
#endforeach
#endif
</ul>
</td>
<td>
<ul>
#if (array_key_exists('users', $group))
#foreach ($group['users'] as $user)
<li>{{ User::find($user['user'])->fName.' '.User::find($user['user'])->lName }}</li>
#endforeach
#endif
</ul>
</td>
</tr>
#endforeach
</tbody>
</table>
</div>
<footer>
<br>
<br>
Regards,<br>
<strong>Yogesh Joshi</strong><br>
Group Leader
</footer>
</body>
</html>
is there something I'm missing or there is some problem in laravel or mail-server (gmail or hotmail), and yeah I've cross checked the script and style files, they do exists in public folder.
please help or provide any alternate method for it.
For the =3D items, see this post: What's a 3D doing in this HTML?
And I see that you are trying to load javascript in your email, that's not a good idea. see: Is JavaScript supported in an email message?
And I am not sure about loading a stylesheet into a html email, so I will leave that one open.
Using Bootstrap in emails sounds a little bit overkill, but there's old project called Bootstrap Template for HTML Email. Not sure if this helps.
Also this question seems to be partially about the same subject as SO question "Has anyone gotten HTML emails working with Twitter Bootstrap?"
Probably easier to use HTML Email Boilerplate with just copied mini style sheet from your actual project.
Or if you're lazy, use the Mailchimp wysiwyg -editor and export it out of MC. They have really nice wysiwyg-editor and responsive templates. I've been using it and it saves some time and effort.

Sorting a Dojo DataGrid Declaratively

I have a DataGrid that is loaded from an XML data store, all created declaratively. I'd like to set the sort when the data is loaded. All of the examples I've found deal with doing this programatically and hint that it should be doable Declaratively.
This is the code that creates the datasource.
<head>
<title>Untitled Page</title>
<style type="text/css">
#import "StyleSheet.css";
#import "js/dojotoolkit/dijit/themes/pfga/pfga.css";
#import "js/dojotoolkit/dojo/resources/dojo.css";
#import "js/dojotoolkit/dojox/grid/resources/Grid.css";
#import "js/dojotoolkit/dojox/grid/resources/pfgaGrid.css";
</style>
<script src="js/dojotoolkit/dojo/dojo.js" type="text/javascript" djConfig="parseOnLoad: true"></script>
<script type="text/javascript">
dojo.require("dojo.parser");
dojo.require("dojox.grid.DataGrid");
dojo.require("dojox.data.XmlStore");
dojo.require("dijit.layout.ContentPane");
</script>
</head>
<body class="pfga">
<div dojotype="dojox.data.XmlStore" url="events.xml" jsID="eventStore"></div>
<table dojoType="dojox.grid.DataGrid" store="eventStore" class="pfga" style="height:500px" clientSort="true" jsID="eventGrid">
<thead>
<tr>
<th field="date" width="80px">Date</th>
<th field="description" width="600">Description</th>
<th field="DateID" sortDesc="true" hidden="false">DateSort</th>
</tr>
<tr>
<th field="time" colspan="3">Details</th>
</tr>
</thead>
</table>
</body>
For the record, in dojo 1.5 it's the 'sortInfo' param passed to the Data Grid. It uses the same convention as the 'canSort' function, i.e. a number indicating the column (starting at 1) and the sign indicating the direction of sort.
I added a comment to http://docs.dojocampus.org/dojox/grid/DataGrid to this effect.
For example, this grid is sorted by the 'created' column in 'most recent first' order:
<table dojoType="dojox.grid.DataGrid" clientSort="true" selectionMode="single"
formatterScope="formatterScope" dojoAttachPoint="logGrid" sortInfo="-2">
<thead><tr>
<th field="clientId" width="10%">Client ID</th>
<th field="created" width="20%" formatter="datefmt">Created</th>
<th field="message" width="30%" formatter="messagebodyfmt">Message</th>
<th field="token" width="10%">Token</th>
<th field="type" width="20%">Type</th>
<th field="username" width="10%">Username</th>
</tr>
</table>
Of course your choice of store and the way it understands the sort directives will have further impact, for example I'm using JsonQueryRestStore and the sortInfo param results in the store query including sort syntax based on dojox.data.JsonQuery and the backend handling the query must understand how to sort the data before returning it.
It looks like the Sort started to work once I added the JSID to solve my filtering problem

Resources