Apache Nutch NoSuchElementException with bin/nutch inject , readdb, generate options

Apache Nutch NoSuchElementException with bin/nutch inject , readdb, generate options - web-scraping

I am new to Apache Nutch 2.3 and Solr. I am trying to get my first crawl working. I installed Apache Nutch and Solr as mentioned in official documentation and both are working fine. However when I did the following steps I get errors -
bin/nutch inject examples/dmoz/ - Works correctly
(InjectorJob: total number of urls rejected by filters: 2
InjectorJob: total number of urls injected after normalization and filtering:130)
Error - $ bin/nutch generate -topN 5
GeneratorJob: starting at 2015-06-25 17:51:50
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: true
GeneratorJob: normalizing: true
GeneratorJob: topN: 5
java.util.NoSuchElementException
at java.util.TreeMap.key(TreeMap.java:1323)
at java.util.TreeMap.firstKey(TreeMap.java:290)
at org.apache.gora.memory.store.MemStore.execute(MemStore.java:125)
at org.apache.gora.query.impl.QueryBase.execute(QueryBase.java:73) ...
GeneratorJob: generated batch id: 1435279910-1190400607 containing 0 URLs
Same errors if i do - $ bin/nutch readdb -stats
Error - java.util.NoSuchElementException ...
Statistics for WebTable:
jobs: {db_stats-job_local970586387_0001={jobName=db_stats, jobID=job_local970586387_0001, counters={Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=6, REDUCE_INPUT_RECORDS=0, SPILLED_RECORDS=0, MAP_INPUT_RECORDS=0, SPLIT_RAW_BYTES=653, MAP_OUTPUT_BYTES=0, REDUCE_SHUFFLE_BYTES=0, REDUCE_INPUT_GROUPS=0, COMBINE_OUTPUT_RECORDS=0, REDUCE_OUTPUT_RECORDS=0, MAP_OUTPUT_RECORDS=0, COMBINE_INPUT_RECORDS=0, COMMITTED_HEAP_BYTES=514850816}, File Input Format Counters ={BYTES_READ=0}, File Output Format Counters ={BYTES_WRITTEN=98}, FileSystemCounters={FILE_BYTES_WRITTEN=1389120, FILE_BYTES_READ=1216494}}}}
TOTAL urls: 0
I am also not able to use generate or crawl commands.
Can anyone tell me what am I doing wrong?
Thanks.

I too am new to nutch. However, I think the problem is that you haven't configured a data store. I got the same error, and got a bit further. You need to follow this: https://wiki.apache.org/nutch/Nutch2Tutorial, or this: https://wiki.apache.org/nutch/Nutch2Cassandra. Then, rebuild: ant runtime

Related

How to change the interval of a plugin in telegraf?

Using: telegraf version 1.23.1
Thats the workflow Telegraf => Influx => Grafana.
I am using telegraf to check my metrics on a shared server. So far so good, i already could initalize the Telegraf uWSGI Plugin and display the data of my running django projects in grafana.
Problem
Now i wanted to check some folder size too with the [[inputs.filecount]] Telegraf Plugin and this works also well. However i do not need Metrics for every 10s for this plugin. So i change the interval like mentioned in the Documentation in the [[inputs.filecount]] Plugin.
telegraf.conf
[agent]
interval = "10s"
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 10000
collection_jitter = "5s"
flush_interval = "10s"
flush_jitter = "0s"
#... PLUGIN
[[inputs.filecount]]
# set different interval for this input plugin every 10min
interval=“600s”
collection_jitter=“20s”
# Default from Doc =>
directories = ["/home/myserver/logs", "/home/someName/growingData, ]
name = "*"
recursive = true
regular_only = false
follow_symlinks = false
size = "0B"
mtime = "0s"
After restarting Telegram with Supervisor it crashed because it could not parse the new lines.
supervisor.log
Error running agent: Error loading config file /home/user/etc/telegraf/telegraf.conf: Error parsing data: line 208: invalid TOML syntax
So that are these lines i added because i thought that is how the Doc it mention it.
telegraf.conf
# set different interval for this input plugin every 10min
interval=“600s”
collection_jitter=“20s”
Question
So my question is. How can i change or setup the interval for a single input plugin in telegraf?
Or do i have to apply a different TOML syntax like [[inputs.filecount.agent]] or so?
I assume that i do not have to change any output interval also? Because i assume even though its currently 10s, if this input plugin only pulls/inputs data every 600s it should not matter, some flush cycle will push the Data to influx .

How can i change or setup the interval for a single input plugin in telegraf?
As the link you pointed to shows, individual inputs can set the interval and collection_jitter options. There is no difference in the TOML syntax for example I can do the following for the memory input plugin:
[[inputs.mem]]
interval="600s"
collection_jitter="20s"
I assume that i do not have to change any output interval also?
Correct, these are independent of each other.
line 208: invalid TOML syntax
Knowing what exactly is on line 208 and around that line will hopefully resolve your issue and get you going again. Also make sure your quotes that you used are correct. Sometimes when people copy and paste quotes they get ” vs " which can cause issues!

SonarQube: sonar.exclusions parameter cannot exclude a folder

I work on symfony project and I want to exclude some generated code from sonar analytics.
I want to exclude a folder named by this path: src/Application/Sonata.
I tried many possibilities with sonar exclusions but in vain:
sonar.exclusions=src/Application/Sonata/*
sonar.exclusions=src/Application/Sonata/**
sonar.exclusions=src/Application/Sonata/**/*
this is my sonar-project.properties file
# Required metadata
sonar.projectKey=project
sonar.projectName=project
sonar.projectVersion=0.1.3
# Description
sonar.projectDescription=project a base symphony 2
# Encoding of the source code
sonar.sourceEncoding=UTF-8
sonar.exclusions=src/Application/Sonata/**/* ,src/project/Resources/public/js/lib/**/*, src/project/Resources/public/js/jquery.validate.js

After many tests, i found the correct syntax of sonar.exclusions clause:
sonar.exclusions=src/Application/Sonata/**/*,src/Simuleo5BOBundle/Resources/public/js/lib/**/*,src/Simuleo5BOBundle/Resources/public/js/jquery.validate.js

Correctly setting exclusions from properties is difficult to get right, which is why it's not documented. Instead, you should use the UI to set your exclusions.

Output current Phusion Passenger configuration values

Sometimes a man has a need to view some particular configuration value.
Let's say in nginx.conf I have a line like passenger_max_pool_size 69;.
So is there some way to output (to terminal) this value?
Imagine something like this:
$ passenger-config-value passenger_max_pool_size
$ 69

Try passenger-status --show=xml. It will show you the internal state as an XML dump, including memoized configuration options.

Log rotation with Plone

The log rotation for Plone product installation would be a nice feature. What are the current best practices regarding the log rotation integration into Plone?
I found this article: http://encolpe.wordpress.com/2010/06/17/how-to-get-log-files-rotate-in-zope-with-buildout/ but as there are no documentation on plone.org I'd like to ping the community for the good known best practices not to fill up their hard disks.

ZConfig has support for the standard library RotatingFileHandler and TimedRotatingFileHandler. Taking an example from the ZConfig tests:
<eventlog>
<logfile>
path /path/to/file.log
level debug
when D
interval 3
old-files 11
</logfile>
</eventlog>
This will roll over the logs every three days, keeping 11 old files.
You place these config snippets your buildout using the event-log-custom/access-log-custom parameters in your instance recipe. plone.recipe.zope2instance

Similar to what Laurence said above but keeps size under 10mb and saves only 1 old file.
<eventlog>
level INFO
<logfile>
path /path/to/plone4/var/log/client1.log
max-size 10mb
old-files 1
</logfile>
</eventlog>
plone.recipe.zope2instance can generate this now. For example, you can specify the following options:
event-log-max-size = 10mb
event-log-old-files = 3

Here's what we do, it's simple but works:
In your buildout you add this part:
[logrotate]
recipe = collective.recipe.template
input = ${buildout:directory}/templates/logrotate.conf
output = ${buildout:directory}/etc/logrotate.conf
And in templates/logrotate.conf
rotate 4
weekly
create
compress
delaycompress
missingok
${buildout:directory}/var/log/instance1.log ${buildout:directory}/var/log/instance1-Z2.log {
sharedscripts
postrotate
/bin/kill -USR2 $(cat ${buildout:directory}/var/instance1.pid)
endscript
}
${buildout:directory}/var/log/instance2.log ${buildout:directory}/var/log/instance2-Z2.log {
sharedscripts
postrotate
/bin/kill -USR2 $(cat ${buildout:directory}/var/instance2.pid)
endscript
}
Add whatever other log rotations you need. Then it's about linking /etc/logrotate.conf to the generated file.

Mikko, you should ask me directly ;)
My blog article is still good and this feature is still undocumented in Zope2. It can be used with Plone 4 and Plone 5. The extension iw.rotatelogs was designed for Plone 3 only.
Using logrotate is not fine because it doesn't handle the case of logwriting during the rotation. It can make your instance crash silently and writing logs in RAM until your restart the instance.

I've been using iw.rotatezlogs since at least early Plone 3 very successfully. I see from your link that I should no longer need iw.rotatezlogs.
I don't like to rely on logrotate, since it's not usable on the one Windows server I have to deploy to.
As near as I can tell, the pure-zope solution (which I stil haven't tested) doesn't do logfile compression (at least I can't see it in ZConfig/components/logger/handlers.xml which is where I think it should be defined), so I suspect I'll be staying with iw.rotatezlogs.

How do you transfer a binary file via Connect:Direct NDM?

I'm trying to submit a binary file, in this case, an Excel file from my local server (Solaris server with Mainframe rehosting software) using Connect:Direct NDM to a destination server (Mainframe).
Here are the environment values I set:
SODETFL "DetailedReport.xls"
SODDETNDM "FIN.REPORT(+1)"
TDCOPTS ":DATATYPE=BINARY:XLATE=NO:STRIP.BLANKS=NO"
Here is the NDM configuration I use:
ASSGNDD ddname='SYSIN' type='INSTREAM' << !
SIGNON 00260005
SUBMIT PROC=COPYFILE - 00270005
JOBNAME=JOB00001 - 00280005
PNODE=SERVER001 - 00290005
SNODE=NDMIDS - 00300005
SNODEID=(xxxxxx,xxxxxx) - 00310005
HOLD=NO - 00320005
NOTIFY=CCACTD - 00330005
NODE=, - 00360005
DSN1=${SODDETFL} - 00370005
DSN2=${SODDETNDM} -
DCBINFO='dcb=(dsorg=ps, recfm=vb, lrecl=1504)' - 00385005
DISP1=NEW, - 00390005
DISP2=CATLG,DELETE - 00400005
UNIT=BATCH - 00410005
SYSOPTS=${TDCOPTS} - 00440005
AEFAJOB=PSIAPNB5
SEL PROC WHERE (QUEUE=A) TABLE 00450005
SIGNOFF 00460005
I'm able to send text files via NDM all day long, no problems there. However, it seems that binary is a bit more difficult. When I try with the above configuration, I get the following error:
Completion Code => 8
Message Id => XCPS009I
Short Text => Read buffer too small. Possibly src reclen > dest reclen.
Ckpt=>Y Lkfl=>N Rstr=>N Xlat=>Y Scmp=>N Ecmp=>Y Ecpr=>0.00 CRC=>N Zlvl=>1 win=>13 Zmem=>4
Can anyone shed some light as to how I can go about submitting a binary file via NDM?

Off the cuff...
Try changing RECFM=VB to RECFM=U and specify a BLKSIZE= instead of a LRECL=
This is really not all that different from how executable load modules are stored on the mainframe except you don't want the file to be a PDS dataset. I'm not at my office right now and I think I have some examples of NDM that transmit load modules that I can look-up if this suggestion doesn't work but I think it will.
Give this suggestion a shot and if it still doesn't fly let me know.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Apache Nutch NoSuchElementException with bin/nutch inject , readdb, generate options - web-scraping

I too am new to nutch. However, I think the problem is that you haven't configured a data store. I got the same error, and got a bit further. You need to follow this: https://wiki.apache.org/nutch/Nutch2Tutorial, or this: https://wiki.apache.org/nutch/Nutch2Cassandra. Then, rebuild: ant runtime

Related

How to change the interval of a plugin in telegraf?

SonarQube: sonar.exclusions parameter cannot exclude a folder

Output current Phusion Passenger configuration values

Log rotation with Plone

How do you transfer a binary file via Connect:Direct NDM?

Categories

Resources