Nutch - clone website - web-scraping

Nutch - clone website - web-scraping

I am playing with Apache Nutch and I am crawling a website successfully. I want to make a clone of a website with Nutch so that I can access the crawled webpages offline. Is there a way to do that? I'm looking for something like an endpoint that receives a url and returns the content of the webpage as if I were GETting the url with curl.
I know there are more specialized solutions like HTTrack, but I want to know if it is possible to use Nutch to do this.

I think you don't need this now but anyway I am putting my answer. Of course it's possible using Apache Nutch.After injecting seed urls and generating segments of urls to fetch, when you execute the fetch command -
$ bin/nutch fetch -all
The Hbase(it's very conventional to use Hbase as storage for Nutch) webpage table structure will be like this -
webpage : {
key : "com.exampe.dev:http/",
f : {
bas : {
timestamp : 1293732801833,
value : "http://dev.example.com/"
},
cnt : {
timestamp : 1293732801833,
value : "DOCTYPE html PUBLIC "-//W3C//DTD X...rest of page content"
},
fi : {
timestamp : 1293676557658,
value : "\x00'\x8D\x00"
},
prot : {
timestamp : 1293732801833,
value : "x02\x00\x00"
},
st : {
timestamp : 1293732801833,
value : "x00\x00\x00\x02"
},
ts : {
timestamp : 1293676557658,
value : "\x00\x00\x01-5!\x9D\xE5"
}
typ : {
timestamp : 1293732801833,
value : "application/xhtml+xml"
}
},
h : {
Cache-Control : {
timestamp : 1293732801833,
value : "private"
},
Content-Type : {
timestamp : 1293732801833,
value : "text/html; charset=UTF-8"
},
Date : {
timestamp : 1293732801833,
value : "Thu, 30 Dec 2010 18:13:21 GMT"
},
ETag : {
timestamp : 1293732801833,
value : 40bdf8b9-8c0a-477e-9ee4-b19995601dde"
},
Expires : {
timestamp : 1293732801833,
value : "Thu, 30 Dec 2010 18:13:21 GMT"
},
Last-Modified : {
timestamp : 1293732801833,
value : "Thu, 30 Dec 2010 15:01:20 GMT"
},
Server : {
timestamp : 1293732801833,
value : "GSE"
},
Set-Cookie : {
timestamp : 1293732801833,
value : "blogger_TID=130c0c57a66d0704;HttpOnly"
},
X-Content-Type-Options : {
timestamp : 1293732801833,
value : "nosniff"
},
X-XSS-Protection : {
timestamp : 1293732801833,
value : "1; mode=block"
}
},
mk : {
_injmrk_ : {
timestamp : 1293676557658,
value : "y"
},
_gnmrk_ : {
timestamp=1293732629430,
value : "1293732622-2092819984"
},
_ftcmrk_ : {
timestamp : 1293732801833,
value : "1293732622-2092819984"
}
},
mtdt : {
_csh_ : {
timestamp : 1293676557658,
value : "x80\x00\x00"
}
},
s : {
s : {
timestamp : 1293676557658,
value : "x80\x00\x00"
}
}
}
The column named cnt under f column family will contain the whole html content of your page. So you can use this to clone the page obviously.
Moreover, you can write custom plugins implementing Parser or ParseFilter interfaces to catch the whole content of page. The code will be very straight-forward.

Related

Wiremock matchesJSONPath if null or empty

I'm trying to add a Wiremock stub that matches if the JSON in a request body is either non-existent OR an empty string.
The stub I have at the moment is:
{
"id" : "e331007e-3e6d-4660-b575-b04e774e88c6",
"request" : {
"urlPathPattern" : "/premises/([0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})/bookings/([0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})/non-arrivals",
"method" : "POST",
"bodyPatterns" : [ {
"matchesJsonPath" : "$.[?(#.reason === '' || #.reason == null)]"
} ]
},
"response" : {
"status" : 400,
"jsonBody" : {
"type" : "https://example.net/validation-error",
"title" : "Invalid request parameters",
"code" : 400,
"invalid-params" : [ {
"propertyName" : "reason",
"errorType" : "blank"
} ]
},
"headers" : {
"Content-Type" : "application/problem+json;charset=UTF-8"
}
},
"uuid" : "e331007e-3e6d-4660-b575-b04e774e88c6"
}
It matches is the reason is '', but not if reason is not present. Any ideas?

Elasticsearch query using Kibana does not work using Java Rest Client API

Can someone help determine why a kibana query does not return hits when using the Elasticsearch Java Rest Client API.
I am currently using
Elasticsearch/Kibana: 7.16.2
Elasticsearch Java Client: 6.6.2
I am reluctant to upgrade java client due to numerous Geometry related updates needed.
fields:
mydatetime: timestamp of doc
category: keyword field
We have 1000 or more records for each category a day.
I want an aggregation that shows categories by day and includes the first and last "date" for the category.
This query works in Kibana
GET /mycategories/_search
{
"size":0,
"aggregations":{
"bucket_by_date":{
"date_histogram":{
"field":"mydatefield",
"format":"yyyy-MM-dd",
"interval":"1d",
"offset":0,
"order":{
"_key":"asc"
},
"keyed":false,
"min_doc_count":1
},
"aggregations":{
"unique_categories":{
"terms":{
"field":"category",
"size":10,
"min_doc_count":1,
"shard_min_doc_count":0,
"show_term_doc_count_error":false,
"order":[
{
"_count":"desc"
},
{
"_key":"asc"
}
]
},
"aggregations":{
"min_mydatefield":{
"min":{
"field":"mydatefield"
}
},
"max_mydatefield":{
"max":{
"field":"mydatefield"
}
}
}
}
}
}
}
}
The first record of the result...category1 and category2 for 2022 05 07 with min and max "mydatetime" for each category
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 4,
"successful" : 4,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2593,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"bucket_by_date" : {
"buckets" : [
{
"key_as_string" : "2022-05-07",
"key" : 1651881600000,
"doc_count" : 2,
"unique_missions" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "category1",
"doc_count" : 1,
"min_mydatefield" : {
"value" : 1.651967952E12,
"value_as_string" : "2022-05-07T13:22:17.000Z"
},
"max_mydatefield" : {
"value" : 1.651967952E12,
"value_as_string" : "2022-05-07T23:59:12.000Z"
}
},
{
"key" : "category2",
"doc_count" : 1,
"min_mydatefield" : {
"value" : 1.651967947E12,
"value_as_string" : "2022-05-07T03:47:23.000Z"
},
"max_mydatefield" : {
"value" : 1.651967947E12,
"value_as_string" : "2022-05-07T23:59:07.000Z"
}
}
]
}
},
I have successfully coded other, less complex aggregations without problems. However, i have not been able to get either an AggregationBuilder or WrapperQuery. Zero results are returned.
{"took":0,"timed_out":false,"_shards":{"total":0,"successful":0,"skipped":0,"failed":0},"hits":{"total":0,"max_score":0.0,"hits":[]}}
Before executing the query, i copy the SearchRequest.source() into Kibana, and it runs and returns the desired information.
Below is the AggregationBuilder code that seems to replicate my kibana query, but returns no results.
AggregationBuilder aggregation =
AggregationBuilders
.dateHistogram("bucket_by_date").format("yyyy-MM-dd")
.minDocCount(1)
.dateHistogramInterval(DateHistogramInterval.DAY)
.field("mydatefield")
.subAggregation(
AggregationBuilders
.terms("unique_categories")
.field("category")
.subAggregation(
AggregationBuilders
.min("min_mydatefield")
.field("mydatefield")
)
.subAggregation(
AggregationBuilders
.max("max_mydatefield")
.field("mydatefield")
)
);

How to avoid for picking up already existing Database Node --Recyclerview Items

The issue i am facing is while updating user message details like whether user has seen the message or not ,firebase creating nodes with previous message Ids and causing to create empty text in Chat App.
My Chat Workflow process is:
--While Sending a message to another User using Groupie Recyclerview i created a separate reference with Parent Node as "Messages". It looks like this
val ref = FirebaseDatabase.getInstance().getReference("/Messages/$fromId/$toId")
val toRef = FirebaseDatabase.getInstance().getReference("/Messages/$toId/$fromId")
fromId -- Current User
toId -- Sending message to another recipient
While Sending messages there is another function which will pickup for listening messages from both end by identifying ChatFromItem Adapter and ChatToItem Adapter.
in ChatTo Adapter, i wrote logic to pickup seen status
toRef.addListenerForSingleValueEvent(object :ValueEventListener{
override fun onCancelled(p0: DatabaseError) {
TODO("Not yet implemented")
}
override fun onDataChange(p0: DataSnapshot) {
for (ds in p0.children) {
val fromuserupdate = HashMap<String,String>()
if (chatMessage.fromId != FirebaseAuth.getInstance().uid && chatMessage.toId == FirebaseAuth.getInstance().uid){
val Totoken = ds.child("tokey").value.toString()
val Frommessageid = ds.child("fromkey").value.toString()
val toUserChatRef = FirebaseDatabase.getInstance().getReference("users-messages").child(fromId).child(toId).child(Totoken)
val fromUserref = FirebaseDatabase.getInstance().getReference("users-messages").child(toId).child(fromId).child(Frommessageid)
fromuserupdate.put("messageseen","true")
fromUserref.updateChildren(fromuserupdate as Map<String,Any>).addOnCompleteListener { }
toUserChatRef.updateChildren(fromuserupdate as Map<String,Any>).addOnCompleteListener { }
ChatFromItem
fromUserref.removeEventListener(this)
toUserChatRef.removeEventListener(this)
}
} })
When trying to send new message to other which its first time, i can update Seen status without issues but when i try to send another message the logic pickups old message database path and creates another node which causes any empty value in chat application and again when i go to any activity and return to ChatActivity it creates addition Node with Null .
Here i am using message seen as String not boolean.
My Database output
"users-messages" : {
"4bgqdomQZlRIFnq9lHtKE78wyrv2" : {
"f4b3phpVJnTByNt2vgCKnKRuxc82" : {
"-M7nbxCZ3WwPGFcLRMGx" : {
"messageseen" : "true"
},
"-M7nbxCakllN4LvLsOQ0" : {
"fromId" : "f4b3phpVJnTByNt2vgCKnKRuxc82",
"fromkey" : "-M7nbxCZ3WwPGFcLRMGx",
"fulldate" : "21/05/2020",
"message" : "hi",
"messageseen" : "true",
"timespot" : " 01:16 AM",
"timestamp" : 1590003971,
"toId" : "4bgqdomQZlRIFnq9lHtKE78wyrv2",
"tokey" : "-M7nbxCakllN4LvLsOQ0"
},
"-M7neJ-qsFODYXbaZeHS" : {
"fromId" : "4bgqdomQZlRIFnq9lHtKE78wyrv2",
"fromkey" : "-M7neJ-qsFODYXbaZeHS",
"fulldate" : "21/05/2020",
"message" : "hello",
"messageseen" : "true",
"timespot" : " 01:26 AM",
"timestamp" : 1590004589,
"toId" : "f4b3phpVJnTByNt2vgCKnKRuxc82",
"tokey" : "-M7neJ-qsFODYXbaZeHT"
}
}
},
"f4b3phpVJnTByNt2vgCKnKRuxc82" : {
"4bgqdomQZlRIFnq9lHtKE78wyrv2" : {
"-M7nbxCZ3WwPGFcLRMGx" : {
"fromId" : "f4b3phpVJnTByNt2vgCKnKRuxc82",
"fromkey" : "-M7nbxCZ3WwPGFcLRMGx",
"fulldate" : "21/05/2020",
"message" : "hi",
"messageseen" : "true",
"timespot" : " 01:16 AM",
"timestamp" : 1590003971,
"toId" : "4bgqdomQZlRIFnq9lHtKE78wyrv2",
"tokey" : "-M7nbxCakllN4LvLsOQ0"
},
"-M7nbxCakllN4LvLsOQ0" : {
"messageseen" : "true"
},
"-M7neJ-qsFODYXbaZeHT" : {
"fromId" : "4bgqdomQZlRIFnq9lHtKE78wyrv2",
"fromkey" : "-M7neJ-qsFODYXbaZeHS",
"fulldate" : "21/05/2020",
"message" : "hello",
"messageseen" : "true",
"timespot" : " 01:26 AM",
"timestamp" : 1590004589,
"toId" : "f4b3phpVJnTByNt2vgCKnKRuxc82",
"tokey" : "-M7neJ-qsFODYXbaZeHT"
}
}
}
}
}
How to avoid creating additional nodes by not touching old values.

It sounds like you want to get more granular information about the data that was modified in the database in your code. In that case you'll be better off using a ChildEventListener instead of a ValueEventListener.
With ChildEventListener you get called for each child that was added, updated, removed or moved, and you can then easily update the UI based on that. For example, if you only want to add new nodes from the database to your list, you'd do something like:
toRef.addChildEventListener(object :ChildEventListener{
override fun onCancelled(p0: DatabaseError) {
TODO("Not yet implemented")
}
override fun onChildAdded(snapshot: DataSnapshot, previousChildKey: String) {
val fromuserupdate = HashMap<String,String>()
if (chatMessage.fromId != FirebaseAuth.getInstance().uid && chatMessage.toId == FirebaseAuth.getInstance().uid){
val Totoken = snapshot.child("tokey").value.toString()
val Frommessageid = snapshot.child("fromkey").value.toString()
...
}
})
...

Using Usergrid how do I get related entities nested in a single json and not only the link to them

When I query /mycollections?ql=Select * where name='dfsdfsdfsdfsdfsdf' I get
{
"action" : "get",
"application" : "859e6180-de8a-11e4-9360-f1aabbc15f58",
"params" : {
"ql" : [ "Select * where name='dfsdfsdfsdfsdfsdf'" ]
},
"path" : "/mycollections",
"uri" : "http://localhost:8080/myorg/myapp/mycollections",
"entities" : [ {
"uuid" : "2ff8961a-dea8-11e4-996b-63ce373ace35",
"type" : "mycollection",
"name" : "dfsdfsdfsdfsdfsdf",
"created" : 1428577466865,
"modified" : 1428577466865,
"metadata" : {
"path" : "/mycollections/2ff8961a-dea8-11e4-996b-63ce373ace35",
"connections" : {
"relations" : "/mycollections/2ff8961a-dea8-11e4-996b-63ce373ace35/relations"
}
}
} ],
"timestamp" : 1428589309204,
"duration" : 53,
"organization" : "myorg",
"applicationName" : "myapp",
"count" : 1
}
Now if I query /mycollections/2ff8961a-dea8-11e4-996b-63ce373ace35/relations I get the second entity
{
"action" : "get",
"application" : "859e6180-de8a-11e4-9360-f1aabbc15f58",
"params" : { },
"path" : "/mycollections/2ff8961a-dea8-11e4-996b-63ce373ace35/relations",
"uri" : "http://localhost:8080/myorg/myapp/mycollections/2ff8961a-dea8-11e4-996b-63ce373ace35/relations",
"entities" : [ {
"uuid" : "56a1185a-dec1-11e4-9ac0-e9343f86b604",
"type" : "secondcollection",
"name" : "coucou",
"created" : 1428588269141,
"modified" : 1428588269141,
"metadata" : {
"connecting" : {
"relations" : "/mycollections/2ff8961a-dea8-11e4-996b-63ce373ace35/relations/56a1185a-dec1-11e4-9ac0-e9343f86b604/connecting/relations"
},
"path" : "/mycollections/2ff8961a-dea8-11e4-996b-63ce373ace35/relations/56a1185a-dec1-11e4-9ac0-e9343f86b604"
}
} ],
"timestamp" : 1428589668542,
"duration" : 51,
"organization" : "myorg",
"applicationName" : "myapp"
}
What I want is that instead of providing me the path of the related entity Usergrid directly nest it in the first JSON answer so that I only need to make a single http request instead of two.

You cannot. Usergrid is not designed in that way. You need to write an extra wrapper rest endpoint to simulate one response.

Not sure what DB you are using. If you are using document db like mongo then you can write a node.js scripts to do this manipulation. Apigee has volvo.js check is it possible to do scripting.

problems on elasticsearch with parent child documents

We work with two types of documents on elastic search (ES): items and slots, where items are parents of slot documents.
We define the index with the following command:
curl -XPOST 'localhost:9200/items' -d #itemsdef.json
where itemsdef.json has the following definition
{
"mappings" : {
"item" : {
"properties" : {
"id" : {"type" : "long" },
"name" : {
"type" : "string",
"_analyzer" : "textIndexAnalyzer"
},
"location" : {"type" : "geo_point" },
}
}
},
"settings" : {
"analysis" : {
"analyzer" : {
"activityIndexAnalyzer" : {
"alias" : ["activityQueryAnalyzer"],
"type" : "custom",
"tokenizer" : "whitespace",
"filter" : ["trim", "lowercase", "asciifolding", "spanish_stop", "spanish_synonym"]
},
"textIndexAnalyzer" : {
"type" : "custom",
"tokenizer" : "whitespace",
"filter" : ["word_delimiter_impl", "trim", "lowercase", "asciifolding", "spanish_stop", "spanish_synonym"]
},
"textQueryAnalyzer" : {
"type" : "custom",
"tokenizer" : "whitespace",
"filter" : ["trim", "lowercase", "asciifolding", "spanish_stop"]
}
},
"filter" : {
"spanish_stop" : {
"type" : "stop",
"ignore_case" : true,
"enable_position_increments" : true,
"stopwords_path" : "analysis/spanish-stopwords.txt"
},
"spanish_synonym" : {
"type" : "synonym",
"synonyms_path" : "analysis/spanish-synonyms.txt"
},
"word_delimiter_impl" : {
"type" : "word_delimiter",
"generate_word_parts" : true,
"generate_number_parts" : true,
"catenate_words" : true,
"catenate_numbers" : true,
"split_on_case_change" : false
}
}
}
}
}
Then we add the child document definition using the following command:
curl -XPOST 'localhost:9200/items/slot/_mapping' -d #slotsdef.json
Where slotsdef.json has the following definition:
{
"slot" : {
"_parent" : {"type" : "item"},
"_routing" : {
"required" : true,
"path" : "parent_id"
},
"properties": {
"id" : { "type" : "long" },
"parent_id" : { "type" : "long" },
"activity" : {
"type" : "string",
"_analyzer" : "activityIndexAnalyzer"
},
"day" : { "type" : "integer" },
"start" : { "type" : "integer" },
"end" : { "type" : "integer" }
}
}
}
Finally we perform a bulk index with the following command:
curl -XPOST 'localhost:9200/items/_bulk' --data-binary #testbulk.json
Where testbulk.json holds the following data:
{"index":{"_type": "item", "_id":35}}
{"location":[40.4,-3.6],"id":35,"name":"A Name"}
{"index":{"_type":"slot","_id":126,"_parent":35}}
{"id":126,"start":1330,"day":1,"end":1730,"activity":"An Activity","parent_id":35}
We see through ES Head plugin that definitions seem to be ok. We test the analyzers to check that they have been loaded and they work. Both documents appear listed in ES Head browser view. But if we try to retrieve the child item using the API, ES responds that it does not exist:
$ curl -XGET 'localhost:9200/items/slot/126'
{"_index":"items","_type":"slot","_id":"126","exists":false}
When we import 50 documents, all parent documents can be retrieved through API, but only SOME of the requests for child elements get a successful response.
My guess is that it may have something to do with how docs are stored across shards and the routing...which certainly is not clear to me how it works.
Any clue on how to be able to retrieve individual child documents? ES Head shows they have been stored but HTTP GETs to localhost:9200/items/slot/XXX respond randomly with "exists":false.

The child documents are using parent's id for routing. So, in order to retrieve child documents you need to specify parent id in the routing parameter on your query:
curl "localhost:9200/items/slot/126?routing=35"
If parent id is not available, you will have to search for the child documents:
curl "localhost:9200/items/slot/_search?q=id:126"
or switch to an index with a single shard.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Nutch - clone website - web-scraping

Related

Wiremock matchesJSONPath if null or empty

Elasticsearch query using Kibana does not work using Java Rest Client API

How to avoid for picking up already existing Database Node --Recyclerview Items

Using Usergrid how do I get related entities nested in a single json and not only the link to them

problems on elasticsearch with parent child documents

Categories

Resources