How i change Scrapy to crawl on onion links? - web-scraping

I try to crawl on the darknet forum using Scrapy, I tried to change the Scrapy setting.
I am using python 2.7
someone can help me??

Related

Advance Import Addon of Newsletter plugin not visible

I'm using Wordpress to build my website, I've used the most popular plugin called Newsletter. I'm using it's free version for now. This plugin provides an option to import CSV files, to perform import function I need Advance Import Addon and it's free but I'm not able to see it even after installing it. I've already followed the official newsletter documentation and video tutorial. Below is the video tutorial screenshot enter image description here
Here is my dashboard screenshot enter image description here
Please let me know what am I missing here.

Is there any way to convert my wordpress website to wix?

Today I'm very concerned about one thing. Can anyone tell me that can I convert my WordPress website to wix website. Please help me with your helpful suggestions.
Thanks in advance!
As far as I'm aware Wix will only automatically import blog posts from WordPress - I think you'll have to manually import the rest of your content into Wix from your WordPress website.
Of course, this does beg the question of "why would a store for WordPress themes and plugins not use WordPress?"...

Crawling twitter,linkedin using nutch

I have been trying to use nutch to crawl twitter and linkedin data
Nutch-0.9.
However when i try to crawl twitter the regex-filter doesnt seem to work, my regex-filter file has
+^https://([a-z0-9]*.)twitter.com/a
and what i wish to do is to crawl only those urls that follow the above pattern. I end up with urls such as https://twitter.com/document.
As for the linkedin part, it always shows a timeout whenever i try to crawl it, robots.txt on linkedin says that you need to mail to get your crawler whitelisted but they never respond.
Appreciate your help !
f you want to crawl this specific urls you should include following line too
-.*
this command will exclude all other urls!
Also if you want to crawl twitter or linkedin, you can use specified crawlers like twit4j or linkedin-j!
As I know so far, Nutch did not support crawling Twitter and Linkedin data. For crawling Titter data you should using Twitter API, check this one http://twitter4j.org/en/. For crawling Linked data, you could have a look on this https://github.com/pondering/scrapy-linkedin.
Hope this helps

How to crawl other websites with Drupal?

I am actually looking for a solution to crawl specified websites with Drupal and make theres content visibil in my search after the crawling process.
Any ideas about that?
I tried for now the Drupal Apache Solr Modul which is working very good as a search as it should be, but i dont know how to extend it, so that the index is filled with content of other sites?
Try using the Feeds Crawler Module

Difficulties in migration from BlogEngine.Net to WordPress

I have to move a clients BlogEngine.Net site to WordPress. I have had people ask why would I want to that, I just am doing what has been requested of me. I have managed to get the BlogML file exported out of BlogEngine.Net. I have gone over it and it looks like everything is there. I have also retrieved all images from posts and put them in a zip.
My questions are in general what difficulties have you experienced doing this? What inconsistencies might I expect to happen during this process?
Hosting: We are going to use DreamHost. Would just your basic LAMP hosting with WordPress installed be enough. I know its enough for the site, but with the BlogML import will there be anything that arises that would need more rights to the machine that basic hosting would provide?
BlogML Import: What is the recommended tool to use to import the XML file into the WordPress site? What has been used tried and true? I found this tool by Aaron Learch and this tool by Wayne John are these the only 2 options?
BlogML Import: What difficulties may arise with the import process?
What i have learned in the inter-webs that may arise.
I have seen possible memory size errors.
Permissions issues.
Edits to the BlogML file. As well as manual edits after import to be found..... That is kinda scary :0
So anything you may know that i have not listed would be extremely helpful.
Thank you
One thing you may run into is a problem with internal links. There are plugins that you can get Broken Link Checker & Redirection which will help you fix these problems.
Another problem you might have is with any hosted files, images, etc. If you back these up as well from your current server's images/files directories, you may be able to replace them.
HTH

Resources