I have been trying to use nutch to crawl twitter and linkedin data
Nutch-0.9.
However when i try to crawl twitter the regex-filter doesnt seem to work, my regex-filter file has
+^https://([a-z0-9]*.)twitter.com/a
and what i wish to do is to crawl only those urls that follow the above pattern. I end up with urls such as https://twitter.com/document.
As for the linkedin part, it always shows a timeout whenever i try to crawl it, robots.txt on linkedin says that you need to mail to get your crawler whitelisted but they never respond.
Appreciate your help !
f you want to crawl this specific urls you should include following line too
-.*
this command will exclude all other urls!
Also if you want to crawl twitter or linkedin, you can use specified crawlers like twit4j or linkedin-j!
As I know so far, Nutch did not support crawling Twitter and Linkedin data. For crawling Titter data you should using Twitter API, check this one http://twitter4j.org/en/. For crawling Linked data, you could have a look on this https://github.com/pondering/scrapy-linkedin.
Hope this helps
Related
The website is a wordpress site and it was been attacked by xss attack. Ive already installed wordfence and malcare to scan and remove the malicious code and files. but still the google search results are show spam links under the main result. I most of the pages direct to 404 webpages and i was told the google bot will remove it automatically but the issue still remains after 4 days. if any expert regarding this have any solutions and advice i would much appreciate.
You can try resubmitting you sitemap to Google in the Search Console.
Otherwise, similiarly try using the Google Removals tool to temporarily these links, hopefully the will be cleared from the search results by the the time the links are restored.
Tutorial: https://support.google.com/webmasters/answer/9689846?hl=en
I'm looking for an easy way to share through LinkedIn without all that hassle with OAuth 2.0 which I don't see required when I see other pages that use this kind of sharing (and they didn't required anything from - I can straight out share).
Straight to the issue:
this one works: https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Frefair.me
this one doesn't: https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Frefair.me%2Fjob%2F494
Seems like beyond main domain I can't get sharing working. For instance from other site a link that goes deeper and is still shareable: https://www.linkedin.com/shareArticle?mini=true&url=https://bulldogjob.pl/companies/jobs/2043-programista-java-warszawa-bms-sp-z-o-o&title=Programista+Java&summary=&source=https://bulldogjob.pl
I also tested with and without source and summary query params. Anyone had that issue?
LinkedIn uses the Open Graph protocol (http://ogp.me/) to determine how pages are shared in LinkedIn.
You may also use the LinkedIn Post Inspector (https://www.linkedin.com/post-inspector/) tool to debug how various pages would be shared in LinkedIn.
I decoded your URL so I could get a cleaner look...
https://www.linkedin.com/sharing/share-offsite/?url=https://refair.me/job/494
So, let's try to visit your URL: https://refair.me/job/494 . The webpage you are sharing DOES NOT LOAD.
Is your site down for everyone? Yes, your site is down for everyone.
In order to share a URL on LinkedIn, you must fulfill the following minimum requirements:
The URL must load.
If you just want to test out the API, try using wikipedia.org or google.com as test pages.
Surprisingly, the old refair.me URL by itself works fine in LinkedIn, but that could be from some internal cache, from way back in the day when the page once did work. It certainly does not do so anymore.
For some reason sharing links on LinkedIn from my client's site does not work.
I've checked the specifications on https://www.linkedin.com/help/linkedin/answer/46687/making-your-website-shareable-on-linkedin?lang=en and it looks like all the og meta tags re correct. Sharing on other social media works without problems. It's only on LinkedIn where the Open Graph data is not picked up.
Here is a sample URL which does not work on LinkedIn:
https://nomadandvillager.com/bestemmingen/kenia/mbara/vrouwenbesnijdenis/
Am I missing something?
Would you be able to share a bit more on how you're getting the tags to render on Facebook and the like?
I encountered the same problem (My site runs on Angular), but my approach was to redirect the LinkedInBot via .htaccess.
Eventually, I gave up, and wrote a custom share button with LinkedIn's rest API
With it, you will be able to specify the details of the share, and avoid the messy workarounds for LinkedIn.
Direct Sharing on LinkedIn
Sharing via REST API
Original link: http://blog.crazy.technology/post/Clash-of-Clans---How-to-use-the-Web-API-570dd2b2
I was looking into ways in which i can decode the Google adclicks URL to the actual website redirect via code...
I have a big db of URLs like following:
https://www.google.co.in/aclk?sa=L&ai=DChcSEwjY9KL2m4fRAhXTCioKHXEWBN0YABAK&sig=AOD64_3p0RvGkZj0fn81FSXIKtQ9XPVBvg&ctype=5&q=&ved=0ahUKEwialZ72m4fRAhVKwI8KHbGmDB8QvhcIKg&adurl=
https://www.googleadservices.com/pagead/aclk?sa=L&ai=DChcSEwjY9KL2m4fRAhXTCioKHXEWBN0YABAM&ohost=www.google.co.in&cid=CAASIuRoPu3Xxj7yyeUtRHLYBy-5952U-NXdaW3ftj91LB2rPAQ&sig=AOD64_0ksuGT2UtbiAEScV_lASVCVh7eFg&ctype=5&q=&ved=0ahUKEwialZ72m4fRAhVKwI8KHbGmDB8QvhcILw&adurl=
http://www.google.com/aclk?gclid=...
I am searching for methods to determine what the target website is. Any help appreciated.
only way to do it is with php file_get_contents
I need help solving the following issue:
I need to validate cached URLs by Google search engine for a particular site. In the case the url will 404 or the page will not render some necessary html elements (considered broken) I need to log those URLs and later 301 redirect to correct URLs. I know PHP and a little bit of Python but I'm not sure what approach to use to scrap all URLs from search engine results for given site.
http://simplehtmldom.sourceforge.net/ - a simple html parser. there is an example at this page; not sure if this still works with googles instant search etc.