Scrapy: decode request.see codes from crawlling history - web-scraping

I have a Scrapy project that saves the crawling in a JOBDIRin order to prevent crawling of already seen URLs.
My problem is related to the fact that sometimes the crawling was not good in a particular URL for some reason and I need to fix the spider and run it again. Of course I do not want to re-crawl all urls but only that specific one that gave some problems.
I would like to localize the URL in the request.seen and just delete it, but is not possible because all URLs are coded.
How can I decode request.seen file to the original URLs?

Related

WordPress Export/Translate/Import Single Pages

Back story:
My company has recently expanded into other countries and requires the site have multiple iterations with different translations and the user be redirected based on their geolocation. I manage our public facing website which is part of the marketing department (no other developers on the team but me). Our website is a custom WordPress template that I inherited, originally developed by a third-party agency. Our parent global company has a relationship with a company called Language Wire, who will take files of many formats and translate them by hand to ensure quality, and then return them translated. In this case they will accept .html and .xml files. Before we get into a workflow of sending and receiving files regularly, they wanted me to send them some test files which they will machine translate quickly and send back to me so we can make sure the workflow will be smooth.
Issues:
I did a WordPress export (which I've never done before) and I believe it exported our entire site in the form of an .xml file but I'm not sure what this file really is. Their machine translation translated a lot of dependencies that needed to remain in english which broke the file and disallowed me from importing back in but I'm assuming their agents who do manual translations will be able to navigate this better so I'm going to ask them to do a manual translation as a test. However, I'm wondering if there is a better way to export single pages or groups of single pages and then import them back in after translation without losing the template so I can send them a smaller file to manually translate.
Anyone have experience with this kind of situation? I'm also researching geolocation redirect plugins and have no idea how I'm going to organize the subdirectories of translated versions of our site. My main issue right now though is exporting and importing wordpress pages. Thanks!

how can i remove weird links from my website

i have a website that was developed with wordpress
it was hacked ,I removed the malicious files that I've found on the server
and I got it back but when I search the website on google I found
strange links that I can't open them
photo
Remove url individually form google webmaster tools and it will take time remove it.
You should know that removing malicious files doesn't mean you cleaned up the site. There are many instances where a file will recreate all the malicious files again. Sometimes it can even be above the root of your website root folder. It's best to use a couple plugins to scan the whole site directory. And then check a couple days later if the malicious files return. (if this is the case you are best to actually just switch to a new server or reformat if you have the option as it will get quite expensive to pay someone to clean up your server)
First make sure you have completely cleaned up the hack. Then those pages in Google should get deindex as they won't exist anymore. It's probably not viable to remove every single hack page indexed in Google via webmaster tools as there could be 10s of thousands! (depending on the hack)
Below are a couple good scanners.
https://wordpress.org/plugins/quttera-web-malware-scanner
https://wordpress.org/plugins/sucuri-scanner
I would also recommend some plugins for enhanced security moving forward.
https://wordpress.org/plugins/ninjafirewall
https://wordpress.org/plugins/better-wp-security
https://wordpress.org/plugins/vulnerable-plugin-checker

Google analytic shows me wired links for one of my visitors

I have a website wich is registered with google analytic so I can see the statistics of it The problem is that sometime it shows me this link :
website.com/www.bndv521.cf/
or:
website.com/admin
I do not know if this is a hacker trying to hack me or something but I think nobody will try to access my admin for good
Can you help me to know what is this link refers to ?
Consider checking for a malicious code included on your pages. And yes it's likely that some one is trying to access those pages but it may not execute because it's invalid path. You should consider blocking such ip addresses after checking in logs.
Although trying to reach an admin page seems a suspicious action, in our website we come accross this issue every one in ten thousand requests.
We think that a browser extension or a virus like program tries to change URL or trying to add this keyword to URL. Not for a hacking purpose, but to redirect to their advertisement website.
Very similar issue here: Weird characters in URL

Google URL Crawl error 404 - domain appending to end of URL

I recently built and published my Wordpress site at www.kernelops.com and submitted it to the google index and webmaster tools. Today I logged into webmaster tools and found 60 URL errors all with the same type of issue. The base domain address www.kernelops.com is being appended to all my sites page, category, and post URLs. An example of the failed URL looks like this:
http://www.kernelops.com/blog/www.kernelops.com
Google Webmaster Tools indicates that this weird link is originating from the base url "http://www.kernelops.com/blog" which obviously means the issue is on my end. My Wordpress permalink settings are set to use the post-name; I'm not sure if that could be causing this, i.e.:
http://www.kernelops.com/sample-post/
I can't seem to find any help resolving this weird issue with google searches and thought someone here may be able to point me in the right direction.
The Wordpress plugins that would potentially affect the site's URLs are the following:
All in One SEO
XML-Sitemap
But I can't see any sort of setting within these plugins that would be causing this type of issue.
Any ideas would be greatly appreciated - thanks in advance!
This is a long shot, but it may be happening if the Google crawler picks up a link that seems like a relative path and attempts to append it to the current directory. It's highly unlikely that Google would have such a bug, but it's not impossible either.
The closes thing I could find that may be considered a relative path is this:
<div class="copyright">
...
Kernel, Inc.
...
</div>
I doubt that this is the problem, but it may be worth fixing it.
Now, there is yet another possibility and that's if the website serves slightly different content depending on the User Agent string. When Google presents your website with a User Agent string, the SEO plugins detects it and tries to optimize things in order to improve your ranking (not familiar with that plugins, so I don't know what it does exactly). There may be a bug in the SEO plugin that will cause the www.kernelops.com URL to look like a relative path or to actually construct that faulty URL somehow.
You can possibly test this by setting the user-agent string in your browser (e.g. FireFox's user-agent switcher) to Googlebot's user-agent string and test what happens when you visit your website. Look at the page source that you receive and look for any links that might look like the one Google is finding.
However, if the SEO tool is smart enough, it will "realize" that your IP doesn't match one of the valid IPs for Googlebot and it will not make the modifications.

web scraping search results

I need help solving the following issue:
I need to validate cached URLs by Google search engine for a particular site. In the case the url will 404 or the page will not render some necessary html elements (considered broken) I need to log those URLs and later 301 redirect to correct URLs. I know PHP and a little bit of Python but I'm not sure what approach to use to scrap all URLs from search engine results for given site.
http://simplehtmldom.sourceforge.net/ - a simple html parser. there is an example at this page; not sure if this still works with googles instant search etc.

Resources