Troubleshooting import.io

Troubleshooting import.io - web-scraping

I am using import.io to scrape information from various websites for a research project. While it usually does a very good job, it occasionally outputs a blank page within the scraper interface and I am not able to select any data or otherwise interact with the website.
Is there anything I can do to manipulate the URL or the site itself so that I can access its data? There aren't any issues with accessing the website normally and I feel like there should be a workaround, but I'm a little inexperienced and haven't been able to figure anything out. I've tried using different browsers and toggling the script and styles options on import.io.

Related

How to download multiple files at the same time in a modern web browser?

I know I have seen websites that have allowed me to download multiple files at the same time. I clicked a link, got a warning that multiple files are coming my way, and then selected a folder where to place them all. Unfortunately I don't remember any specific sites anymore.
Now I'd like to add this feature to a website I'm developing myself, but I cannot find how to do it. Even searching google and here on SO yields nothing (or maybe I haven't figured out the right keywords).
So - anyone know how is this done? I know that it DIDN'T involve any custom plugins to the browser.
Added: A colleague showed me one such site, here's a screenshot:
Although this gave me multiple Save-as dialog boxes, it's still better than nothing. Also in the developer tools I saw that it downloaded the files first via the Fetch API and then somehow made the Save-as dialogs appear via Javascript.
Anyone knows how this is done?

Prevent CMS identification

I have a little curiosity. I often use extensions like Wappalyzer to try to understand certain sites about which CMS are based and, in this regard, I would have a question: is there a way to prevent extensions similar to the one mentioned above from identifying the CMS used?

You can try, but I don't think that you can successfully hide CMS behind your site and that attempt is waste of time. For drupal check on this page:
https://www.drupal.org/node/766404

Most CMS can be run as headless CMS, decoupling backend and frontend. They either render a static set of html pages, or a JSON file with all the sites data, which is then rendered by some kind of Javascript app. This way no one can figure out where the data is coming from, or if there was a CMS involved at all.

Copy information from external website to my website

Apologies if this is the wrong place to ask and if it is, please do let me know where best to do so.
I want to write a script that will pull data from website B (external site, not owned by myself) and display that data on website A (site owned by myself).
Now, I know how to do this programmatically and so my question is more about the legalities of the approach.
For example, Twitter provides API access so that you can embed tweets or a twitter feed into your page. The sites that I would like to pull data from may or may not have such APIs and so I would have to write a scraper.
Am I allowed to scrape information from websites and display it on my own site? I will of course make it absolutely clear where the information has come from; I do not intend to use any information and claim that is is my own.

I think this is generally frowned upon, as you are basically doing the same as copying a CD putting your own label on it and selling it to others (i.e. taking someone else's stuff and pretending it's your own). I suppose it depends on the licence of the web site you are scraping. If the web site provides an API (like Twitter), then they probably allow copying.

How to mechanically identify all broken links in a drupal site

We have just moved to drupal and are trying to pro-actively identify all broken external web (http://, https://) links.
I've seen some references to validation of links but wasn't sure if it only meant validation of the syntax of the link as opposed to whether these web links work or not (e.g. 404).
What is the easiest way to go through all web links in a drupal site and identify all of the broken external web links? This is something we'd like to automate and schedule every day/week.

As someone else mentioned, use Link Checker module. It's a great tool.
In addition, you can check the Crawl errors in Google Webmaster tools for 404'd links like this:
Clicking any URL from there will show you where the URL was linked from so you can update any internal broken links. Be sure to use canonical URLs to avoid that.
Make sure you're using a proper internal linking strategy to avoid broken internal links in the first place, too: http://www.daymuse.com/blogs/drupal-broken-internal-link-path-module-tutorial
Essentially: use canonical, relative links to avoid broken internal links in the future when you change aliases. In simple Drupal terms, be sure you're linking to "node/23" instead of "domain.ext/content/my-node-title" since multiple parts of that might change in the future.

I have not found a Drupal based approach for this. The best, free piece of software I've found for finding bad links on sites is Screaming Frog SEO Spider Tool.
http://www.screamingfrog.co.uk/seo-spider/

What does it mean when I see some IPs look at hundreds of pages on my website?

What should I do when I see some IP in my logs scrolling through 100s of pages on my site? I have a wordpress blog, and it seems like this isn't a real person. This happens almost daily with different IPs.
UPDATE: Oh, i forgot to mention, I'm pretty sure it's not a search engine spider. The hostname is not a searchengine, but some random person from india (ends in '.in').
What I am concerned with, is if it is a scraper, is there anything I can do? Or could it possibly be something worse than a scraper e.g. hacker?

It's a spider/crawler. Search engines use these to compile their listings, researchers use them to figure out the structure of the internet, the Internet Archive uses them to download the contents of the Internet for future generations, spammers use them to search for e-mail addresses, and many more such situations.
Checking out the user agent string in your logs may give you more information on what they're doing. Well-behaved bots will generally indicate who/what they are - Google's search bots, for example, are called Googlebot.

If you're concerned about script kiddies, I suggest checking your error logs. The scripts often look for things you may not have; e.g. on one system I run, I don't have ASP, however, I can tell when a script kiddie has probed the site because I see lots attempts to find ASP pages in my error logs.

Probably some script kiddie looking to take advantage of an exploit in your blog (or server). That, or some web crawler.

It's probably a spider-bot indexing your site. The "User-Agent" might give it away. It is possible to have 100s of GET requests easily for a dynamically generated Wordpress site if it isn't all blog pages but includes things like css, js and images.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Troubleshooting import.io - web-scraping

Related

How to download multiple files at the same time in a modern web browser?

Prevent CMS identification

Copy information from external website to my website

How to mechanically identify all broken links in a drupal site

What does it mean when I see some IPs look at hundreds of pages on my website?

Categories

Resources