Extracting content data from webpages - web-scraping

I'm looking to get structured article data from webpage urls. So far I've found these two services http://www.diffbot.com/ and http://embed.ly/extract/demos/nlp. Are there better alternatives or is it worthwhile to write the code to do this myself?

If you'd like to skip the code, and are looking for a simple software for web scraping / ETL applications, I'd suggest Foxtrot. It's easy enough to use and doesn't require coding. I use it to scrape data from certain gov't websites and dump it into an Excel spreadsheet for reporting purposes.

I have done web scraping / content extract for quite some time now.
For me the best approach is to write a Chrome content extension and automate the browser with their API. This requires that you know Javascript and HTML. In one of my recent projects I use a background page with a couple of editable divs to configure the scraping session. I have some buttons on the background page to start the process. The background page loads a JS script which listens to click events of the buttons.
When one of the buttons is clicked I add a new tab for the scraping session with chrome.tab.create. The background js also defines some chrome.tabs.onUpdated.addListener to inject content scripts when the tab url contains a specific page/domain name.
The content script then does the scraping job for example selecting some elements with jquery, regular expressions etc and finally send a message with an object back to background JS using chrome.runtime.sendmessage. The background JS script listens to messages with chrome.runtime.onMessage.addListener and acts based on the content being extracted.
The extension also automates web databases by clicking for example the next page links.
I have added a timing setting to control the amount of links being clicked / tabs being opened per minute so that the access is slowed down on purpose and too much crawling is avoided.
Finally the results are being uploaded to a database with an AJAX call and inserted with a PHP page into MySQL.
When the extension runs the next time it compares the keys/links which already exist in the database with another AJAX call and ensures that only new information is being extracted.
I have also built extension like the above with Firefox but the best and easiest solution for me is a Chrome/Chromium content extension.

Related

Displaying values from a Google spreadsheet on a WordPress page

I have followed the tutorial on this website https://www.wp-tweaks.com/display-a-single-cell-from-google-sheets-wordpress/ which allows to dynamically display values from a Google spreadsheet on a WordPress page using a simple shortcode:
[get_sheet_value location="Cell Location"]
This solution worked seamlessly until a single page contained hundreds of those shortcodes (I basically need the whole content of the page to be editable via the spreadsheet). I started getting 100% Errors by API method (based on the Google Metrics) and the content was not displayed properly anymore. I realize that sending hundreds of read requests after each page load is not ideal and will inevitably affect the load performance and that Google imposes quota limits too. Is there a way to bypass this issue? For example by pulling the values from the Google spreadsheet only once a day. Unfortunately, I don't have much coding experience but I'm open to all solutions.
Thanks in advance!
You could publish the sheet to the web and embed it to your website:
In your sheet, go to File > Publish to the web
In the window that appears, click Embed.
Click Publish.
Copy the code in the text box and paste it into your site.
To show or hide parts of the spreadsheet, edit the HTML on your site.
It would look like this (click on Run code snippet):
<iframe src="https://docs.google.com/spreadsheets/d/e/2PACX-1vR3UbHTtAkR8TGNtXU3o4hzkVVhSwhnckMp7tQVCl1Fds3AnU5WoUJZxTfJBZgcpBP0VqTJ9n_ptk6J/pubhtml?gid=1223818634&single=true&widget=true&headers=false"></iframe>
You could try reading the entire spreadsheet as a JSON file and parse it within your code.
https://www.freecodecamp.org/news/cjn-google-sheets-as-json-endpoint/

Getting Google Spreadsheet in the Background

We have a Google Spreadsheet from which we wish to load data into our webpage.
I started by using the Google Spreadsheet APi via C# and the Google API .NET libraries to read the spreadsheet and load it into an html unsorted list.
The spreadhsheet has about 200 rows, but could have more, as it will be updated frequently. So the problem is that the users have to wait until the spreadsheed data is retrieved and parsed before they can see anything in the webpage (the page is white whilst loading).
How can I load this data in the background whilst the page loads?
I've already written my code in C# and don't much want to spend the time swapping to javascript, but I will if I have to.
Could I use the AJAX Control Toolkit to do this? I know it will load html, but can I use it to fetch google data?
What can I do here that would be fast and easy?
[Edit]
The account that hosts the google spreadsheet is inside a google domain, so it's documents can't be shared to the public as a whole - only to individuals. The C# libraries allow me to use the account's username and password to log into the account to get the spreadsheet data, and so the spreadsheet doesn't need to be shared at all. Even if I went with a javascript/ajax solution, I would yet need this functionality.
Well, this probably isn't the BEST answer, but it IS a solution. I'd like to see if y'all have a better one.
Anyway, I found this, which is an example of how to use an asp:Timer to delay the calling of a function for a certain amount of time - in my case, long enough for the page itself to load. At least this way, the user gets to see the page, and can watch the nice loading-gif until the actual content arrives.
It is an AJAXy approach that allows me to keep my c# programming without having to add any javascript.

WebAii Test web page that produces non-html content (CSV, JSON, XML...)

I use WebAii to test an ASP.Net application. This application has an "Export to CSV" feature, and I would like to test that it works correctly with WebAii. Is there a way to access the exact source that was generated for a page?
I tried using ActiveBrowser.ViewSourceString, but it appears to work only for HTML. (it contains the HTML of the page that called the "Export to CSV" instead of the CSV content)
It may seem strange to use WebAii to test plain text content, when I could bypass WebAii and the browser and use HttpRequest to directly call the page. The reason why I need to do it this way is that the Export to CSV gets its parameters (a series of search filters) on the query string, and I need to make sure the calling code (an ASP.Net web page) is correctly passing the right parameters.
I work in Telerik's technical support department for WebAii. I'll try to assist. I need to know what happens when you click this "Export to CSV" button/link. Normally such a button causes the webserver to create a file and send it to the browser for downloading. You then save it as a file on your local machine. Is this what is happening or is the browser simply displaying the CSV content in its window?
ActiveBrowser.ViewSourceString is the right approach for getting at the HTML loaded in the browser window. It is possible that HTML contained in the framework is out of sync with what's actually in the browser. We cache the DOM for performance reasons. You can use:
ActiveBrowser.RefreshDomTree();
This forces the framework to resync it's copy of the DOM with what's actually contained in the browser. See if ActiveBrowser.ViewSourceString is now different after clicking on your "Export to CSV" button/link.
Feel free to post questions like this on our Telerik Testing Framework forum. http://www.telerik.com/automated-testing-tools/community/forums/webui-test-studio-developer-edition/webaii-automation-framework.aspx. This is where I hang out daily.
Cody

Loading web application content through AJAX

I'm about to build a web application(not web presentation) which will load its content through AJAX (jQuery) into a specific div. There will be a menu above the div and when a user clicks on an item from the menu, the appropriate page will be loaded into the main div.
I'd like to know if there are any cons and pros of choosing this pattern for a web application.
So far I'm avare that the browser back button and history/url will be gone.
Two possible downsides are that it could make it difficult for users to bookmark content on your site and difficult for search engines to differentiate pages on your site.
You should probably provide more information on your reasons for taking this approach. You might have good reasons or it might be a case of using a technology (AJAX) because it is cool to use.
If you want to give the users the impression of fast responsiveness, then yes AJAX load your pages, but still have a different url for each page. This will take more code but it will solve both issues that I mentioned.
http://yourdomain.com/home.aspx //loads its own content via AJAX
http://yourdomain.com/contact.aspx //loads its own content via AJAX
etc
This is really only appropriate if you have a lot of content, or where the content involves time-consuming calculations, such as on a financial site. In most cases, it would be less trouble to just load your pages normally or break you content into paged chunks.
The main con of this approach this will make your site very difficult for search engines to crawl. They don't read Javascript, so your content won't get seen or indexed by them. Try to do progressive enhancement so that they (and any users who don't use Javascript, e.g. screen-readers) don't get left behind.
On the other hand, you can keep browser history functionality. This can be done using the URL hash, e.g. http://www.example.com/#home vs http://www.example.com/#about-us. The nicest way to do this is to get Ben Alman's hashchange plugin and then use the hashchange event:
$(window).hashchange(function(){
var location = window.location.hash;
//do your processing here based on the contents of location
});
This will allow your users to use the history function and the bookmarking function of their browsers. See the documentation on his site for more information.

VB.Net application - display a message to the user whilst the application is starting up

I have recently created an application where a lot of data is loaded into objects when the application starts up, and other data as it is required. For example if the user requests the catalogue page then it will load all the top level category data into objects of type Category. This will then stay there to be used by other users (who will therefore not have to load this data into objects) and can be altered by admin if they happen to login during the same application instance. I know this is not the most efficient solution, as pointed out below, but it works and the page load, at the moment, is not too long. It is very quick if most of the required data is already loaded into objects. It is also tailored to the business' needs - unlike other techniques such as Linq-to-SQL.
The problem I am facing is when a page is requested which requires lots of data to be displayed about different types of object. For example when a catalogue page is requested which displays information on a product which can be bought, it then loads all the products and categories (as the products make reference to the category object, not just the category name).
I would like to display a loading symbol with a message whilst all this data is being loaded into objects, so the user knows its not just in a loop or anything. Is there any way to do this? I am open to using JS / jQuery if I need to.
Thanks in advance.
Regards,
Richard
PS I am working on ways to make it more efficient - such as using HashTables or HashMaps. However this is taking time as there are so many different types of item (News, Events, Catalogue Item - Range, Collection, Design, RangeCollection, CollectionDesign, RangeCollectionDesign and RangeDesign - Users, PageViews and the list goes on).
Please correct me if I'm wrong, but I do believe that Javascript is required in order to display a "loading" image... Using server-side scriping alone would typically require an entire page load after all the content loads unless you want to start messing with IFrames.
This is a job for AJAX. A common solution to your problem is to have a small page that displays a loading icon. The page has some JavaScript that makes additional HTTP requests to the server to download the rest of the page. JQuery has a "$.ajax" method that is designed to simplify this process.
I would suggest looking at the documentation to the .ajax method in the jQuery documentation. Unfortunately, it seems to be a rather delicate process to get all the scripting code right and it takes a while to learn it all.

Resources