Parse Onlineshop - Onlineshop Data - web-scraping

I am searching for a solution for crawling % parsing a whole website (online shop) automatically and save all products as Product-name and product-price in a CSV.

Gaining data from a website can be extremely simple or the complete opposite. It depends on how the website is made. A shop tends to be a complex website and thus the DOM (the HTML structure) is mostly unique for that website. It is very unlikely that someone else tried the exact same thing you want for that page. So you have to write code and extract the necessary piecs.
This will be our example product: http://www.thomann.de/gb/focusrite_scarlett_2i2.htm
HTML uses classes to tell the CSS (for styling) how to design or render a certain element. You can use this behaviour for you and find an element containing the price by a class. In this example it is .tr-prod-price.
Every major browser has a Discover element function and it can be used to find a class for a element which appears on screen. Make a right click on your text (price or title) press Q (Firefox only).
Now, you've got closer to parsing your data. Now it is time to write code. You could use Python, Java or even JavaScript to give you some examples. JavaScript in conjunction with Node.JS could be very easy, because JS has the built in methods we need.
You may need a searchengine to find the detail pages of a product. Google can list you all results like site:thomann.de/gb. But of course Google does not provide an easy way (API) to get this information and if you start writing your own parser for that I am not sure about the legal consequences. The legal side needs also to be adressed for you main intention.

Related

Automate process of grabbing elements from a webpage

I'm looking to automate test cases for webpage development using Robot Framework. I have about 5000 test case strings that describe pathways to different page elements. Now I'm going to be going through and grabbing specific "id" or "css selector" within the webpage for automation. My default option is to manually inspect each button, link, table etc. and enter it into a huge spreadsheet for automation, but I feel like there must be a less arduous method to extracting the elements.
I've looked into different options and the closest thing I can find to a solution is python webscraping, but from what I understand webscraping requires the elements are already defined and your goal is extract information rather than the actual elements.
Does anyone have a solution that might be a bit less tedious than inspecting 5000 webpage elements? ;)
If you can put your page in IFRAME, than you could probably use JS (in the parent) to wait until page is loaded and then get (all or specific) elements in the IFRAME.
That way you should be able to get all the elements of fully rendered page.
(never did this, but it should work)

How can I effectively clean up styles in a large web site?

Our web site has been under a constant development for a better part of the last five years. As it happens, pretty much all the styles for the site are in one big CSS file. With time this css file has grown to about 9,000 lines - and I'm sure some of those styles are not used any more and quite a few styles provide duplicate functionality.
The site is written with PHP/Smarty; there are over 300 smarty templates and the whole site contains over 1000 different pages (read - unique URLs). I'm sure it will continue growing - as will the CSS file.
What's the best way to clean up this file?
Update: Unfortunately, online parsers where I put in a URL won't work for me, as 75% of the site is behind username/password logins - and depending on login, there are half a dozen different roles, each of which has their own set of of pages. There are also transactional elements (online shop), where the pages are displayed after (for example) credit card payment is taken/processed. I doubt that any online tool would be able to handle any of these. Therefore if there's a tool, it would have to work on a source tree.
Short of going through each .tpl file and searching the file for the selectors manually, I don't see any other way.
You could of course use Dust-Me selectors, but you'd still have to go through each page that uses the .tpl files (not each url as I know that many of them will be duplicates).
Sounds like a big job! I had to do it once before and I did exactly that, took me a week.
Another tool is a Firebug plugin called CSS Usage. As far as I read it can work across multiple pages but might break if used site-wide. Give it a go.
Triumph! Check out the Unused CSS online tool. Type your index url into the field and voila, a few minutees later a list of all the used selectors :) I know you want the unused ones, but then the only work is finding the unused ones in the file (ctrl+f) and removing them :)
Make sure to use the 2nd option, they'll email you the results of the crawl of your entire webpage. Might take up to half an hour, but that's far better than a week. Take some coffee :)
Just tested it, works a treat :)
I had to do this about 3 years ago on a rather large classic ASP web application.
I took the approach that there are only a finite number of styled items on each page and started by identifying these. For example, I went through the main pages and identified that the majority of labels were bold and dark blue and that all buttons are the same width (for example).
Once I'd done that, I spoke to the team and we decided that anything that didn't conform to these rules I'd identified should conform, so I wrote a stylesheet based on this assumption.
We ended up with about 30 styles to apply to several hundred pages. Several regular-expression-find-and-replaces later (we were fortunate that the original development had used reasonably well structured HTML) we had something usable that just needed the odd tweaking.
The key points are:
Aim for uniformity across the site. In other words, don't assume that the resultant site will look exactly the same as the original, but aim for it to look the same as itself (uniform) from page to page
Tackle the obvious styles first (labels / buttons / paragraph fonts / headers) and then worry about the smaller styles or the unique styles later
You might also find that keeping unique styles (e.g. a dashboard page that has unique elements that don't appear elsewhere) in separate files to keep the size of the file down. Obviously, it depends on your site as to whether this would help.
Additionally, there are many sites that will search for these for you. Like this one: http://unused-css.com/ I don't know how they measure up to Dust-Me Selectors, but I know that Dust-Me selectors isn't compatible with Firefox 8.0.
You could use Dust-Me Selectors plugin for FireFox to find unused styles:
http://www.sitepoint.com/dustmeselectors/
If you have a sitemap you could use that to let the plugin crawl your site:
The spider dialog has all the controls for performing a site-wide spider operation. Enter the URL of either a Sitemap XML file, or an HTML sitemap, and the program will read that file and extract all its links. It will then load each of those pages in turn and perform a cumulative Find operation on each one.
I see there's not a good answer yet. I have tried the "Unused CSS online tool" and seems to work ok for public sites. The problem is if you have a CSS to show your public website + an intranet (for example: wordpress site + login for registered users). The intranet pages woun't be tracked and you will lose your css styles.
My next try will be using gulp + uncss:
https://github.com/ben-eb/gulp-uncss
You have to define all the urls of your site (external and internal) and (maybe; not sure) if you are running the site with user + password on your browser, gulp+uncss can go inside the internal url's.
Update: I see unused-css online tool has a login solution!

Are CSS Generated Content acceptable in terms of SEO?

I run an online literary journal which leads to an indexing problem--our content is not "about" literature -- it is literature. As such, Google is really bad at identifying what's going on, and due to the very low keyword density we have to try and work with, I've been looking for ways to slash interface text and turn it into iconography where possible.
I've been looking for a way to do the same with our post dates, but it's been a long search. I stumbled across the idea of using CSS generated content content:attr(id) to substitute the ID attribute of an invisible image into the page itself.
This works on the display level, however, I haven't been able to track down anything conclusive on whether this interface-only text will still get indexed, or whether we'll be able to move away from months and days of the week being our most-frequent keywords. I know Google will still see it; anyone know if it'll "count"?
As far as I'm aware, the 'best' way to ensure something is hidden from a search engine is to either load it via AJAX or (shudder) include it with flash.
If you feel that the non-content aspects of your site are adversely affecting your site's standing in the various search engines, you could load these elements via AJAX.
Only if you really think these elements are seriously affecting your position.
Below is an image describing areas of this page that one could conceivably post-load via AJAX, if one was overly concerned about their impact on SEO:
I know this doesn't specifically answer your question, it's a suggestion for an alternative way to tackle your issue.

How to implement a "news" section in asp.net website?

I'm implementing "news" section in asp.net website. There is a list of short versions of articles on one page and when you click one of the links it redirects you to a page with a full article. The problem is that the article's text on the second page will come from database but the articles may vary - some may have links, some may have an image or a set of images, may be differently formatted etc. The obvious solution that my friend have come up with is to keep the article in the database as html including all links, images, formatting, etc. Then it would be simply displayed on the second page. I feel this is not a good solution as if, for example, we decide to change the css class of some div inside this html (let's say it is used in all articles), we will have to find it and change in every single record of the articles table in our database. But on the other hand we have no idea how to do it differently. My question is: how do you usually handle something like this?
I personally don't like the idea of storing full html in the database. Here's an attempt at solving the problem.
Don't go for a potentially infinite number of layouts. Yes all articles may be different but if you stick to a few good layouts then you're going to save yourself a lot of hassle. These layouts can be stored as templates e.g ArticleWithImagesAtTheBottom, ArticleWithImagesOnLeft etc
This way, your headache is less as you can easily change the templates. I guess you could also argue then that the site has some consistency in layout.
Then for storage you have at least 2 options:
Use the model-per-view approach and have eg ArticleWithImagesAtTheBottomModel which would have properties like 1stparagraph, 2ndparagraph, MainImage, ExtraImages
Parse the article according to the template you want to use. e.g look for a paragraph break if you need to.
Always keep the images separate and reference them in another column/table in the db. That gives you most freedom.
By the way, option #2 would be slower as you'd have to parse on the fly each time. I like the model-per-view approach.
Essentially I guess I'm trying to say beware of making things to complicated. An infinite number of layout means an infinite number of potential problems. You can always add more templates as you go if you really want to expand, but you're probably best off starting with say 3 or 4 layouts.
EDITED FROM THIS POINT:
Actually, thinking about it this may not be the best solution. It could work depending on your needs, but I was wondering how the big sites do it. If you really need that much flexibility, you could (as I think was sort of suggested) use a custom markup. Maybe even a simplified or full wiki markup. I'd still tend toward using templates in general, but if you need to insert at least links and images then you can parse for those.
Surely the point of storing HTML with logically placed < div >s is that you DON'T have to go through every bit of HTML you store to make changes to styles?
I presume you're not using inline styles in your stored HTML, and are referencing an external CSS file, right?
The objection you raise to your colleague's proposal does not say anything about the use of a DB. A DB as opposed to what: files? Then it's all the same. You want to screw around with the HTML, you have to do it on "every single record." Which is not any harder than "on every single file." Global changes are a bitch unless you plan for it by, say, referencing an external CSS. But if you're going to have millions of news articles, you had better plan on versioning the CSS as well.
Anyway, the CMSes do what you're thinking of doing. Using a DB is a fine way to go. How to use it would depend on knowing the problem more intimately.
Have you looked into using free content management systems? I can think of a few good ones:
Joomla
Drupal
WordPress
TONS of others... just do some googling.
Check out this Wiki article: http://en.wikipedia.org/wiki/List_of_content_management_systems

How to check whole website for certain conditions in rendered source of every page , automatically?

Like I want to check
on Every page <h3> tag must come after <h2> otherwise page should be marked.
like if any page has PDF then Some particular text <p>Download Adobe reader from here</p> should be at bottom of every page is this condition is not matched then page should be marked.
I want to make different type of conditions to check then want to check on whole site and if anything mismatch then report should be generated.
Do you necessarily have to use XHTML? I'd use Python and BeautifulSoup, myself.
(Edit: I was confused - I was thinking of XSLT, not XHTML, and I thought "why would you use XSLT for someting like this?". XHTML is fine, and my recommendation of Python and BeautifulSoup still stands.)
This ruby gem looks like it could be useful to you:
http://code.google.com/p/opticon/
I haven't personally used it, but it claims to basically do what you're asking for.
I've had, and still have, the same need on many of my projects. In my case I'm looking for anything with the class 'error'. This is supported by the TestPlan product in it's verification engine.
In my case, as a quick example, I have several "Web" states and my generic verify script is:
CheckNot //div[#class='error']
Now the way TestPlan works is that every state within "Web" will first run this generic verify script.
If you're interested I could help you come up with the exact syntax needed to do your check.

Resources