Automate process of grabbing elements from a webpage - web-scraping

I'm looking to automate test cases for webpage development using Robot Framework. I have about 5000 test case strings that describe pathways to different page elements. Now I'm going to be going through and grabbing specific "id" or "css selector" within the webpage for automation. My default option is to manually inspect each button, link, table etc. and enter it into a huge spreadsheet for automation, but I feel like there must be a less arduous method to extracting the elements.
I've looked into different options and the closest thing I can find to a solution is python webscraping, but from what I understand webscraping requires the elements are already defined and your goal is extract information rather than the actual elements.
Does anyone have a solution that might be a bit less tedious than inspecting 5000 webpage elements? ;)

If you can put your page in IFRAME, than you could probably use JS (in the parent) to wait until page is loaded and then get (all or specific) elements in the IFRAME.
That way you should be able to get all the elements of fully rendered page.
(never did this, but it should work)

Related

Best practices for structuring CSS code [duplicate]

This question already has answers here:
Best way to structure a CSS stylesheet [closed]
(2 answers)
Closed 8 years ago.
I am creating a design for a mid-sized Web application. It's my first time, and there is no established design process at my workplace. Previous projects are small internal applications, and the back-end developer used a minimal design just enough to make stuff align where it should.
I started doing the design for each type of page separately, and created a new CSS file for each type of page, e.g. a separate one for input forms, another one for the search interface, and so on. I also made one large file with elements used everywhere (header, footer, buttons, warning messages and so on). It was the only reasonable structure I could think of.
I've been at it for a while, and I'm now noticing that I've created some sort of chaos. When I have an element and need to change the definition of its style, I always have to go through Inspect Element and then Visual Studio's search function, which is still reasonably efficient. But I also frequently find myself looking at definitions in the stylesheet, having no idea what they are for, or if they're still in use at all - maybe we have already thrown out the elements which use them, or they were an attempt to solve a problem which got a better solution.
I am already trying to give good, semantic names to my classes, but it's not sufficient, and sometimes even impossible - every workaround I use seems to leave me with names like .centeringWrapper.
What is a good, workable structure of CSS code which prevents these problems? What principles can I apply to arranging the code?
How can I divide the code into files so I can find the correct file?
How to structure code inside the files so I can keep my orientation within a file?
How to keep the overview of different definitions for the same element which are used within different #media blocks?
Any advice for making my work less messy is welcome.
The best practice for structuring your CSS is to structure your CSS. By that I mean have a system. It doesn't really matter what your system is, as long as it makes sense to you and your team and people can consistently maintain it (at least for a reasonable length of time).
I can tell you one way not to do it, though and that's by not designing each page separately with its own CSS.
I think you've figured this out already, but it's worth repeating.
Now, there are times when I've broken this rule. But it's rare and it's typically on small marketing-centric sites where I simply have 4 very different pages. In general, though, you want to re-use as much of your CSS as you can across all your pages.
One way to achieve that is to start with a pre-existing structure by working off of a CSS framework. A common one is Bootstrap, but there are literally dozens and dozens of options out there.

Parse Onlineshop - Onlineshop Data

I am searching for a solution for crawling % parsing a whole website (online shop) automatically and save all products as Product-name and product-price in a CSV.
Gaining data from a website can be extremely simple or the complete opposite. It depends on how the website is made. A shop tends to be a complex website and thus the DOM (the HTML structure) is mostly unique for that website. It is very unlikely that someone else tried the exact same thing you want for that page. So you have to write code and extract the necessary piecs.
This will be our example product: http://www.thomann.de/gb/focusrite_scarlett_2i2.htm
HTML uses classes to tell the CSS (for styling) how to design or render a certain element. You can use this behaviour for you and find an element containing the price by a class. In this example it is .tr-prod-price.
Every major browser has a Discover element function and it can be used to find a class for a element which appears on screen. Make a right click on your text (price or title) press Q (Firefox only).
Now, you've got closer to parsing your data. Now it is time to write code. You could use Python, Java or even JavaScript to give you some examples. JavaScript in conjunction with Node.JS could be very easy, because JS has the built in methods we need.
You may need a searchengine to find the detail pages of a product. Google can list you all results like site:thomann.de/gb. But of course Google does not provide an easy way (API) to get this information and if you start writing your own parser for that I am not sure about the legal consequences. The legal side needs also to be adressed for you main intention.

Easy way to make a static copy of a web app for JSFiddle?

I often have a problem where I'm working on a dynamic web app with tons of front-end or back-end code and there is a CSS problem that just eludes me despite an hour of scratching my head. I know that StackOverflow could solve it in a second, and I'd like to post it, but I either have to
Make the app public along with steps to reproduce the state, or
Tediously copy out the DOM and assets (CSS) along with the current state.
Neither is very straightforward. Note that the DOM is dynamically generated so "View Source" won't cut it. Similarly, the CSS could be spread out across multiple files and I'd like to just grab it all at once.
Is there an easy way to copy out the DOM and all CSS as a single file so that I can insert it into something like JSFiddle and be on my way?
The quickest way to get all HTML on the page as-is is to paste this in the address bar:
javascript:alert(document.body.outerHTML)
You can also use the console, of course, but the above works even in old IE versions and is easier to copy/paste.
I don't think there's a good way to get the CSS at all, but you could try using a jQuery selector or similar to get the URLs:
$('link[type="text/css"]')
.each(function(x, link){
console.log(link.attributes.href.value)});
And downloading and concatenating the CSS.

How can I effectively clean up styles in a large web site?

Our web site has been under a constant development for a better part of the last five years. As it happens, pretty much all the styles for the site are in one big CSS file. With time this css file has grown to about 9,000 lines - and I'm sure some of those styles are not used any more and quite a few styles provide duplicate functionality.
The site is written with PHP/Smarty; there are over 300 smarty templates and the whole site contains over 1000 different pages (read - unique URLs). I'm sure it will continue growing - as will the CSS file.
What's the best way to clean up this file?
Update: Unfortunately, online parsers where I put in a URL won't work for me, as 75% of the site is behind username/password logins - and depending on login, there are half a dozen different roles, each of which has their own set of of pages. There are also transactional elements (online shop), where the pages are displayed after (for example) credit card payment is taken/processed. I doubt that any online tool would be able to handle any of these. Therefore if there's a tool, it would have to work on a source tree.
Short of going through each .tpl file and searching the file for the selectors manually, I don't see any other way.
You could of course use Dust-Me selectors, but you'd still have to go through each page that uses the .tpl files (not each url as I know that many of them will be duplicates).
Sounds like a big job! I had to do it once before and I did exactly that, took me a week.
Another tool is a Firebug plugin called CSS Usage. As far as I read it can work across multiple pages but might break if used site-wide. Give it a go.
Triumph! Check out the Unused CSS online tool. Type your index url into the field and voila, a few minutees later a list of all the used selectors :) I know you want the unused ones, but then the only work is finding the unused ones in the file (ctrl+f) and removing them :)
Make sure to use the 2nd option, they'll email you the results of the crawl of your entire webpage. Might take up to half an hour, but that's far better than a week. Take some coffee :)
Just tested it, works a treat :)
I had to do this about 3 years ago on a rather large classic ASP web application.
I took the approach that there are only a finite number of styled items on each page and started by identifying these. For example, I went through the main pages and identified that the majority of labels were bold and dark blue and that all buttons are the same width (for example).
Once I'd done that, I spoke to the team and we decided that anything that didn't conform to these rules I'd identified should conform, so I wrote a stylesheet based on this assumption.
We ended up with about 30 styles to apply to several hundred pages. Several regular-expression-find-and-replaces later (we were fortunate that the original development had used reasonably well structured HTML) we had something usable that just needed the odd tweaking.
The key points are:
Aim for uniformity across the site. In other words, don't assume that the resultant site will look exactly the same as the original, but aim for it to look the same as itself (uniform) from page to page
Tackle the obvious styles first (labels / buttons / paragraph fonts / headers) and then worry about the smaller styles or the unique styles later
You might also find that keeping unique styles (e.g. a dashboard page that has unique elements that don't appear elsewhere) in separate files to keep the size of the file down. Obviously, it depends on your site as to whether this would help.
Additionally, there are many sites that will search for these for you. Like this one: http://unused-css.com/ I don't know how they measure up to Dust-Me Selectors, but I know that Dust-Me selectors isn't compatible with Firefox 8.0.
You could use Dust-Me Selectors plugin for FireFox to find unused styles:
http://www.sitepoint.com/dustmeselectors/
If you have a sitemap you could use that to let the plugin crawl your site:
The spider dialog has all the controls for performing a site-wide spider operation. Enter the URL of either a Sitemap XML file, or an HTML sitemap, and the program will read that file and extract all its links. It will then load each of those pages in turn and perform a cumulative Find operation on each one.
I see there's not a good answer yet. I have tried the "Unused CSS online tool" and seems to work ok for public sites. The problem is if you have a CSS to show your public website + an intranet (for example: wordpress site + login for registered users). The intranet pages woun't be tracked and you will lose your css styles.
My next try will be using gulp + uncss:
https://github.com/ben-eb/gulp-uncss
You have to define all the urls of your site (external and internal) and (maybe; not sure) if you are running the site with user + password on your browser, gulp+uncss can go inside the internal url's.
Update: I see unused-css online tool has a login solution!

How to implement a "news" section in asp.net website?

I'm implementing "news" section in asp.net website. There is a list of short versions of articles on one page and when you click one of the links it redirects you to a page with a full article. The problem is that the article's text on the second page will come from database but the articles may vary - some may have links, some may have an image or a set of images, may be differently formatted etc. The obvious solution that my friend have come up with is to keep the article in the database as html including all links, images, formatting, etc. Then it would be simply displayed on the second page. I feel this is not a good solution as if, for example, we decide to change the css class of some div inside this html (let's say it is used in all articles), we will have to find it and change in every single record of the articles table in our database. But on the other hand we have no idea how to do it differently. My question is: how do you usually handle something like this?
I personally don't like the idea of storing full html in the database. Here's an attempt at solving the problem.
Don't go for a potentially infinite number of layouts. Yes all articles may be different but if you stick to a few good layouts then you're going to save yourself a lot of hassle. These layouts can be stored as templates e.g ArticleWithImagesAtTheBottom, ArticleWithImagesOnLeft etc
This way, your headache is less as you can easily change the templates. I guess you could also argue then that the site has some consistency in layout.
Then for storage you have at least 2 options:
Use the model-per-view approach and have eg ArticleWithImagesAtTheBottomModel which would have properties like 1stparagraph, 2ndparagraph, MainImage, ExtraImages
Parse the article according to the template you want to use. e.g look for a paragraph break if you need to.
Always keep the images separate and reference them in another column/table in the db. That gives you most freedom.
By the way, option #2 would be slower as you'd have to parse on the fly each time. I like the model-per-view approach.
Essentially I guess I'm trying to say beware of making things to complicated. An infinite number of layout means an infinite number of potential problems. You can always add more templates as you go if you really want to expand, but you're probably best off starting with say 3 or 4 layouts.
EDITED FROM THIS POINT:
Actually, thinking about it this may not be the best solution. It could work depending on your needs, but I was wondering how the big sites do it. If you really need that much flexibility, you could (as I think was sort of suggested) use a custom markup. Maybe even a simplified or full wiki markup. I'd still tend toward using templates in general, but if you need to insert at least links and images then you can parse for those.
Surely the point of storing HTML with logically placed < div >s is that you DON'T have to go through every bit of HTML you store to make changes to styles?
I presume you're not using inline styles in your stored HTML, and are referencing an external CSS file, right?
The objection you raise to your colleague's proposal does not say anything about the use of a DB. A DB as opposed to what: files? Then it's all the same. You want to screw around with the HTML, you have to do it on "every single record." Which is not any harder than "on every single file." Global changes are a bitch unless you plan for it by, say, referencing an external CSS. But if you're going to have millions of news articles, you had better plan on versioning the CSS as well.
Anyway, the CMSes do what you're thinking of doing. Using a DB is a fine way to go. How to use it would depend on knowing the problem more intimately.
Have you looked into using free content management systems? I can think of a few good ones:
Joomla
Drupal
WordPress
TONS of others... just do some googling.
Check out this Wiki article: http://en.wikipedia.org/wiki/List_of_content_management_systems

Resources