Is it easier to scrape the AMP versions of webpages? - web-scraping

I'm working on a web-scraper that aggregates newspaper articles. I know AMP protocol mandates a stripped-down version of Javascript, and I also know that Javascript (in part) enables website administrators to detect/prevent scraping. So logically, I figured it would be easier to scrape AMP websites. However, one the other hand, if this is true, I presume StackOverflow would be on top of it, but I haven't found a single thread reaffirming my inference. Am I correct or am I overlooking something?

I would say that AMP pages are definitely easier to scrape due to the fact that there is virtually no custom JS code. Many sites insert content with JS or AJAX. AMP limits the amount of libraries you can use and thus has less amount of them compared to a regular site.
Furthermore, if you want to scrape content written in JavaScript, you should can Selenium. If not, PHP is the way to go (IMHO) or BeautifulSoup in Python.
Happy scraping!

Related

Is it possible to get an amazing Lighthouse score while using a WordPress page builder?

I'm a front end engineer working in e-commerce and marketing, and one of my routine tasks is finding out why client's websites are so slow. Most of them are on different CMS, most Wordpress, and a constant problem I come across is page builders ship with a ton of code that slightly bog the site down. These fall under 'render-blocking resources', see the below
screenshot of the issue.
This file is entirely minified and the website isn't even large (It's still in staging, in fact.) Is it possible to get an outstanding Lighthouse score when your site is build on a CMS and uses a small number of plugins/apps (in the case of Shopify?) The majority of the clients whose sites I gauge are on a CMS and get a bad score because of how much data that browsers have to request when loading 200,000 apps and plugins. I'm exaggerating of course, but even when a client has a small site, but was built with a page builder and has a few popular plugins like Gravity Forms, their sites still suffer a little.
In theory, yes, you can have a page builder that doesn't impact the score, in practice, all of the page builders that I have personally worked with are bad for Lighthouse/PSI scores.
The main reason is that pagespeed hasn't been a conscious priority untill google started encouraging more awareness of a site's percieved performance. So the teams that built pagebuilders didn't take that into account, and it's probably not an easy task to change their codebase so that page builders are more performant.
There would be a few rules for page builders to follow to be performant.
No redundant asset code, I noticed a few page builders that load all
the code they might need for any section that a user might add, even
if the section is not being used.All the asset code being loaded should only be loaded if they are needed.
Properly sized images. I noticed Shogun page builder for shopify to
be really bad at this, as the images are automatically oversized.
Automatically lazyload images. I noticed pagefly has lazyloading, but
it has to be enabled individually for each template.
No elements created with javascript. To reduce CLS and improve LCP,
HTML elements should not be created entirely using javascript.
If page builders followed the above rules, and you replaced the original css and js of the page, since they would be redundant, you could have a page builder that resulted in very performant pages.
I haven't found a page builder that met these standards though, maybe as page speed becomes more important, more teams will be more consicous of performance in their page builders.

Why do people purchase website themes?

I am a newbie in Web designing. I learnt about templates that are available online that people can purchase and change the content according to their requirements.
One thing puzzled me. Why do people purchase themes when they can copy the code using VIEW SOURCE option. I have tried searching the answer but google has failed me.
Also, If I am not using Wordpress, can still I use Wordpress themes for my website.
Thanks!!
Because, while often technologically possible, that's still copyright infringement.
Because WordPress themes usually are more than just their raw HTML/CSS/JS. The PHP logic is frequently pretty complex and important.
1.) Copying source code is stealing. If you're copying the html/css, you'll probably copy the images, too. Definitely could be legal issues in that.
2.) If you decide to try and steal it anyways, you'll notice in a lot of cases they've used Iframes or JavaScript to pull the code in from elsewhere, where you do not have access to it.
3.) Copying CSS and HTML wouldn't do much for a Wordpress site. You wont' be able to copy any of the server-side stuff.
4.) You can't really use the theme because a Wordpress theme comes with functions and much more. You could use the CSS with a lot of hacking.
We can only copy the Html form either full encoding of the sites ie, ref style sheet.
According to the law, copy web design structure, encoding is a crime and Definitely could be legal issues in that.
If you are a newbie in web designing, I suggest you some of the top web designing blog that you should follow.
https://blog.hubspot.com/marketing/web-design-tips
https://blog.techreshape.com/5-web-design-tips-for-a-better-website-user-experience/

How do I integrate my website with AICC for LMS

I want my website to be published as content in an LMS, one of the experts suggested me to use either SCORM and AICC. They suggested that we should make a wrapper around our website and then publish it on the LMS. Now I tried to search and read about SCROM and AICC but was not able to get any idea or how the wrapper has to be built.If someone can guide me with a blog or make steps of how should we achieve this.
Essentially, you have a few e-Learning Content Libraries out around the internet in a free and paid capacity. The primary job is to locate the LMS Runtime API. Your implementation is possible, but it would take some careful packaging and some custom work to get it done.
Every e-learning developer has to commonly build a Shareable Content Object player. This can be done using a IFRAME that loads pages. This allows the main page the LMS launches to load once, and unload once. Another route modern content takes, is to use AJAX to load in snippets of HTML and have a similar paging feel without the overhead of the IFRAME.
The next hurdle is trying to mitigate the cross-domain policy...
As you mentioned you could go AICC mainly because its cross-domain enabled. AICC is a bit more limited in what you can record though. Keep in mind this standard predates XML. So we are in the text with delimiters category of configuration.
SCORM is going to have a cross-domain security (sandbox) error associated with the JavaScript on your website trying to talk to the LMS. There is away around this, by possibly including your base index.html page which can use your JavaScript, CSS, Images on your website. But, if your website changes pages, we are back to my above comment. We need to wrap this in a IFRAME.
In the IFRAME scenario, you'd have to put your e-learning standard library there. And your sub pages in the IFRAME could talk to the parent. This ensures your main page and all sub pages have a line of sight to parent of the frame and can make calls if its present.
All of this really depends on how you built your website. We run content off media servers similarly, and have these same hurdles, but the content is meant to be activities, games, and tests/quizzes. If you adjust your website to try and communicate this way you'll have to pad it for working with and without a LMS and any runtime data.
You can still launch a website as a asset though. Not graded as a option. Some commonly do this as part of a lesson where the student can read something, then take a test on it in a subsequent assignment.
Good Luck,
Mark

Are there good tools which help migrating an existing Website to a CMS-based site?

I'm looking for tools or libraries that load simple but old existing websites and produce an output which can be loaded into wordpress or other CMS. The goal is to keep the existing websites navigating structure and content.
Any hints?
What I've discovered is that it really depends upon the CMS. I would recommend a "tag-pair"-based CMS like the new Craft or ExpressionEngine's free Core version (there're others too) where you drop in a looping control area, replace the fields with the tags and then it just runs. I personally like those because they offer a cleaner separation between content and design.
I'm trying to learn Wordpress now and it's backwards of what you want - you create template environments and half the design seems to be controlled by the code. Great if you want to swap generic templates around, but if you customize them you're doing a lot of work. I'm looking at a site I inherited and conditional statements are a nightmare - half in the templates and other in the plugins themselves (Events, I'm looking at you).

Seeking opinions on the best CMS platform for designers and front end developers

I have so far been a dedicated Wordpress user, but have been researching other CMS solutions of late, specifically, looking for something that could potentially allow me to EASILY convert an XHTML site into a CMS site for most projects.
I don't care for PHP - and find adding the appropriate tags to Wordpress a bit of a challenge. I am building fairly simple sites and simple blogs - I don't need a lot of extensibility.
I have heard good things about ModX and Textpattern and have today installed them on a localhost and started playing with them. Each has a bit of a learning curve, but I like what I see in terms of their tag codes (which looks a lot more like html than php).
I'd like to design websites where most pages are distinct from one another and don't necessarily have to fit into a template. I am looking for an automagical solution where I can directly input my html, css, javascript code into a CMS platform and it spits out my website exactly as I'd imagined it. Is this just a dream?
With so many solutions out there, just wondering about other web designer's and non-hardcore developers preferences?
I guess it's like asking if ipod or blackberry or n900 or htc etc.
While there are vast differences in how it works under the hood, for the most part you can expect about the same functionality, and it comes down to the provider and a particular feature you prefer in one over the other. In this case instead of your carrier you need to worry about you host, whether they offer the php version required aand the database you need etc. But for most part, you should be fine.
Modx Evo requirements : http://modxcms.com/learn/general-requirements.html
Modx Revo requires: http://rtfm.modx.com/display/revolution20/Server+Requirements
TextPattern requirements: http://textpattern.com/about/119/system-requirements
I find this an important thing to start with, while my server can handle either, you never know what server your clients are running(there still are some old configurations out there)
I haven't used Revo much yet, but it looks like there are quite a few really nice enhancements. One of them is installing packages. Previously you have to create TVs snippets chunks, so installing a package wasn't always so straight forward.
Modx Evo however has the import html feature, where you can load put your html site in a folder set the tag that holds the content, and modx will automatically create and fill the resources(pages), pretty nifty, but I've only used it to see what it does :)
Modx lets you get away with not knowing any php, but as your requirements change, you might find yourself needing an extra feature of script that is easy to add. I've found that using modx for the past 2 years or so has vastly improved my php abilities because of these additional functions required.
While I haven't worked with textpattern much I do remember installing it and playing around, but it just wasn't love at first sight, like with modx.
I also think modx is pretty easy for clients to navigate with some minimal coaching, but that is more of a comparison with joomla and drupal etc.
Bottom line, I think few people have spent enough time with both to REALLY understand both in terms of differences/advantages and mostly what you'll get is "the one I use is better; evidence being me using it" (myself included) or worse yet "I see you're mentioning CMSes, here's my favorite one"
To wrap up, one more important aspect (and this one is really important if you don't know your way around backends and php much) the community. If you run into problems, how likely is the modx vs textpattern forum.community willing to help. A good indication is community size. While I can't see the numbers for textpattern, currently modx forums show "255 Guests, 46 Users (3 Hidden)"
good luck
If you're comfortable with HTML then Textpattern's method of adding markup like tags into templates for content will probably be relatively straightforward for you to pick-up. The biggest issue most people seem to have in understanding Textpattern is its semantic model, there's a great page on the semantic model on Textpattern's Textbook site that should help.
I'm a big fan of Textpattern for building simple but powerful small to medium sites that do not require complicated user groups. It's quick to build well designed sites and the admin interface is really simple.
The Textpattern community is extremely friendly and helpful if you ever need help. Best place to ask things is on the Textpattern Forum.
You may want to consider sNews CMS in addition to MODx/Textpattern. It is quite minimalistic
PHP code is single file (excluding language translations).
You need to import one SQL script on phpMyAdmin and you are set.
After that its only styling and content.

Resources