im trying to learn how to use WEBSCRAPER.IO so i can build a database using a webpage's data (For a project) - But when i try to do as the video shows me i cannot get the scrape to go through the pages because the URL is different to that of the video.
Video Example
www.webpage/products/laptops?page=[1-20]
The webpage i want to scan
www.webpage/products/laptops/page/2/
So how would i create the Start URL for webscraper to go through the 20 pages
when i try to use the example from the video it only scans 1 page of my chosen webpage
I have tried veriations like
www.webpage/products/laptops/page/page=[1-20]/
www.webpage/products/laptops/page=[1-20]/
www.webpage/products/laptops?page=[1-20]/
but none of them seem to work. Im stuck.
Could anybody provide my with any advice.
Thank you.
Hi!
Could someone tell me why using requests.get(url) with different url's I am getting same page. Story is:
I am scraping webpage to see products by brands. Thus I am generating urls based on brand list in page (I am retrieving them with xpath).
For other pages it works, however for this one is not.
So I wander maybe there is kind of protection from scraping the page? As when I am pasting those generated urls in chrome - it gives me page with specific Brand products I need.However with requests.get I am ending up at same page.
Also- maybe you can share easy- to - grasp info how requests works? How it reaches page source?
Zillion thanks to contributors !
I'm using a custom Genesis child theme and lately I've been noticing that many false articles have been showing up on webmaster tools. They look something like this:
I haven't written these nor are they topics my site focuses on so I have no clue why they are showing up. So far, I've had to delete about a hundred of these. I read on a forum that this can be due to my theme generating bad urls but I'm not sure what that means nor do I know how to fix it. What can be causing this?
I believe that this problem is due to your website being hacked or Google is trying to Crawl or follow a link within your content that is not really a link.
This is what webmaster tool tells you about the problem:
In Crawl Errors, you might occasionally see 404 errors for URLs you don't believe exist on your own site or on the web. These unexpected URLs might be generated by Googlebot trying to follow links found in JavaScript, Flash files, or other embedded content.
To find out if your website has been hacked. First get this total = WordPress number of pages + number of post + number of categories + number of PDF or files + Images. Then do a google search using the following query (without the quotes) "site:yourdomain.com" if the result number is exaggerated greater than the calculated total then your website is definitely hacked.
If you believe that your website is not hacked try to find from where these links are being generated. Here is the trick: Go to the Web Master Tool report and click on one of those links, check the "Linked from" tab. There should be one or many possible pages listed from where these unexpected links are coming from.
Two possible Outcomes:
The page from where the link is found is from your own website: Go
to that page and open the source code, do a Ctrl+F search for that
link, if found check what section or content is generating this
problem.
The page from where the link is found is NOT from your own website:
In this case try to contact the owner of the other site and ask the
link to be removed, if not possible I highly recommend you to create
a 404 page within your WordPress installation with some useful
links. Google how to do this, there are plenty of resources.
Hope this helps
Hypothetical Situation: I have a small obscure website called "miniatureBoltsInCarburetors.com" which provides content about the miniature bolts which hold a carburetor together as well as some general related automotive information. My site also has a single page which allows someone to find the missing bolt in their carburetor, and while no one will access this page directly from my website, one billion other popular automotive sites have embedded this single page in their website using an iframe, yet not included a link back to my site.
I recognize that this question is related to SEO which is considered off topic, however, all of the many SEO related forums discuss the marketing steps one could take, and not the programming steps or strategies, and hope others will allow this question to be answered here.
I wish my site "miniatureBoltsInCarburetors.com" to be ranked high for general automotive searches. What could I do to allow the 3rd party sites which include an iframe back to my site to improve my ranking? Could using JavaScript in the iframe to create a link on the parent page provide any value? What about when my server renders the page, use PHP to get the referring URL from $_SERVER, and include it in the content?
I am providing a solution here. Not sure if this is what you want though.
In your page which is used by other websites in iframe you can put below Javascript. This javascript checks if the webpage is opened inside an iframe or directly in browser.
So using this check when you see it is opened in an iframe. On click on something navigate to your website.
// This works in all browsers
function inIframe () {
try {
return window.self !== window.top;
} catch () {
return true;
}
}
Also for your reference you can check the below URL.
How to prevent my site page to be loaded via 3rd party site frame of iFrame
Hope it helps.
Iframes are seen seperate pages by Google. Your approach may end up being penalized due to being sourced from untrusted site. According to Google Webmaster Support
Frames can cause problems for search engines because they don't
correspond to the conceptual model of the web. Google tries to
associate framed content with the page containing the frames, but we
don't guarantee that we will.
One of the best approaches to rank higher for a specific keyword is, make multiple related sites. In your case a 3-4 paged site about carburetors, bolts, other things your primary site contain would do it. These mini sites will be more intense about the subject due to less page count. Of course they should contain unique articles on each page. Then link from mini websites to primary websites and you can see the dramatic change.
In fact, the thing you are trying to do was a tactic to rank competitors down worked occasionally a few years ago. Now, it is still a risk.
I see. You don't want to mess up the page for your own site, but you want to do something with all the uncredited embeddings.
The solution is fairly simple:
Create a copy of the page.
Switch your site to use the copy.
Amend the version that countless other sites are embedding, so that there is a small link back to you. Or, add an iframe blocker script that will load your site.
If the page is active (ie user interacts with it to find the missing bolt) you could include a sales message with the response encouraging the user to visit your site.
I think that your goal is getting your link onto these other sites long enough to get indexed by Google before it is noticed by the people doing the embedding, so it's a bit of a balancing act.
I see conflicting advice about how Google indexes iframes. You should use a PageRank checker to see if the existing iframe page url has PageRank, and compare it to the page that you embed it on.
I dont Think you need to worry ,.
Google bot does seem to crawl through Iframes ,but the Web-Page Containing that Iframe is not Credited for that Content .. In other Words,, Page-Ranking of that particular Web-Page do not Change due to Contents from Iframe .
is IFrame crawled by Google?
Do robots crawl iframes?
Is it possible to track if someone links to data on my site? Specifically if my data is used in a site dynamically generated by a developer program? I would like to know if someone is blatantly passing off my site's data as their own. There are obviously ways around directly linking to content, such as content manipulation or even manual manipulation. But if someone where to link(or directly add word for word or manipulate) my content into their website, is there a way to track it?
Can I avoid someone being able to scrape my website at all, or is everything just up for grabs?
the best answer and the easy one is called GOOGLE - WEBMASTER TOOLS!
HERE
actually doing that is very hard and you would need to crawl the web to discover those links that address to your pages... dynamic content as well is linked so it would be find by google as well.
this tool will allow you to see outer links that address to your site.. and you can check them.
for extra - you can monitor requests and traffic to your site and find ip's that are using the same page over and over again. that can tell u that an outer page is dynamically loading content from your web page.
EDIT:
here is a good article in this subject: link - scroll down and you can see the use of google
webmaster tool with some other progrmas and method.
here is a good start guide to the google webmaster: link
ENJOY!