Can I track who is linking or manipulating my site's data? - web-scraping

Is it possible to track if someone links to data on my site? Specifically if my data is used in a site dynamically generated by a developer program? I would like to know if someone is blatantly passing off my site's data as their own. There are obviously ways around directly linking to content, such as content manipulation or even manual manipulation. But if someone where to link(or directly add word for word or manipulate) my content into their website, is there a way to track it?
Can I avoid someone being able to scrape my website at all, or is everything just up for grabs?

the best answer and the easy one is called GOOGLE - WEBMASTER TOOLS!
HERE
actually doing that is very hard and you would need to crawl the web to discover those links that address to your pages... dynamic content as well is linked so it would be find by google as well.
this tool will allow you to see outer links that address to your site.. and you can check them.
for extra - you can monitor requests and traffic to your site and find ip's that are using the same page over and over again. that can tell u that an outer page is dynamically loading content from your web page.
EDIT:
here is a good article in this subject: link - scroll down and you can see the use of google
webmaster tool with some other progrmas and method.
here is a good start guide to the google webmaster: link
ENJOY!

Related

Tracking links within my site

I want to track particular links on my site to see where they come from. For example, I want to know which links on my navigation are being clicked, so if something is not being clicked I could potentially remove it.
I have been using UTM's, super easy, but results in skewed analytics data.
I looked into Google Tag Manager, but I don't want to slow down my website. I can change the site easily, so not sure if this is the best solution.
I found an article dated 2008 that says I can do this:
https://www.example.com/?from=topnav
Is that still valid? Is there a better way. I can't seem to find any information on this and assume somebody wants to acquire this information.
Thank you.
I have been using UTM's, super easy, but results in skewed analytics
data.
UTM codes are meant to track inbound traffic. Don't use them to track internal/outbound navigation, as it will seriously mess up your reporting.
I looked into Google Tag Manager, but I don't want to slow down my
website.
GTM is loading async, just like GA, so performance-wise they are equivalent.
I found an article dated 2008 that says I can do this:
https://www.example.com/?from=topnav
By default GA will not track link clicks. You can indeed add parameters to URLs and then use those to build custom reports and see which links are being clicked.
Since what you're trying to do is custom implementation, you won't find a single best answer, it's up to you to implement something that fits your needs. These are some examples:
https://analytical42.com/2017/track-internal-links-google-analytics-gtm/
https://www.gravitatedesign.com/blog/can-google-analytics-track-link-clicks/

Google Analytics, iframes & cross-domain

I have GA on every page on one domain (actually not me, but my company, whose programmer needs auditing). Just the default code (Classic version, ga.js), no special accommodations whatsoever that I've seen or know of. Bare minimal if any configuration past registering the service with the main site...
All the pages are either aspx or static HTML. It's common practice for this guy to embed pages on the site within other pages on the site in iframes, where both the parent (top-level) & child (embedded) pages contain the GA script.
I don't really know much at all about GA, have never worked with it, but I do suspect that might result in extra hits being counted by GA or something, that that may be messing with the metrics. But then I've read stuff about GA using first-party cookies so by default pages loaded in iframes won't be tracked/counted... I could really use some clarification on this, please.
Then our programmer frames pages from the main site in pages on other sites that we own, that are on different domains. So then there's this cross-domain business, with no segregation of sources, because they really don't care much. So what should be the outcome of that? The external sites' pages don't have the GA code.
However, we're rebuilding one of those other sites - actually I am, for the most part - and the programmer told me to just copy and paste the same exact GA script used on the main site into that one. So, it's a different domain. That wouldn't work as-is, would it? Wouldn't there have to be some sort of special configuration, setting of the domain, something?
I'd really appreciate if someone could tell me more about the scenarios described above. Thanks in advance.
In the Google Analytics developer menu, you can create a new 'profile' for this new site. The analytics will then be tracked for just that one site, not for all. In theory, it is possible to use one GA.js for all your sites, but it kind of kills the whole concept of Google Analytics, so it's not recommended.
Your really shouldn't be using iframes anymore IMO. There are reasons to use them like embedding code for tracking etc, I think, even GA uses iframes. But, generally Google doesn't like them because a lot of spammers use them to try and fool the Google Crawler.
Also, it get's very complicated to understand what is going on within GA.
To answer your question: Each iframe is like an independent webpage completely separated from the other webpage (for security reasons). So when Google or a web browser goes to your website it will do this:
Load your main html document.
Render that page.
See that you have an iframe.
Load that page in the iframe.
Render the iframe.
Now, if you don't have GA installed on the iframe page it will not track the page being loaded.
But if you do put GA in the iframe it will record when the iframe is loaded or the webpage is loaded.
But, remember that one of the main reasons of having GA is to see where your customers are coming from and why. If you have an iframe of another webpage, you really don't know if that is because a customer is:
A) visiting your website from the page directly.
OR
B) the customer is visiting that page through an iframe on another page.
It can get very complicated
You must generate a new tracker for each domain you are using. Otherwise what is to stop someone from just copying your GA code, and putting it on their webpage.

Cross Domain iFrame auto height resizing

I've been googling and looking at various options but could not seem to be able to find a perfect solution that works in what I'm attempting...so needing some help here.
The situation/environment that I have is the following:
Parent page (which has the iframe) - is on a different domain, and the only control I have is a portion of the body tag, where it is updated via an admin console using html/WYSIWG editor. No access to head tag or even hosting jscript in their domain.
Child page (iframe) - is hosted in our domain, and we have full control.
The parent site is actually 3rd party online stores where we have products there, and we want to put in common information that we can control on our end without having to edit each individual product listing one by one.
I've tried alot of options found but it does not seem to work as either they need to include in js file or access to the head tag in the parent page.
So wondering if there are any other options that can help us on this?
I'm afraid you need access to JavaScript on both domains to do this.
Could you get the 3rd online store to host a small JS library that all their clients could then use to solve this problem? I work on a project that allows third parties to add in iFrames and produced this little project for just this reason. When any one say they want to be able have their iFrame resize to content, we point to the iFrame js file and say include this on your page.
https://github.com/davidjbradshaw/iframe-resizer
Sorry, that's not quite the answer your after, but trying asking the store to support this and they might be open to the idea, as I expect others have the same issue with their site.

Web crawlers and IFrames

Hypothetical Situation: I have a small obscure website called "miniatureBoltsInCarburetors.com" which provides content about the miniature bolts which hold a carburetor together as well as some general related automotive information. My site also has a single page which allows someone to find the missing bolt in their carburetor, and while no one will access this page directly from my website, one billion other popular automotive sites have embedded this single page in their website using an iframe, yet not included a link back to my site.
I recognize that this question is related to SEO which is considered off topic, however, all of the many SEO related forums discuss the marketing steps one could take, and not the programming steps or strategies, and hope others will allow this question to be answered here.
I wish my site "miniatureBoltsInCarburetors.com" to be ranked high for general automotive searches. What could I do to allow the 3rd party sites which include an iframe back to my site to improve my ranking? Could using JavaScript in the iframe to create a link on the parent page provide any value? What about when my server renders the page, use PHP to get the referring URL from $_SERVER, and include it in the content?
I am providing a solution here. Not sure if this is what you want though.
In your page which is used by other websites in iframe you can put below Javascript. This javascript checks if the webpage is opened inside an iframe or directly in browser.
So using this check when you see it is opened in an iframe. On click on something navigate to your website.
// This works in all browsers
function inIframe () {
try {
return window.self !== window.top;
} catch () {
return true;
}
}
Also for your reference you can check the below URL.
How to prevent my site page to be loaded via 3rd party site frame of iFrame
Hope it helps.
Iframes are seen seperate pages by Google. Your approach may end up being penalized due to being sourced from untrusted site. According to Google Webmaster Support
Frames can cause problems for search engines because they don't
correspond to the conceptual model of the web. Google tries to
associate framed content with the page containing the frames, but we
don't guarantee that we will.
One of the best approaches to rank higher for a specific keyword is, make multiple related sites. In your case a 3-4 paged site about carburetors, bolts, other things your primary site contain would do it. These mini sites will be more intense about the subject due to less page count. Of course they should contain unique articles on each page. Then link from mini websites to primary websites and you can see the dramatic change.
In fact, the thing you are trying to do was a tactic to rank competitors down worked occasionally a few years ago. Now, it is still a risk.
I see. You don't want to mess up the page for your own site, but you want to do something with all the uncredited embeddings.
The solution is fairly simple:
Create a copy of the page.
Switch your site to use the copy.
Amend the version that countless other sites are embedding, so that there is a small link back to you. Or, add an iframe blocker script that will load your site.
If the page is active (ie user interacts with it to find the missing bolt) you could include a sales message with the response encouraging the user to visit your site.
I think that your goal is getting your link onto these other sites long enough to get indexed by Google before it is noticed by the people doing the embedding, so it's a bit of a balancing act.
I see conflicting advice about how Google indexes iframes. You should use a PageRank checker to see if the existing iframe page url has PageRank, and compare it to the page that you embed it on.
I dont Think you need to worry ,.
Google bot does seem to crawl through Iframes ,but the Web-Page Containing that Iframe is not Credited for that Content .. In other Words,, Page-Ranking of that particular Web-Page do not Change due to Contents from Iframe .
is IFrame crawled by Google?
Do robots crawl iframes?

Flex 3: Project Architecture & SEO

I've got a Flex 3 project. One of the problems I have is that not very much of its content is indexed by Google. Currently, I pull data from a mySQl database, so the Googlebot doesn't see most of the site.
My goal is to increase the amount of content indexed by Google, improve the SEO, and improve SERPs.
I thought that instead of pulling the data from the database that I would change the project's architecture and create separate "pages". So, in my case, I would compile each puzzle separately and upload it to the server in its own directory. This way the info in each puzzle would get indexed.
The negative is that if I add a puzzle, I'd have to add a link to it in all of the puzzles that are already on the server. I would have to add the link, re-compile each puzzle and upload it to the server. Is there a way to get around this problem? Also, if I wanted to communicate some data from one puzzle to another in the future, I wouldn't be able to do so.
Any suggestions?
Thank you.
-Laxmidi
The usual way to achieve this goal is to develop a hidden parallel site in HTML.
On the first page you will have your flash and, hidden by javascript, a list of links to the other pages. These links will be parsed by the robots. Ideally, the href pages are virtual (look for "url rewriting"). On each "fake" page, your server-side language will print on the page a content or links from your database AND the flash. The flash will be provided with a string explaining where it is and what it's supposed to show.
Ex: http://www.mysite.com/category1/content7 The URL rewriting sends this request to http://www.mysite.com/index.php?uri=category1/content7. The page should display the Flash with FlashVar "uri=category1/content7". The Flash knows which content it has to display so when an user comes from google, following this link, he will find the content he was looking for.
Every linking and content for SEO should be in HTML, don't trust robots capability of reading Flash.
have a look at Adobe's reference on deep-linking.
you can generate a website's sitemap.xml with a cron process (daily), such that the URLs encode the state of the application you need. This URL will encode whatever content you need to retrieve from the db, with just one index.html page.
good luck!

Resources