Eliminating Bot Traffic from Google Analytics - google-analytics

In my Google Analytics reports, I see traffic that I am almost sure that it comes from bots:
See how the service provider is amazon technologies inc. (from Ashburn, Virginia, apparently Amazon’s AWS bots) and microsoft corporation (from Coffeyville, Kansas).
I want to exclude all traffic from all bots, including Google, Amazon, Microsoft and any other company. I only want to see traffic from real people who visit my site, not from web robots. Thank you.

In Google Analytics View Settings, you'll see an option for "Bot Filtering". Check the box to "Exclude all hits from known bots and spiders". If Google Analytics recognizes those hits from Ashburn and Coffeyville as bots, the data from those bots won't be recorded in your view.
Bot Filtering
If Google Analytics doesn't recognize them as bots, you could investigate the impact of adding a filter to your view(s) that would exclude traffic from the ISP Organization(s).
View Filter for ISP Organization

Most of these bots are coming from other tools. The last Friday we received a lot of sessions coming from Coffeyville and with the microsoft corporation as a service prodiver. It was because we used a tool to scan our website for cookies. So, that is the reason. My best option is to exclude any data from this town/city.
Screenshot from Google Analytics about how I implemented the filter in that view

You can use Robots.txt to try and exclude the bots: Robots exclusion standard
Some excerpts not that the link would ever likely fail.
The robots exclusion standard, also known as the robots exclusion protocol or simply robots.txt, is a standard used by websites to communicate with web crawlers and other web robots. The standard specifies how to inform the web robot about which areas of the website should not be processed or scanned. Robots are often used by search engines to categorize websites. Not all robots cooperate with the standard; email harvesters, spambots, malware, and robots that scan for security vulnerabilities may even start with the portions of the website where they have been told to stay out. The standard is different from but can be used in conjunction with, Sitemaps, a robot inclusion standard for websites.
About the Standard
When a site owner wishes to give instructions to web robots they place a text file called robots.txt in the root of the web site hierarchy (e.g. https://www.example.com/robots.txt). This text file contains the instructions in a specific format (see examples below). Robots that choose to follow the instructions try to fetch this file and read the instructions before fetching any other file from the website. If this file doesn't exist, web robots assume that the web owner wishes to provide no specific instructions and crawl the entire site.
A robots.txt file on a website will function as a request that specified robots ignore specified files or directories when crawling a site. This might be, for example, out of a preference for privacy from search engine results, or the belief that the content of the selected directories might be misleading or irrelevant to the categorization of the site as a whole, or out of a desire that an application only operates on certain data. Links to pages listed in robots.txt can still appear in search results if they are linked to from a page that is crawled.
Some Simple Examples
This example tells all robots that they can visit all files because the wildcard * stands for all robots and the Disallow directive has no value, meaning no pages are disallowed.
User-agent: *
Disallow:
The same result can be accomplished with an empty or missing robots.txt file.
This example tells all robots to stay out of a website:
User-agent: *
Disallow: /
This example tells all robots not to enter three directories:
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/
This example tells all robots to stay away from one specific file:
User-agent: *
Disallow: /directory/file.html
Note that all other files in the specified directory will be processed.

Related

Thoughts on scraping "sort" pages disallowed by robots.txt for search engine purposes?

I'm building some spiders and am curious if there is any consensus on scraping pages that robots.txt disallows for what appears to be preventing these pages from showing up in search engines.
For example, I'm noticing some retail stores like https://www.barneys.com/robots.txt block certain sorted pages
Disallow: /*%7C*
Disallow: *product.maxSalePrice%7C1*
Disallow: /search*
https://www.barneys.com/category/sale/N-1d0527n?Ns=product.maxFinalPrice%7C1
Unless I'm missing another reason why they are blocking it (can't use up much more resources, can it?), and aside from automated anti-scraping measures, would anyone think I'd have a problem if I scraped one or two of these pages every 30 minutes? I'd imagine it would be better for both parties to do that rather than scrape every product page and sort the results myself. Would a search page for a specific term or two be any different, as these are typically disallowed as well?
I know it's going to depend on a site by site basis, but I'm curious to see what insight anyone might have.
If website has sitemap - you can receive product links by parsing sitemap.xml defined in robots.txt
Disallow: /checkout*
Disallow: *product.maxSalePrice%7C1*
Sitemap: https://www.barneys.com/sitemap_index.xml
You can use SitemapSpider for it.

Getting Strange Google Analytics URL Data

I recently opened up my google analytics and looked at the behavior panel all pages in depth for the first time and I noticed some strange pages such as:
/amobee/a3d-ad-loader.html?a3dWebglBanner=https://cdn-production.amobee3d.com/__integration__/9cbea9d/a3d-webgl-banner.js&adName=canon_sp&bucket=cdn-production.amobee3d.com&creativeId=phone&tpt={"tpt-click":"http://r.turn.com/r/tpclick/urlid/14BFxPtiFHxrmUcNdR4QHN-0x-Yel1rNyX3oaT1U1nk4Xtdr-WJQO1XlpD1d2cgzm_yn98_nqu0l-H7-6TDbnFAVUaa81rE5Va5TPoJV_1Ntn4-ZNPeiesLCUWGi5Q0pMIlxWeHujtiWU4hIRmxZhGDbLcisF5vf52pYjnxx7sgLDq60qaLSM9lSDH_P7r3m2LfHLNhuhT3pi82fEsIKY-zMcLaIqUa9FRu7ru1ABYiMCtsmIp-lbv-0tHQ0QtXb2XvAslSEVQju5WCkGeXtYPPWcOXdh4wRx2g-XrBQLJqyt0vA7eW1L6lLODoYREs9OBPuTEypwnf63U3p8t5FBYUJmQbyMz4eKCUfVCW3oZA8XwQsSlpxKWOwnR4ICWD6Hv0vAV2VuhJR0Xs53RIHS3H9Tz63br3HTEa4ZY_kKFET9A_ftQbvMsRO4u41FP6SKbtlYbh9rP6ujKbOzAN8TRFll4D4qUWscfwlVaUN_u2u5E4Vy42t_bSnl21XJcaYEQEFVUTsKZNXtOXj9z5KcYao4xmdD4GUUWyryckAdVyWahvx4V_d16JvQHawx4X3ioQH0_wNdsrb3RVATpziopDFpbZaBPUHiKLZ-bIyufGmXpZmxg-3vX-zu1vvsZPbJNqcc9li1Ympbj3ShiZ1AiIxqUrWzljp1f1In7Z8Im-yg3_KM0J57D8-gUsHIZ-oX3ZGD89yOo93M3XBqtzuW2Hsic-itJBXhnJzspzQ4UqNbGQz9oR24Gk94As9pRznxJBPBDq4ETbqpQBtH7BoKHQ/3c/https://adclick.g.doubleclick.net/aclk?sa=l&ai=C4Uxke2G6WaDwNofbpAOzu7eoBf7D7ZRGiM-B9pQBwI23ARABIABgyabejOCk0BSCARdjYS1wdWItODQ0NTQ5NjI3NTYxOTU2N6ABjPe59APIAQmoAwGqBHxP0Hg-A8VrFKLhd4VPGK02nOSLdJlNn7XiRxtz6uzu19NuxGmz5enbVlB2iirq6fTo1Hjk0ggr3O7qFuCqnbrLdm_fi-5tala6iCF3bFK5yG40vufVOofQQ-0YefypkSbFeGdRzK6ke5XOGaI8UaVEAoiTfHwrtnGA6nyzgAbd2MidmYzBhAygBiGoB6a-G9gHANIIBQiAARAB&num=1&sig=AOD64_06gu58j3wZF6kAoqQM6TYyaPYIBQ&client=REMOVED&adurl=/url/"}
/flashtalking/ftlocal.html?ifsrc=https://cdn.flashtalking.com/xre/271/2711110/1979640/js/j-2711110-1979640.js&ftx=&fty=&ftadz=&ftscw=&ft_custom=&ftOBA=1&ft_ifb=1&ft_domain=REMOVEDft_agentEnv=0&ft_referrer=REMOVED&cachebuster=750934.4320502493&click=https://googleads.g.doubleclick.net/dbm/clk?sa=L&ai=Ce9A4rOjBWbK7BMnWkgP6w6nwB4GMv7JMitPArpwG-7idztoIEAEg0JjELWDJ5v6GgICgGcgBCagDAaoEuAFP0IwsQBfm1IhnAEcv-Kxde6xOfh27RXolPw6jRU8iIA8UyhMCIzdPsjzlztPIEk-d6gwfr438fNB4ptnk2O2-NRq8iKLUF9M4vcKS2aV9IoNcN3v5gcOhtR8Woojv_R8C-z6cDbensRSTTYYVM9RS8OIGbiXrVvsrHcU7kb8vlmMS0EIKD_5NwhCenv4gRE9-_U1Q1r05lJPI1RAJ1m2m_LPSflL_nb5m8BpwYhfJFdBGanLwgh7LwASLsqq1rgHgBAOIBd3Zs7sDkAYBoAZN2AYCgAefycxeqAemvhvYBwCgCP-hpwSwCALSCAcIgGEQARgCyBPg1p0C0BMA2BMDghQTGhF3d3cudm9sdW1lYm90LmNvbQ&num=1&cid=CAASEuRoTBJWebFR9Y_pZL7ze3vdCg&sig=AOD64_2PqRgxPypUSzjJHrRA4kFBwKQPZQ&client=REMOVED&dbm_c=AKAmf-BdHzMrPFTxYQj06utKwilI6E9GHRDztBNwp4NEhB2BuaayZ6JG_BcT226zfnDtdwABfZhe&dbm_d=AKAmf-BWr8_Qqd0y7BMDQPUfEaK5z_iR3KXo8wstJkrl5wytBRYlArCAOqS_TR4m5kPBDNYQmT520pL98pRp6u4h6seeuW53gXANeGvEaPqByEZTbKzlzs7zvX_HqjcevAzg0oDNVrcKyt6jc0SRG5LJGM-YrbtMWCm0-ceIau7y4qp_WK-X5-c&adurl=&ftimpid=35502EEB8067F1&ft_id=&ftcustom=&ftsection=&fttime=1505880237&ftcfid=6825920&ftguid=3165AF587F7584
I removed the client value and some other fields and replaced with REMOVED for anonymity purposes but I was wondering if anyone can tell me if it's malware.
I have a site that uses wordpress in the cloud and I have scanned with wordfence saying that my site is clean.
Was wondering if I should look deeper, or if this behavioral page is normal.
Amobee and Flashtalking are both advertising platforms, so it looks somebody has configured advertising tags incorrectly. Probably those clicks should be routed through the respective platforms (e.g. to record data for bid management or something like that) and instead they go directly to your page with redirect Urls appended. If you do paid advertising then you should check with the people who configured this for you.

Telling Google to index a new page, without using Webmaster Tools

I have a WordPress site that generates a single page site for users from some fields they enter into a form and some images they upload. I want to get Google to come out and index the page but my users will not be technical enough to set their page up with Webmaster Tools. What can I do from WordPress when I build the page to tell Google a new page is up and to please come out and index it when they have a chance?
Well you don't have to do anything actually, you could just sit back and wait for it to happen naturally. However there are things you could do to speed up the indexing process.
Here's a suggested way that does not involve having your users do anything:
Create one or more (quality) links pointing to their single page site from other websites that you know are already indexed in Google (see https://www.youtube.com/watch?v=4LsB19wTt0Q for more information). Ideally on blogs that get updated frequently because then then it is likely google crawls them more frequently.
Use a site:domain.com search in Google to see whether google has already found your new pages.
Here is how google crawling and indexing works:
Crawling:
Crawling is the process by which Googlebot discovers new and
updated pages to be added to the Google index.
We use a huge set of computers to fetch (or "crawl") billions of pages
on the web. The program that does the fetching is called Googlebot
(also known as a robot, bot, or spider). Googlebot uses an algorithmic
process: computer programs determine which sites to crawl, how often,
and how many pages to fetch from each site.
Google's crawl process begins with a list of web page URLs, generated
from previous crawl processes, and augmented with Sitemap data
provided by webmasters. As Googlebot visits each of these websites it
detects links on each page and adds them to its list of pages to
crawl. New sites, changes to existing sites, and dead links are noted
and used to update the Google index.
Google doesn't accept payment to crawl a site more frequently, and we
keep the search side of our business separate from our
revenue-generating AdWords service.
Indexing:
Googlebot processes each of the pages it crawls in order to
compile a massive index of all the words it sees and their location on
each page. In addition, we process information included in key content
tags and attributes, such as Title tags and ALT attributes. Googlebot
can process many, but not all, content types. For example, we cannot
process the content of some rich media files or dynamic pages
Source: https://support.google.com/webmasters/answer/70897?hl=en

Changed content type leading to wrong crawls by google

In our website built on WordPress, we changed name of one of our Custom Post type from 'A' to 'B' and also changed hierarchy of few categories.
Now, the problem is that google is indexing/crawling the old 'A' CPT Name and also old catgeory structure, which is leading to either random pages (because WordPress makes guess and shows page with those keywords in URL) or 404 errors.
What can we do (via Webmaster Tools) to make google re-index our whole site and start honoring our new structure? Thanks.
Here is the brief explanation of the Google's indexing policy:
The process
The crawl process begins with a list of web addresses from past crawls and sitemaps provided by website owners. As Google crawlers visit these websites, they look for links for other pages to visit. The software pays special attention to new sites, changes to existing sites and dead links.
Computer programs determine which sites to crawl, how often and how many pages to fetch from each site. Google doesn't accept payment to crawl a site more frequently for your web search results. They care more about having the best possible results because in the long run that's what's best for users and, therefore, their business.
Choice for website owners
Most websites don't need to set up restrictions for crawling, indexing or serving, so their pages are eligible to appear in search results without having to do any extra work.
That said, site owners have many choices about how Google crawls and indexes their sites through Webmaster Tools and a file called “robots.txt”. With the robots.txt file, site owners can choose not to be crawled by Google bot or they can provide more specific instructions about how to process pages on their sites.
Site owners have granular choices and can choose how content is indexed on a page-by-page basis. For example, they can opt to have their pages appear without a snippet (the summary of the page shown below the title in search results) or a cached version (an alternate version stored on Google's servers in case the live page is unavailable). Web-masters can also choose to integrate search into their own pages with Custom Search.
Read more here and here.

Wordpress site is appears clear of malware, but clicking on Google search results redirects to spam sites

An issue was brought to me involving malware on a WP environment. When I search the brand in Google and click the corresponding link, I'm redirected to a 3rd party spam site.
This has been happening for a while (over a week), but my site hasn't been put on Google's blacklist. Additionally, site scanners like , Norton Safeweb, etc. all claim the site isn't compromised.
Additional details:
I found and deleted some suspicious PHP eval() functions and then did a search and replace in my pages and database for any remaining code. After the site cleared into un-blacklisted status with Google I thought it was all over, ran updates and took numerous measures to protect the site from future infection.
However the issue still persists.
Were the nameservers ever changed by the malware or attackers? Google could have the wrong DNS information for your domain and thinks its hosted at said spam site? Resubmit your site to Google or report the issue to them to resolve (may also be resolved automatically next time Google tries to crawl your domain)?
It is a strange issue I have not seen before either, have you looked at your .htaccess file in the root directory? It is also possible that this has a rewrite condition that if the referrer is Google to redirect you to the spam site.
Solved this issue. At the time when this happened, this redirect attack was fairly new.
HTTP requests from visitors who passed referrer data from Google Search or Bing were being redirected, some of the time.
By targeting only those coming in from search, the webmaster or site owner is less likely to see the issue (until informed by a third party), while still manipulating a decent amount of the traffic (50% of traffic for most sites comes from search engines).
When I originally posted this question in 2012, this attack was new and because the redirect was being served server-side (directly in a lone PHP file, not via .htaccess), malware signatures from scanners didn't detect this.
Running Maldetect (with an updated database) was the best way to quarantine this issue and analyze the extent of the damage caused by malware.
This issue seems due to wp-vcd Malware that creates rogue WordPress admin users and injected spam links. I faced the similar issue and it got resolved after following these steps.
The files you should check for and delete:
wp-feed.php
wp-vcd.php
wp-tmp.php
Multiple copies of class.theme-modules.php, and
remove a bunch of code from the start of all the functions.php files.
For details you can find on this issue at following links...
https://wordpress.org/support/topic/wp-feed-php/
http://labs.sucuri.net/?note=2017-11-13
http://labs.sucuri.net/?note=2017-11-13

Resources