How do I use the depth value from DepthMiddleware? - web-scraping

I have a broad crawler that recursively travels websites, and I wanted to implemented a tier system which increments as the webpages gets further away from the original, seed url.
For example, if I started with stackoverflow.com, any links that can be visited from http://stackoverflow.com will have a tier value of 1 while the stackoverflow.com will have a tier value of 0 for being a seed url.

The depth level of a response is available via response.meta['depth'].

Related

Linkedin: 2nd degree connections checking

We have a list of about 400 people who are prospective biz dev targets that we are cultivating.
We're looking for a way to see who on that list I'm 2nd degree connects with (and also who the mutual connections are). It can be done manually for each 1 of the 400, but this takes a long time. Also when I add a new contact, I'd like to see if they are linked to any of the 400.
Perhaps, it's possible with the API or some such?

Here API Routing - Avoid unpaved roads

can anyone write how to avoid unpaved roads in Here routing (or truck routing) in REST API? I have checked API and I couldn't find answer. Routing API routes cars or trucks via dirty roads, what is unaccepted.
RouteFeatureType:The routing features can be used to define special conditions on the calculated route. The user can weight each feature with non-positive weights.
Possible paramers are: tollroad, motorway, boatFerry, railFerry, tunnel, dirtRoad,
park.
The Feature weights are used to define weighted conditions on special route features like tollroad,
motorways, etc.
-3 strictExclude The routing engine guarantees that the route does not contain strictly
excluded features. If the condition cannot be fulfilled no route is returned.
-2 softExclude The routing engine does not consider links containing the corresponding
feature. If no route can be found because of these limitations the condition is weakened.
-1 avoid The routing engine assigns penalties for links containing the corresponding
feature.
0 normal The routing engine does not alter the ranking of links containing the
corresponding feature.
Of course does the map content play also a huge role here. It is needed for the routing that the attribution e.g. for a dirt road (unpaved road segment) is set correctly.
You can also check details and report issues here: https://mapcreator.here.com

How Sessions are calculated when applying Advanced Segment

Suppose i have a product home page for eg http://domain.com/products/sonymobile.com and i need to find people who have visited the above product page in an session.
So when i apply an Advanced segment including page as above. The tricky part is how the sessions are calculated?
a) Does google count only the sessions wherein the session starts from "http://domain.com/products/sonymobile.com"
OR
b) the page can come anywhere in the whole session.
[Advanced Segment Image]
It depends on how you build your segment. If its scope is session, the answer is b). It seems that you are confused about an issue like https://support.google.com/analytics/answer/2934985?hl=en , but it's not about segments.

Google Analytics Multi-domains Tracking Code

I have the same GA code on 2 domains that I'm trying to track: domain.com and domain.de. Do I need to add setAllowLinker and setDomainName in both domains? Both sites are under the same account/profile but using hostname filter.
(the sites won't link between each other)
If the sites do not link between each other you don't need a special setup - the linker functions maintain sessions between domains (which is not possible with cookies, so the session data is transmitted via Url) and setDomainName sets the cookie domains, which likewise is not necessary in your case.
I do not understand you second question. If you have created a second data view / profile for you com-Domain it starts at 0 visits (pageviews etc) because the view collects data only from the moment it was created, which might be your problem here (plus I think the etiquette at stack overflow suggests to have only one question per Post, so you might edit the seconds question out and put it in a second post).

Basic site analytics doesn't tally with Google data

After being stumped by an earlier quesiton: SO google-analytics-domain-data-without-filtering
I've been experimenting with a very basic analytics system of my own.
MySQL table:
hit_id, subsite_id, timestamp, ip, url
The subsite_id let's me drill down to a folder (as explained in the previous question).
I can now get the following metrics:
Page Views - Grouped by subsite_id and date
Unique Page Views - Grouped by subsite_id, date, url, IP (not nesecarily how Google does it!)
The usual "most visited page", "likely time to visit" etc etc.
I've now compared my data to that in Google Analytics and found that Google has lower values each metric. Ie, my own setup is counting more hits than Google.
So I've started discounting IP's from various web crawlers, Google, Yahoo & Dotbot so far.
Short Questions:
Is it worth me collating a list of
all major crawlers to discount, is
any list likely to change regularly?
Are there any other obvious filters
that Google will be applying to GA
data?
What other data would you
collect that might be of use further
down the line?
What variables does
Google use to work out entrance
search keywords to a site?
The data is only going to used internally for our own "subsite ranking system", but I would like to show my users some basic data (page views, most popular pages etc) for their reference.
Lots of people block Google Analytics for privacy reasons.
Under-reporting by the client-side rig versus server-side eems to be the usual outcome of these comparisons.
Here's how i've tried to reconcile the disparity when i've come across these studies:
Data Sources recorded in server-side collection but not client-side:
hits from
mobile devices that don't support javascript (this is probably a
significant source of disparity
between the two collection
techniques--e.g., Jan 07 comScore
study showed that 19% of UK
Internet Users access the Internet
from a mobile device)
hits from spiders, bots (which you
mentioned already)
Data Sources/Events that server-side collection tends to record with greater fidelity (much less false negatives) compared with javascript page tags:
hits from users behind firewalls,
particularly corporate
firewalls--firewalls block page tag,
plus some are configured to
reject/delete cookies.
hits from users who have disabled
javascript in their browsers--five
percent, according to the W3C
Data
hits from users who exit the page
before it loads. Again, this is a
larger source of disparity than you
might think. The most
frequently-cited study to
support this was conducted by Stone
Temple Consulting, which showed that
the difference in unique visitor
traffic between two identical sites
configured with the same web
analytics system, but which differed
only in that the js tracking code was
placed at the bottom of the pages
in one site, and at the top of
the pages in the other--was 4.3%
FWIW, here's the scheme i use to remove/identify spiders, bots, etc.:
monitor requests for our
robots.txt file: then of course filter all other requests from same
IP address + user agent (not all
spiders will request robots.txt of
course, but with miniscule error,
any request for this resource is
probably a bot.
compare user agent and ip addresses
against published lists: iab.net and
user-agents.org publish the two
lists that seem to be the most
widely used for this purpose
pattern analysis: nothing sophisticated here;
we look at (i) page views as a
function of time (i.e., clicking a
lot of links with 200 msec on each
page is probative); (ii) the path by
which the 'user' traverses out Site,
is it systematic and complete or
nearly so (like following a
back-tracking algorithm); and (iii)
precisely-timed visits (e.g., 3 am
each day).
Biggest reasons are users have to have JavaScript enabled and load the entire page as the code is often in the footer. Awstars, other serverside solutions like yours will get everything. Plus, analytics does a real good job identifying bots and scrapers.

Resources