ClickHouse SequenceCount group funnel - aggregate-functions

I've been trying to use sequenceCount in order to process a funnel analysis on a website data. We want to track how many user can reach from point A to point B. sequenceCount is a perfect fit for this need as we have trackers that follows the user's interaction on the website. However, let's say that there is not only one tracker but many that trigger the point A in our funnel analysis. This would mean that currently we would have to use sequenceCount for EACH triggers A and make them point to B.
This problem might lead to situation where we have to write plenty of sequenceCount and could kill all performance.
I've been looking on the web for answers but I didn't find any about that.
I'm looking for thoughts that could solve the problem or avoid it

Related

Tracking a Search that leads to a sale in GA

This seems really basic but i am struggling with it
We have a client who runs a travel website.
They have a few different search bars eg Flights, Hotels, Carhire.
I am trying to track the performance of each... "What % of people completed a sale that ran a Flight search." Same for Hotel, and for Car hire
Any ideas for the best way to get this info in GA?
Many thanks
There are a few ways to get this information, each with their pros and cons. The options that I see immediately available are segments and goals.
Segments are great because they are retrospective and generally more flexible, with the ability to be changed if you find your criteria isn't quite right. You create here, and specify sessions that go through search results pages etc:
Then you can create another segment for booking confirmation page, and any other intermediary steps that you'd like to report on. The main con of segments is that you can only pull in 4 at a time, but if you have more you can pull them 4 at a time and copy+paste the data into an excel sheet or google sheet. Segments can also be pulled via the Core Reporting Api and DataStudio which makes them great for automating into dashboards.
Goals are cool because they pull into the default reports, and basically track sessions through a particular page, event or sequence. The main con I see and the reason is that I don't use them is that they only start tracking fro mthe time you create them , and if you change the configuration it does not impact historical data, so your data can get messed up quickly if you don't have sandbox GA views or sandbox goals for your testing before putting it into a dedicated goal slot. You can also only have 10 or 20 goals depending on your plan, so once data is tracked against that goal you can't remove or clear it.

Anomaly detection for Google Analytics in realtime

I'm trying to detect anomalies in google analytics events like page views or custom events.
I tested the custom alert feature from google itself. The period for those alerts are per day, week or month. What I'm looking for is a realtime detection. It would be useful to define rules for alerts like a maximum divergence between two points in time. For example [now, now - 15 minutes] or [now, now - 24 hours] or [now, now - 7 days]. Some solutions provide alerts when fixed threshold got passed (like observe.io). But thats not very helpful for highly fluctuating numbers that depend on weekday and daytime (like page views).
I would be thankful for any tips how to detect anomalies in GA in realtime.
I agree thereshold solutions is not a good idea for detect anomalies in time series. Because they generally are set by the user, rather than learned, which can be a time consuming and difficult process when monitoring many data streams.
Moreover, they need to be adjusted as the environment changes, so manual real-time maintenance is needed.
Besides, since they don’t take temporal sequences into account, simple thresholds cannot identify pattern changes that take place within the range. I recommend you use methods for anomaly detection in time series or change point detection.
You can googling about this topics and you'll find several algorithms. For realtime analisys, i also can recommend softwares like MOA (http://moa.cms.waikato.ac.nz/) and Numenta (https://numenta.com/).

How to Sample Adobe Analytics (Omniture) Data

I can't find anything on the web about how to sample Adobe Analytics data? I need to integrate Adobe Analytics into a new website with a ton of traffic so the stakeholders want to sample the data to avoid exorbitant server calls. I'm using DTM but not sure if that will help or be a non-factor? Can anyone either point me to some documentation or give me some direction on how to do this?
Adobe Analytics does not have any built-in method for sampling data, neither on their end nor in the js code.
DTM doesn't offer anything like this either. It doesn't have any (exposed) mechanisms in place to evaluate all requests made to a given property (container); any rules that extend state beyond "hit" scope are cookie based.
Adobe Target does offer ability to output code based on % of traffic so you can achieve sampling this way, but really, you're just trading one server call cost for another.
Basically, your only solution would be to create your own server-side framework for conditionally outputting the Adobe Analytics (or DTM) tag, to achieve sampling with Adobe Analytics.
Update:
#MichaelJohns comment below:
We have a file that we use as a boot strap file to serve the DTM file.
What I think we are going to do is use some JS logic and cookies
around that to determine if a visitor should be served the DTM code.
Okay, well maybe i'm misunderstanding what your goal here is (but I don't think I am) but that's not going to work.
For example, if you only want to output tracking for 50% of visitors, how would you use javascript and cookies alone to achieve this? In order to know that you are only filtering 50%, you need to know the total # of people in play. By itself, javascript and cookies only know about ONE browser, ONE person. It has no way of knowing anything about those other 99 people unless you have some sort of shared state between all of them, like keeping track of a count in a database server-side.
The best you can do solely with javascript and cookies is that you can basically flip a coin. In this example of 50%, basically you'd pick a random # between 1 and 100 and lower half gets no tracking, higher half gets tracking.
The problem with this is that it is possible for the pendulum to swing 100% one way or the other. It is the same principle as flipping a coin 100 times in a row: it is entirely possible that it can land on tails all 100 times.
In theory, the trend over time should show an overall average of 50/50, but this has a major flaw in that you may go one month with a ton of traffic, another month with few. Or you could have a week with very little traffic followed by 1 day of a lot of traffic. And you really have no idea how that's going to manifest over time; you can't really know which way your pendulum is swinging unless you ARE actually recording 100% of the traffic to begin with. The affect of all this is that it will absolutely destroy your trended data, which is the core principle of making any kind of meaningful analysis.
So basically, if you really want to reliably output tracking to a % of traffic, you will need a mechanism in place that does in fact record 100% of traffic. If I were going to roll my own homebrewed "sampler", I would do this:
In either a flatfile or a database table I would have two columns, one representing "yes", one representing "no". And each time a request is made, I look for the cookie. If the cookie does NOT exist, I count this as a new visitor. Since it is a new visitor, I will increment one of those columns by 1.
Which one? It depends on what percent of traffic I am wanting to (not) track. In this example, we're doing a very simple 50/50 split, so really, all I need to do is increment whichever one is lower, and in the case that they are currently both equal, I can pick one at random. If you want to do a more uneven split, e.g. 30% tracked, 70% not tracked, then the formula becomes a bit more complex. But that's a different topic for discussion ( also, there are a lot of papers and documents and wikis out there published by people a lot smarter than me that can explain it a lot better than me! ).
Then, if it is fated that that I incremented the "yes" column, I set the "track" cookie to "yes". Otherwise I set the "track" cookie to "no".
Then in in my controller (or bootstrap, router, whatever all requests go through), I would look for the cookie called "track" and see if it has a value of "yes" or "no". If "yes" then I output the tracking script. If "no" then I do not.
So in summary, process would be:
Request is made
Look for cookie.
If cookie is not set, update database/flatfile incrementing either yes or no.
Set cookie with yes or no.
If cookie is set to yes, output tracking
If cookie is set to no, don't output tracking
Note: Depending on language/technology of your server, cookie won't actually be set until next request, so you may need to throw in logic to look for a returned value from db/flatfile update, then fallback to looking for cookie value in last 2 steps.
Another (more general) note: In general, you should beware sampling. It is true that some tracking tools (most notably Google Analytics) samples data. But the thing is, it initially records all of the data, and then uses complex algorithms to sample from there, including excluding/exempting certain key metrics from being sampled (like purchases, goals, etc.).
Just think about that for a minute. Even if you take the time to setup a proper "sampler" as described above, you are basically throwing out the window data proving people are doing key things on your site - the important things that help you decide where to go as far as giving visitors a better experience on your site, etc..so now the only way around it is to start recording everything internally and factoring those things in to whether or not to send the data to AA.
But all that aside.. Look, I will agree that hits are something to be concerned about on some level. I've worked with very, very large clients with effectively unlimited budgets, and even they worry about hit costs racking up.
But the bottom line is you are paying for an enterprise level tool. If you are concerned about the cost from Adobe Analytics as far as your site traffic.. maybe you should consider moving away from Adobe Analytics, and towards a different tool like GA, or some other tool that doesn't charge by the hit. Adobe Analytics is an enterprise level tool that offers a lot more than most other tools, and it is priced accordingly. No offense, but IMO that's like leasing a Mercedes and then cheaping out on the quality of gasoline you use.

Could "filling up" Google Analytics with millions of events slow down query performance / increase sampling?

Considering doing some relatively large scale event tracking on my website.
I estimate this would create up to 6 million new events per month in Google Analytics.
My questions are, would all of this extra data that I'm now hanging onto:
a) Slow down GA UI performance
and
b) Increase the amount of data sampling
Notes:
I have noticed that GA seems to be taking longer to retrieve results for longer timelines for my website lately, but I don't know if it has to do with the increased amount of event tracking I've been doing lately or not – it may be that GA is fighting for resources as it matures and as more and more people collect more and more data...
Finally, one might guess that adding events may only slow down reporting on events, but this isn't necessarily so is it?
Drewdavid,
The amount of data being loaded will influence the speed of GA performance, but nothing really dramatic I would say. I am running a website/app with 15+ million events per month and even though all the reporting is automated via API, every now and then we need to find something specific and use the regular GA UI.
More than speed I would be worried about sampling. That's the reason we automated the reporting in the first place as there are some ways how you can eliminate it (with some limitations. See this post for instance that describes using Analytics Canvas, one my of favorite tools (am not affiliated in any way :-).
Also, let me ask what would be the purpose of your events? Think twice if you would actually use them later on...
Slow down GA UI performance
Standard Reports are precompiled and will display as usual. Reports that are generated ad hoc (because you apply filters, segments etc.) will take a little longer, but not so much that it hurts.
Increase the amount of data sampling
If by "sampling" you mean throwing away raw data, Google does not do that (I actually have that in writing from a Google representative). However the reports might not be able to resolve all data points (e.g. you get Top 10 Keywords and everything else is lumped under "other").
However those events will count towards you data limit which is ten million interaction hits (pageviews, events, transactions, any single product in a transaction, user timings and possibly others). Google will not drop data or close your account without warning (again, I have that in writing from a Google Sales Manager) but they reserve to right to either force you to collect less interaction hits or to close your account some time after they issued a warning (actually they will ask you to upgrade to Premium first, but chances are you don't want to spend that much money).
Google is pretty lenient when it comes to violations of the data limit but other peoples leniency is not a good basis for a reliable service, so you want to make sure that you stay withing the limits.

Scrape all google search result for a specific name

I think the question has been answered here before,but i could not find the desired topic.I am a newbie in web scraping.I have to develop a script that will take all the google search result for a specific name.Then it will grab the related data against that name and if there is found more than one,the data will be grouped according to their names.
All I know is that,google has some kind of restriction on scraping.They provide a custom search api.I still did not use that api,but hoping to get all the resulted links corresponding to a query from that api. But, could not understand what will be the ideal process to do the scraping of the information from that links.Any tutorial link or suggestion is very much appreciated.
You should have provided a bit more what you have been doing, it does not sound like you even tried to solve it yourself.
Anyway, if you are still on it:
You can scrape Google through two ways, one is allowed one is not allowed.
a) Use their API, you can get around 2k results a day.
You can up it to around 3k a day for 2000 USD/year. You can up it more by getting in contact with them directly.
You will not be able to get accurate ranking positions from this method, if you only need a lower number of requests and are mainly interested in getting some websites according to a keyword that's the choice.
Starting point would be here: https://code.google.com/apis/console/
b) You can scrape the real search results
That's the only way to get the true ranking positions, for SEO purposes or to track website positions. Also it allows to get a large amount of results, if done right.
You can Google for code, the most advanced free (PHP) code I know is at http://scraping.compunect.com
However, there are other projects and code snippets.
You can start off at 300-500 requests per day and this can be multiplied by multiple IPs. Look at the linked article if you want to go that route, it explains it in more details and is quite accurate.
That said, if you choose route b) you break Googles terms, so either do not accept them or make sure you are not detected. If Google detects you, your script will be banned by IP/captcha. Not getting detected should be a priority.

Resources