How do I regenerate statistics in Openx? - openx

Due to faulty hardware, statistics generated over a 2 week period were significantly higher than normal (10000 times higher than normal).
After moving the application to a new server, the problem rectified itself. The issue I have is that there are 2 weeks of stats that are clearly wrong.
I have checked the raw impressions table for the affected fortnight and it seems to be correct (ie. stats per banner per day match the average for the previous month). Looking at the intermediate & summary impressions tables, the values are inflated.
I understand from the openx forum (link text) it's possible to regenerate stats from the raw data but it will only regenerate stats per hour, meaning regenerating stats for 2 weeks would be very time consuming.
Is there another, more efficient way to regenerate the stats from the raw data for the affected fortnight?

Have a look at this link as it appears to have a solution you may find helpful. The solution is similar to the one you posted in your question, but it appears that this one has been modified to make it easier to use. Other than using regenerateAdServerStatistics.php, I do not know of another option for regenerating the statistics you need.

I understand from the openx forum (link text) it's possible to regenerate stats from the raw data but it will only regenerate stats per hour, meaning regenerating stats for 2 weeks would be very time consuming
We have solved this problem on our installation by creating a wrapper shell script for reganerateAdServerStatistics.php with dateStart & dateEnd arguments for situations like the one you mention. It's used to:
regenerate statistics for a specific day (all hours, takes ~2h)
run normal maintenance to keep today's stats updated
goto step (1) as long as day processed < dateEnd
To be honest the script is somewhat more complex, as we also need to import raw data from our data warehouse for each day to be processed, because the "live" data are kept in a in-memory database, but that's kinda out of this post's context.

Related

Data sampling in google analytics goal flow report

The goal flow report on my google analytics account shows some strange sampling behavior. While I can usually select up to a month of data before sampling starts it seems to be different for the goal flow report.
As soon as I select more than one day of data the used data set is getting smaller very fast. At three days the report ist based on only 50% of the sessions, which, according to analytics, comes to only 35 sessions.
Has anyone experienced a similar behavior of sampling although only very small data-sets are used?
Sampling is induced when your request is calculation-intensive; there's no 'garunteed point at which it trips.
Goal flow complexity will increase exponentially as you add goals, so even a low number of goals might make this report demand a lot of processing.
Meanwhile you'll find that moast of the standard reports can cover large periods of time without sampling; they are preaggreated, so it's very cheap to load them.
If you want to know more about sampling, see here:
https://stackoverflow.com/a/37386181/5815149

Single-stat percentage change from initial value in graphite/grafana?

Is there a way to simply show the change of a value over the selected time period? All I'm interested in is the offset of the last value compared to the initial one. The values can vary above and below these over the time period, it's not really relevant (and would be exceptions in my case).
For an initial value of 100 and an final value of 105, I'd expect a single stat box displaying 5%.
I have the feeling I'm missing something obvious obvious, but can't find a method to display this deceptively simple task.
Edit:
I'm trying to create a scripted Grafana dashboard that will automatically populate disk consumption growth for all our various volumes. The data is already in Graphite, but for purposes of capacity management and finance planning (which projects/departments gets billed) it would be helpful for managers to have a simple and coarse overview of which volumes grow outside expected parameters.
The idea was to create a list of single-stat values with color coding that could easily be scrolled through to find abnormalities. Disk usage would obviously never be negative, but volatility in usage between the start and end of the time period would be lost in this view. That's not a big concern for us as this is all shared storage and such usage is expected to a certain degree.
The perfect solution would be to have the calculations change dynamically based on the selected time period.
I'm thinking that this is not really possible (at least not easily) to do with just Graphite and Grafana and have started looking for alternative methods. We might have to implement a different reporting system for this purpose.
Edit 2
I've tried implementing the suggested solution from Leonid, and it works after a fashion. The calculations seems somewhat off from what I expected though.
My test dashboard looks like follows:
If I were to calculate the change manually, I'd end up with roughly 24% change between the start (7,23) and end (8.96) value. Graphite calculates this to 19%. It's probably a reason for the discrepancy, probably something to do with it being a time-series and not discreet values?
As a sidenote: The example is only 30 days, even though the most interesting number would be a year. We don't have quite a year of data in Graphite yet and having a 30 day view is also interesting. It seems I have to implement several dashboards with static times.
You certainly can do that for some fixed period. For example following query take absolute difference betweent current metric value and value that metric has one minute ago (i.e. initial value) and then calculate it's percentage of inital value.
asPercent(absolute(diffSeries(my_metric, timeShift(my_metric, '1m'))), timeShift(my_metric, '1m'))
I believe you can't do that for time period selected in Grafana picker.
But is that really what you need? It's seems strange because as you said value can change in both directions. Maybe standard deviation would be more suitable for you? It's available in Graphite as stdev function.

Search Twitter by hour in R

Currently using the twitteR package and I am running into a roadblock to extract tweets by minute or hour. My ultimate goal is seeing the total number of tweets for a particular topic at a granular level (specifically for big events like Super Bowl or world cup).
The package allows tweets to be searched using since and until, but the most granularity one can get is by day.
Here is an example of the code:
tweets <- searchTwitter("grammy", n=1500, since='2016-02-15', until='2016-02-16')
Based on the results by #SQLMenace it appears that twitteR only retrieves the status without returning accurate date/time information.
In that case, it depends on the scenario where you're performing the analysis. If you're performing the analysis "live" while the event is occurring you can simply run the R script as a CRONjob. Let's say every twenty minutes you run a job to get all the most recent tweets. Then you can eliminate duplicates to get an idea of how many unique tweets occurred in a 20 minute span.
However, if you're performing the analysis retrospectively, the above method wouldn't work. And I'd caution against using twitteR. It seems as though the functionality for gathering tweets by date is not that versatile. I'd recommend using tweepy (for Python) which retrieves not only the status, but also the exact time the tweet was sent.
Hope that helps.

How Do You Deal With Time Zones in Time Series Graphs?

I imagined there would be more literature on this, but I'm having trouble finding any. I have a lot of non-algebraically-aggregatable time series data (that is to say, points for which no function exists that I could use to aggregate them to a higher granularity-- stuff like unique active users, unique contributors, etc... where knowing the amount I had every minute of some hour does not tell me how many I had total during the hour). Currently, I'm just storing and presenting all of this data in UTC. The problem is that many of my clients find this confusing-- understandably so. Because the data is non-algebraically-aggregatable, there's no way to get from UTC data for 1 day midnight- midnight to, say, PST data from midnight to midnight. Recalculation would need to be done from raw data.
So:
Recalculation from raw data is prohibitively expensive for some complicated analytics graphs
We could store all data for all time zones, but this would increase the amount of data we store x24.
All of that said, how do other people deal with this issue? Here's how Google Analytics does it, but this seems insufficient for my use case because I know if I open the multiple timezone can of worms, clients will ask for more than one. This will also take a lot of work that doesn't seem worth the effort as just adding timezone support won't be extremely noticeable or a huge win. What I'm really hoping for is some clever design solution that just presents the UTC data in some intuitive enough way that it's no longer confusing for people in other timezones. Has anyone dealt with similar problems and come upon a solution I'm missing?
First of all, you should recognize that there a lot more than 24 time zones. In order to accurately take into account how people actually use time worldwide, you should be using IANA time zones, of which there are over 500. See also Wikipedia and the timezone tag wiki.
If you are dealing with individual points (discreet timestamps), then you can certainly convert from UTC to any time zone you wish, on the fly as you render your graph. You just need to also keep in mind that the range of data you query will also need to be translated to that time zone.
But if you are talking about aggregating data by the "day" of a specific time zone, then there is no magic bullet. You will need to decide ahead of time which time zones you want to support and calculate each one separately. When you do this, recognize that it's not just the view that's changing. Since the day boundaries are different for each time zone, then the data for each time zone could potentially have very different daily totals.
You should also be aware that not every day has 24 hours. If the day happens to be the date of a daylight saving time transition, it could have 23, 23.5, 24.5, or 25 hours. This could potentially affect how you draw your graph.
One approach you might consider is to be time zone ignorant in your aggregations, rather than using UTC or any specific time zone. Of course this depends heavily on the context of your data, but it is appropriate in certain circumstances. For example, on an invoice, you might care less about the specific timestamps, and more about which calendar date the invoice was assigned to. In that case, once a date is assigned, you would just aggregate on that date. Even if the company operates over multiple time zones, you wouldn't care about that in aggregate.
As far as some clever design that abstracts this from the user, I'm afraid I haven't seen much. The only two choices you really have are timezone-adjusted aggregations (UTC or otherwise), and time zone ignorant aggregations for calendar-date contexts.
We had similar issues to roll up the data for Generation in renewable. We went with three options User / Farm / UTC.
If user selects USER then all the data would be based on his browser Time zone. And Yesterday meant 24 hours till last mid night in user local time.
Similarly if it was Farm, then we take the Farm local and derive the same.
UTC is standard similar to what you have implemented.

Last “end date” with data in Analytics

I'm using "Reporting google Analitics API" and I can’t find information about what the last “end date” with data in Analytics is.
For example, let's suppose you want to retrive the last month’s data.
When do you have to perform the query?
The first day of the current month?
...or the second one?
...or maybe the third one?
And only another question: are the returned data for days in pacific time?
Google Analytics API is supposed to have access to the same data you have in the interface.
Google says that data can take up to 24h to process. The time it takes to really update the data depends on the type and size of the account. Small accounts are updated multiple times a day and can have data available in just a few hours. Once you reach 1M hits a month you are moved to a different mode where the data on your account is updated only once a day. Google Analytics Premium customers have updates more often even for large ammounts of traffic.
There's no way to tell through the API what is exactly the time of the last hit processed. You can query the data for today by the hour and see for yourself though.
Usually you don't care and just want to make sure that the data you're querying has been fully processed for that day.
So if you query data for yesterday there's a chance it has not being completely updated, for example if it's midnight the data for yesterday is just a couple minutes ago and probably haven't been completely processed yet. The safest bet in this case is to query data for 2 days ago.
So if today is 2012-06-15 and you want to get 1 month of data a safe approach is to query data with start-date=2012-05-13 and end-date=2012-06-13. This will most of the time give you data for days that have been fully processed, but it's not 100% safe as well. Google Analytics have had outages in the past where data took longer than that to process, these are not usual though. When you get the data out it's really hard to tell just for the API if the data for those days have been fully processed or not, using the 2 days ago isea you just make it more likely that it is.
The days are aggregate following your timezone settings configured on the Google Analytics profile.

Resources