Google Analytics Sampling despite low sessions - google-analytics

I'm using Google Analytics Reporting API but I get sampled results even though the sessions in the specified date range are much less than the 500K limit. I have only ~4K sessions in a month.
I have also set "samplingLevel" as "LARGE".
Here's the Python query:
response=analytics.reports().batchGet(
body={
"reportRequests":[
{
"viewId":myViewID,
"dateRanges":[
{
"startDate":"2017-05-01",
"endDate":"2017-05-30"
}],
"samplingLevel":"LARGE",
"metrics":[
{
"expression":"ga:sessions"
}],
"dimensions": [
{
"name":"ga:browser"
},
{
"name":"ga:city",
}
]
}]
}
).execute()
As you can see below the sample space is 4365 sessions, much lesser than the 500K limit
response.get('reports', [])[0].get('data',[]).get('samplesReadCounts',[])
Out[31]: [u'2051']
response.get('reports', [])[0].get('data',[]).get('samplingSpaceSizes',[])
Out[32]: [u'4365']
Breaking the request into a smaller date range doesn't help either. I tried this using the GoogleAnalyticsR library in R with anti_sample=TRUE.
> web_data <- google_analytics_4(view_id,
+ date_range = c("2017-05-01", "2017-05-30"),
+ dimensions = c("city","browser"),
+ metrics = c("hits"),
+ samplingLevel="LARGE",
+ anti_sample = TRUE)
2017-06-04 11:54:51> anti_sample set to TRUE. Mitigating sampling via multiple API calls.
2017-06-04 11:54:51> Finding how much sampling in data request...
2017-06-04 11:54:52> Downloaded [10] rows from a total of [15].
2017-06-04 11:54:52> Data is sampled, based on 47% of sessions.
2017-06-04 11:54:52> Finding number of sessions for anti-sample calculations...
2017-06-04 11:54:53> Downloaded [30] rows from a total of [30].
2017-06-04 11:54:53> Calculated [3] batches are needed to download approx. [18] rows unsampled.
2017-06-04 11:54:53> Anti-sample call covering 14 days: 2017-05-01, 2017-05-14
2017-06-04 11:54:54> Downloaded [7] rows from a total of [7].
2017-06-04 11:54:54> Data is sampled, based on 53.2% of sessions.
2017-06-04 11:54:54> Anti-sampling failed
2017-06-04 11:54:54> Anti-sample call covering 9 days: 2017-05-15, 2017-05-23
2017-06-04 11:54:54> Downloaded [4] rows from a total of [4].
2017-06-04 11:54:54> Data is sampled, based on 55.7% of sessions.
2017-06-04 11:54:54> Anti-sampling failed
2017-06-04 11:54:54> Anti-sample call covering 7 days: 2017-05-24, 2017-05-30
2017-06-04 11:54:55> Downloaded [10] rows from a total of [10].
2017-06-04 11:54:55> Data is sampled, based on 52.3% of sessions.
2017-06-04 11:54:55> Anti-sampling failed
Joining, by = c("city", "browser")
Joining, by = c("city", "browser")
2017-06-04 11:54:55> Finished unsampled data request, total rows [13]
When I check for the same data in a custom request, I see similar sampling
Any idea why I get sampled results even thought the session count is much less than the limit?

There's a ticket at Google about sampling despite low sessions at https://issuetracker.google.com/issues/62525952

You have only 4k sessions in that View... but maybe that view is using filters... Check how much traffic you have in that property by looking at a view without filters.... The 500k sessions is at property Level not at view level.

The 500k applies to Default Reports
Edit:
500k sessions at the property level for the date range you are using for ad-hoc queries.
Default reports explained:
Analytics has a set of preconfigured, default reports listed in the left pane under Audience, Acquisition, Behavior, and Conversions.
It looks like you are working with an ad-hoc report with secondary dimensions, so the 500k threshold probably doesn't apply anymore and is likely much lower. There is some more information on this in the page you originally linked to here.

Related

R googleanalyticsR: strangely, data are not available

I set cron job at 5:00 AM to download page-view data for the past day.
For two days, though, I experience the lack of the data with this log:
2019-09-05 05:00:03> anti_sample set to TRUE. Mitigating sampling via
multiple API calls. 2019-09-05 05:00:03> Finding how much sampling in
data request... 2019-09-05 05:00:04> Downloaded [0] rows from a total
of []. 2019-09-05 05:00:04> No sampling found, returning call
2019-09-05 05:00:04> Downloaded [0] rows from a total of [].
When I run the script manually later during the day I get the data...
Do you have any clues what is happening to cause this behaviour?
(I checked the timezone of our GA account, and it is exactly my city's.)
Update: I edited crontab to run every hour, looking for a possible time threshold when GA report data gets processed and available.
library(googleAuthR)
library(googleAnalyticsR)
service_token <- googleAuthR::gar_auth_service("My Project .......json")
## fetch data
gaid <- .....
recent_dat_ga <- google_analytics(
viewId = gaid, # replace this with your view ID
date_range = c(
as.character(Sys.Date()-1)
, as.character(Sys.Date()-1)
),
metrics = "pageviews"
, dimensions = c("pagePath", "date")
, anti_sample = T
)
So, the problem was simple. Google prepares and process daily data with certain time lag. For example, I am in UTC+3 timezone, and I have to wait until 8 AM to get the data for the past day ready. The solution is to check for the fresh data once in a while, so that it does not get lost.

GET in for loop needs to be faster (httr) in R

I need to write a R script which sends at the end a huge amount of get requests to a server. Each row of my data frame contains several information. The final column "url" builds http requests in each row - for example: https://logsxxx.xxx.com/xx.xx?.....
It may happen that I must send 300.000 - 1.000.000 get request with the script.
The good thing is that my script works and the requests reaches the server.
The bad thing is that the loop costs a lot of time until all rows are send. It takes about 9 hours for 300.000 rows.
I've tested if its possible with ifelse or apply but I failed...
system.time(
for (i in 1:300000){
try(
{
GET(mydata$url[i], timeout(3600))
print(paste("row",i,"sent at",Sys.time()))
}
, silent=FALSE)
}
)
Another bad thing is that the script may fails to send 100% if the internet connection breaks for any reason. Then I can see the following error:
[1] "row 18 sent at 2019-01-18 14:22:05"
[1] "row 19 sent at 2019-01-18 14:22:06"
[1] "row 20 sent at 2019-01-18 14:22:06"
[1] "row 21 sent at 2019-01-18 14:22:06"
Error in curl::curl_fetch_memory(url, handle = handle) :
Timeout was reached: Connection timed out after 10000 milliseconds
[1] "row 23 sent at 2019-01-18 14:22:16"
[1] "row 24 sent at 2019-01-18 14:22:16"
[1] "row 25 sent at 2019-01-18 14:22:16"
At least the script doesn't break completely and goes on with the next row. The problem is here that the longer the internet connection fails the more rows would not send.
I would be very grateful if:
someone could show me a faster way to send the requests - maybe without the nested for loop
and show me how I can do something like this with the code: "if the get request fails because of internet connection, retry 3 times before going to the next get request. Do that until all elements of i are send"
Kind regards,

Loading wikidata dump

I'm loading all geographic entries (Q56061) from wikidata json dump.
Whole dump contains about 16M entries according to Wikidata:Statistics page.
Using python3.4 + ijson + libyajl2 it comes to take about 93 hours of CPU (AMD Phenom II X4 945 3GHz) time just to parse the file.
Using online sequential item queries for total of 2.3M entries of interest comes to take about 134 hours.
Is there some more optimal way to perform this task?
(maybe, something like openstreetmap pdf format and osmosis tool)
My loading code and estimations were wrong.
Using ijson.backends.yajl2_cffi gives about 15 hours for full parsing + filtering + storing to database.

Retrieve video file duration (time) using R

I’m creating a code to delete some video files that I don’t need. The videos are from CCTV footage and they record 24/7. However the software that records the video saves the files in ~1 hour videos and this is the problem (not being exact duration). I’m only interested in keeping videos from a particular part of the day (which varies) and because the duration of the video is not exact this is causing me problems.
The video file name has a date and time stamp but only for the start so if I could find the duration everything becomes simple algebra.
So my question is simple is it possible to get the duration (time) of video files using R?
Just a couple of other useful information the videos are from several cameras and each camera as a different recording frame rate so using file.info to return the file size and derive the length of the video is not an option. Also the video files are in .avi format.
Cheers
Patrao
As far as I know, there are no ready packages that handle video files in R (like matlab does). This isn't a pure R solution, but gets the job done. I installed CLI interface to MediaInfo and called it from R. I called it using system.
wolf <- system("q:/mi_cli/mediainfo.exe Krofel_video2volk2.AVI", intern = TRUE)
wolf # output by MediaInfo
[1] "General"
[2] "Complete name : Krofel_video2volk2.AVI"
[3] "Format : AVI"
[4] "Format/Info : Audio Video Interleave"
[5] "File size : 10.7 MiB"
[6] "Duration : 11s 188ms"
[7] "Overall bit rate : 8 016 Kbps"
...
[37] "Channel count : 1 channel"
[38] "Sampling rate : 8 000 Hz"
[39] "Bit depth : 16 bits"
[40] "Stream size : 174 KiB (2%)"
[41] "Alignment : Aligned on interleaves"
[42] "Interleave, duration : 63 ms (1.00 video frame)"
# Find where Duration is (general) and extract it.
find.duration <- grepl("Duration", wolf)
wolf[find.duration][1]# 1 = General, 2 = Video, 3 = Audio
[1] "Duration : 11s 188ms"
Have fun parsing the time.
This might be a bit low level, but if you're up to the task of parsing binary data, look up a copy of the AVI spec and figure out how to get both the number of video frames and the frame rate.
If you look at one of the AVI files using a hex editor, you will see a series of LIST chunks at the beginning. A little farther into this chunk will be a vids chunk. Immediately following vids should be a human-readable video four-character code (FourCC) specifying the video codec, probably something like mjpg (MJPEG) or avc1 (H.264) for a camera. 20 bytes after that will be 4 bytes stored in little endian notation which indicate the frame rate. Skip another 4 bytes and then the next 4 bytes will be another little endian number which indicate the total number of video frames.
I'm looking at a sample AVI file right now where the numbers are: frame rate = 24 and # of frames = 0x37EB = 14315. This works out to 9m56s, which is accurate for this file.

Collect tweets with their related tweeters

I am doing text mining on tweets,I have collected random tweets form different accounts about some topic, I transformed the tweets into data frame, I was able to find the most frequent tweeters among those tweets(by using the column "screenName")... like those tweets:
[1] "ISCSP_ORG: #cybercrime NetSafe publishes guide to phishing:
Auckland, Monday 04 June 2013 – Most New Zealanders will have...
http://t.co/dFLyOO0Djf"
[1] "ISCSP_ORG: #cybercrime Business Briefs: MILL CREEK — H.M. Jackson
High School DECA chapter members earned the organizatio...
http://t.co/auqL6mP7AQ"
[1] "BNDarticles: How do you protect your #smallbiz from #cybercrime?
Here are the top 3 new ways they get in & how to stop them.
http://t.co/DME9q30mcu"
[1] "TweetMoNowNa: RT #jamescollinss: #senatormbishop It's the same
problem I've been having in my fight against #cybercrime. \"Vested
Interests\" - Tell me if …"
[1] "jamescollinss: #senatormbishop It's the same problem I've been
having in my fight against #cybercrime. \"Vested Interests\" - Tell me
if you work out a way!"
there are different tweeters have sent many tweets (in the collected dataset)
Now , I want to collect/group the related tweets for their corresponding tweeters/user..
Is there any way to do it using R ?? any suggestion? your help would be very appreciated.

Resources