R googleanalyticsR: strangely, data are not available - r

I set cron job at 5:00 AM to download page-view data for the past day.
For two days, though, I experience the lack of the data with this log:
2019-09-05 05:00:03> anti_sample set to TRUE. Mitigating sampling via
multiple API calls. 2019-09-05 05:00:03> Finding how much sampling in
data request... 2019-09-05 05:00:04> Downloaded [0] rows from a total
of []. 2019-09-05 05:00:04> No sampling found, returning call
2019-09-05 05:00:04> Downloaded [0] rows from a total of [].
When I run the script manually later during the day I get the data...
Do you have any clues what is happening to cause this behaviour?
(I checked the timezone of our GA account, and it is exactly my city's.)
Update: I edited crontab to run every hour, looking for a possible time threshold when GA report data gets processed and available.
library(googleAuthR)
library(googleAnalyticsR)
service_token <- googleAuthR::gar_auth_service("My Project .......json")
## fetch data
gaid <- .....
recent_dat_ga <- google_analytics(
viewId = gaid, # replace this with your view ID
date_range = c(
as.character(Sys.Date()-1)
, as.character(Sys.Date()-1)
),
metrics = "pageviews"
, dimensions = c("pagePath", "date")
, anti_sample = T
)

So, the problem was simple. Google prepares and process daily data with certain time lag. For example, I am in UTC+3 timezone, and I have to wait until 8 AM to get the data for the past day ready. The solution is to check for the fresh data once in a while, so that it does not get lost.

Related

Error while using "EpiEstim" and "ggplot2" libraries

First of all, I must say I'm completely noob in R. So I apologize in advance for asking for help with such a simple task. My task is to form a graph of COVID-19 cases for a certain period using data from the CSV file. Unfortunately, at the moment I cannot contact the person from the World Health Organization who provided the data and the script for launching. But I was left with an error that I cannot fix either myself, not with the help of Google.
script.R
library(EpiEstim)
library(ggplot2)
COVID<-read.csv("dataset.csv")
res_parametric_si<-estimate_R(COVID$I,method="parametric_si",config=make_config(list(mean_si=4,std_si=3)))
plot(res_parametric_si)
dataset.csv
Date,Suspected per day,Total suspected,Discarded/pending,Confirmed per day,Total confirmed,Deaths per day,Deaths Total,Case fatality rate,Daily confirmed,Recovered per day,Recovered total,Active cases,Tested with PCR,# of PCR tests total,average tests/ 7 days,Inf HCW,Inf HCW/d,Vent HCW,Susp per day
01-Jul-20,1239,91172,45285,889,45887,12,1185,2.58%,889,505,20053,24649,11109,676684,10073,6828,63,,1239
02-Jul-20,1249,92421,45658,876,46763,27,1212,2.59%,876,505,20558,24993,13167,689851,9966,6874,46,,1249
03-Jul-20,1288,93709,46032,914,47677,15,1227,2.57%,914,597,21155,25295,11825,701676,9915.7,6937,63,,1288
04-Jul-20,926,94635,46135,823,48500,22,1249,2.58%,823,221,21376,25875,9934,711610,9957,6990,53,,926
05-Jul-20,680,95315,46272,543,49043,13,1262,2.57%,543,327,21703,26078,6696,718306,9963.7,7030,40,,680
06-Jul-20,871,96186,46579,564,49607,21,1283,2.59%,564,490,22193,26131,9343,727649,10303.9,7046,16,,871
07-Jul-20,1170,97356,46942,807,50414,23,1306,2.59%,807,926,23119,25989,13568,741217,10806,7092,46,,1170
Error
Error in process_I(incid) (script.R#4): incid must be a vector or a dataframe with either i) a column called 'I', or ii) 2 columns called 'local' and 'imported'.
For the example data the issue seems to be that it does only cover 7 data points, and the configurator assumes that there it can window over more than 7 days. What worked for me was the following code (working in the sense that it does not throw an error).
config <- make_config(incid = COVID$Daily.confirmed,
method="parametric_si",
list(mean_si=4,std_si=3, t_start = c(2,3),t_end = c(6,7)))
res_parametric_si<-estimate_R(COVID$Daily.confirmed,method="parametric_si",config=config)
plot(res_parametric_si)

BlueSky Statistics - String to date [time] issues

Trying to convert time as a string to a time variable.
Use Date/Dates/Convert String to Date...... for format I use %H:%M:%S....
Here is the syntax from the GUI
[Convert String Variables to Date]
BSkystrptime (varNames = c('Time'),dateFormat = "%H:%M:%S",prefixOrSuffix = "prefix",prefixOrSuffixValue = "Con_",data = "Dataset2")
BSkyLoadRefreshDataframe(dframe=Dataset2,load.dataframe=TRUE)
A screen shot of result is attached....
Compare variables Time [string] to Con_Time [date/time]
The hours are 2 hours out [wrong!] - the Minutes and Seconds are correct.
What am I doing wrong here?
Screen Shot
I believe you are running into a known issue with a prior release of BlueSky Statistics. This issue is fixed with the current stable release available on the download page.
The reason for this was although the time is converted correctly into the local time zone, BlueSky Statistics was reading the time zone in the local time zone and converting it to UTC.
You are probably +2 hours ahead of UTC, so you are seeing the time move 2 hrs back. Give us a couple of days to post a patch.
You can also confirm this by writing and executing the following syntax in the syntax window
Dataset2$Con_Time

How can set an inbuilt timer in RStudio to execute snippets of code?

I have some code in Rstudio which sends an API request to Google Big Query to run a saved query. Then my script downloads the data back to RStudio to be modelled to a machine learning model.
Its a lot of medical data and I would like some of the process to be even more automated than before.
tags<-read.csv('patient_health_codes.csv',stringsAsFactors = FALSE)
tags<-tail(tags, 6)
this section takes a CSV to iterate over patient health groups (such as Eczema is 123456) - Section 1
MD2DS="2018-07-20"
MD2DE="2018-07-20"
This section above fills in date periods for the query execution function - Section 2
sapply(health_tags$ID, function(x) query_details (MD2_date_start=MD2SE,
MD2_date_end=MD2DE,
Sterile_tag=as.character(x)))
This section executes the query on google big query and iterates over all the different patient groups in x i.e Eczema, Asthma, Allergy Group, and so on. -Section 3
project <- "private-health-clinic"
bq_table=paste0('[private-health-clinic:medical.london_',Sterile_tag,']')
sql =paste0('SELECT * FROM ', bq_table)
This section names each table after its patient group - section 4
data <- query_exec(sql, project = project, max_pages = Inf)
write.csv(data, file =paste0("medical_", Sterile_tag, ".csv"))
This code downloads and writes the big query table as a CSV on RStudio - Section 5
My question is, how do I tell RStudio when someone executes section 3 after 1 hour in real time please execute section 4 then 5 mins after execute section 5.
In advance thank you for the help I'm not an R expert!
Just add this after section 3:
Sys.sleep(3600)
And after section 4, add:
Sys.sleep(300)
Depending on how long it takes to execute that code, it might be worthwhile to use Sys.sleep for the desired amount of waiting time minus the time spent calculating, as follows:
t0 <- Sys.time()
# section 3
t1 <- Sys.time()
Sys.sleep(3600 - (t1 - t0))
# section 4
t2 <- Sys.time()
Sys.sleep(300 - (t2 - t1))
# section 5
Otherwise the waiting time will be added to the time spent running the sections.

Using Sys.sleep() to delay API call

I'm using R to make an API call to a weather data provider to download some weather forecasts. I'm using a free key that allows me to make no more than 10 calls per minute. I've tried using Sys.sleep() to ensure I don't go over the threshold but the API resource monitor tells me that I've exceeded the number of calls.
For example, if I'm making 6 calls, a time interval of 10 seconds between the calls ought to be sufficient (not taking into account the time R would need).
dat <- list()
for(i in 1:6){
dat[[i]] <- getWeatherData(web_url, api_key, history_date, data_format)
Sys.sleep(10)
web_url <- gsub(i-1, i, url)
}
The getWeatherData function does the following:
makes the API call (only one API call is made each time the function is invoked. Uses httr::GET() to get the data),
parses the XML output to get desired variables (regulat expressions),
performs some clean-up (for missing/garbage values),
converts strings to R date-time objects (POSIXct), and
rounds values to the nearest hour (lubridate::round_date()).
Function inputs:
web_url is a custom url,
api_key is my personal key,
history_date is a string (formatted as "%d/%m/%Y %H:%M:%S"), and
data_format specifies if I want an .XML or .json file as output.
I can not share the url/key for obvious reasons. As soon as I run this, I get a notification from the data provider that I've exceeded the allowable calls per minute (10). I don't get a notification every time - not sure why that is either.
Any help is appreciated!
This solution should be helpful for you if Sys.sleep doesn't do the trick.
Basically, this replaces the use of Sys.sleep with while logic.
dat <- list()
delay_seconds<-10
for(i in 1:6){
dat[[i]] <- getWeatherData(web_url, api_key, history_date, data_format)
date_time<-Sys.time()
while((as.numeric(Sys.time()) - as.numeric(date_time))<delay_seconds){}
web_url <- gsub(i-1, i, url)
}
Here, we are:
defining a number of seconds to wait ( delay_seconds<-10 )
defining a start time for comparison ( date_time<-Sys.time() )
using a while loop that checks the present time in comparison to our comparison time and seeing if this is less than our chosen delay interval ( (as.numeric(Sys.time()) - as.numeric(date_time)<delay_seconds )
doing nothing until the wait time is over( {} )
Not knowing if you need/want to, but in the case that you're hoping to get your data out of the lists and into a longer combined form, I recommend the dplyr function bind_rows().
dat2<-bind_rows(dat)
Thanks to an answer by rbtj to this question: How to make execution pause, sleep, wait for X seconds in R?

Configuring scollector to get different frequences for different collectors

I'm working on scollector and I want to have specific frequencies for different collector.
For example:
get info from disk usage every 5 minutes
info from memory every minute
iostat every 30 seconds
and so on...
Here is a part of the conf.toml I made:
FullHost = true
Freq = 60
DisableSelf = true
[[iostat]]
Filter = "iostat"
Freq = 30
[[memory]]
Filter = "memory"
Freq = 60
But I get some error
./scollector -conf="perso.toml" -p
2016/04/19 14:40:45 fatal: main.go:297: extra keys in perso.toml: [iostat iostat.Freq memory memory.Freq]
It seems that I cannot multiply the frequencies.
What should I do to get what I want?
Thank you all
According to scollector documentation, Freq is a global setting, so it's not possible to set different frequencies for each collector. The exception is for external collectors, which may be put in a folder named after the desired frequency (in seconds).
Freq is indeed global setting and interval is usually set to it. Although some collectors override interval to different values e.g. elasticsearch-indices runs every 15 minutes because there's a lot of data to pull.
To change it either
(best) hack scollector code to read and pass freq parameter to every collector
(second best) file a github issue
(last resort) you can just change intervals scollector code in specific collectors and recompile scollector
Well, we might found something.
We create differents folders representing several Freq (0, 30, 60, 120...) and in each folders, we write external collectors we need.
'/etc/collectors/0',
'/etc/collectors/15',
'/etc/collectors/30',
'/etc/collectors/60',
'/etc/collectors/120',
'/etc/collectors/300',
'/etc/collectors/600'
In the conf.toml:
ColDir = "/etc/scollector/collectors"
If we want the internal collectors, we have to rewrite them :(

Resources