I'm using skardhamar's rga ga$getData to query GA and get all data in an unsampled manner. The data is based on more than 500k sessions per day.
At https://github.com/skardhamar/rga, paragraph 'extracting more observations than 10,000' mentions this is possible by using batch = TRUE. Also, paragraph 'Get the data unsampled' mentions that by walking over the days, you can get unsampled data. I'm trying to combine these two, but I can not get it to work. E.g.
ga$getData(xxx,
start.date = "2015-03-30",
end.date = "2015-03-31",
metrics = "ga:totalEvents",
dimensions = "ga:date,ga:customVarValue4,ga:eventCategory,ga:eventAction,ga:eventLabel",
sort = "",
filters = "",
segment = "",
,batch = TRUE, walk = TRUE
)
.. indeed gets unsampled data, but not all data. I get a dataframe with only 20k rows (10k per day). This is limiting to chunks of 10k per day, contrary to what I expect because of using the batch = TRUE setting. So for the 30th of march, I get a dataframe of 20k rows after seeing this output:
Run (1/2): for date 2015-03-30
Pulling 10000 observations in batches of 10000
Run (1/1): observations [1;10000]. Batch size: 10000
Received: 10000 observations
Received: 10000 observations
Run (2/2): for date 2015-03-31
Pulling 10000 observations in batches of 10000
Run (1/1): observations [1;10000]. Batch size: 10000
Received: 10000 observations
Received: 10000 observations
When I leave out the walk = TRUE setting, I do get all observations (771k rows, around 335k per day), but only in a sampled manner:
ga$getData(xxx,
start.date = "2015-03-30",
end.date = "2015-03-31",
metrics = "ga:totalEvents",
dimensions = "ga:date,ga:customVarValue4,ga:eventCategory,ga:eventAction,ga:eventLabel",
sort = "",
filters = "",
segment = "",
,batch = TRUE
)
Notice: Data set contains sampled data
Pulling 771501 observations in batches of 10000
Run (1/78): observations [1;10000]. Batch size: 10000
Notice: Data set contains sampled data
...
Is my data just too big to get all observations unsampled?
You could try querying by device with filters = "ga:deviceCategory==desktop" (and filters = "ga:deviceCategory!=desktop" respectively) and then merging the resulting dataframes.
I'm assuming that your users uses different devices to access your site. The underlying logic is that when you filter data, Google Analytics servers filter it before you get it, so you can "divide" your query and get unsampled data. I think is the same methododology of the "walk" function.
Desktop only
ga$getData(xxx,
start.date = "2015-03-30",
end.date = "2015-03-31",
metrics = "ga:totalEvents",
dimensions = "ga:date,ga:customVarValue4,ga:eventCategory,ga:eventAction,ga:eventLabel",
sort = "",
filters = "ga:deviceCategory==desktop",
segment = "",
,batch = TRUE, walk = TRUE
)
Mobile and Tablet
ga$getData(xxx,
start.date = "2015-03-30",
end.date = "2015-03-31",
metrics = "ga:totalEvents",
dimensions = "ga:date,ga:customVarValue4,ga:eventCategory,ga:eventAction,ga:eventLabel",
sort = "",
filters = "ga:deviceCategory!=desktop",
segment = "",
,batch = TRUE, walk = TRUE
)
Related
I am currently using the Radlibrary package. I used the following code:
query <- adlib_build_query(ad_active_status = "ALL",
ad_delivery_date_max = "2022-11-08",
ad_delivery_date_min = "2022-06-24",
ad_reached_countries = "US",
ad_type = "POLITICAL_AND_ISSUE_ADS",
search_terms = "democrat",
publisher_platforms = "FACEBOOK",
fields = c('id',
'ad_creation_time',
'ad_creative_bodies',
'ad_creative_link_captions',
'ad_creative_link_descriptions',
'ad_creative_link_titles',
'ad_delivery_start_time',
'ad_delivery_stop_time',
'ad_snapshot_url',
'bylines',
'currency',
'delivery_by_region',
'estimated_audience_size',
'languages',
'page_id',
'page_name',
'spend',
'publisher_platforms',
'demographic_distribution',
'impressions'))
response <- adlib_get(query)
data <- as_tibble(response)
I've noticed that I only get 1000 observations at a time from that time frame? Is there an efficient way to be able to collect all the observations within the time frame? I've thought about changing the "stop time" based on the last date in the dataset, but that might take a long time if there are a lot of ads in the span of a few days. Any suggestions?
Hi I'm using GoogleAnalyticsR to import my data from Google Analytics but I'm having a problem because it only downloads 1,000 rows from a total of 1,000,000.
Any advice how to download all 1,000,000?
Here's my code!
df1 <- google_analytics_4(my_id,
date_range = c("2016-05-13", "2017-05-13"),
metrics = c("pageviews"),
dimensions = c("pagePath"))
By default it gets 1000 rows, if you set max = -1 in your call it gets everything:
df1 <- google_analytics_4(my_id,
date_range = c("2016-05-13", "2017-05-13"),
metrics = "pageviews",
dimensions = "pagePath",
max = -1)
In R I have a dataset containing abbreviated numbers, I really want the full metric so I can sum the values... is there a library or something that would aid in this effort?
i.e.
start result
5k = 5,000
.5k = 500
.25k = 250
5m = 5,000,000
.5m = 500,000
and so on...
Data
dd <- data.frame(start = c('5k', '.5k', '.25k', '5m', '.5m'),
result = c('5,000', '500', '250', '5,000,000', '500,000'),
stringsAsFactors = FALSE)
There is no need for any library.Just declare k=1000.And add * operator in between.
What's the big deal in it.No need for different Library.
I am attempting to extract unsampled data for the past nine months. The website is pretty active, and as such, I'm unable to get the data in its entirety (over 3 m rows) unsampled. I'm currently attempting to break out the filtering so that I'm only returning under 10k rows at a time (which is the API response limit). Is there a way I can loop over a number of days? I tried using the batch function with no success. I have included my code for reference, I was thinking of writing a loop and doing it in 10 day intervals? I appreciate any input.
Thanks!
library(RGA)
gaData <- get_ga(id, start.date = start_date,
end.date= "today" , metrics = "ga:sessions",
dimensions = "ga:date, ga:medium, ga:country, ga:hour, ga:minute",
filters = "ga:country==United States;ga:medium==organic",
max.results = NULL,
batch = TRUE,
sort = "ga:date")
The get_ga function havn't batch param (see ?get_ga). Try it with the fetch.by option. You could test a different variants: "month", "week", "day".
library(RGA)
authorize()
gaData <- get_ga(id, start.date = start_date,
end.date= "today" , metrics = "ga:sessions",
dimensions = "ga:date, ga:medium, ga:country, ga:hour, ga:minute",
filters = "ga:country==United States;ga:medium==organic",
sort = "ga:date", fetch.by = "week")
I am currently trying to configure the rnoaa library to connect city, state data with a weather station, and therefore output ANNUAL weather data, namely temperature. I have included a hardcoded input for reference, but I intend on feeding in hundreds of geocoded cities eventually. This isn't the issue so much as it is retrieving data.
require(rnoaa)
require(ggmap)
city<-geocode("birmingham, alabama", output = "all")
bounds<-city$results[[1]]$geometry$bounds
se<-bounds$southwest$lat
sw<-bounds$southwest$lng
ne<-bounds$northeast$lat
nw<-bounds$northeast$lng
stations<-ncdc_stations(extent = c(se, sw, ne, nw),token = noaakey)
I am calculating a MBR (rectangle) around the geographic area, in this case Birmingham, and then getting a list of stations. I'm then pulling out the station_id and then attempting to retrieve results with any type of parameters with no success. I'm looking to associate annual temperatures with each city.
test <- ncdc(datasetid = "ANNUAL", locationid = topStation[1],
datatypeid = "DSNW",startdate = "2000-01-01", enddate = "2010-01-01",
limit = 1000, token = noaakey)
Warning message:
Sorry, no data found
Looks like location ID is creating issue. Try without it ( as it is optional field )
ncdc_locs(datasetid = "ANNUAL",datatypeid = "DSNW",startdate = "2000-01-01", enddate = "2010-01-01", limit = 1000,token = <your token key>)
and then with valid location ID
ncdc_locs(datasetid = "ANNUAL",datatypeid = "DSNW",startdate = "2000-01-01", enddate = "2010-01-01", limit = 1000,locationid='CITY:US000001',token = <your token>)
returns
$meta
NULL
$data
mindate maxdate name datacoverage id
1 1872-01-01 2016-04-16 Washington D.C., US 1 CITY:US000001
attr(,"class")
[1] "ncdc_locs"