I got quite a big set of URLs (> 8.500) I want to query the Google Analytics API with using R. I'm working with the googleAnalyticsR package. The problem is, that I am indeed able to loop through my set of urls, but the dataframe created only returns the total values for the host-id for each row (e.g. same values for each row).
Here's how far I got to this point:
library(googleAnalyticsR)
library(lubridate)
#Authorize with google
ga_auth()
ga.acc.list = ga_account_list()
my.id = 123456
#set time range
soty = floor_date(Sys.Date(), "year")
yesterday = floor_date(Sys.Date(), "day") - days(1)
#get some - in this case - random URLs
urls = c("example.com/de/", "example.com/us/", "example.com/en/")
urls = gsub("^example.com/", "ga:pagePath=~", urls)
df = data.frame()
#get data
for(i in urls){
ga.data = google_analytics_4(my.id,
date_range = c(soty, yesterday),
metrics = c("pageviews","avgTimeOnPage","entrances","bounceRate","exitRate"),
filters = urls[i])
df = rbind(df, ga.data)}
With the result of always receiving the total statistics for the my.id-domain in each row in the dataframe created (own data):
Output result
Anyone knows of a better way on how to tackle this or does google analytics simply prevent us from querying it in such a way?
What you're getting is normal: you only queried for metrics (c("pageviews","avgTimeOnPage","entrances","bounceRate","exitRate")), so you only get your metrics.
If you want to break down those metrics, you need to use dimensions:
https://developers.google.com/analytics/devguides/reporting/core/dimsmets
In your case you're interested in the ga:pagePath dimension, so something like this (untested code):
ga.data = google_analytics_4(my.id,
date_range = c(soty, yesterday),
dimensions=c("pagePath"),
metrics = c("pageviews","avgTimeOnPage","entrances","bounceRate","exitRate"),
filters = urls[i])
I advise you to use the Google Analytics Query Explorer until you get the desired results, then port it to R.
As for the number of results, you might be limited to 1K by default until you increase max_rows. There is a hard limit on 10K from the API, which means you then have to use pagination to retrieve more results if needed. I see some examples in the R documentation with max=99999999, I don't know if the R library automatically handles pagination beyond the first 10K or if they are unaware of the hard limit:
batch_gadata <- google_analytics(id = ga_id,
start="2014-08-01", end="2015-08-02",
metrics = c("sessions", "bounceRate"),
dimensions = c("source", "medium",
"landingPagePath",
"hour","minute"),
max=99999999)
Related
I am currently working on a project that deals with web scraping using R.
It is very basic but I am trying to understand how it works.
I am using Google Stocks as my URL and I am using the Google ticker as my stock I am viewing.
Here is my code:
# Declaring our URL variable
google = html("https://www.google.com/searchq=google+stock%5D&oq=google+stock%5D&aqs=chrome..69i57j0l2j69i60l3.5208j0j4&sourceid=chrome&ie=UTF-8")
# Prints and initializes the data
google_stock = google %>%
html_nodes("._FOc , .fac-l") %>%
html_text()
# Creating a data frame table
goggledf = data.frame(table(google_stock))
# Orders the data into highest frequency shown
googledf_order = googledf[order(-googledf$Freq),]
# Displays first few rows of data
head(googledf_order)
When I run this I get integer(0), which should be displaying a stock price.
I am not sure why this is not displaying the correct stock price.
I also tried running the code up until html_text() and it still did not show me the data that I wanted or needed.
I just need this to display the stock price from the web.
I am using SelectorGadget to get my html node ("._FOc , .fac-l")
I think there might be something wrong with your URL. When I try to paste it into a browser, I get a 404 error.
Instead of scraping you could use the quantmod package. To get historical data you could use the following:
library(quantmod)
start <- as.Date("2018-01-01")
end <- as.Date("2018-01-20")
getSymbols("GOOGL", src = "google", from = start, to = end)
To get a the current stock quote you could use:
getQuote("GOOGL", src = "yahoo")
From the quantmod documentation, the getQuote function "only handles sourcing quotes from Yahoo Finance."
I am using the googleanalyticsR to download all the data I can from Google Analytics. My objective it is to build a small dataframe to analyze.
To download all the data I created a loop:
for (i in 1:length(metricsarray)) {
print(paste(i))
tryCatch( google_analytics_4(my_id,
date_range = c(start_date, end_date ),
metrics = metricsarray[i],
dimensions = c( 'transactionId'),
max = -1)) %>%
assign(gsub(" ", "", paste( "metricsarray",i, sep="")), ., inherits = TRUE)
}
The loop runs from 1 to 11 with no problems, i.e. Prints the number of i and gives me the message:
Downloaded [3537] rows from a total of [3537]
But I got this error when it reaches i = 12 in metricsarray[i]:
2017-10-04 10:37:56> Downloaded [0] rows from a total of [].
Error in if (nrow(out) < all_rows) { : argument is of length zero
I used the tryCatch, but with no effect, it continues. My objetive was that it would continue to test each of the metricsarray[i] until the end.
Also, also continue when it finds the error:
JSON fetch error: Selected dimensions and metrics cannot be queried
together.
I am new to using the googleanalytics API in R, feel free to suggest solutions, articles or anything we might think it will help me gain more knowledge about this.
Thank you,
JSON fetch error: Selected dimensions and metrics cannot be queried
together.
Not all Google analytics dimensions and metrics can be queried together. The main reasons for that is either the data doesnt exist or the data would make no sence.
The best way to test what metadata can be queried together is to check the dimensions and metrics reference. Invalid items will be grayed out.
For downloading Data from Google Analytics I use RStudio (R version 3.2.3 (2015-12-10)) with the following code:
library(RGoogleAnalytics)
library(RGA)
gaData4_5 <- ga$getData(profileId,
start.date = as.Date("2017-01-04"),
end.date=as.Date("2017-01-05"),
metrics = "ga:impressions,ga:adCost,ga:adClicks,ga:users",
dimensions = "ga:date,ga:adwordsCampaignID,ga:adwordsAdGroupID", filter = "ga:source==google,ga:medium==cpc",
sort = "ga:date", batch = TRUE)
When summing up impression, adCost und adClicks, the numbers in R match exactly the numbers that are shown onthe Analytics Web site itself. However, only the number of users is too high and I do not understand why.
I checked also newUsers, and userType (not in the above code), but neither of these alternatives provided the correct result.
Can anyone please explain, how it can happen that 3 metrics results are perfectly correct, only one is not? And what can be done to correct this?
Thanks a lot!
I am new to Google analytics API.....I authenticated my application in R using code:
library(RGoogleAnalytics)
client.id <- "**************.apps.googleusercontent.com"
client.secret <- "**********************"
token <- Auth(client.id, client.secret)
save(token,file="./token_file")
ValidateToken(token)
I am figuring out what we need to enter in the below credentials:
query.list <- Init(start.date = "2011-11-28",
end.date = "2014-12-04",
dimensions = "ga:date,ga:pagePath,ga:hour,ga:medium",
metrics = "ga:sessions,ga:pageviews",
max.results = 10000, sort = "-ga:date", table.id = "ga:33093633")
Where can I find dimensions, metrics, sort, table.id
My eventual goal is to pull the text from "https://plus.google.com/105253676673287651806/posts"
Please do assist me in this....
Using Google Analytics and R may not suit what you want to do here, as the Google+ website won't be included in the data you collect.
You may want to look at using RVest which is a URL scraper tool for R. You could then get the information you need from any public URL into an R dataframe for you to analyse later.
Query Explorer:
https://ga-dev-tools.appspot.com/query-explorer/?csw=1
Dimensions and metrics:
https://developers.google.com/analytics/devguides/reporting/core/dimsmets
I am using the twitteR package in R to extract tweets based on their ids.
But I am unable to do this for multiple tweet ids without hitting either a rate limit or an error 404.
This is because I am using the showStatus() - one tweet id at a time.
I am looking for a function similar to getStatuses() - multiple tweet id/request
Is there an efficient way to perform this action.
I suppose only 60 requests can be made in a 15 minute window using the outh.
So, how do I ensure :-
1.Retrieve multiple tweet ids for single request thereafter repeating these requests.
2.Rate limit is under check.
3.Error handling for tweets not found.
P.S : This activity is not user based.
Thanks
I have come across the same issue recently. For retrieving tweets in bulk, Twitter recommends using the lookup-method provided by its API. That way you can get up to 100 tweets per request.
Unfortunately, this has not been implemented in the twitteR package yet; so I've tried to hack together a quick function (by re-using lots of code from the twitteR package) to use that API method:
lookupStatus <- function (ids, ...){
lapply(ids, twitteR:::check_id)
batches <- split(ids, ceiling(seq_along(ids)/100))
results <- lapply(batches, function(batch) {
params <- parseIDs(batch)
statuses <- twitteR:::twInterfaceObj$doAPICall(paste("statuses", "lookup",
sep = "/"),
params = params, ...)
twitteR:::import_statuses(statuses)
})
return(unlist(results))
}
parseIDs <- function(ids){
id_list <- list()
if (length(ids) > 0) {
id_list$id <- paste(ids, collapse = ",")
}
return(id_list)
}
Make sure that your vector of ids is of class character (otherwise there can be a some problems with very large IDs).
Use the function like this:
ids <- c("432656548536401920", "332526548546401821")
tweets <- lookupStatus(ids, retryOnRateLimit=100)
Setting a high retryOnRateLimit ensures you get all your tweets, even if your vector of IDs has more than 18,000 entries (100 IDs per request, 180 requests per 15-minute window).
As usual, you can turn the tweets into a data frame with twListToDF(tweets).