Google Analytics query for sessions - r

I am trying to analyze visits to purchase in google analytics through r.
Here is the code
query.list<-Init(start.date = "2016-07-01",
end.date = "2016-08-01",
dimensions = c("ga:daysToTransaction","ga:sessionsToTransaction"),
metrics = c("ga:transaction"),
sort = c("ga:date"),
table.id = "ga:104454195")
I have this code which shows error as
Error in ParseDataFeedJSON(GA.Data) :
code : 400 Reason : Sort key ga:date is not a dimension or metric in this query.
Can you help me to get this desired output
Days to Transaction Transaction %total
0 44 50%
1 11 20%
2-5 22 30%

You are trying to sort your results based on a dimension, which is not included in your result set. You have ga:daysToTransaction and ga:sessionsToTransactions dimensions, and you have tried to apply a sort based on ga:date.
You'll need to use this for sorting:
sort = c("ga:daysToTransaction")
It is not clear for me, if you'll use ga:sessionsToTransactions in an other part of your script, as it'll add an other breakdown compared to your desired output, which needs to be aggregated later to get your expected results.
Also, will you calculate %total in an other part of the script, or you expect it to be returned as part of Analytics response? (About which I'm not sure, if it's possible in GA API or not.)

Related

How to access unaggregated results when aggregation is needed due to dataset size in R

My task is to get total inbound leads for a group of customers, leads by month for the same group of customers and conversion rate of those leads.
The dataset I'm pulling from is 20 million records so I can't query the whole thing. I have successfully done the first step (getting total lead count for each org with this:
inbound_leads <- domo_get_query('6d969e8b-fe3e-46ca-9ba2-21106452eee2',
auto_limit = TRUE,
query = "select org_id,
COUNT(*)
from table
GROUP BY org_id
ORDER BY org_id"
DOMO is the bi tool I'm pulling from and domo_get_query is an internal function from a custom library my company built. It takes a query argument which is a mysql query)and various others which aren't important right now.
sample data looks like this:
org_id, inserted_at, lead_converted_at
1 10/17/2021 2021-01-27T03:39:03
2 10/18/2021 2021-01-28T03:39:03
1 10/17/2021 2021-01-28T03:39:03
3 10/19/2021 2021-01-29T03:39:03
2 10/18/2021 2021-01-29T03:39:03
I have looked through many aggregation online tutorials but none of them seem to go over how to get data needed pre-aggregation (such as number of leads per month per org, which isn't possible once the aggregation has occurred because in the above sample the aggregation would remove the ability to see more than one instance of org_id 1 for example) from a dataset that needs to be aggregated in order to be accessed in the first place. Maybe I just don't understand this enough to know the right questions to ask. Any direction appreciated.
If you're unable to fit your data in memory, you have a few options. You could process the data in batches (i.e. one year at a time) so that it fits in memory. You could use a package like chunked to help.
But in this case I would bet the easiest way to handle your problem is to solve it entirely in your SQL query. To get leads by month, you'll need to truncate your date column and group by org_id, month.
To get conversion rate for leads in those months, you could add a column (in addition to your count column) that is something like:
sum(case when conversion_date is not null then 1 else 0) as convert_count

How to retrieve more than 90,000 users' data using lookup_users() with rtweet package?

I'm trying to get all user data for the followers of an account, but am running into an issue with the 90,000 user lookup limit. The documentation page says that that can be done by iterating through the user IDs while avoiding the rate limit that has a 15 minute reset time, but doesn't really give any guidance on how to do this. How would a complete user lookup with a list of users that is greater than 90,000 be achieved?
I'm using the rtweet package. Below is an an attempt with #lisamurkowski who has 266,000 followers. i have tried using a retryonratelimit = TRUE argument to lookup_users(), but that doesn't do anything.
lisa <- lookup_users("lisamurkowski")
mc_flw <- get_followers("lisamurkowski", n = lisa$followers_count,
retryonratelimit = TRUE)
mc_flw_users <- lookup_users(mc_flw$user_id)
The expected output would be a tibble of all the user lookups, but instead I get
max number of users exceeded; looking up first 90,000
And then the outputted object contains 90,000 observations and ends the process.

Google Analytics API returns less (different amount) rows than GA

I'm using R to access GA data using RGoogleAnalytics plugin.
I wrote the following query to get the Search Terms from the Site Search from Oct 16 to 22.
query <- Init(start.date = "2017-10-16",
end.date = "2017-10-22",
dimensions = "ga:searchKeyword,ga:searchKeywordRefinement",
metrics = "ga:searchUniques,ga:searchSessions,ga:searchExits,ga:searchRefinements",
max.results = 99999,
sort = "-ga:searchUniques",
table.id = "ga:my_view_id")
ga.query2 <- QueryBuilder(query)
ga.data.refined <- GetReportData(ga.query2, token, paginate_query = T)
However, this returns 34000 rows, which doesn't match with 45000 row that I see in GA. Note: I did add another dimension to Search Terms.
Interestengly, if I remove ga:searchKeywordRefinement dimension from the code and also in GA, the number of rows does match.
This is most likely caused by sampling in the data. I can't seem to locate the documentation on how to access this, but the documentation otherwise makes it clear it is possible:
RGoogleAnalytics GitHub with Readme
In cases where queries are sampled, the output also returns the percentage of sessions that were used for the query
So the answer is to access the output that returns the percentage of sessions that were used for the query, and if it less than 100%, then you found your problem.
To solve for sampling... there are some techniques. Review the section in the documentation that talks about splitting your queries into single day, then union all the dates together.

graphite summarize function not working as expected

I am feeding data into a metric, let say it is "local.junk". What I send is just that metric, a 1 for the value and the timestamp
local.junk 1 1394724217
Where the timestamp changes of course. I want to graph the total number of these instances over a period of time so I used
summarize(local.junk, "1min")
Then I went and made some data entries, I expected to see the number of requests that it received in each minute but it always just shows the line at 1. If I summarize over a longer period like 5 mins, It is showing me some random number... I tried 10 requests and I see the graph at like 4 or 5. Am I loading the data wrong? Or using the summarize function wrong?
The method summarize() just sums up your data values so co-relate and verify that you indeed are sending correct values.
Also, to localize weather the function or data has issues, you can run it on metricsReceived:
summarize(carbon.agents.ip-10-0-0-1-a.metricsReceived,"1hour")
Which version of Grahite are you running?
You may want to check your carbon aggregator settings. By default carbon aggregates data for every 10 seconds. Without adding any entry in aggregation-rules.conf, Graphite only saves last metric it receives in the 10second duration.
You are seeing above problem because of that behaviour. You need to add an entry for your metric in the aggregation-rules.conf with sum method like this
local.junk (10) = sum local.junk

How to get Twitter's Impression and Reach with R twitteR package?

This question is about measuring twitter impressions and reach using R.
I'm working on a twitter analysis of "People voice about Lynas Malaysia through Twitter Analysis with R" . To be more perfect, I wish to find out how to measure impressions, reach, frequency and so from twitter.
Definition:
Impressions: The aggregated number of followers that have been exposed to a brand/message.
Reach: The total number of unique users exposed to a message/brand.
Frequency: The number of times each unique user reached is exposed to a message.
My trial: #1.
From my understanding, the impression is the followers numbers of the total tweeters that tweet specific "keyword".
For #1. I made one:
rdmTweets <- searchTwitter(cloudstatorg, n=1500)
tw.df=twListToDF(rdmTweets)
n <- length(tw.df[,2])
S <- 0
X <- 0
for (i in 1:n) {
tuser <- getUser(tw.df$screenName[[i]])
X <- tuser$followersCount
S <- S + X
}
S
But the problem occurred will be
Error in .self$twFromJSON(out) :
Error: Rate limit exceeded. Clients may not make more than 150 requests per hour.
For #2. and #3., still don't have any ideas, hope to get helps here. Thanks a lot.
The problem you are having for #1 has nothing to do with R nor your code, is about the # of calls you have made to the Twitter Search API and that it exceeded the 150 calls you have by default.
Depending on what you are trying to do, you are able to mix and match several components of the API to get the results you need,
You can read more in their docs: https://dev.twitter.com/docs/rate-limiting

Resources