Looker Studio inconsistent in summing metrics? - google-analytics

I've come across what appear to be random inconsistencies in how Looker Studio is aggregating data from the same Analytics source. I can't tell if it's an issue on the software's end, or a result of my incorrectly summing and joining the data in question (or possibly misunderstanding how the aggregation is supposed to work).
Would appreciate any information that could confirm if it's an issue with my process (especially regarding the choice of database join), or alternatively whether my expectations are correct and thus it's an issue on Looker Studio's end.
Data
I have three data sources: A UA Analytics profile, a GA4 Analytics profile, and a blended source that is a full outer join of the two profiles on the Date dimension (UA left, GA4 right).
The two profiles were added as data sources using the default DS Google Analytics connector. There aren't any filters or source-level manipulation of data, though for the purposes of this report I'm only looking at some basic metrics for the month of April 2022, via the standard date range selectors in the report.
The UA Analytics profile:
Sample of the UA metrics being used:
Date
Users
Sessions
Bounce Rate
Goal Completions
Apr 1, 2022
870
955
0.756020942408377
85
Apr 2, 2022
849
923
0.782231852654388
82
Apr 3, 2022
1023
1105
0.774660633484163
100
Apr 4, 2022
1007
1095
0.74337899543379
121
Apr 5, 2022
1111
1210
0.765289256198347
130
Apr 6, 2022
1010
1111
0.756975697569757
92
Apr 7, 2022
1007
1141
0.743207712532866
100
Apr 8, 2022
928
1009
0.77205153617443
80
Apr 9, 2022
941
1054
0.80550284629981
72
Apr 10, 2022
1002
1113
0.761006289308176
85
...
...
...
...
...
Apr 30, 2022
854
931
0.767991407089151
75
The GA4 profile:
Sample of the GA4 data being used (it was only implemented on the domain halfway through the month, so metrics are empty for the first few weeks):
Date
Total users
Sessions
Engagement rate
Conversions
Apr 18, 2022
766
791
0.378002528445006
0
Apr 19, 2022
890
930
0.394623655913978
0
Apr 20, 2022
849
884
0.39027149321267
0
Apr 21, 2022
844
891
0.354657687991021
0
Apr 22, 2022
745
780
0.33974358974359
0
Apr 23, 2022
833
871
0.330654420206659
0
Apr 24, 2022
878
910
0.306593406593407
0
Apr 25, 2022
904
949
0.355110642781876
0
Apr 26, 2022
932
982
0.346232179226069
0
Apr 27, 2022
910
963
0.349948078920042
0
Apr 28, 2022
878
911
0.354555433589462
0
Apr 29, 2022
809
850
0.342352941176471
0
Apr 30, 2022
782
832
0.278846153846154
0
The data blend (date range is auto for both, no filters or segments applied):
Expected Output
Because they're originating from the same data source, I would expect the metrics from the data blend to be equal to those of their relative individual profiles. That is, I would expect the Total Users metric from the GA4 data source, and the Total Users metric from the blended data source (which is referencing the same GA4 data source), to show the same numbers, with the same true of the UA data source.
Instead, I'm getting the following results on Chart Scorecards:
Source
Total users
Sessions
Engagement Rate
Conversions
User engagement
GA4 Profile
9874
11,711
34.20%
0
57:30:07
Blended GA4
11,020
11,544
34.78%
0
57:30:07
Source
Users
Sessions
Bounce Rate
Goal Completions
Avg. Session Duration
UA Profile
25,528
31,370
76.34%
2808
00:00:35
Blended UA
28,709
31,370
76.26%
2808
00:17:44
I expect all of the GA4 metrics to match each other, but while Conversions and User Engagement match, the Total Users, Sessions, and Engagement Rate don't.
For the UA data, Sessions and Goal Completions match, but the Users, Bounce Rate, and Avg. Session Duration don't.
There doesn't appear to be a pattern as to which metrics have a discrepancy or by how much. They aren't multiples of each other, so I don't think it's a simple issue of rows being re-counted in the join.
Even weirder, if I make Chart Tables out of the same data, the tables' summary rows don't always show the same results as the scorecards, even though they're referencing the exact same data.
In addition, if I manually sum up all the metrics in each column by spreadsheet, I get results different from those in the tables' summary rows:
Source
Users
UA Sessions
Bounce Rate
Goal Completions
UA summary row
25,528
31,370
76.34%
2808
UA manual calculation
28,709
31,370
76.26%
2808
Source
Total users
GA4 Sessions
Engagement Rate
Conversions
GA4 summary row
9874
11,711
34.2%
0
GA4 manual calculation
11,020
11,544
34.78%
0
Source
Users
Total users
UA Sessions
GA4 Sessions
Bounce Rate
Engagement Rate
Goal Completions
Conversions
Blended summary row
28,709
9874
31,370
11,711
76.26%
34.2%
2808
0
Blended manual calculation
28,709
11,020
31,370
11,544
76.26%
34.78%
2808
0
These discrepancies happen even when sampling only two rows of data at a time, and never by the same amount.
For this specific report, the GA4 data for Total Users, Sessions, and Engagement rate have discrepancies of 1146, -167, and 0.58 respectively for the entirety of April (or 111%, 99%, and 102% as a percentage of the GA4 data source).
Populating the same report with Analytics' default demo data (see link at bottom), the GA4 data for Total Users, Sessions, and Engagement rate have discrepancies of 51732, 2432, and -2.1 for the entirety of April (or 135%, 102%, and 97% as a percentage of the GA4 data source).
Looking at just April 1 & April 2 with that same demo data, there are discrepancies of 160, 93, and -0.71 (or 102%, 101%, and 99% as a percentage of the GA4 data source).
Applying CAST to the metrics (e.g. CAST(Total Users as number)) has no effect and results in the same metrics and sums.
Chart configuration
All scorecards have the same default configuration, with the only difference being the data source that's referenced:
Same for the tables, with the exception of relabeling Sessions and Sessions as UA Sessions and GA4 Sessions for clarity:
Issue
Why are these discrepancies happening, since to my understanding these charts should be pulling the same data and summing in the same manner, and therefore should have the same results? Am I misunderstanding the output I should be receiving?
I've triple checked all my charts to make sure they're using the same aggregation functions, date range, source metrics, etc. etc. so I don't know why so many of these summed metrics are inconsistent. I thought it might be a matter of date range processing, but I don't think that would explain the different sums in different chart types for the exact same data source, or why the UA sessions match up with each other but the GA4 sessions don't.
I've also tried different join methods for the blended data, but all return the same results except for cross joining (which multiplies all resulting values, as expected).
Am I missing something? My manual calculation results make me feel like the blended and spreadsheet metrics are more "trustworthy", but the blended data still showed incorrect summary row results in the report table, so I'm genuinely unsure if I'm getting correct results for any given data source.
Report
Made a copy populated with demo Analytics data, so the numbers aren't the same, but similar discrepancies are happening: https://datastudio.google.com/reporting/40bab31a-a0d0-4b79-8dcf-25c11279f229
Spreadsheet with manual summing of exports of the tables from the same report (note that they don't match said report): https://docs.google.com/spreadsheets/d/1CvM-4PqPNfBqNIlzJEe9QQVQ5tWNVkOdaZOVLfqnRRU/edit?usp=sharing
Even more simplified report that specifically highlights the discrepancy between default data source aggregation, and aggregation via spreadsheet: https://datastudio.google.com/reporting/a4f989ed-474e-4f04-955d-5ffb6339fc3a

Related

Sqlite remove duplicates within specific time range

I know there are many questions asked about removing duplicates in SQL. However in my case it is slightly more complicated.
These are data with Barcode which repeats over a month. Therefore it is expected that there will be entries with the same Barcode. However it is found out that due to possibly a machine bug, same data will be recorded within 4-5 minutes timeframe 2 to 3 times. It does not happen for every entry, but it happens rather frequently.
Allow me to demonstrate with a sample table which contains the same Barcode "A00000"
Barcode No Date A B C D
A00000 1499456 10/10/2019 3:28 607 94 1743 72D
A00000 1803564 10/20/2019 22:09 589 75 1677 14D
A00000 1803666 10/20/2019 22:13 589 75 1677 14D
A00000 1803751 10/20/2019 22:17 589 75 1677 14D
A00000 2084561 10/30/2019 12:22 583 86 1677 14D
A00000 2383742 11/9/2019 23:18 594 81 1650 07D
As you can see the entries on 10/20 contains identical data which are duplicates which should be removed so only one of the entry remains (any of the entry is fine and the exact time is not the main concern). The "No" column is a pure arbitrary number which can be safely disregarded. The other entries should be remain as it is.
I know this should be done by using "Group by", but I am struggling on how to write the conditions. I have tried also using table INNER JOIN itself and then remove these selected results:
T2.A = T2.B AND
T2.[Date] > T1.[Date] AND
strftime('%s',T2.[Date]) - strftime('%s',T1.[Date]) < 600
The results still seem a bit off as some of the entries are selected twice and some are not selected. I am still not used to SQL style of thinking. Any help is appreciated.
The format of the Date column complicates things a bit, but otherwise the solution basically is to use GROUP BY in the normal way. In the following, I've assumed the name of the table is test:
WITH sane as
(SELECT *,
substr(date,1,instr(date, ' ') - 1) as time
FROM test)
SELECT Barcode, max(No), Date, A, B, C, D
FROM sane
GROUP BY barcode, time;
The use of max() is perhaps unneeded but it gives some determinacy, which might be helpful.

How can I tabulate data that are listed on a map in a pdf?

So, the ATF publishes reports going all the way back to 2008 on trace statistics from each state. I need to pull the number of firearms traced from the source states listed on the pdf (see pdf).
This is for all the years going back to 2008, and I have no idea how to efficiently pull this data. I attempted this with R because that is the only programming language I have experience in (see below).
txt <- pdf_text("https://www.atf.gov/about/docs/report/
colorado-firearms-trace-data-2014/download")
cat(txt[7])
The results...
Top 15 Source States for Firearms with a Colorado Recovery
January 1, 2014 – December 31, 2014
25
27
27
19 23
1,762
71 28
25
45 60
26
97 22
44
NOTE: An additional 32 states accounted for 261 other traces.
The source state was identified in 2,562 total traces.
Bureau of Alcohol, Tobacco, Firearms and Explosives, Office
of Strategic Intelligence and Information
Beyond this, I haven't been able to find anything online that can help me tabulate this data into something like this:
recovered year from weapons
colorado 2014 colorado 1762
colorado 2014 other 261
colorado 2014 washington 25
(and so on...)
Realizing this may be due to the limitations of R, I just want to know if there a good source where I can learn how to develop a function for this (possibly in R). Especially before I attempt to type this out by hand, or learn a new language from scratch (both of which I'm not sure if I can make the time for.)

summary() of transactions is wrong for itemMatrix object

I am trying to do some market basket analysis using the arules package, but when I use the summary() function on an itemMatrix object to check which are the most frequent items, the numbers do not add up.
If I do:
library(arules)
x <- read.transactions("Supermarket2014-15.csv")
summary(x)
I get:
transactions as itemMatrix in sparse format with
5001 rows (elements/itemsets/transactions) and
997 columns (items) and a density of 0.003557162
most frequent items:
45 28 42 35 22 (Other)
503 462 444 440 413 15474
But if I check with a for loop, or even in Excel, the count for the product 45 is 513 and not 503. The same for 28, which should be 499, and so on.
The odd thing is if I sum up all the totals (15474+413+440+444+462+503) I get the correct number for the total of transacted products.
The data has several NA values and products are factors.
And here is the raw data (Day ranges from 1 to 28, Product ranges from 1 to 50):
If you look at the result of your str(x) call then you see under #iteminfo and $labels that some items have labels like "1;1", etc. This means that the items are not correctly separated after reading the file in. The default separator in read.transactions() is a white space, but you seem to have (some) semicolons there. Try sep=";" in read.transactions().

cutting stock optimization/waste minimize in r using lpsolve/lpsolveapi

I am having a tough time understanding the how to formulate code to a cutting stock problem. I have searched the web extensively and I see a lot of theory but no actual examples.
The majority of query results point to the wikipedia page: http://en.wikipedia.org/wiki/Cutting_stock_problem
13 patterns to be produced, with required amounts indicated alongside.
The machine produces by default a 5600 width piece to be cut into widths below. Goal is to minimize waste.
Widths/Required amount
1380 22
1520 25
1560 12
1710 14
1820 18
1880 18
1930 20
2000 10
2050 12
2100 14
2140 16
2150 18
2200 20
Would someone show me how to formulate this solution in R with lpsolve/lpsolve API?
stock=5600
widths=c(1380,1520,1560,1710,1820,1880,1930,2000,2050,2100,2140,2150,2200)
required=c(22,25,12,14,18,18,20,10,12,14,16,18,20)
library(lpSolveAPI)
...
solve(lprec)
get.variables(lprec)
You could model it as a Mixed Integer Problem and solve it using various techniques. Of course to generate variables (i.e. a valid pattern of widths) you need to use a suitable column generation method.
Have a look at this C++ project: https://code.google.com/p/cspsol
cspsol is based on GLPK API library, uses column generation and branch & bound to solve the MIP. It may give you some hints about how to do it in R.
Good luck !

readPDF (tm package) in R

I tried to read some online pdf document in R. I used readRDF function. My script goes like this
safex <- readPDF(PdftotextOptions='-layout')(elem=list(uri='C:/Users/FCG/Desktop/NoteF7000.pdf'),language='en',id='id1')
R showed the message that running command has status 309. I tried different pdftotext options. however, it is the same message. and the text file created has no content.
Can anyone read this pdf
readPDF has bugs and probably isn't worth bothering with (check out this well-documented struggle with it).
Assuming that...
you've got xpdf installed (see here for details)
your PATHs are all in order (see here for details of how to do that) and you've restarted your computer.
Then you might be better off avoiding readPDF and instead using this workaround:
system(paste('"C:/Program Files/xpdf/pdftotext.exe"',
'"C:/Users/FCG/Desktop/NoteF7000.pdf"'), wait=FALSE)
And then read the text file into R like so...
require(tm)
mycorpus <- Corpus(URISource("C:/Users/FCG/Desktop/NoteF7001.txt"))
And have a look to confirm that it went well:
inspect(mycorpus)
A corpus with 1 text document
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
[[1]]
Market Notice
Number: Date F7001 08 May 2013
New IDX SSF (EWJG) The following new IDX SSF contract will be added to the list and will be available for trade today.
Summary Contract Specifications Contract Code Underlying Instrument Bloomberg Code ISIN Code EWJG EWJG IShares MSCI Japan Index Fund (US) EWJ US EQUITY US4642868487 1 (R1 per point)
Contract Size / Nominal
Expiry Dates & Times
10am New York Time; 14 Jun 2013 / 16 Sep 2013
Underlying Currency Quotations Minimum Price Movement (ZAR) Underlying Reference Price
USD/ZAR Bloomberg Code (USDZAR Currency) Price per underlying share to two decimals. R0.01 (0.01 in the share price)
4pm underlying spot level as captured by the JSE.
Currency Reference Price
The same method as the one utilized for the expiry of standard currency futures on standard quarterly SAFEX expiry dates.
JSE Limited Registration Number: 2005/022939/06 One Exchange Square, Gwen Lane, Sandown, South Africa. Private Bag X991174, Sandton, 2146, South Africa. Telephone: +27 11 520 7000, Facsimile: +27 11 520 8584, www.jse.co.za
Executive Director: NF Newton-King (CEO), A Takoordeen (CFO) Non-Executive Directors: HJ Borkum (Chairman), AD Botha, MR Johnston, DM Lawrence, A Mazwai, Dr. MA Matooane , NP Mnxasana, NS Nematswerani, N Nyembezi-Heita, N Payne Alternate Directors: JH Burke, LV Parsons
Member of the World Federation of Exchanges
Company Secretary: GC Clarke
Settlement Method
Cash Settled
-
Clearing House Fees -
On-screen IDX Futures Trading: o 1 BP for Taker (Aggressor) o Zero Booking Fees for Maker (Passive) o No Cap o Floor of 0.01 Reported IDX Futures Trades o 1.75 BP for both buyer and seller o No Cap o Floor of 0.01
Initial Margin Class Spread Margin V.S.R. Expiry Date
R 10.00 R 5.00 3.5 14/06/2013, 16/09/2013
The above instrument has been designated as "Foreign" by the South African Reserve Bank
Should you have any queries regarding IDX Single Stock Futures, please contact the IDX team on 011 520-7399 or idx#jse.co.za
Graham Smale Director: Bonds and Financial Derivatives Tel: +27 11 520 7831 Fax:+27 11 520 8831 E-mail: grahams#jse.co.za
Distributed by the Company Secretariat +27 11 520 7346
Page 2 of 2

Resources