Google freebase score - freebase

I am just wondering whether the score returned by google freebase by using their API, can be compared between different query entities? For example, can we set a threshold to decide the results with some certain score are of high relevance.
Thanks!

Related

BigQuery New Users count strongly differ from displayed Firebase Analytics data

During my inspections, I have found that after the 14th of January, new users count strongly differs between Google BigQuery and Google Firebase Analytics.
The discrepancy is higher than the traditional 0.5-2% rate that can be attributed to the HyperLogLog algorithm used to make computation faster.
I wasn't able to find a precise answer on how exactly new users are computed on Firebase Analytics to create the same query and get identical queries results. Since the discrepancy is above the 30% range, now the problem magnitude is more significant.
Do you have the same problem? How can I explain better this strange behavior? (by run other queries and try to find more details about the issue)
This is the query used to compare results:
SELECT APPROX_COUNT_DISTINCT(user_pseudo_id),event_date FROM `practical-bot-198011.analytics_184597160.events_*`
where event_name = 'first_open' and _TABLE_SUFFIX BETWEEN '20200110' AND '20200127'
GROUP BY event_date
ORDER BY event_date ASC
and this is the result I get:
but in the Google Firebase Analytics Dashboard:
One of the reason of count in Analytics dashboard doesn't match BigQuery results is that the data for the most recent three days is being updated every 4-5 hours in Analytics. In BigQuery data is only exported once per day. Queries which include the most recent three days will show different results between Analytics and BigQuery.
Count(distinct) is an approximation. To get an exact count of unique IDs, try to use EXACT_COUNT_DISTINCT(). Refer to this Stackoverflow thread.
Additionally, take a look to official documentation.

Google Analytics MCF - how is sampling done?

According to the Google sampling documentation, sampling is done based on sessions (fact). In Google MCF queries, sessions are not available as a metric or dimension (fact?) Now I wonder: how is sampling done if there is no metric or dimension for sessions (and date)?
I suspect that sampling for Google MCF is done based on a maximum number of rows of 10.000 for each query. Are there more than 10.000 results? Than sampling is true. Is this correct?
Use-case: let's say I make a query and the "TotalResults" are 67540. Does this mean that I can get all the unsampled data with querying 7 calls (6x10000 and one time 7540 rows)? How do I know which dateranges to query than?
Tried some different approached and found out that the sampling is doe based on sessions and dates, same as the other data from the API.

Is the Google Analytics API containsSampledData field reliable?

We are running the Google Analytics free version and I'm seeing some inconsistent results regarding data sampling. I have tried my requests in Google Analytics Query Explorer, the GA Sheets add-on, and within the GA interface.
Basically, I am comparing results from a complete date range against the sum of results for that date range broken into smaller chunks (to reduce/remove the chance of sampling occurring). Metrics are sessions, transactions, and revenue. I have a session-level dynamic segment applied: sessions::condition::!ga:landingPagePath=#/thanks
As you may expect, the results from the single request are different (counts are lower) than those from summing the multiple smaller requests. For example, sessions are 45,311 vs. 51,596 and income is further apart. This implies that sampling is being used for the larger request. The trouble is that the API response explicitly says that sampling is not used in any case, i.e. "Contains Sampled Data" equals "No", even for the full date range within which our property should be exceeding the 500,000 session threshold for sampling to kick in.
I'm almost certain that the results from summing smaller date ranges are correct, as these are pretty close to what we see in our CMS analytics.
Can anyone explain the mechanics behind this? Is GA doing some sort of behind-the-scenes sampling to produce this inconsistency?
Thanks,
Daniel
Sounds like sampling. Check all your sources to see if they contain sampling and make sure you have Sampling Level Set to "HIGHER_PRECISION".
1) Google Sheets Google Analytics Add-On in cell B6 of the data for each query check to see if it says "Yes: for "Contains Sampled Data"
2) Google Analytics Query Explorer in the header below your profile name check to see if it says "Contains Sampled Data: Yes"
You are on the right track in breaking your query down into smaller chunks with smaller date ranges to avoid sampling. Here is a post on how to Avoid Google Analytics Sampling using Python

Freebase scoring with data dumps

If you use Freebase search to get matches for any entity by name, you will get results sorted by relevance score. Try for example Taj Mahal.
I'm trying to get similar results using Freebase data dumps, so in my database 'Taj Mahal' related topics would be sorted by relevance, i.e. building comes first, musician comes next and so on.
Is there any suggestions how to achieve this without querying Freebase search API?
The wiki page on relevance score that you linked to says:
Freebase entities have an inherent relevance score (ranking) computed during indexing that is function of its inbound and outbound link counts in Freebase and Wikipedia. Some popular Freebase entities also have a popularity score computed by Google. By default, both scores are combined together during queries.
Which should give you a pretty good idea where to start. Freebase in-degree and out-degree can be computed directly from the dump, but Wikipedia in/out-degrees would require using the Wikipedia dump (or Freebase's WEX dump). The "popularity score computed by Google" piece is obviously something that you're not going to be able to replicate.

Getting search rank of freebase topics

Is there a way to get the search rank of freebase topics? We want to order the freebase topics in our database according to their search rank and popularity. I know the Metaweb search API returns the topics in the order of search rank but that applies only to the results for a given query string. We want to apply that logic on the topics that exists in our database.
The Freebase Search API ranks topics based mostly on keyword matching in the title and the description of a topic with the given search query. To get this same feature in your own database you'll have to write your own search ranking code or use a library like Lucene.
You might also be interested in a related discussion that happened on the Freebase mailing list last month about how to rank topics based on overall measures of popularity.

Resources