I'm contemplating using Firebase for an upcoming project, but before doing so I want to make sure I can develop a data structure that will meet my purposes. I'm interested in tracking horse race results for approximately 25 racetracks across the US. My initial impression was that my use case aligned nicely with the Firebase Weather Data Set. The Weather data set is organized by city and then in various time series: currently, hourly, daily and minutely.
My initial thought was that I could follow a similar approach and use the 25 tracks as cities and then organize by years, months, days and races.
This structure lends itself nicely to accessing data from a particular track, but suppose that also want to access data across all tracks. For example, access data for all tracks for races that occurred in 2014 and had more than 10 horses.
Questions:
Does my proposed data structure limit me to queries by track only or would I still be able to query across tracks, years, days, months, etc. and incorporate any and all of the various meta data attributes: number of horses, distance of race, etc.
Given, my interest in freeform querying is there another data structure that would be more advantageous?
Is Firebase similar to Mongodb and have issues with collections (lists) that grow or can one continue to push to a list without pre allocating or worrying about sharding?
I believe my confusion stems from url/path nature of the data storage.
EDIT:
Here is a sample of what I had in mind:
Thanks for your input.
I would think that you would want to organize by horse first. I guess it depends what you are deriving from the data.
One horse could be at different tracks.
Horses table
* Horsename
-----Date
-----Track
-----Racenumber
-----Gate
-----Jockey
-----Place
-----Odds
-----Mud?
Races table
----Track
----Racenumber
----Date
----Time
----NumberOfHorses
Link the tables and you could get at any one part of it.
Related
I have a problem regarding the organization of my data. What I want to achieve:
What I want to achieve
TL/DR: One data point updated in real time in many different groups, how to organize?
Each user sets a daily goal (goal) he wants to achieve
Upon working each user increases his time to get closer to his daily goal (daily_time_spent). (say from 1 minute spent to 2 minute spent).
Each user can also be in a group with other users.
If there is a group of users, you can see each other's progress (goal/daily_time_spent) in real time (real time being every 2-5 minutes, for cost reasons).
It will later also be possible to set a daily goal for a specific group. Your own daily goal would contribute to each of the groups.
Say you are part of three groups with the goals 10m/20m/30m and you already did 10m then you would complete the first group and have done 50% of the second group and 30% of the third group. Your own progress (daily_time_spent) contributes to all groups, regardless of the individual goals (group_daily_goal).
My ideas
How would I organize that? One idea is if a user increments his/her time, the time gets written down into each group the user is part of and then, when the user increases his time, it gets increased in each group he/she is part of. But this seems to be pretty inefficient, because I would potentially write the same data in many different places (coming from the background of a SQL-Developer, it might also be expensive?).
Another option: Each user tracks his time, say under userTimes/{user} and then there are the groups: groups/{groupname} with links to userTimes. But then I don't know how to get realtime updates.
Thanks a lot for your help!
Both approach can work fine, and there is no singular best approach here - as Alex said, it all depends on the use-cases of your app, and your comfort level with the code that is required for each of them.
Duplicating the data under each relevant user will complicate the code that performs the write operation, and store more data. But in return for that, reading the data will be really simple and scale very well to many users.
Reading the data from under all followed users will complicate the code that performs the read operation, and slow it down a bit (though not nearly as much as you may expect, as Firebase can pipeline the requests). But it does keep your data minimal and your write operations simple.
If you choose to duplicate the data, that is an operation that you can usually do well in a (RTDB-triggered) Cloud Function, but it's also possible to do it through a multi-path write operation from the client.
I've been trying to figure out how to best model data for a complex feed in Cloud Firestore without returning unnecessary documents.
Here's the challenge --
Content is created for specific topics, for example: Architecture, Bridges, Dams, Roads, etc. The topic options can expand to included as many as needed at any time. This means it is a growing and evolving list.
When the content is created it is also tagged to specific industries. For example, I may want to create a post in Architecture and I want it to be seen within the Construction, Steel, and Concrete industries.
Here is where the tricky part comes in. If I am a person interested in the Steel and Construction industries, I would like to have a feed that includes posts from both of those industries with the specific topics of Bridges and Dams. Since it's a feed the results will need to be in time order. How would I possibly create this feed?
I've considered these options:
Query for each individual topic selected that includes tags for Steel and Construction, then aggregate and sort the results. The problem I have with this one is that it can return too many posts, which means I'm reading documents unnecessarily. If I select 5 topics between a specific time range, That's 5 queries, which is ok. However, each can have any possible amount of results, which is problematic. I could add a limit but then I run the risk of posts being omitted from topics even though they fall within the time range.
Create a post "index" table in Cloud SQL and perform queries on it to get the post ID's then retrieve the Firestore documents as needed. Then the question is, why not just use Cloud MySql.... Well it's a scaling, cost, and maintenance issue. The whole point of firestore is not having to worry so much about DBAs, load, and scale.
I've not been able to come to any other ideas and hoping someone has dealt with such a challenge and can shed some light on the matter. Perhaps firestore is just completely the wrong solution and I'm trying to fit a square peg into a round hole, but it seems like a workable solution can be found.
The perfect structure is to have separate node for posts then for each post you give it a reference parent category eg Steel and Construction. Have them also arranged with timestamps. If you think that the database will be too massive for firebase's queries. You can connect your firebase database to elasticsearch and do the search from there.
I hope it helps.
I would like some guidance to setup BigQuery data storage from Google Analytics.
We have 6 different websites which 4 of them belongs to a project and 2 of them to another, but we would like to analyse the data both separately for each site; the projects separately with the sites data; and all the sites together.
Hence, which is the best structure to setup in BigQuery?:
Two projects, with 4 and 2 datasets, or 1 main project with 2 datasets and 4 and 2 tables? or is that even possible.
Or is it so easy to extract the data that it doesn't matter, we can just put every site in an own project and extract the data as we want them.
Please give me some guidance in this issue
Kind regards
The short answer:
Or is it so easy to extract the data that it doesn't matter, we can just put every site in an own project and extract the data as we want them.
Yes!
The longer answer:
You can extract data from only one view per property (Set up a BigQuery Export), so start by identifying which one you'll link and ensure the settings are the same across all of the views you are going to import, assuming this is important to you.
Each profile/site will go into it's own dataset and will be partitioned by day, making it easy to query them individually, or together, as required.
It is possible to query across projects, so if you store data across two, you'll still be able to join them.
In my opinion it would make things easier for analysts if the data was all in one project, as you'll be able to save queries in a single location and track the query costs centrally, but if you need to keep 2 projects your data can still be connected.
I'm building an application in ASP.NET(VB) with a MS SQL database. It is a search tool for cars that has a list of every car and all of their attributes (colors, # of doors, gas milage, mfg. year, etc). This tool outputs the results in a gridview and the users has the ability to perform advanced searches and filtering. The filtering needs to be very fine-grained (range of gas milage, color(s), mfg year range, etc.) and I cannot seem to find the best way to do this filtering without a large SQL where statement that is going to greatly impact SQL performance and page load. I feel like I'm missing something very obvious here, thank you for any help. I'm not sure what other details would be helpful.
This is not an OLTP database you're building--it's really an analytics database. There really isn't a way around the problem of having to filter. The question is whether the organization of the data will allow seeks most of the time, or will it require scans; and also whether the resulting JOINs can be done efficiently or not.
My recommendation is to go ahead and create the data normalized and all, as you are doing. Then, build a process that spins it into a data warehouse, denormalizing like crazy as needed, so that you can do filtering by WHERE clauses that have to do a lot less work.
For every single possible search result, you have a row in a table that doesn't require joining to other tables (or only a few fact tables).
You can reduce complexity a bit for some values such as gas mileage, by striping the mileage into bands of, say, 5 mpg. (10-19, 20-24, 25-29, etc.)
As you need to add to the data and change it, your data-warehouse-loading process (that runs once a day perhaps) will keep the data warehouse up to date. If you want more frequent loading that doesn't keep clients offline, you can build the data warehouse to an alternate node, then swap them out. Let's say it takes 2 hours to build. You build for 2 hours to a new database, then swap to the new database, and all your data is only 2 hours old. Then you wipe out the old database and use the space to do it again.
I'm developing a statistics module for my website that will help me measure conversion rates, and other interesting data.
The mechanism I use is - to store a database entry in a statistics table - each time a user enters a specific zone in my DB (I avoid duplicate records with the help of cookies).
For example, I have the following zones:
Website - a general zone used to count unique users as I stopped trusting Google Analytics lately.
Category - self descriptive.
Minisite - self descriptive.
Product Image - whenever user sees a product and the lead submission form.
Problem is after a month, my statistics table is packed with a lot of rows, and the ASP.NET pages I wrote to parse the data load really slow.
I thought maybe writing a service that will somehow parse the data, but I can't see any way to do that without losing flexibility.
My questions:
How large scale data parsing applications - like Google Analytics load the data so fast?
What is the best way for me to do it?
Maybe my DB design is wrong and I should store the data in only one table?
Thanks for anyone that helps,
Eytan.
The basic approach you're looking for is called aggregation.
You are interested in certain function calculated over your data and instead of calculating the data "online" when starting up the displaying website, you calculate them offline, either via a batch process in the night or incrementally when the log record is written.
A simple enhancement would be to store counts per user/session, instead of storing every hit and counting them. That would reduce your analytic processing requirements by a factor in the order of the hits per session. Of course it would increase processing costs when inserting log entries.
Another kind of aggregation is called online analytical processing, which only aggregates along some dimensions of your data and lets users aggregate the other dimensions in a browsing mode. This trades off performance, storage and flexibility.
It seems like you could do well by using two databases. One is for transactional data and it handles all of the INSERT statements. The other is for reporting and handles all of your query requests.
You can index the snot out of the reporting database, and/or denormalize the data so fewer joins are used in the queries. Periodically export data from the transaction database to the reporting database. This act will improve the reporting response time along with the aggregation ideas mentioned earlier.
Another trick to know is partitioning. Look up how that's done in the database of your choice - but basically the idea is that you tell your database to keep a table partitioned into several subtables, each with an identical definition, based on some value.
In your case, what is very useful is "range partitioning" -- choosing the partition based on a range into which a value falls into. If you partition by date range, you can create separate sub-tables for each week (or each day, or each month -- depends on how you use your data and how much of it there is).
This means that if you specify a date range when you issue a query, the data that is outside that range will not even be considered; that can lead to very significant time savings, even better than an index (an index has to consider every row, so it will grow with your data; a partition is one per day).
This makes both online queries (ones issued when you hit your ASP page), and the aggregation queries you use to pre-calculate necessary statistics, much faster.