BigQuery setup from Google analytics - google-analytics

I would like some guidance to setup BigQuery data storage from Google Analytics.
We have 6 different websites which 4 of them belongs to a project and 2 of them to another, but we would like to analyse the data both separately for each site; the projects separately with the sites data; and all the sites together.
Hence, which is the best structure to setup in BigQuery?:
Two projects, with 4 and 2 datasets, or 1 main project with 2 datasets and 4 and 2 tables? or is that even possible.
Or is it so easy to extract the data that it doesn't matter, we can just put every site in an own project and extract the data as we want them.
Please give me some guidance in this issue
Kind regards

The short answer:
Or is it so easy to extract the data that it doesn't matter, we can just put every site in an own project and extract the data as we want them.
Yes!
The longer answer:
You can extract data from only one view per property (Set up a BigQuery Export), so start by identifying which one you'll link and ensure the settings are the same across all of the views you are going to import, assuming this is important to you.
Each profile/site will go into it's own dataset and will be partitioned by day, making it easy to query them individually, or together, as required.
It is possible to query across projects, so if you store data across two, you'll still be able to join them.
In my opinion it would make things easier for analysts if the data was all in one project, as you'll be able to save queries in a single location and track the query costs centrally, but if you need to keep 2 projects your data can still be connected.

Related

Apache Ignite Data Seggregation

I have an application that creates persistent caches on a fixed region (MYAPP_REGION) with fixed cached names (MyApp.Data.Class1, MyApp.Data.Class2, ...etc.)
I am deploying 2 instances of this application for 2 different customers, but they use the same ignite clusters.
What is the correct way to discriminate the data between the instances: do I change the cache name to be by customer or a region per customer is enough?
In a rdbms scenario, we would create 2 different databases; so I am wondering how we would achieve the same thing when using ignite as storage solution.
Well, as you have mentioned, there are a variety of options. If it's only logical division and you are OK with resource sharing, just like with a regular RDBM, then use multiple caches/tables or different SQL schemas. Keep in mind the desired data distribution and the amount of caches/tables per customer. I.e. if you have 3 nodes and 3 customers with about the same amount of data, most likely you'd like to use a custom affinity function to make them collocated on a single node, but it's a bit different question.
If you want more physical division, for example, if one of the customers needs more resources or special features like native persistence, then it's better to follow the different regions approach which might end up having separate clusters though.

Structuring Data In Firebase

I'm contemplating using Firebase for an upcoming project, but before doing so I want to make sure I can develop a data structure that will meet my purposes. I'm interested in tracking horse race results for approximately 25 racetracks across the US. My initial impression was that my use case aligned nicely with the Firebase Weather Data Set. The Weather data set is organized by city and then in various time series: currently, hourly, daily and minutely.
My initial thought was that I could follow a similar approach and use the 25 tracks as cities and then organize by years, months, days and races.
This structure lends itself nicely to accessing data from a particular track, but suppose that also want to access data across all tracks. For example, access data for all tracks for races that occurred in 2014 and had more than 10 horses.
Questions:
Does my proposed data structure limit me to queries by track only or would I still be able to query across tracks, years, days, months, etc. and incorporate any and all of the various meta data attributes: number of horses, distance of race, etc.
Given, my interest in freeform querying is there another data structure that would be more advantageous?
Is Firebase similar to Mongodb and have issues with collections (lists) that grow or can one continue to push to a list without pre allocating or worrying about sharding?
I believe my confusion stems from url/path nature of the data storage.
EDIT:
Here is a sample of what I had in mind:
Thanks for your input.
I would think that you would want to organize by horse first. I guess it depends what you are deriving from the data.
One horse could be at different tracks.
Horses table
* Horsename
-----Date
-----Track
-----Racenumber
-----Gate
-----Jockey
-----Place
-----Odds
-----Mud?
Races table
----Track
----Racenumber
----Date
----Time
----NumberOfHorses
Link the tables and you could get at any one part of it.

MS SQL product list with filtering

I'm building an application in ASP.NET(VB) with a MS SQL database. It is a search tool for cars that has a list of every car and all of their attributes (colors, # of doors, gas milage, mfg. year, etc). This tool outputs the results in a gridview and the users has the ability to perform advanced searches and filtering. The filtering needs to be very fine-grained (range of gas milage, color(s), mfg year range, etc.) and I cannot seem to find the best way to do this filtering without a large SQL where statement that is going to greatly impact SQL performance and page load. I feel like I'm missing something very obvious here, thank you for any help. I'm not sure what other details would be helpful.
This is not an OLTP database you're building--it's really an analytics database. There really isn't a way around the problem of having to filter. The question is whether the organization of the data will allow seeks most of the time, or will it require scans; and also whether the resulting JOINs can be done efficiently or not.
My recommendation is to go ahead and create the data normalized and all, as you are doing. Then, build a process that spins it into a data warehouse, denormalizing like crazy as needed, so that you can do filtering by WHERE clauses that have to do a lot less work.
For every single possible search result, you have a row in a table that doesn't require joining to other tables (or only a few fact tables).
You can reduce complexity a bit for some values such as gas mileage, by striping the mileage into bands of, say, 5 mpg. (10-19, 20-24, 25-29, etc.)
As you need to add to the data and change it, your data-warehouse-loading process (that runs once a day perhaps) will keep the data warehouse up to date. If you want more frequent loading that doesn't keep clients offline, you can build the data warehouse to an alternate node, then swap them out. Let's say it takes 2 hours to build. You build for 2 hours to a new database, then swap to the new database, and all your data is only 2 hours old. Then you wipe out the old database and use the space to do it again.

I need to Edit 100,000+ Products

I'm looking at accepting a project that would require me to clean up an existing e-commerce website. Its been relatively successful and has over 100,000 individual products - loaded both by the client and its publishers.
The site wasn't originally designed for this many products and has become fairly disorganized.
SO, the client has asked I look at a more robust search option - filterable and so forth. I completely agree it needs to be improved, but after looking at the database, I can tell that there are dozens and dozens of categories and not everything is labeled correctly etc.
Is there any database management software that could help me clean up 100,000 entries quickly? Make categories consistent - fix uppercase/lowercase problems etc.
Are there any companies out there that I can source just this particular part of the project to?
Its a massive amount of data-entry. If I spent 2 minutes per product, it would take me 6 months full time to just to complete the database cleanup. I either need to get it down to a matter of seconds per product or find a company that specializes in this type of work.
I don't even know what to search for on Google.
Thanks guys!
--
Thanks everyone for your ideas! I have a lot of options now so I feel a lot more comfortable heading in to this project. Right now I think the direction we will go is to build a tool that allows the client to hire data entry people that can update it as necessary. Then I will work as a consultant, taking care of any UPDATE-WHERE type functions as necessary.
Thanks again!
If there are inconsistencies like you are describing, it sounds like the problem may be more an issues of a bad data model (i.e. lack of normalization) than just dirty data. If good normalization is in place, cleaning up categories should be as simple as updating a single record per each category - but if category name is used instead of a foreign key, then you will most likely need to perform a series of UPDATE WHERE statements to clean up the text.
You may want to look into an ETL (extract, transform, load) tool that can help with bulk data transformation. I'm not familiar with ETL tools for mysql, but I'm sure they exist. SQL Server has a build in service called SQL Integration Services that provides the ability to extract data from an existing data source, perform bulk changes or transformations, and then reload the data back into a destination database. Tools like this may help speed up the process of standardizing capitalization, punctuation, changing categories etc.
Even still, don't overlook the possibility that the data model may need tweaking to help prevent this type of situation in the future.
Edit: Wikipedia has a list of opensource ETL products that you may want to investigate.
In any case you'll probability need to do more than "clean the data", which means you'll need to build new normalized tables. So start there, build a new database that is fully normalized, import the data "as is", with all the duplicate categories, etc.
for example, new tables:
Items
ItemID int identity/auto number
ItemName string
CategoryID int
....
Categories
CategoryID int identity/auto number
CategoryName string
....
import the bad data into the new system:
Items
ItemID ItemName CategoryID
1 thing A 1
2 thing B 2
3 thing C 3
4 thing D 1
Categories
CategoryID CategoryName
1 Game
2 food
3 games
now, you can consolidate the data using the PKs
UPDATE Items
SET CategoryID=1
WHERE CategoryID=3
DELETE Categories
WHERE CategoryID=3
You might just write an application where the customer can do the consolidation. Let them select the duplicates on a screen and merge to a selected parent category. you have this application do the merge sql from above.
If there are issues of needing to have a clean cut over date, create an application that generates a series of "Map" tables, where you store the CategoryNameOld="games" and the CategoryNameNew="Game" and use these when you do the conversion/load of the bad data into the new system's tables.
I would implement the new search system or whatever and build them a tool that would allow them to easily go through and cleanup the listings, re-categorize, etc. This task requires domain knowledge, so they're the best ones to do it.
Do some number crunching so they can prioritize the list and clean in order of importance.
Keep in mind that one or your options is to build a crappy interface that somebody can use to edit records, hire half a dozen data-entry people from a temp agency, spend two days training them, and let them go to town.

Large Product catalog with statistics - alternatives to Sql Server?

I am building UI for a large product catalog (millions of products).
I am using Sql Server, FreeText search and ASP.NET MVC.
Tables are normalized and indexed. Most queries take less then a second to return.
The issue is this. Let's say user does the search by keyword. On search results page I need to display/query for:
Display 20 matching products on first page(paged, sorted)
Total count of matching products for paging
List of stores only of all matching products
List of brands only of all matching products
List of colors only of all matching products
Each query takes about .5 to 1 seconds. Altogether it is like 5 seconds.
I would like to get the whole page to load under 1 second.
There are several approaches:
Optimize queries even more. I already spent a lot of time on this one, so not sure it can be pushed further.
Load products first, then load the rest of the information using AJAX. More like a workaround. Will need to revise UI.
Re-organize data to be more Report friendly. Already aggregated a lot of fields.
I checked out several similar sites. For ex. zappos.com. Not only they display the same information as I would like in under 1 second, but they also include statistics (number of results in each category).
The following is the search for keyword "white"
http://www.zappos.com/white
How do sites like zappos, amazon make their results, filters and stats appear almost instantly?
So you asked specifically "how does Zappos.com do this". Here is the answer from our Search team.
An alternative idea for your issue would be using a search index such as solr. Basically, the way these work is you load your data set into the system and it does a huge amount of indexing. My projects include product catalogs with 200+ data points for each of the 140k products. The average return time is less than 20ms.
The search indexing system I would recommend is Solr which is based on lucene. Both of these projects are open source and free to use.
Solr fits perfectly for your described use case in that it can actually do all of those things all in one query. You can use facets (essentially group by in sql) to return the list of different data values for all applicable results. In the case of keywords it also would allow you to search across multiple fields in one query without performance degradation.
you could try replacing you aggergate queries with materialized indexed views of those aggregates. this will pre-compute all the aggregates and will be as fast as selecting any regular row data.
.5 sec is too long for an appropriate hardware. I agree with Aaronaught and first thing to do is to convert it in single SQL or possibly Stored Procedure to ensure it's compiled only once.
Analyze your queries to see if you can create even better indexes (consider covering indexes), fine tune existing indexes, employ partitioning.
Make sure you have appropriate hardware config - data, log, temp and even index files should be located on independent spindles. make sure you have enough RAM and CPU's. I hope you are running 64-bit platform.
After all this, if you still need more - analyze most used keywords and create aggregate result tables for top 10 keywords.
Amount Amazon - they most likely use superior hardware and also take advantage of CDN's. Also, they have thousands of servers surviving up the content and there is no performance bottlenecks - data is duplicated multiple times across several data centers.
As completely separate approach - you may want to look into "in-memory" databases such as CACHE - this is the fastest you can get on DB side.

Resources