Lost on .rds files/pulling data from tables - r

Very new using R for anything database related, much less with AWS.
I'm currently trying to work with this set of code here. Specifically the '### TEST SPECIFIC TABLES' section.
I'm able to get the code to run, but now I'm actually not sure how to pull data from the tables. I assume that I have to do something with 'groups' but not sure what I need to be doing next to pull the data out of it.
So even more specifically, how would I pull out specific data like revenue for all organizations within the year 2018 for example. I've tried readRDS to pull a table as a dataframe but I get no observations or variables for any table. So I'm sort of lost of what I need to do here to pull the data our of the tables.
Thanks in advance!

Related

Complex Synthetic Data - Create manually or use a package/tool?

The data I work with consists of multiple tables (around 5-10) with single tables containing up to 10 million entries. So, overall I'd describe it as a large data set, but not too large to work with on a 'normal' computer. I'm in the need of a synthetic data set with the same structure and internal dependencies, i.e. a dummy data set. I can't use the data I work with as the data contains sensitive information.
I did research on synthetic data and came across different solutions. The first would be online providers where one uploads the original data and synthetic data is created based on the given input. This sounds like a nice solution, but I'd rather not share the original data with any external sources, so this is currently not an option for me.
The second solution I came across isthe synthpop package in R. I tried that, however, I encountered two problems: the first one being that for larger tables (as the tables in the orginal data sets are) it takes a very long time to execute. The second one being that I only got it working for a single table, however, I need to keep the dependencies between the tables, otherwhise the synthetic data doesn't make any sense.
The third option would be to do the whole data creation by myself. I have good and solid knowledge about the domain and the data, so I would be able to define the internal constraints formally and then write a script to follow these. The problem I see here is that it would obviously be a lot of work and as I'm no expert on synthetic data creation, I might still overlook something important.
So basically I have two questions:
Is there a good package/solution you can recommend (preferably in R, but ultimately the programming language doesn't matter so much) for automatically creating synthetic (and private) data based on original input data consisting of multiple tables?
Would you recommend the manual approach or would you recommend to spend more time on the synthpop package for example and try that approach?

export pivot table to R

I have a very large dataset to which I can only access by a pivot table in Excel.
I would like to access to the raw database to work with R.
I have tried several things:
To copy and paste in both, text and excel files, the pivot database and then import to R : It does not work. its shows only the filters that I have selected but not the totality of the database
to click on any cell of the pivot table to see the underlying data : It only shows the first 1000 entries. This is by far the best result that I have got, the only problem is that I get the first 1000 entries and I need all of them.
What I have been doing just until now is to select the variables that interest me, then, I copy them to a new sheet and finally I export to R. But I always forget a variable and its very expensive in time to redo all this procedure everyday.
Does anyone know how can I access to the totality of a database from a pivot table in Excel?
I hope I have been clear, do nto hesitate to ask for more information if needed.
Thank you in advanced

I want to combine two tables and display them as one table

I want to combine 1 and 2, but I don't know how.
I created a model like this.
①CalendarMaster【field→Index,Date,CalenderHoliday,CompanyHoliday】
②AttendanceMaster【field→Index,EmployeeId,Date,GoingTime,LeavingTime】
I want to combine the date of the CalendarMaster and the date of the AttendanceMaster as a key.
I want to know the type of model and where to write the SQL query script.
If you join the tables into one,
I want to display [Date,CalenderHoliday,CompanyHoliday,GoingTime,LeavingTime] in one table.
I looked at various sites and tried relations, but it didn't work, so please help someone.
I have been worried for another week.
Waiting for advice.

Conditional place data into a prebuilt report

It's quite an interesting challenge, and I can't say that I entirely know how/the best way to go about it.
Basically I have a data set, I have attached a few picture to try and show you what I am working with. The data was randomly generated but it similar to what I am working with.
I am wanting to take the data, then input the date and value into the report based on the category, and date. The even more challenging part of it is that I need to have to report be filled out for each unique id. So it will have to create many different reports, and then fill it out.
Any idea/questions? I have no idea how to go about it.
I am experienced in R, excel, some python and SQL (but very little).
If you have the dataset in R, you could write a function that takes the parameters needed, performs the aggregation, and writes the result to excel.
It is not clear to me what exactly the data aggregation part is. Without reproducible data it is hard to go into more detail.

Adding new fields to historical tables in BigQuery

I'm getting daily exports of Google Analytics data into BigQuery and these form the basis for our main reporting dataset.
Over time i need to add new columns for additional things we use to enrich the data - like say a mapping from url to 'reporting category' for example.
This is easy to just add as a new column onto the processed tables (there is about 10 processing steps at the moment for all the enrichment we do).
This issue is if stakeholders then ask - can we add that new column to the historical data?
Currently i then need to rerun all the daily jobs which is very slow and costly.
This is coming up frequently enough that i'm seriously thinking about redesigning my data pipelines to tailor for the fact that i often need to essentially drop and recreate ALL the data from time to time when i need to add a new field or correct old dirty data or something.
I'm just wondering if there is better ways to
Add a new column to an old table in BQ (would be happy to do this by hand for these instances where i can just join the new column based on the ga [hit_key] i have defined which is basically a row key)
(Less common) Update existing tables based on some where condition.
Just wondering what best practices are and if anyone has had similar issues where you basically need to update an historic shema and if there are ways to do it without just dropping and recreating which is essentially what i'm currently doing.
To be clearer on my current approach: I'm taking the [ga_sessions_yyyymmdd] table and making a series of [ga_data_prepN_yyyymmdd] tables where is either add new columns at each step or reduce the data in some way. There is now 11 of these steps and each time i'm taking all the 100 or more columns along for the ride. This is what i'm going to try design away from as currently 90% of the columns at each stage dont even need to be touched as they can just be joined back on at the end maybe based on hit_key or something.
It's a little bit messy though to try and pick apart.
Adding new columns to the schema of the existing historical tables is possible, but the values for newly added columns will be NULLs. If you do need to populate values into these columns, probably the best approach is to use UPDATE DML statement. More details how to try it out is here: Does BigQuery support UPDATE, DELETE, and INSERT (SQL DML) statements?

Resources