I wanted to know if there are any methods to do data cleansing in Kylo (https://kylo.io/). I was able to get the tool to point out errors using data validation rules, but I was curious to know if it can also perform other functions. Examples:
Deleting any empty records in between the data
Detecting and deleting duplicate columns in the data
Data cleansing is handled in Kylo using standardizers and validators. However, as of Kylo 0.9.0 there's no built-in functions for removing empty rows or duplicate columns. The current functions are limited to removing rows if a specific column is empty and removing duplicate rows.
This functionality could be added by writing a plugin:
http://kylo.readthedocs.io/en/latest/developer-guides/PluginApiIndex.html
Related
The data I work with consists of multiple tables (around 5-10) with single tables containing up to 10 million entries. So, overall I'd describe it as a large data set, but not too large to work with on a 'normal' computer. I'm in the need of a synthetic data set with the same structure and internal dependencies, i.e. a dummy data set. I can't use the data I work with as the data contains sensitive information.
I did research on synthetic data and came across different solutions. The first would be online providers where one uploads the original data and synthetic data is created based on the given input. This sounds like a nice solution, but I'd rather not share the original data with any external sources, so this is currently not an option for me.
The second solution I came across isthe synthpop package in R. I tried that, however, I encountered two problems: the first one being that for larger tables (as the tables in the orginal data sets are) it takes a very long time to execute. The second one being that I only got it working for a single table, however, I need to keep the dependencies between the tables, otherwhise the synthetic data doesn't make any sense.
The third option would be to do the whole data creation by myself. I have good and solid knowledge about the domain and the data, so I would be able to define the internal constraints formally and then write a script to follow these. The problem I see here is that it would obviously be a lot of work and as I'm no expert on synthetic data creation, I might still overlook something important.
So basically I have two questions:
Is there a good package/solution you can recommend (preferably in R, but ultimately the programming language doesn't matter so much) for automatically creating synthetic (and private) data based on original input data consisting of multiple tables?
Would you recommend the manual approach or would you recommend to spend more time on the synthpop package for example and try that approach?
Very new using R for anything database related, much less with AWS.
I'm currently trying to work with this set of code here. Specifically the '### TEST SPECIFIC TABLES' section.
I'm able to get the code to run, but now I'm actually not sure how to pull data from the tables. I assume that I have to do something with 'groups' but not sure what I need to be doing next to pull the data out of it.
So even more specifically, how would I pull out specific data like revenue for all organizations within the year 2018 for example. I've tried readRDS to pull a table as a dataframe but I get no observations or variables for any table. So I'm sort of lost of what I need to do here to pull the data our of the tables.
Thanks in advance!
We are moving to AWS EMR/S3 and using R for analysis (sparklyr library). We have 500gb sales data in S3 containing records for multiple products. We want to analyze data for couple of products and want to read only subset of file into EMR.
So far my understanding is that spark_read_csv will pull in all the data. Is there a way in R/Python/Hive to read data only for products we are interested in?
In short, the choice of the format is on the opposite side of the efficient spectrum.
Using data
Partitioned by (partitionBy option of the DataFrameWriter or correct directory structure) column of interest.
Clustered by (bucketBy option of the DataFrameWriter and persistent metastore) on the column of interest.
can help to narrow down the search to particular partitions in some cases, but if filter(product == p1) is highly selective, then you're likely looking at the wrong tool.
Depending on the requirements:
A proper database.
Data warehouse on Hadoop.
might be a better choice.
You should also consider choosing a better storage format (like Parquet).
I have multiple flatfiles (CSV) (with multiple records) where files will be received randomly. I have to combine them (records) with unique ID fields.
How can I combine them, if there is no common unique field for all files, and I don't know which one will be received first?
Here are some files examples:
In real there are 16 files.
Fields and records are much more then in this example.
I would avoid trying to do this purely in XSLT/BizTalk orchestrations/C# code. These are fairly simple flat files. Load them into SQL, and create a view to join your data up.
You can still use BizTalk to pickup/load the files. You can also still use BizTalk to execute the view or procedure that joins the data up and sends your final message.
There are a few questions that might help guide how this would work here:
When do you want to join the data together? What triggers that (a time of day, a certain number of messages received, a certain type of message, a particular record, etc)? How will BizTalk know when it's received enough/the right data to join?
What does a canonical version of this data look like? Does all of the data from all of these files truly get correlated into one entity (e.g. a "Trade" or a "Transfer" etc.)?
I'd probably start with defining my canonical entity, and then look towards the path of getting a "complete" picture of that canonical entity by using SQL for this kind of case.
I'm getting daily exports of Google Analytics data into BigQuery and these form the basis for our main reporting dataset.
Over time i need to add new columns for additional things we use to enrich the data - like say a mapping from url to 'reporting category' for example.
This is easy to just add as a new column onto the processed tables (there is about 10 processing steps at the moment for all the enrichment we do).
This issue is if stakeholders then ask - can we add that new column to the historical data?
Currently i then need to rerun all the daily jobs which is very slow and costly.
This is coming up frequently enough that i'm seriously thinking about redesigning my data pipelines to tailor for the fact that i often need to essentially drop and recreate ALL the data from time to time when i need to add a new field or correct old dirty data or something.
I'm just wondering if there is better ways to
Add a new column to an old table in BQ (would be happy to do this by hand for these instances where i can just join the new column based on the ga [hit_key] i have defined which is basically a row key)
(Less common) Update existing tables based on some where condition.
Just wondering what best practices are and if anyone has had similar issues where you basically need to update an historic shema and if there are ways to do it without just dropping and recreating which is essentially what i'm currently doing.
To be clearer on my current approach: I'm taking the [ga_sessions_yyyymmdd] table and making a series of [ga_data_prepN_yyyymmdd] tables where is either add new columns at each step or reduce the data in some way. There is now 11 of these steps and each time i'm taking all the 100 or more columns along for the ride. This is what i'm going to try design away from as currently 90% of the columns at each stage dont even need to be touched as they can just be joined back on at the end maybe based on hit_key or something.
It's a little bit messy though to try and pick apart.
Adding new columns to the schema of the existing historical tables is possible, but the values for newly added columns will be NULLs. If you do need to populate values into these columns, probably the best approach is to use UPDATE DML statement. More details how to try it out is here: Does BigQuery support UPDATE, DELETE, and INSERT (SQL DML) statements?