Complex Synthetic Data - Create manually or use a package/tool? - r

The data I work with consists of multiple tables (around 5-10) with single tables containing up to 10 million entries. So, overall I'd describe it as a large data set, but not too large to work with on a 'normal' computer. I'm in the need of a synthetic data set with the same structure and internal dependencies, i.e. a dummy data set. I can't use the data I work with as the data contains sensitive information.
I did research on synthetic data and came across different solutions. The first would be online providers where one uploads the original data and synthetic data is created based on the given input. This sounds like a nice solution, but I'd rather not share the original data with any external sources, so this is currently not an option for me.
The second solution I came across isthe synthpop package in R. I tried that, however, I encountered two problems: the first one being that for larger tables (as the tables in the orginal data sets are) it takes a very long time to execute. The second one being that I only got it working for a single table, however, I need to keep the dependencies between the tables, otherwhise the synthetic data doesn't make any sense.
The third option would be to do the whole data creation by myself. I have good and solid knowledge about the domain and the data, so I would be able to define the internal constraints formally and then write a script to follow these. The problem I see here is that it would obviously be a lot of work and as I'm no expert on synthetic data creation, I might still overlook something important.
So basically I have two questions:
Is there a good package/solution you can recommend (preferably in R, but ultimately the programming language doesn't matter so much) for automatically creating synthetic (and private) data based on original input data consisting of multiple tables?
Would you recommend the manual approach or would you recommend to spend more time on the synthpop package for example and try that approach?

Related

Lost on .rds files/pulling data from tables

Very new using R for anything database related, much less with AWS.
I'm currently trying to work with this set of code here. Specifically the '### TEST SPECIFIC TABLES' section.
I'm able to get the code to run, but now I'm actually not sure how to pull data from the tables. I assume that I have to do something with 'groups' but not sure what I need to be doing next to pull the data out of it.
So even more specifically, how would I pull out specific data like revenue for all organizations within the year 2018 for example. I've tried readRDS to pull a table as a dataframe but I get no observations or variables for any table. So I'm sort of lost of what I need to do here to pull the data our of the tables.
Thanks in advance!

Adding more custom entities into pretrained custom NER Spacy3

I've a huge amount of textual data and wanted to add around 50 different entities. Initially when I started working with it, I was getting memory error. As we know spacy can handle 1,00,000 tokens per GB and maximum up to 10,00,000. So I chunked my dataset into 5 sets and using annotator created multiple JSON file for the same. Now I started with one JSON and successfully completed creating the model and now I want to add more data into it so that I don't miss out any tags and there's a good variety of data is used while training in the model. Please guide me how to proceed next.
I mentioned some points of confusion in a comment, but assuming that your issue is how to load a large training set into spaCy, the solution is pretty simple.
First, save your training data as multiple .spacy files in one directory. You do not have to make JSON files, that was standard in v2. For details on training data see the training data section of the docs. In your config you can specify this directory as the training data source and spaCy will use all the files there.
Next, to avoid keeping all the training data in memory, you can specify max_epochs = -1 (see the docs on streaming corpora). Using this feature means you will have to specify your labels ahead of time as covered in the docs there. You will probably also want to shuffle your training data manually.
That's all you need to train with a lot of data.
The title of your question mentions adding entities to the pretrained model. It's usually better to train from scratch instead to avoid catastrophic forgetting, but you can see a guide to doing it here.

Conditional place data into a prebuilt report

It's quite an interesting challenge, and I can't say that I entirely know how/the best way to go about it.
Basically I have a data set, I have attached a few picture to try and show you what I am working with. The data was randomly generated but it similar to what I am working with.
I am wanting to take the data, then input the date and value into the report based on the category, and date. The even more challenging part of it is that I need to have to report be filled out for each unique id. So it will have to create many different reports, and then fill it out.
Any idea/questions? I have no idea how to go about it.
I am experienced in R, excel, some python and SQL (but very little).
If you have the dataset in R, you could write a function that takes the parameters needed, performs the aggregation, and writes the result to excel.
It is not clear to me what exactly the data aggregation part is. Without reproducible data it is hard to go into more detail.

Filtering data while reading from S3 to Spark

We are moving to AWS EMR/S3 and using R for analysis (sparklyr library). We have 500gb sales data in S3 containing records for multiple products. We want to analyze data for couple of products and want to read only subset of file into EMR.
So far my understanding is that spark_read_csv will pull in all the data. Is there a way in R/Python/Hive to read data only for products we are interested in?
In short, the choice of the format is on the opposite side of the efficient spectrum.
Using data
Partitioned by (partitionBy option of the DataFrameWriter or correct directory structure) column of interest.
Clustered by (bucketBy option of the DataFrameWriter and persistent metastore) on the column of interest.
can help to narrow down the search to particular partitions in some cases, but if filter(product == p1) is highly selective, then you're likely looking at the wrong tool.
Depending on the requirements:
A proper database.
Data warehouse on Hadoop.
might be a better choice.
You should also consider choosing a better storage format (like Parquet).

How to test two numbers that appear on different pages?

Have a stored procedure that produces a number--let's say 50, that is rendered as an anchor with the number as the text. When the user clicks the number, a popup opens and calls a different stored procedure and shows 50 rows in a html table. The 50 rows are the disaggregation of the number the user clicked. In summary, two different aspx pages and two different stored procedures that need to show the same amount, one amount is the aggregate and the other the disaggregation of the aggregate.
Question, how do I test this code so I know that if the numbers do not match, there is an error somewhere.
Note: This is a simplified example, in reality there are 100s of anchor tags on the page.
This kind of testing falls outside of the standard / code level testing paradigm. Here you are explicitly validating the data and it sounds like you need a utility to achieve this.
There are plenty of environments to do this and approaches you can take, but here's two possible candidates
SQL Management Studio : here you can generate a simply script that can run through the various combinations from the two stored procedures ensuring that the number and rows match up. This will involve some inventive T-SQL but nothing particular taxing. The main advantage of this approach is you'll have bare metal access to the data.
Unit Testing : as mentioned your problem is somewhat outside of the typical testing scenario where you would oridnarily Mock the data and test into your Business Logic. However, that doesn't mean you cannot write the tests (especially if you are doing any Dataset manipulation prior to this processing. Check out this link and this one for various approaches (note: if you're using VS2008 or above, you get the Testing Projects built in from the Professional Version up).
In order to test what happens when the numbers do not match, I would simply change (temporary) one of the stored procedure to return the correct amount +1, or always return zero, etc.

Resources