I've a huge amount of textual data and wanted to add around 50 different entities. Initially when I started working with it, I was getting memory error. As we know spacy can handle 1,00,000 tokens per GB and maximum up to 10,00,000. So I chunked my dataset into 5 sets and using annotator created multiple JSON file for the same. Now I started with one JSON and successfully completed creating the model and now I want to add more data into it so that I don't miss out any tags and there's a good variety of data is used while training in the model. Please guide me how to proceed next.
I mentioned some points of confusion in a comment, but assuming that your issue is how to load a large training set into spaCy, the solution is pretty simple.
First, save your training data as multiple .spacy files in one directory. You do not have to make JSON files, that was standard in v2. For details on training data see the training data section of the docs. In your config you can specify this directory as the training data source and spaCy will use all the files there.
Next, to avoid keeping all the training data in memory, you can specify max_epochs = -1 (see the docs on streaming corpora). Using this feature means you will have to specify your labels ahead of time as covered in the docs there. You will probably also want to shuffle your training data manually.
That's all you need to train with a lot of data.
The title of your question mentions adding entities to the pretrained model. It's usually better to train from scratch instead to avoid catastrophic forgetting, but you can see a guide to doing it here.
Related
The data I work with consists of multiple tables (around 5-10) with single tables containing up to 10 million entries. So, overall I'd describe it as a large data set, but not too large to work with on a 'normal' computer. I'm in the need of a synthetic data set with the same structure and internal dependencies, i.e. a dummy data set. I can't use the data I work with as the data contains sensitive information.
I did research on synthetic data and came across different solutions. The first would be online providers where one uploads the original data and synthetic data is created based on the given input. This sounds like a nice solution, but I'd rather not share the original data with any external sources, so this is currently not an option for me.
The second solution I came across isthe synthpop package in R. I tried that, however, I encountered two problems: the first one being that for larger tables (as the tables in the orginal data sets are) it takes a very long time to execute. The second one being that I only got it working for a single table, however, I need to keep the dependencies between the tables, otherwhise the synthetic data doesn't make any sense.
The third option would be to do the whole data creation by myself. I have good and solid knowledge about the domain and the data, so I would be able to define the internal constraints formally and then write a script to follow these. The problem I see here is that it would obviously be a lot of work and as I'm no expert on synthetic data creation, I might still overlook something important.
So basically I have two questions:
Is there a good package/solution you can recommend (preferably in R, but ultimately the programming language doesn't matter so much) for automatically creating synthetic (and private) data based on original input data consisting of multiple tables?
Would you recommend the manual approach or would you recommend to spend more time on the synthpop package for example and try that approach?
Wondering if there are good examples or suggestions for how to handle steps that require manual review in a database-based scientific data pipeline (datajoint). For example, we'd like to handle the pre-processing and denoising/demixing of our neuronal calcium imaging data through the automated pipeline, but then each video and each cell requires manual review before being entered into the database for further analysis. What is the best practice for handling such steps? Add a manual table to add only data that pass review to downstream pipeline stages? Keep the steps before manual review separate from the rest of the pipeline (in their own schema?)? Thanks!
Automated curation
First, you could invoke an interactive GUI as part of the make method of a particular table that requires manual intervention. It would present the computed results for the human review/curation/correction.
A separate manual curation
Second, you can define a manual table to support manual review/curation. For example, the Curation table in the Calcium Imaging element follows this pattern: https://github.com/datajoint/element-calcium-imaging
We are moving to AWS EMR/S3 and using R for analysis (sparklyr library). We have 500gb sales data in S3 containing records for multiple products. We want to analyze data for couple of products and want to read only subset of file into EMR.
So far my understanding is that spark_read_csv will pull in all the data. Is there a way in R/Python/Hive to read data only for products we are interested in?
In short, the choice of the format is on the opposite side of the efficient spectrum.
Using data
Partitioned by (partitionBy option of the DataFrameWriter or correct directory structure) column of interest.
Clustered by (bucketBy option of the DataFrameWriter and persistent metastore) on the column of interest.
can help to narrow down the search to particular partitions in some cases, but if filter(product == p1) is highly selective, then you're likely looking at the wrong tool.
Depending on the requirements:
A proper database.
Data warehouse on Hadoop.
might be a better choice.
You should also consider choosing a better storage format (like Parquet).
For example, lets say I wish to analyze a months worth of company data for trends. I plan on doing regression analysis and classification using an MLP.
A months worth of data has ~10 billion data points (rows).
There are 30 dimensions to the data.
12 features are numeric (integer or float; continuous).
The rest are categoric (integer or string).
Currently the data is stored in flat files (CSV) and is processed and delivered in batches. Data analysis is carried out in R.
I want to:
change this to stream processed (rather than batch process).
offload the computation to a Spark cluster
house the data in a time-series database to facilitate easy read/write and query. In addition, I want the cluster to be able to query data from the database when loading the data into memory.
I have an Apache Kafka system that can publish the feed for the processed input data. I can write a Go module to interface this into the database (via CURL, or a Go API if it exists).
There is already a development Spark cluster available to work with (assume that it can be scaled as necessary, if and when required).
But I'm stuck on the choice of database. There are many solutions (here is a non-exhaustive list) but I'm looking at OpenTSDB, Druid and Axibase Time Series Database.
Other time-series databases which I have looked at briefly, seem more as if they were optimised for handling metric data. (I have looked at InfluxDB, RiakTS and Prometheus)
Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can
access diverse data sources including HDFS, Cassandra, HBase, and S3. - Apache Spark Website
In addition, the time-series database should store that data in a fashion that exposes it directly to Spark (as this is time-series data, it should be immutable and therefore satisfies the requirements of an RDD - therefore, it can be loaded natively by Spark into the cluster).
Once the data (or data with dimensionality reduced by dropping categoric elements) is loaded, using sparklyr (a R interface for Spark) and Spark's machine learning library (MLib, this cheatsheet provides a quick overview of the functionality), regression and classification model can be developed and experimented with.
So, my questions:
Does this seem like a reasonable approach to working with big data?
Is my choice of database solutions correct? (I am set on working with columnar store and time-series database, please do not recommend SQL/Relational DBMS)
If you have prior experience working with data analysis on clusters, from both an analytics and systems point of view (as I am doing both), do you have any advice/tips/tricks?
Any help would be greatly appreciated.
I need to copy the Marklogic DB contents (50 million xml docs) from one DB host to another. We can do this by doing a backup/restore. But i need to copy the data available in two forests (25 million each) to 20 forests (2.5 million each) and distribute them evenly. can this be done using xqsync or any other utilities?
I'm in the process of doing the same migration this week. 14M documents from two forests on a single host to a cluster and six forests. We have done a couple trial runs of the migration and use backup/restore followed by a forest rename then adding the new forests to the cluster. We then use CORB to do the re-balance. A little fine tuning to optimize the number of threads and we had to adjust a linux TCP timeout to make sure the CORB process didn't fail part way through the re-balance. I think we ended up using CORB based on the very old version of ML we are currently running.
If you are lucky to be able to run under ML7 then this is all a lot easier along with much reduced forest storage needs.
As wst indicates, Marklogic 7 will do that automatically for you by default for new databases. For databases that you upgrade from earlier versions, you need to enable rebalancing manualy from Admin interface. You can find that setting on the Database Configure tab, near the bottom.
After that, you just add new forests as needed to your database, and redistribution is automatically triggered after a slight delay (based on a throttle-level like reindexer), also accross a cluster. You can follow rebalancing from the Database Status page in the Admin interface. May take a while though, it is designed to run with low interference on background.
The other way around is almost as easy. Go to Forests page under Database, and select 'retired' next to the forest you want to remove. This automatically triggers rebalancing documents away from that forest. Once that is done, you just detach it from the Database.
All data is fully searchable and accessible during all this, though response times can be relatively slow, as caches need to be refreshed as well.
HTH
With ML6 or earlier I would use back and restore to move the forests, then https://github.com/mblakele/task-rebalancer to rebalance. Afterwards you'll probably want to force a merge, to get rid of the deleted fragments in the original forests.