is there a library that can generate csv files given a data dictionary and data model in some format - faker

is there a library in any language that can generate .csv files for each entity of the data model that complies with a data dictionary.
For example:
data dictionary is specified in a csv file with these column names - field,regex,description
data model is specified in another csv file with these column names - entity,field
faker comes very close however it needs some programming to work for a data model. If there is a wrapper around faker, that might work great I suppose.

Related

Read large txt file with a nested (unknown) json structure in R

I have a large (210 038 KB) txt file which contains json structured data. It contains itinerary data, which I would like to structure the data on a journey basis, which should be easy enough as long as I can find where in the nesting this is located. My main challenge is that I don't know the structure of the data, and when I try to read it in R with for instance read.table('datafile.txt', header=FALSE) it either runs for a very long time and then crashes, alternatively it produced an unsatisfactory result by separating on "wrong" character (and then it had to restart itself).
I've glanced this post: Parsing JSON arrays from a .txt file in R - several large files which is similar to mine, but there the data were separated by newlines. I instead need to iteratively read the json structure and find out what it's comprised of.
Any suggestions?

Dump CSV files to Postgres and read in R while maintaining column data types

I'm new to R and correctly working a project refactoring code reading from csv files to from a database.
The work includes dumping the csv files to a Postgres database, and modify existing R scripts to ingest input data from the db tables instead of csv files for subsequent transformation
Right now I ran into an issue that the dataframe columns returned from dbGetQuery() have different modes and classes than the original dataframe from read_csv()
Since the data I'm reading in has hundreds of columns, it is not that convenient to explicitly specify the mode and class for each column.
Is there an easy way to make the dataframe with same schema as the old one, so I can apply existing code for data transformation on the dataframe
i.e
when I run a comparison between the old dataframe and the new one from db, this is what I see
==================================
VARIABLE CLASS.(from csv) CLASS.(from db)
----------------------------------
col1 numeric integer64
col2 numeric integer
col3 numeric integer
----------------------------------
This won't be possible in general, because some SQL datatypes (e.g. DATE, TIMESTAMP, INTERVAL) have no equivalent in R, and the R data type factor has no equivalent in SQL. Depending on your R version, strings are automatically converted to factors, so it will at least be useful to import the data with stringsAsFactors=FALSE.

Retrieving data from large xml file using node path in R

I am new to xml, and many xml nodes I found are not the same as my file. I want to extract data from large xml file using R (dummy xml file is below). I know even though R has memory limitation, extract specific nodes from large xml file is possible using xmlEventParse() from r XML package. properly naming file path to reach my target data. My final output in form of dataframe should have columns that reflects these nodes N9:Shareholder, N5:IdentifierElement, N2:NameElement. Thanks for your help.
XML code
FOO LIMITED
120801
Companies Register

Sparklyr - How to change the parquet data types

Is there a way to change data types of columns when reading parquet files?
I'm using the spark_read_parquet function from Sparklyr, but it doesn't have the columns option (from spark_read_csv) to change it.
In csv files, I would do something like:
data_tbl <- spark_read_csv(sc, "data", path, infer_schema = FALSE, columns = list_with_data_types)
How could I do something similar with parquet files?
Specifying data types only makes sense when reading a data format that does not have built in metadata on variable types. This is the case with csv or fwf files, which, at most, have variable names in the first row. Thus the read functions for such files have that functionality.
This sort of functionality does not make sense for data formats that have built in variable types, such as Parquet (or .Rds and .Rds in R).
This in this case you should:
a) read the Parquet file into Spark
b) make the necessary data transformations
c) save the transformed data into a Parquet file, overwriting the previous file

Vcorpus Rstudio combining .txt files

I have a directory of .txt files and need to combine then into one file. each file would be a separate line. I tried:
new_corpus <-VCorpus(DirSource("Downloads/data/"))
The data is in the file but I get an error
Error in DirSource(directory = "Downloads/data/") :
empty directory
This is a bit basic but I was only given this information on how to create the corpus. What I need to do is take this file and create one factor that is the .txt and another with an ID, in the form of:
ID .txt
ID .txt
.......
EDIT To clarify on emilliman5 comment:
I need both a data frame and a corpus. The example I am working from used a csv file with the data already tagged for a Naive Bayes problem. I can work through that example and all the steps. The data I have is in a different format. It is 2 directories (/ham and /spam) of short .txt files. I was able to create a corpus, when I changed my command to:
new_corpus <-VCorpus(DirSource("~/Downloads/data/"))
I have cleaned the raw data and can make DTM but at the end I will need to create a crossTable with the labels spam and ham. I do not understand how I insert that information into the corpus.

Resources