Retrieving data from large xml file using node path in R - r

I am new to xml, and many xml nodes I found are not the same as my file. I want to extract data from large xml file using R (dummy xml file is below). I know even though R has memory limitation, extract specific nodes from large xml file is possible using xmlEventParse() from r XML package. properly naming file path to reach my target data. My final output in form of dataframe should have columns that reflects these nodes N9:Shareholder, N5:IdentifierElement, N2:NameElement. Thanks for your help.
XML code
FOO LIMITED
120801
Companies Register

Related

Read large txt file with a nested (unknown) json structure in R

I have a large (210 038 KB) txt file which contains json structured data. It contains itinerary data, which I would like to structure the data on a journey basis, which should be easy enough as long as I can find where in the nesting this is located. My main challenge is that I don't know the structure of the data, and when I try to read it in R with for instance read.table('datafile.txt', header=FALSE) it either runs for a very long time and then crashes, alternatively it produced an unsatisfactory result by separating on "wrong" character (and then it had to restart itself).
I've glanced this post: Parsing JSON arrays from a .txt file in R - several large files which is similar to mine, but there the data were separated by newlines. I instead need to iteratively read the json structure and find out what it's comprised of.
Any suggestions?

Writing new data to an existing excel file that has an XML map attached, without losing the XML data in R

I am trying to write to an excel file that needs to be uploaded somewhere. The target software creates an excel file which has an XML map attached to it. I recreated the entire file structure in R using code, but any time I try to write to that excel file, i think R actually deletes the old file and creates a new one instead, because the XML map is gone the moment I start writing any data to it. Loading up the workbook also doesn't seem to bring in the xml map, only the workbook data and sheets.
Is there a way to write data to this existing file within R (or python) without losing the XML map? Now i need to generate a file and manually copy paste the data into the other excel file.
I've been trying with xlsx, readxl, xml2 packages.
In the past Ive deal with a similar problem. To my knowledge, almost all the R packages that interact with excel replace the entire file with a new one. Except the openxlsx package. You can replace specific sheets, and range of cells, whitout touching the rest (data, styling , etc..). One last comment is that I dont know much about XLM maps, but maybe you are lucky.
Here is the vignette:
https://cran.r-project.org/web/packages/openxlsx/vignettes/Introduction.html
Hope it helps

is there a library that can generate csv files given a data dictionary and data model in some format

is there a library in any language that can generate .csv files for each entity of the data model that complies with a data dictionary.
For example:
data dictionary is specified in a csv file with these column names - field,regex,description
data model is specified in another csv file with these column names - entity,field
faker comes very close however it needs some programming to work for a data model. If there is a wrapper around faker, that might work great I suppose.

Is there a way to compare the structure/architecture of .nc files in R?

I have a sample .nc file that contains a number of variables (5 to be precise) and is being read into a program. I want to create a new .nc file containing different data (and different dimensions) that will also be read into that program.
I have created a .nc file that looks the same as my sample file (I have included all of the necessary attributes for each of the variables that were included in the original file).
However, my file is still not being ingested.
My question is: is there a way to test for differences in the layout/structure of .nc files?
I have examined each of the variables/attributes within Rstudio and I have also opened them in panoply and they look the same. There are obviously differences (besides the actual data that they contain) since the file is not being read.
I see that there are options to compare the actual data within .nc files online (Comparison of two netCDF files), but that is not what I want. I want to compare the variable/attributes names/states/descriptions/dimensions to see where my file differs. Is that possible?
The ideal situation here would be to create a .nc template from the variables that exist within the original file and then fill in my data. I could do this by defining the dimensions (ncdim_def), creating the file(nc_create), getting my data (ncvar_get) and putting it in the file (ncvar_put), but that is what I have done so far, and it is too reliant on me not making an error (which I obviously have as they are not the same).
If you are on unix this is more easily achieved using CDO. See the Information section of the reference card: https://code.mpimet.mpg.de/projects/cdo/embedded/cdo_refcard.pdf.
For example, if you wanted to check that the descriptions are the same in files just do:
cdo griddes example1.nc
cdo griddes example2.nc
You can easily use system in R, to wrap around this.

Import .sps codebook in R

On many micro-data catalog of household surveys ( for instance http://microdata.worldbank.org ...) , the data dictionary (i.e the code book) is actually described within a .sps or .sas syntax text file that follows a clear structure. The scripts includes mapping between questions & modalities labels and their name within the raw dataset.
See for instance any of the first down-loadable zip file below within any open record from the catalog:
Is there an already available R function that would allow to parse the .sps syntax file (better than .sas as the questions label are fully preserved in the .sps...) in order to have a data frame that would allow to easily re-encode the dataset?
The closest i found is http://jason.bryer.org/posts/2013-01-10/Function_for_Reading_Codebooks_in_R.html but it's not working out of the box for an .sps file
There was as well an old discussion here : http://r.789695.n4.nabble.com/how-to-read-sps-SPSS-file-extension-td875309.html and here Input data into R from .dat and .sps files but no solution provided...
Thanks in advance!

Resources