Import .sps codebook in R - r

On many micro-data catalog of household surveys ( for instance http://microdata.worldbank.org ...) , the data dictionary (i.e the code book) is actually described within a .sps or .sas syntax text file that follows a clear structure. The scripts includes mapping between questions & modalities labels and their name within the raw dataset.
See for instance any of the first down-loadable zip file below within any open record from the catalog:
Is there an already available R function that would allow to parse the .sps syntax file (better than .sas as the questions label are fully preserved in the .sps...) in order to have a data frame that would allow to easily re-encode the dataset?
The closest i found is http://jason.bryer.org/posts/2013-01-10/Function_for_Reading_Codebooks_in_R.html but it's not working out of the box for an .sps file
There was as well an old discussion here : http://r.789695.n4.nabble.com/how-to-read-sps-SPSS-file-extension-td875309.html and here Input data into R from .dat and .sps files but no solution provided...
Thanks in advance!

Related

Is there a way to compare the structure/architecture of .nc files in R?

I have a sample .nc file that contains a number of variables (5 to be precise) and is being read into a program. I want to create a new .nc file containing different data (and different dimensions) that will also be read into that program.
I have created a .nc file that looks the same as my sample file (I have included all of the necessary attributes for each of the variables that were included in the original file).
However, my file is still not being ingested.
My question is: is there a way to test for differences in the layout/structure of .nc files?
I have examined each of the variables/attributes within Rstudio and I have also opened them in panoply and they look the same. There are obviously differences (besides the actual data that they contain) since the file is not being read.
I see that there are options to compare the actual data within .nc files online (Comparison of two netCDF files), but that is not what I want. I want to compare the variable/attributes names/states/descriptions/dimensions to see where my file differs. Is that possible?
The ideal situation here would be to create a .nc template from the variables that exist within the original file and then fill in my data. I could do this by defining the dimensions (ncdim_def), creating the file(nc_create), getting my data (ncvar_get) and putting it in the file (ncvar_put), but that is what I have done so far, and it is too reliant on me not making an error (which I obviously have as they are not the same).
If you are on unix this is more easily achieved using CDO. See the Information section of the reference card: https://code.mpimet.mpg.de/projects/cdo/embedded/cdo_refcard.pdf.
For example, if you wanted to check that the descriptions are the same in files just do:
cdo griddes example1.nc
cdo griddes example2.nc
You can easily use system in R, to wrap around this.

Retrieving data from large xml file using node path in R

I am new to xml, and many xml nodes I found are not the same as my file. I want to extract data from large xml file using R (dummy xml file is below). I know even though R has memory limitation, extract specific nodes from large xml file is possible using xmlEventParse() from r XML package. properly naming file path to reach my target data. My final output in form of dataframe should have columns that reflects these nodes N9:Shareholder, N5:IdentifierElement, N2:NameElement. Thanks for your help.
XML code
FOO LIMITED
120801
Companies Register

How to export a dataset to SPSS?

I want to export a dataset in the MASS package to SPSS for further investigation. I'm looking for the EuStockMarkets data set in the package.
As described in http://www.statmethods.net/input/exportingdata.html, I did:
library(foreign)
write.foreign(EuStockMarkets, "c:/mydata.txt", "c:/mydata.sps", package="SPSS")
I got a text file but the sps file is not a valid SPSS file. I'm really looking for a way to export the dataset to something that a SPSS can open.
As Thomas has mentioned in the comments, write.foreign doesn't generate native SPSS datafiles (.sav). What it does generate is the data in a comma delimited format (the .txt file) and a basic syntax file for reading that data into SPSS (the .sps file). The EuStockMarkets data object class is multivariate time series (mts) so when it's exported the metadata is lost and the resulting .sps file, lacking variable names, throws an error when you try to run it in SPSS. To get around this you can export it as a data frame instead:
write.foreign(as.data.frame(EuStockMarkets), "c:/mydata.txt", "c:/mydata.sps", package="SPSS")
Now you just need to open mydata.sps as a syntax file (NOT as a datafile) in SPSS and run it to read in the datafile.
Rather than exporting it, use the STATS GET R extension command. It will take a specified data frame from an R workspace/dataset and convert it into a Statistics dataset. You need the R Essentials for Statistics and the extension command, which are available via the SPSS Community site (www.ibm.com/developerworks/spssdevcentral)
I'm not trying to answer a question that has been answered. I just think there is something else to complement for other users looking for this.
On your SPSS window, you just need to find the first line of code and edit it. It should be something like this:
"file-name.txt"
You need to find the folder path where you're keeping your file:
"C:\Users\DELL\Google Drive\Folder-With-Your-File"
Then you just need to add this path to your file's name:
"C:\Users\DELL\Google Drive\Folder-With-Your-File\file-name.txt"
Otherwise SPSS will not recognize the .txt file.
Sorry if I'm repeating some information here, I just wanted to make it easier to understand.
I suppose that EuStockMarkets is a (labelled) data frame.
This should work and even keep the variable and value labels:
require(sjlabelled)
write_spss(EuStockMarkets, "mydata.sav")
Or you try rio:
rio::export(EuStockMarkets, "mydata.sav")

Can SAS still read or create the combination of {.dat fixed-column ascii data file, .sas syntax file}, or is it obsolete?

In the past I have used the excellent SAScii package in R to read in this type of data: {.dat fixed-column data file + the corresponding .sas "syntax" file}. I want to be quite precise about that because there is no end of ambiguity surrounding phrases like "SAS file". These .dat files contain only integers, and the .sas files specify both the way to parse the columns and the way the integers represent the values in the actual data (this feature is sometimes called the "codebook".) I have found very good data in that format (i.e. in the form of the pair of files {.dat, .sas}) from places like Minnesota Population Center's IPUMS https://usa.ipums.org/usa/, and built up a lot of tools to analyze it using R and SAScii.
Now I have access to SAS itself, and but would still like to re-use some of my tools and techniques. However I can find no reference in SAS to data like that {fixed-column data in .dat, syntax file in .sas}. Has that format been entirely superseded within SAS (perhaps by the SAS7BDAT format)? Or perhaps the {.dat,.sas} format was never used within SAS?? The reason I ask is, now that I have access to SAS and so much data in SAS7BDAT format, I would like to be able to export some of it in {.dat, .sas} format for use with my own tools.
Thanks very much, and cheers - Ed
I don't think this is something built into SAS. You could, however, write such a program pretty easily.
First off, Chris Hemidinger has written something that basically does this (it creates datalines, not .dat file, but that shouldn't be too hard to modify if you know .NET and/or to modify the R module to accept). That is discussed and available here. The title of the post is "Turn your data set into a data step program". This is roughly equivalent to the SQL Server task that creates "Create Table" code out of a table. This would only work in Enterprise Guide, although you should be able to do roughly the same thing in a standalone .NET program.
Second, you can easily write something like this in Base SAS. Creating the datalines is easy, numerous ways to write out to a file.
For a CSV, for example, you can do this.
ods csv file="c:\temp\mydata.csv";
proc print data=mydata;
run;
ods csv close;
If you're going to write a flat file, you might as well make the input/output .sas first - after all it can be almost the same code. You can query dictionary.columns to generate the code, both the input and output code. Create a table with the variable names, lengths, and formats for each variable, then process it in a data step advancing the start variable by the length of each variable (so it moves to the next position after the last one finished). If you need formats for your R project, then proc format cntlout=<datasetname> will generate a dataset that contains those formatted value translations, and you can write that out in whatever format you need as well.

X12 seasonal adjustment program from census, problem with input file extensions

I downloaded the X12 seasonal adjustment program located here: http://www.census.gov/srd/www/x12a/x12downv03_pc.html
I followed the setup and got the setting correct. When I go to select a file to input I have four options for file extensions to import which are ".spc" ".mta" ".dta" and "."
The problem is that I have data in excel and I have searched extensively through search engines and I do cannot figure out a way to get data from excel into one of these formats so I can do a seasonal adjustment on my data. Thanks
ADDED: After converting to a dta file (using R thanks to the comments left below) it looks like the program makes you convert it also to a .spc file as well. Anyone have a lead on how to do this? thanks
My first reaction is to:
(1) export the data from excel in something simple like csv.
(2) import that data into R
(3) use the R library "foreign" to export the data in .dta format.
So with the file "test.csv" containing:
V1,V2
1,2
3,4
5,6
you could do the following to produce "test.dta":
library(foreign)
testdata <- read.csv("test.csv")
write.dta(testdata,"test.dta")
Voila, data in .dta format. Would this work for what you have?
I've only ever used the command-line version of X12, but it sounds like you may be using the windows interface instead? If so the following might not be entirely accurate, but it should be close enough (I hope!).
The .dta and .mta files you refer to are just metafiles containing text lists of either spec files or data files to be processed; in particular the .dta files X12 uses are NOT Stata data format files like those produced by Nathan's R-based answer. It's probably best to ignore using metafiles until you are comfortable enough using the software to adjust a single time series.
You can export your data in tab separated variable format (year month/quarter value) without headings and use that as your data file. You can also use a simple list of data values separated by spaces, tabs, or newlines and then tell X12ARIMA what the start and end dates of the series are in the .spc file.
The .spc file doesn't contain the input data, it's a specification file telling X12 where to find the data file and how you want those data to be processed -- you'll have to write them yourself or create them in Win X-12.
Ideally you should write a separate .spc file for each time series to be adjusted; while you can write a .spc file which invokes many of X12's autoselection and identification procedures, it's usually not a good idea to treat the process as a black box, and a bit of manual intervention in the .spc is often necessary to get a good quality adjustment (and essential if there's a seasonal break involved). I find it helpful to start with a fairly generic skeleton .spc file suitable for your computing environment to begin with and then tweak it from there as appropriate for each series.
If you really want to use a single .spc file to adjust multiple series, then you can provide a list of data files in a .dta file and a single .spc file instructing X12ARIMA how to adjust them, but take care to ensure this is appropriate for your data!
The "Getting started with X-12-ARIMA input files on your PC" document on that site is probably a good place to start reading, but you'll probably end up having to consult the complete reference documentation (in particular Chapters 3 and 7) as well.
Edit postscript:
The UK Office for National Statistics have a draft of their guide to seasonal adjustment with X12ARIMA available online here here (archive.org), and is worth a look. It's a good bit easier to work through than the Census Bureau documentation.
Ryan,
This is not elegant, but it might work for you. In this example I'm trying to replicate the spec file from the Example 3.2 in the Census documentation.
Concatentate the data into one text string, then save this single text string using the MS-DOS (TXT) format under the SAVE AS command. To make the text string, first insert two cells above your column header and in the second one type the following text into it.
series{title=
Next, insert double quotation marks before and after the text in your column header, like this:
"Monthly Retail Sales of Household Appliance Stores"
Directly below the last data row, insert rows of texts that list the model specifications, like the following:
)
start= 1972.jul}
transform{function = log}
regression{variables=td}
indentify[diff=(0,1) sdiff=(0,1)}
So you should have something like the following:
<blank row>
series{title=
"Monthly Retail Sales of Household Appliance Stores"
530
529
...
592
590
start= 1972.jul}
transform{function = log}
regression{variables=td}
indentify{diff=(0,1) sdiff=(0,1)}
For the next instructions I am assuming that the text *series{title=
* appears in cell A2 and that cell B1 is empty. In cell B2, insert the following:
=CONCATENATE(B1,A2," ")
Then copy this formula into every cell down the column to concatentate all of the text in column A into a single cell at the end of column B. Finally, copy the final cell to a new spreadsheet's cell A1 using PASTE SPECIAL/VALUE, and save this spreadsheet using SAVE AS: *TXT(MS-DOS), but change the extension to ".spc".
Good luck (and from the little I read of the Census documentation - you'll need it).

Resources