Reading stata files with large variables in stata and R - r

I am trying to read the stata files of size 320 MB in stata and R with more than 5000 variables. I used first stata to read the file but the maximum variable that it can read is 5000. So, I can't use stata to read the stata file. My questions are:
Is there a way to read the stata file using stata by first asking to keep only variables (I know the variables) so that number of variables is less than 5000?
Is there way that I can read this stata file in R?. I am using 32 bit (Vista) and R is giving me the error. "Error: cannot allocate vector of size 21k.Kb".
I used the following R code to read the file :
#The stata file is in the webpage: http://www.federalreserve.gov/econresdata/scf/scf_2010survey.htm#STATADAT
#1. set mem 400m
set maxvar 4000
use p10i6.dta, clear
keep x8166 x8167 x8168 x8163 x8164 x2422 x2506 x2606 x2623 x604 x614 x623 x716 x507 x513 x526 x1706 x1705 x1806 x1805 x1906 x1905 x2002 x2012 x1409 x1509 x1609 x1415 x1515 x1615 x1417 x1517 x1617 x1619 x1621 x3124 x3224 x3324 x3129 x3229 x3329 x3335 x3408 x3412 x3416 x3420 x3424 x3428 x4020 x4024 x4028 x4018 x4022 x4026 x4030 x4022 x4026 x4030 x4018 x3507 x3511 x3515 x3519 x3523 x3527 x3506 x3510 x3514 x3518 x3522 x3526 x3529 x3804 x3807 x3810 x3813 x3816 x3818 x3930 x3721 x3821 x3823 x3825 x3827 x3829 x3822 x3824 x3826 x3828 x3830
save p10i6.dta, clear
#2.
library (foreign)
year<-2010
yr <- substr( year , 3 , 4 )
p10i6.dta<-read.dta(paste0( "p" , yr , "i6.dta" ))
saveRDS(p10i6.dta,file=paste0( "p" , yr , "i6.rda" ))
p10i6.rda<-readRDS(paste0( "p" , yr , "i6.rda" ))

To read the data into R, there might be a way to do this with the memisc package Stata.file function. Instead of reading in all the variables, select the variables that you need using subset. For example:
require(memisc)
?Stata.file
d1 <- subset(
Stata.file(paste0( "p" , yr , "i6.dta" )),
select=c(x8166, x2606, x2623, x604)
)

I presume you have a StataIC or an older version of Stata. The current Stata/SE and Stata/MP can read up to more than 32.000 variables. So the first logical step would be to upgrade your Stata to a version that can handle larger datasets. If the problem doesn't originate from a lack of available memory, that is... For that, the error message of Stata would be helpful.
As Richard Herron told already in the comments, you should be able to read in a subset of the data using:
use X8166 X8167 ... using p10i6, clear
remember that Stata is case sensitive. The variables are called X... instead of x..., according to the website you linked to.
If you want to load the data into R using the foreign package, make sure you set the memory available to R to the maximum possible on R. On Vista that's going to be about 3.5Gb:
memory.limit(3500)
If that doesn't help, your dataset is going to be too big, and you can use any of the ASCII methods from either Stata or R to load the ASCII data provided at the website.

Related

Retain SPSS value labels when working with data

I am analysing student level data from PISA 2015. The data is available in SPSS format here
I can load the data into R using the read_sav function in the haven package. I need to be able to edit the data in R and then save/export the data in SPSS format with the original value labels that are included in the SPSS download intact. The code I have used is:
library(haven)
student<-read_sav("CY6_MS_CMB_STU_QQQ.sav",user_na = T)
student2<-data.frame(student)
#some edits to data
write_sav(student2,"testdata1.sav")
When my colleague (who works in SPSS) tries to open the "testdata1.sav" the value labels are missing. I've read through the haven documentation and can't seem to find a solution for this. I have also tried read/write.spss in the foreign package but have issues loading in the dataset.
I am using R version 3.4.0 and the latest build of haven.
Does anyone know if there is a solution for this? I'd be very grateful of your help. Please let me know if you require any additional information to answer this.
library(foreign)
df <- read.spss("spss_file.sav", to.data.frame = TRUE)
This may not be exactly what you are looking for, because it uses the labels as the data. So if you have an SPSS file with 0 for "Male" and 1 for "Female," you will have a df with values that are all Males and Females. It gets you one step further, but perhaps isn't the whole solution. I'm working on the same problem and will let you know what else I find.
library ("sjlabelled")
student <- sjlabelled::read_spss("CY6_MS_CMB_STU_QQQ.sav")
student2 <-student
write_spss(student2,"testdata1.sav")
I did not try and hope it works. The sjlabelled package is good with non-ascii-characters as German Umlaute.
But keep in mind, that R saves the labels as attributes. These attributes are lost, when doing some data transformations (as subsetting data for example). When lost in R they won't show up in SPSS of course. The sjlabelled::copy_labels function is helpful in those cases:
student2 <- copy_labels(student2, student) #after data transformations and before export to spss
I think you need to recover the value labels in the dataframe after importing dataset into R. Then write the that dataframe into sav file.
#load library
libray(haven)
# load dataset
student<-read_sav("CY6_MS_CMB_STU_QQQ.sav",user_na = T)
#map to find class of each columns
map_dataset<-map(student, function(x)attr(x, "class"))
#Run for loop to identify all Factors with haven-labelled
factor_variable<-c()
for(i in 1:length(map_dataset)){
if(map_dataset[i]!="NULL"){
name<-names(map_dataset[i])
factor_variable<-c(factor_variable,name)
}
}
#convert all haven labelled variables into factor
student2<-student %>%
mutate_at(vars(factor_variable), as_factor)
#write dataset
write_sav(student2, "testdata1.sav")

use ape to phase a fasta file and create a DNAbin file as output, then test tajima's D using pegas

I'm trying to complete the very simple task of reading in an unphased fasta file and phasing it using ape, and then calculating Tajima's D using pegas, but #my data doesn't seem to be reading in correctly. Input and output is as #follows:
library("ape")
library("adegenet")
library("ade4")
library("pegas")
DNAbin8c18 <- read.dna(file="fasta8c18.fa", format="f")
I shouldn't need to attach any data since I've just generated the file, but since the data() command was in the manual, I executeed
data(DNAbin8c18)
and got
Warning message: In data(DNAbin8c18) : data set ‘DNAbin8c18’ not found
I know that data() only works in certain contexts, so maybe this isn't a big deal. I looked at what had been loaded
DNAbin8c18
817452 DNA sequences in binary format stored in a matrix.
All sequences of same length: 96
Labels:
CLocus_12706_Sample_1_Locus_34105_Allele_0 [BayOfIslands_s08...
CLocus_12706_Sample_2_Locus_31118_Allele_0 [BayOfIslands_s08...
CLocus_12706_Sample_3_Locus_30313_Allele_0 [BayOfIslands_s09...
CLocus_12706_Sample_5_Locus_33345_Allele_0 [BayOfIslands_s09...
CLocus_12706_Sample_7_Locus_37388_Allele_0 [BayOfIslands_s09...
CLocus_12706_Sample_8_Locus_29451_Allele_0 [BayOfIslands_s09... ...
More than 10 million nucleotides: not printing base composition
so it looks like the data should be fine. Because of this, I tried what I want to do
tajima.test(DNAbin8c18)
and got
Error: cannot allocate vector of size 2489.3 Gb
Many people have completed this same test using as many or more SNPs that I have, and also using FASTA files, but is it possible that mine is too big, or can you see another issue?
The data file can be downloaded at the following link
https://drive.google.com/open?id=0B6qb8IlaQGFZLVRYeXMwRnpMTUU
I have also sent and earlier version of this question, with the data, to the r-sig-genetics mailing list, but I have not heard back.
Any thoughts would be much appreciated.
Ella
Thank you for the comment. Indeed, you are correct. The developer just emailed me with the following very helpful comments.
The problem is that your data are too big (too many sequences) and tajima.test() needs to compute the matrix of all pairwise distances. You could this check by trying:
dist.dna(DNAbin8c18, "N")
One possibility for you is to sample randomly some observations, and repeat this many times, eg:
tajima.test(DNAbin8c18[sample(n, size = 1000), ])
This could be:
N <- 1000 # number of repeats
RES <- matrix(N, 3)
for (i in 1:N)
RES[, i] <- unlist(tajima.test(DNAbin8c18[sample(n, size = 10000), ]))
You may adjust N and 'size =' to have something not too long to run. Then you may look at the distribution of the columns of RES.

Reading haven created dta file in Stata - how to deal with dots in variable names?

We are working in Stata with data created in R, that have been exported using haven package. We stumbled upon an issue with variables that have a dot in the name. To replicate the problem, some minimal R code:
library("haven")
var.1 <- c(1,2,3)
var_2 <- c(1,2,3)
test_df <- employ.data <- data.frame(var.1, var_2)
str(test_df)
write_dta(test_df, "D:/test_df.dta")
Now, in Stata, when I do:
use "D:\test_df.dta"
d
First problem - I get an empty dataset. Second problem - we get variable name with a dot - which in Stata should be illegal. Therefore any command using directly the variable name like
drop var.1
returns an error:
factor variables and time-series operators not allowed
r(101);
What is causing such behaviour? Any solutions to this problem?
This will drop var.1 in Stata:
drop var?1
Here (as in Excel), ? is used as a wildcard for a single character. (The regular expression equivalent to .)
Unfortunately, this will also drop var_1, if it exists.
I am not sure about the missing values when writing a .dta file with haven. I am able to replicate this result in Stata 14.1 and haven 0.2.0.
However, using the read_dta function from haven,
temp2 <- read_dta("test_df.dta")
returns the data.frame. As an alternative to haven, I have used the readstata13 package in the past without issues.
library(readstata13)
save.dta13(test_df, "testdf.dta")
While this code has the same variable names issue, it provided a .dta file that contained the correct values when read into Stata 14.1. There is a convert.underscore argument to save.dta13, that is intended to remove non-valid characters in Stata variable names. I verified that it will work properly in this example for readstata13 for version 0.8.5, but had a bug in some earlier versions including version 0.8.2.

replace NA with nothing in R changes variable types to categorical

Is there anyway without converting to char to replace NA with blank or nothing?
I used
data_model <- sapply(data_model, as.character)
data_model[is.na(data_model)] <- " "
data_model=data.table(data_model)
however it changes all the columns' types to categorical.
I want to save the data set and use it in sas it does not understand NA.
Here's a somewhat belated (and shameless self-promotion) from The R Primer on how to export a data frame to SAS. It should automatically correctly handle your NAs:
First you can use the foreign package to export the data frame as a SAS xport dataset. Here, I'll just export the trees data frame.
library(foreign)
data(trees)
write.foreign(trees, datafile = "toSAS.dat",
codefile="toSAS.sas", package="SAS")
This gives you two files, toSAS.dat and toSAS.sas. It is easy to get the data into SAS since the codefile toSAS.sas contains a SAS script that can be read and interpreted directly by SAS and reads the data in toSAS.dat.

R: use LaF (reads fixed column width data FAST) with SAScii (parses SAS dicionary for import instructions)

I'm trying to read quickly into R a ASCII fixed column width dataset, based on a SAS import file (the file that declares the column widths, and etc).
I know I can use SAScii R package for translating the SAS import file (parse.SAScii) and actually importing (read.SAScii). It works but it is too slow, because read.SAScii uses read.fwf to do the data import, which is slow. I would like to change that for a fast import mathod, laf_open_fwf from the "LaF" package.
I'm almost there, using parse.SAScii() and laf_open_fwf(), but I'm able to correctly connect the output of parse.SAScii() to the arguments of laf_open_fwf().
Here is the code, the data is from PNAD, national household survey, 2013:
# Set working dir.
setwd("C:/User/Desktop/folder")
# installing packages:
install.packages("SAScii")
install.packages("LaF")
library(SAScii)
library(LaF)
# Donwload and unzip data and documentation files
# Data
file_url <- "ftp://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_anual/microdados/2013/Dados.zip"
download.file(file_url,"Dados.zip", mode="wb")
unzip("Dados.zip")
# Documentation files
file_url <- "ftp://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_anual/microdados/2013/Dicionarios_e_input_20150814.zip"
download.file(file_url,"Dicionarios_e_input.zip", mode="wb")
unzip("Dicionarios_e_input.zip")
# importing with read.SAScii(), based on read.fwf(): Works fine
dom.pnad2013.teste1 <- read.SAScii("Dados/DOM2013.txt","Dicionarios_e_input/input DOM2013.txt")
# importing with parse.SAScii() and laf_open_fwf() : stuck here
dic_dom2013 <- parse.SAScii("Dicionarios_e_input/input DOM2013.txt")
head(dic_dom2013)
data <- laf_open_fwf("Dados/DOM2013.txt",
column_types=????? ,
column_widths=dic_dom2013[,"width"],
column_names=dic_dom2013[,"Varname"])
I'm stuck on the last commmand, passing the importing arguments to laf_open_fwf().
UPDATE: here are two solutions, using packages LaF and readr.
Solution using readr (8 seconds)
readr is based on LaF but surprisingly faster. More info on readr here
# Load Packages
library(readr)
library(data.table)
# Parse SAS file
dic_pes2013 <- parse.SAScii("./Dicion rios e input/input PES2013.sas")
setDT(dic_pes2013) # convert to data.table
# read to data frame
pesdata2 <- read_fwf("Dados/DOM2013.txt",
fwf_widths(dput(dic_pes2013[,width]),
col_names=(dput(dic_pes2013[,varname]))),
progress = interactive()
)
Take way: readr seems to be the best option: it's faster, you don't need to worry about column types, shorter code and it shows a progress bar :)
Solution using LaF (20 seconds)
LaFis one of the (maybe THE) fastest ways to read fixed-width files in R, according to this benchmark. It tooke me 20 sec. to read the person level file (PES) into a data frame.
Here is the code:
# Parse SAS file
dic_pes2013 <- parse.SAScii("./Dicion rios e input/input PES2013.sas")
# Read .txt file using LaF. This is virtually instantaneous
pesdata <- laf_open_fwf("./Dados/PES2013.txt",
column_types= rep("character", length(dic_pes2013[,"width"])),
column_widths=dic_pes2013[,"width"],
column_names=dic_pes2013[,"varname"])
# convert to data frame. This tooke me 20 sec.
system.time( pesdata <- pesdata[,] )
Note that that I've used character in column_types. I'm not quite sure why the command returns me an error if I try integer or numeric. This shouldn't be a problem, since you can convert all columns to numeric like this:
# convert all columns to numeric
varposition <- grep("V", colnames(pesdata))
pesdata[varposition] <- sapply(pesdata[],as.numeric)
sapply(pesdata, class)
You can try the read.SAScii.sqlite, also by Anthony Damico. It's 4x faster and lead to no RAM issues (as the author himself describes). But it imports data to a SQLite self-contained database file (no SQL server needed) -- not to a data.frame. Then you can open it in R by using a dbConnection. Here it goes the GitHub adress for the code:
https://github.com/ajdamico/usgsd/blob/master/SQLite/read.SAScii.sqlite.R
In the R console, you can just run:
source("https://raw.githubusercontent.com/ajdamico/usgsd/master/SQLite/read.SAScii.sqlite.R")
It's arguments are almost the same as those for the regular read.SAScii.
I know you are asking for a tip on how to use LaF. But I thought this could also be useful to you.
I think that the best choice is to use fwf2csv() from desc package (C++ code). I will illustrate the procedure with PNAD 2013. Be aware that i'm considering that you already have the dictionary with 3 variables: beginning of the field, size of the field, variable name, AND the dara at Data/
library(bit64)
library(data.table)
library(descr)
library(reshape)
library(survey)
library(xlsx)
end_dom <- dic_dom2013$beggining + dicdom$size - 1
fwf2csv(fwffile='Dados/DOM2013.txt', csvfile='dadosdom.csv', names=dicdom$variable, begin=dicdom$beggining, end=end_dom)
dadosdom <- fread(input='dadosdom.csv', sep='auto', sep2='auto', integer64='double')

Resources