Problem with reading microdata from IPUMS into R

Problem with reading microdata from IPUMS into R - r

I am trying to read the micro data from the extract that I downloaded from IPUMS USA into R. It seemed simple at first, but I can't get it. I already downloaded the DDI and CSV, and it is not working!
Would appreciate any help as how to how to get this data into R.
I've tried two different ways. I learned how to do this code from this website: https://tech.popdata.org/Integrating-IPUMS-Data-with-R/ (but apparently it was wrong).
Here's my code:
cps_ddi <- read_ipums_ddi(ipums_example("wagesdata.xml"))
cps_data <- read_ipums_micro(cps_ddi, data_file = ipums_example("usa_00004.csv"), verbose = FALSE)
Console returns this:
Error in ipums_example("wagesdata.xml") :
Could not find file 'wagesdata.xml' in examples. Available files are: cps_00006.csv.gz, cps_00006.dat.gz, cps_00006.xml, cps_00010.dat.gz, cps_00010.xml, cps_00015.dat.gz, cps_00015.xml, nhgis0008_csv.zip, nhgis0008_shape_small.zip`
cps_data <- read_ipums_micro(cps_ddi, data_file = ipums_example("usa_00004.csv"), verbose = FALSE)
Error in read_ipums_micro(cps_ddi, data_file = ipums_example("usa_00004.csv"), : object 'cps_ddi' not found

The ipums_example() function is designed to find the example data included with the R package.
However, if you're working with your own data, you don't need it.
I believe this should work:
cps_ddi <- read_ipums_ddi("wagesdata.xml")
cps_data <- read_ipums_micro(cps_ddi, data_file = "usa_00004.csv", verbose = FALSE)
If it doesn't, then most likely you haven't downloaded the data to your current working directory. You can check where your session is by running command getwd() and see what files are currently available with list.files()

Related

NetCDF: HDF error only inside a loop in R

I have a script to loop through a selection of net cdf files. The files are opened, data extracted, then closed again. I have used this many times before and it works with no issue. I was recently sent a new selection of files to run through the same code. I can check the files individually using the ncdf4 package and nc_open() function. The files look fine and are not corrupt. However, when I run through the loop the function will not let me open the files and I get this error:
Error in R_nc4_open: NetCDF: HDF error
When I run though the loop to check, all is fine and the file opens. It just cannot open in the loop. There is no issue with the code.
Has anyone come across this before with non-corrupt net cdf files getting this error only on occasion. Even outside the loop I can run the code and get the error first time, then run it again without changing anything and the connection works.
Not sure how to trouble shoot this one, so just looking for advice as to why this might be happening.
Code snippet:
targetYear <- '2005-2019'
variables <- c('CHL','SSH')
ncNam <- list.files(folderdir, '.nc', recursive = TRUE)
for(v in 1:length((variables)))
{
varNam <- unlist(unique(variables))[v]
# Get names corresponding to variable
varLs <- ncNam[grep(varNam, basename(ncNam))]
varLs <- varLs[grep(targetYear, varLs)]}
varLs <- varLs[1]
export <- paste0(exportdir,varNam,'/')
dir.create(export, recursive = TRUE)
if(varNam == 'Proximity1km' | varNam == 'Proximity200m'| varNam ==
'ProximityCoast'| varNam == 'Bathymetry'){
fileNam <- varLs
ncfilename <- paste0(folderdir, fileNam)
print(ncfilename)
# Read ncfile
ncfile <- nc_open(ncfilename)
nc_close(ncfile)
gc()
} else {
fileNam <- varLs
ncfilename <- paste0(folderdir, fileNam)
print(ncfilename)
# Read ncfile
ncfile <- nc_open(ncfilename)
nc_close(ncfile)
gc()}`

I figured out the issue. It was to do with the error detection filer in the .nc files.
I removed the filter and the files work fine inside the loop. Still a bit strange.
Perhaps the ncdf4 package is not up to date with this filtering.

Read large csv file from S3 into R

I need to load a 3 GB csv file with about 18 million rows and 7 columns from S3 into R or RStudio respectively. My code for reading data from S3 usually works like this:
library("aws.s3")
obj <-get_object("s3://myBucketName/aFolder/fileName.csv")
csvcharobj <- rawToChar(obj)
con <- textConnection(csvcharobj)
data <- read.csv(file = con)
Now, with the file being much bigger than usual, I receive an error
> csvcharobj <- rawToChar(obj)
Error in rawToChar(obj) : long vectors not supported yet: raw.c:68
Reading this post, I understand that the vector is too long but how would I subset the data in this case? Any other suggestion how to deal with larger files to read from S3?

Originally Building on Hugh's comment in the OP and adding an answer for those wishing to load regular size csv's from s3.
At least as of May 1, 2019, there is an s3read_using() function that allows you to read the object directly out of your bucket.
Thus
data <-
aws.s3::s3read_using(read.csv, object = "s3://your_bucketname/your_object_name.csv.gz")
Will do the trick. However, if you want to make your work run faster and cleaner, I prefer this:
data <-
aws.s3::s3read_using(fread, object = "s3://your_bucketname/your_object_name.csv.gz") %>%
janitor::clean_names()
Previously the more verbose method below was required:
library(aws.s3)
data <-
save_object("s3://myBucketName/directoryName/fileName.csv") %>%
data.table::fread()
It works for files up to at least 305 MB.
A better alternative to filling up your working directory with a copy of every csv you load:
data <-
save_object("s3://myBucketName/directoryName/fileName.csv",
file = tempfile(fileext = ".csv")
) %>%
fread()
If you are curious about where the tempfile is positioned, then Sys.getenv() can give some insight - see TMPDIR TEMP or TMP. More information can be found in the Base R tempfile docs..

You can use AWS Athena and mount your S3 files to athena and query only selective records to R. How to run r with athena is explained in detail below.
https://aws.amazon.com/blogs/big-data/running-r-on-amazon-athena/
Hope it helps.

If you are on Spark or similar, a other workaround would be to
- read/load the csv to DataTable and
- continue processing it with R Server / sparklyr

read an Excel file embedded in a website

I would like to read automatically in R the file which is located at
https://clients.rte-france.com/servlets/IndispoProdServlet?annee=2017
This link generates the automatic download of a zipfile. This zipfile contains the Excel file I want to read in R.
Does any of you have any suggestions on this? Thanks.

Panagiotis' comment to use download.file() is generally good advice, but I couldn't make it work here (and would be curious to know why). Instead I used httr.
(Edit: got it, I reversed args of download.file()... Repeat after me: always use named args...)
Another problem with this data: it appears not to be a regular xls file, I couldn't open it with the yet excellent readxl package.
Looks like a tab separated flat file, but no success with read.table() either. readr::read_delim() made it.
library(httr)
library(readr)
r <- GET("https://clients.rte-france.com/servlets/IndispoProdServlet?annee=2017")
# Write the archive on disk
writeBin(r$content, "./data/rte_data")
rte_data <-
read_delim(
unzip("./data/rte_data", exdir = "./data/"),
delim = "\t",
locale = locale(encoding = "ISO-8859-1"),
col_names = TRUE
)
There still are parsing problems, but not sure they should be dealt with in this SO question.

R read.dta and unz not working

I read a lot of files into R from zipped sources. I try to use the R function unz to read from zipped files because unlike unzip it does not leave any unzipped files on my harddisk.
However, this does not seem to work for zipped *.dta (Stata) files:
library(foreign)
temp <- tempfile()
download.file("http://databank.worldbank.org/data/download/WDI_csv.zip", temp)
wdi_unz <- read.csv(unz(temp, "WDI_Data.csv"))
unlink(temp)
temp <- tempfile()
download.file("http://www.rug.nl/research/ggdc/data/pwt/v80/pwt80.zip",temp)
pwt_unzip <- read.dta(unzip(temp, "pwt80.dta"))
pwt_unz <- read.dta(unz(temp, "pwt80.dta"))
unlink(temp)
Sorry for using the rather large World Development Indicators database (its 40+ MB), but I did not find any better working example.
The code produces an error when reading pwt_unz, [edit: but not when reading pwt_unzip]. What is the problem there? Probably it has something to do with the return value of unz not being compatible with the input for read.dta?

I think you need read.dta
Have a look here :
http://stat.ethz.ch/R-manual/R-devel/library/foreign/html/read.dta.html

Read SPSS file into R

I am trying to learn R and want to bring in an SPSS file, which I can open in SPSS.
I have tried using read.spss from foreign and spss.get from Hmisc. Both error messages are the same.
Here is my code:
## install.packages("Hmisc")
library(foreign)
## change the working directory
getwd()
setwd('C:/Documents and Settings/BTIBERT/Desktop/')
## load in the file
## ?read.spss
asq <- read.spss('ASQ2010.sav', to.data.frame=T)
And the resulting error:
Error in read.spss("ASQ2010.sav", to.data.frame = T) : error
reading system-file header In addition: Warning message: In
read.spss("ASQ2010.sav", to.data.frame = T) : ASQ2010.sav: position
0: character `\000' (
Also, I tried saving out the SPSS file as a SPSS 7 .sav file (was previously using SPSS 18).
Warning messages: 1: In read.spss("ASQ2010_test.sav", to.data.frame =
T) : ASQ2010_test.sav: Unrecognized record type 7, subtype 14
encountered in system file 2: In read.spss("ASQ2010_test.sav",
to.data.frame = T) : ASQ2010_test.sav: Unrecognized record type 7,
subtype 18 encountered in system file

I had a similar issue and solved it following a hint in read.spss help.
Using package memisc instead, you can import a portable SPSS file like this:
data <- as.data.set(spss.portable.file("filename.por"))
Similarly, for .sav files:
data <- as.data.set(spss.system.file('filename.sav'))
although in this case I seem to miss some string values, while the portable import works seamlessly. The help page for spss.portable.file claims:
The importer mechanism is more flexible and extensible than read.spss and read.dta of package "foreign", as most of the parsing of the file headers is done in R. They are also adapted to load efficiently large data sets. Most importantly, importer objects support the labels, missing.values, and descriptions, provided by this package.

The read.spss seems to be outdated a little bit, so I used package called memisc.
To get this to work do this:
install.packages("memisc")
data <- as.data.set(spss.system.file('yourfile.sav'))

You may also try this:
setwd("C:/Users/rest of your path")
library(haven)
data <- read_sav("data.sav")
and if you want to read all files from one folder:
temp <- list.files(pattern = "*.sav")
read.all <- sapply(temp, read_sav)

I know this post is old, but I also had problems loading a Qualtrics SPSS file into R. R's read.spss code came from PSPP a long time ago, and hasn't been updated in a while. (And Hmisc's code uses read.spss(), too, so no luck there.)
The good news is that PSPP 0.6.1 should read the files fine, as long as you specify a "String Width" of "Short - 255 (SPSS 12.0 and earlier)" on the "Download Data" page in Qualtrics. Read it into PSPP, save a new copy, and you should be in business. Awkward, but free.
,

You can read SPSS file from R using above solutions or the one you are currently using. Just make sure that the command is fed with the file, that it can read properly. I had same error and the problem was, SPSS could not access that file. You should make sure the file path is correct, file is accessible and it is in correct format.
library(foreign)
asq <- read.spss('ASQ2010.sav', to.data.frame=TRUE)
As far as warning message is concerned, It does not affect the data. The record type 7 is used to store features in newer SPSS software to make older SPSS software able to read new data. But does not affect data. I have used this numerous times and data is not lost.
You can also read about this at http://r.789695.n4.nabble.com/read-spss-warning-message-Unrecognized-record-type-7-subtype-18-encountered-in-system-file-td3000775.html#a3007945

It looks like the R read.spss implementation is incomplete or broken. R2.10.1 does better than R2.8.1, however. It appears that R gets upset about custom attributes in a sav file even with 2.10.1 (The latest I have). R also may not understand the character encoding field in the file, and in particular it probably does not work with SPSS Unicode files.
You might try opening the file in SPSS, deleting any custom attributes, and resaving the file.
You can see whether there are custom attributes with the SPSS command
display attributes.
If so, delete them (see VARIABLE ATTRIBUTE and DATAFILE ATTRIBUTE commands), and try again.
HTH,
Jon Peck

If you have access to SPSS, save file as .csv, hence import it with read.csv or read.table. I can't recall any problem with .sav file importing. So far it was working like a charm both with read.spss and spss.get. I reckon that spss.get will not give different results, since it depends on foreign::read.spss
Can you provide some info on SPSS/R/Hmisc/foreign version?

Another solution not mentioned here is to read SPSS data in R via ODBC. You need:
IBM SPSS Statistics Data File Driver. Standalone driver is enough.
Import SPSS data using RODBC package in R.
See the example here. However I have to admit that, there could be problems with very big data files.

For me it works well using memisc!
install.packages("memisc")
load('memisc')
Daten.Februar <-as.data.set(spss.system.file("NPS_Februar_15_Daten.sav"))
names(Daten.Februar)

I agree with #SDahm that the haven package would be the way to go. I myself have struggled a bit with string values when starting to use it, so I thought I'd share my approach on that here, too.
The "semantics" vignette has some useful information on this topic.
library(tidyverse)
library(haven)
# Some interesting information in here
vignette('semantics')
# Get data from spss file
df <- read_sav(path_to_file)
# get value labels
df <- map_df(.x = df, .f = function(x) {
if (class(x) == 'labelled') as_factor(x)
else x})
# get column names
colnames(df) <- map(.x = spss_file, .f = function(x) {attr(x, 'label')})

There is no such problem with packages you are using. The only requirement for read a spss file is to put the file into a PORTABLE format file. I mean, spss file have *.sav extension. You need to transform your spss file in a portable document that uses *.por extension.
There is more info in http://www.statmethods.net/input/importingdata.html

In my case this warning was combined with a appearance of a new variable before first column of my data with values -100, 2, 2, 2, ..., a shift in the correspondence between labels and values and the deletion of the last variable. A solution that worked was (using SPSS) to create a new dump variable in the last column of the file, fill it with random values and execute the following code:
(filename is the path to the sav file and in my case the original SPSS file had 62 columns, thus 63 with the additional dumb variable)
library(memisc)
data <- as.data.set(spss.system.file(filename))
copyofdata = data
for(i in 2:63){
names(data)[i] <- names(copyofdata)[i-1]
}
data[[1]] <- NULL
newcopyofdata = data
for(i in 2:62){
labels(data[[i]]) <- labels(newcopyofdata[[i-1]])
}
labels(data[[1]]) <- NULL
Hope the above code will help someone else.

Turn your UNICODE in SPSS off
Open SPSS without any data open and run the code below in your syntax editor
SET UNICODE OFF.
Open the data set and resave it to remove the Unicode
read.spss('yourdata.sav', to.data.frame=T) works correctly then

I just came came across an SPSS file that I couldn't get open using haven, foreign, or memisc, but readspss::read.por did the trick for me:
download.file("http://www.tcd.ie/Political_Science/elections/IMSgeneral92.zip",
"IMSgeneral92.zip")
unzip("IMSgeneral92.zip", exdir = "IMSgeneral92")
# rio, haven, foreign, memisc pkgs don't work on this file! But readspss does:
if(!require(readspss)) remotes::install_git("https://github.com/JanMarvin/readspss.git")
ims92 <- readspss::read.por("IMSgeneral92/IMS_Nov7 92.por", convert.factors = FALSE)
Nice! Thanks, #JanMarvin!

1)
I've found the program, stat-transfer, useful for importing spss and stata files into R.
It resolves the issue you mention by converting spss to R dataset. Also very useful for subsetting super large datasets into smaller portions consumable by R. Not free, but a very useful tool for working with datasets from different programs -- especially if you don't have access to them.
2)
Memisc package also has an spss function worth trying.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Problem with reading microdata from IPUMS into R - r

Related

NetCDF: HDF error only inside a loop in R

Read large csv file from S3 into R

read an Excel file embedded in a website

R read.dta and unz not working

Read SPSS file into R

Categories

Resources