I use R to analyze large data files from IPUMS, which published sophisticated micro-data on Census records. IPUMS offers its extracts as SPSS, SAS or STATA files. To get the data into R, I've had the most luck downloading the SPSS version and using the read.spss function from the "foreign" library:
library(foreign);
ipums <- read.spss("usa_00106.sav", to.data.frame = TRUE);
This works brilliantly, save for this perpetual warning:
Warning message:
In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, :
duplicated levels in factors are deprecated
(If anyone is feeling heroic, I uploaded the zipped .sav file here (39Mb) as well as the .SPS file and the more human-readable codebook. This is just a sample IPUMs extract and, like all IPUMs data, contains no private information.)
My question is whether my data is compromised by duplicate factors in the SPSS file or whether this is something I can fix after the import.
To figure out which of the columns was the culprit, I wrote a little diagnosis:
ipums <- read.spss("usa_00106.sav", to.data.frame = TRUE);
for (name in names(ipums)) {
type <- class(ipums[[name]]);
if (type == "factor") {
print(name);
print(anyDuplicated(levels(ipums[[name]])));
}
}
This loop correctly identifies the column BLPD as the culprit. That's a detailed version of a person's birthplace that has 536 possible values in the .SPS file, as confirmed by this code:
fac <- levels(ipums$BPLD)
length(fac) #536
anyDuplicated(fac) #153
fac[153] #"Br. Virgin Islands, ns"
When I look at the .SPS file, I do see in fact that there are two entries for this location:
26052 "Br. Virgin Islands, ns"
26069 "Br. Virgin Islands, ns"
However, I don't see a single instance of this location in the data:
NROW(subset(ipums, ipums$BPLD=="Br. Virgin Islands, ns")) #0
This may well be because this is not a common location that's likely to show up in the data, but I cannot assume that will always be the case in future project. So part two of my question is whether an SPSS file with duplicate factors will at least important the correct values, or whether a file that produces this warning message is potentially damaged.
As for fixing the problem, I see a few related StackOverflow posts, like this one, but I'm not sure if they address the problem I have with complex public data from a third-party. What is the most efficient way for me to clean up factors with duplicate values so that I can have full confidence in the data?
SPSS does not require uniqueness of value labels. In this dataset, BLPD is a string. I believe read.spss will create a factor with duplicate levels but will assign all the duplicate values to just one of them. You can use droplevels() after reading the data to get rid of the unused level.
Could you try importing and specifying factors as false with either:
#havent tested
read.spss(x...,stringsAsFactors=FALSE)
or from help command for read.spss
read.spss(x...,use.value.labels=FALSE)
?read.spss
#use.value.labels
#logical: convert variables with value labels into R factors with those levels?
#This is only done if there are at least as many labels as values of the
#variable #(when values without a matching label are returned as NA).
Related
I have dug into rlist and purrr and found them to be quite helpful in working with lists of pre-structured data. I tried to solve the problems arising on my one to improve my coding skills - so thanks to the community of helping out! However, I reached a dead-end now:
I want to write a code which is needed to be written in a way, that we throw our excel files in xlsm format to the folder an r does the rest.
I Import my data using:
vec.files<-list.files(pattern=".xlsm")
vec.numbers<- gsub("\\.xlsm","",vec.files)
list.alldata <- lapply(vec.files, read_excel, sheet="XYZ")
names(list.alldata) <- vec.numbers
The data we call is a combination of charaters, Dates (...).
When I try to use the rlist-package everything works fine until I try to use to filter on names, which were in the excel file not a fixed entry (e.g. Measurable 1), but a reference to another field (e.g. =Table1!A1, or an Reference).
If I try to call a false element I get this failure:
list.map(list.alldata, NameWhichWasAReferenceToAnotherFieldBefore)
Fehler in eval(.expr, .evalwith(.data), environment()) :
Objekt 'Namewhichwasareferencetoanotherfieldbefore' nicht gefunden
I am quite surprised, as if I call
names(list.alldata[[1]])
I get a vector with the correct entries / names.
As I identified the read_excel() as the problem causing reason I tried to add col_names=True, but did not help. Also col_names=False calls the correct arguments into the dataset.
I assume, that exporting the data as a .csv would help, but this is not an option. Can this be easily done by r in a pree-loop?
In my concept of working assessing the data by the names is essential and there is no work around so I really appreciate your help!
Just tried to run the build.panel function within the psidR package. It did download all the rda files successfully for the first part of the script and I put them into a separate folder. However, now that I run the function I get an error code :
Error in [.data.table(yind, , :=((as.character(ind.nas)), NA)) :
Can't assign to the same column twice in the same query (duplicates detected).
In addition: Warning message:
In [.data.table(tmp, , :=((nanames), NA_real_), with = FALSE) :
with=FALSE ignored, it isn't needed when using :=. See ?':=' for examples.
Might be my fault of ill-defining my variables? I just use the getNamesPSID function and plug it into a data.table, similar to the example code:
library(psidR)
library(openxlsx)
library(data.table)
cwf <- read.xlsx("http://psidonline.isr.umich.edu/help/xyr/psid.xlsx")
id.ind.educ1 <- getNamesPSID("ER30010", cwf)
id.fam.income1 <- getNamesPSID("V81", cwf)
famvars1 <- data.table(year=c(1968, 1969, 1970),
income1=id.fam.income1
)
indvars1 <- data.table(year=c(1968, 1969, 1970),
educ1=id.ind.educ1
)
build.panel(datadir = "/Users/Adrian/Documents/ECON 490/Heteroskedastic Dependency/Dependency/RDA", fam.vars = famvars1, ind.vars = indvars1, sample = "SRC", design = 3)
If you omit the datadir argument, R will download the corresponding datasets to a temporary directory. It will be printed in the output where exactly. As long as the R process runs you should have access to it and can copy it elsewhere. Error should be reproducible. Might take a bit until it downloads the datasets the first time.
If it relates to the NA's within each getNames argument, is there a workaround where I still preserve the corresponding year so I can tell that apart in my panel?
I know there was a similar issue on the corresponding github page relating to zips with the same name as one of the data sets. However, my folder only contains the correct datasets and no zips.
I also tried to exclude the NA cases but that messed up the length of my vectors. I also tried it with a standard data.frame.
I also checked my resulting famvars / indvars dataframes for duplicates with Excel but there are none besides the NA's, which, according to the github example found on https://github.com/floswald/psidR should be included in the dataset...
Thanks so much for your help :)
EDIT: here the traceback():
3: [.data.table(yind, , :=((as.character(ind.nas)), NA))
2: yind[, :=((as.character(ind.nas)), NA)]
1: build.panel(datadir = "/Users/Adrian/Documents/ECON 490/Heteroskedastic Dependency/Dependency/RDA",
fam.vars = famvars, ind.vars = indvars, sample = "SRC", design = 3)
EDIT'': thank you #Axeman, I cut down the reproducible example. My actual data.table contains many more variables.
UPDATE:
Just for anyone running into a similar issue:
After trying to find a way to get the function to work I decided to instead manually merge all the files and dataframes. Be prepared, its a mammoth project but so is any analysis of the PSID. I followed the instructions found here: http://asdfree.com/panel-study-of-income-dynamics-psid.html and combined them with helper function of the psidR package (getNamesPSID mainly, to get the variable names in each wave). So far, very successful. Only wish that there were more articles on the exact functioning of the survey package on the web.
I'm trying to read a zipped folder called etfreit.zip contained in Purchases from April 2016 onward.
Inside the zipped folder is a file called 2016.xls which is difficult to read as it contains empty rows along with Japanese text.
I have tried various ways of reading the xls from R, but I keep getting errors. This is the code I tried:
download.file("http://www3.boj.or.jp/market/jp/etfreit.zip", destfile="etfreit.zip")
unzip("etfreit.zip")
data <- read.csv(text=readLines("2016.xls")[-(1:10)])
I'm trying to skip the first 10 rows as I simply wish to read the data in the xls file. The code works only to the extent that it runs, but the data looks truly bizarre.
Would greatly appreciate any help on reading the spreadsheet properly in R for purposes of performing analysis.
There is more than one bizzare thing going on here I think, but I had some success with (somewhat older) gdata package:
data = gdata::read.xls("2016.xls")
By the way, treating xls file as csv seldom works. Actually it shouldn't work at all :) Find out a proper import function for your type of data and then use it, don't assume that read.csv is going to take care about anything else than csv (properly).
As per your comment: I'm not sure what you mean by "not properly aligned", but here is some code that cleans the data a bit, and gives you numeric variables instead of factors (note I'm using tidyr for that):
data2 = data[-c(1:7), -c(1, 6)]
names(data2) = c("date", "var1", "var2", "var3")
data2[, c(2:4)] = sapply(data2[, c(2:4)], tidyr::extract_numeric)
# Optionally convert the column with factor dates to Posixct
data2$date = as.POSIXct(data2$date)
Also, note that I am removing only 7 upper rows - this seems to be the portion of the data that contains the header with Japanese.
"Odd" unusual excel tables cab be read with the jailbreakr package. It is still in development, but looks pretty ace:
https://github.com/rsheets/jailbreakr
I am attempting to read data from the National Health Interview Survey in R: http://www.cdc.gov/nchs/nhis/nhis_2011_data_release.htm . The data is Sample Adult. The SAScii library actually has a function read.SAScii whose documentation has an example for the same data set I would like to use. The issue is it "doesn't work":
NHIS.11.samadult.SAS.read.in.instructions <-
"ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Program_Code/NHIS/2011/SAMADULT.sas"
NHIS.11.samadult.file.location <-
"ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHIS/2011/samadult.zip"
#store the NHIS file as an R data frame!
NHIS.11.samadult.df <-
read.SAScii (
NHIS.11.samadult.file.location ,
NHIS.11.samadult.SAS.read.in.instructions ,
zipped = T, )
#or store the NHIS SAS import instructions for use in a
#read.fwf function call outside of the read.SAScii function
NHIS.11.samadult.sas <- parse.SAScii( NHIS.11.samadult.SAS.read.in.instructions )
#save the data frame now for instantaneous loading later
save( NHIS.11.samadult.df , file = "NHIS.11.samadult.data.rda" )
However, when running it I get the error Error in toupper(SASinput) : invalid multibyte string 533.
Others on Stack Overflow with a similar error, but for functions such as read.delim and read.csv, have recommended to try changing the argument to fileEncoding="latin1" for example. The problem with read.SAScii is it has no such parameter fileEncoding.
See:
R: invalid multibyte string and Invalid multibyte string in read.csv
Just in case anyone has a similar problem, the issue and solution for me was to run options( encoding = "windows-1252" ) right before running the above code for read.SAScii since the ASCII file is meant for use in SAS and therefore on Windows. And I am using Linux.
The author of the SAScii library actually has another Github repository asdfree where he has working code for downloading CDC-NHIS datasets for all available years as well as as many other datasets from various surveys such as the American Housing Survey, FDA Drug Surveys, and many more.
The following links to the author's solution to the issue in this question. From there, you can easily find a link to the asdfree repository: https://github.com/ajdamico/SAScii/issues/3 .
As far as this dataset goes, the code in https://github.com/ajdamico/asdfree/blob/master/National%20Health%20Interview%20Survey/download%20all%20microdata.R#L8-L13 does the trick, however it doesn't encode the columns as factors or numeric properly. The good thing is that for any given dataset in an NHIS year, there are only about less than ten to twenty numeric columns where encoding these as numeric one by one is not so painful, and encoding the rest of the columns as numeric requires only a loop through the non-numeric columns.
The easiest solution for me, since I only require the Sample Adult dataset for 2011, and I was able to get my hands on a machine with SAS installed, was to run the SAS program included at http://www.cdc.gov/nchs/nhis/nhis_2011_data_release.htm to encode the columns as necessary. Finally, I used proc export to export the sas dataset onto a CSV file which I then opened in R easily with no necessary edits to the data except in dealing with missing values.
In case you want to work with NHIS datasets besides Sample Adult, it is worth noting that when I ran the available SAS program for 2010 "Sample Adult Cancer" (http://www.cdc.gov/nchs/nhis/nhis_2010_data_release.htm) and exported the data to a CSV, there was an issue with having less column names than actual columns when I attempted to read in the CSV file in R. Skipping the first line resolves this issue but you lose the descriptive column names. You can however import this same data easily without encoding with the R code in the asdfree repository. Please read the documentation there for more info.
I'm trying to read an Excel created .csv file into R. I've tried numerous suggestions but none have completely panned out for me.
Here's how the data looks in the .csv file, with the first row being the header:
recipe_type,State,Successes,Attempts
paper,alabama ,586,3379
Here are my R commands to import the .csv file:
options( StringsAsFactors=F )
results<-read.csv("recipe results.csv", header=TRUE, as.is=T)
results$Successes
[1] "586"
And Successes is being treated as character data.
And I've also tried this approach:
results[,3]<- as.numeric(levels(results$Successes)) but get the rank of each value in this column rather than the actual value, which another posts said would happen.
Any ideas on how to get this data treated as numeric so I can get proper stat.desc stats for it?
Thanks
Direct conversion of a factor to numeric yields the factor levels, and nothing to do with the values themselves. You need to convert to character first:
results[,3] <- as.numeric(as.character(results$Successes))
Equivalently (see ?factor), you can convert the levels to numeric, and index by the (implicit) numeric conversion of the factor.
as.numeric(levels(results$Successes))[results$Successes]
Realise this is an old question, but came across it today when having a similar issue.
I Eventually found that (in my case) the problem arose from Excel's 'Number' format includes a comma (,) in its values so: 1,000 instead of 1000. Once I removed the comma I was able to convert from factors without NA values.
df$col1 <-as.numeric(gsub(",","",df$col1))
Just in case anyone comes across something simialr.
I found this package to be most helpful, worked without any issues, aside from a warning: gdata.
This URL contains the info on the package:http://www.r-tutor.com/r-introduction/data-frame/data-import
I did convert my spreadsheet from an .xlsx to .xls which it seemed to expect. I didn't test if an .xlsx would work.