psidR: build.panel() returns duplicate Error - r

Just tried to run the build.panel function within the psidR package. It did download all the rda files successfully for the first part of the script and I put them into a separate folder. However, now that I run the function I get an error code :
Error in [.data.table(yind, , :=((as.character(ind.nas)), NA)) :
Can't assign to the same column twice in the same query (duplicates detected).
In addition: Warning message:
In [.data.table(tmp, , :=((nanames), NA_real_), with = FALSE) :
with=FALSE ignored, it isn't needed when using :=. See ?':=' for examples.
Might be my fault of ill-defining my variables? I just use the getNamesPSID function and plug it into a data.table, similar to the example code:
library(psidR)
library(openxlsx)
library(data.table)
cwf <- read.xlsx("http://psidonline.isr.umich.edu/help/xyr/psid.xlsx")
id.ind.educ1 <- getNamesPSID("ER30010", cwf)
id.fam.income1 <- getNamesPSID("V81", cwf)
famvars1 <- data.table(year=c(1968, 1969, 1970),
income1=id.fam.income1
)
indvars1 <- data.table(year=c(1968, 1969, 1970),
educ1=id.ind.educ1
)
build.panel(datadir = "/Users/Adrian/Documents/ECON 490/Heteroskedastic Dependency/Dependency/RDA", fam.vars = famvars1, ind.vars = indvars1, sample = "SRC", design = 3)
If you omit the datadir argument, R will download the corresponding datasets to a temporary directory. It will be printed in the output where exactly. As long as the R process runs you should have access to it and can copy it elsewhere. Error should be reproducible. Might take a bit until it downloads the datasets the first time.
If it relates to the NA's within each getNames argument, is there a workaround where I still preserve the corresponding year so I can tell that apart in my panel?
I know there was a similar issue on the corresponding github page relating to zips with the same name as one of the data sets. However, my folder only contains the correct datasets and no zips.
I also tried to exclude the NA cases but that messed up the length of my vectors. I also tried it with a standard data.frame.
I also checked my resulting famvars / indvars dataframes for duplicates with Excel but there are none besides the NA's, which, according to the github example found on https://github.com/floswald/psidR should be included in the dataset...
Thanks so much for your help :)
EDIT: here the traceback():
3: [.data.table(yind, , :=((as.character(ind.nas)), NA))
2: yind[, :=((as.character(ind.nas)), NA)]
1: build.panel(datadir = "/Users/Adrian/Documents/ECON 490/Heteroskedastic Dependency/Dependency/RDA",
fam.vars = famvars, ind.vars = indvars, sample = "SRC", design = 3)
EDIT'': thank you #Axeman, I cut down the reproducible example. My actual data.table contains many more variables.
UPDATE:
Just for anyone running into a similar issue:
After trying to find a way to get the function to work I decided to instead manually merge all the files and dataframes. Be prepared, its a mammoth project but so is any analysis of the PSID. I followed the instructions found here: http://asdfree.com/panel-study-of-income-dynamics-psid.html and combined them with helper function of the psidR package (getNamesPSID mainly, to get the variable names in each wave). So far, very successful. Only wish that there were more articles on the exact functioning of the survey package on the web.

Related

For loop data frame error when working with character vectors

I am currently doing data science with R and I generally write loops to access multiple files or objects at once. Normally this goes without any problems but recently a problem occurred when trying to run the following code:
setwd(PROJECT_FOLDER)
climate_forcing <- c("cf-1", "cf-2", "cf-3", "cf-4")
#load all mean stacks from IM and create rasterstack
for (i in 1:NROW(climate_forcing)){
setwd(PROJECT_FOLDER)
setwd(paste0("time frames mcor/X variable/IM/", climate_forcing[i], "/ncstack/"))
file.names <- list.files(pattern = ".nc", recursive=T, full.names=F) #list all files with ".nc"
stopwords <- c(".nc", "stack", "/dLAI") #stopwords
names.short <- gsub(paste(stopwords, collapse="|"), "", file.names)
assign("names.short", paste0(names.short, climate_forcing[i]))
for (j in 1:NROW(file.names)){
assign(paste0(names.short[j], "_stack"), stack(file.names[j]))
}
}
Error message returned:
Error in data.frame(values = unlist(unname(x)), ind, stringsAsFactors = FALSE) :
arguments imply differing number of rows: 1, 0
I wrote this a while ago and I ran it before and I think it used to work since the files being created by a similar script are there.
Anyways I did some testing and it seems that the error occurs in the for loop within the for loop (with the variable j). I am unsure what may cause this bug but has to do something with "file.names" and "names.short" right? When I compare them, their properties appear to be identical though, which I figured would be, since I create the latter out of the former. The reason I am creating them like this is because I want to create objects reading out the corresponding files of file.names.
The error I get refers to data.frame which confuses me because I'm working with character vectors here..
Maybe somebody with more experience can figure this issue out.
Thanks for any help and if there are any questions I will try to answer them.
Alright it turns out something was wrong with the R packages, I reinstalled and reloaded them (raster) and now it works. Thanks to anyone for your contributions!

Create data tables using SPSS in R

Using expss package I am creating cross tabs by reading SPSS files in R. This actually works perfectly but the process takes lots of time to load. I have a folder which contains various SPSS files(usually 3 files only) and through R script I am fetching the last modified file among the three.
setwd('/file/path/for/this/file/SPSS')
library(expss)
expss_output_viewer()
#get all .sav files
all_sav <- list.files(pattern ='\\.sav$')
#use file.info to get the index of the file most recently modified
pass<-all_sav[with(file.info(all_sav), which.max(mtime))]
mydata = read_spss(pass,reencode = TRUE) # read SPSS file mydata
w <- data.frame(mydata)
args <- commandArgs(TRUE)
Everything is perfect and works absolutely fine but it generally takes too much time to load large files(112MB,48MB for e.g) which isn't good.
Is there a way I can make it more time-efficient and takes less time to create the table. The dropdowns are created using PHP.
I have searched for this and found another library called 'haven' but I am not sure whether that can give me significance as well. Can anyone help me with this? I would really appreciate that. Thanks in advance.
As written in the expss vignette (https://cran.r-project.org/web/packages/expss/vignettes/labels-support.html) you can use in the following way:
# we need to load packages strictly in this order to avoid conflicts
library(haven)
library(expss)
spss_data = haven::read_spss("spss_file.sav")
# add missing 'labelled' class
spss_data = add_labelled_class(spss_data)

Reading zipped folder containing non-traditional spreadsheet

I'm trying to read a zipped folder called etfreit.zip contained in Purchases from April 2016 onward.
Inside the zipped folder is a file called 2016.xls which is difficult to read as it contains empty rows along with Japanese text.
I have tried various ways of reading the xls from R, but I keep getting errors. This is the code I tried:
download.file("http://www3.boj.or.jp/market/jp/etfreit.zip", destfile="etfreit.zip")
unzip("etfreit.zip")
data <- read.csv(text=readLines("2016.xls")[-(1:10)])
I'm trying to skip the first 10 rows as I simply wish to read the data in the xls file. The code works only to the extent that it runs, but the data looks truly bizarre.
Would greatly appreciate any help on reading the spreadsheet properly in R for purposes of performing analysis.
There is more than one bizzare thing going on here I think, but I had some success with (somewhat older) gdata package:
data = gdata::read.xls("2016.xls")
By the way, treating xls file as csv seldom works. Actually it shouldn't work at all :) Find out a proper import function for your type of data and then use it, don't assume that read.csv is going to take care about anything else than csv (properly).
As per your comment: I'm not sure what you mean by "not properly aligned", but here is some code that cleans the data a bit, and gives you numeric variables instead of factors (note I'm using tidyr for that):
data2 = data[-c(1:7), -c(1, 6)]
names(data2) = c("date", "var1", "var2", "var3")
data2[, c(2:4)] = sapply(data2[, c(2:4)], tidyr::extract_numeric)
# Optionally convert the column with factor dates to Posixct
data2$date = as.POSIXct(data2$date)
Also, note that I am removing only 7 upper rows - this seems to be the portion of the data that contains the header with Japanese.
"Odd" unusual excel tables cab be read with the jailbreakr package. It is still in development, but looks pretty ace:
https://github.com/rsheets/jailbreakr

How can I import data into R that is meant for use in SAS, SPSS, or STATA?

I am attempting to read data from the National Health Interview Survey in R: http://www.cdc.gov/nchs/nhis/nhis_2011_data_release.htm . The data is Sample Adult. The SAScii library actually has a function read.SAScii whose documentation has an example for the same data set I would like to use. The issue is it "doesn't work":
NHIS.11.samadult.SAS.read.in.instructions <-
"ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Program_Code/NHIS/2011/SAMADULT.sas"
NHIS.11.samadult.file.location <-
"ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHIS/2011/samadult.zip"
#store the NHIS file as an R data frame!
NHIS.11.samadult.df <-
read.SAScii (
NHIS.11.samadult.file.location ,
NHIS.11.samadult.SAS.read.in.instructions ,
zipped = T, )
#or store the NHIS SAS import instructions for use in a
#read.fwf function call outside of the read.SAScii function
NHIS.11.samadult.sas <- parse.SAScii( NHIS.11.samadult.SAS.read.in.instructions )
#save the data frame now for instantaneous loading later
save( NHIS.11.samadult.df , file = "NHIS.11.samadult.data.rda" )
However, when running it I get the error Error in toupper(SASinput) : invalid multibyte string 533.
Others on Stack Overflow with a similar error, but for functions such as read.delim and read.csv, have recommended to try changing the argument to fileEncoding="latin1" for example. The problem with read.SAScii is it has no such parameter fileEncoding.
See:
R: invalid multibyte string and Invalid multibyte string in read.csv
Just in case anyone has a similar problem, the issue and solution for me was to run options( encoding = "windows-1252" ) right before running the above code for read.SAScii since the ASCII file is meant for use in SAS and therefore on Windows. And I am using Linux.
The author of the SAScii library actually has another Github repository asdfree where he has working code for downloading CDC-NHIS datasets for all available years as well as as many other datasets from various surveys such as the American Housing Survey, FDA Drug Surveys, and many more.
The following links to the author's solution to the issue in this question. From there, you can easily find a link to the asdfree repository: https://github.com/ajdamico/SAScii/issues/3 .
As far as this dataset goes, the code in https://github.com/ajdamico/asdfree/blob/master/National%20Health%20Interview%20Survey/download%20all%20microdata.R#L8-L13 does the trick, however it doesn't encode the columns as factors or numeric properly. The good thing is that for any given dataset in an NHIS year, there are only about less than ten to twenty numeric columns where encoding these as numeric one by one is not so painful, and encoding the rest of the columns as numeric requires only a loop through the non-numeric columns.
The easiest solution for me, since I only require the Sample Adult dataset for 2011, and I was able to get my hands on a machine with SAS installed, was to run the SAS program included at http://www.cdc.gov/nchs/nhis/nhis_2011_data_release.htm to encode the columns as necessary. Finally, I used proc export to export the sas dataset onto a CSV file which I then opened in R easily with no necessary edits to the data except in dealing with missing values.
In case you want to work with NHIS datasets besides Sample Adult, it is worth noting that when I ran the available SAS program for 2010 "Sample Adult Cancer" (http://www.cdc.gov/nchs/nhis/nhis_2010_data_release.htm) and exported the data to a CSV, there was an issue with having less column names than actual columns when I attempted to read in the CSV file in R. Skipping the first line resolves this issue but you lose the descriptive column names. You can however import this same data easily without encoding with the R code in the asdfree repository. Please read the documentation there for more info.

R program does not output

I'm new to R and programming and taking a Coursera course. I've asked in their forums, but nobody can seem to provide an answer in the forums. To be clear, I'm trying to determine why this does not output.
When I first wrote the program, I was getting accurate outputs, but after I tried to upload, something went wonky. Rather than producing any output with [1], [2], etc. when I run the program from RStudio, I only get the the blue +++, but no errors and anything I change still does not produce an output.
I tried with a previous version of R, and reinstalled the most recent version 3.2.1 for Windows.
What I've done:
Set the correct working directory through RStudio
pol <- function(directory, pol, id = 1:332) {
files <- list.files("specdata", full.names = TRUE);
data <- data.frame();
for (i in ID) {
data <- rbind(data, read.csv(files_list[i]))
}
subset <- subset(data, ID %in% id);
polmean <- mean(subset[pol], na.rm = TRUE);
polmean("specdata", "sulfate", 1:10)
polmean("specdata", "nitrate", 70:72)
polmean("specdata", "nitrate", 23)
}
Can someone please provide some direction - debug help?
when I adjust the code the following errors tend to appear:
ID not found
Missing or unexpected } (although I've matched them all).
The updated code is as follow, if I'm understanding:
data <- data.frame();
files <- files[grepl(".csv",files)]
pollutantmean <- function(directory, pollutant, id = 1:332) {
pollutantmean <- mean(subset1[[pollutant]], na.rm = TRUE);
}
Looks like you haven't declared what ID is (I assume: a vector of numbers)?
Also, using 'subset' as a variable name while it's also a function, and pol as both a function name and the name of one of the arguments of that same function is just asking for trouble...
And I think there is a missing ")" in your for-loop.
EDIT
So the way I understand it now, you want to do a couple of things.
Read in a bunch of files, which you'll use multiple times without changing them.
Get some mean value out of those files, under different conditions.
Here's how I would do it.
Since you only want to read in the data once, you don't really need a function to do this (you can have one, but I think it's overkill for now). You correctly have code that makes a vector with the file names, and then loop over over them, rbinding them to each other. The problem is that this can become very slow. Check here. Make sure your directory only contains files that you want to read in, so no Rscripts or other stuff. A way (not 100% foolproof) to do this is using files <- files[grepl(".csv",files)], which makes sure you only have the csv's (grepl checks whether a certain string is a substring of another, and returns a boolean the [] then only keeps the elements for which a TRUE was returned).
Next, there is 'a thing you want to do multiple times', namely getting out mean values. This is where you'd use a function. Apparently you want to get the mean for different types of pollution, and you want this in restricted IDs.
Let's assume that 1. has given you a dataframe df with a column named Type for the type of pollution and a column called Id that somehow represents a sort of ID (substitute with the actual names in your script - if you don't have a column for ID, I'll edit the answer later on). Now you want a function
polmean <- function(type, id) {
# some code that returns the mean of a restricted version of df
}
This is all you need. You write the code that generates df, you then write a function that will get you what you want from that dataframe, and then you call it for the circumstances you want to use it in (the three polmean calls at the end of your original code, but now without the first argument as you no longer need this).
Ok - I finally solved this. Thanks for the help.
I didn't need to call "specdata" in line 2. the directory in line 1 referred to the correct directory.
My for/in statement needed to refer the the id in the first line not the ID in the dataset. The for/in statement doesn't appear to need to be indented (but it looks cleaner)
I did not need a subset
The last 3 lines for pollutantmean did not need to be a part of the program. These are used in the R console to call the results one by one.

Resources