I am using R's censusapi package to download data from the US Census Bureau's Economic Census, specifically for 2012. Some of the columns read into R with NAs introduced by coercion, apparently because R assumes a data type that's inappropriate. In short, how do you integrate a getCensus API call with designated variable types?
Specifically, the variables "NAICS2012_TTL" and some cases of "NAICS2012" read in as NA, and shouldn't. The first is entirely a text field which R insists should be numeric. The second is a series of numbers that should be treated as text, and includes some cases of hyphenated numbers, which read in as NA. How can I tell R to fetch this data and not give it an inappropriate data type? Code follows. You'll need a Census API key to test:
library("censusapi")
myFile<-getCensus(name="ewks",vintage=2012,key=("YOURKEYHERE"),
vars = c("EMP","EMP_F","EMP_S","ESTAB","ESTAB_F","GEO_ID","GEO_TTL","GEOTYPE","NAICS2012","NAICS2012_TTL","OPTAX","PAYANN","PAYANN_F","PLACE"),region="place:*", regionin="state:01")
I have tried making myFile as a data.frame and specifying colClasses in the process, unsuccessfully. I have also read every support doc for the censusapi package I can find, to no avail.
After a few more hours of trial and error, I've concluded that the censusapi package is much better suited to the ACS data, and simply does not work well for the Economic Census data. To be clear, the Economic Census data is a Census Bureau data set that is supported by an API. However, the censusapi package converts some fields to NA for Economic Census data.
To more successfully read the Economic Census into R using its API, I turned to using the "curl" and "jsonlite" packages instead. Here is the script that worked. This is written to download NAICS code 22 data for all places. You use could an apply function to loop over the other NAICS codes.
library(curl)
library(jsonlite)
curl_download("https://api.census.gov/data/2012/ewks?get=EMP,NAICS2012_TTL,OPTAX,ESTAB,PAYANN,RCPTOT,GEO_TTL&for=place:*&NAICS2012=22 &key=YOURKEYHERE",naics_22)
mymatrix <- fromJSON(naics_22)
matrix2<-data.frame(mymatrix)
write.csv(matrix2,"EC22.csv")
Related
I am new to R and have just started to use it. I am currently experimenting with the quantmod, rugarch and rmgarch packages.
In particular, I'm implementing the last package to make a multivariate portfolio analysis for the case of the european markets. In this sense, I need to download the 3-month german treasury bills, in order to use them as risk free rate. However, as far as I known, I can´t download the the mentioned data serie from Yahoo, Google or FDRA databases, so I have already downloaded them from investing.com and I want to load them in R.
The fact here is, my data is different from the ones downloaded by the getsymbols () function of yahoo, because in this case I only have 2 columns, the date column and the closing price column. To sump up, the question arises here is, is there any way to load this type of data in R for rmgarch purposes??
thanks in advance
Not sure if this is the issue, but this is how you might go about getting the data from a csv file.
data <- read.csv(file="file/path/data.csv")
head(data) # Take a look at your data
# Do this if you want the data only replacing ColumnName with the proper name
data_only <- data$ColumnName
It looks like the input data for rugarch needs to be an xts vector. So, you might want to take a look at this. You might also want to take a look at ?read.csv.
I am working on applying the ptw package to my GC-MS wine data. So far I have been able to correctly use this package on the apples example data described in the vignette (MTBLS99). Since I am new to R, I am unable to get my .CDF files into the format they used to start the vignette. They started with three data frames (All.pks, All.tics, All.xset). I assume that this was generated using the xcms package. But I cannot recreate the specific steps used for the data to be formatted in this manner. Has anyone successfully applied 'ptw' to their LC/GC-MS data? can someone share the code used for generating the All.pks, All.tics, All.xset data frames?
I apologise if this question has been asked already (I haven't been able to find it). I was under the impression that I could access datasets in R using data(), for example, from the datasets package. However, this doesn't work for time series objects. Are there other examples where this is not the case? (And why?)
data("ldeaths") # no dice
ts("ldeaths") # works
(However, this works for data("austres"), which is also a time-series object).
The data function is designed to load package data sets and all their attributes, time series or otherwise.
I think the issue your having is that there is no stand-alone data set called ldeaths in the datasets package. ldeaths does exist as 1 of 3 data sets in the UKLungDeaths data set. The other two are fdeaths and mdeaths.
The following should lazily load all data sets.
data(UKLungDeaths)
Then, typing ldeaths in the console or using it as an argument in some function will load it.
str(ldeaths)
While it is uncommon for package authors to include multiple objects in 1 data set, it does happen. This line from the data function documentation gives on a 'heads up' about this:
"For each given data set, the first two types (‘.R’ or ‘.r’, and ‘.RData’ or ‘.rda’ files) can create several variables in the load environment, which might all be named differently from the data set"
That is the case here, as while there are three time series objects contained in the data set, not one of them is named UKLungDeaths.
This choice occurs when the package author uses the save function to write multiple R objects to an external file. In the wild, I've seen folks use the save function to bundle a description file with the data set, although this would not be the proper way to document something in a full on package. If your really curious, go read the documentation on the save function.
Justin
r
The University of Cape Town make data available through it's DataFirst Portal.
All their data is made available in the following formats:
SAS (sab7bdat)
SPSS
Stata (12)
I would like to import a dataset into R using the Haven package, which supports all of the above formats (it utilises the ReadStat Library).
Which would be the prefered format for doing this?
More specifically:
Are there differences in terms of data available in the original formats?
Are some formats closer to R's format than others, and does this affect the output?
Are there differences in terms of speed? (less important)
The best way to transfer data between different systems is .csv, as it can be read by all systems without much hassle.
As you only have access to the other formats, there shouldn't be too much difference (given that haven works with all of them).
As to your questions:
I am not aware of any differences in the data availability or format-compatabilities. However, if you want to speed things up, you should probably look into data.table and it's fread (replaces read.table, so no support for the mentioned files).
You can read the data like this:
library(haven)
dat <- read_sas("link_to_sas_file")
dat <- read_spss("link_to_spss_file")
dat <- read_stata("link_to_stata_file")
I have survey data in SPSS and Stata which is ~730 MB in size. Each of these programs also occupy approximately the amount of space you would expect(~800MB) in the memory if I'm working with that data.
I've been trying to pick up R, and so attempted to load this data into R. No matter what method I try(read.dta from the stata file, fread from a csv file, read.spss from the spss file) the R object(measured using object.size()) is between 2.6 to 3.1 GB in size. If I save the object in an R file, that is less than 100 MB, but on loading it is the same size as before.
Any attempts to analyse the data using the survey package, particularly if I try and subset the data, take significantly longer than the equivalent command in stata.
e.g I have a household size variable 'hhpers' in my data 'hh', weighted by variable 'hhwt' , subset by 'htype'
R code :
require(survey)
sv.design <- svydesign(ids = ~0,data = hh, weights = hh$hhwt)
rm(hh)
system.time(svymean(~hhpers,sv.design[which
(sv.design$variables$htype=="rural"),]))
pushes the memory used by R upto 6 GB and takes a very long time -
user system elapsed
3.70 1.75 144.11
The equivalent operation in stata
svy: mean hhpers if htype == 1
completes almost instantaneously, giving me the same result.
Why is there such a massive difference between both memory usage(by object as well as the function), and time taken between R and Stata?
Is there anything I can do to optimise the data and how R is working with it?
ETA: My machine is running 64 bit Windows 8.1, and I'm running R with no other programs loaded. At the very least, the environment is no different for R than it is for Stata.
After some digging, I expect the reason for this is R's limited number of data types. All my data is stored as int, which takes 4 bytes per element. In survey data, each response is categorically coded, and typically requires only one byte to store, which stata stores using the 'byte' data type, and R stores using the 'int' data type, leading to some significant inefficiency in large surveys.
Regarding difference in memory usage - you're on the right track and (mostly) its because of object types. Indeed integer saving will take up a lot of your memory. So proper setting of variable types would improve memory usage by R. as.factor() would help. See ?as.factor for more details to update this after reading data. To fix this during reading data from the file refer to colClasses parameter of read.table() (and similar functions specific for stata & SPSS formats). This will help R store data more efficiently (its on the fly guessing of types is not top-notch).
Regarding the second part - calculation speed - large dataset parsing is not perfect in base R, that's where data.table package comes handy - its fast and quite similar to original data.frame behavior. Summary calcuations are really quick. You would use it via hh <- as.data.table(read.table(...)) and you can calculate something similar to your example with
hh <- as.data.table(hh)
hh[htype == "rural",mean(hhpers*hhwt)]
## or
hh[,mean(hhpers*hhwt),by=hhtype] # note 'empty' first argument
Sorry, I'm not familiar with survey data studies, so I can't be more specific.
Another detail into memory usage by function - most likely R made a copy of your entire dataset to calculate the summaries you were looking for. Again, in this case data.table would help and prevent R from making excessive copies and improve memory usage.
Of interest may also be the memisc package which, for me, resulted in much smaller eventual files than read.spss (I was however working at a smaller scale than you)
From the memisc vignette
... Thus this package provides facilities to load such subsets of variables, without the need to load a complete data set. Further, the loading of data from SPSS files is organized in such a way that all informations about variable labels, value labels, and user-defined missing values are retained. This is made possible by the definition of importer objects, for which a subset method exists. importer objects contain only the information about the variables in the external data set but not the data. The data itself is loaded into memory when the functions subset or as.data.set are used.