How to read data from a csv table in R? - r

I have a csv file, and I want to extract the each column a as string so I can use it with getSymbols function from quantmod package.
The csv file looks like this:
AEGR,Aegerion Pharmaceuticals Inc
AKS,AK Steel Holding Corp
ALXA‎,Alexza Pharmaceuticals Inc
CCL‎,Carnival Corporation
CECO‎,Career Education Corp
CDXS‎,Codexis Inc
And I use this code to read the file:
data<-read.csv(file='CAPM/allquotes.csv',header=F)
symbols=gettext(data[,1])
symbol.names=gettext(data[,2])
getSymbols(symbols)
I get this error:
Error in download.file(paste(yahoo.URL, "s=", Symbols.name, "&a=", from.m, : cannot open URL 'http://chart.yahoo.com/table.csv?s=ALXA‎&a=0&b=01&c=2007&d=5&e=16&f=2012&g=d&q=q&y=0&z=ALXA‎&x=.csv'
In addition: Warning message:
In download.file(paste(yahoo.URL, "s=", Symbols.name, "&a=", from.m, : cannot open: HTTP status was '404 Not Found'
When I enter the symbols one by one it works fine. I've also noticed that when I go to the end of the last line, the margins seem to corrupt. In the image you can see that values of 'symbols', the end of the line is a few more spaces to the right than it should be (you can see that because of the color of the initial parenthesis).

Your csv has hidden characters in it -- namely a left-to-right mark. Since you are using RStudio, you can remove it with gsub using "\016" as the value for the pattern argument. Alternatively, instead of removing the hidden character that you don't want, you could only keep the characters that you know you DO want. For example, if your symbols will only have letters and/or numbers you could use something like gsub("[^A-Za-z0-9]", "", data[, 1])
data <- read.csv(text="AEGR,Aegerion Pharmaceuticals Inc
AKS,AK Steel Holding Corp
ALXA‎,Alexza Pharmaceuticals Inc
CCL‎,Carnival Corporation
CECO‎,Career Education Corp
CDXS‎,Codexis Inc", header=FALSE)
#data[, 1] <- gsub("\016", "", data[, 1]) #this should work in RStudio
data[, 1] <- gsub("[^A-Za-z0-9]", "", data[, 1]) #but this should work anywhere
symbols=gettext(data[,1])
getSymbols(symbols, src='yahoo')
After you read.csv, you can examine the data object to see that something is amiss.
s <- as.character(data[, 1])
str(s)
#chr [1:6] "AEGR" "AKS" "ALXA""| __truncated__ "CCL""| __truncated__ "CECO""| __truncated__ "CDXS""| __truncated__
str(s[3])
#chr "ALXA""| __truncated__
charToRaw(s[3])
#[1] 41 4c 58 41 e2 80 8e
# Compare what we have to what we think we have
charToRaw("ALXA")
#[1] 41 4c 58 41

I'm using the Systematic Investor Toolbox, that uses the quantmod. Thanks to GSee, the solution came this way:
source('SystematicInvestorToolbox.r')
load.packages('quantmod')
dates='2012::2012'
data<-read.csv(file='CAPM/allquotes.csv',header=F,stringsAsFactors=F)
data[, 1] <- gsub("[^A-Za-z0-9]", "", data[, 1])
symbols=gettext(data[,1])
symbol.names=gettext(data[,2])
ia=aa.test.create.ia.custom(symbols,symbol.names,dates)
plot.ia(ia,(1:1))
It's worth noting that the left-to-right marks only appear with 'symbols' not when I extract the characters for the names of the quotes in 'symbol.names'.
Thanks for the help.

Related

Use mapply to replace a patterns in a vector with replacements in a vector in tm

Hi there: I m using the tm package for some text analysis and I need to sub a vector of terms with the paired replacement term in a vector of replacements. So the pattern / replacement dictionary looks like this.
#pattern -replacement dictionary
df<-data.frame(replace=c('crude', 'oil', 'price'), with=c('xcrude', 'xoil', 'xprice'))
#load tm
library(tm)
#load crude
data('crude')
I tried this and received an error
tm_map(crude, mapply, gsub, df$replace, df$with)
Warning message:
In mclapply(content(x), FUN, ...) :
all scheduled cores encountered errors in user code
Based on this answer you could use stringi and wrap it around content_transformer() to preserve the corpus structure:
corp <- tm_map(crude, content_transformer(
function(x) {
stri_replace_all_fixed(x, df$replace, df$with, vectorize_all = FALSE)
})
)
Or multigsub from qdap
corp <- tm_map(crude, content_transformer(
function(x) {
multigsub(df$replace, df$with, fixed = FALSE, x)
})
)
Which gives:
> corp[[1]][1]
"Diamond Shamrock Corp said that\neffective today it had cut its
contract xprices for xcrude xoil by\n1.50 dlrs a barrel.\n The reduction brings its posted xprice for West Texas\nIntermediate to
16.00 dlrs a barrel, the copany said.\n \"The xprice reduction today was made in the light of falling\nxoil product xprices
and a weak xcrude xoil market,\" a company\nspokeswoman said.\n
Diamond is the latest in a line of U.S. xoil companies that\nhave
cut its contract, or posted, xprices over the last two
days\nciting weak xoil markets.\n Reuter"
You can then apply other tm functions on the resulting corpus:
> DocumentTermMatrix(corp)
#<<DocumentTermMatrix (documents: 20, terms: 1269)>>
#Non-/sparse entries: 2262/23118
#Sparsity : 91%
#Maximal term length: 17
#Weighting : term frequency (tf)

DocumentTermMatrix error on Corpus argument

I have the following code:
# returns string w/o leading or trailing whitespace
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
news_corpus <- Corpus(VectorSource(news_raw$text)) # a column of strings.
corpus_clean <- tm_map(news_corpus, tolower)
corpus_clean <- tm_map(corpus_clean, removeNumbers)
corpus_clean <- tm_map(corpus_clean, removeWords, stopwords('english'))
corpus_clean <- tm_map(corpus_clean, removePunctuation)
corpus_clean <- tm_map(corpus_clean, stripWhitespace)
corpus_clean <- tm_map(corpus_clean, trim)
news_dtm <- DocumentTermMatrix(corpus_clean) # errors here
When I run the DocumentTermMatrix() method, it gives me this error:
Error: inherits(doc, "TextDocument") is not TRUE
Why do I get this error? Are my rows not text documents?
Here is the output upon inspecting corpus_clean:
[[153]]
[1] obama holds technical school model us
[[154]]
[1] oil boom produces jobs bonanza archaeologists
[[155]]
[1] islamic terrorist group expands territory captures tikrit
[[156]]
[1] republicans democrats feel eric cantors loss
[[157]]
[1] tea party candidates try build cantor loss
[[158]]
[1] vehicles materials stored delaware bridges
[[159]]
[1] hill testimony hagel defends bergdahl trade
[[160]]
[1] tweet selfpropagates tweetdeck
[[161]]
[1] blackwater guards face trial iraq shootings
[[162]]
[1] calif man among soldiers killed afghanistan
[[163]]
[1] stocks fall back world bank cuts growth outlook
[[164]]
[1] jabhat alnusra longer useful turkey
[[165]]
[1] catholic bishops keep focus abortion marriage
[[166]]
[1] barbra streisand visits hill heart disease
[[167]]
[1] rand paul cantors loss reason stop talking immigration
[[168]]
[1] israeli airstrike kills northern gaza
Edit: Here is my data:
type,text
neutral,The week in 32 photos
neutral,Look at me! 22 selfies of the week
neutral,Inside rebel tunnels in Homs
neutral,Voices from Ukraine
neutral,Water dries up ahead of World Cup
positive,Who's your hero? Nominate them
neutral,Anderson Cooper: Here's how
positive,"At fire scene, she rescues the pet"
neutral,Hunger in the land of plenty
positive,Helping women escape 'the life'
neutral,A tour of the sex underworld
neutral,Miss Universe Thailand steps down
neutral,China's 'naked officials' crackdown
negative,More held over Pakistan stoning
neutral,Watch landmark Cold War series
neutral,In photos: History of the Cold War
neutral,Turtle predicts World Cup winner
neutral,What devoured great white?
positive,Nun wins Italy's 'The Voice'
neutral,Bride Price app sparks debate
neutral,China to deport 'pork' artist
negative,Lightning hits moving car
neutral,Singer won't be silenced
neutral,Poland's mini desert
neutral,When monarchs retire
negative,Murder on Street View?
positive,Meet armless table tennis champ
neutral,Incredible 400 year-old globes
positive,Man saves falling baby
neutral,World's most controversial foods
Which I retrieve like:
news_raw <- read.csv('news_csv.csv', stringsAsFactors = F)
Edit: Here is the traceback():
> news_dtm <- DocumentTermMatrix(corpus_clean)
Error: inherits(doc, "TextDocument") is not TRUE
> traceback()
9: stop(sprintf(ngettext(length(r), "%s is not TRUE", "%s are not all TRUE"),
ch), call. = FALSE, domain = NA)
8: stopifnot(inherits(doc, "TextDocument"), is.list(control))
7: FUN(X[[1L]], ...)
6: lapply(X, FUN, ...)
5: mclapply(unname(content(x)), termFreq, control)
4: TermDocumentMatrix.VCorpus(x, control)
3: TermDocumentMatrix(x, control)
2: t(TermDocumentMatrix(x, control))
1: DocumentTermMatrix(corpus_clean)
When I evaluate inherits(corpus_clean, "TextDocument") it is FALSE.
It seems this would have worked just fine in tm 0.5.10 but changes in tm 0.6.0 seems to have broken it. The problem is that the functions tolower and trim won't necessarily return TextDocuments (it looks like the older version may have automatically done the conversion). They instead return characters and the DocumentTermMatrix isn't sure how to handle a corpus of characters.
So you could change to
corpus_clean <- tm_map(news_corpus, content_transformer(tolower))
Or you can run
corpus_clean <- tm_map(corpus_clean, PlainTextDocument)
after all of your non-standard transformations (those not in getTransformations()) are done and just before you create the DocumentTermMatrix. That should make sure all of your data is in PlainTextDocument and should make DocumentTermMatrix happy.
I have found a way to solve this problem in an article about TM.
An example in which the error follows below:
getwd()
require(tm)
files <- DirSource(directory="texts/", encoding="latin1") # import files
corpus <- VCorpus(x=files) # load files, create corpus
summary(corpus) # get a summary
corpus <- tm_map(corpus,removePunctuation)
corpus <- tm_map(corpus,stripWhitespace)
corpus <- tm_map(corpus,removePunctuation);
matrix_terms <- DocumentTermMatrix(corpus)
Warning messages:
In TermDocumentMatrix.VCorpus(x, control) : invalid document identifiers
This error occurs because you need an object of the class Vector Source to do your Term Document Matrix, but the previous transformations transform your corpus of texts in character, therefore, changing a class which is not accepted by the function.
However, if you add the function content_transformer inside the tm_map command you may not need even one more command before using the function TermDocumentMatrix to keep going.
The code below changes the class (see second last line) and avoids the error:
getwd()
require(tm)
files <- DirSource(directory="texts/", encoding="latin1")
corpus <- VCorpus(x=files) # load files, create corpus
summary(corpus) # get a summary
corpus <- tm_map(corpus,content_transformer(removePunctuation))
corpus <- tm_map(corpus,content_transformer(stripWhitespace))
corpus <- tm_map(corpus,content_transformer(removePunctuation))
corpus <- Corpus(VectorSource(corpus)) # change class
matrix_term <- DocumentTermMatrix(corpus)
Change this:
corpus_clean <- tm_map(news_corpus, tolower)
For this:
corpus_clean <- tm_map(news_corpus, content_transformer(tolower))
This should work.
remove.packages(tm)
install.packages("http://cran.r-project.org/bin/windows/contrib/3.0/tm_0.5-10.zip",repos=NULL)
library(tm)

Scrape number of articles on a topic per year from NYT and WSJ?

I would like to create a data frame that scrapes the NYT and WSJ and has the number of articles on a given topic per year. That is:
NYT WSJ
2011 2 3
2012 10 7
I found this tutorial for the NYT but is not working for me :_(. When I get to line 30 I get this error:
> cts <- as.data.frame(table(dat))
Error in provideDimnames(x) :
length of 'dimnames' [1] not equal to array extent
Any help would be much appreciated.
Thanks!
PS: This is my code that is not working (A NYT api key is needed http://developer.nytimes.com/apps/register)
# Need to install from source http://www.omegahat.org/RJSONIO/RJSONIO_0.2-3.tar.gz
# then load:
library(RJSONIO)
### set parameters ###
api <- "API key goes here" ###### <<<API key goes here!!
q <- "MOOCs" # Query string, use + instead of space
records <- 500 # total number of records to return, note limitations above
# calculate parameter for offset
os <- 0:(records/10-1)
# read first set of data in
uri <- paste ("http://api.nytimes.com/svc/search/v1/article?format=json&query=", q, "&offset=", os[1], "&fields=date&api-key=", api, sep="")
raw.data <- readLines(uri, warn="F") # get them
res <- fromJSON(raw.data) # tokenize
dat <- unlist(res$results) # convert the dates to a vector
# read in the rest via loop
for (i in 2:length(os)) {
# concatenate URL for each offset
uri <- paste ("http://api.nytimes.com/svc/search/v1/article?format=json&query=", q, "&offset=", os[i], "&fields=date&api-key=", api, sep="")
raw.data <- readLines(uri, warn="F")
res <- fromJSON(raw.data)
dat <- append(dat, unlist(res$results)) # append
}
# aggregate counts for dates and coerce into a data frame
cts <- as.data.frame(table(dat))
# establish date range
dat.conv <- strptime(dat, format="%Y%m%d") # need to convert dat into POSIX format for this
daterange <- c(min(dat.conv), max(dat.conv))
dat.all <- seq(daterange[1], daterange[2], by="day") # all possible days
# compare dates from counts dataframe with the whole data range
# assign 0 where there is no count, otherwise take count
# (take out PSD at the end to make it comparable)
dat.all <- strptime(dat.all, format="%Y-%m-%d")
# cant' seem to be able to compare Posix objects with %in%, so coerce them to character for this:
freqs <- ifelse(as.character(dat.all) %in% as.character(strptime(cts$dat, format="%Y%m%d")), cts$Freq, 0)
plot (freqs, type="l", xaxt="n", main=paste("Search term(s):",q), ylab="# of articles", xlab="date")
axis(1, 1:length(freqs), dat.all)
lines(lowess(freqs, f=.2), col = 2)
UPDATE: the repo is now at https://github.com/rOpenGov/rtimes
There is a RNYTimes package created by Duncan Temple-Lang https://github.com/omegahat/RNYTimes - but it is outdated because the NYTimes API is on v2 now. I've been working on one for political endpoints only, but not relevant for you.
I'm rewiring RNYTimes right now...Install from github. You need to install devtools first to get install_github
install.packages("devtools")
library(devtools)
install_github("rOpenGov/RNYTimes")
Then try your search with that, e.g,
library(RNYTimes); library(plyr)
moocs <- searchArticles("MOOCs", key = "<yourkey>")
This gives you number of articles found
moocs$response$meta$hits
[1] 121
You could get word counts for each article by
as.numeric(sapply(moocs$response$docs, "[[", 'word_count'))
[1] 157 362 1316 312 2936 2973 355 1364 16 880

Reading large fixed format text file in r

I am trying to input a large (> 70 MB) fixed format text file into r. For a smaller file (< 1MB), I can use the read.fwf() function as shown below.
condodattest1a <- read.fwf(impfile1,widths=testcsv3$Varlen,col.names=testcsv3$Varname)
When I try to run the line of code below,
condodattest1 <- read.fwf(impfile,widths=testcsv3$Varlen,col.names=testcsv3$Varname)
I get the following error message:
Error: cannot allocate vector of size 2 Kb
The only difference between the 2 lines is the size of the input file.
The formatting for the file I want to import is given in the dataframe called testcsv3. I show a small snippet of the dataframe below:
> head(testcsv3)
Varlen Varname Varclass Varsep Varforfmt
1 2 "V1" "character" 2 "A2.0"
2 15 "V2" "character" 17 "A15.0"
3 28 "V3" "character" 45 "A28.0"
4 3 "V4" "character" 48 "F3.0"
5 1 "V5" "character" 49 "A1.0"
6 3 "V6" "character" 52 "A3.0"
At least part of my problem is that I am reading in all the data as factors when I use read.fwf() and I end up exceeding the memory limit on my computer.
I tried to use read.table() as a way of formatting each variable but it seems I need a text delimiter with that function. There is a suggestion in section 3.3 in the link below that I could use sep to identify the column where every variable starts.
http://data.princeton.edu/R/readingData.html
However, when I use the command below:
condodattest1b <- read.table(impfile1,sep=testcsv3$Varsep,col.names=testcsv3$Varname, colClasses=testcsv3$Varclass)
I get the following error message:
Error in read.table(impfile1, sep = testcsv3$Varsep, col.names = testcsv3$Varname, : invalid 'sep' argument
Finally, I tried to use:
condodattest1c <- read.fortran(impfile1,lengths=testcsv3$Varlen, format=testcsv3$Varforfmt, col.names=testcsv3$Varname)
but I get the following message:
Error in processFormat(format) : missing lengths for some fields
In addition: Warning messages:
1: In processFormat(format) : NAs introduced by coercion
2: In processFormat(format) : NAs introduced by coercion
3: In processFormat(format) : NAs introduced by coercion
All I am trying to do at this point is format the data when they come into r as something other than factors. I am hoping this will limit the amount of memory I am using and allow me to actually input the file. I would appreciate any suggestions about how I can do this. I know the Fortran formats for all the variables and the column at which each variable begins.
Thank you,
Warren
Maybe this code works for you. You have to fill varlen with the field sizes and add the corresponding type strings (e.g. numeric, character, integer) to colclasses
my.readfwf <- function(filename,varlen,colclasses) {
sidx <- cumsum(c(1,varlen[1:(length(varlen)-1)]))
eidx <- sidx+varlen-1
filecontent <- scan(filename,character(0),sep="\n")
if (any(diff(nchar(filecontent))!=0))
stop("line lengths differ!")
nlines <- length(filecontent)
res <- list()
for (i in seq_along(varlen)) {
res[[i]] <- sapply(filecontent,substring,first=sidx[i],last=eidx[i])
mode(res[[i]]) <- colclasses[i]
}
attributes(res) <- list(names=paste("V",seq_along(res),sep=""),row.names=seq_along(res[[1]]),class="data.frame")
return(res)
}

Reading sdmx-xml files into a dataframe in R

I was wondering if anyone has managed to read SDMX-XML files into a dataframe. The file I’d like to read is https://www.ecb.europa.eu/stats/sdmx/icpf/1/data/pension_funds.xml (1mb).
I saved the file as “pensions_funds.xml” to the pwd and tried to use the XML package to read it:
fileName <- system.file("pensions", "pensions_funds.xml", package="XML")
parsed<-xmlTreeParse("pension_funds.xml",getDTD=F)
r<-xmlRoot(parsed)
tmp = xmlSApply(r, function(x) xmlSApply(x, xmlValue))
The few lines above basically follow the example here http://www.omegahat.org/RSXML/gettingStarted.html
but I think I would first need to somehow ignore the header (I have pasted below the first couple of pages of the file I’m trying to read). So I think the above might work but it starts from the wrong node for my purposes. I would like to grab the obs_values, indexed by their time_period and ref_area.
The first thing would be to find the right node and start there however I suspect I might be on a fool’s errand since I have limited knowledge of data formats and I’m not sure the XML package can be used for SDMX-XML files. Smarter people appear to have tried to do this
http://opensdmxdevelopers.wikispaces.com/RSDMX
I can’t find this package for download on its homepage here
https://r-forge.r-project.org/projects/rsdmx/
(I can’t see any link/download section but maybe I’m blind) and it seems to be early stages. The existence of the rsdmx suggests using the xml package to read sdmx might not be easy so I’m ready to give up at this stage unless anyone has had success with this. Actually I’m mainly interested in reading this file
http://www.ecb.europa.eu/stats/sdmx/bsi/1/data/outstanding_amounts.xml
But this is a 10mb file so I was starting smaller.
edit3
attempting sgibb's answer on large file using changes in Mischa's comment
library("XML")
url <- "http://www.ecb.europa.eu/stats/sdmx/bsi/1/data/outstanding_amounts.xml"
sdmxHandler <- function() {
## data.frame which stores results
data <- data.frame(stringsAsFactors=FALSE)
## counter to store current row
i <- 1
## temp value to store current REF_AREA
## temp value to store current REF_AREA
refArea <- NA
bsItem <- NA
bsCountSector <- NA
## handler subroutine for Obs tag
Obs <- function(name, attr) {
## found an Obs tag and now fill data.frame
data[i, "refArea"] <<- refArea
data[i, "timePeriod"] <<- as.numeric(attr["TIME_PERIOD"])
data[i, "obsValue"] <<- as.numeric(attr["OBS_VALUE"])
data[i, "bsItem"] <<- bsItem
data[i, "bsCountSector"] <<- bsCountSector
i <<- i + 1
}
## handler subroutine for Series tag
Series <- function(name, attr) {
refArea <<- attr["REF_AREA"]
bsItem <<- as.character(attr["BS_ITEM"])
bsCountSector <<- as.numeric(attr["BS_ITEM"])
}
return(list(getData=function() {return(data)},
Obs=Obs, Series=Series))
}
## run parser
df <- xmlEventParse(file(url), handlers=sdmxHandler())$getData()
Specification mandate value for attribute OBS_VALUE
attributes construct error
Couldn't find end of Start Tag Obs line 15108
Premature end of data in tag Series line 15041
Premature end of data in tag DataSet line 91
Premature end of data in tag CompactData line 2
Error: 1: Specification mandate value for attribute OBS_VALUE
2: attributes construct error
3: Couldn't find end of Start Tag Obs line 15108
4: Premature end of data in tag Series line 15041
5: Premature end of data in tag DataSet line 91
6: Premature end of data in tag CompactData line 2
In addition: There were 50 or more warnings (use warnings() to see the first 50)
edit2:
the answer from sgibb looks ideal and works perfectly on the smaller file. I tried to run it on
url <- http://www.ecb.europa.eu/stats/sdmx/bsi/1/data/outstanding_amounts.xml
(the 10mb file, original link corrected), with the only modification being the addition of two lines:
data[i, "bsItem"] <<- as.character(attr["BS_ITEM"])
data[i, "bsCountSector"] <<- as.numeric(attr["BS_COUNT_SECTOR"])
(these are additional id variables which are needed to identify a row in this larger dataset).
It ran for a few minutes then finished with this error:
Error: 1: Specification mandate value for attribute TIME_PE
2: attributes construct error
3: Couldn't find end of Start Tag Obs line 20743
4: Premature end of data in tag Series line 20689
5: Premature end of data in tag DataSet line 91
6: Premature end of data in tag CompactData line 2
In addition: There were 50 or more warnings (use warnings() to see the first 50)
The basic format of the data seems very similar so I thought this might work. The basic format of the 10mb file is as below:
<Series FREQ="M" REF_AREA="AT" ADJUSTMENT="N" BS_REP_SECTOR="A" BS_ITEM="A20" MATURITY_ORIG="A" DATA_TYPE="1" COUNT_AREA="U2" BS_COUNT_SECTOR="0000" CURRENCY_TRANS="Z01" BS_SUFFIX="E" TIME_FORMAT="P1M" COLLECTION="E">
<Obs TIME_PERIOD="1997-09" OBS_VALUE="275.3" OBS_STATUS="A" OBS_CONF="F"/>
<Obs TIME_PERIOD="1997-10" OBS_VALUE="275.9" OBS_STATUS="A" OBS_CONF="F"/>
<Obs TIME_PERIOD="1997-11" OBS_VALUE="276.6" OBS_STATUS="A" OBS_CONF="F"/>
edit1:
desired data format:
Ref_area time_period obs_value
At 2006 118
At 2007 119
…
Be 2006 101
…
Here’s the first bit of the data.
</Header>
DataSet xsi:schemaLocation="https://www.ecb.europa.eu/vocabulary/stats/icpf/1 https://www.ecb.europa.eu/stats/sdmx/icpf/1/structure/2011-08-11/sdmx-compact.xsd" xmlns="https://www.ecb.europa.eu/vocabulary/stats/icpf/1">
<Group DECIMALS="0" TITLE_COMPL="Austria, reporting institutional sector Insurance corporations and pension funds - Closing balance sheet - All financial assets and liabilities - counterpart area World (all entities), counterpart institutional sector Total economy including Rest of the World (all sectors) - Credit (resources/liabilities) - Non-consolidated, Current prices - Euro, Neither seasonally nor working day adjusted - ESA95 TP table Not applicable" UNIT_MULT="9" UNIT="EUR" ESA95TP_SUFFIX="Z" ESA95TP_DENOM="E" ESA95TP_CONS="N" ESA95TP_DC_AL="2" ESA95TP_CPSECTOR="S" ESA95TP_CPAREA="A1" ESA95TP_SECTOR="S125" ESA95TP_ASSET="F" ESA95TP_TRANS="LE" ESA95TP_PRICE="V" ADJUSTMENT="N" REF_AREA="AT"/><Series ESA95TP_SUFFIX="Z" ESA95TP_DENOM="E" ESA95TP_CONS="N" ESA95TP_DC_AL="2" ESA95TP_CPSECTOR="S" ESA95TP_CPAREA="A1" ESA95TP_SECTOR="S125" ESA95TP_ASSET="F" ESA95TP_TRANS="LE" ESA95TP_PRICE="V" ADJUSTMENT="N" REF_AREA="AT" COLLECTION="E" TIME_FORMAT="P1Y" FREQ="A"><Obs OBS_CONF="F" OBS_STATUS="E" OBS_VALUE="112" TIME_PERIOD="2008"/><Obs OBS_CONF="F" OBS_STATUS="E" OBS_VALUE="119" TIME_PERIOD="2009"/><Obs OBS_CONF="F" OBS_STATUS="E" OBS_VALUE="125" TIME_PERIOD="2010"/><Obs OBS_CONF="F" OBS_STATUS="E" OBS_VALUE="127" TIME_PERIOD="2011"/></Series><Group D
RSDMX seems to be in an early development state. IMHO there is no package available yet. But you could easily implement it on your own using the XML package. I would suggest to use xmlEventParse (see ?xmlEventParse for details):
EDIT: adapt example to changed requirements of outstanding_amounts.xml
EDIT2: add download.file
library("XML")
#url <- "http://www.ecb.europa.eu/stats/sdmx/icpf/1/data/pension_funds.xml"
url <- "http://www.ecb.europa.eu/stats/sdmx/bsi/1/data/outstanding_amounts.xml"
## download xml file to avoid download errors disturbing xmlEventParse
tmp <- tempfile()
download.file(url, tmp)
sdmxHandler <- function() {
## data.frame which stores results
data <- data.frame(stringsAsFactors=FALSE)
## counter to store current row
i <- 1
## temp value to store current REF_AREA, BS_ITEM and BS_COUNT_SECTOR
refArea <- NA
bsItem <- NA
bsCountSector <- NA
## handler subroutine for Obs tag
Obs <- function(name, attr) {
## found an Obs tag and now fill data.frame
data[i, "refArea"] <<- refArea
data[i, "bsItem"] <<- bsItem
data[i, "bsCountSector"] <<- bsCountSector
data[i, "timePeriod"] <<- as.Date(paste(attr["TIME_PERIOD"], "-01", sep=""), format="%Y-%m-%d")
data[i, "obsValue"] <<- as.double(attr["OBS_VALUE"])
## update current row
i <<- i + 1
}
## handler subroutine for Series tag
Series <- function(name, attr) {
refArea <<- attr["REF_AREA"]
bsItem <<- attr["BS_ITEM"]
bsCountSector <<- as.numeric(attr["BS_COUNT_SECTOR"])
}
return(list(getData=function() {return(data)},
Obs=Obs, Series=Series))
}
## run parser
df <- xmlEventParse(tmp, handlers=sdmxHandler())$getData()
head(df)
# refArea bsItem bsCountSector timePeriod obsValue
#1 DE A20 2210 12053 39.6
#2 DE A20 2210 12084 46.1
#3 DE A20 2210 12112 50.2
#4 DE A20 2210 12143 52.0
#5 DE A20 2210 12173 52.3
#6 DE A20 2210 12204 47.3
The package rsdmx allows you to read SDMX-ML files and coerce them as data.frame. It is now hosted at Github, and currently available in CRAN, but in case you can install easily it from GitHub with the following:
require("devtools")
install_github("rsdmx", "opensdmx")
Applying to your data, you can do the following:
sdmx <- readSDMX("http://www.ecb.europa.eu/stats/sdmx/bsi/1/data/outstanding_amounts.xml")
df <- as.data.frame(sdmx)
More examples are given in the rsdmx wiki
Note that its functionalities currently load the xml object into R, as a slot part of the SDMX R objects instantiated by rsdmx. In the future, we would like to investigate how rsdmx can use xmlEventParse (as suggested above by #sgibb) to read very large datasets.
library(XML)
xmlparsed <- xmlParse(file(url))
## obtain dataset node::
series_data <- getNodeSet(xmlparsed, "//Series")
if(length(series_data)==0){
datasetnode <- xmlChildren( xmlChildren(xmlparsed)[[1]])[[2]]
series_data<-xmlChildren(datasetnode)[ names(xmlChildren(datasetnode))=="Series"]
}
## prepare dataset
dataset.frame <- data.frame(matrix(ncol=3))
colnames(dataset.frame) <- c('REF_AREA', 'TIME_PERIOD', 'OBS_VALUE')
## loop over data
counter=1
for (i in 1: length(series_data)){
if('Obs'%in%names(xmlChildren(series_data[[i]])) ){ ## To ignore empty //Series nodes
for (j in 1: length(xmlChildren(series_data[[i]]))){
dataset.frame[counter,1] <- xmlAttrs(series_data[[i]])['REF_AREA']
dataset.frame[counter,2] <- xmlAttrs(series_data[[i]][[j]])['TIME_PERIOD']
dataset.frame[counter,3] <- xmlAttrs(series_data[[i]][[j]])['OBS_VALUE']
counter=counter+1
}
}
}
head(dataset.frame,5)

Resources