Is it possible to load a list of frequent subsequences from a .txt file, and make TraMineR recognize it as a sequence object?
Unfortunately I don't have the raw data, therefore I am not able to recreate the analysis. The only file what I have is a .txt file containing the frequent subsequences. I assume it was created with the seqefsub() function from the TraMineR package, with maxGap=2, because the data looks like as an output of the mentioned function.
read.table() reads it as a data frame but as far as I understood, TraMineR handles event sequences as lists with many additional attributes, that for example are not contained in this file. Or I don't know how to extract them...
This is how the a couple of lines from the .txt file look like:
Subsequence Support Count
16 (WT4)-(WT3) 0.76666667 805
17 (WL2) 0.76380952 802
18 (S1) 0.76000000 798
19 (FRF,WL2) 0.74380952 781
20 (WT2)-(WT1) 0.70571429 741
To create an event sequence object from the (text) subsequences, you have to transform them into vertical time stamped event (TSE) form. The function below does the job for your data
## Function subseq.to.TSE
## puts the sequences into TSE format using
## position as timestamp
## sdf: a data frame with columns Id, Subsequence, Support and Count.
subseq.to.TSE <- function(sdf){
tse <- data.frame(id=0, event="", time=0)
k <- 0
for (i in 1:nrow(sdf)){
id <- sdf[i,"Id"]
s <- sdf[i,"Subsequence"]
ss <- gsub("\\(","",s)
ss <- gsub("\\)","",ss)
# split transitions
st <- strsplit(ss, split="-")[[1]]
for (j in 1:length(st)){
stt <- strsplit(st[j], split=",")[[1]]
for(jj in 1:length(stt)){
k <- k+1
tse[k,1] <- id
## parsing for simultaneous events
if (!(stt[jj] %in% levels(tse[,2])))
{levels(tse[,2]) <- c(levels(tse[,2]),stt[jj])}
tse[k,2] <- stt[jj]
tse[k,3] <- j
}
}
}
return(tse)
}
Here is how you would use it on the example data.
We first create the data frame that we name s.df
s.df <- data.frame(scan(what=list(Id=0, Subsequence="", Support=double(), Count=0)))
16 (WT4)-(WT3) 0.76666667 805
17 (WL2) 0.76380952 802
18 (S1) 0.76000000 798
19 (FRF,WL2) 0.74380952 781
20 (WT2)-(WT1) 0.70571429 741
# leave a blank line to end the scan
Then we extract the TSE data from s.df and create from it the event sequence object using seqecreate. Finally, we assign the counts as sequence weights.
s.tse <- subseq.to.TSE(s.df)
seqe <- seqecreate(s.tse)
seqeweight(seqe) <- s.df[,"Count"]
Now you can for instance plot the event sequences with
seqpcplot(seqe)
Related
I have the following MWE, although the datasets are unavailable:
N <- 84 #Number of datasets to pull data from
dates <- c("2010.01", "2010.02", "2010.03", "2010.04", "2010.05", "2010.06", "2010.07", "2010.08",
"2010.09", "2010.10", "2010.11", "2010.12", "2011.01", "2011.02", "2011.03", "2011.04", "2011.05",
"2011.06", "2011.07", "2011.08", "2011.09", "2011.10", "2011.11", "2011.12", "2012.01", "2012.02",
"2012.03", "2012.04", "2012.05", "2012.06", "2012.07", "2012.08", "2012.09", "2012.10", "2012.11",
"2012.12", "2013.01", "2013.02", "2013.03", "2013.04", "2013.05", "2013.06", "2013.07", "2013.08",
"2013.09", "2013.10", "2013.11", "2013.12", "2014.01", "2014.02", "2014.03", "2014.04", "2014.05",
"2014.06", "2014.07", "2014.08", "2014.09", "2014.10", "2014.11", "2014.12", "2015.01", "2015.02",
"2015.03", "2015.04", "2015.05", "2015.06", "2015.07", "2015.08", "2015.09", "2015.10", "2015.11",
"2015.12", "2016.01", "2016.02", "2016.03", "2016.04", "2016.05", "2016.06", "2016.07", "2016.08",
"2016.09", "2016.10", "2016.11", "2016.12") #list of all dates to loop through
A <- list() #empty list to store excel files
for (k in seq_along(dates)) {
A[k] <- read_excel(paste0("~/R/data.", dates[k], ".xlsx"), range = "B3:EO94")
}
Total <- array(unlist(A), dim=c(91,144,84))
First <- read_excel(paste0("~R/data.", dates[1], ".xlsx"), range="B3:EO94")
This gives me Total the 3-dimensional array and then First which should be the first "slice" of the array. So if I take some arbitrary coordinate, say 15,34, then I should be able to pull the exact some value from both Total and First so I try the following:
> Total[15,34,1]
[1] 0.000392432
> A1[15,34]
# A tibble: 1 x 1
`-97.5`
<dbl>
1 0.000384
The 0.000384 is the proper number found in the excel file from AI18 and the number given from Total is incorrect. What gives? To further double-check I compared Total[15,34,2] with the second "slice" and alas, the same incorrect result from Total.
Try using the double square brackets, A[[k]] to assign the data from the Excel files.
A <- list() #empty list to store excel files
for (k in seq_along(dates)) {
A[[k]] <- read_excel(paste0("~/R/data.", dates[k], ".xlsx"), range = "B3:EO94")
}
I have a 159 Ncdf4 files with 56 ensembles in each file. I want to pull out ensemble 1 from each of the 159 input files. Then produce a single NCDF4 file with all the ensemble 1 in a single file. My code is below. My problem is that only data the last file of the 159 is written to the output file. I think I am missing a nested loop, but not sure and my attempts have failed.
rm(list=ls())
library(ncdf.tools)
library(ncdf4)
library(ncdf4.helpers)
library(RNetCDF)
setwd("D:/Rwork/Project") # set working folder
#####Write NCDF4 files#############################################
dir("D:/Rwork/Project/Test")->xlab # This is the directory where the file for analysising are
filelist <- paste("Test/",dir("Test"),sep="")
N <- length(filelist) # Loop over the individual files
for(j in 1:N){
File<-nc_open(filelist[j])
print(filelist[j])
Temperature<-ncvar_get(File,"t2m")
Lat<-ncvar_get(File, "lat")
Lon<-ncvar_get(File,"lon")
Time<-ncvar_get(File,"time")
EnsambleNo.<-ncvar_get(File,"ensemble_member")
Temperature
Ensamble1<-Temperature[,,1,] #The Ensamble wanted, 1 to 56
Ensamble1<-round(Ensamble1,digits = 0)
tunits<-"hours since 1800-01-01 00:00:00"
#Define dimensions
##################################################################
londim<-ncdim_def("Lon","degrees_east",as.double(Lon))
latdim<-ncdim_def("Lat", "degrees_north",as.double(Lat))
timedim<-ncdim_def("Time",tunits,as.double(Time))
#Define variables
##################################################################
fillvalue<-1e32
dlname<-"tm2"
tmp_def<-ncvar_def("Ensamble1","deg_K", list(londim,latdim,timedim),fillvalue,dlname,prec = "double")
ncfname<-("D:/Rwork/Project/TrialEnsamble/TrialEnsamble.nc")
ncout<-nc_create(ncfname,list(tmp_def),force_v4=T)
ncvar_put(ncout,tmp_def,Ensamble1,start=NA,count = NA )# Think I need a nested loop here
ncatt_put(ncout,"Lon","axis","X")
ncatt_put(ncout, "Lat", "axis", "Y")
ncatt_put(ncout, "Time","axis", "T")
title<-c( 1:2 )
names(title)<-c("Ian","Gillespie")
title<-as.data.frame(title)
ncatt_put(ncout,0,"Make_NCDF4_File",1, prec="int")
ncatt_put(ncout,0,"Maynooth_University",1,prec="short")
ncatt_put(ncout,0,"AR000087828",1, prec="short")
ncatt_put(ncout,0,"mickymouse",1, prec="short")
history <- paste("P.J. Bartlein", date(), sep=", ")
ncatt_put(ncout,0,"description","this is the script to write NCDF4 files")
#Close file and write date to disk
##########################################################
nc_close(ncout)
}
Found it is better to prepare an empty array of four dimensions with the first 3 the same size and names dimensions as the array produced from the for loop with an additional fourth dimension. The forth dimension to hold the results of each iteration
dir("D:/Rwork/Project/Test")->xlab # This is the directory where the file for analysising are
filelist <- paste("Test/",dir("Test"),sep="")
output <- array(, dim=c(192,94,12,160))# need to change this to length(Lat), etc
N <- length(filelist) # Loop over the individual files
for(j in 1:N){
File<-nc_open(filelist[j])
print(filelist[j])
Temperature<-ncvar_get(File,"t2m")
Lat<-ncvar_get(File, "lat")
Lon<-ncvar_get(File,"lon")
Time<-ncvar_get(File,"time")
Year<-c(1851:2010)
EnsambleNo.<-ncvar_get(File,"ensemble_member")
Temperature
Lat<-round(Lat,digits = 2)
Lon<-round(Lon,digits = 2)
Ensamble1<-Temperature[,,1,] #The Ensamble wanted, 1 to 56
Ensamble1<-round(Ensamble1,digits = 1)
dimnames(Ensamble1)<-list(Lon,Lat,Time)
dimnames(output) <- list(Lon,Lat,Time,Year)
}
print(Ensamble1)
I have a file with regular numeric output (same format) of many arrays, each separated by a single line (containing some info).
For example:
library(gdata)
nx = 150 # ncol of my arrays
ny = 130 # nrow of my arrays
myfile = 'bigFileWithRowsToSkip.txt'
niter = 10
for (i in 1:niter) {
write(paste(i, 'is the current iteration'), myfile, append=T)
z = matrix(runif(nx*ny), nrow = ny) # random numbers with dim(nx, ny)
write.fwf(z, myfile, append=T, rownames=F, colnames=F) #write in fixed width format
}
With nx=5 and ny=2, I would have a file like this:
# 1 is the current iteration
# 0.08051668 0.19546772 0.908230985 0.9920930408 0.386990316
# 0.57449532 0.21774728 0.273851698 0.8199024885 0.441359571
# 2 is the current iteration
# 0.655215475 0.41899060 0.84615044 0.03001664 0.47584591
# 0.131544592 0.93211342 0.68300161 0.70991368 0.18837031
# 3 is the current iteration
# ...
I want to read the successive arrays as fast as possible to put them in a single data.frame (in reality, I have thousands of them). What is the most efficient way to proceed?
Given the output is regular, I thought readr would be a good idea (?).
The only way I can think of, is to do it manually by chunks in order to eliminate the useless info lines:
library(readr)
ztot = numeric(niter*nx*ny) # allocate a vector with final size
# (the arrays will be vectorized and successively appended to each other)
for (i in 1:niter) {
nskip = (i-1)*(ny+1) + 1 # number of lines to skip, including the info lines
z = read_table(myfile, skip = nskip, n_max = ny, col_names=F)
z = as.vector(t(z))
ifirst = (i-1)*ny*nx + 1 # appropriate index
ztot[ifirst:(ifirst+nx*ny-1)] = z
}
# The arrays are actually spatial rasters. Compute the coordinates
# and put everything in DF for future analysis:
x = rep(rep(seq(1:nx), ny), niter)
y = rep(rep(seq(1:ny), each=nx), niter)
myDF = data.frame(x=x, y=y, z=z)
But this is not fast enough. How can I achieve this faster?
Is there a way to read everything at once and delete the useless rows afterwards?
Alternatively, is there no reading function accepting a vector with precise locations as skip argument, rather than a single number of initial rows?
PS: note the reading operation is to be repeated on many files (same structure) located in different directories, in case it influences the solution...
EDIT
The following solution (reading all lines with readLines and removing the undesirable ones and then processing the rest) is a faster alternative with niter very high:
bylines <- readLines(myfile)
dummylines = seq(1, by=(ny+1), length.out=niter)
bylines = bylines[-dummylines] # remove dummy, undesirable lines
asOneChar <- paste(bylines, collapse='\n') # Then process output from readLines
library(data.table)
ztot <- fread(asOneVector)
ztot <- c(t(ztot))
Discussion on how to proceed results from the readLines can be found here
Pre-processing the file with a command line tool (i.e., not in R) is actually way faster. For example with awk:
tmpfile <- 'cleanFile.txt'
mycommand <- paste("awk '!/is the current iteration/'", myfile, '>', tmpfile)
# "awk '!/is the current iteration/' bigFileWithRowsToSkip.txt > cleanFile.txt"
system(mycommand) # call the command from R
ztot <- fread(tmpfile)
ztot <- c(t(ztot))
Lines can be removed on the basis of a pattern or of indices for example.
This was suggested by #Roland from here.
Not sure if I still understood your problem correctly. Running your script created a file with 1310 lines. With This is iteration 1or2or3 printed at lines
Line 1: This is iteration 1
Line 132: This is iteration 2
Line 263: This is iteration 3
Line 394: This is iteration 4
Line 525: This is iteration 5
Line 656: This is iteration 6
Line 787: This is iteration 7
Line 918: This is iteration 8
Line 1049: This is iteration 9
Line 1180: This is iteration 10
Now there is data between these lines that you want to read and skip this 10 strings.
You can do this by tricking read.table saying your comment.char is "T" which will make read.table thinks all lines starting with letter "T" are comments and will skip those.
data<-read.table("bigFile.txt",comment.char = "T")
this will give you a data.frame of 1300 observations with 150 variables.
> dim(data)
[1] 1300 150
For a non-consisted strings. Read your data with read.table with fill=TRUE flag. This will not break your input process.
data<-read.table("bigFile.txt",fill=TRUE)
Your data looks like this
> head(data)
V1 V2 V3 V4 V5 V6 V7
1: 1.0000000 is the current iteration NA NA
2: 0.4231829 0.142353335 0.3813622692 0.07224282 0.037681101 0.7761575 0.1132471
3: 0.1113989 0.587115721 0.2960257430 0.49175715 0.642754463 0.4036675 0.4940814
4: 0.9750350 0.691093967 0.8610487920 0.08208387 0.826175117 0.8789275 0.3687355
5: 0.1831840 0.001007096 0.2385952028 0.85939856 0.646992019 0.5783946 0.9095849
6: 0.7648907 0.204005372 0.8512769730 0.10731854 0.299391995 0.9200760 0.7814541
Now if you see how the strings are distributed in columns. Now you can simply subset your data set with pattern matching. Matching columns that match these strings. For example
library(data.table)
data<-as.data.table(data)
cleaned_data<-data[!(V3 %like% "the"),]
> head(cleaned_data)
V1 V2 V3 V4 V5 V6 V7
1: 0.4231829 0.142353335 0.3813622692 0.07224282 0.037681101 0.7761575 0.1132471
2: 0.1113989 0.587115721 0.2960257430 0.49175715 0.642754463 0.4036675 0.4940814
3: 0.9750350 0.691093967 0.8610487920 0.08208387 0.826175117 0.8789275 0.3687355
4: 0.1831840 0.001007096 0.2385952028 0.85939856 0.646992019 0.5783946 0.9095849
5: 0.7648907 0.204005372 0.8512769730 0.10731854 0.299391995 0.9200760 0.7814541
6: 0.3943193 0.508373900 0.2131134905 0.92474343 0.432134031 0.4585807 0.9811607
I would like to create a data frame that scrapes the NYT and WSJ and has the number of articles on a given topic per year. That is:
NYT WSJ
2011 2 3
2012 10 7
I found this tutorial for the NYT but is not working for me :_(. When I get to line 30 I get this error:
> cts <- as.data.frame(table(dat))
Error in provideDimnames(x) :
length of 'dimnames' [1] not equal to array extent
Any help would be much appreciated.
Thanks!
PS: This is my code that is not working (A NYT api key is needed http://developer.nytimes.com/apps/register)
# Need to install from source http://www.omegahat.org/RJSONIO/RJSONIO_0.2-3.tar.gz
# then load:
library(RJSONIO)
### set parameters ###
api <- "API key goes here" ###### <<<API key goes here!!
q <- "MOOCs" # Query string, use + instead of space
records <- 500 # total number of records to return, note limitations above
# calculate parameter for offset
os <- 0:(records/10-1)
# read first set of data in
uri <- paste ("http://api.nytimes.com/svc/search/v1/article?format=json&query=", q, "&offset=", os[1], "&fields=date&api-key=", api, sep="")
raw.data <- readLines(uri, warn="F") # get them
res <- fromJSON(raw.data) # tokenize
dat <- unlist(res$results) # convert the dates to a vector
# read in the rest via loop
for (i in 2:length(os)) {
# concatenate URL for each offset
uri <- paste ("http://api.nytimes.com/svc/search/v1/article?format=json&query=", q, "&offset=", os[i], "&fields=date&api-key=", api, sep="")
raw.data <- readLines(uri, warn="F")
res <- fromJSON(raw.data)
dat <- append(dat, unlist(res$results)) # append
}
# aggregate counts for dates and coerce into a data frame
cts <- as.data.frame(table(dat))
# establish date range
dat.conv <- strptime(dat, format="%Y%m%d") # need to convert dat into POSIX format for this
daterange <- c(min(dat.conv), max(dat.conv))
dat.all <- seq(daterange[1], daterange[2], by="day") # all possible days
# compare dates from counts dataframe with the whole data range
# assign 0 where there is no count, otherwise take count
# (take out PSD at the end to make it comparable)
dat.all <- strptime(dat.all, format="%Y-%m-%d")
# cant' seem to be able to compare Posix objects with %in%, so coerce them to character for this:
freqs <- ifelse(as.character(dat.all) %in% as.character(strptime(cts$dat, format="%Y%m%d")), cts$Freq, 0)
plot (freqs, type="l", xaxt="n", main=paste("Search term(s):",q), ylab="# of articles", xlab="date")
axis(1, 1:length(freqs), dat.all)
lines(lowess(freqs, f=.2), col = 2)
UPDATE: the repo is now at https://github.com/rOpenGov/rtimes
There is a RNYTimes package created by Duncan Temple-Lang https://github.com/omegahat/RNYTimes - but it is outdated because the NYTimes API is on v2 now. I've been working on one for political endpoints only, but not relevant for you.
I'm rewiring RNYTimes right now...Install from github. You need to install devtools first to get install_github
install.packages("devtools")
library(devtools)
install_github("rOpenGov/RNYTimes")
Then try your search with that, e.g,
library(RNYTimes); library(plyr)
moocs <- searchArticles("MOOCs", key = "<yourkey>")
This gives you number of articles found
moocs$response$meta$hits
[1] 121
You could get word counts for each article by
as.numeric(sapply(moocs$response$docs, "[[", 'word_count'))
[1] 157 362 1316 312 2936 2973 355 1364 16 880
I was wondering if anyone has managed to read SDMX-XML files into a dataframe. The file I’d like to read is https://www.ecb.europa.eu/stats/sdmx/icpf/1/data/pension_funds.xml (1mb).
I saved the file as “pensions_funds.xml” to the pwd and tried to use the XML package to read it:
fileName <- system.file("pensions", "pensions_funds.xml", package="XML")
parsed<-xmlTreeParse("pension_funds.xml",getDTD=F)
r<-xmlRoot(parsed)
tmp = xmlSApply(r, function(x) xmlSApply(x, xmlValue))
The few lines above basically follow the example here http://www.omegahat.org/RSXML/gettingStarted.html
but I think I would first need to somehow ignore the header (I have pasted below the first couple of pages of the file I’m trying to read). So I think the above might work but it starts from the wrong node for my purposes. I would like to grab the obs_values, indexed by their time_period and ref_area.
The first thing would be to find the right node and start there however I suspect I might be on a fool’s errand since I have limited knowledge of data formats and I’m not sure the XML package can be used for SDMX-XML files. Smarter people appear to have tried to do this
http://opensdmxdevelopers.wikispaces.com/RSDMX
I can’t find this package for download on its homepage here
https://r-forge.r-project.org/projects/rsdmx/
(I can’t see any link/download section but maybe I’m blind) and it seems to be early stages. The existence of the rsdmx suggests using the xml package to read sdmx might not be easy so I’m ready to give up at this stage unless anyone has had success with this. Actually I’m mainly interested in reading this file
http://www.ecb.europa.eu/stats/sdmx/bsi/1/data/outstanding_amounts.xml
But this is a 10mb file so I was starting smaller.
edit3
attempting sgibb's answer on large file using changes in Mischa's comment
library("XML")
url <- "http://www.ecb.europa.eu/stats/sdmx/bsi/1/data/outstanding_amounts.xml"
sdmxHandler <- function() {
## data.frame which stores results
data <- data.frame(stringsAsFactors=FALSE)
## counter to store current row
i <- 1
## temp value to store current REF_AREA
## temp value to store current REF_AREA
refArea <- NA
bsItem <- NA
bsCountSector <- NA
## handler subroutine for Obs tag
Obs <- function(name, attr) {
## found an Obs tag and now fill data.frame
data[i, "refArea"] <<- refArea
data[i, "timePeriod"] <<- as.numeric(attr["TIME_PERIOD"])
data[i, "obsValue"] <<- as.numeric(attr["OBS_VALUE"])
data[i, "bsItem"] <<- bsItem
data[i, "bsCountSector"] <<- bsCountSector
i <<- i + 1
}
## handler subroutine for Series tag
Series <- function(name, attr) {
refArea <<- attr["REF_AREA"]
bsItem <<- as.character(attr["BS_ITEM"])
bsCountSector <<- as.numeric(attr["BS_ITEM"])
}
return(list(getData=function() {return(data)},
Obs=Obs, Series=Series))
}
## run parser
df <- xmlEventParse(file(url), handlers=sdmxHandler())$getData()
Specification mandate value for attribute OBS_VALUE
attributes construct error
Couldn't find end of Start Tag Obs line 15108
Premature end of data in tag Series line 15041
Premature end of data in tag DataSet line 91
Premature end of data in tag CompactData line 2
Error: 1: Specification mandate value for attribute OBS_VALUE
2: attributes construct error
3: Couldn't find end of Start Tag Obs line 15108
4: Premature end of data in tag Series line 15041
5: Premature end of data in tag DataSet line 91
6: Premature end of data in tag CompactData line 2
In addition: There were 50 or more warnings (use warnings() to see the first 50)
edit2:
the answer from sgibb looks ideal and works perfectly on the smaller file. I tried to run it on
url <- http://www.ecb.europa.eu/stats/sdmx/bsi/1/data/outstanding_amounts.xml
(the 10mb file, original link corrected), with the only modification being the addition of two lines:
data[i, "bsItem"] <<- as.character(attr["BS_ITEM"])
data[i, "bsCountSector"] <<- as.numeric(attr["BS_COUNT_SECTOR"])
(these are additional id variables which are needed to identify a row in this larger dataset).
It ran for a few minutes then finished with this error:
Error: 1: Specification mandate value for attribute TIME_PE
2: attributes construct error
3: Couldn't find end of Start Tag Obs line 20743
4: Premature end of data in tag Series line 20689
5: Premature end of data in tag DataSet line 91
6: Premature end of data in tag CompactData line 2
In addition: There were 50 or more warnings (use warnings() to see the first 50)
The basic format of the data seems very similar so I thought this might work. The basic format of the 10mb file is as below:
<Series FREQ="M" REF_AREA="AT" ADJUSTMENT="N" BS_REP_SECTOR="A" BS_ITEM="A20" MATURITY_ORIG="A" DATA_TYPE="1" COUNT_AREA="U2" BS_COUNT_SECTOR="0000" CURRENCY_TRANS="Z01" BS_SUFFIX="E" TIME_FORMAT="P1M" COLLECTION="E">
<Obs TIME_PERIOD="1997-09" OBS_VALUE="275.3" OBS_STATUS="A" OBS_CONF="F"/>
<Obs TIME_PERIOD="1997-10" OBS_VALUE="275.9" OBS_STATUS="A" OBS_CONF="F"/>
<Obs TIME_PERIOD="1997-11" OBS_VALUE="276.6" OBS_STATUS="A" OBS_CONF="F"/>
edit1:
desired data format:
Ref_area time_period obs_value
At 2006 118
At 2007 119
…
Be 2006 101
…
Here’s the first bit of the data.
</Header>
DataSet xsi:schemaLocation="https://www.ecb.europa.eu/vocabulary/stats/icpf/1 https://www.ecb.europa.eu/stats/sdmx/icpf/1/structure/2011-08-11/sdmx-compact.xsd" xmlns="https://www.ecb.europa.eu/vocabulary/stats/icpf/1">
<Group DECIMALS="0" TITLE_COMPL="Austria, reporting institutional sector Insurance corporations and pension funds - Closing balance sheet - All financial assets and liabilities - counterpart area World (all entities), counterpart institutional sector Total economy including Rest of the World (all sectors) - Credit (resources/liabilities) - Non-consolidated, Current prices - Euro, Neither seasonally nor working day adjusted - ESA95 TP table Not applicable" UNIT_MULT="9" UNIT="EUR" ESA95TP_SUFFIX="Z" ESA95TP_DENOM="E" ESA95TP_CONS="N" ESA95TP_DC_AL="2" ESA95TP_CPSECTOR="S" ESA95TP_CPAREA="A1" ESA95TP_SECTOR="S125" ESA95TP_ASSET="F" ESA95TP_TRANS="LE" ESA95TP_PRICE="V" ADJUSTMENT="N" REF_AREA="AT"/><Series ESA95TP_SUFFIX="Z" ESA95TP_DENOM="E" ESA95TP_CONS="N" ESA95TP_DC_AL="2" ESA95TP_CPSECTOR="S" ESA95TP_CPAREA="A1" ESA95TP_SECTOR="S125" ESA95TP_ASSET="F" ESA95TP_TRANS="LE" ESA95TP_PRICE="V" ADJUSTMENT="N" REF_AREA="AT" COLLECTION="E" TIME_FORMAT="P1Y" FREQ="A"><Obs OBS_CONF="F" OBS_STATUS="E" OBS_VALUE="112" TIME_PERIOD="2008"/><Obs OBS_CONF="F" OBS_STATUS="E" OBS_VALUE="119" TIME_PERIOD="2009"/><Obs OBS_CONF="F" OBS_STATUS="E" OBS_VALUE="125" TIME_PERIOD="2010"/><Obs OBS_CONF="F" OBS_STATUS="E" OBS_VALUE="127" TIME_PERIOD="2011"/></Series><Group D
RSDMX seems to be in an early development state. IMHO there is no package available yet. But you could easily implement it on your own using the XML package. I would suggest to use xmlEventParse (see ?xmlEventParse for details):
EDIT: adapt example to changed requirements of outstanding_amounts.xml
EDIT2: add download.file
library("XML")
#url <- "http://www.ecb.europa.eu/stats/sdmx/icpf/1/data/pension_funds.xml"
url <- "http://www.ecb.europa.eu/stats/sdmx/bsi/1/data/outstanding_amounts.xml"
## download xml file to avoid download errors disturbing xmlEventParse
tmp <- tempfile()
download.file(url, tmp)
sdmxHandler <- function() {
## data.frame which stores results
data <- data.frame(stringsAsFactors=FALSE)
## counter to store current row
i <- 1
## temp value to store current REF_AREA, BS_ITEM and BS_COUNT_SECTOR
refArea <- NA
bsItem <- NA
bsCountSector <- NA
## handler subroutine for Obs tag
Obs <- function(name, attr) {
## found an Obs tag and now fill data.frame
data[i, "refArea"] <<- refArea
data[i, "bsItem"] <<- bsItem
data[i, "bsCountSector"] <<- bsCountSector
data[i, "timePeriod"] <<- as.Date(paste(attr["TIME_PERIOD"], "-01", sep=""), format="%Y-%m-%d")
data[i, "obsValue"] <<- as.double(attr["OBS_VALUE"])
## update current row
i <<- i + 1
}
## handler subroutine for Series tag
Series <- function(name, attr) {
refArea <<- attr["REF_AREA"]
bsItem <<- attr["BS_ITEM"]
bsCountSector <<- as.numeric(attr["BS_COUNT_SECTOR"])
}
return(list(getData=function() {return(data)},
Obs=Obs, Series=Series))
}
## run parser
df <- xmlEventParse(tmp, handlers=sdmxHandler())$getData()
head(df)
# refArea bsItem bsCountSector timePeriod obsValue
#1 DE A20 2210 12053 39.6
#2 DE A20 2210 12084 46.1
#3 DE A20 2210 12112 50.2
#4 DE A20 2210 12143 52.0
#5 DE A20 2210 12173 52.3
#6 DE A20 2210 12204 47.3
The package rsdmx allows you to read SDMX-ML files and coerce them as data.frame. It is now hosted at Github, and currently available in CRAN, but in case you can install easily it from GitHub with the following:
require("devtools")
install_github("rsdmx", "opensdmx")
Applying to your data, you can do the following:
sdmx <- readSDMX("http://www.ecb.europa.eu/stats/sdmx/bsi/1/data/outstanding_amounts.xml")
df <- as.data.frame(sdmx)
More examples are given in the rsdmx wiki
Note that its functionalities currently load the xml object into R, as a slot part of the SDMX R objects instantiated by rsdmx. In the future, we would like to investigate how rsdmx can use xmlEventParse (as suggested above by #sgibb) to read very large datasets.
library(XML)
xmlparsed <- xmlParse(file(url))
## obtain dataset node::
series_data <- getNodeSet(xmlparsed, "//Series")
if(length(series_data)==0){
datasetnode <- xmlChildren( xmlChildren(xmlparsed)[[1]])[[2]]
series_data<-xmlChildren(datasetnode)[ names(xmlChildren(datasetnode))=="Series"]
}
## prepare dataset
dataset.frame <- data.frame(matrix(ncol=3))
colnames(dataset.frame) <- c('REF_AREA', 'TIME_PERIOD', 'OBS_VALUE')
## loop over data
counter=1
for (i in 1: length(series_data)){
if('Obs'%in%names(xmlChildren(series_data[[i]])) ){ ## To ignore empty //Series nodes
for (j in 1: length(xmlChildren(series_data[[i]]))){
dataset.frame[counter,1] <- xmlAttrs(series_data[[i]])['REF_AREA']
dataset.frame[counter,2] <- xmlAttrs(series_data[[i]][[j]])['TIME_PERIOD']
dataset.frame[counter,3] <- xmlAttrs(series_data[[i]][[j]])['OBS_VALUE']
counter=counter+1
}
}
}
head(dataset.frame,5)