R Load XML to dataframe, and include attributes - r

I am having trouble loading XML file into R data frame.
This is my XML structure [the data is made up]:
<?xml version="1.0" encoding="UTF-8"?>
-<CancerExtract>
-<CancerRegRec>
-<Demographic>
-<PatientName>
<PatSurname>Jones</PatSurname>
<PatFirstName>John</PatFirstName>
<PatSecondName>Peter</PatSecondName>
</PatientName>
-<PatientDetail Sex="1" IndigStatus="12">
<DOB>01012000</DOB>
<MedicareNo>xxxx776xxx66xx</MedicareNo>
<COB>1101</COB>
<Language>1201</Language>
</PatientDetail>
-<PatientAddress>
<StreetAddr>1 Address Rd</StreetAddr>
<Suburb>AwesomeCity</Suburb>
<Postcode>ZZ304</Postcode>
</PatientAddress>
</Demographic>
-<Tumour>
-<TreatingDoctor>
<TDSurname>Doctor</TDSurname>
<TDFirstName>The Good</TDFirstName>
<TDAddress>FixemUp ct</TDAddress>
<TDMediProvidNo>DR0001</TDMediProvidNo>
</TreatingDoctor>
-<HospitalEpisode>
<HospitalName>FixMeUp</HospitalName>
<CampusCode>0000</CampusCode>
<URN>123456</URN>
<AdmissionDate>01012020</AdmissionDate>
<DischargeDate>03012020</DischargeDate>
</HospitalEpisode>
-<TumourDetail Grade="1" ECOG="9">
<DiagnosisDate>01012050</DiagnosisDate>
<PrimarySite>C61</PrimarySite>
<Morph>81403</Morph>
<Investigations>8 8 7 10 3</Investigations>
<AdditInfo>Some free text can be available here</AdditInfo>
</TumourDetail>
<CStage Stage="9" StagingSystem="99"/>
-<GP>
<GPSurname>MyGP</GPSurname>
<GPFirstName>Peter</GPFirstName>
<GPAddress>100 GP street</GPAddress>
</GP>
-<RegDetail>
<RegName>Some name</RegName>
<RegDate>05122021</RegDate>
</RegDetail>
</Tumour>
</CancerRegRec>
-<CancerRegRec>
-<Demographic>
-<PatientName>
<PatSurname>Pt2</PatSurname>
<PatFirstName>Frits</PatFirstName>
<PatSecondName/>
</PatientName>
-<PatientDetail Sex="4" IndigStatus="22" SomeOtherVariable="random value">
<DOB>12121834</DOB>
<MedicareNo>xxxxxxxx00001</MedicareNo>
<COB>1201</COB>
<Language>1201</Language>
</PatientDetail>
-<PatientAddress>
<StreetAddr>1 church street</StreetAddr>
<Suburb>Cityname Here</Suburb>
<Postcode>7777YY</Postcode>
</PatientAddress>
</Demographic>
-<Tumour>
+<TreatingDoctor>
-<HospitalEpisode>
<HospitalName>HospitalName two </HospitalName>
<CampusCode>2166192</CampusCode>
<URN>10REWR8XX640</URN>
<AdmissionDate>23122025</AdmissionDate>
<DischargeDate>23122027</DischargeDate>
</HospitalEpisode>
-<TumourDetail EstDateFlag="1" PriorDiagFlag="Y" Laterality="8">
<DiagnosisDate>01121812</DiagnosisDate>
<WhereDiagnosed>At home</WhereDiagnosed>
<PrimarySite>C9000</PrimarySite>
<Morph>81403</Morph>
<Investigations>7 3 1</Investigations>
<MetSite>C792 C788</MetSite>
<AdditInfo>This is a second record. </AdditInfo>
</TumourDetail>
<CStage Stage="9" StagingSystem="99"/>
-<GP>
<GPSurname>Jones</GPSurname>
<GPFirstName>John</GPFirstName>
<GPAddress>Test street 12 Unit 1</GPAddress>
</GP>
-<RegDetail>
<RegName>Me Myself and I</RegName>
<RegDate>01011801</RegDate>
</RegDetail>
</Tumour>
</CancerRegRec>
</CancerExtract>
I created this R function to load the file and extract all data:
load_XML_File <- function(file){
load <- tryCatch(expr = { xml2::read_xml(file) },
warning = function(warning_condition) {
message(paste("\n\n\nWarning loading file: ", file))
message("\nHere's the original warning message:\n")
message(warning_condition)
return(NA)
},
error = function(error_condition) {
message(paste("\n\n\nError loading file: ", file))
message("\nHere's the original error message:\n")
message(error_condition)
return(NA)
},
finally = {
message(paste0("\nLoaded file ", file))
}
)
PerPt <- xml2::xml_find_all(load, ".//CancerRegRec")
tmp <- xml2::as_list(PerPt)
if(length(tmp) == 0){out <- NA}
if(length(tmp) >= 1){
for(i in 1:length(tmp)){
tt <- data.frame(t(data.frame(unlist(tmp[i]))))
rownames(tt) <- NULL
if(i == 1){out <- tt}
if(i > 1){out <- plyr::rbind.fill(out, tt)}
}
}
return(out)
}
This works well and is fast enough for my purpose, but does NOT extract the attributes.
How would I change my function so that also the attributes are included?
> load_XML_File(file)
Loaded file H:/TMP/testFile.xml
Demographic.PatientName.PatSurname Demographic.PatientName.PatFirstName Demographic.PatientName.PatSecondName Demographic.PatientDetail.DOB
1 Jones John Peter 01012000
2 Pt2 Frits <NA> 12121834
Demographic.PatientDetail.MedicareNo Demographic.PatientDetail.COB Demographic.PatientDetail.Language Demographic.PatientAddress.StreetAddr
1 xxxx776xxx66xx 1101 1201 1 Address Rd
2 xxxxxxxx00001 1201 1201 1 church street
Demographic.PatientAddress.Suburb Demographic.PatientAddress.Postcode Tumour.TreatingDoctor.TDSurname Tumour.TreatingDoctor.TDFirstName
1 AwesomeCity ZZ304 Doctor The Good
2 Cityname Here 7777YY Jansen Jan
Tumour.TreatingDoctor.TDAddress Tumour.TreatingDoctor.TDMediProvidNo Tumour.HospitalEpisode.HospitalName Tumour.HospitalEpisode.CampusCode
1 FixemUp ct DR0001 FixMeUp 0000
2 Jansen rd DVR0001 HospitalName two 2166192
Tumour.HospitalEpisode.URN Tumour.HospitalEpisode.AdmissionDate Tumour.HospitalEpisode.DischargeDate Tumour.TumourDetail.DiagnosisDate
1 123456 01012020 03012020 01012050
2 10REWR8XX640 23122025 23122027 01121812
Tumour.TumourDetail.PrimarySite Tumour.TumourDetail.Morph Tumour.TumourDetail.Investigations Tumour.TumourDetail.AdditInfo Tumour.GP.GPSurname
1 C61 81403 8 8 7 10 3 Some free text can be available here MyGP
2 C9000 81403 7 3 1 This is a second record. Jones
Tumour.GP.GPFirstName Tumour.GP.GPAddress Tumour.RegDetail.RegName Tumour.RegDetail.RegDate Tumour.TumourDetail.WhereDiagnosed Tumour.TumourDetail.MetSite
1 Peter 100 GP street Some name 05122021 <NA> <NA>
2 John Test street 12 Unit 1 Me Myself and I 01011801 At home C792 C788

It seems like the attributes are present on tmp.
PerPt <- xml2::xml_find_all(load, ".//CancerRegRec")
tmp <- xml2::as_list(PerPt)
This function visits each element of a list, recursively. It makes attributes into members of the element.
move_attr_to_member <- function(x) {
## capture names, and attributes but not names
names <- names(x)
attributes <- attributes(unname(x))
## recursive application
if (is.list(x))
x <- lapply(x, fun)
## return x (with attributes but not names removed) and attributes
attributes(x) <- NULL
names(x) <- names
c(x, attributes)
}
This could be used like
list_with_attrs_as_members <- move_attr_to_member(tmp)
A tibble is easily created with
dplyr::bind_rows(lapply(list_with_attrs_as_members, unlist))
I'd carefully check the output of move_attr_to_member() to make sure that it's doing the right thing!

Related

Error in rbind(deparse.level, ...) : numbers of columns of arguments do not match, when using Rinat package in R

I am using the Rinat API to extract the observations submitted by a list of users to the iNaturalist app. The API function I'm using is as below, which is taken from GitHub (https://github.com/ropensci/rinat/blob/master/R/get_inat_obs_user.R):
get_inat_obs_user <- function(username, maxresults = 10000){
# check Internet connection
if (!curl::has_internet()) {
message("No Internet connection.")
return(invisible(NULL))
}
base_url <- "http://www.inaturalist.org/"
# check that iNat can be reached
if (httr::http_error(base_url)) { # TRUE: 400 or above
message("iNaturalist API is unavailable.")
return(invisible(NULL))
}
q_path <- paste0(username, ".csv")
ping_path <- paste0(username, ".json")
ping_query <- "&per_page=1&page=1"
### Make the first ping to the server to get the number of results
### easier to pull down if you make the query in json, but easier to arrange results
### that come down in CSV format
ping <- GET(base_url, path = paste0("observations/", ping_path), query = ping_query)
total_res <- as.numeric(ping$headers$`x-total-entries`)
if(total_res == 0){
stop("Your search returned zero results. Perhaps your user does not exist.")
}
page_query <-"&per_page=200&page=1"
dat <- GET(base_url, path = paste0("observations/", q_path), query = page_query)
data_out <- read.csv(textConnection(content(dat, as = "text")))
if(maxresults > 200){
for(i in 2:ceiling(total_res/200)){
page_query <- paste0("&per_page=200&page=", i)
dat <- GET(base_url, path = paste0("observations/", q_path), query = page_query)
data_out <- rbind(data_out, read.csv(textConnection(content(dat, as = "text"))))
}
}
if(maxresults < dim(data_out)[1]){
data_out <- data_out[1:maxresults,]
}
return(data_out)
}
I'm also using a for loop to iterate the API function through a list of usernames (in below sample data):
usernames<-
user_login
1 a_random_kaiser
2 a_wade
3 achaas33
4 adam247
5 adam294
6 ailsa2
7 anne278
8 anne370
9 apwiemer
10 arbw
11 aslan1758
12 asmalley
13 aspinwall
14 atypicalhermit
15 aurora143
16 autumn307
17 avenkal
18 awburck
19 badbotanist
20 baleinenature
21 barich
22 batesfire
23 beavers
24 becca57
25 beesis
26 beetlequeenrs
27 ben543
28 ben566
29 bethanyk63
30 betty108
31 boatbirder
32 bobbyboy
33 bobkat
34 brad131987
35 bradders100
36 cadds
37 caketea86
38 carluk
39 carolina711
40 cleo2211
41 cobydo
42 cocothecat
EDIT
sequence <- seq_along(usernames)
User_obs_global <- data.frame()
for (var in sequence){
User_obs <- get_inat_obs_user(as.character(
User_no_obs_Oxfordshire$user_login[var]))
User_obs_global <- rbind(User_obs_global, User_obs)
}
sequence = 1:nrow(usernames)
for (var in sequence)
{User_obs<- get_inat_obs_user(as.character(username$user_login[var]))}
However, I keep getting the below error:
Error in rbind(deparse.level, ...) : numbers of columns of arguments
do not match
I tried searching previous Stack Overflow posts related to this error, and understand this could be due to mismatch in number of columns between successive username records. Nevertheless, the number of columns should be the same for all username records, since their source is the same. Hence, I'm wondering if there is something wrong with my code.
I can run the API function successfully for individual usernames by the way, but I have a few thousand usernames to loop through.

Google maps api status NOT FOUND R

I created my function for accessing Google maps API. I am trying to find out how long does it take from different points to a target location.
getDuration <- function(from,to,tMode,key){
from <- iconv(from, to="UTF-8")
to <- iconv(to, to="UTF-8")
tMode <- iconv(tMode, to="UTF-8")
from <- URLencode(from)
to <- URLencode(to)
tMode <- URLencode(tMode)
strQuery <- paste0(
"https://maps.googleapis.com/maps/api/directions/json?",
paste0("origin=",from),
paste0("&","destination=",to),
paste0("&","mode=",tMode),
paste0("&key=",key)
)
print(strQuery)
jDist <- fromJSON(strQuery,simplifyDataFrame = T)
if (jDist$status != "OK"){
print(paste0("Bad status: ",jDist$status))
return(NA)
}
if (length(jDist$routes)==0){
print("no route")
return(NA)
}
if (length(jDist$routes$legs)==0){
print("no legs")
return(NA)
}
return(jDist$routes$legs[[1]]$duration$value)
}
Then I am trying to apply this function to a character vector "from":
from
[1] "Étampes" "Étréchy" "Dourdan" "La Ferté-Alais" "Méréville" "Saint-Chéron"
sapply(from,function(x) { getDuration(x,to,"driving",key) })
The output I get is the following:
Étampes Étréchy Dourdan La Ferté-Alais Méréville Saint-Chéron
NA NA 3501 4280 NA NA
It is strange because route between Étampes and my target destination exists and it is not empty:
https://maps.googleapis.com/maps/api/directions/json?origin=%C3%83%E2%80%B0tampes&destination=Cours%20Valmy,%20Nanterre&mode=driving&key=AIzaSyBrmNaCXH_ppK7F0uW4SXZhPIBoDLQdKFE
Does anybody knows how to identify the root of problem?
This problem doesn't appear to exist if you use googleway
library(googleway)
set_key("GOOGLE_API_KEY")
res <- google_directions(
origin = "Étampes",
destination = "cours valmy"
)
direction_legs(res)$distance
# text value
# 1 62.6 km 62648
direction_legs(res)$duration
# text value
# 1 1 hour 8 mins 4065

rvest: select and scrape html table(s) after a specific (title) string

I am trying to scrape the following website with public healthcare data:
https://prog.nfz.gov.pl/app-jgp/Grupa.aspx?id=Qpc6nYOpOBQ%3d
I would like to scrape only a table or tables that are preceded with a title
"Tabela xx procedury ICD-9" where xx is not a fixed number.
There may be 1 but occasionally 2-3 such tables on the page. And they can occur in different order with other pages, so it is not possible to indicate I want n'th page on the website.
For instance, I am interested only in tables in the same page that are preceded by "Icd-9 main" string and skip others. There are 2 such tables and I would like to scrape their content to a data.frame. They may not appear in the same order as below, so I must rely on preceding string. Occasionally there will be no Icd-9 main table at all.
Page
----
Icd-10
====
Table
====
Icd-9 main
====
Table
===
Icd-9 main
====
Table
====
Icd-9 supplementary
====
Table
===
I am only aware of the code that selects the nth table in order like in this tutorial:
https://www.r-bloggers.com/using-rvest-to-scrape-an-html-table/
library("rvest")
url <- "http://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population"
population <- url %>%
html() %>%
html_nodes(xpath='//*[#id="mw-content-text"]/table[1]') %>%
html_table()
population <- population[[1]]
whereby all the tables are dumped into a list and then we can select which table we want by number
However in my case I never know which table to scrape, the order may vary, it can be 2 tables preceded by a string containing "Tabela xx: procedury ICD-9"
My question is, how to select and scrape html tables based on preciding title or description string. Or a table or tables appearing only immediately after a string containing e.g "Tabela xx procedury ICD-9"
library(rvest)
library(stringr)
doc <- read_html("https://prog.nfz.gov.pl/app-jgp/Grupa.aspx?id=Qpc6nYOpOBQ%3d")
# extract all the nodes that have the title (id = "tytul") or a table
# the cs selector "," is like a boolean OR.
nodes <- doc %>% html_nodes(".tytul,table")
# loop though each node.
signal <- FALSE
my_tables <- list()
j <- 0
for (i in 1:length(nodes)) {
# if title signal previously set and this is a table tag
if (signal & html_name(nodes[i]) == "table") {
cat("Match..\n")
# get the table (data frame)
this_table <- html_table(nodes[i], fill = TRUE, header = TRUE)[[1]]
# append to list
j = j + 1
my_tables[[j]] <- this_table
# and reset the signal so we search for the next one
signal <- FALSE
}
# if the signal is clear look for matching title
if (!signal) {
signal <- nodes[i] %>% html_text() %>% str_detect("Tabela.+ICD 9")
}
}
my_tables[[1]][1:5,]
my_tables[[2]][1:5,]
# > my_tables[[1]][1:5,]
# ICD 9 Nazwa Lb. hospitalizacji UdziaĹ\u0082 (%) Mediana czasu pobytu (dni)
# 1 2.051 ZaĹ\u0082oĹźenie pĹ\u0082ytki sztucznej do czaszki 168 32,31 7
# 2 1.247 Kraniotomia z usuniÄ\u0099ciem krwiaka podtwardĂłwkowego 55 10,58 20
# 3 2.022 Odbarczenie zĹ\u0082amania czaszki 43 8,27 6
# 4 2.040 Przeszczep kostny do koĹ\u009bci czaszki 35 6,73 8
# 5 1.093 Inne aspiracje w zakresie czaszki 33 6,35 5
# > my_tables[[2]][1:5,]
# ICD 9 Nazwa Lb. hospitalizacji UdziaĹ\u0082 (%) Mediana czasu pobytu (dni)
# 1 O35 SĂłd (Na) 239 45,96 8
# 2 89.00 Porada lekarska, konsultacja, asysta 230 44,23 9
# 3 N45 Potas (K) 217 41,73 8
# 4 87.030 TK gĹ\u0082owy bez kontrastu 214 41,15 9
# 5 89.04 Opieka pielÄ\u0099gniarki lub poĹ\u0082oĹźnej 202 38,85 8

Reading sdmx-xml files into a dataframe in R

I was wondering if anyone has managed to read SDMX-XML files into a dataframe. The file I’d like to read is https://www.ecb.europa.eu/stats/sdmx/icpf/1/data/pension_funds.xml (1mb).
I saved the file as “pensions_funds.xml” to the pwd and tried to use the XML package to read it:
fileName <- system.file("pensions", "pensions_funds.xml", package="XML")
parsed<-xmlTreeParse("pension_funds.xml",getDTD=F)
r<-xmlRoot(parsed)
tmp = xmlSApply(r, function(x) xmlSApply(x, xmlValue))
The few lines above basically follow the example here http://www.omegahat.org/RSXML/gettingStarted.html
but I think I would first need to somehow ignore the header (I have pasted below the first couple of pages of the file I’m trying to read). So I think the above might work but it starts from the wrong node for my purposes. I would like to grab the obs_values, indexed by their time_period and ref_area.
The first thing would be to find the right node and start there however I suspect I might be on a fool’s errand since I have limited knowledge of data formats and I’m not sure the XML package can be used for SDMX-XML files. Smarter people appear to have tried to do this
http://opensdmxdevelopers.wikispaces.com/RSDMX
I can’t find this package for download on its homepage here
https://r-forge.r-project.org/projects/rsdmx/
(I can’t see any link/download section but maybe I’m blind) and it seems to be early stages. The existence of the rsdmx suggests using the xml package to read sdmx might not be easy so I’m ready to give up at this stage unless anyone has had success with this. Actually I’m mainly interested in reading this file
http://www.ecb.europa.eu/stats/sdmx/bsi/1/data/outstanding_amounts.xml
But this is a 10mb file so I was starting smaller.
edit3
attempting sgibb's answer on large file using changes in Mischa's comment
library("XML")
url <- "http://www.ecb.europa.eu/stats/sdmx/bsi/1/data/outstanding_amounts.xml"
sdmxHandler <- function() {
## data.frame which stores results
data <- data.frame(stringsAsFactors=FALSE)
## counter to store current row
i <- 1
## temp value to store current REF_AREA
## temp value to store current REF_AREA
refArea <- NA
bsItem <- NA
bsCountSector <- NA
## handler subroutine for Obs tag
Obs <- function(name, attr) {
## found an Obs tag and now fill data.frame
data[i, "refArea"] <<- refArea
data[i, "timePeriod"] <<- as.numeric(attr["TIME_PERIOD"])
data[i, "obsValue"] <<- as.numeric(attr["OBS_VALUE"])
data[i, "bsItem"] <<- bsItem
data[i, "bsCountSector"] <<- bsCountSector
i <<- i + 1
}
## handler subroutine for Series tag
Series <- function(name, attr) {
refArea <<- attr["REF_AREA"]
bsItem <<- as.character(attr["BS_ITEM"])
bsCountSector <<- as.numeric(attr["BS_ITEM"])
}
return(list(getData=function() {return(data)},
Obs=Obs, Series=Series))
}
## run parser
df <- xmlEventParse(file(url), handlers=sdmxHandler())$getData()
Specification mandate value for attribute OBS_VALUE
attributes construct error
Couldn't find end of Start Tag Obs line 15108
Premature end of data in tag Series line 15041
Premature end of data in tag DataSet line 91
Premature end of data in tag CompactData line 2
Error: 1: Specification mandate value for attribute OBS_VALUE
2: attributes construct error
3: Couldn't find end of Start Tag Obs line 15108
4: Premature end of data in tag Series line 15041
5: Premature end of data in tag DataSet line 91
6: Premature end of data in tag CompactData line 2
In addition: There were 50 or more warnings (use warnings() to see the first 50)
edit2:
the answer from sgibb looks ideal and works perfectly on the smaller file. I tried to run it on
url <- http://www.ecb.europa.eu/stats/sdmx/bsi/1/data/outstanding_amounts.xml
(the 10mb file, original link corrected), with the only modification being the addition of two lines:
data[i, "bsItem"] <<- as.character(attr["BS_ITEM"])
data[i, "bsCountSector"] <<- as.numeric(attr["BS_COUNT_SECTOR"])
(these are additional id variables which are needed to identify a row in this larger dataset).
It ran for a few minutes then finished with this error:
Error: 1: Specification mandate value for attribute TIME_PE
2: attributes construct error
3: Couldn't find end of Start Tag Obs line 20743
4: Premature end of data in tag Series line 20689
5: Premature end of data in tag DataSet line 91
6: Premature end of data in tag CompactData line 2
In addition: There were 50 or more warnings (use warnings() to see the first 50)
The basic format of the data seems very similar so I thought this might work. The basic format of the 10mb file is as below:
<Series FREQ="M" REF_AREA="AT" ADJUSTMENT="N" BS_REP_SECTOR="A" BS_ITEM="A20" MATURITY_ORIG="A" DATA_TYPE="1" COUNT_AREA="U2" BS_COUNT_SECTOR="0000" CURRENCY_TRANS="Z01" BS_SUFFIX="E" TIME_FORMAT="P1M" COLLECTION="E">
<Obs TIME_PERIOD="1997-09" OBS_VALUE="275.3" OBS_STATUS="A" OBS_CONF="F"/>
<Obs TIME_PERIOD="1997-10" OBS_VALUE="275.9" OBS_STATUS="A" OBS_CONF="F"/>
<Obs TIME_PERIOD="1997-11" OBS_VALUE="276.6" OBS_STATUS="A" OBS_CONF="F"/>
edit1:
desired data format:
Ref_area time_period obs_value
At 2006 118
At 2007 119
…
Be 2006 101
…
Here’s the first bit of the data.
</Header>
DataSet xsi:schemaLocation="https://www.ecb.europa.eu/vocabulary/stats/icpf/1 https://www.ecb.europa.eu/stats/sdmx/icpf/1/structure/2011-08-11/sdmx-compact.xsd" xmlns="https://www.ecb.europa.eu/vocabulary/stats/icpf/1">
<Group DECIMALS="0" TITLE_COMPL="Austria, reporting institutional sector Insurance corporations and pension funds - Closing balance sheet - All financial assets and liabilities - counterpart area World (all entities), counterpart institutional sector Total economy including Rest of the World (all sectors) - Credit (resources/liabilities) - Non-consolidated, Current prices - Euro, Neither seasonally nor working day adjusted - ESA95 TP table Not applicable" UNIT_MULT="9" UNIT="EUR" ESA95TP_SUFFIX="Z" ESA95TP_DENOM="E" ESA95TP_CONS="N" ESA95TP_DC_AL="2" ESA95TP_CPSECTOR="S" ESA95TP_CPAREA="A1" ESA95TP_SECTOR="S125" ESA95TP_ASSET="F" ESA95TP_TRANS="LE" ESA95TP_PRICE="V" ADJUSTMENT="N" REF_AREA="AT"/><Series ESA95TP_SUFFIX="Z" ESA95TP_DENOM="E" ESA95TP_CONS="N" ESA95TP_DC_AL="2" ESA95TP_CPSECTOR="S" ESA95TP_CPAREA="A1" ESA95TP_SECTOR="S125" ESA95TP_ASSET="F" ESA95TP_TRANS="LE" ESA95TP_PRICE="V" ADJUSTMENT="N" REF_AREA="AT" COLLECTION="E" TIME_FORMAT="P1Y" FREQ="A"><Obs OBS_CONF="F" OBS_STATUS="E" OBS_VALUE="112" TIME_PERIOD="2008"/><Obs OBS_CONF="F" OBS_STATUS="E" OBS_VALUE="119" TIME_PERIOD="2009"/><Obs OBS_CONF="F" OBS_STATUS="E" OBS_VALUE="125" TIME_PERIOD="2010"/><Obs OBS_CONF="F" OBS_STATUS="E" OBS_VALUE="127" TIME_PERIOD="2011"/></Series><Group D
RSDMX seems to be in an early development state. IMHO there is no package available yet. But you could easily implement it on your own using the XML package. I would suggest to use xmlEventParse (see ?xmlEventParse for details):
EDIT: adapt example to changed requirements of outstanding_amounts.xml
EDIT2: add download.file
library("XML")
#url <- "http://www.ecb.europa.eu/stats/sdmx/icpf/1/data/pension_funds.xml"
url <- "http://www.ecb.europa.eu/stats/sdmx/bsi/1/data/outstanding_amounts.xml"
## download xml file to avoid download errors disturbing xmlEventParse
tmp <- tempfile()
download.file(url, tmp)
sdmxHandler <- function() {
## data.frame which stores results
data <- data.frame(stringsAsFactors=FALSE)
## counter to store current row
i <- 1
## temp value to store current REF_AREA, BS_ITEM and BS_COUNT_SECTOR
refArea <- NA
bsItem <- NA
bsCountSector <- NA
## handler subroutine for Obs tag
Obs <- function(name, attr) {
## found an Obs tag and now fill data.frame
data[i, "refArea"] <<- refArea
data[i, "bsItem"] <<- bsItem
data[i, "bsCountSector"] <<- bsCountSector
data[i, "timePeriod"] <<- as.Date(paste(attr["TIME_PERIOD"], "-01", sep=""), format="%Y-%m-%d")
data[i, "obsValue"] <<- as.double(attr["OBS_VALUE"])
## update current row
i <<- i + 1
}
## handler subroutine for Series tag
Series <- function(name, attr) {
refArea <<- attr["REF_AREA"]
bsItem <<- attr["BS_ITEM"]
bsCountSector <<- as.numeric(attr["BS_COUNT_SECTOR"])
}
return(list(getData=function() {return(data)},
Obs=Obs, Series=Series))
}
## run parser
df <- xmlEventParse(tmp, handlers=sdmxHandler())$getData()
head(df)
# refArea bsItem bsCountSector timePeriod obsValue
#1 DE A20 2210 12053 39.6
#2 DE A20 2210 12084 46.1
#3 DE A20 2210 12112 50.2
#4 DE A20 2210 12143 52.0
#5 DE A20 2210 12173 52.3
#6 DE A20 2210 12204 47.3
The package rsdmx allows you to read SDMX-ML files and coerce them as data.frame. It is now hosted at Github, and currently available in CRAN, but in case you can install easily it from GitHub with the following:
require("devtools")
install_github("rsdmx", "opensdmx")
Applying to your data, you can do the following:
sdmx <- readSDMX("http://www.ecb.europa.eu/stats/sdmx/bsi/1/data/outstanding_amounts.xml")
df <- as.data.frame(sdmx)
More examples are given in the rsdmx wiki
Note that its functionalities currently load the xml object into R, as a slot part of the SDMX R objects instantiated by rsdmx. In the future, we would like to investigate how rsdmx can use xmlEventParse (as suggested above by #sgibb) to read very large datasets.
library(XML)
xmlparsed <- xmlParse(file(url))
## obtain dataset node::
series_data <- getNodeSet(xmlparsed, "//Series")
if(length(series_data)==0){
datasetnode <- xmlChildren( xmlChildren(xmlparsed)[[1]])[[2]]
series_data<-xmlChildren(datasetnode)[ names(xmlChildren(datasetnode))=="Series"]
}
## prepare dataset
dataset.frame <- data.frame(matrix(ncol=3))
colnames(dataset.frame) <- c('REF_AREA', 'TIME_PERIOD', 'OBS_VALUE')
## loop over data
counter=1
for (i in 1: length(series_data)){
if('Obs'%in%names(xmlChildren(series_data[[i]])) ){ ## To ignore empty //Series nodes
for (j in 1: length(xmlChildren(series_data[[i]]))){
dataset.frame[counter,1] <- xmlAttrs(series_data[[i]])['REF_AREA']
dataset.frame[counter,2] <- xmlAttrs(series_data[[i]][[j]])['TIME_PERIOD']
dataset.frame[counter,3] <- xmlAttrs(series_data[[i]][[j]])['OBS_VALUE']
counter=counter+1
}
}
}
head(dataset.frame,5)

Adding a row to a dataframe

I am reading a file line by line and then adding specific lines to a dataframe. Here is an example of a line I would add to a dataframe:
ATOM 230 CA GLU A 31 66.218 118.140 2.411 1.00 31.82 C
I have verified that my checks are ok, I think it has specifically to do with my rbind command. Thanks for your help!
Edit: The error is as follows, the output of the dataframe is:
Residue AtomCount SideChain XCoord YCoord ZCoord
2 MET 1 A 62.935 97.579 30.223
21 <NA> 2 A 63.155 95.525 27.079
3 <NA> 3 A 65.289 96.895 24.308
It seems like it stops picking up the name of the residue..
The code I am using is:
get.positions <- function(sourcefile, chain_required = "A"){
positions = data.frame()
visited = list()
filedata <- readLines(sourcefile, n= -1)
for(i in 1: length(filedata)){
input = filedata[i]
id = substr(input,1,4)
if(id == "ATOM"){
type = substr(input,14,15)
if(type == "CA"){
#if there are duplicates it takes the first one
residue = substr(input,18,20)
type_of_chain = substr(input,22,22)
atom_count = strtoi(substr(input, 23,26))
if(atom_count >=1){
if(type_of_chain == chain_required && !(atom_count %in% visited) ){
position_string = trim(substr(input,30,54))
position_string = lapply(unlist(strsplit(position_string," +")),as.numeric)
positions<- rbind(positions, list(residue, atom_count, type_of_chain, position_string[[1]], position_string[[2]], position_string[[3]]))
}
}
}
}
}
return (positions)
}
When I ran your code with that data I got type=="LU" (so it failed the type=="CA" test) and the rest of processing never got accomplished. I think you may need to change the indices to
type = substr(input,10,11)
Fixing that problem brings up others, and its going to be very difficult to fix all the problems since the goal is not clearly stated, but it suggests that you edit your code and data so it's reproducible. This could be a reproducible input/execution method:
get.positions(textConnection("ATOM 230 CA GLU A 31 66.218 118.140 2.411 1.00 31.82 C") )
In, the end, the following worked. First I made a much larger data frame, and then just replace specific rows (thank you Joran who linked me to the R inferno).
For the user that asked why I am splitting on a plus, your assumption is incorrect. The syntax is actually " +", that's a space-plus so that it's splitting on multiple spaces.Finally, as for the incorrect indices, I've finally figured out how to show the extra spaces on the form. Here is the correct original line, you will see the indices match.
ATOM 2 CA MET A 1 62.935 97.579 30.223 1.00 37.58 C
The R code that works, is as follows.
get.positions <- function(sourcefile, chain_required = "A"){
N <- 10^5
AACount <- 0
positions = data.frame(Residue=rep(NA, N),AtomCount=rep(NA, N),SideChain=rep(NA, N),XCoord=rep(NA, N),YCoord=rep(NA, N),ZCoord=rep(NA, N),stringsAsFactors=FALSE)
visited = list()
filedata <- readLines(sourcefile, n= -1)
for(i in 1: length(filedata)){
input = filedata[i]
id = substr(input,1,4)
if(id == "ATOM"){
type = substr(input,14,15)
if(type == "CA"){
#if there are duplicates it takes the first one
residue = substr(input,18,20)
type_of_chain = substr(input,22,22)
atom_count = strtoi(substr(input, 23,26))
if(atom_count >=1){
if(type_of_chain == chain_required && !(atom_count %in% visited) ){
visited <- c(visited, atom_count)
AACount <- AACount + 1
position_string = trim(substr(input,30,54))
position_string = lapply(unlist(strsplit(position_string," +")),as.numeric)
#print(input)
positions[AACount,]<- c(residue, atom_count, type_of_chain, position_string[[1]], position_string[[2]], position_string[[3]])
}
}
}
}
}
positions<-positions[1:AACount,]
return (positions)
}

Resources