XML parsing with namespaces using XML package in R

XML parsing with namespaces using XML package in R - r

I am trying to parse a xml file using XML package of R. Sample XML content is as follows:
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/">
<soapenv:Header>
<messageContext xmlns="http://www.deltavista.com/dspone/ordercheck-if/V001">
<credentials>
<user>foobar</user>
<password>barbaz</password>
</credentials>
</messageContext>
</soapenv:Header>
<soapenv:Body>
<ns1:orderCheckResponse xmlns:ns1="http://www.deltavista.com/dspone/ordercheck-if/V001">
<ns1:returnCode>1</ns1:returnCode>
<ns1:product>
<ns1:name>Consumer</ns1:name>
<ns1:country>POL</ns1:country>
<ns1:language>POL</ns1:language>
</ns1:product>
<ns1:archiveID>420</ns1:archiveID>
<ns1:reportCreationTime>201911151220</ns1:reportCreationTime>
<ns1:foundAddress>
<ns1:legalForm>PERSON</ns1:legalForm>
<ns1:address>
<ns1:name>John</ns1:name>
<ns1:firstName>Dow</ns1:firstName>
<ns1:gender>MALE</ns1:gender>
<ns1:dateOfBirth>19960410</ns1:dateOfBirth>
<ns1:location>
<ns1:street>nowhere</ns1:street>
<ns1:house>48</ns1:house>
<ns1:city>farfarland</ns1:city>
<ns1:zip>00-500</ns1:zip>
<ns1:country>POL</ns1:country>
</ns1:location>
</ns1:address>
</ns1:foundAddress>
<ns1:myDecision>
<ns1:decision>YELLOW</ns1:decision>
</ns1:myDecision>
<ns1:personBasicData>
<ns1:knownSince>20181201</ns1:knownSince>
<ns1:contact>
<ns1:item>EMAIL</ns1:item>
<ns1:value>foo#gmail.com</ns1:value>
</ns1:contact>
<ns1:contact>
<ns1:item>PHONE</ns1:item>
<ns1:value>123456789</ns1:value>
</ns1:contact>
</ns1:personBasicData>
<ns1:decisionMatrix>
<ns1:identificationDecision>
<ns1:personStatus xsi:type="ns1:DecisionMatrixItemPersonStatus" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<ns1:partialDecision>GREEN</ns1:partialDecision>
<ns1:value>ADULT</ns1:value>
</ns1:personStatus>
<ns1:identificationType xsi:type="ns1:DecisionMatrixItemIdentificationType" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<ns1:partialDecision>GREEN</ns1:partialDecision>
<ns1:value>IDENTITY_IN_CITY</ns1:value>
</ns1:identificationType>
<ns1:similarHit xsi:type="ns1:DecisionMatrixItemInt" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<ns1:partialDecision>GREEN</ns1:partialDecision>
<ns1:value>0</ns1:value>
</ns1:similarHit>
<ns1:houseType xsi:type="ns1:DecisionMatrixItemString" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<ns1:partialDecision>GREEN</ns1:partialDecision>
<ns1:value>SHARED_USAGE</ns1:value>
</ns1:houseType>
<ns1:nameHint xsi:type="ns1:DecisionMatrixItemNameHint" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<ns1:partialDecision>GREEN</ns1:partialDecision>
<ns1:value>CONFIRMED</ns1:value>
</ns1:nameHint>
<ns1:locationIdentificationType xsi:type="ns1:DecisionMatrixItemLocationIdentificationType" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<ns1:partialDecision>GREEN</ns1:partialDecision>
<ns1:value>HOUSE_CONFIRMED</ns1:value>
</ns1:locationIdentificationType>
</ns1:identificationDecision>
<ns1:solvencyDecision>
<ns1:paymentExperience xsi:type="ns1:DecisionMatrixItemPHS" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<ns1:partialDecision>GREEN</ns1:partialDecision>
<ns1:value>NOPROBLEM</ns1:value>
</ns1:paymentExperience>
<ns1:externalSourcesProcessingStatus xsi:type="ns1:DecisionMatrixItemExternalProcessingStatus" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<ns1:partialDecision>GREEN</ns1:partialDecision>
<ns1:value>OK</ns1:value>
</ns1:externalSourcesProcessingStatus>
</ns1:solvencyDecision>
<ns1:clientExtensionsDecision>
<ns1:applicationFilter xsi:type="ns1:DecisionMatrixItemStringWithOverride" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<ns1:partialDecision>GREEN</ns1:partialDecision>
<ns1:value>0</ns1:value>
</ns1:applicationFilter>
<ns1:myScore xsi:type="ns1:DecisionMatrixItemInt" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<ns1:partialDecision>YELLOW</ns1:partialDecision>
<ns1:value>401</ns1:value>
</ns1:myScore>
</ns1:clientExtensionsDecision>
</ns1:decisionMatrix>
<ns1:paymentHistory>
<ns1:currency>PLN</ns1:currency>
<ns1:count>0</ns1:count>
<ns1:dateOfLastEntry>20191111</ns1:dateOfLastEntry>
<ns1:amountTotal>0.0</ns1:amountTotal>
<ns1:amountTotalOpen>0.0</ns1:amountTotalOpen>
<ns1:creditStatusMax>0</ns1:creditStatusMax>
<ns1:masterRiskStatus>Brak danych o negatywnej historii</ns1:masterRiskStatus>
</ns1:paymentHistory>
<ns1:normalization>
<ns1:searchedAddress xsi:type="ns1:SearchedAddressN" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<ns1:name>John</ns1:name>
<ns1:firstName>Dow</ns1:firstName>
<ns1:gender>MALE</ns1:gender>
<ns1:dateOfBirth>19960410</ns1:dateOfBirth>
<ns1:location>
<ns1:street>nowhere</ns1:street>
<ns1:house>39</ns1:house>
<ns1:houseExtension/>
<ns1:city>farfarland</ns1:city>
<ns1:zip>00-500</ns1:zip>
<ns1:country>POL</ns1:country>
</ns1:location>
<ns1:addressID>123</ns1:addressID>
<ns1:unitID>123</ns1:unitID>
<ns1:liableID>1231</ns1:liableID>
<ns1:houseID>1232</ns1:houseID>
<ns1:streetID>1233</ns1:streetID>
<ns1:cityID>1234</ns1:cityID>
</ns1:searchedAddress>
<ns1:foundAddress>
<ns1:addressID>1235</ns1:addressID>
<ns1:unitID>1236</ns1:unitID>
<ns1:liableID>1237</ns1:liableID>
<ns1:houseID>1238</ns1:houseID>
<ns1:streetID>1239</ns1:streetID>
<ns1:cityID>1230</ns1:cityID>
</ns1:foundAddress>
</ns1:normalization>
<ns1:clientExtensions>
<ns1:additionalData>
<ns1:name>pesel_verification_status</ns1:name>
<ns1:value>1</ns1:value>
</ns1:additionalData>
<ns1:additionalData>
<ns1:name>pesel_verification_execution_code</ns1:name>
<ns1:value>200</ns1:value>
</ns1:additionalData>
<ns1:additionalData>
<ns1:name>pesel_verification_codes</ns1:name>
<ns1:value>12010; 12013</ns1:value>
</ns1:additionalData>
</ns1:clientExtensions>
<ns1:executionStrategy/>
</ns1:orderCheckResponse>
</soapenv:Body>
</soapenv:Envelope>
Here is the code I am using to read and parse this XML content; str henceforth:
library(XML)
foobar <- xmlInternalTreeParse(str, encoding = 'KOI8-R', useInternalNodes = F)
xmlSApply(foobar$doc$children$Envelope, function(x) xmlSApply(x, names))
xmlSApply(foobar$doc$children$Envelope, function(x) xmlSApply(x, function(x1) xmlSApply(x1, names)))
Here I am able to parse the XML content and tried iterating over nodes to at least print the names. However I couldn't extract the values inside, even though I read many SO questions and tried countless combinations using xPathApply() etc. (reference)
Any hint on what I might be doing wrong here.

Consider a straightforward descendant XPath search with //* that acknowledges namespaces to retrieve all element names and values. However, since your XPath does not reference any namespace prefixes, for this specific search it is redundant in the xpathSApply calls:
doc <- xmlParse(str, asText=TRUE)
nmsp <- c(soapenv = "http://schemas.xmlsoap.org/soap/envelope/",
doc = "http://www.deltavista.com/dspone/ordercheck-if/V001",
ns1 = "http://www.deltavista.com/dspone/ordercheck-if/V001")
# NAMED CHARACTER VECTOR OF ALL 117 ELEMENT NAMES AND VALUES
elem_vals <- setNames(xpathSApply(doc, path="//*", namespaces = nmsp, xmlValue) ,
xpathSApply(doc, path="//*", namespaces = nmsp, xmlName))
Output
Names (first 20 items)
head(names(elem_vals), 20)
# [1] "Envelope" "Header" "messageContext" "credentials" "user"
# [6] "password" "Body" "orderCheckResponse" "returnCode" "product"
# [11] "name" "country" "language" "archiveID" "reportCreationTime"
# [16] "foundAddress" "legalForm" "address" "name" "firstName"
Values (last 20 items)
tail(elem_vals, 20)
# streetID cityID
# "1233" "1234"
# foundAddress addressID
# "123512361237123812391230" "1235"
# unitID liableID
# "1236" "1237"
# houseID streetID
# "1238" "1239"
# cityID clientExtensions
# "1230" "pesel_verification_status1pesel_verification_execution_code200pesel_verification_codes12010; 12013"
# additionalData name
# "pesel_verification_status1" "pesel_verification_status"
# value additionalData
# "1" "pesel_verification_execution_code200"
# name value
# "pesel_verification_execution_code" "200"
# additionalData name
# "pesel_verification_codes12010; 12013" "pesel_verification_codes"
# value executionStrategy
# "12010; 12013" ""

Related

Build a table from XML data file using R language

I am new learner in R Programming,i have sample xml file as shown below
<Attribute ID="GroupSEO" MultiValued="false" ProductMode="Property" FullTextIndexed="false" ExternallyMaintained="false" Derived="false" Mandatory="false">
<Name>Group SEO Name</Name>
<Validation BaseType="text" MinValue="" MaxValue="" MaxLength="1024" InputMask=""/>
<DimensionLink DimensionID="Language"/>
<MetaData>
<Value AttributeID="Attribute-Group-Order">1</Value>
<Value AttributeID="Enterprise-Label">NAV-GR-SEONAME</Value>
<Value ID="#NAMED" AttributeID="Attribute-Group-Name">#NAMED</Value>
<Value AttributeID="Enterprise-Description">Navigation Group SEO Name</Value>
<Value AttributeID="Attribute-Order">3</Value>
</MetaData>
<AttributeGroupLink AttributeGroupID="HTCategorizationsNavigation"/>
<AttributeGroupLink AttributeGroupID="HTDigitalServicesModifyClassifications"/>
<UserTypeLink UserTypeID="ENT-Group"/>
<UserTypeLink UserTypeID="NAVGRP"/>
<UserTypeLink UserTypeID="ENT-SubCategory"/>
<UserTypeLink UserTypeID="ENT-Category"/>
i want to convert this into data frame using R language.My expected output is
## FullTextIndexed MultiValued ProductMode ExternallyMaintained Derived Mandatory Attribute-Group-Order Enterprise-Description UserTypeID
1 false false Property false false false 1 Navigation group seo name ENT-Group,ENT-Category,..
i have searched the internet but couldn't find a solution to my problem.
I got a code from internet
library("XML")
library("methods")
setwd("E:/Project")
xmldata<-xmlToDataFrame("Sample.xml")
print(xmldata)
but when i execute the code i get the below error
Error in `[<-.data.frame`(`*tmp*`, i, names(nodes[[i]]), value = c(Name = "You YoutubeLink7 (URL)", :
duplicate subscripts for columns
In addition: Warning message:
In names(x) == varNames :
longer object length is not a multiple of shorter object length
> print(xmldata)
Error in print(xmldata) : object 'xmldata' not found
could anyone help me know about what the error means and also a solution to my problem,sorry for the formatting issue.
Thanks in advance for the solution.
Thanks

With a correct xml data (attribute tag at the end of the file).
<?xml version="1.0" encoding="UTF-8"?>
<Attribute ID="GroupSEO" MultiValued="false" ProductMode="Property" FullTextIndexed="false" ExternallyMaintained="false" Derived="false" Mandatory="false">
<Name>Group SEO Name</Name>
<Validation BaseType="text" MinValue="" MaxValue="" MaxLength="1024" InputMask=""/>
<DimensionLink DimensionID="Language"/>
<MetaData>
<Value AttributeID="Attribute-Group-Order">1</Value>
<Value AttributeID="Enterprise-Label">NAV-GR-SEONAME</Value>
<Value ID="#NAMED" AttributeID="Attribute-Group-Name">#NAMED</Value>
<Value AttributeID="Enterprise-Description">Navigation Group SEO Name</Value>
<Value AttributeID="Attribute-Order">3</Value>
</MetaData>
<AttributeGroupLink AttributeGroupID="HTCategorizationsNavigation"/>
<AttributeGroupLink AttributeGroupID="HTDigitalServicesModifyClassifications"/>
<UserTypeLink UserTypeID="ENT-Group"/>
<UserTypeLink UserTypeID="NAVGRP"/>
<UserTypeLink UserTypeID="ENT-SubCategory"/>
<UserTypeLink UserTypeID="ENT-Category"/>
</Attribute>
Then we use xpath to get all we need. Change the path to your xml file in the htmlParse step.
library(XML)
data=htmlParse("C:/Users/.../yourxmlfile.xml")
fulltextindexed=xpathSApply(data,"normalize-space(//attribute/#fulltextindexed)")
multivalued=xpathSApply(data,"normalize-space(//attribute/#multivalued)")
productmode=xpathSApply(data,"normalize-space(//attribute/#productmode)")
externallymaintained=xpathSApply(data,"normalize-space(//attribute/#externallymaintained)")
derived=xpathSApply(data,"normalize-space(//attribute/#derived)")
mandatory=xpathSApply(data,"normalize-space(//attribute/#mandatory)")
attribute.group.order=xpathSApply(data,"//value[#attributeid='Attribute-Group-Order']",xmlValue)
enterprise.description=xpathSApply(data,"//value[#attributeid='Enterprise-Description']",xmlValue)
user.type.id=paste(xpathSApply(data,"//usertypelink/#usertypeid"),collapse = "|")
df=data.frame(fulltextindexed,multivalued,productmode,externallymaintained,derived,mandatory,attribute.group.order,enterprise.description,user.type.id)
Result :

Using tidyverse and xml2
DATA
data <- read_xml('<Attribute ID="GroupSEO" MultiValued="false" ProductMode="Property" FullTextIndexed="false" ExternallyMaintained="false" Derived="false" Mandatory="false">
<Name>Group SEO Name</Name>
<Validation BaseType="text" MinValue="" MaxValue="" MaxLength="1024" InputMask=""/>
<DimensionLink DimensionID="Language"/>
<MetaData>
<Value AttributeID="Attribute-Group-Order">1</Value>
<Value AttributeID="Enterprise-Label">NAV-GR-SEONAME</Value>
<Value ID="#NAMED" AttributeID="Attribute-Group-Name">#NAMED</Value>
<Value AttributeID="Enterprise-Description">Navigation Group SEO Name</Value>
<Value AttributeID="Attribute-Order">3</Value>
</MetaData>
<AttributeGroupLink AttributeGroupID="HTCategorizationsNavigation"/>
<AttributeGroupLink AttributeGroupID="HTDigitalServicesModifyClassifications"/>
<UserTypeLink UserTypeID="ENT-Group"/>
<UserTypeLink UserTypeID="NAVGRP"/>
<UserTypeLink UserTypeID="ENT-SubCategory"/>
<UserTypeLink UserTypeID="ENT-Category"/>
</Attribute>')
CODE
#For attribute tag
Attributes <- xml_find_all(data, "//Attribute")
Attributes <- Attributes %>%
map(xml_attrs) %>%
map_df(~as.list(.))
#find AttributeID nodes
nodes <- xml_find_all(data, "//Value")
AGO <- nodes[xml_attr(nodes, "AttributeID")=="Attribute-Group-Order"]
Attributes["Attribute-Group-Order"] <- xml_text(AGO)
ED <- nodes[xml_attr(nodes, "AttributeID")=="Enterprise-Description"]
Attributes["Enterprise-Description"] <- xml_text(ED)
#UserTypelink tags
UserTypeLink <- xml_find_all(data, "//UserTypeLink")
UserTypeLink <- UserTypeLink %>%
map(xml_attrs) %>%
map_df(~as.list(.)) %>%
mutate(UserTypeID=map_chr(UserTypeID, ~toString(UserTypeID, .x))) %>%
filter(row_number()==1)
#Final output
do.call("cbind", list(Attributes,UserTypeLink))

How to match multiline pattern

I am trying to match pattern in a text file. It works well as long as the pattern stands within one line. But it occurred that in some case the pattern can be across two lines.
i have the following code :
#indicate the Name pattern to R
name_pattern = '<nameOfIssuer>([^<]*)</nameOfIssuer>'
#Collect information that match the pattern that we are looking #
datalines = grep(name_pattern, thepage[1:length(thepage)], value = TRUE)
#We will use gregexpr and gsub to extract the information without the html tags
#create a function first
getexpr = function(s,g)substring(s,g,g+attr(g,'match.length')-1)
gg = gregexpr(name_pattern, datalines)
matches = mapply(getexpr, datalines, gg)
result = gsub(name_pattern, '\\1', matches)
result <- gsub("&", "&", result)
names(result) = NULL
It works well when the text is as :
<nameOfIssuer>Posco ADR</nameOfIssuer>
BUt in case where the text is as follows, it does not work:
<nameOfIssuer>Bank of
America Corp</nameOfIssuer>
Does anyone know how to handle both case dynamically please?
Full text is as follow:
<SEC-DOCUMENT>0001437749-18-018038.txt : 20181009
<SEC-HEADER>0001437749-18-018038.hdr.sgml : 20181009
<ACCEPTANCE-DATETIME>20181005183736
ACCESSION NUMBER: 0001437749-18-018038
CONFORMED SUBMISSION TYPE: 13F-HR
PUBLIC DOCUMENT COUNT: 2
CONFORMED PERIOD OF REPORT: 20180930
FILED AS OF DATE: 20181009
DATE AS OF CHANGE: 20181005
EFFECTIVENESS DATE: 20181009
FILER:
COMPANY DATA:
COMPANY CONFORMED NAME: DAILY JOURNAL CORP
CENTRAL INDEX KEY: 0000783412
STANDARD INDUSTRIAL CLASSIFICATION: NEWSPAPERS: PUBLISHING OR PUBLISHING & PRINTING [2711]
IRS NUMBER: 954133299
STATE OF INCORPORATION: SC
FISCAL YEAR END: 0930
FILING VALUES:
FORM TYPE: 13F-HR
SEC ACT: 1934 Act
SEC FILE NUMBER: 028-15782
FILM NUMBER: 181111587
BUSINESS ADDRESS:
STREET 1: 915 EAST FIRST STREET
CITY: LOS ANGELES
STATE: CA
ZIP: 90012
BUSINESS PHONE: 2132295300
MAIL ADDRESS:
STREET 1: 915 EAST FIRST STREET
CITY: LOS ANGELES
STATE: CA
ZIP: 90012
FORMER COMPANY:
FORMER CONFORMED NAME: DAILY JOURNAL CO
DATE OF NAME CHANGE: 19870427
</SEC-HEADER>
<DOCUMENT>
<TYPE>13F-HR
<SEQUENCE>1
<FILENAME>primary_doc.xml
<TEXT>
<XML>
<?xml version="1.0" encoding="UTF-8"?>
<edgarSubmission xmlns="http://www.sec.gov/edgar/thirteenffiler" xmlns:com="http://www.sec.gov/edgar/common">
<headerData>
<submissionType>13F-HR</submissionType>
<filerInfo>
<liveTestFlag>LIVE</liveTestFlag>
<flags>
<confirmingCopyFlag>false</confirmingCopyFlag>
<returnCopyFlag>true</returnCopyFlag>
<overrideInternetFlag>false</overrideInternetFlag>
</flags>
<filer>
<credentials>
<cik>0000783412</cik>
<ccc>XXXXXXXX</ccc>
</credentials>
</filer>
<periodOfReport>09-30-2018</periodOfReport>
</filerInfo>
</headerData>
<formData>
<coverPage>
<reportCalendarOrQuarter>09-30-2018</reportCalendarOrQuarter>
<filingManager>
<name>DAILY JOURNAL CORP</name>
<address>
<com:street1>915 EAST FIRST STREET</com:street1>
<com:city>LOS ANGELES</com:city>
<com:stateOrCountry>CA</com:stateOrCountry>
<com:zipCode>90012</com:zipCode>
</address>
</filingManager>
<reportType>13F HOLDINGS REPORT</reportType>
<form13FFileNumber>028-15782</form13FFileNumber>
<provideInfoForInstruction5>N</provideInfoForInstruction5>
</coverPage>
<signatureBlock>
<name>Gerald L. Salzman</name>
<title>Chief Executive Officer, President, CFO, Treasurer</title>
<phone>213-229-5300</phone>
<signature>/s/ Gerald L. Salzman</signature>
<city>Los Angeles</city>
<stateOrCountry>CA</stateOrCountry>
<signatureDate>10-05-2018</signatureDate>
</signatureBlock>
<summaryPage>
<otherIncludedManagersCount>0</otherIncludedManagersCount>
<tableEntryTotal>4</tableEntryTotal>
<tableValueTotal>159459</tableValueTotal>
<isConfidentialOmitted>false</isConfidentialOmitted>
</summaryPage>
</formData>
</edgarSubmission>
</XML>
</TEXT>
</DOCUMENT>
<DOCUMENT>
<TYPE>INFORMATION TABLE
<SEQUENCE>2
<FILENAME>rdgit100518.xml
<TEXT>
<XML>
<?xml version="1.0" encoding="us-ascii"?>
<informationTable xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.sec.gov/edgar/document/thirteenf/informationtable">
<infoTable>
<nameOfIssuer>Bank of
America Corp</nameOfIssuer>
<titleOfClass>Common Stock</titleOfClass>
<cusip>060505104</cusip>
<value>67758</value>
<shrsOrPrnAmt>
<sshPrnamt>2300000</sshPrnamt>
<sshPrnamtType>SH</sshPrnamtType>
</shrsOrPrnAmt>
<investmentDiscretion>SOLE</investmentDiscretion>
<votingAuthority>
<Sole>2300000</Sole>
<Shared>0</Shared>
<None>0</None>
</votingAuthority>
</infoTable>
<infoTable>
<nameOfIssuer>Posco ADR</nameOfIssuer>
<titleOfClass>Sponsored ADR</titleOfClass>
<cusip>693483109</cusip>
<value>643</value>
<shrsOrPrnAmt>
<sshPrnamt>9745</sshPrnamt>
<sshPrnamtType>SH</sshPrnamtType>
</shrsOrPrnAmt>
<investmentDiscretion>SOLE</investmentDiscretion>
<votingAuthority>
<Sole>9745</Sole>
<Shared>0</Shared>
<None>0</None>
</votingAuthority>
</infoTable>
<infoTable>
<nameOfIssuer>US Bancorp</nameOfIssuer>
<titleOfClass>Common Stock</titleOfClass>
<cusip>902973304</cusip>
<value>7393</value>
<shrsOrPrnAmt>
<sshPrnamt>140000</sshPrnamt>
<sshPrnamtType>SH</sshPrnamtType>
</shrsOrPrnAmt>
<investmentDiscretion>SOLE</investmentDiscretion>
<votingAuthority>
<Sole>140000</Sole>
<Shared>0</Shared>
<None>0</None>
</votingAuthority>
</infoTable>
<infoTable>
<nameOfIssuer>Wells Fargo &amp; Co</nameOfIssuer>
<titleOfClass>Common Stock</titleOfClass>
<cusip>949746101</cusip>
<value>83665</value>
<shrsOrPrnAmt>
<sshPrnamt>1591800</sshPrnamt>
<sshPrnamtType>SH</sshPrnamtType>
</shrsOrPrnAmt>
<investmentDiscretion>SOLE</investmentDiscretion>
<votingAuthority>
<Sole>1591800</Sole>
<Shared>0</Shared>
<None>0</None>
</votingAuthority>
</infoTable>
</informationTable>
</XML>
</TEXT>
</DOCUMENT>
</SEC-DOCUMENT>

Assuming there could be multiple matching <nameOfIssuer> tags in your document, and you want to match all of them, then we can try using grepexpr with regmatches:
input <- "<nameOfIssuer>Bank of\n America Corp</nameOfIssuer>\n blah blah blah \n"
input <- paste0(input, "<nameOfIssuer>Citigroup</nameOfIssuer>")
m <- gregexpr("(?<=<nameOfIssuer>)([^<]*?)(?=</nameOfIssuer>)", input, perl=TRUE)
regmatches(input, m)[[1]]
[1] "Bank of\n America Corp" "Citigroup"

Using Tim's solution plus the collapse option of paste the program works. The code is as following:
thepage <- paste(thepage, collapse = "")
m <- gregexpr("(?<=<nameOfIssuer>)([^<]*?)(?=</nameOfIssuer>)", thepage, perl=TRUE)
result <- regmatches(thepage, m)[[1]]
names(result) = NULL
#put the result into a dataframe
Positions = as.data.frame(matrix(result, ncol=1, byrow = TRUE))

Extracting IMF-data

I try to download the "World Gross domestic product, current prices" using the IMFdata-API. If I use this query:
library(IMFData)
databaseID <- 'IFS'
startdate = '2001-01-01'
enddate = '2016-12-31'
checkquery = TRUE
queryfilter <- list(CL_FREA="", CL_AREA_IFS="W0", CL_INDICATOR_IFS="NGDP_USD")
W0.NGDP.query <- CompactDataMethod(databaseID, queryfilter, startdate, enddate, checkquery)
I get this error:
Error: lexical error: invalid char in json text.
<?xml version="1.0" encoding="u
(right here) ------^
In addition: Warning message:
JSON string contains (illegal) UTF8 byte-order-mark!
How can I fix this? Is there a better way using Quandl, etc.?

Selecting multiple nodes using Nokogiri and an upper ancestor node within a variable

Last days I was searching for any solution to get multiple nodes using Nokogiri in subject to a reference variable in an ancestor node.
What I need:
Actually I am collecting all "Id"s of "Segment" node. Then I want to collect all subsequent "Resource"s withing the "Segment" node. For collecting the "Resource"s I want to set the "Id" as a variable.
<CPL>
<SegmL>
<Segment>
<Id>UUID</Id> #UUID as a variable
<Name>name_01</Name>
<SeqL>
<ImageSequence>
<Id>UUID</Id>
<Track>UUID</Track>
<ResourceList>
<Resource> #depending on SegmentId
<A>aaa</A>
<B>bbb</B>
<C>ccc</C>
<D>ddd</D>
</Resource>
</ResourceList>
</ImageSequence>
<AudioSequence>
<Id>UUID</Id>
<Track>UUID</Track>
<ResourceList>
<Resource>
<A>aaa</A>
<B>bbb</B>
<C>ccc</C>
<D>ddd</D>
</Resource>
</ResourceList>
</AudioSequence>
</SequL>
</Segment>
<Segment>
<Id>UUIDa</Id>
<Name>name_02</Name>
<SequL>
<ImageSequence>
<Id>UUID</Id>
<Track>UUID</Track>
<ResourceList>
<Resource>
<A>aaa</A>
<B>bbb</B>
<C>ccc</C>
<D>ddd</D>
</Resource>
</ResourceList>
</ImageSequence>
<AudioSequence>
<Id>UUID</Id>
<Track>UUID</Track>
<ResourceList>
<Resource>
<A>aaa</A>
<B>bbb</B>
<C>ccc</C>
<D>ddd</D>
</Resource>
</ResourceList>
</AudioSequence>
</SequL>
</Segment>
</SegmL>
</CPL>
All Resource Data each collected with A = Resource.css("A").text.gsub(/\n/,"")
#first each do
cpls.each_with_index do |(cpl_uuid, mycpl), index|
cpl_filename = mycpl
cpl_file = File.open("#{resource_uri}/#{cpl_filename}")
cpl = Nokogiri::XML( cpl_file ).remove_namespaces!
#get UUID for UUID checks
cpl_uuid = cpl.css("Id").first.text.gsub(/\n/,"")
cpl_root_edit_rate = cpl.css("EditRate").first.text.gsub(/\s+/, "\/")
#second each do
cpl.css("Segment").each do |s| # loop segment
cpl_segment_list_uuid = s.css("Id").first.text.gsub(/\n/,"") #uuid of segment list
#third each do
cpl.css("Resource").each do |f| #loop resources
cpl_A = f.css("A").text.gsub(/\n/,"") # uuid of A
cpl_B = f.css("B").text.gsub(/\n/,"") # uuid of B
end #third
end #second
end #first
My expression gives me these informations stored in an array:
A = 48000.0
B = 240000.0
C = 0.0
D = 240000.0
Some functions to calculate an average on the resources.
puts all_arry
A = 5.0
B = 5.0
C = 5.0
D = 5.0
A = 5.0
B = 5.0
C = 5.0
D = 5.0
=8 values -> only 4 values existing for the exact loop (2 average values per Segment)
At the moment all "SegmentId"s collecting all "Resource"s
How can I exactly allocate the subsequent resources for eacht Segment Id as a variable?
I had used this code, but the loop is empty, thinking because of some more nodes betwerrn the "Id" of "Segment" and each "Resource" "A", "B"... :
if cpl.at("Segment/Id:contains(\"#{cpl_segment_list_uuid}\")")
cpl.css("Resource").each do |f|
#collecting resources here for each segmet
end
end
All nodes have NO attribues, ids, class, etc.
May you can help me with my problem. First of all I will politly thank you for your support!
UPDATE 10/07/16
I did also run the code with the following expressions for the "each do" on the resources:
expression = "/SegmetList/Segment[Id>cpl_segment_list_uuid]"
cpl.xpath(expression).each do |f|
It runs the "each do", but I didn't get internal nodes
cpl.css("Segment:contains(\"#{cpl_segment_list_uuid}\") > Resource").each do |f|
Same as previous
And with a "if"-condition, also the same problem:
if cpl.at("Segment/Id:contains(\"#{cpl_segment_list_uuid}\")").each do|f|
#some code
end
UPDATE 2016/18/10
Actually I get the right number of the resources (4), but still not separated for each Segment. So there are the same four resources in each Segment.
Why I don't get the double number of all resources is, that I create the array in the "Segment"-loop.
This is the present code:
#first each do
cpls.each_with_index do |(cpl_uuid, mycpl), index|
cpl_filename = mycpl
cpl_file = File.open("#{resource_uri}/#{cpl_filename}")
cpl = Nokogiri::XML( cpl_file ).remove_namespaces!
#get UUID for UUID checks
cpl_uuid = cpl.css("Id").first.text.gsub(/\n/,"")
cpl_root_edit_rate = cpl.css("EditRate").first.text.gsub(/\s+/, "\/")
#second each do
cpl.css("Segment").each do |s| # loop segment
cpl_segment_list_uuid = s.css("Id").first.text.gsub(/\n/,"") #uuid of segment list
array_for_resource_data = Array.new
#third each do
s.css("Resource").each do |f| #loop resources #all resources
s.search('//A | //B').each do |f| #selecting only resources "A" and "B"
cpl_A = f.css("A").text.gsub(/\n/,"") # uuid of A
cpl_B = f.css("B").text.gsub(/\n/,"") # uuid of B
end #third
end #second
end #first
I hope my update will give you more details. Thank you very much for your help and answer!
UPDATE 2016/31/10
The problem with the double output of the segments is fixed. Now I have one more loop on each sequence under the segments:
cpl.css("Segment").each do |u|
segment_list_uuid = u.css("Id").first.text.gsub(/\n/,"")
sequence_list_uuid_arr = Array.new
u.xpath("//SequenceList[//*[starts-with(name(),'Sequence')]]").each do |s|
sequence_list_uuid = s.css("TrackId").first.text#.gsub(/\n/,"")
sequence_list_uuid_arr.push(cpl_sequence_list_uuid)
#following some resource nodes
s.css("Resource").each do |f|
asset_uuid = f.css("TrackFileId").text.gsub(/\n/,"")
resource_uuid = f.css("Id").text.gsub(/\n/,"")
edit_rate = f.css("EditRate").text.gsub(/\s+/, "\/")
#some more code
end #resource
end #sequence list
end #segment
Now I want to get all the different "resources" under each unique sequence. I have to list all the different resources and sum up some of the collected values.
Is there any way to collect each resource with different values (sub nodes) under the same "sequence id"? At the moment, I habe no idea for any solution....so there is no code I could show you, that would work in parts.
each_with_index for the "Resource" loop doesn't work.
May you kindly have some ideas or any approach to help me with my new problem?

Try
resource.search('.//A | .//B')
.// will anchor the xpath query at the current element rather than searching the whole document.
Example
elem = doc.search('ImageSequence').first
elem.search('//A') # returns all A in the whole document
elem.search('.//A') # returns all A inside element

This is a common problem when tearing apart XML. Write your code similar to how the data is laid out in the XML, allowing for repeating blocks of similar data.
For instance:
require 'nokogiri'
cpl = Nokogiri::XML(<<EOT)
<CPL>
<SegmL>
<Segment>
<Id>UUID</Id> #UUID as a variable
<Name>name_01</Name>
<SeqL>
<ImageSequence>
<Id>UUID</Id>
<Track>UUID</Track>
<ResourceList>
<Resource> #depending on SegmentId
<A>aaa</A>
<B>bbb</B>
<C>ccc</C>
<D>ddd</D>
</Resource>
</ResourceList>
</ImageSequence>
<AudioSequence>
<Id>UUID</Id>
<Track>UUID</Track>
<ResourceList>
<Resource>
<A>aaa</A>
<B>bbb</B>
<C>ccc</C>
<D>ddd</D>
</Resource>
</ResourceList>
</AudioSequence>
</SequL>
</Segment>
</SegmL>
</CPL>
EOT
Start by finding the node that contains the data you want to iterate over, then start descending into that data:
data = cpl.search('Segment').each_with_object([]) { |segment, ary|
hash = {}
hash[:id] = segment.at('Id').text
hash[:name] = segment.at('Name').text
image_sequence = segment.at('ImageSequence')
image_sequence_h = {}
image_sequence_h[:id] = image_sequence.at('Id').text
image_sequence_h[:track] = image_sequence.at('Track').text
image_resources_h = {
a: image_sequence.at('A').text,
b: image_sequence.at('B').text,
c: image_sequence.at('C').text,
d: image_sequence.at('D').text,
}
audio_sequence = segment.at('AudioSequence')
audio_sequence_h = {}
audio_sequence_h[:id] = audio_sequence.at('Id').text
audio_sequence_h[:track] = audio_sequence.at('Track').text
audio_resources_h = {
a: audio_sequence.at('A').text,
b: audio_sequence.at('B').text,
c: audio_sequence.at('C').text,
d: audio_sequence.at('D').text,
}
image_sequence_h[:resources] = image_resources_h
audio_sequence_h[:resources] = audio_resources_h
hash[:image_sequence] = image_sequence_h
hash[:audio_sequence] = audio_sequence_h
ary << hash
}
This is more verbose than I'd usually write it because I wanted the steps to be clearer.
The end result is an array of hashes:
# => [{:id=>"UUID",
# :name=>"name_01",
# :image_sequence=>
# {:id=>"UUID",
# :track=>"UUID",
# :resources=>{:a=>"aaa", :b=>"bbb", :c=>"ccc", :d=>"ddd"}},
# :audio_sequence=>
# {:id=>"UUID",
# :track=>"UUID",
# :resources=>{:a=>"aaa", :b=>"bbb", :c=>"ccc", :d=>"ddd"}}}]
Then it's easy to iterate over the array and access individual chunks of data, or individual elements of the data:
data[0][:image_sequence][:id] # => "UUID"
data[0][:audio_sequence][:resources][:d] # => "ddd"

Export XML data (from URL) to data.frame or CSV in R

I'm trying to upload XML from URL and save him as CSV:
attached the script :
query <-"https://commission-detail.api.cj.com/v3/commissions?date-type=posting&start-date=2016-03-01&end-date=2016-03-30"
token <-"xxxx"
xmlfile <- xmlTreeParse(GET(url=query, add_headers(Authorization=token)))
but i received the results in the following structure.
$doc
$file
[1] "<buffer>"
$version
[1] "1.0"
$children
$children$`cj-api`
<cj-api>
<commissions total-matched="4">
<commission>
<action-status>closed</action-status>
<action-type>advanced sale</action-type>
<aid>10789406</aid>
<commission-id>1965209327</commission-id>
<country>US</country>
<event-date>2016-03-02T04:22:04-0800</event-date>
<locking-date>2016-05-10</locking-date>
<order-id>1786924</order-id>
<original>true</original>
<original-action-id>1691086180</original-action-id>
<posting-date>2016-03-02T05:03:45-0800</posting-date>
<website-id>7991782</website-id>
<action-tracker-id>337452</action-tracker-id>
<action-tracker-name>JimmyJazz Sale</action-tracker-name>
<cid>3010924</cid>
<advertiser-name>Jimmy Jazz</advertiser-name>
<commission-amount>0.16</commission-amount>
<order-discount>0.00</order-discount>
<sid/>
<sale-amount>1.99</sale-amount>
</commission>
<commission>
<action-status>locked</action-status>
<action-type>advanced sale</action-type>
<aid>12378040</aid>
<commission-id>1969836131</commission-id>
<country>IL</country>
<event-date>2016-03-14T05:53:36-0700</event-date>
<locking-date>2016-05-13</locking-date>
<order-id>27307042</order-id>
<original>true</original>
<original-action-id>1695118411</original-action-id>
<posting-date>2016-03-14T06:30:52-0700</posting-date>
<website-id>7991782</website-id>
<action-tracker-id>361197</action-tracker-id>
<action-tracker-name>Sale</action-tracker-name>
<cid>3848495</cid>
<advertiser-name>boohoo.com</advertiser-name>
<commission-amount>0.40</commission-amount>
<order-discount>0.00</order-discount>
<sid/>
<sale-amount>2.88</sale-amount>
</commission>
<commission>
<action-status>locked</action-status>
<action-type>advanced sale</action-type>
<aid>12378040</aid>
<commission-id>1970220452</commission-id>
<country>GB</country>
<event-date>2016-03-15T03:15:11-0700</event-date>
<locking-date>2016-05-14</locking-date>
<order-id>27330813</order-id>
<original>true</original>
<original-action-id>1695483653</original-action-id>
<posting-date>2016-03-15T04:01:28-0700</posting-date>
<website-id>7991782</website-id>
<action-tracker-id>361197</action-tracker-id>
<action-tracker-name>Sale</action-tracker-name>
<cid>3848495</cid>
<advertiser-name>boohoo.com</advertiser-name>
<commission-amount>0.60</commission-amount>
<order-discount>0.00</order-discount>
<sid>DnoAwoTTYtLs</sid>
<sale-amount>4.31</sale-amount>
</commission>
<commission>
<action-status>locked</action-status>
<action-type>advanced sale</action-type>
<aid>12378040</aid>
<commission-id>1972164361</commission-id>
<country>IL</country>
<event-date>2016-03-20T06:15:41-0700</event-date>
<locking-date>2016-05-19</locking-date>
<order-id>27439097</order-id>
<original>true</original>
<original-action-id>1697317694</original-action-id>
<posting-date>2016-03-20T07:00:46-0700</posting-date>
<website-id>7991782</website-id>
<action-tracker-id>361197</action-tracker-id>
<action-tracker-name>Sale</action-tracker-name>
<cid>3848495</cid>
<advertiser-name>boohoo.com</advertiser-name>
<commission-amount>1.01</commission-amount>
<order-discount>0.00</order-discount>
<sid>9rftdVKxGwud</sid>
<sale-amount>7.24</sale-amount>
</commission>
</commissions>
</cj-api>
attr(,"class")
[1] "XMLDocumentContent"
$dtd
$external
NULL
$internal
NULL
attr(,"class")
[1] "DTDList"
attr(,"class")
[1] "XMLDocument" "XMLAbstractDocument"
other option i've tried to do is .
xmlfile2 <-
read.csv(text=rawToChar(
GET(url=query, add_headers(Authorization=token))
[["content"]]), header = TRUE, sep =',')
but then i've received the data in the following way:
X..xml.version.1.0.encoding.UTF.8..
1 <cj-api><commissions total-matched=4><commission><action-status>closed</action-status><action-type>advanced sale</action-type><aid>10789406</aid><commission-id>1965209327</commission-id><country>US</country><event-date>2016-03-02T04:22:04-0800</event-date><locking-date>2016-05-10</locking-date><order-id>1786924</order-id><original>true</original><original-action-id>1691086180</original-action-id><posting-date>2016-03-02T05:03:45-0800</posting-date><website-id>7991782</website-id><action-tracker-id>337452</action-tracker-id><action-tracker-name>JimmyJazz Sale </action-tracker-name><cid>3010924</cid><advertiser-name>Jimmy Jazz</advertiser-name><commission-amount>0.16</commission-amount><order-discount>0.00</order-discount><sid></sid><sale-amount>1.99</sale-amount></commission><commission><action-status>locked</action-status><action-type>advanced sale</action-type><aid>12378040</aid><commission-id>1969836131</commission-id><country>IL</country><event-date>2016-03-14T05:53:36-0700</event-date><locking-date>2016-05-13</locking-date><order-id>27307042</order-id><original>true</original><original-action-id>1695118411</original-action-id><posting-date>2016-03-14T06:30:52-0700</posting-date><website-id>7991782</website-id><action-tracker-id>361197</action-tracker-id><action-tracker-name>Sale</action-tracker-name><cid>3848495</cid><advertiser-name>boohoo.com</advertiser-name><commission-amount>0.40</commission-amount><order-discount>0.00</order-discount><sid></sid><sale-amount>2.88</sale-amount></commission><commission><action-status>locked</action-status><action-type>advanced sale</action-type><aid>12378040</aid><commission-id>1970220452</commission-id><country>GB</country><event-date>2016-03-15T03:15:11-0700</event-date><locking-date>2016-05-14</locking-date><order-id>27330813</order-id><original>true</original><original-action-id>1695483653</original-action-id><posting-date>2016-03-15T04:01:28-0700</posting-date><website-id>7991782</website-id><action-tracker-id>361197</action-tracker-id><action-tracker-name>Sale</action-tracker-name><cid>3848495</cid><advertiser-name>boohoo.com</advertiser-name><commission-amount>0.60</commission-amount><order-discount>0.00</order-discount><sid>DnoAwoTTYtLs</sid><sale-amount>4.31</sale-amount></commission><commission><action-status>locked</action-status><action-type>advanced sale</action-type><aid>12378040</aid><commission-id>1972164361</commission-id><country>IL</country><event-date>2016-03-20T06:15:41-0700</event-date><locking-date>2016-05-19</locking-date><order-id>27439097</order-id><original>true</original><original-action-id>1697317694</original-action-id><posting-date>2016-03-20T07:00:46-0700</posting-date><website-id>7991782</website-id><action-tracker-id>361197</action-tracker-id><action-tracker-name>Sale</action-tracker-name><cid>3848495</cid><advertiser-name>boohoo.com</advertiser-name><commission-amount>1.01</commission-amount><order-discount>0.00</order-discount><sid>9rftdVKxGwud</sid><sale-amount>7.24</sale-amount></commission></commissions></cj-api>
running the following
xmlfile <- xmlTreeParse(GET(url=query, add_headers(Authorization=token)), useInternalNodes = T)
xmlToDataFrame(xmlfile)
i received
commission
1 closedadvanced sale107894061965209327US2016-03-02T04:22:04-08002016-05-101786924true16910861802016-03-02T05:03:45-08007991782337452JimmyJazz Sale 3010924Jimmy Jazz0.160.001.99
NA
1 lockedadvanced sale123780401969836131IL2016-03-14T05:53:36-07002016-05-1327307042true16951184112016-03-14T06:30:52-07007991782361197Sale3848495boohoo.com0.400.002.88
NA
1 lockedadvanced sale123780401970220452GB2016-03-15T03:15:11-07002016-05-1427330813true16954836532016-03-15T04:01:28-07007991782361197Sale3848495boohoo.com0.600.00DnoAwoTTYtLs4.31
NA
1 lockedadvanced sale123780401972164361IL2016-03-20T06:15:41-07002016-05-1927439097true16973176942016-03-20T07:00:46-07007991782361197Sale3848495boohoo.com1.010.009rftdVKxGwud7.24
how can i saved this XML in data frame .
Thanks

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

XML parsing with namespaces using XML package in R - r

Related

Build a table from XML data file using R language

How to match multiline pattern

Extracting IMF-data

Selecting multiple nodes using Nokogiri and an upper ancestor node within a variable

Export XML data (from URL) to data.frame or CSV in R

Categories

Resources