R: Reading XML as data.frame - r

I'm facing this issue, I could not read an .xml file to make it as a data.frame in R. I know that this question have already great answers here and here, but I'm not able to decline the answers to my necessity, so sorry if it's duplicate.
I have a .xml like this:
<?xml version='1.0' encoding='UTF-8'?>
<LexicalResource>
<GlobalInformation label="Created with the standard propagation algorithm"/>
<Lexicon languageCoding="UTF-8" label="sentiment" language="-">
<LexicalEntry id="id_0" partOfSpeech="adj">
<Lemma writtenForm="word"/>
<Sense>
<Confidence score="0.333333333333" method="automatic"/>
<Sentiment polarity="negative"/>
<Domain/>
</Sense>
</LexicalEntry>
</Lexicon>
</LexicalResource>
Stored locally. So i tried this way:
library(XML)
doc<-xmlParse("...\\test2.xml")
xmldf <- xmlToDataFrame(nodes=getNodeSet(doc,"//LexicalEntry/Lemma/Sense/Confidence/Sentiment"))
but the result is this:
> xmldf
data frame with 0 columns and 0 rows
So I tried the xml2 package:
library(xml2)
pg <- read_xml("...test2.xml")
recs <- xml_find_all(pg, "LexicalEntry")
> recs
{xml_nodeset (0)}
I have a lack of knowledge in manipulating .xml files, so I think I'm missing the point. What am I doing wrong?

You need the attributes, not the values, that's why the methods you have used do not work, try something like this:
data.frame(as.list(xpathApply(doc, "//Lemma", fun = xmlAttrs)[[1]]),
as.list(xpathApply(doc, "//Confidence", fun = xmlAttrs)[[1]]),
as.list(xpathApply(doc, "//Sentiment", fun = xmlAttrs)[[1]]))
writtenForm score method polarity
1 word 0.333333333333 automatic negative
Another option is to get all the attributes of the xml and build with them a data.frame:
df <- data.frame(as.list(unlist(xmlToList(doc, addAttributes = TRUE, simplify = TRUE))))
colnames(df) <- unlist(lapply(strsplit(colnames(df), "\\."), function(x) x[length(x)]))
df
label writtenForm score method
1 Created with the standard propagation algorithm word 0.333333333333 automatic
polarity id partOfSpeech languageCoding label language
1 negative id_0 adj UTF-8 sentiment -

Related

How to use a custom NRC-style lexicon on Syuzhet for R?

I am new to R and new to working with Syuzhet.
I am trying to make a custom NRC-style library to use with the Syuzhet package in order to categorize words. Unfortunately, although this functionality now exists within Syuzhet, it doesnt seem to recognize my custom lexicon. Please excuse my weird variable names and the extra libraries, I plan to use them for other stuff later on and I am just testing things.
library(sentimentr)
library(pdftools)
library(tm)
library(readxl)
library(syuzhet)
library(tidytext)
texto <- "I am so love hate beautiful ugly"
text_cust <- get_tokens(texto)
custom_lexicon <- data.frame(lang = c("eng","eng","eng","eng"), word = c("love", "hate", "beautiful", "ugly"), sentiment = c("positive","positive","positive","positive"), value = c("1","1","1","1"))
my_custom_values <- get_nrc_sentiment(text_cust, lexicon = custom_lexicon)
I get the following error:
my_custom_values <- get_nrc_sentiment(text_cust, lexicon = custom_lexicon)
New names:
• value -> value...4
• value -> value...5
Error in FUN(X[[i]], ...) :
custom lexicon must have a 'word', a 'sentiment' and a 'value' column
As far as I can tell, my data frame exactly matches that of the standard NRC library, containing columns labeled 'word', 'sentiment', and 'value'. So I'm not sure why I am getting this error.
The cran version of syuzhet's get_nrc_sentiment doesn't accept a lexicon. The get_sentiment does. But your custom_lexicon has an error. The values need to be integer values, not a character value. And to use your own lexicon, you need to set the method to "custom" otherwise the custom lexicon will be ignored. The code below works just with syuzhet.
library(syuzhet)
texto <- "I am so love hate beautiful ugly"
text_cust <- get_tokens(texto)
custom_lexicon <- data.frame(lang = c("eng","eng","eng","eng"),
word = c("love", "hate", "beautiful", "ugly"),
sentiment = c("positive","positive","positive","positive"),
value = c(1,1,1,1))
get_sentiment(text_cust, method = "custom", lexicon = custom_lexicon)
[1] 0 0 0 1 1 1 1

How to extract values from complex XML in R without discarding nodes without existing value?

I am starting with a large, complex XML-file and need to extract values and attributes of certain sub(sub...)nodes. But because not all subnodes have all wanted values (some are missing) I cannot easily use the very fast xml_find_all (xml2 package), because it will of course not include the subnodes with missing values.
My solution is to use a for-loop cycling through all of my xml-nodes (Objects) and check within each node, if my desired value is exists - if yes, extract it. Thanks to the index of the loop I know to which Object it belongs and write it to the corresponding data.frame$Feature[i].
This approach works fine, but for my large XML-Node it takes a VERY long time (20 min) and is very memory consuming (~1.5GB, because of if-loop).
My XML: 100MB, about 30.000 "entries/Objects" each with about 50 features (~ 2 Mio lines)
The main problem which I figured out: xpathSApply(...xml_path(Obj[i]...) is very slow, if the indexing [i] of my loop is quite high (>5000)
My questions are:
Is there any better/simpler ideas to solve my problem with a very
complex and higly inhomogenic, structured XML, where not all features
are present in all object(nodes)?
I read this interesting approach, but could not figure out how to translate it to my very complex XML, where my desired values are
in different Nodeset-levels...
Is there maybe some nested xpathSApply-expression to circumvent a for-loop and avoid using index?
Do you now any "vector"-processing approaches (which are quite faster in R) for my problem?
See my MWE-Code with some more comments below.
XML
<?xml version="1.0" encoding="UTF-8"?>
<featureMember>
<Object>
<XML_Name>Object 1</XML_Name>
<XML_Feature1>
<XML_Feature1a href="URL1"></XML_Feature1a>
</XML_Feature1>
<XML_Feature2>
<XML_Feature2a>1</XML_Feature2a>
<XML_Feature2a>1x</XML_Feature2a>
<XML_Feature2a>1y</XML_Feature2a>
</XML_Feature2>
<XML_Feature3>
<XML_Feature3a>F3a_1</XML_Feature2a>
<XML_Feature3b>F3b_1</XML_Feature2a>
</XML_Feature3>
<XML_Feature3>
<XML_Feature3a>F3a_2</XML_Feature2a>
<XML_Feature3b>F3b_2</XML_Feature2a>
</XML_Feature3>
<XML_Feature4>F4_1</XML_Feature4>
<XML_Feature4>F4_2</XML_Feature4>
</Object>
<Object >
<XML_Name>Object 2</XML_Name>
<XML_Feature1>
<XML_Feature1a href="URL2"></XML_Feature1a>
</XML_Feature1>
</Object>
<Object >
<XML_Name>Object 3</XML_Name>
<XML_Feature1>
<XML_Feature1>
</XML_Feature1>
</XML_Feature1>
<XML_Feature2>
<XML_Feature2a>Value 3</XML_Feature2a>
</XML_Feature2>
</Object>
</featureMember>
R
require(xml2)
require(XML)
test_xml2 <- read_xml("above_file.xml") # using Packet xml2 (for using xml_find_all)
test_XML <- xmlParse("above_file.xml") # Packet XML (for using xpathSApply)
# XML-Noteset of all Objects I want to process:
Obj <- xml_find_all(test_xml2, "//Object") # --> has 3 nodes, contains all Objects!
# initialize a destination dataframe and fill with NAs
df <- data.frame('Name'=integer(), 'f2a'=character() , 'f1a'=character(), stringsAsFactors = FALSE)
df[1:length(Obj),] <- NA
# My Initial approach to extract all features by xml_find_all (which is very fast) is not working because not all xml-nodes have all wanted xml-features:
Name <- xml_text(xml_find_all(test_xml2, "//XML_Name"))
# --> length(Name)=3, because all 3 Objects have a name!
f1a <- xml_attr(xml_find_all(test_xml2, "//XML_Feature1/XML_Feature1a"),"href")
# --> length(f1a)=2, because XML_Feature1a is missing in Object3!
f2a <- xml_text(xml_find_all(test_xml2, "//XML_Feature2/XML_Feature2a"))
# --> length(f2a)=2, because XML_Feature2a is missing in Object2!
# Joining these to a final df is not possible, because "Name", "f2a" and "f1a" have of course different lengths, plus correct data matching is not possible!
# Therefore I decided to make instead the following approach.
# 1.) crawl all features, which are present in all nodes, because its fast (here: "Name"):
df$Name <- xml_text(xml_find_all(test_xml2, "//XML_Name"))
# 2.) making a for-loop over all Objects/XML-Nodes of interest and check if eacht wanted feature exist.
# if yes: write to df$FeatureXY[i]
# if not: make nothing (thus df$FeatureXY[i]stays NA from initialization)
for (i in 1:length(Obj))
{ # 1. Feature:
tmp <- xpathSApply(test_XML, paste0(xml_path(Obj[i]),"/XML_Feature1/XML_Feature1a"), xmlGetAttr, "href")
if(length(tmp )>0) { df$f1a[i] <- tmp # otherwise it would produce an error-message}
# 2. Feature:
tmp <- xpathSApply(test_XML, paste0(xml_path(Obj[i]),"/XML_Feature2/XML_Feature2a"), xmlValue)
if(length(tmp )>0) { df$f2a[i] <- tmp}
}
# Result of df as it should be:
# Name f2a f1a f3a f3b f4
# Object 1 1 # 1x # 1y URL1 F3a_1 # F3a_2 F3b_1 # F3b_2 F4_1 # F4_2
# Object 2 NA URL2 NA NA NA
# Object 3 Value 3 NA NA NA NA
Edit 1: Extended XML example (multiple elements of feature2a, feature3a/b feature4)
Problems like this can be tricky, in order to handle any potential changes between the sample data and the actual data. If we assume there is at most a single "Feature1a" node and at most a single "Feature2a" node per "Object" then this breaks down to a straight forward problem.
First find all of the parent "Object" nodes then using this vector of nodes parse each one for the Name, feature1a attribute and feature2a text. xml_find_first will return a value if the node exists, if not then it will return NA. Since the xml_find_first function is vectorized, it will operate on the vector of parent nodes without the need of a loop and with a very significant performance improvement.
library(xml2)
library(dplyr)
#Read file to process
doc<- read_xml("above_file.xml")
#find parent nodes
parents <- xml_find_all(doc, ".//Object")
#Now extract the requested data from each parent
# Notice the use of the . in the xpath.
# // finds anywhere in the document (ignoring the current node)
# .// finds anywhere beneath the current node
Names<- xml_find_first(parents, ".//XML_Name") %>% xml_text()
feature1 <- xml_find_first(parents, ".//XML_Feature1a") %>% xml_attr("href")
#fill features with first elements as default
feature2 <- xml_find_first(parents, ".//XML_Feature2a") %>% xml_text()
#find parents with more than 1 feature2
moretwos<-which(xml_find_all(parents, ".//XML_Feature2") %>% xml_length() >1)
#reparse the parent nodes with more than one child
feature2[moretwos] <-sapply(parents[moretwos], function(node){
xml_find_all(node, ".//XML_Feature2a") %>% xml_text() %>% paste(collapse = "#")
})
#Make combinded dataframe
answer <-data.frame(Names, feature1, feature2)
answer
Here is a similar question but with an unknown number of subnodes: Create data frame from xml with different number of elements
UPDATE
For your revised problem of having multiple subnodes with multiple children, but no grandchildren here is option.
#find parent nodes
parents<-xml_find_all(doc, ".//Object")
dfs<-lapply(parents, function(parent) {
#Get oject name
object<-xml_find_first(parent, ".//XML_Name") %>% xml_text()
#find the number of children under each child
numchild<-xml_children(parent) %>% xml_length()
#if number of children is zero get name and value
name <- xml_children(parent)[numchild==0] %>% xml_name()
value <- xml_children(parent)[numchild==0] %>% xml_text()
#if the number of childern is 1 or more the get the name value of the child
namec2 <- xml_children(parent)[numchild>=1] %>% xml_children() %>% xml_name()
valuec2 <- xml_children(parent)[numchild>=1] %>% xml_children() %>% xml_text()
#make data frame of the values and column headings
df<-data.frame(object, name=c(name, namec2), value=c(value, valuec2), stringsAsFactors = FALSE)
print(df)
df
})
#Make combinded dataframe
answer<-bind_rows(dfs)
answer
library(tidyr)
pivot_wider(answer, object, names_from = name, values_from= value, values_fn = list(value = toString))
The final answer will need to some cleaning of the columns, gsub(", ", " # ", ...) and retrieving the URL attribute from above
At 100 MB, consider XSLT, the special-purpose language designed to transform XML files such as very nested levels to flatter outputs for easy R data frame import. R can run XSLT with xslt, extended package to xml2. Otherwise use any XSLT executable to handle the transformation as demonstrated further below. And because you also use XML, consider its convenience method xmlToDataFrame for import of flatter XML files.
XSLT (save as .xsl file, a special .xml file)
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="xml" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/featureMember">
<root>
<xsl:apply-templates select="Object"/>
</root>
</xsl:template>
<xsl:template match="Object">
<data>
<Name><xsl:apply-templates select="XML_Name"/></Name>
<f1><xsl:apply-templates select="XML_Feature1/XML_Feature1a/#href"/></f1>
<f2><xsl:apply-templates select="XML_Feature2"/></f2>
</data>
</xsl:template>
</xsl:stylesheet>
XSLT Demo
XML Output
<?xml version="1.0" encoding="utf-16"?>
<root>
<data>
<Name>Object 1</Name>
<f1>URL1</f1>
<f2>Value 1</f2>
</data>
<data>
<Name>Object 2</Name>
<f1>URL2</f1>
<f2 />
</data>
<data>
<Name>Object 3</Name>
<f1 />
<f2>Value 3</f2>
</data>
</root>
R (no for loops, apply calls, if logic, or XPath search needed)
library(xml2)
library(xslt)
library(XML)
# PARSE XML AND XSLT
doc <- read_xml('/path/toInput.xml')
style <- read_xml('/path/to/Script.xsl', package = "xslt")
# TRANSFORM NESTED INPUT INTO FLATTER OUTPUT
new_xml <- as.character(xslt::xml_xslt(doc, style))
# PARSE FLATTER XML
flat_xml <- XML::xmlParse(new_xml, asText=TRUE)
# BUILD DATA FRAME
final_df <- XML::xmlToDataFrame(flat_xml, XML::getNodes(nodes="//data"))
To demonstrate an external XSLT solution, below interfaces to the xsltproc command line tool available for installation on Unix machines (i.e., Linux, MacOS):
library(XML)
# COMMAND LINE CALL TO UNIX'S XSLTPROC (ALTERNATIVE TO xslt PACKAGE)
system("xsltproc -o /path/to/input.xml /path/to/script.xsl /path/to/output.xml")
flat_xml <- xmlParse("/path/to/output.xml")
final_df <- xmlToDataFrame(flat_xml, getNodes(nodes="//data"))

Fetching data from OECD into R via SDMX(XML)

I want to extract data from the OECD website particularily the dataset "REGION_ECONOM" with the dimensions "GDP" (GDP of the respective regions) and "POP_AVG" (the average population of the respective region).
This is the first time I am doing this:
I picked all the required dimensions on the OECD website and copied the SDMX (XML) link.
I tried to load them into R and convert them to a data frame with the following code:
(in the link I replaced the list of all regions with "ALL" as otherwise the link would have been six pages long)
if (!require(rsdmx)) install.packages('rsdmx') + library(rsdmx)
url2 <- "https://stats.oecd.org/restsdmx/sdmx.ashx/GetData/REGION_ECONOM/1+2.ALL.SNA_2008.GDP+POP_AVG.REAL_PPP.ALL.1990+1991+1992+1993+1994+1995+1996+1997+1998+1999+2000+2001+2002+2003+2004+2005+2006+2007+2008+2009+2010+2011+2012+2013+2014+2015+2016+2017+2018/all?"
sdmx2 <- readSDMX(url2)
stats2 <- as.data.frame(sdmx2)
head(stats2)
Unfortunately, this returns a "400 Bad request" error.
When just selecting a couple of regions the error does not appear:
if (!require(rsdmx)) install.packages('rsdmx') + library(rsdmx)
url1 <- "https://stats.oecd.org/restsdmx/sdmx.ashx/GetData/REGION_ECONOM/1+2.AUS+AU1+AU101+AU103+AU104+AU105.SNA_2008.GDP+POP_AVG.REAL_PPP.ALL.1990+1991+1992+1993+1994+1995+1996+1997+1998+1999+2000+2001+2002+2003+2004+2005+2006+2007+2008+2009+2010+2011+2012+2013+2014+2015+2016+2017+2018/all?"
sdmx1 <- readSDMX(url1)
stats1 <- as.data.frame(sdmx1)
head(stats1)
I also tried to use the "OECD" package to get the data. There I had the same problem. ("400 Bad Request")
if (!require(OECD)) install.packages('OECD') + library(OECD)
df1<-get_dataset("REGION_ECONOM", filter = "GDP+POP_AVG",
start_time = 2008, end_time = 2009, pre_formatted = TRUE)
However, when I use the package for other data sets it does work:
df <- get_dataset("FTPTC_D", filter = "FRA+USA", pre_formatted = TRUE)
Does anyone know where my mistake could lie?
the sdmx-ml api does not seem to work as explained (using the all parameter), whereas the json API works just fine. The following query returns the values for all countries and returns them as json - I simply replaced All by an empty field.
query <- https://stats.oecd.org/sdmx-json/data/REGION_ECONOM/1+2..SNA_2008.GDP+POP_AVG.REAL_PPP.ALL.1990+1991+1992+1993+1994+1995+1996+1997+1998+1999+2000+2001+2002+2003+2004+2005+2006+2007+2008+2009+2010+2011+2012+2013+2014+2015+2016+2017+2018/all?
Transforming it to a readable format is not so trivial. I played around a bit to find the following work-around:
# send a GET request using httr
library(httr)
query <- "https://stats.oecd.org/sdmx-json/data/REGION_ECONOM/1+2..SNA_2008.GDP+POP_AVG.REAL_PPP.ALL.1990+1991+1992+1993+1994+1995+1996+1997+1998+1999+2000+2001+2002+2003+2004+2005+2006+2007+2008+2009+2010+2011+2012+2013+2014+2015+2016+2017+2018/all?"
dat_raw <- GET(query)
dat_parsed <- parse_json(content(dat_raw, "text")) # parse the content
Next, access the observations from the nested list and transform them to a matrix. Also extract the features from the keys:
dat_obs <- dat_parsed[["dataSets"]][[1]][["observations"]]
dat0 <- do.call(rbind, dat_obs) # get a matrix
new_features <- matrix(as.numeric(do.call(rbind, strsplit(rownames(dat0), ":"))), nrow = nrow(dat0))
dat1 <- cbind(new_features, dat0) # add feature columns
dat1_df <- as.data.frame(dat1) # optionally transform to data frame
Finally you want to find out about the keys. Those are hidden in the "structure". This one you also need to parse correctly, so I wrote a function for you to easier extract the values and ids:
## Get keys of features
keys <- dat_parsed[["structure"]][["dimensions"]][["observation"]]
for (i in 1:length(keys)) print(paste("id position:", i, "is feature", keys[[i]]$id))
# apply keys
get_features <- function(data_input, keys_input, feature_index, value = FALSE) {
keys_temp <- keys_input[[feature_index]]$values
keys_temp_matrix <- do.call(rbind, keys_temp)
keys_temp_out <- keys_temp_matrix[, value + 1][unlist(data_input[, feature_index])+1] # column 1 is id, 2 is value
return(unlist(keys_temp_out))
}
head(get_features(dat1_df, keys, 7))
head(get_features(dat1_df, keys, 2, value = FALSE))
head(get_features(dat1_df, keys, 2, value = TRUE))
I hope that helps you in your project.
Best, Tobias

Result of character(0) when trying to web scrape text

I'm attempting to automate scraping the practice words from this site https://www.livechatinc.com/typing-speed-test/#/ but get a result of 'character(o)'.
I read the url with read_html then use that for x in html_nodes() along with the css selector for the practice words and then read it with html_text, but I get character(0) every time.
No clue what I'm doing wrong, here is the code:
library('rvest')
url <- read_html("https://www.livechatinc.com/typing-speed-test/#/")
wbpg_html <- html_nodes(url,".test-prompt")
wbpg_txt <- html_text(wbpg_html)
> wbpg_txt
character(0)
I'd just like to get the practice words into r, find out how to automate it later.
Thanks for any help.
The word list comes from this js file: https://cdn.livechatinc.com/gtt/app.3.8.min.js
You can try to regex out with R using:
e\\.exports=\\{words:\\[(.*?)\\]
I ran a quick test with python:
import requests, re
r = requests.get('https://cdn.livechatinc.com/gtt/app.3.8.min.js')
p = re.compile(r'e\.exports={words:\[(.*?)\]')
words = p.findall(r.text)
print(words)
With r
library(rvest)
library(stringr)
library(readr)
library(dplyr)
urlmatrix <- paste(readLines('https://cdn.livechatinc.com/gtt/app.3.8.min.js', warn=FALSE),
collapse=" ", fileEncoding = "UTF-16") %>%
str_match(., 'e\\.exports=\\{words:\\[(.*?)\\]')
words <- strsplit(as.character(as.list(urlmatrix[,2])[[1]]), '","')
words[[1]][1] <- substring(words[[1]][1],2,nchar(words[[1]][1]))
words[[1]][length(words[[1]])] <- gsub('\\"', "", words[[1]][length(words[[1]])])

R, Getting the top in every category from a data frame?

I have the following data frame
id,category,value
A,21,0.89
B,21,0.73
C,21,0.61
D,12,0.95
E,12,0.58
F,12,0.44
G,23,0.33
Note, they are already sorted by value within each (id,category). What I would like to be able to do is to get the top from each (id,category) and make a string, followed by the second in each (id,category) and so on. So for the above example it would look like
A,D,G,B,E,C,F
Is there a way to do it easily in R? Or am I better off relying on a Perl script to do it?
Thanks much in advance
This appears to work, but I'm certain we could simplify it somewhat, particularly if you are able to relax your ordering requirements:
library(plyr)
d <- read.table(text = "id,category,value
A,21,0.89
B,21,0.73
C,21,0.61
D,12,0.95
E,12,0.58
F,12,0.44
G,23,0.33",sep = ',',header = TRUE)
d <- ddply(d,.(category),transform,r = seq_along(category))
d <- arrange(d,id)
> paste(d$id[order(d$r)],collapse = ",")
[1] "A,D,G,B,E,C,F"
This version is probably more robust to ordering, and avoids plyr:
d$r <- unlist(sapply(rle(d$category)$lengths,seq_len))
d$s <- 1:nrow(d)
with(d,paste(id[order(r,s)],collapse = ","))

Resources