Extracting data from XML file - R - r

I am trying to extract data from the following xml file, namely all the chunks after Object-id_str, "HMDB0003993" in this case. The goal is to extract ALL the chunks after Object-id_str, as there will be multiple per xml file. I am then hoping to aggregate all these ID's into a dataframe (one column presumably).
<?xml version="1.0"?>
<PC-Substances
xmlns="http://www.ncbi.nlm.nih.gov"
xmlns:xs="http://www.w3.org/2001/XMLSchema-instance"
xs:schemaLocation="http://www.ncbi.nlm.nih.gov
ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem.xsd"
>
<PC-Substance>
<PC-Substance_sid>
<PC-ID>
<PC-ID_id>348276207</PC-ID_id>
<PC-ID_version>1</PC-ID_version>
</PC-ID>
</PC-Substance_sid>
<PC-Substance_source>
<PC-Source>
<PC-Source_db>
<PC-DBTracking>
<PC-DBTracking_name>811</PC-DBTracking_name>
<PC-DBTracking_source-id>
<Object-id>
<Object-id_str>HMDB0003993</Object-id_str>
<PC-XRefData_patent>US7807614</PC-XRefData_patent>
I have tried using readLines and str_match, and read_xml and nodes, but am having trouble working with my results data and turning them into a data frame.
I have tried the method below and used this code
my_xml <- read_xml("my_xml.xml")
usernodes <- xml_find_all(my_xml, ".//PC-Substance")
ui <- sapply(usernodes, function(n){xml_text(xml_find_first(n, ".//Object-id_str"))})
However things go awry when trying to find the nodes. I keep getting a list of 0.
Any help appreciated.
Thank you!

Related

How do you download data from API and export it into a nice CSV file to query?

I am trying to figure out how to 'download' data into a nice CSV file to be able to analyse.
I am currently looking at WHO data here:
I am doing so through following documentation and getting output like so:
test_data <- jsonlite::parse_json(url("http://apps.who.int/gho/athena/api/GHO/WHS6_102.json?profile=simple"))
head(test_data)
This gives me a rather messy list of list of lists.
For example:
I get this
It is not very easy to analyse and rather messy. How could I clean this up by using say two columns that is returned from this json_parse, information only from say dim like REGION, YEAR, COUNTRY and then the values from the column Value. I would like to make this into a nice dataframe/CSV file so I can then more easily understand what is happening.
Can anyone give any advice?
jsonlite::fromJSON gives you data in a better format and the 3rd element in the list is where the main data is.
url <- 'https://apps.who.int/gho/athena/api/GHO/WHS6_102.json?profile=simple'
tmp <- jsonlite::fromJSON(url)
data <- tmp[[3]]

Analyze code in dataframe with Tidycode in R

I am trying to take R code, stored in cells of the content column of a dataframe, and analyze the functions used by applying the Tidycode package. However, I first need to convert the data to a Matahari tibble before applying an unnest_calls() function.
Here is the data:
data <- read.csv("https://github.com/making-data-science-count/TidyTuesday-Analysis/raw/master/db-tmp/cleaned%20database.csv")
I have tried doing this in a number of different ways, including extracting each row (in the content column ) as an Rfile and then reading it back in with Tidycode calls, for example:
tmp<-data$content2[1])
writeLines(tmp, "tmp.R") #I've also used save() and write()
rfile<-tidycode::read_rfiles("tmp.R")
But, I keep getting errors such as: "Error in parse(text = x) : <text>:1:14: unexpected symbol
1: library(here)library"
Ultimately, what I would like to do is analyze the different types of code per file, and keep that linked with the other data in the data dataframe, such as date and username.
Any help would be greatly appreciated!

Why is Rtweet's parse_stream() function returning a NULL object?

I have a series of .json files each containing data captured from between 500 and 10,000 tweets (3-40 MB each). I am trying to use rtweet's parse_stream() function to read these files into R and store the tweet data in a data table. I have tried the following:
tweets <- parse_stream(path = "india1_2019090713.json")
There is no error message and the command creates a tweets object, but it is empty (NULL). I have tried this with other .json files, and the result is the same. Has anyone encountered this behaviour/is there something obvious I am doing wrong? I would appreciate any advice to an rtweet newbie!
I am using rtweet version 0.6.9.
Many thanks!
As an update and partial answer:
I've not made progress with the original issue, but I have had a lot more success using the jsonlite package, which is amply able to read in large and complex .json files containing Tweet data.
library(jsonlite)
I used the fromJSON() function as detailed here. I found I needed to edit the original .json file to match the required structure, beginning and ending the file with square brackets ([ ]) and adding a comma before each line break at the end of each Tweet. Then:
tweetsdf <- fromJSON("india1_2019090713.json", simplifyDataFrame = TRUE, flatten = TRUE)
simplifyDataFrame ensures the contents are saved as a data frame with one row per Tweet, and flatten collapses most of the nested Tweet attributes to separate columns for each sub-value rather than generating columns full of unwieldy list structures.

Creating temporary data frames in R

I am importing multiple excel workbooks, processing them, and appending them subsequently. I want to create a temporary dataframe (tempfile?) that holds nothing in the beginning, and after each successive workbook processing, append it. How do I create such temporary dataframe in the beginning?
I am coming from Stata and I use tempfile a lot. Is there a counterpart to tempfile from Stata to R?
As #James said you do not need an empty data frame or tempfile, simply add newly processed data frames to the first data frame. Here is an example (based on csv but the logic is the same):
list_of_files <- c('1.csv','2.csv',...)
pre_processor <- function(dataframe){
# do stuff
}
library(dplyr)
dataframe <- pre_processor(read.csv('1.csv')) %>%
rbind(pre_processor(read.csv('2.csv'))) %>%>
...
Now if you have a lot of files or a very complicated pre_processsing then you might have other questions (e.g. how to loop over the list of files or to write the right pre_processing function) but these should be separate and we really need more specifics (example data, code so far, etc.).

Creating a dataset from an XML file in R statistics

I am trying to download an XML file of journal article records and create a dataset for further interrogation in R. I'm completely new to XML and quite novice at R. I cobbled together some code using bits of code from 2 sources:
GoogleScholarXScraper
and
Extracting records from pubMed
library(RCurl)
library(XML)
library(stringr)
#Search terms
SearchString<-"cancer+small+cell+non+lung+survival+plastic"
mySearch<-str_c("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=",SearchString,"&usehistory=y",sep="",collapse=NULL)
#Seach
pub.esearch<-getURL(mySearch)
#Extract QueryKey and WebEnv
pub.esearch<-xmlTreeParse(pub.esearch,asText=TRUE)
key<-as.numeric(xmlValue(pub.esearch[["doc"]][["eSearchResult"]][["QueryKey"]]))
env<-xmlValue(pub.esearch[["doc"]][["eSearchResult"]][["WebEnv"]])
#Fetch Records
myFetch<-str_c("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&WebEnv=",env,"&retmode=xml&query_key=",key)
pub.efetch<-getURL(myFetch)
myxml<-xmlTreeParse(pub.efetch,asText=TRUE,useInternalNodes=TRUE)
#Create dataset of article characteristics #This doesn't work
pub.data<-NULL
pub.data<-data.frame(
journal <- xpathSApply(myxml,"//PubmedArticle/MedlineCitation/MedlineJournalInfo/MedlineTA", xmlValue),
abstract<- xpathSApply(myxml,"//PubmedArticle/MedlineCitation/Article/Abstract/AbstractText",xmlValue),
affiliation<-xpathSApply(myxml,"//PubmedArticle/MedlineCitation/Article/Affiliation", xmlValue),
year<-xpathSApply(myxml,"//PubmedArticle/MedlineCitation/Article/Journal/JournalIssue/PubDate/Year", xmlValue)
,stringsAsFactors=FALSE)
The main problem I seem to have is that my returned XML file is not completely uniformly structured. For example, some references have a node structure like this:
- <Abstract>
<AbstractText>The Wilms' tumor gene... </AbstractText>
Whilst some have labels and are like this
- <Abstract>
<AbstractText Label="BACKGROUND & AIMS" NlmCategory="OBJECTIVE">Some background text.</AbstractText>
<AbstractText Label="METHODS" NlmCategory="METHODS"> Some text on methods.</AbstractText>
When I extract the 'AbstactText' I am hoping to get 24 rows of data back (there are 24 records when I run this made up search today), but xpathSApply returns all labels within 'AbstactText' as individual elements of my dataframe. Is there a way to collapse the XML structure in this instance/Ignore the labels? Is there a way to make xpathSApply return 'NA' when nothing is found at end of a path? I am aware of xmlToDataFrame, which sounds like it should fit the bill, but whenever I try to use this it doesn't seem to give me anything sensible.
Thanks for your help
I am unsure as to which you want however:
xpathSApply(myxml,"//*/AbstractText[#Label]")
will get the nodes with labels (keeping all attributes etc).
xpathSApply(myxml,"//*/AbstractText[not(#Label)]",xmlValue)
will get the nodes without labels.
EDIT:
test<-xpathApply(myxml,"//*/Abstract",xmlValue)
> length(test)
[1] 24
may give you what you want
EDIT:
to get affiliation, year etc padded with NA's
dumfun<-function(x,xstr){
res<-xpathSApply(x,xstr,xmlValue)
if(length(res)==0){
out<-NA
}else{
out<-res
}
out
}
xpathSApply(myxml,"//*/Article",dumfun,xstr='./Affiliation')
xpathSApply(myxml,"//*/Article",dumfun,xstr='./Journal/JournalIssue/PubDate/Year')

Resources