Extract metadata with R - r

Good day
I am a newbie to Stackoverflow:)
I am trying my hand with programming with R and found this platform a great source of help.
I have developed some code leveraging stackoverflow, but now I am failing to read the metadata from this htm file
Please direct download this file before using in R
setwd("~/NLP")
library(tm)
library(rvest)
library(tm.plugin.factiva)
file <-read_html("facts.htm")
source <- FactivaSource(file)
corpus <- Corpus(source, readerControl = list(language = NA))
# See the contents of the documents
inspect(corpus)
head(corpus)
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 3
See meta-data associated with first article
meta(corpus[[3]])
meta(corpus[[3]])
author : character(0)
datetimestamp: 2017-08-31
description : character(0)
heading : Rain, Rain, Rain
id : TIMEUK-170830-e
language : en
origin : thetimes.co.uk
edition : character(0)
section : Comment
subject : c("Hurricanes/Typhoons", "Storms", "Political/General News", "Disasters/Accidents", "Natural Disasters/Catastrophes", "Risk News", "Weather")
coverage : c("United States", "North America")
company : character(0)
industry : character(0)
infocode : character(0)
infodesc : character(0)
wordcount : 333
publisher : News UK & Ireland Limited
rights : © Times Newspapers Limited 2017
How can I save each metadata (SE, HD, AU, ..PUB, AU) - all 18 metadata elements column-wise in a dataframe or write to excel for each document in corpus?
Example of output:
SE HD AU ...
Doc 1
2
3
Thank you for your help

The simplest way I know of to do it is:
Make a data frame from each of the three lists in your corpus:
one<-data.frame(unlist(meta(corpus[[1]])))
two<-data.frame(unlist(meta(corpus[[2]])))
three<-data.frame(unlist(meta(corpus[[3]])))
Then you will want to merge them into a single data frame. For the first two, this is easy to do, as using "row.names" will cause them to merge on the NON VARIABLE row names. But the second merge, you need to merge based on the column now named "Row.Names" So you need to create and rename the first column of the third file with the row names, using setDT allows you to do this without adding another full set of information, just redirecting R to see the row names as the first column
setDT(three, keep.rownames = TRUE)[]
colnames(three)[1] <- "Row.names"
then you simply merge the first and second data frame into variable named meta, and then merge meta with three using "Row.names" (the new name of the first column now).
meta <- merge(one, two, by="row.names", all=TRUE)
meta <- merge(meta, three, by = "Row.names", all=TRUE)
Your data will look like this:
Row.names unlist.meta.corpus..1.... unlist.meta.corpus..2.... unlist.meta.corpus..3....
1 author Jenni Russell <NA> <NA>
2 coverage1 United States North Korea United States
3 coverage2 North America United States North America
4 coverage3 <NA> Japan <NA>
5 coverage4 <NA> Pyongyang <NA>
6 coverage5 <NA> Asia Pacific <NA>
Those NA values are there because not all of the sub-lists had values for all of the observations.
By using the all=TRUE on both merges, you preserve all of the fields, with and without data, which makes it easy to work with moving forward.
If you look at this PDF from CRAN on page two the section Details shows you how to access the content and metadata. From there is is simply about unlisting to move them into data frames.
If you get lost, send a comment and I will do what I can to help you out!
EDIT BY REQUEST:
To write this to Excel is not super difficult because the data is already "square" in a uniform data frame. You would just install xlsx package and xlxsjars then use the following function:
write.xlsx(meta, file, sheetName="Sheet1",
col.names=TRUE, row.names=TRUE, append=FALSE, showNA=TRUE)
You can find information about the package here: page 38 gives more detail.
And if you want to save the content, you can change meta to content in the files which extract the data from corpus and make the initial dataframes. The entire process will be the same otherwise

Related

How do I create two subsets out of a corpus based on multiple keywords?

I am working with a large body of political speeches in quanteda and would like to create two subsets.
The first one should contain one or more from a list of specific keywords(e.g. "migrant*", "migration*", "asylum*"). The second one should contain the documents which do not hold any of these terms (the speeches which do not fall into the first subset).
Any input on this would be greatly appreciated. Thanks!
#first suggestion
> corp_labcon$criteria <- ifelse(stringi::stri_detect_regex(corp_labcon, pattern=paste0(regex_pattern), ignore_case = TRUE, collapse="|"), "yes", "no")
Warning messages:
1: In (function (case_insensitive, comments, dotall, dot_all = dotall, :
Unknown option to `stri_opts_regex`.
2: In stringi::stri_detect_regex(corp_labcon, pattern = paste0(regex_pattern), :
longer object length is not a multiple of shorter object length
> table(corp_labcon$criteria)
no yes
556921 6139
#Second suggestion
> corp_labcon$criteria <- ifelse(stringi::stri_detect_regex(corp_labcon, pattern = paste0(glob2rx(regex_pattern), collapse = "|")), "yes","no")
> table(corp_labcon$criteria)
no
563060
You didn't give a reproducible example, but I will show how it can be done with quanteda and the available corpus data_corpus_inaugural. You can make use of the docvars that you can attach to your corpus. It is just like adding a variable to a data.frame.
With stringi::stri_detect_regex you look inside each document if any of the looked for words is in the text, if so set the value in the criteria column to yes. Otherwise to no. After that you can use corpus_subset to create 2 new corpi based on the criteria values. See example code below.
library(quanteda)
# words used in regex selection
regex_pattern <- c("migrant*", "migration*", "asylum*")
# add selection to corpus
data_corpus_inaugural$criteria <- ifelse(stringi::stri_detect_regex(data_corpus_inaugural,
pattern = paste0(regex_pattern,
collapse = "|")),
"yes","no")
# Check docvars and new criteria column
head(docvars(data_corpus_inaugural))
Year President FirstName Party criteria
1 1789 Washington George none yes
2 1793 Washington George none no
3 1797 Adams John Federalist no
4 1801 Jefferson Thomas Democratic-Republican no
5 1805 Jefferson Thomas Democratic-Republican no
6 1809 Madison James Democratic-Republican no
# split corpus into segment 1 and 2
segment1 <- corpus_subset(data_corpus_inaugural, criteria == "yes")
segment2 <- corpus_subset(data_corpus_inaugural, criteria == "no")
Not sure how your data is organised, but you could try the function grep(). Imagining that the data is a data frame and each line is a text, you could try:
words <- c("migrant", "migration", "asylum")
df[grep(words, df$text),] # This will give you those lines with the words
df[!grep(words, df$text),] # This will give you those lines without the words
Probably though, your data is not structured like this! You should explain better how your data looks like.

R xml2 : How to query only corresponding xml nodes

I'm trying to read and transform many XML files into R data frames (or preferably Tibbles).
All R packages I've tried, unfortunately (XML, flatxml, xmlconvert) failed when I tried to convert the files using built-in functions (e.g. xmltodataframe from the XML Package and xml_to_df from the xmlconvert package), so I have to do it manually with XML2.
Here is my question with a small working example:
# Minimal Working Example
library(tidyverse)
library(xml2)
interimxml <- read_xml("<Subdivision>
<Name>Charles</Name>
<Salary>100</Salary>
<Name>Laura</Name>
<Name>Steve</Name>
<Salary>200</Salary>
</Subdivision>")
names <- xml_text(xml_find_all(interimxml ,"//Subdivision/Name"))
salary <- xml_text(xml_find_all(interimxml ,"//Subdivision/Salary"))
names
salary
# combine in to tibble (doesn't work because of inequal vector lengths)
result <- tibble(names=names,
salary = salary)
result
rbind(names, salary)
From the (made up) XML file you can see that Charles earns 100 dollars, Laura earns nothing ( because of the missing entry, here is the problem) and Steve earns 200 dollars.
What I want xml2 do to is, when querying names and salary nodes is to return an "NA" (or zero which would also be okay), when it finds a name but no corresponding salary entry, so that I would end up a nice table like this:
Name
Salary
Charles
100
Laura
NA
Steve
200
I know that I could modify the "xpath" to only pick up the last value (for Steve), which wouldn't help me, since (in the real data) it could also be the 100th or the 23rd person with missing salary information.
[ I'm aware that Salary Numbers are pulled as character values from the xml file. I would mutate(across(salary, as.double) over columns afterwards.]
Any help is highly appreciated. Thank you very much in advance.
You need to be a bit more careful to match up the names and salaries. Basically first find all the <Name> nodes, then check only if their next sibling is a <Salary> node. If not, then return NA.
nameNodes <- xml_find_all(interimxml ,"//Subdivision/Name")
names <- xml_text(nameNodes)
salary <- map_chr(nameNodes, ~xml_text(xml_find_first(., "./following-sibling::*[1][self::Salary]")))
tibble::tibble(names, salary)
# names salary
# <chr> <chr>
# 1 Charles 100
# 2 Laura NA
# 3 Steve 200

How to convert a PDF listing the worlds ministers and cabinet members by country to a .csv in R

The CIA publishes a list of world leaders and cabinet ministers for all countries multiple times a year. This information is in PDF form.
I want to convert this PDF to CSV using R and then seperate and tidy the data.
I am getting the PDF from "https://www.cia.gov/library/publications/resources/world-leaders-1/"
under the link 'PDF Version for Prior Years' located at the center right hand side of the page.
Each PDF has some introductory pages and then lists the Leaders and Ministers for each country.
With each'Title' and 'Name' being seperated by a '..........' of varying lengths.
I have tried to use the pdftools package to convert from PDF, but I am not quite sure how to deal with the format of the data for sorting and tidying.
Here is the first steps I have taken with a downloaded PDF
library(pdftools)
text <- pdf_text("Data/April2006ChiefsDirectory.pdf")
test <- as.data.frame(text)
Starting with a single PDF, I want to list each Minister in a seperate row, with individual columns for year, country, title and name.
With the step I have taken so far, converting the PDF into .csv without any additional tidying, the data is in a single column and each row has a string of text contining title and name for multiple countries.
I am a novice at data tidying any help would be much appreciated.
You can do it with tabulizer but it is going to require some work to clean it up if your want to import all the 240 pages of the document.
Here I import page 4, that is the first with info regarding the leaders
library(tabulizer)
mw_table <- extract_tables(
"https://www.cia.gov/library/publications/resources/world-leaders-1/pdfs/2019/January2019ChiefsDirectory.pdf",
output = "data.frame",
pages = 4,
area = list(c(35.68168, 40.88842, 740.97853, 497.74737 )),
guess = FALSE
)
head(mw_table[[1]])
#> X Afghanistan
#> 1 Last Updated: 20 Dec 2017
#> 2 Pres. Ashraf GHANI
#> 3 CEO Abdullah ABDULLAH, Dr.
#> 4 First Vice Pres. Abdul Rashid DOSTAM
#> 5 Second Vice Pres. Sarwar DANESH
#> 6 First Deputy CEO Khyal Mohammad KHAN
You can use a vector of pages that you want to import as the argument in pages. Consider that you will have all the country names buried among the people names in the second column. Probably you can work out a method to identifying the indexes of the country by looking for the empty "" occurrences in the first column.

how to extract the journals names from a PubMed

I am trying to perform a search on specific authors
so I can look up but I don't know how to extract citation, or plot journals that he or she published papers in
library(RISmed)
#now let's look up this author
res <- EUtilsSummary('Gene Myers', type='esearch', db='pubmed')
summary(res)
The first thing to notice is that what you already produced contains the PubMed IDs
for the papers that match your query.
res#PMID
[1] "30481296" "29335514" "26102528" "25333104" "23541733" "22743769"
[7] "21685076" "20937014" "20122179" "19447790" "12804086" "12061009"
Knowing the IDs, you can retrieve detailed information on all of them
using EUtilsGet
res2 = EUtilsGet(res#PMID)
Now we can get the items required for a citation from res2.
ArticleTitle(res2) ## Article Titles
Title(res2) ## Publication Names
YearPubmed(res2) ## Year of publication
Volume(res2) ## Volume
Issue(res2) ## Issue number
Author(res2) ## Lists of Authors
There is much more information embedded in the res2 object.
If you look at the help page ?Medline, you can get a good idea
of the other information.
When you retrieve the detailed information of the selected articles using EUtilsGet, the journal information is stored as ISO abbreviated term.
library(RISmed)
#now let's look up this author
res <- EUtilsSummary('Gene Myers', type='esearch', db='pubmed')
summary(res)
res2 = EUtilsGet(res, db = "pubmed")
sort(table(res2#ISOAbbreviation), decreasing = T)[1:5] ##Top 5 journals
Gigascience Bioinformatics J Comput Biol BMC Bioinformatics Curr Biol
3 2 2 1 1

Text summarization in R language

I have long text file using help of R language I want to summarize text in at least 10 to 20 line or in small sentences.
How to summarize text in at least 10 line with R language ?
You may try this (from the LSAfun package):
genericSummary(D,k=1)
whereby 'D' specifies your text document and 'k' the number of sentences to be used in the summary. (Further modifications are shown in the package documentation).
For more information:
http://search.r-project.org/library/LSAfun/html/genericSummary.html
There's a package called lexRankr that summarizes text in the same way that Reddit's /u/autotldr bot summarizes articles. This article has a full walkthrough on how to use it but just as a quick example so you can test it yourself in R:
#load needed packages
library(xml2)
library(rvest)
library(lexRankr)
#url to scrape
monsanto_url = "https://www.theguardian.com/environment/2017/sep/28/monsanto-banned-from-european-parliament"
#read page html
page = xml2::read_html(monsanto_url)
#extract text from page html using selector
page_text = rvest::html_text(rvest::html_nodes(page, ".js-article__body p"))
#perform lexrank for top 3 sentences
top_3 = lexRankr::lexRank(page_text,
#only 1 article; repeat same docid for all of input vector
docId = rep(1, length(page_text)),
#return 3 sentences to mimick /u/autotldr's output
n = 3,
continuous = TRUE)
#reorder the top 3 sentences to be in order of appearance in article
order_of_appearance = order(as.integer(gsub("_","",top_3$sentenceId)))
#extract sentences in order of appearance
ordered_top_3 = top_3[order_of_appearance, "sentence"]
> ordered_top_3
[1] "Monsanto lobbyists have been banned from entering the European parliament after the multinational refused to attend a parliamentary hearing into allegations of regulatory interference."
[2] "Monsanto officials will now be unable to meet MEPs, attend committee meetings or use digital resources on parliament premises in Brussels or Strasbourg."
[3] "A Monsanto letter to MEPs seen by the Guardian said that the European parliament was not “an appropriate forum” for discussion on the issues involved."

Resources