Analysing text in R

Analysing text in R - r

I am trying to take data that belongs to a wellknown person speech from the web. First I go to page and open it's source and I have paste it's html link to R and make it read.
after I have view the data. here how it seems>
name type Value
<html> list[2] (S3:xml_document,xml_node) List of length 2
that was the code
list <- read_html("https://linkoftext/")
How should I deal with it?

Related

Extract specific structured data from multiple pages pdf file with the same format using R

Hi first of all thanks for the help. I would like to know if there’s a way to extract specific data that is allocated in the same place in all pages from a pdf editable file.
The file (modified to comply with privacy concerns) contains a series of payroll receipts, all pages contain the same format and data. I would like to extract only the SSN (No. IMSS) of each employee and put them on a data frame. I have searched for how to do this but I have only found cases where the data is not properly structered and since in this file all pages are exactly equal, I would like to know if there's a less troublesome way.
Using pdf tools and the steps bellow I was able to isolate the data I wanted (allocated on line 9), but only from an individual page. I would like to know if it’s possible to enter a command that works for all pages. Thank you.
> library(pdftools)
> test <- pdf_text("pruebas.pdf")
> orden <- strsplit(test,"\r\n")
> required <- c(unlist(strsplit(orden2[[1]],"\r\n")))
> nss <- required[9]
> result <- as.data.frame(nss)

This is a text parsing task and there are several ways to do it. Perhaps the quickest way is to split the output at every No. IMSS:, select the second fragments, split the result at the line break, then take the first fragment. The code isn't pretty, but it works:
sapply(strsplit(sapply(strsplit(pdftools::pdf_text("pruebas.pdf"),
"No\\. IMSS: +"), `[`, 2), "\r"), `[`, 1)
#> [1] "12-34-56-7895-5" "12-34-56-7895-9" "12-34-56-7895-7" "12-34-56-7895-1"

URL contains page number but this number is unknown

I'm trying in R to get the list of tickers from every exchange covered by Quandl.
There are 2 ways:
1) For every exchange the provide the zipped csv with all ticker. The URL looks like this (XXXXXXXXXXXXXXXXXXXX - API key, YYY - code of exchange):
https://www.quandl.com/api/v3/databases/YYY/codes?api_key=XXXXXXXXXXXXXXXXXXXX
This looks pretty promissing, but I was not able to read the file with read.table or e.g fread. Don't know why. Is it because of the API key? read.table is supposed to read zip files with no problem.
2) I was able to go further with the 2nd way. They provide URL to the csv of tickers. E.g.:
https://www.quandl.com/api/v3/datasets.csv?database_code=YYY&per_page=100&sort_by=id&page=1&api_key=XXXXXXXXXXXXXXXXXXXX
As you see, URL contains page number. The problem is they only mention it below in text, that you need to run this URL many time (e.g. 56 for LSE) in order to get the full list. I was able to do it like this:
pages <- 1:100 # "100" is taken just to be big enough
Source <- c("LSE","FSE", ...) # vector of exchange codes
QUANDL_API_KEY="XXXXXXXXXXXXXXXXXXXXXXXXXX"
TICKERS = lapply(sprintf(
"https://www.quandl.com/api/v3/datasets.csv?
database_code=%s&per_page=100&sort_by=id&page=%s&api_key=%s",
Source,pages,QUANDL_API_KEY),
FUN=fread,
stringsAsFactors=FALSE)
TICKERS <- do.call(rbind, TICKERS)
The problem is I just put 100 pages, but when R tryies to get the non-existing page (e.g. #57) it delivers an error and do not go further. I was trying to do smth like iferror, but failed.
Could you pls give some hints?

xml created from dataframe in R does not open in explorer

I have just learnt about xml and intended to turn my dataframe in R into xml, so that it could be used for corpus analysis. I am not using the conventional tagging label because of the nature of the data. The output does look right in the r console, but when I tried to open it in an explorer as a check if the xml is right, it shows nth in the explorer...so does it mean my xml is not right?
here is my r code:
comments$text<-gsub("&","&amp",comments$text)
comments$text<-gsub("<","&lt",comments$text)
comments$text<-gsub(">","&gt",comments$text)
comments$text<-gsub("\n"," <n/> ",comments$text)
top= newXMLNode("CWII Step 1_1")
o=newXMLNode("conversation",parent=top)
kids = lapply(comments$text[comments$parent_group==comments$parent_group[3]],
function(x)
newXMLNode("u",x,))
addChildren(o,kids)
saveXML(top, file="CWII Step 1_1.xml")
this is what is shown in r console
<CWII Step 1_1>
<conversation>
<u>Hello!!! I am a person from India !!!!</u>
<u>Hi . When I think of your country India I i.</u>
</conversation>
</CWII Step 1_1>
#
it seems like an xml to me, but unfortunately it does not show up when i open the .xml file in explorer, although it does show in notepad well with all tags

what's wrong with my R code?

I am struggling to parse contents from HTML using htmlTreeParse and XPath.
Below is the web link from where I need to extract information of "most valuable brands" and create a data frame out of it.
http://www.forbes.com/powerful-brands/list/#tab:rank
As a first step towards building the table, I am trying to extract the list of brands (Apple, Google, Microsoft etc. ). I am trying through below code:
library(XML)
htmlContent <- getURL("http://www.forbes.com/powerful-brands/list/#tab:rank", ssl.verifypeer=FALSE)
htmlParsed <- htmlTreeParse(htmlContent, useInternal = TRUE)
output <- xpathSApply(htmlParsed, "/html/body/div/div/div/table[#id='the_list']/tbody/tr/td[#class='name']", xmlValue)
But its returning NULL. I am not able to find my mistake. "/html/body/div/div/div/table[#id='the_list']/thead/tr/th" works correctly, returning ("", "Rank", "brand" etc.)
This means path upto table is correct. But I am not able to understand what's wrong thereafter.

Creating a dataset from an XML file in R statistics

I am trying to download an XML file of journal article records and create a dataset for further interrogation in R. I'm completely new to XML and quite novice at R. I cobbled together some code using bits of code from 2 sources:
GoogleScholarXScraper
and
Extracting records from pubMed
library(RCurl)
library(XML)
library(stringr)
#Search terms
SearchString<-"cancer+small+cell+non+lung+survival+plastic"
mySearch<-str_c("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=",SearchString,"&usehistory=y",sep="",collapse=NULL)
#Seach
pub.esearch<-getURL(mySearch)
#Extract QueryKey and WebEnv
pub.esearch<-xmlTreeParse(pub.esearch,asText=TRUE)
key<-as.numeric(xmlValue(pub.esearch[["doc"]][["eSearchResult"]][["QueryKey"]]))
env<-xmlValue(pub.esearch[["doc"]][["eSearchResult"]][["WebEnv"]])
#Fetch Records
myFetch<-str_c("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&WebEnv=",env,"&retmode=xml&query_key=",key)
pub.efetch<-getURL(myFetch)
myxml<-xmlTreeParse(pub.efetch,asText=TRUE,useInternalNodes=TRUE)
#Create dataset of article characteristics #This doesn't work
pub.data<-NULL
pub.data<-data.frame(
journal <- xpathSApply(myxml,"//PubmedArticle/MedlineCitation/MedlineJournalInfo/MedlineTA", xmlValue),
abstract<- xpathSApply(myxml,"//PubmedArticle/MedlineCitation/Article/Abstract/AbstractText",xmlValue),
affiliation<-xpathSApply(myxml,"//PubmedArticle/MedlineCitation/Article/Affiliation", xmlValue),
year<-xpathSApply(myxml,"//PubmedArticle/MedlineCitation/Article/Journal/JournalIssue/PubDate/Year", xmlValue)
,stringsAsFactors=FALSE)
The main problem I seem to have is that my returned XML file is not completely uniformly structured. For example, some references have a node structure like this:
- <Abstract>
<AbstractText>The Wilms' tumor gene... </AbstractText>
Whilst some have labels and are like this
- <Abstract>
<AbstractText Label="BACKGROUND & AIMS" NlmCategory="OBJECTIVE">Some background text.</AbstractText>
<AbstractText Label="METHODS" NlmCategory="METHODS"> Some text on methods.</AbstractText>
When I extract the 'AbstactText' I am hoping to get 24 rows of data back (there are 24 records when I run this made up search today), but xpathSApply returns all labels within 'AbstactText' as individual elements of my dataframe. Is there a way to collapse the XML structure in this instance/Ignore the labels? Is there a way to make xpathSApply return 'NA' when nothing is found at end of a path? I am aware of xmlToDataFrame, which sounds like it should fit the bill, but whenever I try to use this it doesn't seem to give me anything sensible.
Thanks for your help

I am unsure as to which you want however:
xpathSApply(myxml,"//*/AbstractText[#Label]")
will get the nodes with labels (keeping all attributes etc).
xpathSApply(myxml,"//*/AbstractText[not(#Label)]",xmlValue)
will get the nodes without labels.
EDIT:
test<-xpathApply(myxml,"//*/Abstract",xmlValue)
> length(test)
[1] 24
may give you what you want
EDIT:
to get affiliation, year etc padded with NA's
dumfun<-function(x,xstr){
res<-xpathSApply(x,xstr,xmlValue)
if(length(res)==0){
out<-NA
}else{
out<-res
}
out
}
xpathSApply(myxml,"//*/Article",dumfun,xstr='./Affiliation')
xpathSApply(myxml,"//*/Article",dumfun,xstr='./Journal/JournalIssue/PubDate/Year')

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Analysing text in R - r

Related

Extract specific structured data from multiple pages pdf file with the same format using R

URL contains page number but this number is unknown

xml created from dataframe in R does not open in explorer

what's wrong with my R code?

Creating a dataset from an XML file in R statistics

Categories

Resources