Extracting full article text via the newsanchor package [in R]

Extracting full article text via the newsanchor package [in R] - r

I am using the newsanchor package in R to try to extract entire article content via NewsAPI. For now I have done the following :
require(newsanchor)
results <- get_everything(query = "Trump +Trade", language = "en")
test <- results$results_df
This give me a dataframe full of info of (maximum) a 100 articles. These however do not containt the entire actual article text. Rather they containt something like the following:
[1] "Tensions between China and the U.S. ratcheted up several notches over the weekend as Washington sent a warship into the disputed waters of the South China Sea. Meanwhile, Google dealt Huaweis smartphone business a crippling blow and an escalating trade war co… [+5173 chars]"
Is there a way to extract the remaining 5173 chars. I have tried to read the documentation but I am not really sure.

I don't think that is possible at least with free plan. If you go through the documentation at https://newsapi.org/docs/endpoints/everything in the Response object section it says :
content - string
The unformatted content of the article, where available. This is truncated to 260 chars for Developer plan users.
So all the content is restricted to only 260 characters. However, test$url has the link of the source article which you can use to scrape the entire content but since it is being aggregated from various sources I don't think there is one automated way to do this.

Related

Web Scraping with R: Truncation Issue [duplicate]

This question already has an answer here:
avoid string printed to console getting truncated (in RStudio)
(1 answer)
Closed 6 years ago.
As a beginner, I am currently working with web scraping with R, using the 'rvest' package. My goal is to get the lyrics of any song from 'www.musixmatch.com'. This is my attempt:
library(rvest)
url <- "https://www.musixmatch.com/lyrics/Red-Hot-Chili-Peppers/Can-t-Stop"
musixmatch <- read_html(url)
lyrics <- musixmatch%>%html_nodes(".mxm-lyrics__content")%>%html_text()
This code creates a vector 'lyrics' with 2 rows, containing the lyrics:
[1] "Can't stop addicted to the shindig\nChop top he says I'm gonna win big\nChoose not a life of imitation"
[2] "Distant cousin to the reservation\n\nDefunkt the pistol that you pay for\nThis punk the feeling that you stay for\nIn time I want to be your best friend\nEastside love is living on the Westend\n\nKnock out but boy you better come to\nDon't die you know the truth is some do\nGo write your message on the pavement\nBurn so bright I wonder what the wave meant\n\nWhite heat is screaming in the jungle\nComplete the motion if you stumble\nGo ask the dust for any answers\nCome back strong with 50 belly dancers\n\nThe world I love\nThe tears I drop\nTo be part of\nThe wave can't stop\nEver wonder if it's all for you\nThe world I love\nThe trains I hop\nTo be part of\nThe wave can't stop\n\nCome and tell me when it's time to\n\nSweetheart is bleeding in the snow cone\nSo smart she's leading me to ozone\nMusic the great communicator\nUse two sticks to make it in the nature\nI'll get you into penetration\nThe gender of a generation\nThe birth of every other nation\nWorth your weight the gold ... <truncated>
The problem is that the 2nd row gets truncated at some point. From what I know about rvest, there is no parameter to adjust truncation. Also, I could not find anything on the internet about this issue. Does anybody know how to adjust/ disable truncation for this feature? Thanks a lot in advance!
Best regards,
Jan

I think its better to copy and paste the lyrics into your Notepad or Wordpad. Save as a .txt file
Then use the readLines function, it prints our a warning message but I was able to have the entire lyrics in 84x1 chacacter vector which you can clean or do whatever you please.
words <- readLines("redhot.txt")
> head(words)
[1] "Can't stop addicted to the shindig"
[2] "Chop top he says I'm gonna win big"
[3] "Choose not a life of imitation"
[4] "Distant cousin to the reservation"
[5] "Defunkt the pistol that you pay for"
[6] "This punk the feeling that you stay for"
No truncation problem here.

Can MeCab be configured / enhanced to give me the reading of English words too?

If I begin with a wholly Japanese sentence and run it through MeCab, I get something like this:
$ echo "吾輩は猫である" | mecab
吾輩 名詞,代名詞,一般,*,*,*,吾輩,ワガハイ,ワガハイ
は 助詞,係助詞,*,*,*,*,は,ハ,ワ
猫 名詞,一般,*,*,*,*,猫,ネコ,ネコ
で 助動詞,*,*,*,特殊・ダ,連用形,だ,デ,デ
ある 助動詞,*,*,*,五段・ラ行アル,基本形,ある,アル,アル
EOS
If I smash together everything I get from the last column, I get "ワガハイワネコデアル", which I can then feed into a speech synthesis program and get output. Said program, however, doesn't handle English words.
I throw English into MeCab, it manages to tokenise it (probably naively at the spaces), but gives no reading:
$ echo "I am a cat" | mecab
I 名詞,固有名詞,組織,*,*,*,*
am 名詞,一般,*,*,*,*,*
a 名詞,一般,*,*,*,*,*
cat 名詞,固有名詞,組織,*,*,*,*
EOS
I want to get readings for these as well, even if they're not perfect, so that I can get something along the lines of "アイアムアキャット".
I have already scoured the web for solutions and whereas I do find a bunch of web sites which have transliteration that appears to be adequate, I can't find any way to do it in my own code. In a couple of cases, I emailed the site authors and got no response yet after waiting for a few weeks. (Just how far behind on their inboxes are these people?)
There are a number of directions I can go but I hit dead ends on all of them so far, so this is my compound question:
MeCab takes custom dictionaries. Is there a custom dictionary which fills in the English knowledge somewhat?
Is there some other library or tool that can take English and spit out Katakana?
Is there some library or tool that can take IPA (International Phonetic Alphabet) and spit out Katakana? (I know how to get from English to IPA.)
As an aside, I find that the software "VOICEROID" can speak English text (poorly, but adequately for my purposes). This software uses MeCab too (or at least its DLL and dictionary files are included in the install.) It also uses another library, Cabocha, which as far as I can tell by running it does the exact same thing as MeCab. It could be using custom dictionaries for either of these two libraries to do the job, or the code to do it could be in the proprietary AITalk library they are using. More research is needed and I haven't figured out how to run either tool against their dictionaries to test it out directly either.

Can R download text from inside this sort of web app?

Trying do download text as in company profiles from this website
http://www.evca.eu/about-evca/members/member-search/#lsearch
In the past I had good success with similar tasks using for example the XML package, but this won't work here because the data I am trying to grasp is inside some sort of dynamic and the single elements in the list don't have own URLs or something.
Unfortunately I don't know much about web-design, so I am not really sure how to address this. Any suggestions, it would really suck to do this manually. Thanks

First download Fiddler Web Debugger or some other similar tool. It places itself between your browser and web server, then you can see what is going on (also dynamic/AJAX communication).
Run it, go to the website you are trying to understand and execute actions you want to do automatically.
For example if you open http://www.evca.eu/about-evca/members/member-search/#lsearch, enter "a" in the search box and then choose "All" (to get all results), you will see in the Fiddler that browser opens http://www.evca.eu/umbraco/Surface/MemberSearchPage/HandleSearchForm?page=1&rpp=999999 URL and sends "Company=a&MemberType=&Country=&X-Requested-With=XMLHttpRequest".
You can do the same with R, parse the result, get some text, maybe some links to other stuff.
Below code in R will do the same as described above:
require('XML')
require(stringr)
library(httr)
r <- POST("http://www.evca.eu/umbraco/Surface/MemberSearchPage/HandleSearchForm?page=1&rpp=999999",
body = "Company=a&MemberType=&Country=&X-Requested-With=XMLHttpRequest")
stop_for_status(r)
txt=content(r,"text")
library(stringr)
matches <- str_match_all(txt,"Full company details.*?</h2>")
# remove some rubish from match
companies=gsub("(Full company details)|\t|\n|\r|<[^>]+>",'',matches[[1]])
#remove trainling spaces
companies=gsub("^[ ]+",'',companies)
Result:
> length(companies)
[1] 1148
> head(companies)
[,1]
[1,] "350 Investment Partners"
[2,] "350 Investment Partners LLP"
[3,] "360° Capital Management SA"
[4,] "360° Capital Partners France - Advisory Company"
[5,] "360° Capital Partners Italia - Advisory Company"
[6,] "3i Deutschland Gesellschaft für Industriebeteiligungen mbH"

R webcorpus attribute extraction

I am using the tm.plugin.webmining to get latest news about a company say microsoft using the following command
corpus<-WebCorpus(GoogleBlogSearchSource(stock))
When I run meta(corpus[[1]]) i get
Metadata:
author : character(0)
datetimestamp: 2014-07-17 20:28:10
description : Microsoft Layoffs ÃƒÂƒÃ‚Â¢ÃƒÂ‚Ã‚Â€ÃƒÂ‚Ã‚Â“ What it Means for MSFT StockInvestorplace.comWhile the layoffs are obviously
going to be hardest on the workers, as investors we still have to take
a rational and objective look at the corporation to see what it means
for MSFT ÃƒÂƒÃ‚Â¢ÃƒÂ‚Ã‚Â€ÃƒÂ‚Ã‚Â“ particularly if you are personally a
Microsoft stock holder ...Why Microsoft (MSFT) Stock Is Up
TodayTheStreet.comEarnings Preview: Microsoft Corporation (MSFT),
Apple Inc (AAPL), Facebook ...International Business TimesWhat Do
Microsoft's Layoff Plans Tell Us About Satya Nadella's Vision?Motley
FoolTech InsiderÂ -Insider Monkey (blog)all 2,176 news articlesÂ Â»
heading : Microsoft Layoffs ÃƒÂ¢Ã‚Â€Ã‚Â“ What it Means for MSFT Stock - Investorplace.com
id : tag:news.google.com,2005:cluster=http://investorplace.com/2014/07/microsoft-layoffs-means-msft-stock/
language : character(0)
origin : http://news.google.com/news/url?sa=t&fd=R&ct2=us&usg=AFQjCNEadqFvThyxvJU3O5uHa6wiyoWNEw&clid=c3a7d30bb8a4878e06b80cf16b898331&cid=52778559643673&ei=Cr3LU8jGNMnNkwX_lYCICQ&url=http://investorplace.com/2014/07/microsoft-layoffs-means-msft-stock/
So here I see that the different attributes are here but when I run
Headers<-sapply(meta(corpus,FUN=function(x){attr(x,"heading")})
Headers is a list of 100 items with null values. I am pretty sure this particular code was running a few days back. What changed in between was I reinstalled the packages on the new system and also updated R to 3.1.1 instead of R 3.1.0(earlier)
What can I do to get separate lists of headers, descriptions timestamp, etc, which I later want to convert into a 100X3 data frame.

With the newest R, Please try the following code:
Code :
headers<-meta(corpus,tag="heading")

How to control the echo width using Sweave

I have a problem with the width of the output from echo within sweave, I have a list with a large amount of text. The problem is the echo response from R runs off the page within the pdf. I have tried using
<<>>=
options(width=40)
#
but this has not changed anything.
An example: Set up the list (not showing in latex).
<<echo=FALSE>>=
my_list <- list(example="Site location was fixed using a Silvia Navigator handheld GPS in October 2003. Point of reference used was the station Bench Mark. If the bench mark location was remote from the site then the point of reference used was changed to the 0-1 metre gauge. Bench Mark location was then recorded as a separate entry in the Site History section [but not used as the site location].\r\nFor a Station location map and all digital photograph's of the station, river reach, and site details see H:\\hyd\\dat\\doc. For non digital photo's taken prior to October 2003 please see the relevant station file at Tumut office.")
#
And show the entry of the list.
<<>>=
my_list
#
Is there any way that I can get this to work without having to break up the list with cat statements.

You can use capture.output() to capture the printed representation of the list and then use writeLines() and strwrap() to display this output, nicely wrapped. As capture.output() returns a vector of strings containing the printed representation of the object, we can cat each of them to the screen/page but wrapped using strwrap(). The benefit of this approach is that the result looks like it was printed by R. Here's the solution:
writeLines(strwrap(capture.output(my_list)))
which produces:
$example
[1] "Site location was fixed using a Silvia Navigator
handheld GPS in October 2003. Point of reference used
was the station Bench Mark. If the bench mark location
was remote from the site then the point of reference used
was changed to the 0-1 metre gauge. Bench Mark location
was then recorded as a separate entry in the Site History
section [but not used as the site location].\r\nFor a
Station location map and all digital photograph's of the
station, river reach, and site details see
H:\\hyd\\dat\\doc. For non digital photo's taken prior
to October 2003 please see the relevant station file at
Tumut office."

From a 2010 posting to rhelp by Mark Schwartz:
cat(paste(strwrap(x, width = 70), collapse = "\\\\\n"), "\n")

Categories

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Extracting full article text via the newsanchor package [in R] - r

Related

Web Scraping with R: Truncation Issue [duplicate]

Can MeCab be configured / enhanced to give me the reading of English words too?

Can R download text from inside this sort of web app?

R webcorpus attribute extraction

How to control the echo width using Sweave

Categories

Resources