R webcorpus attribute extraction - r

I am using the tm.plugin.webmining to get latest news about a company say microsoft using the following command
corpus<-WebCorpus(GoogleBlogSearchSource(stock))
When I run meta(corpus[[1]]) i get
Metadata:
author : character(0)
datetimestamp: 2014-07-17 20:28:10
description : Microsoft Layoffs – What it Means for MSFT StockInvestorplace.comWhile the layoffs are obviously
going to be hardest on the workers, as investors we still have to take
a rational and objective look at the corporation to see what it means
for MSFT – particularly if you are personally a
Microsoft stock holder ...Why Microsoft (MSFT) Stock Is Up
TodayTheStreet.comEarnings Preview: Microsoft Corporation (MSFT),
Apple Inc (AAPL), Facebook ...International Business TimesWhat Do
Microsoft's Layoff Plans Tell Us About Satya Nadella's Vision?Motley
FoolTech Insider -Insider Monkey (blog)all 2,176 news articles »
heading : Microsoft Layoffs – What it Means for MSFT Stock - Investorplace.com
id : tag:news.google.com,2005:cluster=http://investorplace.com/2014/07/microsoft-layoffs-means-msft-stock/
language : character(0)
origin : http://news.google.com/news/url?sa=t&fd=R&ct2=us&usg=AFQjCNEadqFvThyxvJU3O5uHa6wiyoWNEw&clid=c3a7d30bb8a4878e06b80cf16b898331&cid=52778559643673&ei=Cr3LU8jGNMnNkwX_lYCICQ&url=http://investorplace.com/2014/07/microsoft-layoffs-means-msft-stock/
So here I see that the different attributes are here but when I run
Headers<-sapply(meta(corpus,FUN=function(x){attr(x,"heading")})
Headers is a list of 100 items with null values. I am pretty sure this particular code was running a few days back. What changed in between was I reinstalled the packages on the new system and also updated R to 3.1.1 instead of R 3.1.0(earlier)
What can I do to get separate lists of headers, descriptions timestamp, etc, which I later want to convert into a 100X3 data frame.

With the newest R, Please try the following code:
Code :
headers<-meta(corpus,tag="heading")

Related

Extracting full article text via the newsanchor package [in R]

I am using the newsanchor package in R to try to extract entire article content via NewsAPI. For now I have done the following :
require(newsanchor)
results <- get_everything(query = "Trump +Trade", language = "en")
test <- results$results_df
This give me a dataframe full of info of (maximum) a 100 articles. These however do not containt the entire actual article text. Rather they containt something like the following:
[1] "Tensions between China and the U.S. ratcheted up several notches over the weekend as Washington sent a warship into the disputed waters of the South China Sea. Meanwhile, Google dealt Huaweis smartphone business a crippling blow and an escalating trade war co… [+5173 chars]"
Is there a way to extract the remaining 5173 chars. I have tried to read the documentation but I am not really sure.
I don't think that is possible at least with free plan. If you go through the documentation at https://newsapi.org/docs/endpoints/everything in the Response object section it says :
content - string
The unformatted content of the article, where available. This is truncated to 260 chars for Developer plan users.
So all the content is restricted to only 260 characters. However, test$url has the link of the source article which you can use to scrape the entire content but since it is being aggregated from various sources I don't think there is one automated way to do this.

Can MeCab be configured / enhanced to give me the reading of English words too?

If I begin with a wholly Japanese sentence and run it through MeCab, I get something like this:
$ echo "吾輩は猫である" | mecab
吾輩 名詞,代名詞,一般,*,*,*,吾輩,ワガハイ,ワガハイ
は 助詞,係助詞,*,*,*,*,は,ハ,ワ
猫 名詞,一般,*,*,*,*,猫,ネコ,ネコ
で 助動詞,*,*,*,特殊・ダ,連用形,だ,デ,デ
ある 助動詞,*,*,*,五段・ラ行アル,基本形,ある,アル,アル
EOS
If I smash together everything I get from the last column, I get "ワガハイワネコデアル", which I can then feed into a speech synthesis program and get output. Said program, however, doesn't handle English words.
I throw English into MeCab, it manages to tokenise it (probably naively at the spaces), but gives no reading:
$ echo "I am a cat" | mecab
I 名詞,固有名詞,組織,*,*,*,*
am 名詞,一般,*,*,*,*,*
a 名詞,一般,*,*,*,*,*
cat 名詞,固有名詞,組織,*,*,*,*
EOS
I want to get readings for these as well, even if they're not perfect, so that I can get something along the lines of "アイアムアキャット".
I have already scoured the web for solutions and whereas I do find a bunch of web sites which have transliteration that appears to be adequate, I can't find any way to do it in my own code. In a couple of cases, I emailed the site authors and got no response yet after waiting for a few weeks. (Just how far behind on their inboxes are these people?)
There are a number of directions I can go but I hit dead ends on all of them so far, so this is my compound question:
MeCab takes custom dictionaries. Is there a custom dictionary which fills in the English knowledge somewhat?
Is there some other library or tool that can take English and spit out Katakana?
Is there some library or tool that can take IPA (International Phonetic Alphabet) and spit out Katakana? (I know how to get from English to IPA.)
As an aside, I find that the software "VOICEROID" can speak English text (poorly, but adequately for my purposes). This software uses MeCab too (or at least its DLL and dictionary files are included in the install.) It also uses another library, Cabocha, which as far as I can tell by running it does the exact same thing as MeCab. It could be using custom dictionaries for either of these two libraries to do the job, or the code to do it could be in the proprietary AITalk library they are using. More research is needed and I haven't figured out how to run either tool against their dictionaries to test it out directly either.

How to connect data dictionaries to the unlabeled data

I'm working with some large government datasets from the Department of Transportation that are available as tab-delimited text files accompanied by data dictionaries. For example, the auto complaints file is a 670Mb file of unlabeled data (when unzipped), and comes with a dictionary. Here are some excerpts:
Last updated: April 24, 2014
FIELDS:
=======
Field# Name Type/Size Description
------ --------- --------- --------------------------------------
1 CMPLID CHAR(9) NHTSA'S INTERNAL UNIQUE SEQUENCE NUMBER.
IS AN UPDATEABLE FIELD,THUS DATA FOR A
GIVEN RECORD POTENTIALLY COULD CHANGE FROM
ONE DATA OUTPUT FILE TO THE NEXT.
2 ODINO CHAR(9) NHTSA'S INTERNAL REFERENCE NUMBER.
THIS NUMBER MAY BE REPEATED FOR
MULTIPLE COMPONENTS.
ALSO, IF LDATE IS PRIOR TO DEC 15, 2002,
THIS NUMBER MAY BE REPEATED FOR MULTIPLE
PRODUCTS OWNED BY THE SAME COMPLAINANT.
Some of the fields have foreign keys listed like so:
21 CMPL_TYPE CHAR(4) SOURCE OF COMPLAINT CODE:
CAG =CONSUMER ACTION GROUP
CON =FORWARDED FROM A CONGRESSIONAL OFFICE
DP =DEFECT PETITION,RESULT OF A DEFECT PETITION
EVOQ =HOTLINE VOQ
EWR =EARLY WARNING REPORTING
INS =INSURANCE COMPANY
IVOQ =NHTSA WEB SITE
LETR =CONSUMER LETTER
MAVQ =NHTSA MOBILE APP
MIVQ =NHTSA MOBILE APP
MVOQ =OPTICAL MARKED VOQ
RC =RECALL COMPLAINT,RESULT OF A RECALL INVESTIGATION
RP =RECALL PETITION,RESULT OF A RECALL PETITION
SVOQ =PORTABLE SAFETY COMPLAINT FORM (PDF)
VOQ =NHTSA VEHICLE OWNERS QUESTIONNAIRE
There are import instructions for Microsoft Access, which I don't have and would not use if I did. But I THINK this data dictionary was meant to be machine-readable.
My question: Is this data dictionary a standard format of some kind? I've tried to Google around, but it's hard to do so without the right terminology. I would like to import into R, though I'm flexible so long as it can be done programmatically.

Upgraded quantstrat 0.7.8 from 0.7.7 then old code does not work

I have upgraded quantstrat package from 0.7.7(installed on Jan 7th 2013) to 0.7.8, however old code does not work properly. looks like we can not put any entry orders niether buy or sell, and only exit orders are executed. Here is the detail. Someone knows major changes in add.rule or applyStrategy function or same issue has been reported?
We set up trading rule by add.rule()
add.rule(f,'ruleSignal',arguments=list(sigcol="DoSell",sigval=TRUE,orderqty=(-1*tradeSize),osFUN='osSUS',ordertype='market',TxnFees="calcTxnFee",prefer='Open'),type='enter',label=gExitLabel)
add.rule(f,'ruleSignal',arguments=list(sigcol="DoBuy", sigval=TRUE,orderqty=tradeSize,osFUN='osBuy',ordertype='market',TxnFees="calcTxnFee",prefer='Price'),type='enter',label=gEnterLabel)
add.rule(f,'ruleSignal',arguments=list(sigcol="DoStop", sigval=TRUE,orderqty=-1*tradeSize,osFUN='osStop',ordertype='stoplimit',threshold='StopLevel',TxnFees="calcTxnFee"),type='risk',label='Stop')
The problem we had is we don't get entry signal when we run applyStrategy ...... It seems that getOrderbook has "Buy" and "Sell" .....
applyStrategy(rs, rs) only applied sell signal ........(not buy)
[1] "2010-11-18 09:00:00 ABC -65660 # 4.6"
[1] "2010-12-07 09:00:00 ABC -37509 # 5.17"
However getOrderBook() recorded "Buy", and "Sell" in order .....
at the same time order.prices were set "0", order.status "replaced" and Prefer "Price" by the system
It is hard to understand what is your problem exactly: "no entry signal", "orderbook has Buy and Sell" (whatever that means), "order.prices were set to 0" ...
I see that you are using your own order sizing functions, could it have something to do with that? You may try dropping your order sizing functions for a test, just to check if your entry orders are being executed.
Otherwise I suggest that you provide a complete example so I can run it and check.
Please be aware that quantstrat is under heavy development and that the code is patched almost on a daily basis, although the version number may not always be bumped up. So make sure that you always download the latest code.
HTH,
Jan Humme.

Trying get the price of products with RCurl

Im scrapping the price of some products from a website . In Python I used the urllib2 without problems, but when I tried using RCurl in R I couldn't donwload the source code.
I have to paste the source code with the product code, then I catch the price. The path of a product is: http://www.americanas.com.br/produto/code_of_product.
Actually, I can't download the source code of a product with RCurl. When I try for example getURL('http://www.americanas.com.br/produto/111467594') it returns "".
I tried using getURL('.../produtos/111467594') and I could download the source, but in this way I'm unable to get the price. :(
Anyone know how could I get the price of the products?
Thanks.
Ps.: Sorry for my bad english. :)
welcome to StackOverflow.
It's hard to say for me why it doesn't work, could you include a verbose=TRUE in the getURL? Also, I notice there's different prices on the webpage you linked. You want all or just the first? How about this to get the "Por price":
library("stringr")
productwebpage<-readLines("http://www.americanas.com.br/produto/111467594")
pricerow<-productwebpage[grep("p class=\"sale price\"",productwebpage)]
price<-str_extract_all(pricerow,"\\(?[0-9,.]+\\)?")[[1]]
You could also substitute the grep("p class=\"sale price\"",productwebpage) to either grep("<p><span class=\"regular price\">",productwebpage) (to get the "de price" / old price) or grep("<span class=\"p-v interest\">",productwebpage) (which will give you the "sem jouros" price / per month payment). For the last example you will get the months first and the payment after so it will be:
> price
[1] "12" "83,25"
This should hopefully work for other products as well (just tried 5 and seemed to work for all of them).

Resources