Trying get the price of products with RCurl - r

Im scrapping the price of some products from a website . In Python I used the urllib2 without problems, but when I tried using RCurl in R I couldn't donwload the source code.
I have to paste the source code with the product code, then I catch the price. The path of a product is: http://www.americanas.com.br/produto/code_of_product.
Actually, I can't download the source code of a product with RCurl. When I try for example getURL('http://www.americanas.com.br/produto/111467594') it returns "".
I tried using getURL('.../produtos/111467594') and I could download the source, but in this way I'm unable to get the price. :(
Anyone know how could I get the price of the products?
Thanks.
Ps.: Sorry for my bad english. :)

welcome to StackOverflow.
It's hard to say for me why it doesn't work, could you include a verbose=TRUE in the getURL? Also, I notice there's different prices on the webpage you linked. You want all or just the first? How about this to get the "Por price":
library("stringr")
productwebpage<-readLines("http://www.americanas.com.br/produto/111467594")
pricerow<-productwebpage[grep("p class=\"sale price\"",productwebpage)]
price<-str_extract_all(pricerow,"\\(?[0-9,.]+\\)?")[[1]]
You could also substitute the grep("p class=\"sale price\"",productwebpage) to either grep("<p><span class=\"regular price\">",productwebpage) (to get the "de price" / old price) or grep("<span class=\"p-v interest\">",productwebpage) (which will give you the "sem jouros" price / per month payment). For the last example you will get the months first and the payment after so it will be:
> price
[1] "12" "83,25"
This should hopefully work for other products as well (just tried 5 and seemed to work for all of them).

Related

Extracting full article text via the newsanchor package [in R]

I am using the newsanchor package in R to try to extract entire article content via NewsAPI. For now I have done the following :
require(newsanchor)
results <- get_everything(query = "Trump +Trade", language = "en")
test <- results$results_df
This give me a dataframe full of info of (maximum) a 100 articles. These however do not containt the entire actual article text. Rather they containt something like the following:
[1] "Tensions between China and the U.S. ratcheted up several notches over the weekend as Washington sent a warship into the disputed waters of the South China Sea. Meanwhile, Google dealt Huaweis smartphone business a crippling blow and an escalating trade war co… [+5173 chars]"
Is there a way to extract the remaining 5173 chars. I have tried to read the documentation but I am not really sure.
I don't think that is possible at least with free plan. If you go through the documentation at https://newsapi.org/docs/endpoints/everything in the Response object section it says :
content - string
The unformatted content of the article, where available. This is truncated to 260 chars for Developer plan users.
So all the content is restricted to only 260 characters. However, test$url has the link of the source article which you can use to scrape the entire content but since it is being aggregated from various sources I don't think there is one automated way to do this.

R webcorpus attribute extraction

I am using the tm.plugin.webmining to get latest news about a company say microsoft using the following command
corpus<-WebCorpus(GoogleBlogSearchSource(stock))
When I run meta(corpus[[1]]) i get
Metadata:
author : character(0)
datetimestamp: 2014-07-17 20:28:10
description : Microsoft Layoffs – What it Means for MSFT StockInvestorplace.comWhile the layoffs are obviously
going to be hardest on the workers, as investors we still have to take
a rational and objective look at the corporation to see what it means
for MSFT – particularly if you are personally a
Microsoft stock holder ...Why Microsoft (MSFT) Stock Is Up
TodayTheStreet.comEarnings Preview: Microsoft Corporation (MSFT),
Apple Inc (AAPL), Facebook ...International Business TimesWhat Do
Microsoft's Layoff Plans Tell Us About Satya Nadella's Vision?Motley
FoolTech Insider -Insider Monkey (blog)all 2,176 news articles »
heading : Microsoft Layoffs – What it Means for MSFT Stock - Investorplace.com
id : tag:news.google.com,2005:cluster=http://investorplace.com/2014/07/microsoft-layoffs-means-msft-stock/
language : character(0)
origin : http://news.google.com/news/url?sa=t&fd=R&ct2=us&usg=AFQjCNEadqFvThyxvJU3O5uHa6wiyoWNEw&clid=c3a7d30bb8a4878e06b80cf16b898331&cid=52778559643673&ei=Cr3LU8jGNMnNkwX_lYCICQ&url=http://investorplace.com/2014/07/microsoft-layoffs-means-msft-stock/
So here I see that the different attributes are here but when I run
Headers<-sapply(meta(corpus,FUN=function(x){attr(x,"heading")})
Headers is a list of 100 items with null values. I am pretty sure this particular code was running a few days back. What changed in between was I reinstalled the packages on the new system and also updated R to 3.1.1 instead of R 3.1.0(earlier)
What can I do to get separate lists of headers, descriptions timestamp, etc, which I later want to convert into a 100X3 data frame.
With the newest R, Please try the following code:
Code :
headers<-meta(corpus,tag="heading")

Making a self destructive code in R

I was making a package in R and would like it to make it as a trial version for a period of 30 days .
Well my question is how to make a code self destructive depends on number of days ??
I had played with time and date package for a while where i came to know ,
Sys.Date() could give todays date , so i get forard with something below
today=Sys.Date()
a=today
b=a+1
if(a==today)
{
print(paste("today is sunday"))
if(b==today){
print(paste("today is monday"))
}
I know it is stupid work whatever i had done , my sole idea was to fix the 1st use of package as starting day ,and every day it will increment till 30 days ,when it will reach the limit it will automatically destroy using
file.remove () <- through which I can remove some file ........
May be I am clear with my ideas .
Sorry for the novice question .
Add this condition to the license. ("30 days for free, after that you'll have to pay".) and expect users to comply with this.
There is really nothing else you can do.
Well, actually you can. For example, on the first occasion your code is run, save the current date to a file in a certain location (say, "~/.datetocheck"). Then every time your code is run, check for the existence of this file, and if it exists, compare the dates. If more than 30 days have passed, give an error message:
stop("Time is over! You have to pay!")
The problem is that nothing prevents the user from simply deleting this file.

Writing several texts which follows same link with reST

the following reST don't work as expected. How can I do it right?
Only the word python_ in this text is linked. What to do if
I want using other words such as `like it <python_>`_ to jump
to the same link?
.. _python: http://www.python.org
Thanks for any help!
mutetella
This isn't a really well documented as far as I have been able to read but the OP's method should work. I've tested the following input with rst2html.py (v 0.11) and Sphinx (v 1.2b1) on cygwin. Both generate the correct hyperlinks to the CNN site.
* According to `CNN <http://www.cnn.com/>`_ the economy...
* According to `The Amazing Cable News Network`_ the economy...
* According to the `Cable News Network <CNN_>`_ the economy...
.. _The Amazing Cable News Network: CNN_
.. CNN: http://www.cnn.com/
The second form was suggested by #Ajay, although it seems odd to me to have to create another alias to get the text you want, but the third form seems to be what the OP was looking for and as far as I can tell it works fine. The third form also works for internal links within the document.

Looking for a filter to modify the months

Hi making a simple plugin that replaces the wrong russian month with the right ones.
But I can't find any filters that works.
I have tried these filters without success:
add_filter('get_the_modified_date', 'russian-month');
add_filter('the_modified_date', 'russian-month');
add_filter('date_rewrite_rules', 'russian-month');
Just found out that you can use more functions as filters and not only the ones on http://codex.wordpress.org/Plugin_API/Filter_Reference#Date_and_Time_Filters
add_filter('get_the_date', 'russian-month');
did the trick for me :)

Resources