RSS feed for gas prices and how to interpret the feed - rss

I am trying to add a RSS feed of gas prices based on location to my application.
I googled for RSS feed for gas prices and bumped onto Motortrend's gas price feed
http://www.motortrend.com/widgetrss/gas-
The feed seems to be fine, but the price value seem to be depicted in alphabets as below
Chevron 3921 Irvine Blvd, Irvine, CA 92602 (0.0 miles)
Monday, May 10, 2010 9:16 AM
Regular: ZEIECHK Plus: ZEHGIHC Premium: ZEGJEGE Diesel: N/A
How do I interpret these value to come up with a value for the gas price? Or is it internal to Motortrend's and cannot be used elsewhere?

View the source of Vista Motortrend Sidebar by downloading file, renaming file extension to .zip. Then unzip source and look at /js/gas.js file. You will find a js function called parsePrice(). It is basically a character conversion to find the Unicode value, and some simple math.

Related

Extracting full article text via the newsanchor package [in R]

I am using the newsanchor package in R to try to extract entire article content via NewsAPI. For now I have done the following :
require(newsanchor)
results <- get_everything(query = "Trump +Trade", language = "en")
test <- results$results_df
This give me a dataframe full of info of (maximum) a 100 articles. These however do not containt the entire actual article text. Rather they containt something like the following:
[1] "Tensions between China and the U.S. ratcheted up several notches over the weekend as Washington sent a warship into the disputed waters of the South China Sea. Meanwhile, Google dealt Huaweis smartphone business a crippling blow and an escalating trade war co… [+5173 chars]"
Is there a way to extract the remaining 5173 chars. I have tried to read the documentation but I am not really sure.
I don't think that is possible at least with free plan. If you go through the documentation at https://newsapi.org/docs/endpoints/everything in the Response object section it says :
content - string
The unformatted content of the article, where available. This is truncated to 260 chars for Developer plan users.
So all the content is restricted to only 260 characters. However, test$url has the link of the source article which you can use to scrape the entire content but since it is being aggregated from various sources I don't think there is one automated way to do this.

Extract relevant text from a .txt file in R

I am still on a basic beginner level with r. I am currently working on some natural language stuff and I use the ProQuest Newsstand database. Even though the database allows to download txt files, I don't need everything they provide. The files you can download there look like this:
###############################################################################
____________________________________________________________
Report Information from ProQuest 16 July 2016 09:58
____________________________________________________________
____________________________________________________________
Inhaltsverzeichnis
1. Savills cracks Granite deal to establish US presence ; COMMERCIAL PROPERTY
____________________________________________________________
Dokument 1 von 1
Savills cracks Granite deal to establish US presence ; COMMERCIAL PROPERTY
http:...
Kurzfassung: Savills said that as part of its plans to build...
Links: ...
Volltext: Property agency Savills yesterday snapped up US real estate banking firm Granite Partners...
Unternehmen/Organisation: Name: Granite Partners LP; NAICS: 525910
Titel: Savills cracks Granite deal to establish US presence; COMMERCIAL PROPERTY:   [FIRST Edition]
Autor: Steve Pain Commercial Property Editor
Titel der Publikation: Birmingham Post
Seiten: 30
Seitenanzahl: 0
Erscheinungsjahr: 2007
Publikationsdatum: Aug 2, 2007
Jahr: 2007
Bereich: Business
Herausgeber: Mirror Regional Newspapers
Verlagsort: Birmingham (UK)
Publikationsland: United Kingdom
Publikationsthema: General Interest Periodicals--Great Britain
Quellentyp: Newspapers
Publikationssprache: English
Dokumententyp: NEWSPAPER
ProQuest-Dokument-ID: 324215031
Dokument-URL: ...
Copyright: (Copyright 2007 Birmingham Post and Mail Ltd.)
Zuletzt aktualisiert: 2010-06-19
Datenbank: UK Newsstand
____________________________________________________________
Kontaktieren Sie uns unter: http... Copyright © 2016 ProQuest LLC. Alle Rechte vorbehalten. Allgemeine Geschäftsbedingungen: ...
###############################################################################
What I need is a way to extract only the full text to a csv file. The reason is, when I download hundreds of articles within one file it is quite difficult to copy and paste them manually and I think the file is quite structured. However, the length of text varies. Nevertheless, one could use the next header after the full text as a stop sign (I guess).
Is there any way to do this?
I really would appreciate some help.
Kind regards,
Steffen
Lets say you have all publication information in a single text file make a copy of your file for reset first. Using Notepad++ and RegEx you'd go through following steps:
Ctrl+F
Choose the Mark tab.
Search mode: Regular expression
Find what: ^Volltext:\s
Alt+M to check Bookmark line (if unchecked only)
Click on Mark All
From the main menu go to: Search > Bookmark > Remove Unmarked Lines
In a third step go through following steps:
Ctrl+H
Search mode: Regular expression
Find what: ^Volltext:\s (choose from dropdown)
Replace with: NOTHING (clear text field)
Click on Replace All
Done ...
Try this out:
con <- file("./R/sample text.txt")
content <- paste(readLines(con),collapse="\n")
content <- gsub(pattern = "\\n\\n", replacement = "\n", x = content)
close(con)
content.filtered <- sub(pattern = "(.*)(Volltext:.*?)(_{10,}.*)",
replacement = "\\2", x=content)
Results:
> cat(content.filtered)
Volltext: Property agency Savills yesterday snapped up US real estate banking firm Granite Partners...
Unternehmen/Organisation: Name: Granite Partners LP; NAICS: 525910
Titel: Savills cracks Granite deal to establish US presence; COMMERCIAL PROPERTY: [FIRST Edition]
Autor: Steve Pain Commercial Property Editor
Titel der Publikation: Birmingham Post
Seiten: 30
Seitenanzahl: 0
Erscheinungsjahr: 2007
Publikationsdatum: Aug 2, 2007
Jahr: 2007
Bereich: Business
Herausgeber: Mirror Regional Newspapers
Verlagsort: Birmingham (UK)
Publikationsland: United Kingdom
Publikationsthema: General Interest Periodicals--Great Britain
Quellentyp: Newspapers
Publikationssprache: English
Dokumententyp: NEWSPAPER
ProQuest-Dokument-ID: 324215031
Dokument-URL: ...
Copyright: (Copyright 2007 Birmingham Post and Mail Ltd.)
Zuletzt aktualisiert: 2010-06-19
Datenbank: UK Newsstand

R webcorpus attribute extraction

I am using the tm.plugin.webmining to get latest news about a company say microsoft using the following command
corpus<-WebCorpus(GoogleBlogSearchSource(stock))
When I run meta(corpus[[1]]) i get
Metadata:
author : character(0)
datetimestamp: 2014-07-17 20:28:10
description : Microsoft Layoffs – What it Means for MSFT StockInvestorplace.comWhile the layoffs are obviously
going to be hardest on the workers, as investors we still have to take
a rational and objective look at the corporation to see what it means
for MSFT – particularly if you are personally a
Microsoft stock holder ...Why Microsoft (MSFT) Stock Is Up
TodayTheStreet.comEarnings Preview: Microsoft Corporation (MSFT),
Apple Inc (AAPL), Facebook ...International Business TimesWhat Do
Microsoft's Layoff Plans Tell Us About Satya Nadella's Vision?Motley
FoolTech Insider -Insider Monkey (blog)all 2,176 news articles »
heading : Microsoft Layoffs – What it Means for MSFT Stock - Investorplace.com
id : tag:news.google.com,2005:cluster=http://investorplace.com/2014/07/microsoft-layoffs-means-msft-stock/
language : character(0)
origin : http://news.google.com/news/url?sa=t&fd=R&ct2=us&usg=AFQjCNEadqFvThyxvJU3O5uHa6wiyoWNEw&clid=c3a7d30bb8a4878e06b80cf16b898331&cid=52778559643673&ei=Cr3LU8jGNMnNkwX_lYCICQ&url=http://investorplace.com/2014/07/microsoft-layoffs-means-msft-stock/
So here I see that the different attributes are here but when I run
Headers<-sapply(meta(corpus,FUN=function(x){attr(x,"heading")})
Headers is a list of 100 items with null values. I am pretty sure this particular code was running a few days back. What changed in between was I reinstalled the packages on the new system and also updated R to 3.1.1 instead of R 3.1.0(earlier)
What can I do to get separate lists of headers, descriptions timestamp, etc, which I later want to convert into a 100X3 data frame.
With the newest R, Please try the following code:
Code :
headers<-meta(corpus,tag="heading")

How to connect data dictionaries to the unlabeled data

I'm working with some large government datasets from the Department of Transportation that are available as tab-delimited text files accompanied by data dictionaries. For example, the auto complaints file is a 670Mb file of unlabeled data (when unzipped), and comes with a dictionary. Here are some excerpts:
Last updated: April 24, 2014
FIELDS:
=======
Field# Name Type/Size Description
------ --------- --------- --------------------------------------
1 CMPLID CHAR(9) NHTSA'S INTERNAL UNIQUE SEQUENCE NUMBER.
IS AN UPDATEABLE FIELD,THUS DATA FOR A
GIVEN RECORD POTENTIALLY COULD CHANGE FROM
ONE DATA OUTPUT FILE TO THE NEXT.
2 ODINO CHAR(9) NHTSA'S INTERNAL REFERENCE NUMBER.
THIS NUMBER MAY BE REPEATED FOR
MULTIPLE COMPONENTS.
ALSO, IF LDATE IS PRIOR TO DEC 15, 2002,
THIS NUMBER MAY BE REPEATED FOR MULTIPLE
PRODUCTS OWNED BY THE SAME COMPLAINANT.
Some of the fields have foreign keys listed like so:
21 CMPL_TYPE CHAR(4) SOURCE OF COMPLAINT CODE:
CAG =CONSUMER ACTION GROUP
CON =FORWARDED FROM A CONGRESSIONAL OFFICE
DP =DEFECT PETITION,RESULT OF A DEFECT PETITION
EVOQ =HOTLINE VOQ
EWR =EARLY WARNING REPORTING
INS =INSURANCE COMPANY
IVOQ =NHTSA WEB SITE
LETR =CONSUMER LETTER
MAVQ =NHTSA MOBILE APP
MIVQ =NHTSA MOBILE APP
MVOQ =OPTICAL MARKED VOQ
RC =RECALL COMPLAINT,RESULT OF A RECALL INVESTIGATION
RP =RECALL PETITION,RESULT OF A RECALL PETITION
SVOQ =PORTABLE SAFETY COMPLAINT FORM (PDF)
VOQ =NHTSA VEHICLE OWNERS QUESTIONNAIRE
There are import instructions for Microsoft Access, which I don't have and would not use if I did. But I THINK this data dictionary was meant to be machine-readable.
My question: Is this data dictionary a standard format of some kind? I've tried to Google around, but it's hard to do so without the right terminology. I would like to import into R, though I'm flexible so long as it can be done programmatically.

How to control the echo width using Sweave

I have a problem with the width of the output from echo within sweave, I have a list with a large amount of text. The problem is the echo response from R runs off the page within the pdf. I have tried using
<<>>=
options(width=40)
#
but this has not changed anything.
An example: Set up the list (not showing in latex).
<<echo=FALSE>>=
my_list <- list(example="Site location was fixed using a Silvia Navigator handheld GPS in October 2003. Point of reference used was the station Bench Mark. If the bench mark location was remote from the site then the point of reference used was changed to the 0-1 metre gauge. Bench Mark location was then recorded as a separate entry in the Site History section [but not used as the site location].\r\nFor a Station location map and all digital photograph's of the station, river reach, and site details see H:\\hyd\\dat\\doc. For non digital photo's taken prior to October 2003 please see the relevant station file at Tumut office.")
#
And show the entry of the list.
<<>>=
my_list
#
Is there any way that I can get this to work without having to break up the list with cat statements.
You can use capture.output() to capture the printed representation of the list and then use writeLines() and strwrap() to display this output, nicely wrapped. As capture.output() returns a vector of strings containing the printed representation of the object, we can cat each of them to the screen/page but wrapped using strwrap(). The benefit of this approach is that the result looks like it was printed by R. Here's the solution:
writeLines(strwrap(capture.output(my_list)))
which produces:
$example
[1] "Site location was fixed using a Silvia Navigator
handheld GPS in October 2003. Point of reference used
was the station Bench Mark. If the bench mark location
was remote from the site then the point of reference used
was changed to the 0-1 metre gauge. Bench Mark location
was then recorded as a separate entry in the Site History
section [but not used as the site location].\r\nFor a
Station location map and all digital photograph's of the
station, river reach, and site details see
H:\\hyd\\dat\\doc. For non digital photo's taken prior
to October 2003 please see the relevant station file at
Tumut office."
From a 2010 posting to rhelp by Mark Schwartz:
cat(paste(strwrap(x, width = 70), collapse = "\\\\\n"), "\n")

Resources