library(quantmod)
getSymbols("GDPC1",src = "FRED")
I am trying to extract the numerical economic/financial data in FRED but also the metadata. I am trying to chart CPI and have the meta data as a labels/footnotes. Is there a way to extract this data using the quantmod package?
Title: Real Gross Domestic Product
Series ID: GDPC1
Source: U.S. Department of Commerce: Bureau of Economic Analysis
Release: Gross Domestic Product
Seasonal Adjustment: Seasonally Adjusted Annual Rate
Frequency: Quarterly
Units: Billions of Chained 2009 Dollars
Date Range: 1947-01-01 to 2014-01-01
Last Updated: 2014-06-25 7:51 AM CDT
Notes: BEA Account Code: A191RX1
Real gross domestic product is the inflation adjusted value of the
goods and services produced by labor and property located in the
United States.
For more information see the Guide to the National Income and Product
Accounts of the United States (NIPA) -
(http://www.bea.gov/national/pdf/nipaguid.pdf)
You can use the same code that's in the body of getSymbools.FRED, but change ".csv" to ".xls", then read the metadata you're interested in from the .xls file.
library(gdata)
Symbol <- "GDPC1"
FRED.URL <- "http://research.stlouisfed.org/fred2/series"
tmp <- tempfile()
download.file(paste0(FRED.URL, "/", Symbol, "/downloaddata/", Symbol, ".xls"),
destfile=tmp)
read.xls(tmp, nrows=17, header=FALSE)
# V1 V2
# 1 Title: Real Gross Domestic Product
# 2 Series ID: GDPC1
# 3 Source: U.S. Department of Commerce: Bureau of Economic Analysis
# 4 Release: Gross Domestic Product
# 5 Seasonal Adjustment: Seasonally Adjusted Annual Rate
# 6 Frequency: Quarterly
# 7 Units: Billions of Chained 2009 Dollars
# 8 Date Range: 1947-01-01 to 2014-01-01
# 9 Last Updated: 2014-06-25 7:51 AM CDT
# 10 Notes: BEA Account Code: A191RX1
# 11 Real gross domestic product is the inflation adjusted value of the
# 12 goods and services produced by labor and property located in the
# 13 United States.
# 14
# 15 For more information see the Guide to the National Income and Product
# 16 Accounts of the United States (NIPA) -
# 17 (http://www.bea.gov/national/pdf/nipaguid.pdf)
Instead of hardcoding nrows=17, you can use grep to search for the row that has the headers of the data, and subset to only include rows before that.
dat <- read.xls(tmp, header=FALSE, stringsAsFactors=FALSE)
dat[seq_len(grep("DATE", dat[, 1])-1),]
unlink(tmp) # remove the temp file when you're done with it.
FRED has a straightforward, well-document json interface http://api.stlouisfed.org/docs/fred/ which provides both metadata and time series data for all of its economic series. Access requires a FRED account and api key but these are available on request from http://api.stlouisfed.org/api_key.html .
The excel descriptive data you asked for can be retrieved using
get.FRSeriesTags <- function(seriesNam)
{
# seriesNam = character string containing the ID identifying the FRED series to be retrieved
#
library("httr")
library("jsonlite")
# dummy FRED api key; request valid key from http://api.stlouisfed.org/api_key.html
apiKey <- "&api_key=abcdefghijklmnopqrstuvwxyz123456"
base <- "http://api.stlouisfed.org/fred/"
seriesID <- paste("series_id=", seriesNam,sep="")
fileType <- "&file_type=json"
#
# get series descriptive data
#
datType <- "series?"
url <- paste(base, datType, seriesID, apiKey, fileType, sep="")
series <- fromJSON(url)$seriess
#
# get series tag data
#
datType <- "series/tags?"
url <- paste(base, datType, seriesID, apiKey, fileType, sep="")
tags <- fromJSON(url)$tags
#
# format as excel descriptive rows
#
description <- data.frame(Title=series$title[1],
Series_ID = series$id[1],
Source = tags$notes[tags$group_id=="src"][1],
Release = tags$notes[tags$group_id=="gen"][1],
Frequency = series$frequency[1],
Units = series$units[1],
Date_Range = paste(series[1, c("observation_start","observation_end")], collapse=" to "),
Last_Updated = series$last_updated[1],
Notes = series$notes[1],
row.names=series$id[1])
return(t(description))
}
Retrieving the actual time series data would be done in a similar way. There are several json packages available for R but jsonlite works particularly well for this application.
There's a bit more to setting this up than the previous answer but perhaps worth it if you do much with FRED data.
Related
I need to create a dataframe from a .csv file containing author references:
refs <- data.frame(reference = "Harris P R, Harris D L (1983). Training for the Metaindustrial Work Culture. Journal of European Industrial Training, 7(7): 22.")
Essentially I want to pull out the coauthors, year of publication, and article title.
refs$author[1]
Harris P R, Harris D L
refs$year[1]
1983
refs$title[1]
Training for the Metaindustrial Work Culture
At this stage, I do not need a publication source as I can get this via rscopus.
I can extract authors and years with this code:
refs <- refs %>%
mutate(author = sub("\\(.*", "", reference),
year = str_extract(reference, "\\d{4}")))
However, I need help extracting the title (substring between two periods after bracketed date).
This regex works for your minimal example:
refs <- data.frame(reference = "Harris P R, Harris D L (1983). Training for the Metaindustrial Work Culture. Journal of European Industrial Training, 7(7): 22.")
sub("[^.]+\\.([^.]+)\\..*", "\\1", refs$reference)
#> [1] " Training for the Metaindustrial Work Culture"
Explanation:
"[^.]+\\.([^.]+)\\..*" - whole regex
[^.]+\\. - one or more characters that isn't a period, followed by a period (i.e. everything up until the first period)
([^.]+)\\..* - start capturing 'group 1' "(" which contains one or more characters that aren't a period ([^.]+) then stop capturing group 1 ")" at the next period "\\." (group 1 now = the title), then match everything else ".*"
Then, in the sub command, you print group 1 ("\\1").
Unfortunately, you may run into problems with your 'real world' data. Using rscopus to extract the title might be a better solution to avoid unforeseen errors.
Using tidyverse functions:
library(tidyverse)
refs <- data.frame(reference = "Harris P R, Harris D L (1983). Training for the Metaindustrial Work Culture. Journal of European Industrial Training, 7(7): 22.")
refs %>%
mutate(author = sub("\\(.*", "", reference),
year = str_extract(reference, "\\d{4}"),
title = sub("[^.]+\\.([^.]+)\\..*", "\\1", reference))
#> reference
#> 1 Harris P R, Harris D L (1983). Training for the Metaindustrial Work Culture. Journal of European Industrial Training, 7(7): 22.
#> author year title
#> 1 Harris P R, Harris D L 1983 Training for the Metaindustrial Work Culture
Created on 2022-12-05 with reprex v2.0.2
I have two long strings that look like this in a vector:
x <- c("Job Information\n\nLocation: \n\n\nScarsdale, New York, 10583-3050, United States \n\n\n\n\n\nJob ID: \n53827738\n\n\nPosted: \nApril 22, 2020\n\n\n\n\nMin Experience: \n3-5 Years\n\n\n\n\nRequired Travel: \n0-10%",
"Job Information\n\nLocation: \n\n\nGlenview, Illinois, 60025, United States \n\n\n\n\n\nJob ID: \n53812433\n\n\nPosted: \nApril 21, 2020\n\n\n\n\nSalary: \n$110,000.00 - $170,000.00 (Yearly Salary)")
and my goal is to neatly organized them in a dataframe (output form) something like this:
#View(df)
Location Job ID Posted Min Experience Required Travel Salary
[1] Scarsdale,... 53827738 April 22... 3-5 Years 0-10% NA
[2] Glenview,... 53812433 April 21... NA NA $110,000.00 - $170,000.00 (Yearly Salary)
(...) was done to present the dataframe here neatly.
However as you see, two strings doesn't necessarily have same attibutes. Forexample, first string has Min Experience and Required Travel, but on the second string, those field don't exist, but has Salary. So this getting very tricky for me. I thought I will read between \n character but they are not set, some have two newlines, other have 4 or 5. I was wondering if someone can help me out. I will appreciate it!
We can split the string on one or more '\n' ('\n{1,}'). Remove the first word from each (which is 'Job Information') as we don't need it anywhere (x <- x[-1]). For remaining part of the string we can see that they are in pairs in the form of columnname - columnvalue. We create a dataframe from this using alternating index and bind_rows combine all of them by name.
dplyr::bind_rows(sapply(strsplit(gsub(':', '', x), '\n{1,}'), function(x) {
x <- x[-1]
setNames(as.data.frame(t(x[c(FALSE, TRUE)])), x[c(TRUE, FALSE)])
}))
# Location Job ID Posted Min Experience
#1 Scarsdale, New York, 10583-3050, United States 53827738 April 22, 2020 3-5 Years
#2 Glenview, Illinois, 60025, United States 53812433 April 21, 2020 <NA>
# Required Travel Salary
#1 0-10% <NA>
#2 <NA> $110,000.00 - $170,000.00 (Yearly Salary)
I am trying to fetch text from anchor tag, which is embedded in div tag. Following is the link of website `http://mmb.moneycontrol.com/forum-topics/stocks-1.html
The text I want to extract is Mawana Sugars
Mawana Sugars
So I want to extract all the stocks names listed on this website and description of it.
Here is my attempt to do it in R
doc <- htmlParse("http://mmb.moneycontrol.com/forum-topics/stocks-1.html")
xpathSApply(doc,"//div[#class='clearfix PR PB5']//text()",xmlValue)
But, it does not return anything. How can I do it in R?
My answer is essentially the same as the one I just gave here.
The data is dynamically loaded, and cannot be retrieved directly from the html. But, looking at "Network" in Chrome DevTools for instance, we can find a nicely formatted JSON at http://mmb.moneycontrol.com/index.php?q=topic/ajax_call§ion=get_messages&offset=&lmid=&isp=0&gmt=cat_lm&catid=1&pgno=1
To get you started:
library(jsonlite)
dat <- fromJSON("http://mmb.moneycontrol.com/index.php?q=topic/ajax_call§ion=get_messages&offset=&lmid=&isp=0&gmt=cat_lm&catid=1&pgno=1")
Output looks like:
dat[1:3, c("msg_id", "user_id", "topic", "heading", "flag", "price", "message")]
# msg_id user_id topic heading flag
# 1 47730730 liontrade NMDC Stocks APR
# 2 47730726 agrawalknath Glenmark Glenmark APR
# 3 47730725 bissy91 Infosys Stocks APR
# price
# 1 Price when posted : BSE: Rs. 127.90 NSE: Rs. 128.15
# 2 Price when posted : NSE: Rs. 714.10
# 3 Price when posted : BSE: Rs. 956.50 NSE: Rs. 955.00
# message
# 1 There is no mention of dividend in the announcement.
# 2 Eagerly Waiting for 670 to 675 to BUY second phase of Buying in Cash Delivery. Already Holding # 800.
# 3 6 ✂ ✂--Don t Pay High Brokerage While Trading. Take Delivery Free & Rs 20 to trade in any size - Join Today .👉 goo.gl/hDqLnm
I am trying to extract automatically electricity offers from this site.Once I set the postcode (i.e: 300) , I can download(manually) the pdf files
I am using httr package :
library(httr)
qr<- POST("http://www.qenergy.com.au/What-Are-Your-Options",
query=list(postcode=3000))
res <- htmlParse(content(qr))
The problem is that the files urls are not in the query response. Any help please.
Try this
library(httr)
qr<- POST("http://www.qenergy.com.au/What-Are-Your-Options",
encode="form",
body=list(postcode=3000))
res <- content(qr)
pdfs <- as(res['//a[contains(#href, "pdf")]/#href'], "character")
head(pdfs)
# [1] "flux-content/qenergy/pdf/VIC price fact sheet jemena distribution zone business/Jemena-Freedom-Biz-5-Day-Time-of-Use-A210.pdf"
# [2] "flux-content/qenergy/pdf/VIC price fact sheet jemena distribution zone business/Jemena-Freedom-Biz-7-Day-Time-of-Use-A250.pdf"
# [3] "flux-content/qenergy/pdf/VIC price fact sheet jemena distribution zone business/Jemena-Freedom-Biz-Single-Rate-CL.pdf"
# [4] "flux-content/qenergy/pdf/VIC price fact sheet jemena distribution zone business/Jemena-Freedom-Biz-Single-Rate.pdf"
# [5] "flux-content/qenergy/pdf/VIC price fact sheet united energy distribution zone business/United-Freedom-Biz-5-Day-Time-of-Use.pdf"
# [6] "flux-content/qenergy/pdf/VIC price fact sheet united energy distribution zone business/United-Freedom-Biz-7-Day-Time-of-Use.pdf"
I am using the R package tm.plugin.webmining. Using the function GoogleNewsSource() I would like to query the news sorted by date and also from a specific date. Is there any paremeter to query the news of a specific date?
library(tm)
library(tm.plugin.webmining)
searchTerm <- "Data Mining"
corpusGoog <- WebCorpus(GoogleNewsSource(params=list(hl="en", q=searchTerm,
ie="utf-8", num=10, output="rss" )))
headers <- meta(corpusGoog,tag="datetimestamp")
If you're looking for a data frame-like structure, this is how you'd go about creating it (note: not all fields are extracted from the corpus):
library(dplyr)
make_row <- function(elem) {
data.frame(timestamp=elem[[2]]$datetimestamp,
heading=elem[[2]]$heading,
description=elem[[2]]$description,
content=elem$content,
stringsAsFactors=FALSE)
}
dat <- bind_rows(lapply(corpusGoog, make_row))
str(dat)
## Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 10 obs. of 4 variables:
## $ timestamp : POSIXct, format: "2015-02-03 13:08:16" "2015-01-11 23:37:45" ...
## $ heading : chr "A guide to data mining with Hadoop - Information Age" "Barack Obama to seek limits on student data mining - Politico" "Is data mining riddled with risk or a natural hazard of the internet? - INTHEBLACK" "Why an obscure British data-mining company is worth $3 billion - Quartz" ...
## $ description: chr "Information AgeA guide to data mining with HadoopInformation AgeWith the advent of the Internet of Things and the transition fr"| __truncated__ "PoliticoBarack Obama to seek limits on student data miningPoliticoPresident Barack Obama on Monday is expected to call for toug"| __truncated__ "INTHEBLACKIs data mining riddled with risk or a natural hazard of the internet?INTHEBLACKData mining is now viewed as a serious"| __truncated__ "QuartzWhy an obscure British data-mining company is worth $3 billionQuartzTesco, the troubled British retail group, is starting"| __truncated__ ...
## $ content : chr "A guide to data mining with Hadoop\nHow businesses can realise and capitalise on the opportunities that Hadoop offers\nPosted b"| __truncated__ "By Stephanie Simon\n1/11/15 6:32 PM EST\nPresident Barack Obama on Monday is expected to call for tough legislation to protect "| __truncated__ "By Adam Courtenay\nData mining is now viewed as a serious security threat, but with all the hype, s"| __truncated__ "How We Buy\nJanuary 12, 2015\nTesco, the troubled British retail group, is starting over. After an accounting scandal , a serie"| __truncated__ ...
Then, you can do anything you want. For example:
dat %>%
arrange(timestamp) %>%
select(heading) %>%
head
## Source: local data frame [6 x 1]
##
## heading
## 1 The potential of fighting corruption through data mining - Transparency International (pre
## 2 Barack Obama to seek limits on student data mining - Politico
## 3 Why an obscure British data-mining company is worth $3 billion - Quartz
## 4 Parks and Rec Recap: Treat Yo Self to Some Data Mining - Indianapolis Monthly
## 5 Fraud and data mining in Vancouverâ\u0080¦just Outside the Lines - Vancouver Sun (blog)
## 6 'Parks and Rec' Data-Mining Episode Was Eerily True To Life - MediaPost Communications
If you want/need something else, you need to be clearer in your question.
I was looking at google query string and noticed they pass startdate and enddate tag in the query if you click dates on right hand side of the page.
You can use the same tag name and yout results will be confined within start and end date.
GoogleFinanceSource(query, params = list(hl = "en", q = query, ie = "utf-8",
start = 0, num = 25, output = "rss",
startdate='2015-10-26', enddate = '2015-10-28'))