Converting Python code that reads in and manipulates CSV to R code - r

Below is code in Python that reads in a CSV from a url and isolates the "ticker symbols" then converts it to a list. I am brand new to R and am hoping there is an easy, quick way to convert this python code to R before I get too deep into figuring it out myself.
# Read contents of csv link into string variable
cboe_csv_link = 'https://www.cboe.com/available_weeklys/get_csv_download/'
output = requests.get(cboe_csv_link).text
# Find number of rows before string
find_str = "Available Weeklys - Exchange Traded Products (ETFs and ETNs)"
# Find index of search string in output
idx = output.find(find_str)
# Count number of newlines until search string is encountered
skiprows_val = output[:idx+len(find_str)].count("\n")
# Filter out rows and columns to isolate ticker symbols
cboe_csv = pd.read_csv(cboe_csv_link, skiprows=skiprows_val, usecols=[0], header=None)
tickers_df = cboe_csv[(cboe_csv[0] != 'Available Weeklys - Exchange Traded Products (ETFs and ETNs)')
& (cboe_csv[0] != 'Available Weeklys - Equity')]
# Convert dataframe column to list
tickers = tickers_df[0].tolist()

Here is one possible way to solve your problem:
library(magrittr)
tickers = readLines("https://www.cboe.com/available_weeklys/get_csv_download/") %>%
gsub(pattern='"', replacement="") %>%
subset(nzchar(.) & !grepl("Available Weekly|\\d+/\\d+/\\d+", .)) %>%
sub(pattern="([A-Z]+).+", replacement="\\1")
# [1] "AMLP" "ARKF" "ARKG" "ARKK" "ASHR" "BRZU" "DIA" "DUST" "EEM"
# [10] "EFA" "EMB" "ERX" "EWH" "EWJ" "EWU" "EWW" "EWY" "EWZ"
# [19] "FAS" "FAZ" "FEZ" "FXE" "FXI" "FXY" "GDX" "GDXJ" "GLD"
# [28] "HYG" "IAU" "IBB" "ICLN" "IEF" "INDA" "ITB" "IVV" "IWF"
# [37] "IWM" "IYR" "JDST" "JETS" "JNK" "JNUG" "KRE" "KWEB" "LABD"
# ...

Not a translation of your Python code, but hopefully a fair interpretation.
cboe_csv_link <- "https://www.cboe.com/available_weeklys/get_csv_download/"
rr <- readLines(cboe_csv_link)
ss <- c(grep("Available Weeklys", rr), length(rr))
l <- list()
for (i in 1:(length(ss)-1)) {
l[[i]] <- read.csv(text=rr[(ss[i]+1):(ss[i+1]-1)], header=FALSE)
}
names(l) <- rr[head(ss, -1)]
lapply(l, head)
# $`Available Weeklys - Exchange Traded Products (ETFs and ETNs)`
# V1 V2
# 1 AMLP ALPS ETF TR ALERIAN MLP
# 2 ARKF ARK ETF TR FINTECH INNOVA
# 3 ARKG ARK ETF TR GENOMIC REV ETF
# 4 ARKK ARK ETF TR INNOVATION ETF
# 5 ASHR DBX ETF TR XTRACK HRVST CSI
# 6 BRZU DIREXION SHS ETF TR BRZ BL 2X SHS
#
# $`Available Weeklys - Equity`
# V1 V2
# 1 AA ALCOA CORP COM
# 2 AAL AMERICAN AIRLS GROUP INC COM
# 3 AAOI APPLIED OPTOELECTRONICS INC COM
# 4 AAPL APPLE INC COM
# 5 ABBV ABBVIE INC COM
# 6 ABC AMERISOURCEBERGEN CORP COM

Here is slightly different approach. First we download the data to a file, say weeklysmf.csv.
> url <- "https://www.cboe.com/available_weeklys/get_csv_download/"
> download.file(url, "weeklysmf.csv", quiet=TRUE)
>
We then use the fact that all the lines you are interested in have exactly two fields separated with a comma. Using this awk invocation filters all lines with exactly two fields, using , as the field separator:
$ awk -F, 'NF==2 {print $0}' weeklysmf.csv |head
"AMLP","ALPS ETF TR ALERIAN MLP"
"ARKF","ARK ETF TR FINTECH INNOVA"
"ARKG","ARK ETF TR GENOMIC REV ETF"
"ARKK","ARK ETF TR INNOVATION ETF"
"ASHR","DBX ETF TR XTRACK HRVST CSI"
"BRZU","DIREXION SHS ETF TR BRZ BL 2X SHS"
"DIA","SPDR DOW JONES INDL AVERAGE ET UT SER 1"
"DUST","DIREXION SHS ETF TR DAILY GOLD MINER"
"EEM","ISHARES TR MSCI EMG MKT ETF"
"EFA","ISHARES TR MSCI EAFE ETF"
$
We can use this with many of the csv readers in R which can read from a command (as R offers a connections interface where pipe() is an option as are file() and url()). I like data.table so this becomes
> dat <- data.table::fread(cmd="awk -F, 'NF==2 {print $0}' weeklysmf.csv")
> dat
AMLP ALPS ETF TR ALERIAN MLP
1: ARKF ARK ETF TR FINTECH INNOVA
2: ARKG ARK ETF TR GENOMIC REV ETF
3: ARKK ARK ETF TR INNOVATION ETF
4: ASHR DBX ETF TR XTRACK HRVST CSI
5: BRZU DIREXION SHS ETF TR BRZ BL 2X SHS
---
611: YY JOYY INC ADS REPSTG COM A
612: Z ZILLOW GROUP INC CL C CAP STK
613: ZM ZOOM VIDEO COMMUNICATIONS INC CL A
614: ZNGA ZYNGA INC CL A
615: ZS ZSCALER INC COM
>
(and fread can also return a data.frame if you prefer that, there is an option).

Related

How to extract substring between periods in R

I need to create a dataframe from a .csv file containing author references:
refs <- data.frame(reference = "Harris P R, Harris D L (1983). Training for the Metaindustrial Work Culture. Journal of European Industrial Training, 7(7): 22.")
Essentially I want to pull out the coauthors, year of publication, and article title.
refs$author[1]
Harris P R, Harris D L
refs$year[1]
1983
refs$title[1]
Training for the Metaindustrial Work Culture
At this stage, I do not need a publication source as I can get this via rscopus.
I can extract authors and years with this code:
refs <- refs %>%
mutate(author = sub("\\(.*", "", reference),
year = str_extract(reference, "\\d{4}")))
However, I need help extracting the title (substring between two periods after bracketed date).
This regex works for your minimal example:
refs <- data.frame(reference = "Harris P R, Harris D L (1983). Training for the Metaindustrial Work Culture. Journal of European Industrial Training, 7(7): 22.")
sub("[^.]+\\.([^.]+)\\..*", "\\1", refs$reference)
#> [1] " Training for the Metaindustrial Work Culture"
Explanation:
"[^.]+\\.([^.]+)\\..*" - whole regex
[^.]+\\. - one or more characters that isn't a period, followed by a period (i.e. everything up until the first period)
([^.]+)\\..* - start capturing 'group 1' "(" which contains one or more characters that aren't a period ([^.]+) then stop capturing group 1 ")" at the next period "\\." (group 1 now = the title), then match everything else ".*"
Then, in the sub command, you print group 1 ("\\1").
Unfortunately, you may run into problems with your 'real world' data. Using rscopus to extract the title might be a better solution to avoid unforeseen errors.
Using tidyverse functions:
library(tidyverse)
refs <- data.frame(reference = "Harris P R, Harris D L (1983). Training for the Metaindustrial Work Culture. Journal of European Industrial Training, 7(7): 22.")
refs %>%
mutate(author = sub("\\(.*", "", reference),
year = str_extract(reference, "\\d{4}"),
title = sub("[^.]+\\.([^.]+)\\..*", "\\1", reference))
#> reference
#> 1 Harris P R, Harris D L (1983). Training for the Metaindustrial Work Culture. Journal of European Industrial Training, 7(7): 22.
#> author year title
#> 1 Harris P R, Harris D L 1983 Training for the Metaindustrial Work Culture
Created on 2022-12-05 with reprex v2.0.2

Finding GICS Sector using Rblpapi in R

I am trying to replace a column in my data with the output of function: bdp(column + "equity", "GICS_SECTOR NAME")
Require(Rblpapi)
#Create raw data example
ticker <- c(2,3,4,5,6)
sector <- c(NA,NA,NA,NA,NA)
dataraw <- data.frame(ticker, random)
dataraw$sector <- bdp("dataraw$ticker Equity", "GICS_SECTOR_NAME")
This does not work due to "" making it text only and I have to add the word "Equity" e.g. IBM Equity.
An example of it working perfectly would be bdp("IBM Equity", "GICS_SECTOR_NAME")
You can add the "Equity" part using paste and use the resulting ticker as an argument to bdp:
#Create raw data example
ticker <- c("IBM", "AAPL", "MSFT", "FB")
sector <- c(NA,NA,NA,NA)
df <- data.frame(ticker, sector)
df$ticker_full <- paste(df$ticker, "US Equity", sep = " ")
conn <- Rblpapi::blpConnect()
sectors <- bdp(securities = df$ticker_full,
fields = "GICS_SECTOR_NAME")
> print(sectors)
GICS_SECTOR_NAME
IBM US Equity Information Technology
AAPL US Equity Information Technology
MSFT US Equity Information Technology
FB US Equity Communication Services
df$sector <- sectors$GICS_SECTOR_NAME
> print(df)
ticker sector ticker_full
1 IBM Information Technology IBM US Equity
2 AAPL Information Technology AAPL US Equity
3 MSFT Information Technology MSFT US Equity
4 FB Communication Services FB US Equity

Remove certain words in string from column in dataframe in R

I have a dataset in R that lists out a bunch of company names and want to remove words like "Inc", "Company", "LLC", etc. for part of a clean-up effort. I have the following sample data:
sampleData
Location Company
1 New York, NY XYZ Company
2 Chicago, IL Consulting Firm LLC
3 Miami, FL Smith & Co.
Words I do not want to include in my output:
stopwords = c("Inc","inc","co","Co","Inc.","Co.","LLC","Corporation","Corp","&")
I built the following function to break out each word, remove the stopwords, and then bring the words back together, but it is not iterating through each row of the dataset.
removeWords <- function(str, stopwords) {
x <- unlist(strsplit(str, " "))
paste(x[!x %in% stopwords], collapse = " ")
}
removeWords(sampleData$Company,stopwords)
The output for the above function looks like this:
[1] "XYZ Company Consulting Firm Smith"
T
he output should be:
Location Company
1 New York, NY XYZ Company
2 Chicago, IL Consulting Firm
3 Miami, FL Smith
Any help would be appreciated.
We can use 'tm' package
library(tm)
stopwords = readLines('stopwords.txt') #Your stop words file
x = df$company #Company column data
x = removeWords(x,stopwords) #Remove stopwords
df$company_new <- x #Add the list as new column and check
With a little check on the stopwords( having inserted "\" in Co. to avoid regex, spaces ): (But the previous answer should be preferred if you dont want to keep an eye on stopwords)
stopwords = c("Inc","inc","co ","Co ","Inc."," Co\\.","LLC","Corporation","Corp","&")
gsub(paste0(stopwords,collapse = "|"),"", df$Company)
[1] "XYZ Company" "Consulting Firm " "Smith "
df$Company <- gsub(paste0(stopwords,collapse = "|"),"", df$Company)
# df
# Location Company
#1 New York, NY XYZ Company
#2 Chicago, IL Consulting Firm
#3 Miami, FL Smith

Quantmod FRED Metadata in R

library(quantmod)
getSymbols("GDPC1",src = "FRED")
I am trying to extract the numerical economic/financial data in FRED but also the metadata. I am trying to chart CPI and have the meta data as a labels/footnotes. Is there a way to extract this data using the quantmod package?
Title: Real Gross Domestic Product
Series ID: GDPC1
Source: U.S. Department of Commerce: Bureau of Economic Analysis
Release: Gross Domestic Product
Seasonal Adjustment: Seasonally Adjusted Annual Rate
Frequency: Quarterly
Units: Billions of Chained 2009 Dollars
Date Range: 1947-01-01 to 2014-01-01
Last Updated: 2014-06-25 7:51 AM CDT
Notes: BEA Account Code: A191RX1
Real gross domestic product is the inflation adjusted value of the
goods and services produced by labor and property located in the
United States.
For more information see the Guide to the National Income and Product
Accounts of the United States (NIPA) -
(http://www.bea.gov/national/pdf/nipaguid.pdf)
You can use the same code that's in the body of getSymbools.FRED, but change ".csv" to ".xls", then read the metadata you're interested in from the .xls file.
library(gdata)
Symbol <- "GDPC1"
FRED.URL <- "http://research.stlouisfed.org/fred2/series"
tmp <- tempfile()
download.file(paste0(FRED.URL, "/", Symbol, "/downloaddata/", Symbol, ".xls"),
destfile=tmp)
read.xls(tmp, nrows=17, header=FALSE)
# V1 V2
# 1 Title: Real Gross Domestic Product
# 2 Series ID: GDPC1
# 3 Source: U.S. Department of Commerce: Bureau of Economic Analysis
# 4 Release: Gross Domestic Product
# 5 Seasonal Adjustment: Seasonally Adjusted Annual Rate
# 6 Frequency: Quarterly
# 7 Units: Billions of Chained 2009 Dollars
# 8 Date Range: 1947-01-01 to 2014-01-01
# 9 Last Updated: 2014-06-25 7:51 AM CDT
# 10 Notes: BEA Account Code: A191RX1
# 11 Real gross domestic product is the inflation adjusted value of the
# 12 goods and services produced by labor and property located in the
# 13 United States.
# 14
# 15 For more information see the Guide to the National Income and Product
# 16 Accounts of the United States (NIPA) -
# 17 (http://www.bea.gov/national/pdf/nipaguid.pdf)
Instead of hardcoding nrows=17, you can use grep to search for the row that has the headers of the data, and subset to only include rows before that.
dat <- read.xls(tmp, header=FALSE, stringsAsFactors=FALSE)
dat[seq_len(grep("DATE", dat[, 1])-1),]
unlink(tmp) # remove the temp file when you're done with it.
FRED has a straightforward, well-document json interface http://api.stlouisfed.org/docs/fred/ which provides both metadata and time series data for all of its economic series. Access requires a FRED account and api key but these are available on request from http://api.stlouisfed.org/api_key.html .
The excel descriptive data you asked for can be retrieved using
get.FRSeriesTags <- function(seriesNam)
{
# seriesNam = character string containing the ID identifying the FRED series to be retrieved
#
library("httr")
library("jsonlite")
# dummy FRED api key; request valid key from http://api.stlouisfed.org/api_key.html
apiKey <- "&api_key=abcdefghijklmnopqrstuvwxyz123456"
base <- "http://api.stlouisfed.org/fred/"
seriesID <- paste("series_id=", seriesNam,sep="")
fileType <- "&file_type=json"
#
# get series descriptive data
#
datType <- "series?"
url <- paste(base, datType, seriesID, apiKey, fileType, sep="")
series <- fromJSON(url)$seriess
#
# get series tag data
#
datType <- "series/tags?"
url <- paste(base, datType, seriesID, apiKey, fileType, sep="")
tags <- fromJSON(url)$tags
#
# format as excel descriptive rows
#
description <- data.frame(Title=series$title[1],
Series_ID = series$id[1],
Source = tags$notes[tags$group_id=="src"][1],
Release = tags$notes[tags$group_id=="gen"][1],
Frequency = series$frequency[1],
Units = series$units[1],
Date_Range = paste(series[1, c("observation_start","observation_end")], collapse=" to "),
Last_Updated = series$last_updated[1],
Notes = series$notes[1],
row.names=series$id[1])
return(t(description))
}
Retrieving the actual time series data would be done in a similar way. There are several json packages available for R but jsonlite works particularly well for this application.
There's a bit more to setting this up than the previous answer but perhaps worth it if you do much with FRED data.

Web-scraping with xpathSApply

I am using doing some web scraping with packages XML and html, and I need to isolate the country name, and the two numeric values that you see below:
<tr><td>Tonga</td>
<td class="RightAlign">3,000</td>
<td class="RightAlign">6,000</td>
</tr>
here is the code I've written so far - I think that I just need the right regexes?
# a vector to store the results
pages<-character(0)
country_names<-character(0)
# go through all 6 pages containing the info we want, and store
# the html in a list
for (page in 1:6) {
who_search <- paste(who_url, page, '.html', sep='')
page = htmlTreeParse(who_search, useInternalNodes = T)
pages=c(page, pages)
# extract the country names of each tweet
country <- xpathSApply(page, "????", xmlValue)
country_names<-c(country, country_names)
}
Here no need to use xmlSpathApply , use readHTMLTable instead
library(XML)
library(RCurl)
page = htmlParse('http://www.who.int/diabetes/facts/world_figures/en/index4.html')
readHTMLTable(page)
Country 2000 2030
1 Albania 86,000 188,000
2 Andora 6,000 18,000
3 Armenia 120,000 206,000
4 Austria 239,000 366,000
5 Azerbaijan 337,000 733,000
6 Belarus 735,000 922,000
using xpathSApply (Note the use of gsub to clean the result)
country <- xpathSApply(page, '//*[#id="primary"]/table/tbody/tr',
function(x) gsub('\n','' ,xmlValue(x))
+ )
> country
[1] "Albania 86,000 188,000 "
[2] "Andora 6,000 18,000 "
[3] "Armenia 120,000 206,000 "
[4] "Austria 239,000 366,000 "
[5] "Azerbaijan 337,000 733,000 "
EDIT As mentioned in the comment we can use xpathSApply without gsub
val = xpathSApply(page, '//tbody/tr/td', xmlValue) ##gets a vector of table
as.data.frame(matrix(val, ncol=3, byrow=TRUE)) ##transform to matrix

Resources