clean way to download multiple time series from Bloomberg in R - r

i am trying to download some time series data about euro swaps (EUSA10 Currency for example) in R using the blpapi but i am encountering the following problems:
if i try to download for example 2y, 5y, 10y and 30y swap rates using the include.non.trading.days=FALSE option , the resulting time series are for some reason of different length and i receive a message error about it. If, on the other hand i set the non trading day option to true i have similar length time series that can then be cleaned up using the na.omit() function
the format in which the data is downloaded is messy...i would like to have a data frame in which the first column is the date, second column is the first security, third column is second security and so forth. Instead what i get is [date][security][date][security2]......[date][securityN]. Any suggestions on how to solve this?
Below a quick few lines i wrote as an example
# Load package
library(Rblpapi)
# Connect to Bloomberg
blpConnect()
# Declaring securities
sec<-c("eusa2 curncy", "eusa5 curncy", "eusa10 curncy")
# Declaring field to be dowloaded
flds<-"PX_LAST"
data<-as.data.frame(bdh(sec,flds,start.date=as.Date("2019-08-18"),end.date=as.Date("2020-08-18"), include.non.trading.days=TRUE"))

It's states in the Rblapi manual that the Rblapi::bdh returns
A list with as a many entries as there are entries in securities; each list contains a data.frame with
one row per observations and as many columns as entries in fields. If the list is of length one, it
is collapsed into a single data frame. Note that the order of securities returned is determined by the
backend and may be different from the order of securities in the securities field.
So I'd suggest you rbind the data then reshape it in order to have the result you want. a fast way to do it is use the data.table::rbindlist function it takes a list as input and returns a data.table containing all entries and if idcol=TRUE then it'll append a .id column showing where the data.frame came from. Also this method will work even if you have different number of rows in the data.frames resulting from the Rblapi::bdh call.
# Declaring field to be dowloaded
flds<-"PX_LAST"
# LOADING THE DATA FROM THE API
l <- bdh(sec,flds,start.date=as.Date("2019-08-18"),end.date=as.Date("2020-08-18"), include.non.trading.days=TRUE)
# the names of the securities columns as returned by the api
securities <- paste0("eusa", c(2,5,10,15,30), ".curncy.",flds)
# row binding the resulting list
dt <- data.table::rbindlist(l, idcol=T, use.names=FALSE)
# idcol=T appends an id column (.id) to the resulting data.table
# use.names=F because the columns of the data.frames are different
# remaking the .id column so it reflects the name of the column that it already had
dt[, .id:= securities[.id] ]
# making a wider data.table
data.table::dcast(dt, eusa2.curncy.date ~ .id, value.var=securities[1])
# eusa2.curncy.date is the column that defines a group of observation
# .id the name of the columns
# securities[1] or eusa2.curncy.PX_LAST is the column that contains the values
data used
As I don't have access to a bloomberg api endpoint I created this mock data which resemble the output of dbh
col.names <- paste0("eusa", rep(c(2,5,10,15,30),each=2), ".curncy.", rep(c(flds,"date"), 5))
l<-rep(list(data.frame(rnorm(200), 1:200)), 5)
for (i in 1:length(l)) colnames(l[[i]]) <- col.names[(2*i-1):(2*i)]

Related

How do I merge 2 data frames on R based on 2 columns?

I am looking to merge 2 data frames based on 2 columns in R. The two data frames are called popr and dropped column, and they share the same 2 variables: USUBJID and TRTAG2N, which are the variables that I want to combine the 2 data frames by.
The merge function works when I am only trying to do it based off of one column:
merged <- merge(popr,droppedcol,by="USUBJID")
When I attempt to merge by using 2 columns and view the data frame "Duration", the table is empty and there are no values, only column headers. It says "no data available in table".
I am tasked with replicating the SAS code for this in R:
data duration;
set pop combined1 ;
by usubjid trtag2n;
run;
On R, I have tried the following
duration<- merge(popr,droppedcol,by.x="USUBJID","TRTAG2N",by.y="USUBJID","TRTAG2N")
duration <- merge(popr,droppedcol,by.x="USUBJID","TRTAG2N",by.y="USUBJID","TRTAG2N")
duration <- full_join(popr,droppedcol,by = c("USUBJID","TRTAG2N"))
duration <- merge(popr,droppedcol,by = c("USUBJID","TRTAG2N"))
I would like to see a data frame with the columns USUBJID, TRTAG2N, TRTAG2, and FUDURAG2, sorted by first FUDURAG2 and then USUBJID.
Per the SAS documentation, Combining SAS Data Sets, and confirmed by the SAS guru, #Tom, in comments above, the set with by simply means you are interleaving the datasets. No merge (which by the way is also a SAS method which you do not use) is taking place:
Interleaving uses a SET statement and a BY statement to combine
multiple data sets into one new data set. The number of observations
in the new data set is the sum of the number of observations from the
original data sets. However, the observations in the new data set are
arranged by the values of the BY variable or variables and, within
each BY group, by the order of the data sets in which they occur. You
can interleave data sets either by using a BY variable or by using an
index.
Therefore, the best translation of set without by in R is rbind(), and set with by is rbind + order (on the rows):
duration <- rbind(pop, combined1) # STACK DFs
duration <- with(duration, duration[order(usubjid, trtag2n),]) # ORDER ROWS
However, do note: rbind does not allow unmatched columns between the concatenated data sets. However, third-party packages allow for unmatched columns including: plyr::rbind.fill, dplyr::bind_rows, data.table::rbindlist.

Counting unique subsets of data efficiently

I have a relatively large dataset that I wouldn't qualify as 'big data'. It's around 3 to 5 million rows; because of the size I'm using the data.table library to do analysis.
The dataset (named df, which is a data.table structure) composition can essentially be broken into:
n identify fields, hereafter ID_1, ID_2, ..., ID_n, some of which are numeric and some of which are character vector.
m categorical variables, hereafter C_1, ..., C_m, all of which are character vector and have very few values apiece (2 in one, 3 in another, etc...)
2 measurement variables, M_1, and M_2, both numeric.
A subset of data is identified by ID_1 through ID_n, has a full set of all values of C_1 through C_m, and has a range of values of M_1 and M_2. A subset of data consists of 126 records.
I need to accurately count the unique sets of data and, because of the size of the data, I would like to know if there already exists a much more efficient way to do this. (Why roll my own if other, much smarter, people have done it already?)
I've already done a fair amount of Google work to arrive at the method below.
What I've done is to use the ht package (https://github.com/nfultz/ht) so that I can use a data frame as a hash value (using digest in the background).
I paste together the ID fields to create a new, single column, hereafter referred to as ID, which resembles...
ID = "ID1:ID2:...:IDn"
Then I loop through each unique set of identifiers and then, using just the subset data frame of C_1 through C_m, M_1, and M_2 (126 rows of data), hash the value / increment the hash.
Afterwards I'm taking that information and putting it back into the data frame.
# Create the hash structure
datasets <- ht()
# Declare the fields which will denote a subset of data
uniqueFields <- c("C_1",..."C_m","M_1","M_2")
# Create the REPETITIONS field in the original data.table structure
df[,REPETITIONS := 0]
# Create the KEY field in the original data.table structure
df[,KEY := ""]
# Use the updateHash function to fill datasets
updateHash <- function(val){
key <- df[ID==val, uniqueFields, with=FALSE]
if (isnull(datasets[key])) {
# If this unique set of data doesn't already exist in datasets...
datasets[key] <- list(val)
} else {
# If this unique set of data does already exist in datasets...
datasets[key] <- append(datasets[key],val)
}
}
# Loop through the ID fields. I've explored using apply;
# this vector is around 10-15K long. This version works.
for (id in unique(df$ID)) {
updateHash(id)
}
# Now update the original data.table structure so the analysis can
# be done. Again, I could use the R apply family, this version works.
for(dataset in ls(datasets)){
IDS <- unlist(datasets[[dataset]]$val)
# For this set of information, how many times was it repeated?
df[ID%in%IDS, REPETITIONS := length(datasets[[dataset]]$val)]
# For this set, what is a unique identifier?
df[ID%in%IDS, KEY := dataset]
}
This does what I want to, though not blindingly fast. I now have the capability to present some neat analysis revolving around variability in datasets to people who care about it. I don't like that it's hackey and, one way or another, I'm going to clean this up and make it better. Before I do that I want to do my final due diligence and see if it's simply my Google Fu failing me.

Assigning name to rows in R

I would like to assign names to rows in R but so far I have only found ways to assign names to columns. My data is in two columns where the first column (geo) is assigned with the name of the specific location I'm investigating and the second column (skada) is the observed value at that specific location. To clarify, I want to be able to assign names for every location instead of just having them all in one .txt file so that the data is easier to work with. Anyone with more experience than me that knows how to handle this in R?
First you need to import the data to your global environment. Try the function read.table()
To name rows, try
(assuming your data.frame is named df):
rownames(df) <- df[, "geo"]
df <- df[, -1]
Well, your question is not that clear...
I assume you are trying to create a data.frame with named rows. If you look at the data.frame help you can see the parameter row.names description
NULL or a single integer or character string specifying a column to be used as row names, or a character or integer vector giving the row names for the data frame.
which means you can manually specify the row names when you create the data.frame or the column containing the names. The former can be achived as follows
d = data.frame(x=rnorm(10), # 10 random data normally distributed
y=rnorm(10), # 10 random data normally distributed
row.names=letters[1:10] # take the first 10 letters and use them as row header
)
while the latter is
d = data.frame(x=rnorm(10), # 10 random data normally distributed
y=rnorm(10), # 10 random data normally distributed
r=letters[1:10], # take the first 10 letters
row.names=3 # the column with the row headers is the 3rd
)
If you are reading the data from a file I will assume you are using the command read.table. Many of its parameters are the same of data.frame, in particular you will find that the row.headers parameter works the same way:
a vector of row names. This can be a vector giving the actual row names, or a single number giving the column of the table which contains the row names, or character string giving the name of the table column containing the row names.
Finally, if you have already read the data.frame and you want to change the row names, Pierre's answer is your solution

Look up data frame with values stored in another data frame

I have 15 data frames containing information about patient visits for a group of patients. Example below. They are named as FA.OFC1, FA.OFC2 etc.
ID sex date age.yrs important.var etc...
xx_111 F xx.xx.xxxx x.x x
I am generating a summary data frame (sev.scores) which contains information about the most severe episode a patient has across all recorded data. I have successfully used the which.max function to get the most severe episode but now need additional information about that particular episode.
I recreated the name of the data frame I will need to look up to get the additional information by pasting information after the max return:
max data frame
8 df2
Specifically the names() function gave me the name of the column with the most severe episode (in the summary data frame sev.scores which also gives me information about which data frame to look up:
sev.scores[52:53] <- as.data.frame(cbind(row.names(sev.scores[c(5,8,11,14,17,20,23,26,29,32,35,38,41,44,47,50)]),apply(sev.scores[c(5,8,11,14,17,20,23,26,29,32,35,38,41,44,47,50)],1,function(x) names(sev.scores[c(5,8,11,14,17,20,23,26,29,32,35,38,41,44,47,50)])[which(x==max(x))])))
However now I would like to figure out how to tell R to take the data frame name stored in the column and search that data frame for the entry in the 5th column.
So in the example above the information about the most severe episode is stored in data frame 2 (df2) and I need to take information from the 5th record (important.var) and return it to this summary data frame.
UPDATE
I have now stored these dfs in a list but am still having some trouble getting the information I would like.
I found the following example for getting the max value from a list
lapply(L1, function(x) x[which.max(abs(x))])
How can I adapt this for a factor which is present in all elements of the list?
e.g. something like:
lapply(my_dfs[[all elements]]["factor of interest"], function(x) x[which.max(abs(x))])
If I may suggest a fundamentally different approach: concatenate all your data.frames into one (rbind), and add a separate column that describes the nature of the original data.frame. For this, it’s necessary to know in which regard the original data.frames differed (e.g. by disease type; since I don’t know your data, let’s stick with this for my example).
Furthermore, you need to ensure that your data is in tidy data format. This is an easy requirement to satisfy, because your data should be in this format anyway!
Then, once you have all the data in a single data.frame, you can create a summary trivially by simply selecting the most severe episode for each disease type:
sev_scores = all_data %>%
group_by(ID) %>%
filter(row_number() == which.max(FactorOfInterest))
Note that this code uses the ‹dplyr› package. You can perform an equivalent analysis using different packages (e.g. ‹data.table›) or base R functions, but I strongly recommend dplyr: The resulting code is generally easier to understand.
Rather than your sev.scores table, which has columns referring to rows and data.frame names, the sev_scores I created above will contain the actual data for the most severe episode for each patient ID.

Dynamically specify column name in spread()

I am attempting to automate a simple process of importing some data and using the spread function from the tidyr package to make it wide format data.
Below is a simplified example
Ticker <- c(rep("GOOG",5), rep("AAPL",5))
Prices <- rnorm(10, 95, 5)
Date <- rep(sapply(c("2015-01-01", "2015-01-02", "2015-01-03", "2015-01-04", "2015-01-05"),as.Date), 2)
exStockData <- data.frame(Ticker, Date, Prices)
After reading in a data frame like exStockData, I'd like to be able to create a data frame like the one below
library(tidyr)
#this is the data frame I'd like to be able to create
desiredDataFrame <- spread(exStockData, Ticker, Prices)
However, the column used for the key argument of the spread function will not always be called Ticker and the column used for the value argument of the function will not always be called Prices. The column names are read in from a different portion of the file that gets imported.
#these vectors are removed because the way my text file is read in
#I don't actually have these vectors
rm(Ticker, Prices, Date)
#the name of the first column (which serves as the key in
#the spread function) of the exStockData data frame will
#vary, and is read in from the file and stored as a one
#element character vector
secID <- "Ticker"
#the name of the last column in the data frame
#(which serves as the value in the spread function)
#is stored also stored as a one element character vector
fields <- "Prices"
#I'd like to be able to dynamically specify the column
#names using these other character vectors
givesAnError <- spread(exStockData, get(secID), get(fields))
The "See also" section of the documentation for the spread function mentions the spread_ function which is intended to be used in this situation.
In this case the solution is to use:
solved <- spread_(exstockData, secID, fields)

Resources