I am relatively new to the concepts of vectorization, and would like to ask whether or not the community has any suggestions for improving the run time of a process I have been using to download bloomberg API data and bind it to a matrix.
Currently, this process iterates through each individual date within my API call which takes quite a bite of time. I am wondering if I can do this in a "vectorized" way in order to make numerous calls at once, and then bind to a data frame, reducing run time.
'''
#create fund names to feed through as param in loop below
fundList <- c("fund 1 on bloomberg",
"fund 2 on bloomberg",
"fund 3 on bloomberg",
"fund 4 on bloomberg",
"fund 5 on bloomberg",
"fund 6 on bloomberg",
"fund 7 on bloomberg",
)
#create datelist for params for loop
newDateList <- seq(as.Date(today()-1401),length=1401, by="days")
newDateListReformatted <- gsub("-","",newDateList)
#create df object and loop through bloomberg API, assign to dataframe object
df_total = data.frame()
for(fund in 1:length(fundList)){
df_total = data.frame()
for(b in 1:length(newDateListReformatted)){
ovrd <- c("CUST_TRR_START_DT"=newDateListReformatted[b],"CUST_TRR_END_DT"=newDateListReformatted[b+1])
print(ovrd)
model <- bdp(fundList[fund],"CUST_TRR_RETURN_HOLDING_PER",overrides=ovrd)
print(model)
df <- data.frame(model)
df1 <- data.frame(newDateListReformatted[b+1])
df2 <- cbind(df,df1)
df_total <- rbind(df_total,df2)
}
assign(fundList[fund],df_total)
}
'''
First the loop moves to a fund at the first level, iterates through all the dates, and binds the rows to the dataframe one step at a time before moving to the next fund in fundList and iterating through the timeseries again.
The way I am thinking about it, I would call a vector of multiple date parameters to the function, and then "vertically" assign them to the df_total matrix in a greater number than one at a time with each loop increasing run time. Alternatively, I could call each individual date, but do it across a number of funds and assign them "horizontally" to the matrix.
Any thoughts are appreciated.
Vectorization consists of making a functions that efficiently implement handling of multiple parameters for each input. For example one can calculate the mean of columns using a loop lapply(mtcars, mean) or use the vectorized function colMeans(mtcars). The latter is much more efficient than using a loop, as the function is optimized over the inputs.
On stackoverflow vectorization is often misunderstood as readability of code, and as such using an *apply function is often considered vectorization, while these are more useful for readability they do not (by themselves) speed up your code.
For your specific example, your bottleneck (and problem) comes in part from a call to bdp and in part from iteratively expanding your result using cbind, rbind and assign.
To speed up your code, we first need to be aware of how the function is implemented. From the documentation we can read that fields and securities accept multiple arguments. These arguments are thus vectorized, while overrides only accepts a named vector of override fields. This means we can eliminate the outer loop in your code, by providing all the fields and securities in one go.
Next in order to reduce overhead from multiple calls to by iteratively expanding your data.frame, we can store the intermediate results in a list and combine everything in one go once the code has run. Combining these we get a code example such as the one below
n <- length(newDateListReformatted)
# Create override matrix (makes it easier to subset, but not strictly necessary
periods <- matrix(c(newDateListReformatted[-n], newDateListReformatted[-1]), ncol = 2, byrow = FALSE)
colnames(periods) <- c('CUST_TRR_START_DT', 'CUST_TRR_END_DT')
ovrds <- newDateListReformatted
models <- vector('list', n - 1)
for(i in seq_len(n - 1)){
models[[i]] <- bdp(fundList,
'CUST_TRR_RETURN_HOLDING_PER',
overrides = periods[i, ]
)
# Add identifier columns
models[[i]][,'CUST_TRR_START_DT'] <- periods[i, 1]
models[[i]][,'CUST_TRR_END_DT'] <- periods[i, 2]
}
# Combine results in single data.frame (if wanted)
model <- do.call(rbind, models)
Note that the code finishes by combining the intermediary results using do.call(rbind, models) which gives a single data.frame, but one could use bind_rows from the dplyr package or rbindlist from the data.table package as well.
Further note that I do not have access to bloomberg (currently) and cannot test my code for possible spelling mistakes.
Related
I would like to loop through nine data sets, perform calculations, and output a different file name.
Existing Code:
list <- c(corporate_service, finance, its, law, market_services, operations, president, member_services, System_Planning)
Calc <- function(list){
list %>% filter(Total_Flag == 1) %>%
select(Element, Amount, Total)
}
lapply(list, Calc)
I would like to loop through each dataset and apply the function above. More specifically, I would like to re-name each processed dataframe something different. Is there a way to do this? I should also note, this code has not worked for me - is there anything noticeably wrong?
Thanks
Avoid flooding your global environment with separate, similarly structured data frames in the first place. Instead continue to use a list of data frames. See #GregorThomas's best practices answer of why. In fact, a named list is preferable for better indexing.
# DEFINE A NAMED LIST OF DATA FRAMES
df_list <- list(corporate_service = corporate_service,
finance = finance,
its = its,
law = law,
market_services = market_services,
operations = operations,
president = president,
member_services = member_services,
system_planning = System_Planning)
# REMOVE ORIGINALS FROM GLOBAL ENVIRONMENT
rm(corporate_service, finance, its, law, market_services,
operations, president, member_services, System_Planning)
# REVIEW STRUCTURE
str(df_list)
Then define a method to interact with a single data frame (not list) and its list name. Then call it iteratively:
Calc <- function(df, nm) {
df <- select(filter(df, Total_Flag == 1), Element, Amount, Total)
write.csv(df, file.path("path", "to", "my", "destination", paste(nm, ".csv")))
return(df)
}
# ASSIGN TO A NEW LIST
new_df_list <- mapply(Calc, df_list, names(df_list), SIMPLIFY=FALSE)
new_df_list <- Map(Calc, df_list, names(df_list)) # EQUIVALENT WRAPPER TO ABOVE
To be clear, you lose no functionality of a data frame if it is stored in a larger container.
head(new_df_list$corporate_service)
tail(new_df_list$finance)
summary(new_df_list$its)
Such containers even help serialize same operations:
lapply(new_df_list, summary)
Even concatenate all data frame elements together with column of corresponding list name:
final_df <- dplyr::bind_rows(new_df_list, .id="division")
Overall, your organization and data management is enhanced since you only have to use a single, indexed object and not many that require ls, mget, get, eval, assign for dynamic operations.
Essentially its about using bitmask/binary columns and row-oriented operations against a data table/frame: Firstly, to construct a logical vector from a combination of selected columns that can be used to mask a charcter vector to represent 'what' columns are flagged. Secondly, row-expansion - given a count in one column, prouce a data table that contains the original row data replicated that number of times.
For summarising the flags using a row-wise bitmask, which uses purrr:reduce to concatenate the row-represented flags, I cannot find a succinct method to do this in a %>% chain rather than a separate for loop. I suspect a purrr::map is required but I cannot get it/the syntax right.
For the row expansion, the nested for loop has appalling performance and I cannot find a way for dplyr/purrr to, row-wise, replicate that row a given number of times per row. A map and other functions would need to produce and append multiple rows which, I don't think map is capable of.
The following code produces the required output - but, apart from performance issues (especially regarding row expansion), I'd like to be able to do this as vectorised operations.
library(tidyverse)
library(data.table)
dt <- data.table(C1=c(0,0,1,0,1,0),
C2=c(1,0,0,0,0,1),
C3=c(0,1,0,0,1,0),
C4=c(0,1,1,0,0,0),
C5=c(0,0,0,0,1,1),
N=c(5,2,6,8,1,3),
Spurious = '')
flags <- c("Scratching Head","Screaming",
"Breaking Keyboard","Coffee Break",
"Giving up")
# Summarise states
flagSummary <- function(dt){
interim <- dt %>%
dplyr::mutate_at(vars(C1:C5),.funs=as.logical) %>%
dplyr::mutate(States=c(""))
for(i in 1:nrow(interim)){
interim$States[i] <-
flags[as.logical(interim[i,1:5])] %>%
purrr::reduce(~ paste(.x, .y, sep = ","),.init="") %>%
stringr::str_replace("^[,]","") }
dplyr::select(interim,States,N) }
summary <- flagSummary(dt)
View(summary)
# Expand states
expandStates <- function(dt){
interim <- dt %>%
dplyr::mutate_at(vars(C1:C5), .funs=as.logical) %>%
dplyr::select_at(vars(C1:C5,N)) %>%
data.table::setnames(.,append(flags,"Count"))
expansion <- interim[0,1:5]
for(i in 1:nrow(interim)){
for(j in 1:interim$Count[i]){
expansion <- bind_rows(expansion, interim[i,1:5]) } }
expansion }
expansion <- expandStates(dt)
View(expansion)
As stated, the code produces the expected result. I'd 'like' to see the same without resorting to for loops and whilst being able to chain the functions into the initial mutate/selects.
As for the row expansion of the expandStates function, the answer is proffered here Replicate each row of data.frame and specify the number of replications for each row? by A5C1D2H2I1M1N2O1R2T1.
Essentially, the nested for loop is simply replaced by
interim[rep(rownames(interim[,1:5]),interim$Count),][1:5]
On my 'actual' data, this reduces user systime from 28.64 seconds to 0.06 to produce some 26000 rows.
I am using R tidyverse package to extract several subsets of a large data set each matching a specific field name. However since the number of subsets to be extracted is large, and extracting one by one with a specific expression is time consuming and wonder if there is a faster way to do this.
Here is a minimal example:
The data frame looks like this and is called "dummy":
A <- c(605, 605, 608, 608)
B <- c(5, 6, 3, 4)
C <- c(500, 600, 300, 400)
dummy <-as.data.frame(A, B, C)
AT present what I do is:
subject1 <- filter(dummy, A == "605")
subject2 <- filter(dummy, A == "608")
Since there are 100 subjects in my original data set, this process is time consuming and wonder if there is a faster method to do this.
I note that the numbers are in the column A are in order but not consecutive, as shown in the example.
Thanks for any help
We can do a split (should be faster compared to ==) into a list of data.frames
lst1 <- split(dummy, dummy$A)
NOTE: Creating multiple objects in the global environment is not recommended
Once we have a list, it is easier to process/apply functions in each list element with lapply/sapply etc.
lapply(lst1, function(x) colMeans(x[-1]))
NOTE: If it is a group by operation, we don't need to split it
aggregate(.~ A, dummy, FUN = mean)
data
dummy <- data.frame(A, B, C)
You can do this using a loop. However, as #akrun had mentioned, you could end up with a lot of objects in the global environment. For example if you had 200 subjects, then you'll have 200 objects (very messy), perhaps you could consider what your next steps will be and see if you can achieve what you're trying to do without creating a lot of objects
subjects <- c(605, 608)
for (i in 1:length(subjects)) {
object_name <- paste0("subject", i)
assign(object_name, filter(dummy, A == subjects[i]))
}
I do have some grasp on how to use lapply to, say, change the names of variables in several dataframes in a list. However, I am looking to carry out a slightly (but only slightly) more complicated operation.
More specifically, I am looking to calculate the mean growth rates for several entities. The growth rate have already been calculated, so I just need to perfor the following operations on all dataframes
for (i in 1:13) {
growth.type[,i] <- tapply(growth[,8+i] , growth$type, mean, na.rm = TRUE)
}
This creates a new dataframe (growth.type) that includes the mean of all several hundred growth rates in the original dataframe (growth), by type.
Now, I would like to do this to several dataframes (like growth) and put them into new dataframes (like growth.type).
I hope this makes sense.
Put all data.frames you wish to process in a list
xy <- list(growth1, growth2, growth3, ...)
and then apply a custom function to this xy object.
customFunction <- function(.data) {
for (i in 1:13) {
growth.type[,i] <- tapply(.data[,8+i] , .data$type, mean, na.rm = TRUE)
}
growth.type # this is the object which will be returned when function finishes
}
then just do
out <- lapply(xy, FUN = customFunction)
If you want to combine the result of lapply, you can use do.call, e.g. do.call("rbind", out).
I have around thirty separate time series in R. I would like to put them all inside one large data set but can not seem to do this.
I have used the following code but it doesn't work. All my time series are names ts1,ts2 etc. if i was to do df <- data.frame(ts1,ts2) this works individually but not if I input it this way
for(i in 2:nrow(deal))
{
temp <- paste("ts",i,sep="")
mystring <- paste(mystring,temp,sep=",")
}
df <- data.frame(mystring)
Given that df <- data.frame(ts1,ts2) works, the following should work:
N <- nrow(deal) # or whatever number of time series you have
df <- data.frame(sapply(1:N, function(i) eval(parse(text=paste("ts",i,sep="")))))
Notes:
sapply loops over sequence from 1 to N and applies a function. The results of the function are gathered as columns to a matrix that is then coerced into a data frame.
The function that is applied constructs the string for the name of the i-th time series and uses this SO answer to evaluate the expression from the string. This returns the time series.
Hope this helps.