I am using R tidyverse package to extract several subsets of a large data set each matching a specific field name. However since the number of subsets to be extracted is large, and extracting one by one with a specific expression is time consuming and wonder if there is a faster way to do this.
Here is a minimal example:
The data frame looks like this and is called "dummy":
A <- c(605, 605, 608, 608)
B <- c(5, 6, 3, 4)
C <- c(500, 600, 300, 400)
dummy <-as.data.frame(A, B, C)
AT present what I do is:
subject1 <- filter(dummy, A == "605")
subject2 <- filter(dummy, A == "608")
Since there are 100 subjects in my original data set, this process is time consuming and wonder if there is a faster method to do this.
I note that the numbers are in the column A are in order but not consecutive, as shown in the example.
Thanks for any help
We can do a split (should be faster compared to ==) into a list of data.frames
lst1 <- split(dummy, dummy$A)
NOTE: Creating multiple objects in the global environment is not recommended
Once we have a list, it is easier to process/apply functions in each list element with lapply/sapply etc.
lapply(lst1, function(x) colMeans(x[-1]))
NOTE: If it is a group by operation, we don't need to split it
aggregate(.~ A, dummy, FUN = mean)
data
dummy <- data.frame(A, B, C)
You can do this using a loop. However, as #akrun had mentioned, you could end up with a lot of objects in the global environment. For example if you had 200 subjects, then you'll have 200 objects (very messy), perhaps you could consider what your next steps will be and see if you can achieve what you're trying to do without creating a lot of objects
subjects <- c(605, 608)
for (i in 1:length(subjects)) {
object_name <- paste0("subject", i)
assign(object_name, filter(dummy, A == subjects[i]))
}
Related
I am relatively new to the concepts of vectorization, and would like to ask whether or not the community has any suggestions for improving the run time of a process I have been using to download bloomberg API data and bind it to a matrix.
Currently, this process iterates through each individual date within my API call which takes quite a bite of time. I am wondering if I can do this in a "vectorized" way in order to make numerous calls at once, and then bind to a data frame, reducing run time.
'''
#create fund names to feed through as param in loop below
fundList <- c("fund 1 on bloomberg",
"fund 2 on bloomberg",
"fund 3 on bloomberg",
"fund 4 on bloomberg",
"fund 5 on bloomberg",
"fund 6 on bloomberg",
"fund 7 on bloomberg",
)
#create datelist for params for loop
newDateList <- seq(as.Date(today()-1401),length=1401, by="days")
newDateListReformatted <- gsub("-","",newDateList)
#create df object and loop through bloomberg API, assign to dataframe object
df_total = data.frame()
for(fund in 1:length(fundList)){
df_total = data.frame()
for(b in 1:length(newDateListReformatted)){
ovrd <- c("CUST_TRR_START_DT"=newDateListReformatted[b],"CUST_TRR_END_DT"=newDateListReformatted[b+1])
print(ovrd)
model <- bdp(fundList[fund],"CUST_TRR_RETURN_HOLDING_PER",overrides=ovrd)
print(model)
df <- data.frame(model)
df1 <- data.frame(newDateListReformatted[b+1])
df2 <- cbind(df,df1)
df_total <- rbind(df_total,df2)
}
assign(fundList[fund],df_total)
}
'''
First the loop moves to a fund at the first level, iterates through all the dates, and binds the rows to the dataframe one step at a time before moving to the next fund in fundList and iterating through the timeseries again.
The way I am thinking about it, I would call a vector of multiple date parameters to the function, and then "vertically" assign them to the df_total matrix in a greater number than one at a time with each loop increasing run time. Alternatively, I could call each individual date, but do it across a number of funds and assign them "horizontally" to the matrix.
Any thoughts are appreciated.
Vectorization consists of making a functions that efficiently implement handling of multiple parameters for each input. For example one can calculate the mean of columns using a loop lapply(mtcars, mean) or use the vectorized function colMeans(mtcars). The latter is much more efficient than using a loop, as the function is optimized over the inputs.
On stackoverflow vectorization is often misunderstood as readability of code, and as such using an *apply function is often considered vectorization, while these are more useful for readability they do not (by themselves) speed up your code.
For your specific example, your bottleneck (and problem) comes in part from a call to bdp and in part from iteratively expanding your result using cbind, rbind and assign.
To speed up your code, we first need to be aware of how the function is implemented. From the documentation we can read that fields and securities accept multiple arguments. These arguments are thus vectorized, while overrides only accepts a named vector of override fields. This means we can eliminate the outer loop in your code, by providing all the fields and securities in one go.
Next in order to reduce overhead from multiple calls to by iteratively expanding your data.frame, we can store the intermediate results in a list and combine everything in one go once the code has run. Combining these we get a code example such as the one below
n <- length(newDateListReformatted)
# Create override matrix (makes it easier to subset, but not strictly necessary
periods <- matrix(c(newDateListReformatted[-n], newDateListReformatted[-1]), ncol = 2, byrow = FALSE)
colnames(periods) <- c('CUST_TRR_START_DT', 'CUST_TRR_END_DT')
ovrds <- newDateListReformatted
models <- vector('list', n - 1)
for(i in seq_len(n - 1)){
models[[i]] <- bdp(fundList,
'CUST_TRR_RETURN_HOLDING_PER',
overrides = periods[i, ]
)
# Add identifier columns
models[[i]][,'CUST_TRR_START_DT'] <- periods[i, 1]
models[[i]][,'CUST_TRR_END_DT'] <- periods[i, 2]
}
# Combine results in single data.frame (if wanted)
model <- do.call(rbind, models)
Note that the code finishes by combining the intermediary results using do.call(rbind, models) which gives a single data.frame, but one could use bind_rows from the dplyr package or rbindlist from the data.table package as well.
Further note that I do not have access to bloomberg (currently) and cannot test my code for possible spelling mistakes.
I'm trying to store p values from a long nested for loop into an empty column in a data frame. I've tried looking up examples close to my code, but I feel as though my code is really long (and maybe even incorrect) that the same things that can be applied to other for loops can't be applied to mine.
The overview of what I'm trying to do is I'm trying to compare the relatedness of observed paired birds to the relatedness of all possible paired birds in a given year by finding a p value. To do this, I'm writing a for loop where I am selecting a range of years from a huge data set, and then I am applying a bunch of functions to those given years where I'm trying to narrow down the data for observed pairs and then I'm adding a column for relatedness and transferring those relatedness values for the pairs from another data set. I am then applying another for loop function within this in order to create a data frame with all possible paired birds in that given year and also adding and transferring a column of relatedness values for the pairs. From these two data frames of pairs and relatedness within each year, I want to apply the wilcox test to find the p value for each given year. I want to transfer over these p values into a separate data frame that I have created with a year column and a p value column.
Here is my (crazy looking) code:
`year <- c(2000:2013)
pvalue <- c(NA)
results <- data.frame(year, pvalue)
for(j in c(2000:2013)) {
allbr_demo_noEPP_year <- subset(allbr_demo_noEPP, Year == j)
allbr_demo_noEPP_year_geno_obs <- allbr_demo_noEPP_year[allbr_demo_noEPP_year$Pairs %in% c(genome$pair1,genome$pair2),]
allbr_demo_noEPP_year_geno_obs$relatedness <- laply(allbr_demo_noEPP_year_geno_obs$Pairs, function(x) genome[genome$pair1==x|genome$pair2==x,'PI_HAT'])
allbr_demo_noEPP_year_geno <- allbr_demo_noEPP_year[c(allbr_demo_noEPP_year$MB_USFWS,allbr_demo_noEPP_year$FB_USFWS) %in% genotyped$V2,]
breeder_list_males <- allbr_demo_noEPP_year_geno_obs[,8]
breeder_list_females <- allbr_demo_noEPP_year_geno_obs[,10]
unq_breeder_list_males <- unique(breeder_list_males)
unq_breeder_list_females <- unique(breeder_list_females)
all_poss_combo <-list()
for(i in unq_breeder_list_males){
print(i)
all_poss_combo[[i]]<-paste0(i, ",", unq_breeder_list_females)}
lapply(X = all_poss_combo, FUN= function(x) length(unique(x)))
all_poss_df<-unlist(all_poss_combo, use.names = F)
all_poss_df <- data.frame("combo"=all_poss_df, "M"=NA, "F"=NA)
all_poss_df$M <- substr(all_poss_df$combo, start = 1, stop = 10)
all_poss_df$F <- substr(all_poss_df$combo, start = 12, stop = 22)
all_poss_df_geno <- all_poss_df[all_poss_df$combo %in% c(genome$pair1,genome$pair2),]
all_poss_df_geno$relatedness <- laply(all_poss_df_geno$combo, function(x) genome[genome$pair1==x|genome$pair2==x,'PI_HAT'])
wilcox.test(allbr_demo_noEPP_year_geno_obs$relatedness, all_poss_df_geno$relatedness, alternative='greater')}`
To be honest, I'm not even sure if this for loop will work (it seems pretty complex to me, but I am a beginner), but I was told that doing a for loop for this situation should work. I understand there are probably easier or faster ways to do what I am trying to do, which I also welcome, but I would also like to see how I could fix this for loop so it would work and how I could store the results from it into a data frame.
Thank you so much for any help given!
If you are simply looking to save the p value:
str(wilcox.test(rnorm(10), rnorm(10, 2))) # example from running ?Wilcox.test
wilcox.test(rnorm(10), rnorm(10, 2))$p.value #
So with your dataset, perhaps putting this in the bottom of your for loop:
pvalue[j] <- wilcox.test(allbr_demo_noEPP_year_geno_obs$relatedness,
all_poss_df_geno$relatedness, alternative='greater')$p.value
I have 2 relatively large data frames in R. I'm attempting to merge / find all combos, as efficiently as possible. The resulting df turns out to be huge (the length is dim(myDF1)[1]*dim(myDF2)[1]), so I'm attempting to implement a solution using ff. I'm also open to using other solutions, such as the bigmemory package to work-around these memory issues. I'm have virtually no experience with either of these packages.
Working example - assume I'm working with some data frame that looks similar to USArrests:
library('ff')
library('ffbase')
myNames <- USArrests
myNames$States <- rownames(myNames)
rownames(myNames) <- NULL
Now, I will fabricate 2 data frames, which represent some particular sets of observations from myNames. I'm going to try to reference them by their rownames later.
myDF1 <- as.ffdf(as.data.frame(matrix(as.integer(rownames(myNames))[floor(runif(3*1e5, 1, 50))], ncol = 3)))
myDF2 <- as.ffdf(as.data.frame(matrix(as.integer(rownames(myNames))[floor(runif(2*1e5, 1, 50))], ncol = 2)))
# unique combos:
myDF1 <- unique(myDF1)
myDF2 <- unique(myDF2)
For example, my first set of states in myDF1 are myNames[unlist(myDF1[1, ]), ]. Then I will find all combos of myDF1 and myDF2 using ikey :
# create keys:
myDF1$key <- ikey(myDF1)
myDF2$key <- ikey(myDF2)
startTime <- Sys.time()
# Create some huge vectors:
myVector1 <- ffrep.int(myDF1$key, dim(myDF2)[1])
myVector2 <- ffrep.int(myDF2$key, dim(myDF1)[1])
# This takes about 25 seconds on my machine:
print(Sys.time() - startTime)
# Sort one DF (to later combine with the other):
myVector2 <- ffsorted(myVector2)
# Sorting takes an additional 2.5 minutes:
print(Sys.time() - startTime)
1) Is there a faster way to sort this?
# finally, find all combinations:
myDF <- as.ffdf(myVector1, myVector2)
# Very fast:
print(Sys.time() - startTime)
2) Is there an alternative to this type of combination (without using RAM)?
Finally, I'd like to be able to reference any of the original data by row / column. Specifically, I'd like to get different types of rowSums. For example:
# Here are the row numbers (from myNames) for the top 6 sets of States:
this <- cbind(myDF1[myDF[1:6,1], -4], myDF2[myDF[1:6,2], -3])
this
# Then, the original data for the first set of States is:
myNames[unlist(this[1,]),]
# Suppose I want to get the sum of the Urban Population for every row, such as the first:
sum(myNames[unlist(this[1,]),]$UrbanPop)
3) Ultimately, I'd like a vector with the above rowSum, so I can perform some type of subset on myDF. Any advice on how to most efficiently accomplish this?
Thanks!
It's pretty much unclear to me what you intent to do with the rowSum and your 3) element but if you want an efficient and RAM-friendly combination of 2 ff vectors, to get all combinations, you can use expand.ffgrid from ffbase.
The following will generate your ffdf with dimensions 160Mio rows x 2 columns in a few seconds.
require(ffbase)
x <- expand.ffgrid(myDF1$key, myDF2$key)
I want to perform a function - that uses multiple columns - on certain rows of a data frame based on the contents in one column. I can, of course, accomplish this task using a simple for loop, but I am sure that it must be possible to do so more elegantly using one of the apply functions. I just can't quite figure it out.
(data <- data.frame(a = sample(10), b = sample(10), c=NA))
# for every value of b that is greater than 5,
# set c to be equal to a function of a and b, say: 3 * a + b
# otherwise, c = a
for(i in 1:nrow(data)){
if(data$b[i] > 5) {
data$c[i] <- 3*data$a[i]+data$b[i]
} else {
data$c[i] <- data$a[i]
}
}
data
I realize that there are three things going on here: (1) figuring out which rows to perform the function on, (2) performing the function on those rows and (3) performing the alternate function on the other rows. If I could figure out how to apply a function using multiple columns to every row, I could subset the data before I did that.
I thought that code like this would allow me to perform a function using multiple columns:
sapply(data$b, function(b, a) 3*a+b, a=data$a)
#or
lapply(data$b, function(b, a) 3*a+b, a=data$a)
But it returns an nxn matrix of numbers (or n lists that are n long), and I can't figure out how it calculated them.
I also suspect it's possible to do the selection and the function at the same time (maybe with code like this:
data$c <- sapply(data$b, function(b, table) 3*table$a[b>5] + b[b>5], table=data)
)
But that code results in similar output problems.
I think most of my problems stem from the fact that I am not quite comfortable with the apply functions, especially with multiple arguments, but none of my fiddling has enlightened me.
Thank you!
You can use plyr:ddply (easiest for me) if you need to run functions rowwise
In this example as Blue Magister describes, probably easier to do it directly as:
data$c<-ifelse(data$b > 5, 3 * data$a + data$b, data$a)
But here's a ddply example
require(plyr)
ddply(data, c("a","b"), function(df)ifelse(df$b>5,df$a+df$b,df$a))
or
data<-adply(data,1,transform,c=ifelse(b>5,a+b,b))
Or obviously in this case you can just use apply:
data$c<-apply(data, 1, function(x)ifelse(x["b"]>5,x["a"]+x["b"],x["a"]))
I am relatively new to R and have a complicated situation to solve. I have uploaded a list of over 1000 data frames into R and called this list x. What I want to do is take certain data frames and take the mean and variance of the entire data frames (excluding the first column of each) and save these into two separate vectors. For example I wish to take the mean and variance of every third data frame in the list starting from element (3) and going to element (54).
So what I ultimately want are two vectors:
meanvector=c(mean(data frame(3)), mean(data frame(6)),..., mean(data frame(54)))
variancevector=c(var(data frame (3)), var(data frame (6)), ..., var(data frame(54)))
This problem is way above my knowledge level but I am thinking I can do this effectively using some sort of loop but I do not know how to go about making such loop. Any help would be much appreciated! Thank you in advance.
You can use lapply and pass indices as follows:
ids <- seq(3, 54, by=3)
out <- do.call(rbind, lapply(ids, function(idx) {
t <- unlist(x[[idx]][, -1])
c(mean(t), var(t))
}))
If x is a list of 1000 dataframes, you can use lapply to return the means and variances of a subset of this list.
ix = seq(1, 1000, 3)
lapply(x[ix], function(df){
#exclude the first column
c(mean(df[,-1]), var(df[,-1]))
})