I am trying to create two vectors of the 20th and 80th percentiles of monthly return data for companies from 1927 to 2013. The issue I have encountered is that in my nested four loop I don't know how to reference both the month and the year (i.e. the returns across all companies in April 1945). Right now the code looks like this:
qunatile<-function(r){
vec20<-c(rep(0,1038))
vec80<-c(rep(0,1038))
for(i in 1927:2013){
for(j in 1:12){
vec20[j+12(i-1927)]<-quantile(r$(i, j),20)
vec80[j+12(i-1927)]<-quantile(r$(i, j),80)
}
}
data1decilest<-rbind(ps1NYSE,vec20,vec80)
}
But I know that that r$(i, j) notation is not correct. I was wondering if anyone knew how to do what I am attempting with that clearly incorrect code (i.e. reference all returns from a given month in a given year.
Thank you!
One option that would eliminate nesting loops is to create a column in your dataframe that contains a month/year combo (e.g. "Jan1955", "Apr1999", etc.) and then split your dataframe on that variable, and apply quantile functions. It's hard to say if this is solving your problem since there is not a reproducible example. I assume here your data is called df and contains a date and a value column.
library(lubridate)
library(plyr)
df$newtime <- paste0(month(df$date, label = T, abbr = T), year(df$date))
q20 <- function(df){ quantile(df$value, 20) }
q80 <- function(df){ quantile(df$value, 80) }
vec20 <- ddply(df, .(newtime), FUN=q20)
vec80 <- ddply(df, .(newtime), FUN=q80
Related
I have a problem relating to subsetting.
Basically I have a dataset. This toy dataset is a good small example:
df<- data.frame(year = c(1980:2019), randnorm = rnorm(40, 0, 1), count1 = rpois(40, 18),
lograndnorm=(rlnorm(40, 3, 2)))
For each value of year between 2000 and 2019, I want to remove each years observation, and output a subset of the total df data excluding a year. I then want to take the year removed and enter it into a model, and use the remainder of the data to train the model.
For example, subset_ex2010 might be excluding 2010. Therefore, all data except for where year= 2010 goes into subset_ex2010 , and I can then use that data to predict 2010.
Once those parameters are entered into the model, the output is saved (after the model has run) and the loop does the next year, that is, removes 2009 from the full df dataframe and subsets the remainder.
I've tried:
for(i in 2000:2019){
subset_excl_[i] <- subset(df, year<i | year>i] )
subset_of_[i] <- subset(df, year==i] )
lmmod[i] <- lm(count1 ~ randnorm + lograndnorm, data=subset_excl_[i])
distPred[i] <- predict(lmmod[i], subset_of_[i])
}
and,
for(i in 2000:2019){
subset_excl_[i] <- [df$year-i]
subset_of_[i] <- subset(df, year==i] )
lmmod[i] <- lm(count1 ~ randnorm + lograndnorm, data=subset_excl_[i])
distPred[i] <- predict(lmmod[i], subset_of_[i])
}
but both fall over. Any assistance would be gratefully received.
I don't know linear programming. But. In both your blocks of code
lmmod[i] <- lm(count1 ~ randnorm + lograndnorm, data=subset_excl_[i])
distPred[i] <- predict(lmMod[i], subset_of_[i])
You're referring to both lmmod and lmMod. R is case-sensitive.
If that alone doesn't fix it - put a broswer() call in the head of the loop and single step til you find where it's blowing up.
Similar to do.call/lapply approach here, and data.table approach here, but both have the setup of:
MainDF with data and startdate/enddate ranges
SubDF with a vector of single dates
Where the users are looking for summaries of all the MainDF ranges that overlap each SubDF date. I have
MainDF with data and a vector of single dates
SubDF with startdate/enddate ranges
And am looking to append summaries, to SubDF, for multiple rows of MainDF data which fall within each SubDF range. Example:
library(lubridate)
MainDF <- data.frame(Dates = seq.Date(from = as.Date("2020-02-12"),
by = "days",
length.out = 10),
DataA = 1:10)
SubDF <- data.frame(DateFrom = as.Date(c("2020-02-13", "2020-02-16", "2020-02-19")),
DateTo = as.Date(c("2020-02-14", "2020-02-17", "2020-02-21")))
SubDF$interval <- interval(SubDF$DateFrom, SubDF$DateTo)
Trying the data.table approach from the second link I figure it should be something like:
MainDF[SubDF, on = .(Dates >= DateFrom, Dates <= DateTo), allow = TRUE][
, .(SummaryStat = max(DataA)), by = .(Dates)]
But it errors with unused arguments for on. On my actual data I got a result by using (the equivalent of) max(MainDF$DataA), but it was 3 repeats of the second value (In my actual data the final row won't run as it doesn't have a value for DateTo). I suspect using MainDF$ means I've subverting the grouping.
I suspect I'm close but I'm really struggling to get my head around the data.table mindset for complex use cases. The summary stats I'm looking to do are (for example data):
Mean & Max of DataA
length(which(DataA > 3))
difftime(last(Dates), first(Dates), units = "mins")
Dates[which.max(DataA)]
I added the interval line above as data.table's %between% help suggests one might be able to use a Dates %between% interval format but it doesn't mention intervals/difftimes specifically in the text nor examples and my attempts are already failing elsewhere so I'm loathe to concentrate on improving my running while I can't walk!
I've focused on the data.table approach since it's used for a similar problem, but I've been wondering whether dplyr's group_by/group_by_if could be used instead? group_by_if's .predicate seems to be constrained to tests on the columns (e.g. are they factors) rather than relating to data in the columns' rows, but I could be wrong.
Thanks in advance for any help!
Ok so lets take this code below which calculates a rolling simple moving average over 2 day period:
# Use TTR package to create rolling SMA n day moving average
new.df$close.sma.n2 <- SMA(new.df[,"Close"], 2)
Lets say I want to calculate the n day period of 2:30
The inputs here is:
close.sma.n**
and also the numerical value for the SMA calculation.
So my question is:
How can I write one line of code to perform the above calculation on different SMA periods and also making a new column with corresponding close.sma.n2,3,4,5,6 etc value in a data frame?
I understand I can do:
n.sma <- 2:30
and put that variable in:
new.df$close.sma.n2 <- SMA(new.df[,"Close"], n.sma)
Perhaps I can:
name <- "n2:30"
and place that inside:
new.df$close.sma.name <- SMA(new.df[,"Close"], n.sma)
You didn't provide sample data or SMA, so I made up dummy functions to test my code.
df <- data.frame(Close=c(1, 2, 3, 4))
SMA <- function(x, numdays) {numdays}
Then, I wrote a function that takes in the number of days to average, and returns a function that takes a data.frame and takes the SMA over that many days.
getSMA <- function(numdays) {
function(new.df) {
SMA(new.df[,"Close"], numdays)
}
}
Then, create a matrix to put the SMAs in
smas <- matrix(nrow=nrow(df), ncol=0)
and fill it.
for (i in 2:30) {
smas <- cbind(smas, getSMA(i)(df))
}
Then, set the column names to what you want them to be
colnames(smas) <- sapply(2:30, function(n)paste("close.sma.n", n, sep=""))
and bind it with the starting data frame.
df <- cbind(df, smas)
Now you have your original data.frame, plus "close.sma.n2" through ".n30"
I'm trying to store p values from a long nested for loop into an empty column in a data frame. I've tried looking up examples close to my code, but I feel as though my code is really long (and maybe even incorrect) that the same things that can be applied to other for loops can't be applied to mine.
The overview of what I'm trying to do is I'm trying to compare the relatedness of observed paired birds to the relatedness of all possible paired birds in a given year by finding a p value. To do this, I'm writing a for loop where I am selecting a range of years from a huge data set, and then I am applying a bunch of functions to those given years where I'm trying to narrow down the data for observed pairs and then I'm adding a column for relatedness and transferring those relatedness values for the pairs from another data set. I am then applying another for loop function within this in order to create a data frame with all possible paired birds in that given year and also adding and transferring a column of relatedness values for the pairs. From these two data frames of pairs and relatedness within each year, I want to apply the wilcox test to find the p value for each given year. I want to transfer over these p values into a separate data frame that I have created with a year column and a p value column.
Here is my (crazy looking) code:
`year <- c(2000:2013)
pvalue <- c(NA)
results <- data.frame(year, pvalue)
for(j in c(2000:2013)) {
allbr_demo_noEPP_year <- subset(allbr_demo_noEPP, Year == j)
allbr_demo_noEPP_year_geno_obs <- allbr_demo_noEPP_year[allbr_demo_noEPP_year$Pairs %in% c(genome$pair1,genome$pair2),]
allbr_demo_noEPP_year_geno_obs$relatedness <- laply(allbr_demo_noEPP_year_geno_obs$Pairs, function(x) genome[genome$pair1==x|genome$pair2==x,'PI_HAT'])
allbr_demo_noEPP_year_geno <- allbr_demo_noEPP_year[c(allbr_demo_noEPP_year$MB_USFWS,allbr_demo_noEPP_year$FB_USFWS) %in% genotyped$V2,]
breeder_list_males <- allbr_demo_noEPP_year_geno_obs[,8]
breeder_list_females <- allbr_demo_noEPP_year_geno_obs[,10]
unq_breeder_list_males <- unique(breeder_list_males)
unq_breeder_list_females <- unique(breeder_list_females)
all_poss_combo <-list()
for(i in unq_breeder_list_males){
print(i)
all_poss_combo[[i]]<-paste0(i, ",", unq_breeder_list_females)}
lapply(X = all_poss_combo, FUN= function(x) length(unique(x)))
all_poss_df<-unlist(all_poss_combo, use.names = F)
all_poss_df <- data.frame("combo"=all_poss_df, "M"=NA, "F"=NA)
all_poss_df$M <- substr(all_poss_df$combo, start = 1, stop = 10)
all_poss_df$F <- substr(all_poss_df$combo, start = 12, stop = 22)
all_poss_df_geno <- all_poss_df[all_poss_df$combo %in% c(genome$pair1,genome$pair2),]
all_poss_df_geno$relatedness <- laply(all_poss_df_geno$combo, function(x) genome[genome$pair1==x|genome$pair2==x,'PI_HAT'])
wilcox.test(allbr_demo_noEPP_year_geno_obs$relatedness, all_poss_df_geno$relatedness, alternative='greater')}`
To be honest, I'm not even sure if this for loop will work (it seems pretty complex to me, but I am a beginner), but I was told that doing a for loop for this situation should work. I understand there are probably easier or faster ways to do what I am trying to do, which I also welcome, but I would also like to see how I could fix this for loop so it would work and how I could store the results from it into a data frame.
Thank you so much for any help given!
If you are simply looking to save the p value:
str(wilcox.test(rnorm(10), rnorm(10, 2))) # example from running ?Wilcox.test
wilcox.test(rnorm(10), rnorm(10, 2))$p.value #
So with your dataset, perhaps putting this in the bottom of your for loop:
pvalue[j] <- wilcox.test(allbr_demo_noEPP_year_geno_obs$relatedness,
all_poss_df_geno$relatedness, alternative='greater')$p.value
I've gotten fairly good with the *apply family of functions, and I've recently learned to use the do.call("rbind", by(... as a wrapper for tapply. I'm working with a large data set (Compustat) and I have a function (see below) that generates a new column of lagged variables which I later attach to the main data frame df.
My problem is that it is extremely slow. I create about two dozen lagged variables, and the processing in this function takes approximately 1.5 hours because there are 350,000+ firm-year observations in the data set.
Can anyone help improve the speed of this function without losing the aspects that I find desirable:
#' lag vector of unknown size (for do.call-rbind-by: using datadate to track)
lag.vec <- function(x){
x <- x[order(x$datadate), ] # sort data into ascending by date
var <- x[,2] # the specific variable name in data.frame x hereby ignored
var.name <- paste(names(x)[2], "lag", sep = '.') # keep variable name
if(length(var)>1){ # no lagging if single observation
lagged <- c(NA, var[1:(length(var)-1)])
datelag <- c(x$datadate[1], x$datadate[1:(length(x$datadate) - 1)])
datediff <- x$datadate - datelag
y <- data.frame(x$datadate, datediff, lagged) # join lagged variable and difference in YYYYMMDD data
y$lagged[y$datediff >= 20000 & !is.na(y$datediff)] <- NA # 2 or more full years difference
y <- y[, c('x.datadate', 'lagged')]
names(y) <- c("datadate", var.name)
} else { y <- c(x$datadate[1], NA); names(y) <- c("datadate", var.name) }
return(y)
}
I then call this function in a command separately for each variable that I want to generate a lagged series for (here I use the ni variable as an example):
ni_lag <- do.call('rbind', by(df[ , c('datadate', 'ni')], df$gvkey, lag.vec))
where gvkey is the ID number for the particular firm and datadate is an 8-digit integer of the form YYYYMMDD.
The approach was much faster when I used a simpler function:
lag.vec.seq <- function(x){#' lag vector when all data points are present, in order
if(length(x)>1){
y <- c(NA, x[1:(length(x)-1)])
} else {y <- NA}
return(y)
}
along with the tapply command in something like
ni_lag <- as.vector(unlist(tapply(df$ni, df$gvkey, lag.vec.seq)))
As you can see the main difference is that the tapply approach doesn't include any datadate information and so the function assumes that all data are sequential (i.e., there are no missing years in the dataframe). Since I know there are missing years, I built the do.call-by function to account for that.
Some notes:
1) The first order command in the function is probably unnecessary since my data is ordered by gvkey and datadate in advance (e.g. df <- df[order(df$gvkey, df$datadate), ]). However, I'm always a bit afraid that R messies up my row ordering when I use functional programming like this. Is that an unfounded fear?
2) Identifying what is slowing down the processing would be very helpful. Is it the renaming of variables? The creation of a new data frame in the function? Or is the do.call with by just typically (much) slower than tapply?
Thank you!