I have a data frame with 3 columns. What I want to do is to calculate the product of the return over a selected month rolling period for each monthly period (or said another way, each row) (where available). This is the basic structure of the data.
set.seed = 100
assets <- c("A", "B", "C", "D", "E", "F", "G", "H", "I")
FileDate <- seq(as.Date("2011-12-30"), as.Date("2019-01-31"), by="months")
df <- merge(x = assets, y = FileDate, all.x = TRUE)
df$return <- runif(774, min=0, max=1)
What it should end with is a dataframe where a new column is added with the selected period cumulative return for that time frame. For example, I have shown below a four month return. The calculation of the 4-month return on 03/30/2012 from the data would be:
((1+0.81/100)(1+0.715/100)(1+0.27/100)*(1+0.80/100)-1)*100
This would be repeated for each value under the X column.
I ended up utilizing the mutate function there you can set the lag width. in the end version I wanted
library(dplyr)
library(zoo)
# Create Test Dataframe
set.seed = 100
assets <- c("A", "B", "C", "D", "E", "F", "G", "H", "I")
FileDate <- seq(as.Date("2011-12-30"), as.Date("2019-01-31"), by="months")
df <- merge(x = assets, y = FileDate, all.x = TRUE)
df$performance <- runif(774, min=0, max=1)
This particular code creates a 5 month average on a rolling basis. If you sort by column X you can see and recreate it in excel.
df <- df %>%
group_by(x) %>%
mutate(x_mean = rollmean(performance, 5, fill = NA, align = 'right'))
I also found a way to create a lag so I could take the 4 prior values to the observation and calculate the mean:
df2 = df %>%
mutate(perf.4.previous = rollapply(data = perf.1.previous, width = 4, FUN =
mean, align = "right", fill = NA, na.rm = T))
Related
I want to iterate over several columns of a flextable using the mk_par function. Consider the following example:
tibble(a = c(1:10),
b1 = letters[1:10],
b2 = LETTERS[1:10],
c1 = paste0("new_",letters[1:10]),
c2 = paste0(LETTERS[1:10], "_new")) %>%
flextable(col_keys = c("a", "b", "c")) %>%
mk_par(j = "b", value = as_paragraph(b1, b2)) %>%
mk_par(j = "c", value = as_paragraph(c1, c2))
I would like to replace the two mk_par statements by a single expression which takes the arguments c("b", "c") and renders the same output. I have succeeded in rewriting this with a for loop
for(pref in c("b", "c")){
tt <- tt %>%
mk_par(j = pref,
value = as_paragraph(.data[[paste0(pref,1)]],
.data[[paste0(pref,2)]]))
}
but I wonder if there is a one line expression that does the same which integrates smoothly in a dplyr pipe syntax?
I have two sets of data. Each contains a column for the name of the molecule and a column for the number of times that molecule appears in the sample. I want to create a scatterplot with the number of times a molecule appears in dataset #1 on the x-axis and how many times it appears in dataset #2. If a molecule is in one dataset and not the other, it appears 0 times.
Example:
dat1 <- data.frame(
name = c("A", "B", "D", "E")
count = c(10, 1, 30, 10)
)
dat2 <- data.frame(
name = c("A", "B", "C", "F")
count = c(1, 3, 50, 40)
)
Point #1 would be (10,1) corresponding to A, Point #2 would be (1,3), Point #3 would be (0,50) and so on. I don't want to label my points since my datasets contain tens of thousands of molecules.
Try joining the data.frames
full_join(dat1, dat2, by="name") %>%
mutate_all(function(xx) ifelse(is.na(xx), 0, xx)) %>%
ggplot(aes(count.x, count.y)) +
geom_point()
which produces
You would need a full_join():
library(dplyr)
library(ggplot2)
#Data
dat1 <- data.frame(
name = c("A", "B", "D", "E"),
count = c(10, 1, 30, 10)
)
dat2 <- data.frame(
name = c("A", "B", "C", "F"),
count = c(1, 3, 50, 40)
)
#Code
dat1 %>% full_join(dat2 %>% rename(count2=count)) %>%
replace(is.na(.),0) %>%
ggplot(aes(x=count,y=count2))+
geom_point()+
geom_text(aes(label=name),vjust=-0.5)
Output:
Why does this...
B.aov2<- eval(parse(text=StringforEvaluation))
summary(B.aov2)
produce a less detailed report than this...?
res.aov2 <- aov(DV ~ IV1*IV2+Error(Participant/(IV1*IV2), data = AnovaAnalysisData))
summary(res.aov2)
Reproducible data, per request of a user. I know it's weird, but I'm making a user-friendly stats program for my students. Now that I've posted the "reproducible code" it wants me to provide more data. Now, I didn't originally post this code because I thought it was a bit much and my general coding sucks.. My question still stand.. Why does one give me a short crappy, unhelpful output, and the other one provide exactly what I want...I ran the same command:
library(tidyverse)
DV <- c(1,1, 5,6, 1, 2, 7, 7, 1, 4, 9, 9)
IV1 <- c("A","B", "A", "B","A","B", "A", "B","A","B", "A", "B" )
IV2 <- c("C","C","D","D", "C","C", "D","D", "C","C", "D","D")
Participant <- c("A", "A", "A", "A", "B","B","B","B","C","C","C","C")
IV3 <- "no_data" #remove the word "no_data" and add c("A","B", "C", etc.. )
IV4 <- "no_data" #remove the word "no_data" and add c("A","B", "C", etc.. )
##### You have to tell the computer if the variable is within!
IV1_iswithin <-"Y"
IV2_iswithin <-"Y"
IV3_iswithin <-"N"
IV4_iswithin <-"N"
####### Your JOB is DONE
data <- data.frame(DV,Participant,IV1,IV2)
#Grouping the dataframe
data %>%group_by(IV2, IV1)%>% #subsetting the data set to calculate 4 different stats
mutate(MAD = median(abs(DV-median(DV))*2.5*1.4826))%>% #calculates 4 differnet mad numbers
mutate(MADLL = median(DV)-MAD)%>% #cacluates UL of MAD, pipe output to next command
mutate(MADUL = median(DV)+MAD)%>% #calculates the LL of MAD
mutate(OutlierPresent = ifelse(DV<MADLL | DV>MADUL, NA, DV))%>% #Creates NA values if it is an outlier
ungroup()%>% #converts back to big data set
mutate(OutlierPresent = ifelse(DV<MADLL | DV>MADUL, NA, DV))%>% #Creates NA
mutate(whichgroup <- paste(IV1,IV2))%>%
mutate(observation = 1:n()) %>%
{. ->> b }
b %>%
select(DV,OutlierPresent) %>%
{. ->> outlierfeedback }
formattable(outlierfeedback)
b %>%
select(OutlierPresent,IV1,IV2,Participant) %>%
pivot_wider(names_from = c(IV1,IV2), values_from = OutlierPresent, names_sep = "_", id_cols = Participant)%>%
drop_na()%>%
pivot_longer(-Participant, names_to = c("IV1","IV2"), names_sep = "_", values_to = "DV")%>%
{. ->> AnovaAnalysisData }
### Calculating the ANOVA for our outlier free data AnovaAnalysisData
#SettingUpTheListofFixedFactors
FactorModel <- list()
ifelse(length(IV1)> 1, FactorModel<- c(FactorModel, "IV1"), FactorModel<-FactorModel)
ifelse(length(IV2)> 1, FactorModel<- c(FactorModel, "IV2"), FactorModel<-FactorModel)
ifelse(length(IV3)> 1, FactorModel<- c(FactorModel, "IV3"), FactorModel<-FactorModel)
ifelse(length(IV4)> 1, FactorModel<- c(FactorModel, "IV4"), FactorModel<-FactorModel)
#SettingUPTheListofErrorFactors
ErrorModel <-list()
ErrorModel <- list()
ifelse(IV1_iswithin== "Y", ErrorModel<- c(ErrorModel, "IV1"), ErrorModel<-ErrorModel)
ifelse(IV2_iswithin== "Y", ErrorModel<- c(ErrorModel, "IV2"), ErrorModel<-ErrorModel)
ifelse(IV3_iswithin== "Y",ErrorModel<- c(ErrorModel, "IV3"), ErrorModel<-ErrorModel)
ifelse(IV4_iswithin== "Y", ErrorModel<- c(ErrorModel, "IV4"), ErrorModel<-ErrorModel)
StrStart = "aov(DV ~"
StrFactor<-paste(ErrorModel, collapse='*' )
StrErrorStart<-("+Error(Participant/(")
StrError<-paste(ErrorModel, collapse='*' )
StrErrorEnd<- ("),data=AnovaAnalysisData))")
StringforEvaluation<- paste(StrStart, StrFactor,StrErrorStart,StrError,StrErrorEnd)
B.aov2<- eval(parse(text=StringforEvaluation))
summary(B.aov2)
res.aov2 <- aov(DV ~ IV1*IV2+Error(Participant/(IV1*IV2), data = AnovaAnalysisData))
summary(res.aov2)
The data I have contain pair-wise distance between different locations (x,y,z) and (a,b,c,d,e,f,g,h,i,j). See below:
set.seed(123)
x <- rnorm(10, 15,1)
y <- rnorm(10, 7,0.1)
z <- rnorm(10, 3,0.01)
distdat <- data.frame(x,y,z)
rownames(distdat) <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j")
I need to create another data that include: 1) the column name, row name of the min, and the minimum three values for each column. So in total, the new data will contain
three column and nine rows. Here is the first rows:
col_name <- c("x", "x", "x")
row_name <- c("h", "g", "a")
min_val <- c(14.21208, 14.88804, 14.98797)
newdat <- data.frame(col_name, row_name, min_val)
Similarly, we need to repeat this for column y and z.
How about this:
set.seed(123)
x <- rnorm(10, 15,1)
y <- rnorm(10, 7,0.1)
z <- rnorm(10, 3,0.01)
distdat <- data.frame(x,y,z)
rownames(distdat) <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j")
# find indices of smallest values
idx <- sapply(distdat, order)[1:3, ]
# put everything in a data.frame
data.frame(col_name = rep(colnames(distdat), each = 3),
row_name = row.names(distdat)[c(idx)],
min_val = distdat[cbind(c(idx), rep(1:3, each = 3))]
)
Also, with the given seed I could not replicate your example, let me know if I missed something.
Its not real pretty, but this could work:
set.seed(123)
x <- rnorm(10, 15,1)
y <- rnorm(10, 7,0.1)
z <- rnorm(10, 3,0.01)
distdat <- data.frame(x,y,z)
rownames(distdat) <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j")
distdat$row_name <- rownames(distdat)
select(distdat, x, row_name) %>%
arrange(x) %>%
head(3) %>%
mutate(col_name='x') %>%
rename(min_val = x) -> newdat_x
select(distdat, y, row_name) %>%
arrange(y) %>%
head(3) %>%
mutate(col_name='y') %>%
rename(min_val = y) -> newdat_y
select(distdat, z, row_name) %>%
arrange(z) %>%
head(3) %>%
mutate(col_name='z') %>%
rename(min_val = z) -> newdat_z
newdat <- bind_rows(newdat_x, newdat_y, newdat_z)
certainly, we could (should) create a function to create those newdat_ dfs, and then run the function for each variable x,y,z.
You can use the dplyr and tidyr packages to do this. They make transformations much more readable.
newdat <- distdat %>%
mutate(row = rownames(.)) %>%
gather(col, dist, -row) %>%
group_by(col) %>%
arrange(col, dist) %>%
top_n(-3, dist)
I am trying to replicate the Fama French 1993 paper using R. I need to do the following sorting :
for each month,
calculate ME decile breakpoints on NYSE stocks only
sort all stocks into the deciles created in 2.
Data generation:
set.seed(1234)
n = 120
stocks <- c("A", "B", "C", "D", "E")
exchange <- c("NYSE", "NASDAQ", "AMEX")
df <- as.data.frame(cbind(Month = 1:12,
exchangeCode = exchange[round(runif(n, 1, 3))],
Stock = stocks[round(runif(n, 1, 5))],
ME=floor(100*abs(rnorm(n)))))
Desired Output:
ME_NYSE_vals <- as.numeric(paste(df[df$Month==1 & df$exchangeCode=="NYSE","ME"]))
ME_ALL_vals <- as.numeric(paste(df[df$Month==1,"ME"]))
cut(x = ME_ALL_vals,
breaks = c(-Inf,quantile(ME_NYSE_vals,probs=seq(.1,.9,.1)),+Inf),
labels = 1:10
)
The breaks should be calculated based on ME_NSYE_vals. The cut should be applied to all ME_ALL_vals for each month.
If the intention is to keep the whole data frame but generate deciles only for the NYSE values the code below could do. The point was to generate deciles only for the entries pertaining to the NYSE values but to keep the full data set achieving some form of a partial sorting.
# Libs
Vectorize(require)(package = c("dplyr", "magrittr"),
character.only = TRUE)
# Transformations
df %<>%
mutate(nTileNYSE = ifelse(exchangeCode == "NYSE", ntile(ME, 10), NA))
arrange(nTileNYSE)
The code was applied to the data:
set.seed(1)
df <- as.data.frame(cbind(exchangeCode = c("NYSE", "NASDAQ"),
Stock = c("A", "B", "C", "A"),
Month = 1:12,
ME=rnorm(1200)))
2nd approach
Following the discussion in the comments I would suggest the following approach:
# Libs --------------------------------------------------------------------
Vectorize(require)(package = c( "tidyr", "dplyr", "magrittr", "xts", "Hmisc"),
char = TRUE)
# Data generation ---------------------------------------------------------
set.seed(1234)
n = 120
stocks <- c("A", "B", "C", "D", "E")
exchange <- c("NYSE", "NASDAQ", "AMEX")
df <- as.data.frame(cbind(Month = 1:12,
exchangeCode = exchange[round(runif(n, 1, 3))],
Stock = stocks[round(runif(n, 1, 5))],
ME = floor(100*abs(rnorm(n)))))
# Transformations ---------------------------------------------------------
# For some reason this was needed
df$ME <- as.numeric(as.character(df$ME))
# Generate cuts
dfNtiles <- df %>%
arrange(exchangeCode, Month, ME) %>%
group_by(exchangeCode, Month) %>%
mutate(cutsBsdOnNYSE = cut(x = ME,
breaks = cut2(x = df$ME[df$exchangeCode == "NYSE"],
g = 10, onlycuts = TRUE))) %>%
ungroup() %>%
group_by(cutsBsdOnNYSE) %>%
mutate(grpBsdOnNYSE = n())
It's fairly straightforward
Generating cut brackets reflecting subset of the data.
Applying those brackets to the whole vector (ME)
Numbering the obtained groups so a group identifier is created
and boils down to: