Calculating Compounded Return by ID in R - r

I am trying to calculate a CAGR value, defined as (Ending/Beginning)^(1/number of years)-1.
I have a df which has columns "Stock", "date", "Annual.Growth.Rate". To quickly note: I was trying to do this using the lag function, however, I wasn't able to change the recursive formula at the beginning of each stocks. It'll make more sense looking at the dput:
structure(list(Stock = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"),
date = structure(c(6L, 2L, 3L, 4L, 5L, 1L, 12L, 8L, 9L, 10L,
11L, 7L), .Label = c("3/28/16", "3/29/12", "3/29/13", "3/29/14",
"3/29/15", "3/30/11", "6/28/16", "6/29/12", "6/29/13", "6/29/14",
"6/29/15", "6/30/11"), class = "factor"), Annual.Growth.Rate = c(0.1,
0.2, 0.1, 0.1, 0.1, 0.1, 0.3, 0.2, 0.14, 0.14, 0.14, 0.14
), Growth = c(110, 132, 145.2, 159.72, 175.692, 193.2612,
130, 156, 177.84, 202.7376, 231.120864, 263.477785), CAGR = c(0.098479605,
0.098479605, 0.098479605, 0.098479605, 0.098479605, 0.098479605,
0.125, 0.125, 0.125, 0.125, 0.125, 0.125)), .Names = c("Stock",
"date", "Annual.Growth.Rate", "Growth.on.100", "CAGR"), class = "data.frame", row.names = c(NA,
-12L))
This is the expected output. Before there was the stock, date, and growth). The growth on 100 is not all a "lag" from before. Since the first available date is multiplied by a given starter, in this case 100, (1+.1)*100, and then the following growth value is the future value (110) * the next growth rate. I can figure out how to do the CAGR using dplyr, but I'm really stuck on growth on 100.

You could use cumprod in a mutate. Also the starting 100 value is arbitrary. It is all a product. You can calculate the rest of the product then multiply by the starter.
starter <- 100
my.data <- data.frame(stock=c('a','a','a','b','b','b'), growth = c(.1,.2,.1,.1,.1,.1), date = c(1,2,3,1,2,3)) #example Data
my.data
my.data %>%
group_by(stock) %>%
mutate(growth.unit = order_by(date,cumprod(1+growth)),
growth = growth.unit*starter) -> new.data

Related

How can I add vertical and horizontal asterisks and lines to ggplot2?

I have a line plot with errorbars in ggplot2 and I would like to add lines and asterisks to show which comparisons were significant. I have done my testing in a different program, so I am looking for something other than "ggsignif" because it wouldn't reflect my significance testing.
This is a minimal example of my code so far:
data = structure(list(group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("no", "yes"), class = "factor"),
factorA = structure(c(2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c("circle", "square"), class = "factor"),
factorB = structure(c(3L, 1L, 2L, 4L, 3L, 1L, 2L, 4L, 3L, 1L, 2L, 4L, 3L, 1L, 2L, 4L), .Label = c("blue", "green", "red", "yellow"), class = "factor"),
rating = c(0.4, 0.1, 0.3, 0.7, 0.3, 0.15, 0.5, 0.47, 0.56, 0.34, 0.48, 0.74, 0.60, 0.45, 0.37, 0.57),
se = c(0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1)),
row.names = c(NA, -16L), class = "data.frame")
pd <- position_dodge(0.1)
ggplot(data, aes(x=factorA, y=rating, colour=group, group=group)) +
facet_grid(. ~ factorB) +
geom_errorbar(aes(ymin=rating-se, ymax=rating+se), colour="black", width=.25) +
geom_line() +
geom_point()
And this is kind of what I am looking for:
I am also open to other suggestions how to reflect these significant comparisons:
no versus yes in the blue and the red condition
circle versus square in the yellow condition
Two thoughts:
For specifying comparisons and significance within plots, then stat_compare_means from the ggpubr package is fantastic. Vignette here - either with a ggpubr plot (eg ggline) or can add to any other ggplot object.
# Pairwise comparisons: Specify the comparisons you want
my_comparisons <- list( c("0.5", "1"), c("1", "2"), c("0.5", "2") )
ggboxplot(ToothGrowth, x = "dose", y = "len",
color = "dose", palette = "npg")+
# Add pairwise comparisons p-value
stat_compare_means(comparisons = my_comparisons, label.y = c(29, 35, 40))+
stat_compare_means(label.y = 45) # Add global Anova p-value
However more broadly, it's a busy plot (with rating v factorA, facet by factorB and grouped by group) and I'm not sure it's clear which comparisons you are making eg. in facet yellow, the p-val is the aggregate difference between circle & square, but the error bars are circle vs square, subset by group yes/no also
Then within blue and red: appears to be comparing yes v no

In R I want to select max timepoints, while grouping?

I want to select the minimum Timepoint, maximum Timepoint and the duration (difference between max and min) grouped by Replicate, Stimulus, Attribute and Complexity
structure(list(Replicate = c(1L, 1L, 1L, 1L, 1L, 1L), Stimulus = c(1L,
1L, 1L, 1L, 1L, 1L), Subject = c("S001", "S001", "S001", "S001",
"S001", "S001"), Attribute = c("Soft", "Soft", "Soft", "Soft",
"Soft", "Soft"), Timepoint = c(0.77, 0.78, 0.79, 0.8, 0.81, 0.82
), Dominant = c(1L, 1L, 1L, 1L, 1L, 1L), Complexity = c(2L, 2L,
2L, 2L, 2L, 2L)), row.names = c(NA, 6L), class = "data.frame")
I am using the following code
modified_tds_merged2<-tds_merged.df %>%
as.data.frame() %>%
mutate(Timepoint = as.numeric(gsub("[a-zA-Z]+", "", Timepoint))) %>%
group_by(Replicate, Stimulus, Subject, Attribute, Complexity) %>%
summarise(
start_time = min(Timepoint),
end_time = max(Timepoint),
duration = end_time - start_time,
n = n()
) %>%
ungroup()
However the result is inncorrect, the endtimes are often incorrect as they overlap when ther can be only 1 Complexity rating at any timepoint, and seem random. Here is an example of the result. You can see that for S008, crumbly_particles, Complexity rating of 3 goes from 0.47 to 0.71 and Complexity rating of 4 goes from 0.51 to 0.66. When I check back on tds_merged.df, Complexity rating 3 should be from .47 to.50 and .67 to .71, and Complexity rating 4 is from .51 to .66 which is correct. So it looks like my code doesn't specify that if the Complexity rating changes form 3 to 4 and back to 3, the two 3 ratings need to be calculated sperarately.
structure(list(Replicate = c(1L, 1L, 1L, 1L, 1L, 1L), Stimulus = c(1L,
1L, 1L, 1L, 1L, 1L), Subject = c("S001", "S004", "S004", "S008",
"S008", "S008"), Attribute = c("Soft", "Crumbly_Particles", "Soft",
"Crumbly_Particles", "Crumbly_Particles", "Crunchy"), Complexity = c(2L,
2L, 2L, 3L, 4L, 2L), start_time = c(0.77, 0.95, 0.19, 0.47, 0.51,
0.79), end_time = c(0.99, 0.99, 0.94, 0.71, 0.66, 0.82), duration = c(0.22,
0.04, 0.75, 0.24, 0.15, 0.0299999999999999), n = c(23L, 5L, 76L,
9L, 16L, 4L)), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
Sorry for long "question"!! Hope someone can help. Maybe the problem is using max and min, is there another function, I am new to R?
"So it looks like my code doesn't specify that if the Complexity rating changes". Correct. group_by sorts your data (whether explicitly or implicitly is not always clear, but that's another story). So your solution is to introduce another variable, RunID for example, that changes every time Complexity changes within Replicate, Stimulus, Subject and Attribute.
As suggested above I used the rleid function to creat a new group ID variable to allow for the situation in my question.

Multiple vertical shaded area

I am plotting the proportion of deep sleep (y axis) vs days (x axis). I would like to add vertical shaded area for a better understanding (e.g. grey for week-ends, orange for sick period...).
I have tried using geom_ribbon (I created a variable taking the value of 30, with is the top of my y axis if the data is during the WE - information given in another column), but instead of getting rectangles, I get trapezes.
In another post, someone proposed the use of "geom_rect", or "annotate" if one's know the x and y coordinates, but I don't see how to adapt it in my case, when I want to have the colored area repeated to all week-end (it is not exactly every 7 days because some data are missing).
Do you have any idea ?
Many thanks in advance !
ggplot(Sleep.data, aes(x = DATEID)) +
geom_line(aes(y = P.DEEP, group = 1), col = "deepskyblue3") +
geom_point(aes(y = P.DEEP, group = 1, col = Sign.deep)) +
guides(col=FALSE) +
geom_ribbon(aes(ymin = min, ymax = max.WE), fill = '#6495ED80') +
facet_grid(MONTH~.) +
geom_hline(yintercept = 15, col = "forestgreen") +
geom_hline(yintercept = 20, col = "forestgreen", linetype = "dashed") +
geom_vline(xintercept = c(7,14,21,28), col = "grey") +
scale_x_continuous(breaks=seq(0,28,7)) +
scale_y_continuous(breaks=seq(0,30,5)) +
labs(x = "Days",y="Proportion of deep sleep stage", title = "Deep sleep")
Proportion of deep sleep vs time
Head(Sleep.data)
> dput(head(Sleep.data))
structure(list(DATE = structure(c(1L, 4L, 7L, 10L, 13L, 16L), .Label = c("01-Dec-17",
"01-Feb-18", "01-Jan-18", "02-Dec-17", "02-Feb-18", "02-Jan-18",
"03-Dec-17", "03-Feb-18", "03-Jan-18", "04-Dec-17", "04-Feb-18",
"04-Jan-18", "05-Dec-17", "05-Feb-18", "05-Jan-18", "06-Dec-17",
"06-Feb-18", "06-Jan-18", "07-Dec-17", "07-Feb-18", "07-Jan-18",
"08-Dec-17", "08-Jan-18", "09-Dec-17", "09-Feb-18", "09-Jan-18",
"10-Dec-17", "10-Jan-18", "11-Dec-17", "11-Feb-18", "11-Jan-18",
"12-Dec-17", "12-Jan-18", "13-Dec-17", "13-Feb-18", "13-Jan-18",
"14-Dec-17", "14-Feb-18", "14-Jan-18", "15-Dec-17", "15-Jan-18",
"16-Dec-17", "16-Jan-18", "17-Dec-17", "17-Jan-18", "18-Dec-17",
"18-Jan-18", "19-Dec-17", "19-Jan-18", "20-Dec-17", "21-Dec-17",
"21-Jan-18", "22-Dec-17", "22-Jan-18", "23-Dec-17", "23-Jan-18",
"24-Dec-17", "24-Jan-18", "25-Dec-17", "25-Jan-18", "26-Dec-17",
"26-Jan-18", "27-Dec-17", "27-Jan-18", "28-Dec-17", "28-Jan-18",
"29-Dec-17", "29-Jan-18", "30-Dec-17", "30-Jan-18", "31-Dec-17",
"31-Jan-18"), class = "factor"), DATEID = 1:6, MONTH = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = c("Decembre", "Janvier", "FĂ©vrier"
), class = "factor"), DURATION = c(8.08, 7.43, 6.85, 6.23, 7.27,
6.62), D.DEEP = c(1.67, 1.37, 1.62, 1.75, 1.95, 0.9), P.DEEP = c(17L,
17L, 21L, 24L, 25L, 12L), STIMS = c(0L, 0L, 0L, 0L, 390L, 147L
), D.REM = c(1.7, 0.95, 0.95, 1.43, 1.47, 0.72), P.REM = c(17L,
11L, 12L, 20L, 19L, 9L), D.LIGHT = c(4.7, 5.12, 4.27, 3.05, 3.83,
4.98), P.LIGHT = c(49L, 63L, 55L, 43L, 49L, 66L), D.AWAKE = c(1.45,
0.58, 0.47, 0.87, 0.37, 0.85), P.AWAKE = c(15L, 7L, 6L, 12L,
4L, 11L), WAKE.UP = c(-2L, 0L, 2L, -1L, 3L, 1L), AGITATION = c(-1L,
-3L, -1L, -2L, 2L, -1L), FRAGMENTATION = c(1L, -2L, 2L, 1L, 0L,
-1L), PERIOD = structure(c(3L, 3L, 4L, 4L, 4L, 4L), .Label = c("HOLIDAYS",
"SICK", "WE", "WORK"), class = "factor"), SPORT = structure(c(2L,
1L, 2L, 2L, 2L, 1L), .Label = c("", "Day", "Evening"), class = "factor"),
ACTIVITY = structure(c(6L, 1L, 3L, 4L, 5L, 1L), .Label = c("",
"Bkool", "eBike", "Gym", "Natation", "Run"), class = "factor"),
TABLETS = c(0.5, 0.5, 0.5, 0.5, 0.5, 0.5), Ratio = c(1.15,
2.36, 3.45, 2.01, 5.27, 1.06), Sign = structure(c(2L, 2L,
2L, 2L, 2L, 2L), .Label = c("0", "1"), class = "factor"),
Sign.ratio = structure(c(2L, 2L, 2L, 2L, 2L, 2L), .Label = c("0",
"1"), class = "factor"), Sign.deep = structure(c(2L, 2L,
2L, 2L, 2L, 1L), .Label = c("0", "1"), class = "factor"),
Sign.awake = structure(c(1L, 2L, 2L, 1L, 2L, 1L), .Label = c("0",
"1"), class = "factor"), Sign.light = structure(c(2L, 1L,
1L, 2L, 2L, 1L), .Label = c("0", "1"), class = "factor"),
index = structure(c(1L, 1L, 1L, 1L, 2L, 1L), .Label = c("0",
"1"), class = "factor"), min = c(0, 0, 0, 0, 0, 0), max.WE = c(30,
30, 0, 0, 0, 0)), .Names = c("DATE", "DATEID", "MONTH", "DURATION",
"D.DEEP", "P.DEEP", "STIMS", "D.REM", "P.REM", "D.LIGHT", "P.LIGHT",
"D.AWAKE", "P.AWAKE", "WAKE.UP", "AGITATION", "FRAGMENTATION",
"PERIOD", "SPORT", "ACTIVITY", "TABLETS", "Ratio", "Sign", "Sign.ratio",
"Sign.deep", "Sign.awake", "Sign.light", "index", "min", "max.WE"
), row.names = c(NA, 6L), class = "data.frame")
Thanks for adding the data, that makes it easier to understand exactly what you're working with and to confirm that an answer actually addresses your question.
I thought it would be helpful to make a separate table with just the start and end of each contiguous set of rows with the same PERIOD. I did this using dplyr::case_when, assuming we should mark dates as a "start" if they are the first row in the table (row_number() == 1), or they have a different PERIOD value than the prior row. I mark dates as an "end" if they are the last row of the table, or have a different PERIOD than the next row. I only keep the starts and ends, and spread these into new columns called start and end.
library(tidyverse)
Period_ranges <- Sleep.data %>%
mutate(period_status = case_when(row_number() == 1 ~ "start",
PERIOD != lag(PERIOD) ~ "start",
row_number() == n() ~ "end",
PERIOD != lead(PERIOD) ~ "end",
TRUE ~ "other")) %>%
filter(period_status %in% c("start", "end")) %>%
select(DATEID, PERIOD, period_status) %>%
mutate(PERIOD_NUM = cumsum(PERIOD != lag(PERIOD) | row_number() == 1)) %>%
spread(period_status, DATEID)
# Output based on sample data only. If there's a problem with the full data, please add more. To share full data, use `dput(Sleep.data)` or to share 20 rows use `dput(head(Sleep.data, 20))`.
>Period_ranges
PERIOD PERIOD_NUM end start
1 WE 1 2 1
2 WORK 2 6 3
We can now use that in the plot. If you want to toggle the inclusion or fiddle with the appearance separately of different PERIOD types, you could modify the code below with Period_ranges %>% filter(PERIOD == "WE"),
ggplot(Sleep.data, aes(x = DATEID)) +
# Here I specify that this geom should use its own data.
# I start the rectangles half a day before and end half a day after to fill the space.
geom_rect(data = Period_ranges, inherit.aes = F,
aes(xmin = start - 0.5, xmax = end + 0.5,
ymin = 0, ymax = 30,
fill = PERIOD), alpha = 0.5) +
# Here we can specify the shading color for each type of PERIOD
scale_fill_manual(values = c(
"WE" = '#6495ED80',
"WORK" = "gray60"
)) +
# rest of your code
Chart based on data sample:

Re-levelling in R for a xtab based on a condition

For a sample dataframe:
df <- structure(list(region = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L), .Label = c("a", "b", "c", "d"), class = "factor"),
result = c(1L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L), weight = c(0.126,
0.5, 0.8, 1.5, 5.3, 2.2, 3.2, 1.1, 0.1, 1.3, 2.5)), .Names = c("region",
"result", "weight"), row.names = c(NA, 11L), class = "data.frame")
I draw a cross tabulation using:
df$region <- factor(df$region)
result <- xtabs(weight ~ region + result, data=df)
result
However I want to ensure the regions of the xtab are in order of magnitude of percentage 1s in sample. (i.e. 1s represent 29% of region a and 33% of region b). Therefore I would like the xtab to be reordered, so region b is first, then a.
I know I could use relevel, however this would be dependent on me looking at the result and re-levelling where appropriate.
Instead I want this to be automatic in the code and not dependent on the user (as this code will be running lots of times, and completing further analysis on the resulting xtab).
If anyone has any ideas, I would greatly appreciate it.
You can reorder the xtab on the values of the second column using order as follows:
result[order(result[, 2], decreasing=T),]
order ranks the values, adding decreasing=T ranks from top to bottom.

How to display value in a stacked bar chart by using geom_text?

I would like to display the percentage figures in the stacked bar. However, one group has a really low percentage. Two values are overlapping each other. I change to 'postion='identity'. It still wont work.....any thoughts?
x4.can.m <- structure(list(canopy = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("0%", "1 to 84%",
"85% +"), class = "factor"), YearQuarter = structure(c(1L, 1L,
1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 5L, 5L, 5L), .Label = c("2011-09-01",
"2011-12-01", "2012-03-01", "2012-06-01", "2012-09-01"), class = "factor"),
value = c(0.51, 0.01, 0.48, 0.52, 0.01, 0.47, 0.53, 0.01,
0.47, 0.57, 0.01, 0.41, 0.61, 0.01, 0.38)), .Names = c("canopy",
"YearQuarter", "value"), row.names = c(NA, -15L), class = "data.frame")
x4.can.bar <- ggplot(data=x4.can.m, aes(x=factor(YearQuarter), y=value,fill=canopy)) + geom_bar(stat="identity",position = "stack",ymax=100)
x4.can.bar+scale_y_continuous(formatter='percent')+
labs(y="Percentage",x="Year Quarter") +
geom_text(aes(label =paste(round(value*100,0),"%",sep="")),size = 3, hjust = 0.5, vjust = 4,position ="identity")
You need to specify reasonable values for the placement of the labels - if you do this outside the ggplot call, it will be far easier than trying to do so within the call.
You can do this by taking the midpoint of each stacked component.
Using plyr and ddply this is a simple as taking the cumulative sum and subtracting half the current value within each YearQuarter
library(plyr)
x4.can.m <- ddply(x4.can.m, .(YearQuarter), mutate, csum = cumsum(value)-value/2)
x4.can.bar <- ggplot(data=x4.can.m, aes(x=factor(YearQuarter), y=value,fill=canopy)) +
geom_bar(stat="identity",position = "stack",ymax=100)
x4.can.bar +
scale_y_continuous(expand = c(0,0), labels = percent) +
labs(y="Percentage",x="Year Quarter")+
geom_text(aes(y = csum,label =paste(round(value*100,0),"%",sep="")),
size = 3, hjust = 1, vjust = 0)
Note that I am using ggplot2_0.9.2.1, so formatter is no longer a valid argument to scale_y_continuous, replaced with label = percent. See this question and related links
one solution is to change the stack bar to a dodge one
x4.can.bar <- ggplot(data=x4.can.m, aes(x=factor(YearQuarter), y=value,fill=canopy)) +
geom_bar(stat="identity",position = "dodge",ymax=100) +
geom_text(aes(label =paste(round(value*100,0),"%",sep=""),ymax=0),
position=position_dodge(width=0.9), vjust=-0.25)
x4.can.bar

Resources