R Data Structure Setup for Reproducible Research

R Data Structure Setup for Reproducible Research - r

Background
I get hourly interval reports on equipment in buildings, a lot of buildings and a lot of equipment. Each parameter on the equipment is called a point and they already have a name, I don't get to choose the name of the point. Each point name is unique. What I'm trying to do is run a standard report on each building. Eventually, I'd like to move this to Shiny and look at my graphs and maybe print a report from there, but... baby steps.
Question
Am I on the right track? Is there a more efficient way of doing this? Am I going to run into problems when I start to write Markdown reports or transfer this over to Shiny?
Sample Code
library(tidyverse)
set.seed(55)
test_func <- function(pointa, pointb, mult) {
out = (pointb - pointa) * mult
return(out)
}
test_fail <- function(pointa, pointb) {
out = ifelse(pointa > (pointb - 9), 1, 0)
return(out)
}
tbl.data <- data.frame(
date = c(rep("2/1/2018", 24),
rep("2/2/2018", 24),
rep("2/3/2018", 24),
rep("2/4/2018", 24),
rep("2/5/2018", 24),
rep("2/6/2018", 24),
rep("2/7/2018", 24)),
hour = rep(0:23, 7),
equipa.vala = runif(168, min = 50, max = 60),
equipb.vala = runif(168, min = 50, max = 60)
) %>%
mutate(
equipa.valb = 10 + equipa.vala * runif(168, min = 0.75, max = 1.25),
equipb.valb = 10 + equipb.vala * runif(168, min = 0.75, max = 1.25)
)
tbl.equip <- data.frame(
equipment.id = c(1,2),
equipment.name = c("equipa", "equipb"),
equipment.mult = c(5, 7)
)
tbl.point <- data.frame(
point = c("equipa.vala", "equipa.valb", "equipb.vala", "equipb.valb"),
equipment = c("equipa", "equipa", "equipb", "equipb"),
category = c("vala", "valb", "vala", "valb")
)
for (eq in tbl.equip[,2]) {
vala <- as.character(
tbl.point$point[tbl.point$equipment == eq &
tbl.point$category == "vala"]
)
valb <- as.character(
tbl.point$point[tbl.point$equipment == eq &
tbl.point$category == "valb"]
)
equip.mult <- as.numeric(
tbl.equip$equipment.mult[tbl.equip$equipment.name == eq]
)
for.data <- tbl.data %>%
select_(cola = vala,
colb = valb) %>%
mutate(
result = test_func(cola, colb, equip.mult),
fault = test_fail(cola, colb)
)
score <- sum(for.data$fault)/length(for.data$fault)
savings <- sum(for.data$result[for.data$result > 0])
p1 <- ggplot(for.data, aes(x = colb, y = cola, color = as.factor(fault))) +
geom_point() +
annotate("text", label = paste("savings is:", savings), x = 50, y = 60) +
annotate("text", label = paste("score is:", score), y = 51, x = 80) +
ggtitle(paste("Equipment:", eq)) +
theme_minimal()
print(p1)
}
Explanation
So in this sample, the tbl.data data frame would be the data I receive from each building. I'd have to manually create the tbl.equipment and tbl.point data frames, which I'd just house in *.csv files on my machine, or database (and be able to add/edit in Shiny). There's no standard for point names and there's not a guarantee that each piece of equipment has each point. Using select() helpers such as contains() or starts_with() is out of the question.
So I just created an Equipment table, which has parameters on the equipment, (in this case a multiple). Also, there's a Point table, which tells which piece of equipment and which category each point belongs to.
For this simple example, there's two sample functions I included. One calculates a value based on the the data, the other tests for a fault. My biggest problem in the past has been when a piece of equipment doesn't have a point, it stops the execution, so I have to manually go in and take it out or something else. I guess the workaround is to use exists() or something similar and test before running that piece of code.
Again, for this simple example, I just printed a plot, but the output could be a Markdown Document (which I think I've done before, but not like this) or Shiny (which I've created some simpler Apps).
Conclusion
The big question is "Is this the "right" way of doing it?" I'm sure this is pretty common and there has to be a really efficient method I'm not using. What's going to set me up for success when I start writing code to print out reports or taking this into a Shiny App?

Related

Conditional static branching in R targets

I'm using the targets pipelining system in R and am wondering how to statically branch optimally. I have a set of parameters for which I'd like to compute results for most but not all interactions. Notice how N_source_components and N_target_components aren't used by the agg_neighbourhoods target, but they are used by other targets that I didn't include in this example. With the current setup, agg_neighbourhoods will be run too many times (targets doesn't understand that not all columns in the value argument of tar_map are relevant for all targets, right?). Is there a smarter way?
I already tried nesting another tar_map call within the currently shown one, to which N_source_components and N_target_components get relegated. This fixes the redundant executions of agg_neighbourhoods, but doesn't allow me to filter undesirable combinations like I'm doing now because the value of query isn't known at 'compilation' time.
Many thanks :)
tar_map(
values = tidyr::expand_grid(
query = c('6369', '6489', '6493'),
k = c(10, 30, 50),
d = c(5, 10, 15),
genelist = c(
'informativeV15',
'informativeV15_monotonic',
'informativeV15_monoreporter'
),
N_source_components = 10L,
N_target_components = as.integer(c(3, 5))
) %>%
dplyr::filter(
!(query %in% c('6369') & N_target_components > 3)) %>%
{ . },
tar_target(agg_neighbourhoods, {
f(
so = tar_read(so_target, branch = e2i(query))[[1]],
genelist = genelist,
k = k,
d = d
)
}, iteration = 'list')
)

Hopefully this is helpful to someone: in simpler terms, my problem was that targets were needlessly being run due to my necessity for filtering out some parameter combinations of target instantiations and not all parameters being used by all targets. A more simple and complete example of this scenario would be:
tar_map(
values = tibble(A = 1:2, B = 1:4) %>%
dplyr::filter(!(A == 2 & B > 2)),
tar_target(tarX, A*3),
tar_target(tarY, A*4 + B^2)
)
tarX is being run for each value of B whereas only one evaluation is required. However, since the values of both A and B are informative as to what combinations aren't required, we have to pre-specify the required targets.
Seeing the 'problem' in this much cleaner abstracted representation, a solution becomes obvious more easily: just do two calls to tar_map, each operating on tailor-selected columns of the parameter grid.
param_grid <-
tibble(A = 1:2, B = 1:3) %>%
dplyr::filter(!(A == 2 & B > 2))
list(
tar_map(
values = param_grid %>%
dplyr::select(-B) %>%
dplyr::distinct(),
tar_target(tarX, A*3)
),
tar_map(
values = param_grid,
tar_target(tarY, A*4 + B^2)
)
)
Perhaps there are other solutions as well. I'd be happy to hear them.

Add p-value column in qwraps::summary_table

I want to make a little summary table for my colleagues in R-Markdown using qwraps::summary_table. The data.frame contains information of different exposures. All the variables are coded as binary.
library(qwraps2)
library(dplyr)
pop <- rbinom(n = 1000, size = 1, prob = runif(n = 10, min = 0, max = 1))
exp <- rbinom(n = 1000, size = 1, prob = .5)
ID <- c(1:500)
therapy <- factor(sample(x = pop, size = 500, replace = TRUE), labels = c("Control", "Intervention"))
exp_1 <- sample(x = exp, size = 500, replace = TRUE)
exp_2 <- sample(x = exp, size = 500, replace = TRUE)
exp_3 <- sample(x = exp, size = 500, replace = TRUE)
exp_4 <- sample(x = exp, size = 500, replace = TRUE)
df <- data.frame(ID, exp_1, exp_2, exp_3, exp_4, therapy)
head(df)
In the next step, I create a simple summary table as follows. In the table I want to have the groups (control vs. intervention) as columns and the exposures as rows:
my_summary <-
list(list("Exposure 1" = ~ n_perc(exp_1 %in% 1),
"Exposure 2" = ~ n_perc(exp_2 %in% 1),
"Exposure 3" = ~ n_perc(exp_3 %in% 1),
"Exposure 4" = ~ n_perc(exp_4 %in% 1))
)
my_table <- summary_table(group_by(df, therapy), my_summary)
my_table
In the next step I wanted to add a further column containing p-values for the group differences between control and intervention group, e. g. with fisher.test. I read in ?qwraps::summary_table that cbind is a suitable method for class qwraps2_summary_table, but to be honest, I'm struggling with it. I tried different ways but failed, unfortunately.
Is there a convenient way to add individual columns via qwraps::summary_table especially p-values according to the grouped columns?
Thanks for your help!
Best,
Florian

[SOLVED]
Meanwhile, after a lot of research on this topic, I found a convenient and easy way to add a p.values column. Maybe it is not the smartest solution, but worked, at least for me.
First I calculated the p.values with a function, which extracts the p.values from the returned output of fisher.test and stored them in an object, in my case a simple numeric vector:
# write function to extract fishers.test
fisher.pvalue <- function(x) {
value <- fisher.test(x)$p.value
return(value)
}
# fisher test/generate pvalues
p.vals <- round(sapply(list(
table(df$exp_1, df$therapy),
table(df$exp_2, df$therapy),
table(df$exp_3, df$therapy),
table(df$exp_4, df$therapy)), fisher.pvalue), digits = 2)
In the following step I simply added an empty table column called P-Values and added the p.vals to the column cells.
overall_table <- cbind(my_table, "P-Value" = "") # create empty column
overall_table[9:12] <- p.vals # add vals to empty column
# overall_table <- cbind(my_table, "P-Value" = p.vals) works the same way in one line of code
overall_table
In my case, I simply looked for the corresponding cell indices in overall_table (for P-Values = 9:12) and filled them using base syntax. In the vignette of qwraps2 (https://cran.r-project.org/web/packages/qwraps2/vignettes/summary-statistics.html), the author used regular expressions to identify the right cells (see section 3.2).
If there are other methods to add individual columns to qwraps2::summary_table I would appreciate to see how it is possible.
Best,
Florian

Adding a column to a data frame by calculating each value to be added

Good evening,
I asked a question earlier and found it hard to implement the solution so I am gonna reask it in a more clear way.
I have the problem, that I want to add a column to a dataframe of daily returns of a stock. Lets say its normally distributed and I would like to add a column that contains the value at risk (hist) whose function I wrote myself.
The restriction is that each observation should be assigned to my function and take the last 249 observations as well.
So when the next observation is calculated it should also take only the last 249 observations of the das before. So the input values should move as the time goes on. In other words I want values from 251 days ago to be excluded. Hopefully I explained myself well enough. If not maybe the code speaks for me:
df<- data.frame(Date=seq(ISOdate(2000,1,1), by = "days", length.out = 500), Returns=rnorm(500))
#function
VaR.hist<- function(x, n=250, hd=20, q=0.05){
width<-nrow(x)
NA.x<-na.omit(x)
quantil<-quantile(NA.x[(width-249):width],probs=q)
VaR<- quantil*sqrt(hd)%>%
return()
}
# Run the function on the dataframe
df$VaR<- df$Returns%>%VaR.hist()
Error in (width - 249):width : argument of length 0
This is the Error code that I get and not my new Variable...
Thanks !!

As wibom wrote in the comment nrow(x) does not work for vectors. What you need is length() instead. Also you do not need return() in the last line as R automatically returns the last line of a function if there is no early return() before.
library(dplyr)
df<- data.frame(Date=seq(ISOdate(2000,1,1), by = "days", length.out = 500), Returns=rnorm(500))
#function
VaR.hist <- function(x, n=250, hd=20, q=0.05){
width <- length(x) # here you need length as x is a vector, nrow only works for data.frames/matrixes
NA.x <- na.omit(x)
quantil <- quantile(NA.x[(width-249):width], probs = q)
quantil*sqrt(hd)
}
# Run the function on the dataframe
df$VaR <- df$Returns %>% VaR.hist()

It's a bit hard to understand what you want to do exactly.
My understanding is that you wish to compute a new variable VarR, calculated based on the current and previous 249 observations of df$Returns, right?
Is this about what you wish to do?:
library(tidyverse)
set.seed(42)
df <- tibble(
Date = seq(ISOdate(2000, 1, 1), by = "days", length.out = 500),
Returns=rnorm(500)
)
the_function <- function(i, mydata, hd = 20, q = .05) {
r <-
mydata %>%
filter(ridx <= i, ridx > i - 249) %>%
pull(Returns)
quantil <- quantile(r, probs = q)
VaR <- quantil*sqrt(hd)
}
df <-
df %>%
mutate(ridx = row_number()) %>%
mutate(VaR = map_dbl(ridx, the_function, mydata = .))
If you are looking for a base-R solution:
set.seed(42)
df <- data.frame(
Date = seq(ISOdate(2000, 1, 1), by = "days", length.out = 500),
Returns = rnorm(500)
)
a_function <- function(i, mydata, hd = 20, q = .05) {
r <- mydata$Returns[mydata$ridx <= i & mydata$ridx > (i - 249)]
quantil <- quantile(r, probs = q)
VaR <- quantil*sqrt(hd)
}
df$ridx <- 1:nrow(df) # add index
df$VaR <- sapply(df$ridx, a_function, mydata = df)

Populate a variable instance once calculations in a previous row are finished

I want to calculate the final number of unprocessed claims after each month. First, I calculate a total number of claims to process: it's a last month's backlog + any new_claims coming in the current month. Then, in order to calculate a number of close_claims I multiply that number by a closed_total_ratio and take that away from the total. My final variable is open_claims which should automatically feed into backlog as soon as it's calculated:
library(tidyverse)
set.seed(1)
df <- tibble(date = seq(from = lubridate::as_date('2018-01-01'), to = lubridate::as_date('2018-06-01'), by = 'months'),
backlog = c(120, rep(NA, 5)),
new_claims =sample(c(10,20,30), 6, replace = T),
closed_open_ratio = rep(0.2, 6),
open_claims = rep(NA, 6))
df
set.seed(1)
solution <- tibble(date = seq(from = lubridate::as_date('2018-01-01'), to = lubridate::as_date('2018-06-01'), by = 'months'),
backlog = c(120, 104, 99, 95, 100, 88),
new_claims =sample(c(10,20,30), 6, replace = T),
total = c(130, 124, 119, 125, 110, 118),
closed_total_ratio = rep(0.2, 6),
closed = c(26, 25, 24, 25,22,24),
open_claims = c(104, 99, 95, 100,88, 94)
)
solution
The thing is, if I apply something like this:
df %>%
mutate(total = backlog[1] +cumsum(new_claims),
closed = closed_open_ratio* total,
open_claims = total - cumsum(closed)) %>%
select(backlog, new_claims, total, closed_open_ratio, closed, open_claims)
I fail to move open_claims back to the backlog. What would be a better way of doing it?

Cześć Kasiu! I think we can't avoid iteration, if result in the next row depends on result from the previous one. You wrote "I'll be iterating over big data frames" so the best way to save some time is to use Rcpp. You need to create new "C++ File" (it's integrated with RStudio) with the following code:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
DataFrame forKasia(DataFrame DF) {
IntegerVector backlog = DF["backlog"];
IntegerVector new_claims = DF["new_claims"];
NumericVector closed_open_ratio = DF["closed_open_ratio"];
IntegerVector open_claims = DF["open_claims"];
int n = backlog.size();
IntegerVector total = IntegerVector(n);
IntegerVector closed = IntegerVector(n);
for (int i = 0; i < n; i++) {
total[i] = backlog[i] + new_claims[i];
closed[i] = round(total[i] * closed_open_ratio[i]);
open_claims[i] = total[i] - closed[i];
if (i < n - 1) {
backlog[i + 1] = open_claims[i];
}
}
return DataFrame::create(
_["date"] = DF["date"],
_["backlog"] = backlog,
_["new_claims"] = new_claims,
_["total"] = total,
_["closed_open_ratio"] = closed_open_ratio,
_["closed"] = closed,
_["open_claims"] = open_claims
);
}
Then source it and run:
Rcpp::sourceCpp('forKasia.cpp')
forKasia(df)

Not very elegant, but I think it works. There are some differences with your solution, but also the new_claims column is different:
df <- tibble(date = seq(from = lubridate::as_date('2018-01-01'), to = lubridate::as_date('2018-06-01'), by = 'months'),
backlog = c(120, rep(NA, 5)),
new_claims =sample(c(10,20,30), 6, replace = T),
closed_open_ratio = rep(0.2, 6),
open_claims = rep(NA, 6))
df <- data.frame(df)
for (i in 1:nrow(df)) {
df$open_claims[i] <- (df$backlog[i] + df$new_claims[i]) - ((df$backlog[i] df$new_claims[i]) * df$closed_open_ratio[i])
if (i < nrow(df)) {
df$backlog[i + 1] <- (df$backlog[i] + df$new_claims[i]) - ((df$backlog[i] + df$new_claims[i]) * df$closed_open_ratio[i])
}
}
df
date backlog new_claims closed_open_ratio open_claims
1 2018-01-01 120.0000 10 0.2 104.00000
2 2018-02-01 104.0000 20 0.2 99.20000
3 2018-03-01 99.2000 10 0.2 87.36000
4 2018-04-01 87.3600 20 0.2 85.88800
5 2018-05-01 85.8880 30 0.2 92.71040
6 2018-06-01 92.7104 20 0.2 90.16832
Hope it helps.

You can use purrr::accumulate to carry forward the still opened claims starting from the backlog on Day 1. cumsum and cumprod are the most common examples of this type of computation but in this case we need something more complex than cumsum because a proportion of the claims get closed every day.
Let p be the probability of closing (a constant). Let q=1-p be the probability of not closing.
For Day 1 we have backlog + new_claims claims. Let's call the
total x1. Then at the end of Day 1 we have q*x1 claims still
open.
Then for Day 2 we have the previously open claims, q*x1,
plus some new ones, x2, and at the end of Day 2 we have q*(q*x1 + x2) claims still open. Let's look at one more day to make it clear.
For Day 3 we have the previously open claims plus those received
that day and at the end of Day 3 we have q*(q*(q*x1 + x2) + x3) claims still open.
This is the kind of sequential computation we can perform with purrr::accumulate.
p_close <- 0.2
df %>%
# Not very elegant but need to add backlog to the first-day claims
mutate(new_claims = if_else(row_number() == 1,
new_claims + backlog, new_claims)) %>%
# This function computes p*(p*(p*(p*x1 + x2) + x3) + x4) .....
mutate(tot_claims = accumulate(new_claims, function(x, y) (1-p_close)*x + y)) %>%
# Simple vectorized product to split the total claims into open and closed
mutate(open_claims = (1-p_close) * tot_claims,
closed_claims = p_close * tot_claims) %>%
# The backlog is the previous days opened claims
mutate(backlog = if_else(row_number() == 1,
backlog, lag(open_claims)))
The above computation assumes that the probability p_close of closing a claim is the same every day. But you can work with purrr::accumulate2 to provide both a vector of claims and a vector of closing probabilities.
This accumulation is a little more complex so let's define it separately.
accumulate_claims <- function(new_claims, closed_open_ratio) {
f <- function(x, y, p) (1-p)*x + y
# head(p, -1) drops the last probability. We actually don't need it here
# as right now we are computing the sequential sums of previously opened
# claims + the new claims for the day
x <- accumulate2(new_claims, head(closed_open_ratio, -1), f)
unlist(x)
}
df %>%
# Not very elegant but need to add backlog to the first-day claims
mutate(new_claims = if_else(row_number() == 1, new_claims + backlog, new_claims)) %>%
# This function computes p4*(p3*(p2*(p1*x1 + x2) + x3) + x4) .....
mutate(tot_claims = accumulate_claims(new_claims, closed_open_ratio)) %>%
# Simple vectorized product to split the total claims into open and closed
mutate(open_claims = (1-closed_open_ratio) * tot_claims,
closed_claims = closed_open_ratio * tot_claims) %>%
# The backlog is the previous days opened claims
mutate(backlog = if_else(row_number() == 1, backlog, lag(open_claims)))

One way to tackle the sequential nature of the computation is with recursion, calculating the inital open claims using the first row of the dataset, and then repeating the call using the remaining rows.
calc_open_claims <- function(current_backlog, new_claims, closed_open_ratio) {
(current_backlog + new_claims) * (1 - closed_open_ratio)
}
open_claims <- function(weekly_changes, accumulator) {
if (nrow(weekly_changes) == 0) return(accumulator)
new_backlog <- calc_open_claims(last(accumulator),
weekly_changes$new_claims[1],
weekly_changes$closed_open_ratio[1])
accumulator = c(accumulator, new_backlog)
open_claims(weekly_changes[-1, ], accumulator)
}
open_claims(df, 120)
# Wrapper to kick it off and align result
open_claims_wrapper = function(df) {
starting_backlog <- df$backlog[1]
oc <- open_claims(df, starting_backlog) # starting_backlog seeds the accumulator
oc <- oc[-1] # lop off the starting backlog
mutate(df, open_claims = oc)
}
open_claims_wrapper(df)

time series graphs with nPlot

I'm trying to plot a time series graph with nPlot and having difficulties in presenting the labels of the X-axis in a desirable way.
I've been looking if this problem had come up before and it did but without solution yet (as far as i managed to find), I wonder if a solution is already available?
in this case i get the X-axis on a range between -1 and 1, and no lines on the graph:
date = c("2013-07-22", "2013-07-29" ,"2013-08-05", "2013-08-12", "2013-08-19","2013-08-26", "2013-09-02" ,"2013-09-09" ,"2013-09-16")
test = as.data.frame(date)
test$V1 = c("10","11","13","12","11","10","15","12","9")
test$V2 = c("50","51","53","52","51","50","55","52","59")
test1 = melt(test,id = c("date"))
n1 = nPlot(value ~ date, group = "variable", data = test1, type="lineWithFocusChart")
if I add and than plot again:
test1$date = as.Date(test1$date)
I get the wanted graph but the X-axis labels are in their numeric form (15900..)
Thanks.

Here is one way to make it work. I have made some changes to your code. One, I have made V1 and V2 numeric, since you want to be plotting numbers on the y axis. Second, I have added a utility function to_jsdate that takes the character date and converts it into a javascript date (number of milliseconds after 1970-01-01). Date handling is still a little raw in rCharts, but we are working on making it better.
date = c("2013-07-22", "2013-07-29" ,"2013-08-05", "2013-08-12", "2013-08-19",
"2013-08-26", "2013-09-02" ,"2013-09-09" ,"2013-09-16")
test = as.data.frame(date)
test$V1 = as.numeric(c("10","11","13","12","11","10","15","12","9"))
test$V2 = as.numeric(c("50","51","53","52","51","50","55","52","59"))
test1 = reshape2::melt(test,id = c("date"))
to_jsdate <- function(date_){
val = as.POSIXct(as.Date(date_),origin="1970-01-01")
as.numeric(val)
}
test1 = transform(test1, date2 = to_jsdate(date))
n1 = nPlot(value ~ date2, group = "variable", data = test1, type="lineWithFocusChart")
n1$xAxis(tickFormat = "#! function(d){
return d3.time.format('%Y-%m-%d')(new Date(d*1000))
} !#")
n1

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R Data Structure Setup for Reproducible Research - r

Related

Conditional static branching in R targets

Add p-value column in qwraps::summary_table

Adding a column to a data frame by calculating each value to be added

Populate a variable instance once calculations in a previous row are finished

time series graphs with nPlot

Categories

Resources