Dynamic summarise throught dynamic multiplication for an external vector - r

I have a tibble such:
X = tibble(Name = rep(c("A","B","C"),5),
Coeffs_1 = runif(15,0,1),
Coeffs_2 = runif(15,0,1)) %>% arrange(Name)
Y = runif(10,0,100)
In this example the amount of "Reps" per "Names" is arbitrarily fixed at 5, and the amount of Coeffs_2 is arbitrarily fixed at 2, but in my code they could be any number and I don't know the exact number.
I also have a vector with n = reps*2 elements:
Y = runif(10,0,100)
In this specific case, it's 10 = 5*2.
My task is to summarise, per each Name, per each Coeff, this formula:
Result_x[1] = sum(Coeff_x[1]*Y[2] + Coeff_x[2]* Y[3] + ... + Coeff_x[Reps]*Y[Reps+1]) - Y[1]
Result_x[2] = sum(Coeff_x[1]*Y[3] + Coeff_x[2]* Y[4] + ... + Coeff_x[Reps]*Y[Reps+2]) - Y[2]
.
.
.
Result_x[Reps] = sum(Coeff_x[1]* Y[Reps+1] + Coeff_x[2]* Y[Reps+2] + ... + Coeff_x[Reps]*Y[Reps*2]) - Y[Reps]
So that in the end, the final summarise tibble should look like:
Name
Lag
Result_1
...
Result_x
A
+1
a number
numbers
a number
A
+2
a number
numbers
a number
A
...
a number
numbers
a number
A
Reps
a number
numbers
a number
B
+1
a number
numbers
a number
B
+2
a number
numbers
a number
...
...
a number
numbers
a number
...
Reps
a number
numbers
a number
The dynamic nature of the issue makes hard for me to define it well with a for cycle, and the presence of the external vector that must be re-indexed and properly summarised for each row in the original tibble makes me difficult to work with a pipeline.
I thought that defining a custom function could help but again, it messes with pipeline code.

Split the 'X' by 'Name', loop over the list (map), while creating shifted lead values of 'Y' in a list with n specified as a vector. Loop over the list, summarise across the 'Coeff' columns for each of the nested list by taking the sum of product of the column value with the corresponding 'y' length corrected and subtract from the first value of 'y'
library(dplyr)
library(purrr)
library(data.table)
X %>%
group_split(Name) %>%
map_dfr(~ map_dfr(shift(Y, n = 1:nrow(.x), type = 'lead'),
function(y) .x %>%
summarise(Name = first(Name), across(starts_with('Coeff'),
~ sum(. * y[seq_along(.)], na.rm = TRUE) - first(y)))) ) %>%
mutate(Lag = rowid(Name))
-output
# A tibble: 15 × 4
Name Coeffs_1 Coeffs_2 Lag
<chr> <dbl> <dbl> <int>
1 A 127. 54.4 1
2 A 162. 134. 2
3 A 127. 68.2 3
4 A 109. 38.0 4
5 A 108. 94.0 5
6 B 175. 197. 1
7 B 187. 240. 2
8 B 151. 200. 3
9 B 132. 159. 4
10 B 102. 152. 5
11 C 48.8 131. 1
12 C 89.1 128. 2
13 C 42.5 98.7 3
14 C 29.4 95.7 4
15 C 41.7 50.1 5

Related

Keeping the max within a group constant within a group using base::cumsum

Use the data below to make the cumsum_a column look like the should column.
Data to start with:
> demo
th seq group
1 20.1 1 10
2 24.1 2 10
3 26.1 3 10
4 1.1 1 20
5 2.1 2 20
6 4.1 3 20
The "should" column below is the goal.
demo<-data.frame(th=c(c(20.1,24.1,26.1),(c(1.1,2.1,4.1))),
seq=(c(1:3,1:3)),group=c(rep(10,3),rep(20,3)))
library(magrittr)
library(dplyr)
demo %>%
group_by(group) %>%
mutate(
cumsum_a= cumsum((group)^seq*
(((th)/cummax(th)))))%>%
ungroup()%>%
mutate(.,
cumsum_m=c( #As an example only, this manually does exactly what cumsum_a is doing (which is wrong)
10^1*20.1/20.1, #good
10^1*20.1/20.1 + 10^2*24.1/24.1, #different denominators, bad
10^1*20.1/20.1 + 10^2*24.1/24.1 + 10^3*26.1/26.1, #different denominators, bad
20^1*1.1/1.1, #good
20^1*1.1/1.1 + 20^2*2.1/2.1, #different denominators, bad
20^1*1.1/1.1 + 20^2*2.1/2.1 + 20^3*4.1/4.1 #different denominators, bad
),
should=c( #this is exactly the kind of calculation I want
10^1*20.1/20.1, #good
10^1*20.1/24.1 + 10^2*24.1/24.1, #good
10^1*20.1/26.1 + 10^2*24.1/26.1 + 10^3*26.1/26.1, #good
20^1*1.1/1.1, #good
20^1*1.1/2.1 + 20^2*2.1/2.1, #good
20^1*1.1/4.1 + 20^2*2.1/4.1 + 20^3*4.1/4.1 #good
)
)
Most simply put, denominators need to be the same for each row so 24.1 and 24.1 instead of 20.1 and 24.1 on the second row of cumsum_m or the underlying calculations for cumsum_a.
Here are the new columns, where should is what cumsum_a or cumsum_m should be.
th seq group cumsum_a cumsum_m should
<dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 20.1 1 10 10 10 10
2 24.1 2 10 110 110 108.
3 26.1 3 10 1110 1110 1100.
4 1.1 1 20 20 20 20
5 2.1 2 20 420 420 410.
6 4.1 3 20 8420 8420 8210.
You can use the following solution:
purrr::accumulate takes a two argument function, the first one which is represented by .x or ..1 is the accumulated value of the previous iterations and .y represents the current value of our vector (2:n()). So our first accumulated value will be first element of group value as I supplied it as .init argument
Since you would like to change the denominator of the previous iterations/ calculations, I multiplied the result .x by the ratio of the previous value of cmax to the current value of cmax
I think the rest is pretty clear but if you have any more question about it just let me know.
library(dplyr)
library(purrr)
demo %>%
group_by(group) %>%
mutate(cmax = cummax(th),
should = accumulate(2:n(), .init = group[1],
~ (.x * cmax[.y - 1] / cmax[.y]) + (group[.y] ^ seq[.y]) * (th[.y] / cmax[.y])))
# A tibble: 6 x 5
# Groups: group [2]
th seq group cmax should
<dbl> <int> <dbl> <dbl> <dbl>
1 20.1 1 10 20.1 10
2 24.1 2 10 24.1 108.
3 26.1 3 10 26.1 1100.
4 1.1 1 20 1.1 20
5 2.1 2 20 2.1 410.
6 4.1 3 20 4.1 8210.

How to create percentiles in R using dplyr with data frame?

I am looking to create an additional column named "percentile", the percentile will be based off the sold quotes quotes and I do not want to create a window function on it, the percentile is should be based off the entire dataset. See below, the data is currently in descending order by SOLD_QUOOTES, what ideally the first row we see in the image should be the 99.99% percentile and should lower cascading down the table.
Excepted output
Maybe something like,
library(dplyr)
df <- tibble(sold_quotes = sample(1e6, 1e3, replace = TRUE))
pctiles <- seq(0, 1, 0.001)
df %>%
arrange(desc(sold_quotes)) %>%
mutate(percentile = cut(sold_quotes,
quantile(sold_quotes,
probs = pctiles),
labels = pctiles[2:length(pctiles)]*100))
#> # A tibble: 1,000 x 2
#> sold_quotes percentile
#> <int> <fct>
#> 1 999562 100
#> 2 996533 99.9
#> 3 996260 99.8
#> 4 995499 99.7
#> 5 994984 99.6
#> 6 994937 99.5
#> 7 994130 99.4
#> 8 993001 99.3
#> 9 992902 99.2
#> 10 990298 99.1
#> # … with 990 more rows
The percentile calculation doesn't depend on rearranging sold_quotes in descending order; you'll get the correct result without it. I was just mirroring your example.

Writing a function to summarize the results of dunn.test::dunn.test

In R, I perform dunn's test. The function I use has no option to group the input variables by their statistical significant differences. However, this is what I am genuinely interested in, so I tried to write my own function. Unfortunately, I am not able to wrap my head around it. Perhaps someone can help.
I use the airquality dataset that comes with R as an example. The result that I need could look somewhat like this:
> library (tidyverse)
> ozone_summary <- airquality %>% group_by(Month) %>% dplyr::summarize(Mean = mean(Ozone, na.rm=TRUE))
# A tibble: 5 x 2
Month Mean
<int> <dbl>
1 5 23.6
2 6 29.4
3 7 59.1
4 8 60.0
5 9 31.4
When I run the dunn.test, I get the following:
> dunn.test::dunn.test (airquality$Ozone, airquality$Month, method = "bh", altp = T)
Kruskal-Wallis rank sum test
data: x and group
Kruskal-Wallis chi-squared = 29.2666, df = 4, p-value = 0
Comparison of x by group
(Benjamini-Hochberg)
Col Mean-|
Row Mean | 5 6 7 8
---------+--------------------------------------------
6 | -0.925158
| 0.4436
|
7 | -4.419470 -2.244208
| 0.0001* 0.0496*
|
8 | -4.132813 -2.038635 0.286657
| 0.0002* 0.0691 0.8604
|
9 | -1.321202 0.002538 3.217199 2.922827
| 0.2663 0.9980 0.0043* 0.0087*
alpha = 0.05
Reject Ho if p <= alpha
From this result, I deduce that May differs from July and August, June differs from July (but not from August) and so on. So I'd like to append significantly differing groups to my results table:
# A tibble: 5 x 3
Month Mean Group
<int> <dbl> <chr>
1 5 23.6 a
2 6 29.4 ac
3 7 59.1 b
4 8 60.0 bc
5 9 31.4 a
While I did this by hand, I suppose it must be possible to automate this process. However, I don't find a good starting point. I created a dataframe containing all comparisons:
> ozone_differences <- dunn.test::dunn.test (airquality$Ozone, airquality$Month, method = "bh", altp = T)
> ozone_differences <- data.frame ("P" = ozone_differences$altP.adjusted, "Compare" = ozone_differences$comparisons)
P Compare
1 4.436043e-01 5 - 6
2 9.894296e-05 5 - 7
3 4.963804e-02 6 - 7
4 1.791748e-04 5 - 8
5 6.914403e-02 6 - 8
6 8.604164e-01 7 - 8
7 2.663342e-01 5 - 9
8 9.979745e-01 6 - 9
9 4.314957e-03 7 - 9
10 8.671708e-03 8 - 9
I thought that a function iterating through this data frame and using a selection variable to choose the right letter from letters() might work. However, I cannot even think of a starting point, because changing numbers of rows have to considered at the same time...
Perhaps someone has a good idea?
Perhaps you could look into cldList() function from rcompanion library, you can pipe the res results from the output od dunnTest() and create a table that specifies the compact letter display comparison per group.
Following the advice of #TylerRuddenfort , the following code will work. The first cld is created with rcompanion::cldList, and the second directly uses multcompView::multcompLetters. Note that to use multcompLetters, the spaces have to be removed from the names of the comparisons.
Here, I have used FSA:dunnTest for the Dunn test (1964).
In general, I recommend ordering groups by e.g. median or mean before running e.g. dunnTest if you plan on using a cld, so that the cld comes out in a sensible order.
library (tidyverse)
ozone_summary <- airquality %>% group_by(Month) %>% dplyr::summarize(Mean = mean(Ozone, na.rm=TRUE))
library(FSA)
Result = dunnTest(airquality$Ozone, airquality$Month, method = "bh")$res
### Use cldList()
library(rcompanion)
cldList(P.adj ~ Comparison, data=Result)
### Use multcompView
library(multcompView)
X = Result$P.adj <= 0.05
names(X) = gsub(" ", "", Result$Comparison)
multcompLetters(X)

How to prevent R from rounding in frequency function?

I used the freq function of frequency package to get frequency percent on my dataset$MoriskyAdherence, then R gives me percent values with rounding. I need more decimal places.
MoriskyAdherence=dataset$MoriskyAdherence
freq(MoriskyAdherence)
The result is:
The Percent values are 35.5, 41.3,23.8. The sum of them is 100.1.
The exact amounts should be 35.5, 41.25, 23.75.
What should I do?
I used sprintf, as.data.frame,formatC, and some other function to deal with it.But...
The function freq returns a character data frame, and has no option to adjust the number of decimal places. However, it is easy to recreate the table however you want it. For example, I have written this function, which will give you the same result but with two decimal places instead of one:
freq2 <- function(data_frame)
{
df <- frequency::freq(data_frame)
lapply(df, function(x)
{
n <- suppressWarnings(as.numeric(x$Freq))
sum_all <- as.numeric(x$Freq[nrow(x)])
raw_percent <- suppressWarnings(100 * n / sum_all)
t_row <- grep("Total", x[,2])[1]
valid_percent <- suppressWarnings(100*n / as.numeric(x$Freq[t_row]))
x$Percent <- format(round(raw_percent, 2), nsmall = 2)
x$'Valid Percent' <- format(round(valid_percent, 2), nsmall = 2)
x$'Cumulative Percent' <- format(round(cumsum(valid_percent), 2), nsmall = 2)
x$'Cumulative Percent'[t_row:nrow(x)] <- ""
x$'Valid Percent'[(t_row + 1):nrow(x)] <- ""
return(x)
})
}
Now instead of
freq(MoriskyAdherence)
#> Building tables
#> |===========================================================================| 100%
#> $`x:`
#> x label Freq Percent Valid Percent Cumulative Percent
#> 2 Valid High Adherence 56 35.0 35.0 35.0
#> 3 Low Adherence 66 41.3 41.3 76.3
#> 4 Medium Adherence 38 23.8 23.8 100.0
#> 41 Total 160 100.0 100.0
#> 1 Missing <blank> 0 0.0
#> 5 <NA> 0 0.0
#> 7 Total 160 100.0
you can do
freq2(MoriskyAdherence)
#> Building tables
#> |===========================================================================| 100%
#> $`x:`
#> x label Freq Percent Valid Percent Cumulative Percent
#> 2 Valid High Adherence 56 35.00 35.00 35.00
#> 3 Low Adherence 66 41.25 41.25 76.25
#> 4 Medium Adherence 38 23.75 23.75 100.00
#> 41 Total 160 100.00 100.00
#> 1 Missing <blank> 0 0.00
#> 5 <NA> 0 0.00
#> 7 Total 160 100.00
which is exactly what you were looking for.
Two (potential) solutions:
Solution #1:
Make changes inside the function freq. This can be done by retrieving the function's code with the command freq (without round brackets), or by retrieving the code, with comments, from https://rdrr.io/github/wilcoxa/frequencies/src/R/freq.R.
My hunch is that to obtain more decimals, changes must be implemented at this point in the code:
# create a list of frequencies
message("Building tables")
all_freqs <- lapply_pb(names(x), function(y, x1 = as.data.frame(x), maxrow1 = maxrow, trim1 = trim){
makefreqs(x1, y, maxrow1, trim1)
})
Solution #2:
If you're only after percentages with more decimals, you can use aggregate. Let's suppose your data has this structure: a dataframe with two variables, one numeric, one a factor by which you want to group:
set.seed(123)
Var1 <- sample(LETTERS[1:4], 10, replace = T)
Var2 <- sample(10:100, 10, replace = T)
df <- data.frame(Var1, Var2)
Var1 Var2
1 B 97
2 D 51
3 B 71
4 D 62
5 D 19
6 A 91
7 C 32
8 D 13
9 C 39
10 B 96
Then to obtain your percentages by factor, you would use aggregatethus:
aggregate(Var2 ~ Var1, data = df, function(x) sum(x)/sum(Var2)*100)
Var1 Var2
1 A 15.93695
2 B 46.23468
3 C 12.43433
4 D 25.39405
You can control the number of decimals by using round:
aggregate(Var2 ~ Var1, data = df, function(x) round(sum(x)/sum(Var2)*100,3))

Extracting corresponding other values in mutate when group_by is applied

I have a data frame with patient data and measurements of different variables over time.
The data frame looks a bit like this but more lab-values variables:
df <- data.frame(id=c(1,1,1,1,2,2,2,2,2),
time=c(0,3,7,35,0,7,14,28,42),
labvalue1=c(4.04,NA,2.93,NA,NA,3.78,3.66,NA,2.54),
labvalue2=c(NA,63.8,62.8,61.2,78.1,NA,77.6,75.3,NA))
> df2
id time labvalue1 labvalue2
1 1 0 4.04 NA
2 1 3 NA 63.8
3 1 7 2.93 62.8
4 1 35 NA 61.2
5 2 0 NA 78.1
6 2 7 3.78 NA
7 2 14 3.66 77.6
8 2 28 NA 75.3
9 2 42 2.54 NA
I want to calculate for each patient (with unique ID) the decrease or slope per day for the first and last measurement. To compare the slopes between patients. Time is in days. So, eventually I want a new variable, e.g. diff_labvalues - for each value, that gives me for labvalue1:
For patient 1: (2.93-4.04)/ (7-0) and for patient 2: (2.54-3.78)/(42-7) (for now ignoring the measurements in between, just last-first); etc for labvalue2, and so forth.
So far I have used dplyr, created the first1 and last1 functions, because first() and last() did not work with the NA values.
Thereafter, I have grouped_by 'id', used mutate_all (because there are more lab-values in the original df) calculated the difference between the last1() and first1() lab-values for that patient.
But cannot find HOW to extract the values of the corresponding time values (the delta-time value) which I need to calculate the slope of the decline.
Eventually I want something like this (last line):
first1 <- function(x) {
first(na.omit(x))
}
last1 <- function(x) {
last(na.omit(x))
}
df2 = df %>%
group_by(id) %>%
mutate_all(funs(diff=(last1(.)-first1(.)) / #it works until here
(time[position of last1(.)]-time[position of first1(.)]))) #something like this
Not sure if tidyverse even has a solution for this, so any help would be appreciated. :)
We can try
df %>%
group_by(id) %>%
filter(!is.na(labs)) %>%
summarise(diff_labs = (last(labs) - first(labs))/(last(time) - first(time)))
# A tibble: 2 x 2
# id diff_labs
# <dbl> <dbl>
#1 1 -0.15857143
#2 2 -0.03542857
and
> (2.93-4.04)/ (7-0)
#[1] -0.1585714
> (2.54-3.78)/(42-7)
#[1] -0.03542857
Or another option is data.table
library(data.table)
setDT(df)[!is.na(labs), .(diff_labs = (labs[.N] - labs[1])/(time[.N] - time[1])) , id]
# id diff_labs
#1: 1 -0.15857143
#2: 2 -0.03542857

Resources