How to create percentiles in R using dplyr with data frame? - r

I am looking to create an additional column named "percentile", the percentile will be based off the sold quotes quotes and I do not want to create a window function on it, the percentile is should be based off the entire dataset. See below, the data is currently in descending order by SOLD_QUOOTES, what ideally the first row we see in the image should be the 99.99% percentile and should lower cascading down the table.
Excepted output

Maybe something like,
library(dplyr)
df <- tibble(sold_quotes = sample(1e6, 1e3, replace = TRUE))
pctiles <- seq(0, 1, 0.001)
df %>%
arrange(desc(sold_quotes)) %>%
mutate(percentile = cut(sold_quotes,
quantile(sold_quotes,
probs = pctiles),
labels = pctiles[2:length(pctiles)]*100))
#> # A tibble: 1,000 x 2
#> sold_quotes percentile
#> <int> <fct>
#> 1 999562 100
#> 2 996533 99.9
#> 3 996260 99.8
#> 4 995499 99.7
#> 5 994984 99.6
#> 6 994937 99.5
#> 7 994130 99.4
#> 8 993001 99.3
#> 9 992902 99.2
#> 10 990298 99.1
#> # … with 990 more rows
The percentile calculation doesn't depend on rearranging sold_quotes in descending order; you'll get the correct result without it. I was just mirroring your example.

Related

How to randomize ascending list of values into two similar groups in R

I want to randomize ascending list of values in to two similar groups in R.
two statistical similar groups, meaning the mean timen (performance) is the same. (the lower timen the better) i would like a evenly distributed groups with fast and slow skiers
I will have a Pre-test and want to randomize some alpine ski athletes besed on there performance.
The datassett will look like this; (this is test datasett, the real one (with n =40) will i get at thePre-test)
# A tibble: 4 × 5
# Groups: BIB. [4]
BIB. `11` `99` `77` performance
<int> <dbl> <dbl> <dbl> <dbl>
1 1 14.2 NA NA NA
2 2 14.4 15.0 NA -0.600
3 3 14.3 14.6 NA -0.310
4 77 NA 12.9 61.4 NA
can anyone help me ?
My approach might be to sample it randomly some number of times (100?) then evaluate the means in each, and pick the smallest.
Here is how you would do that.
#invent some initial data
(hiris <- head(iris,n=20) |> select(perftime = Sepal.Length) |> mutate(id=row_number()))
# make 100 scrambled datasets, then evaluate them for closest mean
set.seed(42)
(nr <- nrow(hiris))
possible_sets <- map(1:100,
~slice_sample(.data = hiris,n = nr,replace=FALSE) |>
mutate(group=1*(row_number()<nr/2)))
(evaluations <- map_dbl(possible_sets,~{
step1 <- group_by(.x,group) |> summarise(m=mean(perftime))
sqrt((step1$m[[1]]-step1$m[[2]])^2)
}))
(set_to_choose <- which.min(evaluations))
#to see the evaluation
plot(seq_along(evaluations),evaluations)
points(x=set_to_choose,
y=evaluations[set_to_choose],
col="red",pch=15)
#to use the 'best' set
(chosen_set <- possible_sets[set_to_choose])

How to modify labels for p-values and size in ggstatsplot

I am working on a plot using ggstatsplot R package. After completing the plot I have an issue. This is the code for my plot:
library(ggstatsplot)
#Data
data("movies_long")
movies_long
#Plot
Myplot <- ggbarstats(
data = movies_long,
x = mpaa,
y = year,
title = "Distribution of movies per year",
xlab = "year",
legend.title = "",
ggplot.component = list(ggplot2::scale_x_discrete(guide = ggplot2::guide_axis(n.dodge = 2))),
palette = "Set2"
)
The output is next:
The plot is very useful but it has two problems. First is the overlap of p-values at top of bars. Second is the size n which also has overlap, how could I improve the p-values labels at the top of bars and also the size n at bottom?
Is there any way to take these p-values and translate in p<0.001, p<0.1 using confidence levels?
Many thanks for your help. In the extreme case, is there any way to know how to compute the p-values (which test is being applied) at each bar, so that I can replicate it and create a similar plot using ggplot2?
I challenge your statement that this plot is very useful. Apart from the overlapping labels, there are statistical issues including:
The plot almost completely obscures the fact that it summarizes 1 movie per year for most years before 1991.
The analysis computes lots of p-values without any correction for multiple hypothesis testing, which is bad practice.
Here is the same plot restricted to movies released between 1991 and 2005. It's better, though still peppered with overabundance of uncorrected p-values.
Note: I've made the plot large enough to accommodate all those labels. Now the text is a bit small but with some trial and error you can get the plot height & width "right".
As for the p-values on top of each bar... There is a handy way to extract the data visualized in a ggplot:
p$data
#> # A tibble: 41 × 5
#> year mpaa counts perc .label
#> <fct> <fct> <int> <dbl> <chr>
#> 1 1991 R 5 83.3 83%
#> 2 1992 R 6 100 100%
#> 3 1993 R 5 83.3 83%
#> 4 1994 R 20 80 80%
#> 5 1995 R 59 69.4 69%
#> 6 1996 R 63 60.6 61%
#> 7 1997 R 71 62.3 62%
#> 8 1998 R 82 61.7 62%
#> 9 1999 R 98 61.3 61%
#> 10 2000 R 77 48.4 48%
#> # … with 31 more rows
Now that we've confirmed that each bar is a simple summary of the number of movies by audience rating (R, PG, PG-13), we can quickly check that the p-value comes from a chi-squared test on the counts by rating:
p$data %>%
filter(
year == 2000
)
#> # A tibble: 3 × 5
#> year mpaa counts perc .label
#> <fct> <fct> <int> <dbl> <chr>
#> 1 2000 R 77 48.4 48%
#> 2 2000 PG-13 64 40.3 40%
#> 3 2000 PG 18 11.3 11%
chisq.test(c(77, 64, 18))
#>
#> Chi-squared test for given probabilities
#>
#> data: c(77, 64, 18)
#> X-squared = 36.264, df = 2, p-value = 1.335e-08

Dynamic summarise throught dynamic multiplication for an external vector

I have a tibble such:
X = tibble(Name = rep(c("A","B","C"),5),
Coeffs_1 = runif(15,0,1),
Coeffs_2 = runif(15,0,1)) %>% arrange(Name)
Y = runif(10,0,100)
In this example the amount of "Reps" per "Names" is arbitrarily fixed at 5, and the amount of Coeffs_2 is arbitrarily fixed at 2, but in my code they could be any number and I don't know the exact number.
I also have a vector with n = reps*2 elements:
Y = runif(10,0,100)
In this specific case, it's 10 = 5*2.
My task is to summarise, per each Name, per each Coeff, this formula:
Result_x[1] = sum(Coeff_x[1]*Y[2] + Coeff_x[2]* Y[3] + ... + Coeff_x[Reps]*Y[Reps+1]) - Y[1]
Result_x[2] = sum(Coeff_x[1]*Y[3] + Coeff_x[2]* Y[4] + ... + Coeff_x[Reps]*Y[Reps+2]) - Y[2]
.
.
.
Result_x[Reps] = sum(Coeff_x[1]* Y[Reps+1] + Coeff_x[2]* Y[Reps+2] + ... + Coeff_x[Reps]*Y[Reps*2]) - Y[Reps]
So that in the end, the final summarise tibble should look like:
Name
Lag
Result_1
...
Result_x
A
+1
a number
numbers
a number
A
+2
a number
numbers
a number
A
...
a number
numbers
a number
A
Reps
a number
numbers
a number
B
+1
a number
numbers
a number
B
+2
a number
numbers
a number
...
...
a number
numbers
a number
...
Reps
a number
numbers
a number
The dynamic nature of the issue makes hard for me to define it well with a for cycle, and the presence of the external vector that must be re-indexed and properly summarised for each row in the original tibble makes me difficult to work with a pipeline.
I thought that defining a custom function could help but again, it messes with pipeline code.
Split the 'X' by 'Name', loop over the list (map), while creating shifted lead values of 'Y' in a list with n specified as a vector. Loop over the list, summarise across the 'Coeff' columns for each of the nested list by taking the sum of product of the column value with the corresponding 'y' length corrected and subtract from the first value of 'y'
library(dplyr)
library(purrr)
library(data.table)
X %>%
group_split(Name) %>%
map_dfr(~ map_dfr(shift(Y, n = 1:nrow(.x), type = 'lead'),
function(y) .x %>%
summarise(Name = first(Name), across(starts_with('Coeff'),
~ sum(. * y[seq_along(.)], na.rm = TRUE) - first(y)))) ) %>%
mutate(Lag = rowid(Name))
-output
# A tibble: 15 × 4
Name Coeffs_1 Coeffs_2 Lag
<chr> <dbl> <dbl> <int>
1 A 127. 54.4 1
2 A 162. 134. 2
3 A 127. 68.2 3
4 A 109. 38.0 4
5 A 108. 94.0 5
6 B 175. 197. 1
7 B 187. 240. 2
8 B 151. 200. 3
9 B 132. 159. 4
10 B 102. 152. 5
11 C 48.8 131. 1
12 C 89.1 128. 2
13 C 42.5 98.7 3
14 C 29.4 95.7 4
15 C 41.7 50.1 5

How can I apply calculations multiple times on similar variables in the Tidyverse?

I am trying to run calculations on multiple variables with similar names (mx1_var1...mx2_var1 etc).
A simplified version of the data is below.
structure(list(mx1_amenable = c(70.0382790687902, 20.8895416774022,
98.1328630153307, 8.63038330575823, 21.098387740395, 31.959849814698,
9.22952906324882, 74.4660849895597, 29.6851613973842, 60.941434908354
), mx1_Other = c(50.0261607893197, 46.0117649431311, 51.8219837573084,
73.7814971552898, 93.8008571298187, 92.6841115228084, 95.660659297798,
10.8184536035572, 43.6606611340557, 81.4415005182801), mx1_preventable = c(38.6864667127179,
22.5707957186912, 13.324746863086, 74.9369833030818, 13.0413382062397,
98.3757571024402, 86.6179643621766, 19.7927752780922, 2.28293032845359,
67.0137368426169), mx2_amenable = c(63.6636904898683, 40.361275660631,
3.2234218985236, 80.4870440564426, 49.483719663574, 71.0484920255819,
97.3726798797323, 30.0044347466731, 25.8476044496246, 39.4468283905231
), mx2_Other = c(4.0822540063483, 52.9579932985574, 38.3393867228102,
80.8093349013419, 89.5704617034906, 7.15269982141938, 44.9889904260212,
94.1639871656393, 17.4307996383923, 91.9360333328057), mx2_preventable = c(97.9327560952081,
42.7026845980086, 74.6785922702186, 27.4754587243202, 14.5174992869947,
29.298035056885, 3.2058044369044, 44.6985715883816, 33.7262168187378,
50.9358501169921)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-10L))
I want to run calculations e.g.
mutate(diff_amenable = mx1_amenable)
Across all variables in the dataset as well as further calculations based on the output of these new figures. I think using some sort of string match and function should be able to do it but all I could come across was [this.][1]
At the moment I am working with the data in wide format and manually inputting the column names to run the calculations which is not feasible as I work with more variables (up to 70 paired values).
Any ideas how this could be done?
[1]: Function to perform similar calculations on variables with similar names
This might be a slight step forward - writing functions that give the calculation for a pair of selected columns by name detection in the across function. This works for the six example columns in your dataset:
library(tidyverse)
difference <- function(...) {
x <- list(...)
x[[1]][[1]] - x[[1]][[2]]
}
proportion <- function(...) {
x <- list(...)
x[[1]][[1]] / x[[1]][[2]]
}
df %>%
rowwise() %>%
transmute(
mx1_allcause = sum(across(starts_with("mx1"))),
mx2_allcause = sum(across(starts_with("mx2"))),
diff_amenable = difference(across(ends_with("_amenable"))),
diff_allcause = difference(across(ends_with("_allcause"))),
prop_amenable = proportion(across(starts_with("diff")))
)
#> # A tibble: 10 x 5
#> # Rowwise:
#> mx1_allcause mx2_allcause diff_amenable diff_allcause prop_amenable
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 159. 166. 6.37 -6.93 -0.920
#> 2 89.5 136. -19.5 -46.5 0.418
#> 3 163. 116. 94.9 47.0 2.02
#> 4 157. 189. -71.9 -31.4 2.29
#> 5 128. 154. -28.4 -25.6 1.11
#> 6 223. 107. -39.1 116. -0.338
#> 7 192. 146. -88.1 45.9 -1.92
#> 8 105. 169. 44.5 -63.8 -0.697
#> 9 75.6 77.0 3.84 -1.38 -2.79
#> 10 209. 182. 21.5 27.1 0.794
Created on 2021-04-09 by the reprex package (v2.0.0)
Expanding this to your 70+ variables though might be different. My solution here relies on each calculation combining two columns being able to select the two (in order) based on a text match. If there's a need for a more complicated matching of one name to another, you might need a smarter approach or to give in and manually define pairings.

How to prevent R from rounding in frequency function?

I used the freq function of frequency package to get frequency percent on my dataset$MoriskyAdherence, then R gives me percent values with rounding. I need more decimal places.
MoriskyAdherence=dataset$MoriskyAdherence
freq(MoriskyAdherence)
The result is:
The Percent values are 35.5, 41.3,23.8. The sum of them is 100.1.
The exact amounts should be 35.5, 41.25, 23.75.
What should I do?
I used sprintf, as.data.frame,formatC, and some other function to deal with it.But...
The function freq returns a character data frame, and has no option to adjust the number of decimal places. However, it is easy to recreate the table however you want it. For example, I have written this function, which will give you the same result but with two decimal places instead of one:
freq2 <- function(data_frame)
{
df <- frequency::freq(data_frame)
lapply(df, function(x)
{
n <- suppressWarnings(as.numeric(x$Freq))
sum_all <- as.numeric(x$Freq[nrow(x)])
raw_percent <- suppressWarnings(100 * n / sum_all)
t_row <- grep("Total", x[,2])[1]
valid_percent <- suppressWarnings(100*n / as.numeric(x$Freq[t_row]))
x$Percent <- format(round(raw_percent, 2), nsmall = 2)
x$'Valid Percent' <- format(round(valid_percent, 2), nsmall = 2)
x$'Cumulative Percent' <- format(round(cumsum(valid_percent), 2), nsmall = 2)
x$'Cumulative Percent'[t_row:nrow(x)] <- ""
x$'Valid Percent'[(t_row + 1):nrow(x)] <- ""
return(x)
})
}
Now instead of
freq(MoriskyAdherence)
#> Building tables
#> |===========================================================================| 100%
#> $`x:`
#> x label Freq Percent Valid Percent Cumulative Percent
#> 2 Valid High Adherence 56 35.0 35.0 35.0
#> 3 Low Adherence 66 41.3 41.3 76.3
#> 4 Medium Adherence 38 23.8 23.8 100.0
#> 41 Total 160 100.0 100.0
#> 1 Missing <blank> 0 0.0
#> 5 <NA> 0 0.0
#> 7 Total 160 100.0
you can do
freq2(MoriskyAdherence)
#> Building tables
#> |===========================================================================| 100%
#> $`x:`
#> x label Freq Percent Valid Percent Cumulative Percent
#> 2 Valid High Adherence 56 35.00 35.00 35.00
#> 3 Low Adherence 66 41.25 41.25 76.25
#> 4 Medium Adherence 38 23.75 23.75 100.00
#> 41 Total 160 100.00 100.00
#> 1 Missing <blank> 0 0.00
#> 5 <NA> 0 0.00
#> 7 Total 160 100.00
which is exactly what you were looking for.
Two (potential) solutions:
Solution #1:
Make changes inside the function freq. This can be done by retrieving the function's code with the command freq (without round brackets), or by retrieving the code, with comments, from https://rdrr.io/github/wilcoxa/frequencies/src/R/freq.R.
My hunch is that to obtain more decimals, changes must be implemented at this point in the code:
# create a list of frequencies
message("Building tables")
all_freqs <- lapply_pb(names(x), function(y, x1 = as.data.frame(x), maxrow1 = maxrow, trim1 = trim){
makefreqs(x1, y, maxrow1, trim1)
})
Solution #2:
If you're only after percentages with more decimals, you can use aggregate. Let's suppose your data has this structure: a dataframe with two variables, one numeric, one a factor by which you want to group:
set.seed(123)
Var1 <- sample(LETTERS[1:4], 10, replace = T)
Var2 <- sample(10:100, 10, replace = T)
df <- data.frame(Var1, Var2)
Var1 Var2
1 B 97
2 D 51
3 B 71
4 D 62
5 D 19
6 A 91
7 C 32
8 D 13
9 C 39
10 B 96
Then to obtain your percentages by factor, you would use aggregatethus:
aggregate(Var2 ~ Var1, data = df, function(x) sum(x)/sum(Var2)*100)
Var1 Var2
1 A 15.93695
2 B 46.23468
3 C 12.43433
4 D 25.39405
You can control the number of decimals by using round:
aggregate(Var2 ~ Var1, data = df, function(x) round(sum(x)/sum(Var2)*100,3))

Resources