I am using the dplyr library in R.
I created the following dataset:
library(dplyr)
#create data
a = rnorm(100,100,10)
b = rnorm(100,100,10)
group <- sample( LETTERS[1:4], 100, replace=TRUE, prob=c(0.5, 0.2, 0.15, 0.15) )
#create frame
train_data = data.frame(a,b,group)
train_data$group = as.factor(train_data$group)
From here, I want to make a new variable called "diff" which records if variable "b" is bigger than variable "a":
train_data$diff = ifelse(train_data$b > train_data$a,1,0)
Now, I want to make a new variable ("perc") in the "train_data" table, which calculates:
for each unique group
the "percentage" of the "diff" variable
e.g.
suppose there 20 rows where "group = a".
In those 20 rows, there are 10 rows where the variable "diff" is "1".
Therefore, "perc" = 0.5 (10/20 = 0.5) . So, for those 20 rows, the value of "perc" should be 0.5
Using another stackoverflow post, (Compute "percent complete" within subgroups using dplyr in R?) I tried to implement this:
final_table = data.frame(train_data %>% group_by(group) %>% mutate(perc = diff/max(diff)))
But this is not giving me the desired output:
head(final_table)
a b group diff perc
1 107.19028 117.37028 D 1 1
2 105.34165 87.96513 A 0 0
3 120.21911 94.30301 C 0 0
4 98.06001 104.82173 D 1 1
5 104.54841 90.00205 B 0 0
6 90.77172 79.31384 D 0 0
7 96.22783 88.60185 D 0 0
8 113.67500 87.28380 B 0 0
9 96.82708 89.51343 C 0 0
10 115.38720 100.79550 C 0 0
11 105.30922 80.55969 C 0 0
12 114.93315 95.78172 B 0 0
13 105.20058 109.66729 C 1 1
For example, row 11 and row 13 both have "group = c", but different values of the "perc" variable. Furthermore, it doesn't seem like percentages are being calculated here.
Can someone please show me how to fix this?
Note: Is it also possible to create a table with 4 rows in which the summaries are provided? I think the Count = n() command can be used for this?
E.g.
Group Number of Rows Perc
a 20 0.6
b 20 0.7
c 50 0.9
d 10 0.24
Or a general summary (i.e. in the whole table, what is the percentage of rows where the "diff" variable is 1?):
d = sum(train_data$diff) / count(train_data$diff)
Thanks
Please let me know if I misunderstood your questions:
library(dplyr)
#create data
a = rnorm(100,100,10)
b = rnorm(100,100,10)
group <- sample( LETTERS[1:4], 100, replace=TRUE, prob=c(0.5, 0.2, 0.15, 0.15) )
#create frame
train_data = data.frame(a,b,group)
# Question 1
train_data %>%
group_by(group) %>%
mutate(
percent = sum(a>b)/n()
)
#> # A tibble: 100 x 4
#> # Groups: group [4]
#> a b group percent
#> <dbl> <dbl> <chr> <dbl>
#> 1 95.0 88.9 B 0.429
#> 2 96.4 95.1 A 0.35
#> 3 102. 110. A 0.35
#> 4 97.4 96.2 A 0.35
#> 5 90.7 92.7 A 0.35
#> 6 92.0 105. B 0.429
#> 7 93.8 85.1 A 0.35
#> 8 101. 102. B 0.429
#> 9 92.0 99.1 A 0.35
#> 10 77.6 87.8 B 0.429
#> # ... with 90 more rows
# Question 2
train_data %>%
group_by(group) %>%
summarize(
rows= n(),
percent = sum(a>b)/n()
)
#> # A tibble: 4 x 3
#> group rows percent
#> <chr> <int> <dbl>
#> 1 A 60 0.35
#> 2 B 21 0.429
#> 3 C 8 0.375
#> 4 D 11 0.364
Created on 2021-07-02 by the reprex package (v2.0.0)
Related
I have a dataframe with two species A and B and certain variables a b associated with the total of 100 rows.
I want to create a sampler such that in one set it randomly picks 6 rows reps from the df dataset. However, the samples for A must only come from rows associated with sp A from df, similarly from B. I want do this for 500 times over for each of species A and B.
I attempted a for loop and when I ran sampling it shows a single row with 6 columns. I would appreciate any guidance
a <- rnorm(100, 2,1)
b <- rnorm(100, 2,1)
sp <- rep(c("A","B"), each = 50)
df <- data.frame(a,b,sp)
df.sample <- for(i in 1:1000){
sampling <- sample(df[i,],6,replace = TRUE)
}
#Output in a single row
a a.1 sp b sp.1 a.2
1000 1.68951 1.68951 B 1.395995 B 1.68951
#Expected dataframe
df.sample
set rep a b sp
1 1 1 9 A
1 2 3 2 A
1 3 0 2 A
1 4 1 2 A
1 5 1 6 A
1 6 4 2 A
2 1 1 2 B
2 2 5 2 B
2 3 1 2 B
2 4 1 6 B
2 5 1 8 B
2 6 9 2 B
....
Here's how I would do it (using tidyverse):
data:
a <- rnorm(100, 2,1)
b <- rnorm(100, 2,1)
sp <- rep(c("A","B"), each = 50)
df <- data.frame(a,b,sp)
# create an empty table with desired columns
library(tidyverse)
output <- tibble(a = numeric(),
b = numeric(),
sp = character(),
set = numeric())
# sampling in a loop
set.seed(42)
for(i in 1:500){
samp1 <- df %>% filter(sp == 'A') %>% sample_n(6, replace = TRUE) %>% mutate(set = i)
samp2 <- df %>% filter(sp == 'B') %>% sample_n(6, replace = TRUE) %>% mutate(set = i)
output %>% add_row(bind_rows(samp1, samp2)) -> output
}
Result
> head(output, 20)
# A tibble: 20 × 4
a b sp set
<dbl> <dbl> <chr> <dbl>
1 2.59 3.31 A 1
2 1.84 1.66 A 1
3 2.35 1.17 A 1
4 2.33 1.95 A 1
5 0.418 1.11 A 1
6 1.19 2.54 A 1
7 2.35 0.899 B 1
8 1.19 1.63 B 1
9 0.901 0.986 B 1
10 3.12 1.75 B 1
11 2.28 2.61 B 1
12 1.37 3.47 B 1
13 2.33 1.95 A 2
14 1.84 1.66 A 2
15 3.76 1.26 A 2
16 2.96 3.10 A 2
17 1.03 1.81 A 2
18 1.42 2.00 A 2
19 0.901 0.986 B 2
20 2.37 1.39 B 2
You could split df by species at first. Random rows in each species can be drawn by x[sample(nrow(x), 6), ]. Pass it into replicate(), you could do sampling for many times. Here dplyr::bind_rows() is used to combine samples and add a new column set indicating the sampling indices.
lapply(split(df, df$sp), function(x) {
dplyr::bind_rows(
replicate(3, x[sample(nrow(x), 6), ], FALSE),
.id = "set"
)
})
Output
$A
set a b sp
1 1 1.52480034 3.41257975 A
2 1 1.82542370 2.08511584 A
3 1 1.80019901 1.39279162 A
4 1 2.20765154 2.11879412 A
5 1 1.61295185 2.04035172 A
6 1 1.92936567 2.90362816 A
7 2 0.88903679 2.46948106 A
8 2 3.19223788 2.81329767 A
9 2 1.28629416 2.69275525 A
10 2 2.61044815 0.82495427 A
11 2 2.30928735 1.67421328 A
12 2 -0.09789704 2.62434719 A
13 3 2.10386603 1.78157862 A
14 3 2.17542841 0.84016203 A
15 3 3.22202227 3.49863423 A
16 3 1.07929909 -0.02032945 A
17 3 2.95271838 2.34460193 A
18 3 1.90414536 1.54089645 A
$B
set a b sp
1 1 3.5130317 -0.4704879 B
2 1 3.0053072 1.6021795 B
3 1 4.1167657 1.1123342 B
4 1 1.5460589 3.2915979 B
5 1 0.8742753 0.9132530 B
6 1 2.0882660 1.5588471 B
7 2 1.2444645 1.8199525 B
8 2 2.7960117 2.6657735 B
9 2 2.5970774 0.9984187 B
10 2 1.1977317 3.7360884 B
11 2 2.2830643 1.0452440 B
12 2 3.1047150 1.5609482 B
13 3 2.9309124 1.5679255 B
14 3 0.8631965 1.3501631 B
15 3 1.5460589 3.2915979 B
16 3 2.7960117 2.6657735 B
17 3 3.1047150 1.5609482 B
18 3 2.8735390 0.6329279 B
If I understood well what you want, it could be done following this code
# Create the initial data frame
a <- rnorm(100, 2,1)
b <- rnorm(100, 2,1)
sp <- rep(c("A","B"), each = 50)
df <- data.frame(a,b,sp)
# Rows with sp=A
row.A <- which(df$sp=="A")
row.B <- which(df$sp=="B")
# Sampling data.frame
sampling <- data.frame(matrix(ncol = 5, nrow = 0))
# "rep" column for each iteration
rep1 <- rep(1:6,2)
# Build the dara.frame
for(i in 1:500){
# Sampling row.A
s.A <- sample(row.A,6,replace = T)
# Sampling row.B
s.B <- sample(row.B,6,replace = T)
# Data frame with the subset of df and "set" and "rep" values
sampling <- rbind(sampling, set=cbind(rep(i,12),rep=rep1,df[c(s.A,s.B),]))
}
# Delete row.names of sampling and redefine sampling's column names
row.names(sampling) <- NULL
colnames(sampling) <- c("set", "rep", "a", "b", "sp")
And the output looks like this:
set rep a b sp
1 1 3.713663 2.717456 A
1 2 2.456070 2.803443 A
1 3 2.166655 1.395556 A
1 4 1.453738 5.662969 A
1 5 2.692518 2.971156 A
1 6 2.699634 3.016791 A
Say I have a matrix like the following, with marker values per id, 10 events per id (in this example):
set.seed(123)
mymat <- matrix(rnorm(300), nrow=30)
rownames(mymat) <- paste0('id',rep(1:3,each=10))
colnames(mymat) <- letters[1:10]
> head(mymat)
a b c d e f g h i j
id1 -0.56047565 0.4264642 0.3796395 0.9935039 0.1176466 0.7877388 -1.0633261 0.1192452 -0.7886220 0.8450130
id1 -0.23017749 -0.2950715 -0.5023235 0.5483970 -0.9474746 0.7690422 1.2631852 0.2436874 -0.5021987 0.9625280
id1 1.55870831 0.8951257 -0.3332074 0.2387317 -0.4905574 0.3322026 -0.3496504 1.2324759 1.4960607 0.6843094
id1 0.07050839 0.8781335 -1.0185754 -0.6279061 -0.2560922 -1.0083766 -0.8655129 -0.5160638 -1.1373036 -1.3952743
id1 0.12928774 0.8215811 -1.0717912 1.3606524 1.8438620 -0.1194526 -0.2362796 -0.9925072 -0.1790516 0.8496430
id1 1.71506499 0.6886403 0.3035286 -0.6002596 -0.6519499 -0.2803953 -0.1971759 1.6756969 1.9023618 -0.4465572
And an associated data frame of cutoff values (a min and a max cutoff per id and marker), like this one:
cutoff_df <- data.frame(id=paste0('id',rep(1:3,each=10)), marker=rep(letters[1:10],3), min=runif(30, 0, 2), max=runif(30, 5, 7))
> head(cutoff_df)
id marker min max
1 id1 a 0.4744594 6.518271
2 id1 b 1.3729807 6.689669
3 id1 c 0.4516368 5.915843
4 id1 d 0.6369892 6.459263
5 id1 e 0.3479676 5.208157
6 id1 f 1.6028592 5.439966
What I want to do here, is calculate a frequency table, so that I record the percentage of events per id and marker that fall into the cutoffs for that id and marker.
This is my attempt using some ugly nested loops... Wondering if there is a nicer and cleaner way to do this, ideally with base functions or data.table or tidyr...
My ugly code:
freq_mat <- matrix(nrow=length(unique(rownames(mymat))))
rownames(freq_mat) <- unique(rownames(mymat))
for (mk in colnames(mymat)){
mk_freq <- NULL
for (id in unique(rownames(mymat))){
data <- mymat[rownames(mymat)==id,mk]
min <- cutoff_df$min[cutoff_df$id==id & cutoff_df$marker==mk]
max <- cutoff_df$max[cutoff_df$id==id & cutoff_df$marker==mk]
ins <- length(data[data>=min & data<=max])
freq <- ins/length(data)*100
mk_freq <- c(mk_freq, freq)
}
mk_freq <- as.data.frame(mk_freq)
names(mk_freq) <- mk
freq_mat <- cbind(freq_mat, mk_freq)
}
> freq_mat
freq_mat a b c d e f g h i j
id1 NA 20 0 20 40 10 0 30 10 20 30
id2 NA 10 30 30 0 20 10 10 0 0 70
id3 NA 0 0 0 0 30 10 30 10 30 60
Something like this? Here, the sum of all cells is 100.
library(tidyverse)
set.seed(123)
mymat <- matrix(rnorm(300), nrow = 30)
rownames(mymat) <- paste0("id", rep(1:3, each = 10))
colnames(mymat) <- letters[1:10]
cutoff_df <- data.frame(
id = paste0("id", rep(1:3, each = 10)),
marker = rep(letters[1:10], 3), min = runif(30, 0, 2), max = runif(30, 5, 7)
)
mymat %>%
as_tibble(rownames = "id") %>%
pivot_longer(-id, names_to = "marker") %>%
left_join(cutoff_df) %>%
filter(value <= max & value >= min) %>%
count(id, marker) %>%
# group_by(marker) %>% # e.g. to make sum of 100 per marker
mutate(n = n / sum(n) * 100) %>%
pivot_wider(names_from = marker, values_from = n, values_fill = list(n = 0))
#> Joining, by = c("id", "marker")
#> # A tibble: 3 × 11
#> id a c d e g h i j b f
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 id1 3.77 3.77 7.55 1.89 5.66 1.89 3.77 5.66 0 0
#> 2 id2 1.89 5.66 0 3.77 1.89 0 0 13.2 5.66 1.89
#> 3 id3 0 0 0 5.66 5.66 1.89 5.66 11.3 0 1.89
Created on 2022-03-30 by the reprex package (v2.0.0)
Here is a solution based on the purrr package. I'm not sure that it is cleaner but it is shorter.
library(purrr)
asplit(mymat,2) |>
imap(~{
with(filter(cutoff_df, marker == .y),
outer(.x, min, ">=") &
outer(.x, max, "<") &
outer(names(.x), id, "=="))
}) |>
map(rowSums) |>
map_dfr(~tapply(.x, names(.x), FUN = sum),
.id = "marker")
##> + # A tibble: 10 × 4
##> marker id1 id2 id3
##> <chr> <dbl> <dbl> <dbl>
##> 1 a 2 1 0
##> 2 b 0 3 0
##> 3 c 2 3 0
##> 4 d 4 0 0
##> 5 e 1 2 3
##> 6 f 0 1 1
##> 7 g 3 1 3
##> 8 h 1 0 1
##> 9 i 2 0 3
##> 10 j 3 7 6
Another possible solution, based on dplyr:
library(dplyr)
data.frame(id = rownames(mymat), mymat) %>%
group_by(id) %>%
summarise(across(everything(),
~ sum(.x >= cutoff_df[(cutoff_df$id == cur_group()$id[1]) & (cur_column() == cutoff_df$marker), 3] &
.x <= cutoff_df[(cutoff_df$id == cur_group()$id[1]) & (cur_column() == cutoff_df$marker), 4]))) %>%
mutate(aux = sum(cur_data()[,-1]), across(-id, ~ .x*100/aux[1]), aux = NULL)
#> # A tibble: 3 x 11
#> id a b c d e f g h i j
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 id1 3.77 0 3.77 7.55 1.89 0 5.66 1.89 3.77 5.66
#> 2 id2 1.89 5.66 5.66 0 3.77 1.89 1.89 0 0 13.2
#> 3 id3 0 0 0 0 5.66 1.89 5.66 1.89 5.66 11.3
Let's say df present aggregated metric in AB test with groups A and B. x is for example number of page visits, n number of users with this number of visits. (In reality, there are way more users and differences are small). Note that there's different number of users per group.
library(tidyverse)
df <- bind_rows(
tibble(group = "A", x = rpois(100, 1)),
tibble(group = "B", x = rpois(200, 2))
) %>%
count(group, x)
I want to compare tiles of users. By tile, I mean users in group A that have the same x value.
For example, I if 34.17% of users in group A has value 0, I want to compare it to average number of x for the lowest 34.17% of users in group B. Next, for example, users with 1 visits in group A are between 34.17% and 74.8% - I want to compare them with the same percentile (but should be more precise) users in group B. Etc...
Here's my try:
n_fake <- 1000
df_agg_per_imp <- df %>%
group_by(group) %>%
mutate(
p_max = n_fake * cumsum(n) / sum(n),
p_min = lag(p_max, default = 0),
p = map2(p_min + 1, p_max, seq)
) %>%
ungroup()
df_agg_per_imp %>%
unnest(p) %>%
pivot_wider(id_cols = p, names_from = group, values_from = x) %>%
group_by(A) %>%
summarise(
p_min = min(p) / n_fake,
p_max = max(p) / n_fake,
rel_uplift = mean(B) / mean(A)
)
#> # A tibble: 6 × 4
#> A p_min p_max rel_uplift
#> <int> <dbl> <dbl> <dbl>
#> 1 0 0.001 0.34 Inf
#> 2 1 0.341 0.74 1.92
#> 3 2 0.741 0.91 1.57
#> 4 3 0.911 0.96 1.33
#> 5 4 0.961 0.99 1.21
#> 6 5 0.991 1 1.2
What I don't like is that I have to create row for each user (and this could be millions) to get the results I want. Is there simpler/better way to do it?
You may be able to do something like this:
extend the creation of your initial frame to get proportion in A and B, and pivot wider:
set.seed(123)
df <- bind_rows(
tibble(group = "A", x = rpois(100, 1)),
tibble(group = "B", x = rpois(200, 2))
) %>%
count(group, x) %>%
group_by(group) %>%
mutate(prop = n/sum(n)) %>%
pivot_wider(id_cols=x, names_from=group,values_from=prop)
With the seed above, this gives you a frame like this:
# A tibble: 7 x 3
x A B
<int> <dbl> <dbl>
1 0 0.35 0.095
2 1 0.38 0.33
3 2 0.21 0.285
4 3 0.04 0.14
5 4 0.02 0.085
6 5 NA 0.055
7 6 NA 0.01
Create a function estimates the rel_uplift, while also returning an updated set of group B proportions and group B values (i.e. xvalues)
f <- function(a,aval,bvec,bvals) {
cindex = which(cumsum(bvec)>=a)
if(length(cindex) == 0) bindex=seq_along(bvec)
else bindex= 1:min(cindex)
rem = sum(bvec[bindex])-a
bmean = sum(bvals[bindex] * (bvec[bindex] - c(rep(0,length(bindex)-1), rem)))
if(length(bindex)>1) {
if(rem!=0) bindex = bindex[1:(length(bindex)-1)]
bvec = bvec[-bindex]
bvals = bvals[-bindex]
}
bvec[1] = rem
list("rel_uplift" = bmean/(a*aval),"bvec" = bvec, "bvals" = bvals )
}
Initiate a dataframe, and a list called fres which contains the initial bvec and initial bvals
result=data.frame()
fres = list("bvec" = df$B,"bvals" = df$x)
Use a for loop to loop over the values of df$A, each time getting the rel_uplift, and preparing an updated set of bvec and bvals to be used in the function
for(a in df %>% filter(!is.na(A)) %>% pull(A)) {
x = df %>% filter(A==a) %>% pull(x)
fres = f(a, x,fres[["bvec"]],fres[["bvals"]])
result = rbind(result,data.frame(x =x, A=a,rel_uplift=fres[["rel_uplift"]]))
}
result
x A rel_uplift
1 0 0.35 Inf
2 1 0.38 1.855263
3 2 0.21 1.726190
4 3 0.04 1.666667
5 4 0.02 1.375000
If I understand right you want to compare counts by two parameters simultaneously, ie by $group and by $x.
From the example in the initial post I see that not all values $x may be available for each group.
Summarizing by 2 co-variables can be done with base R.
Here a simple function (assuming that you're always looking at $group and $x):
countnByGroup <- function(xx, asPercent=FALSE) {
lev <- unique(xx$x)
grp <- unique(xx$group)
out <- sapply(grp, function(x) {z <- rep(NA, length(lev)); names(z) <- lev
w <- which(xx$group==x); if(length(w) >0) z[match(xx$x[w], lev)] <- xx$n[w]
z })
if(asPercent) out <- 100*apply(out, 2, function(x) x/sum(x, na.rm=TRUE))
out }
Note, in the function above the man variable was called 'xx' to avoid confusion
with $x.
df # produced using the code from your example
## A tibble: 13 x 3
# group x n
# <chr> <int> <int>
# 1 A 0 36
# 2 A 1 38
# 3 A 2 19
# 4 A 3 6
# 5 A 4 1
# 6 B 0 27
# 7 B 1 44
# 8 B 2 55
# 9 B 3 44
#10 B 4 21
#11 B 5 6
#12 B 6 2
#13 B 8 1
One gets :
countnByGroup(df)
# A B
#0 36 27
#1 38 44
#2 19 55
#3 6 44
#4 1 21
#5 NA 6
#6 NA 2
#8 NA 1
## and
countnByGroup(df, asPercent=T)
# A B
#0 36 13.5
#1 38 22.0
#2 19 27.5
#3 6 22.0
#4 1 10.5
#5 NA 3.0
#6 NA 1.0
#8 NA 0.5
As long as you don't apply any rounding you'll have the results as precise as it gets.
By chance the random values from above did't produce more digits when processing and thus by chance the percent values for A are all integers.
Another interesting option may be to consider two-way tables in R using table().
But in this case you need your entries as separate lines and not already transformed to counting data as in your example above.
I have a dataset containing about 60 variables (A, B, C, D, ...), each with 3 corresponding information columns (A, Group_A and WOE_A) as in the list below:
ID A Group_A WOE_A B Group_B WOE_B C Group_C WOE_C D Group_D WOE_D Status
213 0 1 0.87 0 1 0.65 0 1 0.80 915.7 4 -0.30 1
321 12 5 0.08 4 4 -0.43 6 5 -0.20 85.3 2 0.26 0
32 0 1 0.87 0 1 0.65 0 1 0.80 28.6 2 0.26 1
13 7 4 -0.69 2 3 -0.82 4 4 -0.80 31.8 2 0.26 0
43 1 2 -0.04 1 2 -0.49 1 2 -0.22 51.7 2 0.26 0
656 2 3 -0.28 2 3 -0.82 2 3 -0.65 8.5 1 1.14 0
435 2 3 -0.28 0 1 0.65 0 1 0.80 39.8 2 0.26 0
65 8 4 -0.69 3 4 -0.43 5 4 -0.80 243.0 3 0.00 0
565 0 1 0.87 0 1 0.65 0 1 0.80 4.0 1 1.14 0
432 0 1 0.87 0 1 0.65 0 1 0.80 81.6 2 0.26 0
I want to print a table in R with some statistics (Min(A), Max(A), WOE_A, Count(Group_A), Count(Group_A, where Status=1), Count(Group_A, where Status=0)), all grouped by Group for each of the 60 variables and I think I need to perform it in a loop.
I tried the "dplyr" package, but I don't know how to refer to all the three columns (A, Group_A and WOE_A) that relate to a variable (A) and also how to summarize the information for all the desired statistics.
The code I began with is:
df <- data
List <- list(df)
for (colname in colnames(df)) {
List[[colname]]<- df %>%
group_by(df[,colname]) %>%
count()
}
List
This is how I want to print results:
**Var A
Group Min(A) Max(A) WOE_A Count(Group_A) Count_1(Group_A, where Status=1) Count_0(Group_A, where Status=0)**
1
2
3
4
5
Thank you very much!
Laura
Laura, as mentioned by the others, working with "long" data frames is better than with wide data frames.
Your initial idea using dplyr and group_by() got you almost there.
Note: this is also a way to break down your data and then combine it with generic columns, if the wide-long is pushing the limits.
Let's start with this:
library(dplyr)
#---------- extract all "A" measurements
df %>%
select(A, Group_A, WOE_A, Status) %>%
#---------- grouped summary of multiple stats
group_by(A) %>%
summarise(
Min = min(A)
, Max = max(A)
, WOE_A = unique(WOE_A)
, Count = n() # n() is a helper function of dplyr
, CountStatus1 = sum(Status == 1) # use sum() to count logical conditions
, CountStatus0 = sum(Status == 0)
)
This yields:
# A tibble: 6 x 7
A Min Max WOE_A Count CountStatus1 CountStatus0
<dbl> <dbl> <dbl> <dbl> <int> <int> <int>
1 0 0 0 0.87 4 2 2
2 1 1 1 -0.04 1 0 1
3 2 2 2 -0.28 2 0 2
4 7 7 7 -0.69 1 0 1
5 8 8 8 -0.69 1 0 1
6 12 12 12 0.08 1 0 1
OK. Turning your wide dataframe into a long one is not a trivial go as you nest measurements and variable names. On top, ID and Status are ids/key variables for each row.
The standard tool to convert wide to long is tidyr's pivot_longer(). Read up on this.
In your particular case we want to push multiple columns into multiple targets. For this you need to get a feel for the .value sentinel. The pivot_longer() help pages might be useful for studying this case.
To ease the pain of constructing a complex regex expression to decode the variable names, I rename your group-id-label, e.g. A, B, to X_A, X_B. This ensures that all column-names are built in the form of what_letter`!
library(tidyr)
df %>%
# ----------- prepare variable names to be well-formed, you may do this upstream
rename(X_A = A, X_B = B, X_C = C, X_D = D) %>%
# ----------- call pivot longer with .value sentinel and names_pattern
# ----------- that is an advanced use of the capabilities
pivot_longer(
cols = -c("ID","Status") # apply to all cols besides ID and Status
, names_to = c(".value", "label") # target column names are based on origin names
# and an individual label (think id, name as u like)
, names_pattern = "(.*)(.*_[A-D]{1})$") # regex for the origin column patterns
# pattern is built of 2 parts (...)(...)
# (.*) no or any symbol possibly multiple times
# (.*_[A-D]{1}) as above, but ending with underscore and 1 letter
This gives you
# A tibble: 40 x 6
ID Status label X Group WOE
<dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 213 1 _A 0 1 0.87
2 213 1 _B 0 1 0.65
3 213 1 _C 0 1 0.8
4 213 1 _D 916. 4 -0.3
5 321 0 _A 12 5 0.08
6 321 0 _B 4 4 -0.43
7 321 0 _C 6 5 -0.2
8 321 0 _D 85.3 2 0.26
9 32 1 _A 0 1 0.87
10 32 1 _B 0 1 0.65
Putting all together
df %>%
# ------------ prepare and make long
rename(X_A = A, X_B = B, X_C = C, X_D = D) %>%
pivot_longer(cols = -c("ID","Status")
, names_to = c(".value", "label")
, names_pattern = "(.*)(.*_[A-D]{1})$") %>%
# ------------- calculate stats on groups
group_by(label, X) %>%
summarise(Min = min(X), Max = max(X), WOE = unique(WOE)
,Count = n(), CountStatus1 = sum(Status == 1)
, CountStatus0 = sum(Status == 0)
)
Voila:
# A tibble: 27 x 8
# Groups: label [4]
label X Min Max WOE Count CountStatus1 CountStatus0
<chr> <dbl> <dbl> <dbl> <dbl> <int> <int> <int>
1 _A 0 0 0 0.87 4 2 2
2 _A 1 1 1 -0.04 1 0 1
3 _A 2 2 2 -0.28 2 0 2
4 _A 7 7 7 -0.69 1 0 1
5 _A 8 8 8 -0.69 1 0 1
6 _A 12 12 12 0.08 1 0 1
7 _B 0 0 0 0.65 5 2 3
8 _B 1 1 1 -0.49 1 0 1
9 _B 2 2 2 -0.82 2 0 2
10 _B 3 3 3 -0.43 1 0 1
# ... with 17 more rows
The loop that I managed to do is available below.
Apart from the tables that I wanted to list, I also needed to make a chart which would show some of the information from each listed table, and then print a PDF with each variable and corresponding table and chart on a different page.
data <- as.data.frame(data)
# 5 is the column where my first information related to a variable is, so for each variable I am building the data with its' related columns
i <- 5
#each variable has 3 columns (Value, Group, WOE)
for (i in seq(5, 223, 3)){
ID <- data[,1]
A <- data[,i]
Group <- data[,i+1]
WOE <- data[,i+2]
Status <- data[,224]
df <- cbind(ID, A, Group, WOE, Status)
df <- data.frame(df)
# Perform table T with its' corresponding statistics
T <- df %>%
select(A, Group, WOE, Status) %>%
group_by(Group) %>%
summarise(
Min = min(A, na.rm=TRUE), Max = max(A, na.rm=TRUE), WOE = unique(WOE),
Count = n(),
CountStatus1 = sum(Status == 1),
CountStatus0 = sum(Status == 0),
BadRate = round((CountStatus1/Count)*100,1))
print(colnames(data)[i])
print(T)
# Then I plot some information from Table T
p <- ggplot(T) + geom_col(aes(x=Group, y=CountStatus1), size = 1, color = "darkgreen", fill = "darkgreen")
p <- p + geom_line(aes(x=Group, y=WOE*1000), col="firebrick", size=0.9) +
geom_point(aes(x=Group, y=WOE*1000), col="gray", size=3) +
ggtitle(label = paste("WOE and Event Count by Group", " - " , colnames(data)[i])) +
labs(x = "Group", y = "Event Count", size=7) +
theme(plot.title = element_text(size=8, face="bold", margin = margin(10, 0, 10, 0)),
axis.text.x = element_text(angle=0, hjust = 1)) +
scale_y_continuous(sec.axis = sec_axis(trans = ~ . /1000, name="WOE", breaks = seq(-3, 5, 0.5)))
print(p)
}
The information is printed for all the variables that I need as in the pictures below:
Table for one of the variables
Chart for the same variable
However, now I encounter some problems with exporting results in a pdf. I do not know how I could print the results of each table and chart on a distinct page in a PDF.
Each segment has different range, for example A is from 1 to 3 while C is from 1 to 7.
For each segment, there can be missing time for which I want to perform interpolation (linear, spline, etc)
How can I do it within dplyr?
have <- data.frame(time =c(1,3,1,2,5,1,3,5,7),
segment=c('A','A','B','B','B','C','C','C','C'),
toInterpolate= c(0.12,0.31,0.15,0.24,0.55,0.11,0.35,0.53,0.79))
have
want <- data.frame(time =c(1,2,3,1,2,3,4,5,1,2,3,4,5,6,7),
segment=c('A','A','A','B','B','B','B','B','C','C','C','C','C','C','C'),
Interpolated= c(0.12,0.21,0.31,0.15,0.24,0.34,0.41,0.55,0.11,0.28,0.35,0.45,0.53,0.69,0.79))
# note that the interpolated values here are just randomnly put, (not based on actual linear/spline interpolation)
want
We can use complete to complete the sequence and na.spline from zoo for interpolation.
library(dplyr)
library(tidyr)
library(zoo)
have %>%
group_by(segment) %>%
complete(time = min(time):max(time)) %>%
mutate(toInterpolate = na.spline(toInterpolate))
# segment time toInterpolate
# <chr> <dbl> <dbl>
# 1 A 1 0.12
# 2 A 2 0.215
# 3 A 3 0.31
# 4 B 1 0.15
# 5 B 2 0.24
# 6 B 3 0.337
# 7 B 4 0.44
# 8 B 5 0.55
# 9 C 1 0.11
#10 C 2 0.246
#11 C 3 0.35
#12 C 4 0.439
#13 C 5 0.53
#14 C 6 0.641
#15 C 7 0.79
To create sequence for smaller granularities.
have %>%
group_by(segment) %>%
complete(time = min(time):max(time)) %>%
mutate(toInterpolate = na.spline(toInterpolate)) %>%
complete(time = seq(min(time), max(time), 0.1))