Melting data, keeping certain columns paired - r

I have data as follows:
DT <- structure(list(ECOST = c("Choice_01", "Choice_02", "Choice_03",
"Choice_04", "Choice_05", "Choice_06", "Choice_07", "Choice_08",
"Choice_09", "Choice_10", "Choice_11", "Choice_12"), control = c(18,
30, 47, 66, 86, 35, 31, 46, 55, 39, 55, 41), treatment = c(31,
35, 46, 68, 86, 36, 32, 42, 52, 39, 58, 43), control_p = c(0.163636363636364,
0.272727272727273, 0.427272727272727, 0.6, 0.781818181818182,
0.318181818181818, 0.281818181818182, 0.418181818181818, 0.5,
0.354545454545455, 0.5, 0.372727272727273), treatment_p = c(0.319587628865979,
0.360824742268041, 0.474226804123711, 0.701030927835051, 0.88659793814433,
0.371134020618557, 0.329896907216495, 0.43298969072165, 0.536082474226804,
0.402061855670103, 0.597938144329897, 0.443298969072165)), row.names = c(NA,
-12L), class = c("tbl_df", "tbl", "data.frame"))
# A tibble: 12 x 5
ECOST control treatment control_p treatment_p
<chr> <dbl> <dbl> <dbl> <dbl>
1 Choice_01 18 31 0.164 0.320
2 Choice_02 30 35 0.273 0.361
3 Choice_03 47 46 0.427 0.474
4 Choice_04 66 68 0.6 0.701
5 Choice_05 86 86 0.782 0.887
6 Choice_06 35 36 0.318 0.371
7 Choice_07 31 32 0.282 0.330
8 Choice_08 46 42 0.418 0.433
9 Choice_09 55 52 0.5 0.536
10 Choice_10 39 39 0.355 0.402
11 Choice_11 55 58 0.5 0.598
12 Choice_12 41 43 0.373 0.443
I want to melt this data, but I want the columns control and control_p to stay together, and the columns treatment and treatment_p to stay together, creating a table with 24 rows and 4 columns.
Desired result:
# A tibble: 12 x 5
ECOST count percentage group
<chr> <dbl> <dbl>
1 Choice_01 18 0.164 control
2 Choice_02 30 0.273 control
3 Choice_03 47 0.427 control
4 Choice_04 66 0.6 control
5 Choice_05 86 0.782 control
6 Choice_06 35 0.318 control
7 Choice_07 31 0.282 control
8 Choice_08 46 0.418 control
9 Choice_09 55 0.5 control
10 Choice_10 39 0.355 control
11 Choice_11 55 0.5 control
12 Choice_12 41 0.373 control
13 Choice_01 18 0.320 treatment
14 Choice_02 30 0.361 treatment
15 Choice_03 46 0.474 treatment
16 Choice_04 68 0.701 treatment
17 Choice_05 86 0.887 treatment
18 Choice_06 36 0.371 treatment
19 Choice_07 32 0.330 treatment
20 Choice_08 42 0.433 treatment
21 Choice_09 52 0.536 treatment
22 Choice_10 39 0.402 treatment
23 Choice_11 58 0.598 treatment
24 Choice_12 43 0.443 treatment

Using pivot_longer, some data wrangling and afterwards pivot_wider you could achieve your desired result like so:
library(tidyr)
library(dplyr)
DT %>%
pivot_longer(-ECOST) %>%
separate(name, into = c("group", "what")) %>%
mutate(what = ifelse(is.na(what), "count", "percentage")) %>%
pivot_wider(names_from = "what", values_from = "value")
#> # A tibble: 24 x 4
#> ECOST group count percentage
#> <chr> <chr> <dbl> <dbl>
#> 1 Choice_01 control 18 0.164
#> 2 Choice_01 treatment 31 0.320
#> 3 Choice_02 control 30 0.273
#> 4 Choice_02 treatment 35 0.361
#> 5 Choice_03 control 47 0.427
#> 6 Choice_03 treatment 46 0.474
#> 7 Choice_04 control 66 0.6
#> 8 Choice_04 treatment 68 0.701
#> 9 Choice_05 control 86 0.782
#> 10 Choice_05 treatment 86 0.887
#> # … with 14 more rows
Created on 2021-02-21 by the reprex package (v1.0.0)

You could rename the columns so that you have clear distinction between count and percentage columns and then use pivot_longer
library(dplyr)
library(tidyr)
DT %>%
rename_with(~paste(sub('_.*', '', .),
rep(c('count', 'percentage'), each = 2), sep = '_'), -1) %>%
pivot_longer(cols = -ECOST,
names_to = c('group', '.value'),
names_sep = '_')
# A tibble: 24 x 4
# ECOST group count percentage
# <chr> <chr> <dbl> <dbl>
# 1 Choice_01 control 18 0.164
# 2 Choice_01 treatment 31 0.320
# 3 Choice_02 control 30 0.273
# 4 Choice_02 treatment 35 0.361
# 5 Choice_03 control 47 0.427
# 6 Choice_03 treatment 46 0.474
# 7 Choice_04 control 66 0.6
# 8 Choice_04 treatment 68 0.701
# 9 Choice_05 control 86 0.782
#10 Choice_05 treatment 86 0.887
# … with 14 more rows

Here is a data.table approach with a workaround for the limitation/feature of melt.data.table()
library( data.table )
setDT(DT)
#get suffixes
suffix <- unique( sub( "(^.*)(_[a-z])", "\\1", names( DT[ , -1] ) ) )
#melt
DT2 <- melt( DT, id.vars = "ECOST", measure.vars = patterns( count = "[a-oq-z]$", percentage = "_p$"))
#replace factor-levels with the colnames
setattr(DT2$variable, "levels", suffix )
ECOST variable count percentage
1: Choice_01 control 18 0.1636364
2: Choice_02 control 30 0.2727273
3: Choice_03 control 47 0.4272727
4: Choice_04 control 66 0.6000000
5: Choice_05 control 86 0.7818182
6: Choice_06 control 35 0.3181818
7: Choice_07 control 31 0.2818182
8: Choice_08 control 46 0.4181818
9: Choice_09 control 55 0.5000000
10: Choice_10 control 39 0.3545455
11: Choice_11 control 55 0.5000000
12: Choice_12 control 41 0.3727273
13: Choice_01 treatment 31 0.3195876
14: Choice_02 treatment 35 0.3608247
15: Choice_03 treatment 46 0.4742268
16: Choice_04 treatment 68 0.7010309
17: Choice_05 treatment 86 0.8865979
18: Choice_06 treatment 36 0.3711340
19: Choice_07 treatment 32 0.3298969
20: Choice_08 treatment 42 0.4329897
21: Choice_09 treatment 52 0.5360825
22: Choice_10 treatment 39 0.4020619
23: Choice_11 treatment 58 0.5979381
24: Choice_12 treatment 43 0.4432990
ECOST variable count percentage

Related

Divide columns by a reference row

I need to divide columns despesatotal and despesamonetaria by the row named Total:
Lets suppose your data set is df.
# 1) Delete the last row
df <- df[-nrow(df),]
# 2) Build the desired data.frame [combining the CNAE names and the proportion columns
new.df <- cbind(grup_CNAE = df$grup_CNAE,
100*prop.table(df[,-1],margin = 2))
Finally, rename your columns. Be careful with the matrix or data.frame formats, because sometimes mathematical operations may suppose a problem. If you you use dput function in order to give us a reproducible example, the answer would be more accurate.
Here is a way to get it done. This is not the best way, but I think it is very readable.
Suppose this is your data frame:
mydf = structure(list(grup_CNAE = c("A", "B", "C", "D", "E", "Total"
), despesatotal = c(71, 93, 81, 27, 39, 311), despesamonetaria = c(7,
72, 36, 22, 73, 210)), row.names = c(NA, -6L), class = "data.frame")
mydf
# grup_CNAE despesatotal despesamonetaria
#1 A 71 7
#2 B 93 72
#3 C 81 36
#4 D 27 22
#5 E 39 73
#6 Total 311 210
To divide despesatotal values with its total value, you need to use the total value (311 in this example) as the denominator. Note that the total value is located in the last row. You can identify its position by indexing the despesatotal column and use nrow() as the index value.
mydf |> mutate(percentage1 = despesatotal/despesatotal[nrow(mydf)],
percentage2 = despesamonetaria /despesamonetaria[nrow(mydf)])
# grup_CNAE despesatotal despesamonetaria percentage1 percentage2
#1 A 71 7 0.22829582 0.03333333
#2 B 93 72 0.29903537 0.34285714
#3 C 81 36 0.26045016 0.17142857
#4 D 27 22 0.08681672 0.10476190
#5 E 39 73 0.12540193 0.34761905
#6 Total 311 210 1.00000000 1.00000000
library(tidyverse)
Sample data
# A tibble: 11 x 3
group despesatotal despesamonetaria
<chr> <int> <int>
1 1 198 586
2 2 186 525
3 3 202 563
4 4 300 562
5 5 126 545
6 6 215 529
7 7 183 524
8 8 163 597
9 9 213 592
10 10 175 530
11 Total 1961 5553
df %>%
mutate(percentage_total = despesatotal / last(despesatotal),
percentage_monetaria = despesamonetaria/ last(despesamonetaria)) %>%
slice(-nrow(.))
# A tibble: 10 x 5
group despesatotal despesamonetaria percentage_total percentage_monetaria
<chr> <int> <int> <dbl> <dbl>
1 1 198 586 0.101 0.106
2 2 186 525 0.0948 0.0945
3 3 202 563 0.103 0.101
4 4 300 562 0.153 0.101
5 5 126 545 0.0643 0.0981
6 6 215 529 0.110 0.0953
7 7 183 524 0.0933 0.0944
8 8 163 597 0.0831 0.108
9 9 213 592 0.109 0.107
10 10 175 530 0.0892 0.0954
This is a good place to use dplyr::mutate(across()) to divide all relevant columns by the Total row. Note this is not sensitive to the order of the rows and will apply the manipulation to all numeric columns. You can supply any tidyselect semantics to across() instead if needed in your case.
library(tidyverse)
# make sample data
d <- tibble(grup_CNAE = paste0("Group", 1:12),
despesatotal = sample(1e6:5e7, 12),
despesamonetaria = sample(1e6:5e7, 12)) %>%
add_row(grup_CNAE = "Total", summarize(., across(where(is.numeric), sum)))
# divide numeric columns by value in "Total" row
d %>%
mutate(across(where(is.numeric), ~./.[grup_CNAE == "Total"]))
#> # A tibble: 13 × 3
#> grup_CNAE despesatotal despesamonetaria
#> <chr> <dbl> <dbl>
#> 1 Group1 0.117 0.0204
#> 2 Group2 0.170 0.103
#> 3 Group3 0.0451 0.0837
#> 4 Group4 0.0823 0.114
#> 5 Group5 0.0170 0.0838
#> 6 Group6 0.0174 0.0612
#> 7 Group7 0.163 0.155
#> 8 Group8 0.0352 0.0816
#> 9 Group9 0.0874 0.135
#> 10 Group10 0.113 0.0877
#> 11 Group11 0.0499 0.0495
#> 12 Group12 0.104 0.0251
#> 13 Total 1 1
Created on 2022-11-08 with reprex v2.0.2

Calculate Percentage by Group with multiple columns in R

I have several data frames with monthly data, I would like to find the percentage distribution for each product and for each month. I have problem with multiple columns with months. Currently, I can only get a percentage by group for one month.
data <- data.frame(group = rep(LETTERS[1:3], each = 4),
Product = letters[1:4],
January = sample(1:100,12),
February = sample(1:100,12))
data_new1 <- transform(data,
perc = ave(January,
group,
FUN = prop.table))
data_new1$perc<-round(data_new1$perc, 2)
> data_new1
group Product January February perc
1 A a 12 16 0.05
2 A b 73 75 0.32
3 A c 78 11 0.34
4 A d 65 35 0.29
5 B a 86 63 0.36
6 B b 33 71 0.14
7 B c 92 49 0.38
8 B d 30 60 0.12
9 C a 91 59 0.37
10 C b 31 45 0.12
11 C c 99 7 0.40
12 C d 28 50 0.11
tidyverse
library(dplyr)
data %>%
group_by(group) %>%
mutate(across(c("January", "February"), proportions, .names = "{.col}_perc")) %>%
ungroup()
# A tibble: 12 x 6
group Product January February January_perc February_perc
<chr> <chr> <int> <int> <dbl> <dbl>
1 A a 49 40 0.426 0.252
2 A b 1 3 0.00870 0.0189
3 A c 19 50 0.165 0.314
4 A d 46 66 0.4 0.415
5 B a 61 82 0.218 0.285
6 B b 88 51 0.314 0.177
7 B c 32 75 0.114 0.260
8 B d 99 80 0.354 0.278
9 C a 6 31 0.0397 0.373
10 C b 8 5 0.0530 0.0602
11 C c 92 20 0.609 0.241
12 C d 45 27 0.298 0.325
base
data <- data.frame(group = rep(LETTERS[1:3], each = 4),
Product = letters[1:4],
January = sample(1:100,12),
February = sample(1:100,12))
tmp <- sapply(c("January", "February"), function (x) ave(data[[x]], data$group, FUN = prop.table))
colnames(tmp) <- paste0(colnames(tmp), "_perc")
res <- cbind(data, tmp)
res
#> group Product January February January_perc February_perc
#> 1 A a 42 73 0.18260870 0.238562092
#> 2 A b 67 92 0.29130435 0.300653595
#> 3 A c 58 90 0.25217391 0.294117647
#> 4 A d 63 51 0.27391304 0.166666667
#> 5 B a 48 15 0.21621622 0.081521739
#> 6 B b 16 82 0.07207207 0.445652174
#> 7 B c 80 75 0.36036036 0.407608696
#> 8 B d 78 12 0.35135135 0.065217391
#> 9 C a 81 16 0.32793522 0.117647059
#> 10 C b 83 81 0.33603239 0.595588235
#> 11 C c 11 1 0.04453441 0.007352941
#> 12 C d 72 38 0.29149798 0.279411765
Created on 2021-12-20 by the reprex package (v2.0.1)
data.table
library(data.table)
COLS <- c("January", "February")
COLS_RES <- paste0(COLS, "_perc")
setDT(data)[, (COLS_RES) := lapply(.SD, proportions), by = group, .SDcol = COLS][]
These calculations are easier if your data is structured in a tidy way. In your case, January and February should probably be one single variable called month or something.
Example:
Underneath, I use tidyr::pivot_longer() to combine January and February into one column. Then I use the package dplyr to group the dataframe and calculate perc. I'm not using prop.table(), but I believe you just want the proportion of observation to the total of that group and month.
library(dplyr)
library(tidyr)
# To make the sampling underneath reproducable
set.seed(1)
data <- data.frame(
group = rep(LETTERS[1:3], each = 4),
Product = letters[1:4],
January = sample(1:100,12),
February = sample(1:100,12)
)
data %>%
pivot_longer(c(January, February), names_to = "month", values_to = "x") %>%
group_by(group, month) %>%
mutate(
perc = round(x/sum(x), 2)
)
I hope this is what you were looking for.
Another dplyr solution:
library(dplyr)
data %>%
group_by(group) %>%
mutate(across(c(2:5),
~./sum(.)*100, .names = "{.col}_pct"))
# A tibble: 12 × 10
# Groups: group [3]
group Product Jan Feb Mar May Jan_pct Feb_pct Mar_pct May_pct
<chr> <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
1 A a 14 14 95 50 8 18.4 44.4 20.9
2 A b 100 33 28 32 57.1 43.4 13.1 13.4
3 A c 11 16 13 95 6.29 21.1 6.07 39.7
4 A d 50 13 78 62 28.6 17.1 36.4 25.9
5 B a 29 42 72 13 22.0 33.9 20.3 7.07
6 B b 3 4 88 41 2.27 3.23 24.9 22.3
7 B c 30 68 94 86 22.7 54.8 26.6 46.7
8 B d 70 10 100 44 53.0 8.06 28.2 23.9
9 C a 4 88 45 84 3.96 43.6 24.2 30.7
10 C b 52 12 26 55 51.5 5.94 14.0 20.1
11 C c 26 20 23 57 25.7 9.90 12.4 20.8
12 C d 19 82 92 78 18.8 40.6 49.5 28.5
Data:
data <- data.frame(group = rep(LETTERS[1:3], each = 4),
Product = letters[1:4],
Jan = sample(1:100,12),
Feb = sample(1:100,12),
Mar = sample(1:100, 12),
May = sample(1:100, 12))

R:dplyr summarise data by group with nth() call with variable n calculated during aggregation

I'm aggregating data with variable bin sizes (see previous question here: R: aggregate every n rows with variable n depending on sum(n) of second column). In addition to calculating sums and means over groups of variable ranges, I need to pull out single-value covariates at the midpoint of each group range. When I try to do this on the fly, I only get a value for the first group and NAs for the remaining.
df.summary<-as.data.frame(df %>%
mutate(rn = row_number()) %>%
group_by(grp = (cumsum(d)-1)%/% 100 + 1) %>%
summarise(x=mean(x, na.rm = TRUE), d=sum(d, na.rm=T), ,i.start=first(rn), i.end=last(rn), y=nth(y, round(first(rn)+(last(rn)-first(rn))/2-1))))
head(df.summary)
grp x d i.start i.end y
1 1 0.07458317 88.99342 1 4 19.78992
2 2 0.07594546 97.62130 5 8 NA
3 3 0.05353308 104.69683 9 12 NA
4 4 0.06498291 106.23468 13 16 NA
5 5 0.08601759 98.24939 17 20 NA
6 6 0.06262427 84.43745 21 23 NA
sample data:
structure(list(x = c(0.10000112377193, 0.110742170350877, 0.0300274304561404,
0.0575619395964912, 0.109060465438596, 0.0595491225614035, 0.0539270264912281,
0.0812452063859649, 0.0341699389122807, 0.0391744879122807, 0.0411787485614035,
0.0996091644385965, 0.0970479474912281, 0.0595715843684211, 0.0483489989122807,
0.0549631194561404, 0.0705080555964912, 0.080437472631579, 0.105883664631579,
0.0872411613684211, 0.103236660631579, 0.0381296894912281, 0.0465064491578947,
0.0936565184561403, 0.0410095752631579, 0.0311180032105263, 0.0257758157894737,
0.0354721928947368, 0.0584999394736842, 0.0241286060175439, 0.112053376666667,
0.0769823868596491, 0.0558137530526316, 0.0374491000701754, 0.0419279142631579,
0.0260257506842105, 0.0544360374561404, 0.107411071842105, 0.103873468,
0.0419322114035088, 0.0483912961052632, 0.0328373653157895, 0.0866868717719298,
0.063990467245614, 0.0799280314035088, 0.123490407070175, 0.145676836280702,
0.0292878782807018, 0.0432093036666667, 0.0203547443684211),
d = c(22.2483512600033, 22.2483529247042, 22.2483545865809,
22.2483562542823, 22.24835791863, 25.1243105415557, 25.1243148759953,
25.1243192107884, 25.1243235416981, 25.1243278750792, 27.2240858553058,
27.2240943134697, 27.2241027638674, 27.224111222031, 27.2241196741942,
24.5623431981188, 24.5623453409221, 24.5623474809012, 24.562349626705,
24.5623517696847, 28.1458125837154, 28.1458157376341, 28.1458188889053,
28.1458220452951, 28.1458251983314, 27.8293318542146, 27.8293366652115,
27.8293414829159, 27.829346292148, 27.8293511094993, 27.5271773325046,
27.5271834011289, 27.5271894694002, 27.5271955369655, 27.5272016048837,
28.0376097925214, 28.0376146410729, 28.0376194959786, 28.0376243427651,
28.0376291969647, 26.8766095768196, 26.8766122563318, 26.8766149309023,
26.8766176123562, 26.8766202925746, 27.8736950101666, 27.8736960528853,
27.8736971017815, 27.8736981446767, 27.8736991932199), y = c(19.79001,
19.789922, 19.789834, 19.789746, 19.789658, 19.78957, 19.789468,
19.789366, 19.789264, 19.789162, 19.78906, 19.78896, 19.78886,
19.78876, 19.78866, 19.78856, 19.788458, 19.788356, 19.788254,
19.788152, 19.78805, 19.787948, 19.787846, 19.787744, 19.787642,
19.78754, 19.787442, 19.787344, 19.787246, 19.787148, 19.78705,
19.786956, 19.786862, 19.786768, 19.786674, 19.78658, 19.786486,
19.786392, 19.786298, 19.786204, 19.78611, 19.786016, 19.785922,
19.785828, 19.785734, 19.78564, 19.785544, 19.785448, 19.785352,
19.785256)), row.names = c(NA, 50L), class = "data.frame")
Let's add variable z and n in summarise part. Those variables are defined as below.
df %>%
mutate(rn = row_number()) %>%
group_by(grp = (cumsum(d)-1)%/% 100 + 1) %>%
summarise(x=mean(x, na.rm = TRUE),
d=sum(d, na.rm=T), ,i.start=first(rn),
i.end=last(rn),
z = round(first(rn)+(last(rn)-first(rn))/2-1),
n = n())
grp x d i.start i.end z n
<dbl> <dbl> <dbl> <int> <int> <dbl> <int>
1 1 0.0746 89.0 1 4 2 4
2 2 0.0759 97.6 5 8 6 4
3 3 0.0535 105. 9 12 10 4
4 4 0.0650 106. 13 16 14 4
5 5 0.0860 98.2 17 20 18 4
6 6 0.0626 84.4 21 23 21 3
7 7 0.0479 112. 24 27 24 4
8 8 0.0394 83.5 28 30 28 3
9 9 0.0706 110. 31 34 32 4
10 10 0.0575 112. 35 38 36 4
11 11 0.0647 83.0 39 41 39 3
12 12 0.0659 108. 42 45 42 4
13 13 0.0854 111. 46 49 46 4
14 14 0.0204 27.9 50 50 49 1
In dataframe above, n indicates sample size of each groups separated by grp. However, as you state group_by(grp), when you call nth(y, z), YOU WILL CALL Z-TH VALUE BY GROUP.
It means that for 5th group, although there exists only 4 values, you call 18th value of y. So it prints NA.
To get this easy, the most simple way I think is use n().
df %>%
mutate(rn = row_number()) %>%
group_by(grp = (cumsum(d)-1)%/% 100 + 1) %>%
summarise(x=mean(x, na.rm = TRUE),
d=sum(d, na.rm=T), ,i.start=first(rn),
i.end=last(rn),
y=nth(y, round(n()/2)))
grp x d i.start i.end y
<dbl> <dbl> <dbl> <int> <int> <dbl>
1 1 0.0746 89.0 1 4 19.8
2 2 0.0759 97.6 5 8 19.8
3 3 0.0535 105. 9 12 19.8
4 4 0.0650 106. 13 16 19.8
5 5 0.0860 98.2 17 20 19.8
6 6 0.0626 84.4 21 23 19.8
7 7 0.0479 112. 24 27 19.8
8 8 0.0394 83.5 28 30 19.8
9 9 0.0706 110. 31 34 19.8
10 10 0.0575 112. 35 38 19.8
11 11 0.0647 83.0 39 41 19.8
12 12 0.0659 108. 42 45 19.8
13 13 0.0854 111. 46 49 19.8
14 14 0.0204 27.9 50 50 NA
You'll call floor(n/2)th y, which means y that locates middle of each group. Note that you can also try floor(n/2)+1.
df %>%
mutate(rn = row_number()) %>%
group_by(grp = (cumsum(d)-1)%/% 100 + 1) %>%
summarise(x=mean(x, na.rm = TRUE),
d = sum(d, na.rm=T),
i.start=first(rn),
i.end=last(rn),
y = nth(y, floor(median(rn)) - i.start))

"summarise_at" and "mutate_if" for descriptive statistics for character variables

I would like to use summarise_at and mutate_at on multiple character variables at the same time. I have looked at many examples that use integer variables, but I just can't figure it out for character variables. Directly below is the code I use to produce descriptive statistics for a character (or factor) variable.
library(tidyverse)
# First block of code
starwars %>%
group_by(gender) %>%
summarise (n = n()) %>%
mutate(totalN = (cumsum(n))) %>%
mutate(percent = round((n / sum(n)), 3)) %>%
mutate(cumpercent = round(cumsum(freq = n / sum(n)),3))
This produces:
A tibble: 5 x 5
gender n totalN percent cumpercent
<chr> <int> <int> <dbl> <dbl>
1 female 19 19 0.218 0.218
2 hermaphrodite 1 20 0.011 0.230
3 male 62 82 0.713 0.943
4 none 2 84 0.023 0.966
5 <NA> 3 87 0.034 1.000
I would like to produce this same thing, but for multiple character (or factor) variables at once. In this case, let's use the variables gender and eye_color This is what I have tried:
starwars %>%
summarise_at(vars(gender, eyecolor) (n = n()) %>%
mutate_at(vars(gender, eyecolor) (totalN = (cumsum(n))) %>%
mutate_at(vars(gender", "eyecolor) (percent = round((n / sum(n)), 3)) %>%
mutate_at(vars(gender, eyecolor) (cumpercent = round(cumsum(freq = n / sum(n)),3))))))
I get the following error:
Error in eval(expr, envir, enclos) : attempt to apply non-function
I understand that there are built-in functions called using funs, but I don't want to use them. I have tried playing with the code in many different ways to get it to work, but have come up short.
What I would like to produce, is something like this:
A tibble: 5 x 5
gender n totalN percent cumpercent
<chr> <int> <int> <dbl> <dbl>
1 female 19 19 0.218 0.218
2 hermaphrodite 1 20 0.011 0.230
3 male 62 82 0.713 0.943
4 none 2 84 0.023 0.966
5 <NA> 3 87 0.034 1.000
A tibble: 15 x 5
eye_color n totalN percent cumpercent
<chr> <int> <int> <dbl> <dbl>
1 black 10 10 0.115 0.115
2 blue 19 29 0.218 0.333
3 blue-gray 1 30 0.011 0.345
4 brown 21 51 0.241 0.586
5 dark 1 52 0.011 0.598
6 gold 1 53 0.011 0.609
7 green, yellow 1 54 0.011 0.621
8 hazel 3 57 0.034 0.655
9 orange 8 65 0.092 0.747
10 pink 1 66 0.011 0.759
11 red 5 71 0.057 0.816
12 red, blue 1 72 0.011 0.828
13 unknown 3 75 0.034 0.862
14 white 1 76 0.011 0.874
15 yellow 11 87 0.126 1.000
Perhaps a loop would be better? Right now I have many lines of code to generate the descriptive statistics for each character variable because I have to run the first block of code (noted above) for each variable. It would be great if I could just list the variables I would like to use and run each through the first block of code.
Based on your expected output, mutate_at is not what you want, since it mutates on the columns selected. What you wanted to do is to group_by gender and eye_color separately. This is a good place to write your summary code into a function:
library(tidyverse)
library(rlang)
summary_func = function(group_by_var){
group_by_quo = enquo(group_by_var)
starwars %>%
group_by(!!group_by_quo) %>%
summarise(n = n()) %>%
mutate(totalN = (cumsum(n)),
percent = round((n / sum(n)), 3),
cumpercent = round(cumsum(freq = n / sum(n)),3))
}
Result:
> summary_func(gender)
# A tibble: 5 x 5
gender n totalN percent cumpercent
<chr> <int> <int> <dbl> <dbl>
1 female 19 19 0.218 0.218
2 hermaphrodite 1 20 0.011 0.230
3 male 62 82 0.713 0.943
4 none 2 84 0.023 0.966
5 <NA> 3 87 0.034 1.000
> summary_func(eye_color)
# A tibble: 15 x 5
eye_color n totalN percent cumpercent
<chr> <int> <int> <dbl> <dbl>
1 black 10 10 0.115 0.115
2 blue 19 29 0.218 0.333
3 blue-gray 1 30 0.011 0.345
4 brown 21 51 0.241 0.586
5 dark 1 52 0.011 0.598
6 gold 1 53 0.011 0.609
7 green, yellow 1 54 0.011 0.621
8 hazel 3 57 0.034 0.655
9 orange 8 65 0.092 0.747
10 pink 1 66 0.011 0.759
11 red 5 71 0.057 0.816
12 red, blue 1 72 0.011 0.828
13 unknown 3 75 0.034 0.862
14 white 1 76 0.011 0.874
15 yellow 11 87 0.126 1.000
The idea is to make your summary code into a function so that you can apply the same code over different group_by variables. enquo from rlang takes the code supplied to group_by_var and bundles it with the environment where it was called into a quosure. You can then use !! to unquote the group_by_quo in the group_by step. This enables non-standard evaluation (i.e. typing summary_func(gender) instead of summary_func("gender").
If you don't want to call summary_func for every variable you want to group_by, you can wrap your dplyr code in map from purrr, and unquote each argument of group_by_quo supplied as ... arguments. Notice the change from enquo to quos to convert each argument of ... to a list of quosures:
summary_func = function(...){
group_by_quo = quos(...)
map(group_by_quo, ~{
starwars %>%
group_by(!!.x) %>%
summarise(n = n()) %>%
mutate(totalN = (cumsum(n)),
percent = round((n / sum(n)), 3),
cumpercent = round(cumsum(freq = n / sum(n)),3))
})
}
You can now do this:
summary_func(gender, eye_color)
or with a vector of character variable names to group_by:
group_vars = c("gender", "eye_color")
summary_func(!!!syms(group_vars))
Result:
[[1]]
# A tibble: 5 x 5
gender n totalN percent cumpercent
<chr> <int> <int> <dbl> <dbl>
1 female 19 19 0.218 0.218
2 hermaphrodite 1 20 0.011 0.230
3 male 62 82 0.713 0.943
4 none 2 84 0.023 0.966
5 <NA> 3 87 0.034 1.000
[[2]]
# A tibble: 15 x 5
eye_color n totalN percent cumpercent
<chr> <int> <int> <dbl> <dbl>
1 black 10 10 0.115 0.115
2 blue 19 29 0.218 0.333
3 blue-gray 1 30 0.011 0.345
4 brown 21 51 0.241 0.586
5 dark 1 52 0.011 0.598
6 gold 1 53 0.011 0.609
7 green, yellow 1 54 0.011 0.621
8 hazel 3 57 0.034 0.655
9 orange 8 65 0.092 0.747
10 pink 1 66 0.011 0.759
11 red 5 71 0.057 0.816
12 red, blue 1 72 0.011 0.828
13 unknown 3 75 0.034 0.862
14 white 1 76 0.011 0.874
15 yellow 11 87 0.126 1.000

How to scale the dots of a graph based on their p-value in R?

I have a data.frame named df.ordered that looks like:
labels gvs order color pvals
1 Adygei -2.3321916 1 1 0.914
2 Basque -0.8519079 2 1 0.218
3 French -0.9298674 3 1 0.000
4 Italian -2.8859587 4 1 0.024
5 Orcadian -1.4996229 5 1 0.148
6 Russian -1.5597359 6 1 0.626
7 Sardinian -1.4494841 7 1 0.516
8 Tuscan -2.4279528 8 1 0.420
9 Bedouin -3.1717421 9 2 0.914
10 Druze -0.5058627 10 2 0.220
11 Mozabite -2.6491331 11 2 0.200
12 Palestinian -0.7819299 12 2 0.552
13 Balochi -1.4095947 13 3 0.158
14 Brahui -1.2534511 14 3 0.162
15 Burusho 1.7958170 15 3 0.414
16 Hazara 2.2810477 16 3 0.152
17 Kalash -0.9258497 17 3 0.974
18 Makrani -0.9007551 18 3 0.226
19 Pathan 2.5543214 19 3 0.112
20 Sindhi 2.6614486 20 3 0.338
21 Uygur -1.2207974 21 3 0.652
22 Cambodian 2.3706977 22 4 0.118
23 Dai -0.9441980 23 4 0.686
24 Daur -1.0325107 24 4 0.932
25 Han -0.7381369 25 4 0.794
26 Hezhen -2.7590587 26 4 0.182
27 Japanese -0.5644325 27 4 0.366
28 Lahu -0.8449225 28 4 0.560
29 Miao -0.7237586 29 4 0.194
30 Mongola -0.9452944 30 4 0.768
31 Naxi -0.1625003 31 4 0.554
32 Oroqen -1.2035258 32 4 0.782
33 She -2.7758460 33 4 0.912
34 Tu -0.7703779 34 4 0.254
35 Tujia -1.0265275 35 4 0.912
36 Xibo -1.1163019 36 4 0.292
37 Yakut -3.2102686 37 4 0.030
38 Yi -0.9614190 38 4 0.838
39 Colombian -1.9659984 39 5 0.166
40 Karitiana -0.9195156 40 5 0.660
41 Maya 2.1239768 41 5 0.818
42 Pima -3.0895998 42 5 0.818
43 Surui -0.9377928 43 5 0.536
44 Melanesian -1.6961014 44 6 0.414
45 Papuan -0.7037952 45 6 0.386
46 BantuKenya -1.9311354 46 7 0.484
47 BantuSouthAfrica -1.8515908 47 7 0.016
48 BiakaPygmy -1.7657017 48 7 0.538
49 Mandenka -0.5423822 49 7 0.076
50 MbutiPygmy -1.6244801 50 7 0.054
51 San -0.9049735 51 7 0.478
52 Yoruba 2.0949378 52 7 0.904
I have made the following graph
I used the code:
jpeg("test3.jpg", 700,700)
df.ordered$color <- as.factor(df.ordered$color)
levels(df.ordered$color) <- c("blue","yellow3","red","pink","purple","green","orange")
plot(df.ordered$gvs, pch = 19, cex=2, col = as.character(df.ordered$color), xaxt="n")
axis(1, at=1:52, col=as.character(df.ordered$color),labels=df.ordered$labels, las=2)
dev.off()
I now want to scale the dots of the graph to the pvals column. I want the low pvalues to be larger dots, and the higher p-value to be the smaller dots. One issue is that some pvalues are 0. I was thinking of turning all pvals values that are 0.000 to 0.001 to fix this. Does anyone know how to do this? I want the graph to look similar to the graph in figure 5 here: http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004412
The cex argument is vectorized, i.e., you can pass in a vector (of the same length of your data to plot). Take this as a simple example:
plot(1:5, cex = 1:5)
Now, it is completely up to you to define a relationship between cex and pvals. How about a + (1 - pvals) * (b - a)? This will map 1-pvals from [0,1] to [a,b]. For example, with a = 1, b = 5, you can try:
cex <- 1 + (1 - df.ordered$pvals) * (5 - 1)
I'm looking to have the p-values between 0.000 and 0.0010 to have cex = ~10, p-values between 0.010 and 0.20 to have cex = ~5, and p-values from 0.20-1.00 to have cex = ~0.5.
I recommend using cut():
fac <- cut(df.ordered$pvals, breaks = c(0, 0.001, 0.2, 1),
labels = c(10, 5, 0.5), right = FALSE)
cex <- c(10, 5, 0.5)[as.integer(fac)]
Adding to #zheyuan-li's answer, here is a normalization that puts the size of the points for p-values "equal" to 0 with size 2, and the point size of observations with p-values "equal" to 1 with size zero:
plot(df.ordered$gvs, pch = 19,
cex=2 * (1-df.ordered$pvals)/(df.ordered$pvals +1),
col = as.character(df.ordered$color), xaxt="n")

Resources