Related
I have a data frame reporting the count of answers per question (this is just a part of it), and I'd like to obtain the answer percentage for each question. I've found adorn_percentages, but it computes the percentage by dividing the values for the whole data frame, meanwhile, I just want the percentage for each column. Each column has a total of 2230 answers.
I was thinking to use something like (x/2230)*100 but I don't know how to go on.
df<-data.frame(q1=c(159,139,1048,571,93), q2=c(106,284,1043,672,125), q3=c(99,222,981,843,94))
q1 q2 q3
1 159 106 99
2 139 284 222
3 1048 1043 981
4 571 672 843
5 93 125 94
We may use colSums to do the division after making the lengths same
100 * df/colSums(df)[col(df)]
or use sweep
100 * sweep(df, 2, colSums(df), `/`)
Or use proportions
df[paste0(names(df), "_prop")] <- 100 * proportions(as.matrix(df), 2)
-output
> df
q1 q2 q3 q1_prop q2_prop q3_prop
1 159 106 99 7.910448 4.753363 4.421617
2 139 284 222 6.915423 12.735426 9.915141
3 1048 1043 981 52.139303 46.771300 43.814203
4 571 672 843 28.407960 30.134529 37.650737
5 93 125 94 4.626866 5.605381 4.198303
You can apply prop.table for each column -
library(dplyr)
df %>% mutate(across(.fns = prop.table, .names = '{col}_prop') * 100)
# q1 q2 q3 q1_prop q2_prop q3_prop
#1 159 106 99 7.910448 4.753363 4.421617
#2 139 284 222 6.915423 12.735426 9.915141
#3 1048 1043 981 52.139303 46.771300 43.814203
#4 571 672 843 28.407960 30.134529 37.650737
#5 93 125 94 4.626866 5.605381 4.198303
I have a list of lists, like so:
x <-list()
x[[1]] <- c('97', '342', '333')
x[[2]] <- c('97','555','556','742','888')
x[[3]] <- c ('100', '442', '443', '444', '445','446')
The first number in each list (97, 97, 100) refers to a node in a tree and the following numbers refer to traits associated with that node.
My goal is to create a dataframe that looks like this:
df= data.frame(node = c('97','97','97','97','97','97','100','100','100','100','100'),
trait = c('342','333','555','556','742','888','442','443','444','445','446'))
where each trait has its corresponding node.
I think the first thing I need to do is convert the list of lists into a single dataframe. I've tried doing so using:
do.call(rbind,x)
but that repeats the values in x[[1]] and x[[2]] to match the length of x[[3]]. I've also tried using:
dt_list <- map(x, as.data.table)
dt <- rbindlist(dt_list, fill = TRUE, idcol = T)
Which I think gets me closer, but I'm still unsure of how to assign the first node value to the corresponding trait values. I know this is probably a simple task but it's stumping me today!
Maybe you can try the code below
h <- sapply(x, `[`,1)
d <- lapply(x, `[`,-1)
df <- data.frame(node = rep(h,lengths(d)), trait = unlist(d))
such that
> df
node trait
1 97 342
2 97 333
3 97 555
4 97 556
5 97 742
6 97 888
7 100 442
8 100 443
9 100 444
10 100 445
11 100 446
You can create a data frame with the first value from the vector in column 'node' and the rest of the values in column 'trait'. This strategy can be applied to all entries in the list using the map_df() function from purrr package, giving the output you describe.
library(purrr)
library(dplyr)
x %>%
map_df(., function(vec) data.frame(node = vec[1],
trait = vec[-1],
stringsAsFactors = F))
An option with base R is
stack(setNames(lapply(x, `[`, -1), sapply(x, `[`, 1)))[2:1]
# ind values
#1 97 342
#2 97 333
#3 97 555
#4 97 556
#5 97 742
#6 97 888
#7 100 442
#8 100 443
#9 100 444
#10 100 445
#11 100 446
Another solution
library(tidyverse)
library(purrr)
node <- map(x, ~rep(.x[1], length(.x)-1)) %>% flatten_chr()
trait <- map(x, ~.x[2:length(.x)]) %>% flatten_chr()
out <- tibble(node, trait)
node trait
<chr> <chr>
1 97 342
2 97 333
3 97 555
4 97 556
5 97 742
6 97 888
7 100 442
8 100 443
9 100 444
10 100 445
11 100 446
I wanted to divide numbers separated by commas in a column
by other numbers.
Here is the input I have
> df = data.frame (SAMPLE1.DP=c("555","651","641","717"), SAMPLE1.AD=c("555", "68,583","2,639","358,359"), SAMPLE2.DP=c("1023","930","683","1179"), SAMPLE2.AD=c("1023","0,930","683","585,594"))
> df
SAMPLE1.DP SAMPLE1.AD SAMPLE2.DP SAMPLE2.AD
1 555 555 1023 1023
2 651 68,583 930 0,930
3 641 2,639 683 683
4 717 358,359 1179 585,594
In the end I want to add two new columns (AD/DP) that divide the values SAMPLE1.AD by SAMPLE1.DP AND SAMPLE2.AD by SAMPLE2.DP, which represent pourcentage of numbers at each side of the comma, like this :
> end = data.frame(SAMPLE1.DP=c("555","651","641","717"),
+ SAMPLE1.AD=c("555", "68,583","204,437","358,359"),
+ SAMPLE1.AD_DP=c("1.00","0.10,0.90","0.32,0.68","0.50,0.50"),
+ SAMPLE2.DP=c("1023","930","683","1179"),
+ SAMPLE2.AD=c("1023","0,930","683","585,594"),
+ SAMPLE2.AD_DP=c("1.00","0.00,1.00","1.00","0.49,0,51"))
>end
SAMPLE1.DP SAMPLE1.AD SAMPLE1.AD_DP SAMPLE2.DP SAMPLE2.AD SAMPLE2.AD_DP
1 555 555 1.00 1023 1023 1.00
2 651 68,583 0.10,0.90 930 0,930 0.00,1.00
3 641 204,437 0.32,0.68 683 683 1.00
4 717 358,359 0.50,0.50 1179 585,594 0.49,0,51
it means :
XX YY,ZZ YY/XX,ZZ/XX AA BB,CC BB/AA,CC/AA
If I consider the values inside the table as.numeric, it does not work since values are separated by commas...
Do you have any idea to do this ?
Thanks in advance for your help
First thing you need to do is replace the , with . and cast to numeric. Then split based on your required condition and divide, i.e.
df[] <- lapply(df, function(i)as.numeric(gsub(',', '.', i)))
do.call(cbind, lapply(split.default(df, gsub('\\D+', '', names(df))), function(i) i[2] / i[1]))
# SAMPLE1.AD SAMPLE2.AD
#1 1.000000000 1.000000
#2 0.004066052 0.001000
#3 0.004117005 1.000000
#4 0.499803347 0.496687
If there are commas in your numbers than the column has most likely been poisoned and is cast as characters. What you need to do is convert your columns to numeric and then divide each column respectively.
library(tidyverse)
dat <- tribble(~"SAMPLE1.DP", ~"SAMPLE1.AD", ~"SAMPLE2.DP", ~"SAMPLE2.AD",
555, 555, 1023, 1023,
651, "2,647", 930, ",93",
641, "2,639", 683, 683,
717, "358,359", 1179, "585,594")
dat %>%
mutate_at(c(2,4), list(~str_replace(., ",", "."))) %>%
mutate_all(as.numeric) %>%
mutate(addp1 = SAMPLE1.AD / SAMPLE1.DP,
addp2 = SAMPLE2.AD / SAMPLE2.DP)
#> # A tibble: 4 x 6
#> SAMPLE1.DP SAMPLE1.AD SAMPLE2.DP SAMPLE2.AD addp1 addp2
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 555 555 1023 1023 1 1
#> 2 651 2.65 930 0.93 0.00407 0.001
#> 3 641 2.64 683 683 0.00412 1
#> 4 717 358. 1179 586. 0.500 0.497
Created on 2019-05-20 by the reprex package (v0.2.1)
Thanks everyone but I was not very clear in my question, very sorry.
In my input example, I have only whole numbers separated by commas, no decimales.
For example, on line 3 of my example :
2,647 means 2 AND 647, and I want to divide both numbers by 651 in order to have as result : 2/651 , 647/651 , so it will be 0.01 and 0.99 (or 1% and 99%)
They are entire numbers (or integers), separated by commas.
Hope I am clearer ...thanks...
We know mutate_at function from dplyr allows us to mutate selected multiple columns and apply a function to each of them. I need opposite of it. I mean to say, apply multiple functions to same column or apply same function multiple times to the same column. Take the following reproducible example.
> main <- structure(list(PolygonId = c(0L, 1L, 1612L, 3L, 2L, 1698L), Area = c(3.018892,
1.995702, 0.582808, 1.176975, 2.277057, 0.014854), Perimeter = c(10.6415,
8.6314, 4.8478, 6.1484, 9.2226, 0.6503), h0 = c(1000,500,700,1000,200,1200)), .Names = c("PolygonId",
"Area", "Perimeter", "h0"), row.names = c(NA, 6L), class = "data.frame")
> main
PolygonId Area Perimeter h0
1 0 3.018892 10.6415 1000
2 1 1.995702 8.6314 500
3 1612 0.582808 4.8478 700
4 3 1.176975 6.1484 1000
5 2 2.277057 9.2226 200
6 1698 0.014854 0.6503 1200
I am only concerned about h0 column in the df main.
Expected outcome:
The h10 field is h0 + 10% of h0 and h_10 is h0 - 10% of h0
PolygonId Area Perimeter h0 h10 h20 h_10 h_20
1 0 3.018892 10.6415 1000 1100 1200 900 800
2 1 1.995702 8.6314 500 550 600 450 400
3 1612 0.582808 4.8478 700 770 840 630 560
4 3 1.176975 6.1484 1000 1100 1200 900 800
5 2 2.277057 9.2226 200 220 240 180 160
6 1698 0.014854 0.6503 1200 1320 1440 1080 960
I'd usually do this::
calcH <- function(h, pc){
h + pc / 100 * h
}
new_main <- mutate ( main,
h10 = calcH(h0, 10),
h20 = calcH(h0, 20),
h_10 = calcH(h0, -10),
h_20 = calcH(h0, -20)
)
But this is going to be hectic and long code since I have to do this calculation for 1%, 2.5%, 5%, 7.5%, 10%, 12.5%, 15%... 30% in both positive and negative ways.
mutate_at can use multiple functions, but they need to exist in the environment as named functions (can't be anonymous functions) So something like
pcts<-rep(c(1,2.5*1:12),2)*c(-1,1)
for(i in pcts){
assign(gsub("-","_",paste0("h",i)),eval(parse(text=sprintf("function(x) x*(100+%f)/100",i)))) }
main %>% mutate_at(vars(h0),gsub("-","_",paste0("h",pcts)))
would work
I like to solve these kind of problems using long data representation:
library(dplyr)
library(tidyr)
# create data frame with join helper and multiplier-values:
bla <- data.frame(mult = seq(-.1, .1, .01),
join = TRUE)
# join, calculate values, create names, transform to wide:
main %>%
mutate(join = TRUE) %>%
left_join(bla) %>%
mutate(h0 = h0*(1+mult),
mult = sub(x = paste0("h", mult*100), pattern = "-", replacement = "_")) %>%
select(-join) %>%
spread(mult, h0)
This is easy in base R. The idea is to create a vector with the required percentages, loop over that vector and calculate your metric, i.e.
v1 <- c(1, seq(2.5, 30, by = 2.5), seq(-30, -2.5, by = 2.5), -1)
sapply(v1, function(i) calcH(main$h0, i))
Here's another approach similar to #andyyy's, but uses rlang instead:
library(dplyr)
library(rlang)
percent <- c(1, 2.5*1:12)
calc_expr <- function(percent_vec){
parse_exprs(paste(paste0("h0+(",percent_vec,"/100*h0)"), collapse = ";"))
}
main %>%
mutate(!!!calc_expr (percent), !!!calc_expr (percent*-1)) %>%
setNames(c(colnames(main), paste0("h", percent), paste0("h_", percent)))
Result:
PolygonId Area Perimeter h0 h1 h2.5 h5 h7.5 h10 h12.5 h15 h17.5 h20 h22.5 h25 h27.5
1 0 3.018892 10.6415 1000 1010 1025.0 1050 1075.0 1100 1125.0 1150 1175.0 1200 1225.0 1250 1275.0
2 1 1.995702 8.6314 500 505 512.5 525 537.5 550 562.5 575 587.5 600 612.5 625 637.5
3 1612 0.582808 4.8478 700 707 717.5 735 752.5 770 787.5 805 822.5 840 857.5 875 892.5
4 3 1.176975 6.1484 1000 1010 1025.0 1050 1075.0 1100 1125.0 1150 1175.0 1200 1225.0 1250 1275.0
5 2 2.277057 9.2226 200 202 205.0 210 215.0 220 225.0 230 235.0 240 245.0 250 255.0
6 1698 0.014854 0.6503 1200 1212 1230.0 1260 1290.0 1320 1350.0 1380 1410.0 1440 1470.0 1500 1530.0
h30 h_1 h_2.5 h_5 h_7.5 h_10 h_12.5 h_15 h_17.5 h_20 h_22.5 h_25 h_27.5 h_30
1 1300 990 975.0 950 925.0 900 875.0 850 825.0 800 775.0 750 725.0 700
2 650 495 487.5 475 462.5 450 437.5 425 412.5 400 387.5 375 362.5 350
3 910 693 682.5 665 647.5 630 612.5 595 577.5 560 542.5 525 507.5 490
4 1300 990 975.0 950 925.0 900 875.0 850 825.0 800 775.0 750 725.0 700
5 260 198 195.0 190 185.0 180 175.0 170 165.0 160 155.0 150 145.0 140
6 1560 1188 1170.0 1140 1110.0 1080 1050.0 1020 990.0 960 930.0 900 870.0 840
Notes:
Using the vector of percentages, I construct multiple expressions using paste0 and parse_exprs then unquote and splice them in mutate using !!!. Finally, rename the columns using setNames.
I am trying to obtain a vector, which contains sum of elements which fit condition.
values = runif(5000)
bin = seq(0, 0.9, by = 0.1)
sum(values < bin)
I expected that sum will return me 10 values - a sum of "values" elements which fit "<" condition per each "bin" element.
However, it returns only one value.
How can I achieve the result without using a while loop?
I understand this to mean that you want, for each value in bin, the number of elements in values that are less than bin. So I think you want vapply() here
vapply(bin, function(x) sum(values < x), 1L)
# [1] 0 497 1025 1501 1981 2461 2955 3446 3981 4526
If you want a little table for reference, you could add names
v <- vapply(bin, function(x) sum(values < x), 1L)
setNames(v, bin)
# 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
# 0 497 1025 1501 1981 2461 2955 3446 3981 4526
I personally prefer data.table over tapply or vapply, and findInterval over cut.
set.seed(1)
library(data.table)
dt <- data.table(values, groups=findInterval(values, bin))
setkey(dt, groups)
dt[,.(n=.N, v=sum(values)), groups][,list(cumsum(n), cumsum(v)),]
# V1 V2
# 1: 537 26.43445
# 2: 1041 101.55686
# 3: 1537 226.12625
# 4: 2059 410.41487
# 5: 2564 637.18782
# 6: 3050 904.65876
# 7: 3473 1180.53342
# 8: 3951 1540.18559
# 9: 4464 1976.23067
#10: 5000 2485.44920
cbind(vapply(bin, function(x) sum(values < x), 1L)[-1],
cumsum(tapply( values, cut(values, bin), sum)))
# [,1] [,2]
#(0,0.1] 537 26.43445
#(0.1,0.2] 1041 101.55686
#(0.2,0.3] 1537 226.12625
#(0.3,0.4] 2059 410.41487
#(0.4,0.5] 2564 637.18782
#(0.5,0.6] 3050 904.65876
#(0.6,0.7] 3473 1180.53342
#(0.7,0.8] 3951 1540.18559
#(0.8,0.9] 4464 1976.23067
Using tapply with a cut()-constructed INDEX vector seems to deliver:
tapply( values, cut(values, bin), sum)
(0,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5] (0.5,0.6]
25.43052 71.06897 129.99698 167.56887 222.74620 277.16395
(0.6,0.7] (0.7,0.8] (0.8,0.9]
332.18292 368.49341 435.01104
Although I'm guessing you would want the cut-vector to extend to 1.0:
bin = seq(0, 1, by = 0.1)
tapply( values, cut(values, bin), sum)
(0,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5] (0.5,0.6]
25.48087 69.87902 129.37348 169.46013 224.81064 282.22455
(0.6,0.7] (0.7,0.8] (0.8,0.9] (0.9,1]
335.43991 371.60885 425.66550 463.37312
I see that I understood the question differently than Richard. If you wanted his result you can use cumsum on my result.
Using dplyr:
set.seed(1)
library(dplyr)
df %>% group_by(groups) %>%
summarise(count = n(), sum = sum(values)) %>%
mutate(cumcount= cumsum(count), cumsum = cumsum(sum))
Output:
groups count sum cumcount cumsum
1 (0,0.1] 537 26.43445 537 26.43445
2 (0.1,0.2] 504 75.12241 1041 101.55686
3 (0.2,0.3] 496 124.56939 1537 226.12625
4 (0.3,0.4] 522 184.28862 2059 410.41487
5 (0.4,0.5] 505 226.77295 2564 637.18782
6 (0.5,0.6] 486 267.47094 3050 904.65876
7 (0.6,0.7] 423 275.87466 3473 1180.53342
8 (0.7,0.8] 478 359.65217 3951 1540.18559
9 (0.8,0.9] 513 436.04508 4464 1976.23067
10 NA 536 509.21853 5000 2485.44920