R - Percentage of whole dataframe per column - r

I have a data frame reporting the count of answers per question (this is just a part of it), and I'd like to obtain the answer percentage for each question. I've found adorn_percentages, but it computes the percentage by dividing the values for the whole data frame, meanwhile, I just want the percentage for each column. Each column has a total of 2230 answers.
I was thinking to use something like (x/2230)*100 but I don't know how to go on.
df<-data.frame(q1=c(159,139,1048,571,93), q2=c(106,284,1043,672,125), q3=c(99,222,981,843,94))
q1 q2 q3
1 159 106 99
2 139 284 222
3 1048 1043 981
4 571 672 843
5 93 125 94

We may use colSums to do the division after making the lengths same
100 * df/colSums(df)[col(df)]
or use sweep
100 * sweep(df, 2, colSums(df), `/`)
Or use proportions
df[paste0(names(df), "_prop")] <- 100 * proportions(as.matrix(df), 2)
-output
> df
q1 q2 q3 q1_prop q2_prop q3_prop
1 159 106 99 7.910448 4.753363 4.421617
2 139 284 222 6.915423 12.735426 9.915141
3 1048 1043 981 52.139303 46.771300 43.814203
4 571 672 843 28.407960 30.134529 37.650737
5 93 125 94 4.626866 5.605381 4.198303

You can apply prop.table for each column -
library(dplyr)
df %>% mutate(across(.fns = prop.table, .names = '{col}_prop') * 100)
# q1 q2 q3 q1_prop q2_prop q3_prop
#1 159 106 99 7.910448 4.753363 4.421617
#2 139 284 222 6.915423 12.735426 9.915141
#3 1048 1043 981 52.139303 46.771300 43.814203
#4 571 672 843 28.407960 30.134529 37.650737
#5 93 125 94 4.626866 5.605381 4.198303

Related

Using dplyr to compute calculated fields depending on multiple columns without explicitly writing column names

Consider the following code.
set.seed(56)
library(dplyr)
df <- data.frame(
NUM_1 = sample.int(500, replace = TRUE),
DENOM_1 = sample.int(500, replace = TRUE),
NUM_2 = sample.int(500, replace = TRUE),
DENOM_2 = sample.int(500, replace = TRUE)
)
head(df)
NUM_1 DENOM_1 NUM_2 DENOM_2
1 417 379 154 173
2 160 437 239 154
3 243 315 106 361
4 291 169 393 340
5 170 450 429 421
6 422 131 75 64
Without having to manually specify each of the column names (the actual problem has about 40 of these I need to create), I would like to create columns FRAC_1 and FRAC_2 for which FRAC_X = NUM_X/DENOM_X.
So, this would be what I'm looking for with regard to output, but since I'm dealing with about 40 of these, I don't want to have to manually type out each column:
df_frac <- df %>%
mutate(FRAC_1 = NUM_1 / DENOM_1,
FRAC_2 = NUM_2 / DENOM_2)
head(df_frac)
NUM_1 DENOM_1 NUM_2 DENOM_2 FRAC_1 FRAC_2
1 417 379 154 173 1.1002639 0.8901734
2 160 437 239 154 0.3661327 1.5519481
3 243 315 106 361 0.7714286 0.2936288
4 291 169 393 340 1.7218935 1.1558824
5 170 450 429 421 0.3777778 1.0190024
6 422 131 75 64 3.2213740 1.1718750
I would strongly prefer a dplyr solution to this. I thought maybe I could use mutate() with across(), but it isn't clear to me how to tell across() to pair the NUM_x with the corresponding DENOM_x columns.
Here is one in tidyverse
Loop across the columns with names starts_with 'NUM'
Extract the column name cur_column(), replace the substring from 'NUM' to 'DENOM' in str_replace
get the column value, divide by the NUM column, and change the column name in .names to create the 'FRAC' columns
library(dplyr)
library(stringr)
df <- df %>%
mutate(across(starts_with("NUM"), ~
./get(str_replace(cur_column(), 'NUM', 'DENOM')),
.names = "{str_replace(.col, 'NUM', 'FRAC')}"))
-output
head(df)
NUM_1 DENOM_1 NUM_2 DENOM_2 FRAC_1 FRAC_2
1 417 379 154 173 1.1002639 0.8901734
2 160 437 239 154 0.3661327 1.5519481
3 243 315 106 361 0.7714286 0.2936288
4 291 169 393 340 1.7218935 1.1558824
5 170 450 429 421 0.3777778 1.0190024
6 422 131 75 64 3.2213740 1.1718750

Summing columns in a data frame and adding those values to a new data frame in R [duplicate]

This question already has answers here:
How to sum data.frame column values?
(5 answers)
Closed 2 years ago.
I am trying to sum the columns of a data frame and add these sums to a new output data frame. When I run the following script, I get an error stating that the replacement has two rows and the data has 3.
a <-data.frame(replicate(3,sample(1:100,10,rep=TRUE)))
colnames(a) <- c("name1", "name2","name3")
for (i in 1:ncol(a)) {
b <-as.data.frame(names(a))
c <- sum(a[i])
b$d[i] <- c[i]
}
I am looking for the output as a data frame such as:
name1 sum1
name2 sum2
name3 sum3
Your solution was already pretty close. I made some slight modifications for you and it works:
a <-data.frame(replicate(3,sample(1:100,10,rep=TRUE)))
colnames(a) <- c("name1", "name2","name3")
b <-as.data.frame(names(a))
for (i in 1:ncol(a)) {
b$sum[i] <- sum(a[i])
}
Output:
names(a) sum
1 name1 470
2 name2 616
3 name3 495
I would suggest a dplyr approach:
library(dplyr)
#Data
a <-data.frame(replicate(3,sample(1:100,10,rep=TRUE)))
colnames(a) <- c("name1", "name2","name3")
#Code
a %>%
mutate(across(c(name1:name3),.fns = list(sum = ~ sum(.,na.rm=T)) ))
Output:
name1 name2 name3 name1_sum name2_sum name3_sum
1 98 31 79 599 489 506
2 8 71 4 599 489 506
3 59 23 48 599 489 506
4 65 76 64 599 489 506
5 47 53 57 599 489 506
6 80 84 55 599 489 506
7 40 19 28 599 489 506
8 39 2 47 599 489 506
9 65 36 40 599 489 506
10 98 94 84 599 489 506
If only one dataframe is desired you can use this:
a %>%
summarise(across(c(name1:name3),.fns = list(sum = ~ sum(.,na.rm=T)) ))
Output:
name1_sum name2_sum name3_sum
1 599 489 506
Initial code should be used when you want to add those variables to same dataframe.
And if you want a variable for the names and another for results you can use previous code with pivot_longer() from tidyverse to produce this:
library(tidyverse)
#Code
a %>%
summarise(across(c(name1:name3),.fns = list(sum = ~ sum(.,na.rm=T)) )) %>%
pivot_longer(cols = everything())
Output:
# A tibble: 3 x 2
name value
<chr> <int>
1 name1_sum 599
2 name2_sum 489
3 name3_sum 506
It can be vectorized with colSums in base R
as.data.frame.list(colSums(a))
Or for a two column summary
stack(colSums(a))
If we need to create new columns in 'a'
a[paste0(names(a), "_sum")] <- colSums(a)

How to add columns to a dataframe based on indexes in R? (See example)

I'm working with a self made infix function which simply calculates the
percentage growth between observations in columns.
options(digits=3)
`%grow%` <- function(x,y) {
(y-x) / x * 100
}
test <- data.frame(a=c(101,202,301), b=c(123,214,199), h=c(134, 217, 205))
Then I use lapply to my toy database in order to add two new columns.
test[,4:5] <- lapply(1:(ncol(test)-1), function(i) test[,i] %grow% test[,(i+1)])
test
#Output
a b h V4 V5
1 101 123 134 21.78 8.94
2 202 214 217 5.94 1.40
3 301 199 205 -33.89 3.02
This is easy considering I just have three columns and I just can write test[,4:5]. Now talking in general terms: How to do this if we have n columns using column indexes?
What I mean is I want to create n-1 columns to a given database starting from the last one. Something like:
test[,(last_current_column+1):(last_column_created_using_function)]
Considering what I've read in some other posts, using my example, test[,(last_current_column+1): could be written as:
test[,(ncol(test)+1):]
but second part is still missing and I have no idea how to write it.
I hope I made myself clear. I fully appreciate any comment or advise.
Happy 2019 :)
Another way would be:
#options(digits=3)
`%grow%` <- function(x,y) {
(y-x) / x * 100
}
test <- data.frame(a=c(101,202,301),
b=c(123,214,199),
h=c(134, 217, 205),
d=c(156,234,235))
# a b h d
# 1 101 123 134 156
# 2 202 214 217 234
# 3 301 199 205 235
seqcols <- seq_along(test) # saved just to improve readability
test[,seqcols[-length(seqcols)] + max(seqcols)] <- lapply(seqcols[-length(seqcols)],
function(i) test[,i] %grow% test[,(i+1)])
test
# a b h d V5 V6 V7
# 1 101 123 134 156 21.78 8.94 16.42
# 2 202 214 217 234 5.94 1.40 7.83
# 3 301 199 205 235 -33.89 3.02 14.63
Similar to the second solution from #Ronak Shah, just with the use of map2_df from purrr:
cbind(test,
new=purrr::map2_df(test[seqcols[-length(seqcols)]], test[seqcols[-1]], `%grow%`),
deparse.level=1)
# a b h d new.a new.b new.h
# 1 101 123 134 156 21.78 8.94 16.42
# 2 202 214 217 234 5.94 1.40 7.83
# 3 301 199 205 235 -33.89 3.02 14.63
You would always ncol(test) - 1 new columns. Now using this logic there are multiple ways to do this.
One way would be to construct a character vector with some prefix value.
test[paste0("new_col", seq_len(ncol(test) - 1))] <- lapply(1:(ncol(test)-1),
function(i) test[,i] %grow% test[,(i+1)])
test
# a b h new_col1 new_col2
#1 101 123 134 21.782178 8.943089
#2 202 214 217 5.940594 1.401869
#3 301 199 205 -33.887043 3.015075
Another option using mapply and transform by creating subsets of dataframe
transform(test,
new_col = mapply(`%grow%`, test[1:(ncol(test)- 1)], test[2:ncol(test)]))
# a b h new_col.a new_col.b
#1 101 123 134 21.782178 8.943089
#2 202 214 217 5.940594 1.401869
#3 301 199 205 -33.887043 3.015075

Row wise operation on data.table

Let's say I'd like to calculate the magnitude of the range over a few columns, on a row-by-row basis.
set.seed(1)
dat <- data.frame(x=sample(1:1000,1000),
y=sample(1:1000,1000),
z=sample(1:1000,1000))
Using data.frame(), I would do something like this:
dat$diff_range <- apply(dat,1,function(x) diff(range(x)))
To put it more simply, I'm looking for this operation, over each row:
diff(range(dat[1,]) # for i 1:nrow(dat)
If I were doing this for the entire table, it would be something like:
setDT(dat)[,diff_range := apply(dat,1,function(x) diff(range(x)))]
But how would I do it for only named (or numbered) rows?
pmax and pmin find the min and max across columns in a vectorized way, which is much better than splitting and working with each row separately. It's also pretty concise:
dat[, r := do.call(pmax,.SD) - do.call(pmin,.SD)]
x y z r
1: 266 531 872 606
2: 372 685 967 595
3: 572 383 866 483
4: 906 953 437 516
5: 201 118 192 83
---
996: 768 945 292 653
997: 61 231 965 904
998: 771 145 18 753
999: 841 148 839 693
1000: 857 252 218 639
How about this:
D[,list(I=.I,x,y,z)][,diff(range(x,y,z)),by=I][c(1:4,15:18)]
# I V1
#1: 1 971
#2: 2 877
#3: 3 988
#4: 4 241
#5: 15 622
#6: 16 684
#7: 17 971
#8: 18 835
#actually this will be faster
D[c(1:4,15:18),list(I=.I,x,y,z)][,diff(range(x,y,z)),by=I]
use .I to give you an index to call with the by= parameter, then you can run the function on each row. The second call pre-filters by any list of row numbers, or you can add a key and filter on that if your real table looks different.
You can do it by subsetting before/during the function. If you only want every second row for example
dat_Diffs <- apply(dat[seq(2,1000,by=2),],1,function(x) diff(range(x)))
Or for rownames 1:10 (since their names weren't specified they are just numbers counting up)
dat_Diffs <- apply(dat[rownames(dat) %in% 1:10,],1,function(x) diff(range(x)))
But why not just calculate per row then subset later?

Binning a dataframe with equal frequency of samples

I have binned my data using the cut function
breaks<-seq(0, 250, by=5)
data<-split(df2, cut(df2$val, breaks))
My split dataframe looks like
... ...
$`(15,20]`
val ks_Result c
15 60 237
18 70 247
... ...
$`(20,25]`
val ks_Result c
21 20 317
24 10 140
... ...
My bins looks like
> table(data)
data
(0,5] (5,10] (10,15] (15,20] (20,25] (25,30] (30,35]
0 0 0 7 128 2748 2307
(35,40] (40,45] (45,50] (50,55] (55,60] (60,65] (65,70]
1404 11472 1064 536 7389 1008 1714
(70,75] (75,80] (80,85] (85,90] (90,95] (95,100] (100,105]
2047 700 329 1107 399 376 323
(105,110] (110,115] (115,120] (120,125] (125,130] (130,135] (135,140]
314 79 1008 77 474 158 381
(140,145] (145,150] (150,155] (155,160] (160,165] (165,170] (170,175]
89 660 15 1090 109 824 247
(175,180] (180,185] (185,190] (190,195] (195,200] (200,205] (205,210]
1226 139 531 174 1041 107 257
(210,215] (215,220] (220,225] (225,230] (230,235] (235,240] (240,245]
72 671 98 212 70 95 25
(245,250]
494
When I mean the bins, I get on an average of ~900 samples
> mean(table(data))
[1] 915.9
I want to tell R to make irregular bins in such a way that each bin will contain on an average 900 samples (e.g. (0, 27] = 900, (27,28.5] = 900, and so on). I found something similar here, which deals with only one variable, not the whole dataframe.
I also tried Hmisc package, unfortunately the bins don't contain equal frequency!!
library(Hmisc)
data<-split(df2, cut2(df2$val, g=30, oneval=TRUE))
data<-split(df2, cut2(df2$val, m=1000, oneval=TRUE))
Assuming you want 50 equal sized buckets (based on your seq) statement, you can use something like:
df <- data.frame(var=runif(500, 0, 100)) # make data
cut.vec <- cut(
df$var,
breaks=quantile(df$var, 0:50/50), # breaks along 1/50 quantiles
include.lowest=T
)
df.split <- split(df, cut.vec)
Hmisc::cut2 has this option built in as well.
Can be done by the function provided here by Joris Meys
EqualFreq2 <- function(x,n){
nx <- length(x)
nrepl <- floor(nx/n)
nplus <- sample(1:n,nx - nrepl*n)
nrep <- rep(nrepl,n)
nrep[nplus] <- nrepl+1
x[order(x)] <- rep(seq.int(n),nrep)
x
}
data<-split(df2, EqualFreq2(df2$val, 25))

Resources