Two-way contingency table in R

Two-way contingency table in R - r

I have a dataframe and I want to output a two-way contingency table from two of the columns. They both have values "Too Little", "About Right" or "Too Much".
I'm typing
df %>%
filter(!is.na(col1)) %>%
group_by(col1) %>%
summarise(count = n())
for both of them separately and get something like this:
col1 count
<fctr> <int>
Too Little 19259
About Right 9539
Too Much 2816
What I would like to achieve is this:
Too Little About Right Too Much Total
col1 19259 9539 2816 31614
col2 20619 9374 2262 32255
Total 39878 18913 5078 63869
I've been trying to use table function
addmargins(table(df$col1, df$col2))
But the result is not what I want
Too Little About Right Too Much Sum
Too Little 13770 4424 740 18934
About Right 4901 3706 700 9307
Too Much 1250 800 679 2729
Sum 19921 8930 2119 30970

I'd give tabulate a try, which is the foundation for table (see ?tabulate). For example given
set.seed(123)
vals <- LETTERS[1:3]
df <- as.data.frame(replicate(3, sample(vals, 5, T)))
df <- data.frame(lapply(df, "levels<-", vals))
then you could do
m <- t(sapply(df, tabulate, nbins = length(vals)))
colnames(m) <- vals
addmargins(m)
# A B C Sum
# V1 1 1 3 5
# V2 1 3 1 5
# V3 1 2 2 5
# Sum 3 6 6 15
Or (via #thelatemail) just
addmargins(t(sapply(df, table)))
# A B C Sum
# V1 1 1 3 5
# V2 1 3 1 5
# V3 1 2 2 5
# Sum 3 6 6 15

We can use table in a loop then rbind:
# Using dummy data from #lukeA's answer
addmargins(do.call(rbind, lapply(df1, table)))
# A B C Sum
# V1 1 1 3 5
# V2 1 3 1 5
# V3 1 2 2 5
# Sum 3 6 6 15
Benchmarking
# bigger data
set.seed(123)
vals <- LETTERS[1:20]
df1 <- as.data.frame(replicate(20, sample(vals, 100000, T)))
df1 <- data.frame(lapply(df1, "levels<-", vals))
microbenchmark::microbenchmark(
lukeA = {
m1 <- t(sapply(df1, tabulate, nbins = length(vals)))
colnames(m1) <- vals
m1 <- addmargins(m1)
},
# as vals only used for luke's solution, keep it in.
lukeA_1 = {
vals <- LETTERS[1:20]
m2 <- t(sapply(df1, tabulate, nbins = length(vals)))
colnames(m2) <- vals
m2 <- addmargins(m2)
},
thelatemail = {m3 <- addmargins(t(sapply(df1, table)))},
zx8754 = {m4 <- addmargins(do.call(rbind, lapply(df1, table)))}
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# lukeA 2.349969 2.371922 2.518447 2.473839 2.558653 3.363738 100
# lukeA_1 2.351680 2.377196 2.523473 2.473839 2.542831 3.459242 100
# thelatemail 38.316506 42.054136 43.785777 42.674912 44.234193 90.287809 100
# zx8754 38.695101 41.979728 44.933602 42.762006 44.244314 110.834292 100

Related

How to show values as percentage of column total in R like in excel pivot table? [duplicate]

I am trying to divide each cell in a data frame by the sum of the column. For example, I have a data frame df:
sample a b c
a2 1 4 6
a3 5 5 4
I would like to create a new data frame that takes each cell in and divides by the sum of the column, like so:
sample a b c
a2 .167 .444 .6
a3 .833 .556 .4
I have seen answers using sweep(), but that looks like its for matrices, and I have data frames. I understand how to use colSums(), but I'm not sure how to write a function that loops through every cell in the column, and then divides by the column sum. Thanks for the help!

Solution 1
Here are two dplyr solutions. We can use mutate_at or mutate_if to efficiently specify which column we want to apply an operation, or under what condition we want to apply an operation.
library(dplyr)
# Apply the operation to all column except sample
dat2 <- dat %>%
mutate_at(vars(-sample), funs(./sum(.)))
dat2
# sample a b c
# 1 a2 0.1666667 0.4444444 0.6
# 2 a3 0.8333333 0.5555556 0.4
# Apply the operation if the column is numeric
dat2 <- dat %>%
mutate_if(is.numeric, funs(./sum(.)))
dat2
# sample a b c
# 1 a2 0.1666667 0.4444444 0.6
# 2 a3 0.8333333 0.5555556 0.4
Solution 2
We can also use the map_at and map_if function from the purrr package. However, since the output is a list, we will need as.data.frame from base R or as_data_frame from dplyr to convert the list to a data frame.
library(dplyr)
library(purrr)
# Apply the operation to column a, b, and c
dat2 <- dat %>%
map_at(c("a", "b", "c"), ~./sum(.)) %>%
as_data_frame()
dat2
# # A tibble: 2 x 4
# sample a b c
# <chr> <dbl> <dbl> <dbl>
# 1 a2 0.167 0.444 0.600
# 2 a3 0.833 0.556 0.400
# Apply the operation if the column is numeric
dat2 <- dat %>%
map_if(is.numeric, ~./sum(.)) %>%
as_data_frame()
dat2
# # A tibble: 2 x 4
# sample a b c
# <chr> <dbl> <dbl> <dbl>
# 1 a2 0.167 0.444 0.600
# 2 a3 0.833 0.556 0.400
Solution 3
We can also use the .SD and .SDcols from the data.table package.
library(data.table)
# Convert to data.table
setDT(dat)
dat2 <- copy(dat)
dat2[, (c("a", "b", "c")) := lapply(.SD, function(x) x/sum(x)), .SDcols = c("a", "b", "c")]
dat2[]
# sample a b c
# 1: a2 0.1666667 0.4444444 0.6
# 2: a3 0.8333333 0.5555556 0.4
Solution 4
We can also use the lapply function to loop through all column except the first column to perform the operation.
dat2 <- dat
dat2[, -1] <- lapply(dat2[, -1], function(x) x/sum(x))
dat2
# sample a b c
# 1 a2 0.1666667 0.4444444 0.6
# 2 a3 0.8333333 0.5555556 0.4
We can also use apply to loop through all columns but add an if-else statement in the function to make sure only perform the operation on the numeric columns.
dat2 <- dat
dat2[] <- lapply(dat2[], function(x){
# Check if the column is numeric
if (is.numeric(x)){
return(x/sum(x))
} else{
return(x)
}
})
dat2
# sample a b c
# 1 a2 0.1666667 0.4444444 0.6
# 2 a3 0.8333333 0.5555556 0.4
Solution 5
A dplyr and tidyr solution based on gather and spread.
library(dplyr)
library(tidyr)
dat2 <- dat %>%
gather(Column, Value, -sample) %>%
group_by(Column) %>%
mutate(Value = Value/sum(Value)) %>%
spread(Column, Value)
dat2
# # A tibble: 2 x 4
# sample a b c
# * <chr> <dbl> <dbl> <dbl>
# 1 a2 0.167 0.444 0.600
# 2 a3 0.833 0.556 0.400
Performance Evaluation
I am curious about which method has the best performance. So I conduct the following performance evaluation using the microbenchmark package with a data frame having the same column names as OP's example but with 1000000 rows.
library(dplyr)
library(tidyr)
library(purrr)
library(data.table)
library(microbenchmark)
set.seed(100)
dat <- data_frame(sample = paste0("a", 1:1000000),
a = rpois(1000000, lambda = 3),
b = rpois(1000000, lambda = 3),
c = rpois(1000000, lambda = 3))
# Convert the data frame to a data.table for later perofrmance evaluation
dat_dt <- as.data.table(dat)
head(dat)
# # A tibble: 6 x 4
# sample a b c
# <chr> <int> <int> <int>
# 1 a1 2 5 2
# 2 a2 2 5 5
# 3 a3 3 2 4
# 4 a4 1 2 2
# 5 a5 3 3 1
# 6 a6 3 6 1
In addition to all the methods I proposed, I also interested two other methods proposed by others: the prop.table method proposed by Henrik in the comments, and the apply method by Spacedman. I called all my solutions with m1_1, m1_2, m2_1, ... to m5. If there are two methods in one solution, I used _ to separate them. I also called the prop.table method as m6 and the apply method as m7. Notice that I modified m6 to have an output as a data frame so that all the methods can have data frame, tibble, or data.table output.
Here is the code I used to assess the performance.
per <- microbenchmark(m1_1 = {dat2 <- dat %>% mutate_at(vars(-sample), funs(./sum(.)))},
m1_2 = {dat2 <- dat %>% mutate_if(is.numeric, funs(./sum(.)))},
m2_1 = {dat2 <- dat %>%
map_at(c("a", "b", "c"), ~./sum(.)) %>%
as_data_frame()
},
m2_2 = {dat2 <- dat %>%
map_if(is.numeric, ~./sum(.)) %>%
as_data_frame()},
m3 = {dat_dt2 <- copy(dat_dt)
dat_dt2[, c("a", "b", "c") := lapply(.SD, function(x) x/sum(x)),
.SDcols = c("a", "b", "c")]},
m4_1 = {dat2 <- dat
dat2[, -1] <- lapply(dat2[, -1], function(x) x/sum(x))},
m4_2 = {dat2 <- dat
dat2[] <- lapply(dat2[], function(x){
if (is.numeric(x)){
return(x/sum(x))
} else{
return(x)
}
})},
m5 = {dat2 <- dat %>%
gather(Column, Value, -sample) %>%
group_by(Column) %>%
mutate(Value = Value/sum(Value)) %>%
spread(Column, Value)},
m6 = {dat2 <- dat
dat2[-1] <- prop.table(as.matrix(dat2[-1]), margin = 2)},
m7 = {dat2 <- dat
dat2[, -1] = apply(dat2[, -1], 2, function(x) {x/sum(x)})}
)
print(per)
# Unit: milliseconds
# expr min lq mean median uq max neval
# m1_1 23.335600 24.326445 28.71934 25.134798 27.465017 75.06974 100
# m1_2 20.373093 21.202780 29.73477 21.967439 24.897305 216.27853 100
# m2_1 9.452987 9.817967 17.83030 10.052634 11.056073 175.00184 100
# m2_2 10.009197 10.342819 16.43832 10.679270 11.846692 163.62731 100
# m3 16.195868 17.154327 34.40433 18.975886 46.521868 190.50681 100
# m4_1 8.100504 8.342882 12.66035 8.778545 9.348634 181.45273 100
# m4_2 8.130833 8.499926 15.84080 8.766979 9.732891 172.79242 100
# m5 5373.395308 5652.938528 5791.73180 5737.383894 5825.141584 6660.35354 100
# m6 117.038355 150.688502 191.43501 166.665125 218.837502 325.58701 100
# m7 119.680606 155.743991 199.59313 174.007653 215.295395 357.02775 100
library(ggplot2)
autoplot(per)
The result shows that methods based on lapply (m4_1 and m4_2) are the fastest, while the tidyr approach (m5) is the slowest, indicating that when row numbers are large it is not a good idea to use the gather and spread method.
DATA
dat <- read.table(text = "sample a b c
a2 1 4 6
a3 5 5 4",
header = TRUE, stringsAsFactors = FALSE)

Given this:
> d = data.frame(sample=c("a2","a3"),a=c(1,5),b=c(4,5),c=c(6,4))
> d
sample a b c
1 a2 1 4 6
2 a3 5 5 4
You can replace every column other than the first by applying over the rest:
> d[,-1] = apply(d[,-1],2,function(x){x/sum(x)})
> d
sample a b c
1 a2 0.1666667 0.4444444 0.6
2 a3 0.8333333 0.5555556 0.4
If you don't want d being stomped on make a copy beforehand.

You could do this in dplyr as well.
sample <- c("a2", "a3")
a <- c(1, 5)
b <- c(4, 5)
c <- c(6, 4)
dat <- data.frame(sample, a, b, c)
dat
library(dplyr)
dat %>%
mutate(
a.PCT = round(a/sum(a), 3),
b.PCT = round(b/sum(b), 3),
c.PCT = round(c/sum(c), 3))
sample a b c a.PCT b.PCT c.PCT
1 a2 1 4 6 0.167 0.444 0.6
2 a3 5 5 4 0.833 0.556 0.4

You can use the transpose of the matrix and then transpose again:
t(t(as.matrix(df))/colSums(df))

try apply:
mat <- matrix(1:6, ncol=3)
apply(mat,2, function(x) x / sum(x))
okay, if you have not numeric values in you columns you can force them to be numeric:
df <- data.frame( a=c('a', 'b'), b=c(3,4), d=c(1,6))
apply(df,2, function(x) {
x <- as.numeric(x)
x / sum(x)
})

Finding the first time a value shows up in a list efficiently

I was trying to solve a problem where there was a long list that had a variable amount of numbers at each index. The goals was to say what was the earliest index at which every number appeared. So if a 15 shows up at index 45 and 78 in the list, then I should return that 15 is located first at 48. In the original problem this went on with a list of length 10,000, so doing this fast was helpful.
Originally I tried to work with the existing list structure and did something like this, which at 10,000 lines very slow.
set.seed(1)
x <- replicate(100, sample(100, sample(10, 1)))
cbind(value = 1:100,
index = sapply(1:100, function(i) which.max(sapply(x, function(x) i %in% x))))
Eventually I tried converting the data in to a data.table, which worked much better but I always wondered if there was a better way to go about solving the problem. Like was the default list structure inherently inefficient or was there a better way I could have worked with it?
set.seed(1)
x <- replicate(100, sample(100, sample(10, 1)))
dt <- data.table(index = rep(1:100, sapply(x, length)), value = unlist(x))
dt[,.(index = first(index)),value][order(value)]
Here's the full dataset from the original problem if that's helpful.
library(RcppAlgos)
library(memoise)
library(data.table)
jgo <- function(n) {
if (isPrimeRcpp(n) | n == 1) return (n)
div <- divisorsRcpp(n)
div <- div[-c(1, length(div))]
div <- Map(function(a, b) c(a, b), div, rev(div))
div2 <- lapply(div, function(x) lapply(jgo(x[1]), c, x[2]))
unique(lapply(c(div, unlist(div2, recursive = FALSE)), sort))
}
jgo <- memoise(jgo)
x <- lapply(1:12500, function(x) x - sapply(jgo(x), sum) + sapply(jgo(x), length))

Here is another approach that uses match to find the first indices. This slightly outperforms the other suggested approaches and produces similar output as in OP's question:
## dummy data
set.seed(1)
x <- replicate(100, sample(100, sample(10, 1)))
## use match to find first indices
first_indices_match <- function(x) {
seq_x <- 1:length(x)
matrix(c(seq_x, rep(seq_x, lengths(x))[match(seq_x, unlist(x))]),
ncol = 2, dimnames = list(NULL, c("value", "index")))
}
head(first_indices_match(x))
#> value index
#> [1,] 1 1
#> [2,] 2 7
#> [3,] 3 45
#> [4,] 4 38
#> [5,] 5 31
#> [6,] 6 7
## data.table approach
library(data.table)
first_indices_dt <- function(x) {
dt <- data.table(index = rep(seq_along(x), sapply(x, length)), value = unlist(x))
dt[,.(index = first(index)),value][order(value)]
}
head(first_indices_dt(x))
#> value index
#> 1: 1 1
#> 2: 2 7
#> 3: 3 45
#> 4: 4 38
#> 5: 5 31
#> 6: 6 7
Benchmarks
## stack + remove duplicate approach
first_indices_shree <- function(x) {
names(x) <- seq_len(length(x))
(d <- stack(x))[!duplicated(d$values), ]
}
## benchmarks several list sizes
bnch <- bench::press(
n_size = c(100, 1E3, 1E4),
{
x <- replicate(n_size, sample(n_size, sample(10, 1)))
bench::mark(
match = first_indices_match(x),
shree = first_indices_shree(x),
dt = first_indices_dt(x),
check = FALSE
)
}
)
#> # A tibble: 9 x 7
#> expression n_size min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <dbl> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 match 100 18.17µs 21.2µs 45639. 637.3KB 27.4
#> 2 shree 100 361.88µs 411.06µs 2307. 106.68KB 11.1
#> 3 dt 100 759.17µs 898.26µs 936. 264.58KB 8.51
#> 4 match 1000 158.34µs 169.9µs 5293. 164.15KB 30.8
#> 5 shree 1000 1.54ms 1.71ms 567. 412.52KB 13.2
#> 6 dt 1000 1.19ms 1.4ms 695. 372.13KB 10.7
#> 7 match 10000 3.09ms 3.69ms 255. 1.47MB 15.9
#> 8 shree 10000 18.06ms 18.95ms 51.5 4.07MB 12.9
#> 9 dt 10000 5.65ms 6.33ms 149. 2.79MB 20.5

You can simply stack the list into a dataframe and remove duplicated values. This will give you the first index for all values in the list.
set.seed(1)
x <- replicate(100, sample(100, sample(10, 1)))
names(x) <- seq_along(x)
first_indices <- (d <- stack(x))[!duplicated(d$values), ]
head(first_indices)
values ind
1 38 1
2 57 1
3 90 1
5 94 2
6 65 2
7 7 3
You can now lookup index for any value you want using %in% -
subset(first_indices, values %in% c(37,48))
values ind
11 37 3
40 48 8
Benchmarks -
set.seed(1)
x <- replicate(1000, sample(1000, sample(10, 1)))
microbenchmark::microbenchmark(
Shree = first_indices(x),
JamesB = cbind(value = 1:1000,
index = sapply(1:1000, function(i) which.max(sapply(x, function(x) i %in% x))))
)
Unit: milliseconds
expr min lq mean median uq max neval
Shree 2.3488 2.74855 4.171323 3.0577 4.7752 17.0743 100
JamesB 1750.4806 1904.79150 2519.912936 1994.9814 3282.5957 5966.1011 100

R changing Data Frame values based on a secondary Data Frame

I'm looking for a more efficient way to do some replacements/lookups.
My current method is using paste0 to create a lookup value, and then matching on that to filter.
Given x,
x <- data.frame(var1 = c("AA","BB","CC","DD"),
var2 = c("--","AA","AA","--"),
val1 = c(1,2,1,4),
val2 = c(5,5,7,8))
var1 var2 val1 val2
1 AA -- 1 5
2 BB AA 2 5
3 CC AA 1 7
4 DD -- 4 8
var1 is the primary column and var2 is the secondary column. val1 and val2 are value columns.
If var2 is a value in var1 and the values match, we want to replace the stated val with NA - and we want to do this independently for the value columns.
The way that I've come up with uses a lookup that loops over the columns and creates a lookup value essentially.
lookup.df <- x %>% filter(var2 == "--")
x[,c("val1","val2")] <- lapply(c("val1","val2"), function(column) {
var2.lookup <- paste0(x$var2,x[[column]])
var1.lookup <- paste0(lookup.df$var1,lookup.df[[column]])
x[[column]][var2.lookup %in% var1.lookup] <- NA
return(x[[column]])
})
which does return what I would expect.
> x
var1 var2 val1 val2
1 AA -- 1 5
2 BB AA 2 NA
3 CC AA NA 7
4 DD -- 4 8
However, in practice, when profiling the code, the majority of the time is spent in the paste - and it just doesn't feel like the most efficient way to do this.
My real data set is millions of rows and about 25 columns, and runs in around 60 seconds. I'd think there'd be a way to do a logical matrix replacement instead of accessing each column individually. I can't figure it out though.
Any help is greatly appreciated. Thanks!
Edit -- benchmarks
na.replace.orig <- function(x) {
lookup.df <- x %>% filter(var2 == "--")
x[,c("val1","val2")] <- lapply(c("val1","val2"), function(column) {
var2.lookup <- paste0(x$var2,x[[column]])
var1.lookup <- paste0(lookup.df$var1,lookup.df[[column]])
x[[column]][var2.lookup %in% var1.lookup] <- NA
return(x[[column]])
})
return(x)
}
# pulled out the lookup table since it causes a lot of overhead
na.replace.orig.no.lookup <- function(x) {
x[,c("val1","val2")] <- lapply(c("val1","val2"), function(column) {
var2.lookup <- paste0(x$var2,x[[column]])
var1.lookup <- paste0(lookup.df$var1,lookup.df[[column]])
x[[column]][var2.lookup %in% var1.lookup] <- NA
return(x[[column]])
})
return(x)
}
na.replace.1 <- function(x) {
inx <- match(x$var2, x$var1)
jnx <- which(!is.na(inx))
inx <- inx[!is.na(inx)]
knx <- grep("^val", names(x))
for(i in seq_along(inx))
for(k in knx)
if(x[[k]][inx[i]] == x[[k]][jnx[i]]) x[[k]][jnx[i]] <- NA
return(x)
}
na.replace.2 <- function(x) {
for(col in c("val1","val2")) {
x[x[,'var2'] %in% x[,'var1'] & x[,col] %in% lookup.df[,col] , col] <- NA
}
return(x)
}
> microbenchmark::microbenchmark(na.replace.orig(x), na.replace.orig.no.lookup(x), na.replace.1(x), na.replace.2(x), times = 10)
Unit: microseconds
expr min lq mean median uq max neval
na.replace.orig(x) 1267.23 1274.2 1441.9 1408.8 1609.8 1762.8 10
na.replace.orig.no.lookup(x) 217.43 228.9 270.9 239.2 296.6 394.2 10
na.replace.1(x) 98.46 106.3 133.0 123.9 136.6 239.2 10
na.replace.2(x) 117.74 147.7 162.9 166.6 183.0 189.9 10
Edit - 3rd Variable Required
I realized that I have a 3rd variable I need to check against.
x <- data.frame(var1 = c("AA","BB","CC","DD"),
var2 = c("--","AA","AA","--"),
var3 = c("Y","Y","N","N"),
val1 = c(1,2,1,4),
val2 = c(5,5,7,8))
var1 var2 var3 val1 val2
1 AA -- Y 1 5
2 BB AA Y 2 5
3 CC AA N 1 7
4 DD -- N 4 8
with the expected result
var1 var2 var3 val1 val2
1 AA -- Y 1 5
2 BB AA Y 2 NA
3 CC AA N 1 7
4 DD -- N 4 8
My code still works for this case.
x[,c("val1","val2")] <- lapply(c("val1","val2"), function(column) {
var2.lookup <- paste0(x$var2, x$var3, x[[column]])
var1.lookup <- paste0(lookup.df$var1, x$var3, lookup.df[[column]])
x[[column]][var2.lookup %in% var1.lookup] <- NA
return(x[[column]])
})

The following solution uses only vectorized logic. It uses the lookup table you already made. I think it'll be even faster than Rui's solution
library(dplyr)
x <- data.frame(var1 = c("AA","BB","CC","DD"),
var2 = c("--","AA","AA","--"),
val1 = c(1,2,1,4),
val2 = c(5,5,7,8))
lookup.df <- x[ x[,'var2'] == "--", ]
x[x[,'var2'] %in% x[,'var1'] & x[,'val1'] %in% lookup.df[,'val1'] , 'val1'] <- NA
x[x[,'var2'] %in% x[,'var1'] & x[,'val2'] %in% lookup.df[,'val2'] , 'val2'] <- NA
x
#> var1 var2 val1 val2
#> 1 AA -- 1 5
#> 2 BB AA 2 NA
#> 3 CC AA NA 7
#> 4 DD -- 4 8
EDIT:
It might be or it might not be.
set.seed(4)
microbenchmark::microbenchmark(na.replace.orig(x), na.replace.1(x), na.replace.2(x), times = 50)
#> Unit: microseconds
#> expr min lq mean median uq max
#> na.replace.orig(x) 184.348 192.410 348.4430 202.1615 223.375 6206.546
#> na.replace.1(x) 68.127 86.621 281.3503 89.8715 93.381 9693.029
#> na.replace.2(x) 95.885 105.858 210.7638 113.2060 118.668 4993.849
#> neval
#> 50
#> 50
#> 50
OP, you'll need to test it on your dataset to see how the two scale differently at larger-sized dataframes.
Edit 2: Implemented Rui's suggestion for the lookup table. In order from slowest to fastest benchmark:
lookup.df <- x %>% filter(var2 == "--")
lookup.df <- filter(x, var2 == "--")
lookup.df <- x[x[,'var2'] == "--", ]

I find the following solution a bit confusing (and I came up with it!) but it works.
And contrary to the popular belief, for loops are not much slower than the *apply family.
inx <- match(x$var2, x$var1)
jnx <- which(!is.na(inx))
inx <- inx[!is.na(inx)]
knx <- grep("^val", names(x))
for(i in seq_along(inx))
for(k in knx)
if(x[[k]][inx[i]] == x[[k]][jnx[i]]) x[[k]][jnx[i]] <- NA
x
# var1 var2 val1 val2
#1 AA -- 1 5
#2 BB AA 2 NA
#3 CC AA NA 7
#4 DD -- 4 8

Dividing each cell in a data set by the column sum in R

I am trying to divide each cell in a data frame by the sum of the column. For example, I have a data frame df:
sample a b c
a2 1 4 6
a3 5 5 4
I would like to create a new data frame that takes each cell in and divides by the sum of the column, like so:
sample a b c
a2 .167 .444 .6
a3 .833 .556 .4
I have seen answers using sweep(), but that looks like its for matrices, and I have data frames. I understand how to use colSums(), but I'm not sure how to write a function that loops through every cell in the column, and then divides by the column sum. Thanks for the help!

Solution 1
Here are two dplyr solutions. We can use mutate_at or mutate_if to efficiently specify which column we want to apply an operation, or under what condition we want to apply an operation.
library(dplyr)
# Apply the operation to all column except sample
dat2 <- dat %>%
mutate_at(vars(-sample), funs(./sum(.)))
dat2
# sample a b c
# 1 a2 0.1666667 0.4444444 0.6
# 2 a3 0.8333333 0.5555556 0.4
# Apply the operation if the column is numeric
dat2 <- dat %>%
mutate_if(is.numeric, funs(./sum(.)))
dat2
# sample a b c
# 1 a2 0.1666667 0.4444444 0.6
# 2 a3 0.8333333 0.5555556 0.4
Solution 2
We can also use the map_at and map_if function from the purrr package. However, since the output is a list, we will need as.data.frame from base R or as_data_frame from dplyr to convert the list to a data frame.
library(dplyr)
library(purrr)
# Apply the operation to column a, b, and c
dat2 <- dat %>%
map_at(c("a", "b", "c"), ~./sum(.)) %>%
as_data_frame()
dat2
# # A tibble: 2 x 4
# sample a b c
# <chr> <dbl> <dbl> <dbl>
# 1 a2 0.167 0.444 0.600
# 2 a3 0.833 0.556 0.400
# Apply the operation if the column is numeric
dat2 <- dat %>%
map_if(is.numeric, ~./sum(.)) %>%
as_data_frame()
dat2
# # A tibble: 2 x 4
# sample a b c
# <chr> <dbl> <dbl> <dbl>
# 1 a2 0.167 0.444 0.600
# 2 a3 0.833 0.556 0.400
Solution 3
We can also use the .SD and .SDcols from the data.table package.
library(data.table)
# Convert to data.table
setDT(dat)
dat2 <- copy(dat)
dat2[, (c("a", "b", "c")) := lapply(.SD, function(x) x/sum(x)), .SDcols = c("a", "b", "c")]
dat2[]
# sample a b c
# 1: a2 0.1666667 0.4444444 0.6
# 2: a3 0.8333333 0.5555556 0.4
Solution 4
We can also use the lapply function to loop through all column except the first column to perform the operation.
dat2 <- dat
dat2[, -1] <- lapply(dat2[, -1], function(x) x/sum(x))
dat2
# sample a b c
# 1 a2 0.1666667 0.4444444 0.6
# 2 a3 0.8333333 0.5555556 0.4
We can also use apply to loop through all columns but add an if-else statement in the function to make sure only perform the operation on the numeric columns.
dat2 <- dat
dat2[] <- lapply(dat2[], function(x){
# Check if the column is numeric
if (is.numeric(x)){
return(x/sum(x))
} else{
return(x)
}
})
dat2
# sample a b c
# 1 a2 0.1666667 0.4444444 0.6
# 2 a3 0.8333333 0.5555556 0.4
Solution 5
A dplyr and tidyr solution based on gather and spread.
library(dplyr)
library(tidyr)
dat2 <- dat %>%
gather(Column, Value, -sample) %>%
group_by(Column) %>%
mutate(Value = Value/sum(Value)) %>%
spread(Column, Value)
dat2
# # A tibble: 2 x 4
# sample a b c
# * <chr> <dbl> <dbl> <dbl>
# 1 a2 0.167 0.444 0.600
# 2 a3 0.833 0.556 0.400
Performance Evaluation
I am curious about which method has the best performance. So I conduct the following performance evaluation using the microbenchmark package with a data frame having the same column names as OP's example but with 1000000 rows.
library(dplyr)
library(tidyr)
library(purrr)
library(data.table)
library(microbenchmark)
set.seed(100)
dat <- data_frame(sample = paste0("a", 1:1000000),
a = rpois(1000000, lambda = 3),
b = rpois(1000000, lambda = 3),
c = rpois(1000000, lambda = 3))
# Convert the data frame to a data.table for later perofrmance evaluation
dat_dt <- as.data.table(dat)
head(dat)
# # A tibble: 6 x 4
# sample a b c
# <chr> <int> <int> <int>
# 1 a1 2 5 2
# 2 a2 2 5 5
# 3 a3 3 2 4
# 4 a4 1 2 2
# 5 a5 3 3 1
# 6 a6 3 6 1
In addition to all the methods I proposed, I also interested two other methods proposed by others: the prop.table method proposed by Henrik in the comments, and the apply method by Spacedman. I called all my solutions with m1_1, m1_2, m2_1, ... to m5. If there are two methods in one solution, I used _ to separate them. I also called the prop.table method as m6 and the apply method as m7. Notice that I modified m6 to have an output as a data frame so that all the methods can have data frame, tibble, or data.table output.
Here is the code I used to assess the performance.
per <- microbenchmark(m1_1 = {dat2 <- dat %>% mutate_at(vars(-sample), funs(./sum(.)))},
m1_2 = {dat2 <- dat %>% mutate_if(is.numeric, funs(./sum(.)))},
m2_1 = {dat2 <- dat %>%
map_at(c("a", "b", "c"), ~./sum(.)) %>%
as_data_frame()
},
m2_2 = {dat2 <- dat %>%
map_if(is.numeric, ~./sum(.)) %>%
as_data_frame()},
m3 = {dat_dt2 <- copy(dat_dt)
dat_dt2[, c("a", "b", "c") := lapply(.SD, function(x) x/sum(x)),
.SDcols = c("a", "b", "c")]},
m4_1 = {dat2 <- dat
dat2[, -1] <- lapply(dat2[, -1], function(x) x/sum(x))},
m4_2 = {dat2 <- dat
dat2[] <- lapply(dat2[], function(x){
if (is.numeric(x)){
return(x/sum(x))
} else{
return(x)
}
})},
m5 = {dat2 <- dat %>%
gather(Column, Value, -sample) %>%
group_by(Column) %>%
mutate(Value = Value/sum(Value)) %>%
spread(Column, Value)},
m6 = {dat2 <- dat
dat2[-1] <- prop.table(as.matrix(dat2[-1]), margin = 2)},
m7 = {dat2 <- dat
dat2[, -1] = apply(dat2[, -1], 2, function(x) {x/sum(x)})}
)
print(per)
# Unit: milliseconds
# expr min lq mean median uq max neval
# m1_1 23.335600 24.326445 28.71934 25.134798 27.465017 75.06974 100
# m1_2 20.373093 21.202780 29.73477 21.967439 24.897305 216.27853 100
# m2_1 9.452987 9.817967 17.83030 10.052634 11.056073 175.00184 100
# m2_2 10.009197 10.342819 16.43832 10.679270 11.846692 163.62731 100
# m3 16.195868 17.154327 34.40433 18.975886 46.521868 190.50681 100
# m4_1 8.100504 8.342882 12.66035 8.778545 9.348634 181.45273 100
# m4_2 8.130833 8.499926 15.84080 8.766979 9.732891 172.79242 100
# m5 5373.395308 5652.938528 5791.73180 5737.383894 5825.141584 6660.35354 100
# m6 117.038355 150.688502 191.43501 166.665125 218.837502 325.58701 100
# m7 119.680606 155.743991 199.59313 174.007653 215.295395 357.02775 100
library(ggplot2)
autoplot(per)
The result shows that methods based on lapply (m4_1 and m4_2) are the fastest, while the tidyr approach (m5) is the slowest, indicating that when row numbers are large it is not a good idea to use the gather and spread method.
DATA
dat <- read.table(text = "sample a b c
a2 1 4 6
a3 5 5 4",
header = TRUE, stringsAsFactors = FALSE)

Given this:
> d = data.frame(sample=c("a2","a3"),a=c(1,5),b=c(4,5),c=c(6,4))
> d
sample a b c
1 a2 1 4 6
2 a3 5 5 4
You can replace every column other than the first by applying over the rest:
> d[,-1] = apply(d[,-1],2,function(x){x/sum(x)})
> d
sample a b c
1 a2 0.1666667 0.4444444 0.6
2 a3 0.8333333 0.5555556 0.4
If you don't want d being stomped on make a copy beforehand.

You could do this in dplyr as well.
sample <- c("a2", "a3")
a <- c(1, 5)
b <- c(4, 5)
c <- c(6, 4)
dat <- data.frame(sample, a, b, c)
dat
library(dplyr)
dat %>%
mutate(
a.PCT = round(a/sum(a), 3),
b.PCT = round(b/sum(b), 3),
c.PCT = round(c/sum(c), 3))
sample a b c a.PCT b.PCT c.PCT
1 a2 1 4 6 0.167 0.444 0.6
2 a3 5 5 4 0.833 0.556 0.4

You can use the transpose of the matrix and then transpose again:
t(t(as.matrix(df))/colSums(df))

try apply:
mat <- matrix(1:6, ncol=3)
apply(mat,2, function(x) x / sum(x))
okay, if you have not numeric values in you columns you can force them to be numeric:
df <- data.frame( a=c('a', 'b'), b=c(3,4), d=c(1,6))
apply(df,2, function(x) {
x <- as.numeric(x)
x / sum(x)
})

Counting amount of zeros within a "melted" data frame

Hei, I learn R and I try to count how many zeros I have within the melted data. So, I want to know how many zeros corresponds to column a and b and print two results out.
I generated an example:
library(reshape)
library(plyr)
library(dplyr)
id = c(1,2,3,4,5,6,7,8,9,10)
b = c(0,0,5,6,3,7,2,8,1,8)
c = c(0,4,9,87,0,87,0,4,5,0)
test = data.frame(id,b,c)
test_melt = melt(test, id.vars = "id")
test_melt
I imagine for that I should create an if statement. Something with
if (test$value == 0){print()}, but how can I tell R to count zeros for a columns that have been melted?

With your data:
test_melt %>%
group_by(variable) %>%
summarize(zeroes = sum(value == 0))
# # A tibble: 2 x 2
# variable zeroes
# <fctr> <int>
# 1 b 2
# 2 c 4
Base R:
aggregate(test_melt$value, by = list(variable = test_melt$variable),
FUN = function(x) sum(x == 0))
# variable x
# 1 b 2
# 2 c 4
... and for curiosity:
library(microbenchmark)
microbenchmark(
dplyr = group_by(test_melt, variable) %>% summarize(zeroes = sum(value == 0)),
base1 = aggregate(test_melt$value, by = list(variable = test_melt$variable), FUN = function(x) sum(x == 0)),
# #PankajKaundal's suggested "formula" notation reads easier
base2 = aggregate(value ~ variable, test_melt, function(x) sum(x == 0))
)
# Unit: microseconds
# expr min lq mean median uq max neval
# dplyr 916.421 986.985 1069.7000 1022.1760 1094.7460 2272.636 100
# base1 647.658 682.302 783.2065 715.3045 765.9940 1905.411 100
# base2 813.219 867.737 950.3247 897.0930 959.8175 2017.001 100

sum(test_melt$value==0)
This should do it.

This might help . Is this what you're looking for ?
> test_melt[4] <- 1
> test_melt2 <- aggregate(V4 ~ value + variable, test_melt, sum)
> test_melt2
value variable V4
1 0 b 2
2 1 b 1
3 2 b 1
4 3 b 1
5 5 b 1
6 6 b 1
7 7 b 1
8 8 b 2
9 0 c 4
10 4 c 2
11 5 c 1
12 9 c 1
13 87 c 2
V4 is the count

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Two-way contingency table in R - r

Related

How to show values as percentage of column total in R like in excel pivot table? [duplicate]

Finding the first time a value shows up in a list efficiently

R changing Data Frame values based on a secondary Data Frame

Dividing each cell in a data set by the column sum in R

Counting amount of zeros within a "melted" data frame

Categories

Resources