I have the following data frame, where "x" is a grouping variable and "y" some values:
dat <- data.frame(x = c(1, 2, 3, 3, 2, 1), y = c(3, 4, 4, 5, 2, 5))
I want to create a new column where each "y" value is divided by the sum of "y" within each group defined by "x". E.g. the result for the first row is 3 / (3 + 5) = 0.375, where the denominator is the sum of "y" values for group 1 (x = 1).
There are various ways of solving this, here's one
with(dat, ave(y, x, FUN = function(x) x/sum(x)))
## [1] 0.3750000 0.6666667 0.4444444 0.5555556 0.3333333 0.6250000
Here's another possibility
library(data.table)
setDT(dat)[, z := y/sum(y), by = x]
dat
# x y z
# 1: 1 3 0.3750000
# 2: 2 4 0.6666667
# 3: 3 4 0.4444444
# 4: 3 5 0.5555556
# 5: 2 2 0.3333333
# 6: 1 5 0.6250000
Here's a third one
library(dplyr)
dat %>%
group_by(x) %>%
mutate(z = y/sum(y))
# Source: local data frame [6 x 3]
# Groups: x
#
# x y z
# 1 1 3 0.3750000
# 2 2 4 0.6666667
# 3 3 4 0.4444444
# 4 3 5 0.5555556
# 5 2 2 0.3333333
# 6 1 5 0.6250000
Here are some base R solutions:
1) prop.table Use the base prop.table function with ave like this:
transform(dat, z = ave(y, x, FUN = prop.table))
giving:
x y z
1 1 3 0.3750000
2 2 4 0.6666667
3 3 4 0.4444444
4 3 5 0.5555556
5 2 2 0.3333333
6 1 5 0.6250000
2) sum This also works:
transform(dat, z = y / ave(y, x, FUN = sum))
And of course there's a way for people thinking in SQL, very wordy in this case, but nicely generalising to all sorts of other similiar problems:
library(sqldf)
dat <- sqldf("
with sums as (
select
x
,sum(y) as sy
from dat
group by x
)
select
d.x
,d.y
,d.y/s.sy as z
from dat d
inner join sums s
on d.x = s.x
")
Related
I have the following data frame, where "x" is a grouping variable and "y" some values:
dat <- data.frame(x = c(1, 2, 3, 3, 2, 1), y = c(3, 4, 4, 5, 2, 5))
I want to create a new column where each "y" value is divided by the sum of "y" within each group defined by "x". E.g. the result for the first row is 3 / (3 + 5) = 0.375, where the denominator is the sum of "y" values for group 1 (x = 1).
There are various ways of solving this, here's one
with(dat, ave(y, x, FUN = function(x) x/sum(x)))
## [1] 0.3750000 0.6666667 0.4444444 0.5555556 0.3333333 0.6250000
Here's another possibility
library(data.table)
setDT(dat)[, z := y/sum(y), by = x]
dat
# x y z
# 1: 1 3 0.3750000
# 2: 2 4 0.6666667
# 3: 3 4 0.4444444
# 4: 3 5 0.5555556
# 5: 2 2 0.3333333
# 6: 1 5 0.6250000
Here's a third one
library(dplyr)
dat %>%
group_by(x) %>%
mutate(z = y/sum(y))
# Source: local data frame [6 x 3]
# Groups: x
#
# x y z
# 1 1 3 0.3750000
# 2 2 4 0.6666667
# 3 3 4 0.4444444
# 4 3 5 0.5555556
# 5 2 2 0.3333333
# 6 1 5 0.6250000
Here are some base R solutions:
1) prop.table Use the base prop.table function with ave like this:
transform(dat, z = ave(y, x, FUN = prop.table))
giving:
x y z
1 1 3 0.3750000
2 2 4 0.6666667
3 3 4 0.4444444
4 3 5 0.5555556
5 2 2 0.3333333
6 1 5 0.6250000
2) sum This also works:
transform(dat, z = y / ave(y, x, FUN = sum))
And of course there's a way for people thinking in SQL, very wordy in this case, but nicely generalising to all sorts of other similiar problems:
library(sqldf)
dat <- sqldf("
with sums as (
select
x
,sum(y) as sy
from dat
group by x
)
select
d.x
,d.y
,d.y/s.sy as z
from dat d
inner join sums s
on d.x = s.x
")
MWE.
library(data.table)
x <- data.table(
g=rep(c("x", "y"), each=4), # grouping variable
time=c(1,3,5,7,2,4,6,8), # time index
val=1:8) # value
setkeyv(x, c("g", "time"))
cumsd <- function(x) sapply(sapply(seq_along(x)-1, head, x=x), sd)
x[, cumsd(val), by=g]
## Output
# g V1
# 1: x NA
# 2: x NA
# 3: x 0.7071068
# 4: x 1.0000000
# 5: y NA
# 6: y NA
# 7: y 0.7071068
# 8: y 1.0000000
I want to compute the standard deviation (or more generally, a mathematical function) of all prior values (not including the current value), per observation, by group, in R.
The cumsd ("cumulative sd") function above does what I need. For e.g. row 3, V1 = sd(c(1, 2)), corresponding to the values in rows 1 and 2. Row 7, V1 = sd(c(5, 6)), corresponding to the values in rows 5 and 6.
However, cumsd is very slow (too slow to use in my real-world application). Any ideas on how to do the computation more efficiently?
Edit
For sd we can use runSD from library TTR as discussed here: Calculating cumulative standard deviation by group using R
Gabor's answer below addresses the more general case of any arbitrary mathematical function on prior values. Though potentially the generalisability comes at some cost of efficiency.
We can specify the window widths as a vector and then omit the last value in the window for each application of sd.
library(zoo)
x[, sd:=rollapplyr(val, seq_along(val), function(x) sd(head(x, -1)), fill = NA), by = g]
giving:
> x
g time val sd
1: x 1 1 NA
2: x 3 2 NA
3: x 5 3 0.7071068
4: x 7 4 1.0000000
5: y 2 5 NA
6: y 4 6 NA
7: y 6 7 0.7071068
8: y 8 8 1.0000000
Alternately we can specify the offsets in a list. Negative offsets, used here, refer to prior values so -1 is the immediate prior value, -2 is the value before that and so on.
negseq <- function(x) -seq_len(x))
x[, sd:=rollapplyr(val, lapply(seq_along(val)-1, negseq), sd, fill = NA), by = g]
giving:
> x
g time val sd
1: x 1 1 NA
2: x 3 2 NA
3: x 5 3 0.7071068
4: x 7 4 1.0000000
5: y 2 5 NA
6: y 4 6 NA
7: y 6 7 0.7071068
8: y 8 8 1.0000000
We can use TTR::runSD with shift:
library(TTR);
setDT(x)[, cum_sd := shift(runSD(val, n = 2, cumulative = TRUE)) , g]
# g time val cum_sd
#1: x 1 1 NA
#2: x 3 2 NA
#3: x 5 3 0.7071068
#4: x 7 4 1.0000000
#5: y 2 5 NA
#6: y 4 6 NA
#7: y 6 7 0.7071068
#8: y 8 8 1.0000000
Turned out that neither option were fast enough for my application (millions of groups and observations). But your comments inspired me to write a small function in Rcpp that did the trick. Thanks everyone!
library(data.table)
library(Rcpp)
x <- data.table(
g=rep(c("x", "y"), each=4), # grouping variable
time=c(1,3,5,7,2,4,6,8), # time index
val=1:8) # value
setkeyv(x, c("g", "time"))
cumsd <- function(x) sapply(sapply(seq_along(x)-1, head, x=x), sd)
x[, v1:=cumsd(val), by=g]
cppFunction('
Rcpp::NumericVector rcpp_cumsd(Rcpp::NumericVector inputVector){
int len = inputVector.size();
Rcpp::NumericVector outputVector(len, NumericVector::get_na());
if (len < 3) return (outputVector);
for (int i = 2; i < len; ++i){
outputVector(i) = Rcpp::sd(inputVector[Rcpp::seq(0, i - 1)]);
}
return(outputVector);
};
')
x[, v2:= rcpp_cumsd(val), by=g]
all.equal(x$v1, x$v2)
## TRUE
The speed difference seems to depend on the number of groups vs. the number of observations per group in the data.table. I won't post benchmarks but in my case, the Rcpp version was much, much faster.
I have a dataframe df:
df <- data.frame(a = 1:5, b = 6:10)
a b
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
For each column, I want to divide each value by the column mean, where the mean is calculated by excluding the focal value from calculation of the mean ("leave-one-out" mean).
For example, the first two values in column "a"`, the calculation is like:
1: 1 / ((2 + 3 + 4 + 5) / 4)) = 0.2857143
2: 2 / ((1 + 3 + 4 + 5) / 4)) = 0.6153846
etc.
"Leave-one-out means":
mean_a mean_b
1 3.5 8.5
2 3.25 8.25
3 3 8
4 2.75 7.75
5 2.5 7.5
The desired result: values / "leave-one-out" means
res_a res_b
1 0.285 0.705
2 0.615 0.848
3 1 1
4 1.454 1.161
5 2 1.333
Many thanks for any help!
If I understand it correctly, the following should do it.
res <- sapply(df, function(x)
sapply(seq_along(x), function(i) x[i]/mean(x[-i]))
)
res <- as.data.frame(res)
names(res) <- paste("c", names(res), sep = "_")
res
# c_a c_b
#1 0.2857143 0.7058824
#2 0.6153846 0.8484848
#3 1.0000000 1.0000000
#4 1.4545455 1.1612903
#5 2.0000000 1.3333333
Just use the magic of index and vector in R
for(i in 1:nrow(df)){
print(df$a[i]/mean(df$a[-i]))
}
I have just replicated for column a .I hope you can do it for B and convert into dataframe .
Let me know if you need help.
Happy to help with R.
A vectorized possibility, which will be faster for larger data.
df / ((rep(colSums(df), each = nrow(df)) - df) / (nrow(df) - 1))
# a b
# 0.2857143 0.7058824
# 0.6153846 0.8484848
# 1.0000000 1.0000000
# 1.4545455 1.1612903
# 2.0000000 1.3333333
I have a dataframe where I would like to select within each group the lines where y is the closest to a specific value (ex.: 5).
set.seed(1234)
df <- data.frame(x = c(rep("A", 4),
rep("B", 4)),
y = c(rep(4, 2), rep(1, 2), rep(6, 2), rep(3, 2)),
z = rnorm(8))
df
## x y z
## 1 A 4 -1.2070657
## 2 A 4 0.2774292
## 3 A 1 1.0844412
## 4 A 1 -2.3456977
## 5 B 6 0.4291247
## 6 B 6 0.5060559
## 7 B 3 -0.5747400
## 8 B 3 -0.5466319
The result would be:
## x y z
## 1 A 4 -1.2070657
## 2 A 4 0.2774292
## 3 B 6 0.4291247
## 4 B 6 0.5060559
Thank you, Philippe
df %>%
group_by(x) %>%
mutate(
delta = abs(y - 5)
) %>%
filter(delta == min(delta)) %>%
select(-delta)
Alternatively using base R:
df[do.call(c, tapply(df$y, df$x, function(x) x-5 == max(x - 5))),]
x y z
1 A 4 -1.2070657
2 A 4 0.2774292
5 B 6 0.4291247
6 B 6 0.5060559
Here is an option with data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'x', we create get the absolute difference of 'y' with 5, check for elements that are min from the difference, get the row index (.I), extract the column that is row index ("V1") and subset the dataset.
library(data.table)
setDT(df)[df[, {v1 <- abs(y-5)
.I[v1==min(v1)]}, x]$V1]
# x y z
#1: A 4 -1.2070657
#2: A 4 0.2774292
#3: B 6 0.4291247
#4: B 6 0.5060559
val <- 5
delta <- abs(val - df$y)
df <- df[delta == min(delta), ]
I am trying to to divide each value in columns B and C by the sum due to a factor in column A.
The starting matrix could look something like this but has thousands of rows
where A is a factor, and B and C contain the values:
A <- c(1,1,2,2)
B <- c(0.2, 0.3, 1, 0.5)
C <- c(0.7, 0.5, 0, 0.9)
M <- data.table(A,B,C)
> M
A B C
[1,] 1 0.2 0.7
[2,] 1 0.3 0.5
[3,] 2 1.0 0.0
[4,] 2 0.5 0.9
The factors can occur any number of times.
I was able to produce the sum per factor with library data.table:
library(data.table)
M.dt <- data.table(M)
M.sum <- M.dt[, lapply(.SD, sum), by = A]
> M.sum
A B C
1: 1 0.5 1.2
2: 2 1.5 0.9
but didn't know how to go on from here to keep the original format of the table.
The resulting table should look like this:
B.1 <- c(0.4, 0.6, 0.666, 0.333)
C.1 <- c(0.583, 0.416, 0, 1)
M.1 <- cbind(A, B.1, C.1)
> M.1
A B.1 C.1
[1,] 1 0.400 0.58333
[2,] 1 0.600 0.41666
[3,] 2 0.666 0.00000
[4,] 2 0.333 1.00000
The calculation for the first value in B.1 would go like this:
0.2/(0.2+0.3) = 0.4 and so on, where the values to add are given by the factor in A.
I have some basic knowledge of R, but despite trying hard, I do badly with matrix manipulations and loops.
Simply divide each value in each column by its sum per each value in A
M[, lapply(.SD, function(x) x/sum(x)), A]
# A B C
# 1: 1 0.4000000 0.5833333
# 2: 1 0.6000000 0.4166667
# 3: 2 0.6666667 0.0000000
# 4: 2 0.3333333 1.0000000
If you want to update by reference do
M[, c("B", "C") := lapply(.SD, function(x) x/sum(x)), A]
Or more generally
M[, names(M)[-1] := lapply(.SD, function(x) x/sum(x)), A]
A bonus solution for the dplyr junkies
library(dplyr)
M %>%
group_by(A) %>%
mutate_each(funs(./sum(.)))
# Source: local data table [4 x 3]
# Groups: A
#
# A B C
# 1 1 0.4000000 0.5833333
# 2 1 0.6000000 0.4166667
# 3 2 0.6666667 0.0000000
# 4 2 0.3333333 1.0000000
Like most problems of this type, you can either use data.table or plyr package or some combination of split, apply, combine functions in base R.
For those who prefer the plyr package
library (plyr)
M <- data.table(A,B,C)
ddply(M, .(A), colwise(function(x) x/sum(x)))
Output is:
A B C
1 1 0.4000000 0.5833333
2 1 0.6000000 0.4166667
3 2 0.6666667 0.0000000
4 2 0.3333333 1.0000000