Compare values to column-wise "leave-one-out" mean

Compare values to column-wise "leave-one-out" mean - r

I have a dataframe df:
df <- data.frame(a = 1:5, b = 6:10)
a b
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
For each column, I want to divide each value by the column mean, where the mean is calculated by excluding the focal value from calculation of the mean ("leave-one-out" mean).
For example, the first two values in column "a"`, the calculation is like:
1: 1 / ((2 + 3 + 4 + 5) / 4)) = 0.2857143
2: 2 / ((1 + 3 + 4 + 5) / 4)) = 0.6153846
etc.
"Leave-one-out means":
mean_a mean_b
1 3.5 8.5
2 3.25 8.25
3 3 8
4 2.75 7.75
5 2.5 7.5
The desired result: values / "leave-one-out" means
res_a res_b
1 0.285 0.705
2 0.615 0.848
3 1 1
4 1.454 1.161
5 2 1.333
Many thanks for any help!

If I understand it correctly, the following should do it.
res <- sapply(df, function(x)
sapply(seq_along(x), function(i) x[i]/mean(x[-i]))
)
res <- as.data.frame(res)
names(res) <- paste("c", names(res), sep = "_")
res
# c_a c_b
#1 0.2857143 0.7058824
#2 0.6153846 0.8484848
#3 1.0000000 1.0000000
#4 1.4545455 1.1612903
#5 2.0000000 1.3333333

Just use the magic of index and vector in R
for(i in 1:nrow(df)){
print(df$a[i]/mean(df$a[-i]))
}
I have just replicated for column a .I hope you can do it for B and convert into dataframe .
Let me know if you need help.
Happy to help with R.

A vectorized possibility, which will be faster for larger data.
df / ((rep(colSums(df), each = nrow(df)) - df) / (nrow(df) - 1))
# a b
# 0.2857143 0.7058824
# 0.6153846 0.8484848
# 1.0000000 1.0000000
# 1.4545455 1.1612903
# 2.0000000 1.3333333

Related

rowmeans but ignore certain values when calculating the mean but na.rm=F

Here is a sample dataset:
data <- data.frame(x=c(4,3,4,4,99),
y=c(4,NA,3,2,4),
z = c(88,NA,4,4,5),
w = c(4,5,2,3,4))
I would like to create a new column for means using rowMeans. I would like to keep na.rm=F because if its truly NA I do not want to include that into my means calculation.
But if its either 88/99 I would like R to ignore it while calculating the mean and still use the remaining valid values. So far I have the below.
data$mean <- rowMeans(subset(data, select = c(`x`,`y`,`z`,`w`)), na.rm = T)
But I am not sure how to add in a function where it would just ignore the 88 and 99 from calculations.
This is what I am hoping to get
data <- data.frame(x=c(4,3,4,4,99),
y=c(4,NA,3,2,4),
z = c(88,NA,4,4,5),
w = c(4,5,2,3,4),
mean=c(4,NA,3.25,3.25,4.3))
Any help is appreciated - thank you!

Using rowMeans nevertheless with na.rm=TRUE, but on a subset and temporally replaceing 88 and 99 with NA.
s <- rowSums(is.na(data)) == 0 ## store row subset
v <- c("x", "y", "z", "w") ## col subset to calc. mean
data$mean <- NA ## ini column
m <- as.matrix(data[v]) ## we'll ned a matrix
data$mean[s] <- rowMeans(replace(m[s, v], m[s, v] %in% c(88, 99), NA), na.rm=TRUE)
data
# x y z w mean
# 1 4 4 88 4 4.000000
# 2 3 NA NA 5 NA
# 3 4 3 4 2 3.250000
# 4 4 2 4 3 3.250000
# 5 99 4 5 4 4.333333
Or simply using apply but is much slower.
f <- \(x) if (any(is.na(x))) NA else mean(x[!x %in% c(88, 99)])
cbind(data, mean=apply(data, 1, f))
# x y z w mean
# 1 4 4 88 4 4.000000
# 2 3 NA NA 5 NA
# 3 4 3 4 2 3.250000
# 4 4 2 4 3 3.250000
# 5 99 4 5 4 4.333333
From microbenchmark.
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# apply 35.018162 35.941815 38.834333 36.394632 36.960161 212.469412 100 b
# rowMeans 1.097393 1.119396 1.493563 1.193787 1.226691 9.352118 100 a

data <- data.frame(x=c(4,3,4,4,99),
y=c(4,NA,3,2,4),
z = c(88,NA,4,4,5),
w = c(4,5,2,3,4))
df$mean <- apply(data, 1, function(x) {
idx <- which((x %in% c(88, 89)) == FALSE)
mean(x[ idx ], na.rm = TRUE)
})
x y z w mean
1 4 4 88 4 4.00
2 3 NA NA 5 4.00
3 4 3 4 2 3.25
4 4 2 4 3 3.25
5 99 4 5 4 28.00

weight calculation for panel data set r [duplicate]

I have the following data frame, where "x" is a grouping variable and "y" some values:
dat <- data.frame(x = c(1, 2, 3, 3, 2, 1), y = c(3, 4, 4, 5, 2, 5))
I want to create a new column where each "y" value is divided by the sum of "y" within each group defined by "x". E.g. the result for the first row is 3 / (3 + 5) = 0.375, where the denominator is the sum of "y" values for group 1 (x = 1).

There are various ways of solving this, here's one
with(dat, ave(y, x, FUN = function(x) x/sum(x)))
## [1] 0.3750000 0.6666667 0.4444444 0.5555556 0.3333333 0.6250000
Here's another possibility
library(data.table)
setDT(dat)[, z := y/sum(y), by = x]
dat
# x y z
# 1: 1 3 0.3750000
# 2: 2 4 0.6666667
# 3: 3 4 0.4444444
# 4: 3 5 0.5555556
# 5: 2 2 0.3333333
# 6: 1 5 0.6250000
Here's a third one
library(dplyr)
dat %>%
group_by(x) %>%
mutate(z = y/sum(y))
# Source: local data frame [6 x 3]
# Groups: x
#
# x y z
# 1 1 3 0.3750000
# 2 2 4 0.6666667
# 3 3 4 0.4444444
# 4 3 5 0.5555556
# 5 2 2 0.3333333
# 6 1 5 0.6250000

Here are some base R solutions:
1) prop.table Use the base prop.table function with ave like this:
transform(dat, z = ave(y, x, FUN = prop.table))
giving:
x y z
1 1 3 0.3750000
2 2 4 0.6666667
3 3 4 0.4444444
4 3 5 0.5555556
5 2 2 0.3333333
6 1 5 0.6250000
2) sum This also works:
transform(dat, z = y / ave(y, x, FUN = sum))

And of course there's a way for people thinking in SQL, very wordy in this case, but nicely generalising to all sorts of other similiar problems:
library(sqldf)
dat <- sqldf("
with sums as (
select
x
,sum(y) as sy
from dat
group by x
)
select
d.x
,d.y
,d.y/s.sy as z
from dat d
inner join sums s
on d.x = s.x
")

Compute the percentage of summed grouped values in long format [duplicate]

I have the following data frame, where "x" is a grouping variable and "y" some values:
dat <- data.frame(x = c(1, 2, 3, 3, 2, 1), y = c(3, 4, 4, 5, 2, 5))
I want to create a new column where each "y" value is divided by the sum of "y" within each group defined by "x". E.g. the result for the first row is 3 / (3 + 5) = 0.375, where the denominator is the sum of "y" values for group 1 (x = 1).

There are various ways of solving this, here's one
with(dat, ave(y, x, FUN = function(x) x/sum(x)))
## [1] 0.3750000 0.6666667 0.4444444 0.5555556 0.3333333 0.6250000
Here's another possibility
library(data.table)
setDT(dat)[, z := y/sum(y), by = x]
dat
# x y z
# 1: 1 3 0.3750000
# 2: 2 4 0.6666667
# 3: 3 4 0.4444444
# 4: 3 5 0.5555556
# 5: 2 2 0.3333333
# 6: 1 5 0.6250000
Here's a third one
library(dplyr)
dat %>%
group_by(x) %>%
mutate(z = y/sum(y))
# Source: local data frame [6 x 3]
# Groups: x
#
# x y z
# 1 1 3 0.3750000
# 2 2 4 0.6666667
# 3 3 4 0.4444444
# 4 3 5 0.5555556
# 5 2 2 0.3333333
# 6 1 5 0.6250000

Here are some base R solutions:
1) prop.table Use the base prop.table function with ave like this:
transform(dat, z = ave(y, x, FUN = prop.table))
giving:
x y z
1 1 3 0.3750000
2 2 4 0.6666667
3 3 4 0.4444444
4 3 5 0.5555556
5 2 2 0.3333333
6 1 5 0.6250000
2) sum This also works:
transform(dat, z = y / ave(y, x, FUN = sum))

And of course there's a way for people thinking in SQL, very wordy in this case, but nicely generalising to all sorts of other similiar problems:
library(sqldf)
dat <- sqldf("
with sums as (
select
x
,sum(y) as sy
from dat
group by x
)
select
d.x
,d.y
,d.y/s.sy as z
from dat d
inner join sums s
on d.x = s.x
")

System of equations. How to automatically return the value 0 (in a matrix) for variables which don't occur in some equations (R-studio)? [duplicate]

This question already has answers here:
System of equations. How to split a string to gain two matrices A and b in R
(4 answers)
Closed 5 years ago.
Giving an example:
my_string<-"2a+5b-2c+2d=9; 3a-2b+1c-3d=34; -3a+3b+2c+4d=33; 2a+3b+4c+5d=125"
s <- my_string
p <- ";"
s2 <- gsub(p,"",s)
w <- nchar(s) - nchar(s2) + 1
s1 <- my_string
p1 <- "[:a-z:]"
s3 <- gsub(p1,"",s1)
k <- (nchar(s1) - nchar(s3) )/w +1
my_string<-strsplit(my_string, ";")
my_string<-unlist(my_string)
my_string<-trimws(my_string)
for (i in 1:w) {
print(noquote(paste0(my_string[i])))
}
sp2 <- strsplit(my_string, "=")
b <- as.numeric(sapply(sp2, '[[', 2))
sp3 <- lapply(lapply(sp2, '[[', 1), function(s) gsub("([-+])([[:alpha:]])",
"\\11\\2", s))
sp3 <- lapply(sp3, trimws)
sp3 <- lapply(sp3, function(s1) sub("^([[:alpha:]])", "1\\1", s1))
A <- do.call(rbind, lapply(sp3, function(x) as.numeric(unlist(strsplit(x,"
[[:alpha:]]")))))
x <- cbind(A,b)
x
The output of this program is this matrix:
b
[1,] 2 5 -2 2 9
[2,] 3 -2 1 -3 34
[3,] -3 3 2 4 33
[4,] 2 3 4 5 125
It works correct. But a problem comes when in at least one equation there won't occur some variables which exist in other equations. In such case the program will not work.
My question is: How to modify it that for example this system of equations:
my_string <- "2a+5b-2c=9; 3a-2b-3d=34; -3a+3b+2c+4d=33; 2a+4c+5d=125"
will give this output:
b
[1,] 2 5 -2 0 9
[2,] 3 -2 0 -3 34
[3,] -3 3 2 4 33
[4,] 2 0 4 5 125
(every column stands for the next variable, if in any equation don't occur some variables it should return the value 0 for the particular positions)
Thank you in advance.

One option with tidyverse would be
library(tidyverse)
str_split(my_string, "; ") %>%
magrittr::extract2(1) %>%
map_df(~ .x %>%
{
x1 <- str_extract_all(., pattern = "[0-9-]+")[[1]]
x2 <- c(str_extract_all(., pattern = "[a-z]+")[[1]], "e")
set_names(as.list(x1), x2)
}
%>%
as_tibble) %>%
select(a, b, c, d, e) %>%
mutate_all(funs(as.numeric(replace(., is.na(.), 0))))
# A tibble: 4 x 5
# a b c d e
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 2.00 5.00 -2.00 0 9.00
#2 3.00 -2.00 0 -3.00 34.0
#3 -3.00 3.00 2.00 4.00 33.0
#4 2.00 0 4.00 5.00 125

Better way to produce data frame using table()

Recently, I have found that I am using the following pattern over and over again. The process is:
cross-tabulate numeric variable by factor using table
create data frame from created table
add original numeric values to data frame (from row names (!))
remove row names
reorder columns of aggregated data frame
In R, it looks like this:
# Sample data
df <- data.frame(x = round(runif(100), 1),
y = factor(ifelse(runif(100) > .5, 1, 0),
labels = c('failure', 'success'))
)
# Get frequencies
dfSummary <- as.data.frame.matrix(table(df$x, df$y))
# Add column of original values from rownames
dfSummary$x <- as.numeric(rownames(dfSummary))
# Remove rownames
rownames(dfSummary) <- NULL
# Reorder columns
dfSummary <- dfSummary[, c(3, 1, 2)]
Is there anything more elegant in R, preferably using base functions? I know I can use sql to do this in single command - I think that it has to be possible to achieve similar behavior in R.
sqldf solution:
library(sqldf)
dfSummary <- sqldf("select
x,
sum(y = 'failure') as failure,
sum(y = 'success') as success
from df group by x")

An alternative with base R could be:
aggregate(. ~ x, transform(df, success = y == "sucess",
failure = y == "failure", y = NULL), sum)
# x success failure
#1 0.0 2 4
#2 0.1 6 8
#3 0.2 1 7
#4 0.3 5 4
#5 0.4 6 6
#6 0.5 3 3
#7 0.6 4 6
#8 0.7 6 6
#9 0.8 4 5
#10 0.9 6 7
#11 1.0 1 0

Your code modified as a function would be efficient compared to the other solutions in base R (so far). If you wanted the code in one-line, a "reshape/table" combo from base R could be used.
reshape(as.data.frame(table(df)), idvar='x', timevar='y',
direction='wide')
# x Freq.failure Freq.success
#1 0 3 2
#2 0.1 3 9
#3 0.2 5 5
#4 0.3 8 7
#5 0.4 5 3
#6 0.5 9 4
#7 0.6 3 6
#8 0.7 7 6
#9 0.8 3 1
#10 0.9 4 3
#11 1 0 4
In case you want to try data.table
library(data.table)
dcast.data.table(setDT(df), x~y)
# x failure success
# 1: 0.0 3 2
# 2: 0.1 3 9
# 3: 0.2 5 5
# 4: 0.3 8 7
# 5: 0.4 5 3
# 6: 0.5 9 4
# 7: 0.6 3 6
# 8: 0.7 7 6
# 9: 0.8 3 1
#10: 0.9 4 3
#11: 1.0 0 4
Update
I didn't notice the as.data.frame(table( converts to "factor" columns (thanks to #Hadley's comment). A workaround is:
res <- transform(reshape(as.data.frame(table(df), stringsAsFactors=FALSE),
idvar='x', timevar='y', direction='wide'), x= as.numeric(x))
data
set.seed(24)
df <- data.frame(x = round(runif(100), 1),
y = factor(ifelse(runif(100) > .5, 1, 0),
labels = c('failure', 'success'))
)
Benchmarks
set.seed(24)
df <- data.frame(x = round(runif(1e6), 1),
y = factor(ifelse(runif(1e6) > .5, 1, 0),
labels = c('failure', 'success'))
)
tomas <- function(){
dfSummary <- as.data.frame.matrix(table(df$x, df$y))
dfSummary$x <- as.numeric(rownames(dfSummary))
dfSummary <- dfSummary[, c(3, 1, 2)]}
doc <- function(){aggregate(. ~ x, transform(df,
success = y == "success", failure = y == "failure",
y = NULL), sum)}
akrun <- function(){reshape(as.data.frame(table(df)),
idvar='x', timevar='y', direction='wide')}
library(microbenchmark)
microbenchmark(tomas(), doc(), akrun(), unit='relative', times=20L)
Unit: relative
#expr min lq mean median uq max neval cld
#tomas() 1.000000 1.0000000 1.000000 1.000000 1.0000000 1.000000 20 a
#doc() 13.451037 11.5050997 13.082074 13.043584 12.8048306 19.715535 20 b
#akrun() 1.019977 0.9522809 1.012332 1.007569 0.9993835 1.533191 20 a
Updated with dcast.data.table
df1 <- copy(df)
akrun2 <- function() {dcast.data.table(setDT(df1), x~y)}
microbenchmark(tomas(), akrun2(), unit='relative', times=20L)
# Unit: relative
# expr min lq mean median uq max neval cld
# tomas() 6.493231 6.345752 6.410853 6.51594 6.502044 5.591753 20 b
# akrun2() 1.000000 1.000000 1.000000 1.00000 1.000000 1.000000 20 a

This should be relatively efficient. You cannot really suppress rownames in a dataframe, since they are a requirement of a valid dataframe
X <- table(df$x,df$y)
cbind( data.frame(x=rownames(X)), unclass(X) )
x failure success
0 0 5 3
0.1 0.1 6 1
0.2 0.2 7 8
0.3 0.3 7 3
0.4 0.4 6 6
0.5 0.5 6 4
0.6 0.6 2 5
0.7 0.7 2 7
0.8 0.8 3 7
0.9 0.9 4 6
1 1 2 0

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Compare values to column-wise "leave-one-out" mean - r

Just use the magic of index and vector in R for(i in 1:nrow(df)){ print(df$a[i]/mean(df$a[-i])) } I have just replicated for column a .I hope you can do it for B and convert into dataframe . Let me know if you need help. Happy to help with R.

A vectorized possibility, which will be faster for larger data. df / ((rep(colSums(df), each = nrow(df)) - df) / (nrow(df) - 1)) # a b # 0.2857143 0.7058824 # 0.6153846 0.8484848 # 1.0000000 1.0000000 # 1.4545455 1.1612903 # 2.0000000 1.3333333

Related

rowmeans but ignore certain values when calculating the mean but na.rm=F

weight calculation for panel data set r [duplicate]

Compute the percentage of summed grouped values in long format [duplicate]

System of equations. How to automatically return the value 0 (in a matrix) for variables which don't occur in some equations (R-studio)? [duplicate]

Better way to produce data frame using table()

Categories

Resources