Find the max and min value within each group [duplicate]

Find the max and min value within each group [duplicate] - r

This question already has answers here:
How to get summary statistics by group
(14 answers)
Closed 4 years ago.
I want to find the max and min for each Gene in the following table.
I know, that the following function gives the max (or min), but I could not manage to get both at the same time.
tapply(df$Value, df$Gene, max)
Appreciate!
Small test set:
df <- read.table(header = TRUE, text = 'Gene Value
A 12
A 10
A 123
A 1
B 3
B 5
B 6
C 1
D 3
D 45
D 98
D 234
D 4')

range()
Is the function that returns both the max and the min
Here you'd do:
tapply(df$Value, df$Gene, range)
# $A
# [1] 1 123
# $B
# [1] 3 6
# $C
# [1] 1 1
# $D
# [1] 3 234

You can keep using tapply, and simply modify the FUN argument to return multiple summary statistics. For example:
do.call(rbind, tapply(df$Value, df$Gene, FUN = function(x) c(max = max(x), min = min(x))))
# max min
#A 123 1
#B 6 3
#C 1 1
#D 234 3

Related

How to store replicate runs in a dataframe

I have a function (call it random_func) that generates random numbers according to some rules using parameters. I'm trying to repeatedly call that function and store the results in a dataframe.
df <- lapply(c(1,2,3,4,5), FUN = function(x) replicate(100, expr = random_func(n=10, param=x)))
Right now, the output is a list of 5 vectors each with 100 elements. What R voodoo do I need to do in order to get it to look something like:
param, result
1, 5
1, 6
1, 8
...
5, 10

set.seed(42)
do.call(rbind, #rbind results for different x together
lapply(c(1,2), FUN = function(x)
data.frame(param = x, #will be recycled
result = do.call(what = c, #concatenate results of replicate
replicate(n = 2,
expr = rnorm(n = 3, mean = x), #replace with random_func
simplify = FALSE))))) #when FALSE, replicate returns list
# param result
# 1 1 2.3709584
# 2 1 0.4353018
# 3 1 1.3631284
# 4 1 1.6328626
# 5 1 1.4042683
# 6 1 0.8938755
# 7 2 3.5115220
# 8 2 1.9053410
# 9 2 4.0184237
# 10 2 1.9372859
# 11 2 3.3048697
# 12 2 4.2866454

rerun and map_df solution
from purrr
library(dplyr)
library(purrr)
Random function
random_func <- function(n, param) {
rnorm(n)+(param*10)
}
solution
myfun <- function() {
df <- 100 %>%
rerun(x=10, y=1:5) %>%
map_df(~data.frame(param=.x$y, result=random_func(n=.x$x, param=.x$y)))
}
Output
df <- myfun()
head(df)
param result
1 1 10.15325
2 2 19.52867
3 3 30.08218
4 4 40.06418
5 5 48.39804
6 1 11.00435
Additional validation
df %>%
group_by(param) %>%
summarise(mean = mean(result))
param mean
1 1 10.00634
2 2 20.03874
3 3 30.11093
4 4 40.06166
5 5 50.02632
Performance
library(microbenchmark)
microbenchmark(myfun())
expr min lq mean median uq max neval
myfun() 65.93166 66.80521 69.42349 68.5152 69.57185 90.77295 100

Return vector position in list r

I am trying to determine the vector where an element is coming from in a list I have created. I'll give a repeatable example here:
set.seed(101)
a <- runif(10, min=0, max=100)
b <- runif(10, min=0, max=100)
c <- runif(10, min=0, max=100)
d <- runif(10, min=0, max=100)
information <- list(a, b, c, d)
information.wanted <- mean(do.call(pmax, information))
The code to get the information.wanted works just fine. What I am now trying to find is the individual vector in the list where each of the maximum values comes from. For example, value 1 in information.wanted (87.97...) comes from vector b in the information list. I would like to create another piece of code that gives the vector where the information.wanted comes from.
> information.wanted
[1] 87.97957 95.68375 73.19726 93.16344 92.33189 91.34787 82.04361 81.42830 62.20120
[10] 92.48044
I have no idea how to do this though. None of the code that I've tried has gotten me anywhere close.
postition.of.information.wanted <- ??
I'm looking to get something like this. A numeric vector is fine. I can supplement the values in later.
> position.of.informaiton.wanted
[1] 2 3 ...
Any help would be greatly appreciated. Thanks.

You need to apply which.max to each "i" index of each element in "information":
f1 = function(x)
sapply(seq_along(x[[1]]), function(i) which.max(sapply(x, "[[", i)))
f1(information)
# [1] 2 3 2 2 3 4 2 4 1 4
mapply already provides that kind of "parallel" functionality:
f2 = function(x)
unlist(.mapply(function(...) which.max(c(...)), x, NULL))
f2(information)
# [1] 2 3 2 2 3 4 2 4 1 4
Or, instead of concatenating "information" in chunks, convert to a "matrix" -as David Arenburg notes in the comments- at start and apply which.max to its rows:
f3a = function(x)
apply(do.call(cbind, x), 1, which.max)
f3a(information)
# [1] 2 3 2 2 3 4 2 4 1 4
or its columns:
f3b = function(x)
apply(do.call(rbind, x), 2, which.max)
f3b(information)
# [1] 2 3 2 2 3 4 2 4 1 4
also, max.col is convenient for a "matrix":
f4 = function(x)
max.col(do.call(cbind, x), "first")
f4(information)
# [1] 2 3 2 2 3 4 2 4 1 4
If it wasn't R, then a simple loop over the elements would provide both which.max and max ...but R, also, handles vectors:
f5 = function(x)
{
ans = rep_len(1L, length(x[[1]]))
maxs = x[[1]]
for(i in 2:length(x)) {
wh = x[[i]] > maxs
maxs[wh] = x[[i]][wh]
ans[wh] = i
}
ans #or '(data.frame(i = ans, val = maxs)' for both
}
f5(information)
# [1] 2 3 2 2 3 4 2 4 1 4
It had to end with a benchmark:
set.seed(007)
dat = replicate(13, runif(1e4), FALSE)
identical(f1(dat), f2(dat))
#[1] TRUE
identical(f2(dat), f3a(dat))
#[1] TRUE
identical(f3a(dat), f3b(dat))
#[1] TRUE
identical(f3b(dat), f4(dat))
#[1] TRUE
identical(f4(dat), f5(dat))
#[1] TRUE
microbenchmark::microbenchmark(f1(dat), f2(dat), f3a(dat), f3b(dat), f4(dat), f5(dat), do.call(pmax, dat), times = 50)
#Unit: microseconds
# expr min lq mean median uq max neval cld
# f1(dat) 274995.963 298662.210 339279.948 318937.172 350822.539 723673.972 50 d
# f2(dat) 94619.397 100079.205 114664.776 107479.127 114619.439 226733.260 50 c
# f3a(dat) 19767.925 23423.688 26382.919 25795.499 29215.839 40100.656 50 b
# f3b(dat) 20351.872 22829.997 28889.845 25090.446 30503.100 140311.058 50 b
# f4(dat) 975.102 1109.431 1546.571 1169.462 1361.733 8954.100 50 a
# f5(dat) 2427.665 2470.816 5299.386 2520.755 3197.793 112986.612 50 a
# do.call(pmax, dat) 1477.618 1530.166 1627.934 1551.046 1602.898 2814.295 50 a

add values that are the same within a vector

I have a vector
x <- c(1,2,5,4,3,1,1,4,2,6,7,2,4,1,5)
and I want to add the values that are the same, giving me the new vector
x <- c(4, 6, 3, 12, 10, 6, 7)
It sounds quite simple but I am stuck.

You can use sapply to iterate over the vector of the unique values, and then sum each one of the corresponding entries, like so:
> sapply(unique(x), function(i) sum(x[x == i]))
[1] 4 6 10 12 3 6 7
If the order is relevant, please indicate which order do you want.
In this solution, the order is the same as the output of unique, which you can use to know what is the sum of what value.
> unique(x)
[1] 1 2 5 4 3 6 7
Edit
It looks like you want the ascending order of unique values. In that case, you can do like this:
> sapply(sort(unique(x)), function(i) sum(x[x == i]))
[1] 4 6 3 12 10 6 7

aggregate(x, list(number = x), FUN = sum )
# number x
#1 1 4
#2 2 6
#3 3 3
#4 4 12
#5 5 10
#6 6 6
#7 7 7
The result is a data.frame and you can extract the second column as usual.

Here's another option, for fun:
with(rle(sort(x)), lengths * values)
# [1] 4 6 3 12 10 6 7
Benchmarks
library(microbenchmark)
x <- c(1,2,5,4,3,1,1,4,2,6,7,2,4,1,5)
x <- rep(x, length.out=1000)
matthew <- function() with(rle(sort(x)), lengths * values)
iled <- function() sapply(sort(unique(x)), function(i) sum(x[x == i]))
kota <- function() as.numeric(table(x) * as.integer(names(table(x))))
deena <- function() {
freqTable = as.data.frame(table(x))
as.numeric(as.character(freqTable$x)) * freqTable$Freq
}
roland <- function() aggregate(x, list(number = x), FUN = sum )$x
microbenchmark(matthew(), iled(), kota(), deena(), roland())
# Unit: microseconds
# expr min lq mean median uq max neval
# matthew() 105.5 116.9 167.5 122.5 131.3 1466 100
# iled() 111.2 125.6 160.3 131.4 138.8 1449 100
# kota() 1821.5 1899.3 1960.4 1915.9 1940.7 3031 100
# deena() 1124.7 1175.6 1221.1 1187.9 1207.7 2700 100
# roland() 1912.2 1967.9 2116.6 1995.5 2078.5 3610 100

One way to do that would be multiplying each element with its frequency. The table function does a good with that :
freqTable = as.data.frame(table(x))
requiredResult = as.numeric(as.character(freqTable$x)) * freqTable$Freq

You want to compute #n * n for each n, where #n is the number of occurrence of n.
Just an alternative approach to the sapply above.
table(x) * as.integer(names(table(x)))
# x
# 1 2 3 4 5 6 7
# 4 6 3 12 10 6 7

R-ddply function over loop variable

I need to loop over a data frame and calculate functions over the variable that is being looped.
A table example:
table<-data.frame(num1=seq(1,10,len=20), num2=seq(20,30,len=20),
char1=c(rep('a',10), rep('b',10)),
target=c(rep(1,10), rep(0,10)))
I create a list of variables:
nums<-colnames(table)[sapply(table, class)=='numeric']
nums<-nums[nums!='target']
And the table that I will populate:
planF<-data.frame(deciles=c(1), min=c(1), max=c(1), pos=c(1))
planF<-planF[-1,]
And the loop:
library(plyr)
for (i in 1:length(nums)){
table$deciles<-ntile(table[,nums[i]],5)
plan<-ddply(table, 'deciles', summarize, min=min(nums[i]),
max=max(nums[i]),pos=sum(target))
planF<-rbind(planF,plan)
}
I need to get the min and max of the variable por each decile. But instead I get:
deciles min max pos
1 1 num1 num1 4
2 2 num2 num2 4
3 3 <NA> <NA> 2
4 4 <NA> <NA> 0
5 5 <NA> <NA> 0
6 1 num1 num1 4
7 2 num2 num2 4
8 3 <NA> <NA> 2
9 4 <NA> <NA> 0
10 5 <NA> <NA> 0
For variable num1 I need to get the result of:
ddply(table, 'deciles', summarize, min=min(num1),
max=max(num1),pos=sum(target))
deciles min max pos
1 5.736842 7.157895 0
2 7.631579 9.052632 0
3 1.000000 10.000000 2
4 1.947368 3.368421 4
5 3.842105 5.263158 4
And below the result of doing the same with num2.
I understand that I need to introduce the variable with the following form:
num1
but the code is writing
'num1'
I tried with:
min=min(as.name(nums[i]))
But I get an error:
Error in min(as.name(nums[i])) : 'type' (symbol) not valid argument
how can I calculate a function over the variable that is being looped?

The gist of your question is to apply a list of functions over the split-apply-combine method, so here is one way you can do this in base r.
## your data
table<-data.frame(num1=seq(1,10,len=20), num2=seq(20,30,len=20),
char1=c(rep('a',10), rep('b',10)),
target=c(rep(1,10), rep(0,10)))
nums<-colnames(table)[sapply(table, class)=='numeric']
nums<-nums[nums!='target']
table$deciles <- ntile(table[, nums[1]], 5)
FUNS <- list(min = min, max = max, mean = mean)
## split the variable num1 by deciles
## apply each function to each piece
x <- with(table, tapply(num1, deciles, function(x)
setNames(sapply(FUNS, function(y) y(x)), names(FUNS))))
## combine results
do.call('rbind', x)
# min max mean
# 1 1.000000 2.421053 1.710526
# 2 2.894737 4.315789 3.605263
# 3 4.789474 6.210526 5.500000
# 4 6.684211 8.105263 7.394737
# 5 8.578947 10.000000 9.289474
Instead of using a loop, since we have the above which works and is fairly simple, put it into a function like below
f <- function(num, data = table) {
FUNS <- list(min = min, max = max, mean = mean)
x <- tapply(data[, num], data[, 'deciles'], function(x)
setNames(sapply(FUNS, function(y) y(x)), names(FUNS)))
cbind(deciles = as.numeric(names(x)), do.call('rbind', x))
}
This way we have the method generalized so it can use any column you have with any data you have. You can call it for individual columns like
f('num1')
f('num2')
Or use a loop to get everything at once
lapply(c('num1','num2'), f)
# [[1]]
# deciles min max mean
# 1 1 1.000000 2.421053 1.710526
# 2 2 2.894737 4.315789 3.605263
# 3 3 4.789474 6.210526 5.500000
# 4 4 6.684211 8.105263 7.394737
# 5 5 8.578947 10.000000 9.289474
#
# [[2]]
# deciles min max mean
# 1 1 20.00000 21.57895 20.78947
# 2 2 22.10526 23.68421 22.89474
# 3 3 24.21053 25.78947 25.00000
# 4 4 26.31579 27.89474 27.10526
# 5 5 28.42105 30.00000 29.21053
If you don't like lapply, you can Vectorize the function to make it a little easier:
Vectorize(f, SIMPLIFY = FALSE)(c('num1', 'num2'))
Which you would more commonly use like this (SIMPLIFY = FALSE to retain the list structures)
v <- Vectorize(f, SIMPLIFY = FALSE)
v(c('num1','num1'))
# $num1
# deciles min max mean
# 1 1 1.000000 2.421053 1.710526
# 2 2 2.894737 4.315789 3.605263
# 3 3 4.789474 6.210526 5.500000
# 4 4 6.684211 8.105263 7.394737
# 5 5 8.578947 10.000000 9.289474
#
# $num1
# deciles min max mean
# 1 1 1.000000 2.421053 1.710526
# 2 2 2.894737 4.315789 3.605263
# 3 3 4.789474 6.210526 5.500000
# 4 4 6.684211 8.105263 7.394737
# 5 5 8.578947 10.000000 9.289474

I would strictly prefer to use dplyr for this, even though there is some ugliness in handling string variable names in the call to summarize_ (note the trailing _):
library(lazyeval)
library(dplyr)
# create the data.frame
dfX = data.frame(num1=seq(1,10,len=20),
num2=seq(20,30,len=20),
char1=c(rep('a',10), rep('b',10)),
target=c(rep(1,10), rep(0,10))
)
# select the numeric columns
numericCols = names(dfX)[sapply(dfX, is.numeric)]
numericCols = setdiff(numericCols, "target")
# cycle over numeric columns, creating summary data.frames
liDFY = setNames(
lapply(
numericCols, function(x) {
# compute the quantiles
quantiles = quantile(dfX[[x]], probs = seq(0, 1, 0.2))
# create quantile membership
dfX[["quantile_membership"]] =
findInterval(dfX[[x]], vec = quantiles,
rightmost.closed = TRUE,
all.inside = TRUE)
# summarize variables by decile
dfX %>%
group_by(quantile_membership) %>%
summarize_(min = interp( ~ min(x_name), x_name = as.name(x)),
max = interp( ~ max(x_name), x_name = as.name(x)),
mean = interp( ~ mean(x_name), x_name = as.name(x)))
}),
numericCols
)
# inspect the output
liDFY[[numericCols[1]]]

Apply a correction factor to one column based on the value of a second column

Example Data
A<-c(1,4,5,6,2,3,4,5,6,7,8,7)
B<-c(4,6,7,8,2,2,2,3,8,8,7,8)
DF<-data.frame(A,B)
What I would like to do is apply a correction factor to column A, based on the values of column B. The rules would be something like this
If B less than 4 <- Multiply A by 1
If B equal to 4 and less than 6 <- Multiply A by 2
If B equal or greater than 6 <- Multiply by 4
I suppose I could write an "if" statement (and I'd be glad to see a good example), but I'd also be interested in using square bracket indexing to speed things up.
The end result would look like this
A B
2 4
16 6
20 7
24 8
ect

Use this:
within(DF, A <- ifelse(B>=6, 4, ifelse(B<4, 1, 2)) * A)
Or this (corrected by #agstudy):
within(DF, {A[B>=6] <- A[B>=6]*4; A[B>=4 & B<6] <- A[B>=4 & B<6]*2})
Benchmarking:
DF <- data.frame(A=rpois(1e4, 5), B=rpois(1e4, 5))
a <- function(DF) within(DF, A <- ifelse(B>=6, 4, ifelse(B<4, 1, 2)) * A)
b <- function(DF) within(DF, {A[B>=6] <- A[B>=6]*4; A[B>=4 & B<6] <- A[B>=4 & B<6]*2})
identical(a(DF), b(DF))
#[1] TRUE
microbenchmark(a(DF), b(DF), times=1000)
#Unit: milliseconds
# expr min lq median uq max neval
# a(DF) 8.603778 10.253799 11.07999 11.923116 53.91140 1000
# b(DF) 3.763470 3.889065 5.34851 5.480294 39.72503 1000

Similar to #Ferdinand solution but using transform
transform(DF, newcol = ifelse(B<4, A,
ifelse(B>=6,4*A,2*A)))
A B newcol
1 1 4 2
2 4 6 16
3 5 7 20
4 6 8 24
5 2 2 2
6 3 2 3
7 4 2 4
8 5 3 5
9 6 8 24
10 7 8 28
11 8 7 32
12 7 8 28

I prefer to use findInterval as an index into a set of factors for such operations. The proliferation of nested test-conditional and consequent vectors with multiple ifelse calls offends my efficiency sensibilities:
DF$A <- DF$A * c(1,2,4)[findInterval(DF$B, c(-Inf,4,6,Inf) ) ]
DF
A B
1 2 4
2 16 6
3 20 7
4 24 8
snipped ....
Benchmark:
DF <- data.frame(A=rpois(1e4, 5), B=rpois(1e4, 5))
a <- function(DF) within(DF, A <- ifelse(B>=6, 4, ifelse(B<4, 1, 2)) * A)
b <- function(DF) within(DF, {A[B>=6] <- A[B>=6]*4; A[B>=4 & B<6] <- A[B>=4 & B<6]*2})
ccc <- function(DF) within(DF, {A * c(1,2,4)[findInterval(B, c(-Inf,4,6,Inf) ) ]})
microbenchmark(a(DF), b(DF), ccc(DF), times=1000)
#-----------
Unit: microseconds
expr min lq median uq max neval
a(DF) 7616.107 7843.6320 8105.0340 8322.5620 93549.85 1000
b(DF) 2638.507 2789.7330 2813.8540 3072.0785 92389.57 1000
ccc(DF) 604.555 662.5335 676.0645 698.8665 85375.14 1000
Note: I would not have done this using within if I were coding my own function, but thought for fairness to the earlier effort, I would make it apples <-> apples.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Find the max and min value within each group [duplicate] - r

range() Is the function that returns both the max and the min Here you'd do: tapply(df$Value, df$Gene, range) # $A # [1] 1 123 # $B # [1] 3 6 # $C # [1] 1 1 # $D # [1] 3 234

You can keep using tapply, and simply modify the FUN argument to return multiple summary statistics. For example: do.call(rbind, tapply(df$Value, df$Gene, FUN = function(x) c(max = max(x), min = min(x)))) # max min #A 123 1 #B 6 3 #C 1 1 #D 234 3

Related

How to store replicate runs in a dataframe

Return vector position in list r

add values that are the same within a vector

R-ddply function over loop variable

Apply a correction factor to one column based on the value of a second column

Categories

Resources