If I have this array:
vector1 <- c(5,9,3)
vector2 <- c(10,11,12,13,14,15)
result <- array(c(vector1,vector2,vector1,vector2,vector1,vector2,vector1,vector2,vector1,vector2,vector1,vector2),dim = c(3,3,12))
str(result)
num [1:3, 1:3, 1:12] 5 9 3 10 11 12 13 14 15 5 ...
What I need is to start from [,,4] and count 3 so I will have
[,,4] [,,5] [,,6] and then compute the mean for them
[,,7] [,,8] [,,9] and then compute the mean for them
[,,10] [,,11] [,,12] and then compute the mean for them
Here is one option. Subset the 'result' with the indices along the third dimension, split the sequence of third dimension with a grouping variable created with gl, loop through the index, subset the array elements based on index, Reduce by getting the sum of corresponding elements and divide by 3 to get the mean
s1 <- result[,, 4:12]
i1 <- seq(dim(s1)[3])
out <- lapply(split(i1, as.integer(gl(length(i1), 3, length(i1)))),
function(i) Reduce(`+`, lapply(i, function(i2) s1[,, i2]))/3)
If we want as array output
array(unlist(out), c(3, 3, 3))
Or melt it to 'long' format with the indices, then create the grouping varioable, summarise to get the mean
library(tidyverse)
library(reshape2)
melt(result[, , 4:12]) %>%
group_by(Var1, Var2, grp = ((Var3-1) %/% 3 ) + 1) %>%
summarise(value = mean(value)) %>%
split(.$grp) %>%
map(~ .x %>%
select(-grp) %>%
spread(Var2, value) %>%
tibble::column_to_rownames('Var1')) %>%
unlist %>%
array(c(3, 3, 3))
Related
Let's say I make a dummy dataframe with 6 columns with 10 observations:
X <- data.frame(a=1:10, b=11:20, c=21:30, d=31:40, e=41:50, f=51:60)
I need to create a loop that evaluates 3 columns at a time, adding the summed second and third columns and dividing this by the sum of the first column:
(sum(b)+sum(c))/sum(a) ... (sum(e)+sum(f))/sum(d) ...
I then need to construct a final dataframe from these values. For example using the dummy dataframe above, it would look like:
value
1. 7.454545
2. 2.84507
I imagine I need to use the next function to iterate within the loop, but I'm fairly lost! Thank you for any help.
You can split your data frame into groups of 3 by creating a vector with rep where each element repeats 3 times. Then with this list of sub data frames, (s)apply the function of summing the second and third columns, adding them, and dividing by the sum of the first column.
out_vec <-
sapply(
split.default(X, rep(1:ncol(X), each = 3, length.out = ncol(X)))
, function(x) (sum(x[2]) + sum(x[3]))/sum(x[1]))
data.frame(value = out_vec)
# value
# 1 7.454545
# 2 2.845070
You could also sum all the columns up front before the sapply with colSums, which will be more efficient.
out_vec <-
sapply(
split(colSums(X), rep(1:ncol(X), each = 3, length.out = ncol(X)))
, function(x) (x[2] + x[3])/x[1])
data.frame(value = out_vec, row.names = NULL)
# value
# 1 7.454545
# 2 2.845070
You could use tapply:
tapply(colSums(X), gl(ncol(X)/3, 3), function(x)sum(x[-1])/x[1])
1 2
7.454545 2.845070
Here is an option with tidyverse
library(dplyr) # 1.0.0
library(tidyr)
X %>%
summarise(across(.fn = sum)) %>%
pivot_longer(everything()) %>%
group_by(grp = as.integer(gl(n(), 3, n()))) %>%
summarise(value = sum(lead(value)/first(value), na.rm = TRUE)) %>%
select(value)
# A tibble: 2 x 1
# value
# <dbl>
#1 7.45
#2 2.85
Below is a simplified version of a problem that involves transforming multiple input tables and joining the transformed output into a single table.
Three input tables are processed and summarized, yielding three output tables with identical x columns. x can thus be used as index variable to combine the tables with left_join.
out_all is the desired final output table, with index column x and summary columns d, e, and f.
This code achieves the desired output, but it's not efficient for handling a large set of input tables.
What I hope to achieve, perhaps using purr::map functions or a loop
structure, is to iteratively join the new summary columns to the current version of the output table. Rather than pausing the workflow to save out the latest output, I want to feed that output back into the beginning of loop so that it forms the LHS of the next version of itself, with a new summary column added on the RHS.
Thanks in advance for any help!
library(tidyverse)
in1 <- tribble(
~x, ~a,
1, 1,
1, 2,
1, 3,
2, 4,
3, 5
)
in2 <- tribble(
~x, ~b,
1, 1,
2, 2,
2, 3,
2, 4,
3, 5
)
in3 <- tribble(
~x, ~c,
1, 1,
2, 2,
3, 3,
3, 4,
3, 5
)
out1 <- in1 %>%
group_by(x) %>%
summarize(d = mean(a))
out2 <- in2 %>%
group_by(x) %>%
summarize(e = mean(b))
out12 <- left_join(out1, out2, by = 'x')
out3 <- in3 %>%
group_by(x) %>%
summarize(f = mean(c))
out_all <- left_join(out12, out3, by = 'x')
We get all the dataset objects into a list, use map to do the group by mean in the list and then reduce to a single dataste
library(tidyverse)
out <- mget(ls(pattern = "^in\\d+$")) %>%
map(~ .x %>%
group_by(x) %>%
summarise_if(is.numeric, mean)) %>%
reduce(left_join)
Also, if we want to name the columns differently
mget(ls(pattern = "^in\\d+$")) %>%
map2(., c("d", "e", "f"), ~
.x %>%
group_by(x) %>%
summarise(!! .y := mean(!! rlang::sym(names(.)[2])))) %>%
reduce(left_join)
going back to your first comment, if your original input is a single wide table, why not gather the columns of interest and use dplyr to group_by and summarise them in a few steps? Rather than creating various interim tables if those are not necessary for the desired output?
df <- data.frame(id=1:5,matrix(runif(n=26*5),ncol=26))
df %>% gather(k,v,-id) %>% group_by(id) %>%
summarise(m=mean(v))
# A tibble: 5 x 2
id m
<int> <dbl>
1 1 0.522
2 2 0.596
3 3 0.535
4 4 0.548
5 5 0.605
For a larger number of input tables it might be more efficient to first combine the tables and then summarize values instead of first summarizing values and then (recursively) combining the tables. Here is a small demonstration with 26 input tables (which are already simplified to a single list of tibbles):
library(microbenchmark)
library(purrr)
library(dplyr)
## data
in_dfs <- map(LETTERS, function(var) {
tibble(x = sample(1:3, 5, replace = TRUE), !!var := 1:5)
}) %>%
setNames(paste0("in", seq_along(LETTERS)))
## first combine then summarize
out_tidyverse1 <- function(input)
do.call(bind_rows, input) %>%
group_by(x) %>%
summarize_all(~mean(.x, na.rm = TRUE))
## first summarize then combine
out_tidyverse2 <- function(input)
map(input, ~ .x %>%
group_by(x) %>%
summarise_if(is.numeric, mean)) %>%
reduce(left_join)
microbenchmark(out_tidyverse1(in_dfs), out_tidyverse2(in_dfs))
#> Unit: microseconds
#> expr min lq mean median uq
#> out_tidyverse1(in_dfs) 891.425 1052.342 1356.81 1186.545 1345.594
#> out_tidyverse2(in_dfs) 20482.967 23807.713 26453.08 26144.013 28163.417
#> max neval cld
#> 6790.026 100 a
#> 36884.574 100 b
NB: this would probably be quite a bit more efficient with data.table using e.g. data.table's rbindlist
I need help with sample_n() in ‘dplyr’ in R:
I have a list of data riskset[[1]], riskset[[2]],..., riskset[[1000]]), each element riskset[[i]] of the list is a data frame of observations, and I divided the observations in each riskset into group 1:4 based on the distribution of a variable. So the data in riskset[[i]] looks like this:
id sex grp ...
1 F 1 ...
2 M 3 ...
3 F 1 ...
4 M 4 ...
5 F 2 ...
6 F 3 ...
......................
I want to sample 2 observations from each grp within each riskset and save them as a list of sample. I used
sample<- list()
for(i in 1:1000){
sample[[i]] <- riskset[[i]] %>% group_by(grp) %>% sample_n(2,replace=F)
}
It gave me error:
size must be less or equal than 1 (size of data), set ‘replace = TRUE’ to use sampling with replacement.
I tried the code on the riskset which has more than 2 obs in each grp, it worked. But it doesn’t work on the riskset which has less than 2 obs in some group. For the group that has less than 2 obs, I want all the obs it has. And for the group that has more than 2 obs, I want to sample 2 obs without replacement. How can I achieve my sampling goal using R functions? Thanks in advance!
We can use map to loop over the list ('riskset'), then grouped by 'grp', apply the sample_n
library(tidyerse)
out <- map(riskset, ~ .x %>%
group_by(grp) %>%
sample_n(pmin(n(), 2), replace = TRUE))
Or another option is slice
map(riskset, ~ .x %>%
group_by(grp) %>%
slice(if(n() < 2) 1 else sample(row_number(), 2))
Or without using if/else
map(riskset, ~ .x %>%
group_by(grp) %>%
slice(sample(seq_len(pmin(n(), 2)))))
data
iris1 <- iris %>%
select(grp = Species, everything()) %>%
slice(c(1:5, 51))
riskset <- list(iris1, iris1)
I would like to find the closest value to column x3 below.
data=data.frame(x1=c(24,12,76),x2=c(15,30,20),x3=c(45,27,15))
data
x1 x2 x3
1 24 15 45
2 12 30 27
3 76 20 15
So desired output will be
Closest_Value_to_x3
24
30
20
Please help. Thank you
Use max.col(-abs(data[, 3] - data[, -3])) to find the column positions of the closest values and use this result as part of a matrix to extract desired values from your data. The matrix is returned by cbind
col <- 3
data[, -col][cbind(1:nrow(data),
max.col(-abs(data[, col] - data[, -col])))]
#[1] 24 30 20
A tidyverse solution:
data %>%
rowid_to_column() %>%
gather(var, val, -c(x3, rowid)) %>%
mutate(temp = x3 - val) %>%
group_by(rowid) %>%
filter(abs(temp) == min(abs(temp))) %>%
ungroup() %>%
select(val)
val
<dbl>
1 24
2 30
3 20
First, it adds a row ID. Second, it transforms the data from wide to long. Third, it calculates the difference between "x3" and the other variables. Finally, it groups by the row ID and keeps the rows where the absolute difference is the smallest.
Or:
data %>%
rowid_to_column() %>%
gather(var, val, -c(x3, rowid)) %>%
mutate(temp = x3 - val) %>%
group_by(rowid) %>%
filter(abs(temp) == min(abs(temp))) %>%
ungroup() %>%
pull(val)
[1] 24 30 20
Or using an approach originally proposed by #markus (it assumes that your columns are named "x"):
data %>%
mutate(temp = paste0("x", max.col(-abs(.[, -3] - .[, 3])))) %>%
rowwise() %>%
summarise(val = eval(as.symbol(temp)))
val
<dbl>
1 24.
2 30.
3 20.
First, it is assessing the column index of the variable where the absolute difference in regard to "x3" is the smallest and combines it with "x". Then, it evaluates the combination of x and column index as a variable and returns the appropriate value.
Also borrowing the idea from #markus (not assuming that your columns are named "x"):
data %>%
mutate(temp = max.col(-abs(.[, -3] - .[, 3]))) %>%
rowwise %>%
mutate(temp = names(.)[[temp]]) %>%
summarise(val = eval(as.symbol(temp)))
First, it is assessing the column index of the variable where the absolute difference in regard to "x3" is the smallest. Second, it returns the column name based on the column index. Finally, it evaluates it as a variable and returns the appropriate value.
Or a variant where you can reference the "x3" variable by its name and not by column index (the basic idea still from #markus):
data %>%
mutate(temp = max.col(-abs(.[, !grepl("x3", colnames(.))] - .[, grepl("x3", colnames(.))]))) %>%
rowwise %>%
mutate(temp = names(.)[[temp]]) %>%
summarise(val = eval(as.symbol(temp)))
Here is another approach using matrixStats
x <- as.matrix(data[,-3L])
y <- abs(x - .subset2(data, 3L))
x[matrixStats::rowMins(y) == y]
# [1] 24 30 20
Or in base using vapply
x <- as.matrix(data[,-3L])
y <- abs(x - .subset2(data, 3L))
vapply(1:nrow(data),
function(k) x[k,][which.min(y[k,])],
numeric(1))
# [1] 24 30 20
Define a function closest_to_3 that operates on a vector and returns the value in the vector that's closest to the third member:
closest_to_3 <- function(v) v[-3][which.min(abs( v[-3]-v[3] ))]
(The idiom v[-3] deletes the 3rd member from v.) Then apply this function to each row of your data frame:
apply(data, 1, closest_to_3)
#[1] 24 30 20
suppose I have a tibble dat below, what I would like to do is to calculate maximum of (x 2, x 3) and then minus x 1, where x can be either a or b. In my real data I have more than 3 columns, so something like 2:n (e.g., 2:3) would be great. tried many things, seems not working as I wanted them to, still struggling with the string vs column name thing..
dat <- tibble(`a 1` = c(0, 0, 0), `a 2` = 1:3, `a 3` = 3:1,
`b 1` = rep(1, 3), `b 2` = 4:6, `b 3` = 6:4)
foo <- function(x = 'a')
{
???
}
end result:
if x == `a`
c(3, 2, 3)
if x == `b`
c(5, 4, 5)
Solution 1
This solution uses only base R. The idea is to define a function (max_minus_first) to calculate the answer. The max_minus_first function has two arguments. The first argument, dat, is a data frame for analysis with the same format as the OP provided. group is the name of the group for analysis. The end product is a vector with the answer.
max_minus_first <- function(dat, group){
# Get all column names with starting string "group"
col_names <- colnames(dat)
dat2 <- dat[, col_names[grepl(paste0("^", group), col_names)]]
# Get the maximum values from all columns except the first column
max_value <- apply(dat2[, -1], 1, max, na.rm = TRUE)
# Calculate max_value minus the values from the first column
final_value <- max_value - unlist(dat2[, 1], use.names = FALSE)
return(final_value)
}
max_minus_first(dat, "a")
# [1] 3 2 3
max_minus_first(dat, "b")
# [1] 5 4 5
Solution 2
A solution using the tidyverse. The end product (dat2) is a tibble with the output from each group (a, b, ...)
library(tidyverse)
dat2 <- dat %>%
rowid_to_column() %>%
gather(Column, Value, -rowid, -ends_with(" 1")) %>%
separate(Column, into = c("Group", "Column_Number")) %>%
gather(Column_1, Value_1, ends_with(" 1")) %>%
separate(Column_1, into = c("Group_1", "Column_Number_1")) %>%
filter(Group == Group_1) %>%
group_by(rowid, Group, Value_1) %>%
summarise(Value = max(Value, na.rm = TRUE)) %>%
mutate(Final = Value - Value_1) %>%
ungroup() %>%
select(-starts_with("Value")) %>%
spread(Group, Final)
dat2
# # A tibble: 3 x 3
# rowid a b
# * <int> <dbl> <dbl>
# 1 1 3 5
# 2 2 2 4
# 3 3 3 5
Explanation
rowid_to_column() is from the tibble package, a way to create a new column based on row ID.
gather is from the tidyr package to convert the data frame from the wide format to long format. I used gather twice because the first column of each group is different than other columns in the same group. ends_with(" 1") is a select helper function from the dplyr, which select the column with a name ending in " 1". Notice that the space in " 1" is important because "1" may select other columns like a 11 if such columns exist.
separate is from the tidyr package to separate a column into two columns. I used it to separate the Group name and column numbers in each Group.
filter(Group == Group_1) is to filter rows with Group == Group_1.
group_by(rowid, Group, Value_1) and then summarise(Value = max(Value, na.rm = TRUE)) make sure the maximum from each Group is calculated.
mutate(Final = Value - Value_1) is to calculate the difference between maximum from each Group and the value from the first column. The results are stored in the Final column.
select(-starts_with("Value")) removes any columns with a name beginning with "Value".
spread from the tidyr package converts the data frame from long format to wide format.
Solution 3
Another tidyverse solution, which similar to Solution 2. It uses do to conduct operation to each Group hence making the code more concise.
dat2 <- dat %>%
rowid_to_column() %>%
gather(Column, Value, -rowid) %>%
separate(Column, into = c("Group", "Column_Number")) %>%
group_by(rowid, Group) %>%
do(data_frame(Max = max(.$Value[.$Column_Number != 1]),
First = .$Value[.$Column_Number == 1])) %>%
mutate(Final = Max - First) %>%
select(-Max, -First) %>%
spread(Group, Final) %>%
ungroup()
dat2
# # A tibble: 3 x 3
# rowid a b
# * <int> <dbl> <dbl>
# 1 1 3 5
# 2 2 2 4
# 3 3 3 5