Closest value to a specific column in R - r

I would like to find the closest value to column x3 below.
data=data.frame(x1=c(24,12,76),x2=c(15,30,20),x3=c(45,27,15))
data
x1 x2 x3
1 24 15 45
2 12 30 27
3 76 20 15
So desired output will be
Closest_Value_to_x3
24
30
20
Please help. Thank you

Use max.col(-abs(data[, 3] - data[, -3])) to find the column positions of the closest values and use this result as part of a matrix to extract desired values from your data. The matrix is returned by cbind
col <- 3
data[, -col][cbind(1:nrow(data),
max.col(-abs(data[, col] - data[, -col])))]
#[1] 24 30 20

A tidyverse solution:
data %>%
rowid_to_column() %>%
gather(var, val, -c(x3, rowid)) %>%
mutate(temp = x3 - val) %>%
group_by(rowid) %>%
filter(abs(temp) == min(abs(temp))) %>%
ungroup() %>%
select(val)
val
<dbl>
1 24
2 30
3 20
First, it adds a row ID. Second, it transforms the data from wide to long. Third, it calculates the difference between "x3" and the other variables. Finally, it groups by the row ID and keeps the rows where the absolute difference is the smallest.
Or:
data %>%
rowid_to_column() %>%
gather(var, val, -c(x3, rowid)) %>%
mutate(temp = x3 - val) %>%
group_by(rowid) %>%
filter(abs(temp) == min(abs(temp))) %>%
ungroup() %>%
pull(val)
[1] 24 30 20
Or using an approach originally proposed by #markus (it assumes that your columns are named "x"):
data %>%
mutate(temp = paste0("x", max.col(-abs(.[, -3] - .[, 3])))) %>%
rowwise() %>%
summarise(val = eval(as.symbol(temp)))
val
<dbl>
1 24.
2 30.
3 20.
First, it is assessing the column index of the variable where the absolute difference in regard to "x3" is the smallest and combines it with "x". Then, it evaluates the combination of x and column index as a variable and returns the appropriate value.
Also borrowing the idea from #markus (not assuming that your columns are named "x"):
data %>%
mutate(temp = max.col(-abs(.[, -3] - .[, 3]))) %>%
rowwise %>%
mutate(temp = names(.)[[temp]]) %>%
summarise(val = eval(as.symbol(temp)))
First, it is assessing the column index of the variable where the absolute difference in regard to "x3" is the smallest. Second, it returns the column name based on the column index. Finally, it evaluates it as a variable and returns the appropriate value.
Or a variant where you can reference the "x3" variable by its name and not by column index (the basic idea still from #markus):
data %>%
mutate(temp = max.col(-abs(.[, !grepl("x3", colnames(.))] - .[, grepl("x3", colnames(.))]))) %>%
rowwise %>%
mutate(temp = names(.)[[temp]]) %>%
summarise(val = eval(as.symbol(temp)))

Here is another approach using matrixStats
x <- as.matrix(data[,-3L])
y <- abs(x - .subset2(data, 3L))
x[matrixStats::rowMins(y) == y]
# [1] 24 30 20
Or in base using vapply
x <- as.matrix(data[,-3L])
y <- abs(x - .subset2(data, 3L))
vapply(1:nrow(data),
function(k) x[k,][which.min(y[k,])],
numeric(1))
# [1] 24 30 20

Define a function closest_to_3 that operates on a vector and returns the value in the vector that's closest to the third member:
closest_to_3 <- function(v) v[-3][which.min(abs( v[-3]-v[3] ))]
(The idiom v[-3] deletes the 3rd member from v.) Then apply this function to each row of your data frame:
apply(data, 1, closest_to_3)
#[1] 24 30 20

Related

Sum duplicated columns in dataframe in R

Hello i have the following dataframe :
colnames(tv_viewing time) <-c("channel_1", "channel_2", "channel_1", "channel_2")
Each row gives a the viewing time for an individual on channel 1 and channel 2, for instance for individual 1 i get :
tv_viewing_time[1,] <- c(1,2,4,5)
What I would like is actually a dataframe that sums up the values of duplicated columns.
I.e. I would get
colnames(tv_viewing time) <-c("channel_1", "channel_2")
Where for instance for individual 1 i would get :
tv_viewing_time[1,] <- c(5,7)
As all two row entries are summed when they correspond to duplicated column names.
I have looked for an answer but all suggested on other threads did not work for my dataframe case.
Note that there are many more duplicated columns, so i am looking for a solution that can be efficiently applied to all my duplicates.
We could use split.default with rowSums
sapply(split.default(tv_viewing_time,
sub("\\.\\d+$", "", names(tv_viewing_time))), rowSums)
-output
# channel_1 channel_2
# 5 7
Or using tidyverse
library(dplyr)
library(tidyr)
library(stringr)
tv_viewing_time %>%
pivot_longer(cols = everything()) %>%
group_by(name = str_remove(name, "\\.\\d+$")) %>%
summarise(value = sum(value)) %>%
pivot_wider(names_from = name, values_from = value)
# A tibble: 1 x 2
# channel_1 channel_2
# <dbl> <dbl>
#1 5 7
data
tv_viewing_time <- data.frame(channel_1 = 1, channel_2 = 2,
channel_1 = 4, channel_2 = 5)

Creating multiple new columns in a dataframe in an iterative way

-- Small edit made to test data. Columns are no longer grouped by round, but instead grouped by Team as is the case in the real dataset.
I have tried a variety of methods for creating multiple new columns, while minimising the repetition in the code. My initial and successful method requires excessive copy and pasting, but I would like to minimise this as much as possible. Below is example data for the problem:
df <- tribble(~R1TeamX, ~R2TeamX,~R3TeamX, ~R1TeamY,~R2TeamY, ~R3TeamY,
10, 11, 12, 15, 19, 20,
11, 13, 14, 25, 18, 15)
This example data is for three rounds with the scores for both team X and team Y. I am looking to create additional columns, finding the difference between the scores of Team X and Y. The real dataset has upwards of 30 rounds.
My initial solution used mutate and works as follows:
df <- df %>%
mutate(R1Diff = R1TeamX - R1TeamY,
R2Diff = R2TeamX - R2TeamY,
R3Diff = R3TeamX - R3TeamY)
While this does the job, it is not scaleable. I have attempted to reduce this down to less code using str_c & mutate, but cannot identify the correct looping method to make this work for several lines of code. Below is my attempt at standardising the code so far:
teamx <- str_c("R", 1:3, "TeamX")
teamy <- str_c("R", 1:3, "TeamY")
round_diff <- str_c("R", 1:3, "Diff")
df <- df %>%
mutate(!!round_diff[1] := UQ(parse_quo(teamx[1], global_env())) - UQ(parse_quo(teamy[1], global_env())),
!!round_diff[2] := UQ(parse_quo(teamx[2], global_env())) - UQ(parse_quo(teamy[2], global_env())),
!!round_diff[3] := UQ(parse_quo(teamx[3], global_env())) - UQ(parse_quo(teamy[3], global_env())))
While additional code is required, this standardises my input to some degree reducing some of the leg work, but I know there must be some way to reduce this into a single line. I have explored mutate_at and for loops to no avail. I suspect this problem could also be tackled with purrr::map, but I do have enough ability in this area to identify the correct approach.
Any help would be greatly appreciated.
Whilst it's possible to do this in dplyr and tidyr functions, remember you still have some useful base R options open to you. This method uses lapply and makes the assumption that your columns are alternating between team X and team Y
seq(length(df)/2) %>%
lapply(function(x) df[[x]] - df[[x + 1]]) %>%
as.data.frame() %>%
setNames(paste0("R", seq(length(df)/2), "Diff")) %>%
cbind(df,.)
#> R1TeamX R1TeamY R2TeamX R2TeamY R3TeamX R3TeamY R1Diff R2Diff R3Diff
#> 1 10 11 12 15 19 20 -1 -1 -3
#> 2 11 13 14 25 18 15 -2 -1 -11
try to do it this way
library(tidyverse)
df %>%
mutate(id = row_number()) %>%
pivot_longer(
-id,
names_to = c("set", ".value"),
names_pattern = "(R\\d+Team)(X|Y)"
) %>%
mutate(Diff = X - Y) %>%
pivot_longer(-c(id, set)) %>%
pivot_wider(id, names_from = c(set, name), values_from = value, names_sep = "")
Here's a solution that I believe is robust with respect to the number of rounds, the number of opponents of Team X and the order in which the results are stored.
First, make the data tidy: remove information about Teams and Rounds from column names.
newDF <- df %>%
mutate(id = row_number()) %>%
pivot_longer(
-id,
names_to = c("Round", "Team"),
names_pattern = "R(\\d+)Team(X|Y)",
values_to="Score"
)
Now calculate the differences in scores
newDF %>%
# Calculate difference in scores
mutate(Team=ifelse(Team == "X", Team, "Opponent")) %>%
pivot_wider(values_from=Score, names_from=Team) %>%
mutate(Diff=X - Opponent) %>%
select(-Opponent) %>%
# Bring in identity of oponent
left_join(
newDF %>%
filter(Team != "X") %>%
select(-Score) %>%
rename(Opposition=Team),
by=c("id", "Round")
)
Giving
# A tibble: 6 x 5
id Round X Diff Opposition
<int> <chr> <dbl> <dbl> <chr>
1 1 1 10 -5 Y
2 1 2 11 -8 Y
3 1 3 12 -8 Y
4 2 1 11 -14 Y
5 2 2 13 -5 Y
6 2 3 14 -1 Y
based on OP's revised input data.

How to start at certain number in array R?

If I have this array:
vector1 <- c(5,9,3)
vector2 <- c(10,11,12,13,14,15)
result <- array(c(vector1,vector2,vector1,vector2,vector1,vector2,vector1,vector2,vector1,vector2,vector1,vector2),dim = c(3,3,12))
str(result)
num [1:3, 1:3, 1:12] 5 9 3 10 11 12 13 14 15 5 ...
What I need is to start from [,,4] and count 3 so I will have
[,,4] [,,5] [,,6] and then compute the mean for them
[,,7] [,,8] [,,9] and then compute the mean for them
[,,10] [,,11] [,,12] and then compute the mean for them
Here is one option. Subset the 'result' with the indices along the third dimension, split the sequence of third dimension with a grouping variable created with gl, loop through the index, subset the array elements based on index, Reduce by getting the sum of corresponding elements and divide by 3 to get the mean
s1 <- result[,, 4:12]
i1 <- seq(dim(s1)[3])
out <- lapply(split(i1, as.integer(gl(length(i1), 3, length(i1)))),
function(i) Reduce(`+`, lapply(i, function(i2) s1[,, i2]))/3)
If we want as array output
array(unlist(out), c(3, 3, 3))
Or melt it to 'long' format with the indices, then create the grouping varioable, summarise to get the mean
library(tidyverse)
library(reshape2)
melt(result[, , 4:12]) %>%
group_by(Var1, Var2, grp = ((Var3-1) %/% 3 ) + 1) %>%
summarise(value = mean(value)) %>%
split(.$grp) %>%
map(~ .x %>%
select(-grp) %>%
spread(Var2, value) %>%
tibble::column_to_rownames('Var1')) %>%
unlist %>%
array(c(3, 3, 3))

R run T-test/anova for each row with 2 groups with 3 samples

My dataset looks something like this:
df <- data.frame(compound = c("alanine ", "arginine", "asparagine", "aspartate"))
df <- matrix(rnorm(12*4), ncol = 12)
colnames(df) <- c("AC-1", "AC-2", "AC-3", "AM-1", "AM-2", "AM-3", "SC-1", "SC-2", "SC-3", "SM-1", "SM-2", "SM-3")
df <- data.frame(compound = c("alanine ", "arginine", "asparagine", "aspartate"), df)
df
compound AC.1 AC.2 AC.3 AM.1 AM.2 AM.3 SC.1 SC.2 SC.3 SM.1
1 alanine 1.18362683 -2.03779314 -0.7217692 -1.7569264 -0.8381042 0.06866567 0.2327702 -1.1558879 1.2077454 0.437707310
2 arginine -0.19610110 0.05361113 0.6478384 -0.1768597 0.5905398 -0.67945600 -0.2221109 1.4032349 0.2387620 0.598236199
3 asparagine 0.02540509 0.47880021 -0.1395198 0.8394257 1.9046667 0.31175358 -0.5626059 0.3596091 -1.0963363 -1.004673116
4 aspartate -1.36397906 0.91380826 2.0630076 -0.6817453 -0.2713498 -2.01074098 1.4619707 -0.7257269 0.2851122 -0.007027878
I want to perform a t-test for each row (compound) on the columns [2:4] as one, and [5:7] as one, and store all the p-values. Basically see if there is a difference between the AC group and AM group for each compound.
I am aware there is another topic with this however I couldn't find a viable solution for my problem.
PS. my real dataset has about 35000 rows (maybe it needs a different solution than only 4 rows)
After selecting the columns of interest, use pmap to apply the t.test on each row by selecting the first 3 and next 3 observations as input to t.test and bind the extracted 'p value' as another column in the original data
library(tidyverse)
df %>%
select(AC.1:AM.3) %>%
pmap_dbl(~ c(...) %>%
{t.test(.[1:3], .[4:6])$p.value}) %>%
bind_cols(df, pval_AC_AM = .)
Or after selecting the columns, do a gather to convert to 'long' format, spread, apply the t.test in summarise and join with the original data
df %>%
select(compound, AC.1:AM.3) %>%
gather(key, val, -compound) %>%
separate(key, into = c('key1', 'key2')) %>%
spread(key1, val) %>%
group_by(compound) %>%
summarise(pval_AC_AM = t.test(AC, AM)$p.value) %>%
right_join(df)
Update
If there are cases where there is only a unique value, then t.test shows error. One option is to run the t.test and get NA for those cases. This can be done with possibly
posttest <- possibly(function(x, y) t.test(x, y)$p.value, otherwise = NA)
df %>%
select(AC.1:AM.3) %>%
pmap_dbl(~ c(...) %>%
{posttest(.[1:3], .[4:6])}) %>%
bind_cols(df, pval_AC_AM = .)
posttest(rep(3,5), rep(1, 5))
#[1] NA
If you can use an external library:
library(matrixTests)
row_t_welch(df[,2:4], df[,5:7])$pvalue
[1] 0.67667626 0.39501003 0.26678161 0.01237438

issues calculating rowwise maximum

suppose I have a tibble dat below, what I would like to do is to calculate maximum of (x 2, x 3) and then minus x 1, where x can be either a or b. In my real data I have more than 3 columns, so something like 2:n (e.g., 2:3) would be great. tried many things, seems not working as I wanted them to, still struggling with the string vs column name thing..
dat <- tibble(`a 1` = c(0, 0, 0), `a 2` = 1:3, `a 3` = 3:1,
`b 1` = rep(1, 3), `b 2` = 4:6, `b 3` = 6:4)
foo <- function(x = 'a')
{
???
}
end result:
if x == `a`
c(3, 2, 3)
if x == `b`
c(5, 4, 5)
Solution 1
This solution uses only base R. The idea is to define a function (max_minus_first) to calculate the answer. The max_minus_first function has two arguments. The first argument, dat, is a data frame for analysis with the same format as the OP provided. group is the name of the group for analysis. The end product is a vector with the answer.
max_minus_first <- function(dat, group){
# Get all column names with starting string "group"
col_names <- colnames(dat)
dat2 <- dat[, col_names[grepl(paste0("^", group), col_names)]]
# Get the maximum values from all columns except the first column
max_value <- apply(dat2[, -1], 1, max, na.rm = TRUE)
# Calculate max_value minus the values from the first column
final_value <- max_value - unlist(dat2[, 1], use.names = FALSE)
return(final_value)
}
max_minus_first(dat, "a")
# [1] 3 2 3
max_minus_first(dat, "b")
# [1] 5 4 5
Solution 2
A solution using the tidyverse. The end product (dat2) is a tibble with the output from each group (a, b, ...)
library(tidyverse)
dat2 <- dat %>%
rowid_to_column() %>%
gather(Column, Value, -rowid, -ends_with(" 1")) %>%
separate(Column, into = c("Group", "Column_Number")) %>%
gather(Column_1, Value_1, ends_with(" 1")) %>%
separate(Column_1, into = c("Group_1", "Column_Number_1")) %>%
filter(Group == Group_1) %>%
group_by(rowid, Group, Value_1) %>%
summarise(Value = max(Value, na.rm = TRUE)) %>%
mutate(Final = Value - Value_1) %>%
ungroup() %>%
select(-starts_with("Value")) %>%
spread(Group, Final)
dat2
# # A tibble: 3 x 3
# rowid a b
# * <int> <dbl> <dbl>
# 1 1 3 5
# 2 2 2 4
# 3 3 3 5
Explanation
rowid_to_column() is from the tibble package, a way to create a new column based on row ID.
gather is from the tidyr package to convert the data frame from the wide format to long format. I used gather twice because the first column of each group is different than other columns in the same group. ends_with(" 1") is a select helper function from the dplyr, which select the column with a name ending in " 1". Notice that the space in " 1" is important because "1" may select other columns like a 11 if such columns exist.
separate is from the tidyr package to separate a column into two columns. I used it to separate the Group name and column numbers in each Group.
filter(Group == Group_1) is to filter rows with Group == Group_1.
group_by(rowid, Group, Value_1) and then summarise(Value = max(Value, na.rm = TRUE)) make sure the maximum from each Group is calculated.
mutate(Final = Value - Value_1) is to calculate the difference between maximum from each Group and the value from the first column. The results are stored in the Final column.
select(-starts_with("Value")) removes any columns with a name beginning with "Value".
spread from the tidyr package converts the data frame from long format to wide format.
Solution 3
Another tidyverse solution, which similar to Solution 2. It uses do to conduct operation to each Group hence making the code more concise.
dat2 <- dat %>%
rowid_to_column() %>%
gather(Column, Value, -rowid) %>%
separate(Column, into = c("Group", "Column_Number")) %>%
group_by(rowid, Group) %>%
do(data_frame(Max = max(.$Value[.$Column_Number != 1]),
First = .$Value[.$Column_Number == 1])) %>%
mutate(Final = Max - First) %>%
select(-Max, -First) %>%
spread(Group, Final) %>%
ungroup()
dat2
# # A tibble: 3 x 3
# rowid a b
# * <int> <dbl> <dbl>
# 1 1 3 5
# 2 2 2 4
# 3 3 3 5

Resources