Calculate minimum distance between groups of points in data frame - r

my data frame looks like this:
Time, Value, Group
0, 1.0, A
1, 2.0, A
2, 3.0, A
0, 4.0, B
1, 6.0, B
2, 6.0, B
0, 7.0, C
1, 7.0, C
2, 9.0, C
I need to find for each combination (A, B), (A, C), (B, C) the maximum difference over each corresponding Time points.
So comparing A and B has maximum distance for t=1 which is 6 (B) - 2 (A) = 4.
The full output should be something like this:
combination,time,distance
AB, 0, 4
AC, 0, 6
BC, 0, 3

One way in base R using combn :
do.call(rbind, combn(unique(df$Group), 2, function(x) {
df1 <- subset(df, Group == x[1])
df2 <- subset(df, Group == x[2])
df3 <- merge(df1, df2, by = 'Time')
value <- abs(df3$Value.x - df3$Value.y)
data.frame(combn = paste(x, collapse = ''),
time = df3$Time[which.max(value)],
max_difference = max(value))
}, simplify = FALSE))
# combn time max_difference
#1 AB 1 4
#2 AC 0 8
#3 BC 0 5
We create all combination of unique Group values, subset the data for them and merge them on Time. Subtract the corresponding value columns and return the max difference between them.
data
df <- structure(list(Time = c(0L, 1L, 2L, 0L, 1L, 2L, 0L, 0L, 0L),
Value = c(1, 2, 3, 4, 6, 6, 7, 7, 9), Group = c("A", "A",
"A", "B", "B", "B", "C", "C", "C")),
class = "data.frame", row.names = c(NA, -9L))

One dplyr option could be:
df %>%
inner_join(df, by = "Time") %>%
filter(Group.x != Group.y) %>%
group_by(Time,
Group = paste(pmax(Group.x, Group.y), pmin(Group.x, Group.y), sep = "-")) %>%
summarise(Max_Distance = abs(max(Value.x[Group.x == first(Group.x)]) - max(Value.y[Group.y == first(Group.y)])))
Time Group Max_Distance
<int> <chr> <dbl>
1 0 B-A 3
2 0 C-A 8
3 0 C-B 5
4 1 B-A 4
5 2 B-A 3

Related

Subset dataframe based on levels

I have the following dataframe in R. I want to subset it based on three criteria,for each unique value of x within each level of id
If there is only 1 value of x, keep that row
If x has the same value of z, with two different values of y, keep the row where y does not = 1.3
If x has three values of z, keep the two rows where y does not = 1.3
id x y z
a 1 0.2 100
a 2 1 200
a 2 1.3 200
b 1 0.5 400
b 1 1 500
b 1 1.3 600
the solution would look like this:
id x y z
a 1 0.2 100
a 2 1 200
b 1 0.5 400
b 1 1 500
Any help would be appreciated
We can group by 'id', 'x' and filter based on the conditions
library(dplyr)
df1 %>%
group_by(id, x) %>%
filter(n() == 1|(n() > 1 & y != 1.3))
data
df1 <- structure(list(id = c("a", "a", "a", "b", "b", "b"), x = c(1L,
2L, 2L, 1L, 1L, 1L), y = c(0.2, 1, 1.3, 0.5, 1, 1.3), z = c(100L,
200L, 200L, 400L, 500L, 600L)), class = "data.frame", row.names = c(NA,
-6L))

How to create a dataframe in R with a column calculation that references its own value in the prior row?

I am try to use R to calculate sales as a function of inventory as a function of sales. See below data snapshot. Is there anyway to calculate this?
Group, Day and Build are independent variables
Sales = lag(Sales,1) * Build
I am given this data frame:
Group <- c("A","A","A","A","A","B","B","B","B","B")
Day <- c(1,2,3,4,5,1,2,3,4,5)
Build <- c(1.5,2,.3,.5,.6,1.2,.9,1.2,1.2,.4)
Sales <- c(50000,NA,NA,NA,NA,20000,NA,NA,NA,NA)
Trying to populate this data frame:
Group <- c("A","A","A","A","A","B","B","B","B","B")
Day <- c(1,2,3,4,5,1,2,3,4,5)
Build <- c(1.5,2,.3,.5,.6,1.2,.9,1.2,1.2,.4)
Sales <- c(50000,100000,30000,15000,9000,20000,18000,21600,25920,10368)
We can also do this with accumulate from purrr
library(dplyr)
library(purrr)
df1 %>%
group_by(Group) %>%
mutate(Sales = accumulate(Build[-1], ~ .y * .x, .init = first(Sales)))
# A tibble: 10 x 4
# Groups: Group [2]
# Group Day Build Sales
# <fct> <dbl> <dbl> <dbl>
# 1 A 1 1.5 50000
# 2 A 2 2 100000
# 3 A 3 0.3 30000
# 4 A 4 0.5 15000
# 5 A 5 0.6 9000
# 6 B 1 1.2 20000
# 7 B 2 0.9 18000
# 8 B 3 1.2 21600
# 9 B 4 1.2 25920
#10 B 5 0.4 10368
Or using base R with by and Reduce
df1$Sales <- do.call(c, by(df1[3:4], df1$Group, FUN =
function(dat) Reduce(function(x, y) x * y,
dat$Build[-1], init = dat$Sales[1], accumulate = TRUE)))
data
df1 <- structure(list(Group = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L), .Label = c("A", "B"), class = "factor"), Day = c(1,
2, 3, 4, 5, 1, 2, 3, 4, 5), Build = c(1.5, 2, 0.3, 0.5, 0.6,
1.2, 0.9, 1.2, 1.2, 0.4), Sales = c(50000, NA, NA, NA, NA, 20000,
NA, NA, NA, NA)), class = "data.frame", row.names = c(NA, -10L
))

Select or subset variables whose column sums are not zero

I want to select or subset variables in a data frame whose column sum is not zero but also keeping other factor variables as well. It should be fairly simple but I cannot figure out how to run the select_if() function on a subset of variables using dplyr:
df <- data.frame(
A = c("a", "a", "b", "c", "c", "d"),
B = c(0, 0, 0, 0, 0, 0),
C = c(3, 0, 0, 1, 1, 2),
D = c(0, 3, 2, 1, 4, 5)
)
require(dplyr)
df %>%
select_if(funs(sum(.) > 0))
#Error in Summary.factor(c(1L, 1L, 2L, 3L, 3L, 4L), na.rm = FALSE) :
# ‘sum’ not meaningful for factors
Then I tried to only select B, C, D and this works, but I won't have variable A:
df %>%
select(-A) %>%
select_if(funs(sum(.) > 0)) -> df2
df2
# C D
#1 3 0
#2 0 3
#3 0 2
#4 1 1
#5 1 4
#6 2 5
I could simply do cbind(A = df$A, df2) but since I have a dataset with 3000 rows and 200 columns, I am afraid this could introduce errors (if values sort differently for example).
Trying to subset variables B, C, D in the sum() function doesn't work either:
df %>%
select_if(funs(sum(names(.[2:4])) > 0))
#data frame with 0 columns and 6 rows
Try this:
df %>% select_if(~ !is.numeric(.) || sum(.) != 0)
# A C D
# 1 a 3 0
# 2 a 0 3
# 3 b 0 2
# 4 c 1 1
# 5 c 1 4
# 6 d 2 5
The rationale is that for || if the left-side is TRUE, the right-side won't be evaluated.
Note:
the second argument for select_if should be a function name or formula (lambda function). the ~ is necessary to tell select_if that !is.numeric(.) || sum(.) != 0 should be converted to a function.
As commented below by #zx8754, is.factor(.)should be used if one only wants to keep factor columns.
Edit: a base R solution
cols <- c('B', 'C', 'D')
cols.to.keep <- cols[colSums(df[cols]) != 0]
df[!names(df) %in% cols || names(df) %in% cols.to.keep]
Here is an update for everyone who wants to use the new dplyr 1.0.0 which doesn't have the scoped variants (like select_if as nicely shown by #mt1022 but deprecated):
df %>%
select(where(is.numeric)) %>%
select(where(~sum(.) != 0))
If you want to compress the two select statements into one, you cannot do this by the element-wise & but longer form && because this produces the required boolean output:
df %>% select(where(~ is.numeric(.x) && sum(.x) !=0 ))
This is a soltion using data.table
df<-data.table(
A = c("a", "a", "b", "c", "c", "d"),
B = c(0, 0, 0, 0, 0, 0),
C = c(3, 0, 0, 1, 1, 2),
D = c(0, 3, 2, 1, 4, 5)
)
df2<-df[,lapply(X = .SD,FUN = function(x){sum(as.numeric(x))}),.SDcols = colnames(df)]
df[,which(is.na(df[1,]) == F),with = F]

Calculate median for multiple columns by group based on subsets defined by other columns

I am trying to calculate the median (but that could be substituted by similar metrics) by group for multiple columns based on subsets defined by other columns. This is direct follow-on question from this previous post of mine. I have attempted to incorporate calculating the median via aggregate into the Map(function(x,y) dosomething, x, y) solution kindly provided by #Frank, but that didn't work. Let me illustrate:
Calculate median for A and B by groups GRP1 and GRP2
df <- data.frame(GRP1 = c("A","A","A","A","A","A","B","B","B","B","B","B"), GRP2 = c("A","A","A","B","B","B","A","A","A","B","B","B"), A = c(0,4,6,7,0,1,9,0,0,8,3,4), B = c(6,0,4,8,6,7,0,9,9,7,3,0))
med <- aggregate(.~GRP1+GRP2,df,FUN=median)
Simple. Now add columns defining which rows to be used for calculating the median, i.e. rows with NAs should be dropped, column a defines which rows to be used for calculating the median in column A, same for columns b and B:
a <- c(1,4,7,3,NA,3,7,NA,NA,4,8,1)
b <- c(5,NA,7,9,5,6,NA,8,1,7,2,9)
df1 <- cbind(df,a,b)
As mentioned above, I have tried combining Map and aggregate, but that didn't work. I assume that Map doesn't know what to do with GRP1 and GRP2.
med1 <- Map(function(x,y) aggregate(.~GRP1+GRP2,df1[!is.na(y)],FUN=median), x=df1[,3:4], y=df1[, 5:6])
This is the result I'm looking for:
GRP1 GRP2 A B
1 A A 4 5
2 B A 9 9
3 A B 4 7
4 B B 4 3
Any help will be much appreciated!
Using data.table
library(data.table)
setDT(df1)
df1[, .(A = median(A[!is.na(a)]), B = median(B[!is.na(b)])), by = .(GRP1, GRP2)]
GRP1 GRP2 A B
1: A A 4 5
2: A B 4 7
3: B A 9 9
4: B B 4 3
Same logic in dplyr
library(dplyr)
df1 %>%
group_by(GRP1, GRP2) %>%
summarise(A = median(A[!is.na(a)]), B = median(B[!is.na(b)]))
The original df1:
df1 <- data.frame(
GRP1 = c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B"),
GRP2 = c("A", "A", "A", "B", "B", "B", "A", "A", "A", "B", "B", "B"),
A = c(0, 4, 6, 7, 0, 1, 9, 0, 0, 8, 3, 4),
B = c(6, 0, 4, 8, 6, 7, 0, 9, 9, 7, 3, 0),
a = c(1, 4, 7, 3, NA, 3, 7, NA, NA, 4, 8, 1),
b = c(5, NA, 7, 9, 5, 6, NA, 8, 1, 7, 2, 9)
)
With dplyr:
library(dplyr)
df1 %>%
mutate(A = ifelse(is.na(a), NA, A),
B = ifelse(is.na(b), NA, B)) %>%
# I use this to put as NA the values we don't want to include
group_by(GRP1, GRP2) %>%
summarise(A = median(A, na.rm = T),
B = median(B, na.rm = T))
# A tibble: 4 x 4
# Groups: GRP1 [?]
GRP1 GRP2 A B
<fct> <fct> <dbl> <dbl>
1 A A 4 5
2 A B 4 7
3 B A 9 9
4 B B 4 3

A clean way to aggregate/groupby involving only the distinct values in each column?

I've been using the dplyr package to create aggregated data tables, for example using the following code:
agg_data <- df %>%
select(calc.method, price1, price2) %>%
group_by(calc.method) %>%
summarize(
count = n(),
mean_price1 = round(mean(price1, na.rm = TRUE),2),
mean_price2 = round(mean(price2, na.rm = TRUE),2))
However, I would like to only calculate the mean over the distinct values of price1 and price2 within groups
e.g:
Price1: 1 1 2 1 2 2 1
Goes to (before aggregation):
Price1: 1 2 1 2 1
(and these in general don't have the same numbers of after removal for price1 and price2). I would also like to calculate a count for each (price1 and price2), counting only distinct values within groups. (Groups are defined as two or more identical values adjacent to each other)
I have tried:
agg_data <- df %>%
select(calc.method, price1, price2) %>%
group_by(calc.method) %>%
summarize(
count = n(),
mean_price1 = round(mean(distinct(price1), na.rm = TRUE),2),
mean_price2 = round(mean(distinct(price2), na.rm = TRUE),2))
And also tried wrapping the columns within the select function with distinct(), but both these throw errors.
Is there a way to do this using dplyr or another similar package without having to write something from scratch?
To satisfy your requirement for distinct, we need to remove successive values that are the same. For numeric vectors, this can be accomplished by:
x <- x[c(1, which(diff(x) != 0)+1)]
The default use of diff computes the difference between adjoining elements in the vector. We use this to detect successive values that are different, for which diff(x) != 0. Since the output differences are lagged by 1, we add 1 to the indices of these distinct elements, and we also want the first element as distinct. For example:
x <- c(1,1,2,1,2,2,1)
x <- x[c(1, which(diff(x) != 0)+1)]
##[1] 1 2 1 2 1
We can then use this with dplyr:
agg_data <- df %>% group_by(calc.method) %>%
summarize(count = n(),
count_non_rep_1 = length(price1[c(1,which(diff(price1) != 0)+1)]),
mean_price1 = round(mean(price1[c(1,which(diff(price1) != 0)+1)], na.rm=TRUE),2),
count_non_rep_2 = length(price2[c(1,which(diff(price2) != 0)+1)]),
mean_price2 = round(mean(price2[c(1,which(diff(price2) != 0)+1)], na.rm=TRUE),2))
or, better yet, define the function:
remove.repeats <- function(x) {
x[c(1,which(diff(x) != 0)+1)]
}
and use it with dplyr:
agg_data <- df %>% group_by(calc.method) %>%
summarize(count = n(),
count_non_rep_1 = length(remove.repeats(price1)),
mean_price1 = round(mean(remove.repeats(price1), na.rm=TRUE),2),
count_non_rep_2 = length(remove.repeats(price2)),
mean_price2 = round(mean(remove.repeats(price2), na.rm=TRUE),2))
Using this on some example data that is hopefully similar to yours:
df <- structure(list(calc.method = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"),
price1 = c(1, 1, 2, 1, 2, 2, 1, 1, 1, 2, 2, 2, 2, 1, 3),
price2 = c(1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1)),
.Names = c("calc.method", "price1", "price2"), row.names = c(NA, -15L), class = "data.frame")
## calc.method price1 price2
##1 A 1 1
##2 A 1 1
##3 A 2 1
##4 A 1 1
##5 A 2 1
##6 A 2 1
##7 A 1 1
##8 B 1 2
##9 B 1 1
##10 B 2 2
##11 B 2 1
##12 B 2 2
##13 B 2 1
##14 B 1 2
##15 B 3 1
We get:
print(agg_data)
### A tibble: 2 x 6
## calc.method count count_non_rep_1 mean_price1 count_non_rep_2 mean_price2
## <fctr> <int> <int> <dbl> <int> <dbl>
##1 A 7 5 1.40 1 1.0
##2 B 8 4 1.75 8 1.5

Resources