Summarising rows and columns from known groupings - r

I have a symmetrical matrix of flows (in tibble form) similar to the below example:
library(tibble)
set.seed(2019)
df1 <- as_tibble(matrix(sample(1:10,100,replace = T), nrow = 10, ncol = 10, byrow = TRUE,
dimnames = list(as.character(1:10),
as.character(1:10))))
df1
# `1` `2` `3` `4` `5` `6` `7` `8` `9` `10`
# <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 8 8 4 7 1 1 9 1 2 7
# 2 8 7 3 2 7 7 1 8 4 5
# 3 5 6 10 2 2 1 6 10 7 5
# 4 7 1 9 2 1 1 4 5 1 8
# 5 7 3 9 7 9 5 10 10 3 2
# 6 4 1 1 4 6 4 10 10 1 1
# 7 2 3 8 4 8 10 4 1 9 6
# 8 4 2 4 2 7 10 2 6 4 8
# 9 1 10 10 3 6 2 6 7 8 4
#10 6 8 9 3 6 9 5 10 4 10
I also have a lookup table that shows the broad groups that each flow subgroup fits into:
lookup <- tibble(sector = as.character(1:10),
aggregate_sector = c(rep('A',3), rep('B', 3), rep('C', 4)))
lookup
# sector aggregate_sector
#1 1 A
#2 2 A
#3 3 A
#4 4 B
#5 5 B
#6 6 B
#7 7 C
#8 8 C
#9 9 C
#10 10 C
I want to summarise my original df1 such that it represents the flows between each aggregate_sector (as per the lookup table) rather than each sector. Expected output:
# A B C
#A 59 30 65
#B 42 39 65
#C 67 70 94
My initial attempt has been to convert into a matrix and then use a nested for loop to calculate the sum of flows for each aggregate_sector combination in turn:
mdat <- as.matrix(df1)
# replace row and column names with group names - assumes lookup is in same order as row and col names...
row.names(mdat) <- lookup$aggregate_sector
colnames(mdat) <- lookup$aggregate_sector
# pre-allocate an empty matrix
new_mat <- matrix(nrow = 3, ncol = 3, dimnames = list(LETTERS[1:3], LETTERS[1:3]))
# fill in matrix section by section
for(i in row.names(new_mat)){
for(j in colnames(new_mat)){
new_mat[i,j] <- sum(mdat[which(row.names(mdat) ==i), which(colnames(mdat) ==j)])
}
}
new_mat
# A B C
#A 59 30 65
#B 42 39 65
#C 67 70 94
While this is a satisfactory solution, I wonder if there's a solution using dplyr or similar that uses nicer logic and saves me from having to convert my actual data (which is a tibble) into matrix form.

The key steps is to gather - after that is it all straightforward dplyr stuff:
flow_by_sector <-
df1 %>%
mutate(sector_from = rownames(.)) %>%
tidyr::gather(sector_to, flow, -sector_from)
flow_by_sector_with_agg <-
flow_by_sector %>%
left_join(lookup, by = c("sector_from" = "sector")) %>%
rename(agg_from = aggregate_sector) %>%
left_join(lookup, by = c("sector_to" = "sector")) %>%
rename(agg_to = aggregate_sector)
flow_by_agg <-
flow_by_sector_with_agg %>%
group_by(agg_from, agg_to) %>%
summarise(flow = sum(flow))
tidyr::spread(flow_by_agg, agg_to, flow)

Here's a base answer that uses stack and xtabs. It's not super robust - it assumes that the lookup table has the same columns and order as what would be expressed in the data.frame.
colnames(df1) <- lookup$aggregate_sector
xtabs(values ~ sector + ind
, dat = data.frame(sector = rep(lookup$aggregate_sector
, length(df1)), stack(df1))
)
Here's another way to do the data.frame:
xtabs(values ~ Var1 + Var2,
dat = data.frame(expand.grid(lookup$aggregate_sector, lookup$aggregate_sector)
, values = unlist(df1))
)
Var2
Var1 A B C
A 59 30 65
B 42 39 65
C 67 70 94

I actually figured out a matrix algebra alternative to my problem which is much faster despite having to convert my data.frame into a matrix. I won't accept this solution as I did ask specifically for a dplyr answer, but thought it interesting enough to post here anyway.
I first had to form an adjustment matrix, S, from my lookup table where the the locations of ones in row i of S indicate which sectors of the original matrix will be grouped together as sector i in the aggregated matrix:
S <- lookup %>% mutate(sector = as.numeric(sector), value = 1) %>%
spread(sector, value) %>%
column_to_rownames('aggregate_sector') %>%
as.matrix()
S[is.na(S)] <- 0
S
# 1 2 3 4 5 6 7 8 9 10
#A 1 1 1 0 0 0 0 0 0 0
#B 0 0 0 1 1 1 0 0 0 0
#C 0 0 0 0 0 0 1 1 1 1
Then, I convert my original data.frame, df1, into matrix x and simply calculate S.x.S' :
x <- as.matrix(df1)
S %*% x %*% t(S)
# A B C
#A 59 30 65
#B 42 39 65
#C 67 70 94

Related

R: in for loop, using mutate to compute the difference between two variables dynamically

Purpose
Suppose I have four variables: Two variables are original variables and the other two variables are the predictions of the original variables. (In actual data, there are a greater number of original variables)
I want to use for loop and mutate to create columns that compute the difference between the original and prediction variable. The sample data and the current approach are following:
Sample data
set.seed(10000)
id <- sample(1:20, 100, replace=T)
set.seed(10001)
dv.1 <- sample(1:20, 100, replace=T)
set.seed(10002)
dv.2 <- sample(1:20, 100, replace=T)
set.seed(10003)
pred_dv.1 <- sample(1:20, 100, replace=T)
set.seed(10004)
pred_dv.2 <- sample(1:20, 100, replace=T)
d <-
data.frame(id, dv.1, dv.2, pred_dv.1, pred_dv.2)
Current approach (with Error)
original <- d %>% select(starts_with('dv.')) %>% names(.)
pred <- d %>% select(starts_with('pred_dv.')) %>% names(.)
for (i in 1:length(original)){
d <-
d %>%
mutate(diff = original[i] - pred[i])
l <- length(d)
colnames(d[l]) <- paste0(original[i], '.diff')
}
Error: Problem with mutate() input diff. # x non-numeric
argument to binary operator # ℹ Input diff is original[i] - pred[i].
d %>%
mutate(
across(
.cols = starts_with("dv"),
.fns = ~ . - (get(paste0("pred_",cur_column()))),
.names = "diff_{.col}"
)
)
# A tibble: 100 x 7
id dv.1 dv.2 pred_dv.1 pred_dv.2 diff_dv.1 diff_dv.2
<int> <int> <int> <int> <int> <int> <int>
1 15 5 1 5 15 0 -14
2 13 4 4 5 11 -1 -7
3 12 20 13 6 13 14 0
4 20 11 8 13 3 -2 5
5 9 11 10 7 13 4 -3
6 13 3 3 6 17 -3 -14
7 3 12 19 6 17 6 2
8 19 6 7 11 4 -5 3
9 6 7 12 19 6 -12 6
10 13 10 15 6 7 4 8
# ... with 90 more rows
Subtraction can be applied on dataframes directly.
So you can create a vector of original column names and another vector of prediction column names and subtract them creating new columns.
orig_var <- grep('^dv', names(d), value = TRUE)
pred_var <- grep('pred', names(d), value = TRUE)
d[paste0(orig_var, '.diff')] <- d[orig_var] - d[pred_var]
d
# id dv.1 dv.2 pred_dv.1 pred_dv.2 dv.1.diff dv.2.diff
#1 15 5 1 5 15 0 -14
#2 13 4 4 5 11 -1 -7
#3 12 20 13 6 13 14 0
#4 20 11 8 13 3 -2 5
#5 9 11 10 7 13 4 -3
#...
#...

using intervals in a column to populate values for another column

I have a dataframe:
dataframe <- data.frame(Condition = rep(c(1,2,3), each = 5, times = 2),
Time = sort(sample(1:60, 30)))
Condition Time
1 1 1
2 1 3
3 1 4
4 1 7
5 1 9
6 2 11
7 2 12
8 2 14
9 2 16
10 2 18
11 3 19
12 3 24
13 3 25
14 3 28
15 3 30
16 1 31
17 1 34
18 1 35
19 1 38
20 1 39
21 2 40
22 2 42
23 2 44
24 2 47
25 2 48
26 3 49
27 3 54
28 3 55
29 3 57
30 3 59
I want to divide the total length of Time (i.e., max(Time) - min(Time)) per Condition by a constant 'x' (e.g., 3). Then I want to use that quotient to add a new variable Trial such that my dataframe looks like this:
Condition Time Trial
1 1 1 A
2 1 3 A
3 1 4 B
4 1 7 C
5 1 9 C
6 2 11 A
7 2 12 A
8 2 14 B
9 2 16 C
10 2 18 C
... and so on
As you can see, for Condition 1, Trial is populated with unique identifying values (e.g., A, B, C) every 2.67 seconds = 8 (total time) / 3. For Condition 2, Trial is populated every 2.33 seconds = 7 (total time) /3.
I am not getting what I want with my current code:
dataframe %>%
group_by(Condition) %>%
mutate(Trial = LETTERS[cut(Time, 3, labels = F)])
# Groups: Condition [3]
Condition Time Trial
<dbl> <int> <chr>
1 1 1 A
2 1 3 A
3 1 4 A
4 1 7 A
5 1 9 A
6 2 11 A
7 2 12 A
8 2 14 A
9 2 16 A
10 2 18 A
# ... with 20 more rows
Thanks!
We can get the diffrence of range (returns min/max as a vector) and divide by the constant passed into i.e. 3 as the breaks in cut). Then, use integer index (labels = FALSE) to get the corresponding LETTER from the LETTERS builtin R constant
library(dplyr)
dataframe %>%
group_by(Condition) %>%
mutate(Trial = LETTERS[cut(Time, diff(range(Time))/3,
labels = FALSE)])
If the grouping should be based on adjacent values in 'Condition', use rleid from data.table on the 'Condition' column to create the grouping, and apply the same code as above
library(data.table)
dataframe %>%
group_by(grp = rleid(Condition)) %>%
mutate(Trial = LETTERS[cut(Time, diff(range(Time))/3,
labels = FALSE)])
Here's a one-liner using my santoku package. The rleid line is the same as mentioned in #akrun's solution.
dataframe %<>%
group_by(grp = data.table::rleid(Condition)) %>%
mutate(
Trial = chop_evenly(Time, intervals = 3, labels = lbl_seq("A"))
)

Is there an R function for selecting common values of 2 dataframe?

I am trying to select common values of two data frame. I have a big_df and a small_df
What I am trying to obtain is a data frame where only the "ID" values are common in both data frame, and I am only interested to keep the big_df and not the small_df ones.
library(dplyr)
df3 <- merge(big_df, small_df, by =("ID"))
> df3
ID Age Name Colour
1 1 21 a blue
2 4 20 d green
3 8 87 h red
4 9 9 i black
big_df <- data.frame("ID" = 1:10, "Age" = c(21,15,1,20,34,45,67,87,9,77), "Name" = c("a","b","c","d","e","f","g","h","i","l"))
> big_df
ID Age Name
1 1 21 a
2 2 15 b
3 3 1 c
4 4 20 d
5 5 34 e
6 6 45 f
7 7 67 g
8 8 87 h
9 9 9 i
10 10 77 l
small_df <- data.frame("ID" = c(1,4,8,9), "Colour" = c("blue","green","red","black"))
> small_df
ID Colour
1 1 blue
2 4 green
3 8 red
4 9 black
I would like to have instead, withouth the colour information
> df3
ID Age Name
1 1 21 a
2 4 20 d
3 8 87 h
4 9 9 i
dplyr's semi_join() was intended for exactly this
big_df <- data.frame("ID" = 1:10, "Age" = c(21,15,1,20,34,45,67,87,9,77), "Name" = c("a","b","c","d","e","f","g","h","i","l"))
small_df <- data.frame("ID" = c(1,4,8,9), "Colour" = c("blue","green","red","black"))
library(dplyr)
semi_join(big_df,small_df,by='ID')
#
# ID Age Name
# 1 1 21 a
# 2 4 20 d
# 3 8 87 h
# 4 9 9 i
I have a feeling what you really need is:
#check which big IDs exist in small IDs and subset
big_df[big_df$ID %in% unique(small_df$ID), ]
# ID Age Name
#1 1 21 a
#4 4 20 d
#8 8 87 h
#9 9 9 i
So, I don't think you need a join in this case.

Find the smallest value under conditions about the index (NA output possible)

Question:
I am using dplyr to do data analysis in R, and I come across the following problem.
My data frame is like this:
item day val
1 A 1 90
2 A 2 100
3 A 3 110
4 A 5 80
5 A 8 70
6 B 1 75
7 B 3 65
The data frame is already arranged in item, day. Now I want to mutate a new column, with each row being the smallest value of the same group AND having the day to be within the next 2 days.
For the example above, I want the resulting data frame to be:
item day val output
1 A 1 90 100 # the smaller of 100 and 110
2 A 2 100 110 # the only value within 2 days
3 A 3 110 80 # the only value within 2 days
4 A 5 80 NA # there is no data within 2 days
5 A 8 70 NA # there is no data within 2 days
6 B 1 75 65 # the only value within 2 days
7 B 3 65 NA # there is no data within 2 days
I understand that I will probably use group_by and mutate, but how to write the inside function in order to achieve my desired result?
Any help is greatly appreciated. Let me know if you need me to clarify anything. Thank you!
Try this:
df %>%
# arrange(item, day) %>% # if not already arranged
# take note of the next two values & corresponding difference in days
group_by(item) %>%
mutate(val.1 = lead(val),
day.1 = lead(day) - day,
val.2 = lead(val, 2),
day.2 = lead(day, 2) - day) %>%
ungroup() %>%
# if the value is associated with a day more than 2 days away, change it to NA
mutate(val.1 = ifelse(day.1 %in% c(1, 2), val.1, NA),
val.2 = ifelse(day.2 %in% c(1, 2), val.2, NA)) %>%
# calculate output normally
group_by(item, day) %>%
mutate(output = min(val.1, val.2, na.rm = TRUE)) %>%
ungroup() %>%
# arrange results
select(item, day, val, output) %>%
mutate(output = ifelse(output == Inf, NA, output)) %>%
arrange(item, day)
# A tibble: 7 x 4
item day val output
<fctr> <int> <int> <dbl>
1 A 1 90 100
2 A 2 100 110
3 A 3 110 80.0
4 A 5 80 NA
5 A 8 70 NA
6 B 1 75 65.0
7 B 3 65 NA
Data:
df <- read.table(text = " item day val
1 A 1 90
2 A 2 100
3 A 3 110
4 A 5 80
5 A 8 70
6 B 1 75
7 B 3 65", header = TRUE)
We can use complete from the tidyr package to complete the dataset by day, and then use lead from dplyr and rollapply from zoo to find the minimum of the next two days.
library(dplyr)
library(tidyr)
library(zoo)
DF2 <- DF %>%
group_by(item) %>%
complete(day = full_seq(day, period = 1)) %>%
mutate(output = rollapply(lead(val), width = 2, FUN = min, na.rm = TRUE,
fill = NA, align = "left")) %>%
drop_na(val) %>%
ungroup() %>%
mutate(output = ifelse(output == Inf, NA, output))
DF2
# # A tibble: 7 x 4
# item day val output
# <chr> <dbl> <int> <dbl>
# 1 A 1.00 90 100
# 2 A 2.00 100 110
# 3 A 3.00 110 80.0
# 4 A 5.00 80 NA
# 5 A 8.00 70 NA
# 6 B 1.00 75 65.0
# 7 B 3.00 65 NA
DATA
DF <- read.table(text = "item day val
1 A 1 90
2 A 2 100
3 A 3 110
4 A 5 80
5 A 8 70
6 B 1 75
7 B 3 65",
header = TRUE, stringsAsFactors = FALSE)
We'll create a dataset with modified day, so we can left join it on the original dataset, keeping only minimum value.
df %>%
left_join(
bind_rows(mutate(.,day=day-1),mutate(.,day=day-2)) %>% rename(output=val)) %>%
group_by(item,day,val) %>%
summarize_at("output",min) %>%
ungroup
# # A tibble: 7 x 4
# item day val output
# <fctr> <dbl> <int> <dbl>
# 1 A 1 90 100
# 2 A 2 100 110
# 3 A 3 110 80
# 4 A 5 80 NA
# 5 A 8 70 NA
# 6 B 1 75 65
# 7 B 3 65 NA
data
df <- read.table(text = " item day val
1 A 1 90
2 A 2 100
3 A 3 110
4 A 5 80
5 A 8 70
6 B 1 75
7 B 3 65", header = TRUE)

transposing dataframe by column, finding column minimum and returning index

I have a dataframe p1. I would like to transpose by column a. Find minimum of each row and return the column name that has the minimum value.
a=c(0,1,2,3,4,0,1,2,3,4)
b=c(10,20,30,40,50,9,8,7,6,5)
p1=data.frame(a,b)
p1
> p1
a b
1 0 10
2 1 20
3 2 30
4 3 40
5 4 50
6 0 9
7 1 8
8 2 7
9 3 6
10 4 5
The final required answer
0 1 2 3 4 row_minimum column_index_of_minimum
10 20 30 40 50 10 0
9 8 7 6 5 5 4
I used many things but the main was ave(p1$a, p1$a, FUN = seq_along) which allowed me to separate the b into groups based on the number of times they were associated with a
myans = setNames(data.frame(do.call(rbind, lapply(split(p1, ave(p1$a, p1$a, FUN = seq_along)),
function(x) x[,2]))), nm = rbind(p1$a[ave(p1$a, p1$a, FUN = seq_along) == 1]))
minimum = apply(myans, 1, min)
index = colnames(myans)[apply(myans, 1, which.min)]
myans$min = minimum
myans$index = index
myans
# 0 1 2 3 4 min index
#1 10 20 30 40 50 10 0
#2 9 8 7 6 5 5 4
Consider using a running group count followed by an aggregate and reshape:
# RUNNING GROUP COUNT
p1$grpcnt <- sapply(seq(nrow(p1)), function(i) sum(p1[1:i, c("a")]==p1$a[[i]]))
# MINIMUM OF B BY GROUP COUNT MERGING TO RETRIEVE A VALUE
aggdf <- setNames(merge(aggregate(b~grpcnt, p1, FUN=min),p1,by="b")[c("grpcnt.x","b","a")],
c("grpcnt", "row_minimum", "column_index_of_minimum"))
# RESHAPE/TRANSPOSE LONG TO WIDE
reshapedf <- setNames(reshape(p1, timevar=c("a"), idvar=c("grpcnt"), direction="wide"),
c("grpcnt", paste(unique(p1$a))))
# FINAL MERGE
finaldf <- merge(reshapedf, aggdf, by="grpcnt")[-1]
finaldf
# 0 1 2 3 4 row_minimum column_index_of_minimum
# 1 10 20 30 40 50 10 0
# 2 9 8 7 6 5 5 4

Resources