I need to group rows in a dataset based on a rowwise comparison or matching of four different variables in a way that it can be computed fast. The dataset has the following shape:
id
start
start_sep
end
end_sep
1
A
1
F
1
2
B
0
G
0
3
D
1
H
0
4
D
1
J
0
5
E
0
K
0
6
F
1
L
0
7
A
1
O
0
8
H
0
P
0
9
A
1
P
1
Specifically, I would like to group those rows that share the same value in either start<>end, end<>start, start<>start, or end<>end and have a matching value (>0) in the related start_sep and end_sep column. start_sep and end_sep are basically an additional check if the first match is correct. Groups usually contain matches of two rows, but can be of any number in size if a rows start and end matches more than one. Matches are always unique, so there won't be a matching conflict of two rows with the same matching start and end combination.
Here is a little illustration of potential match combinations in the columns 'start' and 'end':
For the example above, the expected result would be:
id
start
start_sep
end
end_sep
group_id
1
A
1
F
1
1
2
B
0
G
0
NA
3
D
1
H
0
2
4
D
1
J
0
2
5
E
0
K
0
NA
6
F
1
L
0
1
7
A
0
O
1
3
8
O
1
P
0
3
9
A
1
P
0
1
In this sense matches can occure within a row
I can do this with a slow for loop and a set of conditions, ie., selecting a row (if group_id is NA) finding all matches (based on conditions describing possible match combinations) and asign group_id. But since my dataset has 1 million plus rows, this is a very slow process. See below an example for a single case example below:
#create data.frame
df <- data.frame (id = c("1", "2", "3","4","5","6","7","8","9"),
start = c("A", "B", "D","D","E","F","A","O","A"),
start_sep = c("1", "0", "1","1","0","1","0","1","1"),
end = c("F", "G", "H", "J","K","L","O","P","P"),
end_sep = c("1", "0", "0","0","0","0","1","0","0"),
gid = 0
)
#extract a single row
i <- df[1,]
#find all matching rows and asign row id as group id
y <- df[ which( i[,6] == 0 &
(i[,2] == df[,2] & i[,3] == df[,3] |
i[,4] == df[,4] & i[,5] == df[,5] |
i[,2] == df[,4] & i[,3] == df[,5] |
i[,4] == df[,2] & i[,5] == df[,3])
),] %>% mutate(gid = recode(gid, '0' = i[,1]) )
View(y)
I am wondering whether there might be a straight forward way with lapply or dplyr (e.g. mutate(), group_by(), rowwise(), cur_group_id()) that is fast, or maybe a much more efficient way than the approch outlined above?
Here are two solutions with dplyr and data.table respectively. Each package vectorizes its operations, so these solutions should be far faster than your loop; and the data.table solution should be the fastest of them all.
Let me know how each solution works for you!
Note
To identify the group to which each row belongs, we use the earliest row that it "matches"; where "matching" rows are defined as those that
share the same value in either start<>end, end<>start, start<>start, or end<>end and have a matching value (>0) in the related start_sep and end_sep column.
For a smaller dataset, it would be simple enough to perform a CROSS JOIN and then filter by your criteria. However, for a dataset with over 1 million rows, its CROSS JOIN would easily max out the available memory at over 1 trillion rows, so I had to find a different technique.
To wit, I use paste0() to generate "artificial" keys. Here start and start_sep are combined into start_label, while end and end_sep are combined into end_label. Now we can directly match() on a single column like start_label; rather than sifting every possible match across a set of columns like {start, start_sep}.
This approach assumes that in those * and *_sep columns:
every distinct value can be represented as a distinct string;
the separator "|" is absent from that string.
Solution 1: dplyr
Once you load dplyr
library(dplyr)
# ...
# Code to generate 'df'.
# ...
this workflow should do the trick. Note that group IDs must be calculated before the JOIN; since cur_group_id() would otherwise "misidentify" the NAs as a group unto themselves.
df %>%
mutate(
# Create an artificial key for matching.
start_label = paste0(start, " | ", start_sep),
end_label = paste0(end, " | ", end_sep ),
# Identify the earliest row where each match is found.
start_to_start = match(start_label, start_label),
start_to_end = match(start_label, end_label ),
end_to_start = match(end_label , start_label),
end_to_end = match(end_label , end_label )
) %>%
# Include only rows meeting the criteria: remove any...
filter(
# ...without a match...
# |-------------------------------------------|
(start_sep > 0 & !(is.na(start_to_start) & is.na(start_to_end))) |
(end_sep > 0 & !(is.na(end_to_start ) & is.na(end_to_end )))
# |-----------|
# ...that corresponds to a positive '*_sep'.
) %>%
# For each row, identify the earliest of ALL its matches.
mutate(
match_id = pmin(
start_to_start, start_to_end, end_to_start, end_to_end,
na.rm = TRUE
)
) %>%
# Keep only the 'id' of each row, along with a 'group_id' for its earliest match.
group_by(match_id) %>%
transmute(
id,
group_id = cur_group_id()
) %>%
ungroup() %>%
# Map the original rows to their 'group_id's; with blanks (NAs) for no match.
right_join(df, by = "id") %>%
# Format into final form.
select(id, start, start_sep, end, end_sep, group_id) %>%
arrange(id)
Results
Please note that your sample data is inconsistent, so I have reconstructed my own df:
df <- structure(list(
id = 1:9,
start = c("A", "B", "D", "D", "E", "F", "A", "O", "A"),
start_sep = c(1L, 0L, 1L, 1L, 0L, 1L, 0L, 1L, 1L),
end = c("F", "G", "H", "J", "K", "L", "O", "P", "P"),
end_sep = c(1L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L)
),
class = "data.frame",
row.names = c(NA, -9L)
)
Given said df, the workflow should yield the following tibble:
# A tibble: 9 x 6
id start start_sep end end_sep group_id
<int> <chr> <int> <chr> <int> <int>
1 1 A 1 F 1 1
2 2 B 0 G 0 NA
3 3 D 1 H 0 2
4 4 D 1 J 0 2
5 5 E 0 K 0 NA
6 6 F 1 L 0 1
7 7 A 0 O 1 3
8 8 O 1 P 0 3
9 9 A 1 P 0 1
Solution 2: data.table
Here is essentially the same logic, but implemented in data.table.
library(data.table)
# ...
# Code to generate 'df'.
# ...
# Convert 'df' to a data.table.
df <- as.data.table(df)
Again, note that group IDs must be calculated before the JOIN; since .GRP would otherwise "misidentify" the NAs as a group unto themselves.
# Use 'id' as the key for efficient JOINs.
setkey(df, id
# Calculate the label and matching columns as before.
)[, c("start_label", "end_label") := .(
paste0(start, " | ", start_sep),
paste0(end , " | ", end_sep )
)][, c("start_to_start", "start_to_end", "end_to_start", "end_to_end") := .(
match(start_label, start_label),
match(start_label, end_label ),
match(end_label , start_label),
match(end_label , end_label )
# Filter by criteria as before.
)][
(start_sep > 0 & !(is.na(start_to_start) & is.na(start_to_end))) |
(end_sep > 0 & !(is.na(end_to_start ) & is.na(end_to_end )))
# Generate the 'group_id' as before.
,][, .(id, match_id = pmin(
start_to_start, start_to_end, end_to_start, end_to_end,
na.rm = TRUE
))][,
("group_id") := .GRP,
by = .(match_id)
# Perform the mapping (RIGHT JOIN) as before...
][
df,
# ...and select the desired columns.
.(id, start, start_sep, end, end_sep, group_id)
]
Results
With df as before, this solution should yield the following data.table:
id start start_sep end end_sep group_id
1: 1 A 1 F 1 1
2: 2 B 0 G 0 NA
3: 3 D 1 H 0 2
4: 4 D 1 J 0 2
5: 5 E 0 K 0 NA
6: 6 F 1 L 0 1
7: 7 A 0 O 1 3
8: 8 O 1 P 0 3
9: 9 A 1 P 0 1
Performance
At scale, the data.table solution should be proportionately faster than the dplyr solution; but both should be quite fast.
On the massive dataset big_data, a data.frame with over 1 million rows
# Find every combination of variables...
big_df <- expand.grid(
start = LETTERS,
start_sep = 0:1,
end = LETTERS,
end_sep = 0:1
)
# ...and repeat until there are (at least) 1 million...
n_comb <- nrow(big_df)
n_rep <- ceiling(1000000/n_comb)
# ...with unique IDs.
big_df <- data.frame(
id = 1:(n_comb * n_rep),
start = rep(big_df$start , n_rep),
start_sep = rep(big_df$start_sep, n_rep),
end = rep(big_df$end , n_rep),
end_sep = rep(big_df$end_sep , n_rep)
)
we can measure the relative performances of each solution at scale
library(microbenchmark)
performances <- microbenchmark(
# Repeat test 50 times, for reliability.
times = 50,
# Solution 1: "dplyr".
solution_1 = {
big_df %>%
mutate(
start_label = paste0(start, " | ", start_sep),
end_label = paste0(end, " | ", end_sep ),
start_to_start = match(start_label, start_label),
start_to_end = match(start_label, end_label ),
end_to_start = match(end_label , start_label),
end_to_end = match(end_label , end_label )
) %>%
filter(
(start_sep > 0 & !(is.na(start_to_start) & is.na(start_to_end))) |
(end_sep > 0 & !(is.na(end_to_start ) & is.na(end_to_end )))
) %>%
mutate(match_id = pmin(
start_to_start, start_to_end, end_to_start, end_to_end,
na.rm = TRUE
)) %>%
group_by(match_id) %>%
transmute(id, group_id = cur_group_id()) %>%
ungroup() %>%
right_join(big_df, by = "id") %>%
select(id, start, start_sep, end, end_sep, group_id) %>%
arrange(id)
},
# Solution 2: "data.table"
solution_2 = {
big_dt <- as.data.table(big_df)
setkey(big_dt, id)[, c("start_label", "end_label") := .(
paste0(start, " | ", start_sep),
paste0(end , " | ", end_sep )
)][, c("start_to_start", "start_to_end", "end_to_start", "end_to_end") := .(
match(start_label, start_label),
match(start_label, end_label ),
match(end_label , start_label),
match(end_label , end_label )
)][
(start_sep > 0 & !(is.na(start_to_start) & is.na(start_to_end))) |
(end_sep > 0 & !(is.na(end_to_start ) & is.na(end_to_end )))
,][, .(id, match_id = pmin(
start_to_start, start_to_end, end_to_start, end_to_end,
na.rm = TRUE
))][, ("group_id") := .GRP, by = .(match_id)
][big_dt, .(id, start, start_sep, end, end_sep, group_id)]
}
)
which I have tabulated here:
#> performances
Unit: milliseconds
expr min lq mean median uq max neval
solution_1 880.1443 972.9289 1013.2868 997.5746 1059.9192 1186.8743 50
solution_2 581.2570 606.7222 649.9858 650.2422 679.4404 734.3966 50
By converting from time to speed
library(formattable)
performances %>%
as_tibble() %>%
group_by(expr) %>%
summarize(t_mean = mean(time)) %>%
transmute(
solution = expr,
# Invert time to get speed; and normalize % by longest time.
advantage = percent(max(t_mean)/t_mean - 1)
)
we estimate that the data.table solution is (on average) about 50% faster than the dplyr solution.
# A tibble: 2 x 2
solution advantage
<fct> <formttbl>
1 solution_1 0.00%
2 solution_2 55.89%
Related
I have below-mentioned dataframe in R:
DF <- tibble::tribble(
~ID, ~Check,
"I-1", "A1",
"I-2", "A2",
"I-2", "OT",
"I-2", "LP",
"I-3", "A1",
"I-3", "A2",
"I-4", NA,
"I-5", NA,
"I-6", "A1",
"I-6", "OT",
"I-7", "A2"
)
DF2 <- tibble::tribble(
~ID, ~Remarks,
"I-1", "{X1,XR,XT}",
"I-2", "{X2,XR}",
"I-3", NA,
"I-4", "{X1,XR,X2}",
"I-5", "{X1}",
"I-6", "{XT}",
"I-7", "{X1,X2}"
)
Using the above mentioned two dataframe, I need the output in the following format:
Where I want to identify the exclusive count of Check and Remark for each unique ID and combination of each Check with another Check and similar with Remark.
Note - The order of rows should be highest to lowest based on Exclusive_Count of Check. It is quite possible that the number of unique Check and Remark may differ in my actual dataframe. (i.e 10 unique Remark and 5 Check, something like this)
DF_Output<-
Remark Exclusive_Count % X1 X2 XR XT Check Exclusive_Count % A1 A2 OT LP
Blank 1 33.33% 0 0 0 0 Blank 2 50.00% 0 0 0 0
X1 1 33.33% 0 2 2 1 A1 1 25.00 0 1 1 0
X2 0 0.00% 2 0 1 0 A2 1 25.00% 1 0 1 1
XR 0 0.00% 2 2 0 1 OT 0 0.00% 1 1 0 1
XT 1 33.33% 1 0 1 0 LP 0 0.00% 0 1 1 0
Total 3 100.00% 5 4 4 2 Total 4 100.00% 2 3 3 2
The OP has requested a canonical answer. So, I have created a function get_exclusive_counts() which takes the first two columns of any tibble, data.frame, or data.table where the first column contains IDs and the second column contains the payload, e.g., Check, in long format.
The function is independent of column names and will work with an arbitrary number of different items in the payload column. It returns a data.table for each input tibble:
get_exclusive_counts(DF)
Check Exclusive_Count % A1 A2 LP OT
1: Blank 2 50.00% 0 0 0 0
2: A1 1 25.00% 0 1 0 1
3: A2 1 25.00% 1 0 1 1
4: LP 0 0.00% 0 1 0 1
5: OT 0 0.00% 1 1 1 0
6: Totals 4 100.00% 2 3 2 3
For the second use case DF2, the payload needs to be split into separate rows beforehand:
library(magrittr)
DF2 %>%
dplyr::mutate(Remarks = stringr::str_remove_all(Remarks, "[{}]")) %>%
tidyr::separate_rows(Remarks) %>%
get_exclusive_counts()
Remarks Exclusive_Count % X1 X2 XR XT
1: Blank 1 33.33% 0 0 0 0
2: X1 1 33.33% 0 2 2 1
3: XT 1 33.33% 1 0 1 0
4: X2 0 0.00% 2 0 2 0
5: XR 0 0.00% 2 2 0 1
6: Totals 3 100.00% 5 4 5 2
Note that the name of the first column of the result table has been retained from the input data.frame.
The OP has mentioned that the number of Remarks and Check may differ. Therefore, it doesn't really make sense to cbind() the two result tables because this only will give a reasonable result in case the number of rows is the same.
Also, OP's expected result has some column names repeated (at least Exclusive_Count, %, perhaps more) which indicates that the result may not be used for further processing but for display / print only.
Printing results side by side
However, I have created a function get_exclusive_counts_side_by_side() which prints the results from calling get_exclusive_counts()
for an arbitray number of input datasets,
with differing numbers of rows, and
with the last rows (Totals) aligned.
The function returns a data.table with character columns.
The call below will reproduce OP'S expected result:
get_exclusive_counts_side_by_side(
DF2 %>%
dplyr::mutate(Remarks = stringr::str_remove_all(Remarks, "[{}]")) %>%
tidyr::separate_rows(Remarks),
DF)
Remarks Exclusive_Count % X1 X2 XR XT Check Exclusive_Count % A1 A2 LP OT
1: Blank 1 33.33% 0 0 0 0 Blank 2 50.00% 0 0 0 0
2: X1 1 33.33% 0 2 2 1 A1 1 25.00% 0 1 0 1
3: XT 1 33.33% 1 0 1 0 A2 1 25.00% 1 0 1 1
4: X2 0 0.00% 2 0 2 0 LP 0 0.00% 0 1 0 1
5: XR 0 0.00% 2 2 0 1 OT 0 0.00% 1 1 1 0
6: Totals 3 100.00% 5 4 5 2 Totals 4 100.00% 2 3 2 3
Here is another use case to demonstrate that it will work with differing rows and an arbitrary number of input datasets:
get_exclusive_counts_side_by_side(
DF,
DF3 %>%
dplyr::mutate(Remarks = stringr::str_remove_all(Remarks, "[{}]")) %>%
tidyr::separate_rows(Remarks),
DF)
Check Exclusive_Count % A1 A2 LP OT Remarks Exclusive_Count % X1 X2 XR XT Y2 Y3 Y4 Check Exclusive_Count % A1 A2 LP OT
1: Blank 2 50.00% 0 0 0 0 X1 2 50.00% 0 2 2 1 1 1 0 Blank 2 50.00% 0 0 0 0
2: A1 1 25.00% 0 1 0 1 Blank 1 25.00% 0 0 0 0 0 0 0 A1 1 25.00% 0 1 0 1
3: A2 1 25.00% 1 0 1 1 XT 1 25.00% 1 0 1 0 0 0 0 A2 1 25.00% 1 0 1 1
4: LP 0 0.00% 0 1 0 1 X2 0 0.00% 2 0 2 0 0 0 0 LP 0 0.00% 0 1 0 1
5: OT 0 0.00% 1 1 1 0 XR 0 0.00% 2 2 0 1 0 0 0 OT 0 0.00% 1 1 1 0
6: Y2 0 0.00% 1 0 0 0 0 1 1
7: Y3 0 0.00% 1 0 0 0 1 0 0
8: Y4 0 0.00% 0 0 0 0 1 0 0
9: Totals 4 100.00% 2 3 2 3 Totals 4 100.00% 7 4 5 2 3 2 1 Totals 4 100.00% 2 3 2 3
Function definitions
The code looks rather bulky but half of the lines are comments. So, the code should be fairly self-explanatory.
Also, about half of the lines of code are due to OP's additional requirements, like a % column or a Totals row.
get_exclusive_counts <- function(DF) {
library(data.table)
library(magrittr)
# make copy of first 2 cols to preserve original attributes of DF
DT <- as.data.table(DF[, 1:2])
# retain original column names
old <- colnames(DT)[1:2]
# rename colnames in copy for convenience of programming
setnames(DT, c("id", "val")) # col 1 contains id, col 2 contains payload
# aggregate by id to find exclusive counts = ids with only one element
tmp <- DT[, .N, keyby = id][N == 1L]
# create table of exclusive counts by joining and aggregating
excl <- DT[tmp, on = .(id)][, .(Exclusive_Count = .N), keyby = val] %>%
# append column of proportions, will be formatted after computing Totals
.[, `%` := Exclusive_Count / sum(Exclusive_Count)]
# anti-join to find remaining rows
rem <- DT[!tmp, on = .(id)]
# create co-occurrence matrix in long format by a self-join
coocc <- rem[rem, on = .(id), allow.cartesian = TRUE] %>%
# reshape to wide format and compute counts of co-occurrences w/o diagonals
dcast(val ~ i.val, length, subset = .(val != i.val))
# build final result table by merging both subresults
merge(excl, coocc, by = "val", all = TRUE) %>%
# replace NA counts by 0
.[, lapply(.SD, nafill, fill = 0L), by = val] %>%
# clean-up: order by decreasing Exclusive_Counts %>%
.[order(-Exclusive_Count)] %>%
# append Totals row
rbind(., .[, c(.(val = "Totals"), lapply(.SD, sum)), .SDcols = is.numeric]) %>%
# clean-up: format proportion as percentage
.[, `%` := sprintf("%3.2f%%", 100 * `%`)] %>%
# clean-up: Replace <NA> by "Blank" in val column
.[is.na(val), val := "Blank"] %>%
# rename val column
setnames("val", old[2]) %>%
# return result visibly
.[]
}
Here is the code for get_exclusive_counts_side_by_side():
get_exclusive_counts_side_by_side <- function(...) {
library(data.table)
library(magrittr)
# process input, return list of subresults
ec_list<- list(...) %>%
lapply(get_exclusive_counts)
# create row indices for maximum rows
rid <- ec_list %>%
lapply(nrow) %>%
Reduce(max, .) %>%
{data.table(.rowid = 1:.)}
# combine subresults
ec_list %>%
# insert empty rows if necessary
lapply(function(.x) .x[
, .rowid := .I][
# but align last row
.rowid == .N, .rowid := nrow(rid)][
rid, on =.(.rowid)][
, .rowid := NULL]
) %>%
# all data.tables have the same number of rows, now cbind()
do.call(cbind, .) %>%
# replace all NA by empty character strings
.[, lapply(.SD, . %>% as.character %>% fifelse(is.na(.), "", .))]
}
Additional explanation
If I understand correctly, exclusive counts refers to IDs which have only of one item (or NA) assigned to it. This is fairly straight forward to compute by
counting the number of items per ID,
picking the IDs with only one item,
picking the rows in the input data.frame which belong to those IDs (using a join), and
counting the appearances of the items in the subset of exclusive rows.
Furthermore, the function deals with OP's additional requirements which go beyond the identification of exclusive counts:
adding a matrix of co-occurrence counts of the remaining, non-exclusive
rows,
adding a column of proportions of exclusive counts at a specific position and formatting it as percent,
adding a Totals row,
replacing NAs by zero or "Blank", resp.
Data
DF <- tibble::tribble(
~ID, ~Check,
"I-1", "A1",
"I-2", "A2",
"I-2", "OT",
"I-2", "LP",
"I-3", "A1",
"I-3", "A2",
"I-4", NA,
"I-5", NA,
"I-6", "A1",
"I-6", "OT",
"I-7", "A2"
)
DF2 <- tibble::tribble(
~ID, ~Remarks,
"I-1", "{X1,XR,XT}",
"I-2", "{X2,XR}",
"I-3", NA,
"I-4", "{X1,XR,X2}",
"I-5", "{X1}",
"I-6", "{XT}",
"I-7", "{X1,X2}"
)
DF3 <- tibble::tribble(
~ID, ~Remarks,
"I-1", "{X1,XR,XT}",
"I-2", "{X2,XR}",
"I-3", NA,
"I-4", "{X1,XR,X2}",
"I-5", "{X1}",
"I-6", "{XT}",
"I-7", "{X1,X2}",
"I-8", "{X1,Y2,Y3}",
"I-9", "{Y2,Y4}",
"I10", "{X1}",
)
I think this will do what you're looking for... Likely not the most succinct, but seems to do the trick.
# Load Library
library('tidyverse')
### CHECK ###
# Load Check Table
DF <- tibble::tribble(
~ID, ~Check,
"I-1", "A1",
"I-2", "A2",
"I-2", "OT",
"I-2", "LP",
"I-3", "A1",
"I-3", "A2",
"I-4", NA,
"I-5", NA,
"I-6", "A1",
"I-6", "OT",
"I-7", "A2"
)
# Count by ID
DF <- DF %>%
group_by(ID) %>%
mutate(count = n())
# Count by Check
DF_X <- DF %>% dplyr::filter(count == 1) %>%
group_by(Check) %>%
dplyr::summarize("Count" = sum(count))
# Identify unique values of Check
DF_UNIQUE <- unique(DF$Check)
DF_FIN <- data.frame("Check" = DF_UNIQUE,stringsAsFactors = FALSE)
# Join Counts by Check with unique list of Checks
DF_FIN <- left_join(x = DF_FIN, y = DF_X, by = "Check")
# Replace NA's with zeros
DF_FIN[is.na(DF_FIN$Count),2] <- 0
# Calculate Percentages
DF_FIN <- DF_FIN %>%
mutate("Check Percentage" = `Count`/sum(`Count`))
# Rename Columns
colnames(DF_FIN) <- c("Check", "Exclusive Count", "Check Percentage")
# Replace NA value with the word "BLANK"
DF_FIN[is.na(DF_FIN$Check),1] <- "BLANK"
# Sort by Exclusive Count and then by Check (alphabetical)
DF_FIN <- DF_FIN %>%
arrange(desc(`Exclusive Count`), Check)
# Join Checks to itself and count instances
DF_CHECKS <- full_join(x = DF, y = DF, by = "ID")
DF_CHECKS <- DF_CHECKS %>%
group_by(Check.x, Check.y) %>%
dplyr::summarize("N" = n())
DF_CHECKS_SPREAD <- DF_CHECKS %>%
tidyr::pivot_wider(names_from = Check.y, values_from = N)
check_order <- DF_CHECKS_SPREAD$Check.x
check_order[is.na(check_order)] <- 'NA'
DF_CHECKS_SPREAD <- DF_CHECKS_SPREAD %>% select(check_order)
# Set the diagonal to zeros
for (i in 1:nrow(DF_CHECKS_SPREAD)){
DF_CHECKS_SPREAD[i,i+1] <-0
}
# Rename Columns
colnames(DF_CHECKS_SPREAD)[1] <- "Check"
colnames(DF_CHECKS_SPREAD)[colnames(DF_CHECKS_SPREAD) == "NA"] <- "BLANK"
# Drop the BLANK column
DF_CHECKS_SPREAD$BLANK <- NULL
# Replace NA value with the word "BLANK"
DF_CHECKS_SPREAD[is.na(DF_CHECKS_SPREAD$Check),1] <- "BLANK"
# Replace all other NA's with zero
DF_CHECKS_SPREAD[is.na(DF_CHECKS_SPREAD)] <- 0
# Join the two Checks data sets together & calculate grand totals
FINAL_TABLE_CHECKS <- left_join(x = DF_FIN, y = DF_CHECKS_SPREAD, by = "Check")
FINAL_TABLE_CHECKS <- FINAL_TABLE_CHECKS %>%
bind_rows(summarise(.,
across(where(is.numeric), sum),
across(where(is.character), ~"Total")))
### REMARKS ###
# Load Remarks table
DF2 <- tibble::tribble(
~ID, ~Remarks,
"I-1", "{X1,XR,XT}",
"I-2", "{X2,XR}",
"I-3", NA,
"I-4", "{X1,XR,X2}",
"I-5", "{X1}",
"I-6", "{XT}",
"I-7", "{X1,X2}"
)
# Remove the {} from the Remarks string
DF2$Remarks <- str_replace_all(string = DF2$Remarks, c("\\{" = "", "\\}" = ""))
# Expand string into rows
DF2 <- separate_rows(DF2, Remarks, convert = TRUE)
# Group and count by ID
DF2 <- DF2 %>%
group_by(ID) %>%
mutate(count = n())
# Count by Remarks
DF2_X <- DF2 %>% dplyr::filter(count == 1) %>%
group_by(Remarks) %>%
dplyr::summarize("Count" = sum(count))
# Identify unique Remarks
DF2_UNIQUE <- unique(DF2$Remarks)
DF2_FIN <- data.frame("Remarks" = DF2_UNIQUE,stringsAsFactors = FALSE)
# Join count of Remarks with unique list of Remarks
DF2_FIN <- left_join(x = DF2_FIN, y = DF2_X, by = "Remarks")
# Replace NA's with zeros
DF2_FIN[is.na(DF2_FIN$Count),2] <- 0
# Calculate Percentages
DF2_FIN <- DF2_FIN %>%
mutate("Remarks Percentage" = `Count`/sum(`Count`))
# Rename columns
colnames(DF2_FIN) <- c("Remarks", "Exclusive Count", "Remarks Percentage")
# Replace NA value with the word "BLANK"
DF2_FIN[is.na(DF2_FIN$Remarks),1] <- "BLANK"
# Sort by Exclusive Count and then by Check (alphabetical)
DF2_FIN <- DF2_FIN %>%
arrange(desc(`Exclusive Count`), Remarks)
# Join Remarks to itself and count instances
DF_REMARKS <- full_join(x = DF2, y = DF2, by = "ID")
DF_REMARKS <- DF_REMARKS %>%
group_by(Remarks.x, Remarks.y) %>%
dplyr::summarize("N" = n())
DF_REMARKS_SPREAD <- DF_REMARKS %>%
tidyr::pivot_wider(names_from = Remarks.y, values_from = N)
check_order <- DF_REMARKS_SPREAD$Remarks.x
check_order[is.na(check_order)] <- 'NA'
DF_REMARKS_SPREAD <- DF_REMARKS_SPREAD %>% select(check_order)
# Set the diagonal to zeros
for (i in 1:nrow(DF_REMARKS_SPREAD)){
DF_REMARKS_SPREAD[i,i+1] <-0
}
# Rename Columns
colnames(DF_REMARKS_SPREAD)[1] <- "Remarks"
colnames(DF_REMARKS_SPREAD)[colnames(DF_CHECKS_SPREAD) == "NA"] <- "BLANK"
# Drop the BLANK column
DF_REMARKS_SPREAD$BLANK <- NULL
# Replace NA value with the word "BLANK"
DF_REMARKS_SPREAD[is.na(DF_REMARKS_SPREAD$Remarks),1] <- "BLANK"
# Replace all other NA's with zero
DF_REMARKS_SPREAD[is.na(DF_REMARKS_SPREAD)] <- 0
# Join the two Remarks data sets together & calculate grand totals
FINAL_TABLE_REMARKS <- left_join(x = DF2_FIN, y = DF_REMARKS_SPREAD, by = "Remarks")
FINAL_TABLE_REMARKS <- FINAL_TABLE_REMARKS %>%
bind_rows(summarise(.,
across(where(is.numeric), sum),
across(where(is.character), ~"Total")))
# Count Rows in Check and Remarks dataframes and add rows in dataframe
# with less rows to match # of rows in other.
checkRows <- nrow(FINAL_TABLE_CHECKS)
remarksRows <- nrow(FINAL_TABLE_REMARKS)
rowDiff <- abs(checkRows - remarksRows)
if(checkRows < remarksRows){
cat("Adding", rowDiff , "rows to the Checks dataframe.\n\n")
FINAL_TABLE_CHECKS[nrow(FINAL_TABLE_CHECKS)+rowDiff,] <- NA
FINAL_TABLE_CHECKS[nrow(FINAL_TABLE_CHECKS),] <- FINAL_TABLE_CHECKS[checkRows,]
FINAL_TABLE_CHECKS[checkRows,] <- NA
}else if(remarksRows < checkRows){
cat("Adding", rowDiff , "rows to the Remarks dataframe.\n\n")
FINAL_TABLE_REMARKS[nrow(FINAL_TABLE_REMARKS)+rowDiff,] <- NA
FINAL_TABLE_REMARKS[nrow(FINAL_TABLE_REMARKS),] <- FINAL_TABLE_REMARKS[remarksRows,]
FINAL_TABLE_REMARKS[remarksRows,] <- NA
}else{
print("There is no difference in number of rows between Checks and Remarks.\n\n")
}
# Combine columns from Checks and Remarks into one table.
RESULTS <- cbind(FINAL_TABLE_REMARKS, FINAL_TABLE_CHECKS)
RESULTS$`Check Percentage` <- paste(round(100*RESULTS$`Check Percentage`,2), "%", sep="")
RESULTS$`Remarks Percentage` <- paste(round(100*RESULTS$`Remarks Percentage`,2), "%", sep="")
RESULTS
I need to detect a sequence by group in a data.frame and compute new variable.
Consider I have this following data.frame:
df1 <- data.frame(ID = c(1,1,1,1,1,1,1,2,2,2,3,3,3,3),
seqs = c(1,2,3,4,5,6,7,1,2,3,1,2,3,4),
count = c(2,1,3,1,1,2,3,1,2,1,3,1,4,1),
product = c("A", "B", "C", "C", "A,B", "A,B,C", "D", "A", "B", "A", "A", "A,B,C", "D", "D"),
stock = c("A", "A,B", "A,B,C", "A,B,C", "A,B,C", "A,B,C", "A,B,C,D", "A", "A,B", "A,B", "A", "A,B,C", "A,B,C,D", "A,B,C,D"))
df1
> df1
ID seqs count product stock
1 1 1 2 A A
2 1 2 1 B A,B
3 1 3 3 C A,B,C
4 1 4 1 C A,B,C
5 1 5 1 A,B A,B,C
6 1 6 2 A,B,C A,B,C
7 1 7 3 D A,B,C,D
8 2 1 1 A A
9 2 2 2 B A,B
10 2 3 1 A A,B
11 3 1 3 A A
12 3 2 1 A,B,C A,B,C
13 3 3 4 D A,B,C,D
14 3 4 1 D A,B,C,D
I am interested to compute a measure for ID that follow this sequence:
- Count == 1
- Count > 1
- Count == 1
In the example this is true for:
- rows 2, 3, 4 for `ID==1`
- rows 8, 9, 10 for `ID==2`
- rows 12, 13, 14 for `ID==3`
For these ID and rows, I need to compute a measure called new that takes the value of the product of the last row of the sequence if it is in the second row of the sequence and NOT in the stock of the first sequence.
The desired outcome is shown below:
> output
ID seq1 seq2 seq3 new
1 1 2 3 4 C
2 2 1 2 3
3 3 2 3 4 D
Note:
In the sequence detected for ID no new products are added to the stock.
In the original data there are a lot of IDs who do not have any sequences.
Some ID have multiple qualifying sequences. All should be recorded.
Count is always 1 or greater.
The original data holds millions of ID with up to 1500 sequences.
How would you write an efficient piece of code to get this output?
Here's a data.table option:
library(data.table)
char_cols <- c("product", "stock")
setDT(df1)[,
(char_cols) := lapply(.SD, as.character),
.SDcols = char_cols] # in case they're factors
df1[, c1 := (count == 1) &
(shift(count) > 1) &
(shift(count, 2L) == 1),
by = ID] #condition1
df1[, pat := paste0("(", gsub(",", "|", product), ")")] # pattern
df1[, c2 := mapply(grepl, pat, shift(product)) &
!mapply(grepl, pat, shift(stock, 2L)),
by = ID] # condition2
df1[(c1), new := ifelse(c2, product, "")] # create new column
df1[, paste0("seq", 1:3) := shift(seqs, 2:0)] # create seq columns
df1[(c1), .(ID, seq1, seq2, seq3, new)] # result
Here's another approach using tidyverse; however, I think lag and lead has made this solution a bit time-consuming. I included the comments within the code to make it more legible.
But I spent enough time on it, to post it anyway.
library(tidyverse)
df1 %>% group_by(ID) %>%
# this finds the row with count > 1 which ...
#... the counts of the row before and the one of after it equals to 1
mutate(test = (count > 1 & c(F, lag(count==1)[-1]) & c(lead(count==1)[-n()],F))) %>%
# this makes a column which has value of True for each chunk...
#that meets desired condition to later filter based on it
mutate(test2 = test | c(F,lag(test)[-1]) | c(lead(test)[-n()], F)) %>%
filter(test2) %>% ungroup() %>%
# group each three occurrences in case of having multiple ones within each ID
group_by(G=trunc(3:(n()+2)/3)) %>% group_by(ID,G) %>%
# creating new column with string extracting techniques ...
#... (assuming those columns are characters)
mutate(new=
str_remove_all(
as.character(regmatches(stock[2], gregexpr(product[3], stock[2]))),
stock[1])) %>%
# selecting desired columns and adding times for long to wide conversion
select(ID,G,seqs,new) %>% mutate(times = 1:n()) %>% ungroup() %>%
# long to wide conversion using tidyr (part of tidyverse)
gather(key, value, -ID, -G, -new, -times) %>%
unite(col, key, times) %>% spread(col, value) %>%
# making the desired order of columns
select(-G,-new,new) %>% as.data.frame()
# ID seqs_1 seqs_2 seqs_3 new
# 1 1 2 3 4 C
# 2 2 1 2 3
# 3 3 2 3 4 D
I need to do something similar to below on a very large data set (with many groups), and read somewhere that using .SD is slow. Is there any faster way to perform the following operation?
To be more precise, I need to create a new column that contains the min value for each group after having excluded a subset of observations in that group (something similar to minif in Excel).
library(data.table)
dt <- data.table(valid = c(0,1,1,0,1),
a = c(1,1,2,3,4),
groups = c("A", "A", "A", "B", "B"))
dt[, valid_min := .SD[valid == 1, min(a, na.rm = TRUE)], by = groups]
With the output:
> test
valid a k valid_min
1: 0 1 A 1
2: 1 1 A 1
3: 1 2 A 1
4: 0 3 B 4
5: 1 4 B 4
To make it even more complicated, groups could have no valid entries or they could have multiple valid but missing entries. My current code is similar to this:
dt <- data.table(valid = c(0,1,1,0,1,0,1,1),
a = c(1,1,2,3,4,3,NA,NA),
k = c("A", "A", "A", "B", "B", "C", "D", "D"))
dt[, valid_min := .SD[valid == 1,
ifelse(all(is.na(a)), NA_real_, min(a, na.rm = TRUE))], by = k]
Output:
> dt
valid a k valid_min
1: 0 1 A 1
2: 1 1 A 1
3: 1 2 A 1
4: 0 3 B 4
5: 1 4 B 4
6: 0 3 C NA
7: 1 NA D NA
8: 1 NA D NA
There's...
dt[dt[valid == 1 & !is.na(a), min(a), by=k], on=.(k), the_min := i.V1]
This should be fast since the inner call to min is optimized for groups. (See ?GForce.)
We can do the same using dplyr
dt %>%
group_by(groups) %>%
mutate(valid_min = min(ifelse(valid == 1,
a, NA),
na.rm = TRUE))
Which gives:
valid a groups valid_min
<dbl> <dbl> <chr> <dbl>
1 0 1 A 1
2 1 1 A 1
3 1 2 A 1
4 0 3 B 4
5 1 4 B 4
Alternatively, if you are not interested in keeping the 'non-valid' rows, we can do the following:
dt %>%
filter(valid == 1) %>%
group_by(groups) %>%
mutate(valid_min = min(a))
Looks like I provided the slowest approach. Comparing each approach (using a larger, replicated data frame called df) with a microbenchmark test:
library(microbenchmark)
library(ggplot2)
mbm <- microbenchmark(
dplyr.test = suppressWarnings(df %>%
group_by(k) %>%
mutate(valid_min = min(ifelse(valid == 1,
a, NA),
na.rm = TRUE),
valid_min = ifelse(valid_min == Inf,
NA,
valid_min))),
data.table.test = df[, valid_min := .SD[valid == 1,
ifelse(all(is.na(a)), NA_real_, min(a, na.rm = TRUE))], by = k],
GForce.test = df[df[valid == 1 & !is.na(a), min(a), by=k], on=.(k), the_min := i.V1]
)
autoplot(mbm)
...well, i tried...
I need to delete all rows that contain a value of 2 or -2 regardless of what column it is in except column one.
Example dataframe:
df
a b c d
zzz 2 2 -1
yyy 1 1 1
xxx 1 -1 -2
Desired output:
df
a b c d
yyy 1 1 1
I have tried
df <- df[!grepl(-2 | 2, df),]
df <- subset(df, !df[-1] == 2 |!df[-1] == -2)
My actual dataset has over 300 rows and 70 variables
I believe I need to use some sort of apply function but I am not sure.
Any help is appreciated please let me know if you need more info.
We can create a logical index by comparing the absolute value of the dataset with that of 2, get the row wise sum and if there are no values, it will be 0 (by negating !, it returns TRUE for those 0 values and FALSE for others) and subset based on the logical index
df[!rowSums(abs(df[-1])==2),]
# a b c d
#2 yyy 1 1 1
Or another option is to compare within each column using lapply, collapse it to a logical vector with | and use that to subset the rows
df[!Reduce(`|`,lapply(abs(df[-1]), `==`, 2)),]
# a b c d
#2 yyy 1 1 1
We could also do this with tidyverse
library(tidyverse)
df %>%
select(-1) %>% #to remove the first column
map(~abs(.) ==2) %>% #do the columnwise comparison
reduce(`|`) %>% #reduce it to logical vector
`!` %>% #negate to convert TRUE/FALSE to FALSE/TRUE
df[., ] #subset the rows of original dataset
# a b c d
# 2 yyy 1 1 1
data
df <- structure(list(a = c("zzz", "yyy", "xxx"), b = c(2L, 1L, 1L),
c = c(2L, 1L, -1L), d = c(-1L, 1L, -2L)), .Names = c("a",
"b", "c", "d"), class = "data.frame", row.names = c(NA, -3L))
Option with dplyr:
library(dplyr)
a <- c("zzz","yyy","xxx")
b <- c(2,1,1)
c <- c(2,1,-1)
d <- c(-1,1,-2)
df <- data.frame(a,b,c,d)
filter(df,((abs(b) != 2) & (abs(c) != 2) & (abs(d) != 2)))
a b c d
1 yyy 1 1 1
I have the following dataset
#mydata
Factors Transactions
a,c 2
b 0
c 0
d,a 0
a 1
a 0
b 1
I'd like to count those factors who had transactions.For example, we had two times "a" with transaction. I can write a code to give me my desirable outcome for each variable separately. The following is for "a".
nrow (subset (mydata,mydata$Transaction > 0 & length(mydata[grep("a", mydata$Factors),] )> 0))
But I have too much variables and do not want to repeat a code for all of them. I would think there should be a way to write a code to give me the results for all of the variables. I wish to have the following out put:
#Output
a 2
b 1
c 1
d 0
Equivalent data.table option:
library(data.table)
setDT(df)[, .(Factors = unlist(strsplit(as.character(Factors), ","))),
by = Transactions][,.(Transactions = sum(Transactions > 0)), by = Factors]
# Factors Transactions
#1: a 2
#2: c 1
#3: b 1
#4: d 0
You could create a table using the unique values of the Factor column as the levels. Consider df to be your data set.
s <- strsplit(as.character(df$Factors), ",", fixed = TRUE)
table(factor(unlist(s[df$Transactions > 0]), levels = unique(unlist(s))))
#
# a c b d
# 2 1 1 0
Wrap in as.data.frame() for data frame output.
with(df, {
s <- strsplit(as.character(Factors), ",", fixed = TRUE)
f <- factor(unlist(s[Transactions > 0]), levels = unique(unlist(s)))
as.data.frame(table(Factors = f))
})
# Factors Freq
# 1 a 2
# 2 c 1
# 3 b 1
# 4 d 0
With tidyverse packages, assuming your data is strings/factors and numbers,
library(tidyr)
library(dplyr)
# separate factors with two elements
df %>% separate_rows(Factors) %>%
# set grouping for aggregation
group_by(Factors) %>%
# for each group, count how many transactions are greater than 0
summarise(Transactions = sum(Transactions > 0))
## # A tibble: 4 x 2
## Factors Transactions
## <chr> <int>
## 1 a 2
## 2 b 1
## 3 c 1
## 4 d 0
You could also avoid dplyr by using xtabs, though some cleaning is necessary to get to the same arrangement:
library(tidyr)
df %>% separate_rows(Factors) %>%
xtabs(Transactions > 0 ~ Factors, data = .) %>%
as.data.frame() %>%
setNames(names(df))
## Factors Transactions
## 1 a 2
## 2 b 1
## 3 c 1
## 4 d 0
A full base R equivalent:
df2 <- do.call(rbind,
Map(function(f, t){data.frame(Factors = strsplit(as.character(f), ',')[[1]],
Transactions = t)},
df$Factors, df$Transactions))
df3 <- as.data.frame(xtabs(Transactions > 0 ~ Factors, data = df2))
names(df3) <- names(df)
df3
## Factors Transactions
## 1 a 2
## 2 b 1
## 3 c 1
## 4 d 0
We can use cSplit from splitstackshape to split the 'Factors' into 'long' format and grouped by 'Factors' we get the sum of logical column ('Transactions > 0`).
library(splitstackshape)
cSplit(df1, "Factors", ",", "long")[, .(Transactions=sum(Transactions > 0)),.(Factors)]
# Factors Transactions
#1: a 2
#2: c 1
#3: b 1
#4: d 0
Or using base R
with(df1, table(factor(unlist(strsplit(Factors[Transactions>0], ",")),
levels = letters[1:4]) ))
# a b c d
# 2 1 1 0
data
df1 <- structure(list(Factors = c("a,c", "b", "c", "d,a", "a", "a",
"b"), Transactions = c(2L, 0L, 0L, 0L, 1L, 0L, 1L)), .Names = c("Factors",
"Transactions"), class = "data.frame", row.names = c(NA, -7L))