How to get exclusive count in R dataframe

How to get exclusive count in R dataframe - r

I have below-mentioned dataframe in R:
DF <- tibble::tribble(
~ID, ~Check,
"I-1", "A1",
"I-2", "A2",
"I-2", "OT",
"I-2", "LP",
"I-3", "A1",
"I-3", "A2",
"I-4", NA,
"I-5", NA,
"I-6", "A1",
"I-6", "OT",
"I-7", "A2"
)
DF2 <- tibble::tribble(
~ID, ~Remarks,
"I-1", "{X1,XR,XT}",
"I-2", "{X2,XR}",
"I-3", NA,
"I-4", "{X1,XR,X2}",
"I-5", "{X1}",
"I-6", "{XT}",
"I-7", "{X1,X2}"
)
Using the above mentioned two dataframe, I need the output in the following format:
Where I want to identify the exclusive count of Check and Remark for each unique ID and combination of each Check with another Check and similar with Remark.
Note - The order of rows should be highest to lowest based on Exclusive_Count of Check. It is quite possible that the number of unique Check and Remark may differ in my actual dataframe. (i.e 10 unique Remark and 5 Check, something like this)
DF_Output<-
Remark Exclusive_Count % X1 X2 XR XT Check Exclusive_Count % A1 A2 OT LP
Blank 1 33.33% 0 0 0 0 Blank 2 50.00% 0 0 0 0
X1 1 33.33% 0 2 2 1 A1 1 25.00 0 1 1 0
X2 0 0.00% 2 0 1 0 A2 1 25.00% 1 0 1 1
XR 0 0.00% 2 2 0 1 OT 0 0.00% 1 1 0 1
XT 1 33.33% 1 0 1 0 LP 0 0.00% 0 1 1 0
Total 3 100.00% 5 4 4 2 Total 4 100.00% 2 3 3 2

The OP has requested a canonical answer. So, I have created a function get_exclusive_counts() which takes the first two columns of any tibble, data.frame, or data.table where the first column contains IDs and the second column contains the payload, e.g., Check, in long format.
The function is independent of column names and will work with an arbitrary number of different items in the payload column. It returns a data.table for each input tibble:
get_exclusive_counts(DF)
Check Exclusive_Count % A1 A2 LP OT
1: Blank 2 50.00% 0 0 0 0
2: A1 1 25.00% 0 1 0 1
3: A2 1 25.00% 1 0 1 1
4: LP 0 0.00% 0 1 0 1
5: OT 0 0.00% 1 1 1 0
6: Totals 4 100.00% 2 3 2 3
For the second use case DF2, the payload needs to be split into separate rows beforehand:
library(magrittr)
DF2 %>%
dplyr::mutate(Remarks = stringr::str_remove_all(Remarks, "[{}]")) %>%
tidyr::separate_rows(Remarks) %>%
get_exclusive_counts()
Remarks Exclusive_Count % X1 X2 XR XT
1: Blank 1 33.33% 0 0 0 0
2: X1 1 33.33% 0 2 2 1
3: XT 1 33.33% 1 0 1 0
4: X2 0 0.00% 2 0 2 0
5: XR 0 0.00% 2 2 0 1
6: Totals 3 100.00% 5 4 5 2
Note that the name of the first column of the result table has been retained from the input data.frame.
The OP has mentioned that the number of Remarks and Check may differ. Therefore, it doesn't really make sense to cbind() the two result tables because this only will give a reasonable result in case the number of rows is the same.
Also, OP's expected result has some column names repeated (at least Exclusive_Count, %, perhaps more) which indicates that the result may not be used for further processing but for display / print only.
Printing results side by side
However, I have created a function get_exclusive_counts_side_by_side() which prints the results from calling get_exclusive_counts()
for an arbitray number of input datasets,
with differing numbers of rows, and
with the last rows (Totals) aligned.
The function returns a data.table with character columns.
The call below will reproduce OP'S expected result:
get_exclusive_counts_side_by_side(
DF2 %>%
dplyr::mutate(Remarks = stringr::str_remove_all(Remarks, "[{}]")) %>%
tidyr::separate_rows(Remarks),
DF)
Remarks Exclusive_Count % X1 X2 XR XT Check Exclusive_Count % A1 A2 LP OT
1: Blank 1 33.33% 0 0 0 0 Blank 2 50.00% 0 0 0 0
2: X1 1 33.33% 0 2 2 1 A1 1 25.00% 0 1 0 1
3: XT 1 33.33% 1 0 1 0 A2 1 25.00% 1 0 1 1
4: X2 0 0.00% 2 0 2 0 LP 0 0.00% 0 1 0 1
5: XR 0 0.00% 2 2 0 1 OT 0 0.00% 1 1 1 0
6: Totals 3 100.00% 5 4 5 2 Totals 4 100.00% 2 3 2 3
Here is another use case to demonstrate that it will work with differing rows and an arbitrary number of input datasets:
get_exclusive_counts_side_by_side(
DF,
DF3 %>%
dplyr::mutate(Remarks = stringr::str_remove_all(Remarks, "[{}]")) %>%
tidyr::separate_rows(Remarks),
DF)
Check Exclusive_Count % A1 A2 LP OT Remarks Exclusive_Count % X1 X2 XR XT Y2 Y3 Y4 Check Exclusive_Count % A1 A2 LP OT
1: Blank 2 50.00% 0 0 0 0 X1 2 50.00% 0 2 2 1 1 1 0 Blank 2 50.00% 0 0 0 0
2: A1 1 25.00% 0 1 0 1 Blank 1 25.00% 0 0 0 0 0 0 0 A1 1 25.00% 0 1 0 1
3: A2 1 25.00% 1 0 1 1 XT 1 25.00% 1 0 1 0 0 0 0 A2 1 25.00% 1 0 1 1
4: LP 0 0.00% 0 1 0 1 X2 0 0.00% 2 0 2 0 0 0 0 LP 0 0.00% 0 1 0 1
5: OT 0 0.00% 1 1 1 0 XR 0 0.00% 2 2 0 1 0 0 0 OT 0 0.00% 1 1 1 0
6: Y2 0 0.00% 1 0 0 0 0 1 1
7: Y3 0 0.00% 1 0 0 0 1 0 0
8: Y4 0 0.00% 0 0 0 0 1 0 0
9: Totals 4 100.00% 2 3 2 3 Totals 4 100.00% 7 4 5 2 3 2 1 Totals 4 100.00% 2 3 2 3
Function definitions
The code looks rather bulky but half of the lines are comments. So, the code should be fairly self-explanatory.
Also, about half of the lines of code are due to OP's additional requirements, like a % column or a Totals row.
get_exclusive_counts <- function(DF) {
library(data.table)
library(magrittr)
# make copy of first 2 cols to preserve original attributes of DF
DT <- as.data.table(DF[, 1:2])
# retain original column names
old <- colnames(DT)[1:2]
# rename colnames in copy for convenience of programming
setnames(DT, c("id", "val")) # col 1 contains id, col 2 contains payload
# aggregate by id to find exclusive counts = ids with only one element
tmp <- DT[, .N, keyby = id][N == 1L]
# create table of exclusive counts by joining and aggregating
excl <- DT[tmp, on = .(id)][, .(Exclusive_Count = .N), keyby = val] %>%
# append column of proportions, will be formatted after computing Totals
.[, `%` := Exclusive_Count / sum(Exclusive_Count)]
# anti-join to find remaining rows
rem <- DT[!tmp, on = .(id)]
# create co-occurrence matrix in long format by a self-join
coocc <- rem[rem, on = .(id), allow.cartesian = TRUE] %>%
# reshape to wide format and compute counts of co-occurrences w/o diagonals
dcast(val ~ i.val, length, subset = .(val != i.val))
# build final result table by merging both subresults
merge(excl, coocc, by = "val", all = TRUE) %>%
# replace NA counts by 0
.[, lapply(.SD, nafill, fill = 0L), by = val] %>%
# clean-up: order by decreasing Exclusive_Counts %>%
.[order(-Exclusive_Count)] %>%
# append Totals row
rbind(., .[, c(.(val = "Totals"), lapply(.SD, sum)), .SDcols = is.numeric]) %>%
# clean-up: format proportion as percentage
.[, `%` := sprintf("%3.2f%%", 100 * `%`)] %>%
# clean-up: Replace <NA> by "Blank" in val column
.[is.na(val), val := "Blank"] %>%
# rename val column
setnames("val", old[2]) %>%
# return result visibly
.[]
}
Here is the code for get_exclusive_counts_side_by_side():
get_exclusive_counts_side_by_side <- function(...) {
library(data.table)
library(magrittr)
# process input, return list of subresults
ec_list<- list(...) %>%
lapply(get_exclusive_counts)
# create row indices for maximum rows
rid <- ec_list %>%
lapply(nrow) %>%
Reduce(max, .) %>%
{data.table(.rowid = 1:.)}
# combine subresults
ec_list %>%
# insert empty rows if necessary
lapply(function(.x) .x[
, .rowid := .I][
# but align last row
.rowid == .N, .rowid := nrow(rid)][
rid, on =.(.rowid)][
, .rowid := NULL]
) %>%
# all data.tables have the same number of rows, now cbind()
do.call(cbind, .) %>%
# replace all NA by empty character strings
.[, lapply(.SD, . %>% as.character %>% fifelse(is.na(.), "", .))]
}
Additional explanation
If I understand correctly, exclusive counts refers to IDs which have only of one item (or NA) assigned to it. This is fairly straight forward to compute by
counting the number of items per ID,
picking the IDs with only one item,
picking the rows in the input data.frame which belong to those IDs (using a join), and
counting the appearances of the items in the subset of exclusive rows.
Furthermore, the function deals with OP's additional requirements which go beyond the identification of exclusive counts:
adding a matrix of co-occurrence counts of the remaining, non-exclusive
rows,
adding a column of proportions of exclusive counts at a specific position and formatting it as percent,
adding a Totals row,
replacing NAs by zero or "Blank", resp.
Data
DF <- tibble::tribble(
~ID, ~Check,
"I-1", "A1",
"I-2", "A2",
"I-2", "OT",
"I-2", "LP",
"I-3", "A1",
"I-3", "A2",
"I-4", NA,
"I-5", NA,
"I-6", "A1",
"I-6", "OT",
"I-7", "A2"
)
DF2 <- tibble::tribble(
~ID, ~Remarks,
"I-1", "{X1,XR,XT}",
"I-2", "{X2,XR}",
"I-3", NA,
"I-4", "{X1,XR,X2}",
"I-5", "{X1}",
"I-6", "{XT}",
"I-7", "{X1,X2}"
)
DF3 <- tibble::tribble(
~ID, ~Remarks,
"I-1", "{X1,XR,XT}",
"I-2", "{X2,XR}",
"I-3", NA,
"I-4", "{X1,XR,X2}",
"I-5", "{X1}",
"I-6", "{XT}",
"I-7", "{X1,X2}",
"I-8", "{X1,Y2,Y3}",
"I-9", "{Y2,Y4}",
"I10", "{X1}",
)

I think this will do what you're looking for... Likely not the most succinct, but seems to do the trick.
# Load Library
library('tidyverse')
### CHECK ###
# Load Check Table
DF <- tibble::tribble(
~ID, ~Check,
"I-1", "A1",
"I-2", "A2",
"I-2", "OT",
"I-2", "LP",
"I-3", "A1",
"I-3", "A2",
"I-4", NA,
"I-5", NA,
"I-6", "A1",
"I-6", "OT",
"I-7", "A2"
)
# Count by ID
DF <- DF %>%
group_by(ID) %>%
mutate(count = n())
# Count by Check
DF_X <- DF %>% dplyr::filter(count == 1) %>%
group_by(Check) %>%
dplyr::summarize("Count" = sum(count))
# Identify unique values of Check
DF_UNIQUE <- unique(DF$Check)
DF_FIN <- data.frame("Check" = DF_UNIQUE,stringsAsFactors = FALSE)
# Join Counts by Check with unique list of Checks
DF_FIN <- left_join(x = DF_FIN, y = DF_X, by = "Check")
# Replace NA's with zeros
DF_FIN[is.na(DF_FIN$Count),2] <- 0
# Calculate Percentages
DF_FIN <- DF_FIN %>%
mutate("Check Percentage" = `Count`/sum(`Count`))
# Rename Columns
colnames(DF_FIN) <- c("Check", "Exclusive Count", "Check Percentage")
# Replace NA value with the word "BLANK"
DF_FIN[is.na(DF_FIN$Check),1] <- "BLANK"
# Sort by Exclusive Count and then by Check (alphabetical)
DF_FIN <- DF_FIN %>%
arrange(desc(`Exclusive Count`), Check)
# Join Checks to itself and count instances
DF_CHECKS <- full_join(x = DF, y = DF, by = "ID")
DF_CHECKS <- DF_CHECKS %>%
group_by(Check.x, Check.y) %>%
dplyr::summarize("N" = n())
DF_CHECKS_SPREAD <- DF_CHECKS %>%
tidyr::pivot_wider(names_from = Check.y, values_from = N)
check_order <- DF_CHECKS_SPREAD$Check.x
check_order[is.na(check_order)] <- 'NA'
DF_CHECKS_SPREAD <- DF_CHECKS_SPREAD %>% select(check_order)
# Set the diagonal to zeros
for (i in 1:nrow(DF_CHECKS_SPREAD)){
DF_CHECKS_SPREAD[i,i+1] <-0
}
# Rename Columns
colnames(DF_CHECKS_SPREAD)[1] <- "Check"
colnames(DF_CHECKS_SPREAD)[colnames(DF_CHECKS_SPREAD) == "NA"] <- "BLANK"
# Drop the BLANK column
DF_CHECKS_SPREAD$BLANK <- NULL
# Replace NA value with the word "BLANK"
DF_CHECKS_SPREAD[is.na(DF_CHECKS_SPREAD$Check),1] <- "BLANK"
# Replace all other NA's with zero
DF_CHECKS_SPREAD[is.na(DF_CHECKS_SPREAD)] <- 0
# Join the two Checks data sets together & calculate grand totals
FINAL_TABLE_CHECKS <- left_join(x = DF_FIN, y = DF_CHECKS_SPREAD, by = "Check")
FINAL_TABLE_CHECKS <- FINAL_TABLE_CHECKS %>%
bind_rows(summarise(.,
across(where(is.numeric), sum),
across(where(is.character), ~"Total")))
### REMARKS ###
# Load Remarks table
DF2 <- tibble::tribble(
~ID, ~Remarks,
"I-1", "{X1,XR,XT}",
"I-2", "{X2,XR}",
"I-3", NA,
"I-4", "{X1,XR,X2}",
"I-5", "{X1}",
"I-6", "{XT}",
"I-7", "{X1,X2}"
)
# Remove the {} from the Remarks string
DF2$Remarks <- str_replace_all(string = DF2$Remarks, c("\\{" = "", "\\}" = ""))
# Expand string into rows
DF2 <- separate_rows(DF2, Remarks, convert = TRUE)
# Group and count by ID
DF2 <- DF2 %>%
group_by(ID) %>%
mutate(count = n())
# Count by Remarks
DF2_X <- DF2 %>% dplyr::filter(count == 1) %>%
group_by(Remarks) %>%
dplyr::summarize("Count" = sum(count))
# Identify unique Remarks
DF2_UNIQUE <- unique(DF2$Remarks)
DF2_FIN <- data.frame("Remarks" = DF2_UNIQUE,stringsAsFactors = FALSE)
# Join count of Remarks with unique list of Remarks
DF2_FIN <- left_join(x = DF2_FIN, y = DF2_X, by = "Remarks")
# Replace NA's with zeros
DF2_FIN[is.na(DF2_FIN$Count),2] <- 0
# Calculate Percentages
DF2_FIN <- DF2_FIN %>%
mutate("Remarks Percentage" = `Count`/sum(`Count`))
# Rename columns
colnames(DF2_FIN) <- c("Remarks", "Exclusive Count", "Remarks Percentage")
# Replace NA value with the word "BLANK"
DF2_FIN[is.na(DF2_FIN$Remarks),1] <- "BLANK"
# Sort by Exclusive Count and then by Check (alphabetical)
DF2_FIN <- DF2_FIN %>%
arrange(desc(`Exclusive Count`), Remarks)
# Join Remarks to itself and count instances
DF_REMARKS <- full_join(x = DF2, y = DF2, by = "ID")
DF_REMARKS <- DF_REMARKS %>%
group_by(Remarks.x, Remarks.y) %>%
dplyr::summarize("N" = n())
DF_REMARKS_SPREAD <- DF_REMARKS %>%
tidyr::pivot_wider(names_from = Remarks.y, values_from = N)
check_order <- DF_REMARKS_SPREAD$Remarks.x
check_order[is.na(check_order)] <- 'NA'
DF_REMARKS_SPREAD <- DF_REMARKS_SPREAD %>% select(check_order)
# Set the diagonal to zeros
for (i in 1:nrow(DF_REMARKS_SPREAD)){
DF_REMARKS_SPREAD[i,i+1] <-0
}
# Rename Columns
colnames(DF_REMARKS_SPREAD)[1] <- "Remarks"
colnames(DF_REMARKS_SPREAD)[colnames(DF_CHECKS_SPREAD) == "NA"] <- "BLANK"
# Drop the BLANK column
DF_REMARKS_SPREAD$BLANK <- NULL
# Replace NA value with the word "BLANK"
DF_REMARKS_SPREAD[is.na(DF_REMARKS_SPREAD$Remarks),1] <- "BLANK"
# Replace all other NA's with zero
DF_REMARKS_SPREAD[is.na(DF_REMARKS_SPREAD)] <- 0
# Join the two Remarks data sets together & calculate grand totals
FINAL_TABLE_REMARKS <- left_join(x = DF2_FIN, y = DF_REMARKS_SPREAD, by = "Remarks")
FINAL_TABLE_REMARKS <- FINAL_TABLE_REMARKS %>%
bind_rows(summarise(.,
across(where(is.numeric), sum),
across(where(is.character), ~"Total")))
# Count Rows in Check and Remarks dataframes and add rows in dataframe
# with less rows to match # of rows in other.
checkRows <- nrow(FINAL_TABLE_CHECKS)
remarksRows <- nrow(FINAL_TABLE_REMARKS)
rowDiff <- abs(checkRows - remarksRows)
if(checkRows < remarksRows){
cat("Adding", rowDiff , "rows to the Checks dataframe.\n\n")
FINAL_TABLE_CHECKS[nrow(FINAL_TABLE_CHECKS)+rowDiff,] <- NA
FINAL_TABLE_CHECKS[nrow(FINAL_TABLE_CHECKS),] <- FINAL_TABLE_CHECKS[checkRows,]
FINAL_TABLE_CHECKS[checkRows,] <- NA
}else if(remarksRows < checkRows){
cat("Adding", rowDiff , "rows to the Remarks dataframe.\n\n")
FINAL_TABLE_REMARKS[nrow(FINAL_TABLE_REMARKS)+rowDiff,] <- NA
FINAL_TABLE_REMARKS[nrow(FINAL_TABLE_REMARKS),] <- FINAL_TABLE_REMARKS[remarksRows,]
FINAL_TABLE_REMARKS[remarksRows,] <- NA
}else{
print("There is no difference in number of rows between Checks and Remarks.\n\n")
}
# Combine columns from Checks and Remarks into one table.
RESULTS <- cbind(FINAL_TABLE_REMARKS, FINAL_TABLE_CHECKS)
RESULTS$`Check Percentage` <- paste(round(100*RESULTS$`Check Percentage`,2), "%", sep="")
RESULTS$`Remarks Percentage` <- paste(round(100*RESULTS$`Remarks Percentage`,2), "%", sep="")
RESULTS

Related

Random Sample From a Dataframe With Specific Count

This question is probably best illustrated with an example.
Suppose I have a dataframe df with a binary variable b (values of b are 0 or 1). How can I take a random sample of size 10 from this dataframe so that I have 2 instances where b=0 in the random sample, and 8 instances where b=1 in the dataframe?
Right now, I know that I can do df[sample(nrow(df),10,] to get part of the answer, but that would give me a random amount of 0 and 1 instances. How can I specify a specific amount of 0 and 1 instances while still taking a random sample?

Here's an example of how I'd do this... take two samples and combine them. I've written a simple function so you can "just take one sample."
With a vector:
pop <- sample(c(0,1), 100, replace = TRUE)
yoursample <- function(pop, n_zero, n_one){
c(sample(pop[pop == 0], n_zero),
sample(pop[pop == 1], n_one))
}
yoursample(pop, n_zero = 2, n_one = 8)
[1] 0 0 1 1 1 1 1 1 1 1
Or, if you are working with a dataframe with some unique index called id:
# Where d1 is your data you are summarizing with mean and sd
dat <- data.frame(
id = 1:100,
val = sample(c(0,1), 100, replace = TRUE),
d1 = runif(100))
yoursample <- function(dat, n_zero, n_one){
c(sample(dat[dat$val == 0,"id"], n_zero),
sample(dat[dat$val == 1,"id"], n_one))
}
sample_ids <- yoursample(dat, n_zero = 2, n_one = 8)
sample_ids
mean(dat[dat$id %in% sample_ids,"d1"])
sd(dat[dat$id %in% sample_ids,"d1"])

Here is a suggestion:
First create a sample of 0 and 1 with id column.
Then sample 2:8 df's with condition and bind them together:
library(tidyverse)
set.seed(123)
df <- as_tibble(sample(0:1,size=50,replace=TRUE)) %>%
mutate(id = row_number())
df1 <- df[ sample(which (df$value ==0) ,2), ]
df2 <- df[ sample(which (df$value ==1), 8), ]
df_final <- bind_rows(df1, df2)
value id
<int> <int>
1 0 14
2 0 36
3 1 21
4 1 24
5 1 2
6 1 50
7 1 49
8 1 41
9 1 28
10 1 33

library(tidyverse)
set.seed(123)
df <- data.frame(a = letters,
b = sample(c(0,1),26,T))
bind_rows(
df %>%
filter(b == 0) %>%
sample_n(2),
df %>%
filter(b == 1) %>%
sample_n(8)
) %>%
arrange(a)
a b
1 d 1
2 g 1
3 h 1
4 l 1
5 m 1
6 o 1
7 p 0
8 q 1
9 s 0
10 v 1

R group rows conditional by rowwise comparisons in a scalable way

I need to group rows in a dataset based on a rowwise comparison or matching of four different variables in a way that it can be computed fast. The dataset has the following shape:
id
start
start_sep
end
end_sep
1
A
1
F
1
2
B
0
G
0
3
D
1
H
0
4
D
1
J
0
5
E
0
K
0
6
F
1
L
0
7
A
1
O
0
8
H
0
P
0
9
A
1
P
1
Specifically, I would like to group those rows that share the same value in either start<>end, end<>start, start<>start, or end<>end and have a matching value (>0) in the related start_sep and end_sep column. start_sep and end_sep are basically an additional check if the first match is correct. Groups usually contain matches of two rows, but can be of any number in size if a rows start and end matches more than one. Matches are always unique, so there won't be a matching conflict of two rows with the same matching start and end combination.
Here is a little illustration of potential match combinations in the columns 'start' and 'end':
For the example above, the expected result would be:
id
start
start_sep
end
end_sep
group_id
1
A
1
F
1
1
2
B
0
G
0
NA
3
D
1
H
0
2
4
D
1
J
0
2
5
E
0
K
0
NA
6
F
1
L
0
1
7
A
0
O
1
3
8
O
1
P
0
3
9
A
1
P
0
1
In this sense matches can occure within a row
I can do this with a slow for loop and a set of conditions, ie., selecting a row (if group_id is NA) finding all matches (based on conditions describing possible match combinations) and asign group_id. But since my dataset has 1 million plus rows, this is a very slow process. See below an example for a single case example below:
#create data.frame
df <- data.frame (id = c("1", "2", "3","4","5","6","7","8","9"),
start = c("A", "B", "D","D","E","F","A","O","A"),
start_sep = c("1", "0", "1","1","0","1","0","1","1"),
end = c("F", "G", "H", "J","K","L","O","P","P"),
end_sep = c("1", "0", "0","0","0","0","1","0","0"),
gid = 0
)
#extract a single row
i <- df[1,]
#find all matching rows and asign row id as group id
y <- df[ which( i[,6] == 0 &
(i[,2] == df[,2] & i[,3] == df[,3] |
i[,4] == df[,4] & i[,5] == df[,5] |
i[,2] == df[,4] & i[,3] == df[,5] |
i[,4] == df[,2] & i[,5] == df[,3])
),] %>% mutate(gid = recode(gid, '0' = i[,1]) )
View(y)
I am wondering whether there might be a straight forward way with lapply or dplyr (e.g. mutate(), group_by(), rowwise(), cur_group_id()) that is fast, or maybe a much more efficient way than the approch outlined above?

Here are two solutions with dplyr and data.table respectively. Each package vectorizes its operations, so these solutions should be far faster than your loop; and the data.table solution should be the fastest of them all.
Let me know how each solution works for you!
Note
To identify the group to which each row belongs, we use the earliest row that it "matches"; where "matching" rows are defined as those that
share the same value in either start<>end, end<>start, start<>start, or end<>end and have a matching value (>0) in the related start_sep and end_sep column.
For a smaller dataset, it would be simple enough to perform a CROSS JOIN and then filter by your criteria. However, for a dataset with over 1 million rows, its CROSS JOIN would easily max out the available memory at over 1 trillion rows, so I had to find a different technique.
To wit, I use paste0() to generate "artificial" keys. Here start and start_sep are combined into start_label, while end and end_sep are combined into end_label. Now we can directly match() on a single column like start_label; rather than sifting every possible match across a set of columns like {start, start_sep}.
This approach assumes that in those * and *_sep columns:
every distinct value can be represented as a distinct string;
the separator "|" is absent from that string.
Solution 1: dplyr
Once you load dplyr
library(dplyr)
# ...
# Code to generate 'df'.
# ...
this workflow should do the trick. Note that group IDs must be calculated before the JOIN; since cur_group_id() would otherwise "misidentify" the NAs as a group unto themselves.
df %>%
mutate(
# Create an artificial key for matching.
start_label = paste0(start, " | ", start_sep),
end_label = paste0(end, " | ", end_sep ),
# Identify the earliest row where each match is found.
start_to_start = match(start_label, start_label),
start_to_end = match(start_label, end_label ),
end_to_start = match(end_label , start_label),
end_to_end = match(end_label , end_label )
) %>%
# Include only rows meeting the criteria: remove any...
filter(
# ...without a match...
# |-------------------------------------------|
(start_sep > 0 & !(is.na(start_to_start) & is.na(start_to_end))) |
(end_sep > 0 & !(is.na(end_to_start ) & is.na(end_to_end )))
# |-----------|
# ...that corresponds to a positive '*_sep'.
) %>%
# For each row, identify the earliest of ALL its matches.
mutate(
match_id = pmin(
start_to_start, start_to_end, end_to_start, end_to_end,
na.rm = TRUE
)
) %>%
# Keep only the 'id' of each row, along with a 'group_id' for its earliest match.
group_by(match_id) %>%
transmute(
id,
group_id = cur_group_id()
) %>%
ungroup() %>%
# Map the original rows to their 'group_id's; with blanks (NAs) for no match.
right_join(df, by = "id") %>%
# Format into final form.
select(id, start, start_sep, end, end_sep, group_id) %>%
arrange(id)
Results
Please note that your sample data is inconsistent, so I have reconstructed my own df:
df <- structure(list(
id = 1:9,
start = c("A", "B", "D", "D", "E", "F", "A", "O", "A"),
start_sep = c(1L, 0L, 1L, 1L, 0L, 1L, 0L, 1L, 1L),
end = c("F", "G", "H", "J", "K", "L", "O", "P", "P"),
end_sep = c(1L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L)
),
class = "data.frame",
row.names = c(NA, -9L)
)
Given said df, the workflow should yield the following tibble:
# A tibble: 9 x 6
id start start_sep end end_sep group_id
<int> <chr> <int> <chr> <int> <int>
1 1 A 1 F 1 1
2 2 B 0 G 0 NA
3 3 D 1 H 0 2
4 4 D 1 J 0 2
5 5 E 0 K 0 NA
6 6 F 1 L 0 1
7 7 A 0 O 1 3
8 8 O 1 P 0 3
9 9 A 1 P 0 1
Solution 2: data.table
Here is essentially the same logic, but implemented in data.table.
library(data.table)
# ...
# Code to generate 'df'.
# ...
# Convert 'df' to a data.table.
df <- as.data.table(df)
Again, note that group IDs must be calculated before the JOIN; since .GRP would otherwise "misidentify" the NAs as a group unto themselves.
# Use 'id' as the key for efficient JOINs.
setkey(df, id
# Calculate the label and matching columns as before.
)[, c("start_label", "end_label") := .(
paste0(start, " | ", start_sep),
paste0(end , " | ", end_sep )
)][, c("start_to_start", "start_to_end", "end_to_start", "end_to_end") := .(
match(start_label, start_label),
match(start_label, end_label ),
match(end_label , start_label),
match(end_label , end_label )
# Filter by criteria as before.
)][
(start_sep > 0 & !(is.na(start_to_start) & is.na(start_to_end))) |
(end_sep > 0 & !(is.na(end_to_start ) & is.na(end_to_end )))
# Generate the 'group_id' as before.
,][, .(id, match_id = pmin(
start_to_start, start_to_end, end_to_start, end_to_end,
na.rm = TRUE
))][,
("group_id") := .GRP,
by = .(match_id)
# Perform the mapping (RIGHT JOIN) as before...
][
df,
# ...and select the desired columns.
.(id, start, start_sep, end, end_sep, group_id)
]
Results
With df as before, this solution should yield the following data.table:
id start start_sep end end_sep group_id
1: 1 A 1 F 1 1
2: 2 B 0 G 0 NA
3: 3 D 1 H 0 2
4: 4 D 1 J 0 2
5: 5 E 0 K 0 NA
6: 6 F 1 L 0 1
7: 7 A 0 O 1 3
8: 8 O 1 P 0 3
9: 9 A 1 P 0 1
Performance
At scale, the data.table solution should be proportionately faster than the dplyr solution; but both should be quite fast.
On the massive dataset big_data, a data.frame with over 1 million rows
# Find every combination of variables...
big_df <- expand.grid(
start = LETTERS,
start_sep = 0:1,
end = LETTERS,
end_sep = 0:1
)
# ...and repeat until there are (at least) 1 million...
n_comb <- nrow(big_df)
n_rep <- ceiling(1000000/n_comb)
# ...with unique IDs.
big_df <- data.frame(
id = 1:(n_comb * n_rep),
start = rep(big_df$start , n_rep),
start_sep = rep(big_df$start_sep, n_rep),
end = rep(big_df$end , n_rep),
end_sep = rep(big_df$end_sep , n_rep)
)
we can measure the relative performances of each solution at scale
library(microbenchmark)
performances <- microbenchmark(
# Repeat test 50 times, for reliability.
times = 50,
# Solution 1: "dplyr".
solution_1 = {
big_df %>%
mutate(
start_label = paste0(start, " | ", start_sep),
end_label = paste0(end, " | ", end_sep ),
start_to_start = match(start_label, start_label),
start_to_end = match(start_label, end_label ),
end_to_start = match(end_label , start_label),
end_to_end = match(end_label , end_label )
) %>%
filter(
(start_sep > 0 & !(is.na(start_to_start) & is.na(start_to_end))) |
(end_sep > 0 & !(is.na(end_to_start ) & is.na(end_to_end )))
) %>%
mutate(match_id = pmin(
start_to_start, start_to_end, end_to_start, end_to_end,
na.rm = TRUE
)) %>%
group_by(match_id) %>%
transmute(id, group_id = cur_group_id()) %>%
ungroup() %>%
right_join(big_df, by = "id") %>%
select(id, start, start_sep, end, end_sep, group_id) %>%
arrange(id)
},
# Solution 2: "data.table"
solution_2 = {
big_dt <- as.data.table(big_df)
setkey(big_dt, id)[, c("start_label", "end_label") := .(
paste0(start, " | ", start_sep),
paste0(end , " | ", end_sep )
)][, c("start_to_start", "start_to_end", "end_to_start", "end_to_end") := .(
match(start_label, start_label),
match(start_label, end_label ),
match(end_label , start_label),
match(end_label , end_label )
)][
(start_sep > 0 & !(is.na(start_to_start) & is.na(start_to_end))) |
(end_sep > 0 & !(is.na(end_to_start ) & is.na(end_to_end )))
,][, .(id, match_id = pmin(
start_to_start, start_to_end, end_to_start, end_to_end,
na.rm = TRUE
))][, ("group_id") := .GRP, by = .(match_id)
][big_dt, .(id, start, start_sep, end, end_sep, group_id)]
}
)
which I have tabulated here:
#> performances
Unit: milliseconds
expr min lq mean median uq max neval
solution_1 880.1443 972.9289 1013.2868 997.5746 1059.9192 1186.8743 50
solution_2 581.2570 606.7222 649.9858 650.2422 679.4404 734.3966 50
By converting from time to speed
library(formattable)
performances %>%
as_tibble() %>%
group_by(expr) %>%
summarize(t_mean = mean(time)) %>%
transmute(
solution = expr,
# Invert time to get speed; and normalize % by longest time.
advantage = percent(max(t_mean)/t_mean - 1)
)
we estimate that the data.table solution is (on average) about 50% faster than the dplyr solution.
# A tibble: 2 x 2
solution advantage
<fct> <formttbl>
1 solution_1 0.00%
2 solution_2 55.89%

Update specific values in a dataframe based on array index position

Let's say I have a dataframe
> colA <- c(1, 14, 8)
> colB <- c(4, 8, 9)
> colC <- c(1, 2, 14)
> df <- data.frame(c(colA, colB, colC))
> df
colA colB colC
1 1 4 1
2 14 8 2
3 8 9 14
What I want to do is create a second data frame which has the same structure as df, but has 1 whenever a specific number is found, and 0 otherwise, e.g., if the number were 14, df2 would look like this
> df2
colA colB colC
1 0 0 0
2 1 0 0
3 0 0 1
I thought I could create a 3x3 data frame of 0s (df2), use which() to get the index for the number in df, and then use that index to change what shows up in df2
> number <- 14
> index <- which(df == number)
> index
[1] 2 9
or perhaps more helpfully
> index <- which(df == number, arr.ind = T)
> index
row col
[1,] 2 1
[2,] 3 3
However I am unsure how to use this index to specifiy which values in the df of NAs should be TRUE and which FALSE (i.e. how to reverse the which)?
NB - I will actually be testing this for multiple numbers, so I figured I would do it inside a for loop. So I want the final DF to show ones for every location which has any of the numbers (i.e. gradually switching the 0's "on" to 1's
> numbers <- c(14, 9, 1
> for(i in numbers){
> index <- which(df == numbers, arr.ind = T)
> #then do whatever needs to be done to change the index locations in df2
P.S., in general, I work in the tidyverse, so tidyverse specific solutions would be grand, but base r would also be brilliant.
Ohh, and yes, this is for day 4 of Advent of Code - it's a useful challenge to help this non-expert coder learn.
Thanks

Here's a full example how it could be done.
Data
df <- structure(list(colA = c(1, 14, 8), colB = c(4, 8, 9), colC = c(1,
2, 14)), class = "data.frame", row.names = c(NA, -3L))
base R
data.frame( sapply( df, function(x) as.numeric( x == 14 | x == 8 ) ))
colA colB colC
1 0 0 0
2 1 1 0
3 1 0 1
for any number in a loop
setNames( data.frame( matrix( rowSums( sapply( c(14,8,1), function(x)
df==x ) ), dim(df) ) ), colnames( df ) )
colA colB colC
1 1 0 1
2 1 1 0
3 1 0 1
dplyr
library(dplyr)
df %>% summarise_all( ~ as.numeric( .x == 14 | .x == 8 ) )
colA colB colC
1 0 0 0
2 1 1 0
3 1 0 1
# or
df %>% summarise( across( everything(), ~ as.numeric( .x == 14 | .x == 8 ) ) )
colA colB colC
1 0 0 0
2 1 1 0
3 1 0 1

Rename specific column names in a data frame (in R)

I have a data frame where I would like to put in front of a column name the following words: "high_" and "low_". The name of the columns from X2-X4 should be renamed eg.high_X2 and X5-X7 eg. low_X6.
Please see an example below.
X1 X2 X3 X4 X5 X6 X7
a 1 0 1 1 1 1 0
b 2 2 1 1 1 1 0
result
X1 high_X2 high_X3 high_X4 low_X5 low_X6 low_X7
a 1 0 1 1 1 1 0
b 2 2 1 1 1 1 0

You can use rep and paste -
names(df)[-1] <- paste(rep(c('high', 'low'), each = 3), names(df)[-1], sep = '_')
df
# X1 high_X2 high_X3 high_X4 low_X5 low_X6 low_X7
#a 1 0 1 1 1 1 0
#b 2 2 1 1 1 1 0
If you want to rely on range of columns then dplyr code would be easier.
library(dplyr)
df %>%
rename_with(~paste('high', ., sep = '_'), X2:X4) %>%
rename_with(~paste('low', ., sep = '_'), X5:X7)

The base solution (which is more straitforward for these kind of things imo)
df <- data.frame(X1=c(a=1L,b=2L),
X2=c(a=0L,b=2L),
X3=c(a=1L,b=1L),
X4=c(a=1L,b=1L),
X5=c(a=1L,b=1L),
X6=c(a=1L,b=1L),
X7=c(a=1L,b=1L))
cn <- colnames(df)
cond <- as.integer(substr(cn,2L,nchar(cn))) %% 2L == 0L
colnames(df)[cond] <- paste0(cn[cond],"_is_pair")
A tidyverse solution (a bit more awkward due to the tidyeval)
library(dplyr)
library(stringr)
library(tidyselect)
df <- data.frame(X1=c(a=1L,b=2L),
X2=c(a=0L,b=2L),
X3=c(a=1L,b=1L),
X4=c(a=1L,b=1L),
X5=c(a=1L,b=1L),
X6=c(a=1L,b=1L),
X7=c(a=1L,b=1L))
is_pair <- function(vars = peek_vars(fn = "is_pair")) {
vars[as.integer(str_sub(vars,2L,nchar(vars))) %% 2L == 0L]
}
df %>% rename_with(~paste0(.x,"_is_pair"),
is_pair())

transform comma separated data into matrix using R

My Data:
A/11:36/0,A/11:36/1,A/11:36/2,A/23:01/0,A/23:01/1,A/23:01/2,B/15:07/0,B/15:07/1,B/15:07/2
1,26,2,1,10,2,1,0,0
Output Expecting:
Name 0 1 2
A/11:36 1 26 2
A/23:01 1 10 2
B/15:07 1 0 0
My Code
library(reshape)
library(library(splitstackshape))
input <- read.csv("D:/input.csv")
t_input <- t(input)
colnames(t_input)<- c("Name","Val")
data<-cSplit(t_input, 'V1', sep="/", type.convert=FALSE)
# here am going wrong, My script splitting the column1 into 3 columns.
final_data <- cast(data, X1~X2)
I need help on spliting my column 1 into two as follows :
A/11:36 0
A/11:36 1
A/11:36 2
A/23:01 0
A/23:01 1
A/23:01 2
B/15:07 0
B/15:07 1
B/15:07 2
Can anybody help me to solve this ?

Here's a tidyr solution:
# read the sample data
data <- read.csv("input.csv", header = F)
tdata <- t(data)
colnames(tdata) <- c("name", "value")
df <- data.frame(tdata)
library(tidyr)
new_df <- df %>%
# extract the variables stored in 'name' to their own columns
separate(name, c("group", "time", "x"), "/") %>%
# transform to wide format
spread(x, value, sep = "")
# final result
new_df
# group time x0 x1 x2
# 1 A 11:36 1 26 2
# 2 A 23:01 1 10 2
# 3 B 15:07 1 0 0
# if, for some reason, you really want the group and time columns together
new_df %>% unite(name, group, time, sep = "/")
# name x0 x1 x2
# 1 A/11:36 1 26 2
# 2 A/23:01 1 10 2
# 3 B/15:07 1 0 0
# or if you want them together and skip the unite step, you can separate directly
# by splitting at a / that is not followed by another / anywhere in the string
df %>%
separate(name, c("name", "x"), "/(?!.*/)") %>%
spread(x, value, sep = "")
# name x0 x1 x2
# 1 A/11:36 1 26 2
# 2 A/23:01 1 10 2
# 3 B/15:07 1 0 0

# read the sample data
input <- read.csv("input.csv", header=FALSE)
t_input <- t(input)
colnames(t_input) <- c("name", "value")
df <- data.frame(t_input)
library(splitstackshape)
new_df <- cSplit(t_input, 'name', sep="/", type.convert=FALSE)
df1 <- reshape(new_df, timevar=c("name_3"), idvar = c("name_1",'name_2'), dir="wide")
df2 <- within(df1, Name <- paste(name_1, name_2, sep='/'))
df2[,c("name_1","name_2"):=NULL]
Finaldf <- subset(df2, select=c(Name,value.0:value.2))
write.csv(Finaldf, "output.csv", row.names = FALSE)
output
Name value.0 value.1 value.2
A/11:36 1 26 2
A/23:01 1 10 2
B/15:07 1 0 0

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to get exclusive count in R dataframe - r

Related

Random Sample From a Dataframe With Specific Count

R group rows conditional by rowwise comparisons in a scalable way

Update specific values in a dataframe based on array index position

Rename specific column names in a data frame (in R)

transform comma separated data into matrix using R

Categories

Resources