Efficient code to remove rows containing non-unique max? - r

Here's a simple example of an array for which I want to extract only those rows whose max value is unique (in that row).
foo <- expand.grid(1:3,1:3,1:3)
Var1 Var2 Var3
1 1 1 1
2 2 1 1
3 3 1 1
4 1 2 1
5 2 2 1
6 3 2 1
7 1 3 1
8 2 3 1
9 3 3 1
10 1 1 2
11 2 1 2
12 3 1 2
13 1 2 2
14 2 2 2
15 3 2 2
16 1 3 2
17 2 3 2
18 3 3 2
19 1 1 3
20 2 1 3
21 3 1 3
22 1 2 3
23 2 2 3
24 3 2 3
25 1 3 3
26 2 3 3
27 3 3 3
I've got working code:
winners <- max.col(foo)
finddupe <- rep(0,length=length(winners))
for (jf in 1:length(winners)) finddupe[jf] <- sum(foo[jf,] == foo[jf, winners[jf] ] )
winners <- winners[finddupe == 1]
foo <- foo[finddupe == 1, ]
That just looks inefficient to me.
I'd prefer a solution which only uses base - R calls, but am open to using tools in other libraries.

Another base R solution:
subset(foo, max.col(foo, 'first') == max.col(foo, 'last'))
Var1 Var2 Var3
2 2 1 1
3 3 1 1
4 1 2 1
6 3 2 1
7 1 3 1
8 2 3 1
10 1 1 2
12 3 1 2
15 3 2 2
16 1 3 2
17 2 3 2
19 1 1 3
20 2 1 3
22 1 2 3
23 2 2 3
>
Same logic as above in dplyr way:
library(dplyr)
foo %>%
filter(max.col(., 'first') == max.col(., 'last'))

Create a column of max with pmax from all the columns, then filter the rows where there is only a single unique max by getting the count on a logical dataset with rowSums
library(dplyr)
foo %>%
mutate(mx = do.call(pmax, c(across(everything()), na.rm = TRUE))) %>%
filter(rowSums(across(Var1:Var3, ~ .x == mx), na.rm = TRUE) == 1)
-output
Var1 Var2 Var3 mx
1 2 1 1 2
2 3 1 1 3
3 1 2 1 2
4 3 2 1 3
5 1 3 1 3
6 2 3 1 3
7 1 1 2 2
8 3 1 2 3
9 3 2 2 3
10 1 3 2 3
11 2 3 2 3
12 1 1 3 3
13 2 1 3 3
14 1 2 3 3
15 2 2 3 3
Or with base R
subset(foo, rowSums(foo == do.call(pmax, c(foo, na.rm = TRUE)),
na.rm = TRUE) == 1)

A base R approach using apply
foo[apply(foo, 1, function(x) sum(x[which.max(x)] == x) <= 1), ]
Var1 Var2 Var3
2 2 1 1
3 3 1 1
4 1 2 1
6 3 2 1
7 1 3 1
8 2 3 1
10 1 1 2
12 3 1 2
15 3 2 2
16 1 3 2
17 2 3 2
19 1 1 3
20 2 1 3
22 1 2 3
23 2 2 3

After verifying the answers so far (18:00 EST Weds 15 Feb), I ran a benchmark comparison. #onyambu wins the race. (cgw is me; ak** are akrun's solutions)
bar5 = 1:5
foo55 <- expand.grid(bar5,bar5,bar5,bar5,bar5)
microbenchmark(ony(foo55), cgw(foo55), akply(foo55), akbase(foo55), andre(foo55))
Unit: microseconds
expr min lq mean median uq max neval cld
ony(foo55) 455.117 495.2335 589.6801 517.3755 634.9795 3107.222 100 a
cgw(foo55) 314076.038 317184.4050 348711.9522 319784.5870 324921.0335 2691161.873 100 b
akply(foo55) 14156.653 14835.2230 16194.3699 15160.0270 16441.3550 74019.622 100 a
akbase(foo55) 858.969 896.8310 1055.4277 970.6395 1117.2420 4098.860 100 a
andre(foo55) 8161.406 8531.1700 9188.4801 8872.0325 9284.0995 14548.383 100 a

Related

Sorting out the data with specific headers in R

A small sample of the data are as follows:
df<-read.table (text=" ID Class1a Time1a MD1a MD2a Class1b Time1b MD1b MD2b Class2a Time2a MD3a MD4a Class2b Time2b MD3b MD4b Class3a Time3a MD5a MD6a Class3b Time3b MD5b MD6b
1 1 1 1 2 2 1 1 2 9 2 2 2 10 2 1 1 17 3 2 2 18 3 1 1
2 3 1 1 1 4 1 2 1 11 2 2 1 12 2 1 1 19 3 2 1 20 3 1 1
3 5 1 2 1 6 1 2 2 13 2 1 1 14 2 2 2 21 3 1 1 22 3 2 2
4 7 1 1 1 8 1 2 2 15 2 1 1 16 2 1 1 23 3 1 1 24 3 1 1
", header=TRUE)
I want to get the following output, especially headers
ID Class Time MD MD1 MD2
1 1 1 1-2 1 2
2 3 1 1-2 1 1
3 5 1 1-2 2 1
4 7 1 1-2 1 1
1 2 1 1-2 1 2
2 4 1 1-2 2 2
3 6 1 1-2 2 2
4 8 1 1-2 2 2
1 9 2 3-4 2 2
2 11 2 3-4 2 1
3 13 2 3-4 1 1
4 15 2 3-4 1 1
1 10 2 3-4 2 1
2 12 2 3-4 2 1
3 14 2 3-4 2 2
4 16 2 3-4 2 1
1 17 3 5-6 2 2
2 19 3 5-6 2 2
3 21 3 5-6 1 2
4 23 3 5-6 1 2
1 18 3 5-6 1 1
2 20 3 5-6 1 1
3 22 3 5-6 2 2
4 24 3 5-6 1 1
df1<- df %>% pivot_longer(
cols = starts_with("Time"),
names_to = "Q",
values_to = "Score",
values_drop_na = TRUE)
df2<- df1 %>% pivot_longer(
cols = starts_with("Class"),
names_prefix = "MD",
values_drop_na = TRUE
) %>% dplyr::select(-value)
But I have failed the get the output of interest
This answer started as a pivot_longer example using names_pattern, but while renaming some of them made sense, it becomes less intuitive how to easily extract the MD column (e.g., 1-2, 3-4) during the pivoting process.
Instead, let's split the frame by column-group, rename the columns as you'd like, then bind_rows them.
bind_rows(
lapply(split.default(df[,-1], cumsum(grepl("Class", names(df)[-1]))),
function(Z) {
out <- transform(Z,
ID = df$ID,
MD = paste(gsub("\\D", "", grep("^MD", names(Z), value = TRUE)), collapse = "-"))
names(out)[1:4] <- c("Class", "Time", "MD1", "MD3")
out
})
)
# Class Time MD1 MD3 ID MD
# 1 1 1 1 2 1 1-2
# 2 3 1 1 1 2 1-2
# 3 5 1 2 1 3 1-2
# 4 7 1 1 1 4 1-2
# 5 2 1 1 2 1 1-2
# 6 4 1 2 1 2 1-2
# 7 6 1 2 2 3 1-2
# 8 8 1 2 2 4 1-2
# 9 9 2 2 2 1 3-4
# 10 11 2 2 1 2 3-4
# 11 13 2 1 1 3 3-4
# 12 15 2 1 1 4 3-4
# 13 10 2 1 1 1 3-4
# 14 12 2 1 1 2 3-4
# 15 14 2 2 2 3 3-4
# 16 16 2 1 1 4 3-4
# 17 17 3 2 2 1 5-6
# 18 19 3 2 1 2 5-6
# 19 21 3 1 1 3 5-6
# 20 23 3 1 1 4 5-6
# 21 18 3 1 1 1 5-6
# 22 20 3 1 1 2 5-6
# 23 22 3 2 2 3 5-6
# 24 24 3 1 1 4 5-6
This relies on:
ID being the first column (ergo df[,-1] and names(df)[-1]), and
Each group of columns starting with a Class* column.

identify whenever values repeat in r

I have a dataframe like this.
data <- data.frame(Condition = c(1,1,2,3,1,1,2,2,2,3,1,1,2,3,3))
I want to populate a new variable Sequence which identifies whenever Condition starts again from 1.
So the new dataframe would look like this.
Thanks in advance for the help!
data <- data.frame(Condition = c(1,1,2,3,1,1,2,2,2,3,1,1,2,3,3),
Sequence = c(1,1,1,1,2,2,2,2,2,2,3,3,3,3,3))
base R
data$Sequence2 <- cumsum(c(TRUE, data$Condition[-1] == 1 & data$Condition[-nrow(data)] != 1))
data
# Condition Sequence Sequence2
# 1 1 1 1
# 2 1 1 1
# 3 2 1 1
# 4 3 1 1
# 5 1 2 2
# 6 1 2 2
# 7 2 2 2
# 8 2 2 2
# 9 2 2 2
# 10 3 2 2
# 11 1 3 3
# 12 1 3 3
# 13 2 3 3
# 14 3 3 3
# 15 3 3 3
dplyr
library(dplyr)
data %>%
mutate(
Sequence2 = cumsum(Condition == 1 & lag(Condition != 1, default = TRUE))
)
# Condition Sequence Sequence2
# 1 1 1 1
# 2 1 1 1
# 3 2 1 1
# 4 3 1 1
# 5 1 2 2
# 6 1 2 2
# 7 2 2 2
# 8 2 2 2
# 9 2 2 2
# 10 3 2 2
# 11 1 3 3
# 12 1 3 3
# 13 2 3 3
# 14 3 3 3
# 15 3 3 3
This took a while. Finally I find this solution:
library(dplyr)
data %>%
group_by(Sequnce = cumsum(
ifelse(Condition==1, lead(Condition)+1, Condition)
- Condition==1)
)
Condition Sequnce
<dbl> <int>
1 1 1
2 1 1
3 2 1
4 3 1
5 1 2
6 1 2
7 2 2
8 2 2
9 2 2
10 3 2
11 1 3
12 1 3
13 2 3
14 3 3
15 3 3

Delete rows with value if not only value in group

Somewhat new to R and I find myself needing to delete rows based on multiple criteria. The data frame has 3 columns and I need to delete rows where bid=99 and there are values less than 99 grouping by rid and qid. The desired output at an rid and qid level are bid has multiple values less than 99 or bid=99.
rid qid bid
1 1 5
1 1 6
1 1 99
1 2 6
2 1 7
2 1 99
2 2 2
2 2 3
3 1 7
3 1 8
3 2 1
3 2 99
4 1 2
4 1 6
4 2 1
4 2 2
4 2 99
5 1 99
5 2 99
The expected output...
rid qid bid
1 1 5
1 1 6
1 2 6
2 1 7
2 2 2
2 2 3
3 1 7
3 1 8
3 2 1
4 1 2
4 1 6
4 2 1
4 2 2
5 1 99
5 2 99
Any assistance would be appreciated.
You can use the base R function ave to generate a dropping variable like this:
df$dropper <- with(df, ave(bid, rid, qid, FUN= function(i) i == 99 & length(i) > 1))
ave calculates a function on bid, grouping by rid and qid. The function tests if each element of the grouped bid values i is 99 and if i has a length greater than 1. Also, with is used to reduce typing.
which returns
df
rid qid bid dropper
1 1 1 5 0
2 1 1 6 0
3 1 1 99 1
4 1 2 6 0
5 2 1 7 0
6 2 1 99 1
7 2 2 2 0
8 2 2 3 0
9 3 1 7 0
10 3 1 8 0
11 3 2 1 0
12 3 2 99 1
13 4 1 2 0
14 4 1 6 0
15 4 2 1 0
16 4 2 2 0
17 4 2 99 1
18 5 1 99 0
19 5 2 99 0
then drop the undesired observations with df[dropper == 0, 1:3] which will simultaneously drop the new variable.
If you want to just delete rows where bid = 99 then use dplyr.
library(dplyr)
df <- df %>%
filter(bid != 99)
Where df is your data frame. and != means not equal to
Updated solution using dplyr
df %>%
group_by(rid, qid) %>%
mutate(tempcount = n())%>%
ungroup() %>%
mutate(DropValue =ifelse(bid == 99 & tempcount > 1, 1,0) ) %>%
filter(DropValue == 0) %>%
select(rid,qid,bid)
Here is another option with all and if condition in data.table to subset the rows after grouping by 'rid' and 'qid'
library(data.table)
setDT(df1)[, if(all(bid==99)) .SD else .SD[bid!= 99], .(rid, qid)]
# rid qid bid
# 1: 1 1 5
# 2: 1 1 6
# 3: 1 2 6
# 4: 2 1 7
# 5: 2 2 2
# 6: 2 2 3
# 7: 3 1 7
# 8: 3 1 8
# 9: 3 2 1
#10: 4 1 2
#11: 4 1 6
#12: 4 2 1
#13: 4 2 2
#14: 5 1 99
#15: 5 2 99
Or without using the if
setDT(df1)[df1[, .I[all(bid==99) | bid != 99], .(rid, qid)]$V1]
Here is a solution using dplyr, which is a very expressive framework for this kind of problems.
df <- read.table(text =
" rid qid bid
1 1 5
1 1 6
1 1 99
1 2 6
2 1 7
2 1 99
2 2 2
2 2 3
3 1 7
3 1 8
3 2 1
3 2 99
4 1 2
4 1 6
4 2 1
4 2 2
4 2 99
5 1 99
5 2 99",
header = TRUE, stringsAsFactors = FALSE)
Dplyr verbs allow to express the program in a way that is close to the very terms of your questions:
library(dplyr)
res <-
df %>%
group_by(rid, qid) %>%
filter(!(any(bid < 99) & bid == 99)) %>%
ungroup()
# # A tibble: 15 × 3
# rid qid bid
# <int> <int> <int>
# 1 1 1 5
# 2 1 1 6
# 3 1 2 6
# 4 2 1 7
# 5 2 2 2
# 6 2 2 3
# 7 3 1 7
# 8 3 1 8
# 9 3 2 1
# 10 4 1 2
# 11 4 1 6
# 12 4 2 1
# 13 4 2 2
# 14 5 1 99
# 15 5 2 99
Let's check we get the desired output:
desired_output <- read.table(text =
" rid qid bid
1 1 5
1 1 6
1 2 6
2 1 7
2 2 2
2 2 3
3 1 7
3 1 8
3 2 1
4 1 2
4 1 6
4 2 1
4 2 2
5 1 99
5 2 99",
header = TRUE, stringsAsFactors = FALSE)
identical(as.data.frame(res), desired_output)
# [1] TRUE

Count with table() and exclude 0's

I try to count triplets; for this I use three vectors that are packed in a dataframe:
X=c(4,4,4,4,4,4,4,4,1,1,1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,3,3)
Y=c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,3,4,2,2,2,2,3,4,1,1,2,2,3,3,4,4)
Z=c(4,4,5,4,4,4,4,4,6,1,1,1,1,1,1,1,2,2,2,2,7,2,3,3,3,3,3,3,3,3)
Count_Frame=data.frame(matrix(NA, nrow=(length(X)), ncol=3))
Count_Frame[1]=X
Count_Frame[2]=Y
Count_Frame[3]=Z
Counts=data.frame(table(Count_Frame))
There is the following problem: if I increase the value range in the vectors or use even more vectors the "Counts" dataframe quickly approaches its size limit due to the many 0-counts. Is there a way to exclude the 0-counts while generating "Counts"?
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(Count_Frame)), grouped by all the columns (.(X, Y, Z)), we get the number or rows (.N).
library(data.table)
setDT(Count_Frame)[,.N ,.(X, Y, Z)]
# X Y Z N
# 1: 4 1 4 7
# 2: 4 1 5 1
# 3: 1 1 6 1
# 4: 1 1 1 3
# 5: 1 2 1 2
# 6: 1 3 1 1
# 7: 1 4 1 1
# 8: 2 2 2 4
# 9: 2 3 7 1
#10: 2 4 2 1
#11: 3 1 3 2
#12: 3 2 3 2
#13: 3 3 3 2
#14: 3 4 3 2
Instead of naming all the columns, we can use names(Count_Frame) as well (if there are many columns)
setDT(Count_Frame)[,.N , names(Count_Frame)]
You can accomplish this with aggregate:
Count_Frame$one <- 1
aggregate(one ~ X1 + X2 + X3, data=Count_Frame, FUN=sum)
This will calculate the positive instances of table, but will not list the zero counts.
One solution is to create a combination of the column values and count those instead:
library(tidyr)
as.data.frame(table(unite(Count_Frame, tmp, X1, X2, X3))) %>%
separate(Var1, c('X1', 'X2', 'X3'))
Resulting output is:
X1 X2 X3 Freq
1 1 1 1 3
2 1 1 6 1
3 1 2 1 2
4 1 3 1 1
5 1 4 1 1
6 2 2 2 4
7 2 3 7 1
8 2 4 2 1
9 3 1 3 2
10 3 2 3 2
11 3 3 3 2
12 3 4 3 2
13 4 1 4 7
14 4 1 5 1
Or using plyr:
library(plyr)
count(Count_Frame, colnames(Count_Frame))
output
# > count(Count_Frame, colnames(Count_Frame))
# X1 X2 X3 freq
# 1 1 1 1 3
# 2 1 1 6 1
# 3 1 2 1 2
# 4 1 3 1 1
# 5 1 4 1 1
# 6 2 2 2 4
# 7 2 3 7 1
# 8 2 4 2 1
# 9 3 1 3 2
# 10 3 2 3 2
# 11 3 3 3 2
# 12 3 4 3 2
# 13 4 1 4 7
# 14 4 1 5 1

expand.grid with unknown set of variables

So, expand.grid returns a df of all the combinations of the vectors passed.
df <- expand.grid(1:3, 1:3)
df <- expand.grid(1:3, 1:3, 1:3)
What I would like is a generalized function that takes 1 parameter (number of vectors) and returns the appropriate data frame.
combinations <- function(n) {
return(expand.grid(0, 1, ... n))
}
Such that
combinations(2) returns(expand.grid(1:3, 1:3))
combinations(3) returns(expand.grid(1:3, 1:3, 1:3))
combinations(4) returns(expand.grid(1:3, 1:3, 1:3, 1:3))
etc.
combinations <- function(n)
expand.grid(rep(list(1:3),n))
> combinations(2)
Var1 Var2
1 1 1
2 2 1
3 3 1
4 1 2
5 2 2
6 3 2
7 1 3
8 2 3
9 3 3
> combinations(3)
Var1 Var2 Var3
1 1 1 1
2 2 1 1
3 3 1 1
4 1 2 1
5 2 2 1
6 3 2 1
7 1 3 1
8 2 3 1
9 3 3 1
10 1 1 2
11 2 1 2
12 3 1 2
13 1 2 2
14 2 2 2
15 3 2 2
16 1 3 2
17 2 3 2
18 3 3 2
19 1 1 3
20 2 1 3
21 3 1 3
22 1 2 3
23 2 2 3
24 3 2 3
25 1 3 3
26 2 3 3
27 3 3 3

Resources