using R: drop rows efficiently based on different conditions

using R: drop rows efficiently based on different conditions - r

Considering this sample
df<-{data.frame(v0=c(1, 2, 5, 1, 2, 0, 1, 2, 2, 2, 5),v1=c('a', 'a', 'a', 'b', 'b', 'c', 'c', 'b', 'b', 'a', 'a'), v2=c(0, 10, 5, 1, 8, 5,10, 3, 3, 1, 5))}
For a large dataframe: if v0>4, drop all the rows containing corresponding value v1 (drop a group?).
So, here the result should be a dataframe dropping all the rows with "a" since v0 values of 5 exist for "a".
df_ExpectedResult<-{data.frame(v0=c( 1, 2, 0, 1, 2, 2 ),v1=c( 'b', 'b', 'c', 'c', 'b', 'b'), v2=c(1, 8, 5,10, 3, 3))}
Also, I would like to have a new dataframe keeping the dropped groups.
df_Dropped <- {data.frame(v1='a')}
How would you do this efficiently for a huge dataset? I am using a simple for loop and if statement, but it takes too long to do the manipulation.

An option with dplyr
library(dplyr)
df %>%
group_by(v1) %>%
filter(sum(v0 > 4) < 1) %>%
ungroup
-output
# A tibble: 6 x 3
# v0 v1 v2
# <dbl> <chr> <dbl>
#1 1 b 1
#2 2 b 8
#3 0 c 5
#4 1 c 10
#5 2 b 3
#6 2 b 3

A base R option using subset + ave
subset(df, !ave(v0 > 4, v1, FUN = any))
gives
v0 v1 v2
4 1 b 1
5 2 b 8
6 0 c 5
7 1 c 10
8 2 b 3
9 2 b 3

It's two operations, but what about this:
drop_groups <- df %>% filter(v0 > 4) %>% select(v1) %>% unique()
df_result <- df %>% filter(!(v1 %in% drop_groups))
df_result
# v0 v1 v2
# 1 1 b 1
# 2 2 b 8
# 3 0 c 5
# 4 1 c 10
# 5 2 b 3
# 6 2 b 3

Related

A faster conditional subset

I'm trying to modify my dataframe according to the value in one column and having the most common value in another column.
df <- data.frame(points=c(1, 2, 4, 3, 4, 8, 3, 3, 2),
assists=c(6, 6, 5, 6, 6, 9, 9, 1, 1),
team=c('A', 'A', 'A', 'A', 'A', 'C', 'C', 'C', 'C'))
points assists team
1 1 6 A
2 2 6 A
3 4 5 A
4 3 6 A
5 4 6 A
6 8 9 C
7 3 9 C
8 3 1 C
9 2 1 C
to look like this:
df2 <- data.frame(points=c(1, 2, 3, 4, 8, 3),
assists=c(6, 6, 6, 6, 1, 1),
team=c('A', 'A', 'A', 'A', 'C', 'C'))
points assists team
1 1 6 A
2 2 6 A
3 3 6 A
4 4 6 A
5 8 1 C
6 3 1 C
The goal is to keep all rows that have the values A and C in the "team" column as long as in the "assists" column the most common value ("6" for "A" ) is kept. If there is a tie (such as "9" and "1" for "C") the last most common value should be kept.
I do this with a for loop but my dataframe has 3,000,000 rows and the process was very slow. Does anyone know a faster alternative?

We could modify the Mode function and do a group by approach to filter
library(dplyr)
Mode <- function(x) {
# get the unique elements
ux <- unique(x)
# convert to integer index with match and get the frequency
# tabulate should be faster than table
freq <- tabulate(match(x, ux))
# use == on the max of the freq, get the corresponding ux
# then get the last elements of ux
last(ux[freq == max(freq)])
}
df %>%
# grouped by team
group_by(team) %>%
# filter only those assists that are returned from Mode function
filter(assists %in% Mode(assists)) %>%
ungroup
-output
# A tibble: 6 × 3
points assists team
<dbl> <dbl> <chr>
1 1 6 A
2 2 6 A
3 3 6 A
4 4 6 A
5 3 1 C
6 2 1 C
Or may use data.table methods for a faster execution
library(data.table)
# setDT - converts data.frame to data.table
# create a frequency column (`.N`) by assignment (`:=`)
# grouped by team, assists columns
setDT(df)[, N := .N, by = .(team, assists)]
# grouped by team, get the index of the max N from reverse (`.N:1`)
#subset the assists with that index
# create a logical vector with %in%
# get the row index -.I, which creates a default column V1
# extract the column ($V1) and use that to subset the data
df[df[, .I[assists %in% assists[.N - which.max(N[.N:1]) + 1]],
by = team]$V1][, N := NULL][]
points assists team
<num> <num> <char>
1: 1 6 A
2: 2 6 A
3: 3 6 A
4: 4 6 A
5: 3 1 C
6: 2 1 C

Merge nth elements from two columns while keeping the original row order in R

I am attempting to merge every nth element from col1, replacing values from that same row in col2 in a new column: col3
df <- data.frame(col1 = c('A', 'B', 'D', 'F', 'C'), col2 = c(2, 1, 2, 3, 1))
> df
col1 col2
1 A 2
2 B 1
3 D 2
4 F 3
5 C 1
If i was to merge every odd element from col1 with every even element from col2, for example, the output should look something like this:
> df
col1 col2 col3
1 A 2 A
2 B 1 1
3 D 2 D
4 F 3 3
5 C 1 C
Thanks.

We could do it with an ifelse statement checking if row is even or odd with the modulo operator %%:
library(dplyr)
df %>%
mutate(col3 = ifelse((row_number() %% 2) == 0, col2, col1))
col1 col2 col3
1 A 2 A
2 B 1 1
3 D 2 D
4 F 3 3
5 C 1 C

In base R, we may also use a row/column indexing
df$col3 <- df[cbind(seq_len(nrow(df)), rep(1:2, length.out = nrow(df)))]
-output
> df
col1 col2 col3
1 A 2 A
2 B 1 1
3 D 2 D
4 F 3 3
5 C 1 C

base
df <- data.frame(col1 = c('A', 'B', 'D', 'F', 'C'), col2 = c(2, 1, 2, 3, 1))
df$col3 <- df$col1
df$col3[c(FALSE, TRUE)] <- df$col2[c(FALSE, TRUE)]
df
#> col1 col2 col3
#> 1 A 2 A
#> 2 B 1 1
#> 3 D 2 D
#> 4 F 3 3
#> 5 C 1 C
Created on 2022-03-06 by the reprex package (v2.0.1)

How to expand a dataframe in R with a continuous variable? [duplicate]

This question already has answers here:
Complete dataframe with missing combinations of values
(2 answers)
R - Insert Missing Numbers in A Sequence by Group's Max Value
(2 answers)
Closed 1 year ago.
I have this dataset:
group_ask <- c('A', 'A', 'B', 'B', 'C', 'C')
number_ask <- c(1, 3, 2, 4, 5, 8)
df_ask <- data.frame(group_ask, number_ask)
I am trying to expand the group_ask column by completing the continuous number_ask column. The solution dataset should look like this:
group_want <- c('A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'C')
number_want <- c(1, 2, 3, 2, 3, 4, 5, 6, 7, 8)
df_want <- data.frame(group_want, number_want)
I have unsuccessfully been trying to solve this R's expand() function.
Any suggestions? Many thanks!

You may use complete -
library(dplyr)
library(tidyr)
df_ask %>%
group_by(group_ask) %>%
complete(number_ask = min(number_ask):max(number_ask)) %>%
ungroup
# group_ask number_ask
# <chr> <dbl>
# 1 A 1
# 2 A 2
# 3 A 3
# 4 B 2
# 5 B 3
# 6 B 4
# 7 C 5
# 8 C 6
# 9 C 7
#10 C 8

Split apply combine approach using by.
do.call(rbind.data.frame,
by(df_ask, df_ask$group_ask, \(x)
cbind(x[1, 1], do.call(seq, as.list(x[, 2]))))) |>
setNames(names(df_ask))
# group_ask number_ask
# A.1 A 1
# A.2 A 2
# A.3 A 3
# B.1 B 2
# B.2 B 3
# B.3 B 4
# C.1 C 5
# C.2 C 6
# C.3 C 7
# C.4 C 8

How to print a min and max values based on a condition - dplyR

I have a data frame that looks as follows:
> df <- data_frame(g = c('A', 'A', 'B', 'B', 'B', 'A'), x = c(7, 3, 5, 9, 2, 4))
> df
Source: local data frame [6 x 2]
g x
1 A 7
2 A 3
3 B 5
4 B 9
5 B 2
6 A 4
I want to print min value for A group and maximum value for B group
output
g x
1 A 3
2 B 9

Maybe this can help:
library(dplyr)
#Code
new <- df %>% group_by(g) %>%
mutate(x=ifelse(g=='A',min(x,na.rm = T),
ifelse(g=='B',max(x,na.rm = T)))) %>%
summarise(x=unique(x))
Output:
# A tibble: 2 x 2
g x
<chr> <dbl>
1 A 3
2 B 9

Another option
df %>% group_by(g) %>% summarise(x = unique(ifelse(g == "A", min(x), max(x))))
#> # A tibble: 2 x 2
#> g x
#> <chr> <dbl>
#> 1 A 3
#> 2 B 9

Using data.table
library(data.table)
setDT(df)[, .(x = unique(fifelse(g == "A", min(x), max(x)))), .(g)]

How to disaggregate a data frame consisting of list of lists

If I construct a data frame as
# constructing df
a <- c(rep("A", 3), rep("B", 3), rep("A",2))
b <- c(1,1,2,4,1,1,2,2)
#c <- c("ir", "ir", "br", "ir", "us", "us", "ir", "br")
c <- c(1, 2, 3, 4, 4, 4, 4, 5)
df <- data.frame(a,b,c)
I can aggregate that via:
df_red <- aggregate(list(track = c), df[,c("a", "b")], '[')
What is the best way to disaggregate that back to what it was before?
IN another words, how can I convert this:
a b track
1 A 1 1, 2
2 B 1 4, 4
3 A 2 3, 4, 5
4 B 4 4
to this:
a b c
1 A 1 1
2 A 1 2
3 A 2 3
4 B 4 4
5 B 1 4
6 B 1 4
7 A 2 4
8 A 2 5

1) unnest Try unnest like this:
library(tidyr)
df_red %>% unnest
or
unnest(df_red)
2) base Here is a base solution:
do.call(rbind, do.call(Map, c(data.frame, df_red)))
3) separate_rows Also note that if you wanted to aggregate into a string rather than into a vector we could have this pair:
library(tidyr)
ag_s <- aggregate(list(track = c), df[c("a", "b")], toString)
ag_s %>% separate_rows(track)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

using R: drop rows efficiently based on different conditions - r

An option with dplyr library(dplyr) df %>% group_by(v1) %>% filter(sum(v0 > 4) < 1) %>% ungroup -output # A tibble: 6 x 3 # v0 v1 v2 # <dbl> <chr> <dbl> #1 1 b 1 #2 2 b 8 #3 0 c 5 #4 1 c 10 #5 2 b 3 #6 2 b 3

A base R option using subset + ave subset(df, !ave(v0 > 4, v1, FUN = any)) gives v0 v1 v2 4 1 b 1 5 2 b 8 6 0 c 5 7 1 c 10 8 2 b 3 9 2 b 3

It's two operations, but what about this: drop_groups <- df %>% filter(v0 > 4) %>% select(v1) %>% unique() df_result <- df %>% filter(!(v1 %in% drop_groups)) df_result # v0 v1 v2 # 1 1 b 1 # 2 2 b 8 # 3 0 c 5 # 4 1 c 10 # 5 2 b 3 # 6 2 b 3

Related

A faster conditional subset

Merge nth elements from two columns while keeping the original row order in R

How to expand a dataframe in R with a continuous variable? [duplicate]

How to print a min and max values based on a condition - dplyR

How to disaggregate a data frame consisting of list of lists

Categories

Resources