A faster conditional subset

A faster conditional subset - r

I'm trying to modify my dataframe according to the value in one column and having the most common value in another column.
df <- data.frame(points=c(1, 2, 4, 3, 4, 8, 3, 3, 2),
assists=c(6, 6, 5, 6, 6, 9, 9, 1, 1),
team=c('A', 'A', 'A', 'A', 'A', 'C', 'C', 'C', 'C'))
points assists team
1 1 6 A
2 2 6 A
3 4 5 A
4 3 6 A
5 4 6 A
6 8 9 C
7 3 9 C
8 3 1 C
9 2 1 C
to look like this:
df2 <- data.frame(points=c(1, 2, 3, 4, 8, 3),
assists=c(6, 6, 6, 6, 1, 1),
team=c('A', 'A', 'A', 'A', 'C', 'C'))
points assists team
1 1 6 A
2 2 6 A
3 3 6 A
4 4 6 A
5 8 1 C
6 3 1 C
The goal is to keep all rows that have the values A and C in the "team" column as long as in the "assists" column the most common value ("6" for "A" ) is kept. If there is a tie (such as "9" and "1" for "C") the last most common value should be kept.
I do this with a for loop but my dataframe has 3,000,000 rows and the process was very slow. Does anyone know a faster alternative?

We could modify the Mode function and do a group by approach to filter
library(dplyr)
Mode <- function(x) {
# get the unique elements
ux <- unique(x)
# convert to integer index with match and get the frequency
# tabulate should be faster than table
freq <- tabulate(match(x, ux))
# use == on the max of the freq, get the corresponding ux
# then get the last elements of ux
last(ux[freq == max(freq)])
}
df %>%
# grouped by team
group_by(team) %>%
# filter only those assists that are returned from Mode function
filter(assists %in% Mode(assists)) %>%
ungroup
-output
# A tibble: 6 × 3
points assists team
<dbl> <dbl> <chr>
1 1 6 A
2 2 6 A
3 3 6 A
4 4 6 A
5 3 1 C
6 2 1 C
Or may use data.table methods for a faster execution
library(data.table)
# setDT - converts data.frame to data.table
# create a frequency column (`.N`) by assignment (`:=`)
# grouped by team, assists columns
setDT(df)[, N := .N, by = .(team, assists)]
# grouped by team, get the index of the max N from reverse (`.N:1`)
#subset the assists with that index
# create a logical vector with %in%
# get the row index -.I, which creates a default column V1
# extract the column ($V1) and use that to subset the data
df[df[, .I[assists %in% assists[.N - which.max(N[.N:1]) + 1]],
by = team]$V1][, N := NULL][]
points assists team
<num> <num> <char>
1: 1 6 A
2: 2 6 A
3: 3 6 A
4: 4 6 A
5: 3 1 C
6: 2 1 C

Related

R - Compare values in a dataframe to aggregated dataframe

I'm trying to figure out how I can compare the values in a dataframe row-by-row that correspond to those given by the aggregate() function.
For example:
#create data frame
df <- data.frame(team=c('a', 'a', 'b', 'b', 'b', 'c', 'c'),
pts=c(5, 8, 14, 18, 5, 7, 7),
rebs=c(8, 8, 9, 3, 8, 7, 4))
#view data frame
df
team pts rebs
1 a 5 8
2 a 8 8
3 b 14 9
4 b 18 3
5 b 5 8
6 c 7 7
7 c 7 4
#find mean points scored by team
agg_df = aggregate(df$pts, list(df$team), FUN=mean)
Group.1 x
1 a 6.50000
2 b 12.33333
3 c 7.00000
What I want to do is create a new column in df using a logic similar to the following pseudo-code:
df$pts[i] > agg_df$x[i] then df$performance = 'overperformed' else df$performance = 'underperformed'.
But this is not exactly what I want. I want to compare row 1 and 2's points to the mean points for team a in agg_df. Similarly, rows 3-5 in df should be compared to the mean points for group b in agg_df.
The final result would look like:
> df
team pts rebs performance
1 a 5 8 under
2 a 8 8 over
3 b 14 9 over
4 b 18 3 over
5 b 5 8 under
6 c 7 7 average
7 c 7 4 average
I am a little puzzled as to how to achieve this, or if it is even achievable, so any help is very much appreciated.

You can do:
library(tidyverse)
df %>%
group_by(team) %>%
mutate(performance = case_when(pts > mean(pts) ~ "over",
pts == mean(pts) ~ "average",
pts < mean(pts) ~ "under")) %>%
ungroup()
which gives:
# A tibble: 7 x 4
team pts rebs performance
<chr> <dbl> <dbl> <chr>
1 a 5 8 under
2 a 8 8 over
3 b 14 9 over
4 b 18 3 over
5 b 5 8 under
6 c 7 7 average
7 c 7 4 average

Or in base way with merge().
# Merge data
db <- merge(df, agg_df, by.x = "team", by.y = 'Group.1')
db$performance <- ifelse(db$pts == db$x, 'average',
ifelse(db$pts > db$x, 'over', 'under'))
db$x <- NULL
db

How to a create a new dataframe of consolidated values from multiple columns in R

I have a dataframe, df1, that looks like the following:
sample
99_Ape_1
93_Cat_1
87_Ape_2
84_Cat_2
90_Dog_1
92_Dog_2
A
2
3
1
7
4
6
B
5
9
7
0
3
7
C
6
8
9
2
3
0
D
3
9
0
5
8
3
I want to consolidate the dataframe by summing the values based on animal present in the header row, i.e. by "Ape", "Cat", "Dog", and end up with the following dataframe:
sample
Ape
Cat
Dog
A
3
10
10
B
12
9
10
C
15
10
3
D
3
14
11
I have created a list that represents all the animals called "animals_list"
I have then created a list of dataframes that subsets each animal into a separate dataframe with:
animals_extract <- c()
for (i in 1:length(animals_list)){
species_extract[[i]] <- df1[, grep(animals_list[i], names(df1))]
}
I am then trying to sum each variable in the row by sample:
for (i in 1:length(species_extract)){
species_extract[[i]]$total <- rowSums(species_extract[[i]])
}
and then create a dataframe 'animal_total' by binding all values in the new 'total' column.
animal_total <- NULL
for (i in 1:length(species_extract)){
animal_total[i] <- cbind(species_extract[[i]]$total)
}
Unfortunately, this doesn't seem to work at all and I think I may have taken the wrong route. Any help would be really appreciated!
EDIT: my dataframe has over 300 animals, meaning incorporating use of my list of identifiers (animals_list) would be highly appreciated! I would also note that some column names do not follow the structure, "number_animal_number" and therefore I can't use a repetitive search (sorry!).

a data.table approach
library(data.table)
library(rlist)
#set data to data.table format
setDT(df1)
# split column 2:n by regex on column names
L <- split.default(df1[,-1], gsub(".*_(.*)_.*", "\\1", names(df1)[-1]))
# Bind together again
data.table(sample = df1$sample,
as.data.table(list.cbind(lapply(L, rowSums))))
# sample Ape Cat Dog
# 1: A 3 10 10
# 2: B 12 9 10
# 3: C 15 10 3
# 4: D 3 14 11

Update: After clarification:
This may work depending on the other names of your animals. but this is a start:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(
cols = -sample
) %>%
mutate(name1 = str_extract(name, '(?<=\\_)(.*?)(?=\\_)')) %>%
group_by(sample, name1) %>%
summarise(sum=sum(value)) %>%
pivot_wider(
names_from = name1,
values_from= sum
)
Output:
sample Ape Cat Dog
<chr> <int> <int> <int>
1 A 3 10 10
2 B 12 9 10
3 C 15 10 3
4 D 3 14 11
First answer:
Here is how we could do it with dplyr:
library(dplyr)
df %>%
mutate(Cat = rowSums(select(., contains("Cat"))),
Ape = rowSums(select(., contains("Ape"))),
Dog = rowSums(select(., contains("Dog")))) %>%
select(sample, Cat, Ape, Dog)
sample Ape Cat Dog
<chr> <int> <int> <int>
1 A 3 10 10
2 B 12 9 10
3 C 15 10 3
4 D 3 14 11

An alternative data.table solution
library(data.table)
# Construct data table
dt <- as.data.table(list(sample = c("A", "B", "C", "D"),
`99_Ape_1` = c(2, 5, 6, 3),
`93_Cat_1` = c(3, 9, 8, 9),
`87_Ape_2` = c(1, 7, 9, 0),
`84_Cat_2` = c(7, 0, 2, 5),
`90_Dog_1` = c(4, 3, 3, 8),
`92_Dog_2` = c(6, 7, 0, 3)))
# Alternatively convert existing dataframe
# dt <- setDT(df)
# Use Regex pattern to drop ids from column names
names(dt) <- gsub("((^[0-9_]{3})|(_[0-9]{1}$))", "", names(dt))
# Pivot long (columns to rows)
dt <- melt(dt, id.vars = "sample")
# Aggregate sample by variable
dt <- dt[, .(value=sum(value)), by=.(sample, variable)]
# Unpivot (rows to colums)
dcast(dt, sample ~ variable)
# sample Ape Cat Dog
# 1: A 3 10 10
# 2: B 12 9 10
# 3: C 15 10 3
# 4: D 3 14 11
Alternatively, leaving the column names as is (after comment from OP to previous answer) and assuming that there are multiple observations of the same samples:
dt <- as.data.table(list(sample = c("A", "B", "C", "D", "A"),
`99_Ape_1` = c(2, 5, 6, 3, 1),
`93_Cat_1` = c(3, 9, 8, 9, 1),
`87_Ape_2` = c(1, 7, 9, 0, 1),
`84_Cat_2` = c(7, 0, 2, 5, 1),
`90_Dog_1` = c(4, 3, 3, 8, 1),
`92_Dog_2` = c(6, 7, 0, 3, 1)))
dt
# sample 99_Ape_1 93_Cat_1 87_Ape_2 84_Cat_2 90_Dog_1 92_Dog_2
# 1: A 2 3 1 7 4 6
# 2: B 5 9 7 0 3 7
# 3: C 6 8 9 2 3 0
# 4: D 3 9 0 5 8 3
# 5: A 1 1 1 1 1 1
# Pivot long (columns to rows)
dt <- melt(dt, id.vars = "sample")
# Aggregate sample by variable
dt <- dt[, .(value=sum(value)), by=.(sample, variable)]
# Unpivot (rows to colums)
dcast(dt, sample ~ variable)
# sample 99_Ape_1 93_Cat_1 87_Ape_2 84_Cat_2 90_Dog_1 92_Dog_2
# 1: A 3 4 2 8 5 7
# 2: B 5 9 7 0 3 7
# 3: C 6 8 9 2 3 0
# 4: D 3 9 0 5 8 3

How to expand a dataframe in R with a continuous variable? [duplicate]

This question already has answers here:
Complete dataframe with missing combinations of values
(2 answers)
R - Insert Missing Numbers in A Sequence by Group's Max Value
(2 answers)
Closed 1 year ago.
I have this dataset:
group_ask <- c('A', 'A', 'B', 'B', 'C', 'C')
number_ask <- c(1, 3, 2, 4, 5, 8)
df_ask <- data.frame(group_ask, number_ask)
I am trying to expand the group_ask column by completing the continuous number_ask column. The solution dataset should look like this:
group_want <- c('A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'C')
number_want <- c(1, 2, 3, 2, 3, 4, 5, 6, 7, 8)
df_want <- data.frame(group_want, number_want)
I have unsuccessfully been trying to solve this R's expand() function.
Any suggestions? Many thanks!

You may use complete -
library(dplyr)
library(tidyr)
df_ask %>%
group_by(group_ask) %>%
complete(number_ask = min(number_ask):max(number_ask)) %>%
ungroup
# group_ask number_ask
# <chr> <dbl>
# 1 A 1
# 2 A 2
# 3 A 3
# 4 B 2
# 5 B 3
# 6 B 4
# 7 C 5
# 8 C 6
# 9 C 7
#10 C 8

Split apply combine approach using by.
do.call(rbind.data.frame,
by(df_ask, df_ask$group_ask, \(x)
cbind(x[1, 1], do.call(seq, as.list(x[, 2]))))) |>
setNames(names(df_ask))
# group_ask number_ask
# A.1 A 1
# A.2 A 2
# A.3 A 3
# B.1 B 2
# B.2 B 3
# B.3 B 4
# C.1 C 5
# C.2 C 6
# C.3 C 7
# C.4 C 8

using R: drop rows efficiently based on different conditions

Considering this sample
df<-{data.frame(v0=c(1, 2, 5, 1, 2, 0, 1, 2, 2, 2, 5),v1=c('a', 'a', 'a', 'b', 'b', 'c', 'c', 'b', 'b', 'a', 'a'), v2=c(0, 10, 5, 1, 8, 5,10, 3, 3, 1, 5))}
For a large dataframe: if v0>4, drop all the rows containing corresponding value v1 (drop a group?).
So, here the result should be a dataframe dropping all the rows with "a" since v0 values of 5 exist for "a".
df_ExpectedResult<-{data.frame(v0=c( 1, 2, 0, 1, 2, 2 ),v1=c( 'b', 'b', 'c', 'c', 'b', 'b'), v2=c(1, 8, 5,10, 3, 3))}
Also, I would like to have a new dataframe keeping the dropped groups.
df_Dropped <- {data.frame(v1='a')}
How would you do this efficiently for a huge dataset? I am using a simple for loop and if statement, but it takes too long to do the manipulation.

An option with dplyr
library(dplyr)
df %>%
group_by(v1) %>%
filter(sum(v0 > 4) < 1) %>%
ungroup
-output
# A tibble: 6 x 3
# v0 v1 v2
# <dbl> <chr> <dbl>
#1 1 b 1
#2 2 b 8
#3 0 c 5
#4 1 c 10
#5 2 b 3
#6 2 b 3

A base R option using subset + ave
subset(df, !ave(v0 > 4, v1, FUN = any))
gives
v0 v1 v2
4 1 b 1
5 2 b 8
6 0 c 5
7 1 c 10
8 2 b 3
9 2 b 3

It's two operations, but what about this:
drop_groups <- df %>% filter(v0 > 4) %>% select(v1) %>% unique()
df_result <- df %>% filter(!(v1 %in% drop_groups))
df_result
# v0 v1 v2
# 1 1 b 1
# 2 2 b 8
# 3 0 c 5
# 4 1 c 10
# 5 2 b 3
# 6 2 b 3

Remove duplicates based on 2nd column condition

I am trying to remove duplicate rows from a data frame based on the max value on a different column
So, for the data frame:
df<-data.frame (rbind(c("a",2,3),c("a",3,4),c("a",3,5),c("b",1,3),c("b",2,6),c("r",4,5))
colnames(df)<-c("id","val1","val2")
id val1 val2
a 2 3
a 3 4
a 3 5
b 1 3
b 2 6
r 4 5
I would like to keep remove all duplicates by id with the condition that for the corresponding rows they do not have the maximum value for val2.
Thus the data frame should become:
a 3 5
b 2 6
r 4 5
-> remove all a duplicates but keep row with the max value for df$val2 for subset(df, df$id=="a")

Using base R. Here, the columns are factors. Make sure to convert it to numeric
df$val2 <- as.numeric(as.character(df$val2))
df[with(df, ave(val2, id, FUN=max)==val2),]
# id val1 val2
#3 a 3 5
#5 b 2 6
#6 r 4 5
Or using dplyr
library(dplyr)
df %>%
group_by(id) %>%
filter(val2==max(val2))
# id val1 val2
#1 a 3 5
#2 b 2 6
#3 r 4 5

One possible way is to use data.table
library(data.table)
setDT(df)[, .SD[which.max(val2)], by = id]
## id val1 val2
## 1: a 3 5
## 2: b 2 6
## 3: r 4 5

Here's how I hope your data is really set up
df <- data.frame (id = c(rep("a", 3), rep("b", 2), "r"),
val1 = c(2, 3, 3, 1, 2, 4), val2 = c(3, 4, 5, 3, 6, 5))
You could do a split-unsplit
> unsplit(lapply(split(df, df$id), function(x) {
if(nrow(x) > 1) {
x[duplicated(x$id) & x$val2 == max(x$val2),]
} else {
x
}
}), levels(df$id))
# id val1 val2
# 3 a 3 5
# 5 b 2 6
# 6 r 4 5
You can also use Reduce(rbind, ...) or do.call(rbind, ...) in place of unsplit

Another one
df %>% group_by(id) %>%
slice(which.max(val2))
id val1 val2
a 3 5
b 2 6
r 4 5

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

A faster conditional subset - r

Related

R - Compare values in a dataframe to aggregated dataframe

How to a create a new dataframe of consolidated values from multiple columns in R

How to expand a dataframe in R with a continuous variable? [duplicate]

using R: drop rows efficiently based on different conditions

Remove duplicates based on 2nd column condition

Categories

Resources