How to expand a dataframe in R with a continuous variable? [duplicate] - r

This question already has answers here:
Complete dataframe with missing combinations of values
(2 answers)
R - Insert Missing Numbers in A Sequence by Group's Max Value
(2 answers)
Closed 1 year ago.
I have this dataset:
group_ask <- c('A', 'A', 'B', 'B', 'C', 'C')
number_ask <- c(1, 3, 2, 4, 5, 8)
df_ask <- data.frame(group_ask, number_ask)
I am trying to expand the group_ask column by completing the continuous number_ask column. The solution dataset should look like this:
group_want <- c('A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'C')
number_want <- c(1, 2, 3, 2, 3, 4, 5, 6, 7, 8)
df_want <- data.frame(group_want, number_want)
I have unsuccessfully been trying to solve this R's expand() function.
Any suggestions? Many thanks!

You may use complete -
library(dplyr)
library(tidyr)
df_ask %>%
group_by(group_ask) %>%
complete(number_ask = min(number_ask):max(number_ask)) %>%
ungroup
# group_ask number_ask
# <chr> <dbl>
# 1 A 1
# 2 A 2
# 3 A 3
# 4 B 2
# 5 B 3
# 6 B 4
# 7 C 5
# 8 C 6
# 9 C 7
#10 C 8

Split apply combine approach using by.
do.call(rbind.data.frame,
by(df_ask, df_ask$group_ask, \(x)
cbind(x[1, 1], do.call(seq, as.list(x[, 2]))))) |>
setNames(names(df_ask))
# group_ask number_ask
# A.1 A 1
# A.2 A 2
# A.3 A 3
# B.1 B 2
# B.2 B 3
# B.3 B 4
# C.1 C 5
# C.2 C 6
# C.3 C 7
# C.4 C 8

Related

Identify duplicates and make column with common id [duplicate]

This question already has answers here:
Concatenate strings by group with dplyr [duplicate]
(4 answers)
Closed 20 days ago.
I have a df
df <- data.frame(ID = c('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'),
var1 = c(1, 1, 3, 4, 5, 5, 7, 8),
var2 = c(1, 1, 0, 0, 1, 1, 0, 0),
var3 = c(50, 50, 30, 47, 33, 33, 70, 46))
Where columns var1 - var3 are numerical inputs into a modelling software. To save on computing time, I would like to simulate unique instances of var1 - var3 in the modelling software, then join the results back to the main dataframe using leftjoin.
I need to add a second identifier to each row to show that it is the same as another row in terms of var1-var3. The output would be like:
ID var1 var2 var3 ID2
1 a 1 1 50 ab
2 b 1 1 50 ab
3 c 3 0 30 c
4 d 4 0 47 d
5 e 5 1 33 ef
6 f 5 1 33 ef
7 g 7 0 70 g
8 h 8 0 46 h
The I can subset unique rows of var1-var3 and ID2 simulate them in the software, and join results back to the main df using the new ID2.
With paste:
library(dplyr) #1.1.0
df %>%
mutate(ID2 = paste(unique(ID), collapse = ""),
.by = c(var1, var2))
# ID var1 var2 var3 ID2
# 1 a 1 1 50 ab
# 2 b 1 1 50 ab
# 3 c 3 0 30 c
# 4 d 4 0 47 d
# 5 e 5 1 33 ef
# 6 f 5 1 33 ef
# 7 g 7 0 70 g
# 8 h 8 0 46 h
Note that the .by argument is a new feature of dplyr 1.1.0. You can still use group_by and ungroup with earlier versions and/or if you have a more complex pipeline.

A faster conditional subset

I'm trying to modify my dataframe according to the value in one column and having the most common value in another column.
df <- data.frame(points=c(1, 2, 4, 3, 4, 8, 3, 3, 2),
assists=c(6, 6, 5, 6, 6, 9, 9, 1, 1),
team=c('A', 'A', 'A', 'A', 'A', 'C', 'C', 'C', 'C'))
points assists team
1 1 6 A
2 2 6 A
3 4 5 A
4 3 6 A
5 4 6 A
6 8 9 C
7 3 9 C
8 3 1 C
9 2 1 C
to look like this:
df2 <- data.frame(points=c(1, 2, 3, 4, 8, 3),
assists=c(6, 6, 6, 6, 1, 1),
team=c('A', 'A', 'A', 'A', 'C', 'C'))
points assists team
1 1 6 A
2 2 6 A
3 3 6 A
4 4 6 A
5 8 1 C
6 3 1 C
The goal is to keep all rows that have the values A and C in the "team" column as long as in the "assists" column the most common value ("6" for "A" ) is kept. If there is a tie (such as "9" and "1" for "C") the last most common value should be kept.
I do this with a for loop but my dataframe has 3,000,000 rows and the process was very slow. Does anyone know a faster alternative?
We could modify the Mode function and do a group by approach to filter
library(dplyr)
Mode <- function(x) {
# get the unique elements
ux <- unique(x)
# convert to integer index with match and get the frequency
# tabulate should be faster than table
freq <- tabulate(match(x, ux))
# use == on the max of the freq, get the corresponding ux
# then get the last elements of ux
last(ux[freq == max(freq)])
}
df %>%
# grouped by team
group_by(team) %>%
# filter only those assists that are returned from Mode function
filter(assists %in% Mode(assists)) %>%
ungroup
-output
# A tibble: 6 × 3
points assists team
<dbl> <dbl> <chr>
1 1 6 A
2 2 6 A
3 3 6 A
4 4 6 A
5 3 1 C
6 2 1 C
Or may use data.table methods for a faster execution
library(data.table)
# setDT - converts data.frame to data.table
# create a frequency column (`.N`) by assignment (`:=`)
# grouped by team, assists columns
setDT(df)[, N := .N, by = .(team, assists)]
# grouped by team, get the index of the max N from reverse (`.N:1`)
#subset the assists with that index
# create a logical vector with %in%
# get the row index -.I, which creates a default column V1
# extract the column ($V1) and use that to subset the data
df[df[, .I[assists %in% assists[.N - which.max(N[.N:1]) + 1]],
by = team]$V1][, N := NULL][]
points assists team
<num> <num> <char>
1: 1 6 A
2: 2 6 A
3: 3 6 A
4: 4 6 A
5: 3 1 C
6: 2 1 C

FIlter Rows in a Dataframe based on Variable's Total Occurrences

I'm trying to remove observations with variables that I don't have enough observations of.
For instance in this dataframe there's
5 observations for A & B,
3 for C,
2 for D,
and 1 for E.
df.1 <- c('A', 'B', 'B', 'C', 'C', 'C', 'B', 'E')
df.2 <- c('B', 'D', 'D', 'A', 'A', 'B', 'A', 'A')
df <- data.frame(df.1, df.2)
df
# df.1 df.2
# 1 A B
# 2 B D
# 3 B D
# 4 C A
# 5 C A
# 6 C B
# 7 B A
# 8 E A
If I need a min of three observations I'd like to remove any with D and E to get something such as:
final.df.1 final.df.2
1 A B
2 C A
3 C A
4 C B
5 B A
In base R, get the frequency count with table and use that to subset the columns having only those names that have frequency greater than 'n'
n <- 3
nm1 <- names(which(table(unlist(df)) >= n))
subset(df, Reduce(`&`, lapply(df, `%in%`, nm1)))
df.1 df.2
1 A B
4 C A
5 C A
6 C B
7 B A

using R: drop rows efficiently based on different conditions

Considering this sample
df<-{data.frame(v0=c(1, 2, 5, 1, 2, 0, 1, 2, 2, 2, 5),v1=c('a', 'a', 'a', 'b', 'b', 'c', 'c', 'b', 'b', 'a', 'a'), v2=c(0, 10, 5, 1, 8, 5,10, 3, 3, 1, 5))}
For a large dataframe: if v0>4, drop all the rows containing corresponding value v1 (drop a group?).
So, here the result should be a dataframe dropping all the rows with "a" since v0 values of 5 exist for "a".
df_ExpectedResult<-{data.frame(v0=c( 1, 2, 0, 1, 2, 2 ),v1=c( 'b', 'b', 'c', 'c', 'b', 'b'), v2=c(1, 8, 5,10, 3, 3))}
Also, I would like to have a new dataframe keeping the dropped groups.
df_Dropped <- {data.frame(v1='a')}
How would you do this efficiently for a huge dataset? I am using a simple for loop and if statement, but it takes too long to do the manipulation.
An option with dplyr
library(dplyr)
df %>%
group_by(v1) %>%
filter(sum(v0 > 4) < 1) %>%
ungroup
-output
# A tibble: 6 x 3
# v0 v1 v2
# <dbl> <chr> <dbl>
#1 1 b 1
#2 2 b 8
#3 0 c 5
#4 1 c 10
#5 2 b 3
#6 2 b 3
A base R option using subset + ave
subset(df, !ave(v0 > 4, v1, FUN = any))
gives
v0 v1 v2
4 1 b 1
5 2 b 8
6 0 c 5
7 1 c 10
8 2 b 3
9 2 b 3
It's two operations, but what about this:
drop_groups <- df %>% filter(v0 > 4) %>% select(v1) %>% unique()
df_result <- df %>% filter(!(v1 %in% drop_groups))
df_result
# v0 v1 v2
# 1 1 b 1
# 2 2 b 8
# 3 0 c 5
# 4 1 c 10
# 5 2 b 3
# 6 2 b 3

How to disaggregate a data frame consisting of list of lists

If I construct a data frame as
# constructing df
a <- c(rep("A", 3), rep("B", 3), rep("A",2))
b <- c(1,1,2,4,1,1,2,2)
#c <- c("ir", "ir", "br", "ir", "us", "us", "ir", "br")
c <- c(1, 2, 3, 4, 4, 4, 4, 5)
df <- data.frame(a,b,c)
I can aggregate that via:
df_red <- aggregate(list(track = c), df[,c("a", "b")], '[')
What is the best way to disaggregate that back to what it was before?
IN another words, how can I convert this:
a b track
1 A 1 1, 2
2 B 1 4, 4
3 A 2 3, 4, 5
4 B 4 4
to this:
a b c
1 A 1 1
2 A 1 2
3 A 2 3
4 B 4 4
5 B 1 4
6 B 1 4
7 A 2 4
8 A 2 5
1) unnest Try unnest like this:
library(tidyr)
df_red %>% unnest
or
unnest(df_red)
2) base Here is a base solution:
do.call(rbind, do.call(Map, c(data.frame, df_red)))
3) separate_rows Also note that if you wanted to aggregate into a string rather than into a vector we could have this pair:
library(tidyr)
ag_s <- aggregate(list(track = c), df[c("a", "b")], toString)
ag_s %>% separate_rows(track)

Resources