Expand dataframe by ID to generate a special column - r

I have the following dataframe
df<-data.frame("ID"=c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B"),
'A_Frequency'=c(1,2,3,4,5,1,2,3,4,5),
'B_Frequency'=c(1,2,NA,4,6,1,2,5,6,7))
The dataframe appears as follows
ID A_Frequency B_Frequency
1 A 1 1
2 A 2 2
3 A 3 NA
4 A 4 4
5 A 5 6
6 B 1 1
7 B 2 2
8 B 3 5
9 B 4 6
10 B 5 7
I Wish to create a new dataframe df2 from df that looks as follows
ID CFreq
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
6 A 6
7 B 1
8 B 2
9 B 3
10 B 4
11 B 5
12 B 6
13 B 7
The new dataframe has a column CFreq that takes unique values from A_Frequency, B_Frequency and groups them by ID. Then it ignores the NA values and generates the CFreq column
I have tried dplyr but am unable to get the required response
df2<-df%>%group_by(ID)%>%select(ID, A_Frequency,B_Frequency)%>%
mutate(Cfreq=unique(A_Frequency, B_Frequency))
This yields the following which is quite different
ID A_Frequency B_Frequency Cfreq
<fct> <dbl> <dbl> <dbl>
1 A 1 1 1
2 A 2 2 2
3 A 3 NA 3
4 A 4 4 4
5 A 5 6 5
6 B 1 1 1
7 B 2 2 2
8 B 3 5 3
9 B 4 6 4
10 B 5 7 5
Request someone to help me here

gather function from tidyr package will be helpful here:
library(tidyverse)
df %>%
gather(x, CFreq, -ID) %>%
select(-x) %>%
na.omit() %>%
unique() %>%
arrange(ID, CFreq)

A different tidyverse possibility could be:
df %>%
nest(A_Frequency, B_Frequency, .key = C_Frequency) %>%
mutate(C_Frequency = map(C_Frequency, function(x) unique(x[!is.na(x)]))) %>%
unnest()
ID C_Frequency
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
9 A 6
10 B 1
11 B 2
12 B 3
13 B 4
14 B 5
18 B 6
19 B 7

Base R approach would be to split the dataframe based on ID and for every list we count the number of unique enteries and create a sequence based on that.
do.call(rbind, lapply(split(df, df$ID), function(x) data.frame(ID = x$ID[1] ,
CFreq = seq_len(length(unique(na.omit(unlist(x[-1]))))))))
# ID CFreq
#A.1 A 1
#A.2 A 2
#A.3 A 3
#A.4 A 4
#A.5 A 5
#A.6 A 6
#B.1 B 1
#B.2 B 2
#B.3 B 3
#B.4 B 4
#B.5 B 5
#B.6 B 6
#B.7 B 7
This will also work when A_Frequency B_Frequency has characters in them or some other random numbers instead of sequential numbers.
In tidyverse we can do
library(tidyverse)
df %>%
group_split(ID) %>%
map_dfr(~ data.frame(ID = .$ID[1],
CFreq= seq_len(length(unique(na.omit(flatten_chr(.[-1])))))))

A data.table option
library(data.table)
cols <- c('A_Frequency', 'B_Frequency')
out <- setDT(df)[, .(CFreq = sort(unique(unlist(.SD)))),
.SDcols = cols,
by = ID]
out
# ID CFreq
# 1: A 1
# 2: A 2
# 3: A 3
# 4: A 4
# 5: A 5
# 6: A 6
# 7: B 1
# 8: B 2
# 9: B 3
#10: B 4
#11: B 5
#12: B 6
#13: B 7

Related

Group by cumulative sums with conditions

In this dataframe:
df <- data.frame(
ID = c("C", "B", "B", "B", NA, "C", "A", NA, "B", "B", "B")
)
I'd like to group the rows using cumsum with two conditions: (i) cumsum should not continue if is.na(ID) and (ii) it should not continue if the next ID value is the same as the prior. I do meet condition (i) with this:
df %>%
group_by(grp = cumsum(!is.na(ID)))
# A tibble: 11 x 2
# Groups: grp [9]
ID grp
<chr> <int>
1 C 1
2 B 2
3 B 3
4 B 4
5 NA 4
6 C 5
7 A 6
8 NA 6
9 B 7
10 B 8
11 B 9
but I don't know how to implement condition (ii) too, to obtain the desired result:
1 C 1
2 B 2
3 B 2
4 B 2
5 NA 2
6 C 3
7 A 4
8 NA 4
9 B 5
10 B 5
11 B 5
I tried it with this but I doesn't work:
df %>%
group_by(grp = cumsum(!is.na(ID) |!lag(ID,1) == ID))
Use na.locf0 from zoo to fill in the NAs and then apply rleid from data.table:
library(data.table)
library(zoo)
rleid(na.locf0(df$ID))
## [1] 1 2 2 2 2 3 4 4 5 5 5
Using tidyr and dplyr, you could do:
df %>%
mutate(grp = fill(., ID) %>% pull(),
grp = cumsum(grp != lag(grp, default = first(grp))))
ID grp
1 C 0
2 B 1
3 B 1
4 B 1
5 <NA> 1
6 C 2
7 A 3
8 <NA> 3
9 B 4
10 B 4
11 B 4
Using rle
library(zoo)
with(rle(na.locf0(df$ID)), rep(seq_along(values), lengths))
#[1] 1 2 2 2 2 3 4 4 5 5 5

Transpose and Merge columns in R [duplicate]

This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 2 years ago.
Quite new to R and I have a dataset in this format:
A B C
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
But I want it in this format:
A 1
A 2
A 3
A 4
A 5
B 1
B 2
B 3
...etc.
Seems like such a simple issue but I need HELP! Thanks
df <- data.frame(
A = 1:5,
B = 1:5,
C = 1:5
)
stack(df)
values ind
1 1 A
2 2 A
3 3 A
4 4 A
5 5 A
6 1 B
7 2 B
8 3 B
9 4 B
10 5 B
11 1 C
12 2 C
13 3 C
14 4 C
15 5 C
Examples using dplyr's gather function:
library(tidyverse)
A <- c(1,2,3,4,5)
B <- c(1,2,3,4,5)
C <- c(1,2,3,4,5)
df <- data.frame(A,B,C)
df %>% gather(key = "key", value = "value")
key value
1 a 1
2 a 2
3 a 3
4 a 4
5 a 5
6 b 1
7 b 2
8 b 3
9 b 4
10 b 5
11 c 1
12 c 2
13 c 3
14 c 4
15 c 5
You can use the package tidyr. This let's you choose, which columns you want to gather in the column "variable".
# if not installed yet
install.packages("tidyr")
library(tidyr)
data <- data.frame(
A = 1:5,
B = 1:5,
C = 1:5
)
data %>% pivot_longer(c(A, B, C), names_to = "variable", values_to = "value")
# Result
variable value
<chr> <int>
1 A 1
2 B 1
3 C 1
4 A 2
5 B 2
6 C 2
7 A 3
8 B 3
9 C 3
10 A 4
11 B 4
12 C 4
13 A 5
14 B 5
15 C 5

Keep rows with specific string and the following row

This is my data frame
df <- data.frame(
id = 1:14,
group_id = c(rep(1:2, each = 3), rep(3:4, each = 4)),
type = rep("A", 14), stringsAsFactors = FALSE)
df[c(2,4,8,12),"type"] <- "B"
id group_id type
1 1 1 A
2 2 1 B
3 3 1 A
4 4 2 B
5 5 2 A
6 6 2 A
7 7 3 A
8 8 3 B
9 9 3 A
10 10 3 A
11 11 4 A
12 12 4 B
13 13 4 A
14 14 4 A
I'd like to keep all rows with type B as well as the following row.
I could do...
B <- which(df$type=="B")
afterB <- B+1
df_sel <- df[c(B, afterB), ]
df_sel <- df_sel[order(df_sel$id),]
df_sel
...to get what I want.
id group_id type
2 2 1 B
3 3 1 A
4 4 2 B
5 5 2 A
8 8 3 B
9 9 3 A
12 12 4 B
13 13 4 A
How can this be done in a more generic way.
Another way, very similar to what you do but in one step and without the need to reorder:
df_sel <- df[rep(which(df$type=="B"), e=2)+c(0, 1), ]
df_sel
# id group_id type
# 2 2 1 B
# 3 3 1 A
# 4 4 2 B
# 5 5 2 A
# 8 8 3 B
# 9 9 3 A
# 12 12 4 B
# 13 13 4 A
Using lag from dplyr
library(dplyr)
df[df$type == "B" | lag(df$type == "B", default = FALSE), ]
# id group_id type
#2 2 1 B
#3 3 1 A
#4 4 2 B
#5 5 2 A
#8 8 3 B
#9 9 3 A
#12 12 4 B
#13 13 4 A
using grep will provide a row index of all instances of B - rows; concatenate (c()) this with rows + 1 to select from df will work.
rows <- grep("B", df[, "type"])
df[sort(c(rows, rows + 1)), ]
gives:
id group_id type
2 2 1 B
3 3 1 A
4 4 2 B
5 5 2 A
8 8 3 B
9 9 3 A
12 12 4 B
13 13 4 A

Dynamic select expression in function [duplicate]

This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 4 years ago.
I am trying to write a function that will convert this data frame
library(dplyr)
library(rlang)
library(purrr)
df <- data.frame(obj=c(1,1,2,2,3,3,3,4,4,4),
S1=rep(c("a","b"),length.out=10),PR1=rep(c(3,7),length.out=10),
S2=rep(c("c","d"),length.out=10),PR2=rep(c(7,3),length.out=10))
obj S1 PR1 S2 PR2
1 1 a 3 c 7
2 1 b 7 d 3
3 2 a 3 c 7
4 2 b 7 d 3
5 3 a 3 c 7
6 3 b 7 d 3
7 3 a 3 c 7
8 4 b 7 d 3
9 4 a 3 c 7
10 4 b 7 d 3
In to this data frame
df %>% {bind_rows(select(., obj, S = S1, PR = PR1),
select(., obj, S = S2, PR = PR2))}
obj S PR
1 1 a 3
2 1 b 7
3 2 a 3
4 2 b 7
5 3 a 3
6 3 b 7
7 3 a 3
8 4 b 7
9 4 a 3
10 4 b 7
11 1 c 7
12 1 d 3
13 2 c 7
14 2 d 3
15 3 c 7
16 3 d 3
17 3 c 7
18 4 d 3
19 4 c 7
20 4 d 3
But I would like the function to be able to work with any number of columns. So it would also work if I had S1, S2, S3, S4 or if there was an additional category ie DS1, DS2. Ideally the function would take as arguments the patterns that determine which columns are stacked on top of each other, the number of sets of each column, the names of the output columns and the names of any variables that should also be kept.
This is my attempt at this function:
stack_col <- function(df, patterns, nums, cnames, keep){
keep <- enquo(keep)
build_exp <- function(x){
paste0("!!sym(cnames[[", x, "]]) := paste0(patterns[[", x, "]],num)") %>%
parse_expr()
}
exps <- map(1:length(patterns), ~expr(!!build_exp(.)))
sel_fun <- function(num){
df %>% select(!!keep,
!!!exps)
}
map(nums, sel_fun) %>% bind_rows()
}
I can get the sel_fun part to work for a fixed number of patterns like this
patterns <- c("S", "PR")
cnames <- c("Species", "PR")
keep <- quo(obj)
sel_fun <- function(num){
df %>% select(!!keep,
!!sym(cnames[[1]]) := paste0(patterns[[1]], num),
!!sym(cnames[[2]]) := paste0(patterns[[2]], num))
}
sel_fun(1)
But the dynamic version that I have tried does not work and gives this error:
Error: `:=` can only be used within a quasiquoted argument
Here is a function to get the expected output. Loop through the 'patterns' and the corresponding new column names ('cnames') using map2, gather into 'long' format, rename the 'val' column to the 'cnames' passed into the function, bind the columns (bind_cols) and select the columns of interest
stack_col <- function(dat, pat, cname, keep) {
purrr::map2(pat, cname, ~
dat %>%
dplyr::select(keep, matches(.x)) %>%
tidyr::gather(key, val, matches(.x)) %>%
dplyr::select(-key) %>%
dplyr::rename(!! .y := val)) %>%
dplyr::bind_cols(.) %>%
dplyr::select(keep, cname)
}
stack_col(df, patterns, cnames, 1)
# obj Species PR
#1 1 a 3
#2 1 b 7
#3 2 a 3
#4 2 b 7
#5 3 a 3
#6 3 b 7
#7 3 a 3
#8 4 b 7
#9 4 a 3
#10 4 b 7
#11 1 c 7
#12 1 d 3
#13 2 c 7
#14 2 d 3
#15 3 c 7
#16 3 d 3
#17 3 c 7
#18 4 d 3
#19 4 c 7
#20 4 d 3
Also, multiple patterns reshaping can be done with data.table::melt
library(data.table)
melt(setDT(df), measure = patterns("^S\\d+", "^PR\\d+"),
value.name = c("Species", "PR"))[, variable := NULL][]
This solves your problem, although it does not fix your function:
The idea is to use gather and spread on the columns which starts with the specific pattern. Therefore I create a regex which matches the column names and then first gather all of them, extract the group and the rename the groups with the cnames. Finally spread takes separates the new columns.
library(dplyr)
library(purrr)
library(tidyr)
library(stringr)
patterns <- c("S", "PR")
cnames <- c("Species", "PR")
names(cnames) <- patterns
complete_pattern <- str_c("^", str_c(patterns, collapse = "|^"))
df %>%
mutate(rownumber = 1:n()) %>%
gather(new_variable, value, matches(complete_pattern)) %>%
mutate(group = str_extract(new_variable, complete_pattern),
group = str_replace_all(group, cnames),
group_number = str_extract(new_variable, "\\d+")) %>%
select(-new_variable) %>%
spread(group, value)
# obj rownumber group_number PR Species
# 1 1 1 1 3 a
# 2 1 1 2 7 c
# 3 1 2 1 7 b
# 4 1 2 2 3 d
# 5 2 3 1 3 a
# 6 2 3 2 7 c
# 7 2 4 1 7 b
# 8 2 4 2 3 d
# 9 3 5 1 3 a
# 10 3 5 2 7 c
# 11 3 6 1 7 b
# 12 3 6 2 3 d
# 13 3 7 1 3 a
# 14 3 7 2 7 c
# 15 4 8 1 7 b
# 16 4 8 2 3 d
# 17 4 9 1 3 a
# 18 4 9 2 7 c
# 19 4 10 1 7 b
# 20 4 10 2 3 d

How to drop factors that have fewer than n members

Is there a way to drop factors that have fewer than N rows, like N = 5, from a data table?
Data:
DT = data.table(x=rep(c("a","b","c"),each=6), y=c(1,3,6), v=1:9,
id=c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,4,4,4))
Goal: remove rows when the number of id is less than 5. The variable "id" is the grouping variable, and the groups to delete when the number of rows in a group is less than 5. In DT, need to determine which groups have less than 5 members, (groups "1" and "4") and then remove those rows.
1: a 3 5 2
2: b 6 6 2
3: b 1 7 2
4: b 3 8 2
5: b 6 9 2
6: b 1 1 3
7: c 3 2 3
8: c 6 3 3
9: c 1 4 3
10: c 3 5 3
11: c 6 6 3
Here's an approach....
Get the length of the factors, and the factors to keep
nFactors<-tapply(DT$id,DT$id,length)
keepFactors <- nFactors >= 5
Then identify the ids to keep, and keep those rows. This generates the desired results, but is there a better way?
idsToKeep <- as.numeric(names(keepFactors[which(keepFactors)]))
DT[DT$id %in% idsToKeep,]
Since you begin with a data.table, this first part uses data.table syntax.
EDIT: Thanks to Arun (comment) for helping me improve this data table answer
DT[DT[, .(I=.I[.N>=5L]), by=id]$I]
# x y v id
# 1: a 3 5 2
# 2: a 6 6 2
# 3: b 1 7 2
# 4: b 3 8 2
# 5: b 6 9 2
# 6: b 1 1 3
# 7: b 3 2 3
# 8: b 6 3 3
# 9: c 1 4 3
# 10: c 3 5 3
# 11: c 6 6 3
In base R you could use
df <- data.frame(DT)
tab <- table(df$id)
df[df$id %in% names(tab[tab >= 5]), ]
# x y v id
# 5 a 3 5 2
# 6 a 6 6 2
# 7 b 1 7 2
# 8 b 3 8 2
# 9 b 6 9 2
# 10 b 1 1 3
# 11 b 3 2 3
# 12 b 6 3 3
# 13 c 1 4 3
# 14 c 3 5 3
# 15 c 6 6 3
If using a data.table is not necessary, you can use dplyr:
library(dplyr)
data.frame(DT) %>%
group_by(id) %>%
filter(n() >= 5)

Resources