How to group_by and summarize multiple variables using regex? - r

I want to use regex to identify the variable to use to group_by and to summarize my data efficiently. I cannot do separately because I have a large number of variables to summarize and the variable to group_by needs to be passed dynamically each time. data.table accepts using regex to pass the grouping variable, but not the summarizing variables. My attempts so far using tidyverse have been unsuccessful as well. Any help would be much appreciated.
My data:
tempDF <- structure(list(d1 = c("A", "B", "C", "A", "C"), d2 = c(40L, 50L, 20L, 50L, 20L),
d3 = c(20L, 40L, 50L, 40L, 50L), d4 = c(60L, 30L, 30L,60L, 30L), p_A = c(1L,
3L, 2L, 3L, 2L), p_B = c(3L, 4L, 3L, 3L, 4L), p_C = c(2L, 1L, 1L,2L, 1L), p4 = c(5L,
5L, 4L, 5L, 4L)), class = "data.frame", row.names = c(NA, -5L))
View(tempDF)
lLevels<-c("d1")
lContinuum<-c("p_A", "p_B", "p_C")
My attempts:
setDT(tempDF)[ , list(group_means = mean(eval((paste0(lContinuum)))), by=eval((paste0(lLevels))))]
group_means by
1: NA d1
Warning message:
In mean.default(eval((paste0(lContinuum)))) :
argument is not numeric or logical: returning NA
But a single variable works:
setDT(tempDF)[ , list(group_means = mean(p_A)), by=eval((paste0(lLevels)))]
setDT(tempDF)[ , list(group_means = mean(p_B)), by=eval((paste0(lLevels)))]
setDT(tempDF)[ , list(group_means = mean(p_C)), by=eval((paste0(lLevels)))]
Expected output:
tempDF %>%
group_by(d1) %>%
summarise(p_A_mean = mean(p_A), p_B_mean = mean(p_B), p_C_mean = mean(p_C))
# A tibble: 3 x 4
d1 p_A_mean p_B_mean p_C_mean
<chr> <dbl> <dbl> <dbl>
1 A 2 3 2
2 B 3 4 1
3 C 2 3.5 1

The data.table approach is very simple:
library(data.table)
setDT(tempDF)
tempDF[, lapply(.SD, mean),
by = lLevels,
.SDcols = lContinuum]
d1 p_A p_B p_C
1: A 2 3.0 2
2: B 3 4.0 1
3: C 2 3.5 1
Similar approach in dplyr would be:
library(dplyr)
tempDF%>%
group_by_at(lLevels)%>%
summarize_at(lContinuum, mean)
# A tibble: 3 x 4
d1 p_A p_B p_C
<chr> <dbl> <dbl> <dbl>
1 A 2 3 2
2 B 3 4 1
3 C 2 3.5 1
In either case, you can replace lLevels and lContinuum with regex. The dplyr option also would allow for select helpers such as starts_with() and ends_with():
https://www.rdocumentation.org/packages/tidyselect/versions/0.2.5/topics/select_helpers
.

Im sure this could be made more efficient / succinct but meets the spec:
summarise_df <- function(df, grouping_var){
# Store string of the grouping var name:
grouping_vec <- gsub(".*[$]", "", deparse(substitute(grouping_var)))
# split apply combine summary - return dataframe:
tmpdf_list <- lapply(split(df[,sapply(df, is.numeric)], df[,grouping_vec]),
function(x){sapply(x, function(y){mean(y)})})
}
tmp <- do.call(rbind, summarise_df(df, df$d1))
df <- data.frame(cbind(d1 = row.names(tmp), tmp), row.names = NULL)
With Summary vars dynamic too:
#
summarise_df <- function(df, grouping_var, summary_vars){
# Store string of the grouping var name:
grouping_vec <- gsub(".*[$]", "", deparse(substitute(grouping_var)))
# split apply combine summary - return dataframe:
tmpdf_list <- lapply(split(df[,summary_vars], df[,grouping_vec]),
function(x){sapply(x, function(y){mean(y)})})
}
tmp <- do.call(rbind, summarise_df(df, df$d1, c("p_A", "p_B", "p_C")))
tmp_df <- data.frame(cbind(d1 = row.names(tmp), tmp), row.names = NULL)

Though it looks a bit roundabout, reshaping this into a long form will allow to group by not only d1 but also by however many values of p_A ... p_C that are in the dataset.
edit: also added code to keep certain columns (d_cols) by regex.
library(tidyverse)
tempDF <- structure(
list(d1 = c("A", "B", "C", "A", "C"),
d2 = c(40L, 50L, 20L, 50L, 20L),
d3 = c(20L, 40L, 50L, 40L, 50L),
d4 = c(60L, 30L, 30L,60L, 30L),
d5 = c("AA", "BB", "CC", "AA", "CC"),
p_A = c(1L, 3L, 2L, 3L, 2L),
p_B = c(3L, 4L, 3L, 3L, 4L),
p_C = c(2L, 1L, 1L,2L, 1L),
p4 = c(5L, 5L, 4L, 5L, 4L)),
class = "data.frame",
row.names = c(NA, -5L))
# columns of d to keep, in strings
d_cols <- str_subset(colnames(tempDF), "d[15]")
tempDF %>%
pivot_longer(cols = matches("p_")) %>%
group_by(!!!syms(d_cols), name) %>%
summarize(mean = mean(value)) %>%
pivot_wider(id_cols = d_cols,
values_from = mean,
names_prefix = "mean_")
#> # A tibble: 3 x 5
#> # Groups: d1, d5 [3]
#> d1 d5 mean_p_A mean_p_B mean_p_C
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 A AA 2 3 2
#> 2 B BB 3 4 1
#> 3 C CC 2 3.5 1
Created on 2019-10-19 by the reprex package (v0.3.0)

Related

Remove row within groups if coordinates of subgroup are within another subgroup in r

I have a dataframe such as
Groups NAMES start end
G1 A 1 50
G1 A 25 45
G1 B 20 51
G1 A 51 49
G2 A 200 400
G2 B 1 1600
G2 A 2000 3000
G2 B 4000 5000
and the idea is within each Groups to look at NAMES where start & end coordinates of A are within coordinates of B
for instance here in the example :
Groups NAMES start end
G1 A 1 50 <- A is outside any B coordinate
G1 A 25 45 <- A is **inside** the B coord `20-51`,then I remove these B row.
G1 B 20 51
G1 A 51 49 <- A is outside any B coordinate
G2 A 200 400 <- A is **inside** the B coordinate 1-1600, then I romove this B row.
G2 B 1 1600
G2 A 2000 3000 <- A is outside any B coordinate
G2 B 4000 5000 <- this one does not have any A inside it, then it will be kept in the output.
Then I should get as output :
Groups NAMES start end
G1 A 1 50
G1 A 25 45
G1 A 51 49
G2 A 200 400
G2 A 2000 3000
G2 B 4000 5000
Does someone have an idea please ?
Here is the dataframe in dput format if it can help you ? :
structure(list(Groups = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L), .Label = c("G1", "G2"), class = "factor"), NAMES = structure(c(1L,
1L, 2L, 1L, 1L, 2L, 1L, 2L), .Label = c("A", "B"), class = "factor"),
start = c(1L, 25L, 20L, 51L, 200L, 1L, 2000L, 4000L), end = c(50L,
45L, 51L, 49L, 400L, 1600L, 3000L, 5000L)), class = "data.frame", row.names = c(NA,
-8L))
Here's a possible approach. We'll split the df by NAMES and join the two parts to each other by Groups to do within-group comparisons. Only B rows can get dropped, so those are the only ones whose row numbers we want to keep track of.
We can then just group by rowid to tag the B rows by whether or not they have any A inside them. Finally, filter to the B to keep and concatenate back to the A rows.
library(tidyverse)
df <- structure(list(Groups = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("G1", "G2"), class = "factor"), NAMES = structure(c(1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L), .Label = c("A", "B"), class = "factor"), start = c(1L, 25L, 20L, 51L, 200L, 1L, 2000L, 4000L), end = c(50L, 45L, 51L, 49L, 400L, 1600L, 3000L, 5000L)), class = "data.frame", row.names = c(NA, -8L))
A <- filter(df, NAMES == "A")
B <- df %>%
filter(NAMES == "B") %>%
rowid_to_column()
comparison <- inner_join(A, B, by = "Groups") %>%
mutate(A_in_B = start.x >= start.y & end.x <= end.y) %>%
group_by(rowid) %>%
summarise(keep_B = !any(A_in_B))
B %>%
inner_join(comparison, by = "rowid") %>%
filter(keep_B) %>%
select(-rowid, -keep_B) %>%
bind_rows(A) %>%
arrange(Groups, NAMES)
#> Groups NAMES start end
#> 1 G1 A 1 50
#> 2 G1 A 25 45
#> 3 G1 A 51 49
#> 4 G2 A 200 400
#> 5 G2 A 2000 3000
#> 6 G2 B 4000 5000
Created on 2021-07-27 by the reprex package (v1.0.0)
This will also do using purrr::map_dfr
library(tidyverse)
df %>%
group_split(Groups) %>%
map_dfr(~ .x %>% mutate(r = row_number()) %>%
full_join(.x %>%
filter(NAMES == 'B'),
by = 'Groups') %>%
group_by(r) %>%
filter(any(NAMES.x == 'B' | start.x > start.y & end.x < end.y)) %>%
ungroup %>%
select(Groups, ends_with('.x')) %>%
distinct %>%
rename_with(~ gsub('\\.x', '', .), everything())
)
#> # A tibble: 6 x 4
#> Groups NAMES start end
#> <fct> <fct> <int> <int>
#> 1 G1 A 25 45
#> 2 G1 B 20 51
#> 3 G1 A 51 49
#> 4 G2 A 200 400
#> 5 G2 B 1 1600
#> 6 G2 B 4000 5000
Created on 2021-07-27 by the reprex package (v2.0.0)

How to combine transmute with grep function?

I'm trying to find a way to create a new table with variables using the rowSums() function from an existing dataframe. For example, my existing dataframe is called 'asn' and I want to sum up the values for each row of all variables which contain "2011" in the variable title. I want a new table consisting of just one column called asn_y2011 which contains the sum of each row using the variables containing "2011"
Data
structure(list(row = 1:3, south_2010 = c(1L, 5L, 7L), south_2011 = c(4L,
0L, 4L), south_2012 = c(5L, 8L, 6L), north_2010 = c(3L, 4L, 1L
), north_2011 = c(2L, 6L, 0L), north_2012 = c(1L, 1L, 2L)), class = "data.frame", row.names = c(NA,
-3L))
The existing 'asn' dataframe looks like this
row south_2010 south_2011 south_2012 north_2010 north_2011 north_2012
1 1 4 5 3 2 1
2 5 0 8 4 6 1
3 7 4 6 1 0 2
I'm trying to use the following function:
asn %>%
transmute(asn_y2011 = rowSums(, grep("2011")))
to get something like this
row asn_y2011
1 6
2 6
3 4
Continuing with your code, grep() should work like this:
library(dplyr)
asn %>%
transmute(row, asn_y2011 = rowSums(.[grep("2011", names(.))]))
# row asn_y2011
# 1 1 6
# 2 2 6
# 3 3 4
Or you can use tidy selection in c_across():
asn %>%
rowwise() %>%
transmute(row, asn_y2011 = sum(c_across(contains("2011")))) %>%
ungroup()
Another base R option using rowSums
cbind(asn[1],asn_y2011 = rowSums(asn[grep("2011",names(asn))]))
which gives
row asn_y2011
1 1 6
2 2 6
3 3 4
An option in base R with Reduce
cbind(df['row'], asn_y2011 = Reduce(`+`, df[endsWith(names(df), '2011')]))
# row asn_y2011
#1 1 6
#2 2 6
#3 3 4
data
df <- structure(list(row = 1:3, south_2010 = c(1L, 5L, 7L), south_2011 = c(4L,
0L, 4L), south_2012 = c(5L, 8L, 6L), north_2010 = c(3L, 4L, 1L
), north_2011 = c(2L, 6L, 0L), north_2012 = c(1L, 1L, 2L)),
class = "data.frame", row.names = c(NA,
-3L))
I think that this code will do what you want:
library(magrittr)
tibble::tibble(row = 1:3, south_2011 = c(4, 0, 4), north_2011 = c(2, 6, 0)) %>%
tidyr::gather(- row, key = "key", value = "value") %>%
dplyr::mutate(year = purrr::map_chr(.x = key, .f = function(x)stringr::str_split(x, pattern = "_")[[1]][2])) %>%
dplyr::group_by(row, year) %>%
dplyr::summarise(sum(value))
I first load the package magrittr so that I can use the pipe, %>%. I've explicitly listed the packages from which the functions are exported, but you are welcome to load the packages with library if you like.
I then create a tibble, or data frame, like what you specify.
I use gather to reorganize the data frame before creating a new variable, year. I then summarise the counts by value of row and year.
You can try this approach
library(tidyverse)
df2 <- df %>%
select(grep("_2011|row", names(df), value = TRUE)) %>%
rowwise() %>%
mutate(asn_y2011 = sum(c_across(south_2011:north_2011))) %>%
select(row, asn_y2011)
# row asn_y2011
# <int> <int>
# 1 1 6
# 2 2 6
# 3 3 4
Data
df <- structure(list(row = 1:3, south_2010 = c(1L, 5L, 7L), south_2011 = c(4L, 0L, 4L), south_2012 = c(5L, 8L, 6L), north_2010 = c(3L, 4L, 1L), north_2011 = c(2L, 6L, 0L), north_2012 = c(1L, 1L, 2L)), class = "data.frame", row.names = c(NA,-3L))

Subsetting a data frame according to recursive rows and creating a column for ordering

Consider the sample data
df <-
structure(
list(
id = c(1L, 1L, 1L, 1L, 2L, 2L, 3L),
A = c(20L, 12L, 13L, 8L, 11L, 21L, 17L),
B = c(1L, 1L, 0L, 0L, 1L, 0L, 0L)
),
.Names = c("id", "A", "B"),
class = "data.frame",
row.names = c(NA,-7L)
)
Each id (stored in column 1) has varying number of entries for column A and B. In the example data, there are four observations with id = 1. I am looking for a way to subset this data in R so that there will be at most 3 entries for for each id and finally create another column (labelled as C) which consists of the order of each id. The expected output would look like:
df <-
structure(
list(
id = c(1L, 1L, 1L, 2L, 2L, 3L),
A = c(20L, 12L, 13L, 11L, 21L, 17L),
B = c(1L, 1L, 0L, 1L, 0L, 0L),
C = c(1L, 2L, 3L, 1L, 2L, 1L)
),
.Names = c("id", "A", "B","C"),
class = "data.frame",
row.names = c(NA,-6L)
)
Your help is much appreciated.
Like this?
library(data.table)
dt <- as.data.table(df)
dt[, C := seq(.N), by = id]
dt <- dt[C <= 3,]
dt
# id A B C
# 1: 1 20 1 1
# 2: 1 12 1 2
# 3: 1 13 0 3
# 4: 2 11 1 1
# 5: 2 21 0 2
# 6: 3 17 0 1
Here is one option with dplyr and considering the top 3 values based on A (based of the comments of #Ronak Shah).
library(dplyr)
df %>%
group_by(id) %>%
top_n(n = 3, wt = A) %>% # top 3 values based on A
mutate(C = rank(id, ties.method = "first")) # C consists of the order of each id
# A tibble: 6 x 4
# Groups: id [3]
id A B C
<int> <int> <int> <int>
1 1 20 1 1
2 1 12 1 2
3 1 13 0 3
4 2 11 1 1
5 2 21 0 2
6 3 17 0 1

Calculate max value across multiple columns by multiple groups

I have a data file with numeric values in three columns and two grouping variables (ID and Group) from which I need to calculate a single max value by ID and Group:
structure(list(ID = structure(c(1L, 1L, 1L, 2L), .Label = c("a1",
"a2"), class = "factor"), Group = structure(c(1L, 1L, 2L, 2L), .Label =
c("abc",
"def"), class = "factor"), Score1 = c(10L, 0L, 0L, 5L), Score2 = c(0L,
0L, 5L, 10L), Score3 = c(0L, 11L, 2L, 11L)), class = "data.frame", row.names =
c(NA,
-4L))
The result I am trying to obtain is:
structure(list(ID = structure(c(1L, 1L, 2L), .Label = c("a1",
"a2"), class = "factor"), Group = structure(c(1L, 2L, 2L), .Label = c("abc",
"def"), class = "factor"), Max = c(11L, 5L, 11L)), class = "data.frame",
row.names = c(NA,
-3L))
I am trying the following in dplyr:
SampTable<-SampDF %>% group_by(ID,Group) %>%
summarize(max = pmax(SampDF$Score1, SampDF$Score2,SampDF$Score3))
But it generates this error:
Error in summarise_impl(.data, dots) :
Column `max` must be length 1 (a summary value), not 4
Is there an easy way to achieve this in dplyr or data.table?
Solution using data.table. Find max value on 3:5 columns (Score columns) by ID and Group.
library(data.table)
setDT(d)
d[, .(Max = do.call(max, .SD)), .SDcols = 3:5, .(ID, Group)]
ID Group Max
1: a1 abc 11
2: a1 def 5
3: a2 def 11
Data:
d <- structure(list(ID = structure(c(1L, 1L, 1L, 2L), .Label = c("a1",
"a2"), class = "factor"), Group = structure(c(1L, 1L, 2L, 2L), .Label =
c("abc",
"def"), class = "factor"), Score1 = c(10L, 0L, 0L, 5L), Score2 = c(0L,
0L, 5L, 10L), Score3 = c(0L, 11L, 2L, 11L)), class = "data.frame", row.names =
c(NA,
-4L))
A solution using tidyverse.
library(tidyverse)
dat2 <- dat1 %>%
gather(Column, Value, starts_with("Score")) %>%
group_by(ID, Group) %>%
summarise(Max = max(Value)) %>%
ungroup()
dat2
# # A tibble: 3 x 3
# ID Group Max
# <fct> <fct> <dbl>
# 1 a1 abc 11
# 2 a1 def 5
# 3 a2 def 11
Here are couple of other options with tidyverse
library(tidyverse)
df1 %>%
group_by(ID, Group) %>%
nest %>%
mutate(Max = map_dbl(data, ~ max(unlist(.x)))) %>%
select(-data)
Or using pmax
df1 %>%
mutate(Max = pmax(!!! rlang::syms(names(.)[3:5]))) %>%
group_by(ID, Group) %>%
summarise(Max = max(Max))
# A tibble: 3 x 3
# Groups: ID [?]
# ID Group Max
# <fct> <fct> <dbl>
#1 a1 abc 11
#2 a1 def 5
#3 a2 def 11
Or using base R
aggregate(cbind(Max = do.call(pmax, df1[3:5])) ~ ID + Group, df1, max)
Here is a tidyverse solution using nest :
library(tidyverse)
df %>%
nest(-(1:2),.key="Max") %>%
mutate_at("Max",map_dbl, max)
# ID Group Max
# 1 a1 abc 11
# 2 a1 def 5
# 3 a2 def 11
In base R:
res <- aggregate(. ~ ID + Group,df,max)
res <- cbind(res[1:2], Max = do.call(pmax,res[-(1:2)]))
res
# ID Group Max
# 1 a1 abc 11
# 2 a1 def 5
# 3 a2 def 11
Here is a base R solution
# gives 2x2 table
x <- by(df[, !names(df) %in% c("ID", "Group")], list(df$ID, df$Group), max)
# get requested format
tmp <- expand.grid(ID = rownames(x), Group = colnames(x))
tmp$Max <- as.vector(x)
tmp[complete.cases(tmp), ]
#R ID Group Max
#R 1 a1 abc 11
#R 3 a1 def 5
#R 4 a2 def 11
with
df <- structure(list(
ID = structure(c(1L, 1L, 1L, 2L), .Label = c("a1", "a2"), class = "factor"),
Group = structure(c(1L, 1L, 2L, 2L), .Label = c("abc", "def"), class = "factor"),
Score1 = c(10L, 0L, 0L, 5L), Score2 = c(0L, 0L, 5L, 10L),
Score3 = c(0L, 11L, 2L, 11L)),
class = "data.frame", row.names = c(NA, -4L))

Conditional data manipulation using data.table in R

I have 2 dataframes, testx and testy
testx
testx <- structure(list(group = 1:2), .Names = "group", class = "data.frame", row.names = c(NA,
-2L))
testy
testy <- structure(list(group = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),
time = c(1L, 3L, 4L, 1L, 4L, 5L, 1L, 5L, 7L), value = c(50L,
52L, 10L, 4L, 84L, 2L, 25L, 67L, 37L)), .Names = c("group",
"time", "value"), class = "data.frame", row.names = c(NA, -9L
))
Based on this topic, I add missing time values using the following code, which works perfectly.
data <- setDT(testy, key='time')[, .SD[J(min(time):max(time))], by = group]
Now I would like to only add these missing time values IF the value for group appears in testx. In this example, I thus only want to add missing time values for groups matching the values for group in the file testx.
data <- setDT(testy, key='time')[,if(testy[group %in% testx[, group]]) .SD[J(min(time):max(time))], by = group]
The error I get is "undefined columns selected". I looked here, here and here, but I don't see why my code isn't working. I am doing this on large datasets, why I prefer using data.table.
You don't need to refer testy when you are within testy[] and are using group by, directly using group as a variable gives correct result, you need an extra else statement to return rows where group is not within testx if you want to keep all records in testy:
testy[, {if(group %in% testx$group) .SD[J(min(time):max(time))] else .SD}, by = group]
# group time value
# 1: 1 1 50
# 2: 1 2 NA
# 3: 1 3 52
# 4: 1 4 10
# 5: 2 1 4
# 6: 2 2 NA
# 7: 2 3 NA
# 8: 2 4 84
# 9: 2 5 2
# 10: 3 1 25
# 11: 3 5 67
# 12: 3 7 37

Resources