Avoiding missing row after summarise

Avoiding missing row after summarise - r

I'm using RStudio Version 0.98.1028 on windows. Summarising a multi level data frame, package dplyr, using the function sum(), I lost a row, which had sum = 0. In other words, if my original data frame was something like
group <- as.factor(rep(c('X', 'Y'), each = 1, times = 6))
type <- as.factor(rep(c('a', 'b'), each = 2, times = 3))
day <- as.factor(rep(1:3, each = 4))
df = data.frame(type = type, day = day, value = abs(rnorm(12)))
df = df[day != 1 | type != 'a',]
and I summarise it
df1 = df %>%
group_by(day, type) %>%
summarise(sum = sum(value))
then I get one missing row, which is the interaction between day = 1 and type = a, which I would like to have (even if it's 0...)
Thanks in advance!
EB

You could try left_join
library(dplyr)
left_join(expand.grid(type=unique(df$type), day=unique(df$day)), df1) %>%
group_by(day, type) %>%
summarise(sum=sum(value, na.rm=TRUE))
# day type sum
#1 1 a 0.0000000
#2 1 b 0.5132914
#3 2 a 1.2482210
#4 2 b 0.9232343
#5 3 a 2.0381779
#6 3 b 0.7558351
where df1 is
df1 <- df[day != 1 | type != 'a',]

Related

How can I search in a data frame if all possible combinations exist in another data frame in R using dplyr?

I have two data frames.
The first one that contains all the possible combinations with their corresponding values and looks like this:
first
second
val
A
B
10
A
C
20
A
D
30
B
C
40
B
D
50
C
D
60
H
I
70
The second one that comes from the production line has two columns the date column that has grouped all the variables corresponding to their date and are concatenated:
date
var
2022-01-01
A
2022-02-01
B,C,F,E,G,H,I
I want to find all the combinations in the second data frame and to see if they match with any combinations in the first data frame. If a variable stands alone in the second data frame as A in 2022-01-01 to give me the 0 and otherwise the value of the combination.
Ideally I want the resulting data frame to look like this:
date
comb
val
2022-01-01
A
0
2022-02-01
B,C
40
2022-02-01
H,I
70
How can I do this in R using dplyr?
library(tidyverse)
first = c("A","A","A","B","B","C","H")
second = c("B","C","D","C","D","D","I")
val = c(10,20,30,40,50,60,70)
df1 = tibble(first,second,val);df1
date = c(as.Date("2022-01-01"),as.Date("2022-02-01"))
var = c("A","B,C,F,E,G,H,I")
df2 = tibble(date,var);df2

Using tidyverse:
library(tidyverse)
first = c("A","A","A","B","B","C","H")
second = c("B","C","D","C","D","D","I")
val = c(10,20,30,40,50,60,70)
df1 = tibble(first,second,val);df1
date = c(as.Date("2022-01-01"),as.Date("2022-02-01"))
var = c("A","B,C,F,E,G,H,I")
df2 = tibble(date,var);df2
df2_tidy <- df2 %>%
mutate(first = str_split(var, ","),
second = first) %>%
unnest(first) %>%
unnest(second) %>%
select(-var)
singles <- df2 %>%
filter(!str_detect(var, ",")) %>%
mutate(val = 0) %>%
select(date, comb = var, val)
combs <- df1 %>%
inner_join(df2_tidy, by = c("first", "second")) %>%
mutate(comb = paste(first, second, sep = ",")) %>%
select(date, comb, val)
bind_rows(singles, combs)

Combine attributes from two columns and sum the values from duplicate rows

This question is slightly modified from this one.
I have a dataframe in long table format like this:
df1 <- data.frame(ID=c(1,1,1,1,1,1,2,2),
name=c("a","c","a","c","a","c","a","c"),
value=c("broad",50,"mangrove",50,"mangrove",50,"coniferous",50))
ID name value
1 a broad
1 c 50
1 a mangrove
1 c 50
1 a mangrove
1 c 50
2 a coniferous
2 c 50
About the data: The value from the second row 50 corresponds to the value broad from the first row. Similarly, the value from the fourth row 50 corresponds to the value mangrove from the third row and so on.. In simple words, values for name c are related with name a.
I want to combine the value in such a way that I could get the corresponding values for each name, which would also aggregate the values with similar names:
df2 <- data.frame(ID=c(1,1,2),
name=c("c_broad","c_mangrove","c_coniferous"),
value=c(50,100,50))
which should look like this:
ID name value
1 c_broad 50
1 c_mangrove 100
2 c_coniferous 50

Using reshape2:
library(reshape2)
df1$grp = cumsum(df1$name == "a")
df2 = dcast(df1, ID + grp ~ name)
df2$c = as.numeric(df2$c)
aggregate(c ~ ID + a, df2, sum)
ID a c
1 1 broad 50
2 2 coniferous 50
3 1 mangrove 100
Column names can be changed if desired, also "c_" can be added to the names with paste.

Using tidyverse:
value_a <- df1 %>% dplyr::filter(name=="a") %>% dplyr::pull(value)
df1 %>%
dplyr::filter(name=="c") %>% #Modify into a sensible data frame from here
dplyr::mutate(a = value_a,
name = stringr::str_c(name, "_" ,a)) %>%
dplyr::select(-a) %>% # to here
dplyr::group_by(ID, name) %>%
dplyr::summarise(value=sum(as.numeric(value)))
# A tibble: 3 x 3
# Groups: ID [2]
ID name value
<dbl> <chr> <dbl>
1 1 c_broad 50
2 1 c_mangrove 100
3 2 c_coniferous 50
Tha main problem you find in your dataframe is that a single column is containing, names and values, and that is the first thing you should fix. My advice is always modify the original dataframe into a tidy format (https://tidyr.tidyverse.org/articles/tidy-data.html) and from there leverage all tidyverse power, or data.table or your framework of choice.
Notice the temporal variable value_a could be included in the pipeline directly I have not done it for clarity. The main idea is to separate values and species in different columns, the first three calls in the pipeline, and then apply the usual tidyverse operations.

Might not be the most elegant, but it works:
df1 <- data.frame(ID=c(1,1,1,1,1,1,2,2),
name=c("a","c","a","c","a","c","a","c"),
value=c("broad",50,"mangrove",50,"mangrove",50,"coniferous",50)
)
df1 %>% group_by( 1+floor((1:n()-1)/2) ) %>%
summarize(
ID = ID[1],
name = paste0( name[2], "_", value[1] ),
value = as.numeric(value[2])
) %>% ungroup %>% select( -1 ) %>% group_by(name) %>%
mutate( value = sum(value) ) %>%
unique
Here is somthing improved, that actually is humanly readable:
i <- seq( 1, nrow(df1), 2 )
df1 %>% summarise(
ID = ID[i],
name = paste0( name[i+1], "_", value[i] ),
value = as.numeric(value[i+1])
) %>% group_by(name) %>%
summarize(
ID=ID[1], value = sum( value )
) %>% arrange(ID)

Base R solution:
# Nullify numeric values belonging to a grouping category: grps => character vector
grps <- gsub("\\d+", NA, df1$value)
# Interpolate NA values using prior string value: a => character vector
df1$a <- na.omit(grps)[cumsum(!(is.na(grps)))]
# Split-Apply-Combine aggregation: data.frame => stdout(console)
data.frame(do.call(rbind, lapply(with(df1, split(df1, a)), function(x){
y <- transform(subset(x, !grepl("\\D+", value)), value = as.numeric(value))
setNames(
aggregate(value ~ ID + a, y, FUN = function(z){sum(z, na.rm = TRUE)}),
c("ID", "a", "c")
)
}
)
),
row.names = NULL
)

additional option
df1 <- data.frame(ID=c(1,1,1,1,1,1,2,2),
name=c("a","c","a","c","a","c","a","c"),
value=c("broad",50,"mangrove",50,"mangrove",50,"coniferous",50))
library(tidyverse)
df1 %>%
pivot_wider(ID, names_from = name, values_from = value) %>%
unnest(c("a", "c")) %>%
group_by(ID, name = a) %>%
summarise(value = sum(as.numeric(c), na.rm = T), .groups = "drop")
#> # A tibble: 3 x 3
#> ID name value
#> <dbl> <chr> <dbl>
#> 1 1 broad 50
#> 2 1 mangrove 100
#> 3 2 coniferous 50
Created on 2021-04-12 by the reprex package (v2.0.0)

Applying functions in dplyr pipes

Given a data frame like data:
data <- data.frame(group = rep(c('a','b'), each= 100),
value = rnorm(200))
We want to filter values for group == b using dplyr and use boxplot.stats to identify outliers:
library(dplyr)
data%>%
filter(group == 'b')%>%
summarise(out.stats = boxplot.stats(value))
This returns the error Column out.stats must be length 1 (a summary value), not 4, why does this not work? How do you apply functions like this inside a pipe?

The following answers to the question and to the last comment to the question, where the OP asks for the row numbers of the outliers.
what if we want to return the row numbers that go with
boxplot.stats()$out from the pipe? so if we did
b<-data%>%filter(group=='b') outside of the pipe, we could have used:
which(b$value %in% boxplot.stats(b$value)$out)
This is done by left_joining with the original data.
library(dplyr)
set.seed(1234)
data <- data.frame(group = rep(c('a','b'), each= 100),
value = rnorm(200))
data %>% filter(group == 'b') %>% pull(value) %>%
boxplot.stats() %>% '[['('out') %>%
data.frame() %>%
left_join(data, by = c('.' = 'value'))
# . group
#1 3.043766 b
#2 -2.732220 b
#3 -2.855759 b

We can use the new version of dplyr which can also return summarise with more than one row
library(dplyr) # >= 1.0.0
data%>%
filter(group == 'b')%>%
summarise(out.stats = boxplot.stats(value))
# out.stats
#1 -2.4804222, -0.7546693, 0.1304050, 0.6390749, 2.2682247
#2 100
#3 -0.08980661, 0.35061653
#4 -3.014914

issues calculating rowwise maximum

suppose I have a tibble dat below, what I would like to do is to calculate maximum of (x 2, x 3) and then minus x 1, where x can be either a or b. In my real data I have more than 3 columns, so something like 2:n (e.g., 2:3) would be great. tried many things, seems not working as I wanted them to, still struggling with the string vs column name thing..
dat <- tibble(`a 1` = c(0, 0, 0), `a 2` = 1:3, `a 3` = 3:1,
`b 1` = rep(1, 3), `b 2` = 4:6, `b 3` = 6:4)
foo <- function(x = 'a')
{
???
}
end result:
if x == `a`
c(3, 2, 3)
if x == `b`
c(5, 4, 5)

Solution 1
This solution uses only base R. The idea is to define a function (max_minus_first) to calculate the answer. The max_minus_first function has two arguments. The first argument, dat, is a data frame for analysis with the same format as the OP provided. group is the name of the group for analysis. The end product is a vector with the answer.
max_minus_first <- function(dat, group){
# Get all column names with starting string "group"
col_names <- colnames(dat)
dat2 <- dat[, col_names[grepl(paste0("^", group), col_names)]]
# Get the maximum values from all columns except the first column
max_value <- apply(dat2[, -1], 1, max, na.rm = TRUE)
# Calculate max_value minus the values from the first column
final_value <- max_value - unlist(dat2[, 1], use.names = FALSE)
return(final_value)
}
max_minus_first(dat, "a")
# [1] 3 2 3
max_minus_first(dat, "b")
# [1] 5 4 5
Solution 2
A solution using the tidyverse. The end product (dat2) is a tibble with the output from each group (a, b, ...)
library(tidyverse)
dat2 <- dat %>%
rowid_to_column() %>%
gather(Column, Value, -rowid, -ends_with(" 1")) %>%
separate(Column, into = c("Group", "Column_Number")) %>%
gather(Column_1, Value_1, ends_with(" 1")) %>%
separate(Column_1, into = c("Group_1", "Column_Number_1")) %>%
filter(Group == Group_1) %>%
group_by(rowid, Group, Value_1) %>%
summarise(Value = max(Value, na.rm = TRUE)) %>%
mutate(Final = Value - Value_1) %>%
ungroup() %>%
select(-starts_with("Value")) %>%
spread(Group, Final)
dat2
# # A tibble: 3 x 3
# rowid a b
# * <int> <dbl> <dbl>
# 1 1 3 5
# 2 2 2 4
# 3 3 3 5
Explanation
rowid_to_column() is from the tibble package, a way to create a new column based on row ID.
gather is from the tidyr package to convert the data frame from the wide format to long format. I used gather twice because the first column of each group is different than other columns in the same group. ends_with(" 1") is a select helper function from the dplyr, which select the column with a name ending in " 1". Notice that the space in " 1" is important because "1" may select other columns like a 11 if such columns exist.
separate is from the tidyr package to separate a column into two columns. I used it to separate the Group name and column numbers in each Group.
filter(Group == Group_1) is to filter rows with Group == Group_1.
group_by(rowid, Group, Value_1) and then summarise(Value = max(Value, na.rm = TRUE)) make sure the maximum from each Group is calculated.
mutate(Final = Value - Value_1) is to calculate the difference between maximum from each Group and the value from the first column. The results are stored in the Final column.
select(-starts_with("Value")) removes any columns with a name beginning with "Value".
spread from the tidyr package converts the data frame from long format to wide format.
Solution 3
Another tidyverse solution, which similar to Solution 2. It uses do to conduct operation to each Group hence making the code more concise.
dat2 <- dat %>%
rowid_to_column() %>%
gather(Column, Value, -rowid) %>%
separate(Column, into = c("Group", "Column_Number")) %>%
group_by(rowid, Group) %>%
do(data_frame(Max = max(.$Value[.$Column_Number != 1]),
First = .$Value[.$Column_Number == 1])) %>%
mutate(Final = Max - First) %>%
select(-Max, -First) %>%
spread(Group, Final) %>%
ungroup()
dat2
# # A tibble: 3 x 3
# rowid a b
# * <int> <dbl> <dbl>
# 1 1 3 5
# 2 2 2 4
# 3 3 3 5

Grouped operation on all groups relative to "baseline" group, with multiple observations

Starting with data containing multiple observations for each group, like this:
set.seed(1)
my.df <- data.frame(
timepoint = rep(c(0, 1, 2), each= 3),
counts = round(rnorm(9, 50, 10), 0)
)
> my.df
timepoint counts
1 0 44
2 0 52
3 0 42
4 1 66
5 1 53
6 1 42
7 2 55
8 2 57
9 2 56
To perform a summary calculation at each timepoint relative to timepoint == 0, for each group I need to pass a vector of counts for timepoint == 0 and a vector of counts for the group (e.g. timepoint == 0) to an arbitrary function, e.g.
NonsenseFunction <- function(x, y){
(mean(x) - mean(y)) / (1 - mean(y))
}
I can get the required output from this table, either with dplyr:
library(dplyr)
my.df %>%
group_by(timepoint) %>%
mutate(rep = paste0("r", 1:n())) %>%
left_join(x = ., y = filter(., timepoint == 0), by = "rep") %>%
group_by(timepoint.x) %>%
summarise(result = NonsenseFunction(counts.x, counts.y))
or data.table:
library(data.table)
my.dt <- data.table(my.df)
my.dt[, rep := paste0("r", 1:length(counts)), by = timepoint]
merge(my.dt, my.dt[timepoint == 0], by = "rep", all = TRUE)[
, NonsenseFunction(counts.x, counts.y), by = timepoint.x]
This only works if the number of observations between groups is the same. Anyway, the observations aren't matched, so using the temporary rep variable seems hacky.
For a more general case, where I need to pass vectors of the baseline values and the group's values to an arbitrary (more complicated) function, is there an idiomatic data.table or dplyr way of doing so with a grouped operation for all groups?

Here's the straightforward data.table approach:
my.dt[, f(counts, my.dt[timepoint==0, counts]), by=timepoint]
This probably grabs my.dt[timepoint==0, counts] again and again, for each group. You could instead save that value ahead of time:
v = my.dt[timepoint==0, counts]
my.dt[, f(counts, v), by=timepoint]
... or if you don't want to add v to the environment, maybe
with(list(v = my.dt[timepoint==0, counts]),
my.dt[, f(counts, v), by=timepoint]
)

You could give the second argument to use the vector from your group of interest as a constant.
my.df %>%
group_by(timepoint) %>%
mutate(response = NonsenseFunction(counts, my.df$counts[my.df$timepoint == 0]))
Or if you want to make it beforehand:
constant = = my.df$counts[my.df$timepoint == 0]
my.df %>%
group_by(timepoint) %>%
mutate(response = NonsenseFunction(counts, constant))

You can try,
library(dplyr)
my.df %>%
mutate(new = mean(counts[timepoint == 0])) %>%
group_by(timepoint) %>%
summarise(result = NonsenseFunction(counts, new))
# A tibble: 3 × 2
# timepoint result
# <dbl> <dbl>
#1 0 0.0000000
#2 1 0.1398601
#3 2 0.2097902

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Avoiding missing row after summarise - r

Related

How can I search in a data frame if all possible combinations exist in another data frame in R using dplyr?

Combine attributes from two columns and sum the values from duplicate rows

Applying functions in dplyr pipes

issues calculating rowwise maximum

Grouped operation on all groups relative to "baseline" group, with multiple observations

Categories

Resources