I want to apply a function to all pairs of items in the same group e.g.
Example input:
Group Item Value
A 1 89
A 2 76
A 3 2
B 4 21
B 5 10
The desired output is a vector of the function output for all items in the same group.
e.g. for arguments sake if the function was:
addnums=function(x,y){
x+y
}
Then the desired output would be:
165, 91, 78, 31
I have tried to do this using summarize in the dplyr package but this can only be used if the output is a single value.
We can split Value for each Group and then use combn to calculate sum for each pair.
sapply(split(df$Value, df$Group), combn, 2, sum)
#$A
#[1] 165 91 78
#$B
#[1] 31
If needed as one vector we can use unlist.
unlist(sapply(split(df$Value, df$Group), combn, 2, sum), use.names = FALSE)
#[1] 165 91 78 31
If you are interested in tidyverse solution using the same logic we can do
library(dplyr)
library(purrr)
df %>%
group_split(Group) %>%
map(~combn(.x %>% pull(Value), 2, sum)) %>% flatten_dbl
#[1] 165 91 78 31
We can use a group by option with data.table
library(data.table)
setDT(df1)[, combn(Value, 2, FUN = sum), Group]
# Group V1
#1: A 165
#2: A 91
#3: A 78
#4: B 31
If we want to use addnums from the OP's post
setDT(df1)[, combn(Value, 2, FUN = function(x) addnums(x[1], x[2])), Group]
# Group V1
#1: A 165
#2: A 91
#3: A 78
#4: B 31
Or using tidyverse
library(dplyr)
library(tidyr)
df1 %>%
group_by(Group) %>%
summarise(Sum = list(combn(Value, 2, FUN = sum))) %>%
unnest
# A tibble: 4 x 2
# Group Sum
# <chr> <int>
#1 A 165
#2 A 91
#3 A 78
#4 B 31
Using addnums
df1 %>%
group_by(Group) %>%
summarise(Sum = list(combn(Value, 2, FUN =
function(x) addnums(x[1], x[2])))) %>%
unnest
Or using base R with aggregate
aggregate(Value ~ Group, df1, FUN = function(x) combn(x, 2, FUN = sum))
data
df1 <- structure(list(Group = c("A", "A", "A", "B", "B"), Item = 1:5,
Value = c(89L, 76L, 2L, 21L, 10L)), class = "data.frame", row.names = c(NA,
-5L))
Related
library(tidyverse)
df <- tibble(Date = c(rep(as.Date("2020-01-01"), 3), NA),
col1 = 1:4,
thisCol = c(NA, 8, NA, 3),
thatCol = 25:28,
col999 = rep(99, 4))
#> # A tibble: 4 x 5
#> Date col1 thisCol thatCol col999
#> <date> <int> <dbl> <int> <dbl>
#> 1 2020-01-01 1 NA 25 99
#> 2 2020-01-01 2 8 26 99
#> 3 2020-01-01 3 NA 27 99
#> 4 NA 4 3 28 99
My actual R data frame has hundreds of columns that aren't neatly named, but can be approximated by the df data frame above.
I want to replace all values of NA with 0, with the exception of several columns (in my example I want to leave out the Date column and the thatCol column. I'd want to do it in this sort of fashion:
df %>% replace(is.na(.), 0)
#> Error: Assigned data `values` must be compatible with existing data.
#> i Error occurred for column `Date`.
#> x Can't convert <double> to <date>.
#> Run `rlang::last_error()` to see where the error occurred.
And my unsuccessful ideas for accomplishing the "everything except" replace NA are shown below.
df %>% replace(is.na(c(., -c(Date, thatCol)), 0))
df %>% replace_na(list([, c(2:3, 5)] = 0))
df %>% replace_na(list(everything(-c(Date, thatCol)) = 0))
Is there a way to select everything BUT in the way I need to? There's hundred of columns, named inconsistently, so typing them one by one is not a practical option.
You can use mutate_at :
library(dplyr)
Remove them by Name
df %>% mutate_at(vars(-c(Date, thatCol)), ~replace(., is.na(.), 0))
Remove them by position
df %>% mutate_at(-c(1,4), ~replace(., is.na(.), 0))
Select them by name
df %>% mutate_at(vars(col1, thisCol, col999), ~replace(., is.na(.), 0))
Select them by position
df %>% mutate_at(c(2, 3, 5), ~replace(., is.na(.), 0))
If you want to use replace_na
df %>% mutate_at(vars(-c(Date, thatCol)), tidyr::replace_na, 0)
Note that mutate_at is soon going to be replaced by across in dplyr 1.0.0.
You have several options here based on data.table.
One of the coolest options: setnafill (version >= 1.12.4):
library(data.table)
setDT(df)
data.table::setnafill(df,fill = 0, cols = colnames(df)[!(colnames(df) %in% c("Date", thatCol)]))
Note that your dataframe is updated by reference.
Another base solution:
to_change<-grep("^(this|col)",names(df))
df[to_change]<- sapply(df[to_change],function(x) replace(x,is.na(x),0))
df
# A tibble: 4 x 5
Date col1 thisCol thatCol col999
<date> <dbl> <dbl> <int> <dbl>
1 2020-01-01 1 0 25 99
2 2020-01-01 2 8 26 99
3 2020-01-01 3 0 27 99
4 NA 0 3 28 99
Data(I changed one value):
df <- structure(list(Date = structure(c(18262, 18262, 18262, NA), class = "Date"),
col1 = c(1L, 2L, 3L, NA), thisCol = c(NA, 8, NA, 3), thatCol = 25:28,
col999 = c(99, 99, 99, 99)), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
replace works on a data.frame, so we can just do the replacement by index and update the original dataset
df[-c(1, 4)] <- replace(df[-c(1, 4)], is.na(df[-c(1, 4)]), 0)
Or using replace_na with across (from the new dplyr)
library(dplyr)
library(tidyr)
df %>%
mutate(across(-c(Date, thatCol), ~ replace_na(., 0)))
If you know the ones that you don't want to change, you could do it like this:
df <- tibble(Date = c(rep(as.Date("2020-01-01"), 3), NA),
col1 = 1:4,
thisCol = c(NA, 8, NA, 3),
thatCol = 25:28,
col999 = rep(99, 4))
#dplyr
df_nonreplace <- select(df, c("Date", "thatCol"))
df_replace <- df[ ,!names(df) %in% names(df_nonreplace)]
df_replace[is.na(df_replace)] <- 0
df <- cbind(df_nonreplace, df_replace)
> head(df)
Date thatCol col1 thisCol col999
1 2020-01-01 25 1 0 99
2 2020-01-01 26 2 8 99
3 2020-01-01 27 3 0 99
4 <NA> 28 4 3 99
This question already has answers here:
Combining rows based on a column
(1 answer)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 4 years ago.
I have a dataset1 which is as follows:
dataset1 <- data.frame(
id1 = c(1, 1, 1, 2, 2, 2),
id2 = c(122, 122, 122, 133, 133, 133),
num1 = c(1, NA, NA, 50,NA, NA),
num2 = c(NA, 2, NA, NA, 45, NA),
num3 = c(NA, NA, 3, NA, NA, 4)
)
How to convert multiple rows into a single row?
The desired output is:
id1, id2, num1, num2, num3
1 122 1 2 3
2 133 50 45 4
library(dplyr)
dataset1 %>% group_by(id1, id2) %>%
summarise_all(funs(.[!is.na(.)])) %>%
as.data.frame()
# id1 id2 num1 num2 num3
# 1 1 122 1 2 3
# 2 2 133 50 45 4
Note: Assuming there will be only 1 non-NA item in a column.
Using data.table
library(data.table)
data.table(dataset1)[, lapply(.SD, sum, na.rm = TRUE), by = c("id1", "id2")]
# id1 id2 num1 num2 num3
#1: 1 122 1 2 3
#2: 2 133 50 45 4
You can use dplyr to achieve that:
library(dplyr)
dataset1 %>%
group_by(id1, id2) %>%
mutate(
num1 = sum(num1, na.rm=T),
num2 = sum(num2, na.rm=T),
num3 = sum(num3, na.rm=T)
) %>%
distinct()
Output:
This is also assuming if there's a repeated value in any of the variable we're going to sum it (if id1 = 1 has two values for num1, we're going to sum the value). If you're confident that every id has only one possible value for each of the num (num1 to num3), then don't worry about it.
I want to calculate difference by groups. Although I referred R: Function “diff” over various groups thread on SO, for unknown reason, I am unable to find the difference. I have tried three methods : a) spread b) dplyr::mutate with base::diff() c) data.table with base::diff(). While a) works, I am unsure how I can solve this problem using b) and c).
Description about the data:
I have revenue data for the product by year. I have categorized years >= 2013 as Period 2 (called P2), and years < 2013 as Period 1 (called P1).
Sample data:
dput(Test_File)
structure(list(Ship_Date = c(2010, 2010, 2012, 2012, 2012, 2012,
2017, 2017, 2017, 2016, 2016, 2016, 2011, 2017), Name = c("Apple",
"Apple", "Banana", "Banana", "Banana", "Banana", "Apple", "Apple",
"Apple", "Banana", "Banana", "Banana", "Mango", "Pineapple"),
Revenue = c(5, 10, 13, 14, 15, 16, 25, 25, 25, 1, 2, 4, 5,
7)), .Names = c("Ship_Date", "Name", "Revenue"), row.names = c(NA,
14L), class = "data.frame")
Expected Output
dput(Diff_Table)
structure(list(Name = c("Apple", "Banana", "Mango", "Pineapple"
), P1 = c(15, 58, 5, NA), P2 = c(75, 7, NA, 7), Diff = c(60,
-51, NA, NA)), .Names = c("Name", "P1", "P2", "Diff"), class = "data.frame", row.names = c(NA,
-4L))
Here's my code:
Method 1: Using spread [Works]
data.table::setDT(Test_File)
cutoff<-2013
Test_File[Test_File$Ship_Date>=cutoff,"Ship_Period"]<-"P2"
Test_File[Test_File$Ship_Date<cutoff,"Ship_Period"]<-"P1"
Diff_Table<- Test_File %>%
dplyr::group_by(Ship_Period,Name) %>%
dplyr::mutate(Revenue = sum(Revenue)) %>%
dplyr::select(Ship_Period, Name,Revenue) %>%
dplyr::ungroup() %>%
dplyr::distinct() %>%
tidyr::spread(key = Ship_Period,value = Revenue) %>%
dplyr::mutate(Diff = `P2` - `P1`)
Method 2: Using dplyr [Doesn't work: NAs are generated in Diff column.]
Diff_Table<- Test_File %>%
dplyr::group_by(Ship_Period,Name) %>%
dplyr::mutate(Revenue = sum(Revenue)) %>%
dplyr::select(Ship_Period, Name,Revenue) %>%
dplyr::ungroup() %>%
dplyr::distinct() %>%
dplyr::arrange(Name,Ship_Period, Revenue) %>%
dplyr::group_by(Ship_Period,Name) %>%
dplyr::mutate(Diff = diff(Revenue))
Method 3: Using data.table [Doesn't work: It generates all zeros in Diff column.]
Test_File[,Revenue1 := sum(Revenue),by=c("Ship_Period","Name")]
Diff_Table<-Test_File[,.(Diff = diff(Revenue1)),by=c("Ship_Period","Name")]
Question: Can someone please help me with method 2 and method 3 above? I am fairly new to R so I apologize if my work sounds too basic. I am still learning this language.
We can do this with data.table. Convert the 'data.frame' to 'data.table' (setDT(Test_File)), grouped by the run-length-id of 'Name' and 'Name', get the sum of 'Revenue', reshape it to 'wide' format with dcast, get the difference between 'P2' and 'P1' and assign (:=) it to 'Diff'
library(data.table)
dcast(setDT(Test_File)[, .(Revenue = sum(Revenue)),
.(grp=rleid(Name), Name)], Name~ paste0("P", rowid(Name)),
value.var = "Revenue")[, Diff := P2 - P1][]
# Name P1 P2 Diff
#1: Apple 15 75 60
#2: Banana 58 7 -51
#3: Mango 5 NA NA
#4: Pineapple 7 NA NA
Or for third case, i.e. base R, we create a grouping column based on whether the adjacent elements in 'Name' are the same or not ('grp'), then aggregate the 'Revenue' by 'Name' and 'grp' to find the sum, create a sequence column, reshape it to 'wide' and transform the dataset to create the 'Diff' column
Test_File$grp <- with(Test_File, cumsum(c(TRUE, Name[-1]!=Name[-length(Name)])))
d1 <- aggregate(Revenue~Name +grp, Test_File, sum)
d1$Seq <- with(d1, ave(seq_along(Name), Name, FUN = seq_along))
transform(reshape(d1[-2], idvar = "Name", timevar = "Seq",
direction = "wide"), Diff = Revenue.2- Revenue.1)
The tidyverse method can also be done using
library(dplyr)
library(tidyr)
Test_File %>%
group_by(grp = cumsum(c(TRUE, Name[-1]!=Name[-length(Name)])), Name) %>%
summarise(Revenue = sum(Revenue)) %>%
group_by(Name) %>%
mutate(Seq = paste0("P", row_number())) %>%
select(-grp) %>%
spread(Seq, Revenue) %>%
mutate(Diff = P2-P1)
#Source: local data frame [4 x 4]
#Groups: Name [4]
# Name P1 P2 Diff
# <chr> <dbl> <dbl> <dbl>
#1 Apple 15 75 60
#2 Banana 58 7 -51
#3 Mango 5 NA NA
#4 Pineapple 7 NA NA
Update
Based on the OP's comments to use only diff function
library(data.table)
setDT(Test_File)[, .(Revenue = sum(Revenue)), .(Name, grp = rleid(Name))
][, .(P1 = Revenue[1L], P2 = Revenue[2L], Diff = diff(Revenue)) , Name]
# Name P1 P2 Diff
#1: Apple 15 75 60
#2: Banana 58 7 -51
#3: Mango 5 NA NA
#4: Pineapple 7 NA NA
Or with dplyr
Test_File %>%
group_by(grp = cumsum(c(TRUE, Name[-1]!=Name[-length(Name)])), Name) %>%
summarise(Revenue = sum(Revenue)) %>%
group_by(Name) %>%
summarise(P1 = first(Revenue), P2 = last(Revenue)) %>%
mutate(Diff = P2-P1)
This will do:
library("data.table")
setDT(Test_File)
T <- Test_File[, .(P=sum(Revenue)),by=.(Ship_Date, Name)]
T[Ship_Date>=2013][T[Ship_Date<2013][CJ(Name=T$Name, unique=TRUE), on="Name"], on="Name"][,`:=`(P1=i.P, P2=P, Diff=P-i.P)][]
# Ship_Date Name P i.Ship_Date i.P P1 P2 Diff
# 1: 2017 Apple 75 2010 15 15 75 60
# 2: 2016 Banana 7 2012 58 58 7 -51
# 3: NA Mango NA 2011 5 5 NA NA
# 4: 2017 Pineapple 7 NA NA NA 7 NA
Or with only the wanted columns:
T[Ship_Date>=2013][T[Ship_Date<2013][CJ(Name=T$Name, unique=TRUE), on="Name"], on="Name"][,`:=`(P1=i.P, P2=P, Diff=P-i.P)][,.(Name, P1, P2, Diff)]
# Name P1 P2 Diff
# 1: Apple 15 75 60
# 2: Banana 58 7 -51
# 3: Mango 5 NA NA
# 4: Pineapple NA 7 NA
Here is a variant using setnames():
setnames(T[Ship_Date>=2013][T[Ship_Date<2013][CJ(Name=T$Name, unique=TRUE), on="Name"], on="Name"],
c("P", "i.P"), c("P2", "P1"))[, Diff:=P2-P1][]
For example, suppose that you had a function that applied some DPLYR functions, but you couldn't expect datasets passed to this function to have the same column names.
For a simplified example of what I mean, say you had a data frame, arizona.trees:
arizona.trees
group arizona.redwoods arizona.oaks
A 23 11
A 24 12
B 9 8
B 10 7
C 88 22
and another very similar data frame, california.trees:
california.trees
group california.redwoods california.oaks
A 25 50
A 11 33
B 90 5
B 77 3
C 90 35
And you wanted to implement a function that returns the mean for the given groups (A, B, ... Z) for a given type of tree that would work for both of these data frames.
foo <- function(dataset, group1, group2, tree.type) {
column.name <- colnames(dataset[2])
result <- filter(dataset, group %in% c(group1, group2) %>%
select(group, contains(tree.type)) %>%
group_by(group) %>%
summarize("mean" = mean(column.name))
return(result)
}
A desired output for a call of foo(california.trees, A, B, redwoods) would be:
result
mean
A 18
B 83.5
For some reason, doing something like the implementation of foo() just doesn't seem to work. This is likely due to some error with the data frame indexing - the function seems to think I am attempting to get the mean of the column.name string, rather than retrieving the column and passing the column to mean(). I'm not sure how to avoid this. There's the issue of the implicit passing of the modified dataframe that can't be directly referenced with the pipe operator that may be causing the issue.
Why is this? Is there some alternative implementation that would work?
We can use the quosure based solution from the devel version of dplyr (soon to be released 0.6.0)
foo <- function(dataset, group1, group2, tree.type){
group1 <- quo_name(enquo(group1))
group2 <- quo_name(enquo(group2))
colN <- rlang::parse_quosure(names(dataset)[2])
tree.type <- quo_name(enquo(tree.type))
dataset %>%
filter(group %in% c(group1, group2)) %>%
select(group, contains(tree.type)) %>%
group_by(group) %>%
summarise(mean = mean(UQ(colN)))
}
foo(california.trees, A, B, redwoods)
# A tibble: 2 × 2
# group mean
# <chr> <dbl>
#1 A 18.0
#2 B 83.5
foo(arizona.trees, A, B, redwoods)
# A tibble: 2 × 2
# group mean
# <chr> <dbl>
#1 A 23.5
#2 B 9.5
The enquotakes the input arguments and converts it to quosure, with quo_name, it is converted to string for using with %in%, the second column name is converted to quosure from string using parse_quosure and then it is unquoted (UQ or !!) for evaluation within summarise
NOTE: This is based on the OP's function about selecting the second column
The above solution was based on selecting the column based on position (as per the OP's code) and it may not work for other columns. So, we can match the 'tree.type' and get the 'mean' of the columns based on that
foo1 <- function(dataset, group1, group2, tree.type){
group1 <- quo_name(enquo(group1))
group2 <- quo_name(enquo(group2))
tree.type <- quo_name(enquo(tree.type))
dataset %>%
filter(group %in% c(group1, group2)) %>%
select(group, contains(tree.type)) %>%
group_by(group) %>%
summarise_at(vars(contains(tree.type)), funs(mean = mean(.)))
}
The function can be tested for different columns in the two datasets
foo1(arizona.trees, A, B, oaks)
# A tibble: 2 × 2
# group mean
# <chr> <dbl>
#1 A 11.5
#2 B 7.5
foo1(arizona.trees, A, B, redwood)
# A tibble: 2 × 2
# group mean
# <chr> <dbl>
#1 A 23.5
#2 B 9.5
foo1(california.trees, A, B, redwood)
# A tibble: 2 × 2
# group mean
# <chr> <dbl>
#1 A 18.0
#2 B 83.5
foo1(california.trees, A, B, oaks)
# A tibble: 2 × 2
# group mean
# <chr> <dbl>
#1 A 41.5
#2 B 4.0
data
arizona.trees <- structure(list(group = c("A", "A", "B", "B", "C"),
arizona.redwoods = c(23L,
24L, 9L, 10L, 88L), arizona.oaks = c(11L, 12L, 8L, 7L, 22L)),
.Names = c("group",
"arizona.redwoods", "arizona.oaks"), class = "data.frame",
row.names = c(NA, -5L))
california.trees <- structure(list(group = c("A", "A", "B", "B", "C"),
california.redwoods = c(25L,
11L, 90L, 77L, 90L), california.oaks = c(50L, 33L, 5L, 3L, 35L
)), .Names = c("group", "california.redwoods", "california.oaks"
), class = "data.frame", row.names = c(NA, -5L))
I have a data frame(df):
group col
a 12
a 15
a 13
b 21
b 23
Desired output is also a data frame(df1):
col1 col2
12 21
15 23
13 0
Namley, I want to partition "col" of "df" by "group" into multi columns as "col1" and "col2".
When the length of each column is not equal to each other, "0" must be added end of each column untill the length of each column reaches to the maximum column length.
We could either use base R functions split or unstack to split the 'col' by 'group' into a list, then pad NA to list elements that are less than the maximum length of the list element. Change the column names, replace 'NA' by 0.
lst <- unstack(df1, col~group)
d1 <- as.data.frame(sapply(lst, `length<-`, max(sapply(lst, length))))
d1[is.na(d1)] <- 0
colnames(d1) <- paste0('col', 1:ncol(d1))
d1
# col1 col2
#1 12 21
#2 15 23
#3 13 0
Or use stri_list2matrix from stringi
library(stringi)
d1 <- as.data.frame(stri_list2matrix(unstack(df1, col~group),
fill=0), stringsAsFactors=FALSE)
d1[] <- lapply(d1, as.numeric)
Or using data.table/splitstackshape
library(splitstackshape)
setnames(dcast(getanID(df1, 'group'), .id~group, value.var='col',
fill=0L)[, .id:= NULL], paste0('col', 1:2))[]
# col1 col2
#1: 12 21
#2: 15 23
#3: 13 0
How to do it with dplyr...
library(dplyr)
library(tidyr)
df1 %>%
group_by(group) %>%
mutate(n = row_number()) %>%
spread(group, col) %>%
select(-n) %>%
(function(x) { x[is.na(x)] <- 0; x })
Since you fill with zeroes, another idea:
xtabs(col ~ ave(DF$col, DF$group, FUN = seq_along) + group, DF)
# group
#ave(DF$col, DF$group, FUN = seq_along) a b
# 1 12 21
# 2 15 23
# 3 13 0
Where "DF":
DF = structure(list(group = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("a",
"b"), class = "factor"), col = c(12L, 15L, 13L, 21L, 23L)), .Names = c("group",
"col"), class = "data.frame", row.names = c(NA, -5L))