I have a simple tibble and want to calculate the absolute difference of values after grouping them:
tibb <- tibble(id = c("A", "B", "C","A", "B", "C"), value = c(5,4,3,7,8,9))
# A tibble: 6 x 2
id value
<chr> <dbl>
1 A 5
2 B 4
3 C 3
4 A 7
5 B 8
6 C 9
tibb %>% group_by(id) %>% summarize(diff = function(x,y){abs(x-y)})
dplyr return an error stating diff is not supported.
The output should looks like this:
# A tibble: 3 x 2
id sum
<chr> <int>
1 A 2
2 B 4
3 C 6
Error in summarise_impl(.data, dots) :
Column `diff` is of unsupported type function
Is there any way to calculate this?
If there are 2 values in each group then try this:
tibb <- tibble(id = c("A", "B", "C","A", "B", "C"), value = c(5,4,3,7,8,9))
tibb
tibb %>% group_by(id) %>% summarize(diff = abs(diff(value)))
#or
tibb %>% group_by(id) %>% summarize(diff = abs(value[1] - value[2]))
Related
This question already has an answer here:
R sum observations by unique column PAIRS (B-A and A-B) and NOT unique combinations (B-A or A-B)
(1 answer)
Closed 3 months ago.
Inputs:
group1 <- c("A", "B", "C", "D")
group2 <- c("B", "A", "D", "C")
count <- c(1, 3, 2, 4)
df <- data.frame(group1, group2, count)
df:
group1 group2 count
1 A B 1
2 B A 3
3 C D 2
4 D C 4
Desired output:
group total
1 AB or BA 4
2 CD or DC 6
My actual dataset has a very long list of these group pairs.
Sort the strings so the alphabetically first one is in a specific column:
library(dplyr)
df %>%
mutate(g1 = pmin(group1, group2),
g2 = pmax(group1, group2)) %>%
group_by(g1, g2) %>%
summarize(total = sum(count), .groups = "drop")
# # A tibble: 2 × 3
# g1 g2 total
# <chr> <chr> <dbl>
# 1 A B 4
# 2 C D 6
This works if the pairs are always in adjacent rows:
library(tidyverse)
df %>%
mutate(
# create row ID for each pair:
id = (row_number() - 1) %/% 2,
# create column `group`:
group = str_c(group1, group2, " or ", group2, group1)) %>%
group_by(id) %>%
summarise(across(group, first),
total = sum(count)) %>%
select(-id)
# A tibble: 2 × 2
group total
<chr> <dbl>
1 AB or BA 4
2 CD or DC 6
I have:
group
items
value
grp1
A
1
grp1
B
2
grp2
B
3
I want:
group
items
value
grp1
A
1
grp1
B
2
grp1
C
NA
grp2
A
NA
grp2
B
3
grp2
C
NA
"group" is taken from the input df. "items" is taken from a codelist vector with all possible entries, all other columns are filled in where known or else NA.
Example:
item_codelist <- c("A", "B", "C")
input <- data.frame("group" = c("grp1", "grp1", "grp2"), "items" = c("A", "B", "B"), "values" = c(1, 2, 3))
I looked into fill(), extend() and complete() but could not get any of these to work for this purpose.
Below is my current workaround but I find it somewhat complicated and I am using a for loop which will take forever for my 200 MB data frame...
If you know an easier way to do this (preferably in dplyr syntax) let me know. Thanks!
# create a data frame with all groups and items
codelist_df <- input %>% head(0) %>% select(group, items)
for (grp in unique(input$group)){
df <- data.frame("items" = item_codelist) %>%
mutate( group = grp, .before = 1)
codelist_df <- bind_rows(codelist_df, df)
}
# join that data frame to the input data
output <- input %>%
group_by(group) %>%
full_join(codelist_df) %>%
arrange(group, items)
Stefan's comment is by far the best solution, which I was unaware of, but here's one option:
library(dplyr)
library(purrr)
library(tidyr)
input <- data.frame("group" = c("grp1", "grp1", "grp2"), "items" = c("A", "B", "B"), "values" = c(1, 2, 3))
items <- c("A", "B", "C")
input %>%
split(.$group) %>%
map_df(~full_join(., as_tibble(items), by = c("items" = "value")) %>%
arrange(items)) %>%
fill(group, .direction = 'down')
#> group items values
#> 1 grp1 A 1
#> 2 grp1 B 2
#> 3 grp1 C NA
#> 4 grp1 A NA
#> 5 grp2 B 3
#> 6 grp2 C NA
It seemse like you want to cross join the groups and items. To do that, you could use dplyr::full_join() with the argument by = character(), and then left join the values back in:
library(dplyr, warn.conflicts = FALSE)
item_codelist <- tibble(items = c('A', 'B', 'C'))
groups <- tibble(group = c('grp1', 'grp1', 'grp2'))
input <- tibble("group" = c("grp1", "grp1", "grp2"), "items" = c("A", "B", "B"), "values" = c(1, 2, 3))
item_codelist |>
full_join(groups, by = character()) |>
left_join(input, by = c('items', 'group')) |>
relocate(group) |>
arrange(group, items) |>
distinct()
#> # A tibble: 6 × 3
#> group items values
#> <chr> <chr> <dbl>
#> 1 grp1 A 1
#> 2 grp1 B 2
#> 3 grp1 C NA
#> 4 grp2 A NA
#> 5 grp2 B 3
#> 6 grp2 C NA
Created on 2022-07-11 by the reprex package (v2.0.1)
I have in R the following data frame:
ID = c(rep(1,5),rep(2,3),rep(3,2),rep(4,6));ID
VAR = c("A","A","A","A","B","C","C","D",
"E","E","F","A","B","F","C","F");VAR
CATEGORY = c("ANE","ANE","ANA","ANB","ANE","BOO","BOA","BOO",
"CAT","CAT","DOG","ANE","ANE","DOG","FUT","DOG");CATEGORY
DATA = data.frame(ID,VAR,CATEGORY);DATA
That looks like this table below :
ID
VAR
CATEGORY
1
A
ANE
1
A
ANE
1
A
ANA
1
A
ANB
1
B
ANE
2
C
BOO
2
C
BOA
2
D
BOO
3
E
CAT
3
E
CAT
4
F
DOG
4
A
ANE
4
B
ANE
4
F
DOG
4
C
FUT
4
F
DOG
ideal output given the above data frame in R I want to be like that:
ID
TEXTS
category
1
A
ANE
2
C
BOO
3
E
CAT
4
F
DOG
More specifically: I want for ID say 1 to search the most common value in the column VAR which is A and then to search the most common value in the column CATEGORY related to the most common value A which is the ANE and so forth.
How can I do it in R ?
Imagine that it is sample example.My real data frame contains 850.000 rows and has 14000 unique ID.
Another dplyr strategy using count and slice:
library(dplyr)
DATA %>%
group_by(ID) %>%
count(VAR, CATEGORY) %>%
slice(which.max(n)) %>%
select(-n)
ID VAR CATEGORY
<dbl> <chr> <chr>
1 1 A ANE
2 2 C BOA
3 3 E CAT
4 4 F DOG
dplyr
library(dplyr)
DATA %>%
group_by(ID) %>%
filter(VAR == names(sort(table(VAR), decreasing=TRUE))[1]) %>%
group_by(ID, VAR) %>%
summarize(CATEGORY = names(sort(table(CATEGORY), decreasing=TRUE))[1]) %>%
ungroup()
# # A tibble: 4 x 3
# ID VAR CATEGORY
# <dbl> <chr> <chr>
# 1 1 A ANE
# 2 2 C BOA
# 3 3 E CAT
# 4 4 F DOG
Data
DATA <- structure(list(ID = c(1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 4), VAR = c("A", "A", "A", "A", "B", "C", "C", "D", "E", "E", "F", "A", "B", "F", "C", "F"), CATEGORY = c("ANE", "ANE", "ANA", "ANB", "ANE", "BOO", "BOA", "BOO", "CAT", "CAT", "DOG", "ANE", "ANE", "DOG", "FUT", "DOG")), class = "data.frame", row.names = c(NA, -16L))
We could modify the Mode to return the index and use that in slice after grouping by 'ID'
Modeind <- function(x) {
ux <- unique(x)
which.max(tabulate(match(x, ux)))
}
library(dplyr)
DATA %>%
group_by(ID) %>%
slice(Modeind(VAR)) %>%
ungroup
-output
# A tibble: 4 x 3
ID VAR CATEGORY
<dbl> <chr> <chr>
1 1 A ANE
2 2 C BOO
3 3 E CAT
4 4 F DOG
A base R option with nested subset + ave
subset(
subset(
DATA,
!!ave(ave(ID, ID, VAR, FUN = length), ID, FUN = function(x) x == max(x))
),
!!ave(ave(ID, ID, VAR, CATEGORY, FUN = length), ID, VAR, FUN = function(x) seq_along(x) == which.max(x))
)
gives
ID VAR CATEGORY
1 1 A ANE
6 2 C BOO
9 3 E CAT
11 4 F DOG
Explanation
The inner subset + ave is to filter out the rows with the most common VAR values (grouped by ID)
Based on the trimmed data frame the previous step, the outer subset + ave is to filter out the rows with the most common CATEGORY values ( grouped by ID + VAR)
I'm using group by funciton in a dataset using R software. But the target of the id would duplicate. Here is the sample dataset:
ID Var1
A 1
A 3
B 2
C 3
C 1
D 2
In tradtional groupby function by each id, I can do
DT<- data.table(dataset )
DT[,sum(Var1),by = ID]
and get the result:
ID V1
A 4
B 2
C 4
D 2
However, I've to group ID by A+B and B+C and D
(PS. say that F=A+B ,G=B+C)
and the target result dataset below:
ID V1
F 6
G 6
D 2
IF I use recoding technique on ID, the duplicate B would be covered twice.
IS there any one have the solution?
MANY THANKS!
library(dplyr)
library(tidyr)
df <- df %>% mutate(F=ifelse(ID %in% c("A", "B"), 1, 0),
G = ifelse(ID %in% c("B", "C"), 1, 0),
D = ifelse(ID == "D", 1, 0))
df %>%
gather(var, val, F:D) %>%
filter(val==1) %>%
group_by(var) %>%
summarise(V1=sum(V1))
# # A tibble: 3 x 2
# var V1
# <chr> <dbl>
# 1 D 2
# 2 F 6
# 3 G 6
I have some data like this:
X Y
-----
A 1
A 2
B 3
B 4
C 5
C 6
I would like to add a new column with values equal to the mean of all Ys in rows where X is not euqal to X of the current observation.
In this particlar case we would get
X Y Mean
-------------------
A 1 (3+4+5+6)/4
A 2 (3+4+5+6)/4
B 3 (1+2+5+6)/4
B 4 (1+2+5+6)/4
C 5 (1+2+3+4)/4
C 6 (1+2+3+4)/4
Thanks in advance!
You can likely do this more succinctly, but this will get you the result.
You essentially create a column which contains the total observations and sum of records for the whole data.frame. Then you group by the X column and repeat the process, by taking the difference you can calculate your mean.
data
df <- data.frame(X = c("A", "A", "B", "B", "C", "C"),
Y = c(1:6))
solution
library(tidyverse)
df %>%
mutate(total_sum = sum(Y),
total_obs = n()) %>%
group_by(X) %>%
mutate(group_sum = sum(Y),
group_obs = n()) %>%
ungroup() %>%
mutate(other_group_sum = total_sum - group_sum,
other_group_obs = total_obs - group_obs,
other_mean = other_group_sum/other_group_obs) %>%
select(X, Y, other_mean)
result
# A tibble: 6 x 3
X Y other_mean
<fct> <int> <dbl>
1 A 1 4.50
2 A 2 4.50
3 B 3 3.50
4 B 4 3.50
5 C 5 2.50
6 C 6 2.50