which.max() by groups but output in the dataframe - r

There is this data frame given by (an example):
df <- read.table(header = TRUE, text = 'Group Utility
A 12
A 10
B 3
B 5
B 6
C 1
D 3
D 4')
I want to use any command (I have been trying iterations of which.max() to no avail) to get an additional row in the dataset, say choice that is an indicator if Value is the max for the group given by Group elements. The table would look like:
Group Utility Choice
A 12 1
A 10 0
B 3 0
B 5 0
B 6 1
C 1 1
D 3 0
D 4 1

You can try this with dplyr
library(dplyr)
df %>%
group_by(Group) %>%
mutate(Choice = ifelse(Utility == max(Utility), 1, 0)) %>%
ungroup()
Output
# A tibble: 8 x 3
Group Utility Choice
<fct> <int> <dbl>
1 A 12 1
2 A 10 0
3 B 3 0
4 B 5 0
5 B 6 1
6 C 1 1
7 D 3 0
8 D 4 1

A one-liner base R solution.
df$Choice <- with(df, ave(Utility, Group, FUN = function(x) +(x == max(x))))
df
# Group Utility Choice
#1 A 12 1
#2 A 10 0
#3 B 3 0
#4 B 5 0
#5 B 6 1
#6 C 1 1
#7 D 3 0
#8 D 4 1

An option with data.table
library(data.table)
setDT(df)[, +(Utility == max(Utility)), Group]

Related

R Extract nested cumulatives from dataframe

given a dataframe
period<-c(1,1,1,3,3,3,3)
item<-c("a","b","b","a","b","c","c")
quantity<-c(1,3,2,4,5,3,7)
df<-data.frame(period,item,quantity)
df
period item quantity
1 1 a 1
2 1 b 3
3 1 b 2
4 3 a 4
5 3 b 5
6 3 c 3
7 3 c 7
I want to obtain
period item cumulative
1 a 1
1 b 5
1 c 0
2 a 0
2 b 0
2 c 0
3 a 4
3 b 5
3 c 10
Not sure what is a kind of efficient way to do this in R. The file has approx 500k records and 10,000 different items
Thanks!!
You can use complete to create the missing sequence of period and item and for each combination sum the quantity value.
library(dplyr)
library(tidyr)
df %>%
complete(period = min(period):max(period), item) %>%
group_by(period, item) %>%
summarise(quantity = sum(quantity, na.rm = TRUE)) %>%
ungroup
# period item quantity
# <dbl> <chr> <dbl>
#1 1 a 1
#2 1 b 5
#3 1 c 0
#4 2 a 0
#5 2 b 0
#6 2 c 0
#7 3 a 4
#8 3 b 5
#9 3 c 10

Convert a binary dataframe to a grouped (long) list of combinations

I have the following binary dataframe
A B C D
0 1 1 0
0 0 1 1
1 1 1 0
0 1 1 1
I would like to create a list with all the column combinations and count the rows with '1' that are in common.
More precisely something like that:
A B 1
A C 1
A D 0
B A 1
B C 3
B D 1
C A 1
C B 3
C D 2
D A 0
D B 1
D C 2
But I'm struggling to think of a way to do that in R. I would appreciate any hint towards the right direction
Alternatively, a 'correlation'-like matrix would work for me. For example:
A B C D
A 0 1 1 0
B 1 0 3 1
C 1 3 0 2
D 0 1 2 0
Since I don't understand purrr/apply/loops easily, my approach will be like this
library(tidyverse)
df %>%
mutate(id = row_number()) %>%
pivot_longer(cols = 1:4) %>%
left_join(df %>% mutate(id = row_number())) %>%
pivot_longer(cols = 4:7, names_to = "Name2", values_to = "Value2") %>%
filter(name != Name2, value == Value2) %>%
select(-1) %>% group_by(name, Name2) %>%
summarise(sum(value))
# A tibble: 12 x 3
# Groups: name [4]
name Name2 `sum(value)`
<chr> <chr> <int>
1 A B 1
2 A C 1
3 A D 0
4 B A 1
5 B C 3
6 B D 1
7 C A 1
8 C B 3
9 C D 2
10 D A 0
11 D B 1
12 D C 2
Explanation Converting it to long format, then join with original keeping row_ids in mind, then pivot_longer again, filter out same names and different values will give you desired combinations which when summarised as sum of values (both equal) give you desired output
One gtools, dplyr and purrr option might be:
map_dfr(.x = asplit(permutations(length(df), 2, names(df)), 1),
~ df %>%
summarise(pair = paste(.x, collapse = ","),
n = sum(rowSums(select(., all_of(.x))) == 2)))
pair n
1 A,B 1
2 A,C 1
3 A,D 0
4 B,A 1
5 B,C 3
6 B,D 1
7 C,A 1
8 C,B 3
9 C,D 2
10 D,A 0
11 D,B 1
12 D,C 2
A pure Base R option is as follows. Note that this only gives the unique combinations of columns. You arrive at a longer version of all permutations by changing the column order and copying the counted values.
Example Data
test <- data.frame(A = c(0, 0, 1, 0),
B = c(1, 0, 1, 1),
C = c(1,1,1,1),
D = c(0, 1, 0, 1))
Code
df_list <- lapply(1:(ncol(combn(1:ncol(test), m = 2))),
function(y) test[, combn(1:ncol(test), m = 2)[,y]])
values <- sapply(df_list, function(x) sum(apply(x, 1, sum) == 2))
names <- sapply(df_list, function(x) colnames(x))
df_final <- cbind.data.frame(t(names), values)
Output
> df_final
1 2 values
1 A B 1
2 A C 1
3 A D 0
4 B C 3
5 B D 1
6 C D 2
A base R option using expand.grid + subset
transform(
subset(
rev(
expand.grid(nm <- names(df), nm)
), Var1 != Var2
),
count = apply(
cbind(Var2, Var1),
1,
function(...) sum(do.call("*", df[...]))
)
)
gives
Var2 Var1 count
2 A B 1
3 A C 1
4 A D 0
5 B A 1
7 B C 3
8 B D 1
9 C A 1
10 C B 3
12 C D 2
13 D A 0
14 D B 1
15 D C 2
I'd suggest using crossprod. Here, I've added diag to set the diagonal to zero:
"diag<-"(crossprod(as.matrix(test)), 0)
# A B C D
# A 0 1 1 0
# B 1 0 3 1
# C 1 3 0 2
# D 0 1 2 0
To get the long form, you can add a couple of steps:
mat <- "diag<-"(crossprod(as.matrix(test)), 0)
df <- data.frame(as.table(mat))
subset(df[order(df$Var1), ], Var1 != Var2)
# Var1 Var2 Freq
# 5 A B 1
# 9 A C 1
# 13 A D 0
# 2 B A 1
# 10 B C 3
# 14 B D 1
# 3 C A 1
# 7 C B 3
# 15 C D 2
# 4 D A 0
# 8 D B 1
# 12 D C 2
It's more compact using "data.table":
library(data.table)
mat <- "diag<-"(crossprod(as.matrix(test)), 0)
data.table(as.table(mat))[V1 != V2][order(V1)]
# V1 V2 N
# 1: A B 1
# 2: A C 1
# 3: A D 0
# 4: B A 1
# 5: B C 3
# 6: B D 1
# 7: C A 1
# 8: C B 3
# 9: C D 2
# 10: D A 0
# 11: D B 1
# 12: D C 2

How can I subtract values within one column based on values in mutliple other columns?

I have a dataframe like this:
dat <- data.frame(c = c(rep(0, 3), rep(5, 3), rep(10, 3)),
id = c(rep(c("A","B","C"), 3)),
measurement = c(1:8, 1))
dat
# c id measurement
# 1 0 A 1
# 2 0 B 2
# 3 0 C 3
# 4 5 A 4
# 5 5 B 5
# 6 5 C 6
# 7 10 A 7
# 8 10 B 8
# 9 10 C 1
I want to subtract the values in the column "measurement" where c is 0 from all other values in this column. This should happen separately based on the info given in the column "id". E.g. the value where c is 0 and "id" is A should be subtracted from all values where c is > 0 and "id" is A. The value where c is 0 and "id" is B should be subtracted from all values where c is > 0 and "id" is B and so on.
If the difference would be negative the result should be 0.
The result should look like this:
result <- data.frame(c = c(rep(0, 3), rep(5, 3), rep(10, 3)),
id = c(rep(c("A","B","C"), 3)),
measurement = c(1:8, 1),
difference = c(0,0,0,3,3,3,6,6,0))
result
# c id measurement difference
# 1 0 A 1 0
# 2 0 B 2 0
# 3 0 C 3 0
# 4 5 A 4 3
# 5 5 B 5 3
# 6 5 C 6 3
# 7 10 A 7 6
# 8 10 B 8 6
# 9 10 C 1 0
I used dplyr to select the values of "measurement" based on the info from the other columns, but unfortunately I don't know how to do the calculations. So any suggestions are welcome!
For each id you can subtract measurement values with the value where c = 0. Using pmax we replace negative values with 0.
library(dplyr)
dat %>%
group_by(id) %>%
mutate(difference = pmax(measurement - measurement[c == 0], 0))
# c id measurement difference
# <dbl> <chr> <dbl> <dbl>
#1 0 A 1 0
#2 0 B 2 0
#3 0 C 3 0
#4 5 A 4 3
#5 5 B 5 3
#6 5 C 6 3
#7 10 A 7 6
#8 10 B 8 6
#9 10 C 1 0
Try this. You can use a join and filter the data for you defined filter. After that dplyr verbs are useful to reach the expected output:
library(dplyr)
#Code
new <- dat %>%
left_join(
dat %>% filter(c==0) %>% select(-c) %>% rename(Var=measurement)
) %>%
mutate(measurement=measurement-Var) %>%
replace(.<=0,0) %>% select(-Var)
Output:
c id measurement
1 0 A 0
2 0 B 0
3 0 C 0
4 5 A 3
5 5 B 3
6 5 C 3
7 10 A 6
8 10 B 6
9 10 C 0

R: select rows by group after resampling

I want to do bootstrapping manually for a panel dataset. I need to cluster at individual level to make sure the consistency of later manipulation, that is to say that all the observations for the same individual need to be selected in bootstrap sample. What I do is to do resampling with replacement on the vector of unique individual IDs, which is used as the index.
df <- data.frame(ID = c("A","A","A","B","B","B","C","C","C"), v1 = c(3,1,2,4,2,2,5,6,9), v2 = c(1,0,0,0,1,1,0,1,0))
boot.index <- sample(unique(df$ID), replace = TRUE)
Then I select rows according to the index, suppose boot.index = (B, B, C), I want to have a data frame like this
ID v1 v2
B 4 0
B 2 1
B 2 1
B 4 0
B 2 1
B 2 1
C 5 0
C 6 1
C 9 0
Apparently df1 <- df[df$ID == testboot.index,] does not give what I want. I tried subset and filter in dplyr, nothing works. Basically this is a issue of selecting the whole group by group index, any suggestions? Thanks!
set.seed(42)
boot.index <- sample(unique(df$ID), replace = TRUE)
boot.index
#[1] C C A
#Levels: A B C
do.call(rbind, lapply(boot.index, function(x) df[df$ID == x,]))
# ID v1 v2
#7 C 5 0
#8 C 6 1
#9 C 9 0
#71 C 5 0
#81 C 6 1
#91 C 9 0
#1 A 3 1
#2 A 1 0
#3 A 2 0
%in% to select the relevant rows would get your desired output.
> df
ID v1 v2
1 A 3 1
2 A 1 0
3 A 2 0
4 B 4 0
5 B 2 1
6 B 2 1
7 C 5 0
8 C 6 1
9 C 9 0
> boot.index
[1] A B A
Levels: A B C
> df[df$ID %in% boot.index,]
ID v1 v2
1 A 3 1
2 A 1 0
3 A 2 0
4 B 4 0
5 B 2 1
6 B 2 1
dplyr::filter based solution:
> df %>% filter(ID %in% boot.index)
ID v1 v2
1 A 3 1
2 A 1 0
3 A 2 0
4 B 4 0
5 B 2 1
6 B 2 1
You can also do this with a join:
boot.index = c("B", "B", "C")
merge(data.frame("ID"=boot.index), df, by="ID", all.x=T, all.y=F)

how to mutate a column with ID in group

how to mutate a column with ID in group
data.frame like:
a b c
1 a 1 1
2 a 1 2
3 a 2 3
4 b 1 4
5 b 2 5
6 b 3 6
group by a, flag start with 1, if b equals pre b,then flag=1 else flag+=1
a b c flag
1 a 1 1 1 <- group a start with 1
2 a 1 2 1 <-- in group a, 1(in row 2)=1(in row 1)
3 a 2 3 2 <- in group a, 2(in row 3)!=1(in row 2)
4 b 1 4 1 <- group b start with 1
5 b 2 5 2 <- in group b, 2(in row 5)!=1(in row 4)
6 b 3 6 3 <- in group b, 3(in row 6)!=2(in row 5)
i now using this:
for(i in 2:nrow(x)){
x[i, 'flag'] = ifelse(x[i, 'a']!=x[i-1,'a'], 1, ifelse(x[i, 'b']==x[i-1, 'b'], x[i-1, 'flag'], x[i-1,'flag']+1))
}
but it is inefficiency in large dataset
#
UPDATE
dense_rank in dplyr give me the answer
> x %>% group_by(a) %>% mutate(dense_rank(b))
Source: local data frame [10 x 4]
Groups: a
a b c dense_rank(b)
1 a x 1 1
2 a x 2 1
3 a y 3 2
4 b x 4 1
5 b y 5 2
6 b z 6 3
7 c x 7 1
8 c y 8 2
9 c z 9 3
10 c z 10 3
thanks.
I am not entirely sure what you are trying to do. But it seems to me that you are trying to assign index numbers to values in b for each group (a or b).
#I modified your example here.
a <- rep(c("a","b"), each =3)
b <- c(4,4,5,11,12,13)
c <- 1:6
foo <- data.frame(a,b,c, stringsAsFactors = F)
a b c
1 a 4 1
2 a 4 2
3 a 5 3
4 b 11 4
5 b 12 5
6 b 13 6
#Since you referred to dplyr, I will use it.
cats <- list()
for(i in unique(foo$a)){
ana <- foo %>%
filter(a == i) %>%
arrange(b) %>%
mutate(indexInb = as.integer(as.factor(b)))
cats[[i]] <- ana
}
bob <- rbindlist(cats)
a b c indexInb
1: a 4 1 1
2: a 4 2 1
3: a 5 3 2
4: b 11 4 1
5: b 12 5 2
6: b 13 6 3
Hers's a quick vectorized way to solve this without using any for loops
Base R solution using ave and transform
transform(x, flag = ave(b, a, FUN = function(x) cumsum(c(1, diff(x)))))
# a b c flag
# 1 a 1 1 1
# 2 a 1 2 1
# 3 a 2 3 2
# 4 b 1 4 1
# 5 b 2 5 2
# 6 b 3 6 3
Or a data.table solution (more efficient)
library(data.table)
setDT(x)[, flag := cumsum(c(1, diff(b))), by = a]
x
# a b c flag
# 1: a 1 1 1
# 2: a 1 2 1
# 3: a 2 3 2
# 4: b 1 4 1
# 5: b 2 5 2
# 6: b 3 6 3
Or a dplyr solution (because you tagged it)
library(dplyr)
x %>%
group_by(a) %>%
mutate(flag = cumsum(c(1, diff(b))))
# Source: local data frame [6 x 4]
# Groups: a
#
# a b c flag
# 1 a 1 1 1
# 2 a 1 2 1
# 3 a 2 3 2
# 4 b 1 4 1
# 5 b 2 5 2
# 6 b 3 6 3

Resources