R: select rows by group after resampling - r

I want to do bootstrapping manually for a panel dataset. I need to cluster at individual level to make sure the consistency of later manipulation, that is to say that all the observations for the same individual need to be selected in bootstrap sample. What I do is to do resampling with replacement on the vector of unique individual IDs, which is used as the index.
df <- data.frame(ID = c("A","A","A","B","B","B","C","C","C"), v1 = c(3,1,2,4,2,2,5,6,9), v2 = c(1,0,0,0,1,1,0,1,0))
boot.index <- sample(unique(df$ID), replace = TRUE)
Then I select rows according to the index, suppose boot.index = (B, B, C), I want to have a data frame like this
ID v1 v2
B 4 0
B 2 1
B 2 1
B 4 0
B 2 1
B 2 1
C 5 0
C 6 1
C 9 0
Apparently df1 <- df[df$ID == testboot.index,] does not give what I want. I tried subset and filter in dplyr, nothing works. Basically this is a issue of selecting the whole group by group index, any suggestions? Thanks!

set.seed(42)
boot.index <- sample(unique(df$ID), replace = TRUE)
boot.index
#[1] C C A
#Levels: A B C
do.call(rbind, lapply(boot.index, function(x) df[df$ID == x,]))
# ID v1 v2
#7 C 5 0
#8 C 6 1
#9 C 9 0
#71 C 5 0
#81 C 6 1
#91 C 9 0
#1 A 3 1
#2 A 1 0
#3 A 2 0

%in% to select the relevant rows would get your desired output.
> df
ID v1 v2
1 A 3 1
2 A 1 0
3 A 2 0
4 B 4 0
5 B 2 1
6 B 2 1
7 C 5 0
8 C 6 1
9 C 9 0
> boot.index
[1] A B A
Levels: A B C
> df[df$ID %in% boot.index,]
ID v1 v2
1 A 3 1
2 A 1 0
3 A 2 0
4 B 4 0
5 B 2 1
6 B 2 1
dplyr::filter based solution:
> df %>% filter(ID %in% boot.index)
ID v1 v2
1 A 3 1
2 A 1 0
3 A 2 0
4 B 4 0
5 B 2 1
6 B 2 1

You can also do this with a join:
boot.index = c("B", "B", "C")
merge(data.frame("ID"=boot.index), df, by="ID", all.x=T, all.y=F)

Related

Convert a binary dataframe to a grouped (long) list of combinations

I have the following binary dataframe
A B C D
0 1 1 0
0 0 1 1
1 1 1 0
0 1 1 1
I would like to create a list with all the column combinations and count the rows with '1' that are in common.
More precisely something like that:
A B 1
A C 1
A D 0
B A 1
B C 3
B D 1
C A 1
C B 3
C D 2
D A 0
D B 1
D C 2
But I'm struggling to think of a way to do that in R. I would appreciate any hint towards the right direction
Alternatively, a 'correlation'-like matrix would work for me. For example:
A B C D
A 0 1 1 0
B 1 0 3 1
C 1 3 0 2
D 0 1 2 0
Since I don't understand purrr/apply/loops easily, my approach will be like this
library(tidyverse)
df %>%
mutate(id = row_number()) %>%
pivot_longer(cols = 1:4) %>%
left_join(df %>% mutate(id = row_number())) %>%
pivot_longer(cols = 4:7, names_to = "Name2", values_to = "Value2") %>%
filter(name != Name2, value == Value2) %>%
select(-1) %>% group_by(name, Name2) %>%
summarise(sum(value))
# A tibble: 12 x 3
# Groups: name [4]
name Name2 `sum(value)`
<chr> <chr> <int>
1 A B 1
2 A C 1
3 A D 0
4 B A 1
5 B C 3
6 B D 1
7 C A 1
8 C B 3
9 C D 2
10 D A 0
11 D B 1
12 D C 2
Explanation Converting it to long format, then join with original keeping row_ids in mind, then pivot_longer again, filter out same names and different values will give you desired combinations which when summarised as sum of values (both equal) give you desired output
One gtools, dplyr and purrr option might be:
map_dfr(.x = asplit(permutations(length(df), 2, names(df)), 1),
~ df %>%
summarise(pair = paste(.x, collapse = ","),
n = sum(rowSums(select(., all_of(.x))) == 2)))
pair n
1 A,B 1
2 A,C 1
3 A,D 0
4 B,A 1
5 B,C 3
6 B,D 1
7 C,A 1
8 C,B 3
9 C,D 2
10 D,A 0
11 D,B 1
12 D,C 2
A pure Base R option is as follows. Note that this only gives the unique combinations of columns. You arrive at a longer version of all permutations by changing the column order and copying the counted values.
Example Data
test <- data.frame(A = c(0, 0, 1, 0),
B = c(1, 0, 1, 1),
C = c(1,1,1,1),
D = c(0, 1, 0, 1))
Code
df_list <- lapply(1:(ncol(combn(1:ncol(test), m = 2))),
function(y) test[, combn(1:ncol(test), m = 2)[,y]])
values <- sapply(df_list, function(x) sum(apply(x, 1, sum) == 2))
names <- sapply(df_list, function(x) colnames(x))
df_final <- cbind.data.frame(t(names), values)
Output
> df_final
1 2 values
1 A B 1
2 A C 1
3 A D 0
4 B C 3
5 B D 1
6 C D 2
A base R option using expand.grid + subset
transform(
subset(
rev(
expand.grid(nm <- names(df), nm)
), Var1 != Var2
),
count = apply(
cbind(Var2, Var1),
1,
function(...) sum(do.call("*", df[...]))
)
)
gives
Var2 Var1 count
2 A B 1
3 A C 1
4 A D 0
5 B A 1
7 B C 3
8 B D 1
9 C A 1
10 C B 3
12 C D 2
13 D A 0
14 D B 1
15 D C 2
I'd suggest using crossprod. Here, I've added diag to set the diagonal to zero:
"diag<-"(crossprod(as.matrix(test)), 0)
# A B C D
# A 0 1 1 0
# B 1 0 3 1
# C 1 3 0 2
# D 0 1 2 0
To get the long form, you can add a couple of steps:
mat <- "diag<-"(crossprod(as.matrix(test)), 0)
df <- data.frame(as.table(mat))
subset(df[order(df$Var1), ], Var1 != Var2)
# Var1 Var2 Freq
# 5 A B 1
# 9 A C 1
# 13 A D 0
# 2 B A 1
# 10 B C 3
# 14 B D 1
# 3 C A 1
# 7 C B 3
# 15 C D 2
# 4 D A 0
# 8 D B 1
# 12 D C 2
It's more compact using "data.table":
library(data.table)
mat <- "diag<-"(crossprod(as.matrix(test)), 0)
data.table(as.table(mat))[V1 != V2][order(V1)]
# V1 V2 N
# 1: A B 1
# 2: A C 1
# 3: A D 0
# 4: B A 1
# 5: B C 3
# 6: B D 1
# 7: C A 1
# 8: C B 3
# 9: C D 2
# 10: D A 0
# 11: D B 1
# 12: D C 2

How can I subtract values within one column based on values in mutliple other columns?

I have a dataframe like this:
dat <- data.frame(c = c(rep(0, 3), rep(5, 3), rep(10, 3)),
id = c(rep(c("A","B","C"), 3)),
measurement = c(1:8, 1))
dat
# c id measurement
# 1 0 A 1
# 2 0 B 2
# 3 0 C 3
# 4 5 A 4
# 5 5 B 5
# 6 5 C 6
# 7 10 A 7
# 8 10 B 8
# 9 10 C 1
I want to subtract the values in the column "measurement" where c is 0 from all other values in this column. This should happen separately based on the info given in the column "id". E.g. the value where c is 0 and "id" is A should be subtracted from all values where c is > 0 and "id" is A. The value where c is 0 and "id" is B should be subtracted from all values where c is > 0 and "id" is B and so on.
If the difference would be negative the result should be 0.
The result should look like this:
result <- data.frame(c = c(rep(0, 3), rep(5, 3), rep(10, 3)),
id = c(rep(c("A","B","C"), 3)),
measurement = c(1:8, 1),
difference = c(0,0,0,3,3,3,6,6,0))
result
# c id measurement difference
# 1 0 A 1 0
# 2 0 B 2 0
# 3 0 C 3 0
# 4 5 A 4 3
# 5 5 B 5 3
# 6 5 C 6 3
# 7 10 A 7 6
# 8 10 B 8 6
# 9 10 C 1 0
I used dplyr to select the values of "measurement" based on the info from the other columns, but unfortunately I don't know how to do the calculations. So any suggestions are welcome!
For each id you can subtract measurement values with the value where c = 0. Using pmax we replace negative values with 0.
library(dplyr)
dat %>%
group_by(id) %>%
mutate(difference = pmax(measurement - measurement[c == 0], 0))
# c id measurement difference
# <dbl> <chr> <dbl> <dbl>
#1 0 A 1 0
#2 0 B 2 0
#3 0 C 3 0
#4 5 A 4 3
#5 5 B 5 3
#6 5 C 6 3
#7 10 A 7 6
#8 10 B 8 6
#9 10 C 1 0
Try this. You can use a join and filter the data for you defined filter. After that dplyr verbs are useful to reach the expected output:
library(dplyr)
#Code
new <- dat %>%
left_join(
dat %>% filter(c==0) %>% select(-c) %>% rename(Var=measurement)
) %>%
mutate(measurement=measurement-Var) %>%
replace(.<=0,0) %>% select(-Var)
Output:
c id measurement
1 0 A 0
2 0 B 0
3 0 C 0
4 5 A 3
5 5 B 3
6 5 C 3
7 10 A 6
8 10 B 6
9 10 C 0

which.max() by groups but output in the dataframe

There is this data frame given by (an example):
df <- read.table(header = TRUE, text = 'Group Utility
A 12
A 10
B 3
B 5
B 6
C 1
D 3
D 4')
I want to use any command (I have been trying iterations of which.max() to no avail) to get an additional row in the dataset, say choice that is an indicator if Value is the max for the group given by Group elements. The table would look like:
Group Utility Choice
A 12 1
A 10 0
B 3 0
B 5 0
B 6 1
C 1 1
D 3 0
D 4 1
You can try this with dplyr
library(dplyr)
df %>%
group_by(Group) %>%
mutate(Choice = ifelse(Utility == max(Utility), 1, 0)) %>%
ungroup()
Output
# A tibble: 8 x 3
Group Utility Choice
<fct> <int> <dbl>
1 A 12 1
2 A 10 0
3 B 3 0
4 B 5 0
5 B 6 1
6 C 1 1
7 D 3 0
8 D 4 1
A one-liner base R solution.
df$Choice <- with(df, ave(Utility, Group, FUN = function(x) +(x == max(x))))
df
# Group Utility Choice
#1 A 12 1
#2 A 10 0
#3 B 3 0
#4 B 5 0
#5 B 6 1
#6 C 1 1
#7 D 3 0
#8 D 4 1
An option with data.table
library(data.table)
setDT(df)[, +(Utility == max(Utility)), Group]

Taking the difference between two data frames in R

I was looking for an easy way of doing it with R, but I couldn't find it, so I'm posting it here.
Let's assume that I have the following data frame
state1 score1 state2 score2
A 1 A 3
A 2 B 13
A 1 C 5
B 10 A 1
B 5 B 0
B 3 C 0
C 2 A 5
C 0 B 6
C 1 C 3
and the 2nd data frame is
state1 state2 score
A A 0
A B -1
A C 3
B A 2
B B 1
B C 1
C A 2
C B 2
C C 1
Let's call the first data frame, df1, and call the second margin, df2.
Look at the df1, df2 having the same (state1, state2) pairs.
For each of those matching pair, subtract score in df2 from score1 in df1 and call it newscore1, and subtract score in df2 from score2 in df2 and call it newscore2. For this case, the following would be desired output.
state1 newscore1 state2 newscore2
A 1 A 3
A 3 B 14
A -2 C 2
B 8 A -1
B 4 B -1
B 2 C -1
C 0 A 3
C -2 B 4
C 0 C 2
Is there a one/two-liner solution to it?
otherwise, I have to do
1) re-order df2 so that state1, state2 match with df1 (in this case, I don't have to do anything since row 1 in df1 already matches with row 1 in df2, row 2 in df1 already matches with row 2 in df2 and so on)
2) cbind the df1$score1-df2$score, df1$score2-df2$score
a one-liner using library(data.table).
Do the join (as the other solutions have suggested), and then use the update-by-reference operator (:=) to add the new column in the one step.
df1[ df2, on = c("state1","state2"), `:=`(newscore1 = score1 - score, newscore2 = score2 - score)]
df1
# state1 score1 state2 score2 newscore1 newscore2
# 1: A 1 A 3 1 3
# 2: A 2 B 13 3 14
# 3: A 1 C 5 -2 2
# 4: B 10 A 1 8 -1
# 5: B 5 B 0 4 -1
# 6: B 3 C 0 2 -1
# 7: C 2 A 5 0 3
# 8: C 0 B 6 -2 4
# 9: C 1 C 3 0 2
Simply merge the two and subtract column by column:
dfm <- merge(df1, df2, by=c("state1", "state2"))
dfm$newscore1 <- dfm$score1 - dfm$score
dfm$newscore2 <- dfm$score2 - dfm$score
dfm <- dfm[c("state1", "newscore1", "state2", "newscore2")]
The cleanest way to do this will be with a join operation. I like dplyr for this. For example:
state1 <- gl(3, k=3, labels=c("A", "B", "C"))
score1 <- sample(1:10, size = 9, replace = TRUE)
state2 <- gl(3, k=1, length=9, labels=c("A", "B", "C"))
score2 <- sample(1:10, size = 9, replace = TRUE)
df1 <- data.frame(state1, score1, state2, score2)
Here's that first dataframe:
> df1
state1 score1 state2 score2
1 A 3 A 6
2 A 8 B 2
3 A 3 C 6
4 B 2 A 8
5 B 3 B 10
6 B 3 C 6
7 C 7 A 2
8 C 9 B 5
9 C 6 C 10
score <- sample(-5:5, size = 9, replace = TRUE)
df2 <- data.frame(state1, state2, score)
And here's the second:
> df2
state1 state2 score
1 A A -1
2 A B 1
3 A C -2
4 B A 5
5 B B 5
6 B C 5
7 C A 0
8 C B -1
9 C C -3
combined_df <- df1 %>%
# line df1 and df2 up by state1 and state2, and combine them
full_join(df2, by=c("state1", "state2")) %>%
# calculate the new columns you need
mutate(newscore1 = score1 - score, newscore2 = score2 - score) %>%
# drop the extra columns
select(state1, newscore1, state2, newscore2)
> combined_df
state1 newscore1 state2 newscore2
1 A 4 A 7
2 A 7 B 1
3 A 5 C 8
4 B -3 A 3
5 B -2 B 5
6 B -2 C 1
7 C 7 A 2
8 C 10 B 6
9 C 9 C 13

Perform function over groups in columns in R

I am completely new to R and have a question about performing a function over a column.
data <- read.table(text ="group; val
a; 4
a; 24
a; 12
b; 1
a; 2
c; 4
c; 5
b; 6 ", sep=";", header=T,stringsAsFactors = FALSE)
How could I add data in the following way?
I would like to create two new columns which I am doing like this:
data$col1 <- 0
data$col2 <- 1
What I now want to do is to add +2 for each group value into the new columns and reach the following pattern:
group val col1 col2
a 4 0 1
a 24 0 1
a 12 0 1
b 1 2 3
a 2 0 1
c 4 4 5
c 5 4 5
b 6 2 3
How could I do this? I hope I made my example more or less clear.
Try this:
Creating an index to cumulatively add +2 depending on the number of groups
indx <- c(0, 2 * seq_len(length(unique(data[, 1])) - 1))
Splitting the data set by groups, adding (cumulatively) +2 and unsplitting back so everything comes back in place
data[, 3:4] <- unsplit(Map(`+`, split(data[, 3:4], data[, 1]), indx), data[, 1])
data
# group val col1 col2
# 1 a 4 0 1
# 2 a 24 0 1
# 3 a 12 0 1
# 4 b 1 2 3
# 5 a 2 0 1
# 6 c 4 4 5
# 7 c 5 4 5
# 8 b 6 2 3
Or you could do
within(data, {col1 <- 2*(as.numeric(factor(group))-1)
col2 <- col1+1})[,c(1:2,4:3)]
# group val col1 col2
#1 a 4 0 1
#2 a 24 0 1
#3 a 12 0 1
#4 b 1 2 3
#5 a 2 0 1
#6 c 4 4 5
#7 c 5 4 5
#8 b 6 2 3
Using data.table
library(data.table)
setDT(data)[,c('col1', 'col2'):= {list(indx=2*(match(group,
unique(group))-1), indx+1)}]
data
# group val col1 col2
#1: a 4 0 1
#2: a 24 0 1
#3: a 12 0 1
#4: b 1 2 3
#5: a 2 0 1
#6: c 4 4 5
#7: c 5 4 5
#8: b 6 2 3

Resources