Taking the difference between two data frames in R - r

I was looking for an easy way of doing it with R, but I couldn't find it, so I'm posting it here.
Let's assume that I have the following data frame
state1 score1 state2 score2
A 1 A 3
A 2 B 13
A 1 C 5
B 10 A 1
B 5 B 0
B 3 C 0
C 2 A 5
C 0 B 6
C 1 C 3
and the 2nd data frame is
state1 state2 score
A A 0
A B -1
A C 3
B A 2
B B 1
B C 1
C A 2
C B 2
C C 1
Let's call the first data frame, df1, and call the second margin, df2.
Look at the df1, df2 having the same (state1, state2) pairs.
For each of those matching pair, subtract score in df2 from score1 in df1 and call it newscore1, and subtract score in df2 from score2 in df2 and call it newscore2. For this case, the following would be desired output.
state1 newscore1 state2 newscore2
A 1 A 3
A 3 B 14
A -2 C 2
B 8 A -1
B 4 B -1
B 2 C -1
C 0 A 3
C -2 B 4
C 0 C 2
Is there a one/two-liner solution to it?
otherwise, I have to do
1) re-order df2 so that state1, state2 match with df1 (in this case, I don't have to do anything since row 1 in df1 already matches with row 1 in df2, row 2 in df1 already matches with row 2 in df2 and so on)
2) cbind the df1$score1-df2$score, df1$score2-df2$score

a one-liner using library(data.table).
Do the join (as the other solutions have suggested), and then use the update-by-reference operator (:=) to add the new column in the one step.
df1[ df2, on = c("state1","state2"), `:=`(newscore1 = score1 - score, newscore2 = score2 - score)]
df1
# state1 score1 state2 score2 newscore1 newscore2
# 1: A 1 A 3 1 3
# 2: A 2 B 13 3 14
# 3: A 1 C 5 -2 2
# 4: B 10 A 1 8 -1
# 5: B 5 B 0 4 -1
# 6: B 3 C 0 2 -1
# 7: C 2 A 5 0 3
# 8: C 0 B 6 -2 4
# 9: C 1 C 3 0 2

Simply merge the two and subtract column by column:
dfm <- merge(df1, df2, by=c("state1", "state2"))
dfm$newscore1 <- dfm$score1 - dfm$score
dfm$newscore2 <- dfm$score2 - dfm$score
dfm <- dfm[c("state1", "newscore1", "state2", "newscore2")]

The cleanest way to do this will be with a join operation. I like dplyr for this. For example:
state1 <- gl(3, k=3, labels=c("A", "B", "C"))
score1 <- sample(1:10, size = 9, replace = TRUE)
state2 <- gl(3, k=1, length=9, labels=c("A", "B", "C"))
score2 <- sample(1:10, size = 9, replace = TRUE)
df1 <- data.frame(state1, score1, state2, score2)
Here's that first dataframe:
> df1
state1 score1 state2 score2
1 A 3 A 6
2 A 8 B 2
3 A 3 C 6
4 B 2 A 8
5 B 3 B 10
6 B 3 C 6
7 C 7 A 2
8 C 9 B 5
9 C 6 C 10
score <- sample(-5:5, size = 9, replace = TRUE)
df2 <- data.frame(state1, state2, score)
And here's the second:
> df2
state1 state2 score
1 A A -1
2 A B 1
3 A C -2
4 B A 5
5 B B 5
6 B C 5
7 C A 0
8 C B -1
9 C C -3
combined_df <- df1 %>%
# line df1 and df2 up by state1 and state2, and combine them
full_join(df2, by=c("state1", "state2")) %>%
# calculate the new columns you need
mutate(newscore1 = score1 - score, newscore2 = score2 - score) %>%
# drop the extra columns
select(state1, newscore1, state2, newscore2)
> combined_df
state1 newscore1 state2 newscore2
1 A 4 A 7
2 A 7 B 1
3 A 5 C 8
4 B -3 A 3
5 B -2 B 5
6 B -2 C 1
7 C 7 A 2
8 C 10 B 6
9 C 9 C 13

Related

How can I stack my dataset so each observation relates to all other observations but itself?

I would like to stack my dataset so all observations relate to all other observations but itself.
Suppose I have the following dataset:
df <- data.frame(id = c("a", "b", "c", "d" ),
x1 = c(1,2,3,4))
df
id x1
1 a 1
2 b 2
3 c 3
4 d 4
I would like observation a to be related to b, c, and d. And the same for every other observation. The result should look like something like this:
id x1 id2 x2
1 a 1 b 2
2 a 1 c 3
3 a 1 d 4
4 b 2 a 1
5 b 2 c 3
6 b 2 d 4
7 c 3 a 1
8 c 3 b 2
9 c 3 d 4
10 d 4 a 1
11 d 4 b 2
12 d 4 c 3
So observation a is related to b,c,d. Observation b is related to a, c,d. And so on. Any ideas?
Another option:
library(dplyr)
left_join(df, df, by = character()) %>%
filter(id.x != id.y)
Or
output <- merge(df, df, by = NULL)
output = output[output$id.x != output$id.y,]
Thanks #ritchie-sacramento, I didn't know the by = NULL option for merge before, and thanks #zephryl for the by = character() option for dplyr joins.
tidyr::expand_grid() accepts data frames, which can then be filtered to remove rows that share the id:
library(tidyr)
library(dplyr)
expand_grid(df, df, .name_repair = make.unique) %>%
filter(id != id.1)
# A tibble: 12 × 4
id x1 id.1 x1.1
<chr> <dbl> <chr> <dbl>
1 a 1 b 2
2 a 1 c 3
3 a 1 d 4
4 b 2 a 1
5 b 2 c 3
6 b 2 d 4
7 c 3 a 1
8 c 3 b 2
9 c 3 d 4
10 d 4 a 1
11 d 4 b 2
12 d 4 c 3
You can use combn() to get all combinations of row indices, then assemble your dataframe from those:
rws <- cbind(combn(nrow(df), 2), combn(nrow(df), 2, rev))
df2 <- cbind(df[rws[1, ], ], df[rws[2, ], ])
# clean up row and column names
rownames(df2) <- 1:nrow(df2)
colnames(df2) <- c("id", "x1", "id2", "x2")
df2
id x1 id2 x2
1 a 1 b 2
2 a 1 c 3
3 a 1 d 4
4 b 2 c 3
5 b 2 d 4
6 c 3 d 4
7 b 2 a 1
8 c 3 a 1
9 d 4 a 1
10 c 3 b 2
11 d 4 b 2
12 d 4 c 3

Convert a binary dataframe to a grouped (long) list of combinations

I have the following binary dataframe
A B C D
0 1 1 0
0 0 1 1
1 1 1 0
0 1 1 1
I would like to create a list with all the column combinations and count the rows with '1' that are in common.
More precisely something like that:
A B 1
A C 1
A D 0
B A 1
B C 3
B D 1
C A 1
C B 3
C D 2
D A 0
D B 1
D C 2
But I'm struggling to think of a way to do that in R. I would appreciate any hint towards the right direction
Alternatively, a 'correlation'-like matrix would work for me. For example:
A B C D
A 0 1 1 0
B 1 0 3 1
C 1 3 0 2
D 0 1 2 0
Since I don't understand purrr/apply/loops easily, my approach will be like this
library(tidyverse)
df %>%
mutate(id = row_number()) %>%
pivot_longer(cols = 1:4) %>%
left_join(df %>% mutate(id = row_number())) %>%
pivot_longer(cols = 4:7, names_to = "Name2", values_to = "Value2") %>%
filter(name != Name2, value == Value2) %>%
select(-1) %>% group_by(name, Name2) %>%
summarise(sum(value))
# A tibble: 12 x 3
# Groups: name [4]
name Name2 `sum(value)`
<chr> <chr> <int>
1 A B 1
2 A C 1
3 A D 0
4 B A 1
5 B C 3
6 B D 1
7 C A 1
8 C B 3
9 C D 2
10 D A 0
11 D B 1
12 D C 2
Explanation Converting it to long format, then join with original keeping row_ids in mind, then pivot_longer again, filter out same names and different values will give you desired combinations which when summarised as sum of values (both equal) give you desired output
One gtools, dplyr and purrr option might be:
map_dfr(.x = asplit(permutations(length(df), 2, names(df)), 1),
~ df %>%
summarise(pair = paste(.x, collapse = ","),
n = sum(rowSums(select(., all_of(.x))) == 2)))
pair n
1 A,B 1
2 A,C 1
3 A,D 0
4 B,A 1
5 B,C 3
6 B,D 1
7 C,A 1
8 C,B 3
9 C,D 2
10 D,A 0
11 D,B 1
12 D,C 2
A pure Base R option is as follows. Note that this only gives the unique combinations of columns. You arrive at a longer version of all permutations by changing the column order and copying the counted values.
Example Data
test <- data.frame(A = c(0, 0, 1, 0),
B = c(1, 0, 1, 1),
C = c(1,1,1,1),
D = c(0, 1, 0, 1))
Code
df_list <- lapply(1:(ncol(combn(1:ncol(test), m = 2))),
function(y) test[, combn(1:ncol(test), m = 2)[,y]])
values <- sapply(df_list, function(x) sum(apply(x, 1, sum) == 2))
names <- sapply(df_list, function(x) colnames(x))
df_final <- cbind.data.frame(t(names), values)
Output
> df_final
1 2 values
1 A B 1
2 A C 1
3 A D 0
4 B C 3
5 B D 1
6 C D 2
A base R option using expand.grid + subset
transform(
subset(
rev(
expand.grid(nm <- names(df), nm)
), Var1 != Var2
),
count = apply(
cbind(Var2, Var1),
1,
function(...) sum(do.call("*", df[...]))
)
)
gives
Var2 Var1 count
2 A B 1
3 A C 1
4 A D 0
5 B A 1
7 B C 3
8 B D 1
9 C A 1
10 C B 3
12 C D 2
13 D A 0
14 D B 1
15 D C 2
I'd suggest using crossprod. Here, I've added diag to set the diagonal to zero:
"diag<-"(crossprod(as.matrix(test)), 0)
# A B C D
# A 0 1 1 0
# B 1 0 3 1
# C 1 3 0 2
# D 0 1 2 0
To get the long form, you can add a couple of steps:
mat <- "diag<-"(crossprod(as.matrix(test)), 0)
df <- data.frame(as.table(mat))
subset(df[order(df$Var1), ], Var1 != Var2)
# Var1 Var2 Freq
# 5 A B 1
# 9 A C 1
# 13 A D 0
# 2 B A 1
# 10 B C 3
# 14 B D 1
# 3 C A 1
# 7 C B 3
# 15 C D 2
# 4 D A 0
# 8 D B 1
# 12 D C 2
It's more compact using "data.table":
library(data.table)
mat <- "diag<-"(crossprod(as.matrix(test)), 0)
data.table(as.table(mat))[V1 != V2][order(V1)]
# V1 V2 N
# 1: A B 1
# 2: A C 1
# 3: A D 0
# 4: B A 1
# 5: B C 3
# 6: B D 1
# 7: C A 1
# 8: C B 3
# 9: C D 2
# 10: D A 0
# 11: D B 1
# 12: D C 2

How can I subtract values within one column based on values in mutliple other columns?

I have a dataframe like this:
dat <- data.frame(c = c(rep(0, 3), rep(5, 3), rep(10, 3)),
id = c(rep(c("A","B","C"), 3)),
measurement = c(1:8, 1))
dat
# c id measurement
# 1 0 A 1
# 2 0 B 2
# 3 0 C 3
# 4 5 A 4
# 5 5 B 5
# 6 5 C 6
# 7 10 A 7
# 8 10 B 8
# 9 10 C 1
I want to subtract the values in the column "measurement" where c is 0 from all other values in this column. This should happen separately based on the info given in the column "id". E.g. the value where c is 0 and "id" is A should be subtracted from all values where c is > 0 and "id" is A. The value where c is 0 and "id" is B should be subtracted from all values where c is > 0 and "id" is B and so on.
If the difference would be negative the result should be 0.
The result should look like this:
result <- data.frame(c = c(rep(0, 3), rep(5, 3), rep(10, 3)),
id = c(rep(c("A","B","C"), 3)),
measurement = c(1:8, 1),
difference = c(0,0,0,3,3,3,6,6,0))
result
# c id measurement difference
# 1 0 A 1 0
# 2 0 B 2 0
# 3 0 C 3 0
# 4 5 A 4 3
# 5 5 B 5 3
# 6 5 C 6 3
# 7 10 A 7 6
# 8 10 B 8 6
# 9 10 C 1 0
I used dplyr to select the values of "measurement" based on the info from the other columns, but unfortunately I don't know how to do the calculations. So any suggestions are welcome!
For each id you can subtract measurement values with the value where c = 0. Using pmax we replace negative values with 0.
library(dplyr)
dat %>%
group_by(id) %>%
mutate(difference = pmax(measurement - measurement[c == 0], 0))
# c id measurement difference
# <dbl> <chr> <dbl> <dbl>
#1 0 A 1 0
#2 0 B 2 0
#3 0 C 3 0
#4 5 A 4 3
#5 5 B 5 3
#6 5 C 6 3
#7 10 A 7 6
#8 10 B 8 6
#9 10 C 1 0
Try this. You can use a join and filter the data for you defined filter. After that dplyr verbs are useful to reach the expected output:
library(dplyr)
#Code
new <- dat %>%
left_join(
dat %>% filter(c==0) %>% select(-c) %>% rename(Var=measurement)
) %>%
mutate(measurement=measurement-Var) %>%
replace(.<=0,0) %>% select(-Var)
Output:
c id measurement
1 0 A 0
2 0 B 0
3 0 C 0
4 5 A 3
5 5 B 3
6 5 C 3
7 10 A 6
8 10 B 6
9 10 C 0

R: select rows by group after resampling

I want to do bootstrapping manually for a panel dataset. I need to cluster at individual level to make sure the consistency of later manipulation, that is to say that all the observations for the same individual need to be selected in bootstrap sample. What I do is to do resampling with replacement on the vector of unique individual IDs, which is used as the index.
df <- data.frame(ID = c("A","A","A","B","B","B","C","C","C"), v1 = c(3,1,2,4,2,2,5,6,9), v2 = c(1,0,0,0,1,1,0,1,0))
boot.index <- sample(unique(df$ID), replace = TRUE)
Then I select rows according to the index, suppose boot.index = (B, B, C), I want to have a data frame like this
ID v1 v2
B 4 0
B 2 1
B 2 1
B 4 0
B 2 1
B 2 1
C 5 0
C 6 1
C 9 0
Apparently df1 <- df[df$ID == testboot.index,] does not give what I want. I tried subset and filter in dplyr, nothing works. Basically this is a issue of selecting the whole group by group index, any suggestions? Thanks!
set.seed(42)
boot.index <- sample(unique(df$ID), replace = TRUE)
boot.index
#[1] C C A
#Levels: A B C
do.call(rbind, lapply(boot.index, function(x) df[df$ID == x,]))
# ID v1 v2
#7 C 5 0
#8 C 6 1
#9 C 9 0
#71 C 5 0
#81 C 6 1
#91 C 9 0
#1 A 3 1
#2 A 1 0
#3 A 2 0
%in% to select the relevant rows would get your desired output.
> df
ID v1 v2
1 A 3 1
2 A 1 0
3 A 2 0
4 B 4 0
5 B 2 1
6 B 2 1
7 C 5 0
8 C 6 1
9 C 9 0
> boot.index
[1] A B A
Levels: A B C
> df[df$ID %in% boot.index,]
ID v1 v2
1 A 3 1
2 A 1 0
3 A 2 0
4 B 4 0
5 B 2 1
6 B 2 1
dplyr::filter based solution:
> df %>% filter(ID %in% boot.index)
ID v1 v2
1 A 3 1
2 A 1 0
3 A 2 0
4 B 4 0
5 B 2 1
6 B 2 1
You can also do this with a join:
boot.index = c("B", "B", "C")
merge(data.frame("ID"=boot.index), df, by="ID", all.x=T, all.y=F)

Remove semi duplicate rows in R

I have the following data.frame.
a <- c(rep("A", 3), rep("B", 3), rep("C",2), "D")
b <- c(NA,1,2,4,1,NA,2,NA,NA)
c <- c(1,1,2,4,1,1,2,2,2)
d <- c(1,2,3,4,5,6,7,8,9)
df <-data.frame(a,b,c,d)
a b c d
1 A NA 1 1
2 A 1 1 2
3 A 2 2 3
4 B 4 4 4
5 B 1 1 5
6 B NA 1 6
7 C 2 2 7
8 C NA 2 8
9 D NA 2 9
I want to remove duplicate rows (based on column A & C) so that the row with values in column B are kept. In this example, rows 1, 6, and 8 are removed.
One way to do this is to order by 'a', 'b' and the the logical vector based on 'b' so that all 'NA' elements will be last for each group of 'a', and 'b'. Then, apply the duplicated and keep only the non-duplicate elements
df1 <- df[order(df$a, df$b, is.na(df$b)),]
df2 <- df1[!duplicated(df1[c('a', 'c')]),]
df2
# a b c d
#2 A 1 1 2
#3 A 2 2 3
#5 B 1 1 5
#4 B 4 4 4
#7 C 2 2 7
#9 D NA 2 9
setdiff(seq_len(nrow(df)), row.names(df2) )
#[1] 1 6 8
First create two datasets, one with duplicates in column a and one without duplicate in column a using the below function :
x = df[df$a %in% names(which(table(df$a) > 1)), ]
x1 = df[df$a %in% names(which(table(df$a) ==1)), ]
Now use na.omit function on data set x to delete the rows with NA and then rbind x and x1 to the final data set.
rbind(na.omit(x),x1)
Answer:
a b c d
2 A 1 1 2
3 A 2 2 3
4 B 4 4 4
5 B 1 1 5
7 C 2 2 7
9 D NA 2 9
You can use dplyr to do this.
df %>% distinct(a, c, .keep_all = TRUE)
Output
a b c d
1 A NA 1 1
2 A 2 2 3
3 B 4 4 4
4 B 1 1 5
5 C 2 2 7
6 D NA 2 9
There are other options in dplyr, check this question for details: Remove duplicated rows using dplyr

Resources