R: How Do you calculate (percent of) matching observations? - r

So my data set looks like this and the alphabets are factor variables. Pred columns are for predicted observations for each ID and Real columns are the real observations. I want to calculate the overall accuracy of the predicted values for each ID.
ID Pred1 Pred2 Pred3 Real1 Real2 Real3
1 A C E A D B
2 A B D E C C
3 E C A A B D
4 D A B B B D
5 B A C C A B
So I want to mutate a column called 'score' which gives you a percentage of the number of matched observations between Pred1,2,3 and Real1,2,3 columns.
I only care about finding any 'Pred' values in any of 'Real' columns.
If Pred1 is found in one of Real1,Real2 and Real3, then I give a score of 1/3.
If Pred1 AND Pred2 (not Pred3) are both found in any of Real1,Real2 and Real3 columns, (*the order does NOT matter. Pred1 can be found in Real2 or Real3- just anywhere in 'Real' columns), then I give a score of 2/3.
I hope it makes sense. The order does not matter and I only care about finding any of 'Pred' values in any of 'Real' columns.
So I want something like below.
ID Pred1 Pred2 Pred3 Real1 Real2 Real3 Score
1 A C E A D B 1/3
2 A B D E C C 0
3 E C A A B D 1/3
4 D A B B E D 2/3
5 B A C C A B 1
I am trying to write a function and tried something like
ifelse("Pred1" %in% c("Real1","Real2","Real3") , 1/3 ,0 )) but it didn't work well.. (had error messages with coercing to logical etc which I didn't know how to solve)
So I am trying different things too but keep getting stuck with errors...
Can anyone help please? Thank you in advance!

It makes it difficult to compare values with different factor levels. We can first convert the columns from factors to characters.
df[-1] <- lapply(df[-1], as.character)
Find out index of Predicted and Real columns and then for every row check how many of Predicted observations are present in Real ones.
pred_cols <- grep("^Pred", names(df))
real_cols <- grep("^Real", names(df))
df$Score <- sapply(1:nrow(df), function(x)
sum(df[x, pred_cols] %in% df[x, real_cols]))/length(pred_cols)
df
# ID Pred1 Pred2 Pred3 Real1 Real2 Real3 Score
#1 1 A C E A D B 0.33
#2 2 A B D E C C 0.00
#3 3 E C A A B D 0.33
#4 4 D A B B B D 0.67
#5 5 B A C C A B 1.00
As it is row-wise comparison we can also use apply with MARGIN = 1 using the same logic. With this approach we don't need to explicitly convert the columns into characters.
apply(df, 1, function(x) sum(x[pred_cols] %in% x[real_cols]))/length(pred_cols)

One tidyverse possibility could be:
bind_cols(df %>%
gather(var, val, -matches("(Real|ID)")) %>%
select(ID, val), df %>%
gather(var2, val2, -matches("(Pred|ID)")) %>%
select(val2)) %>%
group_by(ID) %>%
summarise(res = paste0(sum(val %in% val2), "/3")) %>%
left_join(df, by = c("ID" = "ID"))
ID res Pred1 Pred2 Pred3 Real1 Real2 Real3
<int> <chr> <fct> <fct> <fct> <fct> <fct> <fct>
1 1 1/3 A C E A D B
2 2 0/3 A B D E C C
3 3 1/3 E C A A B D
4 4 2/3 D A B B B D
5 5 3/3 B A C C A B
It first, separately, transforms from wide to long format the columns that contains Pred and Real. Second, it combines the two by columns. Finally, it groups by "ID", sums the number of matching cases and joins it with the original df.
Or if the number of pairs is not fixed to 3:
bind_cols(df %>%
gather(var, val, -matches("(Real|ID)")) %>%
select(ID, val), df %>%
gather(var2, val2, -matches("(Pred|ID)")) %>%
select(val2)) %>%
add_count(ID) %>%
group_by(ID) %>%
summarise(res = paste(sum(val %in% val2), first(n), sep = "/")) %>%
left_join(df, by = c("ID" = "ID"))
Or if you want a numeric variable as the result:
bind_cols(df %>%
gather(var, val, -matches("(Real|ID)")) %>%
select(ID, val), df %>%
gather(var2, val2, -matches("(Pred|ID)")) %>%
select(val2)) %>%
add_count(ID) %>%
group_by(ID) %>%
summarise(res = sum(val %in% val2)/first(n)) %>%
left_join(df, by = c("ID" = "ID"))
ID res Pred1 Pred2 Pred3 Real1 Real2 Real3
<int> <dbl> <fct> <fct> <fct> <fct> <fct> <fct>
1 1 0.333 A C E A D B
2 2 0 A B D E C C
3 3 0.333 E C A A B D
4 4 0.667 D A B B B D
5 5 1 B A C C A B

Related

R: Repeating row of dataframe with respect to multiple count columns

I have a R DataFrame that has a structure similar to the following:
df <- data.frame(var1 = c(1, 1), var2 = c(0, 2), var3 = c(3, 0), f1 = c('a', 'b'), f2=c('c', 'd') )
So visually the DataFrame would look like
> df
var1 var2 var3 f1 f2
1 1 0 3 a c
2 1 2 0 b d
What I want to do is the following:
(1) Treat the first C=3 columns as counts for three different classes. (C is the number of classes, given as an input variable.) Add a new column called "class".
(2) For each row, duplicate the last two entries of the row according to the count of each class (separately); and append the class number to the new "class" column.
For example, the output for the above dataset would be
> df_updated
f1 f2 class
1 a c 1
2 a c 3
3 a c 3
4 a c 3
5 b d 1
6 b d 2
7 b d 2
where row (a c) is duplicated 4 times, 1 time with respect to class 1, and 3 times with respect to class 3; row (b d) is duplicated 3 times, 1 time with respect to class 1 and 2 times with respect to class 2.
I tried looking at previous posts on duplicating rows based on counts (e.g. this link), and I could not figure out how to adapt the solutions there to multiple count columns (and also appending another class column).
Also, my actual dataset has many more rows and classes (say 1000 rows and 20 classes), so ideally I want a solution that is as efficient as possible.
I wonder if anyone can help me on this. Thanks in advance.
Here is a tidyverse option. We can use uncount from tidyr to duplicate the rows according to the count in value (i.e., from the var columns) after pivoting to long format.
library(tidyverse)
df %>%
pivot_longer(starts_with("var"), names_to = "class") %>%
filter(value != 0) %>%
uncount(value) %>%
mutate(class = str_extract(class, "\\d+"))
Output
f1 f2 class
<chr> <chr> <chr>
1 a c 1
2 a c 3
3 a c 3
4 a c 3
5 b d 1
6 b d 2
7 b d 2
Another slight variation is to use expandrows from splitstackshape in conjunction with tidyverse.
library(splitstackshape)
df %>%
pivot_longer(starts_with("var"), names_to = "class") %>%
filter(value != 0) %>%
expandRows("value") %>%
mutate(class = str_extract(class, "\\d+"))
base R
Row order (and row names) notwithstanding:
tmp <- subset(reshape2::melt(df, id.vars = c("f1","f2"), value.name = "class"), class > 0, select = -variable)
tmp[rep(seq_along(tmp$class), times = tmp$class),]
# f1 f2 class
# 1 a c 1
# 2 b d 1
# 4 b d 2
# 4.1 b d 2
# 5 a c 3
# 5.1 a c 3
# 5.2 a c 3
dplyr
library(dplyr)
# library(tidyr) # pivot_longer
df %>%
pivot_longer(-c(f1, f2), values_to = "class") %>%
dplyr::filter(class > 0) %>%
select(-name) %>%
slice(rep(row_number(), times = class))
# # A tibble: 7 x 3
# f1 f2 class
# <chr> <chr> <dbl>
# 1 a c 1
# 2 a c 3
# 3 a c 3
# 4 a c 3
# 5 b d 1
# 6 b d 2
# 7 b d 2

Solution on R group by issue _ multiple combination

I'm using group by funciton in a dataset using R software. But the target of the id would duplicate. Here is the sample dataset:
ID Var1
A 1
A 3
B 2
C 3
C 1
D 2
In tradtional groupby function by each id, I can do
DT<- data.table(dataset )
DT[,sum(Var1),by = ID]
and get the result:
ID V1
A 4
B 2
C 4
D 2
However, I've to group ID by A+B and B+C and D
(PS. say that F=A+B ,G=B+C)
and the target result dataset below:
ID V1
F 6
G 6
D 2
IF I use recoding technique on ID, the duplicate B would be covered twice.
IS there any one have the solution?
MANY THANKS!
library(dplyr)
library(tidyr)
df <- df %>% mutate(F=ifelse(ID %in% c("A", "B"), 1, 0),
G = ifelse(ID %in% c("B", "C"), 1, 0),
D = ifelse(ID == "D", 1, 0))
df %>%
gather(var, val, F:D) %>%
filter(val==1) %>%
group_by(var) %>%
summarise(V1=sum(V1))
# # A tibble: 3 x 2
# var V1
# <chr> <dbl>
# 1 D 2
# 2 F 6
# 3 G 6

Creating Nodes and Edges Dataframes from Tidy Dataframes

I have a data frame that's of this structure:
df <- data.frame(var1 = c(1,1,1,2,2,3,3,3,3),
cat1 = c("A","B","D","B","C","D","E","B","A"))`
> df
var1 cat1
1 1 A
2 1 B
3 1 D
4 2 B
5 2 C
6 3 D
7 3 E
8 3 B
9 3 A
And I am looking to create both nodes and edges data frames from it, so that I can draw a network graph, using VisNetwork. This network will show the number/strength of connections between the different cat1 values, as grouped by the var1 value.
I have the nodes data frame sorted:
nodes <- data.frame(id = unique(df$cat1))
> nodes
id
1 A
2 B
3 D
4 C
5 E
What I'd like help with is how to process df in the following manner:
for each distinct value of var1 in df, tally up the group of nodes that are common to that value of var1 to give an edges dataframe that ultimately looks like the one below. Note that I'm not bothered about the direction of flow along the edges. Just that they are connected is all I need.
> edges
from to value
1 A B 2
2 A D 2
3 A E 1
4 B C 1
5 B D 2
6 B E 1
7 D E 1
With thanks in anticipation,
Nevil
Update: I found here a similar problem, and have adapted that code to give, which is getting close to what I want, but not quite there...
> df %>% group_by(var1) %>%
filter(n()>=2) %>% group_by(var1) %>%
do(data.frame(t(combn(.$cat1, 2,function(x) sort(x))),
stringsAsFactors=FALSE))
# A tibble: 10 x 3
# Groups: var1 [3]
var1 X1 X2
<dbl> <chr> <chr>
1 1. A B
2 1. A D
3 1. B D
4 2. B C
5 3. D E
6 3. B D
7 3. A D
8 3. B E
9 3. A E
10 3. A B
I don't know if there is already a suitable function to achieve this task. Here is a detailed procedure to do it. Whith this, you should be able to define you own function. Hope it helps!
# create an adjacency matrix
mat <- table(df)
mat <- t(mat) %*% mat
as.table(mat) # look at your adjacency matrix
# since the network is not directed, we can consider only the (strictly) upper triangular matrix
mat[lower.tri(mat, diag = TRUE)] <- 0
as.table(mat) # look at the new adjacency matrix
library(dplyr)
edges <- as.data.frame(as.table(mat))
edges <- filter(edges, Freq != 0)
colnames(edges) <- c("from", "to", "value")
edges <- arrange(edges, from)
edges # output
# from to value
#1 A B 2
#2 A D 2
#3 A E 1
#4 B C 1
#5 B D 2
#6 B E 1
#7 D E 1
here's a couple other ways...
in base R...
values <- unique(df$var1[duplicated(df$var1)])
do.call(rbind,
lapply(values, function(i) {
nodes <- as.character(df$cat1[df$var1 == i])
edges <- combn(nodes, 2)
data.frame(from = edges[1, ],
to = edges[2, ],
value = i,
stringsAsFactors = F)
})
)
in tidyverse...
library(dplyr)
library(tidyr)
df %>%
group_by(var1) %>%
filter(n() >= 2) %>%
mutate(cat1 = as.character(cat1)) %>%
summarise(edges = list(data.frame(t(combn(cat1, 2)), stringsAsFactors = F))) %>%
unnest(edges) %>%
select(from = X1, to = X2, value = var1)
in tidyverse using tidyr::complete...
library(dplyr)
library(tidyr)
df %>%
group_by(var1) %>%
mutate(cat1 = as.character(cat1)) %>%
mutate(i.cat1 = cat1) %>%
complete(cat1, i.cat1) %>%
filter(cat1 < i.cat1) %>%
select(from = cat1, to = i.cat1, value = var1)
in tidyverse using tidyr::expand...
library(dplyr)
library(tidyr)
df %>%
group_by(var1) %>%
mutate(cat1 = as.character(cat1)) %>%
expand(cat1, to = cat1) %>%
filter(cat1 < to) %>%
select(from = cat1, to, value = var1)

Remove exact rows and frequency of rows of a data.frame that are in another data.frame in r

Consider the following two data.frames:
a1 <- data.frame(A = c(1:5, 2, 4, 2), B = letters[c(1:5, 2, 4, 2)])
a2 <- data.frame(A = c(1:3,2), B = letters[c(1:3,2)])
I would like to remove the exact rows of a1 that are in a2 so that the result should be:
A B
4 d
5 e
4 d
2 b
Note that one row with 2 b in a1 is retained in the final result. Currently, I use a looping statement, which becomes extremely slow as I have many variables and thousands of rows in my data.frames. Is there any built-in function to get this result?
The idea is, add a counter for duplicates to each file, so you can get a unique match for each occurrence of a row. Data table is nice because it is easy to count the duplicates (with .N), and it also gives the necessary function (fsetdiff) for set operations.
library(data.table)
a1 <- data.table(A = c(1:5, 2, 4, 2), B = letters[c(1:5, 2, 4, 2)])
a2 <- data.table(A = c(1:3,2), B = letters[c(1:3,2)])
# add counter for duplicates
a1[, i := 1:.N, .(A,B)]
a2[, i := 1:.N, .(A,B)]
# setdiff gets the exception
# "all = T" allows duplicate rows to be returned
fsetdiff(a1, a2, all = T)
# A B i
# 1: 4 d 1
# 2: 5 e 1
# 3: 4 d 2
# 4: 2 b 3
You could use dplyr to do this. I set stringsAsFactors = FALSE to get rid of warnings about factor mismatches.
library(dplyr)
a1 <- data.frame(A = c(1:5, 2, 4, 2), B = letters[c(1:5, 2, 4, 2)], stringsAsFactors = FALSE)
a2 <- data.frame(A = c(1:3,2), B = letters[c(1:3,2)], stringsAsFactors = FALSE)
## Make temp variables to join on then delete later.
# Create a row number
a1_tmp <-
a1 %>%
group_by(A, B) %>%
mutate(tmp_id = row_number()) %>%
ungroup()
# Create a count
a2_tmp <-
a2 %>%
group_by(A, B) %>%
summarise(count = n()) %>%
ungroup()
## Keep all that have no entry int a2 or the id > the count (i.e. used up a2 entries).
left_join(a1_tmp, a2_tmp, by = c('A', 'B')) %>%
ungroup() %>% filter(is.na(count) | tmp_id > count) %>%
select(-tmp_id, -count)
## # A tibble: 4 x 2
## A B
## <dbl> <chr>
## 1 4 d
## 2 5 e
## 3 4 d
## 4 2 b
EDIT
Here is a similar solution that is a little shorter. This does the following: (1) add a column for row number to join both data.frame items (2) a temporary column in a2 (2nd data.frame) that will show up as null in the join to a1 (i.e. indicates it's unique to a1).
library(dplyr)
left_join(a1 %>% group_by(A,B) %>% mutate(rn = row_number()) %>% ungroup(),
a2 %>% group_by(A,B) %>% mutate(rn = row_number(), tmpcol = 0) %>% ungroup(),
by = c('A', 'B', 'rn')) %>%
filter(is.na(tmpcol)) %>%
select(-tmpcol, -rn)
## # A tibble: 4 x 2
## A B
## <dbl> <chr>
## 1 4 d
## 2 5 e
## 3 4 d
## 4 2 b
I think this solution is a little simpler (perhaps very little) than the first.
I guess this is similar to DWal's solution but in base R
a1_temp = Reduce(paste, a1)
a1_temp = paste(a1_temp, ave(seq_along(a1_temp), a1_temp, FUN = seq_along))
a2_temp = Reduce(paste, a2)
a2_temp = paste(a2_temp, ave(seq_along(a2_temp), a2_temp, FUN = seq_along))
a1[!a1_temp %in% a2_temp,]
# A B
#4 4 d
#5 5 e
#7 4 d
#8 2 b
Here's another solution with dplyr:
library(dplyr)
a1 %>%
arrange(A) %>%
group_by(A) %>%
filter(!(paste0(1:n(), A, B) %in% with(arrange(a2, A), paste0(1:n(), A, B))))
Result:
# A tibble: 4 x 2
# Groups: A [3]
A B
<dbl> <fctr>
1 2 b
2 4 d
3 4 d
4 5 e
This way of filtering avoids creating extra unwanted columns that you have to later remove in the final output. This method also sorts the output. Not sure if it's what you want.

Select rows based on non-directed combinations of columns

I am trying to select the maximum value in a dataframe's third column based on the combinations of the values in the first two columns.
My problem is similar to this one but I can't find a way to implement what I need.
EDIT: Sample data changed to make the column names more obvious.
Here is some sample data:
library(tidyr)
set.seed(1234)
df <- data.frame(group1 = letters[1:4], group2 = letters[1:4])
df <- df %>% expand(group1, group2)
df <- subset(df, subset = group1!=group2)
df$score <- runif(n = 12,min = 0,max = 1)
df
# A tibble: 12 × 3
group1 group2 score
<fctr> <fctr> <dbl>
1 a b 0.113703411
2 a c 0.622299405
3 a d 0.609274733
4 b a 0.623379442
5 b c 0.860915384
6 b d 0.640310605
7 c a 0.009495756
8 c b 0.232550506
9 c d 0.666083758
10 d a 0.514251141
11 d b 0.693591292
12 d c 0.544974836
In this example rows 1 and 4 are 'duplicates'. I would like to select row 4 as the value in the score column is larger than in row 1. Ultimately I would like a dataframe to be returned with the group1 and group2 columns and the maximum value in the score column. So in this example, I expect there to be 6 rows returned.
How can I do this in R?
I'd prefer dealing with this problem in two steps:
library(dplyr)
# Create function for computing group IDs from data frame of groups (per column)
get_group_id <- function(groups) {
apply(groups, 1, function(row) {
paste0(sort(row), collapse = "_")
})
}
group_id <- get_group_id(select(df, -score))
# Perform the computation
df %>%
mutate(groupId = group_id) %>%
group_by(groupId) %>%
slice(which.max(score)) %>%
ungroup() %>%
select(-groupId)

Resources