comparing dataframe - percentage of mismatched cases

comparing dataframe - percentage of mismatched cases - r

I have two people coding the same variables for a psychological test (more than 400 variables), and I need to compare the datasets. I need two results:
I need to see only the specific cases with mismatch and
As the final result, I need the percentage of mismatch per variable.
here what I mean by the "percentage of mismatch per variable":
A <- tibble( ID = c(1:10),
v1 = rep(1),
v2 = rep(2),
v3 = rep(3))
B <- tibble( ID = c(1:10),
v1 = c(1,1,1,1,10,1,1,1,1,1),
v2 = c(30,2,2,2,51,2,2,2,2,40),
v3 = c(3,3,3,3,3,3,3,65,3,90))
A;B
# A tibble: 10 x 4
ID v1 v2 v3
<int> <dbl> <dbl> <dbl>
1 1 1 2 3
2 2 1 2 3
3 3 1 2 3
4 4 1 2 3
5 5 1 2 3
6 6 1 2 3
7 7 1 2 3
8 8 1 2 3
9 9 1 2 3
10 10 1 2 3
# A tibble: 10 x 4
ID v1 v2 v3
<int> <dbl> <dbl> <dbl>
1 1 1 30 3
2 2 1 2 3
3 3 1 2 3
4 4 1 2 3
5 5 10 51 3
6 6 1 2 3
7 7 1 2 3
8 8 1 2 65
9 9 1 2 3
10 10 1 40 90
How do I compare dataset A with dataset B to get a result like this:
result<- tibble(variables = c("v1", "v2", "v3"),
n.mismatch = c(1,3,2),
percentage.mismatch = c(0.10, 0.30, 0.20))
result
# A tibble: 3 x 3
variables n.mismatch percentage.mismatch
<chr> <dbl> <dbl>
1 v1 1 0.1
2 v2 3 0.3
3 v3 2 0.2

We can use Map to compare the column values.
as.data.frame(Map(function(x, y) {
inds <- x != y
c(n.mismatch = sum(inds), percentage.mismatch = mean(inds))
}, A[-1], B[-1]))
# v1 v2 v3
#n.mismatch 1.0 3.0 2.0
#percentage.mismatch 0.1 0.3 0.2
Similarly, in tidyverse, we can use map2
purrr::map2_df(A[-1], B[-1], ~{
inds = .x != .y
tibble::tibble(n.mismatch = sum(inds), percentage.mismatch = mean(inds))
}, .id = "variables")
# variables n.mismatch percentage.mismatch
# <chr> <int> <dbl>
#1 v1 1 0.1
#2 v2 3 0.3
#3 v3 2 0.2

Related

Is there a way to get subdataframes with purrr in magrittr pipes workflow without using data.frame name?

That is, I was interested in doing the same as in the example, but with purrr functions.
tibble(a, b = a * 2, c = 1) %>%
{lapply(X = names(.), FUN = function(.x) select(., 1:.x))}
[[1]]
# A tibble: 5 x 1
a
<int>
1 1
2 2
3 3
4 4
5 5
[[2]]
# A tibble: 5 x 2
a b
<int> <dbl>
1 1 2
2 2 4
3 3 6
4 4 8
5 5 10
[[3]]
# A tibble: 5 x 3
a b c
<int> <dbl> <dbl>
1 1 2 1
2 2 4 1
3 3 6 1
4 4 8 1
5 5 10 1
I only could do it if I named foo <- tibble(a, b = a * 2, c = 1) and inside map I did select(foo, ...), but I wanted to avoid that, since I wanted to mutate the named dataframe in pipe workflow.
Thank you!

You can use map in the following way :
library(dplyr)
library(purrr)
tibble(a = 1:5, b = a * 2, c = 1) %>%
{map(names(.), function(.x) select(., 1:.x))}
Based on your actual use case you can also use imap which will pass column value (.x) along with it's name (.y).
tibble(a = 1:5, b = a * 2, c = 1) %>%
imap(function(.x, .y) select(., 1:.y))
#$a
# A tibble: 5 x 1
# a
# <int>
#1 1
#2 2
#3 3
#4 4
#5 5
#$b
# A tibble: 5 x 2
# a b
# <int> <dbl>
#1 1 2
#2 2 4
#3 3 6
#4 4 8
#5 5 10
#$c
# A tibble: 5 x 3
# a b c
# <int> <dbl> <dbl>
#1 1 2 1
#2 2 4 1
#3 3 6 1
#4 4 8 1
#5 5 10 1

Optimize computation in dplyr mutate function

Assume following table:
library(dplyr)
library(tibble)
library(purrr)
df = tibble(
client = c(1,1,1,1,2,2,2,2),
prod_type = c(1,1,2,2,1,1,2,2),
max_prod_type = c(2,2,2,2,2,2,2,2),
value_1 = c(10,20,30,30,100,200,300,300),
value_2 = c(1,2,3,3,1,2,3,3),
)
# A tibble: 8 x 5
client prod_type max_prod_type value_1 value_2
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 2 10 1
2 1 1 2 20 2
3 1 2 2 30 3
4 1 2 2 30 3
5 2 1 2 100 1
6 2 1 2 200 2
7 2 2 2 300 3
8 2 2 2 300 3
Column 'max_prod_type' here denotes maximum value for 'prod_type' column per each 'client' value. I need to compute new column 'sum', which would contain sum from adding the values from 'value_1' and 'value_2', but only for those rows, where 'prod_type' == 'max_prod_type' per each 'client' value.
I have tried following code:
df %>%
mutate(
sum =
map2_dbl(
client, max_prod_type,
~case_when(
prod_type == .y~
filter(df, client == .x, prod_type == .y) %>%
mutate(sum = value_1 + value_2) %>%
select(sum) %>%
sum(),
T~NA_real_
)
)
)
Desired output is following:
# A tibble: 8 x 6
client prod_type max_prod_type value_1 value_2 sum
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 2 10 1 NA
2 1 1 2 20 2 NA
3 1 2 2 30 3 66
4 1 2 2 30 3 66
5 2 1 2 100 1 NA
6 2 1 2 200 2 NA
7 2 2 2 300 3 606
8 2 2 2 300 3 606
But it throws an error:
Error: Problem with `mutate()` input `sum`.
x Result 1 must be a single double, not a double vector of length 6
i Input `sum` is `map2_dbl(...)`.
Moreover, as for me such way of implementation is somewhat slow. I'm wondering if there any correct and more optimized solution to this problem.
Appreciate your help!

One option could be:
df %>%
group_by(client) %>%
mutate(res = row_number() == which(value_1 == max(value_1)),
res = if_else(res, sum(value_1[res]) + sum(value_2[res]), NA_real_))
client prod_type max_prod_type value_1 value_2 res
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 2 10 1 NA
2 1 1 2 20 2 NA
3 1 2 2 30 3 66
4 1 2 2 30 3 66
5 2 1 2 100 1 NA
6 2 1 2 200 2 NA
7 2 2 2 300 3 606
8 2 2 2 300 3 606

I think this is closer to what you want:
df %>%
mutate(sum = case_when(prod_type == max_prod_type ~ value_1 + value_2,
TRUE ~ NA_real_))
# A tibble: 6 x 6
client prod_type max_prod_type value_1 value_2 sum
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 2 10 1 NA
2 1 1 2 20 2 NA
3 1 2 2 30 3 33
4 2 1 2 100 1 NA
5 2 1 2 200 2 NA
6 2 2 2 300 3 303

R Tidyverse - Randomize by ID

I have a df like this one:
id <- c(1,1,2,2,3,3,4,4,5,5)
v1 <- c(3,1,2,3,4,5,6,1,5,4)
pos <- c(1,2,1,2,1,2,1,2,1,2)
df <- data.frame(id,v1,pos)
How can I "randomize" the values of v1 WHILE keeping the inherent order from the "Id" var and also the values of "pos" such as I get df with randomized values like this:
id v1 pos
1 1 1
1 3 2
2 2 1
2 3 2
3 5 1
3 4 2
4 6 1
4 1 2
5 5 1
5 4 2
Above and example of resulting df with id and pos staying as originally created and v1 randomized.
Thx!

Is sample what you're looking for?
df %>%
group_by(id) %>%
mutate(v1 = sample(v1, size = length(v1)))
# A tibble: 10 x 3
# Groups: id [5]
id v1 pos
<dbl> <dbl> <dbl>
1 1 3 1
2 1 1 2
3 2 3 1
4 2 2 2
5 3 4 1
6 3 5 2
7 4 1 1
8 4 6 2
9 5 5 1
10 5 4 2

Create panel data from merged dataset

I have several merged data frames for baseline and endline data. The variables names are therefore appended with .x and .y for baseline and endline respectively. The data frames were merged by "Name". My data frames look something like this:
Name v1.x v2.x v3.x v1.y v2.y v3.y
a 1 2 5 3 4 6
b 4 5 3 5 3 5
and so on
I want to convert this to panel data so that it looks like this:
Name v1 v2 v3
a 1 2 5
a 3 4 6
b 4 5 3
b 5 3 5
I have a large amount of data across various merged data frames that I'd like to convert to panel data. How do I go about doing this?
Sample data:
Name gen_dq_1.1.x gen_dq_1.1_1.x
a 2 0
b 2 3 1
c 2 4 1
d 1 0
e 1 2 3 1
f 2 3 0
g 1 0
h 2 4 0
i 1 3 1
j 1 2 1
k 2 3 0
l 3 4 0

Does this work:
library(tidyr)
library(dplyr)
df %>% pivot_longer(cols = -Name, names_to = '.value', names_pattern = '(v[0-9])')
# A tibble: 4 x 4
Name v1 v2 v3
<chr> <dbl> <dbl> <dbl>
1 a 1 2 5
2 a 3 4 6
3 b 4 5 3
4 b 5 3 5
Data used:
df
# A tibble: 2 x 7
Name v1.x v2.x v3.x v1.y v2.y v3.y
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a 1 2 5 3 4 6
2 b 4 5 3 5 3 5
Updated answer:
df %>% pivot_longer(!Name, names_to = '.value', names_pattern = '(.*)(?=\\.[xy])')
# A tibble: 4 x 6
Name v1 v2 v3 gen_dq_1.1 gen_dq_1.1_1
<chr> <dbl> <dbl> <dbl> <chr> <dbl>
1 a 1 2 5 2 0
2 a 3 4 6 2 0
3 b 4 5 3 2 3 1
4 b 5 3 5 2 3 1
Data used:
df
# A tibble: 2 x 11
Name v1.x v2.x v3.x v1.y v2.y v3.y gen_dq_1.1.x gen_dq_1.1.y gen_dq_1.1_1.x gen_dq_1.1_1.y
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 a 1 2 5 3 4 6 2 2 0 0
2 b 4 5 3 5 3 5 2 3 2 3 1 1

repeat list in to a data frame in R

I have a list let's say
k<-c(1,2,3,4)
I want to create a dataframe with let's say 5 rows using the same list in each row as shown below.
X1 X2 X3 X4
1 1 2 3 4
2 1 2 3 4
3 1 2 3 4
4 1 2 3 4
5 1 2 3 4
I tried doing:-
> rep(k, each = 5)
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
However I am not able to get intended result. Any suggestions?

data.frame(t(replicate(5, k)))
#OR
data.frame(matrix(rep(k, each = 5), 5))
#OR
data.frame(t(sapply(1:5, function(x) k)))
# X1 X2 X3 X4
#1 1 2 3 4
#2 1 2 3 4
#3 1 2 3 4
#4 1 2 3 4
#5 1 2 3 4

Here is one option by converting the vector to list with as.list, change it to data.frame (as.data.frame and replicate the rows
as.data.frame(as.list(k))[rep(1, 5),]
# X1 X2 X3 X4
#1 1 2 3 4
#1.1 1 2 3 4
#1.2 1 2 3 4
#1.3 1 2 3 4
#1.4 1 2 3 4
Or another option is to take the transpose of the vector to get a row matrix, replicate the rows and convert to data.frame
as.data.frame(t(k)[rep(1, 5),])
In tidyverse, one option is to convert to tibble and then uncount
library(dplyr)
library(tidyr)
library(stringr)
as.list(k) %>%
set_names(str_c("X", seq_along(k))) %>%
as_tibble %>%
uncount(5)
# A tibble: 5 x 4
# X1 X2 X3 X4
# <dbl> <dbl> <dbl> <dbl>
#1 1 2 3 4
#2 1 2 3 4
#3 1 2 3 4
#4 1 2 3 4
#5 1 2 3 4

purrr::map_dfc(k, rep, 5)
# # A tibble: 5 x 4
# V1 V2 V3 V4
# <dbl> <dbl> <dbl> <dbl>
# 1 1 2 3 4
# 2 1 2 3 4
# 3 1 2 3 4
# 4 1 2 3 4
# 5 1 2 3 4

Using data.table:
k = c(1,2,3,4)
n = 5 # Number of rows
df = data.table()
df = df[, lapply(1:length(k), function(x) rep(k[x], n))]
> df
V1 V2 V3 V4
1: 1 2 3 4
2: 1 2 3 4
3: 1 2 3 4
4: 1 2 3 4
5: 1 2 3 4

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

comparing dataframe - percentage of mismatched cases - r

Related

Is there a way to get subdataframes with purrr in magrittr pipes workflow without using data.frame name?

Optimize computation in dplyr mutate function

R Tidyverse - Randomize by ID

Create panel data from merged dataset

repeat list in to a data frame in R

Categories

Resources