Create panel data from merged dataset - r

I have several merged data frames for baseline and endline data. The variables names are therefore appended with .x and .y for baseline and endline respectively. The data frames were merged by "Name". My data frames look something like this:
Name v1.x v2.x v3.x v1.y v2.y v3.y
a 1 2 5 3 4 6
b 4 5 3 5 3 5
and so on
I want to convert this to panel data so that it looks like this:
Name v1 v2 v3
a 1 2 5
a 3 4 6
b 4 5 3
b 5 3 5
I have a large amount of data across various merged data frames that I'd like to convert to panel data. How do I go about doing this?
Sample data:
Name gen_dq_1.1.x gen_dq_1.1_1.x
a 2 0
b 2 3 1
c 2 4 1
d 1 0
e 1 2 3 1
f 2 3 0
g 1 0
h 2 4 0
i 1 3 1
j 1 2 1
k 2 3 0
l 3 4 0

Does this work:
library(tidyr)
library(dplyr)
df %>% pivot_longer(cols = -Name, names_to = '.value', names_pattern = '(v[0-9])')
# A tibble: 4 x 4
Name v1 v2 v3
<chr> <dbl> <dbl> <dbl>
1 a 1 2 5
2 a 3 4 6
3 b 4 5 3
4 b 5 3 5
Data used:
df
# A tibble: 2 x 7
Name v1.x v2.x v3.x v1.y v2.y v3.y
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a 1 2 5 3 4 6
2 b 4 5 3 5 3 5
Updated answer:
df %>% pivot_longer(!Name, names_to = '.value', names_pattern = '(.*)(?=\\.[xy])')
# A tibble: 4 x 6
Name v1 v2 v3 gen_dq_1.1 gen_dq_1.1_1
<chr> <dbl> <dbl> <dbl> <chr> <dbl>
1 a 1 2 5 2 0
2 a 3 4 6 2 0
3 b 4 5 3 2 3 1
4 b 5 3 5 2 3 1
Data used:
df
# A tibble: 2 x 11
Name v1.x v2.x v3.x v1.y v2.y v3.y gen_dq_1.1.x gen_dq_1.1.y gen_dq_1.1_1.x gen_dq_1.1_1.y
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 a 1 2 5 3 4 6 2 2 0 0
2 b 4 5 3 5 3 5 2 3 2 3 1 1

Related

Use dynamically generated column names in dplyr

I have a data frame with multiple columns, the user provides a vector with the column names, and I want to count maximum amount of times an element appears
set.seed(42)
df <- tibble(
var1 = sample(c(1:3),10,replace=T),
var2 = sample(c(1:3),10,replace=T),
var3 = sample(c(1:3),10,replace=T)
)
select_vars <- c("var1", "var3")
df %>%
rowwise() %>%
mutate(consensus=max(table(unlist(c(var1,var3)))))
# A tibble: 10 x 4
# Rowwise:
var1 var2 var3 consensus
<int> <int> <int> <int>
1 1 1 1 2
2 1 1 3 1
3 1 2 1 2
4 1 2 1 2
5 2 2 2 2
6 2 3 3 1
7 2 3 2 2
8 1 1 1 2
9 3 1 2 1
10 3 3 2 1
This does exactly what I want, but when I try to use a vector of variables i cant get it to work
df %>%
rowwise() %>%
mutate(consensus=max(unlist(table(select_vars)) )))
You can wrap it in c(!!! syms()) to get it working, and you don't need the unlist apparently. But honestly, I'm not sure what you are trying to do, and why table is needed here. Do you just want to check if var2 and var3 are the same value and if then 2 and if not then 1?
library(dplyr)
df <- tibble(
var1 = sample(c(1:3),10,replace=T),
var2 = sample(c(1:3),10,replace=T),
var3 = sample(c(1:3),10,replace=T)
)
select_vars <- c("var2", "var3")
df %>%
rowwise() %>%
mutate(consensus=max(table(c(!!!syms(select_vars)))))
#> # A tibble: 10 x 4
#> # Rowwise:
#> var1 var2 var3 consensus
#> <int> <int> <int> <int>
#> 1 2 3 2 1
#> 2 3 1 3 1
#> 3 3 1 1 2
#> 4 3 3 3 2
#> 5 1 1 2 1
#> 6 2 1 3 1
#> 7 3 2 3 1
#> 8 1 2 3 1
#> 9 2 1 2 1
#> 10 2 1 1 2
Created on 2021-07-22 by the reprex package (v0.3.0)
In the OP's code, we need select
library(dplyr)
df %>%
rowwise() %>%
mutate(consensus=max(table(unlist(select(cur_data(), select_vars))) ))
-output
# A tibble: 10 x 4
# Rowwise:
var1 var2 var3 consensus
<int> <int> <int> <int>
1 1 1 1 2
2 1 1 3 1
3 1 2 1 2
4 1 2 1 2
5 2 2 2 2
6 2 3 3 1
7 2 3 2 2
8 1 1 1 2
9 3 1 2 1
10 3 3 2 1
Or just subset from cur_data() which would only return the data keeping the group attributes
df %>%
rowwise %>%
mutate(consensus = max(table(unlist(cur_data()[select_vars]))))
# A tibble: 10 x 4
# Rowwise:
var1 var2 var3 consensus
<int> <int> <int> <int>
1 1 1 1 2
2 1 1 3 1
3 1 2 1 2
4 1 2 1 2
5 2 2 2 2
6 2 3 3 1
7 2 3 2 2
8 1 1 1 2
9 3 1 2 1
10 3 3 2 1
Or using pmap
library(purrr)
df %>%
mutate(consensus = pmap_dbl(cur_data()[select_vars], ~ max(table(c(...)))))
# A tibble: 10 x 4
var1 var2 var3 consensus
<int> <int> <int> <dbl>
1 1 1 1 2
2 1 1 3 1
3 1 2 1 2
4 1 2 1 2
5 2 2 2 2
6 2 3 3 1
7 2 3 2 2
8 1 1 1 2
9 3 1 2 1
10 3 3 2 1
As these are rowwise operations, can get some efficiency if we use collapse functions
library(collapse)
tfm(df, consensus = dapply(slt(df, select_vars), MARGIN = 1,
FUN = function(x) fmax(tabulate(x))))
# A tibble: 10 x 4
var1 var2 var3 consensus
* <int> <int> <int> <int>
1 1 1 1 2
2 1 1 3 1
3 1 2 1 2
4 1 2 1 2
5 2 2 2 2
6 2 3 3 1
7 2 3 2 2
8 1 1 1 2
9 3 1 2 1
10 3 3 2 1
Benchmarks
As noted above, collapse is faster (run on a slightly bigger dataset)
df1 <- df[rep(seq_len(nrow(df)), 1e5), ]
system.time({
tfm(df1, consensus = dapply(slt(df1, select_vars), MARGIN = 1,
FUN = function(x) fmax(tabulate(x))))
})
#user system elapsed
# 5.257 0.123 5.323
system.time({
df1 %>%
mutate(consensus = pmap_dbl(cur_data()[select_vars], ~ max(table(c(...)))))
})
#user system elapsed
# 54.813 0.517 55.246
The rowwise operation is taking too much time, so stopped the execution
df1 %>%
rowwise() %>%
mutate(consensus=max(table(unlist(select(cur_data(), select_vars))) ))
})
Timing stopped at: 575.5 3.342 581.3
What you need is to use the verb all_of
df %>%
rowwise() %>%
mutate(consensus=max(table(unlist(all_of(select_vars)))))
# A tibble: 10 x 4
# Rowwise:
var1 var2 var3 consensus
<int> <int> <int> <int>
1 2 3 3 1
2 2 2 2 1
3 1 2 2 1
4 2 3 3 1
5 1 2 1 1
6 2 1 2 1
7 2 2 2 1
8 3 1 2 1
9 2 1 3 1
10 3 2 1 1

How to conditionally wrangle df from long to wide while modifying names in r?

I have a longitudinal df similar to the one below, where there is a row for each participant (id) at each visit number (visit). The same 3 variables are recorded at each visit. I want to have each participant as their own row,
but turn the values into wide format... and having the new variable name retaining the original variable name and appending the visit name to the end.
I'll have to repeat this many times so would like to avoid manually naming them after the fact. Ideas?
I have tried dcast()but can't seem to get my desired result. I think pivot_wider() may have a role here but can't figure it out.
# CURRENT:
# A tibble: 12 x 5
id visit var1 var2 var3
<dbl> <txt> <dbl> <dbl> <dbl>
1 1 v1 1 1 1
2 1 v2 1 2 1
3 1 v3 2 2 1
4 2 v1 1 1 1
5 2 v2 1 2 1
6 2 v3 2 2 1
7 2 v4 2 2 2
8 3 v1 1 1 1
9 3 v2 1 2 1
10 3 v3 2 3 1
11 3 v4 2 3 2
12 3 v5 3 3 3
# DESIRED
# A tibble: 3 x 16
id var1_v1 var1_v2 var1_v3 var1_v4 var1_v5 var2_v1 var2_v2 var2_v3 var2_v4 var2_v5 var3_v1 var3_v2 var3_v3 var3_v4 var3_v5
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 3 NA NA 1 2 2 NA NA 1 1 1 NA NA
2 2 1 1 2 2 NA 1 2 2 2 NA 1 1 2 1 NA
3 3 1 1 2 2 3 1 2 3 3 3 1 1 1 2 3
Using pivot_wider :
tidyr::pivot_wider(df, names_from = visit, values_from = starts_with('var'))
# id var1_v1 var1_v2 var1_v3 var1_v4 var1_v5 var2_v1 var2_v2 var2_v3 var2_v4
# <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#1 1 1 1 2 NA NA 1 2 2 NA
#2 2 1 1 2 2 NA 1 2 2 2
#3 3 1 1 2 2 3 1 2 3 3
# … with 6 more variables: var2_v5 <int>, var3_v1 <int>, var3_v2 <int>,
# var3_v3 <int>, var3_v4 <int>, var3_v5 <int>
In data.table using dcast :
library(data.table)
dcast(setDT(df), id~visit, value.var = grep('^var', names(df), value = TRUE))
In base R, you could use:
reshape(df, timevar = "visit", dir="wide", sep="_")

Optimize computation in dplyr mutate function

Assume following table:
library(dplyr)
library(tibble)
library(purrr)
df = tibble(
client = c(1,1,1,1,2,2,2,2),
prod_type = c(1,1,2,2,1,1,2,2),
max_prod_type = c(2,2,2,2,2,2,2,2),
value_1 = c(10,20,30,30,100,200,300,300),
value_2 = c(1,2,3,3,1,2,3,3),
)
# A tibble: 8 x 5
client prod_type max_prod_type value_1 value_2
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 2 10 1
2 1 1 2 20 2
3 1 2 2 30 3
4 1 2 2 30 3
5 2 1 2 100 1
6 2 1 2 200 2
7 2 2 2 300 3
8 2 2 2 300 3
Column 'max_prod_type' here denotes maximum value for 'prod_type' column per each 'client' value. I need to compute new column 'sum', which would contain sum from adding the values from 'value_1' and 'value_2', but only for those rows, where 'prod_type' == 'max_prod_type' per each 'client' value.
I have tried following code:
df %>%
mutate(
sum =
map2_dbl(
client, max_prod_type,
~case_when(
prod_type == .y~
filter(df, client == .x, prod_type == .y) %>%
mutate(sum = value_1 + value_2) %>%
select(sum) %>%
sum(),
T~NA_real_
)
)
)
Desired output is following:
# A tibble: 8 x 6
client prod_type max_prod_type value_1 value_2 sum
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 2 10 1 NA
2 1 1 2 20 2 NA
3 1 2 2 30 3 66
4 1 2 2 30 3 66
5 2 1 2 100 1 NA
6 2 1 2 200 2 NA
7 2 2 2 300 3 606
8 2 2 2 300 3 606
But it throws an error:
Error: Problem with `mutate()` input `sum`.
x Result 1 must be a single double, not a double vector of length 6
i Input `sum` is `map2_dbl(...)`.
Moreover, as for me such way of implementation is somewhat slow. I'm wondering if there any correct and more optimized solution to this problem.
Appreciate your help!
One option could be:
df %>%
group_by(client) %>%
mutate(res = row_number() == which(value_1 == max(value_1)),
res = if_else(res, sum(value_1[res]) + sum(value_2[res]), NA_real_))
client prod_type max_prod_type value_1 value_2 res
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 2 10 1 NA
2 1 1 2 20 2 NA
3 1 2 2 30 3 66
4 1 2 2 30 3 66
5 2 1 2 100 1 NA
6 2 1 2 200 2 NA
7 2 2 2 300 3 606
8 2 2 2 300 3 606
I think this is closer to what you want:
df %>%
mutate(sum = case_when(prod_type == max_prod_type ~ value_1 + value_2,
TRUE ~ NA_real_))
# A tibble: 6 x 6
client prod_type max_prod_type value_1 value_2 sum
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 2 10 1 NA
2 1 1 2 20 2 NA
3 1 2 2 30 3 33
4 2 1 2 100 1 NA
5 2 1 2 200 2 NA
6 2 2 2 300 3 303

Add corresponding columns of data frame in R

Here's the example data I have.
df1 <- tibble(a=1:4,
b=1:4,
c=1:4,
d=1:4,
e=1:4)
# A tibble: 4 x 5
a b c d e
<int> <int> <int> <int> <int>
1 1 1 1 1 1
2 2 2 2 2 2
3 3 3 3 3 3
4 4 4 4 4 4
df2 <- tibble(b=1:4,
d=1:4,
e=1:4)
b d e
<int> <int> <int>
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
I would like to add the columns in common so that I can get a data frame like this
a b c d e
<int> <dbl> <int> <dbl> <dbl>
1 1 2 1 2 2
2 2 4 2 4 4
3 3 6 3 6 6
4 4 8 4 8 8
Is there an easy way to do this in R with tools like dplyr?
An easier option is to subset the first dataset 'df1' based on the column names of 'df2' (assuming all the columns in 'df2' are present in 'df1'), add those and assign back to the those in 'df1'
df1[names(df2)] <- df1[names(df2)] + df2
Or using dplyr
library(dplyr)
df1 %>%
mutate(c_across(names(df2)) + df2)
-output
# A tibble: 4 x 5
# a b c d e
# <int> <int> <int> <int> <int>
#1 1 2 1 2 2
#2 2 4 2 4 4
#3 3 6 3 6 6
#4 4 8 4 8 8

comparing dataframe - percentage of mismatched cases

I have two people coding the same variables for a psychological test (more than 400 variables), and I need to compare the datasets. I need two results:
I need to see only the specific cases with mismatch and
As the final result, I need the percentage of mismatch per variable.
here what I mean by the "percentage of mismatch per variable":
A <- tibble( ID = c(1:10),
v1 = rep(1),
v2 = rep(2),
v3 = rep(3))
B <- tibble( ID = c(1:10),
v1 = c(1,1,1,1,10,1,1,1,1,1),
v2 = c(30,2,2,2,51,2,2,2,2,40),
v3 = c(3,3,3,3,3,3,3,65,3,90))
A;B
# A tibble: 10 x 4
ID v1 v2 v3
<int> <dbl> <dbl> <dbl>
1 1 1 2 3
2 2 1 2 3
3 3 1 2 3
4 4 1 2 3
5 5 1 2 3
6 6 1 2 3
7 7 1 2 3
8 8 1 2 3
9 9 1 2 3
10 10 1 2 3
# A tibble: 10 x 4
ID v1 v2 v3
<int> <dbl> <dbl> <dbl>
1 1 1 30 3
2 2 1 2 3
3 3 1 2 3
4 4 1 2 3
5 5 10 51 3
6 6 1 2 3
7 7 1 2 3
8 8 1 2 65
9 9 1 2 3
10 10 1 40 90
How do I compare dataset A with dataset B to get a result like this:
result<- tibble(variables = c("v1", "v2", "v3"),
n.mismatch = c(1,3,2),
percentage.mismatch = c(0.10, 0.30, 0.20))
result
# A tibble: 3 x 3
variables n.mismatch percentage.mismatch
<chr> <dbl> <dbl>
1 v1 1 0.1
2 v2 3 0.3
3 v3 2 0.2
We can use Map to compare the column values.
as.data.frame(Map(function(x, y) {
inds <- x != y
c(n.mismatch = sum(inds), percentage.mismatch = mean(inds))
}, A[-1], B[-1]))
# v1 v2 v3
#n.mismatch 1.0 3.0 2.0
#percentage.mismatch 0.1 0.3 0.2
Similarly, in tidyverse, we can use map2
purrr::map2_df(A[-1], B[-1], ~{
inds = .x != .y
tibble::tibble(n.mismatch = sum(inds), percentage.mismatch = mean(inds))
}, .id = "variables")
# variables n.mismatch percentage.mismatch
# <chr> <int> <dbl>
#1 v1 1 0.1
#2 v2 3 0.3
#3 v3 2 0.2

Resources