Making data wide where row names are not identical - r

I am trying to get a model results table into wide format. Since names are not the same on outcomes (dv variables), NA's show up in the table and I can't find a way to have one row per variable.
I need one row per variable/dv. Model 1 and 3 share all variables other than one.
Data:
table <- data.frame(variable=c("intercept", "a", "b", "intercept", "c", "intercept", "a", "e", "intercept", "c"),
b=c(1.2, 0.1, 0.4, 0.3, 0.9, 1.3, 2, .23, .4, .7),
p=(abs(rnorm(10, 0, .3))),
model=c(1,1,1,2,2,3,3,3,4,4),
dv=c(rep("dv1", 5), rep("dv2", 5)))
> table
variable b p model dv
1 intercept 1.20 0.03320481 1 dv1
2 a 0.10 0.16675234 1 dv1
3 b 0.40 0.53607394 1 dv1
4 intercept 0.30 0.14935514 2 dv1
5 c 0.90 0.58998515 2 dv1
6 intercept 1.30 0.21040677 3 dv2
7 a 2.00 0.14183742 3 dv2
8 e 0.23 0.32034711 3 dv2
9 intercept 0.40 0.06539247 4 dv2
10 c 0.70 0.30780133 4 dv2
Code:
table %>%
gather(key, value, b, p) %>% unite("stat_var", dv, key, sep=".") %>%
spread(stat_var, value) %>%
arrange(model, desc(variable))
Output:
variable model dv1.b dv1.p dv2.b dv2.p
1 intercept 1 1.2 0.21866737 NA NA
2 b 1 0.4 0.50600799 NA NA
3 a 1 0.1 0.18751178 NA NA
4 intercept 2 0.3 0.25133611 NA NA
5 c 2 0.9 0.04601194 NA NA
6 intercept 3 NA NA 1.30 0.34144108
7 e 3 NA NA 0.23 0.12793927
8 a 3 NA NA 2.00 0.37614448
9 intercept 4 NA NA 0.40 0.08852144
10 c 4 NA NA 0.70 0.26853770
Looking for:

I will ignore the reason (I see there's no valid reason) to consider some type of "equivalence" between models (handled it with mutate()). But related only with the table manipulation, I have this basic option to get your desired output:
You can use pivot_wider_spec() to set the names b and p like suffixs.
require(tidyverse)
table %>%
mutate(model = case_when(model == 3 ~ 1,
model == 4 ~ 2,
TRUE ~ model)) %>%
pivot_wider(names_from = dv, values_from = c("b", "p")) %>%
select(variable,
model,
ends_with("dv1"),
ends_with("dv2"))
# A tibble: 6 x 6
# variable model b_dv1 p_dv1 b_dv2 p_dv2
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 intercept 1 1.2 0.318 1.3 0.200
# 2 a 1 0.1 0.120 2 0.419
# 3 b 1 0.4 0.309 NA NA
# 4 intercept 2 0.3 0.350 0.4 0.0148
# 5 c 2 0.9 0.185 0.7 0.530
# 6 e 1 NA NA 0.23 0.174

As I stated in my comment, it seems that your expected output is wrong.
However, you can reproduce it by tweaking the model variable:
table %>%
select(model, everything()) %>%
mutate(model=ifelse(model>2, model-2, model)) %>%
pivot_longer(c(b, p)) %>%
unite("name", c("dv", "name")) %>%
pivot_wider()
# # A tibble: 6 x 6
# model variable dv1_b dv1_p dv2_b dv2_p
# <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 intercept 1.2 0.193 1.3 0.160
# 2 1 a 0.1 0.650 2 0.476
# 3 1 b 0.4 0.190 NA NA
# 4 2 intercept 0.3 0.0435 0.4 0.145
# 5 2 c 0.9 0.372 0.7 0.243
# 6 1 e NA NA 0.23 0.297
Of note, gather() and spread() are deprecated in favor of pivoting functions which offer very nice improvements (although not used here).

Related

forcats::fct_lump_prop() behavior

I have been struggling using the {forcats} fct_lump_prop() function, more specifically the use of its w = and prop = arguments. What exactly is it supposed to represent with the example below :
df <- tibble(var1 = c(rep("a", 3), rep("b", 3)),
var2 = c(0.2, 0.3, 0.2, 0.1, 0.1, 0.1))
# A tibble: 6 x 2
var1 var2
<chr> <dbl>
1 a 0.2
2 a 0.3
3 a 0.2
4 b 0.1
5 b 0.1
6 b 0.1
What would be the prop value to set so that a is kept intact, but b is lumped as Others ?
# A tibble: 6 x 2
var1 var2 fct
<chr> <dbl> <fct>
1 a 0.2 a
2 a 0.3 a
3 a 0.2 a
4 b 0.1 Others
5 b 0.1 Others
6 b 0.1 Others
I have tried empirically but can't find how this works. Intuitively I would say that setting a prop value above 0.3 (sum of bs) and below 0.6 (sum of as) would lump the b factor but this doesn't lump anything. The only threshold I found was with prop = 0.7 and then everything becomes a Others level...
df %>%
mutate(fct = fct_lump_prop(var1, w = var2, prop = 0.7))
# A tibble: 6 x 3
var1 var2 fct
<fct> <dbl> <fct>
1 a 0.2 Other
2 a 0.3 Other
3 a 0.2 Other
4 b 0.1 Other
5 b 0.1 Other
6 b 0.1 Other
So what am I not understanding ? The little examples I found elsewhere did not help me grasp the prop and w behavior.
Thanks a lot.
Relevant explanation is in this line of the source of fct_lump_prop()
if (prop > 0 && sum(prop_n <= prop) <= 1) {
return(f) # No lumping needed
}
That is, if there are only one factor in "others" the function does nothing.
library(forcats)
library(dplyr)
df <- tibble(var1 = c(rep("a", 3), rep("b", 3),rep("c", 3)),
var2 = (1:9)/(45))
df %>% group_by(var1) %>% summarise(sum(var2))
#> # A tibble: 3 × 2
#> var1 `sum(var2)`
#> <chr> <dbl>
#> 1 a 0.133
#> 2 b 0.333
#> 3 c 0.533
df %>%
mutate(fct = fct_lump_prop(factor(var1), w = var2, prop = 0.45))
#> # A tibble: 9 × 3
#> var1 var2 fct
#> <chr> <dbl> <fct>
#> 1 a 0.0222 Other
#> 2 a 0.0444 Other
#> 3 a 0.0667 Other
#> 4 b 0.0889 Other
#> 5 b 0.111 Other
#> 6 b 0.133 Other
#> 7 c 0.156 c
#> 8 c 0.178 c
#> 9 c 0.2 c
df %>%
mutate(fct = fct_lump_prop(factor(var1), w = var2, prop = 0.25))
#> # A tibble: 9 × 3
#> var1 var2 fct
#> <chr> <dbl> <fct>
#> 1 a 0.0222 a
#> 2 a 0.0444 a
#> 3 a 0.0667 a
#> 4 b 0.0889 b
#> 5 b 0.111 b
#> 6 b 0.133 b
#> 7 c 0.156 c
#> 8 c 0.178 c
#> 9 c 0.2 c
Created on 2022-11-14 with reprex v2.0.2
It does'nt appear to be documented except in the source code itself.
Edit
There is also an issue in github because manpage says that factors are lumper if they appear "fewer than" prop times, but it is also true if it appears "exactly" prop times.

Function that compares one column values against all other column values and returns matching one in R

So let's say I have two data frames
df1 <- data.frame(n = rep(n = 2,c(0,1,2,3,4)), nn =c(rep(x = 1, 5), rep(x=2, 5)),
y = rnorm(10), z = rnorm(10))
df2 <- data.frame(x = rnorm(20))
Here is the first df:
> head(df1)
n nn y z
1 0 1 1.5683647 0.48934096
2 1 1 1.2967556 -0.77891030
3 2 1 -0.2375963 1.74355935
4 3 1 -1.2241501 -0.07838729
5 4 1 -0.3278127 -0.97555379
6 0 2 -2.4124503 0.07065982
Here is the second df:
x
1 -0.4884289
2 0.9362939
3 -1.0624084
4 -0.9838209
5 0.4242479
6 -0.4513135
I'd like to substact x column values of df2 from z column values of df1. And return the rows of both dataframes for which the substracted value is approximately equal to that of y value of df1.
Is there a way to construct such function, so that I could imply the approximation to which the values should be equal?
So, that it's clear, I'd like to substract all x values from all z values and then compare the value to y column value of df1, and check if there is approximately matching value to y.
Here's an approach where I match every row of df1 with every row of df2, then take x and y from z (as implied by your logic of comparing z-x to y; this is the same as comparing z-x-y to zero). Finally, I look at each row of df1 and keep the match with the lowest absolute difference.
library(dplyr)
left_join(
df1 %>% mutate(dummy = 1, row = row_number()),
df2 %>% mutate(dummy = 1, row = row_number()), by = "dummy") %>%
mutate(diff = z - x - y) %>%
group_by(row.x) %>%
slice_min(abs(diff)) %>%
ungroup()
Result (I used set.seed(42) before generating df1+df2.)
# A tibble: 10 x 9
n nn y z dummy row.x x row.y diff
<dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <int> <dbl>
1 0 1 1.37 1.30 1 1 0.0361 20 -0.102
2 1 1 -0.565 2.29 1 2 1.90 5 0.956
3 2 1 0.363 -1.39 1 3 -1.76 8 0.0112
4 3 1 0.633 -0.279 1 4 -0.851 18 -0.0607
5 4 1 0.404 -0.133 1 5 -0.609 14 0.0713
6 0 2 -0.106 0.636 1 6 0.705 12 0.0372
7 1 2 1.51 -0.284 1 7 -1.78 2 -0.0145
8 2 2 -0.0947 -2.66 1 8 -2.41 19 -0.148
9 3 2 2.02 -2.44 1 9 -2.41 19 -2.04
10 4 2 -0.0627 1.32 1 10 1.21 4 0.168

How to use case_when with entire dataframe?

I'd like to apply case_when to all columns in the data frame.
set.seed(1)
data <- tibble(x = runif(10), y = x * 2)
data
For all columns above 0.5, I'd like to replace with a string ">0.5", for those above 1, I'd like to replace with ">1".
I've tried to case_when, but it appears that I have to specify the column like x and y. I'd like to use case_when without specifying columns and use it on the entire data frame instead.
a purrr solution;
library(purrr)
data %>%
map_df(~case_when(.x > 0.5 & .x < 1 ~ ">0.5",
.x >= 1 ~ ">1"))
output;
x y
<chr> <chr>
1 NA >0.5
2 NA >0.5
3 >0.5 >1
4 >0.5 >1
5 NA NA
6 >0.5 >1
7 >0.5 >1
8 >0.5 >1
9 >0.5 >1
10 NA NA
Here is a potential solution:
library(tidyverse)
set.seed(1)
data <- tibble(x = runif(10), y = x * 2)
data
#> # A tibble: 10 × 2
#> x y
#> <dbl> <dbl>
#> 1 0.266 0.531
#> 2 0.372 0.744
#> 3 0.573 1.15
#> 4 0.908 1.82
#> 5 0.202 0.403
#> 6 0.898 1.80
#> 7 0.945 1.89
#> 8 0.661 1.32
#> 9 0.629 1.26
#> 10 0.0618 0.124
data %>%
mutate(across(everything(),
~case_when(.x > 0.5 & .x < 1.0 ~ ">0.5",
.x >= 1.0 ~ ">1")))
#> # A tibble: 10 × 2
#> x y
#> <chr> <chr>
#> 1 <NA> >0.5
#> 2 <NA> >0.5
#> 3 >0.5 >1
#> 4 >0.5 >1
#> 5 <NA> <NA>
#> 6 >0.5 >1
#> 7 >0.5 >1
#> 8 >0.5 >1
#> 9 >0.5 >1
#> 10 <NA> <NA>
Created on 2021-10-24 by the reprex package (v2.0.1)
You can use cut -
library(dplyr)
data %>%
mutate(across(.fns = ~cut(., c(0.5, 1, Inf), c(">0.5", ">1"))))
# x y
# <fct> <fct>
# 1 NA >0.5
# 2 NA >0.5
# 3 >0.5 >1
# 4 >0.5 >1
# 5 NA NA
# 6 >0.5 >1
# 7 >0.5 >1
# 8 >0.5 >1
# 9 >0.5 >1
#10 NA NA
In base R, with lapply -
data[] <- lapply(data, function(x) cut(x, c(0.5, 1, Inf), c(">0.5", ">1")))
We can use if_all with everything() (to select all the columns) to create the logical vector
library(dplyr)
data %>%
mutate(new = case_when(if_all(everything(), `>`, 1) ~ ">1", if_all(everything(), `>`, 0.5) ~ ">0.5")
)
-output
# A tibble: 10 × 3
x y new
<dbl> <dbl> <chr>
1 0.266 0.531 <NA>
2 0.372 0.744 <NA>
3 0.573 1.15 >0.5
4 0.908 1.82 >0.5
5 0.202 0.403 <NA>
6 0.898 1.80 >0.5
7 0.945 1.89 >0.5
8 0.661 1.32 >0.5
9 0.629 1.26 >0.5
10 0.0618 0.124 <NA>
NOTE: As the OP specified on the entire dataset, this does create the column based on evaluation on the entire dataset
If the OP meant separate columns, use between
data %>%
mutate(across(everything(),
~ case_when(between(.x, 0.5, 1) ~"> 0.5", TRUE ~ "> 1")))
# A tibble: 10 × 2
x y
<chr> <chr>
1 > 1 > 0.5
2 > 1 > 0.5
3 > 0.5 > 1
4 > 0.5 > 1
5 > 1 > 1
6 > 0.5 > 1
7 > 0.5 > 1
8 > 0.5 > 1
9 > 0.5 > 1
10 > 1 > 1
If we want to do this separately
out <- as.data.frame(data)
out[] <- case_when(data > 0.5 ~ "> 0.5", data > 1 ~ "> 1")
Another base R solution:
ff = function(z){x = rep(NA, length(z)); x[z > .5] = ">.5"; x[z > 1] = ">1";z = x }
sapply(data, ff)
# x y
# [1,] NA ">.5"
# [2,] NA ">.5"
# [3,] ">.5" ">1"
# [4,] ">.5" ">1"
# [5,] NA NA
# [6,] ">.5" ">1"
# [7,] ">.5" ">1"
# [8,] ">.5" ">1"
# [9,] ">.5" ">1"
#[10,] NA NA

How to apply value in first row to all subsequent rows by participant using the tidyverse [duplicate]

This question already has answers here:
Replace NA with previous or next value, by group, using dplyr
(5 answers)
Replacing NAs with latest non-NA value
(21 answers)
Closed 2 years ago.
Just say we run a study where participants are measured on some outcome variable four times each. At the start of testing they provide their age and sex. Here is some toy data to illustrate.
set.seed(1)
sex <- NA
age <- NA
df <- data.frame(id = factor(rep(1:4,each=4)),
time = rep(1:4,times=4),
sex = as.vector(sapply(0:3, function(i) sex[i*4 + 1:4] <- c(sample(c("m", "f"), 1, replace = T), rep(NA,3)))),
age = as.vector(sapply(0:3, function(i) age[i*4 + 1:4] <- c(sample(18:75, 1, replace = T), rep(NA,3)))),
outcome = round(rnorm(16),2),
stringsAsFactors = F)
Here is what the data looks like
df
# output
# id time sex age outcome
# 1 1 m 29 0.33
# 1 2 <NA> NA -0.82
# 1 3 <NA> NA 0.49
# 1 4 <NA> NA 0.74
# 2 1 m 70 0.58
# 2 2 <NA> NA -0.31
# 2 3 <NA> NA 1.51
# 2 4 <NA> NA 0.39
# 3 1 f 72 -0.62
# 3 2 <NA> NA -2.21
# 3 3 <NA> NA 1.12
# 3 4 <NA> NA -0.04
# 4 1 f 56 -0.02
# 4 2 <NA> NA 0.94
# 4 3 <NA> NA 0.82
# 4 4 <NA> NA 0.59
Now what I want to do is to use the tidyverse to apply the values for the demographic variables, at present only on the first row of each participant's data, to all the rows.
At present all I could come up with was
df %>% group_by(id) %>% # group by id
distinct(sex) %>% # shrink to unique values for each id
dplyr::filter(!is.na(sex)) %>% # remove the NAs
left_join(df, by = "id")
Which yields the output
# A tibble: 16 x 6
# Groups: id [4]
# sex.x id time sex.y age outcome
# <chr> <fct> <int> <chr> <int> <dbl>
# 1 m 1 1 m 29 0.33
# 2 m 1 2 NA NA -0.82
# 3 m 1 3 NA NA 0.49
# 4 m 1 4 NA NA 0.74
# 5 m 2 1 m 70 0.580
# 6 m 2 2 NA NA -0.31
# 7 m 2 3 NA NA 1.51
# 8 m 2 4 NA NA 0.39
# 9 f 3 1 f 72 -0.62
# 10 f 3 2 NA NA -2.21
# 11 f 3 3 NA NA 1.12
# 12 f 3 4 NA NA -0.04
# 13 f 4 1 f 56 -0.02
# 14 f 4 2 NA NA 0.94
# 15 f 4 3 NA NA 0.82
# 16 f 4 4 NA NA 0.59
Now I would consider this partially successful because the first row in each participant's sex.x column has now been applied to all their other rows, but I really don't like that there are now two sex columns.
Now I could easily add some more functions to the chain that remove the superfluous sex.y column and rename the sex.x column to its original form, but this seems a bit clunky.
Can anyone suggest how to do this better?
You can fill the sex value for each id :
library(dplyr)
df %>% group_by(id) %>% tidyr::fill(sex)
# id time sex age outcome
# <fct> <int> <chr> <int> <dbl>
# 1 1 1 m 51 -1.54
# 2 1 2 m NA -0.93
# 3 1 3 m NA -0.290
# 4 1 4 m NA -0.01
# 5 2 1 f 40 2.4
# 6 2 2 f NA 0.76
# 7 2 3 f NA -0.8
# 8 2 4 f NA -1.15
# 9 3 1 m 60 -0.290
#10 3 2 m NA -0.3
#11 3 3 m NA -0.41
#12 3 4 m NA 0.25
#13 4 1 m 31 -0.89
#14 4 2 m NA 0.44
#15 4 3 m NA -1.24
#16 4 4 m NA -0.22
You could also fill age value.(df %>% group_by(id) %>% tidyr::fill(sex, age)).
PS - I get different numbers from the same seed value though.

for-loop inside mutate and append result

I have a simple for-loop which works as I would like on vectors, I would like to use my for-loop on a column of a dataframe grouped by another column in the dataframe e.g.:
# here is my for-loop working as expected on a simple vector:
vect <- c(0.5, 0.7, 0.1)
res <- vector(mode = "numeric", length = 3)
for (i in 1:length(vect)) {
res[i] <- sum(exp(-2 * (vect[i] - vect[-i])))
}
res
[1] 1.9411537 0.9715143 5.5456579
And here is psuedo-code trying to do it on a column of a dataframe:
#Example data
my.df <- data.frame(let = rep(LETTERS[1:3], each = 3),
num1 = 1:3, vect = c(0.5, 0.7, 0.1), num3 = NA)
my.df
let num1 vect num3
1 A 1 0.5 NA
2 A 2 0.7 NA
3 A 3 0.1 NA
4 B 1 0.5 NA
5 B 2 0.7 NA
6 B 3 0.1 NA
7 C 1 0.5 NA
8 C 2 0.7 NA
9 C 3 0.1 NA
# My attempt:
require(tidyverse)
my.df <- my.df %>%
group_by(let) %>%
mutate(for (i in 1:length(vect)) {
num3[i] <- sum(exp(-4 * (vect[i] - vect[-i])))
})
What result should look like (but my psuedo code above doesn't work):
let num1 vect num3
1 A 1 0.5 1.9411537
2 A 2 0.7 0.9715143
3 A 3 0.1 5.5456579
4 B 1 0.5 1.9411537
5 B 2 0.7 0.9715143
6 B 3 0.1 5.5456579
7 C 1 0.5 1.9411537
8 C 2 0.7 0.9715143
9 C 3 0.1 5.5456579
I feel like I am not using tidyverse logic by trying to having a for-loop inside mutate, any suggestions much appreciated.
The simple solution is to create a custom function and pass that to mutate. A working solution:
custom_func <- function(vec) {
res <- vector(mode = "numeric", length = 3)
for (i in 1:length(vect)) {
res[i] <- sum(exp(-2 * (vect[i] - vect[-i])))
}
res
}
library(tidyverse)
my.df %>%
group_by(let) %>%
mutate(num3 = custom_func(vect))
#> # A tibble: 9 x 4
#> # Groups: let [3]
#> let num1 vect num3
#> <fct> <int> <dbl> <dbl>
#> 1 A 1 0.5 1.94
#> 2 A 2 0.7 0.972
#> 3 A 3 0.1 5.55
#> 4 B 1 0.5 1.94
#> 5 B 2 0.7 0.972
#> 6 B 3 0.1 5.55
#> 7 C 1 0.5 1.94
#> 8 C 2 0.7 0.972
#> 9 C 3 0.1 5.55
I'm wondering whether a more elegant version of the custom function is possible - perhaps someone smarter than me can tell you whether purrr::map, for example, could provide an alternative.
We can use map_dbl from purrr and apply the formula for calculation.
library(dplyr)
library(purrr)
my.df %>%
group_by(let) %>%
mutate(num3 = map_dbl(seq_along(vect), ~ sum(exp(-2 * (vect[.] - vect[-.])))))
# let num1 vect num3
# <fct> <int> <dbl> <dbl>
#1 A 1 0.5 1.94
#2 A 2 0.7 0.972
#3 A 3 0.1 5.55
#4 B 1 0.5 1.94
#5 B 2 0.7 0.972
#6 B 3 0.1 5.55
#7 C 1 0.5 1.94
#8 C 2 0.7 0.972
#9 C 3 0.1 5.55
You can turn your for-loop into a sapply-call and then use it in mutate.
sapply takes a function and aplys it to each list-element. In this case I'm looping over the number of elements in each groups (n()).
my.df %>%
group_by(let) %>%
mutate(num3 = sapply(1:n(), function(i) sum(exp(-2 * (vect[i] - vect[-i])))))
# A tibble: 9 x 4
# Groups: let [3]
# let num1 vect num3
# <fct> <int> <dbl> <dbl>
# 1 A 1 0.5 1.94
# 2 A 2 0.7 0.972
# 3 A 3 0.1 5.55
# 4 B 1 0.5 1.94
# 5 B 2 0.7 0.972
# 6 B 3 0.1 5.55
# 7 C 1 0.5 1.94
# 8 C 2 0.7 0.972
# 9 C 3 0.1 5.55
This is essential equivalent to the very wrong looking for-loop inside a mutate call. In this case, however I'd prefer the custom-function provided by A. Stam.
my.df %>%
group_by(let) %>%
mutate(num3 = {
res <- numeric(length = n())
for (i in 1:n()) {
res[i] <- sum(exp(-2 * (vect[i] - vect[-i])))
}
res
})
You can also replace sapply with purrr's map_dbl.
Or using data.table
library(data.table)
setDT(my.df)[, num3 := unlist(lapply(seq_len(.N),
function(i) sum(exp(-2 * (vect[i] - vect[-i]))))), let]
my.df
# let num1 vect num3
#1: A 1 0.5 1.9411537
#2: A 2 0.7 0.9715143
#3: A 3 0.1 5.5456579
#4: B 1 0.5 1.9411537
#5: B 2 0.7 0.9715143
#6: B 3 0.1 5.5456579
#7: C 1 0.5 1.9411537
#8: C 2 0.7 0.9715143
#9: C 3 0.1 5.5456579

Resources