Reshaping data frame with many NAs - r

I have a data frame in R with four variables:
id
var1
var2
var3
1
NA
0.4
NA
1
0.8
NA
NA
2
0.7
NA
NA
2
NA
0.5
NA
2
NA
NA
0.1
3
NA
0.5
NA
3
NA
NA
0.2
There are repeated entries per id and each observation only contains one data value besides the id. I would like to obtain one observation per id with all of the data values "collected".
The output should look like this:
id
var1
var2
var3
1
0.8
0.4
NA
2
0.7
0.5
0.1
3
NA
0.5
0.2
I have played around with pivot_wider, data.table, gather, but am not getting anywhere. It seems to me that this should be very simple. Like some sort of collapse. Grateful for any pointers.

Or using summarise per group:
library(dplyr)
df |>
group_by(id) |>
summarise(across(everything(), ~ first(na.omit(.))))
Output:
# A tibble: 3 × 4
id var1 var2 var3
<int> <dbl> <dbl> <dbl>
1 1 0.8 0.4 NA
2 2 0.7 0.5 0.1
3 3 NA 0.5 0.2
Thanks to #Darren Tsai for the data.

You can use tidyr::fill by groups and then subset unique rows.
library(dplyr)
library(tidyr)
df %>%
group_by(id) %>%
fill(var1:var3, .direction = "downup") %>%
distinct() %>%
ungroup()
# # A tibble: 3 × 4
# id var1 var2 var3
# <int> <dbl> <dbl> <dbl>
# 1 1 0.8 0.4 NA
# 2 2 0.7 0.5 0.1
# 3 3 NA 0.5 0.2
Data
df <- read.table(text = "
id var1 var2 var3
1 NA 0.4 NA
1 0.8 NA NA
2 0.7 NA NA
2 NA 0.5 NA
2 NA NA 0.1
3 NA 0.5 NA
3 NA NA 0.2", header = TRUE)

You can first pivot_longer, then remove NA, and finally pivot_widerback again:
library(tidyverse)
df %>%
pivot_longer(-id) %>%
na.omit() %>%
pivot_wider(names_from = name, values_from = value)
# A tibble: 3 × 4
id var2 var1 var3
<dbl> <dbl> <dbl> <dbl>
1 1 0.4 0.8 NA
2 2 0.5 0.7 0.1
3 3 0.5 NA 0.2

Related

How to loop over the columns in a dataframe, apply spread, and create a new dataframe in R?

I have a dataframe which looks like this example, just much larger:
Name date var1 var2 var3
Peter 2020-03-30 0.4 0.5 0.2
Ben 2020-10-14 0.6 0.4 0.1
Mary 2020-12-06 0.7 0.2 0.9
I want to create a new dataframe for each variable (i.e., var1, var2, var3), which should look like this, e.g., for var1:
date Peter Ben Mary
2020-03-30 0.4 NA NA
2020-10-14 NA 0.6 NA
2020-12-06 NA NA 0.7
I can do it with spread for one variable at a time:
df_new <-tidyr::spread(df[,-c(2:3)], name, var1)
But I could not figure out how to loop it over all columns as I am new to R.
Thank you!
First we want to create a list of data frames and then pivot each one:
library(tidyverse)
res_list = dat %>%
pivot_longer(cols = contains("var")) %>%
split(., .$name) %>%
map(. %>% pivot_wider(names_from="Name"))
$var1
# A tibble: 3 × 5
date name Peter Ben Mary
<date> <chr> <dbl> <dbl> <dbl>
1 2020-03-30 var1 0.4 NA NA
2 2020-10-14 var1 NA 0.6 NA
3 2020-12-06 var1 NA NA 0.7
$var2
# A tibble: 3 × 5
date name Peter Ben Mary
<date> <chr> <dbl> <dbl> <dbl>
1 2020-03-30 var2 0.5 NA NA
2 2020-10-14 var2 NA 0.4 NA
3 2020-12-06 var2 NA NA 0.2
$var3
# A tibble: 3 × 5
date name Peter Ben Mary
<date> <chr> <dbl> <dbl> <dbl>
1 2020-03-30 var3 0.2 NA NA
2 2020-10-14 var3 NA 0.1 NA
3 2020-12-06 var3 NA NA 0.9
Then you can access them like
res_list["var1"]
# A tibble: 3 × 5
date name Peter Ben Mary
<date> <chr> <dbl> <dbl> <dbl>
1 2020-03-30 var1 0.4 NA NA
2 2020-10-14 var1 NA 0.6 NA
3 2020-12-06 var1 NA NA 0.7
We can do it this way:
The beginning is similar to user438383 solution.
But then we name each tibble in the list and save them to the global environment within the the pipe. For this we need massign from collapse package: thanks to #akrun How to save each named tibble in a list, as a separate tibble or dataframe in one run
library(tidyverse)
library(collapse)
df %>%
pivot_longer(cols = contains("var")) %>%
group_split(name) %>%
setNames(unique(df$Name)) %>%
map(. %>% pivot_wider(names_from = Name)) %>%
map(. %>% select(-name)) %>%
massign(names(.), ., .GlobalEnv)
Ben
Mary
Peter
A tibble: 3 x 4
date Peter Ben Mary
<chr> <dbl> <dbl> <dbl>
1 2020-03-30 0.5 NA NA
2 2020-10-14 NA 0.4 NA
3 2020-12-06 NA NA 0.2
> Mary
# A tibble: 3 x 4
date Peter Ben Mary
<chr> <dbl> <dbl> <dbl>
1 2020-03-30 0.2 NA NA
2 2020-10-14 NA 0.1 NA
3 2020-12-06 NA NA 0.9
> Peter
# A tibble: 3 x 4
date Peter Ben Mary
<chr> <dbl> <dbl> <dbl>
1 2020-03-30 0.4 NA NA
2 2020-10-14 NA 0.6 NA
3 2020-12-06 NA NA 0.7

Making data wide where row names are not identical

I am trying to get a model results table into wide format. Since names are not the same on outcomes (dv variables), NA's show up in the table and I can't find a way to have one row per variable.
I need one row per variable/dv. Model 1 and 3 share all variables other than one.
Data:
table <- data.frame(variable=c("intercept", "a", "b", "intercept", "c", "intercept", "a", "e", "intercept", "c"),
b=c(1.2, 0.1, 0.4, 0.3, 0.9, 1.3, 2, .23, .4, .7),
p=(abs(rnorm(10, 0, .3))),
model=c(1,1,1,2,2,3,3,3,4,4),
dv=c(rep("dv1", 5), rep("dv2", 5)))
> table
variable b p model dv
1 intercept 1.20 0.03320481 1 dv1
2 a 0.10 0.16675234 1 dv1
3 b 0.40 0.53607394 1 dv1
4 intercept 0.30 0.14935514 2 dv1
5 c 0.90 0.58998515 2 dv1
6 intercept 1.30 0.21040677 3 dv2
7 a 2.00 0.14183742 3 dv2
8 e 0.23 0.32034711 3 dv2
9 intercept 0.40 0.06539247 4 dv2
10 c 0.70 0.30780133 4 dv2
Code:
table %>%
gather(key, value, b, p) %>% unite("stat_var", dv, key, sep=".") %>%
spread(stat_var, value) %>%
arrange(model, desc(variable))
Output:
variable model dv1.b dv1.p dv2.b dv2.p
1 intercept 1 1.2 0.21866737 NA NA
2 b 1 0.4 0.50600799 NA NA
3 a 1 0.1 0.18751178 NA NA
4 intercept 2 0.3 0.25133611 NA NA
5 c 2 0.9 0.04601194 NA NA
6 intercept 3 NA NA 1.30 0.34144108
7 e 3 NA NA 0.23 0.12793927
8 a 3 NA NA 2.00 0.37614448
9 intercept 4 NA NA 0.40 0.08852144
10 c 4 NA NA 0.70 0.26853770
Looking for:
I will ignore the reason (I see there's no valid reason) to consider some type of "equivalence" between models (handled it with mutate()). But related only with the table manipulation, I have this basic option to get your desired output:
You can use pivot_wider_spec() to set the names b and p like suffixs.
require(tidyverse)
table %>%
mutate(model = case_when(model == 3 ~ 1,
model == 4 ~ 2,
TRUE ~ model)) %>%
pivot_wider(names_from = dv, values_from = c("b", "p")) %>%
select(variable,
model,
ends_with("dv1"),
ends_with("dv2"))
# A tibble: 6 x 6
# variable model b_dv1 p_dv1 b_dv2 p_dv2
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 intercept 1 1.2 0.318 1.3 0.200
# 2 a 1 0.1 0.120 2 0.419
# 3 b 1 0.4 0.309 NA NA
# 4 intercept 2 0.3 0.350 0.4 0.0148
# 5 c 2 0.9 0.185 0.7 0.530
# 6 e 1 NA NA 0.23 0.174
As I stated in my comment, it seems that your expected output is wrong.
However, you can reproduce it by tweaking the model variable:
table %>%
select(model, everything()) %>%
mutate(model=ifelse(model>2, model-2, model)) %>%
pivot_longer(c(b, p)) %>%
unite("name", c("dv", "name")) %>%
pivot_wider()
# # A tibble: 6 x 6
# model variable dv1_b dv1_p dv2_b dv2_p
# <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 intercept 1.2 0.193 1.3 0.160
# 2 1 a 0.1 0.650 2 0.476
# 3 1 b 0.4 0.190 NA NA
# 4 2 intercept 0.3 0.0435 0.4 0.145
# 5 2 c 0.9 0.372 0.7 0.243
# 6 1 e NA NA 0.23 0.297
Of note, gather() and spread() are deprecated in favor of pivoting functions which offer very nice improvements (although not used here).

Subset dataframe in R using function inside select_if to make it conditional on a grouping variable?

I would like to conditionally subset a dataframe in R, using dplyr::select_if(). More specifically, I have a dataframe that is made up of a grouping variable and numerous other variables that contain a bunch of NAs:
data <- tibble(group = sort(rep(letters[1:5],3)),
var_1 = c(1,1,1,1,rep(NA,11)),
var_2 = c(1,1,1,1,1,1,rep(NA,9)),
var_3 = 1,
var_4 = c(1,1,rep(NA,10),1,1,1),
var_5 = c(1,1,1,1,1,1,NA,NA,NA,NA,NA,NA,1,1,1))
# A tibble: 15 x 6
group var_1 var_2 var_3 var_4 var_5
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a 1 1 1 1 1
2 a 1 1 1 1 1
3 a 1 1 1 NA 1
4 b 1 1 1 NA 1
5 b NA 1 1 NA 1
6 b NA 1 1 NA 1
7 c NA NA 1 NA NA
8 c NA NA 1 NA NA
9 c NA NA 1 NA NA
10 d NA NA 1 NA NA
11 d NA NA 1 NA NA
12 d NA NA 1 NA NA
13 e NA NA 1 1 1
14 e NA NA 1 1 1
15 e NA NA 1 1 1
In this dataframe, I need to identify and remove columns like var_4 in this case that only occur in one group (but irrespective of whether or not they show up in the last group: "e"). Importantly, everything else has to remain untouched (i.e. I want to keep variables that look like var_1,var_2,var_3, and var_5). This is what I tried:
library(dplyr)
data %>%
filter(group!="e") %>% # Ignore last group.
select_if(~ function(col)) %>% # Write function to look for cols that only have values for one group of the total four groups remaining (a-d).
names() -> cols_to_drop # Save col names.
data %>% select(-cols_to_drop) -> new_data # Subset by saved col names.
Unfortunately, I can't figure out how to write that function inside select_if() to specify that grouping variable condition.
A second thing that I have been wondering about is whether I can use select_if() to remove cols based on the percentage of NAs it contains. Is there a way?
I am not sure if select_if would be able to do such grouped selection of columns.
Here is one way to do this getting data in long format :
library(dplyr)
cols <- data %>%
filter(group != "e") %>%
tidyr::pivot_longer(cols = starts_with('var')) %>%
group_by(name, group) %>%
summarise(value = any(!is.na(value))) %>%
summarise(value = sum(value)) %>%
filter(value > 1) %>%
pull(name)
#Select the columns
data %>% select(group, cols)
# group var_1 var_2 var_3 var_5
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 a 1 1 1 1
# 2 a 1 1 1 1
# 3 a 1 1 1 1
# 4 b 1 1 1 1
# 5 b NA 1 1 1
# 6 b NA 1 1 1
# 7 c NA NA 1 NA
# 8 c NA NA 1 NA
# 9 c NA NA 1 NA
#10 d NA NA 1 NA
#11 d NA NA 1 NA
#12 d NA NA 1 NA
#13 e NA NA 1 1
#14 e NA NA 1 1
#15 e NA NA 1 1

R Mutate multiple columns with ifelse()-condition

I want to create several columns with a ifelse()-condition. Here is my example-code:
df <- tibble(
date = lubridate::today() +0:9,
return= c(1,2.5,2,3,5,6.5,1,9,3,2))
And now I want to add new columns with ascending conditions (from 1 to 8). The first column should only contain values from the "return"-column, which are higher than 1, the second column should only contain values, which are higher than 2, and so on...
I can calculate each column with a mutate() function:
df <- df %>% mutate( `return>1`= ifelse(return > 1, return, NA))
df <- df %>% mutate( `return>2`= ifelse(return > 2, return, NA))
df <- df %>% mutate( `return>3`= ifelse(return > 3, return, NA))
df <- df %>% mutate( `return>4`= ifelse(return > 4, return, NA))
df <- df %>% mutate( `return>5`= ifelse(return > 5, return, NA))
df <- df %>% mutate( `return>6`= ifelse(return > 6, return, NA))
df <- df %>% mutate( `return>7`= ifelse(return > 7, return, NA))
df <- df %>% mutate( `return>8`= ifelse(return > 8, return, NA))
> head(df)
# A tibble: 6 x 10
date return `return>1` `return>2` `return>3` `return>4` `return>5` `return>6` `return>7` `return>8`
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2019-03-08 1 NA NA NA NA NA NA NA NA
2 2019-03-09 2.5 2.5 2.5 NA NA NA NA NA NA
3 2019-03-10 2 2 NA NA NA NA NA NA NA
4 2019-03-11 3 3 3 NA NA NA NA NA NA
5 2019-03-12 5 5 5 5 5 NA NA NA NA
6 2019-03-13 6.5 6.5 6.5 6.5 6.5 6.5 6.5 NA NA
Is there an easier way to create all these columns and reduce all this code? Maybe with a map_function? And is there a way to automatically name the new columns?
An option with lapply
n <- seq(1, 8)
df[paste0("return > ", n)] <- lapply(n, function(x)
replace(df$return, df$return <= x, NA))
# date return `return > 1` `return > 2` `return > 3` .....
# <date> <dbl> <dbl> <dbl> <dbl>
#1 2019-03-08 1 NA NA NA
#2 2019-03-09 2.5 2.5 2.5 NA
#3 2019-03-10 2 2 NA NA
#4 2019-03-11 3 3 3 NA
#5 2019-03-12 5 5 5 5
#6 2019-03-13 6.5 6.5 6.5 6.5
#...
Here is a for loop solution:
for(i in 1:8){
varname =paste0("return>",i)
df[[varname]] <- with(df, ifelse(return > i, return, NA))
}
use purrr::map_df
> bind_cols(df,purrr::map_df(setNames(1:8,paste0('return>',1:8)),
+ function(x) ifelse(df$return > x, df$return, NA)))
# A tibble: 6 x 10
# date return `return>1` `return>2` `return>3` `return>4` `return>5` `return>6` `return>7` `return>8`
# <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2019-03-08 1 NA NA NA NA NA NA NA NA
# 2 2019-03-09 2.5 2.5 2.5 NA NA NA NA NA NA
# 3 2019-03-10 2 2 NA NA NA NA NA NA NA
# 4 2019-03-11 3 3 3 NA NA NA NA NA NA
# 5 2019-03-12 5 5 5 5 5 NA NA NA NA
# 6 2019-03-13 6.5 6.5 6.5 6.5 6.5 6.5 6.5 NA NA

Calculate new values by row

I'd like to create a new column (val_new) in which each value is multiplied by a value in another column (val2) by row. As I want to do this for several groups I'd prefer using dplyr, but how?
dat <- data.frame(group = rep(c("A", "B"), each = 3),
val1 = c(50, NA, NA, 40, NA, NA),
val2 = c(NA, 0.5, 0.3, NA, 0.8, 0.7))
> dat
group val1 val2
1 A 50 NA
2 A NA 0.5
3 A NA 0.3
4 B 40 NA
5 B NA 0.8
6 B NA 0.7
dat %>%
group_by(group) %>%
mutate(val_new = ifelse(!is.na(val1), val1, lag(val_new) * val2))
Error in mutate_impl(.data, dots) :
Evaluation error: object 'val_new' not found.
Desired result:
# A tibble: 6 x 4
# Groups: group [2]
group val1 val2 val_new
<fct> <dbl> <dbl> <dbl>
1 A 50 NA 50
2 A NA 0.5 25
3 A NA 0.3 7.5
4 B 40 NA 40
5 B NA 0.8 32
6 B NA 0.7 22.4
Try this:
dat %>%
group_by(group) %>%
mutate(val_new = cumprod(c(first(val1),val2[-1])))
## A tibble: 6 x 4
## Groups: group [2]
# group val1 val2 val_new
# <fct> <dbl> <dbl> <dbl>
#1 A 50 NA 50
#2 A NA 0.5 25
#3 A NA 0.3 7.5
#4 B 40 NA 40
#5 B NA 0.8 32
#6 B NA 0.7 22.4

Resources