dplyr rowwise with lag variables - r

I am trying to fill NAs in a variable using another correlated variable as per the code below.
test <- tibble(x = c(1,4,3,2,5,6), y = c(2,NA,6,NA,NA,5))
test <- test %>% mutate(chng = x/lag(x,1))
for(i in 1:nrow(test)){
if(is.na(test$y[i])) test$y[i] <- test$y[i - 1] * test$chng[i]
}
Can I do the same operation in dplyr? I've tried rowwise but it seems that it doesn't recognize the lag function.
test %>% rowwise() %>% mutate(y = ifelse(is.na(y), lag(y,1) * chng, y))
Multiple NAs in a row also prevents me from creating a new column consisting of the lagged variable.

You could just repeat the dplyr operation until all NA have been filled:
while(sum(is.na(test$y)) > 0){
test <- test %>%
mutate(y = ifelse(is.na(y), lag(y,1) * chng, y))
}
# A tibble: 6 x 3
x y chng
<dbl> <dbl> <dbl>
1 1 2 NA
2 4 8 4
3 3 6 0.75
4 2 4 0.667
5 5 10 2.5
6 6 5 1.2
I'm pretty sure this won't gain you any efficiency for computing time, though.

It's not working because in rowwise you are using lag on a subset of one row. Creating a new column of y.lag before you enter rowwise mode will work:
test %>% mutate(y.lag = lag(y,1)) %>%
rowwise() %>%
mutate(y = ifelse(is.na(y), y.lag * chng, y)) %>%
select(-y.lag)

Related

Replacing NA values with mode from multiple imputation in R

I ran 5 imputations on a data set with missing values. For my purposes, I want to replace missing values with the mode from the 5 imputations. Let's say I have the following data sets, where df is my original data, ID is a grouping variable to identify each case, and imp is my imputed data:
df <- data.frame(ID = c(1,2,3,4,5),
var1 = c(1,NA,3,6,NA),
var2 = c(NA,1,2,6,6),
var3 = c(NA,2,NA,4,3))
imp <- data.frame(ID = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,5),
var1 = c(1,2,3,3,2,5,4,5,6,6,7,2,3,2,5,6,5,6,6,6,3,1,2,3,2),
var2 = c(4,3,2,3,2,4,6,5,4,4,7,2,4,2,3,6,5,6,4,5,3,3,4,3,2),
var3 = c(7,6,5,6,6,2,3,2,4,2,5,4,5,3,5,1,2,1,3,2,1,2,1,1,1))
I have a method that works, but it involves a ton of manual coding as I have ~200 variables total (I'm doing this on 3 different data sets with different variables). My code looks like this for one variable:
library(dplyr)
mode <- function(codes){
which.max(tabulate(codes))
}
var1 <- imp %>% group_by(ID) %>% summarise(var1 = mode(var1))
df3 <- df %>%
left_join(var1, by = "ID") %>%
mutate(var1 = coalesce(var1.x, var1.y)) %>%
select(-var1.x, -var1.y)
Thus, the original value in df is replaced with the mode only if the value was NA.
It is taking forever to keep manually coding this for every variable. I'm hoping there is an easier way of calculating the mode from the imputed data set for each variable by ID and then replacing the NAs with that mode in the original data. I thought maybe I could put the variable names in a vector and somehow iterate through them with one code where i changes to each variable name, but I didn't know where to go with that idea.
x <- colnames(df)
# Attempting to iterate through variables names using i
i = as.factor(x[[2]])
This is where I am stuck. Any help is much appreciated!
Here is one option using tidyverse. Essentially, we can pivot both dataframes long, then join together and coalesce in one step rather than column by column. Mode function taken from here.
library(tidyverse)
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
imp_long <- imp %>%
group_by(ID) %>%
summarise(across(everything(), Mode)) %>%
pivot_longer(-ID)
df %>%
pivot_longer(-ID) %>%
left_join(imp_long, by = c("ID", "name")) %>%
mutate(var1 = coalesce(value.x, value.y)) %>%
select(-c(value.x, value.y)) %>%
pivot_wider(names_from = "name", values_from = "var1")
Output
# A tibble: 5 × 4
ID var1 var2 var3
<dbl> <dbl> <dbl> <dbl>
1 1 1 3 6
2 2 5 1 2
3 3 3 2 5
4 4 6 6 4
5 5 3 6 3
You can use -
library(dplyr)
mode_data <- imp %>%
group_by(ID) %>%
summarise(across(starts_with('var'), Mode))
df %>%
left_join(mode_data, by = 'ID') %>%
transmute(ID,
across(matches('\\.x$'),
function(x) coalesce(x, .[[sub('x$', 'y', cur_column())]]),
.names = '{sub(".x$", "", .col)}'))
# ID var1 var2 var3
#1 1 1 3 6
#2 2 5 1 2
#3 3 3 2 5
#4 4 6 6 4
#5 5 3 6 3
mode_data has Mode value for each of the var columns.
Join df and mode_data by ID.
Since all the pairs have name.x and name.y in their name, we can take all the name.x pairs replace x with y to get corresponding pair of columns. (.[[sub('x$', 'y', cur_column())]])
Use coalesce to select the non-NA value in each pair.
Change the column name by removing .x from the name. ({sub(".x$", "", .col)}) so var1.x becomes only var1.
where Mode function is taken from here
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
library(dplyr, warn.conflicts = FALSE)
imp %>%
group_by(ID) %>%
summarise(across(everything(), Mode)) %>%
bind_rows(df) %>%
group_by(ID) %>%
summarise(across(everything(), ~ coalesce(last(.x), first(.x))))
#> # A tibble: 5 × 4
#> ID var1 var2 var3
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 3 6
#> 2 2 5 1 2
#> 3 3 3 2 5
#> 4 4 6 6 4
#> 5 5 3 6 3
Created on 2022-01-03 by the reprex package (v2.0.1)
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}

findInterval by group with dplyr [duplicate]

This question already has answers here:
How to quickly form groups (quartiles, deciles, etc) by ordering column(s) in a data frame
(11 answers)
Closed 1 year ago.
In this example I have a tibble with two variables:
a group variable gr
the variable of interest val
set.seed(123)
df <- tibble(gr = rep(1:3, each = 10),
val = gr + rnorm(30))
Goal
I want to produce a discretized version of val using the function findInterval but the breakpoints should be gr-specific, since in my actual data as well as in this example, the distribution of valdepends on gr. The breakpoints are determined within each group using the quartiles of val.
What I did
I first construct a nested tibble containing the vectors of breakpoints for each value of gr:
df_breakpoints <- bind_cols(gr = 1:3,
purrr::map_dfr(1:3, function(gr) {
c(-Inf, quantile(df$val[df$gr == gr], c(0.25, 0.5, 0.75)), Inf)
})) %>%
nest(bp = -gr) %>%
mutate(bp = purrr::map(.$bp, unlist))
Then I join it with df:
df <- inner_join(df, df_breakpoints, by = "gr")
My first guess to define the discretized variable lvl was
df %>% mutate(lvl = findInterval(x = val, vec = bp))
It produces the error
Error : Problem with `mutate()` input `lvl2`.
x 'vec' must be sorted non-decreasingly and not contain NAs
ℹ Input `lvl` is `findInterval(x = val, vec = bp)`.
Then I tried
df$lvl <- purrr::imap_dbl(1:nrow(df),
~findInterval(x = df$val[.x], vec = df$bp[[.x]]))
or
df %>% mutate(lvl = purrr::map2_int(df$val, df$bp, findInterval))
It does work. However it is highly unefficient. With my actual data (1.2 million rows) it takes several minutes to run. I guess there is a much better way of doing this than iterating on rows. Any idea?
You can do this in group_by + mutate step -
library(dplyr)
df %>%
group_by(gr) %>%
mutate(breakpoints = findInterval(val,
c(-Inf, quantile(val, c(0.25, 0.5, 0.75)), Inf))) %>%
ungroup
# gr val breakpoints
# <int> <dbl> <int>
# 1 1 0.440 1
# 2 1 0.770 2
# 3 1 2.56 4
# 4 1 1.07 3
# 5 1 1.13 3
# 6 1 2.72 4
# 7 1 1.46 4
# 8 1 -0.265 1
# 9 1 0.313 1
#10 1 0.554 2
# … with 20 more rows
findInterval is applied for each gr separately.

R program questions

I am trying to get some unique combinations of two variables.
For each value of x, I would like to have this unique y value, and drop those have several y values. But several x values could share same y value.
For example,
a=data.frame(x=c(1,1,2,4,5,5),y=c(2,3,3,3,6,6)),
and I would like to get the output like:
b=data.frame(x=c(2,4,5),y=c(3,3,6))
I have tried unique(), but it does not help this situation.
Thank you!
First we use unique to omit repeated rows with the same x and y values (keeping only one copy of each). Any repeated x values that are left have different y values, so we want to get rid of them. We use the standard way to remove all copies of any duplicated values as in this R-FAQ.
a=data.frame(x=c(1,1,2,4,5,5),y=c(2,3,3,3,6,6))
b = unique(a)
b = b[!duplicated(b$x) & !duplicated(b$x, fromLast = TRUE), ]
b
# x y
# 3 2 3
# 4 4 3
# 5 5 6
Fans of dplyr would probably do it like this, producing the same result.
library(dplyr)
a %>%
group_by(x) %>%
filter(n_distinct(y) == 1) %>%
distinct
Using dplyr:
library(dplyr)
a <- data.frame(x=c(1,1,2,4,5,5),y=c(2,3,3,3,6,6))
a %>%
distinct() %>%
add_count(x) %>% # adds an implicit group_by(x)
filter(n == 1) %>%
select(-n)
#> # A tibble: 3 x 2
#> # Groups: x [3]
#> x y
#> <dbl> <dbl>
#> 1 2 3
#> 2 4 3
#> 3 5 6
Created on 2018-11-14 by the reprex package (v0.2.1)

Tidyverse Solution for Using Tibble Columns as Input to a Function

I am trying to run a function on all on combinations of two column vectors in a tibble.
library(tidyverse)
combination <- tibble(x = c(1, 2), y = c(3, 4))
sum_square <- function(x, y) {
x^2+y^2
}
I would like to run this function all combinations of column x and column y:
sum_square(1, 3)
sum_square(1, 4)
sum_square(2, 3)
sum_square(2, 4)
Ideally I would like a tidyverse solution.
We can first expand and then apply sum_square on the expanded dataset
library(tidyverse)
expand(combination, x, y) %>%
mutate(new = sum_square(x, y))
# A tibble: 4 x 3
# x y new
# <dbl> <dbl> <dbl>
#1 1 3 10
#2 1 4 17
#3 2 3 13
#4 2 4 20
Another option is outer
combination %>%
reduce(outer, FUN = sum_square) %>%
c %>%
tibble(new = .)

Remove duplicated rows using dplyr

I have a data.frame like this -
set.seed(123)
df = data.frame(x=sample(0:1,10,replace=T),y=sample(0:1,10,replace=T),z=1:10)
> df
x y z
1 0 1 1
2 1 0 2
3 0 1 3
4 1 1 4
5 1 0 5
6 0 1 6
7 1 0 7
8 1 0 8
9 1 0 9
10 0 1 10
I would like to remove duplicate rows based on first two columns. Expected output -
df[!duplicated(df[,1:2]),]
x y z
1 0 1 1
2 1 0 2
4 1 1 4
I am specifically looking for a solution using dplyr package.
Here is a solution using dplyr >= 0.5.
library(dplyr)
set.seed(123)
df <- data.frame(
x = sample(0:1, 10, replace = T),
y = sample(0:1, 10, replace = T),
z = 1:10
)
> df %>% distinct(x, y, .keep_all = TRUE)
x y z
1 0 1 1
2 1 0 2
3 1 1 4
Note: dplyr now contains the distinct function for this purpose.
Original answer below:
library(dplyr)
set.seed(123)
df <- data.frame(
x = sample(0:1, 10, replace = T),
y = sample(0:1, 10, replace = T),
z = 1:10
)
One approach would be to group, and then only keep the first row:
df %>% group_by(x, y) %>% filter(row_number(z) == 1)
## Source: local data frame [3 x 3]
## Groups: x, y
##
## x y z
## 1 0 1 1
## 2 1 0 2
## 3 1 1 4
(In dplyr 0.2 you won't need the dummy z variable and will just be
able to write row_number() == 1)
I've also been thinking about adding a slice() function that would
work like:
df %>% group_by(x, y) %>% slice(from = 1, to = 1)
Or maybe a variation of unique() that would let you select which
variables to use:
df %>% unique(x, y)
For completeness’ sake, the following also works:
df %>% group_by(x) %>% filter (! duplicated(y))
However, I prefer the solution using distinct, and I suspect it’s faster, too.
Most of the time, the best solution is using distinct() from dplyr, as has already been suggested.
However, here's another approach that uses the slice() function from dplyr.
# Generate fake data for the example
library(dplyr)
set.seed(123)
df <- data.frame(
x = sample(0:1, 10, replace = T),
y = sample(0:1, 10, replace = T),
z = 1:10
)
# In each group of rows formed by combinations of x and y
# retain only the first row
df %>%
group_by(x, y) %>%
slice(1)
Difference from using the distinct() function
The advantage of this solution is that it makes it explicit which rows are retained from the original dataframe, and it can pair nicely with the arrange() function.
Let's say you had customer sales data and you wanted to retain one record per customer, and you want that record to be the one from their latest purchase. Then you could write:
customer_purchase_data %>%
arrange(desc(Purchase_Date)) %>%
group_by(Customer_ID) %>%
slice(1)
When selecting columns in R for a reduced data-set you can often end up with duplicates.
These two lines give the same result. Each outputs a unique data-set with two selected columns only:
distinct(mtcars, cyl, hp);
summarise(group_by(mtcars, cyl, hp));
If you want to find the rows that are duplicated you can use find_duplicates from hablar:
library(dplyr)
library(hablar)
df <- tibble(a = c(1, 2, 2, 4),
b = c(5, 2, 2, 8))
df %>% find_duplicates()

Resources