Conditional update similar to SQL - r

I have the following dataframe
library(tidyverse)
x <- c(1,2,3,NA,NA,4,5)
y <- c(1,2,3,5,5,4,5)
z <- c(1,1,1,6,7,7,8)
df <- data.frame(x,y,z)
df
x y z
1 1 1 1
2 2 2 1
3 3 3 1
4 NA 5 6
5 NA 5 7
6 4 4 7
7 5 5 8
I would like to update the dataframe according to the following conditions
If z==1, update to x=1, else leave the current value for x
If z==1, update to y=2, else leave the current value for y
The following code does the job fine
df %>% mutate(x=if_else(z==1,1,x),y=if_else(z==1,2,y))
x y z
1 1 2 1
2 1 2 1
3 1 2 1
4 NA 5 6
5 NA 5 7
6 4 4 7
7 5 5 8
However, I have to add if_else statement for x and y mutate functions. This has the potential to make my code complicated and hard to read. To give you a SQL analogy, consider the following code
UPDATE df
SET x= 1, y= 2
WHERE z = 1;
I would like to achieve the following:
Specify the update condition ahead of time, so I don't have to repeat it for every mutate function
I would like to avoid using data.table or base R. I am using dplyr so I would like to stick to it for consistency

Using mutate_cond posted at dplyr mutate/replace several columns on a subset of rows we can do this:
df %>% mutate_cond(z == 1, x = 1, y = 2)
giving:
x y z
1 1 2 1
2 1 2 1
3 1 2 1
4 NA 5 6
5 NA 5 7
6 4 4 7
7 5 5 8
sqldf
Of course you can directly implement it in SQL with sqldf -- ignore the warning message that the backend RSQLite issues.
library(sqldf)
sqldf(c("update df set x = 1, y = 2 where z = 1", "select * from df"))
base R
It straight-forward in base R:
df[df$z == 1, c("x", "y")] <- list(1, 2)

library(dplyr)
df %>%
mutate(x = replace(x, z == 1, 1),
y = replace(y, z == 1, 2))
# x y z
#1 1 2 1
#2 1 2 1
#3 1 2 1
#4 NA 5 6
#5 NA 5 7
#6 4 4 7
#7 5 5 8
In base R
transform(df,
x = replace(x, z == 1, 1),
y = replace(y, z == 1, 2))
If you store the condition in a variable, you don't have to type it multiple times
condn = (df$z == 1)
transform(df,
x = replace(x, condn, 1),
y = replace(y, condn, 2))

Here is one option with map2. Loop through the 'x', 'y' columns of the dataset, along with the values to change, apply case_when based on the values of 'z' if it is TRUE, then return the new value, or else return the same column and bind the columns with the original dataset
library(dplyr)
library(purrr)
map2_df(df %>%
select(x, y), c(1, 2), ~ case_when(df$z == 1 ~ .y, TRUE ~ .x)) %>%
bind_cols(df %>%
select(z), .) %>%
select(names(df))
Or using base R, create a logical vector, use that to subset the rows of columns 'x', 'y' and update by assigning to a list of values
i1 <- df$z == 1
df[i1, c('x', 'y')] <- list(1, 2)
df
# x y z
#1 1 2 1
#2 1 2 1
#3 1 2 1
#4 NA 5 6
#5 NA 5 7
#6 4 4 7
#7 5 5 8
The advantage of both the solutions are that we can pass n number of columns with corresponding values to pass and not repeating the code

If you have an SQL background, you should really check out data.table:
library(data.table)
dt <- as.data.table(df)
set(dt, which(z == 1), c('x', 'y'), list(1, 2))
dt
# or perhaps more classic syntax
dt <- as.data.table(df)
dt
# x y z
#1: 1 1 1
#2: 2 2 1
#3: 3 3 1
#4: NA 5 6
#5: NA 5 7
#6: 4 4 7
#7: 5 5 8
dt[z == 1, `:=`(x = 1, y = 2)]
dt
# x y z
#1: 1 2 1
#2: 1 2 1
#3: 1 2 1
#4: NA 5 6
#5: NA 5 7
#6: 4 4 7
#7: 5 5 8
The last option is an update join. This is great if you have the lookup data already done upfront:
# update join:
dt <- as.data.table(df)
dt_lookup <- data.table(x = 1, y = 2, z = 1)
dt[dt_lookup, on = .(z), `:=`(x = i.x, y = i.y)]
dt

Related

Efficient recoding of numeric variables into a factor in a data.frame

In recoding values of numeric variables like var1 below into character values, sometimes there is an easy patter. For example, suppose numeric values 1:4 in var1 need to be recoded as LETTERS[27-(4:1)], respectively.
In such situations, is it possible to avoid writing var1 = recode(var1,1="W",2="X",3="Y",4="Z") and instead loop the recoding?
library(tidyverse)
(dat <- data.frame(var1 = rep(1:4,2), id = 1:8))
mutate(dat, var1 = recode(var1,`1`="W",`2`="X",`3`="Y",`4`="Z")) # This works but can we
# loop it as well?
We can use a vectorized approach, no loops necessary. tail and base subsetting with [ will do the trick here.
library(dplyr)
dat %>% mutate(var1=tail(LETTERS, max(var1))[var1] %>% as.factor)
var1 id
1 W 1
2 X 2
3 Y 3
4 Z 4
5 W 5
6 X 6
7 Y 7
8 Z 8
data
dat <- data.frame(var1 = rep(1:4,2), id = 1:8)
data2
dat2 <- data.frame(var1 = c(2,1,3,1,4:1), id = 1:8))
var1 id
1 2 1
2 1 2
3 3 3
4 1 4
5 4 5
6 3 6
7 2 7
8 1 8
output2
var1 id
1 X 1
2 W 2
3 Y 3
4 W 4
5 Z 5
6 Y 6
7 X 7
8 W 8
You can use -
library(dplyr)
dat %>% mutate(var1 = LETTERS[length(LETTERS)-max(var1) + var1])
# var1 id
#1 W 1
#2 X 2
#3 Y 3
#4 Z 4
#5 W 5
#6 X 6
#7 Y 7
#8 Z 8
you can also just use the labels argument of factor()
library(dplyr)
dat <- data.frame(var1 = rep(1:4,2), id = 1:8) %>%
mutate(var1 = factor(var1, labels = tail(LETTERS, 4)))
dat
var1 id
1 W 1
2 X 2
3 Y 3
4 Z 4
5 W 5
6 X 6
7 Y 7
8 Z 8

How to replace values in a specific column

I have a big data frame from a survey. There is some statements where I need to use revere coding, hence I need to change values in few columns. I have tried below code (where x represents the column where I want to make the changes)
df$x <- replace( df$x, 1=7, 2=6, 3=5, 5=3, 6=2, 7=1)
But this did not work. Every help is much appreciated.
If your column has only 1-7 values you can subtract those values from 8 to reverse the values.
set.seed(123)
df <- data.frame(x = sample(7, 10, replace = TRUE))
df$y <- 8 - df$x
#Or maybe more general
#df$y <- max(df$x) + 1 - df$x
df
# x y
#1 7 1
#2 7 1
#3 3 5
#4 6 2
#5 3 5
#6 2 6
#7 2 6
#8 6 2
#9 3 5
#10 5 3
You could try case_when from package dplyr. The syntax is very clean.
library(dplyr)
df %>%
mutate(x=case_when(
x == 1 ~ 7,
x == 2 ~ 6,
x == 3 ~ 5,
x == 6 ~ 2,
x == 7 ~ 1,
TRUE ~ as.numeric(x)
))
DATA
set.seed(1)
df <- data.frame(x = sample(7, 10, replace = TRUE))
df
The solution above overwrites the varaible x. To compare result, I created a new_x variable with the replaced data:
df %>%
mutate(new_x=case_when(
x == 1 ~ 7,
x == 2 ~ 6,
x == 3 ~ 5,
x == 6 ~ 2,
x == 7 ~ 1,
TRUE ~ as.numeric(x)
))
x new_x
1 1 7
2 4 4
3 7 1
4 1 7
5 2 6
6 5 5
7 7 1
8 3 5
9 6 2
10 2 6
One way you can replace values is using which:
df$x[which(df$x=1)] <- 7 # this replaces 1 with 7
Another way is to use ifelse:
df$x <- ifelse(df$x == 1,7,ifelse(df$x == 2,6,ifelse....)) # replaces 1 with 7, 2 with 6 and so on..
An option with which.max
library(dplyr)
df %>%
mutate(y = x[which.max(x)] - x + 1)

Adding column conditioning on the other columns

I want to create a new column z based on the values of x and y. If x>y, z=y otherwise z=x.
x y
3 4
5 2
6 6
1 7
9 4
Output required:
x y z
3 4 3
5 2 2
6 6 6
1 7 1
9 4 4
You can use ifelse :
df$z <- with(df, ifelse(x > y, y, x))
#Or without with
#df$z <- ifelse(df$x > df$y, df$y, df$x)
df
# x y z
#1 3 4 3
#2 5 2 2
#3 6 6 6
#4 1 7 1
#5 9 4 4
In dplyr, you can use if_else which is same as above or case_when which is helpful when you have to list down multiple conditions.
library(dplyr)
df %>%
mutate(z = case_when(x > y ~ y,
TRUE ~x))
If I get it correctly, you are looking for minimum value out of several columns. You can use pmin function:
library(dplyr)
df <- data.frame(x = c(3,5,6,1,9),
y = c(4,2,6,7,4))
df <- df %>% mutate(z = pmin(x, y))
result:
> df
x y z
1 3 4 3
2 5 2 2
3 6 6 6
4 1 7 1
5 9 4 4
It will count minimum value in a data frame row wise and will simplify syntax if you would like to include more than 2 columns:
df <- data.frame(x = c(3, 5, 6, 1, 9),
y = c(4, 2, 6, 7, 4),
a = c(2, 5, 7, 3, 3))
df <- df %>% mutate(z = pmin(x, y, a))
result:
> df
x y a z
1 3 4 2 2
2 5 2 5 2
3 6 6 7 6
4 1 7 3 1
5 9 4 3 3
Similar to another answer but using data.table and pmin:
library(data.table)
dt <- data.table(x = c(3,5,6,1,9),
y = c(4,2,6,7,4))
dt[, z:= pmin(x,y)]
dt
# x y z
# 1: 3 4 3
# 2: 5 2 2
# 3: 6 6 6
# 4: 1 7 1
# 5: 9 4 4
Function pmin returns the parallel minima (https://www.rdocumentation.org/packages/mc2d/versions/0.1-17/topics/pmin)
Another option with fifelse in data.table
library(data.table)
setDT(dt)[, z := fifelse(x > y, y, x)]

R: Conditionally replacing values based on column pre-fixes and suffixes

I have two data frames. Data frame A has many observations/rows, an ID for each observation, and many additional columns. For a subset of observations X, the values for a set of columns are missing/NA. Data frame B contains a subset of the observations in X (which can be matched across data frames using the ID) and variables with identical names as in data frame A, but containing values to replace the missing values in the set of columns with missing/NA.
My code below (using a join operation) merely adds columns rather than replacing missing values. For each of the additional variables (let's name them W) in B, the resulting table produces W.x and W.y.
library(dplyr)
foo <- data.frame(id = seq(1:6), x = c(NA, NA, NA, 1, 3, 8), z = seq_along(10:15))
bar <- data.frame(id = seq(1:2), x = c(10, 9))
dplyr::left_join(x = foo, y = bar, by = "id")
I am trying to replace the missing values in A using the values in B based on the ID, but do so in an efficient manner since I have many columns and many rows. My goal is this:
id x z
1 1 10 1
2 2 9 2
3 3 NA 3
4 4 1 4
5 5 3 5
6 6 8 6
One thought was to use ifelse() after joining, but typing out ifelse() functions for all of the variables is not feasible. Is there a way to do this simply without the database join or is there a way to apply a function across all columns ending in .x to replace the values in .x with the value in .y if the value in .x is missing?
Another attempt which should essentially only be one assignment operation. Using #alistaire's data again:
vars <- c("x","y")
foo[vars] <- Map(pmax, foo[vars], bar[match(foo$id, bar$id), vars], na.rm=TRUE)
foo
# id x y z
#1 1 10 1 1
#2 2 9 2 2
#3 3 NA 3 3
#4 4 1 4 4
#5 5 3 5 5
#6 6 8 6 6
EDIT
Updating the answer taking #alistaire 's example dataframe.
We can extend the same answer given below using mapply so that it can handle multiple columns for both foo and bar.
Finding out common columns between two dataframes and sorting them so they are in the same order.
vars <- sort(intersect(names(foo), names(bar))[-1])
foo[vars] <- mapply(function(x, y) {
ind = is.na(x)
replace(x, ind, y[match(foo$id[ind], bar$id)])
}, foo[vars], bar[vars])
foo
# id x y z
#1 1 10 1 1
#2 2 9 2 2
#3 3 NA 3 3
#4 4 1 4 4
#5 5 3 5 5
#6 6 8 6 6
Original Answer
I think this does what you are looking for :
foo[-1] <- sapply(foo[-1], function(x) {
ind = is.na(x)
replace(x, ind, bar$x[match(foo$id[ind], bar$id)])
})
foo
# id x z
#1 1 10 1
#2 2 9 2
#3 3 NA 3
#4 4 1 4
#5 5 3 5
#6 6 8 6
For every column (except id) we find the missing value in foo and replace it with corresponding values from bar.
If you don't mind verbose baseR approaches, then you can easily accomplish this using merge() and a careful subsetting of your data frame.
df <- merge(foo, bar, by="id", all.x=TRUE)
names(df) <- c("id", "x", "z", "y")
df$x[is.na(df$x)] <- df$y[is.na(df$x)]
df <- df[c("id", "x", "z")]
> df
id x z
1 1 10 1
2 2 9 2
3 3 NA 3
4 4 1 4
5 5 3 5
6 6 8 6
You can iterate dplyr::coalesce over the intersect of non-grouping columns. It's not elegant, but it should scale reasonably well:
library(tidyverse)
foo <- data.frame(id = seq(1:6),
x = c(NA, NA, NA, 1, 3, 8),
y = 1:6, # add extra shared variable
z = seq_along(10:15))
bar <- data.frame(id = seq(1:2),
y = c(1L, NA),
x = c(10, 9))
# names of non-grouping variables in both
vars <- intersect(names(foo), names(bar))[-1]
foobar <- left_join(foo, bar, by = 'id')
foobar <- vars %>%
map(paste0, c('.x', '.y')) %>% # make list of columns to coalesce
map(~foobar[.x]) %>% # for each set, subset foobar to a two-column data.frame
invoke_map(.f = coalesce) %>% # ...and coalesce it into a vector
set_names(vars) %>% # add names to list elements
bind_cols(foobar) %>% # bind into data.frame and cbind to foobar
select(union(names(foo), names(bar))) # drop duplicated columns
foobar
#> # A tibble: 6 x 4
#> id x y z
#> <int> <dbl> <int> <int>
#> 1 1 10 1 1
#> 2 2 9 2 2
#> 3 3 NA 3 3
#> 4 4 1 4 4
#> 5 5 3 5 5
#> 6 6 8 6 6

Filling in values in a data frame in R?

Suppose I have this data frame:
times vals
1 1 2
2 3 4
3 7 6
set up with
foo <- data.frame(times=c(1,3,7), vals=c(2,4,6))
and I want this one:
times vals
1 1 2
2 2 2
3 3 4
4 4 4
5 5 4
6 6 4
7 7 6
That is, I want to fill in all the times from 1 to 7, and fill in the vals from the latest time that is not greater than the given time.
I have some code to do it using dplyr, but it is ugly. Suggestions for better?
library(dplyr)
foo <- merge(foo, data.frame(times=1:max(foo$times)), all.y=TRUE)
foo2 <- merge(foo, foo, by=c(), suffixes=c('', '.1'))
foo2 <- foo2 %>% filter(is.na(vals) & !is.na(vals.1) & times.1 <= times) %>%
group_by(times) %>% arrange(-times.1) %>% mutate(rn = row_number()) %>%
filter(rn == 1) %>%
mutate(vals = vals.1,
rn = NULL,
vals.1 = NULL,
times.1 = NULL)
foo <- merge(foo, foo2, by=c('times'), all.x=TRUE, suffixes=c('', '.2'))
foo <- mutate(foo,
vals = ifelse(is.na(vals), vals.2, vals),
vals.2 = NULL)
This is a standard rolling join problem:
library(data.table)
setDT(foo)[.(1:7), on = 'times', roll = T]
# times vals
#1: 1 2
#2: 2 2
#3: 3 4
#4: 4 4
#5: 5 4
#6: 6 4
#7: 7 6
The above is for devel version (1.9.7+), which is smarter about column matching during joins. For 1.9.6 you still need to specify column name for the inner table:
setDT(foo)[.(times = 1:7), on = 'times', roll = T]
With approx:
data.frame(times = 1:7,
vals = unlist(approx(foo, xout = 1:7, method = "constant", f = 0)[2], use.names = F))
times vals
1 1 2
2 2 2
3 3 4
4 4 4
5 5 4
6 6 4
7 7 6
A dplyr and tidyr option:
library(dplyr)
library(tidyr)
foo %>%
right_join(data_frame(times = min(foo$times):max(foo$times))) %>%
fill(vals)
# Joining by: "times"
# times vals
# 1 1 2
# 2 2 2
# 3 3 4
# 4 4 4
# 5 5 4
# 6 6 4
# 7 7 6
This is a bit longer and more verbose base R solution:
# calculate the number of repetitions needed for vals variable
reps <- c(with(foo, times[2:length(times)]-times[1:length(times)-1]), 1)
# get result
fooDoneIt <- data.frame(times = min(foo$times):max(foo$times),
vals = rep(foo$vals, reps))

Resources