Using map on a conditional statement inside a mutate command - r

I have a data frame containing numbers that I would like to bin according to their absolute value.
library(tidyverse)
dat <- data.frame(val = seq(-10, 10))
The following command accomplishes what I would like to do, but the values are hardcoded which I need to avoid:
dat %>%
mutate(grp = case_when(abs(val) <= 5 ~ "Grp 1",
abs(val) <= 7 ~ "Grp 2",
TRUE ~ "Grp 3"))
How can I accomplish the same transformation, but instead using a named vector as the input:
grps <- c("Grp 1" = 5, "Grp 2" = 7)
So that I can add/remove groups as needed, for example, adding in "Grp 3" = 9?

Instead of using map or something that works one-by-one, we can do it vectorized with cut:
grps <- c("Grp 1" = 5, "Grp 2" = 7)
dat %>%
mutate(
grp = cut(abs(val), c(-Inf, grps, Inf), labels = c(names(grps), "Grp 3"))
)
# val grp
# 1 -10 Grp 3
# 2 -9 Grp 3
# 3 -8 Grp 3
# 4 -7 Grp 2
# 5 -6 Grp 2
# 6 -5 Grp 1
# 7 -4 Grp 1
# 8 -3 Grp 1
# 9 -2 Grp 1
# 10 -1 Grp 1
# 11 0 Grp 1
# 12 1 Grp 1
# 13 2 Grp 1
# 14 3 Grp 1
# 15 4 Grp 1
# 16 5 Grp 1
# 17 6 Grp 2
# 18 7 Grp 2
# 19 8 Grp 3
# 20 9 Grp 3
# 21 10 Grp 3
Note that grp is a factor; if you want it to be character, just wrap it in as.character.

Related

Difference between rows in long format for R based on other column variables

I have an R dataframe such as:
df <- data.frame(ID = rep(c(1, 1, 2, 2), 2), Condition = rep(c("A", "B"),4),
Variable = c(rep("X", 4), rep("Y", 4)),
Value = c(3, 5, 6, 6, 3, 8, 3, 6))
ID Condition Variable Value
1 1 A X 3
2 1 B X 5
3 2 A X 6
4 2 B X 6
5 1 A Y 3
6 1 B Y 8
7 2 A Y 3
8 2 B Y 6
I want to obtain the difference between each value of Condition (A - B) for each Variable and ID while keeping the long format. That would mean the value must appear every two rows, like this:
ID Condition Variable Value diff_value
1 1 A X 3 -2
2 1 B X 5 -2
3 2 A X 6 0
4 2 B X 6 0
5 1 A Y 3 -5
6 1 B Y 8 -5
7 2 A Y 3 -3
8 2 B Y 6 -3
So far, I managed to do something relatively similar using the dplyr package, but it does not work if I want to maintain the long format:
df_long_example %>%
group_by(Variable, ID) %>%
mutate(diff_value = lag(Value, default = Value[1]) -Value)
# A tibble: 8 x 5
# Groups: Variable, ID [4]
ID Condition Variable Value diff_value
<dbl> <chr> <chr> <dbl> <dbl>
1 1 A X 3 0
2 1 B X 5 -2
3 2 A X 6 0
4 2 B X 6 0
5 1 A Y 3 0
6 1 B Y 8 -5
7 2 A Y 3 0
8 2 B Y 6 -3
You don't have to use lag, but use diff:
df %>%
group_by(Variable,ID) %>%
mutate(diff = -diff(Value))
Output:
# A tibble: 8 x 5
# Groups: Variable, ID [4]
ID Condition Variable Value diff
<dbl> <chr> <chr> <dbl> <dbl>
1 1 A X 3 -2
2 1 B X 5 -2
3 2 A X 6 0
4 2 B X 6 0
5 1 A Y 3 -5
6 1 B Y 8 -5
7 2 A Y 3 -3
8 2 B Y 6 -3
You dont need to create lag variable just use Value[Condition == "A"] - Value[Condition == "B"] as below
df %>%
group_by(ID, Variable) %>%
mutate(Value, diff_value = Value[Condition == "A"] - Value[Condition == "B"])
# A tibble: 8 x 5
# Groups: ID, Variable [4]
ID Condition Variable Value diff_value
<dbl> <chr> <chr> <dbl> <dbl>
1 1 A X 3 -2
2 1 B X 5 -2
3 2 A X 6 0
4 2 B X 6 0
5 1 A Y 3 -5
6 1 B Y 8 -5
7 2 A Y 3 -3
8 2 B Y 6 -3
This should work:
# Step one: create a new column of df, where we store the "Value" we need
# to add/subtract, as you required (same "ID", same "Variable", different
# "Condtion").
temp.fun = function(x, dta)
{
# Given a row x of dta, this function selects the value corresponding to the row
# with same "ID", same "Variable" and different "Condition".
# Notice that if "Condition" is not binary, we need to generalize this function.
# Notice also that this function is super specific to your case, and that it has
# been thought to be used within apply().
# INPUTS:
# - x, a row of a data frame.
# - dta, the data frame (df, in your case).
# OUTPUT:
# - temp.corresponding, "Value" you want for each row.
# Saving information.
temp.id = as.numeric(x["ID"])
temp.condition = as.character(x["Condition"])
temp.variable = as.character(x["Variable"])
# Index for selecting row.
temp.row = dta$ID == temp.id & dta$Condition != temp.condition & dta$Variable == temp.variable
# Selecting "Value".
temp.corresponding = dta$Value[temp.row]
return(temp.corresponding)
}
df$corr_value = apply(df, MARGIN = 1, FUN = temp.fun, dta = df)
# Step two: add/subtract to create the column "diff_value".
# Key: if "Condition" equals "A", we subtract, otherwise we add.
df$diff_value = NA
df$diff_value[df$Condition == "A"] = df$Value[df$Condition == "A"] - df$corr_value[df$Condition == "A"]
df$diff_value[df$Condition == "B"] = df$corr_value[df$Condition == "B"] - df$Value[df$Condition == "B"]
Notice that this solution just fits the specifics of your problem, and may be neither elegant nor efficient.
I wrote comments in the code to explain how this solution works. Anyway, the idea is to first write the function temp.fun(), which operates on single rows: for each row we pass, it finds df$Value of the row satisfying the criteria you asked (same ID, same Variable, different Condition). Then, we use apply() to pass all rows in temp.fun(), thus creating a new column in df storing the Value mentioned above.
We are now ready to compute df$diff_value. First, we initialize space, creating a column on NA. Then, we perform the operations. Be careful: because of the specifics of the problem, if Condition equals A, we want to subtract values, whether when Condition equals B we are going to add values. That is, in the former case we compute df$Value - df$corr_value, and in the latter we compute df$corr_value- df$Value.
Final warning: if Condition is not binary, this solution must be generalized in order to work.

Replacement of column values based on a named vector

Consider the following named vector vec and tibble df:
vec <- c("1" = "a", "2" = "b", "3" = "c")
df <- tibble(col = rep(1:3, c(4, 2, 5)))
df
# # A tibble: 11 x 1
# col
# <int>
# 1 1
# 2 1
# 3 1
# 4 1
# 5 2
# 6 2
# 7 3
# 8 3
# 9 3
# 10 3
# 11 3
I would like to replace the values in the col column with the corresponding named values in vec.
I'm looking for a tidyverse approach, that doesn't involve converting vec as a tibble.
I tried the following, without success:
df %>%
mutate(col = map(
vec,
~ str_replace(col, names(.x), .x)
))
Expected output:
# A tibble: 11 x 1
col
<chr>
1 a
2 a
3 a
4 a
5 b
6 b
7 c
8 c
9 c
10 c
11 c
You could use col :
df$col1 <- vec[as.character(df$col)]
Or in mutate :
library(dplyr)
df %>% mutate(col1 = vec[as.character(col)])
# col col1
# <int> <chr>
# 1 1 a
# 2 1 a
# 3 1 a
# 4 1 a
# 5 2 b
# 6 2 b
# 7 3 c
# 8 3 c
# 9 3 c
#10 3 c
#11 3 c
We can also use data.table
library(data.table)
setDT(df)[, col1 := vec[as.character(col)]]

Replace column value in a data frame based on other columns

I have the following data frame ordered by name and time.
set.seed(100)
df <- data.frame('name' = c(rep('x', 6), rep('y', 4)),
'time' = c(rep(1, 2), rep(2, 3), 3, 1, 2, 3, 4),
'score' = c(0, sample(1:10, 3), 0, sample(1:10, 2), 0, sample(1:10, 2))
)
> df
name time score
1 x 1 0
2 x 1 4
3 x 2 3
4 x 2 5
5 x 2 0
6 x 3 1
7 y 1 5
8 y 2 0
9 y 3 5
10 y 4 8
In df$score there are zeros followed by an unknown number of actual values, i.e. df[1:4,], and sometimes there are overlapping df$name between two df$score == 0, i.e. df[6:7,].
I want to change df$time where df$score != 0. Specifically, I want to assign the time value of the closest upper row with df$score == 0 if df$name is matching.
The following code gives the good output but my data have millions of rows so this solution is very inefficient.
score_0 <- append(which(df$score == 0), dim(df)[1] + 1)
for(i in 1:(length(score_0) - 1)) {
df$time[score_0[i]:(score_0[i + 1] - 1)] <-
ifelse(df$name[score_0[i]:(score_0[i + 1] - 1)] == df$name[score_0[i]],
df$time[score_0[i]],
df$time[score_0[i]:(score_0[i + 1] - 1)])
}
> df
name time score
1 x 1 0
2 x 1 4
3 x 1 3
4 x 1 5
5 x 2 0
6 x 2 1
7 y 1 5
8 y 2 0
9 y 2 5
10 y 2 8
Where score_0 gives the index where df$score == 0. We see that df$time[2:4] are now all equal to 1, that in df$time[6:7] only the first one changed because the second have df$name == 'y' and the closest upper row with df$score == 0 has df$name == 'x'. The last two rows also have changed correctly.
You can do it like this:
library(dplyr)
df %>% group_by(name) %>% mutate(ID=cumsum(score==0)) %>%
group_by(name,ID) %>% mutate(time = head(time,1)) %>%
ungroup() %>% select(name,time,score) %>% as.data.frame()
# name time score
# 1 x 1 0
# 2 x 1 8
# 3 x 1 10
# 4 x 1 6
# 5 x 2 0
# 6 x 2 5
# 7 y 1 4
# 8 y 2 0
# 9 y 2 5
# 10 y 2 9
Solution using dplyr and data.table:
library(data.table)
library(dplyr)
df %>%
mutate(
chck = score == 0,
chck_rl = ifelse(score == 0, lead(rleid(chck)), rleid(chck))) %>%
group_by(name, chck_rl) %>% mutate(time = first(time)) %>%
ungroup() %>%
select(-chck_rl, -chck)
Output:
# A tibble: 10 x 3
name time score
<chr> <dbl> <int>
1 x 1 0
2 x 1 2
3 x 1 9
4 x 1 7
5 x 2 0
6 x 2 1
7 y 1 8
8 y 2 0
9 y 2 2
10 y 2 3
Solution only using data.table:
library(data.table)
setDT(df)[, chck_rl := ifelse(score == 0, shift(rleid(score == 0), type = "lead"),
rleid(score == 0))][, time := first(time), by = .(name, chck_rl)][, chck_rl := NULL]
Output:
name time score
1: x 1 0
2: x 1 2
3: x 1 9
4: x 1 7
5: x 2 0
6: x 2 1
7: y 1 8
8: y 2 0
9: y 2 2
10: y 2 3

Divide one column of data frame by condition from another column

I have a data frame with 2 columns like this:
cond val
1 5
2 18
2 18
2 18
3 30
3 30
I want to change values in val in this way:
cond val
1 5 # 5 = 5/1 (only "1" in cond column)
2 6 # 6 = 18/3 (there are three "2" in cond column)
2 6
2 6
3 15 # 15 = 30/2
3 15
How to achieve this?
A base R solution:
# method 1:
mydf$val <- ave(mydf$val, mydf$cond, FUN = function(x) x = x/length(x))
# method 2:
mydf <- transform(mydf, val = ave(val, cond, FUN = function(x) x = x/length(x)))
which gives:
cond val
1 1 5
2 2 6
3 2 6
4 2 6
5 3 15
6 3 15
Here's the dplyr way:
library(dplyr)
df %>%
group_by(cond) %>%
mutate(val = val / n())
Which gives:
#Source: local data frame [6 x 2]
#Groups: cond [3]
#
# cond val
# (int) (dbl)
#1 1 5
#2 2 6
#3 2 6
#4 2 6
#5 3 15
#6 3 15
The idea is to divide val by the number of observations in the current group (cond) using n()
This seems like an appropriate situation for data.table:
library(data.table)
(dt <- data.table(df)[,val := val / .N, by = cond][])
# cond val
# 1: 1 5
# 2: 2 6
# 3: 2 6
# 4: 2 6
# 5: 3 15
# 6: 3 15
df <- read.table(
text = "cond val
1 5
2 18
2 18
2 18
3 30
3 30",
header = TRUE,
colClasses = "numeric"
)
In base R
df$result = df$val / ave(df$cond, df$cond, FUN = length)
The ave() divides up the cond column by its unique values and takes the length of each subvector, i.e., the denominator you ask for.
Here is a base R answer that will work if cond is an ID variable:
# get length of repeats
temp <- rle(df$cond)
temp <- data.frame(cond=temp$values, lengths=temp$lengths)
# merge onto data.frame
df <- merge(df, temp, by="cond")
df$valNew <- df$val / df$lengths

R, dplyr: cumulative version of n_distinct

I have a dataframe as follows. It is ordered by column time.
Input -
df = data.frame(time = 1:20,
grp = sort(rep(1:5,4)),
var1 = rep(c('A','B'),10)
)
head(df,10)
time grp var1
1 1 1 A
2 2 1 B
3 3 1 A
4 4 1 B
5 5 2 A
6 6 2 B
7 7 2 A
8 8 2 B
9 9 3 A
10 10 3 B
I want to create another variable var2 which computes no of distinct var1 values so far i.e. until that point in time for each group grp . This is a little different from what I'd get if I were to use n_distinct.
Expected output -
time grp var1 var2
1 1 1 A 1
2 2 1 B 2
3 3 1 A 2
4 4 1 B 2
5 5 2 A 1
6 6 2 B 2
7 7 2 A 2
8 8 2 B 2
9 9 3 A 1
10 10 3 B 2
I want to create a function say cum_n_distinct for this and use it as -
d_out = df %>%
arrange(time) %>%
group_by(grp) %>%
mutate(var2 = cum_n_distinct(var1))
A dplyr solution inspired from #akrun's answer -
Ths logic is basically to set 1st occurrence of each unique values of var1 to 1 and rest to 0 for each group grp and then apply cumsum on it -
df = df %>%
arrange(time) %>%
group_by(grp,var1) %>%
mutate(var_temp = ifelse(row_number()==1,1,0)) %>%
group_by(grp) %>%
mutate(var2 = cumsum(var_temp)) %>%
select(-var_temp)
head(df,10)
Source: local data frame [10 x 4]
Groups: grp
time grp var1 var2
1 1 1 A 1
2 2 1 B 2
3 3 1 A 2
4 4 1 B 2
5 5 2 A 1
6 6 2 B 2
7 7 2 A 2
8 8 2 B 2
9 9 3 A 1
10 10 3 B 2
Assuming stuff is ordered by time already, first define a cumulative distinct function:
dist_cum <- function(var)
sapply(seq_along(var), function(x) length(unique(head(var, x))))
Then a base solution that uses ave to create groups (note, assumes var1 is factor), and then applies our function to each group:
transform(df, var2=ave(as.integer(var1), grp, FUN=dist_cum))
A data.table solution, basically doing the same thing:
library(data.table)
(data.table(df)[, var2:=dist_cum(var1), by=grp])
And dplyr, again, same thing:
library(dplyr)
df %>% group_by(grp) %>% mutate(var2=dist_cum(var1))
Try:
Update
With your new dataset, an approach in base R
df$var2 <- unlist(lapply(split(df, df$grp),
function(x) {x$var2 <-0
indx <- match(unique(x$var1), x$var1)
x$var2[indx] <- 1
cumsum(x$var2) }))
head(df,7)
# time grp var1 var2
# 1 1 1 A 1
# 2 2 1 B 2
# 3 3 1 A 2
# 4 4 1 B 2
# 5 5 2 A 1
# 6 6 2 B 2
# 7 7 2 A 2
Here's another solution using data.table that's pretty quick.
Generic Function
cum_n_distinct <- function(x, na.include = TRUE){
# Given a vector x, returns a corresponding vector y
# where the ith element of y gives the number of unique
# elements observed up to and including index i
# if na.include = TRUE (default) NA is counted as an
# additional unique element, otherwise it's essentially ignored
temp <- data.table(x, idx = seq_along(x))
firsts <- temp[temp[, .I[1L], by = x]$V1]
if(na.include == FALSE) firsts <- firsts[!is.na(x)]
y <- rep(0, times = length(x))
y[firsts$idx] <- 1
y <- cumsum(y)
return(y)
}
Example Use
cum_n_distinct(c(5,10,10,15,5)) # 1 2 2 3 3
cum_n_distinct(c(5,NA,10,15,5)) # 1 2 3 4 4
cum_n_distinct(c(5,NA,10,15,5), na.include = FALSE) # 1 1 2 3 3
Solution To Your Question
d_out = df %>%
arrange(time) %>%
group_by(grp) %>%
mutate(var2 = cum_n_distinct(var1))

Resources