How to mutate a column given a dataframe that has the conditions? - r

I have a two-column data frame. The first column is a timestamp and the second column is some value. For example:
library(tidyverse)
set.seed(123)
data_df <- tibble(t = 1:15,
value = sample(letters, 15))
I have a another data frame that specifies the range of timestamps that need to be updated and their corresponding values. For example:
criteria_df <- tibble(start = c(1, 3, 7),
end = c(2, 5, 10),
value = c('a', 'b', 'c')
)
This means that I need to mutate the value column in data_df so that its value from t=1 to t=2 is 'a', from t=3 to t=5 is 'b' and from t=7 to t=10 is 'c'.
What is the recommended way to do this in R?
The only way I could think of is to loop each row in criteria_df and mutate the value column in data_df after filtering the t column, like so:
library(iterators)
library(foreach)
foreach(row = row_iter, .combine = c) %do% {
seg_start = row$start
seg_end = row$end
new_value = row$value
data_df %<>%
mutate(value = if_else(between(t, seg_start, seg_end),
new_value,
value))
NULL
}

We can do a two-step base R solution, where we first find the values which lies in the range of criteria_df start and end and then replace the data_df value from it's equivalent criteria_df's value if it matches or keep it as it is.
inds <- sapply(data_df$t, function(x) criteria_df$value[x >= criteria_df$start
& x <= criteria_df$end])
data_df$value <- unlist(ifelse(lengths(inds) > 0, inds, data_df$value))
data_df
# t value
# <int> <chr>
# 1 1 a
# 2 2 a
# 3 3 b
# 4 4 b
# 5 5 b
# 6 6 a
# 7 7 c
# 8 8 c
# 9 9 c
#10 10 c
#11 11 p
#12 12 g
#13 13 r
#14 14 s
#15 15 b

Related

Subset with all values for a variable in R

I have a Data Frame with a variable with different values for another variable.
Like this:
DataFrame
So, I need a subset when the value of S contain all the possible values of B. In this example, el subset is conformed by S = a and S = b:
Subset
Any idea? Thanks!!
An option would be to group by 'S' and filter the rows having all the unique values of the column 'B' %in% 'B'
library(dplyr)
un1 <- unique(df1$B)
df1 %>%
group_by(S) %>%
filter(all(un1 %in% B))
# A tibble: 8 x 2
# Groups: S [2]
# S B
# <fct> <dbl>
#1 a 1
#2 a 2
#3 a 3
#4 a 4
#5 d 1
#6 d 2
#7 d 3
#8 d 4
Or with data.table
library(data.table)
setDT(df1)[, .SD[all(un1 %in% B)], S]
Or using base R
df1[with(df1, ave(B, S, FUN = function(x) all(un1 %in% x)) == 1),]
data
df1 <- data.frame(S = rep(letters[1:4], c(4, 3, 2, 4)),
B = c(1:4, c(1, 3, 4), 1:2, 1:4))

Removing groups from dataframe if variable has repeated values

I would like to ask if there is a way of removing a group from dataframe using dplyr (or anz other way in that matter) in the following way. Lets say I have a dataframe in the following form grouped by variable 1:
Variable 1 Variable 2
1 a
1 b
2 a
2 a
2 b
3 a
3 c
3 a
... ...
I would like to remove only groups that have in Variable 2 two consecutive same values. That is in table above it would remove group 2 because there are values a,a,b but not group c where is a,c,a. So I would get the table bellow?
Variable 1 Variable 2
1 a
1 b
3 a
3 c
3 a
... ...
To test for consecutive identical values, you can compare a value to the previous value in that column. In dplyr, this is possible with lag. (You could do the same thing with comparing to the next value, using lead. Result comes out the same.)
Group the data by variable1, get the lag of variable2, then add up how many of these duplicates there are in that group. Then filter for just the groups with no duplicates. After that, feel free to remove the dupesInGroup column.
library(tidyverse)
df %>%
group_by(variable1) %>%
mutate(dupesInGroup = sum(variable2 == lag(variable2), na.rm = T)) %>%
filter(dupesInGroup == 0)
#> # A tibble: 5 x 3
#> # Groups: variable1 [2]
#> variable1 variable2 dupesInGroup
#> <int> <chr> <int>
#> 1 1 a 0
#> 2 1 b 0
#> 3 3 a 0
#> 4 3 c 0
#> 5 3 a 0
Created on 2018-05-10 by the reprex package (v0.2.0).
prepare data frame:
df <- data.frame("Variable 1" = c(1, 1, 2, 2, 2, 3, 3, 3), "Variable 2" = unlist(strsplit("abaabaca", "")))
write functions to test if consecutive repetitions are there or not:
any.consecutive.p <- function(v) {
for (i in 1:(length(v) - 1)) {
if (v[i] == v[i + 1]) {
return(TRUE)
}
}
return(FALSE)
}
any.consecutive.in.col.p <- function(df, col) {
any.consecutive.p(df[, col])
}
any.consecutive.p returns TRUE if it finds first consecutive repetition in a vector (v).
any.consecutive.in.col.p() looks for consecutive repetitions in a column of a data frame.
split data frame by values of Variable.1
df.l <- split(df, df$Variable.1)
df.l
$`1`
Variable.1 Variable.2
1 1 a
2 1 b
$`2`
Variable.1 Variable.2
3 2 a
4 2 a
5 2 b
$`3`
Variable.1 Variable.2
6 3 a
7 3 c
8 3 a
Finally go over this data.frame list and test for each data frame, if it contains consecutive duplicates in Variable.2 column.
If found, don't collect it.
Bind the collected data frames by rows.
Reduce(rbind, lapply(df.l, function(df) if(!any.consecutive.in.col.p(df, "Variable.2")) {df}))
Variable.1 Variable.2
1 1 a
2 1 b
6 3 a
7 3 c
8 3 a
Say you want to remove all groups of df, grouped by a, where the column b has repeated values. You can do that as below.
set.seed(0)
df <- data.frame(a = rep(1:3, rep(3, 3)), b = sample(1:5, 9, T))
# dplyr
library(dplyr)
df %>%
group_by(a) %>%
filter(all(b != lag(b), na.rm = T))
#data.table
library(data.table)
setDT(df)
df[, if(all(b != shift(b), na.rm = T)) .SD, by = a]
Benchmark shows data.table is faster
#Results
# Unit: milliseconds
# expr min lq mean median uq max neval
# use_dplyr() 141.46819 165.03761 201.0975 179.48334 205.82301 539.5643 100
# use_DT() 36.27936 50.23011 64.9218 53.87114 66.73943 345.2863 100
# Method
set.seed(0)
df <- data.table(a = rep(1:2000, rep(1e3, 2000)), b = sample(1:1e3, 2e6, T))
use_dplyr <- function(x){
df %>%
group_by(a) %>%
filter(all(b != lag(b), na.rm = T))
}
use_DT <- function(x){
df[, if (all(b != shift(b), na.rm = T)) .SD, a]
}
microbenchmark(use_dplyr(), use_DT())

New column conditional on whether number is even/uneven and on column

Say i have the following df:
id<-rep(1:2,c(7,6))
name<-c('a','t','signal','b','s','e','signal','x','signal','r','s','t','signal')
id name
1 1 a
2 1 t
3 1 signal
4 1 b
5 1 s
6 1 e
7 1 signal
8 2 x
9 2 signal
10 2 r
11 2 s
12 2 t
13 2 signal
I want to add a new column with a character value conditional on whether the id number is even or not, and whether the string 'signal' is reached in the 'name' column.
For uneven id numbers, and up to including 'signal' for the column 'name' I would like the character T. After the signal, the character should become 'C'.
For even id numbers, and up to including 'signal' for the column 'name' I would like the character C. After the signal, the character should become 'T'.
For the example given, this should result in the following data.frame:
id, name condition
1, a, T
1, t, T
1, signal, T
1, b, C
1, s, C
1, e, C
1, signal C
2, x, C
2, signal, C
2, r, T
2, s, T
2, t, T
2, signal T
Any help is very much appreciated!
This is not a vectorized solution, but for me it seems as a wroking code.
Data preparation - I add new column to describe the condition
id<-rep(1:2,c(7,6))
name<-c('a','t','signal','b','s','e','signal','x','signal','r','s','t','signal')
df <- data.frame(id, name)
df$condition <- rep("X", nrow(df))
I need to control two states: (i) if the signal has switched; (ii) if the id changes last (from even to odd and other way). Then I read row by row and update the condition state along with two variables.
signal <- F
last <- 1
for (i in 1:nrow(df)){
# id changed - reset signal
if (last != (df[i, "id"] %% 2)) signal <- F
if(!signal){
df[i,"condition"] <- ifelse(df[i,"id"] %% 2, "T", "C")
} else {
df[i, "condition"] <- ifelse(df[i,"id"] %% 2, "C", "T")
}
# signal is on
if (df[i, "name"] == "signal") signal <- T
# save last id (even or odd)
last <- df[i, "id"] %% 2
}
I hope it helps.
We could make use of %% with == to create the column
library(dplyr)
df1 %>%
group_by(id) %>%
mutate(ind = (cumsum(lag(name, default = name[1]) == 'signal')>0) + 1,
condition = c('T', 'C')[ifelse(id %%2 > 0, ind,
as.integer(factor(ind, levels = rev(unique(ind)))))] ) %>%
select(-ind)
# A tibble: 13 x 3
# Groups: id [2]
# id name condition
# <int> <chr> <chr>
# 1 1 a T
# 2 1 t T
# 3 1 signal T
# 4 1 b C
# 5 1 s C
# 6 1 e C
# 7 1 signal C
# 8 2 x C
# 9 2 signal C
#10 2 r T
#11 2 s T
#12 2 t T
#13 2 signal T
data
df1 <- data.frame(id, name, stringsAsFactors=FALSE)
Another approach could be
id <- rep(1:2,c(7,6))
name <- c('a','t','signal','b','s','e','signal','x','signal','r','s','t','signal')
df <- data.frame(id, name)
library(dplyr)
df %>%
group_by(id) %>%
mutate(FirstSignalIndex=min(which(name=='signal'))) %>%
mutate(condition = ifelse((id %% 2)==0,
ifelse(row_number()>FirstSignalIndex, 'T', 'C'),
ifelse(row_number()>FirstSignalIndex, 'C', 'T')))
Hope this helps!

Select rows based on non-directed combinations of columns

I am trying to select the maximum value in a dataframe's third column based on the combinations of the values in the first two columns.
My problem is similar to this one but I can't find a way to implement what I need.
EDIT: Sample data changed to make the column names more obvious.
Here is some sample data:
library(tidyr)
set.seed(1234)
df <- data.frame(group1 = letters[1:4], group2 = letters[1:4])
df <- df %>% expand(group1, group2)
df <- subset(df, subset = group1!=group2)
df$score <- runif(n = 12,min = 0,max = 1)
df
# A tibble: 12 × 3
group1 group2 score
<fctr> <fctr> <dbl>
1 a b 0.113703411
2 a c 0.622299405
3 a d 0.609274733
4 b a 0.623379442
5 b c 0.860915384
6 b d 0.640310605
7 c a 0.009495756
8 c b 0.232550506
9 c d 0.666083758
10 d a 0.514251141
11 d b 0.693591292
12 d c 0.544974836
In this example rows 1 and 4 are 'duplicates'. I would like to select row 4 as the value in the score column is larger than in row 1. Ultimately I would like a dataframe to be returned with the group1 and group2 columns and the maximum value in the score column. So in this example, I expect there to be 6 rows returned.
How can I do this in R?
I'd prefer dealing with this problem in two steps:
library(dplyr)
# Create function for computing group IDs from data frame of groups (per column)
get_group_id <- function(groups) {
apply(groups, 1, function(row) {
paste0(sort(row), collapse = "_")
})
}
group_id <- get_group_id(select(df, -score))
# Perform the computation
df %>%
mutate(groupId = group_id) %>%
group_by(groupId) %>%
slice(which.max(score)) %>%
ungroup() %>%
select(-groupId)

Different sample number for each group in a data set

Given a dataset
key <- rep(c('a', 'b', 'c'), 10)
value <- sample(30)
df <- data.frame(key, value)
I would like a different number of samples for each group in keys, a simple code using dplyr that obviously do not work for this task is
ns <- c('a'= 1, 'b'= 2, 'c' = 3)
df %>%
mutate(n_s = ns[key]) %>%
group_by(key) %>%
sample_n(n_s)
There is some solution that can look as simple as that ?
You can use mapply and with split(df, df$key) and ns as arguments, but note that the names of ns are not use. It's the order of the groups that counts, and if the number of groups doesn't match the length of ns, ns will be recycled.
set.seed(129)
mapply(sample_n, split(df, df$key), ns, SIMPLIFY = FALSE) %>%
rbind_all
# key value
# (fctr) (int)
#1 a 29
#2 b 14
#3 b 22
#4 c 10
#5 c 24
#6 c 3
You can look at the stratified function from my "splitstackshape" package:
library(splitstackshape)
ns <- c('a'= 1, 'b'= 2, 'c' = 3)
stratified(df, "key", size = ns)
# key value
# 1: a 7
# 2: b 10
# 3: b 13
# 4: c 4
# 5: c 20
# 6: c 9

Resources