New column conditional on whether number is even/uneven and on column - r

Say i have the following df:
id<-rep(1:2,c(7,6))
name<-c('a','t','signal','b','s','e','signal','x','signal','r','s','t','signal')
id name
1 1 a
2 1 t
3 1 signal
4 1 b
5 1 s
6 1 e
7 1 signal
8 2 x
9 2 signal
10 2 r
11 2 s
12 2 t
13 2 signal
I want to add a new column with a character value conditional on whether the id number is even or not, and whether the string 'signal' is reached in the 'name' column.
For uneven id numbers, and up to including 'signal' for the column 'name' I would like the character T. After the signal, the character should become 'C'.
For even id numbers, and up to including 'signal' for the column 'name' I would like the character C. After the signal, the character should become 'T'.
For the example given, this should result in the following data.frame:
id, name condition
1, a, T
1, t, T
1, signal, T
1, b, C
1, s, C
1, e, C
1, signal C
2, x, C
2, signal, C
2, r, T
2, s, T
2, t, T
2, signal T
Any help is very much appreciated!

This is not a vectorized solution, but for me it seems as a wroking code.
Data preparation - I add new column to describe the condition
id<-rep(1:2,c(7,6))
name<-c('a','t','signal','b','s','e','signal','x','signal','r','s','t','signal')
df <- data.frame(id, name)
df$condition <- rep("X", nrow(df))
I need to control two states: (i) if the signal has switched; (ii) if the id changes last (from even to odd and other way). Then I read row by row and update the condition state along with two variables.
signal <- F
last <- 1
for (i in 1:nrow(df)){
# id changed - reset signal
if (last != (df[i, "id"] %% 2)) signal <- F
if(!signal){
df[i,"condition"] <- ifelse(df[i,"id"] %% 2, "T", "C")
} else {
df[i, "condition"] <- ifelse(df[i,"id"] %% 2, "C", "T")
}
# signal is on
if (df[i, "name"] == "signal") signal <- T
# save last id (even or odd)
last <- df[i, "id"] %% 2
}
I hope it helps.

We could make use of %% with == to create the column
library(dplyr)
df1 %>%
group_by(id) %>%
mutate(ind = (cumsum(lag(name, default = name[1]) == 'signal')>0) + 1,
condition = c('T', 'C')[ifelse(id %%2 > 0, ind,
as.integer(factor(ind, levels = rev(unique(ind)))))] ) %>%
select(-ind)
# A tibble: 13 x 3
# Groups: id [2]
# id name condition
# <int> <chr> <chr>
# 1 1 a T
# 2 1 t T
# 3 1 signal T
# 4 1 b C
# 5 1 s C
# 6 1 e C
# 7 1 signal C
# 8 2 x C
# 9 2 signal C
#10 2 r T
#11 2 s T
#12 2 t T
#13 2 signal T
data
df1 <- data.frame(id, name, stringsAsFactors=FALSE)

Another approach could be
id <- rep(1:2,c(7,6))
name <- c('a','t','signal','b','s','e','signal','x','signal','r','s','t','signal')
df <- data.frame(id, name)
library(dplyr)
df %>%
group_by(id) %>%
mutate(FirstSignalIndex=min(which(name=='signal'))) %>%
mutate(condition = ifelse((id %% 2)==0,
ifelse(row_number()>FirstSignalIndex, 'T', 'C'),
ifelse(row_number()>FirstSignalIndex, 'C', 'T')))
Hope this helps!

Related

Replace a string if it is different from the last one and the next one within a vector

I have a large dataset grouped by agent and date, the variable I want to clean is a string type variable. For instance, for the following dataset
agent_id<-c("1","1","1","2","2","2","2")
date<-c("2007-02-01","2007-02-02","2007-02-05","2000-05-01","2000-05-02","2000-05-10","2000-05-20")
office<-c("A","A","B","C","D","C","C")
mydata<-data.frame(agent_id,date,office)
I want to replace the outlier within a office vector if it is different from the last observation and the next observation within each agent_id. For instance, for agent_id=1, I don't want to replace anything. For agent_id=2, I want to replace "D" to "C" in office because I observe C both before and after. Is there any ways to do that with dplyr? Additionally, it would be better if I can define the cutoff to replace the outliears i.e. if I observe n same values before and n same values after.
You could do:
library(dplyr)
mydata %>%
group_by(agent_id) %>%
mutate(
office = replaceOutliers(x = office, window = 1)
)
Where replaceOutliers is a custom function:
replaceOutliers <- function(x, window = 1, fixed_wind = FALSE) {
x <- as.character(x)
flag_Outl <- c(FALSE, sapply(2:(length(x) - 1), function(y) length(setdiff(x[pmax(1, y - window):pmax(1, y - 1)],
x[pmin(length(x) - 1, y + 1):pmin(length(x) - 1, y + window)])) == 0), FALSE)
if (fixed_wind) {
len_Lag <- sapply(1:length(x), function(y) length(office[pmax(1, y - window):pmax(1, y - 1)]))
len_Lead <- sapply(1:length(x), function(y) length(office[pmin(length(x), y + 1):pmin(length(x), y + window)]))
x <- sapply(1:length(flag_Outl), function(y) ifelse(flag_Outl[y] & len_Lag[y] == window & len_Lead[y] == window, x[y - 1], x[y]))
}
else x <- sapply(1:length(flag_Outl), function(y) ifelse(flag_Outl[y], x[y - 1], x[y]))
return(x)
}
Output:
# A tibble: 7 x 3
# Groups: agent_id [2]
agent_id date office
<fct> <fct> <chr>
1 1 2007-02-01 A
2 1 2007-02-02 A
3 1 2007-02-05 C
4 2 2000-05-01 C
5 2 2000-05-02 C
6 2 2000-05-10 C
7 2 2000-05-20 C
As you will see I've included a fixed_wind parameter - basically you can decide whether you always need to have the exact number of observations before and after to consider something an outlier.
By default this is FALSE, and when you increase the window to 2 in your example, it'll still replace D, but if you put it to TRUE, it'll keep it as it is (as there is only 1 observation before it in the group):
mydata %>%
group_by(agent_id) %>%
mutate(
office2 = replaceOutliers(x = office, window = 2),
office3 = replaceOutliers(x = office, window = 2, fixed_wind = TRUE)
)
Output:
# A tibble: 7 x 5
# Groups: agent_id [2]
agent_id date office office2 office3
<fct> <fct> <fct> <chr> <chr>
1 1 2007-02-01 A A A
2 1 2007-02-02 A A A
3 1 2007-02-05 C C C
4 2 2000-05-01 C C C
5 2 2000-05-02 D C D
6 2 2000-05-10 C C C
7 2 2000-05-20 C C C

How to mutate a column given a dataframe that has the conditions?

I have a two-column data frame. The first column is a timestamp and the second column is some value. For example:
library(tidyverse)
set.seed(123)
data_df <- tibble(t = 1:15,
value = sample(letters, 15))
I have a another data frame that specifies the range of timestamps that need to be updated and their corresponding values. For example:
criteria_df <- tibble(start = c(1, 3, 7),
end = c(2, 5, 10),
value = c('a', 'b', 'c')
)
This means that I need to mutate the value column in data_df so that its value from t=1 to t=2 is 'a', from t=3 to t=5 is 'b' and from t=7 to t=10 is 'c'.
What is the recommended way to do this in R?
The only way I could think of is to loop each row in criteria_df and mutate the value column in data_df after filtering the t column, like so:
library(iterators)
library(foreach)
foreach(row = row_iter, .combine = c) %do% {
seg_start = row$start
seg_end = row$end
new_value = row$value
data_df %<>%
mutate(value = if_else(between(t, seg_start, seg_end),
new_value,
value))
NULL
}
We can do a two-step base R solution, where we first find the values which lies in the range of criteria_df start and end and then replace the data_df value from it's equivalent criteria_df's value if it matches or keep it as it is.
inds <- sapply(data_df$t, function(x) criteria_df$value[x >= criteria_df$start
& x <= criteria_df$end])
data_df$value <- unlist(ifelse(lengths(inds) > 0, inds, data_df$value))
data_df
# t value
# <int> <chr>
# 1 1 a
# 2 2 a
# 3 3 b
# 4 4 b
# 5 5 b
# 6 6 a
# 7 7 c
# 8 8 c
# 9 9 c
#10 10 c
#11 11 p
#12 12 g
#13 13 r
#14 14 s
#15 15 b

Removing groups from dataframe if variable has repeated values

I would like to ask if there is a way of removing a group from dataframe using dplyr (or anz other way in that matter) in the following way. Lets say I have a dataframe in the following form grouped by variable 1:
Variable 1 Variable 2
1 a
1 b
2 a
2 a
2 b
3 a
3 c
3 a
... ...
I would like to remove only groups that have in Variable 2 two consecutive same values. That is in table above it would remove group 2 because there are values a,a,b but not group c where is a,c,a. So I would get the table bellow?
Variable 1 Variable 2
1 a
1 b
3 a
3 c
3 a
... ...
To test for consecutive identical values, you can compare a value to the previous value in that column. In dplyr, this is possible with lag. (You could do the same thing with comparing to the next value, using lead. Result comes out the same.)
Group the data by variable1, get the lag of variable2, then add up how many of these duplicates there are in that group. Then filter for just the groups with no duplicates. After that, feel free to remove the dupesInGroup column.
library(tidyverse)
df %>%
group_by(variable1) %>%
mutate(dupesInGroup = sum(variable2 == lag(variable2), na.rm = T)) %>%
filter(dupesInGroup == 0)
#> # A tibble: 5 x 3
#> # Groups: variable1 [2]
#> variable1 variable2 dupesInGroup
#> <int> <chr> <int>
#> 1 1 a 0
#> 2 1 b 0
#> 3 3 a 0
#> 4 3 c 0
#> 5 3 a 0
Created on 2018-05-10 by the reprex package (v0.2.0).
prepare data frame:
df <- data.frame("Variable 1" = c(1, 1, 2, 2, 2, 3, 3, 3), "Variable 2" = unlist(strsplit("abaabaca", "")))
write functions to test if consecutive repetitions are there or not:
any.consecutive.p <- function(v) {
for (i in 1:(length(v) - 1)) {
if (v[i] == v[i + 1]) {
return(TRUE)
}
}
return(FALSE)
}
any.consecutive.in.col.p <- function(df, col) {
any.consecutive.p(df[, col])
}
any.consecutive.p returns TRUE if it finds first consecutive repetition in a vector (v).
any.consecutive.in.col.p() looks for consecutive repetitions in a column of a data frame.
split data frame by values of Variable.1
df.l <- split(df, df$Variable.1)
df.l
$`1`
Variable.1 Variable.2
1 1 a
2 1 b
$`2`
Variable.1 Variable.2
3 2 a
4 2 a
5 2 b
$`3`
Variable.1 Variable.2
6 3 a
7 3 c
8 3 a
Finally go over this data.frame list and test for each data frame, if it contains consecutive duplicates in Variable.2 column.
If found, don't collect it.
Bind the collected data frames by rows.
Reduce(rbind, lapply(df.l, function(df) if(!any.consecutive.in.col.p(df, "Variable.2")) {df}))
Variable.1 Variable.2
1 1 a
2 1 b
6 3 a
7 3 c
8 3 a
Say you want to remove all groups of df, grouped by a, where the column b has repeated values. You can do that as below.
set.seed(0)
df <- data.frame(a = rep(1:3, rep(3, 3)), b = sample(1:5, 9, T))
# dplyr
library(dplyr)
df %>%
group_by(a) %>%
filter(all(b != lag(b), na.rm = T))
#data.table
library(data.table)
setDT(df)
df[, if(all(b != shift(b), na.rm = T)) .SD, by = a]
Benchmark shows data.table is faster
#Results
# Unit: milliseconds
# expr min lq mean median uq max neval
# use_dplyr() 141.46819 165.03761 201.0975 179.48334 205.82301 539.5643 100
# use_DT() 36.27936 50.23011 64.9218 53.87114 66.73943 345.2863 100
# Method
set.seed(0)
df <- data.table(a = rep(1:2000, rep(1e3, 2000)), b = sample(1:1e3, 2e6, T))
use_dplyr <- function(x){
df %>%
group_by(a) %>%
filter(all(b != lag(b), na.rm = T))
}
use_DT <- function(x){
df[, if (all(b != shift(b), na.rm = T)) .SD, a]
}
microbenchmark(use_dplyr(), use_DT())

R Count number of times a level occurs in n rows

I have, for example, a vector with 1000 obs and 3 levels (A, B, C). I want to count how many times level A occurs for every 5 rows and produce another vector of the count values, ie with 200obs. Is anyone able to help? I've found how to count based on another variable but not number of rows. Thank you!
df <- data.frame(test=factor(sample(c("A","B", "C" ),1000,replace=TRUE)))
head(df, 10)
test
1 A
2 A
3 B
4 C
5 B
6 A
7 C
8 B
9 C
10 C
Here are a couple of options you might find useful:
a) count all entries per 5 rows and return a list:
head(lapply(split(df$test, rep(1:200, each = 5)), table), 2)
# $`1` # <- result for rows 1:5
#
# A B C
# 1 0 4
#
# $`2` # <- result for rows 6:10
#
# A B C
# 3 0 2
b) count all entries per 5 rows and return a matrix:
head(t(sapply(split(df$test, rep(1:200, each = 5)), table)), 2)
# A B C
# 1 1 0 4
# 2 3 0 2
c) count number of As per 5 rows and return a list:
head(lapply(split(df$test == "A", rep(1:200, each = 5)), sum), 2)
# $`1`
# [1] 1
#
# $`2`
# [1] 3
d) count number of As per 5 rows and return a vector:
head(sapply(split(df$test == "A", rep(1:200, each = 5)), sum), 2)
#1 2
#1 3
Each of the results will be 200 entries long / have 200 rows.
Here is a solution with dplyr and tidyr
library(dplyr)
library(tidyr)
df %>%
mutate(Set = (seq_along(test) - 1) %/% 5) %>%
group_by(Set, test) %>%
summarise(N = n()) %>%
spread(key = test, value = N, fill = 0)
We can use data.table
library(data.table)
setDT(df)[, .N , .(grp= gl(nrow(df), 5, nrow(df)), test)]
If you prefer dplyr, you could use
c1 <- df %>%
mutate(group = rep(paste0("G", seq(1, 200)), each = 5)) %>%
# count each level
count(group, test)
Note that this method doesn't include levels with no values for a certain group (i.e. no 0 values)

dplyr filter: Get rows with minimum of variable, but only the first if multiple minima

I want to make a grouped filter using dplyr, in a way that within each group only that row is returned which has the minimum value of variable x.
My problem is: As expected, in the case of multiple minima all rows with the minimum value are returned. But in my case, I only want the first row if multiple minima are present.
Here's an example:
df <- data.frame(
A=c("A", "A", "A", "B", "B", "B", "C", "C", "C"),
x=c(1, 1, 2, 2, 3, 4, 5, 5, 5),
y=rnorm(9)
)
library(dplyr)
df.g <- group_by(df, A)
filter(df.g, x == min(x))
As expected, all minima are returned:
Source: local data frame [6 x 3]
Groups: A
A x y
1 A 1 -1.04584335
2 A 1 0.97949399
3 B 2 0.79600971
4 C 5 -0.08655151
5 C 5 0.16649962
6 C 5 -0.05948012
With ddply, I would have approach the task that way:
library(plyr)
ddply(df, .(A), function(z) {
z[z$x == min(z$x), ][1, ]
})
... which works:
A x y
1 A 1 -1.04584335
2 B 2 0.79600971
3 C 5 -0.08655151
Q: Is there a way to approach this in dplyr? (For speed reasons)
Update
With dplyr >= 0.3 you can use the slice function in combination with which.min, which would be my favorite approach for this task:
df %>% group_by(A) %>% slice(which.min(x))
#Source: local data frame [3 x 3]
#Groups: A
#
# A x y
#1 A 1 0.2979772
#2 B 2 -1.1265265
#3 C 5 -1.1952004
Original answer
For the sample data, it is also possible to use two filter after each other:
group_by(df, A) %>%
filter(x == min(x)) %>%
filter(1:n() == 1)
Just for completeness: Here's the final dplyr solution, derived from the comments of #hadley and #Arun:
library(dplyr)
df.g <- group_by(df, A)
filter(df.g, rank(x, ties.method="first")==1)
For what it's worth, here's a data.table solution, to those who may be interested:
# approach with setting keys
dt <- as.data.table(df)
setkey(dt, A,x)
dt[J(unique(A)), mult="first"]
# without using keys
dt <- as.data.table(df)
dt[dt[, .I[which.min(x)], by=A]$V1]
This can be accomplished by using row_number combined with group_by. row_number handles ties by assigning a rank not only by the value but also by the relative order within the vector. To get the first row of each group with the minimum value of x:
df.g <- group_by(df, A)
filter(df.g, row_number(x) == 1)
For more information see the dplyr vignette on window functions.
dplyr offers slice_min function, wich do the job with the argument with_ties = FALSE
library(dplyr)
df %>%
group_by(A) %>%
slice_min(x, with_ties = FALSE)
Output :
# A tibble: 3 x 3
# Groups: A [3]
A x y
<fct> <dbl> <dbl>
1 A 1 0.273
2 B 2 -0.462
3 C 5 1.08
Another way to do it:
set.seed(1)
x <- data.frame(a = rep(1:2, each = 10), b = rnorm(20))
x <- dplyr::arrange(x, a, b)
dplyr::filter(x, !duplicated(a))
Result:
a b
1 1 -0.8356286
2 2 -2.2146999
Could also be easily adapted for getting the row in each group with maximum value.
In case you are looking to filter the minima of x and then the minima of y. An intuitive way of do it is just using filtering functions:
> df
A x y
1 A 1 1.856368296
2 A 1 -0.298284187
3 A 2 0.800047796
4 B 2 0.107289719
5 B 3 0.641819999
6 B 4 0.650542284
7 C 5 0.422465687
8 C 5 0.009819306
9 C 5 -0.482082635
df %>% group_by(A) %>%
filter(x == min(x), y == min(y))
# A tibble: 3 x 3
# Groups: A [3]
A x y
<chr> <dbl> <dbl>
1 A 1 -0.298
2 B 2 0.107
3 C 5 -0.482
This code will filter the minima of x and y.
Also you can do a double filter
that looks even more readable:
df %>% group_by(A) %>%
filter(x == min(x)) %>%
filter(y == min(y))
# A tibble: 3 x 3
# Groups: A [3]
A x y
<chr> <dbl> <dbl>
1 A 1 -0.298
2 B 2 0.107
3 C 5 -0.482
I like sqldf for its simplicity..
sqldf("select A,min(X),y from 'df.g' group by A")
Output:
A min(X) y
1 A 1 -1.4836989
2 B 2 0.3755771
3 C 5 0.9284441
For the sake of completeness, here's the base R answer:
df[with(df, ave(x, A, FUN = \(x) rank(x, ties.method = "first")) == 1), ]
# A x y
#1 A 1 0.1076158
#4 B 2 -1.3909084
#7 C 5 0.3511618

Resources