Creating an ifelse statement that is conditional by factor level - r

I'm new to R, so I'm sorry if this is obvious. But, I've been stuck on this for a while, but can have been fruitless in finding answers thus far.
Data frame:
1 b c id e
2 0 1 45 5
3 1 0 45 7
4 0 1 48 5
5 1 0 46 7
Desired result:
1 b c id e f
2 0 1 45 5 1
3 1 0 45 7 1
4 0 1 48 5 0
5 1 0 46 7 0
What I'm trying to do: I am trying to create column F based on levels of b and c for people with the same ID. Column E is still important to me along with other omitted values, so I can't collapse the data on ID.
The closest I've gotten:
library(dplyr)
df2 <- df %>%
group_by(id) %>%
mutate(ifelse(b == 1 & c == 1, 1, 0))
But, I think my problem there is that I'm not using dplyr::group_by correctly so I'm essentially doing a base ifelse statement.

We don't need an ifelse here
df %>%
group_by(id) %>%
mutate(f = as.integer(any(b) & any(c)))
# A tibble: 4 x 5
# Groups: id [3]
# b c id e f
# <int> <int> <int> <int> <int>
#1 0 1 45 5 1
#2 1 0 45 7 1
#3 0 1 48 5 0
#4 1 0 46 7 0

Related

Find discontinuities in observational data with R

Data
id<-c("a","a","a","a","a","a","b","b","b","b","b","b")
d<-c(1,2,3,90,98,100000,4,6,7,8,23,45)
df<-data.frame(id,d)
I want to detect observational discontinuities of each "id".
My expected result is obtain a way to detect discontinuities without using means or medians as a reference.
You can check whether the difference between a row and the next one within each group is different than 1:
library(dplyr)
df %>%
group_by(id) %>%
mutate(dis = +(c(F, diff(d) != 1)))
# A tibble: 12 × 3
# Groups: id [2]
id d dis
<chr> <dbl> <int>
1 a 1 0
2 a 2 0
3 a 3 0
4 a 90 1
5 a 98 1
6 a 100000 1
7 b 4 0
8 b 6 1
9 b 7 0
10 b 8 0
11 b 23 1
12 b 45 1

which.max() by groups but output in the dataframe

There is this data frame given by (an example):
df <- read.table(header = TRUE, text = 'Group Utility
A 12
A 10
B 3
B 5
B 6
C 1
D 3
D 4')
I want to use any command (I have been trying iterations of which.max() to no avail) to get an additional row in the dataset, say choice that is an indicator if Value is the max for the group given by Group elements. The table would look like:
Group Utility Choice
A 12 1
A 10 0
B 3 0
B 5 0
B 6 1
C 1 1
D 3 0
D 4 1
You can try this with dplyr
library(dplyr)
df %>%
group_by(Group) %>%
mutate(Choice = ifelse(Utility == max(Utility), 1, 0)) %>%
ungroup()
Output
# A tibble: 8 x 3
Group Utility Choice
<fct> <int> <dbl>
1 A 12 1
2 A 10 0
3 B 3 0
4 B 5 0
5 B 6 1
6 C 1 1
7 D 3 0
8 D 4 1
A one-liner base R solution.
df$Choice <- with(df, ave(Utility, Group, FUN = function(x) +(x == max(x))))
df
# Group Utility Choice
#1 A 12 1
#2 A 10 0
#3 B 3 0
#4 B 5 0
#5 B 6 1
#6 C 1 1
#7 D 3 0
#8 D 4 1
An option with data.table
library(data.table)
setDT(df)[, +(Utility == max(Utility)), Group]

Identifying duplicate within groups by latest date

I currently have a data frame that looks like this:
ID Value Date
1 1 A 1/1/2018
2 1 B 2/3/1988
3 1 B 6/3/1994
4 2 A 12/6/1999
5 2 B 24/12/1957
6 3 A 9/8/1968
7 3 B 20/9/2016
8 3 C 15/4/1993
9 3 C 9/8/1994
10 4 A 8/8/1988
11 4 C 6/4/2001
Within each ID I would like to identify a row where there is a duplicate Value. The Value that I would like to identify is the duplicate with the most recent Date.
The resulting data frame should look like this:
ID Value Date mostRecentDuplicate
1 1 A 1/1/2018 0
2 1 B 2/3/1988 0
3 1 B 6/3/1994 1
4 2 A 12/6/1999 0
5 2 B 24/12/1957 0
6 3 A 9/8/1968 0
7 3 B 20/9/2016 0
8 3 C 15/4/1993 0
9 3 C 9/8/1994 1
10 4 A 8/8/1988 0
11 4 C 6/4/2001 0`
How do I go about doing this?
Using dplyr we can first convert Date to actual date value, then group_by ID and Value and assign value 1 in the group where there is more than 1 row and the row_number is same as row number of maximum Date.
library(dplyr)
df %>%
mutate(Date = as.Date(Date, "%d/%m/%Y")) %>%
group_by(ID, Value) %>%
mutate(mostRecentDuplicate = +(n() > 1 & row_number() == which.max(Date))) %>%
ungroup()
# A tibble: 11 x 4
# ID Value Date mostRecentDuplicate
# <int> <fct> <date> <int>
# 1 1 A 2018-01-01 0
# 2 1 B 1988-03-02 0
# 3 1 B 1994-03-06 1
# 4 2 A 1999-06-12 0
# 5 2 B 1957-12-24 0
# 6 3 A 1968-08-09 0
# 7 3 B 2016-09-20 0
# 8 3 C 1993-04-15 0
# 9 3 C 1994-08-09 1
#10 4 A 1988-08-08 0
#11 4 C 2001-04-06 0

Fill in rows based on condition for grouped data using tidyr

I have the following dataframe of which I am trying to create the 'index2' field conditional on the 'index1' field:
Basically this data represents a succession of behaviours for different individual (ID) penguins and I am trying to index groups of behaviour (index 2) that incorporates all other behaviours in between (and including) dives (which have been indexed into dive bouts = index 1). I would appreciate a tidyverse solution grouping by ID.
Reproducible:
df<-data.frame(ID=c(rep('A',9),rep('B',14)),behaviour=c('surface','dive','dive','dive','surface','commute','surface','dive', 'dive','dive','dive','surface','dive','dive','commute','commute','surface','dive','dive','surface','dive','dive','surface'),index1=c(0,1,1,1,0,0,0,1,1,2,2,0,3,3,0,0,0,3,3,0,3,3,0),index2=c(0,1,1,1,1,1,1,1,1,2,2,0,3,3,3,3,3,3,3,3,3,3,0))
We could create a function with rle
frle <- function(x) {
rl <- rle(x)
i1 <- cummax(rl$values)
i2 <- c(i1[-1] != i1[-length(i1)], FALSE)
i1[i2] <- 0
as.integer(inverse.rle(within.list(rl, values <- i1)))
}
After grouping by 'ID', mutate the 'Index1' to get the expected column
library(dplyr)
df1 %>%
group_by(ID) %>%
mutate(Index2New = frle(Index1))
# A tibble: 19 x 5
# Groups: ID [2]
# ID behaviour Index1 Index2 Index2New
# <chr> <chr> <int> <int> <int>
# 1 A surface 0 0 0
# 2 A dive 1 1 1
# 3 A dive 1 1 1
# 4 A dive 1 1 1
# 5 A surface 0 1 1
# 6 A commute 0 1 1
# 7 A surface 0 1 1
# 8 A dive 1 1 1
# 9 A dive 1 1 1
#10 B dive 2 2 2
#11 B dive 2 2 2
#12 B surface 0 0 0
#13 B dive 3 3 3
#14 B dive 3 3 3
#15 B commute 0 3 3
#16 B commute 0 3 3
#17 B surface 0 3 3
#18 B dive 3 3 3
#19 B dive 3 3 3

Grouping over 2 columns and use values of subsequent groups in calculations

Suppose I have a df with 3 columns, group1, group2 & variable
set.seed(1)
group1 = c(rep(1,5),rep(2,5),rep(3,5),rep(4,5))
group2 = c("A","B","C","D","B","C","C","B","C","A","B","D")
variable = c(as.integer(rnorm(20,2)**3))
df=data.frame(group1, group2, variable)
I added the column 'min1' which states if the value of b within 'group1' is also present in group1(x-1). Vice Versa with plus1. Below the total data frame:
group1 group2 variable min1 plus1
1 1 A 3 0 0
2 1 B 11 0 1
3 1 C 2 0 1
4 2 D 47 0 1
5 2 B 13 1 1
6 2 C 2 1 1
7 3 C 16 1 0
8 3 B 21 1 1
9 3 C 18 1 0
10 4 A 5 0 0
11 4 B 44 1 0
12 4 D 14 0 0
Now I want to do calculations such as max() and sum() (but also some more exotic ones) on the variables but not just on all values within their own group1 & group2 combination, but including the values of the group before (or after it). The min1 example is shown below.
group1_min1 group2_min1 sum_min1 max_min1
1 2 B 24 13
2 2 C 4 2
3 3 C 36 18
4 3 B 34 21
5 4 B 65 44
Note that for group1_min1(3),group2_min1(C) three values are used: rows 6,7&9 (2,16&18).
I tried using group_by and summarize within dplyr, something like:
group_by(group1, group2) %>%
summarize_each(funs(sum, max))
EDIT:
I found a solution to add the sum to the original df:
sum_min1 = c()
j=0
for (j in 1:(length(df$group1))){
if (df[j,"min1"] == 0){sum_min1 = c(sum_min1,0)} else {
sum_min1 = c(sum_min1,(sum(df[which((df[,"group1"] == df[j,"group1"] | df[,"group1"] == (df[j,"group1"]-1)) & df[,"group2"]==(df[j,"group2"])),"variable"])))
}
}
df = cbind(df,sum_min1)
This delivers the output:
group1 group2 variable min1 plus1 sum_min1
1 1 A 3 0 0 0
2 1 B 11 0 1 0
3 1 C 2 0 1 0
4 2 D 47 0 0 0
5 2 B 13 1 1 24
6 2 C 2 1 1 4
7 3 C 16 1 0 36
8 3 B 21 1 1 34
9 3 C 18 1 0 36
10 4 A 5 0 0 0
11 4 B 44 1 0 65
12 4 D 14 0 0 0
However this seems to be a very crude way and may take long on big data sets, also in reality there are multiple variables and multiple functions. Also it might be a problem because I want to do some user-defined functions which include a for loop for all the values.
Is there a more elegant way to do this?
Sorry for anything I do wrong, I am new to R and StackOverflow and not a native speaker.
# Data
set.seed(1)
group1 = c(rep(1,3),rep(2,3),rep(3,3),rep(4,3))
group2 = c("A","B","C","D","B","C","C","B","C","A","B","D")
variable = c(as.integer(rnorm(12,2)**3))
df=data.frame(group1, group2, variable)
For the first part-
df$min1 <- sapply(seq(nrow(df)), function(x)
{
if(df[x, "group1"] == 1){0} else {
max(df[x, "group2"] %in% df[df$group1 == df[x,"group1"] - 1,"group2"])}
})
df$plus1 <- sapply(seq(nrow(df)), function(x)
{
if(df[x, "group1"] == max(df$group1){0} else {
max(df[x, "group2"] %in% df[df$group1 == df[x,"group1"] + 1,"group2"])}
})
Second part
df$sum_min1 <- sapply(seq(nrow(df)), function(x)
{
if(df[x, "group1"] == 1){0}else{
sum(df[df$group1 == df[x,"group1"] &
df$group2 == df[x,"group2"],"variable"],
df[df$group1 == df[x,"group1"] - 1 &
df$group2 == df[x,"group2"],"variable"])}
})

Resources