Fill in rows based on condition for grouped data using tidyr - r

I have the following dataframe of which I am trying to create the 'index2' field conditional on the 'index1' field:
Basically this data represents a succession of behaviours for different individual (ID) penguins and I am trying to index groups of behaviour (index 2) that incorporates all other behaviours in between (and including) dives (which have been indexed into dive bouts = index 1). I would appreciate a tidyverse solution grouping by ID.
Reproducible:
df<-data.frame(ID=c(rep('A',9),rep('B',14)),behaviour=c('surface','dive','dive','dive','surface','commute','surface','dive', 'dive','dive','dive','surface','dive','dive','commute','commute','surface','dive','dive','surface','dive','dive','surface'),index1=c(0,1,1,1,0,0,0,1,1,2,2,0,3,3,0,0,0,3,3,0,3,3,0),index2=c(0,1,1,1,1,1,1,1,1,2,2,0,3,3,3,3,3,3,3,3,3,3,0))

We could create a function with rle
frle <- function(x) {
rl <- rle(x)
i1 <- cummax(rl$values)
i2 <- c(i1[-1] != i1[-length(i1)], FALSE)
i1[i2] <- 0
as.integer(inverse.rle(within.list(rl, values <- i1)))
}
After grouping by 'ID', mutate the 'Index1' to get the expected column
library(dplyr)
df1 %>%
group_by(ID) %>%
mutate(Index2New = frle(Index1))
# A tibble: 19 x 5
# Groups: ID [2]
# ID behaviour Index1 Index2 Index2New
# <chr> <chr> <int> <int> <int>
# 1 A surface 0 0 0
# 2 A dive 1 1 1
# 3 A dive 1 1 1
# 4 A dive 1 1 1
# 5 A surface 0 1 1
# 6 A commute 0 1 1
# 7 A surface 0 1 1
# 8 A dive 1 1 1
# 9 A dive 1 1 1
#10 B dive 2 2 2
#11 B dive 2 2 2
#12 B surface 0 0 0
#13 B dive 3 3 3
#14 B dive 3 3 3
#15 B commute 0 3 3
#16 B commute 0 3 3
#17 B surface 0 3 3
#18 B dive 3 3 3
#19 B dive 3 3 3

Related

R Extract nested cumulatives from dataframe

given a dataframe
period<-c(1,1,1,3,3,3,3)
item<-c("a","b","b","a","b","c","c")
quantity<-c(1,3,2,4,5,3,7)
df<-data.frame(period,item,quantity)
df
period item quantity
1 1 a 1
2 1 b 3
3 1 b 2
4 3 a 4
5 3 b 5
6 3 c 3
7 3 c 7
I want to obtain
period item cumulative
1 a 1
1 b 5
1 c 0
2 a 0
2 b 0
2 c 0
3 a 4
3 b 5
3 c 10
Not sure what is a kind of efficient way to do this in R. The file has approx 500k records and 10,000 different items
Thanks!!
You can use complete to create the missing sequence of period and item and for each combination sum the quantity value.
library(dplyr)
library(tidyr)
df %>%
complete(period = min(period):max(period), item) %>%
group_by(period, item) %>%
summarise(quantity = sum(quantity, na.rm = TRUE)) %>%
ungroup
# period item quantity
# <dbl> <chr> <dbl>
#1 1 a 1
#2 1 b 5
#3 1 c 0
#4 2 a 0
#5 2 b 0
#6 2 c 0
#7 3 a 4
#8 3 b 5
#9 3 c 10

Add group to data frame using monotonically increasing numbers

I've got a data frame that looks like this (the real data is much larger and more complicated):
df.test = data.frame(
sample = c("a","a","a","a","a","a","b","b"),
day = c(0,1,2,0,1,3,0,2),
value = rnorm(8)
)
sample day value
1 a 0 -1.11182146
2 a 1 0.65679637
3 a 2 0.03652325
4 a 0 -0.95351736
5 a 1 0.16094840
6 a 3 0.06829702
7 b 0 0.33705141
8 b 2 0.24579603
The data frame is organized by experiments but the experiment ids are missed. The same sample can be used in different experiment, but I know that in a single experiment the days start from 0 and are monotonically increasing.
How can I add the experiment ids that can be a numbers {1, 2, ...}?
So the resulted data frame will be
sample day value exp
1 a 0 -1.11182146 1
2 a 1 0.65679637 1
3 a 2 0.03652325 1
4 a 0 -0.95351736 2
5 a 1 0.16094840 2
6 a 3 0.06829702 2
7 b 0 0.33705141 3
8 b 2 0.24579603 3
I would appreciate any help, especially with a tidy/dplyr solution.
As indicated in the comments, you can do this with cumsum:
df.test %>% mutate(exp = cumsum(day == 0))
## sample day value exp
## 1 a 0 0.09300394 1
## 2 a 1 0.85322925 1
## 3 a 2 -0.25167313 1
## 4 a 0 -0.14811243 2
## 5 a 1 -1.86789014 2
## 6 a 3 0.45983987 2
## 7 b 0 2.81199150 3
## 8 b 2 0.31951634 3
You can use diff :
library(dplyr)
df.test %>% mutate(exp = cumsum(c(TRUE, diff(day) < 0)))
# sample day value exp
#1 a 0 -0.3382010 1
#2 a 1 2.2241041 1
#3 a 2 2.2202612 1
#4 a 0 1.0359635 2
#5 a 1 0.4134727 2
#6 a 3 1.0144439 2
#7 b 0 -0.1292119 3
#8 b 2 -0.1191505 3

How to create "blocks" based on column values by group?

I have a data frame looks like this:
data = data.frame(userID = c("a","a","a","a","a","a","a","a","a","b","b"),
diff = c(1,1,1,81,1,1,1,2,1,1,1)
)
Eventually, I want to get something like this:
data = data.frame(userID = c("a","a","a","a","a","a","a","a","a","b","b"),
diff = c(1,1,1,81,1,1,1,2,1,1,1),
block = c(1,1,1,2,2,2,2,3,3,1,1)
)
So bascially, what I want to do is that everytime the value in diff column is greater than 1, a new block is created. And I want to do this by group, i.e. userID.
Right now I am thinking about using LOCF or writing a loop, but it does not seem to work. Any advice? Thanks!
In base you can use ave like:
data$block <- ave(data$diff>1, data$userID, FUN=cumsum)+1
# userID diff block
#1 a 1 1
#2 a 1 1
#3 a 1 1
#4 a 81 2
#5 a 1 2
#6 a 1 2
#7 a 1 2
#8 a 2 3
#9 a 1 3
#10 b 1 1
#11 b 1 1
An option would be to group by 'userID' and then take the cumulative sum of the logical expression (diff > 1)
library(dplyr)
data %>%
group_by(userID) %>%
mutate(block = 1 + cumsum(diff > 1))
# A tibble: 11 x 3
# Groups: userID [2]
# userID diff block
# <fct> <dbl> <dbl>
# 1 a 1 1
# 2 a 1 1
# 3 a 1 1
3 4 a 81 2
# 5 a 1 2
3 6 a 1 2
# 7 a 1 2
# 8 a 2 3
# 9 a 1 3
#10 b 1 1
#11 b 1 1

How to copy value of a cell to other rows based on the value of other two columns?

I have a data frame that looks like this:
zz = "Sub Item Answer
1 A 1 NA
2 A 1 0
3 A 2 NA
4 A 2 1
5 B 1 NA
6 B 1 1
7 B 2 NA
8 B 2 0"
Data = read.table(text=zz, header = TRUE)
The desirable result is to have the value under "Answer" (0 or 1) copied to the NA cells of the same Subject and the same Item. For instance, the answer = 0 in row 2 should be copied to the answer cell in row 1, but not other rows. The output should be like this:
zz2 = "Sub Item Answer
1 A 1 0
2 A 1 0
3 A 2 1
4 A 2 1
5 B 1 1
6 B 1 1
7 B 2 0
8 B 2 0"
Data2 = read.table(text=zz2, header = TRUE)
How should I do this? I noticed that there are some previous questions that asked how to copy a cell to other cells such as replace NA value with the group value, but it was done based on the value of one column only. Also, this question is slightly different from Replace missing values (NA) with most recent non-NA by group, which aims to copy the most-recent numeric value to NAs.
Thanks for all your answers!
You can use zoo::na.locf.
library(tidyverse);
library(zoo);
Data %>% group_by(Sub, Item) %>% mutate(Answer = na.locf(Answer));
# A tibble: 8 x 3
## Groups: Sub, Item [4]
# Sub Item Answer
# <fct> <int> <int>
#1 A 1 0
#2 A 1 0
#3 A 2 1
#4 A 2 1
#5 B 1 1
#6 B 1 1
#7 B 2 0
#8 B 2 0
Thanks to #steveb, here is an alternative without having to rely on zoo::na.locf:
Data %>% group_by(Sub, Item) %>% mutate(Answer = Answer[!is.na(Answer)]);
library(tidyverse)
Data%>%group_by(Sub,Item)%>%fill(Answer,.direction = "up")
# A tibble: 8 x 3
# Groups: Sub, Item [4]
Sub Item Answer
<fctr> <int> <int>
1 A 1 0
2 A 1 0
3 A 2 1
4 A 2 1
5 B 1 1
6 B 1 1
7 B 2 0
8 B 2 0
Though it was not intention of OP but I thought of situations where there are only NA values for set of Sub, Item group OR there multiple non-NA values for a group.
The one way to handle such situations could be by taking max/min of that group and ignoring max/min values if those are Inf
A solution could be:
library(dplyr)
Data %>% group_by(Sub, Item) %>%
mutate(Answer = ifelse(max(Answer, na.rm=TRUE)== -Inf, NA,
as.integer(max(Answer, na.rm=TRUE))))
#Result
# Sub Item Answer
# <fctr> <int> <int>
#1 A 1 0
#2 A 1 0
#3 A 2 1
#4 A 2 1
#5 B 1 1
#6 B 1 1
#7 B 2 0
#8 B 2 0

Create equal length vectors from time series based upon factor in R

I have a data frame that is something like this:
time type count
1 -2 a 1
2 -1 a 4
3 0 a 6
4 1 a 2
5 2 a 5
6 0 b 3
7 1 b 7
8 2 b 2
I want to create a new data frame that takes type 'b' and creates the full time series by filling in zeroes for count. It should look like this:
time type count
1 -2 b 0
2 -1 b 0
3 0 b 3
4 1 b 7
5 2 b 2
I can certainly subset(df, df$type = 'b') and then hack the beginning and rbind, but I want it to be more dynamic just in case the time vector changes.
We can use complete from tidyr to get the full 'time' for all the unique values of 'type' and filter the value of interest in 'type'.
library(tidyr)
library(dplyr)
val <- "b"
df1 %>%
complete(time, type, fill=list(count=0)) %>%
filter(type== val)
# time type count
# <int> <chr> <dbl>
#1 -2 b 0
#2 -1 b 0
#3 0 b 3
#4 1 b 7
#5 2 b 2
With base R:
df1 <- data.frame(time=df[df$type == 'a',]$time, type='b', count=0)
df1[match(df[df$type=='b',]$time, df1$time),]$count <- df[df$type=='b',]$count
df1
time type count
1 -2 b 0
2 -1 b 0
3 0 b 3
4 1 b 7
5 2 b 2

Resources