How to create "blocks" based on column values by group? - r

I have a data frame looks like this:
data = data.frame(userID = c("a","a","a","a","a","a","a","a","a","b","b"),
diff = c(1,1,1,81,1,1,1,2,1,1,1)
)
Eventually, I want to get something like this:
data = data.frame(userID = c("a","a","a","a","a","a","a","a","a","b","b"),
diff = c(1,1,1,81,1,1,1,2,1,1,1),
block = c(1,1,1,2,2,2,2,3,3,1,1)
)
So bascially, what I want to do is that everytime the value in diff column is greater than 1, a new block is created. And I want to do this by group, i.e. userID.
Right now I am thinking about using LOCF or writing a loop, but it does not seem to work. Any advice? Thanks!

In base you can use ave like:
data$block <- ave(data$diff>1, data$userID, FUN=cumsum)+1
# userID diff block
#1 a 1 1
#2 a 1 1
#3 a 1 1
#4 a 81 2
#5 a 1 2
#6 a 1 2
#7 a 1 2
#8 a 2 3
#9 a 1 3
#10 b 1 1
#11 b 1 1

An option would be to group by 'userID' and then take the cumulative sum of the logical expression (diff > 1)
library(dplyr)
data %>%
group_by(userID) %>%
mutate(block = 1 + cumsum(diff > 1))
# A tibble: 11 x 3
# Groups: userID [2]
# userID diff block
# <fct> <dbl> <dbl>
# 1 a 1 1
# 2 a 1 1
# 3 a 1 1
3 4 a 81 2
# 5 a 1 2
3 6 a 1 2
# 7 a 1 2
# 8 a 2 3
# 9 a 1 3
#10 b 1 1
#11 b 1 1

Related

Code values in new column based on whether values in another column are unique

Given the following data I would like to create a new column new_sequence based on the condition:
If only one id is present the new value should be 0. If several id's are present, the new value should numbered according to the values present in sequence.
dat <- tibble(id = c(1,2,3,3,3,4,4),
sequence = c(1,1,1,2,3,1,2))
# A tibble: 7 x 2
id sequence
<dbl> <dbl>
1 1 1
2 2 1
3 3 1
4 3 2
5 3 3
6 4 1
7 4 2
So, for the example data I am looking to produce the following output:
# A tibble: 7 x 3
id sequence new_sequence
<dbl> <dbl> <dbl>
1 1 1 0
2 2 1 0
3 3 1 1
4 3 2 2
5 3 3 3
6 4 1 1
7 4 2 2
I have tried with the code below, that does not work since all unique values are coded as 0
dat %>% mutate(new_sequence = ifelse(!duplicated(id), 0, sequence))
Use dplyr::add_count() rather than !duplicated():
library(dplyr)
dat %>%
add_count(id) %>%
mutate(new_sequence = ifelse(n == 1, 0, sequence)) %>%
select(!n)
Output:
# A tibble: 7 x 3
id sequence new_sequence
<dbl> <dbl> <dbl>
1 1 1 0
2 2 1 0
3 3 1 1
4 3 2 2
5 3 3 3
6 4 1 1
7 4 2 2
You can also try the following. After grouping by id check if the number of rows in the group n() is 1 or not. Use separate if and else instead of ifelse since the lengths are different within each group.
dat %>%
group_by(id) %>%
mutate(new_sequence = if(n() == 1) 0 else sequence)
Output
id sequence new_sequence
<dbl> <dbl> <dbl>
1 1 1 0
2 2 1 0
3 3 1 1
4 3 2 2
5 3 3 3
6 4 1 1
7 4 2 2

Fill in rows based on condition for grouped data using tidyr

I have the following dataframe of which I am trying to create the 'index2' field conditional on the 'index1' field:
Basically this data represents a succession of behaviours for different individual (ID) penguins and I am trying to index groups of behaviour (index 2) that incorporates all other behaviours in between (and including) dives (which have been indexed into dive bouts = index 1). I would appreciate a tidyverse solution grouping by ID.
Reproducible:
df<-data.frame(ID=c(rep('A',9),rep('B',14)),behaviour=c('surface','dive','dive','dive','surface','commute','surface','dive', 'dive','dive','dive','surface','dive','dive','commute','commute','surface','dive','dive','surface','dive','dive','surface'),index1=c(0,1,1,1,0,0,0,1,1,2,2,0,3,3,0,0,0,3,3,0,3,3,0),index2=c(0,1,1,1,1,1,1,1,1,2,2,0,3,3,3,3,3,3,3,3,3,3,0))
We could create a function with rle
frle <- function(x) {
rl <- rle(x)
i1 <- cummax(rl$values)
i2 <- c(i1[-1] != i1[-length(i1)], FALSE)
i1[i2] <- 0
as.integer(inverse.rle(within.list(rl, values <- i1)))
}
After grouping by 'ID', mutate the 'Index1' to get the expected column
library(dplyr)
df1 %>%
group_by(ID) %>%
mutate(Index2New = frle(Index1))
# A tibble: 19 x 5
# Groups: ID [2]
# ID behaviour Index1 Index2 Index2New
# <chr> <chr> <int> <int> <int>
# 1 A surface 0 0 0
# 2 A dive 1 1 1
# 3 A dive 1 1 1
# 4 A dive 1 1 1
# 5 A surface 0 1 1
# 6 A commute 0 1 1
# 7 A surface 0 1 1
# 8 A dive 1 1 1
# 9 A dive 1 1 1
#10 B dive 2 2 2
#11 B dive 2 2 2
#12 B surface 0 0 0
#13 B dive 3 3 3
#14 B dive 3 3 3
#15 B commute 0 3 3
#16 B commute 0 3 3
#17 B surface 0 3 3
#18 B dive 3 3 3
#19 B dive 3 3 3

How to copy value of a cell to other rows based on the value of other two columns?

I have a data frame that looks like this:
zz = "Sub Item Answer
1 A 1 NA
2 A 1 0
3 A 2 NA
4 A 2 1
5 B 1 NA
6 B 1 1
7 B 2 NA
8 B 2 0"
Data = read.table(text=zz, header = TRUE)
The desirable result is to have the value under "Answer" (0 or 1) copied to the NA cells of the same Subject and the same Item. For instance, the answer = 0 in row 2 should be copied to the answer cell in row 1, but not other rows. The output should be like this:
zz2 = "Sub Item Answer
1 A 1 0
2 A 1 0
3 A 2 1
4 A 2 1
5 B 1 1
6 B 1 1
7 B 2 0
8 B 2 0"
Data2 = read.table(text=zz2, header = TRUE)
How should I do this? I noticed that there are some previous questions that asked how to copy a cell to other cells such as replace NA value with the group value, but it was done based on the value of one column only. Also, this question is slightly different from Replace missing values (NA) with most recent non-NA by group, which aims to copy the most-recent numeric value to NAs.
Thanks for all your answers!
You can use zoo::na.locf.
library(tidyverse);
library(zoo);
Data %>% group_by(Sub, Item) %>% mutate(Answer = na.locf(Answer));
# A tibble: 8 x 3
## Groups: Sub, Item [4]
# Sub Item Answer
# <fct> <int> <int>
#1 A 1 0
#2 A 1 0
#3 A 2 1
#4 A 2 1
#5 B 1 1
#6 B 1 1
#7 B 2 0
#8 B 2 0
Thanks to #steveb, here is an alternative without having to rely on zoo::na.locf:
Data %>% group_by(Sub, Item) %>% mutate(Answer = Answer[!is.na(Answer)]);
library(tidyverse)
Data%>%group_by(Sub,Item)%>%fill(Answer,.direction = "up")
# A tibble: 8 x 3
# Groups: Sub, Item [4]
Sub Item Answer
<fctr> <int> <int>
1 A 1 0
2 A 1 0
3 A 2 1
4 A 2 1
5 B 1 1
6 B 1 1
7 B 2 0
8 B 2 0
Though it was not intention of OP but I thought of situations where there are only NA values for set of Sub, Item group OR there multiple non-NA values for a group.
The one way to handle such situations could be by taking max/min of that group and ignoring max/min values if those are Inf
A solution could be:
library(dplyr)
Data %>% group_by(Sub, Item) %>%
mutate(Answer = ifelse(max(Answer, na.rm=TRUE)== -Inf, NA,
as.integer(max(Answer, na.rm=TRUE))))
#Result
# Sub Item Answer
# <fctr> <int> <int>
#1 A 1 0
#2 A 1 0
#3 A 2 1
#4 A 2 1
#5 B 1 1
#6 B 1 1
#7 B 2 0
#8 B 2 0

dplyr how to count cycles in the records

For example, if I have records like:
A B
1 2
2 3
3 1
1 2
2 1
Let's say one cycle is from 1 (to 2 to 3) back to 1,so I need my data frame to be like
No. A B
cycle1 1 2
cycle1 2 3
cycle1 3 1
cycle2 1 2
cycle2 2 1
Or a better way for me, I just need to record the time the same record appears, like
Time A B
Time1 1 2
Time1 2 3
Time1 3 1
Time2 1 2
Time1 2 1
I need to do this because I have to use summarize function in dplyr to do calculation but I cannot group data by A and B directly. The order of the data is also important.
Is this what you want ?
library(zoo)
T1=which(df$A==1)
T2=1:length(T1)
T2=paste('cycle',T2 )
df$No=NA
df$No[T1]=T2
df$No=na.locf(df$No)
df
A B No
1 1 2 cycle 1
2 2 3 cycle 1
3 3 1 cycle 1
4 1 2 cycle 2
5 2 1 cycle 2
#the reason: keep the row Id with the calculation
library(dplyr)
df%>%group_by(A,B)%>%mutate(Time=paste('Time',row_number()))
A B Time
<int> <int> <chr>
1 1 2 Time 1
2 2 3 Time 1
3 3 1 Time 1
4 1 2 Time 2
5 2 1 Time 1
Create an augmented 'diff' variable. c(NA , diff (your_var)). Within a sequence group this will be 1. Set your group to change at the logical falsity of that proposition. (My first iteration on the algorithm wasn't quite correct so modified it slightly.)
dat %>% as_tibble() %>% mutate(G = cumsum( c(-1, diff(A)) < 0 ) )
# A tibble: 5 x 3
A B G
<int> <int> <int>
1 1 2 1
2 2 3 1
3 3 1 1
4 1 2 2
5 2 1 2
dat %>% as_tibble() %>% mutate(G = paste0( "time", cumsum( c(-1, diff(A)) < 0 ) ))
# A tibble: 5 x 3
A B G
<int> <int> <chr>
1 1 2 time1
2 2 3 time1
3 3 1 time1
4 1 2 time2
5 2 1 time2
One could also test for A=1, but then sequences like 1,2,3,2,3,4 would not get properly split.

Keep duplicate values only if they are represented in first sampling period

I am trying to clean my data so that only duplicate values that have an observation in my first sampling period are kept. For instance, if my data frame looks like this:
df <- data.frame(ID = c(1,1,1,2,2,2,3,3,4,4), period = c(1,2,3,1,2,3,2,3,1,3), mass = rnorm(10, 5, 2))
df
ID period mass
1 1 1 3.313674
2 1 2 6.371979
3 1 3 5.449435
4 2 1 4.093022
5 2 2 2.615782
6 2 3 3.622842
7 3 2 4.466666
8 3 3 6.940979
9 4 1 6.226222
10 4 3 4.233397
I would like to keep observations only the observations that are duplicated for individuals measured during period 1. My new data frame would then look like this:
ID period mass
1 1 1 3.313674
2 1 2 6.371979
3 1 3 5.449435
4 2 1 4.093022
5 2 2 2.615782
6 2 3 3.622842
9 4 1 6.226222
10 4 3 4.233397
Using suggestions on this page (Remove all unique rows) I have tried using the following command, but it leaves in the observations for individual 3 (which was not measured in period 1).
subset(df, duplicated(ID) | duplicated(ID, fromLast=T))
If you want a base solution, the following should work, as well.
> df_new <- df[df$ID %in% df$ID[df$period == 1], ]
> df_new
ID period mass
1 1 1 3.238832
2 1 2 3.428847
3 1 3 1.205347
4 2 1 8.498452
5 2 2 7.523085
6 2 3 3.613678
9 4 1 3.324095
10 4 3 1.932733
You can use dplyr as follows:
library(dplyr)
df %>% group_by(ID) %>% filter(1 %in% period)
#Source: local data frame [8 x 3]
#Groups: ID [3]
# ID period mass
# <dbl> <dbl> <dbl>
#1 1 1 7.622950
#2 1 2 7.960665
#3 1 3 5.045723
#4 2 1 4.366568
#5 2 2 4.400645
#6 2 3 6.088367
#7 4 1 2.282713
#8 4 3 2.461640

Resources