How to flag non-sequential numbers in R - r

I was wondering if there's a way to flag non-consequential numbers? For instance below.
Number
3
4
5
6
10
11
12
16
Is there a way to flag the number before the non-sequential number like so?
Number Flag
3 0
4 0
5 0
6 1
10 0
11 0
12 1
16
etc..
Thank you!

dat$Flag <- +c(diff(dat$Number) != 1, NA)
dat
# Number Flag
# 1 3 0
# 2 4 0
# 3 5 0
# 4 6 1
# 5 10 0
# 6 11 0
# 7 12 1
# 8 16 NA

I would use lead() here:
df$Flag <- as.numeric(lead(df$Number) != df$Number + 1)
df
Number Flag
1 3 0
2 4 0
3 5 0
4 6 1
5 10 0
6 11 0
7 12 1
8 16 NA

Here's another option with data.table:
library(data.table)
setDT(df)[, Flag := +(shift(Number, type = "lead") - (Number + 1) != 0)]
Output
Number Flag
<int> <int>
1: 3 0
2: 4 0
3: 5 0
4: 6 1
5: 10 0
6: 11 0
7: 12 1
8: 16 NA
Or another option with dplyr:
library(dplyr)
df %>%
mutate(Flag = +(lead(Number) - (Number + 1) != 0))
Data
df <- structure(list(Number = c(3L, 4L, 5L, 6L, 10L, 11L, 12L, 16L)), class = "data.frame", row.names = c(NA,
-8L))

Related

replace with 0 duplicate variable according to ID

I have a dataframe like this one:
df
ID job_code
1 8
1 8
1 8
2 7
2 7
2 4
3 1
3 2
If an individual has several times the same job code, I would like to keep only the first one and replace the others by 0, to obtain a dataframe like this one:
df
ID job_code job_code_2
1 8 8
1 8 0
1 8 0
2 7 7
2 7 0
2 4 4
3 1 1
3 2 2
I thought of using function :
dataframe %>%
group_by(ID) %>%
and replace
but I am not sure how.
Thank you in advance for your help.
library(tidyverse)
df <- data.frame(
ID = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L),
job_code = c(8L, 8L, 8L, 7L, 7L, 4L, 1L, 2L)
)
df %>%
group_by(ID, job_code) %>%
mutate(job_code2 = job_code * +(row_number() == 1)) %>%
ungroup()
#> # A tibble: 8 x 3
#> ID job_code job_code2
#> <int> <int> <int>
#> 1 1 8 8
#> 2 1 8 0
#> 3 1 8 0
#> 4 2 7 7
#> 5 2 7 0
#> 6 2 4 4
#> 7 3 1 1
#> 8 3 2 2
Created on 2022-03-23 by the reprex package (v2.0.1)
Use duplicated:
df %>%
group_by(ID) %>%
mutate(job_code2 = ifelse(duplicated(job_code), 0, job_code)) %>%
ungroup()
in base R you can use tapply + duplicated:
df$job_code2 <- unlist(tapply(df$job_code, df$ID, function(x) ifelse(duplicated(x), 0, x)))
Another possible solution:
library(tidyverse)
df <- read_table("ID job_code
1 8
1 8
1 8
2 7
2 7
2 4
3 1
3 2")
df %>%
group_by(ID, job_code) %>%
mutate(job_code = if_else(row_number() > 1, 0, job_code)) %>%
ungroup
#> # A tibble: 8 x 2
#> ID job_code
#> <dbl> <dbl>
#> 1 1 8
#> 2 1 0
#> 3 1 0
#> 4 2 7
#> 5 2 0
#> 6 2 4
#> 7 3 1
#> 8 3 2
the first function is good but I don't know why there are some subjects where it doesn't work. For subjects where there is already a code that has been released for a previous subject it doesn't work
for example, for subject 4 I get a 0 when I should get an 8
I have this :
ID job_code job_code_2
1 8 8
1 8 0
1 8 0
2 7 7
2 7 0
2 4 4
3 1 1
3 2 2
4 8 0
Instead of this :
ID job_code job_code_2
1 8 8
1 8 0
1 8 0
2 7 7
2 7 0
2 4 4
3 1 1
3 2 2
4 8 8

R: expand sequence of binary values from two columns

I have a very complex dataset. But for simplification I am going to use this data. Let's assume we have this data, one column X showing numbers in different ranges and the other column is binary 0/1. I want to expand a sequence of binary 0/1 by 1. This will create new longer columns (newX, newbinary) form the original X and binary columns.
X binary
1 1 1
2 3 0
3 5 0
4 6 1
5 8 1
6 10 0
I want to expand X column here and put the binary value next to it. Something like,
newX newbinary
1 1 1
2 2 1
3 3 0
4 4 0
5 5 0
6 6 1
7 7 1
8 8 1
9 9 1
10 10 0
My attempts is this, First I created the new X column (expanded the X).
newX <- seq(X[1], X[length])
Then, I used for loop to iterate through newX and then compare the value with X to check if it is equal to or less than the value. If newX[i] is not equal to X[i], then put the previous binary value, else put the binary[i].
for (i in 1:newX[length])
{
newbinary= ifelse((newX != X)&(between(newX[i], X[i],X[j+1])), lag(binary), binary)
}
But this is not working and give me this,
newX newbinary
1 1 1
2 2 NA
3 3 NA
4 4 NA
5 5 NA
6 6 NA
I don't really know how the newX column (longer one) will iterate through X (shorter one) and put the value correspondingly.
How can I achieve this in R?
Try this tidyverse approach creating a dataframe for the sequence, then use left_join() and finally fill() to complete the binary variable:
library(dplyr)
library(tidyr)
#Code
newdf <- data.frame(X=seq(min(df$X),max(df$X),by=1)) %>%
left_join(df) %>%
fill(binary)
Output:
X binary
1 1 1
2 2 1
3 3 0
4 4 0
5 5 0
6 6 1
7 7 1
8 8 1
9 9 1
10 10 0
Some data used:
#Data
df <- structure(list(X = c(1L, 3L, 5L, 6L, 8L, 10L), binary = c(1L,
0L, 0L, 1L, 1L, 0L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
Does this work:
library(dplyr)
library(tidyr)
library(purrr)
df %>% mutate(y = lead(X, default = max(X)+1), newX = map2(X,y-1,`:`)) %>%
unnest(newX) %>% mutate(newbinary = binary) %>% select(newX, newbinary)
# A tibble: 10 x 2
newX newbinary
<int> <dbl>
1 1 1
2 2 1
3 3 0
4 4 0
5 5 0
6 6 1
7 7 1
8 8 1
9 9 1
10 10 0
Data used:
df
# A tibble: 6 x 2
X binary
<dbl> <dbl>
1 1 1
2 3 0
3 5 0
4 6 1
5 8 1
6 10 0
Assuming the data frame shown reproducibly in the Note at the end the following one-linear uses read.zoo to convert it to a zoo object and converts that to a ts object expanding it at the same time. Then it uses na.locf to fill in the NAs and fortify.zoo to convert it to a data.frame. Set the names on the data frame to the original names.
library(zoo)
setNames(fortify.zoo(na.locf(as.ts(read.zoo(DF)))), names(DF))
giving:
X binary
1 1 1
2 2 1
3 3 0
4 4 0
5 5 0
6 6 1
7 7 1
8 8 1
9 9 1
10 10 0
The above one-linear could alternately be written as a pipeline:
library(zoo)
library(magrittr)
DF %>%
read.zoo %>%
as.ts %>%
na.locf %>%
fortify.zoo %>%
setNames(names(DF))
Note
Lines <- "
X binary
1 1 1
2 3 0
3 5 0
4 6 1
5 8 1
6 10 0"
DF <- read.table(text = Lines)

grouping continuous data with specific pattern

I have a data frame with a column like this (I am not posting other columns)
Value
1
1
1
0
0
1
0
0
1
1
2
2
0
0
1
0
0
1
1
1
0
0
2
2
1
1
2
0
0
1
0
I am trying to group it based on a specific condition. Grouping has to be done when I have 1 and 2. But conditions like these are one group :
1 1 0 0 1 1 0 0
Basically I need to group occurrences of 1 but in between 0s are allowed
Expected output:
Value Group
1 1
1 1
1 1
0 1
0 1
1 1
0 1
0 1
1 1
1 1
2 2
2 2
0 2
0 2
1 3
0 3
0 3
1 3
1 3
1 3
0 3
0 3
2 4
2 4
1 5
1 5
2 6
0 6
0 6
1 7
0 7
2 8
0 8
2 8
1 9
Here is another option using data.table:
DT[, Group := .GRP, .(date, rleid(nafill(replace(Value, Value==0L, NA_integer_), "locf")))]
Here is another base approach that uses ave() to count the changes between 1 and 2 and then uses cummax() on the result to give the final groupings.
dat$Group <- cummax(ave(dat$Value, dat$Value == 0, FUN = function(x) cumsum(c(x[1], diff(x) != 0))))
dat
Value Group
1 1 1
2 1 1
3 1 1
4 0 1
5 0 1
6 1 1
7 0 1
8 0 1
9 1 1
10 1 1
11 2 2
12 2 2
13 0 2
14 0 2
15 1 3
16 0 3
17 0 3
18 1 3
19 1 3
20 1 3
21 0 3
22 0 3
23 2 4
24 2 4
25 1 5
26 1 5
27 2 6
28 0 6
29 0 6
30 1 7
31 0 7
In response to your comment, if you want the result by grouped by date, you can use a nested ave():
ave(ave(dat$Value, dat$Value == 0, dat$date, FUN = function(x) cumsum(c(x[1], diff(x) != 0))), dat$date, FUN = cummax)
This loop in Base-R does the trick
group <- 0
lastgroupvalue <- NA
data$Group <- NA
for(i in 1:nrow(data)){
if(!data$Value[i] %in% c(lastgroupvalue, 0)){
group <- group + 1
lastgroupvalue <- data$Value[i]
}
data$Group[i] <- group
}
> data
Value Group
1 1 1
2 1 1
3 1 1
4 0 1
5 0 1
6 1 1
7 0 1
8 0 1
9 1 1
10 1 1
11 2 2
12 2 2
13 0 2
14 0 2
15 1 3
16 0 3
17 0 3
18 1 3
19 1 3
20 1 3
21 0 3
22 0 3
23 2 4
24 2 4
25 1 5
26 1 5
27 2 6
28 0 6
29 0 6
30 1 7
31 0 7
Data:
data <- structure(list(Value = c(1L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 1L,
1L, 2L, 2L, 0L, 0L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 2L, 2L, 1L,
1L, 2L, 0L, 0L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-31L))
Another solution that avoids loops, that works similar to Limey's solution but uses cumsum to create the groups.
df$Group <- dplyr::na_if(df$Value, 0)
df <- tidyr::fill(df, Group, .direction = "down")
df$Group <- cumsum(df$Group != dplyr::lag(df$Group, default = -1))
> df
Value Group
1 1 1
2 1 1
3 1 1
4 0 1
5 0 1
6 1 1
7 0 1
8 0 1
9 1 1
10 1 1
11 2 2
12 2 2
13 0 2
14 0 2
15 1 3
16 0 3
17 0 3
18 1 3
19 1 3
20 1 3
21 0 3
22 0 3
23 2 4
24 2 4
25 1 5
26 1 5
27 2 6
28 0 6
29 0 6
30 1 7
31 0 7
Or a tidyverse solution that avoids loops:
x <- tibble(Value=c(1,1,1,0,0,1,0,0,1,1,2,2,0,0,1,0,0,1,1,1,
0,0,2,2,1,1,2,0,0,1,0,2,0,2,1)) %>%
mutate(ModValue=ifelse(Value == 0, NA, Value)) %>%
fill(ModValue, .direction="down")
runLengths <- rle(x$ModValue)
groupIndex <- unlist(lapply(1:length(runLengths$lengths),
function(x) rep(x, runLengths$lengths[x]))
)
x <- x %>% add_column(Group=groupIndex) %>% select(-ModValue)
Your input data has a different length to your expected output. Took me a while to work that out... :)
** Edit **
And an inelegant solution to account for changing days (or other super-groupings...
x <- tibble(
RowNumber=1:35,
Date=lubridate::ymd(c(rep("2020-05-31", 20), rep("2020-06-01", 15))),
Value=c(1,1,1,0,0,1,0,0,1,1,2,2,0,0,1,0,0,1,1,1,0,0,2,2,1,1,2,0,0,1,0,2,0,2,1))
# Check we have a change of date mid-sequence
x %>% filter(row_number() > 15 & row_number() < 25)
x <- x %>%
mutate(ModValue=ifelse(Value == 0, NA, Value)) %>%
fill(ModValue, .direction="down")
# Inelegantly compute the groups
make_groups <- function(x) {
runs <- rle(x)
return(tibble(GroupWithinDay=unlist(
lapply(1:length(runs$lengths),
function(x) rep(x, runs$lengths[x])))))
}
y <- x %>% group_by(Date) %>% do(make_groups(.$ModValue))
x <- x %>% add_column(GroupWithinDay=y$GroupWithinDay) %>% select(-ModValue)
# Check the change of date is handled correctly
x %>% filter(row_number() > 15 & row_number() < 25)
Giving
# A tibble: 9 x 4
RowNumber Date Value GroupWithinDay
<int> <date> <dbl> <int>
1 16 2020-05-31 0 3
2 17 2020-05-31 0 3
3 18 2020-05-31 1 3
4 19 2020-05-31 1 3
5 20 2020-05-31 1 3
6 21 2020-06-01 0 1
7 22 2020-06-01 0 1
8 23 2020-06-01 2 2
9 24 2020-06-01 2 2

Conditional average or replace value if zero in dataframe

I have a problem finding the right code for the following problem.
Here is a simplified and short version of my dataframe df :
Line Id Amount
1 1 10
2 2 12
3 2 13
4 2 0
5 3 11
6 4 12
7 4 14
8 5 0
9 6 11
10 6 0
I would like to create another colum Amount_Avrg with the folowing conditions:
-if several lines have the same Id and an Amount that is different from zero the case for lines 2 and 3 and for lines 6 and 7, calculate the average of the different amounts
-if one line has an amount that is equal with 0 then:
A/ erase it if it is alone (if there is no other line with the same Id and a value different from 0) (the case of line 8)
B/ if there is one line with the same Id and a value different from 0 (the case for lines 9 and 10), replace 0 with the value of the other
C/ if there are two lines or more with a value different from zero (the case for lines 2 and 3), replace 0 with the average of the other amounts
The final dataframe I am expecting would then look like this one:
Line Id Amount Amount_Avrg
1 1 10 10
2 2 12 12.5
3 2 13 12.5
4 2 0 12.5
5 3 11 11
6 4 12 13
7 4 14 13
9 6 11 11
10 6 0 11
I have read in many answers that if loops were not efficient on R so if you could help me with another solution, that would be fantastic :-)
Using dplyr, we can group_by ID and take mean of non-zero Amount and remove rows with NA in them.
library(dplyr)
df %>%
group_by(Id) %>%
mutate(mn = mean(Amount[Amount > 0])) %>%
filter(!is.na(mn))
# Line Id Amount mn
# <int> <int> <int> <dbl>
#1 1 1 10 10
#2 2 2 12 12.5
#3 3 2 13 12.5
#4 4 2 0 12.5
#5 5 3 11 11
#6 6 4 12 13
#7 7 4 14 13
#8 9 6 11 11
#9 10 6 0 11
Or with data.table
library(data.table)
setDT(df)[, mn := mean(Amount[Amount > 0]), by = Id][!is.na(mn)]
data
df <- structure(list(Line = 1:10, Id = c(1L, 2L, 2L, 2L, 3L, 4L, 4L,
5L, 6L, 6L), Amount = c(10L, 12L, 13L, 0L, 11L, 12L, 14L, 0L,
11L, 0L)), class = "data.frame", row.names = c(NA, -10L))
You can use ave to calculate the mean per Id and then subset with !is.na to remove the rows where you have only 0 per Id.
x$Amount_Avrg <- ave(x$Amount, x$Id, FUN=function(x) mean(x[x>0]))
x <- x[!is.na(x$Amount_Avrg),]
x
# Line Id Amount Amount_Avrg
#1 1 1 10 10.0
#2 2 2 12 12.5
#3 3 2 13 12.5
#4 4 2 0 12.5
#5 5 3 11 11.0
#6 6 4 12 13.0
#7 7 4 14 13.0
#9 9 6 11 11.0
#10 10 6 0 11.0
Or with within and na.omit:
na.omit(within(x, mount_Avrg <- ave(Amount, Id, FUN=function(x) mean(x[x>0]))))
Or using aggregate and merge:
merge(x, aggregate(cbind(Amount_Avrg = Amount) ~ Id, data=x[x$Amount>0,], mean))
Data:
x <- read.table(header=TRUE, text="Line Id Amount
1 1 10
2 2 12
3 2 13
4 2 0
5 3 11
6 4 12
7 4 14
8 5 0
9 6 11
10 6 0")
If you create a summary table of all the nonzero-means, you can right-join that to the original table to get the result displayed in the question.
library(data.table)
setDT(df)
nonzero_means <- df[Amount > 0, .(Amount_Avg = mean(Amount)), Id]
df[nonzero_means, on = .(Id)]
# Line Id Amount Amount_Avg
# 1: 1 1 10 10.0
# 2: 2 2 12 12.5
# 3: 3 2 13 12.5
# 4: 4 2 0 12.5
# 5: 5 3 11 11.0
# 6: 6 4 12 13.0
# 7: 7 4 14 13.0
# 8: 9 6 11 11.0
# 9: 10 6 0 11.0

Finding values in consecutive rows

An example of the dataframe I have is given below.
ID X
1 1
2 2
3 1
4 0
5 0
6 1
7 4
8 5
9 6
10 7
11 0
12 0
I want to apply logic to it that looks to see whether 3 or more consecutive rows have a value >0 in it. If they do I want to flag them in another column. Hence the output will look as follows.
ID X Y
1 1 1
2 2 1
3 1 1
4 0 0
5 0 0
6 1 1
7 4 1
8 5 1
9 6 1
10 7 1
11 0 0
12 0 0
EXTENSION -
How would I get the following output, givibng a different Y value for each group?
ID X Y
1 1 1
2 2 1
3 1 1
4 0 0
5 0 0
6 1 2
7 4 2
8 5 2
9 6 2
10 7 2
11 0 0
12 0 0
One option with base R. Using rle to find the adjacent values in 'X' that are greater than 0, then do the replication based on the lengths
df1$Y <- with(rle(df1$X > 0), as.integer(rep(values & lengths > 2, lengths)))
df1$Y
#[1] 1 1 1 0 0 1 1 1 1 1 0 0
For the updated case in the OP's post
df1$Y <- inverse.rle(within.list(rle(df1$X > 0), {
i1 <- values & (lengths > 2)
values[i1] <- seq_along(values[i1])}))
df1$Y
#[1] 1 1 1 0 0 2 2 2 2 2 0 0
Or using rleid from data.table
library(data.table)
setDT(df1)[, Y := as.integer((.N > 2) * (X > 0)),rleid(X > 0)]
data
df1 <- structure(list(ID = 1:12, X = c(1L, 2L, 1L, 0L, 0L, 1L, 4L, 5L,
6L, 7L, 0L, 0L)), class = "data.frame", row.names = c(NA, -12L
))
We can use rleid from data.table to create groups and use it in ave and get length of each group and assign 1 to groups which has length greater than equal to 3.
library(data.table)
df$Y <- as.integer(ave(df$X, rleid(df$X > 0), FUN = length) >= 3)
df
# ID X Y
#1 1 1 1
#2 2 2 1
#3 3 1 1
#4 4 0 0
#5 5 0 0
#6 6 1 1
#7 7 4 1
#8 8 5 1
#9 9 6 1
#10 10 7 1
#11 11 0 0
#12 12 0 0
EDIT
For updated post we could include the above data.table part with dplyr by doing
library(dplyr)
library(data.table)
df %>%
group_by(group = rleid(X > 0)) %>%
mutate(Y = ifelse(n() >= 3 & row_number() == 1, 1, 0)) %>%
ungroup() %>%
mutate(Y = cumsum(Y) * Y) %>%
group_by(group) %>%
mutate(Y = first(Y)) %>%
ungroup() %>%
select(-group)
# ID X Y
# <int> <int> <dbl>
# 1 1 1 1
# 2 2 2 1
# 3 3 1 1
# 4 4 0 0
# 5 5 0 0
# 6 6 1 2
# 7 7 4 2
# 8 8 5 2
# 9 9 6 2
#10 10 7 2
#11 11 0 0
#12 12 0 0

Resources