I have a time series dataset containing different sensor measurements. The sensors software has some bugs, resulting in missing measurements. I added the missing measurement times, resulting in NAs in the "value" column. The dataset looks as follows:
df <- structure(list(time_id = 1:10, value = c(-1.80603125680195, -0.582075924689333,
NA, NA, -0.162309523556819, NA, NA, NA, 1.6059096288573, NA),
is_missing = c(FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE,
TRUE, FALSE, TRUE)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -10L))
df
I want to group sequential rows with numeric vs missing values and at the same time count the number of sequential rows in each group. The result should look like this:
df %>% mutate(group = c(1, 1, 2, 2, 3, 4, 4, 4, 5, 6),
seq_NA = c(1:2, 1:2, 1, 1:3, 1, 1))
Help is very much appreciated!
Here is another idea. Here we use is.na() to capture the NAs and change group whenever a non-NA appears, i.e.
df %>%
group_by(grp = cumsum(c(1, diff(is.na(value)) != 0))) %>%
mutate(seq_NA = seq(n()))
which gives,
# A tibble: 10 x 5
# Groups: grp [6]
time_id value is_missing grp seq_NA
<int> <dbl> <lgl> <dbl> <int>
1 1 -1.81 FALSE 1 1
2 2 -0.582 FALSE 1 2
3 3 NA TRUE 2 1
4 4 NA TRUE 2 2
5 5 -0.162 FALSE 3 1
6 6 NA TRUE 4 1
7 7 NA TRUE 4 2
8 8 NA TRUE 4 3
9 9 1.61 FALSE 5 1
10 10 NA TRUE 6 1
Here is a base R solution using ave() + rle()
df$group <- with(df, rep(seq_along(z<-rle(is_missing)$lengths),z))
df$seq_NA <- with(df,ave(seq(nrow(df)),group,FUN = seq_along))
such that
> df
time_id value is_missing group seq_NA
1 1 -1.8060313 FALSE 1 1
2 2 -0.5820759 FALSE 1 2
3 3 NA TRUE 2 1
4 4 NA TRUE 2 2
5 5 -0.1623095 FALSE 3 1
6 6 NA TRUE 4 1
7 7 NA TRUE 4 2
8 8 NA TRUE 4 3
9 9 1.6059096 FALSE 5 1
10 10 NA TRUE 6 1
Related
I got data like this
structure(list(id = c(1, 1, 1, 2, 2, 2), time = c(1, 2, 2, 5,
6, 6)), class = "data.frame", row.names = c(NA, -6L))
and If for the same ID the value in the next row is equal to the value in the previous row, then increase the value of the duplicate by 1. I want to get this
structure(list(id2 = c(1, 1, 1, 2, 2, 2), time2 = c(1, 2, 3,
5, 6, 7)), class = "data.frame", row.names = c(NA, -6L))
Using base R:
ave(df$time, df$time, FUN = function(z) z+cumsum(duplicated(z)))
# [1] 1 2 3 5 6 7
(This can be reassigned back into time.)
This deals with 2 or more duplicates, meaning if we instead have another 6th row,
df <- rbind(df, df[6,])
df$time2 <- ave(df$time, df$time, FUN = function(z) z+cumsum(duplicated(z)))
df
# id time time2
# 1 1 1 1
# 2 1 2 2
# 3 1 2 3
# 4 2 5 5
# 5 2 6 6
# 6 2 6 7
# 61 2 6 8
You could use accumulate
library(tidyverse)
df %>%
group_by(id) %>%
mutate(time2 = accumulate(time, ~if(.x>=.y) .x + 1 else .y))
# A tibble: 6 x 3
# Groups: id [2]
id time time2
<dbl> <dbl> <dbl>
1 1 1 1
2 1 2 2
3 1 2 3
4 2 5 5
5 2 6 6
6 2 6 7
This works even if the group is repeated more than twice.
If the first data.frame is named df, this gives you what you need:
df$time[duplicated(df$id) & duplicated(df$time)] <- df$time[duplicated(df$id) & duplicated(df$time)] + 1
df
id time
1 1 1
2 1 2
3 1 3
4 2 5
5 2 6
6 2 7
It finds the rows where both id and time have been duplicated from the previous row, and adds 1 to time in those rows
You can use dplyr's mutate with lag
data%>%group_by(id)%>%
mutate(time=time+cumsum(duplicated(time)))%>%
ungroup()
# A tibble: 6 x 2
id time
<dbl> <dbl>
1 1 1
2 1 2
3 1 3
4 2 5
5 2 6
6 2 7
Not able to filter on logical columns below. Wanted to check if there is way to filter. If I filter based on false, NA row becomes all NA. I need rows with NA but with values in other column?
asd <- data.frame(Cat = c("A","B","B","A","B","A"), Start_num = c(2, 5, 1, 6, 6, 4), End_num = c(3, 7, 4, 7, 8, 5))
new <- asd %>% arrange(Cat,Start_num) %>%
group_by(Cat) %>%
mutate(Var=lead(Start_num)>End_num)
new <- as.data.frame(new)
new[new$Var != FALSE,]
Cat Start_num End_num Var
1 A 2 3 TRUE
2 A 4 5 TRUE
NA <NA> NA NA NA
4 B 1 4 TRUE
NA.1 <NA> NA NA NA
```
If you want to subset on the TRUE values but also have NA values, you need to subset both on NA and TRUE:
new[!is.na(new$Var) & new$Var == TRUE,]
# A tibble: 3 x 4
# Groups: Cat [2]
Cat Start_num End_num Var
<chr> <dbl> <dbl> <lgl>
1 A 2 3 TRUE
2 A 4 5 TRUE
3 B 1 4 TRUE
Why not using %in%?
new[new$Var %in% TRUE, ]
# Cat Start_num End_num Var
# 1 A 2 3 TRUE
# 2 A 4 5 TRUE
# 4 B 1 4 TRUE
I have a large dataframe. As an example:
Week <- c(1, 1, 1, 2, 2, 2)
Outcome <- c( FALSE, FALSE , TRUE , TRUE, FALSE, FALSE)
df <- data.frame(Week, Outcome)
Week Outcome
1 1 FALSE
2 1 FALSE
3 1 TRUE
4 2 TRUE
5 2 FALSE
6 2 FALSE
In Outcome I would like to change FALSE to NA in case Outcome consists TRUE within the same Week.
So in this case the result will be:
Week Outcome
1 1 NA
2 1 NA
3 1 TRUE
4 2 TRUE
5 2 NA
6 2 NA
Thanks in advance for suggestions.
If you have cases when a week could be all FALSE and you don't want to change those ones, you can do it all in one pass like:
df$Outcome[df$Week %in% unique(df$Week[df$Outcome]) & (!df$Outcome)] <- NA
df
# Week Outcome
#1 1 NA
#2 1 NA
#3 1 TRUE
#4 2 TRUE
#5 2 NA
#6 2 NA
Extended example:
Week <- c(1, 1, 1, 2, 2, 2, 3, 3)
Outcome <- c( FALSE, FALSE , TRUE , TRUE, FALSE, FALSE, FALSE, FALSE)
df2 <- data.frame(Week, Outcome)
df2$Outcome[df2$Week %in% unique(df2$Week[df2$Outcome]) & (!df2$Outcome)] <- NA
df2
# Week Outcome
#1 1 NA
#2 1 NA
#3 1 TRUE
#4 2 TRUE
#5 2 NA
#6 2 NA
#7 3 FALSE
#8 3 FALSE
We can negate the 'Outcome' to convert the FALSE to TRUE and assign those values to NA
df$Outcome[!df$Outcome] <- NA
-output
df
# Week Outcome
#1 1 NA
#2 1 NA
#3 1 TRUE
#4 2 TRUE
#5 2 NA
#6 2 NA
If we have cases where there are no TRUE, an option is
library(dplyr)
df2 %>%
group_by(Week) %>%
mutate(Outcome = replace(Outcome, any(Outcome) & !Outcome, NA))
-output
# A tibble: 8 x 2
# Groups: Week [3]
# Week Outcome
# <dbl> <lgl>
#1 1 NA
#2 1 NA
#3 1 TRUE
#4 2 TRUE
#5 2 NA
#6 2 NA
#7 3 FALSE
#8 3 FALSE
Or using base R
df2$Outcome <- with(df2, !(NA^ave(Outcome, Week, FUN = any) & !Outcome))
data
df2 <- structure(list(Week = c(1, 1, 1, 2, 2, 2, 3, 3), Outcome = c(FALSE,
FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE)), class = "data.frame",
row.names = c(NA,
-8L))
I am trying to use the apply function to rows within a grouped dataframe to check for the existence of other rows within that group that match certain conditions dependent on each row. I am able to get this to work for one group but not for all.
For example, with no grouping:
library(dplyr)
id <- c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2)
station <- c(1, 2, 3, 3, 2, 2, 1, 1, 3, 2, 2)
timeslot <- c(13, 14, 20, 21, 24, 23, 8, 9, 10, 15, 16)
df <- data.frame(id, station, timeslot)
s <- 2
df <-
df %>%
filter(id == 1) %>%
arrange(id, timeslot) %>%
mutate(match = ifelse(station == s, apply(., 1, function(x) (any(as.numeric(x[3] + 1) == .$timeslot))), FALSE))
id station timeslot match
1 1 1 13 FALSE
2 1 2 14 FALSE
3 1 3 20 FALSE
4 1 3 21 FALSE
5 1 2 23 TRUE
6 1 2 24 FALSE
In the above code, for each station 2 row, I am trying to check all other rows to see if there exists a timeslot with a value of one greater (for any station). This works as expected.
Then, I go on to apply this to a grouped dataframe:
df <-
df %>%
group_by(id) %>%
arrange(id, timeslot) %>%
mutate(match = ifelse(station == s, apply(., 1, function(x) (any(as.numeric(x[3] + 1) == .$timeslot))), FALSE))
id station timeslot match
<int> <int> <int> <lgl>
1 1 1 13 FALSE
2 1 2 14 TRUE
3 1 3 20 FALSE
4 1 3 21 FALSE
5 1 2 23 TRUE
6 1 2 24 FALSE
7 2 1 8 FALSE
8 2 1 9 FALSE
9 2 3 10 FALSE
10 2 2 15 FALSE
11 2 2 16 TRUE
and get some unwanted results. It seems like it is not applied by group and I can't figure out how to fix this. How can I apply this function so that only the other rows within a group are checked? In reality, my dataset is much bigger and the conditions are more complex, so it is not running quickly either.
Thanks in advance
Edit: I should add that I have also tried a solution using the arrange() and lead() function but since some timeslot values are shared by many stations in my larger dataset I could not get this to work
This seems to work:
df %>%
group_by(id) %>%
arrange(id, timeslot) %>%
mutate(match = station == s & ((timeslot + 1) %in% timeslot))
# # A tibble: 11 x 4
# # Groups: id [2]
# id station timeslot match
# <dbl> <dbl> <dbl> <lgl>
# 1 1 1 13 FALSE
# 2 1 2 14 FALSE
# 3 1 3 20 FALSE
# 4 1 3 21 FALSE
# 5 1 2 23 TRUE
# 6 1 2 24 FALSE
# 7 2 1 8 FALSE
# 8 2 1 9 FALSE
# 9 2 3 10 FALSE
# 10 2 2 15 TRUE
# 11 2 2 16 FALSE
My sincere apologies if I understood the question wrong. This does what I understand from the question:
df$match = apply(df, 1, function(line) any(df$id == line[1] &
df$station == line[2] &
df$timeslot == line[3] + 1))
The result then is
id station timeslot match
1 1 1 13 FALSE
2 1 2 14 FALSE
3 1 3 20 TRUE
4 1 3 21 FALSE
5 1 2 24 FALSE
6 1 2 23 TRUE
7 2 1 8 TRUE
8 2 1 9 FALSE
9 2 3 10 FALSE
10 2 2 15 TRUE
11 2 2 16 FALSE
I have a dataframe with a lot of variables seen in multiple conditions. I'd like to merge each variable by condition.
The example data frame is a simplified version of what I have (3 variables over 2 conditions).
VAR.B_1 <- c(1, 2, 3, 4, 5, 'NA', 'NA', 'NA', 'NA', 'NA')
VAR.B_2 <- c(2, 2, 3, 4, 5,'NA', 'NA', 'NA', 'NA', 'NA')
VAR.B_3 <- c(1, 1, 1, 1, 1,'NA', 'NA', 'NA', 'NA', 'NA')
VAR.E_1 <- c(NA, NA, NA, NA, NA, 1, 1, 1, 1, 1)
VAR.E_2 <- c(NA, NA, NA, NA, NA, 1, 2, 3, 4, 5)
VAR.E_3 <- c(NA, NA, NA, NA, NA, 1, 1, 1, 1, 1)
Condition <- c("B", "B","B","B","B","E","E","E","E","E")
#Example dataset
data<-as.data.frame(cbind(VAR.B_1,VAR.B_2,VAR.B_3, VAR.E_1,VAR.E_2, VAR.E_3, Condition))
I want to end up with this, appended to the original data frame:
VAR_1 VAR_2 VAR_3
1 2 1
2 2 1
3 3 1
4 4 1
5 5 1
1 1 1
1 2 1
1 3 1
1 4 1
1 5 1
I understand that R won't work with i inside the variable name, but I have an example of the kind of for loop I was trying to do. I would rather not call variables by column location, since there will be a lot of variables.
##Example of how I want to merge - this code does not work
for(i in 1:3) {
data$VAR_[,i] <-ifelse(data$Condition == "B", VAR.B_[,i],
ifelse(data$Condition == "E", VAR.E_[,i], NA))
}
This might work for your situation:
library(tidyverse)
library(stringr)
data %>%
mutate_all(as.character) %>%
gather(key, value, -Condition) %>%
filter(!is.na(value), value != "NA") %>%
mutate(key = str_replace(key, paste0("\\.", Condition), "")) %>%
group_by(Condition, key) %>%
mutate(rowid = 1:n()) %>%
spread(key, value) %>%
bind_cols(data)
#> # A tibble: 10 x 12
#> # Groups: Condition [2]
#> Condition rowid VAR_1 VAR_2 VAR_3 VAR.B_1 VAR.B_2 VAR.B_3 VAR.E_1
#> <chr> <int> <chr> <chr> <chr> <fctr> <fctr> <fctr> <fctr>
#> 1 B 1 1 2 1 1 2 1 NA
#> 2 B 2 2 2 1 2 2 1 NA
#> 3 B 3 3 3 1 3 3 1 NA
#> 4 B 4 4 4 1 4 4 1 NA
#> 5 B 5 5 5 1 5 5 1 NA
#> 6 E 1 1 1 1 NA NA NA 1
#> 7 E 2 1 2 1 NA NA NA 1
#> 8 E 3 1 3 1 NA NA NA 1
#> 9 E 4 1 4 1 NA NA NA 1
#> 10 E 5 1 5 1 NA NA NA 1
#> # ... with 3 more variables: VAR.E_2 <fctr>, VAR.E_3 <fctr>,
#> # Condition1 <fctr>
data.frame(lapply(split.default(data[-NCOL(data)], gsub("\\D+", "", head(names(data), -1))),
function(a){
a = sapply(a, function(x) as.numeric(as.character(x)))
rowSums(a, na.rm = TRUE)
}))
# X1 X2 X3
#1 1 2 1
#2 2 2 1
#3 3 3 1
#4 4 4 1
#5 5 5 1
#6 1 1 1
#7 1 2 1
#8 1 3 1
#9 1 4 1
#10 1 5 1
#Warning messages:
#1: In FUN(X[[i]], ...) : NAs introduced by coercion
#2: In FUN(X[[i]], ...) : NAs introduced by coercion
#3: In FUN(X[[i]], ...) : NAs introduced by coercion
Your data appears to have two kinds of NA values in it. It has NA, or R's NA value, and it also has the string 'NA'. In my solution below, I replace both with zero, cast each column in the data frame to numeric, and then just sum together like-numbered VAR columns. Then, drop the original columns which you don't want anymore.
data <- as.data.frame(cbind(VAR.B_1,VAR.B_2,VAR.B_3, VAR.E_1,VAR.E_2, VAR.E_3),
stringsAsFactors=FALSE)
data[is.na(data)] <- 0
data[data == 'NA'] <- 0
data <- as.data.frame(lapply(data, as.numeric))
data$VAR_1 <- data$VAR.B_1 + data$VAR.E_1
data$VAR_2 <- data$VAR.B_2 + data$VAR.E_2
data$VAR_3 <- data$VAR.B_3 + data$VAR.E_3
data <- data[c("VAR_1", "VAR_2", "VAR_3")]
Demo