I have a dataframe df, where I need to have the lag values to get the difference between times
df
ColA ColB Lag(ColB)
1 11:00:12 11:00:13
1 11:00:13 11:00:14
1 11:00:14 NA
2 11:00:15 11:00:16
2 11:00:16 11:00:17
2 11:00:17 NA
3 11:00:18 11:00:19
3 11:00:19 11:00:20
3 11:00:20 NA
Above only upto unique values I need to create a lag. If you see, the moment ColA changes from 1 to 2 and from 2 to 3, the lag is NA. So Is it possible to achieve this?
As mentioned by #Sotos, you need to group by your colA before doing the lag column and then calculate the diff time.
Using dplyr and lubridate packages, you can calculate diff time by group
library(dplyr)
library(lubridate)
df %>% group_by(ColA) %>% mutate(NewLag = lead(ColB)) %>%
mutate(diff = hms(NewLag)-hms(ColB))
# A tibble: 9 x 5
# Groups: ColA [3]
ColA ColB `Lag(ColB)` NewLag diff
<int> <chr> <chr> <chr> <dbl>
1 1 11:00:12 11:00:13 11:00:13 1
2 1 11:00:13 11:00:14 11:00:14 1
3 1 11:00:14 NA NA NA
4 2 11:00:15 11:00:16 11:00:16 1
5 2 11:00:16 11:00:17 11:00:17 1
6 2 11:00:17 NA NA NA
7 3 11:00:18 11:00:19 11:00:19 1
8 3 11:00:19 11:00:20 11:00:20 1
9 3 11:00:20 NA NA NA
Is it what you are looking for ?
Example Data
structure(list(ColA = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),
ColB = c("11:00:12", "11:00:13", "11:00:14", "11:00:15",
"11:00:16", "11:00:17", "11:00:18", "11:00:19", "11:00:20"
), `Lag(ColB)` = c("11:00:13", "11:00:14", NA, "11:00:16",
"11:00:17", NA, "11:00:19", "11:00:20", NA)), row.names = c(NA,
-9L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x5569bf9b0310>)
Related
I have data that look like these:
Subject Site Date
1 2 '2020-01-01'
1 2 '2020-01-01'
1 2 '2020-01-02'
2 1 '2020-01-02'
2 1 '2020-01-03'
2 1 '2020-01-03'
And I'd like to create an order variable for unique dates by Subject and Site. i.e.
Want
1
1
2
1
2
2
I define a little wrapper:
rle <- function(x) cumsum(!duplicated(x))
and I notice inconsistent behavior when I supply:
have1 <- unlist(tapply(val$Date, val[, c( 'Site', 'Subject')], rle))
versus
have2 <- unlist(tapply(val$Date, val[, c('Subject', 'Site')], rle))
> have1
[1] 1 1 2 1 2 2
> have2
[1] 1 2 2 1 1 2
Is there any way to ensure that the natural ordering of the dataset is followed regardless of the specific columns supplied to the INDEX argument?
library(dplyr)
val %>%
group_by(Subject, Site) %>%
mutate(Want = match(Date, unique(Date))) %>%
ungroup
-output
# A tibble: 6 × 4
Subject Site Date Want
<int> <int> <chr> <int>
1 1 2 2020-01-01 1
2 1 2 2020-01-01 1
3 1 2 2020-01-02 2
4 2 1 2020-01-02 1
5 2 1 2020-01-03 2
6 2 1 2020-01-03 2
val$Want <- with(val, ave(as.integer(as.Date(Date)), Subject, Site,
FUN = \(x) match(x, unique(x))))
val$Want
[1] 1 1 2 1 2 2
data
val <- structure(list(Subject = c(1L, 1L, 1L, 2L, 2L, 2L), Site = c(2L,
2L, 2L, 1L, 1L, 1L), Date = c("2020-01-01", "2020-01-01", "2020-01-02",
"2020-01-02", "2020-01-03", "2020-01-03")),
class = "data.frame", row.names = c(NA,
-6L))
I have two data frames
df1:
01.2020 02.2020 03.2020
11190 4 1 2
12345 3 3 1
11323 1 2 2
df2
08.2020 04.2020 09.2020
11190 1 2 2
12345 1 2 3
11324 1 2 2
Dummy Data -
df1 <- structure(list(`01.2020` = c(4L, 3L, 1L), `02.2020` = c(1L, 3L, 2L), `03.2020` = c(2L, 1L, 2L)), class = "data.frame", row.names = c("11190","12345", "11323"))
df2 <- structure(list(`08.2020` = c(1L, 1L, 1L), `04.2020` = c(2L, 2L, 2L), `09.2020` = c(2L, 3L, 2L)), class = "data.frame", row.names = c("11190", "12345", "11324"))
I want to "outer merge" these two dataframes by key = index
How can we do that? what should be there in the place of by=
merge(x = sheet1_UN, y = sheet2_UN, by = "" , all = TRUE)
I want my final dataframe to look something like this
01.2020 02.2020 03.2020 08.2020 04.2020 09.2020
11190 4 1 2 1 1 2
12345 3 3 1 1 2 3
11323 1 2 2 - - -
11324 - - - 1 2 2
Thanks in advance.
another method
df3 <- merge(df1, df2, by = "row.names", all = TRUE)
output:
Row.names 01.2020 02.2020 03.2020 08.2020 04.2020 09.2020
1 11190 4 1 2 1 2 2
2 11323 1 2 2 NA NA NA
3 11324 NA NA NA 1 2 2
4 12345 3 3 1 1 2 3
This should do:
df1 %>% rownames_to_column('id') %>%
full_join(df2 %>% rownames_to_column('id'), by='id')
output:
id 01.2020 02.2020 03.2020 08.2020 04.2020 09.2020
1 11190 4 1 2 1 2 2
2 12345 3 3 1 1 2 3
3 11323 1 2 2 NA NA NA
4 11324 NA NA NA 1 2 2
You might use replace_na('-') if you want no NA values, like this:
df1 %>% rownames_to_column('id') %>%
full_join(df2 %>% rownames_to_column('id'), by='id') %>%
mutate(across(everything(), ~.x %>% as.character %>% replace_na('-')))
I have the following table:
id_question id_event num_events
2015012713 49508 1
2015012711 49708 1
2015011523 41808 3
2015011523 44008 3
2015011523 44108 3
2015011522 41508 3
2015011522 43608 3
2015011522 43708 3
2015011521 39708 1
2015011519 44208 1
The third column gives the count of events by question. I want to create a variable that would index the events by question only where there are multiple events per question. It would look something like that:
id_question id_event num_events index_event
2015012713 49508 1
2015012711 49708 1
2015011523 41808 3 1
2015011523 44008 3 2
2015011523 44108 3 3
2015011522 41508 3 1
2015011522 43608 3 2
2015011522 43708 3 3
2015011521 39708 1
2015011519 44208 1
How can I do that?
We can use tidyverse to create an 'index_event' after grouping by 'id_question'. If the number of rows are greater than 1 (n() >1), then get the sequence of rows (row_number()) and the default option in case_when is NA
library(dplyr)
df1 %>%
group_by(id_question) %>%
mutate(index_event = case_when(n() >1 ~ row_number()))
# A tibble: 10 x 4
# Groups: id_question [6]
# id_question id_event num_events index_event
# <int> <int> <int> <int>
# 1 2015012713 49508 1 NA
# 2 2015012711 49708 1 NA
# 3 2015011523 41808 3 1
# 4 2015011523 44008 3 2
# 5 2015011523 44108 3 3
# 6 2015011522 41508 3 1
# 7 2015011522 43608 3 2
# 8 2015011522 43708 3 3
# 9 2015011521 39708 1 NA
#10 2015011519 44208 1 NA
Or with data.table, we use rowid on 'id_question' and change the elements that are 1 in 'num_events' to NA with NA^ (making use of NA^0, NA^1)
library(data.table)
setDT(df1)[, index_event := rowid(id_question) * NA^(num_events == 1)]
Or using base R, another option with the sequence of frequency from 'id_question' and change elements to NA as in the previous case
df1$index_event <- with(df1, sequence(table(id_question)) * NA^(num_events == 1))
df1$index_event
#[1] NA NA 1 2 3 1 2 3 NA NA
data
df1 <- structure(list(id_question = c(2015012713L, 2015012711L, 2015011523L,
2015011523L, 2015011523L, 2015011522L, 2015011522L, 2015011522L,
2015011521L, 2015011519L), id_event = c(49508L, 49708L, 41808L,
44008L, 44108L, 41508L, 43608L, 43708L, 39708L, 44208L), num_events = c(1L,
1L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-10L))
If num_events = 1 you can return NA or create a row-index for each id_question.
This can be done in base R :
df$index_event <- with(df, ave(num_events == 1, id_question,
FUN = function(x) replace(seq_along(x), x, NA)))
df
# id_question id_event num_events index_event
#1 2015012713 49508 1 NA
#2 2015012711 49708 1 NA
#3 2015011523 41808 3 1
#4 2015011523 44008 3 2
#5 2015011523 44108 3 3
#6 2015011522 41508 3 1
#7 2015011522 43608 3 2
#8 2015011522 43708 3 3
#9 2015011521 39708 1 NA
#10 2015011519 44208 1 NA
dplyr :
library(dplyr)
df %>%
group_by(id_question) %>%
mutate(index_event = if_else(num_events == 1, NA_integer_, row_number()))
Or data.table :
library(data.table)
setDT(df)
df[,index_event := ifelse(num_events == 1, NA_integer_, seq_len(.N)), id_question]
data
df <- structure(list(id_question = c(2015012713L, 2015012711L, 2015011523L,
2015011523L, 2015011523L, 2015011522L, 2015011522L, 2015011522L,
2015011521L, 2015011519L), id_event = c(49508L, 49708L, 41808L,
44008L, 44108L, 41508L, 43608L, 43708L, 39708L, 44208L), num_events = c(1L,
1L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L)),class = "data.frame",row.names = c(NA, -10L))
I have a data table with 4 columns: ID, Name, Rate1, Rate2.
I want to remove duplicates where ID, Rate1, and Rate 2 are the same, but if they are both NA, I would like to keep both rows.
Basically, I want to conditionally remove duplicates, but only if the conditions != NA.
For example, I would like this:
ID Name Rate1 Rate2
1 Xyz 1 2
1 Abc 1 2
2 Def NA NA
2 Lmn NA NA
3 Hij 3 5
3 Qrs 3 7
to become this:
ID Name Rate1 Rate2
1 Xyz 1 2
2 Def NA NA
2 Lmn NA NA
3 Hij 3 5
3 Qrs 3 7
Thanks in advance!
EDIT: I know it's possible to just take a subset of the data table where the Rates are NA, then remove duplicates on what's left, then add the NA rows back in - but, I would rather avoid this strategy. This is because in reality there are quite a few couplets of rates that I want to do this for consecutively.
EDIT2: Added in some more rows to the example for clarity.
A base R option would be to use duplicated on the subset of dataset without the 'Name' column i.e. column index 2 to create a logical vector, negate (! - TRUE becomes FALSE and viceversa) so that TRUE would be non-duplicated rows. Along with that create another condition with rowSumson a logical matrix (is.na(df1[3:4]) - Rate columns) to get rows that are all NA's - here we compare it with 2 - i.e. the number of Rate columns in the dataset). Both the conditions are joined by | to create the expected logical index
i1 <- !duplicated(df1[-2])| rowSums(is.na(df1[3:4])) == 2
df1[i1,]
# ID Name Rate1 Rate2
#1 1 Xyz 1 2
#3 2 Def NA NA
#4 2 Lmn NA NA
Or with Reduce from base R
df1[Reduce(`&`, lapply(df1[3:4], is.na)) | !duplicated(df1[-2]), ]
Wrapping it in a function
f1 <- function(dat, i, method ) {
nm1 <- grep("^Rate", colnames(dat), value = TRUE)
i1 <- !duplicated(dat[-i])
i2 <- switch(method,
"rowSums" = rowSums(is.na(dat[nm1])) == length(nm1),
"Reduce" = Reduce(`&`, lapply(dat[nm1], is.na))
)
i3 <- i1|i2
dat[i3,]
}
-testing
f1(df1, 2, "rowSums")
# ID Name Rate1 Rate2
#1 1 Xyz 1 2
#3 2 Def NA NA
#4 2 Lmn NA NA
f1(df1, 2, "Reduce")
# ID Name Rate1 Rate2
#1 1 Xyz 1 2
#3 2 Def NA NA
#4 2 Lmn NA NA
f1(df2, 2, "rowSums")
# ID Name Rate1 Rate2
#1 1 Xyz 1 2
#3 2 Def NA NA
#4 2 Lmn NA NA
#5 3 Hij 3 5
#6 3 Qrs 3 7
f1(df2, 2, "Reduce")
# ID Name Rate1 Rate2
#1 1 Xyz 1 2
#3 2 Def NA NA
#4 2 Lmn NA NA
#5 3 Hij 3 5
#6 3 Qrs 3 7
if there are multiple 'Rate' columns (say 100 or more - only thing to change in the first solution is 2 should be changed to the number of 'Rate' columns)
Or using tidyverse
library(tidyvesrse)
df1 %>%
group_by(ID) %>%
filter_at(vars(Rate1, Rate2), any_vars(!duplicated(.)|is.na(.)))
# A tibble: 3 x 4
# Groups: ID [2]
# ID Name Rate1 Rate2
# <int> <chr> <int> <int>
#1 1 Xyz 1 2
#2 2 Def NA NA
#3 2 Lmn NA NA
df2 %>%
group_by(ID) %>%
filter_at(vars(Rate1, Rate2), any_vars(!duplicated(.)|is.na(.)))
# A tibble: 5 x 4
# Groups: ID [3]
# ID Name Rate1 Rate2
# <int> <chr> <int> <int>
#1 1 Xyz 1 2
#2 2 Def NA NA
#3 2 Lmn NA NA
#4 3 Hij 3 5
#5 3 Qrs 3 7
As #Paul mentioned in the comments, the updated tidyverse syntax as on Nov 4 2021 is
library(dplyr)
df2 %>%
group_by(ID) %>%
filter(if_any(cRate1, Rate2), ~ !duplicated(.)|is.na(.)))
data
df1 <- structure(list(ID = c(1L, 1L, 2L, 2L), Name = c("Xyz", "Abc",
"Def", "Lmn"), Rate1 = c(1L, 1L, NA, NA), Rate2 = c(2L, 2L, NA,
NA)), class = "data.frame", row.names = c(NA, -4L))
df2 <- structure(list(ID = c(1L, 1L, 2L, 2L, 3L, 3L), Name = c("Xyz",
"Abc", "Def", "Lmn", "Hij", "Qrs"), Rate1 = c(1L, 1L, NA, NA,
3L, 3L), Rate2 = c(2L, 2L, NA, NA, 5L, 7L)), class = "data.frame",
row.names = c(NA, -6L))
For a given data table see sample below, I only want to keep Difference column for values greater than 2 by Unique_ID, Without deleting the NA rows .
My_data_table <- structure(list(Unique_ID = structure(c(1L, 1L, 2L, 2L, 3L,
3L, 3L, 4L, 4L, 4L), .Label = c("1AA", "3AA", "5AA", "6AA"),
class = "factor"), Distance.km. = c(1, 2.05, 2, 4, 2, 4, 7,
8, 9, 10), Difference = c(NA, 1.05, NA, 2, NA, 2, 3, NA, 1, 1)),
.Names = c("Unique_ID", "Distance.km.", "Difference"),
class = "data.frame", row.names = c(NA, -10L))
My_data_table
Unique_ID Distance(km) Difference
1AA 1 NA
1AA 2.05 1.05
3AA 2 NA
3AA 4 2
5AA 2 NA
5AA 4 2
5AA 7 3
6AA 8 NA
6AA 9 1
6AA 10 1
Here is the result i'm looking for
My_data_table
Unique_ID Distance(km) Difference
3AA 2 NA
3AA 4 2
5AA 2 NA
5AA 4 2
5AA 7 3
After converting to 'data.table' (setDT(df1)), grouped by 'Unique_ID', if the sum of logical vector (Difference >= 2) is greater than 0, then get the Subset of Data.table (.SD) where the 'Difference' is either an NA or | it is greater than or equal to 2
library(data.table)
setDT(df1)[, if(sum(Difference >=2, na.rm = TRUE)>0)
.SD[is.na(Difference)|Difference>=2], by = Unique_ID]
# Unique_ID Distance.km. Difference
#1: 3AA 2 NA
#2: 3AA 4 2
#3: 5AA 2 NA
#4: 5AA 4 2
#5: 5AA 7 3
A dplyr solution:
library(dplyr)
df %>%
group_by(Unique_ID) %>%
filter(any(Difference >= 2 & !is.na(Difference)))
# # A tibble: 5 x 3
# # Groups: Unique_ID [2]
# Unique_ID Distance.km. Difference
# <fctr> <dbl> <dbl>
# 1 3AA 2 NA
# 2 3AA 4 2
# 3 5AA 2 NA
# 4 5AA 4 2
# 5 5AA 7 3