Subset lagged values in R - r

For a given data table see sample below, I only want to keep Difference column for values greater than 2 by Unique_ID, Without deleting the NA rows .
My_data_table <- structure(list(Unique_ID = structure(c(1L, 1L, 2L, 2L, 3L,
3L, 3L, 4L, 4L, 4L), .Label = c("1AA", "3AA", "5AA", "6AA"),
class = "factor"), Distance.km. = c(1, 2.05, 2, 4, 2, 4, 7,
8, 9, 10), Difference = c(NA, 1.05, NA, 2, NA, 2, 3, NA, 1, 1)),
.Names = c("Unique_ID", "Distance.km.", "Difference"),
class = "data.frame", row.names = c(NA, -10L))
My_data_table
Unique_ID Distance(km) Difference
1AA 1 NA
1AA 2.05 1.05
3AA 2 NA
3AA 4 2
5AA 2 NA
5AA 4 2
5AA 7 3
6AA 8 NA
6AA 9 1
6AA 10 1
Here is the result i'm looking for
My_data_table
Unique_ID Distance(km) Difference
3AA 2 NA
3AA 4 2
5AA 2 NA
5AA 4 2
5AA 7 3

After converting to 'data.table' (setDT(df1)), grouped by 'Unique_ID', if the sum of logical vector (Difference >= 2) is greater than 0, then get the Subset of Data.table (.SD) where the 'Difference' is either an NA or | it is greater than or equal to 2
library(data.table)
setDT(df1)[, if(sum(Difference >=2, na.rm = TRUE)>0)
.SD[is.na(Difference)|Difference>=2], by = Unique_ID]
# Unique_ID Distance.km. Difference
#1: 3AA 2 NA
#2: 3AA 4 2
#3: 5AA 2 NA
#4: 5AA 4 2
#5: 5AA 7 3

A dplyr solution:
library(dplyr)
df %>%
group_by(Unique_ID) %>%
filter(any(Difference >= 2 & !is.na(Difference)))
# # A tibble: 5 x 3
# # Groups: Unique_ID [2]
# Unique_ID Distance.km. Difference
# <fctr> <dbl> <dbl>
# 1 3AA 2 NA
# 2 3AA 4 2
# 3 5AA 2 NA
# 4 5AA 4 2
# 5 5AA 7 3

Related

How to remove rows if values from a specified column in data set 1 does not match the values of the same column from data set 2 using dplyr

I have 2 data sets, both include ID columns with the same IDs. I have already removed rows from the first data set. For the second data set, I would like to remove any rows associated with IDs that do not match the first data set by using dplyr.
Meaning whatever is DF2 must be in DF1, if it is not then it must be removed from DF2.
For example:
DF1
ID X Y Z
1 1 1 1
2 2 2 2
3 3 3 3
5 5 5 5
6 6 6 6
DF2
ID A B C
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
DF2 once rows have been removed
ID A B C
1 1 1 1
2 2 2 2
3 3 3 3
5 5 5 5
6 6 6 6
I used anti_join() which shows me the difference in rows but I cannot figure out how to remove any rows associated with IDs that do not match the first data set by using dplyr.
Try with paste
i1 <- do.call(paste, DF2) %in% do.call(paste, DF1)
# if it is only to compare the 'ID' columns
i1 <- DF2$ID %in% DF1$ID
DF3 <- DF2[i1,]
DF3
ID A B C
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 5 5 5 5
5 6 6 6 6
DF4 <- DF2[!i1,]
DF4
ID A B C
4 4 4 4 4
7 7 7 7 7
data
DF1 <- structure(list(ID = c(1L, 2L, 3L, 5L, 6L), X = c(1L, 2L, 3L,
5L, 6L), Y = c(1L, 2L, 3L, 5L, 6L), Z = c(1L, 2L, 3L, 5L, 6L)), class = "data.frame", row.names = c(NA,
-5L))
DF2 <- structure(list(ID = 1:7, A = 1:7, B = 1:7, C = 1:7), class = "data.frame", row.names = c(NA,
-7L))
# Load package
library(dplyr)
# Load dataframes
df1 <- data.frame(
ID = 1:6,
X = 1:6,
Y = 1:6,
Z = 1:6
)
df2 <- data.frame(
ID = 1:7,
X = 1:7,
Y = 1:7,
Z = 1:7
)
# Include all rows in df1
df1 %>%
left_join(df2)
Joining, by = c("ID", "X", "Y", "Z")
ID X Y Z
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
6 6 6 6 6

R Merging non-unique columns to consolidate data frame

I'm having issues figuring out how to merge non-unique columns that look like this:
2_2
2_3
2_4
2_2
3_2
1
2
3
NA
NA
2
3
-1
NA
NA
NA
NA
NA
3
-2
NA
NA
NA
-2
4
To make them look like this:
2_2
2_3
2_4
3_2
1
2
3
NA
2
3
-1
NA
3
NA
NA
-2
-2
NA
NA
4
Essentially reshaping any non-unique columns. I have a large data set to work with so this is becoming an issue!
Note that data.frame doesn't allow for duplicate column names. Even if we create those, it may get modified when we apply functions as make.unique is automatically applied. Assuming we created the data.frame with duplicate names, an option is to use split.default to split the data into list of subset of data, then loop over the list with map and use coalesce
library(dplyr)
library(purrr)
map_dfc(split.default(df1, names(df1)),~ invoke(coalesce, .x))
-output
# A tibble: 4 × 4
`2_2` `2_3` `2_4` `3_2`
<int> <int> <int> <int>
1 1 2 3 NA
2 2 3 -1 NA
3 3 NA NA -2
4 -2 NA NA 4
data
df1 <- structure(list(`2_2` = c(1L, 2L, NA, NA), `2_3` = c(2L, 3L, NA,
NA), `2_4` = c(3L, -1L, NA, NA), `2_2` = c(NA, NA, 3L, -2L),
`3_2` = c(NA, NA, -2L, 4L)), class = "data.frame", row.names = c(NA,
-4L))
Also using coalesce:
You use non-syntactic names. R is strict in using names see here https://adv-r.hadley.nz/names-values.html and also notice the explanation by #akrun:
library(dplyr)
df %>%
mutate(X2_2 = coalesce(X2_2, X2_2.1), .keep="unused")
X2_2 X2_3 X2_4 X3_2
1 1 2 3 NA
2 2 3 -1 NA
3 3 NA NA -2
4 -2 NA NA 4

Create lag numbers upto unique values

I have a dataframe df, where I need to have the lag values to get the difference between times
df
ColA ColB Lag(ColB)
1 11:00:12 11:00:13
1 11:00:13 11:00:14
1 11:00:14 NA
2 11:00:15 11:00:16
2 11:00:16 11:00:17
2 11:00:17 NA
3 11:00:18 11:00:19
3 11:00:19 11:00:20
3 11:00:20 NA
Above only upto unique values I need to create a lag. If you see, the moment ColA changes from 1 to 2 and from 2 to 3, the lag is NA. So Is it possible to achieve this?
As mentioned by #Sotos, you need to group by your colA before doing the lag column and then calculate the diff time.
Using dplyr and lubridate packages, you can calculate diff time by group
library(dplyr)
library(lubridate)
df %>% group_by(ColA) %>% mutate(NewLag = lead(ColB)) %>%
mutate(diff = hms(NewLag)-hms(ColB))
# A tibble: 9 x 5
# Groups: ColA [3]
ColA ColB `Lag(ColB)` NewLag diff
<int> <chr> <chr> <chr> <dbl>
1 1 11:00:12 11:00:13 11:00:13 1
2 1 11:00:13 11:00:14 11:00:14 1
3 1 11:00:14 NA NA NA
4 2 11:00:15 11:00:16 11:00:16 1
5 2 11:00:16 11:00:17 11:00:17 1
6 2 11:00:17 NA NA NA
7 3 11:00:18 11:00:19 11:00:19 1
8 3 11:00:19 11:00:20 11:00:20 1
9 3 11:00:20 NA NA NA
Is it what you are looking for ?
Example Data
structure(list(ColA = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),
ColB = c("11:00:12", "11:00:13", "11:00:14", "11:00:15",
"11:00:16", "11:00:17", "11:00:18", "11:00:19", "11:00:20"
), `Lag(ColB)` = c("11:00:13", "11:00:14", NA, "11:00:16",
"11:00:17", NA, "11:00:19", "11:00:20", NA)), row.names = c(NA,
-9L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x5569bf9b0310>)

Conditional Column Formatting

I have a data frame that looks like this:
cat df1 df2 df3
1 1 NA 1 NA
2 1 NA 2 NA
3 1 NA 3 NA
4 2 1 NA NA
5 2 2 NA NA
6 2 3 NA NA
I want to populate df3 so that when cat = 1, df3 = df2 and when cat = 2, df3 = df1. However I am getting a few different error messages.
My current code looks like this:
df$df3[df$cat == 1] <- df$df2
df$df3[df$cat == 2] <- df$df1
Try this code:
df[df$cat==1,"df3"]<-df[df$cat==1,"df2"]
df[df$cat==2,"df3"]<-df[df$cat==1,"df1"]
The output:
df
cat df1 df2 df3
1 1 1 1 1
2 2 1 2 1
3 3 1 3 NA
4 4 2 NA NA
5 5 2 NA NA
6 5 2 NA NA
You can try
ifelse(df$cat == 1, df$df2, df$df1)
[1] 1 2 3 1 2 3
# saving
df$df3 <- ifelse(df$cat == 1, df$df2, df$df1)
# if there are other values than 1 and 2 you can try a nested ifelse
# that is setting other values to NA
ifelse(df$cat == 1, df$df2, ifelse(df$cat == 2, df$df1, NA))
# or you can try a tidyverse solution.
library(tidyverse)
df %>%
mutate(df3=case_when(cat == 1 ~ df2,
cat == 2 ~ df1))
cat df1 df2 df3
1 1 NA 1 1
2 1 NA 2 2
3 1 NA 3 3
4 2 1 NA 1
5 2 2 NA 2
6 2 3 NA 3
# data
df <- structure(list(cat = c(1L, 1L, 1L, 2L, 2L, 2L), df1 = c(NA, NA,
NA, 1L, 2L, 3L), df2 = c(1L, 2L, 3L, NA, NA, NA), df3 = c(NA,
NA, NA, NA, NA, NA)), .Names = c("cat", "df1", "df2", "df3"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))

Select all rows up to and including first occurrence by group in a data frame

I ve been scratching my head a little about how to do. I'm reorganising some unbalanced panel data (stacked/long format). I need to keep all the rows up to and including the first occurrence of variable (indc=D) value by group (id) and also keep the rows for groups where this has not occurred yet. The only rows I wish to discard are rows per group where there is a second or more value of the indicator variable (indc=D). I also need to keep all the columns in the dataframe.
# Data
id<-factor(c(1,1,1,2,2,2,2,2, 3,3,3,3,3,3,4,4))
time<-c(1,2,3,1,2,3,4,5, 1,2,3,4,5,6, 1,2)
indc<-factor(c("C","C","D","C","C","C","D","D","C","C","C","C","D","D","C","C"))
var1<-sample(seq(1,8.5, by=0.5))
var2<-c(rep(1,8),rep(0,8))
df<-data.frame(id,time,indc,var1,var2)
My attempt is using by and match - problem is it returns the last variable as a match and the indices for each group. I m stuck on how to get to the final solution.
attempt<-by(df, df$id, function(x) {match(unique(x$indc=="D"), x$indc=="D")} )
results<-(do.call("rbind", attempt))
The desired result is df2 df2<-df[c(1:3,4:7,9:13,15:16),]
I'd be very grateful if anyone has ideas on a solution.
One option is to use dplyr to group by "id" and then calculate a cumulative sum of the rows where "indc == "D". Then check and filter all the rows where this cumsum is <= 1.
require(dplyr)
df %>% group_by(id) %>% filter(cumsum(indc == "D") <= 1)
#Source: local data frame [14 x 5]
#Groups: id
#
# id time indc var1 var2
#1 1 1 C 1.5 1
#2 1 2 C 1.0 1
#3 1 3 D 7.0 1
#4 2 1 C 2.5 1
#5 2 2 C 3.5 1
#6 2 3 C 6.5 1
#7 2 4 D 3.0 1
#8 3 1 C 2.0 0
#9 3 2 C 7.5 0
#10 3 3 C 6.0 0
#11 3 4 C 8.0 0
#12 3 5 D 8.5 0
#13 4 1 C 4.0 0
#14 4 2 C 4.5 0
Edit #1 after comments:
Thanks to #akrun's comments below, here are tow more options of how to subset:
Option 1: using base R:
df[with(df, ave(indc=='D', id, FUN=function(x) cumsum(x)<=1)),]
Option 2: using data.table:
require(data.table)
setDT(df)[,.SD[cumsum(indc=='D')<=1], by=id]
Credit goes to #akrun
Edit #2 after comment by OP:
It was not 100% clear how you want rows removed if, for example, the first "D" has occured and then there is another row in the same group where "C" occurs (or some other letter). My initial answer would keep such a row if it occured after the first "D" occurence. To change that behavior and remove all rows after the first "D" occurence, you can simply add another cumsum to the code, like this (for the modified data as presented below):
df %>% group_by(id2) %>% filter(cumsum(cumsum(indc2 == "D")) <= 1L)
#Source: local data frame [13 x 5]
#Groups: id2
#
# id2 time2 indc2 var1 var2
#1 1 1 C 8.0 1
#2 1 2 C 5.0 1
#3 1 3 D 7.0 1
#4 2 1 C 1.0 1
#5 2 2 C 2.0 1
#6 2 3 D 9.0 1
#7 3 1 C 4.5 0
#8 3 2 C 3.0 0
#9 3 3 C 7.5 0
#10 3 4 C 1.5 0
#11 3 5 D 4.0 0
#12 4 1 C 6.0 0
#13 4 2 C 6.5 0
data
df <- structure(list(id2 = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L), .Label = c("1", "2",
"3", "4"), class = "factor"), time2 = c(1, 2, 3, 4, 1, 2, 3,
4, 5, 1, 2, 3, 4, 5, 6, 1, 2), indc2 = structure(c(1L, 1L, 2L,
1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L), .Label = c("C",
"D"), class = "factor"), var1 = c(8, 5, 7, 8.5, 1, 2, 9, 3.5,
2.5, 4.5, 3, 7.5, 1.5, 4, 5.5, 6, 6.5), var2 = c(1, 1, 1, 1,
1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0)), .Names = c("id2", "time2",
"indc2", "var1", "var2"), row.names = c(NA, -17L), class = "data.frame")
> df
id2 time2 indc2 var1 var2
1 1 1 C 8.0 1
2 1 2 C 5.0 1
3 1 3 D 7.0 1
4 1 4 C 8.5 1 <-- this row will also be removed now
5 2 1 C 1.0 1
6 2 2 C 2.0 1
7 2 3 D 9.0 1
8 2 4 D 3.5 1
9 2 5 D 2.5 0
10 3 1 C 4.5 0
11 3 2 C 3.0 0
12 3 3 C 7.5 0
13 3 4 C 1.5 0
14 3 5 D 4.0 0
15 3 6 D 5.5 0
16 4 1 C 6.0 0
17 4 2 C 6.5 0

Resources