get previous value to the current value - r

How can i get the previous value of each group in a new column C and the starting value for each group will be empty as it does not have previous value of respective group!
Can dplyr can perform this?
Code:
df <- data.frame(A = c('a1','a1','b1','b1','b1','c2','d2','d2'),
B = c("2017-02-20","2018-02-14","2017-02-06","2017-02-27","2017-02-29","2017-02-28","2017-02-09","2017-02-10"))
Dataframe:
A B
a1 2017-02-20
a1 2018-02-14
b1 2017-02-06
b1 2017-02-27
b1 2017-02-29
c2 2017-02-28
d2 2017-02-09
d2 2017-02-10
Expected Output
A B C
a1 2017-02-20
a1 2018-02-14 2017-02-20
b1 2017-02-06
b1 2017-02-27 2017-02-06
b1 2017-02-29 2017-02-27
c2 2017-02-28
d2 2017-02-09
d2 2017-02-10 2017-02-09

You could use the lag function from dplyr:
df <- data.frame(A = c('a1','a1','b1','b1','b1','c2','d2','d2'),
B = c("2017-02-20","2018-02-14","2017-02-06",
"2017-02-27","2017-02-29","2017-02-28",
"2017-02-09","2017-02-10"))
library(dplyr)
df %>%
group_by(A) %>%
mutate(C = lag(B, 1, default = NA))
This will apply the lag function for each group of "A"
Output:
# A tibble: 8 x 3
# Groups: A [4]
A B C
<fct> <fct> <fct>
1 a1 2017-02-20 NA
2 a1 2018-02-14 2017-02-20
3 b1 2017-02-06 NA
4 b1 2017-02-27 2017-02-06
5 b1 2017-02-29 2017-02-27
6 c2 2017-02-28 NA
7 d2 2017-02-09 NA
8 d2 2017-02-10 2017-02-09

Related

How to create another table in R to calculate the difference?

I have a set of data frame as below:
ID
Parameter
value
123-01
a1
x
123-02
a1
x
123-01
b3
x
123-02
b3
x
124-01
a1
x
125-01
a1
x
126-01
a1
x
124-01
b3
x
125-01
b3
x
126-01
b3
x
I would like to find the sampleID that ended with "-02", and calculate the difference of the same sample ID that has the same first three digit by same parameter.
For example, calculate the difference of 123-01 and 123-02 based on parameter a1. Then the difference of 123-01 and 123-02 based on parameter b3, etc....
In the end, I can get a table contains
ID
Parameter
DiffValue
123
a1
y
123
b3
y
127
a1
y
127
b3
y
How can I do it?
I tried to use dplyr (filter) to create a table that only contains the duplicate, and then how do I match the origin table and do the calculation?
try to do it this way
library(tidyverse)
df <- read.table(text = "ID Parameter value
123-01 a1 10
123-02 a1 10
123-01 b3 10
123-02 b3 10
124-01 a1 10
125-01 a1 10
126-01 a1 10
124-01 b3 10
125-01 b3 10
126-01 b3 10", header = T)
df %>%
arrange(Parameter, ID) %>%
separate(ID, into = c("id_grp", "n"), sep = "-", remove = F) %>%
group_by(Parameter, id_grp) %>%
mutate(diff_value = c(NA, diff(value))) %>%
select(-c(id_grp, n))
#> Adding missing grouping variables: `id_grp`
#> # A tibble: 10 x 5
#> # Groups: Parameter, id_grp [8]
#> id_grp ID Parameter value diff_value
#> <chr> <chr> <chr> <int> <int>
#> 1 123 123-01 a1 10 NA
#> 2 123 123-02 a1 10 0
#> 3 124 124-01 a1 10 NA
#> 4 125 125-01 a1 10 NA
#> 5 126 126-01 a1 10 NA
#> 6 123 123-01 b3 10 NA
#> 7 123 123-02 b3 10 0
#> 8 124 124-01 b3 10 NA
#> 9 125 125-01 b3 10 NA
#> 10 126 126-01 b3 10 NA
Created on 2021-01-26 by the reprex package (v0.3.0)

Ascending group by date

I cannot able to ascend my group by dates. Please help!
df <- data.frame(A = c('a1','a1','b1','b1','b1','c2','d2','d2'),
B = c("2017-02-20","2018-02-14","2017-02-06","2018-02-27","2017-02-29","2017-02-28","2017-02-09","2017-02-10"))
Code:
df %>% group_by(A) %>% arrange(A,(as.Date(B)))
I am getting wrong result as the b1 didn't sort
A B
<fctr> <fctr>
1 a1 2017-02-20
2 a1 2018-02-14
3 b1 2017-02-06
4 b1 2018-02-27
5 b1 2017-02-29
6 c2 2017-02-28
7 d2 2017-02-09
8 d2 2017-02-10
You can see that the 2017-02-29 is not a real date, only 28 days in feb 2017. So, when you are converting your column B to date, it converts that value to NA. Fix that entry and it your answer should work.
Also, you probably do not need to group_by A
library(dplyr)
#>
df <- data.frame(A = c('a1','a1','b1','b1','b1','c2','d2','d2'),
B = c("2017-02-20","2018-02-14","2017-02-06","2018-02-27","2017-02-29","2017-02-28","2017-02-09","2017-02-10"))
as.Date(df$B)
#> [1] "2017-02-20" "2018-02-14" "2017-02-06" "2018-02-27" NA
#> [6] "2017-02-28" "2017-02-09" "2017-02-10"
df%>%arrange(A, as.Date(B))
#> A B
#> 1 a1 2017-02-20
#> 2 a1 2018-02-14
#> 3 b1 2017-02-06
#> 4 b1 2018-02-27
#> 5 b1 2017-02-29
#> 6 c2 2017-02-28
#> 7 d2 2017-02-09
#> 8 d2 2017-02-10
Created on 2019-09-16 by the reprex package (v0.2.1)

how to perform merge or join operation in R with two different dataframe size

I have two data frames A and B with different size where I am trying to implement either left join or merge data frames based on the certain conditions. Can anyone help me on how to join two tables in R. I am using a1, a2 and b1,b2 to join the two data frames?
df A
a1 a2 a3 a4
1 1 2017-04-25 2017-05-24
1 1 2017-05-25 2017-06-24
2 3 2017-04-25 2017-05-24
3 4 2017-04-25 2017-05-24
4 5 2017-04-25 2017-05-24
4 5 2017-05-25 2017-06-24
4 7 2017-04-25 2017-05-24
5 8 2017-04-25 2017-05-24
5 8 2017-05-25 2017-06-24
df B
b1 b2 b3 b4 b5
1 1 2017-04-20 2017-05-02 M
2 3 2017-03-27 2017-05-19 A
3 4 2017-04-20 2017-05-22 B
4 5 2017-04-21 2017-05-12 N
4 7 2017-05-02 2017-05-09 L
5 8 2017-05-15 2017-05-04 U
Dimension of the first dataframe
> dim(A)
[1] 506335 5
dimensions of the second data frame
> dim(B)
[1] 716776 6
tried below left join in R
left_join(A, B, a1=b1, a2 = b2, a3 > b3 , a4 < b4)
Error:
Error in common_by(by, x, y) : object 'b3' not found
Tried merge operation operation but getting below error
merge(A,B,by=c("a1","a2", "a3 > b3" , "a4 < b4"))
Error:
Error in ungroup_grouped_df(x) :
object 'dplyr_ungroup_grouped_df' not found
From what I gather you are trying to
1- Merge the DF by their first two columns
2- Filter the DF where this conditions are met a3 > b3 , a4 < b4
require(dplyr)
DF <- left_join(A,B, a1=b1, a2=b2) %>% filter(a3 > b3 , a4 < b4)
As Andrew Gustar has commented, you are trying to merge and filter at the same time. Instead, do the merge first, then the filter. It also looks like you're working with dates, so they need to be formatted correctly.
The code below can all be carried out in one chain, but I've broken it down to make it easier to understand.
For example, using the tidyverse dplyr and lubridate packages:
library(dplyr)
library(lubridate)
# load in your data
textA <- "a1 a2 a3 a4
1 1 2017-04-25 2017-05-24
1 1 2017-05-25 2017-06-24
2 3 2017-04-25 2017-05-24
3 4 2017-04-25 2017-05-24
4 5 2017-04-25 2017-05-24
4 5 2017-05-25 2017-06-24
4 7 2017-04-25 2017-05-24
5 8 2017-04-25 2017-05-24
5 8 2017-05-25 2017-06-24"
textB <- "b1 b2 b3 b4 b5
1 1 2017-04-20 2017-05-02 M
2 3 2017-03-27 2017-05-19 A
3 4 2017-04-20 2017-05-22 B
4 5 2017-04-21 2017-05-12 N
4 7 2017-05-02 2017-05-09 L
5 8 2017-05-15 2017-05-04 U"
# make dataframes
dfA <- read.table(text = textA, header = T)
dfB <- read.table(text = textB , header = T)
# now do the merging - when merging on more than one column, combine them using c
dfout <- left_join(x = dfA, y = dfB, by = c("a1" = "b1", "a2" = "b2"))
# now switch your a3, a4, b3, and b4 columns to dates format using the ymd function
dfout <- dfout %>% mutate_at(vars(a3:b4), ymd)
# finally the filtering
dfout <- dfout %>% filter(a3 > b3)
This returns:
a1 a2 a3 a4 b3 b4 b5
1 1 1 2017-04-25 2017-05-24 2017-04-20 2017-05-02 M
2 1 1 2017-05-25 2017-06-24 2017-04-20 2017-05-02 M
3 2 3 2017-04-25 2017-05-24 2017-03-27 2017-05-19 A
4 3 4 2017-04-25 2017-05-24 2017-04-20 2017-05-22 B
5 4 5 2017-04-25 2017-05-24 2017-04-21 2017-05-12 N
6 4 5 2017-05-25 2017-06-24 2017-04-21 2017-05-12 N
7 5 8 2017-05-25 2017-06-24 2017-05-15 2017-05-04 U
Note that filtering again (using code below) on a4 < b4 returns a dataframe with 0 rows.
dfout %>% mutate_at(vars(a3:b4), ymd) %>% filter(a3 > b3) %>% filter(a4 < b4)

Match in R while disregarding order [duplicate]

This question already has an answer here:
Match Dataframes Excluding Last Non-NA Value and disregarding order
(1 answer)
Closed 5 years ago.
I am trying to do a match in R regardless of the order of the columns.
Basically the problem I am trying to solve is that if all of the values in the columns of df2, from column 2-to the end, are found in df1 (after Partner), then match df1.
Here's the catch: disregard the last non-NA value in each row when doing this match but include it in the final output. So don't take the last non-NA value into account when matching but include it.
After the match, determine if that last non-na value exists in any of the columns with it's respective row.
df1
Partner Col1 Col2 Col3 Col4
A A1 A2 NA NA
B A2 B9 NA NA
C B7 V9 C1 N9
D Q1 Q3 Q4 NA
df2
lift rule1 rule2 rule3
11 A2 A1 A9
10 A1 A3 NA
11 B9 A2 D7
10 Q4 Q1 NA
11 A2 B9 B1
How do I match df1 with df2 so that the following happens:
1) Disregards the order of the columns found in both dataframes.
2) Then determine if the last non-na value exists in the row currently.
Final output:
df3
Partner Col1 Col2 Col3 Col4 lift rule1 rule2 rule3 EXIST?
A A1 A2 NA NA 11 A2 A1 A9 YES
A A1 A2 NA NA 10 A1 A3 NA NOPE
B A2 B9 NA NA 11 B9 A2 D7 YES
B A2 B9 NA NA 11 A2 B9 B1 YES
D Q1 Q3 Q4 NA 10 Q4 Q1 NA YES
I get one more B match than you, but this solution is very close to what you want. You first have to add an id column as we use it to reconstruct the data. Then to perform the match, you first need to melt it with gather from tidyr and use inner_join from dplyr. We then cbind using the ids and the original data.frames.
library(tidyr);library(dplyr)
df1 <- read.table(text="Partner Col1 Col2 Col3 Col4
A A1 A2 NA NA
B A2 B9 NA NA
C B7 V9 C1 N9
D Q1 Q3 Q4 NA",header=TRUE, stringsAsFactors=FALSE)
df2 <- read.table(text="lift rule1 rule2 rule3
11 A2 A1 A9
10 A1 A3 NA
11 B9 A2 D7
10 Q4 Q1 NA
11 A2 B9 B1",header=TRUE, stringsAsFactors=FALSE)
df1 <- cbind(df1_id=1:nrow(df1),df1)
df2 <- cbind(df2_id=1:nrow(df2),df2)
#melt with gather
d11 <- df1 %>% gather(Col, Value,starts_with("C")) #Long
d11 <- d11 %>% na.omit() %>%group_by(df1_id) %>% slice(-n()) #remove last non NA
d22 <- df2 %>% gather(Rule, Value,starts_with("r")) #Long
res <- inner_join(d11,d22)
cbind(df1[res$df1_id,],df2[res$df2_id,])
df1_id Partner Col1 Col2 Col3 Col4 df2_id lift rule1 rule2 rule3
1 1 A A1 A2 <NA> <NA> 2 10 A1 A3 <NA>
1.1 1 A A1 A2 <NA> <NA> 1 11 A2 A1 A9
2 2 B A2 B9 <NA> <NA> 1 11 A2 A1 A9
2.1 2 B A2 B9 <NA> <NA> 5 11 A2 B9 B1
2.2 2 B A2 B9 <NA> <NA> 3 11 B9 A2 D7
4 4 D Q1 Q3 Q4 <NA> 4 10 Q4 Q1 <NA>

How can we do some calculations using last row within a group in data.table in R?

I have this data.table:
sample:
id cond date
1 A1 2012-11-19
1 A1 2013-05-09
1 A2 2014-09-05
2 B1 2015-03-05
2 B1 2015-07-06
3 A1 2015-02-05
4 B1 2012-09-26
4 B1 2015-02-05
5 B1 2012-09-26
I want to calculate overdue days from today's date within each group of 'id' and 'cond', so I am trying to get the difference of days between the last date in each group and sys.date. Desired output is ;
id cond date overdue
1 A1 2012-11-19 NA
1 A1 2013-05-09 832
1 A2 2014-09-05 348
2 B1 2015-03-05 NA
2 B1 2015-07-06 44
3 A1 2015-02-05 195
4 B1 2012-09-26 NA
4 B1 2015-02-05 195
5 B1 2012-09-26 1057
I tried to achieve this by following code:
sample <- sample[ , overdue := Sys.Date() - date[.N], by = c('id','cond')]
But I am getting following output, where it the value are recycling:
id cond date overdue
1 A1 2012-11-19 832
1 A1 2013-05-09 832
1 A2 2014-09-05 348
2 B1 2015-03-05 44
2 B1 2015-07-06 44
3 A1 2015-02-05 195
4 B1 2012-09-26 195
4 B1 2015-02-05 195
5 B1 2012-09-26 1057
I am not sure, how can I restrict my code to just do calculations for the last row and not recycle. I am sure there would be ways to do this, help is appreciated.
You could make a table of overdue values and the rows they belong in:
bycols = c("id","cond")
newcolDT2 = DT[, Sys.Date() - date[.N], by = bycols]
DT[newcolDT2, overdue := V1, on = bycols, mult = "last"]
# id cond date overdue
# 1: 1 A1 2012-11-19 NA days
# 2: 1 A1 2013-05-09 832 days
# 3: 1 A2 2014-09-05 348 days
# 4: 2 B1 2015-03-05 NA days
# 5: 2 B1 2015-07-06 44 days
# 6: 3 A1 2015-02-05 195 days
# 7: 4 B1 2012-09-26 NA days
# 8: 4 B1 2015-02-05 195 days
# 9: 5 B1 2012-09-26 1057 days
This is the (arguably uglier) one-liner version:
DT[J(unique(DT[, ..bycols])),
overdue := Sys.Date() - date, on = bycols, mult = "last"]
Data:
DT <- data.table(read.table(header=TRUE,text="id cond date
1 A1 2012-11-19
1 A1 2013-05-09
1 A2 2014-09-05
2 B1 2015-03-05
2 B1 2015-07-06
3 A1 2015-02-05
4 B1 2012-09-26
4 B1 2015-02-05
5 B1 2012-09-26"))[, date := as.IDate(date)]
# anyone know how to do this with fread()?
First, extract the rows you're interested in, then assign the values:
rows = DT[, .I[.N], by = .(id, cond)]$V1
DT[rows, overdue := Sys.Date() - date]
DT
# id cond date overdue
#1: 1 A1 2012-11-19 NA days
#2: 1 A1 2013-05-09 832 days
#3: 1 A2 2014-09-05 348 days
#4: 2 B1 2015-03-05 NA days
#5: 2 B1 2015-07-06 44 days
#6: 3 A1 2015-02-05 195 days
#7: 4 B1 2012-09-26 NA days
#8: 4 B1 2015-02-05 195 days
#9: 5 B1 2012-09-26 1057 days

Resources