calculate difference based on id and date - r

I have a data set that have id date and time, now I want to calculate the difference between each available date based on id. I have try to look for similar problem in stack overflow but so far no luck. I have try a few different syntax but still no luck at the moment. any help would be great.
data set:
> dput(mydata)
structure(list(id = c("a", "a", "b", "b", "b", "c"), date = c("2018-04-13",
"2011-11-12", "2019-05-30", "2014-09-13", "2019-06-21", "1998-01-08"
), time = c("50", "40", "30", "20", "10", "30")), class = "data.frame", row.names = c(NA,
-6L))
Desire output:
id date time time_diff
a 2018-04-13 50 10
a 2011-11-12 40 NA/0
b 2019-05-30 30 10
b 2014-09-13 20 NA/0
b 2019-06-21 10 -20
c 1998-01-08 30 NA/0
I understand the earliest date won't have anything to calculate the difference so it can be either NA or 0 in this case.
Here is the code that I have try but getting error:
mydata <- mydata %>%
group_by(id,date) %>%
mutate(time_diff = diff(time))

library(dplyr)
df <- structure(list(
id = c("a", "a", "b", "b", "b", "c"),
date = ("2018-04-13", "2011-11-12", "2019-05-30", "2014-09-13", "2019-06-21", "1998-01-08"),
time = c("50", "40", "30", "20", "10", "30")),
class = "data.frame", row.names = c(NA, -6L))
df %>%
group_by(id) %>%
arrange(id, date) %>%
mutate(
time = as.numeric(time),
time_diff = time - lag(time)
)

For each id you may subtract the time corresponding to minimum date.
library(dplyr)
mydata %>%
mutate(time = as.numeric(time),
date = as.Date(date)) %>%
group_by(id) %>%
mutate(time_diff = time - time[which.min(date)]) %>%
ungroup
# id date time time_diff
# <chr> <date> <dbl> <dbl>
#1 a 2018-04-13 50 10
#2 a 2011-11-12 40 0
#3 b 2019-05-30 30 10
#4 b 2014-09-13 20 0
#5 b 2019-06-21 10 -10
#6 c 1998-01-08 30 0

We can use data.table
library(data.table)
mydata <- type.convert(mydata, as.is = TRUE)
setDT(mydata)[, time_diff := time - time[date %in% min(date)], id]
mydata
id date time time_diff
1: a 2018-04-13 50 10
2: a 2011-11-12 40 0
3: b 2019-05-30 30 10
4: b 2014-09-13 20 0
5: b 2019-06-21 10 -10
6: c 1998-01-08 30 0

Related

Replacing NA values with the next value in a column in R

I'm trying to mutate a column in a Dataframe using the lag() function as a condition without producing NA values. Let me create an example:
df <- data.frame("Score" = as.numeric(c("20", "10", "15", "30", "15", "10")),
"Time" = c("1", "2", "1", "2", "1", "2"),
"Team" = c("A", "A", "B", "B", "C", "C"))
After that, I created a new column named Diff that calculates the difference of the Score of every Team:
df <- df %>%
group_by(Team) %>%
mutate(Diff = Score - lag(Score))
My problem is that this method creates NA values, obviously:
Score Time Team Diff
20 1 A NA
10 2 A -10
15 1 B NA
30 2 B 15
15 1 C NA
10 2 C -5
My goal is to have this at the end:
Score Time Team Diff
20 1 A -10
10 2 A -10
15 1 B 15
30 2 B 15
15 1 C -5
10 2 C -5
I've tried mutating again using the case_when() function to substitute the NA for the next value, but it also didn't work:
df %>%
group_by(Team) %>%
mutate(Diff = Score - lag(Score)) %>%
mutate(Diff = case_when(
NA ~ lead(Diff)
))
Anyway, how do I make the NA values be replaced by the next Diff value?
Thanks a lot!
Just use fill() after the fact:
library(tidyverse)
df <- data.frame("Score" = as.numeric(c("20", "10", "15", "30", "15", "10")),
"Time" = c("1", "2", "1", "2", "1", "2"),
"Team" = c("A", "A", "B", "B", "C", "C"))
df <- df %>%
group_by(Team) %>%
mutate(Diff = Score - lag(Score)) %>%
fill(Diff, .direction = 'up')
df
# output
# Score Time Team Diff
# <dbl> <chr> <chr> <dbl>
#1 20 1 A -10
#2 10 2 A -10
#3 15 1 B 15
#4 30 2 B 15
#5 15 1 C -5
#6 10 2 C -5

how to split a dataframe by specific rows in r

I have a data look like this:
data <- structure(list(A = c("1", "1", "1", "A", "10", "10", "B", "200"), B = c("2", "2", "2", "B", "20", "20", "C", "300"), C = c("3","3", "3", "C", "30", "30", "D", "400"), D = c("4", "4", "4", "D", "40", "40", NA, NA)), row.names = c(NA, -8L), class = c("tbl_df","tbl", "data.frame"))
data
> data
# A tibble: 8 x 4
A B C D
<chr> <chr> <chr> <chr>
1 1 2 3 4
2 1 2 3 4
3 1 2 3 4
4 A B C D
5 10 20 30 40
6 10 20 30 40
7 B C D NA
8 200 300 400 NA
It was wrong bind by rows and I wanted to split the data into 3 sub data(d1, d2 and d3) such like this:
NOTE: In my real situation, d1, d2 and d3 have different nrow(). I set nrow(d1) = 3, nrow(d2) = 2 and nrow(d3) = 1 just for simplify the question in this example.
d1 <- data.frame(A = rep(1,3), B = rep(2,3), C = rep(3,3), D = rep(4,3))
d2 <- data.frame(A = rep(10,2), B = rep(20,2), C = rep(30,2), D = rep(40,2))
d3 <- data.frame( B = 200, C = 300, D = 400)
> d1
A B C D
1 1 2 3 4
2 1 2 3 4
3 1 2 3 4
> d2
A B C D
1 10 20 30 40
2 10 20 30 40
> d3
B C D
1 200 300 400
And then I could bind them correctly using bind_rows from dplyr
bind_rows(d1, d2, d3) %>% as_tibble()
# A tibble: 6 x 4
A B C D
<dbl> <dbl> <dbl> <dbl>
1 1 2 3 4
2 1 2 3 4
3 1 2 3 4
4 10 20 30 40
5 10 20 30 40
6 NA 200 300 400
The problem is that I am troubled by how to get the d1, d2 and d3 from data.
Any help will be highly appreciated!
Here is a tidyverse solution.
process_df takes a data frame and sets the column names and removes the first row.
process_df <- function(df, ...) {
df %>%
set_names(slice(., 1)) %>%
select(which(!is.na(names(.)))) %>%
slice(-1)
}
Add a header row that just contains the column names.
Use rowwise() and c_across() to get the values of all columns by row. Use this to identify which rows are header rows.
group_map will apply a function over each group and bind_rows will combine the results.
data %>%
add_row(!!!set_names(names(.)), .before = 1) %>%
rowwise() %>%
mutate(
group = all(is.na(c_across()) | c_across() %in% names(.))
) %>%
ungroup() %>%
mutate(group = cumsum(group)) %>%
group_by(group) %>%
group_map(process_df) %>%
bind_rows()
#> # A tibble: 6 x 4
#> A B C D
#> <chr> <chr> <chr> <chr>
#> 1 1 2 3 4
#> 2 1 2 3 4
#> 3 1 2 3 4
#> 4 10 20 30 40
#> 5 10 20 30 40
#> 6 NA 200 300 400
Explanation of the usage of !!! in new_row
set_names(names(.)) creates a named vector that represents the row we want to add. However, add_row doesn't accept a named vector - it wants the values to be specified as arguments.
Here is a simplified example.
new_row <- c(speed = 1, dist = 2)
add_row doesn't accept a named vector, so this doesn't work.
cars %>% add_row(new_row, .before = TRUE)
# (Error)
!!! will unpack the vector as arguments to the function.
cars %>% add_row(!!!new_row, .before = TRUE)
# (Works)
!!! above essentially results in this:
cars %>% add_row(speed = 1, dist = 2, .before = TRUE)
Does this work:
data
# A tibble: 5 x 4
A B C D
<chr> <chr> <chr> <chr>
1 1 2 3 4
2 A B C D
3 10 20 30 40
4 B C D NA
5 200 300 400 NA
data <- rbind(LETTERS[1:4],data)
data
# A tibble: 6 x 4
A B C D
<chr> <chr> <chr> <chr>
1 A B C D
2 1 2 3 4
3 A B C D
4 10 20 30 40
5 B C D NA
6 200 300 400 NA
split(data, rep(1:ceiling(nrow(data)/2), each = 2))
$`1`
# A tibble: 2 x 4
A B C D
<chr> <chr> <chr> <chr>
1 A B C D
2 1 2 3 4
$`2`
# A tibble: 2 x 4
A B C D
<chr> <chr> <chr> <chr>
1 A B C D
2 10 20 30 40
$`3`
# A tibble: 2 x 4
A B C D
<chr> <chr> <chr> <chr>
1 B C D NA
2 200 300 400 NA
Base R solution:
Map(function(x){setNames(data.frame(t(x[,2, drop = FALSE])), x[,1])[,!is.na(x[,1])]},
split.default(cbind(X0 = names(df), data.frame(t(df))), c(0, seq_len(nrow(df)) %/% 2)))
Including pushing separate data.frames to Global Environment:
list2env(setNames(Map(function(x){setNames(data.frame(t(x[,2, drop = FALSE])), x[,1])[,!is.na(x[,1])]},
split.default(cbind(X0 = names(df), data.frame(t(df))), c(0, seq_len(nrow(df)) %/% 2))),
paste0('d', seq_len(ceiling(nrow(df) / 2)))), .GlobalEnv)
Tidyverse Solution:
library(tidyverse)
df %>%
rbind(names(df), .) %>%
split(cumsum(seq_len(nrow(.)) %% 2)) %>%
Map(function(x){setNames(x[2,], x[1,])[,complete.cases(t(x))]}, .) %>%
set_names(str_c('d', names(.))) %>%
list2env(., .GlobalEnv)
Note solution adjusted to reflect edit to the question:
rdf <- type.convert(data.frame(t(rbind(names(df), df))))
Map(function(x){
y <- setNames(t(x[,-1, drop = FALSE]), x[,1]); y[,!is.na(colSums(y))]
}, split.default(rdf, cumsum(!sapply(rdf, is.integer))))
New solution including push to Global Env:
rdf <- type.convert(data.frame(t(rbind(names(df), df))))
dflist <- Map(function(x) {
y <-
setNames(t(x[, -1, drop = FALSE]), x[, 1])
y[, !is.na(colSums(y))]
}, split.default(rdf, cumsum(!sapply(rdf, is.integer))))
list2env(setNames(dflist, paste0('d', names(dflist))), .GlobalEnv)
Adjusted Tidyverse solution:
df %>%
rbind(names(.), .) %>%
t() %>%
data.frame() %>%
type.convert() %>%
split.default(cumsum(!sapply(., is.integer))) %>%
Map(function(x){
y <- setNames(t(x[,-1, drop = FALSE]), x[,1])
data.frame(y[,!is.na(colSums(y)), drop = FALSE])}, .) %>%
set_names(str_c('d', names(.))) %>%
list2env(., .GlobalEnv)
Data:
df <- structure(list(A = c("1", "A", "10", "B", "200"), B = c("2", "B", "20", "C", "300"), C = c("3", "C", "30", "D", "400"), D = c("4","D", "40", NA, NA)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"))
Updated Data:
df <- structure(list(A = c("1", "1", "1", "A", "10", "10", "B", "200"), B = c("2", "2", "2", "B", "20", "20", "C", "300"), C = c("3","3", "3", "C", "30", "30", "D", "400"), D = c("4", "4", "4", "D", "40", "40", NA, NA)), row.names = c(NA, -8L), class = c("tbl_df","tbl", "data.frame"))

How to deduplicate based upon an interval between dates in same column

I have a table that looks something like this:
ID Date Type
1 2019/03/12 A
1 2019/03/12 A
2 2019/01/07 A
2 2019/04/20 B
3 2019/02/09 C
4 2019/01/19 A
4 2019/01/23 A
I want to deduplicate this table by ID, but only if the span between the dates listed is greater than 7 days. If it is less than 7 days, then I want to keep the earliest date.
Want:
ID Date Type
1 2019/03/12 A
2 2019/01/07 A
2 2019/04/20 B
3 2019/02/09 C
4 2019/01/19 A
I'm just struggling with where to start conceptually.
An option would be to convert the 'Date' to Date class (ymd from lubridate is used here), then grouped by 'ID', filter the difference of 'Date' that is greater than or equal to 7
library(dplyr)
library(lubridate)
df1 %>%
mutate(Date = ymd(Date)) %>%
group_by(ID) %>%
filter(c(TRUE, diff(Date) >= 7))
# A tibble: 5 x 3
# Groups: ID [4]
# ID Date Type
# <int> <date> <chr>
#1 1 2019-03-12 A
#2 2 2019-01-07 A
#3 2 2019-04-20 B
#4 3 2019-02-09 C
#5 4 2019-01-19 A
data
df1 <- structure(list(ID = c(1L, 1L, 2L, 2L, 3L, 4L, 4L), Date = c("2019/03/12",
"2019/03/12", "2019/01/07", "2019/04/20", "2019/02/09", "2019/01/19",
"2019/01/23"), Type = c("A", "A", "A", "B", "C", "A", "A")),
class = "data.frame", row.names = c(NA,
-7L))

include columnheader as another column value for each observation in R

I'm looking for a way to add the column header (date) next to each observation.
take df:
structure(list(dates = c("wt", "id", "", ""), X6.1.2018 = c("dd",
"a", "b", "c"), X6.2.2018 = c("qq", "d", "e", ""), X6.2.2018.1 = c("dd",
"z", "y", "")), class = "data.frame", row.names = c(NA, -4L))
where df looks like:
dates 6/1/2018 6/2/2018 6/2/2018
wt dd qq dd
id a d z
b e y
c
I'd like to end with df_final:
id date
a 6/1/2018
b 6/1/2018
c 6/1/2018
d 6/2/2018
e 6/2/2018
z 6/2/2018
y 6/2/2018
Any ideas are helpful - thanks
With tidyverse:
library(tidyverse)
df %>%
filter(dates != 'wt') %>%
select(-dates) %>%
gather(date, id) %>%
filter(id != '') %>%
mutate(date = as.Date(date, format = "X%m.%d.%Y"))
Output:
date id
1 2018-06-01 a
2 2018-06-01 b
3 2018-06-01 c
4 2018-06-02 d
5 2018-06-02 e
6 2018-06-02 z
7 2018-06-02 y
or with data.table::melt:
library(data.table)
dt = setDT(df)[dates != 'wt', !'dates']
melt(dt, measure.vars = 1:3, variable.name = "date",
value.name = "id")[id != '', .(id, date = as.Date(date, format = "X%m.%d.%Y"))]
Output:
id date
1: a 2018-06-01
2: b 2018-06-01
3: c 2018-06-01
4: d 2018-06-02
5: e 2018-06-02
6: z 2018-06-02
7: y 2018-06-02

tidyr::spread resulting in multiple rows

I have a similar problem than the following, but the solution presented in the following link does not work for me:
tidyr spread does not aggregate data
I have a df in the following structure:
UndesiredIndex DesiredIndex DesiredRows Result
1 x1A x1 A 50,32
2 x1B x2 B 7,34
3 x2A x1 A 50,33
4 x2B x2 B 7,35
Using the code below:
dftest <- bd_teste %>%
select(-UndesiredIndex) %>%
spread(DesiredIndex, Result)
I expected the following result:
DesiredIndex A B
A 50,32 50,33
B 7,34 7,35
Although, I keep getting the following result:
DesiredIndex x1 x2
1 A 50.32 NA
2 B 7.34 NA
3 A NA 50.33
4 B NA 7.35
PS: Sometimes I force the column UndesiredIndex out with select(-UndesiredIndex), but I keep getting the following message:
Adding missing grouping variables: UndesiredIndex
Might be something easy to stack those rows, but I'm new to R and have been trying so hard to solve this but without success.
Thanks in advance!
We group by DesiredIndex, create a sequence column and then do the spread:
library(tidyverse)
df1 %>%
select(-UndesiredIndex) %>%
group_by(DesiredIndex) %>%
mutate(new = LETTERS[row_number()]) %>%
ungroup %>%
select(-DesiredIndex) %>%
spread(new, Result)
# A tibble: 2 x 3
# DesiredRows A B
# <chr> <chr> <chr>
#1 A 50,32 50,33
#2 B 7,34 7,35
Data
df1 <- structure(
list(
UndesiredIndex = c("x1A", "x1B", "x2A", "x2B"),
DesiredIndex = c("x1", "x2", "x1", "x2"),
DesiredRows = c("A", "B", "A", "B"),
Result = c("50,32", "7,34", "50,33", "7,35")
),
class = "data.frame",
row.names = c("1", "2", "3", "4")
)
Shorter, but more theoretically round-about.
Data
(Thanks to #akrun!)
df1 <- structure(
list(
UndesiredIndex = c("x1A", "x1B", "x2A", "x2B"),
DesiredIndex = c("x1", "x2", "x1", "x2"),
DesiredRows = c("A", "B", "A", "B"),
Result = c("50,32", "7,34", "50,33", "7,35")
),
class = "data.frame",
row.names = c("1", "2", "3", "4")
)
This is a great technique for concatenating rows.
df1 %>%
group_by(DesiredRows) %>%
summarise(Result = paste(Result, collapse = "|")) %>% #<Concatenate rows
separate(Result, into = c("A", "B"), sep = "\\|") #<Separate by '|'
#> # A tibble: 2 x 3
#> DesiredRows A B
#> <chr> <chr> <chr>
#> 1 A 50,32 50,33
#> 2 B 7,34 7,35
Created on 2018-08-06 by the reprex package (v0.2.0).

Resources