R Function error in calculating date difference [duplicate] - r

This question already has answers here:
Calculate difference between values in consecutive rows by group
(4 answers)
Closed 2 years ago.
I have a dataframe that looks like this:
Name Date
David 2019-12-23
David 2020-1-10
David 2020-2-13
Kevin 2019-2-12
Kevin 2019-3-19
Kevin 2019-5-1
Kevin 2019-7-23
Basically, I'm trying to calculate the date difference between each instance, specific to each person. I am currently using the following for-loop:
df$daysbetween <- with(df, ave(as.numeric(date) , name,
FUN=function(x) { z=c(NA,NA);
for( i in seq_along(x)[-(1:2)] ){
z <- c(z, (x[i]-x[i-1]))}
return(z) }) )
Currently, it calculates the difference between the second and third, and any following instance, perfectly fine. However, it doesn't calculate the difference between the first and second date and I need it to. Where is the error in my code coming from? Would appreciate any help.

transform(df, diff = ave(Date, Name, FUN = function(x)c(NA,diff(as.Date(x)))))
Name Date diff
1 David 2019-12-23 <NA>
2 David 2020-1-10 18
3 David 2020-2-13 34
4 Kevin 2019-2-12 <NA>
5 Kevin 2019-3-19 35
6 Kevin 2019-5-1 43
7 Kevin 2019-7-23 83

Just use lag from the dplyr package:
Description:
Find the "previous" (lag()) or "next" (lead()) values in a vector. Useful for comparing values behind of or ahead of the current values.
df %>%
group_by(name) %>%
mutate(diff = date - lag(date))
Output:
name date diff
<chr> <date> <drtn>
1 David 2019-12-23 NA days
2 David 2020-01-10 18 days
3 David 2020-02-13 34 days
4 Kevin 2019-02-12 NA days
5 Kevin 2019-03-19 35 days
6 Kevin 2019-05-01 43 days
7 Kevin 2019-07-23 83 days

Related

How to add rows to dataframe R with rbind

I know this is a classic question and there are also similar ones in the archive, but I feel like the answers did not really apply to this case. Basically I want to take one dataframe (covid cases in Berlin per district), calculate the sum of the columns and create a new dataframe with a column representing the name of the district and another one representing the total number. So I wrote
covid_bln <- read.csv('https://www.berlin.de/lageso/gesundheit/infektionsepidemiologie-infektionsschutz/corona/tabelle-bezirke-gesamtuebersicht/index.php/index/all.csv?q=', sep=';')
c_tot<-data.frame('district'=c(), 'number'=c())
for (n in colnames(covid_bln[3:14])){
x<-data.frame('district'=c(n), 'number'=c(sum(covid_bln$n)))
c_tot<-rbind(c_tot, x)
next
}
print(c_tot)
Which works properly with the names but returns only the number of cases for the 8th district, but for all the districts. If you have any suggestion, even involving the use of other functions, it would be great. Thank you
Here's a base R solution:
number <- colSums(covid_bln[3:14])
district <- names(covid_bln[3:14])
c_tot <- cbind.data.frame(district, number)
rownames(c_tot) <- NULL
# If you don't want rownames:
rownames(c_tot) <- NULL
This gives us:
district number
1 mitte 16030
2 friedrichshain_kreuzberg 10679
3 pankow 10849
4 charlottenburg_wilmersdorf 10664
5 spandau 9450
6 steglitz_zehlendorf 9218
7 tempelhof_schoeneberg 12624
8 neukoelln 14922
9 treptow_koepenick 6760
10 marzahn_hellersdorf 6960
11 lichtenberg 7601
12 reinickendorf 9752
I want to provide a solution using tidyverse.
The final result is ordered alphabetically by districts
c_tot <- covid_bln %>%
select( mitte:reinickendorf) %>%
gather(district, number, mitte:reinickendorf) %>%
group_by(district) %>%
summarise(number = sum(number))
The rusult is
# A tibble: 12 x 2
district number
* <chr> <int>
1 charlottenburg_wilmersdorf 10736
2 friedrichshain_kreuzberg 10698
3 lichtenberg 7644
4 marzahn_hellersdorf 7000
5 mitte 16064
6 neukoelln 14982
7 pankow 10885
8 reinickendorf 9784
9 spandau 9486
10 steglitz_zehlendorf 9236
11 tempelhof_schoeneberg 12656
12 treptow_koepenick 6788

Get first n rows for each date in a dataframe [duplicate]

This question already has answers here:
Selecting top N rows for each group based on value in column
(4 answers)
Closed 3 years ago.
I am currently trying to subset the first n-observations for each date in my dataset. Let's say n=2 for example purposes. This is what the data set looks like:
Date Measure
2019-02-01 5
2019-02-01 4
2019-02-01 3
2019-02-01 6
… …
2019-02-02 5
2019-02-02 5
2019-02-02 2
… …
I would like to see this output:
Date Measure
2019-02-01 5
2019-02-01 4
2019-02-02 5
2019-02-02 5
… …
Unfortunately, this is not something I am able to do with definitions. I am dealing with over 10 million rows of data, so the solution needs to be dynamic to make the selection of n for each unique date.
An option is to group by 'Date' and slice the sequence of 'n' rows
library(dplyr)
n <- 2
df1 %>%
group_by(Date) %>%
slice(seq_len(n))

Convert column with minutes and seconds in seconds only in R

my dataframe have this format:
name <- c("Carlos", "Matthew", "Toth", "Mike", "Joseph", "Andrey")
time <- c("79:45","78","74:45","65:30","64","57")
myexample <- cbind.data.frame(name, time)
> myexample
nane time
1 Carlos 79:45
2 Matthew 78
3 Toth 74:45
4 Mike 65:30
5 Joseph 64
6 Andrey 57
How to convert time column with two formats ("79:45" and "78") in seconds?
the time column is in the character format :(
As output:
> myexample
nane time
1 Carlos 79:45
2 Matthew 78:00
3 Toth 74:45
4 Mike 65:30
5 Joseph 64:00
6 Andrey 57:00
Here is one option using sub:
myexample$time <- sub("^(\\d{1,})$", "\\1:00", myexample$time)
myexample
name time
1 Carlos 79:45
2 Matthew 78:00
3 Toth 74:45
4 Mike 65:30
5 Joseph 64:00
6 Andrey 57:00
Demo
Normally the best thing to do here would be to parse your text times into some formal time type. But since you are storing non standard values, where the minutes component can be greater than 60, I chose to leave it as text for the moment.
An option can be using grepl to detect presence of : and appending :00 in values : is not present.
myexample$time <- ifelse(grepl(":",myexample$time), as.character(myexample$time),
paste0(myexample$time, ":00"))
myexample
# name time
# 1 Carlos 79:45
# 2 Matthew 78:00
# 3 Toth 74:45
# 4 Mike 65:30
# 5 Joseph 64:00
# 6 Andrey 57:00

Rank most recent scores of students within a given date - 30 days window

Following is what my dataframe/data.table looks like. The rank column is my desired calculated field.
library(data.table)
df <- fread('
Name Score Date Rank
John 42 1/1/2018 3
Rob 85 12/31/2017 2
Rob 89 12/26/2017 1
Rob 57 12/24/2017 1
Rob 53 08/31/2017 1
Rob 72 05/31/2017 2
Kate 87 12/25/2017 1
Kate 73 05/15/2017 1
')
df[,Date:= as.Date(Date, format="%m/%d/%Y")]
I am trying to calculate the rank of each student at every given point in time in the data within a 30 day windows. For that, I need to fetch the most recent scores of all students at a given point in time and then pass the rank function.
In the 1st row, as of 1/1/2018, John has two more competitors in a past 30 day window: Rob with the most recent score of 85 in 12/31/2017 AND Kate with the most recent score of 87 in 12/25/2017 and both of these dates fall within the 1/1/2018 - 30 Day Window. John gets a rank of 3 with the lowest score of 42. If only one students falls within date(at a given row) - 30 day window, then the rank is 1.
In the 3rd row the date is 12/26/2017. So Rob's score as of 12/26/2017 is 89. There is only one case of another student that falls in the time window of 12/26/2017 - 30 days and that is the most recent score(87) of kate on 12/25/2017. So within the time window of (12/26/2017) - 30 , Rob's score of 89 is higher than Kate's score of 87 and therefore Rob gets rank 1.
I was thinking about using the framework from here Efficient way to perform running total in the last 365 day window but struggling to think of a way to fetch all recent score of all students at a given point in time before using rank.
This seems to work:
ranks = df[.(d_dn = Date - 30L, d_up = Date), on=.(Date >= d_dn, Date <= d_up), allow.cart=TRUE][,
.(LatestScore = last(Score)), by=.(Date = Date.1, Name)]
setorder(ranks, Date, -LatestScore)
ranks[, r := rowid(Date)]
df[ranks, on=.(Name, Date), r := i.r]
Name Score Date Rank r
1: John 42 2018-01-01 3 3
2: Rob 85 2017-12-31 2 2
3: Rob 89 2017-12-26 1 1
4: Rob 57 2017-12-24 1 1
5: Rob 53 2017-08-31 1 1
6: Rob 72 2017-05-31 2 2
7: Kate 87 2017-12-25 1 1
8: Kate 73 2017-05-15 1 1
... using last since the Cartesian join seems to sort and we want the latest measurement.
How the update join works
The i. prefix means it's a column from i in the x[i, ...] join, and the assignment := is always in x. So it's looking up each row of i in x and where matches are found, copying values from i to x.
Another way that is sometimes useful is to look up x rows in i, something like df[, r := ranks[df, on=.(Name,Date), x.r]] in which case x.r is still from the ranks table (now in the x position relative to the join).
There's also...
ranks = df[CJ(Name = Name, Date = Date, unique=TRUE), on=.(Name, Date), roll=30, nomatch=0]
setnames(ranks, "Score", "LatestScore")
# and then use the same last three lines above
I'm not sure about efficiency of one vs another, but I guess it depends on number of Names, frequency of measurement and how often measurement days coincide.
A solution that uses data.table though not sure if it is the most efficient usage:
df[.(iName=Name, iScore=Score, iDate=Date, StartDate=Date-30, EndDate=Date),
.(Rank=frank(-c(iScore[1L], .SD[Name != iName, max(Score), by=.(Name)]$V1),
ties.method="first")[1L]),
by=.EACHI,
on=.(Date >= StartDate, Date <= EndDate)]
Explanation:
1) The outer square brackets do a non-equi join within a date range (i.e. 30days ago and latest date for each row). Try studying the below output against the input data:
df[.(iName=Name, iScore=Score, iDate=Date, StartDate=Date-30, EndDate=Date),
c(.(RowGroup=.GRP),
.SD[, .(Name, Score, Date, OrigDate, iName, iScore, iDate, StartDate, EndDate)]),
by=.EACHI,
on=.(Date >= StartDate, Date <= EndDate)]
2) .EACHI is to perform j calculations for each row of i.
3) Inside j, iScore[1L] is the score for the current row, .SD[Name != iName] means taking scores not corresponding to the student in the current row. Then, we use the max(Score) for each student of those students within the 30days window.
4) Concatenate all these scores and calculate the rank for the score of the current row while taking care of ties by taking the first tie.
Note:
see ?data.table to understand what i, j, by, on and .EACHI refers to.
EDIT after comments by OP:
I would add a OrigDate column and find those that matches the latest date.
df[, OrigDate := Date]
df[.(iName=Name, iScore=Score, iDate=Date, StartDate=Date-30, EndDate=Date),
.(Name=iName, Score=iScore, Date=iDate,
Rank=frank(-c(iScore[1L],
.SD[Name != iName, Score[OrigDate==max(OrigDate)], by=.(Name)]$V1),
ties.method="first")[1L]),
by=.EACHI,
on=.(Date >= StartDate, Date <= EndDate)]
I came up with following partial solution, encountered however problem - is it possible that there will be two people occuring with the same date?
if not, have a look at following piece of code:
library(tidyverse) # easy manipulation
library(lubridate) # time handling
# This function can be added to
get_top <- function(df, date_sel) {
temp <- df %>%
filter(Date > date_sel - months(1)) %>% # look one month in the past from given date
group_by(Name) %>% # and for each occuring name
summarise(max_score = max(Score)) %>% # find the maximal score
arrange(desc(max_score)) %>% # sort them
mutate(Rank = 1:n()) # and rank them
temp
}
Now, you have to find the name in the table, for given date and return its rank.
library(data.table)
library(magrittr)
setorder(df, -Date)
fun <- function(i){
df[i:nrow(df), head(.SD, 1), by = Name] %$%
rank(-Score[Date > df$Date[i] - 30])[1]
}
df[, rank := sapply(1:.N, fun)]
This can be done by joining to df those rows of df that are within 30 days behind it or the same date and have higher or equal scores. Then for each original row and joined row Name get the joined row Name that is the most recent. The count of the remaining joined rows for each of the original df rows is the rank.
library(sqldf)
sqldf("with X as
(select a.rowid r, a.*, max(b.Date) Date
from df a join df b
on b.Date between a.Date - 30 and a.Date and b.Score >= a.Score
group by a.rowid, b.Name)
select Name, Date, Score, count(*) Rank
from X
group by r
order by r")
giving:
Name Date Score Rank
1 John 2018-01-01 42 3
2 Rob 2017-12-31 85 2
3 Rob 2017-12-26 89 1
4 Rob 2017-12-24 57 1
5 Rob 2017-08-31 53 1
6 Rob 2017-05-31 72 2
7 Kate 2017-12-25 87 1
8 Kate 2017-05-15 73 1
A tidyverse solution (dplyr + tidyr):
df %>%
complete(Name,Date) %>%
group_by(Name) %>%
mutate(last_score_date = `is.na<-`(Date,is.na(Score))) %>%
fill(Score,last_score_date) %>%
filter(!is.na(Score) & Date-last_score_date <30) %>%
group_by(Date) %>%
mutate(Rank = rank(-Score)) %>%
right_join(df)
# # A tibble: 8 x 5
# # Groups: Date [?]
# Name Date Score last_score_date Rank
# <chr> <date> <int> <date> <dbl>
# 1 John 2018-01-01 42 2018-01-01 3
# 2 Rob 2017-12-31 85 2017-12-31 2
# 3 Rob 2017-12-26 89 2017-12-26 1
# 4 Rob 2017-12-24 57 2017-12-24 1
# 5 Rob 2017-08-31 53 2017-08-31 1
# 6 Rob 2017-05-31 72 2017-05-31 2
# 7 Kate 2017-12-25 87 2017-12-25 1
# 8 Kate 2017-05-15 73 2017-05-15 1
We add all missing combinations of Date and Name
then we create a column for the last_score_date, equal to Date when score isn't NA.
by filling NAs down Score has become the latest score
we filter out NAs and keep only scores that have < 30 days of age
That's our table of valid scores by dates
From there it's easy to add ranks
and a final right_join on the original table gives us the expected output
data
library(data.table)
df <- fread('
Name Score Date
John 42 01/01/2018
Rob 85 12/31/2017
Rob 89 12/26/2017
Rob 57 12/24/2017
Rob 53 08/31/2017
Rob 72 05/31/2017
Kate 87 12/25/2017
Kate 73 05/15/2017
')
df[,Date:= as.Date(Date, format="%m/%d/%Y")]

Apply.weekly for non unique date column?

I currently have the below data.table with Name and Id recycling per day.
Date Name Id Widgets
2016-12-31 Bob Jones 0052A00001 5
2016-12-31 James Smith 0052A00002 25
2016-12-31 Tom Wilson 0052A00003 29
...
2016-01-31 Bob Jones 0052A00001 8
2016-01-31 James Smith 0052A00002 18
2016-01-31 Tom Wilson 0052A00003 20
Is it possible to apply the zoo function apply.weekly to this since there are not unique values per date? If not, what is the easiest way to aggregate this by a weekly value (or period of another length- say 4 days) and create groupings according to that?
You can create a grouping first before you match in the week. You can play around with cut to get your desired grouping.
grpWeek <- data.table(Date=seq.Date(as.Date("2016-01-01"), as.Date("2016-12-31"), by="1 day"))[,
list(Date,
DT_Week=week(Date),
Week_Num=format(Date, "%W"),
User_Week=cut(Date, breaks=52, labels=paste0("Week",1:52)))]
dt <- fread("Date,Name,Id,Widgets
2016-12-31,Bob Jones,0052A00001,5
2016-12-31,James Smith,0052A00002,25
2016-12-31,Tom Wilson,0052A00003,29
2016-01-31,Bob Jones,0052A00001,8
2016-01-31,James Smith,0052A00002,18
2016-01-31,Tom Wilson,0052A00003,20")
dt[,Date:=as.Date(Date)]
grpWeek[dt, on="Date"]

Resources