I have a dataset that contains weekly data. The week starts on a Monday and ends on a Sunday. This dataset is also broken out by group.
I want to detect if there are any missing consecutive dates between the start and finish for each group. Here is an example dataset:
Week<- as.Date(c('2015-04-13', '2015-04-20', '2015-05-04', '2015-06-29', '2015-07-27', '2015-08-03'))
Group <- c('A', 'A', 'A','B','B','B','B')
Value<- c(2,3,10,4,11,9,8)
df<-data.frame(Week, Group, Value)
df
Week Group Value
2015-04-13 A 2
2015-04-20 A 3
2015-05-04 A 10
2015-06-29 B 4
2015-07-06 B 11
2015-07-27 B 9
2015-08-03 B 8
For group B, there is missing data between 2015-07-06 and 2015-07-27. There is also missing data in group A between 2015-04-20 and 2015-05-04. I want to add a row for that group and have the value be NA. I have many groups and I want my expected output to be below:
Week Group Value
2015-04-13 A 2
2015-04-20 A 3
2015-04-27 A NA
2015-05-04 A 10
2015-06-29 B 4
2015-07-06 B 11
2015-07-13 B NA
2015-07-20 B NA
2015-07-27 B 9
2015-08-03 B 8
Any help would be great, thanks!
You can use complete from tidyr package, i.e.
library(tidyverse)
df %>%
group_by(Group) %>%
complete(Week = seq(min(Week), max(Week), by = 'week'))
which gives,
# A tibble: 10 x 3
# Groups: Group [2]
Group Week Value
<fct> <date> <dbl>
1 A 2015-04-13 2
2 A 2015-04-20 3
3 A 2015-04-27 NA
4 A 2015-05-04 10
5 B 2015-06-29 4
6 B 2015-07-06 NA
7 B 2015-07-13 NA
8 B 2015-07-20 NA
9 B 2015-07-27 11
10 B 2015-08-03 9
The only way I've found to do this is using an inequality join in SQL.
library(tidyverse)
library(sqldf)
Week<- as.Date(c('2015-04-13', '2015-04-20', '2015-04-27', '2015-05-04',
'2015-06-29', '2015-06-07', '2015-07-27', '2015-08-03'))
Group <- c('A', 'A','A', 'A','B','B','B','B')
Value<- c(2,3,2,10,4,11,9,8)
df<-data.frame(Week, Group, Value)
#what are the start and end weeks for each group?
GroupWeeks <- df %>%
group_by(Group) %>%
summarise(start = min(Week),
end = max(Week))
#What are all the possible weeks?
AllWeeks <- data.frame(Week = seq.Date(min(df$Week), max(df$Week), by = "week"))
#use an inequality join to add rows for every week within the group's range
sqldf("Select AllWeeks.Week, GroupWeeks.[Group], Value
From AllWeeks inner join GroupWeeks on AllWeeks.Week >= start AND AllWeeks.Week <= end
left join df on AllWeeks.Week = df.Week and GroupWeeks.[Group] = df.[Group]")
This can be achieved using seq function. Here is the code snippet.
Code:
Week<- as.Date(c('2015-04-13', '2015-04-20', '2015-04-27', '2015-05-04', '2015-06-29','2015-07-06', '2015-07-27', '2015-08-03'))
Group <- c('A', 'A','A', 'A','B','B','B','B')
Value<- c(2,3,2,10,4,11,9,8)
df<-data.frame(Week, Group, Value)
#generate all the missing dates
alldates = seq(min(df$Week[df$Group == 'B']), max(df$Week[df$Group == 'B']), 7)
#filter out the dates that are not present in your dataset
dates = alldates[!(alldates %in% df$Week)]
#add these new dates to a new dataframe and rbind with the old dataframe
new_df = data.frame(Week = dates,Group = 'B', Value = NA)
df = rbind(df, new_df)
df = df[order(df$Week),]
Output:
Week Group Value
1 2015-04-13 A 2
2 2015-04-20 A 3
3 2015-04-27 A 2
4 2015-05-04 A 10
5 2015-06-29 B 4
6 2015-07-06 B 11
9 2015-07-13 B NA
10 2015-07-20 B NA
7 2015-07-27 B 9
8 2015-08-03 B 8
Related
I am trying to replace some values in a column of a data frame with the help of a date and ID in another data frame but I cannot manage to find any solution. It will be more clear with an example.
I have two data frames constructed as followed:
date.1 <- c("01.02.2011","02.02.2011","03.02.2011","04.02.2011","05.02.2011","01.02.2011","02.02.2011","03.02.2011","04.02.2011","05.02.2011")
date.1 <- as.Date(date.1, format="%d.%m.%Y")
values.1 <- c("1","3","5","1","2","6","7","8","9","10")
ID.1 <- c("10","10","10","10","10","11","11","11","11","11")
df.1 <- data.frame(date.1, values.1, ID.1)
names(df.1) <- c("date","values","ID")
date.2 <- c("04.02.2011","04.02.2011")
date.2 <- as.Date(date.2, format="%d.%m.%Y")
values.2 <- c("1", "9")
ID.2 <- c("10","11")
df.2 <- data.frame(date.2, values.2, ID.2)
names(df.2) <- c("date","values","ID")
which looked like:
> df.1
date values ID
1 2011-02-01 1 10
2 2011-02-02 3 10
3 2011-02-03 5 10
4 2011-02-04 1 10
5 2011-02-05 2 10
6 2011-02-01 6 11
7 2011-02-02 7 11
8 2011-02-03 8 11
9 2011-02-04 9 11
10 2011-02-05 10 11
> df.2
date values ID
1 2011-02-04 1 10
2 2011-02-04 9 11
I would like to replace the "values" in df.2 for each ID with the "values" of df.1 on the next date, i.e. with the values on 2011-02-05 but I don't manage to replace them. Thus, I would like to obtain:
> df.2
date values ID
1 2011-02-04 2 10
2 2011-02-04 10 11
Your help would be really appreciated. If any editing of the question is needed, do not hesitate to let me know.
If next date means date + 1 day, then try this:
library(dplyr)
df.2 %>%
mutate(date1 = date + 1) %>%
select(-values) %>%
left_join(df.1, by = c(date1 = "date", ID = "ID")) %>%
select(-date1)
#> date ID values
#> 1 2011-02-04 10 2
#> 2 2011-02-04 11 10
Created on 2020-03-28 by the reprex package (v0.3.0)
Is this what you are looking for?
library(lubridate)
library(dplyr)
df.2$values <- df.1 %>% filter (ID == df.2$ID & date == (df.2$date +1)) %>% select(values)
I have a dataframe that contains information for various countries, days and variables. I have observations for one of those variables only. A simple working example would look like this:
df <- data.frame(country=c("NL","NL","NL","NL","BE","BE","BE","BE"),
day=c("Monday","Monday","Tuesday","Tuesday","Monday","Monday","Tuesday","Tuesday"),
variable=c("A","B","A","B","A","B","A","B"),
value=c(8,NA,13,NA,12,NA,9,NA))
> df
country day variable value
1 NL Monday A 8
2 NL Monday B NA
3 NL Tuesday A 13
4 NL Tuesday B NA
5 BE Monday A 12
6 BE Monday B NA
7 BE Tuesday A 9
8 BE Tuesday B NA
I want to copy those observations over to the other variable, as long as country and day are identical. The end result would look like this:
> df
country day variable value
1 NL Monday A 8
2 NL Monday B 8
3 NL Tuesday A 13
4 NL Tuesday B 13
5 BE Monday A 12
6 BE Monday B 12
7 BE Tuesday A 9
8 BE Tuesday B 9
The actual dataframe is quite large and I would like to avoid having to build loops. A solution using pipes would be preferable.
Perhaps you could just do:
library(dplyr)
df %>%
group_by(country, day) %>%
mutate(value = value[!is.na(value)])
Output:
# A tibble: 8 x 4
# Groups: country, day [4]
country day variable value
<fct> <fct> <fct> <dbl>
1 NL Monday A 8
2 NL Monday B 8
3 NL Tuesday A 13
4 NL Tuesday B 13
5 BE Monday A 12
6 BE Monday B 12
7 BE Tuesday A 9
8 BE Tuesday B 9
Another way would be via fill, though this is probably unnecessary (if needed, rather use mutate(value = zoo::na.locf(value)) as last line since fill itself is quite slow):
library(tidyverse)
df %>%
group_by(country, day) %>%
arrange(country, day, value) %>%
fill(value)
With data.table, we can do
library(data.table)
setDT(df)[, value := na.omit(value), .(country, day)]
Or using na.locf
library(zoo)
setDT(df)[, value := na.locf0(value), .(country, day)]
I have a list of transactions for a lot of people. I wish to find out when each particular person has crossed a particular threshold value of total transactions.
Here is an example of what I have already done:
Example dataset:
df <- data.frame(name = rep(c("a","b"),4),
dates = seq(as.Date("2017-01-01"), by = "month", length.out = 8), amt = 11:18)
setorderv(df, "name")
This gives me the following data frame
name dates amt
1 a 2017-01-01 11
3 a 2017-03-01 13
5 a 2017-05-01 15
7 a 2017-07-01 17
2 b 2017-02-01 12
4 b 2017-04-01 14
6 b 2017-06-01 16
8 b 2017-08-01 18
Then I wrote the following code to find the cumulative sums
df$cumsum <- ave(df$amt, df$name, FUN = cumsum)
This gives me the following data frame:
name dates amt cumsum
1 a 2017-01-01 11 11
3 a 2017-03-01 13 24
5 a 2017-05-01 15 39
7 a 2017-07-01 17 56
2 b 2017-02-01 12 12
4 b 2017-04-01 14 26
6 b 2017-06-01 16 42
8 b 2017-08-01 18 60
Now I want to know when each person crossed 20 and 40. I wrote the following code to find this out:
names <- unique(df$name)
for (i in seq_along(names)){
x1 <- Position(function(x) x >= 20, df$cumsum[df$name == names[i]])
x2 <- Position(function(x) x >= 40, df$cumsum[df$name == names[i]])
result_df[i,] <- c(df$name[i],
df[df$name == names[i],2][x1],
df[df$name == names[i],2][x2])
}
This code checks where the thresholds were crossed and stores the row number in a variable. Then extracts the value from that row of the second column and stores it in a another data frame.
The problem is, this code is really slow. I have over 200,000 people in my data set and over 10 million rows. This code takes about 25 seconds to execute for the first 50 users, which means it is likely to take about 30 hours for the entire dataset.
Is there a faster way to do this?
With dplyr you could group by person, filter when cumsum is above >20 or above >40, and then use slice(1) to select the first relevant row per person. Should be way faster than for looping.
df <- read.table(text = '
name dates amt cumsum
a 2017-01-01 11 11
a 2017-03-01 13 24
a 2017-05-01 15 39
a 2017-07-01 17 56
b 2017-02-01 12 12
b 2017-04-01 14 26
b 2017-06-01 16 42
b 2017-08-01 18 60', header = T)
df %>%
group_by(name) %>%
filter(cumsum > 20) %>%
slice(1)
name dates amt cumsum
<fctr> <fctr> <int> <int>
1 a 2017-03-01 13 24
2 b 2017-04-01 14 26
df %>%
group_by(name) %>%
filter(cumsum > 40) %>%
slice(1)
name dates amt cumsum
<fctr> <fctr> <int> <int>
a 2017-07-01 17 56
b 2017-06-01 16 42
Of course you could subsequently rbind these dataframes and arrange on person. Does this help?
Using data table could be something like this:
library(data.table)
dt <- data.table(df[order(df$dates), ])
dt[ ,':='(minDate20 = min(dates[cumsum(amt) > 20]), minDate40 = min(dates[cumsum(amt) > 40])), by = .(name)]
dt[dates == minDate20, ]
dt[dates == minDate40, ]
In the data frame below there are a number of continuous days with missing values.
I want to create a table that shows the missing days
Expected output
Table of missing values
from to
2012-01-08 2012-01-12
2012-01-18 2012-01-22
2012-01-29 2012-02-01
I tried to do it using this code
library(dplyr)
df$Date <- as.Date(df$Date, format = "%d-%b-%Y")
from_to_table_NA <- df %>%
dplyr::filter(is.na(value)) %>%
dplyr::summarise(from = min(Date),
to = max(Date))
> from_to_table_NA
from to
1 2012-01-08 2012-02-01
As expected, it gave me the minimum maximum dates only for missing values. I will highly appreciate any suggestion on how to get the desired output.
DATA
df <- read.table(text = c("
Date value
5-Jan-2012 5
6-Jan-2012 2
7-Jan-2012 3
8-Jan-2012 NA
9-Jan-2012 NA
10-Jan-2012 NA
11-Jan-2012 NA
12-Jan-2012 NA
13-Jan-2012 4
14-Jan-2012 5
15-Jan-2012 5
16-Jan-2012 7
17-Jan-2012 5
18-Jan-2012 NA
19-Jan-2012 NA
20-Jan-2012 NA
21-Jan-2012 NA
22-Jan-2012 NA
23-Jan-2012 12
24-Jan-2012 5
25-Jan-2012 7
26-Jan-2012 8
27-Jan-2012 8
28-Jan-2012 10
29-Jan-2012 NA
30-Jan-2012 NA
31-Jan-2012 NA
1-Feb-2012 NA
2-Feb-2012 12"), header =T)
You need to group by consecutive days. This can be done by getting the cumulative sum of condition where the differences between days is not exactly 1:
df %>%
filter(is.na(value)) %>%
group_by(g = cumsum(coalesce(Date - lag(Date), 1) != 1)) %>%
summarise(from = min(Date),
to = max(Date))
Gives:
# A tibble: 3 x 3
g from to
<int> <date> <date>
1 0 2012-01-08 2012-01-12
2 1 2012-01-18 2012-01-22
3 2 2012-01-29 2012-02-01
I used this code in R:
df[with(df,order(ID,Date)),]
to order the date values for each distinct ID in my data frame. Now I want to add a rank column (1 to n) next to the ordered date values, where rank = 1 is the oldest date and rank = n is the most recent date for each distinct ID.
I have seen questions about adding a rank column but not when sorting with a date value. How do I add this rank column using my code above? Thanks!
Here's a dplyr approach:
library(dplyr)
# Fake data
set.seed(5)
dat = data.frame(date=sample(seq(as.Date("2015-01-01"), as.Date("2015-01-31"),
"1 day"), 12),
ID=rep(LETTERS[1:3], c(2,6,4)))
dat %>% group_by(ID) %>%
mutate(rank = rank(date)) %>%
arrange(date)
date ID rank
1 2015-01-07 A 1
2 2015-01-21 A 2
3 2015-01-03 B 1
4 2015-01-08 B 2
5 2015-01-14 B 3
6 2015-01-19 B 4
7 2015-01-20 B 5
8 2015-01-27 B 6
9 2015-01-06 C 1
10 2015-01-10 C 2
11 2015-01-22 C 3
12 2015-01-29 C 4