Match dates from list of data frames in R - r

I have a list of 100+ time series dataframes my.list with daily observations for each product in its own data frame. Some values are NA without any record of the date. I would like to update each data frame in this list to show the date and NA if it does not have a record on this date.
Dates:
start = as.Date('2016/04/08')
full <- seq(start, by='1 days', length=10)
Sample Time Series Data:
d1 <- data.frame(Date = seq(start, by ='2 days',length=5), Sales = c(5,10,15,20,25))
d2 <- data.frame(Date = seq(start, by= '1 day', length=10),Sales = c(1, 2, 3,4,5,6,7,8,9,10))
my.list <- list(d1, d2)
I want to merge all full date values into each data frame, and if no match exists then sales is NA:
my.list
[[d1]]
Date Sales
2016-04-08 5
2016-04-09 NA
2016-04-10 10
2016-04-11 NA
2016-04-12 15
2016-04-13 NA
2016-04-14 20
2016-04-15 NA
2016-04-16 25
2016-04-17 NA
[[d2]]
Date Sales
2016-04-08 1
2016-04-09 2
2016-04-10 3
2016-04-11 4
2016-04-12 5
2016-04-13 6
2016-04-14 7
2016-04-15 8
2016-04-16 9
2016-04-17 10

If I understand correctly, the OP wants to update each of the dataframes in my.list to contain one row for each date given in the vector of dates full
Base R
In base R, merge() can be used as already mentioned by Hack-R. However, th answer below expands this to work on all dataframes in the list:
# creat dataframe from vector of full dates
full.df <- data.frame(Date = full)
# apply merge on each dataframe in the list
lapply(my.list, merge, y = full.df, all.y = TRUE)
[[1]]
Date Sales
1 2016-04-08 5
2 2016-04-09 NA
3 2016-04-10 10
4 2016-04-11 NA
5 2016-04-12 15
6 2016-04-13 NA
7 2016-04-14 20
8 2016-04-15 NA
9 2016-04-16 25
10 2016-04-17 NA
[[2]]
Date Sales
1 2016-04-08 1
2 2016-04-09 2
3 2016-04-10 3
4 2016-04-11 4
5 2016-04-12 5
6 2016-04-13 6
7 2016-04-14 7
8 2016-04-15 8
9 2016-04-16 9
10 2016-04-17 10
Caveat
The answer assumes that full covers the overall range of Date of all dataframes in the list.
In order to avoid any mishaps, the overall range of Date can be retrieved from the available data in my.list:
overall_date_range <- Reduce(range, lapply(my.list, function(x) range(x$Date)))
full <- seq(overall_date_range[1], overall_date_range[2], by = "1 days")
Using rbindlist()
Alternatively, the list of dataframes which are identical in structure can be stored in a large dataframe. An additional attribute indicates to which product each row belongs to. The homogeneous structure simplifies subsequent operations.
The code below uses the rbindlist() function from the data.table package to create a large data.table. CJ() (cross join) creates all combinations of dates and product id which is then merged / joined to fill in the missing dates:
library(data.table)
all_products <- rbindlist(my.list, idcol = "product.id")[
CJ(product.id = unique(product.id), Date = seq(min(Date), max(Date), by = "1 day")),
on = .(Date, product.id)]
all_products
product.id Date Sales
1: 1 2016-04-08 5
2: 1 2016-04-09 NA
3: 1 2016-04-10 10
4: 1 2016-04-11 NA
5: 1 2016-04-12 15
6: 1 2016-04-13 NA
7: 1 2016-04-14 20
8: 1 2016-04-15 NA
9: 1 2016-04-16 25
10: 1 2016-04-17 NA
11: 2 2016-04-08 1
12: 2 2016-04-09 2
13: 2 2016-04-10 3
14: 2 2016-04-11 4
15: 2 2016-04-12 5
16: 2 2016-04-13 6
17: 2 2016-04-14 7
18: 2 2016-04-15 8
19: 2 2016-04-16 9
20: 2 2016-04-17 10
Subsequent operations can be grouped by product.id, e.g., to determine the number of valid sales data for each product:
all_products[!is.na(Sales), .(valid.sales.data = .N), by = product.id]
product.id valid.sales.data
1: 1 5
2: 2 10
Or, the totals sales per product:
all_products[, .(total.sales = sum(Sales, na.rm = TRUE)), by = product.id]
product.id total.sales
1: 1 75
2: 2 55
If required for some reason the result can be converted back to a list by
split(all_products, by = "product.id")

Related

How to split a data set with duplicated informations based on date

I have this situation:
ID date Weight
1 2014-12-02 23
1 2014-10-02 25
2 2014-11-03 27
2 2014-09-03 45
3 2014-07-11 56
3 NA 34
4 2014-10-05 25
4 2014-08-09 14
5 NA NA
5 NA NA
And I would like split the dataset in this, like this:
1-
ID date Weight
1 2014-12-02 23
1 2014-10-02 25
2 2014-11-03 27
2 2014-09-03 45
4 2014-10-05 25
4 2014-08-09 14
2- Lowest Date
ID date Weight
3 2014-07-11 56
3 NA 34
5 NA NA
5 NA NA
I tried this for second dataset:
dt <- dt[order(dt$ID, dt$date), ]
dt.2=dt[duplicated(dt$ID), ]
but didn't work
Get the ID's for which date are NA and then subset based on that
NA_ids <- unique(df$ID[is.na(df$date)])
subset(df, !ID %in% NA_ids)
# ID date Weight
#1 1 2014-12-02 23
#2 1 2014-10-02 25
#3 2 2014-11-03 27
#4 2 2014-09-03 45
#7 4 2014-10-05 25
#8 4 2014-08-09 14
subset(df, ID %in% NA_ids)
# ID date Weight
#5 3 2014-07-11 56
#6 3 <NA> 34
#9 5 <NA> NA
#10 5 <NA> NA
Using dplyr, we can create a new column which has TRUE/FALSE for each ID based on presence of NA and then use group_split to split into list of two.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(NA_ID = any(is.na(date))) %>%
ungroup %>%
group_split(NA_ID, keep = FALSE)
The above dplyr logic can also be implemented in base R by using ave and split
df$NA_ID <- with(df, ave(is.na(date), ID, FUN = any))
split(df[-4], df$NA_ID)

creating a unique variable based on row differences of another variable considering groups

By using the data below, I want to create a new unique customer id by considering their contact date.
Rule: After every two days, I want each customer to get a new unique customer id and preserve it on the following record if the following contact date for the same customer is within the following two days if not assign a new id to this same customer.
I couldn't go any further than calculating date differences.
The original dataset I work is bigger; therefore, I prefer a data.table solution if possible.
library(data.table)
treshold <- 2
dt <- structure(list(customer_id = c('10','20','20','20','20','20','30','30','30','30','30','40','50','50'),
contact_date = as.Date(c("2019-01-05","2019-01-01","2019-01-01","2019-01-02",
"2019-01-08","2019-01-09","2019-02-02","2019-02-05",
"2019-02-05","2019-02-09","2019-02-12","2019-02-01",
"2019-02-01","2019-02-05")),
desired_output = c(1,2,2,2,3,3,4,5,5,6,7,8,9,10)),
class = "data.frame",
row.names = 1:14)
setDT(dt)
setorder(dt, customer_id, contact_date)
dt[, date_diff_in_days:=contact_date - shift(contact_date, type = c("lag")), by=customer_id]
dt[, date_diff_in_days:=as.numeric(date_diff_in_days)]
dt
customer_id contact_date desired_output date_diff_in_days
1: 10 2019-01-05 1 NA
2: 20 2019-01-01 2 NA
3: 20 2019-01-01 2 0
4: 20 2019-01-02 2 1
5: 20 2019-01-08 3 6
6: 20 2019-01-09 3 1
7: 30 2019-02-02 4 NA
8: 30 2019-02-05 5 3
9: 30 2019-02-05 5 0
10: 30 2019-02-09 6 4
11: 30 2019-02-12 7 3
12: 40 2019-02-01 8 NA
13: 50 2019-02-01 9 NA
14: 50 2019-02-05 10 4
Rule: After every two days, I want each customer to get a new unique customer id and preserve it on the following record if the following contact date for the same customer is within the following two days if not assign a new id to this same customer.
When creating a new ID, if you set up the by= vectors correctly to capture the rule, the auto-counter .GRP can be used:
thresh <- 2
dt[, g := .GRP, by=.(
customer_id,
cumsum(contact_date - shift(contact_date, fill=first(contact_date)) > thresh)
)]
dt[, any(g != desired_output)]
# [1] FALSE
I think the code above is correct since it works on the example, but you might want to check on your actual data (comparing against results from, eg, Gregor's approach) to be sure.
We use cumsum to increment whenever date_diff_in_days is NA or when the threshold is exceeded.
dt[, result := cumsum(is.na(date_diff_in_days) | date_diff_in_days > treshold)]
# customer_id contact_date desired_output date_diff_in_days result
# 1: 10 2019-01-05 1 NA 1
# 2: 20 2019-01-01 2 NA 2
# 3: 20 2019-01-01 2 0 2
# 4: 20 2019-01-02 2 1 2
# 5: 20 2019-01-08 3 6 3
# 6: 20 2019-01-09 3 1 3
# 7: 30 2019-02-02 4 NA 4
# 8: 30 2019-02-05 5 3 5
# 9: 30 2019-02-05 5 0 5
# 10: 30 2019-02-09 6 4 6
# 11: 30 2019-02-12 7 3 7
# 12: 40 2019-02-01 8 NA 8
# 13: 50 2019-02-01 9 NA 9
# 14: 50 2019-02-05 10 4 10

How can I fill missing data points in R for a given dataframe

I have a dataframe which contains dates, products and amounts. However product b is not on every date, I would like it to be with an NA or 0 balance. Is this possible?
Summary_Date <-
as.Date(c("2017-01-31",
"2017-02-28",
"2017-03-31",
"2017-03-31",
"2017-04-30",
"2017-05-31",
"2017-05-31",
"2017-06-30"))
Product <-
as.character(c("a","a","a","b","a","a","b","a"))
Amounts <-
as.numeric(c(10,10,10,20,10,10,20,10))
df <- data.frame(Summary_Date,Product,Amounts)
Regards,
Aksel
You can use tidyr:
> library(tidyr)
> complete(data = df,Summary_Date,Product)
# A tibble: 12 x 3
Summary_Date Product Amounts
<date> <fctr> <dbl>
1 2017-01-31 a 10
2 2017-01-31 b NA
3 2017-02-28 a 10
4 2017-02-28 b NA
5 2017-03-31 a 10
6 2017-03-31 b 20
7 2017-04-30 a 10
8 2017-04-30 b NA
9 2017-05-31 a 10
10 2017-05-31 b 20
11 2017-06-30 a 10
12 2017-06-30 b NA

Looping over unique values [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I have a data frame in long format, with one observation row per measurement. I want to loop through each unique ID and find the "minimum" date for each unique individual. For example, patient 1 may be measured at three different times, but I want the earliest time. I thought about sorting the dataset by the date (in increasing order) and removing all duplicates, but I'm not sure if this is the best way to go. Any help or suggestions would be greatly appreciated. Thank you!
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'ID', order the 'Date' (assuming that it is in Date class or else change to Date class with as.Date with correct format), and get the first observation with head
library(data.table)
setDT(df1)[order(Date), head(.SD, 1), by = ID]
Here is another way using basic R:
earliestDates = aggregate(list(date = df$date), list(ID = df$ID), min)
result = merge(earliestDates,df)
earliestDates is a two column data frame that has the minimum date by ID. The merge will join the values in the other columns.
Example:
set.seed(1)
ID = floor(runif(20,1,5))
day = as.Date(floor(runif(20,1,25)),origin = "2017-1-1")
weight = floor(runif(20,80,95))
df = data.frame(ID = ID, date = day, weight = weight)
> df
ID date weight
1 2 2017-01-24 92
2 2 2017-01-07 89
3 3 2017-01-17 91
4 4 2017-01-05 88
5 1 2017-01-08 87
6 4 2017-01-11 91
7 4 2017-01-02 80
8 3 2017-01-11 87
9 3 2017-01-22 90
10 1 2017-01-10 90
11 1 2017-01-13 87
12 1 2017-01-16 92
13 3 2017-01-13 86
14 2 2017-01-06 83
15 4 2017-01-21 81
16 2 2017-01-18 81
17 3 2017-01-21 84
18 4 2017-01-04 87
19 2 2017-01-19 89
20 4 2017-01-11 86
After the aggregate and merge, the result is:
> result
ID date weight
1 1 2017-01-08 87
2 2 2017-01-06 83
3 3 2017-01-11 87
4 4 2017-01-02 80
Try the following dplyr code:
library(dplyr)
set.seed(12345)
###Create test dataset
tb <- tibble(id = rep(1:10, each = 3),
date = rep(seq(as.Date("2017-07-01"), by=10, len=10), 3),
obs = rnorm(30))
# # A tibble: 30 × 3
# id date obs
# <int> <date> <dbl>
# 1 2017-07-01 0.5855288
# 1 2017-07-11 0.7094660
# 1 2017-07-21 -0.1093033
# 2 2017-07-31 -0.4534972
# 2 2017-08-10 0.6058875
# 2 2017-08-20 -1.8179560
# 3 2017-08-30 0.6300986
# 3 2017-09-09 -0.2761841
# 3 2017-09-19 -0.2841597
# 4 2017-09-29 -0.9193220
# # ... with 20 more rows
###Pipe the dataset through dplyr's 'group_by' and 'filter' commands
tb %>% group_by(id) %>%
filter(date == min(date)) %>%
ungroup() %>%
distinct()
# # A tibble: 10 × 3
# id date obs
# <int> <date> <dbl>
# 1 2017-07-01 0.5855288
# 2 2017-07-31 -0.4534972
# 3 2017-08-30 0.6300986
# 4 2017-07-01 -0.1162478
# 5 2017-07-21 0.3706279
# 6 2017-08-20 0.8168998
# 7 2017-07-01 0.7796219
# 8 2017-07-11 1.4557851
# 9 2017-08-10 -1.5977095
# 10 2017-09-09 0.6203798

How to recreate the table by key?

I thought it could be a very easy question, but I am really a new beginner for R.
I have a data.table with key and lots of rows, two of which could be set as key. I want to recreate the table by Key.
For example, the simple data. In this case, the key is ID and Act, and here we can get a total of 4 groups.
ID ValueDate Act Volume
1 2015-01-01 EUR 21
1 2015-02-01 EUR 22
1 2015-01-01 MAD 12
1 2015-02-01 MAD 11
2 2015-01-01 EUR 5
2 2015-02-01 EUR 7
3 2015-01-01 EUR 4
3 2015-02-01 EUR 2
3 2015-03-01 EUR 6
Here is a code to generate test data:
dd <- data.table(ID = c(1,1,1,1,2,2,3,3,3),
ValueDate = c("2015-01-01", "2015-02-01", "2015-01-01","2015-02-01", "2015-01-01","2015-02-01","2015-01-01","2015-02-01","2015-03-01"),
Act = c("EUR","EUR","MAD","MAD","EUR","EUR","EUR","EUR","EUR"),
Volume=c(21,22,12,11,5,7,4,2,6))
After change, each column should present a specific group which is defined by Key (ID and Act).
Below is the result:
ValueDate ID1_EUR D1_MAD D2_EUR D3_EUR
2015-01-01 21 12 5 4
2015-02-01 22 11 7 2
2015-03-01 NA NA NA 6
Thanks a lot !
What you are trying to do is not recreating the data.table, but reshaping it from a long format to a wide format. You can use dcast for this:
dcast(dd, ValueDate ~ ID + Act, value.var = "Volume")
which gives:
ValueDate 1_EUR 1_MAD 2_EUR 3_EUR
1: 2015-01-01 21 12 5 4
2: 2015-02-01 22 11 7 2
3: 2015-03-01 NA NA NA 6
If you want the numbers in the resulting columns to be preceded with ID, then you can use:
dcast(dd, ValueDate ~ paste0("ID",ID) + Act, value.var = "Volume")
which gives:
ValueDate ID1_EUR ID1_MAD ID2_EUR ID3_EUR
1: 2015-01-01 21 12 5 4
2: 2015-02-01 22 11 7 2
3: 2015-03-01 NA NA NA 6

Resources