Related
I am trying to find the first date of each category then subtract 5 days AND I want to keep the days inbetween! this is where I am struggling. I tried seq() but it gave me an error, so I'm not sure if this is the right way to do it.
I am able to get 5 days prior to my start date for each category, but I can't figure out how to get 0, 1, 2, 3, 4 AND 5 days prior to my start date!
The error I got is this (for the commented out part of the code):
Error in seq.default(., as.Date(first_day), by = "day", length.out = 5) :
'from' must be of length 1
Any help would be greatly appreciated!
library ("lubridate")
library("dplyr")
library("tidyr")
data <- data.frame(date = c("2020-06-08",
"2020-06-09",
"2020-06-10",
"2020-06-11",
"2020-06-12",
"2021-07-13",
"2021-07-14",
"2021-07-15",
"2021-08-16",
"2021-08-17",
"2021-08-18",
"2021-09-19",
"2021-09-20"),
value = c(2,1,7,1,0,1,2,3,4,7,6,5,10),
category = c(1,1,1,1,1,2,2,2,3,3,3,4,4))
data$date <- as.Date(data$date)
View(data)
test_dates <- data %>%
group_by(category) %>%
arrange(date) %>%
slice(1L) %>% #takes first date
mutate(first_day = as.Date(date) - 5)#%>%
#seq(as.Date(first_day),by="day",length.out=5)
#error for seq(): Error in seq.default(., as.Date(first_day), by = "day", length.out = 5) : 'from' must be of length 1
head(test_dates)
The answer I'm looking for should include these dates but in a column format! I'm also trying to input NA in the value category if the value doesnt already exist. I want to keep all possible columns, as the dataframe I'm needing to use this on has about 20 columns
Dates: "2020-06-03 ", "2020-06-04", "2020-06-05", "2020-06-06", "2020-06-07", "2020-06-08", "2020-07-08 ", "2020-07-09", "2020-07-10", "2020-07-11", "2020-07-12", "2021-07-13", "2020-08-11 ", "2020-08-12", "2020-08-13", "2020-08-14", "2020-08-15", "2021-08-16", "2020-09-14 ", "2020-09-15", "2020-09-16", "2020-09-17", "2020-09-18", "2021-09-19",
Related question here: How do I subset my df for the minimum date based on one category and including x days before that?
Here's one approach but kinda clunky:
bind_rows(
data,
data %>%
group_by(category) %>%
slice_min(date) %>%
uncount(6, .id = "id") %>%
mutate(date = date - id + 1) %>%
select(-id)) %>%
arrange(category, date)
Result
# A tibble: 37 × 3
date value category
<date> <dbl> <dbl>
1 2020-06-03 2 1
2 2020-06-04 2 1
3 2020-06-05 2 1
4 2020-06-06 2 1
5 2020-06-07 2 1
6 2020-06-08 2 1
7 2020-06-08 2 1
8 2020-06-09 1 1
9 2020-06-10 7 1
10 2020-06-11 1 1
# … with 27 more rows
This approach provides the row from each category with the minimum date, plus the five dates prior for each category (with value set to NA for these rows)
library(data.table)
setDT(data)[data[, .(date=seq(min(date)-5,by="day", length.out=6)), category], on=.(category,date)]
Output:
date value category
1: 2020-06-03 NA 1
2: 2020-06-04 NA 1
3: 2020-06-05 NA 1
4: 2020-06-06 NA 1
5: 2020-06-07 NA 1
6: 2020-06-08 2 1
7: 2021-07-08 NA 2
8: 2021-07-09 NA 2
9: 2021-07-10 NA 2
10: 2021-07-11 NA 2
11: 2021-07-12 NA 2
12: 2021-07-13 1 2
13: 2021-08-11 NA 3
14: 2021-08-12 NA 3
15: 2021-08-13 NA 3
16: 2021-08-14 NA 3
17: 2021-08-15 NA 3
18: 2021-08-16 4 3
19: 2021-09-14 NA 4
20: 2021-09-15 NA 4
21: 2021-09-16 NA 4
22: 2021-09-17 NA 4
23: 2021-09-18 NA 4
24: 2021-09-19 5 4
date value category
Note: The above uses a join; an identical result can be achieved without a join by row-binding the first row for each category with the data.table generated similarly as above:
rbind(
setDT(data)[order(date), .SD[1],category],
data[,.(date=seq(min(date)-5,by="day",length.out=5),value=NA),category]
)
You indicate you have many columns, so if you are going to take this second approach, rather than explicitly setting value=NA in the second input to rbind, you can also just leave it out, and add fill=TRUE within the rbind()
A dplyr version of the same is:
bind_rows(
data %>%
group_by(category) %>%
slice_min(date) %>%
ungroup() %>%
mutate(date=as.Date(date)),
data %>%
group_by(category) %>%
summarize(date=seq(min(as.Date(date))-5,by="day", length.out=5), .groups="drop")
)
Output:
# A tibble: 24 x 3
date value category
<date> <dbl> <dbl>
1 2020-06-08 2 1
2 2021-07-13 1 2
3 2021-08-16 4 3
4 2021-09-19 5 4
5 2020-06-03 NA 1
6 2020-06-04 NA 1
7 2020-06-05 NA 1
8 2020-06-06 NA 1
9 2020-06-07 NA 1
10 2021-07-08 NA 2
# ... with 14 more rows
Update (9/21/22) -
If you want the NA values to be filled, simply add this to the end of either data.table pipeline:
...[,value:=max(value, na.rm=T), category]
or add this to the dplyr pipeline
... %>%
group_by(category) %>%
mutate(value=max(value, na.rm=T))
#Jon Srpings answer fired this alternative approach:
Here we first get the first days - 5 as already presented in the question. Then we use bind_rows as Jon Srping does in his answer. Next step is to identify the original first dates within the dates column (we use !duplicated within filter). Last main step is to use coalesce:
library(lubridate)
library(dplyr)
data %>%
group_by(category) %>%
mutate(x = min(ymd(date))-5) %>%
slice(1) %>%
bind_rows(data) %>%
mutate(date = ymd(date)) %>%
filter(!duplicated(date)) %>%
mutate(x = coalesce(x, date)) %>%
arrange(category) %>%
select(date = x, value)
category date value
<dbl> <date> <dbl>
1 1 2020-06-03 2
2 1 2020-06-09 1
3 1 2020-06-10 7
4 1 2020-06-11 1
5 1 2020-06-12 0
6 2 2021-07-08 1
7 2 2021-07-14 2
8 2 2021-07-15 3
9 3 2021-08-11 4
10 3 2021-08-17 7
11 3 2021-08-18 6
12 4 2021-09-14 5
13 4 2021-09-20 10
I have data with varying start and end dates.
mydata <- data.frame(id=c(1,2,3), start=c("2010/01/01","2010/01/01","2010/01/02"), end=c("2010/01/01","2010/01/05","2010/01/03"), a=c(140,750,56),b=c(48,25,36))
mydata
id start end a b
1 1 2010-01-01 2010-01-01 140 48
2 2 2010-01-01 2010-01-05 750 25
3 3 2010-01-02 2010-01-03 56 36
I want to find the average of the variables a and b for each day. Below I execute it by expanding every row with different start and end dates, then collapsing it back to the daily level.
mydata$subt <- as.numeric(as.Date(mydata$end, "%Y/%m/%d") - as.Date(mydata$start, "%Y/%m/%d") + 1)
require(data.table)
mydata <- setDT(mydata)[ , list(idnum = id, date = seq(start, end, by = "day"), a=a/subt, b=b/subt), by = 1:nrow(mydata)]
mydata
nrow idnum date a b
1: 1 1 2010-01-01 140 48
2: 2 2 2010-01-01 150 5
3: 2 2 2010-01-02 150 5
4: 2 2 2010-01-03 150 5
5: 2 2 2010-01-04 150 5
6: 2 2 2010-01-05 150 5
7: 3 3 2010-01-02 28 18
8: 3 3 2010-01-03 28 18
mydata %>%
group_by(date) %>%
summarize(a = sum(a),
b = sum(b))
Desired Outcome:
date a b
<date> <dbl> <dbl>
1 2010-01-01 290 53
2 2010-01-02 178 23
3 2010-01-03 178 23
4 2010-01-04 150 5
5 2010-01-05 150 5
However, I have plenty of rows with different start and end dates, and sometimes the length of difference is very long. I am wondering if there is an easier way (i.e., without expanding every row) to find the daily averages for each variable. It would also be great if there is a way to find the weekly averages without first finding the daily figures. Thank you!
Here is an option with tidyverse. We convert the 'start' 'end' columns to Date class with ymd (from lubridate), create a sequence of dates from 'start' to 'end' for corresponding elements with map2, mutate the 'a', 'b' by dividing them with the lengths of the list column 'date', unnest the 'date' and grouped by 'date' we get the sum of 'a', 'b'
library(dplyr)
library(tidyr)
library(lubridate)
library(purrr)
mydata %>%
mutate(across(c(start, end), ymd)) %>%
transmute(id, date = map2(start, end, seq, by = 'day'), a, b) %>%
mutate(across(c(a, b), ~ ./lengths(date))) %>%
unnest(date) %>%
group_by(date) %>%
summarise(across(c(a, b), sum, na.rm = TRUE))
# A tibble: 5 x 3
# date a b
# <date> <dbl> <dbl>
#1 2010-01-01 290 53
#2 2010-01-02 178 23
#3 2010-01-03 178 23
#4 2010-01-04 150 5
#5 2010-01-05 150 5
I'm a complete beginner to R and I just need to do some quick cleaning of my data. But I ran into a problem I can't wrap my head around.
So I have a Postgres db with timeseries, Columns are ID, DATE and VALUE (temperature). Each ID is a new measuring station, so I have a time serie for each id (around 2000 unique ids, 4m rows). The dates span from 1915-2016, some series are overlapping some are not. If there is missing measurement from a week I want to fill those weeks with an NA value (which i interpolate after).
The problem i run into is that complete(Date.seq) creates NA values for all weeks between 1915 and 2016, I clearly understand why it happens. How can I make so it only fills values between the actual start and end date of the specific timeserie? I want a moving min and max which is dependent on the start date and end date of each specific ID and than fill missing dates between the start and end date of each ID.
library("RpostgreSQL")
library("tidyverse")
library("lubridate")
con <- dbConnect(PostgreSQL(), user = "postgres",
dbname="", password = "", host = "localhost", port= "5432")
out <- dbGetQuery(con, "SELECT * FROM *******.Weekly_series")
out %>%
group_by(ID)%>%
mutate(DATE = as.Date(DATE)) %>%
complete(DATE = seq(ymd("1915-04-14"), ymd("2016-03-30"), by= "week"))
Ignore errors in the connect line.
Thanks in advance.
Edit1
Sample data
ID DATE VALUE
1 2015-10-01 1
1 2015-10-08 1
1 2015-10-15 1
1 2015-10-29 1
2 1956-01-01 1
2 1956-01-15 1
2 1956-01-22 1
3 1982-01-01 1
3 1982-01-15 1
3 1982-01-22 1
3 1982-01-29 1
Excpected output
ID DATE VALUE
1 2015-10-01 1
1 2015-10-08 1
1 2015-10-15 1
1 2015-10-22 NA
1 2015-10-29 1
2 1956-01-01 1
2 1956-01-08 NA
2 1956-01-15 1
2 1956-01-22 1
3 1982-01-01 1
3 1982-01-08 NA
3 1982-01-15 1
3 1982-01-22 1
3 1982-01-29 1
Using the data you provided, this works. I don't know why this works and your whole code does not, but possibly in your code, the data structure is not what is needed. If so, something like out <- tibble::as_tibble(out) might work. My other guess is that complete isn't drawing from the package you need. Using tidyr::complete works on the sample.
library(lubridate)
library(dplyr)
library(tidyr)
a <- "ID DATE VALUE
1 2015-10-01 1
1 2015-10-08 1
1 2015-10-15 1
1 2015-10-29 1
2 1956-01-01 1
2 1956-01-15 1
2 1956-01-22 1
3 1982-01-01 1
3 1982-01-15 1
3 1982-01-22 1
3 1982-01-29 1"
df <- read.table(text = a, header = TRUE)
big_df1 <- df %>%
filter(ID == 1)%>%
mutate(DATE = as.Date(DATE)) %>%
tidyr::complete(DATE = seq(ymd(min(DATE)), ymd(max(DATE)), by= "week"))
big_df2 <- df %>%
filter(ID == 2)%>%
mutate(DATE = as.Date(DATE)) %>%
tidyr::complete(DATE = seq(ymd(min(DATE)), ymd(max(DATE)), by= "week"))
big_df3 <- df %>%
filter(ID == 3)%>%
mutate(DATE = as.Date(DATE)) %>%
tidyr::complete(DATE = seq(ymd(min(DATE)), ymd(max(DATE)), by= "week"))
big_df <- rbind(big_df1, big_df2, big_df3)
big_df
DATE ID VALUE
<date> <int> <int>
1 2015-10-01 1 1
2 2015-10-08 1 1
3 2015-10-15 1 1
4 2015-10-22 NA NA
5 2015-10-29 1 1
6 1956-01-01 2 1
7 1956-01-08 NA NA
8 1956-01-15 2 1
9 1956-01-22 2 1
10 1982-01-01 3 1
11 1982-01-08 NA NA
12 1982-01-15 3 1
13 1982-01-22 3 1
14 1982-01-29 3 1
I have a shopping list data like this:
df <- data.frame(id = 1:5, item = c("apple2milk5", "milk1", "juice3apple5", "egg10juice1", "egg8milk2"), stringsAsFactors = F)
# id item
# 1 1 apple2milk5
# 2 2 milk1
# 3 3 juice3apple5
# 4 4 egg10juice1
# 5 5 egg8milk2
I want to separate the variable item into multiple columns and record the number behind the goods. The problem I met is that the goods each person purchases are different so I cannot solve it using tidyr::separate() or other analogous functions. What I expect is:
# id apple milk juice egg
# 1 1 2 5 NA NA
# 2 2 NA 1 NA NA
# 3 3 5 NA 3 NA
# 4 4 NA NA 1 10
# 5 5 NA 2 NA 8
Note: The categories of goods in the market are unknown. So don't assume there are only 4 kinds of goods.
Thanks for any helps!
I just came up with a tidyverse solution which uses stringr::str_extract_all() to extract the quantities, sets their names as product names, and expands them to wide using tidyr::unnest_wider().
library(tidyverse)
df %>%
mutate(N = map2(str_extract_all(item, "\\d+"), str_extract_all(item, "\\D+"), set_names)) %>%
unnest_wider(N, transform = as.numeric)
# # A tibble: 5 × 6
# id item apple milk juice egg
# <int> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 apple2milk5 2 5 NA NA
# 2 2 milk1 NA 1 NA NA
# 3 3 juice3apple5 5 NA 3 NA
# 4 4 egg10juice1 NA NA 1 10
# 5 5 egg8milk2 NA 2 NA 8
I'll add yet another answer. It only slightly differs from #ASuliman's but uses a bit of the newer tidyr and some cute regex to become a bit more straightforward.
The regex trick is that the pattern "(?<=\\d)\\B(?=[a-z])" will match the non-boundary (i.e. an empty location) between numbers and letters, allowing you to create rows for every "apple5" type of entry. Extract the letters into an item column and numbers into a count column. Using the new pivot_wider which replaces spread, you can convert those counts to numeric values as you reshape.
library(dplyr)
library(tidyr)
df %>%
separate_rows(item, sep = "(?<=\\d)\\B(?=[a-z])") %>%
extract(item, into = c("item", "count"), regex = "^([a-z]+)(\\d+)$") %>%
pivot_wider(names_from = item, values_from = count, values_fn = list(count = as.numeric))
#> # A tibble: 5 x 5
#> id apple milk juice egg
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 5 NA NA
#> 2 2 NA 1 NA NA
#> 3 3 5 NA 3 NA
#> 4 4 NA NA 1 10
#> 5 5 NA 2 NA 8
Possibily something like this, and should work with any item/quantity.
It just assumes that the quantity follows the item.
Lets use a custom function which extracts item and quantity:
my_fun <- function(w) {
items <- stringr::str_split(w, "\\d+", simplify = T)
items <- items[items!=""] # dont now why but you get en empty spot each time
quantities <- stringr::str_split(w, "\\D+", simplify = T)
quantities <- quantities[quantities!=""]
d <- data.frame(item = items, quantity=quantities, stringsAsFactors = F)
return(d)
}
Example:
my_fun("apple2milk5")
# gives:
# item quantity
# 1 apple 2
# 2 milk 5
Now we can apply the function to each id, using nest and map:
library(dplyr)
df_result <- df %>%
nest(item) %>%
mutate(res = purrr::map(data, ~my_fun(.x))) %>%
unnest(res)
df_results
# # A tibble: 9 x 3
# id item quantity
# <int> <chr> <chr>
# 1 1 apple 2
# 2 1 milk 5
# 3 2 milk 1
# 4 3 juice 3
# 5 3 apple 5
# 6 4 egg 10
# 7 4 juice 1
# 8 5 egg 8
# 9 5 milk 2
Now we can use dcast() (probabily spread would work too):
data.table::dcast(df_result, id~item, value.var="quantity")
# id apple egg juice milk
# 1 1 2 <NA> <NA> 5
# 2 2 <NA> <NA> <NA> 1
# 3 3 5 <NA> 3 <NA>
# 4 4 <NA> 10 1 <NA>
# 5 5 <NA> 8 <NA> 2
Data:
df <- data.frame(id = 1:5, item = c("apple2milk5", "milk1", "juice3apple5", "egg10juice1", "egg8milk2"), stringsAsFactors = F)
tmp = lapply(strsplit(df$item, "(?<=\\d)(?=\\D)|(?<=\\D)(?=\\d)", perl = TRUE),
function(x) {
d = split(x, 0:1)
setNames(as.numeric(d[[2]]), d[[1]])
})
nm = unique(unlist(lapply(tmp, names)))
cbind(df, do.call(rbind, lapply(tmp, function(x) setNames(x[nm], nm))))
# id item apple milk juice egg
#1 1 apple2milk5 2 5 NA NA
#2 2 milk1 NA 1 NA NA
#3 3 juice3apple5 5 NA 3 NA
#4 4 egg10juice1 NA NA 1 10
#5 5 egg8milk2 NA 2 NA 8
Place a space before each numeric substring and a newline after it. Then read that data using read.table and unnest it. Finally use pivot_wider to convert from long to wide form.
library(dplyr)
library(tidyr)
df %>%
mutate(item = gsub("(\\d+)", " \\1\n", item)) %>%
rowwise %>%
mutate(item = list(read.table(text = item, as.is = TRUE))) %>%
ungroup %>%
unnest(item) %>%
pivot_wider(names_from = "V1", values_from = "V2")
giving:
# A tibble: 5 x 5
id apple milk juice egg
<int> <int> <int> <int> <int>
1 1 2 5 NA NA
2 2 NA 1 NA NA
3 3 5 NA 3 NA
4 4 NA NA 1 10
5 5 NA 2 NA 8
Variation
This is a variation of the above code that eliminates the unnest. We replace each numeric string by a space, that string, another space, the id and a newline. Then use read.table to read that in. Note the use of %$% rather than %>% before the read.table. Finally use pivot_wider to convert from long to wide form.
library(dplyr)
library(magrittr)
library(tidyr)
df %>%
rowwise %>%
mutate(item = gsub("(\\d+)", paste(" \\1", id, "\n"), item)) %$%
read.table(text = item, as.is = TRUE, col.names = c("nm", "no", "id")) %>%
ungroup %>%
pivot_wider(names_from = "nm", values_from = "no")
You can try
library(tidyverse)
library(stringi)
df %>%
mutate(item2 =gsub("[0-9]", " ", df$item)) %>%
mutate(item3 =gsub("[a-z]", " ", df$item)) %>%
mutate_at(vars(item2, item3), ~stringi::stri_extract_all_words(.) %>% map(paste, collapse=",")) %>%
separate_rows(item2, item3, sep = ",") %>%
spread(item2, item3)
id item apple egg juice milk
1 1 apple2milk5 2 <NA> <NA> 5
2 2 milk1 <NA> <NA> <NA> 1
3 3 juice3apple5 5 <NA> 3 <NA>
4 4 egg10juice1 <NA> 10 1 <NA>
5 5 egg8milk2 <NA> 8 <NA> 2
#replace any digit followed by a character "positive look-ahead assertion" by the digit plus a comma
library(dplyr)
library(tidyr)
df %>% mutate(item=gsub('(\\d+(?=\\D))','\\1,' ,item, perl = TRUE)) %>%
separate_rows(item, sep = ",") %>%
extract(item, into = c('prod','quan'), '(\\D+)(\\d+)') %>%
spread(prod, quan, fill=0)
id apple egg juice milk
1 1 2 0 0 5
2 2 0 0 0 1
3 3 5 0 3 0
4 4 0 10 1 0
5 5 0 8 0 2
This is a simple solution in base R and stringr:
goods <- unique(unlist(stringr::str_split(df$item, pattern = "[0-9]")))
goods <- goods[goods != ""]
df <- cbind(df$id, sapply(goods,
function(x) stringr::str_extract(df$item, pattern = paste0(x,"[0-9]*"))))
df <- as.data.frame(df)
df[-1] <- lapply(df[-1], function(x) as.numeric(stringr::str_extract(x, pattern = "[0-9]*$")))
names(df)[1] <- "id"
Output
id apple milk juice egg
1 1 2 5 NA NA
2 2 NA 1 NA NA
3 3 5 NA 3 NA
4 4 NA NA 1 10
5 5 NA 2 NA 8
Mostly base R with some input from stringr and data.table:
library(stringr)
library(data.table)
cbind(
id = df$id,
rbindlist(
lapply(df$item, function(x) as.list(setNames(str_extract_all(x, "[0-9]+")[[1]], strsplit(x, "[0-9]+")[[1]]))),
fill = TRUE
)
)
id apple milk juice egg
1: 1 2 5 <NA> <NA>
2: 2 <NA> 1 <NA> <NA>
3: 3 5 <NA> 3 <NA>
4: 4 <NA> <NA> 1 10
5: 5 <NA> 2 <NA> 8
A cleaner data.table solution with input from stringr:
df[,
.(it_count = str_extract_all(item, "[0-9]+")[[1]],
it_name = str_extract_all(item, "[^0-9]+")[[1]]),
by = id
][, dcast(.SD, id ~ it_name, value.var = "it_count")]
id apple egg juice milk
1: 1 2 <NA> <NA> 5
2: 2 <NA> <NA> <NA> 1
3: 3 5 <NA> 3 <NA>
4: 4 <NA> 10 1 <NA>
5: 5 <NA> 8 <NA> 2
In the data frame below there are a number of continuous days with missing values.
I want to create a table that shows the missing days
Expected output
Table of missing values
from to
2012-01-08 2012-01-12
2012-01-18 2012-01-22
2012-01-29 2012-02-01
I tried to do it using this code
library(dplyr)
df$Date <- as.Date(df$Date, format = "%d-%b-%Y")
from_to_table_NA <- df %>%
dplyr::filter(is.na(value)) %>%
dplyr::summarise(from = min(Date),
to = max(Date))
> from_to_table_NA
from to
1 2012-01-08 2012-02-01
As expected, it gave me the minimum maximum dates only for missing values. I will highly appreciate any suggestion on how to get the desired output.
DATA
df <- read.table(text = c("
Date value
5-Jan-2012 5
6-Jan-2012 2
7-Jan-2012 3
8-Jan-2012 NA
9-Jan-2012 NA
10-Jan-2012 NA
11-Jan-2012 NA
12-Jan-2012 NA
13-Jan-2012 4
14-Jan-2012 5
15-Jan-2012 5
16-Jan-2012 7
17-Jan-2012 5
18-Jan-2012 NA
19-Jan-2012 NA
20-Jan-2012 NA
21-Jan-2012 NA
22-Jan-2012 NA
23-Jan-2012 12
24-Jan-2012 5
25-Jan-2012 7
26-Jan-2012 8
27-Jan-2012 8
28-Jan-2012 10
29-Jan-2012 NA
30-Jan-2012 NA
31-Jan-2012 NA
1-Feb-2012 NA
2-Feb-2012 12"), header =T)
You need to group by consecutive days. This can be done by getting the cumulative sum of condition where the differences between days is not exactly 1:
df %>%
filter(is.na(value)) %>%
group_by(g = cumsum(coalesce(Date - lag(Date), 1) != 1)) %>%
summarise(from = min(Date),
to = max(Date))
Gives:
# A tibble: 3 x 3
g from to
<int> <date> <date>
1 0 2012-01-08 2012-01-12
2 1 2012-01-18 2012-01-22
3 2 2012-01-29 2012-02-01