I'm running into an RStudio data issue regarding properly melting data. It currently is in the following form:
Campaign, ID, Start Date, End Date, Total Number of Days, Total Spend, Total Impressions, Total Conversions
I would like my data to look like the following:
Campaign, ID, Date, Spend, Impressions, Conversions
Each 'date' should contain a specific day the campaign was run while spend, impressions, and conversions should equal Total Spend / Total # of Days, Total Impressions / Total # of Days, and Total Conversions / Total # of Days, respectively.
I'm working in RStudio so a solution in R is needed. Does anyone have experience manipulating data like this?
This works, but it's not particularly efficient. If your data is millions of rows or more, I've had better luck using SQL and inequality joins.
library(tidyverse)
#create some bogus data
data <- data.frame(ID = 1:10,
StartDate = sample(seq.Date(as.Date("2018-01-01"), as.Date("2018-12-31"), "day"), 10),
Total = runif(10)) %>%
mutate(EndDate = StartDate + floor(runif(10) * 14))
#generate all dates between the min and max in the dataset
AllDates = data.frame(Date = seq.Date(min(data$StartDate), max(data$EndDate), "day"),
Dummy = TRUE)
#join via a dummy variable to add rows for all dates to every ID
data %>%
mutate(Dummy = TRUE) %>%
inner_join(AllDates, by = c("Dummy" = "Dummy")) %>%
#filter to just the dates between the start and end
filter(Date >= StartDate, Date <= EndDate) %>%
#divide the total by the number of days
group_by(ID) %>%
mutate(TotalPerDay = Total / n()) %>%
select(ID, Date, TotalPerDay)
# A tibble: 91 x 3
# Groups: ID [10]
ID Date TotalPerDay
<int> <date> <dbl>
1 1 2018-06-21 0.00863
2 1 2018-06-22 0.00863
3 1 2018-06-23 0.00863
4 1 2018-06-24 0.00863
5 1 2018-06-25 0.00863
6 1 2018-06-26 0.00863
7 1 2018-06-27 0.00863
8 1 2018-06-28 0.00863
9 1 2018-06-29 0.00863
10 1 2018-06-30 0.00863
# ... with 81 more rows
Related
I'm working with an ecological dataset that has multiple individuals moving across a landscape where they can be detected at multiple sites. The data has a beginning and ending timestamp when an individual was detected at a given site; heron we'll call this time window for an individual at a site an "event". These events are the rows in this data. I sorted this data by time, and noticed I can have multiple events while an individual remains at a given site (which can be due to an individual moving away from the receiver and coming back to it while not being detected at an adjacent receiver).
Here's example data for a single individual, x:
input <- data.frame(individual = c("x","x","x","x","x","x","x"),
site = c("a","a","a","b","b","a", "a"),
start_time = as.POSIXct(c("2020-01-14 11:11:11", "2020-01-14 11:13:10", "2020-01-14 11:16:20",
"2020-02-14 11:11:11", "2020-02-14 11:13:10",
"2020-03-14 11:12:11", "2020-03-15 11:12:11")),
end_time = as.POSIXct(c("2020-01-14 11:11:41", "2020-01-14 11:13:27", "2020-01-14 11:16:50",
"2020-02-14 11:13:11", "2020-02-14 11:15:10",
"2020-03-14 11:20:11", "2020-03-15 11:20:11")))
I want to aggregate these smaller events (e.g. the first 3 events at site a) into one larger event where I summarize the start/end times for the whole event:
output <- data.frame(individual = c("x","x","x"), site = c("a", "b", "a"),
start_time = as.POSIXct(c("2020-01-14 11:11:11", "2020-02-14 11:11:11", "2020-03-14 11:12:11")),
end_time = as.POSIXct(c("2020-01-14 11:16:50", "2020-02-14 11:15:10", "2020-03-15 11:20:11")))
Note that time intervals for events vary.
Using group_by(individual, site) would mean losing this temporal info, since individuals can travel among sites multiple times. I thought about using some sort of helper dataframe that summarizes events for individuals at sites but I am not sure how to retain the temporal info. I suppose there is a way to do this by indexing row numbers/looping in base but I am hoping there is a nifty dplyr trick that can help with this problem.
One approach would be to take the cumulative sum of times that site has changed, and use that count to summarize each individual's contiguous times at one site.
library(dplyr)
input %>%
arrange(individual, start_time) %>%
mutate(indiv_new_site = cumsum(site != lag(site, default = ""))) %>%
group_by(individual, site, indiv_new_site) %>%
summarize(start_time = min(start_time),
end_time = max(end_time))
# A tibble: 3 x 5
# Groups: individual, site [2]
individual site indiv_new_site start_time end_time
<chr> <chr> <int> <dttm> <dttm>
1 x a 1 2020-01-14 11:11:11 2020-01-14 11:16:50
2 x a 3 2020-03-14 11:12:11 2020-03-15 11:20:11
3 x b 2 2020-02-14 11:11:11 2020-02-14 11:15:10
We could use rle from base R
library(dplyr)
input %>%
arrange(individual, start_time) %>%
group_by(individual, site, grp = with(rle(site),
rep(seq_along(values), lengths))) %>%
summarize(start_time = min(start_time),
end_time = max(end_time), .groups = 'drop') %>%
select(-grp)
-output
# A tibble: 3 x 4
# individual site start_time end_time
# <chr> <chr> <dttm> <dttm>
#1 x a 2020-01-14 11:11:11 2020-01-14 11:16:50
#2 x a 2020-03-14 11:12:11 2020-03-15 11:20:11
#3 x b 2020-02-14 11:11:11 2020-02-14 11:15:10
In data.table we can use rleid.
library(data.table)
setDT(input)
input[, .(site = first(site),
start_time = min(start_time),
end_time = max(end_time)), .(individual, rleid(site))]
# individual rleid site start_time end_time
#1: x 1 a 2020-01-14 11:11:11 2020-01-14 11:16:50
#2: x 2 b 2020-02-14 11:11:11 2020-02-14 11:15:10
#3: x 3 a 2020-03-14 11:12:11 2020-03-15 11:20:11
R:
I have a data-set with N Products sales value from some yyyy-mm-dd to some yyyy-mm-dd, I just want to filter the data for the last 12 months for each product in the data-set.
Eg:
Say, I have values from 2016-01-01 to 2020-02-01
So now I want to filter the sales values for the last 12 months that is from 2019-02-01 to 2020-02-01
I just cannot simply mention a "filter(Month >= as.Date("2019-04-01") & Month <= as.Date("2020-04-01"))" because the end date keeps changing for my case as every months passes by so I need to automate the case.
You can use :
library(dplyr)
library(lubridate)
data %>%
group_by(Product) %>%
filter(between(date, max(date) - years(1), max(date)))
#filter(date >= (max(date) - years(1)) & date <= max(date))
You can test whether the date is bigger equal the maximal date per product minus 365 days:
library(dplyr)
df %>%
group_by(Products) %>%
filter(Date >= max(Date)-365)
# A tibble: 6 x 2
# Groups: Products [3]
Products Date
<dbl> <date>
1 1 2002-01-21
2 1 2002-02-10
3 2 2002-02-24
4 2 2002-02-10
5 2 2001-07-01
6 3 2005-03-10
Data
df <- data.frame(
Products = c(1,1,1,1,2,2,2,3,3,3),
Date = as.Date(c("2000-02-01", "2002-01-21", "2002-02-10",
"2000-06-01", "2002-02-24", "2002-02-10",
"2001-07-01", "2003-01-02", "2005-03-10",
"2002-05-01")))
If your aim is to just capture entries from today back to the same day last year, then:
The function Sys.Date() returns the current date as an object of type Date. You can then convert that to POSIXlc form to adjust the year to get the start date. For example:
end.date <- Sys.Date()
end.date.lt <- asPOSIXlt(end.date)
start.date.lt <- end.date.lt
start.date.lt$year <- start.date.lt$year - 1
start.date <- asPOSIXct(start.date.lt)
Now this does have one potential fail-state: if today is February 29th. One way to deal with that would be to write a "today.last.year" function to do the above conversion, but give an explicit treatment for leap years - possibly including an option to count "today last year" as either February 28th or March 1st, depending on which gives you the desired behaviour.
Alternatively, if you wanted to filter based on a start-of-month date, you can make your function also set start.date.lt$day = 1, and so forth if you need to adjust in different ways.
Input:
product date
1: a 2017-01-01
2: b 2017-04-01
3: a 2017-07-01
4: b 2017-10-01
5: a 2018-01-01
6: b 2018-04-01
7: a 2018-07-01
8: b 2018-10-01
9: a 2019-01-01
10: b 2019-04-01
11: a 2019-07-01
12: b 2019-10-01
Code:
library(lubridate)
library(data.table)
DT <- data.table(
product = rep(c("a", "b"), 6),
date = seq(as.Date("2017-01-01"), as.Date("2019-12-31"), by = "quarter")
)
yearBefore <- function(x){
year(x) <- year(x) - 1
x
}
date_DT <- DT[, .(last_date = last(date)), by = product]
date_DT[, year_before := yearBefore(last_date)]
result <- DT[, date_DT[DT, on = .(product, year_before <= date), nomatch=0]]
result[, last_date := NULL]
setnames(result, "year_before", "date")
Output:
product date
1: a 2018-07-01
2: b 2018-10-01
3: a 2019-01-01
4: b 2019-04-01
5: a 2019-07-01
6: b 2019-10-01
Is this what you are looking for?
I've got a data set with reservation data that has the below format :
property <- c('casa1', 'casa2', 'casa3')
check_in <- as.Date(c('2018-01-01', '2018-01-30','2018-02-28'))
check_out <- as.Date(c('2018-01-02', '2018-02-03', '2018-03-02'))
total_paid <- c(100,110,120)
df <- data.frame(property,check_in,check_out, total_paid)
My goal is to have the monthly total_paid amount divided by days and assigned to each month correctly for budget reasons.
While there's no issue for casa1, casa2 and casa3 have days reserved in both months and the totals get skewed because of this issue.
Any help much appreciated!
Here you go:
library(dplyr)
library(tidyr)
df %>%
mutate(id = seq_along(property), # make few variable to help
day_paid = total_paid / as.numeric(check_out - check_in),
date = check_in) %>%
group_by(id) %>%
complete(date = seq.Date(check_in, (check_out - 1), by = "day")) %>% # get date for each day of stay (except last)
ungroup() %>% # make one row per day of stay
mutate(month = cut(date, breaks = "month")) %>% # determine month of date
fill(property, check_in, check_out, total_paid, day_paid) %>%
group_by(id, month) %>%
summarise(property = unique(property),
check_in = unique(check_in),
check_out = unique(check_out),
total_paid = unique(total_paid),
paid_month = sum(day_paid)) # summarise per month
result:
# A tibble: 5 x 7
# Groups: id [3]
id month property check_in check_out total_paid paid_month
<int> <fct> <fct> <date> <date> <dbl> <dbl>
1 1 2018-01-01 casa1 2018-01-01 2018-01-02 100 100
2 2 2018-01-01 casa2 2018-01-30 2018-02-03 110 55
3 2 2018-02-01 casa2 2018-01-30 2018-02-03 110 55
4 3 2018-02-01 casa3 2018-02-28 2018-03-02 120 60
5 3 2018-03-01 casa3 2018-02-28 2018-03-02 120 60
I hope it's somewhat readable but please ask if there is something I should explain. Convention is that people don't pay the last day of a stay, so I took that into account.
I have a data set with values every minute and I want to calculate the average value for every hour. I have tried by using the group_by(), filter() and summarise() from dplyr package to reduce the data every hour. When I use only these functions I am able to get the mean value for every hour but only every month and I want it for each day.
> head(DF)
datetime pw cu year m d hr min
1 2017-08-18 14:56:00 0.0630341 1.94065 2017 8 18 14 53
2 2017-08-18 14:57:00 0.0604653 1.86771 2017 8 18 14 57
3 2017-08-18 14:58:00 0.0601318 1.86596 2017 8 18 14 58
4 2017-08-18 14:59:00 0.0599276 1.83761 2017 8 18 14 59
5 2017-08-18 15:00:00 0.0598998 1.84177 2017 8 18 15 0
I had to use a for loop to reduce my table, I wrote the following to do it:
datetime <- c()
eg_bf <-c ()
for(i in 1:8760){
hour= start + 3600
DF= DF %>%
filter(datetime >= start & datetime < hour) %>%
summarise(eg= mean(pw))
datetime= append(datetime, start)
eg_bf= append(eg_bf, DF$eg)
start= hour
}
new_DF= data.frame(datetime, eg_bf)
So. I was able to get my new data set with the mean value for every hour of the year.
datetime eg_bf
1 2018-01-01 00:00:00 0.025
2 2018-01-01 01:00:00 0.003
3 2018-01-01 02:00:00 0.002
4 2018-01-01 03:00:00 0.010
5 2018-01-01 04:00:00 0.015
The problem I'm facing is that It takes a lot of time to do it. The idea is to add this calculation to a shiny UI, so every time I make a change it must make the changes faster. Any idea how to improve this calculation?
you can try this. use make_date from the lubridate package to make a new date_time column using the year , month, day and hour columns of your dataset. Then group and summarise on the new column
library(dplyr)
library(lubridate)
df %>%
mutate(date_time = make_datetime(year, m, d, hr)) %>%
group_by(date_time) %>%
summarise(eg_bf = mean(pw))
#Adam Gruer's answer provides a nice solution for the date variable that should solve your question. The calculation of the mean per hour does work with just dplyr, though:
df %>%
group_by(year, m, d, hr) %>%
summarise(test = mean(pw))
# A tibble: 2 x 5
# Groups: year, m, d [?]
year m d hr test
<int> <int> <int> <int> <dbl>
1 2017 8 18 14 0.0609
2 2017 8 18 15 0.0599
You said in your question:
When I use only these functions I am able to get the mean value for every hour but only every month and I want it for each day.
What did you do differently?
Even if you've found your answer, I believe this is worth mentioning:
If you're working with a lot of data and speed is an issue, then you might want ot see if you can use data.table instead of dplyr
You can see with a simple benchmarking how much faster data.table is:
library(dplyr)
library(lubridate)
library(data.table)
library(microbenchmark)
set.seed(123)
# dummy data, one year, one entry per minute
# first as data frame
DF <- data.frame(datetime = seq(as.POSIXct("2018-01-01 00:00:00"),
as.POSIXct("2019-01-02 00:00:00"), 60),
pw = runif(527041)) %>%
mutate(year = year(datetime), m=month(datetime),
d=day(datetime), hour = hour(datetime))
# save it as a data.table
dt <- as.data.table(DF)
# transformation with dplyr
f_dplyr <- function(){
DF %>%
group_by(year, m, d, hour) %>%
summarize(eg_bf = mean(pw))
}
# transformation with data.table
f_datatable <- function() {
dt[, mean(pw), by=.(year, m, d, hour)]
}
# benchmarking
microbenchmark(f_dplyr(), f_datatable())
#
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# f_dplyr() 41.240235 44.075019 46.85497 45.64998 47.95968 76.73714 100 b
# f_datatable() 9.081295 9.712694 12.53998 10.55697 11.33933 41.85217 100 a
check out this post it tells a lot data.table vs dplyr: can one do something well the other can't or does poorly?
As I understood you have a data frame of 365 * 24 * 60 rows. The code below returns the result instantly. The outcome is mean(pw) grouped by every hour of the year.
remove(list = ls())
library(dplyr)
library(lubridate)
library(purrr)
library(tibble)
date_time <- seq.POSIXt(
as.POSIXct("2018-01-01"),
as.POSIXct("2019-01-01"),
by = "1 min"
)
n <- length(date_time)
data <- tibble(
date_time = date_time,
pw = runif(n),
cu = runif(n),
ye = year(date_time),
mo = month(date_time),
da = day(date_time),
hr = hour(date_time)
)
grouped <- data %>%
group_by(
ye, mo, da, hr
) %>%
summarise(
mean_pw = mean(pw)
)
I have a dataset with two columns Id and Date as shown below using a toy dataset.
Id Date
5373283 2010-11-05
5373283 2014-11-05
5373283 2001-07-13
5373283 2007-12-01
5373283 2015-07-07
3475684 2015-05-19
3475684 2010-06-24
I want to check if any of the dates for each id are within 2 years range. If they are then a column will show yes, if not, No. The final output would look like this
Id Status
5373283 Yes
3475684 No
Yes for Id 5373283 because the two dates 2014-11-05 and 2015-07-07 are within two years of each other. No for Id 3475684 because the two dates are more than 2 years apart. Any help on accomplishing this much appreciated.
Hypothetical data.
DF <- data.frame(id = c(1, 1, 1, 2, 2),
date = c("2010-10-9", "2012-10-8", "2008-10-5",
"2007-7-5", "2009-7-5"), stringsAsFactors = FALSE)
The code below gets the minimal interval by ID in days.
What is happening is:
mutate redefines the date as Date class
arrange sort the data by date
group_by tells the following computation shall be done for each ID,
summarize computes the minimum difference.
library(dplyr)
DF %>% mutate(date = as.Date(date)) %>%
arrange(date) %>%
group_by(id) %>%
summarize(diffmin = as.numeric(min(diff(date)), units = "days"))
# id diffmin
# (dbl) (dbl)
#1 1 730
#2 2 731
If you can ignore leap years, this being smaller than or equal to 730 means within 2 years. Note that difference between 2007-7-5 and 2009-7-5 is 731 days, and thus judged as out of 2 years.
If this is not good to you, simple days-difference is not enough. I would need to define a custom checker function.
check2years <- function(a, b) {
# check if b - a <= 2 years
# assumes a and b are Date
yr_a <- format(a, "%Y") %>% as.integer()
yr_b <- format(b, "%Y") %>% as.integer()
dy_a <- format(a, "%m-%d")
dy_b <- format(b, "%m-%d")
(yr_b - yr_a < 2) | ((yr_b - yr_a == 2) & (dy_b >= dy_a))
}
Then, you can check if any combination is within 2 years by the following.
DF %>% mutate(date = as.Date(date)) %>%
arrange(date) %>%
group_by(id) %>%
summarize(within2yr = any(check2years(head(date, length(date)-1),
tail(date, length(date)-1))))
# id within2yr
# (dbl) (lgl)
#1 1 TRUE
#2 2 TRUE
You can also solve this without any library:
Using your example:
Id = c(5373283,5373283,5373283,5373283,5373283,3475684,3475684)
Date = as.Date(c("2010-11-05","2014-11-05","2001-07-13","2007-12-01","2015-07-07","2015-05-19","2010-06-24"))
df = data.frame(Id,Date)
> df
Id Date
7 3475684 2010-06-24
6 3475684 2015-05-19
3 5373283 2001-07-13
4 5373283 2007-12-01
1 5373283 2010-11-05
2 5373283 2014-11-05
5 5373283 2015-07-07
Do the following:
First order your data first by Id then by Date
df = df[order(df$Id,df$Date),]
Do an aggregate by Id using the function min(diff(x)), where x are the dates for each Id.
z = aggregate(df$Date,by = list(Id = df$Id),FUN = function(x){min(diff(x))})
What this function does is it returns the lowest difference between adjacent dates. This is why you need to order the data frame first.
This returns:
> z
Id x
1 3475684 1790
2 5373283 244
Where column x is the minimum difference in days.
Here, you only need to evaluate is if column x is less than or equal to 2*365
z$result = z$x<=2*365
Giving:
Id x result
1 3475684 1790 FALSE
2 5373283 244 TRUE
Final code
df = df[order(df$Id,df$Date),]
z = aggregate(df$Date,by = list(Id = df$Id),FUN = function(x){min(diff(x))})
z$result = z$x>=2*365
You can use something like this with library dplyr with the idea of taking top two dates in sorted order for each ID and see if they differ by two years:
library(dplyr)
df$Date <- as.Date(df$Date)
df %>%
group_by(Id) %>%
summarise(Status = as.numeric(difftime(max(Date), Date[order(Date, decreasing = TRUE)][2], units = 'days')) < 730)
Output will be as follows:
Source: local data frame [2 x 2]
Id Status
(int) (lgl)
1 3475684 FALSE
2 5373283 TRUE