I wonder if there is a way to apply a function to each row of a data.frame such that the column classes are preserved? Let's look at an example to clarify what I mean:
test <- data.frame(startdate = as.Date(c("2010-03-07", "2013-09-13", "2011-11-12")),
enddate = as.Date(c("2010-03-23", "2013-12-01", "2012-01-05")),
nEvents = c(123, 456, 789))
Suppose I would like to expand the data.frame test by inserting all days between startdate and enddate and distribute the number of events over those days. My first try to do so was this:
eventsPerDay1 <- function(row) {
n_days <- as.numeric(row$enddate - row$startdate) + 1
data.frame(date = seq(row$startdate, row$enddate, by = "1 day"),
nEvents = rmultinom(1, row$nEvents, rep(1/n_days, n_days)))
}
apply(test, 1, eventsPerDay1)
This, however, is not possible because apply calls as.matrix on test and thus it gets converted to a character matrix and all column classes are lost.
I already found two workarounds which you can find below, so my question is more of a philosphical nature.
library(magrittr)
############# Workaround 1
eventsPerDay2 <- function(startdate, enddate, nEvents) {
n_days <- as.numeric(enddate - startdate) + 1
data.frame(date = seq(startdate, enddate, by = "1 day"),
nEvents = rmultinom(1, nEvents, rep(1/n_days, n_days)))
}
mapply(eventsPerDay2, test$startdate, test$enddate, test$nEvents, SIMPLIFY = F) %>%
do.call(rbind, .)
############# Workaround 2
seq_along(test) %>%
lapply(function(i) test[i, ]) %>%
lapply(eventsPerDay1) %>%
do.call(rbind, .)
My "problem" with the workarounds is the following:
Workaround 1: It may not be the best reason, but I simply do not like mapply. It has a different signature than the other *apply functions (as the the order of arguments differs) and I always feel that a for loop would just have been clearer.
Workaround 2: While being very flexible, I think it is not clear at first sight what is happening.
So does anyone know a function whose call would look like apply(test, 1, eventsPerDay1) and that will work?
Another idea:
library(dplyr)
library(tidyr)
test %>%
mutate(id = row_number()) %>%
group_by(startdate) %>%
complete(startdate = seq.Date(startdate, enddate, 1), nesting(id)) %>%
group_by(id) %>%
mutate(nEvents = rmultinom(1, first(nEvents), rep(1/n(), n()))) %>%
select(startdate, nEvents)
Which gives:
#Source: local data frame [152 x 3]
#Groups: id [3]
#
# id startdate nEvents
# <int> <date> <int>
#1 1 2010-03-07 6
#2 1 2010-03-08 6
#3 1 2010-03-09 6
#4 1 2010-03-10 7
#5 1 2010-03-11 12
#6 1 2010-03-12 5
#7 1 2010-03-13 8
#8 1 2010-03-14 5
#9 1 2010-03-15 5
#10 1 2010-03-16 9
## ... with 142 more rows
We can do this with data.table
library(data.table)
res <- setDT(test)[,n_days := as.numeric(enddate - startdate) + 1
][, .(date = seq(startdate, enddate, by= "1 day"),
nEvents = c(rmultinom(1, nEvents, rep(1/n_days, n_days)))),
by = 1:nrow(test)][, nrow := NULL]
str(res)
#Classes ‘data.table’ and 'data.frame': 152 obs. of 2 variables:
# $ date : Date, format: "2010-03-07" "2010-03-08" "2010-03-09" "2010-03-10" ...
# $ nEvents: int 5 9 7 11 6 6 10 7 12 3 ...
The above can be wrapped in a function
eventsPerDay <- function(dat){
as.data.table(dat)[, n_days:= as.numeric(enddate - startdate) + 1
][, .(date = seq(startdate, enddate, by= "1 day"),
nEvents = c(rmultinom(1, nEvents, rep(1/n_days, n_days)))) , 1:nrow(dat)
][, nrow := NULL][]
}
eventsPerDay(test)
I have asked myself the same question.
I either end up splitting the df into a list (the base way)
xy <- data.frame()
xy.list <- split(xy, 1:nrow(xy))
out <- lapply(xy.list, function(x) ...)
answer <- unlist(out)
or try the hadleyverse dplyr way using rowwise (the blackbox way)
xy %>%
rowwise() %>%
mutate(newcol = function(x) ....)
I agree that their should be a base implementation of apply(xy, 1, function(x)) that doesn't coerce into character, but I imagine the R ancients implemented the matrix conversion for an advanced reason my primitive mind can't understand.
Related
df <- data.frame('Dev' = 1:12,
'GWP' = seq(10,120,10),
'2012' = 1:12,
'Inc' = seq(10,120,10),
'GWP2' = c(seq(10,100,10),NA,NA),
'2013'= 1:12,
'Inc2' = c(seq(10,100,10),NA,NA),
'GWP3' = c(seq(10,80,10),NA,NA,NA,NA),
'2014'= 1:12,
'Inc3' = c(seq(10,80,10),NA,NA,NA,NA))
head(df)
result_df <- data.frame('Dev' = rep(1:12,3),
'GWP' = c(seq(10,120,10),
c(seq(10,100,10),NA,NA),
c(seq(10,80,10),NA,NA,NA,NA)),
'YEAR' = c(rep(2012,12),
rep(2013,12),
rep(2014,12)),
'Inc' = c(seq(10,120,10),
c(seq(10,100,10),NA,NA),
c(seq(10,80,10),NA,NA,NA,NA)))
head(result_df)
The above is my data structure.
I'm trying to make the df to look like result_df. I'm assuming using the library reshape2 somehow would do the trick but I'm having troubles getting it to come out as expected:
x <- melt(df,id=c("Dev"))
x$value <- ifelse(x$variable == 'X2012',2012,
ifelse(x$variable == 'X2013',2013,
ifelse(x$variable == 'X2014',2014,x$value)))
x$variable <- ifelse(x$variable %in% c('GWP','GWP2','GWP3'),'GWP',
ifelse(x$variable %in% c('Inc','Inc2','Inc3'), 'Inc',
ifelse(x$variable %in% c('X2012','X2013','X2014'),"Year",
x$variable)))
The problem is that the "year" column in my actual data can go for 20-30 years and I want to avoid using multiple ifelse statements to map them up. Is there a way to do this?
The data needs some pre-processing before getting the expected output. Using tidyverse one possible way is
library(tidyverse)
df %>%
gather(key, value, -Dev) %>%
mutate(col = case_when(str_detect(key, "^GWP") ~ "GWP",
str_detect(key, "^X") ~ "Year",
str_detect(key, "^Inc") ~ "Inc"),
value = ifelse(col == "Year", sub("^X", "", key), value)) %>%
select(-key) %>%
group_by(col) %>%
mutate(Dev1 = row_number()) %>%
spread(col, value) %>%
select(-Dev1)
# A tibble: 36 x 4
# Dev GWP Inc Year
# <int> <chr> <chr> <chr>
# 1 1 10 10 2012
# 2 1 10 10 2013
# 3 1 10 10 2014
# 4 2 20 20 2012
# 5 2 20 20 2013
# 6 2 20 20 2014
# 7 3 30 30 2012
# 8 3 30 30 2013
# 9 3 30 30 2014
#10 4 40 40 2012
# … with 26 more rows
I found that this works for the first part:
apply(matrix(c(2012:2014)), 1, function(y) x$value[x$variable == paste("X", y, sep = "")] <<- y )
create a 1 dim matrix to iterate over using apply.
create a function to replace the values found through masking.
Note the use of the <<-, it assigns the respective values to the x scoped one level above that of the function defined in the apply.
Note it applies the function to the variable x and returns the values used in the replacement.
For the second part:
x$variable[x$variable %in% c('GWP', 'GWP2', 'GWP3')] <- "GWP"
x$variable[x$variable %in% c('Inc', 'Inc2', 'Inc3')] <- "Inc"
Since the variable column is type factor and Year is not a level:
x <- transform(x, variable = as.character(variable))
x$variable[x$variable %in% c('X2012', 'X2013', 'X2014')] <- "Year"
x <- transform(x, variable = as.factor(variable))
In my data i have customer-ids, orderdates and an indicator if an order contained a type of product.
I want to give an indicator to each customer, if his first order contained this type of product. But because my data is pretty big i cannot use group_by and case_when, because it is way too slow. I think i could speed things up a lot by using data.table.
Could you point me to a solution? I haven´t had any contact with data.table until now...
# generate data
id <- round(rnorm(3000, mean = 5000, 400),0)
date <- seq.Date(as.Date("2018-01-01"), as.Date("2018-12-31"), "day")
date <- sample(date, length(id), replace = TRUE)
indicator <- rbinom(length(id), 1, 0.5)
df <- data.frame(id, date, indicator)
df$id <- as.factor(df$id)
# Does the first Order contain X?
df <- df %>% group_by(id) %>% mutate(First_Order_contains_x = case_when(
date == min(date) & indicator == "1" ~ 1,
TRUE ~ 0
)) %>% ungroup()
# If first order > 1 ==> all orders get 1 //
df <- df %>% group_by(id) %>% mutate(Customer_type = case_when(
sum(First_Order_contains_x) > 0 ~ "Customer with X in first order",
TRUE ~ "Customer without x in first order"
)) %>% ungroup()
Another way:
library(data.table)
DT = data.table(df[, 1:3])
lookupDT = DT[, .(date = min(date)), by=id]
lookupDT[, fx := DT[copy(.SD), on=.(id, date), max(indicator), by=.EACHI]$V1]
DT[, v := "Customer without x in first order"]
DT[lookupDT[fx == 1L], on=.(id), v := "Customer with X in first order"]
# check results
fsetequal(DT[, .(id, v)], data.table(id = df$id, v = df$Customer_type))
# [1] TRUE
If you want more speed improvements, maybe see ?IDate.
The copy on .SD is needed due to an open issue.
Here's how you can improve your existing code using dplyr more efficiently:
lookup = data.frame(First_Order_contains_x = c(TRUE, FALSE),
Customer_Type = c("Customer with X in first order",
"Customer without x in first order"))
df %>%
group_by(id) %>%
mutate(First_Order_contains_x = any(as.integer(date == min(date) & indicator == 1))) %>%
ungroup() %>%
left_join(lookup, by = "First_Order_contains_x")
# A tibble: 3,000 x 5
id date indicator First_Order_contains_x Customer_Type
<fct> <date> <dbl> <lgl> <fct>
1 5056 2018-03-10 1 TRUE Customer with X in first order
2 5291 2018-12-28 0 FALSE Customer without x in first order
3 5173 2018-04-19 0 FALSE Customer without x in first order
4 5159 2018-11-13 0 TRUE Customer with X in first order
5 5252 2018-05-30 0 TRUE Customer with X in first order
6 5200 2018-01-20 0 FALSE Customer without x in first order
7 4578 2018-12-18 1 FALSE Customer without x in first order
8 5308 2018-03-24 1 FALSE Customer without x in first order
9 5234 2018-05-29 1 TRUE Customer with X in first order
10 5760 2018-06-12 1 TRUE Customer with X in first order
# … with 2,990 more rows
Another data.table approach. Sort the data first so that the first date is the earliest date and we can then use the first indicator for testing the condition. Then, convert logical to an integer (FALSE -> 1 and TRUE -> 2) and map into desired output using a character vector.
library(data.table)
setDT(df)
setorder(df, id, date)
map <- c("Customer without x in first order", "Customer with X in first order")
df[, idx := 1L+any(indicator[1L]==1L), by=.(id)][,
First_Order_contains_x := map[idx]]
If the original order is important, we can store the original order using df[, rn := .I] then finally setorder(df, rn).
data:
set.seed(0L)
id <- round(rnorm(3000, mean = 5000, 5),0)
date <- seq.Date(as.Date("2018-01-01"), as.Date("2018-12-31"), "day")
date <- sample(date, length(id), replace = TRUE)
indicator <- rbinom(length(id), 1, 0.5)
df <- data.frame(id, date, indicator)
df$id <- as.factor(df$id)
I'm trying to group ids with date in this dataset, but I want to summarise based on one of the features outside of the group.
library(dplyr)
library(lubridate)
set.seed(100)
df <- data.frame(ids = sample(c('436247', '2465347', '346654645'), 10000, replace=TRUE),
date = sample(seq.Date(ymd('2018-03-01'), ymd('2018-05-01'), by=1), 10000, replace=TRUE))
new_df <- df %>%
group_by(ids, date) %>%
summarise(events = length(ids[date >= date - 30 & date <= date]))
I'm trying to take this dataframe and answer the question - "for each of the ids, and each date, how many other records within that id, are within the past 30 days of that date". Unfortunately, when I group_by both the ids and date, it only looks within the grouped date. I've created the solution below, but not sure if there is a better one with dplyr?
groupby_function <- function(df, spec_date){
result <- df %>%
group_by(ids) %>%
summarise(events = length(ids[date >= spec_date - 30 & date <= spec_date])) %>%
mutate(date = spec_date)
return(result)
}
date_vector <- seq.Date(ymd('2018-03-01'), ymd('2018-05-01'), by=1)
list_results <- lapply(date_vector, groupby_function, df=df)
x <- do.call(rbind, list_results)
"for each of the ids, and each date, how many other records within that id, are within the past 30 days of that date"
For that, a "join by" condition makes sense, but isn't yet included in dplyr. Until it is, you could use data.table inside your dplyr chain:
# enumerate id-date combos of interest
grid_df = expand.grid(
id = unique(df$ids),
d = seq(min(df$date), max(df$date), by="day")
)
# helper function
library(data.table)
count_matches = function(DF, targetDF, ...){
onexpr = substitute(list(...))
data.table(targetDF)[DF, on=eval(onexpr), .N, by=.EACHI]$N
}
# use a non-equi join to count matching rows
res = grid_df %>%
mutate(d_dn = d - 30) %>%
mutate(n = count_matches(., df, ids = id, date >= d_dn, date <= d)) %>%
as.tibble
# A tibble: 186 x 4
id d d_dn n
<fctr> <date> <date> <int>
1 436247 2018-03-01 2018-01-30 72
2 2465347 2018-03-01 2018-01-30 69
3 346654645 2018-03-01 2018-01-30 51
4 436247 2018-03-02 2018-01-31 123
5 2465347 2018-03-02 2018-01-31 120
6 346654645 2018-03-02 2018-01-31 100
7 436247 2018-03-03 2018-02-01 170
8 2465347 2018-03-03 2018-02-01 166
9 346654645 2018-03-03 2018-02-01 154
10 436247 2018-03-04 2018-02-02 228
# ... with 176 more rows
It should work fine for equality conditions to write either ids = id or ids == id, I think.
If you're interested, the syntax is x[i, on=, j, by=.EACHI] where x and i are tables. For each row of i, we look up rows of x based on the on= criteria (left-hand side refers to columns in x; right-hand to columns in i); then we do j for each ("by each row of i" so by=.EACHI). In this case, j = .N means that we count matched rows of x, returned as a column of counts N.
You can look at the "ungrouped" data by just going back to the original data frame(calling df$date or df$ids). So I think what you are after is
test_df <- df %>%
group_by(ids, date) %>%
summarise(events = length(df$ids[df$date >= date[1] - 30 & df$date <= date[1] & df$ids == ids[1]]))
Also, I ran your proposed function, but I did not see any difference in the result from your original group_by solution, so I don't think that is what you want.
If a 'non dplyr' solution is acceptable, this gives you what you want.
df$diff <- as.vector(
sapply(unique(df$ids), function(x)
sapply(df$date[df$ids == x], function(y)
sum(abs(y - df$date[df$ids == x]) >= 30)
)
)
)
Alternatively, in dplyr, you can get a result like the above using:
f <- function(x) {
sapply(x, function(y) sum(abs(y - x) >= 30))
}
df$diff <- unlist(
df %>%
group_by(ids) %>%
do(diff = f(.$date)) %>%
.$diff
)
Here's an answer. But it assumes there's a continuous sequence of dates in each id.
df %>%
group_by(ids, date) %>%
count() %>%
arrange(ids, date) %>%
group_by(ids) %>%
mutate(
events = cumsum(n) - cumsum(lag(n, 30, 0))
)
I have a panel data set for daily revenue (and other variables) by ID, where the day with 0 revenue go unreported. I want to fill in these blanks with 0 for my analysis, meaning that for each ID's time series, I need to make sure there is an observation for each day. Each series can begin or end on a date distinct from the other series. I have been attempting to use the "padr" package, but I keep getting an "unused argument" error using the following sample code:
library(padr)
library(dplyr)
#unbalanced panel data
ID <- c(1,1,1,1,
2,2,2,2,2,2,
3,3,3,3,3,3,3,
4,4,4)
DT <- today() + c(1,3,4,5, #ID = 1
3,4,7,8,9,10, #ID = 2
2,5,6,7,8,9,10, #ID = 3
8,10,11) #ID = 4
#The end date denote the max date for each ID
EndDT <- today() + c(5,5,5,5, #ID = 1
13,13,13,13,13,13, #ID = 2
10,10,10,10,10,10,10, #ID = 3
15,15,15) #ID = 4
#random variables v1 and v2 to represent revenue and other variables
set.seed(1)
v1 <- rnorm(20,mean = 10000, sd = 5)
v2 <- rnorm(20,mean = 5000, sd = 1.5)
df <- as.data.frame(cbind(ID,DT,EndDT,v1,v2))
#format to simpler date
df$DT <- as.Date(DT, origin="1970-01-01")
df$EndDT <- as.Date(EndDT, origin="1970-01-01")
df_padded <- arrange(df,ID,DT) %>%
pad(by='DT',group='ID', end_val='EndDT') %>%
fill_by_value(v1,v2, value=0)
My error message:
Error in pad(., by = "DT", group = "ID", end_val = "EndDT") :
unused argument (group = "ID")
Answers not involving padr are also highly welcome.
After tussling with padr for a while, I decided to write my own function. This function works for the example set, but quickly ran into issues with real data. Either way, I figured this may be of use to someone else, so here it is:
date.pad <- function(df, date.var, group, replace.vars, new.val=0){
require("dplyr")
require("lazyeval")
require("lubridate")
tempdf1 <- arrange_(df,group,date.var)
finaldf <- tempdf1[0,]
unique.id <- unique(tempdf1[,group])
nonreplaced.vars <- setdiff(colnames(tempdf1),replace.vars)
nonreplaced.vars <- nonreplaced.vars[!nonreplaced.vars==date.var]
for(i in seq_along(unique.id)){
filter_criteria <- interp(~y==x, .values=list(y=as.name(group),x=i)) #necessary for NSE
tempdf2 <- filter_(tempdf1,filter_criteria)
min.date <- min(tempdf2[[date.var]])
max.date <- max(tempdf2[[date.var]])
all.days <- as.Date(seq(min.date,max.date,by="days"),origin="1970-01-01")
distinct.days <- unique(tempdf2[,date.var])
app.days <- as.Date(setdiff(all.days,distinct.days),origin="1970-01-01")
tempdf3 <- tempdf2[0,]
for(n in seq_along(app.days)){
tempdf3[n,date.var] <- app.days[n]
}
for(j in seq_along(nonreplaced.vars)){
tempdf3[1:nrow(tempdf3),nonreplaced.vars[j]] <- tempdf2[1,nonreplaced.vars[j]]
}
finaldf <- bind_rows(finaldf,tempdf3)
}
finaldf[replace.vars] <-new.val
finaldf <- bind_rows(finaldf,df) %>% arrange_(group,date.var)
return(finaldf)
}
for.exmpl <- date.pad(df=df1, date.var="DT", group="ID", replace.vars=c("v1","v2"), new.val=0)
for.exmpl
Here is a new answer I've devised that is far more applicable outside my one application, and uses way less code:
library(tidyverse)
temp <- group_by(df1,ID) %>%
complete(DT = seq.Date(min(DT),max(EndDT),by="day")) %>%
fill(EndDT,sometext) %>%
arrange(ID,DT)
temp[is.na(temp)] <- 0
View(temp)
Which results in:
# A tibble: 33 x 6
# Groups: ID [4]
ID DT EndDT v1 v2 sometext
<dbl> <date> <date> <dbl> <dbl> <chr>
1 1. 2018-05-04 2018-05-08 9997. 5001. textvar
2 1. 2018-05-05 2018-05-08 0. 0. textvar
3 1. 2018-05-06 2018-05-08 10001. 5001. textvar
4 1. 2018-05-07 2018-05-08 9996. 5000. textvar
5 1. 2018-05-08 2018-05-08 10008. 4997. textvar
6 2. 2018-05-06 2018-05-16 10002. 5001. textvar
7 2. 2018-05-07 2018-05-16 9996. 5000. textvar
8 2. 2018-05-08 2018-05-16 0. 0. textvar
9 2. 2018-05-09 2018-05-16 0. 0. textvar
10 2. 2018-05-10 2018-05-16 10002. 5000. textvar
# ... with 23 more rows
(Please ignore the "sometext" variable. I created that while testing out my function below.)
Your code does not run because you specify a character at the end_val argument. This should be a Date, you can only specify a single date over all the groups.
In order to do what you want with padr, you should combine the DT and EndDT columns. This way for each ID its final date is present in the DT column:
df %>%
group_by(ID) %>%
summarise(DT = max(EndDT)) %>%
mutate(v1 = NA, v2 = NA) %>%
bind_rows(df %>% select(-EndDT), .) %>%
group_by(ID, DT) %>%
filter(row_number() == 1) %>%
group_by(ID) %>%
pad()
For the following dataset:
d = data.frame(date = as.Date(as.Date('2015-01-01'):as.Date('2015-04-10'), origin = "1970-01-01"),
group = rep(c('A','B','C','D'), 25), value = sample(1:100))
head(d)
date group value
1: 2015-01-01 A 4
2: 2015-01-02 B 32
3: 2015-01-03 C 46
4: 2015-01-04 D 40
5: 2015-01-05 A 93
6: 2015-01-06 B 10
.. can anyone advise a more elegant way to calculate a cumulative total of values by group than this data.table) method?
library(data.table)
setDT(d)
d.cast = dcast.data.table(d, group ~ date, value.var = 'value', fun.aggregate = sum)
c.sum = d.cast[, as.list(cumsum(unlist(.SD))), by = group]
.. which is pretty clunky and yields a flat matrix that needs dplyr::gather or reshape2::melt to reformat.
Surely R can do better than this??
If you just want cumulative sums per group, then you can do
transform(d, new=ave(value,group,FUN=cumsum))
with base R.
This should work
library(dplyr)
d %>%
group_by(group) %>%
arrange(date) %>%
mutate(Total = cumsum(value))
As this question was tagged with data.table, you are probably looking for (a modification of #Franks comment).
setDT(d)[order(date), new := cumsum(value), by = group]
This will simultaneously rearrange the data by date (not sure if needed, if not, you can get rid of order(date)) and update your data set in place utilizing the := operator
Is this it?
sp <- split(d, d$group)
res <- lapply(seq_along(sp), function(i) cumsum(sp[[i]]$value))
res <- lapply(seq_along(res), function(i){
sp[[i]]$c.sum <- res[[i]]
sp[[i]]
})
res <- do.call(rbind, res)
res