This question already has answers here:
How to prevent ifelse() from turning Date objects into numeric objects
(7 answers)
Closed 2 years ago.
I have the following dataset:
A B
2007-11-22 2004-11-18
<NA> 2004-11-10
when the value of column A is NA, I want this value to be replaced by the date in B, except with an additional 25 days added.
Here is what the outcome should look like:
A B
2007-11-22 2004-11-18
2004-12-05 2004-11-10
So far, I have tried the following if else formula, but with no success.
library(lubridate)
data$A<- ifelse(is.na(data$A),data$B+days(25),data$A)
Could anyone tell me what's wrong with it or give me an alternate solution? The code to build my dataset is below.
A<-c("2007-11-22 01:00:00", NA)
B<-c("2004-11-18","2004-11-10")
data<-data.frame(A,B)
data$A<-as.Date(data$A);data$B<-as.Date(data$B)
The reason of the issue can be traced back from the source code of ifelse. When you type View(ifelse), you will see some lines in the bottom of the source code as below
ans <- test
len <- length(ans)
ypos <- which(test)
npos <- which(!test)
if (length(ypos) > 0L)
ans[ypos] <- rep(yes, length.out = len)[ypos]
if (length(npos) > 0L)
ans[npos] <- rep(no, length.out = len)[npos]
ans
where test is logic array, and ans is initialized as a copy of test. When running ans[ypos] <- rep(yes, length.out = len)[ypos], the class of ans is coerced to numeric, rather than Date. That's why you have integers on A column after using ifelse.
You can try the code below
data$A <- as.Date(ifelse(is.na(data$A), data$B + days(25), data$A), origin = "1970-01-01")
which gives
> data
A B
1 2007-11-22 2004-11-18
2 2004-12-05 2004-11-10
Assuming the data given reproducibly in the Note at the end -- in particular we assume both columns are of Date class -- compute a logical vector is_na which indicates which entries are NA and then set those from B.
is_na <- is.na(data$A)
data$A[is_na] <- data$B[is_na] + 25
This would also work and has the advantage that it does not overwrite data:
transform(data, A = replace(A, is.na(A), B[is.na(A)] + 25))
Note
Lines <- "
A B
2007-11-22 2004-11-18
NA 2004-11-10"
data <- read.table(text = Lines, header = TRUE)
data[] <- lapply(data, as.Date) # convert to Date class
Instead of ifelse you could use coalesce
library(tidyverse)
library(lubridate)
A <- c("2007-11-22 01:00:00", NA)
B <- c("2004-11-18","2004-11-10")
data <-data.frame(A,B)
data <- data %>%
mutate(A = as_date(A),
B = as_date(B),
A = coalesce(A,B+days(25)))
I have a data frame organized like so:
df <- data.frame(id = c(1, 1, 1),
startDate = c("1990-01-01", "1990-01-23", "1990-01-30"),
endDate = c("1990-01-24", "1990-01-25", "1990-01-31"))
Each row represents the start date and end date. There is some overlap in the data though I'd like to identify. For example, the first range of dates goes from January 1, 1990 to January 24, 1990, and then the second row of dates goes from January 23rd, 1990 to January 24th, 1990.
What I want to is create a new data frame which would something like this...
(illustrating by creating a new R dataframe).
df <- data.frame(id = c(1, 1, 1),
startDate = c("1990-01-01", "1990-01-23", "1990-01-30"),
endDate = c("1990-01-24", "1990-01-25", "1990-01-31"),
overlap = c(TRUE, TRUE, FALSE),
newStartDate = c("1990-01-01", "1990-01-01", "1990-01-30"),
newEndDate = c("1990-01-25", "1990-01-25", "1990-01-31"))
So first, identify every row which overlaps with another row. And then create new columns (newStartDate, newEndDate) which would take the earliest start date and the latest end date from all the overlapping dates.
I already have in my mind how this would work using pseudo code. However, I wonder if there was a way to make this work using "R" style programming, using vectors, and functions and all that. I'm having trouble conceptualizing of how it would work because there are lots of variables you would need to keep track of and what not, and I'm not sure how you could achieve it using things like map, apply, etc.
Hopefully my question is clear!
Below are some alternative approaches.
(1) and (1a) only use Base R. These are the same except that (1) uses an adjacency matrix and (1a) uses a neighborhood list to avoid forming the potentially large adjacency matrix.
(2) is an SQL solution using the sqldf package.
(3) uses the igraph package and may give a different answer than the above alternatives altlhough in the case of the example in the question the answers are the same. (3a) is like (3) but like (1a) avoids forming the adjacency matrix.
At the end we provide some graphics.
Alternatives
1) Base R First we convert the dates to Date class giving df2. Then we define a function betw which checks whether its first argument is between the second and third and use that to define a function overlap which given two row indexes into df2 determines whether they overlap (TRUE) or not (FALSE).
If V is the sequence from 1 to the number of rows in df2 then we can form an adjacency matrix adj such that adj[i,j] is 1 if rows i and j overlap. Using that it is straight forward to compute the overlap, newStartDate and newEndDate columns.
This approach uses no packages.
df2 <- transform(df, startDate = as.Date(startDate), endDate = as.Date(endDate))
betw <- function(x, a, b) x >= a & x <= b
overlap <- function(i, j) {
betw(df2[i, "startDate"], df2[j, "startDate"], df2[j, "endDate"]) ||
betw(df2[j, "startDate"], df2[i, "startDate"], df2[i, "endDate"])
}
# form adjacency matrix of graph having vertices V
V <- 1:nrow(df2)
adj <- sapply(V, function(u) sapply(V, overlap, u)) + 0
orig <- "1970-01-01"
transform(df2, overlap = colSums(adj) > 1,
newStartDate = as.Date(apply(adj, 1, function(ix) min(startDate[ix == 1])), orig),
newEndDate = as.Date(apply(adj, 1, function(ix) max(endDate[ix == 1])), orig))
giving:
id startDate endDate overlap newStartDate newEndDate
1 1 1990-01-01 1990-01-24 TRUE 1990-01-01 1990-01-25
2 1 1990-01-23 1990-01-25 TRUE 1990-01-01 1990-01-25
3 1 1990-01-30 1990-01-31 FALSE 1990-01-30 1990-01-31
1a) A variation of (1) which avoids forming the adj adjacency matrix is to create a neighors list such that nbrs[[i]] is a vector of the row numbers that row i overlaps.
nbrs <- lapply(1:nr, function(j) Filter(function(i) overlap(i, j), 1:nr))
names(nbrs) <- 1:nr
orig <- "1970-01-01"
transform(df2, overlap = lengths(nbrs) > 1,
newStartDate = as.Date(sapply(nbrs, function(ix) min(startDate[ix])), orig),
newEndDate = as.Date(sapply(nbrs, function(ix) max(endDate[ix])), orig))
2) sqldf Using df2 we can use SQL to compute the desired output in a single SQL statement:
library(sqldf)
sqldf("select
a.id,
a.startDate as startDate__Date,
a.endDate as endDate__Date,
count(b.rowid) > 1 as overlap__logical,
min(b.startDate) as newStartDate__Date,
max(b.endDate) as newEndDate__Date
from df2 as a
left join df2 as b on (a.startDate between b.startDate and b.endDate) or
(b.startDate between a.startDate and a.endDate)
group by a.rowid
order by a.rowid", method = "name__class")
giving:
id startDate endDate overlap newStartDate newEndDate
1 1 1990-01-01 1990-01-24 TRUE 1990-01-01 1990-01-25
2 1 1990-01-23 1990-01-25 TRUE 1990-01-01 1990-01-25
3 1 1990-01-30 1990-01-31 FALSE 1990-01-30 1990-01-31
3) igraph Another approach which is not equivalent to (1) or (2) but may be preferred is to partition the rows into connected components using a transitive completion of the overlap relation. It is similar to this question here: R: Find groups of vectors that have a > 80% overlap in their elements
Using adj from (1) form a graph g using the igraph package. Then rows that have no other rows in their connected component are not overlapped. If we number the connected components 1, 2, ... then memb is such that memb[i] is the number of the connected component containing row i so for each row we can find the minimum and maximum date over the connected component it belongs to. Although this gives the same answer as (1) for the input in the question, in general, this is different than (1) because, for example, if rows i and j do not overlap but do each overlap row k then i, j and k are all in the same connected component for purposes of computing the columns of the output.
library(igraph)
g <- graph_from_adjacency_matrix(adj, mode = "undirected", diag = FALSE)
memb <- components(g)$membership
# assemble desired output data frame
transform(df2,
overlap = ave(memb, memb, FUN = length) > 1,
newStartDate = ave(startDate, memb, FUN = min),
newEndDate = ave(endDate, memb, FUN = max))
giving:
id startDate endDate overlap newStartDate newEndDate
1 1 1990-01-01 1990-01-24 TRUE 1990-01-01 1990-01-25
2 1 1990-01-23 1990-01-25 TRUE 1990-01-01 1990-01-25
3 1 1990-01-30 1990-01-31 FALSE 1990-01-30 1990-01-31
3a) Alternately we can form g from nbrs to avoid forming adj like this:
g0 <- graph_from_edgelist(as.matrix(stack(nbrs)), directed = FALSE)
g <- simplify(g0) # remove self loops
Graphics
As an aside using g we can display a graphical representation where node i means row i and edges indicate overlap.
plot(g)
I created a solution for a similar problem. As I needed to apply the same logic to a large dataset, my way to go was Rcpp and data.table (sorting speed reason really). Works also on multiple groups - ids. The conti() function produced the aggregated range of time periods without, in this case, a day of interruption (can be tuned with tolerance):
conti <- function(
data = df,
group = "id", #the group variable by which to aggregate the dates
dateFrom = "startDate",
dateTo = "endDate",
tolerance = 0, #what gap shall be seen as uninterupted range on dates, here 0 tollerance
dateFormat = "%Y-%m-%d" #date format in df
) {
if(!require(Rcpp)){install.packages("Rcpp"); library(Rcpp)}
if(!require(data.table)){install.packages("data.table"); library(data.table)}
cppFunction('DataFrame BezRcpp(DataFrame dtable) {
int marker = 0;
IntegerVector ID = dtable["group"];
DateVector From = dtable["dateFrom"];
DateVector To = dtable["dateTo"];
IntegerVector Difference(ID.size(), 9999);
for (int i = 1; i < ID.size(); i++) {
if(ID[i] != ID[i-1]) {
marker = i;
} else {
Difference[i] = From[i] - To[marker];
if(Difference[i]>1) marker = i;
else if(To[i]>To[marker]){
To[marker] = To[i];
}}}
return DataFrame::create(
_["group"] = ID,
_["Difference"] = Difference,
_["dateFrom"] = From,
_["dateTo"] = To,
_["stringsAsFactors"] = false);
}'
)
conti_Rcpp_ <- function(data){
A <- Sys.time()
if(!"data.table" %in% class(data)) dtable <- as.data.table(data) else dtable <- copy(data)
setnames(dtable, old = c("id", "startDate", "endDate"), new = c("group", "dateFrom", "dateTo"))
if(class(dtable[["dateFrom"]])!="Date" || class(dtable[["dateTo"]])!="Date") for (j in c("dateFrom", "dateTo")) set(dtable, j = j, value = as.Date(dtable[[j]], dateFormat))
setorderv(dtable, c("group", "dateFrom"))
dt <- setDT(BezRcpp(dtable))
dt <- dt[Difference>(tolerance+1), c("group", "dateFrom", "dateTo"), with = F]
setnames(dt, new = c("id", "startDate", "endDate"), old = c("group", "dateFrom", "dateTo"))
B <- Sys.time()
print(paste0("Done in ", round(difftime(B, A, units = "secs"), 1), " secs. A data.table was produced."))
return(dt)
}
return(conti_Rcpp_(data))
}
Then
df <- data.frame(id = c(1L, 1L, 1L),
startDate = c("1990-01-01", "1990-01-23", "1990-01-30"),
endDate = c("1990-01-24", "1990-01-25", "1990-01-31"), stringsAsFactors = F)
conti(df)
#[1] "Done in 0 secs. A data.table was produced."
# id startDate endDate
#1: 1 1990-01-01 1990-01-25
#2: 1 1990-01-30 1990-01-31
You could construct a data.table - dependent function called find_overlaps like below:
library(data.table)
find_overlaps <- function(df,
groups = NULL,
start_var = NULL,
end_var = NULL,
fmt = "%Y-%m-%d") {
calc_cummax_Date <- function(x) setattr(cummax(unclass(x)), "class", c("Date", "IDate"))
df_overlap <- setDT(copy(df))
rangevars <- c(start_var, end_var)
groupsidx <- c(groups, "overlap_idx")
df_overlap <- df_overlap[
, (rangevars) := lapply(.SD, function(x) as.Date(as.character(x), format = fmt)), .SDcols = rangevars][
, max_until_now := shift(calc_cummax_Date(get(end_var)), fill = get(end_var)[1]), by = mget(groups)][
(max_until_now + 1L) < get(start_var), gap_between := 1][
is.na(gap_between), gap_between := 0][
, overlap_idx := cumsum(gap_between), by = mget(groups)][
, `:=` (overlap = .N > 1,
newStartDate = min(get(start_var)),
newEndDate = max(get(end_var))), by = groupsidx][
, c("gap_between", "max_until_now") := NULL
]
return(df_overlap)
}
Calling this function (with [] at the end for printing the output) would give you the desired output:
# Below code will only print the output, you have to save it by e.g. df <- find_overlaps(df, groups = "id", start_var = "startDate", end_var = "endDate")
find_overlaps(df, groups = "id", start_var = "startDate", end_var = "endDate")[]
id startDate endDate overlap_idx overlap newStartDate newEndDate
1: 1 1990-01-01 1990-01-24 0 TRUE 1990-01-01 1990-01-25
2: 1 1990-01-23 1990-01-25 0 TRUE 1990-01-01 1990-01-25
3: 1 1990-01-30 1990-01-31 1 FALSE 1990-01-30 1990-01-31
As you can see, I've also added a column named overlap_idx as I believe it may be useful to have a separate index for each non-overlapping range per each id.
The function can handle multiple groups. Since it checks for cumulative maximum in the end date variable it would work also with cases where a line has the lowest start date but also highest end date. Additional parameters (like max_days_between, i.e. the definition of what do you consider to be continuous - 1 day difference or more) can be added easily.
If you're interested, the above function is partly similar to a function called collapse_ranges from my package neatRanges (available on CRAN, but still in experimental state).
It would give you a collapsed output, similar to what you want but as a summary of only first and last start/end dates for each non-overlapping range:
install.packages('neatRanges')
library(neatRanges)
collapse_ranges(df, groups = "id", start_var = "startDate", end_var = "endDate")[]
id startDate endDate
1 1 1990-01-01 1990-01-25
2 1 1990-01-30 1990-01-31
I'd like to create an aggregation without knowing neither the column names nor their positions ie. I retrieve the names dynamically.
Further I'm able to use data.frame or data.table as I'm forced to use R version 3.1.1
Is there an option like do.call... as explained in this answer for 'order'
trying a similar do.call with 'aggregate' leads to an error
# generate a small dataset
set.seed(1234)
smalldat <- data.frame(group1 = rep(1:2, each = 5),
group2 = rep(c('a','b'), times = 5),
x = rnorm(10),
y = rnorm(10))
group_by <- c('group1','group2')
test <- do.call( aggregate.data.frame , c(by=group_by, x=smalldat, FUN=mean))
#output
#Error in is.data.frame(x) : Argument "x" missing (no default)
or is there an option with data.table?
# generate a small dataset
set.seed(1234)
smalldat <- data.frame(group1 = rep(1:2, each = 5),
group2 = rep(c('a','b'), times = 5),
x = rnorm(10),
y = rnorm(10))
# convert to data.frame to data.table
library(data.table)
smalldat <- data.table(smalldat)
# convert aggregated variable into raw data file
smalldat[, aggGroup1 := mean(x), by = group1]
Thanks for advice!
aggregate can take a formula, and you can build a formula from a string.
form = as.formula(paste(". ~", paste(group_by, collapse = " + ")))
aggregate(form, data = smalldat, FUN = mean)
# group1 group2 x y
# 1 1 a 0.1021667 -0.09798418
# 2 2 a -0.5695960 -0.67409059
# 3 1 b -1.0341342 -0.46696381
# 4 2 b -0.3102046 0.46478476
When aggregating an R dataframe, the dates are converted in integer :
For instance, if I want to take the maximum dates for every Id in the following dataframe :
> df1 <- data.frame(id = rep(c(1, 2), 2), b = as.Date(paste("01/01/", 2000:2003, sep=''), format = "%d/%m/%Y"))
> df1
id b
1 1 2000-01-01
2 2 2001-01-01
3 1 2002-01-01
4 2 2003-01-01
> aggregate(x = list(b = df1$b), by = list(id = df1$id), FUN = "max")
id b
1 1 11688
2 2 12053
Why does R behave this way ? (and what's the best way to keep a date class column in the returned dataframe?)
Thanks for your help,
That works for me R version 3, perhaps there were some changes in updates, so I recommend you to update R :)
As for this version of R, have you tried as.Date() function after aggregating?
In your example, should be like:
dtf2<-aggregate(x = list(b = df1$b), by = list(id = df1$id), FUN = "max")
dtf2$b<-as.Date(dtf$b)
You can also add 'origin' option to as.Date, like
as.Date(dtf$b, origin='1970-01-01')
UPD: When R looks at dates as integers, its origin is January 1, 1970.
Hope that will help.
I've been attempting to aggregate (some what erratic) daily data. I'm actually working with csv data, but if i recreate it - it would look something like this:
library(zoo)
dates <- c("20100505", "20100505", "20100506", "20100507")
val1 <- c("10", "11", "1", "6")
val2 <- c("5", "31", "2", "7")
x <- data.frame(dates = dates, val1=val1, val2=val2)
z <- read.zoo(x, format = "%Y%m%d")
Now i'd like to aggregate this on a daily basis (notice that some times there are >1 datapoint for a day, and sometimes there arent.
I've tried lots and lots of variations, but i cant seem to aggregate, so for instance this fails:
aggregate(z, as.Date(time(z)), sum)
# Error in Summary.factor(2:3, na.rm = FALSE) : sum not meaningful for factors
There seems to be a lot of content regarding aggregate, and i've tried a number of versions but cant seem to sum this on a daily level. I'd also like to run cummax and cumulative averages in addition to the daily summing.
Any help woudl be greatly appreciated.
Update
The code I am actually using is as follows:
z <- read.zoo(file = "data.csv", sep = ",", header = TRUE, stringsAsFactors = FALSE, blank.lines.skip = T, na.strings="NA", format = "%Y%m%d");
It seems my (unintentional) quotation of the numbers above is similar to what is happening in practice, because when I do:
aggregate(z, index(z), sum)
#Error in Summary.factor(25L, na.rm = FALSE) : sum not meaningful for factors
There a number of columns (100 or so), how can i specify them to be as.numeric automatically ? (stringAsFactors = False doesnt appear to work?)
Or you aggregate before using zoo (val1 and val2 need to be numeric though).
x <- data.frame(dates = dates, val1=as.numeric(val1), val2=as.numeric(val2))
y <- aggregate(x[,2:3],by=list(x[,1]),FUN=sum)
and then feed y into zoo.
You avoid the warning:)
You started on the right path but made a couple of mistakes.
First, zoo only consumes matrices, not data.frames. Second, those need numeric inputs:
> z <- zoo(as.matrix(data.frame(val1=c(10,11,1,6), val2=c(5,31,2,7))),
+ order.by=as.Date(c("20100505","20100505","20100506","20100507"),
+ "%Y%m%d"))
Warning message:
In zoo(as.matrix(data.frame(val1 = c(10, 11, 1, 6), val2 = c(5, :
some methods for "zoo" objects do not work if the index entries in
'order.by' are not unique
This gets us a warning which is standard in zoo: it does not like identical time indices.
Always a good idea to show the data structure, maybe via str() as well, maybe run summary() on it:
> z
val1 val2
2010-05-05 10 5
2010-05-05 11 31
2010-05-06 1 2
2010-05-07 6 7
And then, once we have it, aggregation is easy:
> aggregate(z, index(z), sum)
val1 val2
2010-05-05 21 36
2010-05-06 1 2
2010-05-07 6 7
>
val1 and val2 are character strings. data.frame() converts them to factors. Summing factors doesn't make sense. You probably intended:
x <- data.frame(dates = dates, val1=as.numeric(val1), val2=as.numeric(val2))
z <- read.zoo(x, format = "%Y%m%d")
aggregate(z, as.Date(time(z)), sum)
which yields:
val1 val2
2010-05-05 21 36
2010-05-06 1 2
2010-05-07 6 7
Convert the character columns to numeric and then use read.zoo making use of its aggregate argument:
> x[-1] <- lapply(x[-1], function(x) as.numeric(as.character(x)))
> read.zoo(x, format = "%Y%m%d", aggregate = sum)
val1 val2
2010-05-05 21 36
2010-05-06 1 2
2010-05-07 6 7