How to get rows, by group, of data frame with earliest timestamp? - r

df <- data.frame(group=c(1,2,4,2,1,4,2,3,3),
ts=c("2014-02-13","2014-06-01","2014-02-14","2014-02-11","2013-02-01","2014-02-02","2014-03-21","2014-12-01","2014-02-11"),
letter=letters[1:9])
df$ts <- as.Date(df$ts,format='%Y-%m-%d')
I want to find an operation that will produce the complete rows containing the minimum timestamp per group, in this case,
group ts letter
1 2013-02-01 e
4 2014-02-02 f
2 2014-02-11 d
3 2014-02-11 i
A quick and dirty (and slow) base R solution would be
dfo <- data.frame(df[order(df$ts,decreasing=F),],index=seq(1:nrow(df)))
mins <- tapply(dfo$index,dfo$group,min)
dfo[dfo$index %in% mins,]
Intuitively, I think if there was a way to add an order index by group then I could just filter to where that column's value is 1, but I'm not sure how to execute it without lots of subsetting and rejoining.

You could use dplyr
library(dplyr)
group_by(df, group) %>% summarise(min = min(ts), letter = letter[which.min(ts)])
# group min letter
# 1 1 2013-02-01 e
# 2 2 2014-02-11 d
# 3 3 2014-02-11 i
# 4 4 2014-02-02 f
You could also slice the ranked rows
group_by(df, group) %>%
mutate(rank = row_number(ts)) %>%
arrange(rank) %>%
slice(1)

Here's a data.table solution. You seem to want the result orders by ts, not group. THis does that.
library(data.table)
setDT(df)[,.SD[which.min(ts)],by=group][order(ts)]
# group ts letter
# 1: 1 2013-02-01 e
# 2: 4 2014-02-02 f
# 3: 2 2014-02-11 d
# 4: 3 2014-02-11 i

Here's a one-liner using base R.
df[sapply(split(df,df$group), function(x) row.names(x)[which.min(x$ts)] ),]
Breaking it down some:
list.by.group <- split(df,df$group)
#a vector of the row names corresponding to the earliest date in each group
names.of.which.min <- sapply(list.by.group, function(x) row.names(x)[which.min(x$ts)])
#subset the data frame by row name
df[names.of.which.min,]

Related

How to filter rows based on difference in dates between rows in R?

Within each id, I would like to keep rows that are at least 91 days apart. In my dataframe df below, id=1 has 5 rows and id=2 has 1 row.
For id=1, I would like to keep only the 1st, 3rd and 5th rows.
This is because if we compare 1st date and 2nd date, they differ by 32 days. So, remove 2nd date. We proceed to comparing 1st and 3rd date, and they differ by 152 days. So, we keep 3rd date.
Now, instead of using 1st date as reference, we use 3rd date. 3rd date and 4th date differ by 61 days. So, remove 4th date. We proceed to comparing 3rd date and 5th date, and they differ by 121 days. So, we keep 5th date.
In the end, the dates we keep are 1st, 3rd and 5th dates. As for id=2, there is only one row, so we keep that. The desired result is shown in dfnew.
df <- read.table(header = TRUE, text = "
id var1 date
1 A 2006-01-01
1 B 2006-02-02
1 C 2006-06-02
1 D 2006-08-02
1 E 2007-12-01
2 F 2007-04-20
",stringsAsFactors=FALSE)
dfnew <- read.table(header = TRUE, text = "
id var1 date
1 A 2006-01-01
1 C 2006-06-02
1 E 2007-12-01
2 F 2007-04-20
",stringsAsFactors=FALSE)
I can only think of starting with grouping the df by id as follows:
library(dplyr)
dfnew <- df %>% group_by(id)
However, I am not sure of how to continue from here. Should I proceed with filter function or slice? If so, how?
Here's an attempt using rolling joins in the data.table which I believe should be efficient
library(data.table)
# Set minimum distance
mindist <- 91L
# Make sure it is a real Date
setDT(df)[, date := as.IDate(date)]
# Create a new column with distance + 1 to roll join too
df[, date2 := date - (mindist + 1L)]
# Perform a rolling join per each value in df$date2 that has atleast 91 difference from df$date
unique(df[df, on = c(id = "id", date = "date2"), roll = -Inf], by = c("id", "var1"))
# id var1 date date2 i.var1 i.date
# 1: 1 A 2005-10-01 2005-10-01 A 2006-01-01
# 2: 1 C 2006-03-02 2006-03-02 C 2006-06-02
# 3: 1 E 2007-08-31 2007-08-31 E 2007-12-01
# 4: 2 F 2007-01-18 2007-01-18 F 2007-04-20
This will give you two additional columns but it's not a big of a deal IMO. Logically this makes sense and I've tested it successfully on different scenarios but it may need some additional proof tests.
An alternative that uses slice from dplyr is to define the following recursive function:
library(dplyr)
f <- function(d, ind=1) {
ind.next <- first(which(difftime(d,d[ind], units="days") > 90))
if (is.na(ind.next))
return(ind)
else
return(c(ind, f(d,ind.next)))
}
This function operates on the date column starting at ind = 1. It then finds the next index ind.next that is the first index for which the date is greater than 90 days (at least 91 days) from the date indexed by ind. Note that if there is no such ind.next, ind.next==NA and we just return ind. Otherwise, we recursively call f starting at ind.next and return its result concatenated with ind. The end result of this function call are the row indices separated by at least 91 days.
With this function, we can do:
result <- df %>% group_by(id) %>% slice(f(as.Date(date, format="%Y-%m-%d")))
##Source: local data frame [4 x 3]
##Groups: id [2]
##
## id var1 date
## <int> <chr> <chr>
##1 1 A 2006-01-01
##2 1 C 2006-06-02
##3 1 E 2007-12-01
##4 2 F 2007-04-20
The use of this function assumes that the date column is sorted in ascending order by each id group. If not, we can just sort the dates before slicing. Not sure about the efficiency of this or the dangers of recursive calls in R. Hopefully, David Arenburg or others can comment on this.
As suggested by David Arenburg, it is better to convert date to a Date class first instead of by group:
result <- df %>% mutate(date=as.Date(date, format="%Y-%m-%d")) %>%
group_by(id) %>% slice(f(date))
##Source: local data frame [4 x 3]
##Groups: id [2]
##
## id var1 date
## <int> <chr> <date>
##1 1 A 2006-01-01
##2 1 C 2006-06-02
##3 1 E 2007-12-01
##4 2 F 2007-04-20

How to repeat empty rows so that each split has the same number

My goal is to get the same number of rows for each split (based on column Initial). I am trying to basically pad the number of rows so that each person has the same amount, while retaining the Initial column so I can tell them apart. My attempt failed completely. Anybody have suggestions?
df<-data.frame(Initials=c("a","a","b"),data=c(2,3,4))
attach(df)
maxrows=max(table(Initials))+1
arr<-split(df,Initials)
lapply(arr,function(x){
toadd<-maxrows-dim(x)[1]
replicate(toadd,x<-rbind(x,rep(NA,1)))#colnames -1 because col 1 should the the same Initial
})
Goal:
a 2
a 3
b 4
b NA
Using data.table...
my_rows <- seq.int(max(tabulate(df$Initials)))
library(data.table)
setDT(df)[ , .SD[my_rows], by=Initials]
# Initials data
# 1: a 2
# 2: a 3
# 3: b 4
# 4: b NA
.SD is the Subset of Data associated with each by= group. We can subset its rows like .SD[row_numbers], unlike a data.frame which requires an additional comma DF[row_numbers,].
The analogue in dplyr is
my_rows <- seq.int(max(tabulate(df$Initials)))
library(dplyr)
setDT(df) %>% group_by(Initials) %>% slice(my_rows)
# Initials data
# (fctr) (dbl)
# 1 a 2
# 2 a 3
# 3 b 4
# 4 b NA
Strangely, this only works if df is a data.table. I've filed a report/query with dplyr. There's a good chance that the dplyr devs will prevent this usage in a future version.
Here's a dplyr/tidyr method. We group_by initials, add row_numbers, ungroup, complete row numbers/Initials combinations, then remove our row numbers:
library(dplyr)
library(tidyr)
df %>% group_by(Initials) %>%
mutate(row = row_number()) %>%
ungroup() %>%
complete(Initials, row) %>%
select(-row)
Source: local data frame [4 x 2]
Initials data
(fctr) (dbl)
1 a 2
2 a 3
3 b 4
4 b NA
Interesting problem. Try:
to.add <- max(table(df$Initials)) - table(df$Initials)
rbind(df, c(rep(names(to.add), to.add), rep(NA, ncol(df)-1)))
# Initials data
#1 a 2
#2 a 3
#3 b 4
#4 b <NA>
We calculate the number of extra initials needed then combine the extras with NA values then rbind to the data frame.
max(table(df$Initials)) calculates the the initial with the most repeats. In this case a 2. By subtracting that max amount by the other initials table(df$Initials) we get a vector with the necessary additions. There's an added bonus to this method, by using table we also automatically have a named vector.
We use the names of the new vector to know 1) what initials to repeat, and 2) how many times should they be repeated.
To preserve the class of the data, you can add newdf$data <- as.numeric(newdf$data).

Getting a summary data frame for all the combinations of categories represented in two columns

I am working with a data frame corresponding to the example below:
set.seed(1)
dta <- data.frame("CatA" = rep(c("A","B","C"), 4), "CatNum" = rep(1:2,6),
"SomeVal" = runif(12))
I would like to quickly build a data frame that would have sum values for all the combinations of the categories derived from the CatA and CatNum as well as for the categories derived from each column separately. On the primitive example above, for the first couple of combinations, this can be achieved with use of simple code:
df_sums <- data.frame(
"Category" = c("Total for A",
"Total for A and 1",
"Total for A and 2"),
"Sum" = c(sum(dta$SomeVal[dta$CatA == 'A']),
sum(dta$SomeVal[dta$CatA == 'A' & dta$CatNum == 1]),
sum(dta$SomeVal[dta$CatA == 'A' & dta$CatNum == 2]))
)
This produces and informative data frame of sums:
Category Sum
1 Total for A 2.1801780
2 Total for A and 1 1.2101839
3 Total for A and 2 0.9699941
This solution would be grossly inefficient when applied to a data frame with multiple categories. I would like to achieve the following:
Cycle through all the categories, including categories derived from each column separately as well as from both columns in the same time
Achieve some flexibility with respect to how the function is applied, for instance I may want to apply mean instead of the sum
Save the Total for string a separate object that I could easily edit when applying other function than sum.
I was initially thinking of using dplyr, on the lines:
require(dplyr)
df_sums_experiment <- dta %>%
group_by(CatA, CatNum) %>%
summarise(TotVal = sum(SomeVal))
But it's not clear to me how I could apply multiple groupings simultaneously. As stated, I'm interested in grouping by each column separately and by the combination of both columns. I would also like to create a string column that would indicate what is combined and in what order.
You could use tidyr to unite the columns and gather the data. Then use dplyr to summarise:
library(dplyr)
library(tidyr)
dta %>% unite(measurevar, CatA, CatNum, remove=FALSE) %>%
gather(key, val, -SomeVal) %>%
group_by(val) %>%
summarise(sum(SomeVal))
val sum(SomeVal)
(chr) (dbl)
1 1 2.8198078
2 2 3.0778622
3 A 2.1801780
4 A_1 1.2101839
5 A_2 0.9699941
6 B 1.4405782
7 B_1 0.4076565
8 B_2 1.0329217
9 C 2.2769138
10 C_1 1.2019674
11 C_2 1.0749464
Just loop over the column combinations, compute the quantities you want and then rbind them together:
library(data.table)
dt = as.data.table(dta) # or setDT to convert in place
cols = c('CatA', 'CatNum')
rbindlist(apply(combn(c(cols, ""), length(cols)), 2,
function(i) dt[, sum(SomeVal), by = c(i[i != ""])]), fill = T)
# CatA CatNum V1
# 1: A 1 1.2101839
# 2: B 2 1.0329217
# 3: C 1 1.2019674
# 4: A 2 0.9699941
# 5: B 1 0.4076565
# 6: C 2 1.0749464
# 7: A NA 2.1801780
# 8: B NA 1.4405782
# 9: C NA 2.2769138
#10: NA 1 2.8198078
#11: NA 2 3.0778622
Split then use apply
#result
res <- do.call(rbind,
lapply(
c(split(dta,dta$CatA),
split(dta,dta$CatNum),
split(dta,dta[,1:2])),
function(i)sum(i[,"SomeVal"])))
#prettify the result
res1 <- data.frame(Category=paste0("Total for ",rownames(res)),
Sum=res[,1])
res1$Category <- sub("."," and ",res1$Category,fixed=TRUE)
row.names(res1) <- seq_along(row.names(res1))
res1
# Category Sum
# 1 Total for A 2.1801780
# 2 Total for B 1.4405782
# 3 Total for C 2.2769138
# 4 Total for 1 2.8198078
# 5 Total for 2 3.0778622
# 6 Total for A and 1 1.2101839
# 7 Total for B and 1 0.4076565
# 8 Total for C and 1 1.2019674
# 9 Total for A and 2 0.9699941
# 10 Total for B and 2 1.0329217
# 11 Total for C and 2 1.0749464

Replace NA with values from previous date

I have dated data frame like this one with approximately 1 million rows
id date variable
1 1 2015-01-01 NA
2 1 2015-01-02 -1.1874087
3 1 2015-01-03 -0.5936396
4 1 2015-01-04 -0.6131957
5 1 2015-01-05 1.0291688
6 1 2015-01-06 -1.5810152
Reproducible example is here:
#create example data set
Df <- data.frame(id = factor(rep(1:3, each = 10)),
date = rep(seq.Date(from = as.Date('2015-01-01'),
to = as.Date('2015-01-10'), by = 1),3),
variable = rnorm(30))
Df$variable[c(1,7,12,18,22,23,29)] <- NA
What I want to do is replace NA values in variable with values from previous date for each id. I created loop which works but very slow (You can find it below). Can you please advice fast alternative for this task. Thank you!
library(dplyr)
#create new variable
Df$variableNew <- Df$variable
#create row numbers vector
Df$n <- 1:dim(Df)[1]
#order data frame by date
Df <- arrange(Df, date)
for (id in levels(Df$id)){
I <- Df$n[Df$id == id] # create vector of rows for specific id
for (row in 1:length(I)){ #if variable == NA for the first date change it to mean value
if (is.na(Df$variableNew[I[1]])) {
Df$variableNew[I[row]] <- mean(Df$variable,na.rm = T)
}
if (is.na(Df$variableNew[I[row]])){ # if variable == NA fassign to this date value from previous date
Df$variableNew[I[row]] <- Df$variableNew[I[row-1]]
}
}
}
This data.table solution should be extremely fast.
library(zoo) # for na.locf(...)
library(data.table)
setDT(Df)[,variable:=na.locf(variable, na.rm=FALSE),by=id]
Df[,variable:=if (is.na(variable[1])) c(mean(variable,na.rm=TRUE),variable[-1]) else variable,by=id]
Df
# id date variable
# 1: 1 2015-01-01 -0.288720759
# 2: 1 2015-01-02 -0.005344028
# 3: 1 2015-01-03 0.707310667
# 4: 1 2015-01-04 1.034107735
# 5: 1 2015-01-05 0.223480415
# 6: 1 2015-01-06 -0.878707613
# 7: 1 2015-01-07 -0.878707613
# 8: 1 2015-01-08 -2.000164945
# 9: 1 2015-01-09 -0.544790740
# 10: 1 2015-01-10 -0.255670709
# ...
So this replaces all embedded NA using locf by id, and then makes a second pass replacing any leading NA with the average of variable for that id. Note that if you do this is the reverse order you may get a different answer.
If you get the dev version of tidyr(0.3.0) available on github, there is a function fill which will do this exactly:
#devtools::install_github("hadley/tidyr")
library(tidyr)
library(dplyr)
Df %>% group_by(id) %>%
fill(variable)
It will not do the first value - We can do that with a mutate and replace:
Df %>% group_by(id) %>%
mutate(variable = ifelse(is.na(variable) & row_number()==1,
replace(variable, 1, mean(variable, na.rm = TRUE)),
variable)) %>%
fill(variable)

Collapsing data frame by selecting one row per group

I'm trying to collapse a data frame by removing all but one row from each group of rows with identical values in a particular column. In other words, the first row from each group.
For example, I'd like to convert this
> d = data.frame(x=c(1,1,2,4),y=c(10,11,12,13),z=c(20,19,18,17))
> d
x y z
1 1 10 20
2 1 11 19
3 2 12 18
4 4 13 17
Into this:
x y z
1 1 11 19
2 2 12 18
3 4 13 17
I'm using aggregate to do this currently, but the performance is unacceptable with more data:
> d.ordered = d[order(-d$y),]
> aggregate(d.ordered,by=list(key=d.ordered$x),FUN=function(x){x[1]})
I've tried split/unsplit with the same function argument as here, but unsplit complains about duplicate row numbers.
Is rle a possibility? Is there an R idiom to convert rle's length vector into the indices of the rows that start each run, which I can then use to pluck those rows out of the data frame?
Maybe duplicated() can help:
R> d[ !duplicated(d$x), ]
x y z
1 1 10 20
3 2 12 18
4 4 13 17
R>
Edit Shucks, never mind. This picks the first in each block of repetitions, you wanted the last. So here is another attempt using plyr:
R> ddply(d, "x", function(z) tail(z,1))
x y z
1 1 11 19
2 2 12 18
3 4 13 17
R>
Here plyr does the hard work of finding unique subsets, looping over them and applying the supplied function -- which simply returns the last set of observations in a block z using tail(z, 1).
Just to add a little to what Dirk provided... duplicated has a fromLast argument that you can use to select the last row:
d[ !duplicated(d$x,fromLast=TRUE), ]
Here is a data.table solution which will be time and memory efficient for large data sets
library(data.table)
DT <- as.data.table(d) # convert to data.table
setkey(DT, x) # set key to allow binary search using `J()`
DT[J(unique(x)), mult ='last'] # subset out the last row for each x
DT[J(unique(x)), mult ='first'] # if you wanted the first row for each x
There are a couple options using dplyr:
library(dplyr)
df %>% distinct(x, .keep_all = TRUE)
df %>% group_by(x) %>% filter(row_number() == 1)
df %>% group_by(x) %>% slice(1)
You can use more than one column with both distinct() and group_by():
df %>% distinct(x, y, .keep_all = TRUE)
The group_by() and filter() approach can be useful if there is a date or some other sequential field and
you want to ensure the most recent observation is kept, and slice() is useful if you want to avoid ties:
df %>% group_by(x) %>% filter(date == max(date)) %>% slice(1)

Resources