How to optimize iterating over a huge dataframe with non-unique rows - r

I understand that if R is not updating a variable in place within the confines of a for loop then I've just made some horrendously slow and expensive code. Unfortunately, with a set of very tight deadlines and a strong background in C++/Java it's my go-to behaviour until I can get my R hat on.
I have a function I need to improve. It takes a dataframe (as below) returns the unique patid values and uses those to retrieve subsets of that dataframe for date modifications. A trimmed example below (note, I just pulled this out of a completed run, so the date has already been modified). The last R run I performed was over a dataframe of 27 million row and took about four/five hours. The size of the dataframe will be a lot bigger.
patid eventdate
1 12/03/1998
1 12/03/1998
2 04/03/2007
3 15/11/1980
3 15/11/1980
3 01/02/1981
A trimmed example of the function:
rearrangeDates <- function(dataFrame) {
#return a list of the unique patient ids
uniquePatids <- getUniquePatidList(dataFrame) #this is only called once and is very fast
out=NULL
for(i in 1:length(uniquePatids)) { # iterate over the list
idf <- subset(dataFrame, dataFrame$patid=uniquePatids[[i]])
idf$eventdate <- as.POSIXct(idf$eventdate,format="%d/%m/%Y")
idf <- idf[order(idf$eventdate,decreasing=FALSE),]
out = rbind(out,idf)
}
return(out)
}
Can anyone suggest improvements?

Since you want to sort your data on patid & eventdate this should work.
library(dplyr)
df %>%
mutate(eventdate = as.Date(eventdate, format="%d/%m/%Y")) %>%
arrange(patid, eventdate)
Output is:
patid eventdate
1 1 1998-03-12
2 1 1998-03-12
3 2 2007-03-04
4 3 1980-11-15
5 3 1980-11-15
6 3 1981-02-01
Sample data:
df <- structure(list(patid = c(1L, 1L, 2L, 3L, 3L, 3L), eventdate = c("12/03/1998",
"12/03/1998", "04/03/2007", "15/11/1980", "15/11/1980", "01/02/1981"
)), class = "data.frame", row.names = c(NA, -6L))

This is ideally suited to data.table: your data has a well-defined key that you group-by (patid,eventdate), you know the size of the output df will be <= size of input df, so it's safe to do do in-place assignments (waaay faster) instead of appends, you don't need the output iterative-append, and data.table has a nice fast unique function. So please try out the (loop-free!) code below and let us know how it compares both to your original, and to the dplyr approach:
require(data.table)
dt = data.table(patid=c(1,1,2,3,3,3), eventdate=c('12/03/1998','12/03/1998',
'04/03/2007', '15/11/1980', '15/11/1980','01/02/1981'))
dt[, eventdate := as.POSIXct(eventdate,format="%d/%m/%Y") ]
# If you set a key, the `by` operation will be super-fast
setkeyv(dt, c('patid','eventdate'))
odt <- dt[, by=.(patid,eventdate)]
patid eventdate
1: 1 1998-03-12
2: 1 1998-03-12
3: 2 2007-03-04
4: 3 1980-11-15
5: 3 1980-11-15
6: 3 1981-02-01
(One last thing: don't be afraid of POSIXct/lt, convert to them early, they're more efficient than strings, they support comparison operators hence the column can be used as key, sorted on, compared.)
(And for the fastest dplyr implementation, use dplyr::distinct())

Related

data.table aggregation based on multiple criteria

I am trying to calculate how many pid within a set fid's have a yob smaller than person's yob. The second question is about unique pid. Updating the question based on efforts #langtang and my own reflections:
#Libraries:
library(data.table)
library(tictoc)
#Make it replicable:
set.seed(1)
#Define parameters of the simulation:
pid<-1:1000
fid<-1:5
time_periods<-1:12
yob<-sample(seq(1900,2010),length(pid),replace = TRUE)
#Obtain in how many firms a given pid works in a givem month:
nr_firms_pid_time<-sample(1:length(fid),length(pid),replace = TRUE)
#This means:
#First pid: works in first firm;
#Second pid: works in first four firms;
#Third pid: works in first firm;
#Fourth pid: works in two firms.
#Aux functions:
function_rep<-function(x){
rep(1:12,x)
}
function_seq<-function(x){
1:x
}
#Create panel
data_panel<-data.table(pid = rep(pid,nr_firms_pid_time*length(time_periods)))
data_panel[,yearmonth:=do.call(c,sapply(nr_firms_pid_time,function_rep))]
data_panel[,fid:=rep(do.call(c,sapply(nr_firms_pid_time,function_seq)),each = 12)]
#Merge in yob:
data_yob<-data.table(pid = pid,yob = yob)
data_panel<-merge(data_panel,data_yob,by = c("pid"),all.x = TRUE)
#Remove not needed stuff:
rm(pid)
rm(fid)
rm(time_periods)
rm(yob)
rm(data_yob)
#Solution 1 (terribly slow):
# make a small function that counts the number of coworkers with
# earlier dob than this individual
older_coworkers = function(id,yrmonth) {
#First obtain firms in which a worker works in a given month:
id_firms<-data_panel[pid==id&yearmonth==yrmonth,fid]
#Then extract data at a given month:
data_func<-data_panel[(fid %in% id_firms)&(yearmonth==yrmonth)]
#Then extract his dob:
dob_to_use<-unique(data_func[pid==id,yob])
sum(data_func[pid!=id]$yob<dob_to_use)
}
older_coworkers_unique = function(id,yrmonth) {
#First obtain firms in which a worker works in a given month:
id_firms<-data_panel[pid==id&yearmonth==yrmonth,fid]
#Then extract data at a given month:
data_func<-data_panel[(fid %in% id_firms)&(yearmonth==yrmonth)]
#Then extract his dob:
dob_to_use<-unique(data_func[pid==id,yob])
#Get UNIQUE number of coworkers:
sum(unique(data_func[pid!=id],by = c("pid"))$yob<dob_to_use)
}
#Works but is terrible slow:
tic()
sol_1<-data_panel[, .(older_coworkers(.BY$pid,.BY$yearmonth)),by = c("pid","yearmonth")]
toc()
#Solution 2 (better but do not like it, what if I want unique older coworkers)
function_older<-function(x){
noc<-lapply(
1:length(x),
function(i){
sum(x[-i]<x[i])
}
)
unlist(noc)
}
#This is fast but I cannot get unique number:
tic()
sol_2<-data_panel[,.(pid,function_older(yob)),by = c("fid","yearmonth")][,sum(V2),by = c("pid","yearmonth")][order(pid,yearmonth)]
toc()
#Everything works:
identical(sol_1,sol_2)
The question is how to implement older_coworkers_unique in a very fast manner. Any suggestions would be greatly appreciated.
Update, based on OP's new reproducible dataset
If you want a one-liner to reproduce sol_2 above, you can do this:
data_panel[data_panel, on=.(yearmonth, fid, yob<yob )][, .N, by=.(i.pid, yearmonth)]
Explanation:
The above is using a non-equi join, which can be a helpful approach when using data.table. I am joining data_panel on itself, requiring that yearmonth and fid be equal, but that year of birth (left side of join) is less than year of birth (right side of join). This will return a data.table where firms and yearmonth matches, but where every older coworker (pid) is matched to their younger coworkers (i.pid). We can thus count the rows (.N) by each younger coworker (i.pid) and yearmonth. This produces the same as sol_1 and sol_2 above. You commented that you would like to find the unique coworkers, and so the second approach below does that, by using len(unique(pid)) as below, in Option 2.
The same non-equi join approach can be used to get unique older coworkers, like this:
data_panel[data_panel, on=.(yearmonth, fid, yob<yob )] %>%
.[, .(older_coworkers = length(unique(pid))), by=.(i.pid, yearmonth)]
Previous Response, based on OP's original very small example dataset
I'm not sure exactly what you want the output to look like. However in your example data, I first drop the duplicate row (because I couldn't understand why it was there (see my comment above)), and then I apply a function that counts that number of older coworkers for each pid/fid/ym.
# make your example data unique
data=unique(data)
# make a small function that counts the number of coworkers with
# earlier dob than this individual
older_coworkers = function(birth,firm,yrmonth,id) {
data[dob<birth & fid==firm & ym==yrmonth & pid!=id,.N]
}
# apply the function to the data
data[, .(num_older_coworkers = older_coworkers(dob,.BY$fid, .BY$ym, .BY$pid)), by=.(pid,fid,ym)]
Output:
pid fid ym num_older_coworkers
1: 1 1 200801 1
2: 1 2 200802 0
3: 2 1 200801 0
4: 3 2 200801 0
Person 1 at Firm 1 has one older coworker in the month of 2008-01 -- that is, Person 2 at Firm 1 in 2008-01.
Person 1 at Firm 2 (born in 1950) would also have an older coworker, namely, Person 3 at Firm 2 (born in 1930), but the result shows 0, because Person 1 at Firm 2 ym (i.e. 2008-01) does not match with that potential older coworker's ym (i.e. 2008-02).

Filtering a data frame in R using a for loop

I am new to R and am running into difficulty with more advanced filtering. I have a data frame containing 1500 rows of people in households and need to filter out everyone who is part of a household where at least 1 person is older than 24. For example, in the sample set below I would only want to keep rows 3,4, and 5.
PersonalID DOB HouseholdID
1 1961-04-15 123
2 2017-01-12 123
3 2000-01-02 122
4 2001-03-05 122
5 1996-08-22 122
Initially I just filtered to get a new data frame with everyone in that age range and then filtered the original data frame again (and again and again and so on...) with each HouseholdID of someone under 25 to check if anyone else with that HouseholdID is over 24.
Whenever I'm doing the same thing over and over it seems like there's probably a way to use a function instead but I'm having a hard time coming up with one that works. This is my current attempt but I know there's plenty wrong with it:
UNDER25df <- filter(df, DOB >= "yyyy-mm-dd")
for (UNDER25df$HouseholdID in df) {
if (all(df$DOB >= "yyyy-mm-dd")) {
view(filter(df, HouseholdID == "$HouseholdID"))
}
}
The error I get is:
unexpected '}' in "}"
but I'm pretty sure that I can nest an if statement in a for loop in R and that I've been careful about the positioning of the brackets so I don't know exactly what it's referring to.
What I'm not sure of is if I can iterate through a data frame in this way or if this even makes sense. I've read that vectoring might be better in general for advanced filtering but tried to read the documentation on it and couldn't really see how to make that jump to this problem. Does anyone have a suggestion or a direction I should be looking in?
You do not need a loop for this. Try
library(lubridate)
library(dplyr)
set.seed(1)
df <- tibble(DOB = Sys.Date() - sample(3000:12000, 6),
personalID = 1:6,
HouseholdID = c(1,1,2,2,2,3))
df$DOB
# grab householdID from all persons that are at least 24
oldies <- df[(lubridate::today() - lubridate::ymd(df$DOB)) > years(24),
"HouseholdID", TRUE]
# base R way
oldies <- df[as.Date(df$DOB) > as.Date("1993-2-10"),
"HouseholdID", TRUE]
# household members in a household with someone 24 or older
df %>%
filter(HouseholdID %in% oldies)
# household members in a household with noone 24 or older
df %>%
filter(!(HouseholdID %in% oldies))
I am not sure if you want keep the rows grouped by ID that all users are less than or equal to 24-year old. If so, then maybe you can try the code below
library(lubridate)
dfout <- subset(df, ave(floor(time_length(Sys.Date()-as.Date(DOB),"years"))<=24, HouseholdID, FUN = all))
If you really want to use for loop to make it, then the below is an example
dfout <- data.frame()
for (id in unique(df$HouseholdID)) {
subdf <- subset(df,HouseholdID == id)
if (with(subdf, all(floor(time_length(Sys.Date()-as.Date(DOB),"years"))<=24))) {
dfout <- rbind(dfout,subdf)
}
}
Both approaches above can give you the result shown as
> dfout
PersonalID DOB HouseholdID
3 3 2000-01-02 122
4 4 2001-03-05 122
5 5 1996-08-22 122
DATA
df <- structure(list(PersonalID = 1:5, DOB = c("1961-04-15", "2017-01-12",
"2000-01-02", "2001-03-05", "1996-08-22"), HouseholdID = c(123L,
123L, 122L, 122L, 122L)), class = "data.frame", row.names = c(NA,
-5L))
I am not sure if you want to select household where all the people are above 24 or at least one person is above 24. In any case, you can use subset with ave
subset(df, ave(as.integer(format(Sys.Date(), "%Y")) -
as.integer(format(DOB, "%Y")) >= 24, HouseholdID, FUN = any))
This selects households where at least one person is above 24. If you want to select households where all people are above 24 use all instead of any in FUN argument.
Similarly, using dplyr, we can use
library(dplyr)
df %>%
group_by(HouseholdID) %>%
filter(any(as.integer(format(Sys.Date(), "%Y")) -
as.integer(format(DOB, "%Y")) >= 24))

Totalize component products from table

I need my warehouse to be able to know how many items of each component we need per day. Basically, I have bundled items made of single products, and I want the warehouse to know how many of those single items they should provide in any given date.
I currently have data like this:
date bundle_name totbund prod1 totprod1 prod2 totprod2
06/01/2019 a_bund 1 a 1 b 1
06/01/2019 a 1
06/01/2019 b 2
07/01/2019 b_bund 1 b 2
07/01/2019 b_bund 2 b 4
07/01/2019 b 2
My expected output is this:
date all_item total
06/01/2019 a 2
06/01/2019 b 3
07/01/2019 b 8
Please notice that the bundle_item column can have bundled items or a single item, so it is mixed.
Something like this could work (I use 'a' as an example):
dat = dat %>%
group_by(date) %>%
summarize(a_bund = sum(tot_bund[bundle_name=='a']),
a_prod1 = sum(prod1[totprod1=='a']),
a_prod2 = sum(prod2[totprod2=='a'])) %>%
mutate(a = a_bund+a_prod1+a_prod2)
I wouldn't use the bundled notation, it sounds overly complicated. If you have everything in a row-format, you can use the group_by - summarize functionality of dplyr.
Assuming the data is called 'df'
library(dplyr)
df <- df %>%
select(date, prod = prod1, totprod = totprod1) %>%
filter(prod != "") %>%
bind_rows(df %>% select(date, prod = prod2, totprod = totprod2) %>% filter(prod != "") %>%
group_by(date, prod) %>%
summarize(totprod = sum(totprod))
I commented you needed a better approach at this problem.
I suggest you consider this from a structured database perspective. In such view, your data (and thus your world) is made of tables with differing and complementary information. And when you need to obtain information to solve your problem, you join data from different tables. If you have used excel, then you'll know it as vlookup.
How I'd approach your problem:
Table of components:
First, I'll have a table of components. This would be a very simple table of 3 columns: name of the product, component from which it is made and amount of component needed.
For your example, I'll have
library(data.table)
components <- structure(list(name = c("a", "b", "a_bund", "a_bund", "b_bund"),
component = c("a", "b", "a", "b", "b"),
amount = c(1, 1, 1, 1, 2)),
row.names = c(NA, -5L),
class = c("data.table", "data.frame"))
Which will produce:
components
name component amount
1: a a 1
2: b b 1
3: a_bund a 1
4: a_bund b 1
5: b_bund b 2
Notice that the information contained here is just the same information that you have in your table in columns 4 to 7 (by the way, your table is called "wide", while mine is "long" formatted. Long is much better for machine processing, and it's considered "tidy").
Table of Requests
Now that you have a table for the components, you'll need a table to put how many units of product x your clients need by y date. Do you notice that I separated the information content in both tables? There's one with components and nothing else; and there's one with requests and nothing else. Each thingy in its own basket!
This table I called requests and it's composed of three columns: dates with the date of the request, name with the name of the product requested by the client, and qty with the quantity the client expects of the product. That would be what you have in columns one to three in your data.
requests <- structure(list(dates = structure(c(17902, 17902, 17902, 17903, 17903, 17903), class = "Date"),
name = c("a_bund", "a", "b", "b_bund", "b_bund", "b"),
qty = c(1, 1, 2, 1, 2, 2)),
row.names = c(NA, -6L),
class = c("data.table", "data.frame"))
Which produces:
requests
dates name qty
1: 2019-01-06 a_bund 1
2: 2019-01-06 a 1
3: 2019-01-06 b 2
4: 2019-01-07 b_bund 1
5: 2019-01-07 b_bund 2
6: 2019-01-07 b 2
Joining the tables
With these two tables, you now need to know how many of each component you'll need in any given date. To solve this I'll use the data.table package, please see ?data.table for details.
requests[components, on = "name" ][, sum(qty*amount), by = .(dates, component)]
What is in there?
requests[components, on = "name"] joins the table requests with components by matching elements with the same name. In other words, it brings the component and amount (from components, of course) for each name in requests. Paste the command and see what the result is.
Data.table syntaxis allows "chaining" or passing an intermediate result to a new operation. That's what happens with the ][ sequence: I joined the tables and now am feeding that result into a new operation.
That new operation is sum(qty * amount). It is multiplying (you weren't wrong initially) the number of requested units qty by the amount of each component needed to produce it, and sums it (aggregates it) by = .(dates, component), which seems pretty self-explanatory. (If you come from the excel world, just think about a pivot or dynamic table).
That produces your expected output:
requests[components, on = "name" ][, sum(qty*amount), by = .(dates, component)]
dates component V1
1: 2019-01-06 a 2
2: 2019-01-06 b 3
3: 2019-01-07 b 8
While the result is the same other answers already provided, I hope you see the difference in approaches, and the enhanced usability of this one. If not, just imagine that k_bundle is made of 19 different components ;)

Extracting the Package name from fully defined class names using R scripting

I have following sort of data set(ds1) in my CSV file that includes class Name and corresponding faults. I intend to extract or filter Package Name from the data having number of faults equal to 2 using R script.
Class Faults
org.apache.tools.ant.taskdefs.Definer 2
org.apache.tools.ant.taskdefs.Definer 2
org.apache.tools.ant.taskdefs.Delete 1
org.apache.tools.ant.taskdefs.Deltree 2
org.apache.tools.ant.taskdefs.DependSet 2
org.apache.tools.ant.taskdefs.DependSet 2
org.apache.tools.ant.taskdefs.DependSet 2
org.apache.tools.ant.taskdefs.Ear 2
org.apache.tools.ant.taskdefs.Ear 2
org.apache.tools.ant.taskdefs.Echo 1
org.apache.tools.ant.Exec 2
org.apache.tools.ant.Exec 2
I have written following code, but, it does not produce desired output
dschanged<- subset(ds1, grep( "/^([^\\.]+)/", class) & Faults==2 )
Technically, I require proper regular expression to pull the string before last dot(.) to generate following output.
org.apache.tools.ant.taskdefs 2
org.apache.tools.ant.taskdefs 2
org.apache.tools.ant.taskdefs 2
org.apache.tools.ant.taskdefs 2
org.apache.tools.ant.taskdefs 2
org.apache.tools.ant.taskdefs 2
org.apache.tools.ant.taskdefs 2
org.apache.tools.ant.taskdefs 2
org.apache.tools.ant 2
org.apache.tools.ant 2
grep (and grepl) are inappropriate for this: you aren't filtering based on textual content. You are (a) filtering based on Faults, and (b) changing the text in Class.
Your data:
ds1 <- structure(list(Class = c("org.apache.tools.ant.taskdefs.Definer", "org.apache.tools.ant.taskdefs.Definer", "org.apache.tools.ant.taskdefs.Delete", "org.apache.tools.ant.taskdefs.Deltree", "org.apache.tools.ant.taskdefs.DependSet", "org.apache.tools.ant.taskdefs.DependSet", "org.apache.tools.ant.taskdefs.DependSet", "org.apache.tools.ant.taskdefs.Ear", "org.apache.tools.ant.taskdefs.Ear", "org.apache.tools.ant.taskdefs.Echo", "org.apache.tools.ant.Exec", "org.apache.tools.ant.Exec"),
Faults = c(2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L)),
.Names = c("Class", "Faults"), class = "data.frame", row.names = c(NA, -12L))
Filter on Faults (you already had this). You only need one of these two commands, they both do the same thing; the major differences are in readability (personal preference) and performance (the second one, in this case, takes about 35% less time, though since they are both measured in microseconds, it seems silly to compete).
ds2 <- subset(ds1, Faults == 2)
ds2 <- ds1[ds1$Faults == 2,]
Update Class to remove the last word (and dot):
ds2$Class <- gsub("\\.[^.]*$", "", ds2$Class)
ds2
# Class Faults
# 1 org.apache.tools.ant.taskdefs 2
# 2 org.apache.tools.ant.taskdefs 2
# 4 org.apache.tools.ant.taskdefs 2
# 5 org.apache.tools.ant.taskdefs 2
# 6 org.apache.tools.ant.taskdefs 2
# 7 org.apache.tools.ant.taskdefs 2
# 8 org.apache.tools.ant.taskdefs 2
# 9 org.apache.tools.ant.taskdefs 2
# 11 org.apache.tools.ant 2
# 12 org.apache.tools.ant 2
Note: this can also be done with sub instead of gsub, but the latter is my first-reach since most of my uses deal with larger and repeating regexes. The major (only?) difference between the two is that:
'sub' and 'gsub' perform replacement of the first and all matches respectively
(from ?sub).
I know of no tool that does both the filtering and changing in a single command (though perhaps data.table does, I don't know).
Similar to #egnha's solution (that uses magrittr), here's one using dplyr, which many people allege is very easy to read and adapt (at the potential cost of performance):
library(dplyr)
ds2 <- ds1 %>%
filter(Faults == 2) %>%
mutate(Class = gsub("\\.[^.]*$", "", Class))
Since I mentioned performance, here's a comparison:
microbenchmark(indexing = { ds2 <- ds1[ds1$Faults == 2,]; ds2$Class <- gsub("\\.[^.]*$", "", ds2$Class) },
subset = { ds2 <- subset(ds1, Faults == 2) ; ds2$Class <- gsub("\\.[^.]*$", "", ds2$Class) },
dplyr = { ds1 %>% filter(Faults == 2) %>% mutate(Class = gsub("\\.[^.]*$", "", Class)) })
# Unit: microseconds
# expr min lq mean median uq max neval
# indexing 71.841 87.7045 109.4496 104.2975 120.7075 269.493 100
# subset 102.473 115.6020 147.0108 139.1230 165.5620 287.726 100
# dplyr 1067.030 1156.3745 1323.1174 1225.4805 1351.2920 4270.308 100
For the record, dplyr used in this way is not often this speed-poor in comparison to other methods. It is not commonly faster, but it is not often an order-of-magnitude slower.
I don't think you are looking for filtering based on class name.
Just do it in 2 steps.
# Filter
dschanged <- ds1[ds1$Faults == 2,]
# Extract package name
dschanged$class <- sub('(.*)[.](.*)','\\1',dschanged$class)
You can also do this without any fancy regex’s: split each Class string on the dots, then dot-paste all but the last substring.
library(magrittr) # Provides pipe operator `%>%`
dschanged <- subset(ds1, Faults == 2)
dschanged$Class <- dschanged$Class %>%
strsplit(split = "[.]") %>%
sapply(function(x) head(x, -1L) %>% paste(collapse = "."))
Note that strings without dots will be transformed to empty strings. It is also quite a bit slower than the solution suggested by #r2evans.

Create new index / re-index in dplyr [duplicate]

This question already has answers here:
How to number/label data-table by group-number from group_by?
(6 answers)
Closed 6 years ago.
I am using a dplyr table in R. Typical fields would be a primary key, an id number identifying a group, a date field, and some values. There are numbersI did some manipulation that throws out a bunch of data in some preliminary steps.
In order to do the next step of my analysis (in MC Stan), It'll be easier if both the date and the group id fields are integer indices. So basically, I need to re-index them as integers between 1 and whatever the total number of distinct elements are (about 750 for group_id and about 250 for date_id, the group_id is already integer, but the date is not). This is relatively straightforward to do after exporting it to a data frame, but I was curious if it is possible in dplyr.
My attempt at creating a new date_val (called date_val_new) is below. Per the discussion in the comments I have some fake data. I purposefully made the group and date values not be 1 to whatever, but I didn't make the date an actual date. I made the data unbalanced, removing some values to illustrate the issue. The dplyr command re-starts the index at 1 for each new group, regardless of what date_val it is. So every group starts at 1, even if the date is different.
df1 <- data.frame(id = 1:40,
group_id = (10 + rep(1:10, each = 4)),
date_val = (20 + rep(rep(1:4), 10)),
val = runif(40))
for (i in c(5, 17, 33))
{
df1 <- df1[!df1$id == i, ]
}
df_new <- df1 %>%
group_by(group_id) %>%
arrange(date_val) %>%
mutate(date_val_new=row_number(group_id)) %>%
ungroup()
This is the base R method:
df1 %>% mutate(date_val_new = match(date_val, unique(date_val)))
Or with a data.table, df1[, date_val_new := .GRP, by=date_val].
Use group_indices_() to generate a unique id for each group:
df1 %>% mutate(date_val_new = group_indices_(., .dots = "date_val"))
Update
Since group_indices() does not handle class tbl_postgres, you could try dense_rank()
copy_to(my_db, df1, name = "df1")
tbl(my_db, "df1") %>%
mutate(date_val_new = dense_rank(date_val))
Or build a custom query using sql()
tbl(my_db, sql("SELECT *,
DENSE_RANK() OVER (ORDER BY date_val) AS DATE_VAL_NEW
FROM df1"))
Alternatively, I think you can try getanID() from the splitstackshape package.
library(splitstackshape)
getanID(df1, "group_id")[]
# id group_id date_val val .id
# 1: 1 11 21 0.01857242 1
# 2: 2 11 22 0.57124557 2
# 3: 3 11 23 0.54318903 3
# 4: 4 11 24 0.59555088 4
# 5: 6 12 22 0.63045007 1
# 6: 7 12 23 0.74571297 2
# 7: 8 12 24 0.88215668 3

Resources