Recycling error while using stringdist and data.table in R - r

I am trying to perform an approximate string matching for a data.table containing author names basis a dictionary of "first" names. I have also set a high threshold say above 0.9 to improve the quality of matching.
However, I get an error message given below:
Warning message:
In [`<-.data.table`(x, j = name, value = value) :
Supplied 6 items to be assigned to 17789 items of column 'Gender_Dict' (recycled leaving remainder of 5 items).
This error occurs even if I round the similarity matching down to 4 digits using signif(similarity_score,4).
Some more information about the input data and approach:
The author_corrected_df is a data.table containing columns: "Author" and "Author_Corrected". Author_Corrected is an alphabet representation of the corresponding Author (Eg: if Author = Jack123, then Author_Corrected = Jack).
The Author_Corrected column can have variations of a proper first name eg: Jackk instead of Jack, and I would like to populate the corresponding gender in this author_corrected_df called Gender_Dict.
Another data.table called first_names_dict contains the 'name' (i.e. first name) and gender (0 for female, 1 for male, 2 for ties).
I would like to find the most relevant match from the "Author_Corrected" per row with respect the the 'name' in first_names_dict and populate the corresponding gender (either one of 0,1,2).
To make the string matching more stringent, I use a threshold of 0.9720, else later in the code (not shown below), the non-matched values are then represented as NA.
The first_names_dict and the author_corrected_df can be accessed from the link below:
https://wetransfer.com/downloads/6efe42597519495fcd2c52264c40940a20190612130618/0cc87541a9605df0fcc15297c4b18b7d20190612130619/6498a7
for (ijk in 1:nrow(author_corrected_df)){
max_sim1 <- max(stringsim(author_corrected_df$Author_Corrected[ijk], first_names_dict$name, method = "jw", p = 0.1, nthread = getOption("sd_num_thread")), na.rm = TRUE)
if (signif(max_sim1,4) >= 0.9720){
row_idx1 <- which.max(stringsim(author_corrected_df$Author_Corrected[ijk], first_names_dict$name, method = "jw", p = 0.1, nthread = getOption("sd_num_thread")))
author_corrected_df$Gender_Dict[ijk] <- first_names_dict$gender[row_idx1]
} else {
next
}
}
While execution I get the following error message:
Warning message:
In `[<-.data.table`(x, j = name, value = value) :
Supplied 6 items to be assigned to 17789 items of column 'Gender_Dict' (recycled leaving remainder of 5 items).
Would appreciate help in terms of knowing where the error lies and if there is a faster way to perform this sort of matching (though the latter one is second priority).
Thanks in advance.

Following previous comments, here I select the gender most present in your selection :
for (ijk in 1:nrow(author_corrected_df)){
max_sim1 <- max(stringsim(author_corrected_df$Author_Corrected[ijk], first_names_dict$name, method = "jw", p = 0.1, nthread = getOption("sd_num_thread")), na.rm = TRUE)
if (signif(max_sim1,4) >= 0.9720){
row_idx1 <- which.max(stringsim(author_corrected_df$Author_Corrected[ijk], first_names_dict$name, method = "jw", p = 0.1, nthread = getOption("sd_num_thread")))
# Analysis of factor gender
gender <- as.character( first_names_dict$gender[row_idx1] )
# I take the (first) gender most present in selection
df_count <- as.data.frame( table(gender) )
ref <- as.character ( df_count$test[which.max(df_count$Freq)] )
value <- unique ( test[which(test == ref)] )
# Affecting single character value to data frame
author_corrected_df$Gender_Dict[ijk] <- value
}
}
Hope this helps :)

Related

How to get in a specific order the results of an r lapply function with arguments from a dataframe

Following a previous question I asked, I got an awesome answer.
Here is a quick summary:
I want to compute a multidimensional development index based on South Africa Data for several years. My list is composed of individual information for each year, so basically df1 is about year 1 and df2 about year2.
df1<-data.frame(var1=c(1, 1,1), var2=c(0,0,1), var3=c(1,1,0))
df2<-data.frame(var1=c(1, 0,1), var2=c(1,0,1), var3=c(0,1,0))
mylist <-list (df1,df2)
var1 could be the stance on religion of each person, var2 how she voted in last national election, etc. In my very simple case, I have the data for 3 different persons each year.
From there, I compute an index based on a number of variables (not all of them)
You can find here a very simplified working index function, with only 2 of 3 variables, named dimX and dimY:
myindex <- function(x, dimX, dimY){
econ_i<- ( x[dimX]+ x[dimY] )
return ( (1/length(econ_i))*sum(econ_i) )
}
myindex(df1, "var2", "var3")
and
myindex2 = function(x, d) {
myindex(x, d[1], d[2])
}
Then I have my dataframe of variables I want to use for my index. I am trying to compute the index for several sets of variables.
args <- data.frame(set1=c("var1", "var2"), set2=c("var2", "var3"), stringsAsFactors = F)
I'd like to have the result as follows : (a)list(set1 = list(df1, df2), set2 = (df1, df2))instead of (b) list(df1 = list(set1, set2), df2 = list(set1, set2)).
Case (a) represents a time series, meaning I have a list of results of my indexes each year for only one set of variables. Case (b) is the opposite where I have the index results of one year for every set of variables. Each individual result should be a unique numeric value. Hence, I am expecting to get a list of 2 sublists df1 and df2, each sublist containing 3 numeric values.
I've been adviced to do use that great command:
lapply(mylist, function(m) lapply(args, myindex2, x = m))
It's working great, but I get the result in the "wrong" format, namely the second one (b) I showed.
How could I get the results ordered per set (i.e. case (a) as time series) instead of per year?
Thanks a lot for your help!
PJ
EDIT: I've managed to find a solution that doesn't answer the question, but still allows me to get my data in desired order.
Namely, I'm transforming my list of lists to a matrix that I simply transpose.
This answer will be edited!
Currently, your function index() does this
myindex <- function(x, dimX, dimY){
econ_i<- ( x[dimX]+ x[dimY] )
return ( (1/length(econ_i))*sum(econ_i) )
}
Aren't you after this, however?
myindex <- function(x, dimX, dimY){
econ_i<- ( x[,dimX]+ x[,dimY] )
return ( (1/length(econ_i))*sum(econ_i) )
}
The way you have it right now, length(econ_i) always returns 1 because econ_i is a data.frame() and not a vector. The length of a data.frame() is always 1, while the length of a vector is the number of elements within it.
Kindly note that here is what the output looks like in R.
df1["var1"]
var1
1 1
2 1
3 1
returns a data.frame()
df1[,"var1"]
[1] 1 1 1
returns a vector.
I will adjust this post to answer your question when you respond. I think it's important to solve this part first.
If that may provide any help, from this article, here my actual index function:
RCI_a_3det <-function(x, econ1, econ2, econ3, perso1, perso2, perso3, civic1, civic2, civic3){
econ_i<- (1/3) *( x[econ1]+ x[econ2] + x[econ3])
perso_i<- (1/3)*( x[perso1] + x[perso2] + x[perso3])
civic_i<- (1/3)*(x[civic1] + x[civic2] + x[civic3])
daf <- data.frame(econ_i, perso_i, civic_i)
colnames(daf)<- c("econ_i", "perso_i", "civic_i")
df1 <- subset(daf, daf$econ_i !=1 & daf$perso_i !=1 & daf$civic_i!=1 )
sum_xik <- (df1$econ_i + df1$perso_i + df1$civic_i)
return ( 1/(3*nrow(df1)) * sum(sum_xik, na.rm=T))
}
Edit:
x is a list of all personal information, for every variable and for every year. It is pretty large.
I am using 9 variables to compute this index, but I actually have 30 such variables in my data, so I have set up a dataframe of sets of variables I could use to compute this index. This is the equivalent of my args df in the simple example. I am actually using 200 such combinations.

Data Extraction in matrix form in R

I need to make a matrix by extracting certain information from a file called flowdata. I need to extract the meanValue of all inputGroup == 5. As the column names, I need the corresponding information on the "name" column, and as row names I need the information written on the cell 4 above the "inputGroup == 5" [i-4]. In this case "Methanol, at plant"
Here is the link to the dropbox that has the data
https://www.dropbox.com/s/x2knuqq1odbt5zg/flowdata01.txt?dl=0
Here is the R code I have:
flowdata <-flowdata01
in.up = length(unique(flowdata$name)) # number of unit processes
in.p = length(unique(flowdata$.attrs[which(flowdata$metadata=='name')])) # number of inputs/outputs
input.mat = matrix(0, in.p,in.up) #empty matrix
colnames(input.mat) = unique(flowdata$name) # up names
rownames(input.mat) = unique(flowdata$attrs[which(flowdata$metadata=='name')]) # inputs/outputs names
for (i in 1:nrow(flowdata)){ # for every row in flowdata
if (flowdata$metadata[i]=="inputGroup" && flowdata$attrs[i] == 5){ # if it is an inputGroup 5
col.name = flowdata$name[i] # up name
row.name = flowdata$attrs[i-4] # i/o name 4 cells above
value = as.numeric(flowdata$attrs[i-5]) #value 5 cells above
input.mat[row.name,col.name] = value}}
input.mat = input.mat[-which(rowSums(input.mat)==0),] # if the row is empty, then the flow was an input or output of no interest
`
When I run the above R code, I get this error message:
Error in `[<-`(`*tmp*`, row.name, col.name, value = 6397) :
subscript out of bounds
This is how the matrix should look like

Error in data table: Item has no length? - R

I have a R script that contains a function, which I recieved in an answer for this question: R: For loop nested in for loop.
The script has been working fine on the first part of my data set, but I am now trying to use it on another part, which as far as I can tell, has the exact same format as the first, but for some reason I get an error when trying to use the script. I cannot figure out, what causes the error.
This is the script I am using:
require(data.table)
MappingTable_Calibrated = read.csv2(file.choose(), header=TRUE)
head(MappingTable_Calibrated)
#The data is sorted primarily after Scaffold number in ascending order, and secondarily after Cal_Startgen in ascending order.
MappingTable_Calibratedord = MappingTable_Calibrated[order(MappingTable_Calibrated$Scaffold, MappingTable_Calibrated$Cal_Startgen),]
head(MappingTable_Calibratedord)
dt <- data.table(MappingTable_Calibratedord, key = "Scaffold,Cal_Startgen")
head(dt)
# The following function creates pairs of loci for each scaffold.
# The function is a modified version of a function found retrieved from http://www.stackoverflow.com
fn = function(dtIn, id){
# Creates the object dtHead containing as many lines as in dtIn minus the last line)
dtHead = head(dtIn, n = nrow(dtIn) - 1)
# The names of dtHead are appended with _a. paste0 short for: paste(x, sep="")
setnames(dtHead, paste0(colnames(dtHead), "_a"))
# Creates the object dtTail containing as many lines as in dtIn minus the first line)
dtTail = tail(dtIn, n = nrow(dtIn) - 1)
# The names of dtTail are appended with _b.
setnames(dtTail, paste0(colnames(dtTail), "_b"))
# dtHead and dtTail are combined. Scaffold is defined as id. The blank column "Pairwise_Distance is added to the table.
cbind(dtHead, dtTail, Scaffold = id, Pairwise_Distance = 0)
}
#The function is run on the data. .SDcols defines the rows to be included in the output.
output = dt[, fn(.SD, Scaffold), by = Scaffold, .SDcols = c("Name", "Startpos", "Endpos", "Rev", "Startgen", "Endgen", "Cal_Startgen", "Cal_Endgen", "Length")]
output = as.data.frame(output[, with = FALSE])
But when trying to create "output" I get the following error:
Error in data.table(..., key = key(..1)) : Item 1 has no length. Provide at least one item (such as NA, NA_integer_etc) to be repeated to match the 2 rows in the longest column. Or, all columns can be 0 length, for insert()ing rows into.
dt looks like this:
Name Length Startpos Endpos Scaffold Startgen Endgen Rev Match Cal_Startgen Cal_Endgen
1: Locus_7173 144 0 144 34 101196 101340 1 1 101196 101340
2: Locus_133 110 0 110 34 223659 223776 1 1 223659 223776
3: Locus_2746 161 0 89 65 101415 101504 1 1 101415 101576
A full dput of "dt" can be found here: https://www.dropbox.com/sh/3j4i04s2rg6b63h/AADkWG3OcsutTiSsyTl8L2Vda?dl=0
Start with tracking the data which cause the error by:
function(dtIn, id){
dtHead = head(dtIn, n = nrow(dtIn) - 1)
setnames(dtHead, paste0(colnames(dtHead), "_a"))
dtTail = tail(dtIn, n = nrow(dtIn) - 1)
setnames(dtTail, paste0(colnames(dtTail), "_b"))
r <- tryCatch(cbind(dtHead, dtTail, Scaffold = id, Pairwise_Distance = 0), error = function(e) NULL)
if(is.null(r)) browser()
r
}
Then you can see you are trying to cbind elements of different nrow/length:
Browse[1]> dtHead
Empty data.table (0 rows) of 9 cols: Name_a,Startpos_a,Endpos_a,Rev_a,Startgen_a,Endgen_a...
Browse[1]> dtTail
Empty data.table (0 rows) of 9 cols: Name_b,Startpos_b,Endpos_b,Rev_b,Startgen_b,Endgen_b...
Browse[1]> id
[1] 76
Browse[1]> 0
[1] 0
Which is not allowed.
I recommend to put an if(nrow( or something similar and then add columns id = integer(), Pairwise_Distance = numeric() for nrow = 0 cases.

function to subtract each column from one specific column in r

I want to subtract each column from a column called df$Means in r. I want to do this as a function but Im not sure how to iterate through each of the columns- each iteration relies on one column being subtracted from df$Means and then there is a load of downstream code that uses the output. I have simplified the code for here as this is the bit that's giving me trouble. So far I have:
CopyNumberLoop <- function (i) {df$ZScore <- (df[3:5]-df$Means)/(df$sd)
}
apply(df[3:50], 2, CopyNumberLoop)
but Im not sure how to make sure that the operation is done on one column at a time. I don't think df[3:5] is correct?
I have been asked to produce a reproducible example so all the code I want is here:
df1 <- read.delim(file.choose(),header=TRUE)
#Take the control samples and average each row for three columns excluding the first two columns- add the per row means to the data frame
df$Means <- rowMeans(df[,30:32])
RowVar <- function(x) {rowSums((x - rowMeans(x))^2)/(dim(x)[2] - 1)}
df$sd=sqrt(RowVar(df[,c(30:32)]))
#Get a Z score by dividing the test sample count at each locus by the average for the control samples and divide everything by the st dev for controls at each locus.
{
df$ZScore <- (df[,35]-df$Means)/(df$sd)
######################################### QUARTILE FILTER ###########################################################
alpha=1.5
numberofControls = 3
UL = median(df$ZScore, na.rm = TRUE) + alpha*IQR(df$ZScore, na.rm = TRUE)
LL = median(df$ZScore, na.rm = TRUE) - alpha*IQR(df$ZScore, na.rm = TRUE)
#Copy the Z score if the score is > or < a certain number, i.e. LL or UL.
Zoutliers <- which(df$ZScore > UL | df$ZScore < LL)
df$Zoutliers <- ifelse(df$ZScore > UL |df$ZScore <LL ,1,-1)
tempout = ifelse(df$ZScore[Zoutliers] > UL,1,-1)
######################################### Three neighbour Isolation filter ##############################################################################
finalSeb=c()
for(i in 2:(length(Zoutliers)-1)){
j=Zoutliers[i]
if(sum(ifelse((j-1) == Zoutliers,1,0)) > 0 & tempout[i] == tempout[i-1] & sum(ifelse((j+1) == Zoutliers,1,0)) > 0 & tempout[i] == tempout[i+1]){
finalSeb = c(finalSeb,i)
}
}
finalset_row_number = Zoutliers[finalSeb]
#View(finalset_row_number)
p_seq = rep(0,nrow(df))
for(i in 1:length(finalset_row_number)){
p_seq[(finalset_row_number[i]-1):(finalset_row_number[i]+1)] = median(df$ZScore[(finalset_row_number[i]-1):(finalset_row_number[i]+1)])
}
nrow(as.data.frame(finalset_row_number))
}
For each column between 3 and 50 I'd like to generate a nrow(as.data.frame(finalset_row_number)) and keep it in another dataframe. Admittedly my code is a mess because I dont know how to create the function that will allow me to apply this to each column
Your code isn’t using the parameter i at all. In fact, i is the current column, so that’s what you should use:
result = apply(df[, 3 : 50], 2, function (col) col - df$Means)
Or you can subtract the means directly:
result = df[, 3 : 50] - df$Means
This will return a new matrix consisting of the columns 3–50 from df, subtracting df$Means from each in turn. Or, if you want to calculate Z scores as your code seems to do:
result = (df[, 3 : 50] - df$Means) / df$sd
It appeared that you wanted the Z-scores assigned back into the original dataframe as named columns. If you want to loop over columns, it would be just as economical to use lapply or sapply. The receiving function will accept each column in turn and match it to the first parameter. Any other arguments offered after the receiving function will get matched by name or position to any other symbol/names in the parameter list. You do not do any assignment to 'df' inside the function:
CopyNumberLoop <- function (col) { col-df$Means/(df$sd)
}
df[, paste0('ZScore' , 3:50)] <- # assignment done outside the loop
lapply(df[3:50], CopyNumberLoop) # result is a list
# but the `[.data.frame<-` method will accept a list.
Usign apply coerces to a matrix which may have undesirable effects in the column is not numeric (say factor or date-time). It's better to get into he habit of using lapply when working on ranges of columns in dataframes.
If you want to assign the result of this operation to a new dataframe, then the lapply(.) result would need to be wrapped in as.data.frame and then column names could be assigned. Same effort would need to be done to a result from apply(.).

Finding sequences in rows in R based on the rep function on a certain column

I'm trying to find a sequence of 0's in a row based on the rep function of a certain column. Below is my best attempt so far which throws an error. I tried using an apply loop but failed miserably and I don't really want to use a for loop unless I have to as my true dataset is about 800,000 rows. I have tried looking up solutions but can't seem to find anything and have spent a few hours at this and had no luck. I have also attached the desired output.
library(data.table)
TEST_DF <- data.table(INDEX = c(1,2,3,4),
COL_1 = c(0,0,0,0),
COL_2 = c(0,0,2,5),
COL_3 = c(0,0,0,0),
COL_4 = c(0,2,0,1),
DAYS = c(4,4,2,2))
IN_FUN <- function(x, y)
{
x <- rle(x)
if( max(as.numeric(x$lengths[x$values == 0])) >= y )
{
"Y"
}
else
{
"N"
}
}
TEST_DF$DEFINITION <- apply(TEST_DF[, c(2:5), with = FALSE], 1,
FUN = IN_FUN(TEST_DF[, c(2:5), with = FALSE], TEST_DF$DAYS))
DESIRED <- TEST_DF <- data.table(P_ID = c(1,2,3,4),
COL_1 = c(0,0,0,0),
COL_2 = c(0,0,2,5),
COL_3 = c(0,0,0,0),
COL_4 = c(0,2,0,1),
DAYS = c(4,4,2,2).
DEFINITION = c("Y","N","Y","N"),
INDEX = c(2,NA,4,NA)
For the first row I want to see if four 0's are within COL_1 to COL_4, four 0's within row 2 and two 0's within rows 3 and 4. Basically the number of 0's is given by the value in the DAYS column. So since four 0's are within row 1, DEFINITION gets a value of "Y", row 2 gets a value of "N" since there is only three 0's row 4 should get a value of "Y" since there are two 0's, etc.
Also, if possible, if the DEFINITION column has a value of "Y" in it, then it should return the column index of the first occurrence of the desired sequence, e.g. in row 1 since the first occurrence of a 0 in the 4 0's we're looking for is in COL_1 then we should get a value of 2 for the INDEX column and row 2 get a NA since DEFINITION is "N", etc.
Feel free to make any edits to make it clearer for other users and let me know if you need better information.
Cheers in advance :)
EDIT:
Below is a slightly extended data table. Let me know if this is sufficient.
TEST_DF <- data.table(P_ID = c(1,2,3,4,5,6,7,8,10),
COL_1 = c(0,0,0,0,0,0,0,5,90),
COL_2 = c(0,0,0,0,0,0,3,78,6),
COL_3 = c(0,0,0,0,0,0,7,5,0),
COL_4 = c(0,0,0,0,0,5,0,2,0),
COL_5 = c(0,0,0,0,0,7,2,0,0),
COL_6 = c(0,0,0,0,0,9,0,0,5),
COL_7 = c(0,0,0,0,0,1,0,0,6),
COL_8 = c(0,0,0,0,0,0,0,1,8),
COL_9 = c(0,0,0,0,0,1,6,1,0),
COL_10 = c(0,0,0,0,0,0,7,1,0),
COL_11 = c(0,0,0,0,0,0,8,3,0),
COL_12 = c(0,0,0,0,0,0,9,6,7),
DAYS = c(10,8,12,4,5,4,3,4,7))
Where the DEFINITION column for the rows would be c(1,1,1,1,1,0,1,0,0) where 1 is "Y" and 0 is "N". Either is ok.
For the INDEX column in the new edit the values should be c(2,2,2,2,2,NA,7,NA,NA)
Was able to do this with some math trickery. I created a binary matrix where an element is 1 if it was originally 0 and 0 otherwise. Then, for each row I set the nth element in the row equal to the (n-1th element + the nth element) times the nth element. In this transformed matrix, the value of an element is equal to the number of consecutive prior elements which were 0 (including this element).
m<-as.matrix(TEST_DF[, 2:(ncol(TEST_DF)-1L)])
m[m==1]<-2
m[m==0]<-1
m[m!=1]<-0
for(i in 2:ncol(m)){
m[,i]=(m[,i-1]+m[,i])*m[,i]
}
# note the use of with=FALSE -- this forces ncol to be evaluated
# outside of TEST_DF, leading the result to be used as a
# column number instead of just evaluating to a scalar
m<-as.matrix(cbind(m, Days=TEST_DF[,ncol(TEST_DF),with=FALSE]))
indx<-apply(m[,-ncol(m)] >= m[,ncol(m)],1,function(x) match(TRUE,x) )
TEST_DF$DEFINITION<-ifelse(is.na(indx),0,1)
TEST_DF$INDEX<-indx-TEST_DF$DAYS+2
Note: I stole some stuff from this post
I think I understand this better now that the question has been edited some. This has loops so it might not be optimal speed-wise, but the set statement should help with this. It still has some of the speed-up that data.table provides.
#Combined all column values in giant string
TEST_DF[ , COL_STRING := paste(COL_1,COL_2,COL_3,COL_4,COL_5,COL_6,COL_7,COL_8,COL_9,COL_10,COL_11,COL_12,sep=",")]
TEST_DF[ , COL_STRING := paste0(COL_STRING,",")]
#Using the Days variable, create a string to be searched
for (i in 1:nrow(TEST_DF))
set(TEST_DF,i=i,j="FIND",value=paste(rep("0,",TEST_DF[i]$DAYS),sep="",collapse=""))
#Find where pattern starts. A negative 1 value means it does not exist
for (i in 1:nrow(TEST_DF))
set(TEST_DF,i=i,j="INDEX",value=regexpr(TEST_DF[i]$FIND,TEST_DF[i]$COL_STRING,fixed=TRUE)[1])
#Define DEFINITION
TEST_DF[ , DEFINITION := 1*(INDEX != -1)]
#Find where pattern starts. A negative 1 value means it does not exist
require(stringr)
for (i in 1:nrow(TEST_DF))
set(TEST_DF,i=i,j="INDEX",value=str_count(substr(TEST_DF[i]$COL_STRING,1,TEST_DF[i]$INDEX),","))
#Clean up variables
TEST_DF[ , INDEX := INDEX + DEFINITION*2L]
TEST_DF[INDEX==0L, INDEX := NA_integer_]
You might explore the IRanges package. I just defined the test dataset as a data.frame, since I am not familiar with data.table. I then expanded it to your dataset size of 800000:
TEST_DF <- TEST_DF[sample(nrow(TEST_DF), 800000, replace=TRUE),]
Then, we put IRanges to work:
library(IRanges)
m <- t(as.matrix(TEST_DF[,2:13]))
l <- relist(Rle(m), PartitioningByWidth(rep(nrow(m), ncol(m))))
r <- ranges(l)
validRuns <- width(r) >= TEST_DF$DAYS
TEST_DF$DEFINITION <- sum(validRuns) > 0
TEST_DF$INDEX <- drop(phead(start(r)[validRuns], 1)) + 1L
The first step simplifies the table to a matrix, so we can transpose and get things in the right layout for a light-weight partitioning (PartitioningByWidth) of the data into a type of list. The data are converted into a run-length encoding (Rle) along the way, which finds the runs of zeros in each row. We can extract the ranges representing the runs and then compute on them more efficiently than we might on the split Rle directly. We find the runs that meet or exceed the DAYS and record which groups (rows) have at least one such run. Finally, we find the start of the valid runs, take the first start for each group with phead, and drop so that those with no runs become NA.
For 800,000 rows, this takes about 4 seconds. If that's not fast enough, we can work on optimization.

Resources