R: Matching rows by two columns - r

I am currently trying to figure out a vectorized way to match by two values in the same row. I have the following two simplified data frames:
# Dataframe 1: Displaying all my observations
df1 <- data.frame(c(1, 2, 3, 4, 5, 6, 7, 8),
c("A", "B", "C", "D", "A", "B", "A", "C"),
c("B", "E", "D", "A", "C", "A", "D", "A"))
colnames(df1) <- c("ID", "Number1", "Number2")
> df1
ID Number1 Number2
1 1 A B
2 2 B E
3 3 C D
4 4 D A
5 5 A C
6 6 B A
7 7 A D
8 8 C A
# Dataframe 2: Matrix of observations I am interested in
df2 <- matrix(c("A", "B",
"D", "A",
"C", "B",
"E", "D"),
ncol = 2,
byrow = TRUE)
> df2
[,1] [,2]
[1,] "A" "B"
[2,] "D" "A"
[3,] "C" "B"
[4,] "E" "D"
What I am trying to accomplish is to create a new column in df1 that states TRUE only if the exact combination is present in df2 (for example ID = 1 is equivalent to the first row in df2 because both of them consist of A and B). Additionally, if there is a shortcut, I would also like the status to be TRUE if the numbers are reversed, i.e. df1$Number1 matches df2[i,2] and df1$Number2 matches df2[i,1] (for example for ID = 7, the combination in df1 is A,D and in df2, the combination is D,A --> TRUE).
My desired output looks like this:
> df1
ID Number1 Number2 Status
1 1 A B TRUE
2 2 B E FALSE
3 3 C D FALSE
4 4 D A TRUE
5 5 A C FALSE
6 6 B A TRUE
7 7 A D TRUE
8 8 C A FALSE
All I have gotten so far is this:
for (i in 1:nrow(df1)) {
for (j in 1:nrow(df2)) {
Status <- ifelse(df1$Number1[i] %in% df2[j,1] &&
df1$Number2[i] %in% df2[j,2], TRUE, FALSE)
StatusComb[i,j] <- Status
}
df1$Status[i] <- ifelse(any(StatusComb[i,]) == TRUE, TRUE, FALSE)
}
It is really inefficient (you can clearly tell I am new to R) and does not look very nice either. I would appreciate any help!

One method would be to merge things together.
Adapting your data, to account for reversed labels, I'll reverse df2 on itself and rbind it:
df2 <- rbind.data.frame(df2, df2[,c(2,1)])
colnames(df2) <- c("Number1", "Number2")
df2$a <- TRUE
df2
# Number1 Number2 a
# 1 A B TRUE
# 2 D A TRUE
# 3 C B TRUE
# 4 E D TRUE
# 5 B A TRUE
# 6 A D TRUE
# 7 B C TRUE
# 8 D E TRUE
I added a so that it'll be merged in. From there:
df3 <- merge(df1, df2, all.x = TRUE)
df3$a <- !is.na(df3$a)
df3[ order(df3$ID), ]
# Number1 Number2 ID a
# 1 A B 1 TRUE
# 5 B E 2 FALSE
# 7 C D 3 FALSE
# 8 D A 4 TRUE
# 2 A C 5 FALSE
# 4 B A 6 TRUE
# 3 A D 7 TRUE
# 6 C A 8 FALSE
If you look at it before !is.na(df3$a), you'll see that the column is wholly TRUE and NA (the NA were not present in df2); if that is enough for you, then you can omit the middle step. The order step is just because row-order with merge is not assured (in fact I find it always inconveniently different). Since it was previously ordered by ID, I reverted to that, but it was entirely for aesthetics here to match your desired output.

You could define the combination variable that you want to search for in alphabetical order as below:
combination <- apply(df2, 1, function(x) {
paste(sort(x), collapse = '')
})
combination
[1] "AB" "AD" "BC" "DE"
And then mutate the Status field based on the concatenation of the Number field
library(dplyr)
df1 %>%
rowwise() %>%
mutate(S = paste(sort(c(Number1, Number2)), collapse = "")) %>%
mutate(Status = ifelse(S %in% combination, TRUE, FALSE))
Source: local data frame [8 x 5]
Groups: <by row>
# A tibble: 8 x 5
ID Number1 Number2 S Status
<dbl> <chr> <chr> <chr> <lgl>
1 1 A B AB TRUE
2 2 B E BE FALSE
3 3 C D CD FALSE
4 4 D A AD TRUE
5 5 A C AC FALSE
6 6 B A AB TRUE
7 7 A D AD TRUE
8 8 C A AC FALSE
Data:
I set stringsAsFactors = F in the dataframe
df1 <- data.frame(c(1, 2, 3, 4, 5, 6, 7, 8),
c("A", "B", "C", "D", "A", "B", "A", "C"),
c("B", "E", "D", "A", "C", "A", "D", "A"), stringsAsFactors = F)
colnames(df1) <- c("ID", "Number1", "Number2")

Related

Remove all records that have duplicates based on more than one variables

I have data like this
df <- data.frame(var1 = c("A", "A", "B", "B", "C", "D", "E"), var2 = c(1, 2, 3, 4, 5, 5, 6 ))
# var1 var2
# 1 A 1
# 2 A 2
# 3 B 3
# 4 B 4
# 5 C 5
# 6 D 5
# 7 E 6
A is mapped to 1, 2
B is mapped to 3, 4
C and D are both mapped to 5 (and vice versa: 5 is mapped to C and D)
E is uniquely mapped to 6 and 6 is uniquely mapped to E
I would like filter the dataset so that only
var1 var2
7 E 6
is returned. base or tidyverse solution are welcomed.
I have tried
unique(df$var1, df$var2)
df[!duplicated(df),]
df %>% distinct(var1, var2)
but without the wanted result.
Using igraph::components.
Represent data as graph and get connected components:
library(igraph)
g = graph_from_data_frame(df)
cmp = components(g)
Grab components where cluster size (csize) is 2. Output vertices as a two-column character matrix:
matrix(names(cmp$membership[cmp$membership %in% which(cmp$csize == 2)]),
ncol = 2, dimnames = list(NULL, names(df))) # wrap in as.data.frame if desired
# var1 var2
# [1,] "E" "6"
Alternatively, use names of relevant vertices to index original data frame:
v = names(cmp$membership[cmp$membership %in% which(cmp$csize == 2)])
df[df$var1 %in% v[1:(length(v)/2)], ]
# var1 var2
# 7 E 6
Visualize the connections:
plot(g)
Using a custom function to determine if the mapping is unique you could achieve your desired result like so:
df <- data.frame(
var1 = c("A", "A", "B", "B", "C", "D", "E"),
var2 = c(1, 2, 3, 4, 5, 5, 6)
)
is_unique <- function(x, y) ave(as.numeric(factor(x)), y, FUN = function(x) length(unique(x)) == 1)
df[is_unique(df$var2, df$var1) & is_unique(df$var1, df$var2), ]
#> var1 var2
#> 7 E 6
Another igraph option
decompose(graph_from_data_frame(df)) %>%
subset(sapply(., vcount) == 2) %>%
sapply(function(g) names(V(g)))
which gives
[,1]
[1,] "E"
[2,] "6"
A base R solution:
df[!(duplicated(df$var1) | duplicated(df$var1, fromLast = TRUE) |
duplicated(df$var2) | duplicated(df$var2, fromLast = TRUE)), ]
var1 var2
7 E 6

identified distinct arrangement of values by groups in data.frame

I have a large dataframe that has as it's primary organization a single row with groups that are all identical length (in the toy example 3).
df <- data.frame(groups = c("gr1","gr1","gr1","gr2","gr2","gr2","gr3","gr3","gr3"),
no = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
colA = c("a", "b", "c", "a", "b", "c", "a", "b", "c"),
colB = c("a", "b", "c", "X_", "b", "c", "a", "b", "c"),
colC = c("a", "b", "c", "X_", "b", "c", "c", "b", "a"))
df
> df
> groups no colA colB colC
> 1 gr1 1 a a a
> 2 gr1 2 b b b
> 3 gr1 3 c c c
> 4 gr2 1 a X_ X_
> 5 gr2 2 b b b
> 6 gr2 3 c c c
> 7 gr3 1 a a c
> 8 gr3 2 b b b
> 9 gr3 3 c c a
I want to identify for each column which group is the first example of a unique arrangement of values. So for colA it should return (T, F, F) since all three groups are identical so only group one is the 1st unique on. For colB it should return (T, T, F) since there are two distinct groups and only the 3rd is identical to the 1st. And for colC it should be (T, T, T) since the order of items matters.
So the final output could be a matrix like this
colA colB colC
> gr1 T T T
> gr2 F T T
> gr3 F F T
I think I could figure this out by breaking down the data frame into pairs of group and colA/B/B, identify which ones are identical, storing the results in a vector, and then reassembling the whole deal. But I am seeing a ton of for-loops and have a hard time thinking about how to vectorize this. I have been using dplyr a bit, but I don't yet see how it can help.
Maybe there's a decent way to unstack each of the columns based on the groups and then run a comparison across the relevant subsets of new (and shorter) columns?
Edited to add:
Maybe group_by %>% summarize is a way to get at this. If the summary can essentially concatenate all values in a group per column into a really long string I could then see which of those is distinct per group?
Second edit:
I got as far as:
d1 <- df %>% group_by(groups) %>% summarise(colB = paste(unique(colB), collapse = ', ')) %>% distinct(colB)
which puts out
> # A tibble: 2 x 1
> colB
> <chr>
> 1 a, b, c
> 2 X_, b, c
It identifies the distinct groups, but I now have to figure out how to compare it against the rest full column to get T/F for each group.
Here's a base R approach :
cols <- grep('col', names(df))
cbind(unique(df[1]), sapply(df[cols], function(x)
!duplicated(by(x, df$groups, paste0, collapse = '-'))))
# groups colA colB colC
#1 gr1 TRUE TRUE TRUE
#4 gr2 FALSE TRUE TRUE
#7 gr3 FALSE FALSE TRUE
Your summarize idea is spot on:
df %>%
group_by(groups) %>%
summarize(across(starts_with("col"), paste, collapse = ""), .groups = "drop") %>%
mutate(across(starts_with("col"), ~!duplicated(.)))
# # A tibble: 3 x 4
# groups colA colB colC
# <chr> <lgl> <lgl> <lgl>
# 1 gr1 TRUE TRUE TRUE
# 2 gr2 FALSE TRUE TRUE
# 3 gr3 FALSE FALSE TRUE
With "data.table" you can try:
library(data.table)
cols <- c("colA", "colB", "colC")
fun <- function(x) !duplicated(x)
as.data.table(df)[, lapply(.SD, toString), groups, .SDcols = cols][
, (cols) := lapply(.SD, fun), .SDcols = cols][]
# groups colA colB colC
# 1: gr1 TRUE TRUE TRUE
# 2: gr2 FALSE TRUE TRUE
# 3: gr3 FALSE FALSE TRUE

R - building new variables from sequenced data

This is an update / follow-up on this question. The answer outlined their doesn't meet the new requirements.
I am looking for an efficient way (data.table?) to construct two new measures for each ID.
Measure 1 and Measure 2 needs to meet the following conditions:
Condition 1:
Find a sequence of three rows for which:
the first count > 0
the second `count >1' and
the third count ==1.
Condition 2 for Measure 1:
takes the value of the elements in product of the third row of the sequence that are:
in the product of second row of sequence and
NOT in the stock of the first row in sequence.
Condition 2 for measure 2:
takes the value of the elements in product of the last row of the sequence that are:
NOT in the product of second row of sequence
NOT in the stock of the first row in sequence.
Data:
df2 <- data.frame(ID = c(1,1,1,1,1,1,1,2,2,2,3,3,3,3),
seqs = c(1,2,3,4,5,6,7,1,2,3,1,2,3,4),
count = c(2,1,3,1,1,2,3,1,2,1,3,1,4,1),
product = c("A", "B", "C", "A,C,E", "A,B", "A,B,C", "D", "A", "B", "A", "A", "A,B,C", "D", "D"),
stock = c("A", "A,B", "A,B,C", "A,B,C,E", "A,B,C,E", "A,B,C,E", "A,B,C,D,E", "A", "A,B", "A,B", "A", "A,B,C", "A,B,C,D", "A,B,C,D"))
> df2
ID seqs count product stock
1 1 1 2 A A
2 1 2 1 B A,B
3 1 3 3 C A,B,C
4 1 4 1 A,C,E A,B,C,E
5 1 5 1 A,B A,B,C,E
6 1 6 2 A,B,C A,B,C,E
7 1 7 3 D A,B,C,D,E
8 2 1 1 A A
9 2 2 2 B A,B
10 2 3 1 A A,B
11 3 1 3 A A
12 3 2 1 A,B,C A,B,C
13 3 3 4 D A,B,C,D
14 3 4 1 D A,B,C,D
The desired output looks like this:
ID seq1 seq2 seq3 measure1 measure2
1: 1 2 3 4 C E
2: 2 1 2 3
3: 3 2 3 4 D
How would you code this?
Few things you need to know to be able to do this:
shift function to compare values in your groups
separate_rows function to split your strings to get to the normalised data view.
library(data.table)
dt <- data.table(ID = c(1,1,1,1,1,1,1,2,2,2,3,3,3,3),
seqs = c(1,2,3,4,5,6,7,1,2,3,1,2,3,4),
count = c(2,1,3,1,1,2,3,1,2,1,3,1,4,1),
product = c("A", "B", "C", "A,C,E", "A,B", "A,B,C", "D", "A", "B", "A", "A", "A,B,C", "D", "D"),
stock = c("A", "A,B", "A,B,C", "A,B,C,E", "A,B,C,E", "A,B,C,E", "A,B,C,D,E", "A", "A,B", "A,B", "A", "A,B,C", "A,B,C,D", "A,B,C,D"))
dt[, count.2 := shift(count, type = "lead")]
dt[, count.3 := shift(count, n = 2, type = "lead")]
dt[, product.2 := shift(product, type = "lead")]
dt[, product.3 := shift(product, n = 2, type = "lead")]
dt <- dt[count > 0 & count.2 > 1 & count.3 == 1]
dt <- unique(dt, by = "ID")
library(tidyr)
dt.measure <- separate_rows(dt, product.3, sep = ",")
dt.measure <- separate_rows(dt.measure, stock, sep = ",")
dt.measure <- separate_rows(dt.measure, product, sep = ",")
dt.measure[, measure.1 := (product.3 == product.2 & product.3 != stock)]
dt.measure[, measure.2 := (product.3 != product.2 & product.3 != stock)]
res <- dt.measure[,
.(
measure.1 = max(ifelse(measure.1, product.3, NA_character_), na.rm = TRUE),
measure.2 = max(ifelse(measure.2, product.3, NA_character_), na.rm = TRUE)
),
ID
]
dt <- merge(dt, res, by = "ID")
dt[, .(ID, measure.1, measure.2)]
# ID measure.1 measure.2
# 1: 1 C E
# 2: 2 <NA> <NA>
# 3: 3 D <NA>
I'm not sure what the criteria for efficient is, but here's an approach using embed and tidyverse style. It filters down so you are working with less and less.
Loading up the data and packages (note later on setdiff and intersect are from dplry)
library(purrr)
library(dplyr)
df1 <- data.frame(ID = c(1,1,1,1,1,1,1,2,2,2,3,3,3,3),
seqs = c(1,2,3,4,5,6,7,1,2,3,1,2,3,4),
count = c(2,1,3,1,1,2,3,1,2,1,3,1,4,1),
product = c("A", "B", "C", "A,C,E", "A,B",
"A,B,C", "D", "A", "B", "A", "A",
"A,B,C", "D", "D"),
stock = c("A", "A,B", "A,B,C", "A,B,C,E", "A,B,C,E",
"A,B,C,E", "A,B,C,D,E", "A", "A,B", "A,B", "A",
"A,B,C", "A,B,C,D", "A,B,C,D"),
stringsAsFactors = FALSE)
Define a helper function to evaluate condition 1
meetsCond1 <- function(rseg) {
seg <- rev(rseg)
all(seg[1] > 0, seg[2] > 1, seg[3] == 1)
}
The embed function warps a time series into a matrix where essentially each row is a window of the length of interest. Using apply, you filter down to which rows start relevant sequences.
cond1Match<- embed(df1$count, 3) %>%
apply(1, meetsCond1) %>%
which()
You can translate that back to final products, the previous products, and stock rows of interest to determine the measures by adding offsets. Split them into a list of individual components.
finalProds <- df1$product[cond1Match + 2] %>%
strsplit(",")
prevProds <- df1$product[cond1Match + 1] %>%
strsplit(",")
initialStock <- df1$stock[cond1Match] %>%
strsplit(",")
For both measures, neither of them can be in the stock.
notStock <- map2(finalProds, initialStock, ~.x[!(.x %in% .y)])
Then generate your data.frame by retrieving the seqs and ID values of the window. The measures then are just the intersect and setdiff of the final products with those in the previous rows.
data.frame(ID = df1$ID[cond1Match],
seq1 = df1$seqs[cond1Match],
seq2 = df1$seqs[cond1Match + 1],
seq3 = df1$seqs[cond1Match + 2],
measure1 = imap_chr(notStock,
~intersect(.x, prevProds[[.y]]) %>%
{if(length(.) == 0) "" else paste(., sep = ",")}
),
measure2 = imap_chr(notStock,
~setdiff(.x, prevProds[[.y]]) %>%
{if(length(.) == 0) "" else paste(., sep = ",")}
),
stringsAsFactors = FALSE
) %>%
slice(match(unique(ID), ID))
which yields the desired output, which seems to limit at most one line per ID. In the original post, you specify you want all reported. Removing the slice call would then instead yield
#> ID seq1 seq2 seq3 measure1 measure2
#> 1 1 2 3 4 C E
#> 2 1 6 7 1
#> 3 2 1 2 3
#> 4 2 3 1 2 C
#> 5 3 2 3 4 D
If you're looking to really squeeze efficiency, you might be able to gain some by placing the definitions of finalProds, prevProds, and initialStock instead of assigning them to variables first. I would imagine unless your set of matches is really large, it would be negligible.
A rolling window approach using data.table with base R code in j:
library(data.table)
cols <- c("product", "stock")
setDT(df2)[, (cols) := lapply(.SD, function(x) strsplit(as.character(x), split=",")), .SDcols=cols]
ans <- df2[,
transpose(lapply(1L:(.N-2L), function(k) {
if(count[k]>0 && count[k+1L]>1 && count[k+2L]==1) {
m1 <- setdiff(intersect(product[[k+2L]], product[[k+1L]]), stock[[k]])
m2 <- setdiff(setdiff(product[[k+2L]], product[[k+1L]]), stock[[k]])
c(seq1=seqs[k], seq2=seqs[k+1L], seq3=seqs[k+2L],
measure1=if(length(m1) > 0) paste(m1, collapse=",") else "",
measure2=if(length(m2) > 0) paste(m2, collapse=",") else "")
}
}), ignore.empty=TRUE),
ID]
setnames(ans, names(ans)[-1L], c(paste0("seq", 1:3), paste0("measure", 1:2)))
ans
output:
ID seq1 seq2 seq3 measure1 measure2
1: 1 2 3 4 C E
2: 2 1 2 3
3: 3 2 3 4 D

Ordering a dataframe by its subsegments

My team and I are dealing with many thousands of URLs that have similar segments.
Some URLs have one segment ("seg", plural, "segs") in a position of interest to us. Other similar URLs have a different seg in the position of interest to us.
We need to sort a dataframe consisting of URLs and associated unique segs
in the position of interest, showing the frequency of those unique segs.
Here is a simplified example:
url <- c(1, 3, 1, 4, 2, 3, 1, 3, 3, 3, 3, 2)
seg <- c("a", "c", "a", "d", "b", "c", "a", "x", "x", "y", "c", "b")
df <- data.frame(url,seg)
We are looking for the following:
url freq seg
1 3 a in other words, url #1 appears three times each with a seg = "a",
2 2 b in other words: url #2 appears twice each with a seg = "b",
3 3 c in other words: url #3 appears three times with a seg = "c",
3 2 x two times with a seg = "x", and,
3 1 y once with a seg = "y"
4 1 d etc.
I can get there using a loop and several small steps, but am convinced there is a more elegant way of doing this. Here's my inelegant approach:
Create empty dataframe with num.unique rows and three columns (url, freq, seg)
result <- data.frame(url=0, Freq=0, seg=0)
Determine the unique URLs
unique.df.url <- unique(df$url)
Loop through the dataframe
for (xx in unique.df.url) {
url.seg <- df[which(df$url == unique.df.url[xx]), ] # create a dataframe for each of the unique urls and associated segs
freq.df.url <- data.frame(table(url.seg)) # summarize the frequency distribution of the segs by url
result <- rbind(result,freq.df.url) # append a new data.frame onto the last one
}
Eliminate rows in the dataframe where Frequency = 0
result.freq <- result[which(result$Freq |0), ]
Sort the dataframe by URL
result.order <- result.freq[order(result.freq$url), ]
This yields the desired results, but since it is so inelegant, I am concerned that once we move to scale, the time required will be prohibitive or at least a concern. Any suggestions?
In base R you can do this :
aggregate(freq~seg+url,`$<-`(df,freq,1),sum)
# or aggregate(freq~seg+url, data.frame(df,freq=1),sum)
# seg url freq
# 1 a 1 3
# 2 b 2 2
# 3 c 3 3
# 4 x 3 2
# 5 y 3 1
# 6 d 4 1
The trick with $<- is just to add a column freq of value 1 everywhere, without changing your source table.
Another possibility:
subset(as.data.frame(table(df[2:1])),Freq!=0)
# seg url Freq
# 1 a 1 3
# 8 b 2 2
# 15 c 3 3
# 17 x 3 2
# 18 y 3 1
# 22 d 4 1
Here I use [2:1] to switch the order of columns so table orders the results in the required way.
url <- c(1, 3, 1, 4, 2, 3, 1, 3, 3, 3, 3, 2)
seg <- c("a", "c", "a", "d", "b", "c", "a", "x", "x", "y", "c", "b")
df <- data.frame(url,seg)
library(dplyr)
df %>% count(url, seg) %>% arrange(url, desc(n))
# # A tibble: 6 x 3
# url seg n
# <dbl> <fct> <int>
# 1 1 a 3
# 2 2 b 2
# 3 3 c 3
# 4 3 x 2
# 5 3 y 1
# 6 4 d 1
Would the following code be better for you?
library(dplyr)
df %>% group_by(url, seg) %>% summarise(n())
Or paste & tapply:
url <- c(1, 3, 1, 4, 2, 3, 1, 3, 3, 3, 3, 2)
seg <- c("a", "c", "a", "d", "b", "c", "a", "x", "x", "y", "c", "b")
df <- data.frame(url,seg)
want <- tapply(url, INDEX = paste(url, seg, sep = "_"), length)
want <- data.frame(do.call(rbind, strsplit(names(want), "_")), want)
colnames(want) <- c("url", "seg", "freq")
want <- want[order(want$url, -want$freq), ]
rownames(want) <- NULL # needed?
want <- want[ , c("url", "freq", "seg")] # needed?
want
An option can be to use table and tidyr::gather to get data in format needed by OP:
library(tidyverse)
table(df) %>% as.data.frame() %>%
filter(Freq > 0 ) %>%
arrange(url, desc(Freq))
# url seg Freq
# 1 1 a 3
# 2 2 b 2
# 3 3 c 3
# 4 3 x 2
# 5 3 y 1
# 6 4 d 1
OR
df %>% group_by(url, seg) %>%
summarise(freq = n()) %>%
arrange(url, desc(freq))
# # A tibble: 6 x 3
# # Groups: url [4]
# url seg freq
# <dbl> <fctr> <int>
# 1 1.00 a 3
# 2 2.00 b 2
# 3 3.00 c 3
# 4 3.00 x 2
# 5 3.00 y 1
# 6 4.00 d 1

dplyr lag across groups

I am trying to do something like a lag, but across and not within groups. Sample data:
df <- data.frame(flag = c("A", "B", "A", "B", "B", "B", "A", "B", "B", "A", "B"),
var = c("AB123","AC124", "AD125", "AE126",
"AF127", "AG128", "AF129",
"AG130","AH131",
"AHI132", "AJ133"))
)
The goal for every flag="B" is to create lagvar with the previous var value where flag="A".
This will show the desired output:
df1 <- data.frame(flag = c("A", "B", "A", "B", "B", "B", "A", "B", "B", "A", "B"),
var = c("AB123","AC124", "AD125", "AE126",
"AF127", "AG128", "AF129",
"AG130","AH131",
"AHI132", "AJ133"),
lagvar = c("","AB123","","AD125","AD125","AD125","","AF129","AF129","","AHI132")
)
A dplyr solution is preferred, but I'm not picky!
EDIT: I found a solution using the zoo package but am interested if others have better ideas. df$lagvar <- ifelse(df$flag == "A", df$var, NA)
df <- df %>%
mutate(lagvar = na.locf(lagvar)
Here you go. I used NA instead of blanks, but you can adjust as needed:
df %>% mutate(lagvar = ifelse(flag == "A", as.character(var), NA),
lagvar = zoo::na.locf(lagvar),
lagvar = ifelse(flag == "A", NA, lagvar))
# flag var lagvar
# 1 A AB123 <NA>
# 2 B AC124 AB123
# 3 A AD125 <NA>
# 4 B AE126 AD125
# 5 B AF127 AD125
# 6 B AG128 AD125
# 7 A AF129 <NA>
# 8 B AG130 AF129
# 9 B AH131 AF129
# 10 A AHI132 <NA>
# 11 B AJ133 AHI132
My solution is a bit complicated. The idea is to find out the position of A each B should assign to and then join with a table, which only contains rows with flag A.
df %>%
mutate(pos=cumsum(flag == "A")) %>%
left_join(
df %>%
filter(flag == "A") %>%
mutate(pos=1:n()) %>%
select(pos, lagvar=var),
by="pos") %>%
mutate(lagvar=ifelse(flag == "A", "", as.character(lagvar)))
# flag var pos lagvar
# 1 A AB123 1
# 2 B AC124 1 AB123
# 3 A AD125 2
# 4 B AE126 2 AD125
# 5 B AF127 2 AD125
# 6 B AG128 2 AD125
# 7 A AF129 3
# 8 B AG130 3 AF129
# 9 B AH131 3 AF129
# 10 A AHI132 4
# 11 B AJ133 4 AHI132

Resources