dplyr lag across groups

dplyr lag across groups - r

I am trying to do something like a lag, but across and not within groups. Sample data:
df <- data.frame(flag = c("A", "B", "A", "B", "B", "B", "A", "B", "B", "A", "B"),
var = c("AB123","AC124", "AD125", "AE126",
"AF127", "AG128", "AF129",
"AG130","AH131",
"AHI132", "AJ133"))
)
The goal for every flag="B" is to create lagvar with the previous var value where flag="A".
This will show the desired output:
df1 <- data.frame(flag = c("A", "B", "A", "B", "B", "B", "A", "B", "B", "A", "B"),
var = c("AB123","AC124", "AD125", "AE126",
"AF127", "AG128", "AF129",
"AG130","AH131",
"AHI132", "AJ133"),
lagvar = c("","AB123","","AD125","AD125","AD125","","AF129","AF129","","AHI132")
)
A dplyr solution is preferred, but I'm not picky!
EDIT: I found a solution using the zoo package but am interested if others have better ideas. df$lagvar <- ifelse(df$flag == "A", df$var, NA)
df <- df %>%
mutate(lagvar = na.locf(lagvar)

Here you go. I used NA instead of blanks, but you can adjust as needed:
df %>% mutate(lagvar = ifelse(flag == "A", as.character(var), NA),
lagvar = zoo::na.locf(lagvar),
lagvar = ifelse(flag == "A", NA, lagvar))
# flag var lagvar
# 1 A AB123 <NA>
# 2 B AC124 AB123
# 3 A AD125 <NA>
# 4 B AE126 AD125
# 5 B AF127 AD125
# 6 B AG128 AD125
# 7 A AF129 <NA>
# 8 B AG130 AF129
# 9 B AH131 AF129
# 10 A AHI132 <NA>
# 11 B AJ133 AHI132

My solution is a bit complicated. The idea is to find out the position of A each B should assign to and then join with a table, which only contains rows with flag A.
df %>%
mutate(pos=cumsum(flag == "A")) %>%
left_join(
df %>%
filter(flag == "A") %>%
mutate(pos=1:n()) %>%
select(pos, lagvar=var),
by="pos") %>%
mutate(lagvar=ifelse(flag == "A", "", as.character(lagvar)))
# flag var pos lagvar
# 1 A AB123 1
# 2 B AC124 1 AB123
# 3 A AD125 2
# 4 B AE126 2 AD125
# 5 B AF127 2 AD125
# 6 B AG128 2 AD125
# 7 A AF129 3
# 8 B AG130 3 AF129
# 9 B AH131 3 AF129
# 10 A AHI132 4
# 11 B AJ133 4 AHI132

Related

Efficiently Repeating Observations by Group

I am trying to find an efficient way to repeat rows by group in data.table for only some groups. Please consider the following example:
library(data.table)
DT <- data.table(x = c("A","A", "B", "B", "C","C", "D","D"),
y = 1:8)
This dataset looks like:
head(DT)
x y
1: A 1
2: A 2
3: B 3
4: B 4
5: C 5
6: C 6
Say I have a separate vector rep <- c("A", "A", "A", "B", "B", "C"). Given this vector, I want to be able to repeat all rows of A three times (due to the cardinality of the "A" characters in rep) and all rows associated with B two times. Thus, the final dataset should like:
x y
1: A 1
2: A 2
3: A 1
4: A 2
5: A 1
6: A 2
7: B 3
8: B 4
9: B 3
10: B 4
11: C 5
12: C 6
Notice that I did not repeat "C" because the cardinality of "C" is only 1 in rep. I have a hackish way of doing this procedure at the moment, but I'm wondering if there was a more efficient data.table way of doing the above.
Thank you!
P.S. The reason I am doing this is because I am doing some matching with replacement in my regressions and sometimes, the same control firm is assigned to more than one treatment firm.

A data.table merge won't give you the same ordering but you aren't supposed to rely on ordering in datatables, anyway:
merge(DT, data.frame(x=rep), by="x")
x y
1: A 1
2: A 1
3: A 1
4: A 2
5: A 2
6: A 2
7: B 3
8: B 3
9: B 4
10: B 4
11: C 5
12: C 6

One solution is to gather up the counts and left join onto them:
library(data.table)
library(data.table)
DT <- data.table(x = c("A","A", "B", "B", "C","C", "D","D"),
y = 1:8)
rep_vec <- c("A", "A", "A", "B", "B", "C")
rep_DT <- DT %>%
left_join(data.frame(group = rep_vec), by = c("x" = "group"))
Are you sure duplicating rows in a dataframe is your ideal option though?

We can do
DT[ data.table(x = v1)[, .N, x], on = .(x)][rep(seq_len(.N), N)]
Or to return in the same order
DT[, .(y = list(y)), x][data.table(x = v1), on = .(x)][, .(x, y = unlist(y))]
data
v1 <- c("A", "A", "A", "B", "B", "C")

R - building new variables from sequenced data

This is an update / follow-up on this question. The answer outlined their doesn't meet the new requirements.
I am looking for an efficient way (data.table?) to construct two new measures for each ID.
Measure 1 and Measure 2 needs to meet the following conditions:
Condition 1:
Find a sequence of three rows for which:
the first count > 0
the second `count >1' and
the third count ==1.
Condition 2 for Measure 1:
takes the value of the elements in product of the third row of the sequence that are:
in the product of second row of sequence and
NOT in the stock of the first row in sequence.
Condition 2 for measure 2:
takes the value of the elements in product of the last row of the sequence that are:
NOT in the product of second row of sequence
NOT in the stock of the first row in sequence.
Data:
df2 <- data.frame(ID = c(1,1,1,1,1,1,1,2,2,2,3,3,3,3),
seqs = c(1,2,3,4,5,6,7,1,2,3,1,2,3,4),
count = c(2,1,3,1,1,2,3,1,2,1,3,1,4,1),
product = c("A", "B", "C", "A,C,E", "A,B", "A,B,C", "D", "A", "B", "A", "A", "A,B,C", "D", "D"),
stock = c("A", "A,B", "A,B,C", "A,B,C,E", "A,B,C,E", "A,B,C,E", "A,B,C,D,E", "A", "A,B", "A,B", "A", "A,B,C", "A,B,C,D", "A,B,C,D"))
> df2
ID seqs count product stock
1 1 1 2 A A
2 1 2 1 B A,B
3 1 3 3 C A,B,C
4 1 4 1 A,C,E A,B,C,E
5 1 5 1 A,B A,B,C,E
6 1 6 2 A,B,C A,B,C,E
7 1 7 3 D A,B,C,D,E
8 2 1 1 A A
9 2 2 2 B A,B
10 2 3 1 A A,B
11 3 1 3 A A
12 3 2 1 A,B,C A,B,C
13 3 3 4 D A,B,C,D
14 3 4 1 D A,B,C,D
The desired output looks like this:
ID seq1 seq2 seq3 measure1 measure2
1: 1 2 3 4 C E
2: 2 1 2 3
3: 3 2 3 4 D
How would you code this?

Few things you need to know to be able to do this:
shift function to compare values in your groups
separate_rows function to split your strings to get to the normalised data view.
library(data.table)
dt <- data.table(ID = c(1,1,1,1,1,1,1,2,2,2,3,3,3,3),
seqs = c(1,2,3,4,5,6,7,1,2,3,1,2,3,4),
count = c(2,1,3,1,1,2,3,1,2,1,3,1,4,1),
product = c("A", "B", "C", "A,C,E", "A,B", "A,B,C", "D", "A", "B", "A", "A", "A,B,C", "D", "D"),
stock = c("A", "A,B", "A,B,C", "A,B,C,E", "A,B,C,E", "A,B,C,E", "A,B,C,D,E", "A", "A,B", "A,B", "A", "A,B,C", "A,B,C,D", "A,B,C,D"))
dt[, count.2 := shift(count, type = "lead")]
dt[, count.3 := shift(count, n = 2, type = "lead")]
dt[, product.2 := shift(product, type = "lead")]
dt[, product.3 := shift(product, n = 2, type = "lead")]
dt <- dt[count > 0 & count.2 > 1 & count.3 == 1]
dt <- unique(dt, by = "ID")
library(tidyr)
dt.measure <- separate_rows(dt, product.3, sep = ",")
dt.measure <- separate_rows(dt.measure, stock, sep = ",")
dt.measure <- separate_rows(dt.measure, product, sep = ",")
dt.measure[, measure.1 := (product.3 == product.2 & product.3 != stock)]
dt.measure[, measure.2 := (product.3 != product.2 & product.3 != stock)]
res <- dt.measure[,
.(
measure.1 = max(ifelse(measure.1, product.3, NA_character_), na.rm = TRUE),
measure.2 = max(ifelse(measure.2, product.3, NA_character_), na.rm = TRUE)
),
ID
]
dt <- merge(dt, res, by = "ID")
dt[, .(ID, measure.1, measure.2)]
# ID measure.1 measure.2
# 1: 1 C E
# 2: 2 <NA> <NA>
# 3: 3 D <NA>

I'm not sure what the criteria for efficient is, but here's an approach using embed and tidyverse style. It filters down so you are working with less and less.
Loading up the data and packages (note later on setdiff and intersect are from dplry)
library(purrr)
library(dplyr)
df1 <- data.frame(ID = c(1,1,1,1,1,1,1,2,2,2,3,3,3,3),
seqs = c(1,2,3,4,5,6,7,1,2,3,1,2,3,4),
count = c(2,1,3,1,1,2,3,1,2,1,3,1,4,1),
product = c("A", "B", "C", "A,C,E", "A,B",
"A,B,C", "D", "A", "B", "A", "A",
"A,B,C", "D", "D"),
stock = c("A", "A,B", "A,B,C", "A,B,C,E", "A,B,C,E",
"A,B,C,E", "A,B,C,D,E", "A", "A,B", "A,B", "A",
"A,B,C", "A,B,C,D", "A,B,C,D"),
stringsAsFactors = FALSE)
Define a helper function to evaluate condition 1
meetsCond1 <- function(rseg) {
seg <- rev(rseg)
all(seg[1] > 0, seg[2] > 1, seg[3] == 1)
}
The embed function warps a time series into a matrix where essentially each row is a window of the length of interest. Using apply, you filter down to which rows start relevant sequences.
cond1Match<- embed(df1$count, 3) %>%
apply(1, meetsCond1) %>%
which()
You can translate that back to final products, the previous products, and stock rows of interest to determine the measures by adding offsets. Split them into a list of individual components.
finalProds <- df1$product[cond1Match + 2] %>%
strsplit(",")
prevProds <- df1$product[cond1Match + 1] %>%
strsplit(",")
initialStock <- df1$stock[cond1Match] %>%
strsplit(",")
For both measures, neither of them can be in the stock.
notStock <- map2(finalProds, initialStock, ~.x[!(.x %in% .y)])
Then generate your data.frame by retrieving the seqs and ID values of the window. The measures then are just the intersect and setdiff of the final products with those in the previous rows.
data.frame(ID = df1$ID[cond1Match],
seq1 = df1$seqs[cond1Match],
seq2 = df1$seqs[cond1Match + 1],
seq3 = df1$seqs[cond1Match + 2],
measure1 = imap_chr(notStock,
~intersect(.x, prevProds[[.y]]) %>%
{if(length(.) == 0) "" else paste(., sep = ",")}
),
measure2 = imap_chr(notStock,
~setdiff(.x, prevProds[[.y]]) %>%
{if(length(.) == 0) "" else paste(., sep = ",")}
),
stringsAsFactors = FALSE
) %>%
slice(match(unique(ID), ID))
which yields the desired output, which seems to limit at most one line per ID. In the original post, you specify you want all reported. Removing the slice call would then instead yield
#> ID seq1 seq2 seq3 measure1 measure2
#> 1 1 2 3 4 C E
#> 2 1 6 7 1
#> 3 2 1 2 3
#> 4 2 3 1 2 C
#> 5 3 2 3 4 D
If you're looking to really squeeze efficiency, you might be able to gain some by placing the definitions of finalProds, prevProds, and initialStock instead of assigning them to variables first. I would imagine unless your set of matches is really large, it would be negligible.

A rolling window approach using data.table with base R code in j:
library(data.table)
cols <- c("product", "stock")
setDT(df2)[, (cols) := lapply(.SD, function(x) strsplit(as.character(x), split=",")), .SDcols=cols]
ans <- df2[,
transpose(lapply(1L:(.N-2L), function(k) {
if(count[k]>0 && count[k+1L]>1 && count[k+2L]==1) {
m1 <- setdiff(intersect(product[[k+2L]], product[[k+1L]]), stock[[k]])
m2 <- setdiff(setdiff(product[[k+2L]], product[[k+1L]]), stock[[k]])
c(seq1=seqs[k], seq2=seqs[k+1L], seq3=seqs[k+2L],
measure1=if(length(m1) > 0) paste(m1, collapse=",") else "",
measure2=if(length(m2) > 0) paste(m2, collapse=",") else "")
}
}), ignore.empty=TRUE),
ID]
setnames(ans, names(ans)[-1L], c(paste0("seq", 1:3), paste0("measure", 1:2)))
ans
output:
ID seq1 seq2 seq3 measure1 measure2
1: 1 2 3 4 C E
2: 2 1 2 3
3: 3 2 3 4 D

R: Matching rows by two columns

I am currently trying to figure out a vectorized way to match by two values in the same row. I have the following two simplified data frames:
# Dataframe 1: Displaying all my observations
df1 <- data.frame(c(1, 2, 3, 4, 5, 6, 7, 8),
c("A", "B", "C", "D", "A", "B", "A", "C"),
c("B", "E", "D", "A", "C", "A", "D", "A"))
colnames(df1) <- c("ID", "Number1", "Number2")
> df1
ID Number1 Number2
1 1 A B
2 2 B E
3 3 C D
4 4 D A
5 5 A C
6 6 B A
7 7 A D
8 8 C A
# Dataframe 2: Matrix of observations I am interested in
df2 <- matrix(c("A", "B",
"D", "A",
"C", "B",
"E", "D"),
ncol = 2,
byrow = TRUE)
> df2
[,1] [,2]
[1,] "A" "B"
[2,] "D" "A"
[3,] "C" "B"
[4,] "E" "D"
What I am trying to accomplish is to create a new column in df1 that states TRUE only if the exact combination is present in df2 (for example ID = 1 is equivalent to the first row in df2 because both of them consist of A and B). Additionally, if there is a shortcut, I would also like the status to be TRUE if the numbers are reversed, i.e. df1$Number1 matches df2[i,2] and df1$Number2 matches df2[i,1] (for example for ID = 7, the combination in df1 is A,D and in df2, the combination is D,A --> TRUE).
My desired output looks like this:
> df1
ID Number1 Number2 Status
1 1 A B TRUE
2 2 B E FALSE
3 3 C D FALSE
4 4 D A TRUE
5 5 A C FALSE
6 6 B A TRUE
7 7 A D TRUE
8 8 C A FALSE
All I have gotten so far is this:
for (i in 1:nrow(df1)) {
for (j in 1:nrow(df2)) {
Status <- ifelse(df1$Number1[i] %in% df2[j,1] &&
df1$Number2[i] %in% df2[j,2], TRUE, FALSE)
StatusComb[i,j] <- Status
}
df1$Status[i] <- ifelse(any(StatusComb[i,]) == TRUE, TRUE, FALSE)
}
It is really inefficient (you can clearly tell I am new to R) and does not look very nice either. I would appreciate any help!

One method would be to merge things together.
Adapting your data, to account for reversed labels, I'll reverse df2 on itself and rbind it:
df2 <- rbind.data.frame(df2, df2[,c(2,1)])
colnames(df2) <- c("Number1", "Number2")
df2$a <- TRUE
df2
# Number1 Number2 a
# 1 A B TRUE
# 2 D A TRUE
# 3 C B TRUE
# 4 E D TRUE
# 5 B A TRUE
# 6 A D TRUE
# 7 B C TRUE
# 8 D E TRUE
I added a so that it'll be merged in. From there:
df3 <- merge(df1, df2, all.x = TRUE)
df3$a <- !is.na(df3$a)
df3[ order(df3$ID), ]
# Number1 Number2 ID a
# 1 A B 1 TRUE
# 5 B E 2 FALSE
# 7 C D 3 FALSE
# 8 D A 4 TRUE
# 2 A C 5 FALSE
# 4 B A 6 TRUE
# 3 A D 7 TRUE
# 6 C A 8 FALSE
If you look at it before !is.na(df3$a), you'll see that the column is wholly TRUE and NA (the NA were not present in df2); if that is enough for you, then you can omit the middle step. The order step is just because row-order with merge is not assured (in fact I find it always inconveniently different). Since it was previously ordered by ID, I reverted to that, but it was entirely for aesthetics here to match your desired output.

You could define the combination variable that you want to search for in alphabetical order as below:
combination <- apply(df2, 1, function(x) {
paste(sort(x), collapse = '')
})
combination
[1] "AB" "AD" "BC" "DE"
And then mutate the Status field based on the concatenation of the Number field
library(dplyr)
df1 %>%
rowwise() %>%
mutate(S = paste(sort(c(Number1, Number2)), collapse = "")) %>%
mutate(Status = ifelse(S %in% combination, TRUE, FALSE))
Source: local data frame [8 x 5]
Groups: <by row>
# A tibble: 8 x 5
ID Number1 Number2 S Status
<dbl> <chr> <chr> <chr> <lgl>
1 1 A B AB TRUE
2 2 B E BE FALSE
3 3 C D CD FALSE
4 4 D A AD TRUE
5 5 A C AC FALSE
6 6 B A AB TRUE
7 7 A D AD TRUE
8 8 C A AC FALSE
Data:
I set stringsAsFactors = F in the dataframe
df1 <- data.frame(c(1, 2, 3, 4, 5, 6, 7, 8),
c("A", "B", "C", "D", "A", "B", "A", "C"),
c("B", "E", "D", "A", "C", "A", "D", "A"), stringsAsFactors = F)
colnames(df1) <- c("ID", "Number1", "Number2")

Count number of duplicates in other dataframe

I have two data.frames dfA and dfB. Both of them have a column called key.
Now I'd like to know how many duplicates for A$key there are in B$key.
A <- data.frame(key=c("A", "B", "C", "D"))
B <- data.frame(key=c("A", "A", "B", "B", "B", "D"))
It should be A=2, B=3, C=0 and D=1. Whats the most easiest way to do this?

Use table
table(factor(B$key, levels = sort(unique(A$key))))
#A B C D
#2 3 0 1
factor is needed here such that we also 'count' entries that do not appear in B$key, that is C.

A <- data.frame(key=c("A", "B", "C", "D"))
B <- data.frame(key=c("A", "A", "B", "B", "B", "D"))
library(dplyr)
library(tidyr)
B %>%
filter(key %in% A$key) %>% # keep values that appear in A
count(key) %>% # count values
complete(key = A$key, fill = list(n = 0)) # add any values from A that don't appear
# # A tibble: 4 x 2
# key n
# <chr> <dbl>
# 1 A 2
# 2 B 3
# 3 C 0
# 4 D 1

Using tidyverse you can do:
A %>%
left_join(B %>% #Merging df A with df B for which the count in "key" was calculated
group_by(key) %>%
tally(), by = c("key" = "key")) %>%
mutate(n = ifelse(is.na(n), 0, n)) #Replacing NA with 0
key n
1 A 2
2 B 3
3 C 0
4 D 1

Actually you mean how many occurrences of each value of A$key you have in B$key?
You can obtain this by coding B$key as factor with the unique values of A$key as levels.
o <- table(factor(B$key, levels=unique(A$key)))
Yielding:
> o
A B C D
2 3 0 1
If you really want to count duplicates, do
dupes <- ifelse(o - 1 < 0, 0, o - 1)
Yielding:
> dupes
A B C D
1 2 0 0

Ordering a dataframe by its subsegments

My team and I are dealing with many thousands of URLs that have similar segments.
Some URLs have one segment ("seg", plural, "segs") in a position of interest to us. Other similar URLs have a different seg in the position of interest to us.
We need to sort a dataframe consisting of URLs and associated unique segs
in the position of interest, showing the frequency of those unique segs.
Here is a simplified example:
url <- c(1, 3, 1, 4, 2, 3, 1, 3, 3, 3, 3, 2)
seg <- c("a", "c", "a", "d", "b", "c", "a", "x", "x", "y", "c", "b")
df <- data.frame(url,seg)
We are looking for the following:
url freq seg
1 3 a in other words, url #1 appears three times each with a seg = "a",
2 2 b in other words: url #2 appears twice each with a seg = "b",
3 3 c in other words: url #3 appears three times with a seg = "c",
3 2 x two times with a seg = "x", and,
3 1 y once with a seg = "y"
4 1 d etc.
I can get there using a loop and several small steps, but am convinced there is a more elegant way of doing this. Here's my inelegant approach:
Create empty dataframe with num.unique rows and three columns (url, freq, seg)
result <- data.frame(url=0, Freq=0, seg=0)
Determine the unique URLs
unique.df.url <- unique(df$url)
Loop through the dataframe
for (xx in unique.df.url) {
url.seg <- df[which(df$url == unique.df.url[xx]), ] # create a dataframe for each of the unique urls and associated segs
freq.df.url <- data.frame(table(url.seg)) # summarize the frequency distribution of the segs by url
result <- rbind(result,freq.df.url) # append a new data.frame onto the last one
}
Eliminate rows in the dataframe where Frequency = 0
result.freq <- result[which(result$Freq |0), ]
Sort the dataframe by URL
result.order <- result.freq[order(result.freq$url), ]
This yields the desired results, but since it is so inelegant, I am concerned that once we move to scale, the time required will be prohibitive or at least a concern. Any suggestions?

In base R you can do this :
aggregate(freq~seg+url,`$<-`(df,freq,1),sum)
# or aggregate(freq~seg+url, data.frame(df,freq=1),sum)
# seg url freq
# 1 a 1 3
# 2 b 2 2
# 3 c 3 3
# 4 x 3 2
# 5 y 3 1
# 6 d 4 1
The trick with $<- is just to add a column freq of value 1 everywhere, without changing your source table.
Another possibility:
subset(as.data.frame(table(df[2:1])),Freq!=0)
# seg url Freq
# 1 a 1 3
# 8 b 2 2
# 15 c 3 3
# 17 x 3 2
# 18 y 3 1
# 22 d 4 1
Here I use [2:1] to switch the order of columns so table orders the results in the required way.

url <- c(1, 3, 1, 4, 2, 3, 1, 3, 3, 3, 3, 2)
seg <- c("a", "c", "a", "d", "b", "c", "a", "x", "x", "y", "c", "b")
df <- data.frame(url,seg)
library(dplyr)
df %>% count(url, seg) %>% arrange(url, desc(n))
# # A tibble: 6 x 3
# url seg n
# <dbl> <fct> <int>
# 1 1 a 3
# 2 2 b 2
# 3 3 c 3
# 4 3 x 2
# 5 3 y 1
# 6 4 d 1

Would the following code be better for you?
library(dplyr)
df %>% group_by(url, seg) %>% summarise(n())

Or paste & tapply:
url <- c(1, 3, 1, 4, 2, 3, 1, 3, 3, 3, 3, 2)
seg <- c("a", "c", "a", "d", "b", "c", "a", "x", "x", "y", "c", "b")
df <- data.frame(url,seg)
want <- tapply(url, INDEX = paste(url, seg, sep = "_"), length)
want <- data.frame(do.call(rbind, strsplit(names(want), "_")), want)
colnames(want) <- c("url", "seg", "freq")
want <- want[order(want$url, -want$freq), ]
rownames(want) <- NULL # needed?
want <- want[ , c("url", "freq", "seg")] # needed?
want

An option can be to use table and tidyr::gather to get data in format needed by OP:
library(tidyverse)
table(df) %>% as.data.frame() %>%
filter(Freq > 0 ) %>%
arrange(url, desc(Freq))
# url seg Freq
# 1 1 a 3
# 2 2 b 2
# 3 3 c 3
# 4 3 x 2
# 5 3 y 1
# 6 4 d 1
OR
df %>% group_by(url, seg) %>%
summarise(freq = n()) %>%
arrange(url, desc(freq))
# # A tibble: 6 x 3
# # Groups: url [4]
# url seg freq
# <dbl> <fctr> <int>
# 1 1.00 a 3
# 2 2.00 b 2
# 3 3.00 c 3
# 4 3.00 x 2
# 5 3.00 y 1
# 6 4.00 d 1

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

dplyr lag across groups - r

Related

Efficiently Repeating Observations by Group

R - building new variables from sequenced data

R: Matching rows by two columns

Count number of duplicates in other dataframe

Ordering a dataframe by its subsegments

Categories

Resources