determining frequency of multiple variables based on multiple factors in R

determining frequency of multiple variables based on multiple factors in R - r

Suppose I have a dataset like this:
id <- c(1,1,1,2,2,3,3,4,4)
visit <- c("A", "B", "C", "A", "B", "A", "C", "A", "B")
test1 <- c(12,16, NA, 11, 15,NA, 0,12, 5)
test2 <- c(1,NA, 2, 2, 2,2, NA,NA, NA)
df <- data.frame(id,visit,test1,test2)
I want to know the number of data points per visit PER test so that the final output looks something like this:
visit test1 test2
A 3 3
B 3 1
C 1 1
I know I can use the aggregate function like this for 1 variable as mentioned on this older post :
aggregate(x = df$id[!is.na(df$test)], by = list(df$visit[!is.na(df$test)]), FUN = length)
but how would I go about doing this for multiple tests?

You can also use data.table which could be useful for a flexible number of columns:
cols <- names(df)[grepl("test",names(df))]
setDT(df)[,lapply(.SD, function(x) sum(!is.na(x))), by = visit, .SDcols = cols]
df
# visit test1 test2
#1: A 3 3
#2: B 3 1
#3: C 1 1

Using table and rowSums in base R:
cols <- 3:4
sapply(cols, function(i) rowSums(table(df$visit, df[,i]), na.rm = TRUE))
# [,1] [,2]
#A 3 3
#B 3 1
#C 1 1

Related

Estimating the percentage of common set elements over combined past periods in a panel

I have a time-series panel dataset that is structured in the following way: There are 2 funds that each own different stocks at each time period.
df <- data.frame(
fund_id = c(1,1,1,1,1,1,1,1, 1, 2,2,2,2),
time_Q = c(1,1,1,2,2,2,2,3, 3, 1,1,2,2),
stock_id = c("A", "B", "C", "A", "C", "D", "E", "D", "E", "A", "B", "B", "C")
)
> df
fund_id time_Q stock_id
1 1 1 A
2 1 1 B
3 1 1 C
4 1 2 A
5 1 2 C
6 1 2 D
7 1 2 E
8 1 3 D
9 1 3 E
10 2 1 A
11 2 1 B
12 2 2 B
13 2 2 C
For each fund, I would like to calculate the percentage of stocks held in that current time_Q that were also ever held in any of the previous one to 2 quarters. So basically for every fund and every time_Q, I would like to have 2 columns with past 1 time_Q, past 1-2 time_Q which show what percentage of stocks held in that time were also present in any of that past time_Qs.
Here is what the result should look like:
result <- data.frame(
fund_id = c(1,1,1,2,2),
time_Q = c(1,2,3,1,2),
past_1Q = c("NA",0.5,1,"NA",0.5),
past_2Q = c("NA",0.5,1,"NA",0.5)
)
> result
fund_id time_Q past_1Q past_1_2Q
1 1 1 NA NA
2 1 2 0.5 0.5
3 1 3 1 1
4 2 1 NA NA
5 2 2 0.5 0.5
I already asked a similar question here, but now I'm looking for common elements across any of the past lagged periods. I'm looking for a dplyr or data.table scalable solution where I can have around 12 past quarters and deal with multiple funds and stocks and time periods.
Thanks in advance!

my solution
# dummy data
df <- data.table(fund_id = c(1,1,1,1,1,1,1,1, 1, 2,2,2,2)
, time_Q = c(1,1,1,2,2,2,2,3, 3, 1,1,2,2)
, stock_id = c("A", "B", "C", "A", "C", "D", "E", "D", "E", "A", "B", "B", "C")
); df
# lower case col names
names(df) <- tolower(names(df))
# unique grouping
x <- df[, .(dummy =.N), .(fund_id, time_q)][, dummy := NULL]
# initialise empty table
y <- NULL
# loop
for(i in 1:nrow(x))
{
# current quarter & before
z <- df[fund_id == x[i, fund_id]
& time_q %between% c( x[i, time_q] - 12, x[i, time_q])
]
# current quarter
a <- z[fund_id == x[i, fund_id]
& time_q == x[i, time_q]
, unique(stock_id)
]
# minus 1 to minus 12 quarter (lapply)
b <- lapply(1:12, \(j) z[fund_id == x[i, fund_id]
& time_q %between% c( x[i, time_q] - j, x[i, time_q] - 1)
, unique(stock_id)
]
)
# results
c <- data.table(fund_id = x[i, fund_id]
, current_q = x[i, time_q]
)
# no. of stocks in current quarter
d <- length(a)
# calculate % for the 12 periods
c[, paste0('past_1_to_', 1:12, '_q') := lapply(1:12, \(j) length(intersect(a,b[[j]])) / d) ]
# collect results
y <- rbind(y, c)
}
benchmark
x <- 1e3
df <- data.table(fund_id = rep(1:x, each = x/10)
, time_Q = rep(1:4, each = x/4)
, stock_id = sample(letters[1:26], size=20, replace=T)
)
took 20 seconds on the above df with 100k rows and 1,200 groups (fund_id, time_q)

Remove all records that have duplicates based on more than one variables

I have data like this
df <- data.frame(var1 = c("A", "A", "B", "B", "C", "D", "E"), var2 = c(1, 2, 3, 4, 5, 5, 6 ))
# var1 var2
# 1 A 1
# 2 A 2
# 3 B 3
# 4 B 4
# 5 C 5
# 6 D 5
# 7 E 6
A is mapped to 1, 2
B is mapped to 3, 4
C and D are both mapped to 5 (and vice versa: 5 is mapped to C and D)
E is uniquely mapped to 6 and 6 is uniquely mapped to E
I would like filter the dataset so that only
var1 var2
7 E 6
is returned. base or tidyverse solution are welcomed.
I have tried
unique(df$var1, df$var2)
df[!duplicated(df),]
df %>% distinct(var1, var2)
but without the wanted result.

Using igraph::components.
Represent data as graph and get connected components:
library(igraph)
g = graph_from_data_frame(df)
cmp = components(g)
Grab components where cluster size (csize) is 2. Output vertices as a two-column character matrix:
matrix(names(cmp$membership[cmp$membership %in% which(cmp$csize == 2)]),
ncol = 2, dimnames = list(NULL, names(df))) # wrap in as.data.frame if desired
# var1 var2
# [1,] "E" "6"
Alternatively, use names of relevant vertices to index original data frame:
v = names(cmp$membership[cmp$membership %in% which(cmp$csize == 2)])
df[df$var1 %in% v[1:(length(v)/2)], ]
# var1 var2
# 7 E 6
Visualize the connections:
plot(g)

Using a custom function to determine if the mapping is unique you could achieve your desired result like so:
df <- data.frame(
var1 = c("A", "A", "B", "B", "C", "D", "E"),
var2 = c(1, 2, 3, 4, 5, 5, 6)
)
is_unique <- function(x, y) ave(as.numeric(factor(x)), y, FUN = function(x) length(unique(x)) == 1)
df[is_unique(df$var2, df$var1) & is_unique(df$var1, df$var2), ]
#> var1 var2
#> 7 E 6

Another igraph option
decompose(graph_from_data_frame(df)) %>%
subset(sapply(., vcount) == 2) %>%
sapply(function(g) names(V(g)))
which gives
[,1]
[1,] "E"
[2,] "6"

A base R solution:
df[!(duplicated(df$var1) | duplicated(df$var1, fromLast = TRUE) |
duplicated(df$var2) | duplicated(df$var2, fromLast = TRUE)), ]
var1 var2
7 E 6

R get column names for changed rows

I have two dataframes, old and new, in R. Is there a way to add a column (called changed) to the new dataframe that lists the column names (in this case, separated with a ";") where the values are different between the two dataframes? I am also trying to use this is a function where the column names that I am comparing are contained in other variables (x1, x2, x3). Ideally, I would only refer to x1, x2, x3 instead of the actual column names, but I can make due if this isn't possible. A tidy solution is preferable.
old <- data.frame(var1 = c(1, 2, 3, 5), var2 = c("A", "B", "C", "D"))
new <- data.frame(var1 = c(1, 4, 3, 6), var2 = c("A", "B", "D", "Z"))
x1 <- "var1"
x2 <- "var2"
x3 <- "changed"
#Output, adding a new column changed to new dataframe
var1 var2 changed
1 1 A NA
2 4 B var1
3 3 D var2
4 6 Z var1; var2

A tidyverse way -
library(dplyr)
library(tidyr)
cols <- names(new)
bind_cols(new, map2_df(old, new, `!=`) %>%
rowwise() %>%
transmute(changed = {
x <- c_across()
if(any(x)) paste0(cols[x], collapse = ';') else NA
}))
# var1 var2 changed
#1 1 A <NA>
#2 4 B var1
#3 3 D var2
#4 6 Z var1;var2
The same logic can be implemented in base R as well -
new$changed <- apply(mapply(`!=`, old, new), 1, function(x)
if(any(x)) paste0(cols[x], collapse = ';') else NA)

Here is a base R approach.
new$changed <- apply(old != new, 1L, \(r, nms) toString(nms[which(r)]), colnames(old))
Output
var1 var2 changed
1 1 A
2 4 B var1
3 3 D var2
4 6 Z var1, var2

R - building new variables from sequenced data

This is an update / follow-up on this question. The answer outlined their doesn't meet the new requirements.
I am looking for an efficient way (data.table?) to construct two new measures for each ID.
Measure 1 and Measure 2 needs to meet the following conditions:
Condition 1:
Find a sequence of three rows for which:
the first count > 0
the second `count >1' and
the third count ==1.
Condition 2 for Measure 1:
takes the value of the elements in product of the third row of the sequence that are:
in the product of second row of sequence and
NOT in the stock of the first row in sequence.
Condition 2 for measure 2:
takes the value of the elements in product of the last row of the sequence that are:
NOT in the product of second row of sequence
NOT in the stock of the first row in sequence.
Data:
df2 <- data.frame(ID = c(1,1,1,1,1,1,1,2,2,2,3,3,3,3),
seqs = c(1,2,3,4,5,6,7,1,2,3,1,2,3,4),
count = c(2,1,3,1,1,2,3,1,2,1,3,1,4,1),
product = c("A", "B", "C", "A,C,E", "A,B", "A,B,C", "D", "A", "B", "A", "A", "A,B,C", "D", "D"),
stock = c("A", "A,B", "A,B,C", "A,B,C,E", "A,B,C,E", "A,B,C,E", "A,B,C,D,E", "A", "A,B", "A,B", "A", "A,B,C", "A,B,C,D", "A,B,C,D"))
> df2
ID seqs count product stock
1 1 1 2 A A
2 1 2 1 B A,B
3 1 3 3 C A,B,C
4 1 4 1 A,C,E A,B,C,E
5 1 5 1 A,B A,B,C,E
6 1 6 2 A,B,C A,B,C,E
7 1 7 3 D A,B,C,D,E
8 2 1 1 A A
9 2 2 2 B A,B
10 2 3 1 A A,B
11 3 1 3 A A
12 3 2 1 A,B,C A,B,C
13 3 3 4 D A,B,C,D
14 3 4 1 D A,B,C,D
The desired output looks like this:
ID seq1 seq2 seq3 measure1 measure2
1: 1 2 3 4 C E
2: 2 1 2 3
3: 3 2 3 4 D
How would you code this?

Few things you need to know to be able to do this:
shift function to compare values in your groups
separate_rows function to split your strings to get to the normalised data view.
library(data.table)
dt <- data.table(ID = c(1,1,1,1,1,1,1,2,2,2,3,3,3,3),
seqs = c(1,2,3,4,5,6,7,1,2,3,1,2,3,4),
count = c(2,1,3,1,1,2,3,1,2,1,3,1,4,1),
product = c("A", "B", "C", "A,C,E", "A,B", "A,B,C", "D", "A", "B", "A", "A", "A,B,C", "D", "D"),
stock = c("A", "A,B", "A,B,C", "A,B,C,E", "A,B,C,E", "A,B,C,E", "A,B,C,D,E", "A", "A,B", "A,B", "A", "A,B,C", "A,B,C,D", "A,B,C,D"))
dt[, count.2 := shift(count, type = "lead")]
dt[, count.3 := shift(count, n = 2, type = "lead")]
dt[, product.2 := shift(product, type = "lead")]
dt[, product.3 := shift(product, n = 2, type = "lead")]
dt <- dt[count > 0 & count.2 > 1 & count.3 == 1]
dt <- unique(dt, by = "ID")
library(tidyr)
dt.measure <- separate_rows(dt, product.3, sep = ",")
dt.measure <- separate_rows(dt.measure, stock, sep = ",")
dt.measure <- separate_rows(dt.measure, product, sep = ",")
dt.measure[, measure.1 := (product.3 == product.2 & product.3 != stock)]
dt.measure[, measure.2 := (product.3 != product.2 & product.3 != stock)]
res <- dt.measure[,
.(
measure.1 = max(ifelse(measure.1, product.3, NA_character_), na.rm = TRUE),
measure.2 = max(ifelse(measure.2, product.3, NA_character_), na.rm = TRUE)
),
ID
]
dt <- merge(dt, res, by = "ID")
dt[, .(ID, measure.1, measure.2)]
# ID measure.1 measure.2
# 1: 1 C E
# 2: 2 <NA> <NA>
# 3: 3 D <NA>

I'm not sure what the criteria for efficient is, but here's an approach using embed and tidyverse style. It filters down so you are working with less and less.
Loading up the data and packages (note later on setdiff and intersect are from dplry)
library(purrr)
library(dplyr)
df1 <- data.frame(ID = c(1,1,1,1,1,1,1,2,2,2,3,3,3,3),
seqs = c(1,2,3,4,5,6,7,1,2,3,1,2,3,4),
count = c(2,1,3,1,1,2,3,1,2,1,3,1,4,1),
product = c("A", "B", "C", "A,C,E", "A,B",
"A,B,C", "D", "A", "B", "A", "A",
"A,B,C", "D", "D"),
stock = c("A", "A,B", "A,B,C", "A,B,C,E", "A,B,C,E",
"A,B,C,E", "A,B,C,D,E", "A", "A,B", "A,B", "A",
"A,B,C", "A,B,C,D", "A,B,C,D"),
stringsAsFactors = FALSE)
Define a helper function to evaluate condition 1
meetsCond1 <- function(rseg) {
seg <- rev(rseg)
all(seg[1] > 0, seg[2] > 1, seg[3] == 1)
}
The embed function warps a time series into a matrix where essentially each row is a window of the length of interest. Using apply, you filter down to which rows start relevant sequences.
cond1Match<- embed(df1$count, 3) %>%
apply(1, meetsCond1) %>%
which()
You can translate that back to final products, the previous products, and stock rows of interest to determine the measures by adding offsets. Split them into a list of individual components.
finalProds <- df1$product[cond1Match + 2] %>%
strsplit(",")
prevProds <- df1$product[cond1Match + 1] %>%
strsplit(",")
initialStock <- df1$stock[cond1Match] %>%
strsplit(",")
For both measures, neither of them can be in the stock.
notStock <- map2(finalProds, initialStock, ~.x[!(.x %in% .y)])
Then generate your data.frame by retrieving the seqs and ID values of the window. The measures then are just the intersect and setdiff of the final products with those in the previous rows.
data.frame(ID = df1$ID[cond1Match],
seq1 = df1$seqs[cond1Match],
seq2 = df1$seqs[cond1Match + 1],
seq3 = df1$seqs[cond1Match + 2],
measure1 = imap_chr(notStock,
~intersect(.x, prevProds[[.y]]) %>%
{if(length(.) == 0) "" else paste(., sep = ",")}
),
measure2 = imap_chr(notStock,
~setdiff(.x, prevProds[[.y]]) %>%
{if(length(.) == 0) "" else paste(., sep = ",")}
),
stringsAsFactors = FALSE
) %>%
slice(match(unique(ID), ID))
which yields the desired output, which seems to limit at most one line per ID. In the original post, you specify you want all reported. Removing the slice call would then instead yield
#> ID seq1 seq2 seq3 measure1 measure2
#> 1 1 2 3 4 C E
#> 2 1 6 7 1
#> 3 2 1 2 3
#> 4 2 3 1 2 C
#> 5 3 2 3 4 D
If you're looking to really squeeze efficiency, you might be able to gain some by placing the definitions of finalProds, prevProds, and initialStock instead of assigning them to variables first. I would imagine unless your set of matches is really large, it would be negligible.

A rolling window approach using data.table with base R code in j:
library(data.table)
cols <- c("product", "stock")
setDT(df2)[, (cols) := lapply(.SD, function(x) strsplit(as.character(x), split=",")), .SDcols=cols]
ans <- df2[,
transpose(lapply(1L:(.N-2L), function(k) {
if(count[k]>0 && count[k+1L]>1 && count[k+2L]==1) {
m1 <- setdiff(intersect(product[[k+2L]], product[[k+1L]]), stock[[k]])
m2 <- setdiff(setdiff(product[[k+2L]], product[[k+1L]]), stock[[k]])
c(seq1=seqs[k], seq2=seqs[k+1L], seq3=seqs[k+2L],
measure1=if(length(m1) > 0) paste(m1, collapse=",") else "",
measure2=if(length(m2) > 0) paste(m2, collapse=",") else "")
}
}), ignore.empty=TRUE),
ID]
setnames(ans, names(ans)[-1L], c(paste0("seq", 1:3), paste0("measure", 1:2)))
ans
output:
ID seq1 seq2 seq3 measure1 measure2
1: 1 2 3 4 C E
2: 2 1 2 3
3: 3 2 3 4 D

Ordering a dataframe by its subsegments

My team and I are dealing with many thousands of URLs that have similar segments.
Some URLs have one segment ("seg", plural, "segs") in a position of interest to us. Other similar URLs have a different seg in the position of interest to us.
We need to sort a dataframe consisting of URLs and associated unique segs
in the position of interest, showing the frequency of those unique segs.
Here is a simplified example:
url <- c(1, 3, 1, 4, 2, 3, 1, 3, 3, 3, 3, 2)
seg <- c("a", "c", "a", "d", "b", "c", "a", "x", "x", "y", "c", "b")
df <- data.frame(url,seg)
We are looking for the following:
url freq seg
1 3 a in other words, url #1 appears three times each with a seg = "a",
2 2 b in other words: url #2 appears twice each with a seg = "b",
3 3 c in other words: url #3 appears three times with a seg = "c",
3 2 x two times with a seg = "x", and,
3 1 y once with a seg = "y"
4 1 d etc.
I can get there using a loop and several small steps, but am convinced there is a more elegant way of doing this. Here's my inelegant approach:
Create empty dataframe with num.unique rows and three columns (url, freq, seg)
result <- data.frame(url=0, Freq=0, seg=0)
Determine the unique URLs
unique.df.url <- unique(df$url)
Loop through the dataframe
for (xx in unique.df.url) {
url.seg <- df[which(df$url == unique.df.url[xx]), ] # create a dataframe for each of the unique urls and associated segs
freq.df.url <- data.frame(table(url.seg)) # summarize the frequency distribution of the segs by url
result <- rbind(result,freq.df.url) # append a new data.frame onto the last one
}
Eliminate rows in the dataframe where Frequency = 0
result.freq <- result[which(result$Freq |0), ]
Sort the dataframe by URL
result.order <- result.freq[order(result.freq$url), ]
This yields the desired results, but since it is so inelegant, I am concerned that once we move to scale, the time required will be prohibitive or at least a concern. Any suggestions?

In base R you can do this :
aggregate(freq~seg+url,`$<-`(df,freq,1),sum)
# or aggregate(freq~seg+url, data.frame(df,freq=1),sum)
# seg url freq
# 1 a 1 3
# 2 b 2 2
# 3 c 3 3
# 4 x 3 2
# 5 y 3 1
# 6 d 4 1
The trick with $<- is just to add a column freq of value 1 everywhere, without changing your source table.
Another possibility:
subset(as.data.frame(table(df[2:1])),Freq!=0)
# seg url Freq
# 1 a 1 3
# 8 b 2 2
# 15 c 3 3
# 17 x 3 2
# 18 y 3 1
# 22 d 4 1
Here I use [2:1] to switch the order of columns so table orders the results in the required way.

url <- c(1, 3, 1, 4, 2, 3, 1, 3, 3, 3, 3, 2)
seg <- c("a", "c", "a", "d", "b", "c", "a", "x", "x", "y", "c", "b")
df <- data.frame(url,seg)
library(dplyr)
df %>% count(url, seg) %>% arrange(url, desc(n))
# # A tibble: 6 x 3
# url seg n
# <dbl> <fct> <int>
# 1 1 a 3
# 2 2 b 2
# 3 3 c 3
# 4 3 x 2
# 5 3 y 1
# 6 4 d 1

Would the following code be better for you?
library(dplyr)
df %>% group_by(url, seg) %>% summarise(n())

Or paste & tapply:
url <- c(1, 3, 1, 4, 2, 3, 1, 3, 3, 3, 3, 2)
seg <- c("a", "c", "a", "d", "b", "c", "a", "x", "x", "y", "c", "b")
df <- data.frame(url,seg)
want <- tapply(url, INDEX = paste(url, seg, sep = "_"), length)
want <- data.frame(do.call(rbind, strsplit(names(want), "_")), want)
colnames(want) <- c("url", "seg", "freq")
want <- want[order(want$url, -want$freq), ]
rownames(want) <- NULL # needed?
want <- want[ , c("url", "freq", "seg")] # needed?
want

An option can be to use table and tidyr::gather to get data in format needed by OP:
library(tidyverse)
table(df) %>% as.data.frame() %>%
filter(Freq > 0 ) %>%
arrange(url, desc(Freq))
# url seg Freq
# 1 1 a 3
# 2 2 b 2
# 3 3 c 3
# 4 3 x 2
# 5 3 y 1
# 6 4 d 1
OR
df %>% group_by(url, seg) %>%
summarise(freq = n()) %>%
arrange(url, desc(freq))
# # A tibble: 6 x 3
# # Groups: url [4]
# url seg freq
# <dbl> <fctr> <int>
# 1 1.00 a 3
# 2 2.00 b 2
# 3 3.00 c 3
# 4 3.00 x 2
# 5 3.00 y 1
# 6 4.00 d 1

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

determining frequency of multiple variables based on multiple factors in R - r

You can also use data.table which could be useful for a flexible number of columns: cols <- names(df)[grepl("test",names(df))] setDT(df)[,lapply(.SD, function(x) sum(!is.na(x))), by = visit, .SDcols = cols] df # visit test1 test2 #1: A 3 3 #2: B 3 1 #3: C 1 1

Using table and rowSums in base R: cols <- 3:4 sapply(cols, function(i) rowSums(table(df$visit, df[,i]), na.rm = TRUE)) # [,1] [,2] #A 3 3 #B 3 1 #C 1 1

Related

Estimating the percentage of common set elements over combined past periods in a panel

Remove all records that have duplicates based on more than one variables

R get column names for changed rows

R - building new variables from sequenced data

Ordering a dataframe by its subsegments

Categories

Resources