I have a dataset with names, dates, and several categorical columns. Let's say
data <- data.table(name = c('Anne', 'Ben', 'Cal', 'Anne', 'Ben', 'Cal', 'Anne', 'Ben', 'Ben', 'Ben', 'Cal'),
period = c(1,1,1,1,1,1,2,2,2,3,3),
category = c("A","A","A","B","B","B","A","B","A","B","A"))
Which looks like this:
name period category
Anne 1 A
Ben 1 A
Cal 1 A
Anne 1 B
Ben 1 B
Cal 1 B
Anne 2 A
Ben 2 B
Ben 2 A
Ben 3 A
Cal 3 B
I want to compute, for each period, how many names were present in the past period, for every group of my categorical variables. The output should be as follows:
period category recurrence_count
2 A 2 # due to Anne and Ben being on A, period 1
2 B 1 # due to Ben being on B, period 1
3 A 1 # due to Ben being on A, period 2
3 B 0 # no match from B, period 2
I am aware of the .I and .GRP operators in data.table, but I have no idea how to write the notion of 'next group' in the j entry of my statement. I imagine something like this might be a reasonable path, but I can't figure out the correct syntax:
data[, .(recurrence_count = length(intersect(name, name[last(.GRP)]))), by = .(category, period)]
You can first summarize your data by category and period.
previous_period_names <- data[, .(names = list(name)), .(category, period)]
previous_period_names[, next_period := period + 1]
Join your summary with your original data.
data[previous_period_names, names := i.names, on = c('period==next_period')]
Now count how many names you see the name in the summarized names
data[, .(recurrence_count = sum(name %in% unlist(names))), by = .(period, category)]
Another data.table alternative. For rows that can have a previous period (period != 1), create such a variable (prev_period := period - 1).
Join original data with a subset that has values for 'prev_period' (data[data[!is.na(prev_period)]). Join on 'category', 'period = prev_period' and 'name'.
In the resulting data set, for each 'period' and 'category' (by = .(period = i.period, category)), count the number of names from original data (x.name) that had a match with previous period (length(na.omit(x.name))).
data[period != 1, prev_period := period - 1]
data[data[!is.na(prev_period)], on = c("category", period = "prev_period", "name"),
.(category, i.period, x.name)][
, .(n = length(na.omit(x.name))), by = .(period = i.period, category)]
# period category n
# 1: 2 A 2
# 2: 2 B 1
# 3: 3 B 1
# 4: 3 A 0
One option in base R is to split the 'data' by 'category', then loop over the list (lapply), use Reduce with intersect on the splitted 'name' by 'period' with accumulate as TRUE, get the lengths of the list, create a data.frame with the unique elements of 'period' and use Map to create the 'category' from the names of the list output, rbind the list of data.frame into a single dataset
library(data.table)
lst1 <- lapply(split(data, data$category), function(x)
data.frame(period = unique(x$period)[-1],
recurrence_count = lengths(Reduce(intersect,
split(x$name, x$period), accumulate = TRUE)[-1])))
rbindlist(Map(cbind, category = names(lst1), lst1))[
order(period), .(period, category, recurrence_count)]
# period category recurrence_count
#1: 2 A 2
#2: 2 B 1
#3: 3 A 1
#4: 3 B 0
Or using the same logic within data.table, grouped by 'category, do the split of 'name' by 'period' and apply the Reduce with intersect
setDT(data)[, .(period = unique(period),
recurrence_count = lengths(Reduce(intersect,
split(name, period), accumulate = TRUE))), .(category)][duplicated(category)]
# category period recurrence_count
#1: A 2 2
#2: A 3 1
#3: B 2 1
#4: B 3 0
Or similar option in tidyverse
library(dplyr)
library(purrr)
data %>%
group_by(category) %>%
summarise(reccurence_count = lengths(accumulate(split(name, period),
intersect)), period = unique(period), .groups = 'drop' ) %>%
filter(duplicated(category))
# A tibble: 4 x 3
# category reccurence_count period
# <chr> <int> <int>
#1 A 2 2
#2 A 1 3
#3 B 1 2
#4 B 0 3
data
data <- structure(list(name = c("Anne", "Ben", "Cal", "Anne", "Ben",
"Cal", "Anne", "Ben", "Ben", "Ben", "Cal"), period = c(1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L), category = c("A", "A", "A",
"B", "B", "B", "A", "B", "A", "A", "B")), class = "data.frame",
row.names = c(NA,
-11L))
A data.table option
setDT(df)[
,
{
u <- split(name, period)
data.table(
period = unique(period)[-1],
recurrence_count = lengths(
Map(
intersect,
head(u, -1),
tail(u, -1)
)
)
)
},
category
]
gives
category period recurrence_count
1: A 2 2
2: A 3 1
3: B 2 1
4: B 3 0
Related
I have the following problem.
I have a large data.frame. In this data.frame there are 648 different combinations of 7 variables. The data.frame is 4 times that length giving 2592 rows. What I am trying to do is to create a vector in that data.frame, which indicates which of the combinations is in that row. So there should in the end be a vector which includes the numbers 1-648 each four times.
In the end it should something like this, here an example for two variables and 3 different combinations.
a b distinct_combinations
<dbl> <chr> <dbl>
1 1 a 1
2 2 b 2
3 3 c 3
4 1 a 1
5 2 b 2
6 3 c 3
Thank you!
The special symbol .GRP from package data.table is essentially what you are asking for:
.GRP is an integer, length 1, containing a simple group counter. 1 for the 1st group, 2 for the 2nd, etc. data.table documentation
library(data.table)
setDT(data) # change data to a data.table
data[, distinct_combinations := .GRP, by = .(a, b)]
You can group_by your desired columns and use group_indices:
library(tidyverse)
data %>%
group_by(across(a:b)) %>%
mutate(distinct_combinations = group_indices())
# A tibble: 6 x 3
# Groups: a, b [3]
a b distinct_combinations
<int> <chr> <int>
1 1 a 1
2 2 b 2
3 3 c 3
4 1 a 1
5 2 b 2
6 3 c 3
You can also arrange your columns and use data.table::rleidv:
data %>%
arrange(across(a:b)) %>%
mutate(distinct_combinations = data.table::rleidv(.))
data
structure(list(a = c(1L, 2L, 3L, 1L, 2L, 3L), b = c("a", "b",
"c", "a", "b", "c")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
v1 = rep(seq(1:3),2)
v2 = rep(c("a","b","c"),2)
df = data.frame(v1,v2)
df$id = as.factor(paste(v1,v2,sep = ""))
levels(df$id) = seq(1:length(unique(df$id)))
You can do by creating a column and changing its levels to numeric
I am using R studio, mostly dplyr processing, where I have a df of users (A,B,C,...) and what day since their first visit they were active (1,2,3,...)
user
day
active
A
1
T
A
3
T
B
2
T
B
4
T
I would like to complete this list with all missing days - up to their current maximum value (so for user B until 4 and for user A until 3) - and value FALSE:
user
day
active
A
1
T
A
2
F
A
3
T
B
1
F
B
2
T
B
3
F
B
4
T
I've been googling and chewing on this for hours now. Anybody have an idea?
We could group by 'user' and then get the sequence from min to max of 'day' in complete to expand the data while filling the 'active' column with FALSE (by default missing combinations are filled with NA)
library(dplyr)
library(tidyr)
df1 %>%
group_by(user) %>%
complete(day = min(day):max(day), fill = list(active = FALSE)) %>%
ungroup
-output
# A tibble: 6 × 3
user day active
<chr> <int> <lgl>
1 A 1 TRUE
2 A 2 FALSE
3 A 3 TRUE
4 B 2 TRUE
5 B 3 FALSE
6 B 4 TRUE
data
df1 <- structure(list(user = c("A", "A", "B", "B"), day = c(1L, 3L,
2L, 4L), active = c(TRUE, TRUE, TRUE, TRUE)), class = "data.frame",
row.names = c(NA,
-4L))
You can create a new dataframe of users and days for all users and all days and then join it to your existing dataframe then set the active column. Something like this:
fullDf <- data.frame("user" = c(rep("A", 4), rep("B", 4)),
"day" = rep(1:4, 2))
existingDf <- left_join(fullDf, existingDf, by = c("user", "day"))
existingDf$active <- ifelse(is.na(existingDf$active), FALSE, existingDf$active
I have data like this:
stim1 stim2
1: 2 3
2: 1 3
3: 2 1
4: 1 2
5: 3 1
structure(list(stim1 = c(2L, 1L, 2L, 1L, 3L),
stim2 = c(3L, 3L, 1L, 2L, 1L)),
row.names = c(NA, -10L), class = c("data.table", "data.frame"))
My objective is to add two columns: one for 'stim1' and one for 'stim2'. For each row of both columns, I want to find the previous occurrence of its value, in either column, and then grab the value in the opposite column.
For example, on row 3 'stim1' is 2. The previous occurrence of 2 is in 'stim1' on row 1. The value in the other column of that row is 3. So Prev1[3] is 3.
Another example: On row 4 'stim1' is 1. The previous occurrence of 1 is in 'stim2' on row 3. The value in the other column on that row is 2. So Prev1[4] is 2.
Desired output:
stim1 stim2 Prev1 Prev2
1: 2 3 <NaN> <NaN>
2: 1 3 <NaN> 2
3: 2 1 3 3
4: 1 2 2 1
5: 3 1 1 2
The tricky thing is that the OP wants to find the previous occurrence of a value in either column.
Therefore, the idea is to reshape the data into long format and to find the matching rows by aggregating in a non-equi self join.
library(data.table)
long <- melt(DT, measure.vars = patterns("^stim"), value.name = "stim")[
, rn := rowid(variable)][
, opposite := rev(stim), keyby = rn][]
long[, prev := long[long, on = c("stim", "rn < rn"),
.(max(x.rn), x.opposite[which.max(x.rn)]), by = .EACHI]$V2][]
dcast(long, rn ~ rowid(rn), value.var = c("stim", "prev"))
rn stim_1 stim_2 prev_1 prev_2
1: 1 2 3 NA NA
2: 2 1 3 NA 2
3: 3 2 1 3 3
4: 4 1 2 2 1
5: 5 3 1 1 2
Explanation
Reshape DT to long format.
Create an additional column rn which identifies the row numbers in the original dataset DT using rowid(variable).
Create an additional column opposite which contains the values of the opposite column. In long format this means to reverse the order of values within each rn group.
Now, join long with itself. The non-equi join condition is looking for all occurrences of the current stim value in rows before the current row. As there might be more than one match, aggregating by max(rn) within the .EACHI groups picks the row number of the previous occurrence of the value as well as the corresponding opposite value. So,
long[long, on = c("stim", "rn < rn"), .(max(x.rn), x.opposite[which.max(x.rn)]), by = .EACHI]
returns
stim rn V1 V2
1: 2 1 NA NA
2: 3 1 NA NA
3: 1 2 NA NA
4: 3 2 1 2
5: 2 3 1 3
6: 1 3 2 3
7: 1 4 3 2
8: 2 4 3 1
9: 3 5 2 1
10: 1 5 4 2
Create an additional column prev in long which contains the previous opposite value V2.
Finally, reshape long back to wide format, using both measure columns stim and prev.
Edit: Alternative solution
In case DT contains more columns that just stim1 and stim2, DT can be updated by reference, alternatively:
long <- melt(DT, measure.vars = patterns("^stim"), value.name = "stim")[
, rn := rowid(variable)][
, opposite := rev(stim), keyby = rn][]
DT[, c("prev1", "prev2") := dcast(
long[long, on = c("stim", "rn < rn"),
.(max(x.rn), x.opposite[which.max(x.rn)]), by = .EACHI],
rn ~ rowid(rn), value.var = "V2")[, rn := NULL]][]
stim1 stim2 prev1 prev2
1: 2 3 NA NA
2: 1 3 NA 2
3: 2 1 3 3
4: 1 2 2 1
5: 3 1 1 2
Data
library(data.table)
DT <- data.table(stim1 = c(2L, 1L, 2L, 1L, 3L),
stim2 = c(3L, 3L, 1L, 2L, 1L))
A quick helper function to iterate through the data:
func <- function(mtx) {
na <- mtx[1][NA]
c(NA, sapply(seq_len(nrow(mtx))[-1], function(ind) {
v <- mtx[ind,1] ; s <- seq_len(ind-1)
m <- cbind(v == mtx[s,1], v == mtx[s,2])
if (any(m)) {
m <- which(m, arr.ind = TRUE)
row <- which.max(m[,1])
mtx[m[row,1], m[row,2] %% 2 + 1]
} else na
}))
}
Demonstration:
dat[, Prev1 := func(cbind(stim1, stim2)) ][, Prev2 := func(cbind(stim2, stim1)) ]
# stim1 stim2 Prev1 Prev2
# <int> <int> <int> <int>
# 1: 2 3 NA NA
# 2: 1 3 NA 2
# 3: 2 1 3 3
# 4: 1 2 2 1
# 5: 3 1 1 2
Alternative, using zoo::rollapply:
func2 <- function(mtx) {
na <- mtx[1][NA]
if (!is.matrix(mtx)) return(na) # we're on the first row
v <- mtx[nrow(mtx),1] ; s <- seq_len(nrow(mtx)-1)
m <- cbind(v == mtx[s,1], v == mtx[s,2])
if (any(m)) {
m <- which(m, arr.ind = TRUE)
row <- which.max(m[,1])
mtx[m[row,1], m[row,2] %% 2 + 1]
} else na
}
dat[, Prev1 := zoo::rollapplyr(.SD, .N, FUN = func2, by.column = FALSE, partial = TRUE),
.SDcols = c("stim1", "stim2")
][, Prev2 := zoo::rollapplyr(.SD, .N, FUN = func2, by.column = FALSE, partial = TRUE),
.SDcols = c("stim2", "stim1") ]
It's not shorter, and in fact is slower (with a 5-row dataset), but if you prefer to think of this in a rolling fashion, this produces the same results. (It's possible the newer slider package might be clearer, faster, or neither compared with this.)
Note:
I assign na as a class-specific NA (there are at least six types of NA). I do this defensively: if there is at least one match, then the remainder of the NA values will be coerced into the correct class; however, if there are no matches, then the class returned by func will be logical which may not be the same as the original data, and data.table will complain.
Just in case there was interest in a tidyverse approach (probably not as fast!). You could add row numbers to your data.frame, and put it into long form. Then, grouping by each value, get the previous row number and stim to reference in Prev. With a left_join you can obtain the appropriate value for Prev.
library(tidyverse)
df <- mutate(as.data.frame(df), rn = row_number())
df_long <- pivot_longer(df,
cols = -rn,
names_to = "stim",
names_pattern = "stim(\\d+)",
names_transform = list(stim = as.numeric))
df_long %>%
group_by(value) %>%
mutate(match_rn = lag(rn), match_stim = 3 - lag(stim)) %>%
left_join(df_long, by = c("match_rn" = "rn", "match_stim" = "stim")) %>%
pivot_wider(id_cols = rn,
names_from = stim,
values_from = value.y,
names_prefix = "Prev") %>%
right_join(df) %>%
arrange(rn)
Output
rn Prev1 Prev2 stim1 stim2
<int> <int> <int> <int> <int>
1 1 NA NA 2 3
2 2 NA 2 1 3
3 3 3 3 2 1
4 4 2 1 1 2
5 5 1 2 3 1
Here is another option:
setDT(DT)[, rn := .I]
dt1 <- DT[DT, on=.(stim1, rn<rn), mult="last", .(x.rn, v=x.stim2)]
dt2 <- DT[DT, on=.(stim2=stim1, rn<rn), mult="last", .(x.rn, v=x.stim1)]
DT[, Prev1 := fcoalesce(fifelse(dt1$x.rn > dt2$x.rn, dt1$v, dt2$v), dt1$v, dt2$v)]
#have not flipped everything and seems to work for this minimal example, pls let me know if there are cases where Prev2 is wrong
dt1 <- DT[DT, on=.(stim2, rn<rn), mult="last", .(x.rn, v=x.stim1)]
dt2 <- DT[DT, on=.(stim1=stim2, rn<rn), mult="last", .(x.rn, v=x.stim2)]
DT[, Prev2 := fcoalesce(fifelse(dt1$x.rn > dt2$x.rn, dt1$v, dt2$v), dt1$v, dt2$v)]
output:
stim1 stim2 rn Prev1 Prev2
1: 2 3 1 NA NA
2: 1 3 2 NA 2
3: 2 1 3 3 3
4: 1 2 4 2 1
5: 3 1 5 1 2
data:
library(data.table)
DT <- structure(list(stim1 = c(2L, 1L, 2L, 1L, 3L),
stim2 = c(3L, 3L, 1L, 2L, 1L)),
row.names = c(NA, -10L), class = c("data.table", "data.frame"))
Given this data, the first 4 columns (rowid, order, line, special), I need to create a column, numSpecial as such:
rowid order line special numSpecial
1 A 01 X 1
2 B 01 0
3 B 02 X 2
4 B 03 X 2
5 C 01 X 1
6 C 02 0
Where numSpecial is determined by summing the number of times for each order that is special (value = X), given that order-line is special itself, otherwise its 0.
I first tried adding a column that simply concats 'order' with 'X', call it orderX, and would look like:
orderX
AX
BX
BX
BX
CX
CX
Then do a sum of order & special in orderx:
df$numSpecial <- sum(paste(order, special, sep = "") %in% orderx)
But that doesnt work, it returns the sum of the results for all rows for every order:
numSpecial
4
4
4
4
4
4
I then tried as.data.table, but I'm not getting the expected results using:
as.data.table(mydf)[, numSpecial := sum(paste(order, special, sep = "") %in% orderx), by = rowid]
However that is returning just 1 for each row and not sums:
numSpecial
1
0
1
1
1
0
Where am I going wrong with these? I shouldn't have to create that orderX column either I don't think, but I can't figure out the way to get this count right. It's similar to a countif in excel which is easy to do.
There's probably several ways, but you could just multiply it by a TRUE/FALSE flag of "X" being present:
dat[, numSpecial := sum(special == "X") * (special == "X"), by=order]
dat
# rowid order line special numSpecial
#1: 1 A 1 X 1
#2: 2 B 1 0
#3: 3 B 2 X 2
#4: 4 B 3 X 2
#5: 5 C 1 X 1
#6: 6 C 2 0
You could also do it a bit differently like:
dat[, numSpecial := 0L][special == "X", numSpecial := .N, by=order]
Where dat was:
library(data.table)
dat <- structure(list(rowid = 1:6, order = c("A", "B", "B", "B", "C",
"C"), line = c(1L, 1L, 2L, 3L, 1L, 2L), special = c("X", "",
"X", "X", "X", "")), .Names = c("rowid", "order", "line", "special"
), row.names = c(NA, -6L), class = "data.frame")
setDT(dat)
You could use ave with a dummy variable (just filled with 1s):
df$numSpecial <- ifelse(df$special == "X", ave(rep(1,nrow(df)), df$order, df$special, FUN = length), 0)
df
# rowid order line special numSpecial
#1 1 A 1 X 1
#2 2 B 1 0
#3 3 B 2 X 2
#4 4 B 3 X 2
#5 5 C 1 X 1
#6 6 C 2 0
Note I read in your data without the numSpecial column.
Using the dplyr package:
library(dplyr)
df %>% group_by(order) %>%
mutate(numSpecial = ifelse(special=="X", sum(special=="X"), 0))
rowid order special numSpecial
1 1 A X 1
2 2 B 0
3 3 B X 2
4 4 B X 2
5 5 C X 1
6 6 C 0
One other option using base R only would be to use aggregate:
# Your data
df <- data.frame(rowid = 1:6, order = c("A", "B", "B", "B", "C", "C"), special = c("X", "", "X", "X", "X", ""))
# Make the counts
dat <- with(df,aggregate(x=list(answer=special),by=list(order=order,special=special),FUN=function(x) sum(x=="X")))
# Merge back to original dataset:
dat.fin <- merge(df,dat,by=c('order','special'))
This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Closed 2 years ago.
so I have a data frame:
ID: YearMon: Var: Count:
1 012007 H 1
1 012007 D 2
1 022007 NA
1 032007 H 1
2 012007 H 1
2 022007 Na
2 022007 D 1
2 032007 NA
How would i go about getting just the max value for each unique ID for a certain YearMon? Ideally it would return:
1 012007 D 2
1 022007 NA
1 032007 H 1
2 012007 H 1
2 022007 D 1
2 032007 NA
Using plyr this should be easily achieved. This will filter by ID and YearMon and return the max value along with the ID and YearMon in a data frame.
library(plyr)
ddply( dat1 , .(ID,YearMon) ,function(x) {
Count = max( x$Count )
data.frame( Count=Count , Var=x[x$Count == Count,"Var"] )
})
In order to return all Columns:
df[ is.na( df$Count ) , "Count" ] <- -9999
df2 <- ddply(df, .(ID,YearMon) , function(x){
Count = max( x$Count )
index = which( x$Count == max( x$Count ))
y <- x[ index ,]
data.frame( y )
})
df2[ df2$Count == -9999, "Count" ] <- NA
This will return your indexing values back to NA as well.
Using data.table, if you have a data table called dt, you can first calculate the max of Count by group, and then just keep the rows where Count is equal to the max for that group:
newdt <- dt[, max.count := max(Count), by=.(ID, YearMon)][Count==max.count,.(ID, YearMon, Var, Count)]
library(dplyr)
dt %>%
group_by(ID, YearMon) %>%
slice(Count %>% which.max)
Lets not forget about aggregate!
#####Clean up data. You need to change your grouping variables to factors and data needs to be numeric####
dat1$Var.[dat1$Var.==1]=""
dat1$Count.<-as.numeric(dat1$Count.)
dat1$ID.<-as.factor(dat1$ID.)
dat1$YearMon.<-as.factor(dat1$YearMon.)
dat1<-dat1[,-3] ###Lets get rid of the Var column as you're not using it.
aggregate(. ~ ID.+YearMon.,data = dat1,FUN=max ) #### Use aggregate. Simple and short code
ID. YearMon. Count.
1 1 12007 2
2 2 12007 1
3 2 22007 1
4 1 32007 1
Another data.table option using if/else. We convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'ID', 'YearMon', if all of the 'Count' values are 'NA' we return the Subset of Data.table (.SD) or else we get the index of maximum value of 'Count' and subset the data.table (.SD[which.max(Count)]).
library(data.table)
setDT(df1)[, if(all(is.na(Count))) .SD else .SD[which.max(Count)],.(ID, YearMon)]
# ID YearMon Var Count
#1: 1 12007 D 2
#2: 1 22007 NA
#3: 1 32007 H 1
#4: 2 12007 H 1
#5: 2 22007 D 1
#6: 2 32007 NA
Or another option would be to concatenate the index from which.max and the rows which have all 'NA' for 'Count' grouped by the variables, get the row index (.I) and use that to subset the 'data.table'.
setDT(df1)[df1[, .I[c(which.max(Count), all(is.na(Count)))], .(ID, YearMon)]$V1]
# ID YearMon Var Count
#1: 1 12007 D 2
#2: 1 22007 NA
#3: 1 32007 H 1
#4: 2 12007 H 1
#5: 2 22007 D 1
#6: 2 32007 NA
Or we replace the NA by a very small number, use which.max and subset
setDT(df1)[, .SD[which.max(replace(Count, is.na(Count),-Inf ))], .(ID, YearMon)]
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
YearMon = c(12007L,
12007L, 22007L, 32007L, 12007L, 22007L, 22007L, 32007L), Var = c("H",
"D", "", "H", "H", "", "D", ""), Count = c(1L, 2L, NA, 1L, 1L,
NA, 1L, NA)), .Names = c("ID", "YearMon", "Var", "Count"),
class = "data.frame", row.names = c(NA, -8L))