Summarise data between given value of a categorical variable

Summarise data between given value of a categorical variable - r

I am looking for a clever and fast way to summarise data in a data frame. The data and desired output looks as follows:
categoriesVector <- c("A", "A", "B", "A", "B", "B", "B", "A", "B")
propertyVector <- 1:length(categoriesVector)
dataVector <- 100 * rev(propertyVector)
df <- data.frame(categoriesVector, propertyVector, dataVector, stringsAsFactors = F)
df
desiredData <- c(700, sum(500, 400, 300), 100)
desiredProperty1 <- c(3, 5, 9)
desiredProperty2 <- c(3, 7, 9)
desiredDF <- data.frame(desiredData, desiredProperty1, desiredProperty2)
desiredDF
Basically I need to sum data and keep first and last property between each two occurrences of Category A. After a lot of headbanging I found a clumsy solution, which I am looking to find an improvement on in terms of clarity and performance, preferably with dplyr:
numRows <- dim(df)[1]
.groupedID <- rep(NA, numRows)
ID <- 1
.groupedID[[1]] <- ifelse(df$categoriesVector[[1]] == "A", 0, ID)
for(i in 2:numRows)
{
if(df$categoriesVector[i] == "B")
{
.groupedID[i] <- ID
if(df$categoriesVector[i - 1] == "B")
{
.groupedID[i] <- .groupedID[i - 1]
}
ID <- ID + 1
} else
{
.groupedID[i] <- 0
}
}
tempDF <-
df %>%
mutate(ID = .groupedID) %>%
filter(ID != 0) %>%
group_by(ID) %>%
summarise(desiredProperty1 = head(propertyVector, 1),
desiredProperty2 = tail(propertyVector, 1),
desiredData = sum(dataVector)) %>%
select(desiredData, desiredProperty1, desiredProperty2)
tempDF

You could use cumsum() to make your groupings and then process based on those like this:
df %>% mutate(Agroups = cumsum(categoriesVector == "A")) %>%
filter(categoriesVector == "B") %>%
group_by(Agroups) %>%
summarise(propertyStart = min(propertyVector),
propertyEnd = max(propertyVector),
dataTotal = sum(dataVector))
# A tibble: 3 x 4
Agroups propertyStart propertyEnd dataTotal
<int> <dbl> <dbl> <dbl>
1 2 3 3 700
2 3 5 7 1200
3 4 9 9 100

Here is how I'd do with data.table. First create spanNumber variable to identify each span of "B" surrounded by "A", then calculate the variables you specified:
library(data.table)
setDT(df)
df[, catShiftConcat := paste0(categoriesVector, shift(categoriesVector, fill = "A"))]
df[categoriesVector == "B", spanNumber := cumsum(catShiftConcat == "BA")]
df[!is.na(spanNumber) , .(desiredData = sum(dataVector),
desiredProperty1 = propertyVector[1],
desiredProperty2 = propertyVector[.N]), by = spanNumber]
## spanNumber desiredData desiredProperty1 desiredProperty2
## 1: 1 700 3 3
## 2: 2 1200 5 7
## 3: 3 100 9 9

An alternative data.table method that uses rleid to group runs of the categories vector is
library(data.table)
setDT(df)[, .(categoriesVector,
desiredData=sum(dataVector),
desiredProperty1=propertyVector[1],
desiredProperty2=propertyVector[.N]),
by=rleid(categoriesVector)
][categoriesVector == "B",][, c("rleid", "categoriesVector") := NULL][]
The contents in the first [] return the desired output and are calculated aggregated to runs of the categories vector. The second chain subsets the observations by keeping those for which the categories vector is B. The third [] removes two helper variables, and the final [] is just there to print the result to screen.
This returns
desiredData desiredProperty1 desiredProperty2
1: 700 3 3
2: 1200 5 7
3: 1200 5 7
4: 1200 5 7
5: 100 9 9

Related

Creating group ids by comparing values of two variables across rows: in R

I have a dataframe with two variables (start,end). would like to create an identifier variable which grows in ascending order of start and, most importantly, is kept constant if the value of start coincides with end of any other row in the dataframe.
Below is a simple example of the data
toy_data <- data.frame(start = c(1,5,6,10,16),
end = c(10,9,11,15,17))
The output I would be looking for is the following:
output_data <- data.frame(start = c(1,10,5,6,16),
end = c(10,15,9,11,17),
NEW_VAR = c(1,1,2,3,4))

You could try adapting this answer to group by ranges that are adjacent to each other. Credit goes entirely to #r2evans.
In this case, you would use expand.grid to get combinations of start and end. Instead of labels you would have row numbers rn to reference.
In the end, you can number the groups based on which rows appear together in the list. The last few lines starting with enframe use tibble/tidyverse. To match the group numbers I resorted the results too.
I hope this might be helpful.
library(tidyverse)
toy_data <- data.frame(start = c(1,5,6,10,16),
end = c(10,9,11,15,17))
toy_data$rn = 1:nrow(toy_data)
eg <- expand.grid(a = seq_len(nrow(toy_data)), b = seq_len(nrow(toy_data)))
eg <- eg[eg$a < eg$b,]
together <- cbind(
setNames(toy_data[eg$a,], paste0(names(toy_data), "1")),
setNames(toy_data[eg$b,], paste0(names(toy_data), "2"))
)
together <- subset(together, end1 == start2)
groups <- split(together$rn2, together$rn1)
for (i in toy_data$rn) {
ind <- (i == names(groups)) | sapply(groups, `%in%`, x = i)
vals <- groups[ind]
groups <- c(
setNames(list(unique(c(i, names(vals), unlist(vals)))), i),
groups[!ind]
)
}
min_row <- as.numeric(sapply(groups, min))
ctr <- seq_along(groups)
lapply(ctr[order(match(min_row, ctr))], \(x) toy_data[toy_data$rn %in% groups[[x]], ]) %>%
enframe() %>%
unnest(col = value) %>%
select(-rn)
Output
name start end
<int> <dbl> <dbl>
1 1 1 10
2 1 10 15
3 2 5 9
4 3 6 11
5 4 16 17

The following function should give you the desired identifier variable NEW_VAR.
identifier <- \(df) {
x <- array(0L, dim = nrow(df))
count <- 0L
my_seq <- seq_len(nrow(df))
for (i in my_seq) {
if(!df[i,]$start %in% df$end) {
x[i] <- my_seq[i] + count
} else {
x[i] <- my_seq[i]-1L + count
count <- count - 1L
}
}
x
}
Examples
# your example
toy_data <- data.frame(start = c(1,10,5,6,16),
end = c(10,15,9,11,17))
toy_data$NEW_VAR <- identifier(toy_data)
# ---------------------
> toy_data$NEW_VAR
[1] 1 1 2 3 4
# other example
toy_data <- data.frame(start = c(1, 2, 2, 4, 16, 21, 18, 3),
end = c(16, 2, 21, 2, 2, 2, 3, 1))
toy_data$NEW_VAR <- identifier(toy_data)
# ---------------------
> toy_data$NEW_VAR
[1] 0 0 0 1 1 1 2 2

Subset data.frame based on lag between two columns

Suppose you want to subset a data.frame where the rule for keeping rows is based
on a lag beteen rows 'a' and 'b':
# input
df <- data.frame(a = c(1,0,0,0,1,0,0,0,0,0,0,0),
b = c(0,1,1,0,0,1,1,0,0,0,1,1))
#output
a b
1 1 0
2 0 1
3 0 1
4 1 0
5 0 1
6 0 1
Essentially, if 'a' = 1 you want to keep that row as well as the subsequent run of rows in
'b' that have a value of 1. This capture continues until the next row with a = 0 & b = 0.
I've tried using nested 'ifelse()' statements, but I am stuck incorporate logical tests based on a lag issue.
Suggestions?

This is how I would do it. There are probably options out there that require maybe 1 or 2 lines less.
df <- data.frame(a = c(1,0,0,0,1,0,0,0,0,0,0,0),
b = c(0,1,1,0,0,1,1,0,0,0,1,1))
library(dplyr)
df %>%
mutate(grp = cumsum(a==1|a+b==0)) %>%
group_by(grp) %>%
filter(any(a == 1)) %>%
ungroup() %>%
select(a, b)

A solution without dplyr. Work with a flag:
# input
df <- data.frame(a = c(1,0,0,0,1,0,0,0,0,0,0,0),
b = c(0,1,1,0,0,1,1,0,0,0,1,1))
# create new empty df
new_df <- read.table(text = "", col.names = c("a", "b"))
a_okay = FALSE # initialize the flag
for (row_number in seq(1:nrow(df))) { # loop over each row of the original df
# if a is 1, we add the row to the new df and set the flag to TRUE
if (df[row_number, "a"] == 1) {
a_okay = TRUE
new_df[nrow(new_df) + 1, ] = c(df[row_number, "a"], df[row_number, "b"])
}
# now we consider the rows where a is not 1
else {
# if b is 1 and we are still following an a == 1: add the row
if (df[row_number, "b"] == 1 & a_okay) {
new_df[nrow(new_df) + 1, ] = c(df[row_number, "a"], df[row_number, "b"])
}
# if b is 0, we reset the flag
else {
a_okay = FALSE
}
}
}

Another base solution inspired by this post, #Wietse de Vries's answer and #Ben's comment.
# input
df <- data.frame(a = c(1,0,0,0,1,0,0,0,0,0,0,0),
b = c(0,1,1,0,0,1,1,0,0,0,1,1))
# identify groups
df$grp <- cumsum(df$a == 1 | df$b == 0)
# subset df by groups with first element of a == 1
df <- do.call(rbind, split(df, df$grp)[by(df, df$grp, function(x) {x$a[1] == 1})])
# remove grp
df$grp <- NULL

Assign the value of a variable based on another variable which is related to column names of a dataframe

I have a data frame with the following variables:
df <- data.frame(ID = seq(1:5),
Price.A = c(10,12,14,16,18),
Price.B = c(6,7,9,8,5),
Price.C = c(27,26,25,24,23),
Choice = c("A", "A", "B", "B", "C"))
I want to create a variable called Expenditure, which picks the value from Price.A, Price.B or Price.C depending on the value of the variable Choice.
I tried to create it with the following code:
df$Expenditure <- with(df, get(paste("Price.", Choice, sep ="")))
However, that returns the value of Price.A for all observations.
In my real application, instead of A, B and C, I have hundreds of names, so an ifelse command is not feasible.
Does anyone knows how to do that?

It would probably make more sense to reshape your data. Currently your data is not in a "tidy" format
library(dplyr)
library(tidyr)
df %>% gather(Price, Expendeture, -ID, -Choice) %>%
filter(Price == paste0("Price.", Choice)) %>%
select(-Price)
Otherwise you could do matrix-indexing of a matrix
cols <- grep("Price", names(df), value=T)
mm <- as.matrix(df[, cols])
colidx <- match(paste0("Price.", df$Choice), cols)
df$Expenditure <- mm[cbind(1:length(colidx), colidx)]

df$Expenditure[df$Choice=="A"] <- df$Price.A[df$Choice=="A"]
df$Expenditure[df$Choice=="B"] <- df$Price.B[df$Choice=="B"]
df$Expenditure[df$Choice=="C"] <- df$Price.C[df$Choice=="C"]
Here's how to scale it up with a loop:
df$Expenditure <- NA
for(i in unique(df$Choice)){
j <- paste0("Price.",i)
df$Expenditure[df$Choice==i] <- df[df$Choice==i,colnames(df) == j]
}
ID Price.A Price.B Price.C Choice Expenditure
1 1 10 6 27 A 10
2 2 12 7 26 A 12
3 3 14 9 25 B 9
4 4 16 8 24 B 8
5 5 18 5 23 C 23
You could easily wrap this into a function and use apply if you prefer.
There are also lots of more overly complicated ways to do this, though I think it's a terrible practice to use some 3rd party package to do this when base R does a wonderful job. Here's one:
df <- data.frame(ID = seq(1:5),
PriceA = c(10,12,14,16,18),
PriceB = c(6,7,9,8,5),
PriceC = c(27,26,25,24,23),
Choice = c("A", "A", "B", "B", "C"))
require(sqldf)
df$Expenditure <- unname(sqldf("SELECT
CASE
WHEN Choice == 'A' THEN PriceA
WHEN Choice == 'B' THEN PriceB
WHEN Choice == 'C' THEN PriceC
END
from df"))

Here are a couple of *apply based approaches:
df$Expenditure <- sapply(seq_along(df[[1]]), function(i) {
df[i, sprintf("Price.%s", df$Choice[i])]
})
df$Expenditure <- mapply(function(x, y) {
df[x, sprintf("Price.%s", y)]
}, row.names(df), df$Choice
)
The second one assumes your object has the default row.names of 1:nrow(df).

How about
for (i in 1:nrow(df)) {
df$Expenditure[i] <- with(df[i, ], get(paste("Price.", Choice, sep="")))
}

Gather duplicate column sets into single columns

The problem of gathering multiple sets of columns was already addressed here: Gather multiple sets of columns, but in my case, the columns are not unique.
I have the following data:
input <- data.frame(
id = 1:2,
question = c("a", "b"),
points = 0,
max_points = c(3, 5),
question = c("c", "d"),
points = c(0, 20),
max_points = c(5, 20),
check.names = F,
stringsAsFactors = F
)
input
#> id question points max_points question points max_points
#> 1 1 a 0 3 c 0 5
#> 2 2 b 0 5 d 20 20
The first column is an id, then I have many repeated columns (the original dataset has 133 columns):
identifier for question
points given
maximum points
I would like to end up with this structure:
expected <- data.frame(
id = c(1, 2, 1, 2),
question = letters[1:4],
points = c(0, 0, 0, 20),
max_points = c(3, 5, 5, 20),
stringsAsFactors = F
)
expected
#> id question points max_points
#> 1 1 a 0 3
#> 2 2 b 0 5
#> 3 1 c 0 5
#> 4 2 d 20 20
I have tried several things:
tidyr::gather(input, key, val, -id)
reshape2::melt(input, id.vars = "id")
Both do not deliver the desired output. Furthermore, with more columns than shown here, gather doesn't work any more, because there are too many duplicate columns.
As a workaround I tried this:
# add numbers to make col headers "unique"
names(input) <- c("id", paste0(1:(length(names(input)) - 1), names(input)[-1]))
# gather, remove number, spread
input %>%
gather(key, val, -id) %>%
mutate(key = stringr::str_replace_all(key, "[:digit:]", "")) %>%
spread(key, val)
which gives an error: Duplicate identifiers for rows (3, 9), (4, 10), (1, 7), (2, 8)
This problem was already discussed here: Unexpected behavior with tidyr, but I don't know why/how I should add another identifier. Most likely this is not the main problem, because I probably should approach the whole thing differently.
How could I solve my problem, preferably with tidyr or base? I don't know how to use data.table, but in case there is a simple solution, I will settle for that too.

Try this:
do.call(rbind,
lapply(seq(2, ncol(input), 3), function(i){
input[, c(1, i:(i + 2))]
})
)
# id question points max_points
# 1 1 a 0 3
# 2 2 b 0 5
# 3 1 c 0 5
# 4 2 d 20 20

The idiomatic way to do this in data.table is pretty simple:
library(data.table)
setDT(input)
res = melt(
input,
id = "id",
meas = patterns("question", "^points$", "max_points"),
value.name = c("question", "points", "max_points")
)
id variable question points max_points
1: 1 1 a 0 3
2: 2 1 b 0 5
3: 1 2 c 0 5
4: 2 2 d 20 20
You get the extra column called "variable", but you can get rid of it with res[, variable := NULL] afterwards if desired.

Another way to accomplish the same goal without using lapply:
We start by grabbing all the columns for question, max_points, and points then we melt each one individually and cbind them all together.
library(reshape2)
questions <- input[,c(1,c(1:length(names(input)))[names(input)=="question"])]
points <- input[,c(1,c(1:length(names(input)))[names(input)=="points"])]
max_points <- input[,c(1,c(1:length(names(input)))[names(input)=="max_points"])]
questions_m <- melt(questions,id.vars=c("id"),value.name = "questions")[,c(1,3)]
points_m <- melt(points,id.vars=c("id"),value.name = "points")[,3,drop=FALSE]
max_points_m <- melt(max_points,id.vars=c("id"),value.name = "max_points")[,3, drop=FALSE]
res <- cbind(questions_m,points_m, max_points_m)
res
id questions points max_points
1 1 a 0 3
2 2 b 0 5
3 1 c 0 5
4 2 d 20 20

You might need to clarify how you want the ID column to be handled but perhaps something like this ?
runme <- function(word , dat){
grep( paste0("^" , word , "$") , names(dat))
}
l <- mapply( runme , unique(names(input)) , list(input) )
l2 <- as.data.frame(l)
output <- data.frame()
for (i in 1:nrow(l2)) output <- rbind( output , input[, as.numeric(l2[i,]) ])
Not sure how robust it is with respect to handling different numbers of repeated columns but it works for your test data and should work if you columns are repeated equal numbers of times.

Remove melt data based on condition

I'd like to remove any rows where the value of a >= b, but I'm not sure how to do this.
Sample data:
df <- data.frame(day = c(1, 1, 2, 2, 3, 3), var = c("a", "b", "a", "b", "a", "b"), value = c(1, 2, 3, 3, 2, 1)
Output:
day var value
1 1 a 1
2 1 b 2
3 2 a 3
4 2 b 3
5 3 a 2
6 3 b 1
Desired output:
day var value
1 1 a 1
2 1 b 2

here's a data.table solution for avoiding going from long to wide:
dt <- data.table(df)
dt[,if(value[var == 'a'] >= value[var == 'b']) .SD,by = day]
EDIT: I realize now that your desired output does not fit your initial inequality, so adjust inequality to match :)
EDIT2: if you don't want to do it in data.table, then here's the dplyr solution
df %>% group_by(day) %>% filter(value[var == 'a'] >= value[var == 'b'])
EDIT3: if you want to put NA's in then this
df %>% group_by(day) %>% mutate(value = if(value[var == 'a'] >= value[var == 'b']) as.numeric(NA) else value)
EDIT4: NOTE this last solution appears to expose a bug, where NA's are handled strangely, see here:Why is dplyr removing values not met by condition?

Shape's answer is a correct approach to address your problem.
Just to extends Shape's answer I want to contribute with a little more generic solution.
An eav function in package dwtools is designed to address Entity-attribute-value data structures by easier calculation on measures. Function is defined below, you don't need dwtools package.
It calculates rm variable for each group. Formula for a calculations can be the same as quoted j arg to [.data.table call after melting your EAV, and before dcasting to EAV again.
library(data.table)
eav = function(x, j, id.vars = key(x)[-length(key(x))], variable.name = key(x)[length(key(x))], measure.vars = names(x)[!(names(x) %in% key(x))], fun.aggregate = sum, shift.on = character(), wide=FALSE){
stopifnot(is.data.table(x))
r <- x[,lapply(.SD,fun.aggregate),c(id.vars,variable.name),.SDcols=measure.vars
][,dcast(.SD,formula=as.formula(paste(paste(id.vars,collapse=' + '),paste(variable.name,collapse=' + '),sep=' ~ ')),fun.aggregate=fun.aggregate,value.var=measure.vars)
][,eval(j), by = eval(id.vars[!(id.vars %in% shift.on)])
]
if(wide) r[] else melt(r,id.vars=id.vars, variable.name=variable.name, value.name=measure.vars)[,.SD,keyby=c(id.vars,variable.name)]
}
df = data.frame(day = c(1, 1, 2, 2, 3, 3), var = c("a", "b", "a", "b", "a", "b"), value = c(1, 2, 3, 3, 2, 1))
dt = as.data.table(df)
setkey(dt, day, var)
r = eav(dt, quote(rm := as.numeric(a >= b)))
print(r)
# day var value
#1: 1 a 1
#2: 1 b 2
#3: 1 rm 0
#4: 2 a 3
#5: 2 b 3
#6: 2 rm 1
#7: 3 a 2
#8: 3 b 1
#9: 3 rm 1
r[, if(value[var=="rm"] == 0) .SD, by = day
][var!="rm"] # you need to exclude temporary variable
# day var value
#1: 1 a 1
#2: 1 b 2
This solution may also be slower than Shape's (you can populate your sample of big data so it can be measured), but may be easier for complex computations on many measures in EAV, and supports shift'ing - see eav examples.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Summarise data between given value of a categorical variable - r

Related

Creating group ids by comparing values of two variables across rows: in R

Subset data.frame based on lag between two columns

Assign the value of a variable based on another variable which is related to column names of a dataframe

Gather duplicate column sets into single columns

Remove melt data based on condition

Categories

Resources