Remove melt data based on condition - r

I'd like to remove any rows where the value of a >= b, but I'm not sure how to do this.
Sample data:
df <- data.frame(day = c(1, 1, 2, 2, 3, 3), var = c("a", "b", "a", "b", "a", "b"), value = c(1, 2, 3, 3, 2, 1)
Output:
day var value
1 1 a 1
2 1 b 2
3 2 a 3
4 2 b 3
5 3 a 2
6 3 b 1
Desired output:
day var value
1 1 a 1
2 1 b 2

here's a data.table solution for avoiding going from long to wide:
dt <- data.table(df)
dt[,if(value[var == 'a'] >= value[var == 'b']) .SD,by = day]
EDIT: I realize now that your desired output does not fit your initial inequality, so adjust inequality to match :)
EDIT2: if you don't want to do it in data.table, then here's the dplyr solution
df %>% group_by(day) %>% filter(value[var == 'a'] >= value[var == 'b'])
EDIT3: if you want to put NA's in then this
df %>% group_by(day) %>% mutate(value = if(value[var == 'a'] >= value[var == 'b']) as.numeric(NA) else value)
EDIT4: NOTE this last solution appears to expose a bug, where NA's are handled strangely, see here:Why is dplyr removing values not met by condition?

Shape's answer is a correct approach to address your problem.
Just to extends Shape's answer I want to contribute with a little more generic solution.
An eav function in package dwtools is designed to address Entity-attribute-value data structures by easier calculation on measures. Function is defined below, you don't need dwtools package.
It calculates rm variable for each group. Formula for a calculations can be the same as quoted j arg to [.data.table call after melting your EAV, and before dcasting to EAV again.
library(data.table)
eav = function(x, j, id.vars = key(x)[-length(key(x))], variable.name = key(x)[length(key(x))], measure.vars = names(x)[!(names(x) %in% key(x))], fun.aggregate = sum, shift.on = character(), wide=FALSE){
stopifnot(is.data.table(x))
r <- x[,lapply(.SD,fun.aggregate),c(id.vars,variable.name),.SDcols=measure.vars
][,dcast(.SD,formula=as.formula(paste(paste(id.vars,collapse=' + '),paste(variable.name,collapse=' + '),sep=' ~ ')),fun.aggregate=fun.aggregate,value.var=measure.vars)
][,eval(j), by = eval(id.vars[!(id.vars %in% shift.on)])
]
if(wide) r[] else melt(r,id.vars=id.vars, variable.name=variable.name, value.name=measure.vars)[,.SD,keyby=c(id.vars,variable.name)]
}
df = data.frame(day = c(1, 1, 2, 2, 3, 3), var = c("a", "b", "a", "b", "a", "b"), value = c(1, 2, 3, 3, 2, 1))
dt = as.data.table(df)
setkey(dt, day, var)
r = eav(dt, quote(rm := as.numeric(a >= b)))
print(r)
# day var value
#1: 1 a 1
#2: 1 b 2
#3: 1 rm 0
#4: 2 a 3
#5: 2 b 3
#6: 2 rm 1
#7: 3 a 2
#8: 3 b 1
#9: 3 rm 1
r[, if(value[var=="rm"] == 0) .SD, by = day
][var!="rm"] # you need to exclude temporary variable
# day var value
#1: 1 a 1
#2: 1 b 2
This solution may also be slower than Shape's (you can populate your sample of big data so it can be measured), but may be easier for complex computations on many measures in EAV, and supports shift'ing - see eav examples.

Related

How to compare one column to a series of related dummy variables without a for loop in R

I have working R code, but it is inelegant and inefficient. I am wondering if there is a better way: i.e. how can I vectorize this process and/or reduce the computing time?
library(data.table)
dt <- data.table(
visited_a = c(1, 1, 0, 0),
visited_b = c(1, 0, 0, 0),
visited_c = c(0, 0, 1, 1),
purchased = c("b", "b", "c", "a")
)
My data.table has dummy indicators for whether a consumer visited a store in 2017. So visited_a = 0 means she did not visit store a in 2017 while visited_b = 1 means she did visit store b in 2017. The data also list which store the consumer purchased from in 2018; all of these consumers made a purchase. Thus a consumer may or may not have visited the store (last year) that she purchased from (this year).
I want to add a variable purchased_was_visited to capture this. The solution would be:
dt$purchased_was_visited <- c(1, 0, 1, 0)
Here is my extraordinary inelegant code that sadly loops through the data.table one row at a time. There must be a better way!
dt[ , purchased_was_visited := NA]
for(i in 1:nrow(dt)) {
brand <- dt[i, purchased]
col <- paste0("visited_", brand)
was_it <- dt[i, ..col]
dt[i, purchased_was_visited := was_it]
}
I would give your consumers an ID column and store the data in two tables:
dt[, cid := .I]
# visits
vDT = melt(dt, id="cid", meas=patterns("visited"), variable.name = "store")[value == 1, !"value"]
vDT[, store := tstrsplit(store, "_")[[2]]]
vDT[, year := 2017L]
# choices
cDT = dt[, .(cid, year = 2018L, store = purchased)]
Then you can do a join to add the indicator column to cDT:
cDT[, v_before := vDT[.SD, on=.(cid, store, year < year), .N, by=.EACHI]$N]
cid year store v_before
1: 1 2018 b 1
2: 2 2018 b 0
3: 3 2018 c 1
4: 4 2018 a 0
An option is to use get by grouping on each row. One has to prepare the column name (based on purchased) which needs to be evaluated to check for past visit. Now, get function will provide desired result if evaluated for each row. Hence, we need to group on each row (e.g. by=1:NROW(dt)) :
library(data.table)
dt[,purchased_was_visited := get(paste("visited",purchased,sep="_")), by=1:NROW(dt)]
dt
# visited_a visited_b visited_c purchased purchased_was_visited
# 1: 1 1 0 b 1
# 2: 1 0 0 b 0
# 3: 0 0 1 c 1
# 4: 0 0 1 a 0
Here is another way to do using base apply commands:
## get index of column names which are 1
vals <- apply(dt[,1:3], 1, function(x) which(x == 1))
vals <- lapply(vals, function(x) names(x))
# replace the string in column names before underscore
vals <- lapply(vals, function(x) gsub(pattern = '.*._',replacement = '',x = x))
# create the final column
dt[, purchased_was_visited := mapply(function(x,y) as.integer(x %in% y), purchased, vals)]
print(dt)
visited_a visited_b visited_c purchased purchased_was_visited
1: 1 1 0 b 1
2: 1 0 0 b 0
3: 0 0 1 c 1
4: 0 0 1 a 0
I am not sure about its elegance but here is a "tidy" and efficient dplyr solution:
library(dplyr)
setDF(dt)
dt <- dt %>%
mutate(
check_a = if_else(visited_a == 1, if_else(purchased == "a", 1, 0), 0),
check_b = if_else(visited_b == 1, if_else(purchased == "b", 1, 0), 0),
check_c = if_else(visited_c == 1, if_else(purchased == "c", 1, 0), 0),
purchased_was_visited = check_a + check_b + check_c
) %>%
select(-c(5:7))
Here's another idea - make your values in "purchased" match the names you give to the visited columns. This way "purchased" can be used to give you a direct lookup of the values needed. Here, I have renamed 'visited_a' to just 'a', etc. This could also be done programmatically, as you and the some others have done, but perhaps designing for your data and data table in advance for usability is a tidier option.
dt <- data.frame(
a = c(1, 1, 0, 0),
b = c(1, 0, 0, 0),
c = c(0, 0, 1, 1),
purchased = c("b", "b", "c", "a")
)
dt$purchased_was_visited <-
sapply(row.names(dt), function(i) {(dt[i, dt[i, 'purchased']])})
dt
# a b c purchased purchased_was_visited
# 1 1 1 0 b 1
# 2 1 0 0 b 0
# 3 0 0 1 c 1
# 4 0 0 1 a 0
Sorry, I used a data.frame not a data.table, but the same principle applies.

Evaluating same column data.table in r

How can I evaluate a column of a data.table with values of the same column, each value against the value of the next two positions. The following example ilustrates the problem and desired result.
library(data.table)
dt <- data.table(a = c(2, 3, 2, 4))
result <- data.table(a = c(2, 3, 2, 4), b = c(T, F, NA, NA))
We can use shift to create two lead columns based on 'a' by specifying n= 1:2. Loop through the columns with lapply, check whether it is equal to 'a', Reduce it to a single logical vector with | and assign it to 'b' column
dt[, b := Reduce(`|`, lapply(shift(a, 1:2, type = 'lead'), `==`, a))]
dt
# a b
#1: 2 TRUE
#2: 3 FALSE
#3: 2 NA
#4: 4 NA
As #Mike H. suggested if we are comparing only for the next values, then doing this individually may be better to understand
dt[, b := (shift(a, 1, type = 'lead') == a) | (shift(a, 2, type = 'lead') ==a)]
You could do a rolling join on row number:
dt[, r := .I]
dt[head(1:.N, -2), found :=
dt[.SD[, .(a = a, r = r + 1L)], on=.(a, r), roll=-1, .N, by=.EACHI]$N > 0L]
a r found
1: 2 1 TRUE
2: 3 2 FALSE
3: 2 3 NA
4: 4 4 NA
To see how it works, replace .N with x.r:
dt[head(1:.N, -2), dt[.SD[, .(a = a, r = r + 1L)], on=.(a, r), roll=-1, x.r, by=.EACHI]]
a r x.r
1: 2 2 3
2: 3 3 NA
The idea is that we look for the nearest a match starting from r+1 and giving up after rolling one more ahead.

Summarise data between given value of a categorical variable

I am looking for a clever and fast way to summarise data in a data frame. The data and desired output looks as follows:
categoriesVector <- c("A", "A", "B", "A", "B", "B", "B", "A", "B")
propertyVector <- 1:length(categoriesVector)
dataVector <- 100 * rev(propertyVector)
df <- data.frame(categoriesVector, propertyVector, dataVector, stringsAsFactors = F)
df
desiredData <- c(700, sum(500, 400, 300), 100)
desiredProperty1 <- c(3, 5, 9)
desiredProperty2 <- c(3, 7, 9)
desiredDF <- data.frame(desiredData, desiredProperty1, desiredProperty2)
desiredDF
Basically I need to sum data and keep first and last property between each two occurrences of Category A. After a lot of headbanging I found a clumsy solution, which I am looking to find an improvement on in terms of clarity and performance, preferably with dplyr:
numRows <- dim(df)[1]
.groupedID <- rep(NA, numRows)
ID <- 1
.groupedID[[1]] <- ifelse(df$categoriesVector[[1]] == "A", 0, ID)
for(i in 2:numRows)
{
if(df$categoriesVector[i] == "B")
{
.groupedID[i] <- ID
if(df$categoriesVector[i - 1] == "B")
{
.groupedID[i] <- .groupedID[i - 1]
}
ID <- ID + 1
} else
{
.groupedID[i] <- 0
}
}
tempDF <-
df %>%
mutate(ID = .groupedID) %>%
filter(ID != 0) %>%
group_by(ID) %>%
summarise(desiredProperty1 = head(propertyVector, 1),
desiredProperty2 = tail(propertyVector, 1),
desiredData = sum(dataVector)) %>%
select(desiredData, desiredProperty1, desiredProperty2)
tempDF
You could use cumsum() to make your groupings and then process based on those like this:
df %>% mutate(Agroups = cumsum(categoriesVector == "A")) %>%
filter(categoriesVector == "B") %>%
group_by(Agroups) %>%
summarise(propertyStart = min(propertyVector),
propertyEnd = max(propertyVector),
dataTotal = sum(dataVector))
# A tibble: 3 x 4
Agroups propertyStart propertyEnd dataTotal
<int> <dbl> <dbl> <dbl>
1 2 3 3 700
2 3 5 7 1200
3 4 9 9 100
Here is how I'd do with data.table. First create spanNumber variable to identify each span of "B" surrounded by "A", then calculate the variables you specified:
library(data.table)
setDT(df)
df[, catShiftConcat := paste0(categoriesVector, shift(categoriesVector, fill = "A"))]
df[categoriesVector == "B", spanNumber := cumsum(catShiftConcat == "BA")]
df[!is.na(spanNumber) , .(desiredData = sum(dataVector),
desiredProperty1 = propertyVector[1],
desiredProperty2 = propertyVector[.N]), by = spanNumber]
## spanNumber desiredData desiredProperty1 desiredProperty2
## 1: 1 700 3 3
## 2: 2 1200 5 7
## 3: 3 100 9 9
An alternative data.table method that uses rleid to group runs of the categories vector is
library(data.table)
setDT(df)[, .(categoriesVector,
desiredData=sum(dataVector),
desiredProperty1=propertyVector[1],
desiredProperty2=propertyVector[.N]),
by=rleid(categoriesVector)
][categoriesVector == "B",][, c("rleid", "categoriesVector") := NULL][]
The contents in the first [] return the desired output and are calculated aggregated to runs of the categories vector. The second chain subsets the observations by keeping those for which the categories vector is B. The third [] removes two helper variables, and the final [] is just there to print the result to screen.
This returns
desiredData desiredProperty1 desiredProperty2
1: 700 3 3
2: 1200 5 7
3: 1200 5 7
4: 1200 5 7
5: 100 9 9

Gather duplicate column sets into single columns

The problem of gathering multiple sets of columns was already addressed here: Gather multiple sets of columns, but in my case, the columns are not unique.
I have the following data:
input <- data.frame(
id = 1:2,
question = c("a", "b"),
points = 0,
max_points = c(3, 5),
question = c("c", "d"),
points = c(0, 20),
max_points = c(5, 20),
check.names = F,
stringsAsFactors = F
)
input
#> id question points max_points question points max_points
#> 1 1 a 0 3 c 0 5
#> 2 2 b 0 5 d 20 20
The first column is an id, then I have many repeated columns (the original dataset has 133 columns):
identifier for question
points given
maximum points
I would like to end up with this structure:
expected <- data.frame(
id = c(1, 2, 1, 2),
question = letters[1:4],
points = c(0, 0, 0, 20),
max_points = c(3, 5, 5, 20),
stringsAsFactors = F
)
expected
#> id question points max_points
#> 1 1 a 0 3
#> 2 2 b 0 5
#> 3 1 c 0 5
#> 4 2 d 20 20
I have tried several things:
tidyr::gather(input, key, val, -id)
reshape2::melt(input, id.vars = "id")
Both do not deliver the desired output. Furthermore, with more columns than shown here, gather doesn't work any more, because there are too many duplicate columns.
As a workaround I tried this:
# add numbers to make col headers "unique"
names(input) <- c("id", paste0(1:(length(names(input)) - 1), names(input)[-1]))
# gather, remove number, spread
input %>%
gather(key, val, -id) %>%
mutate(key = stringr::str_replace_all(key, "[:digit:]", "")) %>%
spread(key, val)
which gives an error: Duplicate identifiers for rows (3, 9), (4, 10), (1, 7), (2, 8)
This problem was already discussed here: Unexpected behavior with tidyr, but I don't know why/how I should add another identifier. Most likely this is not the main problem, because I probably should approach the whole thing differently.
How could I solve my problem, preferably with tidyr or base? I don't know how to use data.table, but in case there is a simple solution, I will settle for that too.
Try this:
do.call(rbind,
lapply(seq(2, ncol(input), 3), function(i){
input[, c(1, i:(i + 2))]
})
)
# id question points max_points
# 1 1 a 0 3
# 2 2 b 0 5
# 3 1 c 0 5
# 4 2 d 20 20
The idiomatic way to do this in data.table is pretty simple:
library(data.table)
setDT(input)
res = melt(
input,
id = "id",
meas = patterns("question", "^points$", "max_points"),
value.name = c("question", "points", "max_points")
)
id variable question points max_points
1: 1 1 a 0 3
2: 2 1 b 0 5
3: 1 2 c 0 5
4: 2 2 d 20 20
You get the extra column called "variable", but you can get rid of it with res[, variable := NULL] afterwards if desired.
Another way to accomplish the same goal without using lapply:
We start by grabbing all the columns for question, max_points, and points then we melt each one individually and cbind them all together.
library(reshape2)
questions <- input[,c(1,c(1:length(names(input)))[names(input)=="question"])]
points <- input[,c(1,c(1:length(names(input)))[names(input)=="points"])]
max_points <- input[,c(1,c(1:length(names(input)))[names(input)=="max_points"])]
questions_m <- melt(questions,id.vars=c("id"),value.name = "questions")[,c(1,3)]
points_m <- melt(points,id.vars=c("id"),value.name = "points")[,3,drop=FALSE]
max_points_m <- melt(max_points,id.vars=c("id"),value.name = "max_points")[,3, drop=FALSE]
res <- cbind(questions_m,points_m, max_points_m)
res
id questions points max_points
1 1 a 0 3
2 2 b 0 5
3 1 c 0 5
4 2 d 20 20
You might need to clarify how you want the ID column to be handled but perhaps something like this ?
runme <- function(word , dat){
grep( paste0("^" , word , "$") , names(dat))
}
l <- mapply( runme , unique(names(input)) , list(input) )
l2 <- as.data.frame(l)
output <- data.frame()
for (i in 1:nrow(l2)) output <- rbind( output , input[, as.numeric(l2[i,]) ])
Not sure how robust it is with respect to handling different numbers of repeated columns but it works for your test data and should work if you columns are repeated equal numbers of times.

Grouping and applying function to .SD but one rolling entry

I am wondering if there is an elegant data.table (v1.9.4) way to do the following:
group a DT by two variables and then compute some function on the grouped tables (.SDs) for all entries in .SD but one and that one should be rolling through .SD and putting the result back in DT. The result is thus (potentially) unique for each entry in the .SDs (and hence DT). You can think of it as computing some value for a peer group of an entry in DT and that peer group is determined by the two grouping variables (same properties as the entry in DT) but the entry itself.
I accomplished this with loops around a simple := in data.table's j, but was wondering if there is a pure data.table solution. I could imagine something like .SD[i != id , := , by=1:nrow(.SD)] inside DT[] could do the trick but:
Using := in the j of .SD is reserved for future use as a (tortuously) flexible way to update DT by reference by group
The solution I have is (compute sum() for group determined by b and c except rolling ID):
DT <- data.table(ID = c("a","a","b","b","c","c"),
b = c(1, 2, 1, 2, 1, 2),
c = c("x", "x", "y", "z", "y", "x"),
Var1 = 1:6)
for (id2 in unique(DT$ID)) {
for (b2 in unique(DT$b)) {
c2 <- DT[ID==id2 & b==b2, c]
DT[ID == id2 & b == b2,
Var1_sum := sum(DT[ID! = id2 & b == b2 & c == c2, Var1], na.rm=TRUE)]
}
}
DT
ID b c Var1 Var1_sum
1: a 1 x 1 0
2: a 2 x 2 6
3: b 1 y 3 5
4: b 2 z 4 0
5: c 1 y 5 3
6: c 2 x 6 2
Do we need that future feature := in .SD's j for this?

Resources