By group, match values to a particular value

By group, match values to a particular value - r

I have a dataset that includes a vote result r for each voter v on a particular decision d. My data thus looks like:
d <- c(1,1,1,1,2,2,2,2,3,3,3,4,4,4,4)
v <- c(6,7,8,9,6,7,8,9,6,7,9,6,7,8,9)
r <- c(y,y,n,n,n,n,n,n,y,y,y,y,y,a,y)
df <- data.frame(d,v,r)
Not every voter votes in every election. What I want to do is see if other voters make the same call as a particular voter (let's say v == 8). Normally I would just use dplyr:
df %>% group_by(d) %>% mutate(like8 = ifelse(r == r[v == 8], 1, 0))
The problem that I have is that that particular voter v == 8 doesn't have a recorded vote for each decision (which is distinct from abstaining votes, which are recorded). Because of this I get the following error.
Error in mutate_impl(.data, dots) :
Column like8 must be length 3 (the group size) or one, not 0
What I've done so far is to write up a combination of ifelse and looping in order to get around this issue.
with(df,
for (i in unique(d)) {
if(8 %in% v){
for (j in r[d == i]) {
df$like8[d == i & r == j] <- ifelse(j == r[v == 8], 1, 0)
}
} else {
for (j in r[d == i]){
df$like8[d == i & r == j] <- NA
}
}
}
)
--note: I've never been formally instructed in 'good' programming conventions, so my bracket placement is probably unclear and open to suggestions.
The problem I have is that my actual dataset has over 500,000 observations, and this is extremely slow. I've seen here solutions using data.table for when the value isn't missing, but I don't understand data.table enough to know how to make it work for my case.

Try this:
df %>%
group_by(d) %>%
mutate(
like8 = {
if (sum(v == 8) > 0) as.numeric(r == r[v == 8])
else NA
}
)
It wraps the test in an if/else statement checking to see there is a voter 8. The as.numeric statement is equivalent to what you wrote, but should be faster when your response is 1/0.

It is not clear about the expected output. If we follow the methodology in #Melissa Key's tidyverse answer, the similar approach in data.table (as OP mentioned in the post) would be
library(data.table)
setDT(df)[, like8 := if(8 %in% v) +(r == r[v == 8]) else NA_integer_, by = d]
df
# d v r like8
# 1: 1 6 y 0
# 2: 1 7 y 0
# 3: 1 8 n 1
# 4: 1 9 n 1
# 5: 2 6 n 1
# 6: 2 7 n 1
# 7: 2 8 n 1
# 8: 2 9 n 1
# 9: 3 6 y NA
#10: 3 7 y NA
#11: 3 9 y NA
#12: 4 6 y 0
#13: 4 7 y 0
#14: 4 8 a 1
#15: 4 9 y 0
Or we avoid the if/else by splitting it to two steps and assign only to those that satisfy the condition (8 %in% v)
i1 <- setDT(df)[, .I[8 %in% v], by = d]$V1
df[i1, like8 := +(r == r[v==8]), by = d]
The other values in 'like8' will by default filled up by NA
data
d <- c(1,1,1,1,2,2,2,2,3,3,3,4,4,4,4)
v <- c(6,7,8,9,6,7,8,9,6,7,9,6,7,8,9)
r <- c('y','y','n','n','n','n','n','n','y','y','y','y','y','a','y')
df <- data.frame(d,v,r)

Another solution using 2 joins:
#initialize column
DT1[, like8 := NA_integer_][
#set to 0 if voter 8 voted on decision
DT1[v==8L], like8 := 0L, on=.(d)][
#set to 1 if other voters voted the same in a particular decision
DT1[v==8L], like8 := 1L, on=.(d, r)]
data:
library(data.table)
library(microbenchmark)
#generate dummy data
set.seed(0L)
numD <- 100L
numV <- 1e4L
DT <- unique(data.table(d=sample(numD, numD*numV, replace=TRUE),
v=sample(numV, numD*numV, replace=TRUE)))
DT[, r:=sample(c('y','n','a'), .N, replace=TRUE)]
setorder(DT, d, v, r)
#set key to speed up the subsetting to voter
setkey(DT, d, v)
DT1 <- copy(DT)

Related

R: fast (conditional) subsetting where feasible

I would like to subset rows of my data
library(data.table); set.seed(333); n <- 100
dat <- data.table(id=1:n, x=runif(n,100,120), y=runif(n,200,220), z=runif(n,300,320))
> head(dat)
id x y z
1: 1 109.3400 208.6732 308.7595
2: 2 101.6920 201.0989 310.1080
3: 3 119.4697 217.8550 313.9384
4: 4 111.4261 205.2945 317.3651
5: 5 100.4024 212.2826 305.1375
6: 6 114.4711 203.6988 319.4913
in several stages. I am aware that I could apply subset(.) sequentially to achieve this.
> s <- subset(dat, x>119)
> s <- subset(s, y>219)
> subset(s, z>315)
id x y z
1: 55 119.2634 219.0044 315.6556
My problem is that I need to automate this and it might happen that the subset is empty. In this case, I would want to skip the step(s) that result in an empty set. For example, if my data was
dat2 <- dat[1:50]
> s <-subset(dat2,x>119)
> s
id x y z
1: 3 119.4697 217.8550 313.9384
2: 50 119.2519 214.2517 318.8567
the second step subset(s, y>219) would come up empty but I would still want to apply the third step subset(s,z>315). Is there a way to apply a subset-command only if it results in a non-empty set? I imagine something like subset(s, y>219, nonzero=TRUE). I would want to avoid constructions like
s <- dat
if(nrow(subset(s, x>119))>0){s <- subset(s, x>119)}
if(nrow(subset(s, y>219))>0){s <- subset(s, y>219)}
if(nrow(subset(s, z>318))>0){s <- subset(s, z>319)}
because I fear the if-then jungle would be rather slow, especially since I need to apply all of this to different data.tables within a list using lapply(.). That's why I am hoping to find a solution optimized for speed.
PS. I only chose subset(.) for clarity, solutions with e.g. data.table would be just as welcome if not more so.

I agree with Konrad's answer that this should throw a warning or at least report what happens somehow. Here's a data.table way that will take advantage of indices (see package vignettes for details):
f = function(x, ..., verbose=FALSE){
L = substitute(list(...))[-1]
mon = data.table(cond = as.character(L))[, skip := FALSE]
for (i in seq_along(L)){
d = eval( substitute(x[cond, verbose=v], list(cond = L[[i]], v = verbose)) )
if (nrow(d)){
x = d
} else {
mon[i, skip := TRUE]
}
}
print(mon)
return(x)
}
Usage
> f(dat, x > 119, y > 219, y > 1e6)
cond skip
1: x > 119 FALSE
2: y > 219 FALSE
3: y > 1e+06 TRUE
id x y z
1: 55 119.2634 219.0044 315.6556
The verbose option will print extra info provided by data.table package, so you can see when indices are being used. For example, with f(dat, x == 119, verbose=TRUE), I see it.
because I fear the if-then jungle would be rather slow, especially since I need to apply all of this to different data.tables within a list using lapply(.).
If it's for non-interactive use, maybe better to have the function return list(mon = mon, x = x) to more easily keep track of what the query was and what happened. Also, the verbose console output could be captured and returned.

An interesting approach could be developed using modified filter function offered in dplyr. In case of conditions not being met the non_empty_filter filter function returns original data set.
Notes
IMHO, this is fairly non-standard behaviour and should be reported via warning. Of course, this can be removed and has no bearing on the function results.
Function
library(tidyverse)
library(rlang) # enquo
non_empty_filter <- function(df, expr) {
expr <- enquo(expr)
res <- df %>% filter(!!expr)
if (nrow(res) > 0) {
return(res)
} else {
# Indicate that filter is not applied
warning("No rows meeting conditon")
return(df)
}
}
Condition met
Behaviour: Returning one row for which the condition is met.
dat %>%
non_empty_filter(x > 119 & y > 219)
Results
# id x y z
# 1 55 119.2634 219.0044 315.6556
Condition not met
Behaviour: Returning the full data set as the whole condition is not met due to y > 1e6.
dat %>%
non_empty_filter(x > 119 & y > 219 & y > 1e6)
Results
# id x y z
# 1: 1 109.3400 208.6732 308.7595
# 2: 2 101.6920 201.0989 310.1080
# 3: 3 119.4697 217.8550 313.9384
# 4: 4 111.4261 205.2945 317.3651
# 5: 5 100.4024 212.2826 305.1375
# 6: 6 114.4711 203.6988 319.4913
# 7: 7 112.1879 209.5716 319.6732
# 8: 8 106.1344 202.2453 312.9427
# 9: 9 101.2702 210.5923 309.2864
# 10: 10 106.1071 211.8266 301.0645
Condition met/not met one-by-one
Behaviour: Skipping filter that would return an empty data set.
dat %>%
non_empty_filter(y > 1e6) %>%
non_empty_filter(x > 119) %>%
non_empty_filter(y > 219)
Results
# id x y z
# 1 55 119.2634 219.0044 315.6556

How to make a fuzzy join in R using more than one variable on each side

I would like to join the two data frames :
a <- data.frame(x=c(1,3,5))
b <- data.frame(start=c(0,4),end=c(2,6),y=c("a","b"))
with a condition like (x>start)&(x<end) in order to get such a result:
# x y
#1 1 a
#2 2 <NA>
#3 3 b
I don't want to make a potentially large cartesian product and then select only the few rows matching the condition and I'd like a solution using the tidyverse (I am not interested in a solution using SQL which would be a confession of failure). I thought of the 'fuzzyjoin' package but I cannot find examples fitting my need : the function to apply for the condition has only two arguments. I also tried to put 'start' and 'end' into a single argument with data.frame(z=I(purrr::map2(b$start,b$end,list)),y=b$y)
# z y
#1 0, 2 a
#2 4, 6 b
but although the data looks fine fuzzy_left_join doesn't accept it.
I search for solutions working in more general cases (n variables on the LHS, m on the RHS, not necessarily numeric with arbitrary conditions).
UPDATE
I also want to be able to express conditions like (x=start+1)|(x=end+1) giving here:
# x y
#1 1 a
#2 3 a
#3 5 b

For this case you don't need multi_by or multy_match_fun, this works :
library(fuzzyjoin)
fuzzy_left_join(a, b, by = c(x = "start", x = "end"), match_fun = list(`>`, `<`))
# x start end y
# 1 1 0 2 a
# 2 3 NA NA <NA>
# 3 5 4 6 b

I eventually went to the code of fuzzy_join and found a way to make what I want even without proper documentation. fuzzy_let_join doesn't work but there is the following way (not really pretty and it actually does a cartesian product):
g <- function(x,y) (x>y[,"start"])&(x<y[,"end"])
fuzzy_join(a,b, multi_by = list(x="x",y=c("start","end"))
, multi_match_fun = g, mode = "left") %>% select(x,y)

data.table approach could be
library(data.table)
name1 <- setdiff(names(setDT(b)), names(setDT(a)))
#perform left outer join and then select required columns
a[b, (name1) := mget(name1), on = .(x > start, x < end)][, .(x, y)]
which gives
x y
1: 1 a
2: 3 <NA>
3: 5 b
Sample data:
a <- data.frame(x = c(1, 3, 5))
b <- data.frame(start = c(0, 4), end = c(2, 6), y = c("a", "b"))
Update: In case you want to join both dataframes on (x=start+1)|(x=end+1) condition then you can try
library(data.table)
DT1 <- as.data.table(a)
DT2 <- as.data.table(b)
#Perform 1st join on "x = start+1" and then another on "x = end+1". Finally row-bind both results.
DT <- rbindlist(list(DT1[DT2[, start_temp := start+1], on = c(x = "start_temp"), .(x, y), nomatch = 0],
DT1[DT2[, end_temp := end+1], on = c(x = "end_temp"), .(x, y), nomatch = 0]))
DT
# x y
#1: 1 a
#2: 5 b
#3: 3 a

A possible answer to explain what I am trying to do : extending dplyr in some way. And I will be happy to know if there are ways to improve this solution or some problems I didn't see.
The solution avoids the cartesian product, but duplicates into lists of data frames both one of the input data frame and the result. I didn't include the final column selection of x and y that is easy to code.
my_left_join <- function(.DATA1,.DATA2,.WHERE)
{
call = as.list(match.call())
df1 <- .DATA1
df1$._row_ <- 1:nrow(df1)
dfl1 <- replyr::replyr_split(df1,"._row_")
eval(substitute(
dfl2 <- mapply(function(.x)
{filter(.DATA2,with(.x,WHERE)) %>%
mutate(._row_=.x$._row_)}
, dfl1, SIMPLIFY=FALSE)
,list(WHERE=call$.WHERE)))
df2 <- replyr::replyr_bind_rows(dfl2)
left_join(df1,df2,by="._row_") %>% select(-._row_)
}
my_left_join(a,b,(x>start)&(x<end))
# x start end y
#1 1 0 2 a
#2 3 NA NA <NA>
#3 5 4 6 b
my_left_join(a,b,(x==(start+1))|(x==(end+1)))
# x start end y
#1 1 0 2 a
#2 3 0 2 a
#3 5 4 6 b

You can try a GenomicRanges solution
library(GenomicRanges)
# setup GRanges objects
a_gr <- GRanges(1, IRanges(a$x,a$x))
b_gr <- GRanges(1, IRanges(b$start, b$end))
# find overlaps between the two data sets
res <- as.data.frame(findOverlaps(a_gr,b_gr))
# create the expected output
a$y <- NA
a$y[res$queryHits] <- as.character(b$y)[res$subjectHits]
a
x y
1 1 a
2 3 <NA>
3 5 b

Replace nth consecutive occurrence of a value

I want to replace the nth consecutive occurrence of a particular code in my data frame. This should be a relatively easy task but I can't think of a solution.
Given a data frame
df <- data.frame(Values = c(1,4,5,6,3,3,2),
Code = c(1,1,2,2,2,1,1))
I want a result
df_result <- data.frame(Values = c(1,4,5,6,3,3,2),
Code = c(1,0,2,2,2,1,0))
The data frame is time-ordered so I need to keep the same order after replacing the values. I guess that nth() or duplicate() functions could be useful here but I'm not sure how to use them. What I'm missing is a function that would count the number of consecutive occurrences of a given value. Once I have it, I could then use it to replace the nth occurrence.
This question had some ideas that I explored but still didn't solve my problem.
EDIT:
After an answer by #Gregor I wrote the following function which solves the problem
library(data.table)
library(dplyr)
replace_nth <- function(x, nth, code) {
y <- data.table(x)
y <- y[, code_rleid := rleid(y$Code)]
y <- y[, seq := seq_along(Code), by = code_rleid]
y <- y[seq == nth & Code == code, Code := 0]
drop.cols <- c("code_rleid", "seq")
y %>% select(-one_of(drop.cols)) %>% data.frame() %>% return()
}
To get the solution, simply run replace_nth(df, 2, 1)

Using data.table:
library(data.table)
setDT(df)
df[, code_rleid := rleid(df$Code)]
df[, seq := seq_along(Code), by = code_rleid]
df[seq == 2 & Code == 1, Code := 0]
df
# Values Code code_rleid seq
# 1: 1 1 1 1
# 2: 4 0 1 2
# 3: 5 2 2 1
# 4: 6 2 2 2
# 5: 3 2 2 3
# 6: 3 1 3 1
# 7: 2 0 3 2
You could combine some of these (and drop the extra columns after). I'll leave it clear and let you make modifications as you like.

Nested ifelse: improved syntax

Description
ifelse() function allows to filter the values in a vector through a series of tests, each of them producing different actions in case of a positive result. For instance, let xx be a data.frame, as follows:
xx <- data.frame(a=c(1,2,1,3), b=1:4)
xx
a b
1 1
2 2
1 3
3 4
Suppose that you want to create a new column, c, from column b, but depending on the values in column a in the following way:
For each row,
if the value in column a is 1, the value in column c, is the same value in column b.
if the value in column a is 2, the value in column c, is 100 times the value in column b.
in any other case, the value in column c is the negative of the value in column b.
Using ifelse(), a solution could be:
xx$c <- ifelse(xx$a==1, xx$b,
ifelse(xx$a==2, xx$b*100,
-xx$b))
xx
a b c
1 1 1
2 2 200
1 3 3
3 4 -4
Problem 1
An aesthetic problem arises when the number of tests increases, say, four tests:
xx$c <- ifelse(xx$a==1, xx$b,
ifelse(xx$a==2, xx$b*100,
ifelse(xx$a==3, ...,
ifelse(xx$a==4, ...,
...))))
I found partial solution to the problem in this page, which consists in the definition of the functions if.else_(), i_(), e_(), as follows:
library(lazyeval)
i_ <- function(if_stat, then) {
if_stat <- lazyeval::expr_text(if_stat)
then <- lazyeval::expr_text(then)
sprintf("ifelse(%s, %s, ", if_stat, then)
}
e_ <- function(else_ret) {
else_ret <- lazyeval::expr_text(else_ret)
else_ret
}
if.else_ <- function(...) {
args <- list(...)
for (i in 1:(length(args) - 1) ) {
if (substr(args[[i]], 1, 6) != "ifelse") {
stop("All but the last argument, need to be if.then_ functions.", call. = FALSE)
}
}
if (substr(args[[length(args)]], 1, 6) == "ifelse"){
stop("Last argument needs to be an else_ function.", call. = FALSE)
}
args$final <- paste(rep(')', length(args) - 1), collapse = '')
eval_string <- do.call('paste', args)
eval(parse(text = eval_string))
}
In this way, the problem given in the Description, can be rewritten as follows:
xx <- data.frame(a=c(1,2,1,3), b=1:4)
xx$c <- if.else_(
i_(xx$a==1, xx$b),
i_(xx$a==2, xx$b*100),
e_(-xx$b)
)
xx
a b c
1 1 1
2 2 200
1 3 3
3 4 -4
And the code for the four tests will simply be:
xx$c <- if.else_(
i_(xx$a==1, xx$b),
i_(xx$a==2, xx$b*100),
i_(xx$a==3, ...), # dots meaning actions for xx$a==3
i_(xx$a==4, ...), # dots meaning actions for xx$a==4
e_(...) # dots meaning actions for any other case
)
Problem 2 & Question
The given code apparently solves the problem. Then, I wrote the following test function:
test.ie <- function() {
dd <- data.frame(a=c(1,2,1,3), b=1:4)
if.else_(
i_(dd$a==1, dd$b),
i_(dd$a==2, dd$b*100),
e_(-dd$b)
) # it should give c(1, 200, 3, -4)
}
When I tried the test:
test.ie()
it spit the following error message:
Error in ifelse(dd$a == 1, dd$b, ifelse(dd$a == 2, dd$b * 100, -dd$b)) :
object 'dd' not found
Question
Since the if.else_() syntactic constructor is not supposed to run only from the console, is there a way for it to 'know' the variables from the function that calls it?
Note
In "Best way to replace a lengthy ifelse structure in R", a similar problem was posted. However, the given solution there focuses on building the table's new column with the given constant output values (the "then" or "else" slots of the ifelse() function), whereas my case addresses a syntactic problem in which the "then" or "else" slots can even be expressions in terms of other data.frame elements or variables.

I think you can use dplyr::case_when inside dplyr::mutate to achieve this.
library(dplyr)
df <- tibble(a=c(1,2,1,3), b=1:4)
df %>%
mutate(
foo = case_when(
.$a == 1 ~ .$b,
.$a == 2 ~ .$b * 100L,
TRUE ~ .$b * -1L
)
)
#> # A tibble: 4 x 3
#> a b foo
#> <dbl> <int> <int>
#> 1 1 1 1
#> 2 2 2 200
#> 3 1 3 3
#> 4 3 4 -4
In the upcoming relase of dplyr 0.6.0 you won't need to use the akward work-around of .$, and you can just use:
df %>%
mutate(
foo = case_when(
a == 1 ~ b,
a == 2 ~ b * 100L,
TRUE ~ b * -1L
)
)

Taking into account MrFlick's advice, I re-coded the if.else_() function as follows:
if.else_ <- function(...) {
args <- list(...)
for (i in 1:(length(args) - 1) ) {
if (substr(args[[i]], 1, 6) != "ifelse") {
stop("All but the last argument, need to be if.then_ functions.", call. = FALSE)
}
}
if (substr(args[[length(args)]], 1, 6) == "ifelse"){
stop("Last argument needs to be an else_ function.", call. = FALSE)
}
args$final <- paste(rep(')', length(args) - 1), collapse = '')
eval_string <- do.call('paste', args)
eval(parse(text = eval_string), envir = parent.frame())
}
Now the test.ie() function runs properly
test.ie()
[1] 1 200 3 -4

With full respect to the OP's remarkable effort to improve nested ifelse(), I prefer a different approach which I believe is easy to write, concise, maintainable and fast:
xx <- data.frame(a=c(1L,2L,1L,3L), b=1:4)
library(data.table)
# coerce to data.table, and set the default first
setDT(xx)[, c:= -b]
xx[a == 1L, c := b] # 1st special case
xx[a == 2L, c := 100L*b] # 2nd special case, note use of integer 100L
# xx[a == 3L, c := ...] # other cases
# xx[a == 4L, c := ...]
#...
xx
# a b c
#1: 1 1 1
#2: 2 2 200
#3: 1 3 3
#4: 3 4 -4
Note that for the 2nd special case b is multiplied by the integer constant 100L to make sure that the right hand sides are all of type integer in order to avoid type conversion to double.
Edit 2: This can also be written in an even more concise (but still maintainable) way as a one-liner:
setDT(xx)[, c:= -b][a == 1L, c := b][a == 2L, c := 100*b][]
data.table chaining works here, because c is updated in place so that subsequent expressions are acting on all rows of xx even if the previous expression was a selective update of a subset of rows.
Edit 1: This approach can be implemented with base R as well:
xx <- data.frame(a=c(1L,2L,1L,3L), b=1:4)
xx$c <- -xx$b
idx <- xx$a == 1L; xx$c[idx] <- xx$b[idx]
idx <- xx$a == 2L; xx$c[idx] <- 100 * xx$b[idx]
xx
# a b c
#1 1 1 1
#2 2 2 200
#3 1 3 3
#4 3 4 -4

Why is a calculating for every row faster than checking if calculation has been done?

I was working on a dataframe with 200.000+ rows and many columns. Let's take a sample dummy version as such that df :
set.seed(1)
"timeslot" = c(as.integer(abs(runif(10000,min=1,max=1000))))
"ID" = c(LETTERS[abs(as.integer(rnorm(10000,2)**3))%%9+1])
"variable1" = c(as.integer(rnorm(10000,2)**3))
df = data.frame(timeslot,ID,variable1)
df = df[order(df$timeslot, df$ID),]
I also calculate a column to check if the ID of that row is also present somewhere in the previous timeslot, called min1:
df$min1 <- sapply(seq(nrow(df)), function(x)
{
if(df[x, "timeslot"] == 1){0} else {
max(df[x, "ID"] %in% df[df$timeslot == df[x,"timeslot"] - 1,"ID"])}
})
This all goes quite well and delivers the following head(df)/tail(df):
timeslot ID variable1 min1
4919 1 A 15 0
2329 1 C 48 0
7359 1 C 1 0
1978 1 E 6 0
2883 1 F 7 0
7448 1 F 21 0
-------------------------------
8462 998 F 1 1
1724 998 H 2 0
989 999 A 7 1
2589 999 D 12 1
3473 999 D 0 1
780 999 I 5 0
I want to perform some calculations on variable1, grouped by unique timeslot+ID. One of these calculations is funfac:
total=0
funfac <- function(x,y){ for (i in x){ (i <- i ** y);
total <- total + i};return((abs(total/(length(x))))**(1/y));total=0 }
However, now comes the difficult part: per ID in a specific timeslot I want to do a calculation over all same IDs in that timeslot and the previous timeslot. So if in timeslot '2' there are 3x D, and in timeslot '1' there are 2x D, the calculation should be done over all 5 Ds. My column min1 helps identify if that ID is present in the previous timeslot. If not: the calculation should return a NA.
First I did this with the following code:
lp5 = c()
for (j in 1:nrow(df)){
if (df[j,"min1"] == 0){lp5 = c(lp5,NA)} else {
total = 0
x = df[which((df[,"timeslot"] == df[j,"timeslot"] | df[,"timeslot"] == (df[j,"timeslot"]-1)) & df[,"ID"]==(df[j,"ID"])),"variable1"]
for (i in x){
i = (i ** 5);
total <- total + i
}
lp5 = c(lp5,((abs(total/(length(x))))**(1/(5))))
}
}
tempdf = data.frame(df[,"timeslot"],df[,"ID"], lp5)
lp5 = tempdf[!duplicated(tempdf[,1:2]),][,3]
Figuring that I performed a lot of calculations double, I thought: Why not check if the calculation has been done already. Doing so by adding the unique timeframe+ID in a dataframe, including the calculated value. And each time checking if the value is in the dataframe already.
lp5DF = data.frame("timeslot" = numeric(0), "ID" = character(0), "lp5" = numeric(0))
for (j in 1:nrow(df)){
if (duplicated(rbind(lp5DF[,1:2],data.frame(timeslot=df[j,"timeslot"], ID=df[j,"ID"])))[nrow(lp5DF)+1]) {next} else{
if (df[j,"min1"] == 0){lp5DF = rbind(lp5DF, data.frame("timeslot" = df[j,"timeslot"], "ID" = df[j,"ID"], "lp5" = NA))} else {
total = 0
x = df[which((df[,"timeslot"] == df[j,"timeslot"] | df[,"timeslot"] == (df[j,"timeslot"]-1)) & df[,"ID"]==(df[j,"ID"])),"variable1"]
for (i in x){
(i <- i ** 5);total <- total + i
}
lp5DF = rbind(lp5DF, data.frame("timeslot" = df[j,"timeslot"], "ID" = df[j,"ID"], "lp5" = ((abs(total/(length(x))))**(1/5)))) }
}
}
The output (head/tail) of lp5DF will be:
timeslot ID lp5
1 1 A NA
2 1 B NA
3 1 C NA
4 1 D NA
5 1 E NA
6 1 F NA
-------------------------
7738 999 B 14.83423
7739 999 C 14.80149
7740 999 E NA
7741 999 F 49.48538
7742 999 G 23.05222
7743 999 H NA
and: lp5DF[,3]==lp5
However, checking this seemed to be a lot slower (6.5x in my case). Since I have to run this kind of calculation multiple times over a lot of rows (dataframe may be expanded later in the project) both my ways are too slow. Why is the second one so slow, and is there a way to speed this up? Maybe something with lapply or the dplyr package?

There are just a lot to optimize. Try learning data manipulation packages like dplyr, data.table.
min1 can be calculated using the technique from here
library(dplyr)
dfs <- split(df$ID, df$timeslot)
df$min1 <- unlist(mapply(`%in%`, dfs, lag(dfs)))
lp5 is little tricky, but manageable
df1 <- df %>%
group_by(timeslot, ID) %>%
summarise(min1 = all(min1), s = sum(variable1^5), n = n()) %>%
group_by(ID) %>%
mutate(s1 = s + lag(s), n1 = n + lag(n), lp5 = ifelse(min1, abs((s1/n1)^(1/5)), NA))
lp5 <- df1$lp5
data.table equivalent is
library(data.table)
setDT(df)
dt1 <- df[, .(min1 = all(min1), s = sum(variable1^5), n = .N), by=.(timeslot, ID)]
dt1[, `:=`(s1 = s + shift(s), n1 = n + shift(n)), by=ID]
dt1[min1==TRUE, lp5 := abs((s1/n1)^(1/5)), by=ID]
lp5 <- dt1$lp5

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

By group, match values to a particular value - r

Try this: df %>% group_by(d) %>% mutate( like8 = { if (sum(v == 8) > 0) as.numeric(r == r[v == 8]) else NA } ) It wraps the test in an if/else statement checking to see there is a voter 8. The as.numeric statement is equivalent to what you wrote, but should be faster when your response is 1/0.

Related

R: fast (conditional) subsetting where feasible

How to make a fuzzy join in R using more than one variable on each side

Replace nth consecutive occurrence of a value

Nested ifelse: improved syntax

Why is a calculating for every row faster than checking if calculation has been done?

Categories

Resources