Vectorize for loop over two rows with condition - r

I want to perform some operations on my dataframe, but I have some problems with performance, so I was wondering how I could speed up the performance of my code.
My data has several columns and if the column X is 0, I want to do some operations on other columns (adding and max). If X is 1, do nothing (X can only be 1 or 0)
df <- data.frame(X = c(0,0,1,0,1),Y = c(10,0,0,3,7),Z = c(2,2,0,4,5))
df
X Y Z
1 0 10 2
2 0 0 2
3 1 0 0
4 0 3 4
5 1 7 5
Right now my code looks like:
for(i in 1:(nrow(df)-1)){
if(df$X[i] == 0){
df$Y[i+1] <- df$Y[i]+df$Y[i+1]
df$Z[i+1] <- max(df$Z[i],df$Z[i+1])
}
}
The result should look like:
df
X Y Z
1 0 10 2
2 0 10 2
3 1 10 2
4 0 3 4
5 1 10 5
Is there a way to write this more efficiently?
Additionally, a lot of the rows contain only 0's, so I was wondering if there is an efficient way to skip the operations for these rows, as the value won't change.
Edit:
As I was a bit unspecific about the rules, here they are in greater detail:
Y should get summed up until there is 1 again (the sum (including the value for the row, where the 1 is) should replace the value of the row with the 1). The same principle should be applied to the X variable, but this time with the max() function.
Many thanks!

How about something like this? This reproduces your expected output:
df <- data.frame(X = c(0,0,1,0,1),Y = c(10,0,0,3,7),Z = c(2,2,0,4,5))
df %>%
mutate(
group = cumsum(c(0, diff(X) == -1))) %>%
group_by(group) %>%
mutate(
n = 1:n(),
Y = cumsum(Y),
Z = ifelse(n > 1, max(Z, lead(Z, default = 0)), Z)) %>%
ungroup() %>%
select(X, Y, Z)
# # A tibble: 5 x 3
# X Y Z
# <dbl> <dbl> <dbl>
#1 0. 10. 2.
#2 0. 10. 2.
#3 1. 10. 2.
#4 0. 3. 4.
#5 1. 10. 5.
Explanation: Group entries based on 0-series terminated by 1; replace Y with the cumsum of Y; replace Z with the maximum of entries in that row and from the next row, starting from the second row (n > 1).

Related

Is there a way to count values by presence per rows in R?

I want a way to count values on a dataframe based on its presence by row
a = data.frame(c('a','b','c','d','f'),
c('a','b','a','b','d'))
colnames(a) = c('let', 'let2')
In this reproducible example, we have the letter "a" appearing in the first row and third row, totalizing two appearences. I've made this code to count the values based if the presence is TRUE, but I want it to atribute it automaticaly for all the variables present in the dataframe:
#for counting the variable a and atribunting the count to the b dataframe
b = data.frame(unique(unique(unlist(a))))
b$count = 0
for(i in 1:nrow(a)){
if(TRUE %in% apply(a[i,], 2, function(x) x %in% 'a') == TRUE){
b$count[1] = b$count[1] + 1
}
}
b$count[1]
[1] 2
The problem is that I have to make this manually for all variables and I want a way to make this automatically. Is there a way? The expected output is:
1 a 2
2 b 2
3 c 1
4 d 2
5 f 1
It can be done in base R by taking the unique values separately from the column, unlist to a vector and get the frequency count with table. If needed convert the table object to a two column data.frame with stack
stack(table(unlist(lapply(a, unique))))[2:1]
-output
# ind values
#1 a 2
#2 b 2
#3 c 1
#4 d 2
#5 f 1
If it is based on row, use apply with MARGIN = 1
table(unlist(apply(a, 1, unique)))
Or do a group by row to get the unique and count with table
table(unlist(tapply(unlist(a), list(row(a)), unique)))
Or a faster approach with dapply from collapse
library(collapse)
table(unlist(dapply(a, funique, MARGIN = 1)))
Does this work:
library(dplyr)
library(tidyr)
a %>% pivot_longer(cols = everything()) %>% distinct() %>% count(value)
# A tibble: 5 x 2
value n
<chr> <int>
1 a 2
2 b 2
3 c 1
4 d 2
5 f 1
Data used:
a
let let2
1 a a
2 b b
3 c a
4 d b
5 f d

Remove Duplicates from Col X based on condition in Col Y

I have a data frame in R, that has duplicates, in one of the columns, however I only want to remove the duplicate based on a specification in another column.
For Example:
DF:
X J Y
1 2 3
2 3 1
1 3 2
I want to remove rows, where X is a duplicate and = 3.
DF:
X J Y
2 3 1
1 3 2
I have tried reading on dplyr, but have so far only been unable to get the desired result.
We can create the condition to condition with duplicated and the equality operator
subset(df1, !((duplicated(X)|duplicated(X, fromLast = TRUE)) & Y == 3))
# X J Y
#2 2 3 1
#3 1 3 2
If we need to remove the whole group of rows of 'X' if there is any value of 'Y' is 3, then
library(dplyr)
df1t %>%
group_by(X) %>%
filter(! 3 %in% Y) #or
# filter(all(Y != 3))

Take first non-0 value or last 0 value if that's all there is

Ciao,
Here is my replicating example.
HAVE <- data.frame(ID=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6),
ABSENCE=c(NA,NA,NA,0,0,0,0,0,1,NA,0,NA,0,1,2,0,0,0),
TIME=c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3))
WANT <- data.frame(ID=c(1,2,3,4,5,6),
ABSENCE=c(NA,0,1,0,1,0),
TIME=c(NA,3,3,2,2,3))
The tall data file HAVE is the one I need to convert to WANT. So essentially for each ID I need to identify the first non-zero value and that value goes into the data file WANT. If all values of absence is NA than TIME is NA. If all values of ABSENCE is 0 then I report the last possible row in WANT (as reflected in the TIME variable)
This is my attempt:
WANT <- group_by(HAVE,ID) %>% slice(seq_len(min(which(ABSENCE > 0), n())))
but I do not know how to take the last of the 0 rows if there are only 0s.
library(data.table)
setDT(HAVE)
res = unique(HAVE[, .(ID)])
# look up first ABSENCE > 0
res[, c("ABSENCE", "TIME") := unique(HAVE[ABSENCE > 0], by="ID")[.SD, on=.(ID), .(ABSENCE, TIME)]]
# if nothing was found, look up last ABSENCE == 0
res[is.na(ABSENCE), c("ABSENCE", "TIME") := unique(HAVE[ABSENCE == 0], by="ID", fromLast=TRUE)[.SD, on=.(ID), .(ABSENCE, TIME)]]
# check
all.equal(as.data.frame(res), WANT)
# [1] TRUE
ID ABSENCE TIME
1: 1 NA NA
2: 2 0 3
3: 3 1 3
4: 4 0 2
5: 5 1 2
6: 6 0 3
I'm using data.table since the tidyverse does not and never will support sub-assignment / modifying only rows selected by a condition (like the is.na(ABSENCE) here).
If there two rules can be made more consistent with each other, this should be doable in a left join or a single group_by + slice as the OP attempted, though. Okay, here's one way, though it looks impossible to debug:
HAVE %>%
arrange(ID, -(ABSENCE > 0), TIME*(ABSENCE > 0), -TIME) %>%
distinct(ID, .keep_all = TRUE)
ID ABSENCE TIME
1 1 NA 3
2 2 0 3
3 3 1 3
4 4 0 2
5 5 1 2
6 6 0 3
Using data.table as well, based on subsetting the .I row counter:
WANT <- HAVE[
HAVE[,
if(all(is.na(ABSENCE))) .I[1] else
if(!any(ABSENCE > 0, na.rm=TRUE)) max(.I[ABSENCE==0], na.rm=TRUE) else
min(.I[ABSENCE > 0], na.rm=TRUE),
by=ID
]$V1,
]
WANT[is.na(ABSENCE), TIME := NA_integer_]
# ID ABSENCE TIME
#1: 1 NA NA
#2: 2 0 3
#3: 3 1 3
#4: 4 0 2
#5: 5 1 2
#6: 6 0 3
Here are two approaches using dplyr and custom functions. Both rely on the data being sorted by TIME.
Filter Approach
# We'll use this function inside filter() to keep only the desired rows
flag_wanted <- function(absence){
flags <- rep(FALSE, length(absence))
if (any(absence > 0, na.rm = TRUE)) {
# There's a nonzero value somewhere in x; we want the first one.
flags[which.max(absence > 0)] <- TRUE
} else if (any(absence == 0, na.rm = TRUE)) {
# There's a zero value somewhere in x; we want the last one.
flags[max(which(absence == 0))] <- TRUE
} else {
# All values are NA; we want the last row
flags[length(absence)] <- TRUE
}
return(flags)
}
# After filtering, we have to flip TIME to NA if ABSENCE is NA
HAVE %>%
arrange(ID, TIME) %>%
group_by(ID) %>%
filter(flag_wanted(ABSENCE)) %>%
mutate(TIME = ifelse(is.na(ABSENCE), NA, TIME)) %>%
ungroup()
# A tibble: 6 x 3
ID ABSENCE TIME
<dbl> <dbl> <dbl>
1 1. NA NA
2 2. 0. 3.
3 3. 1. 3.
4 4. 0. 2.
5 5. 1. 2.
6 6. 0. 3.
The filter() step reduces the dataframe to the rows you need. Since it doesn't modify the TIME values, we need to mutate() as well.
Summarize Approach
# This function captures the general logic of getting the value of one variable
# based on the value of another
get_wanted <- function(of_this, by_this){
# If there are any positive values of `by_this`, use the first
if (any(by_this > 0, na.rm = TRUE)) {
return( of_this[ which.max(by_this > 0) ] )
}
# If there are any zero values of `by_this`, use the last
if (any(by_this == 0, na.rm = TRUE)) {
return( of_this[ max(which(by_this == 0)) ] )
}
# Otherwise, use NA
return(NA)
}
HAVE %>%
arrange(ID, TIME) %>%
group_by(ID) %>%
summarize(TIME = get_first_nz(of_this = TIME, by_this = ABSENCE),
ABSENCE = get_first_nz(of_this = ABSENCE, by_this = ABSENCE))
# A tibble: 6 x 3
ID TIME ABSENCE
<dbl> <dbl> <dbl>
1 1. NA NA
2 2. 3. 0.
3 3. 3. 1.
4 4. 2. 0.
5 5. 2. 1.
6 6. 3. 0.
The order of summarization matters because we're overwriting variables, so this approach is risky. It only produces the output WANT if you summarize TIME and then ABSENCE.

(dplyr) Sum of N values most recent to a date

I'm trying to create a function that sums the closest n values to a given date. So if I had 5 weeks of data, and n=2, the value on week 1 would be the sum of weeks 2&3, the value on week 2 would be the sum of weeks 1&3, etc. Example:
library(dplyr)
library(data.table)
Week <- 1:5
Sales <- c(1, 3, 5, 7, 9)
frame <- data.table(Week, Sales)
frame
Week Sales Recent
1: 1 1 8
2: 2 3 6
3: 3 5 10
4: 4 7 14
5: 5 9 12
I want to make a function that does this for me with an input for most recent n (not just 2), but for now I want to get 2 right. Here's my function using lag/lead:
RecentSum = function(Variable, Lags){
Sum = 0
for(i in 1:(Lags/2)){ #Lags/2 because I want half values before and half after
#Check to see if you can go backwards. If not, go foward (i.e. use lead).
if(is.na(lag(Variable, i))){
LoopSum = lead(Variable, i)
}
else{
LoopSum = lag(Variable, i)
}
Sum = Sum + LoopSum
}
for(i in 1:(Lags/2)){
if(is.na(lead(Variable, i))){ #Check to see if you can go forward. If not, go backwards (i.e. use lag).
LoopSum = lag(Variable, i)
}
else{
LoopSum = lead(Variable, i)
}
Sum = Sum + LoopSum
}
Sum
}
When I do RecentSum(frame$Sale,2) I get [1] 6 10 14 18 NA which is wrong for a number of reasons:
My if statements are only hitting on week one, so it will always be NA for lag and always be non-NA for lead.
I need to have a way to see if it uses lag/lead the first time. The first value is 6 instead of 8 because the first for-loop sends it to lead(_,1), but then the second for-loop does the same. I can't think of how I'd make my second for-loop recognize this.
Is there a function or library (Zoo?) that makes this task easy? I'd like to get my own function to work for the sake of practice/understanding, but at this point I'd rather just get it done.
Thanks!
To elaborate on my comment, lead and lag are functions that are meant to be used within vectorized functions such as dplyr. Here is a way to do it within dplyr without using a function:
df <- tibble(week = Week, sales = Sales)
df %>%
mutate(recent = case_when(is.na(lag(sales)) ~ lead(sales, n = 1) + lead(sales, n = 2),
is.na(lead(sales)) ~ lag(sales, n = 1) + lag(sales, n = 2),
TRUE ~ lag(sales) + lead(sales)))
That gives you this:
# A tibble: 5 x 3
week sales recent
<int> <dbl> <dbl>
1 1 1 8
2 2 3 6
3 3 5 10
4 4 7 14
5 5 9 12
1) Assuming that k is even define to as a vector of indices such that for each element of to we sum the k+1 elements of Sales that end in that index and from that subtract Sales:
k <- 2 # number of elements to sum
n <- nrow(frame)
to <- pmax(k+1, pmin(1:n + k/2, n))
Sum <- function(to, Sales) sum(Sales[seq(to = to, length = k+1)])
frame %>% mutate(recent = sapply(to, Sum, Sales) - Sales)
giving:
Week Sales recent
1 1 1 8
2 2 3 6
3 3 5 10
4 4 7 14
5 5 9 12
Note that by replacing the last line of code above with the following line the solution can be done entirely in base R:
transform(frame, recent = sapply(to, Sum, Sales) - Sales)
2) This concatenates the appropriate elements before and after the Sales series so that an ordinary rolling sum gives the result.
library(zoo)
ix <- c(seq(to = k+1, length = k/2), 1:n, seq(to = n-k, length = k/2))
frame %>% mutate(recent = rollsum(Sales[ix], k+1) - Sales)
Note that if k=2 then it reduces this to this one-liner:
frame %>% mutate(recent = rollsum(Sales[c(3, 1:n(), n()-2)], 3) - Sales)
giving:
Week Sales recent
1 1 1 8
2 2 3 6
3 3 5 10
4 4 7 14
5 5 9 12
Update: fixed for k > 2

Binning data by row values with minimum sample size

I’m trying to figure out how to create bins with a minimum sample size that also accounts for values in a specific column.
So, in the dummy data below, I want to create bins that have a minimum number of 6 samples in them, but if a bin includes a row with a specific value from column a, I want that bin to also include all other rows with that same value. I also do not want any bins to only contain 1 unique value from row a. I then want the output to have a row with a mean of the unique values in column a, a mean of all values in column b and a column with sample size.
df<-data.frame(a=c(1,1,2,2,2,3,3,3,3,4,4,5,6,6,6,7,7,7,7,7,7,8,8,8,9,9,9,9,10,10,10),
b=c(12,13,11,12,12,11,15,13,12,11,14,15,11,14,12,11,14,12,13,15,11,11,12,13,14,16,14,13,15,13,15))
I want the output to look something like this:
mean.a mean.b n
1 2.0 12.33333 9
2 5.0 12.83333 6
3 7.0 12.66667 6
4 8.5 13.28571 7
This is what I have so far:
x<-df
final<-NULL
for(i in 1:16){
x1<-x[1:6,]
x2<-x[-c(1:6),]
x3<-rbind(x1, x2[x2$a==x1$a[6],])
n<-nrow(x3)
y<-mean(x3$b)
z<-mean(unique(x3$a))
f<-data.frame(mean.a=z, mean.b=y, n=n)
final<-rbind(final,f)
x<-x[-c(1:n),]
}
final<-final[complete.cases(final),]
The problem I'm having is I can't figure out how to not have a single bin with one unique value in column a. For example, in the third bin, all 6 rows have mean.a$a=7, but I would like to add the next sequential row and all rows with that row value in column a to that bin (which would be all rows that have mean.a$a=8 in this case).
Also, I can't figure out how to get the loop to continue looping through without having 1:number at the top, and then just deleting the rows with NAs afterwards, this isn't a huge deal, but that's the reason it's kind of messy.
I'm not attached to this loop by any means, and if there's a simpler way to answer this question, I'm all for it!
Here is a recursive solution for the problem, where get_6 will return a group variable based on the column a. The conditions are met in get_i function inside, starting from index 6 and move forward until we find the next index that is not equal to the current value and the length of unique values is not equal to 1, every time we found a sequence that satisfies the condition we increase the id by one and the result will be similar to what you get from the rleid function from data.table, from there, summary statistics can be calculated based on this group variable:
get_6 <- function(vec, id = 1) {
if(length(vec) < 6) NULL
else {
get_i <- function(x, i = 6) {
if(length(x) == i) i
else if(x[i + 1] != x[i] && length(unique(x[1:i])) != 1) i
else get_i(x, i + 1)
}
ind <- get_i(vec)
c(rep(id, ind), get_6(vec[-(1:ind)], id + 1))
}
}
s <- get_6(df$a)
s
# [1] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4
library(dplyr)
df[1:length(s), ] %>%
mutate(g = s) %>% group_by(g) %>%
summarize(n = n(), mean.a = mean(unique(a)), mean.b = mean(b))
#Source: local data frame [4 x 4]
# g n mean.a mean.b
# <dbl> <int> <dbl> <dbl>
#1 1 9 2.0 12.33333
#2 2 6 5.0 12.83333
#3 3 9 7.5 12.44444
#4 4 7 9.5 14.28571

Resources