How to perform the equivalent of Excel rolling sumifs in dplyr? - r

In the below reproducible code, I would like to add a column for SumIfs using dplyr as detailed in the below image, whereby the Excel sumifs() formula in column H of the image has conditions with the tops of the specified ranges "anchored", for a "rolling" calculation as you move down row-wise. Any recommendations for how to do the same in dplyr? I'm sure it requires grouping but unsure of how to handle conditions. The blue below shows the current reproducible code output, the yellow shows what I would like to add, and the non-highlighted shows the underlying XLS formulas.
Now using my words: to derive Sumifs, for each row one-at-a-time rolling from top-to-bottom of the array sequentially, sum all values in column D from the top of the column D range to the current row in the Column D range that have a column C "Code1" value less than the current row column C "Code1" value. So for example in deriving the value of 3 in cell G6: add the 1 in cell D3 (because its Code1 of 0 (cell C3) is < Code1 of 3 (cell C6)) to the 2 in cell D5 (because its Code1 of 1 (cell C5) is < Code1 of 3 (cell C6)).
Reproducible code:
library(dplyr)
myData <-
data.frame(
Name = c("B","R","R","R","R","B","A","A","A"),
Group = c(0,1,1,2,2,0,0,0,0),
Code1 = c(0,1,1,3,3,4,-1,0,0),
Code2 = c(1,0,2,0,1,2,1,0,0)
)
CountIfs <- function(x,y) {
out <- integer(length(x))
for(i in seq_along(x)) {
cond1 <- y[1:i] > 0
cond2 <- x[1:i] == x[i]
out[i] <- sum(cond1*cond2)
}
out
}
myDataRender <-
myData %>%
mutate(CountIfs = CountIfs(Code1, Code2))
print.data.frame(myDataRender)
Adapt Tsai solution for situations where the top/bottom of the XLS sumifs() ranges are anchored (fixed, not rolling)(where first XLS formula in the image would be =SUMIFS(D$3:D$11,C$3:$C11,"<"&C3)), for those of us transitioning from XLS to R:
myData %>% mutate(SumIfs = sapply(1:n(), function(x) sum(Code2[1:n()][Code1[1:n()] < Code1[x]])))

You could use map() or imap() from purrr:
library(dplyr)
library(purrr)
# (1)
myData %>%
mutate(SumIfs = map_dbl(1:n(), ~ sum(Code2[1:.x][Code1[1:.x] < Code1[.x]])))
# (2)
myData %>%
mutate(SumIfs = imap_dbl(Code1, ~ sum(Code2[1:.y][Code1[1:.y] < .x])))
# Name Group Code1 Code2 SumIfs
# 1 B 0 0 1 0
# 2 R 1 1 0 1
# 3 R 1 1 2 1
# 4 R 2 3 0 3
# 5 R 2 3 1 3
# 6 B 0 4 2 4
# 7 A 0 -1 1 0
# 8 A 0 0 0 1
# 9 A 0 0 0 1
If you don't want to rely on purrr, the map() solution can be adapted directly for the base sapply() version:
myData %>%
mutate(SumIfs = sapply(1:n(), \(x) sum(Code2[1:x][Code1[1:x] < Code1[x]])))

Here is another way using map2_dbl() with the row number.
library(dplyr)
library(purrr)
myData %>%
mutate(SumIfs = map2_dbl(Code1, row_number(),
~ sum(if_else(Code1 < .x & row_number() <= .y, Code2, 0))))
Also using base Map(), this will scale to as many criteria as you want.
library(dplyr)
myData %>%
mutate(SumIfs = unlist(Map(\(x, y) sum(if_else(Code1 < x & row_number() <= y, Code2, 0)),
Code1, row_number())))

Related

Looping over multiple columns to generate a new variable based on a condition

I am trying to generate a new column (variable) based on the value inside multiple columns.
I have over 60 columns in the dataset and I wanted to subset the columns that I want to loop through.
The column variables I am using in my condition at all characters, and when a certain pattern is matched, to return a value of 1 in the new variable.
I am using when because I need to run multiple conditions on each column to return a value.
CODE:
df read.csv("sample.csv")
*#Generate new variable name*
df$new_var <- 0
*#For loop through columns 16 to 45*
for (i in colnames(df[16:45])) {
df <- df %>%
mutate(new_var=
case_when(
grepl("I8501", df[[i]]) ~ 1
))
}
This does not work as when I table the results, I only get 1 value matched.
My other attempt was using:
for (i in colnames(df[16:45])) {
df <- df %>%
mutate(new_var=
case_when(
df[[i]] == "I8501" ~ 1
))
}
Any other possible ways to run through multiple columns with multiple conditions and change the value of the variable accordingly? to be achieved using R ?
If I'm understanding what you want, I think you just need to specify another case in your case_when() for keeping the existing values when things don't match "I8501". This is how I would do that:
df$new_var <- 0
for (index in (16:45)) {
df <- df %>%
mutate(
new_var = case_when(
grepl("I8501", df[[index]]) ~ 1,
TRUE ~ df$new_var
)
)
}
I think a better way to do this though would be to use the ever useful apply():
has_match = apply(df[, 16:45], 1, function(x) sum(grepl("I8501", x)) > 0)
df$new_var = ifelse(has_match, 1, 0)
Kindly check if this works for your file.
Sample df:
df <- data.frame(C1=c('A','B','C','D'),C2=c(1,7,3,4),C3=c(5,6,7,8))
> df
C1 C2 C3
1 A 1 5
2 B 7 6
3 C 3 7
4 D 4 8
library(dplyr)
df %>%
rowwise() %>%
mutate(new_var = as.numeric(any(str_detect(c_across(2:last_col()), "7")))) # change the 2:last_col() to select your column range ex: 2:5
Output for finding "7" in any of the columns:
C1 C2 C3 new_var
<chr> <dbl> <dbl> <dbl>
1 A 1 5 0
2 B 7 6 1
3 C 3 7 1
4 D 4 8 0

looping within a variable in panel data using loop in R

I have a panel data like id <- c(1,1,1,1,1,1,1,2,2,2,2,2), intm <- c(1,1,0,0,1,0,0,0,0,0,1,1). The data frame is like
dta <- data.frame(cbind(id,intm)) which gives:
id intm
1 1 1
2 1 1
3 1 0
4 1 0
5 1 1
6 1 0
7 1 0
8 2 0
9 2 0
10 2 0
11 2 1
12 2 1
I would like to replace the subsequent values of "intm" variable by the first value within the ID variable. That is for ID=1, the first value is 1, so the intm should have all values as 1 and for ID=2, intm should have all values as 0. The data should be like
id <- c(1,1,1,1,1,1,1,2,2,2,2,2),intm <- c(1,1,1,1,1,1,1,0,0,0,0,0) with data frame
dta <- data.frame(cbind(id,intm))
How can I do this in R by looping or any other means? I have a big data set.
Consider the following code:
new_column <- c(); i <- 1; # new column to be created
# loop
for (j in unique(dta$id)){ # let's separate the unique values of ID
index <- which(dta$id==j) # which row index satisfy id==1, or id==2, ...?
value <- dta$intm[index[1]] # which value of intm corresponds to the first value of the index?
new_column[i:tail(index,n = 1)] <- rep(value,nrow(dta[id==j,])) # repeat this value the number of rows times which contains the ID
i <- tail(index,n = 1)+1 # the new_column component must start with its last value + 1
}
dta <- cbind(dta,new_column)
Alternatively, you can use the subset() function, i.e
rep(value,nrow(subset(dta,dta$id==j)))
You can use dplyr to do this.
There are other fancier ways to do this but I think dplyr is more graceful.
id <- c(1,1,1,1,1,1,1,2,2,2,2,2)
intm <- c(1,1,0,0,1,0,0,0,0,0,1,1)
df = data.frame(id, intm)
library(dplyr)
df2 = df %>% group_by(id) %>% do({
.$intm = .$intm[1, drop = TRUE]
.
})
You can also try data.table library which shall be faster.
BTW: you do not need cbind to make a data.frame.
Consider ave + head
dta$intm <- with(dta, ave(intm, id, FUN=function(x) head(x, 1)))

Calculate value in third column based off values in other columns but different rows

Sorry if this is a trivial question or doesn't make sense, this is my first post. I'm coming from Excel where I've worked with if statements and index match functions and am trying to do something similar in R to pull data from two columns but not necessarily the same row to get a value in a third column, my example is this
df<-data.frame(ID=c(1,5,4,2,3),A=c(1,0,1,1,1),B=c(0,0,1,0,0))
desired output: df<-data.frame(ID=c(1,5,4,2,3),A=c(1,0,1,1,1),B=c(0,0,1,0,0),C=c(0,0,0,0,1))
What I want is to create a third column "C" that essentially follows this format:
Ifelse(A[ID]=1 & B[ID+1]=1 , C[ID]=1 , C[ID]=0)
Essentially if A=1 in ID "x" and B=1 in ID "x+1" then in the new column C in ID "x" =1 otherwise =0. I could order everything by ID if that makes things easier but doing it by the ID column would be ideal.
So far I've tried ifelse statements but I imagine there is probably a better way of doing this
Using dplyr, we can use lead to get next element after arranging the data by ID.
library(dplyr)
df %>%
arrange(ID) %>%
mutate(C = as.integer(A == 1 & lead(B) == 1))
# ID A B C
#1 1 1 0 0
#2 2 1 0 0
#3 3 1 0 1
#4 4 1 1 0
#5 5 0 0 0
In base R, we can do
df1 <- df[order(df$ID),]
df1$C <- with(df1, c(A[-nrow(df)] == 1 & tail(B, -1) == 1, 0))
Without arranging the data, we can probably do
transform(df, C = as.integer(A[ID] == 1 & B[match(ID + 1, ID)] == 1))
Using the lead function I got this to work
df <- df [order(df$ID), ]
df$C <- ifelse (df$A == 1 & lead (df$B) == 1, 1, 0)

Add columns in order given some names of them data.frame R

I have a data.frame in R which its columns are named L1, L2, L3, etc. but in a given iteration I am given randomly a data.frame with columns as the following one.
L1,L3,L5
0.0000000,0.7142857,0.2857143
0.1052632,0.8947368,0.0000000
1.0000000,0.0000000,0.0000000
0.0000000,1.0000000,0.0000000
0.0000000,0.0000000,1.0000000
1.0000000,0.0000000,0.0000000
I need a create one with the same number of columns and number with columns name ordered consequently as shown below. The added columns L2, L4, and L6 must be filled with 0.
L1,L2,L3,L4,L5,L6
0.0000000,0.0,0.7142857,0.0,0.2857143,0.0
0.1052632,0.0,0.8947368,0.0,0.0000000,0.0
1.0000000,0.0,0.0000000,0.0,0.0000000,0.0
0.0000000,0.0,1.0000000,0.0,0.0000000,0.0
0.0000000,0.0,0.0000000,0.0,1.0000000,0.0
1.0000000,0.0,0.0000000,0.0,0.0000000,0.0
With Base R:
# create example data
df <- read.csv(header=T,
text = "L1,L3,L5
0.0000000,0.7142857,0.2857143
0.1052632,0.8947368,0.0000000
1.0000000,0.0000000,0.0000000
0.0000000,1.0000000,0.0000000
0.0000000,0.0000000,1.0000000
1.0000000,0.0000000,0.0000000")
# create empty dataframe of zeros, with colnames L1:L6
df0 <- as.data.frame(matrix(0, nrow=nrow(df), ncol=6))
names(df0) <- paste0("L", 1:6)
# cbind df with zero cols from df0
df_result <- cbind(df, df0[ , -match(names(df), names(df0))])
# reorder columns L1:L6
df_result <- df_result[ , sort(names(df_result))]
Note that this is effective but inefficient code, as it creates an object full of zeros. This should work well with small to medium-sized data sets, but I would recommend something more clever for large data sets.
Overview
After reading dplyr - mutate: use dynamic variable names, I tweaked the results to solve your problem of not knowing the column names ahead of time.
Using the tidyverse, you store the columns that are not found in your existing df and then dynamically add them by way of a for loop.
Code
# load necessary package --------
library(tidyverse)
library(rlang)
# load necessary data -----------
df <-
read_csv("L1,L3,L5
0.0000000,0.7142857,0.2857143
0.1052632,0.8947368,0.0000000
1.0000000,0.0000000,0.0000000
0.0000000,1.0000000,0.0000000
0.0000000,0.0000000,1.0000000
1.0000000,0.0000000,0.0000000")
# create function that creates one new column ------
FillNewColumns <- function(df, string) {
require(dplyr)
require(rlang)
df %>%
mutate(!!string := 0 )
}
# store the integers from the column names --------
integer.values <-
df %>%
names() %>%
str_extract("\\d") %>%
as.integer()
# identify max value from existing integer.values and add 1 ----
max.value <-
integer.values %>%
max() + 1
# identify the new columns -------
# note: this requires that you know the maximum value ahead of time
new.columns <-
(1:max.value %in%
integer.values == FALSE) %>%
# take the indices of those TRUE values
# which do not appear in 1:max.value and create
# our new columns
which() %>%
paste0("L", .)
# dynamically add new columns to df ------
for (i in new.columns) {
df <- FillNewColumns(df, i)
}
# tidy up the results ------
df <-
df %>%
# rearrange the columns in alphabetical order
select(names(.) %>% sort())
# view results ----
df
# A tibble: 6 x 6
# L1 L2 L3 L4 L5 L6
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 0 0 0.714 0 0.286 0
# 2 0.105 0 0.895 0 0 0
# 3 1 0 0 0 0 0
# 4 0 0 1 0 0 0
# 5 0 0 0 0 1 0
# 6 1 0 0 0 0 0
# end of script #

R: How to automatically create flag variables for sequences of values?

Suppose, you're given the following dataframe:
a <- data.frame(var = c(",1,2,3,", ",2,3,5,", ",1,3,5,5,"))
What I am looking for is to create the variables flag_1, ..., flag_7 in a containing the information of how many times the respective values occur. For a, I would expect the following result:
var flag_1 flag_2 flag_3 flag_4 flag_5
",1,2,3," 1. 1. 1. 0. 0.
",2,3,5," 0. 1. 1. 0. 1.
",1,3,5,5," 1. 0. 1. 0. 2.
I managed to get the result using a nested for-loop and an if-condition but there must be a nicer (more aesthetic and better performing) solution.
One option would be to do strsplit, get the table and then cbind with original data
cbind(a, do.call(rbind, lapply(strsplit(as.character(a$var), ","),
function(x) table(factor(x[nzchar(x)], levels = 1:5, labels = paste0("flag_", 1:5))))))
# var flag_1 flag_2 flag_3 flag_4 flag_5
#1 ,1,2,3, 1 1 1 0 0
#2 ,2,3,5, 0 1 1 0 1
#3 ,1,3,5,5, 1 0 1 0 2
Another option is with tidyverse
library(tidyverse)
str_extract_all(a$var, "[0-9]") %>%
map(~ as.integer(.x) %>%
as_tibble) %>%
bind_rows(.id = 'grp') %>%
count(grp, value = factor(value, levels = min(value):max(value))) %>%
spread(value, n, drop = FALSE, fill = 0) %>%
select(-grp) %>%
bind_cols(a, .) %>%
rename_at(vars(matches("^[0-9]+$")), ~ paste0("flag_", .))
# var flag_1 flag_2 flag_3 flag_4 flag_5
#1 ,1,2,3, 1 1 1 0 0
#2 ,2,3,5, 0 1 1 0 1
#3 ,1,3,5,5, 1 0 1 0 2
First, don't make the strings into factors. Nothing good comes from that.
a <- data.frame(var = c(",1,2,3,", ",2,3,5,", ",1,3,5,5,"),
stringsAsFactors = FALSE)
To get from strings to your table is simple enough if we take it in small steps. Here, I've written (or renamed) a function per step and then gone through the steps using lapply one at a time. You can string it all together in a pipeline if like, but it would be roughly these steps.
First, I extract the numbers from the strings. That involves splitting on commas, getting rid of empty strings, you have those because you can begin and end a string with a comma, but otherwise, that step wouldn't be necessary. Then we need to translate the strings into numbers, count how often we see each (we can do that with the as.numeric and table functions, respectively), and then it is just a question of mapping the observed counts into a table that also includes those we haven't observed.
pick_indices <- function(str) unlist(strsplit(str, split = ","))
remove_empty <- function(chrs) chrs[nchar(chrs) > 0]
get_indices <- as.numeric
to_counts <- table
to_flag_vect <- function(counts, len) {
vec <- rep(0, len)
names(vec) <- 1:len
vec[names(counts)] <- counts
vec
}
strings <- lapply(a$var, pick_indices)
cleaned <- lapply(strings, remove_empty)
indices <- lapply(cleaned, get_indices)
counts <- lapply(indices, to_counts)
flags <- lapply(counts, to_flag_vect, len = 5)
We now have the flag-counts in a list, so to make it into the table you want, with the column names you want, we simply do this:
tbl <- do.call(rbind, flags)
colnames(tbl) <- paste0("flag_", 1:5)
tbl
Done.
Split and unlist the values into a factor with appropriate levels
x = strsplit(a$var, ",")
xp = factor(unlist(x), levels = seq_len(5))
Create an index that maps the values of xp to the rows they came from
i = rep(seq_along(x), lengths(x))
use xtabs() to cross-tabulate the entries by row
xt = xtabs(~ i + xp)
and cbind() the matrix representation of the result to the original
> cbind(a, unclass(xt))
var 1 2 3 4 5
1 ,1,2,3, 1 1 1 0 0
2 ,2,3,5, 0 1 1 0 1
3 ,1,3,5,5, 1 0 1 0 2

Resources