How to perform the equivalent of Excel minifs in R dplyr? - r

I am slowly transitioning from XLS to R. In the below reproducible code, I would like to add columns for "MinIfs" using dplyr as detailed in the below image, whereby the Excel minifs() formula in column G of the image has conditions with only the tops of the specified ranges "anchored", for a "rolling" calculation as you move down row-wise; and the minifs() formula in column M with the same conditions but with the entire range of the reference array fixed and not "rolling". Any recommendations for doing this in dplyr? Dplyr works for the equivalent XLS sumifs() in columns E and F. The blue below shows the current reproducible code output, the yellow shows what I would like to add, and the non-highlighted shows the underlying XLS formulas for the cells in yellow to the immediate left of each.
Now using words:
to derive Rolling MinIfs (call it RollMinIfs), for each row
one-at-a-time rolling from top-to-bottom of the array sequentially,
show the minimum value in column C from the top of the column C range
to the current row in the Column C range that has a column D "Code2"
value in the range D4:D6 not equal to the 0. So for
example in deriving the value of -5 in cell G6: cell C6 has the
lowest value in the range C4:C6 whereby the corresponding Code2
values in range D4:D6 are not equal to 0; and
to derive Fixed MinIfs
(call it FixMinIfs), for all rows in the array fixed range C4:C10, show the minimum value in Column C that has a corresponding column D "Code2" value not equal to 0.
Reproducible code:
library(dplyr)
myDF <-
data.frame(
Name = c("B","R","R","R","R","B","A"),
Group = c(0,1,1,2,2,0,0),
Code1 = c(0,1,-5,3,3,4,-8),
Code2 = c(1,0,2,0,1,2,1)
)
myDFRender <-
myDF %>%
mutate(RollSumIfs = sapply(1:n(), function(x) sum(Code2[1:x][Code1[1:x] < Code1[x]]))) %>%
mutate(FixSumIfs = sapply(1:n(), function(x) sum(Code2[1:n()][Code1[1:n()] < Code1[x]])))
print.data.frame(myDFRender)

Related

Multiply columns by values in other dataset based on matching values between column names and rows

I am trying to apply the following formula to my data:
my_function_alpha <- b1[1] * cond_num[x] + b0[1] + sigma[1] * alpha_norm[1-191]
I have two dataset, my df1 looks like this (alpha_norm goes up to 191 and the number in the column represent the subject ID):
enter image description here
My second dataset df2 looks like this:
enter image description here
The numbers following alpha_norm in df1 correspond to the unique.ID column in df2.
At the moment I have just part of the formula sorted out:
library(dplyr)
A <- function(x) x*df1$sigma.1. + df1$b0.1. + df1$b1.1.*df2$cond_num
new <- test2 %>%
mutate(across(11:201, A)) #11:201 represent the column number of alpha_norm
This bit is what I still need to sort out:
df1$b1.1.*df2$cond_num
I fundamentally want to loop over the multiplication by matching the id presented in df1 as part of column names with the id of Unique.id in df1 and, based on the matching, multiplying for the value of cond_num (which can be either 0.5 or -0.5). So, when alpha_norm.1. multiply b1 by 0.5 (because the cond_num when Unique.id is 1 is 0.5), whereas when alpha_num.56. multiply v1 by -0.5 (because the cond_num when Unique.id is 56 is -0.5).
I have tried using intersections or logical matrices but without success. Any help with this would be appreciated! Thanks!

Take unique rows in R, but keep most common value of a column, and use hierarchy to break ties in frequency

I have a data frame that looks like this:
df <- data.frame(Set = c("A","A","A","B","B","B","B"), Values=c(1,1,2,1,1,2,2))
I want to collapse the data frame so I have one row for A and one for B. I want the Values column for those two rows to reflect the most common Values from the whole dataset.
I could do this as described here (How to find the statistical mode?), but notably when there's a tie (two values that each occur once, therefore no "true" mode) it simply takes the first value.
I'd prefer to use my own hierarchy to determine which value is selected in the case of a tie.
Create a data frame that defines the hierarchy, and assigns each possibility a numeric score.
hi <- data.frame(Poss = unique(df$Set), Nums =c(105,104))
In this case, A gets a numerical value of 105, B gets a numerical score of 104 (so A would be preferred over B in the case of a tie).
Join the hierarchy to the original data frame.
require(dplyr)
matched <- left_join(df, hi, by = c("Set"="Poss"))
Then, add a frequency column to your original data frame that lists the number of times each unique Set-Value combination occurs.
setDT(matched)[, freq := .N, by = c("Set", "Value")]
Now that those frequencies have been recorded, we only need row of each Set-Value combo, so get rid of the rest.
multiplied <- distinct(matched, Set, Value, .keep_all = TRUE)
Now, multiply frequency by the numeric scores.
multiplied$mult <- multiplied$Nums * multiplied$freq
Lastly, sort by Set first (ascending), then mult (descending), and use distinct() to take the highest numerical score for each Value within each Set.
check <- multiplied[with(multiplied, order(Set, -mult)), ]
final <- distinct(check, Set, .keep_all = TRUE)
This works because multiple instances of B (numerical score = 104) will be added together (3 instances would give B a total score in the mult column of 312) but whenever A and B occur at the same frequency, A will win out (105 > 104, 210 > 208, etc.).
If using different numeric scores than the ones provided here, make sure they are spaced out enough for the dataset at hand. For example, using 2 for A and 1 for B doesn't work because it requires 3 instances of B to trump A, instead of only 2. Likewise, if you anticipate large differences in the frequencies of A and B, use 1005 and 1004, since A will eventually catch up to B with the scores I used above (200 * 104 is less than 199 * 205).

Summing a specific vector index

I'm having trouble figuring out how vectors are formatted. I need to find the average height of participants in the cystfibr package of the ISwR library. When printing the entire height data set it appears to be a 21x2 matrix with height values and a 1 or 2 to indicate sex. However, ncol returns a value of NA suggesting it is a vector. Trying to get specific indexes of the matrix (heightdata[1,]) also returns an incorrect number of dimensions error.
I'm looking to sum up only the height values in the vector but when I run the code I get the sum of the male and female integers. (25)
install.packages("ISwR")
library(ISwR)
attach(cystfibr)
heightdata = table(height)
print(heightdata)
print(sum(heightdata))
This is what the output looks like.
You can convert the cystfibr to a dataframe format to find out the sum of all vectors present in the data.
install.packages("ISwR")
library(ISwR)
data <- data.frame(cystfibr) # attach and convert to dataframe format
As there are no unique identifier present in the data, so done sum across observations
apply(data [,"height", drop =F], 2, sum) # to find out the sum of height vector
height
3820
unlist(lapply(data , sum))
age sex height weight bmp fev1 rv frc tlc pemax
362.0 11.0 3820.0 960.1 1957.0 868.0 6380.0 3885.0 2850.0 2728.0
sapply(data, sum)
age sex height weight bmp fev1 rv frc tlc pemax
362.0 11.0 3820.0 960.1 1957.0 868.0 6380.0 3885.0 2850.0 2728.0
table gives you the count of values in the vector.
If you want to sum the output of height from heightdata, they are stored in names of heightdata but it is in character format, convert it to numeric and sum.
sum(as.numeric(names(heightdata)))
#[1] 3177
which is similar to summing the unique values of height.
sum(unique(cystfibr$height))
#[1] 3177

How to accsss R data frame contents using element in factor level

As below, dataframe factorizedss is the factorized version of a sourcedata dataframe ss.
ss <- data.frame(c('a','b','a'), c(1,2,1)); #There are string columns and number columns.
#So, I factorized them as below.
factorizedss <- data.frame(lapply(ss, as.factor)); #factorized version
indices <- data.frame(c(1,1,2,2), c(1,1,1,2)); #Now, given integer indices
With given indices, using factorizedss, is it possible to get corresponding element of the source dataframe as below? (The purpose is to access data frame element by integer number in factor level )
a 1
a 1
b 1
b 2
You can access the first column like this
factorizedss[indices[,1],][,1]
and the second in a similar way
factorizedss[indices[,2],][,2]
It gets more difficult when trying to combine them, you might have to convert them back to native types
t(rbind(as.character(factorizedss[indices[,1],][,1]),as.numeric(factorizedss[indices[,2],][,2])))

Compute new column based on values in current and following rows with dplyr in R

I have a big dataset (10+ Mil x 30 vars) and i am trying to compute some new variables based on complicated interactions of current ones. For clarity i am including only the important variables in the question. I have the following code in R but i am interested in other views and opinions. I am using the dplyr package to compute new columns based on current/following row values of 3 other columns. (more explanation below code)
I am wondering if there is a way to make this faster and more efficient, or maybe completely rewrite it...
# the main function-data is a dataframe, windowSize and ratio are ints
computeNewColumn <- function(data,windowSize,ratio){
#helper function used in the second mutate down...
# all args are ints, i return a boolean out
windowAhead <- function(timeTo,window,reduction){
# subset the original dataframe-only observations with values of
# TimeToGo between timeTo-1 and window (basically the following X rows
# from the current one)
subframe <- data[(timeTo-1 >= data$TimeToGo & data$TimeToGo >= window), ]
isthere <- any(subframe$Price < reduction)
return(isthere)
}
# I group by value of ID first and order by TimeToGo...
data %<>% group_by(ID) %>%
arrange(desc(TimeToGo)) %>%
# ...create two new columns from simple interactions of existing ones...
mutate(Window = ifelse(TimeToGo > windowSize, TimeToGo - windowSize, 0),
Reduction = floor(Price - (ratio * Price))) %>%
rowwise() %>%
#...now comes the more complex stuff- I want to compute a third column
# depending on the next (TimeToGo - Window) number of values of Price
mutate(Advice = ifelse(windowAhead(TimeToGo,Window,Reduction),1,0) )
return(data)
}
We have a dataset with the following columns: ID,Price, TimeToGo.
We first group by values of ID and compute two new columns based on current row values (Window from TimeToGo and Reduction from Price). Next thing we would like to do is compute a new third column based on
1.current value of Reduction
2.the next (Window - TimeToGo) amount of values of Price in the dataframe.
I am wondering if there is a simple way to reference upcoming values of a column from within mutate()? I am ideally looking for a sliding window function on one column, where the limits of the sliding window are set from two other current column values. My solution for now just uses a custom function which subsets on the original dataframe manually, does a comparison and returns back a value to the mutate() call. Any help and ideas would be much appreciated!
p.s. heres a sample of data... please let me know if you would need any more info. Thanks!
> a
ID TimeToGo Price
1 AQSAFOTO30A 96 19
2 AQSAFOTO20A 95 19
3 AQSAFOTO30A 94 17
4 AQSAFOTO20A 93 18
5 AQSAFOTO25A 92 19
6 AQSAFOTO30A 91 17

Resources