ifelse with sorted values by row - r

I have a dataframe (example data):
id <- c(1, 2, 3)
ex1 <- c(0.8, 0.2, 0.3)
ex2 <- c(0.1, 0.4, 0.04)
ex3 <- c(0.04, 0.3, 0.5)
ex <- c(1, 1, 1)
ran <- c(0.5, 0.7, 0.6)
dat <- data.frame(id, ex1, ex2, ex3, ex, ran)
dat
id ex1 ex2 ex3 ex ran
1 1 0.8 0.10 0.04 1 0.5
2 2 0.2 0.40 0.30 1 0.7
3 3 0.3 0.04 0.50 1 0.6
I want to change the values of "ex" with an if-else-condition. "ex" should change to 5 (arbitrary) when "ran" is smaller or equal then the highest value for the ex$-variables. It should be greater then the other ex$-variables aswell, but they should be sorted - the second largest value added to the smallest value. Here are examples for all id's, beginning with id 1:
dat$ex <- ifelse(dat$ran <= dat$ex1 & dat$ran > dat$ex1 + dat$ex2, 5, dat$ex)
Here, ex1 is the largest value, followed by ex1 and ex2.
For id 2, it should be:
dat$ex <- ifelse(dat$ran <= dat$ex2 & dat$ran > dat$ex3 + dat$ex1, 5, dat$ex)
Here, ex2 is the largest value, followed by ex3 and then ex1.
For id 3:
dat$ex <- ifelse(dat$ran <= dat$ex3 & dat$ran > dat$ex1 + dat$ex2, 5, dat$ex)
Here, ex3 is the largest value, followed by ex1 and then ex2.
Now to the problem: How to generalize the ifelse-statement? Note: It is important that the summation of the two smaller values is performed as implemented in the examples. I need to identify the sorted values for ex1, ex2 and ex3 within ifelse by id.

Here is a way how we could achieve the task using dplyr and tidyr:
library(dplyr)
library(tidyr)
dat %>%
pivot_longer(
cols = ex1:ex3
) %>%
arrange(id, desc(value)) %>%
group_by(id) %>%
mutate(ex = ifelse(ran <= value[1] & ran > sum(value[2], value[3]), 5, ex)) %>%
pivot_wider(
names_from=name
)
output:
id ex ran ex1 ex2 ex3
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 5 0.5 0.8 0.1 0.04
2 2 1 0.7 0.2 0.4 0.3
3 3 1 0.6 0.3 0.04 0.5

We may use pmax
library(dplyr)
library(purrr)
dat %>%
mutate(ex= case_when(ran <=invoke(pmax, across(matches('^ex\\d+'))) ~ 5,
TRUE ~ ex))
id ex1 ex2 ex3 ex ran
1 1 0.8 0.10 0.04 5 0.5
2 2 0.2 0.40 0.30 1 0.7
3 3 0.3 0.04 0.50 1 0.6

exes <- t(apply(subset(dat, select = grep("^ex.+", names(dat))), 1, function(z) c(max(z), sum(z[-which.max(z)]))))
exes
# [,1] [,2]
# 1 0.8 0.14
# 2 0.4 0.50
# 3 0.5 0.34
ifelse(dat$ran <= exes[,1] & dat$ran > exes[,2], 5, dat$ran)
# 1 2 3
# 5.0 0.7 0.6
Walk-through:
subset(dat, ...) is a way to dynamically extract columns from a frame regardless of its type (e.g., data.frame, tbl_df, or data.table), and without risk of dropping the frame to a column (i.e., see that mtcars[,2] is no longer a frame); there are other ways to do this, some in base R, some in other packages like dplyr or data.table
apply(dat, 1, ..) operates on the rows of the respective columns; because when MARGIN=1 (second arg), it transposes the results, so we need to t(.) it back into the right shape;
exes is now a matrix whose first column contains the max of the ex# variables, and the second column contains the sum of the non-max ex# variables
From here, I think the use of exes is the "general" solution you were looking for.

Related

Add a column that shows whether the two previous columns include 0 or not

I have a data.frame called dat. I want to add a new column to it called dif. Then, in each row if lower and upper range included 0 (e.g., -0.41 to 0.1 in 1st row) I want the dif value to show FALSE else (e.g., 0.10 to 0.2 2nd row) TRUE.
Is this possible to do in R for any similar data.frame (the following is a toy example a functional answer is appreciated)?
dat <- data.frame(lower = c(-0.41, .1, -.2), upper = 1:3*.1, row.names = paste("a", 1:3)) # add a column called `dif`
desired_output <- data.frame(lower = c(-0.41, .1, -.2), upper = 1:3*.1, dif = c(F,T,F), row.names = paste("a", 1:3))
You can use dplyr::between:
library(dplyr)
dat %>%
rowwise() %>%
mutate(dif = !between(0, lower, upper))
Output
# A tibble: 3 x 3
# Rowwise:
lower upper dif
<dbl> <dbl> <lgl>
1 -0.41 0.1 FALSE
2 0.1 0.2 TRUE
3 -0.2 0.3 FALSE
You can use :
transform(dat, dif = lower > 0 | upper < 0)
# lower upper dif
#1 -0.41 0.1 FALSE
#2 0.10 0.2 TRUE
#3 -0.20 0.3 FALSE
We can use mutate from dplyr
library(dplyr)
mutate(dat, dif = lower > 0 | upper < 0)
Or an option in base R
Reduce(`|`, Map(function(x, y) match.fun(y)(x, 0), dat, c(">", "<")))
#[1] FALSE TRUE FALSE
Here is another base R option using do.call with *
dat$dif <- do.call("*", dat) > 0
such that
> dat
lower upper dif
a 1 -0.41 0.1 FALSE
a 2 0.10 0.2 TRUE
a 3 -0.20 0.3 FALSE

Mutate multiple columns of one value in a dataframe using a single vector

I am trying to do something like below: filling the table multiplying the vector in the left (df2) by each of the values in raw 2 (df1: y1 to y10). I actually have a bigger data set but this is reproducible.
dataframes
df1:
y1 y2 y3 y4
5 10 15 20
df2:
perc
0.08
0.02
0.08
0.12
0.20
0.30
0.12
0.03
0.01
codes that I've tried:
try1 <- df2 %>% mutate_each_(.funs = funs(. * df1[1,]), .cols = vars(contains("y")))
try2 <- df2 %>% mutate_each_(funs = funs(. * df1[1,]), vars = vars(contains("y")))
try3 <- df2 %>% mutate_at(.funs = funs(X = .*df1[1,1:5]), .vars = c(y1:y5))
try4 <- df1 %>% mutate_each(funs(op = .*df2), y1:y5)
try5 <- cbind(df1,apply(df1[1,1:5],2, function(x) x*df2))
try6 <- cbind(df2,apply(df2[,1],2, function(x) x*df1[1,1:5]))
At the end I would like to rename each column y1 = 2010... y5 = 2015, as each represent the values for one year.
Any recommendation is very welcome.
Edited:
As this would be a matrix operation, df1 now is a matrix with more elements N [3 x 10]. df2 is the same [1 x 9].
I want to end with a df O[27 x 10] as explained in the figure below:
Where each value of M (df2) multiplies each value of N (df1).
This is matrix multiplication
y = seq(5, 20, by = 5)
x = c(0.08, 0.02, 0.08, 0.12, 0.2, 0.3, 0.12, 0.03, 0.01)
y %*% t(x)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
# [1,] 0.4 0.1 0.4 0.6 1 1.5 0.6 0.15 0.05
# [2,] 0.8 0.2 0.8 1.2 2 3.0 1.2 0.30 0.10
# [3,] 1.2 0.3 1.2 1.8 3 4.5 1.8 0.45 0.15
# [4,] 1.6 0.4 1.6 2.4 4 6.0 2.4 0.60 0.20
The function outer does this more generally, letting you specify any binary operation (but the default is *).
outer(y, x)
# same result
Simple matrix multiplication can do this easily.
df <- as.data.frame(as.matrix(df2) %*% as.matrix(df1))
If you want to use dplyr, you can do this.
library(dplyr)
df <- cbind(df2, df1) %>%
mutate_at(vars(-perc), ~ perc * .) %>%
select(-perc)
If you really want to stick in the "tidyverse", you can use crossing from tidyr instead of cbind. This will work for multiple rows in df1.
library(tidyr)
library(tibble)
df2 %>%
rowid_to_column("group") %>%
crossing(df1) %>%
mutate_at(vars(-group, -perc), ~ perc * .) %>%
select(-perc)
Regardless, you can then rename the resulting data frame using names().
names(df) <- 2010:2013
Later Edit (Some more Options):
This uses purrr from the tidyverse and might actually be the cleanest for you if starting from two data frames like your example.
library(purrr)
map_dfc(df1, ~ . * df2) %>%
set_names(2009 + seq_along(df1))
You can basically accomplish the same thing in base but this will return a matrix not a data frame unless we convert it.
setNames(as.data.frame(sapply(df1, function(x) t(x * df2))), 2009 + seq_along(df1))

How to use dplyr:mutate to mulitply pairs of columns specified by parts of the variable name

I have the following example:
df <- data.frame(
id = c(1,2,3),
fix_01.2012 = c(2,5,7),
fix_02.2012 = c(5,1,7),
fix_03.2012 = c(6,1,5),
fox_01.2012 = c(0.4, 0.5, 0.7),
fox_02.2012 = c(0.6, 0.5, 0.8),
fox_03.2012 = c(0.7, 0.5, 0.9)
)
id fix_01.2012 fix_02.2012 fix_03.2012 fox_01.2012 fox_02.2012 fox_03.2012
1 1 2 5 6 0.4 0.6 0.7
2 2 5 1 1 0.5 0.5 0.5
3 3 7 7 5 0.7 0.8 0.9
The table below is what I want to get.
I want to create a new column for each date (e.g. "01.2012"):
res_date = fix_date * fox_date
As I have many dates / pairs of dates, I guess this needs to be done by looping through the names.
id fix_01.2012 fix_02.2012 fix_03.2012 fox_01.2012 fox_02.2012 fox_03.2012 res_01.2012 res_02.2012 res_03.2012
1 1 2 5 6 0.4 0.6 0.7 0.8 3.0 4.2
2 2 5 1 1 0.5 0.5 0.5 2.5 0.5 0.5
3 3 7 7 5 0.7 0.8 0.9 4.9 5.6 4.5
Anyone can help? Thanks very much in advance!
Here is an idea that uses split.default to split the data frame based on similar column names (based on your conditions). We then loop over that list and multiply the columns. In this case, we use Reduce (rather than i[1]*i[2]) to multiply in order to account for more than two columns
do.call(cbind,
lapply(split.default(df[-1], gsub('.*_', '', names(df[-1]))), function(i) Reduce(`*`, i)))
# 01.2012 02.2012 03.2012
#[1,] 0.8 3.0 4.2
#[2,] 2.5 0.5 0.5
#[3,] 4.9 5.6 4.5
Bind them back to the original with cbind.data.frame()
If you want a tidyverse approach, it will take using a bit of tidy evaluation to get what you want.
library(tidyverse)
df <- data.frame(
id = c(1,2,3),
fix_01.2012 = c(2,5,7),
fix_02.2012 = c(5,1,7),
fix_03.2012 = c(6,1,5),
fox_01.2012 = c(0.4, 0.5, 0.7),
fox_02.2012 = c(0.6, 0.5, 0.8),
fox_03.2012 = c(0.7, 0.5, 0.9)
)
# colnames with "fix"
fix <- names(df)[grepl("fix",names(df))]
# colnames with "fox"
fox <- names(df)[grepl("fox",names(df))]
# Iterate over the two vectors of names and column bind the results (map2_dfc).
# Since these are strings, we need to have them evaluated as symbols
# Creating the column name just requires the string to be evaluated.
map2_dfc(fix, fox, ~transmute(df, !!paste0("res", str_extract(.x, "_(0\\d)")) := !!sym(.x) * !!sym(.y)))
#> res_01 res_02 res_03
#> 1 0.8 3.0 4.2
#> 2 2.5 0.5 0.5
#> 3 4.9 5.6 4.5
Much more verbose than the other answers, but to my eye easier to read/edit/adapt, is a heavy gather-spread approach (the way I'd reason the problem if I was solving it step-by-step):
library(tidyr)
library(dplyr)
df %>%
gather(-id, key=colname, value=value) %>%
separate(colname, c('fixfox', 'date'), sep='_') %>%
spread(key=fixfox, value=value) %>%
mutate(res=fix*fox) %>%
gather(-id, -date, key=colname, value=value) %>%
unite(new_colname, colname, date, sep='_') %>%
spread(key=new_colname, value=value)

Sum data points in the rows from data frame if they meet criteria from another data frame in R

I have two data frames both with 220 obs and 80 variables. The first data frame, df1, has only the data points 1, 2, and 3. The second data frame, df2, has different numeric values consisting of decimals, such as 0.12, -0.03, 0.01 etc. (supposed to portray market cap weighted stock returns for a given month). PS: The length of the original data set is 80.
For example
df1 = data.frame(a = c(2, 2, 1), b = c(3, 2, 3), c = c(1, 1, 2), d = c(3, 3, 1))
a b c d
1 2 3 1 3
2 2 2 1 3
3 1 3 2 1
df2 = data.frame(a = c(0.1, 0.1, 0.2), b = c(0.3, 0.4, 0.6), c = c(0.2, 0.3, 0.5), d = c(0.1, 0.5, 0.6))
a b c d
1 0.1 0.3 0.2 0.1
2 0.1 0.4 0.3 0.5
3 0.2 0.6 0.5 0.6
How can I sum the rows of df2and turn into a matrix with 220 obs and 3 variables based on the values in df1. Note that df1 and df2 have the same column names in the same order. How can I create a third data frame df3 based on the indicator variables from df1 by summing the rows of df2? I want to sum the rows of df2 based on the values in df1 to create df3:
df3 =
X1 X2 X3
1 0.2 0.1 0.4
2 0.3 0.5 0.5
3 0.8 0.5 0.6
Let's first look at (X1,1). Row 1 in df1 only contain one data point with value 1, which is (c,1). Thus, we sum row 1 of df2 to get 0.2. Now look at (X1,3) (last value of column X1). Observe row 3 in df1 to find two data points with value 1. In df2 those two values are 0.2 (a,3) and 0.6 (d,3), and sum the values to get 0.8.
Here is the explanation of how df3 looks like:
calculation = data.frame("1" = c("0+0+0.2+0", "0+0+0.3+0", "0.2+0+0+0.6"), "2" = c("0.1+0+0+0", "0.1+0.4+0+0", "0+0+0.5+0"), "3" = c("0+0.3+0+0.1", "0+0+0+0.5", "0+0.6+0+0"))
X1 X2 X3
1 0 + 0 + 0.2 + 0 0.1 + 0 + 0 + 0 0 + 0.3 + 0 + 0.1
2 0 + 0 + 0.3 + 0 0.1 + 0.4 + 0 + 0 0 + 0 + 0 + 0.5
3 0.2 + 0 + 0 + 0.6 0 + 0 + 0.5 + 0 0 + 0.6 + 0 + 0
More practical explanation based on stocks. Assume df1 is a matrix that describes buy, hold, and sell recommendations. df2 describes the market weighted stock returns. All variables/columns are different stocks. df3 creates a matrix with three different portfolios. If the stock is "buy", I want to put it in a "buy" portfolio. If the stock is "hold", I want to put it in a "hold" portfolio, etc. This is easily done in Excel with nested IF,AND,OR functions, but I do not know how to do it in R.
We could use tapply by converting the datasets to matrix, use grouping variables as the row index of the data and the index of 'df1'
tapply(as.matrix(df2), list(row(df2), as.matrix(df1)), FUN = sum)
# 1 2 3
#[1,] 0.2 0.1 0.4
#[2,] 0.3 0.5 0.5
#[3,] 0.8 0.5 0.6
Or with tidyverse, bind the datasets after gathering the two in to 'long' data, and then do a group by sum
library(tidyverse)
gather(df1) %>%
bind_cols(gather(df2)) %>%
group_by(key) %>%
group_by(rn = row_number(), value) %>%
summarise(value1 = sum(value1)) %>%
spread(value, value1) %>%
ungroup %>%
select(-rn)
# A tibble: 3 x 3
# `1` `2` `3`
# <dbl> <dbl> <dbl>
#1 0.2 0.1 0.4
#2 0.3 0.5 0.5
#3 0.8 0.5 0.6
Here is another base R method that uses rowsum to perform group sums and loops through the rows with mapply.
t(mapply(rowsum, as.data.frame(t(df2)), as.data.frame(t(df1))))
[,1] [,2] [,3]
V1 0.2 0.1 0.4
V2 0.3 0.5 0.5
V3 0.8 0.5 0.6
Note that I am using R 3.4.4. I believe that as.data.frame is not necessary with R 3.5.0+, since t should return a data.frame when it is fed a data.frame.

How can I vectorize this task in R?

For a specific task, I have written the following R script:
pred <- c(0.1, 0.1, 0.1, 0.2, 0.2, 0.3, 0.3)
grp <- as.factor(c(1, 1, 2, 2, 1, 1, 1))
cut <- unique(pred)
cut_n <- length(cut)
n <- length(pred)
class_1 <- numeric(cut_n)
class_2 <- numeric(cut_n)
curr_cut <- cut[1]
class_1_c <- 0
class_2_c <- 0
j <- 1
for (i in 1:n){
if (curr_cut != pred[i]) {
j <- j + 1
curr_cut <- pred[i]
}
if (grp[i] == levels(grp)[1])
class_1_c <- class_1_c + 1
else
class_2_c <- class_2_c + 1
class_1[j] <- class_1_c
class_2[j] <- class_2_c
}
cat("index:", cut, "\n")
cat("class1:", class_1, "\n")
cat("class2:", class_2, "\n")
My goal above was to compute the cumulative number of times the factors in grp appear for each unique value in pred. For example, I get the following output for above:
index: 0.1 0.2 0.3
class1: 2 3 5
class2: 1 2 2
I am a beginner in R and I have few questions about this:
How can I make this code faster and simpler?
Is is it possible to vectorize this and avoid the for loop?
Is there a different "R-esque" way of doing this?
Any help would be greatly appreciated. Thanks!
You can start by getting a the unique group/pred counts using a table
table(grp, pred)
# pred
# grp 0.1 0.2 0.3
# 1 2 1 2
# 2 1 1 0
Of course this isn't exactly what you wanted. You want cumulative totals, so we can adjust this result by applying a cumulative sum across each row (transposed to better match your data layout)
t(apply(table(grp, pred), 1, cumsum))
# grp 0.1 0.2 0.3
# 1 2 3 5
# 2 1 2 2

Resources