Apply two different formulas on four data frame columns - r

I want to apply two different formulas on four columns of my dataframe df. I have done this manually, but since my original data frame has several columns, I want to be able to use loops or case when to do this faster.
Here's how sample dataframe df looks like:
A B C D
20 100 4 1200
40 150 6 2300
34 200 3 1230
32 225 9 1100
12 220 10 1000
Formula 1:
(x-max(x))/(max(x)-min(x))
Formula 2:
(min(x)-x)/(max(x)-min(x))
I'd like to apply formula 1 on columns B and D and formula 2 on columns A and C.
After applying the formula, I want to store the values in a different dataframe but with the same column names.
Here's what I did:
formula_1 <-function(x) {
(((x - min(x)))/(max(x) - min(x)))
}
formula_2 <-function(x){(min(x)-x)/(max(x)-min(x))
}
Create an empty dataframe BI_score
BI_score$B <- formula_1(df$B)
BI_score$D <- formula_1 (df$D)
BI_score$A <- formula_2 (df$A)
BI_score$C <- formula_2 (df$C)

EDIT
As there are some NAs and Inf values and if we want to exclude them from calculation, we can handle it by updating the function as below and then apply the function to column as shown previously.
formula_1 <-function(x) {
temp <- x[is.finite(x)]
replace(x, is.finite(x), (((temp - min(temp)))/(max(temp) - min(temp))))
}
formula_2 <-function(x) {
temp <- x[is.finite(x)]
replace(x, is.finite(x), (min(temp)-temp)/(max(temp)-min(temp)))
}
The most straight forward approach would be to use lapply to apply the function separately on selected columns.
BI_score <- df
fm1_cols <- c("B", "D")
fm2_cols <- c("A", "C")
BI_score[fm1_cols] <- lapply(df[fm1_cols], formula_1)
BI_score[fm2_cols] <- lapply(df[fm2_cols], formula_2)
BI_score
# A B C D
#1 -0.29 0.00 -0.14 0.154
#2 -1.00 0.40 -0.43 1.000
#3 -0.79 0.80 0.00 0.177
#4 -0.71 1.00 -0.86 0.077
#5 0.00 0.96 -1.00 0.000
As mentioned by #Sotos, if you want to apply the function on alternate columns you could do
BI_score[c(TRUE, FALSE)] <- lapply(df[c(TRUE, FALSE)], formula_1)
BI_score[c(FALSE, TRUE)] <- lapply(df[c(FALSE, TRUE)], formula_2)
Just for fun, approach using dplyr
library(dplyr)
bind_cols(df %>% select(fm1_cols) %>% mutate_all(formula_1),
df %>% select(fm2_cols) %>% mutate_all(formula_2))

If your goal is to apply the two functions on alternating columns, then you can do it via logical indexing
cbind.data.frame(sapply(df[c(TRUE, FALSE)], formula_2),
sapply(df[c(FALSE, TRUE)], formula_1))
# A C B D
#1 -0.2857143 -0.1428571 0.00 0.15384615
#2 -1.0000000 -0.4285714 0.40 1.00000000
#3 -0.7857143 0.0000000 0.80 0.17692308
#4 -0.7142857 -0.8571429 1.00 0.07692308
#5 0.0000000 -1.0000000 0.96 0.00000000

We can use mutate_at from dplyr
library(dplyr)
df1 %>%
mutate_at(vars(B, D), formula_1) %>%
mutate_at(vars(A, C), formula_2)

Related

Apply conditional function in every row of a data frame

I'm new in R and I'm struggling with this df that looks like this:
Date Group Factor 1 Factor 2 Spread
2019-04-01 a 1.01 1.011 0.01
2019-04-02 a 1.02 1.012 0.02
2019-04-03 a 1.03 1.013 0.03
2019-04-01 b 1.005 1.004 0.01
2019-04-02 b 1.0051 1.0041 0.02
2019-04-03 b 1.0052 1.0042 0.03
I would like do verify each group in each row and if the results are Group "a" do Factor1/Factor1(1 day lag) * Factor2 + spread, and if the group it's not "a" do not add the spread.
Since you are conditioning on the group, this is a good example of by (base R), dplyr::group_by, or data.table's x[,,by=].
The equation is effectively the same in all three, capitalizing on the fact that (Group[1] == "a") will be coerced from a logical to numeric when multipled by a number; since FALSE translates to a 0, then effectively disabled adding Spread.
Base
I use within here to make the internals a little more readable, but this is not a requirement (in which case you'd need to prepend x$ in front of all of the variable names).
The lagging can be done using dplyr::lag (even if you don't use the rest of the package for this) or many other techniques. I don't find stats::lag to be the most intuitive in applications like this, but I'm sure somebody will suggest a way to incorporate it into an answer. The use of c(NA, ...) ensures that we don't bring in a different group's data or impute data we don't have, since we have no value to bring in on the first row of a group. Finally, head(..., n = 1) returns the first element of a vector/list, while head(..., n = -1) (negative) returns all but the last.
newx <- by(x, x$Group, function(y) {
within(y, {
NewVal = Factor2 * Factor1 / c(NA, head(Factor1, n=-1)) + (Group[1] == "a") * Spread
})
})
newx
# x$Group: a
# Date Group Factor1 Factor2 Spread NewVal
# 1 2019-04-01 a 1.01 1.011 0.01 NA
# 2 2019-04-02 a 1.02 1.012 0.02 1.042020
# 3 2019-04-03 a 1.03 1.013 0.03 1.052931
# -------------------------------------------------------
# x$Group: b
# Date Group Factor1 Factor2 Spread NewVal
# 4 2019-04-01 b 1.0050 1.0040 0.01 NA
# 5 2019-04-02 b 1.0051 1.0041 0.02 1.0042
# 6 2019-04-03 b 1.0052 1.0042 0.03 1.0043
This is really just a list with some fancy by-specific formatting, so you can treat it as such as combine them in an efficient base-R way:
do.call("rbind.data.frame", c(newx, stringsAsFactors = FALSE))
# Date Group Factor1 Factor2 Spread NewVal
# a.1 2019-04-01 a 1.0100 1.0110 0.01 NA
# a.2 2019-04-02 a 1.0200 1.0120 0.02 1.042020
# a.3 2019-04-03 a 1.0300 1.0130 0.03 1.052931
# b.4 2019-04-01 b 1.0050 1.0040 0.01 NA
# b.5 2019-04-02 b 1.0051 1.0041 0.02 1.004200
# b.6 2019-04-03 b 1.0052 1.0042 0.03 1.004300
dplyr
Many find the tidyverse line of packages to read intuitively.
library(dplyr)
x %>%
group_by(Group) %>%
mutate(NewVal = Factor2 * Factor1 / lag(Factor1) + (Group[1] == "a") * Spread) %>%
ungroup()
# # A tibble: 6 x 6
# Date Group Factor1 Factor2 Spread NewVal
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 2019-04-01 a 1.01 1.01 0.01 NA
# 2 2019-04-02 a 1.02 1.01 0.02 1.04
# 3 2019-04-03 a 1.03 1.01 0.03 1.05
# 4 2019-04-01 b 1.00 1.00 0.01 NA
# 5 2019-04-02 b 1.01 1.00 0.02 1.00
# 6 2019-04-03 b 1.01 1.00 0.03 1.00
data.table
On a different note, many find data.table better because of efficiencies gained from in-place modification (most of R's operations are copy-on-write, meaning some operations re-copy the object or a portion of it with each change).
library(data.table)
X <- as.data.table(x)
X[, NewVal := Factor2 * Factor1 / shift(Factor1) + (Group[1] == "a") * Spread, by = "Group"]
X
# Date Group Factor1 Factor2 Spread NewVal
# 1: 2019-04-01 a 1.0100 1.0110 0.01 NA
# 2: 2019-04-02 a 1.0200 1.0120 0.02 1.042020
# 3: 2019-04-03 a 1.0300 1.0130 0.03 1.052931
# 4: 2019-04-01 b 1.0050 1.0040 0.01 NA
# 5: 2019-04-02 b 1.0051 1.0041 0.02 1.004200
# 6: 2019-04-03 b 1.0052 1.0042 0.03 1.004300
The "in-place" part is evident on the second line here, where it appears as if the [ operation should just return a subset or something of the data ... but in this case using := causes the columns to be created (or changed) in-place.

Assign different values to a large number of columns

I have a large set of financial data that has hundreds of columns. I have cleaned and sorted the data based on date. Here is a simplified example:
df1 <- data.frame(matrix(vector(),ncol=5, nrow = 4))
colnames(df1) <- c("Date","0.4","0.3","0.2","0.1")
df1[1,] <- c("2000-01-31","0","0","0.05","0.07")
df1[2,] <- c("2000-02-29","0","0.13","0.17","0.09")
df1[3,] <- c("2000-03-31","0.03","0.09","0.21","0.01")
df1[4,] <- c("2004-04-30","0.05","0.03","0.19","0.03")
df1
Date 0.4 0.3 0.2 0.1
1 2000-01-31 0 0 0.05 0.07
2 2000-02-29 0 0.13 0.17 0.09
3 2000-03-31 0.03 0.09 0.21 0.01
4 2000-04-30 0.05 0.03 0.19 0.03
I assigned individual weights (based on market value from the raw data) as column headers, because I don’t care about the company names and I need the weights for calculating the result.
My ultimate goal is to get: 1. Sum of the weighted returns; and 2. Sum of the weights when returns are non-zero. With that being said, below is the result I want to get:
Date SWeightedR SWeights
1 2000-01-31 0.017 0.3
2 2000-02-29 0.082 0.6
3 2000-03-31 0.082 1
4 2000-04-30 0.07 1
For instance, the SWeightedR for 2000-01-31 = 0.4x0+0.3x0+0.2x0.05+0.1x0.07, and SWeights = 0.2+0.1.
My initial idea was to assign the weights to each column like WCol2 <- 0.4, then use cbind to create new columns and use c(as.matrix() %*% ) to get the sums. Soon I realize that this is impossible as there are hundreds of columns. Any advice or suggestion is appreciated!
Here's a simple solution using matrix multiplications (as you were suggesting yourself).
First of all, your data seem to be of character type and I'm not sure it's the real case with the real data, but I would first convert it to an appropriate type
df1[-1] <- lapply(df1[-1], type.convert)
Next, we will convert the column names to a numeric class too
vec <- as.numeric(names(df1)[-1])
Finally, we could easily create the new columns in two simple steps. This indeed has a to matrix conversion overhead, but maybe you should work with matrices in the first place. Either way, this is fully vectorized
df1["SWeightedR"] <- as.matrix(df1[, -1]) %*% vec
df1["SWeights"] <- (df1[, -c(1, ncol(df1))] > 0) %*% vec
df1
# Date 0.4 0.3 0.2 0.1 SWeightedR SWeights
# 1 2000-01-31 0.00 0.00 0.05 0.07 0.017 0.3
# 2 2000-02-29 0.00 0.13 0.17 0.09 0.082 0.6
# 3 2000-03-31 0.03 0.09 0.21 0.01 0.082 1.0
# 4 2004-04-30 0.05 0.03 0.19 0.03 0.070 1.0
Or, you could convert to a long format first (here's a data.table example), though I believe it will be less efficient as this are basically by row operations
library(data.table)
res <- melt(setDT(df1), id = 1L, variable.factor = FALSE
)[, c("value", "variable") := .(as.numeric(value), as.numeric(variable))]
res[, .(SWeightedR = sum(variable * value),
SWeights = sum(variable * (value > 0))), by = Date]
# Date SWeightedR SWeights
# 1: 2000-01-31 0.017 0.3
# 2: 2000-02-29 0.082 0.6
# 3: 2000-03-31 0.082 1.0
# 4: 2004-04-30 0.070 1.0

Random sample in R when data is in long format

I need to randomly sample a dataset which is arranged in long format. In my dataset, each subject has 4 observations, so if I randomly sample a row I am randomly losing one or more observation per subject.
This is a simulated data for illustration purposes, my data is much bigger.
sub sex group dv1 dv2
P1 m A 0.66 0.94
P1 m B 0.98 0.26
P1 m C 0.02 0.03
P1 m D 0.60 0.30
P2 m A 0.92 0.99
P2 m B 0.82 0.09
P2 m C 0.44 0.67
P2 m D 0.53 0.80
P3 f A 0.29 0.22
P3 f B 0.46 0.20
P3 f C 0.37 0.77
P3 f D 0.76 0.54
P4 m A 0.28 0.99
P4 m B 0.16 0.57
P4 m C 0.46 0.75
P4 m D 0.28 0.21
In this example, I need to randomly select 2 males. For example, I tried using dplyr packaged (see below), but if I give a sample of 2, it just gives me 2 rows for sex="m" and 2 for sex="f". In total, 4 randomly chosen rows. What I need it to do is to give me 8 rows where 4 come from one male and 4 from another. Changing grouping parameter to sub doesn't work, as it barks that there are only 2 levels in the group (actually, it would work in this toy example as there are 4 levels for each sub, but note that I am choosing like 50 samples from a bigger dataset). Also, it would just give me 2 random rows for each sub, which is not what I need.
library(dplyr)
subset <- data %>%
group_by(sex) %>%
sample_n(2)
Please do not suggest to reshape the date to wide format and sample it there, as I know that I can do that. I am sure there must be a way to sample in long format.
I would sample from the patient names and then filter by those sampled names:
Look at all males
male_subset <- data %>% filter(sex == "m")
Look for unique male ID
male_IDs <- unique(male_subset$sub)
Sample from the unique IDs
sampled_IDs <- sample(male_IDs, 2)
Now you subset your data based on these sampled IDs:
data %>% filter(sub %in% sampled_IDs)
This should return all four rows for each of the 2 sampled individuals.
I'm not sure if I've quite understood what you want. Would this do it?
data %>% filter(sex == 'm') %>% filter(sub %in% sample(paste0('P',1:4), 2))
You'd have to change what's in the paste0 function for your real data, of course.
In base R,
set.seed(1)
subset<- sample(data[data$sex == "m",]$sub,2)
data_subset<-data[data$sub %in% subset,]
nrow(data_subset)
# [1] 8
Works, but not flashy.

How to select columns based on criteria in a certain row in R

I have a matrix of values with both row names and column names, as shown here.
C5.Outliers
Days J1 J2 J3 J4
0.01 458 -160 -151 -52
0.02 459 -163 -154 -46
0.03 457 -165 -150 -51
Perc 0.99 0.04 0.00 0.52
I would like to create a separate matrix using only the columns for which the value for the row "Perc" is =<50.0. In this example, I would be extracting columns J2 and J3.
This is the code I tried which isn't working (the "Perc" row is row #1414 on my matrix):
C5.Final<-subset(C5.Outliers, 1414<.51)
I assume you mean 0.50 since all the columns with the "Perc" are above 50.0.
this might not be the best way but it works:
#data:
df <- data.frame(Days=c(0.01,0.02,0.03,"Perc"),J1=c(458,459,457,0.99),
J2 =c(-165,-163,-160,0.04),J3=c(-151,-153,-131,0.00),J4=c(-52,-45,-51,0.52))
dfc <- subset(df,,select= which(c(TRUE,(df[which(df$Days == "Perc"), ] <= 0.50)[2:5])))
dfc
Days J2 J3
1 0.01 -165.00 -151
2 0.02 -163.00 -153
3 0.03 -160.00 -131
4 Perc 0.04 0
You can remove the TRUE, if you dont want the df$Days variable, change the 0.50 threshold if needed and expand the 2:5 if you have extra columns or even substitute the "Perc" with 1414 if you so wish.
Hope this works.
Presumably you meant <= 0.50 and not <= 50 since all "Perc" are less than 50. You can do
df[, unlist(df["Perc",]) <= 0.5]
# J2 J3
# 0.01 -160.00 -151
# 0.02 -163.00 -154
# 0.03 -165.00 -150
# Perc 0.04 0
But this may be safer and takes into account any NA values that may appear in "Perc".
u <- unlist(df["Perc",]) <= 0.50
df[, u & !is.na(u)]
Also, you can speed it up if need be by adding use.names = FALSE in unlist(). And finally, if you have a matrix and not a data frame, then you can remove unlist() all together.

R: aggregating time series groups of irregular length

I think this is a split-apply-combine problem, but with a time series twist. My data consists of irregular counts and I need to perform some summary statistics on each group of counts. Here's a snapshot of the data:
And here's it is for your console:
library(xts)
date <- as.Date(c("2010-11-18", "2010-11-19", "2010-11-26", "2010-12-03", "2010-12-10",
"2010-12-17", "2010-12-24", "2010-12-31", "2011-01-07", "2011-01-14",
"2011-01-21", "2011-01-28", "2011-02-04", "2011-02-11", "2011-02-18",
"2011-02-25", "2011-03-04", "2011-03-11", "2011-03-18", "2011-03-25",
"2011-03-26", "2011-03-27"))
returns <- c(0.002,0.000,-0.009,0.030, 0.013,0.003,0.010,0.001,0.011,0.017,
-0.008,-0.005,0.027,0.014,0.010,-0.017,0.001,-0.013,0.027,-0.019,
0.000,0.001)
count <- c(NA,NA,1,1,2,2,3,4,5,6,7,7,7,7,7,NA,NA,NA,1,2,NA,NA)
maxCount <- c(NA,NA,0.030,0.030,0.030,0.030,0.030,0.030,0.030,0.030,0.030,
0.030,0.030,0.030,0.030,NA,NA,NA,0.027,0.027,NA,NA)
sumCount <- c(NA,NA,0.000,0.030,0.042,0.045,0.056,0.056,0.067,0.084,0.077,
0.071,0.098,0.112,0.123,NA,NA,NA,0.000,-0.019,NA,NA)
xtsData <- xts(cbind(returns,count,maxCount,sumCount),date)
I have no idea how to construct the max and cumSum columns, especially since each count series is of an irregular length. Since I won't always know the start and end points of a count series, I'm lost at trying to figure out the index of these groups. Thanks for your help!
UPDATE: here is my for loop for attempting to calculating cumSum. it's not the cumulative sum, just the returns necessary, i'm still unsure how to apply functions to these ranges!
xtsData <- cbind(xtsData,mySumCount=NA)
# find groups of returns
for(i in 1:nrow(xtsData)){
if(is.na(xtsData[i,"count"]) == FALSE){
xtsData[i,"mySumCount"] <- xtsData[i,"returns"]
}
else{
xtsData[i,"mySumCount"] <- NA
}
}
UPDATE 2: thank you commenters!
# report returns when not NA count
x1 <- xtsData[!is.na(xtsData$count),"returns"]
# cum sum is close, but still need to exclude the first element
# -0.009 in the first series of counts and .027 in the second series of counts
x2 <- cumsum(xtsData[!is.na(xtsData$count),"returns"])
# this is output is not accurate because .03 is being displayed down the entire column, not just during periods when counts != NA. is this just a rounding error?
x3 <- max(xtsData[!is.na(xtsData$count),"returns"])
SOLUTION:
# function to pad a vector with a 0
lagpad <- function(x, k) {
c(rep(0, k), x)[1 : length(x)]
}
# group the counts
x1 <- na.omit(transform(xtsData, g = cumsum(c(0, diff(!is.na(count)) == 1))))
# cumulative sum of the count series
z1 <- transform(x1, cumsumRet = ave(returns, g, FUN =function(x) cumsum(replace(x, 1, 0))))
# max of the count series
z2 <- transform(x1, maxRet = ave(returns, g, FUN =function(x) max(lagpad(x,1))))
merge(xtsData,z1$cumsumRet,z2$maxRet)
The code shown is not consistent with the output in the image and there is no explanation provided so its not clear what manipulations were wanted; however, the question did mention that the main problem is distinguishing the groups so we will address that.
To do that we compute a new column g whose rows contain 1 for the first group, 2 for the second and so on. We also remove the NA rows since the g column is sufficient to distinguish groups.
The following code computes a vector the same length as count by first setting each NA position to FALSE and each non-NA position to TRUE. It then differences each position of that vector with the prior position. To do that it implicitly converts FALSE to 0 and TRUE to 1 and then performs the differencing. Next we convert this last result to a logical vector which is TRUE for each 1 component and FALSE otherwise. Since the first component of the vector that is differenced has no prior position we prepend 0 for that. The prepending operation implicitly converts the TRUE and FALSE values just generated to 1 and 0 respectively. Taking the cumsum fills in the first group with 1, the second with 2 and so on. Finally omit the NA rows:
x <- na.omit(transform(x, g = cumsum(c(0, diff(!is.na(count)) == 1))))
giving:
> x
returns count maxCount sumCount g
2010-11-26 -0.009 1 0.030 0.000 1
2010-12-03 0.030 1 0.030 0.030 1
2010-12-10 0.013 2 0.030 0.042 1
2010-12-17 0.003 2 0.030 0.045 1
2010-12-24 0.010 3 0.030 0.056 1
2010-12-31 0.001 4 0.030 0.056 1
2011-01-07 0.011 5 0.030 0.067 1
2011-01-14 0.017 6 0.030 0.084 1
2011-01-21 -0.008 7 0.030 0.077 1
2011-01-28 -0.005 7 0.030 0.071 1
2011-02-04 0.027 7 0.030 0.098 1
2011-02-11 0.014 7 0.030 0.112 1
2011-02-18 0.010 7 0.030 0.123 1
2011-03-18 0.027 1 0.027 0.000 2
2011-03-25 -0.019 2 0.027 -0.019 2
attr(,"na.action")
2010-11-18 2010-11-19 2011-02-25 2011-03-04 2011-03-11 2011-03-26 2011-03-27
1 2 16 17 18 21 22
attr(,"class")
[1] "omit"
You can now use ave to perform any calculations you like. For example to take cumulative sums of returns by group:
transform(x, cumsumRet = ave(returns, g, FUN = cumsum))
Replace cumsum with any other function that is suitable for use with ave.
Ah, so "count" are the groups and you want the cumsum per group and the max per group. I think in data.table, so here is how I would do it.
library(xts)
library(data.table)
date <- as.Date(c("2010-11-18", "2010-11-19", "2010-11-26", "2010-12-03", "2010-12-10",
"2010-12-17", "2010-12-24", "2010-12-31", "2011-01-07", "2011-01-14",
"2011-01-21", "2011-01-28", "2011-02-04", "2011-02-11", "2011-02-18",
"2011-02-25", "2011-03-04", "2011-03-11", "2011-03-18", "2011-03-25",
"2011-03-26", "2011-03-27"))
returns <- c(0.002,0.000,-0.009,0.030, 0.013,0.003,0.010,0.001,0.011,0.017,
-0.008,-0.005,0.027,0.014,0.010,-0.017,0.001,-0.013,0.027,-0.019,
0.000,0.001)
count <- c(NA,NA,1,1,2,2,3,4,5,6,7,7,7,7,7,NA,NA,NA,1,2,NA,NA)
maxCount <- c(NA,NA,0.030,0.030,0.030,0.030,0.030,0.030,0.030,0.030,0.030,
0.030,0.030,0.030,0.030,NA,NA,NA,0.027,0.027,NA,NA)
sumCount <- c(NA,NA,0.000,0.030,0.042,0.045,0.056,0.056,0.067,0.084,0.077,
0.071,0.098,0.112,0.123,NA,NA,NA,0.000,-0.019,NA,NA)
DT<-data.table(date,returns,count)]
DT[!is.na(count),max:=max(returns),by=count]
DT[!is.na(count),cumSum:= cumsum(returns),by=count]
#if you need an xts object at the end, then.
xtsData <- xts(cbind(DT$returns,DT$count, DT$max,DT$cumSum),DT$date)

Resources