Vectorize ifelse command in R - r

I have this example data
library(quantmod)
getSymbols("NOK",from="2013-01-01",to="2014-05-01",src="yahoo","getSymbols.warning4.0"=FALSE)
data<-NOK
w1<-1
L_dO<-data[,1]
L_dC<-data[,4]
L_Profit_L_1<-((lag(L_dC,-1)-lag(L_dO,-1))/(lag(L_dO,-1)))*100
L_Profit_L_2<-((lag(L_dC,-2)-lag(L_dO,-1))/(lag(L_dO,-1)))*100
L_Profit_L_3<-((lag(L_dC,-3)-lag(L_dO,-1))/(lag(L_dO,-1)))*100
L_Profit_L_4<-((lag(L_dC,-4)-lag(L_dO,-1))/(lag(L_dO,-1)))*100
L_Profit_L_5<-((lag(L_dC,-5)-lag(L_dO,-1))/(lag(L_dO,-1)))*100
L_Profit_L_all<-ifelse(L_Profit_L_1>w1,L_Profit_L_1,
ifelse(L_Profit_L_2>w1,L_Profit_L_2,
ifelse(L_Profit_L_3>w1,L_Profit_L_3,
ifelse(L_Profit_L_4>w1,L_Profit_L_4,
ifelse(L_Profit_L_5>w1,L_Profit_L_5,L_Profit_L_5)))))
What am I interested in is L_Profit_L_all, but I see this is a bit strange and slow way to write it. I have tried to vectorize it like
L_Profit_L_all<-ifelse(c(L_Profit_L_1>w1,L_Profit_L_2>w1,L_Profit_L_3>w1,L_Profit_L_4>w1,L_Profit_L_5>w1),c(L_Profit_L_1,L_Profit_L_2,L_Profit_L_3,L_Profit_L_4,L_Profit_L_5),L_Profit_L_5)
But th result is not the same. I want it to work in right order, i.e. if the first if condition is TRUE, then return first else condition (and don't care about if another condition is TRUE which is able to do the first code)
Any straightforward how to achieve it? I have a huge dataset so every ms is good to save. Thanks

Here's an example how you might approach the problem without any ifelses. Assume you cbind all your L_Profit_L_X vectors together to get something similar to m in my example.
set.seed(1)
m <- matrix(sample(-5:5, 50, T), ncol = 5)
indx <- max.col(m > 1, ties.method = "first")
sapply(seq_along(indx), function(i) m[i, indx[i]])
#[1] 5 2 2 4 3 4 5 2 4 3
This is not fully vectorized since we're using sapply but I'm sure it will be a lot faster than the initial approach with 5 nested ifelses.
Update
You can vectorize the code by replacing the sapply part above with:
m[cbind(seq_len(nrow(m)), indx)]
# [1] 5 2 2 4 3 4 5 2 4 3

Related

Using data.table function in lapply on a list with data.frames elements (Answer = setDT)

First question, let me know if more info or background is needed in the comments please.
Many answers on here and elsewhere deal with calling lapply in a data.table function. I want to do the opposite, which on paper should be easy lapply(list.of.dfs, fun(x) x) but I cant get it to work with data.table functions.
I have a list that contains several data.frames with the same columns but differing numbers of rows. This comes from the output of several simulation scenarios so they must be treated seperately and not rbind'ed.
#sample list of data.frames
scenarios <- replicate(5, data.frame(a=sample(letters[1:4],10,T),
b=sample(1:2,10,T),
x=sample(1:10, 10),
y =runif(10)), simplify = FALSE)
I want to add a column to every element that is the sum of x/y by a and b.
From the data.table documentation in the examples section the process to do this for one data.frame is the following (search: add new column by reference by group in the doc page):
test <- as.data.table(scenarios[[1]]) #must specify data.table class
test[, newcol := sum(x/y), by = .(a , b)][]
I want to use lapply to do the same thing to every element in the scenarios list and return the list.
My most recent attempt:
lapply(scenarios, function(i) {as.data.table(i[, z := sum(x/y), by=.(a,b)]); i})
but I keep getting the error unused argument (by = .a,b))
After pouring over the results of this and other sites I have been unable to solve this problem. Which I'm fairly sure means that there is something I dont understand about calling anonymous functions, and/or using the data.table function. Is this one of those cases where one you use the [ as the function? Or possibly my as.data.table is out of place.
This answer was a step in the right direction (I think), it covers the use of fun(x) {... ; x} to use an anonymous function and return x.
Thanks!
You can use setDT here instead.
scenarios <- lapply(scenarios, function(i) setDT(i)[, z := sum(x/y), by=.(a,b)])
scenarios[[1]]
a b x y z
1: c 2 2 0.87002174 2.298793
2: b 2 10 0.19720775 78.611837
3: b 2 8 0.47041670 78.611837
4: b 2 4 0.36705023 78.611837
5: a 1 5 0.78922686 12.774035
6: a 1 6 0.93186209 12.774035
7: b 1 3 0.83118438 3.609307
8: c 1 1 0.08248658 30.047494
9: c 1 7 0.89382050 30.047494
10: c 1 9 0.89172831 30.047494
Using as.data.table, the syntax would be
scenarios <- lapply(scenarios, function(i) {i <- as.data.table(i); i[, z := sum(x/y),
by=.(a,b)]})
but this wouldn't be recommended as it will create an additional copy, which is avoided by setDT.

Calling & creating new columns based on string

I have searched quite a bit and not found a question that addresses this issue--but if this has been answered, forgive me, I am still quite green when it comes to coding in general. I have a data frame with a large number of variables that I would like to combine & create new variables from based on names I've put in a 2nd data frame in a loop. The data frame formulas should create & call columns from the main data frame data
USDb = c(1,2,3)
USDc = c(4,5,6)
EURb = c(7,8,9)
EURc = c(10,11,12)
data = data.frame(USDb, USDc, EURb, EURc)
Now I'd like to create a new column data$USDa as defined by
data$USDa = data$USDb - data$USDc
and so on for EUR and other variables. This is easy enough to do manually, but I'd like to create a loop that pulls the names from formulas, something like this:
a = c("USDa", "EURa")
b = c("USDb", "EURb")
c = c("USDc", "EURc")
formulas = data.frame(a,b,c)
for (i in 1:length(formulas[,a])){
data$formulas[i,a] = data$formulas[i,b] - data$formulas[i,c]
}
Obviously data$formulas[i,a] this returns NULL, so I tried data$paste0(formulas[i,a]) and that returns Error: attempt to apply non-function
How can I get these strings to be recognized as variables in this way? Thanks.
There are simpler ways to do this, but I'll stick to most of your code as a means of explanation. Your code should work so long as you edit your for loop to the following:
for (i in 1:length(formulas[,"a"])){
data[formulas[i,"a"]] = data[formulas[i,"b"]] - data[formulas[i,"c"]]
}
formulas[,a] won't work because you have a variable defined as a already that is not appropriate inside an index. Use formulas[, "a"] instead if you want all rows from column "a" in data.frame formulas.
data$formulas is literally searching for the column called "formulas" in the data.frame data. Instead you want to write data[formulas](of course, knowing that you need to index formulas in order to make it a proper string)
logic : iterate through each of the formulae, using a apply which is a for loop internally, and do calculation based on the formula
x = apply(formulas, 1, function(x) data[[x[3]]] - data[[x[2]]])
colnames(x) = formulas$a
x
# USDa EURa
#[1,] 3 3
#[2,] 3 3
#[3,] 3 3
cbind(data, x)
# USDb USDc EURb EURc USDa EURa
#1 1 4 7 10 3 3
#2 2 5 8 11 3 3
#3 3 6 9 12 3 3
Another option is split with sapply
sapply(setNames(split.default(as.matrix(formulas[-1]),
row(formulas[-1])), formulas$a), function(x) Reduce(`-`, data[rev(x)]))
# USDa EURa
#[1,] 3 3
#[2,] 3 3
#[3,] 3 3

Using apply and multi argument functions

I want to apply a function over a data frame. The function takes V1 as arg1 and V2 as arg2 and I want to write the result to V3 or some other vector.
Is there an easy and compact way to do this? I've posted a (non-working) example below.
Thanks
Stu
my.func <- function(X, Y) {
return(X + Y)
}
a <- c(1,2,3)
b <- c(4,5,6)
my.df <- data.frame(a, b)
apply(my.df, 1, my.func, X="a", Y="b")
mapply() is made for this.
Either of the following will do the job. The advantage of the second approach is that it scales nicely to functions that take an arbitrary number of arguments.
mapply(my.func, my.df[,1], my.df[,2])
# [1] 5 7 9
do.call(mapply, c(FUN=list(my.func), unname(my.df)))
# [1] 5 7 9
I feel this would be better approached using with than mapply if you're calling elements inside a data.frame:
with(my.df,my.func(X=a,Y=b))
#[1] 5 7 9
It's still quite a clean method even if you need to do the explicit conversion from a matrix:
with(data.frame(my.mat),my.func(X=a,Y=b))
#[1] 5 7 9
There isn't really any need for an *apply function here. Vectorization would suffice:
my.df$c <- my.df$a + my.df$b
# a b c
#1 1 4 5
#2 2 5 7
#3 3 6 9
Your apply solution can't work the way you have written it because apply does not pass a named vector through to your function: e.g.
colnames(my.df)
#[1] "a" "b"
apply( my.df , 1 , colnames )
#NULL
For your example, rowSums(my.df) will do the job. For more complicated tasks, you can use the mapply function. For example: mapply(my.func, my.df[a], my.df[b]).
Alternatively, you could rewrite your function to take a vector argument:
my.otherfunc <- function(x) sum(x)
apply(my.df, 1, my.otherfunc)
It's important to understand that when apply feeds each row or column into the function, it's sending one vector, not a list of separate entries. So you should give it a function with a single (vector) argument.

`j` doesn't evaluate to the same number of columns for each group

I am trying to use data.table where my j function could and will return a different number of columns on each call. I would like it to behave like rbind.fill in that it fills any missing columns with NA.
fetch <- function(by) {
if(by == 1)
data.table(A=c("a"), B=c("b"))
else
data.table(B=c("b"))
}
data <- data.table(id=c(1,2))
result <- data[, fetch(.BY), by=id]
In this case 'result' may end up with two columns; A and B. 'A' and 'B' was returned as part of the first call to 'fetch' and only 'B' was returned as part of the second. I would like the example code to return this result.
id A B
1 1 a b
2 2 <NA> b
Unfortunately, when run I get this error.
Error in `[.data.table`(data, , fetch(.BY, .SD), by = id) :
j doesn't evaluate to the same number of columns for each group
I can do this with plyr as follows, but in my real world use case plyr is running out of memory. Each call to fetch occurs rather quickly, but the memory crash occurs when plyr tries to merge all of the data back together. I am trying to see if data.table might solve this problem for me.
result <- ddply(data, "id", fetch)
Any thoughts appreciated.
DWin's approach is good. Or you could return a list column instead, where each cell is itself a vector. That's generally a better way of handling variable length vectors.
DT = data.table(A=rep(1:3,1:3),B=1:6)
DT
A B
1: 1 1
2: 2 2
3: 2 3
4: 3 4
5: 3 5
6: 3 6
ans = DT[, list(list(B)), by=A]
ans
A V1
1: 1 1
2: 2 2,3 # V1 is a list column. These aren't strings, the
3: 3 4,5,6 # vectors just display with commas
ans$V1[3]
[[1]]
[1] 4 5 6
ans$V1[[3]]
[1] 4 5 6
ans[,sapply(V1,length)]
[1] 1 2 3
So in your example you could use this as follows:
library(plyr)
rbind.fill(data[, list(list(fetch(.BY))), by = id]$V1)
# A B
#1 a b
#2 <NA> b
Or, just make the list returned conformant :
allcols = c("A","B")
fetch <- function(by) {
if(by == 1)
list(A=c("a"), B=c("b"))[allcols]
else
list(B=c("b"))[allcols]
}
Here are two approaches. The first roughly follows your strategy:
data[,list(A=if(.BY==1) 'a' else NA_character_,B='b'), by=id]
And the second does things in two steps:
DT <- copy(data)[,`:=`(A=NA_character_,B='b')][id==1,A:='a']
Using a by just to check for a single value seems wasteful (maybe computationally, but also in terms of clarity); of course, it could be that your application isn't really like that.
Try
data.table(A=NA, B=c("b"))
#NickAllen: I'm not sure from the comments whether you understood my suggestion. (I was posting from a mobile phone that limited my cut-paste capabilities and I suspect my wife was telling me to stop texting to S0 or she would divorce me.) What I meant was this:
fetch <- function(by) {
if(by == 1)
data.table(A=c("a"), B=c("b"))
else
data.table(A=NA, B=c("b"))
}
data <- data.table(id=c(1,2))
result <- data[, fetch(.BY), by=id]

Syntax (and/or functions) for applying an op over elements of one vector, using as arg elements of a 2nd vector

I am trying to find the right expression for creating a vector result by applying an operation over an vector, using, in a vectorised way, elements of a 2nd vector. The use case is that I have a vector of raw values, and a vector of breakpoints. What I want is an expression that will give me the result of applying a sum of a logical operation on the breakpoints with respect to the values in the values vector. In other words:
Given:
rawfoo <- c(30, 4, 22, 77, 1,169, 10)
breaksfoo <- c(10,50, 80)
resultfoo <- data.frame(breaks=breaksfoo, matching=numeric(length(breaksfoo)))
I want to write a single expression that delivers the column values for resultfoo$matching, which is: for each value in breaksfoo, sum(rawfoo > breaksfoo[i]),
resultfoo
breaks nmatching
1 10 3
2 50 2
3 80 1
I have been trying various forms of apply and having problems with how to express the function. Perhaps I am barking up the wrong tree? Can supply multiple demonstration of failure if required. (But my guess is that this question is so simple it doesn't need error messages to disambiguate it ;-)
You can do it in three steps:
Write a function that, given a break, returns a list of two element: the break itself and the result of sum(break > rawfoo).
Than you can use sapply to apply this function to breaksfoo.
Finally, you would need to transform the result of sapply, which is a matrix, to get a dataframe you need.
The following code does all of these three steps in one statement:
as.data.frame(t(sapply(breaksfoo,
function(x) list(breaks = x, nmatching = sum(x > rawfoo)))))
returns
breaks nmatching
1 10 2
2 50 5
3 80 6
Combining findInterval with table might get you what you're looking for.
#finds which interval rawfoo is in
x <- findInterval(rawfoo,breaksfoo)
#[1] 1 0 1 2 0 3 1
#tabulates the information
table(x)
#0 1 2 3
#2 3 1 1
#cuts off the last element
head(table(x),-1)
#0 1 2
#2 3 1
resultfoo$nmatching <- head(table(x),-1)
This is almost what you want, except that 10 is being placed in the second bucket because findInterval's intervals are inclusive on the lower end, while your example puts it in the first bucket because you want a strict inequality. You can add a corrective vector that will reassign to the right bucket:
y <- table(rawfoo)[as.character(breaksfoo)]
y[is.na(y)] <- 0
y <- y - c(0,head(y,-1))
resultfoo$nmatching <- resultfoo$nmatching + y
To make this easier to do, you can wrap it into a function.
fnfoo <- function(raw,breaks) {
x <- head(table(findInterval(rawfoo,breaksfoo)),-1)
y <- table(rawfoo)[as.character(breaksfoo)]
y[is.na(y)] <- 0
x + y - c(0,head(y,-1))
}
resultfoo$nmatching <- fnfoo(rawfoo,breaksfoo)
EDIT: I was browsing another question and realized that cut works better here.
data.frame(table(cut(rawfoo,c(-Inf,breaksfoo),right=TRUE)))
# Var1 Freq
# 1 (-Inf,10] 3
# 2 (10,50] 2
# 3 (50,80] 1

Resources