I have the following dataframe df:
tickers <- c('AAPL', 'AAPL', 'AAPL', 'AAPL', 'AAPL', 'AAPL', 'AAPL', 'AAPL')
returns <- c(0.1, 0.2, 0.3, -0.15, -0.25, .09, 0.4, -0.2)
df <- data.frame(tickers, returns)
df
tickers returns
1 AAPL 0.10
2 AAPL 0.20
3 AAPL 0.30
4 AAPL -0.15
5 AAPL -0.25
6 AAPL 0.09
7 AAPL 0.40
8 AAPL -0.20
I would like to add a column with the lagged returns. To do so, I use:
df$lag_1 <- Lag(df$returns , k=1)
Which produces:
tickers returns Lag.1
1 AAPL 0.10 NA
2 AAPL 0.20 0.10
3 AAPL 0.30 0.20
4 AAPL -0.15 0.30
5 AAPL -0.25 -0.15
6 AAPL 0.09 -0.25
7 AAPL 0.40 0.09
8 AAPL -0.20 0.40
So far, so good. But, when I try to use a variable to define the 2-day lag, I get an error message:
lookup <- 'returns'
df$lag_2 <- Lag(paste('df$', lookup) , k=2)
Error in Lag.default(paste("df$", lookup), k = 2) :
x must be a time series or numeric vector
Use [[ instead of $
library(quantmod)
df$lag_2 <- Lag(df[[lookup]], k = 2)[,1]
-output
> df
tickers returns lag_2
1 AAPL 0.10 NA
2 AAPL 0.20 NA
3 AAPL 0.30 0.10
4 AAPL -0.15 0.20
5 AAPL -0.25 0.30
6 AAPL 0.09 -0.15
7 AAPL 0.40 -0.25
8 AAPL -0.20 0.09
The stats::lag function is designed for application to time series objects. It is not designed to "lag" ordinary vectors. The lagging of a time series object is accomplished by altering its time base. The quantmod package's help page for its Lag function describes the differences succinctly:
This function differs from lag by returning the original series modified, as opposed to simply changing the time series properties. It differs from the like named Lag in the Hmisc as it deals primarily with time-series like objects.
It is important to realize that if there is no applicable method for Lag, the value returned will be from lag in base. That is, coerced to 'ts' if necessary, and subsequently shifted.
Neither the question, nor the current answer have included the needed code to load the quantmod package:
library(quantmod)
The other learning opportunity is that the expression paste('df$', lookup) will never be effective. That attempt probably comes from experience with what are called "macro" languages". R does not parse and interpret constructed strings like that. The unquoted strings typed at the console are handled differently than strings built with paste or paste0. As #akrun demonstrated, it is possible to use the extraction and assignment operators, [[ and [[<-, with string valued values.
And a third learning opportunity comes from noticing that the name that appears at the top of your new column was not the same on that you assigned to it. What happened is that the result from quantmod::Lag was a matrix named "Lag.1" rather than a vector. The quantmod package is designed to work with zoo-like objects which are matrices rather than dataframes. Noter further that trying to access that clumn with the name that appears in the print-representation will not succeed:
> str(df)
'data.frame': 8 obs. of 3 variables:
$ tickers: chr "AAPL" "AAPL" "AAPL" "AAPL" ...
$ returns: num 0.1 0.2 0.3 -0.15 -0.25 0.09 0.4 -0.2
$ lag_1 : num [1:8, 1] NA 0.1 0.2 0.3 -0.15 -0.25 0.09 0.4
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "Lag.1"
> df$Lag.1 # FAIL
NULL
> df$lag_1 # Success
Lag.1
[1,] NA
[2,] 0.10
[3,] 0.20
[4,] 0.30
[5,] -0.15
[6,] -0.25
[7,] 0.09
[8,] 0.40
If you will be using "quantmod" or "tidyquant", you will definitely need to understand he differences in accessing values inside matrices versus accessing values in matrices.
Related
Imagine there are 4 cards on the desk and there are several rows of them (e.g., 5 rows in the demo). The value of each card is already listed in the demo data frame. However, the exact position of the card is indexed by the pos columns, see the demo data I generated below.
To achieve this, I swap the cards with the [] function across the rows to switch the cards' values back to their original position. The following code already fulfills such a purpose. To avoid explicit usage of the loop, I wonder whether I can achieve a similar effect if I use the vectorization function with packages from tidyverse family, e.g. pmap or related function within the package purrr?
# 1. data generation ------------------------------------------------------
rm(list=ls())
vect<-matrix(round(runif(20),2),nrow=5)
colnames(vect)<-paste0('card',1:4)
order<-rbind(c(2,3,4,1),c(3,4,1,2),c(1,2,3,4),c(4,3,2,1),c(3,4,2,1))
colnames(order)=paste0('pos',1:4)
dat<-data.frame(vect,order,stringsAsFactors = F)
# 2. data swap ------------------------------------------------------------
for (i in 1:dim(dat)[1]){
orders=dat[i,paste0('pos',1:4)]
card=dat[i,paste0('card',1:4)]
vec<-card[order(unlist(orders))]
names(vec)=paste0('deck',1:4)
dat[i,paste0('deck',1:4)]<-vec
}
dat
You could use pmap_dfr :
card_cols <- grep('card', names(dat))
pos_cols <- grep('pos', names(dat))
dat[paste0('deck', seq_along(card_cols))] <- purrr::pmap_dfr(dat, ~{
x <- c(...)
as.data.frame(t(unname(x[card_cols][order(x[pos_cols])])))
})
dat
# card1 card2 card3 card4 pos1 pos2 pos3 pos4 deck1 deck2 deck3 deck4
#1 0.05 0.07 0.16 0.86 2 3 4 1 0.86 0.05 0.07 0.16
#2 0.20 0.98 0.79 0.72 3 4 1 2 0.79 0.72 0.20 0.98
#3 0.50 0.79 0.72 0.10 1 2 3 4 0.50 0.79 0.72 0.10
#4 0.03 0.98 0.48 0.06 4 3 2 1 0.06 0.48 0.98 0.03
#5 0.41 0.72 0.91 0.84 3 4 2 1 0.84 0.91 0.41 0.72
One thing to note here is to make sure that the output from pmap function does not have original names of the columns. If they have the original names, it would reshuffle the columns according to the names and output would not be in correct order. I use unname here to remove the names.
I have a large set of financial data that has hundreds of columns. I have cleaned and sorted the data based on date. Here is a simplified example:
df1 <- data.frame(matrix(vector(),ncol=5, nrow = 4))
colnames(df1) <- c("Date","0.4","0.3","0.2","0.1")
df1[1,] <- c("2000-01-31","0","0","0.05","0.07")
df1[2,] <- c("2000-02-29","0","0.13","0.17","0.09")
df1[3,] <- c("2000-03-31","0.03","0.09","0.21","0.01")
df1[4,] <- c("2004-04-30","0.05","0.03","0.19","0.03")
df1
Date 0.4 0.3 0.2 0.1
1 2000-01-31 0 0 0.05 0.07
2 2000-02-29 0 0.13 0.17 0.09
3 2000-03-31 0.03 0.09 0.21 0.01
4 2000-04-30 0.05 0.03 0.19 0.03
I assigned individual weights (based on market value from the raw data) as column headers, because I don’t care about the company names and I need the weights for calculating the result.
My ultimate goal is to get: 1. Sum of the weighted returns; and 2. Sum of the weights when returns are non-zero. With that being said, below is the result I want to get:
Date SWeightedR SWeights
1 2000-01-31 0.017 0.3
2 2000-02-29 0.082 0.6
3 2000-03-31 0.082 1
4 2000-04-30 0.07 1
For instance, the SWeightedR for 2000-01-31 = 0.4x0+0.3x0+0.2x0.05+0.1x0.07, and SWeights = 0.2+0.1.
My initial idea was to assign the weights to each column like WCol2 <- 0.4, then use cbind to create new columns and use c(as.matrix() %*% ) to get the sums. Soon I realize that this is impossible as there are hundreds of columns. Any advice or suggestion is appreciated!
Here's a simple solution using matrix multiplications (as you were suggesting yourself).
First of all, your data seem to be of character type and I'm not sure it's the real case with the real data, but I would first convert it to an appropriate type
df1[-1] <- lapply(df1[-1], type.convert)
Next, we will convert the column names to a numeric class too
vec <- as.numeric(names(df1)[-1])
Finally, we could easily create the new columns in two simple steps. This indeed has a to matrix conversion overhead, but maybe you should work with matrices in the first place. Either way, this is fully vectorized
df1["SWeightedR"] <- as.matrix(df1[, -1]) %*% vec
df1["SWeights"] <- (df1[, -c(1, ncol(df1))] > 0) %*% vec
df1
# Date 0.4 0.3 0.2 0.1 SWeightedR SWeights
# 1 2000-01-31 0.00 0.00 0.05 0.07 0.017 0.3
# 2 2000-02-29 0.00 0.13 0.17 0.09 0.082 0.6
# 3 2000-03-31 0.03 0.09 0.21 0.01 0.082 1.0
# 4 2004-04-30 0.05 0.03 0.19 0.03 0.070 1.0
Or, you could convert to a long format first (here's a data.table example), though I believe it will be less efficient as this are basically by row operations
library(data.table)
res <- melt(setDT(df1), id = 1L, variable.factor = FALSE
)[, c("value", "variable") := .(as.numeric(value), as.numeric(variable))]
res[, .(SWeightedR = sum(variable * value),
SWeights = sum(variable * (value > 0))), by = Date]
# Date SWeightedR SWeights
# 1: 2000-01-31 0.017 0.3
# 2: 2000-02-29 0.082 0.6
# 3: 2000-03-31 0.082 1.0
# 4: 2004-04-30 0.070 1.0
I created a data.frame that holds my words and its frequencies. Now I would like to do a findAssocs against every row of my frame but I cannot get my code to work. Any help is appreciated.
Here is an example of my data.frame term.df
term.df <- data.frame(word = names(v),freq=v)
word freq
ounce 8917
pack 6724
count 4992
organic 3696
frozen 2534
free 1728
I created a TermDocumentMatrix tdm and the following code works as expected.
findAssocs(tdm, 'frozen', 0.20)
I would like to append the output of findAssocs as a new column
Here's the code I tried:
library(dplyr)
library(tm)
library(pbapply)
#I would like to append all findings in a new column
res <- merge(do.call(rbind.data.frame, pblapply(term.df, findAssocs(tdm, term.df$word , 0.18))),
term.df[, c("word")], by.x="list.q", by.y="word", all.x=TRUE)
EDIT:
as for the output. The single statement above gets me something like this.
$yogurt
greek ellenos fat chobani dannon fage yoplait nonfat wallaby
0.62 0.36 0.25 0.24 0.24 0.24 0.24 0.22 0.20
I was hoping it would be possible to add a single column to my original table (ASSOC) and put the results as comma separated name:value tuples but I'm really open to ideas.
I think a structure that would be simplest to handle would be a nested list:
lapply(seq_len(nrow(text.df)), function(i) {
list(word=text.df$word[i],
freq=text.df$freq[i],
assoc=findAssocs(tdm, as.character(text.df$word[i]), 0.7)[[1]])
})
# [[1]]
# [[1]]$word
# [1] "oil"
#
# [[1]]$freq
# [1] 3
#
# [[1]]$assoc
# 15.8 opec clearly late trying who winter analysts
# 0.87 0.87 0.80 0.80 0.80 0.80 0.80 0.79
# said meeting above emergency market fixed that prices
# 0.78 0.77 0.76 0.75 0.75 0.73 0.73 0.72
# agreement buyers
# 0.71 0.70
#
#
# [[2]]
# [[2]]$word
# [1] "opec"
#
# [[2]]$freq
# [1] 2
#
# [[2]]$assoc
# meeting emergency oil 15.8 analysts buyers above
# 0.88 0.87 0.87 0.85 0.85 0.83 0.82
# said ability they prices. agreement but clearly
# 0.82 0.80 0.80 0.79 0.76 0.74 0.74
# december. however, late production sell trying who
# 0.74 0.74 0.74 0.74 0.74 0.74 0.74
# winter quota that through bpd market
# 0.74 0.73 0.73 0.73 0.70 0.70
#
#
# [[3]]
# [[3]]$word
# [1] "xyz"
#
# [[3]]$freq
# [1] 1
#
# [[3]]$assoc
# numeric(0)
In my experience this will be easier to handle than a nested string because you can still access the word associations for each row of your original text.df object by accessing the corresponding element in the outputted list.
If you really want to keep a data frame structure, then you could pretty easily convert the findAssocs output to a string representation, for instance using toJSON:
library(RJSONIO)
text.df$assoc <- sapply(text.df$word, function(x) toJSON(findAssocs(tdm, x, 0.7)[[1]], collapse=""))
text.df
# word freq
# 1 oil 3
# 2 opec 2
# 3 xyz 1
# assoc
# 1 { "15.8": 0.87,"opec": 0.87,"clearly": 0.8,"late": 0.8,"trying": 0.8,"who": 0.8,"winter": 0.8,"analysts": 0.79,"said": 0.78,"meeting": 0.77,"above": 0.76,"emergency": 0.75,"market": 0.75,"fixed": 0.73,"that": 0.73,"prices": 0.72,"agreement": 0.71,"buyers": 0.7 }
# 2 { "meeting": 0.88,"emergency": 0.87,"oil": 0.87,"15.8": 0.85,"analysts": 0.85,"buyers": 0.83,"above": 0.82,"said": 0.82,"ability": 0.8,"they": 0.8,"prices.": 0.79,"agreement": 0.76,"but": 0.74,"clearly": 0.74,"december.": 0.74,"however,": 0.74,"late": 0.74,"production": 0.74,"sell": 0.74,"trying": 0.74,"who": 0.74,"winter": 0.74,"quota": 0.73,"that": 0.73,"through": 0.73,"bpd": 0.7,"market": 0.7 }
# 3 [ ]
Data:
library(tm)
data("crude")
tdm <- TermDocumentMatrix(crude)
(text.df <- data.frame(word=c("oil", "opec", "xyz"), freq=c(3, 2, 1), stringsAsFactors=FALSE))
# word freq
# 1 oil 3
# 2 opec 2
# 3 xyz 1
I haven't found something which precisely matches what I need, so I thought I'd post this.
I have a number of functions which basically rely on a rolling index of a variable, with a function, and should naturally flow back into the dataframe they came from.
For example,
data<-as.data.frame(as.matrix(seq(1:30)))
data$V1<-data$V1/100
str(data)
data$V1<-NA # rolling 5 day product
for (i in 5:nrow(data)){
start<-i-5
end<-i
data$V1_MA5d[i]<- (prod(((data$V1[start:end]/100)+1))-1)*100
}
data
> head(data,15)
V1 V1_MA5d
1 0.01 NA
2 0.02 NA
3 0.03 NA
4 0.04 NA
5 0.05 0.1500850
6 0.06 0.2101751
7 0.07 0.2702952
8 0.08 0.3304453
9 0.09 0.3906255
10 0.10 0.4508358
11 0.11 0.5110762
12 0.12 0.5713467
13 0.13 0.6316473
14 0.14 0.6919780
15 0.15 0.7523389
But really, I should be able to do something like:
data$V1_MA5d<-sapply(data$V1, function(x) prod(((data$V1[i-5:i]/100)+1))-1)*100
But I'm not sure what that would look like.
Likewise, the count of a variable by another variable:
data$V1_MA5_cat<-NA
data$V1_MA5_cat[data$V1_MA5d<.5]<-0
data$V1_MA5_cat[data$V1_MA5d>.5]<-1
data$V1_MA5_cat[data$V1_MA5d>1.5]<-2
table(data$V1_MA5_cat)
data$V1_MA5_cat_n<-NA
data$V1_MA5_cat_n[data$V1_MA5_cat==0]<-nrow(subset(data,V1_MA5_cat==0))
data$V1_MA5_cat_n[data$V1_MA5_cat==1]<-nrow(subset(data,V1_MA5_cat==1))
data$V1_MA5_cat_n[data$V1_MA5_cat==2]<-nrow(subset(data,V1_MA5_cat==2))
> head(data,15)
V1 V1_MA5d V1_MA5_cat V1_MA5_cat_n
1 0.01 NA NA NA
2 0.02 NA NA NA
3 0.03 NA NA NA
4 0.04 NA NA NA
5 0.05 0.1500850 0 6
6 0.06 0.2101751 0 6
7 0.07 0.2702952 0 6
8 0.08 0.3304453 0 6
9 0.09 0.3906255 0 6
10 0.10 0.4508358 0 6
11 0.11 0.5110762 1 17
12 0.12 0.5713467 1 17
13 0.13 0.6316473 1 17
14 0.14 0.6919780 1 17
15 0.15 0.7523389 1 17
I know there is a better way - help!
You can do this one of a few ways. Its worth mentioning here that you did write a "correct" for loop in R. You preallocated the vector by assigning data$V1_MA5d <- NA. This way you are filling rather than growing and its actually fairly efficient. However, if you want to use the apply family:
sapply(5:nrow(data), function(i) (prod(data$V1[(i-5):i]/100 + 1)-1)*100)
[1] 0.1500850 0.2101751 0.2702952 0.3304453 0.3906255 0.4508358 0.5110762 0.5713467 0.6316473 0.6919780 0.7523389 0.8127299
[13] 0.8731511 0.9336024 0.9940839 1.0545957 1.1151376 1.1757098 1.2363122 1.2969448 1.3576077 1.4183009 1.4790244 1.5397781
[25] 1.6005622 1.6613766
Notice my code inside the [] is different from yours. check out the difference:
i <- 10
i - 5:i
(i-5):i
Or you can use rollapply from the zoo package:
library(zoo)
myfun <- function(x) (prod(x/100 + 1)-1)*100
rollapply(data$V1, 5, myfun)
[1] 0.1500850 0.2001551 0.2502451 0.3003552 0.3504853 0.4006355 0.4508057 0.5009960 0.5512063 0.6014367 0.6516872 0.7019577
[13] 0.7522484 0.8025591 0.8528899 0.9032408 0.9536118 1.0040030 1.0544142 1.1048456 1.1552971 1.2057688 1.2562606 1.3067726
[25] 1.3573047 1.4078569
As per the comment, this will give you a vector of length 26... instead you can add a few arguments to rollapply to make it match with your initial data:
rollapply(data$V1, 5, myfun, fill=NA, align='right')
In regard to your second question, plyr is handy here.
library(plyr)
data$cuts <- cut(data$V1_MA5d, breaks=c(-Inf, 0.5, 1.5, Inf))
ddply(data, .(cuts), transform, V1_MA5_cat_n=length(cuts))
But there are many other choices too.
pretty newb question here, but I have not been able to track down a solution for some time:
I have an XTS object of trading indicators (indicate) for stock data that looks like
A XOM MSFT
2000-11-30 -0.59 0.22 0.10
2000-12-29 0.55 -0.23 0.05
2001-01-30 -0.52 0.09 -0.10
And a table with an identical index for the corresponding period returns (return) that looks like
A XOM MSFT
2000-11-30 -0.15 0.10 0.03
2000-12-29 0.03 -0.05 0.02
2001-01-30 -0.04 0.02 -0.05
I have sorted the indicator table and had it return the column name with the following code:
indicate.label <- colnames(indicate)
indicate.rank <- t(apply(indicate, 1, function(x) indicate.label[order(-x)]))
indicate.rank <- xts(indicate.rank, order.by = index(returns))
Which gives the table (indicate.rank) of the symbol names ranked by their trading indicator:
1 2 3
2000-11-30 XOM MSFT A
2000-12-29 A MSFT XOM
2001-01-30 XOM A MSFT
I would like to also have a table that gives the period returns based on the indicator rank:
2000-11-30 0.10 0.03 -0.15
2000-12-29 0.03 0.02 -0.05
2001-01-30 0.02 -0.04 -0.05
I cannot figure out how to call the correct symbol for all rows or just sort the table return based on the order of indicate.
Thank you for any suggestions.
Trevor J
I'm not particularly satisfied with this solution, but it works.
row.rank <- t(apply(indicate, 1, order, decreasing=TRUE))
indicate.rank <- return.rank <- indicate # pre-allocate
for(i in 1:NROW(indicate.rank)) {
indicate.rank[i,] <- colnames(indicate)[row.rank[i,]]
return.rank[i,] <- return[i,row.rank[i,]]
}
It would probably be easier to handle this if the returns and the indicators for each symbol were in the same object, but I don't know how that would fit with the rest of your workflow.