Gathering multiple data columns currently in factor form - r

I have a dataset of train carloads. It currently has a number (weekly carload) listed for each company (the row) for each week (the columns) over the course of a couple years (100+ columns). I want to gather this into just two columns: a date and loads.
It currently looks like this:
3/29/2017 4/5/2017 4/12/2017 4/19/2017
32.7 31.6 32.3 32.5
20.5 21.8 22.0 22.3
24.1 24.1 23.6 23.4
24.9 24.7 24.8 26.5
I'm looking for:
Date Load
3/29/2017 32.7
3/29/2017 20.5
3/29/2017 24.1
3/29/2017 24.9
4/5/2017 31.6
I've been doing various versions of the following:
rail3 <- rail2 %>%
gather(`3/29/2017`:`1/24/2018`, key = "date", value = "loads")
When I do this it makes a dataset called rail3, but it didn't make the new columns I wanted. It only made the dataset 44 times longer than it was. And it gave me the following message:
Warning message:
attributes are not identical across measure variables;
they will be dropped
I'm assuming this is because the date columns are currently coded as factors. But I'm also not sure how to convert 100+ columns from factors to numeric. I've tried the following and various other methods:
rail2["3/29/2017":"1/24/2018"] <- lapply(rail2["3/29/2017":"1/24/2018"], as.numeric)
None of this has worked. Let me know if you have any advice. Thanks!

If you want to avoid warnings when gathering and want date and numeric output in final df you can do:
library(tidyr)
library(hablar)
# Data from above but with factors
rail2<-read.table(header=TRUE, text="3/29/2017 4/5/2017 4/12/2017 4/19/2017
32.7 31.6 32.3 32.5
20.5 21.8 22.0 22.3
24.1 24.1 23.6 23.4
24.9 24.7 24.8 26.5", check.names=FALSE) %>%
as_tibble() %>%
convert(fct(everything()))
# Code
rail2 %>%
convert(num(everything())) %>%
gather("date", "load") %>%
convert(dte(date, .args = list(format = "%m/%d/%Y")))
Gives:
# A tibble: 16 x 2
date load
<date> <dbl>
1 2017-03-29 32.7
2 2017-03-29 20.5
3 2017-03-29 24.1
4 2017-03-29 24.9
5 2017-04-05 31.6

Here is a possible solution:
rail2<-read.table(header=TRUE, text="3/29/2017 4/5/2017 4/12/2017 4/19/2017
32.7 31.6 32.3 32.5
20.5 21.8 22.0 22.3
24.1 24.1 23.6 23.4
24.9 24.7 24.8 26.5", check.names=FALSE)
library(tidyr)
# gather the data from columns and convert to long format.
rail3 <- rail2 %>% gather(key="date", value="load")
rail3
# date load
#1 3/29/2017 32.7
#2 3/29/2017 20.5
#3 3/29/2017 24.1
#4 3/29/2017 24.9
#5 4/5/2017 31.6
#6 4/5/2017 21.8
#7 ...

Related

Problem with return calculations using "tq_mutate" function in R

I try to calculate stock returns for different time periods for a very large dataset.
I noticed that there are some inconsistencies with tq_mutate calculations and my checking:
library(tidyquant)
A_stock_prices <- tq_get("A",
get = "stock.prices",
from = "2000-01-01",
to = "2004-12-31")
print(A_stock_prices[A_stock_prices$date>"2000-12-31",])
# A tibble: 1,003 x 8
symbol date open high low close volume adjusted
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 2001-01-02 38.5 38.5 35.1 36.4 2261684 **31.0**
2 A 2001-01-03 35.1 40.4 34.0 40.1 4502678 34.2
3 A 2001-01-04 40.7 42.7 39.6 41.7 4398388 35.4
4 A 2001-01-05 41.0 41.7 38.3 39.4 3277052 33.5
5 A 2001-01-08 38.8 39.9 37.4 38.1 2273288 32.4
6 A 2001-01-09 38.3 39.3 37.1 37.9 2474180 32.3
...
1 A 2001-12-21 19.7 20.2 19.7 20.0 3732520 17.0
2 A 2001-12-24 20.4 20.5 20.1 20.4 1246177 17.3
3 A 2001-12-26 20.5 20.7 20.1 20.1 2467051 17.1
4 A 2001-12-27 20.0 20.7 20.0 20.6 1909948 17.5
5 A 2001-12-28 20.7 20.9 20.4 20.7 1600430 17.6
6 A 2001-12-31 20.5 20.8 20.4 20.4 2142016 **17.3**
A_stock_prices %>%
tq_transmute (select = adjusted,
mutate_fun = periodReturn,
period = "yearly") %>%
ungroup()
# A tibble: 5 x 2
date yearly.returns
<date> <dbl>
1 2000-12-29 -0.240
2 2001-12-31 -0.479
3 2002-12-31 -0.370
4 2003-12-31 0.628
5 2004-12-30 -0.176
Now, based on the calculation, the yearly return for the year 2001 is: "-0.479"
But, when I calculate the yearly return myself (the close price at the end of the period divided by the close price at the beginning of the period), I get a different result:
A_stock_prices[A_stock_prices$date=="2001-12-31",]$adjusted/
A_stock_prices[A_stock_prices$date=="2001-01-02",]$adjusted-1
"-0.439"
Same issue persists with other time periods (e.g., monthly or weekly calculations).
What am I missing?
Update: The very strange thing is that if I change the time in the tq_get, to 2001:
A_stock_prices <- tq_get("A",
get = "stock.prices",
from = "2001-01-01",
to = "2004-01-01")
I get the correct result for the year 2001 (but not for other years)..
Not sure how your dataset is built but what's the first date for the 2001 group? Your manual attempt has it as January 2nd, 2001. If there's data present for January 1st, what's that result?
If that's not it, I'd recommend posting your data, just so we can see how it's structured.
Eventually I figured it out:
tq_get() calculates the return for a "day before" the requested period.
I.e., for the yearly return it calculates the return from (say) 31/12/2022 to 31/12/2021 (rather than to 01/01/2022).

How to refer to list-column using purrr function (map()) with varying input

How do I properly refer to a list-column in R, when I am using a map (or any purrr function) function and want to utilize "x" from the map function in calling the appropriate list? For example, if I have a list of 3 (let's call it testlist) and within that list I have a series of single columns (that are dataframes). Each column consists of a list of character vectors (in this case they are a list of symbols to be input into tq_get in tidyqant). Below is some simplified code to help illustrate.
The following code works, but it's hardcoded:
library(tidyverse)
library(lubridate)
library(tidyquant)
library(purrr)
library(dplyr)
str(testlist)
List of 3
$ 2010-12-31:'data.frame': 12 obs. of 1 variable:
..$ symbol: chr [1:12] "ASH" "RS" "FUL" "RGLD" ...
$ 2011-12-31:'data.frame': 15 obs. of 1 variable:
..$ symbol: chr [1:15] "CBT" "RS" "TCK" "MEOH" ...
$ 2012-12-31:'data.frame': 13 obs. of 1 variable:
..$ symbol: chr [1:13] "CBT" "ATI" "RS" "SXT" ...
d <- tq_get((pull(testlist$`2012-12-31`)),
get = "stock.prices",
from = "2011-12-30",
to = "2013-12-31")
To clarify, each dataframe within the "testlist" list is labeled with a date. In this case 2012-12-31.
However, I would like vary the date when referring to each dataframe within "testlist". For example:
year <- as.Date("2012-12-31")
d <- tq_get((pull(testlist[year])),
get = "stock.prices",
from = "2011-12-30",
to = "2013-12-31")
This does not work. I have determined that if I'm referring to a column within a dataframe this will work:
testlist[,as.character(year)]
But clearly referring a column in a dataframe is different from referring to a dateframe within a list.
Here is the expected output. It works for the first example and does not work for the 2nd.
d
# A tibble: 6,526 x 8
symbol date open high low close volume adjusted
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 CBT 2011-12-30 32.2 32.4 32.0 32.1 216100 25.9
2 CBT 2012-01-03 33.2 33.6 32.9 33.2 410500 26.8
3 CBT 2012-01-04 33.1 33.4 32.7 33.2 502100 26.8
4 CBT 2012-01-05 32.9 32.9 32.0 32.7 688400 26.4
5 CBT 2012-01-06 32.8 33.1 31.7 32.8 951900 26.4
6 CBT 2012-01-09 32.9 33.2 32.5 32.7 393100 26.4
7 CBT 2012-01-10 33.3 33.9 33.2 33.3 306300 26.9
8 CBT 2012-01-11 33.3 33.7 33.2 33.5 209700 27.0
9 CBT 2012-01-12 33.7 34.4 33.4 34.3 209800 27.7
10 CBT 2012-01-13 34.0 34.2 33.3 33.9 273200 27.4
# ... with 6,516 more rows
Any help would be appreciated!

How to reshape my data frame to use TTR in R? [duplicate]

This question already has an answer here:
Using Reshape from wide to long in R [closed]
(1 answer)
Closed 2 years ago.
Basically TTR allows to get technical indicator of a ticker and data should be vertical like:
Date Open High Low Close
2014-05-16 16.83 16.84 16.63 16.71
2014-05-19 16.73 16.93 16.66 16.80
2014-05-20 16.80 16.81 16.58 16.70
but my data frame is like:
Sdate Edate Tickers Open_1 Open_2 Open_3 High_1 High_2 High_3 Low_1 Low_2 Low_3 Close_1 Close_2 Close_3
2014-05-16 2014-07-21 TK 31.6 31.8 32.2 32.4 32.4 33.0 31.1 31.5 32.1 32.1 32.1 32.7
2014-05-17 2014-07-22 TGP 25.1 24.8 25.0 25.1 25.3 25.8 24.1 24.4 24.9 24.8 25.0 25.6
2014-05-18 2014-07-23 DNR 3.4 3.5 3.8 3.6 3.8 4.1 3.3 3.5 3.8 3.5 3.7 3.9
As you see I have multiple tickers and time range. I went over package TTR and it does not state how to get technical indicator from which is horizontally made and multiple tickers. My original data has 50days and thousands tickers. To do this, I just knew that, I need to make lists for each tickers, but I'm confused how to do this. How do I achieve this?
You can get data in vertical shape by using pivot_longer :
out <- tidyr::pivot_longer(df, cols = -c(Sdate,Edate, Tickers),
names_to = c('.value', 'num'),
names_sep = '_')
out
# A tibble: 9 x 8
# Sdate Edate Tickers num Open High Low Close
# <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#1 2014-05-16 2014-07-21 TK 1 31.6 32.4 31.1 32.1
#2 2014-05-16 2014-07-21 TK 2 31.8 32.4 31.5 32.1
#3 2014-05-16 2014-07-21 TK 3 32.2 33 32.1 32.7
#4 2014-05-17 2014-07-22 TGP 1 25.1 25.1 24.1 24.8
#5 2014-05-17 2014-07-22 TGP 2 24.8 25.3 24.4 25
#6 2014-05-17 2014-07-22 TGP 3 25 25.8 24.9 25.6
#7 2014-05-18 2014-07-23 DNR 1 3.4 3.6 3.3 3.5
#8 2014-05-18 2014-07-23 DNR 2 3.5 3.8 3.5 3.7
#9 2014-05-18 2014-07-23 DNR 3 3.8 4.1 3.8 3.9
If you want to split the above data into list of dataframes based on Ticker you can use split.
split(out, out$Tickers)
data
df <- structure(list(Sdate = c("2014-05-16", "2014-05-17", "2014-05-18"
), Edate = c("2014-07-21", "2014-07-22", "2014-07-23"), Tickers = c("TK",
"TGP", "DNR"), Open_1 = c(31.6, 25.1, 3.4), Open_2 = c(31.8,
24.8, 3.5), Open_3 = c(32.2, 25, 3.8), High_1 = c(32.4, 25.1,
3.6), High_2 = c(32.4, 25.3, 3.8), High_3 = c(33, 25.8, 4.1),
Low_1 = c(31.1, 24.1, 3.3), Low_2 = c(31.5, 24.4, 3.5), Low_3 = c(32.1,
24.9, 3.8), Close_1 = c(32.1, 24.8, 3.5), Close_2 = c(32.1,
25, 3.7), Close_3 = c(32.7, 25.6, 3.9)),
class = "data.frame", row.names = c(NA, -3L))

tq_mutate() and Volume indicators in R

I am using the tidyquant package in R to calculate indicators for every symbol in the SP500.
As a sample of code:
stocks_w_price_indicators<- stocks2 %>%
group_by(symbol)%>%
tq_mutate(select=close,mutate_fun=RSI) %>%
tq_mutate(select=c(high,low,close),mutate_fun=CLV)
This works for price-based indicators, but not indicators that include volume.
I get "Evaluation error: argument "volume" is missing, with no default."
stocks_w_price_indicators<- stocks2 %>%
group_by(symbol)%>%
tq_mutate(select=close,mutate_fun=RSI) %>%
tq_mutate(select=c(high,low,close,volume),mutate_fun=CMF)
How can I get indicators that include volume to calculate properly?
There are a few functions from the TTR package that cannot be used with tidyquant. Reason being they need 3 inputs like adjRatios or need an HLC object and a volume column like the CMF function. Normally you would solve this by using the tq_mutate_xy function but this one cannot handle the HCL needed for the CMF function. If you would use the OBV function from TTR that needs a price and a volume column and works fine with tq_mutate_xy.
Now there are 2 options. One the CMF function needs to be adjusted to handle a (O)HLCV object. Or two, create your own function.
The last option is the fastest. Since the internals of the CMF function call on the CLV function you could use the first code block you have and extend it with a normal dplyr::mutate call to calculate the cmf.
# create function to calculate the chaikan money flow
tq_cmf <- function(clv, volume, n = 20){
runSum(clv * volume, n)/runSum(volume, n)
}
stocks_w_price_indicators <- stocks2 %>%
group_by(symbol) %>%
tq_mutate(select = close, mutate_fun = RSI) %>%
tq_mutate(select = c(high, low, close), mutate_fun = CLV) %>%
mutate(cmf = tq_cmf(clv, volume, 20))
# A tibble: 5,452 x 11
# Groups: symbol [2]
symbol date open high low close volume adjusted rsi clv cmf
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 MSFT 2008-01-02 35.8 36.0 35 35.2 63004200 27.1 NA -0.542 NA
2 MSFT 2008-01-03 35.2 35.7 34.9 35.4 49599600 27.2 NA 0.291 NA
3 MSFT 2008-01-04 35.2 35.2 34.1 34.4 72090800 26.5 NA -0.477 NA
4 MSFT 2008-01-07 34.5 34.8 34.2 34.6 80164300 26.6 NA 0.309 NA
5 MSFT 2008-01-08 34.7 34.7 33.4 33.5 79148300 25.7 NA -0.924 NA
6 MSFT 2008-01-09 33.4 34.5 33.3 34.4 74305500 26.5 NA 0.832 NA
7 MSFT 2008-01-10 34.3 34.5 33.8 34.3 72446000 26.4 NA 0.528 NA
8 MSFT 2008-01-11 34.1 34.2 33.7 33.9 55187900 26.1 NA -0.269 NA
9 MSFT 2008-01-14 34.5 34.6 34.1 34.4 52792200 26.5 NA 0.265 NA
10 MSFT 2008-01-15 34.0 34.4 34 34 61606200 26.2 NA -1 NA

run function on consecutive vals with specific range in the vector with R

spouse i have a vector tmp of size 100
i want to know where there is for example an average of 10 between
each 4 elements.
i.e
i want to know which of these: mean(tmp[c(1,2,3,4)]),mean(tmp[c(2,3,4,5)]),mean(tmp[c(3,4,5,6)])..and so on...mean(tmp[c(97,98,99,100)])
are larger then 10
how can i do it not in a loop?
(loop takes too long since i have a table of 500000 rows by 60 col)
and more not only avg but also difference or sum and so on...
i have tried splitting rows as such
tmp<-seq(1,100,1)
one<-seq(1,97,1)
two<-seq(2,98,1)
tree<-seq(3,99,1)
four<-seq(4,100,1)
aa<-(tmp[one]+tmp[two]+tmp[tree]+tmp[four])/4
which(aa>10)
its working but its not rational to do it if you want for example avg of 12
here is an example of what i do to be clear
b12<-seq(1,988,1)
b11<-seq(2,989,1)
b10<-seq(3, 990,1)
b9<-seq(4,991,1)
b8<-seq(5,992,1)
b7<-seq(6,993,1)
b6<-seq(7,994,1)
b5<-seq(8, 995,1)
b4<-seq(9,996,1)
b3<-seq(10,997,1)
b2<-seq(11,998,1)
b1<-seq(12,999,1)
now<-seq(13, 1000,1)
po<-rpois(1000,4)
nor<-rnorm(1000,5,0.2)
uni<-runif(1000,10,75)
chis<-rchisq(1000,3,0)
which((po[now]/nor[now])>1 & (nor[b12]/nor[now])>1 &
((po[now]/po[b4])>1 | (uni[now]-uni[b4])>=0) &
((chis[now]+chis[b1]+chis[b2]+chis[b3])/4)>2 &
(uni[now]/max(uni[b1],uni[b2],uni[b3],uni[b4],
uni[b5],uni[b6],uni[b7],uni[b8]))>0.5)+12
this code give me the exact index in the real table
that mach all the conditions
and i have 58 vars with 550000 rows
thank you
The question is not very clear. Based on the wording, I guess, this should help:
n <- 100
res <- sapply(1:(n-3), function(i) mean(tmp[i:(i+3)]))
which(res >10)
Also,
m1 <- matrix(tmp[1:4+ rep(0:96,each=4)],ncol=4,byrow=T)
which(rowMeans(m1) >10)
Maybe you should look at the rollapply function from the "zoo" package. You would need to adjust the width argument according to your specific needs.
library(zoo)
tmp <- seq(1, 100, 1)
rollapply(tmp, width = 4, FUN = mean)
# [1] 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5 15.5
# [15] 16.5 17.5 18.5 19.5 20.5 21.5 22.5 23.5 24.5 25.5 26.5 27.5 28.5 29.5
# [29] 30.5 31.5 32.5 33.5 34.5 35.5 36.5 37.5 38.5 39.5 40.5 41.5 42.5 43.5
# [43] 44.5 45.5 46.5 47.5 48.5 49.5 50.5 51.5 52.5 53.5 54.5 55.5 56.5 57.5
# [57] 58.5 59.5 60.5 61.5 62.5 63.5 64.5 65.5 66.5 67.5 68.5 69.5 70.5 71.5
# [71] 72.5 73.5 74.5 75.5 76.5 77.5 78.5 79.5 80.5 81.5 82.5 83.5 84.5 85.5
# [85] 86.5 87.5 88.5 89.5 90.5 91.5 92.5 93.5 94.5 95.5 96.5 97.5 98.5
So, to get the details you want:
aa <- rollapply(tmp, width = 4, FUN = mean)
which(aa > 10)

Resources