How to reshape my data frame to use TTR in R? [duplicate] - r

This question already has an answer here:
Using Reshape from wide to long in R [closed]
(1 answer)
Closed 2 years ago.
Basically TTR allows to get technical indicator of a ticker and data should be vertical like:
Date Open High Low Close
2014-05-16 16.83 16.84 16.63 16.71
2014-05-19 16.73 16.93 16.66 16.80
2014-05-20 16.80 16.81 16.58 16.70
but my data frame is like:
Sdate Edate Tickers Open_1 Open_2 Open_3 High_1 High_2 High_3 Low_1 Low_2 Low_3 Close_1 Close_2 Close_3
2014-05-16 2014-07-21 TK 31.6 31.8 32.2 32.4 32.4 33.0 31.1 31.5 32.1 32.1 32.1 32.7
2014-05-17 2014-07-22 TGP 25.1 24.8 25.0 25.1 25.3 25.8 24.1 24.4 24.9 24.8 25.0 25.6
2014-05-18 2014-07-23 DNR 3.4 3.5 3.8 3.6 3.8 4.1 3.3 3.5 3.8 3.5 3.7 3.9
As you see I have multiple tickers and time range. I went over package TTR and it does not state how to get technical indicator from which is horizontally made and multiple tickers. My original data has 50days and thousands tickers. To do this, I just knew that, I need to make lists for each tickers, but I'm confused how to do this. How do I achieve this?

You can get data in vertical shape by using pivot_longer :
out <- tidyr::pivot_longer(df, cols = -c(Sdate,Edate, Tickers),
names_to = c('.value', 'num'),
names_sep = '_')
out
# A tibble: 9 x 8
# Sdate Edate Tickers num Open High Low Close
# <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#1 2014-05-16 2014-07-21 TK 1 31.6 32.4 31.1 32.1
#2 2014-05-16 2014-07-21 TK 2 31.8 32.4 31.5 32.1
#3 2014-05-16 2014-07-21 TK 3 32.2 33 32.1 32.7
#4 2014-05-17 2014-07-22 TGP 1 25.1 25.1 24.1 24.8
#5 2014-05-17 2014-07-22 TGP 2 24.8 25.3 24.4 25
#6 2014-05-17 2014-07-22 TGP 3 25 25.8 24.9 25.6
#7 2014-05-18 2014-07-23 DNR 1 3.4 3.6 3.3 3.5
#8 2014-05-18 2014-07-23 DNR 2 3.5 3.8 3.5 3.7
#9 2014-05-18 2014-07-23 DNR 3 3.8 4.1 3.8 3.9
If you want to split the above data into list of dataframes based on Ticker you can use split.
split(out, out$Tickers)
data
df <- structure(list(Sdate = c("2014-05-16", "2014-05-17", "2014-05-18"
), Edate = c("2014-07-21", "2014-07-22", "2014-07-23"), Tickers = c("TK",
"TGP", "DNR"), Open_1 = c(31.6, 25.1, 3.4), Open_2 = c(31.8,
24.8, 3.5), Open_3 = c(32.2, 25, 3.8), High_1 = c(32.4, 25.1,
3.6), High_2 = c(32.4, 25.3, 3.8), High_3 = c(33, 25.8, 4.1),
Low_1 = c(31.1, 24.1, 3.3), Low_2 = c(31.5, 24.4, 3.5), Low_3 = c(32.1,
24.9, 3.8), Close_1 = c(32.1, 24.8, 3.5), Close_2 = c(32.1,
25, 3.7), Close_3 = c(32.7, 25.6, 3.9)),
class = "data.frame", row.names = c(NA, -3L))

Related

Problem with return calculations using "tq_mutate" function in R

I try to calculate stock returns for different time periods for a very large dataset.
I noticed that there are some inconsistencies with tq_mutate calculations and my checking:
library(tidyquant)
A_stock_prices <- tq_get("A",
get = "stock.prices",
from = "2000-01-01",
to = "2004-12-31")
print(A_stock_prices[A_stock_prices$date>"2000-12-31",])
# A tibble: 1,003 x 8
symbol date open high low close volume adjusted
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 2001-01-02 38.5 38.5 35.1 36.4 2261684 **31.0**
2 A 2001-01-03 35.1 40.4 34.0 40.1 4502678 34.2
3 A 2001-01-04 40.7 42.7 39.6 41.7 4398388 35.4
4 A 2001-01-05 41.0 41.7 38.3 39.4 3277052 33.5
5 A 2001-01-08 38.8 39.9 37.4 38.1 2273288 32.4
6 A 2001-01-09 38.3 39.3 37.1 37.9 2474180 32.3
...
1 A 2001-12-21 19.7 20.2 19.7 20.0 3732520 17.0
2 A 2001-12-24 20.4 20.5 20.1 20.4 1246177 17.3
3 A 2001-12-26 20.5 20.7 20.1 20.1 2467051 17.1
4 A 2001-12-27 20.0 20.7 20.0 20.6 1909948 17.5
5 A 2001-12-28 20.7 20.9 20.4 20.7 1600430 17.6
6 A 2001-12-31 20.5 20.8 20.4 20.4 2142016 **17.3**
A_stock_prices %>%
tq_transmute (select = adjusted,
mutate_fun = periodReturn,
period = "yearly") %>%
ungroup()
# A tibble: 5 x 2
date yearly.returns
<date> <dbl>
1 2000-12-29 -0.240
2 2001-12-31 -0.479
3 2002-12-31 -0.370
4 2003-12-31 0.628
5 2004-12-30 -0.176
Now, based on the calculation, the yearly return for the year 2001 is: "-0.479"
But, when I calculate the yearly return myself (the close price at the end of the period divided by the close price at the beginning of the period), I get a different result:
A_stock_prices[A_stock_prices$date=="2001-12-31",]$adjusted/
A_stock_prices[A_stock_prices$date=="2001-01-02",]$adjusted-1
"-0.439"
Same issue persists with other time periods (e.g., monthly or weekly calculations).
What am I missing?
Update: The very strange thing is that if I change the time in the tq_get, to 2001:
A_stock_prices <- tq_get("A",
get = "stock.prices",
from = "2001-01-01",
to = "2004-01-01")
I get the correct result for the year 2001 (but not for other years)..
Not sure how your dataset is built but what's the first date for the 2001 group? Your manual attempt has it as January 2nd, 2001. If there's data present for January 1st, what's that result?
If that's not it, I'd recommend posting your data, just so we can see how it's structured.
Eventually I figured it out:
tq_get() calculates the return for a "day before" the requested period.
I.e., for the yearly return it calculates the return from (say) 31/12/2022 to 31/12/2021 (rather than to 01/01/2022).

Gathering multiple data columns currently in factor form

I have a dataset of train carloads. It currently has a number (weekly carload) listed for each company (the row) for each week (the columns) over the course of a couple years (100+ columns). I want to gather this into just two columns: a date and loads.
It currently looks like this:
3/29/2017 4/5/2017 4/12/2017 4/19/2017
32.7 31.6 32.3 32.5
20.5 21.8 22.0 22.3
24.1 24.1 23.6 23.4
24.9 24.7 24.8 26.5
I'm looking for:
Date Load
3/29/2017 32.7
3/29/2017 20.5
3/29/2017 24.1
3/29/2017 24.9
4/5/2017 31.6
I've been doing various versions of the following:
rail3 <- rail2 %>%
gather(`3/29/2017`:`1/24/2018`, key = "date", value = "loads")
When I do this it makes a dataset called rail3, but it didn't make the new columns I wanted. It only made the dataset 44 times longer than it was. And it gave me the following message:
Warning message:
attributes are not identical across measure variables;
they will be dropped
I'm assuming this is because the date columns are currently coded as factors. But I'm also not sure how to convert 100+ columns from factors to numeric. I've tried the following and various other methods:
rail2["3/29/2017":"1/24/2018"] <- lapply(rail2["3/29/2017":"1/24/2018"], as.numeric)
None of this has worked. Let me know if you have any advice. Thanks!
If you want to avoid warnings when gathering and want date and numeric output in final df you can do:
library(tidyr)
library(hablar)
# Data from above but with factors
rail2<-read.table(header=TRUE, text="3/29/2017 4/5/2017 4/12/2017 4/19/2017
32.7 31.6 32.3 32.5
20.5 21.8 22.0 22.3
24.1 24.1 23.6 23.4
24.9 24.7 24.8 26.5", check.names=FALSE) %>%
as_tibble() %>%
convert(fct(everything()))
# Code
rail2 %>%
convert(num(everything())) %>%
gather("date", "load") %>%
convert(dte(date, .args = list(format = "%m/%d/%Y")))
Gives:
# A tibble: 16 x 2
date load
<date> <dbl>
1 2017-03-29 32.7
2 2017-03-29 20.5
3 2017-03-29 24.1
4 2017-03-29 24.9
5 2017-04-05 31.6
Here is a possible solution:
rail2<-read.table(header=TRUE, text="3/29/2017 4/5/2017 4/12/2017 4/19/2017
32.7 31.6 32.3 32.5
20.5 21.8 22.0 22.3
24.1 24.1 23.6 23.4
24.9 24.7 24.8 26.5", check.names=FALSE)
library(tidyr)
# gather the data from columns and convert to long format.
rail3 <- rail2 %>% gather(key="date", value="load")
rail3
# date load
#1 3/29/2017 32.7
#2 3/29/2017 20.5
#3 3/29/2017 24.1
#4 3/29/2017 24.9
#5 4/5/2017 31.6
#6 4/5/2017 21.8
#7 ...

Convert long to wide dataset using data.table::dcast or tidyr

Given the following data in long format. Would like to do this for an arbitrary number of timepoints.
dat <- structure(list(srdr_id = c("172507", "172507", "172507", "172507",
"172619", "172619", "172619", "172619"), arm = c("CBT_Educ",
"CBT_MI", "CBT_Educ", "CBT_MI", "MI", "Educ", "MI", "Educ"),
timepoint = c(0, 0, 3, 3, 0, 0, 3, 3), n = c(102, 103, 100,
101, 58, 61, 45, 53), mean = c(37.69, 40.23, 34.53, 31.8,
4.6, 4.3, 4.4, 4.1), sd = c(16.06, 14.23, 19.78, 19.67, 2.2,
2.2, 2.3, 2.5)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-8L))
Long dataset:
srdr_id arm timepoint n mean sd
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 172507 CBT_Educ 0 102 37.7 16.1
2 172507 CBT_MI 0 103 40.2 14.2
3 172507 CBT_Educ 3 100 34.5 19.8
4 172507 CBT_MI 3 101 31.8 19.7
5 172619 MI 0 58 4.6 2.2
6 172619 Educ 0 61 4.3 2.2
7 172619 MI 3 45 4.4 2.3
8 172619 Educ 3 53 4.1 2.5
I would like to create a wide dataset, such that within each srdr_id and arm the three variables (n, mean and sd) appear in the same row.
srdr_id arm n.0 mean.0 sd.0 n.3 mean.3 sd.3
1 172507 CBT_Educ 102 37.7 16.1 100 34.5 19.8
2 172507 CBT_MI 103 40.2 14.2 101 31.8 19.7
5 172619 MI 58 4.6 2.2 45 4.4 2.3
6 172619 Educ 61 4.3 2.2 53 4.1 2.5
The following failed with:
Error in is.formula(formula) : object 'srdr_id' not found
data.table::dcast(data = dat, srdr_id + arm, value.var = c(n_analyzed, mean, sd))
A common workflow for this type of situation is gathering all the metrics, renaming them, and then spreading again. See below:
tidyverse:
dat %>%
gather("measure", "val", n, mean, sd) %>%
mutate(measure = paste0(measure, ".", timepoint)) %>%
select(-timepoint) %>%
spread(measure, val)
# A tibble: 4 x 8
srdr_id arm mean.0 mean.3 n.0 n.3 sd.0 sd.3
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 172507 CBT_Educ 37.7 34.5 102 100 16.1 19.8
2 172507 CBT_MI 40.2 31.8 103 101 14.2 19.7
3 172619 Educ 4.3 4.1 61 53 2.2 2.5
4 172619 MI 4.6 4.4 58 45 2.2 2.3
data.table:
library(data.table)
dt <- as.data.table(dat)
melt(dt, id.vars = c("srdr_id", "arm", "timepoint"))[
,`:=`(variable = paste0(variable, ".", timepoint), timepoint = NULL)
] %>%
dcast(srdr_id + arm ~ variable, value.var = "value")
srdr_id arm mean.0 mean.3 n.0 n.3 sd.0 sd.3
1: 172507 CBT_Educ 37.69 34.53 102 100 16.06 19.78
2: 172507 CBT_MI 40.23 31.80 103 101 14.23 19.67
3: 172619 Educ 4.30 4.10 61 53 2.20 2.50
4: 172619 MI 4.60 4.40 58 45 2.20 2.30
One alternative (probably not the most elegant), is to use group_by() and summarise() from the library dplyr.
Here, you don't have to make some calculations (all values are already in your inital dataset), so you can use functions like first() and last() to specify with values you want.
dat %>%
group_by(srdr_id, arm) %>%
summarise(
n0 = first(n), mean0 = first(mean), sd0 = first(sd),
n3 = last(n), mean3 = last(mean), sd3 = last(sd)
)
# srdr_id arm n0 mean0 sd0 n3 mean3 sd3
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 172507 CBT_Educ 102 37.7 16.1 100 34.5 19.8
# 2 172507 CBT_MI 103 40.2 14.2 101 31.8 19.7
# 3 172619 Educ 61 4.3 2.2 53 4.1 2.5
# 4 172619 MI 58 4.6 2.2 45 4.4 2.3

Preparing data for two way anova in R [duplicate]

This question already has answers here:
Extract p-value from aov
(7 answers)
Closed 6 years ago.
I have the following format of the data:
m1 m2 m3 names
1 24.5 28.4 26.1 1
2 23.5 34.2 28.3 2
3 26.4 29.5 24.3 3
4 27.1 32.2 26.2 4
5 29.9 20.1 27.8 5
How can I prepare this data to the format that I can feed to aov in R?
I.e.
values ind name
1 24.5 m1 1
2 23.5 m1 2
3 26.4 m1 3
...
For one way anova I just used stack command. How can I do it for two way anova, without having a loop?
Try this
library(reshape2)
melt(df)
user2100721 has given an answer using a package. Without package imports this can be solved as
a <- read.table(header=TRUE, text="m1 m2 m3 names
24.5 28.4 26.1 1
23.5 34.2 28.3 2
26.4 29.5 24.3 3
27.1 32.2 26.2 4
29.9 20.1 27.8 5")
reshape(a, direction="long", varying=list(c("m1","m2","m3")))
An alternative to all above answers which are nice, you can use gather from tidyr package. Gather takes multiple columns and collapses them into key-value pairs. You need to pass two variables to it, one is the key and the other is value
X<- structure(list(m1 = c(24.5, 23.5, 26.4, 27.1, 29.9), m2 = c(28.4,
34.2, 29.5, 32.2, 20.1), m3 = c(26.1, 28.3, 24.3, 26.2, 27.8),
names = 1:5), .Names = c("m1", "m2", "m3", "names"), class = "data.frame", row.names = c(NA,
-5L))
library(tidyr)
dat <- X %>% gather(variable, value)
> head(dat,10)
# variable value
#1 m1 24.5
#2 m1 23.5
#3 m1 26.4
#4 m1 27.1
#5 m1 29.9
#6 m2 28.4
#7 m2 34.2
#8 m2 29.5
#9 m2 32.2
#10 m2 20.1

How to change a column classed as NULL to class integer?

So I'm starting with a dataframe called max.mins that has 153 rows.
day Tx Hx Tn
1 1 10.0 7.83 2.1
2 2 7.7 6.19 2.5
3 3 7.1 4.86 0.0
4 4 9.8 7.37 2.7
5 5 13.4 12.68 0.4
6 6 17.5 17.47 3.5
7 7 16.5 15.58 6.5
8 8 21.5 20.30 6.2
9 9 21.7 21.41 9.7
10 10 24.4 28.18 8.0
I'm applying these statements to the dataframe to look for specific criteria
temp_warnings <- subset(max.mins, Tx >= 32 & Tn >=20)
humidex_warnings <- subset(max.mins, Hx >= 40)
Now when I open up humidex_warnings for example I have this dataframe
row.names day Tx Hx Tn
1 41 10 31.1 40.51 20.7
2 56 25 33.4 42.53 19.6
3 72 11 34.1 40.78 18.1
4 73 12 33.8 40.18 18.8
5 74 13 34.1 41.10 22.4
6 79 18 30.3 41.57 22.5
7 94 2 31.4 40.81 20.3
8 96 4 30.7 40.39 20.2
The next step is to search for 2 or 3 consective numbers in the column row.names and give me a total of how many times this occurs (I asked this in a previous question and have a function that should work once this problem is sorted out). The issue is that row.names is class NULL which is preventing me from applying further functions to this dataframe.
Help? :)
Thanks in advance,
Nick
If you need the row.names as a data as integer:
humidex_warnings$seq <- as.integer(row.names(humidex_warnings))
If you don't need row.names
row.names(humidex_warnings) <- NULL

Resources