How can i mege data from csv files? - r

I want to perform an analysis of 2 stocks for a period from 2017-01-01 until 2020-04-14. Unfortunately, I struggle with importing data.
I was trying to import data from excel, limit data for the period from 2017-01-01 until 2020-04-14 and merge these data.
x <- read.csv("data/pkn_d.csv")
y <- read.csv("data/lts_d.csv")
head(x)
Date Open High Low Close Volume
1 1999-11-26 16.307 16.452 15.717 16.229 14845780
2 1999-11-29 16.154 16.229 15.863 15.940 5148506
3 1999-11-30 16.086 16.375 16.086 16.229 3077465
4 1999-12-01 16.375 16.742 16.229 16.742 2881475
5 1999-12-02 16.895 17.407 16.818 17.040 3093313
6 1999-12-03 17.040 17.330 16.895 17.260 2207547
head(y)
Date Open High Low Close Volume
1 2005-06-09 26.676 26.676 25.013 25.013 1795647
2 2005-06-10 25.097 25.433 24.594 24.594 679054
3 2005-06-13 25.013 25.097 24.594 24.762 213950
4 2005-06-14 24.929 24.929 24.762 24.762 181415
5 2005-06-15 24.762 24.845 24.594 24.762 160359
6 2005-06-16 24.762 24.762 24.350 24.350 171475
I'm only interested in data from 2017-01-01 until 2020-04-14 and 5th column (close price)
x <- x[4285:5100, 5]
y <- y[2899:3714, 5]
Next, I want to merge these data:
merge(x,y)
However, I don't obtain any meaningful output.. How can I solve this issue?

Since the question doesn't include a reproducible example, here is a solution that merges a set of stock prices retrieved from the internet via the quantmod package.
library("quantmod")
#
symbolList <- c("PKN","LTS")
from.dat <- as.Date("2017-01-01",format="%Y-%m-%d")
to.dat <- as.Date("2020-04-14",format="%Y-%m-%d")
prices <- lapply(symbolList,function(x){
getSymbols(x,auto.assign = FALSE,from = from.dat,to = to.dat)[,4]
})
priceData <- do.call(merge,prices)
head(priceData)
...and the output:
> head(priceData)
PKN.Close LTS.Close
2017-01-03 49.370 2.54
2017-01-04 50.370 2.57
2017-01-05 89.340 2.43
2017-01-06 89.340 2.38
2017-01-09 49.855 2.36
2017-01-10 88.300 2.44
>

Related

How do I retain all the columns while using tq_transmute() function?

I am trying to replicate a trading strategy and backtest in R. However, I am having a slight problem with the tq_transmute() function. Any help would be appreciated.
So, I have the following code that I have written until now:
#Importing the etfs data
symbols<- c("SPY","XLF","XLE")
start<-as.Date("2000-01-01")
end<- as.Date("2018-12-31")
price_data<- lapply(symbols, function(symbol){
etfs<-as.data.frame(getSymbols(symbol,src="yahoo", from=start, to= end,
auto.assign = FALSE))
colnames(etfs)<- c("Open", "High","Low","Close","volume","Adjusted")
etfs$Symbol<- symbol
etfs$Date<- rownames(etfs)
etfs
})
# Next, I used do.call() with rbind() to combine the data into a single data frame
etfs_df<- do.call(rbind, price_data)
#This because of POSIXct error
daily_price<- etfs_df %>%
mutate(Date=as.Date(Date, frac=1))
# I have deleted some columns of the table as my work only concerned the "Adjusted" column.
#So, until now we have:
head(daily_price)
Adjusted Symbol Date
1 98.14607 SPY 2000-01-03
2 94.30798 SPY 2000-01-04
3 94.47669 SPY 2000-01-05
4 92.95834 SPY 2000-01-06
5 98.35699 SPY 2000-01-07
6 98.69440 SPY 2000-01-10
#Converting the daily adjusted price to monthly adjusted price
monthly_price<-
tq_transmute(daily_price,select = Adjusted, mutate_fun = to.monthly, indexAt = "lastof")
head(monthly_price)
# And now, I get the following table:
# A tibble: 6 x 2
Date Adjusted
<date> <dbl>
1 2000-01-31 16.6
2 2000-02-29 15.9
3 2000-03-31 17.9
4 2000-04-30 17.7
5 2000-05-31 19.7
6 2000-06-30 18.6
So, as you can see, the Date and Adjusted prices have been successfully converted to monthly figures but my Symbol column has disappeared. Could anyone please tell me why did that happen and how do I get it back?
Thank you.
group the data by Symbol and apply tq_transmute.
library(dplyr)
library(quantmod)
library(tidyquant)
monthly_price <- daily_price %>%
group_by(Symbol) %>%
tq_transmute(daily_price,select = Adjusted,
mutate_fun = to.monthly, indexAt = "lastof")
# Symbol Date Adjusted
# <chr> <date> <dbl>
# 1 SPY 2000-01-31 94.2
# 2 SPY 2000-02-29 92.7
# 3 SPY 2000-03-31 102.
# 4 SPY 2000-04-30 98.2
# 5 SPY 2000-05-31 96.6
# 6 SPY 2000-06-30 98.5
# 7 SPY 2000-07-31 97.0
# 8 SPY 2000-08-31 103.
# 9 SPY 2000-09-30 97.6
#10 SPY 2000-10-31 97.2
# … with 674 more rows
I would do it like this:
symbols <- c("SPY", "XLF", "XLE")
start <- as.Date("2000-01-01")
end <- as.Date("2018-12-31")
# Environment to hold data
my_data <- new.env()
# Tell getSymbols() to load the data into 'my_data'
getSymbols(symbols, from = start, to = end, env = my_data)
# Combine all the adjusted close prices into one xts object
price_data <- Reduce(merge, lapply(my_data, Ad))
# Remove "Adjusted" from column names
colnames(price_data) <- sub(".Adjusted", "", colnames(price_data), fixed = TRUE)
# Get the last price for each month
monthly_data <- apply.monthly(price_data, last)
# Convert to a long data.frame
long_data <- fortify.zoo(monthly_data,
names = c("Date", "Symbol", "Adjusted"), melt = TRUE)

How do I only have x-axis labels that specify when the year is changed in R?

I've downloaded a couple of .csv files, and they look something like this, just a lot longer and the date continues until 2020-03-13.
Date Open High Low Close Adj.Close Volume
1 2015-03-13 2064.56 2064.56 2041.17 2053.40 2053.40 3498560000
2 2015-03-16 2055.35 2081.41 2055.35 2081.19 2081.19 3295600000
3 2015-03-17 2080.59 2080.59 2065.08 2074.28 2074.28 3221840000
4 2015-03-18 2072.84 2106.85 2061.23 2099.50 2099.50 4128210000
5 2015-03-19 2098.69 2098.69 2085.56 2089.27 2089.27 3305220000
6 2015-03-20 2090.32 2113.92 2090.32 2108.10 2108.10 5554120000
I've created a data frame that looks like this based on the data
Date t SandP AMD
1 0 1 0.000000000 0.000000000
2 2015-03-16 2 0.013442909 0.003629768
3 2015-03-17 3 -0.003325698 0.003616640
4 2015-03-18 4 0.012085102 -0.007246409
5 2015-03-19 5 -0.004884489 -0.003642991
6 2015-03-20 6 0.008972382 0.021661497
I am trying to graph the SandP and AMD columns on the same axis, however I only want the axis labels to show each year (when each year changes). Therefore I would only want the 6 ticks on the axis (2015,2016,2017,2018,2019,2020).
If it helps, the .csv files were downloaded from Yahoo Finance data for S&P500.
This is my code up to now:
SPdata <- read.csv("^GSPC.csv")
AMDdata <- read.csv("AMD.csv")
head(SPdata)
R_t <- function(t){
S=log(SPdata[t,6])-log(SPdata[t-1,6])
return(S)
}
S_t <- function(t){
S=log(AMDdata[t,6])-log(AMDdata[t-1,6])
return(S)
}
comparedata <- data.frame(0,1,0,0)
names(comparedata)[1]<-"Date"
names(comparedata)[2]<-"t"
names(comparedata)[3]<-"SandP"
names(comparedata)[4]<-"AMD"
t<-2
while(t<1260){
comparedata <-rbind(comparedata, list(AMDdata[t,1],t,R_t(t),S_t(t)))
t=t+1
}
# install.packages("ggplot2")
library("ggplot2")
ggplot() +
geom_line(data=comparedata, aes(x=Date,y=SandP),color="red",group=1)+
geom_line(data=comparedata, aes(x=Date,y=AMD), color="blue",group=1)+
labs(x="Date",y="Returns")
I think you need to use scale_x_date and set the argument date_breaks and date_labels (see the offficial documentation: https://ggplot2.tidyverse.org/reference/scale_date.html)
Here, I recreate an example using the small portion of the data you provided:
library(lubridate)
date <- seq(ymd("2015-03-16"), ymd("2020-03-13"), by = "day")
df <- data.frame(date = date,
t = 1:1825,
SandP = rnorm(1825),
AMD = rnorm(1825))
Starting from this example, I reshape the dataframe into a longer format using pivot_longer function from tidyr:
library(tidyr)
DF <- df %>% pivot_longer(cols = c(SandP, AMD), names_to = "indices", values_to = "values")
# A tibble: 3,650 x 4
date t indices values
<date> <int> <chr> <dbl>
1 2015-03-16 1 SandP 0.566
2 2015-03-16 1 AMD -0.185
3 2015-03-17 2 SandP -1.59
4 2015-03-17 2 AMD 0.236
5 2015-03-18 3 SandP 1.11
6 2015-03-18 3 AMD -1.52
7 2015-03-19 4 SandP -1.02
8 2015-03-19 4 AMD 0.0833
9 2015-03-20 5 SandP 2.78
10 2015-03-20 5 AMD -0.173
# … with 3,640 more rows
Then, I plot both indices according to the date using ggplot2:
library(ggplot2)
ggplot(DF, aes(x = date, y = values, color = indices))+
geom_line()+
labs(x="Date",y="Returns")+
scale_x_date(date_breaks = "year", date_labels = "%Y")
Does it look what you are trying to achieve ?

Prices returns calculation in a df with many tickers with dplyr

I have a dataframe with 3 columns : Dates, Tickers (i.e. financial instruments) and Prices.
I just want to calculate the returns for each ticker.
Some data to play with:
AsofDate = as.Date(c("2018-01-01","2018-01-02","2018-01-03","2018-01-04","2018-01-05",
"2018-01-01","2018-01-02","2018-01-03","2018-01-04","2018-01-05",
"2018-01-01","2018-01-02","2018-01-03","2018-01-04","2018-01-05"))
Tickers = c("Ticker1", "Ticker1", "Ticker1", "Ticker1", "Ticker1",
"Ticker2", "Ticker2", "Ticker2", "Ticker2", "Ticker2",
"Ticker3", "Ticker3", "Ticker3", "Ticker3", "Ticker3")
Prices =c(1,2,7,4,2,
6,5,7,9,12,
11,11,16,14,15)
df = data.frame(AsofDate, Tickers, Prices)
My first Idea was just to order the Prices by (Tickers Prices), then calculate for all the vector and set at NA the first day...
TTR::ROC(x=Prices)
It works in Excel but I want something more pretty
So I tried something like this:
require(dplyr)
ret = df %>%
select(Tickers,Prices) %>%
group_by(Tickers) %>%
do(data.frame(LogReturns=TTR::ROC(x=Prices)))
df$LogReturns = ret$LogReturns
But Here I get too much values, it seems that the calculation is not done by Tickers.
Can you give me a hint ?
Thanks !!
In dplyr, we can use lag to get previous Prices
library(dplyr)
df %>%
group_by(Tickers) %>%
mutate(returns = (Prices - lag(Prices))/Prices)
# AsofDate Tickers Prices returns
# <date> <fct> <dbl> <dbl>
# 1 2018-01-01 Ticker1 1 NA
# 2 2018-01-02 Ticker1 2 0.5
# 3 2018-01-03 Ticker1 7 0.714
# 4 2018-01-04 Ticker1 4 -0.75
# 5 2018-01-05 Ticker1 2 -1
# 6 2018-01-01 Ticker2 6 NA
# 7 2018-01-02 Ticker2 5 -0.2
# 8 2018-01-03 Ticker2 7 0.286
# 9 2018-01-04 Ticker2 9 0.222
#10 2018-01-05 Ticker2 12 0.25
#11 2018-01-01 Ticker3 11 NA
#12 2018-01-02 Ticker3 11 0
#13 2018-01-03 Ticker3 16 0.312
#14 2018-01-04 Ticker3 14 -0.143
#15 2018-01-05 Ticker3 15 0.0667
In base R, we can use ave with diff
df$returns <- with(df, ave(Prices, Tickers,FUN = function(x) c(NA,diff(x)))/Prices)
We can use data.table
library(data.table)
setDT(df)[, returns := (Prices - shift(Prices))/Prices, by = Tickers]

Summarise a vector and then append the summary statistics to the original dataframe in R

Intro:
I would like to compute the mean, standard deviation, and standard error of a numeric vector in a given dataframe and then create three new vectors using these summary statistics. I then need to combine them with the original dataframe.
Example Code:
## Creating our dataframe:
datetime <- c("5/12/2017 16:15:00","5/16/2017 16:45:00","5/19/2017 17:00:00")
datetime <- as.POSIXct(datetime, format = "%m/%d/%Y %H:%M:%S")
values <- c(1,2,3)
df <- data.frame(datetime, values)
## Here's the current output:
head(df)
datetime values
1 2017-05-12 16:15:00 1
2 2017-05-16 16:45:00 2
3 2017-05-19 17:00:00 3
## And here's the desired output:
head(df1)
datetime values mean sd se
1 2017-05-12 16:15:00 1 2 0.816 0.471
2 2017-05-16 16:45:00 2 2 0.816 0.471
3 2017-05-19 17:00:00 3 2 0.816 0.471
Thanks in advance!
For those who are curious as to why I am trying to do this, I am following this tutorial. I need to make one of those line graph plots with errorbars for some calibrations between a low-cost sensor and an expensive reference instrument.
You can do the assignment simultaneously. Suppose you already have the helper function for you choice of sd and se:
sd0 <- function(x){sd(x) / sqrt(length(x)) * sqrt(length(x) - 1)}
se0 <- function(x){ sd0(x) / sqrt(length(x))}
Then you can try:
df[c('mean', 'sd', 'se')] <- lapply(list(mean, sd0, se0), function(f) f(df$values))
# > df
# datetime values mean sd se
# 1 2017-05-12 16:15:00 1 2 0.8164966 0.4714045
# 2 2017-05-16 16:45:00 2 2 0.8164966 0.4714045
# 3 2017-05-19 17:00:00 3 2 0.8164966 0.4714045
Here is the dplyr solution, with sd0 and se0 given in mt1022's answer:
df %>% mutate("mean"=mean(values),"sd"=sd0(values),"se"=se0(values))

identify date format in R before converting

I have a simple data set which has a date column and a value column. I noticed that the date sometimes comes in as mmddyy (%m/%d/%y) format and other times in mmddYYYY (%m/%d/%Y) format. What is the best way to standardize the dates so that i can do other calculations without this formatting causing issues?
I tried the answers provided here
Changing date format in R
and here
How to change multiple Date formats in same column
Neither of these were able to fix the problem.
Below is a sample of the data
Date, Market
12/17/09,1.703
12/18/09,1.700
12/21/09,1.700
12/22/09,1.590
12/23/2009,1.568
12/24/2009,1.520
12/28/2009,1.500
12/29/2009,1.450
12/30/2009,1.450
12/31/2009,1.450
1/4/2010,1.440
When i read it into a new vector using something like this
dt <- as.Date(inp$Date, format="%m/%d/%y")
I get the following output for the above segment
dt Market
2009-12-17 1.703
2009-12-18 1.700
2009-12-21 1.700
2009-12-22 1.590
2020-12-23 1.568
2020-12-24 1.520
2020-12-28 1.500
2020-12-29 1.450
2020-12-30 1.450
2020-12-31 1.450
2020-01-04 1.440
As you can see we skipped from 2009 to 2020 at 12/23 because of change in formatting. Any help is appreciated. Thanks.
> dat$Date <- gsub("[0-9]{2}([0-9]{2})$", "\\1", dat$Date)
> dat$Date <- as.Date(dat$Date, format = "%m/%d/%y")
> dat
Date Market
# 1 2009-12-17 1.703
# 2 2009-12-18 1.700
# 3 2009-12-21 1.700
# 4 2009-12-22 1.590
# 5 2009-12-23 1.568
# 6 2009-12-24 1.520
# 7 2009-12-28 1.500
# 8 2009-12-29 1.450
# 9 2009-12-30 1.450
# 10 2009-12-31 1.450
# 11 2010-01-04 1.440

Resources