I'm helping a friend with some R homework for a apparently badly taught R class (because all the stuff covered in the class and the supplementary material doesn't help).
We have two datasets. One contains daily discrete returns of a company share in percent and the other contains daily exchange rates from two currencies, let's say USD to Swiss Franc. It looks like this:
Date Mon Day Exchangerate
2000 01 01 1.03405
2000 01 02 1.02987
2000 01 03 1.03021
2000 01 04 1.03456
2000 01 05 1.03200
And the daily discrete returns:
Date Share1
20000104 -0.03778
20000105 0.02154
20000106 0.01345
20000107 -0.01234
20000108 -0.01789
The task is to write a function that uses both matrices and calculates the daily returns from the perspective of a Swiss investor. We assume an initial investment of 1000 US Dollar.
I tried using tidyverse and calculate the changes in total return and percent changes from one day to another using the lag function from dplyr as in the code provided below.
library(tidyverse)
myCHFreturn <- function(matrix1, matrix2) {
total = dplyr::right_join(matrix1, matrix2, by = "date") %>%
dplyr::filter(!is.na(Share1)) %>%
dplyr::select(-c(Date, Mon, Day)) %>%
dplyr::mutate(rentShare1_usd = (1+Share1)*1000,
rentShare1_usd = dplyr::lag(rentShare1_usd) * (1+Share1),
rentShare1_chf = rentShare1_usd*Exchangerate,
rentShare1_chfperc =(rentShare1_chf - dplyr::lag(rentShare1_chf))/dplyr::lag(rentShare1_chf),
rentShare1_chfperc = rentShare1_chfperc*100)
}
The problem is that the rentShare1_usd = dplyr::lag(rentShare1_usd) * (1+Share1) part of the function relies on the values calculated for the initial 1000 US Dollar investment. Thus, my perception is that we need some type of rolling calculation of the changes, based on the initial investment. However, I don't know how to implement this in the function, since I've only worked with rolling means. We want to calculate the daily returns based on the change given in Variable Share1 and the value of the investment of the previous day. Any help is very much appreciated.
At least to point you to part of a solution, the value of a unit share on any one day is the cumulative product from the start date to that date of (1 + daily_discrete_return) over the time period concerned. To take an example using an extended version of your daily discrete returns table:
df = read.table(text = "Date Share1
20000104 -0.03778
20000105 0.02154
20000106 0.01345
20000107 -0.01234
20000108 -0.01789
20000109 0.02154
20000110 0.01345
20000111 0.02154
20000112 0.02154
20000113 0.01345", header = TRUE, stringsAsFactors = FALSE)
library(dplyr)
Shares = 1000
df1 = mutate(df, ShareValue = cumprod(1+Share1) * Shares)
Date Share1 ShareValue
1 20000104 -0.03778 962.2200
2 20000105 0.02154 982.9462
3 20000106 0.01345 996.1668
4 20000107 -0.01234 983.8741
5 20000108 -0.01789 966.2726
6 20000109 0.02154 987.0862
7 20000110 0.01345 1000.3625
8 20000111 0.02154 1021.9103
9 20000112 0.02154 1043.9222
10 20000113 0.01345 1057.9630
Once you've got a table with the share value as at that date in it you can join it back to your exchange rate table to calculate the swiss currency equivalent for that date, and extend it to do percentage changes and so on.
Related
I would like help with replicating a vlookup from Excel in R. I have two data tables of the following kind but with several more rows and attributes. I have redimensioned them for the sake of simplicity -
FX <- data.table(Currency = c("USD","EUR","AUD"), Y2014 = c(2.13,3.45,1.8), Y2015 = c(2.16,3.48,1.7), Y2016 = c(2.19,3.49,1.6))
DATA <- data.table(Customer = c("Abc","Def","Ghi","Jkl","Mno"), Year = c(2013,2014,2015,2012,2018), CurrencyCode = c("AUD","USD","USD","EUR","USD"))
FX has a list of currencies as the rows and different years as columns denoting their exchange rate against a fixed currency (SEK) and DATA has some customer deals which were originally reported in that fixed currency (SEK).
I would like to add another attribute to DATA called, as an example, ConversionRate by first matching the Currency attribute in FX to CurrencyCode in DATA and then selecting the corresponding conversion rate for the year given in Year from DATA by matching it to the column Yxxxx in FX.
It would result in something like this -
data <- data.table(Customer = c("Abc","Def","Ghi","Jkl","Mno"), Year = c(2013,2014,2015,2012,2018), ConversionRate = c(1.7,2.13,2.16,3.45,2.19))
Please note that for Year < 2014, I would like it to pick up the rate of the corresponding currency in 2014 and for Year > 2016, I would like it to pick up the rate of the corresponding currency in 2016 as it has done for Row 1,4,5.
I have tried using loops, merge, and even a custom vlookup function but it seems I am going wrong when it comes to comparing the Year to the column names Yxxxx.
Any idea on how this can be achieved?
Thank you!
After melting FX to long and converting the "Y2016" etc. values to numbers, you can do an update join to DATA with this fx_long. If you want to join on a year other than the year in the data, you can first create a new column join_year and join on that instead.
library(data.table)
fx_long <- melt(FX, 'Currency')[, Year := as.numeric(sub('Y', '', variable))]
DATA[, join_year := pmin(pmax(Year, 2014), 2016)]
DATA[fx_long, on = .(join_year = Year, CurrencyCode = Currency), ConversionRate := i.value]
DATA
# Customer Year CurrencyCode join_year ConversionRate
# 1: Abc 2013 AUD 2014 1.80
# 2: Def 2014 USD 2014 2.13
# 3: Ghi 2015 USD 2015 2.16
# 4: Jkl 2012 EUR 2014 3.45
# 5: Mno 2018 USD 2016 2.19
I am beginning programming in R and I have not found the solution to this problem.
I have data saved in a dataframe as displayed below :
Material created_date
1 50890000 29/10/2018
2 50890000 17/10/2018
3 50890000 31/05/2018
4 50890000 08/02/2018
5 50890000 09/01/2018
6 50900000 21/12/2018
7 50900000 27/09/2018
8 50900000 24/08/2018
9 50900000 18/05/2018
10 51200000 13/07/2018
11 51210001 08/08/2018
12 51210001 26/07/2018
13 51210001 27/02/2018
14 51210001 17/01/2018
15 51210001 09/01/2018
16 51210002 29/08/2018
17 51210002 08/08/2018
18 51210002 13/04/2018
I would like to calculate 4 columns :
Average difference between consecutive dates in days
Standard deviation associated
Average difference between consecutive dates in working days
Standard deviation associated
I have been told to used plyr or dplyr but as I am beginning I am not sure how to compute the desired output.
Thank you,
First, you will need to change created_date to a date that R understands. Do that with:
df$R_date <- as.Date(df$created_date, "%d/%m/%Y")
Now, if you simply want to calculate the difference between dates, a loop (shunned by many) can work:
for (i in 2:nrow(df)) {
df$date_diff[i] <- as.integer(df$R_date[i]-df$R_date[i-1])
}
However, seeing your reference to dplyr I wonder if you want to do this for each Material group...
Here's the dplyr approach to the first two of your bullet pointed questions:
df <- df %>%
mutate(
created_date = as.Date(created_date, "%d/%m/%Y"),
diff = as.integer(created_date - lag(created_date)))
df %>%
summarise(n = n(), mval = mean(diff, na.rm = T), std = sd(diff, na.rm = T))
n mval std
1 18 -11.70588 128.4916
Check out the link in the comments I left you about the number of workdays, and try to combine these methods to answer your second two bullets
I have a data frame that looks like this:
X id mat.1 mat.2 mat.3 times
1 1 1 Anne 1495206060 18.5639404 2017-05-19 11:01:00
2 2 1 Anne 1495209660 9.0160321 2017-05-19 12:01:00
3 3 1 Anne 1495211460 37.6559161 2017-05-19 12:31:00
4 4 1 Anne 1495213260 31.1218856 2017-05-19 13:01:00
....
164 164 1 Anne 1497825060 4.8098351 2017-06-18 18:31:00
165 165 1 Anne 1497826860 15.0678781 2017-06-18 19:01:00
166 166 1 Anne 1497828660 4.7636241 2017-06-18 19:31:00
What I would like is to subset the data set by time interval (all data between 11 AM and 4 PM) if there are data points for each hour at least (11 AM, 12, 1, 2, 3, 4 PM) within each day. I want to ultimately sum the values from mat.3 per time interval (11 AM to 4 PM) per day.
I did tried:
sub.1 <- subset(t,format(times,'%H')>='11' & format(times,'%H')<='16')
but this returns all the data from any of the times between 11 AM and 4 PM, but often I would only have data for e.g. 12 and 1 PM for a given day.
I only want the subset from days where I have data for each hour from 11 AM to 4 PM. Any ideas what I can try?
A complement to #Henry Navarro answer for solving an additional problem mentioned in the question.
If I understand in proper way, another concern of the question is to find the dates such that there are data points at least for each hour of the given interval within the day. A possible way following the style of #Henry Navarro solution is as follows:
library(lubridate)
your_data$hour_only <- as.numeric(format(your_data$times, format = "%H"))
your_data$days <- ymd(format(your_data$times, "%Y-%m-%d"))
your_data_by_days_list <- split(x = your_data, f = your_data$days)
# the interval is narrowed for demonstration purposes
hours_intervals <- 11:13
all_hours_flags <- data.frame(days = unique(your_data$days),
all_hours_present = sapply(function(Z) (sum(unique(Z$hour_only) %in% hours_intervals) >=
length(hours_intervals)), X = your_data_by_days_list), row.names = NULL)
your_data <- merge(your_data, all_hours_flags, by = "days")
There is now the column "all_hours_present" indicating that the data for a corresponding day contains at least one value for each hour in the given hours_intervals. And you may use this column to subset your data
subset(your_data, all_hours_present)
Try to create a new variable in your data frame with only the hour.
your_data$hour<-format(your_data$times, format="%H:%M:%S")
Then, using this new variable try to do the next:
#auxiliar variable with your interval of time
your_data$aux_var<-ifelse(your_data$hour >"11:00:00" || your_data$hour<"16:00:00" ,1,0)
So, the next step is filter your data when aux_var==1
your_data[which(your_data$aux_var ==1),]
I have a large number of files (~1200) which each contains a large timeserie with data about the height of the groundwater. The starting date and length of the serie is different for each file. There can be large data gaps between dates, for example (small part of such a file):
Date Height (cm)
14-1-1980 7659
28-1-1980 7632
14-2-1980 7661
14-3-1980 7638
28-3-1980 7642
14-4-1980 7652
25-4-1980 7646
14-5-1980 7635
29-5-1980 7622
13-6-1980 7606
27-6-1980 7598
14-7-1980 7654
28-7-1980 7654
14-8-1980 7627
28-8-1980 7600
12-9-1980 7617
14-10-1980 7596
28-10-1980 7601
14-11-1980 7592
28-11-1980 7614
11-12-1980 7650
29-12-1980 7670
14-1-1981 7698
28-1-1981 7700
13-2-1981 7694
17-3-1981 7740
30-3-1981 7683
14-4-1981 7692
14-5-1981 7682
15-6-1981 7696
17-7-1981 7706
28-7-1981 7699
28-8-1981 7686
30-9-1981 7678
17-11-1981 7723
11-12-1981 7803
18-2-1982 7757
16-3-1982 7773
13-5-1982 7753
11-6-1982 7740
14-7-1982 7731
15-8-1982 7739
14-9-1982 7722
14-10-1982 7794
15-11-1982 7764
14-12-1982 7790
14-1-1983 7810
28-3-1983 7836
28-4-1983 7815
31-5-1983 7857
29-6-1983 7801
28-7-1983 7774
24-8-1983 7758
28-9-1983 7748
26-10-1983 7727
29-11-1983 7782
27-1-1984 7801
28-3-1984 7764
27-4-1984 7752
28-5-1984 7795
27-7-1984 7748
27-8-1984 7729
28-9-1984 7752
26-10-1984 7789
28-11-1984 7797
18-12-1984 7781
28-1-1985 7833
21-2-1985 7778
22-4-1985 7794
28-5-1985 7768
28-6-1985 7836
26-8-1985 7765
19-9-1985 7760
31-10-1985 7756
26-11-1985 7760
20-12-1985 7781
17-1-1986 7813
28-1-1986 7852
26-2-1986 7797
25-3-1986 7838
22-4-1986 7807
27-5-1986 7785
24-6-1986 7787
26-8-1986 7744
23-9-1986 7742
22-10-1986 7752
1-12-1986 7749
17-12-1986 7758
I want to calculate the average height over 5 years. So, in case of the example 14-1-1980 + 5 years, 14-1-1985 + 5 years, .... The amount of datapoints is different for each calculation of the average. It is very likely that the date 5 years later will not be in the dataset as a datapoint. Hence, I think I need to tell R somehow to take an average in a certain timespan.
I searched on the internet but didn't find something that fitted my needs. A lot of useful packages like uts, zoo, lubridate and the function aggregate passed by. Instead of getting closer to the solution I get more and more confused about which approach is the best for my problem.
Thanks a lot in advance!
As #vagabond points out, you'll want to combine your 1200 files into a single data frame (the plyr package would allow you to do something simple like: data.all <- adply(dir([DATA FOLDER]), 1, read.csv).
Once you have the data, the first step would be to transform the Date column into proper POSIXct date data. Right now the data appear to be strings, and we want them to have an underlying numerical representation (which POSIXct does):
library(lubridate)
df$date.new <- as.Date(dmy(df$Date))
Date Height date.new
1 14-1-1980 7659 1980-01-14
2 28-1-1980 7632 1980-01-28
3 14-2-1980 7661 1980-02-14
4 14-3-1980 7638 1980-03-14
5 28-3-1980 7642 1980-03-28
6 14-4-1980 7652 1980-04-14
Note that the date.new column looks like a string, but is in fact Date data, and can be handled with numerical operations (addition, comparison, etc.).
Next, we might construct a set of date periods, over which we want to compute averages. Your example mentions 5 years, but with the data you provided, that's not a very illustrative example. So here I'm creating 1-year periods starting at every day between Jan 14 1980 and Jan 14 1985
date.start <- as.Date(as.Date('1980-01-14') : as.Date('1985-01-14'), origin = '1970-01-01')
date.end <- date.start + years(1)
dates <- data.frame(start = date.start, end = date.end)
start end
1 1980-01-14 1981-01-14
2 1980-01-15 1981-01-15
3 1980-01-16 1981-01-16
4 1980-01-17 1981-01-17
5 1980-01-18 1981-01-18
6 1980-01-19 1981-01-19
Then we can use the dplyr package to move through each row of this data frame and compute a summary average of Height:
library(dplyr)
df.mean <- dates %>%
group_by(start, end) %>%
summarize(height.mean = mean(df$Height[df$date.new >= start & df$date.new < end]))
start end height.mean
<date> <date> <dbl>
1 1980-01-14 1981-01-14 7630.273
2 1980-01-15 1981-01-15 7632.045
3 1980-01-16 1981-01-16 7632.045
4 1980-01-17 1981-01-17 7632.045
5 1980-01-18 1981-01-18 7632.045
6 1980-01-19 1981-01-19 7632.045
The foverlaps function is IMHO the perfect candidate for such a situation:
library(data.table)
library(lubridate)
# convert to a data.table with setDT()
# convert the 'Date'-column to date-format
# create a begin & end date for the required period
setDT(dat)[, Date := as.Date(Date, '%d-%m-%Y')
][, `:=` (begindate = Date, enddate = Date + years(1))]
# set the keys (necessary for the foverlaps function)
setkey(dat, begindate, enddate)
res <- foverlaps(dat, dat, by.x = c(1,3))[, .(moving.average = mean(i.Height)), Date]
the result:
> head(res,15)
Date moving.average
1: 1980-01-14 7633.217
2: 1980-01-28 7635.000
3: 1980-02-14 7637.696
4: 1980-03-14 7636.636
5: 1980-03-28 7641.273
6: 1980-04-14 7645.261
7: 1980-04-25 7644.955
8: 1980-05-14 7646.591
9: 1980-05-29 7647.143
10: 1980-06-13 7648.400
11: 1980-06-27 7652.900
12: 1980-07-14 7655.789
13: 1980-07-28 7660.550
14: 1980-08-14 7660.895
15: 1980-08-28 7664.000
Now you have for each date an average of all the values that lie the date and one year ahead of that date.
Hey I just tried after seeing your question!!! Ran on a sample data frame. Try it on yours after understanding the code and then let me know!
Bdw instead of having an interval of 5 years, I used just 2 months (2*30 = approx 2 months) as the interval!
df = data.frame(Date = c("14-1-1980", "28-1-1980", "14-2-1980", "14-3-1980", "28-3-1980",
"14-4-1980", "25-4-1980", "14-5-1980", "29-5-1980", "13-6-1980:",
"27-6-1980", "14-7-1980", "28-7-1980", "14-8-1980"), height = 1:14)
# as.Date(df$Date, "%d-%m-%Y")
df1 = data.frame(orig = NULL, dest = NULL, avg_ht = NULL)
orig = as.Date(df$Date, "%d-%m-%Y")[1]
dest = as.Date(df$Date, "%d-%m-%Y")[1] + 2*30 #approx 2 months
dest_final = as.Date(df$Date, "%d-%m-%Y")[14]
while (dest < dest_final){
m = mean(df$height[which(as.Date(df$Date, "%d-%m-%Y")>=orig &
as.Date(df$Date, "%d-%m-%Y")<dest )])
df1 = rbind(df1,data.frame(orig=orig,dest=dest,avg_ht=m))
orig = dest
dest = dest + 2*30
print(paste("orig:",orig, " + ","dest:",dest))
}
> df1
orig dest avg_ht
1 1980-01-14 1980-03-14 2.0
2 1980-03-14 1980-05-13 5.5
3 1980-05-13 1980-07-12 9.5
I hope this works for you as well
This is my best try, but please keep in mind that I am working with the years instead of the full date, i.e. based on the example you provided I am averaging over beginning of 1980- end of 1984.
dat<-read.csv("paixnidi.csv")
install.packages("stringr")
library(stringr)
dates<-dat[,1]
#extract the year of each measurement
years<-as.integer(str_sub(dat[,1], start= -4))
spread_y<-years[length(years)]-years[1]
ind<-list()
#find how many 5-year intervals there are
groups<-ceiling(spread_y/4)
meangroups<-matrix(0,ncol=2,nrow=groups)
k<-0
for (i in 1:groups){
#extract the indices of the dates vector whithin the 5-year period
ind[[i]]<-which(years>=(years[1]+k)&years<=(years[1]+k+4),arr.ind=TRUE)
meangroups[i,2]<-mean(dat[ind[[i]],2])
meangroups[i,1]<-(years[1]+k)
k<-k+5
}
colnames(meangroups)<-c("Year:Year+4","Mean Height (cm)")
I am trying to do something which seems simple but is proving a bit of a challenge so I hope someone can help!
I have a time series of observations of temperature:
Lines <-"1971-01-17 298.9197
1971-01-17 298.9197
1971-02-16 299.0429
1971-03-17 299.0753
1971-04-17 299.3250
1971-05-17 299.5606
1971-06-17 299.2380
2010-07-14 298.7876
2010-08-14 298.5529
2010-09-14 298.3642
2010-10-14 297.8739
2010-11-14 297.7455
2010-12-14 297.4790"
DF <- read.table(textConnection(Lines), col.names = c("Date", "Value"))
DF$Date <- as.Date(DF$Date)
mean.ts <- aggregate(DF["Value"], format(DF["Date"], "%m"), mean)
This produces:
> mean.ts
Date Value
1 01 1.251667
2 02 1.263333
This is just an example -- my data is for many years so I can calculate a full monthly average of the data.
What I then want to do is calculate the difference in for all of the January's (individually) with the mean January I have calculated above.
If I move away from using Date/Time class I could do this with some loops but I want to see if there is a "neat" way to do this in R? Any ideas?
You can just add the year as an aggregating variable. This is easier using the formula interface:
> aggregate(Value~format(Date,"%m")+format(Date,"%Y"),data=DF,mean)
format(Date, "%m") format(Date, "%Y") Value
1 01 1971 298.9197
2 02 1971 299.0429
3 03 1971 299.0753
4 04 1971 299.3250
5 05 1971 299.5606
6 06 1971 299.2380
7 07 2010 298.7876
8 08 2010 298.5529
9 09 2010 298.3642
10 10 2010 297.8739
11 11 2010 297.7455
12 12 2010 297.4790
At least as I understand your question you want the differences of each month with the mean of those months, so you probably you want to use ave rather than aggregate:
diff.mean.ts <- ave(DF[["Value"]],
list(format(DF[["Date"]], "%m")), FUN=function(x) x-mean(x) )
If you wanted it in the same dataframe, then just assign it as a column:
DF$ diff.mean.ts <- diff.mean.ts
The ave function is designed for adding columns to existing dataframes because it returns a vector of the same length as the number of values in the its first argument, in this case DF[["Value"]]. In the present instance it returns all 0's which is the correct answer because there is only one value for each month.