I am using a for loop to iterate through a list of fund codes, make an API call and then combine this data in one large zoo object. I would like to see something like this (shortened version), where there is a simple change between each column (Zoo1, Zoo2, Zoo3):
However currently I have column headers like this:
I have researched into ways to do this (for example if there was a parameter of merge.zoo for a custom name) however I can't seem to find anything.
Currently my code is looking like this:
## Imported libraries
library(httr) ## Library to make the HTTP get request
library(jsonlite) ## Library to de-code the JSON
library(lattice) ## Help with the output plots
library(zoo) ## Use with data frames
library(ggplot2) ## Used for plotting
## Percentage Change - calculates the percentage difference from the first value to all of the next values, in a dataframe
PercentageChange <- function(dataframe) {
Values <- as.vector(dataframe$navPrice) ## Collect the navPrice as a vector
ReturnValues <- c() ## Create a vector for the values that we are going to return
StartValue <- Values[length(Values)] ## Find the start value that we can base all of the data on
for (Value in Values) { ## Loop through all of the price values.
PercentageChangeValue <- (Value-StartValue)/StartValue*100 ## Calculate the percentage change difference between the current and start value using the percentage change formula
ReturnValues <- append(ReturnValues, PercentageChangeValue) ## Add that value to the values that we will return
}
return(ReturnValues) ## Return the vector that we created
}
## Call API - based on the fund code given to us, collect the data from the AI, for a certain date range
CallAPI <- function(Fund, Start="1950-01-01", End="2050-01-01") {
Data <- read.csv("Data/VanguardFundApiNumbers.csv") ## Collect the data from the CSV file with the API Lookup codes and put that into a dataframe
Code <- Data[Data$Ticker == Fund, ]$VanguardCode ## Look up the row with the fund code and then take the ticker code for the API
Res <- GET(paste("https://www.vanguardinvestor.co.uk/api/fund-data/", Code, "/S/price-history?startDate=", Start, "&endDate=", End, sep="")) ## Create the URL that we will send a request to, and then send it
FundData <- fromJSON(rawToChar(Res$content)) ## Turn the raw data into a string, and then into JSON which can be assigned into the return variable
return(FundData) ## Return the return data
}
## Create Just Date Values - returns a list of date times without the time in a more readable format
CreateJustDateValues <- function(Dates) {
ReturnList <- c() ## Create a list for the return values
for (Date in Dates) { ## Loop through the dates
ReturnList <- append(ReturnList, gsub("-", ".", substr(Date, 1, 10))) ## For each date, take the first ten chars, replace any dashes with . and then add that to the return list
}
return(ReturnList) ## Return the list we have created
}
## Create a list of the funds that we want to look at
Funds <- c("VDEE", "VDWE", "VGAC", "VGSC", "VUSE", "VJSI", "VLSH")
StartDate <- "2021-06-01"
EndDate <- "2050-01-01"
## Creates the first fund for testing and to help set up the variable
FundData <- CallAPI(Funds[1], Start=StartDate)
FundData$percentChange <- PercentageChange(FundData)
Combination <- zoo(FundData$percentChange, FundData$date)
## Loops through the remaining funds
for (Fund in Funds[2:length(Funds)]) {
FundData <- CallAPI(Fund, Start=StartDate) ## Returns a DataFrame with the data
FundData$percentChange <- PercentageChange(FundData) ## Create a percentage change column that we can analyse
ZooObject <- zoo(FundData$percentChange, FundData$date) ## Turn the wanted data into a zoo object
Combination <- merge.zoo(Combination, ZooObject) ## Combine it to the previous objects
}
Any explanation to how I could improve the column headers would be great!
If the problem is to merge zoo vector z0 onto zoo object z such that the new column has name "X" then make z0 into a column vector zoo object and use setNames :
library(zoo)
z <- read.zoo(BOD)
z0 <- 10*z
merge(z, setNames(cbind(z0), "X"))
giving:
z X
1 8.3 83
2 10.3 103
3 19.0 190
4 16.0 160
5 15.6 156
7 19.8 198
If hard coding X is ok then it can be written:
merge(z, X = z0)
Related
I have list of data frames for which I have to perform the same operations for each one. I currently do this with a for loop but it is too slow. I would like to use lapply instead. The operations which I need to perform is to check how many of the values in a date column in each dataset that are missing from a vector of dates.
The data have the following structure.
# the dates which are of interest
dates <- seq(as.Date("2020-02-01"), as.Date("2020-02-09"), by = "days")
# the list of data frames
df_1 <- data.frame(seq(as.Date("2020-02-01"), as.Date("2020-02-09"), by = "days"), 1:9)
names(df_1) <- c("date", "value")
df_2 <- data.frame(seq(as.Date("2020-02-01"), as.Date("2020-02-07"), by = "days"), 1:7)
names(df_2) <- c("date", "value")
df_list <- list(df_1, df_2)
The loop which is working but is too slow looks like this.
for (i in 1:length(df_list)) {
# get range of dates in the data frame
df_date_range <- unique(df_list[[i]][["date"]])
# get range of dates that occur from the point of the beginning of the data frame
dates_reduced <- dates[dates >= min(df_date_range)]
# get the share of dates missing
missing <- mean(!(dates_reduced %in% df_date_range))
# remove data frames where the share of missing values are above 1 %
if (missing > 0.1) {
df_list[[i]] <- NULL
}
}
I tried the following lapply approach.
# write function to use in lapply
clean <- function(data, date_range) {
# get range of dates in the data frame
df_date_range <- unique(data$date)
# get range of dates that occur from the point of the beginning of the data frame
dates_reduced <- date_range[date_range >= min(df_date_range)]
# get the share of dates missing
missing <- mean(!(dates_reduced %in% df_date_range))
# remove data frames where the share of missing values are above 1 %
if (missing > 0.1) {
data <- NULL
}
}
# apply the function to the list of data frames
new_df_list <- lapply(df_list, clean, date_range = dates)
This however only yields a list of NULLs. Any help on what I'm doing wrong would be greatly appreciated.
While we dont't have to explicitly specify return values in R it's always better to do so. Your problem illustrates this point! (R function implicitly return the result of the last expression, but that is not always what one would expect!):
Consider the following function:
no_explicit_return_value <- function() {
some_non_NULL_value <- 10000
}
If we run:
test_value <- no_explicit_return_value()
test_value
We get back:
[1] 10000
Since the last expression returned 10000... so far all good!
Now, consider this function:
no_explicit_return_value <- function() {
some_non_NULL_value <- 10000
if (1000 < 4) {
x <- NULL
}
}
If we run:
test_value <- no_explicit_return_value()
test_value
We get back:
NULL
Not because the if clause evaluated to TRUE but because there is no return value from the if clause
The Solution:
clean <- function(data, date_range) {
# get range of dates in the data frame
df_date_range <- unique(data$date)
# get range of dates that occur from the point of the beginning of the data frame
dates_reduced <- date_range[date_range >= min(df_date_range)]
# get the share of dates missing
missing <- mean(!(dates_reduced %in% df_date_range))
# remove data frames where the share of missing values are above 1 %
if (missing > 0.1) {
data <- NULL
}
return(data)
}
# apply the function to the list of data frames
new_df_list <- lapply(df_list, clean, date_range = dates)
new_df_list
Returns:
[[1]]
date value
1 2020-02-01 1
2 2020-02-02 2
3 2020-02-03 3
4 2020-02-04 4
5 2020-02-05 5
6 2020-02-06 6
7 2020-02-07 7
8 2020-02-08 8
9 2020-02-09 9
[[2]]
NULL
I want to store values in "yy" but my code below stores only one row (last value). Please see the output below. Can somebody help to store all the values in "yy"
Thanks in advance. I am a beginner to R.
arrPol <- as.matrix(unique(TN_97_Lau_Cot[,6]))
arrYear <- as.matrix(unique(TN_97_Lau_Cot[,1]))
for (ij in length(arrPol)){
for (ik in length(arrYear)) {
newPolicy <- subset(TN_97_Lau_Cot, POLICY == as.character(arrPol[ij]) & as.numeric(arrYear[ik]))
yy <- newPolicy[which.min(newPolicy$min_dist),]
}
}
Output:
YEAR DIVISION STATE COUNTY CROP POLICY STATE_ABB LRPP min_dist
1: 2016 8 41 97 21 699609 TN 0 2.6
Here is a image of "TN_97_Lau_Cot" matrix.
No loops required. There could be an easier way to do it, but two set-based steps are better than two loops. These are the two ways I would try and do it:
base
# Perform an aggregate and merge it to your data.frame.
TN_97_Lau_Cot_Agg <- merge(
x = TN_97_Lau_Cot,
y = aggregate(min_dist ~ YEAR + POLICY, data = TN_97_Lau_Cot, min),
by = c("YEAR","POLICY"),
all.x = TRUE
)
# Subset the values that you want.
TN_97_Lau_Cot_Final <- unique(subset(TN_97_Lau_Cot_Agg, min_dist.x == min_dist.y))
data.table
library(data.table)
# Convert your data.frame to a data.table.
TN_97_Lau_Cot <- data.table(TN_97_Lau_Cot)
# Perform a "window" function that calculates the min value for each year without reducing the rows.
TN_97_Lau_Cot[, minDistAggregate:=min(min_dist), by = c("YEAR","POLICY")]
# Find the policy numbers that match the minimum distance for that year.
TN_97_Lau_Cot_Final <- unique(TN_97_Lau_Cot[min_dist==minDistAggregate, -10, with=FALSE])
I have a DataFrame data in sparkR. It contains user = 12 311 12 320, ... and type = 1 2 3 4. We have 10000 user's.
For example one user have type = 1 2 3 4 4 4 2 4.
I want to find the most common integer in type for this user. In R I can solve it this way
mostcommon <- which.max(tabulate(user$type))
given that 'user' was a data.frame and not a DataFrame.
I want do this for all user's in 'data'. One way to do this is this way
u<- c()
for(j in 1:10000) {
id <- filter(data, data$user== j)
# For the jth user I make the data local to run the
# which.max and tabulate functions
idlocal <- collect(id)
u[j] <- which.max(tabulate(idlocal$type))
}
This runs in R/sparkR and u gives me the most common type for all user's. But it takes time because I maked the data local to run the which.max and tabulate functions. Is there a smarter and more fast way to do this?
Futhermore how could one find the two most common types as well ?
Maybe not the best solution, but it works:
Create example data
localData <- data.frame(user = c(1,1,1,2,2,2),
type = c(1,2,2,3,3,2))
data <- createDataFrame(sqlContext, localData)
Group by user and type and count how many times it occurs (sort of tabulate)
groupedData <- groupBy(data, data$user, data$type)
aggregated <- agg(groupedData, number = n(data$user))
Order on this counted number, since this is the easiest way to find the type which has a maximal occurrence.
arranged <- arrange(aggregated, desc(aggregated$number))
Group again on user and take the first occurrence of type, which is the maximum since we have ordered it.
regroupedData <- groupBy(arranged, arranged$user)
firstItems <- agg(regroupedData, firstType = first(arranged$type), number = first(arranged$number))
Check out the results
collect(firstItems)
If you now want the second most occurring item as well, you can first delete these first items
firstDeleted <- except(arranged,firstItems)
And apply the same method again
rearranged <- arrange(firstDeleted, desc(firstDeleted$number))
reregroupedData <- groupBy(rearranged, rearranged$user)
secondItems <- agg(reregroupedData, secondType = first(rearranged$type))
Delete non necessary column and rename a column
firstItems$number <- NULL
secondItems <- withColumnRenamed(secondItems, "user", "user2")
For the final result, join these DataFrames (en delete column user2)
result <- join(firstItems,secondItems, firstItems$user == secondItems$user2)
result$user2 <- NULL
And again to check these results
collect(result)
I am trying to write a function that will return the number of business days between two dates (not just excluding weekends, but holidays as well). I'm approaching it by building a matrix with rownames corresponding to days of the week with the elements of the matrix either a 1 or a 0: a 0 if it is a holiday or the extra couple elements to fill the matrix.
I've checked the length of each vector in the code. It checks out. I've run the code manually in the console, one line at a time, and it works perfectly. BUT if I run the function, it displays this error message:
Warning message:
In matrix(da, nrow = 7, dimnames = list(n)) :
data length [132] is not a sub-multiple or multiple of the number of rows [7]
I'm using R 3.1.1, mostly working in Rstudio. The cal mentioned in the code can be found here.
Here's the code:
dte <- function(date) {
#Input a date and it tells you the number of business (not including holidays)
#days until that date
#Take the target date and turn it into a date
d <- strptime(date,format="%Y-%m-%d")
#Obtain current date
c <- strptime(Sys.Date(), format="%Y-%m-%d")
#Calculate the difference in days
diff <- d-c
#Extract the actual number difference
f <- diff[[1]]
#Get the list of holidays
cal <- dget("cal")
cal <- as.Date(cal)
#Get the full list of dates between now and the target date
b <- Sys.Date()+0:f
#Find which days in the range are holidays
if(any(b %in% cal)) {
bt <- b[b %in% cal]
#Return the position of the holidays within the range
bn <- which(b %in% bt)
} else {
#Set holidays present to 0
bn <- 0
}
#Build a vector of the weekdays starting with the current weekday
n <- weekdays(Sys.Date()+0:6)
#Create a vector as long as the difference with a 1 in each place
v <- rep(1,f)
#Set each holiday to 0
v[bn] <- v[bn]-1
#Extra steps to make sure that the matrix is full but only with 1s where we want them.
g <- ((trunc(f/7)+1)*7)-f
u <- rep(0,g)
da <- c(v,u)
#Create the matrix
m <- matrix(da,nrow=7,dimnames=list(n))
#Extract all of the workweeks and add them up
ww <- m[c("Monday","Tuesday","Wednesday","Thursday","Friday"),]
r <- sum(ww)
r
}
The problem is that your strptime calls return POSIXt objects which have time components and are then effected by daylight savings time. Observe
(d1<-strptime("2014-08-24",format="%Y-%m-%d"))
# [1] "2014-08-24 EDT"
(d2<-strptime("2014-12-31",format="%Y-%m-%d"))
# [1] "2014-12-31 EST"
d2-d1
# Time difference of 129.0417 days
So there are not a while number of dates between the two values which causes complications for you later in your code. If you use as.Date rather than strptime then you won't have this problem because Date objects don't care about time.
But i'm not sure really why you're even bothering with the matrix at all. I think a simpler implementation would look like
dte <- function(date) {
d <- as.Date(date,format="%Y-%m-%d")
c <- Sys.Date()
cal <- dget("cal")
cal <- as.Date(cal)
#Get the full list of dates between now and the target date
b <- seq(c, d, by="1 day")
return(sum(as.POSIXlt(b)$wday %in% 1:5 & (!b %in% cal)))
}
Is there a good package in R that allows to sub-set (i.e. index into) timeseries by times that are not in the time series?
E.g. for financial applications, indexing a price series by a time stamp that is not in the database, should return the latest available price before the time stamp.
in code, this is what I would like
n =15
full.dates = seq(Sys.Date(), by = 'day', length = n)
series.dates = full.dates[c(1:10, 12, 15)]
require(zoo)
series=zoo(rep(1,length(series.dates)), series.dates)
series[full.dates[11]]
this returns
Data:
numeric(0)
Index:
character(0)
however, I would like this to return the value of the last existing date before full.dates[11], which is full.dates[10]:
series[full.dates[10]]
2014-01-03
1
Thanks
You can use index to extract index of the observations in your zoo object. The index can then be used for subsetting the object. Step by step to show the logic (you only need the last step, if I have understood you correctly):
# the index of the observations, here dates
index(series)
# are the dates smaller than your reference date?
index(series) < full.dates[11]
# subset observations: dates less than reference date
series[index(series) < full.dates[11]]
# select last observation before reference date:
tail(series[index(series) < full.dates[11]], 1)
# 2014-01-03
# 1
A possible alternative may be to expand your time series and "replac[e] each NA with the most recent non-NA" using na.locf and the xout argument (see also ?na.locf and ?approx and this answer)
# expand time series to the range of dates in 'full.dates'
series2 <- na.locf(series, xout = full.dates)
series2
# select observation at reference date
series2[full.dates[10]]
# 2014-01-03
# 1
If you rather want missing values in your incomplete series to be replaced by "next observation carried backward", you need to merge your series with with a 'dummy' zoo object which contains the desired range of consecutive dates.
series3 <- merge(series, zoo(, full.dates))
na.locf(series3, fromLast = TRUE)
na.locf(x, xout = newdate) seems not much worse than subscripting but at any rate here we define a subclass of "zoo" called "zoo2" in which [ uses na.locf. This is an untested minimal implementation but it could be extended:
as.zoo2 <- function(x) UseMethod("as.zoo2")
as.zoo2.zoo <- function(x) structure(x, class = c("zoo2", setdiff(class(x), "zoo2")))
"[.zoo2" <- function(x, i, ...) {
if (!missing(i) && inherits(i, class(index(x)))) {
zoo:::`[.zoo`(na.locf(x, xout = i),, ...)
} else as.zoo2(zoo:::`[.zoo`(x, i, ...))
}
This gives:
> series2 <- as.zoo2(series)
> series2[full.dates[11]]
2014-01-04
1
I would strongly argue that subset functions should not return the prior row if the desired index value does not exist. Subset functions should return what the user requested; they should not assume the user wanted something different than what they requested.
If this is what you want, you can handle it fairly easily with an if statement.
series.subset <- series[full.dates[11]]
if(NROW(series.subset)==0) {
# merge series with an empty zoo object
# that contains the index value you want
prior <- merge(series, zoo(,full.dates[11]))
# lag *back* one period so the NA is on the prior value
prior <- lag(prior, 1)
# get the index value at the prior value
prior <- index(prior)[is.na(prior)]
# subset again
series.subset <- series[prior]
}