my question may seem trivial but I can not get a solution.
I have a list containing 5 dataframes named by stock (my real list contains 250 stocks).
Each dataframe has the same number and name of columns:
Date: time series formatted as character
returns: time series of stock returns
DJSTOXX: time series of index returns
EPS: Earnings announcement dates, which contains also NAs, formatted as characters.
For each dataframe I would like to run the evReturn function from the ererpackage, which returns a list with different measures regarding cumulative abnormal returns around certain dates. The main inputs of the function are y which should be a dataframe, containing a column for dates (time series), a column with returns, a column with index returns and a column with the earning announcement dates. As you can see below, this code run for the firm NESN the event analysis for each dates where an announcement has been made. Here I set for each date in the time series equal to the EPS dates, omitting NAs, do the function evReturn. It saves the output in hh3 which is a list where for each EPS date the event return is computed.
hh3 <- list()
for (i in na.omit(dflist$NESN$Date[dflist$NESN$Date == dflist$NESN$EPS])){
hh3[[i]] <- evReturn(y = dflist$NESN, firm = "return", event.date = i, y.date = "Date",
index = "DJSTOXX", event.win = 2, est.win = 50, digits = 3)
}
Now my question is:
How can I set it in order to run this code for each stock in the list? Therefore, I am expecting the resulting list to have 5 list (1 for each stock) with a list for each EPS dates.
As you can see in the evReturn function I explicitly set $NESN but I want to set it as for each dataframe in dflist do the function.
I have tried with lapply the following:
lapply(dflist, function(x){
for (i in na.omit(x$Date[x$Date == x$EPS])){
dflist2[[i]] <- evReturn(y = x, firm = "return", event.date = i, y.date = "Date",
index = "DJSTOXX", event.win = 2, est.win = 50, digits = 3)
}})
But it returns:
Error in xj[i] : only 0's may be mixed with negative subscripts
I thus tried to nest 2 loops as:
for(j in seq_along(dflist)){
for (i in na.omit(dflist$j$Date[dflist$j$Date == dflist$j$EPS])){
hh7[[j]][i] <- evReturn(y = dflist$j, firm = "return", event.date = i, y.date = "Date",
index = "DJSTOXX", event.win = 2, est.win = 50, digits = 3)
}}
But it returns hh7 as a list of length 0.
Any help is highly appreciated since it seems I am missing something.
Thank you
Hard to be sure how to help because you don't provide any sample data, but you should be able to do something like this:
get_ev_return <- function(d) {
dates = na.omit(d[d$Date = d$EPS, "Date"])
lapply(dates, \(date) {
evReturn(y=d,firm="return", event_date=date, y.date = "Date", index="DJSTOXX", event.win=2, est.win=50, digits=3)
})
}
lapply(dflist, get_ev_return)
Before running the lapply(), test get_ev_return() on NESN, and see if it works for one frame. Also, you'll need to check that dates within the get_ev_return() function contains any dates.. Maybe you have one or more data.frames where there are no rows where Date == EPS.... Again, this is the problem with not providing any sample data in your question.
Related
I would like to store a result of an API call in a data frame. The code should loop through different time periods and countries.
If I do one example, it's like this:
testapicall <- jsonlite::fromJSON("https.api.companyname/jurisdiction=eu_ger&date=2018-01-01:2018-01:31&api_token=XYZ")
testapicall[["results"]]$total_number
Now I want to to get this "total number" for different jurisdictions and date ranges. One column should be country name, one should be the date (e.g., 01-2018), and one should be the total_number.
To set up the loop, I've split the API key into 3 parts:
base_string1 <- "https.api.companyname/jurisdiction="
base_string2 <- "&date="
end_string <- "api_token=XYZ"
Then, I can create the dates and countries the dates and countries like this:
dates <- seq(as.Date("1990-01-01"), as.Date("2022-01-01"), by = "month")
dates <- paste(head(dates, -1), tail(dates-1, - 1), sep = ":")
countries<- paste0("eu_", c("fra", "ger"))
Now, I'd like to get the results for each country-date into a data frame, but this is the part I am not sure how to do. I know I have to make an empty data frame and then fill it somehow.
for (y in date){
for(c in countries){
api_string <- paste0(base_string1,y, base_string2,c, end_string)
json <- jsonlite::fromJSON(api_string)
json[["results"]]$total_number
}
}
Any help is appreciated!
You can use map_dfr from purrr to iterate over the countries and dates and generate a dataframe with a single row for each iteration. map_dfr will row bind all the dataframes together.
library(purrr)
map_dfr(dates, function(date){
map_dfr(countries, function(country){
api_string <- paste0(base_string1, date, base_string2, country, end_string)
json <- jsonlite::fromJSON(api_string)
data.frame(country = country,
date = date,
total_number = json[["results"]]$total_number)
})
})
Consider expand.grid to build all possible pairwise combinations of country and month dates into data frame and then mapply to create a new column to retrieve the API data elementwise between each country and month range.
Also, consider a user-defined method that uses tryCatch (useful with API calls) to return NA on error and not stop on problematic urls.
# INPUTS
dates <- seq(as.Date("1990-01-01"), as.Date("2022-01-01"), by="month")
countries <- paste0("eu_", c("fra", "ger"))
# USER-DEFINED METHODS
get_api_data <- function(cnty, rng) {
url <- paste0(
"https.api.companyname/jurisdiction=", cnty,
"&date=", rng, "api_token=XYZ"
)
tryCatch({
api_response <- jsonlite::fromJSON(url)
}, error = function(e) {
paste0(url, " : ", e)
return(NA_real_)
})
return(api_response$results$total_number)
}
add.months <- function(date, n)
seq.Date(date, by=paste(n, "months"), length = 2)[2]
# BUILD DATA FRAME
api_results_df <- expand.grid(
country = countries, date = dates
) |> within({
month_add <- sapply(date, add.months, n=1) |> `class<-`("Date")
ranges <- paste(date, month_add-1, sep=":")
# PASS COLUMN DATA ELEMENTWISE INTO DEFINED METHOD
total_number <- mapply(get_api_data, cnty=country, rng=ranges)
rm(ranges, month_add)
})
In R (studio), I have tried so many iterations of storing just the dates into the turning_point_dates data frame, but I have only been able to get it to store the loop numbers. I can print out each date as it is found, but not able to store them yet.
dates = data.frame(Date = seq(from = as.Date("2002-06-01"), to = as.Date("2011-09-30"), by = 'day'))
nums = c(98,99,100,101,102,103,104,105,106,107)
dataframe_of_numbers = data.frame(nums)
mat = matrix(ncol=0, nrow=0)
turning_point_dates = data.frame(mat)
for (i in 1:nrow(dataframe_of_numbers)){
print(dates$Date[dataframe_of_numbers[i,]])
turning_point_dates[i,] = dates$Date[dataframe_of_numbers[i,]]
}
turning_point_dates
How can I instead store the actual dates that are being looped over into the turning_point_dates data frame?
turning_point_dates puts out a data frame looking like the following:
Description:df [10 x 0]
1
2
3
4
5
6
7
8
9
10
1-10 of 10 rows
When I want instead a data frame like so:
"2002-09-06"
"2002-09-07"
"2002-09-08"
"2002-09-09"
"2002-09-10"
"2002-09-11"
"2002-09-12"
"2002-09-13"
"2002-09-14"
"2002-09-15"
It's a bit unclear, but if you want to end up with a smaller dataframe that only has the dates corresponding to the row numbers in nums, you don't need to use a loop. You can just subset the data frame with num, as shown below.
I'm also suggesting using a tibble instead of a basic data.frame, because subsetting a tibble returns a tibble but subsetting a data.frame returns a vector.
library(tibble)
dates <- tibble::tibble(Date = seq(from = as.Date("2002-06-01"),
to = as.Date("2011-09-30"),
by = 'day'))
nums <- c(98,99,100,101,102,103,104,105,106,107)
dates_subsetted <- dates[nums,]
It can also be done with a loop, but in my view it's much clunkier. It will almost certainly be much, much slower if you have a lot of data.
But since it was asked:
library(dplyr)
# set up another tibble for the values we'll extract
dates_looped <- tibble::tibble()
# loop through each row of the input, add that row if
for (i in 1:nrow(dates)){
if (i %in% nums) {
dates_looped <- dplyr::bind_rows(dates_looped, dates[i,])
}
}
dates_looped and dates_subsetted are the same, but making dates_subsetted took a single line of code that will run many times faster.
I don't think you need a loop to do this. Here is what I did:
dates <- data.frame(Date = seq(from = as.Date("2002-06-01"), to = as.Date("2011-09-30"), by = 'day'))
nums = c(98,99,100,101,102,103,104,105,106,107)
turning_point_dates <- data.frame(nums, dates = dates$Date[nums])
I have a data like time and date columns with respect to visit types.
Below is the sample data[1]
[1]: https://i.stack.imgur.com/GfnQb.png
From the above data I have to get maximum repeated value row wise.
I tried like below
out1$MAX1 <- do.call('pmax',c(out1[,2:5],list(na.rm=TRUE)))
Output of above code[2]
[2]: https://i.stack.imgur.com/CitAa.png
It is giving wrong values for some of the rows.For example in the above output for 3rd row we have "SFU","SFU","SFU","SFU,GFU".By using above maximum code getting maximum value as "SFU,GFU".But it has to get "SFU".And I have to add new column that howmany times that visit type is repeated(i.e, for the same 3rd row "SFU" is the maximum value and it is repeated 4 times).
How to achieve that?
Per your original question, you can get the most frequent categorical value row-wise by (1) creating a function that tables a vector and returns the most frequent; and (2) loop through your data frame, unlisting each row and calling your function:
# Create sample data
df <- data.frame(
id = 1:5,
v1 = sample(c("SFU", "GFU"), replace = TRUE, 5),
v2 = sample(c("SFU", "GFU"), replace = TRUE, 5),
v3 = sample(c("SFU", "GFU"), replace = TRUE, 5),
v4 = sample(c("SFU", "GFU"), replace = TRUE, 5),
stringsAsFactors = FALSE
)
# Create function
get_most_frequent <- function(x) {
tab <- sort(table(x))
out <- names(tail(tab, 1))
out
}
# Loop through data frame
df$most_frequent <- vector(mode = "character", length = nrow(df))
for (i in 1:nrow(df)) {
r <- unlist(df[i,2:5])
df$most_frequent[i] <- get_most_frequent(r)
}
If you need to split up an instance like "SFU, GFU", you can adjust your function accordingly to split the strings by comma.
Currently, I'm having an issue with computation time because I run a triple for loop in R to create anomaly thresholds on the day of the week and hour level for each unique ID.
My original data frame:
Unique ID, Event Date Hour, Event Date, Event Day of Week, Event Hour, Numeric Variable 1, Numeric Variable 2, etc.
df <- read.csv("mm.csv",header=TRUE,sep=",")
for (i in unique(df$customer_id)) {
#I initialize the output data frame so I can rbind as I loop though the grains. This data frame is always emptied out once we move onto our next customer_id
output.final.df <- data_frame(seller_name = factor(), is_anomaly_date = integer(), event_date_hr = double(), event_day_of_wk = integer(), event_day = double(), ...)
for (k in unique(df$event_day_of_wk)) {
for (z in unique(df$event_hr)) {
merchant.df = df[df$merchant_customer_id==i & df$event_day_of_wk==k & df$event_hr==z,10:19] #columns 10:19 are the 9 different numeric variables I am creating anomaly thresholds
#1st anomaly threshold - I have multiple different anomaly thresholds
# TRANSFORM VARIABLES - sometime within the for loop I run another loop that transforms the subset of data within it.
for(j in names(merchant.df)){
merchant.df[[paste(j,"_log")]] <- log(merchant.df[[j]]+1)
#merchant.df[[paste(j,"_scale")]] <- scale(merchant.df[[j]])
#merchant.df[[paste(j,"_cube")]] <- merchant.df[[j]]**3
#merchant.df[[paste(j,"_cos")]] <- cos(merchant.df[[j]])
}
mu_vector = apply( merchant.df, 2, mean )
sigma_matrix = cov( merchant.df, use="complete.obs", method='pearson' )
inv_sigma_matrix = ginv(sigma_matrix)
det_sigma_matrix = det( sigma_matrix )
z_probas = apply( merchant.df, 1, mv_gaussian, mu_vector, det_sigma_matrix, inv_sigma_matrix )
eps = quantile(z_probas,0.01)
mv_outliers = ifelse( z_probas<eps, TRUE, FALSE )
#2nd anomaly threshold
nov = ncol(merchant.df)
pca_result <- PCA(merchant.df,graph = F, ncp = nov, scale.unit = T)
pca.var <- pca_result$eig[['cumulative percentage of variance']]/100
lambda <- pca_result$eig[, 'eigenvalue']
anomaly_score = (as.matrix(pca_result$ind$coord) ^ 2) %*% (1 / as.matrix(lambda, ncol = 1))
significance <- c (0.99)
thresh = qchisq(significance, nov)
pca_outliers = ifelse( anomaly_score > thresh , TRUE, FALSE )
#This is where I bind the anomaly points with the original data frame and then I row bind to the final output data frame then the code goes back to the top and loops through the next hour and then day of the week. Temp.output.df is constantly remade and output.df is slowly growing bigger.
temp.output.df <- cbind(merchant.df, mv_outliers, pca_outliers)
output.df <- rbind(output.df, temp.output.df)
}
}
#Again this is where I write the output for a particular unique_ID then output.df is recreated at the top for the next unique_ID
write.csv(output.df,row.names=FALSE)
}
The following code shows the idea of what I'm doing. As you can see I run 3 for loops where I calculate multiple anomaly detections at the lowest grain which is the hour level by day of the week, then once I finish I output every unique customer_id level into a csv.
Overall the code runs very fast; however, doing a triple for loop is killing my performance. Does anyone know any other way I can do an operation like this given my original data frame and having the need to output a csv at every unique_id level?
So don't use a triple-loop. Use dplyr::group_by(customer_id, event_day_of_wk, event_hr), or the data.table equivalent. Both should be faster.
No need for explicit appending on every iteration with rbind and cbind which will kill your performance.
Also, no need to cbind() your entire input df into your output df; your only actual outputs are mv_outliers, pca_outliers; you could join() the input and output dfs later on customer_id, event_day_of_wk, event_hr
EDIT: since you want to collate all results for each customer_id then write.csv() them, that needs to go in an outer level of grouping, and group_by(event_day_of_wk, event_hr) in the inner level.
.
# Here is pseudocode, you can figure out the rest, do things incrementally
# It looks like seller_name, is_anomaly_date, event_date_hr, event_day_of_wk, event_day,... are variables from your input
require(dplyr)
output.df <- df %>%
group_by(customer_id) %>%
group_by(event_day_of_wk, event_hr) %>%
# columns 10:19 ('foo','bar','baz'...) are the 9 different numeric variables I am creating anomaly thresholds
# Either a) you can hardcode their names in mutate(), summarize() calls
# or b) you can reference the vars by string in mutate_(), summarize_() calls
# TRANSFORM VARIABLES
mutate(foo_log = log1p(foo), bar_log = log1p(bar), ...) %>%
mutate(mu_vector = c(mean(foo_log), mean(bar_log)...) ) %>%
# compute sigma_matrix, inv_sigma_matrix, det_sigma_matrix ...
summarize(
z_probas=mv_gaussian(mu_vector, det_sigma_matrix, inv_sigma_matrix),
eps = quantile(z_probas,0.01),
mv_outliers = (z_probas<eps)
) %>%
# similarly, use mutate() and do.call() for your PCA invocation...
# Your outputs are mv_outliers, pca_outliers
# You don't necessarily need to `cbind(merchant.df, mv_outliers, pca_outliers)` i.e. cbind all your input data together with your output
# Now remove all your temporary variables from your output:
select(-foo_log, -bar_log, ...) %>%
# or else just select(mv_outliers, pca_outliers) the variables you want to keep
ungroup() %>% # (this ends the group_by(event_day_of_wk, event_hr) and cbinds all the intermediate dataframes for you)
write.csv( c(.$mv_outliers, .$pca_outliers), file='<this_customer_id>.csv')
ungroup() # group_by(customer_id)
See also "write.csv() in dplyr chain"
Suppose I have a data frame in R that has names of students in one column and their marks in another column. These marks range from 20 to 100.
> mydata
id name marks gender
1 a1 56 female
2 a2 37 male
I want to divide the student into groups, based on the criteria of obtained marks, so that difference between marks in each group should be more than 10. I tried to use the function table, which gives the number of students in each range from say 20-30, 30-40, but I want it to pick those students that have marks in a given range and put all their information together in a group. Any help is appreciated.
I am not sure what you mean with "put all their information together in a group", but here is a way to obtain a list with dataframes split up of your original data frame where each element is a data frame of the students within a mark range of 10:
mydata <- data.frame(
id = 1:100,
name = paste0("a",1:100),
marks = sample(20:100,100,TRUE),
gender = sample(c("female","male"),100,TRUE))
split(mydata,cut(mydata$marks,seq(20,100,by=10)))
I think that #Sacha's answer should suffice for what you need to do, even if you have more than one set.
You haven't explicitly said how you want to "group" the data in your original post, and in your comment, where you've added a second dataset, you haven't explicitly said whether you plan to "merge" these first (rbind would suffice, as recommended in the comment).
So, with that, here are several options, each with different levels of detail or utility in the output. Hopefully one of them suits your needs.
First, here's some sample data.
# Two data.frames (myData1, and myData2)
set.seed(1)
myData1 <- data.frame(id = 1:20,
name = paste("a", 1:20, sep = ""),
marks = sample(20:100, 20, replace = TRUE),
gender = sample(c("F", "M"), 20, replace = TRUE))
myData2 <- data.frame(id = 1:17,
name = paste("b", 1:17, sep = ""),
marks = sample(30:100, 17, replace = TRUE),
gender = sample(c("F", "M"), 17, replace = TRUE))
Second, different options for "grouping".
Option 1: Return (in a list) the values from myData1 and myData2 which match a given condition. For this example, you'll end up with a list of two data.frames.
lapply(list(myData1 = myData1, myData2 = myData2),
function(x) x[x$marks >= 30 & x$marks <= 50, ])
Option 2: Return (in a list) each dataset split into two, one for FALSE (doesn't match the stated condition) and one for TRUE (does match the stated condition). In other words, creates four groups. For this example, you'll end up with a nested list with two list items, each with two data.frames.
lapply(list(myData1 = myData1, myData2 = myData2),
function(x) split(x, x$marks >= 30 & x$marks <= 50))
Option 3: More flexible than the first. This is essentially #Sacha's example extended to a list. You can set your breaks wherever you would like, making this, in my mind, a really convenient option. For this example, you'll end up with a nested list with two list items, each with multiple data.frames.
lapply(list(myData1 = myData1, myData2 = myData2),
function(x) split(x, cut(x$marks,
breaks = c(0, 30, 50, 75, 100),
include.lowest = TRUE)))
Option 4: Combine the data first and use the grouping method described in Option 1. For this example, you will end up with a single data.frame containing only values which match the given condition.
# Combine the data. Assumes all the rownames are the same in both sets
myDataALL <- rbind(myData1, myData2)
# Extract just the group of scores you're interested in
myDataALL[myDataALL$marks >= 30 & myDataALL$marks <= 50, ]
Option 5: Using the combined data, split the data into two groups: one group which matches the stated condition, one which doesn't. For this example, you will end up with a list with two data.frames.
split(myDataALL, myDataALL$marks >= 30 & myDataALL$marks <= 50)
I hope one of these options serves your needs!
I had the same kind of issue and after researching some answers on stack overflow I came up with the following solution :
Step 1 : Define range
Step 2 : Find the elements that fall in the range
Step 3 : Plot
A sample code is as shown below:
range = NULL
for(i in seq(0, max(all$downlink), 2000)){
range <- c(range, i)
}
counts <- numeric(length(range)-1);
for(i in 1:length(counts)) {
counts[i] <- length(which(all$downlink>=range[i] & all$downlink<range[i+1]));
}
countmax = max(counts)
a = round(countmax/1000)*1000
barplot(counts, col= rainbow(16), ylim = c(0,a))