Sum across multiple time frames using R - r

I have two data frames, x and y. The data frame x has a range of dates while data frame y has individual dates. I want to get the sum of the individual date values for the time ranges in data frame x. Thus id "a" would have the sum of all the values from 2019/1/1 through 2019/3/1.
id <- c("a","b","c")
start_date <- as.Date(c("2019/1/1", "2019/2/1", "2019/3/1"))
end_date <- as.Date(c("2019/3/1", "2019/4/1", "2019/5/1"))
x <- data.frame(id, start_date, end_date)
dates <- seq(as.Date("2019/1/1"),as.Date("2019/5/1"),1)
values <- runif(121, min=0, max=7)
y <- data.frame(dates, values)
Desired output
id start_date end_date sum
a 2019/1/1 2019/3/1 221.8892

One base R option is using apply
x$sum <- apply(x, 1, function(v) sum(subset(y,dates >= v["start_date"] & dates<=v["end_date"])$values))
such that
> x
id start_date end_date sum
1 a 2019-01-01 2019-03-01 196.0311
2 b 2019-02-01 2019-04-01 185.6970
3 c 2019-03-01 2019-05-01 173.6429
Data
set.seed(1234)
id <- c("a","b","c")
start_date <- as.Date(c("2019/1/1", "2019/2/1", "2019/3/1"))
end_date <- as.Date(c("2019/3/1", "2019/4/1", "2019/5/1"))
x <- data.frame(id, start_date, end_date)
dates <- seq(as.Date("2019/1/1"),as.Date("2019/5/1"),1)
values <- runif(121, min=0, max=7)
y <- data.frame(dates, values)

There are many ways of doing this. One possibility would be:
library(data.table)
x <- setDT(x)
# create a complete series for each id
x <- x[, .(dates = seq(start_date, end_date, 1)), by=id]
# merge the data
m <- merge(x, y, by="dates")
# get the sums
m[, .(sum = sum(values)), by=id]
id sum
1: a 196.0311
2: b 185.6970
3: c 173.6429
You can add setseed before you create the random variables to exactly replicate the numbers
set.seed(1234)

Related

Fill a data frame with increasing date objects in R

I can't seem to figure the following out.
I have a data frame with 398 rows and 16 variables. I want to add a date variable. I know that for each row the date increases by a week and starts with 2010-01-01. I've tried the following:
date <- ymd("2010-01-01")
df <- as.data.frame(c(1:nrow(data), 1))
for (i in 1:nrow(data)){
date <- date + 7
df[i,] <- as.Date(date)
}
I then want to bind it to my data-frame. However, the values inside df are non-dates. If I perform the date +7 calculation it works (e.g. once it goes to 2010-01-08), but if I assign it to the df it turns into weird numerical values.
Appreciate any help.
Try the following:
library(lubridate)
date <- ymd("2010-01-01")
df <- data.frame(ind = 1:5)
df$dates <- seq.Date(from = date, length.out = nrow(df), by = 7)
# note that `by = "1 week"` would also work, if you prefer more readable code.
df
ind dates
1 1 2010-01-01
2 2 2010-01-08
3 3 2010-01-15
4 4 2010-01-22
5 5 2010-01-29
Try this:
df$date <- seq(as.Date("2010-01-01"), by = 7, length.out = 398)
also try to get in the habit of not calling your variables names that are already being used by functions such as data and date.

How to filter large data-sets by two attributes and split into subsets? R / Grep

I found myself at the limits of the grep() function or perhaps there are efficient ways of doing this.
Start off a sample data-frame:
Date <- c( "31-DEC-2014","31-DEC-2014","31-DEC-2014","30-DEC-2014",
"30-DEC-2014","30-DEC-2014", "29-DEC-2014","29-DEC-2014","29-DEC-2014" )
ISIN <- c("LU0168343191", "TW0002418001", "GB00B3FFY088","LU0168343191",
"TW0002418001", "GB00B3FFY088","LU0168343191", "TW0002418001", "GB00B3FFY088")
price <-c(seq(1:9))
df <- as.data.frame(cbind(Date, ISIN, price))
And the desired Result is a list() containing subsets of the main data file which looks like the below (x3 for the 3 individual Identifiers in Result_I)
The idea is that the data should first filter by ISIN and then filter by Date. this 2 step process should keep my data intact.
Result_d <- c("31-DEC-2014", "30-DEC-2014","29-DEC-2014")
Result_I <- c("LU0168343191","LU0168343191","LU0168343191")
Result_P <- c(1,4,7)
Result_df <- cbind(Result_d, Result_I, Result_P)
Please keep in mid the above is for demo purposes and the real data-set has 5M rows and 50 columns over a period of 450+ different dates as per Result_d so i am lookign for something that is applicable irrespective of nrow or ncol
What i have so far:
I take all unique dates and store:
Unique_Dates <- unique(df$Date)
The same for the Identifiers:
Unique_ID <- unique(df$ISIN)
Now the grepping issue:
If i wanted all rows containing Unique_Dates i would do something like:
pattern <- paste(Unique_dates, collapse = "|")
result <- as.matrix(df[grep(pattern, df$Date),])
and this will retrieve basically the entire data set. i am wondering if anyone knows an efficient way of doing this.
Thanks in advance.
Using dplyr:
library(dplyr)
Date <- c( "31-Dec-2014","31-Dec-2014","31-Dec-2014","30-Dec-2014",
"30-Dec-2014","30-Dec-2014", "29-Dec-2014","29-Dec-2014","29-Dec-2014" )
ISIN <- c("LU0168343191", "TW0002418001", "GB00B3FFY088","LU0168343191",
"TW0002418001", "GB00B3FFY088","LU0168343191", "TW0002418001", "GB00B3FFY088")
price <-c(seq(1:9))
DF <- data.frame(Date, ISIN, price,stringsAsFactors=FALSE)
DF$Date=as.Date(DF$Date,"%d-%b-%Y")
#Examine data ranges and frequencies
#date range
range(DF$Date)
#date frequency count
table(DF$Date)
#ISIN frequency count
table(DF$ISIN)
#select ISINs for filtering, with user defined choice of filters
# numISIN = 2
# subISIN = head(names(sort(table(DF$ISIN))),numISIN)
subISIN = names(sort(table(DF$ISIN)))[2]
subDF=DF %>%
dplyr::group_by(ISIN) %>%
dplyr::arrange(ISIN,Date) %>%
dplyr::filter(ISIN %in% subISIN) %>%
as.data.frame()
#> subDF
# Date ISIN price
#1 2014-12-29 LU0168343191 7
#2 2014-12-30 LU0168343191 4
#3 2014-12-31 LU0168343191 1
We convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'Date', specify the 'i' based on the index returned with grep and Subset the Data.table (.SD) based on the 'i' index.
library(data.table)
setDT(df)[grep("LU", ISIN), .SD, by = Date]
# Date ISIN price
#1: 31-DEC-2014 LU0168343191 1
#2: 30-DEC-2014 LU0168343191 4
#3: 29-DEC-2014 LU0168343191 7

Add 1 in column according to specific dates and count

I have a table including a time series of daily values (value), Date and a column with "0s". Here are the variables:
value <- c(37,19.75,19.5,14.5,24.75,25,25.5,19.75,19.75,14.25,21.25,21.75,17.5,16.25,14.5,
14.5,14.75,9.5,11.75,15.25,14.25,16.5,13.5,18.25,13.5,11.25,10.75,12,8.5,
9.75,14.75)
Date <- c("1997-05-01","1997-05-02","1997-05-03","1997-05-04","1997-05-05",
"1997-05-06","1997-05-07","1997-05-08","1997-05-09","1997-05-10",
"1997-05-11","1997-05-12","1997-05-13","1997-05-14","1997-05-15",
"1997-05-16","1997-05-17","1997-05-18","1997-05-19","1997-05-20",
"1997-05-21","1997-05-22","1997-05-23","1997-05-24","1997-05-25",
"1997-05-26","1997-05-27","1997-05-28","1997-05-29","1997-05-30",
"1997-05-31")
ncol <- c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)`
data <- data.frame(value, Date, ncol)
Date is formatted as Date using the "as.Date" function. now I want to add "1" to the some values in column "newcol" (with 0s) on a specific 5 days, eg. on the "1997.05.05","1997.05.11","1997.05.14","1997.05.18","1997.05.25" in the time series.
I created this function, but works for a date only:
x <- 1
i <- which(format(data$Date, "%Y.%m.%d") == "1997.05.05")
data$newcol[i] <- data$newcol[i] + x
how to do that best?
Then I would like to count the number of times that "value" appears >20 from a specific date (newcol = 1) for the previous 5 days. For example, the date 1997.05.25 how many times the value appears >20 to 1997.05.21.
This answers the 1st part of your question:
library(data.table)
setDT(data)[ Date %in% c("1997-05-05","1997-05-11","1997-05-14","1997-05-18","1997-05-25"), newcol := ncol+1 ]
# or perhaps better:
setDT(data)[, newcol := ifelse(Date %in% c("1997-05-05","1997-05-11","1997-05-14","1997-05-18","1997-05-25"), ncol+1, 0) ]
With base R this can be done
transform(data, newcol = as.integer(as.character(Date) %in%
c("1997-05-05","1997-05-11","1997-05-14","1997-05-18","1997-05-25")))

R to create a data frame a specific way

I am trying to create a data frame and in the column of time I need to have each time written 1 time before the next date. For example:
1983-01-01
1983-01-01
1983-01-01
1983-01-01
1983-01-02
1983-01-02
etc.
for 10 years.
I used this command, but I don't have the needed format.
data=data.frame(date=as.Date("1983-01-01") +seq(n))
head(data)
date
1 1983-01-02
2 1983-01-03
3 1983-01-04
4 1983-01-05
5 1983-01-06
6 1983-01-07
Here's one way to create the data frame:
library(zoo)
start_date <- as.Date("1983-01-01")
stop_date <- as.Date(as.yearmon(start_date) + 10) - 1
# [1] "1992-12-31"
dat <- data.frame(date = rep(seq(start_date, stop_date, by = 1), each = 4))
Update (based on comment):
dates <- lapply(seq(0, 9), function(x)
rep(as.Date((as.yearmon(start_date) + x) + (0:11)/12), each = 3) + c(0,10,20))
dat <- do.call(rbind, lapply(dates,
function(x) data.frame(date = rep(x, each = 4))))

R applying a data frame on another data frame

I have two data frames.
set.seed(1234)
df <- data.frame(
id = factor(rep(1:24, each = 10)),
price = runif(20)*100,
quantity = sample(1:100,240, replace = T)
)
df2 <- data.frame(
id = factor(seq(1:24)),
eq.quantity = sample(1:100, 24, replace = T)
)
I would like to use df2$­eq.quantity to find the closest absolute value compared to df$quantity, by the factor variable, id. I would like to do that for each id in df2 and bind it into a new data-frame, called results.
I can do it like this for each individually ID:
d.1 <- df2[df2$id == 1, 2]
df.1 <- subset(df, id == 1)
id.1 <- df.1[which.min(abs(df.1$quantity-d.1)),]
Which would give the solution:
id price quantity
1 66.60838 84
But I would really like to be able to use a smarter solution, and also gathered the results into a dataframe, so if I do it manually it would look kinda like this:
results <- cbind(id.1, id.2, etc..., id.24)
I had some trouble giving this question a good name?
data.tables are smart!
Adding this to your current example...
library(data.table)
dt = data.table(df)
dt2 = data.table(df2)
setkey(dt, id)
setkey(dt2, id)
dt[dt2, dif:=abs(quantity - eq.quantity)]
dt[,list(price=price[which.min(dif)], quantity=quantity[which.min(dif)]), by=id]
result:
dt[,list(price=price[which.min(dif)], quantity=quantity[which.min(dif)]), by=id]
id price quantity
1: 1 66.6083758 84
2: 2 29.2315840 19
3: 3 62.3379442 63
4: 4 54.4974836 31
5: 5 66.6083758 6
6: 6 69.3591292 13
...
Merge the two datasets and use lapply to perform the function on each id.
df3 <- merge(df,df2,all.x=TRUE,by="id")
diffvar <- function(df){
df4 <- subset(df3, id == df)
df4[which.min(abs(df4$quantity-df4$eq.quantity)),]
}
resultslist <- lapply(levels(df3$id),function(df) diffvar(df))
Combine the resulting list elements in a dataframe:
resultsdf <- data.frame(matrix(unlist(resultslist), ncol=4, byrow=T))
Or more easy:
library(plyr)
resultsdf <- ddply(df3, .(id), function(x)x[which.min(abs(x$quantity-x$eq.quantity)),])

Resources