Get the sum of a specific number of following rows in R - r

I have to solve this specific problem in R. I have a large list, containing columns and rows in this format:
Day_and_Time Rain1_mm/min Rain2_mm/min
01.12.10 18:01 0 0
.............. .... ...
02.12.10 01:00 0.03 0
02.12.10 01:01 0.03 0
02.12.10 01:02 0.01 0
02.12.10 01:03 0.05 0
02.12.10 01:04 0.03 0.1
02.12.10 01:05 0.04 0
.............. .... ...
02.12.10 18:00 0 0
What I want to do is to write a function that sums up six following rows and return the result as a new row. This means that at the end I have a new list - looking like this for example:
Day_and_Time Rain1_mm/5min Rain2_mm/5min
.............. .... ...
02.12.10 01:05 0.19 0.1
02.12.10 01:10 .... ...
.............. .... ...
Is it possible to do this? The goal is to transform the unit [mm/min] from the first and second column to [mm/5min].
Thank you very much!

Assuming that you read the data in your .csv file as a data frame df, one approach to your problem is to use rollapply from the zoo package to give you a rolling sum:
library(zoo)
ind_keep <- seq(1,floor(nrow(df)/5)*5, by=5) ## 1.
out <- sapply(df[,-1], function(x) rollapply(x,6,sum)) ## 2.
out <- data.frame(df[ind_keep+5,1],out[ind_keep,]) ## 3.
colnames(out) <- c("Day_and_time","Rain1_mm/5min","Rain2_mm/5min") ## 4.
Notes:
Here, we define the indices corresponding to every 5 minutes where we want to keep the rolling sum over the next 5 minutes.
Apply a rolling sum function for each column.
Use sapply over all columns of df that is not the first column. Note that the column indices specified in df[,-1] can be adjusted so that you process only certain columns.
The function to apply is rollapply from the zoo package. The additional arguments are the width of the window 5 and the sum function so that this performs a rolling sum.
At this point, out contains the rolling sums (over 5 minutes) at each minute, but we only want those every 5 minutes. Therefore,
Combines the Day_and_time column from the original df with out keeping only those columns every 5 minutes. Note that we keep the last Day_and_Time in each window.
This just renames the columns.
Using MikeyMike's data, which is
Day_and_Time rain1 rain2
1 2010-02-12 01:00:00 0.03 0.00
2 2010-02-12 01:01:00 0.03 0.00
3 2010-02-12 01:02:00 0.01 0.00
4 2010-02-12 01:03:00 0.05 0.00
5 2010-02-12 01:04:00 0.03 0.10
6 2010-02-12 01:05:00 0.04 0.00
7 2010-02-12 01:06:00 0.02 0.10
8 2010-02-12 01:07:00 0.10 0.10
9 2010-02-12 01:08:00 0.30 0.00
10 2010-02-12 01:09:00 0.01 0.00
11 2010-02-12 01:10:00 0.00 0.01
this gives:
print(out)
## Day_and_time Rain1_mm/5min Rain2_mm/5min
##1 2010-02-12 01:05:00 0.19 0.10
##2 2010-02-12 01:10:00 0.47 0.21
Note the difference in the result, this approach assumes you want overlapping windows since you specified that you want to sum the six numbers between the closed interval [i,i+5] at each 5 minute mark.
To extend the above to a window in the closed interval [i, i+nMin] at each nMin mark:
library(zoo)
nMin <- 10 ## for example 10 minutes
ind_keep <- seq(1, floor(nrow(df)/nMin)*nMin, by=nMin)
out <- sapply(df[,-1], function(x) rollapply(x, nMin+1, sum))
out <- data.frame(df[ind_keep+nMin, 1],out[ind_keep,])
colnames(out) <- c("Day_and_time",paste0("Rain1_mm/",nMin,"min"),paste0("Rain2_mm/",nMin,"min"))
For this to work, the data must have at least 2 * nMin + 1 rows
Hope this helps.

Assuming you want the groups to be 0 - 5 minutes, 6 - 10 minutes, etc. this should give you what you're looking for:
library(data.table)
setDT(df)[,.(day_time = max(Day_and_Time),
rain1_sum=sum(rain1),
rain2_sum=sum(rain2)),
by=.(floor(as.numeric(Day_and_Time)/360))]
floor day_time rain1_sum rain2_sum
1: 3516540 2010-02-12 01:05:00 0.19 0.10
2: 3516541 2010-02-12 01:10:00 0.43 0.21
Data:
df <- structure(list(Day_and_Time = structure(c(1265954400, 1265954460,
1265954520, 1265954580, 1265954640, 1265954700, 1265954760, 1265954820,
1265954880, 1265954940, 1265955000), class = c("POSIXct", "POSIXt"
), tzone = ""), rain1 = c(0.03, 0.03, 0.01, 0.05, 0.03, 0.04,
0.02, 0.1, 0.3, 0.01, 0), rain2 = c(0, 0, 0, 0, 0.1, 0, 0.1,
0.1, 0, 0, 0.01)), .Names = c("Day_and_Time", "rain1", "rain2"
), row.names = c(NA, -11L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x0000000000240788>)

Related

Using rle function with condition on a column in r

My dataset has 523 rows and 93 columns and it looks like this:
data <- structure(list(`2018-06-21` = c(0.6959635416667, 0.22265625,
0.50341796875, 0.982942708333301, -0.173828125, -1.229259672619
), `2018-06-22` = c(0.6184895833333, 0.16796875, 0.4978841145833,
0.0636718750000007, 0.5338541666667, -1.3009207589286), `2018-06-23` = c(1.6165364583333,
-0.375, 0.570800781250002, 1.603515625, 0.5657552083333, -0.9677734375
), `2018-06-24` = c(1.3776041666667, -0.03125, 0.7815755208333,
1.5376302083333, 0.5188802083333, -0.552966889880999), `2018-06-25` = c(1.7903645833333,
0.03125, 0.724609375, 1.390625, 0.4928385416667, -0.723074776785701
)), row.names = c(NA, 6L), class = "data.frame")
Each row is a city, and each column is a day of the year.
After calculating the row average in this way
data$mn <- apply(data, 1, mean)
I want to create another column data$duration that indicates the average length of a period of consecutive days where the values are > than data$mn.
I tried with this code:
data$duration <- apply(data[-6], 1, function(x) with(rle`(x > data$mean), mean(lengths[values])))
But it does not seem to work. In particular, it appears that rle( x > data$mean) fails to recognize the end of a row.
What are your suggestions?
Many thanks
EDIT
Reference dataframe has been changed into a [6x5]
The main challenge you're facing in your code is getting apply (which focuses on one row at a time) to look at the right values of the mean. We can avoid this entirely by keeping the mean out of the data frame, and doing the comparison data > mean to the whole data frame at once. The new columns can be added at the end:
mn = rowMeans(data)
dur = apply(data > mn, 1, function(x) with(rle(x), mean(lengths[values])))
dur
# 1 2 3 4 5 6
# 3.0 1.5 2.0 3.0 4.0 2.0
data = cbind(data, mean = mn, duration = dur)
print(data, digits = 2)
# 2018-06-21 2018-06-22 2018-06-23 2018-06-24 2018-06-25 mean duration
# 1 0.70 0.618 1.62 1.378 1.790 1.2198 3.0
# 2 0.22 0.168 -0.38 -0.031 0.031 0.0031 1.5
# 3 0.50 0.498 0.57 0.782 0.725 0.6157 2.0
# 4 0.98 0.064 1.60 1.538 1.391 1.1157 3.0
# 5 -0.17 0.534 0.57 0.519 0.493 0.3875 4.0
# 6 -1.23 -1.301 -0.97 -0.553 -0.723 -0.9548 2.0

R: calculating interests and balance at each step

I have a stupid question but I can't solve it easily with lag/lead or other things
Let's say I have this table, I have an initial balance of 100, Position is if I bid or not, and percentage is what I get if I bid, how can i calculate the balance to get something like this?
Position Percentage_change Balance
0 0.01 100
0 - 0.01 100
1 0.02 102
1 0.05 107.1
0 - 0.02 107.1
1 0.03 110.3
cumprod is the function you are looking for eg
df <- data.frame(Position = c(0,0,1,1,0,1),
Percentage_change = c(0.01, -0.01, 0.02, 0.05, -0.02, 0.03))
# convert in to multiplier form eg 100 * 1.01
df$Multiplier <- df$Percentage_change + 1
# when position is 0, reset this to 1 so there is no change to the balance
df[df$Position == 0, ]$Multiplier <- 1
# take starting balance of 100 and times by cumulative product of the multipliers
df$Balance <- 100 * cumprod(df$Multiplier)
df
Position Percentage_change Multiplier Balance
1 0 0.01 1.00 100.000
2 0 -0.01 1.00 100.000
3 1 0.02 1.02 102.000
4 1 0.05 1.05 107.100
5 0 -0.02 1.00 107.100
6 1 0.03 1.03 110.313

Dplyr mutate many columns with each new column conditional on two columns

My data frame is like the simple one below with many more columns and rows.
My goal is to add a new column for every "model" based on some ifelse() output using the matching pval and value_IC column.
Here, models are linear, beta and emax.
The closest problem I have found so far was here
https://community.rstudio.com/t/how-to-mutate-at-mutate-if-multiple-columns-using-condition-on-other-column-outside-vars-dplyr/17506/2
where there is always the same "second" column used.
data <- data.frame(pval.linear.orig = c(0.01, 0.06, 0.02),
pval.beta.orig = c(0.06, 0.02, 0.01),
pval.emax.orig = c(0.03, 0.01, 0.07),
value_IC.linear.orig = c(-5, NA, -4),
value_IC.beta.orig = c(NA, NA, -10),
value_IC.emax.orig = c(NA, -11, NA))
pval.linear.orig pval.beta.orig pval.emax.orig value_IC.linear.orig value_IC.beta.orig value_IC.emax.orig
1 0.01 0.06 0.03 -5 NA NA
2 0.06 0.02 0.01 NA NA -11
3 0.02 0.01 0.07 -4 -10 NA
If I only wanted it for one model, let's say beta, I would do this:
library(dplyr)
data_new <- data %>% mutate(conv.beta.orig = case_when(
pval.beta.orig > 0.025~ NA,
pval.beta.orig <= 0.025 & !(is.na(value_IC.beta.orig)) ~ TRUE,
pval.beta.orig <= 0.025 & is.na(value_IC.beta.orig) ~ FALSE))
data_new
pval.linear.orig pval.beta.orig pval.emax.orig value_IC.linear.orig value_IC.beta.orig value_IC.emax.orig conv.beta.orig
1 0.01 0.06 0.03 -5 NA NA NA
2 0.06 0.02 0.01 NA NA -11 FALSE
3 0.02 0.01 0.07 -4 -10 NA TRUE
to get the conv.beta.orig column. The column name does not have to be exactly in this format.
My problem now is to do so with all models I have each using the pval.MODEL.orig and value_IC.MODEL.orig column as above.
Thank you very much for your help!
This is the first question I ever posted, let me now if I should reformulate something / missed something or didn't spot this problem in case it already exists / etc.

Assign different values to a large number of columns

I have a large set of financial data that has hundreds of columns. I have cleaned and sorted the data based on date. Here is a simplified example:
df1 <- data.frame(matrix(vector(),ncol=5, nrow = 4))
colnames(df1) <- c("Date","0.4","0.3","0.2","0.1")
df1[1,] <- c("2000-01-31","0","0","0.05","0.07")
df1[2,] <- c("2000-02-29","0","0.13","0.17","0.09")
df1[3,] <- c("2000-03-31","0.03","0.09","0.21","0.01")
df1[4,] <- c("2004-04-30","0.05","0.03","0.19","0.03")
df1
Date 0.4 0.3 0.2 0.1
1 2000-01-31 0 0 0.05 0.07
2 2000-02-29 0 0.13 0.17 0.09
3 2000-03-31 0.03 0.09 0.21 0.01
4 2000-04-30 0.05 0.03 0.19 0.03
I assigned individual weights (based on market value from the raw data) as column headers, because I don’t care about the company names and I need the weights for calculating the result.
My ultimate goal is to get: 1. Sum of the weighted returns; and 2. Sum of the weights when returns are non-zero. With that being said, below is the result I want to get:
Date SWeightedR SWeights
1 2000-01-31 0.017 0.3
2 2000-02-29 0.082 0.6
3 2000-03-31 0.082 1
4 2000-04-30 0.07 1
For instance, the SWeightedR for 2000-01-31 = 0.4x0+0.3x0+0.2x0.05+0.1x0.07, and SWeights = 0.2+0.1.
My initial idea was to assign the weights to each column like WCol2 <- 0.4, then use cbind to create new columns and use c(as.matrix() %*% ) to get the sums. Soon I realize that this is impossible as there are hundreds of columns. Any advice or suggestion is appreciated!
Here's a simple solution using matrix multiplications (as you were suggesting yourself).
First of all, your data seem to be of character type and I'm not sure it's the real case with the real data, but I would first convert it to an appropriate type
df1[-1] <- lapply(df1[-1], type.convert)
Next, we will convert the column names to a numeric class too
vec <- as.numeric(names(df1)[-1])
Finally, we could easily create the new columns in two simple steps. This indeed has a to matrix conversion overhead, but maybe you should work with matrices in the first place. Either way, this is fully vectorized
df1["SWeightedR"] <- as.matrix(df1[, -1]) %*% vec
df1["SWeights"] <- (df1[, -c(1, ncol(df1))] > 0) %*% vec
df1
# Date 0.4 0.3 0.2 0.1 SWeightedR SWeights
# 1 2000-01-31 0.00 0.00 0.05 0.07 0.017 0.3
# 2 2000-02-29 0.00 0.13 0.17 0.09 0.082 0.6
# 3 2000-03-31 0.03 0.09 0.21 0.01 0.082 1.0
# 4 2004-04-30 0.05 0.03 0.19 0.03 0.070 1.0
Or, you could convert to a long format first (here's a data.table example), though I believe it will be less efficient as this are basically by row operations
library(data.table)
res <- melt(setDT(df1), id = 1L, variable.factor = FALSE
)[, c("value", "variable") := .(as.numeric(value), as.numeric(variable))]
res[, .(SWeightedR = sum(variable * value),
SWeights = sum(variable * (value > 0))), by = Date]
# Date SWeightedR SWeights
# 1: 2000-01-31 0.017 0.3
# 2: 2000-02-29 0.082 0.6
# 3: 2000-03-31 0.082 1.0
# 4: 2004-04-30 0.070 1.0

sapply? tapply? ddply? dataframe variable based on rolling index of previous values of another variable

I haven't found something which precisely matches what I need, so I thought I'd post this.
I have a number of functions which basically rely on a rolling index of a variable, with a function, and should naturally flow back into the dataframe they came from.
For example,
data<-as.data.frame(as.matrix(seq(1:30)))
data$V1<-data$V1/100
str(data)
data$V1<-NA # rolling 5 day product
for (i in 5:nrow(data)){
start<-i-5
end<-i
data$V1_MA5d[i]<- (prod(((data$V1[start:end]/100)+1))-1)*100
}
data
> head(data,15)
V1 V1_MA5d
1 0.01 NA
2 0.02 NA
3 0.03 NA
4 0.04 NA
5 0.05 0.1500850
6 0.06 0.2101751
7 0.07 0.2702952
8 0.08 0.3304453
9 0.09 0.3906255
10 0.10 0.4508358
11 0.11 0.5110762
12 0.12 0.5713467
13 0.13 0.6316473
14 0.14 0.6919780
15 0.15 0.7523389
But really, I should be able to do something like:
data$V1_MA5d<-sapply(data$V1, function(x) prod(((data$V1[i-5:i]/100)+1))-1)*100
But I'm not sure what that would look like.
Likewise, the count of a variable by another variable:
data$V1_MA5_cat<-NA
data$V1_MA5_cat[data$V1_MA5d<.5]<-0
data$V1_MA5_cat[data$V1_MA5d>.5]<-1
data$V1_MA5_cat[data$V1_MA5d>1.5]<-2
table(data$V1_MA5_cat)
data$V1_MA5_cat_n<-NA
data$V1_MA5_cat_n[data$V1_MA5_cat==0]<-nrow(subset(data,V1_MA5_cat==0))
data$V1_MA5_cat_n[data$V1_MA5_cat==1]<-nrow(subset(data,V1_MA5_cat==1))
data$V1_MA5_cat_n[data$V1_MA5_cat==2]<-nrow(subset(data,V1_MA5_cat==2))
> head(data,15)
V1 V1_MA5d V1_MA5_cat V1_MA5_cat_n
1 0.01 NA NA NA
2 0.02 NA NA NA
3 0.03 NA NA NA
4 0.04 NA NA NA
5 0.05 0.1500850 0 6
6 0.06 0.2101751 0 6
7 0.07 0.2702952 0 6
8 0.08 0.3304453 0 6
9 0.09 0.3906255 0 6
10 0.10 0.4508358 0 6
11 0.11 0.5110762 1 17
12 0.12 0.5713467 1 17
13 0.13 0.6316473 1 17
14 0.14 0.6919780 1 17
15 0.15 0.7523389 1 17
I know there is a better way - help!
You can do this one of a few ways. Its worth mentioning here that you did write a "correct" for loop in R. You preallocated the vector by assigning data$V1_MA5d <- NA. This way you are filling rather than growing and its actually fairly efficient. However, if you want to use the apply family:
sapply(5:nrow(data), function(i) (prod(data$V1[(i-5):i]/100 + 1)-1)*100)
[1] 0.1500850 0.2101751 0.2702952 0.3304453 0.3906255 0.4508358 0.5110762 0.5713467 0.6316473 0.6919780 0.7523389 0.8127299
[13] 0.8731511 0.9336024 0.9940839 1.0545957 1.1151376 1.1757098 1.2363122 1.2969448 1.3576077 1.4183009 1.4790244 1.5397781
[25] 1.6005622 1.6613766
Notice my code inside the [] is different from yours. check out the difference:
i <- 10
i - 5:i
(i-5):i
Or you can use rollapply from the zoo package:
library(zoo)
myfun <- function(x) (prod(x/100 + 1)-1)*100
rollapply(data$V1, 5, myfun)
[1] 0.1500850 0.2001551 0.2502451 0.3003552 0.3504853 0.4006355 0.4508057 0.5009960 0.5512063 0.6014367 0.6516872 0.7019577
[13] 0.7522484 0.8025591 0.8528899 0.9032408 0.9536118 1.0040030 1.0544142 1.1048456 1.1552971 1.2057688 1.2562606 1.3067726
[25] 1.3573047 1.4078569
As per the comment, this will give you a vector of length 26... instead you can add a few arguments to rollapply to make it match with your initial data:
rollapply(data$V1, 5, myfun, fill=NA, align='right')
In regard to your second question, plyr is handy here.
library(plyr)
data$cuts <- cut(data$V1_MA5d, breaks=c(-Inf, 0.5, 1.5, Inf))
ddply(data, .(cuts), transform, V1_MA5_cat_n=length(cuts))
But there are many other choices too.

Resources