Loop diagonal multiplication - 7 * 7 matrix ... and so on - r

I need to do a diagonal multiplication for below table.
It's a 7*7 matrix:
Step 1: need a diagonal multiplcation for 7*7 matrix,
Step 2: then ignore the first column and select the next 7 columns and 7 rows and do diagonal multiplication.
Step 3: ignore the 1st & 2nd column and select the next 7 columns and 7 rows and do diagonal multiplication.
Step 4: similar to step 3 and increment the column ignore 1,2,3 .... and so on and so far ....
Note: the diagonal will be going in upward direct from right side Bottom to the left upper side.
Data:
28/02/2013 31/03/2013 30/04/2013 31/05/2013 30/06/2013 31/07/2013 31/08/2013 30/09/2013 31/10/2013 30/11/2013 31/12/2013 31/01/2014 28/02/2014
0.04 0.03 0.03 0.04 0.04 0.07 0.86 0.28 0.05 0.05 0.05 0.04 0.04
0.44 0.44 0.42 0.43 0.40 0.32 0.64 0.02 0.33 0.36 0.30 0.27 0.37
0.57 0.57 0.52 0.59 0.62 0.51 0.79 0.23 0.64 0.66 0.50 0.55 0.60
0.61 0.58 0.60 0.63 0.65 0.59 0.81 0.83 1.00 0.63 0.57 0.63 0.74
0.70 0.65 0.66 0.71 0.73 0.66 0.86 0.90 0.55 0.76 0.65 0.66 0.74
0.76 0.76 0.79 0.74 0.83 0.83 0.86 1.00 0.61 0.83 0.38 0.74 0.75
0.80 0.84 0.89 0.84 0.82 0.83 0.98 0.84 0.44 0.93 0.88 0.78 0.78
Considering each column as A, B, C, D, E, F, G, H, I, J, K and so on ... there will be many columns, but the number of rows will be only 7.
Calculation of the 7*7 daigonal matrix will be as follows.
A is result for -> STEP 1, B -> STEP 2 AND C -> STEP 3 ... and so on.
A B C
G8*F7*E6*D5*C4*B3*A2 = 0.00 H8*G7*F6*E5*D4*C3*B2 = 0.02 I8*H7*G6*F5*E4*D3*C2 = 0.00
G8*F7*E6*D5*C4*B3 = 0.08 H8*G7*F6*E5*D4*C3 = 0.08 I8*H7*G6*F5*E4*D3 = 0.06
G8*F7*E6*D5*C4 = 0.19 H8*G7*F6*E5*D4 = 0.18 I8*H7*G6*F5*E4 = 0.14
G8*F7*E6*D5 = 0.37 H8*G7*F6*E5 = 0.31 I8*H7*G6*F5 = 0.22
G8*F8*E6 = 0.59 H8*G7*F6 = 0.47 I8*H7*G6 = 0.38
G8*F8 = 0.81 H8*G7 = 0.72 I8*H7 = 0.44
G8 = 0.98 H8 = 0.84 I8 = 0.44
So result should be printed as.
A B C
0 0.02 0.00
0.08 0.08 0.06
0.19 0.18 0.14
0.37 0.31 0.22
0.59 0.47 0.38
0.81 0.72 0.44
0.98 0.84 0.44
Similary there will result for D, E, F, and so on.
Please help, Thanks in Advance.

sapply(lapply(7:NCOL(df), function(i)
df[, (i-6):i]), function(a)
round(x = rev(cumprod(rev(diag(as.matrix(a))))), digits = 2))
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#[1,] 0.00 0.00 0.00 0.00 0.00 0.00 0.00
#[2,] 0.09 0.08 0.06 0.08 0.08 0.03 0.00
#[3,] 0.19 0.18 0.14 0.21 0.26 0.05 0.15
#[4,] 0.37 0.31 0.22 0.41 0.33 0.23 0.24
#[5,] 0.59 0.48 0.38 0.51 0.40 0.23 0.38
#[6,] 0.81 0.72 0.44 0.57 0.73 0.30 0.58
#[7,] 0.98 0.84 0.44 0.93 0.88 0.78 0.78
Let me know if the output is correct
DATA
df = structure(list(A = c(0.04, 0.44, 0.57, 0.61, 0.7, 0.76, 0.8),
B = c(0.03, 0.44, 0.57, 0.58, 0.65, 0.76, 0.84), C = c(0.03,
0.42, 0.52, 0.6, 0.66, 0.79, 0.89), D = c(0.04, 0.43, 0.59,
0.63, 0.71, 0.74, 0.84), E = c(0.04, 0.4, 0.62, 0.65, 0.73,
0.83, 0.82), F = c(0.07, 0.32, 0.51, 0.59, 0.66, 0.83, 0.83
), G = c(0.86, 0.64, 0.79, 0.81, 0.86, 0.86, 0.98), H = c(0.28,
0.02, 0.23, 0.83, 0.9, 1, 0.84), I = c(0.05, 0.33, 0.64,
1, 0.55, 0.61, 0.44), J = c(0.05, 0.36, 0.66, 0.63, 0.76,
0.83, 0.93), K = c(0.05, 0.3, 0.5, 0.57, 0.65, 0.38, 0.88
), L = c(0.04, 0.27, 0.55, 0.63, 0.66, 0.74, 0.78), M = c(0.04,
0.37, 0.6, 0.74, 0.74, 0.75, 0.78)), .Names = c("A", "B",
"C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M"), class = "data.frame", row.names = c(NA,
-7L))

I think a for loop is a good bet here - inspired from this
n <- nrow(df)
b <- ncol(df) - n + 1
out <- matrix(0, n, b)
ro <- 1:n
for(i in 1:b){
co <- i:(n + i - 1)
out[ro, i] <- rev(cumprod(rev(df[cbind(ro, co)])))
}
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 0.003423605 0.002303868 0.001785601 0.003374663 0.00337162 0.00232112
# [2,] 0.085590113 0.076795599 0.059520050 0.084366587 0.08429050 0.03315886
# [3,] 0.194522983 0.182846664 0.138418720 0.210916467 0.26340780 0.05181072
# [4,] 0.374082660 0.309909600 0.223256000 0.413561700 0.33342760 0.22526400
# [5,] 0.593782000 0.476784000 0.378400000 0.510570000 0.40172000 0.22526400
# [6,] 0.813400000 0.722400000 0.440000000 0.567300000 0.73040000 0.29640000
# [7,] 0.980000000 0.840000000 0.440000000 0.930000000 0.88000000 0.78000000
Wrap the answer in round to alter how it is printed.
Another way , also using indexing...
ro <- nrow(df)
co <- ncol(df)
b <- co - ro + 1
id <- pmin(ro, b)
ccols <- mapply(seq, 1:b, id:co)
rrows <- rep(1:ro, b)
mat <- matrix(rev(df[cbind(rrows, c(ccols))]), nr=ro)
matrix(rev(matrixStats::colCumprods(mat)), nr=ro)
A quick benchmark on larger data seems to show that method two is considerably faster, however, if you convert the dataframe to a matrix then the for loop has similar speed

Related

Mutating new columns based on common string using existing columns

Sample data:
X_5 X_1 Y alpha_5 alpha_1 beta_5 beta_1
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.21 0.02 0.61 10 5 3 0.01
2 0.01 0.02 0.37 0.4 0.01 0.8 0.5
3 0.02 0.03 0.55 0.01 0.01 0.3 0.99
4 0.04 0.05 0.29 0.01 0.005 0.03 0.55
5 0.11 0.1 -0.08 0.22 0.015 0.01 0.01
6 0.22 0.21 -0.08 0.02 0.03 0.01 0.01
I have a dataset which has columns of some variable of interest, say alpha, beta, and so on. I also have this saved as a character vector. I want to be able to mutate new columns based on these variable names, suffixed with an identifier, using the existing columns in the dataset as part of some transformation, like this:
df %>% mutate(
alpha_new = ((alpha_5-alpha_1) / (X_5-X_1) * Y),
beta_new = ((beta_5-beta_1) / (X_5-X_1) * Y)
)
X_5 X_1 Y alpha_5 alpha_1 beta_5 beta_1 alpha_new beta_new
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.21 0.02 0.61 10 5 3 0.01 16.1 9.60
2 0.01 0.02 0.37 0.4 0.01 0.8 0.5 -14.4 -11.1
3 0.02 0.03 0.55 0.01 0.01 0.3 0.99 0 38.0
4 0.04 0.05 0.29 0.01 0.005 0.03 0.55 -0.145 15.1
5 0.11 0.1 -0.08 0.22 0.015 0.01 0.01 -1.64 0
6 0.22 0.21 -0.08 0.02 0.03 0.01 0.01 0.0800 0
In my real data I have many more columns like this and I'm struggling to implement this in a "tidy" way which isn't hardcoded, what's the best practice for my situation?
Sample code:
structure(
list(
X_5 = c(0.21, 0.01, 0.02, 0.04, 0.11, 0.22),
X_1 = c(0.02,
0.02, 0.03, 0.05, 0.10, 0.21),
Y = c(0.61, 0.37, 0.55, 0.29, -0.08, -0.08),
alpha_5 = c(10, 0.4, 0.01, 0.01, 0.22, 0.02),
alpha_1 = c(5, 0.01, 0.01, 0.005, 0.015, 0.03),
beta_5 = c(3, 0.8, 0.3, 0.03, 0.01, 0.01),
beta_1 = c(0.01, 0.5, 0.99, 0.55, 0.01, 0.01)
),
row.names = c(NA, -6L),
class = c("tbl_df", "tbl", "data.frame")
) -> df
variable_of_interest <- c("alpha", "beta")
Here's another way to approach this with dynamic creation of columns. With map_dfc from purrr you can column-bind new results, creating new column names with bang-bang on left hand side of := operator, and using .data to access column values on right hand side.
library(tidyverse)
bind_cols(
df,
map_dfc(
variable_of_interest,
~ transmute(df, !!paste0(.x, '_new') :=
(.data[[paste0(.x, '_5')]] - .data[[paste0(.x, '_1')]]) /
(X_5 - X_1) * Y)
)
)
Output
X_5 X_1 Y alpha_5 alpha_1 beta_5 beta_1 alpha_new beta_new
1 0.21 0.02 0.61 10.00 5.000 3.00 0.01 16.05263 9.599474
2 0.01 0.02 0.37 0.40 0.010 0.80 0.50 -14.43000 -11.100000
3 0.02 0.03 0.55 0.01 0.010 0.30 0.99 0.00000 37.950000
4 0.04 0.05 0.29 0.01 0.005 0.03 0.55 -0.14500 15.080000
5 0.11 0.10 -0.08 0.22 0.015 0.01 0.01 -1.64000 0.000000
6 0.22 0.21 -0.08 0.02 0.030 0.01 0.01 0.08000 0.000000
Better to pivot the data first
library(dplyr)
library(tidyr)
# your data
df <- structure(list(X_5 = c(0.21, 0.01, 0.02, 0.04, 0.11, 0.22), X_1 = c(0.02,
0.02, 0.03, 0.05, 0.1, 0.21), Y = c(0.61, 0.37, 0.55, 0.29, -0.08,
-0.08), alpha_5 = c(10, 0.4, 0.01, 0.01, 0.22, 0.02), alpha_1 = c(5,
0.01, 0.01, 0.005, 0.015, 0.03), beta_5 = c(3, 0.8, 0.3, 0.03,
0.01, 0.01), beta_1 = c(0.01, 0.5, 0.99, 0.55, 0.01, 0.01)), class = "data.frame", row.names = c(NA,
-6L))
df <- df |> mutate(id = 1:n()) |>
pivot_longer(cols = -c(id, Y, X_5, X_1),
names_to = c("name", ".value"), names_sep="_") |>
mutate(new= (`5` - `1`) / (X_5 - X_1) * Y) |>
pivot_wider(id_cols = id, names_from = "name", values_from = c(`5`,`1`, `new`),
names_glue = "{name}_{.value}", values_fn = sum)
df
#> # A tibble: 6 × 7
#> id alpha_5 beta_5 alpha_1 beta_1 alpha_new beta_new
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 10 3 5 0.01 16.1 9.60
#> 2 2 0.4 0.8 0.01 0.5 -14.4 -11.1
#> 3 3 0.01 0.3 0.01 0.99 0 38.0
#> 4 4 0.01 0.03 0.005 0.55 -0.145 15.1
#> 5 5 0.22 0.01 0.015 0.01 -1.64 0
#> 6 6 0.02 0.01 0.03 0.01 0.0800 0
Created on 2023-02-16 with reprex v2.0.2
Note: if you want to add X_5 and X_1 in the output use id_cols = c(id, X_5, X_1) instead.
I modified your data to create a bit more complicated situation. My hope is that this is close to your real situation. The condition in this idea is that two columns that you wanna pair up stay next to each other. The first job is to collect column names that begin with small letters. Next job is to create a data frame. Here I keep the column names in odd positions
in target in the first column, and ones in even positions in the second column. I was thinking in the same line of Ben; I used map2_dfc to create an output data frame. In this function, I replaced all small letters with X so that I could specify two column names in the original data (i.e., ones starting with X). Then, I did the calculation as you specified. Finally, I created a column name for outcome in the loop. If you want to add the result to the original data, you can run the final line with cbind.
grep(x = names(df), pattern = "[[:lower:]]+_[0-9]+", value = TRUE) -> target
tibble(first_element = target[c(TRUE, FALSE)],
second_element = target[c(FALSE, TRUE)]) -> mydf
map2_dfc(.x = mydf$first_element,
.y = mydf$second_element,
.f = function(x, y) {
sub(x = x, pattern = "[[:lower:]]+", replacement = "X") -> foo1
sub(x = y, pattern = "[[:lower:]]+", replacement = "X") -> foo2
outcome <- ((df[x] - df[y]) / (df[foo1] - df[foo2]) * df["Y"])
names(outcome) <- paste(x,
sub(x = y, pattern = "[[:lower:]]+", replacement = ""),
sep = "")
return(outcome)
}) -> result
cbind(df, result)
# alpha_5_1 alpha_2_6 beta_5_1 beta_3_4
#1 16.05263 0.10736 9.599474 0.27145
#2 -14.43000 0.10730 -11.100000 0.28564
#3 0.00000 0.28710 37.950000 0.50820
#4 -0.14500 0.21576 15.080000 0.64206
#5 -1.64000 -0.06416 0.000000 -0.61352
#6 0.08000 -0.08480 0.000000 -0.25400
DATA
structure(list(
X_5 = c(0.21, 0.01, 0.02, 0.04, 0.11, 0.22),
X_1 = c(0.02,0.02, 0.03, 0.05, 0.10, 0.21),
X_2 = 1:6,
X_6 = 6:11,
X_3 = 21:26,
X_4 = 31:36,
Y = c(0.61, 0.37, 0.55, 0.29, -0.08, -0.08),
alpha_5 = c(10, 0.4, 0.01, 0.01, 0.22, 0.02),
alpha_1 = c(5, 0.01, 0.01, 0.005, 0.015, 0.03),
alpha_2 = c(0.12, 0.55, 0.39, 0.28, 0.99, 0.7),
alpha_6 = 1:6,
beta_5 = c(3, 0.8, 0.3, 0.03, 0.01, 0.01),
beta_1 = c(0.01, 0.5, 0.99, 0.55, 0.01, 0.01),
beta_3 = c(0.55, 0.28, 0.76, 0.86, 0.31, 0.25),
beta_4 = c(5, 8, 10, 23, 77, 32)),
row.names = c(NA, -6L),
class = c("tbl_df", "tbl", "data.frame")) -> df

How to order one column based on another in R

I have this data where I want to order val based on quant
1 will correspond to the highest value and so on.
So 1 will correspond to 181.2349
data = structure(list(quant = c(0, 0.02, 0.04, 0.06, 0.08, 0.1, 0.12,
0.14, 0.16, 0.18, 0.2, 0.22, 0.24, 0.26, 0.28, 0.3, 0.32, 0.34,
0.36, 0.38, 0.4, 0.42, 0.44, 0.46, 0.48, 0.5, 0.52, 0.54, 0.56,
0.58, 0.6, 0.62, 0.64, 0.66, 0.68, 0.7, 0.72, 0.74, 0.76, 0.78,
0.8, 0.82, 0.84, 0.86, 0.88, 0.9, 0.92, 0.94, 0.96, 0.98, 1),
val = c(47.91623, 90.3489408, 127.16448, 70.526045, 66.3226236,
85.103976, 139.317196, 127.446425999999, 91.5951164, 86.805257,
111.71706, 79.3636359999997, 73.1136444, 147.4201476, 65.2126171999996,
135.85975, 127.401408, 106.597378999999, 101.1695592, 94.1209831999999,
93.1355219999998, 96.3409336000001, 90.2044183999998, 75.7257826,
147.727516, 80.45166, 102.691942399999, 77.5738932, 62.665275199999,
128.7217, 156.20672, 132.990364, 118.481792, 118.512295599999,
57.3580020000001, 110.16883, 145.284928, 155.691106799999,
134.824147999999, 161.223344, 98.6174559999996, 99.0563548,
131.044792000001, 124.3800214, 99.4231451999992, 154.733724999998,
120.806394399999, 86.9254320000016, 139.611945600001, 181.234905600001,
119.7396)), row.names = c(NA, -51L), class = c("data.table",
"data.frame"))
You can do:
data[] <- lapply(data, sort, decreasing = TRUE)
head(data)
quant val
1: 1.00 181.2349
2: 0.98 161.2233
3: 0.96 156.2067
4: 0.94 155.6911
5: 0.92 154.7337
6: 0.90 147.7275
you can solve the problem as follows:
data[, `:=`(quant=sort(quant, TRUE), val=sort(val, TRUE))]
head(data)
quant val
1: 1.00 181.2349
2: 0.98 161.2233
3: 0.96 156.2067
4: 0.94 155.6911
5: 0.92 154.7337
6: 0.90 147.7275
# or
cols = c("quant", "val")
data[, (cols) := lapply(.SD, sort, TRUE), .SDcols=cols]
dplyr option (updated thanks to #Adam):
library(dplyr)
data %>%
mutate(across(everything(), sort, decreasing = TRUE))
Output:
quant val
1: 1.00 181.23491
2: 0.98 161.22334
3: 0.96 156.20672
4: 0.94 155.69111
5: 0.92 154.73372
6: 0.90 147.72752
7: 0.88 147.42015
8: 0.86 145.28493
9: 0.84 139.61195
10: 0.82 139.31720
11: 0.80 135.85975
12: 0.78 134.82415
13: 0.76 132.99036
14: 0.74 131.04479
15: 0.72 128.72170
16: 0.70 127.44643
17: 0.68 127.40141
18: 0.66 127.16448
19: 0.64 124.38002
20: 0.62 120.80639
21: 0.60 119.73960
22: 0.58 118.51230
23: 0.56 118.48179
24: 0.54 111.71706
25: 0.52 110.16883
26: 0.50 106.59738
27: 0.48 102.69194
28: 0.46 101.16956
29: 0.44 99.42315
30: 0.42 99.05635
31: 0.40 98.61746
32: 0.38 96.34093
33: 0.36 94.12098
34: 0.34 93.13552
35: 0.32 91.59512
36: 0.30 90.34894
37: 0.28 90.20442
38: 0.26 86.92543
39: 0.24 86.80526
40: 0.22 85.10398
41: 0.20 80.45166
42: 0.18 79.36364
43: 0.16 77.57389
44: 0.14 75.72578
45: 0.12 73.11364
46: 0.10 70.52604
47: 0.08 66.32262
48: 0.06 65.21262
49: 0.04 62.66528
50: 0.02 57.35800
51: 0.00 47.91623
quant val

Fetching data from a data table in R

I have two data tables: MP and MPSubSample. MP has monthly data from 1965 to 2018 and MPSubSample has a few data points from MP. I want to expand MPSubSample such that if there is data from 196801(January 1968), then I want to get data from three months before and three months after from J 1968 from MP data table and add it to MPSubSample data table. Example is as follows:
MPSubSample:
Month ER SENT SENT+ TS DS D12 E12 Inf
196608 -7.905 -1.12 -1.22 0.26 0.52 2.870 5.493 32.650
MP:
Month ER SENT SENT+ TS DS D12 E12 Inf
196604 2.1373 -1.66 -1.62 0.13 0.45 2.7967 5.38 32.28
196605 2.445 -1.56 -1.55 0.14 0.5 2.8133 5.42 32.35
196606 -1.443 -1.41 -1.49 0.31 0.51 2.83 5.46 32.38
196607 -1.622 -1.31 -1.39 0.22 0.52 2.85 5.4767 32.45
196608 -7.905 -1.12 -1.22 0.26 0.52 2.87 5.4933 32.65
196609 -1.066 -1.36 -1.33 -0.19 0.6 2.89 5.51 32.75
196610 3.8619 -1.31 -1.33 -0.34 0.69 2.8833 5.5233 32.85
196611 1.3946 -1.28 -1.29 -0.16 0.78 2.8767 5.5367 32.88
196612 0.1325 -1.23 -1.18 -0.12 0.79 2.87 5.55 32.92
196701 8.1534 -1.06 -1.08 -0.14 0.77 2.88 5.5167 32.9
I want the final data set to be:
Month ER SENT SENT+ TS DS D12 E12 Inf
196605 2.445 -1.56 -1.55 0.14 0.5 2.8133 5.42 32.35
196606 -1.44 -1.41 -1.49 0.31 0.51 2.83 5.46 32.38
196607 -1.622 -1.31 -1.39 0.22 0.52 2.85 5.4767 32.45
196608 -7.905 -1.12 -1.22 0.26 0.52 2.87 5.4933 32.65
196609 -1.066 -1.36 -1.33 -0.19 0.6 2.89 5.51 32.75
196610 3.8619 -1.31 -1.33 -0.34 0.69 2.8833 5.5233 32.85
196611 1.3946 -1.28 -1.29 -0.16 0.78 2.8767 5.5367 32.88
Try this,
library(data.table)
setDT(MP); setDT(MPSubSample)
YM_plus <- function(a, b) {
month <- a %% 100
newmonth <- month + b
newyear <- (a %/% 100) + (newmonth - 1) %/% 12
newmonth <- (newmonth - 1) %% 12 + 1
100 * newyear + newmonth
}
MP[, c("fromdate", "todate") := .(YM_plus(Month, -3), YM_plus(Month, +3)) ]
MP[MPSubSample, on = .(fromdate <= Month, todate >= Month)][, .SD, .SDcols = names(MPSubSample)]
# Month ER SENT SENT. TS DS D12 E12 Inf.
# 1: 196605 2.4450 -1.56 -1.55 0.14 0.50 2.8133 5.4200 32.35
# 2: 196606 -1.4430 -1.41 -1.49 0.31 0.51 2.8300 5.4600 32.38
# 3: 196607 -1.6220 -1.31 -1.39 0.22 0.52 2.8500 5.4767 32.45
# 4: 196608 -7.9050 -1.12 -1.22 0.26 0.52 2.8700 5.4933 32.65
# 5: 196609 -1.0660 -1.36 -1.33 -0.19 0.60 2.8900 5.5100 32.75
# 6: 196610 3.8619 -1.31 -1.33 -0.34 0.69 2.8833 5.5233 32.85
# 7: 196611 1.3946 -1.28 -1.29 -0.16 0.78 2.8767 5.5367 32.88
DataL
MPSubSample <- structure(list(Month = 196608L, ER = -7.905, SENT = -1.12, SENT. = -1.22, TS = 0.26, DS = 0.52, D12 = 2.87, E12 = 5.493, Inf. = 32.65), class = "data.frame", row.names = c(NA, -1L))
MP <- structure(list(Month = c(196604L, 196605L, 196606L, 196607L, 196608L, 196609L, 196610L, 196611L, 196612L, 196701L), ER = c(2.1373, 2.445, -1.443, -1.622, -7.905, -1.066, 3.8619, 1.3946, 0.1325, 8.1534), SENT = c(-1.66, -1.56, -1.41, -1.31, -1.12, -1.36, -1.31, -1.28, -1.23, -1.06), SENT. = c(-1.62, -1.55, -1.49, -1.39, -1.22, -1.33, -1.33, -1.29, -1.18, -1.08), TS = c(0.13, 0.14, 0.31, 0.22, 0.26, -0.19, -0.34, -0.16, -0.12, -0.14), DS = c(0.45, 0.5, 0.51, 0.52, 0.52, 0.6, 0.69, 0.78, 0.79, 0.77), D12 = c(2.7967, 2.8133, 2.83, 2.85, 2.87, 2.89, 2.8833, 2.8767, 2.87, 2.88), E12 = c(5.38, 5.42, 5.46, 5.4767, 5.4933, 5.51, 5.5233, 5.5367, 5.55, 5.5167), Inf. = c(32.28, 32.35, 32.38, 32.45, 32.65, 32.75, 32.85, 32.88, 32.92, 32.9)), class = "data.frame", row.names = c(NA, -10L))

Computing mean of different columns depending on date

My data set is about forest fires and NDVI values (a value ranging from 0 to 1, indicating how green is the surface). It has an initial column which says when the forest fire of row one took place, and subsequent columns indicating the NDVI value on different dates, before and after the fire happened. NDVI values before the fire are substantially higher compared with values after the fire. Something like:
data1989 <- data.frame("date_fire" = c("1987-01-01", "1987-07-03", "1988-01-01"),
"1986-01-01" = c(0.5, 0.589, 0.66),
"1986-06-03" = c(0.56, 0.447, 0.75),
"1986-10-19" = c(0.8, NA, 0.83),
"1987-01-19" = c(0.75, 0.65,0.75),
"1987-06-19" = c(0.1, 0.55,0.811),
"1987-10-19" = c(0.15, 0.12, 0.780),
"1988-01-19" = c(0.2, 0.22,0.32),
"1988-06-19" = c(0.18, 0.21,0.23),
"1988-10-19" = c(0.21, 0.24, 0.250),
stringsAsFactors = FALSE)
> data1989
date_fire X1986.01.01 X1986.06.03 X1986.10.19 X1987.01.19 X1987.06.19 X1987.10.19 X1988.01.19 X1988.06.19 X1988.10.19
1 1987-01-01 0.500 0.560 0.80 0.75 0.100 0.15 0.20 0.18 0.21
2 1987-07-03 0.589 0.447 NA 0.65 0.550 0.12 0.22 0.21 0.24
3 1988-01-01 0.660 0.750 0.83 0.75 0.811 0.78 0.32 0.23 0.25
I would like to compute the average of NDVI values, in a new column, PRIOR to the forest fire. In case one, it would be the average of columns 2, 3, 4 and 5.
What I need to get is:
date_fire X1986.01.01 X1986.06.03 X1986.10.19 X1987.01.19 X1987.06.19 X1987.10.19 X1988.01.19 X1988.06.19 X1988.10.19 meanPreFire
1 1987-01-01 0.500 0.560 0.80 0.75 0.100 0.15 0.20 0.18 0.21 0.653
2 1987-07-03 0.589 0.447 NA 0.65 0.550 0.12 0.22 0.21 0.24 0.559
3 1988-01-01 0.660 0.750 0.83 0.75 0.811 0.78 0.32 0.23 0.25 0.764
Thanks!
EDIT: SOLUTION
How to adapt the code with more than one column to exclude:
data1989 <- data.frame("date_fire" = c("1987-02-01", "1987-07-03", "1988-01-01"),
"type" = c("oak", "pine", "oak"),
"meanRainfall" = c(600, 300, 450),
"1986.01.01" = c(0.5, 0.589, 0.66),
"1986.06.03" = c(0.56, 0.447, 0.75),
"1986.10.19" = c(0.8, NA, 0.83),
"1987.01.19" = c(0.75, 0.65,0.75),
"1987.06.19" = c(0.1, 0.55,0.811),
"1987.10.19" = c(0.15, 0.12, 0.780),
"1988.01.19" = c(0.2, 0.22,0.32),
"1988.06.19" = c(0.18, 0.21,0.23),
"1988.10.19" = c(0.21, 0.24, 0.250),
check.names = FALSE,
stringsAsFactors = FALSE)
Using:
j1 <- findInterval(as.Date(data1989$date_fire), as.Date(names(data1989)[-(1:3)],format="%Y.%m.%d"))
m1 <- cbind(rep(seq_len(nrow(data1989)), j1), sequence(j1))
data1989$meanPreFire <- tapply(data1989[-(1:3)][m1], m1[,1], FUN = mean, na.rm = TRUE)
> data1989
date_fire type meanRainfall 1986.01.01 1986.06.03 1986.10.19 1987.01.19 1987.06.19 1987.10.19 1988.01.19 1988.06.19 1988.10.19 meanPreFire
1 1987-02-01 oak 600 0.500 0.560 0.80 0.75 0.100 0.15 0.20 0.18 0.21 0.6525
2 1987-07-03 pine 300 0.589 0.447 NA 0.65 0.550 0.12 0.22 0.21 0.24 0.5590
3 1988-01-01 oak 450 0.660 0.750 0.83 0.75 0.811 0.78 0.32 0.23 0.25 0.7635
Reshape data to the long form and filter dates prior to the forest fire.
library(tidyverse)
data1989 %>%
pivot_longer(-date_fire, names_to = "date") %>%
mutate(date_fire = as.Date(date_fire),
date = as.Date(date, "X%Y.%m.%d")) %>%
filter(date < date_fire) %>%
group_by(date_fire) %>%
summarise(meanPreFire = mean(value, na.rm = T))
# # A tibble: 3 x 2
# date_fire meanPreFire
# <date> <dbl>
# 1 1987-01-01 0.62
# 2 1987-07-03 0.559
# 3 1988-01-01 0.764
The solution would be much more concise if we would keep the data in long(er) form... but this reproduces the desired output:
library(dplyr)
library(tidyr)
data1989 %>%
pivot_longer(-date_fire, names_to = "date_NDVI", values_to = "value", names_prefix = "^X") %>%
mutate(date_fire = as.Date(date_fire, "%Y-%m-%d"),
date_NDVI = as.Date(date_NDVI, "%Y.%m.%d")) %>%
group_by(date_fire) %>%
mutate(period = ifelse(date_NDVI < date_fire, "before_fire", "after_fire")) %>%
group_by(date_fire, period) %>%
mutate(average_NDVI = mean(value, na.rm = TRUE)) %>%
pivot_wider(names_from = date_NDVI, names_prefix = "X", values_from = value) %>%
pivot_wider(names_from = period, values_from = average_NDVI) %>%
group_by(date_fire) %>%
summarise_all(funs(sum(., na.rm=T)))
Returns:
# A tibble: 3 x 12
date_fire `X1986-01-01` `X1986-06-03` `X1986-10-19` `X1987-01-19` `X1987-06-19` `X1987-10-19` `X1988-01-19` `X1988-06-19` `X1988-10-19` before_fire after_fire
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1987-01-01 0.5 0.56 0.8 0.75 0.1 0.15 0.2 0.18 0.21 0.62 0.265
2 1987-07-03 0.589 0.447 0 0.65 0.55 0.12 0.22 0.21 0.24 0.559 0.198
3 1988-01-01 0.66 0.75 0.83 0.75 0.811 0.78 0.32 0.23 0.25 0.764 0.267
Edit:
If we stop the expression right after calculating the averages we can use the data in this structure to easily calculate the variance or account for variable number of observations. I think it's ok to keep the date_fireas its own column, but I'd suggest leaving the other dates as a column (because they correspond to observations). Especially if we want to do more analysis with the data using ggplot2 and other tidyverse functions.
We can use base R, by creating a row/column index. The column index can be got from findInterval with the column names and the 'date_fire'
j1 <- findInterval(as.Date(data1989$date_fire), as.Date(names(data1989)[-1]))
l1 <- lapply(j1+1, `:`, ncol(data1989)-1)
m1 <- cbind(rep(seq_len(nrow(data1989)), j1), sequence(j1))
m2 <- cbind(rep(seq_len(nrow(data1989)), lengths(l1)), unlist(l1))
data1989$meanPreFire <- tapply(data1989[-1][m1], m1[,1], FUN = mean, na.rm = TRUE)
data1989$meanPostFire <- tapply(data1989[-1][m2], m2[,1], FUN = mean, na.rm = TRUE)
data1989
# date_fire 1986-01-01 1986-06-03 1986-10-19 1987-01-19 1987-06-19 1987-10-19 1988-01-19 1988-06-19 1988-10-19
#1 1987-01-01 0.500 0.560 0.80 0.75 0.100 0.15 0.20 0.18 0.21
#2 1987-07-03 0.589 0.447 NA 0.65 0.550 0.12 0.22 0.21 0.24
#3 1988-01-01 0.660 0.750 0.83 0.75 0.811 0.78 0.32 0.23 0.25
# meanPreFire meanPostFire
#1 0.6200 0.2650000
#2 0.5590 0.1975000
#3 0.7635 0.2666667
Or using melt/dcast from data.table
library(data.table)
dcast(melt(setDT(data1989), id.var = 'date_fire')[,
.(value = mean(value, na.rm = TRUE)),
.(date_fire, grp = c('postFire', 'preFire')[1 + (as.IDate(variable) < as.IDate(date_fire))]) ], date_fire ~ grp)[data1989, on = .(date_fire)]
# date_fire postFire preFire 1986-01-01 1986-06-03 1986-10-19 1987-01-19 1987-06-19 1987-10-19 1988-01-19 1988-06-19
#1: 1987-01-01 0.2650000 0.6200 0.500 0.560 0.80 0.75 0.100 0.15 0.20 0.18
#2: 1987-07-03 0.1975000 0.5590 0.589 0.447 NA 0.65 0.550 0.12 0.22 0.21
#3: 1988-01-01 0.2666667 0.7635 0.660 0.750 0.83 0.75 0.811 0.78 0.32 0.23
# 1988-10-19
#1: 0.21
#2: 0.24
#3: 0.25
data
data1989 <- data.frame("date_fire" = c("1987-01-01", "1987-07-03", "1988-01-01"),
"1986-01-01" = c(0.5, 0.589, 0.66),
"1986-06-03" = c(0.56, 0.447, 0.75),
"1986-10-19" = c(0.8, NA, 0.83),
"1987-01-19" = c(0.75, 0.65,0.75),
"1987-06-19" = c(0.1, 0.55,0.811),
"1987-10-19" = c(0.15, 0.12, 0.780),
"1988-01-19" = c(0.2, 0.22,0.32),
"1988-06-19" = c(0.18, 0.21,0.23),
"1988-10-19" = c(0.21, 0.24, 0.250), check.names = FALSE,
stringsAsFactors = FALSE)

Merging data frames based on column and row names, conditional column creation

I have a data frame with monthly returns and their corresponding month.
Data <- read.csv("C:/Users/h/Desktop/overflow.csv", sep=";", dec=",")
Data$Date <- as.Date(as.character(Data$Date), format="%Y-%m-%d")
The data frame looks like this now:
> Data
Fund.A Fund.B Fund.C Fund.D
2012-01-01 -0.01 0.04 0.11 0.10
2012-02-01 -0.04 -0.06 0.08 0.11
2012-03-01 -0.04 -0.07 0.15 -0.03
2012-04-01 0.00 -0.08 -0.04 0.13
2012-05-01 -0.07 0.10 0.06 0.02
2012-06-01 -0.05 0.06 0.06 -0.02
2012-07-01 0.12 -0.06 -0.09 -0.06
2012-08-01 0.08 -0.03 0.05 0.13
2012-09-01 0.10 0.07 -0.02 0.15
2012-10-01 -0.08 0.14 0.00 -0.04
2012-11-01 -0.09 0.11 -0.07 0.12
2012-12-01 -0.01 -0.09 0.07 -0.02
Now I want to continue the time series with new returns from a new csv, by simply matching the new return with the appropriate Fund in "Data". My problem is that new assets might have been added, messing up the order.
import <- read.csv("C:/Users/h/Desktop/import.csv", sep=";", dec=",")
import
2013-01-01
1 Funds: NA
2 Fund A 0.04
3 Fund AA -0.09
4 Fund C -0.10
5 Fund D 0.03
6 Fund B 0.14
As you can see, the "import" csv has new assets (Fund AA) as well as assets seen in "Data" (Fund a to D), where the funds are in rows and not columns. How can I write a code, which matches and adds a row to "Data" where the values in "import" falls under the right column (Fund) in "Data"? And if a new asset have been added, creates a column for the new asset?
As a bonus, the code would only add a row if the date in "import" is more recent date than the most recent one in "Data". To only import new returns.
Appreciate it!
For time series purpose, I would recommend using xts. It makes life a bit easier. Borrowing from Arun's usable data:
olddata <- structure(list(Date = structure(c(15340, 15371, 15400, 15431,
15461, 15492, 15522, 15553, 15584, 15614, 15645, 15675), class = "Date"),
Fund.A = c(-0.01, -0.04, -0.04, 0, -0.07, -0.05, 0.12, 0.08, 0.1, -0.08,
-0.09, -0.01), Fund.B = c(0.04, -0.06, -0.07, -0.08, 0.1, 0.06, -0.06,
-0.03, 0.07, 0.14, 0.11, -0.09), Fund.C = c(0.11, 0.08, 0.15, -0.04,
0.06, 0.06, -0.09, 0.05, -0.02, 0, -0.07, 0.07), Fund.D = c(0.1, 0.11,
-0.03, 0.13, 0.02, -0.02, -0.06, 0.13, 0.15, -0.04, 0.12, -0.02)),
.Names = c("Date", "Fund.A", "Fund.B", "Fund.C", "Fund.D"),
row.names = c(NA, 12L), class = "data.frame")
newimport <- structure(list(funds = c("Fund.A", "Fund.AA", "Fund.C",
"Fund.D", "Fund.B"), `2013-01-01` = c(0.04, -0.09, -0.1, 0.03, 0.14)),
.Names = c("funds", "2013-01-01"), row.names = c(NA, -5L),
class = "data.frame")
Convert data to xts for easy datewise subsetting:
olddata <- xts(olddata[,-1], olddata$Date)
newdata <- xts(t(newimport[,-1]), as.Date(colnames(newimport)[-1]))
colnames(newdata) <- newimport[,1]
Merge data together while taking care of any new columns:
cols <- names(newdata) %in% names(olddata)
combineData <- merge(rbind(olddata, newdata[,cols]), newdata[,!cols])
combineData
Fund.A Fund.B Fund.C Fund.D Fund.AA
2012-01-01 -0.01 0.04 0.11 0.10 NA
2012-02-01 -0.04 -0.06 0.08 0.11 NA
2012-03-01 -0.04 -0.07 0.15 -0.03 NA
2012-04-01 0.00 -0.08 -0.04 0.13 NA
2012-05-01 -0.07 0.10 0.06 0.02 NA
2012-06-01 -0.05 0.06 0.06 -0.02 NA
2012-07-01 0.12 -0.06 -0.09 -0.06 NA
2012-08-01 0.08 -0.03 0.05 0.13 NA
2012-09-01 0.10 0.07 -0.02 0.15 NA
2012-10-01 -0.08 0.14 0.00 -0.04 NA
2012-11-01 -0.09 0.11 -0.07 0.12 NA
2012-12-01 -0.01 -0.09 0.07 -0.02 NA
2013-01-01 0.04 0.14 -0.10 0.03 -0.09

Resources