Add new blank rows into dataset by group (in R) - r

I use R. I have dataframe like this:
dat <- data.frame(
group = c(1,1,1,1,1,1,2,2,2,2,2),
horizon = c(1,3,5,6,7,10,1,3,5,9,10),
value = c(1.0,0.9,0.8,0.6,0.3,0.0,0.5,0.6,0.8,0.9,0.8)
other = c(a,a,a,a,a,a,b,b,b,b,b)
)
And i would like to add row for every horizon that is missing (2,4,8 and 9 for the first group and 2,4,6,7,8 for the second group). Values (value) for the missing horizons would be blank.
I would like to get something like this:
datx <- data.frame(
group = c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2),
horizon = c(1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10),
value = c(1.0,"na",0.9,"na",0.8,0.6,0.3,"na","na",0.0,0.5,"na",0.6,"na",0.8,"na","na","na",0.9,0.8)
other = c(a,a,a,a,a,a,a,a,a,a,b,b,b,b,b,b,b,b,b,b)
)
i.e. englarged dataset with new horizons, blank or "na" spaces in "value" variable and retained "other" variable.
This is just an example. I am actually working with a much larger dataset.
Without the groups, the problem would be much easier to solve, i would use something like this:
newdat <- merge(data.frame(horizon=seq(1,10,1)),dat,all=TRUE)
newdat <- newdat[order(newdat$horizon),]
Thanks for help!

I'll assume that the values in the variable other are the characters, a or b, and that this is completely redundant with your variable group. If this is the case, you could accomplish this with full_join in the dplyrpackage.
a="a"
b="b"
dat <- data.frame(
group = c(1,1,1,1,1,1,2,2,2,2,2),
horizon = c(1,3,5,6,7,10,1,3,5,9,10),
value = c(1.0,0.9,0.8,0.6,0.3,0.0,0.5,0.6,0.8,0.9,0.8),
other = c(a,a,a,a,a,a,b,b,b,b,b)
)
groups <- expand.grid(group=c(1,2),horizon=1:10)
groups <- groups %>% dplyr::mutate(other=ifelse(group==1,"a","b"))
dat %>%
dplyr::full_join(groups,by=c('group','horizon','other')) %>%
dplyr::arrange(group,horizon)

Using data.table:
library(data.table)
setDT(dat)
fill = c("other")
RES =
dat[CJ(group = group, horizon = min(horizon):max(horizon), unique = TRUE),
on = .(group, horizon)
][, (fill) := lapply(.SD, \(x) x[which.min(is.na(x))]), by = group, .SDcols = fill]
RES[]
# group horizon value other
# <num> <int> <num> <char>
# 1: 1 1 1.0 a
# 2: 1 2 NA a
# 3: 1 3 0.9 a
# 4: 1 4 NA a
# 5: 1 5 0.8 a
# 6: 1 6 0.6 a
# 7: 1 7 0.3 a
# 8: 1 8 NA a
# 9: 1 9 NA a
# 10: 1 10 0.0 a
# 11: 2 1 0.5 b
# 12: 2 2 NA b
# 13: 2 3 0.6 b
# 14: 2 4 NA b
# 15: 2 5 0.8 b
# 16: 2 6 NA b
# 17: 2 7 NA b
# 18: 2 8 NA b
# 19: 2 9 0.9 b
# 20: 2 10 0.8 b
# group horizon value other

Related

How to use functions to do a recursive calculation in data.table/R?

I am new to Programming and got stuck in it. I wanted to calculate the hourly temperature variation of an object throughout the year using some variables, which changes in every hour. The original data contains 60 columns and 8760 rows for the calculation.
I got the desired output using the for loop, but the model is taking a lot of time for the calculation. I wonder if there is any way to replace the loop with functions, which I suspect, can also increase the speed of the calculations.
Here is a small reproducible example to show what I did.
table <- data.table("A" = c(1), "B" = c(1:5), "C" = c(10))
table
A B C
1: 1 1 10
2: 1 2 10
3: 1 3 10
4: 1 4 10
5: 1 5 10
The forloop
for (j in (2: nrow(table))) {
table$A[j] = (table$A[j-1] + table$B[j-1]) * table$B[j]
table$C[j] = table$B[j] * table$A[j]
}
I got the output as I desired:
A B C
1: 1 1 10
2: 4 2 8
3: 18 3 54
4: 84 4 336
5: 440 5 2200
but it took 15 min to run the whole program in my case (not this!)
So I tried to use function instead of the for loop.
I tried this:
table <- data.table("A" = c(1), "B" = c(1:5), "C" = c(10))
myfun <- function(df){
df = df %>% mutate(A = (lag(A) + lag(B)) * B,
C = B * A)
return(df)
}
myfun(table)
But the output was
A B C
1 NA 1 NA
2 4 2 8
3 9 3 27
4 16 4 64
5 25 5 125
As it seems that the function refers to the rows of the first table not the updated rows after the calculation. Is there any way to obtain the desired output using functions? It is my first R project, any help is very much appreciated. Thank you.
A much faster alternative using data.table. Note that the calculation of C can be separated from the calculation of A so we can do less within the loop:
for (i in 2:nrow(table)) {
set(table, i = i, j = "A", value = with(table, (A[i-1] + B[i-1]) * B[i]))
}
table[-1, C := A * B]
table
# A B C
# <num> <int> <num>
# 1: 1 1 10
# 2: 4 2 8
# 3: 18 3 54
# 4: 84 4 336
# 5: 440 5 2200
You can try Reduce like below
dt[
,
A := Reduce(function(x, Y) (x + Y[2]) * Y[1],
asplit(embed(B, 2), 1),
init = A[1],
accumulate = TRUE
)
][
,
C := A * B
]
which updates dt as
> dt
A B C
1: 1 1 1
2: 4 2 8
3: 18 3 54
4: 84 4 336
5: 440 5 2200
data
dt <- data.table("A" = c(1), "B" = c(1:5), "C" = c(10))
Here's a solution using purrr::accumulate2 which lets you use the result of the previous computation as the input to the next one:
library(data.table)
library(purrr)
library(magrittr)
table <- data.table("A" = c(1), "B" = c(1:5), "C" = c(10))
table$A <- accumulate2(
table$A,
seq(table$A),
~ (..1 + table$B[..3]) * table$B[..3 + 1],
.init = table$A[1]
) %>%
unlist() %>%
extract(1:nrow(table))
table$C <- table$B * table$A
table
# A B C
# 1: 1 1 1
# 2: 4 2 8
# 3: 18 3 54
# 4: 84 4 336
# 5: 440 5 2200

How to update both data.tables in a join

Suppose I would like to track which rows from one data.table were merged to another data.table. is there a way to do this at once/while merging? Please see my example below and the way I usually do it. However, this seems rather inefficient.
Example
library(data.table)
# initial data
DT = data.table(x = c(1,1,1,2,2,1,1,2,2),
y = c(1,3,6))
# data to merge
DTx <- data.table(x = 1:3,
y = 1,
k = "X")
# regular update join
copy(DT)[DTx,
on = .(x, y),
k := i.k][]
#> x y k
#> 1: 1 1 X
#> 2: 1 3 <NA>
#> 3: 1 6 <NA>
#> 4: 2 1 X
#> 5: 2 3 <NA>
#> 6: 1 6 <NA>
#> 7: 1 1 X
#> 8: 2 3 <NA>
#> 9: 2 6 <NA>
# DTx remains the same
DTx
#> x y k
#> 1: 1 1 X
#> 2: 2 1 X
#> 3: 3 1 X
What I usually do:
# set an Id variable
DTx[, Id := .I]
# assign the Id in merge
DT[DTx,
on = .(x, y),
`:=`(k = i.k,
matched_id = i.Id)][]
#> x y k matched_id
#> 1: 1 1 X 1
#> 2: 1 3 <NA> NA
#> 3: 1 6 <NA> NA
#> 4: 2 1 X 2
#> 5: 2 3 <NA> NA
#> 6: 1 6 <NA> NA
#> 7: 1 1 X 1
#> 8: 2 3 <NA> NA
#> 9: 2 6 <NA> NA
# use matched_id to find merged rows
DTx[, matched := fifelse(Id %in% DT$matched_id, TRUE, FALSE)]
DTx
#> x y k Id matched
#> 1: 1 1 X 1 TRUE
#> 2: 2 1 X 2 TRUE
#> 3: 3 1 X 3 FALSE
Following Jan's comment:
This will provide you indices of matching rows but you will have to call merge again to perform actual merging, unless you manually use provided indices to match/update those tables.
You can pull the indices:
merge_metaDT = DT[DTx, on=.(x, y), .(irow = .GRP, xrow = .I), by=.EACHI]
x y irow xrow
1: 1 1 1 1
2: 1 1 1 7
3: 2 1 2 4
4: 3 1 3 0
Then apply edits to each table using indices rather than merging or matching a second time:
rowDT = merge_metaDT[xrow != 0L]
DT[rowDT$xrow, k := DTx[rowDT$irow, k]]
DTx[, matched := FALSE][rowDT$irow, matched := TRUE]
How it works:
When joining, x[i], the symbol .I indexes rows of x
When grouping in a join with by=.EACHI, .GRP indexes each group, which means each row of i here
We drop the non-matching values of .I which are coded as zeros
On this last point, we might expect NAs instead of zeros, as returned by DT[DTx, on=.(x, y), which=TRUE]. I'm not sure why these differ.
Suppose I would like to track which rows from one data.table were merged to another data.table. is there a way to do this at once/while merging? [...] seems rather inefficient.
I expect this is more efficient than multiple merges or %in% when the merge is costly enough.
It still requires multiple steps. I doubt there's any way around that, since it would be hard to come up with logic and syntax for the update that is easy to follow.
Update logic is already complex in base R, with multiple edits on a single index allowed:
> x = c(1, 2, 3)
> x[c(1, 1)] = c(4, 5)
> x
[1] 5 2 3
And there is the question of how to match and edit multiple indices at once:
> x = c(1, 1, 3)
> x[match(c(1, 3), x)] = c(4, 5)
> x
[1] 4 1 5
In data.table updates, the latter issue is handled with mult=. In the update-two-tables use case, these questions would get much more complicated.

Column order of `.SD` in j argument differs when `get()` is used

I very often transform subsets of data using the .SDcols option in data.table. It makes sense that the .SD columns sent to j are in the same order as the original data.table.
EDITED to properly identify the issue
It's nice that .SD columns have the same order as that specified in the .SDcols argument. This does not happen when get is used in the j argument (inside an lapply call, at least). In this case, the .SD table columns maintain their original order.
Is there any way to override this behaviour?
An example without get works fine
# library(data.table)
dt = data.table(col1 = rep(LETTERS[1:3], 4),
b = rnorm(12),
a = 1:12,
c = LETTERS[1:12])
# columns I want to do something to
d.vars = c('a', 'b') #' names in different order than names(dt)
# Generate columns of first differences by group
dt[, paste('d', d.vars, sep='.') :=
lapply(.SD, function(L) L - shift(L, n = 1, type='lag') ),
keyby = col1, .SDcols = d.vars]
The result is assigns differenced values to the "wrong" column because my named vector (d.vars) is ordered differently than the columns in dt. The result is:
The results are as expected, the .SD table's columns are ordered the same way as the names in d.vars.
> dt
col1 b a c d.a d.b
1: A -0.28901751 1 A NA NA
2: A 0.65746901 4 D 3 0.94648651
3: A -0.10602462 7 G 3 -0.76349362
4: A -0.38406252 10 J 3 -0.27803790
5: B -1.06963450 2 B NA NA
6: B 0.35137273 5 E 3 1.42100723
7: B 0.43394046 8 H 3 0.08256772
8: B 0.82525042 11 K 3 0.39130996
9: C 0.50421710 3 C NA NA
10: C -1.09493665 6 F 3 -1.59915375
11: C -0.04858163 9 I 3 1.04635501
12: C 0.45867279 12 L 3 0.50725443
Which is the expected output because lapply in j processed column a first and b second, in spite of the column order in dt.
Example with get behaves differently
dt2 = data.table(col1 = rep(LETTERS[1:3], 4),
b = rnorm(12),
a = 1:12,
neg = -1,
c = LETTERS[1:12])
# columns I want to do something to
d.vars = c('a', 'b') #' names in different order than names(dt)
# name of variable to be called in j.
negate <- 'neg'
dt2[, paste('d', d.vars, sep='.') :=
lapply(.SD, function(L) {(L - shift(L, n = 1, type='lag') ) * get(negate) }),
keyby = col1, .SDcols = d.vars]
Now the naming of the newly created columns doesn't align with the name order in d.vars:
> dt2
col1 b a neg c d.a d.b
1: A -0.3539066 1 -1 A NA NA
2: A 0.2702374 4 -1 D -0.62414408 -3
3: A -0.7834941 7 -1 G 1.05373150 -3
4: A -1.2765652 10 -1 J 0.49307118 -3
5: B -0.2936422 2 -1 B NA NA
6: B -0.2451996 5 -1 E -0.04844252 -3
7: B -1.6577614 8 -1 H 1.41256181 -3
8: B 1.0668059 11 -1 K -2.72456737 -3
9: C -0.1160938 3 -1 C NA NA
10: C -0.7940771 6 -1 F 0.67798333 -3
11: C 0.2951743 9 -1 I -1.08925140 -3
12: C -0.4508854 12 -1 L 0.74605969 -3
In this second example the b column is processed by lapply first and therefore assigned to d.a.
If I refer to neg directly (i.e., I don't use get) then the results are as expected: lapply processes the .SD columns in the order given in d.vars.
p.s. Thanks data.table team! I love this package!
Based on the description, we can use match to match the 'd.vars' and the column names of 'dt' ('d.vars1') and then use it to get the order right
d.vars1 <- d.vars[match(names(dt), d.vars, nomatch = 0)]
dt[, paste0("d.",d.vars1) := lapply(.SD, function(L)
L - shift(L, n = 1, type='lag') ), keyby = col1, .SDcols = d.vars1]
dt
# col1 b a c d.b d.a
# 1: A -0.28901751 1 A NA NA
# 2: A 0.65746901 4 D 0.94648652 3
# 3: A -0.10602462 7 G -0.76349363 3
# 4: A -0.38406252 10 J -0.27803790 3
# 5: B -1.06963450 2 B NA NA
# 6: B 0.35137273 5 E 1.42100723 3
# 7: B 0.43394046 8 H 0.08256773 3
# 8: B 0.82525042 11 K 0.39130996 3
# 9: C 0.50421710 3 C NA NA
#10: C -1.09493665 6 F -1.59915375 3
#11: C -0.04858163 9 I 1.04635502 3
#12: C 0.45867279 12 L 0.50725442 3
Update
Based on the new dataset
d.vars1 <- d.vars[match(names(dt2), d.vars, nomatch = 0)]
dt2[, paste0('d.', d.vars1) := lapply(.SD, function(L)
L - shift(L, n = 1, type='lag') * get(negate) ),
keyby = col1, .SDcols = d.vars1]
dt2
# col1 b a neg c d.b d.a
# 1: A -0.3539066 1 -1 A NA NA
# 2: A 0.2702374 4 -1 D -0.0836692 5
# 3: A -0.7834941 7 -1 G -0.5132567 11
# 4: A -1.2765652 10 -1 J -2.0600593 17
# 5: B -0.2936422 2 -1 B NA NA
# 6: B -0.2451996 5 -1 E -0.5388418 7
# 7: B -1.6577614 8 -1 H -1.9029610 13
# 8: B 1.0668059 11 -1 K -0.5909555 19
# 9: C -0.1160938 3 -1 C NA NA
#10: C -0.7940771 6 -1 F -0.9101709 9
#11: C 0.2951743 9 -1 I -0.4989028 15
#12: C -0.4508854 12 -1 L -0.1557111 21

How to reshape a data frame in R, conditioned on a maximum value?

I'm having some difficulty re-shaping my data frame in R. I have 5 individuals: A, B, C, D, and E. Some individuals have 1 observation and some have 2. I have measured 3 values for each observation: X, Y, and Z. I would like to transform my data frame from long to wide format, generating one row per individual and two sets of columns labeled X, Y, and Z. But, I want to condition on the value of X such that the set of observations with the maximum value of X appears first. Thus, for a given observation, the values of X, Y, and Z must remain grouped together, but whether the values from observation 1 or 2 appear first depends on which has the maximum value of X.
df = data.frame(
indiv = c("A","A","B","C","C","D","D","E"),
observ = c(1,2,1,1,2,1,2,1),
X = c(rnorm(8, mean = 10, sd = 6)),
Y = c(rnorm(8, mean = 0, sd = 2)),
Z = c(rnorm(8, mean = 4, sd = 4))
)
indiv observ X Y Z
1 A 1 9.959043 1.785043 10.134511
2 A 2 14.122006 -2.257666 5.799366
3 B 1 11.562801 -1.394951 4.988923
4 C 1 12.955644 -4.330272 8.870165
5 C 2 13.582154 -1.727224 -7.5617
6 D 1 4.053437 1.815233 1.789157
7 D 2 12.990071 -1.989307 3.67696
8 E 1 2.820895 -3.754263 3.001725
Below is what I would like my wide data frame to look like. For individual A, X was greater in observation 2, so that set of values (X,Y,Z) appears first. By contrast, for individuals C and D, X was greater in observation 1, so that set appears first. I think it should be some variation on the reshape function, but I'm not sure how to condition on the maximum value of X. Thanks in advance!
indiv observ X Y Z observ X Y Z
1 A 2 18.797087 0.3247862 4.774446 1 8.547868 0.3203667 6.729975
2 B 1 1.646638 0.7986036 6.938825 NA NA NA NA
3 C 1 17.354905 -2.399272 8.357045 2 6.856093 0.6493722 2.420827
4 D 1 16.058101 -1.2370024 4.045489 2 7.641576 3.0820116 4.232615
5 E 1 13.625998 -0.1953445 -5.627932 NA NA NA NA
I would just order before I casted. The following uses data.table as the dcast function is within that package as well - could be done with a normal data.frame and reshape as well
library(data.table)
set.seed(1)
df = data.frame(
indiv = c("A","A","B","C","C","D","D","E"),
observ = c(1,2,1,1,2,1,2,1),
X = c(rnorm(8, mean = 10, sd = 6)),
Y = c(rnorm(8, mean = 0, sd = 2)),
Z = c(rnorm(8, mean = 4, sd = 4))
)
df
indiv observ X Y Z
1: A 2 11.101860 -0.61077677 7.775345
2: A 1 6.241277 1.15156270 3.935239
3: B 1 4.986228 3.02356234 7.284885
4: C 1 19.571685 0.77968647 6.375605
5: C 2 11.977047 -1.24248116 7.675909
6: D 2 12.924574 2.24986184 4.298260
7: D 1 5.077190 -4.42939977 7.128545
8: E 1 14.429948 -0.08986722 -3.957407
setDT(df)
df <- df[order(indiv,-X)] #orders your frame
df[, observ := as.numeric(1:.N), by = indiv] #reset observ based on new order
df
indiv observ X Y Z
1: A 1 11.101860 -0.61077677 7.775345
2: A 2 6.241277 1.15156270 3.935239
3: B 1 4.986228 3.02356234 7.284885
4: C 1 19.571685 0.77968647 6.375605
5: C 2 11.977047 -1.24248116 7.675909
6: D 1 12.924574 2.24986184 4.298260
7: D 2 5.077190 -4.42939977 7.128545
8: E 1 14.429948 -0.08986722 -3.957407
Now cast normally:
dcast(df, indiv ~ observ, value.var = c("X","Y","Z"))
indiv X_1 X_2 Y_1 Y_2 Z_1 Z_2
1: A 11.101860 6.241277 -0.61077677 1.151563 7.775345 3.935239
2: B 4.986228 NA 3.02356234 NA 7.284885 NA
3: C 19.571685 11.977047 0.77968647 -1.242481 6.375605 7.675909
4: D 12.924574 5.077190 2.24986184 -4.429400 4.298260 7.128545
5: E 14.429948 NA -0.08986722 NA -3.957407 NA
To get the column order you want, I think you need to melt and then cast:
dcast(melt(df, id.vars = c("indiv","observ")), indiv ~ observ + variable)
indiv 1_X 1_Y 1_Z 2_X 2_Y 2_Z
1: A 11.101860 -0.61077677 7.775345 6.241277 1.151563 3.935239
2: B 4.986228 3.02356234 7.284885 NA NA NA
3: C 19.571685 0.77968647 6.375605 11.977047 -1.242481 7.675909
4: D 12.924574 2.24986184 4.298260 5.077190 -4.429400 7.128545
5: E 14.429948 -0.08986722 -3.957407 NA NA NA

Calculate a mean, by a condition, within a factor [r]

I'm looking to calculate the simple mean of an outcome variable, but only for the outcome associated with the maximal instance of another running variable, grouped by factors.
Of course, the calculated statistic could be substituted for any other function, and the evaluation within the group could be any other function.
library(data.table) #1.9.5
dt <- data.table(name = rep(LETTERS[1:7], each = 3),
target = rep(c(0,1,2), 7),
filter = 1:21)
dt
## name target filter
## 1: A 0 1
## 2: A 1 2
## 3: A 2 3
## 4: B 0 4
## 5: B 1 5
## 6: B 2 6
## 7: C 0 7
With this frame, the desired output should return a mean value for target that meets the criteria of exactly 2.
Something like:
dt[ , .(mFilter = which.max(filter),
target = target), by = name][ ,
mean(target), by = c("name", "mFilter")]
... seems close, but isn't hitting it quite right.
The solution should return:
## name V1
## 1: A 2
## 2: B 2
## 3: ...
You could do this with:
dt[, .(meantarget = mean(target[filter == max(filter)])), by = name]
# name meantarget
# 1: A 2
# 2: B 2
# 3: C 2
# 4: D 2
# 5: E 2
# 6: F 2
# 7: G 2

Resources