I have recently come across an interesting question of calculating a vector values using its penultimate value as .init argument plus an additional vector's current value. Here is the sample data set:
set.seed(13)
dt <- data.frame(id = rep(letters[1:2], each = 5), time = rep(1:5, 2), ret = rnorm(10)/100)
dt$ind <- if_else(dt$time == 1, 120, if_else(dt$time == 2, 125, as.numeric(NA)))
id time ret ind
1 a 1 0.005543269 120
2 a 2 -0.002802719 125
3 a 3 0.017751634 NA
4 a 4 0.001873201 NA
5 a 5 0.011425261 NA
6 b 1 0.004155261 120
7 b 2 0.012295066 125
8 b 3 0.002366797 NA
9 b 4 -0.003653828 NA
10 b 5 0.011051443 NA
What I would like to calculate is:
ind_{t} = ind_{t-2}*(1+ret_{t})
I tried the following code. Since .init is of no use here I tried the nullify the original .init and created a virtual .init but unfortunately it won't drag the newly created values (from third row downward) into calculation:
dt %>%
group_by(id) %>%
mutate(ind = c(120, accumulate(3:n(), .init = 125,
~ .x * 1/.x * ind[.y - 2] * (1 + ret[.y]))))
# A tibble: 10 x 4
# Groups: id [2]
id time ret ind
<chr> <int> <dbl> <dbl>
1 a 1 0.00554 120
2 a 2 -0.00280 125
3 a 3 0.0178 122.
4 a 4 0.00187 125.
5 a 5 0.0114 NA
6 b 1 0.00416 120
7 b 2 0.0123 125
8 b 3 0.00237 120.
9 b 4 -0.00365 125.
10 b 5 0.0111 NA
I was wondering if there was a tweak I could make to this code and make it work completely.
I would appreciate your help greatly in advance
Use a state vector consisting of the current value of ind and the prior value of ind. That way the prior state contains the second prior value of ind. We encode that into complex values with the real part equal to ind and the imaginary part equal to the prior value of ind. At the end we take the real part.
library(dplyr)
library(purrr)
dt %>%
group_by(id) %>%
mutate(result = c(ind[1],
Re(accumulate(.x = tail(ret, -2),
.f = ~ Im(.x) * (1 + .y) + Re(.x) * 1i,
.init = ind[2] + ind[1] * 1i)))) %>%
ungroup
giving:
# A tibble: 10 x 5
id time ret ind result
<chr> <int> <dbl> <dbl> <dbl>
1 a 1 0.00554 120 120
2 a 2 -0.00280 125 125
3 a 3 0.0178 NA 122.
4 a 4 0.00187 NA 125.
5 a 5 0.0114 NA 124.
6 b 1 0.00416 120 120
7 b 2 0.0123 125 125
8 b 3 0.00237 NA 120.
9 b 4 -0.00365 NA 125.
10 b 5 0.0111 NA 122.
Variation
This variation eliminates the complex numbers and uses a vector of 2 elements in place of each complex number with the first number corresponding to the real part in the prior solution and the second number of each pair corresponding to the imaginary part. This could be extended to cases where we need more than 2 numbers per state and where the dependence involves all of the last N values but for the question here there is the downside of the extra line of code to extract the result from the list of pairs of numbers which is more involved than using Re in the prior solution.
dt %>%
group_by(id) %>%
mutate(result = c(ind[1],
accumulate(.x = tail(ret, -2),
.f = ~ c(.x[2] * (1 + .y), .x[1]),
.init = ind[2:1])),
result = map_dbl(result, first)) %>%
ungroup
Check
We check that the results above are correct. Alternately this could be used as a straight forward solution.
calc <- function(ind, ret) {
for(i in seq(3, length(ret))) ind[i] <- ind[i-2] * (1 + ret[i])
ind
}
dt %>%
group_by(id) %>%
mutate(result = calc(ind, ret)) %>%
ungroup
giving:
# A tibble: 10 x 5
id time ret ind result
<chr> <int> <dbl> <dbl> <dbl>
1 a 1 0.00554 120 120
2 a 2 -0.00280 125 125
3 a 3 0.0178 NA 122.
4 a 4 0.00187 NA 125.
5 a 5 0.0114 NA 124.
6 b 1 0.00416 120 120
7 b 2 0.0123 125 125
8 b 3 0.00237 NA 120.
9 b 4 -0.00365 NA 125.
10 b 5 0.0111 NA 122.
I would have done it by creating dummy groups for each sequence, so that it can be done for any number of 'N'. Demonstrating it on a new elaborated data
df <- data.frame(
stringsAsFactors = FALSE,
grp = c("a","a","a","a",
"a","a","a","a","a","b","b","b","b","b",
"b","b","b","b"),
rate = c(0.082322056,
0.098491104,0.07294593,0.08741672,0.030179747,
0.061389031,0.011232314,0.08553277,0.091272669,
0.031577847,0.024039791,0.091719552,0.032540636,
0.020411727,0.094521716,0.081729178,0.066429708,
0.04985793),
ind = c(11000L,12000L,
13000L,NA,NA,NA,NA,NA,NA,10000L,13000L,12000L,
NA,NA,NA,NA,NA,NA)
)
df
#> grp rate ind
#> 1 a 0.08232206 11000
#> 2 a 0.09849110 12000
#> 3 a 0.07294593 13000
#> 4 a 0.08741672 NA
#> 5 a 0.03017975 NA
#> 6 a 0.06138903 NA
#> 7 a 0.01123231 NA
#> 8 a 0.08553277 NA
#> 9 a 0.09127267 NA
#> 10 b 0.03157785 10000
#> 11 b 0.02403979 13000
#> 12 b 0.09171955 12000
#> 13 b 0.03254064 NA
#> 14 b 0.02041173 NA
#> 15 b 0.09452172 NA
#> 16 b 0.08172918 NA
#> 17 b 0.06642971 NA
#> 18 b 0.04985793 NA
library(tidyverse)
N = 3
df %>% group_by(grp) %>%
group_by(d = row_number() %% N, .add = TRUE) %>%
mutate(ind = accumulate(rate[-1] + 1, .init = ind[1], ~ .x * .y))
#> # A tibble: 18 x 4
#> # Groups: grp, d [6]
#> grp rate ind d
#> <chr> <dbl> <dbl> <dbl>
#> 1 a 0.0823 11000 1
#> 2 a 0.0985 12000 2
#> 3 a 0.0729 13000 0
#> 4 a 0.0874 11962. 1
#> 5 a 0.0302 12362. 2
#> 6 a 0.0614 13798. 0
#> 7 a 0.0112 12096. 1
#> 8 a 0.0855 13420. 2
#> 9 a 0.0913 15057. 0
#> 10 b 0.0316 10000 1
#> 11 b 0.0240 13000 2
#> 12 b 0.0917 12000 0
#> 13 b 0.0325 10325. 1
#> 14 b 0.0204 13265. 2
#> 15 b 0.0945 13134. 0
#> 16 b 0.0817 11169. 1
#> 17 b 0.0664 14147. 2
#> 18 b 0.0499 13789. 0
Alternate answer in dplyr (using your own data modified a bit only)
set.seed(13)
dt <- data.frame(id = rep(letters[1:2], each = 5), time = rep(1:5, 2), ret = rnorm(10)/100)
dt$ind <- ifelse(dt$time == 1, 12000, ifelse(dt$time == 2, 12500, as.numeric(NA)))
library(dplyr, warn.conflicts = F)
dt %>% group_by(id) %>%
group_by(d= row_number() %% 2, .add = TRUE) %>%
mutate(ind = cumprod(1 + duplicated(id) * ret)* ind[1])
#> # A tibble: 10 x 5
#> # Groups: id, d [4]
#> id time ret ind d
#> <chr> <int> <dbl> <dbl> <dbl>
#> 1 a 1 0.00554 12000 1
#> 2 a 2 -0.00280 12500 0
#> 3 a 3 0.0178 12213. 1
#> 4 a 4 0.00187 12523. 0
#> 5 a 5 0.0114 12353. 1
#> 6 b 1 0.00416 12000 0
#> 7 b 2 0.0123 12500 1
#> 8 b 3 0.00237 12028. 0
#> 9 b 4 -0.00365 12454. 1
#> 10 b 5 0.0111 12161. 0
Related
I have a list of all CpG locations (base pair value) for a gene on a methylation array in one table (table a), and another table with the locations (base pair value) for 12 CpGs for the same gene not present on the array (table b). I am trying to work out for each probe in table_b, which probe in table_a is the closest in bp.
i.e. table_a
# A tibble: 88 x 2
UCSC_RefGene_Name pos
<chr> <int>
1 RXRA 137218280
2 RXRA 137243592
3 RXRA 137330570
4 RXRA 137225311
5 RXRA 137299436
6 RXRA 137277819
7 RXRA 137268074
8 RXRA 137255666
9 RXRA;RXRA 137284989
10 RXRA 137218286
# ... with 78 more rows
table_b
CpG.position Human.genome.19.coordinates
1 1 137215735
2 2 137215739
3 3 137215748
4 4 137215772
5 5 137215779
6 6 137215867
7 7 137215956
8 8 137216015
9 9 137216030
10 10 137216034
11 11 137216036
12 12 137216064
My first step was to sequentially subtract the each value in A from the first row in B -
bibs <- function(table, value, column){
position <- sym(column)
smaps <-
table %>%
summarise(
"cpg_pos" = table$CpG.position,
"new_loc" = value - {{position}})
print(smaps)
}
posns <- table_a$positions
abso <- list()
for(i in seq_along(posns)){
a <- bibs(table_b, posns[[i]], "Human.genome.19.coordinates")
abso[[i]] <- a
}
This produces a list (abso) with 88 entries (1st entry below), so seemingly its only happened for the first value in table b.
cpg_pos new_loc
1 1 2545
2 2 2541
3 3 2532
4 4 2508
5 5 2501
6 6 2413
7 7 2324
8 8 2265
9 9 2250
10 10 2246
11 11 2244
12 12 2216
I wonder if anyone can help with getting it to move sequentially through each value in B?
Thanks,
Matt
Joining is equivalent to filtering the cross-product. We can sort all combinations of rows from both tables to pick the one with the closest distance:
library(tidyverse)
# example data
genes <- tibble(gene = c("A", "A", "B"), gene_pos = c(1, 30, 50))
genes
#> # A tibble: 3 × 2
#> gene gene_pos
#> <chr> <dbl>
#> 1 A 1
#> 2 A 30
#> 3 B 50
cpgs <- tibble(cpg = seq(3), cpg_pos = c(48, 51, 31))
cpgs
#> # A tibble: 3 × 2
#> cpg cpg_pos
#> <int> <dbl>
#> 1 1 48
#> 2 2 51
#> 3 3 31
cpgs %>%
expand_grid(genes) %>%
mutate(dist = abs(gene_pos - cpg_pos)) %>%
group_by(cpg) %>%
arrange(dist) %>%
slice(1)
#> # A tibble: 3 × 5
#> # Groups: cpg [3]
#> cpg cpg_pos gene gene_pos dist
#> <int> <dbl> <chr> <dbl> <dbl>
#> 1 1 48 B 50 2
#> 2 2 51 B 50 1
#> 3 3 31 A 30 1
Created on 2022-04-14 by the reprex package (v2.0.0)
CPG number 1 is at position 48. The closest gene position is position 50 of gene B which is 2bp apart.
I have a data frame containing the values of weight. I have a create a new column, percentage change of weight wherein the denominator takes the value of every third row.
df <- data.frame(weight = c(30,30,109,30,309,10,20,20,14))
# expected output
change_of_weight = c(30/109, 30/109, 109/109, 30/10,309/10,10/10,20/14,20/14,14/14)
Subset weight column where it's position %% 3 is zero and repeat each value three times.
df <- transform(df, change_of_weight=weight / rep(weight[1:nrow(df) %% 3 == 0], each=3))
df
weight change_of_weight
1 30 0.2752294
2 30 0.2752294
3 109 1.0000000
4 30 3.0000000
5 309 30.9000000
6 10 1.0000000
7 20 1.4285714
8 20 1.4285714
9 14 1.0000000
You can create a group of every 3 rows and divide weight column by the last value in the group.
df$change <- with(df, ave(df$weight, ceiling(seq_len(nrow(df))/3),
FUN = function(x) x/x[length(x)]))
Or using dplyr :
library(dplyr)
df %>%
group_by(grp = ceiling(row_number()/3)) %>%
mutate(change = weight/last(weight))
# weight grp change
# <dbl> <dbl> <dbl>
#1 30 1 0.275
#2 30 1 0.275
#3 109 1 1
#4 30 2 3
#5 309 2 30.9
#6 10 2 1
#7 20 3 1.43
#8 20 3 1.43
#9 14 3 1
We can also use gl to create a grouping column
library(dplyr)
df %>%
group_by(grp = as.integer(gl(n(), 3, n()))) %>%
mutate(change = weight/last(weight))
# A tibble: 9 x 3
# Groups: grp [3]
# weight grp change
# <dbl> <int> <dbl>
#1 30 1 0.275
#2 30 1 0.275
#3 109 1 1
#4 30 2 3
#5 309 2 30.9
#6 10 2 1
#7 20 3 1.43
#8 20 3 1.43
#9 14 3 1
Or using data.table
library(data.table)
setDT(df)[, change := weight/last(weight), .(as.integer(gl(nrow(df), 3, nrow(df))))]
I would like to calculate the mean euclidean distances between each item and all other items in a group within a data frame. I'd like to do this within the tidyverse, but can't seem to get it to work how I want.
Example data:
library(tidyverse)
DF <- data.frame(Item = letters[1:20], Grp = rep(1:4, each = 5),
x = runif(20, -0.5, 0.5),
y = runif(20, -0.5, 0.5))
I think euclidean distances are calculated using:
sqrt(((x[i] - x[i + 1]) ^ 2) + ((y[i] - y[i + 1]) ^ 2))
I've tried, without success, to do something with mutate.
DF %>%
group_by(Grp, Item) %>%
mutate(Dist = mean(sqrt(((x - lag(x, default = x[1])) ^ 2) +
(y - lag(y, default = y[1])) ^ 2)))
But, it doesn't work and only returns NA's.
# A tibble: 20 x 5
# Groups: Grp, Item [20]
Item Grp x y Dist
<fct> <int> <dbl> <dbl> <dbl>
1 a 1 -0.212 0.390 NA
2 b 1 0.288 0.193 NA
3 c 1 -0.0910 0.141 NA
4 d 1 0.383 0.494 NA
5 e 1 0.440 0.156 NA
6 f 2 -0.454 0.209 NA
7 g 2 0.0281 0.0441 NA
8 h 2 0.392 0.0941 NA
9 i 2 0.0514 -0.211 NA
10 j 2 -0.0434 -0.353 NA
11 k 3 0.457 0.463 NA
12 l 3 -0.0467 0.402 NA
13 m 3 0.178 0.191 NA
14 n 3 0.0726 0.295 NA
15 o 3 -0.397 -0.475 NA
16 p 4 0.400 -0.0222 NA
17 q 4 -0.254 0.258 NA
18 r 4 -0.458 -0.284 NA
19 s 4 -0.172 -0.182 NA
20 t 4 0.455 -0.268 NA
If I understand lag correctly it would still be sequential (if it worked), rather than computing distances between all pairs within a group.
How can I get the mean of all 4 distances for each item in a group?
Does anyone have any suggestions?
DF %>% group_by(Grp) %>%
mutate(Dist = colMeans(as.matrix(dist(cbind(x, y)))))
# # A tibble: 20 x 5
# # Groups: Grp [4]
# Item Grp x y Dist
# <fctr> <int> <dbl> <dbl> <dbl>
# 1 a 1 -0.197904299 0.363086055 0.4659160
# 2 b 1 0.090540444 -0.006314185 0.2031230
# 3 c 1 0.101018893 -0.025062949 0.2011672
# 4 d 1 0.006358616 -0.149784267 0.2323359
# 5 e 1 0.219596250 -0.341440596 0.3605274
# 6 f 2 -0.493124602 -0.002935820 0.5155365
# ...
To see how it works, start with one data subset and go piece by piece:
# run these one line at a time and have a look at ?dist
dd = DF[DF$Grp == "1", c("x", "y")]
dist(dd)
as.matrix(dist(dd))
colMeans(as.matrix(dist(dd)))
Question:
I am using dplyr to do data analysis in R, and I come across the following problem.
My data frame is like this:
item day val
1 A 1 90
2 A 2 100
3 A 3 110
4 A 5 80
5 A 8 70
6 B 1 75
7 B 3 65
The data frame is already arranged in item, day. Now I want to mutate a new column, with each row being the smallest value of the same group AND having the day to be within the next 2 days.
For the example above, I want the resulting data frame to be:
item day val output
1 A 1 90 100 # the smaller of 100 and 110
2 A 2 100 110 # the only value within 2 days
3 A 3 110 80 # the only value within 2 days
4 A 5 80 NA # there is no data within 2 days
5 A 8 70 NA # there is no data within 2 days
6 B 1 75 65 # the only value within 2 days
7 B 3 65 NA # there is no data within 2 days
I understand that I will probably use group_by and mutate, but how to write the inside function in order to achieve my desired result?
Any help is greatly appreciated. Let me know if you need me to clarify anything. Thank you!
Try this:
df %>%
# arrange(item, day) %>% # if not already arranged
# take note of the next two values & corresponding difference in days
group_by(item) %>%
mutate(val.1 = lead(val),
day.1 = lead(day) - day,
val.2 = lead(val, 2),
day.2 = lead(day, 2) - day) %>%
ungroup() %>%
# if the value is associated with a day more than 2 days away, change it to NA
mutate(val.1 = ifelse(day.1 %in% c(1, 2), val.1, NA),
val.2 = ifelse(day.2 %in% c(1, 2), val.2, NA)) %>%
# calculate output normally
group_by(item, day) %>%
mutate(output = min(val.1, val.2, na.rm = TRUE)) %>%
ungroup() %>%
# arrange results
select(item, day, val, output) %>%
mutate(output = ifelse(output == Inf, NA, output)) %>%
arrange(item, day)
# A tibble: 7 x 4
item day val output
<fctr> <int> <int> <dbl>
1 A 1 90 100
2 A 2 100 110
3 A 3 110 80.0
4 A 5 80 NA
5 A 8 70 NA
6 B 1 75 65.0
7 B 3 65 NA
Data:
df <- read.table(text = " item day val
1 A 1 90
2 A 2 100
3 A 3 110
4 A 5 80
5 A 8 70
6 B 1 75
7 B 3 65", header = TRUE)
We can use complete from the tidyr package to complete the dataset by day, and then use lead from dplyr and rollapply from zoo to find the minimum of the next two days.
library(dplyr)
library(tidyr)
library(zoo)
DF2 <- DF %>%
group_by(item) %>%
complete(day = full_seq(day, period = 1)) %>%
mutate(output = rollapply(lead(val), width = 2, FUN = min, na.rm = TRUE,
fill = NA, align = "left")) %>%
drop_na(val) %>%
ungroup() %>%
mutate(output = ifelse(output == Inf, NA, output))
DF2
# # A tibble: 7 x 4
# item day val output
# <chr> <dbl> <int> <dbl>
# 1 A 1.00 90 100
# 2 A 2.00 100 110
# 3 A 3.00 110 80.0
# 4 A 5.00 80 NA
# 5 A 8.00 70 NA
# 6 B 1.00 75 65.0
# 7 B 3.00 65 NA
DATA
DF <- read.table(text = "item day val
1 A 1 90
2 A 2 100
3 A 3 110
4 A 5 80
5 A 8 70
6 B 1 75
7 B 3 65",
header = TRUE, stringsAsFactors = FALSE)
We'll create a dataset with modified day, so we can left join it on the original dataset, keeping only minimum value.
df %>%
left_join(
bind_rows(mutate(.,day=day-1),mutate(.,day=day-2)) %>% rename(output=val)) %>%
group_by(item,day,val) %>%
summarize_at("output",min) %>%
ungroup
# # A tibble: 7 x 4
# item day val output
# <fctr> <dbl> <int> <dbl>
# 1 A 1 90 100
# 2 A 2 100 110
# 3 A 3 110 80
# 4 A 5 80 NA
# 5 A 8 70 NA
# 6 B 1 75 65
# 7 B 3 65 NA
data
df <- read.table(text = " item day val
1 A 1 90
2 A 2 100
3 A 3 110
4 A 5 80
5 A 8 70
6 B 1 75
7 B 3 65", header = TRUE)
To manipulate/summarize data over time, I usually use SQL ROW_NUMBER() OVER(PARTITION by ...). I'm new to R, so I'm trying to recreate tables I otherwise would create in SQL. The package sqldf does not allow OVER clauses. Example table:
ID Day Person Cost
1 1 A 50
2 1 B 25
3 2 A 30
4 3 B 75
5 4 A 35
6 4 B 100
7 6 B 65
8 7 A 20
I want my final table to include the average of the previous 2 instances for each day after their 2nd instance (day 4 for both):
ID Day Person Cost Prev2
5 4 A 35 40
6 4 B 100 50
7 6 B 65 90
8 7 A 20 35
I've been trying to play around with aggregate, but I'm not really sure how to partition or qualify the function. Ideally, I'd prefer not to use the fact that id is sequential with the date to form my answer (i.e. original table could be rearranged with random date order and code would still work). Let me know if you need more details, thanks for your help!
You could lag zoo::rollapplyr with a width of 2. In dplyr,
library(dplyr)
df %>% arrange(Day) %>% # sort
group_by(Person) %>% # set grouping
mutate(Prev2 = lag(zoo::rollapplyr(Cost, width = 2, FUN = mean, fill = NA)))
#> Source: local data frame [8 x 5]
#> Groups: Person [2]
#>
#> ID Day Person Cost Prev2
#> <int> <int> <fctr> <int> <dbl>
#> 1 1 1 A 50 NA
#> 2 2 1 B 25 NA
#> 3 3 2 A 30 NA
#> 4 4 3 B 75 NA
#> 5 5 4 A 35 40.0
#> 6 6 4 B 100 50.0
#> 7 7 6 B 65 87.5
#> 8 8 7 A 20 32.5
or all in dplyr,
df %>% arrange(Day) %>% group_by(Person) %>% mutate(Prev2 = (lag(Cost) + lag(Cost, 2)) / 2)
which returns the same thing. In base,
df <- df[order(df$Day), ]
df$Prev2 <- ave(df$Cost, df$Person, FUN = function(x){
c(NA, zoo::rollapplyr(x, width = 2, FUN = mean, fill = NA)[-length(x)])
})
df
#> ID Day Person Cost Prev2
#> 1 1 1 A 50 NA
#> 2 2 1 B 25 NA
#> 3 3 2 A 30 NA
#> 4 4 3 B 75 NA
#> 5 5 4 A 35 40.0
#> 6 6 4 B 100 50.0
#> 7 7 6 B 65 87.5
#> 8 8 7 A 20 32.5
or without zoo,
df$Prev2 <- ave(df$Cost, df$Person, FUN = function(x){
(c(NA, x[-length(x)]) + c(NA, NA, x[-(length(x) - 1):-length(x)])) / 2
})
which does the same thing. If you want to remove the NA rows, tack on tidyr::drop_na(Prev2) or na.omit.