I would like to calculate a rolling average for a fixed interval in my dataset.
start end value value_per_unit
4 20 20 1.25
21 33 40 3.33
34 45 30 2.73
46 60 10 0.71
I would like to obtain the value for a fixed interval of 10 as follows:
start end value_per_unit
4 13 1.25
14 23 1.874
24 33 3.33
.
.
Where:
for the interval c(4, 14): (1.25*10)/10 = 1.25
for the interval c(15, 25): (1.25*7 + 3.33*3)/10 = 1.874
for the interval c(26,36): (10*3.33)/10 = 3.33
Is it possible to achieve this in R?
There seems to be some confusion in your question, however this approach gives the desired output:
library(dplyr, warn.conflicts = FALSE)
# Create the test data
df <- tribble(
~start, ~end, ~value, ~value_per_unit,
4 , 20, 20, 1.25,
21 , 33, 40, 3.33,
34 , 45, 30, 2.73,
46 , 60, 10, 0.71
)
# Some data prep to apply the transformation
df1 <- df %>%
rowwise() %>%
mutate(row = list(seq(from = start, to = end))) %>%
ungroup() %>%
tidyr::unnest(row) %>%
mutate(group = (row - 4) %/% 10)
# This hopefully illustrates what's happening here:
print(df1, n = 20)
#> # A tibble: 57 x 6
#> start end value value_per_unit row group
#> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
#> 1 4 20 20 1.25 4 0
#> 2 4 20 20 1.25 5 0
#> 3 4 20 20 1.25 6 0
#> 4 4 20 20 1.25 7 0
#> 5 4 20 20 1.25 8 0
#> 6 4 20 20 1.25 9 0
#> 7 4 20 20 1.25 10 0
#> 8 4 20 20 1.25 11 0
#> 9 4 20 20 1.25 12 0
#> 10 4 20 20 1.25 13 0
#> 11 4 20 20 1.25 14 1
#> 12 4 20 20 1.25 15 1
#> 13 4 20 20 1.25 16 1
#> 14 4 20 20 1.25 17 1
#> 15 4 20 20 1.25 18 1
#> 16 4 20 20 1.25 19 1
#> 17 4 20 20 1.25 20 1
#> 18 21 33 40 3.33 21 1
#> 19 21 33 40 3.33 22 1
#> 20 21 33 40 3.33 23 1
#> # ... with 37 more rows
# Summarise to create new values of start, end and value_per_unit
df1 %>%
group_by(group) %>%
summarise(
start = min(row),
end = max(row),
value_per_unit = mean(value_per_unit),
.groups = "drop"
) %>%
select(-group)
#> # A tibble: 6 x 3
#> start end value_per_unit
#> <int> <int> <dbl>
#> 1 4 13 1.25
#> 2 14 23 1.87
#> 3 24 33 3.33
#> 4 34 43 2.73
#> 5 44 53 1.11
#> 6 54 60 0.71
Created on 2021-10-12 by the reprex package (v2.0.0)
Here is another solution:
library(tidyverse)
seq(4, 60, 10) %>%
enframe(value = "start") %>%
mutate(end = pmin(start + 9, max(df$end))) %>%
{map2(.$start, .$end, ~ c(.x:.y))} %>%
map_dfc(~ df %>%
rowwise() %>%
mutate(cnt = length(intersect(.x, seq(start, end, 1)))) %>%
pull(cnt)) %>%
bind_cols(as_tibble(df$value_per_unit)) %>%
summarise(across(matches("\\d+"), ~ sum(.x * value) / 10)) %>%
pivot_longer(everything(), names_to = "name",
values_to = "weighted_avg",
names_pattern = ".*(\\d+)")
# A tibble: 6 x 2
name weighted_avg
<chr> <dbl>
1 1 1.25
2 2 1.87
3 3 3.33
4 4 2.73
5 5 1.11
6 6 0.497
Related
I have a data frame and I want to define multiple columns with the same function (ntile) operated on the original version (column) of the variable. I'm not sure whether a loop or something else will work but the below example is a toy example. My actual data frame has over 20 variables that this needs to be done on.
Basically I want to make a variable called "original_name"_bin for each of the numeric variables in my data frame. These _bin variables are just the ntile function operated on the original non _bin version:
dat1 <- read.table(text = "x1 x2
10 20
20 30.5
30 40.5
40 20.12
50 25
70 86
80 75
90 45 ", header = TRUE)
num_names <- paste(colnames(dat1[sapply(dat1, is.numeric)]))
bin_names <- paste(colnames(dat1[sapply(dat1, is.numeric)]), "bin", sep = "_")
# Want to make columns in data frame where the var_bin is:
dat1$x1_bin <- ntile(dat1$x1, n = 10)
# loop
for (i in 1:length(bin_names)){
assign(paste0("dat1$", bin_names[i]), ntile(???, 10))
}
Here is one base way to do it using lapply:
dat1 <- read.table(text = "x1 x2
10 20
20 30.5
30 40.5
40 20.12
50 25
70 86
80 75
90 45 ", header = TRUE)
num_names <- paste(colnames(dat1[sapply(dat1, is.numeric)]))
bin_names <- paste(colnames(dat1[sapply(dat1, is.numeric)]), "bin", sep = "_")
dat1[bin_names] <- lapply(dat1[num_names], \(x) dplyr::ntile(x, n = 10))
dat1
#> x1 x2 x1_bin x2_bin
#> 1 10 20.00 1 1
#> 2 20 30.50 2 4
#> 3 30 40.50 3 5
#> 4 40 20.12 4 2
#> 5 50 25.00 5 3
#> 6 70 86.00 6 8
#> 7 80 75.00 7 7
#> 8 90 45.00 8 6
Created on 2021-12-07 by the reprex package (v2.0.1)
As base R loop:
for (i in 1:length(bin_names)){
dat1[bin_names[i]] <- dplyr::ntile(dat1[num_names[i]], 10)
}
dat1
#> x1 x2 x1_bin x2_bin
#> 1 10 20.00 1 1
#> 2 20 30.50 2 4
#> 3 30 40.50 3 5
#> 4 40 20.12 4 2
#> 5 50 25.00 5 3
#> 6 70 86.00 6 8
#> 7 80 75.00 7 7
#> 8 90 45.00 8 6
With dplyr::across:
library(dplyr)
dat1 %>%
mutate(across(all_of(num_names),
~ ntile(.x, n = 10),
.names = "{.col}_bin"))
#> x1 x2 x1_bin x2_bin
#> 1 10 20.00 1 1
#> 2 20 30.50 2 4
#> 3 30 40.50 3 5
#> 4 40 20.12 4 2
#> 5 50 25.00 5 3
#> 6 70 86.00 6 8
#> 7 80 75.00 7 7
#> 8 90 45.00 8 6
Created on 2021-12-07 by the reprex package (v2.0.1)
I have following data
x1 <- rnorm(20,0,1)
x2 <- rnorm(20,0,1)
group <- sample(50:55, size=20, replace=TRUE)
data <- data.frame(x1,x2,group)
head(data)
x1 x2 group
1 -0.88001290 0.53866432 50
2 0.34228653 -0.54503078 52
3 -2.42308971 0.09542262 54
4 0.07310148 -1.03226594 50
5 -0.47786709 2.46726615 55
6 0.45224510 -1.46224926 55
I need to create a grouping indicator based on group variable. (so that the rows where group=50 will equal to 1, group=51 equal to 2 so on)
I tried to do this using dplyr package in R. But I am not getting the correct answer as I have not defined the indicator variable correctly.
data %>% arrange(group) %>% group_by(group) %>% mutate(Indicator = n() )
Can anyone help me to correct my code?
Thank you
We need cur_group_id instead of n() (n() - returns the number of rows of that group)
library(dplyr)
data %>%
arrange(group) %>%
group_by(group) %>%
mutate(indicator = cur_group_id()) %>%
ungroup
-output
# A tibble: 20 x 4
# x1 x2 group indicator
# <dbl> <dbl> <int> <int>
# 1 -1.24 -0.497 50 1
# 2 -0.648 1.59 50 1
# 3 0.598 -0.325 51 2
# 4 -0.721 0.510 51 2
# 5 0.259 1.62 51 2
# 6 -0.288 0.872 52 3
# 7 0.403 0.785 52 3
# 8 1.84 1.65 52 3
# 9 0.116 -0.0234 52 3
#10 -1.31 -0.244 52 3
#11 -0.615 0.994 53 4
#12 -0.469 0.695 53 4
#13 -0.324 -0.599 53 4
#14 -0.394 -0.971 53 4
#15 1.30 0.323 54 5
#16 0.0242 -1.46 54 5
#17 -0.342 -1.96 54 5
#18 1.10 -0.569 54 5
#19 -0.967 -0.863 54 5
#20 -0.396 -0.441 55 6
Or another option is match
data %>%
mutate(indicator = match(group, sort(unique(group))))
base R using factor()
levels = 50:55
labels = 1:6
data$indicator <- factor(data$group, levels, labels)
or
levels = unique(data$group)
labels = seq_len(length(levels))
data$indicator <- factor(data$group, levels, labels)
dplyr::dense_rank may also help even without grouping
data %>% mutate(indicator = dense_rank(group) )
baseR way
data$indicator <- as.numeric(as.factor(data$group))
data
x1 x2 group indicator
1 -1.453628399 -1.78776319 55 6
2 -0.119413813 -0.07656982 52 3
3 0.387951296 -0.26845052 55 6
4 3.117977719 0.69280780 51 2
5 -0.938126762 -0.16898209 50 1
6 -1.596371818 0.35289797 52 3
7 -2.291376398 -1.59385221 55 6
8 0.161164263 -0.99387565 54 5
9 -0.281744752 -0.26801191 53 4
10 0.760719223 -0.28255900 50 1
11 -0.204073022 -1.10262114 51 2
12 0.653628314 0.77778039 54 5
13 0.043736298 -0.37896178 55 6
14 0.002800531 1.17034334 55 6
15 0.451136658 -0.38459588 51 2
16 0.151793862 0.60303631 55 6
17 0.173976519 -0.41745808 53 4
18 0.282827170 -0.16794851 52 3
19 0.737444975 -0.45712603 51 2
20 0.014182869 0.99013155 51 2
I have a tibble as so:
df <- tibble(a = seq(1:10),
b = seq(21,30),
c = seq(31,40))
I want to create a new tibble, where I want to lag some. I want to create new columns called prev+lagged_col_name, eg prev_a.
In my actual data, there are a lot of cols so I don't want to manually write it out. Additonally I only want to do it for some cols. In this eg, I have done it manually but wanted to know if there is a way to use a function to do it.
df_new <- df %>%
mutate(prev_a = lag(a),
prev_b = lag(b),
prev_d = lag(d))
Thanks for your help!
With the current dplyr version you can create new variable names with mutate_at, using a named list will take the name of the list as suffix. If you want it as a prefix as in your example you can use rename_at to correct the variable naming. With your real data, you need to adjust the vars() selection. For your example data matches("[a-c]") did work.
library(dplyr)
df <- tibble(a = seq(1:10),
b = seq(21,30),
c = seq(31,40))
df %>%
mutate_at(vars(matches("[a-c]")), list(prev = ~ lag(.x)))
#> # A tibble: 10 x 6
#> a b c a_prev b_prev c_prev
#> <int> <int> <int> <int> <int> <int>
#> 1 1 21 31 NA NA NA
#> 2 2 22 32 1 21 31
#> 3 3 23 33 2 22 32
#> 4 4 24 34 3 23 33
#> 5 5 25 35 4 24 34
#> 6 6 26 36 5 25 35
#> 7 7 27 37 6 26 36
#> 8 8 28 38 7 27 37
#> 9 9 29 39 8 28 38
#> 10 10 30 40 9 29 39
df %>%
mutate_at(vars(matches("[a-c]")), list(prev = ~ lag(.x))) %>%
rename_at(vars(contains( "_prev") ), list( ~paste("prev", gsub("_prev", "", .), sep = "_")))
#> # A tibble: 10 x 6
#> a b c prev_a prev_b prev_c
#> <int> <int> <int> <int> <int> <int>
#> 1 1 21 31 NA NA NA
#> 2 2 22 32 1 21 31
#> 3 3 23 33 2 22 32
#> 4 4 24 34 3 23 33
#> 5 5 25 35 4 24 34
#> 6 6 26 36 5 25 35
#> 7 7 27 37 6 26 36
#> 8 8 28 38 7 27 37
#> 9 9 29 39 8 28 38
#> 10 10 30 40 9 29 39
Created on 2020-04-29 by the reprex package (v0.3.0)
You could do this this way
df_new <- bind_cols(
df,
df %>% mutate_at(.vars = vars("a","b","c"), function(x) lag(x))
)
Names are a bit nasty but you can rename them check here. Or see #Bas comment to get the names with a suffix.
# A tibble: 10 x 6
a b c a1 b1 c1
<int> <int> <int> <int> <int> <int>
1 1 21 31 NA NA NA
2 2 22 32 1 21 31
3 3 23 33 2 22 32
4 4 24 34 3 23 33
5 5 25 35 4 24 34
6 6 26 36 5 25 35
7 7 27 37 6 26 36
8 8 28 38 7 27 37
9 9 29 39 8 28 38
10 10 30 40 9 29 39
If you have dplyr 1.0 you can use the new accross() function.
See some expamples from the docs, instead of mean you want lag
df %>% mutate_if(is.numeric, mean, na.rm = TRUE)
# ->
df %>% mutate(across(is.numeric, mean, na.rm = TRUE))
df %>% mutate_at(vars(x, starts_with("y")), mean, na.rm = TRUE)
# ->
df %>% mutate(across(c(x, starts_with("y")), mean, na.rm = TRUE))
df %>% mutate_all(mean, na.rm = TRUE)
# ->
df %>% mutate(across(everything(), mean, na.rm = TRUE))
I'm trying to make column with sample value for each row of data
But I'm new with purrr and can't make this.
My code
df<-data.frame(x=rep(1:3,each=4),y=99)
df%>%
group_by(x)%>%
mutate_(val=~purrr::map_dbl(function(x) sample(50,1)))
This didn't work.
But function with purrr only working:
1:5%>%purrr::map_dbl(function(x) sample(50,1))
[1] 39 30 7 18 45
Thanks for any help!
You don't need purrr:
df <- data.frame(x = rep(1:3, each = 4), y = 99)
df %>%
group_by(x) %>%
mutate(val = sample(50, n()))
Output
# A tibble: 12 x 3
# Groups: x [3]
x y val
<int> <dbl> <int>
1 1 99.0 10
2 1 99.0 25
3 1 99.0 2
4 1 99.0 24
5 2 99.0 48
6 2 99.0 19
7 2 99.0 34
8 2 99.0 33
9 3 99.0 24
10 3 99.0 14
11 3 99.0 37
12 3 99.0 12
If you need to use purrr, I guess you could do:
dplyr::mutate(df, val = purrr::map(x, ~ sample(50, 1)))
x y val
1 1 99 35
2 1 99 4
3 1 99 43
4 1 99 28
5 2 99 49
6 2 99 31
7 2 99 31
8 2 99 31
9 3 99 19
10 3 99 4
11 3 99 43
12 3 99 20
Or with the pipe:
library(dplyr)
library(purrr)
df %>%
mutate(val = map(x, ~ sample(50, 1)))
Data:
df <- data.frame(x = rep(1:3, each = 4), y = 99)
I have this data frame:
Votes <- data.frame(
VoteCreationDate = c(1,3,3,5,5,6),
GiverId = c(19,19,38,19,38,19),
CumNumUpVotes = c(1,3,1,7,2,10)
)
Votes
VoteCreationDate GiverId CumNumUpVotes
1 19 1
3 19 3
3 38 1
5 19 7
5 38 2
6 19 10
For each GiverId (19 and 38), all possible dates (number from 1 to 6) should be listed in VoteCreationDate.
Then, for each GiverId and VoteCreationDate, the corresponding CumNumUpVotes should be matched. If there is no corresponding value, the CumNumUpVotes should be taken from the immediately preceding VoteCreationDate.
For example, for VoteCreationDate = 4 and GiverId = 38 there is no corresponding CumNumUpVotes. This cell should be equal to 1, which is the CumNumUpVotes from GiverId = 38 and VoteCreationDate = 3.
Here how it should look at the end:
VoteCreationDate GiverId CumNumUpVotes
1 19 1
2 19 1
3 19 3
4 19 3
5 19 7
6 19 10
1 38 0
2 38 0
3 38 1
4 38 1
5 38 2
6 38 2
Any idea how to get there?
A dplyr and tidyr solution.
library(dplyr)
library(tidyr)
Votes2 <- Votes %>%
complete(VoteCreationDate = full_seq(VoteCreationDate, period = 1), GiverId) %>%
arrange(GiverId, VoteCreationDate) %>%
group_by(GiverId) %>%
fill(CumNumUpVotes) %>%
replace_na(list(CumNumUpVotes = 0)) %>%
ungroup()
Votes2
# # A tibble: 12 x 3
# VoteCreationDate GiverId CumNumUpVotes
# <dbl> <dbl> <dbl>
# 1 1.00 19.0 1.00
# 2 2.00 19.0 1.00
# 3 3.00 19.0 3.00
# 4 4.00 19.0 3.00
# 5 5.00 19.0 7.00
# 6 6.00 19.0 10.0
# 7 1.00 38.0 0
# 8 2.00 38.0 0
# 9 3.00 38.0 1.00
# 10 4.00 38.0 1.00
# 11 5.00 38.0 2.00
# 12 6.00 38.0 2.00
do.call(rbind, lapply(split(Votes, Votes$GiverId), function(x){
temp = merge(x, data.frame(VoteCreationDate = 1:6), all = TRUE)
temp$GiverId = temp$GiverId[!is.na(temp$GiverId)][1]
temp$CumNumUpVotes = cummax(replace(temp$CumNumUpVotes, is.na(temp$CumNumUpVotes), 0))
temp
}))
# VoteCreationDate GiverId CumNumUpVotes
#19.1 1 19 1
#19.2 2 19 1
#19.3 3 19 3
#19.4 4 19 3
#19.5 5 19 7
#19.6 6 19 10
#38.1 1 38 0
#38.2 2 38 0
#38.3 3 38 1
#38.4 4 38 1
#38.5 5 38 2
#38.6 6 38 2