I have a set of x,y data (10,000). These data points are to be partitioned along the x-axis into non-overlapping bins of 10 data points each. From this, I need a new dataset, such that x = mean of these 10 data, y = maximum of these 10 data. The final data set should be 1000 sets of x,y. sample
Sample in Excel. I want to perform this task in R
In tidyverse:
library(tidyverse)
df %>%
arrange(x) %>%
group_by(grp = gl(n(), 10, n())) %>%
summarise(x = mean(x), y = max(y))
In Base R
n <- nrow(df)
do.call(rbind.data.frame, by(df[order(df$x),], gl(n, 10, n),
function(x) cbind(x = mean(x$x), y = max(x$y))))
I created some sample data as you did not provide those.
I use the library data.table but you could do similar in dplyr or base.
library(data.table)
dt <- data.table(
x = sample(40:50, 50, replace = T),
y = sample(1000:3000, 50)
)
dt[, grp := gl(.N, 10, .N)] # edit based on Onyambu's solution
dt[, .(x_avg = mean(x), y_max = max(y)), by = grp]
# grp x_avg y_max
# 1: 1 44.7 2765
# 2: 2 45.3 2861
# 3: 3 44.7 2831
# 4: 4 46.2 2947
# 5: 5 46.7 2684
Related
I want to calculate the mean and standard deviation for subgroups every column in my dataset.
The membership of the subgroups is based on the values in the column of interest and these subgroups are specific to each column of interest.
# Example data
set.seed(1)
library(data.table)
df <- data.frame(baseline = runif(100), `Week0_12` = runif(100), `Week12_24` = runif(100))
So for column Baseline, a row may be assigned to another subgroup than for column Week0_12.
I can of course create these 'subgroup columns' manually for each column and then calculate the statistics for each column by column subgroup:
df$baseline_subgroup <- ifelse(df$baseline < 0.2, "subgroup_1", "subgroup_2")
df <- as.data.table(df)
df[, .(mean = mean(baseline), sd = sd(baseline)), by = baseline_subgroup]
Giving this output:
baseline_subgroup mean sd
1: subgroup_2 0.58059314 0.22670071
2: subgroup_1 0.09793105 0.05317809
Doing this for every column separately is too much repetition, especially given that I have many columns my actual data.
df$Week0_12_subgroup <- ifelse(df$Week0-12 < 0.2, "subgroup_1", "subgroup_2")
df[, .(mean = mean(Week0_12), sd = sd(Week0_12 )), by = Week0_12_subgroup ]
df$Week12_24_subgroup <- ifelse(df$Week0-12 < 0.2, "subgroup_1", "subgroup_2")
df[, .(mean = mean(Week12_24), sd = sd(Week12_24)), by = Week12_24_subgroup ]
What is a more elegant approach to do this?
Here's a tidyverse method that gives an easy-to-read and easy-to-plot output:
library(tidyverse)
set.seed(1)
df <- data.frame(baseline = runif(100),
`Week0_12` = runif(100),
`Week12_24` = runif(100))
df2 <- df %>%
summarize(across(everything(), list(mean_subgroup1 = ~mean(.x[.x < 0.2]),
sd_subgroup1 = ~sd(.x[.x < 0.2]),
mean_subgroup2 = ~mean(.x[.x > 0.2]),
sd_subgroup2 = ~sd(.x[.x > 0.2])))) %>%
pivot_longer(everything(), names_pattern = '^(.*)_(.*)_(.*$)',
names_to = c('time', 'measure', 'subgroup')) %>%
pivot_wider(names_from = measure, values_from = value)
df2
#> # A tibble: 6 x 4
#> time subgroup mean sd
#> <chr> <chr> <dbl> <dbl>
#> 1 baseline subgroup1 0.0979 0.0532
#> 2 baseline subgroup2 0.581 0.227
#> 3 Week0_12 subgroup1 0.117 0.0558
#> 4 Week0_12 subgroup2 0.594 0.225
#> 5 Week12_24 subgroup1 0.121 0.0472
#> 6 Week12_24 subgroup2 0.545 0.239
ggplot(df2, aes(time, mean, group = subgroup)) +
geom_errorbar(aes(ymin = mean - sd, ymax = mean + sd, color = subgroup),
width = 0.1) +
geom_point() +
theme_minimal(base_size = 16)
Created on 2022-07-14 by the reprex package (v2.0.1)
You could use apply to apply a subgroup function across each column
i. e.
# list to house dfs of summary statistics
summaries <- list()
subgroup <- function(x){
# x is the column that we are interested in
df$current_subgroup<- ifelse(x < 0.2, "subgroup_1", "subgroup_2")
library(data.table)
df <- as.data.table(df)
summaries.append(df[, .(mean = mean(baseline), sd = sd(baseline)), by = baseline_subgroup])
}
# MARGIN = 2 applies across columns
apply(df, 2, subgroup)
You can create a custom function and apply it using .SD, i.e.
library(data.table)
f1 <- function(x){
i_mean <- mean(x);
i_sd <- sd(x);
list(Avg = i_mean, standard_dev = i_sd)
}
setDT(df)[, unlist(lapply(.SD, f1), recursive = FALSE), by = baseline_subgroup][]
baseline_subgroup baseline.Avg baseline.standard_dev Week0.12.Avg Week0.12.standard_dev Week12.24.Avg Week12.24.standard_dev
1: subgroup_2 0.5950020 0.22556590 0.5332555 0.2651810 0.5467046 0.2912027
2: subgroup_1 0.1006693 0.04957005 0.5947161 0.2645519 0.5137543 0.3213723
I'm not really familiar with dplyr function in R. However, I want to filter my dataset into certain conditions.
Let's say I've more than 100 of attributes in my dataset. And I want to perform filter with multiple condition.
Can I put my coding filter the position of the column instead of their name as follow:
y = filter(retag, c(4:50) != 8 & c(90:110) == 8)
I've tried few times similar with this coding, however still haven't get the result.
I also did tried coding as follow, but not sure how to add another conditions into the rowSums function.
retag[rowSums((retag!=8)[,c(4:50)])>=1,]
The only example that I found was using the dataset names instead of the position.
Or is there any way to filter using the dataset position as my data quite huge.
You can use a combination of filter() and across(). I didn't have your version of the retag dataframe so I created my own as an example
set.seed(2000)
retag <- tibble(
col1 = runif(n = 1000, min = 0, max = 10) %>% round(0),
col2 = runif(n = 1000, min = 0, max = 10) %>% round(0),
col3 = runif(n = 1000, min = 0, max = 10) %>% round(0),
col4 = runif(n = 1000, min = 0, max = 10) %>% round(0),
col5 = runif(n = 1000, min = 0, max = 10) %>% round(0)
)
# filter where the first, second, and third column all equal 5 and the fourth column does not equal 5
retag %>%
filter(
across(1:3, function(x) x == 5),
across(4, function(x) x != 5)
)
if_all() and if_any() were recently introduced into the tidyverse for the purpose of filtering across multiple variables.
library(dplyr)
filter(retag, if_all(X:Y, ~ .x > 10 & .x < 35))
# # A tibble: 5 x 2
# X Y
# <int> <int>
# 1 11 30
# 2 12 31
# 3 13 32
# 4 14 33
# 5 15 34
filter(retag, if_any(X:Y, ~ .x == 2 | .x == 25))
# # A tibble: 2 x 2
# X Y
# <int> <int>
# 1 2 21
# 2 6 25
Data
retag <- structure(list(X = 1:20, Y = 20:39), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"))
Here's a base R option.
This will select rows where there is no 8 in column 4 to 50 and there is at least one 8 in column 90 to 110.
result <- retag[rowSums(retag[4:50] == 8, na.rm = TRUE) == 0 &
rowSums(retag[90:110] == 8,na.rm = TRUE) > 0, ]
I want to calculate all moving averages in one statement rather than repeating myself. Is this possible using quantmod or does it require some clever use of tidyeval and/or purrr?
library(tidyquant)
library(quantmod)
library(zoo)
tibble(date = as.Date('2018-01-01') + days(1:100),
value = 100 + cumsum(rnorm(100))) %>%
tq_mutate(mutate_fun = rollapply, select = "value", width = 10, FUN = mean, col_rename = "rm10") %>%
tq_mutate(mutate_fun = rollapply, select = "value", width = 5, FUN = mean, col_rename = "rm5") %>%
gather(series, value, -date) %>%
ggplot(aes(date, value, color = series)) +
geom_line()
Here is a solution using data.table's new frollmean()-function
data.table v1.12.0 or higher required.
sample data
library( data.table )
set.seed(123)
dt <- data.table( date = as.Date('2018-01-01') + days(1:100),
value = 100 + cumsum(rnorm(100)))
code
#set windwos you want to roll on
windows <- c(5,10)
#create a rm+window column for each roll
dt[, ( paste0( "rm", windows ) ) := lapply( windows, function(x) frollmean( value, x)) ]
output
head( dt, 15 )
# date value rm5 rm10
# 1: 2018-01-02 99.43952 NA NA
# 2: 2018-01-03 99.20935 NA NA
# 3: 2018-01-04 100.76806 NA NA
# 4: 2018-01-05 100.83856 NA NA
# 5: 2018-01-06 100.96785 100.2447 NA
# 6: 2018-01-07 102.68292 100.8933 NA
# 7: 2018-01-08 103.14383 101.6802 NA
# 8: 2018-01-09 101.87877 101.9024 NA
# 9: 2018-01-10 101.19192 101.9731 NA
# 10: 2018-01-11 100.74626 101.9287 101.0867
# 11: 2018-01-12 101.97034 101.7862 101.3398
# 12: 2018-01-13 102.33015 101.6235 101.6519
# 13: 2018-01-14 102.73092 101.7939 101.8482
# 14: 2018-01-15 102.84161 102.1239 102.0485
# 15: 2018-01-16 102.28577 102.4318 102.1802
plot
#plot molten data
library(ggplot2)
ggplot( data = melt(dt, id.vars = c("date") ),
aes(x = date, y = value, colour = variable)) +
geom_line()
update - grouped data
library(data.table)
library(ggplot2)
set.seed(123)
#changed the sample data a bit, to get different values for grp=1 and grp=2
dt <- data.table(grp = rep(1:2, each = 100), date = rep(as.Date('2018-01-01') + days(1:100), 2), value = 100 + cumsum(rnorm(200)))
dt[, ( paste0( "rm", windows ) ) := lapply( windows, function(x) frollmean( value, x)), by = "grp" ]
ggplot( data = melt(dt, id.vars = c("date", "grp") ),
aes(x = date, y = value, colour = variable)) +
geom_line() +
facet_wrap(~grp, nrow = 1)
In this example I use the AAPL adjusted close price downloaded using the getSymbols function from quantmod
lets say you want the SMAs with the following lengths:
smaLength = c(30,35,40,46,53,61,70,81,93)
Now create the SMA like so:
lapply(smaLength,function(x) SMA(AAPL$AAPL.Adjusted,x)) %>% do.call(cbind,.) %>% tail()
result:
SMA SMA.1 SMA.2 SMA.3 SMA.4 SMA.5 SMA.6 SMA.7 SMA.8
2019-03-04 167.3703 165.2570 163.3706 162.1362 161.5904 162.9735 164.7770 169.3341 175.4143
2019-03-05 168.0162 165.9396 164.0682 162.5499 161.7934 162.8342 164.6408 168.9595 174.9418
2019-03-06 168.7454 166.6585 164.7488 162.9638 162.0062 162.8110 164.6165 168.6446 174.5135
2019-03-07 169.3866 167.2323 165.3086 163.3320 162.1409 162.7868 164.5661 168.2780 174.0284
2019-03-08 170.0820 167.7646 165.8150 163.6764 162.3807 162.8711 164.5855 167.8407 173.5334
2019-03-11 170.8092 168.4419 166.4589 164.1471 162.8097 163.0354 164.6573 167.4864 173.0806
Define the input and then lapply over the widths creating a rollmean for each one merging them together. Finally plot it.
library(ggplot2)
library(magrittr)
library(zoo)
set.seed(123)
w <- c(1, 5, 10)
zoo(100 * cumsum(rnorm(100)), as.Date("2018-01-01") + 1:100) %>%
lapply(w, rollmeanr, x = .) %>%
do.call("merge", .) %>%
setNames(w) %>%
autoplot(facet = NULL)
I have a data frame with several variables I want to get the means of and a variable I want to group by. Then, I would like to get the proportion of each group's mean to the overall mean.
I have put together the following, but it is clumsy.
How would you go about it using dplyr or data.table? Bonus points for the option to return both the intermediate step (group and overall mean) and the final proportions.
library(tidyverse)
set.seed(1)
Data <- data.frame(
X1 = sample(1:10),
X2 = sample(11:20),
X3 = sample(21:30),
Y = sample(c("yes", "no"), 10, replace = TRUE)
)
groupMeans <- Data %>%
group_by(Y) %>%
summarize_all(funs(mean))
overallMeans <- Data %>%
select(-Y) %>%
summarize_all(funs(mean))
index <- sweep(as.matrix(groupMeans[, -1]), MARGIN = 2, as.matrix(overallMeans), FUN = "/")
here is one more dplyr solution
index <- as.data.frame(Data %>%
group_by(Y) %>%
summarise_all(mean) %>%
select(-Y) %>%
rbind(Data %>% select(-Y) %>% summarise_all(mean))%>%
mutate_all(funs( . / .[3])))[1:2,]
Here is one possible dplyr solution that contains everything you want:
Data %>%
group_by(Y) %>%
summarise(
group_avg_X1 = mean(X1),
group_avg_X2 = mean(X2),
group_avg_X3 = mean(X3)
) %>%
mutate(
overall_avg_X1 = mean(group_avg_X1),
overall_avg_X2 = mean(group_avg_X2),
overall_avg_X3 = mean(group_avg_X3),
proportion_X1 = group_avg_X1 / overall_avg_X1,
proportion_X2 = group_avg_X2 / overall_avg_X2,
proportion_X3 = group_avg_X3 / overall_avg_X3
)
# # A tibble: 2 x 10
# Y group_avg_X1 group_avg_X2 group_avg_X3 overall_avg_X1 overall_avg_X2 overall_avg_X3 proportion_X1
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 no 6.6 14.6 25.8 5.5 15.5 25.5 1.2
# 2 yes 4.4 16.4 25.2 5.5 15.5 25.5 0.8
# # ... with 2 more variables: proportion_X2 <dbl>, proportion_X3 <dbl>
Here's a method with data.table:
#data
library(data.table)
set.seed(1)
dt <- data.table(
x1 = sample(1:10),
x2 = sample(11:20),
x3 = sample(21:30),
y = sample(c("yes", "no"), 10, replace = TRUE)
)
# group means
group_means <- dt[ , lapply(.SD, mean), by=y, .SDcols=1:3]
# overall means
overall_means <- dt[ , lapply(.SD, mean), .SDcols=1:3]
# clunky combination (sorry!)
group_means[ , perc_x1 := x1 / overall_means[[1]] ]
group_means[ , perc_x2 := x2 / overall_means[[2]] ]
group_means[ , perc_x3 := x3 / overall_means[[3]] ]
The following code provides example data:
library(TTR)
set.seed(15)
r <- rnorm(1000, 0, .01)
P_1 <- 100
P <- P_1*cumprod(1+r)
zz <- ZigZag(P, change = 5, percent = TRUE)
set.seed(15)
volume <- round(runif(1000, 50, 550), digits = 0)
data <- as.data.frame(cbind(P, zz, volume))
plot(P, type = "l")
lines(zz, col = "red")
in the end I would like to create cumulative sum of volume in new column, where reset happens when zigzag line (zz) changes direction. I have tried to play with s <- sign(diff(data$zz, lag = 1)), which would show those turning points, but haven't been able to use cumsum with it.
Here is a solution that uses dplyr:
library(dplyr)
data %>%
mutate(
zz_up = (zz - lag(zz) > 0),
zz_switch = zz_up != lag(zz_up),
zz_switch = ifelse(is.na(zz_switch), FALSE, zz_switch),
group = cumsum(zz_switch)
) %>%
group_by(group) %>%
mutate(cum_volume = cumsum(volume))
Attempt with RcppRoll:
Code
Vectorize(require)(package = c("magrittr", "dplyr", "RcppRoll"),
char = TRUE)
data %<>%
# Create difference for ZigZag
mutate(diffZZ = c(0,diff(zz))) %>%
# Use it as a group
group_by(diffZZ) %>%
# Use RcppRoll to compute that sum
mutate(sumVolByDiff = roll_sum(x = volume, n = 2, fill = NA)) %>%
# Clean / not important
ungroup()
Preview
> head(data)
Source: local data frame [6 x 5]
P zz volume diffZZ sumVolByDiff
(dbl) (dbl) (dbl) (dbl) (dbl)
1 100.2588 100.2588 351 0.000000 NA
2 102.0947 100.5596 148 0.300785 523
3 101.7480 100.8604 533 0.300785 1077
4 102.6608 101.1612 375 0.300785 609
5 103.1618 101.4620 234 0.300785 692
6 101.8668 101.7627 544 0.300785 938