Related
data <- structure(list(A_w = c(0, 0.69, 1.41, 2.89, 6.42, 13.3, 25.5,
36.7, 44.3, 46.4), E_w = c(1.2, 1.2, 1.5, 1.6, 1.9, 2.3, 3.4,
4.4, 10.6, 16.5), A_e = c(0, 0.18, 0.37, 0.79, 1.93, 4.82, 11.4,
21.6, 31.1, 36.2), E_e = c(99.4, 99.3, 98.9, 98.4, 97.1, 93.3,
84.7, 71.5, 58.1, 48.7)), row.names = c(NA, -10L), class = "data.frame")
data
#> A_w E_w A_e E_e
#> 1 0.00 1.2 0.00 99.4
#> 2 0.69 1.2 0.18 99.3
#> 3 1.41 1.5 0.37 98.9
#> 4 2.89 1.6 0.79 98.4
#> 5 6.42 1.9 1.93 97.1
#> 6 13.30 2.3 4.82 93.3
#> 7 25.50 3.4 11.40 84.7
#> 8 36.70 4.4 21.60 71.5
#> 9 44.30 10.6 31.10 58.1
#> 10 46.40 16.5 36.20 48.7
Created on 2021-05-31 by the reprex package (v2.0.0)
I am trying to plot this data with all A values as X and Es as Y. How can I put either a) both of these columns plotted on a ggplot2, or b) rearrange this dataframe to combine the A columns and E columns into a final dataframe with only two columns with 2x as many rows as pictured?
Thanks for any help, I am a beginner (obviously)
Edit for Clarity: It's important that the A_e & E_e values remain as pairs, similar to how the A_w and E_w values remain as pairs. The end result plot should resemble the ORANGE and BLUE lines of this image, but I am trying to replicate this while learning R.
Currently I am capable of plotting each separately when dividing into two dataframes of 2x10
A_w E_w
1 0.00 1.2
2 0.69 1.2
3 1.41 1.5
4 2.89 1.6
5 6.42 1.9
6 13.30 2.3
7 25.50 3.4
8 36.70 4.4
9 44.30 10.6
10 46.40 16.5
and the second plot
# A tibble: 10 x 2
A_e E_e
<dbl> <dbl>
1 0 99.4
2 0.18 99.3
3 0.37 98.9
4 0.79 98.4
5 1.93 97.1
6 4.82 93.3
7 11.4 84.7
8 21.6 71.5
9 31.1 58.1
10 36.2 48.7
But my end goal is to have them both on the same plot, like in the Excel graph (orange + blue graph) above.
Here is a try
library(dplyr)
library(ggplot2)
line_1_data <- data %>%
select(A_w, E_w) %>%
mutate(xend = lead(A_w), yend = lead(E_w)) %>%
filter(!is.na(xend))
line_2_data <- data %>%
select(A_e, E_e) %>%
mutate(xend = lead(A_e), yend = lead(E_e)) %>%
filter(!is.na(xend))
# multiple column for with different geom
ggplot(data = data) +
# The blue line
geom_point(aes(x = A_w, y = E_w), color = "blue") +
geom_curve(data = line_1_data, aes(x = A_w, y = E_w, xend = xend,
yend = yend), color = "blue",
curvature = 0.02) +
# The orange line
geom_point(aes(x = A_e, y = E_e), color = "orange") +
geom_curve(data = line_2_data,
aes(x = A_e, y = E_e, xend = xend, yend = yend), color = "orange",
curvature = -0.02) +
# The red connection between two line
geom_curve(data = tail(data, 1),
aes(x = A_w, y = E_w, xend = A_e, yend = E_e), curvature = 0.1,
color = "red") +
# The black straight line between pair
geom_curve(
aes(x = A_w, y = E_w, xend = A_e, yend = E_e), curvature = 0,
color = "black")
Created on 2021-05-31 by the reprex package (v2.0.0)
You may try from this
data <- data.frame(
A_w = c(0,0.69,1.41,2.89,6.42,
13.3,25.5,36.7,44.3,46.4),
E_w = c(1.2, 1.2, 1.5, 1.6, 1.9, 2.3, 3.4, 4.4, 10.6, 16.5),
A_e = c(0,0.18,0.37,0.79,1.93,
4.82,11.4,21.6,31.1,36.2),
E_e = c(99.4,99.3,98.9,98.4,
97.1,93.3,84.7,71.4,58.1,48.7)
)
library(tidyverse)
data %>% pivot_longer(everything(), names_sep = '_', names_to = c('.value', 'type')) %>%
ggplot(aes(x = A, y = E, color = type)) +
geom_point() +
geom_line()
Created on 2021-05-31 by the reprex package (v2.0.0)
Doing it "by hand":
#dummmy data:
df = data.frame(A_w=rnorm(10), E_w=rnorm(10), A_e=rnorm(10), E_e=rnorm(10))
df2 = data.frame(A=c(df$A_w, df$A_e), E=c(df$E_w, df$A_e))
Output:
> df2
A E
1 1.25522468 -0.2441768
2 -0.50585191 -0.1383637
3 0.42374270 -0.9664189
4 -0.39858532 -0.3442157
5 -1.05665363 -1.3574362
6 0.79191788 -0.8202841
7 -1.31349592 0.7280619
8 -0.05609851 0.6365495
9 1.01068811 2.0222241
10 -1.15572972 -0.2190794
11 0.15579931 0.1557993
12 1.58834329 1.5883433
13 1.24933622 1.2493362
14 -0.28197439 -0.2819744
15 0.30593184 0.3059318
16 0.75486103 0.7548610
17 1.19394302 1.1939430
18 -1.79955846 -1.7995585
19 0.59688655 0.5968865
20 0.71519048 0.7151905
And for the plot: ggplot(df2, aes(x=A, y=E)) + geom_point()
Output:
There are ways to do this without having to joint the columns by listing their names - with the tidyr package - but i think that this solution is easier to understand from a beginners pov.
I want to compute the mean exposure to se ozone from a dataset with the example below. The mean value should be the ozone value from the year of birth to age 5. Is there a simple way to do this in R.
final = data.frame(ID = c(1, 2, 3, 4, 5, 6),
Zone = c("A", "B", "C", "D", "A", "B"),
dob = c(1993, 1997, 1994, 2001, 1999, 1993),
Ozone_1993 = c(0.12, 0.01, 0.36, 0.78, 0.12, 0.01),
Ozone_1994 = c(0.75, 0.23, 0.14, 0.98, 0.75, 0.23),
Ozone_1995 = c(1.38, 0.45, -0.08, 1.18, 1.38, 0.45),
Ozone_1996 = c(2.01, 0.67, -0.3, 1.38, 2.01, 0.67),
Ozone_1997 = c(2.64, 0.89, -0.52, 1.58, 2.64, 0.89),
Ozone_1998 = c(3.27, 1.11, -0.74, 1.78, 3.27, 1.11),
Ozone_1999 = c(3.9, 1.33, -0.96, 1.98, 3.9, 1.33),
Ozone_2000 = c(4.53, 1.55, -1.18, 2.18, 4.53, 1.55),
Ozone_2001 = c(5.16, 1.77, -1.4, 2.38, 5.16, 1.77),
Ozone_2002 = c(5.79, 1.99, -1.62, 2.58, 5.79, 1.99),
Ozone_2003 = c(6.42, 2.21, -1.84, 2.78, 6.42, 2.21),
Ozone_2004 = c(7.05, 2.43, -2.06, 2.98, 7.05, 2.43),
mean_under5_ozone = c(0.85, 1.33, -0.3, 2.68, 5.16, 0.45))
where column (variable) mean_under5_ozone is the mean score of Ozone exposure from birthyear to age 5 or less. e.g mean_under5_ozone for ID 1 is the rowmean from Ozone_1993 to Ozone_1997
From a novice,
Here is one way to do it with for loops. (It's not very elegant, but it avoids getting into too much details of dplyr and rlang syntax.)
loop over birth years (dob_yr below) to define a column containing variable names to use for the custom mean (use_vars below).
loop over rows and for each row, extract relevant variables using this new column (use_vars) and calculate the custom mean.
library(dplyr)
df <- tibble(id=1:5)
df$zone <- c(rep('A', 5))
df$dob_yr <- c(1991:1995)
for (yr in 1991:1995) {
df[[paste('x_',yr,sep='')]] <- c(abs(rnorm(5)))
}
df # check mock data
add_use_vars <- function(df, dob_yr_varname='dob_yr', prefix='x_', yr_within=3) {
vars <- names(df %>% select(starts_with(prefix)))
vars_yr <- as.integer(sub(prefix, '', vars))
df$use_vars <- NA
for (i in seq_along(df[[dob_yr_varname]])) {
yr <- df[[dob_yr_varname]][i]
idx <- (vars_yr <= yr + yr_within) & (vars_yr >= yr)
df$use_vars[i] <- list(vars[idx]) # list with one element
}
return(df)
}
df <- add_use_vars(df)
df$use_vars[1][[1]] # see the first row in use_vars
custom_mean <- function(df, varname_varlist='use_vars') {
df$custom_mean <- NA
for (i in seq_along(df[[varname_varlist]])) {
vars = df[[varname_varlist]][i][[1]] # extract first element in list
df$custom_mean[i] <- mean(as.numeric(df[i, vars]))
}
return(df)
}
df <- custom_mean(df)
df # see results
Note that for this mock data, for each row, I am averaging over the columns containing value of 0 to 3 years from the birth year.
(Complete rewrite.)
I don't think I understand what mean_under5_ozone means, since I can't reproduce your numbers. For instance, for ID==1, born in 1993, that means we want data from 1993 through 1998 (to include age 5) or 1997 (up to but not including), but neither of those averages is 0.85:
mean(unlist(final[1, 4:9]))
# [1] 1.695
mean(unlist(final[1, 4:8]))
# [1] 1.38
Ignoring this, I'll give you what I think are the correct answers with your final data.
tidyverse
library(dplyr)
library(tidyr) # pivot_longer
final <- select(final, -mean_under5_ozone)
final %>%
pivot_longer(starts_with("Ozone"), names_pattern = "(.*)_(.*)", names_to = c("type", "year")) %>%
mutate(year = as.integer(year)) %>%
group_by(ID) %>%
summarize(mean_under5_ozone = mean(value[ between(year, dob, dob + 5) ]), .groups = "drop")
# # A tibble: 6 x 2
# ID mean_under5_ozone
# <dbl> <dbl>
# 1 1 1.70
# 2 2 1.44
# 3 3 -0.41
# 4 4 2.68
# 5 5 5.48
# 6 6 0.56
data.table
library(data.table)
library(magrittr) # %>%, not required but used for improved readability
finalDT[, mean_under5_ozone := NULL]
melt(finalDT, 1:3) %>%
.[, year := as.integer(gsub("[^0-9]", "", variable))] %>%
.[ year >= dob, ] %>%
.[, .(mean_under5_ozone = mean(value[ between(year, dob, dob + 5) ])), by = .(ID)] %>%
.[order(ID),]
# ID mean_under5_ozone
# 1: 1 1.695
# 2: 2 1.440
# 3: 3 -0.410
# 4: 4 2.680
# 5: 5 5.475
# 6: 6 0.560
A few thoughts, using random data.
set.seed(42)
dat <- data.frame(dob = sample(1990:2020, size=1000, replace=TRUE), Ozone_1993=runif(1000), Ozone_1994=runif(1000), Ozone_1995=runif(1000))
head(dat)
# dob Ozone_1993 Ozone_1994 Ozone_1995
# 1 2006 0.37383448 0.68624969 0.1681480
# 2 1994 0.46496563 0.29309851 0.8198724
# 3 1990 0.04660819 0.41994895 0.7501070
# 4 2014 0.98751620 0.73526105 0.2899959
# 5 1999 0.90845233 0.84982125 0.1798130
# 6 1993 0.97939015 0.07746459 0.6172919
tidyverse
library(dplyr)
dat %>%
filter(dob >= 2015) %>%
summarize_at(vars(starts_with("Ozone")), mean)
# Ozone_1993 Ozone_1994 Ozone_1995
# 1 0.5242029 0.4852803 0.4864364
That is the average per year. If you instead need a single statistic, then
# library(tidyr) # pivot_longer
dat %>%
filter(dob >= 2015) %>%
tidyr::pivot_longer(starts_with("Ozone")) %>%
summarize(value = mean(value))
# # A tibble: 1 x 1
# value
# <dbl>
# 1 0.499
data.table
library(data.table)
datDT <- as.data.table(dat)
datDT[ dob >= 2015, ][, lapply(.SD, mean), .SDcols = patterns("^Ozone")]
# Ozone_1993 Ozone_1994 Ozone_1995
# 1: 0.5242029 0.4852803 0.4864364
melt(datDT[ dob >= 2015, ], "dob")[, .(value = mean(value))]
# value
# 1: 0.4986398
Base R
apply(subset(dat, dob >= 2015, select = Ozone_1993:Ozone_1995), 2, mean)
# Ozone_1993 Ozone_1994 Ozone_1995
# 0.5242029 0.4852803 0.4864364
mean(unlist(subset(dat, dob >= 2015, select = Ozone_1993:Ozone_1995)))
# [1] 0.4986398
I am having all sorts of trouble trying to create a loop, or using lapply to generate multiple plots from one data frame.
df
target A.O2 A.H2O A.conc A.bias B.O2 B.H2O B.conc B.bias C.O2 C.H2O C.conc C.bias
1 85 20.90 0.06 254.96 0.01 20.90 0.06 255.02 0.03 20.90 0.06 254.98 0.01
2 50 20.90 0.09 150.09 0.09 20.90 0.09 150.06 0.08 20.90 0.09 150.00 0.03
3 25 20.94 0.09 75.24 0.31 20.94 0.09 75.47 0.62 20.94 0.09 74.98 -0.04
4 85 10.00 0.08 251.99 -1.22 10.00 0.08 252.02 -1.21 10.00 0.08 252.01 -1.21
5 50 10.00 0.09 148.51 -1.06 10.00 0.09 148.52 -1.05 10.00 0.09 148.50 -1.06
6 25 10.00 0.07 74.00 -1.27 10.00 0.07 74.03 -1.24 10.00 0.07 74.03 -1.24
7 85 0.10 0.06 246.99 -3.13 0.10 0.06 247.01 -3.13 0.10 0.06 247.00 -3.13
8 50 0.10 0.14 146.50 -2.39 0.10 0.14 146.50 -2.39 0.10 0.14 146.45 -2.42
9 25 0.10 0.10 72.97 -2.55 0.10 0.10 73.04 -2.45 0.10 0.10 73.04 -2.44
I want to create plots where X = O2 (A.O2, B.O2, C.O2) and Y = bias (A.bias, B.bias, C.bias) and the points are grouped based off the values in the target column.
library(ggrepel)
ggplot(df, aes(A.O2, A.bias)) +
theme_bw() +
theme(legend.position = 'bottom', plot.title = element_text(hjust=0.5)) +
geom_point(aes(colour = factor(target))) +
geom_line(aes(colour = factor(target))) +
geom_text_repel(aes(label=paste(A.bias),
hjust= 0.4,
vjust=-.8, colour = factor(target)),
size = 3) +
ggtitle('A') +
labs(
x = expression('O'[2]),
y = "bias",
colour = 'conc'
)
I want to repeat the same code where the only thing that changes is the X and Y values in aes() and ggtitle(). I have tried looking up similar posts for using for loops or lapply to do this but nothing seems to work.
Probably reshaping the data to long format and using facet_grid. This is easy using reshape when we switch suffix and prefix of column names.
names(df) <- sapply(lapply(strsplit(names(df), "\\."), rev), paste, collapse=".")
dfl <- reshape(df, varying=2:13, direction="long")
library(ggplot2)
library(ggrepel)
ggplot(dfl, aes(O2, bias)) +
theme_bw() +
theme(legend.position = 'bottom', plot.title = element_text(hjust=0.5)) +
geom_point(aes(colour = factor(target))) +
geom_line(aes(colour = factor(target)))+
geom_text_repel(aes(label=paste(bias),
hjust= 0.4,
vjust=-.8, colour = factor(target)),
size = 3) +
facet_grid("time") +
# ggtitle(z) + ## not needed
labs(
x = expression('O'[2]),
y = "bias",
colour = 'conc'
)
Or if you want three single plots you may put the code into a function together with ggsave to use in an lapply loop.
FUN <- function(x) {
ggplot(dfl[dfl$time == x, ], aes(O2, bias)) +
theme_bw() +
theme(legend.position = 'bottom', plot.title = element_text(hjust=0.5)) +
geom_point(aes(colour = factor(target))) +
geom_line(aes(colour = factor(target)))+
geom_text_repel(aes(label=paste(bias),
hjust= 0.4,
vjust=-.8, colour = factor(target)),
size = 3) +
# facet_grid("time") + ## not needed
ggtitle(x) +
labs(
x = expression('O'[2]),
y = "bias",
colour = 'conc'
)
ggsave(paste0("plot", x, ".png"))
}
times <- c("A", "B", "C")
lapply(times, FUN)
This saves the three plots in your working directory:
dir()
# [1] plotA.png
# [2] plotB.png
# [3] plotC.png
Example plot:
Data:
df <- structure(list(target = c(85L, 50L, 25L, 85L, 50L, 25L, 85L,
50L, 25L), A.O2 = c(20.9, 20.9, 20.94, 10, 10, 10, 0.1, 0.1,
0.1), A.H2O = c(0.06, 0.09, 0.09, 0.08, 0.09, 0.07, 0.06, 0.14,
0.1), A.conc = c(254.96, 150.09, 75.24, 251.99, 148.51, 74, 246.99,
146.5, 72.97), A.bias = c(0.01, 0.09, 0.31, -1.22, -1.06, -1.27,
-3.13, -2.39, -2.55), B.O2 = c(20.9, 20.9, 20.94, 10, 10, 10,
0.1, 0.1, 0.1), B.H2O = c(0.06, 0.09, 0.09, 0.08, 0.09, 0.07,
0.06, 0.14, 0.1), B.conc = c(255.02, 150.06, 75.47, 252.02, 148.52,
74.03, 247.01, 146.5, 73.04), B.bias = c(0.03, 0.08, 0.62, -1.21,
-1.05, -1.24, -3.13, -2.39, -2.45), C.O2 = c(20.9, 20.9, 20.94,
10, 10, 10, 0.1, 0.1, 0.1), C.H2O = c(0.06, 0.09, 0.09, 0.08,
0.09, 0.07, 0.06, 0.14, 0.1), C.conc = c(254.98, 150, 74.98,
252.01, 148.5, 74.03, 247, 146.45, 73.04), C.bias = c(0.01, 0.03,
-0.04, -1.21, -1.06, -1.24, -3.13, -2.42, -2.44)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9"))
I have 2 data sets - one is quarterly which I need to match to monthly data. So the values from the quarterly data will be repeated thrice in the final data set. I have created a one quarter sample below but this would need to be repeated for many quarters.
month <- c(1/20, 2/20, 3/20)
rating <- c(0.5,0.6,0.65)
df1 <- cbind(month,rating)
quarter <- c(“q1/20”)
amount <- c(100)
df2 <- cbind(quarter,amount)
My final data set should have the following structure
month <- c(1/20, 2/20, 3/20)
rating <- c(0.5,0.6,0.65)
quarter <- c(“q1/20”, “q1/20”, “q1/20”)
amount <- c(100,100,100)
df3 <- cbind(month, rating, quarter, amount)
In the full quarterly data set (df1), some observations are also monthly so it would maybe be a case of matching the monthly observations by month and quarterly observations by quarter?
Thanks in anticipation.
Assuming you have this data.
head(m.dat)
# month rating
# 1 1/18 0.91
# 2 2/18 0.94
# 3 3/18 0.29
# 4 4/18 0.83
# 5 5/18 0.64
# 6 6/18 0.52
head(q.dat)
# quarter amount
# 1 q1/18 1
# 2 q2/18 21
# 3 q3/18 91
# 4 q4/18 61
# 5 q1/19 38
# 6 q2/19 44
You could match month information to quarters using an assignment matrix qm.
qm <- matrix(c(1:12, paste0("q", rep(1:4, each=3))), 12, 2)
m.dat$quarter <- paste0(qm[match(qm[, 1], gsub("(^\\d*).*", "\\1", m.dat$month)), 2],
"/",
sapply(strsplit(m.dat$month, "/"), `[`, 2))
This enables you to use merge.
res <- merge(m.dat, q.dat, all=TRUE)
head(res)
# quarter month rating amount
# 1 q1/18 1/18 0.91 1
# 2 q1/18 2/18 0.94 1
# 3 q1/18 3/18 0.29 1
# 4 q1/19 1/19 0.93 38
# 5 q1/19 2/19 0.26 38
# 6 q1/19 3/19 0.46 38
Toy data
m.dat <- structure(list(month = c("1/18", "2/18", "3/18", "4/18", "5/18",
"6/18", "7/18", "8/18", "9/18", "10/18", "11/18", "12/18", "1/19",
"2/19", "3/19", "4/19", "5/19", "6/19", "7/19", "8/19", "9/19",
"10/19", "11/19", "12/19", "1/20", "2/20", "3/20", "4/20", "5/20",
"6/20", "7/20", "8/20", "9/20", "10/20", "11/20", "12/20"), rating = c(0.91,
0.94, 0.29, 0.83, 0.64, 0.52, 0.74, 0.13, 0.66, 0.71, 0.46, 0.72,
0.93, 0.26, 0.46, 0.94, 0.98, 0.12, 0.47, 0.56, 0.9, 0.14, 0.99,
0.95, 0.08, 0.51, 0.39, 0.91, 0.45, 0.84, 0.74, 0.81, 0.39, 0.69,
0, 0.83)), class = "data.frame", row.names = c(NA, -36L))
q.dat <- structure(list(quarter = c("q1/18", "q2/18", "q3/18", "q4/18",
"q1/19", "q2/19", "q3/19", "q4/19", "q1/20", "q2/20", "q3/20",
"q4/20"), amount = c(1, 21, 91, 61, 38, 44, 4, 97, 43, 96, 89,
64)), class = "data.frame", row.names = c(NA, -12L))
Assuming that df1 and df2 are the data frames shown in the Note at the end create a yq column of class yearqtr in each and merge on that:
library(zoo)
df1 <- transform(df1, yq = as.yearqtr(month, "%m/%y"))
df2 <- transform(df2, yq = as.yearqtr(quarter, "q%q/%y"))
merge(df1, df2, by = "yq", all = TRUE)
giving:
yq month rating quarter amount
1 2020 Q1 1/20 0.50 q1/20 100
2 2020 Q1 2/20 0.60 q1/20 100
3 2020 Q1 3/20 0.65 q1/20 100
We could also consider converting the month column into a yearmon class column using
as.yearmon .
Note
df1 <- data.frame(month = c("1/20", "2/20", "3/20"), rating = c(0.5,0.6,0.65))
df2 <- data.frame(quarter = "q1/20", amount = 100)
How do I apply a function to many columns of grouped rows? For example;
library(tidyverse)
data <- tribble(
~Date, ~Seq1, ~Component, ~Seq2, ~X1, ~X2, ~X3,
"01/01/18", 1, "Smooth", NA, 3.98, 2.75, 1.82,
"01/01/18", 2, "Smooth", NA, 1.02, 0.02, -0.04,
"01/01/18", 3, "Smooth", NA, 3.48, 3.06, 1.25,
"01/01/18", 3, "Bounce", 1, 2.01, -0.43, -0.52,
"01/01/18", 3, "Bounce", 2, 1.94, 1.53, 1.92) %>%
mutate_at(vars(Date, Seq1, Component, Seq2), funs(factor))
Each column of X values (many more columns, truncated here for clarity) is grouped into Date, Seq1, Component, and Seq2. While Component "Smooth" and Seq1 "NA" are constant, within Component "Bounce" level there are multiple Seq2 levels e.g. "1", "2", etc.
How do I sum each X column, always the constant "NA" with each level of Seq2?
The desired results is:
expected <- tribble(
~Date, ~Seq1, ~Component, ~Seq2, ~X1, ~X2, ~X3,
"01/01/18", 1, "Smooth", NA, 3.98, 2.75, 1.82,
"01/01/18", 2, "Smooth", NA, 1.02, 0.02, -0.04,
"01/01/18", 3, "Smooth", NA, 3.48, 3.06, 1.25,
"01/01/18", 3, "Bounce", 1, 5.49, 3.49, 1.77,
"01/01/18", 3, "Bounce", 2, 5.42, 4.59, 3.17)
The following example only adds each Seq1 level.
data %>%
group_by(Date, Seq1) %>%
mutate_at(vars(starts_with("X")), funs(sum(.)))
#> # A tibble: 5 x 7
#> # Groups: Date, Seq1 [3]
#> Date Seq1 Component Seq2 X1 X2 X3
#> <fct> <fct> <fct> <fct> <dbl> <dbl> <dbl>
#> 1 01/01/18 1 Smooth <NA> 3.98 2.75 1.82
#> 2 01/01/18 2 Smooth <NA> 1.02 0.02 -0.04
#> 3 01/01/18 3 Smooth <NA> 7.43 4.16 2.65
#> 4 01/01/18 3 Bounce 1 7.43 4.16 2.65
#> 5 01/01/18 3 Bounce 2 7.43 4.16 2.65
I am certain there is solution within the purrr or apply function family, however, I have been unsuccessful (for days) in solving this example. The actual data has about 180 X columns, with hundreds of Date and Seq1 combinations, and multiple Seq2 levels.
A similar example could be Summing Multiple Groups of Columns, How to apply a function to a subset of columns in r?, or even perhaps https://github.com/jennybc/row-oriented-workflows.
Created on 2018-10-23 by the reprex package (v0.2.1)
Here's my solution. This problem is not really a purrr task, because there is nothing really that you want to map a single function to. Instead, what I understand the problem to be is that you want to match each X value in a Bounce row with the corresponding Smooth row X values of the same Date and Seq1 (and there is only one such row). This means that it is really a merging or joining problem, and then the approach is to set up the join so that you can match the right values and do the sum. So I go as follows:
Split the data into the Smooth rows and the Bounce rows and gather so that all the X values are in one column
Join the smooths onto the bounces with a left_join, so each original Bounce row now has its corresponding Smooth.
mutate the sum into a new column and select/rename the columns to be as in the original
bind_rows to join the newly summed bounces and spread to return to the original layout.
This should be robust to any number of Date, Seq1, Seq2 and X values.
library(tidyverse)
data <- tribble(
~Date, ~Seq1, ~Component, ~Seq2, ~X1, ~X2, ~X3,
"01/01/18", 1, "Smooth", NA, 3.98, 2.75, 1.82,
"01/01/18", 2, "Smooth", NA, 1.02, 0.02, -0.04,
"01/01/18", 3, "Smooth", NA, 3.48, 3.06, 1.25,
"01/01/18", 3, "Bounce", 1, 2.01, -0.43, -0.52,
"01/01/18", 3, "Bounce", 2, 1.94, 1.53, 1.92)
smooths <- data %>%
filter(Component == "Smooth") %>%
gather(X, val, starts_with("X"))
bounces <- data %>%
filter(Component == "Bounce") %>%
gather(X, val, starts_with("X")) %>%
left_join(smooths, by = c("Date", "Seq1", "X")) %>%
mutate(val = val.x + val.y) %>%
select(Date, Seq1, Component = Component.x, Seq2 = Seq2.x, X, val)
bounces %>%
bind_rows(smooths) %>%
spread(X, val)
#> # A tibble: 5 x 7
#> Date Seq1 Component Seq2 X1 X2 X3
#> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 01/01/18 1 Smooth NA 3.98 2.75 1.82
#> 2 01/01/18 2 Smooth NA 1.02 0.02 -0.04
#> 3 01/01/18 3 Bounce 1 5.49 2.63 0.73
#> 4 01/01/18 3 Bounce 2 5.42 4.59 3.17
#> 5 01/01/18 3 Smooth NA 3.48 3.06 1.25
Created on 2018-10-31 by the reprex package (v0.2.1)