R split by group and create new columns

R split by group and create new columns - r

I'm trying to split a data frame from long to wide format by converting selected rows to columns. Here is the current general long-format structure:
data_long <- data.frame(
id = c("kelp","kelp","fish","fish","beach","beach","kelp","kelp","fish","fish","beach","beach"),
desig = c("mpa","reference","mpa","reference","mpa","reference","mpa","reference","mpa","reference","mpa","reference"),
indicator = c("density","density","density","density","density","density","biomass","biomass","biomass","biomass","biomass","biomass"),
n = c(1118,1118,1118,1118,1118,1118,1118,1118,1118,1118,1118,1118),
m = c(0.35, 4.28, 1.16, 106.35, 13.44,0.63,0.35, 4.28, 1.16, 106.35, 13.44,0.63),
sd = c(1.19, 8.48, 4.25, 118, 31.77,2.79,1.19, 8.48, 4.25, 118, 31.77,2.79)
)
data_long
I want to keep id and indicator, split by "desig",and move "n", "m", and "sd" into new columns. The final data frame structure I'm trying to obtain is:
data_wide <- data.frame(
id = c("kelp","fish","beach","kelp","fish","beach"),
indicator = c("density","density","density","biomass","biomass","biomass"),
mpa.n = c(1118,1118,1118,1118,1118,1118),
mpa.m = c(0.35, 4.28, 1.16, 106.35, 13.44,0.63),
mpa.sd = c(1.19, 8.48, 4.25, 118, 31.77,2.79),
reference.n = c(1118,1118,1118,1118,1118,1118),
reference.m = c(0.35, 4.28, 1.16, 106.35, 13.44,0.63),
reference.sd = c(1.19, 8.48, 4.25, 118, 31.77,2.79)
)
data_wide
I can't seem to get this right using reshape2. Any suggestions?

We may use pivot_wider
library(tidyr)
library(dplyr)
pivot_wider(data_long, names_from = desig,
values_from = c(n, m, sd), names_glue = "{desig}.{.value}") %>%
select(id, indicator, starts_with("mpa"), starts_with('reference'))
-output
# A tibble: 6 × 8
id indicator mpa.n mpa.m mpa.sd reference.n reference.m reference.sd
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 kelp density 1118 0.35 1.19 1118 4.28 8.48
2 fish density 1118 1.16 4.25 1118 106. 118
3 beach density 1118 13.4 31.8 1118 0.63 2.79
4 kelp biomass 1118 0.35 1.19 1118 4.28 8.48
5 fish biomass 1118 1.16 4.25 1118 106. 118
6 beach biomass 1118 13.4 31.8 1118 0.63 2.79

Related

Select first occurrence of a decimal value in R

Ok, I have been trying to get an answer for this but I cant find it anywhere, but it seems like an easy task (which is bugging me even more!)
I have a dataframe with a series of numbers in a column which I want to filter to get the first occurrence of a number....for example, if i have 1.01, 1.08, 1.15, I want to filter the rows to get the row with the value 1.01 in that column.
An examples is:
x<- c(2.04, 2.25, 3.99, 3.20, 2.60, 1.85, 3.57, 3.37, 2.59, 1.60, 3.93, 1.33, 1.08, 4.64, 2.09, 4.53, 3.04, 3.85, 3.15, 3.97)
y<- c(2.62, 2.48, 1.40, 2.27, 3.71, 1.86, 3.56, 2.08, 2.36, 3.23, 1.65, 3.43, 1.57, 4.49, 2.29, 3.32, 2.12, 4.45, 1.57, 4.70)
z <- data.frame(x, y)
z <- z[order(z$x, decreasing = FALSE), ]
And the filtered results should be:
x y
1.08 1.57
2.04 2.62
3.04 2.12
4.53 3.32
Any help would be apprreciated

z %>%
arrange(x) %>%
group_by(int = floor(x)) %>%
slice(1) %>%
ungroup()
# A tibble: 4 × 3
x y int
<dbl> <dbl> <dbl>
1 1.08 1.57 1
2 2.04 2.62 2
3 3.04 2.12 3
4 4.53 3.32 4
or
z %>%
arrange(x) %>%
filter(floor(x) != lag(floor(x), default = 0))
x y
1 1.08 1.57
2 2.04 2.62
3 3.04 2.12
4 4.53 3.32

You can also try this:
z1 <- z %>%
group_by(floor(z$x)) %>%
arrange(z$x) %>%
filter(row_number()==1)
z1
# A tibble: 4 × 3
# Groups: floor(z$x) [4]
x y `floor(z$x)`
<dbl> <dbl> <dbl>
1 1.08 1.57 1
2 2.04 2.62 2
3 3.04 2.12 3
4 4.53 3.32 4

Translating loop syntax from Stata to R

I need to write a for loop to calculate the product of year variables (e.g. var1874) * price variables (e.g. num1874), creating a new variable for each year and its corresponding price value (e.g. newvar1874).
Here's my data in R
A tibble: 4 x 7
cty var1874 var1875 var1876 num1874 num1875 num1876
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0.78 0.83 0.99 2.64 2.8 3.1
2 2 0.69 0.69 0.89 2.3 2.3 2.58
3 3 0.42 0.48 0.59 2.28 2.44 2.64
4 4 0.82 0.94 1.09 2.28 2.36 3
I've been able to do this using the 'foreach' loop in Stata:
local vn 1874 1875 1876
foreach v of local vn {
gen newvar'v' = var'v'*num'v'
Does anyone know how I would do this same type of command using the for loop in R? I know there may be simpler ways to do this without the for loop, but I need to know how to do this using the for loop.

Using a for loop you could do:
vn <- 1874:1876
for (v in vn) d[[paste0("newvar", v)]] <- d[[paste0("var", v)]] * d[[paste0("num", v)]]
d
#> cty var1874 var1875 var1876 num1874 num1875 num1876 newvar1874 newvar1875
#> 1 1 0.78 0.83 0.99 2.64 2.80 3.10 2.0592 2.3240
#> 2 2 0.69 0.69 0.89 2.30 2.30 2.58 1.5870 1.5870
#> 3 3 0.42 0.48 0.59 2.28 2.44 2.64 0.9576 1.1712
#> 4 4 0.82 0.94 1.09 2.28 2.36 3.00 1.8696 2.2184
#> newvar1876
#> 1 3.0690
#> 2 2.2962
#> 3 1.5576
#> 4 3.2700
Or using lapply you could do:
d[, paste0("newvar", vn)] <- lapply(vn, function(v) d[[paste0("var", v)]] * d[[paste0("num", v)]])
DATA
d <- structure(list(
cty = 1:4, var1874 = c(0.78, 0.69, 0.42, 0.82),
var1875 = c(0.83, 0.69, 0.48, 0.94), var1876 = c(
0.99, 0.89,
0.59, 1.09
), num1874 = c(2.64, 2.3, 2.28, 2.28), num1875 = c(
2.8,
2.3, 2.44, 2.36
), num1876 = c(3.1, 2.58, 2.64, 3)
), class = "data.frame", row.names = c(
"1",
"2", "3", "4"
))

Calculation cumulated values using grouping

I am trying to calculate cumulated acetone and acetaldehyde emission from different soil incubations across three time points. Emission of the compounds was measured from six soils (of different soil_types) on three days. I wish to calculate the cumulated emission for each soil for each time point.
The end goal is to calculate the average emission from all soils and present a graph similar to this one (except there should be error bars on my graph):
Can anyone spot where I'm going wrong?
Here's the code:
library(tidyverse)
library(plotrix)
df%>%
group_by(soil, compound, days)%>%
mutate(cum_emission=cumsum(emission))%>%
summarise(mean=mean(cum_emission, na.rm = TRUE),
sd = sd(cum_emission, na.rm = TRUE),
se = std.error(cum_emission, na.rm = TRUE))
Here's the data:
df <- structure(list(days = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4,
4, 4, 4, 4, 4, 4, 4, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,
10, 10, 4, 4, 4, 4), soil = c(12, 12, 2, 2, 1, 1, 9, 9, 13, 13,
3, 3, 12, 12, 2, 2, 1, 1, 9, 9, 12, 12, 2, 2, 1, 1, 9, 9, 13,
13, 3, 3, 13, 13, 3, 3), soil_type = c("organic", "organic",
"mineral", "mineral", "mineral", "mineral", "organic", "organic",
"organic", "organic", "mineral", "mineral", "organic", "organic",
"mineral", "mineral", "mineral", "mineral", "organic", "organic",
"organic", "organic", "mineral", "mineral", "mineral", "mineral",
"organic", "organic", "organic", "organic", "mineral", "mineral",
"organic", "organic", "mineral", "mineral"), compound = c("Acetone",
"Acetaldehyde", "Acetone", "Acetaldehyde", "Acetone", "Acetaldehyde",
"Acetone", "Acetaldehyde", "Acetone", "Acetaldehyde", "Acetone",
"Acetaldehyde", "Acetone", "Acetaldehyde", "Acetone", "Acetaldehyde",
"Acetone", "Acetaldehyde", "Acetone", "Acetaldehyde", "Acetone",
"Acetaldehyde", "Acetone", "Acetaldehyde", "Acetone", "Acetaldehyde",
"Acetone", "Acetaldehyde", "Acetone", "Acetaldehyde", "Acetone",
"Acetaldehyde", "Acetone", "Acetaldehyde", "Acetone", "Acetaldehyde"
), emission = c(0.01, 0, 0.03, 0.03, 0.07, 0.06, 0.33, 0.1, 0.02,
0.01, 0.01, 0, 0.02, 0.01, 0.07, 0.08, 0.09, 0.07, 0.32, 0.22,
0.01, 0, 0.06, 0.06, 0.08, 0.06, 0.23, 0.14, 0.4, 0.04, 0.14,
0, 0.05, 0.05, 0.14, 0)), row.names = c(NA, -36L), class = c("tbl_df",
"tbl", "data.frame"))

This only addresses the setup of the data, not the plotting. (sorry for the partial answer!)
You wrote that you wanted to group by soil, compound, days, did you mean soil_type, compound, days? As #maarvd pointed out, with soil, every row is unique.
When I modified the content to
df %>%
group_by(soil_type, compound, days)%>%
mutate(cum_emission=cumsum(emission))%>%
summarise(mean=mean(cum_emission, na.rm = TRUE),
sd = sd(cum_emission, na.rm = TRUE),
se = std.error(cum_emission, na.rm = TRUE))
I was able to render the following results
# A tibble: 12 x 6
# Groups: soil_type, compound [4]
soil_type compound days mean sd se
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 mineral Acetaldehyde 0 0.0700 0.0346 0.02
2 mineral Acetaldehyde 4 0.127 0.0404 0.0233
3 mineral Acetaldehyde 10 0.10 0.0346 0.02
4 mineral Acetone 0 0.08 0.0436 0.0252
5 mineral Acetone 4 0.177 0.116 0.0669
6 mineral Acetone 10 0.16 0.111 0.0643
7 organic Acetaldehyde 0 0.07 0.0608 0.0351
8 organic Acetaldehyde 4 0.173 0.144 0.0829
9 organic Acetaldehyde 10 0.107 0.0945 0.0546
10 organic Acetone 0 0.237 0.197 0.113
11 organic Acetone 4 0.25 0.201 0.116
12 organic Acetone 10 0.297 0.319 0.184
** changes based on #Tiptop's comment
If you're looking for the cumulative, moving averages, how about this?
I'm sure some of this I didn't originally write, but wherever it originated, I've repurposed it many times.
You won't need plotrix, but you will need the library tidyquant.
library(tidyverse)
library(tidyquant)
UDF_roll <- function(x, na.rm = TRUE) {
m <- mean(x, na.rm = na.rm) # calculate the average (for the rolling average)
s <- sd(x, na.rm = na.rm) # calculate the sd to find the confidence interval
hi <- m + 2*s # CI HI
lo <- m - 2*s # CI Low
vals <- c(Mean = m, SD = s, HI.95 = hi, LO.95 = lo)
return(vals)
}
# loop for each type of compound (I'm assuming that the data you provided is a sample and you have more.)
trends <- vector("list") # empty list to store the results
cp = unique(df$compound) # create a list of unique compound names
for(i in 1:length(unique(df$compound))){ # loop through each compound
trends[[i]] <- df %>% as.data.frame() %>% # add results to the list
filter(compound == cp[i]) %>% # for one compound
arrange(days) %>%
# the rolling functions requires time series with a date; so random dates added as controller
mutate(time = seq(as.Date("2010/1/1"),
by = "month",
length.out = nrow(.)),
cum_emission = cumsum(emission)) %>%
arrange(compound,-days) %>% # most recent on top for TS
tq_mutate(select = cum_emission, # collect mean, sd, error
mutate_fun = rollapply,
width = 2, # 2: current & previous reading
align = "right",
by.column = FALSE,
FUN = UDF_roll, # calls the function UDF
na.rm = TRUE) %>%
ggplot(aes(x = seq_along(time))) +
geom_point(aes(y = cum_emission),
color = "black", alpha = 0.2) + # cumulative
geom_ribbon(aes(ymin = LO.95, ymax = HI.95),
fill = "azure3", alpha = 0.4) + # confidence interval
geom_jitter(aes(y = Mean, color= Mean),
size = 1, alpha = 0.9) + # rolling average
labs(title = paste0(cp[[i]], ": Trends and Volatility\nIncremental Moving Average with 95% CI Bands (+/-2 SD)"),
x = "", y = "Soil Emissions") +
scale_color_viridis_c(end = .8) + theme_bw() +
theme(legend.position="none")
}
trends[[1]]
trends[[2]]
trends[[1]]$data # you can NULL the time column if you use the data another way
This makes the data time series. The plots:
The data is shown below. If you wanted to group it differently, you'll have to add the argument .groups = "drop" to the summarise() call, or you won't be able to get it through tq_mutate.
# A tibble: 18 x 11
days soil soil_type compound emission time cum_emission Mean SD HI.95 LO.95
<dbl> <dbl> <chr> <chr> <dbl> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 12 organic Acetone 0.01 2010-01-01 0.01 NA NA NA NA
2 0 2 mineral Acetone 0.03 2010-02-01 0.04 0.025 0.0212 0.0674 -0.0174
3 0 1 mineral Acetone 0.07 2010-03-01 0.11 0.075 0.0495 0.174 -0.0240
4 0 9 organic Acetone 0.33 2010-04-01 0.44 0.275 0.233 0.742 -0.192
5 0 13 organic Acetone 0.02 2010-05-01 0.46 0.45 0.0141 0.478 0.422
6 0 3 mineral Acetone 0.01 2010-06-01 0.47 0.465 0.00707 0.479 0.451
7 4 12 organic Acetone 0.02 2010-07-01 0.49 0.48 0.0141 0.508 0.452
8 4 2 mineral Acetone 0.07 2010-08-01 0.56 0.525 0.0495 0.624 0.426
9 4 1 mineral Acetone 0.09 2010-09-01 0.65 0.605 0.0636 0.732 0.478
10 4 9 organic Acetone 0.32 2010-10-01 0.97 0.81 0.226 1.26 0.357
11 4 13 organic Acetone 0.05 2010-11-01 1.02 0.995 0.0354 1.07 0.924
12 4 3 mineral Acetone 0.14 2010-12-01 1.16 1.09 0.0990 1.29 0.892
13 10 12 organic Acetone 0.01 2011-01-01 1.17 1.16 0.00707 1.18 1.15
14 10 2 mineral Acetone 0.06 2011-02-01 1.23 1.2 0.0424 1.28 1.12
15 10 1 mineral Acetone 0.08 2011-03-01 1.31 1.27 0.0566 1.38 1.16
16 10 9 organic Acetone 0.23 2011-04-01 1.54 1.42 0.163 1.75 1.10
17 10 13 organic Acetone 0.4 2011-05-01 1.94 1.74 0.283 2.31 1.17
18 10 3 mineral Acetone 0.14 2011-06-01 2.08 2.01 0.0990 2.21 1.81

Divide all values by reference row

Although this seems similar to this, I'm looking for a "tidy" solution...
Let's look at the following data (it's rocks compositions for some chemical elements, if you are curious):
# A tibble: 4 x 15
Rock La Ce Pr Nd Sm Eu Gd Tb Dy Ho Er Tm Yb Lu
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Upper CC 31 63 7.1 27 4.7 1 4 0.7 3.9 0.83 2.3 0.3 1.96 0.31
2 Middle CC 24 53 5.8 25 4.6 1.4 4 0.7 3.8 0.82 2.3 0.32 2.2 0.4
3 Lower CC 8 20 2.4 11 2.8 1.1 3.1 0.48 3.1 0.68 1.9 0.24 1.5 0.25
4 chondrite 0.235 0.603 0.0891 0.452 0.147 0.056 0.197 0.0363 0.243 0.0556 0.159 0.0242 0.162 0.0243
(see at the end for the dput)
This is made of three samples and a reference value (chondrite). I want to normalize the value of each element by the chondrite, for each sample, i.e. get something like that:
# A tibble: 4 x 15
Rock La Ce Pr Nd Sm Eu Gd Tb Dy Ho Er Tm Yb Lu
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Upper CC 132. 104. 79.7 59.7 32.0 17.9 20.3 19.3 16.0 14.9 14.5 12.4 12.1 12.8
2 Middle CC 102. 87.9 65.1 55.3 31.3 25 20.3 19.3 15.6 14.8 14.5 13.2 13.6 16.5
3 Lower CC 34.0 33.2 26.9 24.3 19.0 19.6 15.7 13.2 12.8 12.2 12.0 9.92 9.26 10.3
4 chondrite 1 1 1 1 1 1 1 1 1 1 1 1 1 1
In which, of course, the first 132 for df["Upper CC","La"] comes from 31 / 0.235, i.e. df["Upper CC","La"] / df["chondrite","La"]
This is trivial in excel, and can be done in plain R with something along the lines of
apply(df[,-1],1,FUN=function(z){return(z/df[4,-1])})
Give or take some unlist() and other niceties.
But how do I do this in tidyverse idiom ? I started constructing
df %>% mutate(across( where(is.numeric), ... ? .... ) )
... but could not go further.
Generalize/related question: instead of normalizing by df[4,], normalize by an arbitrary named vector.
dput(df)
structure(list(Rock = c("Upper CC", "Middle CC", "Lower CC",
"chondrite"), La = c(31, 24, 8, 0.2347), Ce = c(63, 53, 20, 0.6032
), Pr = c(7.1, 5.8, 2.4, 0.0891), Nd = c(27, 25, 11, 0.4524),
Sm = c(4.7, 4.6, 2.8, 0.1471), Eu = c(1, 1.4, 1.1, 0.056),
Gd = c(4, 4, 3.1, 0.1966), Tb = c(0.7, 0.7, 0.48, 0.0363),
Dy = c(3.9, 3.8, 3.1, 0.2427), Ho = c(0.83, 0.82, 0.68, 0.0556
), Er = c(2.3, 2.3, 1.9, 0.1589), Tm = c(0.3, 0.32, 0.24,
0.0242), Yb = c(1.96, 2.2, 1.5, 0.1625), Lu = c(0.31, 0.4,
0.25, 0.0243)), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))

You can use :
library(dplyr)
df %>% mutate(across(where(is.numeric), ~./.[Rock == "chondrite"]))
# Rock La Ce Pr Nd Sm Eu Gd Tb Dy
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Upper … 132. 104. 79.7 59.7 32.0 17.9 20.3 19.3 16.1
#2 Middle… 102. 87.9 65.1 55.3 31.3 25.0 20.3 19.3 15.7
#3 Lower … 34.1 33.2 26.9 24.3 19.0 19.6 15.8 13.2 12.8
#4 chondr… 1 1 1 1 1 1 1 1 1
# … with 5 more variables: Ho <dbl>, Er <dbl>, Tm <dbl>,
# Yb <dbl>, Lu <dbl>

Using matrix calculations.
m <- t(dat[-1])
dat[-1] <- t(m / m[,4])
# Rock La Ce Pr Nd Sm Eu Gd Tb Dy Ho Er Tm Yb Lu
# 1 Upper CC 131.91489 104.47761 79.68575 59.73451 31.97279 17.85714 20.30457 19.28375 16.04938 14.92806 14.46541 12.396694 12.098765 12.75720
# 2 Middle CC 102.12766 87.89386 65.09540 55.30973 31.29252 25.00000 20.30457 19.28375 15.63786 14.74820 14.46541 13.223140 13.580247 16.46091
# 3 Lower CC 34.04255 33.16750 26.93603 24.33628 19.04762 19.64286 15.73604 13.22314 12.75720 12.23022 11.94969 9.917355 9.259259 10.28807
# 4 chondrite 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.000000 1.000000 1.00000
Data
dat <- structure(list(Rock = c("Upper CC", "Middle CC", "Lower CC",
"chondrite"), La = c(31, 24, 8, 0.235), Ce = c(63, 53, 20, 0.603
), Pr = c(7.1, 5.8, 2.4, 0.0891), Nd = c(27, 25, 11, 0.452),
Sm = c(4.7, 4.6, 2.8, 0.147), Eu = c(1, 1.4, 1.1, 0.056),
Gd = c(4, 4, 3.1, 0.197), Tb = c(0.7, 0.7, 0.48, 0.0363),
Dy = c(3.9, 3.8, 3.1, 0.243), Ho = c(0.83, 0.82, 0.68, 0.0556
), Er = c(2.3, 2.3, 1.9, 0.159), Tm = c(0.3, 0.32, 0.24,
0.0242), Yb = c(1.96, 2.2, 1.5, 0.162), Lu = c(0.31, 0.4,
0.25, 0.0243)), class = "data.frame", row.names = c("1",
"2", "3", "4"))

Using data.table
library(data.table)
setDT(df1)[, (names(df1)[-1]) := lapply(.SD, function(x)
x/x[match( "chondrite", Rock)]), .SDcols = -1]

Apply a function to a subset of many columns in R

How do I apply a function to many columns of grouped rows? For example;
library(tidyverse)
data <- tribble(
~Date, ~Seq1, ~Component, ~Seq2, ~X1, ~X2, ~X3,
"01/01/18", 1, "Smooth", NA, 3.98, 2.75, 1.82,
"01/01/18", 2, "Smooth", NA, 1.02, 0.02, -0.04,
"01/01/18", 3, "Smooth", NA, 3.48, 3.06, 1.25,
"01/01/18", 3, "Bounce", 1, 2.01, -0.43, -0.52,
"01/01/18", 3, "Bounce", 2, 1.94, 1.53, 1.92) %>%
mutate_at(vars(Date, Seq1, Component, Seq2), funs(factor))
Each column of X values (many more columns, truncated here for clarity) is grouped into Date, Seq1, Component, and Seq2. While Component "Smooth" and Seq1 "NA" are constant, within Component "Bounce" level there are multiple Seq2 levels e.g. "1", "2", etc.
How do I sum each X column, always the constant "NA" with each level of Seq2?
The desired results is:
expected <- tribble(
~Date, ~Seq1, ~Component, ~Seq2, ~X1, ~X2, ~X3,
"01/01/18", 1, "Smooth", NA, 3.98, 2.75, 1.82,
"01/01/18", 2, "Smooth", NA, 1.02, 0.02, -0.04,
"01/01/18", 3, "Smooth", NA, 3.48, 3.06, 1.25,
"01/01/18", 3, "Bounce", 1, 5.49, 3.49, 1.77,
"01/01/18", 3, "Bounce", 2, 5.42, 4.59, 3.17)
The following example only adds each Seq1 level.
data %>%
group_by(Date, Seq1) %>%
mutate_at(vars(starts_with("X")), funs(sum(.)))
#> # A tibble: 5 x 7
#> # Groups: Date, Seq1 [3]
#> Date Seq1 Component Seq2 X1 X2 X3
#> <fct> <fct> <fct> <fct> <dbl> <dbl> <dbl>
#> 1 01/01/18 1 Smooth <NA> 3.98 2.75 1.82
#> 2 01/01/18 2 Smooth <NA> 1.02 0.02 -0.04
#> 3 01/01/18 3 Smooth <NA> 7.43 4.16 2.65
#> 4 01/01/18 3 Bounce 1 7.43 4.16 2.65
#> 5 01/01/18 3 Bounce 2 7.43 4.16 2.65
I am certain there is solution within the purrr or apply function family, however, I have been unsuccessful (for days) in solving this example. The actual data has about 180 X columns, with hundreds of Date and Seq1 combinations, and multiple Seq2 levels.
A similar example could be Summing Multiple Groups of Columns, How to apply a function to a subset of columns in r?, or even perhaps https://github.com/jennybc/row-oriented-workflows.
Created on 2018-10-23 by the reprex package (v0.2.1)

Here's my solution. This problem is not really a purrr task, because there is nothing really that you want to map a single function to. Instead, what I understand the problem to be is that you want to match each X value in a Bounce row with the corresponding Smooth row X values of the same Date and Seq1 (and there is only one such row). This means that it is really a merging or joining problem, and then the approach is to set up the join so that you can match the right values and do the sum. So I go as follows:
Split the data into the Smooth rows and the Bounce rows and gather so that all the X values are in one column
Join the smooths onto the bounces with a left_join, so each original Bounce row now has its corresponding Smooth.
mutate the sum into a new column and select/rename the columns to be as in the original
bind_rows to join the newly summed bounces and spread to return to the original layout.
This should be robust to any number of Date, Seq1, Seq2 and X values.
library(tidyverse)
data <- tribble(
~Date, ~Seq1, ~Component, ~Seq2, ~X1, ~X2, ~X3,
"01/01/18", 1, "Smooth", NA, 3.98, 2.75, 1.82,
"01/01/18", 2, "Smooth", NA, 1.02, 0.02, -0.04,
"01/01/18", 3, "Smooth", NA, 3.48, 3.06, 1.25,
"01/01/18", 3, "Bounce", 1, 2.01, -0.43, -0.52,
"01/01/18", 3, "Bounce", 2, 1.94, 1.53, 1.92)
smooths <- data %>%
filter(Component == "Smooth") %>%
gather(X, val, starts_with("X"))
bounces <- data %>%
filter(Component == "Bounce") %>%
gather(X, val, starts_with("X")) %>%
left_join(smooths, by = c("Date", "Seq1", "X")) %>%
mutate(val = val.x + val.y) %>%
select(Date, Seq1, Component = Component.x, Seq2 = Seq2.x, X, val)
bounces %>%
bind_rows(smooths) %>%
spread(X, val)
#> # A tibble: 5 x 7
#> Date Seq1 Component Seq2 X1 X2 X3
#> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 01/01/18 1 Smooth NA 3.98 2.75 1.82
#> 2 01/01/18 2 Smooth NA 1.02 0.02 -0.04
#> 3 01/01/18 3 Bounce 1 5.49 2.63 0.73
#> 4 01/01/18 3 Bounce 2 5.42 4.59 3.17
#> 5 01/01/18 3 Smooth NA 3.48 3.06 1.25
Created on 2018-10-31 by the reprex package (v0.2.1)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R split by group and create new columns - r

Related

Select first occurrence of a decimal value in R

Translating loop syntax from Stata to R

Calculation cumulated values using grouping

Divide all values by reference row

Apply a function to a subset of many columns in R

Categories

Resources