How can I do operations among rows in a tibble? - r

I have a tibble which has stored variables taken at different points in the sea and at different depths, but I need to condense all the depths of the same point into a single row following a specific formula (the summation of the sum of values X and X+1 times the subtraction of the depth of X+1 minus depth of X and so on...), which I have wrote on excel as a way of better explaining what I'm trying to do
And here is an small sample of the (edited) data I'm working with
long lat station depth no3
<dbl> <dbl> <dbl> <dbl> <dbl>
1 -71.1 -32 1 0 9
2 -71.1 -32 1 5 14
3 -71.1 -32 1 10 10
4 -71.1 -32 1 20 11
5 -71.6 -32 2 0 13
6 -71.6 -32 2 5 8
7 -71.6 -32 2 10 2
8 -71.6 -32 2 20 6
9 -71.6 -32 2 50 4
10 -71.6 -32 2 75 9
# ... with 942 more rows
From what I read here in similar questions, I could use aggregate or merge but those only do the summation, and I don't know how to get it to do the entire equation. I'll appreciate any suggestion, I'm new to R and if I haven't been very clear (or the solution is actually quite simple) I'm sorry

You can use the lead function to create the sums based on the following row's data (use lag for the previous row), and then sum this new column in a summarize:
df <- data.frame(
depth = c(0, 5, 10, 20, 50),
NO3 = c(3, 5, 6, 2, 3)) %>%
mutate(a = (lead(NO3) + NO3)*(lead(depth) - depth))
df
depth NO3 a
1 0 3 40
2 5 5 55
3 10 6 80
4 20 2 150
5 50 3 NA
df %>%
summarize(b = sum(a, na.rm = TRUE))
b
1 325
Note that the na.rm in the sum is key here since the lead functions create NA values in the final row. These can be filled using the default argument.
EDIT:
If you'd like to apply this to more than just one column, you can use the "scoped" variants of mutate and summarize, by adding _at or _if to the end of these functions.
df2 <- data.frame(
depth = c(0, 5, 10, 20, 50),
NO3 = c(3, 5, 6, 2, 3),
NO4 = c(1, 2, 3, 4, 5),
NO5 = c(5, 4, 3, 2, 1))
_at functions require either a names vector or an index vector to determine which columns to operate on. All three of these will return the same thing, where .x refers to the column being modified:
df2 %>%
mutate_at(c("NO3", "NO4", "NO5"), ~(lead(.x) + .x)*(lead(depth)-depth)) %>%
summarize_at(c("NO3", "NO4", "NO5"), sum, na.rm = T)
df2 %>%
mutate_at(vars(NO3:NO5), ~(lead(.x) + .x)*(lead(depth)-depth)) %>%
summarize_at(vars(NO3:NO5), sum, na.rm = T)
df2 %>%
mutate_at(2:4, ~(lead(.x) + .x)*(lead(depth)-depth)) %>%
summarize_at(2:4, sum, na.rm = T)
NO3 NO4 NO5
1 325 380 220
_if functions need a "predicate function" that determines whether a column will be operated on. Either of these, which check the name of the column, would work:
df2 %>%
mutate_if(str_detect(colnames(.), "NO"), ~(lead(.x) + .x)*(lead(depth)-depth)) %>%
summarize_if(str_detect(colnames(.), "NO"), sum, na.rm = T)
df2 %>%
mutate_if(!str_detect(colnames(.), "depth"), ~(lead(.x) + .x)*(lead(depth)-depth)) %>%
summarize_if(!str_detect(colnames(.), "depth"), sum, na.rm = T)
NO3 NO4 NO5
1 325 380 220

Related

How to select last N observation from each group in dplyr dataframe?

Given a dataframe:
df <- structure(list(a = c(1, 1, 1, 2, 2, 2, 3, 3, 4, 4), b = c(34,
343, 54, 11, 55, 62, 59, -9, 0, -0.5)), row.names = c(NA, -10L
), class = c("tbl_df", "tbl", "data.frame"))
I want to take last N observations / rows from each group:
df %>%
dplyr::group_by(a) %>%
dplyr::last(2)
Gives me wrong results.
I want it to be:
a b
1 343
1 54
2 55
2 62
3 59
3 -9
4 0
4 -0.5
Please advise what is wrong here?
The error I get is:
Error in order(order_by)[[n]] : subscript out of bounds
As it is a specific question based on dplyr
1) after the group_by, use slice on the row_number()
library(tidyverse)
df %>%
group_by(a) %>%
slice(tail(row_number(), 2))
# A tibble: 8 x 2
# Groups: a [4]
# a b
# <dbl> <dbl>
#1 1 343
#2 1 54
#3 2 55
#4 2 62
#5 3 59
#6 3 -9
#7 4 0
#8 4 -0.5
2) Or use filter from dplyr
df %>%
group_by(a) %>%
filter(row_number() >= (n() - 1))
3) or with do and tail
df %>%
group_by(a) %>%
do(tail(., 2))
4) In addition to the tidyverse, methods, we can also use compact data.table
library(data.table)
setDT(df)[df[, .I[tail(seq_len(.N), 2)], a]$V1]
5) Or by from base R
by(df, df$a, FUN = tail, 2)
6) or with aggregate from base R
df[aggregate(c ~ a, transform(df, c = seq_len(nrow(df))), FUN = tail, 2)$c,]
7) or with split from base R
do.call(rbind, lapply(split(df, df$a), tail, 2))
dplyr 1.0.0 introduced slice_tail that makes this simple:
library(dplyr)
df %>%
group_by(a) %>%
slice_tail(n = 2)
Similarly, there is slice_head to get the first n rows.
A base R option using tapply is to subset the last two rows for every group.
df[unlist(tapply(1:nrow(df), df$a, tail, 2)), ]
# a b
# <dbl> <dbl>
#1 1 343
#2 1 54
#3 2 55
#4 2 62
#5 3 59
#6 3 -9
#7 4 0
#8 4 -0.5
Or another option using ave
df[as.logical(with(df, ave(1:nrow(df), a, FUN = function(x) x %in% tail(x, 2)))), ]
Also a tidyverse possibility:
df %>%
group_by(a) %>%
top_n(2, row_number())
a b
<dbl> <dbl>
1 1. 343.
2 1. 54.0
3 2. 55.0
4 2. 62.0
5 3. 59.0
6 3. -9.00
7 4. 0.
8 4. -0.500
It is taking the top two rows given the row numbers per groups.
Try tail().In R head function allows you to preview the first n rows, while tail allows you to preview last n rows.

linear interpolation (approx) by group in a dplyr pipe in R

I have a question that I find kind of hard to explain with a MRE and in an easy
way to answer, mostly because I don't fully understand where the problem lies
myself. So that's my sorry for being vague preamble.
I have a tibble with many sample and reference measurements, for which I want
to do some linear interpolation for each sample. I do this now by taking out
all the reference measurements, rescaling them to sample measurements using
approx, and then patching it back in. But because I take it out first, I
cannot do it nicely in a group_by dplyr pipe way. right now I do it with a
really ugly workaround where I add empty (NA) newly created columns to the
sample tibble, then do it with a for-loop.
So my question is really: how can I implement the approx part within groups
into the pipe, so that I can do everything within groups? I've experimented
with dplyr::do(), and ran into the vignette on "programming with dplyr", but
searching mostly gives me broom::augment and lm stuff that I think operates
differently... (e.g. see
Using approx() with groups in dplyr). This thread also seems promising: How do you use approx() inside of mutate_at()?
Somebody on irc recommended using a conditional mutate, with case_when, but I
don't fully understand where and how within this context yet.
I think the problem lies in the fact that I want to filter out part of the data
for the following mutate operations, but the mutate operations rely on the
grouped data that I just filtered out, if that makes any sense.
Here's a MWE:
library(tidyverse) # or just dplyr, tibble
# create fake data
data <- data.frame(
# in reality a dttm with the measurement time
timestamp = c(rep("a", 7), rep("b", 7), rep("c", 7)),
# measurement cycle, normally 40 for sample, 41 for reference
cycle = rep(c(rep(1:3, 2), 4), 3),
# wheather the measurement is a reference or a sample
isref = rep(c(rep(FALSE, 3), rep(TRUE, 4)), 3),
# measurement intensity for mass 44
r44 = c(28:26, 30:26, 36, 33, 31, 38, 34, 33, 31, 18, 16, 15, 19, 18, 17)) %>%
# measurement intensity for mass 45, normally also masses up to mass 49
mutate(r45 = r44 + rnorm(21, 20))
# of course this could be tidied up to "intensity" with a new column "mass"
# (44, 45, ...), but that would make making comparisons even harder...
# overview plot
data %>%
ggplot(aes(x = cycle, y = r44, colour = isref)) +
geom_line() +
geom_line(aes(y = r45), linetype = 2) +
geom_point() +
geom_point(aes(y = r45), shape = 1) +
facet_grid(~ timestamp)
# what I would like to do
data %>%
group_by(timestamp) %>%
do(target_cycle = approx(x = data %>% filter(isref) %>% pull(r44),
y = data %>% filter(isref) %>% pull(cycle),
xout = data %>% filter(!isref) %>% pull(r44))$y) %>%
unnest()
# immediately append this new column to the original dataframe for all the
# samples (!isref) and then apply another approx for those values.
# here's my current attempt for one of the timestamps
matchref <- function(dat) {
# split the data into sample gas and reference gas
ref <- filter(dat, isref)
smp <- filter(dat, !isref)
# calculate the "target cycle", the points at which the reference intensity
# 44 matches the sample intensity 44 with linear interpolation
target_cycle <- approx(x = ref$r44,
y = ref$cycle, xout = smp$r44)
# append the target cycle to the sample gas
smp <- smp %>%
group_by(timestamp) %>%
mutate(target = target_cycle$y)
# linearly interpolate each reference gas to the target cycle
ref <- ref %>%
group_by(timestamp) %>%
# this is needed because the reference has one more cycle
mutate(target = c(target_cycle$y, NA)) %>%
# filter out all the failed ones (no interpolation possible)
filter(!is.na(target)) %>%
# calculate interpolated value based on r44 interpolation (i.e., don't
# actually interpolate this value but shift it based on the 44
# interpolation)
mutate(r44 = approx(x = cycle, y = r44, xout = target)$y,
r45 = approx(x = cycle, y = r45, xout = target)$y) %>%
select(timestamp, target, r44:r45)
# add new reference gas intensities to the correct sample gasses by the target cycle
left_join(smp, ref, by = c("time", "target"))
}
matchref(data)
# and because now "target" must be length 3 (the group size) or one, not 9
# I have to create this ugly for-loop
# for which I create a copy of data that has the new columns to be created
mr <- data %>%
# filter the sample gasses (since we convert ref to sample)
filter(!isref) %>%
# add empty new columns
mutate(target = NA, r44 = NA, r45 = NA)
# apply matchref for each group timestamp
for (grp in unique(data$timestamp)) {
mr[mr$timestamp == grp, ] <- matchref(data %>% filter(timestamp == grp))
}
Here's one approach that spreads the references and samples to new columns. I drop r45 for simplicity in this example.
data %>%
select(-r45) %>%
mutate(isref = ifelse(isref, "REF", "SAMP")) %>%
spread(isref, r44) %>%
group_by(timestamp) %>%
mutate(target_cycle = approx(x = REF, y = cycle, xout = SAMP)$y) %>%
ungroup
gives,
# timestamp cycle REF SAMP target_cycle
# <fct> <dbl> <dbl> <dbl> <dbl>
# 1 a 1 30 28 3
# 2 a 2 29 27 4
# 3 a 3 28 26 NA
# 4 a 4 27 NA NA
# 5 b 1 31 26 NA
# 6 b 2 38 36 2.5
# 7 b 3 34 33 4
# 8 b 4 33 NA NA
# 9 c 1 15 31 NA
# 10 c 2 19 18 3
# 11 c 3 18 16 2.5
# 12 c 4 17 NA NA
Edit to address comment below
To retain r45 you can use a gather-unite-spread approach like this:
df %>%
mutate(isref = ifelse(isref, "REF", "SAMP")) %>%
gather(r, value, r44:r45) %>%
unite(ru, r, isref, sep = "_") %>%
spread(ru, value) %>%
group_by(timestamp) %>%
mutate(target_cycle_r44 = approx(x = r44_REF, y = cycle, xout = r44_SAMP)$y) %>%
ungroup
giving,
# # A tibble: 12 x 7
# timestamp cycle r44_REF r44_SAMP r45_REF r45_SAMP target_cycle_r44
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 a 1 30 28 49.5 47.2 3
# 2 a 2 29 27 48.8 48.7 4
# 3 a 3 28 26 47.2 46.8 NA
# 4 a 4 27 NA 47.9 NA NA
# 5 b 1 31 26 51.4 45.7 NA
# 6 b 2 38 36 57.5 55.9 2.5
# 7 b 3 34 33 54.3 52.4 4
# 8 b 4 33 NA 52.0 NA NA
# 9 c 1 15 31 36.0 51.7 NA
# 10 c 2 19 18 39.1 37.9 3
# 11 c 3 18 16 39.2 35.3 2.5
# 12 c 4 17 NA 39.0 NA NA

weighted mean between two specific rows

I have a dataframe of this form
id value
1 10
2 25
5 30
7 15
9 30
10 50
I would like to transform it in the following way
id value
1 10
2 25
5 30
9 30
7+10 43
where the obs with id "7+10" is the weighted mean of the previous obs for 7 and 10 with weights 0.2 and 0.8, in other words 43=0.2*15+0.8*50. I tried to use the function aggregate to do this, but it does not work. What can I use to make operations between specific rows?
Thank you for your help.
Since it is a lot easier to work with variables than with rows, you can transform your data from the long to the wide format with the package tidyr (part of the tidyverse), make your transformations, then back to the long format again with tidyr:
library(tidyverse)
dat <- tibble(
id = c(1, 2, 5, 7, 9, 10),
value = c(10, 25, 30, 15, 30, 50)
)
dat %>%
spread(id, value) %>%
mutate(`7 + 10` = 0.2 * `7` + 0.8 * `10`) %>%
select(- `7`, - `10`) %>%
gather("id", "value", everything())
id value
<chr> <dbl>
1 1 10
2 2 25
3 5 30
4 9 30
5 7 + 10 43

Splitting a range provided in a row into several smaller ranges in multiple rows

I am somehow new to programming and I have been struggling to have my desirable output as explained below.
Suppose i have a table like table below:
My input
which includes the range of coordinates (Start_MP & End_MP) of a specific segment (defied by ID) and length of the segment (difference between the range start and end).
What I need to do is, to split all those ranges which have a length of more than 2, into ranges of two or less. To make it more clear I need my output to be like table below
My desired output
I would appreciate if you let me know how I can handle that with R/ R packages?
Function tidyr::expand is the right option to use to expand rows based on choice/desire of OP.
Approach is to first use expand to generate desired number of rows and then use left_join to join those with original data.frame.
# Data
df <- data.frame(Segment_ID = c(1101, 1102, 1103), Start_MP = c(1, 5, 20),
End_MP = c(2, 10, 30), Segment_Length = c(1, 5, 10))
library(tidyverse)
df %>% group_by(Segment_ID) %>%
expand(Segment_ID, Segment_Sequence_Number =
seq(from = Start_MP, to = End_MP, by = 2)) %>%
left_join(df, by="Segment_ID") %>%
mutate(Start_MP = Segment_Sequence_Number) %>%
group_by(Segment_ID) %>%
mutate(End_MP_Calc = lead(Start_MP)) %>%
mutate(End_MP = coalesce(End_MP_Calc, End_MP)) %>%
filter(Start_MP != End_MP) %>%
mutate(Segment_Length = End_MP - Start_MP) %>%
group_by(Segment_ID) %>%
mutate(Segment_Sequence_Number = row_number()) %>%
select(-End_MP_Calc) %>% as.data.frame()
#Result
# Segment_ID Segment_Sequence_Number Start_MP End_MP Segment_Length
# 1 1101 1 1 2 1
# 2 1102 1 5 7 2
# 3 1102 2 7 9 2
# 4 1102 3 9 10 1
# 5 1103 1 20 22 2
# 6 1103 2 22 24 2
# 7 1103 3 24 26 2
# 8 1103 4 26 28 2
# 9 1103 5 28 30 2

Create balanced subsets based on similarity of multiple columns

I'm searching for a possibility to find subsets of rows (one subset should contain 6 rows), where the value-means for multiple columns are most similar. So, I would like R to search through my data.frame and create subsets of 6 rows each, so that finally these subsets are most similar to each other. Similarity could be measured as the Euclidean distance (as pointed out by #David Robinson).
My data looks like that:
TID Cue1 Cue2 Cue3
1 2.06 1.90 3.82
2 5.18 4.13 5.10
3 5.09 2.85 2.80
4 1.93 4.14 4.75
... ... ... ...
I'd now like to know if there is a way in R, that I find the following:
-give me e.g. 4 subsets containing 6 rows eachs, whereby the 4 subsets have the most possible similiarty in the Cue1, Cue2 and Cue3 means (SD isn't important) while each subset contains unique rows (no duplicate rows between the subsets).
One example would be (not matching the data in my example):
-subset 1 contains TID 1, TID 6, TID 14, TID 28, TID 39, TID 50 and this subset has the cue-means (Cue1 = 3,2; Cue2 = 2,5; Cue3 = 4)
-subset 2 contains TID 3, TID 12, TID 20, TID 40, TID 54, TID 59 and this subset has the cue-means (Cue1 = 3,3; Cue2 = 2,6; Cue3 = 4,1).
So that the two subsets are very (most) similar in the cue means. R should now name me the rownumbers (or the TID values) forming the subsets.
Is there any possibilty to do this in R?
Here is an reproducible example of how my data looks like:
mysamp <- function(n, m, s, lwr, upr, nnorm) {
set.seed(1)
samp <- rnorm(nnorm, m, s)
samp <- samp[samp >= lwr & samp <= upr]
if (length(samp) >= n) {
return(sample(samp, n))
}
}
Cue1 <- mysamp(n=60, m=3, s=1.5, lwr=1, upr=6, nnorm=1000)
Cue2 <- mysamp(n=60, m=3, s=2.5, lwr=1, upr=6, nnorm=1000)
Cue3 <- mysamp(n=60, m=4, s=1.5, lwr=1, upr=6, nnorm=1000)
df <- data.frame(TID= 1:60, Cue1= Cue1, Cue2= Cue2, Cue3= Cue3)
This is a clustering problem, so you'd want to approach it by:
Calculating a distance matrix
Using that to construct a "tree" of similar groups of nodes
Extracting sub-clusters of your size that appear lowest on the tree
The distance matrix and hierarchical clustering can be done as:
distances <- dist(df[, -1])
h <- hclust(distances)
There are many approaches to algorithmically pulling off low clusters on the tree; since I'm accustomed to working with dplyr/purrr/tidyr I'll show one solution. This takes the approach of using cutree to break the tree apart at every possible level, then find the first time each group of six appears.
library(dplyr)
library(tidyr)
library(purrr)
clusterings <- data_frame(ncluster = seq(nrow(df), 1)) %>%
unnest(membership = map(ncluster, ~ cutree(h, .))) %>%
group_by(ncluster) %>%
mutate(row = row_number()) %>%
ungroup() %>%
nest(-ncluster, -membership) %>%
mutate(size = map_dbl(data, nrow)) %>%
filter(size == 6) %>%
distinct(membership, .keep_all = TRUE) %>%
unnest(data) %>%
mutate(TID = df$TID[row])
On your data, this returns:
# A tibble: 42 × 5
ncluster membership size row TID
<int> <int> <dbl> <int> <int>
1 29 9 6 9 9
2 29 9 6 30 30
3 29 9 6 39 39
4 29 9 6 41 41
5 29 9 6 43 43
6 29 9 6 57 57
7 21 13 6 15 15
8 21 13 6 20 20
9 21 13 6 25 25
10 21 13 6 29 29
# ... with 32 more rows
Thus, (9, 30, 39, 41, 43, 57) make up your first group of 6, while the second group starts with (15, 20, 25, 29...)

Resources