I have a dataframe of this form
id value
1 10
2 25
5 30
7 15
9 30
10 50
I would like to transform it in the following way
id value
1 10
2 25
5 30
9 30
7+10 43
where the obs with id "7+10" is the weighted mean of the previous obs for 7 and 10 with weights 0.2 and 0.8, in other words 43=0.2*15+0.8*50. I tried to use the function aggregate to do this, but it does not work. What can I use to make operations between specific rows?
Thank you for your help.
Since it is a lot easier to work with variables than with rows, you can transform your data from the long to the wide format with the package tidyr (part of the tidyverse), make your transformations, then back to the long format again with tidyr:
library(tidyverse)
dat <- tibble(
id = c(1, 2, 5, 7, 9, 10),
value = c(10, 25, 30, 15, 30, 50)
)
dat %>%
spread(id, value) %>%
mutate(`7 + 10` = 0.2 * `7` + 0.8 * `10`) %>%
select(- `7`, - `10`) %>%
gather("id", "value", everything())
id value
<chr> <dbl>
1 1 10
2 2 25
3 5 30
4 9 30
5 7 + 10 43
Related
I want to take a df as the one below and want to cut/bin/group/sample into groups of size=20.
Ideally, this "binning" occurs randomly across IDs rather then consecutively from top row to bottom row).
E.g. IDs 2, 29 and 71 have counts of 7,7,6 and would fit nicely into a "bin" of size=20.
I want to achieve the minimum number of bins and do not care about order of IDs (the more random they are, the better).
set.seed(123)
df <- data.frame(
ID = as.numeric(1:100),
Count = as.numeric(sample(1:8, size = 100, replace = T)))
Desired outcome would be a dataframe/tibble looking something like the below with optimum random sampling and minimising bin number.
Bin_size=20 is the parameter set by me (the ideal outcome is exact 20 (=20) but <20 is ok, however, >20 is not ok).
Each Bin should be given a number (e.g. if I have 10 bins, I would like them to be called Bin_number 1-10).
ID, Count, Bin_size, Bin_number
ID 2, 7, 20, 1
ID 29, 7, 20, 1
ID 71, 6, 20, 1
etc.
Where 7+7+6 = 20 (etc.)
Any help with this would be much appreciated. I have been wondering about cumsum and group_by but could not figure it out.
if you need more details, I'm happy to provide them.
thanks!
The BBmisc package has a simple (though not optimized) bin packing algorithm that might be useful:
library(BBmisc)
library(dplyr)
df %>%
as_tibble() %>%
mutate(bin = binPack(Count, 20),
bin_size = ave(Count, bin, FUN = sum)) %>%
arrange(bin)
# A tibble: 100 x 4
ID Count bin bin_size
<dbl> <dbl> <int> <dbl>
1 11 4 1 20
2 17 8 1 20
3 27 8 1 20
4 22 4 2 20
5 42 8 2 20
6 56 8 2 20
7 34 4 3 20
8 62 8 3 20
9 79 8 3 20
10 40 4 4 20
# ... with 90 more rows
Do you want something like this,
df<-df%>%arrange(Count)%>%mutate(Sum=cumsum(Count),Bin_size=20)
df<-df%>%mutate(Bin_number=cut(Sum,breaks = seq(0,c(max(Sum)+20),20),labels = F,right = T))
I have a tibble which has stored variables taken at different points in the sea and at different depths, but I need to condense all the depths of the same point into a single row following a specific formula (the summation of the sum of values X and X+1 times the subtraction of the depth of X+1 minus depth of X and so on...), which I have wrote on excel as a way of better explaining what I'm trying to do
And here is an small sample of the (edited) data I'm working with
long lat station depth no3
<dbl> <dbl> <dbl> <dbl> <dbl>
1 -71.1 -32 1 0 9
2 -71.1 -32 1 5 14
3 -71.1 -32 1 10 10
4 -71.1 -32 1 20 11
5 -71.6 -32 2 0 13
6 -71.6 -32 2 5 8
7 -71.6 -32 2 10 2
8 -71.6 -32 2 20 6
9 -71.6 -32 2 50 4
10 -71.6 -32 2 75 9
# ... with 942 more rows
From what I read here in similar questions, I could use aggregate or merge but those only do the summation, and I don't know how to get it to do the entire equation. I'll appreciate any suggestion, I'm new to R and if I haven't been very clear (or the solution is actually quite simple) I'm sorry
You can use the lead function to create the sums based on the following row's data (use lag for the previous row), and then sum this new column in a summarize:
df <- data.frame(
depth = c(0, 5, 10, 20, 50),
NO3 = c(3, 5, 6, 2, 3)) %>%
mutate(a = (lead(NO3) + NO3)*(lead(depth) - depth))
df
depth NO3 a
1 0 3 40
2 5 5 55
3 10 6 80
4 20 2 150
5 50 3 NA
df %>%
summarize(b = sum(a, na.rm = TRUE))
b
1 325
Note that the na.rm in the sum is key here since the lead functions create NA values in the final row. These can be filled using the default argument.
EDIT:
If you'd like to apply this to more than just one column, you can use the "scoped" variants of mutate and summarize, by adding _at or _if to the end of these functions.
df2 <- data.frame(
depth = c(0, 5, 10, 20, 50),
NO3 = c(3, 5, 6, 2, 3),
NO4 = c(1, 2, 3, 4, 5),
NO5 = c(5, 4, 3, 2, 1))
_at functions require either a names vector or an index vector to determine which columns to operate on. All three of these will return the same thing, where .x refers to the column being modified:
df2 %>%
mutate_at(c("NO3", "NO4", "NO5"), ~(lead(.x) + .x)*(lead(depth)-depth)) %>%
summarize_at(c("NO3", "NO4", "NO5"), sum, na.rm = T)
df2 %>%
mutate_at(vars(NO3:NO5), ~(lead(.x) + .x)*(lead(depth)-depth)) %>%
summarize_at(vars(NO3:NO5), sum, na.rm = T)
df2 %>%
mutate_at(2:4, ~(lead(.x) + .x)*(lead(depth)-depth)) %>%
summarize_at(2:4, sum, na.rm = T)
NO3 NO4 NO5
1 325 380 220
_if functions need a "predicate function" that determines whether a column will be operated on. Either of these, which check the name of the column, would work:
df2 %>%
mutate_if(str_detect(colnames(.), "NO"), ~(lead(.x) + .x)*(lead(depth)-depth)) %>%
summarize_if(str_detect(colnames(.), "NO"), sum, na.rm = T)
df2 %>%
mutate_if(!str_detect(colnames(.), "depth"), ~(lead(.x) + .x)*(lead(depth)-depth)) %>%
summarize_if(!str_detect(colnames(.), "depth"), sum, na.rm = T)
NO3 NO4 NO5
1 325 380 220
I have a question that I find kind of hard to explain with a MRE and in an easy
way to answer, mostly because I don't fully understand where the problem lies
myself. So that's my sorry for being vague preamble.
I have a tibble with many sample and reference measurements, for which I want
to do some linear interpolation for each sample. I do this now by taking out
all the reference measurements, rescaling them to sample measurements using
approx, and then patching it back in. But because I take it out first, I
cannot do it nicely in a group_by dplyr pipe way. right now I do it with a
really ugly workaround where I add empty (NA) newly created columns to the
sample tibble, then do it with a for-loop.
So my question is really: how can I implement the approx part within groups
into the pipe, so that I can do everything within groups? I've experimented
with dplyr::do(), and ran into the vignette on "programming with dplyr", but
searching mostly gives me broom::augment and lm stuff that I think operates
differently... (e.g. see
Using approx() with groups in dplyr). This thread also seems promising: How do you use approx() inside of mutate_at()?
Somebody on irc recommended using a conditional mutate, with case_when, but I
don't fully understand where and how within this context yet.
I think the problem lies in the fact that I want to filter out part of the data
for the following mutate operations, but the mutate operations rely on the
grouped data that I just filtered out, if that makes any sense.
Here's a MWE:
library(tidyverse) # or just dplyr, tibble
# create fake data
data <- data.frame(
# in reality a dttm with the measurement time
timestamp = c(rep("a", 7), rep("b", 7), rep("c", 7)),
# measurement cycle, normally 40 for sample, 41 for reference
cycle = rep(c(rep(1:3, 2), 4), 3),
# wheather the measurement is a reference or a sample
isref = rep(c(rep(FALSE, 3), rep(TRUE, 4)), 3),
# measurement intensity for mass 44
r44 = c(28:26, 30:26, 36, 33, 31, 38, 34, 33, 31, 18, 16, 15, 19, 18, 17)) %>%
# measurement intensity for mass 45, normally also masses up to mass 49
mutate(r45 = r44 + rnorm(21, 20))
# of course this could be tidied up to "intensity" with a new column "mass"
# (44, 45, ...), but that would make making comparisons even harder...
# overview plot
data %>%
ggplot(aes(x = cycle, y = r44, colour = isref)) +
geom_line() +
geom_line(aes(y = r45), linetype = 2) +
geom_point() +
geom_point(aes(y = r45), shape = 1) +
facet_grid(~ timestamp)
# what I would like to do
data %>%
group_by(timestamp) %>%
do(target_cycle = approx(x = data %>% filter(isref) %>% pull(r44),
y = data %>% filter(isref) %>% pull(cycle),
xout = data %>% filter(!isref) %>% pull(r44))$y) %>%
unnest()
# immediately append this new column to the original dataframe for all the
# samples (!isref) and then apply another approx for those values.
# here's my current attempt for one of the timestamps
matchref <- function(dat) {
# split the data into sample gas and reference gas
ref <- filter(dat, isref)
smp <- filter(dat, !isref)
# calculate the "target cycle", the points at which the reference intensity
# 44 matches the sample intensity 44 with linear interpolation
target_cycle <- approx(x = ref$r44,
y = ref$cycle, xout = smp$r44)
# append the target cycle to the sample gas
smp <- smp %>%
group_by(timestamp) %>%
mutate(target = target_cycle$y)
# linearly interpolate each reference gas to the target cycle
ref <- ref %>%
group_by(timestamp) %>%
# this is needed because the reference has one more cycle
mutate(target = c(target_cycle$y, NA)) %>%
# filter out all the failed ones (no interpolation possible)
filter(!is.na(target)) %>%
# calculate interpolated value based on r44 interpolation (i.e., don't
# actually interpolate this value but shift it based on the 44
# interpolation)
mutate(r44 = approx(x = cycle, y = r44, xout = target)$y,
r45 = approx(x = cycle, y = r45, xout = target)$y) %>%
select(timestamp, target, r44:r45)
# add new reference gas intensities to the correct sample gasses by the target cycle
left_join(smp, ref, by = c("time", "target"))
}
matchref(data)
# and because now "target" must be length 3 (the group size) or one, not 9
# I have to create this ugly for-loop
# for which I create a copy of data that has the new columns to be created
mr <- data %>%
# filter the sample gasses (since we convert ref to sample)
filter(!isref) %>%
# add empty new columns
mutate(target = NA, r44 = NA, r45 = NA)
# apply matchref for each group timestamp
for (grp in unique(data$timestamp)) {
mr[mr$timestamp == grp, ] <- matchref(data %>% filter(timestamp == grp))
}
Here's one approach that spreads the references and samples to new columns. I drop r45 for simplicity in this example.
data %>%
select(-r45) %>%
mutate(isref = ifelse(isref, "REF", "SAMP")) %>%
spread(isref, r44) %>%
group_by(timestamp) %>%
mutate(target_cycle = approx(x = REF, y = cycle, xout = SAMP)$y) %>%
ungroup
gives,
# timestamp cycle REF SAMP target_cycle
# <fct> <dbl> <dbl> <dbl> <dbl>
# 1 a 1 30 28 3
# 2 a 2 29 27 4
# 3 a 3 28 26 NA
# 4 a 4 27 NA NA
# 5 b 1 31 26 NA
# 6 b 2 38 36 2.5
# 7 b 3 34 33 4
# 8 b 4 33 NA NA
# 9 c 1 15 31 NA
# 10 c 2 19 18 3
# 11 c 3 18 16 2.5
# 12 c 4 17 NA NA
Edit to address comment below
To retain r45 you can use a gather-unite-spread approach like this:
df %>%
mutate(isref = ifelse(isref, "REF", "SAMP")) %>%
gather(r, value, r44:r45) %>%
unite(ru, r, isref, sep = "_") %>%
spread(ru, value) %>%
group_by(timestamp) %>%
mutate(target_cycle_r44 = approx(x = r44_REF, y = cycle, xout = r44_SAMP)$y) %>%
ungroup
giving,
# # A tibble: 12 x 7
# timestamp cycle r44_REF r44_SAMP r45_REF r45_SAMP target_cycle_r44
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 a 1 30 28 49.5 47.2 3
# 2 a 2 29 27 48.8 48.7 4
# 3 a 3 28 26 47.2 46.8 NA
# 4 a 4 27 NA 47.9 NA NA
# 5 b 1 31 26 51.4 45.7 NA
# 6 b 2 38 36 57.5 55.9 2.5
# 7 b 3 34 33 54.3 52.4 4
# 8 b 4 33 NA 52.0 NA NA
# 9 c 1 15 31 36.0 51.7 NA
# 10 c 2 19 18 39.1 37.9 3
# 11 c 3 18 16 39.2 35.3 2.5
# 12 c 4 17 NA 39.0 NA NA
I'm searching for a possibility to find subsets of rows (one subset should contain 6 rows), where the value-means for multiple columns are most similar. So, I would like R to search through my data.frame and create subsets of 6 rows each, so that finally these subsets are most similar to each other. Similarity could be measured as the Euclidean distance (as pointed out by #David Robinson).
My data looks like that:
TID Cue1 Cue2 Cue3
1 2.06 1.90 3.82
2 5.18 4.13 5.10
3 5.09 2.85 2.80
4 1.93 4.14 4.75
... ... ... ...
I'd now like to know if there is a way in R, that I find the following:
-give me e.g. 4 subsets containing 6 rows eachs, whereby the 4 subsets have the most possible similiarty in the Cue1, Cue2 and Cue3 means (SD isn't important) while each subset contains unique rows (no duplicate rows between the subsets).
One example would be (not matching the data in my example):
-subset 1 contains TID 1, TID 6, TID 14, TID 28, TID 39, TID 50 and this subset has the cue-means (Cue1 = 3,2; Cue2 = 2,5; Cue3 = 4)
-subset 2 contains TID 3, TID 12, TID 20, TID 40, TID 54, TID 59 and this subset has the cue-means (Cue1 = 3,3; Cue2 = 2,6; Cue3 = 4,1).
So that the two subsets are very (most) similar in the cue means. R should now name me the rownumbers (or the TID values) forming the subsets.
Is there any possibilty to do this in R?
Here is an reproducible example of how my data looks like:
mysamp <- function(n, m, s, lwr, upr, nnorm) {
set.seed(1)
samp <- rnorm(nnorm, m, s)
samp <- samp[samp >= lwr & samp <= upr]
if (length(samp) >= n) {
return(sample(samp, n))
}
}
Cue1 <- mysamp(n=60, m=3, s=1.5, lwr=1, upr=6, nnorm=1000)
Cue2 <- mysamp(n=60, m=3, s=2.5, lwr=1, upr=6, nnorm=1000)
Cue3 <- mysamp(n=60, m=4, s=1.5, lwr=1, upr=6, nnorm=1000)
df <- data.frame(TID= 1:60, Cue1= Cue1, Cue2= Cue2, Cue3= Cue3)
This is a clustering problem, so you'd want to approach it by:
Calculating a distance matrix
Using that to construct a "tree" of similar groups of nodes
Extracting sub-clusters of your size that appear lowest on the tree
The distance matrix and hierarchical clustering can be done as:
distances <- dist(df[, -1])
h <- hclust(distances)
There are many approaches to algorithmically pulling off low clusters on the tree; since I'm accustomed to working with dplyr/purrr/tidyr I'll show one solution. This takes the approach of using cutree to break the tree apart at every possible level, then find the first time each group of six appears.
library(dplyr)
library(tidyr)
library(purrr)
clusterings <- data_frame(ncluster = seq(nrow(df), 1)) %>%
unnest(membership = map(ncluster, ~ cutree(h, .))) %>%
group_by(ncluster) %>%
mutate(row = row_number()) %>%
ungroup() %>%
nest(-ncluster, -membership) %>%
mutate(size = map_dbl(data, nrow)) %>%
filter(size == 6) %>%
distinct(membership, .keep_all = TRUE) %>%
unnest(data) %>%
mutate(TID = df$TID[row])
On your data, this returns:
# A tibble: 42 × 5
ncluster membership size row TID
<int> <int> <dbl> <int> <int>
1 29 9 6 9 9
2 29 9 6 30 30
3 29 9 6 39 39
4 29 9 6 41 41
5 29 9 6 43 43
6 29 9 6 57 57
7 21 13 6 15 15
8 21 13 6 20 20
9 21 13 6 25 25
10 21 13 6 29 29
# ... with 32 more rows
Thus, (9, 30, 39, 41, 43, 57) make up your first group of 6, while the second group starts with (15, 20, 25, 29...)
I have a data table like this
ID DAYS FREQUENCY
"ads" 20 3
"jwa" 45 2
"mno" 4 1
"ads" 13 3
"jwa" 60 2
"ads" 18 3
I want to add a column that subtracts the days based on the id and subtract the closest days together.
My new table would like like this:
ID DAYS FREQUENCY DAYS DIFF
"ads" 20 3 2 (because 20-18)
"jwa" 45 2 NA (because no value greater than 45 for that id)
"mno" 4 1 NA
"ads" 13 3 NA
"jwa" 60 2 15
"ads" 18 3 5
Bonus: Is there a way to use the merge function?
Here's an answer using dplyr:
require(dplyr)
mydata %>%
mutate(row.order = row_number()) %>% # row numbers added to preserve original row order
group_by(ID) %>%
arrange(DAYS) %>%
mutate(lag = lag(DAYS)) %>%
mutate(days.diff = DAYS - lag) %>%
ungroup() %>%
arrange(row.order) %>%
select(ID, DAYS, FREQUENCY, days.diff)
Output:
ID DAYS FREQUENCY days.diff
<fctr> <int> <int> <int>
1 ads 20 3 2
2 jwa 45 2 NA
3 mno 4 1 NA
4 ads 13 3 NA
5 jwa 60 2 15
6 ads 18 3 5
You can do this using dplyr and a quick loop:
library(dplyr)
# Rowwise data.frame creation because I'm too lazy not to copy-paste the example data
df <- tibble::frame_data(
~ID, ~DAYS, ~FREQUENCY,
"ads", 20, 3,
"jwa", 45, 2,
"mno", 4, 1,
"ads", 13, 3,
"jwa", 60, 2,
"ads", 18, 3
)
# Subtract each number in a numeric vector with the one following it
rolling_subtraction <- function(x) {
out <- vector('numeric', length(x))
for (i in seq_along(out)) {
out[[i]] <- x[i] - x[i + 1] # x[i + 1] is NA if the index is out of bounds
}
out
}
# Arrange data.frame in order of ID / Days and apply rolling subtraction
df %>%
arrange(ID, desc(DAYS)) %>%
group_by(ID) %>%
mutate(days_diff = rolling_subtraction(DAYS))