I'm looking for some kind kind of conditional rolling sum I thought a while loop would do what I need, but I'm having trouble implementing it. So this should look like PCAR[1]*time[1]+PCAR[2]*time[2]+PCAR[3]*time[3] etc where [] references the row of the column, and this would loop until the cumulative time value reachs <= 100 years, then the loop should add this value to a column and then start again until cumulative time is between 100 and <= 200, and so on until the bottom of the data set. It's going to be applied to datasets of varying sizes with tens of thousands of years in.
I hope that makes sense. In the example data below the PCAR_BIN column is what I'm aiming for as the outcome.
df <- tibble(cumulative.time = c(20,40,60,80,100, 120,140,160,180,200),
PCAR =1:10,
time = 1:10,
depth.along.core = 1:10,
Age.cal.BP = 1:10,
AFBD = 1:10,
assumed.C = rep(0.5, 10),
PCAR_BIN = c(55,330,NA,NA,NA,NA,NA,NA,NA,NA))
The function looks like
MBA <- function(data) {
require(dplyr)
data %>% mutate(PCAR=((lead(depth.along.core) - depth.along.core )/(lead(Age.cal.BP) - Age.cal.BP))*AFBD*assumed.C*10000,
PCA_NCP = PCAR*(lead(Age.cal.BP)-Age.cal.BP),
PCA_NCP[is.na(PCA_NCP)] <- 0,
CCP_Bottom_Up = rev(cumsum(rev(PCA_NCP))),
CCP_Top_Down = CCP_Bottom_Up[1]- CCP_Bottom_Up,
PCAR_BIN = ifelse(cumulative.time <= 100, sum(PCAR*time+lead(PCAR)*lead(time),NA)
)}
Obviously I had no luck with the ifelse satement, as it would only work for one iteration of time and the sum is wrong. I've tried similar with while and for loops but with no luck. Part of the problem is I'm not sure how to express the sum that I need. I've also tried binning the data with case_when, and working off that, but with no luck again.
Thanks people :)
EDIT
Following Martins method I now have the function working up to creating the ROLLSUM Column, I now need to create a column that will give the maximum value for each century group. Running the code from slicemax onward gives me the error:
Error in eval(lhs, parent, parent) : object 'tmp' not found
I've added the real data too.
dput(head(EMC))
structure(list(depth.along.core = c(0.5, 1.5, 2.5, 3.5, 4.5,
5.5), Age.cal.BP = c(-56.016347625, -55.075825875, -54.201453125,
-53.365755375, -52.541258625, -51.700488875), time = c(0.94052175,
0.87437275, 0.83569775, 0.82449675, 0.84076975, 0.88451675),
cumulative.time = c(0.94052175, 1.8148945, 2.65059225, 3.475089,
4.31585875, 5.2003755), AFBD = c(0.0711, 0.057, 0.0568, 0.0512,
0.0559, 0.0353), assumed.C = c(0.5, 0.5, 0.5, 0.5, 0.5, 0.5
)), row.names = c(NA, 6L), class = "data.frame")
MBA <- function(data) {
require(dplyr)
data %>% mutate(PCAR=((lead(depth.along.core) - depth.along.core )/(lead(Age.cal.BP) - Age.cal.BP))*AFBD*assumed.C*10000,
PCA_NCP = PCAR*(lead(Age.cal.BP)-Age.cal.BP),
PCA_NCP[is.na(PCA_NCP)] <- 0,
CCP_Bottom_Up = rev(cumsum(rev(PCA_NCP))),
CCP_Top_Down = CCP_Bottom_Up[1]- CCP_Bottom_Up)%>%
slice(1:(n()-1))%>%
group_by(Century = cut(cumulative.time, breaks = seq(0, max(cumulative.time), 100)))%>%
mutate(ROLLSUM = rev(cumsum(PCAR*time)))%>%
slice_max(order_by = ROLLSUM, n = 1) %>%
pull(ROLLSUM)%>%
df$ROLLSUM <- c(groupMaxima, rep(NA, nrow(df) - length(groupMaxima)))}
You could try this:
# Get cumulative sums by group (assuming per century groups)
df <- df %>%
group_by(Century = cut(cumulative.time,
breaks = seq(0, max(cumulative.time), 100))) %>%
mutate(ROLLSUM = rev(cumsum(PCAR * time)))
# Get maximum of each group
groupMaxima <- df %>%
slice_max(order_by = ROLLSUM, n = 1) %>%
pull(ROLLSUM)
# Fill column as desired
df$ROLLSUM <- c(groupMaxima, rep(NA, nrow(df) - length(groupMaxima)))
We simply create a factor column to group the cumulative time column by centuries and use that factor to sum up the values. Lastly we edit the rolling sum column to contain only the max values and fill the other rows with NA.
# A tibble: 10 x 10
# Groups: Group [2]
cumulative.time PCAR time depth.along.core Age.cal.BP AFBD assumed.C PCAR_BIN Group ROLLSUM
<dbl> <int> <int> <int> <int> <int> <dbl> <dbl> <fct> <int>
1 20 1 1 1 1 1 0.5 55 (0,100] 55
2 40 2 2 2 2 2 0.5 330 (0,100] 330
3 60 3 3 3 3 3 0.5 NA (0,100] NA
4 80 4 4 4 4 4 0.5 NA (0,100] NA
5 100 5 5 5 5 5 0.5 NA (0,100] NA
6 120 6 6 6 6 6 0.5 NA (100,200] NA
7 140 7 7 7 7 7 0.5 NA (100,200] NA
8 160 8 8 8 8 8 0.5 NA (100,200] NA
9 180 9 9 9 9 9 0.5 NA (100,200] NA
10 200 10 10 10 10 10 0.5 NA (100,200] NA
Edit:
For this special case:
MBA <- function(data) {
require(dplyr)
data <- data %>% mutate(PCAR = ((lead(depth.along.core) - depth.along.core )/(lead(Age.cal.BP) - Age.cal.BP))*AFBD*assumed.C*10000,
PCA_NCP = PCAR*(lead(Age.cal.BP)-Age.cal.BP),
PCA_NCP[is.na(PCA_NCP)] <- 0,
CCP_Bottom_Up = rev(cumsum(rev(PCA_NCP))),
CCP_Top_Down = CCP_Bottom_Up[1]- CCP_Bottom_Up)
data <- data %>%
group_by(CTIME = cut(cumsum(cumulative.time),
breaks = seq(0, max(cumsum(cumulative.time), na.rm = T), 100))) %>%
mutate(ROLLSUM = rev(cumsum(PCAR*time)))
groupMaxima <- data %>% slice_max(order_by = ROLLSUM, n = 1) %>%
pull(ROLLSUM)
data$ROLLSUM <- c(groupMaxima, rep(NA, nrow(data) - length(groupMaxima)))
data
}
There are a number of ways, if your steps are really steps of 100 years, and the values go 0,20,40 in constant intervals- you can do this natively:
steps = 100
intervals = 20
ratio = steps / intervals
columns = df[,c("PCAR","time")]
indices = rep(ratio,nrow(df)) %>% cumsum
PCAR_BIN = lapply(indices,function(x){
localRange = (x-ratio):x
sum(columns[localRange,1] * columns[localRange,2])
})%>% unlist
we can now bind PICAR_BIN:
df = cbind(df,PICAR_BIN)
Related
I want to perform matching between two groups in a data frame consisting of 10 million rows, where all rows belonging to one group (binary) are matched with observations from the other group (with replacement) if their difference on another column is smaller than a pre-set threshold. The end result should be a data frame with 2 columns: (1) id number and (2) id number of matched row To do this, I use the outer function. See the toy example below:
set.seed(123)
# Creating data
df <- data.frame(id = c(1:10000000),
group = rbinom(10000000,1, 0.3),
value = round(runif(10000000),2))
threshold <- round(sd(df$value)*0.1,2)
#################################################################
# Identifying matches
library(tidyverse)
library(data.table)
# All values
dist_mat <- df$value
# Adding identifier
names(dist_mat) <- df$id
# Dropping combinations that are not of interest
dist_mat_col <-dist_mat[df$group == 0]
dist_mat_row <- dist_mat[df$group == 1]
# Difference between each value
dist_mat <- abs(outer(dist_mat_row, dist_mat_col, "-"))
# Identifying matches that fulfills the criteria
dist_mat <- dist_mat <= threshold
# From matrix to a long dataframe
dist_mat <- melt(dist_mat)
# Tidying up the dataframe and dropping unneccecary columns and rows.
dist_mat <- dist_mat %>%
rename(id = Var1,
matched_id = Var2,
cond = value) %>%
filter(cond == TRUE) %>%
left_join(df, by = "id") %>%
select(id, matched_id)
This code works for smaller datasets but is having issues when scaling up the data size (for obvious reasons). You can try to reduce the data frame size to 100 or 1000 rows and it should run more smoothly. The issue is related to the outer function and is stated as: Error: cannot allocate vector of size 156431.9 Gb.
As a way to solve this, I tried to do the matching row-wise, i.e., one row at a time. But this takes a tremendously long time (2500 rows in 8h, where I have 3 million rows to loop through...). See code below:
dist_mat <- df$value
names(dist_mat) <- df$id
# Dropping combinations that are not of interest
dist_mat_col <-dist_mat[df$group == 0]
dist_mat_row <- dist_mat[df$group == 1]
# Difference between each value
matched_df <- data.frame()
for (i in 1:length(dist_mat_row)) {
print(i)
dist_mat <- as.matrix(abs(outer(dist_mat_row[i], dist_mat_col, "-")))
colnames(dist_mat) <- names(dist_mat_col)
rownames(dist_mat) <- names(dist_mat_row[i])
dist_mat <- dist_mat <= threshold
# From matrix to a long dataframe
dist_mat <- melt(dist_mat)
# Tidying up the dataframe and dropping unneccecary columns and rows.
dist_mat <- dist_mat %>%
rename(id = Var1,
matched_id = Var2,
cond = value) %>%
filter(cond == TRUE) %>%
left_join(df, by = "id") %>%
select(id, matched_id)
matched_df <- rbind(matched_df, dist_mat)
rm(dist_mat)
gc()
}
Is there any way of doing this that does not run out of memory or takes a tremendous time? So far, I've been trying to "trim some meat" off the data to reduce the size, and perhaps there are any more ways to do this? An alternative is to not do this the "brute" way but to find an alternative. Does anyone have any suggestions or ideas?
Thanks!
This will be my correct answer.
First, we need a function that will generate a data set with the appropriate proportion of the number of unique values. Here it is.
library(tidyverse)
library(collapse)
fdf = function(n, nup=.1) {
vp = 1/n/nup
tibble(
id = c(1:n),
group = rbinom(n, 1, 0.3),
value = round(runif(n)/vp)*vp)
}
For example, let's generate a set of 350 records with a ratio of unique values equal to 0.15
fdf(350, .15) %>% funique(cols=3) %>% nrow()
output
[1] 53
Now for a second example. 1000 lines with approximately 100 unique values.
fdf(1000, .1) %>% funique(cols=3) %>% nrow()
output
[1] 101
Now the most important and crucial thing. A binary search function that finds a range of val values that differ by tresh.
fbin = function(x, val, tresh = 0){
vmin = val - tresh
vmax = val + tresh
n = length(x)
e = .Machine$double.eps
if((x[1]-vmax)>=e | (vmin-x[n])>=e) NULL else{
l = 1
r = n
if(abs(x[1]-vmin)<=e | abs(x[1]-vmax)<=e |
((x[1]-vmin)>=e & (vmax-x[1])>=e)) imin=1 else {
while(l <= r){
i = (l + r) %/% 2
if((vmin-x[i])>e){
l = i + 1
} else {
if(!(vmin-x[i-1])>e){
r = i - 1
} else break
}
}
imin=i
}
l = imin
r = n
if(abs(x[n]-vmin)<=e | abs(x[n]-vmax)<=e |
((x[n]-vmin)>=e & (vmax-x[n])>=e)) imax = n else {
while(l <= r){
i = (l + r) %/% 2
if((x[i]-vmax)>e){
r = i - 1
} else {
if(!((x[i+1]-vmax)>e)){
l = l + 1
} else break
}
}
imax=i
}
imin:imax
}
}
First, a few notes about this feature. I took into account the fact that the val and tresh variables of the double type, and thus, due to the inaccuracy of the calculations, ordinary comparisons cannot be used here
such as x[i]>vmax or x[i]==vmax.
My search function requires the argument x to be sorted in descending order!
Let's do some unit tests.
set.seed(123)
x = sample(1:10, 30, replace=T) %>% sort()
x
#[1] 1 2 3 3 3 3 3 4 4 5 5 5 6 6 7 7 7 8 9 9 9 9 9 9 9 10 10 10 10 10
x[fbin(x, 100, 0)]
#integer(0)
x[fbin(x, -10, 0)]
#integer(0)
x[fbin(x, 1, 0)]
#[1] 1
x[fbin(x, 10, 0)]
#[1] 10 10 10 10 10
x[fbin(x, 1, 1)]
#[1] 1 2
x[fbin(x, 10, 1)]
# [1] 9 9 9 9 9 9 9 10 10 10 10 10
x[fbin(x, 5, 0)]
#[1] 5 5 5
x[fbin(x, 5, 2)]
#[1] 3 3 3 3 3 4 4 5 5 5 6 6 7 7 7
x[fbin(x, 5, 10)]
# [1] 1 2 3 3 3 3 3 4 4 5 5 5 6 6 7 7 7 8 9 9 9 9 9 9 9 10 10 10 10 10
As you can see, the function returns the indexes for which the vector x values fall within the range of <val-tresh, val+tresh>.
Now it's time for a specific test. We'll see how fbin does a 10,000,000-element vector search.
set.seed(123)
n = 10000000
x = runif(n) %>% round(6) %>% sort()
funique(x) %>% length()
x[fbin(x, .5)]
#[1] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
x[fbin(x, .5, .000001)]
# [1] 0.499999 0.499999 0.499999 0.499999 0.499999 0.499999 0.499999 0.499999 0.499999
# [10] 0.499999 0.500000 0.500000 0.500000 0.500000 0.500000 0.500000 0.500000 0.500000
# [19] 0.500000 0.500000 0.500000 0.500000 0.500000 0.500001 0.500001 0.500001 0.500001
# [28] 0.500001 0.500001 0.500001 0.500001
Now let's see how long such a search will take.
library(microbenchmark)
ggplot2::autoplot(microbenchmark(fbin(x, .5, .001),
fbin(x, .5, .002),
fbin(x, .5, .003),
fbin(x, .5, .004),
times=10))
As you can see, the search takes about 1000 us.
Now let's compare that to the subset functions.
ggplot2::autoplot(microbenchmark(x[fbin(x, .5, .001)],
ss(x, x>=(0.5+0.001) & x<=(0.5-0.001)),
subset(x, x>=(0.5+0.001) & x<=(0.5-0.001)),
times=10))
As you can see, it is two or three orders faster!
It's time for the right function to solve your task.
fmatch = function(df, tresh){
#Adding a column with the row number
df = df %>% ftransform(row = 1:nrow(.))
#Splitting into two sorted subsets
df0 = df %>% roworder(value) %>% fsubset(group == 0)
df1 = df %>% roworder(value) %>% fsubset(group == 1)
#Transformations on matrices
M0 = df0 %>% qM()
M1 = df1 %>% qM()
#Prepare unique values from group 1
uM1 = df1$value %>% funique()
out = list()
for(i in 1:length(uM1)){
iM0 = fbin(M0[,3], uM1[i], tresh)
if(length(iM0)>0){
iM1 = fbin(M1[,3], uM1[i])
out[[paste0(uM1[i])]] = list(
row0 = M0[iM0, 4],
row1 = M1[iM1, 4]
)
}
}
out
}
How does this feature work?
I will describe it step by step.
Complete the data frame with the line numbers
Split the frame into two sorted subsets for grooup 1 and group 0
Convert it into matrices to speed up the operation (maybe you don't need it)
Prepare unique values from the subset of groups 1
For each unique value in the subset of group 1, do:
5.1 In the set for group 0, search for all rows for which value does not differ from the current unique value + - threshold
5.2 If only such lines exist, write one list which will contain the line numbers from the subset for group 1 with the value equal to the current value, and the line numbers from the subset group 0.
Let's see this for an example
#Preparation of data and threshold
set.seed(123)
df = fdf(100)
threshold = round(sd(df$value)*0.1,2)
out = fmatch(df, threshold)
df[out[[1]]$row1,]
# # A tibble: 1 x 3
# id group value
# <int> <int> <dbl>
# 1 16 1 0.1
df[out[[1]]$row0,]
# # A tibble: 6 x 3
# id group value
# <int> <int> <dbl>
# 1 10 0 0.1
# 2 13 0 0.1
# 3 28 0 0.1
# 4 29 0 0.1
# 5 48 0 0.1
# 6 55 0 0.1
df[out[[2]]$row1,]
# # A tibble: 3 x 3
# id group value
# <int> <int> <dbl>
# 1 24 1 0.2
# 2 58 1 0.2
# 3 68 1 0.2
df[out[[2]]$row0,]
# # A tibble: 9 x 3
# id group value
# <int> <int> <dbl>
# 1 27 0 0.2
# 2 44 0 0.2
# 3 46 0 0.2
# 4 47 0 0.2
# 5 49 0 0.2
# 6 54 0 0.2
# 7 60 0 0.2
# 8 72 0 0.2
# 9 99 0 0.2
Now I will change the threshold to 0.2 and repeat the test.
out = fmatch(df, 0.2)
df[out[[1]]$row1,]
# # A tibble: 1 x 3
# id group value
# <int> <int> <dbl>
# 1 16 1 0.1
df[out[[1]]$row0,]
# # A tibble: 24 x 3
# id group value
# <int> <int> <dbl>
# 1 43 0 0
# 2 10 0 0.1
# 3 13 0 0.1
# 4 28 0 0.1
# 5 29 0 0.1
# 6 48 0 0.1
# 7 55 0 0.1
# 8 27 0 0.2
# 9 44 0 0.2
# 10 46 0 0.2
# # ... with 14 more rows
df[out[[2]]$row1,]
# # A tibble: 3 x 3
# id group value
# <int> <int> <dbl>
# 1 24 1 0.2
# 2 58 1 0.2
# 3 68 1 0.2
df[out[[2]]$row0,]
# # A tibble: 32 x 3
# id group value
# <int> <int> <dbl>
# 1 43 0 0
# 2 10 0 0.1
# 3 13 0 0.1
# 4 28 0 0.1
# 5 29 0 0.1
# 6 48 0 0.1
# 7 55 0 0.1
# 8 27 0 0.2
# 9 44 0 0.2
# 10 46 0 0.2
# # ... with 22 more rows
Now it's time to test with 100,000 rows.
set.seed(123)
df = fdf(100000)
threshold = round(sd(df$value)*0.1,2)
start_time <- Sys.time()
out = fmatch(df, threshold)
end_time <- Sys.time()
end_time - start_time
#Time difference of 13.9958 secs
object.size(out)
#319309040 bytes
As you can see, the whole thing took only 14 seconds. The output list is 320 MB. This could be crucial.
I ran another test on a set of 500,000 rows.
set.seed(123)
df = fdf(500000)
threshold = round(sd(df$value)*0.1,2)
start_time <- Sys.time()
out = fmatch(df, threshold)
end_time <- Sys.time()
end_time - start_time
#Time difference of 7.982853 mins
length(out)
#47509
object.size(out)
#7889344576 bytes
As you hang, the fivefold increase in the data set has made the time 34 times longer. The initial list has grown 24 times and now takes almost 8 GB!
There is a very important conclusion from this. Probably for 10,000,000 lines you will not have enough memory to complete the operation. So I suggest slightly modifying the fmatch function so that it returns results only for a specific subset of unique values.
Perhaps we could also optimize the binary search functionality a bit more. But I would need to know what your values are in the variable value in your dataframe.
However, as you can see, the critical factor here is not the execution time, but the memory availability.
I will be waiting for your opinion.
Also write if my solution is clear to you and if you need any additional explanations.
Last update
I did one more test tonight. However, it required minimal modification to my fmatch function. It added two additional arguments, vmin and vmax. The function will now only run for unique values in the range <vmin, vmax).
fmatch1 = function(df, tresh, vmin=0, vmax=1){
#Adding a column with the row number
df = df %>% ftransform(row = 1:nrow(.))
#Splitting into two sorted subsets
df0 = df %>% roworder(value) %>% fsubset(group == 0)
df1 = df %>% roworder(value) %>% fsubset(group == 1)
#Transformations on matrices
M0 = df0 %>% qM()
M1 = df1 %>% qM()
#Prepare unique values from group 1
uM1 = df1$value %>% funique() %>% ss(.>=vmin & .<vmax)
out = list()
for(i in 1:length(uM1)){
iM0 = fbin(M0[,3], uM1[i], tresh)
if(length(iM0)>0){
iM1 = fbin(M1[,3], uM1[i])
out[[paste0(uM1[i])]] = list(
row0 = M0[iM0, 4],
row1 = M1[iM1, 4]
)
}
}
out
}
Now I was able to perform a data frame test with 10,000,000 rows.
However, I limited myself to values in the range <0, 0.005).
set.seed(123)
df = fdf(10000000)
threshold = round(sd(df$value)*0.1,2)
start_time <- Sys.time()
out = fmatch1(df, threshold, 0, .005)
end_time <- Sys.time()
end_time - start_time
#Time difference of 6.865934 mins
length(out)
#4706
object.size(out)
#8557061840 bytes
The whole thing took almost 7 minutes and the result was as much as 9 GB of memory !!
If we now assume that it will be relatively linear, we can expect that for all unique values in the data frame with 10,000,000 lines, the function runtime will be approx. 24 hours and the result should be approx. 1,800 GB. Unfortunately, my computer does not have that much memory.
In fact, what I am writing now will not be the actual answer. This is going to be quite a long comment. Unfortunately, I would not fit it in one or even several comments. Therefore, I am asking everyone to be understanding and not to criticize what I am writing here.
Now to the point.
I looked at your problem. I've even been able to write a program that will do your job in much less time. With 100,000 lines, the program only ran for a few minutes. What compared to the 8 hours you gained on 2,500 rows is a clear difference. The problem, however, probably lies in the assumptions of the task itself.
When you write yourself, you have 10,000,000 rows. However, of those 10,000,000 lines, you only have 100 unique values, which is due to round(runif(n), 2)). So the first question to ask: it is the same for your real data?
Later you will say you want to match group id 0 to group id 1 if the difference between the values is less than the specified threshold (let's assume the threshold for a moment is 0.3). So let's check what it gives in the output. If you only have 100 unique values and 10,000,000 rows, you can expect group 0 to be around 50,000 values of 0.99. Each of these values, of course, has a different id. However, in group 1, you will have approximately 3,450,000 rows with values less than 0.69. Now, if you want to match each of these 50,000 IDs to 3,450,000 Group 1 IDs, you will get 172,500,000,000 matches in total !! Recall that we matched only the id from group 0, for which the value was 0.99.
Finally, my 100,000 row code generated a result set of only 10,000,000 rows! And although he did it in minutes, it strained my computer's memory a lot.
In addition, I wonder if by any chance you did not want to match the id not as you write, but when the absolute value of the difference between the values is less than the accepted threshold? abs(value1 - value0)<threshold?
If you are very curious, here is my code that I wrote about above.
library(tidyverse)
library(collapse)
set.seed(123)
n = 100000
df = tibble(
id = c(1:n),
group = rbinom(n,1, 0.3),
value = round(runif(n),2))
threshold = round(sd(df$value)*0.1,2)
m1 = df %>%
fsubset(group == 1) %>%
roworder(value) %>%
ftransform(row = 1:nrow(.))
m1.idx = m1 %>% funique(cols=3)
m1.M = m1 %>% qM()
m0 = df %>%
fsubset(group == 0) %>%
roworder(value)
m0.idx = m0 %>% funique(cols=3)
m0.M = m0 %>% qM
out = list()
for(i in 1:nrow(m0.M)){
id0 = m0.M[i,1]
value0 = m0.M[i,3]
value1 = round(value0 - threshold, 2)
idx = m1.idx %>% fsubset(value<=value1) %>% qM
if(nrow(idx)>1){
last.row = idx[nrow(idx), 4]-1
out[[paste0(id0)]] = m0 %>% ss(1:last.row,1)
}
}
dfout = unlist2d(out) %>% frename(.id = id0, id = id1) %>% qTBL()
However, I would suggest a slightly different solution. Perhaps it will be enough to remember only each of the 100 unique values from one of the groups and to each of them assign all id from group 0 for which this value exists, and all id from group 1 for which the value is less than the set threshold, or the absolute difference of values is smaller than this threshold.
Unfortunately, I do not know if such a solution would be acceptable for you. I will be waiting for a comment from you.
This question already has answers here:
How to quickly form groups (quartiles, deciles, etc) by ordering column(s) in a data frame
(11 answers)
Closed 1 year ago.
In this example I have a tibble with two variables:
a group variable gr
the variable of interest val
set.seed(123)
df <- tibble(gr = rep(1:3, each = 10),
val = gr + rnorm(30))
Goal
I want to produce a discretized version of val using the function findInterval but the breakpoints should be gr-specific, since in my actual data as well as in this example, the distribution of valdepends on gr. The breakpoints are determined within each group using the quartiles of val.
What I did
I first construct a nested tibble containing the vectors of breakpoints for each value of gr:
df_breakpoints <- bind_cols(gr = 1:3,
purrr::map_dfr(1:3, function(gr) {
c(-Inf, quantile(df$val[df$gr == gr], c(0.25, 0.5, 0.75)), Inf)
})) %>%
nest(bp = -gr) %>%
mutate(bp = purrr::map(.$bp, unlist))
Then I join it with df:
df <- inner_join(df, df_breakpoints, by = "gr")
My first guess to define the discretized variable lvl was
df %>% mutate(lvl = findInterval(x = val, vec = bp))
It produces the error
Error : Problem with `mutate()` input `lvl2`.
x 'vec' must be sorted non-decreasingly and not contain NAs
ℹ Input `lvl` is `findInterval(x = val, vec = bp)`.
Then I tried
df$lvl <- purrr::imap_dbl(1:nrow(df),
~findInterval(x = df$val[.x], vec = df$bp[[.x]]))
or
df %>% mutate(lvl = purrr::map2_int(df$val, df$bp, findInterval))
It does work. However it is highly unefficient. With my actual data (1.2 million rows) it takes several minutes to run. I guess there is a much better way of doing this than iterating on rows. Any idea?
You can do this in group_by + mutate step -
library(dplyr)
df %>%
group_by(gr) %>%
mutate(breakpoints = findInterval(val,
c(-Inf, quantile(val, c(0.25, 0.5, 0.75)), Inf))) %>%
ungroup
# gr val breakpoints
# <int> <dbl> <int>
# 1 1 0.440 1
# 2 1 0.770 2
# 3 1 2.56 4
# 4 1 1.07 3
# 5 1 1.13 3
# 6 1 2.72 4
# 7 1 1.46 4
# 8 1 -0.265 1
# 9 1 0.313 1
#10 1 0.554 2
# … with 20 more rows
findInterval is applied for each gr separately.
This question already has answers here:
How to use Dplyr's Summarize and which() to lookup min/max values
(3 answers)
Closed 1 year ago.
I would like to know how I can find the minimum and maximum day of year (DoY) based on water temperature (Wtemp) for each site (siteID).
Example Dataset:
df1 <- data.frame(matrix(ncol = 4, nrow = 20))
x <- c("siteID", "Date", "DoY", "Wtemp")
colnames(df1) <- x
df1$siteID <- c(101,101,101,101,101,
102,102,102,102,102,
103,103,103,103,103,
104,104,104,104,104)
df1$Date <- rep(seq(from = as.Date("2020-01-01"), to = as.Date("2020-01-05"), by = 1),4)
df1$DoY <- rep(seq(from = 1, to = 5, by = 1),4)
df1$Wtemp <- c(10,2,6,12,15,
20,15,5,10,16,
2,4,6,8,10,
12,14,16,18,20)
The output should look like this:
siteID DoY_MaxWtemp DoY_MinWtemp
1 101 5 2
2 102 1 3
3 103 5 1
4 104 5 1
We can group by 'siteID', get the index of 'max' and 'min' value of 'Wtemp' with which.max and which.min respectively, use that to extract the corresponding values of 'DoY' in summarise
library(dplyr)
df1 %>%
group_by(siteID) %>%
summarise(Doy_MaxWtemp = DoY[which.max(Wtemp)],
Doy_MinWtemp = DoY[which.min(Wtemp)], .groups = 'drop')
-output
# A tibble: 4 x 3
# siteID Doy_MaxWtemp Doy_MinWtemp
#* <dbl> <dbl> <dbl>
#1 101 5 2
#2 102 1 3
#3 103 5 1
#4 104 5 1
I want to calculate the sd for several columns inside a data frame without leaving my dplyr pipe. In the past, I have done this by defaulting to base r. I haven't been able to find a solution here that works.
It may help to provide some context. This is a process I do to validate survey data. We measure the sd of matrix questions to identify straight-liners. An sd of zero across the columns flags a straight line. In the past, I calculated this in base R as follows:
apply(x, 1, sd)
I know there has to be a way to do this within a dplyr pipe. I've tried several options including pmap and various approaches at mutate_at. Here's my latest attempt:
library(tidyverse)
set.seed(858465)
scale_points <- c(1:5)
q1 <- sample(scale_points, replace = TRUE, size = 100)
q2 <- sample(scale_points, replace = TRUE, size = 100)
q3 <- sample(scale_points, replace = TRUE, size = 100)
digits = 0:9
createRandString<- function() {
v = c(sample(LETTERS, 5, replace = TRUE),
sample(digits, 4, replace = TRUE),
sample(LETTERS, 1, replace = TRUE))
return(paste0(v,collapse = ""))
}
s_data <- tibble::tibble(resp_id = 100)
for(i in c(1:100)) {
s_data[i,1] <- createRandString()
}
s_data <- bind_cols(s_data, q1 = q1, q2 = q2, q3 = q3)
s_data %>% mutate(vars(starts_with("q"), ~sd(.)))
In a perfect world, I would keep the resp_id variable in the output so that I could generate a report using filter to identify the respondent IDs with sd == 0.
Any help is greatly appreciated!
If we need a rowwise sd,
library(tidyverse)
s_data %>%
mutate(sdQs = select(., starts_with("q")) %>%
pmap_dbl(~ sd(c(...)))) %>%
filter(sdQs == 0)
# A tibble: 9 x 5
# resp_id q1 q2 q3 sdQs
# <chr> <int> <int> <int> <dbl>
#1 JORTY8990R 3 3 3 0
#2 TFYAF4729I 5 5 5 0
#3 VPUYC0789H 4 4 4 0
#4 LHAPM6293X 1 1 1 0
#5 FZQRQ8530P 3 3 3 0
#6 TKTJU3757T 5 5 5 0
#7 AYVHO1309H 4 4 4 0
#8 BBPTZ4822E 5 5 5 0
#9 NGLXT1705B 3 3 3 0
Or another option is rowSds from matrixStats
library(matrixStats)
s_data %>%
mutate(sdQs = rowSds(as.matrix(.[startsWith(names(.), "q")])))
I have a question that I find kind of hard to explain with a MRE and in an easy
way to answer, mostly because I don't fully understand where the problem lies
myself. So that's my sorry for being vague preamble.
I have a tibble with many sample and reference measurements, for which I want
to do some linear interpolation for each sample. I do this now by taking out
all the reference measurements, rescaling them to sample measurements using
approx, and then patching it back in. But because I take it out first, I
cannot do it nicely in a group_by dplyr pipe way. right now I do it with a
really ugly workaround where I add empty (NA) newly created columns to the
sample tibble, then do it with a for-loop.
So my question is really: how can I implement the approx part within groups
into the pipe, so that I can do everything within groups? I've experimented
with dplyr::do(), and ran into the vignette on "programming with dplyr", but
searching mostly gives me broom::augment and lm stuff that I think operates
differently... (e.g. see
Using approx() with groups in dplyr). This thread also seems promising: How do you use approx() inside of mutate_at()?
Somebody on irc recommended using a conditional mutate, with case_when, but I
don't fully understand where and how within this context yet.
I think the problem lies in the fact that I want to filter out part of the data
for the following mutate operations, but the mutate operations rely on the
grouped data that I just filtered out, if that makes any sense.
Here's a MWE:
library(tidyverse) # or just dplyr, tibble
# create fake data
data <- data.frame(
# in reality a dttm with the measurement time
timestamp = c(rep("a", 7), rep("b", 7), rep("c", 7)),
# measurement cycle, normally 40 for sample, 41 for reference
cycle = rep(c(rep(1:3, 2), 4), 3),
# wheather the measurement is a reference or a sample
isref = rep(c(rep(FALSE, 3), rep(TRUE, 4)), 3),
# measurement intensity for mass 44
r44 = c(28:26, 30:26, 36, 33, 31, 38, 34, 33, 31, 18, 16, 15, 19, 18, 17)) %>%
# measurement intensity for mass 45, normally also masses up to mass 49
mutate(r45 = r44 + rnorm(21, 20))
# of course this could be tidied up to "intensity" with a new column "mass"
# (44, 45, ...), but that would make making comparisons even harder...
# overview plot
data %>%
ggplot(aes(x = cycle, y = r44, colour = isref)) +
geom_line() +
geom_line(aes(y = r45), linetype = 2) +
geom_point() +
geom_point(aes(y = r45), shape = 1) +
facet_grid(~ timestamp)
# what I would like to do
data %>%
group_by(timestamp) %>%
do(target_cycle = approx(x = data %>% filter(isref) %>% pull(r44),
y = data %>% filter(isref) %>% pull(cycle),
xout = data %>% filter(!isref) %>% pull(r44))$y) %>%
unnest()
# immediately append this new column to the original dataframe for all the
# samples (!isref) and then apply another approx for those values.
# here's my current attempt for one of the timestamps
matchref <- function(dat) {
# split the data into sample gas and reference gas
ref <- filter(dat, isref)
smp <- filter(dat, !isref)
# calculate the "target cycle", the points at which the reference intensity
# 44 matches the sample intensity 44 with linear interpolation
target_cycle <- approx(x = ref$r44,
y = ref$cycle, xout = smp$r44)
# append the target cycle to the sample gas
smp <- smp %>%
group_by(timestamp) %>%
mutate(target = target_cycle$y)
# linearly interpolate each reference gas to the target cycle
ref <- ref %>%
group_by(timestamp) %>%
# this is needed because the reference has one more cycle
mutate(target = c(target_cycle$y, NA)) %>%
# filter out all the failed ones (no interpolation possible)
filter(!is.na(target)) %>%
# calculate interpolated value based on r44 interpolation (i.e., don't
# actually interpolate this value but shift it based on the 44
# interpolation)
mutate(r44 = approx(x = cycle, y = r44, xout = target)$y,
r45 = approx(x = cycle, y = r45, xout = target)$y) %>%
select(timestamp, target, r44:r45)
# add new reference gas intensities to the correct sample gasses by the target cycle
left_join(smp, ref, by = c("time", "target"))
}
matchref(data)
# and because now "target" must be length 3 (the group size) or one, not 9
# I have to create this ugly for-loop
# for which I create a copy of data that has the new columns to be created
mr <- data %>%
# filter the sample gasses (since we convert ref to sample)
filter(!isref) %>%
# add empty new columns
mutate(target = NA, r44 = NA, r45 = NA)
# apply matchref for each group timestamp
for (grp in unique(data$timestamp)) {
mr[mr$timestamp == grp, ] <- matchref(data %>% filter(timestamp == grp))
}
Here's one approach that spreads the references and samples to new columns. I drop r45 for simplicity in this example.
data %>%
select(-r45) %>%
mutate(isref = ifelse(isref, "REF", "SAMP")) %>%
spread(isref, r44) %>%
group_by(timestamp) %>%
mutate(target_cycle = approx(x = REF, y = cycle, xout = SAMP)$y) %>%
ungroup
gives,
# timestamp cycle REF SAMP target_cycle
# <fct> <dbl> <dbl> <dbl> <dbl>
# 1 a 1 30 28 3
# 2 a 2 29 27 4
# 3 a 3 28 26 NA
# 4 a 4 27 NA NA
# 5 b 1 31 26 NA
# 6 b 2 38 36 2.5
# 7 b 3 34 33 4
# 8 b 4 33 NA NA
# 9 c 1 15 31 NA
# 10 c 2 19 18 3
# 11 c 3 18 16 2.5
# 12 c 4 17 NA NA
Edit to address comment below
To retain r45 you can use a gather-unite-spread approach like this:
df %>%
mutate(isref = ifelse(isref, "REF", "SAMP")) %>%
gather(r, value, r44:r45) %>%
unite(ru, r, isref, sep = "_") %>%
spread(ru, value) %>%
group_by(timestamp) %>%
mutate(target_cycle_r44 = approx(x = r44_REF, y = cycle, xout = r44_SAMP)$y) %>%
ungroup
giving,
# # A tibble: 12 x 7
# timestamp cycle r44_REF r44_SAMP r45_REF r45_SAMP target_cycle_r44
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 a 1 30 28 49.5 47.2 3
# 2 a 2 29 27 48.8 48.7 4
# 3 a 3 28 26 47.2 46.8 NA
# 4 a 4 27 NA 47.9 NA NA
# 5 b 1 31 26 51.4 45.7 NA
# 6 b 2 38 36 57.5 55.9 2.5
# 7 b 3 34 33 54.3 52.4 4
# 8 b 4 33 NA 52.0 NA NA
# 9 c 1 15 31 36.0 51.7 NA
# 10 c 2 19 18 39.1 37.9 3
# 11 c 3 18 16 39.2 35.3 2.5
# 12 c 4 17 NA 39.0 NA NA