Proportional substraction from multiple variables - r

I have two variables that change over time. Addition is known, but as for substraction, I know only the sum that needs to be substracted, while I want to substract it from both variables proportionally.
Here is an example of a dataset
df = data.frame(id = c(rep(1,5), rep(2,3)),
ord = c(1:5, 1:3),
a = c(10, NA, 20, 0, NA, 0, 15, NA),
b = c(0, NA, 0, 15, NA, 10, 0, NA),
substract = c(NA, -5, NA, NA, -10, NA, NA, -15)) %>%
rowwise() %>%
mutate(all = sum(c(a, b, substract), na.rm = TRUE)) %>%
arrange(id, ord) %>%
group_by(id) %>%
mutate(all = cumsum(all)) %>%
ungroup()
So, I want to replace NA in a and b with the value from substract, multiplied by cumulative sum of a and b respectively, divided by value in all right before the NA. The problem is, that after each replacement, the next replacement should take into account all the previous replacements, as cumulative sum of a and b will change after that.
I have a solution with a while loop that works, but is highly ineffective. The original dataset is huge, so it is not an option for me, but it might give some additional insight on what I would like to achieve.
test = df %>%
group_by(id)
while(any(is.na(test$a))){
test = test %>%
mutate(across(c("a", "b"), ~ ifelse(is.na(.x), lag(cumsum(.x)) / lag(all) * substract, .x)))
}
Could anyone suggest a more effective solution? Like, if there was any way to make mutate function save changes after each mutation (so it does not need to be put into a while loop) or something?
EDIT: user63230 suggested using recursive functions. It does seem to be what I wanted, but I still have difficulties in their application to my case. accumulate2() accepts only 3-arguments function and does not seem to work with lag() (as I need not only the previous value of the current variable), so it seems to be not enough. Maybe there is a way to make it work, but I have not discovered it yet. Any help would be very appreciated.

Using a similar approach as here, I think this would work, although not pretty:
library(dplyr)
sp <- split(df, df$id)
list_of_dfs <- lapply(sp, function(x){
for(i in which(is.na(x$a))){
tmp <- x[seq_len(i), ]
x$a[i] <- tail(cumsum(tmp$a)[!is.na(cumsum(tmp$a))], 1)/tail(dplyr::lag(tmp$all), 1)*tail((tmp$substract), 1)
}
x
})
bind_rows(list_of_dfs)
# id ord a b substract all
# <dbl> <int> <dbl> <dbl> <dbl> <dbl>
# 1 1 1 10 0 NA 10
# 2 1 2 -5 NA -5 5
# 3 1 3 20 0 NA 25
# 4 1 4 0 15 NA 40
# 5 1 5 -6.25 NA -10 30
# 6 2 1 0 10 NA 10
# 7 2 2 15 0 NA 25
# 8 2 3 -9 NA -15 10
Can be repeated/automated for b if suitable?

Related

Conditional cumulative sum from two columns

I can't get my head around the following problem.
Assuming the follwoing data:
library(tidyverse)
df <- tibble(source = c("A", "A", "B", "B", "B", "C"),
value = c(5, 10, NA, NA, NA, 20),
add = c(1, 1, 1, 2, 3, 4))
What I want to do is: for all cases where source == "B", I want to calculate the cumulative sum of the previous row's value and the current row's add. Of course, for the first "B" row, I need to provide a starting value for value. Note: in this case, it would be fine if we just take the value from the last "A" row.
So for row 3, the result would be 10 + 1 = 11.
For row 4, the result would be 11 + 2 = 13.
For row 5, the results would be 13 + 3 = 16.
I tried to use purrr::accumulate, but I failed in many different ways, e.g. I thought I can do:
df %>%
mutate(test = accumulate(add, .init = 10, ~.x + .y))
But this leads to error:
Error: Problem with `mutate()` column `test`.
i `test = accumulate(add, .init = 10, ~.x + .y)`.
i `test` must be size 6 or 1, not 7.
Same if I use .init = value
And I also didn't manage to do the job only on group B (although this is probably no issue, I think I can probably performa on the full data frame and then just replace values for all non-B rows).
Expected output:
# A tibble: 6 x 4
source value add test
<chr> <dbl> <dbl> <dbl>
1 A 5 1 NA
2 A 10 1 NA
3 B NA 1 11
4 B NA 2 13
5 B NA 3 16
6 C 20 4 NA
You were essentially in the right direction. Since you provide an .init value to accumulate, the resulting vector is of size n+1, with the first value being .init. You have to remove the first value to get a vector that fit to your column size.
Then, if you want NAs on the remaining values, here's a way to do it. Also, since the "starting row" is the third, .init has to be set to 8.
df %>%
mutate(test =
ifelse(source == "B", accumulate(add, .init = 8, ~.x + .y)[-1], NA))
# A tibble: 6 x 4
source value add test
<chr> <dbl> <dbl> <dbl>
1 A 5 1 NA
2 A 10 1 NA
3 B NA 1 11
4 B NA 2 13
5 B NA 3 16
6 C 20 4 NA
#tmfmnk provided an awesome answer and they deserve full credit (NOT ME)
Below is the same code from their comment (for more visibility, while also setting an initial value)
init_value = 10
df = df %>%
mutate(test = lag(value)) %>%
group_by(source) %>%
mutate(test = init_value + cumsum(add))

I have no idea to express my problem in converting a dataframe to another, please read below to more details. Thank so much

I want to convert a dataframe (left) to another (right), but I have no idea to do this task. Briefly, the nearest value in A was retained and merged with B. Please take your time to help me! Thank you so much!
The original dataframe and converted dataframe (highlighted background)
Here is the original data.
df1 = data.frame(
ID = c(1,1,1,1,1,1,1,1,2,2,2,2,2),
TIME = c(-2, -1, 0, 11, 13, 24, 28, 36, -3, 0, 12, 13, 24),
A = c(30, 50, NA, 80, NA, NA, 30, NA, 20, NA, NA, 80, NA),
B = c(NA, NA, 2.0, NA, 2.5, 2.5, NA, 1.0, NA, 1.0, 1.0, NA, 2.5)
)
library(tidyverse)
df <- df1 %>%
fill(A, .direction = c("down")) %>% # fill missing data in A
drop_na() %>%
rename(B = A, A = B) %>% # swap the names of columns A and B
select(ID, TIME, A, B) # select the order of columns
Here is a base R method how we could do it:
# swap A and B
df1[ , c(3,4)] <- df1[ , c(4,3)]
# fill na with prvious value from above
df1$B <- na.omit(df1$B)[cumsum(!is.na(df1$B))]
# remove NA
df1[complete.cases(df1$A),]
ID TIME A B
3 1 0 2.0 50
5 1 13 2.5 80
6 1 24 2.5 80
8 1 36 1.0 30
10 2 0 1.0 20
11 2 12 1.0 20
13 2 24 2.5 80
Here is a solution. It keeps the rows with non-missing values in B after filling the values in A with the previous value. Then it's a matter of rearranging and renaming the columns.
library(dplyr)
library(tidyr)
df1 %>%
group_by(ID) %>%
fill(A) %>%
ungroup() %>%
na.omit() %>%
rename(B = A, A = B) %>%
relocate(B, .after = A)
## A tibble: 7 x 4
# ID TIME A B
# <dbl> <dbl> <dbl> <dbl>
#1 1 0 2 50
#2 1 13 2.5 80
#3 1 24 2.5 80
#4 1 36 1 30
#5 2 0 1 20
#6 2 12 1 20
#7 2 24 2.5 80
Edit
TarJae's comment makes the code above shorter:
df1 %>%
fill(A) %>%
select(ID, TIME, A=B, B=A) %>%
drop_na()
This can also work using lag function
df1%>%
mutate(new=lag(A))%>%
filter(!is.na(new),!(is.na(B)))%>%
select(-A)

R: Count number of times B follows A using dplyr

I have a data.frame of monthly averages of radon measured over a few months. I have labeled each value either "below" or "above" a threshold and would like to count the number of times the average value does: "below to above", "above to below", "above to above" or "below to below".
df <- data.frame(value = c(130, 200, 240, 230, 130),
level = c("below", "above","above","above", "below"))
A bit of digging into Matlab answer on here suggests that we could use the Matrix package:
require(Matrix)
sparseMatrix(i=c(2,2,2,1), j=c(2,2,2))
Produces this result which I can't yet interpret.
[1,] | |
[2,] | .
Any thoughts about a tidyverse method?
Sure, just use group by and count the values
library(dplyr)
df <- data.frame(value = c(130, 200, 240, 230, 130),
level = c("below", "above","above","above", "below"))
df %>%
group_by(grp = paste(level, lead(level))) %>%
summarise(n = n()) %>%
# drop the observation that does not have a "next" value
filter(!grepl(pattern = "NA", x = grp))
#> # A tibble: 3 × 2
#> grp n
#> <chr> <int>
#> 1 above above 2
#> 2 above below 1
#> 3 below above 1
You could use table from base R:
table(df$level[-1], df$level[-nrow(df)])
above below
above 2 1
below 1 0
EDIT in response to #HCAI's comment: applying table to multiple columns:
First, generate some data:
set.seed(1)
U = matrix(runif(4*20),nrow = 20)
dfU=data.frame(round(U))
library(plyr) # for mapvalues
df2 = data.frame(apply(dfU,
FUN = function(x) mapvalues(x, from=0:1, to=c('below','above')),
MARGIN=2))
so that df2 contains random 'above' and 'below':
X1 X2 X3 X4
1 below above above above
2 below below above below
3 above above above below
4 above below above below
5 below below above above
6 above below above below
7 above below below below
8 above below below above
9 above above above below
10 below below above above
11 below below below below
12 below above above above
13 above below below below
14 below below below below
15 above above below below
16 below above below above
17 above above below above
18 above below above below
19 below above above above
20 above below below above
Now apply table to each column and vectorize the output:
apply(df2,
FUN=function(x) as.vector(table(x[-1],
x[-nrow(df2)])),
MARGIN=2)
which gives us
X1 X2 X3 X4
[1,] 5 2 7 2
[2,] 5 6 4 6
[3,] 6 5 3 6
[4,] 3 6 5 5
All that's left is a bit of care in labeling the rows of the output. Maybe someone can come up with a clever way to merge/join the data frames resulting from apply(df2, FUN=function(x) melt(table(x[-1],x[-nrow(df2)])),2), which would maintain the row names. (I spent some time looking into it but couldn't work out how to do it easily.)
not run, so there may be a typo, but you get the idea. I'll leave it to you to deal with na and the first obs. Single pass through the vector.
library(dplyr)
summarize(increase = sum(case_when(value > lag(value) ~ 1, T ~ 0)),
decrease = sum(case_when(value > lag(value) ~ 1, T ~ 0)),
constant = sum(case_when(value = lag(value) ~ 1, T ~ 0))
)
A slightly different version:
library(dplyr)
library(stringr)
df %>%
group_by(level = str_c(level, lead(level), sep = " ")) %>%
count(level) %>%
na.omit()
level n
<chr> <int>
1 above above 2
2 above below 1
3 below above 1
Another possible solution, based on tidyverse:
library(tidyverse)
df<-data.frame(value=c(130,200, 240, 230, 130),level=c("below", "above","above","above", "below"))
df %>%
mutate(changes = str_c(lag(level), level, sep = "_")) %>%
count(changes) %>% drop_na(changes)
#> changes n
#> 1 above_above 2
#> 2 above_below 1
#> 3 below_above 1
Yet another solution, based on data.table:
library(data.table)
dt<-data.table(value=c(130,200, 240, 230, 130),level=c("below", "above","above","above", "below"))
dt[, changes := paste(shift(level), level, sep = "_")
][2:.N][,.(n = .N), keyby = .(changes)]
#> changes n
#> 1: above_above 2
#> 2: above_below 1
#> 3: below_above 1

Conditional rolling sum loop in R

I'm looking for some kind kind of conditional rolling sum I thought a while loop would do what I need, but I'm having trouble implementing it. So this should look like PCAR[1]*time[1]+PCAR[2]*time[2]+PCAR[3]*time[3] etc where [] references the row of the column, and this would loop until the cumulative time value reachs <= 100 years, then the loop should add this value to a column and then start again until cumulative time is between 100 and <= 200, and so on until the bottom of the data set. It's going to be applied to datasets of varying sizes with tens of thousands of years in.
I hope that makes sense. In the example data below the PCAR_BIN column is what I'm aiming for as the outcome.
df <- tibble(cumulative.time = c(20,40,60,80,100, 120,140,160,180,200),
PCAR =1:10,
time = 1:10,
depth.along.core = 1:10,
Age.cal.BP = 1:10,
AFBD = 1:10,
assumed.C = rep(0.5, 10),
PCAR_BIN = c(55,330,NA,NA,NA,NA,NA,NA,NA,NA))
The function looks like
MBA <- function(data) {
require(dplyr)
data %>% mutate(PCAR=((lead(depth.along.core) - depth.along.core )/(lead(Age.cal.BP) - Age.cal.BP))*AFBD*assumed.C*10000,
PCA_NCP = PCAR*(lead(Age.cal.BP)-Age.cal.BP),
PCA_NCP[is.na(PCA_NCP)] <- 0,
CCP_Bottom_Up = rev(cumsum(rev(PCA_NCP))),
CCP_Top_Down = CCP_Bottom_Up[1]- CCP_Bottom_Up,
PCAR_BIN = ifelse(cumulative.time <= 100, sum(PCAR*time+lead(PCAR)*lead(time),NA)
)}
Obviously I had no luck with the ifelse satement, as it would only work for one iteration of time and the sum is wrong. I've tried similar with while and for loops but with no luck. Part of the problem is I'm not sure how to express the sum that I need. I've also tried binning the data with case_when, and working off that, but with no luck again.
Thanks people :)
EDIT
Following Martins method I now have the function working up to creating the ROLLSUM Column, I now need to create a column that will give the maximum value for each century group. Running the code from slicemax onward gives me the error:
Error in eval(lhs, parent, parent) : object 'tmp' not found
I've added the real data too.
dput(head(EMC))
structure(list(depth.along.core = c(0.5, 1.5, 2.5, 3.5, 4.5,
5.5), Age.cal.BP = c(-56.016347625, -55.075825875, -54.201453125,
-53.365755375, -52.541258625, -51.700488875), time = c(0.94052175,
0.87437275, 0.83569775, 0.82449675, 0.84076975, 0.88451675),
cumulative.time = c(0.94052175, 1.8148945, 2.65059225, 3.475089,
4.31585875, 5.2003755), AFBD = c(0.0711, 0.057, 0.0568, 0.0512,
0.0559, 0.0353), assumed.C = c(0.5, 0.5, 0.5, 0.5, 0.5, 0.5
)), row.names = c(NA, 6L), class = "data.frame")
MBA <- function(data) {
require(dplyr)
data %>% mutate(PCAR=((lead(depth.along.core) - depth.along.core )/(lead(Age.cal.BP) - Age.cal.BP))*AFBD*assumed.C*10000,
PCA_NCP = PCAR*(lead(Age.cal.BP)-Age.cal.BP),
PCA_NCP[is.na(PCA_NCP)] <- 0,
CCP_Bottom_Up = rev(cumsum(rev(PCA_NCP))),
CCP_Top_Down = CCP_Bottom_Up[1]- CCP_Bottom_Up)%>%
slice(1:(n()-1))%>%
group_by(Century = cut(cumulative.time, breaks = seq(0, max(cumulative.time), 100)))%>%
mutate(ROLLSUM = rev(cumsum(PCAR*time)))%>%
slice_max(order_by = ROLLSUM, n = 1) %>%
pull(ROLLSUM)%>%
df$ROLLSUM <- c(groupMaxima, rep(NA, nrow(df) - length(groupMaxima)))}
You could try this:
# Get cumulative sums by group (assuming per century groups)
df <- df %>%
group_by(Century = cut(cumulative.time,
breaks = seq(0, max(cumulative.time), 100))) %>%
mutate(ROLLSUM = rev(cumsum(PCAR * time)))
# Get maximum of each group
groupMaxima <- df %>%
slice_max(order_by = ROLLSUM, n = 1) %>%
pull(ROLLSUM)
# Fill column as desired
df$ROLLSUM <- c(groupMaxima, rep(NA, nrow(df) - length(groupMaxima)))
We simply create a factor column to group the cumulative time column by centuries and use that factor to sum up the values. Lastly we edit the rolling sum column to contain only the max values and fill the other rows with NA.
# A tibble: 10 x 10
# Groups: Group [2]
cumulative.time PCAR time depth.along.core Age.cal.BP AFBD assumed.C PCAR_BIN Group ROLLSUM
<dbl> <int> <int> <int> <int> <int> <dbl> <dbl> <fct> <int>
1 20 1 1 1 1 1 0.5 55 (0,100] 55
2 40 2 2 2 2 2 0.5 330 (0,100] 330
3 60 3 3 3 3 3 0.5 NA (0,100] NA
4 80 4 4 4 4 4 0.5 NA (0,100] NA
5 100 5 5 5 5 5 0.5 NA (0,100] NA
6 120 6 6 6 6 6 0.5 NA (100,200] NA
7 140 7 7 7 7 7 0.5 NA (100,200] NA
8 160 8 8 8 8 8 0.5 NA (100,200] NA
9 180 9 9 9 9 9 0.5 NA (100,200] NA
10 200 10 10 10 10 10 0.5 NA (100,200] NA
Edit:
For this special case:
MBA <- function(data) {
require(dplyr)
data <- data %>% mutate(PCAR = ((lead(depth.along.core) - depth.along.core )/(lead(Age.cal.BP) - Age.cal.BP))*AFBD*assumed.C*10000,
PCA_NCP = PCAR*(lead(Age.cal.BP)-Age.cal.BP),
PCA_NCP[is.na(PCA_NCP)] <- 0,
CCP_Bottom_Up = rev(cumsum(rev(PCA_NCP))),
CCP_Top_Down = CCP_Bottom_Up[1]- CCP_Bottom_Up)
data <- data %>%
group_by(CTIME = cut(cumsum(cumulative.time),
breaks = seq(0, max(cumsum(cumulative.time), na.rm = T), 100))) %>%
mutate(ROLLSUM = rev(cumsum(PCAR*time)))
groupMaxima <- data %>% slice_max(order_by = ROLLSUM, n = 1) %>%
pull(ROLLSUM)
data$ROLLSUM <- c(groupMaxima, rep(NA, nrow(data) - length(groupMaxima)))
data
}
There are a number of ways, if your steps are really steps of 100 years, and the values go 0,20,40 in constant intervals- you can do this natively:
steps = 100
intervals = 20
ratio = steps / intervals
columns = df[,c("PCAR","time")]
indices = rep(ratio,nrow(df)) %>% cumsum
PCAR_BIN = lapply(indices,function(x){
localRange = (x-ratio):x
sum(columns[localRange,1] * columns[localRange,2])
})%>% unlist
we can now bind PICAR_BIN:
df = cbind(df,PICAR_BIN)

Mutate a value across several columns using dplyr selectors only

I want to calculate the sd for several columns inside a data frame without leaving my dplyr pipe. In the past, I have done this by defaulting to base r. I haven't been able to find a solution here that works.
It may help to provide some context. This is a process I do to validate survey data. We measure the sd of matrix questions to identify straight-liners. An sd of zero across the columns flags a straight line. In the past, I calculated this in base R as follows:
apply(x, 1, sd)
I know there has to be a way to do this within a dplyr pipe. I've tried several options including pmap and various approaches at mutate_at. Here's my latest attempt:
library(tidyverse)
set.seed(858465)
scale_points <- c(1:5)
q1 <- sample(scale_points, replace = TRUE, size = 100)
q2 <- sample(scale_points, replace = TRUE, size = 100)
q3 <- sample(scale_points, replace = TRUE, size = 100)
digits = 0:9
createRandString<- function() {
v = c(sample(LETTERS, 5, replace = TRUE),
sample(digits, 4, replace = TRUE),
sample(LETTERS, 1, replace = TRUE))
return(paste0(v,collapse = ""))
}
s_data <- tibble::tibble(resp_id = 100)
for(i in c(1:100)) {
s_data[i,1] <- createRandString()
}
s_data <- bind_cols(s_data, q1 = q1, q2 = q2, q3 = q3)
s_data %>% mutate(vars(starts_with("q"), ~sd(.)))
In a perfect world, I would keep the resp_id variable in the output so that I could generate a report using filter to identify the respondent IDs with sd == 0.
Any help is greatly appreciated!
If we need a rowwise sd,
library(tidyverse)
s_data %>%
mutate(sdQs = select(., starts_with("q")) %>%
pmap_dbl(~ sd(c(...)))) %>%
filter(sdQs == 0)
# A tibble: 9 x 5
# resp_id q1 q2 q3 sdQs
# <chr> <int> <int> <int> <dbl>
#1 JORTY8990R 3 3 3 0
#2 TFYAF4729I 5 5 5 0
#3 VPUYC0789H 4 4 4 0
#4 LHAPM6293X 1 1 1 0
#5 FZQRQ8530P 3 3 3 0
#6 TKTJU3757T 5 5 5 0
#7 AYVHO1309H 4 4 4 0
#8 BBPTZ4822E 5 5 5 0
#9 NGLXT1705B 3 3 3 0
Or another option is rowSds from matrixStats
library(matrixStats)
s_data %>%
mutate(sdQs = rowSds(as.matrix(.[startsWith(names(.), "q")])))

Resources