how to calculate cumsum with depreciation in a grouped dataframe? - r

I tried to calculate the cumsum with a depreciation rate.
I have a grouped dataframe with a column number.
I want to add the number one by one with depreciation.
If the rate is 1, then the cumsum function in base r is good enough.
But if not, let's say the rate of 0.5 (means each number will multiply by 0.5 to add the next number), cumsum is not enough.
I tried to write my own function to work with dplyr, but it fails.
library(tidyverse)
# dataframe
id=sample(1:5,25,replace=TRUE)
num=rnorm(25)
df=data.frame(id,num)
# my custom function
depre=function(data){
rate=0.5
r=nrow(data)
sl=data$num
nl=data$num
for (i in 2:r){
sl[i]=sl[i-1]*rate+nl[i]
}
return(sl)
}
# work with one group
df %>% filter(id==1) %>% depre(.)
# failed to work with dplyr
df %>% group_by(id) %>% mutate(sl=depre(.))
I expect the first element of column s, should be the same as in column num.
But the following ones, should be depreciate by times 0.5 and add next num.
It works in one group, but failed in multi-grouped dataframe.
The error message is: "Error: Column sl must be length 6 (the group size) or one, not 25".
I have no idea. Could anyone have a clue?
Thanks

Your function would work if you pass vector to your function instead of dataframe
depre <- function(num){
rate = 0.5
r= length(num)
sl = num
nl = num
for (i in 2:r){
sl[i]=sl[i-1]*rate+nl[i]
}
return(sl)
}
and then apply it by group.
library(dplyr)
df %>% group_by(id) %>% mutate(sl = depre(num))

We can split by 'id' and use the OP's function without any changes
library(dplyr)
library(purrr)
df %>%
group_split(id, keep = FALSE) %>%
map_df(~ tibble(id = .$id, sl = depre(.)))
# id sl
# <int> <dbl>
# 1 1 1.07
# 2 1 -0.776
# 3 1 -0.518
# 4 1 0.628
# 5 1 0.601
# 6 1 1.10
# 7 2 -0.734
# 8 2 -0.583
# 9 2 -0.437
#10 2 -3.45
# … with 15 more rows
or an option would be accumulate from purrr which would be more compact
out <- df %>%
group_by(id) %>%
mutate(sl = accumulate(num, ~ .y + .x * 0.5))
out
# A tibble: 25 x 3
# Groups: id [5]
# id num sl
# <int> <dbl> <dbl>
# 1 3 -0.784 -0.784
# 2 2 -0.734 -0.734
# 3 2 -0.216 -0.583
# 4 3 -0.335 -0.727
# 5 5 -1.09 -1.09
# 6 4 -0.0854 -0.0854
# 7 1 1.07 1.07
# 8 2 -0.145 -0.437
# 9 3 -1.17 -1.53
#10 5 -0.819 -1.36
# … with 15 more rows
out %>%
filter(id == 1)
# A tibble: 6 x 3
# Groups: id [1]
# id num sl
# <int> <dbl> <dbl>
#1 1 1.07 1.07
#2 1 -1.31 -0.776
#3 1 -0.129 -0.518
#4 1 0.887 0.628
#5 1 0.287 0.601
#6 1 0.800 1.10
Issue in the OP's function is that the input is the whole dataset and during the process of getting the number of rows, it uses nrow(data), which would be the total number of rows. With group_by, the dplyr convention is n() - giving the number of rows. By doing the group_split, the input data.frame is split into subset of data.frames and the nrow of those will work for the created function

Related

Can I filter a column based on the values of another column within the same tibble?

I'm working with large datasets that have countless rows and am trying to automate some of my analyses. I mostly use #tidyverse to reduce the need of adding more packages, but I'm open to all suggestions. Consider the following tibble:
id <- rep(1:3, each = 48) # 3 individuals
time <- rep(seq(0, 23.5, by = .5), 3)
count <- runif(48*3)
df <- tibble(id, time, count)
I'm trying to filter a 2-hour interval around the time of max count.
I can identify the time of max count using:
df %>%
group_by(id) %>%
filter(count == max(count))
# OR
df$time[which.max(df$count)] # Only for 1 id, though
I am struggling to filter a range around the time of max count. I can identify the time correctly as a vector using Base R, but I can't filter for entire rows. I have not prepared for potential negative or missing values yet.
df$time[(which.max(df$count) - 2):(which.max(df$count) + 2)]
I'm calculating a few different variables using mutate(), so I want to incorporate this filter() into a pipe. I've attempted to use between(), match(), lead(), and lag(). which.max() has been the closest I've gotten to filtering the correct time duration. The following are a dead end and my closest, correct attempt:
# Listed max(count) in a new column; maybe use for matching?
df %>%
group_by(id) %>%
mutate(peak = max(count))
# Partially selects time around max count, but not accurately.
df %>%
group_by(id) %>%
filter(time == time[(which.max(count) - 1.5):(which.max(count)+1.5)])
I've been coding for about a year now, but I think I'm missing some basic functions that I just don't know. Similar questions have been posted for SQL, but I have not found any regarding R or tidyverse. If you can help, I'd really appreciate it. Let me know if there's any clarification needed.
We could use slice after the grouping step
library(dplyr)
df %>%
group_by(id) %>%
slice({i1 <- which.max(count)
(i1 -2):(i1 + 2)})
# A tibble: 15 x 3
# Groups: id [3]
# id time count
# <int> <dbl> <dbl>
# 1 1 6.5 0.447
# 2 1 7 0.785
# 3 1 7.5 0.984
# 4 1 8 0.133
# 5 1 8.5 0.433
# 6 2 14.5 0.266
# 7 2 15 0.501
# 8 2 15.5 0.965
# 9 2 16 0.214
#10 2 16.5 0.492
#11 3 14 0.894
#12 3 14.5 0.0388
#13 3 15 0.947
#14 3 15.5 0.776
#15 3 16 0.293
Or it can be made more compact
df %>%
group_by(id) %>%
slice(which.max(count) + (-2:2))
An alternative solution using row_number()
library(dplyr)
df %>%
group_by(id) %>%
filter(abs(row_number() - which.max(count)) <= 2)
which gives
# A tibble: 15 x 3
# Groups: id [3]
id time count
<int> <dbl> <dbl>
1 1 5 0.574
2 1 5.5 0.763
3 1 6 0.985
4 1 6.5 0.701
5 1 7 0.281
6 2 21 0.0563
7 2 21.5 0.274
8 2 22 0.978
9 2 22.5 0.560
10 2 23 0.726
11 3 12 0.889
12 3 12.5 0.767
13 3 13 0.999
14 3 13.5 0.157
15 3 14 0.896

Generating a column with the average value of rows before and after tiw row index

Given some data like the following:
set.seed(1234)
df <- tibble(class = rep(c("a","b"), each=6), value = c(rnorm(n=6, mean=0, sd=1), rnorm(n=6, mean=1, sd=0.1)))
# A tibble: 12 x 2
# class value
# <chr> <dbl>
# 1 a -1.21
# 2 a 0.277
# 3 a 1.08
# 4 a -2.35
# 5 a 0.429
# 6 a 0.506
# 7 b 0.943
# 8 b 0.945
# 9 b 0.944
#10 b 0.911
#11 b 0.952
#12 b 0.900
I'm trying to generate a new column (context) that contains the average of "value" of the X preceding and posterior rows, when possible. It would be desirable to have this by level of a factor in a different column. For example, for X=2, I would expect something like the following:
# A tibble: 12 x 2
# class value context
# <chr> <dbl> <dbl>
# 1 a -1.21 NA
# 2 a 0.277 NA
# 3 a 1.08 -0.7135
# 4 a -2.35 0.573
# 5 a 0.429 NA
# 6 a 0.506 NA
# 7 b 0.943 NA
# 8 b 0.945 NA
# 9 b 0.944 0.9377
#10 b 0.911 0.9278
#11 b 0.952 NA
#12 b 0.900 NA
Note that for the first two rows it is not possible to generate the context value in this case, because they do not have X=2 predecing rows. The value -0.7135 at row 3 is the average of rows 1, 2, 4 and 5.
Similarly, rows 5 and 6 do not have a value of context, because these do not have two values afterwards belonging to the same level of the factor "class" (because row 7 is class="b" while 5 and 6 are class="a").
I do not know if this is even possible in R, I haven't found any similar questions, and I can only reach to solutions like the following one, which I think is not representative of this language.
My solution:
X <- 2
df_list <- df %>% dplyr::group_split(class)
result <- tibble()
for (i in 1:length(df_list)) {
tmp <- df_list[[i]]
context <- vector()
for (j in 1:nrow(tmp)) {
if (j<=X | j>nrow(tmp)-X) context <- c(context, NA)
else {
values <- vector()
for (k in 1:X) {
values <- c(values, tmp$value[j-k], tmp$value[j+k])
}
context <- c(context, mean(values))
}
}
tmp <- tmp %>% dplyr::mutate(context=context)
result <- result %>% dplyr::bind_rows(tmp)
}
This will give and approximate solution to that above (differences due to rounding). But again, this approach lacks of flexibility, e.g. if we want to create various columns at once, for different values of X. Are there R functions developed to solved tasks like this one? (eg. vectorized functions?)
# this is your dataframe
set.seed(1234)
df <- tibble(class = rep(c("a","b"), each=6), value = c(rnorm(n=6, mean=0, sd=1), rnorm(n=6, mean=1, sd=0.1)))
# pipes ('%>%') and grouping from the dplyr package
library(tidyverse)
# rolling mean function from the zoo package
library(zoo)
df %>% # take df
group_by(class) %>% # group it by class
mutate(context = (rollsum(value, 5, fill = NA) - value) / 4) # and calculate the rolling mean
Basically you calculate a rolling mean with a window width of 5, that is center (it's the default) and you fill the remaining values with NAs. Since the value of the exact row is not to be included in the average, it needs to be excluded.
One way using dplyr :
n <- 2
library(dplyr)
df %>%
group_by(class) %>%
mutate(context = map_dbl(row_number(), ~ if(.x <= n | .x > (n() - n))
NA else mean(value[c((.x - n):(.x - 1), (.x + 1) : (.x + n))])))
# class value context
# <chr> <dbl> <dbl>
# 1 a -1.21 NA
# 2 a 0.277 NA
# 3 a 1.08 -0.712
# 4 a -2.35 0.574
# 5 a 0.429 NA
# 6 a 0.506 NA
# 7 b 0.943 NA
# 8 b 0.945 NA
# 9 b 0.944 0.938
#10 b 0.911 0.935
#11 b 0.952 NA
#12 b 0.900 NA
Here is a base R solution using ave(), i.e.,
df <- within(df,
contest <- ave(value,
class,
FUN = function(v,X=2) sapply(seq(v), function(k) ifelse(k-X < 1 | k+X >length(v),NA,mean(v[c(k-(X:1),k + (1:X))])))))
such that
> df
# A tibble: 12 x 3
class value contest
<chr> <dbl> <dbl>
1 a -1.21 NA
2 a 0.277 NA
3 a 1.08 -0.712
4 a -2.35 0.574
5 a 0.429 NA
6 a 0.506 NA
7 b 0.943 NA
8 b 0.945 NA
9 b 0.944 0.938
10 b 0.911 0.935
11 b 0.952 NA
12 b 0.900 NA

Transforming R dataframe by applying function rowwise and create (possibly) larger columns

I'm trying to transform a dataframe (tibble) by using each row as function arguments and create a new column out of it, which is possibly bigger than the number of arguments. Consider the following example, where I have some sample observations:
library(dplyr)
library(stringi)
observations <- c("110", "11011", "1100010")
df <- tibble(obs = observations) %>%
transmute(
Failure = stri_count(obs, fixed = "0"),
Success = stri_count(obs, fixed = "1")
)
df is then:
# A tibble: 3 x 2
Failure Success
<int> <int>
1 1 2
2 1 4
3 4 3
I would like to take every row and use that for calculating a bunch of values, and save each result vector in a new column. For example I would like to do:
p_values = pgrid <- seq(from = 0, to = 1, length.out = 11)
df %>%
rowwise() %>%
transmute(
p = p_values,
likelihood = dbinom(Success,
size = Failure + Success,
prob = p_values
)
)
Error: Column `p` must be length 1 (the group size), not 11
And get something like:
# A tibble: 4 x 11
p_values likelihood_1 likelihood_2 likelihood_3
<float> <float> <float> <float>
1 0 ... ... ...
2 0.1 ... ... ...
... ... ... ... ...
10 0.9 ... ... ...
11 1 ... ... ...
I would actually switch into purrr for this. The function pmap() will iterate by row. We use ..1 and ..2 to signify the first and second inputs, respectively. Using pmap_dfc() will bind the results by columns (dfc = data frame columns).
library(purrr)
library(tibble)
df %>%
pmap_dfc(~ dbinom(..2, size = ..1 + ..2, prob = p_values)) %>%
set_names(paste0("likelihood_", seq_along(.))) %>%
add_column(p_values = p_values, .before = 1)
# A tibble: 11 x 4
p_values likelihood_1 likelihood_2 likelihood_3
<dbl> <dbl> <dbl> <dbl>
1 0 0 0 0
2 0.1 0.027 0.00045 0.0230
3 0.2 0.096 0.0064 0.115
4 0.3 0.189 0.0284 0.227
5 0.4 0.288 0.0768 0.290
6 0.5 0.375 0.156 0.273
7 0.6 0.432 0.259 0.194
8 0.7 0.441 0.360 0.0972
9 0.8 0.384 0.410 0.0287
10 0.9 0.243 0.328 0.00255
11 1 0 0 0
This sort of workflow can be somewhat awkward with a tidyverse approach, as the data is not in a 'tidy' format.
I would come at it from the other angle, starting with the p_values vector:
likelihoods <-
tibble(p = p_values) %>%
mutate(likelihood_1 = dbinom(df[1,]$Success,size = df[1,]$Failure + df[1,]$Success,prob = p),
likelihood_2 = dbinom(df[2,]$Success,size = df[2,]$Failure + df[2,]$Success,prob = p),
likelihood_3 = dbinom(df[3,]$Success,size = df[3,]$Failure + df[3,]$Success,prob = p))
The issue is that transmute or mutate expects the number of elements to be same as number of rows (or if it is grouped, then the number of rows for that group). Here, we do rowwise- which is basically grouping each row, so the n() expected is 1, whereas the output is length of 'p_values'. One option is to wrap in a list, unnest, and reshape to 'wide' format with pivot_wider (if needed)
library(dplyr)
library(tidyr)
library(stringr)
df %>%
mutate(grp = str_c('likelihood_', row_number())) %>%
rowwise() %>%
transmute(grp, p = list(p_values),
likelihood = list(dbinom(Success,
size = Failure + Success,
prob = p_values
))
) %>%
unnest(c(p, likelihood)) %>%
pivot_wider(names_from = grp, values_from = likelihood)
# A tibble: 11 x 4
# p likelihood_1 likelihood_2 likelihood_3
# <dbl> <dbl> <dbl> <dbl>
# 1 0 0 0 0
# 2 0.1 0.027 0.00045 0.0230
# 3 0.2 0.096 0.0064 0.115
# 4 0.3 0.189 0.0284 0.227
# 5 0.4 0.288 0.0768 0.290
# 6 0.5 0.375 0.156 0.273
# 7 0.6 0.432 0.259 0.194
# 8 0.7 0.441 0.360 0.0972
# 9 0.8 0.384 0.410 0.0287
#10 0.9 0.243 0.328 0.00255
#11 1 0 0 0

Compute percentage of rows in group that have a certain value in another column

I am using the dataset birthwt.
For each age, I want to find the percentage of mothers that are white. My end goal is to display that percentage in a plot by age. How can I do this? I'm learning how to use tidyverse functions so I would prefer to do it that way if possible. Here is my work so far:
library(tidyverse)
library(tidyselect)
library("MASS")
grouped <- birthwt %>%
count(race, age) %>%
spread(key = race, value = n, fill = 0)
grouped
This gets a table where each row represents an age, and there is a column for each race representing the count of mothers of that age. This approach may or may not be on the right path.
We can group by 'age' and get the mean of logical vector
library(dplyr)
birthwt %>%
group_by(age) %>%
summarise(perc = mean(race == 1))
# A tibble: 24 x 2
# age perc
# <int> <dbl>
# 1 14 0.333
# 2 15 0.333
# 3 16 0.286
# 4 17 0.25
# 5 18 0.6
# 6 19 0.625
# 7 20 0.333
# 8 21 0.417
# 9 22 0.769
#10 23 0.308
# … with 14 more rows
Or an option with data.table
library(data.table)
setDT(birthwt)[, .(perc = mean(race == 1)), age]
Or using base R
birthwt$perc <- with(birthwt, ave(race == 1, age))
Or another base R option is
with(birthwt, tapply(race == 1, age, FUN = mean))
Or with aggregate
aggregate(cbind(perc = race == 1) ~ age, birthwt, FUN = mean)
Or with by
by(birthwt$race == 1, birthwt$age, FUN = mean)
We can count the number of race which are white for each age and divide it by total number of rows for each age to get ratio.
library(dplyr)
birthwt %>%
group_by(age) %>%
summarise(perc = sum(race == 1)/n())
# A tibble: 24 x 2
# age perc
# <int> <dbl>
# 1 14 0.333
# 2 15 0.333
# 3 16 0.286
# 4 17 0.25
# 5 18 0.6
# 6 19 0.625
# 7 20 0.333
# 8 21 0.417
# 9 22 0.769
#10 23 0.308
# … with 14 more rows
In base R, we can use aggregate following the same logic
aggregate(race~age, birthwt,function(x) sum(x == 1)/length(x))
Or something similar to your approach using table, we could do
tab <- table(birthwt$age, birthwt$race)
tab[, "1"]/rowSums(tab)

Changing / coercing multiple colums of a tibble while avoiding a loop

I have a tibble with several columns in which numbers are stored as text:
my_tbl <- tibble(names = letters[1:5],
value1 = as.character(runif(5)),
value2 = as.character(runif(5)))
Now, I'd like to change the type of these columns ("value1" and "value2") from character to numeric. Only option I've found is using a for-loop:
for (i in 2:ncol(my_tbl)) {
my_tbl[[i]] <- as.numeric(my_tbl[[i]])
}
Is there a possibility to do this without a loop?
You can use mutate_if from dplyr:
library(dplyr)
my_tbl %>%
group_by(names) %>%
mutate_if(is.character, as.numeric)
my_tbl
## A tibble: 5 x 3
## Groups: names [5]
# names value1 value2
# <chr> <dbl> <dbl>
#1 a 0.427 0.0191
#2 b 0.817 0.300
#3 c 0.108 0.158
#4 d 0.394 0.643
#5 e 0.775 0.311
With purrr you could do this:
If you already know your target columns :
library(purrr)
modify_at(my_tbl,-1,as.numeric)
If you need to detect them:
modify_if(my_tbl,~is.character(.) && !any(grepl("[:alpha:]",.)),as.numeric)
# # A tibble: 5 x 3
# names value1 value2
# <chr> <dbl> <dbl>
# 1 a 0.715 0.943
# 2 b 0.639 0.128
# 3 c 0.471 0.0395
# 4 d 0.374 0.374
# 5 e 0.500 0.800
using dplyr instead of purrr, these will yield the same results:
library(dplyr)
mutate_at(my_tbl,-1,as.numeric)
mutate_if(my_tbl,~is.character(.) && !any(grepl("[:alpha:]",.)),as.numeric)
The base R translations:
my_tbl[-1] <- lapply(my_tbl[-1],as.numeric)
my_tbl[] <- lapply(my_tbl,function(x)
if (is.character(x) && !any(grepl("[:alpha:]",x))) as.numeric(x)
else x)

Resources