Retaining the last value of a column - r

I have the following data frame,
Input
For all observations where Month > tenor, the last value of the rate column should be retained for each account for the remaining months. Eg:- Customer 1 has tenor = 5, so for all months greater than 5, the last rate value is retained.
I am using the following code
df$rate <- ifelse(df$Month > df$tenor,tail(df$rate, n=1),df$rate)
But here, the last value is NA so it does not work
Expected output is
Output

this will work, but please have a reproducible example. Others want to help you, not do your homework.
df <- data.frame(
customer = c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2),
Month = c(1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10),
tenor = c(5,5,5,5,5,5,5,5,5,5,3,3,3,3,3,3,3,3,3,3),
rate = c(0.2,0.3,0.4,0.5,0.6,NA,NA,NA,NA,NA,0.1,0.2,0.3,NA,NA,NA,NA,NA,NA,NA)
)
fn <- function(cus, mon, ten, rat){
if (mon > ten & is.na(rat)){
return(dplyr::filter(df, customer == cus, Month == ten, tenor == ten)$rate)
}
return(rat)
}
df2 <- mutate(df,
newrate = Vectorize(fn)(customer, Month, tenor, rate)
)

One option is:
library(dplyr)
library(tidyr)
df %>%
group_by(cus_no) %>%
fill(rate, .direction = "down") %>%
ungroup()
# A tibble: 20 x 4
customer Month tenor rate
<dbl> <dbl> <dbl> <dbl>
1 1 1 5 0.2
2 1 2 5 0.3
3 1 3 5 0.4
4 1 4 5 0.5
5 1 5 5 0.6
6 1 6 5 0.6
7 1 7 5 0.6
8 1 8 5 0.6
9 1 9 5 0.6
10 1 10 5 0.6
11 2 1 3 0.1
12 2 2 3 0.2
13 2 3 3 0.3
14 2 4 3 0.3
15 2 5 3 0.3
16 2 6 3 0.3
17 2 7 3 0.3
18 2 8 3 0.3
19 2 9 3 0.3
20 2 10 3 0.3

I can't replicate your data frame so this is a guess right now.
I think dplyr should be the solution:-
library(dplyr)
df%>%
group_by(Month)%>%
replace_na(last(rate))
should work

Related

How to give a consecutive id number for each distinct study in r

I am trying to create consecutive ID numbers for each distinct study. I found an example of data where they managed to create such an ID number under esid variable
Browse[1]> dat <- dat.assink2016
Browse[1]> head(dat, 9)
study esid id yi vi pubstatus year deltype
1 1 1 1 0.9066 0.0740 1 4.5 general
2 1 2 2 0.4295 0.0398 1 4.5 general
3 1 3 3 0.2679 0.0481 1 4.5 general
4 1 4 4 0.2078 0.0239 1 4.5 general
5 1 5 5 0.0526 0.0331 1 4.5 general
6 1 6 6 -0.0507 0.0886 1 4.5 general
7 2 1 7 0.5117 0.0115 1 1.5 general
8 2 2 8 0.4738 0.0076 1 1.5 general
9 2 3 9 0.3544 0.0065 1 1.5 general
I would like to create the same for my study, can anyone show me how to do it?
The key is to group_by study, then use row_number
library(dplyr)
df %>%
group_by(study) %>%
mutate(esid = row_number())
with the example data from #njp:
# A tibble: 9 × 3
# Groups: study [3]
study id esid
<dbl> <int> <int>
1 1 1 1
2 1 2 2
3 1 3 3
4 2 4 1
5 2 5 2
6 2 6 3
7 2 7 4
8 3 8 1
9 3 9 2
If the id column is consecutive (i.e. no jumps or repeated values) you could subtract the minimum value of id for each study and add one:
# Example data
df = data.frame(study=c(1,1,1,2,2,2,2,3,3),
id=1:9)
# Calculate minima
min.id = tapply(X=df$id,
INDEX=df$study,
FUN=min)
# merge this with the data
df$min.id = min.id[df$study]
# Calculate consecutive id as required
df$esid = df$id - df$min.id+1

Erase groups based on a condition with dplyr [duplicate]

This question already has answers here:
Filter group of rows based on sum of values from different column
(2 answers)
Closed 2 years ago.
I have a data.frame that looks like this
data=data.frame(group=c("A","B","C","A","B","C","A","B","C"),
time= c(rep(1,3),rep(2,3), rep(3,3)),
value=c(0,1,1,0.1,10,20,10,20,30))
group time value
1 A 1 0.0
2 B 1 1.0
3 C 1 1.0
4 A 2 0.1
5 B 2 10.0
6 C 2 20.0
7 A 3 10.0
8 B 3 20.0
9 C 3 30.0
I would like to find an elegant way to erase a group when its values are smaller < 0.2 in two different time points. Those points do not have to be consecutive.
In this case, I would like to filter out group A because its value at time point 1 and time point 2 is smaller than < 0.2.
group time value
1 B 1 1.0
2 C 1 1.0
3 B 2 10.0
4 C 2 20.0
5 B 3 20.0
6 C 3 30.0
With this solution you check that no group has more than 1 observation with values under 0.2 as you requested.
library(dplyr)
data %>%
group_by(group) %>%
filter(sum(value < 0.2) < 2) %>%
ungroup()
#> # A tibble: 6 x 3
#> group time value
#> <chr> <dbl> <dbl>
#> 1 B 1 1
#> 2 C 1 1
#> 3 B 2 10
#> 4 C 2 20
#> 5 B 3 20
#> 6 C 3 30
But if you are really a fan of base R:
data[ave(data$value<0.2, data$group, FUN = function(x) sum(x)<2), ]
#> group time value
#> 2 B 1 1
#> 3 C 1 1
#> 5 B 2 10
#> 6 C 2 20
#> 8 B 3 20
#> 9 C 3 30
Try this dplyr approach:
library(tidyverse)
#Code
data <- data %>% group_by(group) %>% mutate(Flag=any(value<0.2)) %>%
filter(Flag==F) %>% select(-Flag)
Output:
# A tibble: 6 x 3
# Groups: group [2]
group time value
<fct> <dbl> <dbl>
1 B 1 1
2 C 1 1
3 B 2 10
4 C 2 20
5 B 3 20
6 C 3 30

Create grouping based on cumulative sum and another group

This question is nearly identical to:
Create new group based on cumulative sum and group
However, when I apply the accepted solution to my data, it doesn't have the expected result.
In a nutshell, I have a data with two variables: domain and value. Domain is a group variable with multiple observations and value is some continuous value that I would like to accumulate by domain and great a new group variable, newgroup. There are three main rules:
I accumulate only within each domain. If I reach the end of the domain, then the accumulation is reset.
If the accumulated sum is at least 1.0 then the observations whose values added up to at least 1.0 are assigned to a different value for group1. Please note that this rule can be satisfied by a single observation.
If the last group in a domain has an accumulated sum less than 1.0, then merge that with the second to last group within the same domain. This is reflected in the variable group2
The data below has been simplified. The data will usually consist of 10^5 - 10^6 rows, so a vectorized solution would be ideal.
Example Data
domain <- c(rep(1,5),rep(2,8))
value <- c(1,0,2,2.5,0.1,0.1,0.5,0,0.2,0.6,0,0,0.1)
df_raw <- data.frame(domain,value)
domain value
1 1.0
1 0.0
1 2.0
1 2.5
1 0.1
2 0.1
2 0.5
2 0.0
2 0.2
2 0.6
2 0.0
2 0.0
2 0.1
Desired Output
cumsum_val <- c(1,0,2,2.5,0.1,0.1,0.6,0.6,0.8,1.4,0,0,0.1)
group1 <- c(1,2,2,3,4,5,5,5,5,5,6,6,6)
group2 <- c(1,2,2,3,3,4,4,4,4,4,4,4,4) #Satisfies Rule #3
df_want <- data.frame(domain,value,cumsum_val,group1,group2)
domain value cumsum_val group1 group2
1 1.0 1.0 1 1
1 0.0 0.0 2 2
1 2.0 2.0 2 2
1 2.5 2.5 3 3
1 0.1 0.1 4 3
2 0.1 0.1 5 4
2 0.5 0.6 5 4
2 0.0 0.6 5 4
2 0.2 0.8 5 4
2 0.6 1.4 5 4
2 0.0 0.0 6 4
2 0.0 0.0 6 4
2 0.1 0.1 6 4
I used the following code:
sum0 <- function(x, y) { if (x + y >= 1.0) 0 else x + y }
is_start <- function(x) head(c(TRUE, Reduce(sum0, init=0, x, acc = TRUE)[-1] == 0), -1)
cumsum(ave(df_raw$value, df_raw$domain, FUN = is_start))
## 1 2 3 4 5 6 6 6 6 6 7 8 9
but the last line does not produce the same values as group1 above. Generating group1 output is what is mainly causing me issues. Can someone help me understand the function is_start and how that is supposed to produce the groupings?
EDIT
akrun provided some working code in the comments for the simplified example above. However, there are still some situations where it doesn't work. For example,
domain <- c(rep(1,7),rep(2,8))
value <- c(1,0,1,0,2,2.5,0.1,0.1,0.5,0,0.2,0.6,0,0,0.1)
df_raw <- data.frame(domain,value)
The output is show below with new coming from akrun's code and group1 and group2 are the desired groupings based on rules #2 and #3. The discrepancy between new and group2 occurs mainly in the first 3 rows.
domain value new group1 group2
1 1.0 1 1 1
1 0.0 2 2 2
1 1.0 3 2 2
1 0.0 4 3 3
1 2.0 4 3 3
1 2.5 5 4 4
1 0.1 5 5 4
2 0.1 6 6 5
2 0.5 6 6 5
2 0.0 6 6 5
2 0.2 6 6 5
2 0.6 6 6 5
2 0.0 6 7 5
2 0.0 6 7 5
2 0.1 6 7 5
EDIT 2
I have updated with a working answer.
This works! It uses a combination of purrr's accumulate (similar to cumsum but more versatile) and cumsum with appropriate use of group_by to get what you're looking for. I've added comments to indicate what each part is doing. I'll note that next_group2 is a bit of a misnomer--it's more of a not_next_group2, but hopefully the rest is clear.
library(tidyverse)
domain <- c(rep(1,5),rep(2,8))
value <- c(1,0,2,2.5,0.1,0.1,0.5,0,0.2,0.6,0,0,0.1)
df_raw <- data.frame(domain,value)
## Modified from: https://stackoverflow.com/questions/49076769/dplyr-r-cumulative-sum-with-reset
sum_reset_at = function(val_col, threshold, include.equals = TRUE) {
if (include.equals) {
purrr::accumulate({{val_col}}, ~if_else(.x>=threshold , .y, .x+.y))
} else {
purrr::accumulate({{val_col}}, ~if_else(.x>threshold , .y, .x+.y))
}
}
df_raw %>%
group_by(domain) %>%
mutate(cumsum_val = sum_reset_at(value, 1)) %>%
mutate(next_group1 = ifelse(lag(cumsum_val) >= 1 | row_number() == 1, 1, 0)) %>% ## binary interpretation of whether there should be a new group
ungroup %>%
mutate(group1 = cumsum(next_group1)) %>% ## generate new groups
group_by(domain, group1) %>%
mutate(next_group2 = ifelse(max(cumsum_val) < 1 & row_number() == 1, 1, 0)) %>% ## similar to above, but grouped by your new group1; we ask it only to transition at the first value of the group that doesn't reach 1
ungroup %>%
mutate(group2 = cumsum(next_group1 - next_group2)) %>% ## cancel out the next_group1 binary if it meets the conditions of next_group2
select(-starts_with("next_"))
And as specified, this produces:
# A tibble: 13 x 5
domain value cumsum_val group1 group2
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 1
2 1 0 0 2 2
3 1 2 2 2 2
4 1 2.5 2.5 3 3
5 1 0.1 0.1 4 3
6 2 0.1 0.1 5 4
7 2 0.5 0.6 5 4
8 2 0 0.6 5 4
9 2 0.2 0.8 5 4
10 2 0.6 1.4 5 4
11 2 0 0 6 4
12 2 0 0 6 4
13 2 0.1 0.1 6 4
The solution below is adapted from Group vector on conditional sum.
Helper Rcpp Function
library(Rcpp)
cppFunction('
IntegerVector CreateGroup(NumericVector x, int cutoff) {
IntegerVector groupVec (x.size());
int group = 1;
int threshid = 0;
double runSum = 0;
for (int i = 0; i < x.size(); i++) {
runSum += x[i];
groupVec[i] = group;
if (runSum >= cutoff) {
group++;
runSum = 0;
}
}
return groupVec;
}
')
Main Function
domain <- c(rep(1,7),rep(2,8))
value <- c(1,0,1,0,2,2.5,0.1,0.1,0.5,0,0.2,0.6,0,0,0.1)
df_raw <- data.frame(domain,value)
df_raw %>%
group_by(domain) %>%
mutate(group1 = CreateGroup(value,1),
group1 = ifelse(group1==max(group1) & last(value) < 1,
max(group1)-1,group1)) %>%
ungroup() %>%
mutate(group2 = rleid(group1))
domain value group1 group2
1 1.0 1 1
1 0.0 2 2
1 1.0 2 2
1 0.0 3 3
1 2.0 3 3
1 2.5 4 4
1 0.1 4 4
2 0.1 1 5
2 0.5 1 5
2 0.0 1 5
2 0.2 1 5
2 0.6 1 5
2 0.0 1 5
2 0.0 1 5
2 0.1 1 5

Assigning Subgroups in Data R

I know how to do basic stuff in R, but I am still a newbie. I am also probably asking a pretty redundant question (but I don't know how to enter it into google so that I find the right hits).
I have been getting hits like the below:
Assign value to group based on condition in column
R - Group by variable and then assign a unique ID
I want to assign subgroups into groups, and create a new column out of them.
I have data like the following:
dataframe:
ID SubID Values
1 15 0.5
1 15 0.2
2 13 0.1
2 13 0
1 14 0.3
1 14 0.3
2 10 0.2
2 10 1.6
6 31 0.7
6 31 1.0
new dataframe:
ID SubID Values groups
1 15 0.5 2
1 15 0.2 2
2 13 0.1 2
2 13 0 2
1 14 0.3 1
1 14 0.3 1
2 10 0.2 1
2 10 1.6 1
6 31 0.7 1
6 31 1.0 1
I have tried the following in R, but I am not getting the desired results:
newdataframe$groups <- dataframe %>% group_indices(,dataframe$ID, dataframe$SubID)
newdataframe<- dataframe %>% group_by(ID, SubID) %>% mutate(groups=group_indices(,dataframe$ID, dataframe$SubID))
I am not sure how to frame the question in R. I want to group by ID, and SubID, and then assign those subgroups in that are grouped by IDs and reset the the grouping count on each ID.
Any help would be really appreciated.
Here is an alternative approach which uses the rleid() function from the data.table package. rleid() generates a run-length type id column.
According to the expected result, the OP expects SubId to be numbered by order of value and not by order of appearance. Therefore, we need to call arrange().
library(dplyr)
df %>%
group_by(ID) %>%
arrange(SubID) %>%
mutate(groups = data.table::rleid(SubID))
ID SubID Values groups
<int> <int> <dbl> <int>
1 2 10 0.2 1
2 2 10 1.6 1
3 2 13 0.1 2
4 2 13 0 2
5 1 14 0.3 1
6 1 14 0.3 1
7 1 15 0.5 2
8 1 15 0.2 2
9 6 31 0.7 1
10 6 31 1 1
Note that the row order has changed.
BTW: With data.table, the code is less verbose and the original row order is maintained:
library(data.table)
setDT(df)[order(ID, SubID), groups := rleid(SubID), by = ID][]
ID SubID Values groups
1: 1 15 0.5 2
2: 1 15 0.2 2
3: 2 13 0.1 2
4: 2 13 0.0 2
5: 1 14 0.3 1
6: 1 14 0.3 1
7: 2 10 0.2 1
8: 2 10 1.6 1
9: 6 31 0.7 1
10: 6 31 1.0 1
There are multiple ways to do this one way would be to group_by ID and create a unique number for each SubID by converting it to factor and then to integer.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(groups = as.integer(factor(SubID)))
# ID SubID Values groups
# <int> <int> <dbl> <int>
# 1 1 15 0.5 2
# 2 1 15 0.2 2
# 3 2 13 0.1 2
# 4 2 13 0 2
# 5 1 14 0.3 1
# 6 1 14 0.3 1
# 7 2 10 0.2 1
# 8 2 10 1.6 1
# 9 6 31 0.7 1
#10 6 31 1 1
In base R, we can use ave with similar logic
df$groups <- with(df, ave(SubID, ID, FUN = factor))

Calculate product of frequencies in each column

I have a data frame with 3 columns, each containing a small number of values:
> df
# A tibble: 364 x 3
A B C
<dbl> <dbl> <dbl>
0. 1. 0.100
0. 1. 0.200
0. 1. 0.300
0. 1. 0.500
0. 2. 0.100
0. 2. 0.200
0. 2. 0.300
0. 2. 0.600
0. 3. 0.100
0. 3. 0.200
# ... with 354 more rows
> apply(df, 2, table)
$`A`
0 1 2 3 4 5 6 7 8 9 10
34 37 31 32 27 39 29 28 37 39 31
$B
1 2 3 4 5 6 7 8 9 10 11
38 28 38 37 32 34 29 33 30 35 30
$C
0.1 0.2 0.3 0.4 0.5 0.6
62 65 65 56 60 56
I would like to create a fourth column, which will contain for each row the product of the frequencies of each value withing each group. So for example the first value of the column "Freq" would be the product of the frequency of zero within column A, the frequency of 1 within column B and the frequency of 0.1 within column C.
How can I do this efficiently with dplyr/baseR?
To emphasize, this is not the combined frequency of each total row, but the product of the 1-column frequencies
An efficient approach using a combination of lapply, Map & Reduce from base R:
l <- lapply(df, table)
m <- Map(function(x,y) unname(y[match(x, names(y))]), df, l)
df$D <- Reduce(`*`, m)
which gives:
> head(df, 15)
A B C D
1 3 5 0.4 57344
2 5 6 0.5 79560
3 0 4 0.1 77996
4 2 6 0.1 65348
5 5 11 0.6 65520
6 3 8 0.5 63360
7 6 6 0.2 64090
8 1 9 0.4 62160
9 10 2 0.2 56420
10 5 2 0.2 70980
11 4 11 0.3 52650
12 7 6 0.5 57120
13 10 1 0.2 76570
14 7 10 0.5 58800
15 8 10 0.3 84175
What this does:
lapply(df, table) creates a list of frequency for each column
With Map a list is created with match where each list-item has the same length as the number of rows of df. Each list-item is a vector of frequencies corresponding to the values in df.
With Reduce the product of the vectors in the list m is calculated element wise: the first value of each vector in the list m are mulplied with each other, then the 2nd value, etc.
The same approach in tidyverse:
library(dplyr)
library(purrr)
df %>%
mutate(D = map(df, table) %>%
map2(df, ., function(x,y) unname(y[match(x, names(y))])) %>%
reduce(`*`))
Used data:
set.seed(2018)
df <- data.frame(A = sample(rep(0:10, c(34,37,31,32,27,39,29,28,37,39,31)), 364),
B = sample(rep(1:11, c(38,28,38,37,32,34,29,33,30,35,30)), 364),
C = sample(rep(seq(0.1,0.6,0.1), c(62,65,65,56,60,56)), 364))
will use the following small example
df
A B C
1 3 5 0.4
2 5 6 0.5
3 0 4 0.1
4 2 6 0.1
5 5 11 0.6
6 3 8 0.5
7 6 6 0.2
8 1 9 0.4
9 10 2 0.2
10 5 2 0.2
sapply(g,table)
$A
0 1 2 3 5 6 10
1 1 1 2 3 1 1
$B
2 4 5 6 8 9 11
2 1 1 3 1 1 1
$C
0.1 0.2 0.4 0.5 0.6
2 3 2 2 1
library(tidyverse)
df%>%
group_by(A)%>%
mutate(An=n())%>%
group_by(B)%>%
mutate(Bn=n())%>%
group_by(C)%>%
mutate(Cn=n(),prod=An*Bn*Cn)
A B C An Bn Cn prod
<int> <int> <dbl> <int> <int> <int> <int>
1 3 5 0.400 2 1 2 4
2 5 6 0.500 3 3 2 18
3 0 4 0.100 1 1 2 2
4 2 6 0.100 1 3 2 6
5 5 11 0.600 3 1 1 3
6 3 8 0.500 2 1 2 4
7 6 6 0.200 1 3 3 9
8 1 9 0.400 1 1 2 2
9 10 2 0.200 1 2 3 6
10 5 2 0.200 3 2 3 18

Resources