Find missing month after grouping with dplyr - r

I have a data frame with two columns that I am grouping by with dplyr, a column of months (as numerics, e.g. 1 through 12), and several columns with statistical data following that (values unimportant). An example:
ID_1 ID_2 month st1 st2
1 1 1 0.5 0.2
1 1 2 0.7 0.9
1 1 3 1.1 1.7
1 1 4 2.6 0.8
1 1 5 1.8 1.3
1 1 6 2.1 2.2
1 1 7 0.5 0.2
1 1 8 0.7 0.9
1 1 9 1.1 1.7
1 1 10 2.6 0.8
1 1 11 1.8 1.3
1 1 12 2.1 2.2
1 2 1 0.5 0.2
1 2 2 0.7 0.9
1 2 3 1.1 1.7
1 2 4 2.6 0.8
1 2 5 1.8 1.3
1 2 6 2.1 2.2
1 2 7 0.5 0.2
1 2 9 1.1 1.7
1 2 10 2.6 0.8
1 2 11 1.8 1.3
1 2 12 2.1 2.2
For the second grouping (ID_1 = 1 and ID_2 = 2), there is a month missing from the data (month = 8). Is there a way I can find this month and insert a row with the correct ID_1 and ID_2 values, the missing month value, and NA values for the rest of the columns? I've been playing around with this using dplyr functions and can't seem to figure it out, perhaps there is even a non-dplyr solution out there as well.
PS: If it helps, each unique grouping of ID_1 and ID_2 will have no more than 1 month missing.

Expand grid to make all combos of groups, then merge:
# make reference with all needed rows
ref <- data.frame(expand.grid(unique(df1$ID_1),
unique(df1$ID_2),
1:12))
colnames(ref) <- colnames(df1)[1:3]
# them merge with all TRUE
res <- merge(df1, ref, all = TRUE)
# to check output, show only month = 8
res[ res$month == 8, ]
# ID_1 ID_2 month st1 st2
# 8 1 1 8 0.7 0.9
# 20 1 2 8 NA NA

This can be done via tidyr::complete:
library(dplyr)
library(tidyr)
dat %>%
group_by(ID_1, ID_2) %>%
complete(month = 1:12)
Tail of dataset:
Source: local data frame [6 x 5]
Groups: ID_1, ID_2 [1]
ID_1 ID_2 month st1 st2
<int> <int> <int> <dbl> <dbl>
1 1 2 7 0.5 0.2
2 1 2 8 NA NA
3 1 2 9 1.1 1.7
4 1 2 10 2.6 0.8
5 1 2 11 1.8 1.3
6 1 2 12 2.1 2.2

If you go with tidyr, there is the complete function for this, you can nest ID_1 and ID_2 if you want both of the two variables as your grouping variable:
library(tidyr)
df1 = df %>% complete(nesting(ID_1, ID_2), month)
tail(df1)
# Source: local data frame [6 x 5]
# ID_1 ID_2 month st1 st2
# <int> <int> <int> <dbl> <dbl>
# 1 1 2 7 0.5 0.2
# 2 1 2 8 NA NA
# 3 1 2 9 1.1 1.7
# 4 1 2 10 2.6 0.8
# 5 1 2 11 1.8 1.3
# 6 1 2 12 2.1 2.2

Related

Retaining the last value of a column

I have the following data frame,
Input
For all observations where Month > tenor, the last value of the rate column should be retained for each account for the remaining months. Eg:- Customer 1 has tenor = 5, so for all months greater than 5, the last rate value is retained.
I am using the following code
df$rate <- ifelse(df$Month > df$tenor,tail(df$rate, n=1),df$rate)
But here, the last value is NA so it does not work
Expected output is
Output
this will work, but please have a reproducible example. Others want to help you, not do your homework.
df <- data.frame(
customer = c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2),
Month = c(1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10),
tenor = c(5,5,5,5,5,5,5,5,5,5,3,3,3,3,3,3,3,3,3,3),
rate = c(0.2,0.3,0.4,0.5,0.6,NA,NA,NA,NA,NA,0.1,0.2,0.3,NA,NA,NA,NA,NA,NA,NA)
)
fn <- function(cus, mon, ten, rat){
if (mon > ten & is.na(rat)){
return(dplyr::filter(df, customer == cus, Month == ten, tenor == ten)$rate)
}
return(rat)
}
df2 <- mutate(df,
newrate = Vectorize(fn)(customer, Month, tenor, rate)
)
One option is:
library(dplyr)
library(tidyr)
df %>%
group_by(cus_no) %>%
fill(rate, .direction = "down") %>%
ungroup()
# A tibble: 20 x 4
customer Month tenor rate
<dbl> <dbl> <dbl> <dbl>
1 1 1 5 0.2
2 1 2 5 0.3
3 1 3 5 0.4
4 1 4 5 0.5
5 1 5 5 0.6
6 1 6 5 0.6
7 1 7 5 0.6
8 1 8 5 0.6
9 1 9 5 0.6
10 1 10 5 0.6
11 2 1 3 0.1
12 2 2 3 0.2
13 2 3 3 0.3
14 2 4 3 0.3
15 2 5 3 0.3
16 2 6 3 0.3
17 2 7 3 0.3
18 2 8 3 0.3
19 2 9 3 0.3
20 2 10 3 0.3
I can't replicate your data frame so this is a guess right now.
I think dplyr should be the solution:-
library(dplyr)
df%>%
group_by(Month)%>%
replace_na(last(rate))
should work

Assigning Subgroups in Data R

I know how to do basic stuff in R, but I am still a newbie. I am also probably asking a pretty redundant question (but I don't know how to enter it into google so that I find the right hits).
I have been getting hits like the below:
Assign value to group based on condition in column
R - Group by variable and then assign a unique ID
I want to assign subgroups into groups, and create a new column out of them.
I have data like the following:
dataframe:
ID SubID Values
1 15 0.5
1 15 0.2
2 13 0.1
2 13 0
1 14 0.3
1 14 0.3
2 10 0.2
2 10 1.6
6 31 0.7
6 31 1.0
new dataframe:
ID SubID Values groups
1 15 0.5 2
1 15 0.2 2
2 13 0.1 2
2 13 0 2
1 14 0.3 1
1 14 0.3 1
2 10 0.2 1
2 10 1.6 1
6 31 0.7 1
6 31 1.0 1
I have tried the following in R, but I am not getting the desired results:
newdataframe$groups <- dataframe %>% group_indices(,dataframe$ID, dataframe$SubID)
newdataframe<- dataframe %>% group_by(ID, SubID) %>% mutate(groups=group_indices(,dataframe$ID, dataframe$SubID))
I am not sure how to frame the question in R. I want to group by ID, and SubID, and then assign those subgroups in that are grouped by IDs and reset the the grouping count on each ID.
Any help would be really appreciated.
Here is an alternative approach which uses the rleid() function from the data.table package. rleid() generates a run-length type id column.
According to the expected result, the OP expects SubId to be numbered by order of value and not by order of appearance. Therefore, we need to call arrange().
library(dplyr)
df %>%
group_by(ID) %>%
arrange(SubID) %>%
mutate(groups = data.table::rleid(SubID))
ID SubID Values groups
<int> <int> <dbl> <int>
1 2 10 0.2 1
2 2 10 1.6 1
3 2 13 0.1 2
4 2 13 0 2
5 1 14 0.3 1
6 1 14 0.3 1
7 1 15 0.5 2
8 1 15 0.2 2
9 6 31 0.7 1
10 6 31 1 1
Note that the row order has changed.
BTW: With data.table, the code is less verbose and the original row order is maintained:
library(data.table)
setDT(df)[order(ID, SubID), groups := rleid(SubID), by = ID][]
ID SubID Values groups
1: 1 15 0.5 2
2: 1 15 0.2 2
3: 2 13 0.1 2
4: 2 13 0.0 2
5: 1 14 0.3 1
6: 1 14 0.3 1
7: 2 10 0.2 1
8: 2 10 1.6 1
9: 6 31 0.7 1
10: 6 31 1.0 1
There are multiple ways to do this one way would be to group_by ID and create a unique number for each SubID by converting it to factor and then to integer.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(groups = as.integer(factor(SubID)))
# ID SubID Values groups
# <int> <int> <dbl> <int>
# 1 1 15 0.5 2
# 2 1 15 0.2 2
# 3 2 13 0.1 2
# 4 2 13 0 2
# 5 1 14 0.3 1
# 6 1 14 0.3 1
# 7 2 10 0.2 1
# 8 2 10 1.6 1
# 9 6 31 0.7 1
#10 6 31 1 1
In base R, we can use ave with similar logic
df$groups <- with(df, ave(SubID, ID, FUN = factor))

Split up grouped binomial data in r

I have data that looks like this
samplesize <- 6
group <- c(1,2,3)
total <- rep(samplesize,length(group))
outcomeTrue <- c(2,1,3)
df <- data.frame(group,total,outcomeTrue)
and would like my data to look like this
group2 <- c(rep(1,6),rep(2,6),rep(3,6))
outcomeTrue2 <- c(rep(1,2),rep(0,6-2),rep(1,1),rep(0,6-1),rep(1,3),rep(0,6-3))
df2 <- data.frame(group2,outcomeTrue2)
That is to say I have binary data where I am told the total observations and the successful observations, but would prefer it to be organised as individual observations with their explicit outcome as 0 or 1. i.e.Visual Example of Desired Result
Is there an easy way to do this in r, or will I need to write a loop to automate this myself?
Here is one option with tidyverrse. We uncount to expand the rows using the 'total' column, grouped by 'group', create a binary index with a logical condition based on the row_number() and the value of 'outcomeTrue'
library(tidyverse)
df %>%
uncount(total) %>%
group_by(group) %>%
mutate(outcomeTrue = as.integer(row_number() <= outcomeTrue[1]))
# A tibble: 18 x 2
# Groups: group [3]
# group outcomeTrue
# <dbl> <int>
# 1 1 1
# 2 1 1
# 3 1 0
# 4 1 0
# 5 1 0
# 6 1 0
# 7 2 1
# 8 2 0
# 9 2 0
#10 2 0
#11 2 0
#12 2 0
#13 3 1
#14 3 1
#15 3 1
#16 3 0
#17 3 0
#18 3 0
You are also there. just use the group 2 variable with the "[" function in the x position:
df[ group2 , ]
group total outcomeTrue
1 1 6 2
1.1 1 6 2
1.2 1 6 2
1.3 1 6 2
1.4 1 6 2
1.5 1 6 2
2 2 6 1
2.1 2 6 1
2.2 2 6 1
2.3 2 6 1
2.4 2 6 1
2.5 2 6 1
3 3 6 3
3.1 3 6 3
3.2 3 6 3
3.3 3 6 3
3.4 3 6 3
3.5 3 6 3
When a number or character value that matches a rowname is put in the x-position of the "[" it replicates the entire row
Here is a base R solution.
do.call(rbind, lapply(split(df, df$group), function(x) data.frame(group2 = x$group, outcome2 = rep(c(1,0), times = c(x$outcome, x$total-x$outcome)))))
# group2 outcome2
# 1.1 1 1
# 1.2 1 1
# 1.3 1 0
# 1.4 1 0
# 1.5 1 0
# 1.6 1 0
# 2.1 2 1
# 2.2 2 0
# 2.3 2 0
# 2.4 2 0
# 2.5 2 0
# 2.6 2 0
# 3.1 3 1
# 3.2 3 1
# 3.3 3 1
# 3.4 3 0
# 3.5 3 0
# 3.6 3 0

fill gap in dataframe [duplicate]

This question already has answers here:
adding default values to item x group pairs that don't have a value (df %>% spread %>% gather seems strange)
(2 answers)
Closed 4 years ago.
Original Data
id hhcode value
1 1 4.1
1 2 4.5
1 3 3.3
10 5 3.2
Required Output
id hhcode value
1 1 4.1
1 2 4.5
1 3 3.3
1 5 0
10 1 0
10 2 0
10 3 0
10 5 3.2
What got so far
df <- data.frame(
id = c(1, 1, 1, 10),
hhcode = c(1, 2, 3, 5),
value = c(4.1, 4.5, 3.3, 3.2)
)
library(statar)
library(tidyverse)
df %>%
group_by(id) %>%
fill_gap(hhcode, full = TRUE)
# A tibble: 10 x 3
# Groups: id [2]
id hhcode value
<dbl> <dbl> <dbl>
1 1 1 4.1
2 1 2 4.5
3 1 3 3.3
4 1 4 NA
5 1 5 NA
6 10 1 NA
7 10 2 NA
8 10 3 NA
9 10 4 NA
10 10 5 3.2
Any hint to get the required output?
We could use complete
library(tidyverse)
complete(df, id, hhcode, fill = list(value = 0))
# A tibble: 8 x 3
# id hhcode value
# <dbl> <dbl> <dbl>
#1 1 1 4.1
#2 1 2 4.5
#3 1 3 3.3
#4 1 5 0
#5 10 1 0
#6 10 2 0
#7 10 3 0
#8 10 5 3.2

lagging variables by day and creating new row in the process

I'm trying to lag variables by day but many don't have an observation on the previous day. So I need to add an extra row in the process. Dplyr gets me close but I need a way to add a new row in the process and have many thousands of cases. Any thoughts would be much appreciated.
ID<-c(1,1,1,1,2,2)
day<-c(0,1,2,5,1,3)
v<-c(2.2,3.4,1.2,.8,6.4,2)
dat1<-as.data.frame(cbind(ID,day,v))
dat1
ID day v
1 1 0 2.2
2 1 1 3.4
3 1 2 1.2
4 1 5 0.8
5 2 1 6.4
6 2 3 2.0
Using dplyr gets me here:
dat2<-
dat1 %>%
group_by(ID) %>%
mutate(v.L = dplyr::lead(v, n = 1, default = NA))
dat2
ID day v v.L
1 1 0 2.2 3.4
2 1 1 3.4 1.2
3 1 2 1.2 0.8
4 1 5 0.8 NA
5 2 1 6.4 2.0
6 2 3 2.0 NA
But I need to get here:
ID2<-c(1,1,1,1,1,2,2,2)
day2<-c(0,1,2,4,5,1,2,3)
v2<-c(2.2,3.4,1.2,NA,.8,6.4,NA,2)
v2.L<-c(3.4,1.2,NA,.8,NA,NA,2,NA)
dat3<-as.data.frame(cbind(ID2,day2,v2,v2.L))
dat3
ID2 day2 v2 v2.L
1 1 0 2.2 3.4
2 1 1 3.4 1.2
3 1 2 1.2 NA
4 1 4 NA 0.8
5 1 5 0.8 NA
6 2 1 6.4 NA
7 2 2 NA 2.0
8 2 3 2.0 NA
You could use complete and full_seq from the tidyr package to complete the sequence of days. You'd need to remove at the end the rows that have NA in both v and v.L:
library(dplyr)
library(tidyr)
dat2 = dat1 %>%
group_by(ID) %>%
complete(day = full_seq(day,1)) %>%
mutate(v.L = lead(v)) %>%
filter(!(is.na(v) & is.na(v.L)))
ID day v v.L
<dbl> <dbl> <dbl> <dbl>
1 0 2.2 3.4
1 1 3.4 1.2
1 2 1.2 NA
1 4 NA 0.8
1 5 0.8 NA
2 1 6.4 NA
2 2 NA 2.0
2 3 2.0 NA

Resources