Binning data by row values with minimum sample size - r

I’m trying to figure out how to create bins with a minimum sample size that also accounts for values in a specific column.
So, in the dummy data below, I want to create bins that have a minimum number of 6 samples in them, but if a bin includes a row with a specific value from column a, I want that bin to also include all other rows with that same value. I also do not want any bins to only contain 1 unique value from row a. I then want the output to have a row with a mean of the unique values in column a, a mean of all values in column b and a column with sample size.
df<-data.frame(a=c(1,1,2,2,2,3,3,3,3,4,4,5,6,6,6,7,7,7,7,7,7,8,8,8,9,9,9,9,10,10,10),
b=c(12,13,11,12,12,11,15,13,12,11,14,15,11,14,12,11,14,12,13,15,11,11,12,13,14,16,14,13,15,13,15))
I want the output to look something like this:
mean.a mean.b n
1 2.0 12.33333 9
2 5.0 12.83333 6
3 7.0 12.66667 6
4 8.5 13.28571 7
This is what I have so far:
x<-df
final<-NULL
for(i in 1:16){
x1<-x[1:6,]
x2<-x[-c(1:6),]
x3<-rbind(x1, x2[x2$a==x1$a[6],])
n<-nrow(x3)
y<-mean(x3$b)
z<-mean(unique(x3$a))
f<-data.frame(mean.a=z, mean.b=y, n=n)
final<-rbind(final,f)
x<-x[-c(1:n),]
}
final<-final[complete.cases(final),]
The problem I'm having is I can't figure out how to not have a single bin with one unique value in column a. For example, in the third bin, all 6 rows have mean.a$a=7, but I would like to add the next sequential row and all rows with that row value in column a to that bin (which would be all rows that have mean.a$a=8 in this case).
Also, I can't figure out how to get the loop to continue looping through without having 1:number at the top, and then just deleting the rows with NAs afterwards, this isn't a huge deal, but that's the reason it's kind of messy.
I'm not attached to this loop by any means, and if there's a simpler way to answer this question, I'm all for it!

Here is a recursive solution for the problem, where get_6 will return a group variable based on the column a. The conditions are met in get_i function inside, starting from index 6 and move forward until we find the next index that is not equal to the current value and the length of unique values is not equal to 1, every time we found a sequence that satisfies the condition we increase the id by one and the result will be similar to what you get from the rleid function from data.table, from there, summary statistics can be calculated based on this group variable:
get_6 <- function(vec, id = 1) {
if(length(vec) < 6) NULL
else {
get_i <- function(x, i = 6) {
if(length(x) == i) i
else if(x[i + 1] != x[i] && length(unique(x[1:i])) != 1) i
else get_i(x, i + 1)
}
ind <- get_i(vec)
c(rep(id, ind), get_6(vec[-(1:ind)], id + 1))
}
}
s <- get_6(df$a)
s
# [1] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4
library(dplyr)
df[1:length(s), ] %>%
mutate(g = s) %>% group_by(g) %>%
summarize(n = n(), mean.a = mean(unique(a)), mean.b = mean(b))
#Source: local data frame [4 x 4]
# g n mean.a mean.b
# <dbl> <int> <dbl> <dbl>
#1 1 9 2.0 12.33333
#2 2 6 5.0 12.83333
#3 3 9 7.5 12.44444
#4 4 7 9.5 14.28571

Related

How to Efficiently Populate R Dataframe with Lookup Values from Another Dataframe [duplicate]

This question already has answers here:
Expand ranges defined by "from" and "to" columns
(10 answers)
Closed 2 years ago.
I have a question regarding efficiently populating an R dataframe based on data retrieved from another dataframe.
So my input typically looks like:
dfInput <- data.frame(start = c(1,6,17,29), end = c(5,16,28,42), value = c(1,2,3,4))
start end value
1 5 1
6 16 2
17 28 3
29 42 4
I want to find the min and max values in cols 1 and 2 and create a new dataframe with a row for each value in that range:
rangeMin <- min(dfInput$start)
rangeMax <- max(dfInput$end)
dfOutput <- data.frame(index = c(rangeMin:rangeMax), value = 0)
And then populate it with the appropriate "values" from the input dataframe:
for (i in seq(nrow(dfOutput))) {
lookup <- dfOutput[i,"index"]
dfOutput[i, "value"] <- dfInput[which(dfInput$start <= lookup &
dfInput$end >= lookup),"value"]
}
This for-loop achieves what I want to do, but it feels like this is a very convoluted way to do it.
Is there a way that I can do something like:
dfOutput$value <- dfInput[which(dfInput$start <= dfOutput$index &
dfInput$end >= dfOutput$index),"value"]
Or something else to populate the values when instantiating dfOutput.
I feel like this is pretty basic but I'm new to R, so many thanks for any help!
You can create a sequence between start and end :
library(dplyr)
dfInput %>%
mutate(index = purrr::map2(start, end, seq)) %>%
tidyr::unnest(index) %>%
select(-start, -end)
# A tibble: 42 x 2
# value index
# <dbl> <int>
# 1 1 1
# 2 1 2
# 3 1 3
# 4 1 4
# 5 1 5
# 6 2 6
# 7 2 7
# 8 2 8
# 9 2 9
#10 2 10
# … with 32 more rows
In base R :
do.call(rbind, Map(function(x, y, z)
data.frame(index = x:y, value = z), dfInput$start, dfInput$end, dfInput$value))

(dplyr) Sum of N values most recent to a date

I'm trying to create a function that sums the closest n values to a given date. So if I had 5 weeks of data, and n=2, the value on week 1 would be the sum of weeks 2&3, the value on week 2 would be the sum of weeks 1&3, etc. Example:
library(dplyr)
library(data.table)
Week <- 1:5
Sales <- c(1, 3, 5, 7, 9)
frame <- data.table(Week, Sales)
frame
Week Sales Recent
1: 1 1 8
2: 2 3 6
3: 3 5 10
4: 4 7 14
5: 5 9 12
I want to make a function that does this for me with an input for most recent n (not just 2), but for now I want to get 2 right. Here's my function using lag/lead:
RecentSum = function(Variable, Lags){
Sum = 0
for(i in 1:(Lags/2)){ #Lags/2 because I want half values before and half after
#Check to see if you can go backwards. If not, go foward (i.e. use lead).
if(is.na(lag(Variable, i))){
LoopSum = lead(Variable, i)
}
else{
LoopSum = lag(Variable, i)
}
Sum = Sum + LoopSum
}
for(i in 1:(Lags/2)){
if(is.na(lead(Variable, i))){ #Check to see if you can go forward. If not, go backwards (i.e. use lag).
LoopSum = lag(Variable, i)
}
else{
LoopSum = lead(Variable, i)
}
Sum = Sum + LoopSum
}
Sum
}
When I do RecentSum(frame$Sale,2) I get [1] 6 10 14 18 NA which is wrong for a number of reasons:
My if statements are only hitting on week one, so it will always be NA for lag and always be non-NA for lead.
I need to have a way to see if it uses lag/lead the first time. The first value is 6 instead of 8 because the first for-loop sends it to lead(_,1), but then the second for-loop does the same. I can't think of how I'd make my second for-loop recognize this.
Is there a function or library (Zoo?) that makes this task easy? I'd like to get my own function to work for the sake of practice/understanding, but at this point I'd rather just get it done.
Thanks!
To elaborate on my comment, lead and lag are functions that are meant to be used within vectorized functions such as dplyr. Here is a way to do it within dplyr without using a function:
df <- tibble(week = Week, sales = Sales)
df %>%
mutate(recent = case_when(is.na(lag(sales)) ~ lead(sales, n = 1) + lead(sales, n = 2),
is.na(lead(sales)) ~ lag(sales, n = 1) + lag(sales, n = 2),
TRUE ~ lag(sales) + lead(sales)))
That gives you this:
# A tibble: 5 x 3
week sales recent
<int> <dbl> <dbl>
1 1 1 8
2 2 3 6
3 3 5 10
4 4 7 14
5 5 9 12
1) Assuming that k is even define to as a vector of indices such that for each element of to we sum the k+1 elements of Sales that end in that index and from that subtract Sales:
k <- 2 # number of elements to sum
n <- nrow(frame)
to <- pmax(k+1, pmin(1:n + k/2, n))
Sum <- function(to, Sales) sum(Sales[seq(to = to, length = k+1)])
frame %>% mutate(recent = sapply(to, Sum, Sales) - Sales)
giving:
Week Sales recent
1 1 1 8
2 2 3 6
3 3 5 10
4 4 7 14
5 5 9 12
Note that by replacing the last line of code above with the following line the solution can be done entirely in base R:
transform(frame, recent = sapply(to, Sum, Sales) - Sales)
2) This concatenates the appropriate elements before and after the Sales series so that an ordinary rolling sum gives the result.
library(zoo)
ix <- c(seq(to = k+1, length = k/2), 1:n, seq(to = n-k, length = k/2))
frame %>% mutate(recent = rollsum(Sales[ix], k+1) - Sales)
Note that if k=2 then it reduces this to this one-liner:
frame %>% mutate(recent = rollsum(Sales[c(3, 1:n(), n()-2)], 3) - Sales)
giving:
Week Sales recent
1 1 1 8
2 2 3 6
3 3 5 10
4 4 7 14
5 5 9 12
Update: fixed for k > 2

Count of values, for each group, less than a quantile of the whole column performed on each column

I have a data.frame with two groups and two variables (other than the group variable), for example:
set.seed(1729)
temp <- data.frame(group=c(1,2),value1=rnorm(12),value2=rnorm(12))
temp = temp[order(temp$group),]
# group value1 value2
# 1 0.21531616 0.08679615
# 1 1.08604925 0.36344973
# 1 1.04225410 0.53281840
# 1 -1.40843189 0.52096971
# 1 -0.07130541 -0.47550518
# 1 0.18839979 1.96241245
# 2 0.18374784 -0.64102941
# 2 0.02871298 -0.67746579
# 2 0.08826553 -0.32679060
# 2 0.05522136 -0.31371224
# 2 0.36086719 -0.10004339
# 2 -0.55618926 1.22760816
I calculate the quantiles of temp$value1 & temp$value2
qt1 = quantile(temp$value1,probs = c(0.25,0.5,0.75))
qt2 = quantile(temp$value2,probs = c(0.25,0.5,0.75))
For each group, I need
(1) the count of number of values of temp$value1 < qt1[1] (2) the count of number of values of temp$value1 < qt[2] (3) the count of number of values of temp$value1 < qt1[3]. Similarly, six values (two groups & three quantiles) of counts for temp$value2.
As a code, it is (I manually copied each group+variable into a vector to illustrate what I want)
g1v1=c(0.21531616,1.08604925,1.04225410,-1.40843189,-0.07130541,0.18839979)
length(g1v1[g1v1<qt1[1]])
length(g1v1[g1v1<qt1[2]])
length(g1v1[g1v1<qt1[3]])
g2v1=c(0.18374784,0.02871298,0.08826553,0.05522136,0.36086719,-0.55618926)
length(g1v2[g1v2<qt1[1]])
length(g1v2[g1v2<qt1[2]])
length(g1v2[g1v2<qt1[3]])
g1v2=c(0.08679615,0.36344973,0.53281840,0.52096971,-0.47550518,1.96241245)
length(g2v1[g2v1<qt2[1]])
length(g2v1[g2v1<qt2[2]])
length(g2v1[g2v1<qt2[3]])
#similarly for g2v2
The output has to be a data.frame like:
# group value1.25.ct value1.50.ct value1.75.ct value2.25.ct value2.50.ct value2.75.ct
# 1 2 2 4 1 1 4
# 2 1 4 5 2 5 5
Please recommend an efficient approach. Thank you.

R loops:conditioning a loop in R

Thanks for the feedback, below is a reproducible example with my desire output:
# Example Data where I would like my output
N=24
school.assignment = matrix(NA, ncol = 3, nrow = N)
school.assignment = as.data.frame(school.assignment)
colnames(school.assignment) <- c("ID","Group","Assignment")
# Number of groups and assigments per group
groups = 6
Assignment = 4
school.assignment$Group<-rep(1:groups,Assignment)
school.assignment$Group<- sort(school.assignment$Group)
school.assignment$Assignment<-rep(1:Assignment)
# IDs with number of repeats (i.e repeated students)
Data = matrix(0, ncol = 2, nrow = N/2) # ~half with repeated samples
Data = as.data.frame(Data)
colnames(Data) <- c("ID","Repeats")
Data$ID <-1:(N/2)
length(unique(Data$ID)) # unique IDS
ID=rep(seq(1:8),3)
# Genearte random repeats for each ID
Data$Repeats<-{set.seed(55)
sapply(1:(N/2),
function(x) sample(1:5,1))
}
Data=Data[-1,] #take out first row to match N=24
sum(Data$Repeats) #24 total IDs for all assigments
# List of IDs at random to use
IDs <- vector("list",dim(Data)[1]) #
for (i in 1:dim(Data)[1])
{
IDs[[i]]<-rep(Data$ID[i], times=Data$Repeats[i])
}
head(IDs)
# Object with number of repeated IDs
sample.per.ID <- vector("list",length(IDs)[1])
for (i in 1:length(IDs))
{
sample.per.ID[[i]]<-sum(length((IDs)[[i]]))
}
sum=sum(as.data.frame(sample.per.ID)); sum # 24 total IDs (including repeats)
## Unlist Vector with ransom sequence of samples
SRS.ID.order = unlist(IDs) #order of IDs with repeats
for (i in 1:length(SRS.ID.order ))
{
school.assignment$ID[i]<-SRS.ID.order [i]
}
My last loop is where I attempt to assign IDs to my matrix of school.assignment$ID. However, as you can see some IDs cross different groups and I want to condition ID assignment from the SRS.ID.order to stay within the same group (i.e. constant school.assignment$Group, below you can see that this is not the case, for example ID 4 is in group 1 and 2)
> head(school.assignment)
ID Group Assignment
1 2 1 1
2 2 1 2
3 3 1 3
4 4 1 4
5 4 2 1
6 4 2 2
I would like the output of the loop to don't assign any ID (i.e. NA) to that group if the next school.assignment$ID length is longer than the space available in that group.
ID Group Assignment
1 2 1 1
2 2 1 2
3 3 1 3
4 NA 1 4
5 4 2 1
6 4 2 2
I was thinking that I need some type of indicator for the J group like this code below:
########################################
for (i in 1:length(school.assignment$ID))
{
for (j in 1:length(unique(school.assignment$Group)))
{
school.assignment$ID[i]<-ifelse(sum(is.na(school.assignment$ID[i,j]))>=sample.per.ID[i],SRS.ID.order[i],NA)
}
}
Error in school.assignment$ID[i, j] : incorrect number of dimensions
Any help is very much appreciated!
Thanks
OLD POST
I'm currently trying to do a loop in R with a a condition. My data structure is below:
> head(school.assignment)
ID Group Assignment
1 NA 1 1
2 NA 1 2
3 NA 1 3
4 NA 1 4
5 NA 2 1
6 NA 2 2
I would like to assign an ID of the same length as school.assignment to the ID variable shown below:
head(IDs)
[1] 519 519 519 343 251 251...
Not all IDs repeat the same amount of times some 1,2 or even 3 times as shown above. I have an object with the number of repeats per ID, for example:
> head(repeats)
[1] 3 1 2...
Indicating that ID=519 repeats 3 times, ID=343 only once ad ID=251 2 times etc...
There is one condition that I would like to meet:
1) I would like every single ID to be in the same group whenever possible (i.e. if there is only one spot (NA) left for ID in the matrix object "school.assignment" for group 1 then assign the ID to the next group where they will be enough spaces (i.e where NA for school.assignment$ID is >= to repeats for that ID)
My idea was to do a loop but the code below is not working:
########################################
for (i in 1:length(school.assignment$ID))
{
for (j in 1:length(unique(school.assignment$Group)))
{
school.assignment$ID[i]<-ifelse(sum(is.na(school.assignment$ID[i,j]))>=repeats[i],ID[i],NA)
}
}
Is there a way to do this loop while respecting my condition to assign IDs to only one group?
Thank you!
Consider using merge() to assign random group IDs to data frame. No need for nested for loops. Below creates a unique group data frame, assigns random numbers there, and then merges with school.assignment:
# CREATE UNIQUE GROUP DATA FRAME
Group <- unique(school.assignment$Group)
grp.ids <- as.data.frame(Group)
# CREATE RANDOM ID FIELD (THREE DIGITS BETWEEN 100 AND 999)
grp.ids$RandomID <- sample(100:999, size = nrow(grp.ids), replace = TRUE)
# MERGE DATA FRAMES
school.assignment <- merge(school.assignment, grp.ids, by="Group", all=TRUE)
# ASSIGN ID COLUMN
school.assignment$ID <- school.assignment$RandomID
# RESTRUCTURE FINAL DATA FRAME
school.assignment <- school.assignment[c("ID", "Group", "Assignment")]
OUTPUT
ID Group Assignment
977 1 1
977 1 2
977 1 3
977 1 4
368 2 1
368 2 2

R Loop with conditions

I have a series of repeated IDs that I would like to assign to groups with fix size. The subject IDs repeat with different frequencies for example:
# Example Data
ID = c(101,102,103,104)
Repeats = c(2,3,1,3)
Data = data.frame(ID,Repeats)
> head(Data)
ID Repeats
1 101 2
2 102 3
3 103 1
4 104 3
I would like the same repeated ID to stay within the same group. However, each group has a fixed capacity (say 3 only). For example, in my desired output matrix each group can only accommodate 3 IDs:
# Create empty data frame for group annotation
# Add 3 rows in order to have more space for IDs
# Some groups will have NAs due to keeping IDs together (I'm OK with that)
Target = data.frame(matrix(NA,nrow=(sum(Data$Repeats)+3),
ncol=dim(Data)[2]))
names(Target)<-c("ID","Group")
Target$Group<-rep(1:3)
Target$Group<-sort(Target$Group)
> head(Target)
ID Group
1 NA 1
2 NA 1
3 NA 1
4 NA 1
5 NA 2
6 NA 2
I can loop each ID to my Target data frame but this does not guarantee that repeated IDs will stay together in the same group:
# Loop repeated IDs the groups
IDs.repeat = rep(Data$ID, times=Data$Repeats)
# loop IDs to Targets to assign IDs to groups
for (i in 1:length(IDs.repeat))
{
Target$ID[i]<-IDs.repeat[i]
}
In my example in the loop above I get the same ID (102) across two different groups (1 and 2), I would like to avoid this!:
> head(Target)
ID Group
1 101 1
2 101 1
3 102 1
4 102 1
5 102 2
6 103 2
Instead I want the output to look the code to put NA if there is no space for that ID in that group.
> head(Target)
ID Group
1 101 1
2 101 1
3 NA 1
4 NA 1
5 102 2
6 102 2
Anyone has a solution for IDs to stay within the same group if there is sufficient space after assigning ID i?
I think that I need a loop and count NAs within that group and see if the NAs>= to the length of that unique ID. However, I don't know how to implement this simultaneously. Maybe nesting another loop for the j group?
Any help with the loop will be appreciated immensely!
Here's one solution,
## This is the data.frame I'll try to match
target <- data.frame(
ID = c(
rep(101, 2),
rep(102, 3),
rep(103, 1),
rep(104, 3)),
Group = c(
rep(1L, 6), # "L", like 1L makes it an int type rather than numeric
rep(2L, 3)
)
)
print(target)
## Your example data
ID = c(101,102,103,104)
Repeats = c(2,3,1,3)
Data = data.frame(ID,Repeats)
head(Data)
ids_to_group <- 3 # the number of ids per group is specified here.
Data$Group <- sort(
rep(1:ceiling(length(Data$ID) / ids_to_group),
ids_to_group))[1:length(Data$ID)]
# The do.call(rbind, lapply(x = a series, FUN = function(x) { }))
# pattern is a really useful way to stack data.frames
# lapply is basically a fancy for-loop check it out by sending
# ?lapply to the console (to view the help page).
output <- do.call(
rbind,
lapply(unique(Data$ID), FUN = function(ids) {
print(paste(ids, "done.")) # I like to put print statements to follow along
obs <- Data[Data$ID == ids, ]
data.frame(ID = rep(obs$ID, obs$Repeats))
})
)
output <- merge(output, Data[,c("ID", "Group")], by = "ID")
identical(target, output) # returns true if they're equivalent
# For example inspect each with:
str(target)
str(output)

Resources