Thanks for the feedback, below is a reproducible example with my desire output:
# Example Data where I would like my output
N=24
school.assignment = matrix(NA, ncol = 3, nrow = N)
school.assignment = as.data.frame(school.assignment)
colnames(school.assignment) <- c("ID","Group","Assignment")
# Number of groups and assigments per group
groups = 6
Assignment = 4
school.assignment$Group<-rep(1:groups,Assignment)
school.assignment$Group<- sort(school.assignment$Group)
school.assignment$Assignment<-rep(1:Assignment)
# IDs with number of repeats (i.e repeated students)
Data = matrix(0, ncol = 2, nrow = N/2) # ~half with repeated samples
Data = as.data.frame(Data)
colnames(Data) <- c("ID","Repeats")
Data$ID <-1:(N/2)
length(unique(Data$ID)) # unique IDS
ID=rep(seq(1:8),3)
# Genearte random repeats for each ID
Data$Repeats<-{set.seed(55)
sapply(1:(N/2),
function(x) sample(1:5,1))
}
Data=Data[-1,] #take out first row to match N=24
sum(Data$Repeats) #24 total IDs for all assigments
# List of IDs at random to use
IDs <- vector("list",dim(Data)[1]) #
for (i in 1:dim(Data)[1])
{
IDs[[i]]<-rep(Data$ID[i], times=Data$Repeats[i])
}
head(IDs)
# Object with number of repeated IDs
sample.per.ID <- vector("list",length(IDs)[1])
for (i in 1:length(IDs))
{
sample.per.ID[[i]]<-sum(length((IDs)[[i]]))
}
sum=sum(as.data.frame(sample.per.ID)); sum # 24 total IDs (including repeats)
## Unlist Vector with ransom sequence of samples
SRS.ID.order = unlist(IDs) #order of IDs with repeats
for (i in 1:length(SRS.ID.order ))
{
school.assignment$ID[i]<-SRS.ID.order [i]
}
My last loop is where I attempt to assign IDs to my matrix of school.assignment$ID. However, as you can see some IDs cross different groups and I want to condition ID assignment from the SRS.ID.order to stay within the same group (i.e. constant school.assignment$Group, below you can see that this is not the case, for example ID 4 is in group 1 and 2)
> head(school.assignment)
ID Group Assignment
1 2 1 1
2 2 1 2
3 3 1 3
4 4 1 4
5 4 2 1
6 4 2 2
I would like the output of the loop to don't assign any ID (i.e. NA) to that group if the next school.assignment$ID length is longer than the space available in that group.
ID Group Assignment
1 2 1 1
2 2 1 2
3 3 1 3
4 NA 1 4
5 4 2 1
6 4 2 2
I was thinking that I need some type of indicator for the J group like this code below:
########################################
for (i in 1:length(school.assignment$ID))
{
for (j in 1:length(unique(school.assignment$Group)))
{
school.assignment$ID[i]<-ifelse(sum(is.na(school.assignment$ID[i,j]))>=sample.per.ID[i],SRS.ID.order[i],NA)
}
}
Error in school.assignment$ID[i, j] : incorrect number of dimensions
Any help is very much appreciated!
Thanks
OLD POST
I'm currently trying to do a loop in R with a a condition. My data structure is below:
> head(school.assignment)
ID Group Assignment
1 NA 1 1
2 NA 1 2
3 NA 1 3
4 NA 1 4
5 NA 2 1
6 NA 2 2
I would like to assign an ID of the same length as school.assignment to the ID variable shown below:
head(IDs)
[1] 519 519 519 343 251 251...
Not all IDs repeat the same amount of times some 1,2 or even 3 times as shown above. I have an object with the number of repeats per ID, for example:
> head(repeats)
[1] 3 1 2...
Indicating that ID=519 repeats 3 times, ID=343 only once ad ID=251 2 times etc...
There is one condition that I would like to meet:
1) I would like every single ID to be in the same group whenever possible (i.e. if there is only one spot (NA) left for ID in the matrix object "school.assignment" for group 1 then assign the ID to the next group where they will be enough spaces (i.e where NA for school.assignment$ID is >= to repeats for that ID)
My idea was to do a loop but the code below is not working:
########################################
for (i in 1:length(school.assignment$ID))
{
for (j in 1:length(unique(school.assignment$Group)))
{
school.assignment$ID[i]<-ifelse(sum(is.na(school.assignment$ID[i,j]))>=repeats[i],ID[i],NA)
}
}
Is there a way to do this loop while respecting my condition to assign IDs to only one group?
Thank you!
Consider using merge() to assign random group IDs to data frame. No need for nested for loops. Below creates a unique group data frame, assigns random numbers there, and then merges with school.assignment:
# CREATE UNIQUE GROUP DATA FRAME
Group <- unique(school.assignment$Group)
grp.ids <- as.data.frame(Group)
# CREATE RANDOM ID FIELD (THREE DIGITS BETWEEN 100 AND 999)
grp.ids$RandomID <- sample(100:999, size = nrow(grp.ids), replace = TRUE)
# MERGE DATA FRAMES
school.assignment <- merge(school.assignment, grp.ids, by="Group", all=TRUE)
# ASSIGN ID COLUMN
school.assignment$ID <- school.assignment$RandomID
# RESTRUCTURE FINAL DATA FRAME
school.assignment <- school.assignment[c("ID", "Group", "Assignment")]
OUTPUT
ID Group Assignment
977 1 1
977 1 2
977 1 3
977 1 4
368 2 1
368 2 2
Related
I have a data.frame with two groups and two variables (other than the group variable), for example:
set.seed(1729)
temp <- data.frame(group=c(1,2),value1=rnorm(12),value2=rnorm(12))
temp = temp[order(temp$group),]
# group value1 value2
# 1 0.21531616 0.08679615
# 1 1.08604925 0.36344973
# 1 1.04225410 0.53281840
# 1 -1.40843189 0.52096971
# 1 -0.07130541 -0.47550518
# 1 0.18839979 1.96241245
# 2 0.18374784 -0.64102941
# 2 0.02871298 -0.67746579
# 2 0.08826553 -0.32679060
# 2 0.05522136 -0.31371224
# 2 0.36086719 -0.10004339
# 2 -0.55618926 1.22760816
I calculate the quantiles of temp$value1 & temp$value2
qt1 = quantile(temp$value1,probs = c(0.25,0.5,0.75))
qt2 = quantile(temp$value2,probs = c(0.25,0.5,0.75))
For each group, I need
(1) the count of number of values of temp$value1 < qt1[1] (2) the count of number of values of temp$value1 < qt[2] (3) the count of number of values of temp$value1 < qt1[3]. Similarly, six values (two groups & three quantiles) of counts for temp$value2.
As a code, it is (I manually copied each group+variable into a vector to illustrate what I want)
g1v1=c(0.21531616,1.08604925,1.04225410,-1.40843189,-0.07130541,0.18839979)
length(g1v1[g1v1<qt1[1]])
length(g1v1[g1v1<qt1[2]])
length(g1v1[g1v1<qt1[3]])
g2v1=c(0.18374784,0.02871298,0.08826553,0.05522136,0.36086719,-0.55618926)
length(g1v2[g1v2<qt1[1]])
length(g1v2[g1v2<qt1[2]])
length(g1v2[g1v2<qt1[3]])
g1v2=c(0.08679615,0.36344973,0.53281840,0.52096971,-0.47550518,1.96241245)
length(g2v1[g2v1<qt2[1]])
length(g2v1[g2v1<qt2[2]])
length(g2v1[g2v1<qt2[3]])
#similarly for g2v2
The output has to be a data.frame like:
# group value1.25.ct value1.50.ct value1.75.ct value2.25.ct value2.50.ct value2.75.ct
# 1 2 2 4 1 1 4
# 2 1 4 5 2 5 5
Please recommend an efficient approach. Thank you.
I’m trying to figure out how to create bins with a minimum sample size that also accounts for values in a specific column.
So, in the dummy data below, I want to create bins that have a minimum number of 6 samples in them, but if a bin includes a row with a specific value from column a, I want that bin to also include all other rows with that same value. I also do not want any bins to only contain 1 unique value from row a. I then want the output to have a row with a mean of the unique values in column a, a mean of all values in column b and a column with sample size.
df<-data.frame(a=c(1,1,2,2,2,3,3,3,3,4,4,5,6,6,6,7,7,7,7,7,7,8,8,8,9,9,9,9,10,10,10),
b=c(12,13,11,12,12,11,15,13,12,11,14,15,11,14,12,11,14,12,13,15,11,11,12,13,14,16,14,13,15,13,15))
I want the output to look something like this:
mean.a mean.b n
1 2.0 12.33333 9
2 5.0 12.83333 6
3 7.0 12.66667 6
4 8.5 13.28571 7
This is what I have so far:
x<-df
final<-NULL
for(i in 1:16){
x1<-x[1:6,]
x2<-x[-c(1:6),]
x3<-rbind(x1, x2[x2$a==x1$a[6],])
n<-nrow(x3)
y<-mean(x3$b)
z<-mean(unique(x3$a))
f<-data.frame(mean.a=z, mean.b=y, n=n)
final<-rbind(final,f)
x<-x[-c(1:n),]
}
final<-final[complete.cases(final),]
The problem I'm having is I can't figure out how to not have a single bin with one unique value in column a. For example, in the third bin, all 6 rows have mean.a$a=7, but I would like to add the next sequential row and all rows with that row value in column a to that bin (which would be all rows that have mean.a$a=8 in this case).
Also, I can't figure out how to get the loop to continue looping through without having 1:number at the top, and then just deleting the rows with NAs afterwards, this isn't a huge deal, but that's the reason it's kind of messy.
I'm not attached to this loop by any means, and if there's a simpler way to answer this question, I'm all for it!
Here is a recursive solution for the problem, where get_6 will return a group variable based on the column a. The conditions are met in get_i function inside, starting from index 6 and move forward until we find the next index that is not equal to the current value and the length of unique values is not equal to 1, every time we found a sequence that satisfies the condition we increase the id by one and the result will be similar to what you get from the rleid function from data.table, from there, summary statistics can be calculated based on this group variable:
get_6 <- function(vec, id = 1) {
if(length(vec) < 6) NULL
else {
get_i <- function(x, i = 6) {
if(length(x) == i) i
else if(x[i + 1] != x[i] && length(unique(x[1:i])) != 1) i
else get_i(x, i + 1)
}
ind <- get_i(vec)
c(rep(id, ind), get_6(vec[-(1:ind)], id + 1))
}
}
s <- get_6(df$a)
s
# [1] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4
library(dplyr)
df[1:length(s), ] %>%
mutate(g = s) %>% group_by(g) %>%
summarize(n = n(), mean.a = mean(unique(a)), mean.b = mean(b))
#Source: local data frame [4 x 4]
# g n mean.a mean.b
# <dbl> <int> <dbl> <dbl>
#1 1 9 2.0 12.33333
#2 2 6 5.0 12.83333
#3 3 9 7.5 12.44444
#4 4 7 9.5 14.28571
I have a series of repeated IDs that I would like to assign to groups with fix size. The subject IDs repeat with different frequencies for example:
# Example Data
ID = c(101,102,103,104)
Repeats = c(2,3,1,3)
Data = data.frame(ID,Repeats)
> head(Data)
ID Repeats
1 101 2
2 102 3
3 103 1
4 104 3
I would like the same repeated ID to stay within the same group. However, each group has a fixed capacity (say 3 only). For example, in my desired output matrix each group can only accommodate 3 IDs:
# Create empty data frame for group annotation
# Add 3 rows in order to have more space for IDs
# Some groups will have NAs due to keeping IDs together (I'm OK with that)
Target = data.frame(matrix(NA,nrow=(sum(Data$Repeats)+3),
ncol=dim(Data)[2]))
names(Target)<-c("ID","Group")
Target$Group<-rep(1:3)
Target$Group<-sort(Target$Group)
> head(Target)
ID Group
1 NA 1
2 NA 1
3 NA 1
4 NA 1
5 NA 2
6 NA 2
I can loop each ID to my Target data frame but this does not guarantee that repeated IDs will stay together in the same group:
# Loop repeated IDs the groups
IDs.repeat = rep(Data$ID, times=Data$Repeats)
# loop IDs to Targets to assign IDs to groups
for (i in 1:length(IDs.repeat))
{
Target$ID[i]<-IDs.repeat[i]
}
In my example in the loop above I get the same ID (102) across two different groups (1 and 2), I would like to avoid this!:
> head(Target)
ID Group
1 101 1
2 101 1
3 102 1
4 102 1
5 102 2
6 103 2
Instead I want the output to look the code to put NA if there is no space for that ID in that group.
> head(Target)
ID Group
1 101 1
2 101 1
3 NA 1
4 NA 1
5 102 2
6 102 2
Anyone has a solution for IDs to stay within the same group if there is sufficient space after assigning ID i?
I think that I need a loop and count NAs within that group and see if the NAs>= to the length of that unique ID. However, I don't know how to implement this simultaneously. Maybe nesting another loop for the j group?
Any help with the loop will be appreciated immensely!
Here's one solution,
## This is the data.frame I'll try to match
target <- data.frame(
ID = c(
rep(101, 2),
rep(102, 3),
rep(103, 1),
rep(104, 3)),
Group = c(
rep(1L, 6), # "L", like 1L makes it an int type rather than numeric
rep(2L, 3)
)
)
print(target)
## Your example data
ID = c(101,102,103,104)
Repeats = c(2,3,1,3)
Data = data.frame(ID,Repeats)
head(Data)
ids_to_group <- 3 # the number of ids per group is specified here.
Data$Group <- sort(
rep(1:ceiling(length(Data$ID) / ids_to_group),
ids_to_group))[1:length(Data$ID)]
# The do.call(rbind, lapply(x = a series, FUN = function(x) { }))
# pattern is a really useful way to stack data.frames
# lapply is basically a fancy for-loop check it out by sending
# ?lapply to the console (to view the help page).
output <- do.call(
rbind,
lapply(unique(Data$ID), FUN = function(ids) {
print(paste(ids, "done.")) # I like to put print statements to follow along
obs <- Data[Data$ID == ids, ]
data.frame(ID = rep(obs$ID, obs$Repeats))
})
)
output <- merge(output, Data[,c("ID", "Group")], by = "ID")
identical(target, output) # returns true if they're equivalent
# For example inspect each with:
str(target)
str(output)
There has a data set, A, like
id grade
1 10
2 20
3 30
4 40
In addition, there has another index data set, B, like
id
2
3
I would like to extract the subset of A based on B, the result will look like
id grade
2 20
3 30
Here's a data.table solution. This will be much faster if your dataset A is large, or if you have to do this a large number of times.
set.seed(1) # for reproducible example
A <- data.frame(id=1:1e6,grade=10*(1:1e6)) # 1,000,000 rows
B <- data.frame(id=sample(1:1e6,1000)) # random sample of 1000 ids
library(data.table)
setDT(A) # convert A to a data.table
setkey(A,id) # set the key
result <- A[J(B$id)] # extract records based in id
In this example data.table is about 20 times faster than either %in% or merge(...).
Note also that while all three retrieve the same records, they are not necessarily in the same order.
A$id %in% B$id
creates a logical vector the length of A$id, which elements are T if that element is found in B$id, then uses that to subset A. So the records in the result are in the same order as A.
merge(A,B)
sorts the result by the common column (id), so the result is sorted by increasing value of id. In your example and this example, these first two are the same.
A[J(B$id)]
returns a result ordered as B$id (which is random, in this example, but would be the same as the other two approached in your example).
Try this:
> x <- data.frame(id=1:4, grade=(1:4)*10)
> x
id grade
1 1 10
2 2 20
3 3 30
4 4 40
> id <- 2:3
> x[ x$id %in% id, ]
id grade
2 2 20
3 3 30
Alternatively you can also:
> id <- data.frame(id=2:3)
> merge(x, id)
id grade
1 2 20
2 3 30
I have an aggregation problem which I cannot figure out how to perform efficiently in R.
Say I have the following data:
group1 <- c("a","b","a","a","b","c","c","c","c",
"c","a","a","a","b","b","b","b")
group2 <- c(1,2,3,4,1,3,5,6,5,4,1,2,3,4,3,2,1)
value <- c("apple","pear","orange","apple",
"banana","durian","lemon","lime",
"raspberry","durian","peach","nectarine",
"banana","lemon","guava","blackberry","grape")
df <- data.frame(group1,group2,value)
I am interested in sampling from the data frame df such that I randomly pick only a single row from each combination of factors group1 and group2.
As you can see, the results of table(df$group1,df$group2)
1 2 3 4 5 6
a 2 1 2 1 0 0
b 2 2 1 1 0 0
c 0 0 1 1 2 1
shows that some combinations are seen more than once, while others are never seen. For those that are seen more than once (e.g., group1="a" and group2=3), I want to randomly pick only one of the corresponding rows and return a new data frame that has only that subset of rows. That way, each possible combination of the grouping factors is represented by only a single row in the data frame.
One important aspect here is that my actual data sets can contain anywhere from 500,000 rows to >2,000,000 rows, so it is important to be mindful of performance.
I am relatively new at R, so I have been having trouble figuring out how to generate this structure correctly. One attempt looked like this (using the plyr package):
choice <- function(x,label) {
cbind(x[sample(1:nrow(x),1),],data.frame(state=label))
}
df <- ddply(df[,c("group1","group2","value")],
.(group1,group2),
pick_junc,
label="test")
Note that in this case, I am also adding an extra column to the data frame called "label" which is specified as an extra argument to the ddply function. However, I killed this after about 20 min.
In other cases, I have tried using aggregate or by or tapply, but I never know exactly what the specified function is getting, what it should return, or what to do with the result (especially for by).
I am trying to switch from python to R for exploratory data analysis, but this type of aggregation is crucial for me. In python, I can perform these operations very rapidly, but it is inconvenient as I have to generate a separate script/data structure for each different type of aggregation I want to perform.
I want to love R, so please help! Thanks!
Uri
Here is the plyr solution
set.seed(1234)
ddply(df, .(group1, group2), summarize,
value = value[sample(length(value), 1)])
This gives us
group1 group2 value
1 a 1 apple
2 a 2 nectarine
3 a 3 banana
4 a 4 apple
5 b 1 grape
6 b 2 blackberry
7 b 3 guava
8 b 4 lemon
9 c 3 durian
10 c 4 durian
11 c 5 raspberry
12 c 6 lime
EDIT. With a data frame that big, you are better off using data.table
library(data.table)
dt = data.table(df)
dt[,list(value = value[sample(length(value), 1)]),'group1, group2']
EDIT 2: Performance Comparison: Data Table is ~ 15 X faster
group1 = sample(letters, 1000000, replace = T)
group2 = sample(LETTERS, 1000000, replace = T)
value = runif(1000000, 0, 1)
df = data.frame(group1, group2, value)
dt = data.table(df)
f1_dtab = function() {
dt[,list(value = value[sample(length(value), 1)]),'group1, group2']
}
f2_plyr = function() {ddply(df, .(group1, group2), summarize, value =
value[sample(length(value), 1)])
}
f3_by = function() {do.call(rbind,by(df,list(grp1 = df$group1,grp2 = df$group2),
FUN = function(x){x[sample(nrow(x),1),]}))
}
library(rbenchmark)
benchmark(f1_dtab(), f2_plyr(), f3_by(), replications = 10)
test replications elapsed relative
f1_dtab() 10 4.764 1.00000
f2_plyr() 10 68.261 14.32851
f3_by() 10 67.369 14.14127
One more way:
with(df, tapply(value, list( group1, group2), length))
1 2 3 4 5 6
a 2 1 2 1 NA NA
b 2 2 1 1 NA NA
c NA NA 1 1 2 1
# Now use tapply to sample withing groups
# `resample` fn is from the sample help page:
# Avoids an error with sample when only one value in a group.
resample <- function(x, ...) x[sample.int(length(x), ...)]
#Create a row index
df$idx <- 1:NROW(df)
rowidxs <- with(df, unique( c( # the `c` function will make a matrix into a vector
tapply(idx, list( group1, group2),
function (x) resample(x, 1) ))))
rowidxs
# [1] 1 5 NA 12 16 NA 3 15 6 4 14 10 NA NA 7 NA NA 8
df[rowidxs[!is.na(rowidxs)] , ]