generate the subset based on a given index set - r

There has a data set, A, like
id grade
1 10
2 20
3 30
4 40
In addition, there has another index data set, B, like
id
2
3
I would like to extract the subset of A based on B, the result will look like
id grade
2 20
3 30

Here's a data.table solution. This will be much faster if your dataset A is large, or if you have to do this a large number of times.
set.seed(1) # for reproducible example
A <- data.frame(id=1:1e6,grade=10*(1:1e6)) # 1,000,000 rows
B <- data.frame(id=sample(1:1e6,1000)) # random sample of 1000 ids
library(data.table)
setDT(A) # convert A to a data.table
setkey(A,id) # set the key
result <- A[J(B$id)] # extract records based in id
In this example data.table is about 20 times faster than either %in% or merge(...).
Note also that while all three retrieve the same records, they are not necessarily in the same order.
A$id %in% B$id
creates a logical vector the length of A$id, which elements are T if that element is found in B$id, then uses that to subset A. So the records in the result are in the same order as A.
merge(A,B)
sorts the result by the common column (id), so the result is sorted by increasing value of id. In your example and this example, these first two are the same.
A[J(B$id)]
returns a result ordered as B$id (which is random, in this example, but would be the same as the other two approached in your example).

Try this:
> x <- data.frame(id=1:4, grade=(1:4)*10)
> x
id grade
1 1 10
2 2 20
3 3 30
4 4 40
> id <- 2:3
> x[ x$id %in% id, ]
id grade
2 2 20
3 3 30
Alternatively you can also:
> id <- data.frame(id=2:3)
> merge(x, id)
id grade
1 2 20
2 3 30

Related

random sampling of columns based on column group

I have a simple problem which can be solved in a dirty way, but I'm looking for a clean way using data.table
I have the following data.table with n columns belonging to m unequal groups. Here is an example of my data.table:
dframe <- as.data.frame(matrix(rnorm(60), ncol=30))
cletters <- rep(c("A","B","C"), times=c(10,14,6))
colnames(dframe) <- cletters
A A A A A A
1 -0.7431185 -0.06356047 -0.2247782 -0.15423889 -0.03894069 0.1165187
2 -1.5891905 -0.44468389 -0.1186977 0.02270782 -0.64950716 -0.6844163
A A A A B B B
1 -1.277307 1.8164195 -0.3957006 -0.6489105 0.3498384 -0.463272 0.8458673
2 -1.644389 0.6360258 0.5612634 0.3559574 1.9658743 1.858222 -1.4502839
B B B B B B B
1 0.3167216 -0.2919079 0.5146733 0.6628149 0.5481958 -0.01721261 -0.5986918
2 -0.8104386 1.2335948 -0.6837159 0.4735597 -0.4686109 0.02647807 0.6389771
B B B B C C
1 -1.2980799 0.3834073 -0.04559749 0.8715914 1.1619585 -1.26236232
2 -0.3551722 -0.6587208 0.44822253 -0.1943887 -0.4958392 0.09581703
C C C C
1 -0.1387091 -0.4638417 -2.3897681 0.6853864
2 0.1680119 -0.5990310 0.9779425 1.0819789
What I want to do is to take a random subset of the columns (of a sepcific size), keeping the same number of columns per group (if the chosen sample size is larger than the number of columns belonging to one group, take all of the columns of this group).
I have tried an updated version of the method mentioned in this question:
sample rows of subgroups from dataframe with dplyr
but I'm not able to map the column names to the by argument.
Can someone help me with this?
Here's another approach, IIUC:
idx <- split(seq_along(dframe), names(dframe))
keep <- unlist(Map(sample, idx, pmin(7, lengths(idx))))
dframe[, keep]
Explanation:
The first step splits the column indices according to the column names:
idx
# $A
# [1] 1 2 3 4 5 6 7 8 9 10
#
# $B
# [1] 11 12 13 14 15 16 17 18 19 20 21 22 23 24
#
# $C
# [1] 25 26 27 28 29 30
In the next step we use
pmin(7, lengths(idx))
#[1] 7 7 6
to determine the sample size in each group and apply this to each list element (group) in idx using Map. We then unlist the result to get a single vector of column indices.
Not sure if you want a solution with dplyr, but here's one with just lapply:
dframe <- as.data.frame(matrix(rnorm(60), ncol=30))
cletters <- rep(c("A","B","C"), times=c(10,14,6))
colnames(dframe) <- cletters
# Number of columns to sample per group
nc <- 8
res <- do.call(cbind,
lapply(unique(colnames(dframe)),
function(x){
dframe[,if(sum(colnames(dframe) == x) <= nc) which(colnames(dframe) == x) else sample(which(colnames(dframe) == x),nc,replace = F)]
}
))
It might look complicated, but it really just takes all columns per group if there's less than nc, and samples random nc columns if there are more than nc columns.
And to restore your original column-name scheme, gsub does the trick:
colnames(res) <- gsub('.[[:digit:]]','',colnames(res))

Same function over multiple data frames in R - not over a list of data frames

This Issue is almost what I wanted to do, except by the fact of an output being giving as a list of data frames. Let's reproduce the example of mentioned SE issue above.
Let's say I have 2 data frames:
df1
ID col1 col2
x 0 10
y 10 20
z 20 30
df1
ID col1 col2
a 0 10
b 10 20
c 20 30
What I want is an 4th column with an ifelse result. My rationale is:
if col1>=20 in any data.frame I could have named with the pattern "df", then the new column res=1, else res=0.
But I want to create a new column in each data.frame with the same name pattern, not put all of those data.frames in a list and apply the function, except if I could "extract" each 3rd dimension of this list back to individual data frames.
Thanks
Per #Frank...if my understanding of what you are looking for is correct, consider using data.table. MWE:
library(data.table);
addcol <- function(x) x[,res:=ifelse(col1>=20,1,0)]
df1 <- data.table(ID=c("x","y","z"),col1=c(0,10,20),col2=c(10,20,30))
df2 <- data.table(ID=c("x","y","z"),col1=c(20,10,20),col2=c(10,20,30))
#modified df2 so you can see different effects
lapply(list(df1,df2),addcol)
> df1
ID col1 col2 res
1: x 0 10 0
2: y 10 20 0
3: z 20 30 1
> df2
ID col1 col2 res
1: x 20 10 1
2: y 10 20 0
3: z 20 30 1
This works because data.table operates by reference on tables, so inside the function you're actually updating the underlying table, not only the scoped reference to the table.

R loops:conditioning a loop in R

Thanks for the feedback, below is a reproducible example with my desire output:
# Example Data where I would like my output
N=24
school.assignment = matrix(NA, ncol = 3, nrow = N)
school.assignment = as.data.frame(school.assignment)
colnames(school.assignment) <- c("ID","Group","Assignment")
# Number of groups and assigments per group
groups = 6
Assignment = 4
school.assignment$Group<-rep(1:groups,Assignment)
school.assignment$Group<- sort(school.assignment$Group)
school.assignment$Assignment<-rep(1:Assignment)
# IDs with number of repeats (i.e repeated students)
Data = matrix(0, ncol = 2, nrow = N/2) # ~half with repeated samples
Data = as.data.frame(Data)
colnames(Data) <- c("ID","Repeats")
Data$ID <-1:(N/2)
length(unique(Data$ID)) # unique IDS
ID=rep(seq(1:8),3)
# Genearte random repeats for each ID
Data$Repeats<-{set.seed(55)
sapply(1:(N/2),
function(x) sample(1:5,1))
}
Data=Data[-1,] #take out first row to match N=24
sum(Data$Repeats) #24 total IDs for all assigments
# List of IDs at random to use
IDs <- vector("list",dim(Data)[1]) #
for (i in 1:dim(Data)[1])
{
IDs[[i]]<-rep(Data$ID[i], times=Data$Repeats[i])
}
head(IDs)
# Object with number of repeated IDs
sample.per.ID <- vector("list",length(IDs)[1])
for (i in 1:length(IDs))
{
sample.per.ID[[i]]<-sum(length((IDs)[[i]]))
}
sum=sum(as.data.frame(sample.per.ID)); sum # 24 total IDs (including repeats)
## Unlist Vector with ransom sequence of samples
SRS.ID.order = unlist(IDs) #order of IDs with repeats
for (i in 1:length(SRS.ID.order ))
{
school.assignment$ID[i]<-SRS.ID.order [i]
}
My last loop is where I attempt to assign IDs to my matrix of school.assignment$ID. However, as you can see some IDs cross different groups and I want to condition ID assignment from the SRS.ID.order to stay within the same group (i.e. constant school.assignment$Group, below you can see that this is not the case, for example ID 4 is in group 1 and 2)
> head(school.assignment)
ID Group Assignment
1 2 1 1
2 2 1 2
3 3 1 3
4 4 1 4
5 4 2 1
6 4 2 2
I would like the output of the loop to don't assign any ID (i.e. NA) to that group if the next school.assignment$ID length is longer than the space available in that group.
ID Group Assignment
1 2 1 1
2 2 1 2
3 3 1 3
4 NA 1 4
5 4 2 1
6 4 2 2
I was thinking that I need some type of indicator for the J group like this code below:
########################################
for (i in 1:length(school.assignment$ID))
{
for (j in 1:length(unique(school.assignment$Group)))
{
school.assignment$ID[i]<-ifelse(sum(is.na(school.assignment$ID[i,j]))>=sample.per.ID[i],SRS.ID.order[i],NA)
}
}
Error in school.assignment$ID[i, j] : incorrect number of dimensions
Any help is very much appreciated!
Thanks
OLD POST
I'm currently trying to do a loop in R with a a condition. My data structure is below:
> head(school.assignment)
ID Group Assignment
1 NA 1 1
2 NA 1 2
3 NA 1 3
4 NA 1 4
5 NA 2 1
6 NA 2 2
I would like to assign an ID of the same length as school.assignment to the ID variable shown below:
head(IDs)
[1] 519 519 519 343 251 251...
Not all IDs repeat the same amount of times some 1,2 or even 3 times as shown above. I have an object with the number of repeats per ID, for example:
> head(repeats)
[1] 3 1 2...
Indicating that ID=519 repeats 3 times, ID=343 only once ad ID=251 2 times etc...
There is one condition that I would like to meet:
1) I would like every single ID to be in the same group whenever possible (i.e. if there is only one spot (NA) left for ID in the matrix object "school.assignment" for group 1 then assign the ID to the next group where they will be enough spaces (i.e where NA for school.assignment$ID is >= to repeats for that ID)
My idea was to do a loop but the code below is not working:
########################################
for (i in 1:length(school.assignment$ID))
{
for (j in 1:length(unique(school.assignment$Group)))
{
school.assignment$ID[i]<-ifelse(sum(is.na(school.assignment$ID[i,j]))>=repeats[i],ID[i],NA)
}
}
Is there a way to do this loop while respecting my condition to assign IDs to only one group?
Thank you!
Consider using merge() to assign random group IDs to data frame. No need for nested for loops. Below creates a unique group data frame, assigns random numbers there, and then merges with school.assignment:
# CREATE UNIQUE GROUP DATA FRAME
Group <- unique(school.assignment$Group)
grp.ids <- as.data.frame(Group)
# CREATE RANDOM ID FIELD (THREE DIGITS BETWEEN 100 AND 999)
grp.ids$RandomID <- sample(100:999, size = nrow(grp.ids), replace = TRUE)
# MERGE DATA FRAMES
school.assignment <- merge(school.assignment, grp.ids, by="Group", all=TRUE)
# ASSIGN ID COLUMN
school.assignment$ID <- school.assignment$RandomID
# RESTRUCTURE FINAL DATA FRAME
school.assignment <- school.assignment[c("ID", "Group", "Assignment")]
OUTPUT
ID Group Assignment
977 1 1
977 1 2
977 1 3
977 1 4
368 2 1
368 2 2

R Loop with conditions

I have a series of repeated IDs that I would like to assign to groups with fix size. The subject IDs repeat with different frequencies for example:
# Example Data
ID = c(101,102,103,104)
Repeats = c(2,3,1,3)
Data = data.frame(ID,Repeats)
> head(Data)
ID Repeats
1 101 2
2 102 3
3 103 1
4 104 3
I would like the same repeated ID to stay within the same group. However, each group has a fixed capacity (say 3 only). For example, in my desired output matrix each group can only accommodate 3 IDs:
# Create empty data frame for group annotation
# Add 3 rows in order to have more space for IDs
# Some groups will have NAs due to keeping IDs together (I'm OK with that)
Target = data.frame(matrix(NA,nrow=(sum(Data$Repeats)+3),
ncol=dim(Data)[2]))
names(Target)<-c("ID","Group")
Target$Group<-rep(1:3)
Target$Group<-sort(Target$Group)
> head(Target)
ID Group
1 NA 1
2 NA 1
3 NA 1
4 NA 1
5 NA 2
6 NA 2
I can loop each ID to my Target data frame but this does not guarantee that repeated IDs will stay together in the same group:
# Loop repeated IDs the groups
IDs.repeat = rep(Data$ID, times=Data$Repeats)
# loop IDs to Targets to assign IDs to groups
for (i in 1:length(IDs.repeat))
{
Target$ID[i]<-IDs.repeat[i]
}
In my example in the loop above I get the same ID (102) across two different groups (1 and 2), I would like to avoid this!:
> head(Target)
ID Group
1 101 1
2 101 1
3 102 1
4 102 1
5 102 2
6 103 2
Instead I want the output to look the code to put NA if there is no space for that ID in that group.
> head(Target)
ID Group
1 101 1
2 101 1
3 NA 1
4 NA 1
5 102 2
6 102 2
Anyone has a solution for IDs to stay within the same group if there is sufficient space after assigning ID i?
I think that I need a loop and count NAs within that group and see if the NAs>= to the length of that unique ID. However, I don't know how to implement this simultaneously. Maybe nesting another loop for the j group?
Any help with the loop will be appreciated immensely!
Here's one solution,
## This is the data.frame I'll try to match
target <- data.frame(
ID = c(
rep(101, 2),
rep(102, 3),
rep(103, 1),
rep(104, 3)),
Group = c(
rep(1L, 6), # "L", like 1L makes it an int type rather than numeric
rep(2L, 3)
)
)
print(target)
## Your example data
ID = c(101,102,103,104)
Repeats = c(2,3,1,3)
Data = data.frame(ID,Repeats)
head(Data)
ids_to_group <- 3 # the number of ids per group is specified here.
Data$Group <- sort(
rep(1:ceiling(length(Data$ID) / ids_to_group),
ids_to_group))[1:length(Data$ID)]
# The do.call(rbind, lapply(x = a series, FUN = function(x) { }))
# pattern is a really useful way to stack data.frames
# lapply is basically a fancy for-loop check it out by sending
# ?lapply to the console (to view the help page).
output <- do.call(
rbind,
lapply(unique(Data$ID), FUN = function(ids) {
print(paste(ids, "done.")) # I like to put print statements to follow along
obs <- Data[Data$ID == ids, ]
data.frame(ID = rep(obs$ID, obs$Repeats))
})
)
output <- merge(output, Data[,c("ID", "Group")], by = "ID")
identical(target, output) # returns true if they're equivalent
# For example inspect each with:
str(target)
str(output)

In R, how to sum certain rows of a data frame with certain logic?

Hi experienced R users,
It's kind of a simple thing.
I want to sum x by Group.1 depending on one controllable variable.
I'd like to sum x by grouping the first two rows when I say something like: number <- 2
If I say 3, it should sum x of the first three rows by Group.1
Any idea how I might tackle this problem? Should I write a function?
Thank y'all in advance.
Group.1 Group.2 x
1 1 Eggs 230299
2 2 Eggs 263066
3 3 Eggs 266504
4 4 Eggs 177196
If the sums you want are always cumulative, there's a function for that, cumsum. It works like this.
> cumsum(c(1,2,3))
[1] 1 3 6
In this case you might want something like
> mysum <- cumsum(yourdata$x)
> mysum[2] # the sum of the first two rows
> mysum[3] # the sum of the first three rows
> mysum[number] # the sum of the first "number" rows
Assuming your data is in mydata:
with(mydata, sum(x[Group.1 <= 2])
You could use the by function.
For instance, given the following data.frame:
d <- data.frame(Group.1=c(1,1,2,1,3,3,1,3),Group.2=c('Eggs'),x=1:8)
> d
Group.1 Group.2 x
1 1 Eggs 1
2 1 Eggs 2
3 2 Eggs 3
4 1 Eggs 4
5 3 Eggs 5
6 3 Eggs 6
7 1 Eggs 7
8 3 Eggs 8
You can do this:
num <- 3 # sum only the first 3 rows
# The aggregation function:
# it is called for each group receiving the
# data.frame subset as input and returns the aggregated row
innerFunc <- function(subDf){
# we create the aggregated row by taking the first row of the subset
row <- head(subDf,1)
# we set the x column in the result row to the sum of the first "num"
# elements of the subset
row$x <- sum(head(subDf$x,num))
return(row)
}
# Here we call the "by" function:
# it returns an object of class "by" that is a list of the resulting
# aggregated rows; we want to convert it to a data.frame, so we call
# rbind repeatedly by using "do.call(rbind, ... )"
d2 <- do.call(rbind,by(data=d,INDICES=d$Group.1,FUN=innerFunc))
> d2
Group.1 Group.2 x
1 1 Eggs 7
2 2 Eggs 3
3 3 Eggs 19
If you want to sum only a subset of your data:
my_data <- data.frame(c("TRUE","FALSE","FALSE","FALSE","TRUE"), c(1,2,3,4,5))
names(my_data)[1] <- "DESCRIPTION" #Change Column Name
names(my_data)[2] <- "NUMBER" #Change Column Name
sum(subset(my_data, my_data$DESCRIPTION=="TRUE")$NUMBER)
You should get 6.
Not sure why Eggs are important here ;)
df1 <- data.frame(Gr=seq(4),
x=c(230299, 263066, 266504, 177196)
)
now with n=2 i.e. first two rows:
n <- 2
sum(df1[, "x"][df1[, "Gr"]<=n])
The expression [df1[, "Gr"]<=n] creates a logical vector to subset the elements in df1[, "x"] before summing them.
Also, it appears your Group.1 is the same as the row no. If so this may be simpler:
sum(df1[, "x"][1:n])
or to get all at once
cumsum(df1[, "x"])

Resources