split dataframe cumulatively by variable level - r

With a df like this:
x=data.frame(id=c(1,1,1,2,2,2,3,3,3), val=c(1,2,3,2,3,4,1,3,0))
I want to get output like this:
[[1]]
id val
1 1 1
2 1 2
3 1 3
[[2]]
id val
1 1 1
2 1 2
3 1 3
4 2 2
5 2 3
6 2 4
[[3]]
id val
1 1 1
2 1 2
3 1 3
4 2 2
5 2 3
6 2 4
7 3 1
8 3 3
9 3 0
where the df is split into a list of as many dataframes as there are levels of the splitting variable, i.e. id. Each dataframe should start at the first level and include all rows up to each successive level.
I can do this with a loop:
out<-NULL
for(i in 1:3){
out[[i]] <- x[x$id<=i,]
}
out
However, is there a simpler method using e.g. split that I am overlooking? Ideally a one liner.

You can do this in base R with split and Reduce using the accumulate=TRUE argument. split is used to split the data.frame into a list of data.frames by by ID. Reduce is applies rbind to each list element and adding the accumulate=TRUE successively combines the data.frames in the list.
Reduce(rbind, split(x, x$id), accumulate=TRUE)
[[1]]
id val
1 1 1
2 1 2
3 1 3
[[2]]
id val
1 1 1
2 1 2
3 1 3
4 2 2
5 2 3
6 2 4
[[3]]
id val
1 1 1
2 1 2
3 1 3
4 2 2
5 2 3
6 2 4
7 3 1
8 3 3
9 3 0

Related

rep and/or seq function to create continuously reducing vector?

Suppose I have a vector from 1 to 5,
a<-c(1:5)
What I need to do is to repeat the vector by losing one element continuously. That is, the final outcome should be like
1 2 3 4 5 1 2 3 4 1 2 3 1 2 1
We can reverse the vector and apply sequence
sequence(rev(a))
#[1] 1 2 3 4 5 1 2 3 4 1 2 3 1 2 1
Or another option is toeplitz
m1 <- toeplitz(a)
m1[lower.tri(m1, diag=TRUE)]
#[1] 1 2 3 4 5 1 2 3 4 1 2 3 1 2 1

Select rows of data frame based on a vector with duplicated values

What I want can be described as: give a data frame, contains all the case-control pairs. In the following example, y is the id for the case-control pair. There are 3 pairs in my data set. I'm doing a resampling with respect to the different values of y (the pair will be both selected or neither).
sample_df = data.frame(x=1:6, y=c(1,1,2,2,3,3))
> sample_df
x y
1 1 1
2 2 1
3 3 2
4 4 2
5 5 3
6 6 3
select_y = c(1,3,3)
select_y
> select_y
[1] 1 3 3
Now, I have computed a vector contains the pairs I want to resample, which is select_y above. It means the case-control pair number 1 will be in my new sample, and number 3 will also be in my new sample, but it will occur 2 times since there are two 3. The desired output will be:
x y
1 1
2 1
5 3
6 3
5 3
6 3
I can't find out an efficient way other than writing a for loop...
Solution:
Based on #HubertL , with some modifications, a 'vectorized' approach looks like:
sel_y <- as.data.frame(table(select_y))
> sel_y
select_y Freq
1 1 1
2 3 2
sub_sample_df = sample_df[sample_df$y%in%select_y,]
> sub_sample_df
x y
1 1 1
2 2 1
5 5 3
6 6 3
match_freq = sel_y[match(sub_sample_df$y, sel_y$select_y),]
> match_freq
select_y Freq
1 1 1
1.1 1 1
2 3 2
2.1 3 2
sub_sample_df$Freq = match_freq$Freq
rownames(sub_sample_df) = NULL
sub_sample_df
> sub_sample_df
x y Freq
1 1 1 1
2 2 1 1
3 5 3 2
4 6 3 2
selected_rows = rep(1:nrow(sub_sample_df), sub_sample_df$Freq)
> selected_rows
[1] 1 2 3 3 4 4
sub_sample_df[selected_rows,]
x y Freq
1 1 1 1
2 2 1 1
3 5 3 2
3.1 5 3 2
4 6 3 2
4.1 6 3 2
Another method of doing the same without a loop:
sample_df = data.frame(x=1:6, y=c(1,1,2,2,3,3))
row_names <- split(1:nrow(sample_df),sample_df$y)
select_y = c(1,3,3)
row_num <- unlist(row_names[as.character(select_y)])
ans <- sample_df[row_num,]
I can't find a way without a loop, but at least it's not a for loop, and there is only one iteration per frequency:
sample_df = data.frame(x=1:6, y=c(1,1,2,2,3,3))
select_y = c(1,3,3)
sel_y <- as.data.frame(table(select_y))
do.call(rbind,
lapply(1:max(sel_y$Freq),
function(freq) sample_df[sample_df$y %in%
sel_y[sel_y$Freq>=freq, "select_y"],]))
x y
1 1 1
2 2 1
5 5 3
6 6 3
51 5 3
61 6 3

changing values in dataframe in R based on criteria

I have a data frame that looks like
> mydata
ID Observation X
1 1 3
1 2 3
1 3 3
1 4 3
2 1 4
2 2 4
3 1 8
3 2 8
3 3 8
I have some code that counts the number of observations per ID, determines which IDs have a number of observations that meet a certain criteria (in this case, >=3 observations), and returns a vector with these IDs:
> vals
[1] 1 3
Now I want to manipulate the X values associated with these IDs, e.g. by adding 1 to each value, giving a data frame like this:
> mydata
ID Observation X
1 1 4
1 2 4
1 3 4
1 4 4
2 1 4
2 2 4
3 1 9
3 2 9
3 3 9
I'm pretty new to R and am uncertain how I might do this. It might help to know that X is constant for each ID.
The call mydata$ID %in% vals returns TRUE or FALSE to indicate whether the ID value for each row is in the vals vector. When you add this to the data currently in mydata$X, the TRUE and FALSE are converted to 1 and 0, respectively, yielding the desired result:
mydata$X <- mydata$X + mydata$ID %in% vals
# mydata
# ID Observation X
# 1 1 1 4
# 2 1 2 4
# 3 1 3 4
# 4 1 4 4
# 5 2 1 4
# 6 2 2 4
# 7 3 1 9
# 8 3 2 9
# 9 3 3 9

Generating large drawing lists in R

Say I have a list in R like so,
[1] 3 5 4 7
And I want to generate all "drawings" from this list, from 1 up to the value of each number. For example,
1 1 1 1
1 1 1 2
1 1 1 3
...
2 3 3 1
2 3 3 2
2 3 3 3
...
3 5 4 7
I know I have used rep() in the past to do something very similar, which works for lists of 2 or 3 numbers (i.e. something like 1 4 5), but I'm not sure how to generalize this here.
Thoughts?
As suggested in comments, use Map function to apply seq to elements of your vector, then use expand.grid to generate data.frame with Cartesian product of result's elements:
head(expand.grid(Map(seq,c(3,5,4,7))))
Var1 Var2 Var3 Var4
1 1 1 1 1
2 2 1 1 1
3 3 1 1 1
4 1 2 1 1
5 2 2 1 1
6 3 2 1 1

Combining an individual and aggregate level data sets

I've got two different data frames, lets call them "Months" and "People".
Months looks like this:
Month Site X
1 1 4
2 1 3
3 1 5
1 2 10
2 2 7
3 2 5
and People looks like this:
ID Month Site
1 1 1
2 1 2
3 1 1
4 2 2
5 2 2
6 2 2
7 3 1
8 3 2
I'd like to combine them so essentially each time an entry in "People" has a particular Month and Site combination, it's added to the appropriate aggregated data frame, so I'd get something like the following:
Month Site X People
1 1 4 2
2 1 3 0
3 1 5 1
1 2 10 1
2 2 7 3
3 2 5 1
But I haven't the foggiest idea of how to go about doing that. Any suggestions?
Using base packages
> aggregate( ID ~ Month + Site, data=People, FUN = length )
Month Site ID
1 1 1 2
2 3 1 1
3 1 2 1
4 2 2 3
5 3 2 1
> res <- merge(Months, aggdata, all.x = TRUE)
> res
Month Site X ID
1 1 1 4 2
2 1 2 10 1
3 2 1 3 NA
4 2 2 7 3
5 3 1 5 1
6 3 2 5 1
> res[is.na(res)] <- 0
> res
Month Site X ID
1 1 1 4 2
2 1 2 10 1
3 2 1 3 0
4 2 2 7 3
5 3 1 5 1
6 3 2 5 1
Assuming your data.frames are months and people, here's a data.table solution:
require(data.table)
m.dt <- data.table(months, key=c("Month", "Site"))
p.dt <- data.table(people, key=c("Month", "Site"))
# one-liner
dt.f <- p.dt[m.dt, list(X=X[1], People=sum(!is.na(ID)))]
> dt.f
# Month Site X People
# 1: 1 1 4 2
# 2: 1 2 10 1
# 3: 2 1 3 0
# 4: 2 2 7 3
# 5: 3 1 5 1
# 6: 3 2 5 1

Resources