This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 5 years ago.
My actual dataset is composed of repeated measurements for each id, where the number of measurements can vary across individuals. A simplified example is:
dat <- data.frame(id = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L))
dat
## id
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
## 7 2
## 8 2
## 9 3
## 10 3
## 11 3
I am trying to sequentially number the dat rows by the id variable. The result should be:
dat
## id s
## 1 1 1
## 2 1 2
## 3 1 3
## 4 1 4
## 5 1 5
## 6 1 6
## 7 2 1
## 8 2 2
## 9 3 1
## 10 3 2
## 11 3 3
How would you do that? I tried to select the last row of each id by using duplicated(), but this is probably not the way, since it works with the entire column.
Use ave(). The first item is the item you're going to apply the function to; the other items are your grouping variables, and FUN is the function you want to apply. See ?ave for more details.
transform(dat, s = ave(id, id, FUN = seq_along))
# id s
# 1 1 1
# 2 1 2
# 3 1 3
# 4 1 4
# 5 1 5
# 6 1 6
# 7 2 1
# 8 2 2
# 9 3 1
# 10 3 2
# 11 3 3
If you have a large dataset or are using the data.table package, you can make use of ".N" as follows:
library(data.table)
DT <- data.table(dat)
DT[, s := 1:.N, by = "id"]
## Or
## DT[, s := sequence(.N), id][]
Or, you can use rowid, like this:
library(data.table)
setDT(dat)[, s := rowid(id)][]
# id s
# 1: 1 1
# 2: 1 2
# 3: 1 3
# 4: 1 4
# 5: 1 5
# 6: 1 6
# 7: 2 1
# 8: 2 2
# 9: 3 1
# 10: 3 2
# 11: 3 3
For completeness, here's the "tidyverse" approach:
library(tidyverse)
dat %>%
group_by(id) %>%
mutate(s = row_number(id))
## # A tibble: 11 x 2
## # Groups: id [3]
## id s
## <int> <int>
## 1 1 1
## 2 1 2
## 3 1 3
## 4 1 4
## 5 1 5
## 6 1 6
## 7 2 1
## 8 2 2
## 9 3 1
## 10 3 2
## 11 3 3
dat <- read.table(text = "
id
1
1
1
1
1
1
2
2
3
3
3",
header=TRUE)
data.frame(
id = dat$id,
s = sequence(rle(dat$id)$lengths)
)
Gives:
id s
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 1 6
7 2 1
8 2 2
9 3 1
10 3 2
11 3 3
using tapply but not elegant as ave
cbind(dat$id,unlist(tapply(dat$id,dat$id,seq_along)))
[,1] [,2]
11 1 1
12 1 2
13 1 3
14 1 4
15 1 5
16 1 6
21 2 1
22 2 2
31 3 1
32 3 2
33 3 3
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 months ago.
Improve this question
I have a list containing 500,000 dataframes. I want to split this list into two (zero,nonzero) based on a column value (if value>0, if value=0). How do I do so?
I only know how to split a dataframe into multiple frames. I am not sure if it is the same with splitting a list of dataframes.
We can either split each frame (by abs(value) > 0) and then combine the results, or combine the frames and then split afterwards. I'll explore the second option.
Sample data, a list of 4 frames:
set.seed(2022)
frames <- replicate(4, data.frame(id=1:4, value=sample(0:1, size=4, replace=TRUE)), simplify = FALSE)
frames[[1]]
# id value
# 1 1 1
# 2 2 0
# 3 3 1
# 4 4 0
dplyr
library(dplyr)
bind_rows(frames, .id = "elem") %>%
split(.$value > 0)
# $`FALSE`
# elem id value
# 2 1 2 0
# 4 1 4 0
# 5 2 1 0
# 8 2 4 0
# 11 3 3 0
# 12 3 4 0
# 15 4 3 0
# 16 4 4 0
# $`TRUE`
# elem id value
# 1 1 1 1
# 3 1 3 1
# 6 2 2 1
# 7 2 3 1
# 9 3 1 1
# 10 3 2 1
# 13 4 1 1
# 14 4 2 1
The .id="elem" is in case you want to know from which element of the original list each row was derived.
base R
tmp <- do.call(rbind, lapply(seq_along(frames), function(i) transform(frames[[i]], elem = i)))
split(tmp, tmp$value > 0)
# $`FALSE`
# id value elem
# 2 2 0 1
# 4 4 0 1
# 5 1 0 2
# 8 4 0 2
# 11 3 0 3
# 12 4 0 3
# 15 3 0 4
# 16 4 0 4
# $`TRUE`
# id value elem
# 1 1 1 1
# 3 3 1 1
# 6 2 1 2
# 7 3 1 2
# 9 1 1 3
# 10 2 1 3
# 13 1 1 4
# 14 2 1 4
data.table
library(data.table)
rbindlist(frames, idcol = "elem")[, list(split(.SD, value > 0))]$V1
# [[1]]
# elem id value
# <int> <int> <int>
# 1: 1 2 0
# 2: 1 4 0
# 3: 2 1 0
# 4: 2 4 0
# 5: 3 3 0
# 6: 3 4 0
# 7: 4 3 0
# 8: 4 4 0
# [[2]]
# elem id value
# <int> <int> <int>
# 1: 1 1 1
# 2: 1 3 1
# 3: 2 2 1
# 4: 2 3 1
# 5: 3 1 1
# 6: 3 2 1
# 7: 4 1 1
# 8: 4 2 1
This question already has answers here:
Select first row in each contiguous run by group
(4 answers)
Closed 5 months ago.
I am trying to create a subset where I keep the first value in each sequence of numbers in a column. I tried to use:
df %>% group_by(x) %>% slice_head(n = 1)
But it only works for the first instance of each sequence.
An example data where x column contains the repeated sequence can be seen below:
x = c(2,2,2,3,3,3,1,1,1,5,5,5,2,2,2,1,1,1,3,3,3)
y = c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1)
df= data.frame(x,y)
> df
x y
1 2 1
2 2 1
3 2 1
4 3 1
5 3 1
6 3 1
7 1 1
8 1 1
9 1 1
10 5 1
11 5 1
12 5 1
13 2 1
14 2 1
15 2 1
16 1 1
17 1 1
18 1 1
19 3 1
20 3 1
21 3 1
So the end result that I would like to achive is:
x = c(2,3,1,5,2,1,3)
y = c(1,1,1,1,1,1,1)
df= data.frame(x,y)
> df
x y
1 2 1
2 3 1
3 1 1
4 5 1
5 2 1
6 1 1
7 3 1
Could you please help or point me to any useful existing topics as I haven't managed to find it?
Thanks
You can try rleid from package data.table
> library(data.table)
> setDT(df)[!duplicated(rleid(x))]
x y
1: 2 1
2: 3 1
3: 1 1
4: 5 1
5: 2 1
6: 1 1
7: 3 1
Base R.
df[c(1, diff(df$x)) != 0, ]
Or also with helper functions from data.table.
library(data.table)
df[rowid(rleid(df$x)) == 1L, ]
# x y
# 1 2 1
# 4 3 1
# 7 1 1
# 10 5 1
# 13 2 1
# 16 1 1
# 19 3 1
Using rle and match.
df[match(with(rle(df$x), values), df$x), ]
# x y
# 1 2 1
# 4 3 1
# 7 1 1
# 10 5 1
# 1.1 2 1
# 7.1 1 1
# 4.1 3 1
I'm trying to delete rows for which the condition is not satisfying
eg.Remove that Subject row which do not have all period's value
following is the dataframe
Subject Period
1 1
1 2
1 3
2 1
2 2
2 3
3 1
3 2
4 1
4 2
4 3
Subject Period
1 1
1 2
1 3
2 1
2 2
2 3
4 1
4 2
4 3
A dplyr solution.
library(dplyr)
dat %>%
group_by(Subject) %>%
filter(all(unique(dat$Period) %in% Period)) %>%
ungroup()
# # A tibble: 9 x 2
# Subject Period
# <int> <int>
# 1 1 1
# 2 1 2
# 3 1 3
# 4 2 1
# 5 2 2
# 6 2 3
# 7 4 1
# 8 4 2
# 9 4 3
A base R solution.
dat_list <- split(dat, f = dat$Subject)
keep_vec <- sapply(dat_list, function(x) all(unique(dat$Period) %in% x$Period))
dat_keep <- dat_list[keep_vec]
dat2 <- do.call(rbind, dat_keep)
dat2
# Subject Period
# 1.1 1 1
# 1.2 1 2
# 1.3 1 3
# 2.4 2 1
# 2.5 2 2
# 2.6 2 3
# 4.9 4 1
# 4.10 4 2
# 4.11 4 3
A solution using purrr and dplyr.
library(purrr)
library(dplyr)
dat2 <- dat %>%
split(f = .$Subject) %>%
keep(~all(unique(dat$Period) %in% .x$Period)) %>%
bind_rows()
dat2
# Subject Period
# 1 1 1
# 2 1 2
# 3 1 3
# 4 2 1
# 5 2 2
# 6 2 3
# 7 4 1
# 8 4 2
# 9 4 3
DATA
dat <- read.table(text = "Subject Period
1 1
1 2
1 3
2 1
2 2
2 3
3 1
3 2
4 1
4 2
4 3",
header = TRUE)
Consider ave for inline aggregation then subset accordingly:
sub_df <- subset(df, ave(Period, Subject, FUN=max) != 3)
In the following dataframe, I have 24 points in the 3D space (2 horizontal locations along X and Y, each with 12 vertical values along Z).
I would like to group together the points vertically if:
they have the same val value and
they follow each other along the Z axis (so two 1 separated by another value would not have the same ID).
And this should be done only for the values beyond the 3 first Z values (which automatically get ID = 1, 2 and 3 respectively, the following ones start at 4).
set.seed(50)
library(dplyr)
mydf = data.frame(X = rep(1, 24), Y = rep(1:2, each = 12),
Z = c(sample(1:12,12,replace=F), sample(4:16,12,replace=F)),
val = c(rep(1:3, 8)))
mydf = mydf %>% group_by(X,Y) %>% arrange(X,Y,Z) %>% data.frame()
# X Y Z val
# 1 1 1 1 3 # In this X-Y location, Z starts at 1
# 2 1 1 2 3
# 3 1 1 3 3
# 4 1 1 4 2
# 5 1 1 5 2
# 6 1 1 6 1
# 7 1 1 7 1
# 8 1 1 8 1
# 9 1 1 9 1
# 10 1 1 10 2
# 11 1 1 11 2
# 12 1 1 12 3
# 13 1 2 4 2 # In this X-Y location, Z starts at 4
# [etc (see below)]
Desired output (note for example that lines 4-5 and 10-11 get a different ID):
rle1 = rle(mydf[4:12,]$val)
# Run Length Encoding
# lengths: int [1:4] 2 4 2 1
# values : int [1:4] 2 1 2 3
rle2 = rle(mydf[4:12 + 12,]$val)
# Run Length Encoding
# lengths: int [1:7] 2 1 1 2 1 1 1
# values : int [1:7] 3 1 2 1 3 1 2
mydf$ID = c(1:3, rep(4:(3+length(rle1$lengths)), rle1$lengths),
1:3, rep(4:(3+length(rle2$lengths)), rle2$lengths))
# X Y Z val ID
# 1 1 1 1 3 1
# 2 1 1 2 3 2
# 3 1 1 3 3 3
# 4 1 1 4 2 4
# 5 1 1 5 2 4
# 6 1 1 6 1 5
# 7 1 1 7 1 5
# 8 1 1 8 1 5
# 9 1 1 9 1 5
# 10 1 1 10 2 6
# 11 1 1 11 2 6
# 12 1 1 12 3 7 # In this X-Y location, I have 7 groups in the end
# 13 1 2 4 2 1
# 14 1 2 5 2 2
# 15 1 2 6 3 3
# 16 1 2 7 3 4
# 17 1 2 9 3 4
# 18 1 2 10 1 5
# 19 1 2 11 2 6
# 20 1 2 12 1 7
# 21 1 2 13 1 7
# 22 1 2 14 3 8
# 23 1 2 15 1 9
# 24 1 2 16 2 10 # In this X-Y location, I have 10 groups in the end
How could I perform this more efficiently, or in one line, and why not with dplyr, supposing this applies for many (X,Y) locations and with always the 3 first Z values (which starts at a different value at each location) followed by a location-dependent number of following ID groups?
I was starting with a try to work with a vector from a conditional subset in dplyr, which is wrong:
mydf %>% group_by(X,Y) %>% arrange(X,Y,Z) %>%
mutate(dummy = mean(rle(val)$values))
Error: error in evaluating the argument 'x' in selecting a method for function 'mean': Error in rle(c(1L, 2L, 3L, 1L, 2L, 3L, 3L, 3L, 1L, 1L, 2L, 2L))$function (x, :
invalid subscript type 'closure'
Thanks!
You can use data.table::rleid on val starting from the 4th element and then add an offset of 3, this could simplify the rle calculation;
library(dplyr); library(data.table)
mydf %>%
group_by(X, Y) %>%
mutate(ID = c(1:3, rleid(val[-(1:3)]) + 3)) %>%
as.data.frame() # for print purpose only
# X Y Z val ID
#1 1 1 1 3 1
#2 1 1 2 3 2
#3 1 1 3 3 3
#4 1 1 4 2 4
#5 1 1 5 2 4
#6 1 1 6 1 5
#7 1 1 7 1 5
#8 1 1 8 1 5
#9 1 1 9 1 5
#10 1 1 10 2 6
#11 1 1 11 2 6
#12 1 1 12 3 7
#13 1 2 4 2 1
#14 1 2 5 2 2
#15 1 2 6 3 3
#16 1 2 7 3 4
#17 1 2 9 3 4
#18 1 2 10 1 5
#19 1 2 11 2 6
#20 1 2 12 1 7
#21 1 2 13 1 7
#22 1 2 14 3 8
#23 1 2 15 1 9
#24 1 2 16 2 10
Or without rleid, use cumsum + diff:
mydf %>% group_by(X, Y) %>% mutate(ID = c(1:3, cumsum(c(4, diff(val[-(1:3)]) != 0))))
I have a data.frame:
ID <-c(2,2,2,2,3,3,5,5)
Pur<-c(0,1,2,3,1,2,4,5)
df<-data.frame(ID,Pur)
I would like to push the Pur up for each ID to get the up.Pur as follows:
ID Pur up.Pur
2 0 1
2 1 2
2 2 3
2 3 NA
3 1 2
3 2 NA
5 4 5
5 5 NA
Would appreciate your help with this.
Here is a dplyr approach
library(dplyr)
ID <-c(2,2,2,2,3,3,5,5)
Pur<-c(0,1,2,3,1,2,4,5)
df<-data.frame(ID,Pur)
df %>%
group_by(ID) %>%
mutate(up.Pur = lead(Pur))
# Source: local data frame [8 x 3]
# Groups: ID [3]
#
# ID Pur up.Pur
# <dbl> <dbl> <dbl>
# 1 2 0 1
# 2 2 1 2
# 3 2 2 3
# 4 2 3 NA
# 5 3 1 2
# 6 3 2 NA
# 7 5 4 5
# 8 5 5 NA
For completeness, I've added a base R approach, just in case you don't feel like installing any packages.
dfList = split(df, ID)
dfList = lapply(dfList, function(x){
x$up.Pur = c(x$Pur[-1], NA)
return(x)
})
unsplit(dfList, ID)
# ID Pur up.Pur
# 1 2 0 1
# 2 2 1 2
# 3 2 2 3
# 4 2 3 NA
# 5 3 1 2
# 6 3 2 NA
# 7 5 4 5
# 8 5 5 NA
We can use shift from data.table
library(data.table)
setDT(df)[, up.Pur := shift(Pur, type = "lead"), by = ID]
df
# ID Pur up.Pur
#1: 2 0 1
#2: 2 1 2
#3: 2 2 3
#4: 2 3 NA
#5: 3 1 2
#6: 3 2 NA
#7: 5 4 5
#8: 5 5 NA